1 The World Wide Telescope an Archetype for Online-Science Jim Gray (Microsoft) Alex Szalay (Johns...

33
1 The World Wide Telescope an Archetype for Online- Science Jim Gray (Microsoft) Alex Szalay (Johns Hopkins University) Microsoft Academic Days in Silicon Valley http://research.microsoft.com/~gray/talks

Transcript of 1 The World Wide Telescope an Archetype for Online-Science Jim Gray (Microsoft) Alex Szalay (Johns...

Page 1: 1 The World Wide Telescope an Archetype for Online-Science Jim Gray (Microsoft) Alex Szalay (Johns Hopkins University) Microsoft Academic Days in Silicon.

1

The World Wide Telescope an Archetype for Online-Science

Jim Gray (Microsoft)

Alex Szalay (Johns Hopkins University)

Microsoft Academic Days in Silicon Valley

http://research.microsoft.com/~gray/talks

Page 2: 1 The World Wide Telescope an Archetype for Online-Science Jim Gray (Microsoft) Alex Szalay (Johns Hopkins University) Microsoft Academic Days in Silicon.

2

First, an aside: 2 other projects

• TerraServer – joint with USGS

• Giga Byte File Transfers – joint with Caltech and CERN

Page 3: 1 The World Wide Telescope an Archetype for Online-Science Jim Gray (Microsoft) Alex Szalay (Johns Hopkins University) Microsoft Academic Days in Silicon.

3

KVM / IPKVM / IP

TerraServer

• Seamless mosaic of US• ~20 TB of imagery• 30 M web hits/day• A scalability laboratoryTerraServer Bricks – A High Availability Cluster Alternative

(2004)

TerraServer Cluster and SAN Experience (2004)

TerraService.NET: An Introduction to Web Services (2002)

Microsoft TerraServer: A Spatial Data Warehouse (1999)

The Microsoft TerraServerTM (1998)

Page 4: 1 The World Wide Telescope an Archetype for Online-Science Jim Gray (Microsoft) Alex Szalay (Johns Hopkins University) Microsoft Academic Days in Silicon.

4

Giga Byte Per Second File Mover• CERN to Pasadena

– Windows TCP/IP, NTFS– Quantifying performance– Working on better algorithms– Opteron – Disk-to-Disk at 550MBps now

(~2 TB/Hour).

• GOAL: 1GBps disk-to-disk. Gigabyte Bandwidth

Enables Global Co-Laboratories

Sequential Disk IO Tests for

GBps Land Speed Record

OC192 = 9.9 Gbps

CERN-Caltech Trasfer SpeedsNewisys->Newisys

0

100

200

300

400

500

600

700

800

900

1000

Mar-04 May-04 Jun-04 Aug-04 Sep-04

MB

ps

File Transfer MBps1 Stream tcp MBps

PCI -X limit

tcp limit

Page 5: 1 The World Wide Telescope an Archetype for Online-Science Jim Gray (Microsoft) Alex Szalay (Johns Hopkins University) Microsoft Academic Days in Silicon.

5

The World Wide Telescope an Archetype for Online-Science

Jim Gray (Microsoft)

Alex Szalay (Johns Hopkins University)

Microsoft Academic Days in Silicon Valley

http://research.microsoft.com/~gray/talks

Page 6: 1 The World Wide Telescope an Archetype for Online-Science Jim Gray (Microsoft) Alex Szalay (Johns Hopkins University) Microsoft Academic Days in Silicon.

6

The Evolution of Science• Observational Science

– Scientist gathers data by direct observation– Scientist analyzes data

• Analytical Science – Scientist builds analytical model– Makes predictions.

• Computational Science – Simulate analytical model– Validate model and makes predictions

• Data Exploration Science Data captured by instrumentsOr data generated by simulator– Processed by software– Placed in a database / files– Scientist analyzes database / files

Page 7: 1 The World Wide Telescope an Archetype for Online-Science Jim Gray (Microsoft) Alex Szalay (Johns Hopkins University) Microsoft Academic Days in Silicon.

7

Information Avalanche• In science, industry, government,….

– better observational instruments and – and, better simulations producing a data avalanche

• Examples– BaBar: Grows 1TB/day

2/3 simulation Information 1/3 observational Information

– CERN: LHC will generate 1GB/s .~10 PB/y– VLBA (NRAO) generates 1GB/s today– Pixar: 100 TB/Movie

• New emphasis on informatics:– Capturing, Organizing,

Summarizing, Analyzing, Visualizing

Image courtesy C. Meneveau & A. Szalay @ JHU

BaBar, Stanford

Space Telescope

P&E Gene Sequencer Fromhttp://www.genome.uci.edu/

Page 8: 1 The World Wide Telescope an Archetype for Online-Science Jim Gray (Microsoft) Alex Szalay (Johns Hopkins University) Microsoft Academic Days in Silicon.

8

The Big PictureExperiments &

Instruments

Simulationsfacts

facts

answers

questions

• Data ingest • Managing a petabyte• Common schema• How to organize it?• How to reorganize it• How to coexist with others

• Query and Vis tools • Support/training• Performance

– Execute queries in a minute – Batch query scheduling

?The Big Problems

Literature

Other Archives facts

facts

Page 9: 1 The World Wide Telescope an Archetype for Online-Science Jim Gray (Microsoft) Alex Szalay (Johns Hopkins University) Microsoft Academic Days in Silicon.

9

FTP - GREP • Download (FTP and GREP) are not adequate

– You can GREP 1 MB in a second– You can GREP 1 GB in a minute – You can GREP 1 TB in 2 days– You can GREP 1 PB in 3 years.

• Oh!, and 1PB ~3,000 disks

• At some point we need indices to limit searchparallel data search and analysis

• This is where databases can help

• Next generation technique: Data Exploration– Bring the analysis to the data!

Page 10: 1 The World Wide Telescope an Archetype for Online-Science Jim Gray (Microsoft) Alex Szalay (Johns Hopkins University) Microsoft Academic Days in Silicon.

10

The Speed Problem• Many users want to search the whole DB

ad hoc queries, often combinatorial• Want ~ 1 minute response• Brute force (parallel search):

– 1 disk = 50MBps => ~1M disks/PB ~ 300M$/PB

• Indices (limit search, do column store)– 1,000x less equipment: 1M$/PB

• Pre-compute answer– No one knows how do it for all questions.

Page 11: 1 The World Wide Telescope an Archetype for Online-Science Jim Gray (Microsoft) Alex Szalay (Johns Hopkins University) Microsoft Academic Days in Silicon.

11

Next-Generation Data Analysis• Looking for

– Needles in haystacks – the Higgs particle– Haystacks: Dark matter, Dark energy

• Needles are easier than haystacks• Global statistics have poor scaling

– Correlation functions are N2, likelihood techniques N3

• As data and computers grow at same rate, we can only keep up with N logN

• A way out? – Relax notion of optimal

(data is fuzzy, answers are approximate)– Don’t assume infinite computational resources or memory

• Combination of statistics & computer science

Page 12: 1 The World Wide Telescope an Archetype for Online-Science Jim Gray (Microsoft) Alex Szalay (Johns Hopkins University) Microsoft Academic Days in Silicon.

12

Analysis and Databases• Much statistical analysis deals with

– Creating uniform samples – – data filtering– Assembling relevant subsets– Estimating completeness – censoring bad data– Counting and building histograms– Generating Monte-Carlo subsets– Likelihood calculations– Hypothesis testing

• Traditionally these are performed on files• Most of these tasks are much better done inside a database• Move Mohamed to the mountain, not the mountain to Mohamed.

Page 13: 1 The World Wide Telescope an Archetype for Online-Science Jim Gray (Microsoft) Alex Szalay (Johns Hopkins University) Microsoft Academic Days in Silicon.

13

Organization & Algorithms• Use of clever data structures (trees, cubes):

– Up-front creation cost, but only N logN access cost– Large speedup during the analysis– Tree-codes for correlations (A. Moore et al 2001)– Data Cubes for OLAP (all vendors)

• Fast, approximate heuristic algorithms– No need to be more accurate than cosmic variance– Fast CMB analysis by Szapudi et al (2001)

• N logN instead of N3 => 1 day instead of 10 million years

• Take cost of computation into account– Controlled level of accuracy– Best result in a given time, given our computing resources

Page 14: 1 The World Wide Telescope an Archetype for Online-Science Jim Gray (Microsoft) Alex Szalay (Johns Hopkins University) Microsoft Academic Days in Silicon.

14

World Wide TelescopeVirtual Observatory

http://www.ivoa.net/

• Premise: Most data is (or could be online)

• The Internet is the world’s best telescope:– It has data on every part of the sky– In every measured spectral band: optical, x-ray, radio..

– As deep as the best instruments (2 years ago).

– It is up when you are up.The “seeing” is always great (no working at night, no clouds no moons no..).

– It’s a smart telescope: links objects and data to literature on them.

Page 15: 1 The World Wide Telescope an Archetype for Online-Science Jim Gray (Microsoft) Alex Szalay (Johns Hopkins University) Microsoft Academic Days in Silicon.

15

Why Astronomy?• Community has lots of data• Data is real and well documented

– High-dimensional (with confidence intervals)– Spatial, temporal

• Diverse and distributed– Many different instruments from

many different places and many different times

• Community wants to share/cross compare– Can freely share data and algorithms.– “DataMining, Not Data MINE!!” Mark Ellisman, UCSD

• They are well organized• Community is small and homogeneous• No commercial or privacy concerns

– All the problems are technical or social.

Page 16: 1 The World Wide Telescope an Archetype for Online-Science Jim Gray (Microsoft) Alex Szalay (Johns Hopkins University) Microsoft Academic Days in Silicon.

16

The WWT Components• Data Sources

– Literature– Archives

• Unified Definitions– Units, – Semantics/Concepts/Metrics,

Representations, – Provenance

• Object model• Classes and methods• Portals

Page 17: 1 The World Wide Telescope an Archetype for Online-Science Jim Gray (Microsoft) Alex Szalay (Johns Hopkins University) Microsoft Academic Days in Silicon.

17

Data Sources• Literature online and cross indexed

– Simbad, ADS, NED,http://simbad.u-strasbg.fr/Simbad, http://adswww.harvard.edu/, http://nedwww.ipac.caltech.edu/

• Many curated archives online– FIRST, DPOSS, 2MASS, USNO, IRAS, SDSS, VizeR,…– Typically files with English meta-data and some programs

• Groups, Researchers, Amateurs Publish– Datasets online in various formats– Data publications are ephemeral (may disappear) – Many have unknown provenance

• Documentation varies; some good and some none.

Page 18: 1 The World Wide Telescope an Archetype for Online-Science Jim Gray (Microsoft) Alex Szalay (Johns Hopkins University) Microsoft Academic Days in Silicon.

18

Unified Definitions• Universal Content Definitions

http://vizier.u-strasbg.fr/doc/UCD.htx

– Collated all table heads from all the literature– 100,000 terms reduced to ~1,500– Rough consensus that this is the right thing.– Refinement in progress as people use UCDs

• Defines – Units:

• gram, radian, second, janski...

– Semantic Concepts / Metrics • Std error, Chi2 fit, magnitude, flux @ passband, velocity,

Page 19: 1 The World Wide Telescope an Archetype for Online-Science Jim Gray (Microsoft) Alex Szalay (Johns Hopkins University) Microsoft Academic Days in Silicon.

19

Provenance• Most data will be derived.• To do science,

need to trace derived data back to source.• So programs and inputs must be registered.• Must be able to re-run them.• Example: Space Telescope Calibrated Data

– Run on demand– Can specify software version (to get old answers)

• Scientific Data Provenance and Curation are largely unsolved problems (some ideas but no science).

Page 20: 1 The World Wide Telescope an Archetype for Online-Science Jim Gray (Microsoft) Alex Szalay (Johns Hopkins University) Microsoft Academic Days in Silicon.

20

Object Model• General acceptance of XML • Recent acceptance of XML Schema

(XSD over DTD)

• Wait-and-See about SOAP/WSDL/…– “ Web Services are just Corba with angle

brackets.”

– FTP is good enough for me.

• Personal opinion:– Web Services are much more than

“Corba + <>”– Huge focus on interop– Huge focus on integrated tools

• But the community says “Show me!”– Many technologists convinced,

but not yet the astronomers

Yourprogram

DataIn your address

space

Web Service

soap

object

in

xml

Yourprogram Web

Server

http

Web

page

Page 21: 1 The World Wide Telescope an Archetype for Online-Science Jim Gray (Microsoft) Alex Szalay (Johns Hopkins University) Microsoft Academic Days in Silicon.

21

Classes and Methods

• First Class: VO tablehttp://www.us-vo.org/VOTable/

– Represents an answer set in XML• Defined by an XML Schema (XSD) • Metadata (in terms of UCDs)• Data representation (numbers and text)

– First method• Cone Search: Get objects in this cone

http://voservices.org/cone/

Yourprogram

DataIn your address

space

Web Service

soap

object

in xml

Page 22: 1 The World Wide Telescope an Archetype for Online-Science Jim Gray (Microsoft) Alex Szalay (Johns Hopkins University) Microsoft Academic Days in Silicon.

22

Other Classes• Space-Time class

– http://hea-www.harvard.edu/~arots/nvometa/STCdoc.pdf

• Image Class (returns pixels)– SdssCutout– Simple Image Access Protocol

http://bill.cacr.caltech.edu/cfdocs/usvo-pubs/files/ACF8DE.pdf

– HyperAtlashttp://bill.cacr.caltech.edu/usvo-pubs/files/hyperatlas.pdf

• Spectral – Simple Spectral Access Protocol – 500K spectra available at http://voservices.net/wave

• Query Services– ADQL and SkyNode http://skyservice.pha.jhu.edu/develop/vo/adql/– And http://SkyQuery.Net

• Registry: – see below

Yourprogram

DataIn your address

space

Web Service

soap

object

in xml

Page 23: 1 The World Wide Telescope an Archetype for Online-Science Jim Gray (Microsoft) Alex Szalay (Johns Hopkins University) Microsoft Academic Days in Silicon.

23

The Registry• UDDI seemed inappropriate

– Complex – Irrelevant questions– Relevant questions missing

• Evolved Dublin Core– Represent Datasets, Services, Portals– Needs to be machine readable– Federation (DNS model)– Push & Pull: register then harvest

• http://www.ivoa.net/twiki/bin/view/IVOA/IvoaResReg

Page 24: 1 The World Wide Telescope an Archetype for Online-Science Jim Gray (Microsoft) Alex Szalay (Johns Hopkins University) Microsoft Academic Days in Silicon.

24

Demo

• SkyServer: – navigator showing cutout web service– List: showing many calls and variant use.

• SkyQuery:– Show integration of various archives.– Explain spatial join xMatch operator.

Page 25: 1 The World Wide Telescope an Archetype for Online-Science Jim Gray (Microsoft) Alex Szalay (Johns Hopkins University) Microsoft Academic Days in Silicon.

25

SkyServer.SDSS.org• A modern Astronomy archive

– Raw Pixel data lives in file servers– Catalog data (derived objects) lives in Database– Online query to any and all

• Also used for education– 150 hours of online Astronomy– Implicitly teaches data analysis

• Interesting things– Spatial data search– Client query interface via Java Applet– Query interface via Emacs– Popular – Cloned by other surveys (a template design) – Web services are core of it.

Page 26: 1 The World Wide Telescope an Archetype for Online-Science Jim Gray (Microsoft) Alex Szalay (Johns Hopkins University) Microsoft Academic Days in Silicon.

26

SkyQuery.NetA Prototype WWT

• Started with SDSS data and schema• Imported12 other datasets

into that spine schema.(a day per dataset plus load time)

• Unified them with a portal • Implicit spatial join among the datasets.• All built on Web Services

– Pure XML– Pure SOAP– Used .NET toolkit

Page 27: 1 The World Wide Telescope an Archetype for Online-Science Jim Gray (Microsoft) Alex Szalay (Johns Hopkins University) Microsoft Academic Days in Silicon.

27

Federation: SkyQuery.Net• Combine 4 archives initially

• Added 9 more

• Send query to portal, portal joins data from archives.

• Problem: want to do multi-step data analysis (not just single query).

• Solution: Allow personal databases on portal

• Problem: some queries are monsters

• Solution: “batch schedule” on portal server, Deposits answer in personal database.

Page 28: 1 The World Wide Telescope an Archetype for Online-Science Jim Gray (Microsoft) Alex Szalay (Johns Hopkins University) Microsoft Academic Days in Silicon.

282MASS

INT

SDSS

FIRST

SkyQueryPortal

ImageCutout

SkyQuery Structure• Each SkyNode publishes

– Schema Web Service– Database Web Service

• Portal is – Plans Query (2 phase) – Integrates answers– Is a web service

Page 29: 1 The World Wide Telescope an Archetype for Online-Science Jim Gray (Microsoft) Alex Szalay (Johns Hopkins University) Microsoft Academic Days in Silicon.

29

MyDBhttp://skyserver.sdss.org/cas

• Portal allows federation of data but…

• Intermediate results may be large.

• Intermediate results feed into next analysis step.

• Sending them back-and-forth to client is costly and sometimes infeasible.

• Solution: create a working DB for client at Portal: MyDB

Page 30: 1 The World Wide Telescope an Archetype for Online-Science Jim Gray (Microsoft) Alex Szalay (Johns Hopkins University) Microsoft Academic Days in Silicon.

30

MyDBhttp://skyserver.sdss.org/cas

• Anyone can create a personal DB at SkyServer portal. – It is about 100 MB– It is private

• Simple queries done immediately

• Complex queries done by batch scheduler

• All queries can create/read/write MyDB tables

• Very popular with “serious” users.

• MyDB will be sharable with by a group.

Page 31: 1 The World Wide Telescope an Archetype for Online-Science Jim Gray (Microsoft) Alex Szalay (Johns Hopkins University) Microsoft Academic Days in Silicon.

31

Open SkyQuery

• SkyQuery being adopted by AstroGrid as reference implementation for OGSA-DAI(Open Grid Services Architecture, Data Access and Integration).

• SkyNode basic archive objecthttp://www.ivoa.net/twiki/bin/view/IVOA/SkyNode

• SkyQuery Language (VoQL) is evolving.http://www.ivoa.net/twiki/bin/view/IVOA/IvoaVOQL

Page 32: 1 The World Wide Telescope an Archetype for Online-Science Jim Gray (Microsoft) Alex Szalay (Johns Hopkins University) Microsoft Academic Days in Silicon.

32

The WWT ComponentsOutline• Data Sources

– Literature– Archives

• Unified Definitions– Units, – Semantics/Concepts/Metrics,

Representations, – Provenance

• Object model• Classes and methods• Portals• WWT is a poster child for

the Data Grid.

What we learned• Astro is a community of 10,000 • Homogenous & Cooperative• If you can’t do it for Astro,

do not bother with 3M bio-info.• Agreement

– Takes time – Takes endless meetings

• Big problems are non-technical– Legacy is a big problem.

• Plumbing and tools are thereBut…– What is the object model?– What do you want to save?– How document provenance?

Page 33: 1 The World Wide Telescope an Archetype for Online-Science Jim Gray (Microsoft) Alex Szalay (Johns Hopkins University) Microsoft Academic Days in Silicon.

33

References (all are MSR TRs)Where the Rubber Meets the Sky: Bridging the Gap between Databases and Science

When Database Systems Meet the Grid

There Goes the Neighborhood: Relational Algebra for Spatial Data Search

Extending the SDSS Batch Query System to the National Virtual Observatory Grid

The World-Wide Telescope, an Archetype for Online Science

Data Mining the SDSS SkyServer Database

The SDSS SkyServer – Public Access to the Sloan Digital Sky Server Data

Web Services for the Virtual Observatory

Online Scientific Data Curation, Publication, and Archiving

Petabyte Scale Data Mining: Dream or Reality?

The World-Wide Telescope, an Archetype for Online Science

Designing and Mining Multi-Terabyte Astronomy Archives: The Sloan Digital Sky Survey