Designing and Mining Multi-Terabyte Astronomy Archives: The Sloan Digital Sky Survey Alexander S....

16
Designing and Mining Multi-Terabyte Astronomy Archives: The Sloan Digital Sky Survey Alexander S. Szalay, Peter Z. Kunszt, Ani Thakar Dept. of Physics and Astronomy, The Johns Hopkins University Jim Gray, Don Slutz Microsoft Research Robert J. Brunner, California Institute of Technology

Transcript of Designing and Mining Multi-Terabyte Astronomy Archives: The Sloan Digital Sky Survey Alexander S....

Page 1: Designing and Mining Multi-Terabyte Astronomy Archives: The Sloan Digital Sky Survey Alexander S. Szalay, Peter Z. Kunszt, Ani Thakar Dept. of Physics.

Designing and Mining Multi-Terabyte Astronomy Archives: The Sloan Digital Sky Survey

Alexander S. Szalay, Peter Z. Kunszt, Ani ThakarDept. of Physics and Astronomy, The Johns Hopkins University

Jim Gray, Don Slutz Microsoft Research

Robert J. Brunner, California Institute of Technology

Page 2: Designing and Mining Multi-Terabyte Astronomy Archives: The Sloan Digital Sky Survey Alexander S. Szalay, Peter Z. Kunszt, Ani Thakar Dept. of Physics.

Towards the Digital Sky

Goal: interactive exploration of astronomical data efforts underway to capture digital images of the sky multiple wavelengths: x-rays, ultraviolet, visible, infrared diverse data types: images, text, numerical attributes data is big: set of multi-TB archives no need to wait for access to a telescope

NGC 5033, from “Image of the week”1/5000 of first light image, May 27-28, 1998

Page 3: Designing and Mining Multi-Terabyte Astronomy Archives: The Sloan Digital Sky Survey Alexander S. Szalay, Peter Z. Kunszt, Ani Thakar Dept. of Physics.

Astronomy 101

Celestial Sphere

©Sky Publishing Corp

Declination(degrees)

Right ascension(time - h,m,s)

Surface area - “square” degreesUnit of solid anglesphere = 41252.96 deg2

Arcminute = 1/60 degreeArcsecond = 1/60 arcminute

Page 4: Designing and Mining Multi-Terabyte Astronomy Archives: The Sloan Digital Sky Survey Alexander S. Szalay, Peter Z. Kunszt, Ani Thakar Dept. of Physics.

Sloan Digital Sky Survey

Goals (1999)➲ Map ~10000 deg2 of northern sky (~1/4 celestial sphere)➲ Determine position and brightness of 100M celestial objects➲ Measure distance to 1M galaxies, create 3D model➲ Measure distance to 100K quasars➲ Make data available to the public

As of data release 6 (data through June 2006)➲ Images, attributes of ~287M objects over 9583 deg2

➲ 1.27 million spectra of stars, galaxies, quasars and blank sky (for sky subtraction) over 7425 deg2

➲ Additional estimates of stellar temperatures, gravities, metallicities➲ Data, search tools available on web (http://skyserver.sdss.org)

Page 5: Designing and Mining Multi-Terabyte Astronomy Archives: The Sloan Digital Sky Survey Alexander S. Szalay, Peter Z. Kunszt, Ani Thakar Dept. of Physics.

Where is the data acquired?

➲ Apache Point Observatory (APO), Sunspot, NM far away from large cities – dark night sky altitude: 9200 feet little water vapor few pollutants many cloudless,

moonless nights!

Photo: Fermilab Visual Media Services

Page 6: Designing and Mining Multi-Terabyte Astronomy Archives: The Sloan Digital Sky Survey Alexander S. Szalay, Peter Z. Kunszt, Ani Thakar Dept. of Physics.

Telescopes➲ 2.5 meter reflecting light telescope

wide angle: 3° field of view (diameter of ~30 full moons) camera: 120 Mpixel, 30 CCDs, each 2” square, 5 color filters 2 spectrographs measure spectra of ~600 objects at once generates up to 200 GB/night

Photos: Fermilab Visual Media Services

Page 7: Designing and Mining Multi-Terabyte Astronomy Archives: The Sloan Digital Sky Survey Alexander S. Szalay, Peter Z. Kunszt, Ani Thakar Dept. of Physics.

Telescopes

➲ 0.5 meter photometric telescope used to monitor atmosphere during survey

(temperature, pressure) calibrate brightness of objects captured by main telescope

Photos: Fermilab Visual Media Services

Page 8: Designing and Mining Multi-Terabyte Astronomy Archives: The Sloan Digital Sky Survey Alexander S. Szalay, Peter Z. Kunszt, Ani Thakar Dept. of Physics.

Drift scan imaging➲ Telescope is positioned once➲ Images taken as sky moves past

Reading of CCD lines synchronized with sky movement

Exposure time: 55 sec Two scans (runs) form a stripe 5-color columns split into fields, 2048x1489

2B/pixel 5-color images (+ ~60 attributes)

➲ Output: photometric catalog Atlas images, 500+ attributes for each of

100M galaxies, 100M stars, 1M quasars Attributes: position, magnitude, size, color, ...

Image: Christoph Flohr, www.driftscan.comM45 The Pleaides

Page 9: Designing and Mining Multi-Terabyte Astronomy Archives: The Sloan Digital Sky Survey Alexander S. Szalay, Peter Z. Kunszt, Ani Thakar Dept. of Physics.

Spectroscopic survey

➲ Target specific objects automatically chosen from photometric survey

1M galaxies, 100K stars, 100K quasars Up to 5000 spectra collected per night

➲ Classify objects (stars, galaxies, quasars...) template matching against standard spectra for each object class examine spectra for object properties (e.g., chemical composition)

➲ Create 3D map of galaxy distribution Measure distance using Doppler shift

Page 10: Designing and Mining Multi-Terabyte Astronomy Archives: The Sloan Digital Sky Survey Alexander S. Szalay, Peter Z. Kunszt, Ani Thakar Dept. of Physics.

Data archivesraw data FedEx tapes to FermiLab for processing, reduction

operational archive processed data in instrumental form perform calibration information for target selection

science archive object catalog: positions, magnitudes, colors, sizes, radial profiles, classifications, etc. for over 100 million objects housekeeping data: calibrations and logs atlas images in 5 colors for all identified objects one-dimensional spectra of all spectroscopic targets

local archive replica of science archive

public archive scientifically verified recalibrated (if necessary)

Page 11: Designing and Mining Multi-Terabyte Astronomy Archives: The Sloan Digital Sky Survey Alexander S. Szalay, Peter Z. Kunszt, Ani Thakar Dept. of Physics.

Typical queries

Q1: Find all galaxies without unsaturated pixels within 1 arcsecond of a given point in the sky (right ascension and declination).

spatial lookup

Q2: Find all galaxies with blue surface brightness between and 30 and 40, and -10<super galactic latitude (sgb) <10, and declination less than zero.

search for galaxies with a specified blue brightness in a given region of skycoordinate system needs translation

Q3: Find all galaxies brighter than magnitude 22, where the local extinction is >0.75.local extinction indicates amount of dust in a given direction (dust masks light)

Q15: Provide a list of moving objects consistent with an asteroid.Objects are classified as moving: 5 successive observations from the 5 color bands. SQL: select moving object where sqrt((deltax5-deltax1)2 + (deltay5-deltay1)2) < 2 arc

seconds.

Page 12: Designing and Mining Multi-Terabyte Astronomy Archives: The Sloan Digital Sky Survey Alexander S. Szalay, Peter Z. Kunszt, Ani Thakar Dept. of Physics.

Database design

Original design based on OODB (ObjectivityDB), changed to relational DB (Reported in SIGMOD 2002)

Alexander S. Szalay, Jim Gray, Ani R. Thakar, Peter Z. Kunszt, Tanu Malik, Jordan Raddick, Christopher Stoughton, Jan vandenBerg. “The SDSS SkyServer – Public Access to the Sloan Digital Sky Server Data”, SIGMOD 2002

80 million objects5 color images

target selection

follow-up on selected targets

Page 13: Designing and Mining Multi-Terabyte Astronomy Archives: The Sloan Digital Sky Survey Alexander S. Szalay, Peter Z. Kunszt, Ani Thakar Dept. of Physics.

Schema: photographic objects

PhotoObj: star & galaxy attributes records for 80 million objects each ~470 attributes (~2KB) heavily indexed (“tens of indices”) 30% of storage space devoted to indices

Field processing used for objects in field, all

frames

Neighbors computed after the data is loaded For every object, list of objects within 1/2

arcminute (~10 objects)Views

PhotoPrimary: photoObj with mode=1 (best instance of deblended object)

Stars: PhotoPrimary with type='star' Galaxies: PrimaryObjects with

type='galaxy'

Page 14: Designing and Mining Multi-Terabyte Astronomy Archives: The Sloan Digital Sky Survey Alexander S. Szalay, Peter Z. Kunszt, Ani Thakar Dept. of Physics.

Spatial Data Access

Coordinate systems right-ascension and declination hierarchical triangular mesh (HTM): recursive partitioning of celestial sphere

HTM recursively assigns a number to each point on the sphere

Recursion 20 levels deep: smallest triangles < 0.1 arcsecond on a side

HTM index is built as an extension of SQL Server’s B-trees

Spatial queries use the HTM index to limit searches to small set of triangles

Page 15: Designing and Mining Multi-Terabyte Astronomy Archives: The Sloan Digital Sky Survey Alexander S. Szalay, Peter Z. Kunszt, Ani Thakar Dept. of Physics.

Thoughts on server architecture

➲ Use commodity servers and storage Processors, memory costs 10x lower than high end Storage cost 3x lower Deploy as much processing as one can afford

➲ Partition data spatially Repartition as servers added, removed

➲ Replicate high traffic data➲ Exploit parallelism➲ Deploy as network service initially

Page 16: Designing and Mining Multi-Terabyte Astronomy Archives: The Sloan Digital Sky Survey Alexander S. Szalay, Peter Z. Kunszt, Ani Thakar Dept. of Physics.

SkyServerSDSS DR1 is about 900GB (3.4B rows)

SkyServer cluster➲ Web front ends (3)

Hardware: Dell Poweredge 1750 servers, 2GB memory, dual Gbit Ethernet, 2 36GB Ultra320 SCSI disks, RAID1

Software: Windows Server 2003, IIS 6.0 Microsoft Network Load Balancing

➲ Database servers (3) 1 DB server - short queries on the public website 2 DB servers - longer queries for registered users, failover Hardware: Dell 4600 database servers, 4GB memory, 1.2

TB of 10k rpm Ultra SCSI drives, 4 drives/SCSI channel, RAID0

Software: Windows Server 2003 and SQL Server 2000. Data rates: 400MBps (simple query), 160-200 Mbps (typical

multi user load)➲ Log server (1)

same configuration as DB server? all back-ends on private network

http://skyserver.sdss.org

Table Records BytesField 14k 60MBFrame 73k 6GBPhotoObj 14m 31GBProfile 14m 9GBNeighbors 111m 5GBPlate 98 80KBSpecObj 63k 1GBSpecLine 1.7m 225MBSpecLineIndex 1.8m 142MBxcRedShift 1.9m 157MBelRedShift 51k 3MB

Major tables, records and sizes.Indices double the storage. (SIGMOD 2002)