Data-Intensive Scientific Computing in Astronomy
Embed Size (px)
Transcript of Data-Intensive Scientific Computing in Astronomy
Why Is Astronomy Special? Especially attractive for the wide public Community is not very large It is real and well documented High-dimensional (with confidence intervals) Spatial, temporal Diverse and distributed Many different instruments from many different places and times The questions are interesting There is a lot of it (soon petabytes)WORTHLESS!
It has no commercial value No privacy concerns, freely share results with others Great for experimenting with algorithms
4Scientific Data Analysis Today Scientific data is doubling every year, reaching PBsData is everywhere, never will be at a single locationNeed randomized, incremental algorithmsBest result in 1 min, 1 hour, 1 day, 1 weekArchitectures increasingly CPU-heavy, IO-poorData-intensive scalable architectures neededMost scientific data analysis done on small to midsize BeoWulf clusters, from faculty startupUniversities hitting the power wallSoon we cannot even store the incoming data streamNot scalable, not maintainable
Building Scientific Databases10 years ago we set out to explore how to cope with the data explosion (with Jim Gray)Started in astronomy, with the Sloan Digital Sky SurveyExpanded into other areas, while exploring what can be transferredDuring this time data sets grew from 100GB to 100TBInteractions with every step of the scientific processData collection, data cleaning, data archiving, data organization, data publishing, mirroring, data distribution, data curation
Sloan Digital Sky SurveyThe Cosmic Genome ProjectTwo surveys in onePhotometric survey in 5 bandsSpectroscopic redshift surveyData is public2.5 Terapixels of images40 TB of raw data => 120TB processed5 TB catalogs => 35TB in the endStarted in 1992, finished in 2008Database and spectrograph built at JHU (SkyServer)
The University of Chicago Princeton University The Johns Hopkins University The University of Washington New Mexico State University Fermi National Accelerator Laboratory US Naval Observatory The Japanese Participation Group The Institute for Advanced Study Max Planck Inst, Heidelberg Sloan Foundation, NSF, DOE, NASA
SDSS Now Finished!As of May 15, 2008 SDSS is officially completeFinal data release (DR7): Oct 31, 2008Final archiving of the data in progressPaper archive at U. Chicago LibraryDeep Digital Archive at JHU Library CAS Mirrors at FNAL+JHU P&AArchive contains >120TBAll raw dataAll processed/calibrated data All versions of the database (>35TB)Full email archive and technical drawingsFull software code repositoryTelescope sensor stream, IR fisheye camera, etcSDSS2.4m 0.12Gpixel
PanSTARRS1.8m 1.4GpixelLSST8.4m 3.2Gpixel
Impact of Sky Surveys9Continuing GrowthHow long does the data growth continue?High end always linearExponential comes from technology + economicsrapidly changing generationslike CCDs replacing plates, and become ever cheaperHow many generations of instruments are left?Are there new growth areas emerging?Software is becoming a new kind of instrumentValue added federated data setsLarge and complex simulationsHierarchical data replicationCosmological SimulationsCosmological simulations have 109 particles and produce over 30TB of data (Millennium)Build up dark matter halosTrack merging history of halosUse it to assign star formation historyCombination with spectral synthesisRealistic distribution of galaxy types
Hard to analyze the data afterwards -> need DBWhat is the best way to compare to real data?Next generation of simulations with 1012 particlesand 500TB of output are under way (Exascale-Sky)
Immersive TurbulenceUnderstand the nature of turbulenceConsecutive snapshots of a 1,0243 simulation of turbulence:now 30 TerabytesTreat it as an experiment, observethe database! Throw test particles (sensors) in from your laptop, immerse into the simulation,like in the movie Twister
New paradigm for analyzing HPC simulations!
with C. Meneveau, S. Chen (Mech. E), G. Eyink (Applied Math), R. Burns (CS)
Sample ApplicationsLagrangian time correlation in turbulenceYu & Meneveau Phys. Rev. Lett. 104, 084502 (2010)Measuring velocity gradient using a new set of 3 invariantsLuethi, Holzner & Tsinober, J. Fluid Mechanics 641, pp. 497-507 (2010)Experimentalists testing PIV-based pressure-gradient measurement(X. Liu & Katz, 61 APS-DFD meeting, November 2008)
13CommonalitiesHuge amounts of data, aggregates neededBut also need to keep raw data Need for parallelismUse patterns enormously benefit from indexingRapidly extract small subsets of large data setsGeospatial everywhereCompute aggregatesFast sequential read performance is critical!!!But, in the end everything goes. search for the unknown!!Data will never be in one placeNewest (and biggest) data are live, changing dailyFits DB quite well, but no need for transactionsDesign pattern: class libraries wrapped in SQL UDFTake analysis to the data!!Astro-Statistical ChallengesThe crossmatch problem (multi-, time domain)The distance problem, photometric redshiftsSpatial correlations (auto, cross, higher order)Outlier detection in many dimensionsStatistical errors vs systematicsComparing observations to models..The unknown unknownScalability!!!
The Cross MatchMatch objects in catalog A to catalog BCardinalities soon in the billions
How to estimate and include priors?How to deal with moving objects?How to come up with fast, parallel algorithms?How to create tuples among many surveys andavoid a combinatorial explosion?
Was an ad-hoc, heuristic process for a long timeThe Cross Matching ProblemThe Bayes factor
H: all observations of the same object
K: might be from separate objects
On the skyAstrometryBudavari & Szalay 2009Photometric RedshiftsNormally, distances from Hubbles Law
Measure the Doppler shift of spectral linesdistance!But spectroscopy is very expensiveSDSS: 640 spectra in 45 minutes vs. 300K 5 color imagesFuture big surveys will have no spectraLSST, Pan-STARRSBillions of galaxiesIdea: Multicolor images are like a crude spectrographStatistical estimation of the redshifts/distances
Photometric RedshiftsPhenomenological (PolyFit, ANNz, kNN, RF)Simple, quite accurate, fairly robustLittle physical insight, difficult to extrapolate, MalmquistTemplate-based (KL, HyperZ)Simple, physical modelCalibrations, templates, issues with accuracyHybrid (base learner)Physical basis, adaptiveComplicated, compute intensiveImportant for next generation surveys!We must understand the errors!Most errors systematicLessons from Netflix challengeCyberbricks36-node Amdahl cluster using 1200W totalZotac Atom/ION motherboards4GB of memory, N330 dual core Atom, 16 GPU coresAggregate disk space 43.6TB63 x 120GB SSD = 7.7 TB27x 1TB Samsung F1 = 27.0 TB18x.5TB Samsung M1= 9.0 TBBlazing I/O Performance: 18GB/sAmdahl number = 1 for under $30KUsing the GPUs for data mining:6.4B multidimensional regressions (photo-z)in 5 minutes over 1.2TBPorted RF module from R in C#/CUDA
The Impact of GPUsReconsider the N logN only approachOnce we can run 100K threads, maybe running SIMD N2 on smaller partitions is also acceptableRecent JHU effort on integratingCUDA with SQL Server, usingSQL UDFGalaxy spatial correlations: 600 trillion galaxy pairs usingbrute force N2 algorithmFaster than the tree codes!
Tian, Budavari, Neyrinck, Szalay 2010BAOThe Hard ProblemsOutlier detection, extreme value distributionsComparing observations to modelsThe unknown unknown
In 10 years catalogs in the billions, raw data 100PB+Many epochs, many colors, many instruments
SCALABILITY!!!DISC Needs TodayDisk space, disk space, disk space!!!!Current problems not on Google scale yet:10-30TB easy, 100TB doable, 300TB really hardFor detailed analysis we need to park data for several monthsSequential IO bandwidthIf not sequential for large data set, we cannot do itHow do can move 100TB within a University?1Gbps10 days10 Gbps 1 day(but need to share backbone)100 lbs box few hoursFrom outside?Dedicated 10Gbps or FedExTradeoffs TodayStu Feldman: Extreme computing is about tradeoffs
Ordered priorities for data-intensive scientific computing
Total storage (-> low redundancy)Cost(-> total cost vs price of raw disks)Sequential IO (-> locally attached disks, fast ctrl)Fast stream processing (->GPUs inside server)Low power (-> slow normal CPUs, lots of disks/mobo)
The order will be different in a few years...and scalability may appear as wellCost of a Petabyte
From backblaze.comAug 2009
JHU Data-ScopeFunded by NSF MRI to build a new instrument to look at dataGoal: 102 servers for $1M + about $200K switches+racksTwo-tier: performance (P) and storage (S)Large (5PB) + cheap + fast (400+GBps), but . ..a special purpose instrument
1P1S90P12SFullservers119012102rack units412360144504capacity24252216030245184TBprice8.522.87662741040$Kpower11.99423116kWGPU302700270TFseq IO4.63.841445459GBpsnetwk bw10209002401140GbpsProposed Projects at JHUDisciplinedata [TB]Astrophysics930HEP/Material Sci.394CFD425BioInformatics414Environmental660Total2823
19 projects total proposed for the Data-Scope, more coming, data lifetimes between 3 mo and 3 yrs Fractal VisionThe Data-Scope created a lot of excitement but also a lot of fear at JHU Pro: Solve problems that exceed group scale, collaborateCon: Are we back to centralized research computing?Clear impedance mismatch between monolithic large systems and individual userse-Science needs different tradeoffs from eCommerceLarger systems are more efficientSmaller systems have more agilityHow to make it all play nicely together?
Increased DiversificationOne shoe does not fit all!Diversity grows naturally, no matter whatEvolutionary