It’s not a data deluge – it’s worse than that Craig Stewart – [email protected] Executive...
-
Upload
silas-west -
Category
Documents
-
view
217 -
download
1
Transcript of It’s not a data deluge – it’s worse than that Craig Stewart – [email protected] Executive...
It’s not a data deluge – it’s worse than that
Craig Stewart – [email protected] Director, Indiana University Pervasive Technology Institute
Associate Dean, Research Technologies
1
Keynote address for Third International Workshop on “Data Intensive Distributed Computing (DIDC'10)” held in conjunction with HPDC'10 in Chicago, IL on June 22nd, 2010, organized by Tevfik Kosar, DIDC'10 Workshop Chair and Assistant Professor, Department of Computer Science & Center for Computational Theory,
Louisiana State University http://www.cct.lsu.edu/~kosar/didc10/
License Terms• Except where otherwise noted, contents of this presentation are
Copyright 2011 by the Trustees of Indiana University.• Stewart, C.A. 2010. It’s not a data deluge – it’s worse than that.
Keynote presentation. Third International Workshop on “Data Intensive Distributed Computing (DIDC'10)” held in conjunction with HPDC'10. June 22nd, 2010, Chicago, IL. http://hdl.handle.net/2022/13195
• This document is released under the Creative Commons Attribution 3.0 Unported license (http://creativecommons.org/licenses/by/3.0/). This license includes the following terms: You are free to share – to copy, distribute and transmit the work and to remix – to adapt the work under the following conditions: attribution – you must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work). For any reuse or distribution, you must make clear to others the license terms of this work.
2
Graphic from Gantz, J.F. and D. Reinsel. IDC Digital Universe Study, Sponsored by EMC, "The Digital Universe Decade - Are You Ready?" May 2010. Used by permission of IDC. May not be reused without permission from IDC (www.idc.com) 3
Files Stored in IU HPSS
4This slide licensed under Creative Commons 3.0 Attribution license. If any portion of this slide is reused please cite: Kurt Seiffert, http://pti.iu.edu/storage/
Aggregate Data Stored in IU HPSS
5This slide licensed under Creative Commons 3.0 Attribution license. If any portion of this slide is reused please cite: Kurt Seiffert, http://pti.iu.edu/storage/
Data set sizes and time to moveData Size Network capability Time to move
1 TB 100 Mbps 1 day
1 Gbps 2.4 hours
10 Gbps 14.7 mins
100 Gbps 88 seconds
1 PB 100 Mbps 2.9 years
1 Gbps 3.4 months
10 Gbps 10.4 days
100 Gbps 25 hours
6
Why it’s worse than just a deluge• Not all data are equal # 1: sensitive data• Campus networks and proliferation of digital
instruments• Need for complexity hiding interfaces (Science
gateways and others)• Metadata to enable later use and use in other
disciplines• Where are the definitive repositories?• Not all data are equal #2: Who decides what is
worth storing, and where do you store it?
7
Cyberinfrastructure
• What is cyberinfrastructure?– Cyberinfrastructure consists of computing systems,
data storage systems, advanced instruments and data repositories, visualization environments, and people, all linked together by software and high performance networks to improve research productivity and enable breakthroughs not otherwise possible.
• Much of discussion is about cyberinfrastructure for handling data rather than basic computer science
8
Handling sensitive data
• Two established approaches– Protect the data really well– Collect only the data that are essential
9
Fetal Alcohol Spectrum Disorder
10This slide licensed under Creative Commons 3.0 Attribution license. If any portion of this slide is reused please cite: Jeff Rogers, the UITS Research Technologies Advanced Visualization Lab, http://www.avl.iu.edu/
Dietary Intake Monitoring Application
• Portable device to track nutrition anywhere• Bar code scanner for easiest input• Icons for foods without bar codes• Real-time feedback
This slide licensed under Creative Commons 3.0 Attribution license. If any portion of this slide is reused please cite: Kay Connelly, http://www.cs.indiana.edu/surg/Projects/DIMA_SURG_Page/DIMA.htmlFunded by grant NIBIB EB007083: “Self Monitoring of Dietary and Fluid Intake Using a PDA” PI Janet Welch
11
Ability to deal with sensitive data - Increasingly important
• What are you doing about H1N1• What are you doing about Haiti?• What are you doing about the oil spill?• What are you doing about _______?• Bayer and Baycol• Someday we will all be ex-human subjects• Someday we (generally) hope to be old and
applications that help us do that well are interesting• HIPAA - How to secure data is a reasonably well solved
problem – just hard & matter of doing it.
12
Data creation and transport
13
• Unite data and compute• Network bottleneck issues• A few example custom-crafted solutions
Large-scale data sourcesData Source Data Production / Year
Large Synoptic Survey Telescope 6 PB
Large Hadron Collider 15 PB
One Degree Imager 500 TB – 1.5 PB
COLA (Center for Ocean-Land-Atmosphere Studies) 1.8 PB/year
Ocean Observatories Initiative 1 PB
CRESIS 20 TB/expedition
LEAD 4.7 TB / prediction season
EVIA - 200 hours video in a 2-month trip 1.2 TB/ trip
Earthscope 920 GB
14
– Put data and compute together (traditional supercomputer center model)?
• Subsetting services?– Create a data grid?– Accept some inefficiency?
Citations for data sources
15
Data Source Citation
Large Synoptic Survey Telescope http://www.lsst.org/lsst/science/concept_data
Large Hadron Collider http://public.web.cern.ch/public/en/LHC/Computing-en.html
One Degree Imager www.astro.yale.edu/odisurvey/talks/harbeck.pdf
COLA (Center for Ocean-Land-Atmosphere Studies)
http://www.iges.org/cola.html
Ocean Observatories Initiative http://ci.oceanobservatories.org/
CRESIS https://www.cresis.ku.edu/data
LEAD http://portal.leadproject.org/gridsphere/gridsphere
EVIA - 200 hours video in a 2-month trip
http://www.eviada.org/
Earthscope http://www.earthscope.org/
IU’s Data Capacitor WAN• 1 pair Dell PowerEdge 2950 for MDS• 2 pair Dell PowerEdge 2950 for OSS
– 2 x 3.0 GHz Dual Core Xeon– Myrinet 10G Ethernet– Dual port Qlogic 2432 HBA (4 x FC)– 2.6 Kernel (RHEL 5)
• DDN S2A9550 Controller– Over 2.4 GB/sec measured throughput– 360 Terabytes of spinning SATA disk
• Currently running Lustre 1.8.1.1 • UID mapping• Rootsquash• Trade efficiency for effectiveness and
simplicity in globally-accessible file systemThis slide licensed under Creative Commons 3.0 Attribution license. If any portion of this slide is reused please cite: Stephen Simms, UITS Research Technologies Data Capacitor Team, http://pti.iu.edu/dc 16
Gas Giant Planet Formation
• Uses IU Lustre-WAN• Accept some inefficiency on
reads• But interactivity a big gain
17
This research by: Scott Michael, Stephen Simms, and UITS Research Technologies Data Capacitor Team. This slide licensed under Creative Commons 3.0 Attribution license. Please cite Scott Michael,Stephen Simms and http://pti.iu.edu/dc if any portion of this slide is reused
Single Lab Biological InstrumentsType of instrument Model Raw image
dataData products
Light Microscopy BD Pathway 855 Bioimager N/A 7 GB/day
Genome sequencing
Roche 454 Life Sciences genome analyzer system
39 GB/day 9 GB/day
Illumina-Solexa genome analyzer system
367 GB/day 100 GB/day
ABI SOLID 3 238 GB/day 150 GB/day
Microarray Gene Expression Chip Reader
Molecular Devices GenePix Professional 4200A Scanner
N/A 8 MB/day
Microarray Gene Expression Chip Reader
NimbleGen Hybridization System 4 N/A 300 MB/day
18
– Common instance of data bottleneck examples
Citations for data sources
19
Model Reference
BD Pathway 855 Bioimager http://www.bdbiosciences.com/eu/instruments/pathway/index.jsp
Roche 454 Life Sciences genome analyzer system
https://www.roche-applied-science.com/servlet/RCProductDisplay?storeId=10202&catalogId=10202&langId=-1&countryId=us&forCountryId=us&productId=3.8.8.1.1.3
Illumina-Solexa genome analyzer system
http://www.illumina.com/systems/genome_analyzer.ilmn
ABI SOLID 3 http://www.illumina.com/systems/genome_analyzer.ilmn
Molecular Devices GenePix Professional 4200A Scanner
http://www.selectscience.net/products/genepix-professional-4200a-microarray-scanner/?prodID=81832
NimbleGen Hybridization System 4 http://www.nimblegen.com/products/instruments/hybridization/4-slide/
The OptIPuter Project: Creating High Resolution Portals Over Dedicated Optical Channels to Global Science Data
Picture Source: Mark Ellisman, David Lee, Jason Leigh
Calit2 (UCSD, UCI), SDSC, and UIC Leads—Larry Smarr PIUniv. Partners: NCSA, USC, SDSU, NW, TA&M, UvA, SARA, KISTI, AISTIndustry: IBM, Sun, Telcordia, Chiaro, Calient, Glimmerglass, Lucent
Scalable Adaptive Graphics Environment (SAGE)
20
• Overprovisioning everywhere will not work (CIO firing, economy)
• IU– Distributed storage– Machine room
backplane– Custom crafted network
runs• Cal-IT2
– Quartzite– Optiputer
21
The above image licensed under Creative Commons 3.0 Attribution license. If reused please cite: by: Maria Morris, Matthew Link, David Y. Hancock, Stephen Simms, and UITS Research Technologies Division, http://rc.uits.indiana.edu/ci
PolarGrid (Geoffrey Fox, PI)
22
Local storage -> Local Analysis -> Upload image -> Image analyzed via WebLocal storages -> Local Analysis –> Ship disks to Indiana -> Analyze in detail
Photos on this slide are licensed under Creative Commons 3.0 Attribution license. If reused please cite: Keith Lehigh, Indiana University
Ethnographic Video for Instruction and AnalysisEVIA (Ruth Stone, project director)
SambaSamba
Video Acquisition
Server
HPSS
Compression/AnnotationServer
1 mile
346 miles
23This slide licensed under Creative Commons 3.0 Attribution license. If any portion of this slide is reused please cite: Ruth Stone, EVIA team, Stephen Simms and UITS Research Technologies Data Capacitor Team, http://www.eviada.org/, http://pti.iu.edu/dc
Summing up data movement• Some proved approaches
– Just put everything together– Build a data grid
• Newer approaches– Handcraft solutions– Distributed file systems– Accept some inefficiencies
• Network advances will continue to help– 100 Gbps– Dynamic networking services– Don’t underestimate bandwidth of a FedEx shipment
24
Good crop blight severity forecasts using LEAD weather
data
Good crop blight severity forecasts using LEAD weather
data
This slide licensed under Creative Commons 3.0 Attribution license. If any portion of this slide is reused please cite: Beth Plale, http://www.soic.indiana.edu/people/profiles/plale-beth.shtml 26
6/3/2010 14Z radar (left) and helicity (bottom)
LEAD II participation in Spring 2010 Vortex2 tornado field study: 240 total weather workflow runs, 9000+ data products
Trident workflow workbench used to launch hybrid workflows across Linux and HPC Server
This slide licensed under Creative Commons 3.0 Attribution license. If any portion of this slide is reused please cite: Beth Plale, http://www.soic.indiana.edu/people/profiles/plale-beth.shtml 27
MyOSG – Grid ConsolidationDiscovery Health
Information Gathering Accounting
All available in a variety of human and machine readable formats including XML, UWI, iGoogle and Wave, mobile, CSV, along with HTML.
28This slide licensed under Creative Commons 3.0 Attribution license. If any portion of this slide is reused please cite: Rob Quick and Soichi Hayashi, http://osggoc.blogspot.com/
Complexity Hiding Interfaces
• Allow creation of complex workflows on fly• Good demonstration of “not otherwise
possible” part of CI definition• Outreach and Engagement• Bridge gap between gap between creation of
sophisticated data products and information in one domain and their application in another (Plale)
29
Why we need metadataHwæt! We Gardena in geardagum,þeodcyninga, þrym gefrunon, hu ða æþelingas ellen fremedon. Oft Scyld Scefing sceaþena þreatum… 30 0 0.0 139.5 000.0 0.0060 0.02123 -20.48 098.4571 26.2 . .0053 .02123 -20.48 98.4557 . .0057 .02123 -20.47 98.4536 . .0060 .02123 -20.44 98.4533 . .0055 .02123 -20.46 98.4557 . .5760 .43607 0.00 98.4396 408.03 . .5707 .43247 0.00 98.4319 408.03 . .5696 .43161 0.00 98.4350 408.03 . .5718 .43325 0.00 98.4305 408.83 . .5755 .43450 0.00 98.4305 409.16 30 0 5.0 142. . .0045 .02169 1.38 98.8949 26.4 . .0047 .02169 1.39 98.8938 . .0045 .02167 1.38 98.8952 . .0045 .02167 1.41 98.8942 . .0045 .02164 1.41 98.8942 . .4821 .36409 5.45 98.9020 412.24 . .4821 .36512 5.46 98.9020 412.18 . .4847 .36733 5.46 98.8991 412.01 . .4857 .36851 5.46 98.8960 411.78 . .4879 .37028 5.46 98.8949 411.78
31
CIFASD Data Dictionary example
32
This slide licensed under Creative Commons 3.0 Attribution license. If any portion of this slide is reused please cite: Yelena Yezerets, Andrew Arenson, http://hdl.handle.net/2022/9021
XMC Cat Metadata catalog (Plale)
XMC CatMetadata Catalog
Browse
Mon
itor
Sea
rch
Compose
Workflow
Portal
Message Bus
Wor
kflo
w
Not
ifica
tions
Inputs
Intermediate Results
Results
Event Notifications
Configuration
Metadata catalog that fits into SOA, can work with alternate back end storage solutions (i.e., file system, I-Rods)
Adapts to community specified XML schema and configures GUI on fly
Harvests metadata in real time
Metadata catalog that fits into SOA, can work with alternate back end storage solutions (i.e., file system, I-Rods)
Adapts to community specified XML schema and configures GUI on fly
Harvests metadata in real time
“Metadata is key to being able to share results” - UK e-Science Core Programme
33This slide licensed under Creative Commons 3.0 Attribution license. If any portion of this slide is reused please cite: Beth Plale, http://www.soic.indiana.edu/people/profiles/plale-beth.shtml
Metadata & Provenance
• Need for reusability of complex data sets (and data sets whose creation is complex)– Multidisciplinary and interdisciplinary research– Future use– Finding the darn stuff– ‘Walking the cat backwards’– Are these data really what they say they are?
34
Who decides what has to be kept and shared ... And where?
• Financial models?• Legislation and policy
– NIH data policy– HO 5037 `Federal Research Public Access Act of
2009’ – New NSF data policy
• Astronomy & telescope data• Please tell me it’s not all worth keeping
35
Libraries? VOs? Institutions (which?)
Library Volumes TB
Great Library of Alexandria 20,000
Library of Congress 111M 20
Google Books 80M
DataOne 10s
Data Conservancy 500
Artistic Rendering of "The Great Library of Alexandria" by O. Von Corven from Tolzmann, Don Heinrich, Alfred Hessel and Reuben Peiss. The Memory of Mankind. New Castle, DE: Oak Knoll Press, 2001. www.ils.unc.edu/dpr/path/alexandria/
36
TeraGrid is not likely the answer• No clear mandate or
funding to be big data warehouse
• TG 2008 BOF Data management identified significant problems in data movement related to the TeraGrid:– Problem #1: moving data
from outside TG into TG– Problem #2: moving data
within TG
37
This slide licensed under Creative Commons 3.0 Attribution license. If any portion of this slide is reused please cite: Courtesy Indiana University Pervasive Technology Institute, based on illustration by Nicolle Rager Fuller, National Science Foundation.
Overview of iRODS Architecture Overview of iRODS Architecture
UserCan Search, Access, Add and
Manage Data& Metadata
Overview of iRODS Data System
iRODS Data Server
Disk, Tape, etc.
iRODS Metadata
CatalogTrack data
iRODS Data System
iRODS Rule Engine
Track policies
38This slide licensed under Creative Commons 3.0 Attribution license. If any portion of this slide is reused please cite: Reagan Moore, RENCI, http://www.renci.org/staff/reagan-moore
Lessons on people and organizations
• Things once considered inconceivable have often happened
• Resilience in physical design• Resilience in organizational design
– VOs– Consortia of Universities & Libraries– Public/Private Partnerships
• The really hard problems: ontologies, organizational structures, data ownership/privacy/sharing (Astronomy may have the best model now going)
39
In Conclusion…• The problem is much more complex than just the media• Needs for handling sensitive data will increase• Networks (and the economic factors limiting network
capacity growth) are more important than storage media as limiting factors
• Complexity hiding interfaces are essential for expanding usefulness and enabling multi/inter-disciplinary research
• Metadata and provenance issues are critical• Not everything is worth saving• A complex multi-faceted problem calls for multiple research
approaches and many partial solutions rather than a new digital Library of Alexandria
40
Acknowledgments and Thanks• CIFASD (Stewart/Barnett): Part of this work was done in conjunction with the
Collaborative Initiative on Fetal Alcohol Spectrum Disorders funded by NIH NIAAA incl. grant 1U24AA014818-01
• DIMA (Connelly): Lilly Endowment; NIBIB EB007083 (PI Janet Welch)• AVIDD (McRobbie/Stewart): NSF 0116050• Data Capacitor, TeraGrid (Stewart/Simms/Plale) NSF CNS-0521433, ACI-0338618l,
OCI-0451237, OCI-0535258, and OCI-0504075. • Optiputer, Quartzite (Smarr, Papadapolous, Cal-IT2 and UCSD) NSF 0225642,
0421555• LEAD, XMCCat, Karma (Plale): 0721674, 0720580, 0630322, 0331480, Microsoft• PolarGrid (Fox): CNS-0723054• iRODS (Moore): NSF 1032732, 0910431, 0721400• Pervasive Technology Institute: Lilly Endowment• Any opinions, findings, and conclusions or recommendations expressed in this
material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. National Institutes or Health, Lilly Endowment, or other funding agencies
• Thanks to many colleagues locally at IU and globally
41