Big Data R&D InitiativeR&D Initiative - DELS Microsite...
Transcript of Big Data R&D InitiativeR&D Initiative - DELS Microsite...
Big Data R&D InitiativeBig Data R&D Initiative
Suzi Iacono CISE Directorate
National Science Foundation
National Academies of ScienceNational Academies of ScienceIntegrating Environmental Health Data to Advance Discovery
January 11, 2013
Image Credit: Exploratorium.
Advances in information technologies are transforming the
fabric of our society and data represents a transformative newrepresents a transformative new
currency for science, engineering, education and commerce.
Image Credit: CCC and SIGACT CATCS
Where do the data come from?
Why do we have a national initiative?Why do we have a national initiative?
The Big Data Landscape I: Big Science
• Science gathers data at an ever-increasing rate across all scales and complexities of natural phenomenaSloan Digital Sky Survey in 2000 collected more data in its 1st• Sloan Digital Sky Survey in 2000 collected more data in its 1st
few weeks than had been amassed in the entire history of astronomy– Within a decade, over 140 terabytes of information
collected• Large Hadron Collider generates scores of petabytes a yearg g p y y• The proposed Large Synoptic Survey Telescope (3.3 gigapixel
digital camera) will generate 40 terabytes of data nightlyB 2015 th ld ill t th i l t f• By 2015, the world will generate the equivalent of approximately 93 million Libraries of Congress
The Big Data Landscape II: Smart Sensing, Reasoning and Decision-making
Situation Awareness: Humans as
Emergency ResponseEnvironment Sensing
Percepts (sensors)
Agent (Reasoning)
Humans as sensors feed multi-modal data streams Credit: Photo by US Geological Survey
Actions (controllers)
Pervasive Computing
People‐Centric Sensing Smart Health CareSocial Informatics
SenseEvaluate
Public Sensing
Social
Identify
Assess
Intervene
Source: Sajal Das, Keith Marzullo
Personal Sensing
Social Sensing
The Big Data Landscape III: New Paradigms for Communications
1988
Today
1988Remarkable
Pace of Innovation
EMAIL VOIP
BLOGSMOBILE SOCIAL NETWORKS
VIDEO
Communications Volume & Traffic DiversityTraffic Diversity
V IP
663M registered Skype users in 2011. Represents 20% of long distance minutes world-wide. If Skype
VoIP were a carrier, it would be the 3rd largest in the world (behind China Mobile and Vodaphone). Largest provider of cross-border communication.Recent estimates as high as 60% of internet traffic is
VideoRecent estimates as high as 60% of internet traffic is video and music sharing; 35 hours of new videos are uploaded every minute in 2011; 2 billion views per day.
Twitter Currently 175 million registered usersTwitter Currently 175 million registered users.
Broadband 20% of global internet users have residential broadband; 68% in US subscribe to broadband. b oadba d; 68% US subsc be to b oadba d
Mobile5.3 billion mobile phone subscribers; 85% of new handsets will be able to access the mobile web; 1 in 5 ;has access to fast service, 3G or better; IM, MMS, SMS expected to exceed 10 trillion message by 2013.
The Big Data Landscape IV: The Long Tail f S iof Science
• Hundreds of thousands of scientists and engineers work individually or in small distributed disconnected groups all generating data thatin small, distributed, disconnected groups – all generating data that collectively represent an enormous, largely untapped scientific resource
F i i l ti i t t– From running simulations, experiments, etc.
• Making heterogeneous data across many areas of science more homogeneous could give way to breakthroughs across all areas of science and engineering
• Estimated 40 exabytes of unique new information generated worldwide in 2010
• Only 5% of the information created is “structured,” however, in a standard format of words or numbers; the rest are unstructured text, voice images etcvoice, images, etc.
How Big is Big?
• “Big Data”: “Datasets whose size gare beyond the ability of typical database software tools to capture, store, manage, and analyze”
McKinsey Global Institute Big data: the-McKinsey Global Institute, Big data: the next frontier for innovation, competition, and productivity, May 2011.
Image Credit: Sigrid Knemeyer
…Not Just Volumes of Data
• The science of big data is not just aboutThe science of big data is not just about volumes and velocity of data, but also– Heterogeneity and diversityg y y
• Levels of granularity• Media formats
S i ifi di i li• Scientific disciplines– Complexity
• Uncertainty• Uncertainty• Incompleteness• Representation types
Why is Big Data Important?Why is Big Data Important?• Critical to transforming how science is done and to
accelerating the pace of discovery in almost everyaccelerating the pace of discovery in almost every science and engineering discipline
• Transformative implications for commerce and economy• Potential for addressing some of society’s most pressing
challenges
Image Credit: Chi Birmingham
Paradigm Shift: from Hypothesis-driven to Data-driven Discoveryto Data driven Discovery
The Fourth Paradigm: Data‐Intensive Scientific
The Economist, The data deluge and how to Data Intensive Scientific
Discovery (2009, Microsoft Corporation).
deluge and how to handle it: A 14‐page special report (Feb 25,
2010).
http://www.sciencemag.org/site/special/data/ http://www.economist.com/node/15579717 http://research.microsoft.com/en‐us/collaboration/fourthparadigm/
The Age of Data: From Data to Knowledge to ActionFrom Data to Knowledge to Action
• Data-driven discovery is l ti i i i tifi l tirevolutionizing scientific exploration
and engineering innovations
• Automatic extraction of new kno ledge abo t the ph sicalknowledge about the physical, biological and cyber world continues to accelerate
Multi cores concurrent and parallel• Multi-cores, concurrent and parallel algorithms, virtualization and advanced server architectures will enable data mining and machine l i d di dlearning, and discovery and visualization of Big Data
Potential for Transformational Science & Engineering:Engineering:
From Data to Knowledge to Action
I t ti f di i li ( di• Integration of discipline (or media format…) specific data, examine for relationshipsrelationships– Disaster informatics
• 3D toxic fume images• 3D toxic fume images• Simulations of gas spread• Maps of census concentrations• First responder on-the-ground findingsground findings• Evacuation routing
From Data to Knowledge to Action
• Researchers seek to fundamentally transform understanding of spinningtransform understanding of spinning giants
– The task involves assembling data from more storm variables-such as updraft, downdraft and vorticity or g p , yregions of spin--than what can be observed from ground tornado chasers or even actually produced in the atmosphere. For example, researchers need to better understand how changes in wind direction with height cause the updrafts in a storm to rotate, preceding the formation of a tornado.
– To solve the quandary, Amy McGovern, an associate professor in OU's School of Computer Science, and her team create tornado models with super computers that can process vast amounts of data. McGovern and her colleagues use the models to analyze how storm variables interact in order to identify tornadic and non-tornadic storms.
Examples of Research Challenges
• More data are being collected than we can store • Analyze the data as it becomes available• Decide what to archive and what to discard• Decide what to archive and what to discard
• Many data sets are too large to download• Analyze the data wherever it resides
M d t t t l i d t b bl• Many data sets are too poorly organized to be usable• Better organize and retrieve data
• Many data sets are heterogeneous in type, structure, semantics, organization granularity accessibilityorganization, granularity, accessibility …• Integrate and customize access to federate data
• Utility of data is limited by our ability to interpret and use itExtract and visualize actionable knowledge• Extract and visualize actionable knowledge
• Evaluate results• Large and linked datasets may be exploited to identify individuals
D i t d l i ith b ilt i i• Design management and analysis with built-in privacy preserving characteristics
A National Imperative
PCAST ll th F d l t• PCAST calls on the Federal government to increase R&D investments for collecting, storing, preserving, managing, analyzing, and sharing the g g, y g, gincreasing quantities of data.
• Furthermore, PCAST observed that the potential to gain new insights to movepotential to gain new insights … to move from data to knowledge to action has tremendous potential to transform all areas of national priority.
Source: PCAST (December 2010), “Report to the President and Congress: Designing a Digital Future…”– a periodic congressionally-mandated review of the Federal Networking and Information Technology Research and Development (NITRD) Program.
Administration’s Big Data Research and D l t I iti tiDevelopment Initiative
• Big Data Senior Steering Group – chartered in spring 2011 under the Networking and Information Technology R&D (NITRD) Program– Members from DARPA, DOD OSD, DHS, DOE-Science, HHS, NARA, NASA NIST NOAA NSA OFRNASA, NIST, NOAA, NSA, OFR, USGS, etc.– Co-chaired by NSF (and NIH)Co chaired by NSF (and NIH)– Initial charge was to come upwith a plan, a strategy
18
Image Credit: Fuqing Zhang and Yonghui Weng, Pennsylvania State University; Frank Marks, NOAA; Gregory P. Johnson, Romy Schneider, John Cazes, Karl Schulz, Bill Barth, The University of Texas at Austin
Big Data MembershipBiven Laura DOE Science [email protected] Alan NSFNational Science Foundation [email protected] Collica Leslie NISTNational Institute of Standards and Technology [email protected] Deift Abby NSFNational Science Foundation [email protected] Downing Gregory HHSDepartment of Health and Human Services [email protected]
Espina Pedro OSTPWhite House Office of Science and Technology Policy [email protected] Policy
Gerr Neil DARPADefense Advanced Research Projects Agency [email protected] Linda USGSU.S. Geological Survey [email protected]
Hall Alan NOAANational Oceanic and Atmospheric Administration [email protected]
Iacono Suzanne NSFNational Science Foundation [email protected] David OSDOffice of the Secretary of Defense ‐ATL [email protected] Kaufman Daniel DARPADefense Advanced Research Projects Agency [email protected]
OSTPWhit H Offi f S i d T h lLarson Phillip P. OSTPWhite House Office of Science and Technology Policy [email protected]
Lee Tsengdar NASANational Aeronautics and Space Administration [email protected]
Lipman David NIHNational Institutes of Health /NLMNIH’s National Library of Medicine /NCBI [email protected]
Little Michael NASANational Aeronautics and Space Administration [email protected]
Luker Mark NCONational Coordination Office for NITRD /NITRDNetworking and Information Technology Research and Development
Marth Lisa NISTNational Institute of Standards and Technology [email protected]
Muoio Patricia A. DNI [email protected]
Pantula Sastry NSFNational Science Foundation [email protected]
Preuss Don NIHNational Institutes of Health /NLMNIH’s National Library of Medicine /NCBI [email protected]
Quade Brittany NSFNational Science Foundation [email protected] Romine Charles NISTNational Institute of Standards and Technology [email protected] g
Smith Darren NOAANational Oceanic and Atmospheric Administration [email protected]
Spengler Sylvia NSFNational Science Foundation [email protected] Tom NSFNational Science Foundation [email protected]
Strawn George NCONational Coordination Office for NITRD /NITRDNetworking and Information Technology Research and Development
Suskin Mark NSFNational Science Foundation [email protected] Jennifer NIHNational Institutes of Health /NIGMS [email protected]
Wigen Wendy NCONational Coordination Office for NITRD /NITRDNetworking and Information Technology Research and Development
Zhao Fen NSFNational Science Foundation [email protected] Nowell Lucy DOE Science [email protected]
Big Data Membership
B r i s to l S k y U S G S U .S . G e o lo g i c a l S u r v e y s b r i s to l@ u s g s .g o vK ie lm a n J o s e p h D H S D e p a rt m e n t o f H o m e la n d S e c u r i t y jo s e p h .k ie lm a n @ d h s . g o vP e t te rs J o n a th a n D O E S c ie n c e J o n a th a n . P e t t e r s@ s c ie n c e . d o e .g o v
l d d fC a r v e r D o r i s N S F N a t io n a l S c ie n c e F o u n d a t io n d c a r ve r @ n s f .g o v A d o l f ie L a u r a O S D O f f i c e o f t h e S e c r e t a r y o f D e f e n s e l a u ra . a d o l f ie@ o s d . m i lC h a d d u c k R ob e r t N S F N a t io n a l S c ie n c e F o u n d a t io n r c h a d d u c@ n s f .g o v C r ow d e r G ra c e N S A N a t io n a l S e c u r i t y A g e n c y g a c r o w d @ n s a . g o v D u n n M ich e l le N IH N a t io n a l In s t i tu t e s o f H e a l t h d u n n m 3@ m a i l .n ih .g o v F l V l i N IH N t i l I t i t t f H l t h f l @ i l ihF lo r a n ce V a le r ie N IH N a t io n a l In s t i tu t e s o f H e a l t h f lo r a n c e v @ m a i l .n ih .g o v F r e h i l l L i s a O S D O f f i c e o f t h e S e c r e t a r y o f D e f e n s e l i s a .f r e h i l l .C T R @ d a r p a .m i l H e l la n d B a rb a ra D O E D e p a rt m e n t o f E n e r g y ‐ S c ie n c e b a rb a r a . h e l la n d @ s c ie n c e . d o e . g o v
H o a n g T h u c D O E D e p a rt m e n t o f E n e r g y ‐N N S A t h u c .h o a n g @ n n s a .d o e .g o v K a n n a n N a n d in i N S F N a t io n a l S c ie n c e F o u n d a t io n n k a n n a n @ n s f . g o v W T d N S F N t i l S i F d t i t @ fW a r n o w T a n d y N S F N a t io n a l S c ie n c e F o u n d a t io n t w a r n o w @ n s f .g o vD e a n D a v id D O E S c ie n c e d a v id . d e a n @ s c ie n c e .d o e . g o v L y s te r P e te r N IH N a t io n a l In s t i tu t e s o f H e a l t h l y s t e r p @ m a i l . n ih .g o v M i l le m a c i J o h n O S D O f f i c e o f t h e S e c r e t a r y o f D e f e n s e jo h n . m i l le m a c i .c t r @ o s d .m i lP e a rc e C la u d ia N S A N a t io n a l S e c u r i t y A g e n c y c e p e a r c e @ n s a .g o v A l le n M a r c N A S A N a t io n a l A e ro n a u t ic s a n d S p a c e A d m i n i s t r a t io n m a r c a l le n @ n a s a g o vA l le n M a r c N A S A N a t io n a l A e ro n a u t ic s a n d S p a c e A d m i n i s t r a t io n m a r c .a l le n @ n a s a .g o v P e a r l J e n n i f e r N S F N a t io n a l S c ie n c e F o u n d a t io n j s l im o w i @ n s f . g o v B la s z k o w s k y D a v id T re a s u r y D e p a rt m e n t o f t h e T r e a s u r y O F R d a v id . b la s z k o w s k y @ tr e a s u ry .g o v
S z y k m a n J a m e s E P A E n v i ro n m e n ta l P ro t e c t io n A g e n c y ja m e s . j . s z y k m a n @ n a s a . g o v T o m p k in s J e r ry N S A N a t io n a l S e c u r i t y A g e n c y g s t o m p k @ r a d iu m . n c s c .m i l F lo o d M a r k T re a s u r y D e p a rt m e n t o f t h e T r e a s u r y O F R m a r k f lo o d @ t r e a s u ry g o vF lo o d M a r k T re a s u r y D e p a rt m e n t o f t h e T r e a s u r y O F R m a r k .f lo o d @ t r e a s u ry .g o v H o lm J e a n n e D a ta .g o v je a n n e .m . h o lm .j p l .n a s a .g o v M is a w a E d u a r d o N S F N a t io n a l S c ie n c e F o u n d a t io n e m is a w a @ n s f . g o v
Big Data Launch
• Federal Big Data R&D Initiative launched by White House OSTP on March 29, 2012 at AAAS
• Federal Announcements:• NSF – Subra Suresh• NIH Francis Collins• NIH – Francis Collins• USGS – Marcia McNutt• DoD – Zach Lemnios• DARPA - Ken Gabriel• DOE – William Brinkman
• Panel Discussion: • Moderator - Steve Lohr, New York Times
Image Credit: National Science Foundation
,• Daphne Koller, Stanford University• James Manyika, McKinsey & Company • Lucila Ohno-Machado, UC San Diego • Alex Szalay Johns Hopkins University• Alex Szalay, Johns Hopkins University
More information available at: http://nsf.gov/news/news_summ.jsp?org=CISE&cntn_id=123607&preview=false
Strategy to Address Big Datagy g
Foundational research to develop new techniques and
technologies to derive knowledge
New cyberinfrastructure to manage, curate, and serve data to
h i itechnologies to derive knowledge from data research communities
Policy
New approaches for education and workforce development
New types of inter‐disciplinary collaborations, grand challenges,
and competitions
Core Techniques and Technologies for Advancing Big Data Science & Engineering (BIG DATA)
Program Solicitation: NSF 12-499
Foundational research to extract knowledge from data
Foundational research to advance the core techniques
d t h l i fand technologies for managing, analyzing, visualizing, and extracting g, guseful information from large, diverse, distributed and heterogeneous dataand heterogeneous data sets.
C Di t t P NSF Wid
Image Credit: Jurgen Schulze, Calit2, UC‐San Diego
Cross‐Directorate Program: NSF WideMulti‐agency Commitment: NSF and NIH
BIG DATA Research ThrustsCollection, Storage, and
Management of “Big Data”
•3 awards
Data Analytics
• 4 awards
Research in Data Sharing and Collaboration
• 1 award (+1 shared with data collection)
•Foundations of big data management
•Mitigating tradeoffs among speed of data ingestion quicker
• Novel machine learning where multi-dimensional vector data points are replaced by distributions
• Design and test
data collection)
• Open source tools for infrastructure for improving discovery through use of social analytic dataingestion, quicker
answers and the freshness of data through the design of new storage devices with extreme capacities
Design and test mathematical and statistical techniques for large-scale heterogeneous data in DNA repositories
analytic data
• Databridge– linking data, human interactions, and usage practices for the long-tail of science
•Databridge – linking data, human interactions and usage practices for the long-tail of science
• Data analytics problems in next generation sequencing
• Theory and algorithms fro couples tensors and
Credit: Fermilab Photo
fro couples tensors and associated software toolkits to make analysis possible
Eight mid‐scale (up to $1M a year) awards out of over 136 projects announced on Oct. 3.
Award CitationsAward Citations
• DCM:– Dan Suciu – University of WA
• A formal foundation for big data management– Michael Bender – SUNY at Stony Brook &Michael Bender SUNY at Stony Brook &
Martin Farach-Colton – Rutgers University• Eliminating the data ingestion bottleneck in big data
applicationspp– Arcot Rajasekar – University of North Carolina,
Chapel Hill & Gary King – Harvard University & Justin Zhan – North Carolina Agriculture & Technical State U i itUniversity
• Databridge – A sociometric system for long-tail science data collections
25
Award CitationsAward Citations• Data Analytics
Eli Upfal Brown University– Eli Upfal – Brown University• Analytic approaches to massive data computation with
applications to genomicsAarti Singh Carnegie Mellon University– Aarti Singh – Carnegie-Mellon University
• Distribution-based machine learning for high dimensional datasets
Srinvas Aluru Iowa State University & Wuchun– Srinvas Aluru – Iowa State University & Wuchun Feng – Virginia Polytechnic Institute & State University & Oyekunie Olukotun – Stanford UniversityUniversity
• Genomes Galore – Core techniques, libraries, and domain specific languages for high throughput DNA sequencing
Award CitationsAward Citations
• Data Analytics (continued)Data Analytics (continued)– Christos Faloutsos – Carnegie Mellon
University & Nikolaos Sidiropoulos –University & Nikolaos Sidiropoulos University of Minnesota – Twin Cities
• Big Tensor Mining: Theory Scalable Algorithms g g y gand Applications
27
Award CitationsAward Citations
• E-Science Collaboration EnvironmentsE Science Collaboration Environments– Thorsten Joachim – Cornell University & Paul
Kantor – Rutgers UniversityKantor Rutgers University• Discovery and social analysis for large-scale
scientific literature
28
Ideation Contest LaunchIdeation Contest Launch
• Opportunity to expand the innovation ecosystempp y p y• Joint among NASA, NSF and DOE Office of
Science• A contest focused on “How to make• A contest focused on How to make
heterogeneous data seem more homogeneous?”• 5 judges• 5 criteria• Launched on Challenge.gov and the Top Coder
platform on Oct 3 with a two week windowplatform on Oct. 3 with a two week window– http://challenge.gov/NASA/425-big-data-challenge-
serieshttp://community topcoder com/coeci/nitrd/– http://community.topcoder.com/coeci/nitrd/
Ongoing Big Data Programs at NSF
• Dear Colleague Letters:– Encourage CIF21 IGERTs to educate and support a new– Encourage CIF21 IGERTs to educate and support a new
generation of researchers able to address fundamental Big Data challenges: http://www.nsf.gov/pubs/2012/nsf12555/nsf12555.htmD t I t i Ed ti R l t d R h F di– Data-Intensive Education-Related Research Funding Opportunities announcing an Ideas Lab, for which cross disciplinary participation will be solicited, to generate transformative ideas for using large datasets to enhance the effectiveness of teaching and learning environments: http://www.nsf.gov/pubs/2012/nsf12060/nsf12060.jsp
– Data Citation to the Geosciences Community to encourage transparency and increased opportunities for the use and analysis of data sets:use and analysis of data sets: http://www.nsf.gov/pubs/2012/nsf12058/nsf12058.jsp
Earthcube: GEO Science InfrastructureG f• EAGER awards announced as part of White House Big Data Launch
• Integrates geosciences data and high-performance computing technologies in an open, adaptable and sustainable framework to
bl f i h d d i i E h Senable transformative research and education in Earth System Science
• Innovative Model: Community designed, community owned, it dcommunity governed
• Interdisciplinary research: – Building and sustaining “new” communities– Workshops to bring together (GEO, SBE, CISE) communities– EAGER awards to seed new research
A Complex Policy SettingA Complex Policy Setting
Researchers ant data• Researchers want data.• Public policy requires access to data.• Public policy also requires protection of privacy andPublic policy also requires protection of privacy and
intellectual property and other sensitive information.• Much more to be done: Policy on data management
and data access.
Data PrivacyData Privacy
• Never more important than today• However, not all data contain people’s identities (as in data
landscape III)– Not a Big Brother scenario
Go ernment (NSF) in ests in pri ac research– Government (NSF) invests in privacy research• Values in design research community:
– Identity cloaking, anonymization– Do-not-track cookie management
Obfuscation blurring– Obfuscation, blurring– Privacy preserving data mining, search, payment– Just-in-time crypto– Secure data distribution– ….
Privacy in technology; privacy inspired technology
Opportunities for the Future
• Our investments in research and education have already returned exceptional dividends to the Nationreturned exceptional dividends to the Nation.
• Many of tomorrow’s breakthroughs will occur as a result of t h i d t h l i f d i Bi D tnew techniques and technologies for advancing Big Data
science and engineering.
• In turn, Big Data scientific discovery and technological innovation are at the core of our response to national and societal challenges – from environment, energy,societal challenges from environment, energy, transportation, sustainability, and healthcare to cyber security and national defense.
Thanks!