Data- and Compute-Driven Transformation of Modern Science

22
Data- and Compute-Driven Transformation of Modern Science Edward Seidel Assistant Director, Mathematical and Physical Sciences, NSF (Director, Office of Cyberinfrastructure) 1

description

Data- and Compute-Driven Transformation of Modern Science. Edward Seidel Assistant Director, Mathematical and Physical Sciences, NSF (Director, Office of Cyberinfrastructure). Profound Transformation of Science Gravitational Physics. Galileo, Newton usher in birth of modern science: c. 1600 - PowerPoint PPT Presentation

Transcript of Data- and Compute-Driven Transformation of Modern Science

Page 1: Data- and Compute-Driven  Transformation of Modern Science

Data- and Compute-Driven Transformation of Modern

Science

Edward SeidelAssistant Director, Mathematical and

Physical Sciences, NSF(Director, Office of Cyberinfrastructure)

1

Page 2: Data- and Compute-Driven  Transformation of Modern Science

2

Profound Transformation of Science

Gravitational Physics Galileo, Newton usher in birth of

modern science: c. 1600 Problem: single “particle” (apple)

in gravitational field (General 2 body-problem already too hard)

MethodsData: notebooks (Kbytes)Theory: driven by dataComputation: calculus by hand (1

Flop/s) Collaboration

1 brilliant scientist, 1-2 student

Page 3: Data- and Compute-Driven  Transformation of Modern Science

33

Profound Transformation of Science

Collision of Two Black Holes• Science Result

• The “Pair of Pants”

• Year: 1994

• Team size

• ~ 10

• Data produced

• ~ 50Mbytes

• Science Result

• The “Pair of Pants”

• Year: 1994

• Team size

• ~ 10

• Data produced

• ~ 50Mbytes

Science ResultThe “Pair of Pants”

Year: 1972 Team size

1 person (S. Hawking) Computation

Flop/s Data produced

~ Kbytes (text, hand-drawn sketch)

400 years later…same!

Page 4: Data- and Compute-Driven  Transformation of Modern Science

Move to 3D: 1000x more data!

4

• 3D Collision

• Science Result

• Year: 1998

• Team size

• ~ 15

• Data produced

• ~ 50Gbytes

• 3D Collision

• Science Result

• Year: 1998

• Team size

• ~ 15

• Data produced

• ~ 50Gbytes

Page 5: Data- and Compute-Driven  Transformation of Modern Science

5

Just ahead: Complexity of UniverseLHC, Gamma-ray bursts!

Gamma-ray bursts! GR now soluble: complex problems in

relativistic astrophysics All energy emitted in lifetime of sun

bursts out in a few seconds: what are they?! Colliding BH-NS? SN?

GR, hydrodynamics, nuclear physics, radiation transport, neutrinos, magnetic fields: globally distributed collab!

Scalable algorithms, complex AMR codes, viz, PFlops*week, PB output!

LHC: Higgs particle? ~10K scientists, 33+ countries, 25PB Planetary lab for scientific discovery! Remote InstrumentRemote Instrument

Remote InstrumentRemote Instrument

Page 6: Data- and Compute-Driven  Transformation of Modern Science

6

Grand Challenge Communities Combine it All...

Where is it going to go?

6

Same CI useful for black holes, hurricanes

Page 7: Data- and Compute-Driven  Transformation of Modern Science

Framing the QuestionScience is Radically Revolutionized by CI

Modern science Data- and compute-

intensive Integrative

Multiscale Collaborations for Complexity Individuals, groups Teams, communities

Must Transition NSF CI approach to support Integrative, multiscale 4 centuries of

constancy, 4 decades 109-12 change!

7

…But such radical change cannot be

adequately addressed with (our current) incremental approach!

…But such radical change cannot be

adequately addressed with (our current) incremental approach!

We still think like

this…

We still think like

this…

Students take note!Students

take note!

Page 8: Data- and Compute-Driven  Transformation of Modern Science

Scientific Computing and Imaging Institute, Scientific Computing and Imaging Institute, University of UtahUniversity of Utah

Data Crisis: Information Big BangPCAST Digital DataPCAST Digital Data

NSF Experts StudyNSF Experts Study

Wired, NatureWired, Nature

Storage Networking Industry Association (SNIA) 100 Year Archive Requirements Survey Report

“there is a pending crisis in archiving… we have to create long-term methods for preserving information, for making it available for analysis in the future.” 80% respondents: >50 yrs; 68% > 100 yrs

IndustryIndustry

Page 9: Data- and Compute-Driven  Transformation of Modern Science

9

Explosive Trends in Data Growth Comparative Metagenomics

DNA sequencing of entire families of organisms

Already hundreds of TB, thousands of users HD Collaborations and Optiportals

Multichannel HD, gigapixel visualizations Petascale-Exascale simulation

They generate peta-exabytes per simulation! Square Kilometer Array

3000 radio receivers, 1 km2 area!19 countries! Possibly beginning in 201X,

operational 202XData: exabyte per week! Analysis: Exaflops!

Page 10: Data- and Compute-Driven  Transformation of Modern Science

Provenance in Science

Provenance is as important as the result

Not a new issue Lab notebooks have been used for a long time What is new?

Large volumes of data Complex analyses—

computational processes Writing notes is no

longer an option… GC Communities require

open, sharable data, standards, metadata

DNA recombinationBy Lederberg

WhenWhen

Observed dataObserved data

AnnotationAnnotation

Source: Juliana Freire, U of Utah

Page 11: Data- and Compute-Driven  Transformation of Modern Science

11

“Data Deluge” Drives Change at NSF Data Issues resonate the most across NSF units! DataNet – $100M investment in Sustainable

Archive & Access Partners – development of widely accessible network of interactive data archives; Driven by today’s grand challenges, integrating multiple disciplines

INTEROP – Community-led interoperability – interdisciplinary, community approaches to combine and re-use data in ways not envisioned by their creators

Data-intensive computing: SDSC Gordon facility NSF Data Policy – The “Data Working Group” (NSF-

wide group of Program Directors) working to assure that data are shared within and across disciplines

There is a major shift in science towards data-

intensive methods. NSF is

responding…

There is a major shift in science towards data-

intensive methods. NSF is

responding…

Page 12: Data- and Compute-Driven  Transformation of Modern Science

NSF Vision and National CI Blueprint

12

Track 1Track 1Track 1Track 1

Track 2Track 2Track 2Track 2

Track 2Track 2Track 2Track 2

Track 2Track 2Track 2Track 2

CampusCampusCampusCampusCampusCampusCampusCampus

CampusCampusCampusCampusCampusCampusCampusCampus CampusCampusCampusCampus

CampusCampusCampusCampusCampusCampusCampusCampus

CampusCampusCampusCampus

DataNetDataNetDataNetDataNet

DataNetDataNetDataNetDataNet

SoftwareSoftwareSoftwareSoftware

NetsNetsNetsNets

DataNetDataNetDataNetDataNet

DataNetDataNetDataNetDataNet

DataNetDataNetDataNetDataNet

Education Crisis: Education Crisis: I need all of this I need all of this to start to solve to start to solve

mymy problem! problem!

Education Crisis: Education Crisis: I need all of this I need all of this to start to solve to start to solve

mymy problem! problem! Science is becoming unreproducible in this environment. Validation?Provenance? Reproducibility?

Science is becoming unreproducible in this environment. Validation?Provenance? Reproducibility?

Page 13: Data- and Compute-Driven  Transformation of Modern Science

The Shift Towards DataImplications

All science is becoming data-dominatedExperiment, computation, theory

Totally new methodologiesAlgorithms, mathematicsAll disciplines from science and engineering

to arts and humanities End-to-end networking becomes critical

part of CI ecosystemCampuses, please note!

How do we train “data-intensive” scientists?

Data policy becomes critical!13

Page 14: Data- and Compute-Driven  Transformation of Modern Science

Recent NSF Activities on Data Policy and Implementation

14

Page 15: Data- and Compute-Driven  Transformation of Modern Science

15

Fundamental points on data and publication policy

Publicly funded scientific data and publications should be available, and science benefits

There has to be a place to keep data, and a way to access it

There needs to be an affordable, sustainable cost model for this

15

Who pays? The NSF? The

Institution? What is the cost model? What is

reasonable?

Who pays? The NSF? The

Institution? What is the cost model? What is

reasonable?

What data must be made available? Raw

data? Peer reviewed? When is it available? 6 months?

1 year? After publication?

What data must be made available? Raw

data? Peer reviewed? When is it available? 6 months?

1 year? After publication?

Where is it placed?

Author web site?

Library? NSF sites?

Where is it placed?

Author web site?

Library? NSF sites?

How long is it made available?

How do we enforce it post-

award?

How long is it made available?

How do we enforce it post-

award?

There is great variability in requirements across science communities: peer review can help guide this process.

There is great variability in requirements across science communities: peer review can help guide this process.

Page 16: Data- and Compute-Driven  Transformation of Modern Science

Changes Coming for Data! Long-standing NSF Policy on Data (Proposal &

Award Policies & Procedures Guide)“Investigators are expected to share with other

researchers, at no more than incremental cost and within a reasonable time, the primary data... created or gathered in the course of work under NSF grants”

16

NSF will soon require a Data Management Plan (DMP), subject to peer review; criterion for award The DMP will be in the form of a 2-page

supplementary document to the proposal It will not be possible to submit proposals without

such a document Customization by discipline, program necessary

Page 17: Data- and Compute-Driven  Transformation of Modern Science

Upcoming Implementation of NSF Data Policy

Directorate-Specific Issues & Peer Review Many details are implemented/enforced via

peer review and Program Officer discretion, including things like embargo period, standards, etc.

The challenge at NSF is that “no one size fits all” so each Directorate will be responsible for its own recommendations for DMP content, appropriate institutional repositories, etc.

This does not address Open Access as applied

17

Page 18: Data- and Compute-Driven  Transformation of Modern Science

Electronic Access to Scientific Publications

18

Page 19: Data- and Compute-Driven  Transformation of Modern Science

Why is this Important? Science requires it

Science progress accelerated by making publications available and searchable

Results in one community need to easily propagate to another for multidisciplinary complex problem solving

Search technologies can be brought to bear Publications need to be associated with rich information:

videos of simulations, supporting data, simulation and analysis codes…

Equality and Broadening participation Young scientists at smaller universities at a needless

disadvantage without it. They may lose journal access. This hurts science and puts talent at risk

US Administration focus on transparency and accountability

19

Page 20: Data- and Compute-Driven  Transformation of Modern Science

Current Activities We have begun serious discussions within

NSF on these issuesNational Science Board Committee on Data

startedGoals similar to those for Data

We have had numerous visits from funding agencies from around the worldPrimary topic: what is NSF doing on OA?

Discussion with various publishers, libraries to explore optionsQuality of science relies on peer review systems

of best journals; need a way to support OA20

Page 21: Data- and Compute-Driven  Transformation of Modern Science

On Working with Publishers

Quality of science, identification of talented scientists: we rely on the peer review systems of the best journals NSF receives an assurance that the work done on a grant

meets a standard Universities use impact factors as part of their tenure and

promotion process

I believe it is in the interests of science, and hence the public interest, to help journals find a viable OA business model..

Bernard Schutz, Presentation to NSF, May 200921

Page 22: Data- and Compute-Driven  Transformation of Modern Science

Final Remarks

Science is becoming collaborative and data dominated

We are accelerating efforts to advance NSF in all aspects of dataScience requires that data need to be open and

accessible; we are working to achieve this All forms of data are important, and must be

more tightly connected in the futureCollections, software, publications

Time is of the essence

22