Uk Research Infrastructure Workshop E-infrastructure Juan Bicarregui
Building an Open Data Infrastructure for Science: Policy and Practice Juan Bicarregui STFC e-Science...
-
Upload
kenya-stapleford -
Category
Documents
-
view
217 -
download
2
Transcript of Building an Open Data Infrastructure for Science: Policy and Practice Juan Bicarregui STFC e-Science...
Building an Open Data Infrastructure for Science:
Policy and Practice
Juan BicarreguiSTFC e-Science
WIFI: BMA Guest“apa2 conference” “visitor”
APA conference, November 2011
Overview
• Introduction– What is STFC? What do we do? Why do we do it?
• An example project (Practice)– Not: ODE, APARSEN, EUDAT, SCAPE, SCIDIP-ES
• RCUK Data Principles (Policy)
Programme includes: • Neutron and Muon Source• Synchrotron Radiation Source• Lasers• Space Science • Particle Physics• Compuing and Data Management• Microstructures• Nuclear Physics• Radio Communications
What is STFC?
250m
ESRF & ILL, GrenobleDaresbury Laboratory
Square Kilometre Array Large Hadron Collider
What is the science?
©John Womersley/Keith Jeffery/STFC
Tomorrow’s Digital Infrastructure for Science
APA Conference 2007 John Womersley/Keith Jeffery
The Innovation Lifecycle
The Body of Knowledge
The GovernmentProcess
The ResearchProcess
Aggregation of Knowledge lies at the heart of the innovation lifecycle
Enabling Knowledge Creation
Enabling Wealth Creation
Quality Assessment
Strategic Direction
Improved Quality of Life
Improved Understanding
Data and the Research Process
Data centric view of research
DataCreation
Archival
Access
Storage ComputeNetwork
Services
Curation
the researcher actsthrough ingest and access
Virtual Research Environment
the researcher shouldn’t have to worry about the information infrastructure
Information Infrastructure
The 7 C’s
Creation Collection
Capacity
Computation
Curation
Collaboration Communication
Data
Linked systems for:
• Proposal submission• User management• Data acquisition
Metadata carried from each system to the next
Detectors moving from Hz to KHz, MHz, GHz, ...
Creation
Examining the detectors on MAPS instrument on ISIS
Barrel toroid magnet and detector module from ATLAS at CERN
1997 1998 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 201010
100
1,000
10,000
Data Stored at Rutherford Appleton Lab
Capacity
Currently store about
7 PetaBytes of data
1PB = 1015 Bytes
= a Billion Floppys
= a Million CDs
= a Thousand PCs
LHC experiments ~ 2PB
Active file system based HSM ~2PB(facilities)
Dark archive ~2PB(facilities and non-STFC services)
of Today’s
}
10 x Moore’s Law (2 years)2 x Moore’s Law (1.5 years)
Moore’s Lawx1000 in 13 yearsDoubling every 1.3 years
Computation
Compute intensive components on the grid
Computational applications for Laser Theory Group’s adoption of HPC
Laser real-time diagnostics & data flow pipeline.
Fitting of experimental data to model
CurationComplexity of Facility Archives• All ISIS data (~25 years)
> 3,000,000 files
• All Diamond Data (~2.5 years)> 11,000,000 files
Breadth of Data Sources:• STFC (Tier 1)• NERC (BADC, NEODC)• BBSRC (All institutes)• AHRC• MRC • Others:
– The STFC Data Portal – The STFC Publications Archive – The CCPs (Collaborative Computational Projects) – The Chemical Database Service – The Digital Curation Centre – The EUROPRACTICE Software service– The HPCx Supercomputer– The JISCmail service– The NERC Datagrid – The Starlink Software suite – The UK Grid Support Centre – The World Data Centre for Solar-Terrestrial Physics
Atlas Datastore Tape Robot
The StorageTek tape robot with capacity of 20PB
Collection
Proposal
Approval
SchedulingExperiment
Data cleansing
Record Publication
Scientist submits application for
beamtime
Facility committee approves application
Facility registers, trains, and schedules
scientist’s visit
Scientists visits, facility run’s experiment
Subsequent publication registered
with facility
Raw data filtered and cleansed
Data analysis
Tools for processing made available
CommunicationImmense Expectations ! • Web enables:
– access to everything • Everything on-line
• Interlinking enables:– Validation of results – Repetition of experiment
• Discovery enables:– new knowledge from old
• Archiving enables:– Unplanned reuse of data
• Antarctic environmental data– Reuse of knowledge
• One paper has >20,000 downloads– Completing the cycle
• Publications entered in next proposal
STFC’s “e-pubs” Institutional Repository has records of 30,000 publications spanning 25 years
“The web has changed everything...”
Collaboration
Technology integration facilitates scientific collaboration• Cross facility/beamline• Cross disciplinary
Technology integration improves facility efficiency • PaN-data –Photon and Neutron Data infrastructure• ICAT also used in Australian Synchrotron and Oak Ridge National Lab
Overview
• Introduction– What is STFC? What do we do? Why do we do it?
• An example project – Not: ODE, APARSEN, EUDAT, SCAPE, SCIDIP-ES
• RCUK Data Policy Principles
Tools for virtual research environments
Tools for virtual research environments
Generic services, storage and computation
OA participatory infrastructure
Agricultur
e
Environment
Physics, Engineering
Biolo
gy
Medici
n
e
Atmosphere/Space Physics
Social SciencesScientific Data
(Discipline Specific)
Other Data
Researcher 1
Non Scientific World
Scientific WorldResearcher 2
Aggregated Data Sets(Temporary or Permanent)
Workflows
Aggregation Path
transPLANT
EUDAT
AgINFRA
iMarine
OPENAire Plus
diXa
SCIDIP-ES
ESPAS
ENGAGE
PanDataODI
Scientific Data Landscape of Initiatives – results from call9
VREs
VREs
PaNdataODI
PaN-data bring together 11 major European Research Infrastructures
PaN-data is coordinated by the e-Science Department at the Rutherford Appleton Laboratory, UK
ISIS is the world’s leading pulsed spallation neutron source
ILL operates the most intense slow neutron source in the world
PSI operates the Swiss Light Source, SLS, and Neutron Spallation Source, SINQ, and is developing the SwissFEL Free Electron Laser
HZB operates the BER II research reactor the BESSY II synchrotron
CEA/LLB operates neutron scattering spectrometers from the Orphée fission reactor
ESRF is a third generation synchrotron light source jointly funded by 19 European countries
Diamond is new 3rd generation synchrotron funded by the UK and the Wellcome Trust
DESY operates two synchrotrons, Doris III and Petra III, and the FLASH free electron laser
Soleil is a 2.75 GeV synchrotron radiation facility in operation since 2007
ELETTRA operates a 2-2.4 GeV synchrotron and is building the FERMI Free Electron Laser
ALBA is a new 3 GeV synchrotron facility due to become operational in 2010
PaN-data Partners
JCNS Juelich Centre for Neutron Science MaxLab, Max IV Synchrotron
The partners operate hundreds of instruments used by over 30,000 scientists each year
EDNS - European Data Infrastructure for Neutron and Synchrotron Sources
PaNdata Vision
Single Infrastructure Single User Experience
CapacityStorage
Publications Repositories
Data Repositories
Software Repositories
Raw Data Data Analysis
Analysed Data
Publication Data
Publications
Facility 1
Raw Data Data Analysis
Analysed Data
Publication Data
Publications
Facility 2
Raw Data Data Analysis
Analysed Data
Publication Data
Publications
Facility 3
Different Infrastructures Different User ExperiencesRaw Data Catalogue
Data Analysis
Analysed Data Catalogue
Publication Data Catalogue
Publications Catalogue
Science driver – Data IntegrationNeutron diffraction X-ray diffraction
}NMR
High-qualitystructure refinement
}
PaN-data Standardisation
PaN-data Europe is undertaking 5 standardisation activities:
1. Development of a common data policy framework
2. Agreement on protocols for shared user information exchange
3. Definition of standards for common scientific data formats
4. Strategy for the interoperation of data analysis software enabling the most appropriate software to be used independently of where the data is collected
5. Integration and cross-linking of research outputs completing the lifecycle of research, linking all information underpinning publications, and supporting the long-term preservation of the research outputs
PaN-data Europe – building a sustainable data infrastructure for Neutron and Photon laboratories
2.1 Data Policy
2.2Software Policy
2.3UserPolicy
2.4Integrated Policy
4.1User
Proposal
4.2User
Workshop
4.3User
Revision
5.1Data
Proposal
5.2Data
Workshop
5.3Data
Revision
6.1SoftwareReview
6.2Software Workshop
6.3SoftwareProposal
6.4Software Revision
7.1Integration Report
7.2Integration Proposal
7.3Integration Revision
3.4
Final
Workshop
Project Management, Knowledge Exchange and Dissemination Activities
Dependencies between the major project tasks
PaNdata Activities
ERA Open Access Sharing Initiatives (examples, etc)
ERA Infrastructure Platform Initiatives (EGI, etc)
PaNdata Support Action
(Ends 30 Nov 11)
Policies and Standards
PaNdataODI
(begins end2011)
JRAs
Users
Data
Software
Integration
Provenance
Preservation
Scalability
PaNdataODI
(begins end2011)
ServicesUsers
Data
PaNdataODI
Virtual Labs
Policies Powder Diff
SAXS & SANS
Tomography
ObjectivesObjective 2 – UsersTo deploy, operate and evaluate a system for pan-European user identification across the participating facilities and
implement common processes for the joint maintenance of that system.
Objective 3 – DataTo deploy, operate and evaluate a generic catalogue of scientific data across the participating facilities and promote
its integration with other catalogues beyond the project.
Objective 4 – Provenance To research and develop a conceptual framework, defined as a metadata model, which can record the analysis
process, and to provide a software infrastructure which implements that model to record analysis steps hence enabling the tracing of the derivation of analysed data outputs.
Objective 5 – PreservationTo add to the PaNdata infrastructure extra capabilities oriented towards long-term preservation and to integrate
these within selected virtual laboratories of the project to demonstrate benefits. These capabilities should, as for the developments in the provenance JRA, be integrated into the normal scientific lifecycle as far as possible. The conceptual foundations will be the OAIS standard and the NeXus file format.
Objective 6 – Scalability To develop a scalable data processing framework, combining parallel filesystems with a parallelized standard data
formats (pNexus pHDF5) to permit applications to make most efficient use of dedicated multi-core environments and to permit simultaneous ingest of data from various sources, while maintaining the possibility for real-time data processing.
Objective 7 – DemonstrationTo deploy and operate the services and technology developed in the project in virtual laboratories for three specific
techniques providing a set of integrated end-to-end data services.
PaNdata ODI Joint Research Activities
PaNdata ODI Service Activities
PaNdata ODI Service ReleasesStandards from
PaNdataSupport Action
uCat
dCat
vLabs
Prov
Pres
Scale
Rel 1 Rel 2 Rel 3 Rel 4
users
data
s/w
Integ
Jun 2014Jun 2013 Dec 2013Dec 2012
Data
The Research Lifecycle – a personal view
the researcher actsthrough ingest and access Research
Environment
Creation
Archival
Access
Storage ComputeNetwork
Data
Services
the researcher shouldn’t have to worry about the information infrastructure
Information Infrastructure
MetaData/ Catalogues
PortalsUser Info feed
DAQ feedData Analysis feed
EGIGEANT
Local resources
e-Infrastructure
Provenanced Research
Overview
• Introduction– What is STFC? What do we do? Why do we do it?
• An example project – Not: ODE, APARSEN, EUDAT, SCAPE, SCIDIP-ES
• RCUK Data Policy Principles
RCUK Principles on Data Policy
Seven (fairly) orthogonal principles: • Public good • Preservation• Discoverability• Confidentiality • First use• Recognition• Costs
} Data
} Access
} Rights
Motivation
Data are a critical output of the research process:
• For the integrity, transparency and robustness of the research record
• Often value increases through aggregation
• Enables new research questions to be addressed
• Supports the wider exploitation of data
Repeat, Repeal, Repurpose
Why might we want access to data?
Three distinct reasons for sharing data:
• Repeat - Validation of previous analysis
– How does this fit with peer review?
• Reconsider/Reform/Repeal/Reverse - Alternative hypotheses in the same
field
– c.f. Reuse
– How does this fit with “right” to first use?
• Repurpose - New research in another field
– c.f. Recycle
– How does this fit with recognition of Intellectual contribution? (What’s in it for me?)
Different concerns and requirements for each type of sharing
Data are a Public GoodPublicly funded research data are a public good, produced in the public interest, which should be made openly available with as few restrictions as possible in a timely and responsible manner that does not harm intellectual property.
Public good – is nonrival and non-excludable [wikipedia] consumption by one does not reduce availability for othersno one can be effectively excluded from using
Research Data recorded factual material commonly retained by and accepted in the
scientific community as necessary to validate research findings As few restrictions as possible
Later (distinguish registration from restriction)Timely
Later (discipline specific)Responsible
Later (maximising access does not necessarily maximising research benefit)Intellectual Property
Later (balance contribution from sharing and from primary research)
Data should be managedInstitutional and project specific data management policies and plans should be in accordance with relevant standards and community best practice. Data with acknowledged long term value should be preserved and remain accessible and usable for future researchPolicies and Plans
DMPs should exist for all dataInstitutional/Departmental v. Project
Standards and Best practicediscipline specific
Long term value eg. NERC has the concept of the Data Value Checklist.
Future research (by current and future generations)Don’t lose it by accident
Data should be discoverableTo enable research data to be discoverable and effectively reused by others, sufficient metadata should be recorded and made openly available to enable other researchers to understand the research and re-use potential of the data. Published results should always include information on how to access the supporting data
Effectively reused by othersfor validation (the data behind the graph – and derivation/provenance?)for reworking (alternative hypothesis)for repurposing(same data or data merging)
Sufficient Metadata ... openly available (stronger than for data)Understand the re-use potential Discoverable – repository? registration?
Published results should include (pointers)could be an email address (but note longevity requirement)
Data may be protectedRCUK recognises that there are legal, ethical and commercial constraints on release of research data. To ensure that the research process is not damaged by inappropriate release of data, research organisation policies and practices should ensure that these are considered at all stages in the research process. Legal
Data Protection Act, Freedom of Information Act, Environmental Information Regulations, EU INSPIRE Directive (Spatial Data) Some Case law: UEA “Climategate” and a few others - can go either way
ICO developing Guidelines on FoIProtection of Freedoms Bill
EthicalConsent, Privacy, National Security, Consent eg longitudinal cohort studies – protection of cohort participationCommercial – shared funding, patent pending, commercial in confidence... all stages in the research process ...
Plan in proposal – peer reviewedexternal review of access requests
Originators may have first use
To ensure that research teams get appropriate recognition for the effort involved in collecting and analysing data, those who undertake research council funded work may be entitled to a limited period of privileged use of the data they have collected to enable them to publish the results of their research. The length of this period varies by research discipline and, where appropriate, is discussed further in the published policies of individual Research Councils.
... may be entitled ...c.f. will be entitled
... limited period of privileged use ... “Withholding of data without good reason beyond this period is not
acceptable”... to publish the results of their research ... to work on and publish the results ... the length of this period varies ...
individual research councils’ policies elaborate
Reusers have responsibilities
In order to recognise the intellectual contributions of researchers who generate, preserve and share key research datasets, all users of research data should acknowledge the sources of their data and abide by the terms and conditions under which they are accessed.
... abide by the terms and conditions ... terms and conditions may exist
to monitor useto promote terms and conditions of use
... should acknowledge the sources of their data ...
Data citation c.f. should be required to acknowledge....
Data sharing is not freeIt is appropriate to use public funds to support the management and sharing of publicly-funded research data. To maximise the research benefit which can be gained from limited budgets, the mechanisms for these activities should be both efficient and cost-effective in the use of public funds.
Data sharing costs!Marginal cost may be small but intial cost may be high
Even if data are free at the point of use (which it may not be) there are costs behind.
Cost is fundable by RCs - but needs tensioning against other research• The policies, models and mechanisms for managing and providing access to research data• Funders will work with grant holders to .....”• No prescription on how funding should be raised.”
You can ask for funding... ... but be reasonable
Outcomes
• Ensure the continuing availability of data of long-term value
• Facilitate the development of mechanisms to improve the management, accessibility and attribution of data
• Promote awareness of legislation and guidance relating to the management and dissemination of information
The Seven Ps
• Public good Public good
• Preservation Preservation
• Discoverability Promotion
• Confidentiality Protection & Privacy
• First use Privilege
• Recognition Probity
• Costs Price
Overview
• Introduction– What is STFC? What do we do? Why do we do it?
• An example project – Not: ODE, APARSEN, EUDAT, SCAPE, SCIDIP-ES
• RCUK Data Policy Principles
• a final thought
The 7 C’s
Creation Collection
Capacity
Computation
Curation
Collaboration Communication
Permanent Access
Provenanced Research
The KnowledgeLifecycle
DataCreation
Archival
Access
Storage ComputeNetworkServicesCuration
www.pan-data.eu
www.rcuk.ac.uk/research/Pages/DataPolicy.aspx
Thank You
Building an Open Data Infrastructure for Science
Policy and Practice
Juan BicarreguiSTFC e-Science
APA conference, November 2011