Overview of the data pilot and OpenAIRE tools, Elly Dijk and Marjan Grootveld (OpenAIRE workshop,...

62
The Data Pilot and OpenAIRE tools Update on Research Data Management Elly Dijk and Marjan Grootveld Data Archiving and Networked Services (DANS)

Transcript of Overview of the data pilot and OpenAIRE tools, Elly Dijk and Marjan Grootveld (OpenAIRE workshop,...

Master Title

The Data Pilot and OpenAIRE toolsUpdate on Research Data ManagementElly Dijk and Marjan GrootveldData Archiving and Networked Services (DANS)

1

OutlineIntroduction to the Open Research Data PilotPrepare for responsible research During the research project Data servicesIn summary

Introduction to the afternoons In Practice sessions

2

1. Introduction to the Open Research Data PilotHorizon 2020Open Research Data Pilot

3

Financing European research and innovation projectsIncreasing competitive position of Europe, and find solutions for societal challenges, e.g. climate change, food security, health and wellbeing, secure societiesSuccessor of the FP7 programme (KP7)Period 2014- 2020; thebudgetis 80 billionNational Contact Points for the H2020 programmehttp://ec.europa.eu/programmes/horizon2020

4

New EC guidelines: 2015

5

http://ec.europa.eu/research/participants/data/ref/h2020/grants_manual/hi/oa_pilot/h2020-hi-oa-pilot-guide_en.pdf

OpenAIRE supports Horizon 2020 demands6

OpenAIRE support for data

All information is available via https://www.openaire.eu/opendatapilot 7

https://www.openaire.eu/opendatapilot

Open Research Data PilotAim: to make the research data generated by selected Horizon 2020 projects accessible with as few restrictions as possible, while at the same time protecting sensitive data from inappropriate access.

EC: information already paid for by the public should not be paid for again.

Open data is data that is free to access and reuse

Two types of data:Data, including metadata, needed to validate the results in scientific publicationsOther data, including metadata, as specified in the Data Management Plan, like raw data

8

Which research has to partipate in the pilot?Future and Emerging TechnologiesResearch infrastructuresLeadership in enabling and industrial technologiesNanotechnologies, Advanced Materials, Advanced Manufacturing and Processing, and BiotechnologySocietal Challenge: Food security, sustainable agriculture and forestry, marine and maritime and inland water research and the bioeconomy Societal Challenge: Climate Action, Environment, Resource Efficiency and Raw materials Societal Challenge: Europe in a changing world inclusive, innovative and reflective SocietiesScience with and for SocietyCross-cutting activities - focus areas part Smart and Sustainable Cities

9

Opting out / opting in

Opting out of the pilot is possible when motivated

And opting in is also possible

Reasons for total or partial opting outIncompatible with the Horizon 2020 obligation to protect results if they can reasonably be expected to be commercially or industrially exploited;Incompatible with the need for confidentiality in connection with security issues;Incompatible with existing rules concerning the protection of personal data;If the project will not generate / collect any research data;If there are other legitimate reasons to not take part in the Pilot

11

Opting inVoluntary opting in also possible

When a researcher wants to publish and share his/her data as open access

Mandate to open access of publications: Aim to deposit at the same time the research data needed to validate the results ("underlying data)

12

Opt in / Opt out numbersBasis : 3,699 Horizon 2020 signed grant agreements

Calls in core-areas: opt out 34,6% (149/431 proposals)

Other areas: voluntary opt in 12,5% (409/3268 proposals)

Conclusion: These numbers in the proposals for the first calls of Horizon 2020 are encouraging.

Comprehensive follow up needed

Numbers by Daniel Spichtinger, European Commission, at OpenCon 14-11-15

13

Reasons for opting out

Numbers by Daniel Spichtinger, European Commission, at OpenCon 14-11-15

14

relates to controversial or security issues that might have undesired societal consequences if research results became known prematurely14

Requirements Open Data PilotData Management Plan required within six months after project grant

Deposit your data in a research data repository

Open data is data that is free to access and reuse: Creative Commons Licence CC-BY or CC0

15

16

16

2. Prepare for responsible researchData management planningStakeholders

17

Data Management PlanningVideo by Research Data Netherlands, http://datasupport.researchdata.nl

18

18

How to write a DMPTemplate available from https://dmponline.dcc.ac.uk/

And from a few national DMPonline sites, e.g. in Spain and BelgiumSee https://www.openaire.eu/opendatapilot-dmp - Spain: http://pgd.consorciomadrono.es/ - Belgium - forthcoming19

1

19

20

23

Second, you can select your organisation, but no problem if its not on the list. Note that ou may also find projects here, such as ELIXIR for life sciences.20

21

4

You may want to include the guidance provided by the DCC. This is a good addition to the guidance that the EC provides on the questions of the template.Next, click CREATE.21

22The DMP is not a fixed documentSelf-assigned ID

Youre asked to provide some basic information. Please note that the ID here is one that you enter yourself, for your convenience. Ill show you in a second where I did this.22

23Briefly specifyhow data will be captured/createdhow it will be documentedaccording to what standardswho will be able to access itwhere it will be storedhow it will be backed up, and where and how it will be shared and preserved long-term

This page summarises that the DMP is a deliverable to be submitted within 6 months into the project. Below the orange bar it lists the topics of the initial DMP.23

24ID of the dataset, assigned by PIEC guidance PIs answerInitial DMP5

Youre asked to provide some basic information. Please note that the ID here is one that you enter yourself, for your convenience and that of your collaborators.

In this way the researcher proceeds to write the plan more details follow in a second, but lets first look ahead: 24

Template mid-term review DMP

Broad notions: the data and associated metadata should be managed in a way that allows for future reuse

25

And make sure that you know what will be asked of you for the mid-term and the final review: the focus here is on enabling reuse of your data by your future self and others.In a couple of minutes Ill tell you why this is a bit underspecified.

Okay, this is the easy part: there is a template. Whats really at stake of course is: what to write in the plan, and who should be involved? 25

Roles and responsibilities

InstitutionRDM policyFacilities

$Research funders

PublishersData Availability PolicyCommercial partners

The process of planning is also a process of communication, increasingly important in interdisciplinary / multi-partner research. Collaboration will be more harmonious if project partners (in industry, other universities, other countries) are in accord.

Open Science encourages and indeed requires heterogeneous stakeholder groups to work together for a shared societal goal.Its worth bearing in mind that RDM and DMP are similarly hybrid activities, involving multiple stakeholder types

The principal investigator (usually ultimately responsible for data)Research assistants (may be more involved in day-to-day data management)Ideally, they have a FO in the institute and/or in the domain: Library/IT/Legal/Funding office (The library may issue PIDs, or liaise with an external service who do this, e.g. DataCite. The funding office may have a compliance role)And the FO ideally relies on back-office services, such as long-term archives and high-capacity data transfer.Partners based in other institutionsCommercial partnersPublishersetc

Many of us have a role in the FO or the BO: hand raising!!26

Lets recall the goal:Open access to research data refers to the right to access and re-use digital research data. Openly accessible research data can typically be accessed, mined, exploited, reproduced and disseminated free of charge for the user. The use of a Data Management Plan (DMP) is required for projects participating in the Open Research Data Pilot, detailing what data the project will generate, whether and how they will be exploited or made accessible for verification and re-use, and how they will be curated and preserved.

http://ec.europa.eu/research/participants/data/ref/h2020/grants_manual/hi/oa_pilot/h2020-hi-oa-data-mgt_en.pdf 27

Remember, we are still in the early stage of a project, 27

Negative intermezzoStored data is not in itself curated and preserved Preserved (or: archived) data is not in itself findableFindable data is not in itself accessibleAccessible data is not in itself understandableUnderstandable data is not in itself usable

28

What should be archived for long-term reuse is a package of data + context:

28

What should be deposited?The data needed to validate results in scientific publications (minimally!).The associated metadata: the datasets creator, title, year of publication, repository, identifier etc.Follow a metadata standard in your line of work, or a generic standard, e.g. Dublin Core or DataCite. Standards are important for discovering and exchanging data. The repository will assign a persistent ID to the dataset: important for discovering and citing the data. Documentation like code books, lab journals, informed consent forms domain-dependent, and important for understanding the data and combining them with other data sources.Software, hardware, tools, syntax queries, machine configurations domain-dependent, and important for really using the data. (Alternative: information about the software etc.)

Basically, everything that is needed to replicate a study should be available for others. Hence the name replication package, although the aspiration is reuse rather than replication: more is most welcome. More data, more information in the package and described in the DMP.29

Re Software etc: in many cases copyright will prevent the archiving of software and tools. The alternative is a sensible description.

More about this in the break-out session after the lunch break

http://www.veryicon.com/icons/object/package-icons/packageicon-zip.html 29

https://commons.wikimedia.org/wiki/File%3ABudget_Debate_2011_(5611505228).jpg 30

At this point, usually a lively discussion ensues, because yes, but this is different in our domain. Exactly, and thats why domains should maintain or develop and promote their standards. Researchers are quite capable of giving a sensible interpretation to the message manage the data properly for their own line of work, and help implement and foster measures to do so.

Now, should one really deposit and publish all data, raw, intermediate results and so on for eternity?

30

Open Access to all data, unless

Confidentiality and security issues can be good reasons not to publish or share all data. Note in the DMP* the reasons for not giving access, and deposit that part of the data under a Restricted Access regime.E.g. when regenerating data would be cheaper than archiving, dont archive. Spend time on selecting what data youll need and want to retain. Motivate your criteria in the DMP.

See http://ec.europa.eu/research/participants/data/ref/h2020/grants_manual/hi/oa_pilot/h2020-hi-oa-data-mgt_en.pdf For selection criteria see https://www.openaire.eu/opendatapilot 31

Grant Agreement, Art. 29.3, Open Access to research data:

One size doesnt fit all When youre a criminologist, your respondents will probably like to remain anonymous 31

Repository, archive, ehm?A pilot requirement is to deposit your data in a research data repository: a digital archive collecting and displaying datasets and their metadata. Select a data repository that will preserve your data, metadata and possibly tools in the long term. It is advisable to contact the repository of your choice when writing the first version of your DMP. Repositories may offer guidelines for sustainable data formats and metadata standards, as well as support for dealing with sensitive data and licensing.

But how to find a repository? More in a few minutes

32

32

33EC guidance PIs answerInitial DMP5

Recap: the researcher has made the planning, together with the stakeholders. Now finish the plan...

33

34Several export formats6

and select an export format; for the EC PDF is fine. 34

Deliver the DMPSend the initial DMP version to the Commission within six months.EC: Since DMPs are expected to mature during the project, more developed versions of the plan can be included as additional deliverables at later stages. () New versions of the DMP should be created whenever important changes to the project occur due to inclusion of new data sets, changes in consortium policies or external factors.

35

The DMP is a deliverable/milestone to be delivered in the first 6 months AFTER the start of the project. The project officer and reviewers will ask for it, will evaluate it and give it a mark like any other deliverable (excellent, good, needs revision, rejected). This usually happens at the first review, unless the Project Officer is quite meticulous.

In subsequent reviews (or any time they feel like) the PO and reviewers may check to see if the DMP is followed (e.g., data files deposited, access status, metadata format, ...).35

3. During the projectData management is part of good research

36

36

Roles and responsibilities

InstitutionRDM policyFacilities

$Research funders

PublishersData Availability PolicyCommercial partners

While the researchers and research assistants carry out the project, they might need some support in dealing with data:For anonimising sensitive dataFor deciding whether they can share particular data within their discipline with colleagues from institutions outside the project consortium (non-benificiaries)For dealing with unexpected data formats, due to new instrumentsWhen institutional repositories turn out to be unable to sustainably preserve big dataFor dealing with developments in the Open Access to publications sector, with potentially ramifications for the underlying dataFor dealing with changing policies or regulations, at the institutional, national or international level the announced new European data protection regulation might have serious impact on using personal data in researchEt cetera

Thats OK, as long as you are around.37

Linking data and publicationsFrom a data-centric perspective publications are part of a datasets context. However, there is no need to include publications in the replication package:A lot of data repositories also accept publications, and allow linking between publications and their underpinning data. By means of smart, persistent identifiers consistently used linking is also possible across repositories.

38

consistently used: as in many situations this is to a small extent a technical matter: to a larger extent it depends on organising, agreements, and people citing reponsibly. So, if you are in a position to stimulate this, please do so.On top of this, the OpenAIRE project investigates and improves automatic linking, also often via the PID. 38

Incentives 1

In several domains there are indications that publications WITH links to data receive more citations than papers WITHOUT data links. This study in astrophysics is a recent example. 39

Vegetation map 1977 reused in 2015 expedition

H.D. Heinemeijer & A.J. van Dijk (1977): Vegetation map Rosenbergdalen, Edgeya, Svaltbard

http://sees.nl/

During an expedition to Spitsbergen in 1977 much data on vegetation and biomass has been collected. This was used to make the map at the left-hand side. Fortunately, the underlying data was still available and interpretable when a couple of months ago another expedition went to Spitsbergen. Researchers from the Arctic Centre in Groningen in the NL were able to reuse the data for plotting and analysing the changes that have occurred in four decades.40

Incentives 3

Image: https://www.flickr.com/photos/dmh650/4031607067/in/gallery-wlef70-72157633022909105/ 41

Data management is a part of good research practice.

RCUK Policy and Code of Conduct on the Governance of Good Research ConductResponsible data management is part of good research.

NWO Introduction to the pilot Data Management

Proper data management wont prevent theft of your laptop, but will help you to keep your data safe even if they are not meant to go public. 41

4. Data ServicesTrustworthy digital repositoriesFind a data repository Research data in OpenAIRE

42

Storage and TrustLocal storage facilities during the research

Network of trustworthy digital repositories for long-term preservation of (a selection of) the data after the research is finished

Certification of digital repositories in order to establish trust

4 certification standards available

mission to provide reliable, long-term access to managed digital resources to its designated community, now and into the futureconstant monitoring, planning, and maintenanceunderstand threats to and risks within its systemsregular cycle of audit and/or certification

DIN 31644 / ISO 16363

Council for Science World Data System (ICSU-WDS). Met deze certificering bevestigt WDS dat DANS betrouwbaar is als het gaat om: authenticiteit, integriteit, vertrouwenswaardigheid en beschikbaarheid van data en datadiensten. 43

Where to find a repository?In order of preference: usean external data archive or repository in your research domain

an institutional research data repository, or your research groups established data management facilities

Zenodo.org

or search for other data repositories at re3data.org

http://www.zenodo.org/ http://www.re3data.org/ 44

Use an external data archive or repository already established for your research domain to preserve the data according to recognised standards in your discipline.If available, use an institutional research data repository, or your research groups established data management facilities.Use a well-known data repository in your own country.Use a cost-free (data) repository such as Zenodo. Search for other data repositories here: re3data.org

44

Main criteria for choosing a data repository:

Certification as a Trustworthy Digital Repository, with an explicit ambition to keep the data available in the long term. Matches your particular data needs: e.g. formats accepted; mixture of Open and Restricted Access. Gives your submitted dataset a persistent and globally unique identifier: for sustainable citations both for data and publications and to link back to particular researchers and grants.Provides guidance on how to cite the data that has been deposited.How to select a repository?https://www.openaire.eu/opendatapilot-repository 45

Certification as a trustworthy digital repository, with an explicit ambition to keep the data available in the long term. We know of course that several domains have longstanding archives that are not certified as TRD, because they are unsure how much effort a certification process entails. We think thats a pity. three-tiered proces... And the Open Science, Open Access, Open Data effort should really encourage the willing repositories to apply for certification. Matches your particular data needs (e.g. formats accepted; access, back-up and recovery, and sustainability of the service). Most of this information should be contained within the data repositorys policy pages.Gives your submitted dataset a persistent and unique identifier: for sustainable citations both for data and publications and to link back to particular researchers and grants.Lands visitors at the dataset or its metadata.Helps to track how the data has been used by providing access and download statistics.Offers clear terms and conditions that meet legal requirements (e.g. for data protection) and allow reuse without unnecessary licensing conditions.Provides guidance on how to cite the data that has been deposited.

Elly will tell more about trustworthy repositories and also say a few words about storing data safely DURING the project, because thats also part of data management.

45

re3data.org is a global registry of research data repositoriesdifferent academic disciplines It presents repositories for the permanent storage and access of data sets Funded by the German Research Foundation (DFG)2015: 1,368 reviewed repositories

46

47

48

No data available for Latvia, Belarus, Bulgaria, Moldavia en some other countries48

49

Zenodo is developed by CERN under the EU FP7 project OpenAIREplus

49

https://zenodo.org/

Contents

50

OpenAIRE2020

https://zenodo.org/features

Research. Shared. all research outputs from across all fields of science are welcome! Citeable. Discoverable. uploads gets a Digital Object Identifier (DOI) to make them easily and uniquely citeable.Community Collections accept or reject uploads to your own community collections (e.g workshops, EU projects or your complete own digital repository).Funding integrated in reporting lines for research funded by the European Commission via OpenAIRE.Flexible licensing because not everything is under Creative Commons.Safe your research output is stored safely for the future in same cloud infrastructure as research data from CERN's Large Hadron Collider.DropBox integration upload files straight from your DropBox.

51

OpenAIRE

Research data in OpenAIRE

HYPOX: FP7 project52 Publicationsfrom 20 different OpenAIRE data providers392 datasets from PANGAEASlide from Pedro Principe, University of Minho

54

54

FP7 projects: publications + datasets HYPOX > https://www.openaire.eu/search/project?projectId=corda_______::abb5725eaf2617c39ae240b4ce1cce3e http://hypox.net;

Slide from Pedro Principe, University of Minho

55

392 datasets from PANGAEA AND 52 Publications from:Unknown Repository(30)Biogeosciences (BG)(11)Biogeosciences(7)Biogeosciences Disc...(6)OceanRep(6)Ocean Science (OS)(4)Europe PubMed Central(3)Open Repository and...(2)Electronic Publicat...(2)NERC Open Research ...(2)Ocean Science Discu...(1)PLoS ONE(1)Ocean Science (OS)(1)e-Prints Soton(1)University of South...(1)Research@StAndrews:...(1)ArchiMer - Institut...(1)Earth-prints Reposi...(1)Archive ouverte UNIGE(1)Ghent University Ac...(1)

392 datasets55

FP7 projects: publications + datasets HYPOX > https://www.openaire.eu/search/project?projectId=corda_______::abb5725eaf2617c39ae240b4ce1cce3e 56

Open Access funded Publications aggregated from repositories & journalsDatasets from Data Repositories

Concrete examples:Publication: https://www.openaire.eu/search/publication?articleId=dedup_wf_001::d5a93e225ee49168ddd6bb8c85acd4c6 Dataset: http://doi.pangaea.de/10.1594/PANGAEA.779512

56

Incentives 1

In several domains there are indications that publications WITH links to data receive more citations than papers WITHOUT data links. This study in astrophysics is a recent example. 57

Vegetation map 1977 reused in 2015 expedition

H.D. Heinemeijer & A.J. van Dijk (1977): Vegetation map Rosenbergdalen, Edgeya, Svaltbard

http://sees.nl/

During an expedition to Spitsbergen in 1977 much data on vegetation and biomass has been collected. This was used to make the map at the left-hand side. Fortunately, the underlying data was still available and interpretable when a couple of months ago another expedition went to Spitsbergen. Researchers from the Arctic Centre in Groningen in the NL were able to reuse the data for plotting and analysing the changes that have occurred in four decades.58

Pilot and/or RDM costsEC: Costs relating to the implementation of the pilot will be eligible. Some considerations based on a presentation by the Dutch NCP, Michael Schijns, during a meeting with liaison officers, 28 Sept. 2015:General conditions for direct costs to be eligible (not exhaustive):Actually incurred by the beneficiary During the project period, i.e. within the duration of the Grant Agreement How does this relate to long-term availability? Identifiable and verifiable IT and library support often not explicitly linked to specific project or PI > not identifiable? To be furnished by research organisation?Necessary for the project Indicated in the estimated budget Many initiatives for budgetting RDM costs, but no accepted model yet

Still a lot to be done, also in other funding schemes.

RDA.WDS Publishing Dat aCost Recovery for Data Centres: https://rd-alliance.org/groups/rdawds-publishing-data-cost-recovery-data-centres.html4C: Collaboration to Clarify the Costs of Curation: http://4cproject.eu/index.php 59

I hoped that we could avoid this topic

Its good that this is a pilot, because in this area the rules really should become clearer, also for other funders. DIRECT costs primarily include personnel costs; costs of goods, works and services; depreciation costs of equipment, infrastructure and other assets; travel costs.We ignore INDIRECT costs here this slide is just to give you an impression of issues that have no clear answer yet. 59

Question about costscosts related to pilot are eligible for reimbursement

Would the reimbursement of research data management costs fall under 'open access incurred by beneficiaries are eligible for reimbursement during the duration of the project (article 6.2.D.3 of the Model Grant Agreement)? Should these costs be included in H2020 funding applications or can they be claimed back during the financial reporting periods?

Bij een bijeenkomst over reimbursement van datamanagementkosten voor de pilot vertelde onze NCP onlangs dat kosten gedeclareerd kunnen worden, mits (o.a.) binnen de projectduur en indicated in the estimated budget. Dat zou betekenen dat de tweede vraag twee keer ja oplevert: ja, moet included (at least indicated in the estimated budget) in the application EN ja, komt dan in aanmerking voor vergoeding.

60

5. SummaryResearch projects in 9 appointed Horizon 2020 areas are automatically part of the pilot, e.g. Future and emerging technologies; Nanotechnologies; Climate action; Sustainable agriculture.

Opting in / opting out is possible

Data Management Plan required within six months after project grant

Deposit the research data in a trusted research data repository

Open data is data that is free to access and reuse: Creative Commons Licence CC-BY or CC0

11,000 open datasets in OpenAIRE

61

Slogan EC: As open as possible, as closed as needed

62

6. Introduction to the afternoons In practice session

63

WP 4 Training and SupportTask 4.3. Research Data Management training and support

DANS (Data Archiving and Networked Services) is task leader

Support kit for Open Research Data Pilot: https://www.openaire.eu/opendatapilot

Briefing paper: Research Data management - Support for Open Research Data Pilot OpenAIRE 2020

64

Programme In Practice sessionSituation of RDM in your country:Introduction of you and the situation in your country regarding RDM

Breakout sessions Feedback Briefing paper RDMSection 2: Reusability and data managementSection 3: How to plan data managementSection 5: Roles and responsibilities in RDM

Wrap up and other questions / suggestions to future support materials

65

[email protected]@dans.knaw.nl

66

www.openaire.eu@openaire_eufacebook.com/groups/openaire linkedin.com/groups/OpenAIRE-3893548Thank you!