EUDAT Towards a Collaborative Data...

24
EUDAT Towards a Collaborative Data Infrastructure Bielefeld 10’th International Conference Daan Broeder - MPI for Psycholinguistics - EUDAT - CLARIN - DASISH

Transcript of EUDAT Towards a Collaborative Data...

Page 1: EUDAT Towards a Collaborative Data Infrastructureconference.ub.uni-bielefeld.de/programme/presentations/Broeder_BC2012.pdfSafe Replication Use Case 22 • Objective: Allow communities

EUDAT

Towards a Collaborative Data Infrastructure

Bielefeld 10’th International Conference

Daan Broeder

- MPI for Psycholinguistics

- EUDAT

- CLARIN

- DASISH

Page 2: EUDAT Towards a Collaborative Data Infrastructureconference.ub.uni-bielefeld.de/programme/presentations/Broeder_BC2012.pdfSafe Replication Use Case 22 • Objective: Allow communities

Data

• These days it is so very easy to create data but still far less easy to manage it. – Experiment data

– Sensor produced data

– Simulations

– Digital libraries

– The Web

– …

• How to store, to administrate, to find, to enrich, to link, to process, to share, to reuse, …, to publish

• For this we need a data infrastructure

• One that is efficient, sustainable and cost effective

Page 3: EUDAT Towards a Collaborative Data Infrastructureconference.ub.uni-bielefeld.de/programme/presentations/Broeder_BC2012.pdfSafe Replication Use Case 22 • Objective: Allow communities

Data creation cycle

raw data

Citable

publication

registration &

preservation

analysis &

enrichment

temp.

data citable

data

referable

data

Page 4: EUDAT Towards a Collaborative Data Infrastructureconference.ub.uni-bielefeld.de/programme/presentations/Broeder_BC2012.pdfSafe Replication Use Case 22 • Objective: Allow communities

4

The current data infrastructure landscape

Long history of data management in Europe: several existing data infrastructures dealing with established and growing user communities (e.g., ESO, ESA, EBI, CERN)

New Research Infrastructures (ESFRI roadmap) are emerging and are also trying to build data infrastructure solutions to meet their needs (CLARIN, EPOS, ELIXIR, ESS, etc.)

However, most of these infrastructures and initiatives address primarily the needs of a specific discipline and user community

Challenges Compatibility, interoperability, for cross-disciplinary research Data growth in volume and complexity

strong impact on costs threatening the sustainability of the infrastructure

Opportunities Synergies do exist:

although disciplines have different work flows and ambitions, they have common basic needs and requirements that can be matched with generic services supporting multiple communities

Strategy needed at pan-European level

Page 5: EUDAT Towards a Collaborative Data Infrastructureconference.ub.uni-bielefeld.de/programme/presentations/Broeder_BC2012.pdfSafe Replication Use Case 22 • Objective: Allow communities

5

Collaborative Data Infrastructure

Page 6: EUDAT Towards a Collaborative Data Infrastructureconference.ub.uni-bielefeld.de/programme/presentations/Broeder_BC2012.pdfSafe Replication Use Case 22 • Objective: Allow communities

EUDAT short fact list

6

Content Project Name EUDAT – European Data

Start date 1st October 2011

Duration 36 months

Budget 16,3 M€ (including 9,3 M€ from the EC)

EC call Call 9 (INFRA-2011-1.2.2): Data infrastructure for e-Science (11.2010)

Participants 25 partners from 13 countries (national data enters, technology providers, research communities, and funding agencies)

Objectives “To deliver cost-efficient and high quality Collaborative Data Infrastructure (CDI) with the capacity and capability for meeting researchers’ needs in a flexible and sustainable way, across geographical and disciplinary boundaries.”

Page 8: EUDAT Towards a Collaborative Data Infrastructureconference.ub.uni-bielefeld.de/programme/presentations/Broeder_BC2012.pdfSafe Replication Use Case 22 • Objective: Allow communities

Research Communities

Page 9: EUDAT Towards a Collaborative Data Infrastructureconference.ub.uni-bielefeld.de/programme/presentations/Broeder_BC2012.pdfSafe Replication Use Case 22 • Objective: Allow communities

EUDAT targets all scientific disciplines (discipline neutral):

To enable the capture and identify cross-discipline requirements To involving the scientists of all the communities in the shaping of the infrastructure and its services

Environmental Science

ENES, EPOS, Lifewatch, EMSO, IAGOS-ERI, ICOS, Euro-Argo

Social Sciences and Humanities

CLARIN, CESSDA, DARIAH

Biological and Medical Science VPH, ELIXIR, BBRMI, ECRIN, DiXA

Physical Sciences and Engineering

WLCG, ISIS, PanData

Material Science

ESS…

Research fields

Page 10: EUDAT Towards a Collaborative Data Infrastructureconference.ub.uni-bielefeld.de/programme/presentations/Broeder_BC2012.pdfSafe Replication Use Case 22 • Objective: Allow communities

1. Capturing Communities Requirements (WP4)

1st round of interviews with the five initial communities (Oct.2011 - Dec. 2012) • Understand how data is organised in each community

• Collect first wishes and specific requirements from a common data service layer

Next phase: refine analysis and expanding it to other communities

2. Building the corresponding services (WP5)

Technology appraisal (ongoing) • What is already available at partners’s sites to build the corresponding services?

• What are the gaps and market failures that should be addressed by EUDAT?

Next phase: Developing candidate services • Adapt services to match the requirements

• Integrate with community and SP services

• Test and evaluate with communities

3. Deploying the services and operating the federated infrastructure (WP6)

Designing the federated infrastructure and the interfaces for cross-site operations (ongoing)

Next phase: integrating and coordinating resource provision, operations and support

EUDAT service design activities

Page 11: EUDAT Towards a Collaborative Data Infrastructureconference.ub.uni-bielefeld.de/programme/presentations/Broeder_BC2012.pdfSafe Replication Use Case 22 • Objective: Allow communities

Core services are building blocks of EUDAT‘s

Common Data Infrastructure

mainly included on bottom layer of data services

Community-oriented services

• Simple Data Acces and upload

• Long term preservation

• Shared workspaces

• Execution and workflow

• Joint metadata and data visibility

• Simple storage facility for individual

scientists and small projects

Enabling services (making use of

existing services where possible

• Persistent identifier service (EPIC,

DataCite, ...)

• Federated AAI service (NRENs,

eduGain)

• Network Services

• Monitoring and accounting

EUDAT Core Service Areas

Page 12: EUDAT Towards a Collaborative Data Infrastructureconference.ub.uni-bielefeld.de/programme/presentations/Broeder_BC2012.pdfSafe Replication Use Case 22 • Objective: Allow communities

Data Management Service Cases

• Safe Replication – Replicate data between selected centers

– Based on user specified policies

– For LTA, for easy access, …

– Technology: iRods

• Dynamic Replication (Data staging) – Moving data to HPC workspaces and storing the results

– Technology: iRods + grid tools

• Usable PID framework – facilitate administrating data replication

– allow identifying ‘parts’ of objects

– data verifiability, …

– Technology: HS + EPIC and DataCite

• Center registry – Listing EUDAT services, centers and their capabilities

Page 13: EUDAT Towards a Collaborative Data Infrastructureconference.ub.uni-bielefeld.de/programme/presentations/Broeder_BC2012.pdfSafe Replication Use Case 22 • Objective: Allow communities

Data Management Service Cases

• Joint metadata domain – A metadata catalogue for (all?) research data

– Interdisciplinary (re-)use of data

– Semantic interoperability: • explicit semantics and flexible relations or hard-wired mappings,..

– Granularity • Include individual resources or data-sets only

– Commenting function

– Platform permitting data-set promotion • Proper acknowledgements for data creators

– Technologies: icat, mercury, OAI-PMH, xsd, rdf,…

• Simple Store – A safe repository for all research data in need

• youTube or dropbox model

– (Detailed?) metadata

– Sharing

Page 14: EUDAT Towards a Collaborative Data Infrastructureconference.ub.uni-bielefeld.de/programme/presentations/Broeder_BC2012.pdfSafe Replication Use Case 22 • Objective: Allow communities

EUDAT Architecture

EUDAT Community

center

EUDAT data

center

EUDAT data

center

EUDAT data

center

PRACE HPC

center

HPC workspace

EUDAT Community

center

D EUDAT PID

Service

LTA facility

EUDAT HPC

center

D

HPC workspace

D

D

D

D

D

LTA facility

EUDAT Metadata

Service

Harvesting

metadata

EUDAT center registry

EUDAT Simple -store

D

Page 15: EUDAT Towards a Collaborative Data Infrastructureconference.ub.uni-bielefeld.de/programme/presentations/Broeder_BC2012.pdfSafe Replication Use Case 22 • Objective: Allow communities

Collaborations

• With the ESFRI (cluster) projects

• With service providers: EPIC, DataCite, …

• EUDAT <-> EGI collaboration (& competition)

• US DataNET: DataOne, Data Conservancy,…

• DAITF - Data Access & Interoperability Task Force

– This task will contribute to the efforts to establish an

international task force. This work will be carried out in

collaboration with OpenAIRE and other relevant

initiatives/projects focusing on data.

Page 16: EUDAT Towards a Collaborative Data Infrastructureconference.ub.uni-bielefeld.de/programme/presentations/Broeder_BC2012.pdfSafe Replication Use Case 22 • Objective: Allow communities

Thank you for your attention

Page 17: EUDAT Towards a Collaborative Data Infrastructureconference.ub.uni-bielefeld.de/programme/presentations/Broeder_BC2012.pdfSafe Replication Use Case 22 • Objective: Allow communities

Interlinking data and publications

datasets & metadata

publications

data depositor data curator reviewer author editor

API API

Identifiers for Actors (ORCID)

Identifiers for data & publications (HS, DOI, URN)

Page 18: EUDAT Towards a Collaborative Data Infrastructureconference.ub.uni-bielefeld.de/programme/presentations/Broeder_BC2012.pdfSafe Replication Use Case 22 • Objective: Allow communities

ICSU

Organizations guiding

data management infrastructure building

ICORDI

DAITF

COAR CODATA

WDC

EUDAT

OpenAIRE

WDC

Page 19: EUDAT Towards a Collaborative Data Infrastructureconference.ub.uni-bielefeld.de/programme/presentations/Broeder_BC2012.pdfSafe Replication Use Case 22 • Objective: Allow communities

Move to DAITF & iCORDI inspired by OpenAIRE and EUDAT

DAITF PROCESS Conferences, working groups,

hands-on training

DAITF STEERING BOARD

HLSF PROCESSWorkshops, working groups

DOMAIN OF DATA INFRASTRUCTURE PRACTITIONERS

iCORDI PROGRAMMES

DOMAIN OF TOP SCIENTISTS,SENIOR TECHNOLOGISTS,

POLICY MAKERS

InfluencingInteracting

KNOWLEDGE EXCHANGE PROGRAMME

WORKSHOP PROGRAMME

PROTOTYPE PROGRAMME

iCORDI PROGRAMMES

ANALYSIS PROGRAMME

Informing

Informing

Horizontal DataInfrastructures

Data ScientistsYoung Scientists

TechnologistsDiscipline/Domain

Data Infrastructures

EC

NSF

CNRS

KNAWMPG CNR

DFG

NWO

STFC

bottom-up

process towards

solutions

driven by

science

top-down

process about

strategies and

needs driven

by science how to

organize

and

support

this

process?

IETF?

DWF?

other

stakeholders

RCs, ROs,

Funders, etc

1st Workshop March 2012, Copenhagen

next workshop in October, Washington

Page 20: EUDAT Towards a Collaborative Data Infrastructureconference.ub.uni-bielefeld.de/programme/presentations/Broeder_BC2012.pdfSafe Replication Use Case 22 • Objective: Allow communities

What has been done so far?

2006/8

DataNet1

DAITF

Prepar.

Workshop

UIPIU

Data2012

Workshop

ASIST

Workshop

DataNet2

DWF

Concept

2012 2011 2010 2009 2008

20

global interaction

in place

brainstorming on data issues, need for

global action & first focussed actions tackling first

data topics

Page 22: EUDAT Towards a Collaborative Data Infrastructureconference.ub.uni-bielefeld.de/programme/presentations/Broeder_BC2012.pdfSafe Replication Use Case 22 • Objective: Allow communities

Safe Replication Use Case

22

• Objective: Allow communities to reliably replicate data to selected data centers for storage and do this in a robust, reliable and highly available manner. Respecting existing conventions on stewardship and security.

Using user defined policies: e.g. make 4 copies, don’t copy to the UK, …

• Application: To (1) move data to locations where curation and/or LTP services are present (2) processing requiring HPC can take place (3) for improved user data accessibility

• Replicated digital objects are identified through a single PID, with multiple locations associated to the PID record; one location per copy.

Page 23: EUDAT Towards a Collaborative Data Infrastructureconference.ub.uni-bielefeld.de/programme/presentations/Broeder_BC2012.pdfSafe Replication Use Case 22 • Objective: Allow communities

Dynamic Replication Service Case • Move entire data set (i.e. data collection) back and forth between an

EUDAT node and a non-EUDAT node: PRACE or EGI facilities

• Keep the data replicas at the non-EUDAT nodes in sync with the

EUDAT nodes

• Ingest/register relevant simulation results back at the EUDAT nodes.

Candidate technologies

• iRods

• Globus on-line

• FTS

• Unicore FTP

• gTransfer

Page 24: EUDAT Towards a Collaborative Data Infrastructureconference.ub.uni-bielefeld.de/programme/presentations/Broeder_BC2012.pdfSafe Replication Use Case 22 • Objective: Allow communities

HPC, GRID services – PRACE, EGI

SSH communities wide - DASISH

common SSH metadata catalog

community specific

CLARIN LT web service infrastructure

NETWORK Services - GEANT

Federated Identity Management

Data Replication & Preservation, Publication – EUDAT

replication & preservation

CLARIN DARIAH CESSDA Life Watch

DASISH ENVRI