CLARIN-D A Metadata-Based Research Infrastructure for the ... · Gerhard Heyer, Dirk Goldhahn...

23
Institut für Informatik CLARIN-D A Metadata-Based Research Infrastructure for the Humanities and Social Sciences Gerhard Heyer, Dirk Goldhahn Universität Leipzig [email protected]

Transcript of CLARIN-D A Metadata-Based Research Infrastructure for the ... · Gerhard Heyer, Dirk Goldhahn...

Institut für Informatik

CLARIN-D A Metadata-Based Research Infrastructure for

the Humanities and Social Sciences

Gerhard Heyer, Dirk GoldhahnUniversität [email protected]

Gerhard Heyer, Dirk Goldhahn 30.10.18 2

Overview

• Concept of a Research Infrastructure

• CLARIN-D• Sample Applications

Gerhard Heyer, Dirk Goldhahn 30.10.18 3

Tasks of Research Infrastructures

„Providing science and research with information and related services.“ [Rahmenkonzept für die Fachinformationsinfrastruktur in Deutschland, KII 2011]

Gerhard Heyer, Dirk Goldhahn 30.10.18 4

Idea

Provide assistence to researchers for

– searching and finding research data

– analysing and annotating research data

– making own research results available to the respective scientific community

Accessing, Analysing, Archiving => AAA

Gerhard Heyer, Dirk Goldhahn 30.10.18 5

Implementation in CLARIN

Distributed infrastructure with

– standardized interfaces (for metadata, data and tools)

– processing layers and

– communication protocols

A well-working paradigm just like the Internet

Users are both: providers and consumers

Gerhard Heyer, Dirk Goldhahn 30.10.18 6

Layers and Components

9

CLARIN-D

8 Resource Centres (B)3 Computing Centres (E)

Web: https://www.clarin-d.net/en/

funded by BMBFcurrent phase until 10/2020

Follow-up application in preparation

- Integration with DARIAH(CLARIAH) until 12/2021

- NFDI

10

AccessingAnalysingPreparation and Depositing

CLARIN: For users

Gerhard Heyer, Dirk Goldhahn 30.10.18 11

DEMO

Virtual Language Observatory

Federated Content Search

WebLicht

Accessing, Analysing

Gerhard Heyer, Dirk Goldhahn 30.10.18 18

FCS – Federated Content Search: Annotation layers

Accessing Data

Gerhard Heyer, Dirk Goldhahn 30.10.18 19

DEMO

Use Cases in CLARIN-D

Accessing, Analysing

25

Abstraction of Data Management Plans in CLARIN● Similarities between DMPs → Reuse of parts

● Allows for (partly) system-generated DMPs

● Abstraction layer: Data Management Protocol

● Following Science Europe Guidance Document “Presenting a Framework for Discipline-specific Research Data Management” of January 2018

27 27

Usecases in CLARIN-D

28 28

Usecases in CLARIN-D

Gerhard Heyer 30.10.18 29

DEMO

Find a CLARIN-centre

Data Management Plans

Depositing

Gerhard Heyer, Dirk Goldhahn 30.10.18 34

Ingestion and Integration of Resources

(Internal) standard workflow for ingesting a resource at the CLARIN centre Leipzig (shortend)

Gerhard Heyer, Dirk Goldhahn 30.10.18 35

Ingestion and Integration of Resources

Why is this so complicated?

→ Minimal requirements:

● Resource is stored securely (consistency)

● Resource is available for a long-term period

● Resource can be referenced and fetched (e.g. via URL)

But: a successful life for the complete lifecycle of research data requires more

Gerhard Heyer, Dirk Goldhahn 30.10.18 36

Ingestion and Integration of Resources

Long-lasting references (Persistent Identifiers)

● Link rot as a major problem for digital resources

● Need for long-lasting references that can be adapted if the need arises (e.g. when a resource or project is moved)

● Solution: use of Persistent Identifiers (DOI, Handle, ARK, ISLRN etc.)

● Automatic registration in PID registries as part of the ingestion process

Gerhard Heyer, Dirk Goldhahn 30.10.18 37

Ingestion and Integration of Resources

Discoverability (Metadata)

● Researcher have to be able to find relevant resources

→ Creation and publication of structured metadata

● Task that is highly specific and depends on targeted communities and resource types

– What is mandatory or optional information about the resource?

– Consistency of metadata? Use of established vocabularies?– References to authority files? Which ones are relevant?– What metadata formats are relevant for the specific context?

Gerhard Heyer, Dirk Goldhahn 30.10.18 38

Ingestion and Integration of Resources

Discoverability (Metadata)

● Publication of metadata via relevant interfaces

● depends on used standards in the concrete community

● Popular interfaces: OAI-PMH, SPARQL, ...

● Makes difference between "resource exists (somewhere)" and "resource can be actively found and used"

● Goal: ingestion leads automatically to visibility in relevant search and working environments

Gerhard Heyer, Dirk Goldhahn 30.10.18 39

Ingestion and Integration of Resources

Community and resourcetype specific interfaces

● More relevant interfaces and access methods to the resource

● Integration/import in existing research and working environments, e.g.

– (Semi-)Automatic annotation (WebLicht, CLAMS, EXMARaLDA, WebMAUS, etc.)

– Manual annotation of textual resources (WebAnno, TextGrid, etc.)

– Federated resource aggregators (CLARIN FCS)

– Visualisation tools (DiaCollo, WordTies, CorpusDiff, etc.)

→Specific for every resourcetype (text, audio, video, multimodal)

Gerhard Heyer, Dirk Goldhahn 30.10.18 40

Ingestion and Integration of Resources

Also to think about:

● Legal and licensing issues

● Adequate support (including technical support) from experienced users of the same or a comparable community

● Storage of intermediate/private results in a safe environment, still allowing collaborative work

● ...

Gerhard Heyer, Dirk Goldhahn 30.10.18 41

Depositing Data: Standards of Quality

Standards of Quality are essential! Compliance with standardsneeds to be certified. CLARIN-centres are assessed by:

Core Trust Seal (CTS)

● Provides repositories with certification based on a universal requirement catalog

– Core features of trusted, persistent data repositories

– Organizational, technical, financial and legal aspects are examined

● Nonprofit organization committed to sustainable and trusted data infrastructures (under the umbrella of Research Data Alliance)

● Advantages for users of the infrastructure:

– Quality and sustainability of data management checked externally

– Guaranteed compliance with established standards

– Transparency of the (internal) processes in the repository