DTL FOCUS MEETING ON DATA INTEGRATION, STANDARDS AND FAIR PRINCIPLES IN PROTEOMICS ·...

43
DTL FOCUS MEETING ON DATA INTEGRATION, STANDARDS AND FAIR PRINCIPLES IN PROTEOMICS Luiz Olavo Bonino - [email protected] 29 August, 2016

Transcript of DTL FOCUS MEETING ON DATA INTEGRATION, STANDARDS AND FAIR PRINCIPLES IN PROTEOMICS ·...

DTL FOCUS MEETING ON DATA INTEGRATION, STANDARDS AND FAIR PRINCIPLES IN PROTEOMICS

Luiz Olavo Bonino - [email protected] August, 2016

WHAT IS FAIR DATA?FAIR Data aims to support existing communities in their attempts to enable valuable scientific data and knowledge to be published and utilised in a ‘FAIR’ manner.

Findable - (meta)data is uniquely and persistently identifiable. Should have basic machine readable descriptive metadata.

Accessible - data is reachable and accessible by humans and machines using standard formats and protocols.

Interoperable - (meta)data is machine readable and annotated with resolvable vocabularies/ontologies.

Reusable - (meta)data is sufficiently well-described to allow (semi)automated integration with other compatible data sources.

FAIR DATA PRINCIPLESTo be Findable:F1. (meta)data are assigned a globally unique and persistent identifierF2. data are described with rich metadata (defined by R1 below)F3. metadata clearly and explicitly include the identifier of the data it describesF4. (meta)data are registered or indexed in a searchable resource

To be Accessible:A1. (meta)data are retrievable by their identifier using a standardized communications protocolA1.1 the protocol is open, free, and universally implementableA1.2 the protocol allows for an authentication and authorization procedure, where necessaryA2. metadata are accessible, even when the data are no longer available

To be Interoperable:I1. (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation.I2. (meta)data use vocabularies that follow FAIR principlesI3. (meta)data include qualified references to other (meta)data

To be Reusable:R1. meta(data) are richly described with a plurality of accurate and relevant attributesR1.1. (meta)data are released with a clear and accessible data usage licenseR1.2. (meta)data are associated with detailed provenanceR1.3. (meta)data meet domain-relevant community standards

http://www.nature.com/articles/sdata201618

THE FAIR COMPONENTS

FAIR Data Principles

FAIR Data Protocol

FAIR Data Resources

FAIR Data Core Technologies

FAIR Data Systems/Tools

Normative

Artefact

Software

Raw data(many formats)

FAIR download(in local format)

Processed data(primary storage format)

FAIR transformation

FAIR (meta)data(RDF,XML etc.)

High-PerformanceAnalysis

ProvenanceInitial transformation

Analysis transformation

FAIR DATA RESOURCE

Data Creation

FAIR Data Resource

FAIR Data Creation

FAIR Data Resource

FAIR DATA ECOSYSTEM (DUTCH APPROACH)

FAIR DATA RESOURCEDatasets expressed using one of the prescribed standards of the FAIR Data Protocol, with metadata complying with the protocol and license. The original dataset is transformed into a FAIR format and proper metadata and license are added to produce a FAIR Data Resource. The original and the FAIR version can co-exist, each one fulfilling its own purpose.

FAIR transformation

FAIR Data Resource

BRING YOUR OWN DATA - BYOD Goals:

■ Learn how to make data linkable “hands-on” with experts■ Create a “telling story” to demonstrate its use

Composition:■ Data owners – specialists on given datasets■ Data interoperability experts■ Domain experts

Source: Marcos Roos

BYOD

Bring Your Own Data - BYOD

• Goals:• Learn how to make data linkable “hands-on” with experts

• Create a “telling story” to demonstrate its use

• Make FAIR Data at the source

• Composition:• Data owners – specialists on given datasets

• Data interoperability experts

• Domain experts

Source: Marcos Roos

Domain Expert

Data Owner FAIR Data Expert

BYOD

BYOD Planning

Preparation Execution Follow Up

BYOD Planning

Preparation

Identify Plan

Driving question

Datasets

Attendees' profile

Output data access

Tentative dates

Tentative venue

Costs

Funds

Coordination

Set date

Invite attendees

Set venue

Catering

Lodging

Financial planning

Publicity

Working document

Preparatory calls

Data hosting

Software hosting

Documentation hosting

BYOD Planning

Execution

Day One

Introduction

SW, LD, Ontology intro

Use case intro

Workgroups division

Working sessions

WWW/TTTALA

Day Two

Progress report

Working sessions

Groups reports

WWW/TTTALA

Day Three

Data integration

Answer driving question

Explore data

Demo improvement

Final report

WWW/TTTALA

MAIN TASKS Retrieve original data

Dataset identification and analysis

Definition of the semantic model

Data transformation

License assignment

Metadata definition

FAIR Data resource (data, metadata, license) deployment

BYOD Planning

Follow-Up

D+15

Report difficulties

Clarifications

Next steps

D+45

Report difficulties

Clarifications

Next steps

Implementation

Expand FAIRification

Implement solution

Scale-up solution

Deploy

DTL’s BYOD Roadmap

• Rare diseases biobanks companies (Sept 2016)• Rare diseases patient registry companies (Sept 2016)• Rare diseases + WikiPathways (Oct/Nov 2016)• ENSEMBL• Plants• Metabolomics• Human data• Proteomics, …

https://wiki.dtls.nl/index.php/BYOD_meetings

FAIRIFIER

FAIR DATA MODEL REGISTRY

FAIRIFIER AND FAIR DATA MODEL REGISTRY

A particular class of FAIR Data System that provides access to published datasets. The datasets can be external or internal to the FAIR Data Point. Also, the source data can be a regular (non-FAIR) dataset or a FAIR Data Resource. If the source data is non-FAIR, the FAIR Data Point needs to made the necessary FAIR transformations on the fly.

FAIR DATA POINT

FAIR DATA POINT

Who are you? Can I trust

you?

FAIR DATA POINT

Here is information

about myself

FDPMetadata

FAIR DATA POINTOk, now that I know you, tell me what you have to offer

reads

FDPMetadata

FAIR DATA POINT

Here is information about my catalog

of datasets

CatalogMetadata

FAIR DATA POINTTell me more about your genomic dataset

reads

CatalogMetadata

FAIR DATA POINT

This is the detailed information about the

genomic dataset

Dataset &Data Record

Metadata

FAIR DATA POINTOk, now that I know what you have, give me the data.

reads

Dataset &Data Record

Metadata

FAIR DATA POINT

Here is my data.

FAIR DATA POINT - GENERAL ARCHITECTURE

EMBEDDED FAIR DATA POINT

https://www.eudat.eu

DISTRIBUTED FAIR DATA POINTS

FAIR DATA POINT METADATA PROVIDER API

METADATA LAYERSLayer Description URL Example Standard

FDP (Data repository)

Information about the FDP as a data repository

http://myfdp/ PID, title, description, license, owner, API version, etc.

OAI-PMH (extended)

Catalog Information about the catalog of datasets offered

http://myfdp/catalog

PID, title, description, publisher, etc.

W3C DCAT #Catalog

Dataset Information about each of the offered datasets

http://myfdp/[datasetID]/

AccessURL, downloadURL, format, mediaType, etc.

W3C DCAT #Dataset, #Distribution

Data record Information about the actual data, types, identifiers, etc.

http://myfdp/[datarecordID]

data types, domain, range, predicates, etc.

RML-Community/domain, ex.: DICOM, VCF,

FDP METADATA

@prefix dbp: <http://dbpedia.org/resource/> .@prefix dcat: <http://www.w3.org/ns/dcat#> .@prefix dct: <http://purl.org/dc/terms/> .@prefix lang: <http://id.loc.gov/vocabulary/iso639-1/> .@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .@prefix xml: <http://www.w3.org/XML/1998/namespace> .@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<http://fdp.biotools.nl:8080/fdp> a dct:Agent ;rdfs:label "FAIR Data Point of the Plant Breeding Group, Wageningen UR"^^xsd:string ;dct:description "This FDP provides metadata on plant-specific genotype/phenotype data sets"^^xsd:string ;dct:hasPart "catalog-01"^^xsd:string ;dct:identifier "FDP-WUR-PB"^^xsd:string ;dct:issued "2015-11-24"^^xsd:date ;dct:language lang:en ;dct:modified "2015-11-24"^^xsd:date ;dct:publisher <http://orcid.org/0000-0002-4368-8058> ;dct:title "FAIR Data Point of the Plant Breeding Group, Wageningen UR"^^xsd:string ;dct:version "1.0"^^xsd:string ;

CATALOG METADATA

@prefix dbp: <http://dbpedia.org/resource/> .@prefix dcat: <http://www.w3.org/ns/dcat#> .@prefix dct: <http://purl.org/dc/terms/> .@prefix lang: <http://id.loc.gov/vocabulary/iso639-1/> .@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .@prefix xml: <http://www.w3.org/XML/1998/namespace> .@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<http://fdp.biotools.nl:8080/catalog/catalog-01> a dcat:Catalog ;rdfs:label "Plant Breeding Data Catalog"^^xsd:string ;dct:description "Plant Breeding Data Catalog"^^xsd:string ;dct:hasPart <breedb> ;dct:issued "2015-11-24"^^xsd:date ;dct:language lang:en ;dct:modified "2015-11-24"^^xsd:date ;dct:publisher <http://orcid.org/0000-0002-4368-8058> ;dct:title "Plant Breeding Data Catalog"^^xsd:string ;dct:version "1.0"^^xsd:string ;dcat:dataset <breedb> ;

DATASET METADATA

@prefix dbp: <http://dbpedia.org/resource/> .@prefix dcat: <http://www.w3.org/ns/dcat#> .@prefix dct: <http://purl.org/dc/terms/> .@prefix lang: <http://id.loc.gov/vocabulary/iso639-1/> .@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .@prefix xml: <http://www.w3.org/XML/1998/namespace> .@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<http://fdp.biotools.nl:8080/dataset/breedb> a dcat:Dataset ;rdfs:label "BreeDB tomato passport data"^^xsd:string ;dct:description "BreeDB tomato passport data"^^xsd:string ;dct:issued "2015-11-24"^^xsd:date ;dct:language lang:en ;dct:modified "2015-11-24"^^xsd:date ;dct:publisher <http://orcid.org/0000-0002-4368-8058> ;dct:title "BreeDB tomato passport data"^^xsd:string ;dct:version "1.0"^^xsd:string ;dcat:distribution <breedb-sparql>,

<breedb-sqldump> ;

METADATA DISTRIBUTION

<http://fdp.biotools.nl:8080/distribution/breedb-sparql> a dcat:Distribution ;rdfs:label "SPARQL endpoint for BreeDB tomato passport data"^^xsd:string ;dct:description "SPARQL endpoint for BreeDB tomato passport data"^^xsd:string ;dct:issued "2015-11-24"^^xsd:date ;dct:language lang:en ;dct:license <http://rdflicense.appspot.com/rdflicense/cc-by-nc-nd3.0> ;dct:modified "2015-11-24"^^xsd:date ;dct:publisher <http://orcid.org/0000-0002-4368-8058> ;dct:title "SPARQL endpoint for BreeDB tomato passport data"^^xsd:string ;dct:version "1.0"^^xsd:string ;dcat:accessURL <http://virtuoso.biotools.nl:8888/sparql> .

<http://fdp.biotools.nl:8080/distribution/breedb-sqldump> a dcat:Distribution ;rdfs:label "SQL dump of the BreeDB tomato passport data"^^xsd:string ;dct:description "SQL dump of the BreeDB tomato passport data"^^xsd:string ;dct:issued "2015-11-24"^^xsd:date ;dct:language lang:en ;dct:license <http://rdflicense.appspot.com/rdflicense/cc-by-nc-nd3.0> ;dct:modified "2015-11-24"^^xsd:date ;dct:publisher <http://orcid.org/0000-0002-4368-8058> ;dct:title "SQL dump of the BreeDB tomato passport data"^^xsd:string ;dct:version "1.0"^^xsd:string ;dcat:downloadURL <http://virtuoso.biotools.nl:8888/DAV/home/breedb/breedb.sql> .