Metadata for Research Objects

Post on 10-May-2015

888 views 0 download

Tags:

description

Presentation given at ISKOUK Meeting "Making Metadata Work", 23rd June, 2014. http://www.iskouk.org/events/metadata_June_2014.htm

Transcript of Metadata for Research Objects

Sean Bechhofersean.bechhofer@manchester.ac.uk

@seanbechhofer

Making Metadata Work, ISKOLondon, 23rd June 2014

Metadata for Research Objects

1

Publication• Publications are about argumentation:

Convince the reader of the validity of a position– Reproducible Results System: facilitates

enactment and publication of reproducible research.

• Results are reinforced by reproducability– Explicit representation of method.

• Verifiability as a key factor in scientific discovery.

J. Mesirov Accessible Reproducible Research Science 327(5964), p.415-416, 2010 doi:10.1126/science.1179653

Stodden et. al. Reproducible Research: Addressing the Need for Data and Code Sharing in Computational Science Computing in Science and Engineering 12(5), p.8-13, 2010 doi:10.1109/MCSE.2010.113

C.Goble et. al. Accelerating Scientists’ Knowledge Turns Communications in Computer and Information Science Volume 348, 2013, pp 3-25 doi:10.1007/978-3-642-37186-8_1

Reproducible Science

3Goble: SSI Collaborations Workshop 2014

Scientific Workflows

4

» Scientific workflows are at the heart of experimental science› Enable automation of

scientific methods› Support experimental

reproducibility› Encourage best practices

» There is then a need to preserve these workflows› Scientific development based

on method reuse and repurpose

› Conservation is key» Workflow preservation is a

multidimensional challenge› Representation of complex

objects› Decay analysis, diagnosis,

and prevention› Social Objects that can be

inspected, reused, repurposed and credited

Preservation of scientific workflows in data-intensive science

Preservation

TechnicalMulti-step computational processRepeatable and comparativeExplicate computation

Social Virtual WitnessingTransparent, precise, citable documentationAccurate provenance logsReusable protocols, know-how, best practice

Can I review /

repeat your method?

Can I defend my method?

Can I reuse / reproduce

this method?

Context: Semantic Web and Linked Data• SW: Explicit machine-readable representation of

information

• LD: A set of best practices for publishing and connecting data on the Web1. Use URIs to name things2. Use dereferencable HTTP URIs3. Provide useful content on

lookup using standards4. Include links to other stuff

6

• An aggregation object that bundles together experimental resources that are essential to a computational scientific study or investigation. – data used – results produced in an experiment study;– (computational) methods employed to

produce and analyse that data;– people involved in the investigation.

• Plus annotation information that provides additional information about both the bundle itself and the resources of the bundle– descriptions– provenance

Research Objects

7

ROs as a Currency

8

CreatorContributorCollaborator

ComparatorRe-User

EvaluatorReviewerTraineeTrainerReader

Publisher

Curator

Librarian

RepositoryManager

• Three principles underlie the approach:

• Identity– Referring to resources

(and the aggregation itself)• Aggregation

– Describing the aggregation structureand its constituent parts

• Annotation– Associating information with aggregated resources.

Research Objects

9

Identity• Mechanisms for referring to the resources that are

aggregated within a Research Object

• URIs– Web Resources

• DOIs– Documents/papers/datasets

• ORCID IDs– Researchers

10

Identifier Issues• HTTP URIs provide both access and identification• PIDs: Persistent Identifiers (e.g.DOIs) tend to resolve

to human-readable landing pages– With embedded links to further (possibly machine-

readable) resources• ROs seen as non-information resources with

descriptive (RDF) metadata– Redirection/negotiation– Standard patterns for Linked Data resources

• Bidirectional mappings between URIs and PIDs• Versioning through, e.g. Memento

11

H. Van de Sompel et. al. Persistent Identifiers for Scholarly Assets and the Web: The Need for an Unambiguous Mapping 9th International Digital Curation Conference

Aggregation• Open Archives Initiation Object Reuse and Exchange

(OAI ORE) is a standard for describing aggregations of web resources– http://www.openarchives.org/ore/

• Uses a Resource Map to describe the aggregated resources

• Proxies allow for statements about the resources within the aggregation– Capturing context and viewpoints

• Several concrete serialisations– RDF/XML, Atom, RDFa

12Graceful Degradation

Annotation• Open Annotation specification is a community

developed data model for annotation of web resources– http://www.openannotation.org/spec/core/

• Developed by the W3C Open Annotation Community Group

• Allows for “stand-off” annotations– Annotation as a first class citizen

• Developed to fit with Web Architecture

13Graceful Degradation

Annotation Content• Essential to the understanding and interpretation of

the scientific outcomes captured by a Research Object as well as the reuse of the resources within it. – Provenance information about the experiments, the

study or any other experimental resources– Evolution information about the Research Object

and its resources, – Descriptions of computational methods

or processes– Dependency information or settings

about the experiment executions

14

Core & Extensions• Core model provides support for aggregation and

annotation• Extensions provide additional vocabularies for domain

specific tasks• Workflow Provenance

– Information capturing workflow executions• Workflow Description

– Abstractions describing Processes, inputs and outputs

• Research Object Evolution– Information describing change and “snapshots”

15

RO Model

16

Provenance• W3C’s PROV model allows for capture of information

relating to – Attribution

Who did it?– Derivation

Data sources used– Activities

What happened (and when)

• Significant eco-system (generators, viewers, consumers) has grown up around PROV– IPAW & TAPP

17

Copyright © 2013 W3C® (MIT, ERCIM, Keio, Beihang), All Rights Reserved. 

Tooling

18

ROs and OAIS• ROs as Information Packages in OAIS• myExperiment as live/access repository• ROHUB as archival repository

19

SCAPE: Planning and Watch

20

Watch

OperationsPlanning

Env & Users

Repository

plan

deploy

monitor monitor

monitor

accessingest,harvest

execution

http://www.scape-project.eu/

• SCAPE project concerned with Digital Preservation.• Planning and Watch infrastructure to helpmmonitor

the state of a repository and co-ordinate appropriate actions

• Driven by policies.

myExperiment and RODL

Decay, Service Deprecation,Data source monitoring, Checklists,Minimal Models

Wf4Ever: Monitoring and Watch

21

Watch

OperationsPlanning

Env & Users

Repository

plan

deploy

monitor monitor

monitor

accessingest,harvest

execution

• Ideas applied to workflow preservation

Decay• Survey of 92 Taverna workflows from myExperiment

• Volatile Third-Party Resources

• Missing Data• Missing Execution Environments• Poor descriptions

22

Belhajjame et. al. Why workflows break — Understanding and combating decay in Taverna workflows e-Science 2012 doi:10.1109/eScience.2012.6404482

Checklists and Validation• Checklists widely used to support safety, quality and

consistency• Common in experimental science

– Expressing minimum informationrequired

– Supporting “health” monitoring of workflow-centric ROs.

• Checklists can be defined in terms of the RO model and its annotations– Generic checklist service then

executes against that model andthe given annotations

– Provenance23

Minim Data Model

24

Zhao et. al. A Checklist-Based Approach for Quality Assessment of Scientific Information 3rd In. Workshop on Linked Science, 2013

Checklist Evaluation

25

Checklist Evaluation

26

RO Bundle• A single, transferable object encapsulating the

description and resources of an RO– Download, transfer, publish

• ZIP-based format (resources) plus a manifest describing aggregation and annotations (description)– Unpack with standard tooling

• JSON-LD as a representation for manifest– Lightweight linked-data format– Compatible with existing JSON tooling and services– PROV-O and OAC for annotations

27http://wf4ever.github.io/ro/bundle/

Bundling via git/Zenodo/figshare• Scientist works with local folder structure.

– Version management via github. – Local tooling produces metadata description– Metadata about the aggregation (and its resources)

provided by “hidden folder”• Zenodo/figshare pull snapshot from github

– Providing DOIs for the aggregrations– Additional release cycles can prompt new DOIs

28

Zenodo

29

figshare

30

ROs as RDFa

31http://rohub.linkeddata.es

RDFa

32http://rohub.linkeddata.es

Code as a Research Object

33

COMBINE Archive

34http://co.mbine.org/documents/archive

GigaScience/ISA

35http://isa-tools.github.io/soapdenovo2/

IPython

36

Wrap Up• Aggregation objects bundling together experimental

resources that are essential to a computational scientific study or investigation– Intended to support greater transparency and

reproducability• Annotations provide additional information

about the bundle and its contents– Metadata is key here

• Use of existing standards, vocabularies andinfrastructure

• Nascent tooling to support creation,management and publication

37

Thanks!• All the members of the Wf4Ever team

– iSOCO: Intelligent Software Components S.A., Spain– University of Manchester, School of Computer Science, Manchester,

United Kingdom– University of Oxford, Department of Zoology, Oxford, UK– Poznan Supercomputing and Networking Center. Poznan, Poland– IAA: Instituto de Astrofísica de Andalucía, Granada, Spain– Leiden University Medical Centre, Centre for Human and Clinical

Genetics, The Netherlands

• Colleagues in Manchester’s Information Management Group

• RO Advisory Board Members

38

http://www.researchobject.orghttp://www.wf4ever-project.org