Metadata for Large Science: The ICAT Data Model Brian Matthews, Leader, Scientific Applications...

18
Metadata for Large Science: The ICAT Data Model Brian Matthews, Leader, Scientific Applications Group, E-Science Centre, STFC Rutherford Appleton Laboratory [email protected]

Transcript of Metadata for Large Science: The ICAT Data Model Brian Matthews, Leader, Scientific Applications...

Page 1: Metadata for Large Science: The ICAT Data Model Brian Matthews, Leader, Scientific Applications Group, E-Science Centre, STFC Rutherford Appleton Laboratory.

Metadata for Large Science: The ICAT Data Model

Brian Matthews, Leader, Scientific Applications Group,E-Science Centre,STFC Rutherford Appleton Laboratory

[email protected]

Page 2: Metadata for Large Science: The ICAT Data Model Brian Matthews, Leader, Scientific Applications Group, E-Science Centre, STFC Rutherford Appleton Laboratory.

The ICAT software suite• Catalogues all experiment related information• Metadata gathered via integration with existing IT systems

• proposal systems• data acquisition

• Provides a well defined API for easy embedding into any applications.

Distributed Data

Metadata Catalogue

Generic Catalogue Access Interface

Data Access and Analysis

Page 3: Metadata for Large Science: The ICAT Data Model Brian Matthews, Leader, Scientific Applications Group, E-Science Centre, STFC Rutherford Appleton Laboratory.

Integrated e-Infrastructure

Proposal

Metadata Catalogue

Information

Experiment

Data Acquisition

System

Secure Storage

Data

Analysis

Publication

E-PubsProposal System

All Data and Metadata Capture is automated.

Page 4: Metadata for Large Science: The ICAT Data Model Brian Matthews, Leader, Scientific Applications Group, E-Science Centre, STFC Rutherford Appleton Laboratory.

Model Motivation

A common general format/standard for Scientific Studies and data holdings metadata did not exist

By proposing Model and Implementation:– A specification for the types of metadata which

should be captured during Scientific Studies– Cataloguing and organising data holdings:

provide access for the Data Owner– Ease citation, collaboration, and integration– Allow easy Integration of distributed

heterogeneous metadata systems into a homogeneous (virtual) Platform

Therefore – The Common Scientific Metadata Model (CSMD) developed.

Page 5: Metadata for Large Science: The ICAT Data Model Brian Matthews, Leader, Scientific Applications Group, E-Science Centre, STFC Rutherford Appleton Laboratory.

Core Scientific Metadata Model

Metadata Granule

Topic

Study Description

Access Conditions

Data Location

Data Description

Keywords providing a index on what the study is about.

Provenance about what the study is, who did it and when.

Conditions of use providing information on who and how the data can be accessed.

Detailed description of the organisation of the data into datasets and files.

Locations providing a navigational aid to where the data on the study can be found.

References into the literature and community providing context about the study.

Related Material

Legal Note

Copyright, patents and conditions of use etc relating to the study and the data in the study

.

Page 6: Metadata for Large Science: The ICAT Data Model Brian Matthews, Leader, Scientific Applications Group, E-Science Centre, STFC Rutherford Appleton Laboratory.

Investigation

Publication

KeywordTopic

SampleSample

ParameterDataset

Dataset Parameter

Datafile

Datafile Parameter

InvestigatorReference / Proposal IdPrevious ReferenceFacilityInstrumentTitleAbstractEtc.

Name

Name/Units/Value etcSearchableIs Sample ParameterIs Dataset ParameterIs Datafile ParameterVerified

NameUnitsString ValueNumeric ValueRange TopRange BottomError

Full ReferenceURL

Repository

NameParent Id

Topic Level

User IdRole

NameChemical FormulaSafety Information

NameUnitsString ValueNumeric ValueRange TopRange BottomError

NameSample Id

Description

NameUnitsString ValueNumeric ValueRange TopRange BottomError

NameDescription

VersionLocation

FormatFormat Version

Create TimeModify Time

SizeChecksum

Related DatafileRelated Datafile

Parameter

Authorisation

Source Datafile IdDestination Datafile Id

RelationS/W Aplication

S/W Version

User IdRole e.g Admin, Deleter,

Updater, Reader, Creater, Downloader etc.

Element TypeElement Id

CSMD in more detail

Page 7: Metadata for Large Science: The ICAT Data Model Brian Matthews, Leader, Scientific Applications Group, E-Science Centre, STFC Rutherford Appleton Laboratory.

CSMD Used on DataPortal

Implementation used as Data Interface for DataPortalSingle view of heterogeneous systems/schemasActs as a stress test of the model

– Limitations feed into Model Requirements

– New requirements feed back into implementation

Page 8: Metadata for Large Science: The ICAT Data Model Brian Matthews, Leader, Scientific Applications Group, E-Science Centre, STFC Rutherford Appleton Laboratory.

ICAT 3.3 Database Schema

Page 9: Metadata for Large Science: The ICAT Data Model Brian Matthews, Leader, Scientific Applications Group, E-Science Centre, STFC Rutherford Appleton Laboratory.

Core Scientific Metadata Model Usage

Used on many projects since 2001– STFC ICat

• Serving data from STFC Facilities (ISIS, DLS)• SNS, Canada, Australia

– e-Minerals, e-Materials – collaborations to provide access to Grid resources – with ISIS users.

Outside STFC• MyGrid BioInformatics project

– information model based on version 1 of the CSMD Model

• This is being taken in the myIB project– Application to Integrated biology– Extending to provenance tracking in computational steering

• Also influence projects: – Comb-eChem, eBank, eCrystals– NERC DataGrid

• CCP1(Collaborative Computational Project in Quantum Chemistry) – assessed CSMD for metadata needs on their Grid Data

Management Middleware

Page 10: Metadata for Large Science: The ICAT Data Model Brian Matthews, Leader, Scientific Applications Group, E-Science Centre, STFC Rutherford Appleton Laboratory.

InteroperabilitySharing data across boundaries

– Across different research lifecycles– Across institutions– Across information objects– Across disciplines– Across time

Characteristics– Loosely coupled– Across different authorities– Different internal models

Infrastructure to support science across disciplines, scientific institutions and research groups

Interoperable Metadata Models as a way we can enable Interoperable Data

Page 11: Metadata for Large Science: The ICAT Data Model Brian Matthews, Leader, Scientific Applications Group, E-Science Centre, STFC Rutherford Appleton Laboratory.

Publishing the CSMD

Currently published as:• ICAT 3.3 database schema

• ICAT API

• CCLRC Scientific Metadata Model: Version 2 -Sufi & Matthews 2004• UML Model• XML Schema

• Need to update and formally publish

• Dublin Core Application Profile

• Sharable and collaborative• A “flat” model• An exchange model

Name Subject

Term URIhttp://www.stfc.ac.uk/csmd/elements/3.0/topic

Refineshttp://purl.org/dc/elements/1.1/subject

Name Topic

Has Encoding Scheme

DC Subject Encoding Schemes

Has Encoding Scheme

ISIS Subject Encoding Schemes

CommentISIS Subject Encoding Schemes are available from http://www.stfc.ac.uk/schemes/ISIS/Subject

Obligation Recommended

Page 12: Metadata for Large Science: The ICAT Data Model Brian Matthews, Leader, Scientific Applications Group, E-Science Centre, STFC Rutherford Appleton Laboratory.

Towards a Facilities Ontology

Ontologies are used to capture knowledge about a domain of interest.• Increased flexibility when representing frequently changing viewpoints.• Alterations can be made in the model without changing applications.• Allows a unified view of heterogeneous data sources.• Remove conflicts and terminological uncertainties.• Facilitate moderated searches, optimisation of results

ISIS Ontology At present over 10,000 keywords describing experiments in ISIS ICAT

• Free text terms - NO Structure of keywords• Organise and formalise into an ontology• Map familiar terms in one domain to related terms in different domains• Search data by category across studies• Cross facility searching of related scientific data from the various

scientific facilities e.g. CLF and DLS - elsewhere

• Some initial work in this area – Louisa Casely-Hayford

• Re NXDL

Page 13: Metadata for Large Science: The ICAT Data Model Brian Matthews, Leader, Scientific Applications Group, E-Science Centre, STFC Rutherford Appleton Laboratory.
Page 14: Metadata for Large Science: The ICAT Data Model Brian Matthews, Leader, Scientific Applications Group, E-Science Centre, STFC Rutherford Appleton Laboratory.

Sample, Investigator and Experiment Ontologies

Sample Investigator Experiment

Page 15: Metadata for Large Science: The ICAT Data Model Brian Matthews, Leader, Scientific Applications Group, E-Science Centre, STFC Rutherford Appleton Laboratory.

CSMD and Simulations

Detailed Information about the Data Analysis and underpinning Computational Simulations:

•Simulation/Analysis Code and version•Simulation Set-up and Parameters•Information about the Compute Resource•Key Parameter from the Simulation Results•Keywords and Classifications•Links to Analysed Secondary Data

•Currently limited support in the CSMD

Could provide link to Software libraries.

Page 16: Metadata for Large Science: The ICAT Data Model Brian Matthews, Leader, Scientific Applications Group, E-Science Centre, STFC Rutherford Appleton Laboratory.

Sharing PublicationsCSMD offers the potential to integrate the outputs of scientific research: data and publications.Institutional Repository s/w now very well established

– ePrints, DSpace, Fedora, ePubs– Large body of expertise available– Standard metadata models and protocols:

• DC-APs, FRBR, OAI-PMH, OAI-ORE– Not yet embedded in science practise

• except HEP!

Linking science data and publications– Not yet well established– Needs data Identifiers and citation

• DOIs, Data Journals– Needs peer review of data– Can (and should?) be done on a P2P basis

Page 17: Metadata for Large Science: The ICAT Data Model Brian Matthews, Leader, Scientific Applications Group, E-Science Centre, STFC Rutherford Appleton Laboratory.

Issues on Metadata in EDNP

•Integration with other Projects and Facilities.– Interchangeable Metadata to do this– Providing the input tools– Providing the APIs– Providing the User Interfaces

•Data Policy and Ownership.

•Data and Metadata Curation for long-term reuse.

Page 18: Metadata for Large Science: The ICAT Data Model Brian Matthews, Leader, Scientific Applications Group, E-Science Centre, STFC Rutherford Appleton Laboratory.

Questions?

[email protected]