Metadata for Large Science: The ICAT Data Model Brian Matthews, Leader, Scientific Applications...
-
Upload
nickolas-leonard-horn -
Category
Documents
-
view
215 -
download
1
Transcript of Metadata for Large Science: The ICAT Data Model Brian Matthews, Leader, Scientific Applications...
Metadata for Large Science: The ICAT Data Model
Brian Matthews, Leader, Scientific Applications Group,E-Science Centre,STFC Rutherford Appleton Laboratory
The ICAT software suite• Catalogues all experiment related information• Metadata gathered via integration with existing IT systems
• proposal systems• data acquisition
• Provides a well defined API for easy embedding into any applications.
Distributed Data
Metadata Catalogue
Generic Catalogue Access Interface
Data Access and Analysis
Integrated e-Infrastructure
Proposal
Metadata Catalogue
Information
Experiment
Data Acquisition
System
Secure Storage
Data
Analysis
Publication
E-PubsProposal System
All Data and Metadata Capture is automated.
Model Motivation
A common general format/standard for Scientific Studies and data holdings metadata did not exist
By proposing Model and Implementation:– A specification for the types of metadata which
should be captured during Scientific Studies– Cataloguing and organising data holdings:
provide access for the Data Owner– Ease citation, collaboration, and integration– Allow easy Integration of distributed
heterogeneous metadata systems into a homogeneous (virtual) Platform
Therefore – The Common Scientific Metadata Model (CSMD) developed.
Core Scientific Metadata Model
Metadata Granule
Topic
Study Description
Access Conditions
Data Location
Data Description
Keywords providing a index on what the study is about.
Provenance about what the study is, who did it and when.
Conditions of use providing information on who and how the data can be accessed.
Detailed description of the organisation of the data into datasets and files.
Locations providing a navigational aid to where the data on the study can be found.
References into the literature and community providing context about the study.
Related Material
Legal Note
Copyright, patents and conditions of use etc relating to the study and the data in the study
.
Investigation
Publication
KeywordTopic
SampleSample
ParameterDataset
Dataset Parameter
Datafile
Datafile Parameter
InvestigatorReference / Proposal IdPrevious ReferenceFacilityInstrumentTitleAbstractEtc.
Name
Name/Units/Value etcSearchableIs Sample ParameterIs Dataset ParameterIs Datafile ParameterVerified
NameUnitsString ValueNumeric ValueRange TopRange BottomError
Full ReferenceURL
Repository
NameParent Id
Topic Level
User IdRole
NameChemical FormulaSafety Information
NameUnitsString ValueNumeric ValueRange TopRange BottomError
NameSample Id
Description
NameUnitsString ValueNumeric ValueRange TopRange BottomError
NameDescription
VersionLocation
FormatFormat Version
Create TimeModify Time
SizeChecksum
Related DatafileRelated Datafile
Parameter
Authorisation
Source Datafile IdDestination Datafile Id
RelationS/W Aplication
S/W Version
User IdRole e.g Admin, Deleter,
Updater, Reader, Creater, Downloader etc.
Element TypeElement Id
CSMD in more detail
CSMD Used on DataPortal
Implementation used as Data Interface for DataPortalSingle view of heterogeneous systems/schemasActs as a stress test of the model
– Limitations feed into Model Requirements
– New requirements feed back into implementation
ICAT 3.3 Database Schema
Core Scientific Metadata Model Usage
Used on many projects since 2001– STFC ICat
• Serving data from STFC Facilities (ISIS, DLS)• SNS, Canada, Australia
– e-Minerals, e-Materials – collaborations to provide access to Grid resources – with ISIS users.
Outside STFC• MyGrid BioInformatics project
– information model based on version 1 of the CSMD Model
• This is being taken in the myIB project– Application to Integrated biology– Extending to provenance tracking in computational steering
• Also influence projects: – Comb-eChem, eBank, eCrystals– NERC DataGrid
• CCP1(Collaborative Computational Project in Quantum Chemistry) – assessed CSMD for metadata needs on their Grid Data
Management Middleware
InteroperabilitySharing data across boundaries
– Across different research lifecycles– Across institutions– Across information objects– Across disciplines– Across time
Characteristics– Loosely coupled– Across different authorities– Different internal models
Infrastructure to support science across disciplines, scientific institutions and research groups
Interoperable Metadata Models as a way we can enable Interoperable Data
Publishing the CSMD
Currently published as:• ICAT 3.3 database schema
• ICAT API
• CCLRC Scientific Metadata Model: Version 2 -Sufi & Matthews 2004• UML Model• XML Schema
• Need to update and formally publish
• Dublin Core Application Profile
• Sharable and collaborative• A “flat” model• An exchange model
Name Subject
Term URIhttp://www.stfc.ac.uk/csmd/elements/3.0/topic
Refineshttp://purl.org/dc/elements/1.1/subject
Name Topic
Has Encoding Scheme
DC Subject Encoding Schemes
Has Encoding Scheme
ISIS Subject Encoding Schemes
CommentISIS Subject Encoding Schemes are available from http://www.stfc.ac.uk/schemes/ISIS/Subject
Obligation Recommended
Towards a Facilities Ontology
Ontologies are used to capture knowledge about a domain of interest.• Increased flexibility when representing frequently changing viewpoints.• Alterations can be made in the model without changing applications.• Allows a unified view of heterogeneous data sources.• Remove conflicts and terminological uncertainties.• Facilitate moderated searches, optimisation of results
ISIS Ontology At present over 10,000 keywords describing experiments in ISIS ICAT
• Free text terms - NO Structure of keywords• Organise and formalise into an ontology• Map familiar terms in one domain to related terms in different domains• Search data by category across studies• Cross facility searching of related scientific data from the various
scientific facilities e.g. CLF and DLS - elsewhere
• Some initial work in this area – Louisa Casely-Hayford
• Re NXDL
Sample, Investigator and Experiment Ontologies
Sample Investigator Experiment
CSMD and Simulations
Detailed Information about the Data Analysis and underpinning Computational Simulations:
•Simulation/Analysis Code and version•Simulation Set-up and Parameters•Information about the Compute Resource•Key Parameter from the Simulation Results•Keywords and Classifications•Links to Analysed Secondary Data
•Currently limited support in the CSMD
Could provide link to Software libraries.
Sharing PublicationsCSMD offers the potential to integrate the outputs of scientific research: data and publications.Institutional Repository s/w now very well established
– ePrints, DSpace, Fedora, ePubs– Large body of expertise available– Standard metadata models and protocols:
• DC-APs, FRBR, OAI-PMH, OAI-ORE– Not yet embedded in science practise
• except HEP!
Linking science data and publications– Not yet well established– Needs data Identifiers and citation
• DOIs, Data Journals– Needs peer review of data– Can (and should?) be done on a P2P basis
Issues on Metadata in EDNP
•Integration with other Projects and Facilities.– Interchangeable Metadata to do this– Providing the input tools– Providing the APIs– Providing the User Interfaces
•Data Policy and Ownership.
•Data and Metadata Curation for long-term reuse.
Questions?