Post on 15-Apr-2017
A Generic Scientific Data Modeland Ontology for Representation
of Chemical DataStuart J. Chalk, Department of Chemistry
University of North Floridaschalk@unf.edu
CINF Paper 171 – 251st ACS Meeting Spring 2016
#ACSCINFDataSummit
Scientific Data Should be Open
Simple: Openness as the norm not the exception
Data made available, without restriction, so its useful Mechanisms/tools to make data available Formats to allow others to get the data… …but also so its easy to use Annotate the data to make it easy to find
Community driven promotion of and action on this issue
Research Notebook Spectral Files (JCAMP-DX, propriety) Excel Spreadsheets Personal Databases Online Databases
PDF Files No!
RDF Yes!Resource Description Framework
Options for Storing Data?
W3C Recommendation 2015Specification - https://www.w3.org/TR/ldp/Primer - https://www.w3.org/TR/ldp-primer/
The Linked Data Platform
From: http://www.dataversity.net/introduction-linked-data-platform/
Use JavaScript Object Notation (JSON) as a text format for storing data and metadata so it can be converted to RDF
JSON for Linked Data (JSON-LD){ "@context": { "name": "http://schema.org/name", "isAlive": "http://example.org/isAlive", "age": "http://example.org/age", "height": "http://schema.org/height", "@base": "http://www.unf.edu/chemistry/stuart_chalk.aspx" }, "@id": "", "name": "Stuart Chalk", "isAlive": true, "age": 49, "height": 188.0} http://json-ld.org/playground/
JSON for Linked Data (JSON-LD)<http://www.unf.edu/chemistry/stuart_chalk.aspx>
<http://example.org/age> "49"^^<http://www.w3.org/2001/XMLSchema#integer> .
<http://www.unf.edu/chemistry/stuart_chalk.aspx> <http://example.org/isAlive>
"true"^^<http://www.w3.org/2001/XMLSchema#boolean> .
<http://www.unf.edu/chemistry/stuart_chalk.aspx> <http://schema.org/height>
"188"^^<http://www.w3.org/2001/XMLSchema#integer> .
<http://www.unf.edu/chemistry/stuart_chalk.aspx> <http://schema.org/name>
"Stuart Chalk" .
Nice idea but because anything can belinked to anything else to form a graph of variable structure…
...difficult to search, hard to maintain
OK, use regular relational database – Rigid SchemaNot good to try and make data fit the schema…
Use a hybrid approach! Encode some structure in RDF using a framework... ...add data to the structured graph in an organized way
Store all Scientific Data in RDF?
Consider FAIR Principals (http://www.datafairport.org) To be Findable:
F1. (meta)data are assigned a globally unique and persistent identifier F2. data are described with rich metadata (defined by R1 below) F3. metadata clearly and explicitly include the identifier of the data it describes F4. (meta)data are registered or indexed in a searchable resource
To be Accessible: A1. (meta)data are retrievable by their identifier using a standardized communications protocol A2. metadata are accessible, even when the data are no longer available
To be Interoperable: I1. (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation. I2. (meta)data use vocabularies that follow FAIR principles I3. (meta)data include qualified references to other (meta)data
To be Reusable: R1. meta(data) are richly described with a plurality of accurate and relevant attributes R1.1. (meta)data are released with a clear and accessible data usage license R1.2. (meta)data are associated with detailed provenance R1.3. (meta)data meet domain-relevant community standards
What Metadata is Important for Data?
Define scope as data obtained from an experiment,a series of experiments, a project
Who did the work and where are they? Metadata about the data “packet” The raw data… …its associated metadata (enough to properly contextualize the
data) Access rights Published location
What Should a Data Model Represent?
General Framework
SciData – Scientific Data Model (SDM)
Overview –http://stuchalk.github.io/scidata/
GitHub Repo –https://github.com/stuchalk/scidata
General Framework
- The Context “@context” contains the
context definition Refers to other context files Namespace abbreviations Default vocabulary “@vocab”
“@id” links ontology term “@type” states data type
Methodology, System, and Dataset
Example Data - pH
Example Data -Literature Value
“scope” provides internal link to “@id” value
Each value of a name value pair has a default data type that can be override by expanding value to a JSON object and adding “@value” and “@type”
Example Data - NMR Spectrum
“dataseries” are JSON arrays of data on one axis
Bring them together with “datagroup” and we can represent at spectrum
“parameter” is generic container for data, or metadata
Example Data –CC Calculation
“datagroup”s are structures to aggregate data at any level
“datagroup”s can be infinitely nested
“uid” is optional and can be used to unique define any piece of data
The SDM Ontology
SciData Ontology – Scientific Data Model Ontology (SDMO)
OWL File –https://github.com/stuchalk/scidata/blob/master/ontology/scidata.owl
Get community feedback, refine/extend/standardize Generate large corpus of disparate data in JSON-LD, ingest into triple
store and query (SPARQL) Evaluate inferencing on the triple store data Push adoption through collaboration Run hackathons to build developer implementations Develop Electronic Laboratory Notebook (ELN) to generate data in
JSON-LD
Get feedback from data community, RDA - https://rd-alliance.org/ Test using the NDS - http://www.nationaldataservice.org/
Future Work
Pain Points Challenges
Opportunities Normalization Tools to generate
metadata automatically User Perspective Gaps in Data Gaps in Ontology
Coverage
Pain Points? Gather stakeholders to work on
standards Broad knowledge domain representation i-UPAC, RDA Chemistry Research Data IG
Priorities? Data annotation and representation Data exchange (repo <-> repo, user <->
user) Structure representation (chiral centers) Curation infrastructures Domain vocabulary translations Units of measure
Reality Check
“to err is human; to forgive, divine”Alexander Pope
“to err is human; to really screw things up requires a computer”Paul Ehrlich“to err is human; all hell will break loose
if you don’t provide accurate semantics to a computer”
Stuart Chalk
schalk@unf.edu Phone: 904-620-1938 Skype: stuartchalk LinkedIn/Slidehare: https://www.linkedin.com/in/stuchalk ORCID: http://orcid.org/0000-0002-0703-7776 ResearcherID: http://www.researcherid.com/rid/D-8577-
2013
Questions?