2013 01-14 ops-dataset_descriptions
-
Upload
alasdair-gray -
Category
Documents
-
view
594 -
download
0
description
Transcript of 2013 01-14 ops-dataset_descriptions
![Page 1: 2013 01-14 ops-dataset_descriptions](https://reader033.fdocuments.us/reader033/viewer/2022061105/53ff83358d7f7261088b4605/html5/thumbnails/1.jpg)
Dataset Descriptions in Open PHACTS
Alasdair J G GrayUniversity of ManchesterW3C HCLS Call – 14 January 2013
www.openphacts.org/specs/datadesc/
Authors:Christian Y. A. Brenninkmeijer, Chris Evelo, Carole Goble, Alasdair J. G. Gray, Andra Waagmeester and Egon L. Willighagen
![Page 2: 2013 01-14 ops-dataset_descriptions](https://reader033.fdocuments.us/reader033/viewer/2022061105/53ff83358d7f7261088b4605/html5/thumbnails/2.jpg)
Why?
Public Domain Drug Discovery Data:Pharma are accessing, processing, storing & re-processing
LiteraturePubChem
GenbankPatents Databases
Downloads
Data Integration Data Analysis Firewalled Databases
Repeat @ each
companyx
![Page 3: 2013 01-14 ops-dataset_descriptions](https://reader033.fdocuments.us/reader033/viewer/2022061105/53ff83358d7f7261088b4605/html5/thumbnails/3.jpg)
The Project
The Innovative Medicines Initiative• EC funded public-private
partnership for pharmaceutical research
• Focus on key problems– Efficacy, Safety,
Education & Training, Knowledge Management
The Open PHACTS Project• Create a semantic integration hub (“Open
Pharmacological Space”)…• Delivering services to support on-going drug
discovery programs in pharma and public domain• Not just another project; Leading academics in
semantics, pharmacology and informatics, driven by solid industry business requirements
• 13 academic partners, 9 pharmaceutical companies, 6 SMEs
• Work split into clusters:• Technical Build (focus here)• Scientific Drive• Community & Sustainability
![Page 4: 2013 01-14 ops-dataset_descriptions](https://reader033.fdocuments.us/reader033/viewer/2022061105/53ff83358d7f7261088b4605/html5/thumbnails/4.jpg)
Architecture
User Interfaces & Applications
Linked Data API
Linked Data CacheIdentity
Mapping Service
Identity Resolution
Service
Domain Specific Services
Data
![Page 5: 2013 01-14 ops-dataset_descriptions](https://reader033.fdocuments.us/reader033/viewer/2022061105/53ff83358d7f7261088b4605/html5/thumbnails/5.jpg)
Datasets and Links
![Page 6: 2013 01-14 ops-dataset_descriptions](https://reader033.fdocuments.us/reader033/viewer/2022061105/53ff83358d7f7261088b4605/html5/thumbnails/6.jpg)
ChemSpider• ChemSpider aggregates data from
over 400 sources• Central integration point for
chemicals in OPS• OPS data covers
– ChEBI– ChEMBL– DrugBank
14 January 2013 OPS Dataset Descriptions – A. J. G. Gray 6
![Page 7: 2013 01-14 ops-dataset_descriptions](https://reader033.fdocuments.us/reader033/viewer/2022061105/53ff83358d7f7261088b4605/html5/thumbnails/7.jpg)
What version of ChEMBL? ~Jan 2012• ChemSpider: EBI SDF file
– ChEMBL 13 • Data Cache: Chem2Bio2RDF ChEMBL RDF
– File downloaded May 2011– Chem2Bio2RDF metadata webpages:
ChEMBL 8– File: ChEMBL 2
• Mapping Server: Kasabi ChEMBL RDF file– ChEMBL 12
14 January 2013 OPS Dataset Descriptions – A. J. G. Gray 7
![Page 8: 2013 01-14 ops-dataset_descriptions](https://reader033.fdocuments.us/reader033/viewer/2022061105/53ff83358d7f7261088b4605/html5/thumbnails/8.jpg)
For the record• OPS currently uses ChEMBL 13
– RDF generated from EBI database dump
– Published at linkedchemistry.info• Credit: Egon Willighagen
• Soon moving to ChEMBL 15– RDF published by EBI
14 January 2013 OPS Dataset Descriptions – A. J. G. Gray 8
![Page 9: 2013 01-14 ops-dataset_descriptions](https://reader033.fdocuments.us/reader033/viewer/2022061105/53ff83358d7f7261088b4605/html5/thumbnails/9.jpg)
Challenges• Datasets available
– In many versions over time– In different formats– From many mirrors/registries
• Files do not carry metadata• Registries
– Can be out-of-date– Can contain conflicting information
14 January 2013 OPS Dataset Descriptions – A. J. G. Gray 9
![Page 10: 2013 01-14 ops-dataset_descriptions](https://reader033.fdocuments.us/reader033/viewer/2022061105/53ff83358d7f7261088b4605/html5/thumbnails/10.jpg)
VoID: Vocabulary of Interlinked Datasets
• Describes RDF datasets– W3C Note: http://www.w3.org/TR/void/
• Metadata carried with data– Directly embedded or
linked (void:inDataset)• Problems
– Very generic– No checklist of requisite fields
14 January 2013 OPS Dataset Descriptions – A. J. G. Gray 10
![Page 11: 2013 01-14 ops-dataset_descriptions](https://reader033.fdocuments.us/reader033/viewer/2022061105/53ff83358d7f7261088b4605/html5/thumbnails/11.jpg)
Provenance Vocabularies• Dublin Core Terms
– Widely used– Terms to generic to give proper credit
• “Date: A point or period of time associated with an event in the lifecycle of the resource.”
• PROV– New W3C standard: www.w3.org/2011/prov– Generic framework for exchanging data– Does not contain required predicates
14 January 2013 OPS Dataset Descriptions – A. J. G. Gray 11
![Page 12: 2013 01-14 ops-dataset_descriptions](https://reader033.fdocuments.us/reader033/viewer/2022061105/53ff83358d7f7261088b4605/html5/thumbnails/12.jpg)
PAV: Provenance, Authoring and Versioning Vocabulary
http://code.google.com/p/pav-ontology/wiki/Homepage• Easy to understand predicates
– http://purl.org/pav/• Right level of granularity
– Distinguishes: author/creator/curator– Captures source of data:
• import/derived/accessed• version/previousVersion
• Being aligned with PROV-O14 January 2013 OPS Dataset Descriptions – A. J. G. Gray 12
![Page 13: 2013 01-14 ops-dataset_descriptions](https://reader033.fdocuments.us/reader033/viewer/2022061105/53ff83358d7f7261088b4605/html5/thumbnails/13.jpg)
Dataset Descriptions in the Open Pharmacological Space
14 January 2013 OPS Dataset Descriptions – A. J. G. Gray 13
![Page 14: 2013 01-14 ops-dataset_descriptions](https://reader033.fdocuments.us/reader033/viewer/2022061105/53ff83358d7f7261088b4605/html5/thumbnails/14.jpg)
Related Work• Registries: DataHub, MIRIAM
– Do not tie metadata with the data– No checklist of attributes
• BioDBCore– Checklist
• Similar information captured• Includes point of contact information
– Not tied to the data
14 January 2013 OPS Dataset Descriptions – A. J. G. Gray 14
![Page 15: 2013 01-14 ops-dataset_descriptions](https://reader033.fdocuments.us/reader033/viewer/2022061105/53ff83358d7f7261088b4605/html5/thumbnails/15.jpg)
Realisation of Dataset Descriptions
• Needs to be incorporated into data publishing pipeline
• Hard for publishers to provide conformant descriptions– Datasets are complex– Evolve over time– Seen as yet another burden
• Validation tool provided– http://openphacts.cs.man.ac.uk:9090/OPS-IMS/validate
14 January 2013 OPS Dataset Descriptions – A. J. G. Gray 15
![Page 16: 2013 01-14 ops-dataset_descriptions](https://reader033.fdocuments.us/reader033/viewer/2022061105/53ff83358d7f7261088b4605/html5/thumbnails/16.jpg)
Future Vision• Provide rich and accurate
provenance trail of data– Alignment with BioDBCore
• One standard to rule them all– Automatic pipeline from VoID file to
registries• Write once, use many times
14 January 2013 OPS Dataset Descriptions – A. J. G. Gray 16
![Page 17: 2013 01-14 ops-dataset_descriptions](https://reader033.fdocuments.us/reader033/viewer/2022061105/53ff83358d7f7261088b4605/html5/thumbnails/17.jpg)
Thank [email protected]/~graya/www.openphacts.org
14 January 2013 OPS Dataset Descriptions – A. J. G. Gray 17