Data integration with identifiers and ontologies

28
Data integraon with idenfiers and ontologies Why are names and graphs not enough? Egon Willighagen hp://chem-bla-ics.blogspot.com/ @egonwillighagen ORCID:0000-0001-7542-0286 Uppsala University 2016-09-12

Transcript of Data integration with identifiers and ontologies

Page 1: Data integration with identifiers and ontologies

Data integration with identifiers and ontologiesWhy are names and graphs not enough?

Egon Willighagen

http://chem-bla-ics.blogspot.com/@egonwillighagenORCID:0000-0001-7542-0286

Uppsala University2016-09-12

Page 2: Data integration with identifiers and ontologies

Acknowledgements● WikiPathways and PathVisio projects

– Prof. Alex Pico's team, UCSF

– Current and past members of BiGCaT (Prof. Chris Evelo): Marloes Poort

– Pathway Providers: Pieter Giesbertz (TUM), Kozo Nishida (RIKEN)

● Maastricht University– Toxicology: Rianne Fijten

– MaCSBio team

– Maastricht Science Programma (VOC project)

● Open PHACTS– Manchester University: Prof. Carole Goble, Christian Brenninkmeijer, Stian Soiland-Reyes

– Heriot-Watt University: Alasdair Gray

– Royal Society of Chemistry: Colin Batchelor

● Others– Bioclipse: Ola Spjuth (Uppsala University)

– MetaboLights collaboration: Reza Salek, Chandu Venkata, Garima Thakur

– ChEBI collaboration: Christoph Steinbeck, Gareth Owen

– PubChem collaboration: Evan Bolton, Gang Fu

– HMDB, Wikidata teams

Page 3: Data integration with identifiers and ontologies

Asthma: Detecting and Understanding

Smolinska et al. PLOS ONE. 2014 9:e105447doi:10.1371/journal.pone.0105447

Page 4: Data integration with identifiers and ontologies

Systems Biology: pathways

Andón FT, Fadeel B; ''Programmed Cell Death: Molecular Mechanisms and Implications for Safety Assessment of Nanomaterials.''; Acc Chem Res, 2012

Page 5: Data integration with identifiers and ontologies

Dopamine metabolism

Marloes Poort

Page 6: Data integration with identifiers and ontologies

The effect of troglitazone on heme biosynthesis

Page 7: Data integration with identifiers and ontologies
Page 8: Data integration with identifiers and ontologies
Page 9: Data integration with identifiers and ontologies

PathVisio: pathway enrichment (etc)

Van Iersel, M.P., et al. "Presenting and exploring biological pathways with PathVisio." BMC bioinformatics 9.1 (2008): 399. http://pathvisio.org/ → Martina Kutmon

Page 10: Data integration with identifiers and ontologies

We see a lot? But what is it?● Current techniques can see up to 1000

metabolites in one analysis– Only part of all 40k metabolites

● Only 10% we can identify– The other 90% is unknown

Page 11: Data integration with identifiers and ontologies

Databases & identifiers

● HMDB: Human Metabolome Database● ChEBI: Database of Chemicals Entities of

Biological Interest● ChemSpider, PubChem● CAS: Chemical Abstracts Service

● InChI: International Chemical Identifier

Page 12: Data integration with identifiers and ontologies

Acid/Base conjugates

CHEBI:15361 (Pyruvate) -> Ce:CHEBI:32816 (conjugate) -> Ck:C00022 -> [WP2456 HIF1A and PPARG regulation of glycolysis, WP2453 TCA Cycle and PDHc]

Page 13: Data integration with identifiers and ontologies

Switching identities: Glucose

Page 14: Data integration with identifiers and ontologies

Switching identities: Warfarin

Porter, W. (2010). Warfarin: history, tautomerism and activityJournal of Computer-Aided Molecular Design, 24 (6-7), 553-573DOI: 10.1007/s10822-010-9335-7

Page 15: Data integration with identifiers and ontologies

Bridging: identifiers

Page 16: Data integration with identifiers and ontologies

So, what IDs are used in WikiPathways?

Curated Collectionsubset

Page 17: Data integration with identifiers and ontologies

BridgeDb

Van Iersel, M.P., et al. "The BridgeDb framework: standardized accessto gene, protein and metabolite identifier mapping services."BMC Bioinformatics 11.1 (2010): 5.

New tools● Open PHACTS' Identifier Mapping Service

● R package● Bioclipse

Page 18: Data integration with identifiers and ontologies

Metabolite ID Mapping database● HMDB, ChEBI Wikidata

Page 19: Data integration with identifiers and ontologies

BridgeDb: scientific lenses

● Gene

– gene-protein– gene-probe

● Metabolite

– Tautomers– Compound class– Charge (acid/ate)

Brenninkmeijer, CYA, et al. "Scientific Lenses over Linked Data: An approach to support task specific views of the data. A vision." Proceedings of 2nd International Workshop on Linked Science. 2012.

Page 20: Data integration with identifiers and ontologies
Page 21: Data integration with identifiers and ontologies

#1: The breath data setCAS numbers: 1843

CAS numbers (unique): 1733

CAS numbers with mappings: 718

CAS numbers matches: 54

Pathways found: 76

Matches via CAS: 9

Matches via mapping: 29

Matches via ChEBI super class: 35

Matches via ChEBI charged species: 3

Matches via ChEBI tautomers: 0

CAS: 544-63-8 (myristic acid) → Ce:28875 → Ce:15904 (long-chain fatty acid) → [WP368 Mitochondrial LC-Fatty Acid Beta-Oxidation, WP357 Fatty Acid Biosynthesis]

Page 22: Data integration with identifiers and ontologies

What if we add more CAS ID mappings? (e.g. from Wikidata)INFO: Number of ids in Ch (HMDB): 41514 (changed +0.0%)INFO: Number of ids in Ce (ChEBI): 64222 (changed +0.0%)INFO: Number of ids in Kd (KEGG Drug): 2406 (changed +23960.0%)INFO: Number of ids in Ca (CAS): 38621 (changed +30.5%)INFO: Number of ids in Wi (Wikipedia): 3991 (changed +0.0%)INFO: Number of ids in Ck (KEGG Compound): 15896 (changed +0.0%)INFO: Number of ids in Cpc (PubChem-compound): 29170 (changed +72.5%)INFO: Number of ids in Wd: 18237INFO: Number of ids in Cs (Chemspider): 23981 (changed +49.4%)

- 30% more CAS numbers (294 unique IDs in WikiPathways)- 73% more PubChem compound identifiers (217 unique IDs in WP)- 50% more Chemspider identifiers (157 unique IDs in WP)- a lot more KEGG Drug identifiers

Page 23: Data integration with identifiers and ontologies

#1: The breath data set

CAS numbers: 1843CAS numbers (unique): 1733CAS numbers with mappings: 978CAS numbers matches: 116Pathways found: 158 (unique: 62)Matches via CAS: 9Matches via mapping: 28Matches via ChEBI super class: 108Matches via ChEBI charged species: 9Matches via ChEBI tautomers: 0Matches via ChEBI roles: 4

CAS: 544-63-8 (myristic acid) → Ce:28875 → Ce:15904 (long-chain fatty acid) → [WP368 Mitochondrial LC-Fatty Acid Beta-Oxidation, WP357 Fatty Acid Biosynthesis]

Page 24: Data integration with identifiers and ontologies

Wikidata

Mietchen, D. et al. Enabling open science: Wikidata for research (Wiki4R). Research Ideas and Outcomes 1, e7573+ (2015)

Page 25: Data integration with identifiers and ontologies

Wikidata: identifiers

Page 26: Data integration with identifiers and ontologies

Application Programming Interfaces

Page 27: Data integration with identifiers and ontologies

Application Programming Interfaces

Page 28: Data integration with identifiers and ontologies

Conclusions

● Updated metabolite ID database– HMDB: still a major workhorse– ChEBI: charged species, compound

classes– Wikidata: CAS numbers, other

missing● Pathway Analysis

– Mapping with Bioclipse and PathVisio

– Scientific lenses improve mappings– Better annotation