Accurate biochemical knowledge starting with precise structure-based criteria for molecular identity
-
Upload
michel-dumontier -
Category
Technology
-
view
950 -
download
0
description
Transcript of Accurate biochemical knowledge starting with precise structure-based criteria for molecular identity
![Page 1: Accurate biochemical knowledge starting with precise structure-based criteria for molecular identity](https://reader036.fdocuments.us/reader036/viewer/2022062703/554e9021b4c905fc368b4bab/html5/thumbnails/1.jpg)
Accurate biochemical knowledge starting with precise structure-based
criteria for molecular identity
Michel Dumontier, Ph.D.Assistant Professor of Bioinformatics
Department of Biology, School of Computer ScienceInstitute of Biochemistry, Ottawa Institute of Systems Biology
Carleton University
01/04/20091 NCBO Seminar Series::Michel Dumontier
![Page 2: Accurate biochemical knowledge starting with precise structure-based criteria for molecular identity](https://reader036.fdocuments.us/reader036/viewer/2022062703/554e9021b4c905fc368b4bab/html5/thumbnails/2.jpg)
Problem Statement (I)
• Although biochemical events can be described with reference to specific chemical substances, we may want to describe them at finer/grainier levels of (mereological) granularity.– residue : post translational modification
– collection of residues : motif/domain/interaction site
– atom : atomic interactions, catalytic mechanism
– collection of atoms : binding/catalytic site, interaction
• This requires identifiers for parts, regions (contiguous and non-contiguous), aggregates/complexes.
• However, we do not (AFAIK) have a precise (reproducible) methodology to automatically generate these!
01/04/20092 NCBO Seminar Series::Michel Dumontier
![Page 3: Accurate biochemical knowledge starting with precise structure-based criteria for molecular identity](https://reader036.fdocuments.us/reader036/viewer/2022062703/554e9021b4c905fc368b4bab/html5/thumbnails/3.jpg)
Bio2RDF: 2.3B triples of SPARQL-accessible linked biological data!
Chemical Parts!
![Page 4: Accurate biochemical knowledge starting with precise structure-based criteria for molecular identity](https://reader036.fdocuments.us/reader036/viewer/2022062703/554e9021b4c905fc368b4bab/html5/thumbnails/4.jpg)
Case Study: HIF1αHypoxia-Inducible Factor 1, alpha chain (uniprot:Q16665)
Master transcriptional regulator of the adaptive response to hypoxia
• Under normoxic conditions, HIF1α is hydroxylated on Pro-402
and Pro-564 in the oxygen-dependent degradation domain (ODD) by EGLN1/PHD1 and EGLN2/PHD2. EGLN3/PHD3 has also been shown to hydroxylate Pro-564. The hydroxylated prolines promote interaction with VHL, initiating rapid ubiquitination and subsequent proteasomal degradation.
Context Dependent Behaviora) Normoxic Conditionsb) Hypoxic Conditions
Multiple hydroxylations
Part of a domain
The part is the agent in the process
Selective interaction with parts
01/04/20094 NCBO Seminar Series::Michel Dumontier
![Page 5: Accurate biochemical knowledge starting with precise structure-based criteria for molecular identity](https://reader036.fdocuments.us/reader036/viewer/2022062703/554e9021b4c905fc368b4bab/html5/thumbnails/5.jpg)
Are these the same?
• HIF1α – au naturel• HIF1α
– hydroxylated @P402
• HIF1α– hydroxylated @P564
• HIF1α– hydroxylated @P402 & @P564
• HIF1α– hydroxylated @P402 & (@P564)
– ubiquitinated @Lys-532
• HIF1α– L400A & L397A
01/04/20095 NCBO Seminar Series::Michel Dumontier
![Page 6: Accurate biochemical knowledge starting with precise structure-based criteria for molecular identity](https://reader036.fdocuments.us/reader036/viewer/2022062703/554e9021b4c905fc368b4bab/html5/thumbnails/6.jpg)
NO!!!!
• These are structurally different• Each exhibits distinct functionality!
• Yet most databases (Uniprot/Genbank) don’t have separate identifiers for them
• Reactome has an internal identifier for referring to different forms, but links to Uniprot entries and doesn’t provide an explicit description of the structure that it corresponds to!
01/04/20096 NCBO Seminar Series::Michel Dumontier
![Page 7: Accurate biochemical knowledge starting with precise structure-based criteria for molecular identity](https://reader036.fdocuments.us/reader036/viewer/2022062703/554e9021b4c905fc368b4bab/html5/thumbnails/7.jpg)
So
We have a clear need for being able to refer to distinct biochemical entities, based at least on their structure.
We also need to refer to arbitrary structural parts.
Should we generate all the combinations a priori???
NO!!
Should we be able to automatically generate the identifier from the structural attributes?
-> YES!!!
Should we semantically annotate (manually or otherwise) those forms known to be involved in specific processes???
-> YES!!!
What identifiers are unique for a given structure?
01/04/20097 NCBO Seminar Series::Michel Dumontier
![Page 8: Accurate biochemical knowledge starting with precise structure-based criteria for molecular identity](https://reader036.fdocuments.us/reader036/viewer/2022062703/554e9021b4c905fc368b4bab/html5/thumbnails/8.jpg)
InChI• IUPAC International Chemical Identifier (InChI)• A data string that provides
1. the structure of a chemical compound
2. the convention for drawing the structure
• Different compounds must have different identifiers. Several attributes can be used to distinguish one compound from another. – chemical graph (connection table) – Formula– Atom type (only some atoms explicit)– Bond type– Stereochemistry– Mobile/fixed H-bonds (tautomers)– Isotopic composition– Atomic charge
01/04/20098 NCBO Seminar Series::Michel Dumontier
![Page 9: Accurate biochemical knowledge starting with precise structure-based criteria for molecular identity](https://reader036.fdocuments.us/reader036/viewer/2022062703/554e9021b4c905fc368b4bab/html5/thumbnails/9.jpg)
(S)-Glutamic Acid
InChI={version}1/{formula}C5H9NO4/c{connections}6-3(5(9)10)1-2-4(7)8/h{H_atoms}3H,1-2,6H2,(H,7,8)(H,9,10)/p{protons}+1/t{stereo:sp3}3-/m{stereo:sp3:inverted}0/s{stereo:type (1=abs, 2=rel, 3=rac)}1/i{isotopic:atoms}4+1
01/04/20099 NCBO Seminar Series::Michel Dumontier
![Page 10: Accurate biochemical knowledge starting with precise structure-based criteria for molecular identity](https://reader036.fdocuments.us/reader036/viewer/2022062703/554e9021b4c905fc368b4bab/html5/thumbnails/10.jpg)
AuxInfo={version}1/{normalization_type}1/N:{original_atom_numbers}5,6,2,7,1,4,8,9,10,11/E:{atom_equivalence}(7,8)(9,10)/it:{abs_stereo_inverted:sp3}im/I:{isotopic:original_atom_numbers}/E:{isotopic:atom_equivalence}m/rA:{reversibility:atoms}11nCCHN+CCC.i13OOOO/rB:{reversibility:bonds}s1;N2;P2;s2;s5;s6;s7;d7;d1;s1;/rC:{reversibility:xyz}6.1671,-19.3365,0;7.0125,-18.4864,0;6.4113,-17.4485,0;7.6089,-17.4485,0;7.8578,-19.3318,0;8.891,-18.7306,0;9.7363,-19.576,0;9.7316,-20.7735,0;10.8916,-19.266,0;5.0071,-19.0265,0;6.1624,-20.534,0;
AuxInfo=1/1/N:5,6,2,7,1,4,8,9,10,11/E:(7,8)(9,10)/it:im/I:/E:m/rA:11nCCHN+CCC.i13OOOO/rB:s1;N2;P2;s2;s5;s6;s7;d7;d1;s1;/rC:6.1671,-19.3365,0;7.0125,-18.4864,0;6.4113,-17.4485,0;7.6089,-17.4485,0;7.8578,-19.3318,0;8.891,-18.7306,0;9.7363,-19.576,0;9.7316,-20.7735,0;10.8916,-19.266,0;5.0071,-19.0265,0;6.1624,-20.534,0;
More non-core info captured in “AuxInfo” string...
01/04/200910 NCBO Seminar Series::Michel Dumontier
![Page 11: Accurate biochemical knowledge starting with precise structure-based criteria for molecular identity](https://reader036.fdocuments.us/reader036/viewer/2022062703/554e9021b4c905fc368b4bab/html5/thumbnails/11.jpg)
So... InChi a really just a cryptic data identifier
Clever software required to gradually build the chemical identifiers in a series of well-defined steps –
normalization, canonicalization then serialization
Humans can’t (easily) generate them nor can they easily understand them. But that’s OK.
It’s not (user) extensible. But that’s OK.
01/04/200911 NCBO Seminar Series::Michel Dumontier
![Page 12: Accurate biochemical knowledge starting with precise structure-based criteria for molecular identity](https://reader036.fdocuments.us/reader036/viewer/2022062703/554e9021b4c905fc368b4bab/html5/thumbnails/12.jpg)
• Possible... but a 1000 residue protein would contain ~15,000 atoms on average.... – OpenBabel seemed to struggle with anything over 100
residues • Maybe needs some performance tweaking?
– Size of the string will be enormous• We can use InChiKeys (SHA1 hash), but then we need to provide a
you-submit-InChI, we-store-both and they-look-it-up service.
– Modularize InChI construction for (linear) polymers?• Make InChi strings for each residue, and concatenate – rename the
atoms according to the residue position
– We still need to translate the InChi string ...
InCHI for Proteins???
01/04/200912 NCBO Seminar Series::Michel Dumontier
![Page 13: Accurate biochemical knowledge starting with precise structure-based criteria for molecular identity](https://reader036.fdocuments.us/reader036/viewer/2022062703/554e9021b4c905fc368b4bab/html5/thumbnails/13.jpg)
CMLSDF
O1[C@@H]([C@@H](O)([C@H](O)([C@@H](O)([C@@H]1(O)))))(CO) 79025
IUPAC
InChI=1/C6H12O6/c7-1-2-3(8)4(9)5(10)6(11)12-2/h2-11H,1H2/t2-,3-,4+,5-,6+/m1/s1InCHI
α-D-Glucose
6-(hydroxymethyl)oxane-2,3,4,5-tetrol OR (2R,3R,4S,5R,6R)-6 -(hydroxymethyl)tetrahydro -2H-pyran-2,3,4,5-tetraol
SMILES
OpenBabel
![Page 14: Accurate biochemical knowledge starting with precise structure-based criteria for molecular identity](https://reader036.fdocuments.us/reader036/viewer/2022062703/554e9021b4c905fc368b4bab/html5/thumbnails/14.jpg)
OWL Has Explicit Semantics
Can therefore be used to capture knowledge in a machine understandable way
01/04/200914 NCBO Seminar Series::Michel Dumontier
![Page 15: Accurate biochemical knowledge starting with precise structure-based criteria for molecular identity](https://reader036.fdocuments.us/reader036/viewer/2022062703/554e9021b4c905fc368b4bab/html5/thumbnails/15.jpg)
Chemical Ontology
Chemical Knowledge for the Semantic Web.Mykola Konyk, Alexander De Leon, and Michel Dumontier. LNBI. 2008. 5109:169-176. Data Integration in the Life Sciences (DILS2008). Evry. France.
![Page 16: Accurate biochemical knowledge starting with precise structure-based criteria for molecular identity](https://reader036.fdocuments.us/reader036/viewer/2022062703/554e9021b4c905fc368b4bab/html5/thumbnails/16.jpg)
http://code.google.com/p/semanticwebopenbabel/
01/04/200916 NCBO Seminar Series::Michel Dumontier
![Page 17: Accurate biochemical knowledge starting with precise structure-based criteria for molecular identity](https://reader036.fdocuments.us/reader036/viewer/2022062703/554e9021b4c905fc368b4bab/html5/thumbnails/17.jpg)
hydroxyl groupmethyl group
Knowledge of functional groups is important in chemical synthesis, pharmaceutical design and lead optimization.
Functional groups describe chemical reactivity in terms of atoms and their connectivity, and exhibits characteristic chemical behavior when present in a compound.
Describing chemical functional groups in OWL-DL for the classification of chemical compounds
N Villanueva-Rosales, MDumontier. 2007. OWLED, Innsbruck, Austria.
Ethanol
01/04/200917 NCBO Seminar Series::Michel Dumontier
![Page 18: Accurate biochemical knowledge starting with precise structure-based criteria for molecular identity](https://reader036.fdocuments.us/reader036/viewer/2022062703/554e9021b4c905fc368b4bab/html5/thumbnails/18.jpg)
Describing Functional Groups in DL
HydroxylGroup: CarbonGroup that (hasSingleBondWith some (OxygenAtom that hasSingleBondWith some HydrogenAtom)
OHR
R group
01/04/200918 NCBO Seminar Series::Michel Dumontier
![Page 19: Accurate biochemical knowledge starting with precise structure-based criteria for molecular identity](https://reader036.fdocuments.us/reader036/viewer/2022062703/554e9021b4c905fc368b4bab/html5/thumbnails/19.jpg)
Fully Classified Ontology
35 FG
01/04/200919 NCBO Seminar Series::Michel Dumontier
![Page 20: Accurate biochemical knowledge starting with precise structure-based criteria for molecular identity](https://reader036.fdocuments.us/reader036/viewer/2022062703/554e9021b4c905fc368b4bab/html5/thumbnails/20.jpg)
And, we define certain compounds
Alcohol: OrganicCompound that (hasPart some HydroxylGroup)
01/04/200920 NCBO Seminar Series::Michel Dumontier
![Page 21: Accurate biochemical knowledge starting with precise structure-based criteria for molecular identity](https://reader036.fdocuments.us/reader036/viewer/2022062703/554e9021b4c905fc368b4bab/html5/thumbnails/21.jpg)
Organic Compound Ontology
28 OC
01/04/200921 NCBO Seminar Series::Michel Dumontier
![Page 22: Accurate biochemical knowledge starting with precise structure-based criteria for molecular identity](https://reader036.fdocuments.us/reader036/viewer/2022062703/554e9021b4c905fc368b4bab/html5/thumbnails/22.jpg)
Question Answering
• Query all annotations
• Query PubChem, DrugBank and dbPedia*
* Requires import of relevant URIs01/04/200922 NCBO Seminar Series::Michel Dumontier
![Page 23: Accurate biochemical knowledge starting with precise structure-based criteria for molecular identity](https://reader036.fdocuments.us/reader036/viewer/2022062703/554e9021b4c905fc368b4bab/html5/thumbnails/23.jpg)
But...
• Molecules represented as individuals because OWL-DL only allows tree-like class descriptions– No variable binding (e.g. ?x) ... no cyclic
molecule/functional group descriptions at the class level
• Boris Motik et al has a proposal for Description Graphs, – Robert Stevens & Duncan Hull trying it out for
chemical representation....
01/04/200923 NCBO Seminar Series::Michel Dumontier
![Page 24: Accurate biochemical knowledge starting with precise structure-based criteria for molecular identity](https://reader036.fdocuments.us/reader036/viewer/2022062703/554e9021b4c905fc368b4bab/html5/thumbnails/24.jpg)
Identifiers for Atoms
• Atom identifiers can be consistently retrieved from the OpenBabel model.– Canonical numbering means we can reliably refer to a
specific region rather than a (possibly degenerate) sub-graph match.
– In our plugin, URI component naming was based on the assigned molecule identifier
e.g. pubchemid#aN, where N is the number
– Use InChiKey as base?e.g. InChiKey#aN
01/04/200924 NCBO Seminar Series::Michel Dumontier
![Page 25: Accurate biochemical knowledge starting with precise structure-based criteria for molecular identity](https://reader036.fdocuments.us/reader036/viewer/2022062703/554e9021b4c905fc368b4bab/html5/thumbnails/25.jpg)
What about identifiers for collection of atoms?
• Potentially useful in describing residues, PTMs, binding sites, etc. – Is the lack of connectivity sufficient?
• Contiguous: – ranges (aN-aN)– enumerations (aN,aN,aN)
• Non-contiguous:– Combination of ranges, enumerations?
01/04/200925 NCBO Seminar Series::Michel Dumontier
![Page 26: Accurate biochemical knowledge starting with precise structure-based criteria for molecular identity](https://reader036.fdocuments.us/reader036/viewer/2022062703/554e9021b4c905fc368b4bab/html5/thumbnails/26.jpg)
Can we reuse our positional nomenclature for residues?
• Residues are generally referred to by their absolute position in the biopolymer sequence.
e.g. Pro @ X on Protein Y
InChiKey#a50-a65 owl:sameAs InChiKey#r5
InChiKey#r5_a1-r5_a15 owl:sameAs InChiKey#r5
• Collection of Residues might follow the same rules as a Collection of Atoms.– Useful for defining domains, motifs, etc
01/04/200926 NCBO Seminar Series::Michel Dumontier
![Page 27: Accurate biochemical knowledge starting with precise structure-based criteria for molecular identity](https://reader036.fdocuments.us/reader036/viewer/2022062703/554e9021b4c905fc368b4bab/html5/thumbnails/27.jpg)
• We already have a simplified representation for biopolymers... – Canonical sequence is represented by a string of
single letter characters• DNA: ACGT• RNA: ACGU• Proteins: 20 amino acids (not B,J,O,U,X,Z)
– Modifications can be referred to with ChEBI/PSI-MOD ontology (e.g. Prolyl hydroxylated residue @ 402)
• Each (modified) residue must have its InChi description so as to capture explicit structural deviations (de-protonation, etc)
An Alternative Scheme
01/04/200927 NCBO Seminar Series::Michel Dumontier
![Page 28: Accurate biochemical knowledge starting with precise structure-based criteria for molecular identity](https://reader036.fdocuments.us/reader036/viewer/2022062703/554e9021b4c905fc368b4bab/html5/thumbnails/28.jpg)
PSI-MOD contains modified residues with links to structural descriptions
01/04/200928 NCBO Seminar Series::Michel Dumontier
![Page 29: Accurate biochemical knowledge starting with precise structure-based criteria for molecular identity](https://reader036.fdocuments.us/reader036/viewer/2022062703/554e9021b4c905fc368b4bab/html5/thumbnails/29.jpg)
But what if we have a modification that isn’t contained in the ontology!
• No problem... define your own term, with the corresponding structural description (InChi, SMILES), and add to an ontology document...– If you’re using OWL, you can add the import
statement and publish it.
• And, of course, you should submit it to the appropriate ontology development teams. (and later make it equivalent to)
01/04/200929 NCBO Seminar Series::Michel Dumontier
![Page 30: Accurate biochemical knowledge starting with precise structure-based criteria for molecular identity](https://reader036.fdocuments.us/reader036/viewer/2022062703/554e9021b4c905fc368b4bab/html5/thumbnails/30.jpg)
While we’re at it, we could extend our expressive capability to match that of OWL:
• Specification – Exactly mod1@pos X– Only mod1@posX
• Minimum : – At least mod1@posX
• Combination:– mod1@posX AND mod2@posY, X != Y
• Possibilities/Uncertainty: – (mod1 OR mod2) @posX
• Exclusion:– not mod1 @ posX
01/04/200930 NCBO Seminar Series::Michel Dumontier
![Page 31: Accurate biochemical knowledge starting with precise structure-based criteria for molecular identity](https://reader036.fdocuments.us/reader036/viewer/2022062703/554e9021b4c905fc368b4bab/html5/thumbnails/31.jpg)
So what if...we describe the structural features of the molecule with OWL (sequence + PTMs), and generate an identifier from one of its serializations (RDF/XML?)
that way we have the explicit description as the identifier in a form that is compatible with the semantic web.
01/04/200931 NCBO Seminar Series::Michel Dumontier
![Page 32: Accurate biochemical knowledge starting with precise structure-based criteria for molecular identity](https://reader036.fdocuments.us/reader036/viewer/2022062703/554e9021b4c905fc368b4bab/html5/thumbnails/32.jpg)
01/04/200932 NCBO Seminar Series::Michel Dumontier
![Page 33: Accurate biochemical knowledge starting with precise structure-based criteria for molecular identity](https://reader036.fdocuments.us/reader036/viewer/2022062703/554e9021b4c905fc368b4bab/html5/thumbnails/33.jpg)
Uniprot example revisitedUnder normoxic conditions, HIF1α is hydroxylated on Pro-402
and Pro-564 in the oxygen-dependent degradation domain (ODD) by EGLN1/PHD1 and EGLN2/PHD2. The hydroxylated prolines promote interaction with VHL, initiating rapid ubiquitination and subsequent proteasomal degradation
.
:A rdfs:subClassOf :Hydroxylation:A hasParticipant (:0#r402 and :Substrate):A hasParticipant (:1#r402 and :Product):A hasParticipant (:5 and :Enzyme)
:B rdfs:subClassOf :Interaction:B :hasParticipant (:2#r402 or :3#r564 or :4#r402,r564):B :hasParticipant (:6)
:1 (HIF1α):2 (HIF1α + P402hyd):3 (HIF1α + P564hyd):4 (HIF1α + P402hyd + P564hyd):5 (EGLN1):6 (VHL)
Please ignore the made up short-hand syntax!
01/04/200933 NCBO Seminar Series::Michel Dumontier
![Page 34: Accurate biochemical knowledge starting with precise structure-based criteria for molecular identity](https://reader036.fdocuments.us/reader036/viewer/2022062703/554e9021b4c905fc368b4bab/html5/thumbnails/34.jpg)
Infering Protein Participation
• OWL Role Chain
hasParticipant o isPartOf -> hasParticipant
if process has the part as a participant, then the whole is also a participant
:0#r402 :isPartOf :0:1#r402 :isPartOf :1
:A rdfs:subClassOf :Hydroxylation:A hasParticipant (:0#r402 and :Substrate):A hasParticipant (:1#r402 and :Product)
:A hasParticipant :0:A hasParticipant :1
01/04/200934 NCBO Seminar Series::Michel Dumontier
![Page 35: Accurate biochemical knowledge starting with precise structure-based criteria for molecular identity](https://reader036.fdocuments.us/reader036/viewer/2022062703/554e9021b4c905fc368b4bab/html5/thumbnails/35.jpg)
Contextual, but non-structural considerations in identifier generation?
• Chemical?– pH?– Temperature?– Environment (in vitro, in vivo, in silico)?
• Biological?– Species?– mRNA/Gene from which it was transcribed/encoded?
• Indirect Relationships?– Point & Multiple Mutations?– Alternative Splice Variants?– Sequence Similarity?
01/04/200935 NCBO Seminar Series::Michel Dumontier
![Page 36: Accurate biochemical knowledge starting with precise structure-based criteria for molecular identity](https://reader036.fdocuments.us/reader036/viewer/2022062703/554e9021b4c905fc368b4bab/html5/thumbnails/36.jpg)
Summary
• We need a precise method to generate identifiers for biopolymers and arbitrary sets of their parts.
• Consistent identifier generation will allow anybody to specify findings according to the biopolymers for which it was observed, whether it exists in a database or not, and will allow us to link biochemical knowledge at finer levels of granularity.
• (at least) two identifier schemes were put forward to initiate discussion, with the goal of setting a standard naming convention.
01/04/200936 NCBO Seminar Series::Michel Dumontier
![Page 37: Accurate biochemical knowledge starting with precise structure-based criteria for molecular identity](https://reader036.fdocuments.us/reader036/viewer/2022062703/554e9021b4c905fc368b4bab/html5/thumbnails/37.jpg)
dumontierlab.com
Special thanks to PhD Student Leonid Chepelev for insightful discussions
semanticscience.org01/04/200937 NCBO Seminar Series::Michel Dumontier