ChEBI, text mining and ontological best practice Colin Batchelor Royal Society of Chemistry...
-
Upload
bertha-freeman -
Category
Documents
-
view
216 -
download
0
Transcript of ChEBI, text mining and ontological best practice Colin Batchelor Royal Society of Chemistry...
![Page 1: ChEBI, text mining and ontological best practice Colin Batchelor Royal Society of Chemistry 2008-05-19 batchelorc@rsc.org.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697bfab1a28abf838c9b3b3/html5/thumbnails/1.jpg)
ChEBI,text mining
and ontological best practice
Colin BatchelorRoyal Society of Chemistry
![Page 2: ChEBI, text mining and ontological best practice Colin Batchelor Royal Society of Chemistry 2008-05-19 batchelorc@rsc.org.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697bfab1a28abf838c9b3b3/html5/thumbnails/2.jpg)
2
What is text mining?
Marti Hearst, Berkeley:“Text Mining is the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources.”
Can ChEBI help?
![Page 3: ChEBI, text mining and ontological best practice Colin Batchelor Royal Society of Chemistry 2008-05-19 batchelorc@rsc.org.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697bfab1a28abf838c9b3b3/html5/thumbnails/3.jpg)
3
Overview
Reasoning
ChEBI as dictionary
Regular polysemy in chemistry
Some possible solutions
![Page 4: ChEBI, text mining and ontological best practice Colin Batchelor Royal Society of Chemistry 2008-05-19 batchelorc@rsc.org.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697bfab1a28abf838c9b3b3/html5/thumbnails/4.jpg)
4
Reasoning
![Page 5: ChEBI, text mining and ontological best practice Colin Batchelor Royal Society of Chemistry 2008-05-19 batchelorc@rsc.org.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697bfab1a28abf838c9b3b3/html5/thumbnails/5.jpg)
5
Reasoning
Reasoning is using the logical structure of an ontology to automatically infer facts about the world which have not been explicitly added by a human being.
Computers have no real-world knowledge beyond what we tell them.
![Page 6: ChEBI, text mining and ontological best practice Colin Batchelor Royal Society of Chemistry 2008-05-19 batchelorc@rsc.org.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697bfab1a28abf838c9b3b3/html5/thumbnails/6.jpg)
6
Logical structure:properties of relations
We only have time to look at transitivity and is_a.
Smith et al., “Relations in Biomedical Ontologies”, Genome Biol., 2005, 6, R46.
Relation Transitive Symmetric Reflexive Anti-symmetric
is_a Yes No Yes Yes
part_of Yes No Yes Yes
![Page 7: ChEBI, text mining and ontological best practice Colin Batchelor Royal Society of Chemistry 2008-05-19 batchelorc@rsc.org.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697bfab1a28abf838c9b3b3/html5/thumbnails/7.jpg)
7
ChEBI’s is_a is not transitive (1)
If a relation R is transitive, then:
If a R b and b R c, then a R c.
glutathione is_a cofactor cofactor is_a biological role
therefore glutathione is_a biological role
![Page 8: ChEBI, text mining and ontological best practice Colin Batchelor Royal Society of Chemistry 2008-05-19 batchelorc@rsc.org.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697bfab1a28abf838c9b3b3/html5/thumbnails/8.jpg)
8
ChEBI’s is_a is not transitive (2)
water is_a amphiprotic solvent amphiprotic solvent is_a protophilic solvent (*) protophilic solvent is_a Bronsted base (*) Bronsted base is_a base base is_a biological role
therefore water is_a basetherefore water is_a biological role
* how come “protophilic solvent” and “Bronsted base” only have one child each?
![Page 9: ChEBI, text mining and ontological best practice Colin Batchelor Royal Society of Chemistry 2008-05-19 batchelorc@rsc.org.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697bfab1a28abf838c9b3b3/html5/thumbnails/9.jpg)
9
ChEBI’s is_a is not transitive (3)
N-hydroxy-L-aspartic acid is_a hydroxamic acids
hydroxamic acids is_a organic functional classes
therefore N-hydroxy-L-aspartic acid is_a organic functional classes
![Page 10: ChEBI, text mining and ontological best practice Colin Batchelor Royal Society of Chemistry 2008-05-19 batchelorc@rsc.org.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697bfab1a28abf838c9b3b3/html5/thumbnails/10.jpg)
10
is_a has many meanings!
1. An amount of a compound has a biological role: tris is_a buffer.*
2. An amount of a compound has an application: sodium dodecyl sulfate is_a detergent.*
3. A less-abstract type is an example of a more abstract type: propane is_a alkanes.
4. ?!: metals is_a atoms.*
* Not a property of a lone atom or molecule!
![Page 11: ChEBI, text mining and ontological best practice Colin Batchelor Royal Society of Chemistry 2008-05-19 batchelorc@rsc.org.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697bfab1a28abf838c9b3b3/html5/thumbnails/11.jpg)
11
Computers need facts about the world, not about ChEBI curation
![Page 12: ChEBI, text mining and ontological best practice Colin Batchelor Royal Society of Chemistry 2008-05-19 batchelorc@rsc.org.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697bfab1a28abf838c9b3b3/html5/thumbnails/12.jpg)
12
ChEBI as dictionary
![Page 13: ChEBI, text mining and ontological best practice Colin Batchelor Royal Society of Chemistry 2008-05-19 batchelorc@rsc.org.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697bfab1a28abf838c9b3b3/html5/thumbnails/13.jpg)
13
Evaluating name–structure conversion with ChEBI
ChEBI release 37 (26 September 2007) contains 12688 annotated entities, of which 8486 have InChI strings.
We use OSCAR3 (oscar3-chem.sourceforge.net) for name–structure conversion.
We convert chebi.obo to an XML file, each paragraph containing either a ChEBI name or an IUPAC name.
The layered structure of the InChI lets us give partial credit for incomplete matches.
![Page 14: ChEBI, text mining and ontological best practice Colin Batchelor Royal Society of Chemistry 2008-05-19 batchelorc@rsc.org.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697bfab1a28abf838c9b3b3/html5/thumbnails/14.jpg)
14
Results: IUPAC names
Total 8447
Identified as chemical 8255 (97.73%)
With InChI (upper bound) 1810 (21.43%)
Matching InChI, disregarding fixed hydrogen layer 1734 (20.53%)
Matching InChI, disregarding stereo 1176
Matching InChI, exact (lower bound) 1174 (13.90%)
Not all of name matched 1024
Name identified as two or more separate names 974 (11.53%)
![Page 15: ChEBI, text mining and ontological best practice Colin Batchelor Royal Society of Chemistry 2008-05-19 batchelorc@rsc.org.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697bfab1a28abf838c9b3b3/html5/thumbnails/15.jpg)
15
Results: ChEBI names
Total 8146
Identified as chemical 7173 (88.06%)
With InChI (upper bound) 1036 (12.72%)
Matching InChI, disregarding fixed hydrogen layer 953 (11.70%)
Matching InChI, disregarding stereo 637
Matching InChI, exact (lower bound) 628 (7.71%)
Not all of name matched 764
Name identified as two or more separate names 373 (4.58%)
![Page 16: ChEBI, text mining and ontological best practice Colin Batchelor Royal Society of Chemistry 2008-05-19 batchelorc@rsc.org.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697bfab1a28abf838c9b3b3/html5/thumbnails/16.jpg)
16
Regular polysemy
![Page 17: ChEBI, text mining and ontological best practice Colin Batchelor Royal Society of Chemistry 2008-05-19 batchelorc@rsc.org.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697bfab1a28abf838c9b3b3/html5/thumbnails/17.jpg)
17
Regular polysemy
… where words stand for multiple things in a consistent way.
Examples: Brand names Grinding Figure–ground Exact–class–part polysemy in chemistry
Peter Corbett, Colin Batchelor and Ann Copestake (2008), “Pyridines, pyridine and pyridine rings”, Proc. BERBMTM08 at LREC 2008, Marrakech, Morocco.
![Page 18: ChEBI, text mining and ontological best practice Colin Batchelor Royal Society of Chemistry 2008-05-19 batchelorc@rsc.org.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697bfab1a28abf838c9b3b3/html5/thumbnails/18.jpg)
18
Regular polysemy
Brand names“Learning to buy a Renault and talk to BMW”
Grinding“The squirrel scampered down the path and kept
stopping and looking at the officers to check they were behind”
vs.“[…] the trick was to serve squirrel fresh and not to
leave it hanging like other game”
![Page 19: ChEBI, text mining and ontological best practice Colin Batchelor Royal Society of Chemistry 2008-05-19 batchelorc@rsc.org.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697bfab1a28abf838c9b3b3/html5/thumbnails/19.jpg)
19
Regular polysemy
Figure–ground Audrey Hepburn painted the door (figure) Audrey Hepburn walked through the door
(ground) The Incredible Hulk walked through the
door (ambiguous)
![Page 20: ChEBI, text mining and ontological best practice Colin Batchelor Royal Society of Chemistry 2008-05-19 batchelorc@rsc.org.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697bfab1a28abf838c9b3b3/html5/thumbnails/20.jpg)
20
Methyl, the radical (exact)
![Page 21: ChEBI, text mining and ontological best practice Colin Batchelor Royal Society of Chemistry 2008-05-19 batchelorc@rsc.org.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697bfab1a28abf838c9b3b3/html5/thumbnails/21.jpg)
21
Methyl, the group (part)
![Page 22: ChEBI, text mining and ontological best practice Colin Batchelor Royal Society of Chemistry 2008-05-19 batchelorc@rsc.org.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697bfab1a28abf838c9b3b3/html5/thumbnails/22.jpg)
22
Can ChEBI handle methyl?
methyl group (CHEBI:32875) YESmethyl radical (CHEBI:29309) YES
![Page 23: ChEBI, text mining and ontological best practice Colin Batchelor Royal Society of Chemistry 2008-05-19 batchelorc@rsc.org.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697bfab1a28abf838c9b3b3/html5/thumbnails/23.jpg)
23
Imidazole (exact)
![Page 24: ChEBI, text mining and ontological best practice Colin Batchelor Royal Society of Chemistry 2008-05-19 batchelorc@rsc.org.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697bfab1a28abf838c9b3b3/html5/thumbnails/24.jpg)
24
An imidazole (class)
![Page 25: ChEBI, text mining and ontological best practice Colin Batchelor Royal Society of Chemistry 2008-05-19 batchelorc@rsc.org.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697bfab1a28abf838c9b3b3/html5/thumbnails/25.jpg)
25
imidazole side-chain/group/ring (part)
![Page 26: ChEBI, text mining and ontological best practice Colin Batchelor Royal Society of Chemistry 2008-05-19 batchelorc@rsc.org.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697bfab1a28abf838c9b3b3/html5/thumbnails/26.jpg)
26
Can ChEBI handle imidazole?
imidazoles (CHEBI:24780) YESimidazole (CHEBI:16069) YES
imidazole ring not yetimidazolyl group not yet
![Page 27: ChEBI, text mining and ontological best practice Colin Batchelor Royal Society of Chemistry 2008-05-19 batchelorc@rsc.org.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697bfab1a28abf838c9b3b3/html5/thumbnails/27.jpg)
27
Mapping exact, class and part to entries in ChEBI
Tests:1. Has InChI: exact2. Name is plural: class3. Ends in –yl, “group” or “residue”: part
Test 2 doesn’t work for applications or roles.Test 3 is brittle.
I would much rather use the logical structure of the ontology.
![Page 28: ChEBI, text mining and ontological best practice Colin Batchelor Royal Society of Chemistry 2008-05-19 batchelorc@rsc.org.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697bfab1a28abf838c9b3b3/html5/thumbnails/28.jpg)
28
Some possible solutions
![Page 29: ChEBI, text mining and ontological best practice Colin Batchelor Royal Society of Chemistry 2008-05-19 batchelorc@rsc.org.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697bfab1a28abf838c9b3b3/html5/thumbnails/29.jpg)
29
Some possible solutions (1)
ChEBI must represent facts about the world rather than about itself.
Examples: If unclassified compounds have a structure, they
should be in the molecular structure tree rather than the unclassifieds tree.
“organic functional classes” is a tool for assigning nomenclature. No chemical compound is an “organic functional class”.
![Page 30: ChEBI, text mining and ontological best practice Colin Batchelor Royal Society of Chemistry 2008-05-19 batchelorc@rsc.org.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697bfab1a28abf838c9b3b3/html5/thumbnails/30.jpg)
30
Some possible solutions (2)
ChEBI must distinguish between what is always true and what is only sometimes true.
Example: Replace some is_a relationships with
has_biological_role and has_application.
We need ChEBI to represent parts of molecules that aren’t substituents. They should all be descendants of molecular part (a new term), as should amino acid residues and nucleoside residues.
![Page 31: ChEBI, text mining and ontological best practice Colin Batchelor Royal Society of Chemistry 2008-05-19 batchelorc@rsc.org.](https://reader035.fdocuments.us/reader035/viewer/2022062409/5697bfab1a28abf838c9b3b3/html5/thumbnails/31.jpg)
31
Questions?