Evaluating Patent Full Text Documents with Chemical Ontologies
-
Upload
dr-haxel-congress-and-event-management-gmbh -
Category
Internet
-
view
938 -
download
1
Transcript of Evaluating Patent Full Text Documents with Chemical Ontologies
Evaluating patent full text
documents with chemical ontologies
OntoChem IT Solutions GmbHBlücherstr. 2406120 Halle (Saale)Germany
Tel. +49 345 4780472Fax: +49 345 4780471mail: info(at)ontochem.com
Evaluating patent full text
documents with chemical ontologies
• spin-out from OntoChem GmbH
• started 1.7.2015
• 15 chemists, bioinformatics, biologists, linguists, pharmacists
• extracting knowledge from documents, selling software & services
OntoChem IT Solutions GmbHBlücherstr. 2406120 Halle (Saale)Germany
Tel. +49 345 4780472Fax: +49 345 4780471mail: info(at)ontochem.com
3
Computer readable, formal representation of knowledge...
describe relationships between knowledge concepts:
aspirin benzoic acid carboxylic acid
acetyl salicylic acids
can be used to infer extract, search, sort and analyse knowledge
What are Ontologies ?
„is a“ „is a“
4
ChEBI Chemical Entities of Biological Interest
https://www.ebi.ac.uk/chebi/ has about 40,000 compounds manually classified:
MeSH – medical subject headings ... PubChem
Chemical Ontologies...
5
SODIAC:
automated compound classification software
Structure based Ontology Development and Individual Assignment Center
ontology editor, OBO specification conformity
Definition of compound classes via SMARTS
chemical structure editor
sub-structure AND, OR and NOT logic compound to class assignment
chemistry error detection
chemical hierarchy construction
Classifying Chemistry: SODIAC
6
SODIAC:
AND/OR logic to assign Vitamin C derivatives:
• described in different tautomeric forms in databases
• logic needed for classifying correct stereochemistry in substituted compounds
Classifying Chemistry: SODIAC
concept: Vitamin C derivatives
AND AND ANDOR OR
7
structural chemical ontologies are often not based on sub-structures !
Progesterone 19-Norprogesterone 4-8* more active
class: Gestagens class: Gestagens>Progestins
Pregnane (female hormons) Androstane (male hormons)
class: Gonans>Pregnans class: Gonans>Estrans
Classifying Chemistry: not straightforward...
drugbank & ChEBI:
Progestin,
a synthetic progestogen
parent
& SSS
not parent
but SSS
not parent
but SSS
ChEBI:
corticosteroid hormone
same family
different family
8
Chemistry Ontologies
Organic chemistry
7.586 class concepts, 29.709 class terms
3,185 concepts linked to ChEBI concepts
2,465 concepts linked to MeSH concepts
68 million concepts linked to PubChem
Inorganic materials
52.4209 concepts, 56.332 terms
Groups-substituents-fragments
4.428 concepts, 12.754 terms
Substances
989 concepts, 3.522 terms
Polymers
2361 concepts, 7.176 terms
9
Acetylsalicylic acid
SODIAC v2.5.2
Direct Parents:
aromatic compounds, benzenes, carbon compounds, carboxylic acids,
ethanoic acid esters, methyl esters, monocyclic compounds, oxygen compounds,
salicylic acid derivatives
bioavailable molecules, hydrophilic molecules, lead like molecules, lipinski molecules, small molecules
CHEBI:15365; MeSH:D001241
Ancestors:
6-membered carbocycles, 6-membered cyclic compounds, acetic acid derivatives, acids,
carbocycles, carbon group compounds, carbonyl compounds, carboxylic acid derivatives,
carboxylic acid esters, chalcogen compounds, cyclic compounds, esters, fatty acyls,
fatty esters, lipids, monocarboxylic acid derivatives, monocyclic carbocycles, organic acids,
organic compounds, organic esters, salicylic acid derivatives, short chain fatty acid esters
Classifying Chemistry: Example
10
Basic Biology Ontologies
Genes, Proteins & Peptides
annotation version: 708,141 concepts, 2,627,612 terms
classification version: 832,902 concepts, 3,177,057 terms
with linkouts to GO, InterPro, HomoloGene, HUGO, KEGG, Uniprot ...
Diseases
SNOMED-CT, MedDRA, ICD-9, ICD-10, HDO, UMLS, Loinc, MeSH
annotation version: 105,824 concepts, 360,077 terms
Species
based on NCBI, GRIN, IPNI, Cornucopia, World Economic Plants ...
annotation version: 1,012,634 concepts, 1,664,042 terms
Anatomy
different species and stage dependent ontologies available
general anatomy: 4,773 concepts, 19,450 terms
11
Other Biology Ontologies
Cell lines
5,566 concepts, 13,083 terms
Cosmetology
1,187 concepts, 2,017 terms
Effects
35,477 concepts, 111,012 terms
Nutrition
19,193 concepts, 115,699 terms
Physiology
533 concepts, 619 terms
Toxicology
1,019 concepts, 2,150 terms
12
Other Ontologies
Countries
annotation version: 245 concepts, 85,069 terms
Companies
annotation version: 26,388 concepts, 5,757 terms
Material properties
annotation version: 1,081 concepts, 2,428 terms
Methods
annotation version: 2,502 concepts, 10,053 terms
Regions & Geopolitics
annotation version: 3774 concepts, 13,356 terms
Relations
annotation version: 603 concepts, 2,290 syntaxes
13
General Ontologies
Wikipedia
annotation version: 5,200,842 concepts, 11,490,831 terms
Magnitudes & Units
annotation version: 228 concepts, 510 terms
Persons
annotation version: >1,000,000 persons
Relations
annotation version: 603 concepts, 2,290 syntaxes
14
Understanding Patents with Ontologies
NLP for patents pose some unique challenges:
• multilingual
• poor OCR (optical character recognition)
• multi-disciplinary
• many>90 million full text documents from >110 patent offices
• largeup to 500 pageswith sentences spanning >20 pages
• obscure:hand drawingsunclear language
15
Understanding Patents
Collaboration with infoapps GmbH (Munich)
Standard full text data
US, EP, DE, WO,
AT, CH, BE, CA, ES, FR, GB, MA.
Standard full text data
AR, BR, CN, DK, FI, ID, EI, EN,
JP, KR, MX, MY, NL, NO, RU, SE,
TH, TW, VN.
Original full text data
Machine/human translation (EN)
AR, AT, BE, BR, CA, CH, CN, DE,
DK, EP, ES, FI, FR, ID, JP, KR,
MX, NL, NO, RU, SE, TH, TW,
VN, WO.
16
chemistry annotator
OCMiner® UIMA Pipeline
identify
document
type
OCMiner® UIMA Pipeline
picture PDFOCR
Text PDF
reader
XML doc
XML
reader
Office doc
Office
reader
document
classifierXML
detagger
language
detector
normalize
text
tokenize
text
acronym
abbrev
detector
person
annotator
document
structure
domain
annotators
1…n
dictionaryname-2-
structure
formula &
molpuzzler
class/group
resolution
cleanup &
rule
combiner
coordinated
entity
resolution
context
handler
NE
confidence
domain
annotators
1…n
domain
annotators
1…n
relationship
extraction consumer
BRAT
consumer
index
consumer
XML
17
BRAT (Goran Topić) file example:
PLoS One. 2014 Sep 30;9(9):e107477. doi: 10.1371/journal.pone.0107477. eCollection 2014.
Annotated chemical patent corpus: a gold standard for text mining.
Akhondi SA, Klenner AG, Tyrchan C, Manchala AK, Boppana K, Lowe D, Zimmermann M, Jagarlapudi SA, Sayle R,
Kors JA, Muresan S
Regular Names in Patents
18
Chemical Compound
5,7-bis(trifluoromethyl)-pyrazolo[1,5-a]pyrimidine-2-carbonitrile :
Chemical Class
pyrazolo[1,5-a]pyrimidines :
Chemical substituent + class
2-Bromo-, 2-fluoro-, and 2-chloro pyrazolo[1,5-a]pyrimidines:
Other Name Types in Patents
19
Named Entities in Patents
extracting named entities (NE) from infoapps patents
from 19 million patents with chemistry, selected
4.7 million patents from 2001-2010 (publication year)
Ontologyterm annotation
count
unique concepts
per doc
unique
concepts
Chemistry 1,465,510,682 294,771,572 ?
Proteins 204,902,329 30,167,344 67,993
Anatomy non-plants 126,856,048 21,192,154 2,378
Methods 112,230,880 21,725,977 1,959
Species 105,618,715 25,901,359 81,036
Diseases 82,857,385 24,592,233 21,367
Physiology 68,504,035 12,703,542 497
Nutrition 59,367,731 12,839,777 3,861
Cosmetology 23,465,151 4,883,741 920
Anatomy plants, fungi 22,326,124 4,212,548 802
Cell lines 9,857,621 2,325,743 2,079
Toxicity 7,986,832 2,858,977 423
Species plants, fungi 7,444,143 2,345,605 7,347
Regions 6,974,421 2,781,913 1,040
Herbal drugs 162,729 46,830 131
21
3 reasons:
patent claims are „ontological“
background knowledge helps to extract the meaning of named entities
end user, using knowledge classifications
which natural product compound class is useful to treat inflammation of the skin?
Ontologies – Why ?
22
Patent claims are “ontological”
Patent classes & ad hoc classes:
e.g. chemical
„compounds according to claim 1“
„acyl-pyrrolopyridines“
any Markush structure, Patent classes etc
e.g. uses: „anti-infectives“ (e.g. antibacterial, antiviral, antiparasitic ... )
Chemical Ontologies – Why ?
23
ontology based NLP to extract the meaning of named entities
• ontology based context sensitive Named Entity resolution
...glucose... ...glucose oxidase... ...glucose oxidase activity...
finally: ...inhibitor of glucose oxidase activity...
• ontology based anaphora & cataphora resolution
Tetrahydrofurane is a commonly used solvent in organic ...
This cyclic ether has a melting point of -108,4 °C
• ontology based fingerprints
classifying documents, e.g. into patent classes
Chemical Ontologies – Why ?
25
Understanding Patent Claims Logic
high quality patent annotations need:
• annotated text corpus “Gold Set”
• background ontologies
Annotated between <chemistry> & <disease>: p=is_Active_Part_Of, i=is_Instance_Of.
LREC 2014: Creating a Gold Standard Corpus for the Extraction of Chemistry-Disease Relations
from Patents, Antje Schlaf, Claudia Bobach, Matthias Irmer
31
End User: Patent Big Data Analytics
Hot Compounds, hot targets ?
L. Weber, T. Böhme, M. Irmer, Pharm. Pat. Analyst 2013, 2,Ontology-based content analysis of US patent applications from 2001–2010
32
End User: Patent Big Data Analytics
enrichment factors for chemistry related diseases...
Chemistry Conceptcardiovascular
system
disease of
mental health
disease of
metabolism
respiratory
system
nervous
system
musculo-skeletal
system
reproductive
system
gastro-
intestinal
system
immune
system
endocrine
system
prostaglandin F2β derivatives 557 0 0 0 607 427 0 0 375 0
hallucinogens 494 1922 332 449 538 364 3146 622 199 1901
cichoric acid 821 1662 432 1625 509 652 11623 1480 604 7239
alpha 1-adrenoceptor agonist 821 0 267 1736 501 611 8684 1014 543 5636
pregn-4,9(11)-enes 398 256 231 450 491 386 0 467 317 1296
canrenoic acids 771 1343 425 1180 473 534 8474 1260 459 4960
aconitane derivatives 0 1785 205 0 458 257 0 0 0 0
pseudoalkaloid derivatives 0 1778 204 0 456 256 0 0 0 0
diterpene alkaloid derivatives 0 1778 204 0 456 256 0 0 0 0
13,14-dihydro-15-keto-prostaglandin D2
derivatives651 0 213 1831 447 482 0 1188 521 3956
ripisartan derivatives 953 0 351 0 436 411 0 0 409 0
potassium-sparing diuretics 896 1387 399 1156 425 496 6456 1218 501 3863
steroid acids 692 1193 379 1046 423 485 7578 1132 412 4418
Milfasartan 926 0 304 0 407 414 0 917 404 0
pyrrolizidine alkaloids 453 1041 293 1264 407 464 0 1081 498 0
milfasartan derivatives 930 0 303 0 406 416 0 913 402 0
Pratosartan 695 929 450 523 394 240 2747 794 246 2800
33
End User: Online Database ChemAnalyser
ChemAnalyser – Structure
ChemAnalyser – Full text & ontology based semantic searching
ChemAnalyser – Organic chemistry & drug discovery
ChemAnalyser – Alloys & Inorganic Materials
ChemAnalyser – Cosmetics & Nutrition
ChemAnalyser – Polymers
ChemAnalyser – Reach Report Support
34
Thanks!
Please register at
www.chemanalyser.com
for more information and a free trial.