Literature-Based Knowledge Discovery using Natural Language Processing
description
Transcript of Literature-Based Knowledge Discovery using Natural Language Processing
![Page 1: Literature-Based Knowledge Discovery using Natural Language Processing](https://reader034.fdocuments.us/reader034/viewer/2022051700/56816829550346895dddbd6f/html5/thumbnails/1.jpg)
1
Literature-Based Knowledge Discovery using
Natural Language ProcessingDimitar Hristovski,1 PhD, Carol Friedman,2 PhD,
Thomas C Rindflesch,3 PhD, Borut Peterlin,4 MD PhD
1Institute of Biomedical Informatics, Medical Faculty, University of Ljubljana, Slovenia
2Department of Biomedical Informatics, Columbia University, New York3National Library of Medicine, Bethesda, Maryland
4Division of medical genetics, UMC, Slajmerjeva 3, Ljubljana, Slovenia
e-mail: [email protected]
![Page 2: Literature-Based Knowledge Discovery using Natural Language Processing](https://reader034.fdocuments.us/reader034/viewer/2022051700/56816829550346895dddbd6f/html5/thumbnails/2.jpg)
2
Part 1: Co-occurrence based LBD
![Page 3: Literature-Based Knowledge Discovery using Natural Language Processing](https://reader034.fdocuments.us/reader034/viewer/2022051700/56816829550346895dddbd6f/html5/thumbnails/3.jpg)
3
Motivation
• Overspecialization• Information overload• Large databases• Need and opportunity for computer
supported knowledge discovery
![Page 4: Literature-Based Knowledge Discovery using Natural Language Processing](https://reader034.fdocuments.us/reader034/viewer/2022051700/56816829550346895dddbd6f/html5/thumbnails/4.jpg)
4
Literature-based Discovery (LBD)
• A method for automatically generating hypotheses (discoveries) from literature
• Hypotheses have form:Concept1 –Relation– Concept2
• Example:Fish oil –Treats– Raynaud’s disease
![Page 5: Literature-Based Knowledge Discovery using Natural Language Processing](https://reader034.fdocuments.us/reader034/viewer/2022051700/56816829550346895dddbd6f/html5/thumbnails/5.jpg)
5
Background • Swanson’s LBD paradigm:
Concept X(Disease)e.g. Raynaud’s
Concepts Y(Pathologycal or Cell Function, …)e.g. Blood viscosity
Concepts Z(Drugs, …)e.g. Fish oil
New Relation?e.g. Treats
![Page 6: Literature-Based Knowledge Discovery using Natural Language Processing](https://reader034.fdocuments.us/reader034/viewer/2022051700/56816829550346895dddbd6f/html5/thumbnails/6.jpg)
6
Biomedical Discovery Support System (BITOLA)
• Goal: – discover potentially new relations (knowledge) between
biomedical concepts – to be used as research idea generator and/or as– an alternative way to search Medline
• System user (researcher or intermediary):– interactively guides the discovery process– evaluates the proposed relations
![Page 7: Literature-Based Knowledge Discovery using Natural Language Processing](https://reader034.fdocuments.us/reader034/viewer/2022051700/56816829550346895dddbd6f/html5/thumbnails/7.jpg)
7
Extending and Enhancing Literature Based Discovery• Goal:
– Make literature based discovery more suitable for disease candidate gene discovery
– Decrease the number of candidate relations
• Method:– Integrate background knowledge:
• Chromosomal location of diseases and genes• Gene expression location• Disease manifestation location
![Page 8: Literature-Based Knowledge Discovery using Natural Language Processing](https://reader034.fdocuments.us/reader034/viewer/2022051700/56816829550346895dddbd6f/html5/thumbnails/8.jpg)
8
System Overview
Knowledge Base
Concepts
Association Rules
Background Knowledge (Chromosomal Locations, …)
Discovery Algorithm
User Interface
Databases (Medline, LocusLink, HUGO, OMIM, …)
Knowledge Extraction
![Page 9: Literature-Based Knowledge Discovery using Natural Language Processing](https://reader034.fdocuments.us/reader034/viewer/2022051700/56816829550346895dddbd6f/html5/thumbnails/9.jpg)
9
Terminology Problems during Knowledge Extraction
• Gene names• Gene symbols• MeSH and genetic diseases
![Page 10: Literature-Based Knowledge Discovery using Natural Language Processing](https://reader034.fdocuments.us/reader034/viewer/2022051700/56816829550346895dddbd6f/html5/thumbnails/10.jpg)
10
Detected Gene Symbols by Frequency
• type|666548• II|552584• III|201776• component|179643• CT|175973• AT|151337• ATP|147357• IV|123429• CD4|99657• p53|89357• MR|88682• SD|85889• GH|84797• LPS|68982• 59|67272• E2|64616
• 82|63521• AMP|61862• TNF|59343• RA|58818• CD8|57324• O2|56847• ACTH|54933• CO2|53171• PKC|51057• EGF|50483• T3|49632• MS|46813• A2|44896• ER|43212• upstream|41820• PRL|41599
![Page 11: Literature-Based Knowledge Discovery using Natural Language Processing](https://reader034.fdocuments.us/reader034/viewer/2022051700/56816829550346895dddbd6f/html5/thumbnails/11.jpg)
11
Gene Symbol Disambiguation
• Find MEDLINE docs in which we can expect to find gene symbols
• Example of false positive:– Ethics in a twist: "Life Support", BBC1. BMJ 1999
Aug 7;319(7206):390– breast basic conserved 1 (BBC1) gene, v.s. BBC1
television station featuring new drama series Life Support
![Page 12: Literature-Based Knowledge Discovery using Natural Language Processing](https://reader034.fdocuments.us/reader034/viewer/2022051700/56816829550346895dddbd6f/html5/thumbnails/12.jpg)
12
Binary Association Rules• XY (confidence, support) • If X Then Y (confidence, support)• Confidence = % of docs containing Y within the X docs• Support = number (or %) of docs containing both X and
Y• The relation between X and Y not known.• Examples:
– Multiple Sclerosis Optic Neuritis (2.02, 117)– Multiple Sclerosis Interferon-beta (5.17, 300)
![Page 13: Literature-Based Knowledge Discovery using Natural Language Processing](https://reader034.fdocuments.us/reader034/viewer/2022051700/56816829550346895dddbd6f/html5/thumbnails/13.jpg)
13
Discovery Algorithm
Concept X(Disease)
Concepts Y(Pathologycal or Cell Function, …)
Concepts Z(Genes)
Chromosomal Region
Chromosomal Location
Candidate Gene?
Match
Manifestation Location
Expression Location
Match
![Page 14: Literature-Based Knowledge Discovery using Natural Language Processing](https://reader034.fdocuments.us/reader034/viewer/2022051700/56816829550346895dddbd6f/html5/thumbnails/14.jpg)
14
Ranking Concepts Z
X
Y1
Y2
Y3
Yi
Yj
…
…
Z1
Z2
Z3
Zk
Zn
s1
( ) ( * )i i k
m
k XY Y Zi
Rank Z S S
![Page 15: Literature-Based Knowledge Discovery using Natural Language Processing](https://reader034.fdocuments.us/reader034/viewer/2022051700/56816829550346895dddbd6f/html5/thumbnails/15.jpg)
15
Problem Size• Full Medline analyzed (cca 15,000,000 recs)• 87,000,000 association rules between 180,000
biomedical concepts
![Page 16: Literature-Based Knowledge Discovery using Natural Language Processing](https://reader034.fdocuments.us/reader034/viewer/2022051700/56816829550346895dddbd6f/html5/thumbnails/16.jpg)
16
Bilateral Perisylvian Polymicrogiria - BPP (OMIM:
300388)• Polymicrogyria of the cerebral cortex is
a developmental abnormality characterized by excessive surface convolution
• Clinical characteristics:– Mental retardation– Epilepsy– Pseudobulbar palsy (paralysis of the face,
throat, tongue and the chewing process)
• X linked dominant inheritance
![Page 17: Literature-Based Knowledge Discovery using Natural Language Processing](https://reader034.fdocuments.us/reader034/viewer/2022051700/56816829550346895dddbd6f/html5/thumbnails/17.jpg)
17
18 gene candidates
15 gene candidates
Tissue specific expression
2 gene candidates: L1CAM and FLNA
relation between semantic types Cell Movement and Gene or gene products
Sublocalisation in the Xq28
237 genes in Xq28
![Page 18: Literature-Based Knowledge Discovery using Natural Language Processing](https://reader034.fdocuments.us/reader034/viewer/2022051700/56816829550346895dddbd6f/html5/thumbnails/18.jpg)
18
User Interface “cgi-bin” version
![Page 19: Literature-Based Knowledge Discovery using Natural Language Processing](https://reader034.fdocuments.us/reader034/viewer/2022051700/56816829550346895dddbd6f/html5/thumbnails/19.jpg)
19
Automatically search for supporting Medline Citations
![Page 20: Literature-Based Knowledge Discovery using Natural Language Processing](https://reader034.fdocuments.us/reader034/viewer/2022051700/56816829550346895dddbd6f/html5/thumbnails/20.jpg)
20
Part 1: Summary and Conclusions
• Discovery support system (BITOLA) presented• The system can be used as:
– Research idea generator, or– Alternative method of searching Medline
• Genetic knowledge about the chromosomal locations of diseases and genes included to make BITOLA more suitable for disease candidate gene discovery
![Page 21: Literature-Based Knowledge Discovery using Natural Language Processing](https://reader034.fdocuments.us/reader034/viewer/2022051700/56816829550346895dddbd6f/html5/thumbnails/21.jpg)
21
System Availability
• URL:
www.mf.uni-lj.si/bitola/
![Page 22: Literature-Based Knowledge Discovery using Natural Language Processing](https://reader034.fdocuments.us/reader034/viewer/2022051700/56816829550346895dddbd6f/html5/thumbnails/22.jpg)
22
Part 2: Exploring Semantic Relations for
LBD
![Page 23: Literature-Based Knowledge Discovery using Natural Language Processing](https://reader034.fdocuments.us/reader034/viewer/2022051700/56816829550346895dddbd6f/html5/thumbnails/23.jpg)
23
Current LBD Systems• Co-occurrence based• Concepts
– Title/Abstract Words/Phrases– MeSH– UMLS– Genes ...
• UMLS Semantic types used for filtering• Semantic relations between concepts
NOT used
![Page 24: Literature-Based Knowledge Discovery using Natural Language Processing](https://reader034.fdocuments.us/reader034/viewer/2022051700/56816829550346895dddbd6f/html5/thumbnails/24.jpg)
24
Drawbacks of Current LBD
• Not all co-occurrences represent a relation• Users have to read many Medline citations
when reviewing candidate relations• Many spurious (false-positive) relations and
hypotheses produced• No explanation of proposed hypotheses
![Page 25: Literature-Based Knowledge Discovery using Natural Language Processing](https://reader034.fdocuments.us/reader034/viewer/2022051700/56816829550346895dddbd6f/html5/thumbnails/25.jpg)
25
Enhancing the LBD paradigm
• Use semantic relations obtained from – two NLP systems (BioMedLee and SemRep)
to augment – co-occurrence based LBD system (BITOLA)
![Page 26: Literature-Based Knowledge Discovery using Natural Language Processing](https://reader034.fdocuments.us/reader034/viewer/2022051700/56816829550346895dddbd6f/html5/thumbnails/26.jpg)
26
Methods
![Page 27: Literature-Based Knowledge Discovery using Natural Language Processing](https://reader034.fdocuments.us/reader034/viewer/2022051700/56816829550346895dddbd6f/html5/thumbnails/27.jpg)
27
Discovery Patterns• Discovery pattern:
Set of conditions to be satisfied for the generation of new hypotheses
• Conditions are combinations of semantic relations between concepts
• Maybe_Treats pattern in this research – has two forms:– Maybe_Treats1– Maybe_Treats2
![Page 28: Literature-Based Knowledge Discovery using Natural Language Processing](https://reader034.fdocuments.us/reader034/viewer/2022051700/56816829550346895dddbd6f/html5/thumbnails/28.jpg)
28
Maybe_Treats Discovery Pattern
Disease X
Maybe_Treats2
Change1
Change2
Treats
Substance Y1(or Body meas.,
Body funct.)
Substance Y2(or Body meas.,
Body funct.)
Drug Z1 (or substance)
Disease X2
Drug Z2(or substance)
Opposite_Change1
Same Change2
Maybe_Treats1
![Page 29: Literature-Based Knowledge Discovery using Natural Language Processing](https://reader034.fdocuments.us/reader034/viewer/2022051700/56816829550346895dddbd6f/html5/thumbnails/29.jpg)
29
Maybe_Treats1 and Maybe_Treats2
• Goal:Propose potentially new treatments
• Can work in concert:– Propose different treatments (complementary)– Propose same treatments using different discovery
reasoning (reinforcing)
![Page 30: Literature-Based Knowledge Discovery using Natural Language Processing](https://reader034.fdocuments.us/reader034/viewer/2022051700/56816829550346895dddbd6f/html5/thumbnails/30.jpg)
30
Multiple Usages of Maybe_Treats
• Given Disease X as input: – find new treatments Z
• Given Drug Z as input: – find diseases X that can be treated
• Given Disease X and Drug Z as input: – test whether Z can be used to treat X
![Page 31: Literature-Based Knowledge Discovery using Natural Language Processing](https://reader034.fdocuments.us/reader034/viewer/2022051700/56816829550346895dddbd6f/html5/thumbnails/31.jpg)
31
Semantic Relations Used
• Associated_with_change and Treats used to extract known facts from the literature
• Then Maybe_Treats1 and Maybe_Treats2 predict new treatments based on the known extracted facts
![Page 32: Literature-Based Knowledge Discovery using Natural Language Processing](https://reader034.fdocuments.us/reader034/viewer/2022051700/56816829550346895dddbd6f/html5/thumbnails/32.jpg)
32
Associated_with_change
• One concept associated with a change in another concept, for example:
• Associated_with(Raynaud’s, Blood viscosity, increase):– “Local increase of blood viscosity during cold-induced Raynaud's
phenomenon.”– “Increased viscosity might be a causal factor in secondary forms
of Raynaud's disease, …”
• BioMedLee (Friedman et al) used to extract Associated_with_change
![Page 33: Literature-Based Knowledge Discovery using Natural Language Processing](https://reader034.fdocuments.us/reader034/viewer/2022051700/56816829550346895dddbd6f/html5/thumbnails/33.jpg)
33
Treats
• Used to extract drugs known to treat a disease• Major purpose in our approach:
– Eliminate drugs already known to be used to treat a disease– Find existing treatments for similar diseases
• TREATS(Amantadine,Huntington):– “…treatment of Huntington’s disease with amantadine…”
• Treats extracted by SemRep (Rindflesch et al)
![Page 34: Literature-Based Knowledge Discovery using Natural Language Processing](https://reader034.fdocuments.us/reader034/viewer/2022051700/56816829550346895dddbd6f/html5/thumbnails/34.jpg)
34
Results
![Page 35: Literature-Based Knowledge Discovery using Natural Language Processing](https://reader034.fdocuments.us/reader034/viewer/2022051700/56816829550346895dddbd6f/html5/thumbnails/35.jpg)
35
Huntington Disease
• Inherited neurodegenerative disorder• All 5511 Huntington citations (Jan.2006)
processed with BioMedLee and SemRep• 35 interesting concepts assoc.with change
selected and corresponding citations (250.000) processed
![Page 36: Literature-Based Knowledge Discovery using Natural Language Processing](https://reader034.fdocuments.us/reader034/viewer/2022051700/56816829550346895dddbd6f/html5/thumbnails/36.jpg)
36
Insulin for Huntington Disease
• Assoc_with(Huntington,Insulin,decrease):– “Huntington's disease transgenic mice develop an
age-dependent reduction of insulin mRNA expression and diminished expression of key regulators of insulin gene transcription, …”
• Insulin also decreased in diabetes mellitus• Therapies used to regulate insulin in
diabetes might be used for Huntington
![Page 37: Literature-Based Knowledge Discovery using Natural Language Processing](https://reader034.fdocuments.us/reader034/viewer/2022051700/56816829550346895dddbd6f/html5/thumbnails/37.jpg)
37
Capsaicin for Huntington• Assoc_with(Huntington,Substance P,decrease):
– “In Huntington's disease brains decreased Substance P staining was found in …”
• Assoc_with(Capsaicin,Substance P,increase):– “Capsaicin also attenuated the increase in Substance P
content in sciatic nerve, …”
• Capsaicin maybe treats Huntington because Substance P is decreased in Huntington and Capsaicin increases Substance P.
![Page 38: Literature-Based Knowledge Discovery using Natural Language Processing](https://reader034.fdocuments.us/reader034/viewer/2022051700/56816829550346895dddbd6f/html5/thumbnails/38.jpg)
38
Huntington Results - Summary
Huntington(Disease X)
Maybe_Treats2
Decrease
Decrease
Treats
Substance P(Substance Y1)
Insulin(Substance Y2)
Capsaicin(Drug Z1)
Diabetes M(Disease X2)
Insulin regulation ther.
(Z2)
Increase
Decrease
Maybe_Treats1
![Page 39: Literature-Based Knowledge Discovery using Natural Language Processing](https://reader034.fdocuments.us/reader034/viewer/2022051700/56816829550346895dddbd6f/html5/thumbnails/39.jpg)
39
Example: Parkinson disease as starting concept. Bellow shown some related concepts changed in
association to Parkinson
![Page 40: Literature-Based Knowledge Discovery using Natural Language Processing](https://reader034.fdocuments.us/reader034/viewer/2022051700/56816829550346895dddbd6f/html5/thumbnails/40.jpg)
40
Potential Treatments for Parkinson (e.g. gabapentine)
![Page 41: Literature-Based Knowledge Discovery using Natural Language Processing](https://reader034.fdocuments.us/reader034/viewer/2022051700/56816829550346895dddbd6f/html5/thumbnails/41.jpg)
41
Showing Supporting Sentences
with highlighted concepts and relations
![Page 42: Literature-Based Knowledge Discovery using Natural Language Processing](https://reader034.fdocuments.us/reader034/viewer/2022051700/56816829550346895dddbd6f/html5/thumbnails/42.jpg)
42
Gabapentine for Parkinson
• Assoc_with(Parkinson,gamma-aminobutyric acid(GABA),decrease):– “…studies indicate that patients with Parkinson's disease
have decreased basal ganglia gamma-aminobutyric acid function… ”
• Assoc_with(GABA,Gabapentine,increase):– “Gabapentin, probably through the activation of glutamic acid
decarboxylase, leads to the increase in synaptic GABA. ”• Explanation: Gabapentine maybe treats
Parkinson because GABA is decreased in Parkinson and Gabapentine increases GABA.
![Page 43: Literature-Based Knowledge Discovery using Natural Language Processing](https://reader034.fdocuments.us/reader034/viewer/2022051700/56816829550346895dddbd6f/html5/thumbnails/43.jpg)
43
Part 2: Conclusions• A new method to improve LBD presented• Based on discovery patterns and semantic
relations extracted by BioMedLee and SemRep, coupled with BITOLA LBD
• Easier for the user to evaluate smaller number of hypotheses
• Two potentially new therapeutic approaches for Huntington proposed and one for Parkinson
• Raynaud’s—Fish oil discovery replicated
![Page 44: Literature-Based Knowledge Discovery using Natural Language Processing](https://reader034.fdocuments.us/reader034/viewer/2022051700/56816829550346895dddbd6f/html5/thumbnails/44.jpg)
44
The future of Literature-based Discovery
• Development of specific discovery patterns based on semantic relations and further integrated with co-occurrence-based LBD
![Page 45: Literature-Based Knowledge Discovery using Natural Language Processing](https://reader034.fdocuments.us/reader034/viewer/2022051700/56816829550346895dddbd6f/html5/thumbnails/45.jpg)
45
Link, References and some propaganda
• http://www.mf.uni-lj.si/bitola• Hristovski D, Peterlin B, Mitchell JA and Humphrey SM. Using literature-
based discovery to identify disease candidate genes. Int. J. Med. Inform. 2005. Vol. 74(2–4), pp. 289–298. Selected for Yearbook of Medical Informatics 2006
• Hristovski D, Friedman C, Rindflesch TC, Peterlin B. Exploiting semantic relations for literature-based discovery. In Proc AMIA 2006 Symp; 2006. p. 349-53.
• Ahlers C, Hristovski D, Kilicoglu H, Rindflesch TC. Using the Literature-Based Discovery Paradigm to Investigate Drug Mechanisms. In Proc AMIA 2007 Symp; 2007. p. 6-10. “Distinguished Paper Award AMIA2007”
• Hristovski D, Friedman C, Rindflesch TC, Peterlin B. Literature-Based Knowledge Discovery using Natural Language Processing. To appear as a chapter in the first LBD book in 2008