Literature Mining BMI 730
-
Upload
valentine-smith -
Category
Documents
-
view
32 -
download
0
description
Transcript of Literature Mining BMI 730
![Page 1: Literature Mining BMI 730](https://reader037.fdocuments.us/reader037/viewer/2022103006/56812dae550346895d92d5e8/html5/thumbnails/1.jpg)
Literature Mining BMI 730
Kun HuangDepartment of Biomedical Informatics
Ohio State University
![Page 2: Literature Mining BMI 730](https://reader037.fdocuments.us/reader037/viewer/2022103006/56812dae550346895d92d5e8/html5/thumbnails/2.jpg)
Announcement
• HW #3 is cancelled. The grades will be adjusted accordingly.
![Page 3: Literature Mining BMI 730](https://reader037.fdocuments.us/reader037/viewer/2022103006/56812dae550346895d92d5e8/html5/thumbnails/3.jpg)
Jensen et al. Nature Reviews Genetics 7, 119–129 (February 2006) | doi:10.1038/nrg1768
Acknowledgement
![Page 4: Literature Mining BMI 730](https://reader037.fdocuments.us/reader037/viewer/2022103006/56812dae550346895d92d5e8/html5/thumbnails/4.jpg)
Acknowledgement
• Dr. Hongyu Peng (Brandies Univ.)• Dr. Hagit Shatkay (
http://www.shatkay.org)
provided part of the slides.
![Page 5: Literature Mining BMI 730](https://reader037.fdocuments.us/reader037/viewer/2022103006/56812dae550346895d92d5e8/html5/thumbnails/5.jpg)
Connecting the dots
• Story of Thalidomide (from sedative to birth defects to anti-cancer drug)
![Page 6: Literature Mining BMI 730](https://reader037.fdocuments.us/reader037/viewer/2022103006/56812dae550346895d92d5e8/html5/thumbnails/6.jpg)
Jensen et al. Nature Reviews Genetics 7, 119–129 (February 2006) | doi:10.1038/nrg1768
![Page 7: Literature Mining BMI 730](https://reader037.fdocuments.us/reader037/viewer/2022103006/56812dae550346895d92d5e8/html5/thumbnails/7.jpg)
Jensen et al. Nature Reviews Genetics 7, 119–129 (February 2006) | doi:10.1038/nrg1768
![Page 8: Literature Mining BMI 730](https://reader037.fdocuments.us/reader037/viewer/2022103006/56812dae550346895d92d5e8/html5/thumbnails/8.jpg)
Information Retrieval (IR)• Finding the papers• IR systems aim to identify the text
segments (be it full articles, abstracts, paragraphs or sentences) that pertain to a certain topic (e.g., yeast cell cycle).
• E.g., PubMed, Google Scholar• Ad hoc IR• Text categorization (pre-defined set of
papers)• Advanced – integrate Entity Recognition
![Page 9: Literature Mining BMI 730](https://reader037.fdocuments.us/reader037/viewer/2022103006/56812dae550346895d92d5e8/html5/thumbnails/9.jpg)
Ad Hoc IR• User provide query • Boolean model• Index based (e.g. “Gene and CD”)
![Page 10: Literature Mining BMI 730](https://reader037.fdocuments.us/reader037/viewer/2022103006/56812dae550346895d92d5e8/html5/thumbnails/10.jpg)
10
DB: Database of documents.
Vocabulary: {t1,…,tM } (Terms in DB, produced by the tokenization stage)
Index Structure: A term all the documents containing it.
Boolean Queries
acquired immunodeficiencyasthmabloodblood pressure
IndexDatabase
![Page 11: Literature Mining BMI 730](https://reader037.fdocuments.us/reader037/viewer/2022103006/56812dae550346895d92d5e8/html5/thumbnails/11.jpg)
Ad Hoc IR• User provide query • Boolean model• Challenges
CD
Chagas' disease
cytosine deaminase
Crohn‘s disease
capillary density
Cortical dysplasia
(54,745 Pubmed entries)
compact disk...
Synonymy (AGP1, aka, Amino Acid Permease1)
Polysemy
![Page 12: Literature Mining BMI 730](https://reader037.fdocuments.us/reader037/viewer/2022103006/56812dae550346895d92d5e8/html5/thumbnails/12.jpg)
• Similarity query, e.g., Vector based. Semantic search
TIME (Sept 5, 2005): Search engines are good at matching words … The next step is semantic search – looking for meaning, not just matching key words. … Nervana, which analyzes language by linking word patterns contextually to answer questions in defined subject areas, such as medical-research literature.
Ad Hoc IR• User provide query • Vector-based model
![Page 13: Literature Mining BMI 730](https://reader037.fdocuments.us/reader037/viewer/2022103006/56812dae550346895d92d5e8/html5/thumbnails/13.jpg)
13
DB: Database of documents.
Vocabulary: {v1,…,vM } {Terms in DB}
Document dDB: Vector, <w1d,…,wM
d>, of weights.
The Vector Model
Weighting Principles
• Document frequency: Terms occurring in a few documents are more useful than terms occurring in many.
• Local term frequency: Terms occurring frequently within a document are likely to be significant for the document.
• Document length: A term occurring the same # of times in a long document and in a short one has less significance in the long one.
• Relevance: Terms occurring in documents judged as relevant to a query, are likely to be significant (WRT the query).
[Sparck Jones et al. 98]
![Page 14: Literature Mining BMI 730](https://reader037.fdocuments.us/reader037/viewer/2022103006/56812dae550346895d92d5e8/html5/thumbnails/14.jpg)
Some Weighting Schemes:
Binary
TF Wid = fi
d = # of times ti occurs in d.
Wid=
fid
fi
(fi= # of docs containing ti)
TF X IDF(one version...)
Wid =
1 if ti d
0 otherwise
Consider Local term frequency
Consider Local term frequency and (Inverse) Document frequency
![Page 15: Literature Mining BMI 730](https://reader037.fdocuments.us/reader037/viewer/2022103006/56812dae550346895d92d5e8/html5/thumbnails/15.jpg)
Document d= <w1d,…,wM
d>DB
Query q = < w1q,…,wM
q> (q could itself be a document in DB...)
Vector-Based similarity
Sim(q, d) = cosine (q, d ) =
q • d
|q| |d|
d
q
[Salton89, Witten et al99] Introductory IR.
![Page 16: Literature Mining BMI 730](https://reader037.fdocuments.us/reader037/viewer/2022103006/56812dae550346895d92d5e8/html5/thumbnails/16.jpg)
[Sparck Jones et al. 98, Sahami98, Ponte&Croft 98, Hoffman 99]
Probabilistic Models
Query q ; Document d
Log[Log[PP(relevant | (relevant | dd, , qq))
PP(Irrelevant | (Irrelevant | dd, , qq)) ]]Maximize log-odds:Maximize log-odds:
• Goal:Goal: Find all Find all dd’s such that ’s such that PP(relevant | (relevant | dd, , qq) is high) is high
![Page 17: Literature Mining BMI 730](https://reader037.fdocuments.us/reader037/viewer/2022103006/56812dae550346895d92d5e8/html5/thumbnails/17.jpg)
17
Latent Semantics Analysis [Dumais, Deerwester et al,1988,1990]
Motivation: Overcoming synonymy and polysemy.Reducing dimensionality.
Idea: Project from “explicit term” space to a lower dimension, “abstract concept” space.
Methodology: PCA applied to the document-term matrix. Highest singular values are used as the features for representing documents.
![Page 18: Literature Mining BMI 730](https://reader037.fdocuments.us/reader037/viewer/2022103006/56812dae550346895d92d5e8/html5/thumbnails/18.jpg)
18
Information Retrieval- Details(cont.)
Text Categorization (semantic)
Automatically place documents in right categories so as to make them easy-to-find.
......
Cancer
Apoptosis Elongation
![Page 19: Literature Mining BMI 730](https://reader037.fdocuments.us/reader037/viewer/2022103006/56812dae550346895d92d5e8/html5/thumbnails/19.jpg)
19
Information Retrieval-Details(cont.)
Rule-Based Text ClassificationA knowledge-engineering approach. Boolean rules (DNF), based on the presence/absence of specific terms within the document, decide its membership in the class. (e.g. the CONSTRUE system [Hayes et al. 90,92] )
Example: If ( (<GENE_Name> ⋀ transcript) ⋁ ((<GENE_Name> Western Blot) ⋀ ⋁ ((<GENE_Name> Northern Blot))⋀ Then GeneExpressionDoc Else Gene⌝ ExpressionDoc
![Page 20: Literature Mining BMI 730](https://reader037.fdocuments.us/reader037/viewer/2022103006/56812dae550346895d92d5e8/html5/thumbnails/20.jpg)
20
Information Retrieval-Details(cont.)
Machine Learning for Text Classification (supervised)
• Take a training set of pre-classified documents• Build a model for the classes from the training examples• Assign each new document to the class that best fits it
(e.g. closest or most-probable class.)
Types of class assignment:
Hard: Each document belongs to exactly one class
Soft: Each document is assigned a “degree of membership” in several classes
Methods
Nearest neighbor
Summarizing document vectors
SVM, Bayesian, boosting
![Page 21: Literature Mining BMI 730](https://reader037.fdocuments.us/reader037/viewer/2022103006/56812dae550346895d92d5e8/html5/thumbnails/21.jpg)
21
Evaluating Extraction and Retrieval
To say how good a system is we need:1. Performance metrics (numerical measures)2. Benchmarks, on which performance is
measured (the gold-standard).
![Page 22: Literature Mining BMI 730](https://reader037.fdocuments.us/reader037/viewer/2022103006/56812dae550346895d92d5e8/html5/thumbnails/22.jpg)
22
Evaluating Extraction and Retrieval(cont.)
Performance Metrics
N items (e.g. documents, terms or sentences) in the collection
REL: Relevant items (documents, terms or sentences) in the collection.These SHOULD be extracted or retrieved.
RETR: Retrieved items (e.g. documents, terms or sentences) are actually extracted/retrieved
Some correctly (A = |REL ⋀ RETR|),Some incorrectly (B = |RETR – REL| )|RETR| = A+B
![Page 23: Literature Mining BMI 730](https://reader037.fdocuments.us/reader037/viewer/2022103006/56812dae550346895d92d5e8/html5/thumbnails/23.jpg)
23
Evaluating Extraction and Retrieval(cont.)
Performance Metrics (cont.)
|RETR – REL| = B
Collection
REL RETR
|REL RETR| = ⋀ A
|Collection| = N
|REL-RETR| = D
|NotREL – RETR| = C
![Page 24: Literature Mining BMI 730](https://reader037.fdocuments.us/reader037/viewer/2022103006/56812dae550346895d92d5e8/html5/thumbnails/24.jpg)
24
Performance Metrics (cont.)
Precision: P = A/(A+B)
How many of the retrieved/extracted items are correct
Recall: R = A/(A+D)
How many of the items that should be retrieved are recovered
Accuracy: (A+C)/N (Ratio of Correctly classified items)
F-score: 2PR / (P+R)
Harmonic mean, in the range [0,1]
Combination Scores:
Fβ-score: (1+β2)PR / (β2·P + R)β >1 Prefer recall, β <1 Prefer precision
E-measure: 1 – F(β)-scoreInversely proportional to performance (Error measure).
![Page 25: Literature Mining BMI 730](https://reader037.fdocuments.us/reader037/viewer/2022103006/56812dae550346895d92d5e8/html5/thumbnails/25.jpg)
25
Performance Metrics (cont.)
Precision-Recall Curves
4 relevant documents in the collection.
7 retrieved and ranked.
1
7
6
5
4
3
225% Recall
50%
75%
100%
6675
66
100
0102030405060708090
100
0 25 50 75 100
Recall
Pre
cisi
on
![Page 26: Literature Mining BMI 730](https://reader037.fdocuments.us/reader037/viewer/2022103006/56812dae550346895d92d5e8/html5/thumbnails/26.jpg)
26
Performance Metrics (cont.)
Average ScoresAverage Precision: Average the precision over all the ranks in which a relevant document is retrieved.
Mean Average Precision: Mean of the Average Precision over all the queries.Micro-Average: Average over individual items across queriesMacro-Average: Average over queries
For a given rank n, Pn: Precision at rank n (P@n)
R-Precision: PR where R is the number of relevant documents
Accounting for Ranks
![Page 27: Literature Mining BMI 730](https://reader037.fdocuments.us/reader037/viewer/2022103006/56812dae550346895d92d5e8/html5/thumbnails/27.jpg)
Jensen et al. Nature Reviews Genetics 7, 119–129 (February 2006) | doi:10.1038/nrg1768
![Page 28: Literature Mining BMI 730](https://reader037.fdocuments.us/reader037/viewer/2022103006/56812dae550346895d92d5e8/html5/thumbnails/28.jpg)
Entity Recognition (ER)• Identifying the substance(s)• Rule and contextual based approach
(manual) – e.g., ‘-ase’ for enzyme• Rule and contextual based approach
(machine learning)• Dictionary-based approach
• How the names are written - CDC28, cdc28, cdc28p, cdc-28
• Curation of the dictionary
![Page 29: Literature Mining BMI 730](https://reader037.fdocuments.us/reader037/viewer/2022103006/56812dae550346895d92d5e8/html5/thumbnails/29.jpg)
Entity Recognition (ER)• Major Challenge
Lack of standardization of names• ‘cdc2’ refers to two completely unrelated
genes in budding and fission yeast• ‘SDS’ - serine dehydratase gene vs. Sodium
Dodecyl Sulfate vs. Shwachman-Diamond syndrome
Synonymy (AGP1, aka, Amino Acid Permease1) Polysemy
![Page 30: Literature Mining BMI 730](https://reader037.fdocuments.us/reader037/viewer/2022103006/56812dae550346895d92d5e8/html5/thumbnails/30.jpg)
Entity Recognition (ER)• Simpler version – if this symbol is for
gene or its product• iHOP (Information hyperlinked over
proteins) http://www.pdg.cnb.uam.es/UniPub/iHOP
![Page 31: Literature Mining BMI 730](https://reader037.fdocuments.us/reader037/viewer/2022103006/56812dae550346895d92d5e8/html5/thumbnails/31.jpg)
Vocabulary• Many, many• SNOWMED, ICD, …• ICD (
International Statistical Classification of Diseases and Related Health Problems)
![Page 32: Literature Mining BMI 730](https://reader037.fdocuments.us/reader037/viewer/2022103006/56812dae550346895d92d5e8/html5/thumbnails/32.jpg)
Vocabulary• ICD
573.3 Hepatitis, unspecifiedToxic (noninfectious) hepatitisUse additional E code to identify cause
571.4 Chronic hepatitisExcludes:
viral hepatitis (acute) (chronic) (070.0-070.9)
571.49 OtherChronic hepatitis:
activeaggressive
Recurrent hepatitis
070 Viral hepatitisIncludes:
viral hepatitis (acute) (chronic)Excludes:
cytomegalic inclusion virus hepatitis (078.5)
![Page 33: Literature Mining BMI 730](https://reader037.fdocuments.us/reader037/viewer/2022103006/56812dae550346895d92d5e8/html5/thumbnails/33.jpg)
Unified Medical Language system (UMLS)
![Page 34: Literature Mining BMI 730](https://reader037.fdocuments.us/reader037/viewer/2022103006/56812dae550346895d92d5e8/html5/thumbnails/34.jpg)
Jensen et al. Nature Reviews Genetics 7, 119–129 (February 2006) | doi:10.1038/nrg1768
![Page 35: Literature Mining BMI 730](https://reader037.fdocuments.us/reader037/viewer/2022103006/56812dae550346895d92d5e8/html5/thumbnails/35.jpg)
Information Extraction (IE)• Extract pre-defined types of fact — in
particular, relationships between biological entities.
• Co-occurrence based method• Natural language processing (NLP) based
method
![Page 36: Literature Mining BMI 730](https://reader037.fdocuments.us/reader037/viewer/2022103006/56812dae550346895d92d5e8/html5/thumbnails/36.jpg)
36
Information Extraction
• Identify the relevant sentences• Parse to extract specific information • Assume “well-behaved” fact sentences• Using co-occurrence relationships alone
does not require parsing or good fact-structure
Usually it requires
![Page 37: Literature Mining BMI 730](https://reader037.fdocuments.us/reader037/viewer/2022103006/56812dae550346895d92d5e8/html5/thumbnails/37.jpg)
Jensen et al. Nature Reviews Genetics 7, 119–129 (February 2006) | doi:10.1038/nrg1768
![Page 38: Literature Mining BMI 730](https://reader037.fdocuments.us/reader037/viewer/2022103006/56812dae550346895d92d5e8/html5/thumbnails/38.jpg)
Text Mining (TM)• The discovery by computer of new,
previously unknown information, by automatically extracting information from different written records.
![Page 39: Literature Mining BMI 730](https://reader037.fdocuments.us/reader037/viewer/2022103006/56812dae550346895d92d5e8/html5/thumbnails/39.jpg)
Text Mining
Fish Oil
Blood Viscosity
Platelet aggregability
Vascular Reactivity Reduces(and co-occurs)
Raynaud’s Syndrome
Increased(and co-occurs)
Fish OilRaynaud’s Syndrome
•Based on transitivity of relationships in co-occurrence graph.•This idea can be used to discover new facts by co-occurrence•Web Tool : Arrowsmith
[Swanson 86,Swanson87,Swanson90, Swanson and Smalheiser99, Weeber et al. 2001, Stapley & Benoit 2000, Srinivasan 2003, Srivinasan 2004]
Can Reduce
![Page 40: Literature Mining BMI 730](https://reader037.fdocuments.us/reader037/viewer/2022103006/56812dae550346895d92d5e8/html5/thumbnails/40.jpg)
Jensen et al. Nature Reviews Genetics 7, 119–129 (February 2006) | doi:10.1038/nrg1768
![Page 41: Literature Mining BMI 730](https://reader037.fdocuments.us/reader037/viewer/2022103006/56812dae550346895d92d5e8/html5/thumbnails/41.jpg)
Jensen et al. Nature Reviews Genetics 7, 119–129 (February 2006) | doi:10.1038/nrg1768
![Page 42: Literature Mining BMI 730](https://reader037.fdocuments.us/reader037/viewer/2022103006/56812dae550346895d92d5e8/html5/thumbnails/42.jpg)
Integration: combining text and biological data
Jensen et al. Nature Reviews Genetics 7, 119–129 (February 2006) | doi:10.1038/nrg1768
![Page 43: Literature Mining BMI 730](https://reader037.fdocuments.us/reader037/viewer/2022103006/56812dae550346895d92d5e8/html5/thumbnails/43.jpg)
Jensen et al. Nature Reviews Genetics 7, 119–129 (February 2006) | doi:10.1038/nrg1768