GO Tag: Assigning Gene Ontology Labels to Medline GO Tag: Assigning Gene Ontology Labels to Medline AbstractsAbstracts
Natural Language Processing Group
Department of Computer Science
Robert Gaizauskas
GO Tag: Assigning Gene Ontology Labels to Medline GO Tag: Assigning Gene Ontology Labels to Medline AbstractsAbstracts
N. Davis, Y.K. Guo, H. HarkemaNatural Language Processing Group
Department of Computer Science
Robert Gaizauskas
M. Ghanem, Tom Barnwell, Y. GuoDepartment of Computing
J. Ratcliffe
April 21, 2006 NaCTeM Seminar
OutlineOutline
Context─ Project Background─ The Gene Ontology─ Go Annotation in Model Organism Databases─ Medline
Go Tagging Tasks─ User types/scenarios─ Possible tasks─ Related Work
Data sets/Gold Standards Approaches and Results to Date
─ Lexical lookup─ Vector Space Similarity─ Machine Learning
Exploiting the Results in Search Tools
April 21, 2006 NaCTeM Seminar
Project BackgroundProject Background
Work is funded by the EPSRC as a Best Practice Project for collaboration between DiscoveryNet and myGrid -- E-Science Pilot Projects (2001-5)
Both projects ─ have developed text mining and data analysis components --
complementary approaches NLP vs. datamining/statistical analysis─ workflow models for co-ordinating distributed services─ working on life science applications
Aim: to develop a unified real-time e-Science text-mining infrastructure that builds upon and extends the technologies and methods developed by both Discovery Net and myGrid
─ Software engineering challenge: integrate complementary service-based text mining capabilities with different metadata models into a single framework
─ Application challenge: annotate biomedical abstracts with semantic categories from the Gene Ontology
April 21, 2006 NaCTeM Seminar
The Gene OntologyThe Gene Ontology
“The Gene Ontology project provides a controlled vocabulary to describe gene and gene product attributes in any organism”
http://www.geneontology.org/
Consists of three structured, controlled vocabularies (ontologies) that describe gene products in terms of their associated:
─ biological processes─ cellular components─ molecular functions
in a species-independent manner E.g. gene product cytochrome c can be described by
─ the molecular function term electron transporter activity─ the biological process terms oxidative phosphorylation and induction of
cell death─ the cellular component terms mitochondrial matrix and mitochondrial
inner membrane
April 21, 2006 NaCTeM Seminar
Gene Ontology (cont)Gene Ontology (cont)
From: Gene Ontology: tool for the unification of biology. The Gene Ontology Consortium(2000) Nature Genet. 25: 25-29.
April 21, 2006 NaCTeM Seminar
The Gene Ontology (cont)The Gene Ontology (cont)
Started as a joint effort between three model organism databases (FlyBase (Drosophila), the Saccharomyces Genome Database (SGD) and the Mouse Genome Database (MGD))
GO now (08/11/05) contains 19022 terms GO Slim(s) are reduced versions of GO ontologies containing a subset of
GO terms
─ Aim to give a broad overview of ontology content
─ GO Slim Generic currrently contains 127 terms A typical GO Term
Term name: isotropic cell growth
Accession: GO:0051210
Ontology: biological_process
Synonyms: related: uniform cell growth
Definition: “The process by which a cell irreversibly increases in size uniformly in all directions. In general, a rounded cell morphology reflects isotropic cell growth.”
…
April 21, 2006 NaCTeM Seminar
GO Annotation in Model Organism DB’sGO Annotation in Model Organism DB’s
Model organism db’s typically record for each entry (gene) one or more GO codes + links to the literature supporting the assignment of the GO code
E.g. from the Saccharomyces Genome Database (SGD)
Gene Go Annotation Reference Evidence code
ACT1
structural constituent of cytoskeleton
Botstein D, et al. (1997) The yeast cytoskeleton.
TAS : Traceable Author Statement
Pruyne D and Bretscher A (2000)Polarization of cell growth in yeast. Pruyne D and Bretscher A (2000)Polarization of cell growth in yeast. I. Establishment and maintenanceBotstein D, et al. (1997) The yeast cytoskeleton.
histone acetyltransferase complex
exocytosis
Galarneau L, et al. (2000) Multiple links between the NuA4 histone acetyltransferase complex and epigenetic control of transcription
TAS : Traceable Author Statement
IDA : Inferred from Direct Assay
IC: Inferred by Curator
IDA: Inferred from Direct Assay
IEA: Inferred from Electronic Annotation
IEP: Inferred from Expression Pattern
IGI: Inferred from Genetic Interaction
IMP: Inferred from Mutant Phenotype
IPI: Inferred from Physical Interaction
ISS: Inferred from Sequence or Structural Similarity
NAS: Non-traceable Author Statement
ND: No biological Data available
RCA: inferred from Reviewed Computational Analysis
TAS: Traceable Author Statement
NR: Not Recorded
April 21, 2006 NaCTeM Seminar
PubMedPubMed
PubMed ─ on-line bibliographic database designed to provide access to citations
from biomedical literature
─ developed by the US NCBI at the NLM
─ Contains Medline, OldMedline, various other sources
Medline─ Over 12 million citations dating back to 1960’s
─ Author abstracts and citations from > 4800 biomedical journals
April 21, 2006 NaCTeM Seminar
PubMedPubMed
Entrez is NCBI’s integrated, text-based search and retrieval system for the major databases it maintains
April 21, 2006 NaCTeM Seminar
OutlineOutline
Context─ Project Background─ The Gene Ontology─ Go Annotation in Model Organism Databases─ Medline
Go Tagging Tasks─ User types/scenarios─ Possible tasks─ Related Work
Data sets/Gold Standards Approaches and Results to Date
─ Lexical lookup─ Vector Space Similarity─ Machine Learning
Exploiting the Results in Search Tools
April 21, 2006 NaCTeM Seminar
User TypesUser Types
Research Geneticists─ Narrow information interest
Particular gene
Particular activity/functionality
Model Organism Genome DB Curators─ Broader information interest
─ Typically track a number of publications, seeking to enhance information stored in the model organism genome DB at the locus level
April 21, 2006 NaCTeM Seminar
User Scenarios: Research GeneticistUser Scenarios: Research Geneticist
Possible scenarios using GO tagging to support a research geneticist include:
─ Search result presentation: Tag abstracts returned from a PubMed search with GO codes Use GO codes to cluster/structure search results to support more effective
information access
─ Structuring of related literature as workflow side-effect Many typical researcher workflows involve BLAST searches yielding
BLAST/Swissprot reports Workflow can automatically assemble a set of “related” papers by extracting
PMIDs of homologous genes/proteins from reports and collecting these abstracts plus, optionally, others closely related by text similarity
Resulting abstract set can be clustered/structured by GO terms and presented to researcher(Integrating Text Mining Services into Distributed Bioinformatics Workflows: A Web Services Implementation. Gaizauskas, Davis, Demetriou, Guo and Roberts, In Proceedings of the IEEE International Conference on Services Computing (SCC 2004), 2004.)
April 21, 2006 NaCTeM Seminar
Search Result Presentation: Motivating Example Search Result Presentation: Motivating Example
One of the genes involved in the cognitive/social elements of Williams Beuren syndrome is LIM Kinase 1 (LIMK1/LIMK-1)
Putting LIM Kinase into Entrez gives 146 possible papers of interest.
April 21, 2006 NaCTeM Seminar
Search Result Presentation: Motivating Example Search Result Presentation: Motivating Example
One of the genes involved in the cognitive/social elements of Williams Beuren syndrome is LIM Kinase 1 (LIMK1/LIMK-1)
Putting LIM Kinase into Entrez gives 146 possible papers of interest. However search in the model organism corpus for LIM Kinase yields only 5 papers
but a high number of associated GO codes (and this is from only partially annotated papers):
Suggests even a single gene may be involved in numerous roles and that clustering according to GO codes may give a more focused method of searching rather than simply supplying more and more keywords which may remove useful and important papers from the result set.
GO:0006468 : biological_process : protein amino acid phosphorylation
GO:0004674 : molecular_function : protein serine/threonine kinase activity
GO:0004672 : molecular_function : protein kinase activity
GO:0007283 : biological_process : spermatogenesis
GO:0008064 : biological_process : regulation of actin polymerization and/or depolymerization
GO:0005515 : molecular_function : protein binding
GO:0005634 : cellular_component : nucleus
GO:0005925 : cellular_component : focal adhesion
GO:0005515 : molecular_function : protein binding
April 21, 2006 NaCTeM Seminar
Search Result Presentation: Motivating Example Search Result Presentation: Motivating Example
However search in the model organism corpus for LIM Kinase yields only 5 papers but a high number of associated GO codes (and this is from only partially annotated papers):
Suggests even a single gene may be involved in numerous roles and that clustering according to GO codes may give a more focused method of searching rather than simply supplying more and more keywords which may remove useful and important papers from the result set.
GO:0006468 : biological_process : protein amino acid phosphorylation
GO:0004674 : molecular_function : protein serine/threonine kinase activity
GO:0004672 : molecular_function : protein kinase activity
GO:0007283 : biological_process : spermatogenesis
GO:0008064 : biological_process : regulation of actin polymerization and/or depolymerization
GO:0005515 : molecular_function : protein binding
GO:0005634 : cellular_component : nucleus
GO:0005925 : cellular_component : focal adhesion
GO:0005515 : molecular_function : protein binding
April 21, 2006 NaCTeM Seminar
User Scenarios: Model Organism DB CuratorUser Scenarios: Model Organism DB Curator
Possible scenarios using GO tagging/text mining to support DB curators include:
─ Help assemble texts that may support GO code assignment GO tag texts in curator’s watching brief Automated tagging could act as prompt for/check on curator’s
judgement─ Help to determine gene-GO term pairs for annotation
Perform GO tagging/ gene name identification at text level and suggest all pairs as candidates
Perform GO tagging/gene name identification at sentence level and suggest candidates
Attempt to assign GO evidence codes• To text segments providing evidence for GO code assignment without
identifying GO code/gene pair to which the evidenced pertains
• To text segments providing evidence plus the GO code/gene pair to which the evidenced pertains
April 21, 2006 NaCTeM Seminar
Possible Tasks (1)Possible Tasks (1)
Assigning GO codes to abstracts/full papers
Given: a set of texts (PubMed abstracts/full papers) and the GO/GO Slim ontology
Task: assign 0 or more GO codes to a text iff the text is “about” the function/process/component identified by the code (assume most specific code only assigned)
Note in this task there is no association of GO code with any specific gene/gene product
April 21, 2006 NaCTeM Seminar
Possible Tasks (2)Possible Tasks (2)
Assigning GO codes to genes/gene products in abstracts/full papers
Given: a set of texts (PubMed abstracts/full papers) and the GO/GO Slim ontology
Task: If the text supports the assignment of one or more GO codes to a gene/gene product, identify gene/gene product-GO code pairs and the text supporting the assignments
This capability would support additional tasks─ Given a particular gene/gene product and a text collection, find all GO
codes for the gene/gene product across the collection
─ Given a GO code and a text collection, find all genes/gene products tagged with the code across the collection
April 21, 2006 NaCTeM Seminar
Possible Tasks (3)Possible Tasks (3)
Assigning evidence codes to genes/gene products-GO code pairings in abstracts/full papers
Given: a set of texts (PubMed abstracts/full papers) and the GO/GO Slim ontology
Task: As in Task 2. but additionally supply the evidence codes
A weaker variant of this is just to suggest evidence text that may assist in the assignment of GO code
April 21, 2006 NaCTeM Seminar
Related WorkRelated Work
Raychaudri, Chang, Sutphin & Altman (2002)
─ Task: associate GO codes with genes by
1. Associating GO codes with papers
2. Associating a specific GO code with a gene if sufficient number of papers mentioning the gene have the GO associated with them
─ Method: Treat 1. as a document classification task and evaluate maximum entropy, Naïve Bayes and Nearest Neighbours approaches
─ Evaluation: corpus of 20,000 Medline abstracts assigned one or more of 21 GO terms/categories
─ Results: maximum entropy best -- 72.8% classification accuracy over 21 categories
April 21, 2006 NaCTeM Seminar
Related WorkRelated Work
Go-KDS (Smith & Cleary, 2003)─ Product of Reel Two─ Task: assign arbitrary GO terms to PubMed articles─ Method:
Proprietary Weighted Confidence learner (similar to Naïve Bayes), using only words as features
trained on gene/protein DB’s which use GO codes AND have links to Medline
─ Evaluated on approx. same data/task as Raychaudri et al. -- 70.5 % accuracy
April 21, 2006 NaCTeM Seminar
Related WorkRelated Work
GoPubMed ─ On-going work at Dresden University (www.gopubmed.org)─ Task: Annotate PubMed abstracts with GO terms─ Method: Use a local sequence alignment algorithm with weighted term matching
(to overcome limits of strict matching) between GO terms and strings in texts ─ Evaluation: None reported
Kiritchenko et al. (U. of Ottawa)─ Task: assign arbitrary GO terms to biomedical texs─ Method:
Treat task as hierarchical text classification use AdaBoost.MH
─ Evaluation: introduce hierarchical evaluation measure Results unclear
April 21, 2006 NaCTeM Seminar
Related Work (cont)Related Work (cont)
Biocreative challenge -- task 2 contained three related subtasks
1. Given an article, a protein and a GO code, where the article justifies the assignment of the GO code to the protein, find evidence text in the article supporting the assignment
2. Given protein-article pairs plus the number of GO code assignments supported by the article, find the GO code(s) that should be assigned to the protein based on the article
3. Given a set of proteins, retrieve a set of papers relevant to assigning codes to the proteins plus the GO code annotations and the supporting passages (not evaluated)
Results indicated no systems ready for practical use Issues: lack of training data; complexity of tasks
April 21, 2006 NaCTeM Seminar
Related Work (cont)Related Work (cont)
TREC Genomics Track 2004 -- three tasks related to GO code assignment
1. Triage -- given a set of articles find those that contain some evidence for the assignment of a GO code, i.e. warrant being curated
2. Given an article and names of genes occurring in the article assign one or more of the top three GO ontologies from which human curators had assigned codes
3. Task 2 plus provide evidence code supporting each gene-GO hierarchy label association
Results for all three tasks poor
April 21, 2006 NaCTeM Seminar
OutlineOutline
Context─ Project Background─ The Gene Ontology─ Go Annotation in Model Organism Databases─ Medline
Go Tagging Tasks─ User types/scenarios─ Possible tasks─ Related Work
Data sets/Gold Standards Approaches and Results to Date
─ Lexical lookup─ Vector Space Similarity─ Machine Learning
Exploiting the Results in Search Tools
April 21, 2006 NaCTeM Seminar
Data Sets and EvaluationData Sets and Evaluation
In order to assess performance of GO tag assignment, a gold standard manually annotated/verified corpus is needed
However, no such corpus exists …
April 21, 2006 NaCTeM Seminar
Data Sets and EvaluationData Sets and Evaluation
Solution 1: SGD Gold Standard─ Derive a corpus from SGD model organism database (yeast)─ Assemble all Medline abstracts cited as evidence supporting
assignment of GO terms─ Associate with each abstract the GO term whose assignment it
is cited as supporting─ I.e. given the annotated genes in SGD, assign a GO term T to a
paper P if the paper P is referenced in support of a Gene-GO term association involving T
─ SGD Gold Standard 4922 PMIDS 2455 GO terms 10485 PMID-GO term pairs
April 21, 2006 NaCTeM Seminar
Data Sets and Evaluation: SGD “Gold” StandardData Sets and Evaluation: SGD “Gold” Standard
Advantages:─ Data already exists -- no extra annotation work required─ Can assemble similar corpora for each model organism DB
Disadvantages:─ Each abstract has associated with it GO terms whose
assignment to specific genes it supports, but may be missing other GO terms which can also be legitimately attached to it
─ Not every paper supporting a GO term assignment will be cited Consequence:
─ SGD gold standard is “GO term incomplete” Weak measure of recall Precision figures difficult to interpret
April 21, 2006 NaCTeM Seminar
Data Sets and Evaluation: SGD “Gold” StandardData Sets and Evaluation: SGD “Gold” Standard
Further issue:─ SGD Gene-GO term assignments are based on full papers,
whereas system only has access to abstracts Consequence:
─ Limit on maximum Recall obtainable by system
April 21, 2006 NaCTeM Seminar
Data Sets and Evaluation (cont)Data Sets and Evaluation (cont)
Solution 2: IC Gold Standard─ Manually extend the GO annotation of abstracts derivable from the SGD
Goal: GO term complete gold standard─ Selected a subset (~800) for which support for all the assigned GO codes
is found in the abstract (rather than the full paper)─ Manually added additional GO annotations using a combination of fuzzy
maching against GO and some manual addition of synonyms during checking
For included terms, include lowest within each ontology
‘cell wall biosynthesis’ => ‘cell wall biosynthesis’ ‘cell wall’ ─ Also applied same methodology to evidence paragraphs -- brief
summaries written by curators deliberately using GO vocabulary IC Gold Standard
─ 785 PMIDS─ 1006 GO terms─ 5170 PMID-GO term pairs
April 21, 2006 NaCTeM Seminar
Data Sets and Evaluation (cont)Data Sets and Evaluation (cont)
Advantages:─ Much closer to a GO-term complete gold standard
Disadvantages─ Still not GO-term complete
Method of creation suggests there may still be many unannotated GO terms that ought to be marked (direct mentions of GO terms vs. semantically entailed GO terms)
─ Gold Standard creation method favors lexical look-up approach to GO-tagging
─ Dataset is small
April 21, 2006 NaCTeM Seminar
OutlineOutline
Context
─ Project Background
─ The Gene Ontology
─ Go Annotation in Model Organism Databases
─ Medline Go Tagging Tasks
─ User types/scenarios
─ Possible tasks
─ Related Work Data sets/Gold Standards Approaches and Results to Date
─ Lexical lookup
─ Vector Space Similarity
─ Machine Learning Exploiting the Results in Search Tools
April 21, 2006 NaCTeM Seminar
The Go Tagging Task AddressedThe Go Tagging Task Addressed
The approaches we investigated all considered Task 1, as defined earlier:
─ Given: a set of texts (PubMed abstracts/full papers) and the GO/GO Slim ontology
─ Task: assign 0 or more GO codes to a text iff the text is “about” the function/process/component identified by the code (assume most specific code only assigned)
April 21, 2006 NaCTeM Seminar
Approach 1: Lexical Lookup Using TerminoApproach 1: Lexical Lookup Using Termino
Termino: a large-scale terminological resource to support term processing for information extraction, retrieval, and navigation
Termino contains a database holding large numbers of terms imported from various existing terminological resources, e.g., UMLS, GO
Efficient recognition of terms in text is achieved through use of finite state recognizers compiled from contents of database
The results of lexical look-up in Termino can feed into further term processing components, e.g., term parser
Available as a Web Service (see http://nlp.shef.ac.uk)
April 21, 2006 NaCTeM Seminar
Termino Terminology EngineTermino Terminology Engine
Text in
TermInductionTermDB
Finite StateLook-Up
Termino
MedlineAbstractsGO UMLS
Raw TextsExisting Terminological Resources
…
TermParser
Neurofibromin GO annotations: - 0008181: tumor supressor - 0005737: cytoplasma - …
Peptidyl-prolyl isomerase - type: protein term - source: induced from Medline - …
Mastectomy UMLS data: - CUI: C0024881 - semantic type: therapeutic or preventive procedure - synonyms: mammectomy - …
Text out
Source-Specific Loaders
April 21, 2006 NaCTeM Seminar
Lexical Look-Up for GO TagLexical Look-Up for GO Tag
Termino
─ Imported names of all terms in GO, plus their GO ids and namespace attributes (18270 names in total)
─ Go term synonyms
─ SGD yeast gene names Recognition of terms in text
─ Case-insensitive
─ “Simple” morphological variants are recognized
Cells mapped onto cell
Mitochondrial, mitochondria not mapped onto mitochondrion
April 21, 2006 NaCTeM Seminar
Lexical Look-Up for GO Tag (cont)Lexical Look-Up for GO Tag (cont)
GO code assignment─ GO term T is assigned to text iff name of T is recognized in text─ Extensions:
GO term T is assigned to a paper if synonym ofterm T occurs in the abstract of the paper
GO term T is assigned to a paper if yeast gene nameassociated with term T occurs in the abstract of the paper
April 21, 2006 NaCTeM Seminar
Lexical Lookup Results for GO SlimLexical Lookup Results for GO Slim
SGD Dataset
P R F
GO Term 35.27% 52.36% 42.15%
Yeast Term 22.87% 91.76% 36.62%
GO Synonyms 37.86% 34.20% 35.94%
GO + Yeast 21.15% 93.59% 34.50%
GO + Synonyms 32.94% 64.65% 43.65%
GO + Synonyms + Yeast 20.53% 94.17% 33.72%
IC Dataset
P R F
GO Term 98.62% 79.33% 87.93%
Yeast Term 37.94% 75.13% 50.42%
GO Synonyms 70.49% 33.31% 45.24%
GO + Yeast 43.36% 94.76% 59.50%
GO + Synonyms 85.52% 88.35% 86.91%
GO + Synonyms + Yeast 42.42% 95.95% 58.83%
April 21, 2006 NaCTeM Seminar
Lexical Lookup Results for GO FullLexical Lookup Results for GO Full
SGD Dataset
P R F
GO Term 7.33% 15.95% 10.05%
Yeast Term 7.97% 84.42% 14.57%
GO Synonyms 6.46% 7.63% 7.00%
GO + Yeast 6.93% 85.66% 12.82%
GO + Synonyms 6.87% 22.55% 10.54%
GO + Synonyms + Yeast 6.49% 86.14% 12.08%
IC Dataset
P R F
GO Term 90.52% 71.30% 79.77%
Yeast Term 9.26% 31.43% 14.30%
GO Synonyms 29.65% 11.53% 16.60%
GO + Yeast 21.00% 83.73% 33.58%
GO + Synonyms 69.93% 80.04% 74.65%
GO + Synonyms + Yeast 20.70% 88.38% 33.54%
April 21, 2006 NaCTeM Seminar
Lexical Lookup Approach: DiscussionLexical Lookup Approach: Discussion
Recall─ Effect of curators using full text (SGD) vs. abstracts only (IC)─ Inherent drawbacks of lexical look-up: term variation, literal mentions─ Effects of Gold Standard creation method (IC)
Precision─ Effects of Gold Standard creation method (IC)
GO vs. GO Slim─ Recognizing GO Slim terms is easier than recognizing GO terms
Effects of extensions (synonyms/gene names) on performance─ Adding synonyms: variable decrease in Precision, substantial increase
in Recall─ Adding yeast terms: substantial decrease in Precision, substantial
increase in Recall
April 21, 2006 NaCTeM Seminar
Error AnalysisError Analysis
False negatives for abstracts:
─ Abbreviation: mismatch repair (GO name) vs. MMR (in text)
─ Permutation, derivation: regulation of translation vs. regulated translation, sporulation vs. sporulate
─ Truncation: galactokinase activity vs. galactokinase
─ Alternative descriptions: protein catabolism vs. proteins for degradation, autophagic vacuole vs. autophagosomal
April 21, 2006 NaCTeM Seminar
Approach 2: IR-based Vector Space SimilarityApproach 2: IR-based Vector Space Similarity
Document Collection─ Build a collection of “GO documents” where each GO document
consists of GO term, its synonyms and its definition sentence Query
─ Treat each abstract to which GO codes are to be assigned as a query against the GO document collection
Retrieval ─ Given a query (i.e abstract) retrieve relevant “GO documents” (i.e. GO
terms)─ assign top 1, 5, 10 … GO terms to an abstract which are most “similar”
as measured by Vector Space Model(VSM)
April 21, 2006 NaCTeM Seminar
IR-based ApproachIR-based Approach
indexed the GO documents using Lucene search engine Standard IR preprocessing: tokenization, stop word removal, case
normalization, stemming 4 Indices were built according varying as to whether they used
─ Standard GO or GO Slim
─ A GO document consisting of the GO term text (name + definition) or itself plus its ancestor GO terms;
Used standard weighting scheme included in Lucene Postprocessing:
─ Re-weighting: give credit to duplicated GO documents (found on more than one path back to root)
─ Threshold: the number of relevant GOIDs to return
April 21, 2006 NaCTeM Seminar
IR-Based ResultsIR-Based Results
Better performance on IC abstracts than on SGD abstracts
Hierarchical documents do slightly worse than flat documents Discriminatory effect of specific GO terms may be reduced
by occurrence of general terms such as cell and protein
April 21, 2006 NaCTeM Seminar
Approach 3: Machine LearningApproach 3: Machine Learning
Variety of text classification algorithms: Naïve Bayes, Decision Tree, SVM classifier, …
─ Naïve Bayes predicts only one GO term per abstract
─ SGD GS: 2.1 GO terms/abstract; IC GS: 6.6 GO terms/abstract
Features: words, frequent phrases─ Preprocessing steps: tokenization, removal of
stop words, stemming
Training on 66% of annotated data, evaluation on remainder of data GO term assignments vis-à-vis generic GO Slim to mitigate data
sparsity problems
April 21, 2006 NaCTeM Seminar
Machine Learning ResultsMachine Learning Results
One GO term vs. multiple GO terms per abstract makes a difference Higher precision scores than lexical look-up (SGD): GO terms directly mentioned in
text not be assigned if GO terms not present in training set Oracle Text Decision Tree (IC): classifier learns systematic, strong correlation
between words in text and words in GO terms
April 21, 2006 NaCTeM Seminar
Best F scores for GO Slim─ SGD Gold Standard
─ IC Gold Standard
R P F
LLU 79.3 98.6 87.9
IR 59.5 37.6 46.1
ML 76.5 83.0 79.6
Comparison of ApproachesComparison of Approaches
R P F
LLU 64.6 32.9 43.6
IR 51.5 26.2 34.7
ML 36.8 51.6 43.0
April 21, 2006 NaCTeM Seminar
OutlineOutline
Context
─ Project Background
─ The Gene Ontology
─ Go Annotation in Model Organism Databases
─ Medline Go Tagging Tasks
─ User types/scenarios
─ Possible tasks
─ Related Work Data sets/Gold Standards Approaches and Results to Date
─ Lexical lookup
─ Vector Space Similarity
─ Machine Learning Exploiting the Results in Search Tools
April 21, 2006 NaCTeM Seminar
Input keywords here
Upload a file containing a list of Medline abstracts
Type/paste free texts to get results
April 21, 2006 NaCTeM Seminar
Click for the abstract details
Click for the GO definition
Search for the abstracts with similar Go annotations
April 21, 2006 NaCTeM Seminar
April 21, 2006 NaCTeM Seminar
Click for the abstract details
Click for the GO definition
Search for the abstracts with similar Go annotations
April 21, 2006 NaCTeM Seminar
April 21, 2006 NaCTeM Seminar
April 21, 2006 NaCTeM Seminar
April 21, 2006 NaCTeM Seminar
Exploiting the Results in Search ToolsExploiting the Results in Search Tools
GOHierarchy
AbstractTitles
AbstractBodies
Go Labels/Gene Names
April 21, 2006 NaCTeM Seminar
Input keywords here
Upload a file containing a list of Medline abstracts
Type/paste free texts to get results
April 21, 2006 NaCTeM Seminar
April 21, 2006 NaCTeM Seminar
April 21, 2006 NaCTeM Seminar
April 21, 2006 NaCTeM Seminar
ConclusionsConclusions
GO tagging is an “interesting” task that offers significant potential benefits to research biologists and bioinformaticians
─ Several increasingly complex/valuable variants of the task can be identified
Simple techniques ─ Direct term matching
─ IR-type text macthing
─ Machine Learning text classification methods
have been assessed for their level of performance on the simplest task -- assigning GO terms to texts at the whole text level
Evaluation methods/resources are critical issues Effectively utilising imperfect text mining results in end user
applications is challenging
April 21, 2006 NaCTeM Seminar
Future WorkFuture Work
Enhancements to each of the 3 simple approaches Combining 3 simple approaches into a hybrid system Look other tasks:
─ Extracting GO term-gene/gene product pairs
─ Assigning evidence codes
Improving resources and methodology for evaluating the technology
End-user evaluation of search tools employing this technology
April 21, 2006 NaCTeM Seminar
END
Reference:
Davis, N., Harkema, H., Gaizauskas, R.,Guo Y.K., Ghanem, M., Barnwell, T., Guo, Y. and Ratcliffe, J. (2006) Three Approaches to GO-Tagging Biomedical Abstracts.In Proceedings of the Second International Symposium on Semantic Mining in Biomedicine (SMBM06), Jena, April 2006.
Available from http://www.dcs.shef.ac.uk/~robertg/publications/
Top Related