Download - GO Tag: Assigning Gene Ontology Labels to Medline Abstracts

GO Tag: Assigning Gene Ontology Labels to Medline GO Tag: Assigning Gene Ontology Labels to Medline AbstractsAbstracts

Natural Language Processing Group

Department of Computer Science

Robert Gaizauskas

GO Tag: Assigning Gene Ontology Labels to Medline GO Tag: Assigning Gene Ontology Labels to Medline AbstractsAbstracts

N. Davis, Y.K. Guo, H. HarkemaNatural Language Processing Group

Department of Computer Science

Robert Gaizauskas

M. Ghanem, Tom Barnwell, Y. GuoDepartment of Computing

J. Ratcliffe

April 21, 2006 NaCTeM Seminar

OutlineOutline

Context─ Project Background─ The Gene Ontology─ Go Annotation in Model Organism Databases─ Medline

Go Tagging Tasks─ User types/scenarios─ Possible tasks─ Related Work

Data sets/Gold Standards Approaches and Results to Date

─ Lexical lookup─ Vector Space Similarity─ Machine Learning

Exploiting the Results in Search Tools


Project BackgroundProject Background

Work is funded by the EPSRC as a Best Practice Project for collaboration between DiscoveryNet and myGrid -- E-Science Pilot Projects (2001-5)

Both projects ─ have developed text mining and data analysis components --

complementary approaches NLP vs. datamining/statistical analysis─ workflow models for co-ordinating distributed services─ working on life science applications

Aim: to develop a unified real-time e-Science text-mining infrastructure that builds upon and extends the technologies and methods developed by both Discovery Net and myGrid

─ Software engineering challenge: integrate complementary service-based text mining capabilities with different metadata models into a single framework

─ Application challenge: annotate biomedical abstracts with semantic categories from the Gene Ontology


The Gene OntologyThe Gene Ontology

“The Gene Ontology project provides a controlled vocabulary to describe gene and gene product attributes in any organism”

http://www.geneontology.org/

Consists of three structured, controlled vocabularies (ontologies) that describe gene products in terms of their associated:

─ biological processes─ cellular components─ molecular functions

in a species-independent manner E.g. gene product cytochrome c can be described by

─ the molecular function term electron transporter activity─ the biological process terms oxidative phosphorylation and induction of

cell death─ the cellular component terms mitochondrial matrix and mitochondrial

inner membrane

http://www.geneontology.org/


Gene Ontology (cont)Gene Ontology (cont)

From: Gene Ontology: tool for the unification of biology. The Gene Ontology Consortium(2000) Nature Genet. 25: 25-29.


The Gene Ontology (cont)The Gene Ontology (cont)

Started as a joint effort between three model organism databases (FlyBase (Drosophila), the Saccharomyces Genome Database (SGD) and the Mouse Genome Database (MGD))

GO now (08/11/05) contains 19022 terms GO Slim(s) are reduced versions of GO ontologies containing a subset of

GO terms

─ Aim to give a broad overview of ontology content

─ GO Slim Generic currrently contains 127 terms A typical GO Term

Term name: isotropic cell growth

Accession: GO:0051210

Ontology: biological_process

Synonyms: related: uniform cell growth

Definition: “The process by which a cell irreversibly increases in size uniformly in all directions. In general, a rounded cell morphology reflects isotropic cell growth.”

…


GO Annotation in Model Organism DB’sGO Annotation in Model Organism DB’s

Model organism db’s typically record for each entry (gene) one or more GO codes + links to the literature supporting the assignment of the GO code

E.g. from the Saccharomyces Genome Database (SGD)

Gene Go Annotation Reference Evidence code

ACT1

structural constituent of cytoskeleton

Botstein D, et al. (1997) The yeast cytoskeleton.

TAS : Traceable Author Statement

Pruyne D and Bretscher A (2000)Polarization of cell growth in yeast. Pruyne D and Bretscher A (2000)Polarization of cell growth in yeast. I. Establishment and maintenanceBotstein D, et al. (1997) The yeast cytoskeleton.

histone acetyltransferase complex

exocytosis

Galarneau L, et al. (2000) Multiple links between the NuA4 histone acetyltransferase complex and epigenetic control of transcription

TAS : Traceable Author Statement

IDA : Inferred from Direct Assay

IC: Inferred by Curator

IDA: Inferred from Direct Assay

IEA: Inferred from Electronic Annotation

IEP: Inferred from Expression Pattern

IGI: Inferred from Genetic Interaction

IMP: Inferred from Mutant Phenotype

IPI: Inferred from Physical Interaction

ISS: Inferred from Sequence or Structural Similarity

NAS: Non-traceable Author Statement

ND: No biological Data available

RCA: inferred from Reviewed Computational Analysis

TAS: Traceable Author Statement

NR: Not Recorded


PubMedPubMed

PubMed ─ on-line bibliographic database designed to provide access to citations

from biomedical literature

─ developed by the US NCBI at the NLM

─ Contains Medline, OldMedline, various other sources

Medline─ Over 12 million citations dating back to 1960’s

─ Author abstracts and citations from > 4800 biomedical journals


PubMedPubMed

Entrez is NCBI’s integrated, text-based search and retrieval system for the major databases it maintains


OutlineOutline







User TypesUser Types

Research Geneticists─ Narrow information interest

Particular gene

Particular activity/functionality

Model Organism Genome DB Curators─ Broader information interest

─ Typically track a number of publications, seeking to enhance information stored in the model organism genome DB at the locus level


User Scenarios: Research GeneticistUser Scenarios: Research Geneticist

Possible scenarios using GO tagging to support a research geneticist include:

─ Search result presentation: Tag abstracts returned from a PubMed search with GO codes Use GO codes to cluster/structure search results to support more effective

information access

─ Structuring of related literature as workflow side-effect Many typical researcher workflows involve BLAST searches yielding

BLAST/Swissprot reports Workflow can automatically assemble a set of “related” papers by extracting

PMIDs of homologous genes/proteins from reports and collecting these abstracts plus, optionally, others closely related by text similarity

Resulting abstract set can be clustered/structured by GO terms and presented to researcher(Integrating Text Mining Services into Distributed Bioinformatics Workflows: A Web Services Implementation. Gaizauskas, Davis, Demetriou, Guo and Roberts, In Proceedings of the IEEE International Conference on Services Computing (SCC 2004), 2004.)


Search Result Presentation: Motivating Example Search Result Presentation: Motivating Example

One of the genes involved in the cognitive/social elements of Williams Beuren syndrome is LIM Kinase 1 (LIMK1/LIMK-1)

Putting LIM Kinase into Entrez gives 146 possible papers of interest.



One of the genes involved in the cognitive/social elements of Williams Beuren syndrome is LIM Kinase 1 (LIMK1/LIMK-1)

Putting LIM Kinase into Entrez gives 146 possible papers of interest. However search in the model organism corpus for LIM Kinase yields only 5 papers

but a high number of associated GO codes (and this is from only partially annotated papers):

Suggests even a single gene may be involved in numerous roles and that clustering according to GO codes may give a more focused method of searching rather than simply supplying more and more keywords which may remove useful and important papers from the result set.

GO:0006468 : biological_process : protein amino acid phosphorylation

GO:0004674 : molecular_function : protein serine/threonine kinase activity

GO:0004672 : molecular_function : protein kinase activity

GO:0007283 : biological_process : spermatogenesis

GO:0008064 : biological_process : regulation of actin polymerization and/or depolymerization

GO:0005515 : molecular_function : protein binding

GO:0005634 : cellular_component : nucleus

GO:0005925 : cellular_component : focal adhesion




However search in the model organism corpus for LIM Kinase yields only 5 papers but a high number of associated GO codes (and this is from only partially annotated papers):

Suggests even a single gene may be involved in numerous roles and that clustering according to GO codes may give a more focused method of searching rather than simply supplying more and more keywords which may remove useful and important papers from the result set.

GO:0006468 : biological_process : protein amino acid phosphorylation

GO:0004674 : molecular_function : protein serine/threonine kinase activity

GO:0004672 : molecular_function : protein kinase activity

GO:0007283 : biological_process : spermatogenesis

GO:0008064 : biological_process : regulation of actin polymerization and/or depolymerization


GO:0005634 : cellular_component : nucleus

GO:0005925 : cellular_component : focal adhesion



User Scenarios: Model Organism DB CuratorUser Scenarios: Model Organism DB Curator

Possible scenarios using GO tagging/text mining to support DB curators include:

─ Help assemble texts that may support GO code assignment GO tag texts in curator’s watching brief Automated tagging could act as prompt for/check on curator’s

judgement─ Help to determine gene-GO term pairs for annotation

Perform GO tagging/ gene name identification at text level and suggest all pairs as candidates

Perform GO tagging/gene name identification at sentence level and suggest candidates

Attempt to assign GO evidence codes• To text segments providing evidence for GO code assignment without

identifying GO code/gene pair to which the evidenced pertains

• To text segments providing evidence plus the GO code/gene pair to which the evidenced pertains


Possible Tasks (1)Possible Tasks (1)

Assigning GO codes to abstracts/full papers

Given: a set of texts (PubMed abstracts/full papers) and the GO/GO Slim ontology

Task: assign 0 or more GO codes to a text iff the text is “about” the function/process/component identified by the code (assume most specific code only assigned)

Note in this task there is no association of GO code with any specific gene/gene product



Assigning GO codes to genes/gene products in abstracts/full papers


Task: If the text supports the assignment of one or more GO codes to a gene/gene product, identify gene/gene product-GO code pairs and the text supporting the assignments

This capability would support additional tasks─ Given a particular gene/gene product and a text collection, find all GO

codes for the gene/gene product across the collection

─ Given a GO code and a text collection, find all genes/gene products tagged with the code across the collection



Assigning evidence codes to genes/gene products-GO code pairings in abstracts/full papers


Task: As in Task 2. but additionally supply the evidence codes

A weaker variant of this is just to suggest evidence text that may assist in the assignment of GO code


Related WorkRelated Work

Raychaudri, Chang, Sutphin & Altman (2002)

─ Task: associate GO codes with genes by

1. Associating GO codes with papers

2. Associating a specific GO code with a gene if sufficient number of papers mentioning the gene have the GO associated with them

─ Method: Treat 1. as a document classification task and evaluate maximum entropy, Naïve Bayes and Nearest Neighbours approaches

─ Evaluation: corpus of 20,000 Medline abstracts assigned one or more of 21 GO terms/categories

─ Results: maximum entropy best -- 72.8% classification accuracy over 21 categories



Go-KDS (Smith & Cleary, 2003)─ Product of Reel Two─ Task: assign arbitrary GO terms to PubMed articles─ Method:

Proprietary Weighted Confidence learner (similar to Naïve Bayes), using only words as features

trained on gene/protein DB’s which use GO codes AND have links to Medline

─ Evaluated on approx. same data/task as Raychaudri et al. -- 70.5 % accuracy



GoPubMed ─ On-going work at Dresden University (www.gopubmed.org)─ Task: Annotate PubMed abstracts with GO terms─ Method: Use a local sequence alignment algorithm with weighted term matching

(to overcome limits of strict matching) between GO terms and strings in texts ─ Evaluation: None reported

Kiritchenko et al. (U. of Ottawa)─ Task: assign arbitrary GO terms to biomedical texs─ Method:

Treat task as hierarchical text classification use AdaBoost.MH

─ Evaluation: introduce hierarchical evaluation measure Results unclear

http://www.gopubmed.org/


Related Work (cont)Related Work (cont)

Biocreative challenge -- task 2 contained three related subtasks

1. Given an article, a protein and a GO code, where the article justifies the assignment of the GO code to the protein, find evidence text in the article supporting the assignment

2. Given protein-article pairs plus the number of GO code assignments supported by the article, find the GO code(s) that should be assigned to the protein based on the article

3. Given a set of proteins, retrieve a set of papers relevant to assigning codes to the proteins plus the GO code annotations and the supporting passages (not evaluated)

Results indicated no systems ready for practical use Issues: lack of training data; complexity of tasks


Related Work (cont)Related Work (cont)

TREC Genomics Track 2004 -- three tasks related to GO code assignment

1. Triage -- given a set of articles find those that contain some evidence for the assignment of a GO code, i.e. warrant being curated

2. Given an article and names of genes occurring in the article assign one or more of the top three GO ontologies from which human curators had assigned codes

3. Task 2 plus provide evidence code supporting each gene-GO hierarchy label association

Results for all three tasks poor


OutlineOutline







Data Sets and EvaluationData Sets and Evaluation

In order to assess performance of GO tag assignment, a gold standard manually annotated/verified corpus is needed

However, no such corpus exists …


Data Sets and EvaluationData Sets and Evaluation

Solution 1: SGD Gold Standard─ Derive a corpus from SGD model organism database (yeast)─ Assemble all Medline abstracts cited as evidence supporting

assignment of GO terms─ Associate with each abstract the GO term whose assignment it

is cited as supporting─ I.e. given the annotated genes in SGD, assign a GO term T to a

paper P if the paper P is referenced in support of a Gene-GO term association involving T

─ SGD Gold Standard 4922 PMIDS 2455 GO terms 10485 PMID-GO term pairs


Data Sets and Evaluation: SGD “Gold” StandardData Sets and Evaluation: SGD “Gold” Standard

Advantages:─ Data already exists -- no extra annotation work required─ Can assemble similar corpora for each model organism DB

Disadvantages:─ Each abstract has associated with it GO terms whose

assignment to specific genes it supports, but may be missing other GO terms which can also be legitimately attached to it

─ Not every paper supporting a GO term assignment will be cited Consequence:

─ SGD gold standard is “GO term incomplete” Weak measure of recall Precision figures difficult to interpret


Data Sets and Evaluation: SGD “Gold” StandardData Sets and Evaluation: SGD “Gold” Standard

Further issue:─ SGD Gene-GO term assignments are based on full papers,

whereas system only has access to abstracts Consequence:

─ Limit on maximum Recall obtainable by system


Data Sets and Evaluation (cont)Data Sets and Evaluation (cont)

Solution 2: IC Gold Standard─ Manually extend the GO annotation of abstracts derivable from the SGD

Goal: GO term complete gold standard─ Selected a subset (~800) for which support for all the assigned GO codes

is found in the abstract (rather than the full paper)─ Manually added additional GO annotations using a combination of fuzzy

maching against GO and some manual addition of synonyms during checking

For included terms, include lowest within each ontology

‘cell wall biosynthesis’ => ‘cell wall biosynthesis’ ‘cell wall’ ─ Also applied same methodology to evidence paragraphs -- brief

summaries written by curators deliberately using GO vocabulary IC Gold Standard

─ 785 PMIDS─ 1006 GO terms─ 5170 PMID-GO term pairs


Data Sets and Evaluation (cont)Data Sets and Evaluation (cont)

Advantages:─ Much closer to a GO-term complete gold standard

Disadvantages─ Still not GO-term complete

Method of creation suggests there may still be many unannotated GO terms that ought to be marked (direct mentions of GO terms vs. semantically entailed GO terms)

─ Gold Standard creation method favors lexical look-up approach to GO-tagging

─ Dataset is small


OutlineOutline

Context

─ Project Background

─ The Gene Ontology

─ Go Annotation in Model Organism Databases

─ Medline Go Tagging Tasks

─ User types/scenarios

─ Possible tasks

─ Related Work Data sets/Gold Standards Approaches and Results to Date

─ Lexical lookup

─ Vector Space Similarity

─ Machine Learning Exploiting the Results in Search Tools


The Go Tagging Task AddressedThe Go Tagging Task Addressed

The approaches we investigated all considered Task 1, as defined earlier:

─ Given: a set of texts (PubMed abstracts/full papers) and the GO/GO Slim ontology

─ Task: assign 0 or more GO codes to a text iff the text is “about” the function/process/component identified by the code (assume most specific code only assigned)


Approach 1: Lexical Lookup Using TerminoApproach 1: Lexical Lookup Using Termino

Termino: a large-scale terminological resource to support term processing for information extraction, retrieval, and navigation

Termino contains a database holding large numbers of terms imported from various existing terminological resources, e.g., UMLS, GO

Efficient recognition of terms in text is achieved through use of finite state recognizers compiled from contents of database

The results of lexical look-up in Termino can feed into further term processing components, e.g., term parser

Available as a Web Service (see http://nlp.shef.ac.uk)


Termino Terminology EngineTermino Terminology Engine

Text in

TermInductionTermDB

Finite StateLook-Up

Termino

MedlineAbstractsGO UMLS

Raw TextsExisting Terminological Resources

…

TermParser

Neurofibromin GO annotations: - 0008181: tumor supressor - 0005737: cytoplasma - …

Peptidyl-prolyl isomerase - type: protein term - source: induced from Medline - …

Mastectomy UMLS data: - CUI: C0024881 - semantic type: therapeutic or preventive procedure - synonyms: mammectomy - …

Text out

Source-Specific Loaders


Lexical Look-Up for GO TagLexical Look-Up for GO Tag

Termino

─ Imported names of all terms in GO, plus their GO ids and namespace attributes (18270 names in total)

─ Go term synonyms

─ SGD yeast gene names Recognition of terms in text

─ Case-insensitive

─ “Simple” morphological variants are recognized

Cells mapped onto cell

Mitochondrial, mitochondria not mapped onto mitochondrion


Lexical Look-Up for GO Tag (cont)Lexical Look-Up for GO Tag (cont)

GO code assignment─ GO term T is assigned to text iff name of T is recognized in text─ Extensions:

GO term T is assigned to a paper if synonym ofterm T occurs in the abstract of the paper

GO term T is assigned to a paper if yeast gene nameassociated with term T occurs in the abstract of the paper


Lexical Lookup Results for GO SlimLexical Lookup Results for GO Slim

　 SGD Dataset

　 P R F

GO Term 35.27% 52.36% 42.15%

Yeast Term 22.87% 91.76% 36.62%

GO Synonyms 37.86% 34.20% 35.94%

GO + Yeast 21.15% 93.59% 34.50%

GO + Synonyms 32.94% 64.65% 43.65%

GO + Synonyms + Yeast 20.53% 94.17% 33.72%

　 IC Dataset

　 P R F

GO Term 98.62% 79.33% 87.93%

Yeast Term 37.94% 75.13% 50.42%

GO Synonyms 70.49% 33.31% 45.24%

GO + Yeast 43.36% 94.76% 59.50%

GO + Synonyms 85.52% 88.35% 86.91%



Lexical Lookup Results for GO FullLexical Lookup Results for GO Full

　 SGD Dataset

　 P R F

GO Term 7.33% 15.95% 10.05%

Yeast Term 7.97% 84.42% 14.57%

GO Synonyms 6.46% 7.63% 7.00%

GO + Yeast 6.93% 85.66% 12.82%

GO + Synonyms 6.87% 22.55% 10.54%


　 IC Dataset

　 P R F

GO Term 90.52% 71.30% 79.77%

Yeast Term 9.26% 31.43% 14.30%

GO Synonyms 29.65% 11.53% 16.60%

GO + Yeast 21.00% 83.73% 33.58%

GO + Synonyms 69.93% 80.04% 74.65%



Lexical Lookup Approach: DiscussionLexical Lookup Approach: Discussion

Recall─ Effect of curators using full text (SGD) vs. abstracts only (IC)─ Inherent drawbacks of lexical look-up: term variation, literal mentions─ Effects of Gold Standard creation method (IC)

Precision─ Effects of Gold Standard creation method (IC)

GO vs. GO Slim─ Recognizing GO Slim terms is easier than recognizing GO terms

Effects of extensions (synonyms/gene names) on performance─ Adding synonyms: variable decrease in Precision, substantial increase

in Recall─ Adding yeast terms: substantial decrease in Precision, substantial

increase in Recall


Error AnalysisError Analysis

False negatives for abstracts:

─ Abbreviation: mismatch repair (GO name) vs. MMR (in text)

─ Permutation, derivation: regulation of translation vs. regulated translation, sporulation vs. sporulate

─ Truncation: galactokinase activity vs. galactokinase

─ Alternative descriptions: protein catabolism vs. proteins for degradation, autophagic vacuole vs. autophagosomal


Approach 2: IR-based Vector Space SimilarityApproach 2: IR-based Vector Space Similarity

Document Collection─ Build a collection of “GO documents” where each GO document

consists of GO term, its synonyms and its definition sentence Query

─ Treat each abstract to which GO codes are to be assigned as a query against the GO document collection

Retrieval ─ Given a query (i.e abstract) retrieve relevant “GO documents” (i.e. GO

terms)─ assign top 1, 5, 10 … GO terms to an abstract which are most “similar”

as measured by Vector Space Model(VSM)


IR-based ApproachIR-based Approach

indexed the GO documents using Lucene search engine Standard IR preprocessing: tokenization, stop word removal, case

normalization, stemming 4 Indices were built according varying as to whether they used

─ Standard GO or GO Slim

─ A GO document consisting of the GO term text (name + definition) or itself plus its ancestor GO terms;

Used standard weighting scheme included in Lucene Postprocessing:

─ Re-weighting: give credit to duplicated GO documents (found on more than one path back to root)

─ Threshold: the number of relevant GOIDs to return


IR-Based ResultsIR-Based Results

Better performance on IC abstracts than on SGD abstracts

Hierarchical documents do slightly worse than flat documents Discriminatory effect of specific GO terms may be reduced

by occurrence of general terms such as cell and protein


Approach 3: Machine LearningApproach 3: Machine Learning

Variety of text classification algorithms: Naïve Bayes, Decision Tree, SVM classifier, …

─ Naïve Bayes predicts only one GO term per abstract

─ SGD GS: 2.1 GO terms/abstract; IC GS: 6.6 GO terms/abstract

Features: words, frequent phrases─ Preprocessing steps: tokenization, removal of

stop words, stemming

Training on 66% of annotated data, evaluation on remainder of data GO term assignments vis-à-vis generic GO Slim to mitigate data

sparsity problems


Machine Learning ResultsMachine Learning Results

One GO term vs. multiple GO terms per abstract makes a difference Higher precision scores than lexical look-up (SGD): GO terms directly mentioned in

text not be assigned if GO terms not present in training set Oracle Text Decision Tree (IC): classifier learns systematic, strong correlation

between words in text and words in GO terms


Best F scores for GO Slim─ SGD Gold Standard

─ IC Gold Standard

R P F

LLU 79.3 98.6 87.9

IR 59.5 37.6 46.1

ML 76.5 83.0 79.6

Comparison of ApproachesComparison of Approaches

R P F

LLU 64.6 32.9 43.6

IR 51.5 26.2 34.7

ML 36.8 51.6 43.0


OutlineOutline

Context

─ Project Background

─ The Gene Ontology

─ Go Annotation in Model Organism Databases

─ Medline Go Tagging Tasks

─ User types/scenarios

─ Possible tasks

─ Related Work Data sets/Gold Standards Approaches and Results to Date

─ Lexical lookup

─ Vector Space Similarity

─ Machine Learning Exploiting the Results in Search Tools


Input keywords here

Upload a file containing a list of Medline abstracts

Type/paste free texts to get results


Click for the abstract details

Click for the GO definition

Search for the abstracts with similar Go annotations


Exploiting the Results in Search ToolsExploiting the Results in Search Tools

GOHierarchy

AbstractTitles

AbstractBodies

Go Labels/Gene Names


Input keywords here

Upload a file containing a list of Medline abstracts

Type/paste free texts to get results


ConclusionsConclusions

GO tagging is an “interesting” task that offers significant potential benefits to research biologists and bioinformaticians

─ Several increasingly complex/valuable variants of the task can be identified

Simple techniques ─ Direct term matching

─ IR-type text macthing

─ Machine Learning text classification methods

have been assessed for their level of performance on the simplest task -- assigning GO terms to texts at the whole text level

Evaluation methods/resources are critical issues Effectively utilising imperfect text mining results in end user

applications is challenging


Future WorkFuture Work

Enhancements to each of the 3 simple approaches Combining 3 simple approaches into a hybrid system Look other tasks:

─ Extracting GO term-gene/gene product pairs

─ Assigning evidence codes

Improving resources and methodology for evaluating the technology

End-user evaluation of search tools employing this technology


END

Reference:

Davis, N., Harkema, H., Gaizauskas, R.,Guo Y.K., Ghanem, M., Barnwell, T., Guo, Y. and Ratcliffe, J. (2006) Three Approaches to GO-Tagging Biomedical Abstracts.In Proceedings of the Second International Symposium on Semantic Mining in Biomedicine (SMBM06), Jena, April 2006.

Available from http://www.dcs.shef.ac.uk/~robertg/publications/