Download - Text Mining Applications for Literature Curation

Text Mining Applications for Literature Curation

Kimberly Van Auken

WormBase ConsortiumTextpresso

Gene Ontology Consortium

WormBase: A Database for C. elegans and Other Nematodes

www.wormbase.org

Curating Diverse Data Types

Which worms aggregate with other worms

and what contributesto that behavior?

Aggregation Behavior

Bendesky et al., 2012, PLoS Genetics


Which worms (Strain)aggregate with

other worms and and what contributes to

that behavior?




Which worms (Strain)aggregate with other worms

and what contributes to that behavior?



Strain information:August 1, 1972Pineapple field in Hawaii


Which worms aggregate with

other worms (Phenotype) and what contributes

to the behavior?




Which worms aggregate with

other worms (Phenotype) and what contributes to

that behavior?



Worm Phenotype Ontology (WPO): Bordering (WBPhenotype:0001820) Life stage ontology, e.g., L3 larval stage Assay, e.g., food source



other worms (Phenotype) and what contributes to

that behavior (Molecular Basis)?





other worms (Phenotype) and what contributes

to that behavior (Molecular Basis)?


Bendesky et al., 2012, PLoS Genetics Gene: npr-1 Variation: ad609 (T(83)->I and T(144)->A) Gene Ontology for npr-1:

Biological Process: feeding behaviorMolecular Function: neuropeptide receptor activityCellular Component: integral to plasma membrane

Literature Curation Workflow

PubMed keyword search – ‘elegans’

Full text paper acquisition

Data type flagging and entity recognition

Detailed curation/Fact extraction

Finding Papers: Daily, automated PubMed searches using keyword ‘elegans’

PMID Title Authors AbstractArticle

type JournalCurator actions

Download citation XML

Literature Curation Workflow – Full Text Acquisition

Fully manual step

Done for all papers we select

Electronic copies stored in curation database

http://www.sciencedirect.com/science?_ob=RedirectURL&_method=gejLink&_linkType=general&_cdi=7051&_issn=00928674&_targetURL=http://www.cell.com&_acct=C000050264&_version=1&_userid=1010281&md5=4a864a7b03434375b4d2374bbef1c6b2

http://www.jcb.org/content/vol176/issue3/cover.shtml

http://dev.biologists.org/content/vol133/issue2/cover.shtml

http://www.sciencedirect.com/science?_ob=RedirectURL&_method=gejLink&_linkType=general&_cdi=6766&_issn=00121606&_targetURL=http://www.elsevier.com/locate/issn/00121606&_acct=C000050264&_version=1&_userid=1010281&md5=6633cae4b522ab9f3843bef8e1b0bd49

Data Type Flagging/Triage

Data Type Flagging/Triage:

General classification of papers

What types of experiments are in a paper? e.g. RNAi phenotypes, Variation phenotypes,

Expression patterns, Physical interactions

Main pipeline:

Support Vector Machines (SVMs)

Other methods:

Textpresso category searches hidden Markov models

Pattern matching scripts

Data Type Flagging Methods

Support Vector Machines: Document Classification

Machine learning models

Use positive and negative gold standard sets of papers to train (e.g., papers with/without RNAi experiments) Positives: 100s, Negatives: 1000s

Resulting model classifies all new papers as negative or positive (high, medium, low confidence)

Data Type Flagging – Support Vector Machines

SVMs trained for ten different data types:

Antibody

Genetic Interactions

Physical Interactions

Gene Expression

Regulation of Gene Expression

Variation Phenotypes

Overexpression Phenotypes

RNAi Phenotypes

Variation Sequence Change

Gene Structure Correction

See: Fang R, et al. (2012) Automatic categorization of diverse experimental information in the bioscience literature. BMC Bioinformatics. 13(1):16

Curation from Support Vector Machine Results

SVM results lead directly to manual curation:

e.g. RNAi Phenotypes

Results from SVMs are processed further

e.g. Variation Sequence Change

Pattern Matching Script – regular expressions

New variations (entity recognition)

e.g. mg366, ju43, e1360

Data Type Flagging – Textpresso

www.textpresso.org

C. elegansMouseD. melanogasterNeuroscienceArabidopsisDicty

Wnt PathwayHIVNemtaodesS. cerevisiaeRegulonDB….many others

Full text of articlesTerms, phrases, entities – semantically taggedKeyword or category searchMatch within sentence or entire paper

Textpresso Categories

Pre-existing dictionaries, vocabularies:

Gene names ChEBI (Chemical Entities of Biological Interest)

PATO Sequence Ontology (SO)

Manually constructed by curators using language from published literature:

Sequence similarity – orthologous, conserved Localization assays – GFP, antibody, fluorescence Experimental verbs – required, regulates, exhibits

Data Type Flagging - Textpresso Category Searches

Data Type: C. elegans Human Disease Homologs

Three-category Textpresso search:

C. elegans gene

’Ortholog’, ’Homolog’, ’Similar’, ’Model’

Human disease

”We map this defect in dauer response to a mutation in the scd-2 gene, which, we show, encodes the nematode

anaplastic lymphoma kinse (ALK) homolog, a proto-oncogene receptor tyrosine kinase.”

Literature Curation Workflow

PubMed keyword search – ‘elegans’

Full text paper acquisition

Data type flagging and entity recognition

Detailed curation/Fact extraction

Textpresso: Semi-Automated Fact Extraction

Genetic Interactions Interestingly, pph-5 (tm2979) behaved similarly to pph-5 (av101) in its ability to dominantly (but weakly) suppress sep-1 (e2406ts), but recessively suppress sep-1(ax110) (supplementary material Table S1).

Physical Interactions – after SVM document classifier Remarkably, only AIN-1 coimmunoprecipitated HA-tagged Ce PAB-

1 (Figure 3A and B, lane 7).

Gene Ontology – Cellular Component Curation During embryogenesis , PAN-1 protein is uniformly distributed throughout the cytoplasm of the germline and somatic blastomeres , as seen for pan-1 mRNA (Fig. 2A), with no obvious concentration of PAN-1 in the P granules (Fig. 2K, N).

Textpresso: Semi-Automated GO CellularComponent Curation

Textpresso Search ResultsSuggested GO Annotations

Gene Products

Textpresso Component

See: Van Auken KM, Jaffery J, Chan J, Müller HM, Sternberg PW. (2009) Semi-automated curation of protein subcellular localization: a text mining-based approach to Gene Ontology (GO) cellular component curation. BMC Bioinformatics. 10:228.

Future Directions

Textpresso, other methods (HMMs) applied to additional data types

e.g. GO Biological Process curation (Phenotypes)

Focusing triage and fact extraction on novel findings

How best to integrate existing knowledge into curation pipelines to focus curator effort on new experimental results?

e.g. Commonly used molecular markers

Literature Annotation Tool – Tracking Evidence

WB, GO Common Annotation Framework, BioCreative

Summary

Text Mining Applications for Literature Curation:

Paper approval and full text acquisition Data type flagging and entity recognition Fact extraction – record evidence

All steps of our pipeline incorporate some form ofsemi- or fully automated approaches:

Scripts for downloads, pattern matching Support Vector Machines for document classification Textpresso for flagging and fact extraction (Hidden Markov Models for flagging, fact extraction)

The WormBase Consortium, TextpressoWormBase - CaltechPaul SternbergJuancarlos ChanWen ChenChris GroveRanjana KishoreRaymond LeeCecilia NakamuraDaniela RacitiGary SchindelmanKimberly Van AukenDaniel WangXiaodong WangKaren Yook Former member: Ruihua Fang

Textpresso - CaltechHans-Michael MullerYuling Li James DoneFormer member: Arun RangarajanWormBase – OICR, Toronto

Lincoln SteinAbigail CabunocTodd HarrisJD Wong

WormBase – Washington UniversityJohn SpiethTamberlyn BieriPhil Ozersky

WormBase – EBI, Sanger, Hinxton, UKRichard Durbin Paul Kersey Matt BerrimanPaul Davis Michael PauliniKevin Howe Mary Ann Tuli Gary Williams

CGC – Oxford University, Oxford, UKJonathan Hodgkin

Hidden Markov Models: Semi-Automated GO Molecular Function Curation

For each sentence, HMM yields: True positive score False positive score

For each sentence, curator assigns: Fully curatable (entity + indication of enzymatic activity) Positive (experiment was performed, result but no entity) False Positive (not about enzymatic activity at all)