Text Mining Applications for Literature Curation
Kimberly Van Auken
WormBase ConsortiumTextpresso
Gene Ontology Consortium
WormBase: A Database for C. elegans and Other Nematodes
www.wormbase.org
Curating Diverse Data Types
Which worms aggregate with other worms
and what contributesto that behavior?
Aggregation Behavior
Bendesky et al., 2012, PLoS Genetics
Curating Diverse Data Types
Which worms (Strain)aggregate with
other worms and and what contributes to
that behavior?
Aggregation Behavior
Bendesky et al., 2012, PLoS Genetics
Curating Diverse Data Types
Which worms (Strain)aggregate with other worms
and what contributes to that behavior?
Aggregation Behavior
Bendesky et al., 2012, PLoS Genetics
Strain information:August 1, 1972Pineapple field in Hawaii
Curating Diverse Data Types
Which worms aggregate with
other worms (Phenotype) and what contributes
to the behavior?
Aggregation Behavior
Bendesky et al., 2012, PLoS Genetics
Curating Diverse Data Types
Which worms aggregate with
other worms (Phenotype) and what contributes to
that behavior?
Aggregation Behavior
Bendesky et al., 2012, PLoS Genetics
Worm Phenotype Ontology (WPO): Bordering (WBPhenotype:0001820) Life stage ontology, e.g., L3 larval stage Assay, e.g., food source
Curating Diverse Data Types
Which worms (Strain)aggregate with
other worms (Phenotype) and what contributes to
that behavior (Molecular Basis)?
Aggregation Behavior
Bendesky et al., 2012, PLoS Genetics
Curating Diverse Data Types
Which worms (Strain)aggregate with
other worms (Phenotype) and what contributes
to that behavior (Molecular Basis)?
Aggregation Behavior
Bendesky et al., 2012, PLoS Genetics Gene: npr-1 Variation: ad609 (T(83)->I and T(144)->A) Gene Ontology for npr-1:
Biological Process: feeding behaviorMolecular Function: neuropeptide receptor activityCellular Component: integral to plasma membrane
Literature Curation Workflow
PubMed keyword search – ‘elegans’
Full text paper acquisition
Data type flagging and entity recognition
Detailed curation/Fact extraction
Finding Papers: Daily, automated PubMed searches using keyword ‘elegans’
PMID Title Authors AbstractArticle
type JournalCurator actions
Download citation XML
Literature Curation Workflow – Full Text Acquisition
Fully manual step
Done for all papers we select
Electronic copies stored in curation database
Data Type Flagging/Triage
Data Type Flagging/Triage:
General classification of papers
What types of experiments are in a paper? e.g. RNAi phenotypes, Variation phenotypes,
Expression patterns, Physical interactions
Main pipeline:
Support Vector Machines (SVMs)
Other methods:
Textpresso category searches hidden Markov models
Pattern matching scripts
Data Type Flagging Methods
Support Vector Machines: Document Classification
Machine learning models
Use positive and negative gold standard sets of papers to train (e.g., papers with/without RNAi experiments) Positives: 100s, Negatives: 1000s
Resulting model classifies all new papers as negative or positive (high, medium, low confidence)
Data Type Flagging – Support Vector Machines
SVMs trained for ten different data types:
Antibody
Genetic Interactions
Physical Interactions
Gene Expression
Regulation of Gene Expression
Variation Phenotypes
Overexpression Phenotypes
RNAi Phenotypes
Variation Sequence Change
Gene Structure Correction
See: Fang R, et al. (2012) Automatic categorization of diverse experimental information in the bioscience literature. BMC Bioinformatics. 13(1):16
Curation from Support Vector Machine Results
SVM results lead directly to manual curation:
e.g. RNAi Phenotypes
Results from SVMs are processed further
e.g. Variation Sequence Change
Pattern Matching Script – regular expressions
New variations (entity recognition)
e.g. mg366, ju43, e1360
Data Type Flagging – Textpresso
www.textpresso.org
C. elegansMouseD. melanogasterNeuroscienceArabidopsisDicty
Wnt PathwayHIVNemtaodesS. cerevisiaeRegulonDB….many others
Full text of articlesTerms, phrases, entities – semantically taggedKeyword or category searchMatch within sentence or entire paper
Textpresso Categories
Pre-existing dictionaries, vocabularies:
Gene names ChEBI (Chemical Entities of Biological Interest)
PATO Sequence Ontology (SO)
Manually constructed by curators using language from published literature:
Sequence similarity – orthologous, conserved Localization assays – GFP, antibody, fluorescence Experimental verbs – required, regulates, exhibits
Data Type Flagging - Textpresso Category Searches
Data Type: C. elegans Human Disease Homologs
Three-category Textpresso search:
C. elegans gene
’Ortholog’, ’Homolog’, ’Similar’, ’Model’
Human disease
”We map this defect in dauer response to a mutation in the scd-2 gene, which, we show, encodes the nematode
anaplastic lymphoma kinse (ALK) homolog, a proto-oncogene receptor tyrosine kinase.”
Literature Curation Workflow
PubMed keyword search – ‘elegans’
Full text paper acquisition
Data type flagging and entity recognition
Detailed curation/Fact extraction
Textpresso: Semi-Automated Fact Extraction
Genetic Interactions Interestingly, pph-5 (tm2979) behaved similarly to pph-5 (av101) in its ability to dominantly (but weakly) suppress sep-1 (e2406ts), but recessively suppress sep-1(ax110) (supplementary material Table S1).
Physical Interactions – after SVM document classifier Remarkably, only AIN-1 coimmunoprecipitated HA-tagged Ce PAB-
1 (Figure 3A and B, lane 7).
Gene Ontology – Cellular Component Curation During embryogenesis , PAN-1 protein is uniformly distributed throughout the cytoplasm of the germline and somatic blastomeres , as seen for pan-1 mRNA (Fig. 2A), with no obvious concentration of PAN-1 in the P granules (Fig. 2K, N).
Textpresso: Semi-Automated GO CellularComponent Curation
Textpresso Search ResultsSuggested GO Annotations
Gene Products
Textpresso Component
See: Van Auken KM, Jaffery J, Chan J, Müller HM, Sternberg PW. (2009) Semi-automated curation of protein subcellular localization: a text mining-based approach to Gene Ontology (GO) cellular component curation. BMC Bioinformatics. 10:228.
Future Directions
Textpresso, other methods (HMMs) applied to additional data types
e.g. GO Biological Process curation (Phenotypes)
Focusing triage and fact extraction on novel findings
How best to integrate existing knowledge into curation pipelines to focus curator effort on new experimental results?
e.g. Commonly used molecular markers
Literature Annotation Tool – Tracking Evidence
WB, GO Common Annotation Framework, BioCreative
Summary
Text Mining Applications for Literature Curation:
Paper approval and full text acquisition Data type flagging and entity recognition Fact extraction – record evidence
All steps of our pipeline incorporate some form ofsemi- or fully automated approaches:
Scripts for downloads, pattern matching Support Vector Machines for document classification Textpresso for flagging and fact extraction (Hidden Markov Models for flagging, fact extraction)
The WormBase Consortium, TextpressoWormBase - CaltechPaul SternbergJuancarlos ChanWen ChenChris GroveRanjana KishoreRaymond LeeCecilia NakamuraDaniela RacitiGary SchindelmanKimberly Van AukenDaniel WangXiaodong WangKaren Yook Former member: Ruihua Fang
Textpresso - CaltechHans-Michael MullerYuling Li James DoneFormer member: Arun RangarajanWormBase – OICR, Toronto
Lincoln SteinAbigail CabunocTodd HarrisJD Wong
WormBase – Washington UniversityJohn SpiethTamberlyn BieriPhil Ozersky
WormBase – EBI, Sanger, Hinxton, UKRichard Durbin Paul Kersey Matt BerrimanPaul Davis Michael PauliniKevin Howe Mary Ann Tuli Gary Williams
CGC – Oxford University, Oxford, UKJonathan Hodgkin
Hidden Markov Models: Semi-Automated GO Molecular Function Curation
For each sentence, HMM yields: True positive score False positive score
For each sentence, curator assigns: Fully curatable (entity + indication of enzymatic activity) Positive (experiment was performed, result but no entity) False Positive (not about enzymatic activity at all)
Top Related