Text Mining for Biocuration of Bacterial Infectious Diseases
-
Upload
dan-sullivan -
Category
Data & Analytics
-
view
77 -
download
2
Transcript of Text Mining for Biocuration of Bacterial Infectious Diseases
GBCB Seminar
April 24, 2014
Biocuration in bacterial infectious diseases
Existing approaches to biocuration
Goals of current research
Sentence classification for virulence factor (VF) curation
Future Research
Large scale sequencing, transcriptomics, proteomics and metabolomics provide large volumes of data about structure and function
Valuable information about genes, proteins and other biological entities derived from interpretation of data
Publications capture information that researchers extract from data by aggregating, integrating, summarizing and analyzing experiment results and interpreting those results with respect to other published results
Gene annotation◦ Virulence factors◦ Antibiotic resistance◦ Genomic metadata
Experiment Metadata◦ Transcriptomic metadata◦ Metabolomic metadata
Literature◦ Named entity recognition◦ Metadata tagging
Automated annotation ◦ Example – RAST ◦ Transfer annotations based on similarity◦ Metabolic reconstruction
Community curation ◦ Example – WikiGenes◦ Collaborative manual curation
Model building ◦ Example - MetaFlux ◦ Predict missing components of pathways based
on FBA models
Dedicated manual curation ◦ Example –◦ PATRIC Curate entries with statements
traceable to literature
◦ In 2009, half of biocurators were using text mining in support of biocuration1
◦ Common use cases: Document prioritization Linking entities and relations to biological resources
such as GO or UniProt Identification of evidence
◦ Identification of evidence Pattern recognition - genomic location information Named entity recognition – T4SS components Event extraction – positive/negative regulation
1. PMID: 23110974
Manual procedures are time consuming and costly
Volume of literature continues to grow
Commonly used search techniques, such as keyword, similarity searching, metadata filtering, etc. can still yield volumes of literature that are difficult to analyze manually
Some success with popular tools but limitations
http://www.nature.com/nrmicro/journal/v8/n1/fig_tab/nrmicro2260_F2.htmlhttp://stroke.nih.gov/materials/strokechallenges.htm
Potentially brittle methods, e.g. dictionary lookups
Questions of effort required to extend
Named entity recognition does not allows disambiguate correctly
Prioritizing documents is still challenging
Textpresso Dictionary Entries
Adhesion to hostAdhesion to hostsAdhesion to other organism during symbiotic interactionAdhesion to other organism during symbiotic interactionsAdhesion to symbiontAdhesion to symbiontsAgglutination during conjugation with cellular fusionAgglutination during conjugation with cellular fusionsAgglutination during conjugation without cellular fusionAgglutination during conjugation without cellular fusions
Generalized set of biocuration tools to:◦ Filter and prioritize documents◦ Identify relevant assertion sentences within documents◦ Extract entity and events ◦ Require minimal manual intervention
Approach◦ Address each objective separately◦ Topic modeling and similarity measures for document
classification◦ Term-frequency Inverse Document Frequency (TF-IDF) for
sentence classification◦ Shallow semantic parsing for entity and event extraction
Focus of this presentation is TF-IDF for sentence classification and its limitations
3 Key Components
◦ Data
◦ Representation scheme
◦ Algorithms
Data
◦ Positive examples – VF assertion sentences
◦ Negative examples – Randomly selected from same publications
Representation
◦ TF-IDF
◦ Vector space representation
◦ Cosine of vectors measure of similarity
Algorithms
◦ Supervised learning
SVMs
Ridge Classifier
Perceptrons
kNN
SGD Classifier
Naïve Bayes
Random Forest
AdaBoost
• Semisupervised Learning• Label Spreading
“Bacterial virulence factors enable a [pathogen] to replicate and disseminate within a host in part by subverting or eluding host defenses.”1
Example assertion sentences about virulence factors
Mutations in the fimH gene of Salmonella typhimurium result in a non-fimbriate, non-adhesive phenotype.2
Unexpectedly, here we find that nonacylated LprG retains TLR2 activity. 3
The autolysin Ami contributes to the adhesion of Listeria. 4
Negative examples are randomly selected non-VF assertion sentences from the same set of publications.
VF Sentence Set 1 - PATRIC team of biocurators identified 4,696 assertion sentences in 1,127 publications about virulence in 5 genera: Escherichia, Listeria, Mycobacterium, Salmonella, Shigella
VF Sentence Set 2 - Second round of curation over initial results yield 3,716 VF assertion sentences from 787 publications across 6 genera: Bartonella, Escherichia, Listeria, Mycobacterium, Salmonella, Shigella
1. A. Cross, “What is a Virulence Factor” Crit Care. 2008; 12(6): 196.
2. Hancox, Yeh et al. 1997
3. Drage, Tsai et al. 2010
4. Milohanic, Jonquieres et al. 2001
Term Frequency (TF) tf(t,d) = # of occurrences of t in dt is a termd is a document
Inverse Document Frequency (IDF)idf(t,D) = log(N / |{d in D : t in d}|)D is set of documentsN is number of document
TF-IDF = tf(t,d) * idf(t,D)
TF-IDF is ◦ large when high term frequency in document and low
term frequency in all documents◦ small when term appears in many documents
Bag of word model
Ignores structure (syntax) and meaning (semantics) of sentences
Representation vector length is the size of set of unique words in corpus
Stemming used to remove morphological differences
Each word is assigned an index in the representation vector, V
The value V[i] is non-zero if word appears in sentence represented by vector
The non-zero value is a function of the frequency of the word in the sentence and the frequency of the term in the corpus
Support Vector Machine (SVM) is large margin classifier
Commonly used in text classification
Initial results based on VF Sentence Set 1
Image Source:http://en.wikipedia.org/wiki/File:Svm_max_sep_hyperplane_with_margin.png
Non-VF, Predicted VF: ◦ “Collectively, these data suggest that EPEC 30-5-1(3) translocates reduced levels
of EspB into the host cell.”
◦ “Data were log-transformed to correct for heterogeneity of the variances where necessary.”
◦ “Subsequently, the kanamycin resistance cassette from pVK4 was cloned into thePstI site of pMP3, and the resulting plasmid pMP4 was used to target a disruption in the cesF region of EHEC strain 85-170.”
VF, Predicted Non-VF◦ “Here, it is reported that the pO157-encoded Type V-secreted serine protease
EspP influences the intestinal colonization of calves. “
◦ “Here, we report that intragastric inoculation of a Shiga toxin 2 (Stx2)-producing E. coli O157:H7 clinical isolate into infant rabbits led to severe diarrhea and intestinal inflammation but no signs of HUS. “
◦ “The DsbLI system also comprises a functional redox pair”
Adding additional examples is not likely to substantially improve results as seen by error curve
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0 5000 10000
All
Training Error
Validation Error
8 Alternative Algorithms
VF Sentence Set 2
Select 10,000 most important features using chi-square
Machine learning technique that takes advantage of unlabeled data
Unlabeled data helps determine shape of underlying data distribution
Added randomly selected, unlabeled sentences from VF publications
Trained with 842 labeled and 4346 unlabeled
Label Spreading is a semi-supervised algorithm somewhat resilient to noise
Algorithms have parameters not learned from data
SVMs, for example:◦ C – balances training error and over-fitting◦ Kernel – function to map data to high-dimensional
space, e.g. linear, polynomial◦ Gamma – parameter in non-linear kernels, controls how
far influence of a training instance reaches
Search combination of parameters
Optimal results with linear kernel and slightly smaller C than default
Process of explicitly modeling relations between variables or explicitly representing information not already in a representation scheme, for example:◦ Classify all numbers as NUMBER instead of numerals◦ Replace gene/protein names with term GENE_Protein
Used in text classification problems, e.g. phrase-based learning has improved some rule-based classifiers.1
Rule based learners may not be generalizable to other domains
Taxonomic-structure of Unified Medical Language System (UMLS) used to create semantic similarity measures. 2
Most informative features can be detected automatically, e.g. chi-square
Manual feature engineering is not a viable option if our goal is topic-independent support for biocuration
1. DOI:10.1.1.36.97702. PMID: 22580178
Improve quality of data (quantity not likely helpful)
Utilize multiple supervised algorithms, ensemble and non-ensemble
Use unlabeled data and semi-supervised techniques
Feature Selection
Parameter Tuning
Feature Engineering
Given:
◦ High quality data in sufficient quantity
◦ State of the art machine learning algorithms
How to improve results: Change Representation?
TF-IDF◦ Loss of syntactic and
semantic information
◦ No relation between term index and meaning
◦ No support for disambiguation
◦ Feature engineering extends vector representation or substitute specific for more general terms – a crude way to capture semantic properties
Ideal Representation◦ Capture semantic
similarity of words
◦ Does not require feature engineering
◦ Minimal pre-processing, e.g. no mapping to ontologies
◦ Improves precision and recall
Words represented as set of weights in vector
Useful properties◦ Semantically similar words in close proximity◦ Methods for capturing phrases, e.g. “Secretion system”◦ Captures some semantic features
Trained with◦ Skip-gram or CBOW algorithms◦ Text, such as PubMed abstracts and open access papers
T. Mikolov, et. al. “Efficient Estimation of Word Representations in Vector Space.” 2013. http://arxiv.org/pdf/1301.3781.pdf
Utilize distributed representations in classification algorithms
Compare SVM and multi-layered neural network for classification
Build on distributed word representation as basis for shallow semantic parsing and information extraction
Apply to other specialty gene sets
PATRIC Curators: Rebecca Wattam, Chunhong Mao, David Abraham, Meredith Wilson, Yan Zhang
Resources◦ PATRIC www.patricbrc.org◦ Python, NumPy, SciPy, Scikit-
Learn◦ iPython◦ Gensim
Funding◦ National Institute of Allergy and
Infectious Disease, National Institutes of Health
CID Photo Here