Research presentation-wd
-
Upload
wagied-davids -
Category
Documents
-
view
124 -
download
0
Transcript of Research presentation-wd
AbstractDB & ProteinComplexDB: AbstractDB & ProteinComplexDB: A database of protein complexes A database of protein complexes
and their abstractsand their abstracts
Wagied Davids, PhD Banting & Best Dept. of Medical Research, Dept. of Medical Genetics and Microbiology, Donnelly CCBR, 160 College Street, University of Toronto
My ExpertiseComparative Evolutionary Genomics
Detection and Identification sequence homologues
Analysis of mutation rates (dN/dS) AND single nucleotide polymorphism (SNP)
Horizontal Gene Transfer in Bacteria
Graph-theoretic analysis of biological and literature-derived gene networks
Analysis of Sequence-Structure of functional variants
Text-mining:
Construction of literature-derived pathways and networks involving disease genes.
Analysis of microarray gene expression:
Differential gene expression
Gene-Drug profiles
Gene regulation network construction.
Protein Structure - Function analysis of prioritized candidate disease genes by mapping mutation hotspots onto 3D protein structures.
Presentation Overview
AbstractDB – database of abstracts pertaining to protein complexes
Online PubMed abstract curation tool.
ProteinComplexDB- database of extracted protein complexes
Existing Protein Complex Databases
Only 2 high quality human-curated Protein Complex databases available.
Both are products from MIPS - (Munich Information Centre for Protein Sequences, Germany)
(http://mips.gsf.de/genre/proj/yeast/)
MIPS-Yeast Protein Complex catalogue
CORUM- Mammalian Protein Complex catalogue.
Importance of Network Biology, Protein Complexes and Disease
Proteins rarely function in isolation.
Instead, proteins participate in:
protein interactions e.g. phosphorylation
form part of protein complexes e.g. mre11-rad50-nsb1
act together forming pathways e.g. Signalling cascades
From a System Biology perspective:
“Cancer – aberrant state of a biological network.”
Fanconi Anaeami Core Protein ComplexFA core protein complex:(FANCA, B, C, E, F, G, M and L)
● Ref: Youds et al. (2008) Mutation Research doi:10.1016/ j.mrfmm.2008.11.007
Fanconi anaeamiFA severe human recessive disorder.
Defect in genes chromosomal aberrations and sensitivity DNA intra-strand cross-links (ICLs).
13 FA proteins may constitute a pathway for dna damage repair of DNA intra-strand cross-links.
Evolutionary conservation of FA genes from humans to worms and zebrafish.
C. elegans Functional homologs:
brc-2 (FANCD1/BRCA2);
fcd-2 (FANCD-2);
dog-1 (FANCJ/BRIP1);
Gene deletion in C. elegans (worm) results in lethality, ICL sensitivity, sterility.
Q. Which search engine for PROTEIN COMPLEXES ?
1. Relevant for Protein Complexes
and their interactions
2. Would be good if it good
identify gene/protein
names for me!
3. ....and Experimental methods too!
4. ...mmh ...If it could search
& validatemy curations...
....I would not do anything....!
Project ConceptionProject Conception
Comparison criteriaRelevance:
Protein complexes and protein interactions
Named Entity Recognition (NER):
genes, proteins, cell lines, cell types, experimental methods, discriminatory words
User-interactivity (UI)
Construct curations of protein complexes
Validate by searching against known protein complex and protein interaction databases.
Q. Feasibility
Q1. How much information is contained within unstructured text from PubMed abstracts for extracting protein complexes?
Q2. In the absence of complete knowledge, is a perfect solution desired or a good starting point?
Q3. What about large-scale high-throughput studies which are not referenced in abstracts or text documents?
CORUM protein complex database
CORUM protein complex database
Small-scale studies (SSS) account for 76% (1024/1346) of protein complexes derived from the literature-curated CORUM database.
SSS MSS LSS
0
200
400
600
800
1000
1200
SSS: 2-5 protein complex members
MSS: 6-10 protein complex members
LSS: >= 11 protein complex members
Category
Co
un
t o
f P
ub
Me
d I
de
ntif
iers
Manual curation – Steps involved
Find all articles related to protein complexes.
Identify by eye gene/protein names.
Identify terms establishing a relationship between proteins
Make inference on whether or not to include a new member to an existing protein complex .
Search using NCBIPubMed
Q. Why not use PubMed Search Engine ?
PubMed search engine's retrieval model called pmra.
pmra is a Topic-based content similarity model.
PubMed search engine focusses on “relatedness” rather than relevance. i.e the probability a user wants to examine a particular
document given known interest in another document
From Document clusters to Protein Clusters
Corpusof
Documents
DocumentClusters
Protein Clusters(Protein Complexes & their Interactions)
AbstractDb
User Interface - AbstractDb
AimAim
Use literature-derived information to:Rank documents according to protein complex relevance score.Assign confidence scores to protein interactions.Provide an updated catalogue of protein complexes
Our initial step towards our goal is to develop a “Recommender system” for ranking abstracts with relevance to protein complexes.
Our hypothesisOur hypothesisAbstracts discussing protein complexes can be distinguished from non-relevant abstracts based on the frequency distribution of words in a hand-curated data set on protein complexes versus a data set of background word frequencies
Our methodOur method
Our method is based on a Naïve Bayesian classifier using discriminatory words5.Discriminatory words - a selected subset of high scoring words that characterize abstracts discussing protein complexes.The discriminatory words include both high and low frequency words that distinguish abstracts discussing protein complexes.Our use of a “stopword” list removes high frequency non-informative words, e.g. “the”, “a”, “of”, “for”.
Our modelOur model
Assume Poisson word model:
Probability of observing a given word in a document:
n = Count of word occurrences
N = Total number of words in a set of training abstracts
f = Dictionary word frequency
Using the 500 most significant words, we constructed
a discriminatory word list of 80 words for scoring abstracts.
Does the abstract discuss protein Does the abstract discuss protein complexes or Not?complexes or Not?
Calculate log-likelihood score for individual abstract by summing over all discriminatory words.
FN,i : dictionary frequency of discriminatory word
FI,i : frequency of discriminatory word in training abstract
Our systemOur system
Our system consists of the following components:A set of PubMed abstracts from 1965 - 2008 retrieved with the query “protein complex”;A Bayesian probabilistic method for calculating an article's relevance in discussing protein complexes, using word occurrences found in the training set;A method for extracting gene/protein names using a biological named entity recognizer – ABNER6;A Wiki resource to enable scientists to evaluate and revise the data.
Query terms used for construction of protein Query terms used for construction of protein complex abstract data sets complex abstract data sets
325“DNA repair” AND “protein complex”
238“chromatin remodeling” AND “protein complex”
19360“cell cycle” AND “protein complex”
499918“protein complex”
No. of abstract retrieved
Query Term
(including abstracts published 1965 - 2008)
Validation of Bayesian classification of PubMed abstracts Validation of Bayesian classification of PubMed abstracts using hand-curated data setsusing hand-curated data sets
0.920.880.960.9122203DNA repair
0.880.840.930.8381155Chromatin remodelling
0.960.940.970.96702600Cell cycle
0.910.890.930.8994138Apoptosis
F-measureRecallPrecisionAccuracyNegativesPositivesData set
F−measure=2∗Precision∗Recall /Precision+Recall
Accuracy= (TP+TN)/(TP+FP+FN+TN)Precision= TP/(TP+FP)Recall= TP/(TP+FN)F-measure= 2 * Precision * Recall/ (Precision + Recall)
Performance EvaluationPerformance Evaluation
i. Apoptosis ii. Cell cycle
iii. Chromatin remodeling iv. DNA repair
A text-based Protein Assay
Named Entity Recognition for identifying gene and protein names
A challenging task due to the irregularities and
ambiguities in gene and protein nomenclature.
Synonyms and versioning of dbxref.
Online Annotation Tool for PubMed abstract
Biological entities recognised:
Protein
DNA
RNA
CELL LINE
CELL TYPE
1 2 3 4 5 6 7 8
-2
-1
0
1
2
3
4
0
0.05
0.1
0.15
0.2
0.25
CscoreABNERGeneTaggerKEX
Sentence Id
Csc
ore
PMID:10871607SentenceId Cscore ABNER GeneTagger KEX Sentence
1 1.5 0 0.12 0.08 The Rad51 protein in eukaryotic cells is a structural and functional homolog of Escherichia coli RecA with a role in DNA repair and genetic recombination.2 0.62 0.06 0.06 0.12 Several proteins showing sequence similarity to Rad51 have previously been identified in both yeast and human cells.3 -0.31 0.05 0.1 0.15 In Saccharomyces cerevisiae, two of these proteins, Rad55p and Rad57p, form a heterodimer that can stimulate Rad51-mediated DNA strand exchange.4 -1.11 0 0.12 0.12 Here, we report the purification of one of the representatives of the RAD51 family in human cells.5 1.25 0 0.14 0.17 We demonstrate that the purified RAD51L3 protein possesses single-stranded DNA binding activity and DNA-stimulated ATPase activity, consistent with the presence of "Walker box" motifs in the deduced RAD51L3 sequence.6 2.01 0.06 0.17 0.22 We have identified a protein complex in human cells containing RAD51L3 and a second RAD51 family member, XRCC2.7 3.47 0.13 0.13 0.2 By using purified proteins, we demonstrate that the interaction between RAD51L3 and XRCC2 is direct.8 0.66 0.06 0.06 0.06 Given the requirements for XRCC2 in genetic recombination and protection against DNA-damaging agents, we suggest that the complex of RAD51L3 and XRCC2 is likely to be important for these functions in human cells..
Syntax Parsing - semantic relations among words
Example ScenarioQ. What are the members of the FEAR complex ?
1. Keyword: FEAR 2. List of Abstract Relevant to FEAR protein complex
Similar ArticleCONDESIN
smc2 -8 and smc4 -1
FEAR complexcdc14,esp1,cdc5explicit sentence
FEAR complexcdc14,esp1,cdc5, spo12,fob1
explicit sentences
ValidationProteinCompleDb
ConclusionConclusion
We have undertaken an initial step towards developing: a “Recommender system” for ranking abstracts with relevance to protein complexes.
a Curation Tool for extracting Protein Complexes from literature
We are in the process of: Constructing a database of Protein Complexes, and
Linking Protein Complexes to Pathways and Disease phenotypes.
Ultimate aim of understanding biological mechanisms behind complex Disease phenotypes
AcknowledgementsZhang Zhang and lab members:
• Ivan Borozan
• Dong (Derek) Dong
• Matthew Fagnani
• Yunchen Gong
• Sumedha Gunewardena
• Gabe Musso
• Renqiang Min
• Sanaa Mahmood
• Jingjing Li
• Yu Liu
• Apostolos Lydakis
• Lee Zamparo