Research presentation-wd

AbstractDB & ProteinComplexDB: AbstractDB & ProteinComplexDB: A database of protein complexes A database of protein complexes

and their abstractsand their abstracts

Wagied Davids, PhD Banting & Best Dept. of Medical Research, Dept. of Medical Genetics and Microbiology, Donnelly CCBR, 160 College Street, University of Toronto

My ExpertiseComparative Evolutionary Genomics

Detection and Identification sequence homologues

Analysis of mutation rates (dN/dS) AND single nucleotide polymorphism (SNP)

Horizontal Gene Transfer in Bacteria

Graph-theoretic analysis of biological and literature-derived gene networks

Analysis of Sequence-Structure of functional variants

Text-mining:

Construction of literature-derived pathways and networks involving disease genes.

Analysis of microarray gene expression:

Differential gene expression

Gene-Drug profiles

Gene regulation network construction.

Protein Structure - Function analysis of prioritized candidate disease genes by mapping mutation hotspots onto 3D protein structures.

Presentation Overview

AbstractDB – database of abstracts pertaining to protein complexes

Online PubMed abstract curation tool.

ProteinComplexDB- database of extracted protein complexes

Existing Protein Complex Databases

Only 2 high quality human-curated Protein Complex databases available.

Both are products from MIPS - (Munich Information Centre for Protein Sequences, Germany)

(http://mips.gsf.de/genre/proj/yeast/‏)

MIPS-Yeast Protein Complex catalogue

CORUM- Mammalian Protein Complex catalogue.

http://mips.gsf.de/genre/proj/yeast/

Importance of Network Biology, Protein Complexes and Disease

Proteins rarely function in isolation.

Instead, proteins participate in:

protein interactions e.g. phosphorylation

form part of protein complexes e.g. mre11-rad50-nsb1

act together forming pathways e.g. Signalling cascades

From a System Biology perspective:

“Cancer – aberrant state of a biological network.”

Fanconi Anaeami Core Protein ComplexFA core protein complex:(FANCA, B, C, E, F, G, M and L)

● Ref: Youds et al. (2008) Mutation Research doi:10.1016/ j.mrfmm.2008.11.007

Fanconi anaeamiFA severe human recessive disorder.

Defect in genes chromosomal aberrations and sensitivity DNA intra-strand cross-links (ICLs).

13 FA proteins may constitute a pathway for dna damage repair of DNA intra-strand cross-links.

Evolutionary conservation of FA genes from humans to worms and zebrafish.

C. elegans Functional homologs:

brc-2 (FANCD1/BRCA2);

fcd-2 (FANCD-2);

dog-1 (FANCJ/BRIP1);

Gene deletion in C. elegans (worm) results in lethality, ICL sensitivity, sterility.

Q. Which search engine for PROTEIN COMPLEXES ?

1. Relevant for Protein Complexes

and their interactions

2. Would be good if it good

identify gene/protein

names for me!

3. ....and Experimental methods too!

4. ...mmh ...If it could search

& validatemy curations...

....I would not do anything....!

Project ConceptionProject Conception

Comparison criteriaRelevance:

Protein complexes and protein interactions

Named Entity Recognition (NER):

genes, proteins, cell lines, cell types, experimental methods, discriminatory words

User-interactivity (UI)‏

Construct curations of protein complexes

Validate by searching against known protein complex and protein interaction databases.

Q. Feasibility

Q1. How much information is contained within unstructured text from PubMed abstracts for extracting protein complexes?

Q2. In the absence of complete knowledge, is a perfect solution desired or a good starting point?

Q3. What about large-scale high-throughput studies which are not referenced in abstracts or text documents?

CORUM protein complex database

CORUM protein complex database

Small-scale studies (SSS) account for 76% (1024/1346) of protein complexes derived from the literature-curated CORUM database.

SSS MSS LSS

0

200

400

600

800

1000

1200

SSS: 2-5 protein complex members

MSS: 6-10 protein complex members

LSS: >= 11 protein complex members

Category

Co

un

t o

f P

ub

Me

d I

de

ntif

iers

Manual curation – Steps involved

Find all articles related to protein complexes.

Identify by eye gene/protein names.

Identify terms establishing a relationship between proteins

Make inference on whether or not to include a new member to an existing protein complex .

Search using NCBIPubMed

Q. Why not use PubMed Search Engine ?

PubMed search engine's retrieval model called pmra.

pmra is a Topic-based content similarity model.

PubMed search engine focusses on “relatedness” rather than relevance. i.e the probability a user wants to examine a particular

document given known interest in another document

From Document clusters to Protein Clusters

Corpusof

Documents

DocumentClusters

Protein Clusters(Protein Complexes & their Interactions)‏

AbstractDb

User Interface - AbstractDb

AimAim

Use literature-derived information to:Rank documents according to protein complex relevance score.Assign confidence scores to protein interactions.Provide an updated catalogue of protein complexes

Our initial step towards our goal is to develop a “Recommender system” for ranking abstracts with relevance to protein complexes.

Our hypothesisOur hypothesisAbstracts discussing protein complexes can be distinguished from non-relevant abstracts based on the frequency distribution of words in a hand-curated data set on protein complexes versus a data set of background word frequencies

Our methodOur method

Our method is based on a Naïve Bayesian classifier using discriminatory words5.Discriminatory words - a selected subset of high scoring words that characterize abstracts discussing protein complexes.The discriminatory words include both high and low frequency words that distinguish abstracts discussing protein complexes.Our use of a “stopword” list removes high frequency non-informative words, e.g. “the”, “a”, “of”, “for”.

Our modelOur model

Assume Poisson word model:

Probability of observing a given word in a document:

n = Count of word occurrences

N = Total number of words in a set of training abstracts

f = Dictionary word frequency

Using the 500 most significant words, we constructed

a discriminatory word list of 80 words for scoring abstracts.

Does the abstract discuss protein Does the abstract discuss protein complexes or Not?complexes or Not?

Calculate log-likelihood score for individual abstract by summing over all discriminatory words.

FN,i : dictionary frequency of discriminatory word

FI,i : frequency of discriminatory word in training abstract

Our systemOur system

Our system consists of the following components:A set of PubMed abstracts from 1965 - 2008 retrieved with the query “protein complex”;A Bayesian probabilistic method for calculating an article's relevance in discussing protein complexes, using word occurrences found in the training set;A method for extracting gene/protein names using a biological named entity recognizer – ABNER6;A Wiki resource to enable scientists to evaluate and revise the data.

Query terms used for construction of protein Query terms used for construction of protein complex abstract data sets complex abstract data sets

325“DNA repair” AND “protein complex”

238“chromatin remodeling” AND “protein complex”

19360“cell cycle” AND “protein complex”

499918“protein complex”

No. of abstract retrieved

Query Term

(including abstracts published 1965 - 2008)‏

Validation of Bayesian classification of PubMed abstracts Validation of Bayesian classification of PubMed abstracts using hand-curated data setsusing hand-curated data sets

0.920.880.960.9122203DNA repair

0.880.840.930.8381155Chromatin remodelling

0.960.940.970.96702600Cell cycle

0.910.890.930.8994138Apoptosis

F-measureRecallPrecisionAccuracyNegativesPositivesData set

F−measure=2∗Precision∗Recall /Precision+Recall

Accuracy= (TP+TN)/(TP+FP+FN+TN)Precision= TP/(TP+FP)Recall= TP/(TP+FN)F-measure= 2 * Precision * Recall/ (Precision + Recall)

Performance EvaluationPerformance Evaluation

i. Apoptosis ii. Cell cycle

iii. Chromatin remodeling iv. DNA repair

A text-based Protein Assay

Named Entity Recognition for identifying gene and protein names

A challenging task due to the irregularities and

ambiguities in gene and protein nomenclature.

Synonyms and versioning of dbxref.

Online Annotation Tool for PubMed abstract

Biological entities recognised:

Protein

DNA

RNA

CELL LINE

CELL TYPE

1 2 3 4 5 6 7 8

-2

-1

0

1

2

3

4

0

0.05

0.1

0.15

0.2

0.25

CscoreABNERGeneTaggerKEX

Sentence Id

Csc

ore

PMID:10871607SentenceId Cscore ABNER GeneTagger KEX Sentence

1 1.5 0 0.12 0.08 The Rad51 protein in eukaryotic cells is a structural and functional homolog of Escherichia coli RecA with a role in DNA repair and genetic recombination.2 0.62 0.06 0.06 0.12 Several proteins showing sequence similarity to Rad51 have previously been identified in both yeast and human cells.3 -0.31 0.05 0.1 0.15 In Saccharomyces cerevisiae, two of these proteins, Rad55p and Rad57p, form a heterodimer that can stimulate Rad51-mediated DNA strand exchange.4 -1.11 0 0.12 0.12 Here, we report the purification of one of the representatives of the RAD51 family in human cells.5 1.25 0 0.14 0.17 We demonstrate that the purified RAD51L3 protein possesses single-stranded DNA binding activity and DNA-stimulated ATPase activity, consistent with the presence of "Walker box" motifs in the deduced RAD51L3 sequence.6 2.01 0.06 0.17 0.22 We have identified a protein complex in human cells containing RAD51L3 and a second RAD51 family member, XRCC2.7 3.47 0.13 0.13 0.2 By using purified proteins, we demonstrate that the interaction between RAD51L3 and XRCC2 is direct.8 0.66 0.06 0.06 0.06 Given the requirements for XRCC2 in genetic recombination and protection against DNA-damaging agents, we suggest that the complex of RAD51L3 and XRCC2 is likely to be important for these functions in human cells..

Syntax Parsing - semantic relations among words

Example ScenarioQ. What are the members of the FEAR complex ?

1. Keyword: FEAR 2. List of Abstract Relevant to FEAR protein complex

Similar ArticleCONDESIN

smc2 -8 and smc4 -1

FEAR complexcdc14,esp1,cdc5explicit sentence

FEAR complexcdc14,esp1,cdc5, spo12,fob1

explicit sentences

ValidationProteinCompleDb

ConclusionConclusion

We have undertaken an initial step towards developing: a “Recommender system” for ranking abstracts with relevance to protein complexes.

a Curation Tool for extracting Protein Complexes from literature

We are in the process of: Constructing a database of Protein Complexes, and

Linking Protein Complexes to Pathways and Disease phenotypes.

Ultimate aim of understanding biological mechanisms behind complex Disease phenotypes

AcknowledgementsZhang Zhang and lab members:

• Ivan Borozan

• Dong (Derek) Dong

• Matthew Fagnani

• Yunchen Gong

• Sumedha Gunewardena

• Gabe Musso

• Renqiang Min

• Sanaa Mahmood

• Jingjing Li

• Yu Liu

• Apostolos Lydakis

• Lee Zamparo

Research presentation-wd

Documents

Transcript of Research presentation-wd