A Statistical Approach to Literature-based Gene Group Annotation
-
Upload
quintessa-duran -
Category
Documents
-
view
41 -
download
1
description
Transcript of A Statistical Approach to Literature-based Gene Group Annotation
![Page 1: A Statistical Approach to Literature-based Gene Group Annotation](https://reader035.fdocuments.us/reader035/viewer/2022062422/56813253550346895d98d644/html5/thumbnails/1.jpg)
A Statistical Approach to Literature-based Gene Group
Annotation
Xin He
01/31/2007
![Page 2: A Statistical Approach to Literature-based Gene Group Annotation](https://reader035.fdocuments.us/reader035/viewer/2022062422/56813253550346895d98d644/html5/thumbnails/2.jpg)
Annotating a Gene List
• Goal: understand the commonalities in a list of genes from:– Clustering by expression patterns– Differentially expressed genes – Genes sharing cis-regulatory elements
• How to automatically construct the annotations?
![Page 3: A Statistical Approach to Literature-based Gene Group Annotation](https://reader035.fdocuments.us/reader035/viewer/2022062422/56813253550346895d98d644/html5/thumbnails/3.jpg)
Gene Ontology-based Approach
• Each gene is annotated by a set of GO terms
• The importance of any term wrt the gene list is measured by the number of genes that are associated with this term
• Need to correct for the uneven distribution of GO terms: a hypergeometric test
![Page 4: A Statistical Approach to Literature-based Gene Group Annotation](https://reader035.fdocuments.us/reader035/viewer/2022062422/56813253550346895d98d644/html5/thumbnails/4.jpg)
Limitations of GO-based Approach
• GO annotations of all genes: may not be available
• Rapid growth of literature: constantly add new functions to existing genes
• Coverage is not even in all areas. E.g. ecology and behavior; medicine; anatomy and physiology; etc.
![Page 5: A Statistical Approach to Literature-based Gene Group Annotation](https://reader035.fdocuments.us/reader035/viewer/2022062422/56813253550346895d98d644/html5/thumbnails/5.jpg)
Literature-based Approach
• Annotate via the analysis of text extracted from literature
• Advantages:– Not dependent on manually created data– Easy to keep up with the recent discoveries– Broad coverage
• Explorative tool: suggest new hypothesis
![Page 6: A Statistical Approach to Literature-based Gene Group Annotation](https://reader035.fdocuments.us/reader035/viewer/2022062422/56813253550346895d98d644/html5/thumbnails/6.jpg)
Literature-based Approach
• Extract abstracts for each gene• Idea: If a word is overrepresented in the
abstracts for the list, then it is likely to describe the common functions of the list
• A simple measure of significance: Z-score = (observed count – expected count under background distribution) / standard deviation
![Page 7: A Statistical Approach to Literature-based Gene Group Annotation](https://reader035.fdocuments.us/reader035/viewer/2022062422/56813253550346895d98d644/html5/thumbnails/7.jpg)
Motivations for the Current Work
• Drawbacks of the existing approach– Biased textual representation: genes that are
well-studied will dominate the results
• Motivations for a new approach:– Need to capture overrepresentation of words– Favor words that are common to all or most
genes– A unified way to solve both problems?
![Page 8: A Statistical Approach to Literature-based Gene Group Annotation](https://reader035.fdocuments.us/reader035/viewer/2022062422/56813253550346895d98d644/html5/thumbnails/8.jpg)
Ideas for the Statistical Model
• Observation: typically, some genes in the list are related to a given word, but the other genes are not (Few gene clusters are perfect!)
• Assumption: the count of a term in a document follows Poisson distribution
• Idea: the count of a term in one gene is either from a background distribution (if the gene is unrelated) or from a positive distribution (if the gene is related)
![Page 9: A Statistical Approach to Literature-based Gene Group Annotation](https://reader035.fdocuments.us/reader035/viewer/2022062422/56813253550346895d98d644/html5/thumbnails/9.jpg)
Poisson Mixture Model
1 2
1 2
0
... : the size of documents of genes 1 to n
... : counts of the word in documents 1 to n
: the rate of the background and positive Poisson distributions
: the mixing weight of the
n
n
d ,d d
x ,x x
, positive Poisson
])|()1()|([
)|(|(
10
1
n
iiiii
n
i0ii0
dxPdxP
,,,dxP),,,P
dx
)|()1()|(|( 0 iiii0ii dxPdxP ),,,dxP
Each count is generated from a mixture of the background and signal Poisson distributions:
The probability of observing the data is thus:
Notations:
![Page 10: A Statistical Approach to Literature-based Gene Group Annotation](https://reader035.fdocuments.us/reader035/viewer/2022062422/56813253550346895d98d644/html5/thumbnails/10.jpg)
Parameter EstimationMaximum likelihood estimation of parameters:
),,,P, 0
dx |(maxarg)ˆˆ(0,10
EM algorithm to maximize the likelihood function. The updating formula is given by:
i(t)(t)
0ii
n
iii
(t)(t)0ii
n
ii
t
(t)(t)0ii
n
ii
t
d,,,d,xzPx,,,d,xzP
n,,,d,xzP
)|1(/)|1(
/)|1(
11
)1(
1
)1(
The posterior probability of missing label (zi) is:
)|()1()|((
)|()|1(
0)()()(
)()(
iit
it
it
it
it
(t)(t)0iii dxPdxP
dxP,,,d,xzP
![Page 11: A Statistical Approach to Literature-based Gene Group Annotation](https://reader035.fdocuments.us/reader035/viewer/2022062422/56813253550346895d98d644/html5/thumbnails/11.jpg)
Evaluating the Statistical Significance
• The candidate terms: those with a large theta (an estimate of the proportion of related genes)
• Need to assess the significance– E.g. a term from a distribution slightly different from
the background. EM may estimate lambda to be this distribution and theta close to 1
• Idea: if the counts can be explained well by the background, then there is no need to use a mixture of two distributions. This word would be insignificant regardless of the estimated theta
![Page 12: A Statistical Approach to Literature-based Gene Group Annotation](https://reader035.fdocuments.us/reader035/viewer/2022062422/56813253550346895d98d644/html5/thumbnails/12.jpg)
Likelihood Ratio Test
Hypothesis testing:
mixture thefrom generated are counts observed the0, :H
background thefrom generated are counts observed the0, :H
1
0
Generalized Likelihood Ratio test Statistic (LRS):
n
i iiii
ii
0
0
dxPdxP
dxP
,,,P
,PT
1 0
0
)|()ˆ1()ˆ|(ˆ)|(
log2
)ˆˆ|(
)|(log2
dx
dx
Reject H0 if T is greater than a certain threshold.
![Page 13: A Statistical Approach to Literature-based Gene Group Annotation](https://reader035.fdocuments.us/reader035/viewer/2022062422/56813253550346895d98d644/html5/thumbnails/13.jpg)
Asymptotic Distribution of LRS
• It is well known that the distribution of LRS converges to chi-square, with degree of freedom equal to the difference between the number of free parameters of null and alternative hypothesis
• Bonferroni correction: significance level (0.05) divided by the number of hypothesis tested (number of terms)
![Page 14: A Statistical Approach to Literature-based Gene Group Annotation](https://reader035.fdocuments.us/reader035/viewer/2022062422/56813253550346895d98d644/html5/thumbnails/14.jpg)
Implementation• Computational framework
– Medline collections: yeast and fly– Indexing, retrieval, analysis by Indri 2.2
• Background distribution of terms– Organism-specific distribution
• Phrase construction: bigrams only– NSP (Ngram statistical package) for bigram selection: chi-square
measure• Procedure
– Retrieval of articles about the given list of genes– Collection of counts of all words and significant bigrams– Statistical analysis by Poission mixture model– Presentation of results: (1) sorted by statistical significance; (2)
remove concepts whose fractions (theta) are below a threshold
![Page 15: A Statistical Approach to Literature-based Gene Group Annotation](https://reader035.fdocuments.us/reader035/viewer/2022062422/56813253550346895d98d644/html5/thumbnails/15.jpg)
Web Interface
• Web link:http://beespace.igb.uiuc.edu:8080/BeeSpace/
Annot.jsp
• User interface– Choose collection: MedYeast and MedFly– Input genes: textual names rather than
identifiers
![Page 16: A Statistical Approach to Literature-based Gene Group Annotation](https://reader035.fdocuments.us/reader035/viewer/2022062422/56813253550346895d98d644/html5/thumbnails/16.jpg)
Experimental Validation
• Methods compared– StatAnnot– vs. term overpresentation method: GEISHA– vs. GO enrichment analysis: FuncAssociate
http://llama.med.harvard.edu/cgi/func/funcassociate
• Evaluation– Concepts that are common to both methods– Concepts that are unique to the method compared– Concepts that are unique to StatAnnot– Genes that are found by StatAnnot
![Page 17: A Statistical Approach to Literature-based Gene Group Annotation](https://reader035.fdocuments.us/reader035/viewer/2022062422/56813253550346895d98d644/html5/thumbnails/17.jpg)
Comparison with other Literature-based Method
• Gene List: Eisen K cluster (15 genes)– Mainly respiratory chain complex (13), one
mitochondrial membrane pore (por1)– However, por1 matches many more articles than other
genes, thus dominates the results, eg. voltage-dependent channel
• Results:– GESHIA: voltage-dependent, outer member, pores,
channel, succinate– StatAnnot: electron_transport, electron_transfer,
respiratory, cytochrome_c1, mitochondrial_membrane, succinate_dehydrogenase
![Page 18: A Statistical Approach to Literature-based Gene Group Annotation](https://reader035.fdocuments.us/reader035/viewer/2022062422/56813253550346895d98d644/html5/thumbnails/18.jpg)
Comparison with GO-based Method
• Gene list: Eisen B cluster (11 genes)• Results
– cell_cycle, mitosis, s_phase, microtubule, cytokinesi, cytoskeletal_protein, spindle_pole, pole_body, mitotic_spindle, spindle_apparatus, spindle_checkpoint
– contractile ring, actomyosin ring, cell cortex, microtubule nucleation, microtubule organizing center, M phase, bud site selection, phosphatidylinositol binding, septin ring
– anaphase, anaphase_promote, ubiquitin, ligase, bud_neck, metaphase, proteolysis
– cyclin_b, cdc3, cdc28, cdc12, cdc15, cdc2
![Page 19: A Statistical Approach to Literature-based Gene Group Annotation](https://reader035.fdocuments.us/reader035/viewer/2022062422/56813253550346895d98d644/html5/thumbnails/19.jpg)
Comparison with GO-based Method
• Gene List: cAMPDown001 cluster– Originally 59 genes– 16 genes have fly orthologs that match at least one article
• Results– heat_shock, molecular_chaperone, stress (response to stress)– response to temperature, unfolded protein binding, response to
biotic stimulus– channel/ion_channel/channel_gate,
kinase/threonine_kinase/serine_threonine, gtpase, ethanol, immunophilin, tetratricopeptide_repeat, geldanamycin, tacrolimu_bind, co_chaperone/cochaperon, steroid, quinone, cyclophilin
– genes: hsp90, hsc70, hsp70, hsp40, cyp40, fkbp52, mok,
![Page 20: A Statistical Approach to Literature-based Gene Group Annotation](https://reader035.fdocuments.us/reader035/viewer/2022062422/56813253550346895d98d644/html5/thumbnails/20.jpg)
Conclusion from Experiments
• Solved the biased textual representation problem of the earlier literature-based method
• In general, the new method is able to cover a large proportion of terms from GO enrichment analysis
• Supplement with additional biological concepts, including many related genes
• May be particularly useful for studying aspects not focused in GO, such as medicine
![Page 21: A Statistical Approach to Literature-based Gene Group Annotation](https://reader035.fdocuments.us/reader035/viewer/2022062422/56813253550346895d98d644/html5/thumbnails/21.jpg)
Future Plan (I)
• Phrases– N-gram, N > 2– Syntactic phrase construction– Removal of redundant concepts in the results
• Retrieval for genes– Gene name mapping to database entries– Synonym expansion– Gene name disambiguition
• Entity recognition in the resulting concepts– Gene names, chemicals, organisms
![Page 22: A Statistical Approach to Literature-based Gene Group Annotation](https://reader035.fdocuments.us/reader035/viewer/2022062422/56813253550346895d98d644/html5/thumbnails/22.jpg)
Future Plan (II)
• Sentence selection– Informative sentences summarizing the concepts
(theme retrieval in Beespace)
• Clustering of concepts– Ex. Distance measure of terms based on co-occurrence
• A general tool to summarize a group of entities? For example:– Common aspects of a set of diseases– Genes commonly involved in multiple developmental
stages