A faster reliable algorithm to estimate the p-value of the multinomial llr statistic
description
Transcript of A faster reliable algorithm to estimate the p-value of the multinomial llr statistic
![Page 1: A faster reliable algorithm to estimate the p-value of the multinomial llr statistic](https://reader035.fdocuments.us/reader035/viewer/2022062520/56815ac8550346895dc8984e/html5/thumbnails/1.jpg)
A faster reliable algorithm to estimate
the p-value of the multinomial llr statistic
Uri Keich and Niranjan Nagarajan(Department of Computer Science,Cornell University, Ithaca, NY, USA)
![Page 2: A faster reliable algorithm to estimate the p-value of the multinomial llr statistic](https://reader035.fdocuments.us/reader035/viewer/2022062520/56815ac8550346895dc8984e/html5/thumbnails/2.jpg)
Motivation: Hunting for Motifs
![Page 3: A faster reliable algorithm to estimate the p-value of the multinomial llr statistic](https://reader035.fdocuments.us/reader035/viewer/2022062520/56815ac8550346895dc8984e/html5/thumbnails/3.jpg)
Is there a motif here?
Is the alignment interesting/significant? Which of the columns form a motif?
Or for that matter here?
![Page 4: A faster reliable algorithm to estimate the p-value of the multinomial llr statistic](https://reader035.fdocuments.us/reader035/viewer/2022062520/56815ac8550346895dc8984e/html5/thumbnails/4.jpg)
Work with column profiles Null hypothesis: multinomial distribution
Uncovering significant columns
Calculate p-value of the observed score s
Test Statistic: Log-likelihood ratio
where for a random sample of size
![Page 5: A faster reliable algorithm to estimate the p-value of the multinomial llr statistic](https://reader035.fdocuments.us/reader035/viewer/2022062520/56815ac8550346895dc8984e/html5/thumbnails/5.jpg)
and more … Bioinformatics applications:
Sequence-Profile and Profile-Profile alignment
Locating binding site correlations in DNA Detecting compensatory mutations in
proteins and RNA Other applications:
Signal Processing Natural Language Processing
![Page 6: A faster reliable algorithm to estimate the p-value of the multinomial llr statistic](https://reader035.fdocuments.us/reader035/viewer/2022062520/56815ac8550346895dc8984e/html5/thumbnails/6.jpg)
Direct enumeration
Computational Extremes!
Constant time approximation Fails with N fixed and s approaching the tail
Asymptotic approximation As,
possible outcomes Exact results Exponential time and space requirements
(O(NK)) even with pruning of the search space!
![Page 7: A faster reliable algorithm to estimate the p-value of the multinomial llr statistic](https://reader035.fdocuments.us/reader035/viewer/2022062520/56815ac8550346895dc8984e/html5/thumbnails/7.jpg)
Compute pQ the p.m.f. of the integer valued (Baglivo et al, 1992)
where is the mesh size.
Runtime is polynomial in N, K and Q Accurate upto the granularity of the lattice
The middle path
![Page 8: A faster reliable algorithm to estimate the p-value of the multinomial llr statistic](https://reader035.fdocuments.us/reader035/viewer/2022062520/56815ac8550346895dc8984e/html5/thumbnails/8.jpg)
Baglivo et al.’s approach Compute pQ by first computing the DFT!
Let and then,
Then recover pQ by the inverse-DFT
But how do we compute the DFT in the first place without knowing pQ??
![Page 9: A faster reliable algorithm to estimate the p-value of the multinomial llr statistic](https://reader035.fdocuments.us/reader035/viewer/2022062520/56815ac8550346895dc8984e/html5/thumbnails/9.jpg)
We compute
where
Hertz and Stormo’s Algorithm
Sample from which are independent Poissons with mean (instead of X)
Fact:
Recursion:
![Page 10: A faster reliable algorithm to estimate the p-value of the multinomial llr statistic](https://reader035.fdocuments.us/reader035/viewer/2022062520/56815ac8550346895dc8984e/html5/thumbnails/10.jpg)
Baglivo et al.’s Algorithm Recurse over k to compute DFT of
Qk,n Let DQk,n be . Then,
And is upto a constant
![Page 11: A faster reliable algorithm to estimate the p-value of the multinomial llr statistic](https://reader035.fdocuments.us/reader035/viewer/2022062520/56815ac8550346895dc8984e/html5/thumbnails/11.jpg)
Comparison of the two algorithms
Both have O(QKN2) time complexity Space complexity:
O(QN) for Hertz and Stormo’s method O(Q+N) for Baglivo et al.’s method
Numerical errors: Bounded for Hertz and Stormo’s
method. Baglivo et al.’s method can yield
negative p-values in some cases!
![Page 12: A faster reliable algorithm to estimate the p-value of the multinomial llr statistic](https://reader035.fdocuments.us/reader035/viewer/2022062520/56815ac8550346895dc8984e/html5/thumbnails/12.jpg)
Whats going wrong? Hoeffding (1965) proved
that P(I s) f(N,K)e-s
If we compare with
we would get …
![Page 13: A faster reliable algorithm to estimate the p-value of the multinomial llr statistic](https://reader035.fdocuments.us/reader035/viewer/2022062520/56815ac8550346895dc8984e/html5/thumbnails/13.jpg)
And why? Fixed precision arithmetic:
In double precision arithmetic Due to roundoff errors the smaller
numbers in the summation in the DFT are overwhelmed by the largest ones!
Solution: boost the smaller values How? And by how much?
![Page 14: A faster reliable algorithm to estimate the p-value of the multinomial llr statistic](https://reader035.fdocuments.us/reader035/viewer/2022062520/56815ac8550346895dc8984e/html5/thumbnails/14.jpg)
Our Solution We shift pQ(s) with es:
where is the MGF of IQ To compute the DFT of p we replace
the Poisson pk with
and compute
![Page 15: A faster reliable algorithm to estimate the p-value of the multinomial llr statistic](https://reader035.fdocuments.us/reader035/viewer/2022062520/56815ac8550346895dc8984e/html5/thumbnails/15.jpg)
Which to use? With = 1, N = 400, K = 5,
![Page 16: A faster reliable algorithm to estimate the p-value of the multinomial llr statistic](https://reader035.fdocuments.us/reader035/viewer/2022062520/56815ac8550346895dc8984e/html5/thumbnails/16.jpg)
Optimizing Theoretical error bound
Want to calculate P(I > s0) and so a good choice for seems to be
We can solve for numerically. This only adds O(KN2) to the runtime.
![Page 17: A faster reliable algorithm to estimate the p-value of the multinomial llr statistic](https://reader035.fdocuments.us/reader035/viewer/2022062520/56815ac8550346895dc8984e/html5/thumbnails/17.jpg)
So far … We have shown how to
compensate for the errors in Baglivo et al.’s algorithm
provide error guarantees As a bonus we also avoid the need for log-
computation! The updated algorithm has O(Q+N) space
complexity. However the runtime is still O(QKN2) Can we speed it up?
![Page 18: A faster reliable algorithm to estimate the p-value of the multinomial llr statistic](https://reader035.fdocuments.us/reader035/viewer/2022062520/56815ac8550346895dc8984e/html5/thumbnails/18.jpg)
Unfortunately the FFT based convolution introduces more numerical errors: When =1,
A faster variant! Note that the recursive step is a
convolution
Naïve convolution takes time O(N2) An FFT based convolution,
takes O(NlogN) time!
![Page 19: A faster reliable algorithm to estimate the p-value of the multinomial llr statistic](https://reader035.fdocuments.us/reader035/viewer/2022062520/56815ac8550346895dc8984e/html5/thumbnails/19.jpg)
The final piece Need a shift that works well on both
and The variation over l is expected to be small.
So we focus on computing the correct shift for different values of k.
We make an intuitive choice
where Mk(2) is the MGF of
![Page 20: A faster reliable algorithm to estimate the p-value of the multinomial llr statistic](https://reader035.fdocuments.us/reader035/viewer/2022062520/56815ac8550346895dc8984e/html5/thumbnails/20.jpg)
Experimental Results (Accuracy)
Both the shifts work well in practice! We tested our method with K values
upto a 100, N upto 10,000 and various choices of and s
In all cases relative error in comparison to Hertz and Stormo was less than 1e-9
![Page 21: A faster reliable algorithm to estimate the p-value of the multinomial llr statistic](https://reader035.fdocuments.us/reader035/viewer/2022062520/56815ac8550346895dc8984e/html5/thumbnails/21.jpg)
Experimental Results (Runtime)
As expected, our algorithm scales as NlogN as opposed to N2 for Hertz and Stormo’s method!
![Page 22: A faster reliable algorithm to estimate the p-value of the multinomial llr statistic](https://reader035.fdocuments.us/reader035/viewer/2022062520/56815ac8550346895dc8984e/html5/thumbnails/22.jpg)
Recovering the entire range
Imaginary part of computed p is a measure of numerical error
Adhoc test for reliability: Real(p(j)) > 103 × maxj Imag(p(j))
In practise, we can recover the entire p.m.f. using as few as 2 to 3 different s (or equivalently ) values.
For large Ns this is still significantly faster than Hertz and Stormo’s algorithm.
![Page 23: A faster reliable algorithm to estimate the p-value of the multinomial llr statistic](https://reader035.fdocuments.us/reader035/viewer/2022062520/56815ac8550346895dc8984e/html5/thumbnails/23.jpg)
Work in progress … Rigorous error bounds for choice of
2’s Applying the methodology to
compute other statistics Extending the method to
automatically recover the entire range
Exploring applications