P-value Calculating Problem
-
Upload
austin-morris -
Category
Documents
-
view
18 -
download
0
description
Transcript of P-value Calculating Problem
![Page 1: P-value Calculating Problem](https://reader036.fdocuments.us/reader036/viewer/2022062718/56812bd6550346895d904145/html5/thumbnails/1.jpg)
P-value Calculating Problem
Ph. D. Thesis by Jing Zhang
Presented by Chao Wang
![Page 2: P-value Calculating Problem](https://reader036.fdocuments.us/reader036/viewer/2022062718/56812bd6550346895d904145/html5/thumbnails/2.jpg)
Problem Description
Given an independent and identically distributed (i.i.d.) model R over an alphabet , a pattern m with the same alphabet, an integer k, we should calculate the probability of m hits the model at least k times.
![Page 3: P-value Calculating Problem](https://reader036.fdocuments.us/reader036/viewer/2022062718/56812bd6550346895d904145/html5/thumbnails/3.jpg)
Problem Description
Note that the overlapping matches are not considered in this problem.
For example, the pattern “ACGACG” only match the target “TACGACGACGG” once because between the 2nd and 5th positions there is a overlap.
“TACGACGACGG” “TACGACGACGG”overlap
![Page 4: P-value Calculating Problem](https://reader036.fdocuments.us/reader036/viewer/2022062718/56812bd6550346895d904145/html5/thumbnails/4.jpg)
An instance of the problem
……
{ , , , }A C G T Alphabet:
Target Sequence:
Pattern: ACTTGG
Each position has the same distribution:
A: 0.5 C: 0.3 G: 0.1 T:0.1
![Page 5: P-value Calculating Problem](https://reader036.fdocuments.us/reader036/viewer/2022062718/56812bd6550346895d904145/html5/thumbnails/5.jpg)
Two cases of the target sequence
The length of the target sequence is infinite.
Finite.
![Page 6: P-value Calculating Problem](https://reader036.fdocuments.us/reader036/viewer/2022062718/56812bd6550346895d904145/html5/thumbnails/6.jpg)
Infinite case
Infinite length of target sequence
Define f(k) as the probability of m hits the sequence at least k times.
This case is easy to calculate because the following equality holds.( ) ( [0,| | 1])* ( 1)
( [0,| | 1])* ( )
f k p m R m f k
p m R m f k
![Page 7: P-value Calculating Problem](https://reader036.fdocuments.us/reader036/viewer/2022062718/56812bd6550346895d904145/html5/thumbnails/7.jpg)
Infinite case
Intuitive mean of the equation Consider m hits the first |m|
positions or not. Divide the probability into two cases.
m=ACCGT
m doesn’t hit R[0, |m|-1]
m hits R[0, |m|-1]
m=ACCGT
![Page 8: P-value Calculating Problem](https://reader036.fdocuments.us/reader036/viewer/2022062718/56812bd6550346895d904145/html5/thumbnails/8.jpg)
Infinite case
A trivial observation: f(0)=1 And
We can prove for all positive integer k, f(k)=1 easily using Mathematical Inductive Principle.
0 ( [0,| | 1] ) 1
( [0,| | 1] ) ( [0,| | 1] ) 1
p R m m
p R m m p R m m
![Page 9: P-value Calculating Problem](https://reader036.fdocuments.us/reader036/viewer/2022062718/56812bd6550346895d904145/html5/thumbnails/9.jpg)
Infinite case
An interesting example: A monkey is clicking the keyboard
randomly. If the time is sufficiently long, the content contains the great drama “Macbeth” with probability “1”.
Does it correspond with our intuition?
![Page 10: P-value Calculating Problem](https://reader036.fdocuments.us/reader036/viewer/2022062718/56812bd6550346895d904145/html5/thumbnails/10.jpg)
Finite case
The inequality in the infinite case will not hold.
Why? Because the “f(k)” in the left-side
doesn’t equal the one in the right-side.
![Page 11: P-value Calculating Problem](https://reader036.fdocuments.us/reader036/viewer/2022062718/56812bd6550346895d904145/html5/thumbnails/11.jpg)
Finite case
A C G G T A T T G C C A A T G
f(k) in the right-side
f(k) in the left-side
( ) ( [0,| | 1])* ( 1)
( [0,| | 1])* ( )
f k p m R m f k
p m R m f k
![Page 12: P-value Calculating Problem](https://reader036.fdocuments.us/reader036/viewer/2022062718/56812bd6550346895d904145/html5/thumbnails/12.jpg)
Finite case
How to solve the puzzle? Dynamic Programming.
![Page 13: P-value Calculating Problem](https://reader036.fdocuments.us/reader036/viewer/2022062718/56812bd6550346895d904145/html5/thumbnails/13.jpg)
Trial 1
Pr(i,k) denotes that m hits R[i,n] at least k times.
m=R[i] means for all , m[t]=R[i+t] and m R[i] means the opposite.
Pr( , ) Pr( [ ]) Pr( 1, 1| [ ])
Pr( [ ]) Pr( 1, | [ ])
i k m R i i k m R i
m R i i k m R i
![Page 14: P-value Calculating Problem](https://reader036.fdocuments.us/reader036/viewer/2022062718/56812bd6550346895d904145/html5/thumbnails/14.jpg)
Trial 1
Why it failed? Because m R[i] has many cases as
the condition so that DP doesn’t work.
![Page 15: P-value Calculating Problem](https://reader036.fdocuments.us/reader036/viewer/2022062718/56812bd6550346895d904145/html5/thumbnails/15.jpg)
Basic Idea
We need compare all the position of m and R[i, i+|m|-1]. The number of case is . (each position pair may equals or not)
In fact, only the prefixes of m need to be considered, the number of which is |m|.
Thus, DP can work well.
| |2m
![Page 16: P-value Calculating Problem](https://reader036.fdocuments.us/reader036/viewer/2022062718/56812bd6550346895d904145/html5/thumbnails/16.jpg)
Calculating the P-value for a word motif
A simple algorithm for the case of k=1Basic Idea: to calculate a series of conditional probabilities instead of the target probability
For a string w over alphabet and
, the conditional probability is
{ , , , }A C T G | | n i
![Page 17: P-value Calculating Problem](https://reader036.fdocuments.us/reader036/viewer/2022062718/56812bd6550346895d904145/html5/thumbnails/17.jpg)
The Definition of Conditional Probability
A C T T G G T A C C A C T C G
G T A
R1 i n
W=
A C C A Cm=
![Page 18: P-value Calculating Problem](https://reader036.fdocuments.us/reader036/viewer/2022062718/56812bd6550346895d904145/html5/thumbnails/18.jpg)
Calculating the P-value for a word motif
Then the target hit probability of m in Region R equals .
For any , we decompose it according to the character following w in region R[i, n].
(1, )f
( , )f i w
( , ) ( , )*Pr( [ | | ])c
f i w f i wc R i w c
![Page 19: P-value Calculating Problem](https://reader036.fdocuments.us/reader036/viewer/2022062718/56812bd6550346895d904145/html5/thumbnails/19.jpg)
Calculating the P-value for a word motif
Next we define the longest suffix:
For example, m=ACCAC and w=CCACAll the prefixes are , A, AC, ACC, ACCA
and ACCAC, the .
( )mS w AC
Let P(m) be the set of all prefixes of a word m. For any string w, let denote the longest suffix of w which is in P(m).
![Page 20: P-value Calculating Problem](https://reader036.fdocuments.us/reader036/viewer/2022062718/56812bd6550346895d904145/html5/thumbnails/20.jpg)
Calculating the P-value for a word motif
Then the following observation helps to constrain the domain of w in P(m). For w does not belong to P(m),
where ' | | | ( ) |mi i w S w
( , ) ( ', ( ))mf i w f i S w
![Page 21: P-value Calculating Problem](https://reader036.fdocuments.us/reader036/viewer/2022062718/56812bd6550346895d904145/html5/thumbnails/21.jpg)
Calculating the P-value for a word motif
Case 1:
A C T T G G T G C C A C T C G
A C C A C
A C T T G G T G C C A C T C G
A C C A C
1 i n
1 i’ nto
compute:
( )mS w
G T G
No prefix of m is the suffix of w.
m=
w=
m=
( ', )f i
![Page 22: P-value Calculating Problem](https://reader036.fdocuments.us/reader036/viewer/2022062718/56812bd6550346895d904145/html5/thumbnails/22.jpg)
Calculating the P-value for a word motif
Case 2:
A C T T G G T A C C A C T C G
A C C A C
A C T T G G T A C C A C T C G
A C C A C
1 i n
1 i’ nto
compute:
G T A C
One prefix of m is the suffix of w and is the longest one.
m=
w=
m=
A Cw=
| ( ) | 1mS w ( )mS w
( ', ( ))mf i S w
( , )f i w
![Page 23: P-value Calculating Problem](https://reader036.fdocuments.us/reader036/viewer/2022062718/56812bd6550346895d904145/html5/thumbnails/23.jpg)
Calculating the P-value for a word motif
Algorithm 1 shows that f(i, w) can be computed by DP in polynomial time.
Algorithm 2 shows how to calculate f(i, w).
![Page 24: P-value Calculating Problem](https://reader036.fdocuments.us/reader036/viewer/2022062718/56812bd6550346895d904145/html5/thumbnails/24.jpg)
Algorithm 1
![Page 25: P-value Calculating Problem](https://reader036.fdocuments.us/reader036/viewer/2022062718/56812bd6550346895d904145/html5/thumbnails/25.jpg)
Algorithm 2: calculate f(i,w)
![Page 26: P-value Calculating Problem](https://reader036.fdocuments.us/reader036/viewer/2022062718/56812bd6550346895d904145/html5/thumbnails/26.jpg)
Calculating the P-value for a word motif
We generalize Algorithms 1 and 2 to arbitrary k by defining a series of probabilities
where
for , is exactly the P-value we want to calculate. 1 j k ( ) (1, )kf
![Page 27: P-value Calculating Problem](https://reader036.fdocuments.us/reader036/viewer/2022062718/56812bd6550346895d904145/html5/thumbnails/27.jpg)
Calculating the P-value for a word motif
Then the recursion formulae here are:
![Page 28: P-value Calculating Problem](https://reader036.fdocuments.us/reader036/viewer/2022062718/56812bd6550346895d904145/html5/thumbnails/28.jpg)
Calculating the P-value for a word motif
Algorithm 3 shows how to compute the P-value
![Page 29: P-value Calculating Problem](https://reader036.fdocuments.us/reader036/viewer/2022062718/56812bd6550346895d904145/html5/thumbnails/29.jpg)
A simple example
1 10 101
0 10 10
1 1 1 101 1
i.i.d. model, |R|=4, m=101, each position generate 1 with probability 0.4
Compute the map first:
w: all prefixes of m
c
( )mS wc
( )mS wc
![Page 30: P-value Calculating Problem](https://reader036.fdocuments.us/reader036/viewer/2022062718/56812bd6550346895d904145/html5/thumbnails/30.jpg)
A simple example
f(i,w) 5 4 3 2 1
101 0 0 0 1 1
10 0 0 0
1 0 0 0
0 0 0
Initialize the DP table f(i,w)
w
i
![Page 31: P-value Calculating Problem](https://reader036.fdocuments.us/reader036/viewer/2022062718/56812bd6550346895d904145/html5/thumbnails/31.jpg)
A simple example
Compute the DB items using the recursive representation
f(2,10) = f(2,100)*p(generate 0)+f(2,101)*p(generate 1)
= F(5, 0)*0.6+1*0.4=0.4
f(2,1) = f(2,10)*p(generate 0)+f(2,11)*p(generate 1)
= f(2,10)*0.6+f(3,1)*0.4=0.24
f(2, ) = f(2,0)*p(generate 0)+f(2,1)*p(generate 1)
= f(3, )*0.6+f(2,1)*0.4=0.096
![Page 32: P-value Calculating Problem](https://reader036.fdocuments.us/reader036/viewer/2022062718/56812bd6550346895d904145/html5/thumbnails/32.jpg)
A simple example
f(1,10) = f(1,100)*p(generate 0)+f(2,101)*p(generate 1)
= f(4, )*0.6+1*0.4=0.4
f(1,1) = f(1,10)*p(generate 0)+f(1,11)*p(generate 1)
= 0.4*0.6+f(2,1)*0.4=0.336
![Page 33: P-value Calculating Problem](https://reader036.fdocuments.us/reader036/viewer/2022062718/56812bd6550346895d904145/html5/thumbnails/33.jpg)
A simple example
f(1, ) = f(1,0)*p(generate 0)+f(1,1)*p(generate 1)
= f(2, )*0.6+0.336*0.4=0.192
The final result is f(1, )=0.192
![Page 34: P-value Calculating Problem](https://reader036.fdocuments.us/reader036/viewer/2022062718/56812bd6550346895d904145/html5/thumbnails/34.jpg)
A simple example
f(i,w) 5 4 3 2 1
101 0 0 0 1 1
10 0 0 0 0.4 0.4
1 0 0 0 0.24 0.336
0 0 0 0.096 0.192
w
iThe final DP table will be as following: