Correctness proof of EM Variants of HMM Sequence Alignment via HMM Lecture # 10
description
Transcript of Correctness proof of EM Variants of HMM Sequence Alignment via HMM Lecture # 10
![Page 1: Correctness proof of EM Variants of HMM Sequence Alignment via HMM Lecture # 10](https://reader035.fdocuments.us/reader035/viewer/2022081603/56813e4c550346895da83916/html5/thumbnails/1.jpg)
.
Correctness proof of EMVariants of HMM
Sequence Alignment via HMM
Lecture #10
This class has been edited from Nir Friedman’s lecture. Changes made by Dan Geiger, then Shlomo Moran.
Background Readings: chapters 11.6, 3.4, 3.5, 4, in the Durbin et al., 2001, Chapter 3.4 Setubal et al., 1997
![Page 2: Correctness proof of EM Variants of HMM Sequence Alignment via HMM Lecture # 10](https://reader035.fdocuments.us/reader035/viewer/2022081603/56813e4c550346895da83916/html5/thumbnails/2.jpg)
2
In each iteration the EM algorithm does the following. (E step): Calculate
Qθ () = ∑y p(y|x,θ)log p(x,y|)
(M step): Find * which maximizes Qθ ()
(Next iteration sets * and repeats).
The EM algorithm
Comment: At the M-step we only need that Qθ(*)>Qθ(θ). This change yields the so called Generalized EM algorithm. It is important when it is hard to find the optimal *.
![Page 3: Correctness proof of EM Variants of HMM Sequence Alignment via HMM Lecture # 10](https://reader035.fdocuments.us/reader035/viewer/2022081603/56813e4c550346895da83916/html5/thumbnails/3.jpg)
3
Correctness proof of EM
Theorem:
If λ* maximizes Q (λ) = ∑y p(y|x,θ)log p(y| λ), then P(x| λ*) P(x| θ) .
Comment: The proof remains valid if we assume only that
Q(λ*) Q(θ).
![Page 4: Correctness proof of EM Variants of HMM Sequence Alignment via HMM Lecture # 10](https://reader035.fdocuments.us/reader035/viewer/2022081603/56813e4c550346895da83916/html5/thumbnails/4.jpg)
4
By the definition of conditional probability, for each y we have,
p(x|) p(y|x,) = p(y,x|), and hence:
log p(x|) = log p(y,x|) – log p(y|x,)Hence
log p(x| λ) = ∑y p(y|x,θ) [log p(y|λ) – log p(y|x,λ)]
log p(x|λ)
Proof (cont.)
=1
(Next..)
![Page 5: Correctness proof of EM Variants of HMM Sequence Alignment via HMM Lecture # 10](https://reader035.fdocuments.us/reader035/viewer/2022081603/56813e4c550346895da83916/html5/thumbnails/5.jpg)
5
Proof (end)
log p(x|λ) = ∑y p(y|x, θ) log p(y|λ)
- ∑y p(y|x,θ) log [p(y|x,λ)]
Qθ(λ)
Substituting λ=λ* and λ=θ, and then subtracting, we get
log p(x|λ*) - log p(x|θ) =
Q(λ*) – Q(θ) + D(p(y|x,θ) || p(y|x,λ*))
≥ 0 [since λ* maximizes Q(λ)]. Relative entropy 0 ≤
≥ 0 QED
![Page 6: Correctness proof of EM Variants of HMM Sequence Alignment via HMM Lecture # 10](https://reader035.fdocuments.us/reader035/viewer/2022081603/56813e4c550346895da83916/html5/thumbnails/6.jpg)
6
EM in Practice
Initial parameters: Random parameters setting “Best” guess from other source
Stopping criteria: Small change in likelihood of data Small change in parameter values
Avoiding bad local maxima: Multiple restarts Early “pruning” of unpromising ones
![Page 7: Correctness proof of EM Variants of HMM Sequence Alignment via HMM Lecture # 10](https://reader035.fdocuments.us/reader035/viewer/2022081603/56813e4c550346895da83916/html5/thumbnails/7.jpg)
7
HMM model structure:1. Duration Modeling
Markov chains are rather limited in describing sequences of symbols with non-random structures. For instance, Markov chain forces the distribution of segments in which some state is repeated for k additional times to be (1-p)pk, for some p.
Several ways enable modeling of other distributions. One is assigning k >1 states to represent the same “real” state. This may guarantee k repetitions with any desired probability.
A1 A2 A3 A4
![Page 8: Correctness proof of EM Variants of HMM Sequence Alignment via HMM Lecture # 10](https://reader035.fdocuments.us/reader035/viewer/2022081603/56813e4c550346895da83916/html5/thumbnails/8.jpg)
8
HMM model structure:2. Silent states
States which do not emit symbols (as we saw in the abo locus). Can be used to model duration distributions. Also used to allow arbitrary jumps (may be used to model
deletions) Need to generalize the Forward and Backward algorithms for
arbitrary acyclic digraphs to count for the silent states (see next):
Silent states:
Regular states:
![Page 9: Correctness proof of EM Variants of HMM Sequence Alignment via HMM Lecture # 10](https://reader035.fdocuments.us/reader035/viewer/2022081603/56813e4c550346895da83916/html5/thumbnails/9.jpg)
9
Eg, the forwards algorithm should look:
Directed cycles of silence (or other) states complicate things, and should be avoided.
Ezu kklkl
Evu kklkll
aufzfz
aufxevf
xv
),(
),(
)()(
)()()(
: statessilent for and
: symbolemit which statesregular For
x
v
zSilent states
Regular states
symbols
![Page 10: Correctness proof of EM Variants of HMM Sequence Alignment via HMM Lecture # 10](https://reader035.fdocuments.us/reader035/viewer/2022081603/56813e4c550346895da83916/html5/thumbnails/10.jpg)
10
HMM model structure:3. High Order Markov Chains
Markov chains in which the transition probabilities depends on the last k states:
P(xi|xi-1,...,x1) = P(xi|xi-1,...,xi-k)
Can be represented by a standard Markov chain with more states. eg for k=2:
AA
BBBA
AB
![Page 11: Correctness proof of EM Variants of HMM Sequence Alignment via HMM Lecture # 10](https://reader035.fdocuments.us/reader035/viewer/2022081603/56813e4c550346895da83916/html5/thumbnails/11.jpg)
11
HMM model structure:4. Inhomogenous Markov Chains
An important task in analyzing DNA sequences is recognizing the genes which code for proteins.
A triplet of 3 nucleotides – called codon - codes for amino acids (see next slide).
It is known that in parts of DNA which code for genes, the three codons positions has different statistics.
Thus a Markov chain model for DNA should represent not only the Nucleotide (A, C, G or T), but also its position – the same nucleotide in different position will have different transition probabilities. Used in GENEMARK gene finding program (93).
![Page 12: Correctness proof of EM Variants of HMM Sequence Alignment via HMM Lecture # 10](https://reader035.fdocuments.us/reader035/viewer/2022081603/56813e4c550346895da83916/html5/thumbnails/12.jpg)
12
Genetic Code
There are 20 amino acids from which proteins are build.
![Page 13: Correctness proof of EM Variants of HMM Sequence Alignment via HMM Lecture # 10](https://reader035.fdocuments.us/reader035/viewer/2022081603/56813e4c550346895da83916/html5/thumbnails/13.jpg)
13
Sequence Comparison using HMM
HMM for sequence alignment, which incorporates affine gap scores.
“Hidden” States Match. Insertion in x. insertion in y.
Symbols emitted Match: {(a,b)| a,b in ∑ }. Insertion in x: {(a,-)| a in ∑ }. Insertion in y: {(-,a)| a in ∑ }.
![Page 14: Correctness proof of EM Variants of HMM Sequence Alignment via HMM Lecture # 10](https://reader035.fdocuments.us/reader035/viewer/2022081603/56813e4c550346895da83916/html5/thumbnails/14.jpg)
14
The Transition Probabilities
ε1- ε
ε1- ε
δδ1-2δ
M X
X
M
Y
Y
Emission Probabilities Match: (a,b) with pab – only from M states
Insertion in x: (a,-) with qa – only from X state
Insertion in y: (-,a).with qa - only from Y state.
(Note that the hidden states can be reconstructed from the alignment.)
Transitions probabilities(note the forbidden ones).δ = probability for 1st gapε = probability for tailing gap.
![Page 15: Correctness proof of EM Variants of HMM Sequence Alignment via HMM Lecture # 10](https://reader035.fdocuments.us/reader035/viewer/2022081603/56813e4c550346895da83916/html5/thumbnails/15.jpg)
15
Scoring alignments
Each aligned pair is generated by the above HMM with certain probability. For each pair of sequences x (of length m) and y (of length n), there are many alignments of x and y, each corresponds to a different state-path (the length of the paths are between max{m,n} and m+n).Our task is to score alignments by using this model. The score should reflect the probability of the alignment.
![Page 16: Correctness proof of EM Variants of HMM Sequence Alignment via HMM Lecture # 10](https://reader035.fdocuments.us/reader035/viewer/2022081603/56813e4c550346895da83916/html5/thumbnails/16.jpg)
16
Most probable alignment
Let vM(i,j) be the probability of the most probable alignment of x(1..i) and y(1..j), which ends with a match. Similarly, vX(i,j) and vY(i,j), the probabilities of the most probable alignment of x(1..i) and y(1..j), which ends with an insertion to x or y.
Then using a recursive argument, we get:
(1 2 ) ( 1, 1)
[ , ] max (1 ) ( 1, 1)
(1 ) ( 1, 1)i j
M
M Xx y
Y
v i j
v i j p v i j
v i j
![Page 17: Correctness proof of EM Variants of HMM Sequence Alignment via HMM Lecture # 10](https://reader035.fdocuments.us/reader035/viewer/2022081603/56813e4c550346895da83916/html5/thumbnails/17.jpg)
17
Most probable alignment
Similar argument for vX(i,j) and vY(i,j), the probabilities of the most probable alignment of x(1..i) and y(1..j), which ends with an insertion to x or y, are:
( 1, )[ , ] max
( 1, )i
MX
x X
v i jv i j q
v i j
( , 1)[ , ] max
( , 1)j
MY
y Y
v i jv i j q
v i j
![Page 18: Correctness proof of EM Variants of HMM Sequence Alignment via HMM Lecture # 10](https://reader035.fdocuments.us/reader035/viewer/2022081603/56813e4c550346895da83916/html5/thumbnails/18.jpg)
18
Adding termination probabilities
For this, an END state is added, with transition probability τ from any other state to END. This assumes expected sequence length of 1/ τ.
M X Y END
M 1-2δ -τ δ δ τ
X 1-ε -τ ε τ
Y1-ε -τ
ε τ
END 1
We may want a model which defines a probability distribution over all possible sequences.
The last transition in each alignment is to the END state, with probability τ
![Page 19: Correctness proof of EM Variants of HMM Sequence Alignment via HMM Lecture # 10](https://reader035.fdocuments.us/reader035/viewer/2022081603/56813e4c550346895da83916/html5/thumbnails/19.jpg)
19
The log-odds scoring function
We wish to know if the alignment score is above or below the score of random alignment.
For gapless alignments we used the log-odds ratio:
s(a,b) = log (pab / qaqb). log (pab/qaqb)>0 iff the probability that a and b are related by our model is larger than the probability that they are picked at random.
To adapt this for the HMM model, we need to model random sequence by HMM, with end state. This model assigns probability to each pair of sequences x and y of arbitrary lengths m and n.
![Page 20: Correctness proof of EM Variants of HMM Sequence Alignment via HMM Lecture # 10](https://reader035.fdocuments.us/reader035/viewer/2022081603/56813e4c550346895da83916/html5/thumbnails/20.jpg)
20
scoring function for random model
X Y END
X 1- η η
Y 1- η η
END 1
The transition probabilities for the random model, with termination probability η:
(x is the start state)
2
1 1
( , | ) (1 )i j
n mn m
x yi j
p x y Random q q
The emission probability for a is qa.
Thus the probability of x (of length n) and y (of length m) being random is:
And the corresponding log-odds score is:
m
iy
n
ix ii
qqmnRandomyxp11
12 loglog)log()(log)|,(log
![Page 21: Correctness proof of EM Variants of HMM Sequence Alignment via HMM Lecture # 10](https://reader035.fdocuments.us/reader035/viewer/2022081603/56813e4c550346895da83916/html5/thumbnails/21.jpg)
21
Markov Chains for “Random” and “Model”
X Y END
X 1- η η
Y 1- η η
END 1
M X Y END
M 1-2δ -τ δ δ τ
X 1-ε -τ ε τ
Y1-ε -τ
ε τ
END 1
“Model”
“Random”
![Page 22: Correctness proof of EM Variants of HMM Sequence Alignment via HMM Lecture # 10](https://reader035.fdocuments.us/reader035/viewer/2022081603/56813e4c550346895da83916/html5/thumbnails/22.jpg)
22
Combining models in the log-odds scoring function
In order to compare the M score to the R score of sequences x and y, we can find an optimal M score, and then subtract from it the R score.
This is insufficient when we look for local alignments, where the optimal substrings in the alignment are not known in advance. A better way:
1. Define a log-odds scoring function which keeps track of the difference Match-Random scores of the partial strings during the alignment.
2. At the end add to the score (logτ – 2logη).
We get the following:
![Page 23: Correctness proof of EM Variants of HMM Sequence Alignment via HMM Lecture # 10](https://reader035.fdocuments.us/reader035/viewer/2022081603/56813e4c550346895da83916/html5/thumbnails/23.jpg)
23
The log-odds scoring function
)log(
],[)log(
],[)log(
],[)log(
maxlog],[
12
111
111
1121
bY
X
M
yx
yxM
jiV
jiV
jiV
pjiV
ji
ji
And at the end add to the score (logτ – 2logη).
)log(],[log
],[logmax],[
11
1
jiV
jiVjiV
Y
MY
)log(],[log
],[logmax],[
1
1
1
jiV
jiVjiV
X
MX
(assuming that letters at insertions/deletions are selected by the random model)
![Page 24: Correctness proof of EM Variants of HMM Sequence Alignment via HMM Lecture # 10](https://reader035.fdocuments.us/reader035/viewer/2022081603/56813e4c550346895da83916/html5/thumbnails/24.jpg)
24
The log-odds scoring function
Another way (Durbin et. al., Chapter 4.1):Define scoring function s with penalties d and e for a first gap and a
tailing gap, resp.
Then modify the algorithm to correct for extra prepayment, as follows:
-1
states) Yor X tong when movi"prepayment(" 211
1
state)from M move(assume 1
212
log
))((
)(log
)(
)(loglog),(
-e
-d
pbas
ba
ab
![Page 25: Correctness proof of EM Variants of HMM Sequence Alignment via HMM Lecture # 10](https://reader035.fdocuments.us/reader035/viewer/2022081603/56813e4c550346895da83916/html5/thumbnails/25.jpg)
25
Log-odds alignment algorithm
( 1, 1)
[ , ] ( , ) max ( 1, 1)
( 1, 1)
M
M Xi j
Y
V i j
V i j s x y V i j
V i j
( 1, )[ , ] max
( 1, )
MX
X
V i j dV i j
V i j e
( , 1)[ , ] max
( , 1)
MY
Y
V i j dV i j
V i j e
Initialization: VM(0,0)=logτ - 2logη.
Termination:V = max{ VM(m,n), VX(m,n)+c, VY(m,n)+c}Where c= log (1-2δ-τ) – log(1-ε-τ)