Department of Mathematics & Computer Science Algorithms in Computational Biology 11
Algorithms in Computational Biology
Markov Chains and Hidden Markov Model
Department of Mathematics & Computer Science Algorithms in Computational Biology 22
Example: CpG Islands
• Dinucleotide CG (CpG to distinguish it from C-G base pair)• C within CG is typically methylated• Methyl-C is more likely to mutate to T
• CpG dinucleotides are rarer in genome than would be expected from the independent probabilities of C and G
• Methylation process is suppressed in short stretches of the genome• More CpG dinucleotides in promoter regions of genes
• CpG islands are regions with many CpGs• Typically a few hundred to a few thousand bases long
Department of Mathematics & Computer Science Algorithms in Computational Biology 33
Questions about CpG Island?
• Given a short stretch of genomic sequence, how would we decide if it comes from a CpG island or not?
• Given a long piece of sequence, how would we find the CpG islands in it if there are any?
Department of Mathematics & Computer Science Algorithms in Computational Biology 44
Markov Chains
A
G
T
C
)()...,...,|(),....,|(
),...,,()(
Llength of sequence ofy probabilit the),()|(),( Since
)|(:yprobabilit Transition
112111
11
1
xPxxxPxxxP
xxxPxP
xyPyxPyxP
sxtxPa
LLLL
LL
iist
Department of Mathematics & Computer Science Algorithms in Computational Biology 55
Key Property of a Markov Chain
• The probability of each symbol xi depends only on the value of the preceding symbol xi-1
L
ixx
LLLL
LL
iiaxP
xPxxPxxPxxP
xxxPxP
21
112211
11
1)(
)()|()...|()|(
),...,,()(
Department of Mathematics & Computer Science Algorithms in Computational Biology 66
Modeling the Beginning and End of Sequences
A
G
T
C
B E
tEL
Bs
atxEP
asxP
)|(
)( 1
Department of Mathematics & Computer Science Algorithms in Computational Biology 77
Using Markov Chains for Discrimination
+ A C G T
A 0.180 0.274 0.426 0.120
C 0.171 0.368 0.274 0.188
G 0.161 0.339 0.375 0.125
T 0.079 0.355 0.384 0.182
- A C G T
A 0.300 0.205 0.285 0.210
C 0.322 0.298 0.078 0.302
G 0.248 0.246 0.298 0.208
T 0.177 0.239 0.292 0.292
CpG island model
Non-CpG island model
''
tstc
ca st
st
Department of Mathematics & Computer Science Algorithms in Computational Biology 88
Cont’
• For discrimination, the log-odds ratio is calculated:
L
ixx
L
i xx
xx
ii
ii
ii
a
a
xP
xPxS
221
1
1log
)IslandCpG Non |(
)IslandCpG |(log)(
A C G T
A -0.740 0.419 0.580 -0.803
C -0.913 0.302 1.812 -0.685
G -0.624 0.461 0.331 -0.730
T -1.169 0.573 0.393 -0.679
Department of Mathematics & Computer Science Algorithms in Computational Biology 99
Histogram of Length-Normalized Scores
Non-CpG Islands
CpG islands
Department of Mathematics & Computer Science Algorithms in Computational Biology 1010
Locating CpG Islands in a DNA Sequence
Input: A long DNA sequence X = {x1, x2, …, xL)*
Output: CpG islands along X.• Use Markov chain models• Calculate log-odds score for a window of length k
(e.g., 100)• A total of L-k+1 scores will be computed and plotted• CpG islands will stand out with positive values
Department of Mathematics & Computer Science Algorithms in Computational Biology 1111
Problems with Markov Chain Models in Locating CpG Islands
• CpG islands have sharp boundaries• CpG islands have variable lengths
These problems can be better addressed by building a single model for the entire sequence that incorporates both Markov chains
Department of Mathematics & Computer Science Algorithms in Computational Biology 1212
Formal Definition of an HMM
• A hidden Markov model is a triplet M = (, Q, ), where is an alphabet of symbols• Q is a finite set of states, capable of emitting symbols
from the alphabet is a set of probabilities, comprised of
• State transition probabilities, denoted by akl for each k, l Q
• Emission probabilities, denoted by ek(b) for each state k Q and b
Department of Mathematics & Computer Science Algorithms in Computational Biology 1313
Cont’
(State sequence or path)• = (1, 2, …, L)
• Follows a simple Markov chain (the probability of a state depends only on the previous state)
• State transition probability• akl = P{i = l| i-1 = k}
• Emission probability• Given a sequence X = (x1, x2, … xL), emission probability ek(b) is defined as:
ek(b) = P{xi=b| i = k}
• The probability that the sequence X was generated by M given the path is:
end theis and begin, theis Where
)(),(
1L0
1,, 110
L
ii iii
axeaXP
Department of Mathematics & Computer Science Algorithms in Computational Biology 1414
An HMM for Detecting CpG Islands in a Long DNA Sequence
• Alphabet: = {A, C, G, T}
• States: Q = {A+, C+, G+, T+, A-, C-, G-, T-}
• EmissionsState: A+ C+
G+ T+ A-
C- G- T-
Emitted symbol: A C G T A C G T
The emission probability of each state x+ and x- is 1 for emitting symbol x and 0 for emitting other symbols (special feature of this HMM)
Department of Mathematics & Computer Science Algorithms in Computational Biology 1515
Transition Matrix for CpG Island HMM
P is the probability of staying in a CpG island, and q is the probability of staying in a non-CpG Island
Department of Mathematics & Computer Science Algorithms in Computational Biology 1616
Occasionally Dishonest Casino Dealer
• In casino a dealer uses a fair die most of the time, but occasionally he switch to a loaded die. The loaded die has a probability of 0.5 for a six and probability of 0.1 for the numbers one to five. The dealer switches from a fair to a loaded die with probability of 0.05 before each roll, and that the probability of switching back is 0.1.
• In each state of the Markov process the outcomes of a roll have different probabilities, thus the process can modeled using a HMM
Department of Mathematics & Computer Science Algorithms in Computational Biology 1717
HMM for the Occasionally Dishonest Casino Dealer
• Q = {F, L} = {1, 2, 3, 4, 5, 6}
What is hidden?
Department of Mathematics & Computer Science Algorithms in Computational Biology 1818
HMMs Generate Sequences
• Generate a sequence via HMM• Choose 1 according to the probabilities a0i
• An observation (x1) is emitted according to the probabilities e1
• Choose 2 according to the probabilities a1i
• An observation (x2) is emitted according to the probabilities e2
• And so forth ……
• P(x) is the probability that sequence x was generated by the model
• The joint probability of an observed sequence x and a state sequence :
0 where)(),( 11
0 11
L
L
ii iiiaxeaxP
Department of Mathematics & Computer Science Algorithms in Computational Biology 1919
Most Probable State Path
• A CpG island example• Sequence CGCG can be emitted by:
• (C+, G+, C+, G+), (C-, G-, C-, G-), (C+, G-, C+, G-)
• Which state sequence is more likely for the observation?
• Most probable path is defined as:
• The probability vk(i) of the most probable path ending in state k with observation i is known for all the states k, then vl(i+1) is defined:
),(maxarg*
xP
))((max)()1( 1 klkk
ill aivxeiv
Department of Mathematics & Computer Science Algorithms in Computational Biology 2020
Finding Most Probable Path Using Viterbi Algorithm
Initialization (i = 0): v0(0) = 1, vk(0) = 0 for k > 0
Recursion: (i = 1…L):
vl(i) = el(xi)maxk(vk(i-1)akl)
ptri(l) = argmaxk(vk(i-1)akl)
Termination:
P(x, *) = maxk(vk(L)ak0)
L*= argmaxk(vk(L)ak0)
Traceback (i=L…1): i-1*= ptri(i
*)
Department of Mathematics & Computer Science Algorithms in Computational Biology 2121
Viterbi Example
V C G C G
B 1 0 0 0 0
A+ 0 0 0 0 0
C+ 0 0.13 0 0.011 0
G+ 0 0 0.034 0 0.003
T+ 0 0 0 0 0
A- 0 0 0 0 0
C- 0 0.13 0 0.002 0
G- 0 0 0.010 0 0.0002
T- 0 0 0 0 0
Most probable path for sequence CGCG
Department of Mathematics & Computer Science Algorithms in Computational Biology 2222
Sequence of Die Rolls Predicted by Viterbi Algorithm
Department of Mathematics & Computer Science Algorithms in Computational Biology 2323
Finding the Probability of a Sequence for an HMM: the Forward Algorithm
Definitions:
Algorithm:
Initialization ( i = 0):
Recursion (i = 1…L):
Termination:
kklkill
iik
aifxeif
kxxPif
)()()1(
),,...,()(
1
1
0for 0)0(,1)0(0 kff k
k
klkill aifxeif )1()()(
k
kk aLfxP 0)()(
Department of Mathematics & Computer Science Algorithms in Computational Biology 2424
Posterior State Probability
• We want to know the most probable state for an observation xi
• We need to find out the probability that observation xi came from each state k given the observed sequence
)(
)()(
)(
)|...()(
)(
),...|...(),...(
)(
),()|(
1
111
xP
ibif
xP
kxxPif
xP
kxxxxPkxxP
xP
kxPxkP
kkiLik
iiLiii
ii
Department of Mathematics & Computer Science Algorithms in Computational Biology 2525
Finding bk(i) Using Backward Algorithm
Initialization (i = L):
Recursion (i = L-1, …, 1):
Termination:
kaLb kk allfor )( 0
l
lilklk ibxeaib )1()()( 1
l
lll bxeaxP )1()()( 10
Department of Mathematics & Computer Science Algorithms in Computational Biology 2626
Posterior Decoding
• Approach 1
• Approach 2
• E.g. Find the posterior probability according to the model that base i is in a CpG island, we can let
g(k) = 1 for k {A+, C+, G+, T+}
g(k) = 0 for k {A-, C-, G-, T-}G(i|k) is precisely the posterior probability
)|(maxargˆ xkP ik
i
k
i kgxkPxiG )()|()|(
Department of Mathematics & Computer Science Algorithms in Computational Biology 2727
Use of Posterior Decoding
Shaded areas show when the roll was generated by the
loaded die
Department of Mathematics & Computer Science Algorithms in Computational Biology 2828
Parameter Estimation for HMMs
• Model specification• Structure design
• What states there are and how they are connected
• Assignment of parameter values• Transition probabilities akl
• Emission probabilities ek(b)
• Estimation framework• Training sequences x1, …, xn
• Work in log space
n
j
jnn xPxxPxxl1
11 )|(log)|,...,(log),...,(
Department of Mathematics & Computer Science Algorithms in Computational Biology 2929
Estimation When the State Sequence is Known
goverfittinprevent
totspseudocoun nedpredetermi are and Both
data gin trainin from of emissions ofnumber )(
data gin trainin to ns transitioofnumber
)'(
)()(
'''
(b)rr
(b)rkbbE
rlkA
bE
bEbe
A
Aa
kkl
kk
klkl
bk
kk
lkl
klkl
Department of Mathematics & Computer Science Algorithms in Computational Biology 3030
Estimation When Paths Are Unknown
• Baum (1971)• Calculate Akl and Ek(b) as the expected times each transition or
emission is used given the training sequences• Subject to local maxima• Depends only the starting values of the parameters
• The probability that akl is used at position i in sequence x is:
)(
)1()()(),|,( 1
1 xP
ibxeaifxlkP lilklk
ii
Department of Mathematics & Computer Science Algorithms in Computational Biology 3131
Expected Transition and Emission Counts
• The expected number of times that akl can be obtained by summing over all positions and over all training sequences
• The expected number of times that letter b appears in state k
)1()()()(
11
1
ibxeaifxP
A jl
i
jlkl
jk
n
jjkl i
n
j bxi
jk
jkjk
ji
ibifxP
bE1 |
)()()(
1)(
Department of Mathematics & Computer Science Algorithms in Computational Biology 3232
Baum-Welch Training (EM algorithm)
Initialization: Pick arbitrary model parameters
Recurrence:Set all the A and E variables to their pseudocount values r (or to zero)For each sequence j = 1 … n
Calculate fk(i) for sequence j using forward algorithmCalculate bk(i) for sequence j using backward algorithmAdd the contribution of sequence j to A and E
Calculate the new model parametersCalculate the new log likelihood of the model
Termination:Stop if the change in log likelihood is less than some predefined threshold or the maximum number of iterations is exceeded
Department of Mathematics & Computer Science Algorithms in Computational Biology 3333
Modeling of Labeled Sequences
• HMMs can be used to predict the labeling of unannotated sequences
• Training for HMMs• Separately train the model for CpG islands and the model for non-CpG
islands
• Combine them into a larger HMM
• Tedious especially if there are more two classes involved
• It will be nice to estimate everything at once• Training set includes all classes (e.g., CpG islands and non-CpG islands)
• Each sequence is labeled with corresponding classes
• Let y = y1, …, yL be the labels on the observation x = x1, …, xL
Department of Mathematics & Computer Science Algorithms in Computational Biology 3434
Cont’
• Model can be estimated with a slight modification of Baum-Welch algorithm• Allow only valid paths through the model
• A valid path is one where the state labels and sequence labels are the same, i.e., i has label yi
• During the forward and backward algorithms this corresponds to setting fl(i) = 0 and bl(i) = 0 for all the states l with a label different from yi
Department of Mathematics & Computer Science Algorithms in Computational Biology 3535
Discriminative Estimation
• When modeling labeled sequences, the following likelihood is maximized
• Obtaining a good prediction of y is our primary interest, it is preferable to maximize the following conditional maximum likelihood
)|,(maxarg
yxPML
),|(maxarg
xyPCML
)|(
)|,(),|( asrewritten becan )
xP
yxPxyPP(y|x,θ
Probability calculated by the forward algorithm for the
labeled sequences
Probability calculated by the forward algorithm
disregarding all the labels
Department of Mathematics & Computer Science Algorithms in Computational Biology 3636
HMM Model Structure
• Choice of model topology• Fully connected model causes local maxima• In practice, successful HMMs are constructed by carefully deciding
which transitions are allowed in the model based on knowledge about the problem under investigation
• Duration modeling• Probability decays exponentially on lengths (geometric distribution)
• P(L)=(1-p)p^(L-1) (p: self-transition 1-p: probability of leaving it)• Model more complex length distribution
• Introduce several states with the same distribution over residues and transitions between each other.
• E.g. Non-negative binomial distribution
p p p p
1-p 1-p 1-p
nnl ppN
LLP )1(
1
1)(
Department of Mathematics & Computer Science Algorithms in Computational Biology 3737
Numerical Stability of HMM Algorithms
• Probability gets too low when multiplying many probabilities in the Viterbi, forward and backward algorithms
• Consequences• Underflow error• Program would crash• Program would keep running and produce arbitrary wrong
numbers
Department of Mathematics & Computer Science Algorithms in Computational Biology 3838
Improving Numerical Stability
• Log transform
• Scaling of probabilities• For each i define a scaling variable si
)~)((max)(~)1( 1 klkk
ill aiVxeiV
lillkl
ik
kklkil
il
i
j j
ll
xeibas
ib
aifxes
if
s
ifif
)()1(~1
)(~
)(~
)(1
)1(~
)()(
~
1
11
1
Top Related