Introduction to Hidden Markov Models for Gene Prediction
Transcript of Introduction to Hidden Markov Models for Gene Prediction
![Page 1: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/1.jpg)
Introduction to Hidden Markov Models for Gene Prediction
ECE-S690
![Page 2: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/2.jpg)
Outline
Markov Models The Hidden Part How can we use this for gene
prediction?
![Page 3: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/3.jpg)
Learning Models
Want to recognize patterns (e.g. sequence motifs), we have to learn from the data
![Page 4: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/4.jpg)
![Page 5: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/5.jpg)
![Page 6: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/6.jpg)
Markov Chains
A B C Probability
of Transition
Probability of
Transition
Current State only depends on previous state and transition
probability
aBA=Pr(xi=B|xi-1=A)
![Page 7: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/7.jpg)
Example: Estimating Mood State from Grad Student Observations
![Page 8: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/8.jpg)
![Page 9: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/9.jpg)
Example: “Happy” Grad Student Markov Chain
Lab Coffee Shop Bar
0.5
0.3 0.2
0.2
0.7
0.2
Observations: Lab, Coffee, Lab, Coffee, Lab, Lab, Bar, Lab, Coffee,…
0.5
0.3
0.1
![Page 10: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/10.jpg)
Depressed about research
Lab Coffee Shop Bar
0.05
0.1 0.75
0.8
0.1
0.2
0.1
0.2
0.7
![Page 11: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/11.jpg)
Evaluating Observations
The probability of observing a given sequence is equal to the product of all observed transition probabilities.
P(Coffee->Bar->Lab) = P(Coffee) P(Bar | Coffee) P(Lab | Bar)
P(CBL) = P(L|B) P(B|C)P(C)
X are the observations
![Page 12: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/12.jpg)
1st order model
Probability of Next State | Previous State Calculate all probabilities
![Page 13: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/13.jpg)
Convert “Depressed” Observations to Matrix
Lab Coffee Shop Bar
0.05
0.1 0.75
0.8
0.1
0.2
0.1
0.2
0.7
![Page 14: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/14.jpg)
Scoring Observations: Depressed Grad Student
From Lab
From Coffee Shop
From Bar
To Lab 0.1 0.05 0.2
To Coffee Shop
0.1 0.2 0.1
To Bar 0.8 0.75 0.7
Pr from each state add to 1
Student 1:LLLCBCLLBBLL Student 2:LCBLBBCBBBBL Student 3:CCLLLLCBCLLL
![Page 15: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/15.jpg)
Scoring Observations: Depressed Grad Student
From Lab
From Coffee Shop
From Bar
To Lab 0.1 0.05 0.2
To Coffee Shop
0.1 0.2 0.1
To Bar 0.8 0.75 0.7
Student 1:LLLCBCLLBBLL
Pr from each state add to 1
p’s
![Page 16: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/16.jpg)
Scoring Observations: Depressed Grad Student
From Lab
From Coffee Shop
From Bar
To Lab 0.1 0.05 0.2
To Coffee Shop
0.1 0.2 0.1
To Bar 0.8 0.75 0.7
Student 1:LLLCBCLLBBLL
Pr from each state add to 1
![Page 17: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/17.jpg)
Scoring Observations: Depressed Grad Student
From Lab
From Coffee Shop
From Bar
To Lab 0.1 0.05 0.2
To Coffee Shop
0.1 0.2 0.1
To Bar 0.8 0.75 0.7
Student 1:LLLCBCLLBBLL = 4.2x10-9 Student 2:LCBLBBCBBBBL = 4.3x10-5 Student 3:CCLLLLCBCLLL = 3.8x10-11
Pr from each state add to 1
p’s
![Page 18: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/18.jpg)
Equilibrium State From Lab
From Coffee Shop
From Bar
To Lab 0.333 0.333 0.333
To Cofee Shop
0.333 0.333 0.333
To Bar 0.333 0.333 0.333
Student 1:LLLCBCLLBBLL = 5.6x10-6 Student 2:LCBLBBCBBBBL = 5.6x10-6 Student 3:CCCLCCCBCCCL = 5.6x10-6
q’s
![Page 19: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/19.jpg)
Comparing to Equilibrium States
![Page 20: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/20.jpg)
Evaluation Observations
Likelihood ratios: Student 1 = 4.2x10-9 / 5.6x10-6 = 7.5x10-4 Student 2 = 4.3x10-5 / 5.6x10-6 = 7.7 Student 3 = 3.8x10-11 / 5.6x10-6 = 6.8 x 10-6
Log likelihood ratios Student 1 = -3.2 Student 2 = 0.9 (Most likely sad) Student 3 = -5.2
![Page 21: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/21.jpg)
The model could represent Research Breakthrough (Happy) Student!: Transition Probabilities
From Lab
From Coffee Shop
From Bar
To Lab 0.6 0.75 0.5
To Cofee Shop
0.25 0.2 0.45
To Bar 0.15 0.05 0.05
![Page 22: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/22.jpg)
Combined Model
Lab Coffee Shop Bar
Lab Coffee Shop Bar
Happy Student
Depressed Student
![Page 23: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/23.jpg)
“Generalized” HMM
Lab Coffee Shop Bar
Lab Coffee Shop Bar
Emission
Transition
Happy
Depressed
![Page 24: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/24.jpg)
Generalized HMM - Combined Model
Lab Coffee Shop Bar
Lab Coffee Shop Bar
Start End
Happy
Depressed
![Page 25: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/25.jpg)
Simplifying the Markov Chains to 0th order to model hidden states
Happy: Lab: 75%, Coffee: 20%, Bar 5% Sad: Lab:40%, Coffee: 20%, Bar 40%
![Page 26: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/26.jpg)
HMM - Combined Model
Start End
Happy
Depressed
L: 0.75 C: 0.2 B: 0.05
L: 0.4 C: 0.2 B: 0.4
![Page 27: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/27.jpg)
Hiddenness
![Page 28: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/28.jpg)
Evaluating Hidden State
Evaluating Hidden State Observations: LLLCBCLLBBLLCBLBBCBBBBLCLLLCCL Hidden state: HHHHHHHHHHHHHDDDDDDDDDDHHHHHHH
![Page 29: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/29.jpg)
Applications
![Page 30: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/30.jpg)
Particulars about HMMs
![Page 31: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/31.jpg)
Gene Prediction
![Page 32: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/32.jpg)
Why are HMMs a good fit for DNA and Amino Acids?
![Page 33: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/33.jpg)
HMM Caveats
• States are supposed to be independent of each other and this isn’t always true • Need to be mindful of overfitting – Need a good training set – More training data does not always mean a better model • HMMs can be slow (if proper Decoding not implemented) – Some decoding maps out all paths through the model – DNA sequences can be very long so processing/ annotating them can be very time consuming
![Page 34: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/34.jpg)
Genomic Applications
Finding Genes Finding Pathogenicity Islands
![Page 35: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/35.jpg)
Neisseria meningitidis, 52% G+C
(from Tettelin et al. 2000. Science)
GC Content
Example Bio App: Pathogenicity Islands
Clusters of genes acquired by horizontal transfer Present in pathogenic species
but not others Frequently encode virulence
factors Toxins, secondary
metabolites, adhesins
(Flanked by repeats, regulation and have different codon usage)
Different GC content than rest of genome
![Page 36: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/36.jpg)
Modeling Sequence Composition (Simple Probability of Sequence)
Calculate sequence distribution from known islands Count occurrences of A,T,G,C
Model islands as nucleotides drawn independently from this distribution
A: 0.15 T: 0.13 G: 0.30 C: 0.42
… … A: 0.15 T: 0.13 G: 0.30 C: 0.42
A: 0.15 T: 0.13 G: 0.30 C: 0.42
P(Si|MP)
... C C TA A G T T A G A G G A T T G A G A ….
![Page 37: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/37.jpg)
The Probability of a Sequence (Simplistic)
Can calculate the probability of a particular sequence (S) according to the pathogenicity island model (MP)
Example
S = AAATGCGCATTTCGAA A: 0.15 T: 0.13 G: 0.30 C: 0.42
![Page 38: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/38.jpg)
Background Island
0.15
0.25 0.75 0.85
A: 0.25 T: 0.25 G: 0.25 C: 0.25
TAAGAATTGTGTCACACACATAAAAACCCTAAGTTAGAGGATTGAGATTGGCA GACGATTGTTCGTGATAATAAACAAGGGGGGCATAGATCAGGCTCATATTGGC
A: 0.15 T: 0.13 G: 0.30 C: 0.42
A More Complex Model
![Page 39: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/39.jpg)
P
B B
P P
B
P P
B
P
B
P
B
P
B
P
B
A Generative Model
P P
B B B
P P
C A A A T G C G S:
B B B
P P P
B B
A: 0.42 T: 0.30 G: 0.13 C: 0.15
A: 0.25 T: 0.25 G: 0.25 C: 0.25
P(S|P) P(S|B) P(Li+1|Li) Bi+1 Pi+1
Bi 0.85 0.15
Pi 0.25 0.75
![Page 40: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/40.jpg)
The Hidden in HMM
DNA does not come conveniently labeled (i.e. Island, Gene, Promoter)
We observe nucleotide sequences
The hidden in HMM refers to the fact that state labels, L, are not observed Only observe emissions (e.g.
nucleotide sequence in our example)
State i State j
…A A G T T A G A G…
![Page 41: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/41.jpg)
A Hidden Markov Model
Hidden States L = { 1, ..., K }
Transition probabilities akl = Transition probability from state k to state l
Emission probabilities ek(b) = P( emitting b | state=k)
Initial state probability π(b) = P(first state=b)
State i State j
el(b) ek(b)
Emission Probabilities
Transition Probabilities
![Page 42: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/42.jpg)
HMM with Emission Parameters
a13: Probability of a transition from State 1 to State 3
e2(A): Probability of emitting character A in state 2
![Page 43: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/43.jpg)
Hidden Markov Models (HMM)
Allows you to find sub-sequence that fit your model
Hidden states are disconnected from observed states
Emission/Transition probabilities Must search for optimal paths
![Page 44: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/44.jpg)
Three Basic Problems of HMMs The Evaluation Problem
Given an HMM and a sequence of observations, what is the probability that the observations are generated by the model?
The Decoding Problem Given a model and a sequence of observations, what
is the most likely state sequence in the model that produced the observations?
The Learning Problem Given a model and a sequence of observations, how
should we adjust the model parameters in order to maximize evaluation/decoding
![Page 45: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/45.jpg)
Fundamental HMM Operations
Decoding Given an HMM and sequence S Find a corresponding sequence
of labels, L Evaluation Given an HMM and sequence S Find P(S|HMM)
Training Given an HMM w/o parameters
and set of sequences S Find transition and emission
probabilities the maximize P(S | params, HMM)
Computation Biology
Annotate pathogenicity islands on a new sequence
Score a particular sequence
Learn a model for sequence composed of background DNA and pathogenicity islands
![Page 46: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/46.jpg)
Markov chains and processes 1st order Markov chain
2nd order Markov chain
1st order with stochastic observations -- HMM
![Page 47: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/47.jpg)
Order & Conditional Probabilities
P(ACTGTC) = p(A) x p(C) x p(T) x p(G) x p(T) ...
P(ACTGTC) = p(A) x p(C|A) x p(T|C) x p(G|T) …
P(ACTGCG) = p(A) x p(C|A) x p(T|AC) x p(G|CT)...
Order
0th
1st
2nd
P(T|AC) Probability of T given AC
![Page 48: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/48.jpg)
HMM - Combined Model for Gene Detection
Start End
Coding
Noncoding
![Page 49: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/49.jpg)
1st-order transition matrix (4x4)
A C G T
A 0.2 0.15 0.25 0.2
C 0.3 0.35 0.25 0.2
G 0.3 0.4 0.3 0.3
T 0.2 0.1 0.2 0.2
![Page 50: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/50.jpg)
2nd Order Model (16x4)
A C G T AA 0.1 0.3 0.25 0.05 AC 0.05 0.25 0.3 0.1 AG 0.3 0.05 0.1 0.25 AT 0.25 0.1 0.05 0.3
.
.
.
![Page 51: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/51.jpg)
Three Basic Problems of HMMs
The Evaluation Problem Given an HMM and a sequence of observations, what
is the probability that the observations are generated by the model?
The Decoding Problem Given a model and a sequence of observations, what
is the most likely state sequence in the model that produced the observations?
The Learning Problem Given a model and a sequence of observations, how
should we adjust the model parameters in order to maximize
![Page 52: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/52.jpg)
What Questions can an HMM Answer?
Viterbi Algorithm: What is the most probable path that generated sequence X? Forward Algorithm: What is the likelihood of sequence X given HMM M – Pr(X|M)? Forward-Backward (Baum-Welch) Algorithm: What is the probability of a particular state k having generated symbol Xi?
![Page 53: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/53.jpg)
“Decoding” With HMM
Pathogenicity Island Example Given a nucleotide sequence, we want a labeling of each nucleotide as either “pathogenicity island” or “background DNA”
Given observations, we would like to predict a sequence of hidden states that is most likely to have generated that sequence
![Page 54: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/54.jpg)
The Most Likely Path
Given observations, one reasonable choice for labeling the hidden states is:
The sequence of hidden state labels, L*, (or path) that makes the labels and
sequence most likely given the model
![Page 55: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/55.jpg)
Probability of a Path,Seq
P
B
P
B
P
B B
P
B B
P
B
P
B
G C A A A T G C
L:
S:
P P
0.25 0.25
B B B
0.25
0.85 0.85 0.85 0.85 B B B B B
0.85
0.25
0.85
0.25 0.25 0.25 0.25
0.85
![Page 56: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/56.jpg)
Probability of a Path,Seq
P
B
P
B
P
B B
P
B B
P
B
P
B
G C A A A T G C
L:
S:
P P
B B B B B 0.85
0.25
0.85
0.15 0.25
0.25 0.25 0.42 0.42 0.30 0.25 0.25
0.85
P P P 0.75 0.75
We could try to calculate the probability of every path, but….
![Page 57: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/57.jpg)
Decoding
Viterbi Algorithm Finds most likely sequence of hidden states or
labels, L* or P* or π*, given sequence and model
Uses dynamic programming (same technique used in sequence alignment)
Much more efficient than searching every path
![Page 58: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/58.jpg)
Finding Best Path
Viterbi Dynamic programming Maximize Probability Emission of
observations on trace-back
![Page 59: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/59.jpg)
Viterbi Algorithm
Most probable state path given sequence (observations)?
![Page 60: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/60.jpg)
Viterbi (in pseudocode)
l is previous state and k is next state vl(i)= el(xi) maxk(vk(i-1)akl) π* are the paths that maximizes the
probability of the previous path times new transition in maxk(vk(i-1)akl)
Each node picks one max Start
![Page 61: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/61.jpg)
![Page 62: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/62.jpg)
![Page 63: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/63.jpg)
Forward Alg: Probability of a Single Label (Hidden State)
Calculate most probable label, L*i , at each position i
Do this for all N positions gives us {L*1, L*
2, L*3…. L*
N}
P
B
P
B
P
B B
P
B B
P
B
P
B
G C A A A T G C
L:
S:
P P P
B
P
B
P
B B
P
B B
P
B
P
B
P P Sum over all paths
P(Label5=B|S) Forward algorithm (dynamic programming)
fl(i) = el(xi) Σkfk(i-1)akl
![Page 64: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/64.jpg)
Forward Algorithm
Start
fl(i) = el(xi) Σkfk(i-1)akl
P(x) = Σk fk(N)ak0
Add probs of all Different paths to get Probability of sequence
![Page 65: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/65.jpg)
Viterbi Algorithm Finds most likely sequence of hidden states, L* or P*
or π*, given sequence and model
Posterior Decoding Finds most likely label at each position for all
positions, given sequence and model {L*
1, L*2, L*
3…. L*N}
Forward and Backward equations
Two Decoding Options
![Page 66: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/66.jpg)
Relation between Viterbi and Forward VITERBI
Vj(i) = P(most probable path ending in state j with observation i )
Initialization: V0(0) = 1 Vk(0) = 0, for all k > 0
Iteration: Vl(i)= el(xi)maxkVk(i-1) akl
Termination: P(x, π*) = maxkVk(N)
FORWARD
fl(i)=P(x1…xi,statei=l)
Initialization: f0(0) = 1 fk(0) = 0, for all k > 0
Iteration: fl(i) = el(xi) Σkfk(i-1)akl
Termination: P(x) = Σk fk(N)ak0
![Page 67: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/67.jpg)
Forward/Backward Algorithms
Way to compute probability of most probable path
Forward and Backward can be combined to find Probability of emission, xi from state k given sequence x. P(πi=k | x)
P(πi=k | x) is called posterior decoding P(πi=k | x) = fk(I)bk(I)/P(x)
![Page 68: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/68.jpg)
Example Application: Bacillus subtilis
![Page 69: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/69.jpg)
Method
Nicolas et al (2002) NAR
Gene+ Gene-
AT Rich
Second Order Emissions
P(Si)=P(Si|State,Si-1,Si-2) (capturing trinucleotide
Frequencies)
Train using EM
Predict w/Posterior Decoding
Three State Model
![Page 70: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/70.jpg)
Results
Nicolas et al (2002) NAR
Gene on positive strand
Each line is P(label|S,model)
color coded by label
Gene on negative strand
A/T Rich - Intergenic regions - Islands
![Page 71: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/71.jpg)
Training an HMM
Transition probabilities e.g. P(Pi+1|Bi) – the probability of entering a pathogenicity island from background DNA
Emission probabilities i.e. the nucleotide frequencies for background DNA and pathogenicity islands
B P
P(S|P) P(S|B)
P(Li+1|Li)
![Page 72: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/72.jpg)
Learning From Labelled Data
P
B
P
B
P
B B
P
B B
P
B
P
B
G C A A A T G C
L:
S:
If we have a sequence that has islands marked, we can simply count
A: T: G: C:
P(S|P) P(S|B) P(Li+1|Li) Bi+1 Pi+1 End
Bi 3/5 1/5 1/5
Pi 1/3 2/3 0
Start 1 0 0
End start
P
B B B B B
P
ETC.. ! A: 1/5
T: 0 G: 2/5 C: 2/5
![Page 73: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/73.jpg)
Unlabelled Data
P
B
P
B
P
B B
P
B B
P
B
P
B
G C A A A T G C
L:
S:
How do we know how to count?
A: T: G: C:
A: T: G: C:
P(S|P) P(S|B) P(Li+1|Li)
Start
? Pi
Pi+1
Bi
End Bi+1
End start
P P
?
![Page 74: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/74.jpg)
Unlabeled Data
An idea: 1. Imagine we start with some parameters
(e.g. initial or bad model)
2. We could calculate the most likely path, P*, given those parameters and S
3. We could then use P* to recalculate our parameters by maximum likelihood
4. And iterate (to convergence)
P
B
P
B
P
B B
P
B B
P
B
P
B
G C A A A T G C
L:
S:
P(S|P)0 P(S|B)0 P(Li+1|Li)0
End start
P P
P(S|P)1 P(S|B)1 P(Li+1|Li)1
P(S|P)2 P(S|B)2 P(Li+1|Li)2
P(S|P)K P(S|B)K P(Li+1|Li)K
…
B B B B B B B B B B B B B
P P P
![Page 75: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/75.jpg)
Training Models for Classification
Correct Order for the model Higher order models remember more “history” Additional history can have predictive value
Example: predict the next word in this sentence fragment “…finish __” (up, it, first, last, …?) now predict it given more history “Fast guys finish __”
![Page 76: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/76.jpg)
Model Order However, the number of parameters to estimate
grows exponentially with the order for modeling DNA we need parameters for an nth order model, with n>=5 normally
The higher the order, the less reliable we can expect our parameter estimates to be estimating the parameters of a 2nd order Markov
chain from the complete genome of E. Coli, each word > 72,000 times on average
estimating the parameters of an 8th order chain, word 5 times on average
![Page 77: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/77.jpg)
HMMs in Context
HMMs Sequence alignment Gene Prediction
Generalized HMMs Variable length states Complex emissions models e.g. Genscan
Bayesian Networks General graphical model Arbitrary graph structure e.g. Regulatory network
analysis
![Page 78: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/78.jpg)
HMMs can model different regions
![Page 79: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/79.jpg)
Example Model for Gene Recognition
Promoter C Start Trans- cription Factor
Exon Splice Intron
Repeat
End
![Page 80: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/80.jpg)
Another Example
![Page 81: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/81.jpg)
CpG Islands: Another Application
CG dinucleotides are rarer in eukaryotic genomes than expected given the independent probabilities of C, G
Particularly, the regions upstream of genes are richer in CG dinucleotides than elsewhere - CpG islands
![Page 82: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/82.jpg)
Most transitions omitted for clarity
A+ C+
G+ T+
+aGC +aCG
A- C-
G- T-
-aGC -aCG B E
-+aAC
CpG island DNA states: large C, G transition
probabilities
“Normal DNA” states: small C, G transition
probabilities
CpG Island Sub-model Normal DNA Sub-model
CpG Islands
![Page 83: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/83.jpg)
CpG Islands
In human genome, CG dinucleotides are relatively rare CG pairs undergo a process called methylation
that modifies the C nucleotide A methylated C mutate (with relatively high
chance) to a T Promotor regions are CG rich
These regions are not methylated, and thus mutate less often
These are called CG (aka CpG) islands
![Page 84: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/84.jpg)
CpG Island Prediction
In a CpG island, the probability of a “C” following a “G” is much higher than in “normal” intragenic DNA sequence.
We can construct an HMM to model this by combining two HMMs: one for normal sequence and one for CpG island sequence.
Transitions between the two sub-models allow the model to switch between CpG island and normal DNA.
Because there is more than one state that can generate a given character, the states are “hidden” when you just see the sequence.
For example, a “C” can be generated by either the C+ or C-
states in the following model.
![Page 85: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/85.jpg)
Inhomogenous Markov Chains Borodovsky’s Lab: http://exon.gatech.edu/GeneMark/
![Page 86: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/86.jpg)
Variable-length
Full
Variable Length
![Page 87: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/87.jpg)
Interpolated HMMs
Manage Model Trade-off by interpolating between various HMM Model orders
GlimmerHMM
![Page 88: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/88.jpg)
The Three Basic HMM Problems
Problem 1 (Evaluation): Given the observation sequence O=o1,…,oT and an
HMM model, how do we compute the probability of O given the model?
Problem 2 (Decoding): Given the observation sequence O=o1,…,oT and an
HMM model, how do we find the state sequence that best explains the observations?
![Page 89: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/89.jpg)
Problem 3 (Learning): How do we adjust the model parameters to maximize the probability of observations given the model?
The Three Basic HMM Problems
![Page 90: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/90.jpg)
Conclusions
Markov Models HMMs Issues Applications
![Page 91: Introduction to Hidden Markov Models for Gene Prediction](https://reader031.fdocuments.us/reader031/viewer/2022021009/6203ad32da24ad121e4c2c90/html5/thumbnails/91.jpg)
Example of Viterbi, Forward, Backward, and Posterior Algorithms
Real DNA sequences are inhomogeneous and can be described by a hidden Markov model with hidden states representing different types of nucleotide composition. Consider an HMM that includes two hidden states H and L for high and lower C+G content, respectively. Initial probabilities for both H and L are equal to 0.5, while transition probabilities are as follows: aHH=0.5, aHL=0.5, aLL=0.6, aLH=0.4. Nucleotides T, C, A, G are emitted from states H and L with probabilities 0.2, 0.3, 0.2, 0.3, and 0.3, 0.2, 0.3, 0.2, respectively. Use the Viterbi algorithm to define the most likely sequence of hidden states for the sequence, X=TGC.