Genome 559 - homes.cs.washington.eduruzzo/courses/gs... · The Viterbi Algorithm!!probability of...
Transcript of Genome 559 - homes.cs.washington.eduruzzo/courses/gs... · The Viterbi Algorithm!!probability of...
![Page 1: Genome 559 - homes.cs.washington.eduruzzo/courses/gs... · The Viterbi Algorithm!!probability of the most probable path !!emitting and ending in state l Initialize:! General case:!](https://reader033.fdocuments.us/reader033/viewer/2022041923/5e6cca012fc49425e44e5a7c/html5/thumbnails/1.jpg)
Hidden Markov Models
Genome 559���
![Page 2: Genome 559 - homes.cs.washington.eduruzzo/courses/gs... · The Viterbi Algorithm!!probability of the most probable path !!emitting and ending in state l Initialize:! General case:!](https://reader033.fdocuments.us/reader033/viewer/2022041923/5e6cca012fc49425e44e5a7c/html5/thumbnails/2.jpg)
Eddy, Nat. Biotech, 2004
A simple ���HMM
![Page 3: Genome 559 - homes.cs.washington.eduruzzo/courses/gs... · The Viterbi Algorithm!!probability of the most probable path !!emitting and ending in state l Initialize:! General case:!](https://reader033.fdocuments.us/reader033/viewer/2022041923/5e6cca012fc49425e44e5a7c/html5/thumbnails/3.jpg)
Notes Probability of a given a state path and output sequence is just product of emission/transition probabilities
If state path is hidden, you need to consider all possible paths (usually exponentially many). E.g., find:
Total probability of a given seq (sum over all paths)
Probability of the most probable single path
“Dynamic programming” algorithms similar to seq alignment can solve these problems relatively quickly
![Page 4: Genome 559 - homes.cs.washington.eduruzzo/courses/gs... · The Viterbi Algorithm!!probability of the most probable path !!emitting and ending in state l Initialize:! General case:!](https://reader033.fdocuments.us/reader033/viewer/2022041923/5e6cca012fc49425e44e5a7c/html5/thumbnails/4.jpg)
Viterbi: Most probable path���
C A G A T
.25 .252*.9 .253*.92 .254*.93 .255*.94
0 .25*.1*.05 .252*.9*.1*.95 .253*.92*.1*.05 0
0 0 .25*.1*.05*.1 .252*.9*.1*.95*.9*.4
.25*.1*.05*.1*.9*.4 …
![Page 5: Genome 559 - homes.cs.washington.eduruzzo/courses/gs... · The Viterbi Algorithm!!probability of the most probable path !!emitting and ending in state l Initialize:! General case:!](https://reader033.fdocuments.us/reader033/viewer/2022041923/5e6cca012fc49425e44e5a7c/html5/thumbnails/5.jpg)
The Viterbi Algorithm probability of the most probable path emitting and ending in state l
Initialize:
General case:
emission transition
![Page 6: Genome 559 - homes.cs.washington.eduruzzo/courses/gs... · The Viterbi Algorithm!!probability of the most probable path !!emitting and ending in state l Initialize:! General case:!](https://reader033.fdocuments.us/reader033/viewer/2022041923/5e6cca012fc49425e44e5a7c/html5/thumbnails/6.jpg)
Viterbi Traceback
Above finds probability of best path To find the path itself, trace backward to the state k attaining the max at each stage
![Page 7: Genome 559 - homes.cs.washington.eduruzzo/courses/gs... · The Viterbi Algorithm!!probability of the most probable path !!emitting and ending in state l Initialize:! General case:!](https://reader033.fdocuments.us/reader033/viewer/2022041923/5e6cca012fc49425e44e5a7c/html5/thumbnails/7.jpg)
x1 x2 x3 x4
Viterbi Traceback
Viterbi score:
Viterbi pathR:
States
Emissions/sequence positions
![Page 8: Genome 559 - homes.cs.washington.eduruzzo/courses/gs... · The Viterbi Algorithm!!probability of the most probable path !!emitting and ending in state l Initialize:! General case:!](https://reader033.fdocuments.us/reader033/viewer/2022041923/5e6cca012fc49425e44e5a7c/html5/thumbnails/8.jpg)
An Application: Protein Alignments
WMM might be a good model of the marked ���blocks, but what about the gaps??
![Page 9: Genome 559 - homes.cs.washington.eduruzzo/courses/gs... · The Viterbi Algorithm!!probability of the most probable path !!emitting and ending in state l Initialize:! General case:!](https://reader033.fdocuments.us/reader033/viewer/2022041923/5e6cca012fc49425e44e5a7c/html5/thumbnails/9.jpg)
Mj: Match states (20 emission probabilities) Ij: Insert states (Background emission probabilities) Dj: Delete states (silent - no emission)
Profile Hmm Structure
From DEKM
![Page 10: Genome 559 - homes.cs.washington.eduruzzo/courses/gs... · The Viterbi Algorithm!!probability of the most probable path !!emitting and ending in state l Initialize:! General case:!](https://reader033.fdocuments.us/reader033/viewer/2022041923/5e6cca012fc49425e44e5a7c/html5/thumbnails/10.jpg)
Odds Scores
From DEKM
Length-normalized log odds scores, globin model
![Page 11: Genome 559 - homes.cs.washington.eduruzzo/courses/gs... · The Viterbi Algorithm!!probability of the most probable path !!emitting and ending in state l Initialize:! General case:!](https://reader033.fdocuments.us/reader033/viewer/2022041923/5e6cca012fc49425e44e5a7c/html5/thumbnails/11.jpg)
HMMs in Action: Pfam���http://pfam.sanger.ac.uk/
Hand-curated “seed” multiple alignments (domains, not full-length proteins)
Train profile HMM from seed alignment Hand-chosen score threshold(s) Automatic classification/alignment of all other
protein sequences 11912 families in Pfam 24.0, 10/2009���
(covers ~75% of proteins)
![Page 12: Genome 559 - homes.cs.washington.eduruzzo/courses/gs... · The Viterbi Algorithm!!probability of the most probable path !!emitting and ending in state l Initialize:! General case:!](https://reader033.fdocuments.us/reader033/viewer/2022041923/5e6cca012fc49425e44e5a7c/html5/thumbnails/12.jpg)
12
HMM Gene Finder
![Page 13: Genome 559 - homes.cs.washington.eduruzzo/courses/gs... · The Viterbi Algorithm!!probability of the most probable path !!emitting and ending in state l Initialize:! General case:!](https://reader033.fdocuments.us/reader033/viewer/2022041923/5e6cca012fc49425e44e5a7c/html5/thumbnails/13.jpg)
HMM Summary Search Viterbi – best single path (max of products) Forward – sum over all paths (sum of products) Posterior decoding
Model building Typically fix architecture (e.g. profile HMM), then ���Learn parameters – the Baum-Welch Algorithm
Scoring Odds ratio to background
Excellent tools available (SAM, HMMer, Pfam, …)
A very widely used tool for biosequence analysis
Man
y va
rian
ts o
n al
l of t
hese
poi
nts
![Page 14: Genome 559 - homes.cs.washington.eduruzzo/courses/gs... · The Viterbi Algorithm!!probability of the most probable path !!emitting and ending in state l Initialize:! General case:!](https://reader033.fdocuments.us/reader033/viewer/2022041923/5e6cca012fc49425e44e5a7c/html5/thumbnails/14.jpg)
![Page 15: Genome 559 - homes.cs.washington.eduruzzo/courses/gs... · The Viterbi Algorithm!!probability of the most probable path !!emitting and ending in state l Initialize:! General case:!](https://reader033.fdocuments.us/reader033/viewer/2022041923/5e6cca012fc49425e44e5a7c/html5/thumbnails/15.jpg)
Hidden Markov Models���(HMMs; Claude Shannon, 1948)
![Page 16: Genome 559 - homes.cs.washington.eduruzzo/courses/gs... · The Viterbi Algorithm!!probability of the most probable path !!emitting and ending in state l Initialize:! General case:!](https://reader033.fdocuments.us/reader033/viewer/2022041923/5e6cca012fc49425e44e5a7c/html5/thumbnails/16.jpg)
1 fair die, 1 “loaded” die, occasionally swapped
The Occasionally Dishonest Casino
![Page 17: Genome 559 - homes.cs.washington.eduruzzo/courses/gs... · The Viterbi Algorithm!!probability of the most probable path !!emitting and ending in state l Initialize:! General case:!](https://reader033.fdocuments.us/reader033/viewer/2022041923/5e6cca012fc49425e44e5a7c/html5/thumbnails/17.jpg)
Rolls !315116246446644245311321631164152133625144543631656626566666!Die !FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFLLLLLLLLLLLLLLL!Viterbi !FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFLLLLLLLLLLLL!
Rolls !651166453132651245636664631636663162326455236266666625151631!Die !LLLLLLFFFFFFFFFFFFLLLLLLLLLLLLLLLLFFFLLLLLLLLLLLLLLFFFFFFFFF!Viterbi !LLLLLLFFFFFFFFFFFFLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLFFFFFFFF!
Rolls !222555441666566563564324364131513465146353411126414626253356!Die !FFFFFFFFLLLLLLLLLLLLLFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFLL!Viterbi !FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFL!
Rolls !366163666466232534413661661163252562462255265252266435353336!Die !LLLLLLLLFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF!Viterbi !LLLLLLLLLLLLFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF!
Rolls !233121625364414432335163243633665562466662632666612355245242!Die !FFFFFFFFFFFFFFFFFFFFFFFFFFFLLLLLLLLLLLLLLLLLLLLLLFFFFFFFFFFF!Viterbi !FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFLLLLLLLLLLLLLLLLLLLFFFFFFFFFFF!
Figure 3.5 Rolls: Visible data–300 rolls of a die as described above. Die: Hidden data–which die was actually used for that roll (F = fair, L = loaded). Viterbi: the prediction by the Viterbi algorithm is shown.
From DEKM
![Page 18: Genome 559 - homes.cs.washington.eduruzzo/courses/gs... · The Viterbi Algorithm!!probability of the most probable path !!emitting and ending in state l Initialize:! General case:!](https://reader033.fdocuments.us/reader033/viewer/2022041923/5e6cca012fc49425e44e5a7c/html5/thumbnails/18.jpg)
Joint probability of a given path π & emission sequence x:���
But π is hidden; what to do? Some alternatives:
Most probable single path
Sequence of most probable states
Inferring hidden stuff
![Page 19: Genome 559 - homes.cs.washington.eduruzzo/courses/gs... · The Viterbi Algorithm!!probability of the most probable path !!emitting and ending in state l Initialize:! General case:!](https://reader033.fdocuments.us/reader033/viewer/2022041923/5e6cca012fc49425e44e5a7c/html5/thumbnails/19.jpg)
Viterbi finds:
Possibly there are 1099 paths of prob 10-99
More commonly, one path (+ slight variants) dominate others. ���(If not, other approaches may be preferable.)
Key problem: exponentially many paths π
The Viterbi Algorithm: ���The most probable path
![Page 20: Genome 559 - homes.cs.washington.eduruzzo/courses/gs... · The Viterbi Algorithm!!probability of the most probable path !!emitting and ending in state l Initialize:! General case:!](https://reader033.fdocuments.us/reader033/viewer/2022041923/5e6cca012fc49425e44e5a7c/html5/thumbnails/20.jpg)
L
F
L
F
L
F
L
F
t=0 t=1 t=2 t=3
...
...
3 6 6 2 ...
Unrolling an HMM
Conceptually, sometimes convenient Note exponentially many paths
![Page 21: Genome 559 - homes.cs.washington.eduruzzo/courses/gs... · The Viterbi Algorithm!!probability of the most probable path !!emitting and ending in state l Initialize:! General case:!](https://reader033.fdocuments.us/reader033/viewer/2022041923/5e6cca012fc49425e44e5a7c/html5/thumbnails/21.jpg)
Rolls !315116246446644245311321631164152133625144543631656626566666!Die !FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFLLLLLLLLLLLLLLL!Viterbi !FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFLLLLLLLLLLLL!
Rolls !651166453132651245636664631636663162326455236266666625151631!Die !LLLLLLFFFFFFFFFFFFLLLLLLLLLLLLLLLLFFFLLLLLLLLLLLLLLFFFFFFFFF!Viterbi !LLLLLLFFFFFFFFFFFFLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLFFFFFFFF!
Rolls !222555441666566563564324364131513465146353411126414626253356!Die !FFFFFFFFLLLLLLLLLLLLLFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFLL!Viterbi !FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFL!
Rolls !366163666466232534413661661163252562462255265252266435353336!Die !LLLLLLLLFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF!Viterbi !LLLLLLLLLLLLFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF!
Rolls !233121625364414432335163243633665562466662632666612355245242!Die !FFFFFFFFFFFFFFFFFFFFFFFFFFFLLLLLLLLLLLLLLLLLLLLLLFFFFFFFFFFF!Viterbi !FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFLLLLLLLLLLLLLLLLLLLFFFFFFFFFFF!
Figure 3.5 Rolls: Visible data–300 rolls of a die as described above. Die: Hidden data–which die was actually used for that roll (F = fair, L = loaded). Viterbi: the prediction by the Viterbi algorithm is shown.
From DEKM
![Page 22: Genome 559 - homes.cs.washington.eduruzzo/courses/gs... · The Viterbi Algorithm!!probability of the most probable path !!emitting and ending in state l Initialize:! General case:!](https://reader033.fdocuments.us/reader033/viewer/2022041923/5e6cca012fc49425e44e5a7c/html5/thumbnails/22.jpg)
Most probable path ≠ Sequence of most probable states
Another example, based on casino dice again
Suppose p(fair↔loaded) transitions are 10-99 and roll sequence is 11111…66666; then fair state is more likely all through 1’s & well into the run of 6’s, but eventually loaded wins, and the improbable F→L transitions make Viterbi = all L.
* = Viterbi
1 1 1 1 1 6 6 6 6 6 * = max prob
* * * * * * * *
* *
L
F
![Page 23: Genome 559 - homes.cs.washington.eduruzzo/courses/gs... · The Viterbi Algorithm!!probability of the most probable path !!emitting and ending in state l Initialize:! General case:!](https://reader033.fdocuments.us/reader033/viewer/2022041923/5e6cca012fc49425e44e5a7c/html5/thumbnails/23.jpg)
x1 x2 x3 x4
The Forward Algorithm For each state/time, want total probability of all paths leading to it, with given emissions
![Page 24: Genome 559 - homes.cs.washington.eduruzzo/courses/gs... · The Viterbi Algorithm!!probability of the most probable path !!emitting and ending in state l Initialize:! General case:!](https://reader033.fdocuments.us/reader033/viewer/2022041923/5e6cca012fc49425e44e5a7c/html5/thumbnails/24.jpg)
x1 x2 x3 x4
The Backward Algorithm Similar: ���for each state/time, want total probability of all paths from it, with given emissions, conditional on that state.
![Page 25: Genome 559 - homes.cs.washington.eduruzzo/courses/gs... · The Viterbi Algorithm!!probability of the most probable path !!emitting and ending in state l Initialize:! General case:!](https://reader033.fdocuments.us/reader033/viewer/2022041923/5e6cca012fc49425e44e5a7c/html5/thumbnails/25.jpg)
In state k at step i ?
![Page 26: Genome 559 - homes.cs.washington.eduruzzo/courses/gs... · The Viterbi Algorithm!!probability of the most probable path !!emitting and ending in state l Initialize:! General case:!](https://reader033.fdocuments.us/reader033/viewer/2022041923/5e6cca012fc49425e44e5a7c/html5/thumbnails/26.jpg)
Posterior Decoding, I Alternative 1: what’s the most likely state at step i?
Note: the sequence of most likely states ≠ the most likely sequence of states. May not even be legal!
![Page 27: Genome 559 - homes.cs.washington.eduruzzo/courses/gs... · The Viterbi Algorithm!!probability of the most probable path !!emitting and ending in state l Initialize:! General case:!](https://reader033.fdocuments.us/reader033/viewer/2022041923/5e6cca012fc49425e44e5a7c/html5/thumbnails/27.jpg)
1 fair die, 1 “loaded” die, occasionally swapped
The Occasionally Dishonest Casino
![Page 28: Genome 559 - homes.cs.washington.eduruzzo/courses/gs... · The Viterbi Algorithm!!probability of the most probable path !!emitting and ending in state l Initialize:! General case:!](https://reader033.fdocuments.us/reader033/viewer/2022041923/5e6cca012fc49425e44e5a7c/html5/thumbnails/28.jpg)
Rolls !315116246446644245311321631164152133625144543631656626566666!Die !FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFLLLLLLLLLLLLLLL!Viterbi !FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFLLLLLLLLLLLL!
Rolls !651166453132651245636664631636663162326455236266666625151631!Die !LLLLLLFFFFFFFFFFFFLLLLLLLLLLLLLLLLFFFLLLLLLLLLLLLLLFFFFFFFFF!Viterbi !LLLLLLFFFFFFFFFFFFLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLFFFFFFFF!
Rolls !222555441666566563564324364131513465146353411126414626253356!Die !FFFFFFFFLLLLLLLLLLLLLFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFLL!Viterbi !FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFL!
Rolls !366163666466232534413661661163252562462255265252266435353336!Die !LLLLLLLLFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF!Viterbi !LLLLLLLLLLLLFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF!
Rolls !233121625364414432335163243633665562466662632666612355245242!Die !FFFFFFFFFFFFFFFFFFFFFFFFFFFLLLLLLLLLLLLLLLLLLLLLLFFFFFFFFFFF!Viterbi !FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFLLLLLLLLLLLLLLLLLLLFFFFFFFFFFF!
Figure 3.5 Rolls: Visible data–300 rolls of a die as described above. Die: Hidden data–which die was actually used for that roll (F = fair, L = loaded). Viterbi: the prediction by the Viterbi algorithm is shown.
From DEKM
![Page 29: Genome 559 - homes.cs.washington.eduruzzo/courses/gs... · The Viterbi Algorithm!!probability of the most probable path !!emitting and ending in state l Initialize:! General case:!](https://reader033.fdocuments.us/reader033/viewer/2022041923/5e6cca012fc49425e44e5a7c/html5/thumbnails/29.jpg)
Posterior Decoding
From DEKM
![Page 30: Genome 559 - homes.cs.washington.eduruzzo/courses/gs... · The Viterbi Algorithm!!probability of the most probable path !!emitting and ending in state l Initialize:! General case:!](https://reader033.fdocuments.us/reader033/viewer/2022041923/5e6cca012fc49425e44e5a7c/html5/thumbnails/30.jpg)
Posterior Decoding, II Alternative 1: what’s most likely state at step i ?
Alternative 2: given some function g(k) on states, what’s its expectation. E.g., what’s probability of “+” model in CpG HMM (g(k)=1 iff k is “+” state)?
![Page 31: Genome 559 - homes.cs.washington.eduruzzo/courses/gs... · The Viterbi Algorithm!!probability of the most probable path !!emitting and ending in state l Initialize:! General case:!](https://reader033.fdocuments.us/reader033/viewer/2022041923/5e6cca012fc49425e44e5a7c/html5/thumbnails/31.jpg)
Post-process: merge within 500; discard < 500
CpG Islands again
Data: 41 human sequences, totaling 60kbp, including 48 CpG islands of about 1kbp each
Viterbi: Post-process: ���Found 46 of 48 46/48���plus 121 “false positives” 67 false pos
Posterior Decoding: ���same 2 false negatives 46/48���plus 236 false positives 83 false pos
![Page 32: Genome 559 - homes.cs.washington.eduruzzo/courses/gs... · The Viterbi Algorithm!!probability of the most probable path !!emitting and ending in state l Initialize:! General case:!](https://reader033.fdocuments.us/reader033/viewer/2022041923/5e6cca012fc49425e44e5a7c/html5/thumbnails/32.jpg)
Z-Scores
From DEKM
![Page 33: Genome 559 - homes.cs.washington.eduruzzo/courses/gs... · The Viterbi Algorithm!!probability of the most probable path !!emitting and ending in state l Initialize:! General case:!](https://reader033.fdocuments.us/reader033/viewer/2022041923/5e6cca012fc49425e44e5a7c/html5/thumbnails/33.jpg)
HMM Casino Example
(Excel spreadsheet on web; download & play…)
![Page 34: Genome 559 - homes.cs.washington.eduruzzo/courses/gs... · The Viterbi Algorithm!!probability of the most probable path !!emitting and ending in state l Initialize:! General case:!](https://reader033.fdocuments.us/reader033/viewer/2022041923/5e6cca012fc49425e44e5a7c/html5/thumbnails/34.jpg)
HMM Casino Example
(Excel spreadsheet on web; download & play…)
![Page 35: Genome 559 - homes.cs.washington.eduruzzo/courses/gs... · The Viterbi Algorithm!!probability of the most probable path !!emitting and ending in state l Initialize:! General case:!](https://reader033.fdocuments.us/reader033/viewer/2022041923/5e6cca012fc49425e44e5a7c/html5/thumbnails/35.jpg)
x1 x2 x3 x4
An HMM (unrolled) States
Emissions/sequence positions
![Page 36: Genome 559 - homes.cs.washington.eduruzzo/courses/gs... · The Viterbi Algorithm!!probability of the most probable path !!emitting and ending in state l Initialize:! General case:!](https://reader033.fdocuments.us/reader033/viewer/2022041923/5e6cca012fc49425e44e5a7c/html5/thumbnails/36.jpg)
HMMs in Action: Pfam���http://pfam.sanger.ac.uk/
Proteins fall into families, both across & within species Ex: Globins, GPCRs, Zinc fingers, Leucine zippers,...
Identifying family very useful: suggests function, etc.
So, search & alignment are both important One very successful approach: profile HMMs