Profile Hidden Markov Models
description
Transcript of Profile Hidden Markov Models
1
Profile Hidden Markov Models
PHMM
Mark Stamp
2
Hidden Markov Models Here, we assume you know about HMMs
o If not, see “A revealing introduction to hidden Markov models”
Executive summary of HMMso HMM is a machine learning techniqueo Also, a discrete hill climb techniqueo Train model based on observation sequenceo Score given sequence to see how closely it
matches the modelo Efficient algorithms, many useful applications
PHMM
3
HMM Notation Recall, HMM model denoted λ = (A,B,π) Observation sequence is O Notation:
PHMM
4
Hidden Markov Models Among the many uses for HMMs… Speech analysis Music search engine Malware detection Intrusion detection systems (IDS) Many more, and more all the time
PHMM
5
Limitations of HMMs Positional information not considered
o HMM has no “memory”o Higher order models have some memoryo But no explicit use of positional information
Does not handle insertions or deletions These limitations are serious problems in
some applicationso In bioinformatics string comparison,
sequence alignment is criticalo Also, insertions and deletions occur
PHMM
6
Profile HMM Profile HMM (PHMM) designed to
overcome limitations on previous slideo In some ways, PHMM easier than HMMo In some ways, PHMM more complex
The basic idea of PHMMo Define multiple B matriceso Almost like having an HMM for each
position in sequencePHMM
7
PHMM In bioinformatics, begin by aligning
multiple related sequenceso Multiple sequence alignment (MSA)o This is like training phase for HMM
Generate PHMM based on given MSAo Easy, once MSA is knowno Hard part is generating MSA
Then can score sequences using PHMMo Use forward algorithm, like HMM
PHMM
8
Training: PHMM vs HMM Training PHMM
o Determine MSA nontrivialo Determine PHMM matrices trivial
Training HMMo Append training sequences trivialo Determine HMM matrices nontrivial
These are opposites…o In some sense
PHMM
9
Generic View of PHMM Have delete, insert, and match
stateso Match states correspond to HMM states
Arrows are possible transitionso Each transition has a probability
Transition probabilities are A matrixEmission probabilities are B matrices
o In PHMM, observations are emissionso Match and insert states have emissions
PHMM
10
Generic View of PHMM Circles are delete states, diamonds are insert states, squares are match states
Also, begin and end states
PHMM
11
PHMM Notation Notation
PHMM
12
PHMM Match state probabilities easily
determined from MSA aMi,Mi+1 transitions between match stateseMi(k) emission probability at match
state Many other transition probabilities
o For example, aMi,Ii and aMi,Di+1
Emissions at all match & insert stateso Remember, emission == observationPHMM
13
Multiple Sequence Alignment
First we show MSA constructiono This is the difficult parto Lots of ways to do thiso “Best” way depends on specific problem
Then construct PHMM from MSAo This is the easy parto Standard algorithm for this
How to score a sequence?o Forward algorithm, similar to HMM
PHMM
14
MSA How to construct MSA?
o Construct pairwise alignmentso Combine pairwise alignments for MSA
Allow gaps to be insertedo To make better matches
Gaps tend to weaken PHMM scoringo A tradeoff between gaps and scoring
PHMM
15
Global vs Local Alignment In these pairwise alignment examples
o “-” is gapo “|” means elements alignedo “*” for omitted beginning/ending symbols
PHMM
16
Global vs Local Alignment Global alignment is lossless
o But gaps tend to proliferateo And gaps increase when we do MSA o More gaps, more random sequences match…o …and result is less useful for scoring
We usually only consider local alignmento That is, omit ends for better alignment
For simplicity, assume global alignment in examples presented here
PHMM
17
Pairwise Alignment Allow gaps when aligning How to score an alignment?
o Based on n x n substitution matrix So Where n is number of symbols
What algorithm(s) to align sequences?o Usually, dynamic programmingo Sometimes, HMM is usedo Other?
Local alignment creates more issuesPHMM
18
Pairwise Alignment Example
Tradeoff gaps vs misaligned elementso Depends on matrix S and gap penalty
PHMM
19
Substitution Matrix Masquerade detection
o Detect imposter using an account Consider 4 different operations
o E == send emailo G == play gameso C == C programmingo J == Java programming
How similar are these to each other?
PHMM
20
Substitution Matrix Consider 4 different operations:
o E, G, C, J Possible substitution matrix: Diagonal matches
o High positive scores Which others most similar?
o J and C, so substituting C for J is a high score Game playing/programming, very different
o So substituting G for C is a negative score
PHMM
21
Substitution Matrix Depending on problem, might be easy
or very difficult to find useful S matrix Consider masquerade detection based
on UNIX commandso Sometimes difficult to say how “close” 2
commands are Suppose aligning DNA sequences
o Biological rationale for closeness of symbols
PHMM
22
Gap Penalty Generally must allow gaps to be inserted But gaps make alignment more generic
o Less useful for scoring, so we penalize gaps How to penalize gaps? Linear gap penalty function:
g(x) = ax (constant penalty for every gap) Affine gap penalty function
g(x) = a + b(x – 1)o Gap opening penalty a and constant penalty of
b for each extension of existing gap
PHMM
23
Pairwise Alignment Algorithm
We use dynamic programmingo Based on S matrix, gap penalty function
Notation:
PHMM
24
Pairwise Alignment DP Initialization:
Recursion:
where
PHMM
25
MSA from Pairwise Alignments Given pairwise alignments… How to construct MSA? Generally use “progressive alignment”
o Select one pairwise alignmento Select another and combine with firsto Continue to add more until all are combined
Relatively easy (good) Gaps proliferate, and it’s unstable (bad)
PHMM
26
MSA from Pairwise Alignments Lots of ways to improve on generic
progressive alignmento Here, we mention one such approacho Not necessarily “best” or most popular
Feng-Dolittle progressive alignmento Compute scores for all pairs of n sequenceso Select n-1 alignments that a) “connect” all
sequences and b) maximize pairwise scoreso Then generate a minimum spanning treeo For MSA, add sequences in the order that they
appear in the spanning tree
PHMM
27
MSA Construction Create pairwise alignments
o Generate substitution matrixo Dynamic program for pairwise alignments
Use pairwise alignments to make MSAo Use pairwise alignments to construct
spanning tree (e.g., Prim’s Algorithm)o Add sequences to MSA in spanning tree
order (from highest score, insert gaps as needed)
o Note: gap penalty is usedPHMM
28
MSA Example Suppose 10 sequences, with the following
pairwise alignment scores
PHMM
29
MSA Example: Spanning Tree
Spanning tree based on scores
So process pairs in following order: (5,4), (5,8), (8,3), (3,2), (2,7), (2,1), (1,6), (6,10), (10,9)
PHMM
30
MSA Snapshot
Intermediate step and finalo Use “+” for
neutral symbol
o Then “-” for gaps in MSA
Note increase in gapsPHMM
31
PHMM from MSA In PHMM, determine match and
insert states & probabilities from MSA
“Conservative” columns match stateso Half or less of symbols are gaps
Other columns are insert stateso Majority of symbols are gaps
Delete states are a separate issuePHMM
32
PHMM States from MSA Consider a simpler MSA… Columns 1,2,6 are match
states 1,2,3, respectivelyo Since less than half gaps
Columns 3,4,5 are combined to form insert state 2o Since more than half gapso Match states between insert
PHMM
33
Probabilities from MSA Emission probabilities
o Based on symbol distribution in match and insert states
State transition probso Based on transitions in
the MSA
PHMM
34
Probabilities from MSA Emission probabilities:
But 0 probabilities are bado Model “overfits” the datao So, use “add one” ruleo Add one to each numerator,
add total to denominators
PHMM
35
Probabilities from MSA More emission probabilities:
But 0 probabilities still bado Model “overfits” the datao Again, use “add one” ruleo Add one to each numerator,
add total to denominators
PHMM
36
Probabilities from MSA Transition probabilities:
We look at some exampleso Note that “-” is delete state
First, consider begin state:
Again, use add one rulePHMM
37
Probabilities from MSA Transition probabilities When no information in
MSA, set probs to uniform For example I1 does not
appear in MSA, so
PHMM
38
Probabilities from MSA Transition probabilities,
another example What about transitions
from state D1? Can only go to M2, so
Again, use add one rule:
PHMM
39
PHMM Emission Probabilities Emission probabilities for the given MSA
o Using add-one rule
PHMM
40
PHMM Transition Probabilities Transition probabilities for the given MSA
o Using add-one rule
PHMM
41
PHMM Summary Construct pairwise alignments
o Usually, use dynamic programming Use these to construct MSA
o Lots of ways to do this Using MSA, determine probabilities
o Emission probabilitieso State transition probabilities
Then we have trained a PHMMo Now what???
PHMM
42
PHMM Scoring Want to score sequences to see
how closely they match PHMM How did we score using HMM?
o Forward algorithm How to score sequences with
PHMM?o Forward algorithm (surprised?)
But, algorithm is a little more complexo Due to complex state transitionsPHMM
43
Forward Algorithm Notation
o Indices i and j are columns in MSAo xi is ith observation symbolo qxi is distribution of xi in “random model”o Base case iso is score of x1,…,xi up to state j (note
that in PHMM, i and j may not agree)o Some states undefinedo Undefined states ignored in calculation
PHMM
44
Forward Algorithm Compute P(X|λ) recursively
Note that depends on , and o And corresponding state transition probs
PHMM
45
PHMM We will see examples of PHMM
later In particular,
o Malware detection based on opcodeso Masquerade detection based on UNIX
commands
PHMM
46
References Durbin, et al, Biological Sequence Analysis:
Probabilistic Models of Proteins and Nucleic Acids
L. Huang and M. Stamp, Masquerade detection using profile hidden Markov models, Computers & Security, 30(8):732-747, 2011
S. Attaluri, S. McGhee, and M. Stamp, Profile hidden Markov models for metamorphic virus detection, Journal in Computer Virology, 5(2):151-169, 2009
PHMM