Introduction to Profile Hidden Markov Models Mark Stamp 1PHMM.

45
Introduction to Profile Hidden Markov Models Mark Stamp 1 PHMM

Transcript of Introduction to Profile Hidden Markov Models Mark Stamp 1PHMM.

Page 1: Introduction to Profile Hidden Markov Models Mark Stamp 1PHMM.

PHMM 1

Introduction to Profile Hidden Markov Models

Mark Stamp

Page 2: Introduction to Profile Hidden Markov Models Mark Stamp 1PHMM.

PHMM 2

Hidden Markov Models

Here, we assume you know about HMMs If not, see “A revealing introduction to hidden

Markov models” Executive summary of HMMs

HMM is a machine learning technique Also, a discrete hill climb technique Train model based on observation sequence Score given sequence to see how closely it

matches the model Efficient algorithms, many useful applications

Page 3: Introduction to Profile Hidden Markov Models Mark Stamp 1PHMM.

PHMM 3

HMM Notation Recall, HMM model denoted λ = (A,B,π) Observation sequence is O Notation:

Page 4: Introduction to Profile Hidden Markov Models Mark Stamp 1PHMM.

PHMM 4

Hidden Markov Models

Among the many uses for HMMs… Speech analysis Music search engine Malware detection Intrusion detection systems (IDS) Many more, and more all the time

Page 5: Introduction to Profile Hidden Markov Models Mark Stamp 1PHMM.

PHMM 5

Limitations of HMMs

Positional information not considered HMM has no “memory” Higher order models have some memory But no explicit use of positional information

Does not handle insertions or deletions These limitations are serious problems in

some applications In bioinformatics string comparison, sequence

alignment is critical Also, insertions and deletions occur

Page 6: Introduction to Profile Hidden Markov Models Mark Stamp 1PHMM.

PHMM 6

Profile HMM

Profile HMM (PHMM) designed to overcome limitations on previous slide In some ways, PHMM easier than HMM In some ways, PHMM more complex

The basic idea of PHMM Define multiple B matrices Almost like having an HMM for each

position in sequence

Page 7: Introduction to Profile Hidden Markov Models Mark Stamp 1PHMM.

PHMM 7

PHMM

In bioinformatics, begin by aligning multiple related sequences Multiple sequence alignment (MSA) This is like training phase for HMM

Generate PHMM based on given MSA Easy, once MSA is known Hard part is generating MSA

Then can score sequences using PHMM Use forward algorithm, like HMM

Page 8: Introduction to Profile Hidden Markov Models Mark Stamp 1PHMM.

PHMM 8

Generic View of PHMM

Circles are Delete states Diamonds are Insert states Rectangles are Match states

Match states correspond to HMM states Arrows are possible transitions

Each transition has associated probability Transition probabilities are A matrix Emission probabilities are B matrices

In PHMM, observations are emissions Match and insert states have emissions

Page 9: Introduction to Profile Hidden Markov Models Mark Stamp 1PHMM.

PHMM 9

Generic View of PHMM

Circles are Delete states, diamonds are Insert states, rectangles are Match states

Also, begin and end states

Page 10: Introduction to Profile Hidden Markov Models Mark Stamp 1PHMM.

PHMM 10

PHMM Notation Notation

Page 11: Introduction to Profile Hidden Markov Models Mark Stamp 1PHMM.

PHMM 11

PHMM

Match state probabilities easily determined from MSA, that is aMi,Mi+1 transitions between match states eMi(k) emission probability at match

state Note: other transition probabilities

For example, aMi,Ii and aMi,Di+1

Emissions at all match & insert states Remember, emission == observation

Page 12: Introduction to Profile Hidden Markov Models Mark Stamp 1PHMM.

PHMM 12

MSA

First we show MSA construction This is the difficult part Lots of ways to do this “Best” way depends on specific problem

Then construct PHMM from MSA The easy part Standard algorithm for this

How to score a sequence? Forward algorithm, similar to HMM

Page 13: Introduction to Profile Hidden Markov Models Mark Stamp 1PHMM.

PHMM 13

MSA

How to construct MSA? Construct pairwise alignments Combine pairwise alignments to obtain

MSA Allow gaps to be inserted

Makes better matches But gaps tend to weaken scoring

So there is a tradeoff

Page 14: Introduction to Profile Hidden Markov Models Mark Stamp 1PHMM.

PHMM 14

Global vs Local Alignment In these pairwise alignment examples

“-” is gap “|” are aligned “*” omitted beginning and ending symbols

Page 15: Introduction to Profile Hidden Markov Models Mark Stamp 1PHMM.

PHMM 15

Global vs Local Alignment

Global alignment is lossless But gaps tend to proliferate And gaps increase when we do MSA More gaps implies more sequences match So, result is less useful for scoring

We usually only consider local alignment That is, omit ends for better alignment

For simplicity, we assume global alignment here

Page 16: Introduction to Profile Hidden Markov Models Mark Stamp 1PHMM.

PHMM 16

Pairwise Alignment

We allow gaps when aligning How to score an alignment?

Based on n x n substitution matrix S Where n is number of symbols

What algorithm(s) to align sequences? Usually, dynamic programming Sometimes, HMM is used Other?

Local alignment --- more issues

Page 17: Introduction to Profile Hidden Markov Models Mark Stamp 1PHMM.

PHMM 17

Pairwise Alignment

Example

Note gaps vs misaligned elements Depends on S and gap penalty

Page 18: Introduction to Profile Hidden Markov Models Mark Stamp 1PHMM.

PHMM 18

Substitution Matrix

Masquerade detection Detect imposter using an account

Consider 4 different operations E == send email G == play games C == C programming J == Java programming

How similar are these to each other?

Page 19: Introduction to Profile Hidden Markov Models Mark Stamp 1PHMM.

PHMM 19

Substitution Matrix

Consider 4 different operations: E, G, C, J

Possible substitution matrix: Diagonal --- matches

High positive scores Which others most similar?

J and C, so substituting C for J is a high score Game playing/programming, very different

So substituting G for C is a negative score

Page 20: Introduction to Profile Hidden Markov Models Mark Stamp 1PHMM.

PHMM 20

Substitution Matrix

Depending on problem, might be easy or very difficult to get useful S matrix

Consider masquerade detection based on UNIX commands Sometimes difficult to say how “close” 2

commands are Suppose aligning DNA sequences

Biological rationale for closeness of symbols

Page 21: Introduction to Profile Hidden Markov Models Mark Stamp 1PHMM.

PHMM 21

Gap Penalty

Generally must allow gaps to be inserted But gaps make alignment more generic

So, less useful for scoring Therefore, we penalize gaps

How to penalize gaps? Linear gap penalty function

f(g) = dg (i.e., constant penalty per gap) Affine gap penalty function

f(g) = a + e(g – 1) Gap opening penalty a, then constant factor of e

Page 22: Introduction to Profile Hidden Markov Models Mark Stamp 1PHMM.

PHMM 22

Pairwise Alignment Algorithm

We use dynamic programming Based on S matrix, gap penalty function

Notation:

Page 23: Introduction to Profile Hidden Markov Models Mark Stamp 1PHMM.

PHMM 23

Pairwise Alignment DP

Initialization:

Recursion:

Page 24: Introduction to Profile Hidden Markov Models Mark Stamp 1PHMM.

PHMM 24

MSA from Pairwise Alignments

Given pairwise alignments… …how to construct MSA? Generic approach is “progressive

alignment” Select one pairwise alignment Select another and combine with first Continue to add more until all are combined

Relatively easy (good) Gaps may proliferate, unstable (bad)

Page 25: Introduction to Profile Hidden Markov Models Mark Stamp 1PHMM.

PHMM 25

MSA from Pairwise Alignments

Lots of ways to improve on generic progressive alignment Here, we mention one such approach Not necessarily “best” or most popular

Feng-Dolittle progressive alignment Compute scores for all pairs of n sequences Select n-1 alignments that a) “connect” all

sequences and b) maximize pairwise scores Then generate a minimum spanning tree For MSA, add sequences in the order that they

appear in the spanning tree

Page 26: Introduction to Profile Hidden Markov Models Mark Stamp 1PHMM.

PHMM 26

MSA Construction

Create pairwise alignments Generate substitution matrix Dynamic program for pairwise alignments

Use pairwise alignments to make MSA Use pairwise alignments to construct

spanning tree (e.g., Prim’s Algorithm) Add sequences to MSA in spanning tree

order (from highest score, insert gaps as needed)

Note: gap penalty is used

Page 27: Introduction to Profile Hidden Markov Models Mark Stamp 1PHMM.

PHMM 27

MSA Example Suppose 10 sequences, with the following

pairwise alignment scores:

Page 28: Introduction to Profile Hidden Markov Models Mark Stamp 1PHMM.

PHMM 28

MSA Example: Spanning Tree

Spanning tree based on scores

So process pairs in following order: (5,4), (5,8), (8,3), (3,2), (2,7), (2,1), (1,6), (6,10), (10,9)

Page 29: Introduction to Profile Hidden Markov Models Mark Stamp 1PHMM.

PHMM 29

MSA Snapshot

Intermediate step and final Use “+” for

neutral symbol

Then “-” for gaps in MSA

Note increase in gaps

Page 30: Introduction to Profile Hidden Markov Models Mark Stamp 1PHMM.

PHMM 30

PHMM from MSA

For PHMM, must determine match and insert states & probabilities from MSA

“Conservative” columns are match states Half or less of symbols are gaps

Other columns are insert states Majority of symbols are gaps

Delete states are a separate issue

Page 31: Introduction to Profile Hidden Markov Models Mark Stamp 1PHMM.

PHMM 31

PHMM States from MSA

Consider a simpler MSA… Columns 1,2,6 are match

states 1,2,3, respectively Since less than half gaps

Columns 3,4,5 are combined to form insert state 2 Since more than half gaps Match states between

insert

Page 32: Introduction to Profile Hidden Markov Models Mark Stamp 1PHMM.

PHMM 32

PHMM Probabilities from MSA

Emission probabilities Based on symbol

distribution in match and insert states

State transition probs Based on transitions in

the MSA

Page 33: Introduction to Profile Hidden Markov Models Mark Stamp 1PHMM.

PHMM 33

PHMM Probabilities from MSA

Emission probabilities:

But 0 probabilities are bad Model “overfits” the data So, use “add one” rule Add one to each numerator,

add total to denominators

Page 34: Introduction to Profile Hidden Markov Models Mark Stamp 1PHMM.

PHMM 34

PHMM Probabilities from MSA

More emission probabilities:

But 0 probabilities are bad Model “overfits” the data Again, use “add one” rule Add one to each numerator,

add total to denominators

Page 35: Introduction to Profile Hidden Markov Models Mark Stamp 1PHMM.

PHMM 35

PHMM Probabilities from MSA

Transition probabilities:

We look at some examples Note that “-” is delete state

First, consider begin state:

Again, use add one rule

Page 36: Introduction to Profile Hidden Markov Models Mark Stamp 1PHMM.

PHMM 36

PHMM Probabilities from MSA

Transition probabilities When no information in

MSA, set probs to uniform For example I1 does not

appear in MSA, so

Page 37: Introduction to Profile Hidden Markov Models Mark Stamp 1PHMM.

PHMM 37

PHMM Probabilities from MSA

Transition probabilities, another example

What about transitions from state D1?

Can only go to M2, so

Again, use add one rule:

Page 38: Introduction to Profile Hidden Markov Models Mark Stamp 1PHMM.

PHMM 38

PHMM Emission Probabilities Emission probabilities for the given MSA

Using add-one rule

Page 39: Introduction to Profile Hidden Markov Models Mark Stamp 1PHMM.

PHMM 39

PHMM Transition Probabilities Transition probabilities for the given MSA

Using add-one rule

Page 40: Introduction to Profile Hidden Markov Models Mark Stamp 1PHMM.

PHMM 40

PHMM Summary

Construct pairwise alignments Usually, use dynamic programming

Use these to construct MSA Lots of ways to do this

Using MSA, determine probabilities Emission probabilities State transition probabilities

In effect, we have trained a PHMM Now what???

Page 41: Introduction to Profile Hidden Markov Models Mark Stamp 1PHMM.

PHMM 41

PHMM Scoring

Want to score sequences to see how closely they match PHMM

How did we score sequences with HMM? Forward algorithm

How to score sequences with PHMM? Forward algorithm

But, algorithm is a little more complex Due to complex state transitions

Page 42: Introduction to Profile Hidden Markov Models Mark Stamp 1PHMM.

PHMM 42

Forward Algorithm

Notation Indices i and j are columns in MSA xi is ith observation symbol qxi is distribution of xi in “random model” Base case is is score of x1,…,xi up to state j (note

that in PHMM, i and j may not agree) Some states undefined Undefined states ignored in calculation

Page 43: Introduction to Profile Hidden Markov Models Mark Stamp 1PHMM.

PHMM 43

Forward Algorithm

Compute P(X|λ) recursively

Note that depends on , and And corresponding state transition probs

Page 44: Introduction to Profile Hidden Markov Models Mark Stamp 1PHMM.

PHMM 44

PHMM

We will see examples of PHMM later In particular,

Malware detection based on opcodes Masquerade detection based on UNIX

commands

Page 45: Introduction to Profile Hidden Markov Models Mark Stamp 1PHMM.

PHMM 45

References

Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids, Durbin, et al

Masquerade detection using profile hidden Markov models, L. Huang and M. Stamp, to appear in Computers and Security

Profile hidden Markov models for metamorphic virus detection, S. Attaluri, S. McGhee and M. Stamp, Journal in Computer Virology, Vol. 5, No. 2, May 2009, pp. 151-169