Introduction to Profile Hidden Markov Models Mark Stamp 1PHMM.

PHMM 1

Introduction to Profile Hidden Markov Models

Mark Stamp

PHMM 2

Hidden Markov Models

Here, we assume you know about HMMs If not, see “A revealing introduction to hidden

Markov models” Executive summary of HMMs

HMM is a machine learning technique Also, a discrete hill climb technique Train model based on observation sequence Score given sequence to see how closely it

matches the model Efficient algorithms, many useful applications

PHMM 3

HMM Notation Recall, HMM model denoted λ = (A,B,π) Observation sequence is O Notation:

PHMM 4

Hidden Markov Models

Among the many uses for HMMs… Speech analysis Music search engine Malware detection Intrusion detection systems (IDS) Many more, and more all the time

PHMM 5

Limitations of HMMs

Positional information not considered HMM has no “memory” Higher order models have some memory But no explicit use of positional information

Does not handle insertions or deletions These limitations are serious problems in

some applications In bioinformatics string comparison, sequence

alignment is critical Also, insertions and deletions occur

PHMM 6

Profile HMM

Profile HMM (PHMM) designed to overcome limitations on previous slide In some ways, PHMM easier than HMM In some ways, PHMM more complex

The basic idea of PHMM Define multiple B matrices Almost like having an HMM for each

position in sequence

PHMM 7

PHMM

In bioinformatics, begin by aligning multiple related sequences Multiple sequence alignment (MSA) This is like training phase for HMM

Generate PHMM based on given MSA Easy, once MSA is known Hard part is generating MSA

Then can score sequences using PHMM Use forward algorithm, like HMM

PHMM 8

Generic View of PHMM

Circles are Delete states Diamonds are Insert states Rectangles are Match states

Match states correspond to HMM states Arrows are possible transitions

Each transition has associated probability Transition probabilities are A matrix Emission probabilities are B matrices

In PHMM, observations are emissions Match and insert states have emissions

PHMM 9

Generic View of PHMM

Circles are Delete states, diamonds are Insert states, rectangles are Match states

Also, begin and end states

PHMM 10

PHMM Notation Notation

PHMM 11

PHMM

Match state probabilities easily determined from MSA, that is aMi,Mi+1 transitions between match states eMi(k) emission probability at match

state Note: other transition probabilities

For example, aMi,Ii and aMi,Di+1

Emissions at all match & insert states Remember, emission == observation

PHMM 12

MSA

First we show MSA construction This is the difficult part Lots of ways to do this “Best” way depends on specific problem

Then construct PHMM from MSA The easy part Standard algorithm for this

How to score a sequence? Forward algorithm, similar to HMM

PHMM 13

MSA

How to construct MSA? Construct pairwise alignments Combine pairwise alignments to obtain

MSA Allow gaps to be inserted

Makes better matches But gaps tend to weaken scoring

So there is a tradeoff

PHMM 14

Global vs Local Alignment In these pairwise alignment examples

“-” is gap “|” are aligned “*” omitted beginning and ending symbols

PHMM 15

Global vs Local Alignment

Global alignment is lossless But gaps tend to proliferate And gaps increase when we do MSA More gaps implies more sequences match So, result is less useful for scoring

We usually only consider local alignment That is, omit ends for better alignment

For simplicity, we assume global alignment here

PHMM 16

Pairwise Alignment

We allow gaps when aligning How to score an alignment?

Based on n x n substitution matrix S Where n is number of symbols

What algorithm(s) to align sequences? Usually, dynamic programming Sometimes, HMM is used Other?

Local alignment --- more issues

PHMM 17

Pairwise Alignment

Example

Note gaps vs misaligned elements Depends on S and gap penalty

PHMM 18

Substitution Matrix

Masquerade detection Detect imposter using an account

Consider 4 different operations E == send email G == play games C == C programming J == Java programming

How similar are these to each other?

PHMM 19

Substitution Matrix

Consider 4 different operations: E, G, C, J

Possible substitution matrix: Diagonal --- matches

High positive scores Which others most similar?

J and C, so substituting C for J is a high score Game playing/programming, very different

So substituting G for C is a negative score

PHMM 20

Substitution Matrix

Depending on problem, might be easy or very difficult to get useful S matrix

Consider masquerade detection based on UNIX commands Sometimes difficult to say how “close” 2

commands are Suppose aligning DNA sequences

Biological rationale for closeness of symbols

PHMM 21

Gap Penalty

Generally must allow gaps to be inserted But gaps make alignment more generic

So, less useful for scoring Therefore, we penalize gaps

How to penalize gaps? Linear gap penalty function

f(g) = dg (i.e., constant penalty per gap) Affine gap penalty function

f(g) = a + e(g – 1) Gap opening penalty a, then constant factor of e

PHMM 22

Pairwise Alignment Algorithm

We use dynamic programming Based on S matrix, gap penalty function

Notation:

PHMM 23

Pairwise Alignment DP

Initialization:

Recursion:

PHMM 24

MSA from Pairwise Alignments

Given pairwise alignments… …how to construct MSA? Generic approach is “progressive

alignment” Select one pairwise alignment Select another and combine with first Continue to add more until all are combined

Relatively easy (good) Gaps may proliferate, unstable (bad)

PHMM 25

MSA from Pairwise Alignments

Lots of ways to improve on generic progressive alignment Here, we mention one such approach Not necessarily “best” or most popular

Feng-Dolittle progressive alignment Compute scores for all pairs of n sequences Select n-1 alignments that a) “connect” all

sequences and b) maximize pairwise scores Then generate a minimum spanning tree For MSA, add sequences in the order that they

appear in the spanning tree

PHMM 26

MSA Construction

Create pairwise alignments Generate substitution matrix Dynamic program for pairwise alignments

Use pairwise alignments to make MSA Use pairwise alignments to construct

spanning tree (e.g., Prim’s Algorithm) Add sequences to MSA in spanning tree

order (from highest score, insert gaps as needed)

Note: gap penalty is used

PHMM 27

MSA Example Suppose 10 sequences, with the following

pairwise alignment scores:

PHMM 28

MSA Example: Spanning Tree

Spanning tree based on scores

So process pairs in following order: (5,4), (5,8), (8,3), (3,2), (2,7), (2,1), (1,6), (6,10), (10,9)

PHMM 29

MSA Snapshot

Intermediate step and final Use “+” for

neutral symbol

Then “-” for gaps in MSA

Note increase in gaps

PHMM 30

PHMM from MSA

For PHMM, must determine match and insert states & probabilities from MSA

“Conservative” columns are match states Half or less of symbols are gaps

Other columns are insert states Majority of symbols are gaps

Delete states are a separate issue

PHMM 31

PHMM States from MSA

Consider a simpler MSA… Columns 1,2,6 are match

states 1,2,3, respectively Since less than half gaps

Columns 3,4,5 are combined to form insert state 2 Since more than half gaps Match states between

insert

PHMM 32

PHMM Probabilities from MSA

Emission probabilities Based on symbol

distribution in match and insert states

State transition probs Based on transitions in

the MSA

PHMM 33


Emission probabilities:

But 0 probabilities are bad Model “overfits” the data So, use “add one” rule Add one to each numerator,

add total to denominators

PHMM 34


More emission probabilities:

But 0 probabilities are bad Model “overfits” the data Again, use “add one” rule Add one to each numerator,

add total to denominators

PHMM 35


Transition probabilities:

We look at some examples Note that “-” is delete state

First, consider begin state:

Again, use add one rule

PHMM 36


Transition probabilities When no information in

MSA, set probs to uniform For example I1 does not

appear in MSA, so

PHMM 37


Transition probabilities, another example

What about transitions from state D1?

Can only go to M2, so

Again, use add one rule:

PHMM 38

PHMM Emission Probabilities Emission probabilities for the given MSA

Using add-one rule

PHMM 39

PHMM Transition Probabilities Transition probabilities for the given MSA

Using add-one rule

PHMM 40

PHMM Summary

Construct pairwise alignments Usually, use dynamic programming

Use these to construct MSA Lots of ways to do this

Using MSA, determine probabilities Emission probabilities State transition probabilities

In effect, we have trained a PHMM Now what???

PHMM 41

PHMM Scoring

Want to score sequences to see how closely they match PHMM

How did we score sequences with HMM? Forward algorithm

How to score sequences with PHMM? Forward algorithm

But, algorithm is a little more complex Due to complex state transitions

PHMM 42

Forward Algorithm

Notation Indices i and j are columns in MSA xi is ith observation symbol qxi is distribution of xi in “random model” Base case is is score of x1,…,xi up to state j (note

that in PHMM, i and j may not agree) Some states undefined Undefined states ignored in calculation

PHMM 43

Forward Algorithm

Compute P(X|λ) recursively

Note that depends on , and And corresponding state transition probs

PHMM 44

PHMM

We will see examples of PHMM later In particular,

Malware detection based on opcodes Masquerade detection based on UNIX

commands

PHMM 45

References

Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids, Durbin, et al

Masquerade detection using profile hidden Markov models, L. Huang and M. Stamp, to appear in Computers and Security

Profile hidden Markov models for metamorphic virus detection, S. Attaluri, S. McGhee and M. Stamp, Journal in Computer Virology, Vol. 5, No. 2, May 2009, pp. 151-169

http://www.amazon.com/Biological-Sequence-Analysis-Probabilistic-Proteins/dp/0521629713/ref=sr_1_1?ie=UTF8&qid=1315334714&sr=8-1

http://www.amazon.com/Biological-Sequence-Analysis-Probabilistic-Proteins/dp/0521629713/ref=sr_1_1?ie=UTF8&qid=1315334714&sr=8-1

Introduction to Profile Hidden Markov Models Mark Stamp 1PHMM.

Documents

Transcript of Introduction to Profile Hidden Markov Models Mark Stamp 1PHMM.