Profile Hidden Markov Models

46
Profile Hidden Markov Models PHMM 1 Mark Stamp

description

Profile Hidden Markov Models. Mark Stamp. Hidden Markov Models. Here, we assume you know about HMMs If not, see “A revealing introduction to hidden Markov models” Executive summary of HMMs HMM is a machine learning technique Also, a discrete hill climb technique - PowerPoint PPT Presentation

Transcript of Profile Hidden Markov Models

Page 1: Profile Hidden Markov Models

1

Profile Hidden Markov Models

PHMM

Mark Stamp

Page 2: Profile Hidden Markov Models

2

Hidden Markov Models Here, we assume you know about HMMs

o If not, see “A revealing introduction to hidden Markov models”

Executive summary of HMMso HMM is a machine learning techniqueo Also, a discrete hill climb techniqueo Train model based on observation sequenceo Score given sequence to see how closely it

matches the modelo Efficient algorithms, many useful applications

PHMM

Page 3: Profile Hidden Markov Models

3

HMM Notation Recall, HMM model denoted λ = (A,B,π) Observation sequence is O Notation:

PHMM

Page 4: Profile Hidden Markov Models

4

Hidden Markov Models Among the many uses for HMMs… Speech analysis Music search engine Malware detection Intrusion detection systems (IDS) Many more, and more all the time

PHMM

Page 5: Profile Hidden Markov Models

5

Limitations of HMMs Positional information not considered

o HMM has no “memory”o Higher order models have some memoryo But no explicit use of positional information

Does not handle insertions or deletions These limitations are serious problems in

some applicationso In bioinformatics string comparison,

sequence alignment is criticalo Also, insertions and deletions occur

PHMM

Page 6: Profile Hidden Markov Models

6

Profile HMM Profile HMM (PHMM) designed to

overcome limitations on previous slideo In some ways, PHMM easier than HMMo In some ways, PHMM more complex

The basic idea of PHMMo Define multiple B matriceso Almost like having an HMM for each

position in sequencePHMM

Page 7: Profile Hidden Markov Models

7

PHMM In bioinformatics, begin by aligning

multiple related sequenceso Multiple sequence alignment (MSA)o This is like training phase for HMM

Generate PHMM based on given MSAo Easy, once MSA is knowno Hard part is generating MSA

Then can score sequences using PHMMo Use forward algorithm, like HMM

PHMM

Page 8: Profile Hidden Markov Models

8

Training: PHMM vs HMM Training PHMM

o Determine MSA nontrivialo Determine PHMM matrices trivial

Training HMMo Append training sequences trivialo Determine HMM matrices nontrivial

These are opposites…o In some sense

PHMM

Page 9: Profile Hidden Markov Models

9

Generic View of PHMM Have delete, insert, and match

stateso Match states correspond to HMM states

Arrows are possible transitionso Each transition has a probability

Transition probabilities are A matrixEmission probabilities are B matrices

o In PHMM, observations are emissionso Match and insert states have emissions

PHMM

Page 10: Profile Hidden Markov Models

10

Generic View of PHMM Circles are delete states, diamonds are insert states, squares are match states

Also, begin and end states

PHMM

Page 11: Profile Hidden Markov Models

11

PHMM Notation Notation

PHMM

Page 12: Profile Hidden Markov Models

12

PHMM Match state probabilities easily

determined from MSA aMi,Mi+1 transitions between match stateseMi(k) emission probability at match

state Many other transition probabilities

o For example, aMi,Ii and aMi,Di+1

Emissions at all match & insert stateso Remember, emission == observationPHMM

Page 13: Profile Hidden Markov Models

13

Multiple Sequence Alignment

First we show MSA constructiono This is the difficult parto Lots of ways to do thiso “Best” way depends on specific problem

Then construct PHMM from MSAo This is the easy parto Standard algorithm for this

How to score a sequence?o Forward algorithm, similar to HMM

PHMM

Page 14: Profile Hidden Markov Models

14

MSA How to construct MSA?

o Construct pairwise alignmentso Combine pairwise alignments for MSA

Allow gaps to be insertedo To make better matches

Gaps tend to weaken PHMM scoringo A tradeoff between gaps and scoring

PHMM

Page 15: Profile Hidden Markov Models

15

Global vs Local Alignment In these pairwise alignment examples

o “-” is gapo “|” means elements alignedo “*” for omitted beginning/ending symbols

PHMM

Page 16: Profile Hidden Markov Models

16

Global vs Local Alignment Global alignment is lossless

o But gaps tend to proliferateo And gaps increase when we do MSA o More gaps, more random sequences match…o …and result is less useful for scoring

We usually only consider local alignmento That is, omit ends for better alignment

For simplicity, assume global alignment in examples presented here

PHMM

Page 17: Profile Hidden Markov Models

17

Pairwise Alignment Allow gaps when aligning How to score an alignment?

o Based on n x n substitution matrix So Where n is number of symbols

What algorithm(s) to align sequences?o Usually, dynamic programmingo Sometimes, HMM is usedo Other?

Local alignment creates more issuesPHMM

Page 18: Profile Hidden Markov Models

18

Pairwise Alignment Example

Tradeoff gaps vs misaligned elementso Depends on matrix S and gap penalty

PHMM

Page 19: Profile Hidden Markov Models

19

Substitution Matrix Masquerade detection

o Detect imposter using an account Consider 4 different operations

o E == send emailo G == play gameso C == C programmingo J == Java programming

How similar are these to each other?

PHMM

Page 20: Profile Hidden Markov Models

20

Substitution Matrix Consider 4 different operations:

o E, G, C, J Possible substitution matrix: Diagonal matches

o High positive scores Which others most similar?

o J and C, so substituting C for J is a high score Game playing/programming, very different

o So substituting G for C is a negative score

PHMM

Page 21: Profile Hidden Markov Models

21

Substitution Matrix Depending on problem, might be easy

or very difficult to find useful S matrix Consider masquerade detection based

on UNIX commandso Sometimes difficult to say how “close” 2

commands are Suppose aligning DNA sequences

o Biological rationale for closeness of symbols

PHMM

Page 22: Profile Hidden Markov Models

22

Gap Penalty Generally must allow gaps to be inserted But gaps make alignment more generic

o Less useful for scoring, so we penalize gaps How to penalize gaps? Linear gap penalty function:

g(x) = ax (constant penalty for every gap) Affine gap penalty function

g(x) = a + b(x – 1)o Gap opening penalty a and constant penalty of

b for each extension of existing gap

PHMM

Page 23: Profile Hidden Markov Models

23

Pairwise Alignment Algorithm

We use dynamic programmingo Based on S matrix, gap penalty function

Notation:

PHMM

Page 24: Profile Hidden Markov Models

24

Pairwise Alignment DP Initialization:

Recursion:

where

PHMM

Page 25: Profile Hidden Markov Models

25

MSA from Pairwise Alignments Given pairwise alignments… How to construct MSA? Generally use “progressive alignment”

o Select one pairwise alignmento Select another and combine with firsto Continue to add more until all are combined

Relatively easy (good) Gaps proliferate, and it’s unstable (bad)

PHMM

Page 26: Profile Hidden Markov Models

26

MSA from Pairwise Alignments Lots of ways to improve on generic

progressive alignmento Here, we mention one such approacho Not necessarily “best” or most popular

Feng-Dolittle progressive alignmento Compute scores for all pairs of n sequenceso Select n-1 alignments that a) “connect” all

sequences and b) maximize pairwise scoreso Then generate a minimum spanning treeo For MSA, add sequences in the order that they

appear in the spanning tree

PHMM

Page 27: Profile Hidden Markov Models

27

MSA Construction Create pairwise alignments

o Generate substitution matrixo Dynamic program for pairwise alignments

Use pairwise alignments to make MSAo Use pairwise alignments to construct

spanning tree (e.g., Prim’s Algorithm)o Add sequences to MSA in spanning tree

order (from highest score, insert gaps as needed)

o Note: gap penalty is usedPHMM

Page 28: Profile Hidden Markov Models

28

MSA Example Suppose 10 sequences, with the following

pairwise alignment scores

PHMM

Page 29: Profile Hidden Markov Models

29

MSA Example: Spanning Tree

Spanning tree based on scores

So process pairs in following order: (5,4), (5,8), (8,3), (3,2), (2,7), (2,1), (1,6), (6,10), (10,9)

PHMM

Page 30: Profile Hidden Markov Models

30

MSA Snapshot

Intermediate step and finalo Use “+” for

neutral symbol

o Then “-” for gaps in MSA

Note increase in gapsPHMM

Page 31: Profile Hidden Markov Models

31

PHMM from MSA In PHMM, determine match and

insert states & probabilities from MSA

“Conservative” columns match stateso Half or less of symbols are gaps

Other columns are insert stateso Majority of symbols are gaps

Delete states are a separate issuePHMM

Page 32: Profile Hidden Markov Models

32

PHMM States from MSA Consider a simpler MSA… Columns 1,2,6 are match

states 1,2,3, respectivelyo Since less than half gaps

Columns 3,4,5 are combined to form insert state 2o Since more than half gapso Match states between insert

PHMM

Page 33: Profile Hidden Markov Models

33

Probabilities from MSA Emission probabilities

o Based on symbol distribution in match and insert states

State transition probso Based on transitions in

the MSA

PHMM

Page 34: Profile Hidden Markov Models

34

Probabilities from MSA Emission probabilities:

But 0 probabilities are bado Model “overfits” the datao So, use “add one” ruleo Add one to each numerator,

add total to denominators

PHMM

Page 35: Profile Hidden Markov Models

35

Probabilities from MSA More emission probabilities:

But 0 probabilities still bado Model “overfits” the datao Again, use “add one” ruleo Add one to each numerator,

add total to denominators

PHMM

Page 36: Profile Hidden Markov Models

36

Probabilities from MSA Transition probabilities:

We look at some exampleso Note that “-” is delete state

First, consider begin state:

Again, use add one rulePHMM

Page 37: Profile Hidden Markov Models

37

Probabilities from MSA Transition probabilities When no information in

MSA, set probs to uniform For example I1 does not

appear in MSA, so

PHMM

Page 38: Profile Hidden Markov Models

38

Probabilities from MSA Transition probabilities,

another example What about transitions

from state D1? Can only go to M2, so

Again, use add one rule:

PHMM

Page 39: Profile Hidden Markov Models

39

PHMM Emission Probabilities Emission probabilities for the given MSA

o Using add-one rule

PHMM

Page 40: Profile Hidden Markov Models

40

PHMM Transition Probabilities Transition probabilities for the given MSA

o Using add-one rule

PHMM

Page 41: Profile Hidden Markov Models

41

PHMM Summary Construct pairwise alignments

o Usually, use dynamic programming Use these to construct MSA

o Lots of ways to do this Using MSA, determine probabilities

o Emission probabilitieso State transition probabilities

Then we have trained a PHMMo Now what???

PHMM

Page 42: Profile Hidden Markov Models

42

PHMM Scoring Want to score sequences to see

how closely they match PHMM How did we score using HMM?

o Forward algorithm How to score sequences with

PHMM?o Forward algorithm (surprised?)

But, algorithm is a little more complexo Due to complex state transitionsPHMM

Page 43: Profile Hidden Markov Models

43

Forward Algorithm Notation

o Indices i and j are columns in MSAo xi is ith observation symbolo qxi is distribution of xi in “random model”o Base case iso is score of x1,…,xi up to state j (note

that in PHMM, i and j may not agree)o Some states undefinedo Undefined states ignored in calculation

PHMM

Page 44: Profile Hidden Markov Models

44

Forward Algorithm Compute P(X|λ) recursively

Note that depends on , and o And corresponding state transition probs

PHMM

Page 45: Profile Hidden Markov Models

45

PHMM We will see examples of PHMM

later In particular,

o Malware detection based on opcodeso Masquerade detection based on UNIX

commands

PHMM