Hidden Markov Model and Conditional Random...

School of Computer Science

Hidden Markov Model and

Conditional Random Fields

Probabilistic Graphical Models (10Probabilistic Graphical Models (10--708)708)

Lecture 12, Oct 29, 2007

Eric XingEric XingReceptor A

Kinase C

Gene G Gene H

Kinase EKinase D

Receptor BX1 X2

X3 X4 X5

Receptor A

Kinase C

Gene G Gene H

Kinase EKinase D

Receptor BX1 X2

X3 X4 X5

Reading: J-Chap. 12, and addition papers

Eric Xing 2

FeedbacksTodayOffice hour

Recitation

Project milestoneThis Wednesday

Eric Xing 3

Hidden Markov Model revisitTransition probabilities between any two states

Start probabilities

Emission probabilities associated with each state

or in general:

A AA Ax2 x3x1 xT

y2 y3y1 yT...

... ,)|( , jiit

jt ayyp === − 11 1

( ) .,,,,lMultinomia~)|( ,,, I∈∀=− iaaayyp Miiiitt K211 1

( ).,,,lMultinomia~)( Myp πππ K211

( ) .,,,,lMultinomia~)|( ,,, I∈∀= ibbbyxp Kiiiitt K211

( ) .,|f~)|( I∈∀⋅= iyxp iitt θ1

Eric Xing 4

Applications of HMMsSome early applications of HMMs

finance, but we never saw them speech recognition modelling ion channels

In the mid-late 1980s HMMs entered genetics and molecular biology, and they are now firmly entrenched.

Some current applications of HMMs to biologymapping chromosomesaligning biological sequencespredicting sequence structureinferring evolutionary relationshipsfinding genes in DNA sequence

Eric Xing 5

Typical structure of a gene

Eric Xing 6

E0 E1 E2

poly-A

3'UTR5'UTR

I0 I1 I2

intergenicregion

Forward (+) strandReverse (-) strand

promoter

GAGAACGTGTGAGAGAGAGGCAAGCCGAAAAATCAGCCGCCGAAGGATACACTATCGTCGTCCTTGTCCGACGAACCGGTGGTCATGCAAACAACGCACAGAACAAATTAATTTTCAAATTGTTCAATAAATGTCCCACTTGCTTCTGTTGTTCCCCCCTTTCCGCTAGCGGAATTTTTTATATTCTTTTGGGGGCGCTCTTTCGTTGACTTTTCGAGCACTTTTTCGATTTTCGCGCGCTGTCGAACGGCAGCGTATTTATTTACAATTTTTTTTGTTAGCGGCCGCCGTTGTTTGTTGCAGATACACAGCGCACACATATAAGCTTGCACACTGATGCACACACACCGACACGTTGTCACCGAAATGAACGGGACGGCCATATGACTGGCTGGCGCTCGGTATGTGGGTGCAAGCGAGATACCGCGATCAAGACTCGAACGAGACGGGTCAGCGAGTGATACCGATTCTCTCTCTTTTGCGATTGGGAATAATGCCCGACTTTTTACACTACATGCGTTGGATCTGGTTATTTAATTATGCCATTTTTCTCAGTATATCGGCAATTGGTTGCATTAATTTTGCCGCAAAGTAAGGAACACAAACCGATAGTTAAGATCCAACGTCCCTGCTGCGCCTCGCGTGCACAATTTGCGCCAATTTCCCCCCTTTTCCAGTTTTTTTCAACCCAGCACCGCTCGTCTCTTCCTCTTCTTAACGTTAGCATTCGTACGAGGAACAGTGCTGTCATTGTGGCCGCTGTGTAGCTAAAAAGCGTAATTATTCATTATCTAGCTATCTTTTCGGATATTATTGTCATTTGCCTTTAATCTTGTGTATTTATATGGATGAAACGTGCTATAATAACAATGCAGAATGAAGAACTGAAGAGTTTCAAAACCTAAAAATAATTGGAATATAAAGTTTGGTTTTACAATTTGATAAAACTCTATTGTAAGTGGAGCGTAACATAGGGTAGAAAACAGTGCAAATCAAAGTACCTAAATGGAATACAAATTTTAGTTGTACAATTGAGTAAAATGAGCAAAGCGCCTATTTTGGATAATATTTGCTGTTTACAAGGGGAACATATTCATAATTTTCAGGTTTAGGTTACGCATATGTAGGCGTAAAGAAATAGCTATATTTGTAGAAGTGCATATGCACTTTATAAAAAATTATCCTACATTAACGTATTTTATTTGCTTTAAAACCTATCTGAGATATTCCAATAAGGTAAGTGCAGTAATACAATGTAAATAATTGCAAATAATGTTGTAACTAAATACGTAAACAATAATGTAGAAGTCCGGCTGAAAGCCCCAGCAGCTATAGCCGATATCTATATGATTTAAACTCTTGTCTGCAACGTTCTAATAAATAAATAAAATGCAAAATATAACCTATTGAGACAATACATTTATTTTATTTTTTTATATCATCAATCATCTACTGATTTCTTTCGGTGTATCGCCTAATCCATCTGTGAAATAGAAATGGCGCCACCTAGGTTAAGAAAAGATAAACAGTTGCCTTTAGTTGCATGACTTCCCGCTGGAT

=)|•(

θθθθ

GENSCAN (Burge & Karlin),)|( , ji

jt ayyp === − 11 1Transition probabilities:

Eric Xing 7

The Forward AlgorithmWe want to calculate P(x), the likelihood of x, given the HMM

Sum over all possible ways of generating x:

To avoid summing over an exponential number of paths y, define

(the forward probability)

The recursion:

),,...,()(def

11 1 ==== ktt

kt yxxPy αα

∑ −==i

kt ayxp ,)|( 11 αα

kTP α)(x

∑ ∑ ∑ ∑ ∏ ∏= =

1 2 112 1

tttyyy

N ttyxpapp )|(),()( ,πL

Eric Xing 8

The Backward AlgorithmWe want to compute ,

the posterior probability distribution on the t th position, given x

We start by computing

The recursion:

)|( x1=ktyP

Forward, αtk Backward,

),...,,,,...,(),( Ttk

t xxyxxPyP 11 11 +=== x

)|...()...(

),,...,|,...,(),,...,(

, 1111

yxxPyxxP

yxxxxPyxxP

)|,...,( 11 == +ktTt

kt yxxPβ

kt yxpa 111, )1|( +++ == ∑ ββ

A Axt+1 xT

yt+1 yT...

Eric Xing 9

The Junction tree algorithmA junction tree for the HMM

Rightward pass

This is exactly the forward algorithm!

Leftward pass …

This is exactly the backward algorithm!

A AA Ax2 x3x1 xT

y2 y3y1 yT...

),( 11 xyψ ),( 21 yyψ ),( 32 yyψ ),( TT yy 1−ψ

),( 22 xyψ ),( 33 xyψ ),( TT xyψ

)( 2yζ )( 3yζ )( Tyζ)( 1yφ )( 2yφ⇒⇒

),( 1+tt yyψ)( ttt y→−1µ

),( 11 ++ tt xyψ

)( 11 ++→ ttt yµ

)( 1+↑ tt yµ

+↑++←+←− =1

11111ty

tttttttttt yyyyy )()(),()( µµψµ

→−++

++→−+

ytttyytt

yttttttt

yxpyyyp

)|()()|(

∑ +↑→−+++→ =ty

tttttttttt yyyyy )()(),()( 11111 µµψµ

++++←+=1

11111ty

ttttttt yxpyyyp )|()()|( µ

),( 1+tt yyψ)( ttt y←−1µ )( 11 ++← ttt yµ

),( 11 ++ tt xyψ

)( 1+↑ tt yµ

Eric Xing 10

Summary of the F-B algorithm

),,,...,()(def

1111 === −→−k

tttttkt yxxxPkµα

)|,...,()(def

111 === +←−k

tTtttk

t yxxPkµβ

)|()|()1()1(

),1,1(

111111

tttittt

yypyxpyy

+++++←→−

∑ −==i

kt yxpa 111, )1|( +++ == ∑ ββ

)|(,, 1111 == +++

jit yxpaβαξ

∑=∝==j

it xyp ,

def)|1( ξβαγ

)|(),(

)|()(def

( )( )

( )( )ttt

βαγβαξ

ββαα

∗=∗∗=

∗=∗=

The matrix-vector form:

Eric Xing 11

Posterior decodingWe can now calculate

Then, we can askWhat is the most likely state at position t of sequence x:

Note that this is an MPA of a single hidden state, what if we want to a MPA of a whole hidden state sequence?

Posterior Decoding:

This is different from MPA of a whole sequenceof hidden states

This can be understood as bit error ratevs. word error rate

)()(),(

PPyPyP

)|(maxarg* x1== ktkt yPk

{ } : * Tty tk

t L11 ==

Example:MPA of X ?MPA of (X, Y) ?

x y P(x,y)0 0 0.350 1 0.051 0 0.31 1 0.3

Eric Xing 12

Viterbi decodingGIVEN x = x1, …, xT, we want to find y = y1, …, yT, such that P(y|x) is maximized:

y* = argmaxy P(y|x) = argmaxπ P(y,x) Let

= Probability of most likely sequence of states ending at state yt = k

The recursion:

Underflows are a significant problem

These numbers become extremely small – underflowSolution: Take the logs of all values:

),,...,,,...,(max ,--},...{ -1111111

== kttttyy

kt yxyyxxPV

kt VayxpV 11 −== ,max)|(

x1 x2 x3 ……………………...……..xN

State 12

x1 x2 x3 ……………………...……..xN

State 12

x1 x2 x3 ……………………...……..xN

State 12

x1 x2 x3 ……………………...……..xN

State 12

Vi(t)k

tttt xyxyyyyyytt bbaayyxxp ,,,,),,,,,( LLKK11121111 −

( )( )itkii

kt VayxpV 11 −++== ,logmax)|(log

Eric Xing 13

The Viterbi Algorithm – derivationDefine the viterbi probability:

),,...,,,...,(max ,},...{ 111111 1== +++

kttttyy

kt yxyyxxPV

),...,,,...,(),...,,,...,|(max ,},...{ ttttk

ttyy yyxxPyyxxyxPt 111111 1

1== ++

),,,...,,,...,()|(max ,},...{ tttttk

ttyy yxyyxxPyyxPt 111111 1

1 −−++ ==

ktti VayxP ,, )|(max 111 == ++

),,,...,,,...,(max)|(max },...{, 111 111111 11==== −−++ −

ittttyy

ktti yxyyxxPyyxP

ktt VayxP ,, max)|( 111 == ++

Eric Xing 14

Computational Complexity and implementation details

What is the running time, and space required, for Forward, and Backward?

Time: O(K2N); Space: O(KN).

Useful implementation technique to avoid underflowsViterbi: sum of logsForward/Backward: rescaling at each position by multiplying by a constant

∑ −==i

kt yxpa 111, )1|( +++ == ∑ ββ

kt VayxpV 11 −== ,max)|(

Eric Xing 15

Learning HMMSupervised learning: estimation when the “right answer” is known

Examples: GIVEN: a genomic region x = x1…x1,000,000 where we have good

(experimental) annotations of the CpG islandsGIVEN: the casino player allows us to observe him one evening,

as he changes dice and produces 10,000 rolls

Unsupervised learning: estimation when the “right answer” is unknown

Examples:GIVEN: the porcupine genome; we don’t know how frequent are the

CpG islands there, neither do we know their compositionGIVEN: 10,000 rolls of the casino player, but we don’t see when he

changes dice

QUESTION: Update the parameters θ of the model to maximize P(x|θ) --- Maximal likelihood (ML) estimation

Eric Xing 16

Learning HMM: two scenariosSupervised learning: if only we knew the true state path then ML parameter estimation would be trivial

E.g., recall that for complete observed tabular BN:

What if y is continuous? We can treat as N×Tobservations of, e.g., a GLIM, and apply learning rules for GLIM …

Unsupervised learning: when the true state path is unknown, we can fill in the missing values using inference recursions.

The Baum Welch algorithm (i.e., EM)Guaranteed to increase the log likelihood of the model after each iterationConverges to local optimum, depending on initial conditions

kjikij

ijkMLijk n

θ ∑ ∑∑ ∑

= −=•→

ij yyy

)(#)(#

∑ ∑∑ ∑

==•→

ik yxy

)(#)(#

( ){ }NnTtyx tntn :,::, ,, 11 ==

Eric Xing 17

The Baum Welch algorithmThe complete log likelihood

The expected complete log likelihood

EMThe E step

The M step ("symbolically" identical to MLE)

∏ ∏∏ ⎟⎟⎠

⎞⎜⎜⎝

ttntnnc xxpyypypp

1211 )|()|()(log),(log),;( ,,,,,yxyxθl

∑∑∑∑∑==

− ⎟⎠⎞⎜

⎝⎛+⎟

⎠⎞⎜

⎝⎛+⎟

⎠⎞⎜

⎝⎛=

tjiyyp

inc byxayyy

ntnntntnnn 1211

11,)|(,,,)|,(,,)|(, logloglog),;(

,,,, xxxyxθ πl

)|( ,,, ni

tn ypy x1===γ

)|,( ,,,,,, n

jitn yypyy x1111 ==== −−ξ

∑ ∑∑ ∑

tjitnML

∑ ∑∑ ∑

tnMLik

i∑= 1,γ

Eric Xing 18

Shortcomings of Hidden Markov Model

HMM models capture dependences between each state and only its corresponding observation

NLP example: In a sentence segmentation task, each segmental state may depend not just on a single word (and the adjacent segmental stages), but also on the (non-local) features of the whole line such as line length, indentation, amount of white space, etc.

Mismatch between learning objective function and prediction objective function

HMM learns a joint distribution of states and observations P(Y, X), but in a prediction task, we need the conditional probability P(Y|X)

Y1 Y2 … … … Yn

X1 X2 … … … Xn

Eric Xing 19

Solution:Maximum Entropy Markov Model (MEMM)

Models dependence between each state and the full observation sequence explicitly

More expressive than HMMs

Discriminative modelCompletely ignores modeling P(X): saves modeling effortLearning objective function consistent with predictive function: P(Y|X)

Y1 Y2 … … … Yn

Eric Xing 20

MEMM: the Label bias problem

State 1

State 2

State 3

State 4

State 5

Observation 1 Observation 2 Observation 3 Observation 40.4

0.60.2

0.550.2

0.50.1

What the local transition probabilities say:

• State 1 almost always prefers to go to state 2

• State 2 almost always prefer to stay in state 2

Eric Xing 21

State 1

State 2

State 3

State 4

State 5

0.60.2

0.550.2

0.50.1

Probability of path 1-> 1-> 1-> 1:

• 0.4 x 0.45 x 0.5 = 0.09

Eric Xing 22

State 1

State 2

State 3

State 4

State 5

0.60.2

0.550.2

0.50.1

Probability of path 2->2->2->2 :

• 0.2 X 0.3 X 0.3 = 0.018 Other paths:1-> 1-> 1-> 1: 0.09

Eric Xing 23

State 1

State 2

State 3

State 4

State 5

0.60.2

0.550.2

0.50.1

Probability of path 1->2->1->2:

• 0.6 X 0.2 X 0.5 = 0.06Other paths:1->1->1->1: 0.09 2->2->2->2: 0.018

Eric Xing 24

State 1

State 2

State 3

State 4

State 5

0.60.2

0.550.2

0.50.1

Probability of path 1->1->2->2:

• 0.4 X 0.55 X 0.3 = 0.066Other paths:1->1->1->1: 0.09 2->2->2->2: 0.0181->2->1->2: 0.06

Eric Xing 25

State 1

State 2

State 3

State 4

State 5

0.60.2

0.550.2

0.50.1

Most Likely Path: 1-> 1-> 1-> 1

• Although locally it seems state 1 wants to go to state 2 and state 2 wants to remain in state 2.

• why?

Eric Xing 26

State 1

State 2

State 3

State 4

State 5

0.60.2

0.550.2

0.50.1

Most Likely Path: 1-> 1-> 1-> 1

• State 1 has only two transitions but state 2 has 5:

• Average transition probability from state 2 is lower

Eric Xing 27

State 1

State 2

State 3

State 4

State 5

0.60.2

0.550.2

0.50.1

Label bias problem in MEMM:

• Preference of states with lower number of transitions over others

Eric Xing 28

Solution: Do not normalize probabilities locally

State 1

State 2

State 3

State 4

State 5

0.60.2

0.550.2

0.50.1

From local probabilities ….

Eric Xing 29

Solution: Do not normalize probabilities locally

State 1

State 2

State 3

State 4

State 5

Observation 1 Observation 2 Observation 3 Observation 420

From local probabilities to local potentials

• States with lower transitions do not have an unfair advantage!

Eric Xing 30

Office hour

Mid-term project milestone due today

Next week

Eric Xing 31

From MEMM ….

Y1 Y2 … … … Yn

Eric Xing 32

CRF is a partially directed modelDiscriminative model like MEMMUsage of global normalizer Z(x) overcomes the label bias problem of MEMMModels the dependence between each state and the entire observation sequence (like MEMM)

From MEMM to CRF

Y1 Y2 … … … Yn

Eric Xing 33

Conditional Random FieldsGeneral parametric form:

Y1 Y2 … … … Yn

Eric Xing 34

CRFs: InferenceGiven CRF parameters λ and µ, find the y* that maximizes P(y|x)

Can ignore Z(x) because it is not a function of y

Run the max-product algorithm on the junction-tree of CRF:

Y1 Y2 … … … Yn

Y1,Y2 Y2,Y3 ……. Yn-2,Yn-1Yn-1,Yn

Y2 Y3Yn-2 Yn-1

Same as Viterbi decoding used in HMMs!

Eric Xing 35

CRF learningGiven {(xd, yd)}d=1

N, find λ*, µ* such that

Computing the gradient w.r.t λ:

Gradient of the log-partition function in an exponential family is the expectation of the

sufficient statistics.

Eric Xing 36

CRF learning

Computing the model expectations:

Requires exponentially large number of summations: Is it intractable?

Tractable!Can compute marginals using the sum-product algorithm on the chain

Expectation of f over the corresponding marginal probability of neighboring nodes!!

Eric Xing 37

CRF learningComputing marginals using junction-tree calibration:

Junction Tree Initialization:

After calibration:

Y1 Y2 … … … Yn

Y1,Y2 Y2,Y3 ……. Yn-2,Yn-1Yn-1,Yn

Y2 Y3Yn-2 Yn-1

Also called forward-backward algorithm

Eric Xing 38

CRF learningComputing feature expectations using calibrated potentials:

Now we know how to compute rλL(λ,µ):

Learning can now be done using gradient ascent:

Eric Xing 39

CRF learningIn practice, we use a Gaussian Regularizer for the parameter vector to improve generalizability

In practice, gradient ascent has very slow convergenceAlternatives:

Conjugate Gradient methodLimited Memory Quasi-Newton Methods

Eric Xing 40

CRFs: some empirical resultsComparison of error rates on synthetic data

CRF error HMM error

HMM error

Data is increasingly higher order in the direction of arrow

CRFs achieve the lowest error rate for higher order data

Eric Xing 41

CRFs: some empirical resultsParts of Speech tagging

Using same set of features: HMM >=< CRF > MEMMUsing additional overlapping features: CRF+ > MEMM+ >> HMM

Eric Xing 42

Other CRFsSo far we have discussed only 1-dimensional chain CRFs

Inference and learning: exact

We could also have CRFs for arbitrary graph structure

E.g: Grid CRFsInference and learning no longer tractableApproximate techniques used

MCMC SamplingVariational InferenceLoopy Belief Propagation

We will discuss these techniques in the future

Eric Xing 43

SummaryConditional Random Fields are partially directed discriminative modelsThey overcome the label bias problem of MEMMs by using a global normalizerInference for 1-D chain CRFs is exact

Same as Max-product or Viterbi decodingLearning also is exact

globally optimum parameters can be learnedRequires using sum-product or forward-backward algorithm

CRFs involving arbitrary graph structure are intractable in generalE.g.: Grid CRFsInference and learning require approximation techniques

MCMC samplingVariational methodsLoopy BP

Hidden Markov Model and Conditional Random...

Documents

Transcript of Hidden Markov Model and Conditional Random...

Hidden Markov

Bioinformatics Introduction to Hidden Markov Models - … · Bioinformatics Introduction to Hidden Markov Models Hidden Markov Models and Multiple Sequence Alignment Slides borrowed

23 hidden markov models v2 - Department of Computer Science€¦ · Hidden Markov Model Given ﬂip outcomes (heads or tails) and the conditional & marginal probabilities, when was

L13: hidden Markov models - Texas A&M Universityresearch.cs.tamu.edu/prism/lectures/sp/l13.pdf · L13: hidden Markov models • Discrete Markov processes • Hidden Markov models

Hidden Markov Models - AUusers-cs.au.dk/cstorm/courses/PRiB_f12/slides/hidden-markov-model… · Hidden Markov Models Markov Model Hidden Markov Model If the latent variables are

Hidden Markov Models in Bioinformaticscsatol/mach_learn/bemutato/Mate_Korosi_HMMpr… · Outline ˜ Markov Chain ˜ HMM (Hidden Markov Model) ˜ Hidden Markov Models in Bioinformatics

Hidden Markov Models

Markov Chains and Hidden Markov Models - Rice University · are “hidden”; hence, we have a hidden Markov model, or HMM. ... In Markov chains and hidden Markov models, the probability

L13: hidden Markov modelscourses.cs.tamu.edu/rgutier/csce630_f14/l13.pdf · L13: hidden Markov models • Discrete Markov processes • Hidden Markov models • Forward and Backward

Hidden Markov Models - uml.edugrinstei/91.510/HMM/Class Sildes - Hidden Markov... · Hidden Markov Models. ... particular hidden state. • Thus a hidden Markov model is a standard

Hidden Markov Models · Hidden Markov Models 1 10-601 Introduction to Machine Learning Matt Gormley Lecture 20 Nov. 7, 2018 ... Hidden Markov Model 28 A Hidden Markov Model (HMM)

Hidden Markov Model Nov 11, 2008 Sung-Bae Cho. Hidden Markov Model Inference of Hidden Markov Model Path Tracking of HMM Learning of Hidden Markov Model.

Hidden markov model

Interacting Hidden Markov Models for Video Understandingdraper/papers/narayana_ijprai18.pdfKeywords: Hidden Markov models; video analysis; Segre variety. 1. Introduction Hidden Markov

Hidden Markov Models - Penn Engineeringcis520/lectures/HMM.pdf · Hidden Markov Models ... w is the “hidden” part of the “Hidden Markov Model” In speech recognition, we will

9 Markov chains and Hidden Markov Models - Freie … · 9 Markov chains and Hidden Markov Models We will discuss: Markov chains Hidden Markov Models (HMMs) Algorithms: Viterbi, forward,

Inference in Mixed Hidden Markov Models and Applications ... · 2. From hidden Markov models to mixed hidden Markov models Many authors suggested applying hidden Markov models to

Sequence Classification - with emphasis on Hidden Markov ...cs.unc.edu/~lazebnik/fall09/sequence_classification.pdf · Hidden Markov Models ... with emphasis on Hidden Markov Models

EE365: Hidden Markov Models - Stanford Universityee266.stanford.edu/lectures/hmm.pdf · EE365: Hidden Markov Models Hidden Markov Models The Viterbi Algorithm 1. Hidden Markov Models

Fingerspelling recognition with semi-Markov conditional …klivescu/papers/kim_etal_iccv...16.3% using a hidden Markov model baseline to 11.6% us-ing the proposed semi-Markov model.