HMM - Part 2

1

HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM

2

Three Basic Problems for HMMs

Given an observation sequence O=(o1,o2,…,oT), and an HMM =(A,B,)– Problem 1:

How to compute P(O|) efficiently ? The forward algorithm

– Problem 2: How to choose an optimal state sequence Q=(q1,q2,……, qT) which

best explains the observations? The Viterbi algorithm

– Problem 3: How to adjust the model parameters =(A,B,) to maximize P(O|)? The Baum-Welch (forward-backward) algorithm

cf. The segmental K-means algorithm maximizes P(O, Q* |)

* arg max ( , | )P Q

Q O Q

)|(maxarg*iP

i

OP(up, up, up, up, up|)?

3

The Forward Algorithm

The forward variable:– Probability of o1,o2,…,ot being observed and the state at time t

being i, given model λ

The forward algorithm

1 1 1 1

1 11

1

1. Initialization ( , | ) 1

2. Induction 1 1 1

3.Termination O

i i

N

t t ij j ti

N

Ti

α i P o q i π b o , i N

α j α i a b o , t T - , j N

P λ α i

λiqoooPi ttt ,...21

4

The Viterbi Algorithm

1. Initialization

2. Induction

3. Termination

4. Backtracking),...,,(

1,...,2.1),(**

2*1

*

*11

T

tt*t

qqq

TTtqq

Q

Ni, i

Ni, obπi ii

10)(1

1

11

Nj,T-t, aij

Nj,T-t, obaij

ijtNi

tjijtNi

t

1 11][maxarg)(

1 11][max

11t

11

1

iq

iλP

TNi

*T

TNi

1

1

*

maxarg

max,QO

is the best state sequence

11

1 cf.

tj

N

iijtt obaiαjα

N

iT iαλP

1O

5

The Segmental K-means Algorithm

Assume that we have a training set of observations and an initial estimate of model parameters– Step 1 : Segment the training data

The set of training observation sequences is segmented into states, based on the current model, by the Viterbi Algorithm

– Step 2 : Re-estimate the model parameters

– Step 3: Evaluate the model If the difference between the new and current model scores exceeds a threshold, go back to Step 1; otherwise, return

Number of " " in state ˆ Number of times in state j

k jb k j

1Number of times ˆNumber of training sequencesi

q i

Number of transitions from state to state ˆ Number of transitions from state ij

i jai

1ˆ1

N

ii

1ˆ1

N

jija

1)(ˆ1

M

kj kb

6

Segmental K-means vs. Baum-Welch

Number of " " in state ˆ Number of times in state j

k jb k j


q i

Number of transitions from state to state ˆ Number of transitions from state ij

i jai

1 number of times ˆNumber of training sequencesi

q i

Expected

number of " " in state ˆ number of times in state j

k jb kj

Expected

Expected

number of transitions from state to state ˆ number of transitions from state ij

i jai

Expected

Expected

7

The Backward Algorithm

The backward variable:– Probability of ot+1,ot+2,…,oT being observed, given the state at

time t being i and model

The backward algorithm

1 11

1 11

1. Initialization 1, 1

2. Induction , 1 1 1

3. Termination ( ) ( )

T

N

t ij j t tj

N

i ii

β i i N

i a b o j t T - , j N

P λ i b o

O

λiqoooPi tTttt ,,...,, 21

N

iT iαλP

1Ocf.

8

ot

The Forward-Backward Algorithm

Relation between the forward and backward variables

)(][

,...

11

21

ti

N

jjitt

ttt

obaji

iqoooPi

λ

N

jttjijt

tTttt

jobai

iqoooPi

111

21

)(

,...

λ

λiqPii ttt ,)( O

(Huang et al., 2001)

Ni tt iiλP 1 )(O

9

The Baum-Welch Algorithm (1/3)

Define two new variables: t(i)= P(qt = i | O, ) – Probability of being in state i at time t, given O and

t( i, j )=P(qt = i, qt+1 = j | O, ) – Probability of being in state i at time t and state j at time t+1, given O

and

N

m

N

nttnmnt

ttjijtttt

nobam

jobaiλP

λjqiqPji

1 111

111 ,,,

OO

Ni tt

tttttt

ii

iiλP

iiλPiqPi

1

)|,(

OO

O

N

jtt jii

1,

10


t(i)= P(qt = i | O, ) – Probability of being in state i at time t, given O and

t( i, j )=P(qt = i, qt+1 = j | O, ) – Probability of being in state i at time t and state j at time t+1, given O

and

iiL

l

T

t

lt

l state from ns transitioofnumber expected)(

1

1

1

iqiL

l

l

11

1 timesofnumber expected)(

jijiL

l

T

t

lt

l state to state from ns transitioofnumber expected),(

1

1

1

11


Re-estimation formulae for , A, and B are

How do you know ? ˆ( | ) ( | )P P O O

L

iiqL

l

l

i

1

11

)(

sequences trainingofNumber timesofnumber Expectedˆ

L

l

T

t

lt

L

l

T

t

lt

ijl

l

i

ji

iji

a

1

1

1

1

1

1

)(

),(

state from ns transitioofnumber Expected state to state from ns transitioofnumber Expectedˆ

L

l

T

t

lt

L

l

T

vot

lt

jl

l

kt

j

j

jjkkb

1 1

1 s.t.

1

)(

)(

statein timesofnumber Expected statein "" ofnumber Expected)(ˆ

12

Maximum Likelihood Estimation for HMM

QQO

O

O

)|,(log maxarg

))|(log)(( )(maxarg

))|()(( )(maxarg

P

Pll

PLLML

,....,...,, 10 tHowever, we cannot find the solution directly.

An alternative way is to find a sequence:

....)(...)()( 10 tlll s.t.

13

])|,()|,([log

])|,()|,([log

)|,()|,(),|(log

)|,()|,(

)|()|,(log

)|,()|,(

)|()|,(log

)|()|,(log

)|(log)|,(log

)|(log)|(log)()(

),|(

),|(

tP

tP

tt

tt

t

t

t

t

t

t

tt

PPE

PPE

PPP

PP

PP

PP

PP

PP

PP

PPll

t

t

QOQOQOQO

QOQOOQ

QOQO

OQO

QOQO

OQO

OQO

OQOOO

OQ

OQ

Q

Q

Q

Q

Q

Jensen’s inequality

)|,(logmaxarg

)|,(log),|(maxarg

)|,()|,(log),|(maxarg

])|,()|,([logmaxarg

),|(

),|()1(

QO

QOOQQOQOOQ

QOQO

OQ

Q

Q

OQ

PE

PP

PPP

PPE

t

t

P

t

tt

tPt

Q functionSolvable and can be proved that

1( ) ( )t tl l

1( | ) ( | )t tP P O OIf f is a concave function, and X is a r.v., thenE[f(X)]≤ f(E[X])

14

The EM Algorithm

EM: Expectation Maximization– Why EM?

• Simple optimization algorithms for likelihood functions rely on the intermediate variables, called latent dataFor HMM, the state sequence is the latent data

• Direct access to the data necessary to estimate the parameters is impossible or difficultFor HMM, it is almost impossible to estimate (A, B, ) without considering the state sequence

– Two Major Steps :• E step: compute the expectation of the likelihood by including the latent

variables as if they were observed

• M step: compute the maximum likelihood estimates of the parameters by maximizing the expected likelihood found in the E step

Q QOOQ )|,(log),|( PλP

15

Three Steps for EM

Step 1. Draw a lower bound– Use the Jensen’s inequality

Step 2. Find the best lower bound auxiliary function– Let the lower bound touch the objective function at the

current guess

Step 3. Maximize the auxiliary function– Obtain the new guess– Go to Step 2 until converge

[Minka 1998]

16

)(F

objective function

current guess

Form an Initial Guess of =(A,B,)

Given the current guess , the goal is to find a new guess such that

NEW

)()( NEWFF

)|(maxarg*

OP

17

)(F

)()( Fg

Step 1. Draw a Lower Bound

)(glower bound function

objective function

18

)(F

Step 2. Find the Best Lower Bound

objective function

lower bound function)(g

),( g

auxiliary function

19

)(F

),( g

Step 3. Maximize the Auxiliary Function

NEW

)()( FF NEW

auxiliary function

objective function

20

)(F

Update the Model

NEW

objective function

21

)(F

),( g

Step 2. Find the Best Lower Bound

objective function

auxiliary function

22

)(F

NEW

Step 3. Maximize the Auxiliary Function

)()( FF NEW

objective function

23

Step 1. Draw a Lower Bound (cont’d)

Q

Q

QQOQ

QOO

)()|,()(log

)|,(log)|(log

pPp

PP

Apply Jensen’s Inequality

A lower bound function of

)(F

If f is a concave function, and X is a r.v., thenE[f(X)]≤ f(E[X])

Q Q

QOQ)(

)|,(log)(p

Pp

Objective function

p(Q): an arbitrary probability distribution

24

Step 2. Find the Best Lower Bound (cont’d)

– Find that makes the lower bound function touch the objective function at the current guess

*

( )

( , | ) We want to maximize ( ) log w.r.t ( ) at ( )

( , | )The best ( ) arg max ( ) log( )p

Pp pp

Pp pp

Q

Q Q

O QQ QQO QQ Q

Q

)(Qp

25


),|()|(

)|,()|,(

)|,()()|,(

1)|,(

)()|,()(

1)|,(log)(log

01)(log)|,(log

)(log)()|,(log)()(1

here multiplier Lagrange a introduce we,1)( Since

OQO

QOQO

QOQQO

QOQQOQ

QOQ

QQO

QQQOQQ

Q

QQ

Q

Q

QQQ

Q

PP

PP

PpP

ee

e

Pep

ePep

Pp

pP

ppPpp

p

Take the derivative w.r.t and set it to zero)(Qp

26


Q function

)|(log)|(log),|(

),|()|(),|(

log),|(

),|()|,(log),|(),(

OOOQ

OQOOQOQ

OQQOOQ

Q

Q

Q

PPP

PPP

P

PPPg

We can check

),(),|()|,(log),|(

g

PP

P Q OQ

QOOQDefine

QQ

OQOQQOOQ ),|(log),|()|,(log),|(),( PPPPg

27

EM for HMM Training

Basic idea– Assume we have and the probability that each Q occurred in the

generation of O i.e., we have in fact observed a complete data pair (O,Q) with

frequency proportional to the probability P(O,Q|)– We then find a new that maximizes

– It can be guaranteed that

EM can discover parameters of model to maximize the log-likelihood of the incomplete data, logP(O|), by iteratively maximizing the expectation of the log-likelihood of the complete data, logP(O,Q|)

Q QOOQ )|,(log),|( PλP ˆ

)|()ˆ|( λPP OO

Expectation

28

Solution to Problem 3 - The EM Algorithm

The auxiliary function

where and can be expressed as

Q

Q

QOO

QO

QOOQ

,log,

,log,,

PλP

λP

PλPλλQ

T

ttq

T

tqqq

T

ttq

T

tqqq

obaP

obaλP

ttt

ttt

1

1

1

1

1

1

logloglog,log

,

11

11

QO

QO

λP QO, QO,log P

29

Solution to Problem 3 - The EM Algorithm (cont’d)

The auxiliary function can be rewritten as

wi yi

wj yj

wk yk

N

j

M

k votj

t

N

i

N

j

T

tij

tt

N

ii

kt

kbλP

λj,qPλQ

aλP

λjqi,qPλQ

λP

λi,qPλQ

1 1

1 1

1

1

1

1

1

log,

log,

,

log,

O

Ob

O

Oa

O

Oπ

b

a

π

i1

( )t j

),( jit

λQλQλQ

obaλP

λ,PλλQ

T

ttq

T

tqqq ttt

,,,

]loglog[log, all 1

1

111

baπ

O

QO

baπ

Q

example

30


The auxiliary function is separated into three independent terms, each respectively corresponds to , , and – Maximization procedure on can be done by maximizing

the individual terms separately subject to probability constraints

– All these terms have the following form

ija kb ji

N

nn

jj

jN

jj

N

jjjN

w

wyF

yyywyyygF

1

1121

: when valuemaximum a has

0 and ,1 where,log,,...,,

y

y

Mk j

Nj ij

Ni i jkbia 111 1)( , 1 ,1

λλ,Q

31


Proof: Apply Lagrange Multiplier

N

nn

jj

N

jj

N

jj

N

jj

N

jj

j

j

j

j

j

N

j

N

jjjj

N

jjj

w

wy

wwyy

jyw

yw

yF

yywywF

1

1111

1 11

0Then

0 Letting

1loglog that Suppose

Multiplier Lagrange applyingBy

Constraint

xe

xxh

x

xhxh

hxhx

hxhx

dxxd

he

hx

xh

xhx

h

h

h

hh

h

h

1ln1/1lnlim1

/1lnlim/1lnlim

/lnlim)ln()ln(limln

...71828.21lim

/

0/

/1/

0

/1

0

00

/1

0

32


N

ii

λP

λi,qPλQ

1

1log,

O

Oππ

wi yi

N

nn

ii

w

wy

1 i

P

iqPi 1

1,ˆ

O

O

λiqPi tt ,)( O

1

1

1

1

N

n

N

nn

λP

λn,qP

w

O

O

33


N

nn

jj

w

wy

1

1

1

1

11

1

1

11 ,

,

,,ˆ

T

tt

T

tt

T

tt

T

ttt

iji

ji

iqP

jqiqPa

O

O

N

i

N

j

T

tij

tta

λP

λjqi,qPλQ

1 1

1

1

1log

,,

O

Oaa

wj yj

34


N

nn

kk

w

wy

1

wk yk

1 1s.t. s.t.

1 1

,

ˆ

,

t k t k

T T

t tt to v o v

j T T

t tt t

P q j j

b kP q j j

O

O

N

j

M

k votj

t

kt

kbλP

λj,qPλQ

1 1log,

O

Obb

35


The new model parameter set can be expressed as:

BAπ ˆ,ˆ,ˆ=̂

11

1 1

11 1

1 1

1 1

1 1s.t. s.t.

1 1

,ˆ

, , ,ˆ

,

,

ˆ

,

t k t k

i

T T

t t tt t

ij T T

t tt t

T T

t tt to v o v

j T T

t tt t

P q ii

P

P q i q j i ja

P q i i

P q j j

b kP q j j

O

O

O

O

O

O

λjqiqPji

λiqPi

ttt

tt

,,,

,)(

1 O

O

36

Discrete vs. Continuous Density HMMs

Two major types of HMMs according to the observations– Discrete and finite observation:

• The observations that all distinct states generate are finite in number, i.e., V={v1, v2, v3, ……, vM}, vkRL

• In this case, the observation probability distribution in state j, B={bj(k)}, is defined as bj(k)=P(ot=vk|qt=j), 1kM, 1jNot : observation at time t, qt : state at time t

bj(k) consists of only M probability values– Continuous and infinite observation:

• The observations that all distinct states generate are infinite and continuous, i.e., V={v| vRL}

• In this case, the observation probability distribution in state j, B={bj(v)}, is defined as bj(v)=f(ot=v|qt=j), 1jNot : observation at time t, qt : state at time t

bj(v) is a continuous probability density function (pdf) and is often a mixture of Multivariate Gaussian (Normal) Distributions

37

Gaussian Distribution

A continuous random variable X is said to have a Gaussian distribution with mean μ and variance σ2(σ>0) if X has a continuous pdf in the following form:

2

2

2/12

2exp

21),|(

xμxXf

38

Multivariate Gaussian Distribution

If X=(X1,X2,X3,…,Xd) is an d-dimensional random vector with a multivariate Gaussian distribution with mean vector and covariance matrix , then the pdf can be expressed as

If X1,X2,X3,…,Xd are independent random variables, the covariance matrix is reduced to diagonal, i.e.,

))((

oft determinan : ((

21exp

2

1),;()(

2

TTT

1T2/12/

jjiiij

d

xxE

E))E

E

Nf

ΣΣμμxxμxμxΣ

xμ

μxΣμxΣ

ΣμxxX

jiij ,02

d

i ii

ii

ii

xf

1 2

2

2/1 2exp

21),|(

ΣμxX

39

Multivariate Mixture Gaussian Distribution

An d-dimensional random vector X=(X1,X2,X3,…,Xd) is with a multivariate mixture Gaussian distribution if

In CDHMM, bj(v) is a continuous probability density function (pdf) and is often a mixture of multivariate Gaussian distributions

M

kjkjkjk

jkdjkj cb

1

1T2/12/ 2

1exp2

1 μvΣμvΣ

v

M

kjkjk cc

11and0

Covariance matrix of the k-th mixture of the j-th state

Mean vectorof the k-th mixture of the j-th state

Observation vector

wNwfM

kk

M

kkkk 1 ,),;()(

11

Σμxx

40

Solution to Problem 3 – The Segmental K-means Algorithm for CDHMM

Assume that we have a training set of observations and an initial estimate of model parameters– Step 1 : Segment the training data

The set of training observation sequences is segmented into states, based on the current model, by Viterbi Algorithm

– Step 2 : Re-estimate the model parameters

– Step 3: Evaluate the model If the difference between the new and current model scores exceeds a threshold, go back to Step 1; otherwise, return


q i Number of transitions from state to state ˆ

Number of transitions from state iji ja

i

By partitioning the observation vectors within each state into clustersnumber of vectors classified into cluster of state ˆ

number of vectors in state ˆ sample mean of vectors classified

jm

jm

j Mm jc

j

μ into cluster of state ˆ sample covariance matrix of vectors classified into cluster of state jm

m j

m jΣ

41

Solution to Problem 3 – The Segmental K-means Algorithm for CDHMM

(cont’d) 3 states and 4 Gaussian mixtures per state

O1

State

O2

1 2 tOt

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

s2

s3

s1

Global mean Cluster 1 mean

Cluster 2mean

K-means {11,11,c11}{12,12,c12}

{13,13,c13} {14,14,c14}

42

Solution to Problem 3 – The Baum-Welch Algorithm for CDHMM

Define a new variable t(j,k) – Probability of being in state j at time t with the k-th mixture

component accounting for ot, given O and

M

mjmjmtjm

jkjktjkN

stt

tt

tt

tttt

t

tttttt

tttttt

Nc

Nc

ss

jj

λjqPλjqkmP

j

λjqPλjqkmP

jλjqkmPj

λjqkmPλjqPλkmjqPkj

11,;

,;

,,,

,,,

,,

,,,,,,

Σμo

Σμo

oo

OO

O

OOO

Observation-independent assumption

λjqP

λjqkmP

tTttt

tTtttt

,,...,,,,...,,,...,,,,...,,

111

111

oooooooooo

λjqP

λkmjqPλjqkmP

tt

ttttt

,,,,

oo

43

Solution to Problem 3 – The Baum-Welch Algorithm for CDHMM (cont’d)

Re-estimation formulae for are

1

1

,ˆ Weighted average (mean) of observations in state and mixture

,

T

t tt

jk T

tt

j kj k

j k

oμ

1 1

1 1 1

Expected number of times in state and mixture ˆ Expected number of times in state

T T

t tt t

jk T M T

t tt m t

j,k j,kj kc

j j,m j

T

1

1

ˆ Weighted covariance of observations in state and mixture

ˆ ˆ,

,

jk

T

t t jk t jkt

T

tt

j k

j k

j k

Σ

o μ o μ

, , jk jk jkc μ Σ

44

A Simple Example

o1

State

o2 o3

1 2 3 Time

S1

S2

S1

S2

S1

S2

1 1 11

2 2 11 2 2 22

2 2 33

1 1 22 1 1 33

The Forward/Backward Procedure

N

jtt

tt

N

jt

t

tt

jj

ii

λjqP

λiqP

λPλiqP

i

1

1

,

,

,

O

O

OO

N

jt1tjijt

N

i

t1tjijt

N

jtt

N

i

tt

ttt

jobai

jobai

λjqiqP

λjqiqP

λPλjqiqP

ji

11

1

1

11

1

1

1

)(

)(

,,

,,

,,,

O

O

OO

45

A Simple Example (cont’d) 1

2

1

2

1

2

4v 7v 4v

start1

2

11a

12a

22a

21a

4,1117,1114,11 babab 1 4,1117,1114,11 loglogloglogloglog babab








)|,(log qOp)|,( λp qO

Total 8 paths

q: 1 1 1

q: 1 1 2

46

A Simple Example (cont’d) pathsall 87654321 all

21 log8765log4321

allall

...

log8487log7365

log6243log5121

2221

1211

aallall

aallall

aallall

aallall

back

)1,1()1,1(/1,1/1,1 213221 λPλq,qPλPλq,qP OOOO

λQλQλQ

obaλP

λ,PλλQ

T

ttq

T

tqqq ttt

,,,

]loglog[log, all 1

1

111

baπ

O

QO

baπ

Q

2111 log)2(log)1(

)1(/ 11 λPλi,qP OO )2(1

11(1,1) logtt

a

( , ) logt ijt

i j a

HMM - Part 2

Documents

Transcript of HMM - Part 2