HMM - Part 2
description
Transcript of HMM - Part 2
1
HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM
2
Three Basic Problems for HMMs
Given an observation sequence O=(o1,o2,…,oT), and an HMM =(A,B,)– Problem 1:
How to compute P(O|) efficiently ? The forward algorithm
– Problem 2: How to choose an optimal state sequence Q=(q1,q2,……, qT) which
best explains the observations? The Viterbi algorithm
– Problem 3: How to adjust the model parameters =(A,B,) to maximize P(O|)? The Baum-Welch (forward-backward) algorithm
cf. The segmental K-means algorithm maximizes P(O, Q* |)
* arg max ( , | )P Q
Q O Q
)|(maxarg*iP
i
OP(up, up, up, up, up|)?
3
The Forward Algorithm
The forward variable:– Probability of o1,o2,…,ot being observed and the state at time t
being i, given model λ
The forward algorithm
1 1 1 1
1 11
1
1. Initialization ( , | ) 1
2. Induction 1 1 1
3.Termination O
i i
N
t t ij j ti
N
Ti
α i P o q i π b o , i N
α j α i a b o , t T - , j N
P λ α i
λiqoooPi ttt ,...21
4
The Viterbi Algorithm
1. Initialization
2. Induction
3. Termination
4. Backtracking),...,,(
1,...,2.1),(**
2*1
*
*11
T
tt*t
qqq
TTtqq
Q
Ni, i
Ni, obπi ii
10)(1
1
11
Nj,T-t, aij
Nj,T-t, obaij
ijtNi
tjijtNi
t
1 11][maxarg)(
1 11][max
11t
11
1
iq
iλP
TNi
*T
TNi
1
1
*
maxarg
max,QO
is the best state sequence
11
1 cf.
tj
N
iijtt obaiαjα
N
iT iαλP
1O
5
The Segmental K-means Algorithm
Assume that we have a training set of observations and an initial estimate of model parameters– Step 1 : Segment the training data
The set of training observation sequences is segmented into states, based on the current model, by the Viterbi Algorithm
– Step 2 : Re-estimate the model parameters
– Step 3: Evaluate the model If the difference between the new and current model scores exceeds a threshold, go back to Step 1; otherwise, return
Number of " " in state ˆ Number of times in state j
k jb k j
1Number of times ˆNumber of training sequencesi
q i
Number of transitions from state to state ˆ Number of transitions from state ij
i jai
1ˆ1
N
ii
1ˆ1
N
jija
1)(ˆ1
M
kj kb
6
Segmental K-means vs. Baum-Welch
Number of " " in state ˆ Number of times in state j
k jb k j
1Number of times ˆNumber of training sequencesi
q i
Number of transitions from state to state ˆ Number of transitions from state ij
i jai
1 number of times ˆNumber of training sequencesi
q i
Expected
number of " " in state ˆ number of times in state j
k jb kj
Expected
Expected
number of transitions from state to state ˆ number of transitions from state ij
i jai
Expected
Expected
7
The Backward Algorithm
The backward variable:– Probability of ot+1,ot+2,…,oT being observed, given the state at
time t being i and model
The backward algorithm
1 11
1 11
1. Initialization 1, 1
2. Induction , 1 1 1
3. Termination ( ) ( )
T
N
t ij j t tj
N
i ii
β i i N
i a b o j t T - , j N
P λ i b o
O
λiqoooPi tTttt ,,...,, 21
N
iT iαλP
1Ocf.
8
ot
The Forward-Backward Algorithm
Relation between the forward and backward variables
)(][
,...
11
21
ti
N
jjitt
ttt
obaji
iqoooPi
λ
N
jttjijt
tTttt
jobai
iqoooPi
111
21
)(
,...
λ
λiqPii ttt ,)( O
(Huang et al., 2001)
Ni tt iiλP 1 )(O
9
The Baum-Welch Algorithm (1/3)
Define two new variables: t(i)= P(qt = i | O, ) – Probability of being in state i at time t, given O and
t( i, j )=P(qt = i, qt+1 = j | O, ) – Probability of being in state i at time t and state j at time t+1, given O
and
N
m
N
nttnmnt
ttjijtttt
nobam
jobaiλP
λjqiqPji
1 111
111 ,,,
OO
Ni tt
tttttt
ii
iiλP
iiλPiqPi
1
)|,(
OO
O
N
jtt jii
1,
10
The Baum-Welch Algorithm (2/3)
t(i)= P(qt = i | O, ) – Probability of being in state i at time t, given O and
t( i, j )=P(qt = i, qt+1 = j | O, ) – Probability of being in state i at time t and state j at time t+1, given O
and
iiL
l
T
t
lt
l state from ns transitioofnumber expected)(
1
1
1
iqiL
l
l
11
1 timesofnumber expected)(
jijiL
l
T
t
lt
l state to state from ns transitioofnumber expected),(
1
1
1
11
The Baum-Welch Algorithm (3/3)
Re-estimation formulae for , A, and B are
How do you know ? ˆ( | ) ( | )P P O O
L
iiqL
l
l
i
1
11
)(
sequences trainingofNumber timesofnumber Expectedˆ
L
l
T
t
lt
L
l
T
t
lt
ijl
l
i
ji
iji
a
1
1
1
1
1
1
)(
),(
state from ns transitioofnumber Expected state to state from ns transitioofnumber Expectedˆ
L
l
T
t
lt
L
l
T
vot
lt
jl
l
kt
j
j
jjkkb
1 1
1 s.t.
1
)(
)(
statein timesofnumber Expected statein "" ofnumber Expected)(ˆ
12
Maximum Likelihood Estimation for HMM
QQO
O
O
)|,(log maxarg
))|(log)(( )(maxarg
))|()(( )(maxarg
P
Pll
PLLML
,....,...,, 10 tHowever, we cannot find the solution directly.
An alternative way is to find a sequence:
....)(...)()( 10 tlll s.t.
13
])|,()|,([log
])|,()|,([log
)|,()|,(),|(log
)|,()|,(
)|()|,(log
)|,()|,(
)|()|,(log
)|()|,(log
)|(log)|,(log
)|(log)|(log)()(
),|(
),|(
tP
tP
tt
tt
t
t
t
t
t
t
tt
PPE
PPE
PPP
PP
PP
PP
PP
PP
PP
PPll
t
t
QOQOQOQO
QOQOOQ
QOQO
OQO
QOQO
OQO
OQO
OQOOO
OQ
OQ
Q
Q
Q
Q
Q
Jensen’s inequality
)|,(logmaxarg
)|,(log),|(maxarg
)|,()|,(log),|(maxarg
])|,()|,([logmaxarg
),|(
),|()1(
QO
QOOQQOQOOQ
QOQO
OQ
Q
Q
OQ
PE
PP
PPP
PPE
t
t
P
t
tt
tPt
Q functionSolvable and can be proved that
1( ) ( )t tl l
1( | ) ( | )t tP P O OIf f is a concave function, and X is a r.v., thenE[f(X)]≤ f(E[X])
14
The EM Algorithm
EM: Expectation Maximization– Why EM?
• Simple optimization algorithms for likelihood functions rely on the intermediate variables, called latent dataFor HMM, the state sequence is the latent data
• Direct access to the data necessary to estimate the parameters is impossible or difficultFor HMM, it is almost impossible to estimate (A, B, ) without considering the state sequence
– Two Major Steps :• E step: compute the expectation of the likelihood by including the latent
variables as if they were observed
• M step: compute the maximum likelihood estimates of the parameters by maximizing the expected likelihood found in the E step
Q QOOQ )|,(log),|( PλP
15
Three Steps for EM
Step 1. Draw a lower bound– Use the Jensen’s inequality
Step 2. Find the best lower bound auxiliary function– Let the lower bound touch the objective function at the
current guess
Step 3. Maximize the auxiliary function– Obtain the new guess– Go to Step 2 until converge
[Minka 1998]
16
)(F
objective function
current guess
Form an Initial Guess of =(A,B,)
Given the current guess , the goal is to find a new guess such that
NEW
)()( NEWFF
)|(maxarg*
OP
17
)(F
)()( Fg
Step 1. Draw a Lower Bound
)(glower bound function
objective function
18
)(F
Step 2. Find the Best Lower Bound
objective function
lower bound function)(g
),( g
auxiliary function
19
)(F
),( g
Step 3. Maximize the Auxiliary Function
NEW
)()( FF NEW
auxiliary function
objective function
20
)(F
Update the Model
NEW
objective function
21
)(F
),( g
Step 2. Find the Best Lower Bound
objective function
auxiliary function
22
)(F
NEW
Step 3. Maximize the Auxiliary Function
)()( FF NEW
objective function
23
Step 1. Draw a Lower Bound (cont’d)
Q
Q
QQOQ
QOO
)()|,()(log
)|,(log)|(log
pPp
PP
Apply Jensen’s Inequality
A lower bound function of
)(F
If f is a concave function, and X is a r.v., thenE[f(X)]≤ f(E[X])
Q Q
QOQ)(
)|,(log)(p
Pp
Objective function
p(Q): an arbitrary probability distribution
24
Step 2. Find the Best Lower Bound (cont’d)
– Find that makes the lower bound function touch the objective function at the current guess
*
( )
( , | ) We want to maximize ( ) log w.r.t ( ) at ( )
( , | )The best ( ) arg max ( ) log( )p
Pp pp
Pp pp
Q
Q Q
O QQ QQO QQ Q
Q
)(Qp
25
Step 2. Find the Best Lower Bound (cont’d)
),|()|(
)|,()|,(
)|,()()|,(
1)|,(
)()|,()(
1)|,(log)(log
01)(log)|,(log
)(log)()|,(log)()(1
here multiplier Lagrange a introduce we,1)( Since
OQO
QOQO
QOQQO
QOQQOQ
QOQ
QQO
QQQOQQ
Q
Q
Q
QQQ
Q
PP
PP
PpP
ee
e
Pep
ePep
Pp
pP
ppPpp
p
Take the derivative w.r.t and set it to zero)(Qp
26
Step 2. Find the Best Lower Bound (cont’d)
Q function
)|(log)|(log),|(
),|()|(),|(
log),|(
),|()|,(log),|(),(
OOOQ
OQOOQOQ
OQQOOQ
Q
Q
Q
PPP
PPP
P
PPPg
We can check
),(),|()|,(log),|(
g
PP
P Q OQ
QOOQDefine
OQOQQOOQ ),|(log),|()|,(log),|(),( PPPPg
27
EM for HMM Training
Basic idea– Assume we have and the probability that each Q occurred in the
generation of O i.e., we have in fact observed a complete data pair (O,Q) with
frequency proportional to the probability P(O,Q|)– We then find a new that maximizes
– It can be guaranteed that
EM can discover parameters of model to maximize the log-likelihood of the incomplete data, logP(O|), by iteratively maximizing the expectation of the log-likelihood of the complete data, logP(O,Q|)
Q QOOQ )|,(log),|( PλP ˆ
)|()ˆ|( λPP OO
Expectation
28
Solution to Problem 3 - The EM Algorithm
The auxiliary function
where and can be expressed as
Q
Q
QOO
QO
QOOQ
,log,
,log,,
PλP
λP
PλPλλQ
T
ttq
T
tqqq
T
ttq
T
tqqq
obaP
obaλP
ttt
ttt
1
1
1
1
1
1
logloglog,log
,
11
11
QO
QO
λP QO, QO,log P
29
Solution to Problem 3 - The EM Algorithm (cont’d)
The auxiliary function can be rewritten as
wi yi
wj yj
wk yk
N
j
M
k votj
t
N
i
N
j
T
tij
tt
N
ii
kt
kbλP
λj,qPλQ
aλP
λjqi,qPλQ
λP
λi,qPλQ
1 1
1 1
1
1
1
1
1
log,
log,
,
log,
O
Ob
O
Oa
O
Oπ
b
a
π
i1
( )t j
),( jit
λQλQλQ
obaλP
λ,PλλQ
T
ttq
T
tqqq ttt
,,,
]loglog[log, all 1
1
111
baπ
O
QO
baπ
Q
example
30
Solution to Problem 3 - The EM Algorithm (cont’d)
The auxiliary function is separated into three independent terms, each respectively corresponds to , , and – Maximization procedure on can be done by maximizing
the individual terms separately subject to probability constraints
– All these terms have the following form
ija kb ji
N
nn
jj
jN
jj
N
jjjN
w
wyF
yyywyyygF
1
1121
: when valuemaximum a has
0 and ,1 where,log,,...,,
y
y
Mk j
Nj ij
Ni i jkbia 111 1)( , 1 ,1
λλ,Q
31
Solution to Problem 3 - The EM Algorithm (cont’d)
Proof: Apply Lagrange Multiplier
N
nn
jj
N
jj
N
jj
N
jj
N
jj
j
j
j
j
j
N
j
N
jjjj
N
jjj
w
wy
wwyy
jyw
yw
yF
yywywF
1
1111
1 11
0Then
0 Letting
1loglog that Suppose
Multiplier Lagrange applyingBy
Constraint
xe
xxh
x
xhxh
hxhx
hxhx
dxxd
he
hx
xh
xhx
h
h
h
hh
h
h
1ln1/1lnlim1
/1lnlim/1lnlim
/lnlim)ln()ln(limln
...71828.21lim
/
0/
/1/
0
/1
0
00
/1
0
32
Solution to Problem 3 - The EM Algorithm (cont’d)
N
ii
λP
λi,qPλQ
1
1log,
O
Oππ
wi yi
N
nn
ii
w
wy
1 i
P
iqPi 1
1,ˆ
O
O
λiqPi tt ,)( O
1
1
1
1
N
n
N
nn
λP
λn,qP
w
O
O
33
Solution to Problem 3 - The EM Algorithm (cont’d)
N
nn
jj
w
wy
1
1
1
1
11
1
1
11 ,
,
,,ˆ
T
tt
T
tt
T
tt
T
ttt
iji
ji
iqP
jqiqPa
O
O
N
i
N
j
T
tij
tta
λP
λjqi,qPλQ
1 1
1
1
1log
,,
O
Oaa
wj yj
34
Solution to Problem 3 - The EM Algorithm (cont’d)
N
nn
kk
w
wy
1
wk yk
1 1s.t. s.t.
1 1
,
ˆ
,
t k t k
T T
t tt to v o v
j T T
t tt t
P q j j
b kP q j j
O
O
N
j
M
k votj
t
kt
kbλP
λj,qPλQ
1 1log,
O
Obb
35
Solution to Problem 3 - The EM Algorithm (cont’d)
The new model parameter set can be expressed as:
BAπ ˆ,ˆ,ˆ=̂
11
1 1
11 1
1 1
1 1
1 1s.t. s.t.
1 1
,ˆ
, , ,ˆ
,
,
ˆ
,
t k t k
i
T T
t t tt t
ij T T
t tt t
T T
t tt to v o v
j T T
t tt t
P q ii
P
P q i q j i ja
P q i i
P q j j
b kP q j j
O
O
O
O
O
O
λjqiqPji
λiqPi
ttt
tt
,,,
,)(
1 O
O
36
Discrete vs. Continuous Density HMMs
Two major types of HMMs according to the observations– Discrete and finite observation:
• The observations that all distinct states generate are finite in number, i.e., V={v1, v2, v3, ……, vM}, vkRL
• In this case, the observation probability distribution in state j, B={bj(k)}, is defined as bj(k)=P(ot=vk|qt=j), 1kM, 1jNot : observation at time t, qt : state at time t
bj(k) consists of only M probability values– Continuous and infinite observation:
• The observations that all distinct states generate are infinite and continuous, i.e., V={v| vRL}
• In this case, the observation probability distribution in state j, B={bj(v)}, is defined as bj(v)=f(ot=v|qt=j), 1jNot : observation at time t, qt : state at time t
bj(v) is a continuous probability density function (pdf) and is often a mixture of Multivariate Gaussian (Normal) Distributions
37
Gaussian Distribution
A continuous random variable X is said to have a Gaussian distribution with mean μ and variance σ2(σ>0) if X has a continuous pdf in the following form:
2
2
2/12
2exp
21),|(
xμxXf
38
Multivariate Gaussian Distribution
If X=(X1,X2,X3,…,Xd) is an d-dimensional random vector with a multivariate Gaussian distribution with mean vector and covariance matrix , then the pdf can be expressed as
If X1,X2,X3,…,Xd are independent random variables, the covariance matrix is reduced to diagonal, i.e.,
))((
oft determinan : ((
21exp
2
1),;()(
2
TTT
1T2/12/
jjiiij
d
xxE
E))E
E
Nf
ΣΣμμxxμxμxΣ
xμ
μxΣμxΣ
ΣμxxX
jiij ,02
d
i ii
ii
ii
xf
1 2
2
2/1 2exp
21),|(
ΣμxX
39
Multivariate Mixture Gaussian Distribution
An d-dimensional random vector X=(X1,X2,X3,…,Xd) is with a multivariate mixture Gaussian distribution if
In CDHMM, bj(v) is a continuous probability density function (pdf) and is often a mixture of multivariate Gaussian distributions
M
kjkjkjk
jkdjkj cb
1
1T2/12/ 2
1exp2
1 μvΣμvΣ
v
M
kjkjk cc
11and0
Covariance matrix of the k-th mixture of the j-th state
Mean vectorof the k-th mixture of the j-th state
Observation vector
wNwfM
kk
M
kkkk 1 ,),;()(
11
Σμxx
40
Solution to Problem 3 – The Segmental K-means Algorithm for CDHMM
Assume that we have a training set of observations and an initial estimate of model parameters– Step 1 : Segment the training data
The set of training observation sequences is segmented into states, based on the current model, by Viterbi Algorithm
– Step 2 : Re-estimate the model parameters
– Step 3: Evaluate the model If the difference between the new and current model scores exceeds a threshold, go back to Step 1; otherwise, return
1Number of times ˆNumber of training sequencesi
q i Number of transitions from state to state ˆ
Number of transitions from state iji ja
i
By partitioning the observation vectors within each state into clustersnumber of vectors classified into cluster of state ˆ
number of vectors in state ˆ sample mean of vectors classified
jm
jm
j Mm jc
j
μ into cluster of state ˆ sample covariance matrix of vectors classified into cluster of state jm
m j
m jΣ
41
Solution to Problem 3 – The Segmental K-means Algorithm for CDHMM
(cont’d) 3 states and 4 Gaussian mixtures per state
O1
State
O2
1 2 tOt
s2
s3
s1
s2
s3
s1
s2
s3
s1
s2
s3
s1
s2
s3
s1
s2
s3
s1
s2
s3
s1
s2
s3
s1
s2
s3
s1
Global mean Cluster 1 mean
Cluster 2mean
K-means {11,11,c11}{12,12,c12}
{13,13,c13} {14,14,c14}
42
Solution to Problem 3 – The Baum-Welch Algorithm for CDHMM
Define a new variable t(j,k) – Probability of being in state j at time t with the k-th mixture
component accounting for ot, given O and
M
mjmjmtjm
jkjktjkN
stt
tt
tt
tttt
t
tttttt
tttttt
Nc
Nc
ss
jj
λjqPλjqkmP
j
λjqPλjqkmP
jλjqkmPj
λjqkmPλjqPλkmjqPkj
11,;
,;
,,,
,,,
,,
,,,,,,
Σμo
Σμo
oo
OO
O
OOO
Observation-independent assumption
λjqP
λjqkmP
tTttt
tTtttt
,,...,,,,...,,,...,,,,...,,
111
111
oooooooooo
λjqP
λkmjqPλjqkmP
tt
ttttt
,,,,
oo
43
Solution to Problem 3 – The Baum-Welch Algorithm for CDHMM (cont’d)
Re-estimation formulae for are
1
1
,ˆ Weighted average (mean) of observations in state and mixture
,
T
t tt
jk T
tt
j kj k
j k
oμ
1 1
1 1 1
Expected number of times in state and mixture ˆ Expected number of times in state
T T
t tt t
jk T M T
t tt m t
j,k j,kj kc
j j,m j
T
1
1
ˆ Weighted covariance of observations in state and mixture
ˆ ˆ,
,
jk
T
t t jk t jkt
T
tt
j k
j k
j k
Σ
o μ o μ
, , jk jk jkc μ Σ
44
A Simple Example
o1
State
o2 o3
1 2 3 Time
S1
S2
S1
S2
S1
S2
1 1 11
2 2 11 2 2 22
2 2 33
1 1 22 1 1 33
The Forward/Backward Procedure
N
jtt
tt
N
jt
t
tt
jj
ii
λjqP
λiqP
λPλiqP
i
1
1
,
,
,
O
O
OO
N
jt1tjijt
N
i
t1tjijt
N
jtt
N
i
tt
ttt
jobai
jobai
λjqiqP
λjqiqP
λPλjqiqP
ji
11
1
1
11
1
1
1
)(
)(
,,
,,
,,,
O
O
OO
45
A Simple Example (cont’d) 1
2
1
2
1
2
4v 7v 4v
start1
2
11a
12a
22a
21a
4,1117,1114,11 babab 1 4,1117,1114,11 loglogloglogloglog babab
4,2127,1114,11 babab 2 4,2127,1114,11 loglogloglogloglog babab
4,1217,2124,11 babab 3 4,1217,2124,11 loglogloglogloglog babab
4,2227,2124,11 babab 4 4,2227,2124,11 loglogloglogloglog babab
4,1117,1214,22 babab 5 4,1117,1214,22 loglogloglogloglog babab
4,2127,1214,22 babab 6 4,2127,1214,22 loglogloglogloglog babab
4,1217,2224,22 babab 7 4,1217,2224,22 loglogloglogloglog babab
4,2227,2224,22 babab 8 4,2227,2224,22 loglogloglogloglog babab
)|,(log qOp)|,( λp qO
Total 8 paths
q: 1 1 1
q: 1 1 2
46
A Simple Example (cont’d) pathsall 87654321 all
21 log8765log4321
allall
...
log8487log7365
log6243log5121
2221
1211
aallall
aallall
aallall
aallall
back
)1,1()1,1(/1,1/1,1 213221 λPλq,qPλPλq,qP OOOO
λQλQλQ
obaλP
λ,PλλQ
T
ttq
T
tqqq ttt
,,,
]loglog[log, all 1
1
111
baπ
O
QO
baπ
Q
2111 log)2(log)1(
)1(/ 11 λPλi,qP OO )2(1
11(1,1) logtt
a
( , ) logt ijt
i j a