Extended Baum-Welch algorithm
description
Transcript of Extended Baum-Welch algorithm
Extended Baum-Welch algorithm
Present by shih-hung Liu 20060121
NTNU Speech Lab. 2
References
• A generalization of the Baum algorithm to rational objective function - [Gopalakrishnan et al.] IEEE ICASP 1989
• An inequality for rational function with applications to some statistical estimation problems [Gopalakrishnan et al.]
- IEEE Transactions on Information Theory 1991
• HMMs, MMIE, and the Speech Recognition problem- [Normandin 1991] PhD dissertation
• Function maximization - [Povey 2004] PhD thesis chapter 4.5
NTNU Speech Lab. 3
Outline
• Introduction• Extended Baum-Welch algorithm [Gopalakrishnan et al.]• EBW from discrete to continuous [Normandin]• EBW for discrete [Povey]• Example of function optimization [Gopalakrishnan et al.]• Conclusion
NTNU Speech Lab. 4
Introduction
• The well-known Baum-Eagon inequality provides an effective iterative scheme for finding a local maximum for homogeneous polynomials with positive coefficients over a domain of probability values
• However, we are interesting in maximizing a general rational function. We extend the Baum-Eagon inequality to rational function
NTNU Speech Lab. 5
Extended Baum-Welch algorithm (1/6)
• an arbitrary homogeneous polynomial with nonnegative coefficient of degree d in variables
Assuming that this polynomial is defined over a domain of probability values, they show how to construct a transformation for some such that following the property:
property A : for any and , unless
[Gopalakrishnan 1989]
})({)( ijXPXP
iij qjpiX ,...,1 ,,...,1 ,
iq
j ijij xxD1
1 ,0 :DUT : DU
Ux )(xTy )()( xPyP xy
NTNU Speech Lab. 6
Extended Baum-Welch algorithm (2/6)
• is a ratio of two polynomials in variables defined over a domain
we are looking for a growth transformation such that for any and , unless
• A reduction of the case of rational function to polynomial we reduce the problem of finding a growth transformation for a
rational function to of finding that for a specially formed polynomial• reduce to Non-homogeneous polynomial with nonnegative • Extend Baum-Eagon inequality to Non-homogeneous polynomial
with nonnegative
[Gopalakrishnan 1989]
iq
j ijij xxD1
1 ,0 :
)(/)()( 21 XSXSXR 0)( ),( 21 XSXSiij qjpiXX ,...,1 ,,...,1 },{
DDT :Dx )(xTy )()( xRyR xy
NTNU Speech Lab. 7
Extended Baum-Welch algorithm (3/6)
• Step1:
[Gopalakrishnan 1989]
)()( then ),()( ifsuch that )( polynomial a exists thereany for
xRyRDyxPyPXPDx xxx
)()( then 0)()( if thereforeand, 0)( that see easy to isit Indeed,
)()()()(set enough to isit for this 21
xRyRxPyPxP
XSXRXSXP
xx
x
x
follows as of n nsformatiogrowth tra a define could then we
)( unless any for unless)())((such that , of n nsformatiogrowth tra aconstruct
could we,),( polynomialeach for that suppose now
DT
yTyDyyPyTPDT
DxXP
x
xxxx
x
)()( yTyT y
NTNU Speech Lab. 8
Extended Baum-Welch algorithm (4/6)
• Step2:
[Gopalakrishnan 1989]
1,0 be domain Let
...1,...1, ein variabl tscoefficien real withpolynomial a be })({)(Let :
1
iq
jijij
iij
ij
xxD
qjpiXXPXPLemma
constant a is any at valuethesuch that and tscoefficien enonnegativonly has )()()( polynomail thesuch that polynomial aexist there)(
DxC(x)XCXPXP
C(X)a
)(for nsnsformatiogrowth tra ofset the with coincide )(for of nsnsformatiogrowth tra ofset the)(
XPXPDb
NTNU Speech Lab. 9
Extended Baum-Welch algorithm (5/6)
• Step3: finding a growth transformation for a polynomial with nonnegative coefficients can be reduce to the same problem for a homogeneous polynomial with nonnegative coefficients
[Gopalakrishnan 1989]
1 where...1,1...1, esin variabl})/({})({)( polynomial shomogeneou heconsider t
1
1,11,1
pilm
pijdplm
qqmplYYYPYYPYP
iij
q
jij qjpiyyD
i
...1 ,1...1 ,0 ,1:1
))(()( , any for such that and)1,1(),(for such that
}{ into }{ mapping , :bijection )),(()),((
ln
xfPxPDxpjiyx
yxDxxDDfDYPDYP
ijij
ij
1
NTNU Speech Lab. 10
Extended Baum-Welch algorithm (6/6)
• Baum-Eagon inequality:
[Gopalakrishnan 1989]
i allfor 0)(
1
iq
j ij
ijij x
xPx
iq
j ij
ijij
ij
ijij
ij
xxP
x
xxP
xy
1
)(
)(
iq
j ij
ijij
ij
ijij
ijC
CxxP
x
CxxP
x
xT
1
)(
)(
))((
NTNU Speech Lab. 11
EBW for CDHMM – from discrete to continuous (1/3)
• Discrete case for emission probability update
codebook in the symbols ofnumber theis:
)( : ),(
)(),(
)(),()(for
such that 1
1
K
jkj
Ckbkj
CkbkjkbEBW
t
v
T
t
K
kjt
jtj
k
o
[ Normandin 1991 ]
NTNU Speech Lab. 12
kx
),|( jkxN
j jj
EBW for CDHMM – from discrete to continuous (2/3)[ Normandin 1991 ]
M subintervals Ik of width Mj /2
K
kjjk
jjkj
xN
xNkb
),|(
),|()(
1I 2I3I
NTNU Speech Lab. 13
EBW for CDHMM – from discrete to continuous (3/3)[ Normandin 1991 ]
2
1
222
1
01
2
1
01
2
0
1
1
1
1
01
1
01
0
),(
)(),(lim)(
),()(
)(),(lim))((lim
),(
),(
)(),(
)(),(lim
),()(
)(),(lim)(lim
jK
k
jjk
K
kK
kjkK
kj
jK
kjkjj
K
k
jk
K
kK
kK
kj
kjkK
kkK
kj
jK
kkjj
Ckj
Cxkjx
Ckjkb
Ckbkjxkb
Ckj
Cxkj
Ckbkj
xCkbxkjx
Ckjkb
Ckbkjxkb
K
kjt
jtj
Ckbkj
Ckbkjkb
1
)(),(
)(),()(
EBW
j
K
kkj
v
xkb
1
0)(lim
NTNU Speech Lab. 14
EBW for discrete HMMs (1/6)
• The Baum-Eagon inequality is formulated for the case where there are variables in a matrix containing rows with a sum-to-one constraint , and we are maximizing a sum of polynomial terms in with nonnegative coefficient
• For ML training, we can find an auxiliary function and optimize it
• Finding the maximum of the auxiliary function (e.g. using lagrangian multiplier) leads to the following update, which is a growth transformation for the polynomial:
[Povey 2004]
ijx X1 j ijx
ijx
NTNU Speech Lab. 15
EBW for discrete HMMs (2/6)
• The Baum-Welch update is an update procedure for HMMs which uses this growth transformation together with an algorithm known as the forward-backward algorithm for finding the relevant differentials efficiently
[Povey 2004]
kXXik
ik
XXijij
ij
xFx
xFx
x
NTNU Speech Lab. 16
EBW for discrete HMMs (3/6)
• An update rule as convenient and provable correct as the Baum-Welch update is not available for discriminative training of HMMs, which is a harder optimization problem
• The Extended Baum-Welch update equation as originally derived is applicable to rational function of parameters which are subject to sum-to-one constraints
• The MMI objective function for discrete-probability HMMs is an example of such a function
[Povey 2004]
)()|(log
OpwOpFMMI
NTNU Speech Lab. 17
EBW for discrete HMMs (4/6)
Instead of maximizing for positive and ,we can instead maximize where and are the value of previous iteration ; increasing will cause to increase
this is because is a strong sense auxiliary function for around
2. If some terms in the resulting polynomial are negative, we can add to the expression a constant C times a further polynomial which is constrained to be a constant (e.g. ), so as to ensure that no product of terms in the final expression has a negative coefficient
[Povey 2004]
)()()(
xbxaxf )(xa )(xb
)()()( xkbxaxg )(/)( xbxak x)(xg )(xf
x)(xg )(xf
1.
j iji xC
two essential points used to derive the EBW update for MMI
NTNU Speech Lab. 18
EBW for discrete HMMs (5/6)[Povey 2004]
kXXik
XXijij
xF
xF
x
)log(
)log( ijijij
ij
ij xxF
xx
xF
1)log(
)log()log(
By applying these two ideas :
k ijXXik
ij
XXijij
xCx
F
xCx
F
x
)log(
)log(
NTNU Speech Lab. 19
EBW equivalent smooth function (6/6)
0
2)(2221),(
022
21),(
check can We
function objective into
2)()2log(
21),(
function smootha adding as regarded be can
4
222
2
2
2222
jjjjsm
jjsm
jjjj
sm
DDDDg
DDg
DDDDg
EBW
[Povey 2004]
NTNU Speech Lab. 20
Example
• consider 1 0,, ),,( 222
2
zyxzyxzyx
xzyxR
2. togo and 1by iindex iteration increment .5
),,(
),,(
),,(
4442 /),,(/),,(/),,(let
formula update using.4)()(),,(
tcoefficien nonegative with polynomial aobtain .3),,(2.
0iindex iteration 1 ,0,0,0such that ,, some fromstart .1
,
1
,
1
,
1
2
22222
000000000
iiiiiiiii zyx
i
zyx
i
zyx
i
iii
Dz
zyxPzz
Dy
zyxPyy
Dx
zyxPxx
kyzkxzkxyxzzyxPzyzyxPyxzyxPxD
zyxkzyxkxzyxP
zyxRk
zyxzyxzyx
C
NTNU Speech Lab. 21
Example
NTNU Speech Lab. 22
Conclusion
• Presented an algorithm for maximization of certain rational function define over domain of probability values
• This algorithm is very useful in practical situation for training HMMs parameters
NTNU Speech Lab. 23
MPE: Final Auxiliary Function
)|(log)|(log
)(),( qOpqOp
FH rr
MPE
qrMPE
rlat
W
),),((log)(),( mmrrqm
MPErq
m
et
stqrMPE toNtg
q
qrlat
W
weak-sense auxiliary function
strong-sense auxiliary function
smoothing function involved
)()()(|)log(|2
),),((log)(),(
11
mmmmmT
mmmm
m
mmrrqm
MPErq
m
et
stqrMPE
trD
toNtgq
qrlat
W
weak-sense auxiliary function
NTNU Speech Lab. 24
EBW derived from auxiliary function
m m
mmm
msm Dg 1)()2log(2
),( 2
22
)()()(|)log(|2
),( 11 mmmmmT
mmmm
m
sm trD
g
2
)(22
),(m
mmmsm
m
Dg
m m
mmm
m
mmrrqm
MPErq
m
et
stqrMPE
D
toNtgq
qrlat
1)()2log(2
),),((log)(),(
2
22
W
2
2))((
21),),(( m
mr to
mmmr etoN
NTNU Speech Lab. 25
EBW derived from auxiliary function
1)(
)2log(2
),),((log)(),( 2
22
m m
mmm
mmmr
rqm
MPErq
m
et
stqrMPE
DtoNtg
q
qrlat
W
2
2
22
))((221)(
))(()2log(21)(),(
m
mrrqm
MPErq
et
stqr
m
mrm
rqm
MPErq
m
et
stqr
mMPE
m
tot
totg
q
qrlat
q
qrlat
W
W
mrqm
MPErq
et
stqr
mmrrqm
MPErq
et
stqr
m
m
mmm
m
mrrqm
MPErq
et
stqr
m
mmm
m
mrrqm
MPErq
et
stqr
MPEm
Dt
Dtot
Dto
t
Dtot
g
q
qrlat
q
qrlat
q
qrlat
q
qrlat
)(
)()(
0)())((
)(
0)(22
))((221)(
0),(
22
22
W
W
W
W
m m
mmmsm Dg 2
)(22
),(