DUJH0D UJLQ+ 00V IRU6 SHHFK5 HFRJQLWLRQhj/Talks/MSR_TALK.pdf · Prof. Hui Jiang Department of...
Transcript of DUJH0D UJLQ+ 00V IRU6 SHHFK5 HFRJQLWLRQhj/Talks/MSR_TALK.pdf · Prof. Hui Jiang Department of...
Prof. Hui Jiang
Department of Computer Science and Engineering
York University, Toronto, Ont. M3J 1P3, CANADA
Email: [email protected]
/DUJH�0DUJLQ�+00V�IRU�6SHHFK�5HFRJQLWLRQ
(This is a joint work with Xinwei Li, Chao-Jun Liu )
2XWOLQHx Background
– Discriminative Training for ASR
– Large Margin Classifiers: concept & theory
x Large Margin HMMs for ASR– A new estimation criterion for HMM
– Analysis of Margin in CDHMMs
– Large Margin Estimation (LME): Constrained minimax optimization
x Optimization Methods– Gradient Descent (GD) search
– Semi-definite Programming (SDP)
x Experiments– The ISOLET recognition task
– The TIDIGITS connected digit string recognition
x Summary
$XWRPDWLF�6SHHFK�5HFRJQLWLRQ�x Statistical Speech Model
1
a11
2
a22
3
a33
a12 a23
ObservationSequence
x1 x2 x3 x4 x5
)( 11 OT )( 21 OT )( 32 OT )( 43 OT )( 53 OT
X =
WX
XW NoisyChnnel
WordSequence
SpeechSignal
ChannelDecoding
SpeechSignal
WordSequence
x Bayesian Decision Rule:
Language Model Acoustic Model
Discriminant Function
)F(X|WXpWPXWpW WWWW
/ � :�:�:�
maxarg)|()(maxarg)|(maxargˆ
+00�(VWLPDWLRQ�0HWKRGVx Maximum Likelihood Estimation (MLE)
– The Baum-Welch algorithm: the EM algorithm for HMM
x Discriminative Training (DT)– Maximum Mutual Information Estimation (MMIE):
• MPE, MWE, etc. – Minimum Classification Error (MCE):
x Discriminative training can improve over the standa rd ML training.
x Can we do better than DT?
/DUJH�0DUJLQ�&ODVVLILHU�6XSSRUW�9HFWRU�0DFKLQH��690�
larger margin
/DUJH�0DUJLQ�&ODVVLILHUVx Why large margin classifiers yield better
generalization performance?
x Conceptually, large margin Ζ Robustness w.r.t. data patterns
– Robustness w.r.t. classifier parameters
x Theoretically …
6WDWLVWLFDO�/HDUQLQJ�7KHRU\x In pattern classification, the generalization upper bound
holds with probability 1- (Vapnik et. al. ):
¹̧·
©̈§ ���d )
4log()1
2(log
1)()(
GTTV
NV
NRR emp
TestErrorRate
Training ErrorRate
VC Confidence
N: size of training set V: VC dimension
6WDWLVWLFDO�/HDUQLQJ�7KHRU\x In pattern classification, the generalization upper bound
holds with probability 1- (Vapnik et. al. ):
¸̧¹·¨̈©
§¹̧·
©̈§��d GTT 1
log)/(log
)()(2
2
d
VNV
N
CRR d
TestError
MarginError
VC Confidence
V: VC dimensiond: marginN: size of training setC: universal constant
+RZ�DERXW�XVLQJ�690�IRU�6SHHFK�5HFRJQLWLRQ"
x Done in some simple ASR tasks:– phoneme recognition; speaker recognition– small vocabulary isolated speech recognition
x Hard to extend to large-scale continuous speech rec ognition.
x No significant improvement is reported.– still not a main-stream method
x Why? – SVM: binary, static classifier.– Lack of a proper kernel function to map speech samp les from
one dynamic high-dimension space to another high-di mension space, which is suitable for linear classifiers.
/DUJH�0DUJLQ�+00�EDVHG�&ODVVLILHU
model 1 model 2
separation boundary F(X| 1)-F(X| 2)=0
/DUJH�0DUJLQ�+00�EDVHG�&ODVVLILHU
original separation boundary F(X| 1)-F(X| 2)=0
1
’1
2
’2
new separation boundary F(X| ’1)-F(X| ’2)=0
+RZ�WR�GHILQH�VHSDUDWLRQ�PDUJLQ" ���
x In 2-class separable problem:
– For a data token, x1, of class 1
– For a data token, x2, of class 2
)|F(x)|F(xxd 21111)( �
)|F(x)|F(xxd 12222)( �
> 0
> 0
+RZ�WR�GHILQH�VHSDUDWLRQ�PDUJLQ" ���
x Extend to multiple-class problem:
– N classes 1, 2, …, N,
– For a data token, x i, of class i
> @)|F(x)|F(x
)|F(x)|F(xxd
jiiiij
jiij
iii
� �
z
zmin
max)(
/DUJH�0DUJLQ�(VWLPDWLRQ�RI�+00Vx An N-class problem: each class is represented by an HMM
x Given a training set DD, define a subset, called support token set SS, as:
x Large-Margin Estimation ( LME) of HMMs:
},,,{ 21 N/// �
})(0 and |{ Hdd� iii XdDXXS
0))( all o(subject t)(minmaxargˆ ! � iiSX
XdXdi
/DUJH�0DUJLQ�(VWLPDWLRQ�RI�+00Vx Convert to a minimax optimization problem.
x Assume Xi belongs to class i:
> @)|F(X)|F(X iijiijSX i
� z� ,maxminargˆ
. and allfor
0
:sconstraint subject to
ijSX
)|F(X)|F(X
i
iiji
z���
$QDO\VLV�RI�0DUJLQV�LQ�&'+00x The margin in CDHMM is unbounded without additional
constraints.
x Adjust CDHMM parameters in certain way to increase the margin unlimitedly.
x Adopt Viterbi approximation:
$QDO\VLV�RI�0DUJLQV�LQ�&'+00
x Each dimension: independent
x Linear: same variance
x Quadratic: different variances
/LQHDU
4XDGUDWLF
0DUJLQ��/LQHDU�'LPHQVLRQV
0 5 10 15-0.05
0
0.05
0.1
0.15
0.2
C1 C2
x o
÷ø
öçè
æ +-
-=
2
12
2
12mm
smm
xd
0DUJLQ��4XDGUDWLF�'LPHQVLRQV
0 5 10 15−0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
C2
C1
x o o x
X4 X1
X2 d2(X1)
−d1(X2)
&RQVWUDLQWV�LQ�/0(�RI�&'+00x Impose constraints to make LME solvable:
± Linear part: fix the norm of the slope to a constan t
± Quadratic part: constrain the vertex to a range
2
1
21
1
)|,( ij
R
t DditdijW gXR
t
i // ¦ ¦
�
2
1
2)0(2
2
)()|,( ij
R
t DditditdijW GXR
t
id� // ¦ ¦
�
/0(��FRQVWUDLQHG�PLQLPD[�RSWLPL]DWLRQ
x Large Margin Estimation (LME) of CDHMM Î aconstrainted minimiax optimization problem> @)|F(X)|F(X iiji
ijSX i
� z� ,maxminargˆ
. and allfor
)|,(
)|,(
:sconstraint subject to
22
21
ijSX
GXR
gXR
i
ijijW
ijijW
i
i
z�d// //
subject to constraints:
2SWLPL]DWLRQ�0HWKRGVx Gradient Descent (GD) Search
– approx the objective function with a differentiable one
– cast constraints as penalty terms
x Semi-definite Programming (SDP)
– math manipulation
– relaxation
/0(�2SWLPL]DWLRQ��*UDGLHQW�'HVFHQW
x Approximate with summation of exponential s
x Constraints Î Penalty terms:
)(Q
»¼º«¬
ª � | ¦z� ijSX
i
i
XdQQ,
)](exp[log1
)()( KKK
)()(lim )0()()( QQQQ �! �fo KKK K
� �� �2
,,
222
2
,,
211
2211
)|,(,0max)(
)|,()(
)()()()(
¦¦
z
z
�// /�// /
/��/��/ /
i
i
i
i
WjjiijiWj
WjjiijiWj
GXRP
gXRP
PPQO WWK
)(min)( ii
XdQ
/0(�2SWLPL]DWLRQ��*UDGLHQW�'HVFHQW
x The gradient descent optimization:
x Gradient descent optimization:
± Many parameters to be tuned experimentally:
� step size, penalty coefficients, , etc.
± Slow convergence speed.
± Local optimum.
)(’
)()(’ˆ)1('ˆ
n
Onn w
w�� � H
+RZ�WR�FDOFXODWH�WKH�JUDGLHQWIRU�FRQWLQXRXV�GHQVLW\�+00"����
¦¦
z�
z��
ww��
ww
ijSXi
ijSX
ii
i
i
Xd
XdXd
Q
,
,
)](exp[
)()](exp[
)(
KK
K
i
ii
i
i XXd
/w/w /w
w )|()( F
j
ji
j
iXXd
/w/w� /w
w )|()( F
+RZ�WR�FDOFXODWH�WKH�JUDGLHQWIRU�FRQWLQXRXV�GHQVLW\�+00"����x Assumption 1: adjust CDHMM mean vectors only
x Assumption 2: diagonal precision matrices
x Assumption 3: use the Viterbi approximation
¦¦
��|/ T
t
D
d
iitd
iii dtltsdtlts
mXrCX1 1
2)()( )(2
1’)|(F
¦¦
��|/ T
t
D
d
jitd
jji dtltsdtlts
mXrCX1 1
2)()( )(2
1")|(
’’’’F
x An active area in optimization community nowadays.x The standard SDP form
x Linear function of symmetric matrices in semi-defin ite matrix conex SDP can solve nonlinear optimization if configured properly.x Convex conic optimization Î Global optimal solutionx New efficient algorithms are developed.
6HPL�GHILQLWH�3URJUDPPLQJ��6'3�
M, X,ibXA
XC
ji
N
jjij
N
jjj
XXX N
� �
�
¦
¦
�
�
1
subject to
1
1,,,min
21
subject to
/0(�2SWLPL]DWLRQ��6'3x LME: convert the constrained minimax optimziation Î
semi-definite programming (SDP)
x Introduce a new constraint to the minimax optimizat ion problem:
subject to
/0(�6'3��0LQLPL]DWLRQx Minimax optimization Πminimization
± Replace max with a common upper-bound ± :
subject to
/0(�6'3� 0DWUL[�)RUPx Transform into matrix form
subject to
/0(�6'3��5HOD[DWLRQx Matrix Relaxation: equality ΠInequality
x Relaxation to an SDP problem
subject to
/0(�6'3� 5HOD[DWLRQ�$QDO\VLVx Relaxation ± geometry explanation
± Augment x and u to a higher dimension
± Solve the problem in the augmented space
/0(�6'3�� WUDLQLQJ�SURFHGXUH
([SHULPHQWV��RYHUYLHZx Implemented under the HTK framework.
x Added more training tools:
± MCE training tool: HMce.c
± LME-GD training tool: HCLme.c
± LME-SDP training tool: C programs + Matlab program + dsdpdsdp package
x ASR Tasks
± OGI ISOLET E-set recognition
± TIDIGITS
/0(�7UDLQLQJ�6\VWHP
([SHULPHQWDO�5HVXOW��,62/(7
x Feature vector is of 39 dimensions:
(12 MFCC + E) + +
x MLE models (14 states per HMM ) are trained by HTK.
x MLE Î MCE ; MCE Î LME .
x Alphabet (26-letter) recognition (training ± 3120 utterances; test ± 1560 utterances):
– OGI (96%), Cambridge (96.73%).
– Ours: MLE (95.4%), MCE (96.1%), LME ( 96.92%).
([SHULPHQWDO�5HVXOW�,62/(7�(�VHW
x ISOLET E-set: {B, C, D, E, G, P, T, V, Z}
x Training: 1080 utterances; Test: 540 utterances
x word accuracy (in %) on ISOLET E-set test data
94.4495.0092.96LME-GD
95.0095.1992.96LME-SDP
93.8994.0791.48MCE
91.4890.5685.56ML
mix-4mix-2mix-1
([SHULPHQWDO�5HVXOW��7,',*,76
x Connected digit strings: µ1¶ to µ9¶ plus µoh ¶ and µzero ¶x Training with 8623 sentences; test 8700 sentences.x Feature vector is of 39 dimensions:
(12 MFCC + E) + + .x Unknown length digit string recognition.x Context-independent whole-word HMM models.x MLE models (12 states per HMM) are trained by HTK.x MLE Î MCE ; MCE Î LME .x MCE/LME training: N-best (N=5) based string-level.
6WULQJ�/HYHO�/0(�7UDLQLQJ
1. Identify support tokens
2. LME optimization
3. Converge or not
([SHULPHQWDO�5HVXOW�� 7,',*,76x string accuracy (in %) in test data
-GD
([SHULPHQWDO�5HVXOW�� 7,',*,76x WER (in %) on TIDIGITS Test set
-GD
7,',*,76�5HVXOWV/0(�*'�YV��/0(�6'3
0 10 20 30 40 50 60 70 8098.8
98.9
99
99.1
99.2
99.3
99.4
99.5
99.6
99.7
99.8
Number of Iterations
Acc
urac
y (%
)
testtraining
*UDGLHQW�'HVFHQW 6'3
0 2 4 6 8 10 12
98.9
99
99.1
99.2
99.3
99.4
99.5
99.6
99.7
99.8
99.9
Number of Iterations
Str
ing
Acc
urac
y (%
)testtraining
7,',*,76�UHVXOWV�/0(�YV��RWKHUV
WER (%)string error rate (%)ModelCriterion
0.451.34context-indep
whole word model
MLE
(with HTK)
n/a0.95context-indep
whole word model
MCE
(Juang et. al. ’97 )
0.240.72context-dep
head-body-tail model
MCE
(Juang et. al. ’97 )
0.290.89context-indep
two models per word
MMIE
(Normandin’94 )
0.180.53context-indep
whole word modelLME-SDP
0.220.66context-indep
whole word modelLME-GD
6XPPDU\x HMM Estimation methods for ASR
growth-transformation
or extended BW (EBW)
Maximum Mutual Information Estimation (MMIE)
(Maximum Conditional Likelihood)
gradient descent, GPD
Quickprop, etc.
Minimum Classification Error (MCE)
EM or Baum-Welch (BW)Maximum Likelihood Estimation (MLE)
gradient descent
semisemi --definite programming definite programming (SDP)(SDP)
Large Margin EstimationLarge Margin Estimation
(LME)(LME)
gradient descentMaximum Relative Margin EstimationMaximum Relative Margin Estimation
(MRME)(MRME)
Optimization methodsCriterion
/DUJH�0DUJLQ�(VWLPDWLRQ��/0(��YV��'LVFULPLQDWLYH�7UDLQLQJ��'7�
x MCE or MMIE is only asymptotic bound of the Bayes error.
x But Vapnik’s generalization bound holds for a finite body of training data.
),(lim)( NQR MCEN
TT fod),(lim)( NQR MMI
NTT fo�d
¹̧·
©̈§ ���d )
4log()1
2(log
1)()(
GTTV
NV
NRR emp
Large MarginEstimation
Discriminative Training
2QJRLQJ�:RUNVx How to handle training errors?
– combined objective function: margin + training erro rs
(ICASSP’06)
x Extend to large-scale subword-based speech recognition:
– WSJ-5K
– SPINE