Joint Discriminative Front-End and Back-End Training for Improved Speech Recognition Accuracy
-
Upload
amal-jenkins -
Category
Documents
-
view
34 -
download
0
description
Transcript of Joint Discriminative Front-End and Back-End Training for Improved Speech Recognition Accuracy
Joint Discriminative Front-End and Back-End Training for Improved Speech Recognition
Accuracy
Jasha Droppo and Alex Acero
Speech Technology Group
Microsoft Research, Redmond, WA, USA
ICASSP 2006 Discriminative Training Session
Presented by Shih-Hung Liu 2006/07/20
2
Outline
• Introduction
• MMI Training with Rprop– MMI– Rprop
• Back-end Acoustic Model Training
• Front-end Parameters Training– SPLICE Transform
• Experimental Result
• Conclusions
3
Introduction
• The acoustic processing of standard ASR systems can be roughly divided into two part– Front-end which extracts the acoustic features– Back-end acoustic model which scores transcription hypotheses
for sequences of those acoustic features
• This division might lead to suboptimal design– If useful information is lost in the front end, it can not be
recovered in the back end– The acoustic model is constrained to work with whatever fixed
feature set is chosen
4
Introduction
• In theory, a system that could jointly optimize the front-end and back-end acoustic model parameters would overcome this problem
• One problem with traditional front-end design is that the parameters are manually chosen based on a combination of trial and error, and reliance on historical values
• Even though much effort has been applied to train the parameters of back-end to improve recognition accuracy, the front-end contains fixed filterbanks, cosine transforms, cepstral lifter, and delta/acceleration regression matrices
5
Introduction
• Many existing front-end designs are also uniform, in the sense that they extract the same features regardless of absolute location in the acoustic space
• This is not a desirable quality, as the information needed to discriminate /f/ from /s/ is quite different from that needed to tell the difference between /aa/ and /ae/
• A front-end with uniform feature extraction must make compromises to get good coverage over the entire range of speech sounds
6
MMI objective function
• The MMI objective function is the sum of the log conditional probabilities for all correct transcription of utterance , given their corresponding acoustics
• The front-end feature transformation
• The back-end acoustic score
• is the Jacobian of the transformation
rw rrY
r
rrr
r wpFF )|(ln Y
);( λf rr YX );,( θwp rX
rw
r
rw
rfr
rfrr
θwp
θwp
Jθwλfp
JθwλfpF
);,(
);,(ln
)();),,((
)();),,((ln
r
r
X
X
YY
YY
)( rfJ Y );( λf rY
7
Rprop
• Rprop, which stands for “Resilient back-propagation”, is a well-known algorithm that was originally developed to train neural network
• For this paper, Rprop is employed to find parameter values that increase MMI objective function
• By design, Rprop only needs the sign of the gradient of the objective function w.r.t each parameter
8
Rprop
1.0 ,00001.0 ,01.0
}}
0)(
)()1( ),5.0max(
{ )0( }
)()()1(
),2.1min( )0( { )0(
)()1(
{
maxmin0
min
max
tλ
Ftλtλ
thendifelse
tλ
Fsigntλtλ
thendifthendif
tλ
Ft
λ
Fd
λparametereachfor
i
ii
ii
ii
ii
ii
ii
i
9
Back-end Acoustic Model Training
)(
)()|(ln
)|(),|()|(ln
)|(ln
)|(ln
1
,
1
,
srts
tr
denrts
numrts
s
srts
s
rt
rt
denrts
numrts
rrtrr
rtr
trt
r
rt
rt
strt
rt
rr
μxγγμ
F
μxμ
ssxp
γγ
sspwsspssxp
F
θ
ssxp
ssxp
F
θ
F
XX
10
SPLICE Transform
• The SPLICE transform was first introduced as a method for overcoming noisy speech.
• It models the relationship between feature vectors y and x as constrained Guassian mixture model (GMM), and then uses this relationship to construct estimates of x given observations of y
• One way of parameterizing the joint GMM on x and y is as a GMM on y, and a conditional expectation of x given y and the model state m
mmm πσμyNmyp ),;(),(
mm byAmyxE ],|[
11
SPLICE Transform
• The SPLICE transform is defined as the minimum mean squared estimate of x, giving y and the model parameter
• In effect, the GMM induces a piecewise linear mapping from y to x
• Simplified form of SPLICE transform
);( λyf
λ
m
mm ympbyAyxEλyfx )|()(]|[);(ˆ
m
m ympbyλyf )|();(
12
Front-end Parameters Training
)()|(
)|(
)()|(ln
)|(),|()|(ln
)|(ln
)|(ln
1
,,
1
,,
srts
str
denrts
numrts
rt
um
m
rtmi
rut
umum
rit
srtsr
it
rt
rt
denrts
numrts
rrtrr
rtr
trt
r
rit
rit
rt
rt
istrt
rt
rr
μxγγympb
F
ympbybb
x
μxx
ssxp
γγ
sspwsspssxp
F
λ
x
x
ssxp
ssxp
F
λ
F
XX
13
Aurora 2 Baseline
• The task consists of recognizing strings of English digits embedded in a range of artificial noise conditions
• The acoustic model (AM) used for recognition was trained with the standard complex back-end Aurora 2 scripts on the multi-condition training data
• The AM contains eleven whole word models, plus sil and sp, and consists of a total of 3628 diagonal Guassian mixture components, each with 39 dimensions
14
Aurora 2 Baseline
• Each utterance in the training and testing sets was normalized using whole-utterance Guassianization
• This simple cepstral histogram normalization method provides us with a very strong baseline
• WER results are presented on test set A, average across 0dB to 20dB SNRs, and baseline system achieves a WER of 6.38%
15
Experimental Results
16
Conclusions
• We have presented a framework for jointly training the front-end and back-end of a speech recognition system against the same discriminative objective function
• Experiments indicate that joint training achieves better word error rates with fewer iterations