Joint Discriminative Front-End and Back-End Training for Improved Speech Recognition Accuracy

Joint Discriminative Front-End and Back-End Training for Improved Speech Recognition

Accuracy

Jasha Droppo and Alex Acero

Speech Technology Group

Microsoft Research, Redmond, WA, USA

ICASSP 2006 Discriminative Training Session

Presented by Shih-Hung Liu 2006/07/20

2

Outline

• Introduction

• MMI Training with Rprop– MMI– Rprop

• Back-end Acoustic Model Training

• Front-end Parameters Training– SPLICE Transform

• Experimental Result

• Conclusions

3

Introduction

• The acoustic processing of standard ASR systems can be roughly divided into two part– Front-end which extracts the acoustic features– Back-end acoustic model which scores transcription hypotheses

for sequences of those acoustic features

• This division might lead to suboptimal design– If useful information is lost in the front end, it can not be

recovered in the back end– The acoustic model is constrained to work with whatever fixed

feature set is chosen

4

Introduction

• In theory, a system that could jointly optimize the front-end and back-end acoustic model parameters would overcome this problem

• One problem with traditional front-end design is that the parameters are manually chosen based on a combination of trial and error, and reliance on historical values

• Even though much effort has been applied to train the parameters of back-end to improve recognition accuracy, the front-end contains fixed filterbanks, cosine transforms, cepstral lifter, and delta/acceleration regression matrices

5

Introduction

• Many existing front-end designs are also uniform, in the sense that they extract the same features regardless of absolute location in the acoustic space

• This is not a desirable quality, as the information needed to discriminate /f/ from /s/ is quite different from that needed to tell the difference between /aa/ and /ae/

• A front-end with uniform feature extraction must make compromises to get good coverage over the entire range of speech sounds

6

MMI objective function

• The MMI objective function is the sum of the log conditional probabilities for all correct transcription of utterance , given their corresponding acoustics

• The front-end feature transformation

• The back-end acoustic score

• is the Jacobian of the transformation

rw rrY

r

rrr

r wpFF )|(ln Y

);( λf rr YX );,( θwp rX

rw

r

rw

rfr

rfrr

θwp

θwp

Jθwλfp

JθwλfpF

);,(

);,(ln

)();),,((

)();),,((ln

r

r

X

X

YY

YY

)( rfJ Y );( λf rY

7

Rprop

• Rprop, which stands for “Resilient back-propagation”, is a well-known algorithm that was originally developed to train neural network

• For this paper, Rprop is employed to find parameter values that increase MMI objective function

• By design, Rprop only needs the sign of the gradient of the objective function w.r.t each parameter

8

Rprop

1.0 ,00001.0 ,01.0

}}

0)(

)()1( ),5.0max(

{ )0( }

)()()1(

),2.1min( )0( { )0(

)()1(

{

maxmin0

min

max

tλ

Ftλtλ

thendifelse

tλ

Fsigntλtλ

thendifthendif

tλ

Ft

λ

Fd

λparametereachfor

i

ii

ii

ii

ii

ii

ii

i

9

Back-end Acoustic Model Training

)(

)()|(ln

)|(),|()|(ln

)|(ln

)|(ln

1

,

1

,

srts

tr

denrts

numrts

s

srts

s

rt

rt

denrts

numrts

rrtrr

rtr

trt

r

rt

rt

strt

rt

rr

μxγγμ

F

μxμ

ssxp

γγ

sspwsspssxp

F

θ

ssxp

ssxp

F

θ

F

XX

10

SPLICE Transform

• The SPLICE transform was first introduced as a method for overcoming noisy speech.

• It models the relationship between feature vectors y and x as constrained Guassian mixture model (GMM), and then uses this relationship to construct estimates of x given observations of y

• One way of parameterizing the joint GMM on x and y is as a GMM on y, and a conditional expectation of x given y and the model state m

mmm πσμyNmyp ),;(),(

mm byAmyxE ],|[

11

SPLICE Transform

• The SPLICE transform is defined as the minimum mean squared estimate of x, giving y and the model parameter

• In effect, the GMM induces a piecewise linear mapping from y to x

• Simplified form of SPLICE transform

);( λyf

λ

m

mm ympbyAyxEλyfx )|()(]|[);(ˆ

m

m ympbyλyf )|();(

12

Front-end Parameters Training

)()|(

)|(

)()|(ln

)|(),|()|(ln

)|(ln

)|(ln

1

,,

1

,,

srts

str

denrts

numrts

rt

um

m

rtmi

rut

umum

rit

srtsr

it

rt

rt

denrts

numrts

rrtrr

rtr

trt

r

rit

rit

rt

rt

istrt

rt

rr

μxγγympb

F

ympbybb

x

μxx

ssxp

γγ

sspwsspssxp

F

λ

x

x

ssxp

ssxp

F

λ

F

XX

13

Aurora 2 Baseline

• The task consists of recognizing strings of English digits embedded in a range of artificial noise conditions

• The acoustic model (AM) used for recognition was trained with the standard complex back-end Aurora 2 scripts on the multi-condition training data

• The AM contains eleven whole word models, plus sil and sp, and consists of a total of 3628 diagonal Guassian mixture components, each with 39 dimensions

14

Aurora 2 Baseline

• Each utterance in the training and testing sets was normalized using whole-utterance Guassianization

• This simple cepstral histogram normalization method provides us with a very strong baseline

• WER results are presented on test set A, average across 0dB to 20dB SNRs, and baseline system achieves a WER of 6.38%

15

Experimental Results

16

Conclusions

• We have presented a framework for jointly training the front-end and back-end of a speech recognition system against the same discriminative objective function

• Experiments indicate that joint training achieves better word error rates with fewer iterations

Joint Discriminative Front-End and Back-End Training for Improved Speech Recognition Accuracy

Documents

Transcript of Joint Discriminative Front-End and Back-End Training for Improved Speech Recognition Accuracy