Joint Discriminative Front-End and Back-End Training for Improved Speech Recognition Accuracy

16
Joint Discriminative Front-End and Back-End Training for Improved Speech Recognition Accuracy Jasha Droppo and Alex Acero Speech Technology Group Microsoft Research, Redmond, WA, USA ICASSP 2006 Discriminative Training S ession Presented by Shih-Hung Liu 2006/07/20

description

Joint Discriminative Front-End and Back-End Training for Improved Speech Recognition Accuracy. Jasha Droppo and Alex Acero Speech Technology Group Microsoft Research, Redmond, WA, USA ICASSP 2006 Discriminative Training Session. Presented by Shih-Hung Liu 2006/07/20. Outline. Introduction - PowerPoint PPT Presentation

Transcript of Joint Discriminative Front-End and Back-End Training for Improved Speech Recognition Accuracy

Page 1: Joint Discriminative Front-End and Back-End Training for Improved Speech Recognition Accuracy

Joint Discriminative Front-End and Back-End Training for Improved Speech Recognition

Accuracy

Jasha Droppo and Alex Acero

Speech Technology Group

Microsoft Research, Redmond, WA, USA

ICASSP 2006 Discriminative Training Session

Presented by Shih-Hung Liu 2006/07/20

Page 2: Joint Discriminative Front-End and Back-End Training for Improved Speech Recognition Accuracy

2

Outline

• Introduction

• MMI Training with Rprop– MMI– Rprop

• Back-end Acoustic Model Training

• Front-end Parameters Training– SPLICE Transform

• Experimental Result

• Conclusions

Page 3: Joint Discriminative Front-End and Back-End Training for Improved Speech Recognition Accuracy

3

Introduction

• The acoustic processing of standard ASR systems can be roughly divided into two part– Front-end which extracts the acoustic features– Back-end acoustic model which scores transcription hypotheses

for sequences of those acoustic features

• This division might lead to suboptimal design– If useful information is lost in the front end, it can not be

recovered in the back end– The acoustic model is constrained to work with whatever fixed

feature set is chosen

Page 4: Joint Discriminative Front-End and Back-End Training for Improved Speech Recognition Accuracy

4

Introduction

• In theory, a system that could jointly optimize the front-end and back-end acoustic model parameters would overcome this problem

• One problem with traditional front-end design is that the parameters are manually chosen based on a combination of trial and error, and reliance on historical values

• Even though much effort has been applied to train the parameters of back-end to improve recognition accuracy, the front-end contains fixed filterbanks, cosine transforms, cepstral lifter, and delta/acceleration regression matrices

Page 5: Joint Discriminative Front-End and Back-End Training for Improved Speech Recognition Accuracy

5

Introduction

• Many existing front-end designs are also uniform, in the sense that they extract the same features regardless of absolute location in the acoustic space

• This is not a desirable quality, as the information needed to discriminate /f/ from /s/ is quite different from that needed to tell the difference between /aa/ and /ae/

• A front-end with uniform feature extraction must make compromises to get good coverage over the entire range of speech sounds

Page 6: Joint Discriminative Front-End and Back-End Training for Improved Speech Recognition Accuracy

6

MMI objective function

• The MMI objective function is the sum of the log conditional probabilities for all correct transcription of utterance , given their corresponding acoustics

• The front-end feature transformation

• The back-end acoustic score

• is the Jacobian of the transformation

rw rrY

r

rrr

r wpFF )|(ln Y

);( λf rr YX );,( θwp rX

rw

r

rw

rfr

rfrr

θwp

θwp

Jθwλfp

JθwλfpF

);,(

);,(ln

)();),,((

)();),,((ln

r

r

X

X

YY

YY

)( rfJ Y );( λf rY

Page 7: Joint Discriminative Front-End and Back-End Training for Improved Speech Recognition Accuracy

7

Rprop

• Rprop, which stands for “Resilient back-propagation”, is a well-known algorithm that was originally developed to train neural network

• For this paper, Rprop is employed to find parameter values that increase MMI objective function

• By design, Rprop only needs the sign of the gradient of the objective function w.r.t each parameter

Page 8: Joint Discriminative Front-End and Back-End Training for Improved Speech Recognition Accuracy

8

Rprop

1.0 ,00001.0 ,01.0

}}

0)(

)()1( ),5.0max(

{ )0( }

)()()1(

),2.1min( )0( { )0(

)()1(

{

maxmin0

min

max

Ftλtλ

thendifelse

Fsigntλtλ

thendifthendif

Ft

λ

Fd

λparametereachfor

i

ii

ii

ii

ii

ii

ii

i

Page 9: Joint Discriminative Front-End and Back-End Training for Improved Speech Recognition Accuracy

9

Back-end Acoustic Model Training

)(

)()|(ln

)|(),|()|(ln

)|(ln

)|(ln

1

,

1

,

srts

tr

denrts

numrts

s

srts

s

rt

rt

denrts

numrts

rrtrr

rtr

trt

r

rt

rt

strt

rt

rr

μxγγμ

F

μxμ

ssxp

γγ

sspwsspssxp

F

θ

ssxp

ssxp

F

θ

F

XX

Page 10: Joint Discriminative Front-End and Back-End Training for Improved Speech Recognition Accuracy

10

SPLICE Transform

• The SPLICE transform was first introduced as a method for overcoming noisy speech.

• It models the relationship between feature vectors y and x as constrained Guassian mixture model (GMM), and then uses this relationship to construct estimates of x given observations of y

• One way of parameterizing the joint GMM on x and y is as a GMM on y, and a conditional expectation of x given y and the model state m

mmm πσμyNmyp ),;(),(

mm byAmyxE ],|[

Page 11: Joint Discriminative Front-End and Back-End Training for Improved Speech Recognition Accuracy

11

SPLICE Transform

• The SPLICE transform is defined as the minimum mean squared estimate of x, giving y and the model parameter

• In effect, the GMM induces a piecewise linear mapping from y to x

• Simplified form of SPLICE transform

);( λyf

λ

m

mm ympbyAyxEλyfx )|()(]|[);(ˆ

m

m ympbyλyf )|();(

Page 12: Joint Discriminative Front-End and Back-End Training for Improved Speech Recognition Accuracy

12

Front-end Parameters Training

)()|(

)|(

)()|(ln

)|(),|()|(ln

)|(ln

)|(ln

1

,,

1

,,

srts

str

denrts

numrts

rt

um

m

rtmi

rut

umum

rit

srtsr

it

rt

rt

denrts

numrts

rrtrr

rtr

trt

r

rit

rit

rt

rt

istrt

rt

rr

μxγγympb

F

ympbybb

x

μxx

ssxp

γγ

sspwsspssxp

F

λ

x

x

ssxp

ssxp

F

λ

F

XX

Page 13: Joint Discriminative Front-End and Back-End Training for Improved Speech Recognition Accuracy

13

Aurora 2 Baseline

• The task consists of recognizing strings of English digits embedded in a range of artificial noise conditions

• The acoustic model (AM) used for recognition was trained with the standard complex back-end Aurora 2 scripts on the multi-condition training data

• The AM contains eleven whole word models, plus sil and sp, and consists of a total of 3628 diagonal Guassian mixture components, each with 39 dimensions

Page 14: Joint Discriminative Front-End and Back-End Training for Improved Speech Recognition Accuracy

14

Aurora 2 Baseline

• Each utterance in the training and testing sets was normalized using whole-utterance Guassianization

• This simple cepstral histogram normalization method provides us with a very strong baseline

• WER results are presented on test set A, average across 0dB to 20dB SNRs, and baseline system achieves a WER of 6.38%

Page 15: Joint Discriminative Front-End and Back-End Training for Improved Speech Recognition Accuracy

15

Experimental Results

Page 16: Joint Discriminative Front-End and Back-End Training for Improved Speech Recognition Accuracy

16

Conclusions

• We have presented a framework for jointly training the front-end and back-end of a speech recognition system against the same discriminative objective function

• Experiments indicate that joint training achieves better word error rates with fewer iterations