Joint Discriminative Front-End and Back-End Training for Improved Speech Recognition Accuracy

Joint Discriminative Front-End and Back-End Training for Improved Speech Recognition

Accuracy

Jasha Droppo and Alex Acero

Speech Technology Group

Microsoft Research, Redmond, WA, USA

ICASSP 2006 Discriminative Training Session

Presented by Shih-Hung Liu 2006/07/20

Outline

• Introduction

• MMI Training with Rprop– MMI– Rprop

• Back-end Acoustic Model Training

• Front-end Parameters Training– SPLICE Transform

• Experimental Result

• Conclusions

Introduction

• The acoustic processing of standard ASR systems can be roughly divided into two part– Front-end which extracts the acoustic features– Back-end acoustic model which scores transcription hypotheses

for sequences of those acoustic features

• This division might lead to suboptimal design– If useful information is lost in the front end, it can not be

recovered in the back end– The acoustic model is constrained to work with whatever fixed

feature set is chosen

Introduction

• In theory, a system that could jointly optimize the front-end and back-end acoustic model parameters would overcome this problem

• One problem with traditional front-end design is that the parameters are manually chosen based on a combination of trial and error, and reliance on historical values

• Even though much effort has been applied to train the parameters of back-end to improve recognition accuracy, the front-end contains fixed filterbanks, cosine transforms, cepstral lifter, and delta/acceleration regression matrices

Introduction

• Many existing front-end designs are also uniform, in the sense that they extract the same features regardless of absolute location in the acoustic space

• This is not a desirable quality, as the information needed to discriminate /f/ from /s/ is quite different from that needed to tell the difference between /aa/ and /ae/

• A front-end with uniform feature extraction must make compromises to get good coverage over the entire range of speech sounds

MMI objective function

• The MMI objective function is the sum of the log conditional probabilities for all correct transcription of utterance , given their corresponding acoustics

• The front-end feature transformation

• The back-end acoustic score

• is the Jacobian of the transformation

rw rrY

r wpFF )|(ln Y

);( λf rr YX );,( θwp rX

Jθwλfp

JθwλfpF

);,(ln

)();),,((

)();),,((ln

)( rfJ Y );( λf rY

• Rprop, which stands for “Resilient back-propagation”, is a well-known algorithm that was originally developed to train neural network

• For this paper, Rprop is employed to find parameter values that increase MMI objective function

• By design, Rprop only needs the sign of the gradient of the objective function w.r.t each parameter

1.0 ,00001.0 ,01.0

)()1( ),5.0max(

{ )0( }

)()()1(

),2.1min( )0( { )0(

maxmin0

Ftλtλ

thendifelse

Fsigntλtλ

thendifthendif

λparametereachfor

Back-end Acoustic Model Training

)()|(ln

)|(),|()|(ln

denrts

numrts

denrts

numrts

μxγγμ

sspwsspssxp

SPLICE Transform

• The SPLICE transform was first introduced as a method for overcoming noisy speech.

• It models the relationship between feature vectors y and x as constrained Guassian mixture model (GMM), and then uses this relationship to construct estimates of x given observations of y

• One way of parameterizing the joint GMM on x and y is as a GMM on y, and a conditional expectation of x given y and the model state m

mmm πσμyNmyp ),;(),(

mm byAmyxE ],|[

SPLICE Transform

• The SPLICE transform is defined as the minimum mean squared estimate of x, giving y and the model parameter

• In effect, the GMM induces a piecewise linear mapping from y to x

• Simplified form of SPLICE transform

);( λyf

mm ympbyAyxEλyfx )|()(]|[);(ˆ

m ympbyλyf )|();(

Front-end Parameters Training

)()|(ln

)|(),|()|(ln

denrts

numrts

denrts

numrts

μxγγympb

ympbybb

sspwsspssxp

Aurora 2 Baseline

• The task consists of recognizing strings of English digits embedded in a range of artificial noise conditions

• The acoustic model (AM) used for recognition was trained with the standard complex back-end Aurora 2 scripts on the multi-condition training data

• The AM contains eleven whole word models, plus sil and sp, and consists of a total of 3628 diagonal Guassian mixture components, each with 39 dimensions

Aurora 2 Baseline

• Each utterance in the training and testing sets was normalized using whole-utterance Guassianization

• This simple cepstral histogram normalization method provides us with a very strong baseline

• WER results are presented on test set A, average across 0dB to 20dB SNRs, and baseline system achieves a WER of 6.38%

Experimental Results

Conclusions

• We have presented a framework for jointly training the front-end and back-end of a speech recognition system against the same discriminative objective function

• Experiments indicate that joint training achieves better word error rates with fewer iterations

Joint Discriminative Front-End and Back-End Training for Improved Speech Recognition Accuracy

Documents

Transcript of Joint Discriminative Front-End and Back-End Training for Improved Speech Recognition Accuracy

Adversarial Discriminative Domain Adaptation - CVF Open Accessopenaccess.thecvf.com/content_cvpr_2017/papers/Tzeng_Adversarial... · Adversarial Discriminative Domain Adaptation Eric

Discriminative Naïve Bayesian Classifiers

Discriminative Region Proposal Adversarial Networks for ...openaccess.thecvf.com/content_ECCV_2018/...Proposal... · Discriminative Region Proposal Adversarial Networks (DRPAN) 3

Discriminative and Generative Classifiers

DISCRIMINATIVE NON-NEGATIVE MATRIX FACTORIZATION FOR ...boulanni/ISMIR2012.pdf · the multiple pitch estimation accuracy of the non-negative matrix factorization (NMF) algorithm.

Histopathological Image Classiﬁcation using Discriminative ... · Histopathological Image Classiﬁcation using Discriminative Feature-oriented ... our Discriminative Feature-oriented

Improving the Accuracy and Scalability of Discriminative Learning Methods for Markov Logic Networks Tuyen N. Huynh Adviser: Prof. Raymond J. Mooney PhD.

Exploiting Lexical Information and Discriminative ...glicom.upf.edu/lambert/papers/08_04_thesis_lambert.defense.pdf · Exploiting Lexical Information and Discriminative Alignment

Maxent Models and Discriminative Estimation

Maxent Models and Discriminative Estimation

Discriminative Learning with Markov Logic · PDF fileDiscriminative Learning with Markov Logic Networks ... and traditional ILP systems in term of predictive accuracy, ... maintaining

Localized discriminative Gaussian process latent variable ... · PDF fileLocalized discriminative Gaussian process latent variable model for text ... we use Discriminative Gaussian

Linear Classification with discriminative models

Domain Adaptation for Structured Output via Discriminative ...mkchandraker/pdf/iccv19_structuredadaptation.pdfdomain adaptation task. To this end, we take advantage of the available

Discriminative Estimation (Maxentmodelsandperceptron)

MINING DISCRIMINATIVE ITEMSETS IN ATA STREAMS USING ... · empirical analysis shows that the proposed algorithms have good time and space complexities with high accuracy in large

Discriminative Blur Detection Features

High accuracy indexable end mill Optimum tool for ﬁ nish ...

Discriminative, Unsupervised, Convex Learning

Efﬁcient mining of discriminative co-clusters from gene ...dmkd.cs.vt.edu/papers/KAIS14.pdfEfﬁcient mining of discriminative co-clusters from gene ... For example, discriminative