Dynamic Tuning Of Language Model Score In Speech Recognition Using A Confidence Measure Sherif...

Dynamic Tuning Of Language Model Score In Speech Recognition Using A Confidence

Measure

Sherif Abdou, Michael ScordilisDepartment of Electrical and Computer Engineering, University of Miami

Coral Gables, Florida 33124, U.S.A.

DSAP

Abstract

Speech recognition errors limit the capability of language models to predict subsequent words correctly

Error analysis on Switchboard data show that :87% of words proceeded by a correct word were correctly decoded 47% of words proceeded by incorrect word was correctly decoded

An effective way to enhance the function of the language model is by using confidence measures

Most of current efforts for developing confidence measures for speech recognition focus on the verification of the final result but doesn’t make any effort to correct recognition errors

In this work, we use confidence measures early during the search process.

A word-based acoustic confidence metric is used to define a dynamic language weight.

Using Confidence To Guide The Search

The search score is changed from

( / ) ( / ) ( )LWScore W A P A W P W

To the confidence based score

( ( ))( / ) ( / ) ( )LW C WScore W A P A W P W

Where A : Acoustic input W : The hypothesized word sequence P(A/W): The acoustic model score P(W) : The language model score LW :The language weight C(W) : The confidence of word sequence W

We used the functional form

02( ( ))

1 exp( ( ))LW C W LW

r C W

10

( )( ) ( )

N

jj

C wC W C

N

The word sequence confidence is estimated by the average of its words’ confidence.

Where N : The number of words in sequence W C(wj) : The confidence of word wj

C0 : The operation point threshold LW0 : The static language weight r : A smoothing parameter

For bigram models we approximate by the current and previous words confidence

10

( ) ( )( ) ( )2

N NC w C wC W C

LW as a function of C(W), LW0=6.5, C0=0.65

Constraints On The Measures Used For Confidence-Based Language Model (CBLM)

Efficiency: Has to be computationally inexpensive Synchronization: Can be extracted from on-line

information Source of information : Extracted only from

acoustic data

Word Posterior As a Confidence Measure

^arg max ( / )

( / ) ( )arg max( )

arg max ( / ) ( )

W

W

W

W p W X

p X W p Wp X

p X W p W

Ignored in all ASR systems

Observation Probability Estimation

Theoretically: ( ) ( ) ( / )

q

p x p q p x q

Discrete HMM:

Semi-Continuous HMM:

( ) ( ) ( ( ) / ) ( ( ))q

p x p q p m x q p m x

1

( ) ( ) ( )C

iq ii all q

p x p q w g x

q : model states

m(x) : vector quantization of x

C : number of mixtures wiq: mixture weightsgi(x): mixtures

Continuous HMM:

Building a catch-all model

VectorQuantization

ClusteringTechnique

Mappinginformation

Catch-allModel

Originalacoustic model

Mixtures Clustering Technique

1 2log ( ) ( ) B p x p x dx distance

11 21 2

1 2 1 2 1/ 2 1/ 21 2

( ) / 21 1( ) ( ) ln8 2 2

TB

distance

1 2neww w w

1 1 2 2

1 2new

w ww w

2 21 1 1 2 2 2( ( ) ) ( ( ) )new new neww w

Bdistance: Bhattacharyya distance

Vector Quantization

OV: observation vectorCVi : code vector : Gaussian mixture mean

Computation reduction using VQ

CVi

CVj

CVk

CVm

OV

The Catch-all Model Performance

Relative ROC performance of reduced catch-all models

Word Level Confidence Measures

Arithmetic Mean

Geometric Mean

Weighted Mean

1

1( ) ( )N

am ii

CM w CM phN

1

1 log( ( ))N

ii

CM phN

gmCM e

1

1( ) ( )N

wm i i ii

CM w a CM ph bN

CM(a): confidence score of phoneme a

a , b : linear model parameters

Word Level Confidence Measures Performance

ROC curves indicating the relative performance of CMam , CMgm and CMwm

Performance Evaluation Compared With Other Approaches

Comparison of the catch-all model measure, the likelihood ratio(LR) measure and the word

lattice based measure

Experimental Results

Smoothing Parameter( r)

Threshold0.5 0.6 0.7 0.8 0.9

0 19.3% 19.3% 19.3% 19.3% 19.3%1 18.6% 18.43% 18.41% 18.31 24%2 18.9% 18.42% 18.41 18.30 22%3 18.9% 18.47% 18.63 18.43 25%

WER for different threshold and r values

Recognition accuracy for words following correctly decoded and

incorrectly decoded words

CONCLUSION AND FUTURE WORK

We used a confidence metric to improve the integration of system models and guide the search towards the most promising paths

Dynamic tuning of the language model weight parameter proved to be effective for performance improvement

Word posterior based confidence measures are efficient and can be extracted from the online search side information.It doesn’t require the training of anti-models

With CBLM the language model score will be favored in regions of ambiguous acoustics, but will plays a second fiddle when the acoustics are well matched.

Future work: We plan to extend this work for the cases when we have high confidence only for one of the words, we should back off to the unigram language model score not completely reduce the language model score.

Dynamic Tuning Of Language Model Score In Speech Recognition Using A Confidence Measure Sherif...

Documents

Transcript of Dynamic Tuning Of Language Model Score In Speech Recognition Using A Confidence Measure Sherif...