Post on 17-Jan-2018
description
Dynamic Tuning Of Language Model Score In Speech Recognition Using A Confidence
Measure
Sherif Abdou, Michael ScordilisDepartment of Electrical and Computer Engineering, University of Miami
Coral Gables, Florida 33124, U.S.A.
DSAP
Abstract
Speech recognition errors limit the capability of language models to predict subsequent words correctly
Error analysis on Switchboard data show that :87% of words proceeded by a correct word were correctly decoded 47% of words proceeded by incorrect word was correctly decoded
An effective way to enhance the function of the language model is by using confidence measures
Most of current efforts for developing confidence measures for speech recognition focus on the verification of the final result but doesn’t make any effort to correct recognition errors
In this work, we use confidence measures early during the search process.
A word-based acoustic confidence metric is used to define a dynamic language weight.
Using Confidence To Guide The Search
The search score is changed from
( / ) ( / ) ( )LWScore W A P A W P W
To the confidence based score
( ( ))( / ) ( / ) ( )LW C WScore W A P A W P W
Where A : Acoustic input W : The hypothesized word sequence P(A/W): The acoustic model score P(W) : The language model score LW :The language weight C(W) : The confidence of word sequence W
We used the functional form
02( ( ))
1 exp( ( ))LW C W LW
r C W
10
( )( ) ( )
N
jj
C wC W C
N
The word sequence confidence is estimated by the average of its words’ confidence.
Where N : The number of words in sequence W C(wj) : The confidence of word wj
C0 : The operation point threshold LW0 : The static language weight r : A smoothing parameter
For bigram models we approximate by the current and previous words confidence
10
( ) ( )( ) ( )2
N NC w C wC W C
LW as a function of C(W), LW0=6.5, C0=0.65
Constraints On The Measures Used For Confidence-Based Language Model (CBLM)
Efficiency: Has to be computationally inexpensive Synchronization: Can be extracted from on-line
information Source of information : Extracted only from
acoustic data
Word Posterior As a Confidence Measure
^arg max ( / )
( / ) ( )arg max( )
arg max ( / ) ( )
W
W
W
W p W X
p X W p Wp X
p X W p W
Ignored in all ASR systems
Observation Probability Estimation
Theoretically: ( ) ( ) ( / )
q
p x p q p x q
Discrete HMM:
Semi-Continuous HMM:
( ) ( ) ( ( ) / ) ( ( ))q
p x p q p m x q p m x
1
( ) ( ) ( )C
iq ii all q
p x p q w g x
q : model states
m(x) : vector quantization of x
C : number of mixtures wiq: mixture weightsgi(x): mixtures
Continuous HMM:
Building a catch-all model
VectorQuantization
ClusteringTechnique
Mappinginformation
Catch-allModel
Originalacoustic model
Mixtures Clustering Technique
1 2log ( ) ( ) B p x p x dx distance
11 21 2
1 2 1 2 1/ 2 1/ 21 2
( ) / 21 1( ) ( ) ln8 2 2
TB
distance
1 2neww w w
1 1 2 2
1 2new
w ww w
2 21 1 1 2 2 2( ( ) ) ( ( ) )new new neww w
Bdistance: Bhattacharyya distance
Vector Quantization
OV: observation vectorCVi : code vector : Gaussian mixture mean
Computation reduction using VQ
CVi
CVj
CVk
CVm
OV
The Catch-all Model Performance
Relative ROC performance of reduced catch-all models
Word Level Confidence Measures
Arithmetic Mean
Geometric Mean
Weighted Mean
1
1( ) ( )N
am ii
CM w CM phN
1
1 log( ( ))N
ii
CM phN
gmCM e
1
1( ) ( )N
wm i i ii
CM w a CM ph bN
CM(a): confidence score of phoneme a
a , b : linear model parameters
Word Level Confidence Measures Performance
ROC curves indicating the relative performance of CMam , CMgm and CMwm
Performance Evaluation Compared With Other Approaches
Comparison of the catch-all model measure, the likelihood ratio(LR) measure and the word
lattice based measure
Experimental Results
Smoothing Parameter( r)
Threshold0.5 0.6 0.7 0.8 0.9
0 19.3% 19.3% 19.3% 19.3% 19.3%1 18.6% 18.43% 18.41% 18.31 24%2 18.9% 18.42% 18.41 18.30 22%3 18.9% 18.47% 18.63 18.43 25%
WER for different threshold and r values
Recognition accuracy for words following correctly decoded and
incorrectly decoded words
CONCLUSION AND FUTURE WORK
We used a confidence metric to improve the integration of system models and guide the search towards the most promising paths
Dynamic tuning of the language model weight parameter proved to be effective for performance improvement
Word posterior based confidence measures are efficient and can be extracted from the online search side information.It doesn’t require the training of anti-models
With CBLM the language model score will be favored in regions of ambiguous acoustics, but will plays a second fiddle when the acoustics are well matched.
Future work: We plan to extend this work for the cases when we have high confidence only for one of the words, we should back off to the unigram language model score not completely reduce the language model score.