Linear Dynamic Model (LDM) for Automatic Speech Recognition
description
Transcript of Linear Dynamic Model (LDM) for Automatic Speech Recognition
![Page 1: Linear Dynamic Model (LDM) for Automatic Speech Recognition](https://reader036.fdocuments.us/reader036/viewer/2022081519/56813fce550346895daaabf6/html5/thumbnails/1.jpg)
PhD Candidate: Tao MaAdvised by: Dr. Joseph Picone
Institute for Signal and Information Processing (ISIP)Mississippi State University
Linear Dynamic Model (LDM) for Automatic Speech Recognition
![Page 2: Linear Dynamic Model (LDM) for Automatic Speech Recognition](https://reader036.fdocuments.us/reader036/viewer/2022081519/56813fce550346895daaabf6/html5/thumbnails/2.jpg)
Institute for Signal and Information Processing (ISIP) Page 2 of 20
An Example of Kalman Filter (another name of LDM)
Observation
A Kalman Filter models the position evolution
• In control system engineering, Kalman Filter succeeds to model a system with noisy observations
Filtering: Position at present time (remove noise effect)
Predicting: Position at a future time
Smoothing: Position at a time in the past
![Page 3: Linear Dynamic Model (LDM) for Automatic Speech Recognition](https://reader036.fdocuments.us/reader036/viewer/2022081519/56813fce550346895daaabf6/html5/thumbnails/3.jpg)
Institute for Signal and Information Processing (ISIP) Page 3 of 20
Outline
• Why Linear Dynamic Model (LDM)?
• Linear Dynamic Model
• Pilot experiment: LDM phone classification on Aurora 4
• Hybrid HMM/LDM decoder architecture for LVCSR
• Future work
![Page 4: Linear Dynamic Model (LDM) for Automatic Speech Recognition](https://reader036.fdocuments.us/reader036/viewer/2022081519/56813fce550346895daaabf6/html5/thumbnails/4.jpg)
Institute for Signal and Information Processing (ISIP) Page 4 of 20
HMM & Speech Recognition System
Hidden Markov Models
![Page 5: Linear Dynamic Model (LDM) for Automatic Speech Recognition](https://reader036.fdocuments.us/reader036/viewer/2022081519/56813fce550346895daaabf6/html5/thumbnails/5.jpg)
Institute for Signal and Information Processing (ISIP) Page 5 of 20
Is HMM a perfect model for ASR?
• Progress on improving the accuracy of HMM-based system has slowed in the past decade
• Theory drawbacks of HMM– False assumption that frames are independent and stationary– Spatial correlation is ignored (diagonal covariance matrix)– Limited discrete state space
Accuracy
Time
Clean
Noisy
![Page 6: Linear Dynamic Model (LDM) for Automatic Speech Recognition](https://reader036.fdocuments.us/reader036/viewer/2022081519/56813fce550346895daaabf6/html5/thumbnails/6.jpg)
Institute for Signal and Information Processing (ISIP) Page 6 of 20
Motivation of Linear Dynamic Model (LDM) Research
• Motivation– A model which reflects the characteristics of speech signals will
ultimately lead to great ASR performance improvement
– LDM incorporates frame correlation information of speech signals, which is potential to increase recognition accuracy
– “Filter” characteristic of LDM has potential to improve noise robustness of speech recognition
– Fast growing computation capacity (thanks to Intel) make it realistic to build a two-way HMM/LDM hybrid speech engine
![Page 7: Linear Dynamic Model (LDM) for Automatic Speech Recognition](https://reader036.fdocuments.us/reader036/viewer/2022081519/56813fce550346895daaabf6/html5/thumbnails/7.jpg)
Institute for Signal and Information Processing (ISIP) Page 7 of 20
State Space Model
• Linear Dynamic Model (LDM) is derived from State Space Model
• Equations of State Space Model:
y: observation feature vector
x: corresponding internal state vector
h(): relationship function between y and x at current time
f(): relationship function between current state and all previous states
epsilon: noise component
eta: noise component
![Page 8: Linear Dynamic Model (LDM) for Automatic Speech Recognition](https://reader036.fdocuments.us/reader036/viewer/2022081519/56813fce550346895daaabf6/html5/thumbnails/8.jpg)
Institute for Signal and Information Processing (ISIP) Page 8 of 20
Linear Dynamic Model
• Equations of Linear Dynamic Model (LDM)– Current state is only determined by previous state– H, F are linear transform matrices– Epsilon and Eta are driving components
y: observation feature vector
x: corresponding internal state vector
H: linear transform matrix between y and x
F: linear transform matrix between current state and previous state
epsilon: driving component
eta: driving component
![Page 9: Linear Dynamic Model (LDM) for Automatic Speech Recognition](https://reader036.fdocuments.us/reader036/viewer/2022081519/56813fce550346895daaabf6/html5/thumbnails/9.jpg)
Institute for Signal and Information Processing (ISIP) Page 9 of 20
Kalman filtering for state inference (E-Step of EM training)
Human Being Sound System
Kalman Filtering Estimation
e
For a speech sound,
![Page 10: Linear Dynamic Model (LDM) for Automatic Speech Recognition](https://reader036.fdocuments.us/reader036/viewer/2022081519/56813fce550346895daaabf6/html5/thumbnails/10.jpg)
Institute for Signal and Information Processing (ISIP) Page 10 of 20
RTS smoother for better inference
Standard Kalman Filter Kalman Filter with RTS smoother
• Rauch-Tung-Striebel (RTS) smoother–Additional backward pass to minimize inference error–During EM training, computes the expectations of state statistics
![Page 11: Linear Dynamic Model (LDM) for Automatic Speech Recognition](https://reader036.fdocuments.us/reader036/viewer/2022081519/56813fce550346895daaabf6/html5/thumbnails/11.jpg)
Institute for Signal and Information Processing (ISIP) Page 11 of 20
Maximum Likelihood Parameter Estimation (M-Step of EM training)
Nothing but matrix multiplication!
LDM Parametersaa
ae
ah
ao
aw
ay
b
ch
d
dh
eh
er……
…
![Page 12: Linear Dynamic Model (LDM) for Automatic Speech Recognition](https://reader036.fdocuments.us/reader036/viewer/2022081519/56813fce550346895daaabf6/html5/thumbnails/12.jpg)
Institute for Signal and Information Processing (ISIP) Page 12 of 20
LDM for Speech Classification
MFCC Feature
………
aa
ch
eh
x y
HMM-Based Recognition
LDM-Based Recognition
MFCC Feature
………
aa
ch
eh
x y
Hypothesis
x^
x^
x^
x^
x^
x^Hypothesis
![Page 13: Linear Dynamic Model (LDM) for Automatic Speech Recognition](https://reader036.fdocuments.us/reader036/viewer/2022081519/56813fce550346895daaabf6/html5/thumbnails/13.jpg)
Institute for Signal and Information Processing (ISIP) Page 13 of 20
Challenges of Applying LDM to ASR
• Segment-based model–frame-to-phoneme information is needed before classification
• EM training is sensitive to state initialization–Each phoneme is modeled by a LDM, EM training is to find a set of parameters for a specific LDM–No good mechanism for state initialization yet
• More parameters than HMM (2~3x)–Currently mono-phone model, to build a tri-phone model for LVCSR would need more training data
![Page 14: Linear Dynamic Model (LDM) for Automatic Speech Recognition](https://reader036.fdocuments.us/reader036/viewer/2022081519/56813fce550346895daaabf6/html5/thumbnails/14.jpg)
Institute for Signal and Information Processing (ISIP) Page 14 of 20
Pilot experiment: phone classification on Aurora 4
• Aurora 4: Wall Street Journal + six kinds of noises–Airport, Babble, Car, Restaurant, Street, and Train
• Frame-to-phone alignment is generated by ISIP decoder (force align mode)
– Adding language model will get 93% accuracy for clean data
• 40 phones, one vs. all classifier
modelclean dataset
(Acc)noisy dataset
(Acc)
HMM 46.9% 36.8%
LDM 49.2% 39.2%
![Page 15: Linear Dynamic Model (LDM) for Automatic Speech Recognition](https://reader036.fdocuments.us/reader036/viewer/2022081519/56813fce550346895daaabf6/html5/thumbnails/15.jpg)
Institute for Signal and Information Processing (ISIP) Page 15 of 20
Hybrid HMM/LDM decoder architecture for LVCSR
Confidence Measurement
Best Hypothesis
![Page 16: Linear Dynamic Model (LDM) for Automatic Speech Recognition](https://reader036.fdocuments.us/reader036/viewer/2022081519/56813fce550346895daaabf6/html5/thumbnails/16.jpg)
Institute for Signal and Information Processing (ISIP) Page 16 of 20
Status and future work
• The development of HMM/LDM hybrid decoder is still in progress
–HMM/LDM hybrid decoder is Expected to be done in 2009–ISIP HMM/SVM hybrid decoder acts as the reference for implementation
• Future work–Research has proved the nonlinear effects in speech signals–Investigate the probability of replacing Kalman filtering with nonlinear filtering (such as Unscented Kalman Filter, Extended Kalman Filter)
![Page 17: Linear Dynamic Model (LDM) for Automatic Speech Recognition](https://reader036.fdocuments.us/reader036/viewer/2022081519/56813fce550346895daaabf6/html5/thumbnails/17.jpg)
Institute for Signal and Information Processing (ISIP) Page 17 of 20
Thank you!
Questions?
![Page 18: Linear Dynamic Model (LDM) for Automatic Speech Recognition](https://reader036.fdocuments.us/reader036/viewer/2022081519/56813fce550346895daaabf6/html5/thumbnails/18.jpg)
Institute for Signal and Information Processing (ISIP) Page 18 of 20
References
• Digalakis, V., “Segment-based Stochastic Models of Spectral Dynamics for Continuous Speech Recognition,” Ph.D. Dissertation, Boston University, Boston, Massachusetts, USA, 1992.
• Digalakis, V., Rohlicek, J. and Ostendorf, M., “ML Estimation of a Stochastic Linear System with the EM Algorithm and Its Application to Speech Recognition,” IEEE Transactions on Speech and Audio Processing, vol. 1, no. 4, pp. 431–442, October 1993.
• Frankel, J., “Linear Dynamic Models for Automatic Speech Recognition,” Ph.D. Dissertation, The Centre for Speech Technology Research, University of Edinburgh, Edinburgh, UK, 2003.