Statistical Language Modeling for Speech Recognition and Information Retrieval
Introduction to Automatic Speech Recognition. Outline Define the problem What is speech? Feature...
-
Upload
duane-oliver -
Category
Documents
-
view
222 -
download
1
Transcript of Introduction to Automatic Speech Recognition. Outline Define the problem What is speech? Feature...
![Page 1: Introduction to Automatic Speech Recognition. Outline Define the problem What is speech? Feature Selection Models Early methods Modern statistical.](https://reader034.fdocuments.us/reader034/viewer/2022052701/56649ddf5503460f94ad8c2b/html5/thumbnails/1.jpg)
Introduction to Automatic Speech Recognition
![Page 2: Introduction to Automatic Speech Recognition. Outline Define the problem What is speech? Feature Selection Models Early methods Modern statistical.](https://reader034.fdocuments.us/reader034/viewer/2022052701/56649ddf5503460f94ad8c2b/html5/thumbnails/2.jpg)
Outline
Define the problemWhat is speech?Feature SelectionModels
Early methods Modern statistical models
Current State of ASRFuture Work
![Page 3: Introduction to Automatic Speech Recognition. Outline Define the problem What is speech? Feature Selection Models Early methods Modern statistical.](https://reader034.fdocuments.us/reader034/viewer/2022052701/56649ddf5503460f94ad8c2b/html5/thumbnails/3.jpg)
The ASR Problem
There is no single ASR problemThe problem depends on many factors
Microphone: Close-mic, throat-mic, microphone array, audio-visual
Sources: band-limited, background noise, reverberation
Speaker: speaker dependent, speaker independent
Language: open/closed vocabulary, vocabulary size, read/spontaneous speech
Output: Transcription, speaker id, keywords
![Page 4: Introduction to Automatic Speech Recognition. Outline Define the problem What is speech? Feature Selection Models Early methods Modern statistical.](https://reader034.fdocuments.us/reader034/viewer/2022052701/56649ddf5503460f94ad8c2b/html5/thumbnails/4.jpg)
Performance Evaluation
Accuracy Percentage of tokens correctly recognized
Error Rate Inverse of accuracy
Token Type Phones Words* Sentences Semantics?
![Page 5: Introduction to Automatic Speech Recognition. Outline Define the problem What is speech? Feature Selection Models Early methods Modern statistical.](https://reader034.fdocuments.us/reader034/viewer/2022052701/56649ddf5503460f94ad8c2b/html5/thumbnails/5.jpg)
What is Speech?
Analog signal produced by humansYou can think about the speech signal being decomposed into the source and filterThe source is the vocal folds in voiced speechThe filter is the vocal tract and articulators
![Page 6: Introduction to Automatic Speech Recognition. Outline Define the problem What is speech? Feature Selection Models Early methods Modern statistical.](https://reader034.fdocuments.us/reader034/viewer/2022052701/56649ddf5503460f94ad8c2b/html5/thumbnails/6.jpg)
Speech Production
![Page 7: Introduction to Automatic Speech Recognition. Outline Define the problem What is speech? Feature Selection Models Early methods Modern statistical.](https://reader034.fdocuments.us/reader034/viewer/2022052701/56649ddf5503460f94ad8c2b/html5/thumbnails/7.jpg)
Speech Production
![Page 8: Introduction to Automatic Speech Recognition. Outline Define the problem What is speech? Feature Selection Models Early methods Modern statistical.](https://reader034.fdocuments.us/reader034/viewer/2022052701/56649ddf5503460f94ad8c2b/html5/thumbnails/8.jpg)
Speech Production
![Page 9: Introduction to Automatic Speech Recognition. Outline Define the problem What is speech? Feature Selection Models Early methods Modern statistical.](https://reader034.fdocuments.us/reader034/viewer/2022052701/56649ddf5503460f94ad8c2b/html5/thumbnails/9.jpg)
Speech Visualization
![Page 10: Introduction to Automatic Speech Recognition. Outline Define the problem What is speech? Feature Selection Models Early methods Modern statistical.](https://reader034.fdocuments.us/reader034/viewer/2022052701/56649ddf5503460f94ad8c2b/html5/thumbnails/10.jpg)
Speech Visualization
![Page 11: Introduction to Automatic Speech Recognition. Outline Define the problem What is speech? Feature Selection Models Early methods Modern statistical.](https://reader034.fdocuments.us/reader034/viewer/2022052701/56649ddf5503460f94ad8c2b/html5/thumbnails/11.jpg)
Speech Visualization
![Page 12: Introduction to Automatic Speech Recognition. Outline Define the problem What is speech? Feature Selection Models Early methods Modern statistical.](https://reader034.fdocuments.us/reader034/viewer/2022052701/56649ddf5503460f94ad8c2b/html5/thumbnails/12.jpg)
Feature Selection
As in any data-driven task, the data must be represented in some formatCepstral features have been found to perform wellThey represent the frequency of the frequenciesMel-frequency cepstral coefficients (MFCC) are the most common variety
![Page 13: Introduction to Automatic Speech Recognition. Outline Define the problem What is speech? Feature Selection Models Early methods Modern statistical.](https://reader034.fdocuments.us/reader034/viewer/2022052701/56649ddf5503460f94ad8c2b/html5/thumbnails/13.jpg)
Where do we stand?
Defined the multiple problems associated with ASRDescribed how speech is producedIllustrated how speech can be represented in an ASR systemNow that we have the data, how do we recognize the speech?
![Page 14: Introduction to Automatic Speech Recognition. Outline Define the problem What is speech? Feature Selection Models Early methods Modern statistical.](https://reader034.fdocuments.us/reader034/viewer/2022052701/56649ddf5503460f94ad8c2b/html5/thumbnails/14.jpg)
Radio Rex
First known attempt at speech recognitionA toy from 1922Worked by analyzing the signal strength at 500Hz
![Page 15: Introduction to Automatic Speech Recognition. Outline Define the problem What is speech? Feature Selection Models Early methods Modern statistical.](https://reader034.fdocuments.us/reader034/viewer/2022052701/56649ddf5503460f94ad8c2b/html5/thumbnails/15.jpg)
Actual speech recognition systems
Originally thought to be a relatively simple task requiring a few years of concerted effort
1969, “Wither speech recognition” is published
A DARPA project ran from 1971-1976 in response to the statements in the Pierce article
We can examine a few general systems
![Page 16: Introduction to Automatic Speech Recognition. Outline Define the problem What is speech? Feature Selection Models Early methods Modern statistical.](https://reader034.fdocuments.us/reader034/viewer/2022052701/56649ddf5503460f94ad8c2b/html5/thumbnails/16.jpg)
Template-Based ASR
Originally only worked for isolated words Performs best when training and testing
conditions are best For each word we want to recognize, we
store a template or example based on actual data
Each test utterance is checked against the templates to find the best match
Uses the Dynamic Time Warping (DTW) algorithm
![Page 17: Introduction to Automatic Speech Recognition. Outline Define the problem What is speech? Feature Selection Models Early methods Modern statistical.](https://reader034.fdocuments.us/reader034/viewer/2022052701/56649ddf5503460f94ad8c2b/html5/thumbnails/17.jpg)
Dynamic Time Warping
Create a similarity matrix for the two utterances
Use dynamic programming to find the lowest cost path
![Page 18: Introduction to Automatic Speech Recognition. Outline Define the problem What is speech? Feature Selection Models Early methods Modern statistical.](https://reader034.fdocuments.us/reader034/viewer/2022052701/56649ddf5503460f94ad8c2b/html5/thumbnails/18.jpg)
Hearsay-II
One of the systems developed during the DARPA program
A blackboard-based system utilizing symbolic problem solvers
Each problem solver was called a knowledge group
A complex scheduler was used to decide when each KG should be called
![Page 19: Introduction to Automatic Speech Recognition. Outline Define the problem What is speech? Feature Selection Models Early methods Modern statistical.](https://reader034.fdocuments.us/reader034/viewer/2022052701/56649ddf5503460f94ad8c2b/html5/thumbnails/19.jpg)
Hearsay-II
![Page 20: Introduction to Automatic Speech Recognition. Outline Define the problem What is speech? Feature Selection Models Early methods Modern statistical.](https://reader034.fdocuments.us/reader034/viewer/2022052701/56649ddf5503460f94ad8c2b/html5/thumbnails/20.jpg)
DARPA Results
The Hearsay-II system performed much better than the two other similar competing systems
However, only one system met the performance goals of the project The Harpy system was also a CMU built system In many ways it was a predecessor to the
modern statistical systems
![Page 21: Introduction to Automatic Speech Recognition. Outline Define the problem What is speech? Feature Selection Models Early methods Modern statistical.](https://reader034.fdocuments.us/reader034/viewer/2022052701/56649ddf5503460f94ad8c2b/html5/thumbnails/21.jpg)
Modern Statistical ASR
![Page 22: Introduction to Automatic Speech Recognition. Outline Define the problem What is speech? Feature Selection Models Early methods Modern statistical.](https://reader034.fdocuments.us/reader034/viewer/2022052701/56649ddf5503460f94ad8c2b/html5/thumbnails/22.jpg)
Modern Statistical ASR
![Page 23: Introduction to Automatic Speech Recognition. Outline Define the problem What is speech? Feature Selection Models Early methods Modern statistical.](https://reader034.fdocuments.us/reader034/viewer/2022052701/56649ddf5503460f94ad8c2b/html5/thumbnails/23.jpg)
Acoustic Model
For each frame of data, we need some way of describing the likelihood of it belonging to any of our classes
Two methods are commonly used Multilayer perceptron (MLP) gives the likelihood
of a class given the data Gaussian Mixture Model (GMM) gives the
likelihood of the data given a class
![Page 24: Introduction to Automatic Speech Recognition. Outline Define the problem What is speech? Feature Selection Models Early methods Modern statistical.](https://reader034.fdocuments.us/reader034/viewer/2022052701/56649ddf5503460f94ad8c2b/html5/thumbnails/24.jpg)
Gaussian Distribution
![Page 25: Introduction to Automatic Speech Recognition. Outline Define the problem What is speech? Feature Selection Models Early methods Modern statistical.](https://reader034.fdocuments.us/reader034/viewer/2022052701/56649ddf5503460f94ad8c2b/html5/thumbnails/25.jpg)
Pronunciation Model
While the pronunciation model can be very complex, it is typically just a dictionary
The dictionary contains the valid pronunciations for each word
Examples: Cat: k ae t Dog: d ao g Fox: f aa x s
![Page 26: Introduction to Automatic Speech Recognition. Outline Define the problem What is speech? Feature Selection Models Early methods Modern statistical.](https://reader034.fdocuments.us/reader034/viewer/2022052701/56649ddf5503460f94ad8c2b/html5/thumbnails/26.jpg)
Language Model
Now we need some way of representing the likelihood of any given word sequence
Many methods exist, but ngrams are the most common
Ngrams models are trained by simply counting the occurrences of words in a training set
![Page 27: Introduction to Automatic Speech Recognition. Outline Define the problem What is speech? Feature Selection Models Early methods Modern statistical.](https://reader034.fdocuments.us/reader034/viewer/2022052701/56649ddf5503460f94ad8c2b/html5/thumbnails/27.jpg)
Ngrams
A unigram is the probability of any word in isolation
A bigram is the probability of a given word given the previous word
Higher order ngrams continue in a similar fashion
A backoff probability is used for any unseen data
![Page 28: Introduction to Automatic Speech Recognition. Outline Define the problem What is speech? Feature Selection Models Early methods Modern statistical.](https://reader034.fdocuments.us/reader034/viewer/2022052701/56649ddf5503460f94ad8c2b/html5/thumbnails/28.jpg)
How do we put it together?
We now have models to represent the three parts of our equation
We need a framework to join these models together
The standard framework used is the Hidden Markov Model (HMM)
![Page 29: Introduction to Automatic Speech Recognition. Outline Define the problem What is speech? Feature Selection Models Early methods Modern statistical.](https://reader034.fdocuments.us/reader034/viewer/2022052701/56649ddf5503460f94ad8c2b/html5/thumbnails/29.jpg)
Markov Model
A state model using the markov property The markov property states that the future
depends only on the present state Models the likelihood of transitions between
states in a model Given the model, we can determine the
likelihood of any sequence of states
![Page 30: Introduction to Automatic Speech Recognition. Outline Define the problem What is speech? Feature Selection Models Early methods Modern statistical.](https://reader034.fdocuments.us/reader034/viewer/2022052701/56649ddf5503460f94ad8c2b/html5/thumbnails/30.jpg)
Hidden Markov Model
Similar to a markov model except the states are hidden
We now have observations tied to the individual states
We no longer know the exact state sequence given the data
Allows for the modeling of an underlying unobservable process
![Page 31: Introduction to Automatic Speech Recognition. Outline Define the problem What is speech? Feature Selection Models Early methods Modern statistical.](https://reader034.fdocuments.us/reader034/viewer/2022052701/56649ddf5503460f94ad8c2b/html5/thumbnails/31.jpg)
HMMs for ASR
First we build an HMM for each phone Next we combine the phone models based
on the pronunciation model to create word level models
Finally, the word level models are combined based on the language model
We now have a giant network with potentially thousands or even millions of states
![Page 32: Introduction to Automatic Speech Recognition. Outline Define the problem What is speech? Feature Selection Models Early methods Modern statistical.](https://reader034.fdocuments.us/reader034/viewer/2022052701/56649ddf5503460f94ad8c2b/html5/thumbnails/32.jpg)
Decoding
Decoding happens in the same way as the previous example
For each time frame we need to maintain two pieces of information The likelihood of being at any state The previous state for every state
![Page 33: Introduction to Automatic Speech Recognition. Outline Define the problem What is speech? Feature Selection Models Early methods Modern statistical.](https://reader034.fdocuments.us/reader034/viewer/2022052701/56649ddf5503460f94ad8c2b/html5/thumbnails/33.jpg)
State of the Art
What works well Constrained vocabulary systems Systems adapted to a given speaker Systems in anechoic environments without
background noise Systems expecting read speech
What doesn't work Large unconstrained vocabulary Noisy environments Conversational speech
![Page 34: Introduction to Automatic Speech Recognition. Outline Define the problem What is speech? Feature Selection Models Early methods Modern statistical.](https://reader034.fdocuments.us/reader034/viewer/2022052701/56649ddf5503460f94ad8c2b/html5/thumbnails/34.jpg)
Future Work
Better representations of audio based on humans
Better representation of acoustic elements based on articulatory phonology
Segmental models that do not rely on the simple frame-based approach
![Page 35: Introduction to Automatic Speech Recognition. Outline Define the problem What is speech? Feature Selection Models Early methods Modern statistical.](https://reader034.fdocuments.us/reader034/viewer/2022052701/56649ddf5503460f94ad8c2b/html5/thumbnails/35.jpg)
Resources
Hidden Markov Model Toolkit (HTK) http://htk.eng.cam.ac.uk/
CHIME ( a freely available dataset) http://spandh.dcs.shef.ac.uk/projects/chime/PCC
/datasets.html Machine Learning Lectures
http://www.stanford.edu/class/cs229/ http://www.youtube.com/watch?v=UzxYlbK2c7E