Speech Recognition and Processing
Transcript of Speech Recognition and Processing
![Page 1: Speech Recognition and Processing](https://reader030.fdocuments.us/reader030/viewer/2022020707/61fdcbcd9fbf267db57a722c/html5/thumbnails/1.jpg)
Speech Recognition and Processing
Lecture 5
Yossi Keshet
![Page 2: Speech Recognition and Processing](https://reader030.fdocuments.us/reader030/viewer/2022020707/61fdcbcd9fbf267db57a722c/html5/thumbnails/2.jpg)
Agenda• ASR (Automatic Speech Recognition) Structure
• Phonemes
• Acoustic Model, Language Model, Pronunciation Model
• Language Model
• Acoustic Model
![Page 3: Speech Recognition and Processing](https://reader030.fdocuments.us/reader030/viewer/2022020707/61fdcbcd9fbf267db57a722c/html5/thumbnails/3.jpg)
Recall
• Up until now we use the following decoding process:
c* = arg minc∈Vocab
distance(Atest, Ac)
• Where we set distance to be the DTW alignment score
d([ 0.00.03], [−0.58
−0.63])
![Page 4: Speech Recognition and Processing](https://reader030.fdocuments.us/reader030/viewer/2022020707/61fdcbcd9fbf267db57a722c/html5/thumbnails/4.jpg)
Limitations• The above approach has many drawbacks:
• Classifying whole words
• Limited dictionary size
• Lazy method
• Non-parametric, Non-probabilistic
• Etc.
![Page 5: Speech Recognition and Processing](https://reader030.fdocuments.us/reader030/viewer/2022020707/61fdcbcd9fbf267db57a722c/html5/thumbnails/5.jpg)
Can we do better?
![Page 6: Speech Recognition and Processing](https://reader030.fdocuments.us/reader030/viewer/2022020707/61fdcbcd9fbf267db57a722c/html5/thumbnails/6.jpg)
Limitations• The above approach has many drawbacks:
• Classifying whole words
• Limited dictionary size
• Lazy method
• Non-parametric, Non-probabilistic
• Etc.
![Page 7: Speech Recognition and Processing](https://reader030.fdocuments.us/reader030/viewer/2022020707/61fdcbcd9fbf267db57a722c/html5/thumbnails/7.jpg)
Classifying whole words• Cons:
• Words usually span over many frames
• Need to split audio files to words, a hard task
• Classifying words will force us to use multi-class classification
• Extreme classification (~200K+ words)
• Limited dictionary
• Spoken language tend to change — language is evolving
![Page 8: Speech Recognition and Processing](https://reader030.fdocuments.us/reader030/viewer/2022020707/61fdcbcd9fbf267db57a722c/html5/thumbnails/8.jpg)
Can we do better?• Can we break words into smaller sub-units?
• Option 1: Characters
• Often not represent the acoustic signal (know, lough, musically, etc.)
• Different pronunciations for the same word (accents, slang, etc.)
• Option 2: Phonemes
![Page 9: Speech Recognition and Processing](https://reader030.fdocuments.us/reader030/viewer/2022020707/61fdcbcd9fbf267db57a722c/html5/thumbnails/9.jpg)
Can we do better?• Can we break words into smaller sub-units?
• Option 1: Characters
• Often not represent the acoustic signal (know, musically, etc.)
• Different pronunciations for the same word (accents, slang, etc.)
• Option 2: Phonemes
![Page 10: Speech Recognition and Processing](https://reader030.fdocuments.us/reader030/viewer/2022020707/61fdcbcd9fbf267db57a722c/html5/thumbnails/10.jpg)
Phonemes
![Page 11: Speech Recognition and Processing](https://reader030.fdocuments.us/reader030/viewer/2022020707/61fdcbcd9fbf267db57a722c/html5/thumbnails/11.jpg)
Phonetics
• Phonetics: Study of the physical properties of speech-sounds
• Articulatory Phonetics: how speech-sounds are made
• Auditory Phonetics: how speech-sounds are heard
• Acoustic Phonetics: how speech-sounds are transmitted
![Page 12: Speech Recognition and Processing](https://reader030.fdocuments.us/reader030/viewer/2022020707/61fdcbcd9fbf267db57a722c/html5/thumbnails/12.jpg)
Phonetics• Phones: distinct speech sound or gesture, regardless of
whether the exact sound is critical to the meanings of words
• Phoneme: a sound or a group of different sounds perceived to have the same function by speakers of the language
• For example: phoneme /k/ -> cat, kit, scat, skit
• Different standards: IPA, SAMBA
• Allophone: The different phonetic realizations of a phoneme
![Page 13: Speech Recognition and Processing](https://reader030.fdocuments.us/reader030/viewer/2022020707/61fdcbcd9fbf267db57a722c/html5/thumbnails/13.jpg)
Phonetics
Kit
Skill
![Page 14: Speech Recognition and Processing](https://reader030.fdocuments.us/reader030/viewer/2022020707/61fdcbcd9fbf267db57a722c/html5/thumbnails/14.jpg)
Phonetics
![Page 15: Speech Recognition and Processing](https://reader030.fdocuments.us/reader030/viewer/2022020707/61fdcbcd9fbf267db57a722c/html5/thumbnails/15.jpg)
Phonetics
![Page 16: Speech Recognition and Processing](https://reader030.fdocuments.us/reader030/viewer/2022020707/61fdcbcd9fbf267db57a722c/html5/thumbnails/16.jpg)
Phonetics
![Page 17: Speech Recognition and Processing](https://reader030.fdocuments.us/reader030/viewer/2022020707/61fdcbcd9fbf267db57a722c/html5/thumbnails/17.jpg)
Limitations• The above approach has many drawbacks:
• Classifying whole words
• Limited dictionary size
• Lazy method
• Non-parametric, Non-probabilistic
• Etc.
Classify Phonemes
![Page 18: Speech Recognition and Processing](https://reader030.fdocuments.us/reader030/viewer/2022020707/61fdcbcd9fbf267db57a722c/html5/thumbnails/18.jpg)
DictionaryWord Transcription
odd AA D
at AE T
cow K AW
be B IY
cheese CH IY Z
eat IY T
read R IY D
![Page 19: Speech Recognition and Processing](https://reader030.fdocuments.us/reader030/viewer/2022020707/61fdcbcd9fbf267db57a722c/html5/thumbnails/19.jpg)
How can we handle wrong phonetic sequences?
Wrong pronunciationWrong prediction
![Page 20: Speech Recognition and Processing](https://reader030.fdocuments.us/reader030/viewer/2022020707/61fdcbcd9fbf267db57a722c/html5/thumbnails/20.jpg)
ASR Structure
![Page 21: Speech Recognition and Processing](https://reader030.fdocuments.us/reader030/viewer/2022020707/61fdcbcd9fbf267db57a722c/html5/thumbnails/21.jpg)
ASR Structure
• Automatic Speech Recognition
• Acoustic Model: Represent the relationship between audio signal (features) to phonemes
• Language Model: Compute the probability to observe a given sequence of words
• Pronunciation Model: Builds a lexicon for how people pronounce words
![Page 22: Speech Recognition and Processing](https://reader030.fdocuments.us/reader030/viewer/2022020707/61fdcbcd9fbf267db57a722c/html5/thumbnails/22.jpg)
ASR Structure• How can we combine all these components?
• Let’s be more formal:
• Given a sequence of acoustic features:
• We would like to set:
w* = arg maxw
P(w | o)
o
![Page 23: Speech Recognition and Processing](https://reader030.fdocuments.us/reader030/viewer/2022020707/61fdcbcd9fbf267db57a722c/html5/thumbnails/23.jpg)
ASR Structure
w* = arg maxw
P(w | o)
= arg maxw
P(o | w) ⋅ P(w)P(o)
= arg maxw
P(o | w) ⋅ P(w)
![Page 24: Speech Recognition and Processing](https://reader030.fdocuments.us/reader030/viewer/2022020707/61fdcbcd9fbf267db57a722c/html5/thumbnails/24.jpg)
ASR Structure
P(o | w) = ∑p
P(o | p, w) ⋅ P(p | w)
≈ maxp
P(o | p, w) ⋅ P(p | w)
≈ maxp
P(o | p) ⋅ P(p | w)
![Page 25: Speech Recognition and Processing](https://reader030.fdocuments.us/reader030/viewer/2022020707/61fdcbcd9fbf267db57a722c/html5/thumbnails/25.jpg)
ASR Structure
w* = arg maxw
maxp
P(o | p) ⋅ P(p | w) ⋅ P(w)
![Page 26: Speech Recognition and Processing](https://reader030.fdocuments.us/reader030/viewer/2022020707/61fdcbcd9fbf267db57a722c/html5/thumbnails/26.jpg)
ASR Structure
w* = arg maxw
maxp
P(o | p) ⋅ P(p | w) ⋅ P(w)
![Page 27: Speech Recognition and Processing](https://reader030.fdocuments.us/reader030/viewer/2022020707/61fdcbcd9fbf267db57a722c/html5/thumbnails/27.jpg)
Language Model• In language modeling we would like to predict the probability of
a given sequence of tokens
• We can be factored this probability using the chain rule of probability and we get:
P(w)
P(w) = ΠMi=1P(wi |w1, …, wi−1)
• We usually assume that given the history of the previous n-1 tokens, each token is independent of the remaining history
P(w) = ΠMi=1P(wi |wi−n+1, …, wi−1)
![Page 28: Speech Recognition and Processing](https://reader030.fdocuments.us/reader030/viewer/2022020707/61fdcbcd9fbf267db57a722c/html5/thumbnails/28.jpg)
Language Model• This can be implemented using different ML models:
• N-grams
• RNN
• CNN
• Etc.
![Page 29: Speech Recognition and Processing](https://reader030.fdocuments.us/reader030/viewer/2022020707/61fdcbcd9fbf267db57a722c/html5/thumbnails/29.jpg)
ASR Structure
![Page 30: Speech Recognition and Processing](https://reader030.fdocuments.us/reader030/viewer/2022020707/61fdcbcd9fbf267db57a722c/html5/thumbnails/30.jpg)
Acoustic Model• We would like to compute the probability:
• Consider the phone sequence [p r aa b l iy]
• Corresponds to some spoken conversational speech segment of 320ms
• There are 32 observations in such segment (each frame is 10ms long)
• The phone sequence can be written as an expended sequence of 32 phone symbols
P(o | p)
q = [p, p, r, r, …, r, aa, aa, …, b, b, b, l, …, iy, iy, iy]
![Page 31: Speech Recognition and Processing](https://reader030.fdocuments.us/reader030/viewer/2022020707/61fdcbcd9fbf267db57a722c/html5/thumbnails/31.jpg)
Markov Model• Given states:
• In each time instant, a system changes (makes a transition) to state
S1, S2, …, SNN
t = 1,2,…, Tqt
• For a special case of a first order Markov chain
• This usually refers as the state transition probability
P(qt = Sj |qt−1 = Si, qt−2 = Sk, …) = P(qt = Sj |qt−1 = Si)
![Page 32: Speech Recognition and Processing](https://reader030.fdocuments.us/reader030/viewer/2022020707/61fdcbcd9fbf267db57a722c/html5/thumbnails/32.jpg)
Acoustic Model• We can use this Markovian assumptions to reduce the complexity of our
problem
P(o | q)P(q) = ΠTi=1P(ot |qt)P(qt |qt−1)
Probability of observing an acoustic features given a phoneme Transition probability
![Page 33: Speech Recognition and Processing](https://reader030.fdocuments.us/reader030/viewer/2022020707/61fdcbcd9fbf267db57a722c/html5/thumbnails/33.jpg)
Acoustic Model
• We can compute using different models:
• Gaussian Mixture Model (GMM)
• Support Vector Machines
• Deep Neural Networks (DNNs)
P(ot |qt)
![Page 34: Speech Recognition and Processing](https://reader030.fdocuments.us/reader030/viewer/2022020707/61fdcbcd9fbf267db57a722c/html5/thumbnails/34.jpg)
Deep Neural Networks
34
●We would like to find the parameters that minimize the
following Negative Log Likelihood loss function:
●Notice, now we have several weight matrices and biases we
need to find the gradients to all of them
L(θ) = ∑t
− log(p(y = yt |xt))
●In order to do so, we calculate and use GD/SGD∂L∂θ
![Page 35: Speech Recognition and Processing](https://reader030.fdocuments.us/reader030/viewer/2022020707/61fdcbcd9fbf267db57a722c/html5/thumbnails/35.jpg)
Deep Neural Networks
35
f(x) = gn(gn−1(…g2(g1(x)))) g1(x) = σ(w1x + b1) = h1
g2(x) = σ(w2h1 + b2) = h2
gn(hn−1) = σ(wnhn−1 + bn) = hn
p(y |x) = softmax(hn)
…
![Page 36: Speech Recognition and Processing](https://reader030.fdocuments.us/reader030/viewer/2022020707/61fdcbcd9fbf267db57a722c/html5/thumbnails/36.jpg)
Deep Neural Networks
…
each layer learns different representation 36
f(x) = gn(gn−1(…g2(g1(x))))
![Page 37: Speech Recognition and Processing](https://reader030.fdocuments.us/reader030/viewer/2022020707/61fdcbcd9fbf267db57a722c/html5/thumbnails/37.jpg)
Deep Neural Networks
37
• We use as input the Spectrogram / MFCC features
• Problem: However, in a single feature we do not have much information about the pronounced phoneme
• Solution: Concatenate for each frame k frames before and after (we usually set k=5,9,11)
• Note, we will still classify the phoneme in the center frame
![Page 38: Speech Recognition and Processing](https://reader030.fdocuments.us/reader030/viewer/2022020707/61fdcbcd9fbf267db57a722c/html5/thumbnails/38.jpg)
Deep Neural Networks• DNNs optimization is often hard
• Practically researchers often use unsupervised techniques as a pre-train step
• Denoising Auto-Encoder
• Restricted Boltzman-Machines
• Deep Belief Networks
• Etc.
![Page 39: Speech Recognition and Processing](https://reader030.fdocuments.us/reader030/viewer/2022020707/61fdcbcd9fbf267db57a722c/html5/thumbnails/39.jpg)
Deep Neural Networks• After training, the network will have outputs of the form
For each frame
• Recall, we would like use the DNN to output
• We can use Bayes theorem to obtain:
• is unknown
• All alignments will scale by the same factor so no effect
• We can assume a uniform prior over the phonemes or computer this probability
P(pt | ot)
P(ot | qt)
P(ot | qt) =P(qt | ot) ⋅ P(ot)
P(qt)P(ot)
![Page 40: Speech Recognition and Processing](https://reader030.fdocuments.us/reader030/viewer/2022020707/61fdcbcd9fbf267db57a722c/html5/thumbnails/40.jpg)
Acoustic Model
• Notice, we assume we have an alignment between the phonemes and the speech signal
• Practically, to obtain
• To compute that we need to use DP algorithm called the Forward-Backward algorithm (next lesson)
P(o | p) = ∑a∈A
ΠTt=1P(ot | qa
t ) ⋅ P(qat | qa
t−1)
![Page 41: Speech Recognition and Processing](https://reader030.fdocuments.us/reader030/viewer/2022020707/61fdcbcd9fbf267db57a722c/html5/thumbnails/41.jpg)
Questions?