Post on 16-Oct-2021
2.0 Fundamentals of Speech Recognition
References for 2.0
1.3, 3.3, 3.4, 4.2, 4.3, 6.4, 7.2, 7.3, of Bechetti
2.0
2.0 Fundamentals of Speech Recognition
Hidden Markov Models (HMM)
˙Formulation Ot= [x1, x2, …xD]
T feature vectors for a frame at time t
qt {1,2,3…N } state number for feature vector Ot
A =[ aij ] , aij = Prob[ qt = j | qt-1 = i ]
state transition probability
B =[ bj(o), j = 1,2,…N] observation (emission) probability
bj(o) = cjkbjk(o) Gaussian Mixture Model (GMM)
bjk(o): multi-variate Gaussian distribution
for the k-th mixture (Gaussian) of the j-th state
M : total number of mixtures
cjk = 1
= [ 1, 2, …N ] initial probabilities
i = Prob[q1= i]
HMM : ( A , B, ) =
a22 a11
M
k = 1
M
k= 1
o1 o2 o3 o4 o5 o6 o7 o8…………
q1 q2 q3 q4 q5 q6 q7 q8…………
observation sequence
state sequence
b1(o) b2 (o) b3(o)
a13
a24
a23 states a12
∈
2
Observation Sequences
𝑡
𝑡
𝑥(𝑡)𝑥[𝑛]
3
1-dim Gaussian Mixtures
State Transition Probabilities
𝑃𝑥(𝑥)
0.1 0.5
0.9 0.5
0.9 0.5
0.50.1
1 2 2 ⋯ 1 1 1 1 1 1 1 1 2 2 ⋯
4
2.0
˙Gaussian Random Variable X
fX(x) = e−(x−m)
/2
˙Multivariate Gaussian Distribution for n Random Variables
X = [ X1 , X2 ,……, Xn ] t
fX(x) = e− −[ (x− )
( x− ) ]
= [ X , X ,……, X ] t
= [ ij ] , covariance matrix
ij = E [ (Xi − X )( Xj − X ) ]
: determinant of
2
t -1 1
(2)n/21/2
(2)n/2
1/2
1
2
1 2 n
i j
2 1
(22) 1/2
(22) 1/2
𝑗
=⋮
⋯ 𝜎𝑖𝑗 ⋯
⋮
𝑖
𝜎𝑖𝑗 = 𝐸[(𝑥𝑖 − 𝜇𝑥𝑖) (𝑥𝑗 − 𝜇𝑥𝑗)]
𝑓𝑥(𝑥)𝜎2
𝑥𝑥𝑚
5
Multivariate Gaussian Distribution
𝑗
=⋮
⋯ 𝜎𝑖𝑗 ⋯
⋮
𝑖
𝜎𝑖𝑗 = 𝐸[(𝑥𝑖 − 𝜇𝑥𝑖) (𝑥𝑗 − 𝜇𝑥𝑗)]
𝑓𝑥(𝑥)𝜎2
𝑥𝑥𝑚
6
2-dim Gaussian
𝜎112 0
0 𝜎222
𝜎112 = 𝜎22
2
𝜎112 𝜎12𝜎21 𝜎22
2
𝑥2
𝑥1
𝜎112 0
0 𝜎222
𝜎112 ≠ 𝜎22
2
𝑥1
𝑥1
𝑥1
𝑥2𝑥2
𝑥2
7
N-dim Gaussian Mixtures
8
2.0
Hidden Markov Models (HMM)
˙Double Layers of Stochastic Processes
- hidden states with random transitions for time warping
- random output given state for random acoustic characteristics
˙Three Basic Problems
(1) Evaluation Problem:
Given O =(o1, o2, …ot…oT) and = (A, B, )
find Prob [ O | ]
(2) Decoding Problem:
Given O = (o1, o2, …ot…oT) and = (A, B, )
find a best state sequence q = (q1,q2,…qt,…qT)
(3) Learning Problem:
Given O, find best values for parameters in
such that Prob [ O | ] = max 9
Simplified HMM
RGBGGBBGRRR……
1 2 3
10
2.0
Feature Extraction (Front-end Signal Processing)
˙Pre-emphasis
H(z) = 1 − az-1 , 0 << a < 1
x[n] = x[n] − ax[n−1]
- pre-emphasis of spectrum at higher frequencies
˙Endpoint Detection (Speech/Silence Discrimination)
- short-time energy
En = (x[m])2w[m− n]
- adaptive thresholds
˙Windowing
Qn = T{ x[m]}w[m− n]
T{ • } : some operator
w[m] : window shape
- Rectangular window
w[m] = 1 , 0 < m L−1
0 , else
Hamming window
w[m] = 0.54 − 0.46 cos[ ⎯⎯ ], 0 m L−1
0 , else
window length/shift/shape
m= -
2m
L
L
−
m= -
11
Time and Frequency Domains
X[k]
time domain
1-1 mapping
Fourier Transform
Fast Fourier Transform (FFT)
frequency domain
𝑅𝑒 𝑒𝑗𝜔1𝑡 = cos(𝜔1𝑡)
𝑅𝑒 (𝐴1 𝑒𝑗𝜙1) (𝑒𝑗𝜔1𝑡) = 𝐴1cos(𝜔1𝑡 + 𝜙1)
𝑋 = 𝑎1𝑖 + 𝑎2𝑗 + 𝑎3𝑘
𝑋 =
𝑖
𝑎𝑖 𝑥𝑖
𝑥 𝑡 =
𝑖
𝑎𝑖 𝑥𝑖 (𝑡)
12
Pre-emphasis
• Pre-emphasis
H(z) = 1 − az-1 , 0 << a < 1
x[n] = x[n] − ax[n−1]
pre-emphasis of spectrum at higher frequencies
𝑋(𝜔)
𝜔
13
Endpoint Detection
𝑡, 𝑛
𝐸𝑛
EndpointThreshold
14
Hamming Window Rectangular Window
Endpoint Detection
𝐸𝑛 =
−∞
∞
(𝑥[𝑚])2 𝑤[𝑚 − 𝑛]
𝑥[𝑚]
Hamming window
𝑤 𝑚 = ቐ0.54 − 0.46 cos[2𝜋𝑚
𝐿] ,0 ≤ 𝑚 ≤ 𝐿 − 1
0, else
𝑄𝑛 =
𝑚=−∞
∞
𝑇 𝑥[𝑚] 𝑤[𝑚 − 𝑛]
𝑇{ • } : some operator
w 𝑚 : window shape
𝑤[𝑛]
𝑛𝐿 − 1
1
0
𝑛𝑚
𝑛 + L − 1
𝑤[𝑚]
0 𝐿 − 1𝑚
15
2.0
Feature Extraction (Front-end Signal Processing)
˙Mel Frequency Cepstral Coefficients (MFCC)
- Mel-scale Filter Bank
triangular shape in frequency/overlapped
uniformly spaced below 1 kHz
logarithmic scale above 1 kHz
˙Delta Coefficients
- 1st/2nd order differences
Discrete
Fourier
Transform
windowed
speech
samples MFCC
Mel-scale
Filter
Bank
log( | |2
)
Inverse
Discrete
Fourier
Transform
16
Peripheral Processing for Human Perception (P.34 of 7.0 )
17
Mel-scale Filter Bank
𝜔
log𝜔
𝜔𝜔
𝑋(𝜔)
18
Delta Coefficients
19
2.0
Language Modeling: N-gram
W = (w1, w2, w3,…,wi,…wR) a word sequence
- Evaluation of P(W)
P(W) = P(w1) II P(wi|w1, w2,…wi-1)
- Assumption:
P(wi|w1, w2,…wi-1) = P(wi|wi-N+1,wi-N+2,…wi-1)
Occurrence of a word depends on previous N−1 words only
N-gram language models
N = 2 : bigram P(wi | wi-1)
N = 3 : tri-gram P(wi | wi-2 , wi-1)
N = 4 : four-gram P(wi | wi-3 , wi-2, wi-1)
N = 1 : unigram P(wi)
probabilities estimated from a training text database
example : tri-gram model
P(W) = P(w1) P(w2|w1) II P(wi|wi-2 , wi-1)
R
i = 2
N
i = 3 20
tri-gram
W1 W2 W3 W4 W5 W6 ...... WR
W1 W2 W3 W4 W5 W6 ...... WR
𝑃 𝑊 = 𝑃(𝑤1)ෑ
𝑖=2
𝑅
𝑃 𝑤𝑖 𝑤1, 𝑤2, ⋯ , 𝑤𝑖−1
𝑃 𝑊 = 𝑃 𝑤1 𝑃 𝑤2 𝑤1 ෑ
𝑖=3
𝑁
𝑃 𝑤𝑖 𝑤𝑖−2, 𝑤𝑖−1
N-gram
⋮
21
2.0
Language Modeling
- Evaluation of N-gram model parameters
unigram
P(wi) = ⎯⎯⎯⎯
wi : a word in the vocabulary
V : total number of different words in the vocabulary N( • ) number of counts in the training text database
bigram
P(wj|wk) = ⎯⎯⎯⎯⎯
< wk, wj > : a word pair
trigram
P(wj|wk,wm) = ⎯⎯⎯⎯⎯⎯
smoothing − estimation of probabilities of rare events by statistical approaches
N(wi)
N (wj)
V
j = 1
N(<wk,wj>)
N (wk)
N(<wk,wm,wj>)
N (<wk,wm>)
22
… th is ……… 50000
… th is i s …… 500
… th is i s a … 5
Prob [ is| this ] = 500
50000
Prob [ a| this is ] = 5
500
23
2.0
Large Vocabulary Continuous Speech Recognition
W = (w1, w2,…wR) a word sequence
O = (o1, o2,…oT) feature vectors for a speech utterance
W* = Prob(W|O) MAP principle
Prob(W|O) = ⎯⎯⎯⎯⎯⎯⎯ = max
Prob(O|W) • P(W) = max
by HMM by language model
˙A Search Process Based on Three Knowledge Sources
O
- Acoustic Models : HMMs for basic voice units (e.g. phonemes)
- Lexicon : a database of all possible words in the vocabulary, each word including its
pronunciation in terms of component basic voice units
- Language Models : based on words in the lexicon
Arg Max
w
Search W*
Acoustic
Models
Lexicon Language
Models
Prob(O|W) • P(W)
P(O)
A Posteriori Probability
Maximum A Posteriori (MAP) Principle
24
Maximum A Posteriori Principle (MAP)
Problem
W : { w1 , w2 , w3 }
↑ ↑ ↑
sunny rainy cloudy
P(w1)
P(w2)
+ P(w3)
1.0
𝑂 = ( Ԧ𝑜1, Ԧ𝑜2, Ԧ𝑜3, ⋯ ) weather parameters
given 𝑂 today, to predict W for tomorrow
25
Maximum A Posteriori Principle (MAP)
Approach 1
Approach 2
Comparing P(w1), P(w2), P(w3)
𝑂 not used?
𝑃(
𝑤1𝑤2
𝑤3
ቚ𝑂 ) =
𝑃(𝑂|
𝑤1𝑤2𝑤3
) ⋅ 𝑃 (
𝑤1𝑤2𝑤3
)
𝑃(𝑂), 𝑃 𝑤𝑖 ቚ𝑂 =
𝑃 𝑂ห𝑤𝑖 𝑃(𝑤𝑖)
𝑃(𝑂), 𝑖 = 1, 2, 3
𝑃(𝑂|
𝑤1𝑤2
𝑤3
) ⋅ 𝑃(
𝑤1𝑤2
𝑤3
) , 𝑃 𝑂ห𝑤𝑖 ⋅ 𝑃(𝑤𝑖) , 𝑖 = 1, 2, 3
A Posteriori Probability事後機率
unknown observation
compute
compare
Likelihoodfunction
Prior Probability事前機率
𝑃 𝑂ห𝑤1 𝑃 𝑂ห𝑤2 𝑃 𝑂ห𝑤3
26
Syllable-based One-pass Search
• Finding the Optimal Sentence from an Unknown Utterance Using 3
Knowledge Sources:Acoustic Models, Lexicon and Language Model
• Based on a Lattice of Syllable Candidates
Syllable Lattice
Word Graph
Acoustic
Models
Language
Models
w1 w2
w1 w2
Lexicon
P(w1)P(w2 |w1)......
P(w1)P(w2 |w1)......
t
27