2.0 Fundamentals of Speech Recognition

Post on 16-Oct-2021

9 views 0 download

Transcript of 2.0 Fundamentals of Speech Recognition

2.0 Fundamentals of Speech Recognition

References for 2.0

1.3, 3.3, 3.4, 4.2, 4.3, 6.4, 7.2, 7.3, of Bechetti

2.0

2.0 Fundamentals of Speech Recognition

Hidden Markov Models (HMM)

˙Formulation Ot= [x1, x2, …xD]

T feature vectors for a frame at time t

qt {1,2,3…N } state number for feature vector Ot

A =[ aij ] , aij = Prob[ qt = j | qt-1 = i ]

state transition probability

B =[ bj(o), j = 1,2,…N] observation (emission) probability

bj(o) = cjkbjk(o) Gaussian Mixture Model (GMM)

bjk(o): multi-variate Gaussian distribution

for the k-th mixture (Gaussian) of the j-th state

M : total number of mixtures

cjk = 1

= [ 1, 2, …N ] initial probabilities

i = Prob[q1= i]

HMM : ( A , B, ) =

a22 a11

M

k = 1

M

k= 1

o1 o2 o3 o4 o5 o6 o7 o8…………

q1 q2 q3 q4 q5 q6 q7 q8…………

observation sequence

state sequence

b1(o) b2 (o) b3(o)

a13

a24

a23 states a12

2

Observation Sequences

𝑡

𝑡

𝑥(𝑡)𝑥[𝑛]

3

1-dim Gaussian Mixtures

State Transition Probabilities

𝑃𝑥(𝑥)

0.1 0.5

0.9 0.5

0.9 0.5

0.50.1

1 2 2 ⋯ 1 1 1 1 1 1 1 1 2 2 ⋯

4

2.0

˙Gaussian Random Variable X

fX(x) = e−(x−m)

/2

˙Multivariate Gaussian Distribution for n Random Variables

X = [ X1 , X2 ,……, Xn ] t

fX(x) = e− −[ (x− )

( x− ) ]

= [ X , X ,……, X ] t

= [ ij ] , covariance matrix

ij = E [ (Xi − X )( Xj − X ) ]

: determinant of

2

t -1 1

(2)n/21/2

(2)n/2

1/2

1

2

1 2 n

i j

2 1

(22) 1/2

(22) 1/2

𝑗

=⋮

⋯ 𝜎𝑖𝑗 ⋯

𝑖

𝜎𝑖𝑗 = 𝐸[(𝑥𝑖 − 𝜇𝑥𝑖) (𝑥𝑗 − 𝜇𝑥𝑗)]

𝑓𝑥(𝑥)𝜎2

𝑥𝑥𝑚

5

Multivariate Gaussian Distribution

𝑗

=⋮

⋯ 𝜎𝑖𝑗 ⋯

𝑖

𝜎𝑖𝑗 = 𝐸[(𝑥𝑖 − 𝜇𝑥𝑖) (𝑥𝑗 − 𝜇𝑥𝑗)]

𝑓𝑥(𝑥)𝜎2

𝑥𝑥𝑚

6

2-dim Gaussian

𝜎112 0

0 𝜎222

𝜎112 = 𝜎22

2

𝜎112 𝜎12𝜎21 𝜎22

2

𝑥2

𝑥1

𝜎112 0

0 𝜎222

𝜎112 ≠ 𝜎22

2

𝑥1

𝑥1

𝑥1

𝑥2𝑥2

𝑥2

7

N-dim Gaussian Mixtures

8

2.0

Hidden Markov Models (HMM)

˙Double Layers of Stochastic Processes

- hidden states with random transitions for time warping

- random output given state for random acoustic characteristics

˙Three Basic Problems

(1) Evaluation Problem:

Given O =(o1, o2, …ot…oT) and = (A, B, )

find Prob [ O | ]

(2) Decoding Problem:

Given O = (o1, o2, …ot…oT) and = (A, B, )

find a best state sequence q = (q1,q2,…qt,…qT)

(3) Learning Problem:

Given O, find best values for parameters in

such that Prob [ O | ] = max 9

Simplified HMM

RGBGGBBGRRR……

1 2 3

10

2.0

Feature Extraction (Front-end Signal Processing)

˙Pre-emphasis

H(z) = 1 − az-1 , 0 << a < 1

x[n] = x[n] − ax[n−1]

- pre-emphasis of spectrum at higher frequencies

˙Endpoint Detection (Speech/Silence Discrimination)

- short-time energy

En = (x[m])2w[m− n]

- adaptive thresholds

˙Windowing

Qn = T{ x[m]}w[m− n]

T{ • } : some operator

w[m] : window shape

- Rectangular window

w[m] = 1 , 0 < m L−1

0 , else

Hamming window

w[m] = 0.54 − 0.46 cos[ ⎯⎯ ], 0 m L−1

0 , else

window length/shift/shape

m= -

2m

L

L

m= -

11

Time and Frequency Domains

X[k]

time domain

1-1 mapping

Fourier Transform

Fast Fourier Transform (FFT)

frequency domain

𝑅𝑒 𝑒𝑗𝜔1𝑡 = cos(𝜔1𝑡)

𝑅𝑒 (𝐴1 𝑒𝑗𝜙1) (𝑒𝑗𝜔1𝑡) = 𝐴1cos(𝜔1𝑡 + 𝜙1)

𝑋 = 𝑎1𝑖 + 𝑎2𝑗 + 𝑎3𝑘

𝑋 =

𝑖

𝑎𝑖 𝑥𝑖

𝑥 𝑡 =

𝑖

𝑎𝑖 𝑥𝑖 (𝑡)

12

Pre-emphasis

• Pre-emphasis

H(z) = 1 − az-1 , 0 << a < 1

x[n] = x[n] − ax[n−1]

pre-emphasis of spectrum at higher frequencies

𝑋(𝜔)

𝜔

13

Endpoint Detection

𝑡, 𝑛

𝐸𝑛

EndpointThreshold

14

Hamming Window Rectangular Window

Endpoint Detection

𝐸𝑛 =

−∞

(𝑥[𝑚])2 𝑤[𝑚 − 𝑛]

𝑥[𝑚]

Hamming window

𝑤 𝑚 = ቐ0.54 − 0.46 cos[2𝜋𝑚

𝐿] ,0 ≤ 𝑚 ≤ 𝐿 − 1

0, else

𝑄𝑛 =

𝑚=−∞

𝑇 𝑥[𝑚] 𝑤[𝑚 − 𝑛]

𝑇{ • } : some operator

w 𝑚 : window shape

𝑤[𝑛]

𝑛𝐿 − 1

1

0

𝑛𝑚

𝑛 + L − 1

𝑤[𝑚]

0 𝐿 − 1𝑚

15

2.0

Feature Extraction (Front-end Signal Processing)

˙Mel Frequency Cepstral Coefficients (MFCC)

- Mel-scale Filter Bank

triangular shape in frequency/overlapped

uniformly spaced below 1 kHz

logarithmic scale above 1 kHz

˙Delta Coefficients

- 1st/2nd order differences

Discrete

Fourier

Transform

windowed

speech

samples MFCC

Mel-scale

Filter

Bank

log( | |2

)

Inverse

Discrete

Fourier

Transform

16

Peripheral Processing for Human Perception (P.34 of 7.0 )

17

Mel-scale Filter Bank

𝜔

log𝜔

𝜔𝜔

𝑋(𝜔)

18

Delta Coefficients

19

2.0

Language Modeling: N-gram

W = (w1, w2, w3,…,wi,…wR) a word sequence

- Evaluation of P(W)

P(W) = P(w1) II P(wi|w1, w2,…wi-1)

- Assumption:

P(wi|w1, w2,…wi-1) = P(wi|wi-N+1,wi-N+2,…wi-1)

Occurrence of a word depends on previous N−1 words only

N-gram language models

N = 2 : bigram P(wi | wi-1)

N = 3 : tri-gram P(wi | wi-2 , wi-1)

N = 4 : four-gram P(wi | wi-3 , wi-2, wi-1)

N = 1 : unigram P(wi)

probabilities estimated from a training text database

example : tri-gram model

P(W) = P(w1) P(w2|w1) II P(wi|wi-2 , wi-1)

R

i = 2

N

i = 3 20

tri-gram

W1 W2 W3 W4 W5 W6 ...... WR

W1 W2 W3 W4 W5 W6 ...... WR

𝑃 𝑊 = 𝑃(𝑤1)ෑ

𝑖=2

𝑅

𝑃 𝑤𝑖 𝑤1, 𝑤2, ⋯ , 𝑤𝑖−1

𝑃 𝑊 = 𝑃 𝑤1 𝑃 𝑤2 𝑤1 ෑ

𝑖=3

𝑁

𝑃 𝑤𝑖 𝑤𝑖−2, 𝑤𝑖−1

N-gram

21

2.0

Language Modeling

- Evaluation of N-gram model parameters

unigram

P(wi) = ⎯⎯⎯⎯

wi : a word in the vocabulary

V : total number of different words in the vocabulary N( • ) number of counts in the training text database

bigram

P(wj|wk) = ⎯⎯⎯⎯⎯

< wk, wj > : a word pair

trigram

P(wj|wk,wm) = ⎯⎯⎯⎯⎯⎯

smoothing − estimation of probabilities of rare events by statistical approaches

N(wi)

N (wj)

V

j = 1

N(<wk,wj>)

N (wk)

N(<wk,wm,wj>)

N (<wk,wm>)

22

… th is ……… 50000

… th is i s …… 500

… th is i s a … 5

Prob [ is| this ] = 500

50000

Prob [ a| this is ] = 5

500

23

2.0

Large Vocabulary Continuous Speech Recognition

W = (w1, w2,…wR) a word sequence

O = (o1, o2,…oT) feature vectors for a speech utterance

W* = Prob(W|O) MAP principle

Prob(W|O) = ⎯⎯⎯⎯⎯⎯⎯ = max

Prob(O|W) • P(W) = max

by HMM by language model

˙A Search Process Based on Three Knowledge Sources

O

- Acoustic Models : HMMs for basic voice units (e.g. phonemes)

- Lexicon : a database of all possible words in the vocabulary, each word including its

pronunciation in terms of component basic voice units

- Language Models : based on words in the lexicon

Arg Max

w

Search W*

Acoustic

Models

Lexicon Language

Models

Prob(O|W) • P(W)

P(O)

A Posteriori Probability

Maximum A Posteriori (MAP) Principle

24

Maximum A Posteriori Principle (MAP)

Problem

W : { w1 , w2 , w3 }

↑ ↑ ↑

sunny rainy cloudy

P(w1)

P(w2)

+ P(w3)

1.0

𝑂 = ( Ԧ𝑜1, Ԧ𝑜2, Ԧ𝑜3, ⋯ ) weather parameters

given 𝑂 today, to predict W for tomorrow

25

Maximum A Posteriori Principle (MAP)

Approach 1

Approach 2

Comparing P(w1), P(w2), P(w3)

𝑂 not used?

𝑃(

𝑤1𝑤2

𝑤3

ቚ𝑂 ) =

𝑃(𝑂|

𝑤1𝑤2𝑤3

) ⋅ 𝑃 (

𝑤1𝑤2𝑤3

)

𝑃(𝑂), 𝑃 𝑤𝑖 ቚ𝑂 =

𝑃 𝑂ห𝑤𝑖 𝑃(𝑤𝑖)

𝑃(𝑂), 𝑖 = 1, 2, 3

𝑃(𝑂|

𝑤1𝑤2

𝑤3

) ⋅ 𝑃(

𝑤1𝑤2

𝑤3

) , 𝑃 𝑂ห𝑤𝑖 ⋅ 𝑃(𝑤𝑖) , 𝑖 = 1, 2, 3

A Posteriori Probability事後機率

unknown observation

compute

compare

Likelihoodfunction

Prior Probability事前機率

𝑃 𝑂ห𝑤1 𝑃 𝑂ห𝑤2 𝑃 𝑂ห𝑤3

26

Syllable-based One-pass Search

• Finding the Optimal Sentence from an Unknown Utterance Using 3

Knowledge Sources:Acoustic Models, Lexicon and Language Model

• Based on a Lattice of Syllable Candidates

Syllable Lattice

Word Graph

Acoustic

Models

Language

Models

w1 w2

w1 w2

Lexicon

P(w1)P(w2 |w1)......

P(w1)P(w2 |w1)......

t

27