2.0 Fundamentals of Speech Recognition

References for 2.0

1.3, 3.3, 3.4, 4.2, 4.3, 6.4, 7.2, 7.3, of Bechetti

Hidden Markov Models (HMM)

˙Formulation Ot= [x1, x2, …xD]

T feature vectors for a frame at time t

qt {1,2,3…N } state number for feature vector Ot

A =[ aij ] , aij = Prob[ qt = j | qt-1 = i ]

state transition probability

B =[ bj(o), j = 1,2,…N] observation (emission) probability

bj(o) = cjkbjk(o) Gaussian Mixture Model (GMM)

bjk(o): multi-variate Gaussian distribution

for the k-th mixture (Gaussian) of the j-th state

M : total number of mixtures

cjk = 1

= [ 1, 2, …N ] initial probabilities

i = Prob[q1= i]

HMM : ( A , B, ) =

a22 a11

o1 o2 o3 o4 o5 o6 o7 o8…………

q1 q2 q3 q4 q5 q6 q7 q8…………

observation sequence

state sequence

b1(o) b2 (o) b3(o)

a23 states a12

Observation Sequences

𝑥(𝑡)𝑥[𝑛]

1-dim Gaussian Mixtures

State Transition Probabilities

𝑃𝑥(𝑥)

0.1 0.5

0.9 0.5

0.50.1

1 2 2 ⋯ 1 1 1 1 1 1 1 1 2 2 ⋯

˙Gaussian Random Variable X

fX(x) = e−(x−m)

˙Multivariate Gaussian Distribution for n Random Variables

X = [ X1 , X2 ,……, Xn ] t

fX(x) = e− −[ (x− )

( x− ) ]

= [ X , X ,……, X ] t

= [ ij ] , covariance matrix

ij = E [ (Xi − X )( Xj − X ) ]

: determinant of

t -1 1

(2)n/21/2

(2)n/2

(22) 1/2

⋯ 𝜎𝑖𝑗 ⋯

𝜎𝑖𝑗 = 𝐸[(𝑥𝑖 − 𝜇𝑥𝑖) (𝑥𝑗 − 𝜇𝑥𝑗)]

𝑓𝑥(𝑥)𝜎2

𝑥𝑥𝑚

Multivariate Gaussian Distribution

⋯ 𝜎𝑖𝑗 ⋯

𝜎𝑖𝑗 = 𝐸[(𝑥𝑖 − 𝜇𝑥𝑖) (𝑥𝑗 − 𝜇𝑥𝑗)]

𝑓𝑥(𝑥)𝜎2

𝑥𝑥𝑚

2-dim Gaussian

𝜎112 0

0 𝜎222

𝜎112 = 𝜎22

𝜎112 𝜎12𝜎21 𝜎22

𝜎112 0

0 𝜎222

𝜎112 ≠ 𝜎22

𝑥2𝑥2

N-dim Gaussian Mixtures

Hidden Markov Models (HMM)

˙Double Layers of Stochastic Processes

- hidden states with random transitions for time warping

- random output given state for random acoustic characteristics

˙Three Basic Problems

(1) Evaluation Problem:

Given O =(o1, o2, …ot…oT) and = (A, B, )

find Prob [ O | ]

(2) Decoding Problem:

Given O = (o1, o2, …ot…oT) and = (A, B, )

find a best state sequence q = (q1,q2,…qt,…qT)

(3) Learning Problem:

Given O, find best values for parameters in

such that Prob [ O | ] = max 9

Simplified HMM

RGBGGBBGRRR……

Feature Extraction (Front-end Signal Processing)

˙Pre-emphasis

H(z) = 1 − az-1 , 0 << a < 1

x[n] = x[n] − ax[n−1]

- pre-emphasis of spectrum at higher frequencies

˙Endpoint Detection (Speech/Silence Discrimination)

- short-time energy

En = (x[m])2w[m− n]

- adaptive thresholds

˙Windowing

Qn = T{ x[m]}w[m− n]

T{ • } : some operator

w[m] : window shape

- Rectangular window

w[m] = 1 , 0 < m L−1

0 , else

Hamming window

w[m] = 0.54 − 0.46 cos[ ⎯⎯ ], 0 m L−1

0 , else

window length/shift/shape

Time and Frequency Domains

time domain

1-1 mapping

Fourier Transform

Fast Fourier Transform (FFT)

frequency domain

𝑅𝑒 𝑒𝑗𝜔1𝑡 = cos(𝜔1𝑡)

𝑅𝑒 (𝐴1 𝑒𝑗𝜙1) (𝑒𝑗𝜔1𝑡) = 𝐴1cos(𝜔1𝑡 + 𝜙1)

𝑋 = 𝑎1𝑖 + 𝑎2𝑗 + 𝑎3𝑘

𝑋 =

𝑎𝑖 𝑥𝑖

𝑥 𝑡 =

𝑎𝑖 𝑥𝑖 (𝑡)

Pre-emphasis

• Pre-emphasis

H(z) = 1 − az-1 , 0 << a < 1

x[n] = x[n] − ax[n−1]

pre-emphasis of spectrum at higher frequencies

𝑋(𝜔)

Endpoint Detection

𝑡, 𝑛

𝐸𝑛

EndpointThreshold

Hamming Window Rectangular Window

Endpoint Detection

𝐸𝑛 =

−∞

(𝑥[𝑚])2 𝑤[𝑚 − 𝑛]

𝑥[𝑚]

Hamming window

𝑤 𝑚 = ቐ0.54 − 0.46 cos[2𝜋𝑚

𝐿] ,0 ≤ 𝑚 ≤ 𝐿 − 1

0, else

𝑄𝑛 =

𝑚=−∞

𝑇 𝑥[𝑚] 𝑤[𝑚 − 𝑛]

𝑇{ • } : some operator

w 𝑚 : window shape

𝑤[𝑛]

𝑛𝐿 − 1

𝑛𝑚

𝑛 + L − 1

𝑤[𝑚]

0 𝐿 − 1𝑚

Feature Extraction (Front-end Signal Processing)

˙Mel Frequency Cepstral Coefficients (MFCC)

- Mel-scale Filter Bank

triangular shape in frequency/overlapped

uniformly spaced below 1 kHz

logarithmic scale above 1 kHz

˙Delta Coefficients

- 1st/2nd order differences

Discrete

Fourier

Transform

windowed

speech

samples MFCC

Mel-scale

Filter

log( | |2

Inverse

Discrete

Fourier

Transform

Peripheral Processing for Human Perception (P.34 of 7.0 )

Mel-scale Filter Bank

log𝜔

𝜔𝜔

𝑋(𝜔)

Delta Coefficients

Language Modeling: N-gram

W = (w1, w2, w3,…,wi,…wR) a word sequence

- Evaluation of P(W)

P(W) = P(w1) II P(wi|w1, w2,…wi-1)

- Assumption:

P(wi|w1, w2,…wi-1) = P(wi|wi-N+1,wi-N+2,…wi-1)

Occurrence of a word depends on previous N−1 words only

N-gram language models

N = 2 : bigram P(wi | wi-1)

N = 3 : tri-gram P(wi | wi-2 , wi-1)

N = 4 : four-gram P(wi | wi-3 , wi-2, wi-1)

N = 1 : unigram P(wi)

probabilities estimated from a training text database

example : tri-gram model

P(W) = P(w1) P(w2|w1) II P(wi|wi-2 , wi-1)

i = 3 20

tri-gram

W1 W2 W3 W4 W5 W6 ...... WR

𝑃 𝑊 = 𝑃(𝑤1)ෑ

𝑖=2

𝑃 𝑤𝑖 𝑤1, 𝑤2, ⋯ , 𝑤𝑖−1

𝑃 𝑊 = 𝑃 𝑤1 𝑃 𝑤2 𝑤1 ෑ

𝑖=3

𝑃 𝑤𝑖 𝑤𝑖−2, 𝑤𝑖−1

N-gram

Language Modeling

- Evaluation of N-gram model parameters

unigram

P(wi) = ⎯⎯⎯⎯

wi : a word in the vocabulary

V : total number of different words in the vocabulary N( • ) number of counts in the training text database

bigram

P(wj|wk) = ⎯⎯⎯⎯⎯

< wk, wj > : a word pair

trigram

P(wj|wk,wm) = ⎯⎯⎯⎯⎯⎯

smoothing − estimation of probabilities of rare events by statistical approaches

N (wj)

N(<wk,wj>)

N (wk)

N(<wk,wm,wj>)

N (<wk,wm>)

… th is ……… 50000

… th is i s …… 500

… th is i s a … 5

Prob [ is| this ] = 500

Prob [ a| this is ] = 5

Large Vocabulary Continuous Speech Recognition

W = (w1, w2,…wR) a word sequence

O = (o1, o2,…oT) feature vectors for a speech utterance

W* = Prob(W|O) MAP principle

Prob(W|O) = ⎯⎯⎯⎯⎯⎯⎯ = max

Prob(O|W) • P(W) = max

by HMM by language model

˙A Search Process Based on Three Knowledge Sources

- Acoustic Models : HMMs for basic voice units (e.g. phonemes)

- Lexicon : a database of all possible words in the vocabulary, each word including its

pronunciation in terms of component basic voice units

- Language Models : based on words in the lexicon

Arg Max

Search W*

Acoustic

Models

Lexicon Language

Models

Prob(O|W) • P(W)

A Posteriori Probability

Maximum A Posteriori (MAP) Principle

Maximum A Posteriori Principle (MAP)

Problem

W : { w1 , w2 , w3 }

↑ ↑ ↑

sunny rainy cloudy

+ P(w3)

𝑂 = ( Ԧ𝑜1, Ԧ𝑜2, Ԧ𝑜3, ⋯ ) weather parameters

given 𝑂 today, to predict W for tomorrow

Maximum A Posteriori Principle (MAP)

Approach 1

Approach 2

Comparing P(w1), P(w2), P(w3)

𝑂 not used?

𝑤1𝑤2

ቚ𝑂 ) =

𝑃(𝑂|

𝑤1𝑤2𝑤3

) ⋅ 𝑃 (

𝑤1𝑤2𝑤3

𝑃(𝑂), 𝑃 𝑤𝑖 ቚ𝑂 =

𝑃 𝑂ห𝑤𝑖 𝑃(𝑤𝑖)

𝑃(𝑂), 𝑖 = 1, 2, 3

𝑃(𝑂|

𝑤1𝑤2

) ⋅ 𝑃(

𝑤1𝑤2

) , 𝑃 𝑂ห𝑤𝑖 ⋅ 𝑃(𝑤𝑖) , 𝑖 = 1, 2, 3

A Posteriori Probability事後機率

unknown observation

compute

compare

Likelihoodfunction

Prior Probability事前機率

𝑃 𝑂ห𝑤1 𝑃 𝑂ห𝑤2 𝑃 𝑂ห𝑤3

Syllable-based One-pass Search

• Finding the Optimal Sentence from an Unknown Utterance Using 3

Knowledge Sources:Acoustic Models, Lexicon and Language Model

• Based on a Lattice of Syllable Candidates

Syllable Lattice

Word Graph

Acoustic

Models

Language

Models

Lexicon

P(w1)P(w2 |w1)......

2.0 Fundamentals of Speech Recognition

Documents

Transcript of 2.0 Fundamentals of Speech Recognition

SpeM: Modeling Human Speech Recognition - MRC ... · Web viewKeywords: human speech recognition; automatic speech recognition; spoken word recognition; computational modeling Abstract

SPEECH RECOGNITION:

Source/Filter – Model - Graz University of Technology · Source/Filter – Model ... ﬁrst edition, ... Fundamentals of Speech Synthesis and Speech Recognition, chapter 6, Formant

Chapter 5: Speech Recognition An example of a speech recognition system Speech recognition techniques Ch5., v.5b1.

Rabiner & Juang - Fundamentals of Speech Recognition

Voice Activity Detection. Fundamentals and Speech Recognition

The Practical Guide to Speech Recognition · Speech recognition offers a rapid and substantial payback. Table One: Increasing Self-Help with Speech Recognition 3 Speech Recognition

Speech Communication II - SPSC · Speech Communication II ... Speech Communication and Signal Processing Laboratory ... "Fundamentals of Speech Recognition" Prentice Hall, Englewood

Speech Recognition. What makes speech recognition hard?

CHPATER 1 FUNDAMENTALS OF SPEECH RECOGNITION 1.0 …

ISSUES IN SPEECH RECOGNITION Shraddha Sharma. Contents: Introduction What is speech recognition? Terminology of speech recognition Why we want speech.

Guest Speakers Today - cs.columbia.edujulia/courses/CS6998-2019/20190308_Homa… · 08.03.2019 · "Fundamentals of Speaker Recognition (COMS-E6998-005)," "Fundamentals of Speech

Book: Robust Automatic Please e-mail your Speech Recognition … · 2018. 1. 4. · “Driver” — 2015/7/8 — 0:11 — page 12 — #5. 12 CHAPTER 2 Fundamentals of speech recognition.

Voice Activity Detection. Fundamentals and Speech Recognition … · 2018-09-25 · Voice Activity Detection. Fundamentals and Speech Recognition System Robustness 3 Figure 1. Speech

Speech and Speech Recognition resources

speech recognition

Speech Recognition and Speech Translation

ARTIFICIAL INTELLIGENCE FOR SPEECH RECOGNITION. Introduction What is Speech Recognition? also known as automatic speech recognition or computer speech.

Fundamentals of Speech Recognition

8-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches Recognition Theories Bayse Rule Simple Language Model P(A|W)