Download - An Overview of Automatic Speech Recognition

2018/9/17 1

An Overview of Automatic

Speech Recognition

2018/9/17 2

Outline

1. Introduction

2. Pattern Recognition Approach

3. Two Key Technologies for ASR

4. ASR Systems and Application

5. ASR System Design Issues

6. Summary

2018/9/17 3

IntroductionSpoken language processing relies

heavily on pattern recognition (PR).

If we can incorporate the ability to PR in our work and life, we can make machines much easier to use. But the process of human PR is not well understood.

The decision for PR is based on appropriate probabilistic models of the patterns.

2018/9/17 4

Introduction

Problems of recognition

Signal variability

Environmental effects

Distinctive features are difficult to identify in continuous speech

A more successful approach

To treat the speech signal as a stochastic pattern and adopt a statistical pattern recognition

2018/9/17 5

Introduction

A source-channel speech generation

model

Speech recognition is formulated as a

maximum a posteriori decoding problem

Noisy

Channel

Channel

Decoding

Word

Sequence

Word

Sequence

Speech

Signal

Speech

Signal

2018/9/17 6

Introduction

Bayes rule

Assume speech signal S is parametrically

represented as a sequence of acoustic

vectors A

)()|()|( maxargmaxarg WPWAPAWPWW

the set of all possible sequences of words

the conditional probability of the acoustic vector sequence

the a priori probability of generating the sequence of words W

2018/9/17 7

Pattern Recognition Approach

Block diagram of a typical integrated continuous

speech recognizer

Feature

Analysis

Word

Match

Sentence

Match

Recognized

SentenceInput

Speech

Unit Model Lexicon Syntax Semantics

Language

Model

Acoustic

Word Model

2018/9/17 8

Speech Analysis and Feature

Extraction

To Parameterize speech into feature

vectors

Spectral feature

• Good discrimination

• An capability of creating statistical models

without the need for an excessive amount of

training data

• Having statistical properties which are invariant

across speakers and over a wide range of

speaking environments

2018/9/17 9

Feature Extraction Speech is a continuous evolution of the vocal tract

Need to extract time series of spectra

Use a sliding window - 16 ms window, 8 ms shift

2018/9/17 10


Extraction Implementations of feature extraction

Short-Time Spectral Features• Discrete Fourier transform(DFT)

• Linear predictive coding(LPC)

• Cepstral features with its first and second time derivatives

Frequency-Warped Spectral Features• Non-uniform frequency scales are used in

spectral analysis to provide the so-called mel-frequency spectral feature sets

2018/9/17 11


Extraction

Segment Analysis and Segment

Features

Long term analysis

Temporal decomposition

Auditory Features

Articulatory Features

Discriminative Features

Linear discriminant analysis

2018/9/17 12

Selection of Fundamental

Speech UnitsPhoneme-like units (PLUs)

The simplest set of fundamental speech units

/s/ / / /g/ /m/ / / /n/ /t/

Units other than Phones

Syllables, /seg/ /men/ /t/

Demisyllables, /s / / g/ /m / / n/ /t/

Units with Linguistic Context Dependency

triphones

2018/9/17 13

Acoustic Modeling of Speech

Units The size of training set

There is a tradeoff between using fewer subword

units(good coverage of individual units, but poor

resolution of linguistic context) and more subword

units(poor coverage of the infrequently occurring

units, but good resolution of linguistic context).

Start with some initial set of subword unit models

and adapt the models over time to the task

Hidden Markov models(HMMs)

2018/9/17 14

Lexical Modeling and Word

Level MatchLexicon

Provides a description of the words in terms of the basic set of subword units

Data-independent• Extracted from a standard dictionary and each

word in the vocabulary is represented by a single lexical entry

Data-independent• Multiple pronunciation

2018/9/17 15

Language Modeling and

Sentence Match The sentence-level matching

Grammar• Consist of a set of syntactic and semantic rules

• Can be represented as finite state network

Advance in language model Improve the efficiency and effectiveness of large

vocabulary speech recognition

Adaptive language modelingA large body of domain-specific training data

cannot be applied directly to a different task

2018/9/17 16

Search and Decision

Strategies

Finite state representation of all

knowledge source

Grammar for word sequence

The network representation of lexical

variability for words and phrases

Syllabic and phonemic knowledge used to

form fundamental linguistic units

Dynamic programming

2018/9/17 17

Two Key Technologies for

ASR

The use of HMMs modeling techniques

to characterize and model the spectral

and temporal variation of basic subword

units

The use of dynamic programming

search techniques to perform network

search

2018/9/17 18

Hidden Markov Modeling of

Speech Estimation of HMMs

Maximum likelihood• EM algorithm

• It often requires a large size training set

Maximum mutual information• Maximizes the mutual information between the given

acoustic data and the corresponding transcription

Maximum a posteriori• Adapt models to the task, to the speaker

• Bayesian learning

Minimum classification error• Minimizes the error rate on task-specific training data

• Generalized probabilistic descent(GPD)

2018/9/17 19

Complete Channel

Characterization for ASR

2018/9/17 20

Knowledge-Ignorant Modeling

& Channel Decoding

2018/9/17 21

LVCSR

Feature

Extraction

Subword

Models Lexicon

FeatureVectors

Word Level

Match

Recognized Sentence

Speech

Corpora

Acoustic

ModelingLanguage

ModelingText

Corpora

Speech Input

Word Model

Composition

Sentence Level

Search

Language

Models

Linguistic DecodingFront-end Processing

HMMs N-grams

Front-end Processing (前端處理) is a spectral analysis (頻譜分析)

that derives feature vectors to capture salient spectral

characteristics of speech input

Linguistic decoding (語言解碼) combines word-level matching and

sentence-level searc to perform an inverse operation to decode

the message from the speech waveform

2018/9/17 22

ASR Capabilities

Use powerful data-driven modeling tools: HMM, ANN, GM

• Rely little on detailed speech & language knowledge sources

• Give high performance in matched conditions

• Develop and deploy many applications and services

– but robustness and rigid constraints are two major limiting

factors

2018/9/17 23

Shannon’s Channel Modeling

Paradigm

Channel input is hidden (unobserved) while output is

observed and used to infer the input (which is often

approximated by a structural Markov model)

Channel modeling with (I, O) pairs in large training sets

2018/9/17 24

Modeling Input-Output

Associations Hidden Markov Model (HMM)

Artificial Neural Network (ANN)

Classification and Regression Tree (CART)

Support Vector Machine (SVM) and LVQ

Kernel-based models, mixture of experts,

Bayesian network. Graphical models, Markov random field

Many New Applications

– Rule induction, statistical parsing, machine translation

– Information retrieval, text categorization, call routing,

transliteration, pronunciation modeling, etc.

2018/9/17 25

Speech Recognition

2018/9/17 HMM and Speech Recognition 26

Hidden Markov Model - HMM


The Markov chain

Let X=X1, X2,….Xn be a sequence of

random variables from a finite discrete

alphabet O={o1, o2,….,oM}. Based on

the Bayes’ rule

(8.1)

X are said to from a first-order Markov

chain , provided that

(8.2)

1

1 2 1 1

2

( , ,..., ) ( ) ( | )n

i

n i

i

P X X X P X P X X

1

1 1( | ) ( | )i

i i iP X X P X X


The Markov chain

Eq.(8.1) becomes

(8.3)

Eq.(8.2) as the Markov assumption. The

Markov chain models time-invariant

(stationary) events if we discard the

time index i,

(8.4)

1 2 1 1

2

( , ,..., ) ( ) ( | )n

n i i

i

P X X X P X P X X

1( | ') ( | ')i iP X s X s P s s

HMM and Speech Recognition 29

The Markov chain

Consider a Markov chain with N distinct states {1,….,N},with the state at time t in the Markov chain denoted as st; Markov chain can be described as follows:

(8.5)

(8.6)

(8.7)

1( | ) 1 ,ij i ia P s j s i i j N

1( ) 1i P s i i N

1

1

1; 1

1

N

ij

j

N

j

j

a i N


The Markov chain


The Markov chain

A state-transition probability matrix of

this Dow Jones Markov chain

An initial state probability matrix

0.6 0.2 0.2

{ } 0.5 0.3 0.2

0.4 0.1 0.5

ijA a

0.5

( ) 0.2

0.3

t

i


Markov Chain


What is HMM (Why Hidden?)

Coin toss example

O = H T H H…T

o1 o2 o3 o4 oT

H：stands for heads

T：stands for tails


1-FAIR COIN MODEL

2 states (每個state代表H or

T)

2 observation symbols

(每個symbol代表H or T)

2-FAIR COIN MODEL

2 states (每個state代表一個coin)


(每個symbol代表H or T)

Extension to Hidden Markov Model


2-BIASED COINS MODEL

2 states


3-BIASED COINS MODEL

3 states



How to model the outputs of the coin

tossing experiment via HMM

1. The size (the number of states) of the

model

2. The size of the observation sequence

3. Model parameters (state transition

probabilities, probabilities of heads and

tails in each state)

HMM for Coin Tossing


N-State Urn and Ball Model


T=length of the observation sequence (total number of clock times)

N=number of states (urns) in the model

M=number of observation symbols (colors)

Q=｛q1,q2,…,qN｝,states (urns)

V=｛v1,v2,…,vM｝discrete set of possible symbol observations (color)

A=｛aij｝,aij=Pr(qi at t+1 | qi at t),state transition probability distribution

B=｛bi(k)｝,bi(k)=Pr(vk at t | qj at t),observation symbol probability

distribution in state j

π=｛ π i｝, π i =Pr(qj at t=1),initial state distribution

We use λ= (A,B, π) to represent an HMM

clock time

urn (hidden) state

color (observation)

1234．．．T

q3q1q1q2．．．qN-2

RBYY ．．．R


N=4, Q=｛1,2,3,4｝ →state number

M=8, V=｛a,b,c,d,e,f,g,h｝ →observation number

T=24 →length

I=I1I2…IT →state

O=O1O2…OT →observation

ㄊ ˊ ㄞㄅ ˇ ㄟ

1 2 3 4

111222222223344444444444

aabcddedddcggffhffefhfff

HMM for Speech Recognition


0.4 0.5 0.1 0

A = 0 0.6 0.3 0.1

0 0 0.6 0.4

0 0 0 1.0

a b c d e f g h

B= 0.7 0.2 0.1 0 0 0 0 0

0 0 0.2 0.6 0.2 0 0 0

0 0 0 0 0.1 0.6 0.1 0.2

π= (1,0,0,0)


Elements of an HMM

1. There are a finite number states in the model,

say N.

2. A new state is entered based upon a

transition probability distribution which

depends on the previous state.(the Markov

property)

3. After each transition is made, an observation

output symbol is produced according to a

probability distribution which depends on the

current state.


Basic principle

Each HMM is modeled using a set [, A, B] where

is a column matrix, A and B are square matrix.

is the initial state probability. ={i} and i, the

state index. i=1,2.. p, the HMM will have p states.

A={Aij} is the state transition probability. Aij=the

probability of moving from state i to state j.

B={Bjk} is the state observation probability. Bjk=the

observation of k pattern at state j.


Using the model, an observation

sequence, O=O1O2,…,OT, is generated

as follows:

1.Choose an initial state,i1,according to the initial state

distribution,π;

2.Set t = 1;

3.Choose Ot according to bit(k),the symbol probability

distribution in state it;

4.Choose it+1 according to ｛aitit+1｝,it+1=1,2,…,N,the

state transition probability distribution for state it;

5.Set t = t+1;return to step 3 if t<T;otherwise terminate

the procedure.


Three Problems for HMM

Problem1- Given the observation sequence

O=O1,O2,…,OT, and the model λ = (A,B, π),

how we compute Pr(O| λ), the probability of the

observation sequence.

Problem2- Given the observation sequence

O=O1,O2,…,OT, how we choose a state

sequence

I= i1,i2,…,iT which is optimal in some meaningful

sense.

Problem3- How we adjust the model parameters

λ= (A,B, π) to maximize Pr(O| λ).


Solutions to the three HMM ProblemsProblem 1:

Given the model λ, the observation sequence O.

For every fixed state sequence I = i1i2…iT,

the probability of the observation sequence O is

Pr(O|I,λ)=bi1(O1) bi2(O2)... biT(OT)

The probability of such a state sequence I is

Pr(I|λ)=πi1ai1i2ai2i3… aiT-1iT

The joint probability of O and I is

Pr(O,I|λ)= Pr(O|I,λ) Pr(I|λ)


The probability of O is obtained by

summing the joint probability over all

possible state sequences

all I

i1,i2…,iT

Pr(O|λ) = ΣPr(O ,I|λ)

=Σπi1bi1(O1)ai1i2bi2 (O2)… aiT-1iTbiT(OT)


DRAWBACK:

This method needs (2T-1)NT

multiplications and (NT -1) additions.

The calculation is on the order of 2T．NT.

If N=5, T=100, it needs 2 ．100 ．5100≡1072 computations!!

Clearly a more efficient procedure is

required.


*The Forward_Backward Procedure

Def: The forward variable αt(i)

1. α1(i) = πibi(O1), 1≦i≦ N;

2. for t = 1,2,… ,T-1, 1≦j≦ N

3.then, Pr(O|λ) = Σi=1αT(i).

This method needs N(T-1)(N+1)+N multiplications and N(T-1)(N+1)+(N-1) additions. It is on the order of N2T calculations.

If N=5, T=100, it needs about 3000 computations.

αt(i) = Pr(O1, O2,… , Ot, it=qi|λ)

αt+1(j)=〔Σ αt(i)aij〕bj(Ot+1)


Def : The backward variable βt(i)

1. βT(i) = 1, 1≦i ≦ N;

2, for t=T-1,T-2,… ,1, 1≦i ≦ N;

βt(i) = Σaijbj(0t+1) βt+1(j)

It requires on the order of N2T

calculations.

βt(i) = Pr(Ot+1, Ot+2,… ,OT | it = qi,λ)

βt-1(i) = Pr(OT | iT-1 = qi,λ)


Problem 2:

*Viterbi Algorithm

Step 1-Initialization

Step 2-Recursion


Problem 2:

*Viterbi Algorithm

Step 3-Termination

Step 4-Path (state sequence) backtracking


Problem 3:

Def:γt(i) = Pr(it = qi | O,λ)

Def:ξt(i,j) = Pr(it = qi , it+1 = qj | O,λ)

γt(i)=αt(i)βt(i)/Pr(O| λ)

ξt(i,j) =αt(i)aijbjβt+1(j)/ Pr(O| λ)

Summing ξt(i,j) over j,giving

γt(i) = Σ ξt(i,j)N

j=1


*Sum γt(i) & ξt(i,j) over time index t.

Σrt(j)=Expected number of times that state

qj is visited

Σt=1~T-1rt(i)=Expected number of transitions

made from qi.

Σ ξt(i,j) = Expected number of transitions

from state qi to state qj.


*The Baum-Welch Reestimation Formulas

1. The initial model λ defines a critical point of the

likelihood function, in which case λ= λ or

2. Model λ is more likely in the sense that

Pr(O| λ)>Pr(O| λ),i.e. we have found another

model λ, from which the observation sequence

is more likely to be produced.


An example for HMM calcualtion

a simple 3 colored balls HMM(please go through this

sample)


Viterbi


Steps of ASR using HMM

Pre-emphasis

Frame blocking

Windowing

Auto-correlation analysis

LPC analysis

CEP analysis

• Codebook preparation

• HMM training

• HMM recognition

• Recognition of continuous

speech

• Post-processing


Pre-emphasis

The digitized signal s(n) is

put through a digital filter to

flatten the signal to make it

less susceptible to the finite

precision effects later in the

signal processing.

Common value for a is 0.95

)1()()( nsansns~


Frame Blocking The pre-emphasized signal s(n)(with ~ on top) is blocked into frames

of N samples, with adjacent frames being separated by M samples.

In the example below, the signals are blocked into frames for the

case in which M=(1/3)N.

Typical values for N and M are 300 and 100 when the speech

sampling rate is 6.67 kHz. These correspond to 45 ms frames,

separated by 15ms, 66.7 Hz frame rate.

Blocking of speech into overlapping frames.


Windowing

To window each frame so as

to minimize the

discontinuities at the

beginning and end of the

frame.

A typical window is the

Hamming window

10

1

2460540

Nnfor

N

n..)n(w

)n(w)n(x)n(x~



Each frame of the

windowed signal is

auto-correlated.

Typically p is from 8 to

16.

The zeroth auto-

correlation R(0), is the

energy of the frame.

p,....,k

)kn()n()k(RkN

n

~~

xx

10

1

0


LPC and CEP analysis

This is to convert each frame

of p+1 auto-correlation in an

LPC parameter set, which

are:

LPC coefficients

reflection coefficients

log area ratio coefficients

Read previous lectures for

details

CEP coefficients are directly

derived from LPC coefficients.

Read previous lectures for

details.

The CEP coefficients (cm) are

the coefficients of the Fourier

transform coefficients of the log

magnitude spectrum, are more

robust, reliable features se for

speech recognition than the

LPC coefficients, the reflection

coefficients and the log area

ration coefficients.

Generally, a CEP with Q>P

coefficients is used and

Q(3/2)p

refer to previous lecture


Parameter weighting

Because of the sensitivity of

the low-order CEP to overall

spectral slope and the

sensitivity of high-order CEP

to noise, it is a standard

technique to weight the CEP

by a tapered window to

minimize these sensitivities.

A general weighting function

This function truncates the

computation and de-

emphasizes ck around k=1

and k=Q.

Qkfor

Q

ksin

Qw

cwc

k

kkk

^

1

21


VQ = Vector Quantization

Advantages:

Reduced storage for signal

analysis

Reduced computation for

determining similarity of speech

analysis vectors.

Discretize representation of

speech sound. By associating a

phonetic label with each

codebook vector, the process of

choosing a best codebook

vector to represent a give

speech vector becomes

equivalent to assigning a

phonetic label to each speech

frame.

Disadvantages

an inherent signal distortion in

representing the actual speech

vector

the storage required for

codebook vector is nontrivial.

There will be trade-off among

quantization error, processing

time and storage.


Recognition using HMM

The last phase of speech recognition is to recognise the speech uttered using HMM

There are 2 major steps: training and recognition. Both are quite similar.

Training Record the speech

Segment the speech into suitable units

Pre-process the speech

Extract LPC/CEP coefficients

Train and obtain the codebook.

Train and obtain the HMM model for each speech unit.

Recognition Repeat all the above, except now

we do not train codebook and HMM. We input the LPC/CEP parameter to obtain the codebook and send it to HMM for recognition.


Recognition of Continuous

Words

Three methods

Two-level dynamic programming

Level Builing algorithm

One-pass algorithm

• details in L Rabiner and B H Juang, Chapter 7


Beam SearchOne stage or Bridle algorithm

Each word pattern is matched against a first portion of the input pattern using a DTW and a few of the best scoring patterns together with their ending positions in the input pattern are recorded.

Each word pattern is matched against the second portion of the input pattern starting at the points where the last word matches ended.

The process is repeated until the end of the pattern is reached and generates a word-decision tree.

After first match vocabulary words W4, W6

and W3 ,ending at frames 10,32 and 22 respectively, are selected as possible ways of explaining the first part of the unknown input pattern.

Some paths are eventually ended as better-scoring paths are found.

When all the input pattern has been processed the unknown word sequence is obtained through tracing back the tree.


Post-processing

Post-processing increase the recognition rate

significantly(reducing the error rate to 1/2 or

1/3)

Post-processing techniques

N-gram

Grammar, syntactical parsers such as PCFG, Link

Grammar

Semantic parsers (Logic parser?)

Trigger pairs - long distance dependency

Many other ….


HMM - Hidden Markov Model

Three types:

discrete DHMM(poor performance)

continuous CHMM(require huge speech

data for training)

semi-continuous SCHMM(best)


*HMM on Speech Recognition


Example: Isolated Word Recognition

Assume we have a vocabulary of V words to be

recognized. We have a training set of L tokens

of each word (spoken by 1 or more talkers).

1. Build an HMM for each word in the vocabulary

λk,1≦k ≦ V

2. For each unknown word, characterized by

observation sequence 0=01, 02, 03,…, 0T

3. Scoring procedure

For each word model, λk,calculate Pk=Pr(0| λk)

4. Choose the word whose model probability is

highest.

2018/9/17 81

Steps of ASR using HMM

Pre-emphasis

Frame blocking

Windowing


LPC analysis

CEP analysis

Codebook preparation

HMM training

HMM recognition

Recognition of continuous

speech

Post-processing

2018/9/17 82

2018/9/17 83

Pre-emphasis

The digitized signal s(n) is put through a

digital filter to flatten the signal to make

it less susceptible to the finite precision

effects later in the signal processing.

Common value for a is 0.95

)1()()( nsansns~

2018/9/17 84

Frame Blocking The pre-emphasized signal s(n)(with ~ on top) is blocked into frames

of N samples, with adjacent frames being separated by M samples.

In the example below, the signals are blocked into frames for the

case in which M=(1/3)N.

Typical values for N and M are 300 and 100 when the speech

sampling rate is 6.67 kHz. These correspond to 45 ms frames,

separated by 15ms, 66.7 Hz frame rate.

2018/9/17 85

Windowing

To window each frame so as to minimize the

discontinuities at the beginning and end of the

frame.

A typical window is the Hamming window

10

1

2460540

Nnfor

N

n..)n(w

)n(w)n(x)n(x~

2018/9/17 86


Each frame of the windowed signal is auto-

correlated.

Typically p is from 8 to 16.

The zeroth auto-correlation R(0), is the

energy of the frame.

p,....,k

)kn()n()k(RkN

n

~~

xx

10

1

0

2018/9/17 87

LPC and CEP analysis This is to convert each frame of p+1 auto-correlation in an

LPC parameter set, which are: LPC coefficients

reflection coefficients

log area ratio coefficients

CEP coefficients are directly derived from LPC coefficients. Read previous lectures for details.

The CEP coefficients (cm) are the coefficients of the Fourier transform coefficients of the log magnitude spectrum, are more robust, reliable features se for speech recognition than the LPC coefficients, the reflection coefficients and the log area ration coefficients.

Generally, a CEP with Q>P coefficients is used and Q(3/2)p refer to previous lecture

2018/9/17 88

Parameter weighting

Because of the sensitivity of the low-order CEP to overall

spectral slope and the sensitivity of high-order CEP to

noise, it is a standard technique to weight the CEP by a

tapered window to minimize these sensitivities.

A general weighting function

This function truncates the computation and de-emphasizes ck

around k=1 and k=Q.

Qkfor

Q

ksin

Qw

cwc

k

kkk

^

1

21

2018/9/17 89

VQ = Vector Quantization

Advantages:

Reduced storage for signal

analysis

Reduced computation for

determining similarity of speech

analysis vectors.

Discretize representation of

speech sound. By associating a

phonetic label with each

codebook vector, the process of

choosing a best codebook

vector to represent a give

speech vector becomes

equivalent to assigning a

phonetic label to each speech

frame.

Disadvantages

an inherent signal distortion in

representing the actual speech

vector

the storage required for

codebook vector is nontrivial.

There will be trade-off among

quantization error, processing

time and storage.

2018/9/17 90

2018/9/17 91

Recognition using HMM

The last phase of speech recognition is to recognise the speech uttered using HMM

There are 2 major steps: training and recognition. Training

• Record the speech

• Segment the speech into suitable units

• Pre-process the speech

• Extract LPC/CEP coefficients

• Train and obtain the codebook.

• Train and obtain the HMM model for each speech unit.

Recognition

• Repeat all the above, except now we do not train codebook and HMM. We input the LPC/CEP parameter to obtain the codebook and send it to HMM for recognition.

2018/9/17 92

Recognition of Continuous

Words

Three methods

Two-level dynamic programming

Level Builing algorithm

One-pass algorithm

• details in L Rabiner and B H Juang, Chapter 7

2018/9/17 93

Beam SearchOne stage or Bridle algorithm

Each word pattern is matched against a first portion of the input pattern using a DTW and a few of the best scoring patterns together with their ending positions in the input pattern are recorded.

Each word pattern is matched against the second portion of the input pattern starting at the points where the last word matches ended.

The process is repeated until the end of the pattern is reached and generates a word-decision tree.

After first match vocabulary words W4, W6 and W3 ,ending at frames 10,32 and 22 respectively, are selected as possible ways of explaining the first part of the unknown input pattern.

Some paths are eventually ended as

better-scoring paths are found.

When all the input pattern has been processed

the unknown word sequence is obtained

through tracing back the tree.

2018/9/17 94

Post-processing

Post-processing increase the recognition rate

significantly(reducing the error rate to 1/2 or

1/3)

Post-processing techniques

N-gram

Grammar, syntactical parsers such as PCFG, Link

Grammar

Semantic parsers (Logic parser?)

Trigger pairs - long distance dependency

Many other ….

2018/9/17 95

HMM - Hidden Markov Model

Three types:

discrete DHMM(poor performance)

continuous CHMM(require huge speech

data for training)

semi-continuous SCHMM(best)

2018/9/17 96

*HMM on Speech Recognition

2018/9/17 97

Example: Isolated Word Recognition

Assume we have a vocabulary of V words to be

recognized. We have a training set of L tokens

of each word (spoken by 1 or more talkers).

1. Build an HMM for each word in the vocabulary

λk,1≦k ≦ V

2. For each unknown word, characterized by

observation sequence 0=01, 02, 03,…, 0T

3. Scoring procedure

For each word model, λk,calculate Pk=Pr(0| λk)

4. Choose the word whose model probability is

highest.

2018/9/17 98

4. Speech Recognition

Application Telecommunications

Providing information or access to data or service over telephone lines

Office / desktopVoice control of PC

Voice control of telephony functionality

Manufacturing and business

Medical / legalCreate various reports and forms

Other applicationsUsing speech recognition in toys and games

2018/9/17 99

5. ASR System Design Issues

Signal capturing and noise reduction

Microphone array

Acoustic variability and robustness issues

Database collection limitation

Static versus dynamic system design

Spontaneous speech and keyword spotting

Human factors issues

2018/9/17 100

6. Summary

Research Challenges Robust speech recognition to improve the usability

of a system in a wide variety of speaking condition for a large population of speakers

Robust utterance verification to relax the rigid speaking format and to be able to extract relevant partial information in spontaneous speech and attach a recognition confidence to it

High performance speech recognition through adaptive system design to quickly meet changing tasks, speakers and speaking environments