Chapter 7 Speech Recognition Framework 7.1 The main form and application of speech recognition 7.2...

Chapter 7 Speech Recognition Chapter 7 Speech Recognition FrameworkFramework

7.1 The main form and application of speech recognition

7.2 The main factors of speech recognition 7.3 The active topics of speech recognition 7.4 The basic framework of speech

recognition system

7.1 7.1 The main form and application of The main form and application of speech recognition (1)speech recognition (1)

Speech Recognition --Inputs speech string and generates corresponding word or text of word string or transcription

Speech Understanding --Inputs speech string and generates corresponding response or actions

Speaker Recognition --Inputs speech string and identifies or verifies the speaker

Language Identification --Inputs speech string and identifies which language the input belongs to

The main form and application of speech The main form and application of speech recognition (2)recognition (2)

Speech Recognition Speech Navigation

Computer operation, Speech control, Intelligent toys, and Parcels dispatch

Speech DictationDictation machine, Speech dialing, and

Broadcasting recording


Speech Understanding Speech Service

For disables, Banking, Traveling, Transportation, in case dialog is needed

Speech CommunicationBilingual speech communication and Multilingual

simultaneous interpretation


Speaker Recognition Speaker Verification

Accessing to the security department or program, Banking and other service

Speaker IdentificationUser recognition, Voice checking for searching the

criminals

7.2 7.2 The main factors of speech The main factors of speech recognition (1)recognition (1)

Speech Style Isolated Words (IWR) --There are obvious

pause (or silence) between words, for example names of person or place, commands

Connected Word Speech (CWR) --For example continuous digit string (telephone numbers or data)

Continuous Speech (CSR) --Natural spoken language in sentence (or utterance). The easy degree is : CSR<<CWR<<IWR

The main factors of speech recognition The main factors of speech recognition (2)(2)

Speaker Dependent (SD) Speaker Dependent Recognition System only can

recognize the speech by one or a couple of speaker. The speech model is trained only by the speaker’s speech samples (corpus)

Speaker Independent (SI) Speaker Independent Recognition System can

recognize speech by any speaker. In this case, the speech model is trained by many speaker’s corpuses and speaker adaptation for recognition will improve the performance. It is much harder than SD.


Vocabulary Size Small Vocabulary --containing several hundred

words Middle Vocabulary --containing a

thousand to several thousand words Large Vocabulary --more than 10 thousand

words


Other factors : Speech Quality -Microphone speech or

telephone speech, recording environment, speaker’s cooperation

Task --Word Recognition, Transcription, Word Spotting, Dialog and Translation are very different task

Domain (specific or generic) and Syntax Constraints (less or more)

7.3 7.3 The active topics of speech The active topics of speech recognition recognition

Broadcasting Recording Systems Telephone Dialog Systems Speaker Adaptation Noise Reduction Word Spotting Language Models Based on Classes

7.4 The basic framework of speech 7.4 The basic framework of speech recognition system (1)recognition system (1)

Input --Speech string (utterance) through microphone or telephone { x’(n) }

Preprocessing --Windowing, Framing and Pre-emphasizing { xi(n) }

Feature Extraction --Feature vector calculation frame by frame { oi }

Decision Making --Simple algorithm such as minimal distance classifier to complex one such as HMM (statistical acoustic and language models).

The basic framework of speech The basic framework of speech recognition system(2)recognition system(2)

Input Anti-aliasing filter with 300-4KHz Sampling rate : 8KHz (telephone speech) to

16KHz (microphone speech) Sampling precision : 8 bits (telephone speech)

to 16 bits (microphone speech) Sampling starting and ending determination

(silence detecting and memory buffer to use)

The basic framework of speech The basic framework of speech recognition system (3)recognition system (3)

Preprocessing Window selection and windowing Framing --frame length and frame shift

selection (typical 25ms and 10ms) Pre-emphasizing y(n) = x(n) – αx(n-1) αis close to 1.0 (0.95 or 0.97), for simplicity it

could be 15/16 ≈ 0.9375. The goal is high frequency enhancement

The basic framework of speech The basic framework of speech recognition system(4)recognition system(4)

Feature Extraction There are a couple of way to get feature vector,

here only one is given –MFCC( mel-scale frequency cepstrum coefficients)

The steps to get MFCC for one frame : 1. FFT (by padding 0) to get X(k) : X[k] = Σn=0

N-1 x[n]exp(-j2πnk/N), k=0~N-1 2. Using M filters, with the log-energy S[m] of filter m

being computed via the convolution of the power spectrum S[k]=|X[k]|2 with a filter Hm[k] :


S[m] = log[S[k]*Hm[k]] m=0~N-1

where Hm[k]>=0 and Σk=0N-1 Hm[k] = 1.

Typically Hm[k]are chosen as triangular filters: Hm[k] = 0 k<f[m-1] =2(k-f[m-1])/[(f[m+1]-f[m-1])(f[m]-f[m-1])] f[m-

1]<=k<f[m] =2(f[m+1]-k)/[f[m+1]-f[m-1])(f[m+1]-f[m])]

f[m]<=k<bf[m+1] =0 k>f[m+1] So that Σk=0

N-1 Hm[k] = 1 for all m where the boundary points f[m] are uniformly spaced in the mel-scale.


The mel frequency cepstrum is the DCT of the m filter outputs :

c[n] = Σk=0m-1 S[k]cos(πn(k+1/2)/m),

n=0~m-1 M=24~40, but n is truncated to about 12. Besides the 12 coefficients, their first and

second order of differences are often used as feature vector components too. The total number of the components is about 36~39.


Decision Making This is the last step to determine what is in

the input speech string. Now for isolated word system the template matching is still used. For connected speech or continuous speech the statistical model (HMM and others ) is used. We will discuss them later in details.

Chapter 7 Speech Recognition Framework 7.1 The main form and application of speech recognition 7.2...

Documents

Transcript of Chapter 7 Speech Recognition Framework 7.1 The main form and application of speech recognition 7.2...