Chapter 7 Speech Recognition Framework 7.1 The main form and application of speech recognition 7.2...

17
Chapter 7 Speech Chapter 7 Speech Recognition Framework Recognition Framework 7.1 The main form and application of speech recognition 7.2 The main factors of speech recognition 7.3 The active topics of speech recognition 7.4 The basic framework of speech recognition system

description

The main form and application of speech recognition (2)  Speech Recognition  Speech Navigation Computer operation, Speech control, Intelligent toys, and Parcels dispatch  Speech Dictation Dictation machine, Speech dialing, and Broadcasting recording

Transcript of Chapter 7 Speech Recognition Framework 7.1 The main form and application of speech recognition 7.2...

Page 1: Chapter 7 Speech Recognition Framework  7.1 The main form and application of speech recognition  7.2 The main factors of speech recognition  7.3 The.

Chapter 7 Speech Recognition Chapter 7 Speech Recognition FrameworkFramework

7.1 The main form and application of speech recognition

7.2 The main factors of speech recognition 7.3 The active topics of speech recognition 7.4 The basic framework of speech

recognition system

Page 2: Chapter 7 Speech Recognition Framework  7.1 The main form and application of speech recognition  7.2 The main factors of speech recognition  7.3 The.

7.1 7.1 The main form and application of The main form and application of speech recognition (1)speech recognition (1)

Speech Recognition --Inputs speech string and generates corresponding word or text of word string or transcription

Speech Understanding --Inputs speech string and generates corresponding response or actions

Speaker Recognition --Inputs speech string and identifies or verifies the speaker

Language Identification --Inputs speech string and identifies which language the input belongs to

Page 3: Chapter 7 Speech Recognition Framework  7.1 The main form and application of speech recognition  7.2 The main factors of speech recognition  7.3 The.

The main form and application of speech The main form and application of speech recognition (2)recognition (2)

Speech Recognition Speech Navigation

Computer operation, Speech control, Intelligent toys, and Parcels dispatch

Speech DictationDictation machine, Speech dialing, and

Broadcasting recording

Page 4: Chapter 7 Speech Recognition Framework  7.1 The main form and application of speech recognition  7.2 The main factors of speech recognition  7.3 The.

The main form and application of speech The main form and application of speech recognition (3)recognition (3)

Speech Understanding Speech Service

For disables, Banking, Traveling, Transportation, in case dialog is needed

Speech CommunicationBilingual speech communication and Multilingual

simultaneous interpretation

Page 5: Chapter 7 Speech Recognition Framework  7.1 The main form and application of speech recognition  7.2 The main factors of speech recognition  7.3 The.

The main form and application of speech The main form and application of speech recognition (4)recognition (4)

Speaker Recognition Speaker Verification

Accessing to the security department or program, Banking and other service

Speaker IdentificationUser recognition, Voice checking for searching the

criminals

Page 6: Chapter 7 Speech Recognition Framework  7.1 The main form and application of speech recognition  7.2 The main factors of speech recognition  7.3 The.

7.2 7.2 The main factors of speech The main factors of speech recognition (1)recognition (1)

Speech Style Isolated Words (IWR) --There are obvious

pause (or silence) between words, for example names of person or place, commands

Connected Word Speech (CWR) --For example continuous digit string (telephone numbers or data)

Continuous Speech (CSR) --Natural spoken language in sentence (or utterance). The easy degree is : CSR<<CWR<<IWR

Page 7: Chapter 7 Speech Recognition Framework  7.1 The main form and application of speech recognition  7.2 The main factors of speech recognition  7.3 The.

The main factors of speech recognition The main factors of speech recognition (2)(2)

Speaker Dependent (SD) Speaker Dependent Recognition System only can

recognize the speech by one or a couple of speaker. The speech model is trained only by the speaker’s speech samples (corpus)

Speaker Independent (SI) Speaker Independent Recognition System can

recognize speech by any speaker. In this case, the speech model is trained by many speaker’s corpuses and speaker adaptation for recognition will improve the performance. It is much harder than SD.

Page 8: Chapter 7 Speech Recognition Framework  7.1 The main form and application of speech recognition  7.2 The main factors of speech recognition  7.3 The.

The main factors of speech recognition The main factors of speech recognition (3)(3)

Vocabulary Size Small Vocabulary --containing several hundred

words Middle Vocabulary --containing a

thousand to several thousand words Large Vocabulary --more than 10 thousand

words

Page 9: Chapter 7 Speech Recognition Framework  7.1 The main form and application of speech recognition  7.2 The main factors of speech recognition  7.3 The.

The main factors of speech recognition The main factors of speech recognition (4)(4)

Other factors : Speech Quality -Microphone speech or

telephone speech, recording environment, speaker’s cooperation

Task --Word Recognition, Transcription, Word Spotting, Dialog and Translation are very different task

Domain (specific or generic) and Syntax Constraints (less or more)

Page 10: Chapter 7 Speech Recognition Framework  7.1 The main form and application of speech recognition  7.2 The main factors of speech recognition  7.3 The.

7.3 7.3 The active topics of speech The active topics of speech recognition recognition

Broadcasting Recording Systems Telephone Dialog Systems Speaker Adaptation Noise Reduction Word Spotting Language Models Based on Classes

Page 11: Chapter 7 Speech Recognition Framework  7.1 The main form and application of speech recognition  7.2 The main factors of speech recognition  7.3 The.

7.4 The basic framework of speech 7.4 The basic framework of speech recognition system (1)recognition system (1)

Input --Speech string (utterance) through microphone or telephone { x’(n) }

Preprocessing --Windowing, Framing and Pre-emphasizing { xi(n) }

Feature Extraction --Feature vector calculation frame by frame { oi }

Decision Making --Simple algorithm such as minimal distance classifier to complex one such as HMM (statistical acoustic and language models).

Page 12: Chapter 7 Speech Recognition Framework  7.1 The main form and application of speech recognition  7.2 The main factors of speech recognition  7.3 The.

The basic framework of speech The basic framework of speech recognition system(2)recognition system(2)

Input Anti-aliasing filter with 300-4KHz Sampling rate : 8KHz (telephone speech) to

16KHz (microphone speech) Sampling precision : 8 bits (telephone speech)

to 16 bits (microphone speech) Sampling starting and ending determination

(silence detecting and memory buffer to use)

Page 13: Chapter 7 Speech Recognition Framework  7.1 The main form and application of speech recognition  7.2 The main factors of speech recognition  7.3 The.

The basic framework of speech The basic framework of speech recognition system (3)recognition system (3)

Preprocessing Window selection and windowing Framing --frame length and frame shift

selection (typical 25ms and 10ms) Pre-emphasizing y(n) = x(n) – αx(n-1) αis close to 1.0 (0.95 or 0.97), for simplicity it

could be 15/16 ≈ 0.9375. The goal is high frequency enhancement

Page 14: Chapter 7 Speech Recognition Framework  7.1 The main form and application of speech recognition  7.2 The main factors of speech recognition  7.3 The.

The basic framework of speech The basic framework of speech recognition system(4)recognition system(4)

Feature Extraction There are a couple of way to get feature vector,

here only one is given –MFCC( mel-scale frequency cepstrum coefficients)

The steps to get MFCC for one frame : 1. FFT (by padding 0) to get X(k) : X[k] = Σn=0

N-1 x[n]exp(-j2πnk/N), k=0~N-1 2. Using M filters, with the log-energy S[m] of filter m

being computed via the convolution of the power spectrum S[k]=|X[k]|2 with a filter Hm[k] :

Page 15: Chapter 7 Speech Recognition Framework  7.1 The main form and application of speech recognition  7.2 The main factors of speech recognition  7.3 The.

The basic framework of speech The basic framework of speech recognition system (5)recognition system (5)

S[m] = log[S[k]*Hm[k]] m=0~N-1

where Hm[k]>=0 and Σk=0N-1 Hm[k] = 1.

Typically Hm[k]are chosen as triangular filters: Hm[k] = 0 k<f[m-1] =2(k-f[m-1])/[(f[m+1]-f[m-1])(f[m]-f[m-1])] f[m-

1]<=k<f[m] =2(f[m+1]-k)/[f[m+1]-f[m-1])(f[m+1]-f[m])]

f[m]<=k<bf[m+1] =0 k>f[m+1] So that Σk=0

N-1 Hm[k] = 1 for all m where the boundary points f[m] are uniformly spaced in the mel-scale.

Page 16: Chapter 7 Speech Recognition Framework  7.1 The main form and application of speech recognition  7.2 The main factors of speech recognition  7.3 The.

The basic framework of speech The basic framework of speech recognition system (6)recognition system (6)

The mel frequency cepstrum is the DCT of the m filter outputs :

c[n] = Σk=0m-1 S[k]cos(πn(k+1/2)/m),

n=0~m-1 M=24~40, but n is truncated to about 12. Besides the 12 coefficients, their first and

second order of differences are often used as feature vector components too. The total number of the components is about 36~39.

Page 17: Chapter 7 Speech Recognition Framework  7.1 The main form and application of speech recognition  7.2 The main factors of speech recognition  7.3 The.

The basic framework of speech The basic framework of speech recognition system (7)recognition system (7)

Decision Making This is the last step to determine what is in

the input speech string. Now for isolated word system the template matching is still used. For connected speech or continuous speech the statistical model (HMM and others ) is used. We will discuss them later in details.