Chapter 7 Speech Recognition Framework 7.1 The main form and application of speech recognition 7.2...
-
Upload
molly-powell -
Category
Documents
-
view
222 -
download
0
description
Transcript of Chapter 7 Speech Recognition Framework 7.1 The main form and application of speech recognition 7.2...
Chapter 7 Speech Recognition Chapter 7 Speech Recognition FrameworkFramework
7.1 The main form and application of speech recognition
7.2 The main factors of speech recognition 7.3 The active topics of speech recognition 7.4 The basic framework of speech
recognition system
7.1 7.1 The main form and application of The main form and application of speech recognition (1)speech recognition (1)
Speech Recognition --Inputs speech string and generates corresponding word or text of word string or transcription
Speech Understanding --Inputs speech string and generates corresponding response or actions
Speaker Recognition --Inputs speech string and identifies or verifies the speaker
Language Identification --Inputs speech string and identifies which language the input belongs to
The main form and application of speech The main form and application of speech recognition (2)recognition (2)
Speech Recognition Speech Navigation
Computer operation, Speech control, Intelligent toys, and Parcels dispatch
Speech DictationDictation machine, Speech dialing, and
Broadcasting recording
The main form and application of speech The main form and application of speech recognition (3)recognition (3)
Speech Understanding Speech Service
For disables, Banking, Traveling, Transportation, in case dialog is needed
Speech CommunicationBilingual speech communication and Multilingual
simultaneous interpretation
The main form and application of speech The main form and application of speech recognition (4)recognition (4)
Speaker Recognition Speaker Verification
Accessing to the security department or program, Banking and other service
Speaker IdentificationUser recognition, Voice checking for searching the
criminals
7.2 7.2 The main factors of speech The main factors of speech recognition (1)recognition (1)
Speech Style Isolated Words (IWR) --There are obvious
pause (or silence) between words, for example names of person or place, commands
Connected Word Speech (CWR) --For example continuous digit string (telephone numbers or data)
Continuous Speech (CSR) --Natural spoken language in sentence (or utterance). The easy degree is : CSR<<CWR<<IWR
The main factors of speech recognition The main factors of speech recognition (2)(2)
Speaker Dependent (SD) Speaker Dependent Recognition System only can
recognize the speech by one or a couple of speaker. The speech model is trained only by the speaker’s speech samples (corpus)
Speaker Independent (SI) Speaker Independent Recognition System can
recognize speech by any speaker. In this case, the speech model is trained by many speaker’s corpuses and speaker adaptation for recognition will improve the performance. It is much harder than SD.
The main factors of speech recognition The main factors of speech recognition (3)(3)
Vocabulary Size Small Vocabulary --containing several hundred
words Middle Vocabulary --containing a
thousand to several thousand words Large Vocabulary --more than 10 thousand
words
The main factors of speech recognition The main factors of speech recognition (4)(4)
Other factors : Speech Quality -Microphone speech or
telephone speech, recording environment, speaker’s cooperation
Task --Word Recognition, Transcription, Word Spotting, Dialog and Translation are very different task
Domain (specific or generic) and Syntax Constraints (less or more)
7.3 7.3 The active topics of speech The active topics of speech recognition recognition
Broadcasting Recording Systems Telephone Dialog Systems Speaker Adaptation Noise Reduction Word Spotting Language Models Based on Classes
7.4 The basic framework of speech 7.4 The basic framework of speech recognition system (1)recognition system (1)
Input --Speech string (utterance) through microphone or telephone { x’(n) }
Preprocessing --Windowing, Framing and Pre-emphasizing { xi(n) }
Feature Extraction --Feature vector calculation frame by frame { oi }
Decision Making --Simple algorithm such as minimal distance classifier to complex one such as HMM (statistical acoustic and language models).
The basic framework of speech The basic framework of speech recognition system(2)recognition system(2)
Input Anti-aliasing filter with 300-4KHz Sampling rate : 8KHz (telephone speech) to
16KHz (microphone speech) Sampling precision : 8 bits (telephone speech)
to 16 bits (microphone speech) Sampling starting and ending determination
(silence detecting and memory buffer to use)
The basic framework of speech The basic framework of speech recognition system (3)recognition system (3)
Preprocessing Window selection and windowing Framing --frame length and frame shift
selection (typical 25ms and 10ms) Pre-emphasizing y(n) = x(n) – αx(n-1) αis close to 1.0 (0.95 or 0.97), for simplicity it
could be 15/16 ≈ 0.9375. The goal is high frequency enhancement
The basic framework of speech The basic framework of speech recognition system(4)recognition system(4)
Feature Extraction There are a couple of way to get feature vector,
here only one is given –MFCC( mel-scale frequency cepstrum coefficients)
The steps to get MFCC for one frame : 1. FFT (by padding 0) to get X(k) : X[k] = Σn=0
N-1 x[n]exp(-j2πnk/N), k=0~N-1 2. Using M filters, with the log-energy S[m] of filter m
being computed via the convolution of the power spectrum S[k]=|X[k]|2 with a filter Hm[k] :
The basic framework of speech The basic framework of speech recognition system (5)recognition system (5)
S[m] = log[S[k]*Hm[k]] m=0~N-1
where Hm[k]>=0 and Σk=0N-1 Hm[k] = 1.
Typically Hm[k]are chosen as triangular filters: Hm[k] = 0 k<f[m-1] =2(k-f[m-1])/[(f[m+1]-f[m-1])(f[m]-f[m-1])] f[m-
1]<=k<f[m] =2(f[m+1]-k)/[f[m+1]-f[m-1])(f[m+1]-f[m])]
f[m]<=k<bf[m+1] =0 k>f[m+1] So that Σk=0
N-1 Hm[k] = 1 for all m where the boundary points f[m] are uniformly spaced in the mel-scale.
The basic framework of speech The basic framework of speech recognition system (6)recognition system (6)
The mel frequency cepstrum is the DCT of the m filter outputs :
c[n] = Σk=0m-1 S[k]cos(πn(k+1/2)/m),
n=0~m-1 M=24~40, but n is truncated to about 12. Besides the 12 coefficients, their first and
second order of differences are often used as feature vector components too. The total number of the components is about 36~39.
The basic framework of speech The basic framework of speech recognition system (7)recognition system (7)
Decision Making This is the last step to determine what is in
the input speech string. Now for isolated word system the template matching is still used. For connected speech or continuous speech the statistical model (HMM and others ) is used. We will discuss them later in details.