Introduction to Speech Signal Processing
description
Transcript of Introduction to Speech Signal Processing
Introduction to Speech Signal Processing
Dr. Zhang SenDr. Zhang Sen
Chinese Academy of SciencesBeijing, China
23/04/24
Report D
ocument
2
•Introduction–Sampling and quantization–Speech coding
•Features and Analysis–Main features–Some transformations
•Text-to-Speech–State of the art–Main approaches
•Speech-to-Text –State of the art–Main approaches
•Applications–Human-machine dialogue systems
Report D
ocument
3
• View speech signal in math.– Can be described by continuous function, but– Hard to find explicit analytical form– Non-linear – Non-stationary, time-varying– Some parts like noise– Some parts like pseudo-periodic signal
• View speech signal in physics– Wave generated by vibration– Transmitted in air/media
Report D
ocument
4
• Analysis approaches– Divide-and-conquer– Approximation and simplicity– Transformation (TD-FD)
• Analysis purpose– To find speech features– Which are important, which are trivial– Correlation between features– How features change?– How to to change original signal
Report D
ocument
5
• Features can be classified as– Time-domain features– Frequency-domain features
• Or– Short-term features– Long-term features
• Feature representation– Numerical: Vector or Distribution– Diagram: curve or image
Report D
ocument
6
• Windowing (frame)– In short-term, non-stationary->stationary– and Non-linear->linear (10ms-25ms)
Report D
ocument
7
• Window types
Report D
ocument
8
• Window shapes
Report D
ocument
9
• A few words on Window function
Report D
ocument
10
• Commonly used speech features– Zero-crossing-rate (ZCR)– Peaks– Power and energy– Correlation, auto-correlation, AMDF– Formant– Pitch– Frequency spectrum– Cepstrum and MFCC– Linear Predictive Coefficients (LPC), LPCC
Report D
ocument
11
• ZCR
Report D
ocument
12
• Level-crossing-rate
Report D
ocument
13
• Peaks
Report D
ocument
14
• Power and energy
Report D
ocument
15
• Correlation, auto-correlation, AMDF– To measure the similarity of two signals or to detect the
periodicity of a signal– Sum x(k+i)*x(k+m+i) in a range, where k is the
reference point and m is the lags
Report D
ocument
16
• Center-clipping technique
Report D
ocument
17
• Auto-correlation peaks
Report D
ocument
18
• Auto-correlation show
Report D
ocument
19
• Formant– LPC->FFT
Report D
ocument
20
• Formant displays
Report D
ocument
21
• Some typical formant values
Report D
ocument
22
• Pitch, fundamental frequency– Referred to as F0, determine tone and prosody– Pitch estimation methods
• Auto-correlation and AMDF• Cepstrum• LPC• Peak detection
– Pitch smoothing methods• Dynamic programming• N-point smoothing filter• HMM
Report D
ocument
23
• Pitch show– The pitch of a3 by auto-correlation method
Report D
ocument
24
• Spectrogram– Representation of a signal highlighting several
of its properties based on short-time Fourier analysis
– Two dimensional: time horizontal and frequency vertical
– Third ‘dimension’: gray or color level indicating energy
Report D
ocument
25
Report D
ocument
26
• Spectrum of a frame (vowel)
Report D
ocument
27
• Spectrum of a frame (consonant)
Report D
ocument
28
• Cepstrum analysis
Report D
ocument
29
• Cepstrum and MFCC computation
DFT IDFTlog|DFT|s(n)
Filter-bankDCTMFCCcepstrum
Report D
ocument
30
• Filter-bank
Report D
ocument
31
• Perceptual measures
Report D
ocument
32
• Linear predictive analysis
Report D
ocument
33
• Prediction errors
Report D
ocument
34
• LP coefficients to cepstral coefficients – The computation of LPCC– LPCC is often used in ASR as feature vector
Report D
ocument
35
• Some transformations in SSP– DFT, FFT, DCT and their inverses
• Frequency analysis• TD-FD conversion
– Z transformation• LPC analysis• Filter design
– Wavelet transformation• Frequency analysis• Compression
Report D
ocument
36
• Fourier Transform
Report D
ocument
37
• Discrete Fourier Transform
The computation load of DFT is O(N2), the Fast Discrete Fourier Transform reduced it to O(NlogN) b
ased on divide-and-conquer principle
Report D
ocument
38
• Basic Phonetic knowledge– Consonant/unvoiced– Vowel/voiced– Co-articulation– Phone and phoneme– Uni-, bi-, tri-phone– Canonical form, surface form, reduced form– Tone and prosody
Report D
ocument
39
Report D
ocument
40
Report D
ocument
41
• Co-articulation– Very common in English, it causes many diffic
ulties in ASR– In Mandarin, not very serious– The use of bi-phones and tri-phones intend to c
ope with this issue.– Some examples:
• Mandarin: A yi, yi yi, wu yun, …• English: this issue, in a box, …
Report D
ocument
42
• Some research topics– Speech signal detection, endpoint detection– Consonant/vowel separation– Pitch estimation– Echo cancellation– De-noise and filter design– Multi-signal separation– Robust features– Perceptual features– Re-sampling and re-construction– etc
Report D
ocument
43
References• Speech & Language Processing
– Jurafsky & Martin -Prentice Hall - 2000• Spoken Language Processing
– X.. D. Huang, al et, Prentice Hall, Inc., 2000• Statistical Methods for Speech Recognition
– Jelinek - MIT Press - 1999• Foundations of Statistical Natural Language Processing
– Manning & Schutze - MIT Press - 1999• Fundamentals of Speech Recognition
– L. R. Rabiner and B. H. Juang, Prentice-Hall, 1993• Dr. J. Picone - Speech Website
– www.isip.msstate.edu
Report D
ocument
44
Test
• Mode– A final 4-page report or– A 30-min presentation
• Content– Review of speech processing– Speech features and processing approaches– Review of TTS or ASR– Audio in computer engineering
Report D
ocument
45
TTHHAANNKKSS