Introduction to Speech Signal Processing

45
Introduction to Speech Signal Processing Dr. Zhang Sen Dr. Zhang Sen [email protected] Chinese Academy of Sciences Beijing, China 22/06/12

description

Introduction to Speech Signal Processing. Dr. Zhang Sen [email protected] Chinese Academy of Sciences Beijing, China 2014/9/10. Introduction Sampling and quantization Speech coding Features and Analysis Main features Some transformations Text-to-Speech State of the art - PowerPoint PPT Presentation

Transcript of Introduction to Speech Signal Processing

Page 1: Introduction to  Speech Signal Processing

Introduction to Speech Signal Processing

Dr. Zhang SenDr. Zhang Sen

[email protected]

Chinese Academy of SciencesBeijing, China

23/04/24

Page 2: Introduction to  Speech Signal Processing

Report D

ocument

2

•Introduction–Sampling and quantization–Speech coding

•Features and Analysis–Main features–Some transformations

•Text-to-Speech–State of the art–Main approaches

•Speech-to-Text –State of the art–Main approaches

•Applications–Human-machine dialogue systems

Page 3: Introduction to  Speech Signal Processing

Report D

ocument

3

• View speech signal in math.– Can be described by continuous function, but– Hard to find explicit analytical form– Non-linear – Non-stationary, time-varying– Some parts like noise– Some parts like pseudo-periodic signal

• View speech signal in physics– Wave generated by vibration– Transmitted in air/media

Page 4: Introduction to  Speech Signal Processing

Report D

ocument

4

• Analysis approaches– Divide-and-conquer– Approximation and simplicity– Transformation (TD-FD)

• Analysis purpose– To find speech features– Which are important, which are trivial– Correlation between features– How features change?– How to to change original signal

Page 5: Introduction to  Speech Signal Processing

Report D

ocument

5

• Features can be classified as– Time-domain features– Frequency-domain features

• Or– Short-term features– Long-term features

• Feature representation– Numerical: Vector or Distribution– Diagram: curve or image

Page 6: Introduction to  Speech Signal Processing

Report D

ocument

6

• Windowing (frame)– In short-term, non-stationary->stationary– and Non-linear->linear (10ms-25ms)

Page 7: Introduction to  Speech Signal Processing

Report D

ocument

7

• Window types

Page 8: Introduction to  Speech Signal Processing

Report D

ocument

8

• Window shapes

Page 9: Introduction to  Speech Signal Processing

Report D

ocument

9

• A few words on Window function

Page 10: Introduction to  Speech Signal Processing

Report D

ocument

10

• Commonly used speech features– Zero-crossing-rate (ZCR)– Peaks– Power and energy– Correlation, auto-correlation, AMDF– Formant– Pitch– Frequency spectrum– Cepstrum and MFCC– Linear Predictive Coefficients (LPC), LPCC

Page 11: Introduction to  Speech Signal Processing

Report D

ocument

11

• ZCR

Page 12: Introduction to  Speech Signal Processing

Report D

ocument

12

• Level-crossing-rate

Page 13: Introduction to  Speech Signal Processing

Report D

ocument

13

• Peaks

Page 14: Introduction to  Speech Signal Processing

Report D

ocument

14

• Power and energy

Page 15: Introduction to  Speech Signal Processing

Report D

ocument

15

• Correlation, auto-correlation, AMDF– To measure the similarity of two signals or to detect the

periodicity of a signal– Sum x(k+i)*x(k+m+i) in a range, where k is the

reference point and m is the lags

Page 16: Introduction to  Speech Signal Processing

Report D

ocument

16

• Center-clipping technique

Page 17: Introduction to  Speech Signal Processing

Report D

ocument

17

• Auto-correlation peaks

Page 18: Introduction to  Speech Signal Processing

Report D

ocument

18

• Auto-correlation show

Page 19: Introduction to  Speech Signal Processing

Report D

ocument

19

• Formant– LPC->FFT

Page 20: Introduction to  Speech Signal Processing

Report D

ocument

20

• Formant displays

Page 21: Introduction to  Speech Signal Processing

Report D

ocument

21

• Some typical formant values

Page 22: Introduction to  Speech Signal Processing

Report D

ocument

22

• Pitch, fundamental frequency– Referred to as F0, determine tone and prosody– Pitch estimation methods

• Auto-correlation and AMDF• Cepstrum• LPC• Peak detection

– Pitch smoothing methods• Dynamic programming• N-point smoothing filter• HMM

Page 23: Introduction to  Speech Signal Processing

Report D

ocument

23

• Pitch show– The pitch of a3 by auto-correlation method

Page 24: Introduction to  Speech Signal Processing

Report D

ocument

24

• Spectrogram– Representation of a signal highlighting several

of its properties based on short-time Fourier analysis

– Two dimensional: time horizontal and frequency vertical

– Third ‘dimension’: gray or color level indicating energy

Page 25: Introduction to  Speech Signal Processing

Report D

ocument

25

Page 26: Introduction to  Speech Signal Processing

Report D

ocument

26

• Spectrum of a frame (vowel)

Page 27: Introduction to  Speech Signal Processing

Report D

ocument

27

• Spectrum of a frame (consonant)

Page 28: Introduction to  Speech Signal Processing

Report D

ocument

28

• Cepstrum analysis

Page 29: Introduction to  Speech Signal Processing

Report D

ocument

29

• Cepstrum and MFCC computation

DFT IDFTlog|DFT|s(n)

Filter-bankDCTMFCCcepstrum

Page 30: Introduction to  Speech Signal Processing

Report D

ocument

30

• Filter-bank

Page 31: Introduction to  Speech Signal Processing

Report D

ocument

31

• Perceptual measures

Page 32: Introduction to  Speech Signal Processing

Report D

ocument

32

• Linear predictive analysis

Page 33: Introduction to  Speech Signal Processing

Report D

ocument

33

• Prediction errors

Page 34: Introduction to  Speech Signal Processing

Report D

ocument

34

• LP coefficients to cepstral coefficients – The computation of LPCC– LPCC is often used in ASR as feature vector

Page 35: Introduction to  Speech Signal Processing

Report D

ocument

35

• Some transformations in SSP– DFT, FFT, DCT and their inverses

• Frequency analysis• TD-FD conversion

– Z transformation• LPC analysis• Filter design

– Wavelet transformation• Frequency analysis• Compression

Page 36: Introduction to  Speech Signal Processing

Report D

ocument

36

• Fourier Transform

Page 37: Introduction to  Speech Signal Processing

Report D

ocument

37

• Discrete Fourier Transform

The computation load of DFT is O(N2), the Fast Discrete Fourier Transform reduced it to O(NlogN) b

ased on divide-and-conquer principle

Page 38: Introduction to  Speech Signal Processing

Report D

ocument

38

• Basic Phonetic knowledge– Consonant/unvoiced– Vowel/voiced– Co-articulation– Phone and phoneme– Uni-, bi-, tri-phone– Canonical form, surface form, reduced form– Tone and prosody

Page 39: Introduction to  Speech Signal Processing

Report D

ocument

39

Page 40: Introduction to  Speech Signal Processing

Report D

ocument

40

Page 41: Introduction to  Speech Signal Processing

Report D

ocument

41

• Co-articulation– Very common in English, it causes many diffic

ulties in ASR– In Mandarin, not very serious– The use of bi-phones and tri-phones intend to c

ope with this issue.– Some examples:

• Mandarin: A yi, yi yi, wu yun, …• English: this issue, in a box, …

Page 42: Introduction to  Speech Signal Processing

Report D

ocument

42

• Some research topics– Speech signal detection, endpoint detection– Consonant/vowel separation– Pitch estimation– Echo cancellation– De-noise and filter design– Multi-signal separation– Robust features– Perceptual features– Re-sampling and re-construction– etc

Page 43: Introduction to  Speech Signal Processing

Report D

ocument

43

References• Speech & Language Processing

– Jurafsky & Martin -Prentice Hall - 2000• Spoken Language Processing

– X.. D. Huang, al et, Prentice Hall, Inc., 2000• Statistical Methods for Speech Recognition

– Jelinek - MIT Press - 1999• Foundations of Statistical Natural Language Processing

– Manning & Schutze - MIT Press - 1999• Fundamentals of Speech Recognition

– L. R. Rabiner and B. H. Juang, Prentice-Hall, 1993• Dr. J. Picone - Speech Website

– www.isip.msstate.edu

Page 44: Introduction to  Speech Signal Processing

Report D

ocument

44

Test

• Mode– A final 4-page report or– A 30-min presentation

• Content– Review of speech processing– Speech features and processing approaches– Review of TTS or ASR– Audio in computer engineering

Page 45: Introduction to  Speech Signal Processing

Report D

ocument

45

TTHHAANNKKSS