Extraction of features, MM5.ppt - Aalborg...

23
1 Extraction of Features, V, Zheng-Hua Tan 1 Extraction and Representation of Features, Spring 2011 Zheng-Hua Tan Multimedia Information and Signal Processing Department of Electronic Systems Aalborg University, Denmark [email protected] Lecture 5: Speech and Audio Analysis Feature extraction A special form of dimensionality reduction, used when the input data is Too large to be stored or processed Redundant (much data, but not much information) Data is transformed into a compact representation – a set of features. Speech and audio signals have a lot in common. 2 Extraction of Features, V, Zheng-Hua Tan

Transcript of Extraction of features, MM5.ppt - Aalborg...

Page 1: Extraction of features, MM5.ppt - Aalborg Universitetkom.aau.dk/~zt/courses/Feature_extraction/Extraction of features... · Extraction of Features, V, Zheng-Hua Tan 2. 2 Extraction

1

Extraction of Features, V, Zheng-Hua Tan 1

Extraction and Representation of Features, Spring 2011

Zheng-Hua Tan

Multimedia Information and Signal Processing Department of Electronic Systems

Aalborg University, Denmark [email protected]

Lecture 5: Speech and Audio Analysis

Feature extraction

A special form of dimensionality reduction, used when the input data is Too large to be stored or processed

Redundant (much data, but not much information)

Data is transformed into a compact representation – a set of features.

Speech and audio signals have a lot in common.

2Extraction of Features, V, Zheng-Hua Tan

Page 2: Extraction of features, MM5.ppt - Aalborg Universitetkom.aau.dk/~zt/courses/Feature_extraction/Extraction of features... · Extraction of Features, V, Zheng-Hua Tan 2. 2 Extraction

2

Extraction of Features, V, Zheng-Hua Tan 3

Speech analysis

Most applications of speech processing must exploit the properties of speech signals Speech Analysis: the process of extracting such properties from a speech signal.

speech analysis(DSP)

speechrepresentationof speech

applications: e.g. pitch, formants,

boundary detection, speech & speaker

recognition

Extraction of Features, V, Zheng-Hua Tan 4

Speech analysis: Short-time analysis

Short-time speech analysis

Time-domain processing

Frequency-domain (spectral) processing

Linear predictive coding (LPC) analysis

Cepstral analysis

Filter bank analysis

Page 3: Extraction of features, MM5.ppt - Aalborg Universitetkom.aau.dk/~zt/courses/Feature_extraction/Extraction of features... · Extraction of Features, V, Zheng-Hua Tan 2. 2 Extraction

3

Extraction of Features, V, Zheng-Hua Tan 5

Properties of speech signals

Speech is a time-varying signal: excitation

pitch

amplitude

Extraction of Features, V, Zheng-Hua Tan 6

Short-time processing solution

Assuming that speech has non-time-varying properties (fixed excitation and vocal tract) within short intervals

Processing short segments (frames) of the speech signal each time

)()(),( mnwmxmnf x

Page 4: Extraction of features, MM5.ppt - Aalborg Universitetkom.aau.dk/~zt/courses/Feature_extraction/Extraction of features... · Extraction of Features, V, Zheng-Hua Tan 2. 2 Extraction

4

Extraction of Features, V, Zheng-Hua Tan 7

Frame-by-frame processing

Frames often overlap one another

The frame-based analysis yields a time-varying sequence as a new representation of the speech signal samples at 8000/sec vectors at 100/sec

Extraction of Features, V, Zheng-Hua Tan 8

Windows

10,1][ Nnnw

10),1

2cos(46.054.0][

Nn

N

nnw

Rectangular window

Hamming window

Page 5: Extraction of features, MM5.ppt - Aalborg Universitetkom.aau.dk/~zt/courses/Feature_extraction/Extraction of features... · Extraction of Features, V, Zheng-Hua Tan 2. 2 Extraction

5

Extraction of Features, V, Zheng-Hua Tan 9

Choice of window

Window type Bandwidth of Hamming window is about twice the

bandwidth of Rectangular

Attenuation of more than 40dB for Hamming as compared with 14 dB for Rectangular, outside passband

Window duration - N Increase N = decrease window bandwidth

N should be larger than a pitch period, but smaller than a sound duration

Extraction of Features, V, Zheng-Hua Tan 10

Speech analysis: Time-domain

Short-time speech analysis

Time-domain speech processing

Frequency-domain (spectral) processing

Linear predictive coding (LPC) analysis

Cepstral analysis

Filter bank analysis

Page 6: Extraction of features, MM5.ppt - Aalborg Universitetkom.aau.dk/~zt/courses/Feature_extraction/Extraction of features... · Extraction of Features, V, Zheng-Hua Tan 2. 2 Extraction

6

Extraction of Features, V, Zheng-Hua Tan 11

Time-domain parameters

Short-time energy

Short-time average magnitude

Short-time zero crossing rate

Short-time autocorrelation

Short-time average magnitude difference

Extraction of Features, V, Zheng-Hua Tan 12

Short-time energy

The long term energy definition is not useful for time-varying signals

Short-time energy of weighted signal around n is defined as

m

mxE )(2

mn mnwmxE 2)]()([

Page 7: Extraction of features, MM5.ppt - Aalborg Universitetkom.aau.dk/~zt/courses/Feature_extraction/Extraction of features... · Extraction of Features, V, Zheng-Hua Tan 2. 2 Extraction

7

Extraction of Features, V, Zheng-Hua Tan 13

Examples of short-time energy

It can be used to detection voiced/unvoiced/silence Effects of window type, duration N (bandwidth) and why?

Uttered by a male speaker. Two plots converge as N increases.

Extraction of Features, V, Zheng-Hua Tan 14

Short-time magnitude

Less sensitive to large signal levels as compared to energy where x2(n) terms is used.

mn mnwmxM )(|)(|

Page 8: Extraction of features, MM5.ppt - Aalborg Universitetkom.aau.dk/~zt/courses/Feature_extraction/Extraction of features... · Extraction of Features, V, Zheng-Hua Tan 2. 2 Extraction

8

Extraction of Features, V, Zheng-Hua Tan 15

Short-time average zero-crossing rate

A zero-crossing occurs if successive samples have different algebraic signs.

It is a measure of the frequency. Definition

where

and

mn mnwmxmxZ )(|)]1(sgn[)](sgn[|

0)(1

0)(1)](sgn[

nx

nxnx

otherwise

NnNnw0

102

1)(

Extraction of Features, V, Zheng-Hua Tan 16

Zero-crossing rate distributions

A histogram of average zero-crossing rates (averaged over 10 msec) for both voiced and unvoiced speech

In different frequency bands

(80/2)/10ms=4kHz

Page 9: Extraction of features, MM5.ppt - Aalborg Universitetkom.aau.dk/~zt/courses/Feature_extraction/Extraction of features... · Extraction of Features, V, Zheng-Hua Tan 2. 2 Extraction

9

Extraction of Features, V, Zheng-Hua Tan 17

Example of zero-crossing rate

Although the zero-crossing rate varies considerably, the voiced and unvoiced regions are quite prominent.

Extraction of Features, V, Zheng-Hua Tan 18

Short-time autocorrelation function

The autocorrelation function

The short-time autocorrelation function

m

kmxmxk )()()(

)()()()()( mknwkmxmnwmxkRm

n

Page 10: Extraction of features, MM5.ppt - Aalborg Universitetkom.aau.dk/~zt/courses/Feature_extraction/Extraction of features... · Extraction of Features, V, Zheng-Hua Tan 2. 2 Extraction

10

Extraction of Features, V, Zheng-Hua Tan 19

Applications

Boundary detection short-time energy

zero crossing rate

Pitch estimation short-time autocorrelation function

Extraction of Features, V, Zheng-Hua Tan 20

Speech analysis: Frequency-domain

Short-time speech analysis

Time-domain speech processing

Frequency-domain (spectral) processing

Linear predictive coding (LPC) analysis

Cepstral analysis

Filter bank analysis

Page 11: Extraction of features, MM5.ppt - Aalborg Universitetkom.aau.dk/~zt/courses/Feature_extraction/Extraction of features... · Extraction of Features, V, Zheng-Hua Tan 2. 2 Extraction

11

Extraction of Features, V, Zheng-Hua Tan 21

Discrete-time Fourier transform

Convolution and multiplication duality:

dweeXnx

enxeX

jwnjw

n

jwnjw

)(2

1][

][)(

)()()(

][*][][jwjwjw eHeXeY

nhnxny

deXeWeY

nwnxnywjjjw )()(

2

1)(

][][][)(

Extraction of Features, V, Zheng-Hua Tan 22

Short-time Fourier transform

It is motivated by the need for a spectral representation to reflect the time-varying properties of the speech waveform

m

jwmjwn emxmnweX ][][)(

Page 12: Extraction of features, MM5.ppt - Aalborg Universitetkom.aau.dk/~zt/courses/Feature_extraction/Extraction of features... · Extraction of Features, V, Zheng-Hua Tan 2. 2 Extraction

12

Extraction of Features, V, Zheng-Hua Tan 23

Spectra

Extraction of Features, V, Zheng-Hua Tan 24

Spectra of voiced/unvoiced sounds

Page 13: Extraction of features, MM5.ppt - Aalborg Universitetkom.aau.dk/~zt/courses/Feature_extraction/Extraction of features... · Extraction of Features, V, Zheng-Hua Tan 2. 2 Extraction

13

Extraction of Features, V, Zheng-Hua Tan 25

Spectrogram

Spectrogram two-dimensional waveform (amplitude/time) is

converted into a three-dimensional pattern (amplitude/frequency/time)

Wideband spectrogram: analyzed on 15ms sections of waveform with a step of 1ms voiced regions with vertical striations due to the

periodicity of the time waveform (each vertical line represents a pulse of vocal folds) while unvoiced regions are solid/random, or ‘snowy’

Narrowband spectrogram: on 50ms pitch for voiced intervals in horizontal lines

Extraction of Features, V, Zheng-Hua Tan 26

Wide- and narrow-band spectrograms

Wideband spectrogram

waveform

narrowband spectrogram

F1

F2

F3

Page 14: Extraction of features, MM5.ppt - Aalborg Universitetkom.aau.dk/~zt/courses/Feature_extraction/Extraction of features... · Extraction of Features, V, Zheng-Hua Tan 2. 2 Extraction

14

Extraction of Features, V, Zheng-Hua Tan 27

Speech analysis: LPC analysis

Short-time speech analysis

Time-domain speech processing

Frequency-domain (spectral) processing

Linear predictive coding (LPC) analysis

Cepstral analysis

Filter bank analysis

Extraction of Features, V, Zheng-Hua Tan 28

Discrete-time filter model for speech

Its philosophy is related to the speech model in which speech is modelled as the output of a linear, time-varying system excited by either quasi-periodic pulses or random noise.

The LPC provides a robust and accurate method for estimating the parameters of the time-varying system.

H(z)

Page 15: Extraction of features, MM5.ppt - Aalborg Universitetkom.aau.dk/~zt/courses/Feature_extraction/Extraction of features... · Extraction of Features, V, Zheng-Hua Tan 2. 2 Extraction

15

Extraction of Features, V, Zheng-Hua Tan 29

LPC analysis

For efficient coding, speech signals are often modelled using parameters of the vocal tract shape that generates them.

Pole-zero model (ideal during a stationary frame)

All-pole model (simple): a matter of analytical necessity

p

k

kk

q

l

ll

za

zb

GzU

zSzH

1

1

1

1

)(

)()(

p

k

kk za

GzU

zSzH

1

1

1

)(

)()(

0

1 11

poles multiple zero one

n

nnza

az

Extraction of Features, V, Zheng-Hua Tan 30

All-pole model – the LPC model

where u(n) is a normalised excitation and G is the

gain of the excitation

p

k

kk za

GzU

zSzH

1

1

1

)(

)()( )(

1

)(

1

zU

za

GzS

p

k

kk

)()()(1

zGUzazSzSp

k

kk

)()()(1

nGuknsansp

kk

Page 16: Extraction of features, MM5.ppt - Aalborg Universitetkom.aau.dk/~zt/courses/Feature_extraction/Extraction of features... · Extraction of Features, V, Zheng-Hua Tan 2. 2 Extraction

16

Extraction of Features, V, Zheng-Hua Tan 31

After excluding the excitation term, a given speech sample at time n, s(n), can be approximated as a linear combination of the past p speech samples:

where the coefficients are assumed constant over the speech frame.

LPC analysis: to determine a set of predictor coefficients directly from the speech signal.

The LPC model

)(...)2()1()(~21 pnsnsnsns p

p ,...,, 21

}{ k

Extraction of Features, V, Zheng-Hua Tan 32

LPC analysis equations

Windowed speech:

Error of linear predictor

Method: minimise mean-squared prediction error

Short-time average prediction error

)(ˆ)()( nsnsne

p

kk knsnsne

1

)()()(

)()()( nwnsnx

2

1

2 ])()([)(

m

p

knkn

mnn kmsmsmeE

Page 17: Extraction of features, MM5.ppt - Aalborg Universitetkom.aau.dk/~zt/courses/Feature_extraction/Extraction of features... · Extraction of Features, V, Zheng-Hua Tan 2. 2 Extraction

17

Extraction of Features, V, Zheng-Hua Tan 33

LPC analysis equations (cont’d)

Find such that is minimal

Resulting in

Define covariance

Then

piE

i

n ,...,2,1for 0

p

k mnnk

mnn kmsimsmsims

1

)()(ˆ)()(

m

nnn kmsimski )()(),(

piikip

knnk ,...,2,1 )0,(),(ˆ

1

nEk

2

1

2 ])()([)(

m

p

knkn

mnn kmsmsmeE

Extraction of Features, V, Zheng-Hua Tan 34

Short-time LP analysis

To solve the following equation for the optimum predictor coefficients (the s)

we have to compute and then solve the resulting set of p equations.

pikiaip

kk ,...,2,1 ),(ˆ)0,(

1

),( ki

Page 18: Extraction of features, MM5.ppt - Aalborg Universitetkom.aau.dk/~zt/courses/Feature_extraction/Extraction of features... · Extraction of Features, V, Zheng-Hua Tan 2. 2 Extraction

18

Extraction of Features, V, Zheng-Hua Tan 35

Speech analysis: Cepstral analysis

Short-time speech analysis

Time-domain speech processing

Frequency-domain (spectral) processing

Linear predictive coding (LPC) analysis

Cepstral analysis

Filter bank analysis

Extraction of Features, V, Zheng-Hua Tan 36

Homomorphic speech processing

Again, speech is modelled as the output of a linear, time-varying system (linear time-invariant (LTI) in short seg.) excited by either quasi-periodic pulses or random noise.

The problem of speech analysis is to estimate the parameters of the speech model and to measure their variations with time.

Since the excitation and impulse response of a LTI system are combined in a convolutional manner, the problem of speech analysis can also been viewed as a problem in separating the components of a convolution, called ”deconvolution”.

][*][][ nhnxny

Page 19: Extraction of features, MM5.ppt - Aalborg Universitetkom.aau.dk/~zt/courses/Feature_extraction/Extraction of features... · Extraction of Features, V, Zheng-Hua Tan 2. 2 Extraction

19

Extraction of Features, V, Zheng-Hua Tan 37

Homomorphic deconvolution

Converts a convolution into a sum

Canonic form for system for homomorphic deconvolution

)(ˆ)(ˆ)(ˆ

)(*)()(

nhnxny

nhnxny

D*[ ] L[ ] D*-1[ ]

* + + + + *

)(nx

)(*)( 21 nxnx

)(ˆ nx )(ˆ ny )(ny

)(ˆ)(ˆ 21 nxnx )(ˆ)(ˆ 21 nyny )()( 21 nyny

Extraction of Features, V, Zheng-Hua Tan 38

* . . + + +

The characteristic system for homomorphic deconvolution

The characteristic system

Z[ ] log[ ] Z-1[ ])(nx )(zX )(ˆ zX )(ˆ nx

D*[ ]

Page 20: Extraction of features, MM5.ppt - Aalborg Universitetkom.aau.dk/~zt/courses/Feature_extraction/Extraction of features... · Extraction of Features, V, Zheng-Hua Tan 2. 2 Extraction

20

Extraction of Features, V, Zheng-Hua Tan 39

Cepstral analysis

Observation:

taking logarithm of X(z), then

So, the two convolved signals are additive.

)()()(][*][][ 2121 zXzXzXnxnxnx

)(ˆ)(ˆ)(ˆ i.e.,

)}(log{)}(log{)}(log{

21

21

zXzXzX

zXzXzX

][ˆ][ˆ][ˆ 21 nxnxnx

Extraction of Features, V, Zheng-Hua Tan 40

Complex cepstrum and real cepstrum

Real cepstrum is the even part of

cepstrum was coined by reversing the first syllable in the word spectrum.

cepstrum |)(|log2

1][

cepstrumcomplex )}(log{2

1

)(ˆ2

1][ˆ

dweeXnc

dweeX

dweeXnx

jwnjw

jwnjw

jwnjw

][nc ][ˆ nx

Page 21: Extraction of features, MM5.ppt - Aalborg Universitetkom.aau.dk/~zt/courses/Feature_extraction/Extraction of features... · Extraction of Features, V, Zheng-Hua Tan 2. 2 Extraction

21

Mel-frequency cepstral coefficience

MFCC

41Extraction of Features, V, Zheng-Hua Tan

Mel-frequency filter bank

Extraction of Features, V, Zheng-Hua Tan 42

Page 22: Extraction of features, MM5.ppt - Aalborg Universitetkom.aau.dk/~zt/courses/Feature_extraction/Extraction of features... · Extraction of Features, V, Zheng-Hua Tan 2. 2 Extraction

22

Extraction of Features, V, Zheng-Hua Tan 43

Speech analysis: Cepstral analysis

Short-time speech analysis

Time-domain speech processing

Frequency-domain (spectral) processing

Linear predictive coding (LPC) analysis

Cepstral analysis

Filter bank analysis

Gammatone filters

Gammatone is a widely used model of auditory filters.

Impulse response (the product of a gamma distribution and sinusoidal tone):

where f is the frequency, φ is the phase of the carrier (tone), a is the amplitude, n is the filter's order, b is the filter's bandwidth, and t is time.

Extraction of Features, V, Zheng-Hua Tan 44

Page 23: Extraction of features, MM5.ppt - Aalborg Universitetkom.aau.dk/~zt/courses/Feature_extraction/Extraction of features... · Extraction of Features, V, Zheng-Hua Tan 2. 2 Extraction

23

Gammatone filters

Extraction of Features, V, Zheng-Hua Tan 45

Extraction of Features, V, Zheng-Hua Tan 46

Summary

Short-time speech analysis

Time-domain processing

Frequency-domain (spectral) processing

Linear predictive coding (LPC) analysis

Cepstral analysis

Filter bank analysis