Speech signal processing lizy

Post on 24-Apr-2015

2.268 views 0 download

description

Based on Kerala University M-Tech 1st Semester Speech Signal Processing of Signal Processing Branch.

Transcript of Speech signal processing lizy

SPEECH SIGNAL PROCESSINGSPEECH SIGNAL PROCESSINGSPEECH SIGNAL PROCESSINGSPEECH SIGNAL PROCESSINGKERALA UNIVERSITY MKERALA UNIVERSITY MKERALA UNIVERSITY MKERALA UNIVERSITY M----TECH 1TECH 1TECH 1TECH 1

STSTSTSTSEMESTERSEMESTERSEMESTERSEMESTER

lizytvm@yahoo.com Lizy Abraham

+919495123331 Assistant Professor+919495123331 Assistant Professor

Department of ECE

LBS Institute of Technology for Women

(A Govt. of Kerala Undertaking)

Poojappura

Trivandrum -695012

Kerala, India

1

SYLLABUS TSC 1004 SPEECH SIGNAL PROCESSING 3TSC 1004 SPEECH SIGNAL PROCESSING 3TSC 1004 SPEECH SIGNAL PROCESSING 3TSC 1004 SPEECH SIGNAL PROCESSING 3----0000----0000----3333

SpeechSpeechSpeechSpeech ProductionProductionProductionProduction :- Acoustic theory of speech production (Excitation, Vocal tract model for

speech analysis, Formant structure, Pitch). Articulatory Phonetic (Articulation, Voicing,

Articulatory model). Acoustic Phonetics ( Basic speech units and their classification).

SpeechSpeechSpeechSpeech AnalysisAnalysisAnalysisAnalysis :- Short-Time Speech Analysis, Time domain analysis (Short time energy, short

time zero crossing Rate, ACF ). Frequency domain analysis (Filter Banks, STFT, Spectrogram,

Formant Estimation &Analysis). Cepstral Analysis

ParametricParametricParametricParametric representationrepresentationrepresentationrepresentation ofofofof speechspeechspeechspeech :- AR Model, ARMA model. LPC Analysis ( LPC model, Auto

correlation method, Covariance method, Levinson-Durbin Algorithm, Lattice form).LSF, LAR,correlation method, Covariance method, Levinson-Durbin Algorithm, Lattice form).LSF, LAR,

MFCC, Sinusoidal Model, GMM, HMM

SpeechSpeechSpeechSpeech codingcodingcodingcoding :- Phase Vocoder, LPC, Sub-band coding, Adaptive Transform Coding , Harmonic

Coding, Vector Quantization based Coders, CELP

SpeechSpeechSpeechSpeech processingprocessingprocessingprocessing :- Fundamentals of Speech recognition, Speech segmentation. Text-to-

speech conversion, speech enhancement, Speaker Verification, Language Identification, Issues

of Voice transmission over Internet.

2

REFERENCEREFERENCEREFERENCEREFERENCE

1. Douglas O'Shaughnessy, Speech Communications : Human & Machine, IEEE

Press, Hardcover 2nd edition, 1999; ISBN: 0780334493.

2. Nelson Morgan and Ben Gold, Speech and Audio Signal Processing : Processing

and Perception Speech and Music, July 1999, John Wiley & Sons, ISBN:0471351547

3. Rabiner and Schafer, Digital Processing of Speech Signals, Prentice Hall, 1978.

4. Rabiner and Juang, Fundamentals of Speech Recognition, Prentice Hall, 1994. 4. Rabiner and Juang, Fundamentals of Speech Recognition, Prentice Hall, 1994.

5. Thomas F. Quatieri, Discrete-Time Speech Signal Processing: Principles and

Practice, Prentice Hall; ISBN: 013242942X; 1st edition

6. Donald G. Childers, Speech Processing and Synthesis Toolboxes, John Wiley &

Sons, September 1999; ISBN: 0471349593

For the End semester exam (100 marks), the question paper shall have six questions

of 20 marks each covering entire syllabus out of which any five shall be answered. It

shall have 75% problems & 25% Theory. For the internal marks of 50, Two test of 20

marks each and 10 marks for assignments (Minimum two) /Term Project.

3

Speech Processing means Processing of

discrete time speech signals

4

Speech Processing

Signal

Processing Information Phonetics

Acoustics

Algorithms

(Programming)Psychoacoustics

Room acoustics

Speech production

Processing Information

TheoryPhonetics

Fourier transforms

Discrete time filters

AR(MA) models

Entropy

Communication theory

Rate-distortion theory

Statistical SP

Stochastic

models

5

6

7

HOW IS SPEECH PRODUCED ?HOW IS SPEECH PRODUCED ?HOW IS SPEECH PRODUCED ?HOW IS SPEECH PRODUCED ?

Speech can be defined as “ a pressure Speech can be defined as “ a pressure Speech can be defined as “ a pressure Speech can be defined as “ a pressure acoustic signal that is articulated in the acoustic signal that is articulated in the acoustic signal that is articulated in the acoustic signal that is articulated in the vocal tract”vocal tract”vocal tract”vocal tract”vocal tract”vocal tract”vocal tract”vocal tract”

Speech is produced when: air is forced Speech is produced when: air is forced Speech is produced when: air is forced Speech is produced when: air is forced from the lungs through the vocal cords from the lungs through the vocal cords from the lungs through the vocal cords from the lungs through the vocal cords and along the vocal tract.and along the vocal tract.and along the vocal tract.and along the vocal tract.

8

This air flow is referred to as “excitation signal”. This air flow is referred to as “excitation signal”. This air flow is referred to as “excitation signal”. This air flow is referred to as “excitation signal”.

This excitation signal causes the vocal cords to This excitation signal causes the vocal cords to This excitation signal causes the vocal cords to This excitation signal causes the vocal cords to vibrate and propagate the energy to excite the oral vibrate and propagate the energy to excite the oral vibrate and propagate the energy to excite the oral vibrate and propagate the energy to excite the oral and nasal openings, which play a major role in and nasal openings, which play a major role in and nasal openings, which play a major role in and nasal openings, which play a major role in and nasal openings, which play a major role in and nasal openings, which play a major role in and nasal openings, which play a major role in and nasal openings, which play a major role in shaping the sound produced.shaping the sound produced.shaping the sound produced.shaping the sound produced.

Vocal Tract components:Vocal Tract components:Vocal Tract components:Vocal Tract components:–––– Oral Tract: (from lips to vocal cords).Oral Tract: (from lips to vocal cords).Oral Tract: (from lips to vocal cords).Oral Tract: (from lips to vocal cords).–––– Nasal Tract: (from the velum till Nasal Tract: (from the velum till Nasal Tract: (from the velum till Nasal Tract: (from the velum till nostrillsnostrillsnostrillsnostrills).).).).

9

10

11

• Larynx: the source of speech

• Vocal cords (folds): the two folds of tissue in the larynx. They

can open and shut like a pair of fans.

• Glottis: the gap between the vocal cords. As air is forced

through the glottis the vocal cords will start to vibrate and

modulate the air flow.

• The frequency of vibration determines the pitch of the voice (for

a male, 50-200Hz; for a female, up to 500Hz).

12

SPEECH PRODUCTION MODEL

13

Places of articulation

dentalalveolar post-alveolar/palatal

velar

uvularlabial

uvular

pharyngeal

laryngeal/glottal

14

Classes of speech sounds

Voiced sound The vocal cords vibrate open and close

Quasi-periodic pulses of air

The rate of the opening and closing – the pitch

Unvoiced sounds Forcing air at high velocities through a constriction Forcing air at high velocities through a constriction

Noise-like turbulence

Show little long-term periodicity

Short-term correlations still present

Eg. “S”, “F”

Plosive sounds A complete closure in the vocal tract

Air pressure is built up and released suddenly

Eg. “B” , “P”15

Speech Model

16

SPEECH SOUNDS

Coarse classification with phonemes.

A phone is the acoustic realization of a A phone is the acoustic realization of a

phoneme.

Allophones are context dependent

phonemes.

17

PHONEME HIERARCHY

Speech soundsSpeech soundsSpeech soundsSpeech sounds

VowelsVowelsVowelsVowels ConsonantsConsonantsConsonantsConsonantsDiphtongsDiphtongsDiphtongsDiphtongs

iy, ih, ae, aa,

ah, ao,ax, eh,ay, ey,

oy, aw

Language dependent.

About 50 in English.

PlosivePlosivePlosivePlosive

NasalNasalNasalNasalFricativeFricativeFricativeFricative

RetroflexRetroflexRetroflexRetroflex

liquidliquidliquidliquid

LateralLateralLateralLateral

liquidliquidliquidliquid

GlideGlideGlideGlide

ah, ao,ax, eh,

er, ow, uh, uwoy, aw

w, y

p, b, t,

d, k, gm, n, ng

f, v, th, dh,

s, z, sh, zh, h

r

l

18

19

20

sounds like /SH/ and /S/ look like

(spectrally shaped) random noise,

while the vowel sounds /UH/, /IY/,

and /EY/ are highly structured and

quasi-periodic.

These differences result from the

distinctively different ways that these

sounds are produced.

21

22

Vowel Chart

Front

High

BackCenter

i

ɪ

u

ʊ

ɪ

o ə ʌ

e

Mid

Lowa

ɪ ə ʌ

ɛ

ɪ æ

24

SPEECH WAVEFORM CHARACTERISTICS

Loudness

Voiced/Unvoiced.

Pitch.

Fundamental frequency.

Spectral envelope.

Formants.

25

Pitch:Signal within each voiced interval is periodic. The period T is

called “pitch”. The pitch depends on the vowel being spoken,

changes in time. T~70 samples in this ex.

f0=1/T is the fundamental frequency (also known as formant

frequency).

Acoustic Characteristics of speech

frequency).

26

FORMANTS

Formants can be recognized in the frequency content

of the signal segment.

Formants are best described as high energy peaks in the

frequency spectrum of speech sound.

27

The resonant frequencies of the vocal tract are

called formant frequencies or simply formants.

The peaks of the spectrum of the vocal tract

response correspond approximately to its response correspond approximately to its

formants.

Under the linear time-invariant all-pole

assumption, each vocal tract shape is

characterized by a collection of formants.

28

Because the vocal tract is assumed stable with

poles inside the unit circle, the vocal tract

transfer function can be expressed either in

product or partial fraction expansion form: product or partial fraction expansion form:

29

30

A detailed acoustic theory must consider the effects of the

following:

• Time variation of the vocal tract shape

• Losses due to heat conduction and viscous friction at the

vocal tract wallsvocal tract walls

• Softness of the vocal tract walls

• Radiation of sound at the lips

• Nasal coupling

• Excitation of sound in the vocal tract

Let us begin by considering a simple case of a lossless tube:

31

28 December 2012

MULTI-TUBE APPROXIMATION OF THE VOCAL

TRACT

We can represent the vocal tract as a concatenation of N lossless tubes with area Ak.and equal length ∆x = l/N

The wave propagation time through each tube is τ =∆x/c = l/Nc

32

33

Consider an N-tube model of the previous figure. Each tube has length lkand cross sectional area of Ak.

Assume:

No losses

Planar wave propagation

The wave equations for section k: 0≤x≤l The wave equations for section k: 0≤x≤lk

34

35

28 December 2012

SOUND PROPAGATION IN THE CONCATENATED

TUBE MODEL

Boundary conditions:

Physical principle of continuity:

Pressure and volume velocity must be continuous both in time and in space

everywhere in the system:

At k’th/(k+1)’st junction we have:

36

28 December 2012

ANALOGY WITH ELECTRICAL CIRCUIT

TRANSMISSION LINE

37

28 December 2012

PROPAGATION OF SOUND IN A UNIFORM TUBE

The vocal tract transfer function of volume velocities is

38

The vocal tract transfer function of volume velocities is

28 December 2012

PROPAGATION OF SOUND IN A UNIFORM TUBE

Using the boundary conditions U (0,s)=UG(s) and P(-l,s)=0

*(derivation in Quateri text: page 122 – 125)

39

The poles of the transfer function T (jΩ) are where cos(Ωl/c)=0

119 – 124: Quatieri

Derivation of eqn.4.18 is

important.

28 December 2012

PROPAGATION OF SOUND IN A UNIFORM TUBE

(CON’T)

For c =34,000 cm/sec, l =17 cm, the natural frequencies (also called the formants) are at 500 Hz, 1500 Hz, 2500 Hz, …

40

The transfer function of a tube with no side branches, excited at one end and response measured at another, only has poles

The formant frequencies will have finite bandwidth when vocal tract losses are considered (e.g., radiation, walls, viscosity, heat)

The length of the vocal tract, l, corresponds to 1/4λ1, 3/4λ2, 5/4λ3, …, where λi is the wavelength of the ith natural frequency

28 December 2012

UNIFORM TUBE MODEL

Example

Consider a uniform tube of length l=35 cm. If speed

of sound is 350 m/s calculate its resonances in Hz.

Compare its resonances with a tube of length l =

41

Compare its resonances with a tube of length l =

17.5 cm.

f=Ω/2π ⇒

,...1250,750,250f

25035.04

350

2

1

2k

2f

1,3,5,...k ,2

k

=

==Ω

=

==Ω

kkl

c

l

c

π

π

π

π

28 December 2012

UNIFORM TUBE MODELUNIFORM TUBE MODELUNIFORM TUBE MODELUNIFORM TUBE MODEL

For 17.5 cm tube:

,...2500,1500,500f

250175.04

350

2

1

2k

2f

=

==Ω

= kkl

c

π

π

π

42

,...2500,1500,500f =

43

APPROXIMATING VOCAL TRACT SHAPES

44

45

VOWELS

Modeled as a tube closed at one end and open at the other

the closure is a membrane with a slit in it

the tube has uniform cross sectional area

membrane represents the source of energy (vocal folds)

the energy travels through the tube

the tube generates no energy on its own

the tube represents an important class of resonators

odd quarter length relationship

Fn=(2n-1)c/4l

VOWELS

Filter characteristics for vowels

the vocal tract is a dynamic filter

it is frequency dependent

it has, theoretically, an infinite number of resonances

each resonance has a center frequency, an amplitude and a

bandwidthbandwidth

for speech, these resonances are called formants

formants are numbered in succession from the lowest

F1, F2, F3, etc.

Fricatives

Modeled as a tube with a very severe constriction

The air exiting the constriction is turbulent

Because of the turbulence, there is no periodicity

unless accompanied by voicing

When a fricative constriction is tapered

the back cavity is involved

this resembles a tube closed at both ends

Fn=nc/2lFn=nc/2l

such a situation occurs primarily for articulation

disorders

Introduction to Digital Speech Processing

(Rabiner & Schafer )– 20-23

51

52

Rabiner &

Schafer : 98-

105

53

54

28 December 2012

SOUND SOURCE:

VOCAL FOLD VIBRATION

Modeled as a volume velocity source at glottis, UG(jΩ)

55

56

SHORT-TIME SPEECH ANALYSIS

Segments (or frames, or vectors) are typically of

length 20 ms.

Speech characteristics are constant.

Allows for relatively simple modeling. Allows for relatively simple modeling.

Often overlapping segments are extracted.

57

SHORTSHORTSHORTSHORT----TIME ANALYSIS OF SPEECHTIME ANALYSIS OF SPEECHTIME ANALYSIS OF SPEECHTIME ANALYSIS OF SPEECH

58

the system is an all-pole system with system function of the form:

For all-pole linear systems, the input and output are related by

a difference equation of the form:

59

60

The operator T defines the nature of the

short-time analysis function, and w[ˆn − m]

represents a time shifted window sequence

61

62

SHORT-TIME ENERGY

simple to compute, and useful for estimating

properties of the excitation function in the

model.

In this case the operator T is simply

squaring the windowed samples.

63

SHORT-TIME ZERO-CROSSING RATE

Weighted average of the number of times the

speech signal changes sign within the time

window. Representing this operator in terms of

linear filtering leads to:linear filtering leads to:

64

Since |sgnx[m] − sgnx[m − 1]| is equal to 1

if x[m] and x[m − 1] have different algebraic

signs and 0 if they have the same sign, it

follows that it is a weighted sum of all the follows that it is a weighted sum of all the

instances of alternating sign (zero-crossing)

that fall within the support region of the shifted

window w[ˆn − m].

65

shows an example of the short-time energy and

zero crossing rate for a segment of speech with

a transition from unvoiced to voiced speech.

In both cases, the window is a Hamming In both cases, the window is a Hamming

window of duration 25ms (equivalent to 401

samples at a 16 kHz sampling rate).

Thus, both the short-time energy and the

short-time zero-crossing rate are output of a

low pass filter whose frequency response is as

shown.66

Short time energy and zero-crossing rate functions are slowly varying Short time energy and zero-crossing rate functions are slowly varying

compared to the time variations of the speech signal, and therefore, they

can be sampled at a much lower rate than that of the original speech

signal.

For finite-length windows like the Hamming window, this reduction of

the sampling rate is accomplished by moving the window position ˆn in

jumps of more than one sample

67

during the unvoiced interval, the zero-crossing

rate is relatively high compared to the zero-

crossing rate in the voiced interval.

Conversely, the energy is relatively low in the Conversely, the energy is relatively low in the

unvoiced region compared to the energy in the

voiced region.

68

SHORT-TIME AUTOCORRELATION FUNCTION

(STACF)

The autocorrelation function is often used as a means

of detecting periodicity in signals, and it is also the

basis for many spectrum analysis methods.

STACF is defined as the deterministic autocorrelation

function of the sequence xˆn[m] = x[m]w[ˆn − m] that function of the sequence xˆn[m] = x[m]w[ˆn − m] that

is selected by the window shifted to time ˆn, i.e.,

69

70

e[n] is the excitation to the

linear system with impulse response h[n]. A

well known, and easily

proved, property of the autocorrelation

function is thatfunction is that

i.e., the autocorrelation function of s[n] =

e[n] h[n] is the convolution

of the autocorrelation functions of e[n] and

h[n].

71

72

SHORT-TIME FOURIER TRANSFORM (STFT)

The expression for the discrete-time STFT at

time n

where w[n] is assumed to be non-zero only

in the interval [0, N w - 1] and is referred to

as analysis window or sometimes as the

analysis filter

73

74

FILTERING VIEW

75

76

77

SHORT TIME SYNTHESIS

problem of obtaining a sequence back from its

discrete-time STFT.

This equation represents a synthesis

equation for the discrete-time STFT.

78

FILTER BANK SUMMATION (FBS) METHOD

the discrete STFT is considered to be the set of

outputs of a bank of filters.

the output of each filter is modulated with a

complex exponential, and these modulated complex exponential, and these modulated

filter outputs are summed at each instant of

time to obtain the corresponding time sample

of the original sequence

That is, given a discrete STFT, X (n, k), the FBS

method synthesize a sequence y(n) satisfying

the following equation: 79

80

81

82

83

OVERLAP-ADD METHOD

Just as the FBS method was motivated from the

filteling view of the STFT, the OLA method is motivated

from the Fourier transform view of the STFT.

In this method, for each fixed time, we take the

inverse DFT of the corresponding frequency function inverse DFT of the corresponding frequency function

and divide the result by the analysis window.

However, instead of dividing out the analysis window

from each of the resulting short-time sections, we

perform an overlap and add operation between the

short-time sections.

84

given a discrete STFT X (n, k), the OLA method

synthesizes a sequence Y[n] given by

85

86

Furthermore, if the discrete STFT had been

decimated in time by a factor L, it can be

similarly shown that if the analysis window

satisfiessatisfies

87

88

DESIGN OF DIGITAL FILTER BANKS

282 – 297: Rabiner & Schafer

89

90

91

92

USING IIR FILTER

93

94

USING FIR FILTER

95

96

97

98

99

100

FILTER BANK ANALYSIS AND SYNTHESIS

101

102

103

FBS synthesis results in multiple copies of the

input:

104

PHASE VOCODER

The fourier series is computed over a sliding

window of a single pitch period duration and

provide a measure of amplitude and frequency

trajectories of the musical tones.trajectories of the musical tones.

105

106

107

which can be interpreted as a real sinewave

that is amplitude- and phase-modulated by the

STFT, the "carrier" of the latter being the kth

filter's center frequency. filter's center frequency.

the STFT of a continuos time signal as,

108

109

where is an initial condition.

The signal is likewise referred to as the

instantaneous amplitude for each channel. The

resulting filter-bank output is a sinewave with resulting filter-bank output is a sinewave with

generally a time-varying amplitude and

frequency modulation.

An alternative expression is,

110

which is the time-domain counterpart to the

frequency-domain phase derivative.

111

we can sample the continuous-time STFT, with

sampling interval T, to obtain the discrete-time

STFT.

112

113

114

115

116

117

SPEECH MODIFICATION

118

119

120

121

122

HOMOMORPHICHOMOMORPHICHOMOMORPHICHOMOMORPHIC ((((CEPSTRALCEPSTRALCEPSTRALCEPSTRAL) SPEECH ANALYSIS) SPEECH ANALYSIS) SPEECH ANALYSIS) SPEECH ANALYSIS

use of the short-time cepstrum as a representation of

speech and as a basis for estimating the parameters

of the speech generation model.

cepstrum of a discrete-time signal,

123

124

That is, the complex cepstrum operator

transforms convolution into addition.

This property, is what makes the cepstrum

useful for speech analysis, since the model foruseful for speech analysis, since the model for

speech production involves convolution of the

excitation with the vocal tract impulse

response, and our goal is often to separate the

excitation signal from the vocal tract signal.

125

The key issue in the definition and computation

of the complex cepstrum is the computation of

the complex logarithm.

ie, the computation of the phase angle ie, the computation of the phase angle

arg[X(ejω)], which must be done so as to

preserve an additive combination of phases for

two signals combined by convolution

126

THE SHORTTHE SHORTTHE SHORTTHE SHORT----TIME TIME TIME TIME CEPSTRUMCEPSTRUMCEPSTRUMCEPSTRUM

The short-time cepstrum is a sequence of

cepstra of windowed finite-duration segments

of the speech waveform.

127

128

RECURSIVE COMPUTATION OF THE COMPLEX RECURSIVE COMPUTATION OF THE COMPLEX RECURSIVE COMPUTATION OF THE COMPLEX RECURSIVE COMPUTATION OF THE COMPLEX

CEPSTRUMCEPSTRUMCEPSTRUMCEPSTRUM

Another approach to compute the complex

cepstrum applies only to minimum-phase

signals.

i.e., signals having an z-transform whose poles i.e., signals having an z-transform whose poles

and zeros are inside the unit circle.

An example would be the impulse response of

an all-pole vocal tract model with system

function

129

In this case, all the poles ck must be inside

the unit circle

for stability of the system.

130

SHORTSHORTSHORTSHORT----TIME TIME TIME TIME HOMOMORPHICHOMOMORPHICHOMOMORPHICHOMOMORPHIC FILTERING OF FILTERING OF FILTERING OF FILTERING OF

SPEECH SPEECH SPEECH SPEECH ––––PAGE N0: 63, PAGE N0: 63, PAGE N0: 63, PAGE N0: 63, RABINERRABINERRABINERRABINER & & & & SCHAFERSCHAFERSCHAFERSCHAFER

131

The low quefrency part of the cepstrum is

expected to be representative of the slow

variations (with frequency) in the log spectrum,

while the high quefrency components would while the high quefrency components would

correspond to the more rapid fluctuations of

the log spectrum.

132

the spectrum for the voiced segment has a structure of periodic ripples

due to the harmonic structure of the quasi-periodic segment of voiced

speech.

This periodic structure in the log spectrum manifests itself in the

cepstrum peak at a quefrency of about 9ms.

The existence of this peak in the quefrency range of expected pitch

periods strongly signals voiced speech.periods strongly signals voiced speech.

Furthermore, the quefrency of the peak is an accurate estimate of the

pitch period during the corresponding speech interval.

the autocorrelation function also displays an indication of periodicity, but

not nearly as unambiguously as does the cepstrum.

But the rapid variations of the unvoiced spectra appear random with no

periodic structure.

As a result, there is no strong peak indicating periodicity as in the voiced

case.

133

These slowly varying log spectra clearly retain

the general spectral shape with peaks

corresponding to the formant resonance

structure for the segment of speech under structure for the segment of speech under

analysis.

134

APPLICATION TO PITCH DETECTIONAPPLICATION TO PITCH DETECTIONAPPLICATION TO PITCH DETECTIONAPPLICATION TO PITCH DETECTION

The cepstrum was first applied in speech

processing to determine the excitation

parameters for the discrete-time speech model.

The successive spectra and cepstra are for 50 The successive spectra and cepstra are for 50

ms segments obtained by moving the window

in steps of 12.5 ms (100 samples at a

sampling rate of 8000 samples/sec).

135

for the positions 1 through 5, the window includes only

unvoiced speech

for positions 6 and 7 the signal within the window is partly

voiced and partly unvoiced.

For positions 8 through 15 the window only includes voiced For positions 8 through 15 the window only includes voiced

speech.

the rapid variations of the unvoiced spectra appear random

with no periodic structure.

the spectra for voiced segments have a structure of periodic

ripples due to the harmonic structure of the quasi-periodic

segment of voiced speech.

136

137

the cepstrum peak at a quefrency of about 11–

12 ms strongly signals voiced speech, and the

quefrency of the peak is an accurate estimate

of the pitch period during the corresponding of the pitch period during the corresponding

speech interval.

Presence of a strong peak implies voiced

speech, and the quefrency location of the peak

gives the estimate of the pitch period.

138

MELMELMELMEL----FREQUENCY FREQUENCY FREQUENCY FREQUENCY CEPSTRUMCEPSTRUMCEPSTRUMCEPSTRUM COEFFICIENTS COEFFICIENTS COEFFICIENTS COEFFICIENTS

((((MFCCMFCCMFCCMFCC))))

The idea is to compute a frequency analysis based

upon a filter bank with approximately critical band

spacing of the filters and bandwidths.

For 4 KHz bandwidth, approximately 20 filters are

used.used.

a short-time Fourier analysis is done first, resulting in

a DFT Xˆn[k] for analysis time ˆn.

Then the DFT values are grouped together in critical

bands and weighted by a triangular weighting

function.

139

the bandwidths are constant for center

frequencies below 1 kHz and then increase

exponentially up to half the sampling rate of 4

kHz resulting in a total of 22 filters.kHz resulting in a total of 22 filters.

The mel-frequency spectrum at analysis timeˆn

is defined for r = 1,2,...,R as

140

141

is a normalizing factor for the rth mel-filter.

For each frame, a discrete cosine transform of

the log of the magnitude of the filter outputs is

computed to form the function mfccˆn[m], i.e.,computed to form the function mfccˆn[m], i.e.,

142

143

shows the result of mfcc analysis of a frame of

voiced speech in comparison with the short-

time Fourier spectrum, LPC spectrum, and a

homomorphically smoothed spectrum.homomorphically smoothed spectrum.

all these spectra are different, but they have in

common that they have peaks at the formant

resonances.

At higher frequencies, the reconstructed mel-

spectrum has more smoothing due to the

structure of the filter bank.144

THE SPEECH SPECTROGRAMTHE SPEECH SPECTROGRAMTHE SPEECH SPECTROGRAMTHE SPEECH SPECTROGRAM

simply a display of the magnitude of the STFT.

Specifically, the images in Figure are plots of

where the plot axes are labeled in terms of where the plot axes are labeled in terms of

analog time and frequency through the

relations tr = rRT and fk = k/(NT), where T is

the sampling period of the discrete-time signal

x[n] = xa(nT).

145

In order to make smooth, R is usually quite

small compared to both the window length L

and the number of samples in the frequency

dimension, N, which may be much larger than dimension, N, which may be much larger than

the window length L.

Such a function of two variables can be plotted

on a two dimensional surface as either a gray-

scale or a color-mapped image.

The bars on the right calibrate the color map (in

dB).146

147

if the analysis window is short, the spectrogram

is called a wide-band spectrogram which is

characterized by good time resolution and poor

frequency resolution.frequency resolution.

when the window length is long, the

spectrogram is a narrow-band spectrogram,

which is characterized by good frequency

resolution and poor time resolution.

148

THE SPECTROGRAM

•A classic analysis tool.

– Consists of DFTs of overlapping, and

windowed frames.windowed frames.

•Displays the distribution of energy in time

and frequency.

– is typically displayed.2

10 )(log10 fX m

149

THE SPECTROGRAM CONT.

150

151

Note the three broad peaks in the spectrum

slice at time tr = 430 ms, and observe that

similar slices would be obtained at other times

around tr = 430 ms.around tr = 430 ms.

These large peaks are representative of the

underlying resonances of the vocal tract at the

corresponding time in the production of the

speech signal.

152

The lower spectrogram is not as sensitive to

rapid time variations, but the resolution in the

frequency dimension is much better.

This window length is on the order of several This window length is on the order of several

pitch periods of the waveform during voiced

intervals.

As a result, the spectrogram no longer displays

vertically oriented striations since several

periods are included in the window.

153

SHORT TIME ACF/m/ /ow/ /s/

ACF

154

CEPSTRUMSPEECH WAVE (X)= EXCITATION (E) . FILTER (H)

(H)(H)(H)(H)(Vocal tract

filter)

Glottal excitation

From

(E)(E)(E)(E)

(S)(S)(S)(S)

155

http://home.hib.no/al/engelsk/seksjon/SOFF-MASTER/ill061.gif

From

Vocal cords

(Glottis)

CEPSTRAL ANALYSIS

Signal(s)=convolution(*) of

glottal excitation (e) and vocal_tract_filter (h)

s(n)=e(n)*h(n), n is time index

After Fourier transform FT: FTs(n)=FTe(n)*h(n)

Convolution(*) becomes multiplication (.)

156

n(time) w(frequency),

S(w) = E(w).H(w)

Find Magnitude of the spectrum

|S(w)| = |E(w)|.|H(w)|

log10 |S(w)|= log10|E(w)|+ log10|H(w)|

Ref: http://iitg.vlab.co.in/?sub=59&brch=164&sim=615&cnt=1

CEPSTRUM

C(n)=IDFT[log10 |S(w)|]=

IDFT[ log10|E(w)| + log10|H(w)| ]

windowing DFT Log|x(w)| IDFT

X(n) X(w) Log|x(w)|

N=time index

S(n) C(n)

157

In c(n), you can see E(n) and H(n) at two different positions

Application: useful for (i) glottal excitation (ii) vocal tract filter analysis

N=time index

w=frequency

I-DFT=Inverse-discrete Fourier transform

EXAMPLE OF CEPSTRUM

sampling frequency 22.05KHz

158

SUB BAND CODING

159

the time-decimated subband outputs are quantized

and encoded, then are decoded at the receiver.

In subband coding, a small number of filters with wide

and overlapping bandwidths are chosen and each

output is quantizedoutput is quantized

each bandpass filter output is quantized individually.

although the bandpass filters are wide and

overlapping, careful design of the filter, resuIts in a

cancellation of quantization noise that leaks across

bands.

160

Quadrature mirror filters are one such filter

class;

shows an example of a two-band subband

coder using two overlapping quadrature mirror coder using two overlapping quadrature mirror

filters

Quadrature mirror filters can be further

subdivided from high to low filters by splitting

the fullband into two, then the resulting lower

band into two, and so on.

161

This octave-band splitting, together with the

iterative decimation, can be shown to yield a

perfect reconstruction filter bank

such octave-band filter banks, and their such octave-band filter banks, and their

conditions for perfect reconstruction, are

closely related to wavelet analysis/synthesis

structures.

162

163

164

LINEAR PREDICTION (INTRODUCTION):

The object of linear prediction is to estimate

the output sequence from a linear combination

of input samples, past output samples or both :

∑∑pq

The factors a(i) and b(j) are called predictor

coefficients.

∑∑==

−−−=

p

i

q

j

inyiajnxjbny10

)()()()()(ˆ

165

LINEAR PREDICTION (INTRODUCTION):

Many systems of interest to us are describable by a linear, constant-coefficient difference equation :

If Y(z)/X(z)=H(z), where H(z) is a ratio of polynomials

∑∑==

−=−

q

j

p

i

jnxjbinyia00

)()()()(

If Y(z)/X(z)=H(z), where H(z) is a ratio of polynomials N(z)/D(z), then

Thus the predictor coefficients give us immediate access to the poles and zeros of H(z).

== ji 00

∑∑=

=

−==

p

i

iq

j

jziazDzjbzN

00

)()( and )()(

166

LINEAR PREDICTION (TYPES OF SYSTEM MODEL):

There are two important variants :

All-pole model (in statistics, autoregressive (AR)(AR)

model ) :

The numerator N(z) is a constant.The numerator N(z) is a constant.

All-zero model (in statistics, moving-average (MA)(MA)

model ) :

The denominator D(z) is equal to unity.

The mixed pole-zero model is called the

autoregressive moving-average (ARMA)(ARMA) model.

167

LINEAR PREDICTION (DERIVATION OF LP EQUATIONS):

Given a zero-mean signal y(n), in the AR model :

The error is :

∑=

−−=

p

i

inyiany1

)()()(ˆ

−= nynyne )(ˆ)()(

To derive the predictor we use the orthogonality

principle, the principle states that the desired

coefficients are those which make the error orthogonal

to the samples y(n-1), y(n-2),…, y(n-p).

∑=

−=

p

i

inyia0

)()(

168

LINEAR PREDICTION (DERIVATION OF LP EQUATIONS):

Thus we require that

Or,

p..., 2, 1,jfor 0)()( =>=−< nejny

0)()()(0

=−− ∑=

p

i

inyiajny

Interchanging the operation of averaging and summing, and representing < > by summing over n, we have

The required predictors are found by solving these equations.

p1,...,j ,0)()()(0

==−−∑∑= n

p

i

jnyinyia

169

LINEAR PREDICTION (DERIVATION OF LP EQUATIONS):

The orthogonality principle also states that resulting

minimum error is given by

Or,

)()()(2nenyneE ==

Enyinyiap

=−∑∑ )()()(

We can minimize the error over all time :

where

Eriap

i

i =∑= 0

)(

Enyinyiani

=−∑∑=

)()()(0

∑∞

−∞=

−=

n

i inynyr )()(

, ...,p,jria ji

p

i

21 ,0)(0

==−

=

170

LINEAR PREDICTION (APPLICATIONS):

Autocorrelation matching :

We have a signal y(n) with known autocorrelation

. We model this with the AR system shown below :

)(nryy)(ne )(ny)(nryy

∑=

−−

==p

i

i

i zazA

zH

1

1)(

)(σσ

)(ne

σ

1-A(z)

)(ny

171

LINEAR PREDICTION (ORDER OF LINEAR PREDICTION):

The choice of predictor order depends on the analysis bandwidth. The rule of thumb is :

For a normal vocal tract, there is an average of

cBW

p +=1000

2

For a normal vocal tract, there is an average of about one formant per kilo Hertz of BW.

One formant requires two complex conjugate poles.

Hence for every formant we require two predictor coefficients, or two coefficients per kilo Hertz of bandwidth.

172

LINEAR PREDICTION (AR MODELING OF SPEECH SIGNAL):

True Model:

DT

Impulse

G(z)

GlottalVoiced

Pitch Gain

U(n)

s(n)

Speech

SignalImpulse

generator

Glottal

Filter

Uncorrelated

Noise

generator

H(z)

Vocal tract

Filter

R(z)

LP

Filter

Voiced

Unvoiced

Gain

V

U

U(n)

Voiced

Volume

velocity

173

LINEAR PREDICTION (AR MODELING OF SPEECH SIGNAL):

Using LP analysis :

DT

ImpulseVoiced

Pitch

Gain

estimate s(n)Impulse

generator

White

Noise

generator

All-Pole

Filter

(AR)

Voiced

Unvoiced

estimate

V

U

H(z)

s(n)

Speech

Signal