Robustness Techniques for Speech RecognitionSpeech …berlin.csie.ntnu.edu.tw/Courses/Speech...

68
f Robustness Techniques for Speech Recognition Speech Recognition Berlin Chen D t t fC t Si &If ti E i i Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. X. Huang et al. Spoken Language Processing (2001). Chapter 10 2. J. C. Junqua and J. P. Haton. Robustness in Automatic Speech Recognition (1996), Chapters 5, 8-9 3TF Q ti i Di t Ti S h Si lP i (2002) Ch t 13 3. T . F . Quatieri, Discrete-Time Speech Signal Processing (2002), Chapter 13 4. J. Droppo and A. Acero, “Environmental robustness,” in Springer Handbook of Speech Processing, Springer, 2008, ch. 33, pp. 653–679.

Transcript of Robustness Techniques for Speech RecognitionSpeech …berlin.csie.ntnu.edu.tw/Courses/Speech...

Page 1: Robustness Techniques for Speech RecognitionSpeech …berlin.csie.ntnu.edu.tw/Courses/Speech Recognition...Speech RecognitionSpeech Recognition Berlin Chen Dt tfC tSi &If tiEiiDepartment

fRobustness Techniques for Speech RecognitionSpeech Recognition

Berlin ChenD t t f C t S i & I f ti E i iDepartment of Computer Science & Information Engineering

National Taiwan Normal University

References:1. X. Huang et al. Spoken Language Processing (2001). Chapter 102. J. C. Junqua and J. P. Haton. Robustness in Automatic Speech Recognition (1996), Chapters 5, 8-93 T F Q ti i Di t Ti S h Si l P i (2002) Ch t 133. T. F. Quatieri, Discrete-Time Speech Signal Processing (2002), Chapter 134. J. Droppo and A. Acero, “Environmental robustness,” in Springer Handbook of Speech Processing,

Springer, 2008, ch. 33, pp. 653–679.

Page 2: Robustness Techniques for Speech RecognitionSpeech …berlin.csie.ntnu.edu.tw/Courses/Speech Recognition...Speech RecognitionSpeech Recognition Berlin Chen Dt tfC tSi &If tiEiiDepartment

Introduction

• Classification of Speech Variability in Five CategoriesClassification of Speech Variability in Five Categories

PronunciationVariation

Speaker-independencySpeaker-adaptation

Speaker-dependency

Variation

Linguisticvariabilityvariability

Intra-speakervariability

Inter-speakervariability

variability

V i bilit dVariability causedContext-DependentAcoustic Modeling

Variability causedby the context

Variability causedby the environment

Speech - Berlin Chen 2

RobustnessEnhancement

Page 3: Robustness Techniques for Speech RecognitionSpeech …berlin.csie.ntnu.edu.tw/Courses/Speech Recognition...Speech RecognitionSpeech Recognition Berlin Chen Dt tfC tSi &If tiEiiDepartment

Introduction (cont.)( )

• The Diagram for Speech Recognition

Feature Lik lih d

Acoustic Processing Linguistic Processing

Li i ti N t kFeature Extraction

Likelihood computationSpeech

signalRecognition

results

Linguistic Network Decoding

Acoustic model

Languagemodel Lexicon

• Importance of the robustness in speech recognition– Speech recognition systems have to operate in situations with p g y p

uncontrollable acoustic environments– The recognition performance is often degraded due to the

mismatch in the training and testing conditions

Speech - Berlin Chen 3

mismatch in the training and testing conditions• Varying environmental noises, different speaker characteristics

(sex, age, dialects), different speaking modes (stylistic, Lombard effect), etc.

Page 4: Robustness Techniques for Speech RecognitionSpeech …berlin.csie.ntnu.edu.tw/Courses/Speech Recognition...Speech RecognitionSpeech Recognition Berlin Chen Dt tfC tSi &If tiEiiDepartment

Introduction (cont.)( )

• If a speech recognition system’s accuracy does not p g y ydegrade very much under mismatch conditions, the system is called robust – ASR performance is rather uniform for SNRs greater than 25dB,

but there is a very steep degradation as the noise level increases

31610log1025 5.210 ≈==>=

N

s

N

s

EE

EEdB

• Signal energy measured in time domain, e.g.:

[ ] [ ]∑−1 *1 T

E

• Various noises exist in varying real-world environments

[ ] [ ]∑ ×== 0n

s nsnsT

E

Speech - Berlin Chen 4

– Periodic, impulsive, or wide/narrow band

Page 5: Robustness Techniques for Speech RecognitionSpeech …berlin.csie.ntnu.edu.tw/Courses/Speech Recognition...Speech RecognitionSpeech Recognition Berlin Chen Dt tfC tSi &If tiEiiDepartment

Introduction (cont.)( )

• Therefore, several possible robustness approaches have , p ppbeen developed to enhance the speech signal, its spectrum, and the acoustic models as well

– Environmental compensation processing (feature-based)

A ti d l d t ti ( d l b d)– Acoustic model adaptation (model-based)

– Robust acoustic features (both model- and feature-based)• Or, inherently discriminative acoustic features

Speech - Berlin Chen 5

Page 6: Robustness Techniques for Speech RecognitionSpeech …berlin.csie.ntnu.edu.tw/Courses/Speech Recognition...Speech RecognitionSpeech Recognition Berlin Chen Dt tfC tSi &If tiEiiDepartment

The Noise Types (1/2)The Noise Types (1/2)

h[m]s[m]n[m]

x[m]

A model of the environment.

[ ] [ ] [ ] [ ]( ) ( ) ( ) ( )( ) ( ) ( ) ( ) ( ) ( ) ( ){ }R2 *2222 ++⇔

+=⇔+∗=

NHSNHSX

NHSXmnmhmsmx

ωωωω

( ) ( ) ( ) ( ) ( ) ( ) ( ){ }( ) ( ) ( ) ( ) ( ) ( )( ) ( ) ( )

cos2

Re2

222

222

+

++=

++=⇔

NHS

NHSNHS

NHSNHSX

θωωωωωω

ωωωωωωω

ω

Speech - Berlin Chen 6

( ) ( ) ( )( ) ( ) ( ) ( ) ( )( ) ( ) ( ) ( ) ( ) spectrum : , or

spectrumpower : , or

-- ⋅+=⋅+=

+≈

SSSSSPPPPP

NHS

nnhhssxx

NHSX

ωωωωωωωωωωω

Page 7: Robustness Techniques for Speech RecognitionSpeech …berlin.csie.ntnu.edu.tw/Courses/Speech Recognition...Speech RecognitionSpeech Recognition Berlin Chen Dt tfC tSi &If tiEiiDepartment

The Noise Types (2/2)y ( )

[ ] kkS θsin

Im

[ ]kSkθ

[ ]kN

[ ] θcoskS

[ ]

Re

[ ] [ ] [ ][ ] [ ] [ ]( )kk kSkNkS

kNkSkX

θθ cossin 222

22

++=

+=

Speech - Berlin Chen 7

[ ] [ ] [ ]( )[ ] [ ] [ ] [ ] k

kk

kNkSkNkX θcos2 22 ++=

Page 8: Robustness Techniques for Speech RecognitionSpeech …berlin.csie.ntnu.edu.tw/Courses/Speech Recognition...Speech RecognitionSpeech Recognition Berlin Chen Dt tfC tSi &If tiEiiDepartment

Additive Noises• Additive noises can be stationary or non-stationary

– Stationary noises– Stationary noises• Such as computer fan, air conditioning, car noise: the power

spectral density does not change over time (the above noises are also narrow band noises)also narrow-band noises)

– Non-stationary noises• Machine gun, door slams, keyboard clicks, radio/TV, and other

speakers’ voices (babble noise, wide band nose, most difficult): the statistical propertieschange over time

loglog SCSS ⇒⇒

powerspectrum

log powerspectrum cepstrum

) ( ) ( cl SS

Speech - Berlin Chen 8

Page 9: Robustness Techniques for Speech RecognitionSpeech …berlin.csie.ntnu.edu.tw/Courses/Speech Recognition...Speech RecognitionSpeech Recognition Berlin Chen Dt tfC tSi &If tiEiiDepartment

Additive Noises (cont.)( )

Speech - Berlin Chen 9

Page 10: Robustness Techniques for Speech RecognitionSpeech …berlin.csie.ntnu.edu.tw/Courses/Speech Recognition...Speech RecognitionSpeech Recognition Berlin Chen Dt tfC tSi &If tiEiiDepartment

Convolutional Noises

• Convolutional noises are mainly resulted from channel ydistortion (sometimes called “channel noises”) and are stationary for most cases– Reverberation, the frequency response of microphone,

transmission lines, etc.

loglog SCSS ⇒⇒

powerspectrum

log powerspectrum cepstrum

) ( ) ( cl SS

Speech - Berlin Chen 10

Page 11: Robustness Techniques for Speech RecognitionSpeech …berlin.csie.ntnu.edu.tw/Courses/Speech Recognition...Speech RecognitionSpeech Recognition Berlin Chen Dt tfC tSi &If tiEiiDepartment

Noise Characteristics

• White Noise– The power spectrum is flat ,a condition equivalent to

different samples being uncorrelated, Whit i h b t h diff t di t ib ti

( ) qSnn =ω[ ] [ ]mqmRnn δ=

– White noise has a zero mean, but can have different distributions – We are often interested in the white Gaussian noise, as it

resembles better the noise that tends to occur in practicep

• Colored NoiseThe spectr m is not flat (like the noise capt red b a microphone)– The spectrum is not flat (like the noise captured by a microphone)

– Pink noise• A particular type of colored nose that has a low-pass nature, as it p yp p ,

has more energy at the low frequencies and rolls off at high frequency

• E.g., the noise generated by a computer fan, an air conditioner, or

Speech - Berlin Chen 11

g , g y p , ,an automobile

Page 12: Robustness Techniques for Speech RecognitionSpeech …berlin.csie.ntnu.edu.tw/Courses/Speech Recognition...Speech RecognitionSpeech Recognition Berlin Chen Dt tfC tSi &If tiEiiDepartment

Noise Characteristics (cont.)( )

• Musical NoiseM i l i i h t i id (t ) d l di t ib t d– Musical noise is short sinusoids (tones) randomly distributed over time and frequency

• That occur due to, e.g., the drawback of original spectral g g psubtraction technique and statistical inaccuracy in estimating noise magnitude spectrum

• Lombard effect– A phenomenon by which a speaker increases his vocal effect in

th f b k d i (th dditi i )the presence of background noise (the additive noise)– When a large amount of noise is present, the speaker tends to

shout, which entails not only a high amplitude, but also often , y g p ,higher pitch, slightly different formants, and a different coloring (shape) of the spectrumThe vowel portion of the words will be overemphasized by the

Speech - Berlin Chen 12

– The vowel portion of the words will be overemphasized by the speakers

Page 13: Robustness Techniques for Speech RecognitionSpeech …berlin.csie.ntnu.edu.tw/Courses/Speech Recognition...Speech RecognitionSpeech Recognition Berlin Chen Dt tfC tSi &If tiEiiDepartment

A Few Robustness Approaches

Page 14: Robustness Techniques for Speech RecognitionSpeech …berlin.csie.ntnu.edu.tw/Courses/Speech Recognition...Speech RecognitionSpeech Recognition Berlin Chen Dt tfC tSi &If tiEiiDepartment

Three Basic Categories of Approachesg

• Speech Enhancement Techniquesp q– Eliminate or reduce the noisy effect on the speech signals, thus

better accuracy with the originally trained models(Restore the clean speech signals or compensate for distortions)(Restore the clean speech signals or compensate for distortions)

– The feature part is modified while the model part remains unchanged

• Model-based Noise Compensation Techniques– Adjust (changing) the recognition model parameters (means and

i ) f b tt t hi th t ti i ditivariances) for better matching the testing noisy conditions– The model part is modified while the feature part remains

unchangedg

• Robust Parameters for Speech– Find robust representation of speech signals less influenced by

Speech - Berlin Chen 14

additive or channel noise– Both of the feature and model parts are changed

Page 15: Robustness Techniques for Speech RecognitionSpeech …berlin.csie.ntnu.edu.tw/Courses/Speech Recognition...Speech RecognitionSpeech Recognition Berlin Chen Dt tfC tSi &If tiEiiDepartment

Assumptions & Evaluationsp

• General Assumptions for the Noise– The noise is uncorrelated with the speech signal– The noise characteristics are fixed during the speech utterance

or vary very slowly (the noise is said to be stationary)or vary very slowly (the noise is said to be stationary)• The estimates of the noise characteristics can be obtained during

non-speech activity – The noise is supposed to be additive or convolutional

• Performance EvaluationsPerformance Evaluations– Intelligibility, quality (subjective assessment)– Distortion between clean and recovered speech (objective

assessment)– Speech recognition accuracy

Speech - Berlin Chen 15

Page 16: Robustness Techniques for Speech RecognitionSpeech …berlin.csie.ntnu.edu.tw/Courses/Speech Recognition...Speech RecognitionSpeech Recognition Berlin Chen Dt tfC tSi &If tiEiiDepartment

Spectral Subtraction (SS) S. F. Boll, 1979( )

• A Speech Enhancement TechniqueE ti t th it d ( th ) f l h b• Estimate the magnitude (or the power) of clean speech by explicitly subtracting the noise magnitude (or the power) spectrum from the noisy magnitude (or power) spectrumspectrum from the noisy magnitude (or power) spectrum

• Basic Assumption of Spectral Subtraction– The clean speech is corrupted by additive noise[ ]ms [ ]mnThe clean speech is corrupted by additive noise – Different frequencies are uncorrelated from each other– and are statistically independent, so that the power

[ ]ms [ ]mn

[ ]ms [ ]mn

spectrum of the noisy speech can be expressed as:

To eliminate the additive noise:

[ ]mx

( ) ( ) ( )ωωω NSX PPP +=

( ) ( ) ( )ωωω PPP =– To eliminate the additive noise:– We can obtain an estimate of using the average period of M

frames that known to be just noise:

( ) ( ) ( )ωωω NXS PPP −=

( )ωNP

( ) ( )−1M P1P

Speech - Berlin Chen 16

( ) ( )∑=

=0i i,NN P

MP ωω

frames

Page 17: Robustness Techniques for Speech RecognitionSpeech …berlin.csie.ntnu.edu.tw/Courses/Speech Recognition...Speech RecognitionSpeech Recognition Berlin Chen Dt tfC tSi &If tiEiiDepartment

Spectral Subtraction (cont.)( )

• Problems of Spectral Subtraction– and are not statistically independent such that the cross [ ]ms [ ]mn y p

term in power spectrum can not be eliminated– is possibly less than zero

I t d “ i l i ” h

[ ] [ ]

( )ωSP( ) ( )PP

Speech - Berlin Chen 17

– Introduce “musical noise” when– Need a robust endpoint (speech/noise/silence) detector

( ) ( )ωω NX PP ≈

Page 18: Robustness Techniques for Speech RecognitionSpeech …berlin.csie.ntnu.edu.tw/Courses/Speech Recognition...Speech RecognitionSpeech Recognition Berlin Chen Dt tfC tSi &If tiEiiDepartment

Spectral Subtraction (cont.)( )

• Modification: Nonlinear Spectral Subtraction (NSS)p ( )

( ) ( ) ( ) ( ) ( ) ( )( ) otherwise,

if ωβ

ωβωφωωφωω

PPP,P

PN

NXXS

⎩⎨⎧

⋅⋅+>−

=( ) ( ) ( ) ( ) ( )( ) otherwise,

if ,ˆω

ωωωωω

N

NXNXS P

PPPPP

⎩⎨⎧ ≥−

=or

( ) ( )( ) SNR toaccordingfunction linear -non a :

spectrum noise andnoisy smoothed and ωφ

ωω :PP NX( ) ( ) spectrum noise andnoisy smoothed : and ωω NX PP

Speech - Berlin Chen 18

Page 19: Robustness Techniques for Speech RecognitionSpeech …berlin.csie.ntnu.edu.tw/Courses/Speech Recognition...Speech RecognitionSpeech Recognition Berlin Chen Dt tfC tSi &If tiEiiDepartment

Spectral Subtraction (cont.)( )

• Spectral Subtraction can be viewed as a filtering p goperation

Power Spectrum

( ) ( ) ( )

( ) ( )( ) ( ) ( )

( ) ( ) ( ) ( ) ( )) that (supposed 1

ˆ

+≈⎥⎦

⎤⎢⎣

+=⎥

⎤⎢⎣

⎡−=

−=

ωωωωω

ωωωωω

ωωω

PPPPP

PPPPP

PPP

NSXS

XN

X

NXS

( ) ( ) ( )

( ) ( ) ( ) ( )( ) ) SNR ousinstantane : ( 11

1−

=⎥⎦

⎤⎢⎣

⎡+=

⎦⎣ +⎦⎣

ωωω

ωω

ωωω

PPR

RP

PPP

N

SX

NSX

:byely approximatgiven isfilter n suppressio varying timeThe

2/1⎤⎡

( ) ( ) 11 2/1−

⎥⎦

⎤⎢⎣

⎡+=

ωω

RH Spectrum Domain

Speech - Berlin Chen 19

Page 20: Robustness Techniques for Speech RecognitionSpeech …berlin.csie.ntnu.edu.tw/Courses/Speech Recognition...Speech RecognitionSpeech Recognition Berlin Chen Dt tfC tSi &If tiEiiDepartment

Wiener Filteringg

• A Speech Enhancement Techniquep q• From the Statistical Point of View

– The process is the sum of the random process and the [ ]ms[ ]mxadditive noise process

Find a linear estimate in terms of the process :

[ ]mn

[ ]ms [ ]mx

[ ] [ ] [ ]mmm nsx +=

– Find a linear estimate in terms of the process :• Or to find a linear filter such that the sequence

minimizes the expected value of

[ ]ms [ ]mx

[ ]mh [ ] [ ] [ ]mhmxms ∗=[ ] [ ]( )2msms −

Noisy Speech

A linear filter h[n]

[ ]mx [ ]ms

Clean Speech

[ ][ ] [ ]∞

∗=

llh

mhmxms ][][ˆ

Speech - Berlin Chen 20

[ ] [ ]∑−∞=

−=l

lmxlh

Page 21: Robustness Techniques for Speech RecognitionSpeech …berlin.csie.ntnu.edu.tw/Courses/Speech Recognition...Speech RecognitionSpeech Recognition Berlin Chen Dt tfC tSi &If tiEiiDepartment

Wiener Filtering (cont.)g ( )

• Minimize the expectation of the squared error (MMSE

[ ] [ ] [ ]

llmxlhms EMinimize F

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎦

⎤⎢⎣

⎡∑ −−=∞

−∞=

2

p q (estimate)

[ ]k khF

⎞⎛

=∂

⎭⎩

0

[ ] [ ] [ ] [ ] [ ]

[ ] [ ] [ ] [ ] [ ]l

k

kmxlmxlhkmxms

kmxlmxlhkmxm s

∑ ∑ −−=∑ −⇒

−⎟⎟⎠

⎞⎜⎜⎝

⎛∑ −=−∀⇒

∞ ∞∞

−∞=Take summation for m

[ ] [ ] [ ] [ ] [ ]

[ ] [ ] [ ]( ) [ ] [ ] [ ]l mm

l mm

kmxlmxlhkmnkmsms

kmxlmxlhkmxms

∑ ∑ −−=∑ −+−⇒

∑ ∑∑⇒

−∞=

−∞=

−∞=

−∞= −∞=−∞=

[ ] [ ] areand m n m s

[ ] [ ] [ ] [ ] [ ] [ ]

[ ] [ ] [ ]xs

lx

mm

kRkhkR

lkRlhkmnmskmsms

∗=⇒

∑ −=∑ −∑ +−⇒∞

−∞=

−∞=

−∞=

t!independenlly statistica

[ ] [ ] ationautocorrelly therespectiveare:and nRnR

Speech - Berlin Chen 21

[ ] [ ] [ ]( ) ( ) ( )ωωω xxss

xs

SHSkRkhkR

=⇒⇒

Take Fourier transformm

[ ] [ ][ ] ][ and of sequences

ationautocorrelly therespectiveare: andnxns

nRnR xs

Page 22: Robustness Techniques for Speech RecognitionSpeech …berlin.csie.ntnu.edu.tw/Courses/Speech Recognition...Speech RecognitionSpeech Recognition Berlin Chen Dt tfC tSi &If tiEiiDepartment

Wiener Filtering (cont.)g ( )

• Minimize the expectation of the squared error (MMSE p q (estimate)

( ) ( ) ( )SHS ( ) ( ) ( )

( ) ( )ωω

ωωω xxss

SS

SHS =Q

( ) ( )( )

( )( ) ( ) filter Wiener noncausal thecalled is ,

ωωω

ωω

ωnnss

ss

xx

ss

SSS

SS

H+

==⇒

( ) ( ) ( ) ) (where

ωωω nnssxx SSS +=

Speech - Berlin Chen 22

Page 23: Robustness Techniques for Speech RecognitionSpeech …berlin.csie.ntnu.edu.tw/Courses/Speech Recognition...Speech RecognitionSpeech Recognition Berlin Chen Dt tfC tSi &If tiEiiDepartment

Wiener Filtering (cont.)g ( )

• The time varying Wiener Filter also can be expressed in

( ) ( )( ) ( )

( )( ) ( )

PSH Sss ωωω ==

a similar form as the spectral subtraction

( ) ( ) ( ) ( ) ( )( )( ) ( ) ( ) ( )

( ) ) ousinstantane : ( 1 1 1-1-

SNRPPR,

R1

PP

PPSSH

N

S

S

N

NSnnss

ωωω

ωωω

ωωωωω

=⎥⎦

⎤⎢⎣

⎡+=⎥

⎤⎢⎣

⎡+=

++

( ) ( ) ( )NS ⎦⎣⎦⎣

SS vs. Wiener Filter:1. Wiener filter has stronger attenuation

at low SNR region2. Wiener filter does not invoke an

absolute thresholdingabsolute thresholding

( )( )ωωS

PP

log10

Speech - Berlin Chen 23

( )ωNP

Page 24: Robustness Techniques for Speech RecognitionSpeech …berlin.csie.ntnu.edu.tw/Courses/Speech Recognition...Speech RecognitionSpeech Recognition Berlin Chen Dt tfC tSi &If tiEiiDepartment

Wiener Filtering (cont.)g ( )

• Wiener Filtering can be realized only if we know the g ypower spectra of both the noise and the signal– A chicken-and-egg problem

• Approach - I : Ephraim(1992) proposed the use of an f f fHMM where, if we know the current frame falls under, we

can use its mean spectrum as In practice we do not know what state each frame falls into

( ) ( )ωω Sss PS or – In practice, we do not know what state each frame falls into

either• Weight the filters for each state by a posterior probability that frame

ffalls into each state

Speech - Berlin Chen 24

Page 25: Robustness Techniques for Speech RecognitionSpeech …berlin.csie.ntnu.edu.tw/Courses/Speech Recognition...Speech RecognitionSpeech Recognition Berlin Chen Dt tfC tSi &If tiEiiDepartment

Wiener Filtering (cont.)g ( )

• Approach - II :– The background/noise is stationary and its power spectrum can

be estimated by averaging spectra over a known background regiong

– For the non-stationary speech signal, its time-varying power spectrum can be estimated using the past Wiener filter (of previous frame)previous frame)

( ) ( ) ( )( )

ωωωˆ

filter) Wiener :)( index, frame :( ,,1,,ˆ HttHtPtP XS ⋅−=

( ) ( )( ) ( )ωω

ωω,ˆ

,,PtP

tPtHNS

S+

=∴

• The initial estimate of the speech spectrum can be derived from

( ) ( ) ( )ωωω ,,,~ tHtPtP XS =

Speech - Berlin Chen 25

• The initial estimate of the speech spectrum can be derived from spectral subtraction

– Sometimes introduce musical noise

Page 26: Robustness Techniques for Speech RecognitionSpeech …berlin.csie.ntnu.edu.tw/Courses/Speech Recognition...Speech RecognitionSpeech Recognition Berlin Chen Dt tfC tSi &If tiEiiDepartment

Wiener Filtering (cont.)g ( )

• Approach - III :pp– Slow down the rapid frame-to-frame movement of the object

speech power spectrum estimate by apply temporal smoothing

( ) ( ) ( ) ( ),ˆ1,1~, ωαωαω SSS tPtPtP ⋅−+−⋅=)

( ) ( )( ) ( )ˆ

in ,ˆ replace to, useThen ωω SS tPtP)

)

( ) ( )( ) ( )

( ) ( )( ) ( ) ,

,, ,ˆ

,ˆ,

S

S

ωωωω

ωωωω

NNS

S

PtPtPtH

PtPtPtH

+=⇒

+= )

)

Speech - Berlin Chen 26

Page 27: Robustness Techniques for Speech RecognitionSpeech …berlin.csie.ntnu.edu.tw/Courses/Speech Recognition...Speech RecognitionSpeech Recognition Berlin Chen Dt tfC tSi &If tiEiiDepartment

Wiener Filtering (cont.)g ( )

Clean Speechp

Noisy Speech

Enhanced Noise SpeechUsing Approach – III

85.0=α

Other more complicate Wiener filters

Speech - Berlin Chen 27

Page 28: Robustness Techniques for Speech RecognitionSpeech …berlin.csie.ntnu.edu.tw/Courses/Speech Recognition...Speech RecognitionSpeech Recognition Berlin Chen Dt tfC tSi &If tiEiiDepartment

The Effectives of Active Noise

Speech - Berlin Chen 28

Page 29: Robustness Techniques for Speech RecognitionSpeech …berlin.csie.ntnu.edu.tw/Courses/Speech Recognition...Speech RecognitionSpeech Recognition Berlin Chen Dt tfC tSi &If tiEiiDepartment

Cepstral Mean Normalization (CMN)( )

• A Speech Enhancement Technique and sometimes ll d C t l M S bt ti (CMS)called Cepstral Mean Subtraction (CMS)

• CMN is a powerful and simple technique designed to handle convolutional (Time invariant linear filtering)handle convolutional (Time-invariant linear filtering)distortions [ ] [ ] [ ]nhnsnx ∗=

( ) ( ) ( )HSX

Time Domain

S t l D illl HSHSSHX +=+==

222 logloglog( ) ( ) ( )ωωω HSX =

( ) lllll CHCSHSCCX +=+=

Spectral Domain

Log Power Spectral DomainCepstral Domain( )

( ) ll1T

0t

lt

ll1T

0tt

ll CHCSCHCST1CXCS

T1CS +=+== ∑∑

=

= and

channelsdifferent twofrom recored werematerialsspeech testingand training theif

p

( ) ( ) ( ) ( ) llllllllll )2(CHCS)2(HSC2CX,)1(CHCS)1(HSC1CX +=+=+=+= :Testing :Training

( ) ( ) llll CSCS1CX1CX −=− The spectral characteristics of the microphone and

Speech - Berlin Chen 29

( ) ( ) llll CSCS2CX2CX −=− room acoustics thus can be removed !

Can be eliminated if the assumption of zero-mean speech contribution!

Page 30: Robustness Techniques for Speech RecognitionSpeech …berlin.csie.ntnu.edu.tw/Courses/Speech Recognition...Speech RecognitionSpeech Recognition Berlin Chen Dt tfC tSi &If tiEiiDepartment

Cepstral Mean Normalization (cont.)( )

• Some Findingsg– Interesting, CMN has been found effective even the testing and

training utterances are within the same microphone and environmentenvironment

• Variations for the distance between the mouth and the microphone for different utterances and speakers

– Be careful that the duration/period used to estimate the mean of noisy speech y p

• Why?– Problematic when the acoustic feature vectors are almost

identical within the selected time periodidentical within the selected time period

Speech - Berlin Chen 30

Page 31: Robustness Techniques for Speech RecognitionSpeech …berlin.csie.ntnu.edu.tw/Courses/Speech Recognition...Speech RecognitionSpeech Recognition Berlin Chen Dt tfC tSi &If tiEiiDepartment

Cepstral Mean Normalization (cont.)( )

• Performance– For telephone recordings, where each call has different

frequency response, the use of CMN has been shown to provide as much as 30 % relative decrease in error rateas much as 30 % relative decrease in error rate

– When a system is trained on one microphone and tested on another, CMN can provide significant robustness

Speech - Berlin Chen 31

Page 32: Robustness Techniques for Speech RecognitionSpeech …berlin.csie.ntnu.edu.tw/Courses/Speech Recognition...Speech RecognitionSpeech Recognition Berlin Chen Dt tfC tSi &If tiEiiDepartment

Cepstral Mean Normalization (cont.)( )

• CMN has been shown to improve the robustness not ponly to varying channels but also to the noise– White noise added at different SNRs– System trained with speech with the same SNR (matched

Condition)

Cepstral delta and delta-deltafeatures are computed prior to the CMN operation so that they are

ff t dunaffected.

Speech - Berlin Chen 32

Page 33: Robustness Techniques for Speech RecognitionSpeech …berlin.csie.ntnu.edu.tw/Courses/Speech Recognition...Speech RecognitionSpeech Recognition Berlin Chen Dt tfC tSi &If tiEiiDepartment

Cepstral Mean Normalization (cont.)( )

• From the other perspectivep p– We can interpret CMN as the operation of subtracting a low-pass

temporal filter , where all the coefficients are identical and equal to which is a high pass temporal filter

[ ]nd T1equal to , which is a high-pass temporal filter

– Alleviate the effect of conventional noise introduced in the channel

T1

Temporal (Modulation)Frequency

Speech - Berlin Chen 33

Page 34: Robustness Techniques for Speech RecognitionSpeech …berlin.csie.ntnu.edu.tw/Courses/Speech Recognition...Speech RecognitionSpeech Recognition Berlin Chen Dt tfC tSi &If tiEiiDepartment

Cepstral Mean Normalization (cont.)( )

• Real-time Cepstral Normalizationp– CMN requires the complete utterance to compute the cepstral

mean; thus, it cannot be used in a real-time system, and an approximation needs to be usedapproximation needs to be used

– Based on the above perspective, we can implement other types of high-pass filters

( ) mean)cepstral:( , tl

1tl

tl

tl CXCX1CXCX −⋅−+⋅= αα ( ) )p(,

Speech - Berlin Chen 34

Page 35: Robustness Techniques for Speech RecognitionSpeech …berlin.csie.ntnu.edu.tw/Courses/Speech Recognition...Speech RecognitionSpeech Recognition Berlin Chen Dt tfC tSi &If tiEiiDepartment

Histogram EQualization (HEQ)g ( )• HEQ has its roots in the assumption that the transformed

speech feature distributions of the test (or noisy) dataspeech feature distributions of the test (or noisy) data should be identical to that of the training (or reference) data y

– Find a transformation function converts to satisfying

( )dFd )(1−

x( )xF y

( )xFy

Therefore the relationship between the cumulative probability

( ) ( )dy

ydFyFpdydxxpyp TestTestTrain

)()()(1

1−==( )

( )yFx

xFy1−=⇒

=

– Therefore, the relationship between the cumulative probability density functions (CDFs) respectively associated with the test and training speech is

Clean Noisy

)())((

)()(

)(1

1 ydyd

ydFyFp

xdxpxC

xFTest

xTesttest

′∫ ′′

′=

′∫ ′=

∞−

−−

∞−

Clean Noisy0.50 0.532.10 1.951.20 1.403 50 3 20

Speech - Berlin Chen 35

( ))(

)(

yC

ydyp

yd

Train

xFyy

Train

=

′∫ ′= =∞−

-3.50 -3.204.31 4.47

Page 36: Robustness Techniques for Speech RecognitionSpeech …berlin.csie.ntnu.edu.tw/Courses/Speech Recognition...Speech RecognitionSpeech Recognition Berlin Chen Dt tfC tSi &If tiEiiDepartment

Histogram Equalization (Cont.)g ( )

• Convert the distribution of each feature vector component of the test speech into a predefined targetcomponent of the test speech into a predefined target distribution which corresponds to that of the training speechp– Attempt not only to match speech feature means/variances, but

also to completely match the feature distribution of the training and test dataand test data

Refe

Dist

y

erencetribution

( )xC1.0 ( )yC Train x Tes

Utte

nSpeech - Berlin Chen 36

1.0CDF( )xC Test

sterance

Page 37: Robustness Techniques for Speech RecognitionSpeech …berlin.csie.ntnu.edu.tw/Courses/Speech Recognition...Speech RecognitionSpeech Recognition Berlin Chen Dt tfC tSi &If tiEiiDepartment

RASTA Temporal Filter Hyneck Hermansky, 1991

• A Speech Enhancement Technique• RASTA (Relative Spectral)

Assumption– The linguistic message is coded into movements of the vocal

tract (i.e., the change of spectral characteristics)The rate of change of non linguistic components in speech often– The rate of change of non-linguistic components in speech often lies outside the typical rate of change of the vocal tact shape

• E.g. fix or slow time-varying linear communication channels– A great sensitivity of human hearing to modulation frequencies

around 4Hz than to lower or higher modulation frequencies

EffectEffect– RASTA Suppresses the spectral components that change more

slowly or quickly than the typical rate of change of speech

Speech - Berlin Chen 37

y q y yp g p

Page 38: Robustness Techniques for Speech RecognitionSpeech …berlin.csie.ntnu.edu.tw/Courses/Speech Recognition...Speech RecognitionSpeech Recognition Berlin Chen Dt tfC tSi &If tiEiiDepartment

RASTA Temporal Filter (cont.)( )

• The IIR transfer function( ) 431~

( ) ( )( ) 1

4314

x

x

z98.01z2zz2z1.0

zCzCzH

−−−

−−−+

⋅==

[ ]tc [ ]tc~

MFCC stream H(z)H(z)

NewMFCC stream

Frame index H(z)

• An other versionRASTA has a peak at about 4Hz (modulation frequency)

22 431

( ) 98.01221.0 1

431

−−−

−−−+

⋅=z

zzzzH

Speech - Berlin Chen 38modulation frequency 100 Hz

[ ] [ ] [ ] [ ][ ] [ ]42.031.0

11.02.01~98.0~

−⋅+−⋅−−⋅+⋅+−⋅=

tttttt

cccccc

Page 39: Robustness Techniques for Speech RecognitionSpeech …berlin.csie.ntnu.edu.tw/Courses/Speech Recognition...Speech RecognitionSpeech Recognition Berlin Chen Dt tfC tSi &If tiEiiDepartment

Retraining on Corrupted Speechg

• A Model-based Noise Compensation Technique• Matched-Conditions Training

– Take a noise waveform from the new environment, add it to all the utterance in the training database and retrain the systemthe utterance in the training database, and retrain the system

– If the noise characteristics are known ahead of time, this method allow as to adapt the model to the new environment with

l ti l ll t f d t f th i t trelatively small amount of data from the new environment, yet use a large amount of training data

Speech - Berlin Chen 39

Page 40: Robustness Techniques for Speech RecognitionSpeech …berlin.csie.ntnu.edu.tw/Courses/Speech Recognition...Speech RecognitionSpeech Recognition Berlin Chen Dt tfC tSi &If tiEiiDepartment

Retraining on Corrupted Speech (cont.)g ( )

• Multi-style TrainingC t b f tifi i l ti l i t b– Create a number of artificial acoustical environments by corrupting the clean training database with noise samples of varying levels (30dB, 20dB, etc.) and types (white, babble, etc.), as well as varying the channels

– All those waveforms (copies of training database) from multiple acoustical environments can be used in trainingacoustical environments can be used in training

Speech - Berlin Chen 40

Page 41: Robustness Techniques for Speech RecognitionSpeech …berlin.csie.ntnu.edu.tw/Courses/Speech Recognition...Speech RecognitionSpeech Recognition Berlin Chen Dt tfC tSi &If tiEiiDepartment

Model Adaptation

• A Model-based Noise Compensation Techniquep q• The standard adaptation methods for speaker adaptation

can be used for adapting speech recognizers to noisy environments– MAP (Maximum a Posteriori) can offer results similar to those of

matched conditions but it requires a significant amount ofmatched conditions, but it requires a significant amount of adaptation data

– MLLR (Maximum Likelihood Regression) can achieve reasonable performance with about a minute of speech for minor mismatch. For severe mismatches, MLLR also requires a larger amount of adaptation datap

Speech - Berlin Chen 41

Page 42: Robustness Techniques for Speech RecognitionSpeech …berlin.csie.ntnu.edu.tw/Courses/Speech Recognition...Speech RecognitionSpeech Recognition Berlin Chen Dt tfC tSi &If tiEiiDepartment

Signal Decomposition Using HMMsg g

• A Model-based Noise Compensation Technique• Recognize concurrent signals (speech and noise)

simultaneouslyParallel HMMs are sed to model the conc rrent signals and the– Parallel HMMs are used to model the concurrent signals and the composite signal is modeled as a function of their combined outputs

• Three-dimensional Viterbi Search

Computationally ExpensiveNoise HMM

(especially for non-stationary noise)

Computationally Expensivefor both Training and Decoding !

Clean speech HMM

Speech - Berlin Chen 42

Page 43: Robustness Techniques for Speech RecognitionSpeech …berlin.csie.ntnu.edu.tw/Courses/Speech Recognition...Speech RecognitionSpeech Recognition Berlin Chen Dt tfC tSi &If tiEiiDepartment

Parallel Model Combination (PMC)( )

• A Model-based Noise Compensation Techniquep q• By using the clean-speech models and a noise model,

we can approximate the distributions obtained by training a HMM with corrupted speech

Speech - Berlin Chen 43

Page 44: Robustness Techniques for Speech RecognitionSpeech …berlin.csie.ntnu.edu.tw/Courses/Speech Recognition...Speech RecognitionSpeech Recognition Berlin Chen Dt tfC tSi &If tiEiiDepartment

Parallel Model Combination (cont.)( )

• The steps of Standard Parallel Model Combination (Log-p ( gNormal Approximation)

Cepstral domainLog-spectral domain Linear spectral domain

Noise HMM’sl

l

Σμ

cl μCμ 1−=Tcl )( 11 −−= CΣCΣ

( )2exp lii

lii Σ+= μμ

( )[ ]1exp −Σ=Σ lijjiij μμc

c

Σμ

Σμ

Clean speech HMM’s

Σ

μμμ ~gˆ +=

Σμ ~ ~In linear spectral domain, the distribution is lognormal μμμ g

ΣΣΣ ~gˆ 2 +=

Noisy speech HMM’s

Because speech and noise are independent and additive in the linear spectral domain

Σ

μˆˆ

l

lμˆˆcμ

ˆˆ

( ) ( )1log21ˆlogˆ 2ˆ

ˆ+−= Σ

i

iii

li μ

μμ

( )1logˆ ˆ+Σ Σijl

lc μCμ ˆ ˆ =Tlc CΣCΣ ˆˆ

y p linear spectral domain

Log-normal approximation

Speech - Berlin Chen 44

ΣlΣcΣ ( )1logˆˆ+=Σ

ji

ijlij μμ

Tlc CΣCΣ = approximation(Assume the new distribution is lognormal)

Constraint: the estimate ofvariance is positive

Page 45: Robustness Techniques for Speech RecognitionSpeech …berlin.csie.ntnu.edu.tw/Courses/Speech Recognition...Speech RecognitionSpeech Recognition Berlin Chen Dt tfC tSi &If tiEiiDepartment

Parallel Model Combination (cont.)( )

• Modification-I: Perform the model combination in the Log-S t l D i (th i l t i ti )Spectral Domain (the simplest approximation)– Log-Add Approximation: (without compensation of variances)

( ) ( )( )• The variances are assumed to be small

A simplified version of Log Normal approximation

( ) ( )( )lll μμμ ~expexplogˆ +=

– A simplified version of Log-Normal approximation• Reduction in computational load

• Modification-II: Perform the model combination in the Linear Spectral Domain (Data-Driven PMC, DPMC, or Iterative PMC)Iterative PMC)– Use the speech models to generate noisy samples (corrupted

speech observations) and then compute a maximum likelihood of these noisy samples

Speech - Berlin Chen 45

these noisy samples– This method is less computationally expensive than standard

PMC with comparable performance

Page 46: Robustness Techniques for Speech RecognitionSpeech …berlin.csie.ntnu.edu.tw/Courses/Speech Recognition...Speech RecognitionSpeech Recognition Berlin Chen Dt tfC tSi &If tiEiiDepartment

Parallel Model Combination (cont.)( )

• Modification-II: Perform the model combination in the Linear Spectral Domain (Data-Driven PMC, DPMC)

Noise HMMClean Speech HMM Noisy Speech HMM

Cepstral domain

G tiGenerating samples

Apply Monte Carlo simulation to draw random cepstral vectors(for example, at least 100 for

Linear spectral domain Domain transform

( p ,each distribution)

p transform

Speech - Berlin Chen 46

Page 47: Robustness Techniques for Speech RecognitionSpeech …berlin.csie.ntnu.edu.tw/Courses/Speech Recognition...Speech RecognitionSpeech Recognition Berlin Chen Dt tfC tSi &If tiEiiDepartment

Parallel Model Combination (cont.)( )

• Data-Driven PMC

Speech - Berlin Chen 47

Page 48: Robustness Techniques for Speech RecognitionSpeech …berlin.csie.ntnu.edu.tw/Courses/Speech Recognition...Speech RecognitionSpeech Recognition Berlin Chen Dt tfC tSi &If tiEiiDepartment

Vector Taylor Series (VTS) P. J. Moreno,1995y ( )

• A Model-based Noise Compensation Techniquep q• VTS Approach

– Similar to PMC, the noisy-speech-like models is generated by combining of clean speech HMM’s and the noise HMM

– Unlike PMC, the VTS approach combines the parameters of clean speech HMM’s and the noise HMM linearly in the log-clean speech HMM s and the noise HMM linearly in the logspectral domain

Power spectrum( ) ( ) ( ) ( )( ) ( ) ( )( )l

NHSX PPPP += ωωωωLog Power spectrum( ) ( ) ( )( )

( ) ( ) ( )( ) ( )

NHS

NHSl

PPP

PP

PPPX

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛+=

+=

1log

log

ωωω

ωωω

( ) ( )( ) ( ) ( ) ( ) ( )( )

( )lll

HSN

HSNll

PPPHS

HS

HS

ePP

PP

−−

−−+++=

⎟⎠

⎜⎝

⎟⎠

⎜⎝

1l

1logloglog logloglog ωωωωω

ωω

Speech - Berlin Chen 48

Non-linear function( )( ) ( ) ( )lll HSNllllllll

HSNll

eNHSfNHSfHS

eHS−−

−−

+=++=

+++=

1log,, where,,,

1log Is a vector function

Page 49: Robustness Techniques for Speech RecognitionSpeech …berlin.csie.ntnu.edu.tw/Courses/Speech Recognition...Speech RecognitionSpeech Recognition Berlin Chen Dt tfC tSi &If tiEiiDepartment

Vector Taylor Series (cont.)y ( )

• The Taylor series provides a polynomial representation y p p y pof a function in terms of the function and its derivatives at a point– Application often arises when nonlinear functions are employed

and we desire to obtain a linear approximation– The function is represented as an offset and a linear termThe function is represented as an offset and a linear term

→ RRf :

( ) ( ) ( )( ) ( )( )−′′+−′+=

xxxfxxxfxfxf

RRf

200000 2

1:

( )( )( ) ⎟⎠⎞⎜

⎝⎛ −+−++ nnn xxoxxxf

n 000!1....

Speech - Berlin Chen 49

Page 50: Robustness Techniques for Speech RecognitionSpeech …berlin.csie.ntnu.edu.tw/Courses/Speech Recognition...Speech RecognitionSpeech Recognition Berlin Chen Dt tfC tSi &If tiEiiDepartment

Vector Taylor Series (cont.)y ( )

• Apply Taylor Series Approximation( )( ) ( ) ( )( )

( )( ) ( )( ),,,,

,,,,,,

000000

0000

000

+++

−+≅

lllll

lllll

lll

lllllllll

NNNHSdf

HHNHSdf

SSdS

NHSdfNHSfHSNf

– VTS-0: use only the 0th-order terms of Taylor Series– VTS-1: use only the 0th- and 1th-order terms of Taylor Series

( )( ) ( )( ) ..... 00 +−+−+ ll NNdN

HHdH

VTS 1: use only the 0th and 1th order terms of Taylor Series– is the vector function evaluated at a particular

vector point( )lll NHSf 000 ,,

• If VTS-0 is used [ ] ( )[ ]

( )[ ]llllll

llllll

NHSfEuuuN,H,SfHSEXE

++

++=(constant)biasaasitregardcanwe

invariant, time-linear isfilter channel theIfg( )[ ]

( )[ ]( ) Gaussian)alsois(

llllll

lllll

Xu,u,ufuuu,u,ufEuu

N,H,SfEuuu

nhshs

nhshs

hsx

++≅

++≅

++=

( ) Gaussian) also is ( domain spectrumpower log in the

,,(constant)biasaasit regardcan we

lllll Xu,g,ufguu

g

nssx ++≅

0-th order VTS

Speech - Berlin Chen 50

( )t)independen are and (if

)(lllll HS

,,f

hsx

nhshs

Σ+Σ≅Σ llsx Σ≅Σ

To get the clean speech statistics

Page 51: Robustness Techniques for Speech RecognitionSpeech …berlin.csie.ntnu.edu.tw/Courses/Speech Recognition...Speech RecognitionSpeech Recognition Berlin Chen Dt tfC tSi &If tiEiiDepartment

Vector Taylor Series (cont.)y ( )

Speech - Berlin Chen 51

Page 52: Robustness Techniques for Speech RecognitionSpeech …berlin.csie.ntnu.edu.tw/Courses/Speech Recognition...Speech RecognitionSpeech Recognition Berlin Chen Dt tfC tSi &If tiEiiDepartment

Retraining on Compensated Featuresg

• A Model-based Noise Compensation Technique that also U h d F t ( d b SS CMN t )Uses enhanced Features (processed by SS, CMN, etc.)– Combine speech enhancement and model compensation

SPLICE: Stereo-based Piecewise Linear Compensation

Speech - Berlin Chen 52

Page 53: Robustness Techniques for Speech RecognitionSpeech …berlin.csie.ntnu.edu.tw/Courses/Speech Recognition...Speech RecognitionSpeech Recognition Berlin Chen Dt tfC tSi &If tiEiiDepartment

More on SPLICE

Speech - Berlin Chen 53Note that this slide was adapted from Dr. Droppo’s presentation slides

Page 54: Robustness Techniques for Speech RecognitionSpeech …berlin.csie.ntnu.edu.tw/Courses/Speech Recognition...Speech RecognitionSpeech Recognition Berlin Chen Dt tfC tSi &If tiEiiDepartment

Principal Component Analysisy

• Principal Component Analysis (PCA) :– Widely applied for the data analysis and dimensionality reduction

in order to derive the most “expressive” feature– Criterion:Criterion:

for a zero mean r.v. x∈RN, find k (k≤N) orthonormal vectors{e1, e2,…, ek} so that

– (1) var(e1T x)=max 1

(2) var(eiT x)=max i

subject to e⊥ e ⊥ ⊥e 1≤ i ≤ksubject to ei⊥ ei-1 ⊥…… ⊥e1 1≤ i ≤k

– {e1, e2,…, ek} are in fact the eigenvectors f th i t i (Σ ) fof the covariance matrix (Σx) for x

corresponding to the largest k eigenvalues– Final r.v y ∈R k : the linear transform

T P i i l i

Speech - Berlin Chen 54

(projection) of the original r.v., y=ATxA=[e1 e2 …… ek]

Principal axis

Page 55: Robustness Techniques for Speech RecognitionSpeech …berlin.csie.ntnu.edu.tw/Courses/Speech Recognition...Speech RecognitionSpeech Recognition Berlin Chen Dt tfC tSi &If tiEiiDepartment

Principal Component Analysis (cont.)y ( )

Speech - Berlin Chen 55

Page 56: Robustness Techniques for Speech RecognitionSpeech …berlin.csie.ntnu.edu.tw/Courses/Speech Recognition...Speech RecognitionSpeech Recognition Berlin Chen Dt tfC tSi &If tiEiiDepartment

Principal Component Analysis (cont.)y ( )

• Properties of PCA– The components of y are mutually uncorrelated

E{yiyj}=E{(eiTx) (ej

Tx)T}=E{(eiTx) (xTej)}=ei

TE{xxT} ej =eiTΣxej{yiyj} {( i ) ( j ) } {( i ) ( j)} i { } j i x j

= λjeiTej=0 , if i≠j

∴ the covariance of y is diagonal– The error power (mean-squared error) between the original vector x

and the projected x’ is minimum x=(e1

Tx)e1+ (e2Tx)e2 + ……+(ek

Tx)ek + ……+(eNTx)eN( 1 ) 1 ( 2 ) 2 ( k ) k ( N ) N

x’=(e1Tx)e1+ (e2

Tx)e2 + ……+(ekTx)ek (Note : x’∈RN)

error r.v : x x’= (e Tx)e + (e Tx)e + +(e Tx)ex-x = (ek+1

Tx)ek+1+ (ek+2Tx)ek+2 + ……+(eN

Tx)eN

E((x-x’)T(x-x’))=E((ek+1Tx) ek+1

Tek+1 (ek+1Tx))+……+E((eN

Tx) eNTeN

(eNTx))

Speech - Berlin Chen 56

(eN x))=var(ek+1

Tx)+ var(ek+2Tx)+…… var(eN

Tx) = λk+1+ λk+2+…… +λN minimum

Page 57: Robustness Techniques for Speech RecognitionSpeech …berlin.csie.ntnu.edu.tw/Courses/Speech Recognition...Speech RecognitionSpeech Recognition Berlin Chen Dt tfC tSi &If tiEiiDepartment

PCA Applied in Inherently Robust Features pp y

• Application 1 : the linear transform of the originalfeatures (in the spatial domain)

Original feature stream xtg t

Frame index

AT AT AT AT

zt= ATxt

The columns of A are the “first k” eigenvectors of Σx

Speech - Berlin Chen 57

transformed feature stream zt

Frame index

Page 58: Robustness Techniques for Speech RecognitionSpeech …berlin.csie.ntnu.edu.tw/Courses/Speech Recognition...Speech RecognitionSpeech Recognition Berlin Chen Dt tfC tSi &If tiEiiDepartment

PCA Applied in Inherently Robust Features (cont.) pp y ( )

• Application 2 : PCA-derived temporal filter(i th t l d i )(in the temporal domain)– The effect of the temporal filter is equivalent to the weighted sum of

sequence of a specific MFCC coefficient with length L slid along thesequence of a specific MFCC coefficient with length L slid along the frame index

quefrencyB1(z)

( )( )mymy

NxNx

nxnx

xx

xx

xx

)2,()1,(

)2,()1,(

)2,3()1,3(

)2,2()1,2(

)2,1()1,1(

2

1

→→

⎥⎥⎤

⎢⎢⎡

⎥⎥⎤

⎢⎢⎡

⎥⎥⎤

⎢⎢⎡

⎥⎥⎤

⎢⎢⎡

⎥⎥⎤

⎢⎢⎡

quefrencyB2(z)

Original feature

( )

( )

( )( ) ( ) ( ) ( ) ( )

my

my

y

KNx

kNx

Knx

knx

Kx

kx

Kx

kx

Kx

kx

K

k

),(

),(

),(

),(

),(

),(

),3(

),3(

),(

),2(

),2(

),(

),1(

),1(

),( 2

M

M

M

ML

M

ML

M

M

M

M

M

M

⎥⎥⎥⎥⎥⎥

⎦⎢⎢⎢⎢⎢⎢

⎣⎥⎥⎥⎥⎥⎥

⎦⎢⎢⎢⎢⎢⎢

⎣⎥⎥⎥⎥⎥⎥

⎦⎢⎢⎢⎢⎢⎢

⎣⎥⎥⎥⎥⎥⎥

⎦⎢⎢⎢⎢⎢⎢

⎣⎥⎥⎥⎥⎥⎥

⎦⎢⎢⎢⎢⎢⎢

Frame index

Bn(z)g

stream xt

The impulse response of Bk(z) is one of the

( ) ( ) ( ) ( ) ( )Nn xxxxx 3 2 1 LL

zk(n)=[ yk(n) yk(n+1) yk(n+2) …… yk(n+L-1)]T

( )∑+−

+−=

1

111 LN

k nLNk

zzμ

L

zk(1)

The impulse response of Bk(z) is one of the eigenvectors of the covariance for zk

=+ 11 nLN

( )( ) ( )( )∑+−

=

−−+−

=Σ1

111 LN

n

Tkk kkk

nnLN zzz zz μμ

The element in the new feature vector

Speech - Berlin Chen 58

zk(2)zk(3) ( ) ( ) ( )nknx k

Tk ze 1,ˆ =

The element in the new feature vector

From Dr. Jei-wei Hung

Page 59: Robustness Techniques for Speech RecognitionSpeech …berlin.csie.ntnu.edu.tw/Courses/Speech Recognition...Speech RecognitionSpeech Recognition Berlin Chen Dt tfC tSi &If tiEiiDepartment

PCA Applied in Inherently Robust Features (cont.)

The frequency responses of the 15 PCA-derived temporal filters

Speech - Berlin Chen 59From Dr. Jei-wei Hung

Page 60: Robustness Techniques for Speech RecognitionSpeech …berlin.csie.ntnu.edu.tw/Courses/Speech Recognition...Speech RecognitionSpeech Recognition Berlin Chen Dt tfC tSi &If tiEiiDepartment

PCA Applied in Inherently Robust Features (cont.)

• Application 2 : PCA-derived temporal filterpp p

Mismatched condition

Filter lengthL=10

MatchedMatched condition

Speech - Berlin Chen 60From Dr. Jei-wei Hung

Page 61: Robustness Techniques for Speech RecognitionSpeech …berlin.csie.ntnu.edu.tw/Courses/Speech Recognition...Speech RecognitionSpeech Recognition Berlin Chen Dt tfC tSi &If tiEiiDepartment

PCA Applied in Inherently Robust Features (cont.) pp y ( )

• Application 3 : PCA-derived filter bankpp

Power spectrumobtained by DFT

x1 x2 x3

hh1

h3

h2 hk is one of the eigenvectors of the covariancefor xk

Speech - Berlin Chen 61From Dr. Jei-wei Hung

Page 62: Robustness Techniques for Speech RecognitionSpeech …berlin.csie.ntnu.edu.tw/Courses/Speech Recognition...Speech RecognitionSpeech Recognition Berlin Chen Dt tfC tSi &If tiEiiDepartment

PCA Applied in Inherently Robust Features (cont.)

• Application 3 : PCA-derived filter bankpp

Speech - Berlin Chen 62From Dr. Jei-wei Hung

Page 63: Robustness Techniques for Speech RecognitionSpeech …berlin.csie.ntnu.edu.tw/Courses/Speech Recognition...Speech RecognitionSpeech Recognition Berlin Chen Dt tfC tSi &If tiEiiDepartment

Linear Discriminative Analysisy

• Linear Discriminative Analysis (LDA)y ( )– Widely applied for the pattern classification – In order to derive the most “discriminative” feature– Criterion : assume wj, μj and Σj are the weight, mean and

covariance of class j, j=1……N. Two matrices are defined as:( )( )∑ −−= N

jT

b w1:covarianceclass-Between μμμμS

Find W=[w1 w2 wk]

( )( )∑

=

=

=

=Nj jjw

j jjjb

w

w

1

1

: covariance class-Within

:covarianceclassBetween

ΣS

μμμμS

Find W [w1 w2 ……wk]such that

WSW

WSWW

W T

bT

maxargˆ =

– The columns wj of W are the eigenvectors of Sw

-1SB

WSWWw

Speech - Berlin Chen 63

g w Bhaving the largest eigenvalues

Page 64: Robustness Techniques for Speech RecognitionSpeech …berlin.csie.ntnu.edu.tw/Courses/Speech Recognition...Speech RecognitionSpeech Recognition Berlin Chen Dt tfC tSi &If tiEiiDepartment

Linear Discriminative Analysis (cont.)y ( )

The frequency responses of the 15 LDA-derived temporal filters

Speech - Berlin Chen 64From Dr. Jei-wei Hung

Page 65: Robustness Techniques for Speech RecognitionSpeech …berlin.csie.ntnu.edu.tw/Courses/Speech Recognition...Speech RecognitionSpeech Recognition Berlin Chen Dt tfC tSi &If tiEiiDepartment

Minimum Classification Error

• Minimum Classification Error (MCE):( )– General Objective : find an optimal feature presentation or an

optimal recognition model to minimize the expected error of classificationclassification

– The recognizer is often operated under the following decision rule :C(X)=Ci if gi(X,Λ)=maxj gj(X,Λ)C(X) Ci gi(X, ) a j gj(X, )Λ={λ(i)}i=1……M (M models, classes), X : observations,gi(X,Λ): class conditioned likelihood function, for example,

gi(X,Λ)=P(X|λ(i))– Traditional Training Criterion :

find λ(i) such that P(X|λ(i)) is maximum (Maximum Likelihood) if X find λ( ) such that P(X|λ( )) is maximum (Maximum Likelihood) if X ∈Ci

• This criterion does not always lead to minimum classification error, i it d 't id th t l l ti hi b t

Speech - Berlin Chen 65

since it doesn't consider the mutual relationship between different classes

• For example, it’s possible that P(X|λ(i)) is maximum but X ∉Ci

Page 66: Robustness Techniques for Speech RecognitionSpeech …berlin.csie.ntnu.edu.tw/Courses/Speech Recognition...Speech RecognitionSpeech Recognition Berlin Chen Dt tfC tSi &If tiEiiDepartment

Minimum Classification Error (cont.)( )

( )( )kCXXLRP ∉ ( )( )CXXLRP ∈

kτThreshold

Type I error

Type II error

( )( )kCXXLRP ∉ ( )( )kCXXLRP ∈

error

( )kLR

Example showing histograms of the likelihood ratio when the observation and kCX ∉kCX ∈

( )XLR

Type I error: False Rejection

Speech - Berlin Chen 66

Type II error: False Alarm/False Acceptance

Page 67: Robustness Techniques for Speech RecognitionSpeech …berlin.csie.ntnu.edu.tw/Courses/Speech Recognition...Speech RecognitionSpeech Recognition Berlin Chen Dt tfC tSi &If tiEiiDepartment

Minimum Classification Error (cont.)( )

• Minimum Classification Error (MCE) (Cont.):( ) ( )– One form of the class misclassification measure :

( )( ) ( )( )( )11

⎤⎡ α( ) ( )( ) ( )( )( )( ) 1)(erroricationmisclassifaimplies0

,exp1

1log,

=≥

∈⎥⎦⎤

⎢⎣⎡

−+−= ∑

Xd

CXXgM

XgXd

i

iij

iii

ααλλ

– A continuous loss function is defined as follows :

( )( ) 0)(error tion classificacorrect a implies 0

1)(errorication misclassifaimplies 0=<

≥XdXd

i

i

( ) ( )( )

( ) ( )=

∈=Λ

dlfunctionsigmoidthewhere

CXXdlXl iii

1 ,

– Classifier performance measure :

( ) ( )θγ +−+ ddlfunctionsigmoidthewhere

exp1

M

Speech - Berlin Chen 67

( ) ( )[ ] ( ) ( )∑∑=

∈δ==X

M

iiiX CXXlXLEL

1

,, ΛΛΛ

Page 68: Robustness Techniques for Speech RecognitionSpeech …berlin.csie.ntnu.edu.tw/Courses/Speech Recognition...Speech RecognitionSpeech Recognition Berlin Chen Dt tfC tSi &If tiEiiDepartment

Minimum Classification Error (cont.)( )

• Using MCE in model training : g g– Find Λ such that

( ) ( )[ ]Λ=Λ=Λ ,minargminargˆ XLEL X

the above objective function in general cannot be minimized di tl b t th l l i i b hi d i di t

( ) ( )[ ]ΛΛ

,gg X

directly but the local minimum can be achieved using gradient decent algorithm

( )Λ

Λ∂ ftbitL

• Using MCE in robust feature representation

( ) :,1 Λ∂

−=+ ofparameterarbitraryanww

ww tt ε

( ) ( )( )[ ]featureoriginaltheofma transfor :

,minargˆ

Xf

XfLEf fXf

Λ=

Speech - Berlin Chen 68

yaccordingl changed also is model thechanged, is onpresentati feature while:g

Notef