Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in...

94
Hynek Bořil Introductio n Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center for Robust Speech Systems Erik Jonsson School of Engineering and Computer Science University of Texas at Dallas © Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Transcript of Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in...

Page 1: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Hynek Bořil

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

Attributes and Recognition of Lombard Speech

Center for Robust Speech SystemsErik Jonsson School of Engineering and Computer Science

University of Texas at Dallas

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Page 2: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Contents

IntroductionWhat is Lombard Effect?

Why is Lombard Effect Interesting?

Goals and Motivation of the Study

Data Acquisition

Neutral/LE Speech Analysis

Equalization of LE in ASRAcoustic Model Adaptation

Voice Conversion

Data-Driven Design of Robust Features

Frequency Warping

Two-Stage Recognition System

Summary

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Page 3: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Objective

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

What is Lombard Effect?When exposed to noisy adverse environment, speakers modify the

way they speak in an effort to maintain intelligible communication

(Lombard Effect - LE)

Why is Lombard Effect Interesting?Better understanding mechanisms of human speech

communication (Can we intentionally change particular parameters

of speech production to improve intelligibility, or is LE an automatic

process learned through public loop? How the type of noise and

communication scenario affect LE?)

Mathematical modeling of LE classification of LE level, speech

synthesis in noisy environments, increasing robustness of automatic

speech recognition and speaker identification systems

Page 4: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Study Objective and Motivation – LE Analysis

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

Ambiguity in Past LE InvestigationsLE has been studied since 1911, however, many investigations disagree in the

observed impacts of LE on speech production

Analyses conducted typically on very limited data – a couple of utterances from few

subjects (1–10)

Lack of communication factor – many of studies ignore the importance of

communication for evoking LE (an effort to convey message over noise) occurrence

and level of LE in speech recordings is ‘random’ contradicting analysis results

LE was studied only for several world languages (English, Spanish, French,

Japanese, Korean, Mandarin Chinese), no comprehensive study for any of Slavic

languages

1st GoalDesign of Czech Lombard Speech Database addressing the need of communication

factor and well defined simulated noisy conditions

Systematic analysis of LE in Czech spoken language

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Page 5: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Study Objective and Motivation – ASR under LE

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

ASR under LEMismatch between LE speech with by noise and acoustic models trained on clean neutral

speech

Strong impact of noise on ASR is well known and vast number of noise suppression/speech

emphasis algorithms have been proposed in last decades (yet no ultimate solution is reached)

Negative impact of LE on ASR often exceeds the one of noise; recent state-of-the-art ASR

systems mostly ignore this issue

LE-Equalization MethodsLE-equalization algorithms typically operate in the following domains: Robust features, LE-

transformation towards neutral, model adjustments, improved training of acoustic models

The algorithms display various degrees of efficiency and are often bound by strong

assumptions preserving them from the real world application (applying fixed transformations

to phonetic groups, known level of LE, etc.)

2nd GoalProposal of novel LE-equalization techniques with a focus on both level of LE suppression

and extent of bounding assumptions

Page 6: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Data Acquisition - Motivation

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

LE Corpora – Issues (1)Tradeoff between realism and control of the phenomena of interest: (Murray

and Arnott, 1993) for elicited emotional speech: “A trade-off exists between realism

and measurement accuracy of emotions generated by speakers in the laboratory

(questionable realism, but verbal content and recording conditions controllable) and

field recordings (real emotions, but content and recording conditions less

controllable).”

Databases recorded in real adverse conditions (e.g., car environment):

Limited or no control over level and characteristics of background noise

Low SNRs of the recordings difficult to perform reliable speech analysis

Importance of communication factor often completely ignored - e.g., SPEECON

(Iskra et al., 2002)

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Page 7: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Data Acquisition - Motivation

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

LE Corpora – Issues (2)Special LE databases simulated noisy conditions

Successfully address the control of noise and SNR

In studies on speech production, authors sometimes employ

communication factor (Korn, 1954), (Webster and Klumpp, 1962) – repeating

words, (Patel and Schell, 2008) – interactive game

Studies on ASR/Speaker ID under LE - the importance of communication

factor largely ignored– (Junqua , 1993) SUSAS (Hansen and Ghazale, 1997);

exception (Junqua et al, 1998) – communication with dialing machine

Limited number of subjects and utterances - ranging typically from ten

(Webster and Klumpp, 1962), (Lane et al., 1970), (Junqua, 1993), to one or two

speakers, (Summers et al., 1988), (Pisoni et al., 1985), (Bond et al., 1989), (Tian

et al., 2003), (Garnier et al., 2006)

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Page 8: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Data Acquisition - Motivation

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

Available Czech CorporaCzech SPEECON – speech recordings from various environments including office

and car

CZKCC – car recordings – include parked car with engine off and moving car

scenarios

Both databases contain speech produced in quiet in noise candidates for study of

LE, however, not good ones, shown later

Design/acquisition of LE-oriented database – Czech

Lombard Speech Database‘05 (CLSD‘05) Goals – Communication in simulated noisy background high SNR

-Phonetically rich data/extensive small vocabulary material

-Parallel utterances in neutral and LE conditions

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Page 9: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Introducing Communication in Recording

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

Data Acquisition

Simulated Noisy ConditionsNoise samples mixed with speech feedback and produced to the speaker and

operator by headphones

Operator qualifies intelligibility of speech in noise – if the utterance is not intelligible,

operator asks the subject to repeat it speakers are required to convey message

over noise communication LE

Noises: mostly car noises from Car2E database, normalized to 90 dB SPL

Speaker Sessions14 male/12 female speakers

Each subject recorded both in neutral and simulated noisy conditions

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Close talk

Noise + speech feedback

Middle talk

H&T RECORDER

OK – next / / BAD - again

Noise + speech monitor SPEAKER

SMOOTH OPERATOR

Page 10: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Speaker Session Contents

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

Data Acquisition

IVR – interactive

voice response

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Corpus contents Corpus/item id. Number

Phonetically rich sentences S01 – 30 30 Phonetically rich words W01 – 05 5 Isolated digits CI1 – I4, 30 – 69 44 Isolated digit sequences (8 digits) CB1 – B2, 00 – 29 32 Connected digit sequences (5 digits) CC1 – 4, C70 – 99 34 Natural numbers CN1 – N3 3 Money amount CM1 1 Time phrases; T1 – analogue, T2 – digital

CT1 – T2 2

Dates: D1 – analogue, D2 – relative and general date, D3 – digital

CD1 – D3 3

Proper name CP1 1 City or street names CO1 – O2 2 Questions CQ1 – Q2 2 Special keyboard characters CK1 – K2 2 Core word synonyms Y01 – 95 Basic IVR commands 101 – 85 Directory navigation 201 – 40 Editing 301 – 22 Output control 401 – 57 Messaging & Internet browsing 501 – 70 Organizer functions 601 – 33 Routing 701 – 39 Automotive 801 – 12 Audio & Video 901 – 95

89

Page 11: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Sound Attenuation by Headphones

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

Data Acquisition

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Environmental Sound Attenuation by HeadphonesAttenuation characteristics measured on dummy head

Source of wide-band noise, measurement of sound transfer to dummy head’s

auditory canals when not wearing/wearing headphones

Attenuation characteristics – subtraction of the transfers

Page 12: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Sound Attenuation by Headphones

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

Data Acquisition

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

102

103

104

0

50

100

150

200-10

0

10

20

30

Frequency (Hz)Angle (°)

Att

enua

tion

(dB

)

-10

-5

0

5

10

15

20

25

100 1000 10000

0° 90°180°Rec. room

Frequency (Hz)

Atte

nu

atio

n (

dB

)

Attenuation by headphones

-100102030

0

15

30

45

60

75 90 105

120

135

150

165

180

195

210

225

240

255270285

300

315

330

345

0 180 -10 0 10 20 30 0 10 20 30

1 kHz 2 kHz 4 kHz 8 kHz

Angle (°)

Attenuation (dB)

Environmental Sound Attenuation by HeadphonesDirectional attenuation – reflectionless sound booth

Real attenuation in recording room

Page 13: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Closed vs. Open-Air Headphones

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

Data Acquisition

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Open-Air Headphones+ Easier to reach flat frequency response than in closed headphones

+ Lower attenuation of sound coming from outside the headset

- High level of cross-talk from headphones to close-talk microphone contamination

of recorded speech by noise reproduced to headphones

0

200

400

600

800

1000

1200

1400

1600

-10 0 10 20 30 40 50 60

Close-talk NHands-free NClose-talk LEHands-free LE

CLSD'05 - SNR distributions

Nu

mb

er

of u

ttera

nce

s

SNR (dB)

Page 14: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Parameters of Neutral and LE Speech

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

1

( )1

Nk

kk

GV z

z

Speech Features affected by LE

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

IMPULSE TRAIN GENERATOR I(z)

VOCAL TRACT MODEL V(z)

RADIATION MODEL R(z)

RANDOM NOISE GENERATOR N(z)

Voiced/Unvoiced Switch

Pitch Period

AV

AN

Vocal Tract Parameters

GLOTTAL PULSE MODEL G(z)

uG(n) pL(n)

Page 15: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Parameters of Neutral and LE Speech

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

Speech Features affected by LEVocal tract excitation: glottal pulse shape changes, fundamental frequency

rises

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

1

( )1

Nk

kk

GV z

z

IMPULSE TRAIN GENERATOR I(z)

VOCAL TRACT MODEL V(z)

RADIATION MODEL R(z)

RANDOM NOISE GENERATOR N(z)

Voiced/Unvoiced Switch

Pitch Period

AV

AN

Vocal Tract Parameters

GLOTTAL PULSE MODEL G(z)

uG(n) pL(n)

Page 16: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Parameters of Neutral and LE Speech

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

1

( )1

Nk

kk

GV z

z

Speech Features affected by LEVocal tract excitation: glottal pulse shape changes, fundamental frequency

rises

Vocal tract transfer function: center frequencies of low formants increase,

formant bandwidths reduce

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

IMPULSE TRAIN GENERATOR I(z)

VOCAL TRACT MODEL V(z)

RADIATION MODEL R(z)

RANDOM NOISE GENERATOR N(z)

Voiced/Unvoiced Switch

Pitch Period

AV

AN

Vocal Tract Parameters

GLOTTAL PULSE MODEL G(z)

uG(n) pL(n)

Page 17: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Parameters of Neutral and LE Speech

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

1

( )1

Nk

kk

GV z

z

Speech Features affected by LEVocal tract excitation: glottal pulse shape changes, fundamental frequency

rises

Vocal tract transfer function: center frequencies of low formants increase,

formant bandwidths reduce

Vocal effort (intensity) increase

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

IMPULSE TRAIN GENERATOR I(z)

VOCAL TRACT MODEL V(z)

RADIATION MODEL R(z)

RANDOM NOISE GENERATOR N(z)

Voiced/Unvoiced Switch

Pitch Period

AV

AN

Vocal Tract Parameters

GLOTTAL PULSE MODEL G(z)

uG(n) pL(n)

Page 18: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Parameters of Neutral and LE Speech

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

1

( )1

Nk

kk

GV z

z

Speech Features affected by LEVocal tract excitation: glottal pulse shape changes, fundamental frequency rises

Vocal tract transfer function: center frequencies of low formants increase,

formant bandwidths reduce

Vocal effort (intensity) increase

Other: voiced phonemes prolonged, energy ratio in voiced/unvoiced increases,…

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

IMPULSE TRAIN GENERATOR I(z)

VOCAL TRACT MODEL V(z)

RADIATION MODEL R(z)

RANDOM NOISE GENERATOR N(z)

Voiced/Unvoiced Switch

Pitch Period

AV

AN

Vocal Tract Parameters

GLOTTAL PULSE MODEL G(z)

uG(n) pL(n)

Page 19: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Analysis: Fundamental Frequency

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

Parameters of Neutral and LE Speech

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

0

2

4

6

8

10

12

70 170 270 370 470 570

Office FCar F

Office MCar M

Fundamental frequency (Hz)

Distribution of fundamental frequencyCzech SPEECON

Nu

mb

er

of s

am

ple

s (x

10

,00

0)

0

2

4

6

8

10

12

14

16

70 170 270 370 470 570

Eng off F

Eng on F

Eng off M

Eng on M

Fundamental frequency (Hz)

Nu

mb

er

of s

am

ple

s (x

10

00

)

Distribution of fundamental frequencyCZKCC

0

1

2

3

4

5

6

70 170 270 370 470 570

Neutral FLE FNeutral MLE M

Fundamental frequency (Hz)

Nu

mb

er

of s

am

ple

s (x

10

,00

0)

Distribution of fundamental frequencyCLSD'05

Page 20: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Analysis: Formant Center Frequencies

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

Parameters of Neutral and LE Speech

900

1100

1300

1500

1700

1900

2100

2300

2500

300 400 500 600 700 800 900 1000

Female_N

Female_LE

F1 (Hz)

F2

(H

z)

Formants - CZKCCFemale digits/i/

/i'/

/e//e'/

/a/

/a'//o/

/o'//u/

/u'/

500

700

900

1100

1300

1500

1700

1900

2100

200 300 400 500 600 700 800 900

Male_N

Male_LE

/i//i'/

F1 (Hz)

F2

(H

z)

/e//e'/

/a/

/a'/

/o//o'/

/u//u'/

Formants - CZKCCMale digits

900

1100

1300

1500

1700

1900

2100

2300

2500

300 400 500 600 700 800 900 1000

Female_N

Female_LE

F1 (Hz)

F2

(H

z)

Formants - CLSD'05Female digits/i/

/i'/

/e/

/e'/

/a//a'/

/o/

/o'/

/u/

/u'/

500

700

900

1100

1300

1500

1700

1900

2100

200 300 400 500 600 700 800 900

Male_N

Male_LE

Formants - CLSD'05Male digits/i/

/i'/

F1 (Hz)

F2

(H

z)

/e/ /e'/

/a//a'/

/o/ /o'//u/

/u'/

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Page 21: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

Parameters of Neutral and LE Speech

CZKCC

Vowel B1M (Hz) 1M (Hz) B1M (Hz) 1M (Hz) B1F (Hz) 1F (Hz) B1F (Hz) 1F (Hz)

/a/ 207* 74 210* 84 275 97 299 78

/e/ 125* 70 130* 78 156 68 186 79

/i/ 124* 49 127* 44 105 44 136 53

/o/ 275 87 222 67 263* 85 269* 73

/u/ 187 100 170 89 174* 96 187* 101

CLSD‘05

Vowel B1M (Hz) 1M (Hz) B1M (Hz) 1M (Hz) B1F (Hz) 1F (Hz) B1F (Hz) 1F (Hz)

/a/ 269 88 152 59 232 85 171 68

/e/ 168 94 99 44 169 73 130 49

/i/ 125 53 108 52 132* 52 133* 58

/o/ 239 88 157 81 246 91 158 62

/u/ 134* 67 142* 81 209 95 148 66

Analysis: Formant Bandwidths

SPEECON, CZKCC: no consistent BW changes

CLSD‘05: significant BW reduction in many voiced phonemes

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Page 22: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

Parameters of Neutral and LE Speech

CZKCC

Vowel B1M (Hz) 1M (Hz) B1M (Hz) 1M (Hz) B1F (Hz) 1F (Hz) B1F (Hz) 1F (Hz)

/a/ 207* 74 210* 84 275 97 299 78

/e/ 125* 70 130* 78 156 68 186 79

/i/ 124* 49 127* 44 105 44 136 53

/o/ 275 87 222 67 263* 85 269* 73

/u/ 187 100 170 89 174* 96 187* 101

CLSD‘05

Vowel B1M (Hz) 1M (Hz) B1M (Hz) 1M (Hz) B1F (Hz) 1F (Hz) B1F (Hz) 1F (Hz)

/a/ 269 88 152 59 232 85 171 68

/e/ 168 94 99 44 169 73 130 49

/i/ 125 53 108 52 132* 52 133* 58

/o/ 239 88 157 81 246 91 158 62

/u/ 134* 67 142* 81 209 95 148 66

Analysis: Formant Bandwidths

SPEECON, CZKCC: no consistent BW changes

CLSD‘05: significant BW reduction in many voiced phonemes

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Page 23: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

Parameters of Neutral and LE Speech

CZKCC

Vowel B1M (Hz) 1M (Hz) B1M (Hz) 1M (Hz) B1F (Hz) 1F (Hz) B1F (Hz) 1F (Hz)

/a/ 207* 74 210* 84 275 97 299 78

/e/ 125* 70 130* 78 156 68 186 79

/i/ 124* 49 127* 44 105 44 136 53

/o/ 275 87 222 67 263* 85 269* 73

/u/ 187 100 170 89 174* 96 187* 101

CLSD‘05

Vowel B1M (Hz) 1M (Hz) B1M (Hz) 1M (Hz) B1F (Hz) 1F (Hz) B1F (Hz) 1F (Hz)

/a/ 269 88 152 59 232 85 171 68

/e/ 168 94 99 44 169 73 130 49

/i/ 125 53 108 52 132* 52 133* 58

/o/ 239 88 157 81 246 91 158 62

/u/ 134* 67 142* 81 209 95 148 66

Analysis: Formant Bandwidths

SPEECON, CZKCC: no consistent BW changes

CLSD‘05: significant BW reduction in many voiced phonemes

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Page 24: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Analysis: Phoneme Durations

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

Parameters of Neutral and LE Speech

CZKCC

Word Phoneme # OFF TOFF (s) TOFF (s) # ON TON (s) TON (s) (%)

Nula /a/ 349 0.147 0.079 326 0.259 0.289 48.50

Jedna /a/ 269 0.173 0.076 251 0.241 0.238 39.36

Dva /a/ 245 0.228 0.075 255 0.314 0.311 38.04

Štiri /r/ 16 0.045 0.027 68 0.080 0.014 78.72

Sedm /e/ 78 0.099 0.038 66 0.172 0.142 72.58

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

CLSD‘05

Word Phoneme # N TN (s) Tn (s) # LE TLE (s) Tle (s) (%)

Jedna /e/ 583 0.031 0.014 939 0.082 0.086 161.35

Dvje /e/ 586 0.087 0.055 976 0.196 0.120 126.98

Čtiri /r/ 35 0.041 0.020 241 0.089 0.079 115.92

Pjet /e/ 555 0.056 0.033 909 0.154 0.089 173.71

Sedm /e/ 358 0.080 0.038 583 0.179 0.136 122.46

Osm /o/ 310 0.086 0.027 305 0.203 0.159 135.25

Devjet /e/ 609 0.043 0.022 932 0.120 0.088 177.20

Significant increase in duration in some phonemes, especially voiced phonemes

Some unvoiced consonants – duration reduction

Duration changes in CLSD’05 considerably exceed the ones in SPEECON and CZKCC

Page 25: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Analysis: Phoneme Durations

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

Parameters of Neutral and LE Speech

CZKCC

Word Phoneme # OFF TOFF (s) TOFF (s) # ON TON (s) TON (s) (%)

Nula /a/ 349 0.147 0.079 326 0.259 0.289 48.50

Jedna /a/ 269 0.173 0.076 251 0.241 0.238 39.36

Dva /a/ 245 0.228 0.075 255 0.314 0.311 38.04

Štiri /r/ 16 0.045 0.027 68 0.080 0.014 78.72

Sedm /e/ 78 0.099 0.038 66 0.172 0.142 72.58

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

CLSD‘05

Word Phoneme # N TN (s) Tn (s) # LE TLE (s) Tle (s) (%)

Jedna /e/ 583 0.031 0.014 939 0.082 0.086 161.35

Dvje /e/ 586 0.087 0.055 976 0.196 0.120 126.98

Čtiri /r/ 35 0.041 0.020 241 0.089 0.079 115.92

Pjet /e/ 555 0.056 0.033 909 0.154 0.089 173.71

Sedm /e/ 358 0.080 0.038 583 0.179 0.136 122.46

Osm /o/ 310 0.086 0.027 305 0.203 0.159 135.25

Devjet /e/ 609 0.043 0.022 932 0.120 0.088 177.20

Significant increase in duration in some phonemes, especially voiced phonemes

Some unvoiced consonants – duration reduction

Duration changes in CLSD’05 considerably exceed the ones in SPEECON and CZKCC

Page 26: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Word Durations

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

Parameters of Neutral and LE Speech

CZKCC

Word # OFF TOFF (s) OFF (s) # ON TON (s) TON (s) (%)

Nula 349 0.475 0.117 326 0.560 0.345 17.82

Jedna 269 0.559 0.136 251 0.607 0.263 8.58

Dva 245 0.426 0.106 255 0.483 0.325 13.57

CLSD‘05

Word # N TN (s) Tn (s) # LE TLE (s) Tle (s) (%)

Nula 497 0.397 0.109 802 0.476 0.157 19.87

Jedna 583 0.441 0.128 939 0.527 0.165 19.56

Dvje 586 0.365 0.114 976 0.423 0.138 15.87

Word durations variations typically did not exceed 20 %

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Page 27: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

HMM-Based Automatic Speech Recognition (ASR)

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

Typical HMM Recognizer

LANGUAGE MODEL

(BIGRAMS)

DECODER (VITERBI)

ESTIMATED WORD

SEQUENCE

SPEECH SIGNAL

FEATURE EXTRACTION (MFCC/PLP)

ACOUSTIC MODEL

SUB-WORD LIKELIHOODS

(GMM/MLP)

LEXICON (HMM)

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Feature extraction – transformation of time-domain acoustic signal into

representation more convenient for ASR engine: data dimensionality reduction,

suppression of irrelevant (disturbing) signal components

(speaker/environment/recording chain-dependent characteristics), preserving

phonetic content

Sub-word models – Gaussian Mixture Models (GMMs) – mixture of gaussians used

to model distribution of feature vector parameters; Multi-Layer Perceptrons (MLPs) –

artificial neural networks

Page 28: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

HMM-Based Automatic Speech Recognition (ASR)

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Mel Frequency Cepstral CoefficientsDavis & Mermelstein, IEEE Trans. Acoustics, Speech, and Signal Processing, 1980

Mermelstein was born in Czechoslovakia

MFCC is the first choice in current commercial ASR

When used in HMM ASR, MFCC may be incorporating several redundant stages –

historical reasons (in the past, distance-based measures were used in speech

decoding, different requirements on cepstral coeffs than in HMM systems)

Perceptual Linear Predictive CoefficientsHermansky, Journal of Acoustical Society of America, 1990

Hermansky was born in Czechoslovakia

Linear prediction – smoothing of the spectral envelope (may improve robustness)

PLP is a frequent choice in research labs – IDIAP, ICSI Berkeley, LIMSI…

WINDOW

(HAMMING)

|FFT|2

c(n)

s(n)

PREEMPHASIS

Log( )

.

IDCT

MFCC FILTER

BANK (MEL)

WINDOW

(HAMMING)

|FFT|2

EQUAL LOUDNESS

PREEMPHASIS

LINEAR PREDICTION

c(n)

s(n)

PLP INTENSITY

LOUDNESS 3

RECURSION

CEPSTRUM

FILTER BANK

(BARK)

Page 29: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Initial ASR Experiment

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

Equalization of LE in ASR

ASR Evaluation – WER (Word Error Rate)S – word substitutions

I – word insertions

D – word deletions

Digit RecognizerMonophone HMM models

13 MFCC + ∆ + ∆∆

32 Gaussian mixtures per model state

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Czech SPEECON CZKCC CLSD‘05 Set

Office F Office M Car F Car M OFF F OFF M ON F ON M N F N M LE F LE M

# Spkrs 22 31 28 42 30 30 18 21 12 14 12 14

# Digits 880 1219 1101 1657 1480 1323 1439 1450 4930 1423 5360 6303

WER

(%) 5.5

(4.0–7.0)

4.3 (3.1–5.4)

4.6 (3.4–5.9)

10.5 (9.0–12.0)

3.0 (2.1–3.8)

2.3 (1.5–3.1)

13.5 (11.7–15.2)

10.4 (8.8–12.0)

7.3 (6.6–8.0)

3.8 (2.8–4.8)

42.8 (41.5–44.1)

16.3 (15.4–17.2)

100 %D S I

WERN

Noisy car recordings (SPEE/CZKCC – 10.7/12.6 dB

SNR)

Clean recordings (LE - 40.9 dB SNR)

Page 30: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Initial ASR ExperimentEqualization of LE in ASR

ASR Evaluation – WER (Word Error Rate)S – word substitutions

I – word insertions

D – word deletions

Digit RecognizerMonophone HMM models

13 MFCC + ∆ + ∆∆

32 Gaussian mixtures per model state

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Czech SPEECON CZKCC CLSD‘05 Set

Office F Office M Car F Car M OFF F OFF M ON F ON M N F N M LE F LE M

# Spkrs 22 31 28 42 30 30 18 21 12 14 12 14

# Digits 880 1219 1101 1657 1480 1323 1439 1450 4930 1423 5360 6303

WER

(%) 5.5

(4.0–7.0)

4.3 (3.1–5.4)

4.6 (3.4–5.9)

10.5 (9.0–12.0)

3.0 (2.1–3.8)

2.3 (1.5–3.1)

13.5 (11.7–15.2)

10.4 (8.8–12.0)

7.3 (6.6–8.0)

3.8 (2.8–4.8)

42.8 (41.5–44.1)

16.3 (15.4–17.2)

100 %D S I

WERN

Noisy car recordings (SPEE/CZKCC – 10.7/12.6 dB

SNR)

Clean recordings (LE - 40.9 dB SNR)

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

Page 31: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Initial ASR ExperimentEqualization of LE in ASR

ASR Evaluation – WER (Word Error Rate)S – word substitutions

I – word insertions

D – word deletions

Digit RecognizerMonophone HMM models

13 MFCC + ∆ + ∆∆

32 Gaussian mixtures per model state

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Czech SPEECON CZKCC CLSD‘05 Set

Office F Office M Car F Car M OFF F OFF M ON F ON M N F N M LE F LE M

# Spkrs 22 31 28 42 30 30 18 21 12 14 12 14

# Digits 880 1219 1101 1657 1480 1323 1439 1450 4930 1423 5360 6303

WER

(%) 5.5

(4.0–7.0)

4.3 (3.1–5.4)

4.6 (3.4–5.9)

10.5 (9.0–12.0)

3.0 (2.1–3.8)

2.3 (1.5–3.1)

13.5 (11.7–15.2)

10.4 (8.8–12.0)

7.3 (6.6–8.0)

3.8 (2.8–4.8)

42.8 (41.5–44.1)

16.3 (15.4–17.2)

100 %D S I

WERN

Noisy car recordings (SPEE/CZKCC – 10.7/12.6 dB

SNR)

Clean recordings (LE - 40.9 dB SNR)

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

Page 32: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Initial ASR ExperimentEqualization of LE in ASR

ASR Evaluation – WER (Word Error Rate)S – word substitutions

I – word insertions

D – word deletions

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Digit RecognizerMonophone HMM models

13 MFCC + ∆ + ∆∆

32 Gaussian mixtures per model state

Czech SPEECON CZKCC CLSD‘05 Set

Office F Office M Car F Car M OFF F OFF M ON F ON M N F N M LE F LE M

# Spkrs 22 31 28 42 30 30 18 21 12 14 12 14

# Digits 880 1219 1101 1657 1480 1323 1439 1450 4930 1423 5360 6303

WER

(%) 5.5

(4.0–7.0)

4.3 (3.1–5.4)

4.6 (3.4–5.9)

10.5 (9.0–12.0)

3.0 (2.1–3.8)

2.3 (1.5–3.1)

13.5 (11.7–15.2)

10.4 (8.8–12.0)

7.3 (6.6–8.0)

3.8 (2.8–4.8)

42.8 (41.5–44.1)

16.3 (15.4–17.2)

100 %D S I

WERN

Noisy car recordings (SPEE/CZKCC – 10.7/12.6 dB

SNR)

Clean recordings (LE - 40.9 dB SNR)

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

Page 33: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Initial ASR ExperimentEqualization of LE in ASR

ASR Evaluation – WER (Word Error Rate)S – word substitutions

I – word insertions

D – word deletions

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Digit RecognizerMonophone HMM models

13 MFCC + ∆ + ∆∆

32 Gaussian mixtures per model state

Czech SPEECON CZKCC CLSD‘05 Set

Office F Office M Car F Car M OFF F OFF M ON F ON M N F N M LE F LE M

# Spkrs 22 31 28 42 30 30 18 21 12 14 12 14

# Digits 880 1219 1101 1657 1480 1323 1439 1450 4930 1423 5360 6303

WER

(%) 5.5

(4.0–7.0)

4.3 (3.1–5.4)

4.6 (3.4–5.9)

10.5 (9.0–12.0)

3.0 (2.1–3.8)

2.3 (1.5–3.1)

13.5 (11.7–15.2)

10.4 (8.8–12.0)

7.3 (6.6–8.0)

3.8 (2.8–4.8)

42.8 (41.5–44.1)

16.3 (15.4–17.2)

100 %D S I

WERN

Noisy car recordings (SPEE/CZKCC – 10.7/12.6 dB

SNR)

Clean recordings (LE - 40.9 dB SNR)

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

Page 34: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Initial ASR ExperimentEqualization of LE in ASR

ASR Evaluation – WER (Word Error Rate)S – word substitutions

I – word insertions

D – word deletions

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Digit RecognizerMonophone HMM models

13 MFCC + ∆ + ∆∆

32 Gaussian mixtures per model state

Czech SPEECON CZKCC CLSD‘05 Set

Office F Office M Car F Car M OFF F OFF M ON F ON M N F N M LE F LE M

# Spkrs 22 31 28 42 30 30 18 21 12 14 12 14

# Digits 880 1219 1101 1657 1480 1323 1439 1450 4930 1423 5360 6303

WER

(%) 5.5

(4.0–7.0)

4.3 (3.1–5.4)

4.6 (3.4–5.9)

10.5 (9.0–12.0)

3.0 (2.1–3.8)

2.3 (1.5–3.1)

13.5 (11.7–15.2)

10.4 (8.8–12.0)

7.3 (6.6–8.0)

3.8 (2.8–4.8)

42.8 (41.5–44.1)

16.3 (15.4–17.2)

100 %D S I

WERN

Noisy car recordings (SPEE/CZKCC – 10.7/12.6 dB

SNR)

Clean recordings (LE - 40.9 dB SNR)

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

Page 35: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Initial ASR ExperimentEqualization of LE in ASR

100 %D S I

WERN

ASR Evaluation – WER (Word Error Rate)S – word substitutions

I – word insertions

D – word deletions

Digit RecognizerMonophone HMM models

13 MFCC + ∆ + ∆∆

32 Gaussian mixtures per model state

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Czech SPEECON CZKCC CLSD‘05 Set

Office F Office M Car F Car M OFF F OFF M ON F ON M N F N M LE F LE M

# Spkrs 22 31 28 42 30 30 18 21 12 14 12 14

# Digits 880 1219 1101 1657 1480 1323 1439 1450 4930 1423 5360 6303

WER

(%) 5.5

(4.0–7.0)

4.3 (3.1–5.4)

4.6 (3.4–5.9)

10.5 (9.0–12.0)

3.0 (2.1–3.8)

2.3 (1.5–3.1)

13.5 (11.7–15.2)

10.4 (8.8–12.0)

7.3 (6.6–8.0)

3.8 (2.8–4.8)

42.8 (41.5–44.1)

16.3 (15.4–17.2)

Noisy car recordings (SPEE/CZKCC – 10.7/12.6 dB

SNR)

Clean recordings (LE - 40.9 dB SNR)

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

Page 36: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Acoustic Model Adaptation

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

Model AdaptationOften effective when only limited data from given conditions are available

Maximum Likelihood Linear Regression (MLLR) – if limited amount of data per

class, acoustically close classes are grouped and transformed together

'MLLR μ Aμ b

Maximum a posteriori approach (MAP) – initial models are used as informative

priors for the adaptation

'MAP

N

N N

μ μ μ

Adaptation ProcedureFramework provided by Technical University of Liberec

First, neutral speaker-independent (SI) models transformed by MLLR, employing

clustering (binary regression tree)

Second, MAP adaptation – only for nodes with sufficient amount of adaptation data

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Page 37: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Acoustic Model Adaptation

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

0

10

20

30

40

50

60

70

80

90

Baseline digits LE Adapted digits LE Baseline sentences LE Adapted sentences LE

SI adapt to LE (same spkrs)

SI adapt to LE (disjunct spkrs)

SD adapt to neutral

SD adapt to LE

Model adaptation to conditions and speakers

WE

R (

%)

Adaptation SchemesSpeaker-independent adaptation (SI) – group dependent/independent

Speaker-dependent adaptation (SD) – to neutral/LE

Page 38: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Voice Conversion

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

Technique transforming speech from source speaker towards target speaker

Voice conversion typically transforms both excitation and vocal tract parameters

promising for LEneutral transformation

Idea: xn – speech samples from source speaker, yn – target speaker; the goal is to find

conversion function F minimizing the mean square error

2

mse n nE F y x

1

1

My yx xx x

V m m m m mm

F p

x x μ Σ Σ x μ

Voice conversion framework provided by Siemens AGGMM-based text-dependent voice conversion

Parallel utterances required for transformation model training

Fundamental frequency transform:

0

0 0

0

0 0y

y x

x

F

G x F x FF

F F F

Vocal tract transfer function transform

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Page 39: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Transformation of Fundamental Frequency

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

0

1000

2000

3000

4000

5000

6000

7000

0 100 200 300 400 500 600

Female N

Female LE

Female CLE

Female CLEF0

Fundamental frequency (Hz)

Fundamental frequencyCLSD'05 female sentences

Nu

mb

er

of s

am

ple

s

0

500

1000

1500

2000

2500

3000

3500

4000

0 100 200 300 400 500 600

Male NMale LEMale CLEMale CLEF0

Fundamental frequencyCLSD'05 male sentences

Fundamental frequency (Hz)

Nu

mb

er

of s

am

ple

s

Voice Conversion

CLEF – conversion of both excitation and vocal tract parameters

CLEF0 – only excitation converted, vocal tract parameters preserved

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Page 40: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Transformation of Formants

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

Voice Conversion

500

1000

1500

2000

2500

300 400 500 600 700 800 900

Female NFemale LEFemale CLEFemale CLEF0

/u/

/o/

/a/

/e/

/i//i'/

/u'//o'/

/a'/

/e'/

/i''/

/u''//o''/

/a''/

/e''/

F1 (Hz)

Formants - CLSD'05Female digits

F2

(Hz)

500

1000

1500

2000

2500

300 400 500 600 700 800 900

Male NMale LEMale CLEMale CLEF0

/u/

/o/

/a/

/e/

/i//i'/

/i''/

/u'/

/u''/ /o'//o''/

/a'//a''/

/e'/

/e''/

F1 (Hz)

Formants - CLSD'05 Male digits

F2

(Hz)

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Page 41: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Voice Conversion in ASR Front-End

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

Voice Conversion

0

20

40

60

80

100

Males - no LM Males - LM Females - no LM Females - LM

NeutralLECLECLEF0

WE

R (

%)

Sentences - LVCSR

0

10

20

30

40

50

Males Females

NeutralLECLECLEF0W

ER

(%

)

Digits

Effectiveness of Voice Conversion in ASR TaskPartially successful in digits task

Fails in LVCSR task – classes to be recognized too close in acoustic space, any inaccuracy of VC

results in ASR deterioration

Listening tests – converted speech samples contain strong artifacts, at times the speech becomes

unintelligible for human listeners

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Page 42: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Data-Driven Design of Robust Features

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

Filter Bank ApproachAnalysis of importance of frequency components for ASR

Repartitioning filter bank (FB) to emphasize components carrying phonetic information and suppress

disturbing components

Initial FB uniformly distributed on linear scale – equal attention to all components

Consecutively, a single FB band is omitted impact on WER?

Omitting bands carrying more information will result in considerable WER increase

ImplementationMFCC front-end, MEL scale replaced by linear, triangular filters replaced by rectangular filters

without overlap

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Page 43: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Optimizing Filter Banks – Baseline Performance

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

Data-Driven Design of Robust Features

3

4

5

0 5 10 15 20

Omitted band

Neutral speech

WE

R (

%)

20

30

40

0 5 10 15 20

Omitted band

LE speech

WE

R (

%)

Filterbank Cut-Off Frequencies (Hz)

Am

plitu

de

c0, c1,…, c12; ∆c0, ∆c1,…, ∆c12; ∆∆c0, ∆∆c1,…, ∆∆c12

4K

1 20

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Page 44: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Optimizing Filter Banks – Omitting Bands

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

Data-Driven Design of Robust Features

3

4

5

0 5 10 15 20

Omitted band

Neutral speech

WE

R (

%)

20

30

40

0 5 10 15 20

Omitted band

LE speech

WE

R (

%)

Filterbank Cut-Off Frequencies (Hz)

Am

plitu

de

c0, c1,…, c12; ∆c0, ∆c1,…, ∆c12; ∆∆c0, ∆∆c1,…, ∆∆c12

4K

1 20

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Page 45: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Optimizing Filter Banks – Omitting Bands

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

Data-Driven Design of Robust Features

3

4

5

0 5 10 15 20

Omitted band

Neutral speech

WE

R (

%)

20

30

40

0 5 10 15 20

Omitted band

LE speech

WE

R (

%)

Filterbank Cut-Off Frequencies (Hz)

Am

plitu

de

c0, c1,…, c12; ∆c0, ∆c1,…, ∆c12; ∆∆c0, ∆∆c1,…, ∆∆c12

4K

1 20

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Page 46: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Optimizing Filter Banks – Omitting Bands

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

Data-Driven Design of Robust Features

3

4

5

0 5 10 15 20

Omitted band

Neutral speech

WE

R (

%)

20

30

40

0 5 10 15 20

Omitted band

LE speech

WE

R (

%)

Filterbank Cut-Off Frequencies (Hz)

Am

plitu

de

c0, c1,…, c12; ∆c0, ∆c1,…, ∆c12; ∆∆c0, ∆∆c1,…, ∆∆c12

4K

1 20

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Page 47: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Optimizing Filter Banks – Omitting Bands

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

Data-Driven Design of Robust Features

3

4

5

0 5 10 15 20

Omitted band

Neutral speech

WE

R (

%)

20

30

40

0 5 10 15 20

Omitted band

LE speech

WE

R (

%)

Filterbank Cut-Off Frequencies (Hz)

Am

plitu

de

c0, c1,…, c12; ∆c0, ∆c1,…, ∆c12; ∆∆c0, ∆∆c1,…, ∆∆c12

4K

1 20

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Page 48: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Optimizing Filter Banks – Omitting Bands

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

Data-Driven Design of Robust Features

3

4

5

0 5 10 15 20

Omitted band

Neutral speech

WE

R (

%)

20

30

40

0 5 10 15 20

Omitted band

LE speech

WE

R (

%)

Filterbank Cut-Off Frequencies (Hz)

Am

plitu

de

c0, c1,…, c12; ∆c0, ∆c1,…, ∆c12; ∆∆c0, ∆∆c1,…, ∆∆c12

4K

1 20

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Page 49: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Optimizing Filter Banks – Omitting Bands

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

Data-Driven Design of Robust Features

3

4

5

0 5 10 15 20

Omitted band

Neutral speech

WE

R (

%)

20

30

40

0 5 10 15 20

Omitted band

LE speech

WE

R (

%)

c0, c1,…, c12; ∆c0, ∆c1,…, ∆c12; ∆∆c0, ∆∆c1,…, ∆∆c12 Area of 1st and 2nd formant occurrence – highest portion of phonetic information, F1 more important

for neutral speech, F1–F2 for LE speech recognition

Omitting the 1st band considerably improves LE ASR while reducing performance on neutral speech

tradeoff

Next step – how much of the low frequency content should be omitted for LE ASR?

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Page 50: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Filterbank Cut-Off Frequencies (Hz)

Am

plitu

de

c0, c1,…, c12; ∆c0, ∆c1,…, ∆c12; ∆∆c0, ∆∆c1,…, ∆∆c12

4K

0

2

4

6

8

10

12

0 200 400 600 800 1000 1200

Bandwidth (Hz)

Neutral speech

WE

R (

%)

0

10

20

30

0 200 400 600 800 1000 1200

Bandwidth (Hz)

LE speech

WE

R (%

)

Optimizing Filter Banks – Omitting Low Frequencies

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

Data-Driven Design of Robust Features

1 19

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Page 51: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Optimizing Filter Banks – Omitting Low Frequencies

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

Data-Driven Design of Robust Features

Filterbank Cut-Off Frequencies (Hz)

Am

plitu

de

c0, c1,…, c12; ∆c0, ∆c1,…, ∆c12; ∆∆c0, ∆∆c1,…, ∆∆c12

4K

0

2

4

6

8

10

12

0 200 400 600 800 1000 1200

Bandwidth (Hz)

Neutral speech

WE

R (

%)

0

10

20

30

0 200 400 600 800 1000 1200

Bandwidth (Hz)

LE speech

WE

R (%

)

1 19

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Page 52: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Optimizing Filter Banks – Omitting Low Frequencies

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

Data-Driven Design of Robust Features

1 19

Filterbank Cut-Off Frequencies (Hz)

Am

plitu

de

c0, c1,…, c12; ∆c0, ∆c1,…, ∆c12; ∆∆c0, ∆∆c1,…, ∆∆c12

4K

0

2

4

6

8

10

12

0 200 400 600 800 1000 1200

Bandwidth (Hz)

Neutral speech

WE

R (

%)

0

10

20

30

0 200 400 600 800 1000 1200

Bandwidth (Hz)

LE speech

WE

R (%

)

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Page 53: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Optimizing Filter Banks – Omitting Low Frequencies

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

Data-Driven Design of Robust Features

1 19

Filterbank Cut-Off Frequencies (Hz)

Am

plitu

de

c0, c1,…, c12; ∆c0, ∆c1,…, ∆c12; ∆∆c0, ∆∆c1,…, ∆∆c12

4K

0

2

4

6

8

10

12

0 200 400 600 800 1000 1200

Bandwidth (Hz)

Neutral speech

WE

R (

%)

0

10

20

30

0 200 400 600 800 1000 1200

Bandwidth (Hz)

LE speech

WE

R (%

)

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Page 54: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Optimizing Filter Banks – Omitting Low Frequencies

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

Data-Driven Design of Robust Features

1 19

Filterbank Cut-Off Frequencies (Hz)

Am

plitu

de

c0, c1,…, c12; ∆c0, ∆c1,…, ∆c12; ∆∆c0, ∆∆c1,…, ∆∆c12

4K

0

2

4

6

8

10

12

0 200 400 600 800 1000 1200

Bandwidth (Hz)

Neutral speech

WE

R (

%)

0

10

20

30

0 200 400 600 800 1000 1200

Bandwidth (Hz)

LE speech

WE

R (%

)

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Page 55: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Optimizing Filter Banks – Omitting Low Frequencies

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

Data-Driven Design of Robust Features

1 19

Filterbank Cut-Off Frequencies (Hz)

Am

plitu

de

c0, c1,…, c12; ∆c0, ∆c1,…, ∆c12; ∆∆c0, ∆∆c1,…, ∆∆c12

4K

0

2

4

6

8

10

12

0 200 400 600 800 1000 1200

Bandwidth (Hz)

Neutral speech

WE

R (

%)

0

10

20

30

0 200 400 600 800 1000 1200

Bandwidth (Hz)

LE speech

WE

R (%

)

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Page 56: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

`

Reduced Filter Bank vs. Standard Features

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

Data-Driven Design of Robust Features

Devel set Set

Neutral LE

LFCC, full band 4.8

(4.1–5.5)

29.0

(27.5–30.5) WER

(%) LFCC, 625 Hz

6.6

(5.8–7.4)

15.6

(14.4–16.8)

Effect of Omitting Low Spectral Components Increasing FB low cut-off results in almost linear increase of WER on neutral speech while

considerably enhancing ASR performance on LE speech

Optimal low cut-off found at 625 Hz

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Page 57: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Increasing Filter Bank Resolution

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Data-Driven Design of Robust Features

15

20

25

30

0 1 2 3 4 5 6 7 8 9 10 11 12 13

Omitted band

LE speech

WE

R (

%)

625 Hz

Increasing Frequency Resolution Idea – emphasize high information portion of spectrum by increasing FB resolution

Experiment – FB decimation from 1912 bands (decreasing computational costs)

Increasing number of filters at the peak of information distribution curve

deterioration of LE ASR (17.2 % 26.9 %)

Slight F1–F2 shifts due to LE affect cepstral features

No simple recipe on how to derive efficient FB from the information distribution curves

Page 58: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Filter Bank Repartitioning

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

Data-Driven Design of Robust Features

13

15

17

19

21

23

25

27

500 1000 1500 2000 2500 3000 3500 4000

LE speech

Band 1 Band 2 Band 3 Band 4 Band 5 Band 6

Critical frequency (Hz)

WE

R (

%)

Consecutive Filter Bank RepartitioningConsecutively, from lowest to highest, each FB high cut-off is varied, while the upper rest of the FB

is redistributed uniformly across the remaining frequency band

Cut-off resulting in WER local minimum is fixed and the procedure is repeated for adjacent higher

cut-off WER reduction by 2.3 % for LE, by 1 % on neutral (Example – 6 bands FB)

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Page 59: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Evaluation - Standard vs. Novel Features

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

Data-Driven Design of Robust Features

3988700 10 1 0 2000 Hz

Expolog

2595 log 1 2000 4000 Hz700

f

f

ff

f

Linear frequency (Hz)Linear frequency (Hz)

Exp

olo

g f

req

ue

ncy

(H

z)

Exp

olo

g f

req

ue

ncy

(H

z)

State-of-the-Art LE Front-End – Expolog (Ghazale & Hansen, 2000) FB redistributed to improve stressed speech recognition (including loud and Lombard speech)

Increased resolution in the area of F2 occurrence

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Page 60: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Evaluation - Standard vs. Novel Features

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

Data-Driven Design of Robust Features

3988700 10 1 0 2000 Hz

Expolog

2595 log 1 2000 4000 Hz700

f

f

ff

f

Linear frequency (Hz)Linear frequency (Hz)

Exp

olo

g f

req

ue

ncy

(H

z)

Exp

olo

g f

req

ue

ncy

(H

z)

State-of-the-Art LE Front-End – Expolog (Ghazale & Hansen, 2000) FB redistributed to improve stressed speech recognition (including loud and Lombard speech)

Increased resolution in the area of F2 occurrence

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Page 61: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Evaluation - Standard vs. Novel Features

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

Data-Driven Design of Robust Features

Evaluation in ASR Task Standard MFCC, PLP, variations MFCC-LPC, PLP-DCT – altered cepstrum exctraction schemes

Expolog – Expolog FB replacing trapezoid FB in PLP

20Bands-LPC – uniform rectangular FB employed in PLP front-end

Big1-LPC – derived from 20Bands-LPC, first 3 bands merged – decreased resolution at frequencies

disturbing for LE ASR

RFCC-DCT – repartitioned FB, 19 bands, starting at 625 Hz, employed in MFCC

RFCC-DCT – RFCC employed in PLP

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Page 62: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Evaluation - Standard vs. Novel Features

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

Data-Driven Design of Robust Features

0

10

20

30

40

50

60

70

80

MFCC MFCC-LPC PLP PLP-DCT Expolog 20Bands-LPC Big1-LPC RFCC-DCT RFCC-LPC

Neutral

LE

CLE

CLEF0

WE

R (

%)

Features - performance on female digits

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Page 63: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Performance as a Function of Utterances’ Mean F0

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Data-Driven Design of Robust Features

Mean fundamental frequency correlates with LE level evaluation of selected front-ends on data

subsets with changing mean fundamental frequency

0

2000

4000

6000

8000

10000

12000

0 100 200 300 400 500 600

Fc

Fundamental frequency (Hz)

Fundamental frequencyNeutral + LE female digits

Nu

mb

er

of s

am

ple

s

0

10

20

30

40

50

60

70

80

90

100

0 100 200 300 400 500 600

MFCCRFCC-LPCPLPExpolog20Bands-LPCBL MFCCBL RFCC-LPCBL PLPBL ExpologBL 20Bands-LPC

Center frequency (Hz)

WE

R (

%)

Page 64: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Performance as a Function of Utterances’ Mean F0

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Data-Driven Design of Robust Features

0

10

20

30

40

50

60

70

80

90

100

150 200 250 300 350 400 450

MFCCRFCC-LPCPLPExpolog20Bands-LPCBL MFCCBL RFCC-LPCBL PLPBL ExpologBL 20Bands-LPC

Center frequency (Hz)

WE

R (

%)

Mean fundamental frequency correlates with LE level evaluation of selected front-ends on data

subsets with changing mean fundamental frequency

RFCC-LPC outperform other approaches in increasing fundamental frequency

Page 65: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Performance in Noise

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

Data-Driven Design of Robust Features

Set Neutral LE NSeff (dB)

Airport MFCC, 20Bands–LPC, PLP Big1–LPC, RFCC–LPC, Expolog None

Babble MFCC, MFCC–LPC, PLP–DCT RFCC–LPC, Expolog (Big1–LPC), RFCC–DCT 10

Car2E Expolog, 20Bands–LPC, Big1–LPC RFCC–LPC, Big1–LPC, Expolog -5

Restaurant MFCC, 20Bands–LPC, MFCC–LPC RFCC–LPC, Big1–LPC, RFCC–DCT -5

Street 20Bands–LPC, MFCC, Expolog RFCC–LPC, Big1–LPC, 20Bands–LPC 0

Train station 20Bands–LPC, MFCC, Expolog RFCC–LPC, Big1–LPC, 20Bands–LPC -5

Front-End Comparison on Noisy Speech A set of noises of levels -5, 0, 5,…,20, dB SNR added to clean speech recordings

Simple full-wave rectification noise subtraction applied

Neutral noisy speech: both DCT and LPC-based features perform best, depending on the noise type

LE noisy speech: LPC-based features perform best for all noise types – spectral smoothing introduced

by LP modeling more robust to glottal variations due to LE

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Page 66: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Frequency Warping

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

Maximum Likelihood (ML) Approach Vocal tract length normalization (VTLN): mean formant locations inversely proportional to vocal tract

length (VTL) compensation for inter-speaker VTL variations by frequency transformation (warping):

Formant-Driven (FD) Approach Warping factor determined from estimated mean formant locations

WF F

Warping factor searched to maximize likelihoods of observations and acoustic models:

ˆ arg max Pr ,

O W Θ

Factor searched in typically in the interval 0.8–1.2 (corresponds to ratio of VTL differences in males

and females)

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Page 67: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Vocal Tract Length Normalization (ML Approach)

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

Frequency Warping

0

500

1000

1500

2000

2500

3000

3500

4000

4500

0 500 1000 1500 2000 2500 3000 3500 4000 4500

VTLN - PrincipleCase: VTL1 VTLNORM

Formant frequencies (Hz)Speaker1 (VTL1)

No

rma

lize

d S

pe

ake

r NO

RM (

VT

LN

OR

M)

Fo

rma

nt f

req

ue

nci

es

(Hz)

F2

F1

F3

F4

=

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Page 68: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Vocal Tract Length Normalization (ML Approach)

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

Frequency Warping

0

500

1000

1500

2000

2500

3000

3500

4000

4500

0 500 1000 1500 2000 2500 3000 3500 4000 4500

VTLN - PrincipleCase: VTL1 VTLNORM

Formant frequencies (Hz)Speaker1 (VTL1)

No

rma

lize

d S

pe

ake

r NO

RM (

VT

LN

OR

M)

Fo

rma

nt f

req

ue

nci

es

(Hz)

F2

F1

F3

F4

>

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Page 69: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Vocal Tract Length Normalization (ML Approach)

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

Frequency Warping

0

500

1000

1500

2000

2500

3000

3500

4000

4500

0 500 1000 1500 2000 2500 3000 3500 4000 4500

VTLN - PrincipleCase: VTL1 VTLNORM

Formant frequencies (Hz)Speaker1 (VTL1)

No

rma

lize

d S

pe

ake

r NO

RM (

VT

LN

OR

M)

Fo

rma

nt f

req

ue

nci

es

(Hz)

F2

F1

F3

F4

>

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Page 70: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Vocal Tract Length Normalization (ML Approach)

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

Frequency Warping

0

500

1000

1500

2000

2500

3000

3500

4000

4500

0 500 1000 1500 2000 2500 3000 3500 4000 4500

VTLN - PrincipleCase: VTL1 VTLNORM

Formant frequencies (Hz)Speaker1 (VTL1)

No

rma

lize

d S

pe

ake

r NO

RM (

VT

LN

OR

M)

Fo

rma

nt f

req

ue

nci

es

(Hz)

F2

F1

F3

F4

>

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Page 71: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Vocal Tract Length Normalization (ML Approach)

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

Frequency Warping

0

500

1000

1500

2000

2500

3000

3500

4000

4500

0 500 1000 1500 2000 2500 3000 3500 4000 4500

VTLN - PrincipleCase: VTL1 VTLNORM

Formant frequencies (Hz)Speaker1 (VTL1)

No

rma

lize

d S

pe

ake

r NO

RM (

VT

LN

OR

M)

Fo

rma

nt f

req

ue

nci

es

(Hz)

F2

F1

F3

F4

<

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Page 72: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Vocal Tract Length Normalization (ML Approach)

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

Frequency Warping

0

500

1000

1500

2000

2500

3000

3500

4000

4500

0 500 1000 1500 2000 2500 3000 3500 4000 4500

VTLN - PrincipleCase: VTL1 VTLNORM

Formant frequencies (Hz)Speaker1 (VTL1)

No

rma

lize

d S

pe

ake

r NO

RM (

VT

LN

OR

M)

Fo

rma

nt f

req

ue

nci

es

(Hz)

F2

F1

F3

F4

<

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Page 73: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Vocal Tract Length Normalization (ML Approach)

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

Frequency Warping

0

500

1000

1500

2000

2500

3000

3500

4000

4500

0 500 1000 1500 2000 2500 3000 3500 4000 4500

VTLN - PrincipleCase: VTL1 VTLNORM

Formant frequencies (Hz)Speaker1 (VTL1)

No

rma

lize

d S

pe

ake

r NO

RM (

VT

LN

OR

M)

Fo

rma

nt f

req

ue

nci

es

(Hz)

F2

F1

F3

F4

<

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Page 74: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

VTLN vs. Lombard Effect

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

Frequency Warping

0

500

1000

1500

2000

2500

3000

3500

4000

4500

0 500 1000 1500 2000 2500 3000 3500 4000 4500

VTLN - PrincipleCase: VTL1 VTLNORM

Formant frequencies (Hz)Speaker1 (VTL1)

No

rma

lize

d S

pe

ake

r NO

RM (

VT

LN

OR

M)

Fo

rma

nt f

req

ue

nci

es

(Hz) ?

F2

F1

F3

F4

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Page 75: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

VTLN vs. Lombard Effect

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

Frequency Warping

0

500

1000

1500

2000

2500

3000

3500

4000

4500

0 500 1000 1500 2000 2500 3000 3500 4000 4500

VTLN - PrincipleCase: VTL1 VTLNORM

Formant frequencies (Hz)Speaker1 (VTL1)

No

rma

lize

d S

pe

ake

r NO

RM (

VT

LN

OR

M)

Fo

rma

nt f

req

ue

nci

es

(Hz)

F2

F1

F3

F4?

What to choose?

Good approx. of low formants?

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Page 76: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

VTLN vs. Lombard Effect

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

Frequency Warping

0

500

1000

1500

2000

2500

3000

3500

4000

4500

0 500 1000 1500 2000 2500 3000 3500 4000 4500

VTLN - PrincipleCase: VTL1 VTLNORM

Formant frequencies (Hz)Speaker1 (VTL1)

No

rma

lize

d S

pe

ake

r NO

RM (

VT

LN

OR

M)

Fo

rma

nt f

req

ue

nci

es

(Hz) ?

What to choose?

Good approx. of higher formants?

F2

F3

F4?

F1

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Page 77: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Generalized frequency transform

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

Frequency Warping

0

500

1000

1500

2000

2500

3000

3500

4000

4500

0 500 1000 1500 2000 2500 3000 3500 4000 4500

VTLN - PrincipleCase: VTL1 VTLNORM

Formant frequencies (Hz)Speaker1 (VTL1)

No

rma

lize

d S

pe

ake

r NO

RM (

VT

LN

OR

M)

Fo

rma

nt f

req

ue

nci

es

(Hz) ?

F2

F3

F4?

F1

Generalized Transform

Case: VTL1 ? VTLNORM

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Page 78: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Frequency Warping (Formant Driven Approach)

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

Frequency Warping

y = 1.0045x - 76.745

R2 = 0.9979

0

500

1000

1500

2000

2500

3000

3500

4000

4500

0 500 1000 1500 2000 2500 3000 3500 4000 4500

LE domain - frequency (Hz)

Ne

utr

al d

om

ain

- fr

eq

ue

ncy

(H

z)

Frequency warping function Females

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

y = 1.0217x - 50.311

R2 = 0.9941

0

500

1000

1500

2000

2500

3000

3500

4000

4500

0 500 1000 1500 2000 2500 3000 3500 4000 4500

Frequency warping function Males

LE domain - frequency (Hz)

Ne

utr

al d

om

ain

- fr

eq

ue

ncy

(H

z)

Page 79: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Evaluation – ML VTLN vs. FD Generalized Transform

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

Frequency Warping

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Females Males Set

Neutral LE Neutral LE

# Digits 2560 2560 1423 6303

Baseline 4.3

(3.5–5.0)

33.6 (31.8–35.5)

2.2 (1.4–2.9)

22.9 (21.8–23.9)

Utterance-dependent VTLN 3.6

(2.9–4.3)

28.2

(26.4–29.9)

1.8

(1.1–2.4)

16.6

(15.7–17.6)

WER

(%)

Speaker-dependent VTLN 4.0

(3.2–4.7)

27.7

(26.0–29.5)

1.8

(1.1–2.4)

17.4

(16.5–18.3)

Females Males

Set Neutral LE Neutral LE

# Digits 2560 2560 1423 6303

Baseline bank 4.2

(3.4–5.0)

35.1 (33.3–37.0)

2.2 (1.4–2.9)

23.2 (22.1–24.2) WER

(%) Warped bank

4.4

(3.6–5.2)

23.4

(21.8–25.0)

1.8

(1.1–2.4)

15.7

(14.8–16.6)

Generalized transform better addresses LE-induced formant shifts

Formant-driven approach is less computationally demanding (no need of multiple alignment passes as in

VTLN), but requires reliable formant tracking problem in low SNR’s ML approach more stable

Page 80: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Two-Stage Recognition System

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

LE recognizer

Speech Signal

Estimated Word

Sequence

Neutral/LE Classifier

Neutral Recognizer

Tandem Neutral/LE Classifier – Neutral/LE Dedicated Recognizers Improving ASR features for LE often results in performance tradeoff on neutral speech

Idea - combining separate systems ‘tuned’ for neutral and LE speech directed by neutral/LE classifier

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Page 81: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Features for Neutral/LE Classification – Spectral Slope

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

Two-Stage Recognition System

100

101

102

103

104

-80

-60

-40

-20

0

20

40

60

Frequency (Hz)

Am

plitu

de (

dB)

Mag

nitu

de (

dB)

Log frequency (Hz)

Spectral slope – female vowel /a/

100

101

102

103

104

-80

-60

-40

-20

0

20

40

60

Frequency (Hz)

Am

plitu

de (

dB)

Mag

nitu

de (

dB)

Log frequency (Hz)

Spectral slope – female vowel /a/

Proposal of Neutral/LE Classifier Search for a set of features providing good discriminability between neutral/LE speech

Requirements – speaker/gender/phonetic content independent classification

Extension of the set of analyzed features for the slope of short-term spectra

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Page 82: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Features for Neutral/LE Classification – Spectral Slope

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

Two-Stage Recognition System

Neutral LE

Set # N T (s)

Slope (dB/oct)

(dB/oct) # LE T (s) Slope

(dB/oct) (dB/oct)

M 2587 618 -7.42

(-7.48; -7.36) 1.53 3532 1114

-5.32 (-5.37; -5.27)

1.55 0–8000

Hz F 5558 1544

-6.15

(-6.18; -6.12) 1.30 5030 1926

-3.91

(-3.96; -3.86) 1.77

Neutral – LE distribution overlap (%) Set

0–8000 Hz 60–8000 Hz 60–5000 Hz 1k–5k Hz 0–1000 Hz 60–1000 Hz

M 26.00 28.13 29.47 100.00 27.81 27.96

F 26.20 28.95 16.76 100.00 25.75 22.18

M+F 28.06 30.48 29.49 100.00 27.54 26.00

Mean Spectral Slopes in Voiced Male/Female Speech

Overlap of Neutral/LE Spectral Slope Distributions

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Classification Feature Set A feature set providing superior classification performance on the development data set was found:

SNR, spectral slope (60–1000 Hz), F0, F0

Training GMM and multi-layer perceptron (MLP) classifiers

Page 83: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

Classifier ImplementationTwo-Stage Recognition System

1

0

P H

P H i

i

o

o

1

1f

1 jj qqe

2

1

fj

i

q

j Mq

i

eq

e

Pr(N) Pr(LE)

GMMN GMMLE

Acoustic Observation (Classification Feature Vector)

Binary Classification Task

GMM Classifier

11

21e

2

Ti i

i nP

o μ Σ o μo

Σ

MLP Classifier

… …

Pr(N) Pr(LE)

Classification Feature Vector

(Softmax)

(Sigmoid)

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Page 84: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Classification Feature Set

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Two-Stage Recognition System

0

20

40

60

80

100

120

0 20 40 60 80 100

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

Dev_N_M+FDev_LE_M+FPDF_LEPDF_N

SNR (dB)

GMM PDFsSNR

PD

FN, P

DF

LE

Nu

mb

er

of s

am

ple

s (n

orm

aliz

ed

)

0

10

20

30

40

50

60

70

80

-20 -10 0 10 20 30

0

0.04

0.08

0.12

0.16

Dev_N_M+FDev_LE_M+FPDF_LEPDF_N

Spectral slope (dB/oct)

GMM PDFsSpectral slope

PD

FN, P

DF

LE

Nu

mb

er

of s

am

ple

s (n

orm

aliz

ed

)

0

20

40

60

80

100

120

0 20 40 60 80 100

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Dev_N_M+FDev_LE_M+FPr(N)Pr(LE)

SNR (dB)

ANN posteriorsSNR

Pr(

N),

Pr(

LE

)

Nu

mb

er

of s

am

ple

s (n

orm

aliz

ed

)

0

10

20

30

40

50

60

70

80

-20 -10 0 10 20 30

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Dev_N_M+FDev_LE_M+FPr(N)Pr(LE)

Spectral slope (dB/oct)

ANN posteriorsSpectral slope

Pr(

N),

Pr(

LE

)

Nu

mb

er

of s

am

ple

s (n

orm

aliz

ed

)

Page 85: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Classification Feature SetTwo-Stage Recognition System

0

50

100

150

200

250

0 20 40 60 80 100 120

0

0.01

0.02

0.03

0.04

Dev_N_M+F

Dev_LE_M+F

PDF_LE

PDF_N

GMM PDFsF0

F0 (Hz)

PD

FN, P

DF

LE

Nu

mb

er

of s

am

ple

s (n

orm

aliz

ed

)

0

20

40

60

80

100

120

140

160

0 100 200 300 400 500

0.000

0.004

0.008

0.012

Dev_N_M+FDev_LE_M+FPDF_LEPDF_N

F0 (Hz)

GMM PDFsF0

PD

FN, P

DF

LE

Nu

mb

er

of s

am

ple

s (n

orm

aliz

ed

)

0

20

40

60

80

100

120

140

160

0 100 200 300 400 500

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Dev_N_M+FDev_LE_M+FPr(N)Pr(LE)

F0 (Hz)

ANN posteriorsF0

Pr(

N),

Pr(

LE

)

Nu

mb

er

of s

am

ple

s (n

orm

aliz

ed

)

0

50

100

150

200

250

0 20 40 60 80 100 120

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Dev_N_M+FDev_LE_M+F

Pr(N)Pr(LE)

F0 (Hz)

ANN posteriors

F0

Pr(

N),

Pr(

LE

)

Nu

mb

er

of s

am

ple

s (n

orm

aliz

ed

)

Page 86: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

Classifier EvaluationTwo-Stage Recognition System

Set Train CV Open

# Utterances 2202 270 1371

UER (%) 9.9

(8.7–11.1)

5.6

(2.8–8.3)

1.6

(0.9–2.3)

Set Devel FM Open FM Devel DM Open DM

# Utterances 2472 1371 2472 1371

UER (%) 6.6

(5.6–7.6)

2.5

(1.7–3.3)

8.1

(7.0–9.2)

2.8

(1.9–3.6)

Set #Utterances

Devel 2472 4.10 1.60

Open 1371 4.01 1.50

sUtterT sUtterT

Classification Data Sets

Classification Performance UER – Utterance Error Rate – ratio of incorrectly classified utterances to all utterances

GMM

MLP

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Page 87: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

Two-Stage Recognizer (TSR)Two-Stage Recognition System

Set Real – neutral Real – LE

# Female digits 1439 1837

PLP 4.3

(3.3–5.4)

48.1

(45.8–50.4)

RFCC–LPC 6.5

(5.2–7.7)

28.3

(26.2–30.4)

MLP TSR 4.2

(3.2–5.3)

28.4

(26.4–30.5)

FM–GMLC TSR 4.4

(3.3–5.4)

28.4

(26.4–30.5)

WER

(%)

DM–GMLC TSR 4.4

(3.3–5.4)

28.4

(26.3–30.4)

Discrete Recognizers Either good on neutral or LE speech

LE recognizer

Speech Signal

Estimated Word

Sequence

Neutral/LE Classifier

Neutral Recognizer

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Page 88: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

Two-Stage Recognizer (TSR)Two-Stage Recognition System

LE recognizer

Speech Signal

Estimated Word

Sequence

Neutral/LE Classifier

Neutral Recognizer

Set Real – neutral Real – LE

# Female digits 1439 1837

PLP 4.3

(3.3–5.4)

48.1

(45.8–50.4)

RFCC–LPC 6.5

(5.2–7.7)

28.3

(26.2–30.4)

MLP TSR 4.2

(3.2–5.3)

28.4

(26.4–30.5)

FM–GMLC TSR 4.4

(3.3–5.4)

28.4

(26.4–30.5)

WER

(%)

DM–GMLC TSR 4.4

(3.3–5.4)

28.4

(26.3–30.4)

Discrete Recognizers Either good on neutral or LE speech

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Page 89: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

Two-Stage Recognizer (TSR)Two-Stage Recognition System

LE recognizer

Speech Signal

Estimated Word

Sequence

Neutral/LE Classifier

Neutral Recognizer

Set Real – neutral Real – LE

# Female digits 1439 1837

PLP 4.3

(3.3–5.4)

48.1

(45.8–50.4)

RFCC–LPC 6.5

(5.2–7.7)

28.3

(26.2–30.4)

MLP TSR 4.2

(3.2–5.3)

28.4

(26.4–30.5)

FM–GMLC TSR 4.4

(3.3–5.4)

28.4

(26.4–30.5)

WER

(%)

DM–GMLC TSR 4.4

(3.3–5.4)

28.4

(26.3–30.4)

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

TSR When exposed to mixture of neutral/LE

speech, provides best of both discrete

recognizers

Only neutral speech data required for

training acoustic models

Page 90: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Proposed Equalization Techniques

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

Conclusions

Acoustic Model AdaptationAdaptation of neutral acoustic models to LE: proposal of speaker/group-

dependent adaptation approaches

Assumptions: LE-level dependent; adaptation data for a given speaker/LE

level are available, together with their transcriptions

Voice ConversionExcitation and vocal tract components of LE speech are transformed towards

neutral in the ASR front-end

Assumptions: LE-level dependent; parallel training data for each speaker

available, speaker identification system choosing from the codebook of

speaker-dependent transforms available; increased conversion accuracy

required

Data-Driven Design of robust featuresContribution of frequency sub-bands to speech recognition performance is studied;

Novel filter banks for MFCC and PLP-based front-ends are designed

Assumptions: LE-level dependent; gender classification required to pick

gender-dependent features

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Page 91: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Proposed Equalization Techniques

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

Conclusions

Frequency WarpingModified vocal tract normalization (VTLN) and generalized formant-driven

frequency warping are proposed

Assumptions: LE-level independent, transform parameters adapt on-the-fly!

Two-Stage Recognition SystemNeutral/LE classifier is proposed and used to direct incoming speech to

matching neutral/LE dedicated recognizers

Assumptions: LE-level independent, increasing codebook of LE-level dependent

recognizers would further improve performance in changing LE levels

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Page 92: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Proposed Techniques – Performance Comparison

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

Conclusions

0

10

20

30

40

50

60

Model Adapt toLE - SI

Model Adapt toLE - SD

VoiceConversion -

CLE

Modified FB -RFCC-LPC

VTLNRecognition -

Utt. Dep. Warp

FormantWarping

MLP TSR

Baseline Neutral

Baseline LE

LE Suppression

WE

R (

%)

Comparison of proposed techniques for LE-robust ASR

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Page 93: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

Thank you

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

Conclusions

Thank You for Your Attention!

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Page 94: Hynek Bořil IntroductionData AcquisitionNeutral/LE Speech AnalysisEqualization of LE in ASRConclusions Attributes and Recognition of Lombard Speech Center.

References

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

Conclusions

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

PhD ThesisHynek Bořil – Robust Speech Recognition: Analysis and Equalization of

Lombard Effect in Czech Corpora. Czech Technical University in Prague,

2008.

http://www.utdallas.edu/~hxb076000