Email: [email protected] Slide 1 Speech and Speaker Recognition SLIDES by John H.L. Hansen,...

Email: [email protected] Slide 1 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007

Lombard Speech Recognition

Hynek Boř[email protected]


OverviewOverview

Model of Speech Production

Automatic Speech Recognition (LE) Outline Feature Extraction Acoustic Models

Lombard Effect Definition & Motivation Acquisition of Corpus capturing Lombard Effect Analysis of Speech under LE Methods Increasing ASR Robustness to LE


Speech ProductionSpeech Production Model of speech production understanding speech signal

structure design of speech processing algorithms


IMPULSE TRAIN GENERATOR I(z)

VOCAL TRACT MODEL V(z)

RADIATION MODEL R(z)

RANDOM NOISE GENERATOR N(z)

Voiced/Unvoiced Switch

Pitch Period

AV

AN

Vocal Tract Parameters

GLOTTAL PULSE MODEL G(z)

uG(n) pL(n)

Speech ProductionLinear Model


Voiced Excitation

Unvoiced Excitation









Pitch Period

AV

AN



uG(n) pL(n)

1/F0

TimeFreq.F0 2F0 ...

-12 dB/oct

|I(F)G(F)|

=









Pitch Period

AV

AN



uG(n) pL(n)

Time

=

Frequency

|N(F)|









Pitch Period

AV

AN



uG(n) pL(n)

1

( )1

Nk

kk

GV z

z

0 500 1000 1500 2000 2500 3000 3500 4000-4

-2

0

2

4

6

8x 10

-12

0 500 1000 1500 2000 2500 3000 3500 4000-4

-2

0

2

4

6

8x 10

-12

0 500 1000 1500 2000 2500 3000 3500 4000-4

-2

0

2

4

6

8x 10

-12

0 500 1000 1500 2000 2500 3000 3500 4000-4

-2

0

2

4

6

8x 10

-12

Frequency

+6 dB/oct

Frequency

|V(F)| |R(F)|









Pitch Period

AV

AN



uG(n) pL(n)

1

( )1

Nk

kk

GV z

z

0 500 1000 1500 2000 2500 3000 3500 4000-4

-2

0

2

4

6

8x 10

-12

0 500 1000 1500 2000 2500 3000 3500 4000-4

-2

0

2

4

6

8x 10

-12

0 500 1000 1500 2000 2500 3000 3500 4000-4

-2

0

2

4

6

8x 10

-12

0 500 1000 1500 2000 2500 3000 3500 4000-4

-2

0

2

4

6

8x 10

-12

Frequency

+6 dB/oct

1/F0

TimeFreq.F0 2F0 ...

-12 dB/oct

Frequency

|V(F)| |R(F)|

|I(F)G(F)|

Time

=

Frequency

|N(F)|

=


Speech ProductionLinguistic/Speaker Information in Speech Signal

Speech ProductionLinguistic/Speaker Information in Speech Signal

How is Linguistic Info Coded in Speech Signal? Phonetic Contents

Energy: voiced phones (v) – higher energy than unvoiced phones (uv) Low formants: locations and bandwidths ( changes in configuration

of vocal tract during speech production) Spectral tilt: differs across phones, generally flatter for uv (due to

changes in excitation and formant locations) Other Cues

Pitch contour: important to distinguish words in tonal languages (e.g., Chinese dialects)

How is Speaker Identity Coded in Speech Signal? Glottal Waveform Vocal Tract Parameters Prosody (intonation, rhythm, stress,…)


Speech ProductionPhonetic Contents in FeaturesSpeech Production

Phonetic Contents in Features

800

1000

1200

1400

1600

1800

2000

2200

200 400 600 800

F2

(Hz)

F1 (Hz)

/u/

/i/

/ae/

/a/

Neutral

Vowel # N T (s)

Slope (dB/oct)

(dB/oct)

/a/ 454 69.03 -6.8

(-6.9; -6.7) 1.13

/e/ 1064 69.33 -5.6

(-5.7; -5.6) 1.06

/i/ 509 58.92 -5.0

(-5.1; -4.9) 1.15

/o/ 120 9.14 -8.0

(-8.1; -7.8) 0.91

/u/ 102 5.73 -6.1

(-6.3; -6.0) 0.77

Example 2 – Spectral Slopes in Czech Vowels

Example 1 – First 2 Formants in US Vowels(Bond et. al., 1989)


Automatic Speech Recognition (ASR)Architecture of HMM Recognizer

Automatic Speech Recognition (ASR)Architecture of HMM Recognizer

LANGUAGE

MODEL (BIGRAMS)

DECODER (VITERBI)

ESTIMATED WORD

SEQUENCE

SPEECH SIGNAL

FEATURE EXTRACTION (MFCC/PLP)

ACOUSTIC MODEL

SUB-WORD LIKELIHOODS

(GMM/MLP)

LEXICON (HMM)

Feature extraction – transformation of time-domain acoustic signal into

representation more effective for ASR engine: data dimensionality reduction,

suppression of irrelevant (disturbing) signal components

(speaker/environment/recording chain-dependent characteristics), preserving phonetic

content

Sub-word models – Gaussian Mixture Models (GMMs) – mixture of gaussians used to

model distribution of feature vector parameters; Multi-Layer Perceptrons (MLPs) – neural

networks – multi-layer perceptrons (MLPs) (much less common than GMMs)


Automatic Speech Recognition (ASR)HMM-Based Recognition – Stages

Automatic Speech Recognition (ASR)HMM-Based Recognition – Stages

Speech Signal

Feature Extraction

(Windowing,…,

cepstrum)

…

o1 o2 o3 …

Acoustic Models

(HMMs word sequences)

(HTK book, 2006)

Language Model

Speech Transcription


Automatic Speech Recognition (ASR)Feature Extraction – MFCC

Automatic Speech Recognition (ASR)Feature Extraction – MFCC

Mel Frequency Cepstral Coefficients (MFCC)Davis & Mermelstein, IEEE Trans. Acoustics, Speech, and Signal Processing, 1980

MFCC is the first choice in current commercial ASR

WINDOW

(HAMMING)

|FFT|2

c(n)

s(n)

PREEMPHASIS

Log( )

.

IDCT

MFCC FILTER

BANK (MEL)

Preemphasis: compensates for spectral tilt (speech production/microphone channel)

Windowing: suppression of transient effects in short-term segments of signal

|FFT|2: energy spectrum (phase is discarded)

MEL Filter bank: MEL scale – models logarithmic perception of frequency in humans;

triangular filters – dimensionality reduction

Log + IDCT: extraction of cepstrum – deconvolution of glottal waveform, vocal tract

function, channel characteristics

Linear Frequency


Automatic Speech Recognition (ASR)Feature Extraction – MFCC & PLP


Perceptual Linear Predictive Coefficients (PLP)Hermansky, Journal of Acoustical Society of America, 1990

An alternative to MFCC, used less frequently

Many stages similar to MFCC

Linear prediction – smoothing of the spectral envelope

WINDOW

(HAMMING)

|FFT|2

EQUAL LOUDNESS

PREEMPHASIS

LINEAR PREDICTION

c(n)

s(n)

PLP INTENSITY

LOUDNESS 3

RECURSION

CEPSTRUM

FILTER BANK

(BARK)

WINDOW

(HAMMING)

|FFT|2

c(n)

s(n)

PREEMPHASIS

Log( )

.

IDCT

MFCC FILTER

BANK (MEL)

MFCC

PLP




Perceptual Linear Predictive Coefficients (PLP)Hermansky, Journal of Acoustical Society of America, 1990

An alternative to MFCC, used less frequently

Many stages similar to MFCC

Linear prediction – smoothing of the spectral envelope (may improve robustness)

WINDOW

(HAMMING)

|FFT|2

EQUAL LOUDNESS

PREEMPHASIS

LINEAR PREDICTION

c(n)

s(n)

PLP INTENSITY

LOUDNESS 3

RECURSION

CEPSTRUM

FILTER BANK

(BARK)

WINDOW

(HAMMING)

|FFT|2

c(n)

s(n)

PREEMPHASIS

Log( )

.

IDCT

MFCC FILTER

BANK (MEL)

MFCC

PLP


Automatic Speech Recognition (ASR)Acoustic Models – GMM-HMM

Automatic Speech Recognition (ASR)Acoustic Models – GMM-HMM

Gaussian Mixture Models (GMMs)Motivation: distributions of cepstral coefficients can be well modeled by a mixture (sum) of

gaussian functions

Example – distribution of c0 in certain phone and corresponding gaussian (defined uniquely by

mean, variance, and weight)

0

20

40

60

80

100

120

c0

# S

ampl

es

Multidimensional observations (c0,…,c12) multidimensional gaussians – defined uniquely by

means, covariance matrices, and weights

GMMs – typically used to model parts of phones

0

20

40

60

80

100

120

c0

Pr(

c 0)

Histogram Probability Density Function (pdf)Weight

Hidden Markov Models (HMMs)States (GMMs) + transition probabilities between states

Models of whole phones; lexicon word models built of phone models


Lombard EffectDefinition & Motivation

Lombard EffectDefinition & Motivation

What is Lombard Effect?When exposed to noisy adverse environment, speakers modify the

way they speak in an effort to maintain intelligible communication

(Lombard Effect - LE)

Why is Lombard Effect Interesting?Better understanding mechanisms of human speech

communication (Can we intentionally change particular parameters

of speech production to improve intelligibility, or is LE an automatic

process learned through public loop? How the type of noise and

communication scenario affect LE?)

Mathematical modeling of LE classification of LE level, speech

synthesis in noisy environments, increasing robustness of automatic

speech recognition and speaker identification systems


Lombard EffectMotivation & Goals

Lombard EffectMotivation & Goals

Ambiguity in Past LE InvestigationsLE has been studied since 1911, however, many investigations disagree in the

observed impacts of LE on speech production

Analyses conducted typically on very limited data – a couple of utterances from few

subjects (1–10)

Lack of communication factor – a majority of studies ignore the importance of

communication for evoking LE (an effort to convey message over noise) occurrence

and level of LE in speech recordings is ‘random’ contradicting analysis results

LE was studied only for several world languages (English, Spanish, French,

Japanese, Korean, Mandarin Chinese), no comprehensive study for any of Slavic

languages

1st GoalDesign of Czech Lombard Speech Database addressing the need of communication

factor and well defined simulated noisy conditions

Systematic analysis of LE in Czech spoken language


Lombard Effect Motivation & Goals

Lombard Effect Motivation & Goals

ASR under LEMismatch between LE speech with by noise and acoustic models trained on clean neutral

speech

Strong impact of noise on ASR is well known and vast number of noise suppression/speech

emphasis algorithms have been proposed in last decades (yet no ultimate solution is reached)

Negative impact of LE on ASR often exceeds the one of noise; recent state-of-the-art ASR

systems mostly ignore this issue

LE-Equalization MethodsLE-equalization algorithms typically operate in the following domains: Robust features, LE-

transformation towards neutral, model adjustments, improved training of acoustic models

The algorithms display various degrees of efficiency and are often bound by strong

assumptions preserving them from the real world application (applying fixed transformations

to phonetic groups, known level of LE, etc.)

2nd GoalProposal of novel LE-equalization techniques with a focus on both level of LE suppression

and extent of bounding assumptions


LE CorporaLE Corpora

Available Czech CorporaCzech SPEECON – speech recordings from various environments including office

and car

CZKCC – car recordings – include parked car with engine off and moving car

scenarios

Both databases contain speech produced in quiet in noise candidates for study of

LE, however, not good ones, shown later

Design/acquisition of LE-oriented database – Czech

Lombard Speech Database‘05 (CLSD‘05) Goals – Communication in simulated noisy background high SNR

-Phonetically rich data/extensive small vocabulary material

-Parallel utterances in neutral and LE conditions


Data AcquisitionRecording Setup

Data AcquisitionRecording Setup

Simulated Noisy ConditionsNoise samples mixed with speech feedback and produced to the speaker and

operator by headphones

Operator qualifies intelligibility of speech in noise – if the utterance is not intelligible,

operator asks the subject to repeat it speakers are required to convey message

over noise communication LE

Noises: mostly car noises from Car2E database, normalized to 90 dB SPL

Speaker Sessions14 male/12 female speakers

Each subject recorded both in neutral and simulated noisy conditions

Close talk

Noise + speech feedback

Middle talk

H&T RECORDER

OK – next / / BAD - again

Noise + speech monitor SPEAKER

SMOOTH OPERATOR


Data Acquisition Recording Setup

Data Acquisition Recording Setup

Simulated Noisy ConditionsNoise samples mixed with speech feedback and produced to the speaker and

operator by headphones

Operator qualifies intelligibility of speech in noise – if the utterance is not intelligible,

operator asks the subject to repeat it speakers are required to convey message

over noise real LE

Noises: mostly car noises from Car2E database, normalized to 90 dB SPL

Speaker Sessions14 male/12 female speakers

Each subject recorded both in neutral and simulated noisy conditions

NB2 ME-104

ME-104 NB2


Data AcquisitionImpact of Headphones


Environmental Sound Attenuation by HeadphonesAttenuation characteristics measured on dummy head

Source of wide-band noise, measurement of sound transfer to dummy head’s

auditory canals when not wearing/wearing headphones

Attenuation characteristics – subtraction of the transfers




102

103

104

0

50

100

150

200-10

0

10

20

30

Frequency (Hz)Angle (°)

Att

enua

tion

(dB

)

-10

-5

0

5

10

15

20

25

100 1000 10000

0° 90°180°Rec. room

Frequency (Hz)

Atte

nu

atio

n (

dB

)

Attenuation by headphones

-100102030

0

15

30

45

60

75 90 105

120

135

150

165

180

195

210

225

240

255270285

300

315

330

345

0 180 -10 0 10 20 30 0 10 20 30

1 kHz 2 kHz 4 kHz 8 kHz

Angle (°)

Attenuation (dB)

Environmental Sound Attenuation by HeadphonesDirectional attenuation – reflectionless sound booth

Real attenuation in recording room


Speech Production under Lombard EffectSpeech Production under Lombard Effect

Speech Features affected by LEVocal tract excitation: glottal pulse shape changes, fundamental frequency

rises

1

( )1

Nk

kk

GV z

z






Pitch Period

AV

AN



uG(n) pL(n)



1

( )1

Nk

kk

GV z

z


rises

Vocal tract transfer function: center frequencies of low formants increase,

formant bandwidths reduce






Pitch Period

AV

AN



uG(n) pL(n)



1

( )1

Nk

kk

GV z

z


rises



Vocal effort (intensity) increase






Pitch Period

AV

AN



uG(n) pL(n)



1

( )1

Nk

kk

GV z

z

Speech Features affected by LEVocal tract excitation: glottal pulse shape changes, fundamental frequency rises



Vocal effort (intensity) increase

Other: voiced phonemes prolonged, energy ratio in voiced/unvoiced increases,…






Pitch Period

AV

AN



uG(n) pL(n)


Analysis of Speech Features under LEFundamental Frequency

Analysis of Speech Features under LEFundamental Frequency

0

2

4

6

8

10

12

70 170 270 370 470 570

Office FCar F

Office MCar M

Fundamental frequency (Hz)

Distribution of fundamental frequencyCzech SPEECON

Nu

mb

er

of s

am

ple

s (x

10

,00

0)

0

2

4

6

8

10

12

14

16

70 170 270 370 470 570

Eng off F

Eng on F

Eng off M

Eng on M


Nu

mb

er

of s

am

ple

s (x

10

00

)

Distribution of fundamental frequencyCZKCC

0

1

2

3

4

5

6

70 170 270 370 470 570

Neutral FLE FNeutral MLE M


Nu

mb

er

of s

am

ple

s (x

10

,00

0)

Distribution of fundamental frequencyCLSD'05


Analysis of Speech Features under LEFormant Locations

Analysis of Speech Features under LEFormant Locations

900

1100

1300

1500

1700

1900

2100

2300

2500

300 400 500 600 700 800 900 1000

Female_N

Female_LE

F1 (Hz)

F2

(H

z)

Formants - CZKCCFemale digits/i/

/i'/

/e//e'/

/a/

/a'//o/

/o'//u/

/u'/

500

700

900

1100

1300

1500

1700

1900

2100

200 300 400 500 600 700 800 900

Male_N

Male_LE

/i//i'/

F1 (Hz)

F2

(H

z)

/e//e'/

/a/

/a'/

/o//o'/

/u//u'/

Formants - CZKCCMale digits

900

1100

1300

1500

1700

1900

2100

2300

2500

300 400 500 600 700 800 900 1000

Female_N

Female_LE

F1 (Hz)

F2

(H

z)

Formants - CLSD'05Female digits/i/

/i'/

/e/

/e'/

/a//a'/

/o/

/o'/

/u/

/u'/

500

700

900

1100

1300

1500

1700

1900

2100

200 300 400 500 600 700 800 900

Male_N

Male_LE

Formants - CLSD'05Male digits/i/

/i'/

F1 (Hz)

F2

(H

z)

/e/ /e'/

/a//a'/

/o/ /o'//u/

/u'/


Analysis of Speech Features under LEFormant Bandwidths

Analysis of Speech Features under LEFormant Bandwidths

CZKCC

Vowel B1M (Hz) 1M (Hz) B1M (Hz) 1M (Hz) B1F (Hz) 1F (Hz) B1F (Hz) 1F (Hz)

/a/ 207* 74 210* 84 275 97 299 78

/e/ 125* 70 130* 78 156 68 186 79

/i/ 124* 49 127* 44 105 44 136 53

/o/ 275 87 222 67 263* 85 269* 73

/u/ 187 100 170 89 174* 96 187* 101

CLSD‘05

Vowel B1M (Hz) 1M (Hz) B1M (Hz) 1M (Hz) B1F (Hz) 1F (Hz) B1F (Hz) 1F (Hz)

/a/ 269 88 152 59 232 85 171 68

/e/ 168 94 99 44 169 73 130 49

/i/ 125 53 108 52 132* 52 133* 58

/o/ 239 88 157 81 246 91 158 62

/u/ 134* 67 142* 81 209 95 148 66

SPEECON, CZKCC: no consistent BW changes

CLSD‘05: significant BW reduction in many voiced phonemes


Analysis of Speech Features under LEPhoneme Durations

Analysis of Speech Features under LEPhoneme Durations

CZKCC

Word Phoneme # OFF TOFF (s) TOFF (s) # ON TON (s) TON (s) (%)

Nula /a/ 349 0.147 0.079 326 0.259 0.289 48.50

Jedna /a/ 269 0.173 0.076 251 0.241 0.238 39.36

Dva /a/ 245 0.228 0.075 255 0.314 0.311 38.04

Štiri /r/ 16 0.045 0.027 68 0.080 0.014 78.72

Sedm /e/ 78 0.099 0.038 66 0.172 0.142 72.58

CLSD‘05

Word Phoneme # N TN (s) Tn (s) # LE TLE (s) Tle (s) (%)

Jedna /e/ 583 0.031 0.014 939 0.082 0.086 161.35

Dvje /e/ 586 0.087 0.055 976 0.196 0.120 126.98

Čtiri /r/ 35 0.041 0.020 241 0.089 0.079 115.92

Pjet /e/ 555 0.056 0.033 909 0.154 0.089 173.71

Sedm /e/ 358 0.080 0.038 583 0.179 0.136 122.46

Osm /o/ 310 0.086 0.027 305 0.203 0.159 135.25

Devjet /e/ 609 0.043 0.022 932 0.120 0.088 177.20

Significant increase in duration in some phonemes, especially voiced phonemes

Some unvoiced consonants – duration reduction

Duration changes in CLSD’05 considerably exceed the ones in SPEECON and CZKCC


Lombard EffectInitial ASR Experiments


ASR Evaluation – WER (Word Error Rate)S – word substitutions

I – word insertions

D – word deletions

Digit RecognizerMonophone HMM models

13 MFCC + ∆ + ∆∆

32 Gaussian mixtures per model state


Czech SPEECON CZKCC CLSD‘05 Set

Office F Office M Car F Car M OFF F OFF M ON F ON M N F N M LE F LE M

# Spkrs 22 31 28 42 30 30 18 21 12 14 12 14

# Digits 880 1219 1101 1657 1480 1323 1439 1450 4930 1423 5360 6303

WER

(%) 5.5

(4.0–7.0)

4.3 (3.1–5.4)

4.6 (3.4–5.9)

10.5 (9.0–12.0)

3.0 (2.1–3.8)

2.3 (1.5–3.1)

13.5 (11.7–15.2)

10.4 (8.8–12.0)

7.3 (6.6–8.0)

3.8 (2.8–4.8)

42.8 (41.5–44.1)

16.3 (15.4–17.2)

100 %D S I

WERN

Noisy car recordings (SPEE/CZKCC – 10.7/12.6 dB

SNR)

Clean recordings (LE - 40.9 dB SNR)








13 MFCC + ∆ + ∆∆





# Spkrs 22 31 28 42 30 30 18 21 12 14 12 14

# Digits 880 1219 1101 1657 1480 1323 1439 1450 4930 1423 5360 6303

WER

(%) 5.5

(4.0–7.0)

4.3 (3.1–5.4)

4.6 (3.4–5.9)

10.5 (9.0–12.0)

3.0 (2.1–3.8)

2.3 (1.5–3.1)

13.5 (11.7–15.2)

10.4 (8.8–12.0)

7.3 (6.6–8.0)

3.8 (2.8–4.8)

42.8 (41.5–44.1)

16.3 (15.4–17.2)

100 %D S I

WERN


SNR)



LE Suppression in ASR Model Adaptation


Model AdaptationOften effective when only limited data from given conditions are available

Maximum Likelihood Linear Regression (MLLR) – if limited amount of data per

class, acoustically close classes are grouped and transformed together

'MLLR μ Aμ b

Maximum a posteriori approach (MAP) – initial models are used as informative

priors for the adaptation

'MAP

N

N N

μ μ μ

Adaptation ProcedureFirst, neutral speaker-independent (SI) models transformed by MLLR, employing

clustering (binary regression tree)

Second, MAP adaptation – only for nodes with sufficient amount of adaptation data




0

10

20

30

40

50

60

70

80

90

Baseline digits LE Adapted digits LE Baseline sentences LE Adapted sentences LE

SI adapt to LE (same spkrs)

SI adapt to LE (disjunct spkrs)

SD adapt to neutral

SD adapt to LE

Model adaptation to conditions and speakers

WE

R (

%)

Adaptation SchemesSpeaker-independent adaptation (SI) – group dependent/independent

Speaker-dependent adaptation (SD) – to neutral/LE


LE Suppression in ASR Data-Driven Design of Robust Features

LE Suppression in ASR Data-Driven Design of Robust Features

Filter Bank ApproachAnalysis of importance of frequency components for ASR

Repartitioning filter bank (FB) to emphasize components carrying phonetic information and suppress

disturbing components

Initial FB uniformly distributed on linear scale – equal attention to all components

Consecutively, a single FB band is omitted impact on WER?

Omitting bands carrying more information will result in considerable WER increase

ImplementationMFCC front-end, MEL scale replaced by linear, triangular filters replaced by rectangular filters

without overlap


Data-Driven Design of Robust FeaturesImportance of Frequency Components


3

4

5

0 5 10 15 20

Omitted band

Neutral speech

WE

R (

%)

20

30

40

0 5 10 15 20

Omitted band

LE speech

WE

R (

%)

Filterbank Cut-Off Frequencies (Hz)

Am

plitu

de

c0, c1,…, c12; ∆c0, ∆c1,…, ∆c12; ∆∆c0, ∆∆c1,…, ∆∆c12

4K

1 20




3

4

5

0 5 10 15 20

Omitted band

Neutral speech

WE

R (

%)

20

30

40

0 5 10 15 20

Omitted band

LE speech

WE

R (

%)


Am

plitu

de

c0, c1,…, c12; ∆c0, ∆c1,…, ∆c12; ∆∆c0, ∆∆c1,…, ∆∆c12

4K

1 20




3

4

5

0 5 10 15 20

Omitted band

Neutral speech

WE

R (

%)

20

30

40

0 5 10 15 20

Omitted band

LE speech

WE

R (

%)

c0, c1,…, c12; ∆c0, ∆c1,…, ∆c12; ∆∆c0, ∆∆c1,…, ∆∆c12 Area of 1st and 2nd formant occurrence – highest portion of phonetic information, F1 more important

for neutral speech, F1–F2 for LE speech recognition

Omitting the 1st band considerably improves LE ASR while reducing performance on neutral speech

tradeoff

Next step – how much of the low frequency content should be omitted for LE ASR?


Lombard EffectLombard Effect


Am

plitu

de

c0, c1,…, c12; ∆c0, ∆c1,…, ∆c12; ∆∆c0, ∆∆c1,…, ∆∆c12

4K

0

2

4

6

8

10

12

0 200 400 600 800 1000 1200

Bandwidth (Hz)

Neutral speech

WE

R (

%)

0

10

20

30

0 200 400 600 800 1000 1200

Bandwidth (Hz)

LE speech

WE

R (%

)

Optimizing Filter Banks – Omitting Low Frequencies

1 19


Data-Driven Design of Robust FeaturesOmitting Low Frequencies


1 19


Am

plitu

de

c0, c1,…, c12; ∆c0, ∆c1,…, ∆c12; ∆∆c0, ∆∆c1,…, ∆∆c12

4K

0

2

4

6

8

10

12

0 200 400 600 800 1000 1200

Bandwidth (Hz)

Neutral speech

WE

R (

%)

0

10

20

30

0 200 400 600 800 1000 1200

Bandwidth (Hz)

LE speech

WE

R (%

)




1 19


Am

plitu

de

c0, c1,…, c12; ∆c0, ∆c1,…, ∆c12; ∆∆c0, ∆∆c1,…, ∆∆c12

4K

0

2

4

6

8

10

12

0 200 400 600 800 1000 1200

Bandwidth (Hz)

Neutral speech

WE

R (

%)

0

10

20

30

0 200 400 600 800 1000 1200

Bandwidth (Hz)

LE speech

WE

R (%

)




Devel set Set

Neutral LE

LFCC, full band 4.8

(4.1–5.5)

29.0

(27.5–30.5) WER

(%) LFCC, 625 Hz

6.6

(5.8–7.4)

15.6

(14.4–16.8)

Effect of Omitting Low Spectral Components Increasing FB low cut-off results in almost linear increase of WER on neutral speech while

considerably enhancing ASR performance on LE speech

Optimal low cut-off found at 625 Hz


Data-Driven Design of Robust FeaturesIncreasing Filter Bank Resolution


15

20

25

30

0 1 2 3 4 5 6 7 8 9 10 11 12 13

Omitted band

LE speech

WE

R (

%)

625 Hz

Increasing Frequency Resolution Idea – emphasize high information portion of spectrum by increasing FB resolution

Experiment – FB decimation from 1912 bands (decreasing computational costs)

Increasing number of filters at the peak of information distribution curve

deterioration of LE ASR (17.2 % 26.9 %)

Slight F1–F2 shifts due to LE affect cepstral features

No simple recipe on how to derive efficient FB from the information distribution curves




13

15

17

19

21

23

25

27

500 1000 1500 2000 2500 3000 3500 4000

LE speech

Band 1 Band 2 Band 3 Band 4 Band 5 Band 6

Critical frequency (Hz)

WE

R (

%)

Consecutive Filter Bank RepartitioningConsecutively, from lowest to highest, each FB high cut-off is varied, while the upper rest of the FB

is redistributed uniformly across the remaining frequency band

Cut-off resulting in WER local minimum is fixed and the procedure is repeated for adjacent higher

cut-off WER reduction by 2.3 % for LE, by 1 % on neutral (Example – 6 bands FB)


Data-Driven Design of Robust FeaturesStandard vs. Novel Features


3988700 10 1 0 2000 Hz

Expolog

2595 log 1 2000 4000 Hz700

f

f

ff

f

State-of-the-Art LE Front-End – Expolog (Ghazale & Hansen, 2000) FB redistributed to improve stressed speech recognition (including loud and Lombard speech)

Increased resolution in the area of F2 occurrence

Linear frequency (Hz)Linear frequency (Hz)

Exp

olo

g f

req

ue

ncy

(H

z)

Exp

olo

g f

req

ue

ncy

(H

z)




Evaluation in ASR Task Standard MFCC, PLP, variations MFCC-LPC, PLP-DCT – altered cepstrum exctraction schemes

Expolog – Expolog FB replacing trapezoid FB in PLP

20Bands-LPC – uniform rectangular FB employed in PLP front-end

Big1-LPC – derived from 20Bands-LPC, first 3 bands merged – decreased resolution at frequencies

disturbing for LE ASR

RFCC-DCT – repartitioned FB, 19 bands, starting at 625 Hz, employed in MFCC

RFCC-DCT – RFCC employed in PLP




0

10

20

30

40

50

60

70

80

MFCC MFCC-LPC PLP PLP-DCT Expolog 20Bands-LPC Big1-LPC RFCC-DCT RFCC-LPC

Neutral

LE

CLE

CLEF0

WE

R (

%)

Features - performance on female digits


LE Suppression in ASR Frequency Warping

LE Suppression in ASR Frequency Warping

Maximum Likelihood (ML) Approach Vocal tract length normalization (VTLN): mean formant locations inversely proportional to vocal tract

length (VTL) compensation for inter-speaker VTL variations by frequency transformation (warping):

Formant-Driven (FD) Approach Warping factor determined from estimated mean formant locations

WF F

Warping factor searched to maximize likelihoods of observations and acoustic models:

ˆ arg max Pr ,

O W Θ

Factor searched in typically in the interval 0.8–1.2 (corresponds to ratio of VTL differences in males

and females)


Frequency WarpingVTLN – Principle


0

500

1000

1500

2000

2500

3000

3500

4000

4500

0 500 1000 1500 2000 2500 3000 3500 4000 4500

VTLN - PrincipleCase: VTL1 VTLNORM

Formant frequencies (Hz)Speaker1 (VTL1)

No

rma

lize

d S

pe

ake

r NO

RM (

VT

LN

OR

M)

Fo

rma

nt f

req

ue

nci

es

(Hz)

F2

F1

F3

F4

=




0

500

1000

1500

2000

2500

3000

3500

4000

4500

0 500 1000 1500 2000 2500 3000 3500 4000 4500



No

rma

lize

d S

pe

ake

r NO

RM (

VT

LN

OR

M)

Fo

rma

nt f

req

ue

nci

es

(Hz)

F2

F1

F3

F4

>


Frequency WarpingVTLN vs. Lombard Effect


0

500

1000

1500

2000

2500

3000

3500

4000

4500

0 500 1000 1500 2000 2500 3000 3500 4000 4500



No

rma

lize

d S

pe

ake

r NO

RM (

VT

LN

OR

M)

Fo

rma

nt f

req

ue

nci

es

(Hz) ?

F2

F1

F3

F4




0

500

1000

1500

2000

2500

3000

3500

4000

4500

0 500 1000 1500 2000 2500 3000 3500 4000 4500



No

rma

lize

d S

pe

ake

r NO

RM (

VT

LN

OR

M)

Fo

rma

nt f

req

ue

nci

es

(Hz)

F2

F1

F3

F4?

What to choose?

Good approx. of low formants?


Frequency WarpingGeneralized TransformFrequency Warping

Generalized Transform

0

500

1000

1500

2000

2500

3000

3500

4000

4500

0 500 1000 1500 2000 2500 3000 3500 4000 4500



No

rma

lize

d S

pe

ake

r NO

RM (

VT

LN

OR

M)

Fo

rma

nt f

req

ue

nci

es

(Hz) ?

What to choose?

Good approx. of higher formants?

F2

F3

F4?

F1


Frequency WarpingGeneralized TransformFrequency Warping


0

500

1000

1500

2000

2500

3000

3500

4000

4500

0 500 1000 1500 2000 2500 3000 3500 4000 4500



No

rma

lize

d S

pe

ake

r NO

RM (

VT

LN

OR

M)

Fo

rma

nt f

req

ue

nci

es

(Hz) ?

F2

F3

F4?

F1


Case: VTL1 ? VTLNORM


Frequency WarpingEvaluation – VTLN vs. Generalized Transform

Frequency WarpingEvaluation – VTLN vs. Generalized Transform

Females Males Set

Neutral LE Neutral LE

# Digits 2560 2560 1423 6303

Baseline 4.3

(3.5–5.0)

33.6 (31.8–35.5)

2.2 (1.4–2.9)

22.9 (21.8–23.9)

Utterance-dependent VTLN 3.6

(2.9–4.3)

28.2

(26.4–29.9)

1.8

(1.1–2.4)

16.6

(15.7–17.6)

WER

(%)

Speaker-dependent VTLN 4.0

(3.2–4.7)

27.7

(26.0–29.5)

1.8

(1.1–2.4)

17.4

(16.5–18.3)

Females Males

Set Neutral LE Neutral LE

# Digits 2560 2560 1423 6303

Baseline bank 4.2

(3.4–5.0)

35.1 (33.3–37.0)

2.2 (1.4–2.9)

23.2 (22.1–24.2) WER

(%) Warped bank

4.4

(3.6–5.2)

23.4

(21.8–25.0)

1.8

(1.1–2.4)

15.7

(14.8–16.6)

Generalized transform better addresses LE-induced formant shifts

Formant-driven approach is less computationally demanding (no need of multiple alignment passes as in

VTLN), but requires reliable formant tracking problem in low SNR’s ML approach more stable


LE Suppression in ASR Two-Stage Recognizer (TSR)

LE Suppression in ASR Two-Stage Recognizer (TSR)

LE recognizer

Speech Signal

Estimated Word

Sequence

Neutral/LE Classifier

Neutral Recognizer

Tandem Neutral/LE Classifier – Neutral/LE Dedicated Recognizers Improving ASR features for LE often results in performance tradeoff on neutral speech

Idea - combining separate systems ‘tuned’ for neutral and LE speech directed by neutral/LE classifier


Two-Stage Recognizer (TSR)Neutral/LE Classification


100

101

102

103

104

-80

-60

-40

-20

0

20

40

60

Frequency (Hz)

Am

plitu

de (

dB)

Mag

nitu

de (

dB)

Log frequency (Hz)

Spectral slope – female vowel /a/

100

101

102

103

104

-80

-60

-40

-20

0

20

40

60

Frequency (Hz)

Am

plitu

de (

dB)

Mag

nitu

de (

dB)

Log frequency (Hz)

Spectral slope – female vowel /a/

Proposal of Neutral/LE Classifier Search for a set of features providing good discriminability between neutral/LE speech

Requirements – speaker/gender/phonetic content independent classification

Extension of the set of analyzed features for the slope of short-term spectra




Neutral LE

Set # N T (s)

Slope (dB/oct)

(dB/oct) # LE T (s) Slope

(dB/oct) (dB/oct)

M 2587 618 -7.42

(-7.48; -7.36) 1.53 3532 1114

-5.32 (-5.37; -5.27)

1.55 0–8000

Hz F 5558 1544

-6.15

(-6.18; -6.12) 1.30 5030 1926

-3.91

(-3.96; -3.86) 1.77

Neutral – LE distribution overlap (%) Set

0–8000 Hz 60–8000 Hz 60–5000 Hz 1k–5k Hz 0–1000 Hz 60–1000 Hz

M 26.00 28.13 29.47 100.00 27.81 27.96

F 26.20 28.95 16.76 100.00 25.75 22.18

M+F 28.06 30.48 29.49 100.00 27.54 26.00

Mean Spectral Slopes in Voiced Male/Female Speech

Overlap of Neutral/LE Spectral Slope Distributions

Classification Feature Set A feature set providing superior classification performance on the development data set was found:

SNR, spectral slope (60–1000 Hz), F0, F0

Training GMM and multi-layer perceptron (MLP) classifiers




1

0

P H

P H i

i

o

o

1

1f

1 jj qqe

2

1

fj

i

q

j Mq

i

eq

e

Pr(N) Pr(LE)

GMMN GMMLE

Acoustic Observation (Classification Feature Vector)

Binary Classification Task

GMM Classifier

11

21e

2

Ti i

i nP

o μ Σ o μo

Σ

MLP Classifier (Softmax)

(Sigmoid)

… …

Pr(N) Pr(LE)

Classification Feature Vector


Two-Stage Recognizer (TSR)Neutral/LE Classification – Feature Set


0

20

40

60

80

100

120

0 20 40 60 80 100

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

Dev_N_M+FDev_LE_M+FPDF_LEPDF_N

SNR (dB)

GMM PDFsSNR

PD

FN, P

DF

LE

Nu

mb

er

of s

am

ple

s (n

orm

aliz

ed

)

0

10

20

30

40

50

60

70

80

-20 -10 0 10 20 30

0

0.04

0.08

0.12

0.16


Spectral slope (dB/oct)

GMM PDFsSpectral slope

PD

FN, P

DF

LE

Nu

mb

er

of s

am

ple

s (n

orm

aliz

ed

)

0

20

40

60

80

100

120

0 20 40 60 80 100

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Dev_N_M+FDev_LE_M+FPr(N)Pr(LE)

SNR (dB)

ANN posteriorsSNR

Pr(

N),

Pr(

LE

)

Nu

mb

er

of s

am

ple

s (n

orm

aliz

ed

)

0

10

20

30

40

50

60

70

80

-20 -10 0 10 20 30

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


Spectral slope (dB/oct)

ANN posteriorsSpectral slope

Pr(

N),

Pr(

LE

)

Nu

mb

er

of s

am

ple

s (n

orm

aliz

ed

)




0

50

100

150

200

250

0 20 40 60 80 100 120

0

0.01

0.02

0.03

0.04

Dev_N_M+F

Dev_LE_M+F

PDF_LE

PDF_N

GMM PDFsF0

F0 (Hz)

PD

FN, P

DF

LE

Nu

mb

er

of s

am

ple

s (n

orm

aliz

ed

)

0

20

40

60

80

100

120

140

160

0 100 200 300 400 500

0.000

0.004

0.008

0.012


F0 (Hz)

GMM PDFsF0

PD

FN, P

DF

LE

Nu

mb

er

of s

am

ple

s (n

orm

aliz

ed

)

0

20

40

60

80

100

120

140

160

0 100 200 300 400 500

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7


F0 (Hz)

ANN posteriorsF0

Pr(

N),

Pr(

LE

)

Nu

mb

er

of s

am

ple

s (n

orm

aliz

ed

)

0

50

100

150

200

250

0 20 40 60 80 100 120

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Dev_N_M+FDev_LE_M+F

Pr(N)Pr(LE)

F0 (Hz)

ANN posteriors

F0

Pr(

N),

Pr(

LE

)

Nu

mb

er

of s

am

ple

s (n

orm

aliz

ed

)


Two-Stage Recognizer (TSR)Neutral/LE Classification – Performance

Two-Stage Recognizer (TSR)Neutral/LE Classification – Performance

Set Train CV Open

# Utterances 2202 270 1371

UER (%) 9.9

(8.7–11.1)

5.6

(2.8–8.3)

1.6

(0.9–2.3)

Set Devel FM Open FM Devel DM Open DM

# Utterances 2472 1371 2472 1371

UER (%) 6.6

(5.6–7.6)

2.5

(1.7–3.3)

8.1

(7.0–9.2)

2.8

(1.9–3.6)

Set #Utterances

Devel 2472 4.10 1.60

Open 1371 4.01 1.50

sUtterT sUtterT

Classification Data Sets

Classification Performance UER – Utterance Error Rate – ratio of incorrectly classified utterances to all utterances

GMM

MLP


Two-Stage Recognizer (TSR)Overall Performance

Two-Stage Recognizer (TSR)Overall Performance

Set Real – neutral Real – LE

# Female digits 1439 1837

PLP 4.3

(3.3–5.4)

48.1

(45.8–50.4)

RFCC–LPC 6.5

(5.2–7.7)

28.3

(26.2–30.4)

MLP TSR 4.2

(3.2–5.3)

28.4

(26.4–30.5)

FM–GMLC TSR 4.4

(3.3–5.4)

28.4

(26.4–30.5)

WER

(%)

DM–GMLC TSR 4.4

(3.3–5.4)

28.4

(26.3–30.4)

Discrete Recognizers Either good on neutral or LE speech

LE recognizer

Speech Signal

Estimated Word

Sequence

Neutral/LE Classifier

Neutral Recognizer


LE Suppression in ASR Comparison of Proposed Methods

LE Suppression in ASR Comparison of Proposed Methods

0

10

20

30

40

50

60

Model Adapt toLE - SI

Model Adapt toLE - SD

VoiceConversion -

CLE

Modified FB -RFCC-LPC

VTLNRecognition -

Utt. Dep. Warp

FormantWarping

MLP TSR

Baseline Neutral

Baseline LE

LE Suppression

WE

R (

%)

Comparison of proposed techniques for LE-robust ASR


Thank YouThank You

Thank You for Your Attention!

Email: [email protected] Slide 1 Speech and Speaker Recognition SLIDES by John H.L. Hansen,...

Documents

Transcript of Email: [email protected] Slide 1 Speech and Speaker Recognition SLIDES by John H.L. Hansen,...