Email: [email protected] Slide 1 Speech and Speaker Recognition SLIDES by John H.L. Hansen,...
-
Upload
leonard-mason -
Category
Documents
-
view
214 -
download
0
Transcript of Email: [email protected] Slide 1 Speech and Speaker Recognition SLIDES by John H.L. Hansen,...
Email: [email protected] Slide 1 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007
Lombard Speech Recognition
Hynek Boř[email protected]
Email: [email protected] Slide 2 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007
OverviewOverview
Model of Speech Production
Automatic Speech Recognition (LE) Outline Feature Extraction Acoustic Models
Lombard Effect Definition & Motivation Acquisition of Corpus capturing Lombard Effect Analysis of Speech under LE Methods Increasing ASR Robustness to LE
Email: [email protected] Slide 3 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007
Speech ProductionSpeech Production Model of speech production understanding speech signal
structure design of speech processing algorithms
Email: [email protected] Slide 4 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007
IMPULSE TRAIN GENERATOR I(z)
VOCAL TRACT MODEL V(z)
RADIATION MODEL R(z)
RANDOM NOISE GENERATOR N(z)
Voiced/Unvoiced Switch
Pitch Period
AV
AN
Vocal Tract Parameters
GLOTTAL PULSE MODEL G(z)
uG(n) pL(n)
Speech ProductionLinear Model
Speech ProductionLinear Model
Voiced Excitation
Unvoiced Excitation
Email: [email protected] Slide 5 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007
Speech ProductionLinear Model
Speech ProductionLinear Model
IMPULSE TRAIN GENERATOR I(z)
VOCAL TRACT MODEL V(z)
RADIATION MODEL R(z)
RANDOM NOISE GENERATOR N(z)
Voiced/Unvoiced Switch
Pitch Period
AV
AN
Vocal Tract Parameters
GLOTTAL PULSE MODEL G(z)
uG(n) pL(n)
1/F0
TimeFreq.F0 2F0 ...
-12 dB/oct
|I(F)G(F)|
=
Email: [email protected] Slide 6 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007
Speech ProductionLinear Model
Speech ProductionLinear Model
IMPULSE TRAIN GENERATOR I(z)
VOCAL TRACT MODEL V(z)
RADIATION MODEL R(z)
RANDOM NOISE GENERATOR N(z)
Voiced/Unvoiced Switch
Pitch Period
AV
AN
Vocal Tract Parameters
GLOTTAL PULSE MODEL G(z)
uG(n) pL(n)
Time
=
Frequency
|N(F)|
Email: [email protected] Slide 7 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007
Speech ProductionLinear Model
Speech ProductionLinear Model
IMPULSE TRAIN GENERATOR I(z)
VOCAL TRACT MODEL V(z)
RADIATION MODEL R(z)
RANDOM NOISE GENERATOR N(z)
Voiced/Unvoiced Switch
Pitch Period
AV
AN
Vocal Tract Parameters
GLOTTAL PULSE MODEL G(z)
uG(n) pL(n)
1
( )1
Nk
kk
GV z
z
0 500 1000 1500 2000 2500 3000 3500 4000-4
-2
0
2
4
6
8x 10
-12
0 500 1000 1500 2000 2500 3000 3500 4000-4
-2
0
2
4
6
8x 10
-12
0 500 1000 1500 2000 2500 3000 3500 4000-4
-2
0
2
4
6
8x 10
-12
0 500 1000 1500 2000 2500 3000 3500 4000-4
-2
0
2
4
6
8x 10
-12
Frequency
+6 dB/oct
Frequency
|V(F)| |R(F)|
Email: [email protected] Slide 8 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007
Speech ProductionLinear Model
Speech ProductionLinear Model
IMPULSE TRAIN GENERATOR I(z)
VOCAL TRACT MODEL V(z)
RADIATION MODEL R(z)
RANDOM NOISE GENERATOR N(z)
Voiced/Unvoiced Switch
Pitch Period
AV
AN
Vocal Tract Parameters
GLOTTAL PULSE MODEL G(z)
uG(n) pL(n)
1
( )1
Nk
kk
GV z
z
0 500 1000 1500 2000 2500 3000 3500 4000-4
-2
0
2
4
6
8x 10
-12
0 500 1000 1500 2000 2500 3000 3500 4000-4
-2
0
2
4
6
8x 10
-12
0 500 1000 1500 2000 2500 3000 3500 4000-4
-2
0
2
4
6
8x 10
-12
0 500 1000 1500 2000 2500 3000 3500 4000-4
-2
0
2
4
6
8x 10
-12
Frequency
+6 dB/oct
1/F0
TimeFreq.F0 2F0 ...
-12 dB/oct
Frequency
|V(F)| |R(F)|
|I(F)G(F)|
Time
=
Frequency
|N(F)|
=
Email: [email protected] Slide 9 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007
Speech ProductionLinguistic/Speaker Information in Speech Signal
Speech ProductionLinguistic/Speaker Information in Speech Signal
How is Linguistic Info Coded in Speech Signal? Phonetic Contents
Energy: voiced phones (v) – higher energy than unvoiced phones (uv) Low formants: locations and bandwidths ( changes in configuration
of vocal tract during speech production) Spectral tilt: differs across phones, generally flatter for uv (due to
changes in excitation and formant locations) Other Cues
Pitch contour: important to distinguish words in tonal languages (e.g., Chinese dialects)
How is Speaker Identity Coded in Speech Signal? Glottal Waveform Vocal Tract Parameters Prosody (intonation, rhythm, stress,…)
Email: [email protected] Slide 10 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007
Speech ProductionPhonetic Contents in FeaturesSpeech Production
Phonetic Contents in Features
800
1000
1200
1400
1600
1800
2000
2200
200 400 600 800
F2
(Hz)
F1 (Hz)
/u/
/i/
/ae/
/a/
Neutral
Vowel # N T (s)
Slope (dB/oct)
(dB/oct)
/a/ 454 69.03 -6.8
(-6.9; -6.7) 1.13
/e/ 1064 69.33 -5.6
(-5.7; -5.6) 1.06
/i/ 509 58.92 -5.0
(-5.1; -4.9) 1.15
/o/ 120 9.14 -8.0
(-8.1; -7.8) 0.91
/u/ 102 5.73 -6.1
(-6.3; -6.0) 0.77
Example 2 – Spectral Slopes in Czech Vowels
Example 1 – First 2 Formants in US Vowels(Bond et. al., 1989)
Email: [email protected] Slide 11 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007
Automatic Speech Recognition (ASR)Architecture of HMM Recognizer
Automatic Speech Recognition (ASR)Architecture of HMM Recognizer
LANGUAGE
MODEL (BIGRAMS)
DECODER (VITERBI)
ESTIMATED WORD
SEQUENCE
SPEECH SIGNAL
FEATURE EXTRACTION (MFCC/PLP)
ACOUSTIC MODEL
SUB-WORD LIKELIHOODS
(GMM/MLP)
LEXICON (HMM)
Feature extraction – transformation of time-domain acoustic signal into
representation more effective for ASR engine: data dimensionality reduction,
suppression of irrelevant (disturbing) signal components
(speaker/environment/recording chain-dependent characteristics), preserving phonetic
content
Sub-word models – Gaussian Mixture Models (GMMs) – mixture of gaussians used to
model distribution of feature vector parameters; Multi-Layer Perceptrons (MLPs) – neural
networks – multi-layer perceptrons (MLPs) (much less common than GMMs)
Email: [email protected] Slide 12 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007
Automatic Speech Recognition (ASR)HMM-Based Recognition – Stages
Automatic Speech Recognition (ASR)HMM-Based Recognition – Stages
Speech Signal
Feature Extraction
(Windowing,…,
cepstrum)
…
o1 o2 o3 …
Acoustic Models
(HMMs word sequences)
(HTK book, 2006)
Language Model
Speech Transcription
Email: [email protected] Slide 13 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007
Automatic Speech Recognition (ASR)Feature Extraction – MFCC
Automatic Speech Recognition (ASR)Feature Extraction – MFCC
Mel Frequency Cepstral Coefficients (MFCC)Davis & Mermelstein, IEEE Trans. Acoustics, Speech, and Signal Processing, 1980
MFCC is the first choice in current commercial ASR
WINDOW
(HAMMING)
|FFT|2
c(n)
s(n)
PREEMPHASIS
Log( )
.
IDCT
MFCC FILTER
BANK (MEL)
Preemphasis: compensates for spectral tilt (speech production/microphone channel)
Windowing: suppression of transient effects in short-term segments of signal
|FFT|2: energy spectrum (phase is discarded)
MEL Filter bank: MEL scale – models logarithmic perception of frequency in humans;
triangular filters – dimensionality reduction
Log + IDCT: extraction of cepstrum – deconvolution of glottal waveform, vocal tract
function, channel characteristics
Linear Frequency
Email: [email protected] Slide 14 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007
Automatic Speech Recognition (ASR)Feature Extraction – MFCC & PLP
Automatic Speech Recognition (ASR)Feature Extraction – MFCC & PLP
Perceptual Linear Predictive Coefficients (PLP)Hermansky, Journal of Acoustical Society of America, 1990
An alternative to MFCC, used less frequently
Many stages similar to MFCC
Linear prediction – smoothing of the spectral envelope
WINDOW
(HAMMING)
|FFT|2
EQUAL LOUDNESS
PREEMPHASIS
LINEAR PREDICTION
c(n)
s(n)
PLP INTENSITY
LOUDNESS 3
RECURSION
CEPSTRUM
FILTER BANK
(BARK)
WINDOW
(HAMMING)
|FFT|2
c(n)
s(n)
PREEMPHASIS
Log( )
.
IDCT
MFCC FILTER
BANK (MEL)
MFCC
PLP
Email: [email protected] Slide 15 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007
Automatic Speech Recognition (ASR)Feature Extraction – MFCC & PLP
Automatic Speech Recognition (ASR)Feature Extraction – MFCC & PLP
Perceptual Linear Predictive Coefficients (PLP)Hermansky, Journal of Acoustical Society of America, 1990
An alternative to MFCC, used less frequently
Many stages similar to MFCC
Linear prediction – smoothing of the spectral envelope (may improve robustness)
WINDOW
(HAMMING)
|FFT|2
EQUAL LOUDNESS
PREEMPHASIS
LINEAR PREDICTION
c(n)
s(n)
PLP INTENSITY
LOUDNESS 3
RECURSION
CEPSTRUM
FILTER BANK
(BARK)
WINDOW
(HAMMING)
|FFT|2
c(n)
s(n)
PREEMPHASIS
Log( )
.
IDCT
MFCC FILTER
BANK (MEL)
MFCC
PLP
Email: [email protected] Slide 16 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007
Automatic Speech Recognition (ASR)Acoustic Models – GMM-HMM
Automatic Speech Recognition (ASR)Acoustic Models – GMM-HMM
Gaussian Mixture Models (GMMs)Motivation: distributions of cepstral coefficients can be well modeled by a mixture (sum) of
gaussian functions
Example – distribution of c0 in certain phone and corresponding gaussian (defined uniquely by
mean, variance, and weight)
0
20
40
60
80
100
120
c0
# S
ampl
es
Multidimensional observations (c0,…,c12) multidimensional gaussians – defined uniquely by
means, covariance matrices, and weights
GMMs – typically used to model parts of phones
0
20
40
60
80
100
120
c0
Pr(
c 0)
Histogram Probability Density Function (pdf)Weight
Hidden Markov Models (HMMs)States (GMMs) + transition probabilities between states
Models of whole phones; lexicon word models built of phone models
Email: [email protected] Slide 17 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007
Lombard EffectDefinition & Motivation
Lombard EffectDefinition & Motivation
What is Lombard Effect?When exposed to noisy adverse environment, speakers modify the
way they speak in an effort to maintain intelligible communication
(Lombard Effect - LE)
Why is Lombard Effect Interesting?Better understanding mechanisms of human speech
communication (Can we intentionally change particular parameters
of speech production to improve intelligibility, or is LE an automatic
process learned through public loop? How the type of noise and
communication scenario affect LE?)
Mathematical modeling of LE classification of LE level, speech
synthesis in noisy environments, increasing robustness of automatic
speech recognition and speaker identification systems
Email: [email protected] Slide 18 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007
Lombard EffectMotivation & Goals
Lombard EffectMotivation & Goals
Ambiguity in Past LE InvestigationsLE has been studied since 1911, however, many investigations disagree in the
observed impacts of LE on speech production
Analyses conducted typically on very limited data – a couple of utterances from few
subjects (1–10)
Lack of communication factor – a majority of studies ignore the importance of
communication for evoking LE (an effort to convey message over noise) occurrence
and level of LE in speech recordings is ‘random’ contradicting analysis results
LE was studied only for several world languages (English, Spanish, French,
Japanese, Korean, Mandarin Chinese), no comprehensive study for any of Slavic
languages
1st GoalDesign of Czech Lombard Speech Database addressing the need of communication
factor and well defined simulated noisy conditions
Systematic analysis of LE in Czech spoken language
Email: [email protected] Slide 19 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007
Lombard Effect Motivation & Goals
Lombard Effect Motivation & Goals
ASR under LEMismatch between LE speech with by noise and acoustic models trained on clean neutral
speech
Strong impact of noise on ASR is well known and vast number of noise suppression/speech
emphasis algorithms have been proposed in last decades (yet no ultimate solution is reached)
Negative impact of LE on ASR often exceeds the one of noise; recent state-of-the-art ASR
systems mostly ignore this issue
LE-Equalization MethodsLE-equalization algorithms typically operate in the following domains: Robust features, LE-
transformation towards neutral, model adjustments, improved training of acoustic models
The algorithms display various degrees of efficiency and are often bound by strong
assumptions preserving them from the real world application (applying fixed transformations
to phonetic groups, known level of LE, etc.)
2nd GoalProposal of novel LE-equalization techniques with a focus on both level of LE suppression
and extent of bounding assumptions
Email: [email protected] Slide 20 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007
LE CorporaLE Corpora
Available Czech CorporaCzech SPEECON – speech recordings from various environments including office
and car
CZKCC – car recordings – include parked car with engine off and moving car
scenarios
Both databases contain speech produced in quiet in noise candidates for study of
LE, however, not good ones, shown later
Design/acquisition of LE-oriented database – Czech
Lombard Speech Database‘05 (CLSD‘05) Goals – Communication in simulated noisy background high SNR
-Phonetically rich data/extensive small vocabulary material
-Parallel utterances in neutral and LE conditions
Email: [email protected] Slide 21 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007
Data AcquisitionRecording Setup
Data AcquisitionRecording Setup
Simulated Noisy ConditionsNoise samples mixed with speech feedback and produced to the speaker and
operator by headphones
Operator qualifies intelligibility of speech in noise – if the utterance is not intelligible,
operator asks the subject to repeat it speakers are required to convey message
over noise communication LE
Noises: mostly car noises from Car2E database, normalized to 90 dB SPL
Speaker Sessions14 male/12 female speakers
Each subject recorded both in neutral and simulated noisy conditions
Close talk
Noise + speech feedback
Middle talk
H&T RECORDER
OK – next / / BAD - again
Noise + speech monitor SPEAKER
SMOOTH OPERATOR
Email: [email protected] Slide 22 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007
Data Acquisition Recording Setup
Data Acquisition Recording Setup
Simulated Noisy ConditionsNoise samples mixed with speech feedback and produced to the speaker and
operator by headphones
Operator qualifies intelligibility of speech in noise – if the utterance is not intelligible,
operator asks the subject to repeat it speakers are required to convey message
over noise real LE
Noises: mostly car noises from Car2E database, normalized to 90 dB SPL
Speaker Sessions14 male/12 female speakers
Each subject recorded both in neutral and simulated noisy conditions
NB2 ME-104
ME-104 NB2
Email: [email protected] Slide 23 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007
Data AcquisitionImpact of Headphones
Data AcquisitionImpact of Headphones
Environmental Sound Attenuation by HeadphonesAttenuation characteristics measured on dummy head
Source of wide-band noise, measurement of sound transfer to dummy head’s
auditory canals when not wearing/wearing headphones
Attenuation characteristics – subtraction of the transfers
Email: [email protected] Slide 24 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007
Data AcquisitionImpact of Headphones
Data AcquisitionImpact of Headphones
102
103
104
0
50
100
150
200-10
0
10
20
30
Frequency (Hz)Angle (°)
Att
enua
tion
(dB
)
-10
-5
0
5
10
15
20
25
100 1000 10000
0° 90°180°Rec. room
Frequency (Hz)
Atte
nu
atio
n (
dB
)
Attenuation by headphones
-100102030
0
15
30
45
60
75 90 105
120
135
150
165
180
195
210
225
240
255270285
300
315
330
345
0 180 -10 0 10 20 30 0 10 20 30
1 kHz 2 kHz 4 kHz 8 kHz
Angle (°)
Attenuation (dB)
Environmental Sound Attenuation by HeadphonesDirectional attenuation – reflectionless sound booth
Real attenuation in recording room
Email: [email protected] Slide 25 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007
Speech Production under Lombard EffectSpeech Production under Lombard Effect
Speech Features affected by LEVocal tract excitation: glottal pulse shape changes, fundamental frequency
rises
1
( )1
Nk
kk
GV z
z
IMPULSE TRAIN GENERATOR I(z)
VOCAL TRACT MODEL V(z)
RADIATION MODEL R(z)
RANDOM NOISE GENERATOR N(z)
Voiced/Unvoiced Switch
Pitch Period
AV
AN
Vocal Tract Parameters
GLOTTAL PULSE MODEL G(z)
uG(n) pL(n)
Email: [email protected] Slide 26 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007
Speech Production under Lombard EffectSpeech Production under Lombard Effect
1
( )1
Nk
kk
GV z
z
Speech Features affected by LEVocal tract excitation: glottal pulse shape changes, fundamental frequency
rises
Vocal tract transfer function: center frequencies of low formants increase,
formant bandwidths reduce
IMPULSE TRAIN GENERATOR I(z)
VOCAL TRACT MODEL V(z)
RADIATION MODEL R(z)
RANDOM NOISE GENERATOR N(z)
Voiced/Unvoiced Switch
Pitch Period
AV
AN
Vocal Tract Parameters
GLOTTAL PULSE MODEL G(z)
uG(n) pL(n)
Email: [email protected] Slide 27 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007
Speech Production under Lombard EffectSpeech Production under Lombard Effect
1
( )1
Nk
kk
GV z
z
Speech Features affected by LEVocal tract excitation: glottal pulse shape changes, fundamental frequency
rises
Vocal tract transfer function: center frequencies of low formants increase,
formant bandwidths reduce
Vocal effort (intensity) increase
IMPULSE TRAIN GENERATOR I(z)
VOCAL TRACT MODEL V(z)
RADIATION MODEL R(z)
RANDOM NOISE GENERATOR N(z)
Voiced/Unvoiced Switch
Pitch Period
AV
AN
Vocal Tract Parameters
GLOTTAL PULSE MODEL G(z)
uG(n) pL(n)
Email: [email protected] Slide 28 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007
Speech Production under Lombard EffectSpeech Production under Lombard Effect
1
( )1
Nk
kk
GV z
z
Speech Features affected by LEVocal tract excitation: glottal pulse shape changes, fundamental frequency rises
Vocal tract transfer function: center frequencies of low formants increase,
formant bandwidths reduce
Vocal effort (intensity) increase
Other: voiced phonemes prolonged, energy ratio in voiced/unvoiced increases,…
IMPULSE TRAIN GENERATOR I(z)
VOCAL TRACT MODEL V(z)
RADIATION MODEL R(z)
RANDOM NOISE GENERATOR N(z)
Voiced/Unvoiced Switch
Pitch Period
AV
AN
Vocal Tract Parameters
GLOTTAL PULSE MODEL G(z)
uG(n) pL(n)
Email: [email protected] Slide 29 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007
Analysis of Speech Features under LEFundamental Frequency
Analysis of Speech Features under LEFundamental Frequency
0
2
4
6
8
10
12
70 170 270 370 470 570
Office FCar F
Office MCar M
Fundamental frequency (Hz)
Distribution of fundamental frequencyCzech SPEECON
Nu
mb
er
of s
am
ple
s (x
10
,00
0)
0
2
4
6
8
10
12
14
16
70 170 270 370 470 570
Eng off F
Eng on F
Eng off M
Eng on M
Fundamental frequency (Hz)
Nu
mb
er
of s
am
ple
s (x
10
00
)
Distribution of fundamental frequencyCZKCC
0
1
2
3
4
5
6
70 170 270 370 470 570
Neutral FLE FNeutral MLE M
Fundamental frequency (Hz)
Nu
mb
er
of s
am
ple
s (x
10
,00
0)
Distribution of fundamental frequencyCLSD'05
Email: [email protected] Slide 30 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007
Analysis of Speech Features under LEFormant Locations
Analysis of Speech Features under LEFormant Locations
900
1100
1300
1500
1700
1900
2100
2300
2500
300 400 500 600 700 800 900 1000
Female_N
Female_LE
F1 (Hz)
F2
(H
z)
Formants - CZKCCFemale digits/i/
/i'/
/e//e'/
/a/
/a'//o/
/o'//u/
/u'/
500
700
900
1100
1300
1500
1700
1900
2100
200 300 400 500 600 700 800 900
Male_N
Male_LE
/i//i'/
F1 (Hz)
F2
(H
z)
/e//e'/
/a/
/a'/
/o//o'/
/u//u'/
Formants - CZKCCMale digits
900
1100
1300
1500
1700
1900
2100
2300
2500
300 400 500 600 700 800 900 1000
Female_N
Female_LE
F1 (Hz)
F2
(H
z)
Formants - CLSD'05Female digits/i/
/i'/
/e/
/e'/
/a//a'/
/o/
/o'/
/u/
/u'/
500
700
900
1100
1300
1500
1700
1900
2100
200 300 400 500 600 700 800 900
Male_N
Male_LE
Formants - CLSD'05Male digits/i/
/i'/
F1 (Hz)
F2
(H
z)
/e/ /e'/
/a//a'/
/o/ /o'//u/
/u'/
Email: [email protected] Slide 31 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007
Analysis of Speech Features under LEFormant Bandwidths
Analysis of Speech Features under LEFormant Bandwidths
CZKCC
Vowel B1M (Hz) 1M (Hz) B1M (Hz) 1M (Hz) B1F (Hz) 1F (Hz) B1F (Hz) 1F (Hz)
/a/ 207* 74 210* 84 275 97 299 78
/e/ 125* 70 130* 78 156 68 186 79
/i/ 124* 49 127* 44 105 44 136 53
/o/ 275 87 222 67 263* 85 269* 73
/u/ 187 100 170 89 174* 96 187* 101
CLSD‘05
Vowel B1M (Hz) 1M (Hz) B1M (Hz) 1M (Hz) B1F (Hz) 1F (Hz) B1F (Hz) 1F (Hz)
/a/ 269 88 152 59 232 85 171 68
/e/ 168 94 99 44 169 73 130 49
/i/ 125 53 108 52 132* 52 133* 58
/o/ 239 88 157 81 246 91 158 62
/u/ 134* 67 142* 81 209 95 148 66
SPEECON, CZKCC: no consistent BW changes
CLSD‘05: significant BW reduction in many voiced phonemes
Email: [email protected] Slide 32 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007
Analysis of Speech Features under LEPhoneme Durations
Analysis of Speech Features under LEPhoneme Durations
CZKCC
Word Phoneme # OFF TOFF (s) TOFF (s) # ON TON (s) TON (s) (%)
Nula /a/ 349 0.147 0.079 326 0.259 0.289 48.50
Jedna /a/ 269 0.173 0.076 251 0.241 0.238 39.36
Dva /a/ 245 0.228 0.075 255 0.314 0.311 38.04
Štiri /r/ 16 0.045 0.027 68 0.080 0.014 78.72
Sedm /e/ 78 0.099 0.038 66 0.172 0.142 72.58
CLSD‘05
Word Phoneme # N TN (s) Tn (s) # LE TLE (s) Tle (s) (%)
Jedna /e/ 583 0.031 0.014 939 0.082 0.086 161.35
Dvje /e/ 586 0.087 0.055 976 0.196 0.120 126.98
Čtiri /r/ 35 0.041 0.020 241 0.089 0.079 115.92
Pjet /e/ 555 0.056 0.033 909 0.154 0.089 173.71
Sedm /e/ 358 0.080 0.038 583 0.179 0.136 122.46
Osm /o/ 310 0.086 0.027 305 0.203 0.159 135.25
Devjet /e/ 609 0.043 0.022 932 0.120 0.088 177.20
Significant increase in duration in some phonemes, especially voiced phonemes
Some unvoiced consonants – duration reduction
Duration changes in CLSD’05 considerably exceed the ones in SPEECON and CZKCC
Email: [email protected] Slide 33 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007
Lombard EffectInitial ASR Experiments
Lombard EffectInitial ASR Experiments
ASR Evaluation – WER (Word Error Rate)S – word substitutions
I – word insertions
D – word deletions
Digit RecognizerMonophone HMM models
13 MFCC + ∆ + ∆∆
32 Gaussian mixtures per model state
D – word deletions
Czech SPEECON CZKCC CLSD‘05 Set
Office F Office M Car F Car M OFF F OFF M ON F ON M N F N M LE F LE M
# Spkrs 22 31 28 42 30 30 18 21 12 14 12 14
# Digits 880 1219 1101 1657 1480 1323 1439 1450 4930 1423 5360 6303
WER
(%) 5.5
(4.0–7.0)
4.3 (3.1–5.4)
4.6 (3.4–5.9)
10.5 (9.0–12.0)
3.0 (2.1–3.8)
2.3 (1.5–3.1)
13.5 (11.7–15.2)
10.4 (8.8–12.0)
7.3 (6.6–8.0)
3.8 (2.8–4.8)
42.8 (41.5–44.1)
16.3 (15.4–17.2)
100 %D S I
WERN
Noisy car recordings (SPEE/CZKCC – 10.7/12.6 dB
SNR)
Clean recordings (LE - 40.9 dB SNR)
Email: [email protected] Slide 34 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007
Lombard EffectInitial ASR Experiments
Lombard EffectInitial ASR Experiments
ASR Evaluation – WER (Word Error Rate)S – word substitutions
I – word insertions
D – word deletions
Digit RecognizerMonophone HMM models
13 MFCC + ∆ + ∆∆
32 Gaussian mixtures per model state
D – word deletions
Czech SPEECON CZKCC CLSD‘05 Set
Office F Office M Car F Car M OFF F OFF M ON F ON M N F N M LE F LE M
# Spkrs 22 31 28 42 30 30 18 21 12 14 12 14
# Digits 880 1219 1101 1657 1480 1323 1439 1450 4930 1423 5360 6303
WER
(%) 5.5
(4.0–7.0)
4.3 (3.1–5.4)
4.6 (3.4–5.9)
10.5 (9.0–12.0)
3.0 (2.1–3.8)
2.3 (1.5–3.1)
13.5 (11.7–15.2)
10.4 (8.8–12.0)
7.3 (6.6–8.0)
3.8 (2.8–4.8)
42.8 (41.5–44.1)
16.3 (15.4–17.2)
100 %D S I
WERN
Noisy car recordings (SPEE/CZKCC – 10.7/12.6 dB
SNR)
Clean recordings (LE - 40.9 dB SNR)
Email: [email protected] Slide 35 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007
Lombard EffectInitial ASR Experiments
Lombard EffectInitial ASR Experiments
ASR Evaluation – WER (Word Error Rate)S – word substitutions
I – word insertions
D – word deletions
Digit RecognizerMonophone HMM models
13 MFCC + ∆ + ∆∆
32 Gaussian mixtures per model state
D – word deletions
Czech SPEECON CZKCC CLSD‘05 Set
Office F Office M Car F Car M OFF F OFF M ON F ON M N F N M LE F LE M
# Spkrs 22 31 28 42 30 30 18 21 12 14 12 14
# Digits 880 1219 1101 1657 1480 1323 1439 1450 4930 1423 5360 6303
WER
(%) 5.5
(4.0–7.0)
4.3 (3.1–5.4)
4.6 (3.4–5.9)
10.5 (9.0–12.0)
3.0 (2.1–3.8)
2.3 (1.5–3.1)
13.5 (11.7–15.2)
10.4 (8.8–12.0)
7.3 (6.6–8.0)
3.8 (2.8–4.8)
42.8 (41.5–44.1)
16.3 (15.4–17.2)
100 %D S I
WERN
Noisy car recordings (SPEE/CZKCC – 10.7/12.6 dB
SNR)
Clean recordings (LE - 40.9 dB SNR)
Email: [email protected] Slide 36 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007
Lombard EffectInitial ASR Experiments
Lombard EffectInitial ASR Experiments
ASR Evaluation – WER (Word Error Rate)S – word substitutions
I – word insertions
D – word deletions
Digit RecognizerMonophone HMM models
13 MFCC + ∆ + ∆∆
32 Gaussian mixtures per model state
D – word deletions
Czech SPEECON CZKCC CLSD‘05 Set
Office F Office M Car F Car M OFF F OFF M ON F ON M N F N M LE F LE M
# Spkrs 22 31 28 42 30 30 18 21 12 14 12 14
# Digits 880 1219 1101 1657 1480 1323 1439 1450 4930 1423 5360 6303
WER
(%) 5.5
(4.0–7.0)
4.3 (3.1–5.4)
4.6 (3.4–5.9)
10.5 (9.0–12.0)
3.0 (2.1–3.8)
2.3 (1.5–3.1)
13.5 (11.7–15.2)
10.4 (8.8–12.0)
7.3 (6.6–8.0)
3.8 (2.8–4.8)
42.8 (41.5–44.1)
16.3 (15.4–17.2)
100 %D S I
WERN
Noisy car recordings (SPEE/CZKCC – 10.7/12.6 dB
SNR)
Clean recordings (LE - 40.9 dB SNR)
Email: [email protected] Slide 37 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007
LE Suppression in ASR Model Adaptation
LE Suppression in ASR Model Adaptation
Model AdaptationOften effective when only limited data from given conditions are available
Maximum Likelihood Linear Regression (MLLR) – if limited amount of data per
class, acoustically close classes are grouped and transformed together
'MLLR μ Aμ b
Maximum a posteriori approach (MAP) – initial models are used as informative
priors for the adaptation
'MAP
N
N N
μ μ μ
Adaptation ProcedureFirst, neutral speaker-independent (SI) models transformed by MLLR, employing
clustering (binary regression tree)
Second, MAP adaptation – only for nodes with sufficient amount of adaptation data
Email: [email protected] Slide 38 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007
LE Suppression in ASR Model Adaptation
LE Suppression in ASR Model Adaptation
0
10
20
30
40
50
60
70
80
90
Baseline digits LE Adapted digits LE Baseline sentences LE Adapted sentences LE
SI adapt to LE (same spkrs)
SI adapt to LE (disjunct spkrs)
SD adapt to neutral
SD adapt to LE
Model adaptation to conditions and speakers
WE
R (
%)
Adaptation SchemesSpeaker-independent adaptation (SI) – group dependent/independent
Speaker-dependent adaptation (SD) – to neutral/LE
Email: [email protected] Slide 39 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007
LE Suppression in ASR Data-Driven Design of Robust Features
LE Suppression in ASR Data-Driven Design of Robust Features
Filter Bank ApproachAnalysis of importance of frequency components for ASR
Repartitioning filter bank (FB) to emphasize components carrying phonetic information and suppress
disturbing components
Initial FB uniformly distributed on linear scale – equal attention to all components
Consecutively, a single FB band is omitted impact on WER?
Omitting bands carrying more information will result in considerable WER increase
ImplementationMFCC front-end, MEL scale replaced by linear, triangular filters replaced by rectangular filters
without overlap
Email: [email protected] Slide 40 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007
Data-Driven Design of Robust FeaturesImportance of Frequency Components
Data-Driven Design of Robust FeaturesImportance of Frequency Components
3
4
5
0 5 10 15 20
Omitted band
Neutral speech
WE
R (
%)
20
30
40
0 5 10 15 20
Omitted band
LE speech
WE
R (
%)
Filterbank Cut-Off Frequencies (Hz)
Am
plitu
de
c0, c1,…, c12; ∆c0, ∆c1,…, ∆c12; ∆∆c0, ∆∆c1,…, ∆∆c12
4K
1 20
Email: [email protected] Slide 41 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007
Data-Driven Design of Robust FeaturesImportance of Frequency Components
Data-Driven Design of Robust FeaturesImportance of Frequency Components
3
4
5
0 5 10 15 20
Omitted band
Neutral speech
WE
R (
%)
20
30
40
0 5 10 15 20
Omitted band
LE speech
WE
R (
%)
Filterbank Cut-Off Frequencies (Hz)
Am
plitu
de
c0, c1,…, c12; ∆c0, ∆c1,…, ∆c12; ∆∆c0, ∆∆c1,…, ∆∆c12
4K
1 20
Email: [email protected] Slide 42 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007
Data-Driven Design of Robust FeaturesImportance of Frequency Components
Data-Driven Design of Robust FeaturesImportance of Frequency Components
3
4
5
0 5 10 15 20
Omitted band
Neutral speech
WE
R (
%)
20
30
40
0 5 10 15 20
Omitted band
LE speech
WE
R (
%)
Filterbank Cut-Off Frequencies (Hz)
Am
plitu
de
c0, c1,…, c12; ∆c0, ∆c1,…, ∆c12; ∆∆c0, ∆∆c1,…, ∆∆c12
4K
1 20
Email: [email protected] Slide 43 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007
Data-Driven Design of Robust FeaturesImportance of Frequency Components
Data-Driven Design of Robust FeaturesImportance of Frequency Components
3
4
5
0 5 10 15 20
Omitted band
Neutral speech
WE
R (
%)
20
30
40
0 5 10 15 20
Omitted band
LE speech
WE
R (
%)
Filterbank Cut-Off Frequencies (Hz)
Am
plitu
de
c0, c1,…, c12; ∆c0, ∆c1,…, ∆c12; ∆∆c0, ∆∆c1,…, ∆∆c12
4K
1 20
Email: [email protected] Slide 44 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007
Data-Driven Design of Robust FeaturesImportance of Frequency Components
Data-Driven Design of Robust FeaturesImportance of Frequency Components
3
4
5
0 5 10 15 20
Omitted band
Neutral speech
WE
R (
%)
20
30
40
0 5 10 15 20
Omitted band
LE speech
WE
R (
%)
c0, c1,…, c12; ∆c0, ∆c1,…, ∆c12; ∆∆c0, ∆∆c1,…, ∆∆c12 Area of 1st and 2nd formant occurrence – highest portion of phonetic information, F1 more important
for neutral speech, F1–F2 for LE speech recognition
Omitting the 1st band considerably improves LE ASR while reducing performance on neutral speech
tradeoff
Next step – how much of the low frequency content should be omitted for LE ASR?
Email: [email protected] Slide 45 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007
Lombard EffectLombard Effect
Filterbank Cut-Off Frequencies (Hz)
Am
plitu
de
c0, c1,…, c12; ∆c0, ∆c1,…, ∆c12; ∆∆c0, ∆∆c1,…, ∆∆c12
4K
0
2
4
6
8
10
12
0 200 400 600 800 1000 1200
Bandwidth (Hz)
Neutral speech
WE
R (
%)
0
10
20
30
0 200 400 600 800 1000 1200
Bandwidth (Hz)
LE speech
WE
R (%
)
Optimizing Filter Banks – Omitting Low Frequencies
1 19
Email: [email protected] Slide 46 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007
Data-Driven Design of Robust FeaturesOmitting Low Frequencies
Data-Driven Design of Robust FeaturesOmitting Low Frequencies
1 19
Filterbank Cut-Off Frequencies (Hz)
Am
plitu
de
c0, c1,…, c12; ∆c0, ∆c1,…, ∆c12; ∆∆c0, ∆∆c1,…, ∆∆c12
4K
0
2
4
6
8
10
12
0 200 400 600 800 1000 1200
Bandwidth (Hz)
Neutral speech
WE
R (
%)
0
10
20
30
0 200 400 600 800 1000 1200
Bandwidth (Hz)
LE speech
WE
R (%
)
Email: [email protected] Slide 47 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007
Data-Driven Design of Robust FeaturesOmitting Low Frequencies
Data-Driven Design of Robust FeaturesOmitting Low Frequencies
1 19
Filterbank Cut-Off Frequencies (Hz)
Am
plitu
de
c0, c1,…, c12; ∆c0, ∆c1,…, ∆c12; ∆∆c0, ∆∆c1,…, ∆∆c12
4K
0
2
4
6
8
10
12
0 200 400 600 800 1000 1200
Bandwidth (Hz)
Neutral speech
WE
R (
%)
0
10
20
30
0 200 400 600 800 1000 1200
Bandwidth (Hz)
LE speech
WE
R (%
)
Email: [email protected] Slide 48 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007
Data-Driven Design of Robust FeaturesOmitting Low Frequencies
Data-Driven Design of Robust FeaturesOmitting Low Frequencies
1 19
Filterbank Cut-Off Frequencies (Hz)
Am
plitu
de
c0, c1,…, c12; ∆c0, ∆c1,…, ∆c12; ∆∆c0, ∆∆c1,…, ∆∆c12
4K
0
2
4
6
8
10
12
0 200 400 600 800 1000 1200
Bandwidth (Hz)
Neutral speech
WE
R (
%)
0
10
20
30
0 200 400 600 800 1000 1200
Bandwidth (Hz)
LE speech
WE
R (%
)
Email: [email protected] Slide 49 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007
Data-Driven Design of Robust FeaturesOmitting Low Frequencies
Data-Driven Design of Robust FeaturesOmitting Low Frequencies
Devel set Set
Neutral LE
LFCC, full band 4.8
(4.1–5.5)
29.0
(27.5–30.5) WER
(%) LFCC, 625 Hz
6.6
(5.8–7.4)
15.6
(14.4–16.8)
Effect of Omitting Low Spectral Components Increasing FB low cut-off results in almost linear increase of WER on neutral speech while
considerably enhancing ASR performance on LE speech
Optimal low cut-off found at 625 Hz
Email: [email protected] Slide 50 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007
Data-Driven Design of Robust FeaturesIncreasing Filter Bank Resolution
Data-Driven Design of Robust FeaturesIncreasing Filter Bank Resolution
15
20
25
30
0 1 2 3 4 5 6 7 8 9 10 11 12 13
Omitted band
LE speech
WE
R (
%)
625 Hz
Increasing Frequency Resolution Idea – emphasize high information portion of spectrum by increasing FB resolution
Experiment – FB decimation from 1912 bands (decreasing computational costs)
Increasing number of filters at the peak of information distribution curve
deterioration of LE ASR (17.2 % 26.9 %)
Slight F1–F2 shifts due to LE affect cepstral features
No simple recipe on how to derive efficient FB from the information distribution curves
Email: [email protected] Slide 51 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007
Data-Driven Design of Robust FeaturesIncreasing Filter Bank Resolution
Data-Driven Design of Robust FeaturesIncreasing Filter Bank Resolution
13
15
17
19
21
23
25
27
500 1000 1500 2000 2500 3000 3500 4000
LE speech
Band 1 Band 2 Band 3 Band 4 Band 5 Band 6
Critical frequency (Hz)
WE
R (
%)
Consecutive Filter Bank RepartitioningConsecutively, from lowest to highest, each FB high cut-off is varied, while the upper rest of the FB
is redistributed uniformly across the remaining frequency band
Cut-off resulting in WER local minimum is fixed and the procedure is repeated for adjacent higher
cut-off WER reduction by 2.3 % for LE, by 1 % on neutral (Example – 6 bands FB)
Email: [email protected] Slide 52 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007
Data-Driven Design of Robust FeaturesStandard vs. Novel Features
Data-Driven Design of Robust FeaturesStandard vs. Novel Features
3988700 10 1 0 2000 Hz
Expolog
2595 log 1 2000 4000 Hz700
f
f
ff
f
State-of-the-Art LE Front-End – Expolog (Ghazale & Hansen, 2000) FB redistributed to improve stressed speech recognition (including loud and Lombard speech)
Increased resolution in the area of F2 occurrence
Linear frequency (Hz)Linear frequency (Hz)
Exp
olo
g f
req
ue
ncy
(H
z)
Exp
olo
g f
req
ue
ncy
(H
z)
Email: [email protected] Slide 53 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007
Data-Driven Design of Robust FeaturesStandard vs. Novel Features
Data-Driven Design of Robust FeaturesStandard vs. Novel Features
Evaluation in ASR Task Standard MFCC, PLP, variations MFCC-LPC, PLP-DCT – altered cepstrum exctraction schemes
Expolog – Expolog FB replacing trapezoid FB in PLP
20Bands-LPC – uniform rectangular FB employed in PLP front-end
Big1-LPC – derived from 20Bands-LPC, first 3 bands merged – decreased resolution at frequencies
disturbing for LE ASR
RFCC-DCT – repartitioned FB, 19 bands, starting at 625 Hz, employed in MFCC
RFCC-DCT – RFCC employed in PLP
Email: [email protected] Slide 54 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007
Data-Driven Design of Robust FeaturesStandard vs. Novel Features
Data-Driven Design of Robust FeaturesStandard vs. Novel Features
0
10
20
30
40
50
60
70
80
MFCC MFCC-LPC PLP PLP-DCT Expolog 20Bands-LPC Big1-LPC RFCC-DCT RFCC-LPC
Neutral
LE
CLE
CLEF0
WE
R (
%)
Features - performance on female digits
Email: [email protected] Slide 55 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007
LE Suppression in ASR Frequency Warping
LE Suppression in ASR Frequency Warping
Maximum Likelihood (ML) Approach Vocal tract length normalization (VTLN): mean formant locations inversely proportional to vocal tract
length (VTL) compensation for inter-speaker VTL variations by frequency transformation (warping):
Formant-Driven (FD) Approach Warping factor determined from estimated mean formant locations
WF F
Warping factor searched to maximize likelihoods of observations and acoustic models:
ˆ arg max Pr ,
O W Θ
Factor searched in typically in the interval 0.8–1.2 (corresponds to ratio of VTL differences in males
and females)
Email: [email protected] Slide 56 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007
Frequency WarpingVTLN – Principle
Frequency WarpingVTLN – Principle
0
500
1000
1500
2000
2500
3000
3500
4000
4500
0 500 1000 1500 2000 2500 3000 3500 4000 4500
VTLN - PrincipleCase: VTL1 VTLNORM
Formant frequencies (Hz)Speaker1 (VTL1)
No
rma
lize
d S
pe
ake
r NO
RM (
VT
LN
OR
M)
Fo
rma
nt f
req
ue
nci
es
(Hz)
F2
F1
F3
F4
=
Email: [email protected] Slide 57 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007
Frequency WarpingVTLN – Principle
Frequency WarpingVTLN – Principle
0
500
1000
1500
2000
2500
3000
3500
4000
4500
0 500 1000 1500 2000 2500 3000 3500 4000 4500
VTLN - PrincipleCase: VTL1 VTLNORM
Formant frequencies (Hz)Speaker1 (VTL1)
No
rma
lize
d S
pe
ake
r NO
RM (
VT
LN
OR
M)
Fo
rma
nt f
req
ue
nci
es
(Hz)
F2
F1
F3
F4
>
Email: [email protected] Slide 58 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007
Frequency WarpingVTLN – Principle
Frequency WarpingVTLN – Principle
0
500
1000
1500
2000
2500
3000
3500
4000
4500
0 500 1000 1500 2000 2500 3000 3500 4000 4500
VTLN - PrincipleCase: VTL1 VTLNORM
Formant frequencies (Hz)Speaker1 (VTL1)
No
rma
lize
d S
pe
ake
r NO
RM (
VT
LN
OR
M)
Fo
rma
nt f
req
ue
nci
es
(Hz)
F2
F1
F3
F4
>
Email: [email protected] Slide 59 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007
Frequency WarpingVTLN – Principle
Frequency WarpingVTLN – Principle
0
500
1000
1500
2000
2500
3000
3500
4000
4500
0 500 1000 1500 2000 2500 3000 3500 4000 4500
VTLN - PrincipleCase: VTL1 VTLNORM
Formant frequencies (Hz)Speaker1 (VTL1)
No
rma
lize
d S
pe
ake
r NO
RM (
VT
LN
OR
M)
Fo
rma
nt f
req
ue
nci
es
(Hz)
F2
F1
F3
F4
>
Email: [email protected] Slide 60 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007
Frequency WarpingVTLN vs. Lombard Effect
Frequency WarpingVTLN vs. Lombard Effect
0
500
1000
1500
2000
2500
3000
3500
4000
4500
0 500 1000 1500 2000 2500 3000 3500 4000 4500
VTLN - PrincipleCase: VTL1 VTLNORM
Formant frequencies (Hz)Speaker1 (VTL1)
No
rma
lize
d S
pe
ake
r NO
RM (
VT
LN
OR
M)
Fo
rma
nt f
req
ue
nci
es
(Hz) ?
F2
F1
F3
F4
Email: [email protected] Slide 61 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007
Frequency WarpingVTLN vs. Lombard Effect
Frequency WarpingVTLN vs. Lombard Effect
0
500
1000
1500
2000
2500
3000
3500
4000
4500
0 500 1000 1500 2000 2500 3000 3500 4000 4500
VTLN - PrincipleCase: VTL1 VTLNORM
Formant frequencies (Hz)Speaker1 (VTL1)
No
rma
lize
d S
pe
ake
r NO
RM (
VT
LN
OR
M)
Fo
rma
nt f
req
ue
nci
es
(Hz)
F2
F1
F3
F4?
What to choose?
Good approx. of low formants?
Email: [email protected] Slide 62 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007
Frequency WarpingGeneralized TransformFrequency Warping
Generalized Transform
0
500
1000
1500
2000
2500
3000
3500
4000
4500
0 500 1000 1500 2000 2500 3000 3500 4000 4500
VTLN - PrincipleCase: VTL1 VTLNORM
Formant frequencies (Hz)Speaker1 (VTL1)
No
rma
lize
d S
pe
ake
r NO
RM (
VT
LN
OR
M)
Fo
rma
nt f
req
ue
nci
es
(Hz) ?
What to choose?
Good approx. of higher formants?
F2
F3
F4?
F1
Email: [email protected] Slide 63 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007
Frequency WarpingGeneralized TransformFrequency Warping
Generalized Transform
0
500
1000
1500
2000
2500
3000
3500
4000
4500
0 500 1000 1500 2000 2500 3000 3500 4000 4500
VTLN - PrincipleCase: VTL1 VTLNORM
Formant frequencies (Hz)Speaker1 (VTL1)
No
rma
lize
d S
pe
ake
r NO
RM (
VT
LN
OR
M)
Fo
rma
nt f
req
ue
nci
es
(Hz) ?
F2
F3
F4?
F1
Generalized Transform
Case: VTL1 ? VTLNORM
Email: [email protected] Slide 64 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007
Frequency WarpingEvaluation – VTLN vs. Generalized Transform
Frequency WarpingEvaluation – VTLN vs. Generalized Transform
Females Males Set
Neutral LE Neutral LE
# Digits 2560 2560 1423 6303
Baseline 4.3
(3.5–5.0)
33.6 (31.8–35.5)
2.2 (1.4–2.9)
22.9 (21.8–23.9)
Utterance-dependent VTLN 3.6
(2.9–4.3)
28.2
(26.4–29.9)
1.8
(1.1–2.4)
16.6
(15.7–17.6)
WER
(%)
Speaker-dependent VTLN 4.0
(3.2–4.7)
27.7
(26.0–29.5)
1.8
(1.1–2.4)
17.4
(16.5–18.3)
Females Males
Set Neutral LE Neutral LE
# Digits 2560 2560 1423 6303
Baseline bank 4.2
(3.4–5.0)
35.1 (33.3–37.0)
2.2 (1.4–2.9)
23.2 (22.1–24.2) WER
(%) Warped bank
4.4
(3.6–5.2)
23.4
(21.8–25.0)
1.8
(1.1–2.4)
15.7
(14.8–16.6)
Generalized transform better addresses LE-induced formant shifts
Formant-driven approach is less computationally demanding (no need of multiple alignment passes as in
VTLN), but requires reliable formant tracking problem in low SNR’s ML approach more stable
Email: [email protected] Slide 65 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007
LE Suppression in ASR Two-Stage Recognizer (TSR)
LE Suppression in ASR Two-Stage Recognizer (TSR)
LE recognizer
Speech Signal
Estimated Word
Sequence
Neutral/LE Classifier
Neutral Recognizer
Tandem Neutral/LE Classifier – Neutral/LE Dedicated Recognizers Improving ASR features for LE often results in performance tradeoff on neutral speech
Idea - combining separate systems ‘tuned’ for neutral and LE speech directed by neutral/LE classifier
Email: [email protected] Slide 66 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007
Two-Stage Recognizer (TSR)Neutral/LE Classification
Two-Stage Recognizer (TSR)Neutral/LE Classification
100
101
102
103
104
-80
-60
-40
-20
0
20
40
60
Frequency (Hz)
Am
plitu
de (
dB)
Mag
nitu
de (
dB)
Log frequency (Hz)
Spectral slope – female vowel /a/
100
101
102
103
104
-80
-60
-40
-20
0
20
40
60
Frequency (Hz)
Am
plitu
de (
dB)
Mag
nitu
de (
dB)
Log frequency (Hz)
Spectral slope – female vowel /a/
Proposal of Neutral/LE Classifier Search for a set of features providing good discriminability between neutral/LE speech
Requirements – speaker/gender/phonetic content independent classification
Extension of the set of analyzed features for the slope of short-term spectra
Email: [email protected] Slide 67 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007
Two-Stage Recognizer (TSR)Neutral/LE Classification
Two-Stage Recognizer (TSR)Neutral/LE Classification
Neutral LE
Set # N T (s)
Slope (dB/oct)
(dB/oct) # LE T (s) Slope
(dB/oct) (dB/oct)
M 2587 618 -7.42
(-7.48; -7.36) 1.53 3532 1114
-5.32 (-5.37; -5.27)
1.55 0–8000
Hz F 5558 1544
-6.15
(-6.18; -6.12) 1.30 5030 1926
-3.91
(-3.96; -3.86) 1.77
Neutral – LE distribution overlap (%) Set
0–8000 Hz 60–8000 Hz 60–5000 Hz 1k–5k Hz 0–1000 Hz 60–1000 Hz
M 26.00 28.13 29.47 100.00 27.81 27.96
F 26.20 28.95 16.76 100.00 25.75 22.18
M+F 28.06 30.48 29.49 100.00 27.54 26.00
Mean Spectral Slopes in Voiced Male/Female Speech
Overlap of Neutral/LE Spectral Slope Distributions
Classification Feature Set A feature set providing superior classification performance on the development data set was found:
SNR, spectral slope (60–1000 Hz), F0, F0
Training GMM and multi-layer perceptron (MLP) classifiers
Email: [email protected] Slide 68 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007
Two-Stage Recognizer (TSR)Neutral/LE Classification
Two-Stage Recognizer (TSR)Neutral/LE Classification
1
0
P H
P H i
i
o
o
1
1f
1 jj qqe
2
1
fj
i
q
j Mq
i
eq
e
Pr(N) Pr(LE)
GMMN GMMLE
Acoustic Observation (Classification Feature Vector)
Binary Classification Task
GMM Classifier
11
21e
2
Ti i
i nP
o μ Σ o μo
Σ
MLP Classifier (Softmax)
(Sigmoid)
… …
Pr(N) Pr(LE)
Classification Feature Vector
Email: [email protected] Slide 69 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007
Two-Stage Recognizer (TSR)Neutral/LE Classification – Feature Set
Two-Stage Recognizer (TSR)Neutral/LE Classification – Feature Set
0
20
40
60
80
100
120
0 20 40 60 80 100
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
Dev_N_M+FDev_LE_M+FPDF_LEPDF_N
SNR (dB)
GMM PDFsSNR
PD
FN, P
DF
LE
Nu
mb
er
of s
am
ple
s (n
orm
aliz
ed
)
0
10
20
30
40
50
60
70
80
-20 -10 0 10 20 30
0
0.04
0.08
0.12
0.16
Dev_N_M+FDev_LE_M+FPDF_LEPDF_N
Spectral slope (dB/oct)
GMM PDFsSpectral slope
PD
FN, P
DF
LE
Nu
mb
er
of s
am
ple
s (n
orm
aliz
ed
)
0
20
40
60
80
100
120
0 20 40 60 80 100
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Dev_N_M+FDev_LE_M+FPr(N)Pr(LE)
SNR (dB)
ANN posteriorsSNR
Pr(
N),
Pr(
LE
)
Nu
mb
er
of s
am
ple
s (n
orm
aliz
ed
)
0
10
20
30
40
50
60
70
80
-20 -10 0 10 20 30
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Dev_N_M+FDev_LE_M+FPr(N)Pr(LE)
Spectral slope (dB/oct)
ANN posteriorsSpectral slope
Pr(
N),
Pr(
LE
)
Nu
mb
er
of s
am
ple
s (n
orm
aliz
ed
)
Email: [email protected] Slide 70 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007
Two-Stage Recognizer (TSR)Neutral/LE Classification – Feature Set
Two-Stage Recognizer (TSR)Neutral/LE Classification – Feature Set
0
50
100
150
200
250
0 20 40 60 80 100 120
0
0.01
0.02
0.03
0.04
Dev_N_M+F
Dev_LE_M+F
PDF_LE
PDF_N
GMM PDFsF0
F0 (Hz)
PD
FN, P
DF
LE
Nu
mb
er
of s
am
ple
s (n
orm
aliz
ed
)
0
20
40
60
80
100
120
140
160
0 100 200 300 400 500
0.000
0.004
0.008
0.012
Dev_N_M+FDev_LE_M+FPDF_LEPDF_N
F0 (Hz)
GMM PDFsF0
PD
FN, P
DF
LE
Nu
mb
er
of s
am
ple
s (n
orm
aliz
ed
)
0
20
40
60
80
100
120
140
160
0 100 200 300 400 500
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Dev_N_M+FDev_LE_M+FPr(N)Pr(LE)
F0 (Hz)
ANN posteriorsF0
Pr(
N),
Pr(
LE
)
Nu
mb
er
of s
am
ple
s (n
orm
aliz
ed
)
0
50
100
150
200
250
0 20 40 60 80 100 120
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Dev_N_M+FDev_LE_M+F
Pr(N)Pr(LE)
F0 (Hz)
ANN posteriors
F0
Pr(
N),
Pr(
LE
)
Nu
mb
er
of s
am
ple
s (n
orm
aliz
ed
)
Email: [email protected] Slide 71 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007
Two-Stage Recognizer (TSR)Neutral/LE Classification – Performance
Two-Stage Recognizer (TSR)Neutral/LE Classification – Performance
Set Train CV Open
# Utterances 2202 270 1371
UER (%) 9.9
(8.7–11.1)
5.6
(2.8–8.3)
1.6
(0.9–2.3)
Set Devel FM Open FM Devel DM Open DM
# Utterances 2472 1371 2472 1371
UER (%) 6.6
(5.6–7.6)
2.5
(1.7–3.3)
8.1
(7.0–9.2)
2.8
(1.9–3.6)
Set #Utterances
Devel 2472 4.10 1.60
Open 1371 4.01 1.50
sUtterT sUtterT
Classification Data Sets
Classification Performance UER – Utterance Error Rate – ratio of incorrectly classified utterances to all utterances
GMM
MLP
Email: [email protected] Slide 72 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007
Two-Stage Recognizer (TSR)Overall Performance
Two-Stage Recognizer (TSR)Overall Performance
Set Real – neutral Real – LE
# Female digits 1439 1837
PLP 4.3
(3.3–5.4)
48.1
(45.8–50.4)
RFCC–LPC 6.5
(5.2–7.7)
28.3
(26.2–30.4)
MLP TSR 4.2
(3.2–5.3)
28.4
(26.4–30.5)
FM–GMLC TSR 4.4
(3.3–5.4)
28.4
(26.4–30.5)
WER
(%)
DM–GMLC TSR 4.4
(3.3–5.4)
28.4
(26.3–30.4)
Discrete Recognizers Either good on neutral or LE speech
LE recognizer
Speech Signal
Estimated Word
Sequence
Neutral/LE Classifier
Neutral Recognizer
Email: [email protected] Slide 73 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007
LE Suppression in ASR Comparison of Proposed Methods
LE Suppression in ASR Comparison of Proposed Methods
0
10
20
30
40
50
60
Model Adapt toLE - SI
Model Adapt toLE - SD
VoiceConversion -
CLE
Modified FB -RFCC-LPC
VTLNRecognition -
Utt. Dep. Warp
FormantWarping
MLP TSR
Baseline Neutral
Baseline LE
LE Suppression
WE
R (
%)
Comparison of proposed techniques for LE-robust ASR
Email: [email protected] Slide 74 Speech and Speaker Recognition SLIDES by John H.L. Hansen, 2007
Thank YouThank You
Thank You for Your Attention!