Woojay Jeon Research Advisor: Fred Juang School of Electrical and Computer Engineering
description
Transcript of Woojay Jeon Research Advisor: Fred Juang School of Electrical and Computer Engineering
Speech Analysis and Cognition Using Category-Dependent Features in a Model of the Central Auditory System
Woojay Jeon
Research Advisor: Fred Juang
School of Electrical and Computer EngineeringGeorgia Institute of Technology
October 8, 2006
2
Synopsis of Project
One of the very few, if any, attempt to address auditory modeling beyond the periphery (ear, cochlea, even auditory nerve fiber) for ASR;
Implemented a model (periphery + 3D cortical model) to calculate cortical response to stimuli;
Investigated cortical representations in ASR; conducted a comprehensive comparative study to understand robustness in auditory representations;
Developed a methodology to analyze robustness based on matched filter theory;
Spawned a new development based on category dependent feature selection and hierarchical pattern recognition.
3
Matched Filtering
Cortical response:– p(y): power (auditory) spectrum– w(y;)response area x,s,, where– r()cortical response– R(): non-zero frequency range of w(y;)
The Cauchy-Schwarz Inequality tells us that r()2 will be maximum when:
If R() includes enough spectral peaks, we can also use the spectral envelope v(y):
4
Signal-Respondent Neurons
(a)
(b)
(c)
(d)
Frequency (kHz)
Scal
e (c
yc/o
ct)
a
bc
d
0.25 0.5 1.0 2.0 4.0
0.5
1.0
2.0
4.0
0.25 1.0 4.0 kHz
-50
51015
0.25 1.0 4.0 kHz
0
2
40.25 1.0 4.0 kHz
0
2
4
0.25 1.0 4.0 kHz
-2
0
2
4
(all points differ in phase)
5
Noise-Respondent Neurons
(a) (b)
Frequency (kHz)
Scal
e (c
yc/o
ct)
a b0.25 0.5 1.0 2.0 4.0
0.5
1.0
2.0
4.0
0.25 1.0 4.0 kHz
0
0.25 1.0 4.0 kHz
0
(all points differ in phase)
6
Noise Robustness
Assuming a conventional Fourier power spectrum with stationary white noise as the distortion, it can be shown mathematically that:
– Sr, : SNR of signal-respondent neuron– Sp, : SNR of auditory spectrum in R()– Sr, : SNR of noise-respondent neuron where
R() = R() Modeling inhibition can increase Sr, even
more.
7
Noise Robustness Experiments
Vowel /iy/ Fricative /dh/
Affricate /jh/ Plosive /p/ Sr(Ai) : Overall SNR of signal-respondent neurons of phoneme wi
Sr(U) : Overall SNR of entire cortical response Sp : Overall SNR of auditory spectrum
20 15 10 5 0 dB0
1
2
3
4
Rat
io
20 15 10 5 0 dB0
2
4
6
Rat
io
20 15 10 5 0 dB0
1
2
3
4
Rat
io
20 15 10 5 0 dB0
1
2
3
4
Rat
io
8
Category-Dependent Feature Selection
LVF: Low Variance Filter; HAF: High Activation Filter; NR: Neuron Reduction (via Clustering and Remapping); PCA: Principal Component Analysis
Auditory Spectrum
Speech Signal
Cortical Response
Early AuditoryModel
Central AuditoryModel (A1)
LVF1 HAF1 PCA1NR1 x1
LVF2 HAF2 PCA2NR2
LVFM HAFM PCAMNRM
x2
xM
9
Hierarchical Classification
Single-Layer ClassifierUses standard Bayesian Decision Theory to classify a test observation into 1 of N classes using class-wise discriminants that estimate the a posteriori probabilities
Hierarchical Classifier (Two-Layer Classifier)A two-stage process that first classifies a test observation into 1 of M categories, then into 1 of |Cn| classes
, arg maxj
j j i jwg p w w g x x x
, arg max
, arg maxm
j n
m m n mC
j n j n i j nw C
f p C C f
h p w w h
x x x
x x x
10
Searching for a Categorization
Ordered Set of Phonemes 1
Phoneme Tree 1
List of Candidate Categorizations 1
Categorization
CART-Style Splitting Combination
Ordered Set of Phonemes N
Phoneme Tree N
List of Candidate Categorizations N
Phoneme Class-Wise Variances
Performance-Based Search
The phoneme-wise variances are arranged into N orderings (each ordering with a different “seed” phoneme).
For each ordering, a CART-style splitting routine is applied to create a “phoneme tree,” from which a list of candidate categorizations is obtained.
We search for the categorization with the best hierarchical classification performance over the training data (using initial models).
11
Model Training
Category-Independent
Features
Initial Category Models (Mixed
HMM)
Category-Dependent Features
Refined Cat.-Dep. Class
Models (HMM)
Training Data
Initial Class Models (HMM)
Cat.-Dep. Class Models (HMM)
ML Estimation (Baum-Welch)
Apply Uniform Weights
Refined Category Models
(Mixed HMM)
MCE Training
ML Estimation (Baum-Welch)
Within-Category MCE TrainingCategorization
CI features are used to construct category models, which are refined with MCE training
, 1i m i m
m i i m iw C w C
f g p C
x x x
12
Hierarchical Classification
Category-Dependent
FeatureSelection
Test Data
Category Models(Mixed HMM)
Within-Category Class Models (HMM)
Category-Independent
Feature Selection
Category Decision arg max ,
m
m mn C
C f X Θ α
Within-CategoryClass Decision
,
arg maxj n
ni n jj w C
w h
X Θ
X nC
nX
iw
13
Phoneme Categorization
Categorization
14
Phoneme Classification Results
Classification rates (%) for clean speech in TIMIT database (48 phonemes)
Classification rates (%) for varying SNR, features, and classifier configurations (*74.51 when 48 phonemes are mapped down to 39 according to convention)
SL: Single-Layer Classifier; CI: Category-Independent Features; CD: Category-Dependent Features; TL: Two-Layer (Hierarchical) Classifier (results produced after MCE training)
Generalization of the MCE Method
Qiang Fu
Research Advisor: Fred JuangSchool of Electrical and Computer Engineering
Georgia Institute of TechnologyOctober 8, 2006
16
Synopsis
Excellent detector results (6-class, 14 class, 44-class) reported; use of detector results as “independent” information for rescoring.
Generalization of minimum error principle to large vocabulary continuous speech recognition– Definition of competing events– Selection of training units (state, phone, ..)– Use of word graph– Unequal error weight.
17
We investigate effects of combining the conventional ASR paradigm and the phonetic class detectors using MVE training
We keep the segmentation information from the Viterbi decoder, which may affect the final improvement
The rescoring algorithm can be flexible in order to fit different tasks
Rescoring Using MVE Detectors
18
Assume there are M classes and K training tokens. A token labeled in the ith class may generate one type I (missing) error and M-1 type II (false alarm) errors. Hence, key scores related to these two types of error are:
1 1
1( ) ( | )1( )K M
i itotal total k k
k i
L l o o class iK
And the overall performance objective becomes
In the above, 1 is the indicator function, l is a sigmoid function, and kI and kII are penalty weights for missing and false alarm errors. A descent algorithm is then applied for the
minimization of the overall error objective.
[ ( | )] [ ( | )]I
Mi i i i j j jtotal I I k II II II k
j i
l k l d o k l d o
ijMj
ogogod
ogogod
jjanti
ijanti
jtgt
ijtgt
jijII
iianti
iianti
itgt
iitgt
iiiI
,,,2,1
)|()|()~|(
)|()|()~|(
Minimum Verification Error
19
Conventional Decoder
SpeechSignals
MVE Detector 1
MVE Detector 2
MVE Detector M
Rescoring Algorithm
Decision Criteria & Thresholds
Decoding Scores
RescoringCandidates
Det
ecto
rSc
ores
Neyman-Pearson
Rescoring Paradigm
20
Suppose there are M classes of sub-word units. Hence there are M sets of detectors accordingly, each of which consists of a target model and an anti-model. For a segment that is decoded as the ith class with log likelihood , its jth (j = 1, 2,…,M) detector scores are and respectively. Namely, the likelihood ratio for the jth detector is . We call the score for the test segment belonging to class i after combination .
Method 1: Naive-adding (NA)
)()()()( iianti
idecode
inew ratioSSS
The reason for subtracting the anti-model score is to scale the decoding score into a relatively close dynamic range with the likelihood ratio. This procedure is also taken in the following two methods.
We simply add the decoder score and the detector score together
)( jantiS
)(idecodeS
)( jtgtS
)(inewS
)()()( janti
jtgt
j SSratio
Rescoring Methods (I)
21
Method 3: Remodeled Posterior Probability (RPP)
M
j
jianti
idecode
iianti
idecode
iden
inumi
new
ratioSS
ratioSSSSS
1
)()()(
)()()(
)(
)()(
)exp()exp(
)exp()exp(
We compute the “remodeled posterior probability”
Method 2: Competitive Rescoring (CR) /1)()()( })exp(
11log{
ij
jiiC ratio
MratioS
)()()()( iC
ianti
idecode
inew SSSS
We add the decoder score and the “competitive” score together, which is a “distance measure” between the claimed class and competitors
Rescoring Methods (II)
22
• Experiments are conducted on the TIMIT database (3696 training utterances and 1344 test utterances. There are 119,580 training tokens for MVE detectors) using three-state HMMs.
• Rescoring candidates are generated using HVite. The model for decoder is trained by Maximum Likelihood (ML) method, and the detectors are trained by MVE. Performance is examined on 6-class (Rabiner and Juang, 1993), 14-class (Deller et.al., 1999), and 48-class (Lee and Hon, ASAP-1989) broad phonetic categories, respectively.
• The models for both decoder and detectors are trained on 39-dimensional MFCC (12 MFCC + 12 delta + 12 acc + 3 log energy).
Experiments Setup
23
Rescoring performance
PhonemeClass Acc(%) Method 1
(NA)Method 2
(CR)Method 3
(RPP)
6-class
Baseline Upper bound
Rescored Relative
75.4480.8476.3617.04
75.4480.8476.3817.40
75.4480.8480.0080.44
14-class
Baseline Upper bound
Rescored Relative
63.6170.8565.3824.45
63.6170.8567.2750.55
63.6170.8568.4561.88
48-class
Baseline Upper bound
Rescored Relative
55.3362.0855.04-4.30
55.3362.0855.614.15
55.3362.0855.918.59Need to perform phone or word rescoring
24
Three different rescoring methods are introduced and the experiment results show that creating a pseudo-phone graph and re-computing the posterior probability achieves the best performance enhancement;
MVE trained detectors shows promising results in helping the conventional ASR techniques. The detectors can be optimized in the sense of features or attributes (e.g. features representing articulatory knowledge and others), and used for re-ranking the decoded candidates;
Bottom-up event detection and information fusion will be conducted on continuous speech signals in the future.
Conclusions and Future work
25
MCE Generalization
MCE criterion formulation: 1. Define the performance objective and the
corresponding task evaluation measure;2. Specify the target event (i.e., the correct label),
competing events (i.e., the incorrect hypotheses from the recognizer), and the corresponding models;
3. Construct the objective function and set hyper-parameters
4. Choose a suitable optimization method to update parameters.
In this presentation, only the first step which is also the most fundamental one is discussed due to limited space. This work is the first of an extensive generalization of the MCE training criterion
26
Competingwords
Targetwords A
A
B
A
start end
…
labeled word
A
Competingwords
TargetwordsA
A
B
A
start end
…
labeled word
A
Strict Boundary and Relaxed Boundary
27
Experiments are conducted on the WSJ0 database (7077 training utterances and 330 test utterances);
All models are three-state HMMs with 8 Gaussian mixtures in each state. There are totally 7385 physical models, 19075 logical models and 2329 tied states;
The models are constructed on 39-dimensional MFCC (12 MFCC + 12 delta + 12 acc + 3 log energy) feature vectors;
The baseline recognizer basically follows the large vocabulary continuous speech recognition recipe using HTK;
We investigated three cases of maximizing the GPP on different training levels (word, phone, state)
Experimental Setup
28
Table 1: Word Error Rate (WER) and Sentence Error Rate (SER) for WSJ0-eval using different training levels
Training level WER(%) SER(%)Baseline 8.41 57.88
Word-level 8.05 56.97Phone-level 7.96 56.67State-level 8.02 56.97
Results
29
We generalize the criterion for minimum classification error (MCE) training and investigate their impact on recognition performance. This paper is the first part of an extensive generation of the MCE training.
The experiments are conducted based on the framework of “maximizing posterior probability”. The impact of different training levels is investigated and the phone level gained the best performance;
Further investigation upon various tasks based on this generalized framework is in progress.
Conclusion & Future Work