Pattern Recognition in Acoustic Signal Processing
Transcript of Pattern Recognition in Acoustic Signal Processing
Pattern Recognition in Acoustic Signal Processing
Pattern Recognition in Acoustic Signal Processing
Mark Hasegawa-Johnson
These slides: http://www.isle.uiuc.edu/slides/2009/Hasegawa-Johnson09mlss.pdf
Machine Learning Summer School, June 11, 2009
Pattern Recognition in Acoustic Signal Processing
Outline
1 Why Use Pattern Recognition in Acoustic Signal Processing?
2 Four Criteria for Choosing a Pattern Recognizer
3 Discriminative MethodsHypothesis Space: Universal ApproximatorsTraining Criteria: Differentiable Error MetricComplications: Label Dynamics, Data Sparsity
4 Bayesian MethodsHypothesis Space: Latent VariablesTraining Criteria: Maximum Likelihood, MAP, MaxEntComplications: HMM Regression, Switching Kalman Smoothers
5 Hybrid MethodsDiscriminative Hidden Nodes, Bayesian InferenceDiscriminative Feature Selection for Acoustic Event DetectionBayesian Re-Estimation of the “Discriminative” Hidden Nodes
6 Conclusions
Pattern Recognition in Acoustic Signal Processing
Why Use Pattern Recognition in Acoustic Signal Processing?
The Scientific Method
y = h(x)
Hypothesize-Measure-Test
1 Based on knowledge of the physical situation, form:1 a hypothesis2 a null hypothesis
2 Collect data: (xt , yt), 1 ≤ t ≤ T .
3 Test the hypothesis: measure P(data|null hypothesis)
Pattern Recognition in Acoustic Signal Processing
Why Use Pattern Recognition in Acoustic Signal Processing?
The Pattern Recognition Method
y = h(x)
Hypothesize-Measure-Learn-Test
1 Form an infinite set of hypotheses (called the “hypothesisspace”), usually a parameterized universal approximator.
2 Collect:1 training data ((xt , yt), 1 ≤ t ≤ T )2 testing data ((xt , yt), T + 1 ≤ t ≤ T + N)
3 Train the hypothesis: maximize P(hypothesis|training data)
4 Test the hypothesis: measure P(testing data|hypothesis)
Pattern Recognition in Acoustic Signal Processing
Why Use Pattern Recognition in Acoustic Signal Processing?
Example: PR in the Scientific Method
Pattern Recognition in Acoustic Signal Processing
Four Criteria for Choosing a Pattern Recognizer
Criteria for Choosing a Pattern Recognizer
1 Structure of the Model1 Discriminative Training: All parameters in the model can be
simultaneously adjusted to minimize global error metric2 Bayesian Training: Components must be separately trained,
then combined without blowing up
Pattern Recognition in Acoustic Signal Processing
Four Criteria for Choosing a Pattern Recognizer
Criteria for Choosing a Pattern Recognizer
1 Structure of the Model1 Discriminative2 Bayesian
2 Size of the Training Database1 Empirical Risk Minimization: Training database includes
10,000 independent trials; train model to minimize trainingdatabase error
2 Structural Risk Minimization: Training database smallerthan 1000 trials; train model to minimize
P(Error) ≤ (Training Corpus Error) + λ(Model Complexity)
(Training Corpus Size)
Pattern Recognition in Acoustic Signal Processing
Four Criteria for Choosing a Pattern Recognizer
Criteria for Choosing a Pattern Recognizer
1 Structure of the Model1 Discriminative2 Bayesian
2 Size of the Training Database1 Empirical Risk Minimization2 Structural Risk Minimization
3 Dynamic State1 y = h(x) has no hidden state (classification, regression)2 y = h(x) has hidden state (recognition, tracking)
Pattern Recognition in Acoustic Signal Processing
Four Criteria for Choosing a Pattern Recognizer
Criteria for Choosing a Pattern Recognizer
1 Structure of the Model1 Discriminative2 Bayesian
2 Size of the Training Database1 Empirical Risk Minimization2 Structural Risk Minimization
3 Dynamic State1 y = h(x) has no hidden state (classification, regression)2 y = h(x) has hidden state (recognition, tracking)
4 Function Range1 y = h(x) is an integer (classification, recognition)2 y = h(x) is a real-valued vector (regression, tracking)
Pattern Recognition in Acoustic Signal Processing
Discriminative Methods
Discriminative Training—Gradient Descent Methods
1 Choose a hypothesis space (a universal approximator)
2 Choose a differentiable error metric
3 Apply the Chain Rule
Pattern Recognition in Acoustic Signal Processing
Discriminative Methods
Hypothesis Space: Universal Approximators
Universal Approximators
Universal Approximator: Definition
A parameterized function space hΘ(x), with parameter vectorΘ ∈ ℜαK , is called a universal approximator if for any boundedh(x) with finite domain,
limK→∞
minΘ
‖hΘ(x) − h(x)‖ = 0
Example: Sigmoidal Neural Network
1 Sigmoidal Neural Network (Θ = {c1, . . . , cK ,w1, . . . ,wK})
hΘ(x) =K∑
k=1
ck1
1 + e−xT wk
Pattern Recognition in Acoustic Signal Processing
Discriminative Methods
Hypothesis Space: Universal Approximators
Universal Approximators
More Examples: Universal Approximators
1 Sigmoidal Neural Network (Θ = {c1, . . . , cK ,w1, . . . ,wK})
2 Mixture Gaussian (Θ = {c1, . . . , cK , µ1, . . . , µK ,Σ1, . . . ,ΣK})
hΘ(x) =
K∑
k=1
cke−12(x−µk)T Σ−1
k(x−µk )
3 Classification and Regression Tree (CART), K NearestNeighbors (KNN), etc.(Θ = [b1, . . . , bK ,w1, . . . ,wK ,R1, . . . ,Rk ])
hΘ(x) = xT wk + bk if x ∈ Rk
Rk is a region with piece-wise linear boundaries.
Pattern Recognition in Acoustic Signal Processing
Discriminative Methods
Training Criteria: Differentiable Error Metric
Differentiable Error Metric
Minkowski Norm Error Metrics
EL =1
N
T∑
t=1
‖hΘ(xt) − yt‖LL
The “Best” Metric: The Zero Norm
‖hΘ(xt) − yt‖0 ,
{
0 yt = hΘ(xt)1 otherwise
Problem: if E0 6= 0, what do we do next?
Pattern Recognition in Acoustic Signal Processing
Discriminative Methods
Training Criteria: Differentiable Error Metric
Differentiable Error Metric
Minkowski Norm Error Metrics
EL =1
N
T∑
t=1
‖hΘ(xt) − yt‖LL
Differentiable Error Metrics: L = 1, L = 2
1 The One Norm (Manhattan Distance, L = 1):
∂E1
∂hΘ(xt)= sign (hΘ(xt) − yt)
2 The Two Norm (Euclidean Distance, L = 2):
∂E2
∂hΘ(xt)= hΘ(xt) − yt
Pattern Recognition in Acoustic Signal Processing
Discriminative Methods
Training Criteria: Differentiable Error Metric
Apply the Chain Rule
The Error Back-Propagation Algorithm
1 Initialize: Choose some initial parameter set Θ(0)
2 Iterate: For i = 1, . . . until E(Θ) stops changing:
Θ(i) = Θ(i−1) − η∇ΘE
∇ΘE ,
T∑
t=1
(
∂E
∂hΘ(xt)
)
(∇ΘhΘ(xt))
Pattern Recognition in Acoustic Signal Processing
Discriminative Methods
Complications: Label Dynamics, Data Sparsity
Wrinkle #1: Recognition, Tracking
A Recursive Neural Net is a neural net with one or more hiddenstate variables:
hΘ(xt) =
K∑
k=1
ck
1
1 + e−[xTt ,hΘ(xt−1)]wk
Training is performed using Back-Propagation Through Time:
∂E2
∂hΘ(xt)= (hΘ(xt) − yt) + (hΘ(xt+1) − yt+1)
(
∂hΘ(xt+1)
∂hΘ(xt)
)
+ . . .
Pattern Recognition in Acoustic Signal Processing
Discriminative Methods
Complications: Label Dynamics, Data Sparsity
Recursive Neural Nets: Example Application
Task Description
Goal of the RNN: Detectpitch accents (syllablesthat the talker has markedas “prominent” by sometype of F0 movement)
Blue = xt (F0=RNNinput)
Yellow = yt (pitch accent= target RNN output)
Pink = f (xt)(RNN-estimated pitchaccent probability)
Example Results
Pattern Recognition in Acoustic Signal Processing
Discriminative Methods
Complications: Label Dynamics, Data Sparsity
Wrinkle #2: Small Training Corpus
Test Corpus Error Bounds based on the Central Limit Theorem
P(Error) ≤ E(X ,Y ,Θ) + G(Θ)
Example Bounds
Minimum Description Length: K (Θ) describes hΘ(x) as a binaryprogram written in signed bits; G(Θ) is itsKolmogorov complexity
G(Θ) ∝ ‖K (Θ)‖0
Support Vector Machines: ψ(Θ) describes hΘ(x) as a linearclassifier in an augmented feature space, and
G(Θ) ∝ ‖ψ(Θ)‖22
Pattern Recognition in Acoustic Signal Processing
Discriminative Methods
Complications: Label Dynamics, Data Sparsity
Support Vector Machines: Example Application
Consonant-vowel transitions, and similar manner-changelandmarks, are good places to look for information aboutspeech [Stevens, Interspeech 2000]
Some landmarks occur relatively infrequently—it’s hard tolearn what they sound like.
SVMs do the job [Niyogi, Burges, and Ramesh, 1999; Borysand Hasegawa-Johnson, 2005; chance=50%]:
SVM %ACC SVM %ACC
-+Silence 92.1 +-Silence 91.6
-+Continuant 79.7 +-Continuant 81.1
-+Sonorant 86.4 +-Sonorant 91.1
-+Syllabic 88.6 +-Syllabic 78.5
-+Consonantal 78.1 +-Consonantal 73.1
Pattern Recognition in Acoustic Signal Processing
Discriminative Methods
Complications: Label Dynamics, Data Sparsity
Wrinkle #3: Label Dynamics AND Small Dataset
Wrinkle #3: Label Dynamics AND Small Dataset
Landmark detectors (δit = sign (gi (~xt))) must be trained usingsmall dataset, so we use SVMs
Pattern Recognition in Acoustic Signal Processing
Discriminative Methods
Complications: Label Dynamics, Data Sparsity
Wrinkle #3: Label Dynamics AND Small Dataset
Wrinkle #3: Label Dynamics AND Small Dataset
Landmark detectors (δit = sign (gi (~xt))) must be trained usingsmall dataset, so we use SVMs
Phonetic landmarks tell us about the phonemes that wereproduced (Q = [q1, . . . , qT ], qt ∈ set of phoneme labels)
. . . but phoneme also depends on previous phoneme!
p(qt = j) = fj
(
~δt , qt−1
)
Pattern Recognition in Acoustic Signal Processing
Discriminative Methods
Complications: Label Dynamics, Data Sparsity
Wrinkle #3: Label Dynamics AND Small Dataset
Wrinkle #3: Label Dynamics AND Small Dataset
Landmark detectors (δit = sign (gi (~xt))) must be trained usingsmall dataset, so we use SVMs
Phonetic landmarks tell us about the phonemes that wereproduced (Q = [q1, . . . , qT ], qt ∈ set of phoneme labels)
. . . but phoneme also depends on previous phoneme!
p(qt = j) = fj
(
~δt , qt−1
)
Functions like fj (·) are hard to train using purelydiscriminative methods, but relatively easy using Bayesianmethods
Pattern Recognition in Acoustic Signal Processing
Bayesian Methods
Bayesian Methods
Bayesian Classification and Recognition:
y∗ = arg max pΘ(y |x)
Bayesian Regression and Tracking:
y∗ = E {y |x}
Advantages and Disadvantages
Disadvantage: One must learn p(x , y); this usually requiresmore data, and is subject to more error, than learningy = h(x) directly
Advantage: Bayesian inference allows modeling of latentvariables, state dynamics, and extra sources of information ina principled manner
Pattern Recognition in Acoustic Signal Processing
Bayesian Methods
Hypothesis Space: Latent Variables
Hypothesis Space Example: Hidden Markov Model
Let X = [x1, . . . , xN ] be the observations, let Y = [y1, . . . , yN ] bethe labels. A Hidden Markov Model posits the existence of somelatent variables Q = [q1, . . . , qN ] such that
hΘ(X ) = arg maxY
pΘ(X ,Y )
pΘ(X ,Y ) =∑
Q
T∏
t=1
pΘ(yt |yt−1)pΘ(qt |qt−1, yt)pΘ(xt |qt)
pΘ(yt |yt−1) (the “language model”) is a lookup table
pΘ(qt |qt−1, yt) (the “pronunciation model”) is a lookup table
pΘ(xt |qt , yt) (the “acoustic model”) is a Gaussian, with meanvector µq and covariance matrix Σq
Pattern Recognition in Acoustic Signal Processing
Bayesian Methods
Training Criteria: Maximum Likelihood, MAP, MaxEnt
Learn the Distributions: Maximum Likelihood
Maximum Likelihood Parameter Estimation
Θ = arg max log pΘ(X ,Y )
What About Small Training Corpora?
Maximum Likelihood is a form of empirical risk minimization
Related forms of structural risk minimization include
MAP (maximum a posteriori probability)
Θ = arg max (log p(X ,Y |Θ) + log p(Θ))
MaxEnt(maximum entropy)
Θ = arg max (log pΘ(X ,Y ) + H(pΘ))
Pattern Recognition in Acoustic Signal Processing
Bayesian Methods
Training Criteria: Maximum Likelihood, MAP, MaxEnt
Apply Bayes’ Rule
Bayes’ Rule (a.k.a. the Definition of Conditional Probability)
pΘ(X ,Y ) =∑
Q
T∏
t=1
pΘ(yt |yt−1)pΘ(qt |qt−1, yt)pΘ(xt |qt)
Training a Bayesian Classifier: Maximum Likelihood
Θ = arg max pΘ(X ,Y )
Testing a Bayesian Classifier: Minimum Probability of Error
Y = arg max pΘ(X ,Y )
Pattern Recognition in Acoustic Signal Processing
Bayesian Methods
Complications: HMM Regression, Switching Kalman Smoothers
Bayesian Regression and Tracking
Hidden Markov Regression
Suppose yt ∈ ℜD is a real-valued vector, and (xt , yt) are jointlyGaussian:
p(x , y |q) ∝ exp
(
−1
2
[
xt − xq
yt − yq
]T [Aq Bq
BTq Cq
]−1 [xt − xq
yt − yq
]
)
then h(xt) = arg min E2 is
E {yt |X } =∑
qt
P(qt |X )(
yq + BTq A−1
q (x − xq))
Pattern Recognition in Acoustic Signal Processing
Bayesian Methods
Complications: HMM Regression, Switching Kalman Smoothers
Switching Kalman Smoother
sst−1 s t+1t
x x xt−1 t t+1
t−1 t t+1y y y
Setup: exactly like HMM regression, except that (xt , yt , yt−1)are jointly Gaussian
Result: exactly like HMM regression, except that yt|q,X ,At|q,X , and Bt|q,X must be updated using interacting multipleKalman filters:
E {yt |X } =∑
qt
P(qt |X )(
yt|q,X + BTt|q,XA−1
t|q,X(x − xq)
)
Pattern Recognition in Acoustic Signal Processing
Bayesian Methods
Complications: HMM Regression, Switching Kalman Smoothers
Switching Kalman Smoother: Example
Task Description
xt is an acoustic spectrum; yt is the corresponding vector ofspeech articulator positions
Results (unpublished):
E2(Switching Kalman Smoother) < E2(HMM Regression)
Difference is consistent but very small
Pattern Recognition in Acoustic Signal Processing
Hybrid Methods
Hybrid Discriminative-Bayesian Systems
Task Scenario
yt is very difficult to classify without dynamic information,(e.g. speech recognition: yt =words, xt =short-timespectrum)
An auxiliary variable ft can be inferred very accurately using alocal classifier (e.g., ft =phonological distinctive features):
ft = hΘ(xt)
Training database includes X = [x1, . . . , xN ],Y = [y1, . . . , yN ], and F = [f1, . . . , fN ]
Testing database includes only X = [xN+1, . . . , xN+M ]
Pattern Recognition in Acoustic Signal Processing
Hybrid Methods
Discriminative Hidden Nodes, Bayesian Inference
Hybrid Training Methods
Training Algorithm
Train hΘ(x) using discriminative methods
Minimize
EL =1
T
T∑
t=1
‖ft − hΘ(xt)‖LL
hΘ(xt) is a real-valued vector that approximates ft
Train a probability model pΘ(F ,X ,Y ) using Bayesianmethods
Using pΘ(F ,X ,Y ), Bayesian inference can model dynamics ofhidden states, incorporate multiple knowledge sources, etc.
Pattern Recognition in Acoustic Signal Processing
Hybrid Methods
Discriminative Hidden Nodes, Bayesian Inference
Example: Landmark-Based Speech Recognizer
SVM-HMM Hybrid Landmark-Based Speech Recognizer
SVM computes a real-valued distinctive feature that optimallydiscriminates between the case ft = 1 (landmark of a specifiedtype is present) and ft = −1 (landmark absent)
HMM computes p(phoneme sequence|landmark sequence)
Pattern Recognition in Acoustic Signal Processing
Hybrid Methods
Discriminative Hidden Nodes, Bayesian Inference
Phone Recognition Accuracy vs. Mixture Size, Telephone Speech
MFCC = mel frequency cepstral coefficients
Landmark = detect manner-to-manner landmarks, e.g.,obstruent-to-sonorant
Manner = detect manner-onset landmarks, e.g., onset ofsonorant region
Pattern Recognition in Acoustic Signal Processing
Hybrid Methods
Discriminative Hidden Nodes, Bayesian Inference
Example: RNN with Kalman Smoothing
Mitra et al., in review
Time
Fre
quen
cy
Synthetic utterance:ground
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50
2000
4000
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5−0.5
0
0.5
VE
L
ANN+Kalman(AP)TrueData
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50
100
200
TB
CL
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5−20
0
20
TB
CD
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50
50
100
TT
CL
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5−20
0
20
TT
CD
Pattern Recognition in Acoustic Signal Processing
Hybrid Methods
Discriminative Feature Selection for Acoustic Event Detection
Example: Non-Speech Acoustic Event Detection
Motivation
“Activity detection and description is a key functionality ofperceptually aware interfaces working in collaborative humancommunication environments. . . detection and classification ofacoustic events may help to detect and describe humanactivity. . . ” (CLEAR-AED Task Brief)
Difficulties
Negative SNR (speech is “background noise”)
Unknown spectral structure
Different spectral structure for each event type
Pattern Recognition in Acoustic Signal Processing
Hybrid Methods
Discriminative Feature Selection for Acoustic Event Detection
Difficulty #1: Negative SNR
Pattern Recognition in Acoustic Signal Processing
Hybrid Methods
Discriminative Feature Selection for Acoustic Event Detection
Difficulty #2: Unknown Spectral Structure
Key Jingle Footsteps Speech
Pattern Recognition in Acoustic Signal Processing
Hybrid Methods
Discriminative Feature Selection for Acoustic Event Detection
Discriminative Feature Selection for AED
Zhuang et al., ICASSP 2008
Problem: what acoustic features are relevant for detectingnon-speech acoustic events?
Input: (xt ∈ ℜD) includes many acoustic features inventedfor speech processing (MFCC, PLP, energy, ZCR)
Output: (ft ∈ ℜd) selects the most useful features:
ft = Wxt
where W T = [w1, . . . ,wK ], and wk is an indicator vector(only one non-zero element)
Hidden Markov Modeling: the label sequenceY ∗ = [y∗
1 , . . . , y∗T ], yt ∈ {keyjingle, footstep, . . .} is chosen by
a hidden Markov model observing F = [f1, . . . , fN ]:
Y ∗ = arg max p(F |Y )p(Y )
Pattern Recognition in Acoustic Signal Processing
Hybrid Methods
Discriminative Feature Selection for Acoustic Event Detection
Bayes Error Rate
Zhuang et al., ICASSP 2008
Bayes Error Rate
Let wk be an indicator vector (all zeros except for one element).The Bayes-optimal error rate of a classifier observing feature wT
k x
is
P(error) =
∫ ∫
P(
y 6= arg max p(wTk x , y)
)
dydx
Bayes Error Rate Approximated on a Database
F(wk) =1
T
T∑
t=1
δ(
yt 6= arg max p(wTk xt , yt)
)
Pattern Recognition in Acoustic Signal Processing
Hybrid Methods
Discriminative Feature Selection for Acoustic Event Detection
Feature Selection Algorithms
Hard-Bayes-Error Feature Selection
For k = 1, . . . ,K , Choose the indicator vector wk (wk is all zerosexcept for one nonzero element) to minimize
F(wk) =1
T
T∑
t=1
δ(
yt 6= arg max p(wTk xt , yt)
)
Soft-Bayes-Error Feature Selection
For k = 1, . . . ,K , Choose the indicator vector wk (wk is all zerosexcept for one nonzero element) to minimize
FS (wk) =1
T
T∑
t=1
rank(
yt
∣
∣
∣wT
k xt
)
Pattern Recognition in Acoustic Signal Processing
Hybrid Methods
Discriminative Feature Selection for Acoustic Event Detection
Acoustic Event Detection Results
Zhuang et al., ICASSP 2008
MFCC26DAZ = 26 Mel-frequency cepstral coefficients +deltas + acceleration
DERIVE26DAZ = 26 Derived features + deltas + acceleration
DERIVE78 = 78 Derived features
Pattern Recognition in Acoustic Signal Processing
Hybrid Methods
Bayesian Re-Estimation of the “Discriminative” Hidden Nodes
Bayesian Re-Estimation of Discriminative Nodes
Problem: what projection of the acoustic spectrogram isrelevant for recognizing acoustic events?
Discriminative Transform: (ft ∈ ℜd) selects the most usefulfeatures:
ft =K∑
k=1
ckσ(wTk xt)
where ck ∈ ℜd and wk ∈ ℜD are arbitrary real-valued weightvectors, and σ(z) = 1/(1 + e−z).
Bayesian Inference: the label sequence Y ∗ = [y∗1 , . . . , y
∗N ],
yt ∈ {keyjingle, footstep, . . .} is chosen by a hidden Markovmodel observing F = [f1, . . . , fN ]:
Y ∗ = arg max p(F |Y )p(Y )
Pattern Recognition in Acoustic Signal Processing
Hybrid Methods
Bayesian Re-Estimation of the “Discriminative” Hidden Nodes
The Baum-Welch Algorithm
Hidden Markov model parameters are trained to maximize theexpected log likelihood, with expectation over the unknown statesequence Q = [q1, . . . , qN ]
F = EQ {log p(F ,Q)}
F = −1
2
T∑
t=1
∑
q
p(qt = q|F ,Y )(ft − µq)T Σ−1
q (ft − µq) − . . .
Pattern Recognition in Acoustic Signal Processing
Hybrid Methods
Bayesian Re-Estimation of the “Discriminative” Hidden Nodes
Baum-Welch Back-Propagation
The neural network can be trained, using standard gradientdescent methods, in order to minimize F . For example,
ft =
K∑
k=1
ckσ(wTk xt)
∂F
∂ck
=T∑
t=1
∑
q
p(qt = q|F )
(
∂F
∂ft|qt = q
)(
∂ft
∂ck
)
=
T∑
t=1
∑
q
p(qt = q|F )Σ−1q (µq − ft)σ(wT
k xt)
Pattern Recognition in Acoustic Signal Processing
Hybrid Methods
Bayesian Re-Estimation of the “Discriminative” Hidden Nodes
The Problem of Spurious Maxima
It is always possible to train a mixture Gaussian so thatF = ∞
Give one of the Gaussians a nonzero variance (|Σ1| > 0), andone a zero variance (|Σ2| = 0)
Pattern Recognition in Acoustic Signal Processing
Hybrid Methods
Bayesian Re-Estimation of the “Discriminative” Hidden Nodes
The Problem of Spurious Maxima
It is always possible to train a mixture Gaussian so thatF = ∞
Give one of the Gaussians a nonzero variance (|Σ1| > 0), andone a zero variance (|Σ2| = 0)Set mean of the second Gaussian equal to any training token(µ2 = ft for any particular t)
Pattern Recognition in Acoustic Signal Processing
Hybrid Methods
Bayesian Re-Estimation of the “Discriminative” Hidden Nodes
The Problem of Spurious Maxima
It is always possible to train a mixture Gaussian so thatF = ∞
Give one of the Gaussians a nonzero variance (|Σ1| > 0), andone a zero variance (|Σ2| = 0)Set mean of the second Gaussian equal to any training token(µ2 = ft for any particular t)Result: F = ∞ + finite = ∞
Pattern Recognition in Acoustic Signal Processing
Hybrid Methods
Bayesian Re-Estimation of the “Discriminative” Hidden Nodes
The Problem of Spurious Maxima
It is always possible to train a mixture Gaussian so thatF = ∞
Give one of the Gaussians a nonzero variance (|Σ1| > 0), andone a zero variance (|Σ2| = 0)Set mean of the second Gaussian equal to any training token(µ2 = ft for any particular t)Result: F = ∞ + finite = ∞This is called “over-training”
Pattern Recognition in Acoustic Signal Processing
Hybrid Methods
Bayesian Re-Estimation of the “Discriminative” Hidden Nodes
The Problem of Spurious Maxima
It is always possible to train a mixture Gaussian so thatF = ∞
Give one of the Gaussians a nonzero variance (|Σ1| > 0), andone a zero variance (|Σ2| = 0)Set mean of the second Gaussian equal to any training token(µ2 = ft for any particular t)Result: F = ∞ + finite = ∞This is called “over-training”
In Baum-Welch Back-Propagation, the same result is obtainedfor ‖ck‖ → 0 or ‖wk‖ → 0
Solution: require ‖ck‖ = 1, or more generally, ‖ ∂ft∂xt
‖ = 1
Pattern Recognition in Acoustic Signal Processing
Hybrid Methods
Bayesian Re-Estimation of the “Discriminative” Hidden Nodes
Methods for Avoiding Spurious Maxima
Constrained optimization: maximize
L = F +∑
k
αk(‖ck‖ − 1) + βk(‖wk‖ − 1)
with Lagrange multipliers αk and βk chosen so that ‖ck‖ = 1and ‖wk‖ = 1
Symplectic Maximum Likelihood Transform (SMLT, Omarand Hasegawa-Johnson, 2004): replace the neural networkwith one that computes a volume preserving transform:
∣
∣
∣
∣
df
dx
∣
∣
∣
∣
= 1
where Jf (x) is the Jacobian of the transform
Pattern Recognition in Acoustic Signal Processing
Hybrid Methods
Bayesian Re-Estimation of the “Discriminative” Hidden Nodes
The Reflecting Symplectic TransformOmar and Hasegawa-Johnson, 2004
Divide x and y arbitrarily into equal-length sub-vectors,xT = [xT
1 , xT2 ], yT = [yT
1 , yT2 ]. Interpret as follows:
x1 is a vector of object positions
x2 is a vector of velocities
V (x2) is a scalar called the “kinetic energy”
T (y1) is a scalar called the “potential energy”
Then the following transform is volume-preserving:
[
y1
y2
]
=
[
x1 −∇x2V
x2 −∇y1T
]
=
[
x1 − g1(x2)x2 − g2(x1 − g1(x2))
]
g1(x2) and g2(y1) must be irrotational. Easiest way toguarantee this: train V (x2) and T (y1) directly, usingBaum-Welch back-propagation
Pattern Recognition in Acoustic Signal Processing
Hybrid Methods
Bayesian Re-Estimation of the “Discriminative” Hidden Nodes
SMLT+GMM for Phone Classification
Omar and Hasegawa-Johnson, 2004
Compute phone label yt given MFCC cepstrum xt
Symplectic maximum likelihood transform (SMLT) computesft(xt)
Maximum likelihood linear transform (MLLT) computesft = Wxt
Gaussian mixture model (GMM) computes p(ft |yt)
Database: TIMIT
Features Classifier Accuracy
MFCC GMM 73.7%MLLT GMM 74.6%SMLT GMM 75.6%
Pattern Recognition in Acoustic Signal Processing
Conclusions
Conclusions
Pattern recognition for acoustic applications:
Choose a hypothesis space (a family of universalapproximators)Gradient descent to minimize error
Pattern Recognition in Acoustic Signal Processing
Conclusions
Conclusions
Pattern recognition for acoustic applications:
Choose a hypothesis space (a family of universalapproximators)Gradient descent to minimize error
Bayesian learning simplifies the use of structured models
Hidden-state dynamicsExternal sources of information
Pattern Recognition in Acoustic Signal Processing
Conclusions
Conclusions
Pattern recognition for acoustic applications:
Choose a hypothesis space (a family of universalapproximators)Gradient descent to minimize error
Bayesian learning simplifies the use of structured models
Hidden-state dynamicsExternal sources of information
Hybrid discriminative-Bayesian methods sometimes give thebest of both worlds
Discriminative training = minimum error locallyBayesian inference = principled integration of disparateinformation sources“Discriminative” hidden nodes can be re-trained to optimize aglobal criterion if the training is constrained to avoid spuriousmaxima (e.g., volume preserving)
Pattern Recognition in Acoustic Signal Processing
Conclusions
Thank You!http://www.isle.uiuc.edu/slides/2009/Hasegawa-Johnson09mlss.pdf