Lecture 3: Machine learning, classification, and ...dpwe/e6820/lectures/L03-ml.pdf · Lecture 3:...
Transcript of Lecture 3: Machine learning, classification, and ...dpwe/e6820/lectures/L03-ml.pdf · Lecture 3:...
EE E6820: Speech & Audio Processing & Recognition
Lecture 3:Machine learning, classification, and generative
models
Michael Mandel <[email protected]>
Columbia University Dept. of Electrical Engineeringhttp://www.ee.columbia.edu/∼dpwe/e6820
February 7, 2008
1 Classification
2 Generative models
3 Gaussian models
4 Hidden Markov models
Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 1 / 43
Outline
1 Classification
2 Generative models
3 Gaussian models
4 Hidden Markov models
Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 2 / 43
Classification and generative models
0
00
1000
1000
2000
2000
1000
2000
3000
4000
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
f/Hz
time/s
F1/Hz
F2/Hz
x
x
ay ao
ClassificationI discriminative modelsI discrete, categorical random variable of interestI fixed set of categories
Generative modelsI descriptive modelsI continuous or discrete random variable(s) of interestI can estimate parametersI Bayes’ rule makes them useful for classification
Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 3 / 43
Building a classifier
Define classes/attributesI could state explicit rulesI better to define through ‘training’ examples
Define feature space
Define decision algorithmI set parameters from examples
Measure performanceI calculated (weighted) error rate
250 300 350 400 450 500 550 600600
700
800
900
1000
1100
F1 / Hz
F2
/ Hz
Pols vowel formants: "u" (x) vs. "o" (o)
Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 4 / 43
Classification system parts
signal
segment
feature vector
class
Pre-processing/ segmentation
Feature extraction
Classification
Post-processing
Sensor
• STFT • Locate vowels
• Formant extraction
• Context constraints • Costs/risk
Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 5 / 43
Feature extraction
Right features are criticalI waveform vs formants vs cepstraI invariance under irrelevant modifications
Theoretically equivalent features may act very differently in aparticular classifier
I representations make important aspects explicitI remove irrelevant information
Feature design incorporates ‘domain knowledge’I although more data ⇒ less need for ‘cleverness’
Smaller ‘feature space’ (fewer dimensions)
→ simpler models (fewer parameters)→ less training data needed→ faster training
[inverting MFCCs]
Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 6 / 43
Optimal classification
Minimize probability of error with Bayes optimal decision
θ = argmaxθi
p(θi | x)
p(error) =
∫p(error | x)p(x) dx
=∑
i
∫Λi
(1− p(θi | x))p(x) dx
I where Λi is the region of x where θi is chosen. . . but p(θi | x) is largest in that region
I so p(error) is minimized
Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 7 / 43
Sources of error
p(x|ω2)·Pr(ω2)
p(x|ω1)·Pr(ω1)
x
Xω2^
Pr(ω1|Xω2) = Pr(err|Xω2)
^
^
Pr(ω2|Xω1) = Pr(err|Xω1)
^
^
Xω1^
Suboptimal threshold / regions (bias error)I use a Bayes classifier
Incorrect distributions (model error)I better distribution models / more training data
Misleading features (‘Bayes error’)I irreducible for given feature setI regardless of classification scheme
Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 8 / 43
Two roads to classification
Optimal classifier is
θ = argmaxθi
p(θi | x)
but we don’t know p(θi | x)
Can model distribution directly
e.g. Nearest neighbor, SVM, AdaBoost, neural netI maps from inputs x to outputs θiI a discriminative model
Often easier to model data likelihood p(x | θi )I use Bayes’ rule to convert to p(θi | x)I a generative (descriptive) model
Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 9 / 43
Nearest neighbor classification
0 200 400 600 800 1000 1200 1400 1600 1800
600
800
1000
1200
1400
1600
1800Pols vowel formants: "u" (x), "o" (o), "a" (+)
F1 / Hz
F2
/ Hz
x
*
+
o
new observation x
Find closest match (Nearest Neighbor)
Naıve implementation takes O(N) time for N training points
As N →∞, error rate approaches twice the Bayes error rate
With K summarized classes, takes O(K ) time
Locality sensitive hashing gives approximate nearest neighborsin O(dn1/c2
) time (Andoni and Indyk, 2006)
Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 10 / 43
Support vector machines
“Large margin” linear classifier forseparable data
I regularization of margin avoidsover-fitting
I can be adapted to non-separabledata (C parameter)
I made nonlinear using kernelsk(x1, x2) = Φ(x1) · Φ(x2) .
(w x) + b = –1.
(w x) + b = + 1.
x 1
y
y i = +1
w
(w x) + b = 0
x 2
i = – 1
Depends only on training points near the decision boundary,the support vectors
Unique, optimal solution for given Φ and C
Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 11 / 43
Outline
1 Classification
2 Generative models
3 Gaussian models
4 Hidden Markov models
Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 12 / 43
Generative models
Describe the data using structured probabilistic models
Observations are random variables whose distribution dependson model parameters
Source distributions p(x | θi )I reflect variability in featuresI reflect noise in observationI generally have to be estimated from data (rather than known
in advance)
p(x|ωi)
x
ω1 ω2 ω3 ω4
Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 13 / 43
Generative models (2)
Three things to do with generative models
Evaluate the probability of an observation,possibly under multiple parameter settings
p(x), p(x | θ1), p(x | θ2), . . .
Estimate model parameters from observed data
θ = argminθ
C (θ∗, θ | x)
Run the model forward to generate new data
x ∼ p(x | θ)
Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 14 / 43
Random variables review
Random variables have jointdistributions, p(x , y)
Marginal distribution of y
p(y) =
∫p(x , y) dx
Knowing one value in a jointdistribution constrains the remainder
Conditional distribution of x given y
p(x | y) ≡ p(x , y)
p(y)=
p(x , y)∫p(x , y) dy
xµx
σx
σxyσyµy
yp(x,y)
p(y)
p(x)
x
y
Y
p(x,y)
p(x | y = Y )
Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 15 / 43
Bayes’ rule
p(x | y)p(y) = p(x , y) = p(y | x)p(x)
∴ p(y | x) =p(x | y)p(y)
p(x)
⇒ can reverse conditioning given priors/marginals
terms can be discrete or continuous
generalizes to more variables
p(x , y , z) = p(x | y , z)p(y , z) = p(x | y , z)p(y | z)p(z)
allows conversion between joint, marginals, conditionals
Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 16 / 43
Bayes’ rule for generative models
Run generative models backwards to compare them
Likelihood
p(θ | x) = p(x | θ)Rp(x | θ)p(θ)
· p(θ)
Posterior prob Evidence = p(x) Prior prob
Posterior is the classification we’re looking forI combination of prior belief in each classI with likelihood under our modelI normalized by evidence (so
∫posteriors = 1)
Objection: priors are often unknown
. . . but omitting them amounts to assuming they are all equal
Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 17 / 43
Computing probabilities and estimating parameters
Want probability of the observation under a model, p(x)I regardless of parameter settings
Full Bayesian integral
p(x) =
∫p(x | θ)p(θ) dθ
Difficult to compute in general, approximate as p(x | θ)I Maximum likelihood (ML)
θ = argmaxθ
p(x | θ)
I Maximum a posteriori (MAP): ML + prior
θ = argmaxθ
p(θ | x) = argmaxθ
p(x | θ)p(θ)
Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 18 / 43
Model checking
After estimating parameters, run the model forward
Check thatI model is rich enough to capture variation in dataI parameters are estimated correctlyI there aren’t any bugs in your code
Generate data from the model and compare it to observations
x ∼ p(x | θ)
I are they similar under some statistics T (x) : Rd 7→ R ?I can you find the real data set in a group of synthetic data sets?
Then go back and update your model accordingly
Gelman et al. (2003, ch. 6)
Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 19 / 43
Outline
1 Classification
2 Generative models
3 Gaussian models
4 Hidden Markov models
Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 20 / 43
Gaussian models
Easiest way to model distributions is via parametric modelI assume known form, estimate a few parameters
Gaussian model is simple and useful. In 1D
p(x | θi ) =1
σi
√2π
exp
[−1
2
(x − µi
σi
)2]
Parameters mean µi and variance σi → fit
p(x|ωi)
xµi
σi
PMPM e1/2
Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 21 / 43
Gaussians in d dimensions
p(x | θi ) =1
(2π)d/2|Σi |1/2exp
[−1
2(x− µi )
T Σ−1i (x− µi )
]Described by a d-dimensional mean µi
and a d × d covariance matrix Σi
0
2
4
0
2
4
0
0.5
1
0 1 2 3 4 50
1
2
3
4
5
Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 22 / 43
Gaussian mixture models
Single Gaussians cannot modelI distributions with multiple modesI distributions with nonlinear correlations
What about a weighted sum?
p(x) ≈∑k
ckp(x | θk)
I where {ck} is a set of weights and {p(x | θk)} is a set ofGaussian components
I can fit anything given enough components
Interpretation: each observation is generated by one of theGaussians, chosen with probability ck = p(θk)
Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 23 / 43
Gaussian mixtures (2)
e.g. nonlinear correlation
05
1015
2025
3035
0
0.2
0.4
0.6
0.8
1
1.2
1.4
-2
-1
0
1
2
3
original data
Gaussian components
resulting surface
Problem: finding ck and θk parameters
easy if we knew which θk generated each x
Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 24 / 43
Expectation-maximization (EM)
General procedure for estimating model parameters whensome are unknown
e.g. which GMM component generated a point
Iteratively updated model parameters θ to maximize Q, theexpected log-probability of observed data x and hidden data z
Q(θ, θt) =
∫zp(z | x , θt) log p(z , x | θ)
I E step: calculate p(z | x , θt) using θtI M step: find θ that maximizes Q using p(z | x , θt)I can prove p(x | θ) non-decreasingI hence maximum likelihood modelI local optimum—depends on initialization
Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 25 / 43
Fitting GMMs with EM
Want to findI parameters of the Gaussians θk = {µk ,Σk}I weights/priors on Gaussians ck = p(θk)
. . . that maximize likelihood of training data x
If we could assign each x to a particular θk , estimation wouldbe direct
Hence treat mixture indices, z , as hiddenI form Q = E [p(x , z | θ)]I differentiate wrt model parameters→ equations for µk , Σk , ck to maximize Q
Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 26 / 43
GMM EM updated equations
Parameters that maximize Q
νnk ≡ p(zk | xn, θt)
µk =
∑n νnkxn∑n νnk
Σk =
∑n νnk(xn − µk)(xn − µk)T∑
n νnk
ck =1
N
∑n
νnk
Each involves νnk , ‘fuzzy membership’ of xn in Gaussian k
Updated parameter is just sample average, weighted by fuzzymembership
Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 27 / 43
GMM examplesVowel data fit with different mixture counts
200 400 600 800 1000 1200600
800
1000
1200
1400
16001 Gauss logp(x)=-1911
200 400 600 800 1000 1200600
800
1000
1200
1400
16002 Gauss logp(x)=-1864
200 400 600 800 1000 1200600
800
1000
1200
1400
16003 Gauss logp(x)=-1849
200 400 600 800 1000 1200600
800
1000
1200
1400
16004 Gauss logp(x)=-1840
[Example. . . ]
Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 28 / 43
Outline
1 Classification
2 Generative models
3 Gaussian models
4 Hidden Markov models
Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 29 / 43
Markov models
A (first order) Markov model is a finite-state system whosebehavior depends only on the current state
“The future is independent of the past, conditioned on thepresent”
e.g. generative Markov model
A
S
E
C
Bp(qn+1|qn)
S A B C E
0 1 0 0 0
0 0 0 0 1
0 .8 .1 .1 00 .1 .8 .1 00 .1 .1 .7 .1
S A B C E
qn
qn+1.8 .8
.7
.1
.1
.1.1.1
.1
.1
S A A A A A A A A B B B B B B B B B C C C C B B B B B B C E
Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 30 / 43
Hidden Markov models
Markov model where state sequence Q = {qn} is not directlyobservable (‘hidden’)
But, observations X do depend on QI xn is RV that depends only on current state p(xn | qn)
1
2
3
0
0.2
0.4
0.6
0.8
0 10 20 300
0
1
2
3
0 1 2 3 40
0.2
0.4
0.6
0.8
observation x time step n
State sequenceEmission distributions
Observation sequence
x nx n
p(x|
q)p(
x|q)
q = A q = B q = C
q = A q = B q = C
AAAAAAAABBBBBBBBBBBCCCCBBBBBBBC
can still tell something about state sequence. . .
Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 31 / 43
(Generative) Markov models
HMM is specified by parameters Θ:
k a t
k a t •
k a t •
• •
•
•
• k a t
k a t •
0.9 0.1 0.0 0.01.0 0.0 0.0 0.0
0.0 0.9 0.1 0.00.0 0.0 0.9 0.1
p(x|q)
x
- states qi
- transition probabilities aij
- emission distributions bi(x)
(+ initial state probabilities πi )
aij ≡ p(q jn | qi
n−1) bi (x) ≡ p(x | qi ) πi ≡ p(qi1)
Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 32 / 43
Markov models for sequence recognitionIndependence of observations
I observation xn depends only on current state qn
p(X |Q) = p(x1, x2, . . . xN | q1, q2, . . . qN)
= p(x1 | q1)p(x2 | q2) · · · p(xN | qN)
=N∏
n=1
p(xn | qn) =N∏
n=1
bqn(xn)
Markov transitionsI transition to next state qi+1 depends only on qi
p(Q |M) = p(q1, q2, . . . |M)
= p(qN | qN−1 . . . q1)p(qN−1 | qN−2 . . . q1)p(q2 | q1)p(q1)
= p(qN | qN−1)p(qN−1 | qN−2)p(q2 | q1)p(q1)
= p(q1)N∏
n=2
p(qn | qn−1) = πq1
N∏n=2
aqn−1qn
Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 33 / 43
Model-fit calculations
From ‘state-based modeling’:
p(X |Θj) =∑all Q
p(X |Q,Θj)p(Q |Θj)
For HMMs
p(X |Q) =N∏
n=1
bqn(xn)
p(Q |M) = πq1
N∏n=2
aqn−1qn
Hence, solve for Θ = argmaxΘjp(Θj |X )
I Using Bayes’ rule to convert from p(X |Θj)
Sum over all Q???
Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 34 / 43
Summing over all paths
q0 q1 q2 q3 q4
S A A A ES A A B ES A B B ES B B B E
.9 x .7 x .7 x .1 = 0.0441
.9 x .7 x .2 x .2 = 0.0252
.9 x .2 x .8 x .2 = 0.0288
.1 x .8 x .8 x .2 = 0.0128Σ = 0.1109 Σ = p(X | M) = 0.4020
2.5 x 0.2 x 0.1 = 0.052.5 x 0.2 x 2.3 = 1.152.5 x 2.2 x 2.3 = 12.650.1 x 2.2 x 2.3 = 0.506
0.00220.02900.36430.0065
S A B E
AS B E0.9• 0.1 • 0.7• 0.2 0.1• • 0.8 0.2• • • 1
All possible 3-emission paths Qk from S to Ep(Q | M) = Πn p(qn|qn-1) p(X | Q,M) = Πn p(xn|qn) p(X,Q | M)
Observation likelihoods
x1 x2 x3
AB
p(x|q)
q{ 2.5 0.2 0.10.1 2.2 2.3
A B ES0.9
0.1
0.7
0.2
0.1
0.8
0.2
Model M1 xn
n1
p(x|B)p(x|A)
2 3
Observations x1, x2, x3
S0 1 2 3 4
A
B
E
Sta
tes
time n
0.1
0.9
0.80.2
0.7
0.8 0.2
0.2
0.7
0.1
Paths
Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 35 / 43
The ‘forward recursion’
Dynamic-programming-like technique to sum over all Q
Define αn(i) as the probability of getting to state qi at timestep n (by any path):
αn(i) = p(x1, x2, . . . xn, qn = qi ) ≡ p(X n1 , q
in)
αn+1(j) can be calculated recursively:
Time steps n
Mo
del
sta
tes
qi
αn(i+1)
αn(i)
ai+1j
aij
bj(xn+1)
αn+1(j) = [Σ αn(i)·aij]·bj(xn+1)i=1
S
Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 36 / 43
Forward recursion (2)
Initialize α1(i) = πibi (x1)
Then total probability p(XN1 |Θ) =
∑Si=1 αN(i)
→ Practical way to solve for p(X |Θj) and hence select the mostprobable model (recognition)
Observations X
p(X | M1)·p(M1)
p(X | M2)·p(M2)
Choose best
Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 37 / 43
Optimal path
May be interested in actual qn assignmentsI which state was ‘active’ at each time frame
e.g. phone labeling (for training?)
Total probability is over all paths
. . . but can also solve for single best path, “Viterbi” state sequence
Probability along best path to state qjn+1:
αn+1(j) =
[max
i{αn(i)aij}
]bj(xn+1)
I backtrack from final state to get best pathI final probability is product only (no sum)→ log-domain calculation is just summation
Best path often dominates total probability
p(X |Θ) ≈ p(X , Q |Θ)
Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 38 / 43
Interpreting the Viterbi path
Viterbi path assigns each xn to a state qi
I performing classification based on bi (x). . . at the same time applying transition constraints aij
0 10 20 300
1
2
3
Viterbi labels:
Inferred classification
x nAAAAAAAABBBBBBBBBBBCCCCBBBBBBBC
Can be used for segmentationI train an HMM with ‘garbage’ and ‘target’ statesI decode on new data to find ‘targets’, boundaries
Can use for (heuristic) training
e.g. forced alignment to bootstrap speech recognizere.g. train classifiers based on labels. . .
Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 39 / 43
Aside: Training and test data
A rich model can learn every training example (overtraining)
But the goal is to classify new, unseen dataI sometimes use ‘cross validation’ set to decide when to stop
training
For evaluation results to be meaningful:I don’t test with training data!I don’t train on test data (even indirectly. . . )
Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 40 / 43
Aside (2): Model complexity
More training data allows the use of larger models
More model parameters create a better fit to the training dataI more Gaussian mixture componentsI more HMM states
For fixed training set size, there will be some optimal modelsize that avoids overtraining
Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 41 / 43
Summary
Classification is making discrete (hard) decisions
Basis is comparison with known examplesI explicitly or via a model
Classification modelsI discriminative models, like SVMs, neural nets, boosters,
directly learn posteriors p(θi | x)I generative models, like Gaussians, GMMs, HMMs, model
likelihoods p(x | θ)I Bayes’ rule lets us use generative models for classification
EM allows parameter estimation even with some data missing
Parting thought
Is it wise to use generative models for discrimination or vice versa?
Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 42 / 43
References
Alexandr Andoni and Piotr Indyk. Near-optimal hashing algorithms for approximatenearest neighbor in high dimensions. In Proc. IEEE Symposium on Foundations ofComputer Science, pages 459–468, Washington, DC, USA, 2006. IEEE ComputerSociety.
Andrew Gelman, John B. Carlin, Hal S. Stern, and Donald B. Rubin. Bayesian DataAnalysis. Chapman & Hall/CRC, second edition, July 2003. ISBN 158488388X.
Lawrence R. Rabiner. A tutorial on hidden markov models and selected applications inspeech recognition. Proceedings of the IEEE, 77(2):257–286, 1989.
Jeff A. Bilmes. A gentle tutorial on the em algorithm and its application to parameterestimation for gaussian mixture and hidden markov models, 1997.
Christopher J. C. Burges. A tutorial on support vector machines for patternrecognition. Data Min. Knowl. Discov., 2(2):121–167, June 1998.
Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 43 / 43