LING 696B: Mixture model and linear dimension reduction
-
Upload
caesar-stanton -
Category
Documents
-
view
34 -
download
0
description
Transcript of LING 696B: Mixture model and linear dimension reduction
![Page 1: LING 696B: Mixture model and linear dimension reduction](https://reader035.fdocuments.us/reader035/viewer/2022062314/568136b2550346895d9e563f/html5/thumbnails/1.jpg)
1
LING 696B: Mixture model and linear dimension reduction
![Page 2: LING 696B: Mixture model and linear dimension reduction](https://reader035.fdocuments.us/reader035/viewer/2022062314/568136b2550346895d9e563f/html5/thumbnails/2.jpg)
2
Statistical estimation Basic setup:
The world: distributions p(x; ), -- parameters “all models may be wrong, but some are useful”
Given parameter , p(x; ) tells us how to calculate the probability of x (also referred to as the “likelihood” p(x|) )
Observations: X = {x1, x2, …, xN} generated from some p(x; ). N is the number of observations
Model-fitting: based on some examples X, make guesses (learning, inference) about
![Page 3: LING 696B: Mixture model and linear dimension reduction](https://reader035.fdocuments.us/reader035/viewer/2022062314/568136b2550346895d9e563f/html5/thumbnails/3.jpg)
3
Statistical estimation Example:
Assuming people’s height follows normal distributions (mean, var)
p(x; ) = the probability density function of normal distribution
Observation: measurements of people’s height
Goal: estimate parameters of the normal distribution
![Page 4: LING 696B: Mixture model and linear dimension reduction](https://reader035.fdocuments.us/reader035/viewer/2022062314/568136b2550346895d9e563f/html5/thumbnails/4.jpg)
4
Maximum likelihood estimate (MLE) Likelihood function: examples xi
are independent of one another, so
Among all the possible values of , choose the so that L() is the biggest
L()
Consistency:
H !
![Page 5: LING 696B: Mixture model and linear dimension reduction](https://reader035.fdocuments.us/reader035/viewer/2022062314/568136b2550346895d9e563f/html5/thumbnails/5.jpg)
5
H matters a lot! Example: curve fitting with
polynomials
![Page 6: LING 696B: Mixture model and linear dimension reduction](https://reader035.fdocuments.us/reader035/viewer/2022062314/568136b2550346895d9e563f/html5/thumbnails/6.jpg)
6
Clustering Need to divide x1, x2, …, xN into
clusters, without a priori knowledge of where clusters are
An unsupervised learning problem: fitting a mixture model to x1, x2, …, xN
Example: height of male and female follow two distributions, but don’t know gender from x1, x2, …, xN
![Page 7: LING 696B: Mixture model and linear dimension reduction](https://reader035.fdocuments.us/reader035/viewer/2022062314/568136b2550346895d9e563f/html5/thumbnails/7.jpg)
7
The K-means algorithm Start with a random assignment,
calculate the means
![Page 8: LING 696B: Mixture model and linear dimension reduction](https://reader035.fdocuments.us/reader035/viewer/2022062314/568136b2550346895d9e563f/html5/thumbnails/8.jpg)
8
The K-means algorithm Re-assign members to the closest
cluster according to the means
![Page 9: LING 696B: Mixture model and linear dimension reduction](https://reader035.fdocuments.us/reader035/viewer/2022062314/568136b2550346895d9e563f/html5/thumbnails/9.jpg)
9
The K-means algorithm Update the means based on the
new assignments, and iterate
![Page 10: LING 696B: Mixture model and linear dimension reduction](https://reader035.fdocuments.us/reader035/viewer/2022062314/568136b2550346895d9e563f/html5/thumbnails/10.jpg)
10
Why does K-means work? In the beginning, the centers are poorly
chosen, so the clusters overlap a lot But if centers are moving away from each
other, then clusters tend to separate better Vice versa, if clusters are well-separated,
then the centers will stay away from each other
Intuitively, these two steps “help each other”
![Page 11: LING 696B: Mixture model and linear dimension reduction](https://reader035.fdocuments.us/reader035/viewer/2022062314/568136b2550346895d9e563f/html5/thumbnails/11.jpg)
11
Interpreting K-means as statistical estimation Equivalent to fitting a mixture of
Gaussians with: Spherical covariance Uniform prior (weights on each Gaussian)
Problems: Ambiguous data should have gradient
membership Shape of the clusters may not be
spherical Size of the cluster should play a role
![Page 12: LING 696B: Mixture model and linear dimension reduction](https://reader035.fdocuments.us/reader035/viewer/2022062314/568136b2550346895d9e563f/html5/thumbnails/12.jpg)
12
Multivariate Gaussian 1-D: N(, 2) N-D: N(, ), ~NX1 vector, ~NXN
matrix with (i,j) = ij ~ correlation Probability calculation:
P(x; ,) = C ||-N/2 exp{-(x-)T -1 (x-)} Intuitive meaning of -1: how to
calculate the distance from x to
transpose inverse
![Page 13: LING 696B: Mixture model and linear dimension reduction](https://reader035.fdocuments.us/reader035/viewer/2022062314/568136b2550346895d9e563f/html5/thumbnails/13.jpg)
13
Multivariate Gaussian: log likelihood and distance Spherical covariance matrix -1
Diagonal covariance matrix -1
Full covariance matrix -1
![Page 14: LING 696B: Mixture model and linear dimension reduction](https://reader035.fdocuments.us/reader035/viewer/2022062314/568136b2550346895d9e563f/html5/thumbnails/14.jpg)
14
Learning mixture of Gaussian:EM algorithm Expectation: putting “soft” labels
on data -- a pair (, 1-)
(0.5, 0.5)
(0.05, 0.95)(0.8, 0.2)
![Page 15: LING 696B: Mixture model and linear dimension reduction](https://reader035.fdocuments.us/reader035/viewer/2022062314/568136b2550346895d9e563f/html5/thumbnails/15.jpg)
15
Learning mixture of Gaussian:EM algorithm Maximization: doing Maximum-
Likelihood with weighted data
Notice everyoneis wearing a hat!
![Page 16: LING 696B: Mixture model and linear dimension reduction](https://reader035.fdocuments.us/reader035/viewer/2022062314/568136b2550346895d9e563f/html5/thumbnails/16.jpg)
16
EM v.s. K-means Same:
Iterative optimization, provably converge (see demo)
EM better captures the intuition: Ambiguous data are assigned gradient
membership Clusters can be arbitrary shaped
pancakes Size of the cluster is a parameter Allows for flexible control based on
prior knowledge (see demo)
![Page 17: LING 696B: Mixture model and linear dimension reduction](https://reader035.fdocuments.us/reader035/viewer/2022062314/568136b2550346895d9e563f/html5/thumbnails/17.jpg)
17
EM is everywhere Our problem: the labels are important,
yet not observable – “hidden variables” This situation is common for complex
models, and Maximum likelihood --> EM Bayesian Networks Hidden Markov models Probabilistic Context Free Grammars Linear Dynamic Systems
![Page 18: LING 696B: Mixture model and linear dimension reduction](https://reader035.fdocuments.us/reader035/viewer/2022062314/568136b2550346895d9e563f/html5/thumbnails/18.jpg)
18
Beyond Maximum likelihood?Statistical parsing Interesting remark from Mark Johnson:
Intialize a PCFG with treebank counts Train the PCFG on treebank with EM
A large a mount of NLP research try to dump the first, and improve the second
Log likelihood
Measure of success
![Page 19: LING 696B: Mixture model and linear dimension reduction](https://reader035.fdocuments.us/reader035/viewer/2022062314/568136b2550346895d9e563f/html5/thumbnails/19.jpg)
19
What’s wrong with this? Mark Johnson’s idea:
Wrong data: human don’t just learn from strings
Wrong model: human syntax isn’t context-free
Wrong way of calculating likelihood: p(sentence | PCFG) isn’t informative
(Maybe) wrong measure of success?
![Page 20: LING 696B: Mixture model and linear dimension reduction](https://reader035.fdocuments.us/reader035/viewer/2022062314/568136b2550346895d9e563f/html5/thumbnails/20.jpg)
20
End of excursion:Mixture of many things Any generative model can be combined
with a mixture model to deal with categorical data
Examples: Mixture of Gaussians Mixture of HMMs Mixture of Factor Analyzers Mixture of Expert networks
It all depends on what you are modeling
![Page 21: LING 696B: Mixture model and linear dimension reduction](https://reader035.fdocuments.us/reader035/viewer/2022062314/568136b2550346895d9e563f/html5/thumbnails/21.jpg)
21
Applying to the speech domain
Speech signals have high dimensions Using front-end acoustic modeling from
speech recognition: Mel-Frequency Cepstral Coefficients (MFCC)
Speech sounds are dynamic Dynamic acoustic modeling: MFCC-delta Mixture components are Hidden Markov
Models (HMM)
![Page 22: LING 696B: Mixture model and linear dimension reduction](https://reader035.fdocuments.us/reader035/viewer/2022062314/568136b2550346895d9e563f/html5/thumbnails/22.jpg)
22
Clustering speech with K-means Phones from TIMIT
![Page 23: LING 696B: Mixture model and linear dimension reduction](https://reader035.fdocuments.us/reader035/viewer/2022062314/568136b2550346895d9e563f/html5/thumbnails/23.jpg)
23
Clustering speech with K-means Diphones
Words
![Page 24: LING 696B: Mixture model and linear dimension reduction](https://reader035.fdocuments.us/reader035/viewer/2022062314/568136b2550346895d9e563f/html5/thumbnails/24.jpg)
24
What’s wrong here Longer sound sequences are more
distinguishable for people Yet doing K-means on static feature
vectors misses the change over time
Mixture components must be able to capture dynamic data
Solution: mixture of HMMs
![Page 25: LING 696B: Mixture model and linear dimension reduction](https://reader035.fdocuments.us/reader035/viewer/2022062314/568136b2550346895d9e563f/html5/thumbnails/25.jpg)
25
Mixture of HMMs HMM HMM Mixture
Learning: EM for HMM + EM for mixture
silence burst transition
![Page 26: LING 696B: Mixture model and linear dimension reduction](https://reader035.fdocuments.us/reader035/viewer/2022062314/568136b2550346895d9e563f/html5/thumbnails/26.jpg)
26
Mixture of HMMs Model-based clustering Front-end (MFCC+delta) Algorithm: initial guess by K-means, then EM
Gaussian mixturefor single frames
HMM mixturefor whole sequences
![Page 27: LING 696B: Mixture model and linear dimension reduction](https://reader035.fdocuments.us/reader035/viewer/2022062314/568136b2550346895d9e563f/html5/thumbnails/27.jpg)
27
Mixture of HMM v.s. K-means
Phone clustering: 7 phones from 22 speakers
*1 – 5: cluster index
![Page 28: LING 696B: Mixture model and linear dimension reduction](https://reader035.fdocuments.us/reader035/viewer/2022062314/568136b2550346895d9e563f/html5/thumbnails/28.jpg)
28
Mixture of HMM v.s. K-means
Diphone clustering: 6 diphones from 300+ speakers
![Page 29: LING 696B: Mixture model and linear dimension reduction](https://reader035.fdocuments.us/reader035/viewer/2022062314/568136b2550346895d9e563f/html5/thumbnails/29.jpg)
29
Mixture of HMM v.s. K-means
Word clustering: 3 words from 300+ speakers
![Page 30: LING 696B: Mixture model and linear dimension reduction](https://reader035.fdocuments.us/reader035/viewer/2022062314/568136b2550346895d9e563f/html5/thumbnails/30.jpg)
30
Growing the model Guess 6 at once is hard, but 2 is easy; Hill climbing strategy: starting with 2,
then 3, 4, ... Implementation: split the cluster with
the maximum gain in likelihood; Intuition: discriminate within the
biggest pile.
![Page 31: LING 696B: Mixture model and linear dimension reduction](https://reader035.fdocuments.us/reader035/viewer/2022062314/568136b2550346895d9e563f/html5/thumbnails/31.jpg)
31
Learning categories and features with mixture model Procedure: apply mixture model
and EM algorithm, inductively find clusters
Each split is followed by a retraining step using all dataData
21
11 12 21 22
![Page 32: LING 696B: Mixture model and linear dimension reduction](https://reader035.fdocuments.us/reader035/viewer/2022062314/568136b2550346895d9e563f/html5/thumbnails/32.jpg)
32
% classified as Cluster 1
% classified as Cluster 2
All data
1obstruent
2sonorant
IPA TIMIT
![Page 33: LING 696B: Mixture model and linear dimension reduction](https://reader035.fdocuments.us/reader035/viewer/2022062314/568136b2550346895d9e563f/html5/thumbnails/33.jpg)
33
% classifed as Cluster 11
% classified as Cluster 12
All data
1 2
1
11fricative
12
![Page 34: LING 696B: Mixture model and linear dimension reduction](https://reader035.fdocuments.us/reader035/viewer/2022062314/568136b2550346895d9e563f/html5/thumbnails/34.jpg)
34
% classified as Cluster 21
% classified as Cluster 22
All data
1 21
11 12
r
21back
sonorant
22
![Page 35: LING 696B: Mixture model and linear dimension reduction](https://reader035.fdocuments.us/reader035/viewer/2022062314/568136b2550346895d9e563f/html5/thumbnails/35.jpg)
35
% classified Cluster 121
% classified as Cluster 122
All data
1 21
11 12 21 22
121oralstop
122nasalstop
![Page 36: LING 696B: Mixture model and linear dimension reduction](https://reader035.fdocuments.us/reader035/viewer/2022062314/568136b2550346895d9e563f/html5/thumbnails/36.jpg)
36
% classified as Cluster 221
% classified as Cluster 222
All data
1 21
11 12 22
121 122
221 222
21
front low
sonorant
front high
sonorant
nasaloralstop
fricative backsonorant
![Page 37: LING 696B: Mixture model and linear dimension reduction](https://reader035.fdocuments.us/reader035/viewer/2022062314/568136b2550346895d9e563f/html5/thumbnails/37.jpg)
37
Summary: learning features
Discovered features: distinctions between natural classes based on spectral properties
All data
1 [+sonorant][- sonorant]
[+fricative] [-fricative] [+back] [-back]
[-nasal] [+nasal] [+high] [-high]
For individual sounds, the feature values are gradient rather than binary (Ladefoged, 01)
![Page 38: LING 696B: Mixture model and linear dimension reduction](https://reader035.fdocuments.us/reader035/viewer/2022062314/568136b2550346895d9e563f/html5/thumbnails/38.jpg)
38
Evaluation: phone classification How do the “soft” classes fit into “hard” ones?
Training set
Test set
Are “errors” really errors?
![Page 39: LING 696B: Mixture model and linear dimension reduction](https://reader035.fdocuments.us/reader035/viewer/2022062314/568136b2550346895d9e563f/html5/thumbnails/39.jpg)
39
Level 2: Learning segments + phonotactics
Segmentation is a kind of hidden structure Iterative strategy works here too
Optimization -- the augmented model:p(words | units, phonotactics, segmentation) Units argmax p({wi} | U, P, {si})
Clustering = argmax p(segments | units) -- Level 1 Phonotactics argmax p({wi} | U, P, {si})
Estimating transitions of Markov chain Segmentation argmax p({wi} | U, P, {si})
Viterbi decoding
![Page 40: LING 696B: Mixture model and linear dimension reduction](https://reader035.fdocuments.us/reader035/viewer/2022062314/568136b2550346895d9e563f/html5/thumbnails/40.jpg)
40
Iterative learning as coordinate-wise ascent
Each step increases likelihood score and eventually reaches a local maximum
segmentation
Unitsphonotactics
Level curves of likelihood score
Initial valuecomes fromLevel-1 learning
![Page 41: LING 696B: Mixture model and linear dimension reduction](https://reader035.fdocuments.us/reader035/viewer/2022062314/568136b2550346895d9e563f/html5/thumbnails/41.jpg)
41
Level 3:Lexicon can be mixtures too
Re-clustering of words using the mixture-based lexical model
Initial values (mixture components, weights) bottom-up learning (Stage 2)
Iterating steps: Classify each word as the best exemplar of
the given lexical item (also infer segmentation)
Update lexical weights + units + phonotactics
![Page 42: LING 696B: Mixture model and linear dimension reduction](https://reader035.fdocuments.us/reader035/viewer/2022062314/568136b2550346895d9e563f/html5/thumbnails/42.jpg)
42
Big question:How to choose K? Basic problem:
Nested hypothesis spaces: Hk-1 Hk Hk+1 …
As K goes up, likelihood always goes up.
Recall the polynomial curve fitting Mixture model too
(see demo)
![Page 43: LING 696B: Mixture model and linear dimension reduction](https://reader035.fdocuments.us/reader035/viewer/2022062314/568136b2550346895d9e563f/html5/thumbnails/43.jpg)
43
Big question:How to choose K? Idea #1: don’t just look at the likelihood,
look at the combination of likelihood and something else Bayesian Information Criterion:
-2 log L() + (log N)*d Minimal Description Length:
log L() + description() Akaike Information Criterion:
-2 log L() + 2 d/N In practice, often need magical “weights”
in front of the something else
![Page 44: LING 696B: Mixture model and linear dimension reduction](https://reader035.fdocuments.us/reader035/viewer/2022062314/568136b2550346895d9e563f/html5/thumbnails/44.jpg)
44
Big question:How to choose K? Idea #2: use one set of data for
learning, one for testing generalization
Cross-validation: run EM until the likelihood starts to hurt in the test set (see demo)
What if you have a bad test set: Jack-knife procedure Cutting data into 10 parts, and do 10
training and tests
![Page 45: LING 696B: Mixture model and linear dimension reduction](https://reader035.fdocuments.us/reader035/viewer/2022062314/568136b2550346895d9e563f/html5/thumbnails/45.jpg)
45
Big question:How to choose K? Idea #3: treat K as “hyper” parameter,
and do Bayesian learning on K More flexible: K can grow up and down
depending on number of data Allow K to grow to infinity: Dirichlet /
Chinese restaurant process mixture Need “hyper-hyper” parameters to
control how likely K grows Computationally also intensive
![Page 46: LING 696B: Mixture model and linear dimension reduction](https://reader035.fdocuments.us/reader035/viewer/2022062314/568136b2550346895d9e563f/html5/thumbnails/46.jpg)
46
Big question:How to choose K? There is really no elegant universal
solution One view: statistical learning looks
within Hk, but does not come up with Hk itself
How do people choose K? (also see later reading)
![Page 47: LING 696B: Mixture model and linear dimension reduction](https://reader035.fdocuments.us/reader035/viewer/2022062314/568136b2550346895d9e563f/html5/thumbnails/47.jpg)
47
Dimension reduction Why dimension reduction? Example: estimate a continuous
probability distribution by counting histograms on samples
10 bins 20 bins 30 bins
![Page 48: LING 696B: Mixture model and linear dimension reduction](https://reader035.fdocuments.us/reader035/viewer/2022062314/568136b2550346895d9e563f/html5/thumbnails/48.jpg)
48
Dimension reduction Now think about 2D, 3D …
How many bins do you need? Estimate density of distribution
with Parzen window:
How big (r) does the window needs to grow?
Data in the window
Window size
![Page 49: LING 696B: Mixture model and linear dimension reduction](https://reader035.fdocuments.us/reader035/viewer/2022062314/568136b2550346895d9e563f/html5/thumbnails/49.jpg)
49
Curse of dimensionality Discrete distributions:
Phonetics experiment: M speakers X N sentences X P stresses X Q segments … …
Decision rules: (K) Nearest-neighbor How big a K is safe? How long do you have to wait until you
are really sure they are your nearest neighbors?
![Page 50: LING 696B: Mixture model and linear dimension reduction](https://reader035.fdocuments.us/reader035/viewer/2022062314/568136b2550346895d9e563f/html5/thumbnails/50.jpg)
50
One obvious solution Assume we know something about
the distribution Translates to a parametric approach
Example: counting histograms for 10-D data needs lots of bins, but knowing it’s a pancake allows us to fit a Gaussian d10 parameters v.s. how many?
![Page 51: LING 696B: Mixture model and linear dimension reduction](https://reader035.fdocuments.us/reader035/viewer/2022062314/568136b2550346895d9e563f/html5/thumbnails/51.jpg)
51
Linear dimension reduction Principle Components Analysis Multidimensional Scaling Factor Analysis Independent Component Analysis As we will see, we still need to
assume we know something…
![Page 52: LING 696B: Mixture model and linear dimension reduction](https://reader035.fdocuments.us/reader035/viewer/2022062314/568136b2550346895d9e563f/html5/thumbnails/52.jpg)
52
Principle Component Analysis Many names (eigen modes, KL
transform, etc.) and relatives The key is to understand how to
make a pancake Centering, rotating and smashing
Step 1: moving the dough to the center X <-- X -
![Page 53: LING 696B: Mixture model and linear dimension reduction](https://reader035.fdocuments.us/reader035/viewer/2022062314/568136b2550346895d9e563f/html5/thumbnails/53.jpg)
53
Principle Component Analysis Step 2: finding a direction of
projection that has the maximal “stretch”
Linear projection of X onto vector w: Projw(X) = XNXd * wdX1 (X centered)
Now measure the stretch This is sample variance = Var(X*w)
wx
![Page 54: LING 696B: Mixture model and linear dimension reduction](https://reader035.fdocuments.us/reader035/viewer/2022062314/568136b2550346895d9e563f/html5/thumbnails/54.jpg)
54
Principle Component Analysis Step 3: formulate this as a
constrained optimization problem Objective of optimization: Var(X*w) Need constraint on w: (otherwise can
explode), only consider the direction So formally:
argmax||w||=1 Var(X*w)
![Page 55: LING 696B: Mixture model and linear dimension reduction](https://reader035.fdocuments.us/reader035/viewer/2022062314/568136b2550346895d9e563f/html5/thumbnails/55.jpg)
55
Principle Component Analysis Some algebra (homework):
Var(x) = E[(x - E[x])2
= E[x2] - (E[x])2
Apply to matrices (homework)Var(X*w) = wTXT * X w = wTCov(X) w (why)
Cov(X) is a dXd matrix (homework) Symmetric (easy) For any y, yTCov(X) y >= 0 (tricky)
![Page 56: LING 696B: Mixture model and linear dimension reduction](https://reader035.fdocuments.us/reader035/viewer/2022062314/568136b2550346895d9e563f/html5/thumbnails/56.jpg)
56
Principle Component Analysis Going back to the optimization
problem:= argmax||w||=1 Var(X*w)= argmax||w||=1 wTCov(X) w
The solution is an eigenvector of Cov(X)
w1
The first Principle Component!
![Page 57: LING 696B: Mixture model and linear dimension reduction](https://reader035.fdocuments.us/reader035/viewer/2022062314/568136b2550346895d9e563f/html5/thumbnails/57.jpg)
57
More principle components We keep looking for w2 in all the
directions perpendicular to w1
Formally:argmax||w2||=1,w2w1 wTCov(X) w
This turns out to be another eigenvector corresponding to the 2nd largest eigenvalue w2
New coordinates!
![Page 58: LING 696B: Mixture model and linear dimension reduction](https://reader035.fdocuments.us/reader035/viewer/2022062314/568136b2550346895d9e563f/html5/thumbnails/58.jpg)
58
Rotation Can keep going until we pick up all d
eigenvectors, perpendicular to each other
Putting these eigenvectors together, we have a big matrix W=(w1,w2,…,wd)
W is called an orthogonal matrix This corresponds to a rotation of the
pancake This pancake has no correlation
between dimensions