Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data...

119
Clustering / Unsupervised Methods Jason Corso, Albert Chen SUNY at Buffalo J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 1 / 41

Transcript of Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data...

Page 1: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Clustering / Unsupervised Methods

Jason Corso, Albert Chen

SUNY at Buffalo

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 1 / 41

Page 2: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Clustering

Introduction

Until now, we’ve assumed our training samples are “labeled” by theircategory membership.

Methods that use labeled samples are said to be supervised ;otherwise, they’re said to be unsupervised.

However:

Why would one even be interested in learning with unlabeled samples?Is it even possible in principle to learn anything of value from unlabeledsamples?

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 2 / 41

Page 3: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Introduction

Why Unsupervised Learning?

1 Collecting and labeling a large set of sample patterns can besurprisingly costly.

E.g., videos are virtually free, but accurately labeling the video pixels isexpensive and time consuming.

2 Extend to a larger training set by using semi-supervised learning.

Train a classifier on a small set of samples, then tune it up to make itrun without supervision on a large, unlabeled set.Or, in the reverse direction, let a large set of unlabeled data groupautomatically, then label the groupings found.

3 To detect the gradual change of pattern over time.

4 To find features that will then be useful for categorization.

5 To gain insight into the nature or structure of the data during theearly stages of an investigation.

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 3 / 41

Page 4: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Introduction

Why Unsupervised Learning?

1 Collecting and labeling a large set of sample patterns can besurprisingly costly.

E.g., videos are virtually free, but accurately labeling the video pixels isexpensive and time consuming.

2 Extend to a larger training set by using semi-supervised learning.

Train a classifier on a small set of samples, then tune it up to make itrun without supervision on a large, unlabeled set.Or, in the reverse direction, let a large set of unlabeled data groupautomatically, then label the groupings found.

3 To detect the gradual change of pattern over time.

4 To find features that will then be useful for categorization.

5 To gain insight into the nature or structure of the data during theearly stages of an investigation.

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 3 / 41

Page 5: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Introduction

Why Unsupervised Learning?

1 Collecting and labeling a large set of sample patterns can besurprisingly costly.

E.g., videos are virtually free, but accurately labeling the video pixels isexpensive and time consuming.

2 Extend to a larger training set by using semi-supervised learning.

Train a classifier on a small set of samples, then tune it up to make itrun without supervision on a large, unlabeled set.Or, in the reverse direction, let a large set of unlabeled data groupautomatically, then label the groupings found.

3 To detect the gradual change of pattern over time.

4 To find features that will then be useful for categorization.

5 To gain insight into the nature or structure of the data during theearly stages of an investigation.

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 3 / 41

Page 6: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Introduction

Why Unsupervised Learning?

1 Collecting and labeling a large set of sample patterns can besurprisingly costly.

E.g., videos are virtually free, but accurately labeling the video pixels isexpensive and time consuming.

2 Extend to a larger training set by using semi-supervised learning.

Train a classifier on a small set of samples, then tune it up to make itrun without supervision on a large, unlabeled set.Or, in the reverse direction, let a large set of unlabeled data groupautomatically, then label the groupings found.

3 To detect the gradual change of pattern over time.

4 To find features that will then be useful for categorization.

5 To gain insight into the nature or structure of the data during theearly stages of an investigation.

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 3 / 41

Page 7: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Introduction

Why Unsupervised Learning?

1 Collecting and labeling a large set of sample patterns can besurprisingly costly.

E.g., videos are virtually free, but accurately labeling the video pixels isexpensive and time consuming.

2 Extend to a larger training set by using semi-supervised learning.

Train a classifier on a small set of samples, then tune it up to make itrun without supervision on a large, unlabeled set.Or, in the reverse direction, let a large set of unlabeled data groupautomatically, then label the groupings found.

3 To detect the gradual change of pattern over time.

4 To find features that will then be useful for categorization.

5 To gain insight into the nature or structure of the data during theearly stages of an investigation.

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 3 / 41

Page 8: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Data Clustering

Data ClusteringSource: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice Hall, 1988.

What is data clustering?

Grouping of objects into meaningful categoriesGiven a representation of N objects, find k clusters based on ameasure of similarity.

Why data clustering?

Natural Classification: degree of similarity among forms.Data exploration: discover underlying structure, generate hypotheses,detect anomalies.Compression: for organizing data.Applications: can be used by any scientific field that collects data!

Google Scholar: 1500 clustering papers in 2007 alone!

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 4 / 41

Page 9: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Data Clustering

Data ClusteringSource: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice Hall, 1988.

What is data clustering?

Grouping of objects into meaningful categoriesGiven a representation of N objects, find k clusters based on ameasure of similarity.

Why data clustering?

Natural Classification: degree of similarity among forms.Data exploration: discover underlying structure, generate hypotheses,detect anomalies.Compression: for organizing data.Applications: can be used by any scientific field that collects data!

Google Scholar: 1500 clustering papers in 2007 alone!

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 4 / 41

Page 10: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Data Clustering

Data ClusteringSource: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice Hall, 1988.

What is data clustering?

Grouping of objects into meaningful categoriesGiven a representation of N objects, find k clusters based on ameasure of similarity.

Why data clustering?

Natural Classification: degree of similarity among forms.Data exploration: discover underlying structure, generate hypotheses,detect anomalies.Compression: for organizing data.Applications: can be used by any scientific field that collects data!

Google Scholar: 1500 clustering papers in 2007 alone!

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 4 / 41

Page 11: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Data Clustering

E.g.: Structure Discovering via ClusteringSource: http://clusty.com

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 5 / 41

Page 12: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Data Clustering

E.g.: Topic DiscoverySource: Map of Science, Nature, 2006

800,000 scientific papers clustered into 776 topics based on how oftenthe papers were cited together by authors of other papers

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 6 / 41

Page 13: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Data Clustering

Data Clustering - Formal Definition

Given a set of N unlabeled examples D = x1, x2, ..., xN in ad-dimensional feature space, D is partitioned into a number ofdisjoint subsets Dj ’s:

D = ∪kj=1Dj where Di ∩Dj = ∅, i 6= j , (1)

where the points in each subset are similar to each other according toa given criterion φ.

A partition is denoted by

π = (D1, D2, ..., Dk) (2)

and the problem of data clustering is thus formulated as

π∗ = argminπ

f(π) , (3)

where f(·) is formulated according to φ.

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 7 / 41

Page 14: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Data Clustering

Data Clustering - Formal Definition

Given a set of N unlabeled examples D = x1, x2, ..., xN in ad-dimensional feature space, D is partitioned into a number ofdisjoint subsets Dj ’s:

D = ∪kj=1Dj where Di ∩Dj = ∅, i 6= j , (1)

where the points in each subset are similar to each other according toa given criterion φ.

A partition is denoted by

π = (D1, D2, ..., Dk) (2)

and the problem of data clustering is thus formulated as

π∗ = argminπ

f(π) , (3)

where f(·) is formulated according to φ.

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 7 / 41

Page 15: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Data Clustering

k-Means ClusteringSource: D. Aurthor and S. Vassilvitskii. k-Means++: The Advantages of CarefulSeeding

Randomly initialize µ1, µ2, ..., µcRepeat until no change in µi:

Classify N samples according to nearest µi

Recompute µi

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 8 / 41

Page 16: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Data Clustering

k-Means ClusteringSource: D. Aurthor and S. Vassilvitskii. k-Means++: The Advantages of CarefulSeeding

Randomly initialize µ1, µ2, ..., µcRepeat until no change in µi:

Classify N samples according to nearest µi

Recompute µi

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 8 / 41

Page 17: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Data Clustering

k-Means ClusteringSource: D. Aurthor and S. Vassilvitskii. k-Means++: The Advantages of CarefulSeeding

Randomly initialize µ1, µ2, ..., µcRepeat until no change in µi:

Classify N samples according to nearest µi

Recompute µi

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 8 / 41

Page 18: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Data Clustering

k-Means ClusteringSource: D. Aurthor and S. Vassilvitskii. k-Means++: The Advantages of CarefulSeeding

Randomly initialize µ1, µ2, ..., µcRepeat until no change in µi:

Classify N samples according to nearest µi

Recompute µi

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 8 / 41

Page 19: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Data Clustering

k-Means ClusteringSource: D. Aurthor and S. Vassilvitskii. k-Means++: The Advantages of CarefulSeeding

Randomly initialize µ1, µ2, ..., µcRepeat until no change in µi:

Classify N samples according to nearest µi

Recompute µi

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 8 / 41

Page 20: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Data Clustering

k-Means ClusteringSource: D. Aurthor and S. Vassilvitskii. k-Means++: The Advantages of CarefulSeeding

Randomly initialize µ1, µ2, ..., µcRepeat until no change in µi:

Classify N samples according to nearest µi

Recompute µi

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 8 / 41

Page 21: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Data Clustering

k-Means ClusteringSource: D. Aurthor and S. Vassilvitskii. k-Means++: The Advantages of CarefulSeeding

Randomly initialize µ1, µ2, ..., µcRepeat until no change in µi:

Classify N samples according to nearest µi

Recompute µi

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 8 / 41

Page 22: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Data Clustering

k-Means ClusteringSource: D. Aurthor and S. Vassilvitskii. k-Means++: The Advantages of CarefulSeeding

Randomly initialize µ1, µ2, ..., µcRepeat until no change in µi:

Classify N samples according to nearest µi

Recompute µi

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 8 / 41

Page 23: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Data Clustering

k-Means ClusteringSource: D. Aurthor and S. Vassilvitskii. k-Means++: The Advantages of CarefulSeeding

Randomly initialize µ1, µ2, ..., µcRepeat until no change in µi:

Classify N samples according to nearest µi

Recompute µi

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 8 / 41

Page 24: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Data Clustering

k-Means ClusteringSource: D. Aurthor and S. Vassilvitskii. k-Means++: The Advantages of CarefulSeeding

Randomly initialize µ1, µ2, ..., µcRepeat until no change in µi:

Classify N samples according to nearest µi

Recompute µi

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 9 / 41

Page 25: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Data Clustering

k-Means++ ClusteringSource: D. Aurthor and S. Vassilvitskii. k-Means++: The Advantages of CarefulSeeding

Choose starting centers iteratively.

Let D(x) be the distance from x to the nearest existing center, takex as new center with probability ∝ D(x)2.

Repeat until no change in µi:

Classify N samples according to nearest µi

Recompute µi

(refer to the slides by D. Aurthor and S. Vassolvitskii for details)

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 10 / 41

Page 26: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

User’s Dilemma

User’s DilemmaSource: R. Dubes and A. K. Jain, Clustering Techniques: User’s Dilemma, PR 1976

1 What is a cluster?

2 How to define pair-wise similarity?

3 Which features and normalization scheme?

4 How many clusters?

5 Which clustering method?

6 Are the discovered clusters and partition valid?

7 Does the data have any clustering tendency?

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 11 / 41

Page 27: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

User’s Dilemma

User’s DilemmaSource: R. Dubes and A. K. Jain, Clustering Techniques: User’s Dilemma, PR 1976

1 What is a cluster?

2 How to define pair-wise similarity?

3 Which features and normalization scheme?

4 How many clusters?

5 Which clustering method?

6 Are the discovered clusters and partition valid?

7 Does the data have any clustering tendency?

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 11 / 41

Page 28: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

User’s Dilemma

User’s DilemmaSource: R. Dubes and A. K. Jain, Clustering Techniques: User’s Dilemma, PR 1976

1 What is a cluster?

2 How to define pair-wise similarity?

3 Which features and normalization scheme?

4 How many clusters?

5 Which clustering method?

6 Are the discovered clusters and partition valid?

7 Does the data have any clustering tendency?

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 11 / 41

Page 29: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

User’s Dilemma

User’s DilemmaSource: R. Dubes and A. K. Jain, Clustering Techniques: User’s Dilemma, PR 1976

1 What is a cluster?

2 How to define pair-wise similarity?

3 Which features and normalization scheme?

4 How many clusters?

5 Which clustering method?

6 Are the discovered clusters and partition valid?

7 Does the data have any clustering tendency?

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 11 / 41

Page 30: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

User’s Dilemma

User’s DilemmaSource: R. Dubes and A. K. Jain, Clustering Techniques: User’s Dilemma, PR 1976

1 What is a cluster?

2 How to define pair-wise similarity?

3 Which features and normalization scheme?

4 How many clusters?

5 Which clustering method?

6 Are the discovered clusters and partition valid?

7 Does the data have any clustering tendency?

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 11 / 41

Page 31: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

User’s Dilemma

Cluster Similarity?Source: R. Dubes and A. K. Jain, Clustering Techniques: User’s Dilemma, PR 1976

Compact ClustersWithin-cluster distance < between-cluster connectivity

Connected ClustersWithin-cluster connectivity > between-cluster connectivity

Ideal cluster: compact and isolated.

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 12 / 41

Page 32: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

User’s Dilemma

Representation (features)?Source: R. Dubes and A. K. Jain, Clustering Techniques: User’s Dilemma, PR 1976

There’s no universal representation; they’re domain dependent.

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 13 / 41

Page 33: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

User’s Dilemma

Good RepresentationSource: R. Dubes and A. K. Jain, Clustering Techniques: User’s Dilemma, PR 1976

A good representation leads to compact and isolated clusters.

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 14 / 41

Page 34: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

User’s Dilemma

How do we weigh the features?Source: R. Dubes and A. K. Jain, Clustering Techniques: User’s Dilemma, PR 1976

Two different meaningful groupings produced by different weightingschemes.

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 15 / 41

Page 35: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

User’s Dilemma

How do we decide the Number of Clusters?Source: R. Dubes and A. K. Jain, Clustering Techniques: User’s Dilemma, PR 1976

The samples are generated by 6 independent classes, yet:

ground truth k = 2

k = 5 k = 6

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 16 / 41

Page 36: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

User’s Dilemma

Cluster ValiditySource: R. Dubes and A. K. Jain, Clustering Techniques: User’s Dilemma, PR 1976

Clustering algorithms find clusters, even if there are no naturalclusters in the data.

100 2D uniform data points k-Means with k=3

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 17 / 41

Page 37: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

User’s Dilemma

Comparing Clustering MethodsSource: R. Dubes and A. K. Jain, Clustering Techniques: User’s Dilemma, PR 1976

Which clustering algorithm is the best?

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 18 / 41

Page 38: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

User’s Dilemma

There’s no best Clustering Algorithm!Source: R. Dubes and A. K. Jain, Clustering Techniques: User’s Dilemma, PR 1976

Each algorithm imposes a structure on data.Good fit between model and data ⇒ success.

GMM; k=3 GMM; k=2

Spectral; k=3 Spectral; k=2

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 19 / 41

Page 39: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Gaussian Mixture Models

Gaussian Mixture Models

Recall the Gaussian distribution:

N (x|µ,Σ) =1

(2π)d/2|Σ|1/2exp

[−1

2(x− µ)TΣ−1(x− µ)

](4)

It forms the basis for the important Mixture of Gaussians density.

The Gaussian mixture is a linear superposition of Gaussians in theform:

p(x) =K∑k=1

πkN (x|µk,Σk) . (5)

The πk are non-negative scalars called mixing coefficients and theygovern the relative importance between the various Gaussians in themixture density.

∑k πk = 1.

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 20 / 41

Page 40: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Gaussian Mixture Models

Gaussian Mixture Models

Recall the Gaussian distribution:

N (x|µ,Σ) =1

(2π)d/2|Σ|1/2exp

[−1

2(x− µ)TΣ−1(x− µ)

](4)

It forms the basis for the important Mixture of Gaussians density.

The Gaussian mixture is a linear superposition of Gaussians in theform:

p(x) =K∑k=1

πkN (x|µk,Σk) . (5)

The πk are non-negative scalars called mixing coefficients and theygovern the relative importance between the various Gaussians in themixture density.

∑k πk = 1.

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 20 / 41

Page 41: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Gaussian Mixture Models

Gaussian Mixture Models

Recall the Gaussian distribution:

N (x|µ,Σ) =1

(2π)d/2|Σ|1/2exp

[−1

2(x− µ)TΣ−1(x− µ)

](4)

It forms the basis for the important Mixture of Gaussians density.

The Gaussian mixture is a linear superposition of Gaussians in theform:

p(x) =

K∑k=1

πkN (x|µk,Σk) . (5)

The πk are non-negative scalars called mixing coefficients and theygovern the relative importance between the various Gaussians in themixture density.

∑k πk = 1.

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 20 / 41

Page 42: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Gaussian Mixture Models

Gaussian Mixture Models

Recall the Gaussian distribution:

N (x|µ,Σ) =1

(2π)d/2|Σ|1/2exp

[−1

2(x− µ)TΣ−1(x− µ)

](4)

It forms the basis for the important Mixture of Gaussians density.

The Gaussian mixture is a linear superposition of Gaussians in theform:

p(x) =

K∑k=1

πkN (x|µk,Σk) . (5)

The πk are non-negative scalars called mixing coefficients and theygovern the relative importance between the various Gaussians in themixture density.

∑k πk = 1.

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 20 / 41

Page 43: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Gaussian Mixture Models

x

p(x)

0.5 0.3

0.2

(a)

0 0.5 1

0

0.5

1 (b)

0 0.5 1

0

0.5

1

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 21 / 41

Page 44: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Gaussian Mixture Models

x

p(x)

0.5 0.3

0.2

(a)

0 0.5 1

0

0.5

1 (b)

0 0.5 1

0

0.5

1

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 21 / 41

Page 45: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Gaussian Mixture Models

Introducing Latent Variables

Define a K-dimensional binary random variable z.

z has a 1-of-K representation such that a particular element zk is 1and all of the others are zero. Hence:

zk ∈ {0, 1} (6)∑k

zk = 1 (7)

The marginal distribution over z is specified in terms of the mixingcoefficients:

p(zk = 1) = πk (8)

And, recall, 0 ≤ πk ≤ 1 and∑

k πk = 1.

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 22 / 41

Page 46: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Gaussian Mixture Models

Introducing Latent Variables

Define a K-dimensional binary random variable z.

z has a 1-of-K representation such that a particular element zk is 1and all of the others are zero. Hence:

zk ∈ {0, 1} (6)∑k

zk = 1 (7)

The marginal distribution over z is specified in terms of the mixingcoefficients:

p(zk = 1) = πk (8)

And, recall, 0 ≤ πk ≤ 1 and∑

k πk = 1.

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 22 / 41

Page 47: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Gaussian Mixture Models

Introducing Latent Variables

Define a K-dimensional binary random variable z.

z has a 1-of-K representation such that a particular element zk is 1and all of the others are zero. Hence:

zk ∈ {0, 1} (6)∑k

zk = 1 (7)

The marginal distribution over z is specified in terms of the mixingcoefficients:

p(zk = 1) = πk (8)

And, recall, 0 ≤ πk ≤ 1 and∑

k πk = 1.

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 22 / 41

Page 48: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Gaussian Mixture Models

Since z has a 1-of-K representation, we can also write thisdistribution as

p(z) =

K∏k=1

πzkk (9)

The conditional distribution of x given z is a Gaussian:

p(x|zk = 1) = N (x|µk,Σk) (10)

or

p(x|z) =K∏k=1

N (x|µk,Σk)zk (11)

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 23 / 41

Page 49: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Gaussian Mixture Models

Since z has a 1-of-K representation, we can also write thisdistribution as

p(z) =

K∏k=1

πzkk (9)

The conditional distribution of x given z is a Gaussian:

p(x|zk = 1) = N (x|µk,Σk) (10)

or

p(x|z) =K∏k=1

N (x|µk,Σk)zk (11)

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 23 / 41

Page 50: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Gaussian Mixture Models

We are interested in the marginal distribution of x:

p(x) =∑z

p(x, z) (12)

=∑z

p(z)p(x|z) (13)

=∑z

K∏k=1

πzkk N (x|µk,Σk)zk (14)

=

K∑k=1

πkN (x|µk,Σk) (15)

So, given our latent variable z, the marginal distribution of x is aGaussian mixture.

If we have N observations x1, . . . ,xN , then because of our chosenrepresentation, it follows that we have a latent variable zn for eachobserved data point xn.

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 24 / 41

Page 51: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Gaussian Mixture Models

We are interested in the marginal distribution of x:

p(x) =∑z

p(x, z) (12)

=∑z

p(z)p(x|z) (13)

=∑z

K∏k=1

πzkk N (x|µk,Σk)zk (14)

=

K∑k=1

πkN (x|µk,Σk) (15)

So, given our latent variable z, the marginal distribution of x is aGaussian mixture.

If we have N observations x1, . . . ,xN , then because of our chosenrepresentation, it follows that we have a latent variable zn for eachobserved data point xn.

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 24 / 41

Page 52: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Gaussian Mixture Models

We are interested in the marginal distribution of x:

p(x) =∑z

p(x, z) (12)

=∑z

p(z)p(x|z) (13)

=∑z

K∏k=1

πzkk N (x|µk,Σk)zk (14)

=

K∑k=1

πkN (x|µk,Σk) (15)

So, given our latent variable z, the marginal distribution of x is aGaussian mixture.

If we have N observations x1, . . . ,xN , then because of our chosenrepresentation, it follows that we have a latent variable zn for eachobserved data point xn.

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 24 / 41

Page 53: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Gaussian Mixture Models

Component Responsibility Term

We need to also express the conditional probability of z given x.

Denote this conditional p(zk = 1|x) as γ(zk).

We can derive this value with Bayes’ theorem:

γ(zk).= p(zk = 1|x) = p(zk = 1)p(x|zk = 1)∑K

j=1 p(zj = 1)p(x|zj = 1)(16)

=πkN (x|µk,Σk)∑Kj=1 πjN (x|µj ,Σj)

(17)

View πk as the prior probability of zk = 1 and the quantity γ(zk) asthe corresponding posterior probability once we have observed x.

γ(zk) can also be viewed as the responsibility that component k takesfor explaining the observation x.

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 25 / 41

Page 54: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Gaussian Mixture Models

Component Responsibility Term

We need to also express the conditional probability of z given x.

Denote this conditional p(zk = 1|x) as γ(zk).

We can derive this value with Bayes’ theorem:

γ(zk).= p(zk = 1|x) = p(zk = 1)p(x|zk = 1)∑K

j=1 p(zj = 1)p(x|zj = 1)(16)

=πkN (x|µk,Σk)∑Kj=1 πjN (x|µj ,Σj)

(17)

View πk as the prior probability of zk = 1 and the quantity γ(zk) asthe corresponding posterior probability once we have observed x.

γ(zk) can also be viewed as the responsibility that component k takesfor explaining the observation x.

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 25 / 41

Page 55: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Gaussian Mixture Models

Component Responsibility Term

We need to also express the conditional probability of z given x.

Denote this conditional p(zk = 1|x) as γ(zk).

We can derive this value with Bayes’ theorem:

γ(zk).= p(zk = 1|x) = p(zk = 1)p(x|zk = 1)∑K

j=1 p(zj = 1)p(x|zj = 1)(16)

=πkN (x|µk,Σk)∑Kj=1 πjN (x|µj ,Σj)

(17)

View πk as the prior probability of zk = 1 and the quantity γ(zk) asthe corresponding posterior probability once we have observed x.

γ(zk) can also be viewed as the responsibility that component k takesfor explaining the observation x.

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 25 / 41

Page 56: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Gaussian Mixture Models

Component Responsibility Term

We need to also express the conditional probability of z given x.

Denote this conditional p(zk = 1|x) as γ(zk).

We can derive this value with Bayes’ theorem:

γ(zk).= p(zk = 1|x) = p(zk = 1)p(x|zk = 1)∑K

j=1 p(zj = 1)p(x|zj = 1)(16)

=πkN (x|µk,Σk)∑Kj=1 πjN (x|µj ,Σj)

(17)

View πk as the prior probability of zk = 1 and the quantity γ(zk) asthe corresponding posterior probability once we have observed x.

γ(zk) can also be viewed as the responsibility that component k takesfor explaining the observation x.

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 25 / 41

Page 57: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Gaussian Mixture Models

Component Responsibility Term

We need to also express the conditional probability of z given x.

Denote this conditional p(zk = 1|x) as γ(zk).

We can derive this value with Bayes’ theorem:

γ(zk).= p(zk = 1|x) = p(zk = 1)p(x|zk = 1)∑K

j=1 p(zj = 1)p(x|zj = 1)(16)

=πkN (x|µk,Σk)∑Kj=1 πjN (x|µj ,Σj)

(17)

View πk as the prior probability of zk = 1 and the quantity γ(zk) asthe corresponding posterior probability once we have observed x.

γ(zk) can also be viewed as the responsibility that component k takesfor explaining the observation x.

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 25 / 41

Page 58: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Gaussian Mixture Models Sampling

Sampling from the GMM

To sample from the GMM, we can first generate a value for z fromthe marginal distribution p(z). Denote this sample z.

Then, sample from the conditional distribution p(x|z).The figure below-left shows samples from a three-mixture and colorsthe samples based on their z. The figure below-middle shows samplesfrom the marginal p(x) and ignores z. On the right, we show theγ(zk) for each sampled point, colored accordingly.

(a)

0 0.5 1

0

0.5

1 (b)

0 0.5 1

0

0.5

1 (c)

0 0.5 1

0

0.5

1

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 26 / 41

Page 59: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Gaussian Mixture Models Sampling

Sampling from the GMM

To sample from the GMM, we can first generate a value for z fromthe marginal distribution p(z). Denote this sample z.

Then, sample from the conditional distribution p(x|z).

The figure below-left shows samples from a three-mixture and colorsthe samples based on their z. The figure below-middle shows samplesfrom the marginal p(x) and ignores z. On the right, we show theγ(zk) for each sampled point, colored accordingly.

(a)

0 0.5 1

0

0.5

1 (b)

0 0.5 1

0

0.5

1 (c)

0 0.5 1

0

0.5

1

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 26 / 41

Page 60: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Gaussian Mixture Models Sampling

Sampling from the GMM

To sample from the GMM, we can first generate a value for z fromthe marginal distribution p(z). Denote this sample z.

Then, sample from the conditional distribution p(x|z).The figure below-left shows samples from a three-mixture and colorsthe samples based on their z. The figure below-middle shows samplesfrom the marginal p(x) and ignores z. On the right, we show theγ(zk) for each sampled point, colored accordingly.

(a)

0 0.5 1

0

0.5

1 (b)

0 0.5 1

0

0.5

1 (c)

0 0.5 1

0

0.5

1

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 26 / 41

Page 61: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Gaussian Mixture Models Sampling

Sampling from the GMM

To sample from the GMM, we can first generate a value for z fromthe marginal distribution p(z). Denote this sample z.

Then, sample from the conditional distribution p(x|z).The figure below-left shows samples from a three-mixture and colorsthe samples based on their z. The figure below-middle shows samplesfrom the marginal p(x) and ignores z. On the right, we show theγ(zk) for each sampled point, colored accordingly.

(a)

0 0.5 1

0

0.5

1 (b)

0 0.5 1

0

0.5

1 (c)

0 0.5 1

0

0.5

1

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 26 / 41

Page 62: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Gaussian Mixture Models Maximum-Likelihood

Maximum-Likelihood

Suppose we have a set of N i.i.d. observations {x1, . . . ,xN} that wewish to model with a GMM.

Consider this data set as an N × d matrix X in which the nth row isgiven by xT

n .

Similarly, the corresponding latent variables define an N ×K matrixZ with rows zT

n .

The log-likelihood of the corresponding GMM is given by

ln p(X|π,µ,Σ) =

N∑n=1

ln

[K∑k=1

πkN (x|µk,Σk)

]. (18)

Ultimately, we want to find the values of the parameters π,µ,Σ thatmaximize this function.

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 27 / 41

Page 63: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Gaussian Mixture Models Maximum-Likelihood

Maximum-Likelihood

Suppose we have a set of N i.i.d. observations {x1, . . . ,xN} that wewish to model with a GMM.

Consider this data set as an N × d matrix X in which the nth row isgiven by xT

n .

Similarly, the corresponding latent variables define an N ×K matrixZ with rows zT

n .

The log-likelihood of the corresponding GMM is given by

ln p(X|π,µ,Σ) =

N∑n=1

ln

[K∑k=1

πkN (x|µk,Σk)

]. (18)

Ultimately, we want to find the values of the parameters π,µ,Σ thatmaximize this function.

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 27 / 41

Page 64: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Gaussian Mixture Models Maximum-Likelihood

Maximum-Likelihood

Suppose we have a set of N i.i.d. observations {x1, . . . ,xN} that wewish to model with a GMM.

Consider this data set as an N × d matrix X in which the nth row isgiven by xT

n .

Similarly, the corresponding latent variables define an N ×K matrixZ with rows zT

n .

The log-likelihood of the corresponding GMM is given by

ln p(X|π,µ,Σ) =

N∑n=1

ln

[K∑k=1

πkN (x|µk,Σk)

]. (18)

Ultimately, we want to find the values of the parameters π,µ,Σ thatmaximize this function.

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 27 / 41

Page 65: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Gaussian Mixture Models Maximum-Likelihood

Maximum-Likelihood

Suppose we have a set of N i.i.d. observations {x1, . . . ,xN} that wewish to model with a GMM.

Consider this data set as an N × d matrix X in which the nth row isgiven by xT

n .

Similarly, the corresponding latent variables define an N ×K matrixZ with rows zT

n .

The log-likelihood of the corresponding GMM is given by

ln p(X|π,µ,Σ) =

N∑n=1

ln

[K∑k=1

πkN (x|µk,Σk)

]. (18)

Ultimately, we want to find the values of the parameters π,µ,Σ thatmaximize this function.

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 27 / 41

Page 66: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Gaussian Mixture Models Maximum-Likelihood

Maximum-Likelihood

Suppose we have a set of N i.i.d. observations {x1, . . . ,xN} that wewish to model with a GMM.

Consider this data set as an N × d matrix X in which the nth row isgiven by xT

n .

Similarly, the corresponding latent variables define an N ×K matrixZ with rows zT

n .

The log-likelihood of the corresponding GMM is given by

ln p(X|π,µ,Σ) =

N∑n=1

ln

[K∑k=1

πkN (x|µk,Σk)

]. (18)

Ultimately, we want to find the values of the parameters π,µ,Σ thatmaximize this function.

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 27 / 41

Page 67: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Gaussian Mixture Models Maximum-Likelihood

However, maximizing the log-likelihood terms for GMMs is muchmore complicated than for the case of a single Gaussian. Why?

The difficulty arises from the sum over k inside of the log-term. Thelog function no longer acts directly on the Gaussian, and noclosed-form solution is available.

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 28 / 41

Page 68: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Gaussian Mixture Models Maximum-Likelihood

However, maximizing the log-likelihood terms for GMMs is muchmore complicated than for the case of a single Gaussian. Why?

The difficulty arises from the sum over k inside of the log-term. Thelog function no longer acts directly on the Gaussian, and noclosed-form solution is available.

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 28 / 41

Page 69: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Gaussian Mixture Models Maximum-Likelihood

Singularities

There is a significant problem when we apply MLE to estimate GMMparameters.

Consider simply covariances defined by Σk = σ2kI.

Suppose that one of the components of the mixture model, j, has itsmean µj exactly equal to one of the data points so that µj = xn forsome n.

This term contributes

N (xn|xn, σ2j I) =1

(2π)(1/2)σj(19)

Consider the limit σj → 0 to see that this term goes to infinity andhence the log-likelihood will also go to infinity.

Thus, the maximization of the log-likelihood function is not awell posed problem because such a singularity will occurwhenever one of the components collapses to a single, specificdata point.

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 29 / 41

Page 70: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Gaussian Mixture Models Maximum-Likelihood

Singularities

There is a significant problem when we apply MLE to estimate GMMparameters.

Consider simply covariances defined by Σk = σ2kI.

Suppose that one of the components of the mixture model, j, has itsmean µj exactly equal to one of the data points so that µj = xn forsome n.

This term contributes

N (xn|xn, σ2j I) =1

(2π)(1/2)σj(19)

Consider the limit σj → 0 to see that this term goes to infinity andhence the log-likelihood will also go to infinity.

Thus, the maximization of the log-likelihood function is not awell posed problem because such a singularity will occurwhenever one of the components collapses to a single, specificdata point.

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 29 / 41

Page 71: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Gaussian Mixture Models Maximum-Likelihood

Singularities

There is a significant problem when we apply MLE to estimate GMMparameters.

Consider simply covariances defined by Σk = σ2kI.

Suppose that one of the components of the mixture model, j, has itsmean µj exactly equal to one of the data points so that µj = xn forsome n.

This term contributes

N (xn|xn, σ2j I) =1

(2π)(1/2)σj(19)

Consider the limit σj → 0 to see that this term goes to infinity andhence the log-likelihood will also go to infinity.

Thus, the maximization of the log-likelihood function is not awell posed problem because such a singularity will occurwhenever one of the components collapses to a single, specificdata point.

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 29 / 41

Page 72: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Gaussian Mixture Models Maximum-Likelihood

Singularities

There is a significant problem when we apply MLE to estimate GMMparameters.

Consider simply covariances defined by Σk = σ2kI.

Suppose that one of the components of the mixture model, j, has itsmean µj exactly equal to one of the data points so that µj = xn forsome n.

This term contributes

N (xn|xn, σ2j I) =1

(2π)(1/2)σj(19)

Consider the limit σj → 0 to see that this term goes to infinity andhence the log-likelihood will also go to infinity.

Thus, the maximization of the log-likelihood function is not awell posed problem because such a singularity will occurwhenever one of the components collapses to a single, specificdata point.

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 29 / 41

Page 73: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Gaussian Mixture Models Maximum-Likelihood

Singularities

There is a significant problem when we apply MLE to estimate GMMparameters.

Consider simply covariances defined by Σk = σ2kI.

Suppose that one of the components of the mixture model, j, has itsmean µj exactly equal to one of the data points so that µj = xn forsome n.

This term contributes

N (xn|xn, σ2j I) =1

(2π)(1/2)σj(19)

Consider the limit σj → 0 to see that this term goes to infinity andhence the log-likelihood will also go to infinity.

Thus, the maximization of the log-likelihood function is not awell posed problem because such a singularity will occurwhenever one of the components collapses to a single, specificdata point.

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 29 / 41

Page 74: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Gaussian Mixture Models Maximum-Likelihood

Singularities

There is a significant problem when we apply MLE to estimate GMMparameters.

Consider simply covariances defined by Σk = σ2kI.

Suppose that one of the components of the mixture model, j, has itsmean µj exactly equal to one of the data points so that µj = xn forsome n.

This term contributes

N (xn|xn, σ2j I) =1

(2π)(1/2)σj(19)

Consider the limit σj → 0 to see that this term goes to infinity andhence the log-likelihood will also go to infinity.

Thus, the maximization of the log-likelihood function is not awell posed problem because such a singularity will occurwhenever one of the components collapses to a single, specificdata point.

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 29 / 41

Page 75: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Gaussian Mixture Models Maximum-Likelihood

x

p(x)

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 30 / 41

Page 76: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Expectation-Maximization for GMMs

Expectation-Maximization for GMMs

Expectation-Maximization or EM is an elegant and powerfulmethod for finding MLE solutions in the case of missing data such asthe latent variables z indicating the mixture component.

Recall the conditions that must be satisfied at a maximum of thelikelihood function.

For the mean µk, setting the derivatives of ln p(X|π,µ,Σ) w.r.t. µkto zero yields

0 = −N∑n=1

πkN (x|µk,Σk)∑Kj=1 πjN (x|µj ,Σj)

Σk(xn − µk) (20)

= −N∑n=1

γ(znk)Σk(xn − µk) (21)

Note the natural appearance of the responsibility terms on the RHS.

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 31 / 41

Page 77: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Expectation-Maximization for GMMs

Expectation-Maximization for GMMs

Expectation-Maximization or EM is an elegant and powerfulmethod for finding MLE solutions in the case of missing data such asthe latent variables z indicating the mixture component.

Recall the conditions that must be satisfied at a maximum of thelikelihood function.

For the mean µk, setting the derivatives of ln p(X|π,µ,Σ) w.r.t. µkto zero yields

0 = −N∑n=1

πkN (x|µk,Σk)∑Kj=1 πjN (x|µj ,Σj)

Σk(xn − µk) (20)

= −N∑n=1

γ(znk)Σk(xn − µk) (21)

Note the natural appearance of the responsibility terms on the RHS.

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 31 / 41

Page 78: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Expectation-Maximization for GMMs

Expectation-Maximization for GMMs

Expectation-Maximization or EM is an elegant and powerfulmethod for finding MLE solutions in the case of missing data such asthe latent variables z indicating the mixture component.

Recall the conditions that must be satisfied at a maximum of thelikelihood function.

For the mean µk, setting the derivatives of ln p(X|π,µ,Σ) w.r.t. µkto zero yields

0 = −N∑n=1

πkN (x|µk,Σk)∑Kj=1 πjN (x|µj ,Σj)

Σk(xn − µk) (20)

= −N∑n=1

γ(znk)Σk(xn − µk) (21)

Note the natural appearance of the responsibility terms on the RHS.

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 31 / 41

Page 79: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Expectation-Maximization for GMMs

Expectation-Maximization for GMMs

Expectation-Maximization or EM is an elegant and powerfulmethod for finding MLE solutions in the case of missing data such asthe latent variables z indicating the mixture component.

Recall the conditions that must be satisfied at a maximum of thelikelihood function.

For the mean µk, setting the derivatives of ln p(X|π,µ,Σ) w.r.t. µkto zero yields

0 = −N∑n=1

πkN (x|µk,Σk)∑Kj=1 πjN (x|µj ,Σj)

Σk(xn − µk) (20)

= −N∑n=1

γ(znk)Σk(xn − µk) (21)

Note the natural appearance of the responsibility terms on the RHS.

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 31 / 41

Page 80: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Expectation-Maximization for GMMs

Multiplying by Σ−1k , which we assume is non-singular, gives

µk =1

Nk

N∑n=1

γ(znk)xn (22)

where

Nk =

N∑n=1

γ(znk) (23)

We see the kth mean is the weighted mean over all of the points inthe dataset.

Interpret Nk as the number of points assigned to component k.

We find a similar result for the covariance matrix:

Σk =1

Nk

N∑n=1

γ(znk)(xn − µk)(xn − µk)T . (24)

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 32 / 41

Page 81: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Expectation-Maximization for GMMs

Multiplying by Σ−1k , which we assume is non-singular, gives

µk =1

Nk

N∑n=1

γ(znk)xn (22)

where

Nk =

N∑n=1

γ(znk) (23)

We see the kth mean is the weighted mean over all of the points inthe dataset.

Interpret Nk as the number of points assigned to component k.

We find a similar result for the covariance matrix:

Σk =1

Nk

N∑n=1

γ(znk)(xn − µk)(xn − µk)T . (24)

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 32 / 41

Page 82: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Expectation-Maximization for GMMs

Multiplying by Σ−1k , which we assume is non-singular, gives

µk =1

Nk

N∑n=1

γ(znk)xn (22)

where

Nk =

N∑n=1

γ(znk) (23)

We see the kth mean is the weighted mean over all of the points inthe dataset.

Interpret Nk as the number of points assigned to component k.

We find a similar result for the covariance matrix:

Σk =1

Nk

N∑n=1

γ(znk)(xn − µk)(xn − µk)T . (24)

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 32 / 41

Page 83: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Expectation-Maximization for GMMs

Multiplying by Σ−1k , which we assume is non-singular, gives

µk =1

Nk

N∑n=1

γ(znk)xn (22)

where

Nk =

N∑n=1

γ(znk) (23)

We see the kth mean is the weighted mean over all of the points inthe dataset.

Interpret Nk as the number of points assigned to component k.

We find a similar result for the covariance matrix:

Σk =1

Nk

N∑n=1

γ(znk)(xn − µk)(xn − µk)T . (24)

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 32 / 41

Page 84: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Expectation-Maximization for GMMs

We also need to maximize ln p(X|π,µ,Σ) with respect to the mixingcoefficients πk.

Introduce a Lagrange multiplier to enforce the constraint∑

k πk = 1.

ln p(X|π,µ,Σ) + λ

(K∑k=1

πk − 1

)(25)

Maximizing it yields:

0 =1

Nk

∑n=1

γ(znk) + λ (26)

After multiplying both sides by π and summing over k, we get

λ = −N (27)

Eliminate λ and rearrange to obtain:

πk =Nk

N(28)

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 33 / 41

Page 85: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Expectation-Maximization for GMMs

We also need to maximize ln p(X|π,µ,Σ) with respect to the mixingcoefficients πk.

Introduce a Lagrange multiplier to enforce the constraint∑

k πk = 1.

ln p(X|π,µ,Σ) + λ

(K∑k=1

πk − 1

)(25)

Maximizing it yields:

0 =1

Nk

∑n=1

γ(znk) + λ (26)

After multiplying both sides by π and summing over k, we get

λ = −N (27)

Eliminate λ and rearrange to obtain:

πk =Nk

N(28)

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 33 / 41

Page 86: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Expectation-Maximization for GMMs

We also need to maximize ln p(X|π,µ,Σ) with respect to the mixingcoefficients πk.

Introduce a Lagrange multiplier to enforce the constraint∑

k πk = 1.

ln p(X|π,µ,Σ) + λ

(K∑k=1

πk − 1

)(25)

Maximizing it yields:

0 =1

Nk

∑n=1

γ(znk) + λ (26)

After multiplying both sides by π and summing over k, we get

λ = −N (27)

Eliminate λ and rearrange to obtain:

πk =Nk

N(28)

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 33 / 41

Page 87: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Expectation-Maximization for GMMs

We also need to maximize ln p(X|π,µ,Σ) with respect to the mixingcoefficients πk.

Introduce a Lagrange multiplier to enforce the constraint∑

k πk = 1.

ln p(X|π,µ,Σ) + λ

(K∑k=1

πk − 1

)(25)

Maximizing it yields:

0 =1

Nk

∑n=1

γ(znk) + λ (26)

After multiplying both sides by π and summing over k, we get

λ = −N (27)

Eliminate λ and rearrange to obtain:

πk =Nk

N(28)

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 33 / 41

Page 88: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Expectation-Maximization for GMMs

We also need to maximize ln p(X|π,µ,Σ) with respect to the mixingcoefficients πk.

Introduce a Lagrange multiplier to enforce the constraint∑

k πk = 1.

ln p(X|π,µ,Σ) + λ

(K∑k=1

πk − 1

)(25)

Maximizing it yields:

0 =1

Nk

∑n=1

γ(znk) + λ (26)

After multiplying both sides by π and summing over k, we get

λ = −N (27)

Eliminate λ and rearrange to obtain:

πk =Nk

N(28)

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 33 / 41

Page 89: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Expectation-Maximization for GMMs

Solved...right?

So, we’re done, right? We’ve computed the maximum likelihoodsolutions for each of the unknown parameters.

Wrong!

The responsibility terms depend on these parameters in an intricateway:

γ(zk).= p(zk = 1|x) = πkN (x|µk,Σk)∑K

j=1 πjN (x|µj ,Σj)

But, these results do suggest an iterative scheme for finding asolution to the maximum likelihood problem.

1 Chooce some initial values for the parameters, π,µ,Σ.2 Use the current parameters estimates to compute the posteriors on the

latent terms, i.e., the responsibilities.3 Use the responsibilities to update the estimates of the parameters.4 Repeat 2 and 3 until convergence.

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 34 / 41

Page 90: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Expectation-Maximization for GMMs

Solved...right?

So, we’re done, right? We’ve computed the maximum likelihoodsolutions for each of the unknown parameters.

Wrong!

The responsibility terms depend on these parameters in an intricateway:

γ(zk).= p(zk = 1|x) = πkN (x|µk,Σk)∑K

j=1 πjN (x|µj ,Σj)

But, these results do suggest an iterative scheme for finding asolution to the maximum likelihood problem.

1 Chooce some initial values for the parameters, π,µ,Σ.2 Use the current parameters estimates to compute the posteriors on the

latent terms, i.e., the responsibilities.3 Use the responsibilities to update the estimates of the parameters.4 Repeat 2 and 3 until convergence.

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 34 / 41

Page 91: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Expectation-Maximization for GMMs

Solved...right?

So, we’re done, right? We’ve computed the maximum likelihoodsolutions for each of the unknown parameters.

Wrong!

The responsibility terms depend on these parameters in an intricateway:

γ(zk).= p(zk = 1|x) = πkN (x|µk,Σk)∑K

j=1 πjN (x|µj ,Σj)

But, these results do suggest an iterative scheme for finding asolution to the maximum likelihood problem.

1 Chooce some initial values for the parameters, π,µ,Σ.2 Use the current parameters estimates to compute the posteriors on the

latent terms, i.e., the responsibilities.3 Use the responsibilities to update the estimates of the parameters.4 Repeat 2 and 3 until convergence.

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 34 / 41

Page 92: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Expectation-Maximization for GMMs

Solved...right?

So, we’re done, right? We’ve computed the maximum likelihoodsolutions for each of the unknown parameters.

Wrong!

The responsibility terms depend on these parameters in an intricateway:

γ(zk).= p(zk = 1|x) = πkN (x|µk,Σk)∑K

j=1 πjN (x|µj ,Σj)

But, these results do suggest an iterative scheme for finding asolution to the maximum likelihood problem.

1 Chooce some initial values for the parameters, π,µ,Σ.2 Use the current parameters estimates to compute the posteriors on the

latent terms, i.e., the responsibilities.3 Use the responsibilities to update the estimates of the parameters.4 Repeat 2 and 3 until convergence.

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 34 / 41

Page 93: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Expectation-Maximization for GMMs

(a)−2 0 2

−2

0

2

(b)−2 0 2

−2

0

2

(c)

L = 1

−2 0 2

−2

0

2

(d)

L = 2

−2 0 2

−2

0

2

(e)

L = 5

−2 0 2

−2

0

2

(f)

L = 20

−2 0 2

−2

0

2

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 35 / 41

Page 94: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Expectation-Maximization for GMMs

(a)−2 0 2

−2

0

2

(b)−2 0 2

−2

0

2

(c)

L = 1

−2 0 2

−2

0

2

(d)

L = 2

−2 0 2

−2

0

2

(e)

L = 5

−2 0 2

−2

0

2

(f)

L = 20

−2 0 2

−2

0

2

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 35 / 41

Page 95: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Expectation-Maximization for GMMs

(a)−2 0 2

−2

0

2

(b)−2 0 2

−2

0

2

(c)

L = 1

−2 0 2

−2

0

2

(d)

L = 2

−2 0 2

−2

0

2

(e)

L = 5

−2 0 2

−2

0

2

(f)

L = 20

−2 0 2

−2

0

2

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 35 / 41

Page 96: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Expectation-Maximization for GMMs

(a)−2 0 2

−2

0

2

(b)−2 0 2

−2

0

2

(c)

L = 1

−2 0 2

−2

0

2

(d)

L = 2

−2 0 2

−2

0

2

(e)

L = 5

−2 0 2

−2

0

2

(f)

L = 20

−2 0 2

−2

0

2

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 35 / 41

Page 97: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Expectation-Maximization for GMMs

(a)−2 0 2

−2

0

2

(b)−2 0 2

−2

0

2

(c)

L = 1

−2 0 2

−2

0

2

(d)

L = 2

−2 0 2

−2

0

2

(e)

L = 5

−2 0 2

−2

0

2

(f)

L = 20

−2 0 2

−2

0

2

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 35 / 41

Page 98: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Expectation-Maximization for GMMs

(a)−2 0 2

−2

0

2

(b)−2 0 2

−2

0

2

(c)

L = 1

−2 0 2

−2

0

2

(d)

L = 2

−2 0 2

−2

0

2

(e)

L = 5

−2 0 2

−2

0

2

(f)

L = 20

−2 0 2

−2

0

2

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 35 / 41

Page 99: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Expectation-Maximization for GMMs

Some Quick, Early Notes on EM

EM generally tends to take more steps than the K-Means clusteringalgorithm.

Each step is more computationally intense than with K-Means too.

So, one commonly computes K-Means first and then initializes EMfrom the resulting clusters.

Care must be taken to avoid singularities in the MLE solution.

There will generally be multiple local maxima of the likelihoodfunction and EM is not guaranteed to find the largest of these.

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 36 / 41

Page 100: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Expectation-Maximization for GMMs

Some Quick, Early Notes on EM

EM generally tends to take more steps than the K-Means clusteringalgorithm.

Each step is more computationally intense than with K-Means too.

So, one commonly computes K-Means first and then initializes EMfrom the resulting clusters.

Care must be taken to avoid singularities in the MLE solution.

There will generally be multiple local maxima of the likelihoodfunction and EM is not guaranteed to find the largest of these.

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 36 / 41

Page 101: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Expectation-Maximization for GMMs

Some Quick, Early Notes on EM

EM generally tends to take more steps than the K-Means clusteringalgorithm.

Each step is more computationally intense than with K-Means too.

So, one commonly computes K-Means first and then initializes EMfrom the resulting clusters.

Care must be taken to avoid singularities in the MLE solution.

There will generally be multiple local maxima of the likelihoodfunction and EM is not guaranteed to find the largest of these.

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 36 / 41

Page 102: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Expectation-Maximization for GMMs

Some Quick, Early Notes on EM

EM generally tends to take more steps than the K-Means clusteringalgorithm.

Each step is more computationally intense than with K-Means too.

So, one commonly computes K-Means first and then initializes EMfrom the resulting clusters.

Care must be taken to avoid singularities in the MLE solution.

There will generally be multiple local maxima of the likelihoodfunction and EM is not guaranteed to find the largest of these.

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 36 / 41

Page 103: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Expectation-Maximization for GMMs

Some Quick, Early Notes on EM

EM generally tends to take more steps than the K-Means clusteringalgorithm.

Each step is more computationally intense than with K-Means too.

So, one commonly computes K-Means first and then initializes EMfrom the resulting clusters.

Care must be taken to avoid singularities in the MLE solution.

There will generally be multiple local maxima of the likelihoodfunction and EM is not guaranteed to find the largest of these.

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 36 / 41

Page 104: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Expectation-Maximization for GMMs

Given a GMM, the goal is to maximize the likelihood function with respect to theparameters (the means, the covarianes, and the mixing coefficients).

1 Initialize the means, µk, the covariances, Σk, and mixing coefficients, πk.Evaluate the initial value of the log-likelihood.

2 E-Step Evaluate the responsibilities using the current parameter values:

γ(zk) =πkN (x|µk,Σk)∑Kj=1 πjN (x|µj ,Σj)

3 M-Step Update the parameters using the current responsibilities

µnewk =

1

Nk

N∑n=1

γ(znk)xn (29)

Σnewk =

1

Nk

N∑n=1

γ(znk)(xn − µnewk )(xn − µnew

k )T (30)

πnewk =

Nk

N(31)

where

Nk =

N∑n=1

γ(znk) (32)

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 37 / 41

Page 105: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Expectation-Maximization for GMMs

Given a GMM, the goal is to maximize the likelihood function with respect to theparameters (the means, the covarianes, and the mixing coefficients).

1 Initialize the means, µk, the covariances, Σk, and mixing coefficients, πk.Evaluate the initial value of the log-likelihood.

2 E-Step Evaluate the responsibilities using the current parameter values:

γ(zk) =πkN (x|µk,Σk)∑Kj=1 πjN (x|µj ,Σj)

3 M-Step Update the parameters using the current responsibilities

µnewk =

1

Nk

N∑n=1

γ(znk)xn (29)

Σnewk =

1

Nk

N∑n=1

γ(znk)(xn − µnewk )(xn − µnew

k )T (30)

πnewk =

Nk

N(31)

where

Nk =

N∑n=1

γ(znk) (32)

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 37 / 41

Page 106: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Expectation-Maximization for GMMs

Given a GMM, the goal is to maximize the likelihood function with respect to theparameters (the means, the covarianes, and the mixing coefficients).

1 Initialize the means, µk, the covariances, Σk, and mixing coefficients, πk.Evaluate the initial value of the log-likelihood.

2 E-Step Evaluate the responsibilities using the current parameter values:

γ(zk) =πkN (x|µk,Σk)∑Kj=1 πjN (x|µj ,Σj)

3 M-Step Update the parameters using the current responsibilities

µnewk =

1

Nk

N∑n=1

γ(znk)xn (29)

Σnewk =

1

Nk

N∑n=1

γ(znk)(xn − µnewk )(xn − µnew

k )T (30)

πnewk =

Nk

N(31)

where

Nk =N∑

n=1

γ(znk) (32)

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 37 / 41

Page 107: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Expectation-Maximization for GMMs

4 Evaluate the log-likelihood

ln p(X|µnew,Σnew,πnew) =

N∑n=1

ln

[K∑

k=1

πnewk N (xn|µnew

k ,Σnewk )

](33)

5 Check for convergence of either the parameters of the log-likelihood. If theconvergence is not satisfied, set the parameters:

µ = µnew (34)

Σ = Σnew (35)

π = πnew (36)

and goto step 2.

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 38 / 41

Page 108: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

Expectation-Maximization for GMMs

4 Evaluate the log-likelihood

ln p(X|µnew,Σnew,πnew) =

N∑n=1

ln

[K∑

k=1

πnewk N (xn|µnew

k ,Σnewk )

](33)

5 Check for convergence of either the parameters of the log-likelihood. If theconvergence is not satisfied, set the parameters:

µ = µnew (34)

Σ = Σnew (35)

π = πnew (36)

and goto step 2.

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 38 / 41

Page 109: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

A More General EM

A More General View of EM

The goal of EM is to find maximum likelihood solutions for modelshaving latent variables.

Denote the set of all model parameters as θ, and so the log-likelihoodfunction is

ln p(X|θ) = ln

[∑Z

p(X,Z|θ)

](37)

Note how the summation over the latent variables appears inside ofthe log.

Even if the joint distribution p(X,Z|θ) belongs to the exponentialfamily, the marginal p(X|θ) typically does not.

If, for each sample xn we were given the value of the latent variablezn, then we would have a complete data set, {X,Z}, with whichmaximizing this likelihood term would be straightforward.

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 39 / 41

Page 110: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

A More General EM

A More General View of EM

The goal of EM is to find maximum likelihood solutions for modelshaving latent variables.

Denote the set of all model parameters as θ, and so the log-likelihoodfunction is

ln p(X|θ) = ln

[∑Z

p(X,Z|θ)

](37)

Note how the summation over the latent variables appears inside ofthe log.

Even if the joint distribution p(X,Z|θ) belongs to the exponentialfamily, the marginal p(X|θ) typically does not.

If, for each sample xn we were given the value of the latent variablezn, then we would have a complete data set, {X,Z}, with whichmaximizing this likelihood term would be straightforward.

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 39 / 41

Page 111: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

A More General EM

A More General View of EM

The goal of EM is to find maximum likelihood solutions for modelshaving latent variables.

Denote the set of all model parameters as θ, and so the log-likelihoodfunction is

ln p(X|θ) = ln

[∑Z

p(X,Z|θ)

](37)

Note how the summation over the latent variables appears inside ofthe log.

Even if the joint distribution p(X,Z|θ) belongs to the exponentialfamily, the marginal p(X|θ) typically does not.

If, for each sample xn we were given the value of the latent variablezn, then we would have a complete data set, {X,Z}, with whichmaximizing this likelihood term would be straightforward.

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 39 / 41

Page 112: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

A More General EM

A More General View of EM

The goal of EM is to find maximum likelihood solutions for modelshaving latent variables.

Denote the set of all model parameters as θ, and so the log-likelihoodfunction is

ln p(X|θ) = ln

[∑Z

p(X,Z|θ)

](37)

Note how the summation over the latent variables appears inside ofthe log.

Even if the joint distribution p(X,Z|θ) belongs to the exponentialfamily, the marginal p(X|θ) typically does not.

If, for each sample xn we were given the value of the latent variablezn, then we would have a complete data set, {X,Z}, with whichmaximizing this likelihood term would be straightforward.

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 39 / 41

Page 113: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

A More General EM

However, in practice, we are not given the latent variables values.

So, instead, we focus on the expectation of the log-likelihood underthe posterior distribution of the latent variables.

In the E-Step, we use the current parameter values θold to find theposterior distribution of the latent variables given by p(Z|X,θold).

This posterior is used to define the expectation of thecomplete-data log-likelihood, denoted Q(θ,θold), which is given by

Q(θ,θold) =∑Z

p(Z|X,θold) ln p(X,Z|θ) (38)

Then, in the M-step, we revise the parameters to θnew by maximizingthis function:

θnew = argmaxθQ(θ,θold) (39)

Note that the log acts directly on the joint distribution p(X,Z|θ) andso the M-step maximization will likely be tractable.

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 40 / 41

Page 114: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

A More General EM

However, in practice, we are not given the latent variables values.

So, instead, we focus on the expectation of the log-likelihood underthe posterior distribution of the latent variables.

In the E-Step, we use the current parameter values θold to find theposterior distribution of the latent variables given by p(Z|X,θold).

This posterior is used to define the expectation of thecomplete-data log-likelihood, denoted Q(θ,θold), which is given by

Q(θ,θold) =∑Z

p(Z|X,θold) ln p(X,Z|θ) (38)

Then, in the M-step, we revise the parameters to θnew by maximizingthis function:

θnew = argmaxθQ(θ,θold) (39)

Note that the log acts directly on the joint distribution p(X,Z|θ) andso the M-step maximization will likely be tractable.

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 40 / 41

Page 115: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

A More General EM

However, in practice, we are not given the latent variables values.

So, instead, we focus on the expectation of the log-likelihood underthe posterior distribution of the latent variables.

In the E-Step, we use the current parameter values θold to find theposterior distribution of the latent variables given by p(Z|X,θold).

This posterior is used to define the expectation of thecomplete-data log-likelihood, denoted Q(θ,θold), which is given by

Q(θ,θold) =∑Z

p(Z|X,θold) ln p(X,Z|θ) (38)

Then, in the M-step, we revise the parameters to θnew by maximizingthis function:

θnew = argmaxθQ(θ,θold) (39)

Note that the log acts directly on the joint distribution p(X,Z|θ) andso the M-step maximization will likely be tractable.

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 40 / 41

Page 116: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

A More General EM

However, in practice, we are not given the latent variables values.

So, instead, we focus on the expectation of the log-likelihood underthe posterior distribution of the latent variables.

In the E-Step, we use the current parameter values θold to find theposterior distribution of the latent variables given by p(Z|X,θold).

This posterior is used to define the expectation of thecomplete-data log-likelihood, denoted Q(θ,θold), which is given by

Q(θ,θold) =∑Z

p(Z|X,θold) ln p(X,Z|θ) (38)

Then, in the M-step, we revise the parameters to θnew by maximizingthis function:

θnew = argmaxθQ(θ,θold) (39)

Note that the log acts directly on the joint distribution p(X,Z|θ) andso the M-step maximization will likely be tractable.

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 40 / 41

Page 117: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

A More General EM

However, in practice, we are not given the latent variables values.

So, instead, we focus on the expectation of the log-likelihood underthe posterior distribution of the latent variables.

In the E-Step, we use the current parameter values θold to find theposterior distribution of the latent variables given by p(Z|X,θold).

This posterior is used to define the expectation of thecomplete-data log-likelihood, denoted Q(θ,θold), which is given by

Q(θ,θold) =∑Z

p(Z|X,θold) ln p(X,Z|θ) (38)

Then, in the M-step, we revise the parameters to θnew by maximizingthis function:

θnew = argmaxθQ(θ,θold) (39)

Note that the log acts directly on the joint distribution p(X,Z|θ) andso the M-step maximization will likely be tractable.

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 40 / 41

Page 118: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

A More General EM

However, in practice, we are not given the latent variables values.

So, instead, we focus on the expectation of the log-likelihood underthe posterior distribution of the latent variables.

In the E-Step, we use the current parameter values θold to find theposterior distribution of the latent variables given by p(Z|X,θold).

This posterior is used to define the expectation of thecomplete-data log-likelihood, denoted Q(θ,θold), which is given by

Q(θ,θold) =∑Z

p(Z|X,θold) ln p(X,Z|θ) (38)

Then, in the M-step, we revise the parameters to θnew by maximizingthis function:

θnew = argmaxθQ(θ,θold) (39)

Note that the log acts directly on the joint distribution p(X,Z|θ) andso the M-step maximization will likely be tractable.

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 40 / 41

Page 119: Clustering / Unsupervised Methodsjcorso/t/CSE555/files/lecture_cluster.pdf · Data Clustering Data Clustering Source: A. K. Jain and R. C. Dubes. Alg. for Clustering Data, Prentiice

A More General EM

J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 41 / 41