Clustering - People · • Small shift of a data point can flip it to a different cluster •...

52
Clustering Machine Learning Workshop UC Berkeley Junming Yin August 24, 2007

Transcript of Clustering - People · • Small shift of a data point can flip it to a different cluster •...

Page 1: Clustering - People · • Small shift of a data point can flip it to a different cluster • Solution: replace hard clustering of K-means with soft probabilistic assignments (GMM)

Clustering

Machine Learning Workshop

UC Berkeley

Junming Yin

August 24, 2007

Page 2: Clustering - People · • Small shift of a data point can flip it to a different cluster • Solution: replace hard clustering of K-means with soft probabilistic assignments (GMM)

2

Outline

• Introduction

• Unsupervised learning

• What is clustering? Application

• Dissimilarity (similarity) of samples

• Clustering algorithm

• K-means

• Gaussian mixture model (GMM), EM

• Hierarchical clustering

• Spectral clustering

Page 3: Clustering - People · • Small shift of a data point can flip it to a different cluster • Solution: replace hard clustering of K-means with soft probabilistic assignments (GMM)

3

Unsupervised Learning

• Recall in the setting of classification and

regression, the training data are represented as

, the goal is to predict given

a new point , they are called supervised

learning

• In unsupervised setting, we only have unlabelled

data

• Why bother unsupervised learning? Whether can

we learn anything of value from unlabelled

samples?

Page 4: Clustering - People · • Small shift of a data point can flip it to a different cluster • Solution: replace hard clustering of K-means with soft probabilistic assignments (GMM)

4

Unsupervised Learning

• Most of time, collecting raw data is virtually cheap

but labeling them can be very costly

• The samples might lie in high-dimensional space,we might find some low-dimensional features thatmight be sufficient to describe the samples

• In the early stages of an investigation it may bevaluable to perform exploratory data analysisand thus gain some insight into the nature orstructure of data

• Clustering is one of unsupervised learningmethod

Page 5: Clustering - People · • Small shift of a data point can flip it to a different cluster • Solution: replace hard clustering of K-means with soft probabilistic assignments (GMM)

5

What is Clustering?• Roughly speaking, clustering analysis aims to discover

distinct clusters or groups of samples such that samples

within the same group are more similar to each other than

they are to the members of other groups

• a dissimilarity (similarity) function between samples

• a criterion to evaluate a groupings of samples into clusters

• an algorithm that optimizes this criterion function

Page 6: Clustering - People · • Small shift of a data point can flip it to a different cluster • Solution: replace hard clustering of K-means with soft probabilistic assignments (GMM)

6

Application of Clustering

• Image segmentation: decompose the image into regions with

coherent color and texture inside them

• Search result clustering: group the search result set and

provide a better user interface

• Computational biology: group homologous protein sequences

into families; gene expression data analysis

• Signal processing: compress the signal by using codebook

derived from vector quantization (VQ)

Page 7: Clustering - People · • Small shift of a data point can flip it to a different cluster • Solution: replace hard clustering of K-means with soft probabilistic assignments (GMM)

7

Outline

• Introduction

• Unsupervised learning

• What is clustering? Application

• Dissimilarity (similarity) of samples

• Clustering algorithm

• K-means

• Gaussian mixture model (GMM), EM

• Hierarchical clustering

• Spectral clustering

Page 8: Clustering - People · • Small shift of a data point can flip it to a different cluster • Solution: replace hard clustering of K-means with soft probabilistic assignments (GMM)

8

Dissimilarity of samples

• The natural question now is: how should we

measure the dissimilarity between samples?

• the results depend on the choice of dissimilarity

• usually from subject matter consideration

• not necessarily a metric (i.e. triangle inequalitydoesn’t hold)

• possible to learn the dissimilarity from data for aparticular application (later)

Page 9: Clustering - People · • Small shift of a data point can flip it to a different cluster • Solution: replace hard clustering of K-means with soft probabilistic assignments (GMM)

9

Dissimilarity Based on features

• Most of time, data have measurements on features

• A common choice of dissimilarity function between samples is

the Euclidean distance

• Clusters defined by Euclidean distance is invariant totranslations and rotations in feature space, but not invariant toscaling of features

• You might consider standardizing the data: translate andscale the features so that all of features have zero mean andunit variance

• BE CAREFUL! It is not always desirable

Page 10: Clustering - People · • Small shift of a data point can flip it to a different cluster • Solution: replace hard clustering of K-means with soft probabilistic assignments (GMM)

10

Case Studies

Simulated data, 2-meanswithout standardization

Simulated data, 2-meanswith standardization

Page 11: Clustering - People · • Small shift of a data point can flip it to a different cluster • Solution: replace hard clustering of K-means with soft probabilistic assignments (GMM)

11

Learning Dissimilarity

• Suppose a user indicates certain objects are considered by him to be

“similar”:

• Consider learning a dissimilarity that respects this subjectivity

• If A is identity matrix, it corresponds to Euclidean distance

• Generally, A parameterizes a family of Mahalanobis distance

• Leaning such a dissimilarity is equivalent to finding a rescaling of

data by replacing with , and then applying the standard

Euclidean distance

Page 12: Clustering - People · • Small shift of a data point can flip it to a different cluster • Solution: replace hard clustering of K-means with soft probabilistic assignments (GMM)

12

Learning Dissimilarity

• A simple way to define a criterion for thedesired dissimilarity:

• A convex optimization problem, could besolved by gradient descent and iterativeprojection

• For details, see [Xing, Ng, Jordan, Russell ’03]

Page 13: Clustering - People · • Small shift of a data point can flip it to a different cluster • Solution: replace hard clustering of K-means with soft probabilistic assignments (GMM)

13

Learning Dissimilarity

Page 14: Clustering - People · • Small shift of a data point can flip it to a different cluster • Solution: replace hard clustering of K-means with soft probabilistic assignments (GMM)

14

Outline

• Introduction

• Unsupervised learning

• What is clustering? Application

• Dissimilarity (similarity) of samples

• Clustering algorithm

• K-means

• Gaussian mixture model (GMM), EM

• Hierarchical clustering

• Spectral clustering

Page 15: Clustering - People · • Small shift of a data point can flip it to a different cluster • Solution: replace hard clustering of K-means with soft probabilistic assignments (GMM)

15

K-means: Idea

• Represent the data set in terms of K clusters,

each of which is summarized by a prototype

• Each data is assigned to one of K clusters

• Represented by responsibilities such

that for all data indices i

• Example: 4 data points and 3 clusters

Page 16: Clustering - People · • Small shift of a data point can flip it to a different cluster • Solution: replace hard clustering of K-means with soft probabilistic assignments (GMM)

16

K-means: Idea

• Criterion function: the sum-of-squared

distances from each data point to its

assigned prototype

prototypesresponsibilities

data

Page 17: Clustering - People · • Small shift of a data point can flip it to a different cluster • Solution: replace hard clustering of K-means with soft probabilistic assignments (GMM)

17

Minimizing the Cost Function

• Chicken and egg problem, have to resort to iterativemethod

• E-step: fix values for and minimize w.r.t.

• assigns each data point to its nearest prototype

• M-step: fix values for and minimize w.r.t• this gives

• each prototype set to the mean of points in thatcluster

• Convergence guaranteed since there are a finitenumber of possible settings for the responsibilities

• It can only find the local minima, we should start thealgorithm with many different initial settings

Page 18: Clustering - People · • Small shift of a data point can flip it to a different cluster • Solution: replace hard clustering of K-means with soft probabilistic assignments (GMM)

18

Page 19: Clustering - People · • Small shift of a data point can flip it to a different cluster • Solution: replace hard clustering of K-means with soft probabilistic assignments (GMM)

19

Page 20: Clustering - People · • Small shift of a data point can flip it to a different cluster • Solution: replace hard clustering of K-means with soft probabilistic assignments (GMM)

20

Page 21: Clustering - People · • Small shift of a data point can flip it to a different cluster • Solution: replace hard clustering of K-means with soft probabilistic assignments (GMM)

21

Page 22: Clustering - People · • Small shift of a data point can flip it to a different cluster • Solution: replace hard clustering of K-means with soft probabilistic assignments (GMM)

22

Page 23: Clustering - People · • Small shift of a data point can flip it to a different cluster • Solution: replace hard clustering of K-means with soft probabilistic assignments (GMM)

23

Page 24: Clustering - People · • Small shift of a data point can flip it to a different cluster • Solution: replace hard clustering of K-means with soft probabilistic assignments (GMM)

24

Page 25: Clustering - People · • Small shift of a data point can flip it to a different cluster • Solution: replace hard clustering of K-means with soft probabilistic assignments (GMM)

25

Page 26: Clustering - People · • Small shift of a data point can flip it to a different cluster • Solution: replace hard clustering of K-means with soft probabilistic assignments (GMM)

26

Page 27: Clustering - People · • Small shift of a data point can flip it to a different cluster • Solution: replace hard clustering of K-means with soft probabilistic assignments (GMM)

27

How to Choose K?• In some cases it is known apriori from problem

domain

• Generally, it has to be estimated from data andusually selected by some heuristics in practice

• The cost function J generally decrease withincreasing K

• Idea: Assume that K* is the right number

• We assume that for K<K* each estimated clustercontains a subset of true underlying groups

• For K>K* some natural groups must be split

• Thus we assume that for K<K* the cost functionfalls substantially, afterwards not a lot more

Page 28: Clustering - People · • Small shift of a data point can flip it to a different cluster • Solution: replace hard clustering of K-means with soft probabilistic assignments (GMM)

28

Limitations of K-means

• Hard assignments of data points to clusters

• Small shift of a data point can flip it to a different cluster

• Solution: replace hard clustering of K-means with soft

probabilistic assignments (GMM)

• Hard to choose the value of K

• As K is increased, the cluster memberships can change in

an arbitrary way, the resulting clusters are not necessarily

nested

• Solution: hierarchical clustering

Page 29: Clustering - People · • Small shift of a data point can flip it to a different cluster • Solution: replace hard clustering of K-means with soft probabilistic assignments (GMM)

29

The Gaussian Distribution• Multivariate Gaussian

• Maximum likelihood estimation

mean covariance

Page 30: Clustering - People · • Small shift of a data point can flip it to a different cluster • Solution: replace hard clustering of K-means with soft probabilistic assignments (GMM)

30

Gaussian Mixture

• Linear combination of Gaussians

0 0.5 10

0.5

1

(a)

where

parameters to be estimated

Page 31: Clustering - People · • Small shift of a data point can flip it to a different cluster • Solution: replace hard clustering of K-means with soft probabilistic assignments (GMM)

31

Gaussian Mixture

0 0.5 10

0.5

1

(a)

• To generate a data point:

• first pick one of the components with probability

• then draw a sample from that component distribution

• Each data point is generated by one of K components, a latent

variable is associated with each

Page 32: Clustering - People · • Small shift of a data point can flip it to a different cluster • Solution: replace hard clustering of K-means with soft probabilistic assignments (GMM)

32

Synthetic Data Set Without Colours

0 0.5 10

0.5

1

(b)

Page 33: Clustering - People · • Small shift of a data point can flip it to a different cluster • Solution: replace hard clustering of K-means with soft probabilistic assignments (GMM)

33

Fitting the Gaussian Mixture• Given the complete data set

• maximize the complete log likelihood

• trivial closed-form solution: fit each component to the

corresponding set of data points

• Without knowing values of latent variables, we have to

maximize the incomplete log likelihood:

• Sum over components appears inside the logarithm, no

closed-form solution

Page 34: Clustering - People · • Small shift of a data point can flip it to a different cluster • Solution: replace hard clustering of K-means with soft probabilistic assignments (GMM)

34

EM Algorithm

• E-step: for given parameter values we can

compute the expected values of the latent

variables (responsibilities of data points)

• Note that instead of but we still

have

Bayes rule

Page 35: Clustering - People · • Small shift of a data point can flip it to a different cluster • Solution: replace hard clustering of K-means with soft probabilistic assignments (GMM)

35

EM Algorithm

• M-step: maximize the expected complete log

likelihood

• update parameters:

Page 36: Clustering - People · • Small shift of a data point can flip it to a different cluster • Solution: replace hard clustering of K-means with soft probabilistic assignments (GMM)

36

EM Algorithm

• Iterate E-step and M-step until the loglikelihood of data does not increase any more

• converge to local optima

• need to restart algorithm with different initialguess of parameters (as in K-means)

• Relation to K-means

• Consider GMM with common covariance

• As , two methods coincide

Page 37: Clustering - People · • Small shift of a data point can flip it to a different cluster • Solution: replace hard clustering of K-means with soft probabilistic assignments (GMM)

37

Page 38: Clustering - People · • Small shift of a data point can flip it to a different cluster • Solution: replace hard clustering of K-means with soft probabilistic assignments (GMM)

38

Page 39: Clustering - People · • Small shift of a data point can flip it to a different cluster • Solution: replace hard clustering of K-means with soft probabilistic assignments (GMM)

39

Page 40: Clustering - People · • Small shift of a data point can flip it to a different cluster • Solution: replace hard clustering of K-means with soft probabilistic assignments (GMM)

40

Page 41: Clustering - People · • Small shift of a data point can flip it to a different cluster • Solution: replace hard clustering of K-means with soft probabilistic assignments (GMM)

41

Page 42: Clustering - People · • Small shift of a data point can flip it to a different cluster • Solution: replace hard clustering of K-means with soft probabilistic assignments (GMM)

42

Page 43: Clustering - People · • Small shift of a data point can flip it to a different cluster • Solution: replace hard clustering of K-means with soft probabilistic assignments (GMM)

43

Hierarchical Clustering

• Organize the clusters in an hierarchical way

• Produces a rooted (binary) tree (dendrogram)

Step 0 Step 1 Step 2 Step 3 Step 4

b

d

c

e

aa b

d e

c d e

a b c d e

Step 4 Step 3 Step 2 Step 1 Step 0

agglomerative

divisive

Page 44: Clustering - People · • Small shift of a data point can flip it to a different cluster • Solution: replace hard clustering of K-means with soft probabilistic assignments (GMM)

44

Hierarchical Clustering• Two kinds of strategy

• Bottom-up (agglomerative): recursively merge two groups with

the smallest between-cluster dissimilarity (defined later on)

• Top-down (divisive): in each step, split a least coherent cluster

(e.g. largest diameter); splitting a cluster is also a clustering

problem (usually done in a greedy way); less popular than

bottom-up way

Step 0 Step 1 Step 2 Step 3 Step 4

b

d

c

e

aa b

d e

c d e

a b c d e

Step 4 Step 3 Step 2 Step 1 Step 0

agglomerative

divisive

Page 45: Clustering - People · • Small shift of a data point can flip it to a different cluster • Solution: replace hard clustering of K-means with soft probabilistic assignments (GMM)

45

Hierarchical Clustering

• User can choose a cut through the hierarchy to

represent the most natural division into clusters

• e.g, choose the cut where intergroup dissimilarity

exceeds some threshold

Step 0 Step 1 Step 2 Step 3 Step 4

b

d

c

e

aa b

d e

c d e

a b c d e

Step 4 Step 3 Step 2 Step 1 Step 0

agglomerative

divisive

3 2

Page 46: Clustering - People · • Small shift of a data point can flip it to a different cluster • Solution: replace hard clustering of K-means with soft probabilistic assignments (GMM)

46

Hierarchical Clustering

• Have to measure the dissimilarity for two disjoint

groups G and H, is computed from

pairwise dissimilarities

• Single Linkage: tends to yield extended clusters

• Complete Linkage: tends to yield round clusters

• Group Average: tradeoff between them

Page 47: Clustering - People · • Small shift of a data point can flip it to a different cluster • Solution: replace hard clustering of K-means with soft probabilistic assignments (GMM)

47

Example: Human Tumor Microarray Data

• 6830 64 matrix of real numbers

• Rows correspond to genes,

columns to tissue samples

• Cluster rows (genes) can deduce

functions of unknown genes from

known genes with similar

expression profiles

• Cluster columns (samples) can

identify disease profiles: tissues

with similar disease should yield

similar expression profiles

Gene expression matrix

Page 48: Clustering - People · • Small shift of a data point can flip it to a different cluster • Solution: replace hard clustering of K-means with soft probabilistic assignments (GMM)

48

Example: Human Tumor Microarray Data

• 6830 64 matrix of real numbers

• GA clustering of the microarray data• Applied separately to rows and columns

• Subtrees with tighter clusters placed on the left

• Produces a more informative picture of genes and samples than

the randomly ordered rows and columns

Page 49: Clustering - People · • Small shift of a data point can flip it to a different cluster • Solution: replace hard clustering of K-means with soft probabilistic assignments (GMM)

49

Example: Two circles

Page 50: Clustering - People · • Small shift of a data point can flip it to a different cluster • Solution: replace hard clustering of K-means with soft probabilistic assignments (GMM)

50

Spectral Clustering [Ng, Jordan and Weiss 01’]

• Given a set of points , we’d like tocluster them into k clusters

• Form an affinity matrix where

• Define

• Find k largest eigenvectors of L, concatenate themcolumnwise to obtain

• Form the matrix Y by normalizing each row of X tohave unit length

• Think of n rows of Y as a new representation oforiginal n data points; cluster them into k clustersusing K-means

Page 51: Clustering - People · • Small shift of a data point can flip it to a different cluster • Solution: replace hard clustering of K-means with soft probabilistic assignments (GMM)

51

Example: Two circles

Page 52: Clustering - People · • Small shift of a data point can flip it to a different cluster • Solution: replace hard clustering of K-means with soft probabilistic assignments (GMM)

52

Had Fun?

Hope you enjoyed…