Clustering - People · • Small shift of a data point can flip it to a different cluster •...

Clustering

Machine Learning Workshop

UC Berkeley

Junming Yin

August 24, 2007

2

Outline

• Introduction

• Unsupervised learning

• What is clustering? Application

• Dissimilarity (similarity) of samples

• Clustering algorithm

• K-means

• Gaussian mixture model (GMM), EM

• Hierarchical clustering

• Spectral clustering

3

Unsupervised Learning

• Recall in the setting of classification and

regression, the training data are represented as

, the goal is to predict given

a new point , they are called supervised

learning

• In unsupervised setting, we only have unlabelled

data

• Why bother unsupervised learning? Whether can

we learn anything of value from unlabelled

samples?

4

Unsupervised Learning

• Most of time, collecting raw data is virtually cheap

but labeling them can be very costly

• The samples might lie in high-dimensional space,we might find some low-dimensional features thatmight be sufficient to describe the samples

• In the early stages of an investigation it may bevaluable to perform exploratory data analysisand thus gain some insight into the nature orstructure of data

• Clustering is one of unsupervised learningmethod

5

What is Clustering?• Roughly speaking, clustering analysis aims to discover

distinct clusters or groups of samples such that samples

within the same group are more similar to each other than

they are to the members of other groups

• a dissimilarity (similarity) function between samples

• a criterion to evaluate a groupings of samples into clusters

• an algorithm that optimizes this criterion function

6

Application of Clustering

• Image segmentation: decompose the image into regions with

coherent color and texture inside them

• Search result clustering: group the search result set and

provide a better user interface

• Computational biology: group homologous protein sequences

into families; gene expression data analysis

• Signal processing: compress the signal by using codebook

derived from vector quantization (VQ)

7

Outline

• Introduction





• K-means




8

Dissimilarity of samples

• The natural question now is: how should we

measure the dissimilarity between samples?

• the results depend on the choice of dissimilarity

• usually from subject matter consideration

• not necessarily a metric (i.e. triangle inequalitydoesn’t hold)

• possible to learn the dissimilarity from data for aparticular application (later)

9

Dissimilarity Based on features

• Most of time, data have measurements on features

• A common choice of dissimilarity function between samples is

the Euclidean distance

• Clusters defined by Euclidean distance is invariant totranslations and rotations in feature space, but not invariant toscaling of features

• You might consider standardizing the data: translate andscale the features so that all of features have zero mean andunit variance

• BE CAREFUL! It is not always desirable

10

Case Studies

Simulated data, 2-meanswithout standardization

Simulated data, 2-meanswith standardization

11

Learning Dissimilarity

• Suppose a user indicates certain objects are considered by him to be

“similar”:

• Consider learning a dissimilarity that respects this subjectivity

• If A is identity matrix, it corresponds to Euclidean distance

• Generally, A parameterizes a family of Mahalanobis distance

• Leaning such a dissimilarity is equivalent to finding a rescaling of

data by replacing with , and then applying the standard

Euclidean distance

12


• A simple way to define a criterion for thedesired dissimilarity:

• A convex optimization problem, could besolved by gradient descent and iterativeprojection

• For details, see [Xing, Ng, Jordan, Russell ’03]

13


14

Outline

• Introduction





• K-means




15

K-means: Idea

• Represent the data set in terms of K clusters,

each of which is summarized by a prototype

• Each data is assigned to one of K clusters

• Represented by responsibilities such

that for all data indices i

• Example: 4 data points and 3 clusters

16

K-means: Idea

• Criterion function: the sum-of-squared

distances from each data point to its

assigned prototype

prototypesresponsibilities

data

17

Minimizing the Cost Function

• Chicken and egg problem, have to resort to iterativemethod

• E-step: fix values for and minimize w.r.t.

• assigns each data point to its nearest prototype

• M-step: fix values for and minimize w.r.t• this gives

• each prototype set to the mean of points in thatcluster

• Convergence guaranteed since there are a finitenumber of possible settings for the responsibilities

• It can only find the local minima, we should start thealgorithm with many different initial settings

27

How to Choose K?• In some cases it is known apriori from problem

domain

• Generally, it has to be estimated from data andusually selected by some heuristics in practice

• The cost function J generally decrease withincreasing K

• Idea: Assume that K* is the right number

• We assume that for K<K* each estimated clustercontains a subset of true underlying groups

• For K>K* some natural groups must be split

• Thus we assume that for K<K* the cost functionfalls substantially, afterwards not a lot more

28

Limitations of K-means

• Hard assignments of data points to clusters

• Small shift of a data point can flip it to a different cluster

• Solution: replace hard clustering of K-means with soft

probabilistic assignments (GMM)

• Hard to choose the value of K

• As K is increased, the cluster memberships can change in

an arbitrary way, the resulting clusters are not necessarily

nested

• Solution: hierarchical clustering

29

The Gaussian Distribution• Multivariate Gaussian

• Maximum likelihood estimation

mean covariance

30

Gaussian Mixture

• Linear combination of Gaussians

0 0.5 10

0.5

1

(a)

where

parameters to be estimated

31

Gaussian Mixture

0 0.5 10

0.5

1

(a)

• To generate a data point:

• first pick one of the components with probability

• then draw a sample from that component distribution

• Each data point is generated by one of K components, a latent

variable is associated with each

32

Synthetic Data Set Without Colours

0 0.5 10

0.5

1

(b)

33

Fitting the Gaussian Mixture• Given the complete data set

• maximize the complete log likelihood

• trivial closed-form solution: fit each component to the

corresponding set of data points

• Without knowing values of latent variables, we have to

maximize the incomplete log likelihood:

• Sum over components appears inside the logarithm, no

closed-form solution

34

EM Algorithm

• E-step: for given parameter values we can

compute the expected values of the latent

variables (responsibilities of data points)

• Note that instead of but we still

have

Bayes rule

35

EM Algorithm

• M-step: maximize the expected complete log

likelihood

• update parameters:

36

EM Algorithm

• Iterate E-step and M-step until the loglikelihood of data does not increase any more

• converge to local optima

• need to restart algorithm with different initialguess of parameters (as in K-means)

• Relation to K-means

• Consider GMM with common covariance

• As , two methods coincide

43

Hierarchical Clustering

• Organize the clusters in an hierarchical way

• Produces a rooted (binary) tree (dendrogram)

Step 0 Step 1 Step 2 Step 3 Step 4

b

d

c

e

aa b

d e

c d e

a b c d e


agglomerative

divisive

44

Hierarchical Clustering• Two kinds of strategy

• Bottom-up (agglomerative): recursively merge two groups with

the smallest between-cluster dissimilarity (defined later on)

• Top-down (divisive): in each step, split a least coherent cluster

(e.g. largest diameter); splitting a cluster is also a clustering

problem (usually done in a greedy way); less popular than

bottom-up way


b

d

c

e

aa b

d e

c d e

a b c d e


agglomerative

divisive

45


• User can choose a cut through the hierarchy to

represent the most natural division into clusters

• e.g, choose the cut where intergroup dissimilarity

exceeds some threshold


b

d

c

e

aa b

d e

c d e

a b c d e


agglomerative

divisive

3 2

46


• Have to measure the dissimilarity for two disjoint

groups G and H, is computed from

pairwise dissimilarities

• Single Linkage: tends to yield extended clusters

• Complete Linkage: tends to yield round clusters

• Group Average: tradeoff between them

47

Example: Human Tumor Microarray Data

• 6830 64 matrix of real numbers

• Rows correspond to genes,

columns to tissue samples

• Cluster rows (genes) can deduce

functions of unknown genes from

known genes with similar

expression profiles

• Cluster columns (samples) can

identify disease profiles: tissues

with similar disease should yield

similar expression profiles

Gene expression matrix

48

Example: Human Tumor Microarray Data

• 6830 64 matrix of real numbers

• GA clustering of the microarray data• Applied separately to rows and columns

• Subtrees with tighter clusters placed on the left

• Produces a more informative picture of genes and samples than

the randomly ordered rows and columns

49

Example: Two circles

50

Spectral Clustering [Ng, Jordan and Weiss 01’]

• Given a set of points , we’d like tocluster them into k clusters

• Form an affinity matrix where

• Define

• Find k largest eigenvectors of L, concatenate themcolumnwise to obtain

• Form the matrix Y by normalizing each row of X tohave unit length

• Think of n rows of Y as a new representation oforiginal n data points; cluster them into k clustersusing K-means

51

Example: Two circles

52

Had Fun?

Hope you enjoyed…

Clustering - People · • Small shift of a data point can flip it to a different cluster •...

Documents

Transcript of Clustering - People · • Small shift of a data point can flip it to a different cluster •...