AMCS/CS229: Machine Learning Clustering 2 Xiangliang Zhang King Abdullah University of Science and...

AMCS/CS229: Machine Learning

Clustering 2

Xiangliang Zhang

King Abdullah University of Science and Technology

Cluster Analysis

1. Partitioning Methods + EM algorithm

2. Hierarchical Methods

3. Density-Based Methods

4. Clustering quality evaluation

5. How to decide the number of clusters ?

6. Summary

Xiangliang Zhang, KAUST AMCS/CS229: Machine Learning

The quality of Clustering• For supervised classification we have a variety of measures

to evaluate how good our model is– Accuracy, precision, recall

• For cluster analysis, the analogous question is how to evaluate the “goodness” of the resulting clusters?

• But “clusters are in the eye of the beholder”!

• Then why do we want to evaluate them? To avoid finding patterns in noise To compare clustering algorithms To compare two sets of clusters To compare two clusters

3Xiangliang Zhang, KAUST AMCS/CS229: Machine Learning

Measures of Cluster Validity

Numerical measures that are applied to judge various aspects

of cluster validity, are classified into the following two types:

External Index: Used to measure the extent to which

cluster labels match externally supplied class labels.• Purity, Normalized Mutual Information

Internal Index: Used to measure the goodness of a

clustering structure without respect to external

information.

• Sum of Squared Error (SSE)

• Cophenetic correlation coefficient, silhouette4

http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html

The class labels are externally supplied (q classes)

Purity: Larger purity values indicate better clustering solutions.

• Purity of each cluster Cr of size nr

• Purity of the entire clustering

Cluster Validity: External Index

nCPurity

Purity:

The class labels are externally supplied (q classes)

NMI (Normalized Mutual Information) :

where I is mutual information

and H is entropy

NMI (Normalized Mutual Information) :

Larger NMI values indicate better clustering solutions.

Internal Index: Used to measure the goodness of a clustering structure without respect to external information

SSE is good for comparing two clustering results• average SSE• SSE curves w.r.t. various K

Can also be used to estimate the number of clusters

Internal Measures: SSE

2 5 10 15 20 25 300

5 10 15

Cophenetic correlation coefficient: a measure of how faithfully a dendrogram preserves the

pairwise distances between the original data points. Compare two hierarchical clusterings of the data

Internal Measures: Cophenetic correlation coefficient

0.5D F

Compute the correlation coefficient between Dist and CP

Matlab functions: cophenet

Cluster Analysis

6. Summary

Internal Measures: Cohesion and Separation

• Cluster cohesion measures how closely related are objects in a cluster

= SSE or the sum of the weight of all links within a cluster.

• Cluster separation measures how distinct or well-separated a cluster is from other clusters

= sum of the weights between nodes in the cluster and nodes outside the cluster.

cohesion separation 12

Internal Measures: Silhouette Coefficient

• Silhouette Coefficient combines ideas of both cohesion and separation

• For an individual point, i Calculate a = average distance of i to the points in its cluster Calculate b = min (average distance of i to points in another cluster) The silhouette coefficient for a point is then given by

o Typically between 0 and 1. o The closer to 1 the better.

• Can calculate the Average Silhouette width for a cluster or a clustering

13Matlab functions: silhouette

Determine number of clusters by Silhouette Coefficient

compare different clusterings by the average silhouette values

K=4mean(silh) = 0.640

Determine the number of clusters

1. Select the number K of clusters as the one maximizing averaged silhouette value of all points

2. Optimizing an objective criterion– Gap statistics of the decreasing of SSE w.r.t. K

3. Model-based method: • optimizing a global criterion (e.g. the maximum likelihood of data)

4. Try to use clustering methods which need not to set K, e.g., DbScan,

5. Prior knowledge…..

Cluster Analysis

6. Summary

Clustering VS Classification

Problems and Challenges• Considerable progress has been made in scalable clustering

methods Partitioning: k-means, k-medoids, CLARANS Hierarchical: BIRCH, ROCK, CHAMELEON Density-based: DBSCAN, OPTICS, DenClue Grid-based: STING, WaveCluster, CLIQUE Model-based: EM, SOM Spectral clustering Affinity Propagation Frequent pattern-based: Bi-clustering, pCluster

• Current clustering techniques do not address all the requirements adequately, still an active area of research

Cluster Analysis

Open issues in clustering

1.Clustering quality evaluation

2.How to decide the number of clusters ?

What you should know

• What is clustering?

• How does k-means work?

• What is the difference between k-means and k-mediods?

• What is EM algorithm? How does it work?

• What is the relationship between k-means and EM?

• How to define inter-cluster similarity in Hierarchical

clustering? What kinds of options do you have ?

• How does DBSCAN work ?

What you should know

• What are the advantages and disadvantages of DbScan?

• How to evaluate the clustering results ?

• Usually how to decide the number of clusters ?

• What are the main differences between clustering and

classification?

AMCS/CS229: Machine Learning Clustering 2 Xiangliang Zhang King Abdullah University of Science and...

Documents

Transcript of AMCS/CS229: Machine Learning Clustering 2 Xiangliang Zhang King Abdullah University of Science and...

CS229 PROJECT REPORT - GitHub Pages

Fabrication of Aluminium Matrix Composites (AMCs) by ...

MELJUN CORTES Cs229 electronics for_cs_switching_theory_updated_hours

CS229: Machine Learning

AMCS Leadership Survey

Fabrication of Aluminium Matrix Composites (AMCs) by ...€¦ · Fabrication of Aluminium Matrix Composites (AMCs) by Squeeze Casting Technique Using Carbon Fiber as Reinforcement

boosting - cs229.stanford.edu

All AMCs under One Roof

Patents, Prizes, AMCs and CAMCs

CS229 - Probability Theory Review

Gender & Concentration in the AMCS Major

AMCS/CS 340: Data Mining Clustering Xiangliang Zhang King Abdullah University of Science and Technology.

CS229 Final Project: Language Grounding in Minecraft with ...cs229.stanford.edu/proj2017/final-reports/5242630.pdf · CS229 Final Project: Language Grounding in Minecraft with Gated-Attention

MachineLearning CS229/STATS229

AMCS Annual Report 2011-2012

AMCS/CS 340: Data Mining Classification I: Decision Tree Xiangliang Zhang King Abdullah University of Science and Technology.

AMCS Annual Report 2010-2011

Area Mine Clearing System (AMCS)

CS229 MACHINE LEARNING, STANFORD UNIVERSITY, …cs229.stanford.edu/proj2016/report/FegelisHebert...CS229 MACHINE LEARNING, STANFORD UNIVERSITY, DECEMBER 2016 3 t= f(k x t +h y t) (6)

AMCS / CS 247 – Scientific Visualization Lecture 22 ...faculty.kaust.edu.sa/sites/markushadwiger/... · AMCS / CS 247 – Scientific Visualization Lecture 22: Vector Field / Flow