Cluster Analysis 1 Mark Stamp. Cluster Analysis Grouping objects in meaningful way o Clustered data...

1

Cluster Analysis

Cluster Analysis

Mark Stamp

2

Cluster Analysis

Grouping objects in meaningful wayo Clustered data fits together in some

wayo Can help to make sense of (big) datao Finds application in many fields

Many different clustering strategies Overview, then details on 2

methodso K-means simple and can be effectiveo EM clustering not as simple

Cluster Analysis

3

Intrinsic vs Extrinsic

Intrinsic clustering relies on unsupervised learningo No predetermined labels on objectso Apply analysis directly to data

Extrinsic requires category labels o Requires pre-processing of datao Can be viewed as a form of supervised

learning

Cluster Analysis

4

Agglomerative vs Divisive

Agglomerative o Each object starts in its own clustero Clustering merges existing clusterso A “bottom up” approach

Divisiveo All objects start in one clustero Clustering process splits existing

clusterso A “top down” approach

Cluster Analysis

5

Hierarchical vs Partitional

Hierarchical clusteringo “Child” and “parent” clusterso Can be viewed as dendrograms

Partitional clusteringo Partition objects into disjoint clusterso No hierarchical relationship

We consider K-means and EM in detailo These are both partitional

Cluster Analysis

6

Hierarchical Clustering

Example of a hierarchical approach...

1. start: Every point is its own cluster2. while number of clusters exceeds 1

o Find 2 nearest clusters and merge

3. end while OK, but no real theoretical basis

o And some find that “disconcerting”o Even K-means has some theory behind

itCluster Analysis

7

Distance

Distance between data points? Suppose

x = (x1,x2,…,xn) and y = (y1,y2,…,yn)

where each xi and yi are real numbers

Euclidean distance isd(x,y) = sqrt((x1-y1)2 + (x2-y2)2 +…+ (xn-

yn)2)

Manhattan (taxicab) distance isd(x,y) = |x1-y1| + |x2-y2| + … + |xn-yn| Cluster Analysis

8

Distance

Euclidean distance red line Manhattan distance blue or yellow

o Or any similar right-angle only path

Cluster Analysis

a

b

9

Distance

Lots and lots more distance measures

Other examples includeo Mahalanobis distance takes mean

and covariance into accounto Simple substitution distance

measure of “decryption” distance o Chi-squared distance statistical o Or just about anything you can think

of…Cluster Analysis

10

One Clustering Approach

Given data points x1,x2,x3,…,xm Want to partition into K clusters

o I.e., each point in exactly one cluster A centroid specified for each

clustero Let c1,c2,…,cK denote current centroids

Each xi associated with one centroido Let centroid(xi) be centroid for xi

o If cj = centroid(xi), then xi is in cluster j

Cluster Analysis

11

Clustering

Two crucial questions1. How to determine centroids, cj?

2. How to determine clusters, that is, how to assign xi to centroids?

But first, what makes a cluster good?o For now, focus on one individual clustero Relationship between clusters later…

What do you think?Cluster Analysis

12

Distortion

Intuitively, “compact” clusters goodo Depends on data and K, which are giveno And depends on centroids and

assignment of xi to clusters (which we can control)

How to measure this “goodness”? Define distortion = Σ

d(xi,centroid(xi))o Where d(x,y) is a distance measure

Given K, let’s try to minimize distortion

Cluster Analysis

13

Distortion

Consider this 2-d datao Choose K = 3 clusters

Same data for botho Which has smaller

distortion? How to minimize

distortion?o Good question…

Cluster Analysis

14

Distortion

Note, distortion depends on Ko So, should probably write distortionK

Typically, larger K, smaller distortionK o Want to minimize distortionK for fixed K

Best choice of K is a different issueo Briefly considered latero Also consider other measures of

goodness For now, assume K is given and

fixed

Cluster Analysis

15

How to Minimize Distortion?

Given m data points and K … Min distortion via exhaustive search?

o Try all m choose K different cases? o Too much work for realistic size data set

An approximate solution will have to doo Exact solution is NP-complete problem

Important Note: For minimum distortion…o Each xi grouped with nearest centroido Centroid must be center of its group

Cluster Analysis

16

K-Means Previous slide implies that we can

improve suboptimal cluster by either…1. Re-assign each xi to nearest centroid

2. Re-compute centroids so they’re centered

No improvement from applying either 1 or 2 more than once in succession

But alternating might be usefulo In fact, that is the K-means algorithm

Cluster Analysis

17

K-Means Algorithm Given dataset…1. Select a value for K (how?)2. Select initial centroids (how?)3. Group data by nearest centroid4. Recompute centroids (cluster

centers)5. If significant change, goto 3; else

stop

Cluster Analysis

18

K-Means Animation

Very good animation herehttp://shabal.in/visuals/kmeans/2.html

Nice animations of movement of centroids in different cases here

http://www.ccs.neu.edu/home/kenb/db/examples/059.html

(near bottom of web page) Other?

Cluster Analysis

http://shabal.in/visuals/kmeans/2.html

http://www.ccs.neu.edu/home/kenb/db/examples/059.html

19

K-Means

Are we assured of optimal solution?o Definitely not

Why not?o For one thing, initial centroid locations

are criticalo There is a (sensitive) dependence on

initial conditionso This is a common issue in iterative

processes (HMM training, is an example)

Cluster Analysis

20

K-Means Initialization

Recall, K is the number of clusters How to choose K? No obvious “best” way to do so But K-means is fast

o So trial and error may be OKo That is, experiment with different K o Similar to choosing N in HMM

Is there a better way to choose K?

Cluster Analysis

21

Optimal K?

Even for trial and error, need a way to measure “goodness” of results

Choosing optimal K is tricky Most intuitive measures will tend to

improve for larger K But K “too big” may overfit data So, when is K “big enough”?

o But not too big…

Cluster Analysis

22

Schwarz Criterion Choose K that minimizes

f(K) = distortionK + λdK log mo Where d is the dimension, m is the number of

data points, and λ is ??? Recall that distortion depends on K

o Tends to decrease as K increaseso Essentially, adding a penalty as K increases

Related to Bayes Information Criterion (BIC)o And some other similar things

Consider choice of K in more detail later…Cluster Analysis

K

f(K)

23


How to choose initial centroids? Again, no best way to do this

o Counterexamples to any “best” approach

Often just choose at random Or uniform/maximum spacing

o Or some variation on this idea Other?

Cluster Analysis

24


In practice, often… Try several different choices of K

o For each K, test several initial centroids

Select the result that is besto How to measure “best”?o We’ll look at that next

May not be very scientifico But often works well

Cluster Analysis

25

K-Means Variations K-mediods

o Centroids point must be actual data point

Fuzzy K-meanso In K-means, any data point is in one

cluster and not in any othero In fuzzy case, data point can be partly

in several different clusters o “Degree of membership” vs distance

Many other variations…Cluster Analysis

26

Measuring Cluster Quality

How can we judge clustering results?o In general, that is, not just for K-

means Compare to typical

training/scoring…o Suppose we test new scoring methodo E.g., score malware and benign fileso Compute ROC curves, AUC, etc.o Many tools to measure

success/accuracy Clustering is different (Why? How?)

Cluster Analysis

27

Clustering Quality

Clustering is a fishing expeditiono Not sure what we are looking foro Hoping to find structure, data

discoveryo If we know answer, no point to

clustering Might find something that’s not

thereo Even random data can be clustered

Some things to consider on next slideso Relative to the data to be clustered

Cluster Analysis

28

Cluster-ability?

Clustering tendencyo How suitable is dataset for clustering?o Which dataset below is cluster-

friendly?o We can always apply clustering…o …but expect better results in some

cases

Cluster Analysis

29

Validation

External validationo Compare clusters based on data

labelso Similar to usual training/scoring

scenarioo Good idea if know something about

data Internal validation

o Determine quality based only on clusters

o E.g., spacing between and within clusters

o Generally applicable

Cluster Analysis

30

It’s All Relative

Comparing clustering resultso That is, compare one clustering result

with others for same dataseto Would be very useful in practiceo Often, lots of trial and erroro Could enable us to “hill climb” to

better clustering results…o …if we have a way to quantify things

Cluster Analysis

31

How Many Clusters?

Optimal number of clusters?o Already mentioned this wrt K-meanso But what about the general case?o I.e., no reference to clustering

techniqueo Can the data tell us how many

clusters?o Or the topology of the clusters?

Next, we consider several relevant measuresCluster Analysis

32

Internal Validation

Direct measurement of clusterso Might call it “topological” validation

We’ll consider the followingo Cluster correlationo Similarity matrixo Sum of squares erroro Cohesion and separationo Silhouette coefficient

Cluster Analysis

33

Cluster Correlation

Given data x1,x2,…,xm, and clusters, define 2 matrices

Distance matrix D = {dij} o Where dij is distance between xi and xj

Adjacency matrix A = {aij}o Where aij is 1 if xi and xj in same

clustero And aij is 0 otherwise

Now what?Cluster Analysis

34

Cluster Correlation

Compute correlation between D and A rAD = Corr(A,D) = cov(A,D) / (σAσD)

= Σ(aij–μA)(dij–μD) / sqrt(Σ(aij–μA)2Σ(dij–μD)2)

Can show that r is between -1 and 1o If r > 0 then positive Corr (and vice

versa)o Magnitude is strength of correlation

High (inverse) correlation implies nearby things clustered together

Cluster Analysis

35

Correlation

Correlation examples

Cluster Analysis

36

Similarity Matrix

Form “similarity matrix”o Could be based on just about anythingo Typically, distance matrix D = {dij},

where dij = d(xi,xj)

Group rows and columns by cluster Heat map for resulting matrix

o Provides visual representation of similarity within clusters (so look at it…)

Cluster Analysis

37

Similarity Matrix

Examples Better

than just looking at clusters?

Good for higher dimensions

Cluster Analysis

38

Residual Sum of Squares

Residual Sum of Squares (RSS)o Aka Sum of Squared Errors (SSE)o RSS is squared sum of “error” termso Definition of error depends on

problem What is “error” when clustering?

o Distance from centroid?o Then same as distortiono But, could use other measures instead

Cluster Analysis

39

Cohesion and Separation

Cluster cohesiono How tightly packed is a clustero More cohesive clusters is more better

Cluster separationo Distance between clusterso The more separation, the better

Can we measure these things?o Yes, easily

Cluster Analysis

40

Notation

Same notation is K-meansoLet ci, i=1,2,…,K, cluster

centroidsoLet x1,x2,…,xm be data points

oLet centroid(xi) be centroid of xi oClusters determined by centroids

Following results apply generallyoNot just for K-means…

Cluster Analysis

41

Cohesion

Lots of measures of cohesiono Previously defined distortion is usefulo Recall, distortion = Σ d(xi,centroid(xi))

Can also use distance between all pairs

Cluster Analysis

42

Separation Again, many ways to measure this

o Here, using distances to other centroids

Or distances between all points in clusters Or distance from centroids to a

“midpoint” Or distance between centroids, or…Cluster Analysis

43

Silhouette Coefficient

Essentially, combines cohesion and separation into a single number

Let Ci be cluster of point xi o Let a be average of d(xi,y) for all y in

Ci

o For Cj ≠ Ci, let bj be avg d(xi,y) for y in Cj

o Let b be minimum of bj

Then let S(xi) = (b – a) / max(a,b)o What the … ?

Cluster Analysis

44


The idea...

Cluster Analysis

xi avg

avg

b=min

a=avg

Usually, S(xi) = 1 - a/b

45

Silhouette Coefficient For given point xi, we

o Let a be avg distance to points in its clustero Let b be dist to nearest other cluster (in a

sense) Usually, a < b and hence S(xi) = 1 – a/b If a is a lot less than b, then S(xi) ≈ 1

o Points inside cluster much closer together than nearest other cluster (this is good)

If a is almost same as b, then S(xi) ≈ 0o Some other cluster is almost as close as

things inside cluster (this is bad)Cluster Analysis

46


Silhouette coefficient is defined for each point

Avg silhouette coefficient for a clustero Measure of how good a cluster is

Avg silhouette coefficient for all pointso Measure of clustering “goodness”

What is a good number for coefficient?o Rule of thumb on next slide

Cluster Analysis

47


Average coefficient (to 2 decimal places)o 0.71 to 1.00 strong structure foundo 0.51 to 0.70 reasonable structure foundo 0.26 to 0.50 weak or artificial structureo 0.25 or less no significant structure

Bottom line on silhouette coefficiento Combine cohesion, separation in one

numbero One of most useful measures of qualityCluster Analysis

48

External Validation

“External” implies that we measure quality based on data in clusterso Not relying on cluster topology

(“shape”) Suppose clustering data is of

several different typeso Say, different malware families

We can compute statistics on clusterso We only consider 2 stats hereCluster Analysis

49

Entropy and Purity

Entropyo Standard measure of uncertainty or

randomnesso High entropy implies clusters less

uniform Purity

o Another measure of uniformityo Ideally, cluster should be more “pure”,

that is, more uniform

Cluster Analysis

50

Entropy

Suppose total of m data elementso As usual, x1,x2,…,xm

Denote cluster j as Cj o Let mj be number of elements in Cj

o Let mij be count of type i in cluster Cj

Compute probabilities based on relative frequencieso That is, pij = mij / mj

Cluster Analysis

51

Entropy

Then entropy of cluster Cj is Ej = − Σ pij log pij, where sum is over i

Compute entropy Ej for each cluster Cj

Overall (weighted) entropy is thenE = Σ mj/m Ej, where sum is from 1 to K

and K is number of clusters Smaller E is better

o Implies clusters less uncertain/randomCluster Analysis

52

Purity

Ideally, each cluster is all one type Using same notation as in

entropy…o Purity of Cj defined as Uj = max pij o Where max is over i (i.e., different

types) If Uj is 1, then Cj all one type of data

o If Uj is near 0, no dominant type

Overall (weighted) purity isU = Σ mj/m Uj, where sum is from 1 to K

Cluster Analysis

53

Entropy and Purity

Example o Based on K-means clustering

Cluster Analysis

54

EM Clustering Data might be from different

probability distributionso If so, “distance” might be poor

measureo Maybe better to use mean and

variance Cluster on probability distributions?

o But distributions are unknown… Expectation maximization (EM)

o Technique to determine unknown parameters of probability distributions

Cluster Analysis

55

EM Clustering Animation

Good animation on Wikipedia pagehttp://en.wikipedia.org/wiki/Expectation–maximization_algorithm

Another animation herehttp://www.cs.cmu.edu/~alad/em/

Probably others too…

Cluster Analysis

http://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm

http://www.cs.cmu.edu/~alad/em/

56

Coin Experiment

Given 2 biased coins, A and Bo Randomly select coino Flip selected coin 10 timeso Repeat 5 times, so 50 total coin flips

Can we determine P(H) for each coin?

Easy, if you know which coin selectedo For each coin, just divide number of

heads by number of flips of that coinCluster Analysis

57

Coin Example

For example, supposeCoin B: HTTTHHTHTH 5 H and 5 TCoin A: HHHHTHHHHH 9 H and 1 TCoin A: HTHHHHHTHH 8 H and 2 TCoin B: HTHTTTHHTT 4 H and 6 TCoin A: THHHTHHHTH 7 H and 3 T

Then maximum likelihood estimate isPA(H) = 24/30 = 0.80 and PB(H) = 9/20 =

0.45Cluster Analysis

58

Coin Example

Suppose we have same data, but we do not know which coin was selectedCoin ?: 5 H and 5 TCoin ?: 9 H and 1 TCoin ?: 8 H and 2 TCoin ?: 4 H and 6 TCoin ?: 7 H and 3 T

Can we estimate PA(H) and PB(H)?Cluster Analysis

59

Coin Example

We do not know which coin was flipped

So, there is “hidden” informationo This should sound familiar…

Train HMM on sequence of H and T ??o Using 2 hidden stateso Use resulting model to find most likely

state sequence (recall, problem 2)o Use sequence to estimate PA(H) and

PB(H)

Cluster Analysis

60

Coin Example

HMM is very “heavy artillery”o And HMM needs lot of data to

converge (or lots of different initializations)

o No need to work so hard here EM algorithm

o Alternate between following 2 steps…o Expectation: Recompute expected

valueso Maximization: Recompute max

likelihoodCluster Analysis

61

EM for Coin Example

Start with a guess (initialization)o Say, PA(H) = 0.6 and PB(H) = 0.5

Compute expectations (E-step) First, from current PA(H) and PB(H)

5 H, 5 T P(A) = .45, P(B) = .55 9 H, 1 T P(A) = .80, P(B) = .20 8 H, 2 T P(A) = .73, P(B) = .27 4 H, 6 T P(A) = .35, P(B) = .65 7 H, 3 T P(A) = .65, P(B) = .35

Cluster Analysis

62

E-step for Coin Example

So far, we have5 H, 5 T P(A) = .45, P(B) = .55 9 H, 1 T P(A) = .80, P(B) = .20 8 H, 2 T P(A) = .73, P(B) = .27 4 H, 6 T P(A) = .35, P(B) = .65 7 H, 3 T P(A) = .65, P(B) = .35

Next, compute expected (weighted) H and T

For example, in 1st lineo For A we have 5 x .45 = 2.25 H and To For B we have 5 x .55 = 2.75 H and T

Cluster Analysis

63


So far, we have5 H, 5 T P(A) = .45, P(B) = .55 9 H, 1 T P(A) = .80, P(B) = .20 8 H, 2 T P(A) = .73, P(B) = .27 4 H, 6 T P(A) = .35, P(B) = .65 7 H, 3 T P(A) = .65, P(B) = .35

Compute expected (weighted) H and T For example, in 2nd line

o For A, we have 9 x .8 = 7.2 H and 1 x .8 = .8 T

o For B, we have 9 x .2 = 1.8 H and 1 x .2 = .2 TCluster Analysis

64


Rounded to nearest 0.1: Coin A Coin B5 H, 5 T P(A) = .45, P(B) = .55 2.2H 2.2T

2.8H 2.8T9 H, 1 T P(A) = .80, P(B) = .20 7.2H 0.8T

1.8H 0.2T8 H, 2 T P(A) = .73, P(B) = .27 5.9H 1.5T

2.1H 0.5T 4 H, 6 T P(A) = .35, P(B) = .65 1.4H 2.1T

2.6H 3.9T 7 H, 3 T P(A) = .65, P(B) = .35 4.5H 1.9T

2.5H 1.1T

totals 21.2H 8.5T 11.8H 8.5T

This completes E-step Note: We computed these expected

numbers based on current PA(H) and PB(H)

Cluster Analysis

65

M-step for Coin Example

M-step Re-estimate PA(H) and PB(H) using

results from E-step:PA(H) = 21.2/(21.2+8.5) ≈ 0.71

PB(H) = 11.8/(11.8+8.5) ≈ 0.58

Next? E-step using these probabilitieso Then M-step, then E-step, then…o …until convergence (or we get tired)

Cluster Analysis

66

EM for Clustering

How is EM relevant to clustering? Can use EM to obtain parameters of

K “hidden” distributionso That is, means and variances, μi and σi

2

Then, use μi as centers of clusterso And σi (standard deviations) as “radii”o Assume Gaussian (normal)

distributions Is this better than K-means?Cluster Analysis

67

EM vs K-Means

Whether it is better or not, EM is obviously different than K-means…o …or is it?

Actually, K-means is special case of EMo Using distance instead of probabilities

E-step? Re-assign points to centroidso Like “E” in EM, this “re-shapes” clusters

M-step? Recompute centroidsCluster Analysis

68

Conclusion

Clustering is fun, entertaining, very usefulo Can explore mysterious data, and more…

And K-means is really simpleo EM is powerful and not too difficult either

Measuring success is not so easyo Good clusters? And useful information? o Or just random noise? Can cluster

anything… Clustering is often a good starting point

o Help us decide whether any “there” is thereCluster Analysis

69

References: K-Means A.W. Moore, K-means and hierarchical

clustering P.-N. Tan, M. Steinbach, and V. Kumar,

Introduction to Data Mining, Addison-Wesley, 2006, Chapter 8, Cluster analysis: Basic concepts and algorithms

R. Jin, Cluster validation M.J. Norusis, IBM SPSS Statistics 19

Statistical Procedures Companion, Chapter 17, Cluster analysis

Cluster Analysis

http://www.autonlab.org/tutorials/kmeans11.pdf

http://www.autonlab.org/tutorials/kmeans11.pdf

http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf

http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf

http://www.cs.kent.edu/~jin/DM08/ClusterValidation.pdf

http://www.norusis.com/pdf/SPC_v19.pdf

70

References: EM Clustering C.B. Do and S. Batzoglou, What is the

expectation maximization algorithm?, Nature Biotechnology, 26(8):897-899, 2008

J.A. Bilmes, A gentle tutorial of the EM algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models, ICSI Report TR-97-021, 1998

Cluster Analysis

http://ai.stanford.edu/~chuongdo/papers/em_tutorial.pdf

http://ai.stanford.edu/~chuongdo/papers/em_tutorial.pdf

http://crow.ee.washington.edu/people/bulyko/papers/em.pdf




Cluster Analysis 1 Mark Stamp. Cluster Analysis Grouping objects in meaningful way o Clustered data...

Documents

Transcript of Cluster Analysis 1 Mark Stamp. Cluster Analysis Grouping objects in meaningful way o Clustered data...