Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering...

Clustering

Panagio's Papapetrou, PhD Associate Professor, Stockholm University Adjunct Professor, Aalto University

1

TODAY… Nov 4 Introduc4on to data mining

Nov 5 Associa4on Rules

Nov 10, 14 Clustering and Data Representa4on

Nov 17 Exercise session 1 (Homework 1 due)

Nov 19 Classifica4on

Nov 24, 26 Similarity Matching and Model Evalua4on

Dec 1 Exercise session 2 (Homework 2 due)

Dec 3 Combining Models

Dec 8, 10 Time Series Analysis

Dec 15 Exercise session 3 (Homework 3 due)

Dec 17 Ranking

Jan 13 Review

Jan 14 EXAM

Feb 23 Re-‐EXAM 2

What is clustering?

•  A grouping of data objects such that the objects within a group are similar (or related) to one another and different from (or unrelated to) the objects in other groups

Inter-cluster distances are maximized

Intra-cluster distances are

minimized

3

Outliers •  Outliers are objects that do not belong to any cluster or

form clusters of very small cardinality

•  In some applica4ons we are interested in discovering outliers, not clusters (outlier analysis)

cluster

outliers

Applica4ons of clustering?

•  Image Processing –  cluster images based on their visual content

•  Web –  Cluster groups of users based on their access pa]erns on webpages

–  Cluster webpages based on their content •  Bioinforma4cs

–  Cluster similar proteins together (similarity wrt chemical structure and/or func4onality etc)

•  Many more…

5

The clustering task •  Group observa4ons into groups so that the observa4ons

belonging in the same group are similar, whereas observa4ons in different groups are different

•  Basic quesGons:

– What does “similar” mean? – What is a good par44on of the objects? I.e., how is the quality of a solu4on measured?

–  How to find a good par44on of the observa4ons?

6

Observa4ons to cluster

•  Usually data objects consist of a set of a]ributes (also known as dimensions)

•  J. Smith, 20, 200K

•  If all d dimensions are real-‐valued then we can visualize the data as points in a d-‐dimensional space

•  If all d dimensions are binary then we can think of each data point as a binary vector

7

Distance func4ons for binary vectors

Q1 Q2 Q3 Q4 Q5 Q6

X 1 0 0 1 1 0

Y 0 1 1 0 1 0

JSim(X,Y ) = X∩YX∪Y

•  Jaccard similarity between binary vectors X and Y •  Jaccard distance between binary vectors X and Y Jdist(X,Y) = 1-‐ JSim(X,Y)

•  Example: •  JSim = 1/6 •  Jdist = 5/6

Distance func4ons for real-‐valued vectors

•  Lp norms or Minkowski distance:

9



where p is a posi4ve integer

Lp(x,y)= (xi−yi)i=1

d∑

p#

$

%%%%%

&

'

(((((

1/p

10



where p is a posi4ve integer

•  If p = 1, L1 is the Manha@an (or city block) distance:

•  If p = 2, L2 is the Euclidean distance:

Lp(x,y)= (xi−yi)i=1

d∑

p#

$

%%%%%

&

'

(((((

1/p

L1(x,y)= xi−yii=1

d∑

L2(x,y)= (|x1−y1|2+|x2−y2|

2+...+|xd−yd |2)

11

Data Structures

•  Data matrix

•  Distance matrix

⎥⎥⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢⎢⎢

⎣

⎡

ndx...nx...n1x...............idx...ix...i1x...............1dx...1x...11x

ℓ

ℓ

ℓ

⎥⎥⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢⎢⎢

⎣

⎡

0...)2,()1,(:::

)2,3()

...ndnd

0dd(3,10d(2,1)

0

a]ributes/dimensions

tuples/objects

objects ob

jects

12

Par44oning algorithms

•  Basic concept:

–  Construct a par44on of a set of n objects into a set of k clusters

–  Each object belongs to exactly one cluster

–  The number of clusters k is given in advance

13

The k-‐means problem •  Given a set X of n points in a d-‐dimensional space and an integer k

•  Task: choose a set of k points {c1, c2,…,ck} in the d-‐dimensional space to form clusters C = {C1, C2,…,Ck} such that

is minimized •  Some special cases: k = 1, k = n

Cost(C) = L22 x,ci( )

x∈Ci

∑i=1

k

∑ = x − ci( )2x∈Ci

∑i=1

k

∑

14

Algorithmic proper4es of the k-‐means problem

•  NP-‐hard if the dimensionality of the data is at least 2 (d>=2)

•  Finding the best solu4on in polynomial 4me is infeasible

•  For d=1 the problem is solvable in polynomial 4me

•  A simple itera4ve algorithm works quite well in prac4ce

15

The k-‐means algorithm •  Randomly pick k cluster centers {c1,…,ck}

•  For each i, set the cluster Ci to be the set of points in X that are closer to ci than they are to cj for all i≠j

•  For each i let ci be the center of cluster Ci (mean of the vectors in Ci)

•  Repeat un4l convergence

16

0

1

2

3

4

5

0 1 2 3 4 5

K-‐means Clustering: Step 1 Algorithm: k-‐means, Distance Metric: Euclidean Distance

k1

k2

k3

17

0

1

2

3

4

5

0 1 2 3 4 5


k1

k2

k3

18

0

1

2

3

4

5

0 1 2 3 4 5


k1

k2

k3

19

0

1

2

3

4

5

0 1 2 3 4 5


k1

k2

k3

20

Proper4es of the k-‐means algorithm

•  Finds a local op4mum

•  Converges oken quickly (but not always) •  But it is guaranteed to converge aker at most KN itera4ons (N number of objects)

•  The choice of ini4al points can have large influence in the result

21

Discussion of the k-‐means algorithm

x1

x2

x3

x4 K = 2

22


x1

x2

x3

x4

c1

c2 K = 2

23


x1

x2

x3

x4

c1

c2 K = 2

K-‐means converges immediately!

24


x1

x2

x3

x4

c1

c2 K = 2


Bad result L 25


x1

x2

x3

x4

c1

c2 K = 2


Even worse as the horizontal lines stretch 26

K-‐means++ •  Choose one center uniformly at random from X

•  For each data point x, compute d(x)

–  distance between x and the nearest center that has already been chosen

•  Choose “the farthest” data point as a new center –  using a weighted probability distribu4on where a point x is chosen with probability propor4onal to d(x)

•  Repeat Steps 2 and 3 un4l k centers have been chosen

27

K-‐means •  Limita4ons?

•  What if the space is not known?

•  Example: images… 0

1

2

3

4

5

0 1 2 3 4 5

k1

k2

k3

28

Clustering images

29

Clustering images: scien4st or not

30

What are the centroids?

The k-‐median problem •  Given a set X of n points in a d-‐dimensional space and an integer k

•  Task: choose a set of k points {c1,c2,…,ck} from X and form clusters {C1,C2,…,Ck} such that

is minimized

31

∑∑= ∈

=k

i Cxi

i

cxLCCost1

1 ),()(

The k-‐medoids algorithm •  Or … PAM (Par44oning Around Medoids, 1987)

– Choose randomly k medoids from the original

dataset X

– Assign each of the n-‐k remaining points in X to their

closest medoid

–  Itera4vely replace one of the medoids by one of the

non-‐medoids if it improves the total clustering cost

32

Discussion of PAM algorithm •  The algorithm is very similar to the k-‐means

algorithm

•  Advantages and disadvantages?

33

K-‐means vs. K-‐medoids •  K-‐medoids is more flexible:

–  Can be used with any distance/similarity measure

–  K-‐means works only with measures that are consistent with the mean (e.g., Euclidean Distance)

•  K-‐medoids is more robust to ouliers: –  Example: [1 2 3 4 100000]

–  mean = 20002 vs medoid = 3

•  K-‐medoids is more expensive: –  All pair-‐wise distances are computed at each itera4on

34

CLARA (Clustering Large Applica4ons) •  It draws mul6ple samples of the data set, applies PAM on

each sample, and gives the best clustering as the output

•  Strength: deals with larger data sets than PAM

•  Weakness: –  Efficiency depends on the sample size

–  A good clustering based on samples will not necessarily represent a good clustering of the whole data set if the sample is biased

35

What is the right number of clusters? •  …or who sets the value of k?

•  For n points to be clustered consider the case where k=n. What is the value of the error func4on?

•  What happens when k = 1?

•  Since we want to minimize the error why don’t we select always k = n?

•  X-‐means: –  based on the Bayesian Informa4on Criterion –  Detects the op4mal number of clusters

36

Hierarchical Clustering

•  Produces a set of nested clusters organized as a hierarchical tree

•  Can be visualized as a dendrogram –  A tree-‐like diagram that records the sequences of merges or splits

1 3 2 5 4 60

0.05

0.1

0.15

0.2

1

2

3

4

5

6

1

23 4

5

37

Strengths of Hierarchical Clustering

•  No assump4ons on the number of clusters –  Any desired number of clusters can be obtained by ‘cutng’ the dendogram at the proper level

•  Hierarchical clusterings may correspond to meaningful taxonomies –  Example in web (e.g., product catalogs), etc…

38

Hierarchical Clustering •  Two main types of hierarchical clustering

–  AgglomeraGve: •  Start with the points as individual clusters •  At each step, merge the closest pair of clusters un4l only one cluster (or k clusters) lek

–  Divisive: •  Start with one, all-‐inclusive cluster •  At each step, split a cluster un4l each cluster contains a point (or there are k clusters)

•  Tradi4onal hierarchical algorithms use a similarity or distance matrix

–  Merge or split one cluster at a 4me

39

Complexity of hierarchical clustering

•  Distance matrix is used for deciding which clusters to merge/split

•  At least quadra4c in the number of data points

•  Not usable for large datasets

40

Agglomera4ve clustering algorithm

•  Most popular hierarchical clustering technique

•  Basic algorithm 1.  Compute the distance matrix between the input data points 2.  Let each data point be a cluster 3.   Repeat 4.  Merge the two closest clusters 5.  Update the distance matrix 6.   UnGl only a single cluster remains

•  Key opera4on is the computa4on of the distance between two clusters

–  Different defini4ons of the distance between clusters lead to different algorithms

41

Input/ Ini4al setng •  Start with clusters of individual points and a distance/proximity

matrix

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

. Distance/Proximity Matrix

...p1 p2 p3 p4 p9 p10 p11 p12

42

Intermediate State •  Aker some merging steps, we have some clusters

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Distance/Proximity Matrix

...p1 p2 p3 p4 p9 p10 p11 p12

43

Intermediate State •  Merge the two closest clusters (C2 and C5) and update the

distance matrix

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Distance/Proximity Matrix

...p1 p2 p3 p4 p9 p10 p11 p1244

Aker Merging •  “How do we update the distance matrix?”

C1

C4

C2 U C5

C3 ? ? ? ?

?

?

?

C2 U C5 C1

C1

C3

C4

C2 U C5

C3 C4

...p1 p2 p3 p4 p9 p10 p11 p1245

Distance between two clusters

•  Each cluster is a set of points

•  How do we define distance between two sets of points – Lots of alterna4ves – Not an easy task

46


•  Single-‐link distance between clusters Ci and Cj the minimum distance (or highest similarity) between any object in Ci and any object in Cj

•  The distance is defined by the two most similar objects

( ) { }jiyxjisl CyCxyxdCCD ∈∈= ,),(min, ,

47

Single-‐link clustering: example

•  Determined by one pair of points, i.e., by one link in the proximity graph using the similarity matrix below:

I1 I2 I3 I4 I5I1 1.00 0.90 0.10 0.65 0.20I2 0.90 1.00 0.70 0.60 0.50I3 0.10 0.70 1.00 0.40 0.30I4 0.65 0.60 0.40 1.00 0.80I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5

48

Strengths of single-‐link clustering

Original Points Two Clusters

•  Can handle non-elliptical shapes

49

Limita4ons of single-‐link clustering


•  Sensitive to noise and outliers •  It produces long, elongated clusters

50


•  Complete-‐link distance between clusters Ci and Cj the maximum distance (minimum similarity) between any object in Ci and any object in Cj

•  The distance is defined by the two most dissimilar objects

( ) { }jiyxjicl CyCxyxdCCD ∈∈= ,),(max, ,

51

Complete-‐link clustering: example

•  Distance between clusters is determined by the two most distant points in the different clusters

I1 I2 I3 I4 I5I1 1.00 0.90 0.10 0.65 0.20I2 0.90 1.00 0.70 0.60 0.50I3 0.10 0.70 1.00 0.40 0.30I4 0.65 0.60 0.40 1.00 0.80I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5

52

Strengths of complete-‐link clustering


•  More balanced clusters (with equal diameter) •  Less susceptible to noise

53

Limita4ons of complete-‐link clustering


•  Tends to break large clusters •  All clusters tend to have the same diameter – small clusters are merged with larger ones

54


•  Average-‐link distance between clusters Ci and Cj is the average distance between any object in Ci and any object in Cj

( ) ∑∈∈×

=ji CyCxji

jiavg yxdCC

CCD,

),(1,

55

Average-‐link clustering: example

•  Proximity of two clusters is the average of pairwise proximity between points in the two clusters.

I1 I2 I3 I4 I5I1 1.00 0.90 0.10 0.65 0.20I2 0.90 1.00 0.70 0.60 0.50I3 0.10 0.70 1.00 0.40 0.30I4 0.65 0.60 0.40 1.00 0.80I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5

56

Average-‐link clustering: discussion

•  Compromise between Single and Complete Link

•  Strengths –  Less suscep4ble to noise and outliers

•  Limita4ons –  Biased towards globular clusters

57


•  Ward’s distance between clusters Ci and Cj the difference between: –  the total within cluster sum of squares for the two clusters separately

–  the within cluster sum of squares resul6ng from merging the two clusters in cluster Cij

•  ri: centroid (mean) of Ci •  rj: centroid (mean) of Cj •  rij: centroid (mean) of Cij

( ) ( ) ( ) ( )∑∑∑∈∈∈

−−−+−=ijji Cx

ijCx

jCx

ijiw rxrxrxCCD 222,

58

Ward’s distance for clusters

•  Similar to average-‐link

•  Less suscep4ble to noise and outliers

•  Biased towards globular clusters

•  Hierarchical analogue of k-‐means –  Can be used to ini4alize k-‐means

59

Hierarchical Clustering: Time and Space requirements

•  For a dataset X consis4ng of n points

•  O(n2) space; it requires storing the distance matrix

•  O(n3) Gme in most of the cases –  There are n steps and at each step the size n2

distance matrix must be updated and searched –  Complexity can be reduced to O(n2 log(n) ) 4me by

using appropriate data structures

60

How good is a clustering? •  Cluster purity

–  Each cluster is assigned to the class label that is most frequent in the cluster

–  Purity is measured by coun4ng the number of correctly assigned objects and dividing by the total number of objects

Purity = (5 + 4 + 3) / 17 = 12 / 17

61

How good is purity?

•  Cluster purity –  Simple! –  Intui4ve

•  Trade-‐off between number of clusters and purity? –  If each object belongs to its own cluster, then purity is 1.0 –  Is that desirable?

•  Cannot trade-‐off the quality of clusters against the number of clusters

•  Alterna4ves: –  Normalized Mutual Informa4on

–  Rand Index 62

What if we don’t have class labels? •  a(i): average dissimilarity (distance) of i with all other data within

the same cluster •  b(i): the lowest average dissimilarity (distance) of i to any other

cluster where i is not a member •  The Silhoue]e of object i:

•  Values between -‐1 and 1

–  if b(i) >> a(i) then s(i) = 1 –  if b(i) << a(i) then s(i) = -‐1

•  Total Silhouete: –  average silhouete value of all objects

63

i

Correct number of clusters

64

•  Can be used to iden4fy the correct number of clusters

Clustering aggrega4on

•  Many different clusterings for the same dataset! –  Different objec4ve func4ons –  Different algorithms –  Different number of clusters

•  Which clustering is the best? –  AggregaGon: we do not need to decide, but rather find a reconcilia4on between different outputs

65

The clustering-‐aggrega4on problem

•  Input –  n objects X = {x1,x2,…,xn} – m clusterings of the objects {C1,…,Cm} – clustering: a collec4on of disjoint sets that cover X

•  Output

–  a single clustering C, that is as close as possible to all input clusterings

•  How do we measure closeness of clusterings?

–  disagreement distance

66

Disagreement distance •  For two par44ons C and P, and objects x,y

in X define

•  if IC,P(x,y) = 1 we say that x,y create a disagreement between clusterings C and P •  The disagreement distance between C and

P is:

U C P x1 1 1 x2 1 2 x3 2 1 x4 3 3 x5 3 4

D(C,P) = 3

⎪⎪⎪

⎩

⎪⎪⎪

⎨

⎧

=≠

≠=

=

otherwise0

P(y) P(x) and C(y) C(x) ifOR

P(y) P(x) and C(y) C(x) if 1

y)(x,I PC,

D(C,P) = IC,P (x,y)(x,y)∑

67

Clustering aggrega4on •  Given m clusterings C1,…,Cm find C such that

is minimized ∑=

=m

1ii )CD(C, D(C)

U C1 C2 C3 C x1 1 1 1 1 x2 1 2 2 2 x3 2 1 1 1 x4 2 2 2 2 x5 3 3 3 3 x6 3 4 3 3

the aggrega4on cost

68

Why clustering aggrega4on?

•  Clustering categorical data U City Profession Nationality

x1 New York Doctor U.S.

x2 New York Teacher Canada

x3 Boston Doctor U.S.

x4 Boston Teacher Canada

x5 Los Angeles Lawer Mexican

x6 Los Angeles Actor Mexican

69

Why clustering aggrega4on?

•  Detect outliers –  outliers are defined as points for which there is no consensus by the clusterings

•  Improve the robustness of clustering algorithms –  different algorithms have different weaknesses –  combining them can produce a be]er result

•  Privacy preserving clustering –  different companies have data for the same users –  compute an aggregate clustering without sharing the actual data

70

Complexity of Clustering Aggrega4on

•  The clustering aggrega4on problem is NP-‐hard –  the median par44on problem [Barthelemy and LeClerc 1995]

•  Look for heuris4cs and approximate solu4ons

ALG(I) ≤ c OPT(I)

71

A simple 2-‐approxima4on algorithm

•  The disagreement distance D(C,P) is a metric – What does that mean? We will revisit this in a few lectures…

•  The algorithm BEST: – select among the input clusterings the clustering C* that minimizes D(C*)

– a 2-‐approximate solu4on

72

TODOs

•  Assignment 1: – Due next week [Nov. 17 at 11:00am] – Discussion forum on this assignment now available on iLearn2

•  Exercise Session 1: – Nov. 17 at 11:00am

73

Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering...

Documents

Transcript of Clustering - people.dsv.su.sepeople.dsv.su.se/~panagiotis/DAMI2014/lect3.pdfClustering...