Clustering -...

65
Clustering CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2018 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

Transcript of Clustering -...

Page 1: Clustering - ce.sharif.educe.sharif.edu/courses/97-98/1/ce324-1/resources/root/slides/Clusteri… · Clustering CE-324: Modern Information Retrieval Sharif University of Technology

Clustering CE-324: Modern Information Retrieval Sharif University of Technology

M. SoleymaniFall 2018

Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

Page 2: Clustering - ce.sharif.educe.sharif.edu/courses/97-98/1/ce324-1/resources/root/slides/Clusteri… · Clustering CE-324: Modern Information Retrieval Sharif University of Technology

What is clustering?} Clustering: grouping a set of objects into similar ones

} Docs within a cluster should be similar.} Docs from different clusters should be dissimilar.

} The commonest form of unsupervised learning} Unsupervised learning

} learning from raw data, as opposed to supervised data where aclassification of examples is given

} A common and important task that finds many applications inIR and other places

Ch. 16

2

Page 3: Clustering - ce.sharif.educe.sharif.edu/courses/97-98/1/ce324-1/resources/root/slides/Clusteri… · Clustering CE-324: Modern Information Retrieval Sharif University of Technology

A data set with clear cluster structure

} How would youdesign an algorithmfor finding the threeclusters in this case?

Ch. 16

3

Page 4: Clustering - ce.sharif.educe.sharif.edu/courses/97-98/1/ce324-1/resources/root/slides/Clusteri… · Clustering CE-324: Modern Information Retrieval Sharif University of Technology

Applications of clustering in IR} For better navigation of search results

} Effective “user recall” will be higher

} Whole corpus analysis/navigation} Better user interface: search without typing

} For improving recall in search applications} Better search results (like pseudo RF)

} For speeding up vector space retrieval} Cluster-based retrieval gives faster search

Sec. 16.1

4

Page 5: Clustering - ce.sharif.educe.sharif.edu/courses/97-98/1/ce324-1/resources/root/slides/Clusteri… · Clustering CE-324: Modern Information Retrieval Sharif University of Technology

Applications of clustering in IR

5

Page 6: Clustering - ce.sharif.educe.sharif.edu/courses/97-98/1/ce324-1/resources/root/slides/Clusteri… · Clustering CE-324: Modern Information Retrieval Sharif University of Technology

Search result clustering

6

Page 7: Clustering - ce.sharif.educe.sharif.edu/courses/97-98/1/ce324-1/resources/root/slides/Clusteri… · Clustering CE-324: Modern Information Retrieval Sharif University of Technology

7 yippy.com – grouping search results

Page 8: Clustering - ce.sharif.educe.sharif.edu/courses/97-98/1/ce324-1/resources/root/slides/Clusteri… · Clustering CE-324: Modern Information Retrieval Sharif University of Technology

Clustering the collection

8

} Cluster-based navigation is an interesting alternative tokeyword searching (i.e., the standard IR paradigm)

} User may prefer browsing over searching when they areunsure about which terms to use

} Well suited to a collection of news stories} News reading is not really search, but rather a process of

selecting a subset of stories about recent events

Page 9: Clustering - ce.sharif.educe.sharif.edu/courses/97-98/1/ce324-1/resources/root/slides/Clusteri… · Clustering CE-324: Modern Information Retrieval Sharif University of Technology

Yahoo! Hierarchy isn’t clustering but is the kind of output you want from clustering

dairycrops

agronomyforestry

AIHCI

craftmissions

botany

evolution

cellmagnetism

relativity

courses

agriculture biology physics CS space

... ... ...

… (30)

www.yahoo.com/Science

... ...

Yahoo! Hierarchy

9

Page 10: Clustering - ce.sharif.educe.sharif.edu/courses/97-98/1/ce324-1/resources/root/slides/Clusteri… · Clustering CE-324: Modern Information Retrieval Sharif University of Technology

Google News: automatic clustering gives an effective news presentation metaphor

10

Page 11: Clustering - ce.sharif.educe.sharif.edu/courses/97-98/1/ce324-1/resources/root/slides/Clusteri… · Clustering CE-324: Modern Information Retrieval Sharif University of Technology

11

Page 12: Clustering - ce.sharif.educe.sharif.edu/courses/97-98/1/ce324-1/resources/root/slides/Clusteri… · Clustering CE-324: Modern Information Retrieval Sharif University of Technology

To improve efficiency and effectiveness of search system

12

} Improve language modeling: replacing the collectionmodel used for smoothing by a model derived from doc’scluster

} Clustering can speed-up search (via an inexact algorithm)

} Clustering can improve recall

Page 13: Clustering - ce.sharif.educe.sharif.edu/courses/97-98/1/ce324-1/resources/root/slides/Clusteri… · Clustering CE-324: Modern Information Retrieval Sharif University of Technology

For improving search recall} Cluster hypothesis: Docs in the same cluster behave similarly

with respect to relevance to information needs

} Therefore, to improve search recall:} Cluster docs in corpus a priori} When a query matches a doc 𝑑, also return other docs in the cluster

containing 𝑑

} Query car: also return docs containing automobile} Because clustering grouped together docs containing car with

those containing automobile.

Why might this happen?

Sec. 16.1

13

Page 14: Clustering - ce.sharif.educe.sharif.edu/courses/97-98/1/ce324-1/resources/root/slides/Clusteri… · Clustering CE-324: Modern Information Retrieval Sharif University of Technology

Issues for clustering} Representation for clustering

} Doc representation} Vector space? Normalization?} Centroids aren’t length normalized

} Need a notion of similarity/distance

} How many clusters?} Fixed a priori?} Completely data driven?

} Avoid “trivial” clusters - too large or small¨ too large: for navigation purposes you've wasted an extra user click

without whittling down the set of docs much

Sec. 16.2

14

Page 15: Clustering - ce.sharif.educe.sharif.edu/courses/97-98/1/ce324-1/resources/root/slides/Clusteri… · Clustering CE-324: Modern Information Retrieval Sharif University of Technology

Notion of similarity/distance} Ideal: semantic similarity.

} Practical: term-statistical similarity} We will use cosine similarity.

} For many algorithms, easier to think in terms of a distance (rather thansimilarity)

} We will mostly speak of Euclidean distance¨ But real implementations use cosine similarity

15

Page 16: Clustering - ce.sharif.educe.sharif.edu/courses/97-98/1/ce324-1/resources/root/slides/Clusteri… · Clustering CE-324: Modern Information Retrieval Sharif University of Technology

Clustering algorithms categorization} Flat algorithms (k-means)

} Usually start with a random (partial) partitioning} Refine it iteratively

} Hierarchical algorithms} Bottom-up, agglomerative} Top-down, divisive

16

Page 17: Clustering - ce.sharif.educe.sharif.edu/courses/97-98/1/ce324-1/resources/root/slides/Clusteri… · Clustering CE-324: Modern Information Retrieval Sharif University of Technology

Hard vs. soft clustering} Hard clustering: Each doc belongs to exactly one cluster

} More common and easier to do

} Soft clustering:A doc can belong to more than one cluster.

17

Page 18: Clustering - ce.sharif.educe.sharif.edu/courses/97-98/1/ce324-1/resources/root/slides/Clusteri… · Clustering CE-324: Modern Information Retrieval Sharif University of Technology

Partitioning algorithms} Construct a partition of 𝑁 docs into 𝐾 clusters

} Given: a set of docs and the number 𝐾} Find: a partition of docs into 𝐾 clusters that optimizes the

chosen partitioning criterion} Finding a global optimum is intractable for many objective functions of

clustering} Effective heuristic methods: K-means and K-medoids algorithms

18

Page 19: Clustering - ce.sharif.educe.sharif.edu/courses/97-98/1/ce324-1/resources/root/slides/Clusteri… · Clustering CE-324: Modern Information Retrieval Sharif University of Technology

K-means} Assumes docs are real-valued vectors 𝒙(&), … , 𝒙(*).} Clusters based on centroids (aka the center of gravity or

mean) of points in a cluster:

𝝁, =1𝒞,

0 𝒙(1)𝒙(2)∈𝒞4

} K-means cost function:

𝐽(𝒞) =0 0 𝒙(1)– 𝝁𝑗9�

𝒙(2)∈𝒞4

;

,<&

Sec. 16.4

19

𝒞 = {𝒞1, 𝒞2, … , 𝒞𝐾}

𝒞𝑗: the set of data points assigned to j-th cluster

Page 20: Clustering - ce.sharif.educe.sharif.edu/courses/97-98/1/ce324-1/resources/root/slides/Clusteri… · Clustering CE-324: Modern Information Retrieval Sharif University of Technology

K-means algorithm

Select K random points {𝝁1, 𝝁2, …𝝁𝐾} as clusters’ initial centroids.

Until clustering converges (or other stopping criterion):

For each doc 𝒙(1):Assign 𝒙(1) to the cluster 𝒞, such that 𝑑𝑖𝑠𝑡(𝒙(1), 𝝁𝑗) is minimal.

For each cluster 𝐶𝑗𝝁𝑗 =

∑ 𝒙(2)�2∈𝒞4𝒞4

Sec. 16.4

Reassignment of instances to clusters is based on distance to the current cluster centroids (can equivalently be in terms of similarities)

20

Page 21: Clustering - ce.sharif.educe.sharif.edu/courses/97-98/1/ce324-1/resources/root/slides/Clusteri… · Clustering CE-324: Modern Information Retrieval Sharif University of Technology

21 [Bishop]

Page 22: Clustering - ce.sharif.educe.sharif.edu/courses/97-98/1/ce324-1/resources/root/slides/Clusteri… · Clustering CE-324: Modern Information Retrieval Sharif University of Technology

22

Page 23: Clustering - ce.sharif.educe.sharif.edu/courses/97-98/1/ce324-1/resources/root/slides/Clusteri… · Clustering CE-324: Modern Information Retrieval Sharif University of Technology

Termination conditions} Several possibilities for terminal condition, e.g.,

} A fixed number of iterations} Doc partition unchanged} 𝐽 < 𝜃: cost function falls below a threshold} ∆𝐽 < 𝜃: the decrease in the cost function (in two successive

iterations) falls below a threshold

Sec. 16.4

23

Page 24: Clustering - ce.sharif.educe.sharif.edu/courses/97-98/1/ce324-1/resources/root/slides/Clusteri… · Clustering CE-324: Modern Information Retrieval Sharif University of Technology

Convergence of K-means} K-means algorithm ever reaches a fixed point in which

clusters don’t change.} We must use tie-breaking when there are samples with the

same distance from two or more clusters (by assigning it tothe lower index cluster)

Sec. 16.4

24

Page 25: Clustering - ce.sharif.educe.sharif.edu/courses/97-98/1/ce324-1/resources/root/slides/Clusteri… · Clustering CE-324: Modern Information Retrieval Sharif University of Technology

K-means decreases 𝐽(𝒞) in each iteration(before convergence)

} First, reassignment monotonically decreases 𝐽(𝒞) since eachvector is assigned to the closest centroid.

} Second, recomputation monotonically decreases each∑ 𝒙(1)– 𝝁𝑘 2�1∈IJ :

} ∑ 𝒙(1)– 𝝁𝑘 2�1∈IJ reaches minimum for 𝝁𝑘 =

&IK∑ 𝒙(1)�1∈IJ

} K-means typically converges quickly

Sec. 16.4

25

Page 26: Clustering - ce.sharif.educe.sharif.edu/courses/97-98/1/ce324-1/resources/root/slides/Clusteri… · Clustering CE-324: Modern Information Retrieval Sharif University of Technology

Time complexity of K-means} Computing distance between two docs:𝑂(𝑀)

} 𝑀 is the dimensionality of the vectors.

} Reassigning clusters: 𝑂(𝐾𝑁) distance computations ⇒𝑂(𝐾𝑁𝑀).

} Computing centroids: Each doc gets added once to somecentroid:𝑂(𝑁𝑀).

} Assume these two steps are each done once for 𝐼iterations: 𝑂(𝐼𝐾𝑁𝑀).

Sec. 16.4

26

Page 27: Clustering - ce.sharif.educe.sharif.edu/courses/97-98/1/ce324-1/resources/root/slides/Clusteri… · Clustering CE-324: Modern Information Retrieval Sharif University of Technology

Seed choice

} Results can vary based on randomselection of initial centroids.

} Some initializations get poorconvergence rate, or convergence tosub-optimal clustering} Exclude outliers from the seed set} Try out multiple starting points and

choosing the clustering with lowest cost} Select good seeds using a heuristic (e.g., doc

least similar to any existing mean)} Obtaining seeds from another method such

as hierarchical clustering

If you start with B and E as centroids you converge to {A,B,C} and {D,E,F}

If you start with D and F, you converge to {A,B,D,E} {C,F}

Example showingsensitivity to seeds

Sec. 16.4

27

Page 28: Clustering - ce.sharif.educe.sharif.edu/courses/97-98/1/ce324-1/resources/root/slides/Clusteri… · Clustering CE-324: Modern Information Retrieval Sharif University of Technology

K-means issues, variations, etc.} Computes the centroid only after all points are re-

assigned} Instead, we can re-compute the centroid after every

assignment} It can improve speed of convergence of K-means

} Assumes clusters are spherical in vector space} Sensitive to coordinate changes, weighting etc.

} Disjoint and exhaustive} Doesn’t have a notion of “outliers” by default

} But can add outlier filtering

Sec. 16.4

Dhillon et al. ICDM 2002 – variation to fix some issues with smalldocument clusters

28

Page 29: Clustering - ce.sharif.educe.sharif.edu/courses/97-98/1/ce324-1/resources/root/slides/Clusteri… · Clustering CE-324: Modern Information Retrieval Sharif University of Technology

How many clusters?} Number of clusters 𝐾 is given

} Partition n docs into predetermined number of clusters

} Finding the “right” number is part of the problem} Given docs, partition into an “appropriate” no. of subsets.

} E.g., for query results - ideal value of K not known up front - thoughUI may impose limits.

29

Page 30: Clustering - ce.sharif.educe.sharif.edu/courses/97-98/1/ce324-1/resources/root/slides/Clusteri… · Clustering CE-324: Modern Information Retrieval Sharif University of Technology

How many clusters?

How many clusters?

Four ClustersTwo Clusters

Six Clusters

30

Page 31: Clustering - ce.sharif.educe.sharif.edu/courses/97-98/1/ce324-1/resources/root/slides/Clusteri… · Clustering CE-324: Modern Information Retrieval Sharif University of Technology

Selecting k

31

Page 32: Clustering - ce.sharif.educe.sharif.edu/courses/97-98/1/ce324-1/resources/root/slides/Clusteri… · Clustering CE-324: Modern Information Retrieval Sharif University of Technology

K not specified in advance

} Tradeoff between having better focus within each clusterand having too many clusters

} Solve an optimization problem: penalize having lots ofclusters} application dependent

} e.g., compressed summary of search results list.

𝑘∗ = minK𝐽T1U 𝑘 + 𝜆𝑘

32

obtained in e.g. 𝐽 {𝒞1,𝒞2, … ,𝒞𝑘}show the minimum value of 𝐽T1U 𝑘 :100 runs of k-means (with different initializations)

Page 33: Clustering - ce.sharif.educe.sharif.edu/courses/97-98/1/ce324-1/resources/root/slides/Clusteri… · Clustering CE-324: Modern Information Retrieval Sharif University of Technology

Penalize lots of clusters} Benefit for a doc: cosine similarity to its centroid} Total Benefit: sum of the individual doc Benefits.

} Why is there always a clustering of Total Benefit n?

} For each cluster, we have a Cost C.} For K clusters, the Total Cost is KC.

} Value of a clustering = Total Benefit - Total Cost.

} Find clustering of highest value, over all choices of K.} Total benefit increases with increasing K.} But can stop when it doesn’t increase by “much”. The Cost term

enforces this.

33

Page 34: Clustering - ce.sharif.educe.sharif.edu/courses/97-98/1/ce324-1/resources/root/slides/Clusteri… · Clustering CE-324: Modern Information Retrieval Sharif University of Technology

What is a good clustering?} Internal criterion:

} intra-class (that is, intra-cluster) similarity is high} inter-class similarity is low

} The measured quality of a clustering depends on both thedoc representation and the similarity measure

Sec. 16.3

34

Page 35: Clustering - ce.sharif.educe.sharif.edu/courses/97-98/1/ce324-1/resources/root/slides/Clusteri… · Clustering CE-324: Modern Information Retrieval Sharif University of Technology

External criteria for clustering quality} Quality: ability to discover some or all of patterns in gold

standard data

} Assesses a clustering with respect to ground truth …requires labeled data

} 𝐶 gold standard classes 𝐶&, … , 𝐶Y , while clusteringproduces K clusters,ω1,ω2, …,ωK.

Sec. 16.3

35

Page 36: Clustering - ce.sharif.educe.sharif.edu/courses/97-98/1/ce324-1/resources/root/slides/Clusteri… · Clustering CE-324: Modern Information Retrieval Sharif University of Technology

External criteria} Purity} Rand Index} F measure

36

Page 37: Clustering - ce.sharif.educe.sharif.edu/courses/97-98/1/ce324-1/resources/root/slides/Clusteri… · Clustering CE-324: Modern Information Retrieval Sharif University of Technology

Rand Index (RI)

Number of points Same Cluster in clustering

Different Clusters in clustering

Same class in ground truth 20 24

Different classes in ground truth 20 72

Sec. 16.3

measures between pair decisions RI = 0.68

37

Page 38: Clustering - ce.sharif.educe.sharif.edu/courses/97-98/1/ce324-1/resources/root/slides/Clusteri… · Clustering CE-324: Modern Information Retrieval Sharif University of Technology

Rand index and F-measure

TPPTP FP

=+

TP TNRITP FP TN FN

+=

+ + +

Compare with standard Precision and Recall:

People also define and use a cluster F-measure, which is probably a better measure.

Sec. 16.3

38

𝐹[ =𝛽9 + 1 𝑃𝑅𝛽9𝑃 + 𝑅

𝑅 =𝑇𝑃

𝑇𝑃 + 𝐹𝑁

Page 39: Clustering - ce.sharif.educe.sharif.edu/courses/97-98/1/ce324-1/resources/root/slides/Clusteri… · Clustering CE-324: Modern Information Retrieval Sharif University of Technology

External evaluation of cluster quality} Purity: ratio between the dominant class in the cluster 𝜔1

and the size of cluster 𝜔1

𝑝𝑢𝑟𝑖𝑡𝑦 𝜔1 =max,<&,…,Y

𝜔1 ∩ 𝐶,𝜔1

𝑇𝑜𝑡𝑎𝑙𝑃𝑢𝑟𝑖𝑡𝑦 𝜔&, … , 𝜔Y =1𝑁0 max

,<&,…,Y𝜔1 ∩ 𝐶,

;

1<&

=0𝑝𝑢𝑟𝑖𝑡𝑦 𝜔1

;

1<&

×𝜔1𝑁

} Biased because having 𝑁 clusters maximizes purity

Sec. 16.3

39

Page 40: Clustering - ce.sharif.educe.sharif.edu/courses/97-98/1/ce324-1/resources/root/slides/Clusteri… · Clustering CE-324: Modern Information Retrieval Sharif University of Technology

• •• •• •

• •• •• •

• •• •

Cluster I Cluster II Cluster III

Cluster I: Purity = 1/6 (max(5, 1, 0)) = 5/6

Cluster II: Purity = 1/6 (max(1, 4, 1)) = 4/6

Cluster III: Purity = 1/5 (max(2, 0, 3)) = 3/5

Purity example

Sec. 16.3

40

Total Purity = 1/17 (5+4+3) = 12/17

Page 41: Clustering - ce.sharif.educe.sharif.edu/courses/97-98/1/ce324-1/resources/root/slides/Clusteri… · Clustering CE-324: Modern Information Retrieval Sharif University of Technology

Hierarchical clustering} Build a tree-based hierarchical taxonomy (dendrogram)

from a set of docs.

} Outputs a hierarchy, a structure that is more informative} Does not require a pre-specified number of clusters} But they have lower efficiency.

animal

vertebrate

fish reptile amphib. mammal worm insect crustacean

invertebrate

Ch. 17

41

Page 42: Clustering - ce.sharif.educe.sharif.edu/courses/97-98/1/ce324-1/resources/root/slides/Clusteri… · Clustering CE-324: Modern Information Retrieval Sharif University of Technology

Hierarchical clustering: Approaches

42

} Divisive approach: recursive application of a partitionalclustering algorithm.

} Agglomerative approach:} treat each doc as a singleton cluster at the outset} then successively merge (or agglomerate) pairs of clusters until

all clusters have been merged into a single cluster

Page 43: Clustering - ce.sharif.educe.sharif.edu/courses/97-98/1/ce324-1/resources/root/slides/Clusteri… · Clustering CE-324: Modern Information Retrieval Sharif University of Technology

Hierarchical clustering

43

Page 44: Clustering - ce.sharif.educe.sharif.edu/courses/97-98/1/ce324-1/resources/root/slides/Clusteri… · Clustering CE-324: Modern Information Retrieval Sharif University of Technology

Dendrogram: hierarchical clustering

} Each merge is represented by ahorizontal line

} y-coordinate of the horizontalline is the similarity of the twoclusters that were merged

} Clustering obtained by cutting thedendrogram at a desired level:

} Each connected component forms acluster.

44

Page 45: Clustering - ce.sharif.educe.sharif.edu/courses/97-98/1/ce324-1/resources/root/slides/Clusteri… · Clustering CE-324: Modern Information Retrieval Sharif University of Technology

Hierarchical Agglomerative Clustering (HAC)} Starts with each doc in a separate cluster

} then repeatedly joins the closest pair of clusters, until there isonly one cluster.

} The history of merging forms a binary tree or hierarchy.

Sec. 17.1

Note: the resulting clusters are still “hard” and induce a partition

45

Page 46: Clustering - ce.sharif.educe.sharif.edu/courses/97-98/1/ce324-1/resources/root/slides/Clusteri… · Clustering CE-324: Modern Information Retrieval Sharif University of Technology

Example} Hierarchical Agglomerative Clustering (HAC)

46

7653241

7

65

4

3

2

1

Page 47: Clustering - ce.sharif.educe.sharif.edu/courses/97-98/1/ce324-1/resources/root/slides/Clusteri… · Clustering CE-324: Modern Information Retrieval Sharif University of Technology

47

Page 48: Clustering - ce.sharif.educe.sharif.edu/courses/97-98/1/ce324-1/resources/root/slides/Clusteri… · Clustering CE-324: Modern Information Retrieval Sharif University of Technology

How to cut the hierarchy?

48

} Cut at a prespecified level of similarity} Cut where the gap between two successive combination

similarities is largest.} prespecify the number of clusters K and select the cutting

point that produces K clusters.} Find the number of clusters as : 𝑘∗ = min

K𝐽T1U 𝑘 + 𝜆𝑘

Page 49: Clustering - ce.sharif.educe.sharif.edu/courses/97-98/1/ce324-1/resources/root/slides/Clusteri… · Clustering CE-324: Modern Information Retrieval Sharif University of Technology

Closest pair of clusters} Many variants to defining closest pair of clusters

} Single-link} Similarity of the most similar pair (single-link)

} Complete-link} Similarity of the “furthest” points, the least similar pair

} Centroid} Clusters whose centroids (centers of gravity) are the most similar

} Average-link} Average similarities between pairs of elements

Sec. 17.2

49

Page 50: Clustering - ce.sharif.educe.sharif.edu/courses/97-98/1/ce324-1/resources/root/slides/Clusteri… · Clustering CE-324: Modern Information Retrieval Sharif University of Technology

Distances between cluster pairs

50

Single-link Complete-link

Centroid Average-link

Page 51: Clustering - ce.sharif.educe.sharif.edu/courses/97-98/1/ce324-1/resources/root/slides/Clusteri… · Clustering CE-324: Modern Information Retrieval Sharif University of Technology

Single link} Use maximum similarity of pairs:

} Can result in “straggly” (long and thin) clusters} due to chaining effect.

} After merging 𝐶𝑖 and 𝐶𝑗, the similarity of the resultingcluster to another cluster, 𝐶𝑘:

,( , ) max ( , )

i ji j x C y C

sim C C sim x yÎ Î

=

(( ), ) max( ( , ), ( , ))i j k i k j ksim C C C sim C C sim C C=U

Sec. 17.2

51

Page 52: Clustering - ce.sharif.educe.sharif.edu/courses/97-98/1/ce324-1/resources/root/slides/Clusteri… · Clustering CE-324: Modern Information Retrieval Sharif University of Technology

Single link: example

Sec. 17.2

52

Page 53: Clustering - ce.sharif.educe.sharif.edu/courses/97-98/1/ce324-1/resources/root/slides/Clusteri… · Clustering CE-324: Modern Information Retrieval Sharif University of Technology

Complete link} Use minimum similarity of pairs:

} Makes “tighter,” spherical clusters that are typicallypreferable.

} After merging 𝐶𝑖 and 𝐶𝑗, the similarity of the resultingcluster to another cluster, 𝐶𝑘:

,( , ) min ( , )

i ji j x C y C

sim C C sim x yÎ Î

=

(( ), ) min( ( , ), ( , ))i j k i k j ksim C C C sim C C sim C C=U

Sec. 17.2

53

Page 54: Clustering - ce.sharif.educe.sharif.edu/courses/97-98/1/ce324-1/resources/root/slides/Clusteri… · Clustering CE-324: Modern Information Retrieval Sharif University of Technology

Complete link: example

Sec. 17.2

54

Page 55: Clustering - ce.sharif.educe.sharif.edu/courses/97-98/1/ce324-1/resources/root/slides/Clusteri… · Clustering CE-324: Modern Information Retrieval Sharif University of Technology

Graph-theoretic interpretations

55

} 𝑠K : similarity of the two clusters merged in step k

} 𝐺(𝑠K): graph that links all data points with a similarity ofat least 𝑠K.

} The clusters after step k in:} single-link clustering are the connected components of G(sk)} complete-link clustering are maximal cliques of G(sk).

Page 56: Clustering - ce.sharif.educe.sharif.edu/courses/97-98/1/ce324-1/resources/root/slides/Clusteri… · Clustering CE-324: Modern Information Retrieval Sharif University of Technology

Noise

56

Page 57: Clustering - ce.sharif.educe.sharif.edu/courses/97-98/1/ce324-1/resources/root/slides/Clusteri… · Clustering CE-324: Modern Information Retrieval Sharif University of Technology

Group average} Similarity of two clusters = average similarity of all pairs

within merged cluster.

} Compromise between single and complete link.

( ) ( ):

1( , ) ( , )( 1) i j i j

i jx C C y C C y xi j i j

sim C C sim x yC C C C Î Î ¹

=- å år r r rU U

r rU U

Sec. 17.3

57

Page 58: Clustering - ce.sharif.educe.sharif.edu/courses/97-98/1/ce324-1/resources/root/slides/Clusteri… · Clustering CE-324: Modern Information Retrieval Sharif University of Technology

Computing group average similarity} When we use dot product as the similarity measure we

can compute similarity of two clusters in constant time ifwe maintain sum of vectors in each cluster.

} Compute similarity of clusters:j

jx C

s xÎ

= årr r

( ) ( ) (| | | |)( , )

(| | | |)(| | | | 1)i j i j i j

i ji j i j

s s s s C Csim c c

C C C C+ • + - +

=+ + -

r r r r

Sec. 17.3

58

If the length of doc vectors are assumed one

GAAC requires:(i) documents represented as vectors(ii) length normalization of vectors, so that self-similarities are 1.0

(iii) the dot product as the similarity measure between vectors and sums of vectors.

Page 59: Clustering - ce.sharif.educe.sharif.edu/courses/97-98/1/ce324-1/resources/root/slides/Clusteri… · Clustering CE-324: Modern Information Retrieval Sharif University of Technology

Centroid Clustering

59

Page 60: Clustering - ce.sharif.educe.sharif.edu/courses/97-98/1/ce324-1/resources/root/slides/Clusteri… · Clustering CE-324: Modern Information Retrieval Sharif University of Technology

Inversions in centroid clustering

60

Page 61: Clustering - ce.sharif.educe.sharif.edu/courses/97-98/1/ce324-1/resources/root/slides/Clusteri… · Clustering CE-324: Modern Information Retrieval Sharif University of Technology

Computational complexity} First iteration: all HAC methods need to compute similarity of

all pairs of instances, O(N2) similarity computation.

} N-1 merging iterations: compute the distance between themost recently created cluster and all other existing ones.

} In order to maintain an overall O(N2) performance, computingsimilarity to each other cluster must be done in constant time.

} We can find the closest pair in O(N log N)

} Thus, the overall complexity is O(N3) if done naively orO(N2 log N) if done more cleverly using priority queues

Sec. 17.2.1

61

Page 62: Clustering - ce.sharif.educe.sharif.edu/courses/97-98/1/ce324-1/resources/root/slides/Clusteri… · Clustering CE-324: Modern Information Retrieval Sharif University of Technology

62

Page 63: Clustering - ce.sharif.educe.sharif.edu/courses/97-98/1/ce324-1/resources/root/slides/Clusteri… · Clustering CE-324: Modern Information Retrieval Sharif University of Technology

Cluster labeling

63

} Cluster internal labeling: depends on the cluster itself} Title of the doc closest to the centroid

} Titles are easiest to read than a list of terms} However, single doc is unlikely to be representative of all docs in a

cluster

} A list of terms with high weights in the cluster centroid

} Differential cluster labeling: comparing the distributionof terms in one cluster with that of other clusters.} We can use measures such as “Mutual Information”

} 𝐼 𝐶K; 𝑋1 = ∑ ∑ 𝑃 𝑥1, 𝑐K log t u2,YJt u2 t YJu2YJ∈{v,&}

Page 64: Clustering - ce.sharif.educe.sharif.edu/courses/97-98/1/ce324-1/resources/root/slides/Clusteri… · Clustering CE-324: Modern Information Retrieval Sharif University of Technology

Cluster Labeling

64

Page 65: Clustering - ce.sharif.educe.sharif.edu/courses/97-98/1/ce324-1/resources/root/slides/Clusteri… · Clustering CE-324: Modern Information Retrieval Sharif University of Technology

Final word and resources} In clustering, clusters are inferred from the data without

human input (unsupervised learning)

} However, in practice, it’s a bit less clear} many ways of influencing the outcome of clustering: number of

clusters, similarity measure, representation of docs, . . .

} Resources} IIR 16 except 16.5} IIR 17 except 17.5

65