Clustering -...

Clustering CE-324: Modern Information Retrieval Sharif University of Technology

M. SoleymaniFall 2018

Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

What is clustering?} Clustering: grouping a set of objects into similar ones

} Docs within a cluster should be similar.} Docs from different clusters should be dissimilar.

} The commonest form of unsupervised learning} Unsupervised learning

} learning from raw data, as opposed to supervised data where aclassification of examples is given

} A common and important task that finds many applications inIR and other places

Ch. 16

2

A data set with clear cluster structure

} How would youdesign an algorithmfor finding the threeclusters in this case?

Ch. 16

3

Applications of clustering in IR} For better navigation of search results

} Effective “user recall” will be higher

} Whole corpus analysis/navigation} Better user interface: search without typing

} For improving recall in search applications} Better search results (like pseudo RF)

} For speeding up vector space retrieval} Cluster-based retrieval gives faster search

Sec. 16.1

4

Applications of clustering in IR

5

Search result clustering

6

7 yippy.com – grouping search results

Clustering the collection

8

} Cluster-based navigation is an interesting alternative tokeyword searching (i.e., the standard IR paradigm)

} User may prefer browsing over searching when they areunsure about which terms to use

} Well suited to a collection of news stories} News reading is not really search, but rather a process of

selecting a subset of stories about recent events

Yahoo! Hierarchy isn’t clustering but is the kind of output you want from clustering

dairycrops

agronomyforestry

AIHCI

craftmissions

botany

evolution

cellmagnetism

relativity

courses

agriculture biology physics CS space

... ... ...

… (30)

www.yahoo.com/Science

... ...

Yahoo! Hierarchy

9

Google News: automatic clustering gives an effective news presentation metaphor

10

To improve efficiency and effectiveness of search system

12

} Improve language modeling: replacing the collectionmodel used for smoothing by a model derived from doc’scluster

} Clustering can speed-up search (via an inexact algorithm)

} Clustering can improve recall

For improving search recall} Cluster hypothesis: Docs in the same cluster behave similarly

with respect to relevance to information needs

} Therefore, to improve search recall:} Cluster docs in corpus a priori} When a query matches a doc 𝑑, also return other docs in the cluster

containing 𝑑

} Query car: also return docs containing automobile} Because clustering grouped together docs containing car with

those containing automobile.

Why might this happen?

Sec. 16.1

13

Issues for clustering} Representation for clustering

} Doc representation} Vector space? Normalization?} Centroids aren’t length normalized

} Need a notion of similarity/distance

} How many clusters?} Fixed a priori?} Completely data driven?

} Avoid “trivial” clusters - too large or small¨ too large: for navigation purposes you've wasted an extra user click

without whittling down the set of docs much

Sec. 16.2

14

Notion of similarity/distance} Ideal: semantic similarity.

} Practical: term-statistical similarity} We will use cosine similarity.

} For many algorithms, easier to think in terms of a distance (rather thansimilarity)

} We will mostly speak of Euclidean distance¨ But real implementations use cosine similarity

15

Clustering algorithms categorization} Flat algorithms (k-means)

} Usually start with a random (partial) partitioning} Refine it iteratively

} Hierarchical algorithms} Bottom-up, agglomerative} Top-down, divisive

16

Hard vs. soft clustering} Hard clustering: Each doc belongs to exactly one cluster

} More common and easier to do

} Soft clustering:A doc can belong to more than one cluster.

17

Partitioning algorithms} Construct a partition of 𝑁 docs into 𝐾 clusters

} Given: a set of docs and the number 𝐾} Find: a partition of docs into 𝐾 clusters that optimizes the

chosen partitioning criterion} Finding a global optimum is intractable for many objective functions of

clustering} Effective heuristic methods: K-means and K-medoids algorithms

18

K-means} Assumes docs are real-valued vectors 𝒙(&), … , 𝒙(*).} Clusters based on centroids (aka the center of gravity or

mean) of points in a cluster:

𝝁, =1𝒞,

0 𝒙(1)𝒙(2)∈𝒞4

} K-means cost function:

𝐽(𝒞) =0 0 𝒙(1)– 𝝁𝑗9�

𝒙(2)∈𝒞4

;

,<&

Sec. 16.4

19

𝒞 = {𝒞1, 𝒞2, … , 𝒞𝐾}

𝒞𝑗: the set of data points assigned to j-th cluster

K-means algorithm

Select K random points {𝝁1, 𝝁2, …𝝁𝐾} as clusters’ initial centroids.

Until clustering converges (or other stopping criterion):

For each doc 𝒙(1):Assign 𝒙(1) to the cluster 𝒞, such that 𝑑𝑖𝑠𝑡(𝒙(1), 𝝁𝑗) is minimal.

For each cluster 𝐶𝑗𝝁𝑗 =

∑ 𝒙(2)�2∈𝒞4𝒞4

Sec. 16.4

Reassignment of instances to clusters is based on distance to the current cluster centroids (can equivalently be in terms of similarities)

20

21 [Bishop]

Termination conditions} Several possibilities for terminal condition, e.g.,

} A fixed number of iterations} Doc partition unchanged} 𝐽 < 𝜃: cost function falls below a threshold} ∆𝐽 < 𝜃: the decrease in the cost function (in two successive

iterations) falls below a threshold

Sec. 16.4

23

Convergence of K-means} K-means algorithm ever reaches a fixed point in which

clusters don’t change.} We must use tie-breaking when there are samples with the

same distance from two or more clusters (by assigning it tothe lower index cluster)

Sec. 16.4

24

K-means decreases 𝐽(𝒞) in each iteration(before convergence)

} First, reassignment monotonically decreases 𝐽(𝒞) since eachvector is assigned to the closest centroid.

} Second, recomputation monotonically decreases each∑ 𝒙(1)– 𝝁𝑘 2�1∈IJ :

} ∑ 𝒙(1)– 𝝁𝑘 2�1∈IJ reaches minimum for 𝝁𝑘 =

&IK∑ 𝒙(1)�1∈IJ

} K-means typically converges quickly

Sec. 16.4

25

Time complexity of K-means} Computing distance between two docs:𝑂(𝑀)

} 𝑀 is the dimensionality of the vectors.

} Reassigning clusters: 𝑂(𝐾𝑁) distance computations ⇒𝑂(𝐾𝑁𝑀).

} Computing centroids: Each doc gets added once to somecentroid:𝑂(𝑁𝑀).

} Assume these two steps are each done once for 𝐼iterations: 𝑂(𝐼𝐾𝑁𝑀).

Sec. 16.4

26

Seed choice

} Results can vary based on randomselection of initial centroids.

} Some initializations get poorconvergence rate, or convergence tosub-optimal clustering} Exclude outliers from the seed set} Try out multiple starting points and

choosing the clustering with lowest cost} Select good seeds using a heuristic (e.g., doc

least similar to any existing mean)} Obtaining seeds from another method such

as hierarchical clustering

If you start with B and E as centroids you converge to {A,B,C} and {D,E,F}

If you start with D and F, you converge to {A,B,D,E} {C,F}

Example showingsensitivity to seeds

Sec. 16.4

27

K-means issues, variations, etc.} Computes the centroid only after all points are re-

assigned} Instead, we can re-compute the centroid after every

assignment} It can improve speed of convergence of K-means

} Assumes clusters are spherical in vector space} Sensitive to coordinate changes, weighting etc.

} Disjoint and exhaustive} Doesn’t have a notion of “outliers” by default

} But can add outlier filtering

Sec. 16.4

Dhillon et al. ICDM 2002 – variation to fix some issues with smalldocument clusters

28

How many clusters?} Number of clusters 𝐾 is given

} Partition n docs into predetermined number of clusters

} Finding the “right” number is part of the problem} Given docs, partition into an “appropriate” no. of subsets.

} E.g., for query results - ideal value of K not known up front - thoughUI may impose limits.

29

How many clusters?

How many clusters?

Four ClustersTwo Clusters

Six Clusters

30

Selecting k

31

K not specified in advance

} Tradeoff between having better focus within each clusterand having too many clusters

} Solve an optimization problem: penalize having lots ofclusters} application dependent

} e.g., compressed summary of search results list.

𝑘∗ = minK𝐽T1U 𝑘 + 𝜆𝑘

32

obtained in e.g. 𝐽 {𝒞1,𝒞2, … ,𝒞𝑘}show the minimum value of 𝐽T1U 𝑘 :100 runs of k-means (with different initializations)

Penalize lots of clusters} Benefit for a doc: cosine similarity to its centroid} Total Benefit: sum of the individual doc Benefits.

} Why is there always a clustering of Total Benefit n?

} For each cluster, we have a Cost C.} For K clusters, the Total Cost is KC.

} Value of a clustering = Total Benefit - Total Cost.

} Find clustering of highest value, over all choices of K.} Total benefit increases with increasing K.} But can stop when it doesn’t increase by “much”. The Cost term

enforces this.

33

What is a good clustering?} Internal criterion:

} intra-class (that is, intra-cluster) similarity is high} inter-class similarity is low

} The measured quality of a clustering depends on both thedoc representation and the similarity measure

Sec. 16.3

34

External criteria for clustering quality} Quality: ability to discover some or all of patterns in gold

standard data

} Assesses a clustering with respect to ground truth …requires labeled data

} 𝐶 gold standard classes 𝐶&, … , 𝐶Y , while clusteringproduces K clusters,ω1,ω2, …,ωK.

Sec. 16.3

35

External criteria} Purity} Rand Index} F measure

36

Rand Index (RI)

Number of points Same Cluster in clustering

Different Clusters in clustering

Same class in ground truth 20 24

Different classes in ground truth 20 72

Sec. 16.3

measures between pair decisions RI = 0.68

37

Rand index and F-measure

TPPTP FP

=+

TP TNRITP FP TN FN

+=

+ + +

Compare with standard Precision and Recall:

People also define and use a cluster F-measure, which is probably a better measure.

Sec. 16.3

38

𝐹[ =𝛽9 + 1 𝑃𝑅𝛽9𝑃 + 𝑅

𝑅 =𝑇𝑃

𝑇𝑃 + 𝐹𝑁

External evaluation of cluster quality} Purity: ratio between the dominant class in the cluster 𝜔1

and the size of cluster 𝜔1

𝑝𝑢𝑟𝑖𝑡𝑦 𝜔1 =max,<&,…,Y

𝜔1 ∩ 𝐶,𝜔1

𝑇𝑜𝑡𝑎𝑙𝑃𝑢𝑟𝑖𝑡𝑦 𝜔&, … , 𝜔Y =1𝑁0 max

,<&,…,Y𝜔1 ∩ 𝐶,

;

1<&

=0𝑝𝑢𝑟𝑖𝑡𝑦 𝜔1

;

1<&

×𝜔1𝑁

} Biased because having 𝑁 clusters maximizes purity

Sec. 16.3

39

• •• •• •

• •• •• •

• •• •

•

Cluster I Cluster II Cluster III

Cluster I: Purity = 1/6 (max(5, 1, 0)) = 5/6

Cluster II: Purity = 1/6 (max(1, 4, 1)) = 4/6

Cluster III: Purity = 1/5 (max(2, 0, 3)) = 3/5

Purity example

Sec. 16.3

40

Total Purity = 1/17 (5+4+3) = 12/17

Hierarchical clustering} Build a tree-based hierarchical taxonomy (dendrogram)

from a set of docs.

} Outputs a hierarchy, a structure that is more informative} Does not require a pre-specified number of clusters} But they have lower efficiency.

animal

vertebrate

fish reptile amphib. mammal worm insect crustacean

invertebrate

Ch. 17

41

Hierarchical clustering: Approaches

42

} Divisive approach: recursive application of a partitionalclustering algorithm.

} Agglomerative approach:} treat each doc as a singleton cluster at the outset} then successively merge (or agglomerate) pairs of clusters until

all clusters have been merged into a single cluster

Hierarchical clustering

43

Dendrogram: hierarchical clustering

} Each merge is represented by ahorizontal line

} y-coordinate of the horizontalline is the similarity of the twoclusters that were merged

} Clustering obtained by cutting thedendrogram at a desired level:

} Each connected component forms acluster.

44

Hierarchical Agglomerative Clustering (HAC)} Starts with each doc in a separate cluster

} then repeatedly joins the closest pair of clusters, until there isonly one cluster.

} The history of merging forms a binary tree or hierarchy.

Sec. 17.1

Note: the resulting clusters are still “hard” and induce a partition

45

Example} Hierarchical Agglomerative Clustering (HAC)

46

7653241

7

65

4

3

2

1

How to cut the hierarchy?

48

} Cut at a prespecified level of similarity} Cut where the gap between two successive combination

similarities is largest.} prespecify the number of clusters K and select the cutting

point that produces K clusters.} Find the number of clusters as : 𝑘∗ = min

K𝐽T1U 𝑘 + 𝜆𝑘

Closest pair of clusters} Many variants to defining closest pair of clusters

} Single-link} Similarity of the most similar pair (single-link)

} Complete-link} Similarity of the “furthest” points, the least similar pair

} Centroid} Clusters whose centroids (centers of gravity) are the most similar

} Average-link} Average similarities between pairs of elements

Sec. 17.2

49

Distances between cluster pairs

50

Single-link Complete-link

Centroid Average-link

Single link} Use maximum similarity of pairs:

} Can result in “straggly” (long and thin) clusters} due to chaining effect.

} After merging 𝐶𝑖 and 𝐶𝑗, the similarity of the resultingcluster to another cluster, 𝐶𝑘:

,( , ) max ( , )

i ji j x C y C

sim C C sim x yÎ Î

=

(( ), ) max( ( , ), ( , ))i j k i k j ksim C C C sim C C sim C C=U

Sec. 17.2

51

Single link: example

Sec. 17.2

52

Complete link} Use minimum similarity of pairs:

} Makes “tighter,” spherical clusters that are typicallypreferable.

} After merging 𝐶𝑖 and 𝐶𝑗, the similarity of the resultingcluster to another cluster, 𝐶𝑘:

,( , ) min ( , )

i ji j x C y C

sim C C sim x yÎ Î

=

(( ), ) min( ( , ), ( , ))i j k i k j ksim C C C sim C C sim C C=U

Sec. 17.2

53

Complete link: example

Sec. 17.2

54

Graph-theoretic interpretations

55

} 𝑠K : similarity of the two clusters merged in step k

} 𝐺(𝑠K): graph that links all data points with a similarity ofat least 𝑠K.

} The clusters after step k in:} single-link clustering are the connected components of G(sk)} complete-link clustering are maximal cliques of G(sk).

Noise

56

Group average} Similarity of two clusters = average similarity of all pairs

within merged cluster.

} Compromise between single and complete link.

( ) ( ):

1( , ) ( , )( 1) i j i j

i jx C C y C C y xi j i j

sim C C sim x yC C C C Î Î ¹

=- å år r r rU U

r rU U

Sec. 17.3

57

Computing group average similarity} When we use dot product as the similarity measure we

can compute similarity of two clusters in constant time ifwe maintain sum of vectors in each cluster.

} Compute similarity of clusters:j

jx C

s xÎ

= årr r

( ) ( ) (| | | |)( , )

(| | | |)(| | | | 1)i j i j i j

i ji j i j

s s s s C Csim c c

C C C C+ • + - +

=+ + -

r r r r

Sec. 17.3

58

If the length of doc vectors are assumed one

GAAC requires:(i) documents represented as vectors(ii) length normalization of vectors, so that self-similarities are 1.0

(iii) the dot product as the similarity measure between vectors and sums of vectors.

Centroid Clustering

59

Inversions in centroid clustering

60

Computational complexity} First iteration: all HAC methods need to compute similarity of

all pairs of instances, O(N2) similarity computation.

} N-1 merging iterations: compute the distance between themost recently created cluster and all other existing ones.

} In order to maintain an overall O(N2) performance, computingsimilarity to each other cluster must be done in constant time.

} We can find the closest pair in O(N log N)

} Thus, the overall complexity is O(N3) if done naively orO(N2 log N) if done more cleverly using priority queues

Sec. 17.2.1

61

Cluster labeling

63

} Cluster internal labeling: depends on the cluster itself} Title of the doc closest to the centroid

} Titles are easiest to read than a list of terms} However, single doc is unlikely to be representative of all docs in a

cluster

} A list of terms with high weights in the cluster centroid

} Differential cluster labeling: comparing the distributionof terms in one cluster with that of other clusters.} We can use measures such as “Mutual Information”

} 𝐼 𝐶K; 𝑋1 = ∑ ∑ 𝑃 𝑥1, 𝑐K log t u2,YJt u2 t YJu2YJ∈{v,&}

Cluster Labeling

64

Final word and resources} In clustering, clusters are inferred from the data without

human input (unsupervised learning)

} However, in practice, it’s a bit less clear} many ways of influencing the outcome of clustering: number of

clusters, similarity measure, representation of docs, . . .

} Resources} IIR 16 except 16.5} IIR 17 except 17.5

65

Clustering -...

Documents

Transcript of Clustering -...