CS728 Clustering the Web Lecture 13

CS728

Clusteringthe Web

Lecture 13

What is clustering?

• Clustering is the process of grouping a set of physical or abstract objects into classes of similar objects– It is the commonest form of unsupervised learning

• Unsupervised learning = learning from raw data, as opposed to supervised data where the correct classification of examples is given

– It is a common and important task that finds many applications in Web Science, IR and other places

Why cluster web documents?

• Whole web navigation– Better user interface

• Improving recall in web search – Better search results

• For better navigation of search results– Effective “user recall” will be higher

• For speeding up retrieval– Faster search

Yahoo! Tree Hierarchy

dairycrops

agronomyforestry

AI

HCIcraft

missions

botany

evolution

cellmagnetism

relativity

courses

agriculture biology physics CS space

... ... ...

… (30)

www.yahoo.com/Science

... ...

CS Research Question: Given a set of related objects (e.g., webpages) find the best tree decomposition.

Best: helps a user find/retrieve object of interest

Scatter/Gather Method for Browsing a Large Collection (SIGIR ’92) Cutting, Karger, Pedersen, and TukeyUsers browse a document collection interactively by selecting subsets of documents that are re-clustered on-the-fly

For improving search recall• Cluster hypothesis - Documents with similar text

are related• Therefore, to improve search recall:

– Cluster docs in corpus a priori– When a query matches a doc D, also return other

docs in the cluster containing D

• Hope if we do this: The query “car” will also return docs containing automobile– Because clustering grouped together docs

containing car with those containing automobile.

For better navigation of search results

• For grouping search results thematically– clusty.com / Vivisimo

For better navigation of search results

• And more visually: Kartoo.com

Defining What Is Good Clustering

• Internal criterion: A good clustering will produce high quality clusters in which:– the intra-class (that is, intra-cluster) similarity is high– the inter-class similarity is low– The measured quality of a clustering depends on both

the document representation and the similarity measure used

• External criterion: The quality of a clustering is also measured by its ability to discover some or all of the hidden patterns or latent classes– Assessable with gold standard data

External Evaluation of Cluster Quality

• Assesses clustering with respect to ground truth• Assume that there are C gold standard classes, while our

clustering algorithms produce k clusters, π1, π2, …, πk with ni members.

• Simple measure: purity, the ratio between the dominant class in the cluster πi and the size of cluster πi

• Others are entropy of classes in clusters (or mutual information between classes and clusters)

Cjnn

Purity ijji

i )(max1

)(

Cluster I Cluster II Cluster III

Cluster I: Purity = 1/6 (max(5, 1, 0)) = 5/6

Cluster II: Purity = 1/6 (max(1, 4, 1)) = 4/6

Cluster III: Purity = 1/5 (max(2, 0, 3)) = 3/5

Purity

Issues for clustering

• Representation for clustering– Document representation

• Vector space? Normalization?

– Need a notion of similarity/distance

• How many clusters?– Fixed a priori?– Completely data driven?

• Avoid “trivial” clusters - too large or small– In an application, if a cluster's too large, then for navigation

purposes you've wasted an extra user click without whittling down the set of documents much.

What makes docs “related”? • Ideal: semantic similarity.• Practical: statistical similarity

– We will typically use cosine similarity (Pearson correlation).

– Docs as vectors.– For many algorithms, easier to think in terms

of a distance (rather than similarity) between docs.

– We will describe algorithms in terms of cosine similarity.

. Aka

1

)(

:, normalized of similarity Cosine

,

product inner normalized

m

i ikwijwDDsim

DD

kj

kj

Recall doc as vector

• Each doc j is a vector of tfidf values, one component for each term.

• Can normalize to unit length.

• So we have a vector space– terms are axes - aka features– n docs live in this space– even with stemming, may have 20,000+ dimensions– do we really want to use all terms?

• Different from using vector space for search. Why?

Intuition

Postulate: Documents that are “close together” in vector space talk about the same things.

t 1

D2

D1

D3

D4

t 3

t 2

x

y

Clustering Algorithms

• Partitioning “flat” algorithms– Usually start with a random (partial) partitioning– Refine it iteratively

• k means/medoids clustering• Model based clustering

• Hierarchical algorithms– Bottom-up, agglomerative– Top-down, divisive

k-Clustering Algorithms

• Given: a set of documents and the number k

• Find: a partition into k clusters that optimizes the chosen partitioning criterion– Globally optimal: exhaustively enumerate all

partitions– Effective heuristic methods:

• Iterative k-means and k-medoids algorithms• Hierarchical methods – stop at level with k parts

How hard is clustering?

• One idea is to consider all possible clusterings, and pick the one that has best inter and intra cluster distance properties

• Suppose we are given n points, and would like to cluster them into k-clusters– How many possible clusterings?

!k

k n

Clustering Criteria: Maximum Spacing

• Spacing between clusters is defined as Min distance between any pair of points in different clusters.

• Clustering of maximum spacing. Given an integer k, find a k-clustering of maximum spacing.

spacing

k = 4

Greedy Clustering Algorithm• Single-link k-clustering algorithm.

– Form a graph on the vertex set U, corresponding to n clusters.– Find the closest pair of objects such that each object is in a

different cluster, and add an edge between them.– Repeat n-k times until there are exactly k clusters.

• Key observation. This procedure is precisely Kruskal's algorithm for Minimum-Cost Spanning Tree (except we stop when there are k connected components).

• Remark. Equivalent to finding an MST and deleting the k-1 most expensive edges.

Greedy Clustering Analysis• Theorem. Let C* denote the clustering C*1, …, C*k formed by deleting

the k-1 most expensive edges of a MST. C* is a k-clustering of max spacing.

• Pf. Let C denote some other clustering C1, …, Ck.

– The spacing of C* is the length d* of the (k-1)st most expensive edge.

– Since C is not C* there is pair pi, pj in the same cluster in C*, say C*r, but different clusters in C, say Cs and Ct.

– Some edge (p, q) on pi-pj path in C*r spans two different clusters in C.

– All edges on pi-pj path have length d*since Kruskal chose them.

– Spacing of C is d* since p and qare in different clusters. ▪

p qpi pj

Cs Ct

C*r

Hierarchical Agglomerative Clustering (HAC)

• Greedy is one example of HAC• Assumes a similarity function for

determining the similarity of two instances.• Starts with all instances in a separate

cluster and then repeatedly joins the two clusters that are most similar until there is only one cluster.

• The history of merging forms a binary tree or hierarchy.

• Dendrogram: Decomposes data objects into a several levels of nested partitioning (tree of clusters).

• Clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connectedconnected component forms a cluster.

A Dendogram: Hierarchical Clustering

HAC Algorithm

Start with all instances in their own cluster.Until there is only one cluster: Among the current clusters, determine the two clusters, ci and cj, that are most similar. Replace ci and cj with a single cluster ci cj

• Agglomerative (bottom-up): – Start with each document being a single cluster.– Eventually all documents belong to the same cluster.

• Divisive (top-down): – Start with all documents belong to the same cluster.

– Eventually each node forms a cluster on its own.

• Does not require the number of clusters k in advance

• Needs a termination/readout condition

– The final mode in both Agglomerative and Divisive is of no use.

Hierarchical Clustering algorithms

Dendrogram: Document Example

• As clusters agglomerate, docs likely to fall into a hierarchy of “topics” or concepts.

d1

d2

d3

d4

d5

d1,d2

d4,d5

d3

d3,d4,d5

Hierarchical Clustering

• Key problem: as you build clusters, how do you represent the location of each cluster, to tell which pair of clusters is closest?

• Max spacing– Measure intercluster distances by distances of

nearest pairs.

• Euclidean spacing– each cluster has a centroid = average of its points.– Measure intercluster distances by distances of

centroids.

“Closest pair” of clusters

• Many variants to defining closest pair of clusters• Single-link or max spacing

– Similarity of the most similar pair (same as Kruskal’s MST algorithm)

• “Center of gravity”– Clusters whose centroids (centers of gravity) are the

most similar• Average-link

– Average similarity between pairs of elements• Complete-link

– Similarity of the “furthest” points, the least similar

Impact of Choice of Similarity Measure

• Single Link Clustering – Can result in “straggly” (long and thin) clusters

due to chaining effect.– Appropriate in some domains, such as

clustering islands: “Hawaii clusters”

• Uses min distance / max similarity update– After merging ci and cj, the similarity of the

resulting cluster to another cluster, ck, is:

)),(),,(max()),(( kjkikji ccsimccsimcccsim

Single Link Example

Complete-Link Clustering

• Use minimum similarity (max distance) of pairs:

• Makes “tighter,” spherical clusters that are sometimes preferable.

• After merging ci and cj, the similarity of the resulting cluster to another cluster, ck, is:

),(min),(,

yxsimccsimji cycx

ji

)),(),,(min()),(( kjkikji ccsimccsimcccsim

Complete Link Example

Computational Complexity• In the first iteration, all HAC methods need to compute

similarity of all pairs of n individual instances which is O(n2).

• In each of the subsequent n2 merging iterations, it must compute the distance between the most recently created cluster and all other existing clusters.– Since we can just store unchanged similarities

• In order to maintain an overall O(n2) performance, computing similarity to each other cluster must be done in constant time.– Else O(n2 log n) or O(n3) if done naively

Key notion: cluster representative

• We want a notion of a representative point in a cluster

• Representative should be some sort of “typical” or central point in the cluster, e.g.,– point inducing smallest radii to docs in cluster– smallest squared distances, etc.– point that is the “average” of all docs in the

cluster• Centroid or center of gravity

Example: n=6, k=3, closest pair of centroids

d1 d2

d3

d4

d5

d6

Centroid after first step.

Centroid aftersecond step.

Outliers in centroid computation

• Can ignore outliers when computing centroid.

• What is an outlier?– Lots of statistical definitions, e.g.– moment of point to centroid > M some cluster

moment.

CentroidOutlier

Group-Average Clustering• Uses average similarity across all pairs within the merged

cluster to measure the similarity of two clusters.

• Compromise between single and complete link.• Two options:

– Averaged across all ordered pairs in the merged cluster – Averaged over all pairs between the two original clusters

• Some previous work has used one of these options; some the other. No clear difference in efficacy

)( :)(

),()1(

1),(

ji jiccx xyccyjiji

ji yxsimcccc

ccsim

Computing Group Average Similarity

• Assume cosine similarity and normalized vectors with unit length.

• Always maintain sum of vectors in each cluster.

• Compute similarity of clusters in constant time:

jcx

j xcs

)(

)1||||)(|||(|

|)||(|))()(())()((),(

jiji

jijijiji cccc

cccscscscsccsim

Medoid As Cluster Representative

• The centroid does not have to be a document.• Medoid: A cluster representative that is one of the

documents• For example: the document closest to the centroid• One reason this is useful

– Consider the representative of a large cluster (>1000 documents)

– The centroid of this cluster will be a dense vector– The medoid of this cluster will be a sparse vector

• Compare: mean/centroid vs. median/medoid

Homework Exercise

• Consider different agglomerative clustering methods of n points on a line. Explain how you could avoid n3 distance computations - how many will your scheme use?

Efficiency: “Using approximations”

• In standard algorithm, must find closest pair of centroids at each step

• Approximation: instead, find nearly closest pair– use some data structure that makes this

approximation easier to maintain– simplistic example: maintain closest pair

based on distances in projection on a random line

Random line

The dual space

• So far, we clustered docs based on their similarities in term space

• For some applications, e.g., topic analysis for inducing navigation structures, can “dualize”:– use docs as axes– represent users or terms as vectors– proximity based on co-occurrence of usage – now clustering users or terms, not docs

Next time

• Iterative clustering using K-means

• Spectral clustering using eigenvalues

CS728 Clustering the Web Lecture 13

Documents

Transcript of CS728 Clustering the Web Lecture 13