“A Comparison of Document Clustering Techniques” Michael Steinbach, George Karypis and Vipin...

18
“A Comparison of Document Clustering Techniques” Michael Steinbach, George Karypis and Vipin Kumar (Technical Report, CSE, UMN, 2000) Mahashweta Das ([email protected]) CSE 6339 (Dr. Chengkai Li) Feb 9, 2010

Transcript of “A Comparison of Document Clustering Techniques” Michael Steinbach, George Karypis and Vipin...

Page 1: “A Comparison of Document Clustering Techniques” Michael Steinbach, George Karypis and Vipin Kumar (Technical Report, CSE, UMN, 2000) Mahashweta Das (mahashweta.das@mavs.uta.edu)

“A Comparison of Document Clustering Techniques”

Michael Steinbach, George Karypis and Vipin Kumar

(Technical Report, CSE, UMN, 2000)

Mahashweta Das ([email protected])

CSE 6339 (Dr. Chengkai Li)

Feb 9, 2010

Page 2: “A Comparison of Document Clustering Techniques” Michael Steinbach, George Karypis and Vipin Kumar (Technical Report, CSE, UMN, 2000) Mahashweta Das (mahashweta.das@mavs.uta.edu)

Feb 9, 2010 CSE 6339 2

Document Clustering

• Clustering - act of grouping similar object into sets

• Document Clustering - act of collecting similar documents into bins, where similarity is some function on a document

• Uses of Document Clustering• Browsing a large collection of documents (document

organization, automatic topic extraction, fast information retrieval)

• Organizing results returned by search engine (efficient web search, automatic generation of taxonomy of web documents, effective document classifier)

- Improves precision and recall in information retrieval systems

Page 3: “A Comparison of Document Clustering Techniques” Michael Steinbach, George Karypis and Vipin Kumar (Technical Report, CSE, UMN, 2000) Mahashweta Das (mahashweta.das@mavs.uta.edu)

Feb 9, 2010 CSE 6339 3

Types of Clustering

• Agglomerative Hierarchical Clustering• Begin with as many clusters as objects; most similar

clusters are successively merged until only one cluster remains

• Superior cluster quality; but O(n2) complexity

• Partitional Clustering• Begin with k initial centroids and assign all n objects

to closest centroid; recompute centroid of each cluster and repeat until centroids don’t change

• Efficient O(knt) complexity; but often poor cluster quality

Page 4: “A Comparison of Document Clustering Techniques” Michael Steinbach, George Karypis and Vipin Kumar (Technical Report, CSE, UMN, 2000) Mahashweta Das (mahashweta.das@mavs.uta.edu)

Feb 9, 2010 CSE 6339 4

Agglomerative Hierarchical Clustering

Euclidean distance is the similarity/distance metric

Page 5: “A Comparison of Document Clustering Techniques” Michael Steinbach, George Karypis and Vipin Kumar (Technical Report, CSE, UMN, 2000) Mahashweta Das (mahashweta.das@mavs.uta.edu)

Feb 9, 2010 CSE 6339 5

Comparison: Agglomerative Hierarchical Clustering

• Intra-Cluster Similarity Technique (IST)• looks at the similarity of all documents in a cluster to their

cluster centroid - to find which pair of cluster-merge will lead to smallest decrease in similarity

• Centroid Similarity Technique (CST)• looks at the cosine similarity between the centroids of the two

clusters

• UPGMA • looks at cluster similarity as follows:

Performs Best

Page 6: “A Comparison of Document Clustering Techniques” Michael Steinbach, George Karypis and Vipin Kumar (Technical Report, CSE, UMN, 2000) Mahashweta Das (mahashweta.das@mavs.uta.edu)

Feb 9, 2010 CSE 6339 6

Partitional Clustering (K-Means)

Euclidean distance is the similarity/distance metric

Page 7: “A Comparison of Document Clustering Techniques” Michael Steinbach, George Karypis and Vipin Kumar (Technical Report, CSE, UMN, 2000) Mahashweta Das (mahashweta.das@mavs.uta.edu)

Feb 9, 2010 CSE 6339 7

Vector Space Model and Document Clustering

• Cosine Similarity between documents d1 and d2

• Cluster Centroid Vector for a set of S documents in a cluster

• Cosine Similarity between a document and centroid vector

• Cosine Similarity between centroid vectors c1 and c2

Page 8: “A Comparison of Document Clustering Techniques” Michael Steinbach, George Karypis and Vipin Kumar (Technical Report, CSE, UMN, 2000) Mahashweta Das (mahashweta.das@mavs.uta.edu)

Feb 9, 2010 CSE 6339 8

Cluster Quality Evaluation Measures• Internal Quality

Measure• Cohesiveness of cluster

as measure of cluster quality

• OVERALL SIMILARITY

• Based on pairwise similarity of documents in a cluster

• For a set of S documents in a cluster

• External Quality Measure• Compares the groups

produced by clustering techniques to known classes

• ENTROPY

• F-MEASURE

The Higher, The Better

Page 9: “A Comparison of Document Clustering Techniques” Michael Steinbach, George Karypis and Vipin Kumar (Technical Report, CSE, UMN, 2000) Mahashweta Das (mahashweta.das@mavs.uta.edu)

Feb 9, 2010 CSE 6339 9

ENTROPY: External Cluster Quality Measure

• ENTROPY• Calculate class distribution of data

• pij : the “probability” that a member of cluster j belongs to class i

• Entropy of cluster j

• Total entropy

The Lower, The Better

Page 10: “A Comparison of Document Clustering Techniques” Michael Steinbach, George Karypis and Vipin Kumar (Technical Report, CSE, UMN, 2000) Mahashweta Das (mahashweta.das@mavs.uta.edu)

Feb 9, 2010 CSE 6339 10

F-MEASURE: External Cluster Quality Measure

• F-MEASURE• Combines precision and recall ideas from information

retrieval

• For cluster j and class i

where nij: number of members of cluster j in class i; nj: number of members of cluster j; ni: number of members of class i

• P

• Entire F-Measure

• p

The Higher, The Better

Page 11: “A Comparison of Document Clustering Techniques” Michael Steinbach, George Karypis and Vipin Kumar (Technical Report, CSE, UMN, 2000) Mahashweta Das (mahashweta.das@mavs.uta.edu)

Feb 9, 2010 CSE 6339 11

Bisecting K-Means Clustering

The algorithm starts with a single cluster of all documents

Largest Cluster or Least Overall Similarity or Both

Page 12: “A Comparison of Document Clustering Techniques” Michael Steinbach, George Karypis and Vipin Kumar (Technical Report, CSE, UMN, 2000) Mahashweta Das (mahashweta.das@mavs.uta.edu)

Feb 9, 2010 CSE 6339 12

Bisecting K-Means Example

Page 13: “A Comparison of Document Clustering Techniques” Michael Steinbach, George Karypis and Vipin Kumar (Technical Report, CSE, UMN, 2000) Mahashweta Das (mahashweta.das@mavs.uta.edu)

Feb 9, 2010 CSE 6339 13

S

K

L

D

HS

H4

H2

H3 H4

K

L

S

H2

H4

H4

S

S

Bisecting K-Means Clustering Document Cluster Hierarchy

Bisecting K-Means Example

Page 14: “A Comparison of Document Clustering Techniques” Michael Steinbach, George Karypis and Vipin Kumar (Technical Report, CSE, UMN, 2000) Mahashweta Das (mahashweta.das@mavs.uta.edu)

Feb 9, 2010 CSE 6339 14

Observations

• Bisecting K-Means is actually divisive hierarchical clustering

• Bisecting K-Means has a time complexity linear in number of documents

• Multiple runs of Bisecting K-Means does not improve results

• Bisecting K-Means (with or without refinement) is better than regular K-Means and UPGMA (with or without refinement) quite consistently (Overall Similarity and Entropy)

• Bisecting K-means produces better document hierarchies

Refinement: Bisecting K-Means and UPGMA algorithms are followed by basic K-Means clustering algorithm which uses the centroids of the clusters produced by the techniques as initial centroids

Page 15: “A Comparison of Document Clustering Techniques” Michael Steinbach, George Karypis and Vipin Kumar (Technical Report, CSE, UMN, 2000) Mahashweta Das (mahashweta.das@mavs.uta.edu)

Feb 9, 2010 CSE 6339 15

Agglomerative Hierarchical Clustering vs. K-Means/Bisecting K-Means

• Documents share “core” vocabularies

• Two documents can often be nearest neighbors without belonging to the same class, so agglomerative algorithms make mistakes

• “Global properties” help overcome local minima• Global property: computing the cosine similarity of a

document to a cluster centroid is the same as computing the average similarity of the document to all the cluster’s documents

• K-means better suited to document clustering• However, UPGMA outperforms a single run of K-Means

• Incremental update of centroid version of K-Means has been used

• Hybrid Hierarchical K-Means performs better than Hierarchical

Page 16: “A Comparison of Document Clustering Techniques” Michael Steinbach, George Karypis and Vipin Kumar (Technical Report, CSE, UMN, 2000) Mahashweta Das (mahashweta.das@mavs.uta.edu)

Feb 9, 2010 CSE 6339 16

Bisecting K-Means vs. K-Means

• Bisecting K-means tends to produce clusters of relatively uniform size

• Regular K-means tends to produce clusters of widely different sizes which affects overall cluster quality measure

• Bisecting K-means beats Regular K-means in Entropy measurement

• Is this explanation/intuition sufficient? What is the scope of the algorithm outside document clustering?

Page 17: “A Comparison of Document Clustering Techniques” Michael Steinbach, George Karypis and Vipin Kumar (Technical Report, CSE, UMN, 2000) Mahashweta Das (mahashweta.das@mavs.uta.edu)

Thank You !!

??

Page 18: “A Comparison of Document Clustering Techniques” Michael Steinbach, George Karypis and Vipin Kumar (Technical Report, CSE, UMN, 2000) Mahashweta Das (mahashweta.das@mavs.uta.edu)

Feb 9, 2010 CSE 6339 18

References

• Cluster Analysis: Basic Concepts and Algorithms, Ruoming Jin www.cs.kent.edu/~jin/DM07/cluster.ppt

• A Comparison of Document Clustering Techniques, Leo Chen www.cs.sfu.ca/~wangk/894report/chen1.pdf

• TaxaMiner: An Experimental Framework for Automated Taxonomy Bootstrapping, Vipul Kashyap www.lsdis.cs.uga.edu/~kashyap/talks/lhncbc-talk.ppt

• K Means Clustering, Panos Pardalos www.ise.ufl.edu/pardalos/dm/k-means.pdf

• Wikipedia