Information Retrieval (8) Prof. Dragomir R. Radev radev@umich.edu.

Information Retrieval(8)

Prof. Dragomir R. Radev

radev@umich.edu

IR Winter 2010

…13. Clustering…

Clustering

• Exclusive/overlapping clusters

• Hierarchical/flat clusters

• The cluster hypothesis– Documents in the same cluster are relevant to

the same query– How do we use it in practice?

Representations for document clustering

• Typically: vector-based– Words: “cat”, “dog”, etc.– Features: document length, author name, etc.

• Each document is represented as a vector in an n-dimensional space

• Similar documents appear nearby in the vector space (distance measures are needed)

Scatter-gather

• Introduced by Cutting, Karger, and Pedersen

• Iterative process– Show terms for each cluster– User picks some of them– System produces new clusters

• Example:– http://www.ischool.berkeley.edu/~hearst/imag

es/sg-example1.html

k-means

• Iteratively determine which cluster a point belongs to, then adjust the cluster cenroid, then repeat

• Needed: small number k of desired clusters

• hard decisions

• Example: Weka

k-means

1 initialize cluster centroids to arbitrary vectors

2 while further improvement is possible do

3 for each document d do

4 find the cluster c whose centroid is closest to d

5 assign d to cluster c

6 end for

7 for each cluster c do

8 recompute the centroid of cluster c based on its documents

9 end for

10 end while

K-means (cont’d)

• In practice (to avoid suboptimal clusters), run hierarchical agglomerative clustering on sample size sqrt(N) and then use the resulting clusters as seeds for k-means.

Example

• Cluster the following vectors into two groups:– A = <1,6>– B = <2,2>– C = <4,0>– D = <3,3>– E = <2,5>– F = <2,1>

• A general environment for machine learning (e.g. for classification and clustering)

• Book by Witten and Frank• www.cs.waikato.ac.nz/ml/weka• cd /data2/tools/weka-3-4-7• export CLASSPATH=$CLASSPATH:./weka.jar• java weka.clusterers.SimpleKMeans -t ~/e.arff • java weka.clusterers.SimpleKMeans -p 1-2 -t

~/e.arff

Demos• http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.

html• http://cgm.cs.mcgill.ca/~godfried/student_projects/bonnef_k-means • http://www.cs.washington.edu/research/imagedatabase/demo/

kmcluster • http://www.cc.gatech.edu/~dellaert/html/software.html • http://www-2.cs.cmu.edu/~awm/tutorials/kmeans11.pdf • http://www.ece.neu.edu/groups/rpl/projects/kmeans/

Probability and likelihood

ixpXpL )|()|()(

Example:

What is in this case?

Bayesian formulation

Posterior ∞ likelihood x prior

E-M algorithms

[Dempster et al. 77]

• Class of iterative algorithms for maximum likelihood estimation in problems with incomplete data. Given a model of data generation and data with some missing values, EM alternately uses the current model to estimate the missing values, and then uses the missing value estimates to improve the model. Using all the available data, EM will locally maximize the likelihood of the generative parameters giving estimates for the missing values.

[McCallum & Nigam 98]

E-M algorithm

• Initialize probability model

• Repeat– E-step: use the best available current

classifier to classify some datapoints– M-step: modify the classifier based on the

classes produced by the E-step.

• Until convergence

Soft clustering method

EM example

Figure from Chris Bishop

EM example

http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/mixture.html

http://lcn.epfl.ch/tutorial/english/gaussian/html/

http://www.cs.cmu.edu/~alad/em/

http://www.nature.com/nbt/journal/v26/n8/full/nbt1406.html

http://people.csail.mit.edu/mcollins/papers/wpeII.4.ps

“Online” centroid method

Centroid method

Online centroid-based clustering

sim ≥ T sim < T

Sample centroids

C 00022 (N=44)(10000)diana 1.93

princess 1.52

C 00025 (N=19)(10000)albanians 3.00

C 00026 (N=10)(10000)universe 1.50

expansion 1.00bang 0.90

C 10007 (N=11)(10000)crashes 1.00

safety 0.55transportat

ion0.55

drivers 0.45board 0.36flight 0.27buckle 0.27

pittsburgh 0.18graduating 0.18automobile 0.18

C 00035 (N=22)(10000)airlines 1.45

finnair 0.45

C 00031 (N=34)(10000)el 1.85

nino 1.56

C 00008 (N=113)(10000)space 1.98

shuttle 1.17station 0.75nasa 0.51

columbia 0.37mission 0.33

mir 0.30astronaut

steering 0.11safely 0.07

C 10062 (N=161)microsoft 3.24

justice 0.93departmen

windows 0.98corp 0.61

software 0.57ellison 0.07hatch 0.06

netscape 0.04metcalfe 0.02

Evaluation of clustering

• Formal definition

• Objective function

• Purity (considering the majority class in each cluster)

RAND index

• Accuracy when preserving object-object relationships.

• RI=(TP+TN)/TP+FP+FN+TN

• In the example:

202040

RAND indexSame cluster

Same class TP=20 FN=24

FP=20 TN=72

RI = 0.68

Hierarchical clustering methods

• Single-linkage– One common pair is sufficient– disadvantages: long chains

• Complete-linkage– All pairs have to match– Disadvantages: too conservative

• Average-linkage• Demo

Non-hierarchical methods

• Also known as flat clustering

• Centroid method (online)

• K-means

• Expectation maximization

Hierarchical clustering

Single link produces straggly clusters (e.g., ((12)(56)))

Hierarchical agglomerative clusteringDendrograms

http://odur.let.rug.nl/~kleiweg/clustering/clustering.html/data2/tools/clustering

E.g., language similarity:

Clustering using dendrograms

REPEATCompute pairwise similaritiesIdentify closest pairMerge pair into single node

UNTIL only one node leftQ: what is the equivalent Venn diagram representation?

Example: cluster the following sentences:

A B C B AA D C C A D EC D E F C D AE F G F D AA C D A B A

Information Retrieval (8) Prof. Dragomir R. Radev radev@umich.edu.

Documents

Transcript of Information Retrieval (8) Prof. Dragomir R. Radev radev@umich.edu.

Information Retrieval (2) Prof. Dragomir R. Radev radev@umich.edu.

COLLABORATION SPACES The Collabigators Tom Bergman – tombrgmn@umich.edu Amine Boudalia - amineb@umich.edu Rohan Malpani – rohanjm@umich.edu Hariharan Subramonyam.

Bibliography Papers on SummarizationBibliography Papers on Summarization Dragomir Radev and Erin Doumpoulaki October 29, 2003 This document contains a rather incomplete bibliography

Dragomir Rtangra.si.umich.edu/~radev/resume/resume.doc · Web viewInformation retrieval and natural language processing (Web graph analysis, biologically-inspired natural language

Developing a dashboard for Sakai CLE Jim Eng (jimeng@umich.edu) Zhen Qian (zquian@umich.edu) Gonzalo Silverio (gsilver@umich.edu) Chris Kretler (ckretler@umich.edu)

Dragomir R

2010 © University of Michigan 1 DivRank: Interplay of Prestige and Diversity in Information Networks Qiaozhu Mei 1,2, Jian Guo 3, Dragomir Radev 1,2 1.

Dragomir Radev Wrocław, Poland July 29, 2009 Computational Linguistics.

COMS 6998-06 Network Theory Week 4: September 29, 2010 Dragomir R. Radev Wednesdays, 6:10-8 PM 325 Pupin Terrace Fall 2010.

Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev.

Architecture Portfolio - Dragomir Mladenovic

Recent research at CLAIR Dragomir Radev University of Michigan, Ann Arbor radev@umich.edu Fall 2004.

Fundamentals, Design, and Implementation, 9/e Chapter 7 Relational Algebra and SQL applications Instructor: Dragomir R. Radev Fall 2005.

Lecture 17 Networks in Web & IR Slides are modified from Lada Adamic and Dragomir Radev.

LexRank: Graph-based Centrality as Salience in Text Summarization Gunes Erkan Department of EECS University of Michigan Dragomir R. Radev School of Information.

PREDICTION MARKETS Business Intelligence Community March 10, 2009 Tony Tsai (anttsai@umich.edu)anttsai@umich.edu Matthew Comstock (comstock@umich.edu)comstock@umich.edu.

David Ledyard-Marks: davidlm@umich.edudavidlm@umich.edu Jordan Dantzer: jdantzer@umich.edujdantzer@umich.edu Andrew Bieneman: bieneman@umich.edubieneman@umich.edu.

Lecture 7 Centrality (cont) Slides modified from Lada Adamic and Dragomir Radev.

Jump to first page Perl Programming A short course with emphasis on Internet and Database programming Dragomir R. Radev © University of Michigan.

COMS 6998-06 Network Theory Week 5: October 6, 2010 Dragomir R. Radev Wednesdays, 6:10-8 PM 325 Pupin Terrace Fall 2010.