Information Retrieval (8) Prof. Dragomir R. Radev [email protected].

35
Information Retrieval (8) Prof. Dragomir R. Radev [email protected]

Transcript of Information Retrieval (8) Prof. Dragomir R. Radev [email protected].

Page 1: Information Retrieval (8) Prof. Dragomir R. Radev radev@umich.edu.

Information Retrieval(8)

Prof. Dragomir R. Radev

[email protected]

Page 2: Information Retrieval (8) Prof. Dragomir R. Radev radev@umich.edu.

IR Winter 2010

…13. Clustering…

Page 3: Information Retrieval (8) Prof. Dragomir R. Radev radev@umich.edu.
Page 4: Information Retrieval (8) Prof. Dragomir R. Radev radev@umich.edu.
Page 5: Information Retrieval (8) Prof. Dragomir R. Radev radev@umich.edu.

Clustering

• Exclusive/overlapping clusters

• Hierarchical/flat clusters

• The cluster hypothesis– Documents in the same cluster are relevant to

the same query– How do we use it in practice?

Page 6: Information Retrieval (8) Prof. Dragomir R. Radev radev@umich.edu.

Representations for document clustering

• Typically: vector-based– Words: “cat”, “dog”, etc.– Features: document length, author name, etc.

• Each document is represented as a vector in an n-dimensional space

• Similar documents appear nearby in the vector space (distance measures are needed)

Page 7: Information Retrieval (8) Prof. Dragomir R. Radev radev@umich.edu.

Scatter-gather

• Introduced by Cutting, Karger, and Pedersen

• Iterative process– Show terms for each cluster– User picks some of them– System produces new clusters

• Example:– http://www.ischool.berkeley.edu/~hearst/imag

es/sg-example1.html

Page 8: Information Retrieval (8) Prof. Dragomir R. Radev radev@umich.edu.

k-means

• Iteratively determine which cluster a point belongs to, then adjust the cluster cenroid, then repeat

• Needed: small number k of desired clusters

• hard decisions

• Example: Weka

Page 9: Information Retrieval (8) Prof. Dragomir R. Radev radev@umich.edu.

k-means

1 initialize cluster centroids to arbitrary vectors

2 while further improvement is possible do

3 for each document d do

4 find the cluster c whose centroid is closest to d

5 assign d to cluster c

6 end for

7 for each cluster c do

8 recompute the centroid of cluster c based on its documents

9 end for

10 end while

Page 10: Information Retrieval (8) Prof. Dragomir R. Radev radev@umich.edu.

K-means (cont’d)

• In practice (to avoid suboptimal clusters), run hierarchical agglomerative clustering on sample size sqrt(N) and then use the resulting clusters as seeds for k-means.

Page 11: Information Retrieval (8) Prof. Dragomir R. Radev radev@umich.edu.

Example

• Cluster the following vectors into two groups:– A = <1,6>– B = <2,2>– C = <4,0>– D = <3,3>– E = <2,5>– F = <2,1>

Page 12: Information Retrieval (8) Prof. Dragomir R. Radev radev@umich.edu.

Weka

• A general environment for machine learning (e.g. for classification and clustering)

• Book by Witten and Frank• www.cs.waikato.ac.nz/ml/weka• cd /data2/tools/weka-3-4-7• export CLASSPATH=$CLASSPATH:./weka.jar• java weka.clusterers.SimpleKMeans -t ~/e.arff • java weka.clusterers.SimpleKMeans -p 1-2 -t

~/e.arff

Page 13: Information Retrieval (8) Prof. Dragomir R. Radev radev@umich.edu.

Demos• http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.

html• http://cgm.cs.mcgill.ca/~godfried/student_projects/bonnef_k-means • http://www.cs.washington.edu/research/imagedatabase/demo/

kmcluster • http://www.cc.gatech.edu/~dellaert/html/software.html • http://www-2.cs.cmu.edu/~awm/tutorials/kmeans11.pdf • http://www.ece.neu.edu/groups/rpl/projects/kmeans/

Page 14: Information Retrieval (8) Prof. Dragomir R. Radev radev@umich.edu.

Probability and likelihood

i

ixpXpL )|()|()(

Example:

What is in this case?

Page 15: Information Retrieval (8) Prof. Dragomir R. Radev radev@umich.edu.

Bayesian formulation

Posterior ∞ likelihood x prior

Page 16: Information Retrieval (8) Prof. Dragomir R. Radev radev@umich.edu.

E-M algorithms

[Dempster et al. 77]

• Class of iterative algorithms for maximum likelihood estimation in problems with incomplete data. Given a model of data generation and data with some missing values, EM alternately uses the current model to estimate the missing values, and then uses the missing value estimates to improve the model. Using all the available data, EM will locally maximize the likelihood of the generative parameters giving estimates for the missing values.

[McCallum & Nigam 98]

Page 17: Information Retrieval (8) Prof. Dragomir R. Radev radev@umich.edu.

E-M algorithm

• Initialize probability model

• Repeat– E-step: use the best available current

classifier to classify some datapoints– M-step: modify the classifier based on the

classes produced by the E-step.

• Until convergence

Soft clustering method

Page 18: Information Retrieval (8) Prof. Dragomir R. Radev radev@umich.edu.

EM example

Figure from Chris Bishop

Page 19: Information Retrieval (8) Prof. Dragomir R. Radev radev@umich.edu.

EM example

Figure from Chris Bishop

Page 20: Information Retrieval (8) Prof. Dragomir R. Radev radev@umich.edu.

EM example

Figure from Chris Bishop

Page 21: Information Retrieval (8) Prof. Dragomir R. Radev radev@umich.edu.

EM example

Figure from Chris Bishop

Page 22: Information Retrieval (8) Prof. Dragomir R. Radev radev@umich.edu.

Demos

http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/mixture.html

http://lcn.epfl.ch/tutorial/english/gaussian/html/

http://www.cs.cmu.edu/~alad/em/

http://www.nature.com/nbt/journal/v26/n8/full/nbt1406.html

http://people.csail.mit.edu/mcollins/papers/wpeII.4.ps

Page 23: Information Retrieval (8) Prof. Dragomir R. Radev radev@umich.edu.

“Online” centroid method

Page 24: Information Retrieval (8) Prof. Dragomir R. Radev radev@umich.edu.

Centroid method

cx

xc

c

1)(

Page 25: Information Retrieval (8) Prof. Dragomir R. Radev radev@umich.edu.

Online centroid-based clustering

sim ≥ T sim < T

Page 26: Information Retrieval (8) Prof. Dragomir R. Radev radev@umich.edu.

Sample centroids

C 00022 (N=44)(10000)diana 1.93

princess 1.52

C 00025 (N=19)(10000)albanians 3.00

C 00026 (N=10)(10000)universe 1.50

expansion 1.00bang 0.90

C 10007 (N=11)(10000)crashes 1.00

safety 0.55transportat

ion0.55

drivers 0.45board 0.36flight 0.27buckle 0.27

pittsburgh 0.18graduating 0.18automobile 0.18

C 00035 (N=22)(10000)airlines 1.45

finnair 0.45

C 00031 (N=34)(10000)el 1.85

nino 1.56

C 00008 (N=113)(10000)space 1.98

shuttle 1.17station 0.75nasa 0.51

columbia 0.37mission 0.33

mir 0.30astronaut

s0.14

steering 0.11safely 0.07

C 10062 (N=161)microsoft 3.24

justice 0.93departmen

t0.88

windows 0.98corp 0.61

software 0.57ellison 0.07hatch 0.06

netscape 0.04metcalfe 0.02

Page 27: Information Retrieval (8) Prof. Dragomir R. Radev radev@umich.edu.

Evaluation of clustering

• Formal definition

• Objective function

• Purity (considering the majority class in each cluster)

Page 28: Information Retrieval (8) Prof. Dragomir R. Radev radev@umich.edu.

RAND index

• Accuracy when preserving object-object relationships.

• RI=(TP+TN)/TP+FP+FN+TN

• In the example:

202040

202

2

2

3

2

4

2

5

402

5

2

6

2

6

FP

TP

FPTP

Page 29: Information Retrieval (8) Prof. Dragomir R. Radev radev@umich.edu.

RAND indexSame cluster

Same class TP=20 FN=24

FP=20 TN=72

RI = 0.68

Page 30: Information Retrieval (8) Prof. Dragomir R. Radev radev@umich.edu.

Hierarchical clustering methods

• Single-linkage– One common pair is sufficient– disadvantages: long chains

• Complete-linkage– All pairs have to match– Disadvantages: too conservative

• Average-linkage• Demo

Page 31: Information Retrieval (8) Prof. Dragomir R. Radev radev@umich.edu.

Non-hierarchical methods

• Also known as flat clustering

• Centroid method (online)

• K-means

• Expectation maximization

Page 32: Information Retrieval (8) Prof. Dragomir R. Radev radev@umich.edu.

Hierarchical clustering

21 65

43 87

Single link produces straggly clusters (e.g., ((12)(56)))

Page 33: Information Retrieval (8) Prof. Dragomir R. Radev radev@umich.edu.
Page 34: Information Retrieval (8) Prof. Dragomir R. Radev radev@umich.edu.

Hierarchical agglomerative clusteringDendrograms

http://odur.let.rug.nl/~kleiweg/clustering/clustering.html/data2/tools/clustering

E.g., language similarity:

Page 35: Information Retrieval (8) Prof. Dragomir R. Radev radev@umich.edu.

Clustering using dendrograms

REPEATCompute pairwise similaritiesIdentify closest pairMerge pair into single node

UNTIL only one node leftQ: what is the equivalent Venn diagram representation?

Example: cluster the following sentences:

A B C B AA D C C A D EC D E F C D AE F G F D AA C D A B A