Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev.
-
Upload
karin-kelley -
Category
Documents
-
view
219 -
download
0
Transcript of Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev.
![Page 1: Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev.](https://reader035.fdocuments.us/reader035/viewer/2022081514/5a4d1b697f8b9ab0599b2a00/html5/thumbnails/1.jpg)
Information RetrievalSearch Engine Technology
(8)http://tangra.si.umich.edu/clair/ir09
Prof. Dragomir R. [email protected]
![Page 2: Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev.](https://reader035.fdocuments.us/reader035/viewer/2022081514/5a4d1b697f8b9ab0599b2a00/html5/thumbnails/2.jpg)
SET/IR – W/S 2009
…13. Clustering…
![Page 3: Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev.](https://reader035.fdocuments.us/reader035/viewer/2022081514/5a4d1b697f8b9ab0599b2a00/html5/thumbnails/3.jpg)
![Page 4: Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev.](https://reader035.fdocuments.us/reader035/viewer/2022081514/5a4d1b697f8b9ab0599b2a00/html5/thumbnails/4.jpg)
![Page 5: Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev.](https://reader035.fdocuments.us/reader035/viewer/2022081514/5a4d1b697f8b9ab0599b2a00/html5/thumbnails/5.jpg)
Clustering
• Exclusive/overlapping clusters• Hierarchical/flat clusters
• The cluster hypothesis– Documents in the same cluster are relevant to
the same query– How do we use it in practice?
![Page 6: Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev.](https://reader035.fdocuments.us/reader035/viewer/2022081514/5a4d1b697f8b9ab0599b2a00/html5/thumbnails/6.jpg)
Representations for document clustering
• Typically: vector-based– Words: “cat”, “dog”, etc.– Features: document length, author name, etc.
• Each document is represented as a vector in an n-dimensional space
• Similar documents appear nearby in the vector space (distance measures are needed)
![Page 7: Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev.](https://reader035.fdocuments.us/reader035/viewer/2022081514/5a4d1b697f8b9ab0599b2a00/html5/thumbnails/7.jpg)
Scatter-gather
• Introduced by Cutting, Karger, and Pedersen
• Iterative process– Show terms for each cluster– User picks some of them– System produces new clusters
• Example:– http://www.ischool.berkeley.edu/~hearst/imag
es/sg-example1.html
![Page 8: Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev.](https://reader035.fdocuments.us/reader035/viewer/2022081514/5a4d1b697f8b9ab0599b2a00/html5/thumbnails/8.jpg)
k-means
• Iteratively determine which cluster a point belongs to, then adjust the cluster cenroid, then repeat
• Needed: small number k of desired clusters
• hard decisions• Example: Weka
![Page 9: Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev.](https://reader035.fdocuments.us/reader035/viewer/2022081514/5a4d1b697f8b9ab0599b2a00/html5/thumbnails/9.jpg)
k-means1 initialize cluster centroids to arbitrary vectors2 while further improvement is possible do3 for each document d do4 find the cluster c whose centroid is closest to d5 assign d to cluster c6 end for7 for each cluster c do8 recompute the centroid of cluster c based on its documents9 end for10 end while
![Page 10: Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev.](https://reader035.fdocuments.us/reader035/viewer/2022081514/5a4d1b697f8b9ab0599b2a00/html5/thumbnails/10.jpg)
K-means (cont’d)
• In practice (to avoid suboptimal clusters), run hierarchical agglomerative clustering on sample size sqrt(N) and then use the resulting clusters as seeds for k-means.
![Page 11: Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev.](https://reader035.fdocuments.us/reader035/viewer/2022081514/5a4d1b697f8b9ab0599b2a00/html5/thumbnails/11.jpg)
Example
• Cluster the following vectors into two groups:– A = <1,6>– B = <2,2>– C = <4,0>– D = <3,3>– E = <2,5>– F = <2,1>
![Page 12: Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev.](https://reader035.fdocuments.us/reader035/viewer/2022081514/5a4d1b697f8b9ab0599b2a00/html5/thumbnails/12.jpg)
Weka
• A general environment for machine learning (e.g. for classification and clustering)
• Book by Witten and Frank• www.cs.waikato.ac.nz/ml/weka• cd /data2/tools/weka-3-4-7• export CLASSPATH=$CLASSPATH:./weka.jar• java weka.clusterers.SimpleKMeans -t ~/e.arff • java weka.clusterers.SimpleKMeans -p 1-2 -t
~/e.arff
![Page 13: Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev.](https://reader035.fdocuments.us/reader035/viewer/2022081514/5a4d1b697f8b9ab0599b2a00/html5/thumbnails/13.jpg)
Demos• http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.
html• http://cgm.cs.mcgill.ca/~godfried/student_projects/bonnef_k-means • http://www.cs.washington.edu/research/imagedatabase/demo/
kmcluster • http://www.cc.gatech.edu/~dellaert/html/software.html • http://www-2.cs.cmu.edu/~awm/tutorials/kmeans11.pdf • http://www.ece.neu.edu/groups/rpl/projects/kmeans/
![Page 14: Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev.](https://reader035.fdocuments.us/reader035/viewer/2022081514/5a4d1b697f8b9ab0599b2a00/html5/thumbnails/14.jpg)
Probability and likelihood
i
ixpXpL )|()|()(
Example:
What is in this case?
![Page 15: Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev.](https://reader035.fdocuments.us/reader035/viewer/2022081514/5a4d1b697f8b9ab0599b2a00/html5/thumbnails/15.jpg)
Bayesian formulation
Posterior ∞ likelihood x prior
![Page 16: Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev.](https://reader035.fdocuments.us/reader035/viewer/2022081514/5a4d1b697f8b9ab0599b2a00/html5/thumbnails/16.jpg)
E-M algorithms
[Dempster et al. 77]
• Class of iterative algorithms for maximum likelihood estimation in problems with incomplete data. Given a model of data generation and data with some missing values, EM alternately uses the current model to estimate the missing values, and then uses the missing value estimates to improve the model. Using all the available data, EM will locally maximize the likelihood of the generative parameters giving estimates for the missing values.
[McCallum & Nigam 98]
![Page 17: Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev.](https://reader035.fdocuments.us/reader035/viewer/2022081514/5a4d1b697f8b9ab0599b2a00/html5/thumbnails/17.jpg)
E-M algorithm
• Initialize probability model• Repeat
– E-step: use the best available current classifier to classify some datapoints
– M-step: modify the classifier based on the classes produced by the E-step.
• Until convergenceSoft clustering method
![Page 18: Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev.](https://reader035.fdocuments.us/reader035/viewer/2022081514/5a4d1b697f8b9ab0599b2a00/html5/thumbnails/18.jpg)
EM example
Figure from Chris Bishop
![Page 19: Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev.](https://reader035.fdocuments.us/reader035/viewer/2022081514/5a4d1b697f8b9ab0599b2a00/html5/thumbnails/19.jpg)
EM example
Figure from Chris Bishop
![Page 20: Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev.](https://reader035.fdocuments.us/reader035/viewer/2022081514/5a4d1b697f8b9ab0599b2a00/html5/thumbnails/20.jpg)
EM example
Figure from Chris Bishop
![Page 21: Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev.](https://reader035.fdocuments.us/reader035/viewer/2022081514/5a4d1b697f8b9ab0599b2a00/html5/thumbnails/21.jpg)
EM example
Figure from Chris Bishop
![Page 22: Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev.](https://reader035.fdocuments.us/reader035/viewer/2022081514/5a4d1b697f8b9ab0599b2a00/html5/thumbnails/22.jpg)
Demos
http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/mixture.html
http://lcn.epfl.ch/tutorial/english/gaussian/html/
http://www.cs.cmu.edu/~alad/em/
http://www.nature.com/nbt/journal/v26/n8/full/nbt1406.html
http://people.csail.mit.edu/mcollins/papers/wpeII.4.ps
![Page 23: Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev.](https://reader035.fdocuments.us/reader035/viewer/2022081514/5a4d1b697f8b9ab0599b2a00/html5/thumbnails/23.jpg)
“Online” centroid method
![Page 24: Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev.](https://reader035.fdocuments.us/reader035/viewer/2022081514/5a4d1b697f8b9ab0599b2a00/html5/thumbnails/24.jpg)
Centroid method
cx
xc
c
1)(
![Page 25: Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev.](https://reader035.fdocuments.us/reader035/viewer/2022081514/5a4d1b697f8b9ab0599b2a00/html5/thumbnails/25.jpg)
Online centroid-based clustering
sim ≥ T sim < T
![Page 26: Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev.](https://reader035.fdocuments.us/reader035/viewer/2022081514/5a4d1b697f8b9ab0599b2a00/html5/thumbnails/26.jpg)
Sample centroidsC 00022 (N=44)
(10000)diana 1.93princess 1.52
C 00025 (N=19)(10000)albanians 3.00
C 00026 (N=10)(10000)universe 1.50
expansion 1.00bang 0.90
C 10007 (N=11)(10000)crashes 1.00
safety 0.55transportat
ion0.55
drivers 0.45board 0.36flight 0.27buckle 0.27
pittsburgh 0.18graduating 0.18automobile 0.18
C 00035 (N=22)(10000)airlines 1.45
finnair 0.45
C 00031 (N=34)(10000)el 1.85
nino 1.56
C 00008 (N=113)(10000)space 1.98
shuttle 1.17station 0.75nasa 0.51
columbia 0.37mission 0.33
mir 0.30astronaut
s0.14
steering 0.11safely 0.07
C 10062 (N=161)microsoft 3.24
justice 0.93departmen
t0.88
windows 0.98corp 0.61
software 0.57ellison 0.07hatch 0.06
netscape 0.04metcalfe 0.02
![Page 27: Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev.](https://reader035.fdocuments.us/reader035/viewer/2022081514/5a4d1b697f8b9ab0599b2a00/html5/thumbnails/27.jpg)
Evaluation of clustering
• Formal definition• Objective function• Purity (considering the majority class in
each cluster)
![Page 28: Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev.](https://reader035.fdocuments.us/reader035/viewer/2022081514/5a4d1b697f8b9ab0599b2a00/html5/thumbnails/28.jpg)
RAND index
• Accuracy when preserving object-object relationships.
• RI=(TP+TN)/TP+FP+FN+TN• In the example:
202040
2022
23
24
25
4025
26
26
FP
TP
FPTP
![Page 29: Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev.](https://reader035.fdocuments.us/reader035/viewer/2022081514/5a4d1b697f8b9ab0599b2a00/html5/thumbnails/29.jpg)
RAND indexSame cluster
Same class TP=20 FN=24
FP=20 TN=72
RI = 0.68
![Page 30: Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev.](https://reader035.fdocuments.us/reader035/viewer/2022081514/5a4d1b697f8b9ab0599b2a00/html5/thumbnails/30.jpg)
Hierarchical clustering methods• Single-linkage
– One common pair is sufficient– disadvantages: long chains
• Complete-linkage– All pairs have to match– Disadvantages: too conservative
• Average-linkage• Demo
![Page 31: Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev.](https://reader035.fdocuments.us/reader035/viewer/2022081514/5a4d1b697f8b9ab0599b2a00/html5/thumbnails/31.jpg)
Non-hierarchical methods
• Also known as flat clustering• Centroid method (online)• K-means• Expectation maximization
![Page 32: Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev.](https://reader035.fdocuments.us/reader035/viewer/2022081514/5a4d1b697f8b9ab0599b2a00/html5/thumbnails/32.jpg)
Hierarchical clustering
21 65
43 87
Single link produces straggly clusters (e.g., ((12)(56)))
![Page 33: Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev.](https://reader035.fdocuments.us/reader035/viewer/2022081514/5a4d1b697f8b9ab0599b2a00/html5/thumbnails/33.jpg)
![Page 34: Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev.](https://reader035.fdocuments.us/reader035/viewer/2022081514/5a4d1b697f8b9ab0599b2a00/html5/thumbnails/34.jpg)
Hierarchical agglomerative clusteringDendrograms
http://odur.let.rug.nl/~kleiweg/clustering/clustering.html/data2/tools/clustering
E.g., language similarity:
![Page 35: Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev.](https://reader035.fdocuments.us/reader035/viewer/2022081514/5a4d1b697f8b9ab0599b2a00/html5/thumbnails/35.jpg)
Clustering using dendrograms
REPEATCompute pairwise similaritiesIdentify closest pairMerge pair into single node
UNTIL only one node leftQ: what is the equivalent Venn diagram representation?
Example: cluster the following sentences:
A B C B AA D C C A D EC D E F C D AE F G F D AA C D A B A
![Page 36: Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev.](https://reader035.fdocuments.us/reader035/viewer/2022081514/5a4d1b697f8b9ab0599b2a00/html5/thumbnails/36.jpg)
Paper reading
• Mark Newman paper “The structure and function of complex networks”(sections I, II, III, IV, VI, VII, and VIIIa)