Lecture 17 Networks in Web & IR Slides are modified from Lada Adamic and Dragomir Radev.
Information Retrieval (8) Prof. Dragomir R. Radev [email protected].
-
Upload
trevor-booth -
Category
Documents
-
view
221 -
download
0
Transcript of Information Retrieval (8) Prof. Dragomir R. Radev [email protected].
IR Winter 2010
…13. Clustering…
Clustering
• Exclusive/overlapping clusters
• Hierarchical/flat clusters
• The cluster hypothesis– Documents in the same cluster are relevant to
the same query– How do we use it in practice?
Representations for document clustering
• Typically: vector-based– Words: “cat”, “dog”, etc.– Features: document length, author name, etc.
• Each document is represented as a vector in an n-dimensional space
• Similar documents appear nearby in the vector space (distance measures are needed)
Scatter-gather
• Introduced by Cutting, Karger, and Pedersen
• Iterative process– Show terms for each cluster– User picks some of them– System produces new clusters
• Example:– http://www.ischool.berkeley.edu/~hearst/imag
es/sg-example1.html
k-means
• Iteratively determine which cluster a point belongs to, then adjust the cluster cenroid, then repeat
• Needed: small number k of desired clusters
• hard decisions
• Example: Weka
k-means
1 initialize cluster centroids to arbitrary vectors
2 while further improvement is possible do
3 for each document d do
4 find the cluster c whose centroid is closest to d
5 assign d to cluster c
6 end for
7 for each cluster c do
8 recompute the centroid of cluster c based on its documents
9 end for
10 end while
K-means (cont’d)
• In practice (to avoid suboptimal clusters), run hierarchical agglomerative clustering on sample size sqrt(N) and then use the resulting clusters as seeds for k-means.
Example
• Cluster the following vectors into two groups:– A = <1,6>– B = <2,2>– C = <4,0>– D = <3,3>– E = <2,5>– F = <2,1>
Weka
• A general environment for machine learning (e.g. for classification and clustering)
• Book by Witten and Frank• www.cs.waikato.ac.nz/ml/weka• cd /data2/tools/weka-3-4-7• export CLASSPATH=$CLASSPATH:./weka.jar• java weka.clusterers.SimpleKMeans -t ~/e.arff • java weka.clusterers.SimpleKMeans -p 1-2 -t
~/e.arff
Demos• http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.
html• http://cgm.cs.mcgill.ca/~godfried/student_projects/bonnef_k-means • http://www.cs.washington.edu/research/imagedatabase/demo/
kmcluster • http://www.cc.gatech.edu/~dellaert/html/software.html • http://www-2.cs.cmu.edu/~awm/tutorials/kmeans11.pdf • http://www.ece.neu.edu/groups/rpl/projects/kmeans/
Probability and likelihood
i
ixpXpL )|()|()(
Example:
What is in this case?
Bayesian formulation
Posterior ∞ likelihood x prior
E-M algorithms
[Dempster et al. 77]
• Class of iterative algorithms for maximum likelihood estimation in problems with incomplete data. Given a model of data generation and data with some missing values, EM alternately uses the current model to estimate the missing values, and then uses the missing value estimates to improve the model. Using all the available data, EM will locally maximize the likelihood of the generative parameters giving estimates for the missing values.
[McCallum & Nigam 98]
E-M algorithm
• Initialize probability model
• Repeat– E-step: use the best available current
classifier to classify some datapoints– M-step: modify the classifier based on the
classes produced by the E-step.
• Until convergence
Soft clustering method
EM example
Figure from Chris Bishop
EM example
Figure from Chris Bishop
EM example
Figure from Chris Bishop
EM example
Figure from Chris Bishop
Demos
http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/mixture.html
http://lcn.epfl.ch/tutorial/english/gaussian/html/
http://www.cs.cmu.edu/~alad/em/
http://www.nature.com/nbt/journal/v26/n8/full/nbt1406.html
http://people.csail.mit.edu/mcollins/papers/wpeII.4.ps
“Online” centroid method
Centroid method
cx
xc
c
1)(
Online centroid-based clustering
sim ≥ T sim < T
Sample centroids
C 00022 (N=44)(10000)diana 1.93
princess 1.52
C 00025 (N=19)(10000)albanians 3.00
C 00026 (N=10)(10000)universe 1.50
expansion 1.00bang 0.90
C 10007 (N=11)(10000)crashes 1.00
safety 0.55transportat
ion0.55
drivers 0.45board 0.36flight 0.27buckle 0.27
pittsburgh 0.18graduating 0.18automobile 0.18
C 00035 (N=22)(10000)airlines 1.45
finnair 0.45
C 00031 (N=34)(10000)el 1.85
nino 1.56
C 00008 (N=113)(10000)space 1.98
shuttle 1.17station 0.75nasa 0.51
columbia 0.37mission 0.33
mir 0.30astronaut
s0.14
steering 0.11safely 0.07
C 10062 (N=161)microsoft 3.24
justice 0.93departmen
t0.88
windows 0.98corp 0.61
software 0.57ellison 0.07hatch 0.06
netscape 0.04metcalfe 0.02
Evaluation of clustering
• Formal definition
• Objective function
• Purity (considering the majority class in each cluster)
RAND index
• Accuracy when preserving object-object relationships.
• RI=(TP+TN)/TP+FP+FN+TN
• In the example:
202040
202
2
2
3
2
4
2
5
402
5
2
6
2
6
FP
TP
FPTP
RAND indexSame cluster
Same class TP=20 FN=24
FP=20 TN=72
RI = 0.68
Hierarchical clustering methods
• Single-linkage– One common pair is sufficient– disadvantages: long chains
• Complete-linkage– All pairs have to match– Disadvantages: too conservative
• Average-linkage• Demo
Non-hierarchical methods
• Also known as flat clustering
• Centroid method (online)
• K-means
• Expectation maximization
Hierarchical clustering
21 65
43 87
Single link produces straggly clusters (e.g., ((12)(56)))
Hierarchical agglomerative clusteringDendrograms
http://odur.let.rug.nl/~kleiweg/clustering/clustering.html/data2/tools/clustering
E.g., language similarity:
Clustering using dendrograms
REPEATCompute pairwise similaritiesIdentify closest pairMerge pair into single node
UNTIL only one node leftQ: what is the equivalent Venn diagram representation?
Example: cluster the following sentences:
A B C B AA D C C A D EC D E F C D AE F G F D AA C D A B A