HYPOTHESIS TESTING Null Hypothesis and Research Hypothesis ?
The Cluster Hypothesis in Information...
Transcript of The Cluster Hypothesis in Information...
![Page 1: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/1.jpg)
The Cluster Hypothesis in Information Retrieval
SIGIR 2013 tutorial
Oren KurlandTechnion --- Israel Institute of Technology
Email: [email protected]: http://iew3.technion.ac.il/~kurland
Slides are available at: http://iew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdf
![Page 2: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/2.jpg)
Tutorial overview• The cluster hypothesis• Historical view of the effect of the hypothesis on work
on ad hoc information retrieval• Testing the cluster hypothesis• Cluster-based document retrieval
• Using topic models for ad hoc information retrieval• Graph-based methods for ad hoc retrieval that utilize
inter-document similarities• Additional tasks/applications
• Search results visualization, query-performance prediction, fusion, federated search, query expansion, microblog retrieval, relevance feedback, adversarial search
• Concluding notes
![Page 3: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/3.jpg)
The ad hoc retrieval task
• Ranking the documents in a corpus by their relevance to the information need expressed by a query
• Vector space model• Probabilistic approaches• Language modeling framework• Divergence from randomness framework• Learning to rank
![Page 4: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/4.jpg)
The cluster hypothesis
Closely associated documents tend to be relevant to the same requests
(Jardine&van Rijsbergen ’71, van Rijsbergen ’79)
![Page 5: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/5.jpg)
A quick historical tour
• Mid-end 60’s• Using document clusters to improve search efficiency (Salton ’68)
• 70’s-80’s• Using document clusters to improve search effectiveness (Jardine&van
Rijsbergen ’71)• ~2004-today
• Using document clusters to improve search effectiveness (Azzopardi et al. ’04, Kurland&Lee ’04, Liu&Croft ’04)
• 90’s-00’s• Using document clusters to improve results browsing (Preece ’73)
• 90-today• Using topic models to improve search effectiveness (Deerwester et al. ’90)
• ~00’s-today• Using graph-based approaches for ad hoc retrieval that utilize inter-
document similarities (Salton&Buckley ’88)
![Page 6: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/6.jpg)
Ph.D. dissertations • Ivie, E. L. Search procedures based on measures of relatedness between documents. PhD thesis, Massachusetts
Institute of Technology, 1966.
• Marcia Davis Kerchner. Dynamic document processing in clustered collections. PhD thesis, Cornell University, 1971.
• Daniel McClure Murray. Document retrieval based on clustered files. PhD thesis, Cornell Univeristy, 1972.
• Ellen Voorhees. The effectiveness and efficiency of agglomerative hierarchic clustering in document retrieval. PhD thesis, Cornell Univeristy, 1985.
• Anton Leuski. Interactive information organization: Techniques and evaluation. PhD thesis, University Massachusetts Amherst, 2001.
• Anastasios Tombros. The effectiveness of hierarchic query-based clustering of documents for information retrieval. PhD thesis, Department of Computing Science, University of Glasgow, 2002.
• Leif Azzopardi. Incorporating context within the language modeling approach for ad hoc information retrieval. PhD thesis, University of Paisley, 2005.
• Oren Kurland. Inter-document similarities, language models, and ad hoc retrieval. PhD thesis, Cornell University, 2006.
• Xiaoyong Liu. Cluster-based retrieval from a language modeling perspective. PhD thesis, University Massachusetts Amherst, 2006.
• Xing Wei. Topic models in information retrieval. PhD thesis, University Massachusetts Amherst, 2007.
• Fernando Diaz. Autocorrelation and regularization of query-based retrieval scores. PhD thesis, University of Massachusetts Amherst, 2008.
• Mark Smucker. Evaluation of find-similar with simulation and network analysis. PhD thesis, University Massachusetts Amherst, 2008.
![Page 7: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/7.jpg)
Improving search efficiency
• Cluster the corpus offline• Represent each cluster by its centroid• At query time, compare the centroids with the
query and select the clusters to present
![Page 8: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/8.jpg)
The cluster hypothesis
Closely associated documents tend to be relevant to the same requests
(Jardine&van Rijsbergen ’71, van Rijsbergen ’79)
![Page 9: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/9.jpg)
Does the cluster hypothesis hold?
• Depends on the inter-document similarity used?• Maybe we should assume that the hypothesis
holds, and accordingly devise inter-document similarity measures?
• More details later
![Page 10: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/10.jpg)
The Jardine&van Rijsbergen’s (’71) (overlap) test• The similarity between two relevant documents vs.
the similarity between a relevant and a non-relevant document
• Measuring the overlap between the similarity distributions
![Page 11: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/11.jpg)
Jardine&van Rijsbergen’s cluster hypothesis test
(Figure is taken from Voorhees ‘85)
![Page 12: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/12.jpg)
Voorhees’ (’85) nearest-neighbor test• The percentage of relevant documents among the 5
nearest neighbors of a relevant document• The cosine similarity between tf.idf vectors is used
![Page 13: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/13.jpg)
Voorhees’ nearest-neighbor test (Kurland ’06)• The KL divergence between language models of
documents is used for the similarity measure
![Page 14: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/14.jpg)
Voorhees’ nearest-neighbor test applied to the result list of the n highest ranked documents (Raiber&Kurland ’12)• The KL divergence between language models of
documents is used for the similarity measure
![Page 15: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/15.jpg)
The connection between the cluster hypothesis and cluster-based retrieval effectiveness
• “The extent to which the cluster hypothesis characterized a collection seemed to have little effect on how well cluster searching performed as compared to a sequential search of the collection.” (Voorhees ’85)
• There is (high) correlation between the extent to which the nearest-neighbor cluster hypothesis hold, and the effectiveness of cluster-based document retrieval. (Na et al. ’08)
• One potential reason for the contradicting findings: completely different cluster-based retrieval methods have been used
![Page 16: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/16.jpg)
The density-based cluster hypothesis test (El-Hamdouchi and Willett ‘87)• The test value is the ratio between the number of
postings in the index (i.e., the total number of different terms used in documents) and the size of the vocabulary
• There is also a weighted version• The test was empirically shown to be more correlated
than the overlap and nearest-neighbor tests with the relative improvement posted by cluster-based retrieval over document-based retrieval
• Nearest-neighbor clusters (Griffiths et al. ’85) were used• Retrieval performance was measured by recall at some
cuttoff
![Page 17: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/17.jpg)
Query-sensitive similarity measures (Tombros&van Rijsbergen ’01)• Claim: the cluster hypothesis should hold for every
collection; it is the inter-document similarity measure that needs to be adjusted so that the hypothesis holds
• Heretofore, all inter-document similarities were query-independent
• Idea: bias the inter-document similarity measure to emphasize relations between the documents and the query𝑠𝑠𝑠𝑠𝑠𝑠 𝑑𝑑1,𝑑𝑑2 𝑞𝑞 ≜ cos 𝑑𝑑1,𝑑𝑑2 cos 𝑐𝑐, �⃗�𝑞 ;𝑐𝑐𝑖𝑖 = 1
2(𝑑𝑑1;𝑖𝑖+𝑑𝑑2;𝑖𝑖), where the i’th term in the vocabulary
is common to 𝑑𝑑1 and 𝑑𝑑2, and 𝑑𝑑𝑥𝑥;𝑖𝑖 is its weight in 𝑑𝑑𝑥𝑥
![Page 18: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/18.jpg)
Query-sensitive similarity measures (Tombros&van Rijsbergen ’01)
Nearest neighbor test with 5 nearest neighbors
![Page 19: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/19.jpg)
Using the cluster hypothesis to induce clustering• The optimum clustering framework (Fuhr et al. 2012)• Basic principle: documents that are co-relevant to
many queries should be clustered together• Definitions for the expected recall and precision of a
clustering based on co-relevance• Well known clustering methods can be viewed as based
on principles of the framework• The framework was shown to provide a more effective
internal clustering quality criterion than commonly used alternatives
• In terms of correlation to ground truth
![Page 20: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/20.jpg)
Alternative cluster hypothesis test (Smucker&Allan ’09)• Claim: The nearest neighbor test is insufficient for
query-biased similarities• The nearest-neighbor test is a good measure of local
clustering• A graph-based normalized mean reciprocal distance
measure
![Page 21: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/21.jpg)
The cluster hypothesis test for entity retrieval (Raviv et al. SIGIR 2013)• Check out the poster of Hadas Raviv• The main challenge: defining similarities between
entities
![Page 22: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/22.jpg)
Using clusters of similar documents for document retrieval• Visualizing results• Using clusters to select documents• Using clusters to enrich document representations
• Using topic models to enrich document representations
![Page 23: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/23.jpg)
Types of document clusters• Hard vs. soft
• Hard clustering: A document belongs to a single cluster• Partitioning (e.g., K-Means) vs. hierarchical agglomerative clustering (single
link, complete link, average link, Ward’s criterion)• Soft clustering: A document can belong to, or be associated with,
several clusters• Overlapping nearest-neighbor clusters: for each document we construct a
cluster that contains the document and its k nearest neighbors• Topic models (more details later)
• Offline (query-independent)• Created from all documents in the corpus• Help to address recall issues with the initial search (?)• Efficiency issues
• Large scale and dynamic corpora
• Query specific (Preece ’73, Willett ’85)• Created from the documents most highly ranked by an initial search• Used either for visualization of results or for automatic re-ranking
of the initial result list• Drawback(?): dependence on the effectiveness of the initial search
![Page 24: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/24.jpg)
![Page 25: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/25.jpg)
![Page 26: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/26.jpg)
Cluster-based search results visualization
• The scatter-gather system (Cutting et al. ’92, Hearst&Pedersen ’96)
• Browsing strategies for cluster-based result interfaces (Leuski ’01)
• Interactive retrieval using a cluster-based interface (Leuski&Allan ’04)
• Interactive exploration of corpora based on inter-document similarities (Smucker ’08)
![Page 27: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/27.jpg)
Challenges to address
• Fast online creation of clusters• Cutting et al. ’92, Zamir&Etzioni ’98
• Automatic labeling of clusters • Treeratpituk&Callan ’06 (agglomerative clusters)• Mei et al. ’07 (topic models)
![Page 28: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/28.jpg)
Using document clusters for ad hoc retrieval• The user needn’t be aware of the fact that
clustering was performed• Clusters often serve one of two roles (or both)
(Kurland&Lee ’04)• Document selection• Enriching (“smoothing”) document representations
![Page 29: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/29.jpg)
Using offline-created clusters for document selection (Kurland ’06)
Query = {truck, bus}
d1=school bus, classes, teachers
d2=school , classes, teachers, class
d3=bus, taxi, boat, bike
d4=taxi, boat, truck, scooter
d5=boat, horse, taxi, bike, scooter
d6=home, house, kids, floor
(x , x ) | x |i j i jsim x=
( , 1) ( , 5) ( , 6)
( , 2) ( , 1) ( , 3)( 5, 2) ( 4, 2) ( 3, 2)
3, , , , 6,1 2
, , , , , 6
4 5
3 25 14
Rank using documents
Rank using clusters and docs
sim q d sim q d sim q d
sim q C sim q C sim q Csim d C sim d C sim d c
Ranking d
Ra
d d d
d d
d d
nki d dn dg d
> =
> >> =
=
=
c2
c1
c3
![Page 30: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/30.jpg)
Using clusters for document selection
• Given a query 𝑞𝑞 and a list of document clusters 𝐶𝐶𝐶𝐶• Cl can be a set of offline-created or query-specific clusters
• Rank the clusters c in 𝐶𝐶𝐶𝐶 using a query-cluster similarity measure, or any other approach:
• 𝑠𝑠𝑐𝑐𝑠𝑠𝑠𝑠𝑠𝑠 𝑐𝑐; 𝑞𝑞 ≜ 𝑠𝑠𝑠𝑠𝑠𝑠 𝑞𝑞, 𝑐𝑐• A key estimation issue which will be further discussed
• Transform the cluster ranking to document ranking
![Page 31: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/31.jpg)
Transforming cluster ranking to document ranking
• Strategy #1 (originally termed “cluster-based retrieval” by Jardine&Rijsbergen (’71))
• Replace each cluster with its constituent documents (omitting repeats)
• Within-cluster document ranking is based on the initial document scores which were assigned in response to the query or on the similarity between the document and the cluster centroid
• Jardine&Rijsbergen ’71, Croft ’80, Voorhees ’85, Willett ’85, Liu&Croft ’04, Kurland&Lee ’06, Kurland ’08, Kurland&Domshlak ’08, Liu&Croft ’08
![Page 32: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/32.jpg)
The CQL method (Liu&Croft ’04)(Example for strategy #1 using query-specific clusters)
𝑠𝑠𝑐𝑐𝑠𝑠𝑠𝑠𝑠𝑠 𝑐𝑐; 𝑞𝑞 ≜ 𝑠𝑠𝑠𝑠𝑠𝑠(𝑞𝑞, 𝑐𝑐); similarity is measured using language modelsA cluster is represented by the concatenation of its constituent documents
![Page 33: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/33.jpg)
Transforming cluster ranking to document ranking (cont.)• Strategy #2
• Rank all the documents in the top-retrieved clusters using some criterion (Voorhees ’85)
• Examples to follow
• Strategy #3• Traverse the clustering dendogram until finding the cluster
with the best match to the query or using any other stopping criterion (Jardine&Rijsbergen ’71, Croft ’80, Voorhees ’85, Griffiths et al. ‘86)
• Mixed results with respect to whether cluster-based retrieval is consistently more effective than document retrieval
• Cluster-based retrieval was shown to be more effective in terms of precision (Jardine&Rijsbergen ’71, Croft ’80)
• Bottom up search was shown to be more effective than top-down (Croft ’80)
![Page 34: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/34.jpg)
Algorithmic framework (Kurland&Lee’04, ‘09)
(Example for strategy #2)
is the set of nearest-neighbor clusters created fromall documents in the corpus (cf., Griffiths et al. '86)Given query and (the number of docs to retrieve): 1. For each document , - C
Cl
q Nd
Score by a weighted combination of ( , ) and the ( , ) ' for all ( , )
hoose ( , ) -
2. Set ( ) to the ranked-ordered list of -to
d sFacets q d C
im q dsim q c s c Facets q
l
TopD cd
o s NN
⊆
∈
Optional: re-rank p scoring documents
3. 4. Return
( ) by ((
,)
)d TopTo
Docs N sipDoc
m ds
qN∈
![Page 35: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/35.jpg)
The Set-Select algorithm (Kurland&Lee ’04, ’09; cf. Voorhees ‘85)
• Instantiation of the framework:
• The procedure
{ }[ ]
( , ) : ( )
( ) ( , ) | ( , ) | 0
( ) is the set of clusters that are the most similar to the query
q
q
Facets q d c d c TopClusters m
Score d sim q d Facets q d
TopClusters m m
δ
= ∈
= ⋅ >
Rank only documents in top retrieved clusters using ( , )sim q d
![Page 36: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/36.jpg)
The Bag-Select Algorithm (Kurland&Lee ’04, ‘09)
• Instantiation of the framework:
• The procedure
{ }( , ) : ( )
( ) ( , ) | ( , ) |qFacets q d c d c TopClusters m
Score d sim q d Facets q d
= ∈
= ⋅
Rank only documents in top retrieved clusters using ( , ) # of top clusters belongs to
sim q d d×
![Page 37: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/37.jpg)
Set-Select, Bag-Select (Kurland&Lee ’04, ‘09)
c1c2 c3
d1d2
d40
40 1
2 1
2
1
2 3
1
( , )( , ) { ,
( , }
,
) {{ , }
}Facets q d c cFacet
Facets q d c
q
c
d
c
s c==
=
40 40
1
2 2
1
( ) (( ) ( , )
,( ) ( , )
)Score d s
ScorSco
e dre d sim q d
si
d
d
i
m
q
q
m==
= 1 1
2
40 40
2
( ) 2* ( ,(
( ) (
) 3* (
,
)
))
,Score d sim q dScore d sim q
Score d sim
d
q d==
=
Set-Select Bag-Select
q
![Page 38: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/38.jpg)
Empirical Results (Kurland&Lee‘09)
15%17%19%21%23%25%27%29%
Baseline(doc-based)
Set-s
Bag-s
*
*
* *
AP89 AP88+89 LA+FR
‘*’ marks a statistically significant difference with the baseline
*
*
*
MAP
![Page 39: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/39.jpg)
The cluster ranking challenge
• 𝑠𝑠𝑐𝑐𝑠𝑠𝑠𝑠𝑠𝑠(𝑐𝑐; 𝑞𝑞) ≜ 𝑠𝑠𝑠𝑠𝑠𝑠(𝑞𝑞, 𝑐𝑐)• How do we represent cluster c?
• Binary term vector (Jardine&van Rijsbergen ’71, van Rijsbergen ’74)
• A centroid of the vectors representing c’s constituent documents (Jardine&Rijsbergen ’71, Croft ’80, Voorhees ’85, El-Hamdouci&Willett ’87,Liu&Croft ‘08)
• Cosine, for example, serves for the similarity measure in the vector space
• The big document that results from concatenating c’sconstituent documents (Kurland&Lee ’04, Liu&Croft ‘04)
• Language-model-based similarity estimates
![Page 40: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/40.jpg)
Cluster representations (Liu&Croft ’08)
• The best representation for a cluster, among those studied, was the geometric mean of the language models of its constituent documents• Seo&Croft ’10 provide arguments based on information geometry
for the effectiveness of the geometric-mean-based representation
![Page 41: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/41.jpg)
Some more cluster ranking methods• Using the min/max query-similarity score of a
document in the cluster (Leuski ’01, Shanahan et al. ’03, Liu&Croft ’08)
• Document-cluster graphs (Kurland&Lee ’06; more details later)
• Variance of document retrieval scores in the cluster (Liu&Croft ’06; more details later)
• Aggregating measures of properties of a cluster (Kurland&Domshlak ’08)
• Using the similarity between the cluster and an expanded query form (Liu&Croft ’04, Wei&Croft ’06; more details later)
![Page 42: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/42.jpg)
The optimal cluster• The percentage of relevant documents in cluster c
of k documents for which 𝑠𝑠𝑐𝑐𝑠𝑠𝑠𝑠𝑠𝑠 𝑐𝑐; 𝑞𝑞 is the highest is the precision@k attained by cluster-based retrieval with strategy #1.
Using nearest neighbor query-specific clusters of 5 documents: (Kurland&Domshlak ’08)
![Page 43: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/43.jpg)
The optimal cluster (Liu&Croft’06)
![Page 44: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/44.jpg)
The optimal cluster (cont.)
• Jardine and van Rijsbergen ’71 were the first to report the existence of an optimal (offline created) cluster
• This was later re-asserted by, for example, Hearst&Pedersen (’96), Tombros et al. (’02), Kurland (’06) and Liu+Croft (’06)
• Offline-created optimal clusters contain a smaller percentage of relevant documents than optimal query-specific clusters (Tombros et al. ’02)
• There are several approaches to estimating the potential retrieval merits of finding optimal clusters (Tombros et al. ’02)
![Page 45: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/45.jpg)
A probabilistic graph-based approach for ranking clusters (Kurland ’08, Kurland&Krikon ’11)
• What is the probability that this cluster is relevant to this query? (cf., Croft ’80)
• The ClustRanker method:𝑠𝑠𝑐𝑐𝑠𝑠𝑠𝑠𝑠𝑠(𝑐𝑐; 𝑞𝑞) ≜ 𝜆𝜆𝜆𝜆 𝑞𝑞|𝑐𝑐 𝜆𝜆 𝑐𝑐 + 1 − 𝜆𝜆 �
𝑑𝑑∈𝑐𝑐
𝜆𝜆 𝑞𝑞|𝑑𝑑 𝜆𝜆 𝑐𝑐|𝑑𝑑 𝜆𝜆 𝑑𝑑
• p(d) and p(c) are estimated based on graphs where documents (clusters) are vertices, edge-weights represent inter-item similarities, and the PageRank score of item x serves as an estimate for p(x)
![Page 46: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/46.jpg)
ClustRanker (Kurland ’08, Kurland&Krikon’11)(Using nearest-neighbor query-specific clusters for re-ranking)
![Page 47: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/47.jpg)
A (much) more effective cluster ranking method (Raiber&Kurland SIGIR 2013)
• Uses the Markov Random Field framework• Attend Fiana Raiber’s talk:
Ranking Document Clusters using Markov Random Fields
![Page 48: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/48.jpg)
Selective cluster-based retrieval
• Griffiths et al. (’86) observed that cluster-based retrieval and document-based retrieval can be of the same effectiveness
• But, different relevant documents were retrieved in the two cases
• Some previous work on selecting a retrieval strategy per query
• Croft&Thompson ’84• Amati et al. ’04• Balasubramanian&Allan ‘10
![Page 49: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/49.jpg)
Selective cluster-based retrieval (Liu&Croft ’06)• A “good” cluster is one which (1) exhibits high similarity to
the query and (2) contains documents with query-similarity values that do not deviate much from that of the cluster
• Kurland et al. (’12) found some contrasting evidence with respect to the deviation
• For queries with “good” clusters perform cluster-based retrieval, for the other queries perform document-based retrieval
![Page 50: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/50.jpg)
Selective cluster-based retrieval (Liu&Croft ’06)
![Page 51: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/51.jpg)
Intermediate summary• We ranked document clusters• We transformed the cluster ranking to document
ranking• Some observations:
• There are clusters that contain a very high percentage of relevant documents (whether static offline-created clusters or query-specific clusters); the optimal cluster
• Optimal query-specific clusters contain a higher percentage of relevant documents than static clusters (Tombros et al. ’02)
• Using small clusters results in more effective retrieval (Griffiths et al. ’86, Tombros et al. ’02, Kurland&Lee ‘04)
• Cluster representation is crucial• A geometric-mean-based representation seems to be the most
effective among those proposed (Liu&Croft ’08, Seo&Croft ’10, Kurland&Krikon ’11)
• The performance of cluster-based retrieval can be much better than that of document-based retrieval and that of using query expansion
![Page 52: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/52.jpg)
Using clusters to enrich (smooth) document representations• Clusters provide (corpus) context for documents • Enrich the document representation using information
induced from similar documents • Example: use Rocchio’s method to smooth the vector
representing the document with those representing similar documents (Singhal&Pereira ’99)
𝑑𝑑𝑛𝑛𝑛𝑛𝑛𝑛 = 𝑑𝑑𝑜𝑜𝑜𝑜𝑑𝑑 + 1𝑘𝑘∑𝑖𝑖=1𝑘𝑘 𝑑𝑑𝑖𝑖;
𝑑𝑑1, … ,𝑑𝑑𝑘𝑘 are d’s nearest neighbors in the vector space
![Page 53: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/53.jpg)
Similar document-expansion methods• Lavrenko (’00) employed a nearest-neighbor smoothing
method for the query model• Ogilvie (’00) and Kurland&Lee (’04) smoothed a
document language model with language models of its nearest neighbors
• Tao et al. (’06) created pseudo counts for terms in a document that are smoothed using the counts of terms in similar documents
• Wi&Allan (’09) and Efron et al. (’12) use the (weighted) language models of nearest-neighbors of a document to smooth the document language model
• Efron et al. (’12) use this document model for Twitter search
![Page 54: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/54.jpg)
A quick recap of the language modeling approach (Ponte&Croft ’98)
'
: query, : document, : corpus of documents, : term
( )( | ) ; ( )
A maximu
is the
m likelihood est
number of times
Jelinek-Merc
appears in ( ' )
(
i
er smo
mate:
othi|
g1 ) (
n :) (
MLE
w d
JM MLE
q d C w
tf w dp w d tf w d w dtf w d
p w d pλ
∈
∈= ∈
∈
−
∑
(1) The query likelihood model: (Song&Croft '99) (2) The KL retrieval method: (Lafferty&Zhai '0
| ) ( |Dir
1
ichle
)
t smo)
| |
( ; ) ( | ) ( | )
oth
ing:
i
MLE
iq q
w d p w C
d
score d q p q d p q d
λ
µλµ
∈
+
=+
∏
( | ) ( ; ) ( | ) log( | )
i
ii
q q i
p q qscore d q p q qp q d∈
∑
![Page 55: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/55.jpg)
Using pLSA for retrieval (Hofmann ’99)• pLsa (probabilistic latent semantic analysis) is a
“probabilistic successor” of LSA (Deerwester et al. ’90), and an implementation of the aspect model (Hofmann et al. ’97)
• Additional topic models (LDA Blei et al. ’03, Pachinko Allocation Model Li&McCallum ’06)
• The generative story of pLSA:• Select a document 𝑑𝑑 with probability P 𝑑𝑑• Pick a latent class (topic) 𝑧𝑧 with probability P 𝑧𝑧|𝑑𝑑• Generate a word w with probability P 𝑤𝑤|𝑧𝑧
![Page 56: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/56.jpg)
Using pLSA for retrieval (Hofmann ’99)• P 𝑑𝑑,𝑤𝑤 = 𝑃𝑃 𝑑𝑑 𝑃𝑃 𝑤𝑤 𝑑𝑑• 𝑃𝑃 𝑤𝑤 𝑑𝑑 = ∑𝑧𝑧∈𝑍𝑍 𝑃𝑃 𝑤𝑤 𝑧𝑧 𝑃𝑃(𝑧𝑧|𝑑𝑑)• Data likelihood:
𝐿𝐿 = ∑𝑑𝑑∈𝐷𝐷∑𝑛𝑛∈𝑊𝑊 𝑡𝑡𝑡𝑡 𝑤𝑤 ∈ 𝑑𝑑 𝐶𝐶𝑠𝑠𝑙𝑙𝑃𝑃(𝑑𝑑,𝑤𝑤)• Maximizing data likelihood using tempered EM
• Note the potential metric divergence problem (Azzopardi et al. ’03)• Using the topic model for retrieval
• Smoothing a document language model (more details later)• Folding the query into the lower dimensional space
• Vector-based representation, cosine measure
• Retrieval performance is better than that attained by LSA and the cosine method; very small collections are used
![Page 57: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/57.jpg)
Cluster-based document language models (Liu&Croft ’04)
1 2 3
1 2 3
The CBDM model:( | ) ( | ) ( | c) ( | );
1
Let be the single hard cluster to which belongs; c is represented by the concatenation of its constituent documents
MLE MLE MLEp w d p w d p w p w C
c d
λ λ λλ λ λ
+ ++ + =
Using offline K-MEANS clustering (K is the number of clusters)
![Page 58: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/58.jpg)
Topic-based document language models (Wei&Croft ’06)• Apply Latent Dirichlet Allocation (LDA; Blei et al. ‘03) to induce topics from the
corpus• Use the resultant topics to smooth document language models
• Using the KL divergence between a document LM and that of the query for ranking
• A generalization of the CBDM model• Earlier work by Azzopardi et al. (’04) used LDA and pLSA to induce document
prior distributions• Lu et al. ’11 found that the retrieval performance of using LDA and pLSA
(Hofmann ’99) was comparable
1 2 3
1
1 2 3
1
ˆ ˆ( | , ) ( | , )
( | ) ( | ) ( | ) ( | );
( | )k
i
MLE MLELDA
LDA p w z p z d
p w d p w d p w d p w C
p w d
λ λ λ
φ θ
λ λ λ
=
+ + =
+ +
∑
![Page 59: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/59.jpg)
Topic-based document language models (Wei&Croft ’06)
![Page 60: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/60.jpg)
A study of using topic-based document language models (Yi&Allan ’09)
• Using more sophisticated topic models (e.g, Pachinko Allocation Model Li&McCallum ’06) doesn’t yield better retrieval performance (e.g., than that attained by LDA)
• Using nearest-neighbor smoothing results in performance that is as good as that of using topic models
• Pseudo-feedback-based query expansion is more effective than using topic models (either in an offline fashion or in a query-specific fashion)
𝜆𝜆𝑇𝑇𝑇𝑇 𝑤𝑤 𝑑𝑑 ≜ �𝑡𝑡𝑖𝑖∈𝑇𝑇
𝜆𝜆𝑇𝑇𝑇𝑇 𝑤𝑤 𝑡𝑡𝑖𝑖 𝜆𝜆𝑇𝑇𝑇𝑇 𝑡𝑡𝑖𝑖 𝑑𝑑
T is the set of topics
𝜆𝜆′(𝑤𝑤|𝑑𝑑) ≜ 𝜆𝜆1𝜆𝜆𝑇𝑇𝑀𝑀𝑀𝑀 𝑤𝑤 𝑑𝑑 + 𝜆𝜆2𝜆𝜆𝑇𝑇𝑇𝑇(𝑤𝑤|𝑑𝑑) +𝜆𝜆3 𝜆𝜆𝑇𝑇𝑀𝑀𝑀𝑀 𝑤𝑤 𝐶𝐶
![Page 61: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/61.jpg)
Score-based smoothing (Kurland&Lee ’04 ‘10, Kurland ’09)
( ; ) ( | ) ( | , ) ( | )
Estimate ( | , ) using ( | ) (1- ) ( | )
The interpolation m
Using a single term for we get a clust
ethod:( ; ) ( | ) (1- ) ( | ) (
er/top b
|
ic-
)
c Cl
c Cl
score d q p q d p q d c p c d
p q d c p q d p q c
score d q p q d p q c p c d
w q
λ λ
λ λ
∈
∈
=
+=>
+
∑
∑
Use exp(- ( ( | ) || ( | ))), where ( | ) is the unigram language model ind
ased documen
uce
t language
d from , as an estimate fo
mod
r
l
|
e
( )KL p x p y p z z
p x y
![Page 62: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/62.jpg)
The interpolation method (Kurland&Lee ’04, ‘09)
• Ranking the corpus using nearest-neighbor offline-created clusters
‘b’ and ‘I’ mark statistically significant differences with the baseline and interpolation, respectively
![Page 63: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/63.jpg)
Interpolation method (Kurland&Lee ’04, ‘09) –comparison between using nearest neighbor (NN) clusters and K-Means clusters (Clusters are created offline)
El-Hamdouci and Willett (’89) found that when using cluster ranking with offline created clusters(i) Using nearest neighbor clusters (of two documents) resulted in better
performance than that of using various hard clustering methods; (the same as Griffiths et al. (’86) findings)
(ii) Using small agglomerative clusters yielded better performance than using larger clusters; and,
(iii) Complete-link agglomerative clustering was more effective than single-link and Ward’s method for cluster-based retrieval; (the same as Voorhees’ (’85) findings)
![Page 64: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/64.jpg)
p@5 p@10 p@5 p@10 p@5 p@10
Init. Rank. .457 .432 .500 .456 .536 .484
CQL (sim(q,c)) .448 .418 .500 .432 .504 .454
Bag-Select .507 .494 .532 .514 .548 .488
Interpolation .537 .498 .576 .496 .592 .508
The interpolation method (Kurland ’09)
• Using nearest-neighbor query-specific clusters to re-rank an initially retrieved list
* * * *
*
* *
AP TREC8 WSJ
‘*’ marks a statistically significant difference with the initial ranking (Init. Rank.)
![Page 65: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/65.jpg)
The interpolation method (Kurland ’09)• Comparison with pseudo-feedback-based query
expansion• Using nearest-neighbor query-specific clusters
![Page 66: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/66.jpg)
The interpolation method with different query-specific clustering algorithms (Kurland ’09)
• The findings with respect to (i) nn-LM and nn-VS being superior to hard clustering schemes, and (ii) agg-comp’s relative effectiveness, are reminiscent of those of El-Hamdouci and Willett (’89) who used cluster ranking with clusters created offline
• Tombros et al. (’02) found that average link was better than complete link in terms of query-specific optimal cluster search
![Page 67: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/67.jpg)
Comparison of cluster-based retrieval methods (Raiber&Kurland ’12)
![Page 68: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/68.jpg)
Cluster types and their effectiveness for smoothing document language models (or scores)• Nearest-neighbor clusters (with a small number of
nearest neighbors)>= topic models > hard clusters• Note: The first to suggest the use of nearest-
neighbor overlapping clusters were Griffiths et al. (’86)
• Used offline-created clusters (of two documents) • There is much recent evidence for the effectiveness of using
nearest-neighbor query-specific clusters (Kurland+Lee:06,Liu+Croft:08,Kurland:09)
![Page 69: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/69.jpg)
Integrating query-specific and offline-created clusters• Meister et al. (’09) used the interpolation algorithm
(Kurland&Lee ’04) with both offline and query-specific clusters
• Small (but consistent) performance improvements over using only offline-created or query-specific clusters
• Lee et al. (’01) use a “query-specific” view of static (offline-created) clusters
![Page 70: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/70.jpg)
Cluster-based fusion
![Page 71: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/71.jpg)
Fusion of retrieved lists
1Given query and document lists ,..., that were retrieved in response to from corpus , produce a single list of results
Integrating various information sources for retrieval
Motivation
The taskmq L L q
D
(Croft '00)(e.g., document representations, query representations, retrieval models)
1d
2d
1d
2d
3d3d
2d
4d
3d
1L 2L 3L
fuse
1d
2d
3d
![Page 72: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/72.jpg)
A common fusion principleDocuments that are highly ranked in many of the lists are rewarded
(1) overlap of relevant documents in the lists is higher than that of non-relevant documents (Lee ’97)
(2) chorus and skimming effects (Saracevic&Kantor‘88, Vogt&Cottrell ’99)
: document: is a member of list
( ) : the positive retrieval score of in if ; 0 otherwise
( ) : "standard" fusion score of that depends only on retrieval scores or ranks
Examples
i
i i
L i i
dd L d LS d d L d L
F d d
∈∈
:( ) #{ : } ( ) (Fox&Shaw '94, Lee '97)
( ) #{ ' : ( ') ( )} (Borda 1781, Aslam&Montague '01)
i
i
i i
i
CombMNZ i i LL
Borda i L LL
F d L d L S d
F d d L S d S d
∈
∈ ≤
∑
∑
![Page 73: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/73.jpg)
But …
Different retrieved lists might contain different relevant documents
e.g., Das-Jupta and Katzer ’83, Griffiths et al. ’86, Soboroff et al. ’01, Beitzel et al. ’03
fuse2r
1r
nr
1rnr
2r
1rnr
3r
1rnr
2r
1L 2L 3L
![Page 74: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/74.jpg)
A cluster-based fusion approach (Kozorovitsky&Kurland ’11)
Let similar documents across the lists provide relevance-status support to each other
• cluster hypothesis • utilize information induced from clusters of similar
documents that are created across the lists
fuse2r
1r
nr
1rnr
2r
1rnr
3r
1r
2r
3r
![Page 75: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/75.jpg)
Fusion model (Kozorovitsky&Kurland ’11)
( )
- the set of documents in the lists; ( ) - their clusters
: use clusters as proxies for documents( | ) ( | , ) ( | ) (cf., Kurland&Lee '04)
ˆEstimate: ( | ,
Starting point
init
init init
c Cl D
D Cl D
p d q p d c q p c q
p d c
∈
= ∑
( )
Resultant fusion
) ( | ) (1 ) (
functio
| )
: ( ) (1 ) ( | ) ( | ) ( | )
n
init
ClustFusec Cl D
q p d c p d q
F d p d q p c q p d c
λ λ
λ λ∈
= + −
− + ∑
![Page 76: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/76.jpg)
ClustFuse (Kozorovitsky&Kurland ’11)
Document d is rewarded based on its • standard fusion score
• reflects the extent to which d is highly ranked in many of the lists• similarity to clusters that contain documents that are highly
ranked in many of the lists
• 𝜆𝜆=0 amounts to the standard fusion method (F) that ClustFuse incorporates
( )( ) (1 ) ( | ) ( | ) ( | )
L
ClustFusec Cl C
F d p d q p c q p d cλ λ∈
− + ∑
)𝐹𝐹(𝑑𝑑;𝑞𝑞�𝑑𝑑′∈𝐷𝐷𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖
𝐹𝐹(𝑑𝑑′;𝑞𝑞)∏𝑑𝑑∈𝑐𝑐 𝐹𝐹(𝑑𝑑;𝑞𝑞)
∑𝑐𝑐′∈𝐶𝐶𝑜𝑜(𝐷𝐷𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖)∏𝑑𝑑′∈𝑐𝑐′ 𝐹𝐹(𝑑𝑑′;𝑞𝑞)∑𝑑𝑑′∈𝑐𝑐 𝑠𝑠𝑠𝑠𝑠𝑠(𝑑𝑑′,𝑑𝑑)
∑𝑑𝑑𝑖𝑖∈𝐷𝐷𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 ∑𝑑𝑑′∈𝑐𝑐 𝑠𝑠𝑠𝑠𝑠𝑠(𝑑𝑑′,𝑑𝑑𝑖𝑖)
cf., the interpolation model (Kurland&Lee ’04, ‘09)
![Page 77: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/77.jpg)
MAP performance of fusing TREC runs (Kozorovitsky&Kurland ’11)
4
5
6
7
8
9
10
trec3
run1CombMNZClustFuseCombMNZBordaClustFuseBorda
‘r’, ‘f’ – statistically significant differences with run1 and the standard fusion method, respectively
r
r
r
r
r
f f
r,f r,f
Fusing 3 randomly selected TREC runs (run1 is the best performing among the three)
![Page 78: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/78.jpg)
Optimal clusters in the fusion setting (Kozorovitsky&Kurland ’11)
• OptCluster is the optimal cluster among all clusters created from all the documents in the 3 runs that are fused (run1, run2, run3)
• OptCluster(runi) is the optimal cluster among clusters created from runi‘a’, ’b’ and ‘c’ mark statistically significant differences with run1, run2, and run3, respectively
![Page 79: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/79.jpg)
Cluster-based fusion (Kozorovitsky&Kurland ’11)• Re-ranking each run using query-specific clusters
created from the run, and then fusing the runs (cf. Zhang et al. ’01), yields performance that is inferior to that of using clusters created over all the runs
• The lower the overlap between relevant documents in the runs the more benefit we gain from applying cluster-based fusion
![Page 80: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/80.jpg)
Cluster-based federates search(Khalman&Kurland ’12)
• Retrieving the lists from disjoint corpora(federated/distributed search)
• Crestani and Wu (’06) showed that in the federated search setting there exist clusters that contain a high percentage of relevant documents
p@10p@5
42.046.0init
49.853.6ClustFuse
41.043.2init
49.050.4ClustFuse
CORI
SSL
![Page 81: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/81.jpg)
Cluster-based query expansion
• Treating clusters as pseudo queries (Kurland et al. ’05)• Using cluster-based (or topic-based) smoothed document
language models for both constructing an expanded query form and for ranking (Liu&Croft ’04, Tao et al. ’06, Wei&Croft ‘06)
• Constructing a query-expanded form by rewarding top-retrieved documents that are members of many query-specific overlapping clusters (Lee et al. ’08)
• Using top-retrieved clusters instead (or in addition to) top-retrieved documents for constructing an expanded query form (Na et al. ’07, Gelfer Kalmanovich&Kurland ’09)
• Cluster-based query expansion for federated search (Shokouhi et al. ’09)
![Page 82: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/82.jpg)
Cluster-based results diversification
• e.g., Maximal Marginal Relevance (Carbonell&Goldstein’98)
• Let R be the result list of the documents most highly ranked using 𝑆𝑆𝑠𝑠𝑠𝑠1 𝑞𝑞,𝑑𝑑
• Let S be the new list we create from R (i.e., re-ranking); the order of inserting documents to S is the induced ranking
• 𝑠𝑠𝑐𝑐𝑠𝑠𝑠𝑠𝑠𝑠(𝑑𝑑; 𝑞𝑞) ≜ 𝑎𝑎𝑠𝑠𝑙𝑙𝑠𝑠𝑎𝑎𝑎𝑎𝑑𝑑𝑖𝑖∈𝑅𝑅\S[𝜆𝜆𝑆𝑆𝑠𝑠𝑠𝑠1 𝑞𝑞,𝑑𝑑 − (1− 𝜆𝜆𝑠𝑠𝑎𝑎𝑎𝑎𝑑𝑑𝑖𝑖∈𝑆𝑆𝑆𝑆𝑠𝑠𝑠𝑠2 𝑑𝑑,𝑑𝑑𝑖𝑖 ]
• A cluster-based approach: estimate 𝑆𝑆𝑠𝑠𝑠𝑠1 𝑞𝑞,𝑑𝑑 using a cluster ranking approach (He et al. ’11, Raiber&Kurland’13)
![Page 83: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/83.jpg)
Cluster-based results diversification (cont.)
• A second cluster-based approach: Cluster R and pick documents from the clusters (e.g., in a round robin fashion) which are viewed as potential aspects
• e.g., Leelanupab et al. ’10
![Page 84: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/84.jpg)
Utilizing relevance feedback using document clusters• Shanhan et al. ’03 found that the optimal cluster is
a very good basis for relevance feedback• Re-emphasizing claims in older literature (e.g.,
Jardine&Rijsbergen ’71, Croft ’80) about the motivation to find good clusters
• Active relevance feedback (Shen&Zhai ’05)• Diversify the feedback set by picking documents from
query-specific clusters of top-retrieved results• Baseline: asking for feedback for the top-k retrieved documents
• Interactive retrieval• Ivie ’66, Hearst&Pedersen ’95, Leuski ’01
![Page 85: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/85.jpg)
Using clusters for query-performance prediction (QPP)• The query-performance prediction task: estimating
the effectiveness of a search performed in response to a query in lack of relevance judgments (Carmel&Yom Tov ’10)
• The clustering tendency of the results is an indicator for search effectiveness (Vinay et al. ’06)
• The extent to which the retrieval scores of documents “respect” the cluster hypothesis is an effective query-performance predictor (Diaz ’07)
![Page 86: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/86.jpg)
On the connection between cluster ranking and query-performance prediction (Kurland et al. ’12)• Cluster ranking: estimate the probability that a
cluster (set of documents) is relevant to a query• Query performance prediction (QPP): estimate the
probability that a result list (ranked list of documents) is relevant to a query
• As it turns out, quite a few QPP and cluster ranking methods are based on the exact same principles
• The geometric mean of retrieval scores in a result list is a high quality performance predictor (Zhou&Croft ’07)
• The geometric mean of retrieval scores in a cluster is an effective criterion for ranking clusters (Liu&Croft ’08, Seo&Croft ’10, Kurland&Krikon ’11)
![Page 87: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/87.jpg)
Cluster-based retrieval –intermediate summary• Using clusters to select documents• Using clusters to enrich (smooth) document representations
• or topic models
• Offline vs. query-specific clustering• Soft vs. hard clustering• The optimal cluster• Cluster-based fusion and federated search• Cluster-based query expansion• Cluster-based results diversification• Using document clusters to utilize relevance feedback• The connection between query-performance prediction and
cluster ranking
![Page 88: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/88.jpg)
Graph-based methods utilizing inter-document similarities
![Page 89: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/89.jpg)
Graph-based framework for re-ranking (Kurland&Lee '05, ‘10)
Inspiration: Web Retrieval Common approach to web retrieval:
• Re-rank an initial retrieved list 𝐷𝐷𝑖𝑖𝑛𝑛𝑖𝑖𝑡𝑡 of documents by the degree of centrality (Brin&Page '98, Kleinberg '99)
• Centrality of a document is estimated using explicit hyperlink structure (PageRank, HITS)
Can we use the scoring by centrality approach for ranking non-hypertext documents?
AB C
X
Y
![Page 90: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/90.jpg)
A possible strategy:Structural re-ranking
• Use inter-document similarities to infer links between documents in 𝐷𝐷𝑖𝑖𝑛𝑛𝑖𝑖𝑡𝑡
• On the resultant graph (of documents and induced links) define centrality measures and use them as criteria for ranking
![Page 91: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/91.jpg)
How to induce links ?One might suggest:
Vector Space Model (VSM) for information representation and cosine for similarity metric
Erkan&Radev '04 : text summarization• Cosine similarity between sentences• See the book: Rada Mihalcea and Dragomir Radev. Graph-
based natural language processing and information retrieval. Cambridge University Press, 2011.
but …
![Page 92: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/92.jpg)
Inducing Links
PortlandPortlandPortland
Relevant Relevant⇒
{2d
DublinPortlandBeijing
{1d
Relevant Relevant⇒/
“spiky” distribution
“flat” distribution
1d 2d 1d 2d2 1 1 2( | ) ( | )p d d p d d>LMs:1 2 2 1cos( , ) cos( , )d d d d=VSM:
Zhang et al. (’05) used asymmetric cosine-based edge weights, but these were found by Kurland&Lee (’10) to be somewhat less effective than the language-model-based weights
![Page 93: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/93.jpg)
Generation graphsFor document :
( ) documents that yield the highest ( | )
The complete graph with edge weig( , )
hts:[ ( )] ( | )
The smoothed (Brin&P
( )
ag
init ini
init
ini
t init
t
G D D Dwt
o DTopGen o k g D p o g
g TopGen o p o go g δ
∈
∈
∈×
− > =
'
[ ]
[ ]
e '98) complete graph with edge weights:
1 + |
( , )
( )( '
( ))|
init
init
init ini
g D
t init
wt o gwt o
G D D
wt o gD
D
g
λ
λ λ λ
∈
− >×
− >−
=− >∑
is a “generator” of o (o is an offspring of g)( )g TopGen o∈
![Page 94: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/94.jpg)
Inducing centrality: Recursive Weighted Influx Algorithm
• Smoothed graph : ergodic Markov chain, power method converges• The Recursive Weighted Influx algorithm is a weighted analog of
PageRank
[ ] [ ] [ ]
[ ]
( ; ) ( ) ( ; )
( ; ) 1init
init
RWI RWIo D
RWId D
Cen d G wt o d Cen o G
Cen d G
λ λ λ
λ
∈
∈
− >
=∑
∑
[ ]G λ
![Page 95: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/95.jpg)
The language modeling framework
( ; ) ( | )Cen d G p q d•
doc “prior” initial ranking
Lafferty&Zhai '01: “with hypertext, [a document prior] might be the distribution calculated using the ‘PageRank’ scheme”
Algorithm: Recursive Weighted Influx+LM
Score by: cf. ( ) ( | )p d p q d•
![Page 96: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/96.jpg)
Evaluation
• : 50 documents retrieved by an optimized language-model-based retrieval method
• Evaluation measure: precision@5• Reference comparison (initial ranking)
• Can we push relevant documents to the top 5 ranks and move from there non-relevant documents ?
initD
![Page 97: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/97.jpg)
LM framework with centrality scores as “priors”
26
31
36
41
46
51
56
61
init rank
RW-Influx+LM
AP TREC8 WSJ AP89prec @ 5
*
*
* Statistical significance difference with init rank
![Page 98: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/98.jpg)
Comparing centrality measures(e.g., Miller et al. '99; doc length as "prior")
25
30
35
40
45
50
55
60uniform
log(length)
RW-Influx
precision@5 AP TREC8 WSJ AP89
*
*
* Statistical significance difference with init rank
![Page 99: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/99.jpg)
Cosine vs. LM probabilities
25
30
35
40
45
50
55
60 init rank
RW-In+LM(COS)
RW-In+LM(LM)
precision@5 AP TREC8 WSJ AP89
*
*
* Statistical significance difference with init rank
![Page 100: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/100.jpg)
Relevance score propagation (cf., Otterbacher et al. 05)
For document :( ) documents that yield the highest ( | )
The complete graph with edge weig( , )
hts:[ ( )] ( | )
The smoothed (Brin&P
( )
ag
init ini
init
ini
t init
t
G D D Dwt
o DTopGen o k g D p o g
g TopGen o p o go g δ
∈
∈
∈×
− > =
' '
[ ]
[ ]
e '98) complete graph with e( , )
dge weights:( , ) + ( )
( ')( , ') (
)init init
init init i
g
n
D
i
g
t
D
G D D D
w wt o gw
sim q gs
tt o
gim g
oq g
λ
λ λ
∈ ∈
− >−
×
− > =>∑ ∑
![Page 101: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/101.jpg)
Label propagation (Yang et al. ’06)• Treat the query and documents in the highest
ranked (query-specific) cluster as relevant• Treat the documents at the tail of the result list as
non-relevant• Apply Zhu&Ghahramani’s (’02) label propagation
algorithm:
2, j
, j 2
Y: the vector of documents' labels (relevant/not-relevant/unlabeled): the distance between docs i and j: regularization factor
exp( ) (j i)
The algorithm:1. Propag
ij
ij ii ij
kj
d
d ww T P
w
σ
σ−
= = → =∑
ate Y TY2. Row-normalize Y3. Clamp the labeled data and go to step 1 until convergence
←
![Page 102: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/102.jpg)
Label propagation (Yang et al. ’06)Experiments with TREC8; Okapi BM25 is the initial ranking methodM: size of the initial list; K: size of the base set from which pseudo relevant documents are selected; N: # of pseudo non relevant documents
![Page 103: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/103.jpg)
Document-cluster graphs (Kurland&Lee ’06)• The cluster-document duality (for query-specific
clusters)• Clusters that are most representative of the information
need contain (are associated with) many relevant documents
• Clusters that contain (are associated with) many relevant documents are most representative of the information need
![Page 104: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/104.jpg)
Hub/Authority Cluster/Document
Document as authority Document as hubd c→c d→
id
jd
lc
mc
nc
id
jd
lc
mc
nc
Doc-only graph d d→
jdid
kdld
![Page 105: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/105.jpg)
30%
35%
40%
45%
50%
55%
60%
65% init. rank
auth[d->d]
PR[d->d]
auth[c->d]
Re-ranking using document centrality (query-specific clusters)
AP TREC8 WSJprec @ 5
Authority scoresdoc-only graph d d→doc as authority graph c d→
PageRank scores doc-only graph d d→
* *
using nearest-neighbor clusters
* significant difference between auth[c d] and auth[d d]→ →
![Page 106: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/106.jpg)
Cluster centralityThe percentage of relevant documents in the highest ranked cluster of 5 documents
39.2% 39.6% 44%
48.7% 48% 51.2%
49.5% 50.8% 53.6%
AP TREC8 WSJ
influx[d c]→
auth[d c]→
query likelihood (sim(q,c))
q q
q q q
'q': significant difference with query likelihood
![Page 107: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/107.jpg)
Passage-based document retrieval
• Motivation for using passage-based information: long and/or topically heterogeneous documents that are relevant to a query might contain a single (short) passage that contains query-pertaining information
The InterPsgDoc method (Callan ’94):𝑆𝑆𝑐𝑐𝑠𝑠𝑠𝑠𝑠𝑠 𝑑𝑑; 𝑞𝑞 ≜ 𝜆𝜆𝑠𝑠𝑠𝑠𝑠𝑠 𝑞𝑞,𝑑𝑑 + (1 − 𝜆𝜆) 𝑠𝑠𝑎𝑎𝑎𝑎𝑔𝑔𝑖𝑖∈𝑑𝑑𝑠𝑠𝑠𝑠𝑠𝑠 𝑞𝑞,𝑙𝑙𝑖𝑖𝑙𝑙𝑖𝑖 (∈ 𝑑𝑑) is a passage in document 𝑑𝑑
![Page 108: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/108.jpg)
Passage-document graphs (Bendersky&Kurland ’08)
• Re-ranking an initially retrieved list• Documents in the list are hubs, passages of
documents in the lists are authorities• 𝑆𝑆𝑐𝑐𝑠𝑠𝑠𝑠𝑠𝑠 𝑑𝑑; 𝑞𝑞 ≜ 𝑠𝑠𝑠𝑠𝑠𝑠 𝑞𝑞,𝑑𝑑 𝑠𝑠𝑎𝑎𝑎𝑎𝑔𝑔𝑖𝑖∈𝑑𝑑𝐶𝐶𝑠𝑠𝐶𝐶𝑡𝑡𝑠𝑠𝑎𝑎𝐶𝐶𝑠𝑠𝑡𝑡𝐶𝐶(g)Centrality(g) is g’s influx (weighted in-degree) or authority value (induced by the HITS algorithm)
id
jd
lg
mg
See Krikon et al. (’10) for using simultaneously doc-only and passage-only graphsSee Krikon&Kurland (’11) for integrating documents, passages and clusters
![Page 109: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/109.jpg)
Score regularization (Diaz ’05, ’07)• The idea: similar documents should be assigned with
similar retrieval scores• But, maintain some “consistency” with the initial retrieval scores
• Otherwise, a flat score distribution would be the best• Could be viewed as “iterative score smoothing”
• f: vector of regularized scores• S(f): score function associated with inter-document-
consistency of scores; penalizes large differences between scores of similar documents
• Υ 𝑡𝑡 : consistency with original scores (L2 distance)• Objective function:
𝑡𝑡∗ = 𝑎𝑎𝑠𝑠𝑙𝑙𝑠𝑠𝑠𝑠𝐶𝐶 𝑄𝑄 𝑡𝑡 ≜ 𝑆𝑆 𝑡𝑡 + 𝜇𝜇Υ 𝑡𝑡𝑡𝑡 ∈ 𝑅𝑅𝑛𝑛
![Page 110: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/110.jpg)
1
1
( ,..., ) is the vector of original retrieval scores of the documents to be re-ranked (i.e., the highest ranked documents by the initial search)
( ,..., ) is the vector of regularized (new)
n
n
y y y nn
f f f
=
=
1
2
, 1
2
1
scores
: affinity matrix ( - affinity between doc i and doc j; =0)
: diagonal matrix where
( )
( ) ( )
)(
ij ii
n
ii iji
nij ij
i ji j ii jj
n
i ii
W W W
D D W
W WS f f f
D D
f f y
=
=
=
=
= −
ϒ = −
∑
∑
∑
𝑡𝑡∗ = 𝑎𝑎𝑠𝑠𝑙𝑙𝑠𝑠𝑠𝑠𝐶𝐶 𝑄𝑄 𝑡𝑡 ≜ 𝑆𝑆 𝑡𝑡 + 𝜇𝜇Υ 𝑡𝑡𝑡𝑡 ∈ 𝑅𝑅𝑛𝑛
There is a closed form solution
![Page 111: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/111.jpg)
Score regularization – empirical results (Diaz ’07)
![Page 112: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/112.jpg)
Additional approaches (1)
• Daniłowicz and Bali´nski (’01) define a Markov chain over a graph where documents are vertices and edge-weights are based on query-sensitive inter-document similarities
• The more similar two documents, the higher the edge weight
• The smaller the difference between the original retrieval scores of two documents, the higher the edge weight
• Matveeva ‘04
![Page 113: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/113.jpg)
Additional approaches (2)
• Spreading activation networks (Salton&Buckley ’88)• Markov chains (Balin´ski& Daniłowicz ’05) • Hyperlinks and inter-document similarities
• Biasing PageRank using query-similarity values (Richardson&Domingos ’04)
• Both the random jump and the jump via hyperlinks• The bias can be based on inter-document similarities
• Are semantically related (hyper) links more effective for retrieval? (Koolen&Kamps ’11)
• Graph-based fusion (Kozorovitzky&Kurland ’09, ’11)• Using random walks with absorbing states to diversify
search results (Zhu et al. ’07)• Term-based graphs (out of the scope of this tutorial)
![Page 114: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/114.jpg)
Adversarial search
• Mishne et al. ’05• Finding spam comments in blogs by comparing the language
models of the (i) comment, (ii) pages to which the comment has outgoing links, and (iii) the blog post
• Benczúr et al. ’06• Finding nepotistic links by comparing a language model
induced from the anchor text and that induced from the target page
• A similar approach was used by Martinez-Romo&Araujo ‘09
• Raiber et al. ’12 • Using inter-document similarities to address keyword stuffing
![Page 115: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/115.jpg)
Concluding notes• The cluster hypothesis gave rise to much work in
the IR field. We surveyed:• Cluster hypothesis tests• Cluster-based document retrieval
• Using clusters to select documents• Using clusters (or topic models) to enrich (smooth) document
representations• Using graph-based methods that utilize inter-document
similarities for ad hoc retrieval• Applications/tasks for which cluster-based or graph-
based methods have been used• Query-performance prediction, fusion, federated search,
microblog retrieval, results diversification, query expansion, using relevance feedback, adversarial search
![Page 116: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/116.jpg)
Some open challenges
• The optimal cluster• The cluster ranking challenge
• Selective application of cluster-based and document-based retrieval
• Devising query-sensitive inter-document measures that will result in the cluster hypothesis holding to a larger extent
• Devising additional graph-based centrality measures that are correlated with relevance
![Page 117: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/117.jpg)
![Page 118: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/118.jpg)
![Page 119: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/119.jpg)
![Page 120: The Cluster Hypothesis in Information Retrievaliew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdfTutorial overview • The cluster hypothesis • Historical view of the effect](https://reader034.fdocuments.us/reader034/viewer/2022051606/60186a1d87adad733f39b8bf/html5/thumbnails/120.jpg)