The Cluster Hypothesis in Information...

The Cluster Hypothesis in Information Retrieval

SIGIR 2013 tutorial

Oren KurlandTechnion --- Israel Institute of Technology

Email: [email protected]: http://iew3.technion.ac.il/~kurland

Slides are available at: http://iew3.technion.ac.il/~kurland/clustHypothesisTutorial.pdf

mailto:[email protected]

Tutorial overview• The cluster hypothesis• Historical view of the effect of the hypothesis on work

on ad hoc information retrieval• Testing the cluster hypothesis• Cluster-based document retrieval

• Using topic models for ad hoc information retrieval• Graph-based methods for ad hoc retrieval that utilize

inter-document similarities• Additional tasks/applications

• Search results visualization, query-performance prediction, fusion, federated search, query expansion, microblog retrieval, relevance feedback, adversarial search

• Concluding notes

The ad hoc retrieval task

• Ranking the documents in a corpus by their relevance to the information need expressed by a query

• Vector space model• Probabilistic approaches• Language modeling framework• Divergence from randomness framework• Learning to rank

The cluster hypothesis

Closely associated documents tend to be relevant to the same requests

(Jardine&van Rijsbergen ’71, van Rijsbergen ’79)

A quick historical tour

• Mid-end 60’s• Using document clusters to improve search efficiency (Salton ’68)

• 70’s-80’s• Using document clusters to improve search effectiveness (Jardine&van

Rijsbergen ’71)• ~2004-today

• Using document clusters to improve search effectiveness (Azzopardi et al. ’04, Kurland&Lee ’04, Liu&Croft ’04)

• 90’s-00’s• Using document clusters to improve results browsing (Preece ’73)

• 90-today• Using topic models to improve search effectiveness (Deerwester et al. ’90)

• ~00’s-today• Using graph-based approaches for ad hoc retrieval that utilize inter-

document similarities (Salton&Buckley ’88)

Ph.D. dissertations • Ivie, E. L. Search procedures based on measures of relatedness between documents. PhD thesis, Massachusetts

Institute of Technology, 1966.

• Marcia Davis Kerchner. Dynamic document processing in clustered collections. PhD thesis, Cornell University, 1971.

• Daniel McClure Murray. Document retrieval based on clustered files. PhD thesis, Cornell Univeristy, 1972.

• Ellen Voorhees. The effectiveness and efficiency of agglomerative hierarchic clustering in document retrieval. PhD thesis, Cornell Univeristy, 1985.

• Anton Leuski. Interactive information organization: Techniques and evaluation. PhD thesis, University Massachusetts Amherst, 2001.

• Anastasios Tombros. The effectiveness of hierarchic query-based clustering of documents for information retrieval. PhD thesis, Department of Computing Science, University of Glasgow, 2002.

• Leif Azzopardi. Incorporating context within the language modeling approach for ad hoc information retrieval. PhD thesis, University of Paisley, 2005.

• Oren Kurland. Inter-document similarities, language models, and ad hoc retrieval. PhD thesis, Cornell University, 2006.

• Xiaoyong Liu. Cluster-based retrieval from a language modeling perspective. PhD thesis, University Massachusetts Amherst, 2006.

• Xing Wei. Topic models in information retrieval. PhD thesis, University Massachusetts Amherst, 2007.

• Fernando Diaz. Autocorrelation and regularization of query-based retrieval scores. PhD thesis, University of Massachusetts Amherst, 2008.

• Mark Smucker. Evaluation of find-similar with simulation and network analysis. PhD thesis, University Massachusetts Amherst, 2008.

Improving search efficiency

• Cluster the corpus offline• Represent each cluster by its centroid• At query time, compare the centroids with the

query and select the clusters to present

The cluster hypothesis

Closely associated documents tend to be relevant to the same requests

(Jardine&van Rijsbergen ’71, van Rijsbergen ’79)

Does the cluster hypothesis hold?

• Depends on the inter-document similarity used?• Maybe we should assume that the hypothesis

holds, and accordingly devise inter-document similarity measures?

• More details later

The Jardine&van Rijsbergen’s (’71) (overlap) test• The similarity between two relevant documents vs.

the similarity between a relevant and a non-relevant document

• Measuring the overlap between the similarity distributions

Jardine&van Rijsbergen’s cluster hypothesis test

(Figure is taken from Voorhees ‘85)

Voorhees’ (’85) nearest-neighbor test• The percentage of relevant documents among the 5

nearest neighbors of a relevant document• The cosine similarity between tf.idf vectors is used

Voorhees’ nearest-neighbor test (Kurland ’06)• The KL divergence between language models of

documents is used for the similarity measure

Voorhees’ nearest-neighbor test applied to the result list of the n highest ranked documents (Raiber&Kurland ’12)• The KL divergence between language models of

documents is used for the similarity measure

The connection between the cluster hypothesis and cluster-based retrieval effectiveness

• “The extent to which the cluster hypothesis characterized a collection seemed to have little effect on how well cluster searching performed as compared to a sequential search of the collection.” (Voorhees ’85)

• There is (high) correlation between the extent to which the nearest-neighbor cluster hypothesis hold, and the effectiveness of cluster-based document retrieval. (Na et al. ’08)

• One potential reason for the contradicting findings: completely different cluster-based retrieval methods have been used

The density-based cluster hypothesis test (El-Hamdouchi and Willett ‘87)• The test value is the ratio between the number of

postings in the index (i.e., the total number of different terms used in documents) and the size of the vocabulary

• There is also a weighted version• The test was empirically shown to be more correlated

than the overlap and nearest-neighbor tests with the relative improvement posted by cluster-based retrieval over document-based retrieval

• Nearest-neighbor clusters (Griffiths et al. ’85) were used• Retrieval performance was measured by recall at some

cuttoff

Query-sensitive similarity measures (Tombros&van Rijsbergen ’01)• Claim: the cluster hypothesis should hold for every

collection; it is the inter-document similarity measure that needs to be adjusted so that the hypothesis holds

• Heretofore, all inter-document similarities were query-independent

• Idea: bias the inter-document similarity measure to emphasize relations between the documents and the query𝑠𝑠𝑠𝑠𝑠𝑠 𝑑𝑑1,𝑑𝑑2 𝑞𝑞 ≜ cos 𝑑𝑑1,𝑑𝑑2 cos 𝑐𝑐, �⃗�𝑞 ;𝑐𝑐𝑖𝑖 = 1

2(𝑑𝑑1;𝑖𝑖+𝑑𝑑2;𝑖𝑖), where the i’th term in the vocabulary

is common to 𝑑𝑑1 and 𝑑𝑑2, and 𝑑𝑑𝑥𝑥;𝑖𝑖 is its weight in 𝑑𝑑𝑥𝑥

Query-sensitive similarity measures (Tombros&van Rijsbergen ’01)

Nearest neighbor test with 5 nearest neighbors

Using the cluster hypothesis to induce clustering• The optimum clustering framework (Fuhr et al. 2012)• Basic principle: documents that are co-relevant to

many queries should be clustered together• Definitions for the expected recall and precision of a

clustering based on co-relevance• Well known clustering methods can be viewed as based

on principles of the framework• The framework was shown to provide a more effective

internal clustering quality criterion than commonly used alternatives

• In terms of correlation to ground truth

Alternative cluster hypothesis test (Smucker&Allan ’09)• Claim: The nearest neighbor test is insufficient for

query-biased similarities• The nearest-neighbor test is a good measure of local

clustering• A graph-based normalized mean reciprocal distance

measure

The cluster hypothesis test for entity retrieval (Raviv et al. SIGIR 2013)• Check out the poster of Hadas Raviv• The main challenge: defining similarities between

entities

Using clusters of similar documents for document retrieval• Visualizing results• Using clusters to select documents• Using clusters to enrich document representations

• Using topic models to enrich document representations

Types of document clusters• Hard vs. soft

• Hard clustering: A document belongs to a single cluster• Partitioning (e.g., K-Means) vs. hierarchical agglomerative clustering (single

link, complete link, average link, Ward’s criterion)• Soft clustering: A document can belong to, or be associated with,

several clusters• Overlapping nearest-neighbor clusters: for each document we construct a

cluster that contains the document and its k nearest neighbors• Topic models (more details later)

• Offline (query-independent)• Created from all documents in the corpus• Help to address recall issues with the initial search (?)• Efficiency issues

• Large scale and dynamic corpora

• Query specific (Preece ’73, Willett ’85)• Created from the documents most highly ranked by an initial search• Used either for visualization of results or for automatic re-ranking

of the initial result list• Drawback(?): dependence on the effectiveness of the initial search

Cluster-based search results visualization

• The scatter-gather system (Cutting et al. ’92, Hearst&Pedersen ’96)

• Browsing strategies for cluster-based result interfaces (Leuski ’01)

• Interactive retrieval using a cluster-based interface (Leuski&Allan ’04)

• Interactive exploration of corpora based on inter-document similarities (Smucker ’08)

Challenges to address

• Fast online creation of clusters• Cutting et al. ’92, Zamir&Etzioni ’98

• Automatic labeling of clusters • Treeratpituk&Callan ’06 (agglomerative clusters)• Mei et al. ’07 (topic models)

Using document clusters for ad hoc retrieval• The user needn’t be aware of the fact that

clustering was performed• Clusters often serve one of two roles (or both)

(Kurland&Lee ’04)• Document selection• Enriching (“smoothing”) document representations

Using offline-created clusters for document selection (Kurland ’06)

Query = {truck, bus}

d1=school bus, classes, teachers

d2=school , classes, teachers, class

d3=bus, taxi, boat, bike

d4=taxi, boat, truck, scooter

d5=boat, horse, taxi, bike, scooter

d6=home, house, kids, floor

(x , x ) | x |i j i jsim x=

( , 1) ( , 5) ( , 6)

( , 2) ( , 1) ( , 3)( 5, 2) ( 4, 2) ( 3, 2)

3, , , , 6,1 2

, , , , , 6

4 5

3 25 14

Rank using documents

Rank using clusters and docs

sim q d sim q d sim q d

sim q C sim q C sim q Csim d C sim d C sim d c

Ranking d

Ra

d d d

d d

d d

nki d dn dg d

> =

> >> =

=

=

c2

c1

c3

Using clusters for document selection

• Given a query 𝑞𝑞 and a list of document clusters 𝐶𝐶𝐶𝐶• Cl can be a set of offline-created or query-specific clusters

• Rank the clusters c in 𝐶𝐶𝐶𝐶 using a query-cluster similarity measure, or any other approach:

• 𝑠𝑠𝑐𝑐𝑠𝑠𝑠𝑠𝑠𝑠 𝑐𝑐; 𝑞𝑞 ≜ 𝑠𝑠𝑠𝑠𝑠𝑠 𝑞𝑞, 𝑐𝑐• A key estimation issue which will be further discussed

• Transform the cluster ranking to document ranking

Transforming cluster ranking to document ranking

• Strategy #1 (originally termed “cluster-based retrieval” by Jardine&Rijsbergen (’71))

• Replace each cluster with its constituent documents (omitting repeats)

• Within-cluster document ranking is based on the initial document scores which were assigned in response to the query or on the similarity between the document and the cluster centroid

• Jardine&Rijsbergen ’71, Croft ’80, Voorhees ’85, Willett ’85, Liu&Croft ’04, Kurland&Lee ’06, Kurland ’08, Kurland&Domshlak ’08, Liu&Croft ’08

The CQL method (Liu&Croft ’04)(Example for strategy #1 using query-specific clusters)

𝑠𝑠𝑐𝑐𝑠𝑠𝑠𝑠𝑠𝑠 𝑐𝑐; 𝑞𝑞 ≜ 𝑠𝑠𝑠𝑠𝑠𝑠(𝑞𝑞, 𝑐𝑐); similarity is measured using language modelsA cluster is represented by the concatenation of its constituent documents

Transforming cluster ranking to document ranking (cont.)• Strategy #2

• Rank all the documents in the top-retrieved clusters using some criterion (Voorhees ’85)

• Examples to follow

• Strategy #3• Traverse the clustering dendogram until finding the cluster

with the best match to the query or using any other stopping criterion (Jardine&Rijsbergen ’71, Croft ’80, Voorhees ’85, Griffiths et al. ‘86)

• Mixed results with respect to whether cluster-based retrieval is consistently more effective than document retrieval

• Cluster-based retrieval was shown to be more effective in terms of precision (Jardine&Rijsbergen ’71, Croft ’80)

• Bottom up search was shown to be more effective than top-down (Croft ’80)

Algorithmic framework (Kurland&Lee’04, ‘09)

(Example for strategy #2)

is the set of nearest-neighbor clusters created fromall documents in the corpus (cf., Griffiths et al. '86)Given query and (the number of docs to retrieve): 1. For each document , - C

Cl

q Nd

Score by a weighted combination of ( , ) and the ( , ) ' for all ( , )

hoose ( , ) -

2. Set ( ) to the ranked-ordered list of -to

d sFacets q d C

im q dsim q c s c Facets q

l

TopD cd

o s NN

⊆

∈

Optional: re-rank p scoring documents

3. 4. Return

( ) by ((

,)

)d TopTo

Docs N sipDoc

m ds

qN∈

The Set-Select algorithm (Kurland&Lee ’04, ’09; cf. Voorhees ‘85)

• Instantiation of the framework:

• The procedure

{ }[ ]

( , ) : ( )

( ) ( , ) | ( , ) | 0

( ) is the set of clusters that are the most similar to the query

q

q

Facets q d c d c TopClusters m

Score d sim q d Facets q d

TopClusters m m

δ

= ∈

= ⋅ >

Rank only documents in top retrieved clusters using ( , )sim q d

The Bag-Select Algorithm (Kurland&Lee ’04, ‘09)

• Instantiation of the framework:

• The procedure

{ }( , ) : ( )

( ) ( , ) | ( , ) |qFacets q d c d c TopClusters m

Score d sim q d Facets q d

= ∈

= ⋅

Rank only documents in top retrieved clusters using ( , ) # of top clusters belongs to

sim q d d×

Set-Select, Bag-Select (Kurland&Lee ’04, ‘09)

c1c2 c3

d1d2

d40

40 1

2 1

2

1

2 3

1

( , )( , ) { ,

( , }

,

) {{ , }

}Facets q d c cFacet

Facets q d c

q

c

d

c

s c==

=

40 40

1

2 2

1

( ) (( ) ( , )

,( ) ( , )

)Score d s

ScorSco

e dre d sim q d

si

d

d

i

m

q

q

m==

= 1 1

2

40 40

2

( ) 2* ( ,(

( ) (

) 3* (

,

)

))

,Score d sim q dScore d sim q

Score d sim

d

q d==

=

Set-Select Bag-Select

q

Empirical Results (Kurland&Lee‘09)

15%17%19%21%23%25%27%29%

Baseline(doc-based)

Set-s

Bag-s

*

*

* *

AP89 AP88+89 LA+FR

‘*’ marks a statistically significant difference with the baseline

*

*

*

MAP

The cluster ranking challenge

• 𝑠𝑠𝑐𝑐𝑠𝑠𝑠𝑠𝑠𝑠(𝑐𝑐; 𝑞𝑞) ≜ 𝑠𝑠𝑠𝑠𝑠𝑠(𝑞𝑞, 𝑐𝑐)• How do we represent cluster c?

• Binary term vector (Jardine&van Rijsbergen ’71, van Rijsbergen ’74)

• A centroid of the vectors representing c’s constituent documents (Jardine&Rijsbergen ’71, Croft ’80, Voorhees ’85, El-Hamdouci&Willett ’87,Liu&Croft ‘08)

• Cosine, for example, serves for the similarity measure in the vector space

• The big document that results from concatenating c’sconstituent documents (Kurland&Lee ’04, Liu&Croft ‘04)

• Language-model-based similarity estimates

Cluster representations (Liu&Croft ’08)

• The best representation for a cluster, among those studied, was the geometric mean of the language models of its constituent documents• Seo&Croft ’10 provide arguments based on information geometry

for the effectiveness of the geometric-mean-based representation

Some more cluster ranking methods• Using the min/max query-similarity score of a

document in the cluster (Leuski ’01, Shanahan et al. ’03, Liu&Croft ’08)

• Document-cluster graphs (Kurland&Lee ’06; more details later)

• Variance of document retrieval scores in the cluster (Liu&Croft ’06; more details later)

• Aggregating measures of properties of a cluster (Kurland&Domshlak ’08)

• Using the similarity between the cluster and an expanded query form (Liu&Croft ’04, Wei&Croft ’06; more details later)

The optimal cluster• The percentage of relevant documents in cluster c

of k documents for which 𝑠𝑠𝑐𝑐𝑠𝑠𝑠𝑠𝑠𝑠 𝑐𝑐; 𝑞𝑞 is the highest is the precision@k attained by cluster-based retrieval with strategy #1.

Using nearest neighbor query-specific clusters of 5 documents: (Kurland&Domshlak ’08)

The optimal cluster (Liu&Croft’06)

The optimal cluster (cont.)

• Jardine and van Rijsbergen ’71 were the first to report the existence of an optimal (offline created) cluster

• This was later re-asserted by, for example, Hearst&Pedersen (’96), Tombros et al. (’02), Kurland (’06) and Liu+Croft (’06)

• Offline-created optimal clusters contain a smaller percentage of relevant documents than optimal query-specific clusters (Tombros et al. ’02)

• There are several approaches to estimating the potential retrieval merits of finding optimal clusters (Tombros et al. ’02)

A probabilistic graph-based approach for ranking clusters (Kurland ’08, Kurland&Krikon ’11)

• What is the probability that this cluster is relevant to this query? (cf., Croft ’80)

• The ClustRanker method:𝑠𝑠𝑐𝑐𝑠𝑠𝑠𝑠𝑠𝑠(𝑐𝑐; 𝑞𝑞) ≜ 𝜆𝜆𝜆𝜆 𝑞𝑞|𝑐𝑐 𝜆𝜆 𝑐𝑐 + 1 − 𝜆𝜆 �

𝑑𝑑∈𝑐𝑐

𝜆𝜆 𝑞𝑞|𝑑𝑑 𝜆𝜆 𝑐𝑐|𝑑𝑑 𝜆𝜆 𝑑𝑑

• p(d) and p(c) are estimated based on graphs where documents (clusters) are vertices, edge-weights represent inter-item similarities, and the PageRank score of item x serves as an estimate for p(x)

ClustRanker (Kurland ’08, Kurland&Krikon’11)(Using nearest-neighbor query-specific clusters for re-ranking)

A (much) more effective cluster ranking method (Raiber&Kurland SIGIR 2013)

• Uses the Markov Random Field framework• Attend Fiana Raiber’s talk:

Ranking Document Clusters using Markov Random Fields

Selective cluster-based retrieval

• Griffiths et al. (’86) observed that cluster-based retrieval and document-based retrieval can be of the same effectiveness

• But, different relevant documents were retrieved in the two cases

• Some previous work on selecting a retrieval strategy per query

• Croft&Thompson ’84• Amati et al. ’04• Balasubramanian&Allan ‘10

Selective cluster-based retrieval (Liu&Croft ’06)• A “good” cluster is one which (1) exhibits high similarity to

the query and (2) contains documents with query-similarity values that do not deviate much from that of the cluster

• Kurland et al. (’12) found some contrasting evidence with respect to the deviation

• For queries with “good” clusters perform cluster-based retrieval, for the other queries perform document-based retrieval

Selective cluster-based retrieval (Liu&Croft ’06)

Intermediate summary• We ranked document clusters• We transformed the cluster ranking to document

ranking• Some observations:

• There are clusters that contain a very high percentage of relevant documents (whether static offline-created clusters or query-specific clusters); the optimal cluster

• Optimal query-specific clusters contain a higher percentage of relevant documents than static clusters (Tombros et al. ’02)

• Using small clusters results in more effective retrieval (Griffiths et al. ’86, Tombros et al. ’02, Kurland&Lee ‘04)

• Cluster representation is crucial• A geometric-mean-based representation seems to be the most

effective among those proposed (Liu&Croft ’08, Seo&Croft ’10, Kurland&Krikon ’11)

• The performance of cluster-based retrieval can be much better than that of document-based retrieval and that of using query expansion

Using clusters to enrich (smooth) document representations• Clusters provide (corpus) context for documents • Enrich the document representation using information

induced from similar documents • Example: use Rocchio’s method to smooth the vector

representing the document with those representing similar documents (Singhal&Pereira ’99)

𝑑𝑑𝑛𝑛𝑛𝑛𝑛𝑛 = 𝑑𝑑𝑜𝑜𝑜𝑜𝑑𝑑 + 1𝑘𝑘∑𝑖𝑖=1𝑘𝑘 𝑑𝑑𝑖𝑖;

𝑑𝑑1, … ,𝑑𝑑𝑘𝑘 are d’s nearest neighbors in the vector space

Similar document-expansion methods• Lavrenko (’00) employed a nearest-neighbor smoothing

method for the query model• Ogilvie (’00) and Kurland&Lee (’04) smoothed a

document language model with language models of its nearest neighbors

• Tao et al. (’06) created pseudo counts for terms in a document that are smoothed using the counts of terms in similar documents

• Wi&Allan (’09) and Efron et al. (’12) use the (weighted) language models of nearest-neighbors of a document to smooth the document language model

• Efron et al. (’12) use this document model for Twitter search

A quick recap of the language modeling approach (Ponte&Croft ’98)

'

: query, : document, : corpus of documents, : term

( )( | ) ; ( )

A maximu

is the

m likelihood est

number of times

Jelinek-Merc

appears in ( ' )

(

i

er smo

mate:

othi|

g1 ) (

n :) (

MLE

w d

JM MLE

q d C w

tf w dp w d tf w d w dtf w d

p w d pλ

∈

∈= ∈

∈

−

∑

(1) The query likelihood model: (Song&Croft '99) (2) The KL retrieval method: (Lafferty&Zhai '0

| ) ( |Dir

1

ichle

)

t smo)

| |

( ; ) ( | ) ( | )

oth

ing:

i

MLE

iq q

w d p w C

d

score d q p q d p q d

λ

µλµ

∈

+

=+

∏

( | ) ( ; ) ( | ) log( | )

i

ii

q q i

p q qscore d q p q qp q d∈

∑

Using pLSA for retrieval (Hofmann ’99)• pLsa (probabilistic latent semantic analysis) is a

“probabilistic successor” of LSA (Deerwester et al. ’90), and an implementation of the aspect model (Hofmann et al. ’97)

• Additional topic models (LDA Blei et al. ’03, Pachinko Allocation Model Li&McCallum ’06)

• The generative story of pLSA:• Select a document 𝑑𝑑 with probability P 𝑑𝑑• Pick a latent class (topic) 𝑧𝑧 with probability P 𝑧𝑧|𝑑𝑑• Generate a word w with probability P 𝑤𝑤|𝑧𝑧

Using pLSA for retrieval (Hofmann ’99)• P 𝑑𝑑,𝑤𝑤 = 𝑃𝑃 𝑑𝑑 𝑃𝑃 𝑤𝑤 𝑑𝑑• 𝑃𝑃 𝑤𝑤 𝑑𝑑 = ∑𝑧𝑧∈𝑍𝑍 𝑃𝑃 𝑤𝑤 𝑧𝑧 𝑃𝑃(𝑧𝑧|𝑑𝑑)• Data likelihood:

𝐿𝐿 = ∑𝑑𝑑∈𝐷𝐷∑𝑛𝑛∈𝑊𝑊 𝑡𝑡𝑡𝑡 𝑤𝑤 ∈ 𝑑𝑑 𝐶𝐶𝑠𝑠𝑙𝑙𝑃𝑃(𝑑𝑑,𝑤𝑤)• Maximizing data likelihood using tempered EM

• Note the potential metric divergence problem (Azzopardi et al. ’03)• Using the topic model for retrieval

• Smoothing a document language model (more details later)• Folding the query into the lower dimensional space

• Vector-based representation, cosine measure

• Retrieval performance is better than that attained by LSA and the cosine method; very small collections are used

Cluster-based document language models (Liu&Croft ’04)

1 2 3

1 2 3

The CBDM model:( | ) ( | ) ( | c) ( | );

1

Let be the single hard cluster to which belongs; c is represented by the concatenation of its constituent documents

MLE MLE MLEp w d p w d p w p w C

c d

λ λ λλ λ λ

+ ++ + =

Using offline K-MEANS clustering (K is the number of clusters)

Topic-based document language models (Wei&Croft ’06)• Apply Latent Dirichlet Allocation (LDA; Blei et al. ‘03) to induce topics from the

corpus• Use the resultant topics to smooth document language models

• Using the KL divergence between a document LM and that of the query for ranking

• A generalization of the CBDM model• Earlier work by Azzopardi et al. (’04) used LDA and pLSA to induce document

prior distributions• Lu et al. ’11 found that the retrieval performance of using LDA and pLSA

(Hofmann ’99) was comparable

1 2 3

1

1 2 3

1

ˆ ˆ( | , ) ( | , )

( | ) ( | ) ( | ) ( | );

( | )k

i

MLE MLELDA

LDA p w z p z d

p w d p w d p w d p w C

p w d

λ λ λ

φ θ

λ λ λ

=

+ + =

+ +

∑

Topic-based document language models (Wei&Croft ’06)

A study of using topic-based document language models (Yi&Allan ’09)

• Using more sophisticated topic models (e.g, Pachinko Allocation Model Li&McCallum ’06) doesn’t yield better retrieval performance (e.g., than that attained by LDA)

• Using nearest-neighbor smoothing results in performance that is as good as that of using topic models

• Pseudo-feedback-based query expansion is more effective than using topic models (either in an offline fashion or in a query-specific fashion)

𝜆𝜆𝑇𝑇𝑇𝑇 𝑤𝑤 𝑑𝑑 ≜ �𝑡𝑡𝑖𝑖∈𝑇𝑇

𝜆𝜆𝑇𝑇𝑇𝑇 𝑤𝑤 𝑡𝑡𝑖𝑖 𝜆𝜆𝑇𝑇𝑇𝑇 𝑡𝑡𝑖𝑖 𝑑𝑑

T is the set of topics

𝜆𝜆′(𝑤𝑤|𝑑𝑑) ≜ 𝜆𝜆1𝜆𝜆𝑇𝑇𝑀𝑀𝑀𝑀 𝑤𝑤 𝑑𝑑 + 𝜆𝜆2𝜆𝜆𝑇𝑇𝑇𝑇(𝑤𝑤|𝑑𝑑) +𝜆𝜆3 𝜆𝜆𝑇𝑇𝑀𝑀𝑀𝑀 𝑤𝑤 𝐶𝐶

Score-based smoothing (Kurland&Lee ’04 ‘10, Kurland ’09)

( ; ) ( | ) ( | , ) ( | )

Estimate ( | , ) using ( | ) (1- ) ( | )

The interpolation m

Using a single term for we get a clust

ethod:( ; ) ( | ) (1- ) ( | ) (

er/top b

|

ic-

)

c Cl

c Cl

score d q p q d p q d c p c d

p q d c p q d p q c

score d q p q d p q c p c d

w q

λ λ

λ λ

∈

∈

=

+=>

+

∑

∑

Use exp(- ( ( | ) || ( | ))), where ( | ) is the unigram language model ind

ased documen

uce

t language

d from , as an estimate fo

mod

r

l

|

e

( )KL p x p y p z z

p x y

The interpolation method (Kurland&Lee ’04, ‘09)

• Ranking the corpus using nearest-neighbor offline-created clusters

‘b’ and ‘I’ mark statistically significant differences with the baseline and interpolation, respectively

Interpolation method (Kurland&Lee ’04, ‘09) –comparison between using nearest neighbor (NN) clusters and K-Means clusters (Clusters are created offline)

El-Hamdouci and Willett (’89) found that when using cluster ranking with offline created clusters(i) Using nearest neighbor clusters (of two documents) resulted in better

performance than that of using various hard clustering methods; (the same as Griffiths et al. (’86) findings)

(ii) Using small agglomerative clusters yielded better performance than using larger clusters; and,

(iii) Complete-link agglomerative clustering was more effective than single-link and Ward’s method for cluster-based retrieval; (the same as Voorhees’ (’85) findings)

p@5 p@10 p@5 p@10 p@5 p@10

Init. Rank. .457 .432 .500 .456 .536 .484

CQL (sim(q,c)) .448 .418 .500 .432 .504 .454

Bag-Select .507 .494 .532 .514 .548 .488

Interpolation .537 .498 .576 .496 .592 .508

The interpolation method (Kurland ’09)

• Using nearest-neighbor query-specific clusters to re-rank an initially retrieved list

* * * *

*

* *

AP TREC8 WSJ

‘*’ marks a statistically significant difference with the initial ranking (Init. Rank.)

The interpolation method (Kurland ’09)• Comparison with pseudo-feedback-based query

expansion• Using nearest-neighbor query-specific clusters

The interpolation method with different query-specific clustering algorithms (Kurland ’09)

• The findings with respect to (i) nn-LM and nn-VS being superior to hard clustering schemes, and (ii) agg-comp’s relative effectiveness, are reminiscent of those of El-Hamdouci and Willett (’89) who used cluster ranking with clusters created offline

• Tombros et al. (’02) found that average link was better than complete link in terms of query-specific optimal cluster search

Comparison of cluster-based retrieval methods (Raiber&Kurland ’12)

Cluster types and their effectiveness for smoothing document language models (or scores)• Nearest-neighbor clusters (with a small number of

nearest neighbors)>= topic models > hard clusters• Note: The first to suggest the use of nearest-

neighbor overlapping clusters were Griffiths et al. (’86)

• Used offline-created clusters (of two documents) • There is much recent evidence for the effectiveness of using

nearest-neighbor query-specific clusters (Kurland+Lee:06,Liu+Croft:08,Kurland:09)

Integrating query-specific and offline-created clusters• Meister et al. (’09) used the interpolation algorithm

(Kurland&Lee ’04) with both offline and query-specific clusters

• Small (but consistent) performance improvements over using only offline-created or query-specific clusters

• Lee et al. (’01) use a “query-specific” view of static (offline-created) clusters

Cluster-based fusion

Fusion of retrieved lists

1Given query and document lists ,..., that were retrieved in response to from corpus , produce a single list of results

Integrating various information sources for retrieval

Motivation

The taskmq L L q

D

(Croft '00)(e.g., document representations, query representations, retrieval models)

1d

2d

1d

2d

3d3d

2d

4d

3d

1L 2L 3L

fuse

1d

2d

3d

A common fusion principleDocuments that are highly ranked in many of the lists are rewarded

(1) overlap of relevant documents in the lists is higher than that of non-relevant documents (Lee ’97)

(2) chorus and skimming effects (Saracevic&Kantor‘88, Vogt&Cottrell ’99)

: document: is a member of list

( ) : the positive retrieval score of in if ; 0 otherwise

( ) : "standard" fusion score of that depends only on retrieval scores or ranks

Examples

i

i i

L i i

dd L d LS d d L d L

F d d

∈∈

:( ) #{ : } ( ) (Fox&Shaw '94, Lee '97)

( ) #{ ' : ( ') ( )} (Borda 1781, Aslam&Montague '01)

i

i

i i

i

CombMNZ i i LL

Borda i L LL

F d L d L S d

F d d L S d S d

∈

∈ ≤

∑

∑

But …

Different retrieved lists might contain different relevant documents

e.g., Das-Jupta and Katzer ’83, Griffiths et al. ’86, Soboroff et al. ’01, Beitzel et al. ’03

fuse2r

1r

nr

1rnr

2r

1rnr

3r

1rnr

2r

1L 2L 3L

A cluster-based fusion approach (Kozorovitsky&Kurland ’11)

Let similar documents across the lists provide relevance-status support to each other

• cluster hypothesis • utilize information induced from clusters of similar

documents that are created across the lists

fuse2r

1r

nr

1rnr

2r

1rnr

3r

1r

2r

3r

Fusion model (Kozorovitsky&Kurland ’11)

( )

- the set of documents in the lists; ( ) - their clusters

: use clusters as proxies for documents( | ) ( | , ) ( | ) (cf., Kurland&Lee '04)

ˆEstimate: ( | ,

Starting point

init

init init

c Cl D

D Cl D

p d q p d c q p c q

p d c

∈

= ∑

( )

Resultant fusion

) ( | ) (1 ) (

functio

| )

: ( ) (1 ) ( | ) ( | ) ( | )

n

init

ClustFusec Cl D

q p d c p d q

F d p d q p c q p d c

λ λ

λ λ∈

= + −

− + ∑

ClustFuse (Kozorovitsky&Kurland ’11)

Document d is rewarded based on its • standard fusion score

• reflects the extent to which d is highly ranked in many of the lists• similarity to clusters that contain documents that are highly

ranked in many of the lists

• 𝜆𝜆=0 amounts to the standard fusion method (F) that ClustFuse incorporates

( )( ) (1 ) ( | ) ( | ) ( | )

L

ClustFusec Cl C

F d p d q p c q p d cλ λ∈

− + ∑

)𝐹𝐹(𝑑𝑑;𝑞𝑞�𝑑𝑑′∈𝐷𝐷𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖

𝐹𝐹(𝑑𝑑′;𝑞𝑞)∏𝑑𝑑∈𝑐𝑐 𝐹𝐹(𝑑𝑑;𝑞𝑞)

∑𝑐𝑐′∈𝐶𝐶𝑜𝑜(𝐷𝐷𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖)∏𝑑𝑑′∈𝑐𝑐′ 𝐹𝐹(𝑑𝑑′;𝑞𝑞)∑𝑑𝑑′∈𝑐𝑐 𝑠𝑠𝑠𝑠𝑠𝑠(𝑑𝑑′,𝑑𝑑)

∑𝑑𝑑𝑖𝑖∈𝐷𝐷𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 ∑𝑑𝑑′∈𝑐𝑐 𝑠𝑠𝑠𝑠𝑠𝑠(𝑑𝑑′,𝑑𝑑𝑖𝑖)

cf., the interpolation model (Kurland&Lee ’04, ‘09)

MAP performance of fusing TREC runs (Kozorovitsky&Kurland ’11)

4

5

6

7

8

9

10

trec3

run1CombMNZClustFuseCombMNZBordaClustFuseBorda

‘r’, ‘f’ – statistically significant differences with run1 and the standard fusion method, respectively

r

r

r

r

r

f f

r,f r,f

Fusing 3 randomly selected TREC runs (run1 is the best performing among the three)

Optimal clusters in the fusion setting (Kozorovitsky&Kurland ’11)

• OptCluster is the optimal cluster among all clusters created from all the documents in the 3 runs that are fused (run1, run2, run3)

• OptCluster(runi) is the optimal cluster among clusters created from runi‘a’, ’b’ and ‘c’ mark statistically significant differences with run1, run2, and run3, respectively

Cluster-based fusion (Kozorovitsky&Kurland ’11)• Re-ranking each run using query-specific clusters

created from the run, and then fusing the runs (cf. Zhang et al. ’01), yields performance that is inferior to that of using clusters created over all the runs

• The lower the overlap between relevant documents in the runs the more benefit we gain from applying cluster-based fusion

Cluster-based federates search(Khalman&Kurland ’12)

• Retrieving the lists from disjoint corpora(federated/distributed search)

• Crestani and Wu (’06) showed that in the federated search setting there exist clusters that contain a high percentage of relevant documents

p@10p@5

42.046.0init

49.853.6ClustFuse

41.043.2init

49.050.4ClustFuse

CORI

SSL

Cluster-based query expansion

• Treating clusters as pseudo queries (Kurland et al. ’05)• Using cluster-based (or topic-based) smoothed document

language models for both constructing an expanded query form and for ranking (Liu&Croft ’04, Tao et al. ’06, Wei&Croft ‘06)

• Constructing a query-expanded form by rewarding top-retrieved documents that are members of many query-specific overlapping clusters (Lee et al. ’08)

• Using top-retrieved clusters instead (or in addition to) top-retrieved documents for constructing an expanded query form (Na et al. ’07, Gelfer Kalmanovich&Kurland ’09)

• Cluster-based query expansion for federated search (Shokouhi et al. ’09)

Cluster-based results diversification

• e.g., Maximal Marginal Relevance (Carbonell&Goldstein’98)

• Let R be the result list of the documents most highly ranked using 𝑆𝑆𝑠𝑠𝑠𝑠1 𝑞𝑞,𝑑𝑑

• Let S be the new list we create from R (i.e., re-ranking); the order of inserting documents to S is the induced ranking

• 𝑠𝑠𝑐𝑐𝑠𝑠𝑠𝑠𝑠𝑠(𝑑𝑑; 𝑞𝑞) ≜ 𝑎𝑎𝑠𝑠𝑙𝑙𝑠𝑠𝑎𝑎𝑎𝑎𝑑𝑑𝑖𝑖∈𝑅𝑅\S[𝜆𝜆𝑆𝑆𝑠𝑠𝑠𝑠1 𝑞𝑞,𝑑𝑑 − (1− 𝜆𝜆𝑠𝑠𝑎𝑎𝑎𝑎𝑑𝑑𝑖𝑖∈𝑆𝑆𝑆𝑆𝑠𝑠𝑠𝑠2 𝑑𝑑,𝑑𝑑𝑖𝑖 ]

• A cluster-based approach: estimate 𝑆𝑆𝑠𝑠𝑠𝑠1 𝑞𝑞,𝑑𝑑 using a cluster ranking approach (He et al. ’11, Raiber&Kurland’13)

Cluster-based results diversification (cont.)

• A second cluster-based approach: Cluster R and pick documents from the clusters (e.g., in a round robin fashion) which are viewed as potential aspects

• e.g., Leelanupab et al. ’10

Utilizing relevance feedback using document clusters• Shanhan et al. ’03 found that the optimal cluster is

a very good basis for relevance feedback• Re-emphasizing claims in older literature (e.g.,

Jardine&Rijsbergen ’71, Croft ’80) about the motivation to find good clusters

• Active relevance feedback (Shen&Zhai ’05)• Diversify the feedback set by picking documents from

query-specific clusters of top-retrieved results• Baseline: asking for feedback for the top-k retrieved documents

• Interactive retrieval• Ivie ’66, Hearst&Pedersen ’95, Leuski ’01

Using clusters for query-performance prediction (QPP)• The query-performance prediction task: estimating

the effectiveness of a search performed in response to a query in lack of relevance judgments (Carmel&Yom Tov ’10)

• The clustering tendency of the results is an indicator for search effectiveness (Vinay et al. ’06)

• The extent to which the retrieval scores of documents “respect” the cluster hypothesis is an effective query-performance predictor (Diaz ’07)

On the connection between cluster ranking and query-performance prediction (Kurland et al. ’12)• Cluster ranking: estimate the probability that a

cluster (set of documents) is relevant to a query• Query performance prediction (QPP): estimate the

probability that a result list (ranked list of documents) is relevant to a query

• As it turns out, quite a few QPP and cluster ranking methods are based on the exact same principles

• The geometric mean of retrieval scores in a result list is a high quality performance predictor (Zhou&Croft ’07)

• The geometric mean of retrieval scores in a cluster is an effective criterion for ranking clusters (Liu&Croft ’08, Seo&Croft ’10, Kurland&Krikon ’11)

Cluster-based retrieval –intermediate summary• Using clusters to select documents• Using clusters to enrich (smooth) document representations

• or topic models

• Offline vs. query-specific clustering• Soft vs. hard clustering• The optimal cluster• Cluster-based fusion and federated search• Cluster-based query expansion• Cluster-based results diversification• Using document clusters to utilize relevance feedback• The connection between query-performance prediction and

cluster ranking

Graph-based methods utilizing inter-document similarities

Graph-based framework for re-ranking (Kurland&Lee '05, ‘10)

Inspiration: Web Retrieval Common approach to web retrieval:

• Re-rank an initial retrieved list 𝐷𝐷𝑖𝑖𝑛𝑛𝑖𝑖𝑡𝑡 of documents by the degree of centrality (Brin&Page '98, Kleinberg '99)

• Centrality of a document is estimated using explicit hyperlink structure (PageRank, HITS)

Can we use the scoring by centrality approach for ranking non-hypertext documents?

AB C

X

Y

A possible strategy:Structural re-ranking

• Use inter-document similarities to infer links between documents in 𝐷𝐷𝑖𝑖𝑛𝑛𝑖𝑖𝑡𝑡

• On the resultant graph (of documents and induced links) define centrality measures and use them as criteria for ranking

How to induce links ?One might suggest:

Vector Space Model (VSM) for information representation and cosine for similarity metric

Erkan&Radev '04 : text summarization• Cosine similarity between sentences• See the book: Rada Mihalcea and Dragomir Radev. Graph-

based natural language processing and information retrieval. Cambridge University Press, 2011.

but …

Inducing Links

PortlandPortlandPortland

Relevant Relevant⇒

{2d

DublinPortlandBeijing

{1d

Relevant Relevant⇒/

“spiky” distribution

“flat” distribution

1d 2d 1d 2d2 1 1 2( | ) ( | )p d d p d d>LMs:1 2 2 1cos( , ) cos( , )d d d d=VSM:

Zhang et al. (’05) used asymmetric cosine-based edge weights, but these were found by Kurland&Lee (’10) to be somewhat less effective than the language-model-based weights

Generation graphsFor document :

( ) documents that yield the highest ( | )

The complete graph with edge weig( , )

hts:[ ( )] ( | )

The smoothed (Brin&P

( )

ag

init ini

init

ini

t init

t

G D D Dwt

o DTopGen o k g D p o g

g TopGen o p o go g δ

∈

∈

∈×

− > =

'

[ ]

[ ]

e '98) complete graph with edge weights:

1 + |

( , )

( )( '

( ))|

init

init

init ini

g D

t init

wt o gwt o

G D D

wt o gD

D

g

λ

λ λ λ

∈

− >×

− >−

=− >∑

is a “generator” of o (o is an offspring of g)( )g TopGen o∈

Inducing centrality: Recursive Weighted Influx Algorithm

• Smoothed graph : ergodic Markov chain, power method converges• The Recursive Weighted Influx algorithm is a weighted analog of

PageRank

[ ] [ ] [ ]

[ ]

( ; ) ( ) ( ; )

( ; ) 1init

init

RWI RWIo D

RWId D

Cen d G wt o d Cen o G

Cen d G

λ λ λ

λ

∈

∈

− >

=∑

∑

[ ]G λ

The language modeling framework

( ; ) ( | )Cen d G p q d•

doc “prior” initial ranking

Lafferty&Zhai '01: “with hypertext, [a document prior] might be the distribution calculated using the ‘PageRank’ scheme”

Algorithm: Recursive Weighted Influx+LM

Score by: cf. ( ) ( | )p d p q d•

Evaluation

• : 50 documents retrieved by an optimized language-model-based retrieval method

• Evaluation measure: precision@5• Reference comparison (initial ranking)

• Can we push relevant documents to the top 5 ranks and move from there non-relevant documents ?

initD

LM framework with centrality scores as “priors”

26

31

36

41

46

51

56

61

init rank

RW-Influx+LM

AP TREC8 WSJ AP89prec @ 5

*

*

* Statistical significance difference with init rank

Comparing centrality measures(e.g., Miller et al. '99; doc length as "prior")

25

30

35

40

45

50

55

60uniform

log(length)

RW-Influx

precision@5 AP TREC8 WSJ AP89

*

*


Cosine vs. LM probabilities

25

30

35

40

45

50

55

60 init rank

RW-In+LM(COS)

RW-In+LM(LM)

precision@5 AP TREC8 WSJ AP89

*

*


Relevance score propagation (cf., Otterbacher et al. 05)

For document :( ) documents that yield the highest ( | )

The complete graph with edge weig( , )

hts:[ ( )] ( | )

The smoothed (Brin&P

( )

ag

init ini

init

ini

t init

t

G D D Dwt

o DTopGen o k g D p o g

g TopGen o p o go g δ

∈

∈

∈×

− > =

' '

[ ]

[ ]

e '98) complete graph with e( , )

dge weights:( , ) + ( )

( ')( , ') (

)init init

init init i

g

n

D

i

g

t

D

G D D D

w wt o gw

sim q gs

tt o

gim g

oq g

λ

λ λ

∈ ∈

− >−

×

− > =>∑ ∑

Label propagation (Yang et al. ’06)• Treat the query and documents in the highest

ranked (query-specific) cluster as relevant• Treat the documents at the tail of the result list as

non-relevant• Apply Zhu&Ghahramani’s (’02) label propagation

algorithm:

2, j

, j 2

Y: the vector of documents' labels (relevant/not-relevant/unlabeled): the distance between docs i and j: regularization factor

exp( ) (j i)

The algorithm:1. Propag

ij

ij ii ij

kj

d

d ww T P

w

σ

σ−

= = → =∑

ate Y TY2. Row-normalize Y3. Clamp the labeled data and go to step 1 until convergence

←

Label propagation (Yang et al. ’06)Experiments with TREC8; Okapi BM25 is the initial ranking methodM: size of the initial list; K: size of the base set from which pseudo relevant documents are selected; N: # of pseudo non relevant documents

Document-cluster graphs (Kurland&Lee ’06)• The cluster-document duality (for query-specific

clusters)• Clusters that are most representative of the information

need contain (are associated with) many relevant documents

• Clusters that contain (are associated with) many relevant documents are most representative of the information need

Hub/Authority Cluster/Document

Document as authority Document as hubd c→c d→

id

jd

lc

mc

nc

id

jd

lc

mc

nc

Doc-only graph d d→

jdid

kdld

30%

35%

40%

45%

50%

55%

60%

65% init. rank

auth[d->d]

PR[d->d]

auth[c->d]

Re-ranking using document centrality (query-specific clusters)

AP TREC8 WSJprec @ 5

Authority scoresdoc-only graph d d→doc as authority graph c d→

PageRank scores doc-only graph d d→

* *

using nearest-neighbor clusters

* significant difference between auth[c d] and auth[d d]→ →

Cluster centralityThe percentage of relevant documents in the highest ranked cluster of 5 documents

39.2% 39.6% 44%

48.7% 48% 51.2%

49.5% 50.8% 53.6%

AP TREC8 WSJ

influx[d c]→

auth[d c]→

query likelihood (sim(q,c))

q q

q q q

'q': significant difference with query likelihood

Passage-based document retrieval

• Motivation for using passage-based information: long and/or topically heterogeneous documents that are relevant to a query might contain a single (short) passage that contains query-pertaining information

The InterPsgDoc method (Callan ’94):𝑆𝑆𝑐𝑐𝑠𝑠𝑠𝑠𝑠𝑠 𝑑𝑑; 𝑞𝑞 ≜ 𝜆𝜆𝑠𝑠𝑠𝑠𝑠𝑠 𝑞𝑞,𝑑𝑑 + (1 − 𝜆𝜆) 𝑠𝑠𝑎𝑎𝑎𝑎𝑔𝑔𝑖𝑖∈𝑑𝑑𝑠𝑠𝑠𝑠𝑠𝑠 𝑞𝑞,𝑙𝑙𝑖𝑖𝑙𝑙𝑖𝑖 (∈ 𝑑𝑑) is a passage in document 𝑑𝑑

Passage-document graphs (Bendersky&Kurland ’08)

• Re-ranking an initially retrieved list• Documents in the list are hubs, passages of

documents in the lists are authorities• 𝑆𝑆𝑐𝑐𝑠𝑠𝑠𝑠𝑠𝑠 𝑑𝑑; 𝑞𝑞 ≜ 𝑠𝑠𝑠𝑠𝑠𝑠 𝑞𝑞,𝑑𝑑 𝑠𝑠𝑎𝑎𝑎𝑎𝑔𝑔𝑖𝑖∈𝑑𝑑𝐶𝐶𝑠𝑠𝐶𝐶𝑡𝑡𝑠𝑠𝑎𝑎𝐶𝐶𝑠𝑠𝑡𝑡𝐶𝐶(g)Centrality(g) is g’s influx (weighted in-degree) or authority value (induced by the HITS algorithm)

id

jd

lg

mg

See Krikon et al. (’10) for using simultaneously doc-only and passage-only graphsSee Krikon&Kurland (’11) for integrating documents, passages and clusters

Score regularization (Diaz ’05, ’07)• The idea: similar documents should be assigned with

similar retrieval scores• But, maintain some “consistency” with the initial retrieval scores

• Otherwise, a flat score distribution would be the best• Could be viewed as “iterative score smoothing”

• f: vector of regularized scores• S(f): score function associated with inter-document-

consistency of scores; penalizes large differences between scores of similar documents

• Υ 𝑡𝑡 : consistency with original scores (L2 distance)• Objective function:

𝑡𝑡∗ = 𝑎𝑎𝑠𝑠𝑙𝑙𝑠𝑠𝑠𝑠𝐶𝐶 𝑄𝑄 𝑡𝑡 ≜ 𝑆𝑆 𝑡𝑡 + 𝜇𝜇Υ 𝑡𝑡𝑡𝑡 ∈ 𝑅𝑅𝑛𝑛

1

1

( ,..., ) is the vector of original retrieval scores of the documents to be re-ranked (i.e., the highest ranked documents by the initial search)

( ,..., ) is the vector of regularized (new)

n

n

y y y nn

f f f

=

=

1

2

, 1

2

1

scores

: affinity matrix ( - affinity between doc i and doc j; =0)

: diagonal matrix where

( )

( ) ( )

)(

ij ii

n

ii iji

nij ij

i ji j ii jj

n

i ii

W W W

D D W

W WS f f f

D D

f f y

=

=

=

=

= −

ϒ = −

∑

∑

∑

𝑡𝑡∗ = 𝑎𝑎𝑠𝑠𝑙𝑙𝑠𝑠𝑠𝑠𝐶𝐶 𝑄𝑄 𝑡𝑡 ≜ 𝑆𝑆 𝑡𝑡 + 𝜇𝜇Υ 𝑡𝑡𝑡𝑡 ∈ 𝑅𝑅𝑛𝑛

There is a closed form solution

Score regularization – empirical results (Diaz ’07)

Additional approaches (1)

• Daniłowicz and Bali´nski (’01) define a Markov chain over a graph where documents are vertices and edge-weights are based on query-sensitive inter-document similarities

• The more similar two documents, the higher the edge weight

• The smaller the difference between the original retrieval scores of two documents, the higher the edge weight

• Matveeva ‘04

Additional approaches (2)

• Spreading activation networks (Salton&Buckley ’88)• Markov chains (Balin´ski& Daniłowicz ’05) • Hyperlinks and inter-document similarities

• Biasing PageRank using query-similarity values (Richardson&Domingos ’04)

• Both the random jump and the jump via hyperlinks• The bias can be based on inter-document similarities

• Are semantically related (hyper) links more effective for retrieval? (Koolen&Kamps ’11)

• Graph-based fusion (Kozorovitzky&Kurland ’09, ’11)• Using random walks with absorbing states to diversify

search results (Zhu et al. ’07)• Term-based graphs (out of the scope of this tutorial)

Adversarial search

• Mishne et al. ’05• Finding spam comments in blogs by comparing the language

models of the (i) comment, (ii) pages to which the comment has outgoing links, and (iii) the blog post

• Benczúr et al. ’06• Finding nepotistic links by comparing a language model

induced from the anchor text and that induced from the target page

• A similar approach was used by Martinez-Romo&Araujo ‘09

• Raiber et al. ’12 • Using inter-document similarities to address keyword stuffing

Concluding notes• The cluster hypothesis gave rise to much work in

the IR field. We surveyed:• Cluster hypothesis tests• Cluster-based document retrieval

• Using clusters to select documents• Using clusters (or topic models) to enrich (smooth) document

representations• Using graph-based methods that utilize inter-document

similarities for ad hoc retrieval• Applications/tasks for which cluster-based or graph-

based methods have been used• Query-performance prediction, fusion, federated search,

microblog retrieval, results diversification, query expansion, using relevance feedback, adversarial search

Some open challenges

• The optimal cluster• The cluster ranking challenge

• Selective application of cluster-based and document-based retrieval

• Devising query-sensitive inter-document measures that will result in the cluster hypothesis holding to a larger extent

• Devising additional graph-based centrality measures that are correlated with relevance

The Cluster Hypothesis in Information...

Documents

Transcript of The Cluster Hypothesis in Information...