Tanvi Motwani- A Few Examples Go A Long Way

Constructing Query Models from Elaborate Query Formulations

A Few Examples Go A Long Way

Krisztian [email protected]

Wouter [email protected]

Maarten de [email protected]

ISLA, University of Amsterdam

Presented by Tanvi Motwani

AIM

• This paper aims to introduce and compare several methods for sampling expansion terms with query independent as well as query dependent techniques.

• Along with the query it takes sample documents as input. Sample documents are additional information that users provide consisting of small number of “key references” (pages that should be linked to by good overview page of that topic)

• Aim is to increase “aspect recall” by attempting to uncover aspects of information which are not captured by the query but by the sample documents.

Aspect Retrieval

Query: What are current applications of robotics?

Find as many different applications as possible.

Example Aspects

A1: spot-welding robotics

A2: controlling inventory

A3: pipe-laying robots

A4: talking robot

A5: robots for loading & unloading

memory tapesA6: robot telephone operators

A7: robot cranes… …

Aspect judgments A1 A2 A3 … ... Ak

d1 1 1 0 0 … 0 0

d2 0 1 1 1 … 0 0

d3 0 0 0 0 … 1 0

….dk 1 0 1 0 ... 0 1

Overview

Retrieval

Model

Experimental Set up

Query Representati

on

Baseline Parameter

s

Experimental

Evaluation

Overview

Retrieval

Model

Experimental Set up

Query Representati

on

Baseline Parameter

s

Experimental

Evaluation

Query Likelihoo

d

Document Modeling

Query Modeli

ng

What is a Rainforest?

P(D1|Q) = 0.32

P(D2|Q) = 0.26

P(D3|Q) = 0.19

P(D4|Q) = 0.12

P(D5|Q) = 0.09

Query (Q) Documents

Query Likelihood

•

•

•

•

Bayes’ Rule

Ignoring P(Q)

Assuming Independence of Query terms

Taking log

• Using query and document models

What is a Rainforest?

Query (Q) Documents

Relevance

Model

Underlying Relevance Model

The query and relevant documents are random samples from an underlying relevance model R.

Documents are ranked based on their similarity to the query model.The Kullback-Leibler divergence between the query and document models can he used to provide a ranking of documents.

Overview

Retrieval

Model

Experimental Set up

Query Representati

on

Baseline Parameter

s

Experimental

Evaluation

Query Likelihoo

d

Document Modeling

Query Modeli

ng

Document Modeling

Maximum Likelihood Estimate

Smoothing ML estimate

This document will have P(“Rain”|D) as 0, thus smoothing is required.

Query Modeling

P(t|Q) is extremely space and thus query expansion is necessary.

This document does not have words “Rain” and “Forest” but have related words such as “Wild Life”. Expansion of query brings different “aspects” of the topic.

Overview

Retrieval

Model

Experimental Set up

Query Representati

on

Baseline Parameter

s

Experimental

Evaluation

Experimental Setup

• CSIRO Enterprise Research Collection (CERC), a crawl of *.csiro.au web site conducted in March 2007.

• 370,715 documents

• Size of 4.2 gigabytes

• 50 topics

• Judgments made in 3-point scale: 2: highly relevant “key reference”1: candidate key page0: not a “key reference”

Overview

Retrieval

Model

Experimental Set up

Query Representati

on

Baseline Parameter

s

Experimental

Evaluation

Maximizing Average Precision (MAX_AP)

Maximizing Query

Log Likelihood (MAX_QLL

)

Best Empirical estimate (EMP_BES

T)

Parameter Estimation

Maximizing Average Precision (MAX_AP)

Maximizing Query Log likelihood (MAX_QLL)

Best Empirical Estimate (EMP_BEST)

•

•

•

Evaluation

•Maximum AP score is reached when weight is 0.6

•MAX_QLL performs slightly better than MAX_AP

Overview

Retrieval

Model

Experimental Set up

Query Representati

on

Baseline Parameter

s

Experimental

Evaluation

Feedback Using

Relevance Models

Relevance Models from Sample Documents

Query Model from Sample Documents

Query Representation

• Combination of expanded query terms is performed with the original query terms.• This prevents the topic to shift away from the original user information need.

Overview

Retrieval

Model

Experimental Set up

Query Representati

on

Baseline Parameter

s

Experimental

Evaluation

Feedback Using

Relevance Models



Feedback Using Relevance Models

Joint Probability of observing t together with query terms q1,q2…qk divided by joint probability of the query terms.

• RM1: It is assumed that t and qi are sampled independently and identically to each other

• RM2 : Sampling of q1,q2…qk are dependent on t but independent of each other.

RM1

Assume weight of smoothing is 0.“wild” appears 5 times in this document.“rain” appears 20 times in this document.“forest” appears 30 times in this document.Number of unique terms in this document are 150.M is just this single document.P(D1) = 1/5P(“wild”, “rain”, “forest”) = 1/5* 5/150 * 20/150 * 30/150

RM2

Given the term “wild” we first pick a document from M set with probability P(D|t) and then sample query words from the document.

Assume P(D | “wild”) is 0.7This document has 10 “rain” wordsAnd 20 “forest” wordsDocument has 200 unique wordsP(“wild”) is 0.2And M is just this documentP(“wild”, “rain”, “forest”)= 0.2* 0.7 * 20/200 * 10/200

Overview

Retrieval

Model

Experimental Set up

Query Representati

on

Baseline Parameter

s

Experimental

Evaluation

Feedback Using

Relevance Models




• Apply Relevance Models on Sample Document instead of Feedback documents i.e. set M = S.

• For RM1 assume P(D) = 1/|S|.

Overview

Retrieval

Model

Experimental Set up

Query Representati

on

Baseline Parameter

s

Experimental

Evaluation

Feedback Using

Relevance Models




Top K terms with highest probability P(t|S) are taken and used to formulate expanded query.

1. Sample Document set S2. Select document D from this set S with probability P(D|S)3. From this document, generate term t with probability P(t|D)4. Sum over all sample documents to obtain P(t|S)


• Maximum Likelihood Estimate of a term (EX-QM-ML)

• Smoothed Estimate of a term (EX-QM-SM)

• Ranking Function proposed by Ponte and Croft for unsupervised query expansion (EX-QM-EXP)


Three options for estimating P(D|S)

• Uniform: • Query-biased:

• Inverse query-biased:

Overview

Retrieval

Model

Experimental Set up

Query Representati

on

Baseline Parameter

s

Experimental

Evaluation

Expanded Query Models

Combination with Original Query

Importance of Sample Document

Topic Level Comparison

Sampling conditioned on query

Conclusion

• Introduced a method of sampling query expansion terms in a query-independent way, based on sample documents that reflect “aspects” of user’s information need that are not captured by the query.

• Introduced different versions of expansion term selection method, based on different term selection and document importance weighting methods and compared them against more traditional query expansion terms is a query-biased manner.

Questions/Discussion

• Every topic needs a sample document set, is this method feasible in real world domain where there are uncountable topics?

• Aspect Recall is obtained from the sample documents, aren’t we dependent on the “goodness” or the amount of different aspects covered in sample documents for obtaining a high aspect recall?

• Theoretically there is slight increase in MAP measurement as compared to BFB-RM2 (around 0.07), for a end-user will it provide any difference in user experience? Is such a small gain in MAP worth the high cost of obtaining sample documents?

Tanvi Motwani- A Few Examples Go A Long Way

Education

Transcript of Tanvi Motwani- A Few Examples Go A Long Way