Relative Query Performance Prediction using Ideal Expanded Query

Relative Query Performance Predictionusing Ideal Expanded Query

Snehasish Mukherjee

M.Tech (CS) Year IIRoll No. mtc1010

Indian Statistical Institute, Kolkata.

July 19, 2012

Snehasish Mukherjee Relative Query Performance Prediction using Ideal Expanded Query

Overview

Introduction.

Query Expansion

Problem Definition.

Related Work.

Ideal Expanded Query.

Formulating the Ideal Expanded Query.

Prediction using the Ideal Expanded Query.

Conclusion.


Introduction

Information Retrieval: The science of retrieving relevantdocuments from large collections.

Query: An expression of the users information need.

Vector Space Model: Queries and documents are vectors.Similarity between documents is measured by the cosine of theangle between the vectors.

Evaluating IR systems: Precision, Recall and MAP.

Query Expansion: Reformulating the user query to retrieve morerelevant documents.

Query Expansion Algorithms: Which terms to add to the userquery and what should be their importance ?


Different Approaches to Query Expansion

Global vs Local Methods.

Query Expansion using lexical resources.

(Pseudo)Relevance Feedback and Rocchio Algorithm.

Local Context Analysis(LCA): A co-occurrence based approach.

Information theoretic approach with Kullback-LeiblerDivergence(KLD).

Relevance Based Language Modelling(RBLM).


So What’s The Problem ?

Effectiveness of QE algorithms depend on the input query.

Query Expansion is not always successful.

Given a query, which of the several QE algorithms, if any at all,should be used ?

Need large amount of training data on relative queryperformances to answer such questions.

How to build the training data ? Retrieve with the expandedtraining queries. But requirement of repeated retrieval makes thiscostly.

Can we predict relative performances without repeatedretrieval ? This is precisely the question that we try toanswer.


Overview of our solution

For each training query we know the complete set of documentsthat are relevant to the information need. We utilize this fullknowledge of relevance to formulate the ideal expanded query forthe input {query, collection} pair. For every candidate queryexpansion algorithm, find the similarity between the expandedquery produced by it and the ideal query. Predict the relativeperformances of the QE algorithms based on this similarityscores.

Consists of 2 problems. First is the formulation of the idealexpanded query. Second is choosing the appropriate similaritymetric to measure the similarity between a candidate query andthe ideal query.


Related Work: Comparing QE algorithms.

Not much work has been reported in IR literature that addressthe problems we are trying to solve.

“Comparing and Combining methods for automatic queryexpansion” [7] reports comparison between QE methodsbelonging probabilistic and co-occurrence based approaches.

In [4] Carpenito et al reports adhoc experimental comparisonbetween an information theoretic approach to QE and otherapproaches.

But all of the above lack the development of an algorithmicframework for comparison on per query basis.


Related Work: Formulating the Ideal Expanded Query

IR literature is deficient in studies on methods to compute a setof good expansion terms for a given query by utilizing fullknowledge of relevance.

Araujo et al [1] discusses training of a classifier to select goodexpansion terms. Training is done with good quality expansionterms selected by a Genetic Algorithm from the users relevancejudgements on a set of documents.

Document Routing [6] and Filtering [2] provide a goodapproximation to our original problem. The problem of learninguser profile from a set of documents marked as relevant by theuser in the TREC Routing or Filtering tracks are conceptuallysimilar to the problem of selecting a set of good expansion termsby utilizing full knowledge of relevance.


Ideal Expanded Query

Definition: Ideal Expanded Query(IEQ)

For a given query qorig, the Ideal Expanded Query is the query Qidealobtained by adding appropriately weighted terms to qorig such thatthe query so formed ranks all relevant documents above all nonrelevant documents and hence has AP = 1.0, irrespective of theretrieval or document weighting model.

The above definition of the IEQ is too strict to be feasible. For agiven query-collection pair an IEQ might not even exist. So we lowerour aim and accept any query with a reasonably high AP, say > 0.8,as our IEQ.


Predicting Relative query performance with IEQ

Hypothesis

For a particular information need, if I be the IEQ, qi be the ith

candidate query, sim() be the similarity measure between two vectorsthen, sim(I, qi) values exhibit high positive correlation with the MAPvalues of retrieval done using the candidate queries.

Therefore ranking candidate queries based on sim(I, qi), should be areasonably accurate approximation of the actual relative performancesof the candidate queries. Hence this solves our problem. However theabove hypothesis does not explicitly require that query performancebe monotonically increasing with its similarity to the IEQ. This isway more stricter and difficult to achieve. Instead our hypothesisallows few cases of MAP(q1) > MAP(q2) even while sim(IEQ,q1) <sim(IEQ,q2).


Problem Definition

Formulating the Ideal Expanded Query(IEQ)

Given the information need expressed as the input query ~qorig, thecollection C to be searched, and the set R,R ⊂ C of all documentsthat are relevant to the information need, expand ~qorig to form the

Ideal Expanded Query ~Qideal such that ~Qideal has very high APirrespective of the document weighting and retrieval model.

It is interesting to note that while the problem of informationretrieval is to find the relevant documents for a given query ,the problem of formulating the Ideal expanded Query is to find thequery given all the relevant documents.


How to formulate the IEQ? Part1:The Rocchio Algorithm

Modified Rocchio query

~Qrocchio = α ~Qorig + β1

|R|∑~d ∈ R

~d − γ1

|N |∑~d /∈ R

~d (1)

where ~Qorig is the original query vector, C is the set of all documentsto be searched, R ⊆ C is the set of all documents that are relevant tothe query, N ⊆ C is the set of all documents that are not-relevant tothe original query, ~d is the Ltu weighted document vector and α, βand γ are the Rocchio parameters.

The Rocchio query maximizes similarity with the relevant set whilemaximizing the dissimilarity with the non relevant set. it however isnot the ideal query that we are looking for. In the following slides wewill look at techniques that take this rocchio query near the idealquery.


Part2: Query Zoning [Singhal, 97] [9]

Central theme of Query Zoning

Instead of taking into account all known non-relevant documents inRocchio’s formulation of feedback query, we should consider onlynon-relevant docs that are in the query domain.

Query Domain

A set of documents that are more similar to the query than manyothers. For example the query domain of an user query “The ChromeOS platform” may be the set of all documents about computers asopposed to other domains like Geography, History, Literature etc. Adomain refers to a large or general topic and there may bemany documents in the query domain itself that are nonrelevant.


Query Zoning continued...

Query Zone

We approximate a query domain by a Query Zone. In the vectorspace model, a Query Zone, henceforth abbreviated QZ, can bevisualized as a volume or cloud near the query vector. It representsset of documents that are similar to the query vector. Resultsreported in [9] show that using Query Zoning results in 9% to 12%improvement over the case where we use all non-relevant documents,in the TREC Routing task. Moreover for Query Zoning α = 8 andβ = γ = 64 gives better results.

Final prescription

In the modified Rocchio’s formulae, replace “all” nonrelevant documents by the “top few most similar” nonrelevant documents.


Part 3 - Dynamic Feedback Optimization[Buckley,95] [3]

DFO overview

Optimize the query term weights by making small changes in theircurrent weights, and observing its effect on the retrieval of the trainingset. Favourable changes stay while unfavourable ones are undone.

DFO Steps as in [3]

1. Rank terms occurring in the training doc set by the number oftraining relevant documents in which they occur. From that list addthe top x terms to the original query. Re-weigh all terms in queryusing modified Rocchio eqn. on training set.

2. (Optional) For each expansion term determine whether includingthe term at all improves retrieval over the training set.


DFO contd..

3. Perform 3 passes over all query terms. During each pass increase(by 50%, 25%, 12.5% in 1st , 2nd and 3rd passes respectively – theseare known as the pass ratios) the weight of the term encountered andsee if it improves retrieval. Only if it does, note down its new weightbut do not change it in the query immediately. Move on to the nextterm and so on. When the pass ends, make all these weight changesand repeat the process until 3 complete passes are done.

4. The reformulated query form the step 3 above is run against thetest set and evaluated using average recall precision for all documents.


Part 4: Augmenting Rocchio with QZ and DFO

Algorithm 1 [Schapire, 98] [8]

1. Find nw, the average number of distinct words per document.

2. ~Qrel: Create the centroid vector for the set R of Ltu weightedrelevant documents. Denote this query by ~Qrel.

3. ~Q′

rel: Create a copy ~Q′

rel of ~Qrel. Remove all rare words, i.e. words

that appear in less than 5% of all relevant documents from ~Q′

rel. Thisprevents possibly random terms from influencing the query. Truncate~Q

′

rel to contain only the highest weighted nw words. This, together

with ~Qorig is the initial query for Query Zoning.

4. Query Zoning: Using ~Qorig + ~Q′

rel and Lnu weighted documents in

C form the Query Zone of ~Qorig by selecting the most similar

max{ |C|100 , |R|} non-relevant documents. For measuring the documentsimilarity use the inner product similarity between the document andquery vectors.


Algorithm 1: contd...

5. ~Qnrel: Form the centroid vector of the Ltu weighted documents inthe Query Zone computed in the previous step. Denote this query by~Qnrel. 6. ~Qrocchio: Obtain the Rocchio query as follows

~Qrocchio = α ~Qorig + β ~Qrel − γ ~Qnrel (2)

7. ~Qfinal: Remove rare terms, i.e terms that occur in less than 5% of

all relevant documents, from ~Qrocchio. Select the top, i.e highestweighted, nw terms from ~Qrocchio into ~Qfinal which is our final query.

8. Term weights of the nw terms of ~Qfinal are further optimizedusing a 5 pass DFO with pass ratios 4.00, 2.00, 1.00, 0.50, 0.25.

9. Output ~Qfinal as the IEQ.


Part 5: Boosting for term Selection

AdaBoost

AdaBoost was introduced by Freund et al in [5] and was adapted forapplication to text filtering by Schapire et al in [8]. The main idea ofboosting, in the context of text filtering, is to combine several simpleclassification rules to build an accurate classifier.

Definition

Weak Hypothesis: A weak hypothesis is a classification rule like “Ifa term t is present in a document d, then the document d is relevant,else irrelevant.”. A weak hypothesis for the sth round is denoted byhs. If the ith document is relevant hs(i) = +1, else hs(i) = −1


Boosting contd...

Definition

ths : The term corresponding to the weak hypothesis hs.

Weak Learner: A weak learner or weaklearn() is a subroutine thatproduces a weak hypothesis.

T,T0: T0 is the number of rounds after which classification errorreduces nearly to 0. T = 1.1 ∗ T0, i.e T is 10% more than T0.

Ds(i): D is a distribution that assigns importance weights todocuments. Ds is the distribution during the sth round. Ds(i) is theimportance weight assigned to the ith document at the start of the sth

round.


Boosting contd...

Overview of the Algorithm

The Algorithm assumes access to a weak learner. At the beginningof a round s; 1 ≤ s ≤ T ; all documents are assigned importanceweights. Then the weak learner is called to produce a weak hypothesishs, such that it properly classifies as many heavy documents aspossible. Before the beginning of the next round the documents arereweighed; documents correctly classified previously get lower weightswhile those incorrectly classified previously get higher weights. Thusdocuments that are difficult to classify, progressively gets higherweights until the appropriate weak hypothesis is obtained and thedocument is correctly classified.


Boosting contd...

Algorithm 2

Input: An integer T specifying the number of iterations. Ndocuments and labels < (di, yi), ...(dN , yN) > whereyi ∈ {+1,−1} is the relevant/non-relevant label.

1. Initialize D1(i) =1

N, ∀i.

2. Initialize Q = φ

3. Do for s = 1..T

a. Call weaklearn() and get a weak hypothesis hs.

b. Calculate error of hs : εs =∑

i:hs(i) 6=yi

Ds(i).

c. Set αs =1

2ln

(1− εsεs

).

d. If αs > 0, Q = Q ∪ {(ths, αs)}


Algorithm 2 contd...

e. Update distribution: Ds+1(i) =Ds(i)

Zs

{e−αs if hs(di) = yi,

e+αs if hs(di) 6= yi,

where Zs is the normalization factor.

4. Return Q, the set of (term,weight) pairs as the expansion query.

Choice of αs: αs =1

2ln

(1− εsεs

)Lower the value of εs higher the value of αs. So more accuratethe weak hypothesis hs, higher is the the value of αs.

we allow αs to become negative. This happens when εs >1

2. It

signifies harmful terms, i.e terms whose presence are betterindicators of irrelevance than that of relevance.

As εs →1

2, αs → 0. This implies that terms that misclassify

nearly half of the documents are neutral, i.e they are neithergood expansion terms, nor bad.



Generating the weak hypothesis: weaklearn()

1. Consider all terms t.

2. For each term t design a hypothesis h(t) such thatt ∈ d =⇒ d is relevant.t /∈ d =⇒ d is not relevant.

3. Calculate error for the weak hypothesis h(t) as

εs(t) =∑

i:t∈di;di /∈R

Ds(i) +∑

i:t/∈di;di∈R

Ds(i)

4. Finally, choose that term t for forming the weak hypothesis whichminimizes min{εs, 1− εs}

Relevant documents that are difficult to retrieve gets progressivelyhigher weights in this method. Eventually terms from such documentsget selected as weak hypothesis and are hence included in our query.


How to formulate the IEQ? Part 6:Algorithm 3: Combining Boosting and Rocchio

Some observations leading to Algorithm 3

1. Algorithm 2 considers all terms in the lexicon, and for each term itconsiders all the documents in the collection. However it is wastefulto consider all terms in the lexicon, since it is good expansion termswill come from the subset LR of the lexicon, which is the union of allthe terms in the set of relevant documents R.

2. We can further restrict the set of candidate terms which theroutine weaklearn() should consider by selecting the top few termsform LR on the basis of their weights as given by the Rocchio query.

3. We assume that for two different input queries to the DFOprocess, better results will be obtained for the better input query.Therefore, if we further optimize the weights of the query terms of thequery Q from Algorithm 2 using DFO, the results obtained should bebetter than that obtained from Algorithm 1.


Part 6 contd...

Algorithm 3

1. Formulate ~Qfinal as given in step 7 of Algorithm 1. Only, do notremove the rare terms in this case.

2. Let n be the number of expansion query terms needed. Letn = 1.2 ∗ nw, where nw is the average number of unique terms perdocument. This maintains a balance between the need to have querysize equal to the average document size and and the need to havelarge number of expansion terms.

3. Select the top n terms from ~Qfinal on basis of their term weights.Denote this set by LB , the boosting lexicon. Notice that we use theRocchio weights only for selection and not term weighting.

4. Run Algorithm 2. Only, now weaklearn() does not consider allterms in lexicon, but considers LB , a much smaller subset. Noticethat we do not resort to the sub-sampling of the document collectionas reported in the implementation note in [8].



5. Feed the output query Q from the above step to step 8 of theAlgorithm 1 for optimizing its weights using DFO.

6. Output the final query ~Qultimate

Ranked retrieval performed using queries formulated by Algorithm 3should have very high MAP. In the following slides we presentretrieval results obtained using queries produced by Algorithms 1, 2and 3 on the standard TREC collections.


Results for Ideal Expansion Query: page I

Test Collection: Disks 4 & 5 of the TREC Collection excluding theCongressional Records subcollection on Disk 4.

Experiment: We run Algorithms 1, 2 and 3 on topics 401 to 450 ofthe TREC8 adhoc track. We perform retrieval with the queriesproduced by each of these 3 algorithms using 4 different documentterm weighting/retrieval models implemented in the Terrier IRplatform. Higher the MAP better is the algorithm at formulating theIEQ. Low variance across different document term weighting/retrievalmodels will imply that the queries perform well irrespective of thedocument weighting model.

Baseline: The baseline retrieval system against which we compareour algorithm, uses Terrier’s implementation of the BM25 modeltogether with Bo1 query expansion.


Results for Ideal Expansion Query: page II

Table Algo1 presents the retrieval performance of the expandedqueries produced by Algorithm 1 for 4 different documentweighting/retrieval schemes. The MAP hovers around 0.77 for all 4different document weighting models and hence results obtained areindependent of the retrieval model. Though Algorithm 1 performs alot better than the baseline, the MAP is not high enough to term thequery produced by Algorithm 1 as the IEQ.

Algorithm 1: Query Zoned Rocchio Query with DFO

Weighting Model Baseline MAP IEQ MAP percentage increaseBM25 0.2610 0.7710 195%DLH13 0.2747 0.7724 181%InL2 0.2672 0.7760 190%

TF-IDF 0.2637 0.7730 193%

Table: Algo-1 MAP values obtained for Algorithm 1


Results for Ideal Expansion Query: page III

Similarly, Table Algo-2 presents the results obtained using Algorihm2 with all column and row definitions similar to that of Table 4.2.

Algorithm 2: Boosting with Query Zoned Rocchio query


TF-IDF 0.2637 0.6213 135%



Results for Ideal Expansion Query: page IV

Observations for Table Algo-2

1. The results are yet again independent of the document weightingschemes.

2. Retrieval performance is about 18% lower than that of Algorithm1. Hence Boosting terms selected using Rocchio query is not a goodchoice for formulating the IEQ.

3. However, it should be noted that Algorithm 2 does not employDFO which is known to increase query term weighting qualityconsiderably. Still it manages to achieve a average MAP ≈ 0.63.Hence the quality of terms selected by this algorithm is better thanthe quality of terms selected at the end of step 7 in Algorithm 1 andretrieval performance should improve considerably on application ofDFO to the query terms selected by Algorithm 2.


Results for Ideal Expansion Query: page V

Table Algo-3 presents the retrieval performance obtained using theAlgorithm 3. The MAPs obtained are around 0.81 for all the 4document weighting schemes used. It brings out the effectiveness ofthe 5 pass DFO used by us, since it effected a 28.5% improvementover Algorithm 2. The MAP is high enough for the queriesproduced by Algorithm 3 to be considered as the IEQ.

Algorithm 3: DFO applied to Boosting with Query Zoned Rocchio query


TF-IDF 0.2637 0.8132 208%



Results for Ideal Expansion Query: page VI

Table Comp compares the relative performances of Algorithms 1, 2and 3 on 10 difficult-to-retrieve queries. Compared to Algorithm 1,Algorithm 3 performs better on most of the queries while beingmarginally worse on a few.

QNo. Baseline AP Algorithm-1 AP Algorithm-2 AP Algorithm-3 AP

401 0.0279 0.6725 0.5820 0.6656

413 0.0460 0.7920 0.6264 0.8346

421 0.0220 0.3804 0.1631 0.5288

426 0.0543 0.6578 0.5815 0.6943

432 0.0011 0.8006 0.5674 0.9621

433 0.0043 1.0000 1.0000 1.0000

437 0.0176 0.5737 0.2870 0.7252

439 0.0592 0.5153 0.3659 0.5548

442 0.0119 0.6943 0.5692 0.8117

448 0.0070 0.7115 0.3482 0.7023

Table: Comp Performance on difficult-to-retrieve queries


Results of Prediction: page I

Candidate Queries: Queries produced by different parametervariations of KLD, LCA and RBLM query expansion algorithms.Each parameter variation is treated as a different algorithm,henceforth denoted as runs, and has a suite of 50 queriescorresponding to the 50 topics(401-450) in the TREC8 adhoc track.

Experiment: Select 31 runs where queries have different number ofexpansion terms. For the ith topic, compute the similarity betweenthe ideal query and the ith query in each of the 31 runs. Rank the 31queries according to this similarity. Compute the correlation of thisranking with the actual ranking of the 31 runs. Higher thecorrelation, better the prediction.


Results of Prediction: page II

Similarity Measures:

Jaccard Index(j): sim (I,Q) =| I ∩ Q || I ∪ Q |

Cosine similarity(c): sim (I,Q) =~I · ~Q

||~I|| · || ~Q||

Kendall’s rank correlation(τ):

sim (I,Q) =no of concordant pairs− no of discordant pairs

12n(n− 1)

Spearman’s rank correlation(ρ):

sim (I,Q) =

∑i (xi − x) (yi − y)√∑i (xi − x)

2(yi − y)

2

Pearson’s correlation(r): sim (I,Q) =

∑i

(Ii − I

) (Qi − Q

)√∑i

(Ii − I

)2 (Qi − Q

)2Snehasish Mukherjee Relative Query Performance Prediction using Ideal Expanded Query

Results of Prediction: page III

Table Description: In all the following tables, the rows representthe type of similarity measure between the candidate and idealqueries, while the columns represent the kind of correlation betweenthe predicted and actual ranking. For example TablePred-1[cosine,pearson] = 0.642448 means that the average pearsoncorrelation coefficient between the actual ranking of the runs andranking the ranking of runs based on the cosine similarity betweenthe queries and the IEQ, is 0.642448. The average is taken over allthe 50 queries.

Pearson Kendall SpearmanJaccard index 0.350437 0.130963 0.183748

Pearson 0.578621 0.286329 0.389483Kendall −0.002708 −0.072504 −0.113027

Spearman −0.003438 −0.058723 −0.097622cosine 0.642448 0.289576 0.388241

Table: Pred-1 31-candidates having different number of expansion terms


Results of Prediction: page IV

The following 2 tables investigates the dependence of the predictionquality on the similarity of the number of expansion terms as well ason the number of intersections with the IEQ. Table Pred-2 is for 20runs with each query having 5 expansion terms and Table Pred-3 isfor 21 runs with each query having 50 expansion terms.


Pearson 0.502687 0.231794 0.293571Kendall −0.283837 −0.097530 −0.164953

Spearman −0.277611 −0.108700 −0.183090cosine 0.515466 0.201074 0.254903

Table: Pred-2 20-candidates, each has 5 terms.


Results of Prediction: page V


Pearson 0.611369 0.265306 0.349775Kendall −0.162379 −0.054227 −0.083143

Spearman −0.263758 −0.104762 −0.166578cosine 0.673141 0.296599 0.380705

Table: Pred-3 21-candidates, each has 50 terms.


Results of Prediction: page VI

Discussions1. Cosine similarity is clearly a better similarity metric. Pearsoncorrelation between the term weights in the ideal query and in thecandidate queries, is a close second followed by Jaccard index.

2. Rank correlations between the candidate and reference queryterms are clearly very poor choices of similarity.

3. The cosine similarity displays strong positive correlation withactual query performance(MAP).

4. Rank correlation between the actual and predicted ranks is notas high as expected. Spearman rank correlation is considerablybetter than Kendall’s rank correlation.

5. The similarity metrics are not affected too much by thedifference in the no of expansion terms in the candidates beingranked. Prediction quality is better when queries have largenumber of expansion terms. This is possibly due to the largernumber of intersection with the IEQ.


Conclusion

With MAP > 0.8 across different weighting schemes, Algorithm 3is indeed an effective formulator of the IEQ. Suitability of morerecent algorithms for formulating IEQ can be investigated infuture.

cosine similarity between the input query and the IEQ is a verygood similarity metric since it exhibits high correlation(pearson)with the MAP values of the actual retrieval runs.

Ranking runs based on their cosine similarity with the IEQ showpositive rank correlation, with the actual ranking of the runs.

Therefore a ranking of queries based on their cosine similaritywith the Ideal Expanded Query produced by Algorithm 3 is areasonable approximation of their actual relative retrievalperformances.


Thank You.


References I

Lourdes Araujo and Joaquın Perez-Iglesias.Training a classifier for the selection of good query expansionterms with a genetic algorithm.In IEEE Congress on Evolutionary Computation, pages 1–8.IEEE, 2010.

Nicholas J. Belkin and W. Bruce Croft.Information filtering and information retrieval: Two sides of thesame coin?Commun. ACM, 35(12):29–38, 1992.

Chris Buckley and Gerard Salton.Optimization of relevance feedback weights.In Edward A. Fox, Peter Ingwersen, and Raya Fidel, editors,SIGIR, pages 351–357. ACM Press, 1995.


References II

Claudio Carpineto, Renato de Mori, Giovanni Romano, andBrigitte Bigi.An information-theoretic approach to automatic query expansion.

ACM Trans. Inf. Syst., 19(1):1–27, 2001.

Yoav Freund and Robert E. Schapire.A decision-theoretic generalization of on-line learning and anapplication to boosting.J. Comput. Syst. Sci., 55(1):119–139, 1997.

Donna Harman.Overview of the fourth text retrieval conference (trec-4).In TREC, 1995.

Jose R. Perez-Aguera and Lourdes Araujo.Comparing and combining methods for automatic queryexpansion.CoRR, abs/0804.2057, 2008.


References III

Robert E. Schapire, Yoram Singer, and Amit Singhal.Boosting and rocchio applied to text filtering.In SIGIR, pages 215–223. ACM, 1998.

Amit Singhal, Mandar Mitra, and Chris Buckley.Learning routing queries in a query zone.In SIGIR, pages 25–32. ACM, 1997.


Relative Query Performance Prediction using Ideal Expanded Query

Technology

Transcript of Relative Query Performance Prediction using Ideal Expanded Query