Extending BM25 with multiple query operators

Roi Blanco Yahoo! Research

Paolo BoldiUniversitá degli studi di Milano

Extending BM25 with multiple query operators

Overview

• Motivation

• Virtual regions and BM25’s event space

• Experiments

• Conclusions

Motivation• From document understanding to query

understanding

• Query rewrites (gazetteers, spell correction), named entity recognition, query suggestions, query categories, query segmentation ...

• Detecting query intent, triggering verticals

• direct target towards answers

• richer interfaces

• Signals for ranking: matches of query terms in documents, query-independent quality measures, CTR, among others

• Probabilistic IR models are all about counting

• occurrences of terms in documents, in sets of documents, etc.

• How to aggregate efficiently a large number of “different” counts

• coming from the same terms• no double counts!

Signals for Ranking

Searching for food• New York’s greatest pizza

‣ New OR York’s OR greatest OR pizza‣ New AND York’s AND greatest AND

pizza‣ New OR York OR great OR pizza‣ “New York” OR “great pizza”‣ “New York” AND “great pizza”‣ York < New AND great OR pizza

• among many more.

“Refined” matching• Extract a number of virtual regions in the

document that match some version of the query (operators)

• Each region provides a different evidence of relevance (i.e. signal)

• Aggregate the scores over the different regions

• Ex. :“at least any two words in the query appear either consecutively or with an extra word between them”

Probability of Relevance

0 10 20 30 40 50

0.0000

0.0005

0.0010

0.0015

tf

P[relevant]

!!!!!!!!!

!!!!!!

!

!!

!

!

!

!!!

!!!!

!

!

!!!

!!

!!!

!

!

!

!

!!!

!

!!

!

!

!

!

!

!!

! or

2!gram

2!and

t ! q

R

tft

Et

Figure 2: BM25 plate diagram representing the as-sumptions over variable independence

according to p(r | Q = q, D = d), or equivalently

p(r | Q = q, D = d)p(r | Q = q, D = d)

!qp(D = d | r, Q = q)p(D = d | r, Q = q)

!q

Y

t!q

p(Dt = dt | r)p(Dt = dt | r)

!q

X

t!q,dt>0

logp(Dt = dt | r) · p(Dt = 0 | r)p(Dt = dt | r) · p(Dt = 0 | r)

!q

X

t!q,dt>0

wt,d,

where wt,d is the weight assigned for term t in document d,and you can think of dt as being the term frequency of t ind, later on denoted by tft,d.

In the derivation above, we assumed the terms to be con-ditionally independent, that is, p(Dt = x, Dt! = y | r, Q) =p(Dt = dt | r, Q) · p(Dt! | r, Q) and p(Dt = x, Dt! =y|r, Q) = p(Dt = dt|r, Q) ·p(Dt! |r, Q) for any pair of terms tand t"; this is a weaker assumption than term independence,in that we only require terms to be independent for eachfixed query and relevance; this assumption is fundamentalin practice, because it leads to tractable models, but it alsohas a deeper justification: as [16] proved, conditional termindependence can be obtained when, for any given query,terms are statistically correlated but the correlation is thesame on relevant and on non-relevant documents. Empir-ical evidence on retrieval performances of BM25 suggeststhat this indeed is often the case. For example, it is truethat query terms New and York are correlated in relevantdocuments for the query New York pizza, but they are alsocorrelated in the whole collection.

Further, the derivation imposes a vague prior assumptionover terms not appearing in the query (p(Dt = x | r) =p(Dt = x | r) if t /" Q). This can be weakened in the case ofquery expansion by explicitly linking unseen query terms torelevance.

The final arithmetic trick in the above derivation, knownas removing the zeroes, is used to eliminate from the finalscore calculation terms that are not present in the document.

The determination of term weights in BM25 is based onthe assumption that there is an Elite random variable, whichcan be cast as a simple topical model that perturbs the dis-tribution of words over the text. That is [30], the author

is assumed first to choose which topics to cover, i.e., whichare the elite terms and which are not. Furthermore, it isassumed that frequencies of terms on both the elite and thenon-elite set follow a Poisson distribution, with two di!erentmeans; in other words, for a given term t and for e " {0, 1}(denoting whether we are considering the term to be in theelite or not), there is a random variable Et,e that expressesthe distribution of the frequencies of the (elite or non-elite)term t in a document, and Et,e # Poisson(!t,e); clearly,we expect !t,1 > !t,0 (i.e., a single term will appear morefrequently if it is elite than if it is not). Plugging this as-sumption in the general formula derived above determinesa monotonic weighting function with an horizontal asymp-tote that may be interpreted as a form of saturation: theprobability of relevance increases with term frequency, butthe amount of increase is ultimately close zero when the fre-quency becomes large. In practice, this is well approximatedby

wBM25t,d =

tft,d

tft,d + k1

· widft

where widft is the inverse document frequency for term t (that

determines the asymptotic behavior when the frequency goesto infinity), k1 is a parameter and tft,d is a normalized termfrequency with respect to the document length, i.e.,

tft,d =tft,d

(1$ b) + b · |d|/avdl, (1)

where |d| is the length of document d, avdl the average lengthof documents in the collection, and b " [0, 1] is a tunableparameter. Putting things together, the weight derived fromthe 2-Poisson elite assumption wBM25

t,d is UBM25(tft,d) where

UBM25(x) =x

x + k1.

Regions and virtual regions.In the following derivation we start again from the above-

mentioned estimation of

wBM25t,d = log

p(tft,d = x | r) · p(tft,d = 0 | r)p(tft,d = x | r) · p(tft,d = 0 | r)

.

Here, the only features that we need to observe are termfrequencies; this is a quite mild assumption and stems fromthe idea that documents are a single body of text with nostructure whatsoever. Some earlier refinements of the prob-abilistic model, however, already introduced the idea thatdocuments may have some structure. In BM25F [33], for ex-ample, a notion of region was introduced: each document ismade up of regions, or fields, (e.g., title, body, footnotes etc.)and it is possible to observe term frequencies separately ineach region. This extension accounts for the fact that somefields may be more predictive for relevance than others (forexample, title will be more predictive than footnotes), andfits well in the eliteness model. The idea is that eliteness ofterms is decided beforehand, for every given document, andit is the same across fields; term-frequency, instead, will beagain modeled as in standard BM25, although it will be in-fluenced by the length of each field—of course, shorter fields(such as title) are expected to contain more elite terms thanlonger ones.

BM25• Term (tf) independence • Vague Prior over terms not

appearing in the query• Eliteness - topical model that

perturbs the word distribution

• 2-poisson distribution of term frequencies over relevant and non-relevant documents

wBM25t,d =

tft,d

tft,d + k1· widf

t

Virtual regions• Different parts of the

document provide different evidence of relevance

• Create a (finite) set of (latent) artificial regions and re-weight

p(tfjt,d = x | r) =!

e!{0,1}

p(tfjt,d = x | e, r)p(e | r) =

!

q!Q

!

e,!!{0,1}

p(tfjt,d = x | !j = !, e, r)p(!j = ! | e, r)p(e | r)

t ! q

j ! M

R

tfjt

Et

!j

Implementation• An operator maps a query to a set of

queries, which could match a document

• Each operator has a weight

• The average term frequency in a document istf t,d =

!

j!M

wj · tf jt,d

(1! bj) + bj · |d|/avdl.

Remarks• Different saturation (eliteness) function?

• learn the real functional shape!

• log-logistic is good if the class-conditional distributions are drawn from an exp. family

• Positions as variables?

• kernel-like method or exp. #parameters

• Apply operators on a per query or per query class basis?

Operator examples• BOW: maps a raw query to the set of

queries whose elements are the single terms

• p-grams: set of all p-gram of consecutive terms

• p-and: all conjunctions of p arbitrary terms

• segments: match only the “concepts”• Enlargement: some words might sneak in

between the phrases/segments

• TREC GOV2

• 150 Topics (3 sets)

• Web sample (~100GB)

• 1000 + 400 Topics (2 sets)

• Weight learning only (coordinate ascent)

• Train/Test split, 10-fold CV

Experimental set-up

Research questions

• R1: How do the operators work individually combined with BM25(F)?

• R2: How does the combination of operators work?

• R3: Is there a difference in difficult queries?

Single operators results

TERA04 TERA05 TERA06MAP P@10 P@20 MAP P@10 P@20 MAP P@10 P@20

BM25 0.2648 0.5327 0.5071 0.3228 0.6140 0.5600 0.2928 0.5380 0.5140BM25F 0.2697 0.5510 0.5143 0.3284 0.6200 0.5570 0.2935 0.5460 0.5160p-AND p = 2 0.2679 0.5429 0.5286 0.3396* 0.6260 0.5920 0.3069 0.5900 0.5300phrasal µ = 3 0.2673 0.5306 0.4939 0.3369* 0.6060 0.5790 0.3082* 0.5720 0.5290segment µ = 1 0.2685* 0.5143 0.4970 0.3274 0.6080 0.5580 0.3180* 0.5860 0.5350segment µ = 1.5 0.2695* 0.5347 0.5010 0.3272 0.6140 0.5620 0.3122* 0.5940 0.5460segment µ = 3 0.2690* 0.5143 0.5204 0.3295* 0.6300 0.5840 0.3277* 0.5940 0.5460p-grams p = 2,µ = 1 0.2786* 0.5530 0.5120 0.3294 0.5860 0.5470 0.3137* 0.5860 0.5470p-grams p = 3,µ = 1 0.2670 0.5060 0.4950 0.3272 0.5980 0.5560 0.3198* 0.6160 0.5600

Table 2: Performance of single operators (* = statistical significance at p < 0.05 using a one-sided t-test withrespect to BM25, MAP only).

operator. The purpose of this experiment is to determinewhether the sigmoid-like term-frequency normalizing func-tion is able to accommodate for di!erent features which stemfrom the same source of evidence (matches of the query termin the document). Results are significantly better than thebaseline and outperform state-of-the-art ranking functionsthat just use matching of query terms (note we are notadding query-independent evidence, like link information,click-through data, etc.). For instance, the best MAP valuefor TREC 2004 at TERA04 [14] was 0.2844, the sequentialdependence model of the Markov random field for IR peaksat MAP 0.2832 [26] and the two-stage segmentation modelof Bendersky et al. had an average MAP of 0.2711 over the150 topics [5].

In order to explore further the usefulness of query segmen-tation and p-grams for di!cult queries, we selected a sub-sample of 1000 queries taken from Yahoo! Search; the datacorpus was a 100GB sub-sample of the Web. The querieshad been evaluated by a trained editorial team, with about130 judged documents per query on average: relevance wasassigned on a 4-level scale, from Bad to Excellent ; giventhat we had graded relevance available, we report on NDCG(gain values at the relevance level, from 0 to 4) and MAP.The average query length was 3.14. We report the perfor-mance of BM25 and BM25F as baselines. In addition, weselected a sub-sample of 400 queries where the user herselfhad employed query segments (i.e., phrasal queries), and re-moved the segmentation information. This experiment wasaimed at establishing if a more elaborate query interpreta-tion mechanisms is able to be of help in these cases.

Table 4 shows that segmentation and p-grams, when mixedwith the operator combination are able to improve the per-formance of a large number of Web queries. When lookingat the di!erences between the regular and phrasal queries,we observe that gains, even if consistent over the two di!er-ent sets, are slightly higher in the second group (! 12.3%vs. ! 7.1% in MAP for p-grams vs. BM25F). This fact in-dicates that the method is helpful for queries that containdi!cult concepts, as they have been expressed by users man-ually, whereas maintaining the performance of queries inwhich identifying concepts is not so critical and standardbag-of-word approaches perform as well.

Anecdotal study.It is useful to look into the reasons behind the increased

retrieval performance we observed in our experiments; withthis goal, we considered separately the mean average pre-

cision on each of the TREC topics 701-850 and examinedthe ten queries that produced the largest di!erence in pre-cision between our system and the BM25 baseline. Table 5shows the queries that obtained the greatest benefit from theuse of segmentation and/or p-gram operators: in the right-most column of the table, we identified the segments (or thep-grams) that most contributed to the increased precision,allowing to retrieve relevant documents that BM25 missed(because they were ranked too low).

7. CONCLUSIONS AND FUTURE WORKIn this paper we proposed a way to extend the proba-

bilistic relevance framework with a notion of virtual regionbased on the use of operators applied to the query. Ourexperiments (both on standard collections, such as TREC,and on other Web-like repertoires) show that the use of vir-tual regions is especially beneficial for hard queries wherepositional information is actually precious. The method hasroom for improvements and further study should be under-taken to understand which operators are more useful andunder which circumstances. Even if we explored a reason-able number of combinations, we have not made any sys-tematic attempt to develop a method to select the individ-ual best operators; we just limited ourselves to handpick afew—it was out of the scope of this paper to analyze auto-mated operator selection. In contrast, a machine learningapproach [3] would derive features for as many operators aspossible and try to combine them optimizing a loss functionof MAP or NDCG [38]. One might even envisage the adop-tion of a query-classification tool to decide which operatorsshould be used, based on the presumed nature of the query.In any case, our experiments show the practical usefulness ofthe non-linear operator score combination for retrieval [33].As a final remark, it is interesting to observe that most ofthe studied operators (actually, all of them except for seg-ments) do not employ any source of external information,and still produce a significant performance improvement.

8. REFERENCES[1] E. Agichtein, E. Brill, S. Dumais, and R. Ragno.

Learning user interaction models for predicting websearch result preferences. In Proceedings of the 29thannual international ACM SIGIR conference onResearch and development in information retrieval,volume 1, pages 3–10. ACM, 2006.

[2] R. Baeza-Yates. Query intent prediction andrecommendation. In Proceedings of the fourth ACM

Combination of operators


BM25 0.2648 0.5327 0.5071 0.3228 0.6140 0.5600 0.2928 0.5380 0.5140BM25F 0.2697 0.5510 0.5143 0.3284 0.6200 0.5570 0.2935 0.5460 0.5160BM25 + p-grams 0.2898* 0.5939 0.5531 0.3368* 0.6260 0.5800 0.3361* 0.6000 0.5048BM25 + segment 0.2866* 0.5653 0.5306 0.3285 0.5980 0.5710 0.3188 0.5720 0.5100BM25F + p-grams 0.2908* 0.5959 0.5541 0.3398* 0.6280 0.5830 0.3319* 0.5940 0.5350BM25F + segment 0.2869* 0.5653 0.5306 0.3282* 0.5920 0.5510 0.3315* 0.5940 0.5350

Table 3: Performance of a combination of operators (* = statistical significance at p < 0.05 using a one-sidedt-test with respect to BM25, MAP only). Configuration for p-gram is p = 2, µ = {1, 2, 3}, and for segment isµ = {1, 2, 3}.

WEB WEB-phrasalMAP NDCG P@10 MAP NDCG P@10

BM25 0.1553 0.2728 0.3892 0.1350 0.2345 0.3133BM25F 0.1722* 0.3008 0.4285 0.1575* 0.2842 0.3747BM25 + p-grams 0.1822* 0.3126 0.4390 0.1750* 0.3078 0.4010BM25 + segment 0.1810* 0.3126 0.4344 0.1725* 0.3029 0.4040BM25F + p-grams 0.1854* 0.3216 0.4400 0.1769* 0.3123 0.4101BM25F + segment 0.1842* 0.3190 0.4392 0.1749* 0.3165 0.4005

Table 4: Performance of a combination of operators on the WEB collection (* = statistical significance atp < 0.05 using a one-sided t-test with respect to BM25, MAP only). Configuration for p-gram is p = 2, µ ={1, 2, 3}, and for segment is µ = {1, 2, 3}.

conference on Recommender systems, pages 5–6.ACM, 2010.

[3] B. Bai, J. Weston, D. Grangier, R. Collobert,K. Sadamasa, Y. Qi, O. Chapelle, and K. Weinberger.Learning to rank with (a lot of) word features.Information Retrieval, 13:291–314, 2010.10.1007/s10791-009-9117-9.

[4] A. Banerjee. An analysis of logistic models:exponential family connections and onlineperformance. In SIAM International Conference onData Mining, 2007.

[5] M. Bendersky, B. Croft, and D. a. Smith. Two-stagequery segmentation for information retrieval.Proceedings of the 32nd international ACM SIGIRconference on Research and development ininformation retrieval - SIGIR ’09, page 810, 2009.

[6] M. Bendersky, D. Metzler, and W. Croft. Learningconcept importance using a weighted dependencemodel. In Proceedings of the third ACM internationalconference on Web search and data mining, pages31–40. ACM, 2010.

[7] S. Bergsma and Q. Wang. Learning noun phrase querysegmentation. In Proc. of EMNLP-CoNLL, numberJune, pages 819–826, 2007.

[8] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latentdirichlet allocation. J. Mach. Learn. Res., 3:993–1022,Mar. 2003.

[9] P. Boldi and S. Vigna. MG4J at TREC 2005. In E. M.Voorhees and L. P. Buckland, editors, The FourteenthText REtrieval Conference (TREC 2005) Proceedings,number SP 500-266 in Special Publications. NIST,2005. http://mg4j.dsi.unimi.it/.

[10] S. Brin and L. Page. The anatomy of a large-scalehypertextual Web search engine. Computer Networksand ISDN Systems, 30(1-7):107–117, Apr. 1998.

[11] S. Buttcher, C. L. A. Clarke, and B. Lushman. Termproximity scoring for ad-hoc retrieval on very largetext collections. In Proceedings of the 29th annualinternational ACM SIGIR conference on Research anddevelopment in information retrieval, SIGIR ’06, pages621–622, New York, NY, USA, 2006. ACM.

[12] S. Buttcher, C. L. A. Clarke, and I. Soboro!. TheTREC 2006 terabyte track. In TREC, 2006.

[13] C. L. A. Clarke, G. V. Cormack, and F. J. Burkowski.An algebra for structured text search and a frameworkfor its implementation. The Computer Journal,38:43–56, 1995.

[14] C. L. A. Clarke, N. Craswell, and I. Soboro!.Overview of the TREC 2004 terabyte track. In TREC,2004.

[15] C. L. A. Clarke, F. Scholer, and I. Soboro!. TheTREC 2005 terabyte track. In TREC, 2005.

[16] S. Cooper. Some Inconsistencies and MisnomersRetrieval in Probabilistic Information of Library takenE!ective. Proceedings of the 14th annual internationalACM SIGIR conference on Research and developmentin information retrieval, pages 57–61, 1991.

[17] N. Craswell, S. Robertson, H. Zaragoza, andM. Taylor. Relevance weighting for query independentevidence. In Proceedings of the 28th annualinternational ACM SIGIR conference on Research anddevelopment in information retrieval, volume 1, pages416–423. ACM, 2005.

[18] P. Domingos, S. Kok, H. Poon, M. Richardson, andP. Singla. Unifying logical and statistical AI. InProceedings of the Twenty-First National Conferenceon Artificial Intelligence, pages 2–7, 2006.

[19] M. Hagen, M. Potthast, B. Stein, and C. Braeutigam.The power of naive query segmentation. In Proceedingof the 33rd international ACM SIGIR conference on

Phrases vs regular queries


BM25 0.2648 0.5327 0.5071 0.3228 0.6140 0.5600 0.2928 0.5380 0.5140BM25F 0.2697 0.5510 0.5143 0.3284 0.6200 0.5570 0.2935 0.5460 0.5160BM25 + p-grams 0.2898* 0.5939 0.5531 0.3368* 0.6260 0.5800 0.3361* 0.6000 0.5048BM25 + segment 0.2866* 0.5653 0.5306 0.3285 0.5980 0.5710 0.3188 0.5720 0.5100BM25F + p-grams 0.2908* 0.5959 0.5541 0.3398* 0.6280 0.5830 0.3319* 0.5940 0.5350BM25F + segment 0.2869* 0.5653 0.5306 0.3282* 0.5920 0.5510 0.3315* 0.5940 0.5350

Table 3: Performance of a combination of operators (* = statistical significance at p < 0.05 using a one-sidedt-test with respect to BM25, MAP only). Configuration for p-gram is p = 2, µ = {1, 2, 3}, and for segment isµ = {1, 2, 3}.

WEB WEB-phrasalMAP NDCG P@10 MAP NDCG P@10

BM25 0.1553 0.2728 0.3892 0.1350 0.2345 0.3133BM25F 0.1722* 0.3008 0.4285 0.1575* 0.2842 0.3747BM25 + p-grams 0.1822* 0.3126 0.4390 0.1750* 0.3078 0.4010BM25 + segment 0.1810* 0.3126 0.4344 0.1725* 0.3029 0.4040BM25F + p-grams 0.1854* 0.3216 0.4400 0.1769* 0.3123 0.4101BM25F + segment 0.1842* 0.3190 0.4392 0.1749* 0.3165 0.4005

Table 4: Performance of a combination of operators on the WEB collection (* = statistical significance atp < 0.05 using a one-sided t-test with respect to BM25, MAP only). Configuration for p-gram is p = 2, µ ={1, 2, 3}, and for segment is µ = {1, 2, 3}.

conference on Recommender systems, pages 5–6.ACM, 2010.

[3] B. Bai, J. Weston, D. Grangier, R. Collobert,K. Sadamasa, Y. Qi, O. Chapelle, and K. Weinberger.Learning to rank with (a lot of) word features.Information Retrieval, 13:291–314, 2010.10.1007/s10791-009-9117-9.

[4] A. Banerjee. An analysis of logistic models:exponential family connections and onlineperformance. In SIAM International Conference onData Mining, 2007.

[5] M. Bendersky, B. Croft, and D. a. Smith. Two-stagequery segmentation for information retrieval.Proceedings of the 32nd international ACM SIGIRconference on Research and development ininformation retrieval - SIGIR ’09, page 810, 2009.

[6] M. Bendersky, D. Metzler, and W. Croft. Learningconcept importance using a weighted dependencemodel. In Proceedings of the third ACM internationalconference on Web search and data mining, pages31–40. ACM, 2010.

[7] S. Bergsma and Q. Wang. Learning noun phrase querysegmentation. In Proc. of EMNLP-CoNLL, numberJune, pages 819–826, 2007.

[8] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latentdirichlet allocation. J. Mach. Learn. Res., 3:993–1022,Mar. 2003.

[9] P. Boldi and S. Vigna. MG4J at TREC 2005. In E. M.Voorhees and L. P. Buckland, editors, The FourteenthText REtrieval Conference (TREC 2005) Proceedings,number SP 500-266 in Special Publications. NIST,2005. http://mg4j.dsi.unimi.it/.

[10] S. Brin and L. Page. The anatomy of a large-scalehypertextual Web search engine. Computer Networksand ISDN Systems, 30(1-7):107–117, Apr. 1998.

[11] S. Buttcher, C. L. A. Clarke, and B. Lushman. Termproximity scoring for ad-hoc retrieval on very largetext collections. In Proceedings of the 29th annualinternational ACM SIGIR conference on Research anddevelopment in information retrieval, SIGIR ’06, pages621–622, New York, NY, USA, 2006. ACM.

[12] S. Buttcher, C. L. A. Clarke, and I. Soboro!. TheTREC 2006 terabyte track. In TREC, 2006.

[13] C. L. A. Clarke, G. V. Cormack, and F. J. Burkowski.An algebra for structured text search and a frameworkfor its implementation. The Computer Journal,38:43–56, 1995.

[14] C. L. A. Clarke, N. Craswell, and I. Soboro!.Overview of the TREC 2004 terabyte track. In TREC,2004.

[15] C. L. A. Clarke, F. Scholer, and I. Soboro!. TheTREC 2005 terabyte track. In TREC, 2005.

[16] S. Cooper. Some Inconsistencies and MisnomersRetrieval in Probabilistic Information of Library takenE!ective. Proceedings of the 14th annual internationalACM SIGIR conference on Research and developmentin information retrieval, pages 57–61, 1991.

[17] N. Craswell, S. Robertson, H. Zaragoza, andM. Taylor. Relevance weighting for query independentevidence. In Proceedings of the 28th annualinternational ACM SIGIR conference on Research anddevelopment in information retrieval, volume 1, pages416–423. ACM, 2005.

[18] P. Domingos, S. Kok, H. Poon, M. Richardson, andP. Singla. Unifying logical and statistical AI. InProceedings of the Twenty-First National Conferenceon Artificial Intelligence, pages 2–7, 2006.

[19] M. Hagen, M. Potthast, B. Stein, and C. Braeutigam.The power of naive query segmentation. In Proceedingof the 33rd international ACM SIGIR conference on

Examples (that worked)

Topic no. BM25 MAP Operators MAP Query Which segment / p-gram helped?810 0.2759 0.5243 Timeshare resales “Timeshare resales”750 0.0903 0.3251 John Edwards womens’ issues “John Edwards” and “womens’ issue”806 0.1279 0.3446 Doctors Without Borders “Doctors Without Borders”834 0.2819 0.4859 Global positioning system earthquakes “Global positioning”745 0.1801 0.3828 Doomsday cults “Doomsday cult”835 0.1306 0.3222 Big Dig pork “Big Dig”732 0.1438 0.3309 U.S. cheese production “U.S.” and “cheese production”723 0.1187 0.3030 Executive privilege “Executive privilege”843 0.2420 0.4228 Pol Pot “Pol Pot”785 0.2717 0.4323 Ivory billed woodpecker “Ivory billed”

Table 5: The ten queries that produced the largest gap between standard BM25 and our operator-basedscoring. The methods in this table use a combination of BM25, segments and 2-grams.

Research and development in information retrieval,pages 797–798. ACM, 2010.

[20] Y. Li, B.-j. P. Hsu, C. Zhai, and K. Wang.Unsupervised Query Segmentation Using Clickthroughfor Information Retrieval. Proceedings of the 34thinternational ACM SIGIR conference on Research anddevelopment in Information Retrieval, pages 285–294,2011.

[21] T.-Y. Liu. Learning to Rank for Information Retrieval.Information Retrieval, 3(3):225–331, 2009.

[22] Y. Lu, F. Peng, G. Mishne, X. Wei, and B. Dumoulin.Improving web search relevance with semanticfeatures. In Proceedings of the 2009 Conference onEmpirical Methods in Natural Language ProcessingVolume 2 - EMNLP ’09, number August, page 648,Morristown, NJ, USA, 2009. Association forComputational Linguistics.

[23] Y. Lv and C. Zhai. Positional language models forinformation retrieval. In Proceedings of the 32ndinternational ACM SIGIR conference on Research anddevelopment in information retrieval, pages 299–306.ACM, 2009.

[24] Y. Lv and C. Zhai. Positional relevance model forpseudo-relevance feedback. In Proceeding of the 33rdinternational ACM SIGIR conference on Research anddevelopment in information retrieval, pages 579–586.ACM, 2010.

[25] Y. Lv and C. Zhai. A log-logistic model-basedinterpretation of TF normalization of BM25. InProceedings of the 34th European Conference onInformation Retrieval, ECIR’12, 2012.

[26] D. Metzler and B. Croft. A Markov random fieldmodel for term dependencies. Proceedings of the 28thannual international ACM SIGIR conference onResearch and development in information retrieval -SIGIR ’05, page 472, 2005.

[27] J. Ponte and W. B. Croft. A language modelingapproach to information retrieval. Proceedings of the21st annual international ACM SIGIR conference onResearch and development in information retrieval,pages 275–281, 1998.

[28] L. Ramshaw. Text chunking usingtransformation-based learning. of the Third ACLWorkshop on Very, pages 82–94, 1995.

[29] M. Richardson and P. Domingos. Markov logicnetworks. Machine Learning, 62(1):107–136, 2006.

[30] S. Robertson. The probability ranking principle in IR.Journal of documentation, 1977.

[31] S. Robertson. On event spaces and probabilisticmodels in information retrieval. Information Retrieval,8(2):319–329, 2005.

[32] S. Robertson and S. Walker. Some simple e!ectiveapproximations to the 2-poisson model forprobabilistic weighted retrieval. Proceedings of theSIGIR conference on Research and development ininformation retrieval, (1), 1994.

[33] S. Robertson, H. Zaragoza, and M. Taylor. SimpleBM25 extension to multiple weighted fields. InProceedings of the thirteenth ACM internationalconference on Information and knowledgemanagement, CIKM ’04, pages 42–49, New York, NY,USA, 2004. ACM.

[34] S. E. Robertson and H. Zaragoza. The probabilisticrelevance framework: BM25 and beyond. Foundationsand Trends in Information Retrieval, 3(4):333–389,2009.

[35] K. Svore, P. Kanani, and N. Khan. How good is aspan of terms?: exploiting proximity to improve webretrieval. In Proceeding of the 33rd international ACMSIGIR conference on Research and development ininformation retrieval, pages 154–161. ACM, 2010.

[36] M. Taghavi, A. Patel, N. Schmidt, C. Wills, andY. Tew. An analysis of web proxy logs with querydistribution pattern approach for search engines.Computer Standards and Interfaces, 34(1):162 – 170,2012.

[37] B. Tan and F. Peng. Unsupervised query segmentationusing generative language models and wikipedia. InProceeding of the 17th international conference onWorld Wide Web, pages 347–356. ACM, 2008.

[38] M. Taylor, J. Guiver, S. Robertson, and T. Minka.Softrank: optimizing non-smooth rank metrics. InProceedings of the international conference on Websearch and web data mining, WSDM ’08, pages 77–86,New York, NY, USA, 2008. ACM.

[39] C. Zhai and J. La!erty. A study of smoothingmethods for language models applied to informationretrieval. ACM Transactions on Information Systems,22(2):179–214, Apr. 2004.

Conclusions• Developed an extension for BM25 that

accounts for different query operators (versions of the query)

• The main idea is to score individually virtual regions that match each operator

• aggregate the evidence per query term• squash the different signals

• Future work• apply other query-dependent signals

• select the right operators for a query

Feature dependencies• Class-linearly dependent (or affine) features

• add no extra evidence/signal

• model overfitting (vs capacity)

• Still, it is desirable to enrich the model with more involved features

• Some features are surprisingly correlated

• Positional information requires a large number of parameters to estimate

! O(|D||q|)

Query concept segmentation• Queries are made up of basic conceptual

units, comprising many words

• “Indian summer victor herbert”

• Spurious matches: “san jose airport” -> “san jose city airport”

• Model to detect segments based on generative language models and Wikipedia

• Relax matches using factors of the max ratio between span length and segment length

Extending BM25 with multiple query operators

Technology

Transcript of Extending BM25 with multiple query operators