Lower-Bounding Term Frequency Normalization Yuanhua Lv and ChengXiang Zhai University of Illinois at...

Lower-Bounding Term Frequency Normalization

Yuanhua Lv and ChengXiang ZhaiUniversity of Illinois at Urbana-Champaign

CIKM 2011 Best Student Award Paper

Speaker: Tom

Nov 8th, 2011

http://www.cs.uiuc.edu/

It is very difficult to improve retrieval models

• BM25 [Robertson et al. 1994]

• Pivoted length normalization (PIV) [Singhal et al. 1996]

• Query likelihood with Dirichlet prior (DIR) [Ponte & Croft 1998; Zhai & Lafferty 2001]

• PL2 [Amati & Rijsbergen 2002]

2

17 years

15 years

10 years

9 years

All these models remain strong baselines today after so many years!


3

1. Why does it seem to be so hard to beat these state-of-the-art retrieval models {BM25, PIV, DIR, PL2 …}?

2. Are they hitting the ceiling?


Key heuristic in all effective retrieval models: term frequency (TF) normalization by document length [Singhal et al. 96; Fang et al. 04]

• BM25

• DIR: Query likelihood with Dirichlet prior

4

)(

1log

,||

1

,1

,

,1

1

1

3

3

qdf

N

DqcavdlD

bbk

Dqck

Qqck

Qqck

DQq

||log||

)|(

),(1log),(

DQ

Cwp

DqcQqc

DQq

PIV and PL2 implement similar retrieval heuristics

Term Frequency

Document length

Term discrimination


However, the component of TF normalization by document length is NOT lower-bounded properly

• BM25

• DIR: Query likelihood with Dirichlet prior

5

)(

1log

,||

1

,1

,

,1

1

1

3

3

qdf

N

DqcavdlD

bbk

Dqck

Qqck

Qqck

DQq

||log||

)|(

),(1log),(

DQ

Cwp

DqcQqc

DQq

0||D

||D

When a document is very long, its score from matching a query term could be too small!


As a result, long documents could be overly penalized

D2 matches the query term, while D1 does not

Sco

re

PL2

S(D2) < S(D1)

Sco

re

DIR

S(D2) < S(D1)


Empirical evidence: long documents indeed overly penalized

7

Prob. of relevance/retrieval: the probability of a randomly selected relevant/retrieved document having a certain document length [Singhal et al. 96]

Relevance

Retrieval Retrieval

Relevance

Document length Document length


8

Functionality analysis of retrieval

models

Bug

TF normalization not lower-bounded properly, and long documents overly penalized

Are these retrieval models sharing this similar bug because they all violate some necessary retrieval heuristics?

Can we formally capture these necessary heuristics?

White-box Testing


Two novel heuristics for regulating the interactions between TF and doc. length

• There should be a sufficiently large gap between the presence and absence of a query term– Document length normalization should not cause a very

long document with a non-zero TF to receive a score too close to or even lower than a short document with a zero TF

• A short document that only covers a very small subset of the query terms should not easily dominate over a very long document that contains many distinct query terms

9

LB2

LB1


Lower-bounding constraint 1 (LB1):Occurrence > Non-Occurrence

10

D1:w

Score(Q, D1) = Score(Q, D2)

Score(Q’, D1) < Score(Q’, D2)

Q:w

D2:w q

Q’:w q


Lower-bounding constraint 2 (LB2):First Occurrence > Repeated Occurrence

11

D1:q1

Score(Q, D1) = Score(Q, D2)

D2:q1

D1’:q1q1

D2’:q1 q2

Q:q1 q2

Score(Q, D1’) < Score(Q, D2’)


BM25 satisfies LB1 but violates LB2

• LB1 is satisfied unconditionally• LB2 is equivalent to:

12

)(

1log

,||

1

,1

,

,1

1

1

3

3

tdf

N

DtcavdlD

bbk

Dtck

Qtck

Qtck

DQt

avdlbk

kD

122

21

1 (Parameters: k1 > 0 && 0 < b < 1)

Long documents tend to violate LB2

Large b or k1 violates LB2 easily


DIR satisfies LB2 but violates LB1

• LB2 is equivalent to:

• LB1 is equivalent to:

13

avdl

CtpavdlD 1

)|(

1

Long documents tend to violate LB1

||log||

)|(

),(1log),(

DQ

Cwp

DqcQqc

DQq

)|(

)|(1

)|(

)|(1

Ctp

Ctp

Ctpn

Ctpn

satisfied unconditionally!

Large µ or non-discriminative terms violate LB1 easily


No retrieval model satisfies both constraints

14

Model LB1 LB2 Parameter and/or query restrictions

BM25 Yes No b and k1 should not be too large

PIV Yes No s should not be too large

PL2 No No c should not be too small

DIR No Yes µ should not be too large; query terms should be discriminative

Can we "fix" this problem for all the models in a general way?


Solution: a general approach to lower-bounding TF normalization

• The score of a document D from matching a query term t:

15

)(|,|),,( ttdDDtcFTerm discrimination

)(

1log

,||

1

,1

1

1

tdf

N

DtcavdlD

bbk

Dtck

BM25

CtpD

Dtc

D |||

),(

||log

DIR

PIV and PL2 also have their corresponding components


Solution: a general approach to lower-bounding TF normalization (Cont.)

• Objective: an improved version

that does not hurt other retrieval heuristics, but

• A heuristic solution:

16

)(|,|,0')(|,|,1' ttdDFttdDF

)(|,|),,(' ttdDDtcF

l can be absorbed into δ

which satisfies all retrieval heuristics that are satisfied by )(|,|),,( ttdDDtcF


Example: BM25+, a lower-bounded version of BM25

17

)(

1log

,||

1

,1

,

,1

1

1

3

3

tdf

N

DtcavdlD

bbk

Dtck

Qtck

Qtck

DQt

BM25:

)(

1log

,||

1

,1

,

,1

1

1

3

3

tdf

N

DtcavdlD

bbk

Dtck

Qtck

Qtck

DQt

BM25+:

BM25+ incurs almost no additional computational cost

Similarly, we can also improve PIV, DIR, and PL2, leading to PIV+, DIR+, and PL2+ respectively


BM25+ can satisfy both LB1 and LB2

• Similarly to BM25, BM25+ satisfies LB1

• LB2 can also be satisfied unconditionally if:

18

21

1

k

k

Experiments show later that setting δ = 1.0 works very well


The proposed approach can fix or alleviate the problem of all these retrieval models

19

BM25+ Yes Yes

PIV+ Yes Yes

PL2+ Yes Yes

DIR+ Alleviated Yes

BM25 Yes No

PIV Yes No

PL2 No No

DIR No Yes

Current retrieval models

Improved retrieval models

LB1 LB2


Experiment Setup

• Standard TREC document collections– Web: WT2G, WT10G, and Terabyte– News: Robust04

• Standard TREC query sets:– Short (the title field): e.g., “Iraq foreign debt reduction”

– Verbose (the description field): e.g., “Identify any efforts, proposed or undertaken, by world governments to seek reduction of Iraq's foreign debt”

• 2-fold cross validation for parameter tuning

20


BM25+ improves over BM25 significantly

21

BM25+ performs better on Web data than on News data

Web Web News

Superscripts 1/2/3/4 indicating significance at the 0.05/0.02/0.01/0.001 level

δ = 1.0 works well, confirming constraint analysis that 21

1

k

k

BM25+ performs better on verbose queries?

Short

Verbose

σ = 2.31 σ = 2.63 σ = 1.19


BM25 overly penalizes long documents more seriously for verbose queries

22

The “condition” that BM25 violates LB2 is

avdlbk

kD

122

|| 21

1 (monotonically decreasing with b & k1)

The optimal settings of b & k1 are larger for verbose queries


The improvement indeed comes from alleviating the problem of overly-penalizing long docs

23

BM25+ (verbose)BM25+ (short)

BM25 (short) BM25 (verbose)

DIR+ improves over DIR significantly

24

Fixing δ = 0.05 works very well

DIR+ performs better on verbose than on short queries


Short

Verbose

?

avdl

CtpavdlD 1

)|(

1DIR can only satisfy LB1 if

Optimal µ settings


PL2+ improves over PL2 significantly

25

Fixing δ = 0.8 works very well

PL2+ performs better on verbose than on short queries


Short

Verbose

Optimal settings of c: the smaller, the more dangerous


PIV+ works as we expected

26

PIV+ does not consistently outperform PIV, as we expected

Superscripts 1 indicating significance at the 0.05 level

PIV can satisfy LB2 if avdls

D

1

899.0

It’s fine, as the optimal settings of s are very small


27

1. Why does it seem to be so hard to beat these state-of-the-art retrieval models {BM25, PIV, DIR, PL2 …}?

2. Are they hitting the ceiling?

We weren’t able to figure out their deficiency analytically.

No, they haven’t hit the ceiling yet!


Conclusions

• Reveal a common deficiency of current retrieval models

• Propose two novel formal constraints

• Show that current retrieval models do not satisfy both constraints, and that retrieval performance tends to be poor if either constraint is violated

• Develop a general and efficient solution, which has been shown analytically to fix/alleviate the problem of current retrieval models

• Demonstrate the effectiveness of the proposed algorithms across different collections for different types of queries

28


Our models {BM25+, DIR+, PL2+} can potentially replace current

state-of-the-art retrieval models {BM25, DIR, PL2}

29

)(

1log

,||

1

,1

,

,1

1

1

3

3

tdf

N

DtcavdlD

bbk

Dtck

Qtck

Qtck

DQt

BM25:

)(

1log0.1

,||

1

,1

,

,1

1

1

3

3

tdf

N

DtcavdlD

bbk

Dtck

Qtck

Qtck

DQt

BM25+:


Future work

• This work has demonstrated the power of doing axiomatic analysis to fix deficiencies of retrieval models. Are there any other deficiencies of current retrieval models? If so, can we solve them with axiomatic analysis?

• Can we go beyond bag of words with constraint analysis?

• Can we find a comprehensive set of constraints that are sufficient for deriving a unique (optimal) retrieval function

30


Thanks!

31


Lower-Bounding Term Frequency Normalization Yuanhua Lv and ChengXiang Zhai University of Illinois at...

Documents

Transcript of Lower-Bounding Term Frequency Normalization Yuanhua Lv and ChengXiang Zhai University of Illinois at...