Lower-Bounding Term Frequency Normalization Yuanhua Lv and ChengXiang Zhai University of Illinois at...
-
Upload
regina-charles -
Category
Documents
-
view
215 -
download
0
Transcript of Lower-Bounding Term Frequency Normalization Yuanhua Lv and ChengXiang Zhai University of Illinois at...
Lower-Bounding Term Frequency Normalization
Yuanhua Lv and ChengXiang ZhaiUniversity of Illinois at Urbana-Champaign
CIKM 2011 Best Student Award Paper
Speaker: Tom
Nov 8th, 2011
It is very difficult to improve retrieval models
• BM25 [Robertson et al. 1994]
• Pivoted length normalization (PIV) [Singhal et al. 1996]
• Query likelihood with Dirichlet prior (DIR) [Ponte & Croft 1998; Zhai & Lafferty 2001]
• PL2 [Amati & Rijsbergen 2002]
2
17 years
15 years
10 years
9 years
All these models remain strong baselines today after so many years!
3
1. Why does it seem to be so hard to beat these state-of-the-art retrieval models {BM25, PIV, DIR, PL2 …}?
2. Are they hitting the ceiling?
Key heuristic in all effective retrieval models: term frequency (TF) normalization by document length [Singhal et al. 96; Fang et al. 04]
• BM25
• DIR: Query likelihood with Dirichlet prior
4
)(
1log
,||
1
,1
,
,1
1
1
3
3
qdf
N
DqcavdlD
bbk
Dqck
Qqck
Qqck
DQq
||log||
)|(
),(1log),(
DQ
Cwp
DqcQqc
DQq
PIV and PL2 implement similar retrieval heuristics
Term Frequency
Document length
Term discrimination
However, the component of TF normalization by document length is NOT lower-bounded properly
• BM25
• DIR: Query likelihood with Dirichlet prior
5
)(
1log
,||
1
,1
,
,1
1
1
3
3
qdf
N
DqcavdlD
bbk
Dqck
Qqck
Qqck
DQq
||log||
)|(
),(1log),(
DQ
Cwp
DqcQqc
DQq
0||D
||D
When a document is very long, its score from matching a query term could be too small!
As a result, long documents could be overly penalized
D2 matches the query term, while D1 does not
Sco
re
PL2
S(D2) < S(D1)
Sco
re
DIR
S(D2) < S(D1)
Empirical evidence: long documents indeed overly penalized
7
Prob. of relevance/retrieval: the probability of a randomly selected relevant/retrieved document having a certain document length [Singhal et al. 96]
Relevance
Retrieval Retrieval
Relevance
Document length Document length
8
Functionality analysis of retrieval
models
Bug
TF normalization not lower-bounded properly, and long documents overly penalized
Are these retrieval models sharing this similar bug because they all violate some necessary retrieval heuristics?
Can we formally capture these necessary heuristics?
White-box Testing
Two novel heuristics for regulating the interactions between TF and doc. length
• There should be a sufficiently large gap between the presence and absence of a query term– Document length normalization should not cause a very
long document with a non-zero TF to receive a score too close to or even lower than a short document with a zero TF
• A short document that only covers a very small subset of the query terms should not easily dominate over a very long document that contains many distinct query terms
9
LB2
LB1
Lower-bounding constraint 1 (LB1):Occurrence > Non-Occurrence
10
D1:w
Score(Q, D1) = Score(Q, D2)
Score(Q’, D1) < Score(Q’, D2)
Q:w
D2:w q
Q’:w q
Lower-bounding constraint 2 (LB2):First Occurrence > Repeated Occurrence
11
D1:q1
Score(Q, D1) = Score(Q, D2)
D2:q1
D1’:q1q1
D2’:q1 q2
Q:q1 q2
Score(Q, D1’) < Score(Q, D2’)
BM25 satisfies LB1 but violates LB2
• LB1 is satisfied unconditionally• LB2 is equivalent to:
12
)(
1log
,||
1
,1
,
,1
1
1
3
3
tdf
N
DtcavdlD
bbk
Dtck
Qtck
Qtck
DQt
avdlbk
kD
122
21
1 (Parameters: k1 > 0 && 0 < b < 1)
Long documents tend to violate LB2
Large b or k1 violates LB2 easily
DIR satisfies LB2 but violates LB1
• LB2 is equivalent to:
• LB1 is equivalent to:
13
avdl
CtpavdlD 1
)|(
1
Long documents tend to violate LB1
||log||
)|(
),(1log),(
DQ
Cwp
DqcQqc
DQq
)|(
)|(1
)|(
)|(1
Ctp
Ctp
Ctpn
Ctpn
satisfied unconditionally!
Large µ or non-discriminative terms violate LB1 easily
No retrieval model satisfies both constraints
14
Model LB1 LB2 Parameter and/or query restrictions
BM25 Yes No b and k1 should not be too large
PIV Yes No s should not be too large
PL2 No No c should not be too small
DIR No Yes µ should not be too large; query terms should be discriminative
Can we "fix" this problem for all the models in a general way?
Solution: a general approach to lower-bounding TF normalization
• The score of a document D from matching a query term t:
15
)(|,|),,( ttdDDtcFTerm discrimination
)(
1log
,||
1
,1
1
1
tdf
N
DtcavdlD
bbk
Dtck
BM25
CtpD
Dtc
D |||
),(
||log
DIR
PIV and PL2 also have their corresponding components
Solution: a general approach to lower-bounding TF normalization (Cont.)
• Objective: an improved version
that does not hurt other retrieval heuristics, but
• A heuristic solution:
16
)(|,|,0')(|,|,1' ttdDFttdDF
)(|,|),,(' ttdDDtcF
l can be absorbed into δ
which satisfies all retrieval heuristics that are satisfied by )(|,|),,( ttdDDtcF
Example: BM25+, a lower-bounded version of BM25
17
)(
1log
,||
1
,1
,
,1
1
1
3
3
tdf
N
DtcavdlD
bbk
Dtck
Qtck
Qtck
DQt
BM25:
)(
1log
,||
1
,1
,
,1
1
1
3
3
tdf
N
DtcavdlD
bbk
Dtck
Qtck
Qtck
DQt
BM25+:
BM25+ incurs almost no additional computational cost
Similarly, we can also improve PIV, DIR, and PL2, leading to PIV+, DIR+, and PL2+ respectively
BM25+ can satisfy both LB1 and LB2
• Similarly to BM25, BM25+ satisfies LB1
• LB2 can also be satisfied unconditionally if:
18
21
1
k
k
Experiments show later that setting δ = 1.0 works very well
The proposed approach can fix or alleviate the problem of all these retrieval models
19
BM25+ Yes Yes
PIV+ Yes Yes
PL2+ Yes Yes
DIR+ Alleviated Yes
BM25 Yes No
PIV Yes No
PL2 No No
DIR No Yes
Current retrieval models
Improved retrieval models
LB1 LB2
Experiment Setup
• Standard TREC document collections– Web: WT2G, WT10G, and Terabyte– News: Robust04
• Standard TREC query sets:– Short (the title field): e.g., “Iraq foreign debt reduction”
– Verbose (the description field): e.g., “Identify any efforts, proposed or undertaken, by world governments to seek reduction of Iraq's foreign debt”
• 2-fold cross validation for parameter tuning
20
BM25+ improves over BM25 significantly
21
BM25+ performs better on Web data than on News data
Web Web News
Superscripts 1/2/3/4 indicating significance at the 0.05/0.02/0.01/0.001 level
δ = 1.0 works well, confirming constraint analysis that 21
1
k
k
BM25+ performs better on verbose queries?
Short
Verbose
σ = 2.31 σ = 2.63 σ = 1.19
BM25 overly penalizes long documents more seriously for verbose queries
22
The “condition” that BM25 violates LB2 is
avdlbk
kD
122
|| 21
1 (monotonically decreasing with b & k1)
The optimal settings of b & k1 are larger for verbose queries
The improvement indeed comes from alleviating the problem of overly-penalizing long docs
23
BM25+ (verbose)BM25+ (short)
BM25 (short) BM25 (verbose)
DIR+ improves over DIR significantly
24
Fixing δ = 0.05 works very well
DIR+ performs better on verbose than on short queries
Superscripts 1/2/3/4 indicating significance at the 0.05/0.02/0.01/0.001 level
Short
Verbose
?
avdl
CtpavdlD 1
)|(
1DIR can only satisfy LB1 if
Optimal µ settings
PL2+ improves over PL2 significantly
25
Fixing δ = 0.8 works very well
PL2+ performs better on verbose than on short queries
Superscripts 1/2/3/4 indicating significance at the 0.05/0.02/0.01/0.001 level
Short
Verbose
Optimal settings of c: the smaller, the more dangerous
PIV+ works as we expected
26
PIV+ does not consistently outperform PIV, as we expected
Superscripts 1 indicating significance at the 0.05 level
PIV can satisfy LB2 if avdls
D
1
899.0
It’s fine, as the optimal settings of s are very small
27
1. Why does it seem to be so hard to beat these state-of-the-art retrieval models {BM25, PIV, DIR, PL2 …}?
2. Are they hitting the ceiling?
We weren’t able to figure out their deficiency analytically.
No, they haven’t hit the ceiling yet!
Conclusions
• Reveal a common deficiency of current retrieval models
• Propose two novel formal constraints
• Show that current retrieval models do not satisfy both constraints, and that retrieval performance tends to be poor if either constraint is violated
• Develop a general and efficient solution, which has been shown analytically to fix/alleviate the problem of current retrieval models
• Demonstrate the effectiveness of the proposed algorithms across different collections for different types of queries
28
Our models {BM25+, DIR+, PL2+} can potentially replace current
state-of-the-art retrieval models {BM25, DIR, PL2}
29
)(
1log
,||
1
,1
,
,1
1
1
3
3
tdf
N
DtcavdlD
bbk
Dtck
Qtck
Qtck
DQt
BM25:
)(
1log0.1
,||
1
,1
,
,1
1
1
3
3
tdf
N
DtcavdlD
bbk
Dtck
Qtck
Qtck
DQt
BM25+:
Future work
• This work has demonstrated the power of doing axiomatic analysis to fix deficiencies of retrieval models. Are there any other deficiencies of current retrieval models? If so, can we solve them with axiomatic analysis?
• Can we go beyond bag of words with constraint analysis?
• Can we find a comprehensive set of constraints that are sufficient for deriving a unique (optimal) retrieval function
30