Term-Specific Smoothing On a paper by D. Hiemstra Alexandru A. Chitea [email protected]...
-
Upload
shannon-houseman -
Category
Documents
-
view
219 -
download
1
Transcript of Term-Specific Smoothing On a paper by D. Hiemstra Alexandru A. Chitea [email protected]...
Term-Specific Smoothing
On a paper by D. Hiemstra
Alexandru A. [email protected]
Universität des Saarlandes March 10, 2005
Seminar CS 555 – Language Model based Information Retrieval
March 10, 2006 Term-Specific Smoothing 2
Introduction
Experimental approach to Information Retrieval
– A formal model specifies an exact formula, which is tried
empirically
– Formulae are empirically tried because they seem
plausible
Modeling approach to Information Retrieval
– A formal model specifies an exact formula that is used to
prove some simple mathematical properties of the model
March 10, 2006 Term-Specific Smoothing 3
Information Retrieval – Overview
System query returns a ranked result list
– Statistical ranking on term frequencies is still standard
practice
Search engines provide means to override the
default ranking mechanisms
– Users can specify mandatory query terms (e.g. +term or
“term” in Google)
March 10, 2006 Term-Specific Smoothing 4
Information Retrieval – Practice (1)
Query: Star Wars Episode I (I is not treated as a mandatory term)
March 10, 2006 Term-Specific Smoothing 5
Information Retrieval – Practice (2)
Query: Star Wars Episode +I (I is treated as a mandatory term)
March 10, 2006 Term-Specific Smoothing 6
Motivation
Performance limitations in statistical ranking
Statistics-based IR models do not capture term
importance specification
User/system should be able to override the
default ranking mechanism
Objective
Mathematical model that supports the concept of query
term importance
March 10, 2006 Term-Specific Smoothing 7
Language Models
A statistical model for generating text
– Probability distribution over strings in a given language
M ntt ,,1
),,|(),|()|()|,,( 111211 nnn ttMtPtMtPMtPMttP
Consider the Unigram Language Model (LM)
)()(),,( 11 nn tPtPttP
March 10, 2006 Term-Specific Smoothing 8
Example – Language Models
IRsample
text … 0.2search … 0.1…mining … 0.1food … 0.0001…
build model
)()()( foodPtextP,food,text,P
Healthsample
food … 0.25nutrition … 0.1…healthy … 0.05diet … 0.02…
build model
)()()( dietPfoodP,diet,food,P
March 10, 2006 Term-Specific Smoothing 9
Language Models in IR
Estimate a LM for each document: D
Estimate probability of generating a query Q with
terms (t1,…,tn) using a given model:
Rank documents by probability of generating Q:
)|,,( 1 DttP n
),,(
)()|,,(),,|(
111
nnn
ttP
DPDttPttDP
March 10, 2006 Term-Specific Smoothing 10
Insufficient Data
If a term is not in the document, the query cannot
be generated:
0)|()|,,(1
1
n
i
in DtPDttP
Smooth probabilities
– Probabilities of observed events are decreased by a
certain amount, which is credited to unobserved events
March 10, 2006 Term-Specific Smoothing 11
Smoothing
Roles
– Estimation >> reevaluation of probabilities
– Query modeling >> to “explain” the common and non-
informative terms in a query
Linear interpolation smoothing
– Defines a smoothing parameter necessary for query
modeling
– Can be defined as a two-state Hidden Markov Model
March 10, 2006 Term-Specific Smoothing 12
Smoothing Models
Mixture Model smoothing
– Define a hidden event for all query terms
Term-specific smoothing
– Define a hidden event for each query term
March 10, 2006 Term-Specific Smoothing 13
Smoothing – Mixture Model
Mixes the probability from the document with the
general collection probability of the term
])|()1()|([)|,,(1
1
n
i
iin CtPDtPdttP
can be tuned to adjust performance:
– High value >> “conjunctive-like” search, i.e., suitable for
short queries
– Low value >> suitable for long queries
March 10, 2006 Term-Specific Smoothing 14
Bayesian Networks (1)
A Bayesian Network (BN) is a directed, acyclic graph
G(V, E) where:
– Nodes >> Random variables (RVs)
– Edges >> Dependencies
Properties:
),,|(),,,|(
},,{)(
itsgiven Y nodeparent -non a oft independen lconditiona is X Node -
),,|(y probabilit
lconditiona thecaptures BN the},,,{)( with V X nodeGiven -
)(y probabilitprior thecaptures BN theV, Rroot Given -
11
1
1
1
kk
k
k
k
PPXPYPPXP
PPXparents
PPxXP
PPXparents
rRP
March 10, 2006 Term-Specific Smoothing 15
Bayesian Networks (2)
From the properties it holds that:
),,(),,|(),,( 2211 nnn XXPXXXPXXP
By the chain rule:
n
i
niin XXXPXXP1
11 ),,|(),,(
By conditional independence:
n
i
iin XparentsXPXXP1
1 )nodesother ),(|(),,(
n
i
ii XparentsXP1
))(|(
March 10, 2006 Term-Specific Smoothing 16
LM as a Bayesian Network
Nodes >> random variables
Edges >> model’s conditional dependencies
Clear nodes >> hidden random variables
Shaded nodes >> observed random variables
Figure 1: The language modeling approachas a Bayesian network
D
tn…t1
March 10, 2006 Term-Specific Smoothing 17
Example – Mixture Model (1) Collection (2 documents)
– d1: IBM reports a profit but revenue is down
– d2: Siemens narrows quarter loss but revenue decreases further
Model: MLE unigram from documents;
Query: revenue down
256
3)]
16
1
8
1(
2
1[)]
16
2
8
1(
2
1[)|,( 121 dttP
21
256
1)]
16
10(
2
1[)]
16
2
8
1(
2
1[)|,( 221 dttP
Ranking: d1 > d2
March 10, 2006 Term-Specific Smoothing 18
Example – Mixture Model (2)
D
t3:downt1:revenue
C
Figure 2: Bayesian Network for C(d1,d2) languagemodel
March 10, 2006 Term-Specific Smoothing 19
Term-Specific Smoothing
D
t3t2t1
t3t1 t2
D
1 2 3
])|()1()|([)|,,(1
1
n
i
iin CtPDtPDttP ])|()1()|([)|,,(1
1
n
i
iiiin CtPDtPDttP
March 10, 2006 Term-Specific Smoothing 20
Term-Specific Smoothing – Derivation
Step 1: Assume query term independence
n
i
in DtPDttP1
1 )|()|,,(
Step 2: For each ti introduce a binary RV Ii (i.e.
the importance of a query term)
otherwise ,0
important ,1iI
n
i k
iin DkItPDttP1 }1,0{
1 ])|,([)|,,(
March 10, 2006 Term-Specific Smoothing 21
Term-Specific Smoothing – Derivation
Step 3: Assume query term importance does not
depend on D
Step 4: Writing the full sum over the importance
values yields:
n
i k
iiin DkItPkIPDttP1 }1,0{
1 ]),|()([)|,,(
n
i
iiiiiin DItPIPDItPIPDttP1
1 )],1|()1(),0|()0([)|,,(
March 10, 2006 Term-Specific Smoothing 22
Term-Specific Smoothing – Derivation
Step 4 (contd.):
– Let,
iiIP )1(
iiIP 1)0(
– Assume),1|()|( DItPDtP iii
),0|()|( DItPCtP iii
n
i
iiiin DtPCtPDttP1
1 )]|()|()1[()|,,(
March 10, 2006 Term-Specific Smoothing 23
Term-Specific Smoothing – Properties
Case 1: Stop Words (‘–’)
– >> query term is not important
– >> ignore query term ti
Case 2: Mandatory Terms (‘+’)
– >> relevant documents contain the query term
– >> no smoothing by collection model performed
Case 3: Coordination level ranking
– A
1i0)|()1( DtP ii
1, ii
0)|( DtP ii0i
March 10, 2006 Term-Specific Smoothing 24
Stop Words
Query terms that are ignored during the search
Reasons:
– Frequent words (e.g. the, it, a, …) might not contribute
significantly to the final document score, but they do
require processing power
– Words are stopped if they carry little meaning (e.g.
hereupon, whereafter)
0i
March 10, 2006 Term-Specific Smoothing 25
Mandatory Terms
A query term that should occur in every retrieved
document
Collection model can be dropped from the
calculation of the document score
Documents that do not match the query term are
assigned null probabilities
Users specify mandatory terms (e.g. by +)
1i
March 10, 2006 Term-Specific Smoothing 26
Coordination Level Ranking
A
A document containing n query terms will always
rank higher than one with n-1 query terms
Most tf.idf-ranking methods do not behave like
coordination level ranking
1, ii
March 10, 2006 Term-Specific Smoothing 27
Term-Specific Smoothing – Review
Term importance probability accounts for:
– Statistics alone cannot always account for ignored query
terms
– Restrict the retrieved list of documents to documents
that match specific terms, regardless of their frequency
distributions
– Enforce a coordination level ranking of the documents,
regardless of the terms frequency distribution
March 10, 2006 Term-Specific Smoothing 28
Relevance Feedback
Predict optimal values for lambda
Train on relevant documents and predict the
probability of term importance for each term that
maximizes retrieval performance
Use the Expectation Maximization (EM) algorithm
– Maximize the probability of the observed data given
some training data
March 10, 2006 Term-Specific Smoothing 29
EM Algorithm
The algorithm iteratively maximizes the probability of the
query t1,…,tn given r relevant documents D1,…,Dr
E-step
r
j jipii
pi
jipi
iDtPCtP
DtPm
1)()(
)(
)|()|()1(
)|(
M-step
r
mipi )1(
March 10, 2006 Term-Specific Smoothing 30
Generalization of Term Importance
Allow the RV Ii to have more than 2 realizations:
– Combine the unigram document model with the bigram
document model
n
i
iiiiiiii
n
DttPDtPCtP
DtPCtPDttP
2
1
11111
)],|()|()|()1[(
)]|()|()1[()|,,(
),2,|(),|( and ),2(},2,1,0{ 11 DIttPDttPIPI iiiiiiii
March 10, 2006 Term-Specific Smoothing 31
Example – General Model
“last will” of Alfred Nobel
)0( i
+“last will” of Alfred Nobel
)0,1( ii
t3t1 t2
D
1 2 3
Figure 3: Graphical model of dependence relationsbetween query terms
March 10, 2006 Term-Specific Smoothing 32
Future Research
Define a unigram LM for a topic-specific space
Extend beyond term-matching
– Use syntax (bag of words vs. structured text) and
semantics (exact terms vs. “equivalent” terms)
March 10, 2006 Term-Specific Smoothing 33
Conclusions
Extension to the LM approach to IR: model the
importance of a query term
– Stop Words/Phrases: trade-off between search quality
and search speed
– Mandatory Terms: the user overrides the default
ranking algorithm
Statistical ranking algorithms motivated by the LM
approach perform well in an empirical setting
March 10, 2006 Term-Specific Smoothing 34
Discussion
Is this a valid approach?
How does it differ from term weighting?
Why do we want coordination level ranking?
Is the bi-gram generalization valid and/or useful?
March 10, 2006 Term-Specific Smoothing 35
References
1. D. Hiemstra. Term-Specific Smoothing for the
Language Modeling Approach to Information
Retrieval: The Importance of a Query Term.
SIGIR’02, August 11-15, 2002.
2. G. Weikum. Information Retrieval and Data
Mining. Course Slides. Universität des
Saarlandes (Retrieved on: February 15, 2006)