Modern Information Retrieval -...
Transcript of Modern Information Retrieval -...
![Page 1: Modern Information Retrieval - homepages.dcc.ufmg.brhomepages.dcc.ufmg.br/~nivio/cursos/ri07/transp/slideschap02.pdf · Modern Information Retrieval Chapter 2 Modeling ... Example](https://reader030.fdocuments.us/reader030/viewer/2022021622/5b8d9dc309d3f272408c5f9e/html5/thumbnails/1.jpg)
Modern Information Retrieval
Chapter 2
Modeling
Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 1/37
![Page 2: Modern Information Retrieval - homepages.dcc.ufmg.brhomepages.dcc.ufmg.br/~nivio/cursos/ri07/transp/slideschap02.pdf · Modern Information Retrieval Chapter 2 Modeling ... Example](https://reader030.fdocuments.us/reader030/viewer/2022021622/5b8d9dc309d3f272408c5f9e/html5/thumbnails/2.jpg)
IntroductionIR systems usually adopt index terms to processqueries
Index term:a keyword or group of selected wordsany word (more general)
Stemming might be used:connect: connecting, connection, connections
An inverted file is built for the chosen index terms
Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 2/37
![Page 3: Modern Information Retrieval - homepages.dcc.ufmg.brhomepages.dcc.ufmg.br/~nivio/cursos/ri07/transp/slideschap02.pdf · Modern Information Retrieval Chapter 2 Modeling ... Example](https://reader030.fdocuments.us/reader030/viewer/2022021622/5b8d9dc309d3f272408c5f9e/html5/thumbnails/3.jpg)
Introduction
Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 3/37
![Page 4: Modern Information Retrieval - homepages.dcc.ufmg.brhomepages.dcc.ufmg.br/~nivio/cursos/ri07/transp/slideschap02.pdf · Modern Information Retrieval Chapter 2 Modeling ... Example](https://reader030.fdocuments.us/reader030/viewer/2022021622/5b8d9dc309d3f272408c5f9e/html5/thumbnails/4.jpg)
IntroductionMatching at index term level is quite imprecise
No surprise that users get frequently unsatisfied
Since most users have no training in query formation,problem is even worst
Frequent dissatisfaction of Web users
Issue of deciding relevance is critical for IR systems:ranking
Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 4/37
![Page 5: Modern Information Retrieval - homepages.dcc.ufmg.brhomepages.dcc.ufmg.br/~nivio/cursos/ri07/transp/slideschap02.pdf · Modern Information Retrieval Chapter 2 Modeling ... Example](https://reader030.fdocuments.us/reader030/viewer/2022021622/5b8d9dc309d3f272408c5f9e/html5/thumbnails/5.jpg)
IntroductionA ranking is an ordering of the documents retrieved that(hopefully) reflects the relevance of the documents tothe user query
A ranking is based on fundamental premisses regardingthe notion of relevance, such as:
common sets of index termssharing of weighted termslikelihood of relevance
Each set of premisses leads to a distinct IR model
Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 5/37
![Page 6: Modern Information Retrieval - homepages.dcc.ufmg.brhomepages.dcc.ufmg.br/~nivio/cursos/ri07/transp/slideschap02.pdf · Modern Information Retrieval Chapter 2 Modeling ... Example](https://reader030.fdocuments.us/reader030/viewer/2022021622/5b8d9dc309d3f272408c5f9e/html5/thumbnails/6.jpg)
IR Models
Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 6/37
![Page 7: Modern Information Retrieval - homepages.dcc.ufmg.brhomepages.dcc.ufmg.br/~nivio/cursos/ri07/transp/slideschap02.pdf · Modern Information Retrieval Chapter 2 Modeling ... Example](https://reader030.fdocuments.us/reader030/viewer/2022021622/5b8d9dc309d3f272408c5f9e/html5/thumbnails/7.jpg)
IR ModelsThe IR model, the logical view of the docs, and theretrieval task are distinct aspects of the system
Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 7/37
![Page 8: Modern Information Retrieval - homepages.dcc.ufmg.brhomepages.dcc.ufmg.br/~nivio/cursos/ri07/transp/slideschap02.pdf · Modern Information Retrieval Chapter 2 Modeling ... Example](https://reader030.fdocuments.us/reader030/viewer/2022021622/5b8d9dc309d3f272408c5f9e/html5/thumbnails/8.jpg)
Retrieval: Ad Hoc x FilteringAd Hoc Retrieval:
Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 8/37
![Page 9: Modern Information Retrieval - homepages.dcc.ufmg.brhomepages.dcc.ufmg.br/~nivio/cursos/ri07/transp/slideschap02.pdf · Modern Information Retrieval Chapter 2 Modeling ... Example](https://reader030.fdocuments.us/reader030/viewer/2022021622/5b8d9dc309d3f272408c5f9e/html5/thumbnails/9.jpg)
Retrieval: Ad Hoc x FilteringFiltering
Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 9/37
![Page 10: Modern Information Retrieval - homepages.dcc.ufmg.brhomepages.dcc.ufmg.br/~nivio/cursos/ri07/transp/slideschap02.pdf · Modern Information Retrieval Chapter 2 Modeling ... Example](https://reader030.fdocuments.us/reader030/viewer/2022021622/5b8d9dc309d3f272408c5f9e/html5/thumbnails/10.jpg)
Classic IR Models - Basic ConceptsEach document represented by a set of representativekeywords or index terms
An index term is a document word useful forremembering the document main themes
Usually, index terms are nouns because nouns havemeaning by themselves
However, search engines assume that all words areindex terms (full text representation)
Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 10/37
![Page 11: Modern Information Retrieval - homepages.dcc.ufmg.brhomepages.dcc.ufmg.br/~nivio/cursos/ri07/transp/slideschap02.pdf · Modern Information Retrieval Chapter 2 Modeling ... Example](https://reader030.fdocuments.us/reader030/viewer/2022021622/5b8d9dc309d3f272408c5f9e/html5/thumbnails/11.jpg)
Classic IR Models - Basic ConceptsNot all terms are equally useful for representing thedocument contents
less frequent terms allow identifying a narrower setof documents
To quantify the importance of an index term, weassociate a weight with it
Letki be an index termdj be a documentwij be a weight associated with (ki , dj), whichquantifies the importance of ki for describing thecontents of dj
Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 11/37
![Page 12: Modern Information Retrieval - homepages.dcc.ufmg.brhomepages.dcc.ufmg.br/~nivio/cursos/ri07/transp/slideschap02.pdf · Modern Information Retrieval Chapter 2 Modeling ... Example](https://reader030.fdocuments.us/reader030/viewer/2022021622/5b8d9dc309d3f272408c5f9e/html5/thumbnails/12.jpg)
Classic IR Models - Basic ConceptsLet
ki : ith index termdj : jth documentt : total number of terms in the vocabularyK = {k1, k2, ..., kt} : the set of all index termswij > 0 : weight associated with (ki, dj)
if wij = 0 then term ki does not occur within dj
~dj = (w1j , w2j , ..., wtj) : weighted vector associatedwith dj
g(~dj) : a reference to the weight wij
Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 12/37
![Page 13: Modern Information Retrieval - homepages.dcc.ufmg.brhomepages.dcc.ufmg.br/~nivio/cursos/ri07/transp/slideschap02.pdf · Modern Information Retrieval Chapter 2 Modeling ... Example](https://reader030.fdocuments.us/reader030/viewer/2022021622/5b8d9dc309d3f272408c5f9e/html5/thumbnails/13.jpg)
The Boolean ModelSimple model based on set theory
Queries specified as Boolean expressionsprecise semanticsneat formalismq = ka ∧ (kb ∨ ¬kc)
Letwiq : weight associated with pair (ki, q)
wiq ∈ {0, 1} : terms either present or absent(Boolean)~dq = (w1q, w2q, ..., wtq) : weighted vector associatedwith q
Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 13/37
![Page 14: Modern Information Retrieval - homepages.dcc.ufmg.brhomepages.dcc.ufmg.br/~nivio/cursos/ri07/transp/slideschap02.pdf · Modern Information Retrieval Chapter 2 Modeling ... Example](https://reader030.fdocuments.us/reader030/viewer/2022021622/5b8d9dc309d3f272408c5f9e/html5/thumbnails/14.jpg)
The Boolean ModelLet,
q = ka ∧ (kb ∨ ¬kc)
~dq : weighted vector associated with q
dnf(~dq) : distinct normal form for vector ~Dq
Then,
dnf(~dq) = (1, 1, 1) ∨ (1, 1, 0) ∨ (1, 0, 0)
(1,1,1) : conjunctive component for (ka, kb, kc)(1,1,0) : conjunctive component for (ka, kb,¬kc)(1,0,0) : conjunctive component for (ka,¬kb,¬kc)
ccq : a conjunctive component for q, ccq ∈ dnf(~dq)
Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 14/37
![Page 15: Modern Information Retrieval - homepages.dcc.ufmg.brhomepages.dcc.ufmg.br/~nivio/cursos/ri07/transp/slideschap02.pdf · Modern Information Retrieval Chapter 2 Modeling ... Example](https://reader030.fdocuments.us/reader030/viewer/2022021622/5b8d9dc309d3f272408c5f9e/html5/thumbnails/15.jpg)
The Boolean Modelq = ka ∧ (kb ∨ ¬kc)
sim(q, dj) = 1, if ∃ccq|∀ki, gi(~dj) = gi(ccq)
sim(q, dj) = 0, otherwise
Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 15/37
![Page 16: Modern Information Retrieval - homepages.dcc.ufmg.brhomepages.dcc.ufmg.br/~nivio/cursos/ri07/transp/slideschap02.pdf · Modern Information Retrieval Chapter 2 Modeling ... Example](https://reader030.fdocuments.us/reader030/viewer/2022021622/5b8d9dc309d3f272408c5f9e/html5/thumbnails/16.jpg)
Drawbacks of the Boolean ModelRetrieval based on binary decision criteria with nonotion of partial matching
No ranking of the documents is provided (absence of agrading scale)
Information need has to be translated into a Booleanexpression, which most users find awkward
The Boolean queries formulated by the users are mostoften too simplistic
As a consequence, the Boolean model frequentlyreturns either too few or too many documents inresponse to a user query
Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 16/37
![Page 17: Modern Information Retrieval - homepages.dcc.ufmg.brhomepages.dcc.ufmg.br/~nivio/cursos/ri07/transp/slideschap02.pdf · Modern Information Retrieval Chapter 2 Modeling ... Example](https://reader030.fdocuments.us/reader030/viewer/2022021622/5b8d9dc309d3f272408c5f9e/html5/thumbnails/17.jpg)
The Vector ModelUse of binary weights is too limiting
Non-binary weights provide consideration for partialmatches
These term weights are used to compute a degree ofsimilarity between a query and each document
Ranked set of documents provides for better matching
Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 17/37
![Page 18: Modern Information Retrieval - homepages.dcc.ufmg.brhomepages.dcc.ufmg.br/~nivio/cursos/ri07/transp/slideschap02.pdf · Modern Information Retrieval Chapter 2 Modeling ... Example](https://reader030.fdocuments.us/reader030/viewer/2022021622/5b8d9dc309d3f272408c5f9e/html5/thumbnails/18.jpg)
The Vector ModelDefine:
wij > 0 whenever ki ∈ dj
wiq > 0 associated with the pair (ki, q)
~dj = (w1j , w2j , . . . , wtj)
~dq = (w1q, w2q, . . . , wtq)
To each term ki is associated a unitary vector~i
The unitary vectors~i and ~j are assumed to beorthonormal (i.e., index terms are assumed to occurindependently within the documents)
The t unitary vectors~i form an orthonormal basis for at-dimensional space
In this space, queries and documents are representedas weighted vectors
Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 18/37
![Page 19: Modern Information Retrieval - homepages.dcc.ufmg.brhomepages.dcc.ufmg.br/~nivio/cursos/ri07/transp/slideschap02.pdf · Modern Information Retrieval Chapter 2 Modeling ... Example](https://reader030.fdocuments.us/reader030/viewer/2022021622/5b8d9dc309d3f272408c5f9e/html5/thumbnails/19.jpg)
The Vector ModelSimilarity
sim(dj , q) = cos(θ) =~dj • ~q
|~dj | × |~q|=
∑ti=1 wi,j × wi,q
√
∑ti=1 w2
i,j ×√
∑tj=1 w2
i,q
Since wij > 0 and wiq > 0 then 0 6 sim(dj , q) 6 1
A document is retrieved even if it matches the queryterms only partially
Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 19/37
![Page 20: Modern Information Retrieval - homepages.dcc.ufmg.brhomepages.dcc.ufmg.br/~nivio/cursos/ri07/transp/slideschap02.pdf · Modern Information Retrieval Chapter 2 Modeling ... Example](https://reader030.fdocuments.us/reader030/viewer/2022021622/5b8d9dc309d3f272408c5f9e/html5/thumbnails/20.jpg)
The Vector ModelSimilarity
sim(dj , q) =
∑ti=1 wi,j × wi,q
√
∑ti=1 w2
i,j ×√
∑tj=1 w2
i,q
How to compute the weights wij and wiq?
A good weight must take into account two effects:quantification of intra-document contents (similarity)
tf factor, the term frequency within a documentquantification of inter-documents separation(dissimilarity)
idf factor, the inverse document frequencywi,j = tf i ,j × idf i
Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 20/37
![Page 21: Modern Information Retrieval - homepages.dcc.ufmg.brhomepages.dcc.ufmg.br/~nivio/cursos/ri07/transp/slideschap02.pdf · Modern Information Retrieval Chapter 2 Modeling ... Example](https://reader030.fdocuments.us/reader030/viewer/2022021622/5b8d9dc309d3f272408c5f9e/html5/thumbnails/21.jpg)
The Vector ModelLet, ki ∈ dj
N be the total number of docs in the collectionni be the number of docs which contain ki
freqi,j raw frequency of ki within dj
A normalized tf factor is given by
fi,j = freqi,j
maxl freql,j
where the maximum is computed over all termswhich occur within the document dj
The idf factor is computed as
idfi = log Nni
the log is used to make the values of tf and idfcomparable. It can also be interpreted as theamount of information associated with the term ki.
Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 21/37
![Page 22: Modern Information Retrieval - homepages.dcc.ufmg.brhomepages.dcc.ufmg.br/~nivio/cursos/ri07/transp/slideschap02.pdf · Modern Information Retrieval Chapter 2 Modeling ... Example](https://reader030.fdocuments.us/reader030/viewer/2022021622/5b8d9dc309d3f272408c5f9e/html5/thumbnails/22.jpg)
The Vector ModelThe best term-weighting schemes use weights whichare give by
wi,j = fi,j × log Nni
the strategy is called a tf-idf weighting scheme
For the query term weights, a suggestion is
wi,q =(
0.5 + 0.5 freqi,q
maxl freql,q
)
× log Nni
The vector model with tf-idf weights is a good rankingstrategy with general collections
The vector model is usually as good as the knownranking alternatives. It is also simple and fast tocompute.
Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 22/37
![Page 23: Modern Information Retrieval - homepages.dcc.ufmg.brhomepages.dcc.ufmg.br/~nivio/cursos/ri07/transp/slideschap02.pdf · Modern Information Retrieval Chapter 2 Modeling ... Example](https://reader030.fdocuments.us/reader030/viewer/2022021622/5b8d9dc309d3f272408c5f9e/html5/thumbnails/23.jpg)
The Vector ModelAdvantages:
term-weighting improves quality of the answer setpartial matching allows retrieval of docs thatapproximate the query conditionscosine ranking formula sorts documents accordingto degree of similarity to the query
Disadvantages:assumes independence of index terms (??); notclear that this is bad though
Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 23/37
![Page 24: Modern Information Retrieval - homepages.dcc.ufmg.brhomepages.dcc.ufmg.br/~nivio/cursos/ri07/transp/slideschap02.pdf · Modern Information Retrieval Chapter 2 Modeling ... Example](https://reader030.fdocuments.us/reader030/viewer/2022021622/5b8d9dc309d3f272408c5f9e/html5/thumbnails/24.jpg)
The Vector ModelExample 1
Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 24/37
![Page 25: Modern Information Retrieval - homepages.dcc.ufmg.brhomepages.dcc.ufmg.br/~nivio/cursos/ri07/transp/slideschap02.pdf · Modern Information Retrieval Chapter 2 Modeling ... Example](https://reader030.fdocuments.us/reader030/viewer/2022021622/5b8d9dc309d3f272408c5f9e/html5/thumbnails/25.jpg)
The Vector ModelExample 2
Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 25/37
![Page 26: Modern Information Retrieval - homepages.dcc.ufmg.brhomepages.dcc.ufmg.br/~nivio/cursos/ri07/transp/slideschap02.pdf · Modern Information Retrieval Chapter 2 Modeling ... Example](https://reader030.fdocuments.us/reader030/viewer/2022021622/5b8d9dc309d3f272408c5f9e/html5/thumbnails/26.jpg)
The Vector ModelExample 3
Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 26/37
![Page 27: Modern Information Retrieval - homepages.dcc.ufmg.brhomepages.dcc.ufmg.br/~nivio/cursos/ri07/transp/slideschap02.pdf · Modern Information Retrieval Chapter 2 Modeling ... Example](https://reader030.fdocuments.us/reader030/viewer/2022021622/5b8d9dc309d3f272408c5f9e/html5/thumbnails/27.jpg)
Probabilistic ModelObjective: to capture the IR problem using aprobabilistic framework
Given a user query, there is an ideal answer set
Querying as specification of the properties of this idealanswer set (clustering)
But, what are these properties?
Guess at the beginning what they could be (i.e., guessinitial description of ideal answer set)
Improve by iteration
Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 27/37
![Page 28: Modern Information Retrieval - homepages.dcc.ufmg.brhomepages.dcc.ufmg.br/~nivio/cursos/ri07/transp/slideschap02.pdf · Modern Information Retrieval Chapter 2 Modeling ... Example](https://reader030.fdocuments.us/reader030/viewer/2022021622/5b8d9dc309d3f272408c5f9e/html5/thumbnails/28.jpg)
Probabilistic ModelAn initial set of documents is retrieved somehow
User inspects these docs looking for the relevant ones(in truth, only top 10-20 need to be inspected)
IR system uses this information to refine description ofideal answer set
By repeating this process, it is expected that thedescription of the ideal answer set will improve
Have always in mind the need to guess at the verybeginning the description of the ideal answer set
Description of ideal answer set is modeled inprobabilistic terms
Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 28/37
![Page 29: Modern Information Retrieval - homepages.dcc.ufmg.brhomepages.dcc.ufmg.br/~nivio/cursos/ri07/transp/slideschap02.pdf · Modern Information Retrieval Chapter 2 Modeling ... Example](https://reader030.fdocuments.us/reader030/viewer/2022021622/5b8d9dc309d3f272408c5f9e/html5/thumbnails/29.jpg)
Probabilistic Ranking PrincipleGiven a user query q and a document dj, theprobabilistic model tries to estimate the probability thatthe user will find the document dj interesting (i.e.,relevant). The model assumes that this probability ofrelevance depends on the query and the documentrepresentations only. Ideal answer set is referred to asR and should maximize the probability of relevance.Documents in the set R are predicted to be relevant.
But,how to compute probabilities?what is the sample space?
Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 29/37
![Page 30: Modern Information Retrieval - homepages.dcc.ufmg.brhomepages.dcc.ufmg.br/~nivio/cursos/ri07/transp/slideschap02.pdf · Modern Information Retrieval Chapter 2 Modeling ... Example](https://reader030.fdocuments.us/reader030/viewer/2022021622/5b8d9dc309d3f272408c5f9e/html5/thumbnails/30.jpg)
The RankingSimilarity
sim(dj , q) =P (R|~dj)
P (R|~dj)=
P (~dj |R) × P (R)
P (~dj |R) × P (R)∼
P (~dj |R)
P (~dj |R)
P (~dj |R) : probability of randomly selecting thedocument dj from the set R of relevant documents
P (R) : probability that a document randomly selectedfrom the entire collection is relevant
P (~dj |R) and P (R) : analogous and complementary
Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 30/37
![Page 31: Modern Information Retrieval - homepages.dcc.ufmg.brhomepages.dcc.ufmg.br/~nivio/cursos/ri07/transp/slideschap02.pdf · Modern Information Retrieval Chapter 2 Modeling ... Example](https://reader030.fdocuments.us/reader030/viewer/2022021622/5b8d9dc309d3f272408c5f9e/html5/thumbnails/31.jpg)
The RankingAssuming independence of index terms
sim(dj , q) ∼(∏
gi(~dj)=1P (ki|R)) × (
∏
gi(~dj)=0P (ki|R))
(∏
gi(~dj)=1P (ki|R)) × (
∏
gi(~dj)=0P (ki|R))
P (ki|R) : probability that the index term ki is present ina document randomly selected from the set R ofrelevant documents
P (ki|R) : probability that the index term ki is not presentin a document randomly selected from the set R
The probabilities associated with the R have meaningswhich are analogous to the ones just described
Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 31/37
![Page 32: Modern Information Retrieval - homepages.dcc.ufmg.brhomepages.dcc.ufmg.br/~nivio/cursos/ri07/transp/slideschap02.pdf · Modern Information Retrieval Chapter 2 Modeling ... Example](https://reader030.fdocuments.us/reader030/viewer/2022021622/5b8d9dc309d3f272408c5f9e/html5/thumbnails/32.jpg)
The RankingFinally
sim(dj , q) ∼
∼t
∑
i=1
wi,q × wi,j ×
(
logP (ki|R)
1 − P (ki|R)+ log
1 − P (ki|R)
P (ki|R)
)
WhereP (ki|R) = 1 − P (ki|R)
P (ki|R) = 1 − P (ki|R)
Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 32/37
![Page 33: Modern Information Retrieval - homepages.dcc.ufmg.brhomepages.dcc.ufmg.br/~nivio/cursos/ri07/transp/slideschap02.pdf · Modern Information Retrieval Chapter 2 Modeling ... Example](https://reader030.fdocuments.us/reader030/viewer/2022021622/5b8d9dc309d3f272408c5f9e/html5/thumbnails/33.jpg)
The Initial RankingSimilarity sim(dj , q)
t∑
i=1
wi,q × wi,j ×
(
logP (ki|R)
1 − P (ki|R)+ log
1 − P (ki|R)
P (ki|R)
)
Probabilities P (ki|R) and P (ki|R) ?
Estimates based on assumptions:P (ki|R) = 0.5
P (ki|R) = ni
N where ni is the number of docs thatcontain ki
Use this initial guess to retrieve an initial rankingImprove upon this initial ranking
Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 33/37
![Page 34: Modern Information Retrieval - homepages.dcc.ufmg.brhomepages.dcc.ufmg.br/~nivio/cursos/ri07/transp/slideschap02.pdf · Modern Information Retrieval Chapter 2 Modeling ... Example](https://reader030.fdocuments.us/reader030/viewer/2022021622/5b8d9dc309d3f272408c5f9e/html5/thumbnails/34.jpg)
Improving the Initial Ranking
t∑
i=1
wi,q × wi,j ×
(
logP (ki|R)
1 − P (ki|R)+ log
1 − P (ki|R)
P (ki|R)
)
LetV : set of docs initially retrievedVi : subset of docs retrieved that contain ki
Reevaluate estimates:P (ki|R) = Vi
V
P (ki|R) = ni−Vi
N−V
Repeat recursively
Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 34/37
![Page 35: Modern Information Retrieval - homepages.dcc.ufmg.brhomepages.dcc.ufmg.br/~nivio/cursos/ri07/transp/slideschap02.pdf · Modern Information Retrieval Chapter 2 Modeling ... Example](https://reader030.fdocuments.us/reader030/viewer/2022021622/5b8d9dc309d3f272408c5f9e/html5/thumbnails/35.jpg)
Improving the Initial Ranking
t∑
i=1
wi,q × wi,j ×
(
logP (ki|R)
1 − P (ki|R)+ log
1 − P (ki|R)
P (ki|R)
)
To avoid problems with V = 1 and Vi = 0:
P (ki|R) = Vi+0.5V +1
P (ki|R) = ni−Vi+0.5N−V +1
Also,
P (ki|R) =Vi+
niN
V +1
P (ki|R) =ni−Vi+
niN
N−V +1
Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 35/37
![Page 36: Modern Information Retrieval - homepages.dcc.ufmg.brhomepages.dcc.ufmg.br/~nivio/cursos/ri07/transp/slideschap02.pdf · Modern Information Retrieval Chapter 2 Modeling ... Example](https://reader030.fdocuments.us/reader030/viewer/2022021622/5b8d9dc309d3f272408c5f9e/html5/thumbnails/36.jpg)
Pluses and MinusesAdvantages:
Docs ranked in decreasing order of probability ofrelevance
Disadvantages:need to guess initial estimates for P (ki|R)
method does not take into account tf and idf factors
Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 36/37
![Page 37: Modern Information Retrieval - homepages.dcc.ufmg.brhomepages.dcc.ufmg.br/~nivio/cursos/ri07/transp/slideschap02.pdf · Modern Information Retrieval Chapter 2 Modeling ... Example](https://reader030.fdocuments.us/reader030/viewer/2022021622/5b8d9dc309d3f272408c5f9e/html5/thumbnails/37.jpg)
Brief Comparison of Classic ModelsBoolean model does not provide for partial matchesand is considered to be the weakest classic model
Salton and Buckley did a series of experiments thatindicate that, in general, the vector model outperformsthe probabilistic model with general collections
This seems also to be the view of the researchcommunity
Modeling, Modern Information Retrieval, Addison Wesley, 2006 – p. 37/37