Post on 11-Mar-2020
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
Chap 2: Classical models for information retrieval
Jean-Pierre Chevallet & Philippe Mulhem
LIG-MRIM
Sept 2016
Jean-Pierre Chevallet & Philippe Mulhem Models of IR 1 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
Set modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices
Outline
1 Basic IR ModelsSet modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices
2 Vector Space ModelWeightingExercicesImplementing VSMLeaning from user
3 Probabilistic models
4 Solutions
Jean-Pierre Chevallet & Philippe Mulhem Models of IR 2 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
Set modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices
IR System
Request! Documents!
Representation !of queries!
(query language)!
Representation of !the content documents!
(indexing language)!
Implementation of Matching
function!
Query! Surrogate!
Information Retrieval System (IRS)!
Knowledge!
Base!
Query model! Document model!(content)!
Matching!function!
Information Retrieval Model!
Knowledge model!
Jean-Pierre Chevallet & Philippe Mulhem Models of IR 3 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
Set modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices
IR Models
Classical Models
Boolean Set Vector Space Probabilistic
Strict Weighted Classical Inference Network
Language Model
Jean-Pierre Chevallet & Philippe Mulhem Models of IR 4 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
Set modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices
Outline
1 Basic IR ModelsSet modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices
2 Vector Space ModelWeightingExercicesImplementing VSMLeaning from user
3 Probabilistic models
4 Solutions
Jean-Pierre Chevallet & Philippe Mulhem Models of IR 5 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
Set modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices
Set Model
Membership of a set
Requests are single descriptors
Indexing assignments : a set of descriptors
In reply to a request, documents are either retrieved or not(no ordering)
Retrieval rule : if the descriptor in the request is a member ofthe descriptors assigned to a document, then the document isretrieved.
This model is the simplest one and describes the retrievalcharacteristics of a typical library where books are retrieved bylooking up a single author, title or subject descriptor in a catalog.
Jean-Pierre Chevallet & Philippe Mulhem Models of IR 6 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
Set modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices
Example
Request
”information retrieval”
Doc1
{”information retrieval”, ”database” , ”salton”}−− > RETRIEVED < −−
Doc2
{ ”database” , ”SQL”}−− > NOT RETRIEVED < −−
Jean-Pierre Chevallet & Philippe Mulhem Models of IR 7 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
Set modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices
Set inclusion
Request: a set of descriptors
Indexing assignments : a set of descriptors
Documents are either retrieved or not (no ordering)
Retrieval rule : document is retrieved if ALL the descriptors inthe request are in the indexing set of the document.
This model uses the notion of inclusion of the descriptor set of therequest in the descriptor set of the document
Jean-Pierre Chevallet & Philippe Mulhem Models of IR 8 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
Set modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices
Set intersection
Request: a set of descriptors PLUS a cut off value
Indexing assignments : a set of descriptors
Documents are either retrieved or not (no ordering)
Retrieval rule : document is retrieved if it shares a number ofdescriptors with the request that exceeds the cut-off value
This model uses the notion of set intersection between thedescriptor set of the request with the descriptor set of thedocument.
Jean-Pierre Chevallet & Philippe Mulhem Models of IR 9 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
Set modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices
Set intersection plus ranking
Request: a set of descriptors PLUS a cut off value
Indexing assignments : a set of descriptors
Retrieved documents are ranked
Retrieval rule : documents showing with the request more than thespecified number of descriptors are ranked in order of decreasingoverlap
Jean-Pierre Chevallet & Philippe Mulhem Models of IR 10 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
Set modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices
Outline
1 Basic IR ModelsSet modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices
2 Vector Space ModelWeightingExercicesImplementing VSMLeaning from user
3 Probabilistic models
4 Solutions
Jean-Pierre Chevallet & Philippe Mulhem Models of IR 11 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
Set modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices
Logical Models
Based on a given formalized logic:
Propositional Logic: boolean model
First Order Logic : Conceptual Graph Matching
Modal Logic
Description Logic: matching with knowledge
Concept Analysis
Fuzzy logic
...
Matching : a deduction from the query Q to the document D.
Jean-Pierre Chevallet & Philippe Mulhem Models of IR 12 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
Set modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices
Boolean Model
Request is any boolean combination of descriptors using theoperators AND, OR and NOT
Indexing assignments : a set of descriptors
Retrieved documents are retrieved or not
Retrieval rules:
if Request = ta ∧ tb then retrieve only documents with bothta and tb
if Request = ta ∨ tb then retrieve only documents with eitherta or tb
if Request = ¬ta then retrieve only documents without ta.
Jean-Pierre Chevallet & Philippe Mulhem Models of IR 13 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
Set modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices
Boolean Model
Knowledge Model : T = {ti}, i ∈ [1, ..N]
Term ti that index the documents
The document model (content) I a Boolean expression in theproposition logic, with the ti considered as propositions:
A document D1 = {t1, t3} is represented by the logic formulaas a conjunction of all terms direct (in the set) or negated.t1 ∧ ¬t2 ∧ t3 ∧ ¬t4 ∧ ... ∧ ¬tN−1 ∧ ¬tNA query Q is represented by any logic formula
The matching function is the logical implication: D |= Q
Jean-Pierre Chevallet & Philippe Mulhem Models of IR 14 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
Set modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices
Boolean Model: relevance value
No distinction between relevant documents:
Q = t1 ∧ t2 over the vocabulary {t1, t2, t3, t4, t5, t6}D1 = {t1, t2} ≡ t1 ∧ t2 ∧ ¬t3 ∧ ¬t4 ∧ ¬t5 ∧ ¬t6
D2 = {t1, t2, t3, t4, t5} ≡ t1 ∧ t2 ∧ t3 ∧ t4 ∧ t5 ∧ ¬t6
Both documents are relevant because : D1 ⊃ Q and D2 ⊃ Q.We ”feel” that D1 is a better response because ”closer” to thequery.Possible solution : D ⊃ Q and Q ⊃ D. (See chapter on LogicalModels)
Jean-Pierre Chevallet & Philippe Mulhem Models of IR 15 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
Set modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices
Boolean Model: complexity of queries
Q = ((t1 ∧ t2) ∨ t3) ∧ (t4 ∨ ¬(¬t5 ∧ t6))
Meaning of the logical ∨ (inclusive) different from the usual ”or”(exclusive)
Jean-Pierre Chevallet & Philippe Mulhem Models of IR 16 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
Set modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices
Conclusion for boolean model
Model use till the the 1990.
Very simple to implement
Still used is some web search engine as a basic document fastfilter with an extra step for document ordering.
Jean-Pierre Chevallet & Philippe Mulhem Models of IR 17 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
Set modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices
Outline
1 Basic IR ModelsSet modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices
2 Vector Space ModelWeightingExercicesImplementing VSMLeaning from user
3 Probabilistic models
4 Solutions
Jean-Pierre Chevallet & Philippe Mulhem Models of IR 18 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
Set modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices
Beyond boolean logic: choice of a logic
A lot of possible choices to go beyond the boolean model. Thechoice of the underlying logic L to models the IR matchingprocess:
Propositional Logic
First order logic
Modal Logic
Fuzzy Logic
Description logic
....
Jean-Pierre Chevallet & Philippe Mulhem Models of IR 19 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
Set modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices
Beyond boolean logic: matching interpretation
For a logic L, different possible interpretations of matching:
Deduction theory
D `L Q
`L D ⊃ Q
Model theory
D |=L Q
|=L D ⊃ Q
Jean-Pierre Chevallet & Philippe Mulhem Models of IR 20 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
Set modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices
Interpretations
Documents answer to q=t1"t3
1011
0000
0001 0010 0011
0100 0110
0111
1001
Document d = {t1, t2, t3}
1010
1111 1110
1100 1101
1000
0101
Other Documents
Other interpretations
Jean-Pierre Chevallet & Philippe Mulhem Models of IR 21 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
Set modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices
Inverted File
a " b
a b c d !
0 1 0 0 1 1 0 1 !! 1 1 1 1 0 1 0 0 !! 0 0 0 1 0 0 0 0 !!
d1 d2 d3 !.
0 1 0 0 0 1 0 0 !..
Jean-Pierre Chevallet & Philippe Mulhem Models of IR 22 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
Set modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices
Implementing the boolean model
Posting list:
List of unique doc id associates with an indexing term
If no ordering, this list is in fact a set
If ordering, nice algorithm to fuse lists
Main set (list) fusion algorithms:
Intersection, usion and complement
Algorithm depend on order.
If no order complexity is O(n ∗m)
if order, complexity is O(n + m) ! So maintain posting listordered.
Jean-Pierre Chevallet & Philippe Mulhem Models of IR 23 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
Set modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices
Implementing the boolean model
Left to exercise : produces the 3 algorithms, of union, intersectionand complement, with the following constraints:
The two lists are read only
The produced list is e new one and is write only
Using sorting algorithms is not allowed
Deduce from the algorithms the compexity.
Jean-Pierre Chevallet & Philippe Mulhem Models of IR 24 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
Set modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices
Outline
1 Basic IR ModelsSet modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices
2 Vector Space ModelWeightingExercicesImplementing VSMLeaning from user
3 Probabilistic models
4 Solutions
Jean-Pierre Chevallet & Philippe Mulhem Models of IR 25 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
Set modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices
Weighted Boolean Model
Extension of Boolean Model with weights.Weight denoting the representativity of a term for a document).Knowledge Model : T = {ti}, i ∈ [1, ..N]Terms ti that index the documents.
A document D is represented by
A logical formula D (similar to Boolean Model)
A function WD : T → [0, 1], which gives, for each term in Tthe weight of the term in D. The weight is 0 for a term notpresent in the document.
Jean-Pierre Chevallet & Philippe Mulhem Models of IR 26 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
Set modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices
Weighted Boolean Model: matching
Non binary matching function based on Fuzzy logic
RSV (D, a ∨ b) = Max [WD(a),WD(b)]
RSV (D, a ∧ b) = Min[WD(a),WD(b)]
RSV (D,¬a) = 1−WD(a)
Limitation : this matching does not take all the query termsinto account.
Jean-Pierre Chevallet & Philippe Mulhem Models of IR 27 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
Set modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices
Weighted Boolean Model: matching
Non binary matching function based on a similarity function whichtake more into account all the query terms.
RSV (D, a ∨ b) =
√WD(a)2+WD(b)2
2
RSV (D, a ∧ b) = 1−√
(1−WD(a))2+(1−WD(b))2
2
RSV (D,¬a) = 1−WD(a)
Limitation : query expression for complex needs
Jean-Pierre Chevallet & Philippe Mulhem Models of IR 28 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
Set modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices
Weighted Boolean Model: matching example
0#0#0#0#0#0#D4
1- 1/$2#1/$2 0#1#1#0#D3
1- 1/$2=0.29 1/$2=0.71#0#1#0#1#D2
1 1 1#1#1#1#D1
a ! b a " b a ! b a " b b a Boolean Weighted Boolean
Document Query
Jean-Pierre Chevallet & Philippe Mulhem Models of IR 29 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
Set modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices
Weighted Boolean Model: matching example
d1=(0.2, 0.4) , d2=(0.6, 0.9), d3=(0,1)
(0,0) b b
a
(0,0)
(1,1)
d1 d1 d2 d2
0.4
0.6
0.2 0.9 0.2 0.9
a or b
a and b d3 0.71
To be closer to (1,1) To avoid (0,0) (1,1)
d3
0.29
Jean-Pierre Chevallet & Philippe Mulhem Models of IR 30 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
Set modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices
Outline
1 Basic IR ModelsSet modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices
2 Vector Space ModelWeightingExercicesImplementing VSMLeaning from user
3 Probabilistic models
4 Solutions
Jean-Pierre Chevallet & Philippe Mulhem Models of IR 31 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
Set modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices
Upper bounds for results
Let ti indexing term, let #ti the size of its posting list and let |Q|the number of documents that answer the boolean query Q.
1 Propose an upper bound value in term of posting list size forthe queries t1 ∨ t2 and t1 ∧ t2
2 If we suppose that #t1 > #t2 > #t3, compute the upperbound value of |t1 ∧ t2 ∧ t3|, then propose a way to reduceglobal computation of posting list merge1.
3 If we suppose that #t1 > #t2 > #t3, compute the upperbound value of |t1 ∨ t2| and of |(t1 ∨ t2) ∧ t3|, then propose away to reduce global computation of posting list merge ofquery (t1 ∨ t2) ∧ t3
1Use the complexity value of merge depending of list sizesJean-Pierre Chevallet & Philippe Mulhem Models of IR 32 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
WeightingExercicesImplementing VSMLeaning from user
Outline
1 Basic IR ModelsSet modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices
2 Vector Space ModelWeightingExercicesImplementing VSMLeaning from user
3 Probabilistic models
4 Solutions
Jean-Pierre Chevallet & Philippe Mulhem Models of IR 33 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
WeightingExercicesImplementing VSMLeaning from user
Vector Space Model
Request: a set of descriptors each of which has a positivenumber associated with it
Indexing assignments : a set of descriptors each of which hasa positive number associated with it.
Retrieved documents are ranked
Retrieval rule : the weights of the descriptors common to therequest and to indexing records are treated as vectors. The valueof a retrieved document is the cosine of the angle between thedocument vector and the query vector.
Jean-Pierre Chevallet & Philippe Mulhem Models of IR 34 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
WeightingExercicesImplementing VSMLeaning from user
Vector Space Model
Knowledge model: T = {ti}, i ∈ [1, .., n]All documents are described using this vocabulary.A document Di is represented by a vector di described in the Rn
vector space defined on Tdi = (wi ,1,wi ,2, ...,wi ,j , ...,wi ,n), with wk,l the weight of a term tlfor a document.A query Q is represented by a vector q described in the samevector space q = (wQ,1,wQ,2, ...,wQ,j , ...,wQ,n)
Jean-Pierre Chevallet & Philippe Mulhem Models of IR 35 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
WeightingExercicesImplementing VSMLeaning from user
Vector Space Model
The more two vectors that represent documents are ?near?, themore the documents are similar:
Di
Term 1
Term 3
Term 2
Dj
Jean-Pierre Chevallet & Philippe Mulhem Models of IR 36 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
WeightingExercicesImplementing VSMLeaning from user
Vector Space Model: matching
Relevance: is related to a vector similarity.RSV (D,Q) =def SIM( ~D, ~Q)
Symmetry: SIM( ~D, ~Q) = SIM( ~Q, ~D)
Normalization: SIM : V → [min,max ]
Reflectivity : SIM(~X , ~X ) = max
Jean-Pierre Chevallet & Philippe Mulhem Models of IR 37 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
WeightingExercicesImplementing VSMLeaning from user
Outline
1 Basic IR ModelsSet modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices
2 Vector Space ModelWeightingExercicesImplementing VSMLeaning from user
3 Probabilistic models
4 Solutions
Jean-Pierre Chevallet & Philippe Mulhem Models of IR 38 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
WeightingExercicesImplementing VSMLeaning from user
Vector Space Model: weighting
Based on counting the more frequent words, and also the moresignificant ones.
From Luhn 58
Jean-Pierre Chevallet & Philippe Mulhem Models of IR 39 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
WeightingExercicesImplementing VSMLeaning from user
Zipf Law
Rank r and frequency f :
r × f = const
f1
fr
t1 tr Jean-Pierre Chevallet & Philippe Mulhem Models of IR 40 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
WeightingExercicesImplementing VSMLeaning from user
Heap Law
Estimation of the vocabulary size |V | according to the size |C | ofthe corpus:
|V | = K × |C |β
for English : K ∈ [10, 100] and β ∈ [0.4, 0.6] (e.g., |V | ≈ 39 000for |C |=600 000, K=50, β=0.5)
T
V
Ti Jean-Pierre Chevallet & Philippe Mulhem Models of IR 41 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
WeightingExercicesImplementing VSMLeaning from user
Term Frequency
Term Frequency
The frequency tfi ,j of the term tj in the document Di equals to thenumber of occurrences of tj in Di .
Considering a whole corpus (document database) into account, aterm that occurs a lot does not discriminate documents:
Term frequent in the whole corpus
Term frequent in only one document of the corpus
Jean-Pierre Chevallet & Philippe Mulhem Models of IR 42 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
WeightingExercicesImplementing VSMLeaning from user
Document Frequency: tf.idf
Document Frequency
The document frequency dfj of a term tj is the number ofdocuments in which tj occurs.
The larger the df , the worse the term for an IR point of view... so,we use very often the inverse document frequency idfj :
Inverse Document Frequency
idfj = 1dfj
idfj = log( |C |dfj) with |C | is the size of the corpus, i.e. the number of
documents.
The classical combination : wi ,j = tfi ,j × idfj .
Jean-Pierre Chevallet & Philippe Mulhem Models of IR 43 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
WeightingExercicesImplementing VSMLeaning from user
matching
Matching function: based on the angle between the query vector ~Qand the document vector ~Di .The smaller the angle the more the document matches the query.
Di
Term 1
Term 3
Term 2
Q
Jean-Pierre Chevallet & Philippe Mulhem Models of IR 44 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
WeightingExercicesImplementing VSMLeaning from user
matching: cosine
One solution is to use the cosine the angle between the queryvector and the document vector.
Cosine
SIMcos( ~Di , ~Q) =
n∑k=1
wi ,k × wQ,k√√√√√√n∑
k=1
(wi ,k)2 ×n∑
k=1
(wQ,k)2
Jean-Pierre Chevallet & Philippe Mulhem Models of IR 45 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
WeightingExercicesImplementing VSMLeaning from user
matching: set interpretation
When binary discrete values, on have a set interpretations D{}i and
Q{}:
SIMcos( ~Di , ~Q) =|D{}i ∩ Q{}|√|D{}i | × |Q{}|
SIMcos( ~Di , ~Q) = 0 : D{}i and Q{} are disjoined
SIMcos( ~Di , ~Q) = 1 : D{}i and Q{} are equal
Jean-Pierre Chevallet & Philippe Mulhem Models of IR 46 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
WeightingExercicesImplementing VSMLeaning from user
Other matching functions
Dice coefficient
SIMdice( ~Di , ~Q) =
2N∑
k=1
wi ,k × wQ,k
N∑k=1
wi ,k + wQ,k
Discrete Dice coefficient
SIMdice( ~Di , ~Q) =2|D{}i ∩ Q{}||D{}i |+ |Q{}|
Jean-Pierre Chevallet & Philippe Mulhem Models of IR 47 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
WeightingExercicesImplementing VSMLeaning from user
Outline
1 Basic IR ModelsSet modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices
2 Vector Space ModelWeightingExercicesImplementing VSMLeaning from user
3 Probabilistic models
4 Solutions
Jean-Pierre Chevallet & Philippe Mulhem Models of IR 48 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
WeightingExercicesImplementing VSMLeaning from user
Matching cosine simplification
Question
Transform the cosine formula so that to simplify the matchingformula.
Cosine
SIMcos( ~Di , ~Q) =
n∑k=1
wi ,k × wQ,k√√√√√√n∑
k=1
(wi ,k)2 ×n∑
k=1
(wQ,k)2
Jean-Pierre Chevallet & Philippe Mulhem Models of IR 49 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
WeightingExercicesImplementing VSMLeaning from user
Exercice: Similarity et dissimilarity and distance
The Euclidian distance between point x and y is expressed by :
L2(x , y) =∑i
(xi − yi )2
Question
Show the potential link between L2 and the cosine.
Tips:
consider points as vectors
consider document vectors has normalized, i.e. with∑
i y2i as
constant.
Jean-Pierre Chevallet & Philippe Mulhem Models of IR 50 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
WeightingExercicesImplementing VSMLeaning from user
Exercice: dot product and list intersection
The index is usually very sparse: a (usefull) term appears in lessthan 1% of document.A term t has a non null weight in ~D iff t appears in document D
Question
Show that the algorithm for computing the dot product isequivalent to list intersections.
Jean-Pierre Chevallet & Philippe Mulhem Models of IR 51 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
WeightingExercicesImplementing VSMLeaning from user
Exercice: efficiency of list intersection
Let the two lists X and Y of size |X | and |Y |:
Question
Compare the efficiency of a non sorted list intersection algorithmand a sorted list intesection.
Jean-Pierre Chevallet & Philippe Mulhem Models of IR 52 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
WeightingExercicesImplementing VSMLeaning from user
Outline
1 Basic IR ModelsSet modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices
2 Vector Space ModelWeightingExercicesImplementing VSMLeaning from user
3 Probabilistic models
4 Solutions
Jean-Pierre Chevallet & Philippe Mulhem Models of IR 53 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
WeightingExercicesImplementing VSMLeaning from user
Link between dot product and inverted files
If RSV (x , y) ∝∑
i xiyi after some normalizations of y , then :
RSV (x , y) ∝∑
i ,xi 6=0,yi 6=0
xiyi
Query: just to proceed with terms that are in the query, i.e.whose weight are not null.
Documents: store only non null terms.
Inverted file: access to non null document weight for for eachterm id.
Jean-Pierre Chevallet & Philippe Mulhem Models of IR 54 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
WeightingExercicesImplementing VSMLeaning from user
The index
Use a Inverted file like Boolean model
Store term frequency in the posting list
Do not pre-compute the weight, but keep raw integer valuesin the index
Compute the matching value on the fly, i.e. during posting listintersection
Jean-Pierre Chevallet & Philippe Mulhem Models of IR 55 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
WeightingExercicesImplementing VSMLeaning from user
Matching: matrix product
With an inverted file, in practice, matching computation is amatrix product:
a ! b
a b c d !
0 1 0 0 1 1 0 1 !! 1 1 1 1 0 1 0 0 !! 0 0 0 1 0 0 0 0 !!
d1 d2 d3 !.
1 1 1 1 1 1 0 1 !.. 1 1 0 0 !
Jean-Pierre Chevallet & Philippe Mulhem Models of IR 56 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
WeightingExercicesImplementing VSMLeaning from user
Outline
1 Basic IR ModelsSet modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices
2 Vector Space ModelWeightingExercicesImplementing VSMLeaning from user
3 Probabilistic models
4 Solutions
Jean-Pierre Chevallet & Philippe Mulhem Models of IR 57 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
WeightingExercicesImplementing VSMLeaning from user
Relevance feedback
To learn system relevance from user relevance:
First query
Satistaction
Answer presentation
End Yes no
Matching
User eval
Start
reformulation
Jean-Pierre Chevallet & Philippe Mulhem Models of IR 58 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
WeightingExercicesImplementing VSMLeaning from user
Rocchio Formula
Rocchio Formula
~Qi+1 = α ~Qi + β ~Rel i − γ ~nRel i
With:~Rel i : the cluster center of relevant documents, i.e., the
positive feedback~nRel i : the cluster center of non relevant documents, i.e., the
negative feedback
Note : when using this formula, the generated query vector maycontain negative values.
Jean-Pierre Chevallet & Philippe Mulhem Models of IR 59 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
Outline
1 Basic IR ModelsSet modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices
2 Vector Space ModelWeightingExercicesImplementing VSMLeaning from user
3 Probabilistic models
4 Solutions
Jean-Pierre Chevallet & Philippe Mulhem Models of IR 60 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
Probabilistic Models
Capture the IR problem in a probabilistic framework.
first probabilistic model (Binary Independent Retrieval Model)by Robertson and Spark-Jones in 1976. . .
late 90s, emergence of language models, still hot topic in IR
Overall question: ”what is the probability for a document tobe relevant to a query ?” several interpretation of thissentence
Jean-Pierre Chevallet & Philippe Mulhem Models of IR 61 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
Models
Classical Models
Boolean Vector Space Probabilistic
[Robertson & Spärk-Jones] [Tutle & Croft]
[Croft], [Hiemstra][Nie]
Jean-Pierre Chevallet & Philippe Mulhem Models of IR 62 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
Probabilistic Model of IR
Different approaches of seeing a probabilistic approach forinformation retrieval
Classical approach: probability to have the event Relevantknowing one document and one query.
Inference Networks approach: probability that the query istrue after inference from the content of a document.
Language Models approach: probability that a query isgenerated from a document.
Jean-Pierre Chevallet & Philippe Mulhem Models of IR 63 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
Binary Independent Retrieval Model
Robertson & Spark-Jones 1976
computes the relevance of a document from the relevanceknown a priori from other documents.
achieved by estimating of each indexing term a the document,and by using the Bayes Theorem and a decision rule.
Jean-Pierre Chevallet & Philippe Mulhem Models of IR 64 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
Binary Independent Retrieval Model
R: binary random variable
R = r : relevant; R = r̄ : non relevant
P(R = r |d , q): probability that R is relavant for thedocument and the query considered (noted P(r |d , q) )
Probability of relevance depends only from document and query
term weights are binary: d = (11. . . 100. . . ),wdt = 0 or 1
Each term t is characterized by a binary variable wt , indicating theprobability that the term occurs.
P(wt = 1|q, r): probability that t occurs in a relevantdocument.
P(wt = 0|q, r) = 1–P(wt = 1|q, r)
The terms are conditionnaly independant to R
Jean-Pierre Chevallet & Philippe Mulhem Models of IR 65 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
Binary Independent Retrieval Model
Non Relevant DocumentsR=r
CORPUS
Relevant DocumentsR=r
withCorpus = rel ∪ nonrelRelevant Documents ∩ Non Relevant Documents = ∅
For a query Q
Jean-Pierre Chevallet & Philippe Mulhem Models of IR 66 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
Binary Independent Retrieval Model
Matching function : use of Bayes theorem
Probability that the document d belongs to the set of relevant documents of the query q.
Probability to obtain the description d from observed relevances
probability that the document d is picked for q
P(r,q) is the relevance probability. It is the chance of randomly taking one document from the corpus which is relevant for the query q
),(),().,(
),(qdP
qrPqrdPqdrP =
Jean-Pierre Chevallet & Philippe Mulhem Models of IR 67 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
BIR: Matching function
Decision: document retrieved if:
P(r |d , q)
P(r̄ |d , q)=
P(d |r , q).P(r , q)
P(d |r̄ , q).P(r̄ , q)> 1
IR looks for a ranking, so we eliminate P(r ,q)P(r̄ ,q) because it is constant
for a given query
Jean-Pierre Chevallet & Philippe Mulhem Models of IR 68 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
BIR: Matching function
In IR, it is more simple to use logs:
rsv(d) =rank log(P(d |r , q)
P(d |r̄ , q))
Jean-Pierre Chevallet & Philippe Mulhem Models of IR 69 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
BIR: Matching function
Simplification using hypothesis of independence between terms(Binary Independence) with weight w for term t in d
P(d |r , q) = P(d = (10...110...)|r , q)
=∏wdt =1
P(wdt = 1|r , q)
∏wdt =0
P(wdt = 0|r , q)
P(d |r̄ , q) = P(d = (10...110...)|r̄ , q)
=∏wdt =1
P(wdt = 1|r̄ , q)
∏wdt =0
P(wdt = 0|r̄ , q)
Jean-Pierre Chevallet & Philippe Mulhem Models of IR 70 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
BIR: Matching function
Let’s simplify the notations:
P(wdt = 1|r , q) = pt
P(wdt = 0|r , q) = 1− pt
P(wdt = 1|r̄ , q) = ut
P(wdt = 0|r̄ , q) = 1− ut
Hence :
P(d |r , q) =∏wdt =1
pt∏wdt =0
1− pt
P(d |r̄ , q) =∏wdt =1
ut∏wdt =0
1− ut
Jean-Pierre Chevallet & Philippe Mulhem Models of IR 71 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
BIR: Matching function
Now let’s compute the Relevance Status Value:
rsv(d) =rank log(P(d |r , q)
P(d |r̄ , q)) = log(
∏wdt =1 pt
∏wdt =0 1− pt∏
wdt =1 ut
∏wdt =0 1− ut
)
Jean-Pierre Chevallet & Philippe Mulhem Models of IR 72 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
Conclusion
Common step for indexing:
Define index set
Automatize index construction from documents
Select a model for index representation and weighting
Define a matching process and an associated ranking
This is used for textual processing but also for other media.
Jean-Pierre Chevallet & Philippe Mulhem Models of IR 73 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
Outline
1 Basic IR ModelsSet modelsBoolean ModelBeyond boolean logicWeighted Boolean ModelExercices
2 Vector Space ModelWeightingExercicesImplementing VSMLeaning from user
3 Probabilistic models
4 Solutions
Jean-Pierre Chevallet & Philippe Mulhem Models of IR 74 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
Solution: matching cosine simplification
Because Q is constant, norm of vector Q is a constant that can beremoved.
K =1
n∑k=1
(wQ,k)2
SIMcos( ~Di , ~Q) = K
n∑k=1
wi ,k × wQ,k√√√√ n∑k=1
(wi ,k)2
Jean-Pierre Chevallet & Philippe Mulhem Models of IR 75 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
Solution: matching cosine simplification
The division of the document vector by its norm can beprecomputed:
w ′i ,k =wi ,k√√√√ n∑
k=1
(wi ,k)2
Jean-Pierre Chevallet & Philippe Mulhem Models of IR 76 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
Solution: matching cosine simplification
Hence, one juste has to compute a vector dot product:
SIMcos( ~Di , ~Q) = Kn∑
k=1
w ′i ,k × wQ,k
The constant does not influence the order:
RSVcos( ~Di , ~Q) ∝n∑
k=1
w ′i ,k × wQ,k
Jean-Pierre Chevallet & Philippe Mulhem Models of IR 77 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
Solution: Similarity et dissimilarity and distance
To transform a distance or dissimilarity into a similarity, simplynegate the value. Ex. with the Squared Euclidian distance :
RSVD2(x , y) = −D2(x , y)2 = −∑i
(xi − yi )2
= −∑i
(x2i + y2
i − 2xiyi ) = −∑i
x2i −
∑i
y2i + 2
∑i
xiyi
if x is the query then∑
i x2i is constant for matching with a
document set.
RSV (x , y)D2 ∝ −∑i
y2i + 2
∑i
xiyi
Jean-Pierre Chevallet & Philippe Mulhem Models of IR 78 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
Solution: Similarity et dissimilarity and distance
RSV (x , y)D2 ∝ −∑i
y2i + 2
∑i
xiyi
If∑
i y2i is constant over the corpus then :
RSV (x , y)D2 ∝ 2∑i
xiyi
RSV (x , y)D2 ∝∑i
xiyi
Hence, if we normalize the corpus so each document vector lengthis a constant, then using the Euclidean distance as a similarity,provides the same results than the cosine of the vector spacemodel !
Jean-Pierre Chevallet & Philippe Mulhem Models of IR 79 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
Solution: dot product and list intersection
In the dot product∑
i xiyi , if the term at rank i is not in x or ythen the term xiyi is null and does not participate to the value.Hence,If we compute directly
∑i xiyi using a list of terms for each
document, then we sum the xiyi only for terms that are in theintersection.Hence, it is equivalent to an intersection list.
Jean-Pierre Chevallet & Philippe Mulhem Models of IR 80 / 81
Basic IR ModelsVector Space ModelProbabilistic models
Solutions
Solution: efficiency of list intersection
If X and Y are not sorted, for each xi one must find the term in Yby sequential search. So complexity is O(|X | × |Y |)If X and Y are sorted, the algorithm used two pointer:
Before the two pointer, lists are merged
Lets compare the pointers :
If terms are equal : we can compute the merge (or theproduct), then move the tow pointersIt terms are different : because of the order, one are sure thesmaller term to net be present in the other list. On ca movethat pointeur.
Hence one move at least one pointer on each list, the thecomplexity is : O(|X |+ |Y |)
Jean-Pierre Chevallet & Philippe Mulhem Models of IR 81 / 81