Some Information Retrieval Models and Our Experiments for TREC KBA
-
Upload
aix-marseille-universite-polytech-lsis-cleo -
Category
Science
-
view
180 -
download
1
description
Transcript of Some Information Retrieval Models and Our Experiments for TREC KBA
INFORMATION RETRIEVAL MODELS / TREC KBA
Patrice BellotAix-‐Marseille Université -‐ CNRS (LSIS UMR 7296 ; OpenEdition) !patrice.bellot@univ-‐amu.frLSIS -‐ DIMAG team http://www.lsis.org/spip.php?id_rubrique=291 OpenEdition Lab : http://lab.hypotheses.org
P. Bellot (AMU-‐CNRS, LSIS-‐OpenEdition)
— What Web search engines can do and still can’t do ?
— The Main Statistical Information Retrieval Models for Texts
— Entity linking and Entity oriented Document Retrieval
2
Mining large text collections Robustness (documents, queries, information needs, languages…)
Be fast, be relevant
Do we really need (formal) semantics ? Do we need deep (symbolic) language analysis ?
P. Bellot (AMU-‐CNRS, LSIS-‐OpenEdition)
Vertical vs horizontal search vs … ?
3
Horizontal search (Google search, Bing…)
Vertical search (e.g. Health search engines)
Future ?
What models ? What NLP ? What resources should be used ? What (how) can be learned ?
P. Bellot (AMU-‐CNRS, LSIS-‐OpenEdition)
INFORMATION RETRIEVAL MODELS
4
P. Bellot (AMU-‐CNRS, LSIS-‐OpenEdition)
Information Retrieval / Document Retrieval
• Objective: finding the « documents » that correspond to the user request at best
• Problems: — Interpreting the query— Interpreting the documents (indexing) — Defining a score of relatedness (a ranking function)
• Solutions: — Distributional hypothesis = statistical and probabilistic approaches (+ linear algebra) — Natural Language Processing — Knowledge Engineering
• Indexing : — Assigning terms to documents (number of terms = exhaustivity vs specificity)— Index term weighting based on the occurrence frequency of terms in documents and on the number of documents in which a term occurs (document frequency)
5
~d
~q
~d =
0
BBB@
wm1,d
wm2,d.
.
.
wmn,d
1
CCCA
~q =
0
BBB@
wm1,q
wm2,q.
.
.
wmn,q
1
CCCA
wmi,d
mi
s(~d, ~q) =
i=nX
i=1
wmi,d · wmi,q (1)
wi,d =
wi,dqPnj=1 w2
j,d
(2)
s(~d, ~q) =
i=nX
i=1
wi,dqPnj=1 w2
j,d
· wi,qqPnj=1 w2
j,q
=
~d · ~q
kdk2 · kqk2= cos(
~d, ~q) (3)
1
P. Bellot (AMU-‐CNRS, LSIS-‐OpenEdition)
Evaluation
• The aim is to retrieve as many relevant documents as possible and as few non-relevant documents as possible
• Relevance is not truth
• Precision and Recall
!!!!
• Precision and recall can be estimated at different cut-off ranks (P@n)
• Other measures : (mean) average precision (MAP), Discounted Cumulative Gain, Mean Reciprocal Rank…
• International Challenges : TREC, CLEF, INEX, NTCIR…
6
Evaluation Precision and Recall
Recall / Precision
In the ideal case, the set of retrieved documents is equal to the set ofrelevant documents. However, in most cases, the two sets will be di↵erent.This di↵erence is formally measured with precision and recall.
Precision =number of relevant documents retrieved
number of documents retrieved
Recall =number of relevant documents retrieved
number of relevant documents
Mounia Lalmas (Yahoo! Research) 20-21 June 2011 59 / 171
P. Bellot (AMU-‐CNRS, LSIS-‐OpenEdition)
Document retrieval : the Vector Space Model• Classical solution : the Vector Space Model
• In the index : a (non binary) weight is associated to every word in each document that contains it
• Every document d is represented as a vector • The query q is represented as a vector in the document space • The degree of similarity between a document and the query is
computed according to the weights w of the words m
7
~d
~q
~d =
0
BBB@
wm1,d
wm2,d.
.
.
wmn,d
1
CCCA
~q =
0
BBB@
wm1,q
wm2,q.
.
.
wmn,q
1
CCCA
wmi,d
mi
s(~d, ~q) =
i=nX
i=1
wmi,d · wmi,q (1)
wi,d =
wi,dqPnj=1 w2
j,d
(2)
s(~d, ~q) =
i=nX
i=1
wi,dqPnj=1 w2
j,d
· wi,qqPnj=1 w2
j,q
=
~d · ~q
kdk2 · kqk2= cos(
~d, ~q) (3)
1
The foundations of the rigorous study of analysis were laid in the nineteenth century, notably bythe mathematicians Cauchy and Weierstrass. Central to the study of this subject are the formaldefinitions of limits and continuity.
Let D be a subset of R and let f :D ! R be a real-valued function on D. The function f is said tobe continuous on D if, for all ✏ > 0 and for all x 2 D, there exists some � > 0 (which may dependon x) such that if y 2 D satisfies
|y � x| < �
then|f(y)� f(x)| < ✏.
One may readily verify that if f and g are continuous functions on D then the functions f + g,f � g and f.g are continuous. If in addition g is everywhere non-zero then f/g is continuous.
~
d
~q
~
d =
0
BBB@
wm1,d
wm2,d...
wmn,d
1
CCCA
1
The foundations of the rigorous study of analysis were laid in the nineteenth century, notably bythe mathematicians Cauchy and Weierstrass. Central to the study of this subject are the formaldefinitions of limits and continuity.
Let D be a subset of R and let f :D ! R be a real-valued function on D. The function f is said tobe continuous on D if, for all ✏ > 0 and for all x 2 D, there exists some � > 0 (which may dependon x) such that if y 2 D satisfies
|y � x| < �
then|f(y)� f(x)| < ✏.
One may readily verify that if f and g are continuous functions on D then the functions f + g,f � g and f.g are continuous. If in addition g is everywhere non-zero then f/g is continuous.
~
d
~q
~
d =
0
BBB@
wm1,d
wm2,d...
wmn,d
1
CCCA
~q =
0
BBB@
wm1,q
wm2,q...
wmn,q
1
CCCA
s(~d, ~q) =i=nX
i=1
wmi,d · wmi,q (1)
1
P. Bellot (AMU-‐CNRS, LSIS-‐OpenEdition)
Ranking function : e.g. dot product / cosine
• Similarity function : dot product !!!!!!
• Normalization ? !!!
• cosine similarity function
~d
~q
~d =
0
BBB@
wm1,d
wm2,d.
.
.
wmn,d
1
CCCA
~q =
0
BBB@
wm1,q
wm2,q.
.
.
wmn,q
1
CCCA
wmi,d
mi
s(~d, ~q) =
i=nX
i=1
wmi,d · wmi,q (1)
wi,d =
wi,dqPnj=1 w2
j,d
(2)
s(~d, ~q) =
i=nX
i=1
wi,dqPnj=1 w2
j,d
· wi,qqPnj=1 w2
j,q
=
~d · ~q
kdk2 · kqk2= cos(
~d, ~q) (3)
1
~d
~q
~d =
0
BBB@
wm1,d
wm2,d.
.
.
wmn,d
1
CCCA
~q =
0
BBB@
wm1,q
wm2,q.
.
.
wmn,q
1
CCCA
wmi,d
mi
s(~d, ~q) =
i=nX
i=1
wmi,d · wmi,q (1)
wi,d =
wi,dqPnj=1 w2
j,d
(2)
s(~d, ~q) =
i=nX
i=1
wi,dqPnj=1 w2
j,d
· wi,qqPnj=1 w2
j,q
=
~d · ~q
kdk2 · kqk2= cos(
~d, ~q) (3)
1
~d
~q
~d =
0
BBB@
wm1,d
wm2,d.
.
.
wmn,d
1
CCCA
~q =
0
BBB@
wm1,q
wm2,q.
.
.
wmn,q
1
CCCA
wmi,d
mi
s(~d, ~q) =
i=nX
i=1
wmi,d · wmi,q (1)
wi,d =
wi,dqPnj=1 w2
j,d
(2)
s(~d, ~q) =
i=nX
i=1
wi,dqPnj=1 w2
j,d
· wi,qqPnj=1 w2
j,q
=
~d · ~q
k~dk2 · k~qk2
= cos(
~d, ~q) (3)
1
cosine
document
query
8TAL et RI - Rech. doc - Classif. et catégorisation - Q&A - Campagnes
Example
9
Information Retrieval and Web Search 63-3
Terms Documents
T1: Bab(y,ies,y’s) D1: Infant & Toddler First AidT2: Child(ren’s) D2: Babies and Children’s Room (For Your Home)T3: Guide D3: Child Safety at HomeT4: Health D4: Your Baby’s Health and Safety: From Infant to ToddlerT5: Home D5: Baby Proofing BasicsT6: Infant D6: Your Guide to Easy Rust ProofingT7: Proofing D7: Beanie Babies Collector’s GuideT8: SafetyT9: Toddler
The indexed terms are italicized in the titles. Also, the stems [BB05] of the terms for baby (andits variants) and child (and its variants) are used to save storage and improve performance. Theterm-by-document matrix for this document collection is
A =
⎡
⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣
0 1 0 1 1 0 1
0 1 1 0 0 0 0
0 0 0 0 0 1 1
0 0 0 1 0 0 0
0 1 1 0 0 0 0
1 0 0 1 0 0 0
0 0 0 0 1 1 0
0 0 1 1 0 0 0
1 0 0 1 0 0 0
⎤
⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦
.
For a query on baby health, the query vector is
q = [ 1 0 0 1 0 0 0 0 0 ]T .
To process the user’s query, the cosines
δi = cos θi = qT di
∥q∥2∥di ∥2
are computed. The documents corresponding to the largest elements of δ are most relevant to theuser’s query. For our example,
δ ≈ [ 0 0.40824 0 0.63245 0.5 0 0.5 ],
so document vector 4 is scored most relevant to the query on baby health. To calculate the recalland precision scores, one needs to be working with a small, well-studied document collection. Inthis example, documents d4, d1, and d3 are the three documents in the collection relevant to babyhealth. Consequently, with τ = .1, the recall score is 1/3 and the precision is 1/4.
63.2 Latent Semantic Indexing
In the 1990s, an improved information retrieval system replaced the vector space model. This system iscalled Latent Semantic Indexing (LSI) [Dum91] and was the product of Susan Dumais, then at Bell Labs.LSI simply creates a low rank approximation Ak to the term-by-document matrix A from the vector spacemodel.
from Langville & Meyer, 2006 Handbook of Linear Algebra
P. Bellot (AMU-‐CNRS, LSIS-‐OpenEdition)
Term Weighting• Zipf’s law (1949) : the distribution of word frequencies is similar for (large) texts
!!!!!!!
• Luhn’s hypothesis (1957) : the frequency of a word is a measurement of its significance … and then a criterion that measures the capacity of a word to discriminate documents by their content
10
Indexing and TF-IDF Index Term Weighting
Zipf’s law [1949]
Distribution of word frequencies is similar for di↵erent texts (naturallanguage) of significantly large size
Words by rank order
Fre
quen
cy o
f w
ord
s
f
r
Zipf’s law holds even for di↵erent languages!
Mounia Lalmas (Yahoo! Research) 20-21 June 2011 42 / 171
Indexing and TF-IDF Index Term Weighting
Luhn’s analysis — Observation
Upper cut−off Lower cut−off
Significantwords
Words by rank order
Fre
quen
cy o
f w
ord
s
f
r
com
mon
word
s
rare words
Resolving power
Mounia Lalmas (Yahoo! Research) 20-21 June 2011 44 / 171
from M. Lalmas, 2012
Rank Word Frequency
1 the 200
2 a 150
… …
hapax 1~50%
rank x freq ≈ constant
P. Bellot (AMU-‐CNRS, LSIS-‐OpenEdition)
Term weighting• In a given document, a word is important (discriminant) if it occurs often and it is rare
in the collection
!
• TF.IDF weighting schemes
~d
~q
~d =
0
BBB@
wm1,d
wm2,d.
.
.
wmn,d
1
CCCA
~q =
0
BBB@
wm1,q
wm2,q.
.
.
wmn,q
1
CCCA
wmi,d
mi
s(~d, ~q) =
i=nX
i=1
wmi,d · wmi,q (1)
wi,d =
wi,dqPnj=1 w2
j,d
(2)
s(~d, ~q) =
i=nX
i=1
wi,dqPnj=1 w2
j,d
· wi,qqPnj=1 w2
j,q
=
~d · ~q
k~dk2 · k~qk2= cos(
~d, ~q) (3)
QteInfo(mi) = log2 P (mi) �! IDF (mi) = � log
ni
N
1
Pondération pour les documents Pondération pour lesrequêtes
(a)
wi, D =
tf mi ,D( ). log Nn mi( )
tf mj ,D( ). log Nn mj( )
⎛
⎝ ⎜ ⎜
⎞
⎠ ⎟ ⎟
2
j / m j∈D∑
wi, R = 0,5 + 0,5tf mi , R( )
maxj / m j∈R
tf mi , R( )⎛
⎝
⎜ ⎜ ⎜
⎞
⎠
⎟ ⎟ ⎟
⋅log Nn mi( )
(b)
wi, D = 0,5 + 0,5tf mi , D( )
maxj / m j ∈D
tf mi ,D( ) wi,R = log
N − n mi( )n mi( )
(c) wi, D = log
Nn mi( )
wi, R = logN
n mi( )(d) wi, D =1
wi, R = log
N − n mi( )n mi( )
(e)
wi, D =tf m i,D( )
tf m j, D( )2
j / m j ∈D∑ wi, R = tf mi ,R( )
(f) wi, D =1 wi, R = 1
Tableau 1 - Pondérations citées et évaluéesdans [Salton & Buckley, 1988]
11TAL et RI - Rech. doc - Classif. et catégorisation - Q&A - Campagnes
P. Bellot (AMU-‐CNRS, LSIS-‐OpenEdition)
Vector Space Model : some drawbacks• The dimensions are orthogonal
– “automobile” and “car” are as distant as “car” and “apricot tree”…
—> the user query must contain the same words than the documents that he wishes to find…
• The word order and the syntax are not used
– the cat drove out the dog of the neighbor – ≈ the dog drove out the cat of the neighbor – ≈ the cat close to the dog drives out
– It assumes words are statistically independent – It does not take into account the syntax of the sentences, nor the negations…
– this paper is about politics VS. this paper is not about politics : very similar sentences…
12TAL et RI - Rech. doc - Classif. et catégorisation - Q&A - Campagnes
P. Bellot (AMU-‐CNRS, LSIS-‐OpenEdition)
Probabilistic model (1)
• 1976 : Robertson and Sparck-Jones
• Query : {relevant documents} : {features}
• Problem: to guess the characteristics (features) of the relevant documents (Binary independence retrieval model : based on the presence or the absence of terms)
• Solutions : • iterative and interactive process {user, selection of relevant documents =
relevance feedback}
• selection of the documents according to a cost function
2 Modele probabiliste
Le modele probabiliste permet de representer le processus de recherche documentaire comme unprocessus de decision : le cout, pour l’utilisateur, associe a la recuperation d’un document doit etreminimise. Autrement dit, un document n’est propose a l’utilisateur que si le cout associe a cetteproposition est inferieur a celui de ne pas le retrouver (voir [Losee, Kluwer, BU 006.35, p.62]) :
ECretr(d) < EC ¯retr(d) (4)
avec :ECretr(d) = P (pert.|�d)Cretrouve,pert. + P (pert.|�d)Cretrouve,pert. (5)
ou P (pert.|�d) designe la probabilite qu’un document d est pertinent sachant ses caracteristiques �d,P (pertinent|�d) qu’il ne le soit pas et Cretrouve,pert. le cout associe au fait de retrouver (ramener)un document pertinent et Cretrouve, ¯pert. de retrouver un document non pertinent.
La regle de decision devient alors : retrouver un document s seulement si :
P (pert.|�d)Cretr.,pert. + P (pert.|�d)Cretr.,pert. < P (pert.|�d)Cretr.,pert. + P (pert.|�d)Cretr.,pert.(6)
soit :P (pert.|�d)P ( ¯pert.|�d)
>Cretrouve,pert. � Cretrouve,pert.
Cretrouve,pertinent � Cretrouve,pert.= constante = � (7)
La valeur de la constante � depend du type de recherche e�ectuee : desire-t-on privilegier le rappelou la precision etc.
Une autre maniere de voir le modele probabiliste est de considerer que celui-ci cherche a modeliserl’ensemble des documents pertinents, autrement dit a estimer la probabilite qu’un mot donneapparaisse dans de tels documents.
2.1 Binary Relevance Model
Soit q une requete et dj un document. Le modele probabiliste essaye d’estimer la probabilite quel’utilisateur trouve interessant le document dj sachant la requete q. On suppose qu’il existe alorsl’ensemble R des documents interessants (on parle d’ensemble ideal ) et que ces documents designentl’ensemble des documents pertinents. Soit R le complement de R. Le modele attribue a chaquedocument dj sa probabilite de pertinence de la facon suivante :
dj ⇥P (dj est pertinent)
P (dj n’est pas pertinent)(8)
sim(dj , q) =P (R|�dj)P (R|�dj)
(9)
2
13TAL et RI - Rech. doc - Classif. et catégorisation - Q&A - Campagnes
P. Bellot (AMU-‐CNRS, LSIS-‐OpenEdition)
Probabilistic model (2)
• Estimating the probability that a document d is relevant (is not relevant) for the query q :
!• Bayes th.using the probability of observing the document given relevance, the prior probability of relevance and the probability of observing the document at random
• The Retrieval Status Value :
2 Modele probabiliste
Le modele probabiliste permet de representer le processus de recherche documentaire comme unprocessus de decision : le cout, pour l’utilisateur, associe a la recuperation d’un document doit etreminimise. Autrement dit, un document n’est propose a l’utilisateur que si le cout associe a cetteproposition est inferieur a celui de ne pas le retrouver (voir [Losee, Kluwer, BU 006.35, p.62]) :
ECretr(d) < EC ¯retr(d) (4)
avec :ECretr(d) = P (pert.|�d)Cretrouve,pert. + P (pert.|�d)Cretrouve,pert. (5)
ou P (pert.|�d) designe la probabilite qu’un document d est pertinent sachant ses caracteristiques �d,P (pertinent|�d) qu’il ne le soit pas et Cretrouve,pert. le cout associe au fait de retrouver (ramener)un document pertinent et Cretrouve, ¯pert. de retrouver un document non pertinent.
La regle de decision devient alors : retrouver un document s seulement si :
P (pert.|�d)Cretr.,pert. + P (pert.|�d)Cretr.,pert. < P (pert.|�d)Cretr.,pert. + P (pert.|�d)Cretr.,pert.(6)
soit :P (pert.|�d)P ( ¯pert.|�d)
>Cretrouve,pert. � Cretrouve,pert.
Cretrouve,pertinent � Cretrouve,pert.= constante = � (7)
La valeur de la constante � depend du type de recherche e�ectuee : desire-t-on privilegier le rappelou la precision etc.
Une autre maniere de voir le modele probabiliste est de considerer que celui-ci cherche a modeliserl’ensemble des documents pertinents, autrement dit a estimer la probabilite qu’un mot donneapparaisse dans de tels documents.
2.1 Binary Relevance Model
Soit q une requete et dj un document. Le modele probabiliste essaye d’estimer la probabilite quel’utilisateur trouve interessant le document dj sachant la requete q. On suppose qu’il existe alorsl’ensemble R des documents interessants (on parle d’ensemble ideal ) et que ces documents designentl’ensemble des documents pertinents. Soit R le complement de R. Le modele attribue a chaquedocument dj sa probabilite de pertinence de la facon suivante :
dj ⇥P (dj est pertinent)
P (dj n’est pas pertinent)(8)
sim(dj , q) =P (R|�dj)P (R|�dj)
(9)
2
Ainsi, si la probabilite que dj soit pertinent est grande mais que la probabilite qu’il ne le soit pasest grande egalement, la similarite sim(dj , q) sera faible. Cette quantite ne pouvant etre calculeequ’a la condition de savoir definir la pertinence d’un document en fonction de q (ce que l’on ne saitfaire), il est necessaire de la determiner a partir d’exemples de documents pertinents.Selon la regle de Bayes : P (R|↵dj) = P (R)·P ( ⌦dj |R)
P ( ⌦dj), la similarite est egale a :
sim(dj , q) =P (↵dj |R)� P (R)P (↵dj |R)� P (R)
⇥ P (↵dj |R)P (↵dj |R)
(10)
P (↵dj |R) correspond a la probabilite de selectionner aleatoirement dj dans l’ensemble des documentspertinents et P (R) la probabilite qu’un document choisi aleatoirement dans la collection est per-tinent. P (R) et P (R) sont independants de q, leur calcul n’est donc pas necessaire pour ordonnerles sim(dj , q).Il est alors possible de definir un seuil � en-deca duquel les documents ne sont plus considerespertinents.
En faisant l’hypothese que les mots apparaissent independamment les uns des autres dans les textes(hypothese naturellement fausse... mais realiste a l’usage !), les probabilites se reduisent a celles dessacs de mots.
P (↵dj |R) =i=n⌅
i=1
P (dj,i)|R) =i=n⌅
i=1
P (wmi,dj )|R) (11)
P (↵dj |R) =i=n⌅
i=1
P (dj,i)|R =i=n⌅
i=1
P (wmi,dj )|R)) (12)
Dans le modele probabiliste, les poids des entrees mi de l’index sont binaires :
wmi,dj = {0, 1} (13)
La probabilite de selectionner aleatoirement dj dans l’ensemble des documents pertinents estegal au produit des probabilites d’appartenance des mots de dj dans un document de R (choisialeatoirement) et des probabilites de non appartenance a un document de R (choisi aleatoirement)des mots non presents dans dj :
sim(dj , q) ⇥
�⇤mi⇥dj
P (mi|R)⇥�
�⇤mi /⇥dj
P (mi|R)⇥
�⇤mi⇥dj
P (mi|R)⇥�
�⇤mi /⇥dj
P (mi|R)⇥ (14)
avec P (mi|R) la probabilite que le mot mi soit present dans un document selectionne aleatoirementdans R et P (mi|R) la probabilite que le mot mi ne soit pas present dans un document selectionnealeatoirement dans R.
Cette equation peut etre coupee en deux parties suivant que le mot appartient ou non au documentdj :
sim(dj , q) ⇥⌅
mi⇥dj
P (mi|R)P (mi|R
�⌅
mi /⇥dj
P (mi|R)P (mi|R)
(15)
3
Ainsi, si la probabilite que dj soit pertinent est grande mais que la probabilite qu’il ne le soit pasest grande egalement, la similarite sim(dj , q) sera faible. Cette quantite ne pouvant etre calculeequ’a la condition de savoir definir la pertinence d’un document en fonction de q (ce que l’on ne saitfaire), il est necessaire de la determiner a partir d’exemples de documents pertinents.Selon la regle de Bayes : P (R|↵dj) = P (R)·P ( ⌦dj |R)
P ( ⌦dj), la similarite est egale a :
sim(dj , q) =P (↵dj |R)� P (R)P (↵dj |R)� P (R)
⇥ P (↵dj |R)P (↵dj |R)
(10)
P (↵dj |R) correspond a la probabilite de selectionner aleatoirement dj dans l’ensemble des documentspertinents et P (R) la probabilite qu’un document choisi aleatoirement dans la collection est per-tinent. P (R) et P (R) sont independants de q, leur calcul n’est donc pas necessaire pour ordonnerles sim(dj , q).Il est alors possible de definir un seuil � en-deca duquel les documents ne sont plus considerespertinents.
En faisant l’hypothese que les mots apparaissent independamment les uns des autres dans les textes(hypothese naturellement fausse... mais realiste a l’usage !), les probabilites se reduisent a celles dessacs de mots.
P (↵dj |R) =i=n⌅
i=1
P (dj,i)|R) =i=n⌅
i=1
P (wmi,dj )|R) (11)
P (↵dj |R) =i=n⌅
i=1
P (dj,i)|R =i=n⌅
i=1
P (wmi,dj )|R)) (12)
Dans le modele probabiliste, les poids des entrees mi de l’index sont binaires :
wmi,dj = {0, 1} (13)
La probabilite de selectionner aleatoirement dj dans l’ensemble des documents pertinents estegal au produit des probabilites d’appartenance des mots de dj dans un document de R (choisialeatoirement) et des probabilites de non appartenance a un document de R (choisi aleatoirement)des mots non presents dans dj :
sim(dj , q) ⇥
�⇤mi⇥dj
P (mi|R)⇥�
�⇤mi /⇥dj
P (mi|R)⇥
�⇤mi⇥dj
P (mi|R)⇥�
�⇤mi /⇥dj
P (mi|R)⇥ (14)
avec P (mi|R) la probabilite que le mot mi soit present dans un document selectionne aleatoirementdans R et P (mi|R) la probabilite que le mot mi ne soit pas present dans un document selectionnealeatoirement dans R.
Cette equation peut etre coupee en deux parties suivant que le mot appartient ou non au documentdj :
sim(dj , q) ⇥⌅
mi⇥dj
P (mi|R)P (mi|R
�⌅
mi /⇥dj
P (mi|R)P (mi|R)
(15)
3
14
P. Bellot (AMU-‐CNRS, LSIS-‐OpenEdition)
• Hypothesis : bag of words = words occur independently
!• The Retrieval Status Value :
Probabilistic model (3)
Ainsi, si la probabilite que dj soit pertinent est grande mais que la probabilite qu’il ne le soit pasest grande egalement, la similarite sim(dj , q) sera faible. Cette quantite ne pouvant etre calculeequ’a la condition de savoir definir la pertinence d’un document en fonction de q (ce que l’on ne saitfaire), il est necessaire de la determiner a partir d’exemples de documents pertinents.Selon la regle de Bayes : P (R|↵dj) = P (R)·P ( ⌦dj |R)
P ( ⌦dj), la similarite est egale a :
sim(dj , q) =P (↵dj |R)� P (R)P (↵dj |R)� P (R)
⇥ P (↵dj |R)P (↵dj |R)
(10)
P (↵dj |R) correspond a la probabilite de selectionner aleatoirement dj dans l’ensemble des documentspertinents et P (R) la probabilite qu’un document choisi aleatoirement dans la collection est per-tinent. P (R) et P (R) sont independants de q, leur calcul n’est donc pas necessaire pour ordonnerles sim(dj , q).Il est alors possible de definir un seuil � en-deca duquel les documents ne sont plus considerespertinents.
En faisant l’hypothese que les mots apparaissent independamment les uns des autres dans les textes(hypothese naturellement fausse... mais realiste a l’usage !), les probabilites se reduisent a celles dessacs de mots.
P (↵dj |R) =i=n⌅
i=1
P (dj,i)|R) =i=n⌅
i=1
P (wmi,dj )|R) (11)
P (↵dj |R) =i=n⌅
i=1
P (dj,i)|R =i=n⌅
i=1
P (wmi,dj )|R)) (12)
Dans le modele probabiliste, les poids des entrees mi de l’index sont binaires :
wmi,dj = {0, 1} (13)
La probabilite de selectionner aleatoirement dj dans l’ensemble des documents pertinents estegal au produit des probabilites d’appartenance des mots de dj dans un document de R (choisialeatoirement) et des probabilites de non appartenance a un document de R (choisi aleatoirement)des mots non presents dans dj :
sim(dj , q) ⇥
�⇤mi⇥dj
P (mi|R)⇥�
�⇤mi /⇥dj
P (mi|R)⇥
�⇤mi⇥dj
P (mi|R)⇥�
�⇤mi /⇥dj
P (mi|R)⇥ (14)
avec P (mi|R) la probabilite que le mot mi soit present dans un document selectionne aleatoirementdans R et P (mi|R) la probabilite que le mot mi ne soit pas present dans un document selectionnealeatoirement dans R.
Cette equation peut etre coupee en deux parties suivant que le mot appartient ou non au documentdj :
sim(dj , q) ⇥⌅
mi⇥dj
P (mi|R)P (mi|R
�⌅
mi /⇥dj
P (mi|R)P (mi|R)
(15)
3
Ainsi, si la probabilite que dj soit pertinent est grande mais que la probabilite qu’il ne le soit pasest grande egalement, la similarite sim(dj , q) sera faible. Cette quantite ne pouvant etre calculeequ’a la condition de savoir definir la pertinence d’un document en fonction de q (ce que l’on ne saitfaire), il est necessaire de la determiner a partir d’exemples de documents pertinents.Selon la regle de Bayes : P (R|↵dj) = P (R)·P ( ⌦dj |R)
P ( ⌦dj), la similarite est egale a :
sim(dj , q) =P (↵dj |R)� P (R)P (↵dj |R)� P (R)
⇥ P (↵dj |R)P (↵dj |R)
(10)
P (↵dj |R) correspond a la probabilite de selectionner aleatoirement dj dans l’ensemble des documentspertinents et P (R) la probabilite qu’un document choisi aleatoirement dans la collection est per-tinent. P (R) et P (R) sont independants de q, leur calcul n’est donc pas necessaire pour ordonnerles sim(dj , q).Il est alors possible de definir un seuil � en-deca duquel les documents ne sont plus considerespertinents.
En faisant l’hypothese que les mots apparaissent independamment les uns des autres dans les textes(hypothese naturellement fausse... mais realiste a l’usage !), les probabilites se reduisent a celles dessacs de mots.
P (↵dj |R) =i=n⌅
i=1
P (dj,i)|R) =i=n⌅
i=1
P (wmi,dj )|R) (11)
P (↵dj |R) =i=n⌅
i=1
P (dj,i)|R =i=n⌅
i=1
P (wmi,dj )|R)) (12)
Dans le modele probabiliste, les poids des entrees mi de l’index sont binaires :
wmi,dj = {0, 1} (13)
La probabilite de selectionner aleatoirement dj dans l’ensemble des documents pertinents estegal au produit des probabilites d’appartenance des mots de dj dans un document de R (choisialeatoirement) et des probabilites de non appartenance a un document de R (choisi aleatoirement)des mots non presents dans dj :
sim(dj , q) ⇥
�⇤mi⇥dj
P (mi|R)⇥�
�⇤mi /⇥dj
P (mi|R)⇥
�⇤mi⇥dj
P (mi|R)⇥�
�⇤mi /⇥dj
P (mi|R)⇥ (14)
avec P (mi|R) la probabilite que le mot mi soit present dans un document selectionne aleatoirementdans R et P (mi|R) la probabilite que le mot mi ne soit pas present dans un document selectionnealeatoirement dans R.
Cette equation peut etre coupee en deux parties suivant que le mot appartient ou non au documentdj :
sim(dj , q) ⇥⌅
mi⇥dj
P (mi|R)P (mi|R)
�⌅
mi /⇥dj
P (mi|R)P (mi|R)
(15)
3
Soit pi = P (mi ⌅ dj |R) la probabilite que le ie mot de dj apparaisse dans un document pertinentet soit qi = P (mi ⌅ dj |R) la probabilite que le ie mot de dj apparaisse dans un document nonpertinent. Il est clair que 1�pi = P (mi /⌅ dj |R) et 1� qi = P (mi /⌅ dj |R). Il est enfin generalementsuppose que, pour les mots n’apparaissant pas dans la requete : pi = qi ([Fuhr, 1992, ”ProbabilisticsModels in IR”]). Dans ces conditions :
sim(dj , q) ⇤⇧
mi⇥dj
pi
qi⇥
⇧
mi /⇥dj
1� pi
1� qi(16)
⇤⇧
mi⇥dj⇤q
pi
qi⇥
⇧
mi⇥dj ,mi /⇥q
pi
qi⇥
⇧
mi /⇥dj ,mi⇥q
1� pi
1� qi⇥
⇧
mi /⇥dj ,mi /⇥q
1� pi
1� qi(17)
⇤⇧
mi⇥dj⇤q
pi
qi⇥
⇧
mi /⇥dj ,mi⇥q
1� pi
1� qi(18)
=⇧
mi⇥dj⇤q
pi
qi⇥
⇤mi⇥q
1�pi1�qi⇤
mi⇥dj⇤q1�pi1�qi
(19)
=⇧
mi⇥dj⇤q
pi(1� qi)qi(1� pi)
⇥⇧
mi⇥q
1� pi
1� qi(20)
Le deuxieme terme de ce produit est independant du document (tous les mots de la requete sontpris en compte, independamment de dj). Ce qui nous interesse etant uniquement d’ordonner lesdocuments, ce terme peut etre ignore.
Soit, en passant en outre au logarithme1 :
sim(dj , q) ⇤⌅
mi⇥dj⇤q
logpi(1� qi)qi(1� pi)
= RSV (dj , q) (22)
sim(dj , q) est souvent denommee le RSV (Retrieval Status Value) de dj pour la requete q.
En gardant les notations precedentes :
sim(dj , q) ⇤⌅
mi⇥q⇤dj
�log
P (mi|R)1� P (mi|R)
+ logP (mi|R)
1� P (mi|R)
⇥(23)
1D’autres demonstrations [Losee, Kluwer, BU 006.35, p.65] font intervenir le calcul des probabilites selon unedistribution binaire. Une telle distribution (egalement dite de Bernouilli), decrit la probabilite d’un evenement binaire(le mot appartient ou n’appartient pas) en fonction de la valeur de la variable et de la probabilite de cette valeur :
�(x; p) = px(1� p)1�x (21)
qui donne la probabilite que x vaut 1 ou 0 en fonction de p. Le parametre p peut etre interprete comme la probabiliteque x vaut 1 ou comme le pourcentage de fois ou x = 1.
4
15
• Let and= the probability that a relevant (a non relevant) document contains m_i
• RSV = Retrieval Status Value
!!
• A non binary model ? = Using term frequency, document length
Soit pi = P (mi ⌅ dj |R) la probabilite que le ie mot de dj apparaisse dans un document pertinentet soit qi = P (mi ⌅ dj |R) la probabilite que le ie mot de dj apparaisse dans un document nonpertinent. Il est clair que 1�pi = P (mi /⌅ dj |R) et 1� qi = P (mi /⌅ dj |R). Il est enfin generalementsuppose que, pour les mots n’apparaissant pas dans la requete : pi = qi ([Fuhr, 1992, ”ProbabilisticsModels in IR”]). Dans ces conditions :
sim(dj , q) ⇤⇧
mi⇥dj
pi
qi⇥
⇧
mi /⇥dj
1� pi
1� qi(16)
⇤⇧
mi⇥dj⇤q
pi
qi⇥
⇧
mi⇥dj ,mi /⇥q
pi
qi⇥
⇧
mi /⇥dj ,mi⇥q
1� pi
1� qi⇥
⇧
mi /⇥dj ,mi /⇥q
1� pi
1� qi(17)
⇤⇧
mi⇥dj⇤q
pi
qi⇥
⇧
mi /⇥dj ,mi⇥q
1� pi
1� qi(18)
=⇧
mi⇥dj⇤q
pi
qi⇥
⇤mi⇥q
1�pi1�qi⇤
mi⇥dj⇤q1�pi1�qi
(19)
=⇧
mi⇥dj⇤q
pi(1� qi)qi(1� pi)
⇥⇧
mi⇥q
1� pi
1� qi(20)
Le deuxieme terme de ce produit est independant du document (tous les mots de la requete sontpris en compte, independamment de dj). Ce qui nous interesse etant uniquement d’ordonner lesdocuments, ce terme peut etre ignore.
Soit, en passant en outre au logarithme1 :
sim(dj , q) ⇤⌅
mi⇥dj⇤q
logpi(1� qi)qi(1� pi)
= RSV (dj , q) (22)
sim(dj , q) est souvent denommee le RSV (Retrieval Status Value) de dj pour la requete q.
En gardant les notations precedentes :
sim(dj , q) ⇤⌅
mi⇥q⇤dj
�log
P (mi|R)1� P (mi|R)
+ logP (mi|R)
1� P (mi|R)
⇥(23)
1D’autres demonstrations [Losee, Kluwer, BU 006.35, p.65] font intervenir le calcul des probabilites selon unedistribution binaire. Une telle distribution (egalement dite de Bernouilli), decrit la probabilite d’un evenement binaire(le mot appartient ou n’appartient pas) en fonction de la valeur de la variable et de la probabilite de cette valeur :
�(x; p) = px(1� p)1�x (21)
qui donne la probabilite que x vaut 1 ou 0 en fonction de p. Le parametre p peut etre interprete comme la probabiliteque x vaut 1 ou comme le pourcentage de fois ou x = 1.
4
Soit pi = P (mi ⌅ dj |R) la probabilite que le ie mot de dj apparaisse dans un document pertinentet soit qi = P (mi ⌅ dj |R) la probabilite que le ie mot de dj apparaisse dans un document nonpertinent. Il est clair que 1�pi = P (mi /⌅ dj |R) et 1� qi = P (mi /⌅ dj |R). Il est enfin generalementsuppose que, pour les mots n’apparaissant pas dans la requete : pi = qi ([Fuhr, 1992, ”ProbabilisticsModels in IR”]). Dans ces conditions :
sim(dj , q) ⇤⇧
mi⇥dj
pi
qi⇥
⇧
mi /⇥dj
1� pi
1� qi(16)
⇤⇧
mi⇥dj⇤q
pi
qi⇥
⇧
mi⇥dj ,mi /⇥q
pi
qi⇥
⇧
mi /⇥dj ,mi⇥q
1� pi
1� qi⇥
⇧
mi /⇥dj ,mi /⇥q
1� pi
1� qi(17)
⇤⇧
mi⇥dj⇤q
pi
qi⇥
⇧
mi /⇥dj ,mi⇥q
1� pi
1� qi(18)
=⇧
mi⇥dj⇤q
pi
qi⇥
⇤mi⇥q
1�pi1�qi⇤
mi⇥dj⇤q1�pi1�qi
(19)
=⇧
mi⇥dj⇤q
pi(1� qi)qi(1� pi)
⇥⇧
mi⇥q
1� pi
1� qi(20)
Le deuxieme terme de ce produit est independant du document (tous les mots de la requete sontpris en compte, independamment de dj). Ce qui nous interesse etant uniquement d’ordonner lesdocuments, ce terme peut etre ignore.
Soit, en passant en outre au logarithme1 :
sim(dj , q) ⇤⌅
mi⇥dj⇤q
logpi(1� qi)qi(1� pi)
= RSV (dj , q) (22)
sim(dj , q) est souvent denommee le RSV (Retrieval Status Value) de dj pour la requete q.
En gardant les notations precedentes :
sim(dj , q) ⇤⌅
mi⇥q⇤dj
�log
P (mi|R)1� P (mi|R)
+ logP (mi|R)
1� P (mi|R)
⇥(23)
1D’autres demonstrations [Losee, Kluwer, BU 006.35, p.65] font intervenir le calcul des probabilites selon unedistribution binaire. Une telle distribution (egalement dite de Bernouilli), decrit la probabilite d’un evenement binaire(le mot appartient ou n’appartient pas) en fonction de la valeur de la variable et de la probabilite de cette valeur :
�(x; p) = px(1� p)1�x (21)
qui donne la probabilite que x vaut 1 ou 0 en fonction de p. Le parametre p peut etre interprete comme la probabiliteque x vaut 1 ou comme le pourcentage de fois ou x = 1.
4
Soit pi = P (mi ⌅ dj |R) la probabilite que le ie mot de dj apparaisse dans un document pertinentet soit qi = P (mi ⌅ dj |R) la probabilite que le ie mot de dj apparaisse dans un document nonpertinent. Il est clair que 1�pi = P (mi /⌅ dj |R) et 1� qi = P (mi /⌅ dj |R). Il est enfin generalementsuppose que, pour les mots n’apparaissant pas dans la requete : pi = qi ([Fuhr, 1992, ”ProbabilisticsModels in IR”]). Dans ces conditions :
sim(dj , q) ⇤⇧
mi⇥dj
pi
qi⇥
⇧
mi /⇥dj
1� pi
1� qi(16)
⇤⇧
mi⇥dj⇤q
pi
qi⇥
⇧
mi⇥dj ,mi /⇥q
pi
qi⇥
⇧
mi /⇥dj ,mi⇥q
1� pi
1� qi⇥
⇧
mi /⇥dj ,mi /⇥q
1� pi
1� qi(17)
⇤⇧
mi⇥dj⇤q
pi
qi⇥
⇧
mi /⇥dj ,mi⇥q
1� pi
1� qi(18)
=⇧
mi⇥dj⇤q
pi
qi⇥
⇤mi⇥q
1�pi1�qi⇤
mi⇥dj⇤q1�pi1�qi
(19)
=⇧
mi⇥dj⇤q
pi(1� qi)qi(1� pi)
⇥⇧
mi⇥q
1� pi
1� qi(20)
Le deuxieme terme de ce produit est independant du document (tous les mots de la requete sontpris en compte, independamment de dj). Ce qui nous interesse etant uniquement d’ordonner lesdocuments, ce terme peut etre ignore.
Soit, en passant en outre au logarithme1 :
sim(dj , q) ⇤⌅
mi⇥dj⇤q
logpi(1� qi)qi(1� pi)
= RSV (dj , q) (22)
sim(dj , q) est souvent denommee le RSV (Retrieval Status Value) de dj pour la requete q.
En gardant les notations precedentes :
sim(dj , q) ⇤⌅
mi⇥q⇤dj
�log
P (mi|R)1� P (mi|R)
+ logP (mi|R)
1� P (mi|R)
⇥(23)
1D’autres demonstrations [Losee, Kluwer, BU 006.35, p.65] font intervenir le calcul des probabilites selon unedistribution binaire. Une telle distribution (egalement dite de Bernouilli), decrit la probabilite d’un evenement binaire(le mot appartient ou n’appartient pas) en fonction de la valeur de la variable et de la probabilite de cette valeur :
�(x; p) = px(1� p)1�x (21)
qui donne la probabilite que x vaut 1 ou 0 en fonction de p. Le parametre p peut etre interprete comme la probabiliteque x vaut 1 ou comme le pourcentage de fois ou x = 1.
4
2.4 Methode par apprentissage automatique des parametres
Les methodes Bayesiennes permettent d’estimer les parametres a partir du retour de pertinenceformule par un utilisateur [Bookstein, 1983, ”Information retrieval : A sequential learning process”,JASIS].
2.5 Integration de distributions non binaires
A partir du modele probabiliste originel, Robertson et l’equipe du Centre for Interactive SystemsResearch de City University (London) y ont integre la possibilite de tenir compte de la frequenced’apparition des mots dans les documents et dans la requete ainsi que de la longueur des docu-ments. Cette integration correspondait originellement a l’integration du modele 2-poisson de Harter(utilise par ce dernier pour selectionner les bons termes d’indexation et non pour les ponderer)dans le modele probabiliste. A partir du modele 2-poisson et de la notion d’ensemble d’elite Epour un mot (selon Harter, l’ensemble des documents les plus representatifs de l’usage du mot ;plus generalement : l’ensemble des documents qui contiennent le mot), sont derivees les proba-bilites conditionnelles p(E|R), p(E|R), p(E|R) et p(E|R) donnant un nouveau modele probabilistedependant de E et de E. Avec la prise en compte d’autres variables telles la longueur des documentset le nombre d’occurrences du mot au sein du document, ce modele a donne lieu a une famille deponderations denommees BM (Best Match).
De maniere generale, la prise en compte des poids w des mots dans les documents et dans la requetes’exprime par :
sim(dj , q) =�
mi�dj⇥q
wmi,dj · wmi,dj · logpi(1� qi)qi(1� pi)
(33)
Lorsque l’on n’a pas d’informations sur l’ensemble R des documents pertinents, il est d’usage detransformer cette egalite en un classique produit scalaire :
sim(dj , q) =�
mi�dj⇥q
wmi,dj · wmi,dj (34)
2.5.1 Integration du modele 2-poisson de Harter
On obtient une loi de Poisson lorsque le nombre d’evenements est tres grand et que la probabiliteelementaire est tres faible (exemples : defauts dans une chaıne de fabrication, erreurs de frappedans une page). Certaines experiences ont montre que seule la distribution de 50% (jusqu’a 70%selon les experiences) des mots peut s’apparenter a un modele de Poisson [Margulis 91 ; Fuhr 92].
Definition 1 (Distribution de Poisson) Sachant µi, le nombre d’occurrences moyen du mot mi
par document dans un ensemble de documents R, la probabilite que le nombre d’occurrences f(mi, d)
6
P. Bellot (AMU-‐CNRS, LSIS-‐OpenEdition)
Eliteness• « We hypothesize that occurrences of a term in a document have a random or
stochastic element, which nevertheless reflects a real but hidden distinction between those documents which are “about” the concept represented by the term and those which are not. Those documents which are “about” this concept are described as “elite” for the term. »
• The assumption is that the distribution of within-document frequencies is Poisson for the elite documents, and also (but with a different mean) for the non-elite documents.
• Modeling within-document term frequencies by means of a mixture of two Poisson distributions
16
It would be possible to derive this model from a more basic one, under which a document was arandom stream of term occurrences, each one having a fixed, small probability of being the term inquestion, this probability being constant over all elite documents, and also constant (but smaller) overall non-elite documents. Such a model would require that all documents were the same length. Thus the2–Poisson model is usually said to assume that document length is constant: although technically it doesnot require that assumption, it makes little sense without it. Document length is discussed further below(section 5).
The approach taken in [6] was to estimate the parameters of the two Poisson distributions for eachterm directly from the distribution of within-document frequencies. These parameters were then used invarious weighting functions. However, little performance benefit was gained. This was seen essentiallyas a result of estimation problems: partly that the estimation method for the Poisson parameters wasprobably not very good, and partly because the model is complex in the sense of requiring a large numberof di↵erent parameters to be estimated. Subsequent work on mixed-Poisson models has suggested thatalternative estimation methods may be preferable [9].
Combining the 2–Poisson model with formula 4, under the various assumptions given about depen-dencies, we obtain [6] the following weight for a term t:
w = log(p0�tf e�� + (1� p0)µtf e�µ) (q0e�� + (1� q0)e�µ)(q0�tf e�� + (1� q0)µtf e�µ) (p0e�� + (1� p0)e�µ)
, (5)
where � and µ are the Poisson means for tf in the elite and non-elite sets for t respectively, p0 =P (document elite for t|R), and q0 is the corresponding probability for R.
The estimation problem is very apparent from equation 5, in that there are four parameters for eachterm, for none of which are we likely to have direct evidence (because of eliteness being a hidden variable).It is precisely this estimation problem which makes the weighting function intractable. This considerationleads directly to the approach taken in the next section.
4 A Rough Model for Term Frequency
4.1 The Shape of the tf E↵ect
Many di↵erent functions have been used to allow within-document term frequency tf to influence theweight given to the particular document on account of the term in question. In some cases a linearfunction has been used; in others, the e↵ect has been dampened by using a suitable transformation suchas log tf .
Even if we do not use the full equation 5, we may allow it to suggest the shape of an appropriate,but simpler, function. In fact, equation 5 has the following characteristics: (a) It is zero for tf = 0; (b)it increases monotonically with tf ; (c) but to an asymptotic maximum; (d) which approximates to theRobertson/Sparck Jones weight that would be given to a direct indicator of eliteness.
Only in an extreme case, where eliteness is identical to relevance, is the function linear in tf . Thesepoints can be seen from the following re-arrangement of equation 5:
w = log(p0 + (1� p0)(µ/�)tf e��µ) (q0eµ�� + (1� q0))(q0 + (1� q0)(µ/�)tf e��µ) (p0eµ�� + (1� p0))
. (6)
µ is smaller than �. As tf ! 1 (to give us the asymptotic maximum), (µ/�)tf goes to zero, so thosecomponents drop out. eµ�� will be small, so the approximation is:
w = logp0(1� q0)q0(1� p0)
. (7)
(The last approximation may not be a good one: for a poor and/or infrequent term, eµ�� will not bevery small. Although this should not a↵ect the component in the numerator, because q0 is likely to besmall, it will a↵ect the component in the denominator.)
4.2 A Simple Formulation
What is required, therefore, is a simple tf -related weight that has something like the characteristics(a)-(d) listed in the previous section. Such a function can be constructed as follows. The functiontf /(constant + tf ) increases from zero to an asymptotic maximum in approximately the right fashion.The constant determines the rate at which the increase drops o↵: with a large constant, the function
Robertson & Walker, 1994, ACM SIGIR
p(k) = λk
k!e−λ
B BB B
BB BB B
BB
BBBA
A
A AA
A
A
B
BB
BB
P. Bellot (AMU-‐CNRS, LSIS-‐OpenEdition)
Divergence From Randomness (DFR) models
• The 2-Poisson model : in a elite-set of documents, informative words occur to a greater extent than in the rest of documents in the collection. But other words do not possess elite documents and their frequencies follow a random distribution.
• Divergence from randomness (DFR) : — selecting a basic randomness model — applying normalisations
• « The more the divergence of the within-document term-frequency from its frequency within the collection, the more the information carried by the word t in the document d »
• « if a rare term has many occurrences in a document then it has a very high probability (almost the certainty) to be informative for the topic described by the document »
!!
• By using a binomial distribution or a geometric distribution
17
score(d,Q) =X
t2Q
qtw · w(t, d)
http://ir.dcs.gla.ac.uk/wiki/FormulasOfDFRModels
1
tfn+ 1
�tfn · log2
N + 1
nt + 0.5
�I(n)L2 :
P. Bellot (AMU-‐CNRS, LSIS-‐OpenEdition)
Probabilistic model (4)• Estimating p and q ? = better estimate term weights
according to the number of documents n_i with words m_i and N the total number of documents
• Iterative process (relevance feedback) : user selects the relevant documents froma first list of retrieved documents
• if no sample is available = pseudo-relevance feedback (and 2-Poisson model)
!!
• With no relevance information, it approximates TF / IDF :
!
Si l’on integre le nombre d’occurrences f(mi, dj) des mi dans dj , on obtient :
sim(dj , q) ⇤�
mi�dj⇥q
f(mi, dj) · logpi(1� qi)qi(1� pi)
(24)
2.2 Estimation des parametres
2.3 Methode originelle sans retour de pertinence
Lors de la premiere iteration, aucun document pertinent n’a encore ete trouve, il est necessaire deposer les valeurs de P (mi|R) et de P (mi|R). On suppose ainsi qu’il y a une chance sur deux qu’unmot quelconque de l’index soit present dans un document pertinent et que la probabilite qu’un motsoit present dans un document non pertinent est proportionnelle a sa distribution dans la collection(etant donne que le nombre de documents non pertinents est generalement bien plus grand quecelui des pertinents) :
P (mi|R) = 0, 5 (25)
P (mi|R) =ni
N(26)
avec ni le nombre de documents qui contiennent mi dans la collection et N le nombre total dedocuments de la collection. Ces valeurs doivent etre estimes lors de chaque iteration en fonctiondes documents qu’elles permettent de trouver (et, eventuellement de la selection de ceux qui sontpertinents par l’utilisateur).
A partir de ces valeurs initiales, il est possible de calculer sim(dj , q) pour tous les documents dela collection et de ne retenir que ceux dont la similarite est superieure a �. Le choix de � peut seramener au choix d’un rang r au-dela duquel les documents sont ecartes. Soit Vi le nombre desdocuments dans le sous-ensemble des documents retenus qui contiennent mi (V designe alors lenombre de documents retenus). P (mi|R) et de P (mi|R) sont alors calculees recursivement :
P (mi|R) =Vi
V(27)
P (mi|R) =ni � Vi
N � V(28)
ou encore (pour eviter un probleme avec les valeurs V = 1 et Vi = 0) :
P (mi|R) =Vi + 0.5V + 1
(29)
P (mi|R) =ni � Vi + 0.5N � V + 1
(30)
et, plus souvent :
P (mi|R) =Vi + ni
N
V + 1(31)
P (mi|R) =ni � Vi + ni
N
N � V + 1(32)
5
Si l’on integre le nombre d’occurrences f(mi, dj) des mi dans dj , on obtient :
sim(dj , q) ⇤�
mi�dj⇥q
f(mi, dj) · logpi(1� qi)qi(1� pi)
(24)
2.2 Estimation des parametres
2.3 Methode originelle sans retour de pertinence
Lors de la premiere iteration, aucun document pertinent n’a encore ete trouve, il est necessaire deposer les valeurs de P (mi|R) et de P (mi|R). On suppose ainsi qu’il y a une chance sur deux qu’unmot quelconque de l’index soit present dans un document pertinent et que la probabilite qu’un motsoit present dans un document non pertinent est proportionnelle a sa distribution dans la collection(etant donne que le nombre de documents non pertinents est generalement bien plus grand quecelui des pertinents) :
P (mi|R) = 0, 5 (25)
P (mi|R) =ni
N(26)
avec ni le nombre de documents qui contiennent mi dans la collection et N le nombre total dedocuments de la collection. Ces valeurs doivent etre estimes lors de chaque iteration en fonctiondes documents qu’elles permettent de trouver (et, eventuellement de la selection de ceux qui sontpertinents par l’utilisateur).
A partir de ces valeurs initiales, il est possible de calculer sim(dj , q) pour tous les documents dela collection et de ne retenir que ceux dont la similarite est superieure a �. Le choix de � peut seramener au choix d’un rang r au-dela duquel les documents sont ecartes. Soit Vi le nombre desdocuments dans le sous-ensemble des documents retenus qui contiennent mi (V designe alors lenombre de documents retenus). P (mi|R) et de P (mi|R) sont alors calculees recursivement :
P (mi|R) =Vi
V(27)
P (mi|R) =ni � Vi
N � V(28)
ou encore (pour eviter un probleme avec les valeurs V = 1 et Vi = 0) :
P (mi|R) =Vi + 0.5V + 1
(29)
P (mi|R) =ni � Vi + 0.5N � V + 1
(30)
et, plus souvent :
P (mi|R) =Vi + ni
N
V + 1(31)
P (mi|R) =ni � Vi + ni
N
N � V + 1(32)
5
V <=> threshold (cost)
18
1st estimation
Parmi les revers du modele, la prediction systematique que deux occurrences d’un mot dans undocument sont moins probables que trois ou quatre occurrences. Le modele 2-Poisson n’ayant pasabouti a des resultats particulierement satisfaisants, d’autres auteurs ont propose une mixture de ndistribution de Poisson [Margulis, cite par Ponte & Croft, ACM SIGIR 1998]. Une autre possibiliteest d’utiliser la K-mixture de Katz donnant d’aussi bon resultats qu’une distribution binomialenegative, mais bien plus simple a utiliser ; voir [Manning & Schutze, ”Foundations of...”, p.549].
2.5.2 Integration d’un modele gaussien
Si l’on considere que les mots sont distribues selon une loi normale, la similarite propose en 1982par Bookstein est :
RSV (dj , q) =⇧
mi�q⇥dj
f(mi, dj)⇤�
µmi
⇥2mi
� µmi
⇥mi
⇥� f(mi, dj)
2·�
1⇥2
mi
� 1⇥mi
⇥⌅(41)
avec µ et ⇥ les moyennes et les ecarts-types dans R et dans R.
2.5.3 Les ponderations Okapi
Une maniere courante de definir la composante IDF (Inverse Document Frequency) avec N lenombre de documents dans la collection et n(mi) le nombre de documents contenant mi dans lacollection est2 :
IDF (mi) = log�
N � n(mi) + 0.5n(mi) + 0.5
⇥(43)
Le nombre d’occurrences f(mi, dj) est generalement normalise suivant la longueur moyenne l desdocuments de la collection et l(dj) la taille (en nombre d’occurrences de mots) de dj . Avec K uneconstante, habituellement choisie entre 1.0 et 2.0, une possibilite consiste a definir la composanteTF de telle sorte de favoriser les documents courts :
TF (mi, dj) =(K + 1) · f(mi, dj)
f(mi, dj) + K · (l(dj)/l)(44)
Un grand nombre de ponderations ont ete testees. Les premiers resultats sont publies pendant lescampagnes TREC-2 et TREC-3. La definition de ces nouvelles ponderations fut concomitante dela generalisation de l’expansion automatique de requete a partir des premiers documents trouves.
Soient :2Etant donne que n(mi) est generalement petit par rapport a N , on peut parfois simplifier cette definition en :
IDF (mi) = log
„N + 0.5
n(mi) + 0.5
«(42)
8
Parmi les revers du modele, la prediction systematique que deux occurrences d’un mot dans undocument sont moins probables que trois ou quatre occurrences. Le modele 2-Poisson n’ayant pasabouti a des resultats particulierement satisfaisants, d’autres auteurs ont propose une mixture de ndistribution de Poisson [Margulis, cite par Ponte & Croft, ACM SIGIR 1998]. Une autre possibiliteest d’utiliser la K-mixture de Katz donnant d’aussi bon resultats qu’une distribution binomialenegative, mais bien plus simple a utiliser ; voir [Manning & Schutze, ”Foundations of...”, p.549].
2.5.2 Integration d’un modele gaussien
Si l’on considere que les mots sont distribues selon une loi normale, la similarite propose en 1982par Bookstein est :
RSV (dj , q) =⇧
mi�q⇥dj
f(mi, dj)⇤�
µmi
⇥2mi
� µmi
⇥mi
⇥� f(mi, dj)
2·�
1⇥2
mi
� 1⇥mi
⇥⌅(41)
avec µ et ⇥ les moyennes et les ecarts-types dans R et dans R.
2.5.3 Les ponderations Okapi
Une maniere courante de definir la composante IDF (Inverse Document Frequency) avec N lenombre de documents dans la collection et n(mi) le nombre de documents contenant mi dans lacollection est2 :
IDF (mi) = log�
N � n(mi) + 0.5n(mi) + 0.5
⇥(43)
Le nombre d’occurrences f(mi, dj) est generalement normalise suivant la longueur moyenne l desdocuments de la collection et l(dj) la taille (en nombre d’occurrences de mots) de dj . Avec K uneconstante, habituellement choisie entre 1.0 et 2.0, une possibilite consiste a definir la composanteTF de telle sorte de favoriser les documents courts :
TF (mi, dj) =(K + 1) · f(mi, dj)
f(mi, dj) + K · (l(dj)/l)(44)
Un grand nombre de ponderations ont ete testees. Les premiers resultats sont publies pendant lescampagnes TREC-2 et TREC-3. La definition de ces nouvelles ponderations fut concomitante dela generalisation de l’expansion automatique de requete a partir des premiers documents trouves.
Soient :2Etant donne que n(mi) est generalement petit par rapport a N , on peut parfois simplifier cette definition en :
IDF (mi) = log
„N + 0.5
n(mi) + 0.5
«(42)
8
P. Bellot (AMU-‐CNRS, LSIS-‐OpenEdition)
Probabilistic model (5)
• “OKAPI” (BM 25) with tuning constants = a (very) good baseline – N le nombre de documents dans la collection ;– n(mi) le nombre de documents contenant le mot mi ;– R le nombre de documents connus comme etant pertinents pour la requete q ;– r(mi) le nombre de documents de R contenant le mot mi ;– tf(mi, dj) le nombre d’occurrences de mi dans dj ;– tf(mi, q) le nombre d’occurrences de mi dans q ;– l(dj) la taille (en nombre de mots) de dj ;– l la taille moyenne des documents de la collection ;– ki et b des parametres dependants de la requete et, si possible, de la collection.Le poids w d’un mot mi est defini par :
w(mi) = log(r(mi) + 0.5)/(R� r(mi) + 0.5)
(n(mi))� r(mi) + 0.5)/(N � n(mi)�R + r(mi) + 0.5)(45)
Definition 3 (BM25) La ponderation BM25 est definie de la maniere suivante :
sim(dj , q) =⇤
mi�q
�w(mi)⇤
(k1 + 1) · tf(mi, dj)K + tf(mi, dj)
⇤ (k3 + 1)tf(mi, q)k3 + tf(mi, q)
⇥(46)
avec :K = k1 ·
�(1� b) + b · l(dj)
l
⇥(47)
Lorsqu’on n’a pas d’informations sur R et r(mi), cette definition se reduit a (ponderation utiliseedans le systeme Okapi durant TREC-1) :
w(mi) = logN � n(mi) + 0.5
n(mi) + 0.5(48)
avec R = r(mi) = 0. Ce sont ces valeurs qui sont utilisees dans les deux exemples suivants.
Lors de la campagne TREC-8, le systeme Okapi a ete utilise avec les valeurs : k1 = 1.2, b = 0.75(des valeurs inferieures de b sont parfois interessantes) et pour les longues requetes, k3 est positionnesoit a 7 soit a 1000 :
sim(dj , q) =⇤
mi�q
2.2 · tf(mi, dj)
0.3 + 0.9 · l(dj)l
+ tf(mi, dj)⇤ 1001 · tf(mi, q)
1000 + tf(mi, q)⇤ log2
N � n(mi) + 0.5n(mi) + 0.5
(49)
Le systeme Inquery [Allan, 1996] utilise BM25 avec k1 = 2, b = 0.75 et ⌅i : tf(mi, q) = 1 :
sim(dj , q) =⇤
mi�q
tf(mi, dj)
0.5 + 1.5 · l(dj)l
+ tf(mi, dj)⇤
log2N+0.5n(mi)
log2(N + 1)(50)
3 Les modeles de langage pour la RD
Contrairement au modele probabiliste qui essaye de representer l’ensemble des documents perti-nents, la recherche documentaire a base de modele de langage se propose de modeliser la processus
9
– N le nombre de documents dans la collection ;– n(mi) le nombre de documents contenant le mot mi ;– R le nombre de documents connus comme etant pertinents pour la requete q ;– r(mi) le nombre de documents de R contenant le mot mi ;– tf(mi, dj) le nombre d’occurrences de mi dans dj ;– tf(mi, q) le nombre d’occurrences de mi dans q ;– l(dj) la taille (en nombre de mots) de dj ;– l la taille moyenne des documents de la collection ;– ki et b des parametres dependants de la requete et, si possible, de la collection.Le poids w d’un mot mi est defini par :
w(mi) = log(r(mi) + 0.5)/(R� r(mi) + 0.5)
(n(mi))� r(mi) + 0.5)/(N � n(mi)�R + r(mi) + 0.5)(45)
Definition 3 (BM25) La ponderation BM25 est definie de la maniere suivante :
sim(dj , q) =⇤
mi�q
�w(mi)⇤
(k1 + 1) · tf(mi, dj)K + tf(mi, dj)
⇤ (k3 + 1)tf(mi, q)k3 + tf(mi, q)
⇥(46)
avec :K = k1 ·
�(1� b) + b · l(dj)
l
⇥(47)
Lorsqu’on n’a pas d’informations sur R et r(mi), cette definition se reduit a (ponderation utiliseedans le systeme Okapi durant TREC-1) :
w(mi) = logN � n(mi) + 0.5
n(mi) + 0.5(48)
avec R = r(mi) = 0. Ce sont ces valeurs qui sont utilisees dans les deux exemples suivants.
Lors de la campagne TREC-8, le systeme Okapi a ete utilise avec les valeurs : k1 = 1.2, b = 0.75(des valeurs inferieures de b sont parfois interessantes) et pour les longues requetes, k3 est positionnesoit a 7 soit a 1000 :
sim(dj , q) =⇤
mi�q
2.2 · tf(mi, dj)
0.3 + 0.9 · l(dj)l
+ tf(mi, dj)⇤ 1001 · tf(mi, q)
1000 + tf(mi, q)⇤ log2
N � n(mi) + 0.5n(mi) + 0.5
(49)
Le systeme Inquery [Allan, 1996] utilise BM25 avec k1 = 2, b = 0.75 et ⌅i : tf(mi, q) = 1 :
sim(dj , q) =⇤
mi�q
tf(mi, dj)
0.5 + 1.5 · l(dj)l
+ tf(mi, dj)⇤
log2N+0.5n(mi)
log2(N + 1)(50)
3 Les modeles de langage pour la RD
Contrairement au modele probabiliste qui essaye de representer l’ensemble des documents perti-nents, la recherche documentaire a base de modele de langage se propose de modeliser la processus
9
– N le nombre de documents dans la collection ;– n(mi) le nombre de documents contenant le mot mi ;– R le nombre de documents connus comme etant pertinents pour la requete q ;– r(mi) le nombre de documents de R contenant le mot mi ;– tf(mi, dj) le nombre d’occurrences de mi dans dj ;– tf(mi, q) le nombre d’occurrences de mi dans q ;– l(dj) la taille (en nombre de mots) de dj ;– l la taille moyenne des documents de la collection ;– ki et b des parametres dependants de la requete et, si possible, de la collection.Le poids w d’un mot mi est defini par :
w(mi) = log(r(mi) + 0.5)/(R� r(mi) + 0.5)
(n(mi))� r(mi) + 0.5)/(N � n(mi)�R + r(mi) + 0.5)(45)
Definition 3 (BM25) La ponderation BM25 est definie de la maniere suivante :
sim(dj , q) =⇤
mi�q
�w(mi)⇤
(k1 + 1) · tf(mi, dj)K + tf(mi, dj)
⇤ (k3 + 1)tf(mi, q)k3 + tf(mi, q)
⇥(46)
avec :K = k1 ·
�(1� b) + b · l(dj)
l
⇥(47)
Lorsqu’on n’a pas d’informations sur R et r(mi), cette definition se reduit a (ponderation utiliseedans le systeme Okapi durant TREC-1) :
w(mi) = logN � n(mi) + 0.5
n(mi) + 0.5(48)
avec R = r(mi) = 0. Ce sont ces valeurs qui sont utilisees dans les deux exemples suivants.
Lors de la campagne TREC-8, le systeme Okapi a ete utilise avec les valeurs : k1 = 1.2, b = 0.75(des valeurs inferieures de b sont parfois interessantes) et pour les longues requetes, k3 est positionnesoit a 7 soit a 1000 :
sim(dj , q) =⇤
mi�q
2.2 · tf(mi, dj)
0.3 + 0.9 · l(dj)l
+ tf(mi, dj)⇤ 1001 · tf(mi, q)
1000 + tf(mi, q)⇤ log2
N � n(mi) + 0.5n(mi) + 0.5
(49)
Le systeme Inquery [Allan, 1996] utilise BM25 avec k1 = 2, b = 0.75 et ⌅i : tf(mi, q) = 1 :
sim(dj , q) =⇤
mi�q
tf(mi, dj)
0.5 + 1.5 · l(dj)l
+ tf(mi, dj)⇤
log2N+0.5n(mi)
log2(N + 1)(50)
3 Les modeles de langage pour la RD
Contrairement au modele probabiliste qui essaye de representer l’ensemble des documents perti-nents, la recherche documentaire a base de modele de langage se propose de modeliser la processus
9
– N le nombre de documents dans la collection ;– n(mi) le nombre de documents contenant le mot mi ;– R le nombre de documents connus comme etant pertinents pour la requete q ;– r(mi) le nombre de documents de R contenant le mot mi ;– tf(mi, dj) le nombre d’occurrences de mi dans dj ;– tf(mi, q) le nombre d’occurrences de mi dans q ;– l(dj) la taille (en nombre de mots) de dj ;– l la taille moyenne des documents de la collection ;– ki et b des parametres dependants de la requete et, si possible, de la collection.Le poids w d’un mot mi est defini par :
w(mi) = log(r(mi) + 0.5)/(R� r(mi) + 0.5)
(n(mi))� r(mi) + 0.5)/(N � n(mi)�R + r(mi) + 0.5)(45)
Definition 3 (BM25) La ponderation BM25 est definie de la maniere suivante :
sim(dj , q) =⇤
mi�q
�w(mi)⇤
(k1 + 1) · tf(mi, dj)K + tf(mi, dj)
⇤ (k3 + 1)tf(mi, q)k3 + tf(mi, q)
⇥(46)
avec :K = k1 ·
�(1� b) + b · l(dj)
l
⇥(47)
Lorsqu’on n’a pas d’informations sur R et r(mi), cette definition se reduit a (ponderation utiliseedans le systeme Okapi durant TREC-1) :
w(mi) = logN � n(mi) + 0.5
n(mi) + 0.5(48)
avec R = r(mi) = 0. Ce sont ces valeurs qui sont utilisees dans les deux exemples suivants.
Lors de la campagne TREC-8, le systeme Okapi a ete utilise avec les valeurs : k1 = 1.2, b = 0.75(des valeurs inferieures de b sont parfois interessantes) et pour les longues requetes, k3 est positionnesoit a 7 soit a 1000 :
sim(dj , q) =⇤
mi�q
2.2 · tf(mi, dj)
0.3 + 0.9 · l(dj)l
+ tf(mi, dj)⇤ 1001 · tf(mi, q)
1000 + tf(mi, q)⇤ log2
N � n(mi) + 0.5n(mi) + 0.5
(49)
Le systeme Inquery [Allan, 1996] utilise BM25 avec k1 = 2, b = 0.75 et ⌅i : tf(mi, q) = 1 :
sim(dj , q) =⇤
mi�q
tf(mi, dj)
0.5 + 1.5 · l(dj)l
+ tf(mi, dj)⇤
log2N+0.5n(mi)
log2(N + 1)(50)
3 Les modeles de langage pour la RD
Contrairement au modele probabiliste qui essaye de representer l’ensemble des documents perti-nents, la recherche documentaire a base de modele de langage se propose de modeliser la processus
9
19
7 Experiments
7.1 TREC
The TREC (Text REtrieval Conference) conferences, of which there have been two, with the third due tostart early 1994, are concerned with controlled comparisons of di↵erent methods of retrieving documentsfrom large collections of assorted textual material. They are funded by the US Advanced ProjectsResearch Agency (ARPA) and organised by Donna Harman of NIST (National Institute for Standardsand Technology). There were about 31 participants, academic and commercial, in the TREC-2 conferencewhich took place at Gaithersburg, MD in September 1993 [2]. Information needs are presented in theform of highly structured “topics” from which queries are to be derived automatically and/or manuallyby participants. Documents include newspaper articles, entries from the Federal Register, patents andtechnical abstracts, varying in length from a line or two to several hundred thousand words.
A large number of relevance judgments have been made at NIST by a panel of experts assessing thetop-ranked documents retrieved by some of the participants in TREC–1 and TREC–2. The number ofknown relevant documents for the 150 topics varies between 1 and more than 1000, with a mean of 281.
7.2 Experiments Conducted
Some of the experiments reported here were also reported at TREC–2 [1].
Database and Queries
The experiments reported here involved searches of one of the TREC collections, described as disks 1 &2 (TREC raw data has been distributed on three CD-ROMs). It contains about 743,000 documents. Itwas indexed by keyword stems, using a modified Porter stemming procedure [13], spelling normalisationdesigned to conflate British and American spellings, a moderate stoplist of about 250 words and a smallcross-reference table and “go” list. Topics 101–150 of the 150 TREC–1 and –2 topic statements wereused. The mean length (number of unstopped tokens) of the queries derived from title and concepts fieldsonly was 30.3; for those using additionally the narrative and description fields the mean length was 81.
Search Procedure
Searches were carried out automatically by means of City University’s Okapi text retrieval software. Theweighting functions described in Sections 4–6 were implemented as BM152 (the model using equation 8 forthe document term frequency component) and BM11 (using equation 10). Both functions incorporatedthe document length correction factor of equation 13. These were compared with BM1 (w(1) weights,approximately ICF, since no relevance information was used in these experiments) and with a simplecoordination-level model BM0 in which terms are given equal weights. Note that BM11 and BM15 bothreduce to BM1 when k1 and k2 are zero. The within-query term frequency component (equation 15)could be used with any of these functions.
To summarize, the following functions were used:
w = 1(BM0)
w = logN � n + 0.5
n + 0.5⇥ qtf
(k3 + qtf )(BM1)
w =tf
(k1 + tf )⇥ log
N � n + 0.5n + 0.5
⇥ qtf(k3 + qtf )
+ k2 ⇥ nq(�� d)(� + d)
(BM15)
w =tf
(k1⇥d� + tf )
⇥ logN � n + 0.5
n + 0.5⇥ qtf
(k3 + qtf )+ k2 ⇥ nq
(�� d)(� + d)
.(BM11)
In the experiments reported below where k3 is given as 1, the factor qtf /(k3 + qtf ) is implemented asqtf on its own (equation 16).
2BM = Best Match
P. Bellot (AMU-‐CNRS, LSIS-‐OpenEdition)
Generative models - eg. Language model
• A model that « generates » phrases
• A probability distribution (unigrams, bigrams, n-grams) over samples
• For IR : what is the probability a document produces a given query ? = the query likelihood = the probability the document is relevant
• IR = what is the document that is the most likely to generate the query
!• Different types of language models : unigrams assume word independence
!!
• Estimating P(t|d) with Maximum Likelihood (the number of times the query word t occurs in the document d divided by the total number of word occurrences in d)
• Problem : estimating « Zero Frequency Prob. » (t may not occur in d)—> smoothing function (Laplace, Jelinek-Mercer, Dirichlet…)
20
Retrieval Models Retrieval Models II: Probabilities, Language models and DFR
Standard LM Approach
Assume that query terms are drawn identically and independentlyfrom a document (unigram models)
P(q|d) =Y
t2qP(t|d)n(t,q)
(where n(t, q) is the number of term t in query q)
Maximum Likelihood Estimate of P(t|d)Simply use the number of times the query term occurs in the documentdivided by the total number of term occurrences.
Problem: Zero Probability (frequency) Problem
Mounia Lalmas (Yahoo! Research) 20-21 June 2011 118 / 171
Retrieval Models Retrieval Models II: Probabilities, Language models and DFR
Document Priors
Remember P(d |q) = P(q|d)P(d)/P(q) ⇡ P(q|d)P(d)P(d) is typically assumed to be uniform so is usually ignored leadingto P(d |q) ⇡ P(q|d)P(d) provides an interesting avenue for encoding a priori knowledgeabout the document
Document length (longer doc ! more relevant)Average Word Length (bigger words ! more relevant)Time of publication (newer doc ! more relevant)Number of web links (more in links ! more relevant)PageRank (more popular ! more relevant)
Mounia Lalmas (Yahoo! Research) 20-21 June 2011 125 / 171
Retrieval Models Retrieval Models II: Probabilities, Language models and DFR
Estimating Document Models
Example of Smoothing methodsLaplace
P(t|✓d) =n(t, d) + ↵P
t0 n(t0, d) + ↵|T |
|T | is the number of term in the vocabularyJelinek-Mercer
P(t|✓d) = � · P(t|d) + (1 � �) · P(t)
Dirichlet
P(t|✓d) =|d |
|d | + µ· P(t|d) + µ
|d | + µ· P(t)
Mounia Lalmas (Yahoo! Research) 20-21 June 2011 123 / 171
Retrieval Models Retrieval Models II: Probabilities, Language models and DFR
Estimating Document Models
Example of Smoothing methodsLaplace
P(t|✓d) =n(t, d) + ↵P
t0 n(t0, d) + ↵|T |
|T | is the number of term in the vocabularyJelinek-Mercer
P(t|✓d) = � · P(t|d) + (1 � �) · P(t)
Dirichlet
P(t|✓d) =|d |
|d | + µ· P(t|d) + µ
|d | + µ· P(t)
Mounia Lalmas (Yahoo! Research) 20-21 June 2011 123 / 171
Classification et enrichissement 9
12.3.2. Classification bayésienne naïve (modèles de langage)
Un modèle de langage [DEM 98] est un ensemble de propriétés et de contraintes
sur des séquences de mots obtenues à partir d’exemples. Ces exemples peuvent re-
présenter, plus ou moins fidèlement, une langue ou une thématique. L’estimation de
probabilités à partir d’exemples permet par extension de déterminer la probabilité
qu’une phrase quelconque puisse être générée par le modèle. Catégoriser un nou-
veau texte équivaut à calculer la probabilité de la suite de mots qui le compose pour
chacun des modèles de langage de chaque catégorie. Le nouveau texte est étiqueté
selon la thématique correspondant au langage de probabilité maximale.
Soit W une suite de mots w1, w2, …, wn. Nous faisons l’hypothèse que les proba-
bilités d’apparition des mots sont indépendantes les unes des autres (hypothèse évi-
demment fausse mais qui fonctionne assez bien). Dans le cas d’un modèle de
langage trigramme – historique de longueur 2– la probabilité de cette suite de mots
peut être calculée comme suit :
€
P W( ) = P wi wi-2,wi−1( )i=1
i= n
∏ [12.7]
La représentativité du corpus d’apprentissage par rapport aux données qu’il fau-
dra exploiter est cruciale8. Nigam et al. [NIG 00] ont toutefois montré que l’emploi
d’un algorithme EM permettait de combler en partie la trop faible quantité de ces
dernières.
Exemple. L’utilisation de la règle de Bayes permet de résoudre des problèmes de
catégorisation. Supposons par exemple que l’on souhaite déterminer la langue em-
ployée majoritairement dans un texte. Il s’agit alors de calculer la probabilité de
chaque langue L sachant le texte S. La formule de Bayes permet de « retourner »
cette probabilité en des facteurs calculés grâce aux modèles de langage des diffé-
rentes langues. Comparer :
€
P L = Anglais S( ) =P S L = Anglais( ). P L = Anglais( )
P S( )
et
€
P L = Espagnol S( ) =P S L = Espagnol ( ). P L = Espagnol ( )
P S( )[12.8]
revient à comparer uniquement (puisque P(S) est identique dans les deux cas) :
8 Il est très probable que le calcul fasse intervenir des trigrammes jamais rencontrés. La tech-
nique la plus simple pour répondre à ce problème consiste à ajouter systématiquement un
nombre k, petit, d’occurrences à chaque mot et de normaliser l’ensemble des comptes.
P. Bellot (AMU-‐CNRS, LSIS-‐OpenEdition)
Language models (2)
• Priors allow to take into account diverse elements about the documents / the collection / the query
• the document length (longer a document is, more relevant it is ?)
• the time of publication
• the number of links / citations
• the page rank of the document (Web)
• the language…
• Sequential Dependence Model
21
fT
fO
fU
SDM(Q,D) = �T
X
q2Q
fT (q,D)
+�O
|Q|�1X
i=1
fO(qi, qi + 1, D)
+�U
|Q|�1X
i=1
fU (qi, qi + 1, D)
�T = 0.85 �O = 0.1 �U = 0.05 fT fO fU
http://www.lemurproject.org
#weight( 0.75 #combine ( hubble telescope achievements )!! 0.25 #combine ( universe system mission search galaxies ) )
P. Bellot (AMU-‐CNRS, LSIS-‐OpenEdition)
Some other models
• Inference networks (Bayesian networks) : combination of distinct evidence sources - modeling causal relationship - ex. Probabilistic inference network (Inquery) —> cf. Learning to rank from multiple and diverse features
• Fuzzy models
• (Extended) Boolean Model / Inference logical models
• Information-based models
• Algebric models (Latent Semantic Indexing…)
• Semantic IR models based on ontologies and conceptualization
!• and … Web-based models (Page Rank…) / XML based models…
22
P. Bellot (AMU-‐CNRS, LSIS-‐OpenEdition)
Web Page Retrieval
IR Systems on the web
Use many scores (> 300)
• Similarity between the query and the docs
• Localization of the keywords in the pages
• Structure of the pages
• Page Authority (Google’s PageRank)
• Domain Authority
23
— Hyperlink matrix (the link structure of the Web) : an entry if there is a link from page i to page j
(else = 0)
ai,j =1
|Oi|
P. Bellot (AMU-‐CNRS, LSIS-‐OpenEdition)
PageRankThe authority of a Web page ? / The authority of a Web site - a domain ?
24
Random Walk : the PageRank of a page is the probability of arriving at that page after a large number of clicks
http://en.wikipedia.org/wiki/PageRank
P. Bellot (AMU-‐CNRS, LSIS-‐OpenEdition) 25
Fast, Scalable Graph Processing: Apache Giraph on YARN
1. All vertices start with same PageRank
1.0
1.0
1.0
Fast, Scalable Graph Processing: Apache Giraph on YARN
2. Each vertex distributes an equal portion of its PageRank to all neighbors:
0.5 0.5
1
1
Fast, Scalable Graph Processing: Apache Giraph on YARN
3. Each vertex sums incoming values times a weight factor and adds in small adjustment:
1/(# vertices in graph)
(.5*.85) + (.15/3)
(1.5*.85) + (.15/3)
(1*.85) + (.15/3)
Fast, Scalable Graph Processing: Apache Giraph on YARN
4. This value becomes the vertices' PageRank for the next iteration
.43
.21
.64
Fast, Scalable Graph Processing: Apache Giraph on YARN
5. Repeat until convergence:
(change in PR per-iteration < epsilon)
From : Fast, Scalable Graph Processing: Apache Giraph on YARNhttp://fr.slideshare.net/Hadoop_Summit/fast-scalable-graph-processing-apache-giraph-on-yarn
P. Bellot (AMU-‐CNRS, LSIS-‐OpenEdition) 26
P. Bellot (AMU-‐CNRS, LSIS-‐OpenEdition) 27
P. Bellot (AMU-‐CNRS, LSIS-‐OpenEdition)
Entity oriented IR on the Web !Example : LSIS / KWare @ TREC KBA
28
P. Bellot (AMU-‐CNRS, LSIS-‐OpenEdition) 29
http://trec-‐kba.org/ Knowledge Base Acceleration
2014 : 1.2B documents (Web, social…), 11 TB http://s3.amazonaws.com/aws-‐publicdatasets/trec/kba/index.html
P. Bellot (AMU-‐CNRS, LSIS-‐OpenEdition)
Some Challenges
- Queries focused on specific entity
- Key issues
- Ambiguity in names = Need Disambiguation
- Profile definition
- Novelty detection / event detection / event attribution
- Dynamic models (outdated information, new information, new aspects/properties)
- Time oriented IR models
30
P. Bellot (AMU-‐CNRS, LSIS-‐OpenEdition) 31
Evaluation using TREC KBA Framework
Run F-Measure
1 vs All .361
1 vs All Top10 Features .355
Cross10 .355
Cross 5 .350
Cross 3 .354
Cross 2 .339
Table 2: Robustness evaluation resultsFigure 2:�=HYPHISL�0TWVY[HUJL�MVY�*SHZZPÄJH[PVU
�� PZ� IHZLK� VU� [^V� JSHZZPÄLYZ� PU� JHZJHKL� !� VUL� MVY�ÄS[LYPUN� V\[� UVU� TLU[PVUPUN� KVJ\TLU[Z� HUK� [OL� V[OLY�[V� KPZZVJPH[L� WVVYS`� YLSL]HU[� KVJ\TLU[Z� MYVT� JLU[YHSS`�YLSL]HU[�VULZ�
��KVLZ�UV[�YLX\PYL�UL^�[YHPUPUN�KH[H�^OLU�WYVJLZZPUN�H�UL^�LU[P[`
��KLHSZ�^P[O�[OYLL�KPMMLYLU[�[`WLZ�VM�MLH[\YL�IHZLK�VU!�[OL�LU[P[ ��[OL�[PTL�HUK�[OL�MV\UK�KVJ\TLU[Z
�� OHZ� ILLU� L]HS\H[LK� \ZPUN� [OL� 2UV^SLKNL� )HZL�(JJLSLYH[PVU�-YHTL^VYR�WYV]PKLK� MVY� [OL�;L_[�9,[YPL]HS�*VUMLYLUJL�������;9,*�2)(�
Our Approach
Figure 1: Time lag between the publication date of cited news articles and the date of an edit to WP creating the citation (Frank et al 2012)
Today
Run F-Measure
Our Approach .382
Best KBA .359
Median KBA .289
Mean KBA .220
Table 1: KBA 2012 results
About KBAÄYZ[�ZLZZPVU�PU!������WHY[PJPWHU[Z!����[LHTZV\Y�YHUR!��YK��ILMVYL�LUOHUJLTLU[�
U\TILY�VM�Z\ITPZZPVUZ!����
Evaluation using TREC KBA Framework
Run F-Measure
1 vs All .361
1 vs All Top10 Features .355
Cross10 .355
Cross 5 .350
Cross 3 .354
Cross 2 .339
Table 2: Robustness evaluation resultsFigure 2:�=HYPHISL�0TWVY[HUJL�MVY�*SHZZPÄJH[PVU
�� PZ� IHZLK� VU� [^V� JSHZZPÄLYZ� PU� JHZJHKL� !� VUL� MVY�ÄS[LYPUN� V\[� UVU� TLU[PVUPUN� KVJ\TLU[Z� HUK� [OL� V[OLY�[V� KPZZVJPH[L� WVVYS`� YLSL]HU[� KVJ\TLU[Z� MYVT� JLU[YHSS`�YLSL]HU[�VULZ�
��KVLZ�UV[�YLX\PYL�UL^�[YHPUPUN�KH[H�^OLU�WYVJLZZPUN�H�UL^�LU[P[`
��KLHSZ�^P[O�[OYLL�KPMMLYLU[�[`WLZ�VM�MLH[\YL�IHZLK�VU!�[OL�LU[P[ ��[OL�[PTL�HUK�[OL�MV\UK�KVJ\TLU[Z
�� OHZ� ILLU� L]HS\H[LK� \ZPUN� [OL� 2UV^SLKNL� )HZL�(JJLSLYH[PVU�-YHTL^VYR�WYV]PKLK� MVY� [OL�;L_[�9,[YPL]HS�*VUMLYLUJL�������;9,*�2)(�
Our Approach
Figure 1: Time lag between the publication date of cited news articles and the date of an edit to WP creating the citation (Frank et al 2012)
Today
Run F-Measure
Our Approach .382
Best KBA .359
Median KBA .289
Mean KBA .220
Table 1: KBA 2012 results
About KBAÄYZ[�ZLZZPVU�PU!������WHY[PJPWHU[Z!����[LHTZV\Y�YHUR!��YK��ILMVYL�LUOHUJLTLU[�
U\TILY�VM�Z\ITPZZPVUZ!����
P. Bellot (AMU-‐CNRS, LSIS-‐OpenEdition) 32
Approach for Documents Filtering on a Content Stream by Vincent Bouvier, Ludovic Bonnefoy, Patrice Bellot, Michel Benoit
KBA is about Retrieving and Filtering Information from a content stream in order to expand knowledge bases like Wikipedia and recommending edits.
Topic Preprocessing:Variants Extraction using:
- Bold text� ���Ƥ�����������������the topic’s wikipedia page;
- Text from links that points to the topic’s wikipedia page in the whole wikipedia corpus.
Information Retrieval:We adopted a recall oriented approach. We wanted to retrieve all documents containing at least one of the previously found variations. We used the RI system provided by Terrier with a tf-idf words weighting.
count KBA LSIStotal LSIS 44,351total KBA 52,244inter. 23,245 44.49% 52.41%comp. 50,105 55.41% 47.59%
Results
Top 10 Features from Gini Score:
Process description:
Text REtrieval Conference: Knowledge Base Acceleration
������Ƥ��������������������Ƥ������������Ƥ��������������������������������������when dealing with a content stream. We decided to use a decision �����������Ƥ�������������������������������������������������������������������Ǥ������Ƥ�������������������������������������������������������Ƥ��ǣ�time related features: statistics on found documents; presence/absence of known relations concerning the current topic during a week using a day scale;
common RI features: TF-IDF; mention distribution every 10% of the ��������Ǣ����������������������Ǧ���������Ǧ������������������������page.
Boris_Berezovsky_(business-man)boris berezovskyboris abramovich berezovsky
Boris_Berezovsky_(pianist)boris berezovskyboris vadimovich berezovsky
Relations extraction is also performed using link’s titles from and to the topic’s wikipedia page.
OVERALLCOUNT_RELATED_ENTITIES_MENTION 100.00000 COSINE_SIMLARITY_1G 100.00000 COSINE_SIMLARITY_2G 100.00000 COUNT_MENTION_IN_60_70%_DOCUMENT 88.33264 STAT_MENTION 73.70804 COUNT_RELATED_CITED 72.44485 COUNT_MENTION_IN_10_20%_DOCUMENT 70.76657 AVG_DOCUMENTS_IN_QUEUE 66.24520 COUNT_SENTENCE_WITH_MENTION 65.93995 COUNT_RELATED_LINKED 65.86051
Central Rel./Cent.
Run # F1 SU F1 SU
���������� .359 .410 .639 .635
4 RF-Yes .342 Ǥ .617 .600
3 RF-All .330 .279 .614 .601
5 SRF-All Ǥ Ǥ .603 Ǥ
6 SRF-Yes Ǥ Ǥ Ǥ Ǥ
1 All-All Ǥ Ǥ .553 .554
All-Yes .306 .193 Ǥ Ǥ
median Ǥ Ǥ .543 .549
means Ǥ .311 .405 Ǥ
RF: Weka Random Forest SRF: Salford Random ForestAll: Weka Random Comittee of Random Forest
Yes: Includes only central judgements
All: Includes central and relevant judgments
score(di ) = s(di,c1)* s(di,c2 )
score(di ) = 0.5+s(di,c1)+s(di,c2)
2
s(di,c1) if s(di,c1) < 0.5���
��
LIA: [email protected]; LSIS:{vincent.bouvier, patrice.bellot}@lsis.org; Kware: [email protected]
Approach for Documents Filtering on a Content Stream by Vincent Bouvier, Ludovic Bonnefoy, Patrice Bellot, Michel Benoit
KBA is about Retrieving and Filtering Information from a content stream in order to expand knowledge bases like Wikipedia and recommending edits.
Topic Preprocessing:Variants Extraction using:
- Bold text� ���Ƥ�����������������the topic’s wikipedia page;
- Text from links that points to the topic’s wikipedia page in the whole wikipedia corpus.
Information Retrieval:We adopted a recall oriented approach. We wanted to retrieve all documents containing at least one of the previously found variations. We used the RI system provided by Terrier with a tf-idf words weighting.
count KBA LSIStotal LSIS 44,351total KBA 52,244inter. 23,245 44.49% 52.41%comp. 50,105 55.41% 47.59%
Results
Top 10 Features from Gini Score:
Process description:
Text REtrieval Conference: Knowledge Base Acceleration
������Ƥ��������������������Ƥ������������Ƥ��������������������������������������when dealing with a content stream. We decided to use a decision �����������Ƥ�������������������������������������������������������������������Ǥ������Ƥ�������������������������������������������������������Ƥ��ǣ�time related features: statistics on found documents; presence/absence of known relations concerning the current topic during a week using a day scale;
common RI features: TF-IDF; mention distribution every 10% of the ��������Ǣ����������������������Ǧ���������Ǧ������������������������page.
Boris_Berezovsky_(business-man)boris berezovskyboris abramovich berezovsky
Boris_Berezovsky_(pianist)boris berezovskyboris vadimovich berezovsky
Relations extraction is also performed using link’s titles from and to the topic’s wikipedia page.
OVERALLCOUNT_RELATED_ENTITIES_MENTION 100.00000 COSINE_SIMLARITY_1G 100.00000 COSINE_SIMLARITY_2G 100.00000 COUNT_MENTION_IN_60_70%_DOCUMENT 88.33264 STAT_MENTION 73.70804 COUNT_RELATED_CITED 72.44485 COUNT_MENTION_IN_10_20%_DOCUMENT 70.76657 AVG_DOCUMENTS_IN_QUEUE 66.24520 COUNT_SENTENCE_WITH_MENTION 65.93995 COUNT_RELATED_LINKED 65.86051
Central Rel./Cent.
Run # F1 SU F1 SU
���������� .359 .410 .639 .635
4 RF-Yes .342 Ǥ .617 .600
3 RF-All .330 .279 .614 .601
5 SRF-All Ǥ Ǥ .603 Ǥ
6 SRF-Yes Ǥ Ǥ Ǥ Ǥ
1 All-All Ǥ Ǥ .553 .554
All-Yes .306 .193 Ǥ Ǥ
median Ǥ Ǥ .543 .549
means Ǥ .311 .405 Ǥ
RF: Weka Random Forest SRF: Salford Random ForestAll: Weka Random Comittee of Random Forest
Yes: Includes only central judgements
All: Includes central and relevant judgments
score(di ) = s(di,c1)* s(di,c2 )
score(di ) = 0.5+s(di,c1)+s(di,c2)
2
s(di,c1) if s(di,c1) < 0.5���
��
LIA: [email protected]; LSIS:{vincent.bouvier, patrice.bellot}@lsis.org; Kware: [email protected]
P. Bellot (AMU-‐CNRS, LSIS-‐OpenEdition)
Numerical and Temporal Meta-Features for Entity Document Filtering and Ranking
— Entity related features
— Document related meta-features
— Time related meta-features
33
Evaluation using TREC KBA Framework
Run F-Measure
1 vs All .361
1 vs All Top10 Features .355
Cross10 .355
Cross 5 .350
Cross 3 .354
Cross 2 .339
Table 2: Robustness evaluation resultsFigure 2:�=HYPHISL�0TWVY[HUJL�MVY�*SHZZPÄJH[PVU
�� PZ� IHZLK� VU� [^V� JSHZZPÄLYZ� PU� JHZJHKL� !� VUL� MVY�ÄS[LYPUN� V\[� UVU� TLU[PVUPUN� KVJ\TLU[Z� HUK� [OL� V[OLY�[V� KPZZVJPH[L� WVVYS`� YLSL]HU[� KVJ\TLU[Z� MYVT� JLU[YHSS`�YLSL]HU[�VULZ�
��KVLZ�UV[�YLX\PYL�UL^�[YHPUPUN�KH[H�^OLU�WYVJLZZPUN�H�UL^�LU[P[`
��KLHSZ�^P[O�[OYLL�KPMMLYLU[�[`WLZ�VM�MLH[\YL�IHZLK�VU!�[OL�LU[P[ ��[OL�[PTL�HUK�[OL�MV\UK�KVJ\TLU[Z
�� OHZ� ILLU� L]HS\H[LK� \ZPUN� [OL� 2UV^SLKNL� )HZL�(JJLSLYH[PVU�-YHTL^VYR�WYV]PKLK� MVY� [OL�;L_[�9,[YPL]HS�*VUMLYLUJL�������;9,*�2)(�
Our Approach
Figure 1: Time lag between the publication date of cited news articles and the date of an edit to WP creating the citation (Frank et al 2012)
Today
Run F-Measure
Our Approach .382
Best KBA .359
Median KBA .289
Mean KBA .220
Table 1: KBA 2012 results
About KBAÄYZ[�ZLZZPVU�PU!������WHY[PJPWHU[Z!����[LHTZV\Y�YHUR!��YK��ILMVYL�LUOHUJLTLU[�
U\TILY�VM�Z\ITPZZPVUZ!����
recall =#documentsfound 2 corpus
#documentsfound 2 train [ test
(1)
With Variants Without Variants
KBA12Train .862 .772Test .819 .726Overall .835 .743
KBA13Train .877 .831Test .611 .534Overall .646 .573
Table 1. Recall depending on using variants name or not on both KBA12and KBA13 collection train and test subset
3.2 The Ranking MethodThe ranking method is right after the documents pre-selection fil-ter and thus takes as an input a document mentioning an entity. Themethod is to rank documents into four classes: garbage/neutral (noinformation or not informative), useful or vital. It has been shownin [9] that Naive Bayes, Decision Trees, or SVM classifiers performsimilarly on several test collections. For the ranking method, we usea Random Forest Classifier (a decision type of tree classifier) which,in addition of great performance, is really useful for post analysis.
We want our method to be adaptive and therefore not dependent onthe entity on which the classifier is trained. So we designed a seriesof meta-features that strive to depict evidence regarding an entity soit can be apply to other entities. The remaining details the three typesof meta-features: document, entity and time related meta-features
3.2.1 Entity related meta-features
The entity related meta-features are used to determine how a doc-ument concerns the target entity it has been extracted for. In orderto structure all information we have for an entity, we build an entityprofile that contains :
- variant collection V
e
: contains different variant names found foran entity e (cf., section 3.1);
- relation collection R
e,relType
: contains the different typesrelType of relations an entity e has with other entities;
- entity language model ✓e
: contains textual representation of theentity e as a bag of n-grams.
- entity Stream Information Language Model eSilme
: contains tex-tual representation of one or more documents selected by our sys-tem as a bag of n-grams for the entity e. The eSlime is used toevaluate the divergence with upcoming documents in order to tryto depict novelty from already known ”new” information.
A system may have no information at all (besides name) concern-ing an entity. Therefore the entity language model ✓
e
remains empty.The wikipedia page can be used though when it is known.
However for entities where no information at all is available wethought it could be useful to use documents that mention the entitiesand that are well ranked. With the aim to keep the entity backgroundseparated from the entity new information, we build another modeleSilm
e
to store the information that comes from the stream about anentity e. We will see in the section 4.2 the different ways we experi-ment to update the model.
The relation collection can be obtained with different manners de-pending on the prior information on the entity. When having for an
entity its wikipedia page it is possible while extracting variant namesto gather the pages that contain hyperlinks pointing to the entity page.It is also possible to gather all hyperlinks from the entity page thatpoint to another page. So it is possible to define three types of re-lations : incoming (from a page to the entity page), outgoing (fromentity page to another page) and mutual (when incoming and outgo-ing).
When using social networks those relations are explicitly defined.On twitter for instance, incoming relation would be when a user isfollowed, outgoing relation is when a user is following, and mutualis when both users are following each other.
Some meta-features require term frequency (TF) to be computed.To compute a TF of an entity e, we sum up the frequencies of allmentions of variant names v
i
from the collection V
e
in a documentD. We eventually normalize by the number of words|D| in D (cf.,equation 2). We also compute meta-features for each type of relation(incoming, outgoing, mutual) using the equation 2 where instead ofvariants, all relation sharing the same types are used.
tf(e,D) =
PVe
i=1f(v
i
, D)
|D| (2)
A snippet is computed from a document and the different mentionsof an entity. It contains a set of paragraph where the mentions of theentity are. Then the coverage of the snippet cov(D
snippet
, D) for thedocument D is computed using the length |D
snippet
| of the snippetand the length |D| of the document (cf., equation 3).
cov(Dsnippet
, D)) =|D
snippet
||D| (3)
The following table summarize all entity related meta-features:
tf
title
tf(e,Dtitle
)tf
document
tf(e,D)length
✓e |✓
e
|
length
eSilme |eSilm
e
|
cov
snippet
equation 3tf
relationType
tf(reltype
, D)cosine(✓
e
, D) similarity between ✓
e
and D
jensenShannon(✓e
, D) divergence between ✓
e
and D
jensenShannon(eSilme
, D) divergence between eSilm
e
and D
jensenShannon(✓e
, eSilm
e
) divergence between ✓
e
and eSilm
e
Table 2. Entity related features
3.2.2 Document related meta-features
Documents can give many information regardless an entity. For in-stance it is possible to compute the amount of information carriedby a document using the entropy of a document D. In addition, thelength (number of words) of a document also can give informationon whether a document is rather short or long. A document might beconsidered as long (compare to others) and be more likely vital thanshorts ones. This is at least the kind of behavior we could expect.Since we want to be able to distinguish documents not mentioningthe entity in the document title (entity meta-feature tf(e,D
title
))from those who simply don’t have a title we add a meta-featurehas
t
itle(D). The table 3 gather all documents related meta-features.
3.2.3 Time related meta-features
Let’s consider a stream of documents where each documents havepublication date and time. It is therefore possible to make use of this
has title(D) 2 {0, 1}length
document
|D|
entropy(D)P
D
i=0p(w
i
, D)log2(p(wi
, D))
Table 3. Document related Meta-Features
information to detect for instance an anormal activity on an entitywhich might mean that something really important to that entity ishappening.
As shown on the figure 3 drew from the KBA13 stream-corpus, theburst does not always depict vital documents, although it still mightbe a relevant information for classification.
Figure 3. Burst on different entities does not always imply vital documents.
To depict the burst effect we used an implementation of the Klein-berg Algorithm [11]. Given a time series, it captures burst and mea-sure the strength of it as well as the direction (up or down). We de-cided to scale the time series on an hour basis. In order not to messthe classifiers with too many information we decided not to use thedirection as a feature but to merge the direction with the strength byapplying a coefficient of -1 when direction is down and 1 otherwise.
In addition to burst detection, we also consider the number of doc-uments having a mention the last 24hours.
We noticed from our last year experiments on KBA12 that timefeatures were actually degrading final results since when ignoringthem our scores was better. So we decided to focus only on features(cf table 4) that can really bring useful time information.
kleinberg1h burst strength and directionmatch24h # documents found last 24h
Table 4. Time related features used for classification
3.2.4 Classification
To perform the classification we decided not to rely only on onemethod. Instead we designed different ways to classify the informa-tion given the meta-features described in the previous section.
For the first method TwoSteps, we consider the problem as a bi-nary classification problem where we use two classifiers in cas-cade. The first one C
GN/UV
is to classify between two classes:Garbage/Neutral and Useful/Vital. For documents being classified asUseful/Vital a second classifier C
U/V
is used to determine the finaloutput class between Useful and Vital.
The second method Single performs directly a classification be-tween the four classes.
The third method VitalVSOthers trains a classifier on all docu-ments considering only two classes vital and others (all classes but
vital). When this classifier gives a non-vital class, the Single methodis used to determine another class from Garbage to Useful.
The last but not least method CombineScores uses scores emittedby all previous classifiers and try to learn the best output class con-sidering all classifiers scores for every classes.
4 Experiments on KBA Framework
4.1 Setup
The KBA organizers have built up a stream-corpus which is a hugecorpus of dated web documents that can be processed chronologi-cally. Hence it is possible to simulate a real time system. The doc-uments come from newswires, blogs, forums, review, memetracker.In addition, a set of target entities, coming from wikipedia or fromtwitter, has been selected for their ambiguity or unpopularity. Andlast but not least, more than 60,000 documents have been annotatedso that systems can train on it. The train period starts on documentspublished from october 2011 until february 2012, and the test periodstarts from february 2012 to february 2013.
The KBA track is divided in two tasks: CCR (Cumulative Cita-tion Recommendation) and SSF (Streaming Slot Filling). CCR taskis to filter out documents worth citing in a profile of an entity (e.g.,wikipedia or freebase article). SSF task is to detect changes on givenslots for each of the target entities. We will focus only on CCR task.
The KBA task in 2013 is more challenging than the one fromKBA12 since entity are more diversified (29 Wikipedia entities in2012 vs 141 entities from wikipedia and twitter in 2013), the amountof annotated data per entity is much lower, and the ranking is moredifficult with vital classes. The table 5 shows the differences in thetraining data between KBA13 and KBA12.
Classes #Docs1 #Docs/Entity2
2012 2013 2012 2013Garbage 8467 2176 284 20Neutral 1584 1152 73 11Relevant/Useful 5186 2293 181 20Central/Vital 2671 1718 92 19Total 17482 7222
Table 5. Number of documents per classe and number of documents perclasse and per entity for both evaluation KBA12 and KBA13.
4.2 System Output
We detailed in the section 3.2.1 how an entity profile is build andhow this profile is dynamically altered by the updates on the entityStream Information Language Model (eSilm
e
). We did different ex-periments to understand how this kind of model would evolve usingtrivial update methods based on two parameters:
- UPDATE WITH: Full-Documents or Snippet or No Update;- UPDATE THRESHOLD: Useful/Vital or Vital documents only;
Given those parameters, the system is to give 5 different outputs de-pending on how the eSilm is updated: NO UPDT, V UPDT DOC,V UPDT SNPT, VU UPDT DOC, VU UPDT SNPT.
In addition, we said in the section 3.2.4 that four classificationmethods are used to experiment different types of classification.
To summarize, 20 outputs are expected at the end of the wholeprocess.
Bouvier & Bellot, TREC 2013
P. Bellot (AMU-‐CNRS, LSIS-‐OpenEdition)
Temporal FeaturesBurstiness : some words tend to appear in bursts
Hypothesis : Entity name bursts are related to important news about the entity (social Web; News…)
34
tf
title
tf(e,Dtitle
)tf
document
tf(e,D)voc size
document
|D|cov
snippet
equation 3tf
relationType
tf(reltype
, D)
Table 5: Entity related features
When building profile, we said that we extract re-lation an entity may have with another from WP us-ing three different kind of relations: incoming, out-going and mutual. For each kind of relation and foreach entity in this relation group, we compute theaverage tf on the whole document.
Time related Features: the corpus offers the prosto be able to work with time information. We de-signed the time related features so the classifiers areable to work with information concerning previousdocuments. Such information may help detectingthat may be something is going on about an entityusing different clues such as burst effect. As shownon the figure 2, the burst does not always depicts vi-tal documents, although it still might be a relevantinformation for classification.
Figure 2: Burst on different entities does not alwaysimply vital documents.
To depict the burst effect we used an implementa-tion of the Kleinberg Algorithm (Kleinberg, 2003).Given a time series, it captures burst and measure thestrength of it as well as the direction (up or down).We decided to scale the time series on an hour ba-sis. In order not to mess the classifiers with toomany information we decided not to use the direc-tion as a feature but to merge the direction with thestrength by applying a coefficient of -1 when direc-tion is down and 1 otherwise.
In addition to burst detection, we also considerthe number of documents having a mention the last24hours.
We noticed from our last year experiments onKBA12 that time features were actually degrading
final results since when ignoring them our scoreswas better. So we decided to focus only on features(cf table 6) that can really bring useful time informa-tion.
kleinberg1h burst strength and directionmatch24h # documents found last 24h
Table 6: Time related features used for classification
4.1 ClassificationAs a reminder of section 3.1.2, we implemented dif-ferent ways to update (or not) a dynamic languagemodel:
- No Update: NO UPDT
- Update with Snippet: UPDT SNPT
- Update with Document: UPDT DOC
When we update the dynamic model, we canchoose to update either Vital or Vital and Useful doc-uments which adds 2 different outputs. In total 5outputs are computed.
To classify documents based on computed fea-tures, we designed several ways to handle it. Thefirst method “TwoStep” we use, considers the prob-lem as a binary classification problem where we usetwo classifiers in cascade. The first one C
GN/UV
isto classify between two classes: “Garbage/Neutral”and “Useful/Vital”. For documents being classifiedas “Useful/Vital” the second classifier C
U/V
is usedto determine the final output class between “Useful”and “Vital”.
The second method “Single” performs directly aclassification between the four classes.
The third method “VitalVSOthers” trains a classi-fier on recognizing vital documents amongst all oth-ers classes. When this classifier gives a non-vitalclass, the “Single” method is used to determine an-other class from “Garbage” to “Useful”.
The last but not least method “CombineScores”uses scores emitted by all previous classifier and tryto learn the best output class considering all classi-fiers scores for every classes.
4.2 System OutputsTo summarize, we have 5 different outputs possiblewith 4 different methods which makes 20 differentruns. For the official run submission, we had is-sues with our system making our runs not consistentenough. In addition, we also had issues for extract-ing documents from stream-corpus which makes oursystem miss a lot of documents. The result of those
has title(D) 2 {0, 1}length
document
|D|
entropy(D)P
D
i=0p(w
i
, D)log2(p(wi
, D))
Table 3. Document related Meta-Features
information to detect for instance an anormal activity on an entitywhich might mean that something really important to that entity ishappening.
As shown on the figure 3 drew from the KBA13 stream-corpus, theburst does not always depict vital documents, although it still mightbe a relevant information for classification.
Figure 3. Burst on different entities does not always imply vital documents.
To depict the burst effect we used an implementation of the Klein-berg Algorithm [11]. Given a time series, it captures burst and mea-sure the strength of it as well as the direction (up or down). We de-cided to scale the time series on an hour basis. In order not to messthe classifiers with too many information we decided not to use thedirection as a feature but to merge the direction with the strength byapplying a coefficient of -1 when direction is down and 1 otherwise.
In addition to burst detection, we also consider the number of doc-uments having a mention the last 24hours.
We noticed from our last year experiments on KBA12 that timefeatures were actually degrading final results since when ignoringthem our scores was better. So we decided to focus only on features(cf table 4) that can really bring useful time information.
kleinberg1h burst strength and directionmatch24h # documents found last 24h
Table 4. Time related features used for classification
3.2.4 Classification
To perform the classification we decided not to rely only on onemethod. Instead we designed different ways to classify the informa-tion given the meta-features described in the previous section.
For the first method TwoSteps, we consider the problem as a bi-nary classification problem where we use two classifiers in cas-cade. The first one C
GN/UV
is to classify between two classes:Garbage/Neutral and Useful/Vital. For documents being classified asUseful/Vital a second classifier C
U/V
is used to determine the finaloutput class between Useful and Vital.
The second method Single performs directly a classification be-tween the four classes.
The third method VitalVSOthers trains a classifier on all docu-ments considering only two classes vital and others (all classes but
vital). When this classifier gives a non-vital class, the Single methodis used to determine another class from Garbage to Useful.
The last but not least method CombineScores uses scores emittedby all previous classifiers and try to learn the best output class con-sidering all classifiers scores for every classes.
4 Experiments on KBA Framework
4.1 Setup
The KBA organizers have built up a stream-corpus which is a hugecorpus of dated web documents that can be processed chronologi-cally. Hence it is possible to simulate a real time system. The doc-uments come from newswires, blogs, forums, review, memetracker.In addition, a set of target entities, coming from wikipedia or fromtwitter, has been selected for their ambiguity or unpopularity. Andlast but not least, more than 60,000 documents have been annotatedso that systems can train on it. The train period starts on documentspublished from october 2011 until february 2012, and the test periodstarts from february 2012 to february 2013.
The KBA track is divided in two tasks: CCR (Cumulative Cita-tion Recommendation) and SSF (Streaming Slot Filling). CCR taskis to filter out documents worth citing in a profile of an entity (e.g.,wikipedia or freebase article). SSF task is to detect changes on givenslots for each of the target entities. We will focus only on CCR task.
The KBA task in 2013 is more challenging than the one fromKBA12 since entity are more diversified (29 Wikipedia entities in2012 vs 141 entities from wikipedia and twitter in 2013), the amountof annotated data per entity is much lower, and the ranking is moredifficult with vital classes. The table 5 shows the differences in thetraining data between KBA13 and KBA12.
Classes #Docs1 #Docs/Entity2
2012 2013 2012 2013Garbage 8467 2176 284 20Neutral 1584 1152 73 11Relevant/Useful 5186 2293 181 20Central/Vital 2671 1718 92 19Total 17482 7222
Table 5. Number of documents per classe and number of documents perclasse and per entity for both evaluation KBA12 and KBA13.
4.2 System Output
We detailed in the section 3.2.1 how an entity profile is build andhow this profile is dynamically altered by the updates on the entityStream Information Language Model (eSilm
e
). We did different ex-periments to understand how this kind of model would evolve usingtrivial update methods based on two parameters:
- UPDATE WITH: Full-Documents or Snippet or No Update;- UPDATE THRESHOLD: Useful/Vital or Vital documents only;
Given those parameters, the system is to give 5 different outputs de-pending on how the eSilm is updated: NO UPDT, V UPDT DOC,V UPDT SNPT, VU UPDT DOC, VU UPDT SNPT.
In addition, we said in the section 3.2.4 that four classificationmethods are used to experiment different types of classification.
To summarize, 20 outputs are expected at the end of the wholeprocess.
Jon Kleinberg, ‘Bursty and hierarchical structure in streams’, Data Mining and Knowledge Discovery, 7(4), 373–397, (2003)
Bouvier & Bellot, DN,2014
P. Bellot (AMU-‐CNRS, LSIS-‐OpenEdition) 35
V. Bouvier & P. Bellot (TREC 2014, to appear)
http://docreader:4444/data/index.html
DEMO IR KBA platform soft. (Kware Company / LSIS) V. Bouvier, P. Bellot, M. Benoit
P. Bellot (AMU-‐CNRS, LSIS-‐OpenEdition) 36
P. Bellot (AMU-‐CNRS, LSIS-‐OpenEdition) 37
P. Bellot (AMU-‐CNRS, LSIS-‐OpenEdition)
Some Interesting Perspectives
— More features, more (linguistic / semantic) resources, more data… — Deeper Linguistic / Semantic Analysis
= Machine Learning Approaches (Learning to rank) + Natural Language Processing + Knowledge Management
Pluridisciplinarity :
— Neurolinguistics (What Models could be adapted to Information Retrieval / Text Mining / Knowledge Retrieval)
— Psycholinguistics (psychological / neurobiological) / (models / features)
38
One example ?
P. Bellot (AMU-‐CNRS, LSIS-‐OpenEdition)
Recent publications
39
Publications scientifiquesh-index = 15 ; i10 = 22 (Google Scholar)
375 citations depuis 2009
Direction d’ouvrage1. P. Bellot, "Recherche d’information contextuelle, assistée et personnalisée" – Hermès (collection Recherche d’In-
formation et Web), 306 pages, Paris, ISBN-978-2746225831, décembre 2011.
Direction de numéros spéciaux1. P. Bellot, C. Cauvet, G. Pasi, N. Valles, "Approches pour la recherche d’information en contexte", Document
numérique RSTI série DN - Volume 15 – num. 1/2012.
Edition d’actes de conférences1. G. Pasi, P. Bellot, "COnférence en Recherche d’Infomations et Applications - CORIA 2011, 8th French Information
Retrieval Conference", Avignon, France, Editions Universitaires d’Avignon, 2011.2. F. Béchet, J.-F. Bonastre, P. Bellot, "Actes de JEP-TALN 2008 - Journées d’Etudes sur la Parole 2008, Traitement
Automatique des Langues Naturelles 2008", Avignon, France, 2008.
Revues répertoriées1. Romain Deveaud, Eric SanJuan, Patrice Bellot, "Accurate and Effective Latent Concept Modeling", Document
Numérique RSTI, vol. 17-1, 20142. L. Bonnefoy, V. Bouvier, P. Bellot, "Approches de classification pour le filtrage de documents importants au sujet
d’une entité nommée", Document Numérique RSTI, vol. 17-1, 20143. P. Bellot, B. Grau, "Recherche et Extraction d’Information", L’information Grammaticale, p. 37-45, 2014, (indexée
par Persée) — rang B AERES4. P. Bellot, A. Doucet, S. Geva, S. Gurajada, J. Kamps, G. Kazai, M. Koolen, V. Moriceau, J. Mothe, M. Sanderson,
E. Sanjuan, F. Scholer, A. Schuh, X. Tannier, "Report on INEX 2013", ACM SIGIR Forum 47 (2), 21-32, 2013.5. P. Bellot, T. Chappell, A. Doucet, S. Geva, S. Gurajada, J. Kamps, G. Kazai, M. Koolen, M. Landoni, M. Marx,
A. Mishra, V. Moriceau, J. Mothe, M. Preminger, G. Ramírez, M. Sanderson, E. Sanjuan, F. Scholer, A. Schuh, X.Tannier, M. Theobald, M. Trappett, A. Trotman, Q. Wang, "Report on INEX 2012", ACM SIGIR Forum, vol. 46-2,p. 50-59, 2012.
6. Patrice Bellot, Timothy Chappell, Antoine Doucet, Shlomo Geva, Jaap Kamps, Gabriella Kazai, Marijn Koolen,Monica Landoni, Maarten Marx, Véronique Moriceau, Josiane Mothe, G. Ramírez, Mark Sanderson, Eric SanJuan,Falk Scholer, Xavier Tannier, Martin Theobald, Matthew Trappett, Andrew Trotman, Qiuyue Wang, Report onINEX 2011, ACM SIGIR Forum,vol. 46-1, p. 33-42, 2012
7. D. Alexander, P. Arvola, T. Beckers, P. Bellot, T. Chappell, C.M. De Vries, A. Doucet, N. Fuhr, S. Geva, J. Kamps,G. Kazai, M. Koolen, S. Kutty, M. Landoni, V. Moriceau, R. Nayak, R. Nordlie, N. Pharo, E. SanJuan, R. Schenkel,A. Tagarelli, X. Tannier, J.A. Thom, A. Trotman, J. Vainio, Q. Wang, C. Wu. Report on INEX 2010. ACM SIGIRForum,vol. 45-1, p. 2-17, 2011
8. R. Lavalley, C. Clavel, P. Bellot, "Extraction probabiliste de chaînes de mots relatives à une opinion", TraitementAutomatique des Langues (TAL), p. 101-130, vol. 50, 3-2011. — rang A AERES
9. L. Sitbon, P. Bellot, P. Blache, "Vers une recherche d’informations adaptée aux capacités de lecture des utilisa-teurs – Recherche d’informations et résumé automatique pour des personnes dyslexiques", Revue des Sciences etTechnologies de l’Information, série Document numérique, volume 13, 1-2010, p. 161-186, 2010
10. T. Beckers, P. Bellot, G. Demartini, L. Denoyer, C. M. De Vries, A. Doucet, K. N. Fachry, N. Fuhr, P. Galli-nari, S. Geva, W.-C. Huang, T. Iofciu, J. Kamps, G. Kazai, M. Koolen, S. Kutty, M. Landoni, M. Lehtonen,V. Moriceau, R. Nayak, R. Nordlie, N. Pharo, E. SanJuan, R. Schenkel, X. Tannier, M. Theobald, J. A. Thom,A. Trotman, and A. P. de Vries, 2010. Report on INEX 2009. ACM SIGIR Forum 44, 1 (August 2010), 38-57.DOI=10.1145/1842890.1842897 http ://doi.acm.org/10.1145/1842890.1842897
11. Juan-Manuel Torres-Moreno, Pier-Luc St-Onge, Michel Gagnon, Marc El-Bèze, Patrice Bellot, "Automatic Sum-marization System coupled with a Question-Answering System (QAAS)", CoRR, arXiv :0905.2990v1, 2009.
10
Publications scientifiquesh-index = 15 ; i10 = 22 (Google Scholar)
375 citations depuis 2009
Direction d’ouvrage1. P. Bellot, "Recherche d’information contextuelle, assistée et personnalisée" – Hermès (collection Recherche d’In-
formation et Web), 306 pages, Paris, ISBN-978-2746225831, décembre 2011.
Direction de numéros spéciaux1. P. Bellot, C. Cauvet, G. Pasi, N. Valles, "Approches pour la recherche d’information en contexte", Document
numérique RSTI série DN - Volume 15 – num. 1/2012.
Edition d’actes de conférences1. G. Pasi, P. Bellot, "COnférence en Recherche d’Infomations et Applications - CORIA 2011, 8th French Information
Retrieval Conference", Avignon, France, Editions Universitaires d’Avignon, 2011.2. F. Béchet, J.-F. Bonastre, P. Bellot, "Actes de JEP-TALN 2008 - Journées d’Etudes sur la Parole 2008, Traitement
Automatique des Langues Naturelles 2008", Avignon, France, 2008.
Revues répertoriées1. Romain Deveaud, Eric SanJuan, Patrice Bellot, "Accurate and Effective Latent Concept Modeling", Document
Numérique RSTI, vol. 17-1, 20142. L. Bonnefoy, V. Bouvier, P. Bellot, "Approches de classification pour le filtrage de documents importants au sujet
d’une entité nommée", Document Numérique RSTI, vol. 17-1, 20143. P. Bellot, B. Grau, "Recherche et Extraction d’Information", L’information Grammaticale, p. 37-45, 2014, (indexée
par Persée) — rang B AERES4. P. Bellot, A. Doucet, S. Geva, S. Gurajada, J. Kamps, G. Kazai, M. Koolen, V. Moriceau, J. Mothe, M. Sanderson,
E. Sanjuan, F. Scholer, A. Schuh, X. Tannier, "Report on INEX 2013", ACM SIGIR Forum 47 (2), 21-32, 2013.5. P. Bellot, T. Chappell, A. Doucet, S. Geva, S. Gurajada, J. Kamps, G. Kazai, M. Koolen, M. Landoni, M. Marx,
A. Mishra, V. Moriceau, J. Mothe, M. Preminger, G. Ramírez, M. Sanderson, E. Sanjuan, F. Scholer, A. Schuh, X.Tannier, M. Theobald, M. Trappett, A. Trotman, Q. Wang, "Report on INEX 2012", ACM SIGIR Forum, vol. 46-2,p. 50-59, 2012.
6. Patrice Bellot, Timothy Chappell, Antoine Doucet, Shlomo Geva, Jaap Kamps, Gabriella Kazai, Marijn Koolen,Monica Landoni, Maarten Marx, Véronique Moriceau, Josiane Mothe, G. Ramírez, Mark Sanderson, Eric SanJuan,Falk Scholer, Xavier Tannier, Martin Theobald, Matthew Trappett, Andrew Trotman, Qiuyue Wang, Report onINEX 2011, ACM SIGIR Forum,vol. 46-1, p. 33-42, 2012
7. D. Alexander, P. Arvola, T. Beckers, P. Bellot, T. Chappell, C.M. De Vries, A. Doucet, N. Fuhr, S. Geva, J. Kamps,G. Kazai, M. Koolen, S. Kutty, M. Landoni, V. Moriceau, R. Nayak, R. Nordlie, N. Pharo, E. SanJuan, R. Schenkel,A. Tagarelli, X. Tannier, J.A. Thom, A. Trotman, J. Vainio, Q. Wang, C. Wu. Report on INEX 2010. ACM SIGIRForum,vol. 45-1, p. 2-17, 2011
8. R. Lavalley, C. Clavel, P. Bellot, "Extraction probabiliste de chaînes de mots relatives à une opinion", TraitementAutomatique des Langues (TAL), p. 101-130, vol. 50, 3-2011. — rang A AERES
9. L. Sitbon, P. Bellot, P. Blache, "Vers une recherche d’informations adaptée aux capacités de lecture des utilisa-teurs – Recherche d’informations et résumé automatique pour des personnes dyslexiques", Revue des Sciences etTechnologies de l’Information, série Document numérique, volume 13, 1-2010, p. 161-186, 2010
10. T. Beckers, P. Bellot, G. Demartini, L. Denoyer, C. M. De Vries, A. Doucet, K. N. Fachry, N. Fuhr, P. Galli-nari, S. Geva, W.-C. Huang, T. Iofciu, J. Kamps, G. Kazai, M. Koolen, S. Kutty, M. Landoni, M. Lehtonen,V. Moriceau, R. Nayak, R. Nordlie, N. Pharo, E. SanJuan, R. Schenkel, X. Tannier, M. Theobald, J. A. Thom,A. Trotman, and A. P. de Vries, 2010. Report on INEX 2009. ACM SIGIR Forum 44, 1 (August 2010), 38-57.DOI=10.1145/1842890.1842897 http ://doi.acm.org/10.1145/1842890.1842897
11. Juan-Manuel Torres-Moreno, Pier-Luc St-Onge, Michel Gagnon, Marc El-Bèze, Patrice Bellot, "Automatic Sum-marization System coupled with a Question-Answering System (QAAS)", CoRR, arXiv :0905.2990v1, 2009.
10
Publications scientifiquesh-index = 15 ; i10 = 22 (Google Scholar)
375 citations depuis 2009
Direction d’ouvrage1. P. Bellot, "Recherche d’information contextuelle, assistée et personnalisée" – Hermès (collection Recherche d’In-
formation et Web), 306 pages, Paris, ISBN-978-2746225831, décembre 2011.
Direction de numéros spéciaux1. P. Bellot, C. Cauvet, G. Pasi, N. Valles, "Approches pour la recherche d’information en contexte", Document
numérique RSTI série DN - Volume 15 – num. 1/2012.
Edition d’actes de conférences1. G. Pasi, P. Bellot, "COnférence en Recherche d’Infomations et Applications - CORIA 2011, 8th French Information
Retrieval Conference", Avignon, France, Editions Universitaires d’Avignon, 2011.2. F. Béchet, J.-F. Bonastre, P. Bellot, "Actes de JEP-TALN 2008 - Journées d’Etudes sur la Parole 2008, Traitement
Automatique des Langues Naturelles 2008", Avignon, France, 2008.
Revues répertoriées1. Romain Deveaud, Eric SanJuan, Patrice Bellot, "Accurate and Effective Latent Concept Modeling", Document
Numérique RSTI, vol. 17-1, 20142. L. Bonnefoy, V. Bouvier, P. Bellot, "Approches de classification pour le filtrage de documents importants au sujet
d’une entité nommée", Document Numérique RSTI, vol. 17-1, 20143. P. Bellot, B. Grau, "Recherche et Extraction d’Information", L’information Grammaticale, p. 37-45, 2014, (indexée
par Persée) — rang B AERES4. P. Bellot, A. Doucet, S. Geva, S. Gurajada, J. Kamps, G. Kazai, M. Koolen, V. Moriceau, J. Mothe, M. Sanderson,
E. Sanjuan, F. Scholer, A. Schuh, X. Tannier, "Report on INEX 2013", ACM SIGIR Forum 47 (2), 21-32, 2013.5. P. Bellot, T. Chappell, A. Doucet, S. Geva, S. Gurajada, J. Kamps, G. Kazai, M. Koolen, M. Landoni, M. Marx,
A. Mishra, V. Moriceau, J. Mothe, M. Preminger, G. Ramírez, M. Sanderson, E. Sanjuan, F. Scholer, A. Schuh, X.Tannier, M. Theobald, M. Trappett, A. Trotman, Q. Wang, "Report on INEX 2012", ACM SIGIR Forum, vol. 46-2,p. 50-59, 2012.
6. Patrice Bellot, Timothy Chappell, Antoine Doucet, Shlomo Geva, Jaap Kamps, Gabriella Kazai, Marijn Koolen,Monica Landoni, Maarten Marx, Véronique Moriceau, Josiane Mothe, G. Ramírez, Mark Sanderson, Eric SanJuan,Falk Scholer, Xavier Tannier, Martin Theobald, Matthew Trappett, Andrew Trotman, Qiuyue Wang, Report onINEX 2011, ACM SIGIR Forum,vol. 46-1, p. 33-42, 2012
7. D. Alexander, P. Arvola, T. Beckers, P. Bellot, T. Chappell, C.M. De Vries, A. Doucet, N. Fuhr, S. Geva, J. Kamps,G. Kazai, M. Koolen, S. Kutty, M. Landoni, V. Moriceau, R. Nayak, R. Nordlie, N. Pharo, E. SanJuan, R. Schenkel,A. Tagarelli, X. Tannier, J.A. Thom, A. Trotman, J. Vainio, Q. Wang, C. Wu. Report on INEX 2010. ACM SIGIRForum,vol. 45-1, p. 2-17, 2011
8. R. Lavalley, C. Clavel, P. Bellot, "Extraction probabiliste de chaînes de mots relatives à une opinion", TraitementAutomatique des Langues (TAL), p. 101-130, vol. 50, 3-2011. — rang A AERES
9. L. Sitbon, P. Bellot, P. Blache, "Vers une recherche d’informations adaptée aux capacités de lecture des utilisa-teurs – Recherche d’informations et résumé automatique pour des personnes dyslexiques", Revue des Sciences etTechnologies de l’Information, série Document numérique, volume 13, 1-2010, p. 161-186, 2010
10. T. Beckers, P. Bellot, G. Demartini, L. Denoyer, C. M. De Vries, A. Doucet, K. N. Fachry, N. Fuhr, P. Galli-nari, S. Geva, W.-C. Huang, T. Iofciu, J. Kamps, G. Kazai, M. Koolen, S. Kutty, M. Landoni, M. Lehtonen,V. Moriceau, R. Nayak, R. Nordlie, N. Pharo, E. SanJuan, R. Schenkel, X. Tannier, M. Theobald, J. A. Thom,A. Trotman, and A. P. de Vries, 2010. Report on INEX 2009. ACM SIGIR Forum 44, 1 (August 2010), 38-57.DOI=10.1145/1842890.1842897 http ://doi.acm.org/10.1145/1842890.1842897
11. Juan-Manuel Torres-Moreno, Pier-Luc St-Onge, Michel Gagnon, Marc El-Bèze, Patrice Bellot, "Automatic Sum-marization System coupled with a Question-Answering System (QAAS)", CoRR, arXiv :0905.2990v1, 2009.
10
12. P. Zweigenbaum, B. Grau, A.-L. Ligozat, I. Robba, S. Rosset, X. Tannier, A. Vilnat (LIMSI) & P. Bellot (Univ. Avi-gnon), "Apports de la linguistique dans les systèmes de recherche d’informations précises", RFLA (Revue Françaisede Linguistique Appliquée),XIII (1), p. 41 à 62, 2008.– Numéro spécial sur l’apport de la linguistique en extraction d’informations contenant des contributions de C.J.
Van Rijsbergen (Glasgow), de H. Saggion (Sheffield), de P. Vossen (Amsterdam) et de M.C. L’Homme (Mont-réal) ; http ://www.rfla-journal.org/som_2008-1.html
13. L. Sitbon, P. Bellot, P. Blache, "Éléments pour adapter les systèmes de recherche d’information aux dyslexiques",Traitement Automatique des Langues (TAL), vol. 48-2, p. 123 à 147, 2007 — rang A AERES
14. Laurent Gillard, Laurianne Sitbon, Patrice Bellot, Marc El-Bèze, "Dernières évolutions de SQuALIA, le systèmede Questions/Réponses du LIA", 2006 Traitement Automatique des Langues (TAL), vol. 46-3, p. 41 à 70, Hermès
15. P. Bellot, M. El-Bèze, « Classification locale non supervisée pour la recherche documentaire », Traitement Auto-matique des Langues (TAL), vol. 42-2, Hermès, p. 335 à 366, 2001
16. P. Bellot, M. El-Bèze, « Classification et segmentation de textes par arbres de décision », Technique et ScienceInformatiques (TSI), Editions Hermès, volume 20-3, p. 397 à 424, 2001.
17. P.-F. Marteau, C. De Loupy, P. Bellot, M. El-Bèze, « Le Traitement Automatique du Langage Naturel, Outil d’As-sistance à la Fonction d’Intelligence Economique », Systèmes et Sécurité, Vol. 5, num.4, p. 8-41, 1999.
Chapitres de livres1. P. Bellot, L. Bonnefoy, V. Bouvier, F. Duvert, Young-Min Kim, Large Scale Text Mining Approaches for Informa-
tion Retrieval and Extraction, ISBN : 978-3-319-01865-2 In book : Innovations in Intelligent Machines-4, Chapter :1, Publisher : Springer International Publishing Switzerland, Editors : Lakhmi C., Colette Faucher, pp.1-43, 2013.
2. J.M. Torres-Moreno, M. El-Bèze, P. Bellot, F. Béchet, "Opinion Detection as a Topic Classification Problem", in"Textual Information Access : Statistical Models" E. Gaussier & F. Yvon Eds., J. Wiley-ISTE, chapitre 9, ISBN :978-1-84821-322-7, 2012.
3. P. Bellot, "Vers une prise en compte de certains handicaps langagiers dans les processus de recherche d’informa-tion", in "Recherche d’information contextuelle, assistée et personnalisée" sous la direction de P. Bellot, chapitre 7,p. 191 à 226, collection Recherche d’information et Web, Hermes, 2011.
4. J.M. Torres-Moreno, M. El-Bèze, P. Bellot, F. Béchet, "Peut-on voir la détection d’opinions comme un problèmede classification thématique ?", in "Modèles statistiques pour l’accès à l’information textuelle" sous la direction deE. Gaussier et F. Yvon, Hermes, chapitre 9, p. 389-422, 2011.
5. P. Bellot, M. Boughanem, "Recherche d’information et systèmes de questions-réponses", 2008 in " La recherched’informations précises : traitement automatique de la langue, apprentissage et connaissances pour les systèmes dequestion-réponse (Traité IC2, série Informatique et systèmes d’information)", sous la direction de B.Grau, Hermès-Lavoisier, chapitre 1, p. 5-35
6. Patrice Bellot, "Classification de documents et enrichissement de requêtes", 2004 Méthodes avancées pour lessystèmes de recherche d’informations (Traité des sciences et techniques de l’information) sous la dir. de IHADJA-DENE M., chapitre 4, p.73 à 96, Hermès
7. J.-C. Meilland, P. Bellot, "Extraction automatique de terminologie à partir de libellés textuels courts", 2005 in "LaLinguistique de corpus" sous la direction de G. Williams, Presses Universitaires de Rennes, p. 357 à 370, 2005
Conférences internationales avec comités de lecture (ACTI)1. H. Hamdan, P. Bellot, F. Béchet, "The Impact of Z score on Twitter Sentiment Analysis", Int. Workshop on Semantic
Evaluation (SEMEVAL 2014), COLING 2014, Dublin (Ireland)2. Chahinez Benkoussas, Hussam Hamdan, Patrice Bellot, Frédéric Béchet, Elodie Faath, "A Collection of Scholarly
Book Reviews from the Platforms of electronic sources in Humanities and Social Sciences OpenEdition.org", 9thInternational Conference on Language Resources and Evaluation (LREC 2014), Rejkjavik, Iceland, May 2014.
3. Romain Deveaud, Eric San Juan, Patrice Bellot, "Are Semantically Coherent Topic Models Useful for Ad HocInformation Retrieval ?", 51st Annual Meeting of the Association for Computational Linguistics (ACL 2013), Sofia,Bulgaria, August 2013.
4. L. Bonnefoy, V. Bouvier, P. Bellot, "A weakly-supervised detection of entity central documents in a stream", The36th Annual ACM SIGIR Conference SIGIR’13, Dublin (Ireland), July 2013.
5. Romain Deveaud, Eric San Juan, Patrice Bellot, "Estimating Topical Context by Diverging from External Re-sources", The 36th Annual ACM SIGIR Conference SIGIR’13, Dublin (Ireland), July 2013.
11
12. P. Zweigenbaum, B. Grau, A.-L. Ligozat, I. Robba, S. Rosset, X. Tannier, A. Vilnat (LIMSI) & P. Bellot (Univ. Avi-gnon), "Apports de la linguistique dans les systèmes de recherche d’informations précises", RFLA (Revue Françaisede Linguistique Appliquée),XIII (1), p. 41 à 62, 2008.– Numéro spécial sur l’apport de la linguistique en extraction d’informations contenant des contributions de C.J.
Van Rijsbergen (Glasgow), de H. Saggion (Sheffield), de P. Vossen (Amsterdam) et de M.C. L’Homme (Mont-réal) ; http ://www.rfla-journal.org/som_2008-1.html
13. L. Sitbon, P. Bellot, P. Blache, "Éléments pour adapter les systèmes de recherche d’information aux dyslexiques",Traitement Automatique des Langues (TAL), vol. 48-2, p. 123 à 147, 2007 — rang A AERES
14. Laurent Gillard, Laurianne Sitbon, Patrice Bellot, Marc El-Bèze, "Dernières évolutions de SQuALIA, le systèmede Questions/Réponses du LIA", 2006 Traitement Automatique des Langues (TAL), vol. 46-3, p. 41 à 70, Hermès
15. P. Bellot, M. El-Bèze, « Classification locale non supervisée pour la recherche documentaire », Traitement Auto-matique des Langues (TAL), vol. 42-2, Hermès, p. 335 à 366, 2001
16. P. Bellot, M. El-Bèze, « Classification et segmentation de textes par arbres de décision », Technique et ScienceInformatiques (TSI), Editions Hermès, volume 20-3, p. 397 à 424, 2001.
17. P.-F. Marteau, C. De Loupy, P. Bellot, M. El-Bèze, « Le Traitement Automatique du Langage Naturel, Outil d’As-sistance à la Fonction d’Intelligence Economique », Systèmes et Sécurité, Vol. 5, num.4, p. 8-41, 1999.
Chapitres de livres1. P. Bellot, L. Bonnefoy, V. Bouvier, F. Duvert, Young-Min Kim, Large Scale Text Mining Approaches for Informa-
tion Retrieval and Extraction, ISBN : 978-3-319-01865-2 In book : Innovations in Intelligent Machines-4, Chapter :1, Publisher : Springer International Publishing Switzerland, Editors : Lakhmi C., Colette Faucher, pp.1-43, 2013.
2. J.M. Torres-Moreno, M. El-Bèze, P. Bellot, F. Béchet, "Opinion Detection as a Topic Classification Problem", in"Textual Information Access : Statistical Models" E. Gaussier & F. Yvon Eds., J. Wiley-ISTE, chapitre 9, ISBN :978-1-84821-322-7, 2012.
3. P. Bellot, "Vers une prise en compte de certains handicaps langagiers dans les processus de recherche d’informa-tion", in "Recherche d’information contextuelle, assistée et personnalisée" sous la direction de P. Bellot, chapitre 7,p. 191 à 226, collection Recherche d’information et Web, Hermes, 2011.
4. J.M. Torres-Moreno, M. El-Bèze, P. Bellot, F. Béchet, "Peut-on voir la détection d’opinions comme un problèmede classification thématique ?", in "Modèles statistiques pour l’accès à l’information textuelle" sous la direction deE. Gaussier et F. Yvon, Hermes, chapitre 9, p. 389-422, 2011.
5. P. Bellot, M. Boughanem, "Recherche d’information et systèmes de questions-réponses", 2008 in " La recherched’informations précises : traitement automatique de la langue, apprentissage et connaissances pour les systèmes dequestion-réponse (Traité IC2, série Informatique et systèmes d’information)", sous la direction de B.Grau, Hermès-Lavoisier, chapitre 1, p. 5-35
6. Patrice Bellot, "Classification de documents et enrichissement de requêtes", 2004 Méthodes avancées pour lessystèmes de recherche d’informations (Traité des sciences et techniques de l’information) sous la dir. de IHADJA-DENE M., chapitre 4, p.73 à 96, Hermès
7. J.-C. Meilland, P. Bellot, "Extraction automatique de terminologie à partir de libellés textuels courts", 2005 in "LaLinguistique de corpus" sous la direction de G. Williams, Presses Universitaires de Rennes, p. 357 à 370, 2005
Conférences internationales avec comités de lecture (ACTI)1. H. Hamdan, P. Bellot, F. Béchet, "The Impact of Z score on Twitter Sentiment Analysis", Int. Workshop on Semantic
Evaluation (SEMEVAL 2014), COLING 2014, Dublin (Ireland)2. Chahinez Benkoussas, Hussam Hamdan, Patrice Bellot, Frédéric Béchet, Elodie Faath, "A Collection of Scholarly
Book Reviews from the Platforms of electronic sources in Humanities and Social Sciences OpenEdition.org", 9thInternational Conference on Language Resources and Evaluation (LREC 2014), Rejkjavik, Iceland, May 2014.
3. Romain Deveaud, Eric San Juan, Patrice Bellot, "Are Semantically Coherent Topic Models Useful for Ad HocInformation Retrieval ?", 51st Annual Meeting of the Association for Computational Linguistics (ACL 2013), Sofia,Bulgaria, August 2013.
4. L. Bonnefoy, V. Bouvier, P. Bellot, "A weakly-supervised detection of entity central documents in a stream", The36th Annual ACM SIGIR Conference SIGIR’13, Dublin (Ireland), July 2013.
5. Romain Deveaud, Eric San Juan, Patrice Bellot, "Estimating Topical Context by Diverging from External Re-sources", The 36th Annual ACM SIGIR Conference SIGIR’13, Dublin (Ireland), July 2013.
11
LSIS -‐ DIMAG team http://www.lsis.org/spip.php?id_rubrique=291 OpenEdition Lab : http://lab.hypotheses.org