3.3 Probabilistic IR - htw saar · 2019-02-28 · 28 3.3 Probabilistic IR § Vector space model is...

24
28 3.3 Probabilistic IR § Vector space model is commonly criticized for being heuristic and lacking a clear model of when a document should be considered relevant § Probabilistic IR relies on probability theory to model the event that a document d is relevant to a query q § This probability is then estimated based on the terms contained in the document and the query Information Retrieval / Chapter 3: Retrieval Models

Transcript of 3.3 Probabilistic IR - htw saar · 2019-02-28 · 28 3.3 Probabilistic IR § Vector space model is...

Page 1: 3.3 Probabilistic IR - htw saar · 2019-02-28 · 28 3.3 Probabilistic IR § Vector space model is commonly criticizedfor being heuristicand lacking a clear modelof when a document

28

3.3 Probabilistic IR§ Vector space model is commonly criticized for being

heuristic and lacking a clear model of when adocument should be considered relevant

§ Probabilistic IR relies on probability theory to model the event that a document d is relevant to a query q

§ This probability is then estimated based on the termscontained in the document and the query

Information Retrieval / Chapter 3: Retrieval Models

Page 2: 3.3 Probabilistic IR - htw saar · 2019-02-28 · 28 3.3 Probabilistic IR § Vector space model is commonly criticizedfor being heuristicand lacking a clear modelof when a document

29

Events and Probabilities§ Let’s consider two events A and B

§ A is the event that an object is a circle§ B is the event that an object is green

§ We refer to A ∧ B as the joint event that an objectis a green circle

P [ A ] = 59 P [ B ] = 4

9

P [ A · B ] = P [ A, B ] = 39

Information Retrieval / Chapter 3: Retrieval Models

Page 3: 3.3 Probabilistic IR - htw saar · 2019-02-28 · 28 3.3 Probabilistic IR § Vector space model is commonly criticizedfor being heuristicand lacking a clear modelof when a document

30

Conditional Probabilities§ The conditional probability P[B|A] (B given A) is the

probability that the event B occurs if we already know that the event A has occurred

here

P [ B | A ] = P [ A · B ]P [ A ]

P [ B | A ] = 35

P [ A | B ] = 34

Information Retrieval / Chapter 3: Retrieval Models

Page 4: 3.3 Probabilistic IR - htw saar · 2019-02-28 · 28 3.3 Probabilistic IR § Vector space model is commonly criticizedfor being heuristicand lacking a clear modelof when a document

31

Independence§ Two events A and B are called (stochastically)

independent, if the following holds for their joint probability

§ In our example, the events A and B are not independent

P [ A · B ] = P [ A ] P [ B ]

39 ”= 5

949

Information Retrieval / Chapter 3: Retrieval Models

Page 5: 3.3 Probabilistic IR - htw saar · 2019-02-28 · 28 3.3 Probabilistic IR § Vector space model is commonly criticizedfor being heuristicand lacking a clear modelof when a document

32

Bayes’ Theorem§ Thomas Bayes (1701-1761) famously

observed the following theoremregarding the conditionalprobabilities of events

§ Bayes’ theorem is particularly useful when, for two eventsA and B, one of the conditional probabilities is easyto estimate, but the other is hard to estimate

Source: en.wikipedia.orgP [ A | B ] = P [ B | A ] P [ A ]P [ B ]

Information Retrieval / Chapter 3: Retrieval Models

Page 6: 3.3 Probabilistic IR - htw saar · 2019-02-28 · 28 3.3 Probabilistic IR § Vector space model is commonly criticizedfor being heuristicand lacking a clear modelof when a document

33

Bayes’ Theorem in Action§ Example: Examining animals in the wild

§ A is the event that the animal is a fox§ B is the event that the animal has rabies (“Tollwut”)

§ Assume that we know the following probabilities

§ P[A] = 0.1 (e.g., estimated based on video surveillance)

§ P[B] = 0.05 (e.g., estimated based on hunted animals)

§ P[A|B] = 0.25 (e.g., estimated based on deceased animals)

§ We can now estimate the probability that a fox has rabies

P [ B | A ] = 0.25 · 0.050.1 = 0.125

Information Retrieval / Chapter 3: Retrieval Models

Page 7: 3.3 Probabilistic IR - htw saar · 2019-02-28 · 28 3.3 Probabilistic IR § Vector space model is commonly criticizedfor being heuristicand lacking a clear modelof when a document

34

Probabilistic Ranking Principle (PRP)§ Probabilistic Ranking Principle (PRP) suggests that

documents should be ranked in descendingorder of their probability

of being relevant to the query (R = 1 indicates the event of observing a relevant document)

§ PRP maximizes precision under the assumptions that the probabilities can be determined exactly and that they are independent (both questionable assumptions)

Information Retrieval / Chapter 3: Retrieval Models

P [ R = 1 | d, q ]

Page 8: 3.3 Probabilistic IR - htw saar · 2019-02-28 · 28 3.3 Probabilistic IR § Vector space model is commonly criticizedfor being heuristicand lacking a clear modelof when a document

35

Binary Independence Model§ Binary Independence Model (BIM) considers documents

and queries as sets of terms, i.e., a term either occursin a document or it doesn’t

§ BIM assumes that terms occur independently from eachother in documents (a questionable assumption)

§ Documents are ranked, following the PRP, according totheir probability P[R = 1 | d, q] with

Information Retrieval / Chapter 3: Retrieval Models

P [ R = 1 | d, q ] + P [ R = 0 | d, q ] = 1

Page 9: 3.3 Probabilistic IR - htw saar · 2019-02-28 · 28 3.3 Probabilistic IR § Vector space model is commonly criticizedfor being heuristicand lacking a clear modelof when a document

36

Binary Independence Model§ We obtain the same ranking of documents, if we consider

their so-called odds ratios

§ Applying Bayes’ theorem we obtain

Information Retrieval / Chapter 3: Retrieval Models

O [ R | d, q ] =P [ R = 1 | d, q ]

P [ R = 0 | d, q ]

O [ R | d, q ] =P [ R = 1 | q ]

P [ R = 0 | q ]· P [ d | R = 1, q ]

P [ d | R = 0, q ]

à P [ d | R = 1, q ]P [ d | R = 0, q ]

{Constant

(depends only on q)

Page 10: 3.3 Probabilistic IR - htw saar · 2019-02-28 · 28 3.3 Probabilistic IR § Vector space model is commonly criticizedfor being heuristicand lacking a clear modelof when a document

37

Binary Independence Model§ Assuming that terms occur independently

with V as the vocabulary of all known terms

§ Assuming that only terms from the query play a role

Information Retrieval / Chapter 3: Retrieval Models

P [ d | R = 1, q ]P [ d | R = 0, q ] =

Ÿ

vœV

P [ v | R = 1, q ]P [ v | R = 0, q ]

P [ d | R = 1, q ]P [ d | R = 0, q ] ¥

Ÿ

vœq

P [ v | R = 1, q ]P [ v | R = 0, q ]

Page 11: 3.3 Probabilistic IR - htw saar · 2019-02-28 · 28 3.3 Probabilistic IR § Vector space model is commonly criticizedfor being heuristicand lacking a clear modelof when a document

38

Binary Independence Model§ We can distinguish between terms that occur in a

document and terms that don’t

§ Let pv and uv denote the probabilities that a term v occurs in a relevant and irrelevant document, respectively

Information Retrieval / Chapter 3: Retrieval Models

P [ d | R = 1, q ]P [ d | R = 0, q ] ¥

Ÿ

vœqvœd

P [ v | R = 1, q ]P [ v | R = 0, q ] ·

Ÿ

vœqv ”œd

P [ v | R = 1, q ]P [ v | R = 0, q ]

P [ d | R = 1, q ]P [ d | R = 0, q ] ¥

Ÿ

vœqvœd

pv

uv·Ÿ

vœqv ”œd

1 ≠ pv

1 ≠ uv

Page 12: 3.3 Probabilistic IR - htw saar · 2019-02-28 · 28 3.3 Probabilistic IR § Vector space model is commonly criticizedfor being heuristicand lacking a clear modelof when a document

39

Binary Independence Model§ This can be rewritten as

Information Retrieval / Chapter 3: Retrieval Models

P [ d | R = 1, q ]P [ d | R = 0, q ] ¥

Ÿ

vœqvœd

pv (1 ≠ uv)uv (1 ≠ pv) ·

Ÿ

vœq

1 ≠ pv

1 ≠ uv

{

Constant(depends only on q)

ß

vœqvœd

pv (1 ≠ uv)uv (1 ≠ pv)

Page 13: 3.3 Probabilistic IR - htw saar · 2019-02-28 · 28 3.3 Probabilistic IR § Vector space model is commonly criticizedfor being heuristicand lacking a clear modelof when a document

40

Computing with Probabilities§ When representing probabilities as floating point numbers

(e.g., double in Java) we have to worry aboutnumerical imprecision

§ We can mitigate the problem of numerical imprecisionby applying a logarithmic transformation, thus turningproducts into sums and operating with logarithmsof probabilities

Information Retrieval / Chapter 3: Retrieval Models

Page 14: 3.3 Probabilistic IR - htw saar · 2019-02-28 · 28 3.3 Probabilistic IR § Vector space model is commonly criticizedfor being heuristicand lacking a clear modelof when a document

41

Computing with Probabilities

Information Retrieval / Chapter 3: Retrieval Models

Page 15: 3.3 Probabilistic IR - htw saar · 2019-02-28 · 28 3.3 Probabilistic IR § Vector space model is commonly criticizedfor being heuristicand lacking a clear modelof when a document

42

Binary Independence Model§ Applying a logarithmic transformation to the binary

independence model, we obtain

§ We can return documents in descending order of theirrank status value (RSVd) and obtain the sameranking that we would have obtained whencomputing with the actual probabilities

§ How can we estimate the probabilities pv and uv?Information Retrieval / Chapter 3: Retrieval Models

log

Q

caŸ

vœqvœd

pv (1 ≠ uv)uv (1 ≠ pv)

R

db =ÿ

vœqvœd

log pv (1 ≠ uv)uv (1 ≠ pv) = RSVd

Page 16: 3.3 Probabilistic IR - htw saar · 2019-02-28 · 28 3.3 Probabilistic IR § Vector space model is commonly criticizedfor being heuristicand lacking a clear modelof when a document

43

Binary Independence Model§ Assuming that most documents in the document

collection are irrelevant to any query, we estimate

as the probability that the term v occurs ina document that is irrelevant to the query

Information Retrieval / Chapter 3: Retrieval Models

uv = df (v)|D|

Page 17: 3.3 Probabilistic IR - htw saar · 2019-02-28 · 28 3.3 Probabilistic IR § Vector space model is commonly criticizedfor being heuristicand lacking a clear modelof when a document

44

Binary Independence Model§ We have no information about which documents are

relevant to the query and thus estimate

as the probability that the term v occurs in a document that is relevant to the query

Information Retrieval / Chapter 3: Retrieval Models

pv = (1 ≠ pv) = 0.5

Page 18: 3.3 Probabilistic IR - htw saar · 2019-02-28 · 28 3.3 Probabilistic IR § Vector space model is commonly criticizedfor being heuristicand lacking a clear modelof when a document

45

Binary Independence Model§ Retrieval status value RSVd can thus be rewritten as

the following variant of tf.idf

under the assumption that most terms occur rarely

Information Retrieval / Chapter 3: Retrieval Models

RSVd =ÿ

vœqvœd

log pv (1 ≠ uv)uv (1 ≠ pv) =

ÿ

vœqvœd

log (1 ≠ uv)uv

=ÿ

vœqvœd

log

11 ≠ df (v)

|D|

2

df (v)|D|

=ÿ

vœqvœd

log |D| ≠ df (v)df (v)

¥ÿ

vœqvœd

log |D|df (v)

Page 19: 3.3 Probabilistic IR - htw saar · 2019-02-28 · 28 3.3 Probabilistic IR § Vector space model is commonly criticizedfor being heuristicand lacking a clear modelof when a document

46

Binary Independence Model§ Binary Independence Model has been shown to obtain

good results on collections with documents havinghomogeneous lengths, it does not work well whendocument lengths differ a lot (e.g., on the Web)

§ Relevance feedback by a user can be incorporated when estimating the probabilities pv and uv

§ While more principled than the vector space model,many of the assumptions made are questionablein practice (e.g., independence of terms)

Information Retrieval / Chapter 3: Retrieval Models

Page 20: 3.3 Probabilistic IR - htw saar · 2019-02-28 · 28 3.3 Probabilistic IR § Vector space model is commonly criticizedfor being heuristicand lacking a clear modelof when a document

47

Okapi BM25§ Okapi BM25 is a probabilistic retrieval model that builds on

the binary independence model but takes term frequencies into account

§ It assumes that terms in relevant and irrelevant documents are distributed according to a Poisson distribution

§ Derivation of the rank status value is beyondthe scope of this lecture

Information Retrieval / Chapter 3: Retrieval Models

P [ tf (v, d) = k ] = ⁄k

k! e≠⁄

Page 21: 3.3 Probabilistic IR - htw saar · 2019-02-28 · 28 3.3 Probabilistic IR § Vector space model is commonly criticizedfor being heuristicand lacking a clear modelof when a document

48

Okapi BM25

§ Parameter k1 controls influence of term frequencies

§ k1 = 0.0 yields a binary model similar to the BIM

§ k1 = 1.2 is a common choice in practice

§ Parameter b controls the normalization of term frequencies based on the document length |d| and the average document length avdl

§ b = 0.0 ignores document lengths

§ b = 0.75 is a common choice in practice

Information Retrieval / Chapter 3: Retrieval Models

RSVd =ÿ

vœq

(k1 + 1) tf (v, d)k1 ((1 ≠ b) + b (|d|/avdl)) + tf (v, d) log |D| ≠ df (v) + 0.5

df (v) + 0.5

Page 22: 3.3 Probabilistic IR - htw saar · 2019-02-28 · 28 3.3 Probabilistic IR § Vector space model is commonly criticizedfor being heuristicand lacking a clear modelof when a document

49

Okapi BM25§ Okapi BM25F as an extension that can deal with fielded

documents (e.g., title, abstract, body)

§ Okapi BM25 has been shown to yield excellent results in different settings and is considered one of the state of the art retrieval models (e.g., available in Apache Lucene)

§ While more principled than the vector space model,many of the assumptions made are questionablein practice (e.g., independence of terms)

Information Retrieval / Chapter 3: Retrieval Models

Page 23: 3.3 Probabilistic IR - htw saar · 2019-02-28 · 28 3.3 Probabilistic IR § Vector space model is commonly criticizedfor being heuristicand lacking a clear modelof when a document

50

Summary§ Probabilistic IR relies on probability theory to model the

event that a document is relevant to a query

§ Probabilistic Ranking Principle suggests to rank documents according to their probability of being relevant

§ Binary Independence Model considers whether a term occurs in a document or not and assumes independence

§ Okapi BM25 as a more sophisticated model that yields good results and is considered state of the art

Information Retrieval / Chapter 3: Retrieval Models

Page 24: 3.3 Probabilistic IR - htw saar · 2019-02-28 · 28 3.3 Probabilistic IR § Vector space model is commonly criticizedfor being heuristicand lacking a clear modelof when a document

51

Literature[1] C. D. Manning, P. Raghavan, and H. Schütze:

Introduction to Information Retrieval,Cambridge University Press, 2008 (Chapter 11)

[2] W. B. Croft, D. Metzler, and T. Strohman:Search Engines – Information Retrievalin Practice, Pearson Education, 2009 (Chapter 7)

Information Retrieval / Chapter 3: Retrieval Models