Probabilistic Information Retrieval [ ProbIR ] Suman K Mitra DAIICT, Gandhinagar...

ProbProbabilistic abilistic IInformation nformation RRetrievaletrieval

[ ProbIR ]

Suman K Mitra

DAIICT, Gandhinagar

[email protected]

Acknowledgment

Alexander Dekhtyar, University of Maryland

Mandar Mitra, ISI, Kolkata

Prasenjit Majumder, DAIICT, Gandhinagar

Why use Probabilities?

•Information Retrieval deals with uncertain information

•Probability is a measure of uncertainty

•Probabilistic Ranking Principle

•provable

•minimization of risk

•Probabilistic Inference

•To justify your decision

Document Collection

Document Representation

Query

Query Representation

11

22

33

Basic IR SystemBasic IR System

44

1. How good the representation is?

2. How exact the representation is?

3. How well is the query matched ?

4. How relevant is the result to the query?

Approaches and main ContributorsApproaches and main Contributors

Probability Ranking Principle – Robertson 1970 onwards

Information Retrieval as Probabilistic Inference – Van Rijsbergen et al. 1970 onwards

Probabilistic Indexing – Fuhr et al. 1980 onwards

Bayesian Nets in Information Retrieval – Turtle, Croft 1990 onwards

Probabilistic Logic Programming in Information Retrieval – Fuhr et al. 1990 onwards

Probability Ranking PrincipleProbability Ranking Principle

1.Collection of documents

2.Representation of documents

3.User uses a query

4.Representation of query

5.A set of documents to return

Question: In what order documents to present to user?

Logically: Best document first and then next best and so on

Requirement: A formal way to judge the goodness of documents with respect to query

Possibility: Probability of relevance of the document with respect to query

Probability Ranking PrincipleProbability Ranking Principle

If a retrieval system’s response to each request is a ranking of the documents in the collections in order of decreasing probability of goodness to the user who submitted the request ...

… where the probabilities are estimated as accurately as possible on the basis of whatever data made available to the system for this purpose ...

… then the overall effectiveness of the system to its users will be the best that is obtainable on the basis of that data.

W. S. CooperW. S. Cooper

)(

)()|()|(

)(

)()|()|(

)()|()()()|(

bp

apabpbap

bp

apabpbap

apabpbapbpbap

Probability BasicsBayes’ Rule

Let a and b are two events

Odds of an event a is defined as

)(1

)(

)(

)()(

ap

ap

ap

apaO

a1

a2

a3

a4

b

1

1

)(

))(()|(

i

ii

i bp

bpbp

aa

)(

)(1

bp

bpi

ia

)(

)(1

bp

bpi

ia

[ ‘s are all mutually exclusive]bai

Hence (iii)

1

1 )|()(

)()|(

ii

ii

bpbp

bpbp

aa

Probability Ranking PrincipleLet x be a document in the collection. Let R represent relevance of a document w.r.t. given (fixed) query and let NR represent non-relevance.

p(R|x) - probability that a retrieved document x is relevant.p(NR|x) - probability that a retrieved document x is non-relevant.

)(

)()|()|(

)(

)()|()|(

xp

NRpNRxpxNRp

xp

RpRxpxRp

p(R),p(NR) - prior probability of retrieving a relevant and non-relevant document respectively

p(x|R), p(x|NR) - probability that if a relevant (non-relevant) document is retrieved, it is x.

)(

)()|()|(

)(

)()|()|(

xp

NRpNRxpxNRp

xp

RpRxpxRp

Ranking Principle (Bayes’ Decision Rule):If p(R|x) > p(NR|x) then x is relevant,otherwise x is not relevant

Probability Ranking Principle

Is PRP minimizes the average probability of error?

)|(

)|()|(

xNRp

xRpxerrorp

If we decide NR

If we decide R

x

xpxerrorperrorp )()|()(

p(error) is minimal when all p(error|x) are minimal.Bayes’ decision rule minimizes each p(error|x).


Actual X is R X is NR

Decision

X is R 2

X is NR 1

x

xpxerrorperrorp )()|()(

NRx Rx

xpxNRpxpxRp )()|()()|(X is either in R or NR

NRx Rx

xpxNRpxpxRp )()|()()|(

NRx NRx

xpxNRpxpxNRp )()|()()|(

NRx x

xpxNRpxpxNRpxRp )()|()()}|()|({

Constant

Rx

tConsxpxRpxNRperrorp tan)()}|()|({)(

Minimization of p(error) Minimization of 2p(error)

Rx NRx

xpxNRpxRpxpxRpxNRp )(}|()|({)()}|()|({Min

p(x) is same for any x

Define}0)|()|(:{1 xNRpxRpxS

}0)|()|(:{2 xNRpxRpxS

are nothing but the decision for x to be in R and NR respectively

1S2Sand

: The decision minimizes p(error)1S2Sand

Hence

• How do we compute all those probabilities?– Cannot compute exact probabilities, have to use

estimates from the (ground truth) data.

(Binary Independence Retrieval)

(Bayesian Networks??)

Assumptions– “Relevance” of each document is independent of

relevance of other documents.– Most applications are for Boolean model.


Issues

• Simple case: no selection costs.

• x is relevant iff p(R|x) > p(NR|x) (Bayes’ Decision Rule)

• PRP: Rank all documents by p(R|x).


Actual X is R X is NR

Decision

X is R

X is NR

• More complex case: retrieval costs.– C - cost of retrieval of relevant document– C’ - cost of retrieval of non-relevant document– let d, be a document

• Probability Ranking Principle: if

for all d’ not yet retrieved, then d is the next document to be retrieved

))|(1()|())|(1()|( dRpCdRpCdRpCdRpC

Probability Ranking PrincipleActual X is R X is NR

Decision

X is R

X is NR

Binary Independence Model


• Traditionally used in conjunction with PRP

• “Binary” = Boolean: documents are represented as binary vectors of terms:– – iff term i is present in document x.

• “Independence”: terms occur in documents independently

• Different documents can be modeled as same vector.

),,( 1 nxxx 1ix

• Queries: binary vectors of terms

• Given query q, – for each document d, need to compute p(R|q,d).– replace with computing p(R|q,x) where x is vector

representing d

• Interested only in ranking

• Use odds:

),|(

),|(

)|(

)|(

),|(

),|(),|(

qNRxp

qRxp

qNRp

qRp

xqNRp

xqRpxqRO


• Using Independence Assumption:

n

i i

i

qNRxp

qRxp

qNRxp

qRxp

1 ),|(

),|(

),|(

),|(

),|(

),|(

)|(

)|(

),|(

),|(),|(

qNRxp

qRxp

qNRp

qRp

xqNRp

xqRpxqRO

Constant for each query

Needs estimation

n

i i

i

qNRxp

qRxpqROdqRO

1 ),|(

),|()|(),|(•So :


n

i i

i

qNRxp

qRxpqROdqRO

1 ),|(

),|()|(),|(

• Since xi is either 0 or 1:

01 ),|0(

),|0(

),|1(

),|1()|(),|(

ii x i

i

x i

i

qNRxp

qRxp

qNRxp

qRxpqROdqRO

• Let );,|1( qRxpp ii );,|1( qNRxpr ii

• Assume, for all terms not occurring in the query (qi=0) ii rp


All matching terms Non-matching query terms

All matching termsAll query terms

11

101

1

1

)1(

)1()|(

1

1)|(),|(

iii

i

iii

q i

i

qx ii

ii

qx i

i

qx i

i

r

p

pr

rpqRO

r

p

r

pqROxqRO


Constant foreach query

Only quantity to be estimated for rankings

11 1

1

)1(

)1()|(),|(

iii q i

i

qx ii

ii

r

p

pr

rpqROxqRO

• Retrieval Status Value:

11 )1(

)1(log

)1(

)1(log

iiii qx ii

ii

qx ii

ii

pr

rp

pr

rpRSV


• All boils down to computing RSV.

11 )1(

)1(log

)1(

)1(log

iiii qx ii

ii

qx ii

ii

pr

rp

pr

rpRSV

1

;ii qx

icRSV)1(

)1(log

ii

iii pr

rpc

So, how do we compute ci’s from our data ?


• Estimating RSV coefficients.• For each term i look at the following table:

Documens Relevant Non-Relevant Total

Xi=1 s n-s n

Xi=0 S-s N-n-S+s N-n

Total S N-S N

S

spi

)(

)(

SN

snri

)()(

)(log),,,(

sSnNsn

sSssSnNKci

• Estimates:


PRP and BIR: The lessons

• Getting reasonable approximations of probabilities is possible.

• Simple methods work only with restrictive assumptions:– term independence– terms not in query do not affect the outcome– boolean representation of documents/queries– document relevance values are independent

• Some of these assumptions can be removed

Probabilistic weighting scheme

))((

)(log

sSsn

sSnNs

Add 0.5 with each term

)5.0)(5.0(

)5.0)(5.0(log

sSsn

sSnNs

log function of ratio of probabilities may lead to positive or negative of infinity

)5.0)(5.0(

)5.0)(5.0(log.

)(

)1(.

)(

)1(

1

1

3

3

sSsn

sSnNs

fLk

fk

qk

qk

: are constants.q : within query frequency (wqf),f : within document frequency (wdf),n : number of documents in the collection indexed by this term,N : total number of documents in the collection, s : number of relevant documents indexed by this term,S : total number of relevant documents,L : normalised document length (i.e. the length of this document divided by the average length of documents in the collection).

Probabilistic weighting scheme [S.Robertson]

In general form, the weighting function is

31,kk

Probabilistic weighting scheme [S.Robertson]

BM11

Stephen Robertson's BM11 uses the general form for weights, but adds an extra item to the sum of term weights to give the overall document score

)1(

)1(2 L

Lnk q

: number of terms in the query (the query length),

: another constant. 2kqn

This term is 0 when L=1

)1(

)1(2 L

Lnk q

)5.0)(5.0(

)5.0)(5.0(log.

)(

)1(.

)(

)1(

1

1

3

3

sSsn

sSnNs

fLk

fk

qk

qk


)1(

)1(2 L

Lnk q

)5.0)(5.0(

)5.0)(5.0(log.

)(

)1(.

)(

)1(

1

1

3

3

sSsn

sSnNs

fLk

fk

qk

qk

BM15

)1(

)1(2 L

Lnk q

)5.0)(5.0(

)5.0)(5.0(log.

)(

)1(.

)(

)1(

1

1

3

3

sSsn

sSnNs

fk

fk

qk

qk

BM15 is same as BM11 with term

replaced by

fLk 1

fk 1


BM25

)5.0)(5.0(

)5.0)(5.0(log.

)(

)1(.

)(

)1( 1

3

3

sSsn

sSnNs

fk

fk

qk

qk

BM25 combines the B11 and B15 with a scaling factor, b, which turns BM15 into BM11 as it moves from 0 to 1

))1((1 bbLkk

b=1 BM11 b=0 BM15

Default values used: 5.0,1,0,1 321 bkkk

1,02 bk General Form

Bayesian Networks

2. Binary Assumption

xi = 0 or 1

Can it be 0, 1, , …..n ?

1. Independent Assumption

xxx nx ,........,

21

Independent

Can they be dependent?

A possible way could be Bayesian Networks

Modeling of JPD in a compact way by using graph (s) to reflect conditional independence relationships

BN

BN : Representation

Arcs Conditional independence (or causality)

Nodes Random variables

Undirected arcs MRF

Directed arcs BNsBN : Advantages

•Arc from node A to B implies : A causes B

•Compact representation of JPD of nodes

•Easier to learn (Fit to data)

Bayesian Network (BN) BasicsBayesian Network (BN) is a hybrid system of probability theory and graph theory. Graph theory provides the user to build an interface to model highly interacting variables. On the other hand probability theory ensures the system as a whole is consistent, and provides ways to interface models to data.

G = Global structure (DAG – Directed Acyclic Graph) that contains

• a node for each variable xi U, i = 1,2, ….., n

•edges represent the probabilistic dependencies between

nodes of variables xi U

M = set of local structures {M1 , M2 , …., Mn }. n mappings for n variables. Mi maps each values of {xi , Par(xi )} to parameter . Here Par(xi ) denotes the set of parent nodes of xi in G.

The joint probability distribution over U can be decomposed by the global structure G as

P(U | B) = i P(xi | Par(xi ), , Mi , G)

x4

x2 x3

x1

Yes No

True False

Wet Grass Yes No

SprinklerOn Off

Rain

Cloud

BN : An Example * By Kevin Murphy

p(C=T) p(C=F)

1/2 1/2

C p(S=On) p(S=Off)

T 0.1 0.9

F 0.5 0.5

C p(R=Y) p(R=N)

T 0.8 0.2

F 0.2 0.8S R p(W=Y) p(W=N)

On Y 0.99 0.01

On N 0.9 0.1

Off Y 0.9 0.1

Off N 0.0 1.0

p(C,S,R,W) = p(C ) p(S|C) p(R|C,S) p(W|C,S,R)

p(C,S,R,W) = p(C ) p(S|C) p(R|C) p(W|S,R)

BN : As Model•Hybrid system of probability theory and graph theory

•Graph theory provides the user to build an interface to model the highly interactive variables

•Probability theory provides ways to interface model to data

BN : Notations),( sBB

sB : Structure of the Network

: Set of parameters that encode local prob. dist.

sB again has two parts G and M

G is the global structure (DAG)

M is the mapping for n variables (arcs)

)}(,{ ii xparx

BN : Learning

Structure : DAG

Parameter : CPD

sBStructure

Known

Unknown

Parameter

Known

Unknown

Parameter can only be learnt if structure is either known or learnt earlier

Structure is known and full data is observed (nothing missing)

BN : Learning

Parameter learning MLE, MAP

Structure is known and full data is NOT observed (data missing)

Parameter learning EM

BN : Learning

Learning Structure

•To find the best DAG that fits the data

)(

)()|()|(

Dp

BpBDpDBp ss

s Objective Function:

))(log())(log())|(log())|(log( DpBpBDpDBp sss

Constant and indep. of sB

Search Algorithm : NP hard

Performance criteria used : MIC, BIC

References (basics)

1. S. E. Roberctson, “The Probability Ranking Principle in IR”, Journal of Documentation, 33, 294-304, 1977.

2. K. S. Jones, S. Walker and S. E. Roberctson, “A Probabilistic model for information retrieval: development and comparative experiments-Part-1”, Information Processing and Management, 36, 779-808, 2000.

3. K. S. Jones, S. Walker and S. E. Roberctson, “A Probabilistic model for information retrieval: development and comparative experiments-Part-2”, Information Processing and Management, 36, 809-840, 2000.

4. S. E. Roberctson and H. Zaragoza, “The Probabilistic relevance framework: BM25 and beyond Ranking Principle in IR”, Foundation and Trends in Information Retrieval, 3, 333-389, 2009.

5. S. E. Roberctson, C. J. Van Rijsbergen and M. F. Porter, “Probabilistic models of indexing and searching”, Information Retrieval Research, Oddy et al. (Ed.s), 36, 35-56, 1981.

6. N. Fuhr and C. Buckley, “A probabilistic learning approach for document indexing”, ACM Tran. On Information systems, 9, 223-248, 1991.

7. H. R. Turtle and W. B. Croft, “Evaluation of an inference network based retrieval model, ACM Tran. On Information systems, 7, 187-222, 1991.

Thanks You

Discussions

Probabilistic Information Retrieval [ ProbIR ] Suman K Mitra DAIICT, Gandhinagar...

Documents

Transcript of Probabilistic Information Retrieval [ ProbIR ] Suman K Mitra DAIICT, Gandhinagar...