Language Models LBSC 796/CMSC 828o Session 4, February 16, 2004 Douglas W. Oard.

34
Language Models LBSC 796/CMSC 828o Session 4, February 16, 2004 Douglas W. Oard

Transcript of Language Models LBSC 796/CMSC 828o Session 4, February 16, 2004 Douglas W. Oard.

Page 1: Language Models LBSC 796/CMSC 828o Session 4, February 16, 2004 Douglas W. Oard.

Language Models

LBSC 796/CMSC 828o

Session 4, February 16, 2004

Douglas W. Oard

Page 2: Language Models LBSC 796/CMSC 828o Session 4, February 16, 2004 Douglas W. Oard.

• Questions

• The meaning of “maybe”

• Probabilistic retrieval

• Comparison with vector space model

Agenda

Page 3: Language Models LBSC 796/CMSC 828o Session 4, February 16, 2004 Douglas W. Oard.

Muddiest Points

• Why distinguish utility and relevance?• How coordination measure is ranked Boolean• Why use term weights?• The meaning of DF• The problem with log(1)=0• How the vectors are built• How to do cosine normalization (5)• Why to do cosine normalization (2)• Okapi graphs

Page 4: Language Models LBSC 796/CMSC 828o Session 4, February 16, 2004 Douglas W. Oard.

Supporting the Search Process

SourceSelection

Search

Query

Selection

Ranked List

Examination

Document

Delivery

Document

QueryFormulation

IR System

Indexing Index

Acquisition Collection

Page 5: Language Models LBSC 796/CMSC 828o Session 4, February 16, 2004 Douglas W. Oard.

Looking Ahead

• We ask “is this document relevant?”– Vector space: we answer “somewhat”– Probabilistic: we answer “probably”

• The key is to know what “probably” means– First, we’ll formalize that notion– Then we’ll apply it to retrieval

Page 6: Language Models LBSC 796/CMSC 828o Session 4, February 16, 2004 Douglas W. Oard.

Probability

• What is probability?– Statistical: relative frequency as n – Subjective: degree of belief

• Thinking statistically– Imagine a finite amount of “stuff”– Associate the number 1 with the total amount– Distribute that “mass” over the possible events

Page 7: Language Models LBSC 796/CMSC 828o Session 4, February 16, 2004 Douglas W. Oard.

Statistical Independence

• A and B are independent if and only if: P(A and B) = P(A) P(B)

• Independence formalizes “unrelated”– P(“being brown eyed”) = 85/100– P(“being a doctor”) = 1/1000– P(“being a brown eyed doctor”) = 85/100,000

Page 8: Language Models LBSC 796/CMSC 828o Session 4, February 16, 2004 Douglas W. Oard.

Dependent Events

• Suppose”– P(“having a B.S. degree”) = 2/10– P(“being a doctor”) = 1/1000

• Would you expect– P(“having a B.S. degree and being a doctor”)

= 2/10,000 ???

• Extreme example:– P(“being a doctor”) = 1/1000– P(“having studied anatomy”) = 12/1000

Page 9: Language Models LBSC 796/CMSC 828o Session 4, February 16, 2004 Douglas W. Oard.

Conditional Probability• P(A | B) P(A and B) / P(B)

A

B

A and B

• P(A) = prob of A relative to the whole space

• P(A|B) = prob of A considering only the cases where B is known to be true

Page 10: Language Models LBSC 796/CMSC 828o Session 4, February 16, 2004 Douglas W. Oard.

More on Conditional Probability

• Suppose– P(“having studied anatomy”) = 12/1000– P(“being a doctor and having studied anatomy”) = 1/1000

• Consider– P(“being a doctor” | “having studied anatomy”) = 1/12

• But if you assume all doctors have studied anatomy– P(“having studied anatomy” | “being a doctor”) = 1

Useful restatement of definition: P(A and B) = P(A|B) x P(B)

Page 11: Language Models LBSC 796/CMSC 828o Session 4, February 16, 2004 Douglas W. Oard.

Some Notation

• Consider – A set of hypotheses: H1, H2, H3– Some observable evidence O

• P(O|H1) = probability of O being observed if we knew H1 were true

• P(O|H2) = probability of O being observed if we knew H2 were true

• P(O|H3) = probability of O being observed if we knew H3 were true

Page 12: Language Models LBSC 796/CMSC 828o Session 4, February 16, 2004 Douglas W. Oard.

An Example

• Let– O = “Joe earns more than $80,000/year”

– H1 = “Joe is a doctor”

– H2 = “Joe is a college professor”

– H3 = “Joe works in food services”

• Suppose we do a survey and we find out– P(O|H1) = 0.6

– P(O|H2) = 0.07

– P(O|H3) = 0.001

• What should be our guess about Joe’s profession?

Page 13: Language Models LBSC 796/CMSC 828o Session 4, February 16, 2004 Douglas W. Oard.

Bayes’ Rule

• What’s P(H1|O)? P(H2|O)? P(H3|O)?

• Theorem:

P(H | O) = P(O | H) x P(H)

P(O) Posteriorprobability

Priorprobability

• Notice that the prior is very important!

Page 14: Language Models LBSC 796/CMSC 828o Session 4, February 16, 2004 Douglas W. Oard.

Back to the Example

• Suppose we also have good data about priors:– P(O|H1) = 0.6 P(H1) = 0.0001 doctor– P(O|H2) = 0.07 P(H2) = 0.001 prof– P(O|H3) = 0.001 P(H3) = 0.2 food

• We can calculate– P(H1|O) = 0.00006 / P(“earning >

$70K/year”)– P(H2|O) = 0.0007 / P(“earning > $70K/year”)– P(H3|O) = 0.0002 / P(“earning > $70K/year”)

Page 15: Language Models LBSC 796/CMSC 828o Session 4, February 16, 2004 Douglas W. Oard.

Key Ideas

• Defining probability using frequency

• Statistical independence

• Conditional probability

• Bayes’ rule

Page 16: Language Models LBSC 796/CMSC 828o Session 4, February 16, 2004 Douglas W. Oard.

• Questions

• Defining probability

• Using probability for retrieval– Language modeling– Inference networks

• Comparison with vector space model

Agenda

Page 17: Language Models LBSC 796/CMSC 828o Session 4, February 16, 2004 Douglas W. Oard.

Probability Ranking Principle• Assume binary relevance/document independence

– Each document is either relevant or it is not– Relevance of one doc reveals nothing about another

• Assume the searcher works down a ranked list– Seeking some number of relevant documents

• Theorem (provable from assumptions):– Documents should be ranked in order of decreasing

probability of relevance to the query,

P(d relevant-to q)

Page 18: Language Models LBSC 796/CMSC 828o Session 4, February 16, 2004 Douglas W. Oard.

Probabilistic Retrieval Strategy

• Estimate how terms contribute to relevance– How do TF, DF, and length influence your

judgments about document relevance? (e.g., Okapi)

• Combine to find document relevance probability

• Order documents by decreasing probability

Page 19: Language Models LBSC 796/CMSC 828o Session 4, February 16, 2004 Douglas W. Oard.

Where do the probabilities fit?

Comparison Function

Representation Function

Query Formulation

Human Judgment

Representation Function

Retrieval Status Value

Utility

Query

Information Need Document

Query Representation Document Representation

Que

ry P

roce

ssin

g

Doc

umen

t P

roce

ssin

g

P(d is Rel | q)

Page 20: Language Models LBSC 796/CMSC 828o Session 4, February 16, 2004 Douglas W. Oard.

Binary Independence Model

• Binary refers again to binary relevance

• Assume “term independence”– Presence of one term tells nothing about another

• Assume “uniform priors”– P(d) is the same for all d

Page 21: Language Models LBSC 796/CMSC 828o Session 4, February 16, 2004 Douglas W. Oard.

“Okapi” Term Weights

5.0

5.0log*

5.05.1 ,

,,

j

j

jii

jiji DF

DFN

TFLL

TFw

0.0

0.2

0.4

0.6

0.8

1.0

0 5 10 15 20 25

Raw TF

Oka

pi

TF 0.5

1.0

2.0

4.4

4.6

4.8

5.0

5.2

5.4

5.6

5.8

6.0

0 5 10 15 20 25

Raw DF

IDF Classic

Okapi

LL /

TF component IDF component

Page 22: Language Models LBSC 796/CMSC 828o Session 4, February 16, 2004 Douglas W. Oard.

Stochastic Language Models

• Models probability of generating any string

0.2 the

0.1 a

0.01 man

0.01 woman

0.03 said

0.02 likes

the man likes the woman

0.2 0.01 0.02 0.2 0.01

multiply

Model M

P(s | M)

Page 23: Language Models LBSC 796/CMSC 828o Session 4, February 16, 2004 Douglas W. Oard.

Language Models, cont’d

• Models probability of generating any string

0.2 the

0.1 a

0.01 man

0.01 woman

0.03 said

0.02 likes

Model M1

0.2 the

0.1 yon

0.001 class

0.01 maiden

0.03 sayst

0.02 pleaseth

Model M2

maidenclass pleaseth yonthe

0.00050.01 0.0001 0.00010.2

0.010.0001 0.02 0.10.2

P(s|M2) > P(s|M1)

Page 24: Language Models LBSC 796/CMSC 828o Session 4, February 16, 2004 Douglas W. Oard.

Retrieval with Language Models

• Treat each document as the basis for a model

• Rank document d based on P(d | q)– P(d | q) = P(q | d) x P(d) / P(q)

• P(q) is same for all documents, can’t change ranks

• P(d) [the prior] is often treated as the same for all d– But we could use criteria like authority, length, genre

• P(q | d) is the probability of q given d’s model

– Same as ranking by P(q | d)

Page 25: Language Models LBSC 796/CMSC 828o Session 4, February 16, 2004 Douglas W. Oard.

Computing P(q | d)

• Build a smoothed language model for d– Count the frequency of each term in d– Count the frequency of each term in the collection– Combine the two in some way– Redistribute probabilities to unobserved events

• Example: add 1 to every count

• Combine the probability for the full query– Summing over the terms in q is a soft “OR”

Page 26: Language Models LBSC 796/CMSC 828o Session 4, February 16, 2004 Douglas W. Oard.

Key Ideas

• Probabilistic methods formalize assumptions– Binary relevance– Document independence– Term independence– Uniform priors– Top-down scan

• Natural framework for combining evidence– e.g., non-uniform priors

Page 27: Language Models LBSC 796/CMSC 828o Session 4, February 16, 2004 Douglas W. Oard.

Inference Networks

• A flexible way of combining term weights– Boolean model– Binary independence model– Probabilistic models with weaker assumptions

• Key concept: rank based on P(d | q)– P(d | q) = P(q | d) x P(d) / P(q)

• Efficient large-scale implementation– InQuery text retrieval system from U Mass

Page 28: Language Models LBSC 796/CMSC 828o Session 4, February 16, 2004 Douglas W. Oard.

A Boolean Inference Net

bat

d1 d2 d3 d4

cat fat hat mat pat rat vat

ANDOR

sat

AND I Information need

Page 29: Language Models LBSC 796/CMSC 828o Session 4, February 16, 2004 Douglas W. Oard.

A Binary Independence Network

bat

d1 d2 d3 d4

cat fat hat mat pat rat vatsat

query

Page 30: Language Models LBSC 796/CMSC 828o Session 4, February 16, 2004 Douglas W. Oard.

Probability Computation

• Turn on exactly one document at a time– Boolean: Every connected term turns on– Binary Ind: Connected terms gain their weight

• Compute the query value– Boolean: AND and OR nodes use truth tables– Binary Ind: Fraction of the possible weight w

w

t dt d

t dt

,

,

Page 31: Language Models LBSC 796/CMSC 828o Session 4, February 16, 2004 Douglas W. Oard.

A Critique

• Most of the assumptions are not satisfied!– Searchers want utility, not relevance– Relevance is not binary– Terms are clearly not independent– Documents are often not independent

• The best known term weights are quite ad hoc– Unless some relevant documents are known

Page 32: Language Models LBSC 796/CMSC 828o Session 4, February 16, 2004 Douglas W. Oard.

But It Works!

• Ranked retrieval paradigm is powerful– Well suited to human search strategies

• Probability theory has explanatory power– At least we know where the weak spots are– Probabilities are good for combining evidence

• Good implementations exist (InQuery, Lemur)– Effective, efficient, and large-scale

Page 33: Language Models LBSC 796/CMSC 828o Session 4, February 16, 2004 Douglas W. Oard.

Comparison With Vector Space

• Similar in some ways– Term weights can be based on frequency– Terms often used as if they were independent

• Different in others– Based on probability rather than similarity– Intuitions are probabilistic rather than geometric

Page 34: Language Models LBSC 796/CMSC 828o Session 4, February 16, 2004 Douglas W. Oard.

One Minute Paper

• Which assumption underlying the probabilistic retrieval model causes you the most concern, and why?