Language Modeling Information Retrieval - IIT Bombaynlp-ai/java ppt/java_lect_35.pdf · 2 Outline...

28
25 August, 2005 "NLP-AI" 1 Language Modeling for Information Retrieval Ananthakrishnan R (anand@cse)

Transcript of Language Modeling Information Retrieval - IIT Bombaynlp-ai/java ppt/java_lect_35.pdf · 2 Outline...

25 August, 2005 "NLP-AI" 1

Language Modeling for

Information Retrieval

Ananthakrishnan R(anand@cse)

2

Outline of the Seminar

• What is Language Modeling?

• Language Modeling in NLP• Speech Recognition, Statistical MT, Classification, IR …

• The Language Modeling approach to IR– the conceptual model– smoothing– semantic smoothing

• IR as Statistical Machine Translation

3

Language Modeling

• A language model is a probabilistic mechanism for generating text

• Language models estimate the probability distribution of various natural language phenomena– sentences, utterances, queries …

4

Language Modeling Techniques• N-grams

• Class-based N-grams• Probabilistic CFGs

• Decision Trees

Claude Shannon was probably the first language modeler“How well can we predict or compress natural language text through simple n-gram models?”

5

Language Modeling for Other NLP Tasks

6

Applications

• Speech Recognition• Statistical Machine Translation• Classification• Spell Checking

Play the role of either the prior or the likelihood

7

Language Models inSpeech Recognition

Noisy ChannelSource WordNoisyWord

Decoder

)()|(maxarg

)|(maxargˆ

wpwOp

Owpw

Vocabularyw

Vocabularyw

w

)(wp (source model)

Ow

8

Language Models in Statistical Machine Translation

Noisy Channel

Original sentence e Noisy sentence f

Decoder

)e()e|f(maxarg

)f|e(maxarge

e

e

pp

p

E

E

)e(p )e|f(p (translation model)(source model)

Source e

9

Language Models inText Classification

)c()c|d(maxarg

)d|c(maxargc

c

c

pp

p

C

C

a language model for each class

Bayesian Classifier

10

The Language Modeling Approach to Information Retrieval

11

The Conceptual Model of IR

• The user has an information need ,

• From this need he generates an ideal document fragment d,

• He selects a set of key terms from d,, and generates a query q from this set

12

Analysis of Australia’s defeatto England in the Ashes. Reason

for defeat: overconfidence?

Eng bt. Aus: AnalysisAustralia’s defeat to England in the Ashes islargely due to their overconfidence. They don’tseem to have figured out that they are facing a

far superior team …

The Ashes ReviewBrash batting by Australia has lost them

a test match. Their opposition is not Zimbabwe, and not many of the Aussies

seem to remember that! …

Information need ,

Ideal document d,

A candidate document

australia ashes defeat analysis

query

d

q

13

Document generation model

Document-queryTranslation model

Retrieval Engineq

q

{d}

d,,

retrieved documentsranked by:

informationneed

Ideal documentfragment

query

query

),q|d( 8p

8

14

)|q(

)|d(),d|q(),q|d( 8

888p

ppp

Using Bayes’ law,

For the purpose of ranking,

)|d(),d|q()d(q 88 ppp

)d()d|q()d(q ppp (user independent)

The Language Modeling Approach

15

)d()d|q()d(q ppp

query-likelihood: comes from the language model

document prior(query-independent)

E.g., PageRank

Language Modeling for IR:Using document language models to assign likelihood scores to queries(Ponte and Croft, 1998)

16

Analysis of Australia’s defeatto England in the Ashes. Reason

for defeat: overconfidence?

Eng bt. Aus: AnalysisAustralia’s defeat to England in the Ashes islargely due to their overconfidence. They don’tseem to have figured out that they are facing a

far superior team …

The Ashes ReviewBrash batting by Australia has lost them

a test match. Their opposition is not Zimbabwe, and not many of the Aussies

seem to remember that! …

Information need ,

Ideal document d,

A candidate document

australia ashes defeat analysis

query

?),d|q( 8p

q

d

17

The Language Model

• For each document din the collection C, for each term w, estimate:

• To find the query likelihood, assume query terms to be independent:

)d|w(p

� m

iiqpp

1)d|()d|q(

18

The Need for Smoothing

• To eliminate “zero” probabilities– Correct the error in the MLE due to the

problem of data sparsity

australia ashes defeat analysis

The Ashes ReviewBrash batting by Australia has lost them

a test match. Their opposition is not Zimbabwe, and not many of the Aussies

seem to remember that! …

p(defeat|d) = 0

p(q|d) = 0

d

q

19

Smoothing

• Jelinek-Mercer Method: Linear interpolation of the maximum likelihood model with the collection model

• (Zhai & Lafferty, 2003) compares Jelinek-Mercer, Dirichlet and the absolute discounting methods for Language Models applied to IR.

)|w()d|w()1()d|w( &ppp OOO ��

20

Analysis of Australia’s defeatto England in the Ashes. Reason

for defeat: overconfidence?

Eng bt. Aus: AnalysisAustralia’s defeat to England in the Ashes islargely due to their overconfidence. They don’tseem to have figured out that they are facing a

far superior team …

The Ashes ReviewBrash batting by Australia has lost them

a test match. Their opposition is not Zimbabwe, and not many of the Aussies

seem to remember that! …

Information need , Ideal document d,

A candidate document

australia ashes defeat overconfidence

query

q

d

The Need for Semantic Smoothing

21

Eng bt. Aus: AnalysisAustralia’s defeat to England in the Ashes islargely due to their overconfidence. They don’tseem to have figured out that they are facing a

far superior team …

australia ashes defeat overconfidence

Document-queryTranslation model

Ideal document d,

query

Document-Query Translation Model

?)d|q( p

Must be sophisticated enough to handle

polysemy and synonymy

22

Semantic Smoothing

�¦

m

1i w

i )d|w()w|()d|q( pqtp

)d|(analysisp

Estimate translation probabilities t(wq|wd) formapping a document term wd to a query term wq

)d|(review)|( reviewpanalysisp

(Berger and Lafferty, 1999)(Lafferty and Zhai, 2001)

23

Compare withThe Statistical MT model

Noisy Channel

Original sentence e Noisy sentence f

Decoder

)e()e|f(maxarg

)f|e(maxarge

e

e

pp

p

E

E

)e(p )e|f(p (translation model)(source model)

Source e

Language model Semantic Smoothing

24

Language Modeling vis-à-visthe ‘traditional probabilistic model’

?)d,q|( rp

),q|d(

),q|d(

rp

rp log

The Probability Ranking Principle(Robertson, 1977)

The Robertson-Sparck Jones Model(Sparck Jones et al., 2000)

Language Modeling approach: � Where’s the relevance?� Is the ranking optimal?

(Lafferty & Zhai, 2002)(Laverenko and Croft, 2001)

25

Discussion: Advantages of the Language Modeling Approach

• Conditioning on d provides a larger ‘foothold’ for estimation

• p(d): an explicit notion of the importance of a document

• Document normalization is not an issue

“When designing a statistical model … the most natural route is to apply a generative model which builds up the output step-by-step. The source channel perspective suggests a different approach: turn the search problem around to predict the input. In speech recognition, NLP, and MT, researchers have time and again found that predicting what is known from competing hypothesis can be easier than directly predicting all of the hypothesis”

26

Discussion: Disadvantage of the Language Modeling Approach

• The Robertson-Sparck Jones model can directly use relevance judgements to improve the estimation of p(Ai |q,r)

This is not possible in the LanguageModeling approach

27

References

• (Robertson, 1977) The probability ranking principle in IR

• (Sparck Jones et al., 2000) A probabilistic model of information retrieval: development and comparative experiments.

• (Ponte and Croft, 1998) A language modeling approach to information retrieval

• (Zhai and Lafferty, 2001) A study of smoothing methods for language models applied to ad hoc information retrieval

28

References …

• (Berger and Lafferty, 1999) Information retrieval as statistical translation

• (Lafferty and Zhai, 2001) Risk minimization and language modeling in information retrieval

• (Lavrenko and Croft, 2001) Relevance-based language models

• (Lafferty and Zhai, 2001) Probabilistic IR models based on document and query generation