Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.

31
Relevance Feedback Relevance Feedback Prof. Marti Hearst Prof. Marti Hearst SIMS 202, Lecture 24 SIMS 202, Lecture 24

description

Information need Index Pre-process Parse Collections Rank Query text input How is the index constructed?

Transcript of Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.

Page 1: Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.

Relevance FeedbackRelevance Feedback

Prof. Marti HearstProf. Marti HearstSIMS 202, Lecture 24SIMS 202, Lecture 24

Page 2: Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.

Marti A. HearstSIMS 202, Fall 1997

TodayToday

Review Inverted IndexesReview Inverted Indexes Relevance FeedbackRelevance Feedback

aka query modification aka “more like this” Begin considering the role of the user

Page 3: Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.

Informationneed

Index

Pre-process

Parse

Collections

Rank

Query

text inputHow isthe indexconstructed?

Page 4: Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.

Marti A. HearstSIMS 202, Fall 1997

Inverted FilesInverted Files Primary data structure for text indexesPrimary data structure for text indexes InvertInvert documents into a big index documents into a big index Basic idea:Basic idea:

list all the tokens in the collection for each token, list all the docs it occurs in do a few things to reduce redundancy in the

data structure

Read “Inverted Files” by Harman et al., Chapter 3, sections 3.1 Read “Inverted Files” by Harman et al., Chapter 3, sections 3.1 through 3.3 (the rest of the chapter is optional).through 3.3 (the rest of the chapter is optional).

Page 5: Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.

Inverted FilesInverted Files

We have seen “Vector files” conceptually. We have seen “Vector files” conceptually. An Inverted File is a vector file An Inverted File is a vector file “inverted” so that rows become “inverted” so that rows become columns and columns become rowscolumns and columns become rowsdocs t1 t2 t3D1 1 0 1D2 1 0 0D3 0 1 1D4 1 0 0D5 1 1 1D6 1 1 0D7 0 1 0D8 0 1 0D9 0 0 1

D10 0 1 1

Terms D1 D2 D3 D4 D5 D6 D7 …t1 1 1 0 1 1 1 0t2 0 0 1 0 1 1 1t3 1 0 1 0 1 0 0

Page 6: Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.

How Are Inverted Files How Are Inverted Files CreatedCreated

Documents are parsed to extract Documents are parsed to extract tokens. These are saved with the tokens. These are saved with the Document ID.Document ID.

Now is the timefor all good men

to come to the aidof their country

Doc 1

It was a dark andstormy night in

the country manor. The time was past midnight

Doc 2

Term Doc #now 1is 1the 1time 1for 1all 1good 1men 1to 1come 1to 1the 1aid 1of 1their 1country 1it 2was 2a 2dark 2and 2stormy 2night 2in 2the 2country 2manor 2the 2time 2was 2past 2midnight 2

Page 7: Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.

How Inverted Files are How Inverted Files are CreatedCreated After all After all

documents have documents have been parsed the been parsed the inverted file is inverted file is sorted sorted

Term Doc #a 2aid 1all 1and 2come 1country 1country 2dark 2for 1good 1in 2is 1it 2manor 2men 1midnight 2night 2now 1of 1past 2stormy 2the 1the 1the 2the 2their 1time 1time 2to 1to 1was 2was 2

Term Doc #now 1is 1the 1time 1for 1all 1good 1men 1to 1come 1to 1the 1aid 1of 1their 1country 1it 2was 2a 2dark 2and 2stormy 2night 2in 2the 2country 2manor 2the 2time 2was 2past 2midnight 2

Page 8: Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.

How Inverted Files are How Inverted Files are CreatedCreated Multiple term Multiple term

entries for a single entries for a single document are document are mergedmerged

Within-document Within-document term frequency term frequency information is information is compiledcompiled

Term Doc # Freqa 2 1aid 1 1all 1 1and 2 1come 1 1country 1 1country 2 1dark 2 1for 1 1good 1 1in 2 1is 1 1it 2 1manor 2 1men 1 1midnight 2 1night 2 1now 1 1of 1 1past 2 1stormy 2 1the 1 2the 2 2their 1 1time 1 1time 2 1to 1 2was 2 2

Term Doc #a 2aid 1all 1and 2come 1country 1country 2dark 2for 1good 1in 2is 1it 2manor 2men 1midnight 2night 2now 1of 1past 2stormy 2the 1the 1the 2the 2their 1time 1time 2to 1to 1was 2was 2

Page 9: Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.

How Inverted Files are How Inverted Files are CreatedCreated

Then the file is split int a Then the file is split int a DictionaryDictionary and a and a PostingsPostings filefile

Term Doc # Freqa 2 1aid 1 1all 1 1and 2 1come 1 1country 1 1country 2 1dark 2 1for 1 1good 1 1in 2 1is 1 1it 2 1manor 2 1men 1 1midnight 2 1night 2 1now 1 1of 1 1past 2 1stormy 2 1the 1 2the 2 2their 1 1time 1 1time 2 1to 1 2was 2 2

Doc # Freq2 11 11 12 11 11 12 12 11 11 12 11 12 12 11 12 12 11 11 12 12 11 22 21 11 12 11 22 2

Term N docs Tot Freqa 1 1aid 1 1all 1 1and 1 1come 1 1country 2 2dark 1 1for 1 1good 1 1in 1 1is 1 1it 1 1manor 1 1men 1 1midnight 1 1night 1 1now 1 1of 1 1past 1 1stormy 1 1the 2 4their 1 1time 2 2to 1 2was 1 2

Page 10: Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.

Inverted filesInverted files Permit fast search for individual termsPermit fast search for individual terms For each term, you get a list consisting of:For each term, you get a list consisting of:

document ID frequency of term in doc (optional) position of term in doc (optional)

These lists can be used to solve Boolean queries:These lists can be used to solve Boolean queries: country -> d1, d2 manor -> d2 country AND manor -> d2

Also used for statistical ranking algorithmsAlso used for statistical ranking algorithms

Page 11: Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.

Marti A. HearstSIMS 202, Fall 1997

Finding Out AboutFinding Out About

Three phases:Three phases: Asking of a question Construction of an answer Assessment of the answer

Part of an iterative processPart of an iterative process

Querymodification

Page 12: Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.

Informationneed

Index

Pre-process

Parse

Collections

Rank

Query

text input

Query Modification

Page 13: Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.

Marti A. HearstSIMS 202, Fall 1997

Relevance FeedbackRelevance Feedback

Problem: how to reformulate the query?Problem: how to reformulate the query? Relevance Feedback:Relevance Feedback:

Modify existing query based on relevance judgements

Extract terms from relevant documents and add them to the query

and/or re-weight the terms already in the query Either automatic, or let users select the

terms from an automatically-generated list

Page 14: Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.

Marti A. HearstSIMS 202, Fall 1997

Relevance FeedbackRelevance Feedback

Usually do both:Usually do both: expand query with new terms re-weight terms in query

There are many variationsThere are many variations usually positive weights for terms

from relevant docs sometimes negative weights for

terms from non-relevant docs

Page 15: Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.

Marti A. HearstSIMS 202, Fall 1997

Rocchio MethodRocchio Method(See Harman Chapter 11)(See Harman Chapter 11)

0.25) to and 0.75 toset best to studies some(in t termsnonrelevan andrelevant of importance the tune and

chosen documentsrelevant -non ofnumber thechosen documentsrelevant ofnumber the

document relevant -non for the vector thedocument relevant for the vector the

query initial for the vector the

2

1

0

1 21 101

21

nn

iSiR

Qwhere

nS

nRQQ

i

i

n

i

in

i

i

Page 16: Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.

Marti A. HearstSIMS 202, Fall 1997

Rocchio MethodRocchio Method

Rocchio automaticallyRocchio automatically re-weights terms adds in new terms (from relevant docs) Have to be careful when using negative terms

Most methods perform similarlyMost methods perform similarly results heavily dependent on test collection

Machine learning methods are proving to Machine learning methods are proving to work better than standard IR approaches work better than standard IR approaches like Rocchiolike Rocchio

Page 17: Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.

Marti A. HearstSIMS 202, Fall 1997

Using Relevance FeedbackUsing Relevance Feedback

Known to improve resultsKnown to improve results in TREC-like conditions (no user

involved) What about with a user in the What about with a user in the

loop?loop? Let’s examine a user study of

relevance feedback by Koenneman & Belkin 1996.

Page 18: Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.

Marti A. HearstSIMS 202, Fall 1997

Questions being Questions being InvestigatedInvestigatedKoenneman & Belkin 96Koenneman & Belkin 96 How well do users work with How well do users work with

statistical ranking on full text?statistical ranking on full text? Does relevance feedback improve Does relevance feedback improve

results?results? Is user control over operation of Is user control over operation of

relevance feedback helpful?relevance feedback helpful? How do different levels of user How do different levels of user

control effect results?control effect results?

Page 19: Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.

Marti A. HearstSIMS 202, Fall 1997

How much of the guts should How much of the guts should the user see?the user see? Opaque (black box) Opaque (black box)

(like web search engines) Transparent Transparent

(see available terms after the r.f. ) Penetrable Penetrable

(see suggested terms before the r.f.) Which do you think worked best?Which do you think worked best?

Page 20: Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.

Marti A. HearstSIMS 202, Fall 1997

Page 21: Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.

Marti A. HearstSIMS 202, Fall 1997

Terms Terms available for available for relevance relevance feedback feedback made visiblemade visible(from Koenneman & (from Koenneman & Belkin)Belkin)

Page 22: Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.

Marti A. HearstSIMS 202, Fall 1997

Details on User StudyDetails on User StudyKoenemann & Belkin 96Koenemann & Belkin 96

Subjects have a tutorial session to learn Subjects have a tutorial session to learn the systemthe system

Their goal is to keep modifying the Their goal is to keep modifying the query until they’ve developed one that query until they’ve developed one that gets high precisiongets high precision

This is an example of a routing query This is an example of a routing query (as opposed to ad hoc)(as opposed to ad hoc)

Page 23: Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.

Marti A. HearstSIMS 202, Fall 1997

Details on User StudyDetails on User StudyKoenemann & Belkin 96Koenemann & Belkin 96 64 novice searchers64 novice searchers

43 female, 21 male, native english TREC test bedTREC test bed

Wall Street Journal subset Two search topicsTwo search topics

Automobile Recalls Tobacco Advertising and the Young

Relevance judgements from TREC and Relevance judgements from TREC and experimenterexperimenter

System was INQUERY (vector space with some System was INQUERY (vector space with some bells and whistles)bells and whistles)

Page 24: Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.

Marti A. HearstSIMS 202, Fall 1997

Sample TREC querySample TREC query

Page 25: Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.

Marti A. HearstSIMS 202, Fall 1997

EvaluationEvaluation

Precision at 30 documentsPrecision at 30 documents Baseline: (Trial 1)Baseline: (Trial 1)

How well does initial search go? One topic has more relevant docs than the

other Experimental condition (Trial 2)Experimental condition (Trial 2)

Subjects get tutorial on relevance feedback Modify query in one of four modes

no r.f., opaque, transparent, penetration

Page 26: Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.

Marti A. HearstSIMS 202, Fall 1997

Precision vs. RF condition (from Precision vs. RF condition (from Koenemann & Belkin 96)Koenemann & Belkin 96)

Page 27: Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.

Marti A. HearstSIMS 202, Fall 1997

Effectiveness ResultsEffectiveness Results

Subjects with r.f. did 17-34% Subjects with r.f. did 17-34% better performance than no r.f.better performance than no r.f.

Subjects with penetration case did Subjects with penetration case did 15% better as a group than those 15% better as a group than those in opaque and transparent cases.in opaque and transparent cases.

Page 28: Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.

Marti A. HearstSIMS 202, Fall 1997

Number of iterations in formulating Number of iterations in formulating queries (from Koenemann & Belkin queries (from Koenemann & Belkin 96)96)

Page 29: Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.

Marti A. HearstSIMS 202, Fall 1997

Number of terms in created queries Number of terms in created queries (from Koenemann & Belkin 96)(from Koenemann & Belkin 96)

Page 30: Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.

Marti A. HearstSIMS 202, Fall 1997

Behavior ResultsBehavior Results Search times approximately equalSearch times approximately equal Precision increased in first few iterations Precision increased in first few iterations Penetration case required fewer Penetration case required fewer

iterations to make a good query than iterations to make a good query than transparent and opaquetransparent and opaque

R.F. queries much longerR.F. queries much longer but fewer terms in penetrable case -- users

were more selective about which terms were added in.

Page 31: Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.

Marti A. HearstSIMS 202, Fall 1997

Relevance Feedback Relevance Feedback SummarySummary Iterative query modification can Iterative query modification can

improve precision and recall for a improve precision and recall for a standing querystanding query

In at least one study, users were In at least one study, users were able to make good choices by able to make good choices by seeing which terms were suggested seeing which terms were suggested for r.f. and selecting among themfor r.f. and selecting among them

So … “more like this” can be useful!So … “more like this” can be useful!