Marti Hearst SIMS 247 SIMS 247 Lecture 20 Visualizing Text & Text Collections (cont.) April 2, 1998.
Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.
-
Upload
janis-briggs -
Category
Documents
-
view
219 -
download
1
description
Transcript of Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.
Relevance FeedbackRelevance Feedback
Prof. Marti HearstProf. Marti HearstSIMS 202, Lecture 24SIMS 202, Lecture 24
Marti A. HearstSIMS 202, Fall 1997
TodayToday
Review Inverted IndexesReview Inverted Indexes Relevance FeedbackRelevance Feedback
aka query modification aka “more like this” Begin considering the role of the user
Informationneed
Index
Pre-process
Parse
Collections
Rank
Query
text inputHow isthe indexconstructed?
Marti A. HearstSIMS 202, Fall 1997
Inverted FilesInverted Files Primary data structure for text indexesPrimary data structure for text indexes InvertInvert documents into a big index documents into a big index Basic idea:Basic idea:
list all the tokens in the collection for each token, list all the docs it occurs in do a few things to reduce redundancy in the
data structure
Read “Inverted Files” by Harman et al., Chapter 3, sections 3.1 Read “Inverted Files” by Harman et al., Chapter 3, sections 3.1 through 3.3 (the rest of the chapter is optional).through 3.3 (the rest of the chapter is optional).
Inverted FilesInverted Files
We have seen “Vector files” conceptually. We have seen “Vector files” conceptually. An Inverted File is a vector file An Inverted File is a vector file “inverted” so that rows become “inverted” so that rows become columns and columns become rowscolumns and columns become rowsdocs t1 t2 t3D1 1 0 1D2 1 0 0D3 0 1 1D4 1 0 0D5 1 1 1D6 1 1 0D7 0 1 0D8 0 1 0D9 0 0 1
D10 0 1 1
Terms D1 D2 D3 D4 D5 D6 D7 …t1 1 1 0 1 1 1 0t2 0 0 1 0 1 1 1t3 1 0 1 0 1 0 0
How Are Inverted Files How Are Inverted Files CreatedCreated
Documents are parsed to extract Documents are parsed to extract tokens. These are saved with the tokens. These are saved with the Document ID.Document ID.
Now is the timefor all good men
to come to the aidof their country
Doc 1
It was a dark andstormy night in
the country manor. The time was past midnight
Doc 2
Term Doc #now 1is 1the 1time 1for 1all 1good 1men 1to 1come 1to 1the 1aid 1of 1their 1country 1it 2was 2a 2dark 2and 2stormy 2night 2in 2the 2country 2manor 2the 2time 2was 2past 2midnight 2
How Inverted Files are How Inverted Files are CreatedCreated After all After all
documents have documents have been parsed the been parsed the inverted file is inverted file is sorted sorted
Term Doc #a 2aid 1all 1and 2come 1country 1country 2dark 2for 1good 1in 2is 1it 2manor 2men 1midnight 2night 2now 1of 1past 2stormy 2the 1the 1the 2the 2their 1time 1time 2to 1to 1was 2was 2
Term Doc #now 1is 1the 1time 1for 1all 1good 1men 1to 1come 1to 1the 1aid 1of 1their 1country 1it 2was 2a 2dark 2and 2stormy 2night 2in 2the 2country 2manor 2the 2time 2was 2past 2midnight 2
How Inverted Files are How Inverted Files are CreatedCreated Multiple term Multiple term
entries for a single entries for a single document are document are mergedmerged
Within-document Within-document term frequency term frequency information is information is compiledcompiled
Term Doc # Freqa 2 1aid 1 1all 1 1and 2 1come 1 1country 1 1country 2 1dark 2 1for 1 1good 1 1in 2 1is 1 1it 2 1manor 2 1men 1 1midnight 2 1night 2 1now 1 1of 1 1past 2 1stormy 2 1the 1 2the 2 2their 1 1time 1 1time 2 1to 1 2was 2 2
Term Doc #a 2aid 1all 1and 2come 1country 1country 2dark 2for 1good 1in 2is 1it 2manor 2men 1midnight 2night 2now 1of 1past 2stormy 2the 1the 1the 2the 2their 1time 1time 2to 1to 1was 2was 2
How Inverted Files are How Inverted Files are CreatedCreated
Then the file is split int a Then the file is split int a DictionaryDictionary and a and a PostingsPostings filefile
Term Doc # Freqa 2 1aid 1 1all 1 1and 2 1come 1 1country 1 1country 2 1dark 2 1for 1 1good 1 1in 2 1is 1 1it 2 1manor 2 1men 1 1midnight 2 1night 2 1now 1 1of 1 1past 2 1stormy 2 1the 1 2the 2 2their 1 1time 1 1time 2 1to 1 2was 2 2
Doc # Freq2 11 11 12 11 11 12 12 11 11 12 11 12 12 11 12 12 11 11 12 12 11 22 21 11 12 11 22 2
Term N docs Tot Freqa 1 1aid 1 1all 1 1and 1 1come 1 1country 2 2dark 1 1for 1 1good 1 1in 1 1is 1 1it 1 1manor 1 1men 1 1midnight 1 1night 1 1now 1 1of 1 1past 1 1stormy 1 1the 2 4their 1 1time 2 2to 1 2was 1 2
Inverted filesInverted files Permit fast search for individual termsPermit fast search for individual terms For each term, you get a list consisting of:For each term, you get a list consisting of:
document ID frequency of term in doc (optional) position of term in doc (optional)
These lists can be used to solve Boolean queries:These lists can be used to solve Boolean queries: country -> d1, d2 manor -> d2 country AND manor -> d2
Also used for statistical ranking algorithmsAlso used for statistical ranking algorithms
Marti A. HearstSIMS 202, Fall 1997
Finding Out AboutFinding Out About
Three phases:Three phases: Asking of a question Construction of an answer Assessment of the answer
Part of an iterative processPart of an iterative process
Querymodification
Informationneed
Index
Pre-process
Parse
Collections
Rank
Query
text input
Query Modification
Marti A. HearstSIMS 202, Fall 1997
Relevance FeedbackRelevance Feedback
Problem: how to reformulate the query?Problem: how to reformulate the query? Relevance Feedback:Relevance Feedback:
Modify existing query based on relevance judgements
Extract terms from relevant documents and add them to the query
and/or re-weight the terms already in the query Either automatic, or let users select the
terms from an automatically-generated list
Marti A. HearstSIMS 202, Fall 1997
Relevance FeedbackRelevance Feedback
Usually do both:Usually do both: expand query with new terms re-weight terms in query
There are many variationsThere are many variations usually positive weights for terms
from relevant docs sometimes negative weights for
terms from non-relevant docs
Marti A. HearstSIMS 202, Fall 1997
Rocchio MethodRocchio Method(See Harman Chapter 11)(See Harman Chapter 11)
0.25) to and 0.75 toset best to studies some(in t termsnonrelevan andrelevant of importance the tune and
chosen documentsrelevant -non ofnumber thechosen documentsrelevant ofnumber the
document relevant -non for the vector thedocument relevant for the vector the
query initial for the vector the
2
1
0
1 21 101
21
nn
iSiR
Qwhere
nS
nRQQ
i
i
n
i
in
i
i
Marti A. HearstSIMS 202, Fall 1997
Rocchio MethodRocchio Method
Rocchio automaticallyRocchio automatically re-weights terms adds in new terms (from relevant docs) Have to be careful when using negative terms
Most methods perform similarlyMost methods perform similarly results heavily dependent on test collection
Machine learning methods are proving to Machine learning methods are proving to work better than standard IR approaches work better than standard IR approaches like Rocchiolike Rocchio
Marti A. HearstSIMS 202, Fall 1997
Using Relevance FeedbackUsing Relevance Feedback
Known to improve resultsKnown to improve results in TREC-like conditions (no user
involved) What about with a user in the What about with a user in the
loop?loop? Let’s examine a user study of
relevance feedback by Koenneman & Belkin 1996.
Marti A. HearstSIMS 202, Fall 1997
Questions being Questions being InvestigatedInvestigatedKoenneman & Belkin 96Koenneman & Belkin 96 How well do users work with How well do users work with
statistical ranking on full text?statistical ranking on full text? Does relevance feedback improve Does relevance feedback improve
results?results? Is user control over operation of Is user control over operation of
relevance feedback helpful?relevance feedback helpful? How do different levels of user How do different levels of user
control effect results?control effect results?
Marti A. HearstSIMS 202, Fall 1997
How much of the guts should How much of the guts should the user see?the user see? Opaque (black box) Opaque (black box)
(like web search engines) Transparent Transparent
(see available terms after the r.f. ) Penetrable Penetrable
(see suggested terms before the r.f.) Which do you think worked best?Which do you think worked best?
Marti A. HearstSIMS 202, Fall 1997
Marti A. HearstSIMS 202, Fall 1997
Terms Terms available for available for relevance relevance feedback feedback made visiblemade visible(from Koenneman & (from Koenneman & Belkin)Belkin)
Marti A. HearstSIMS 202, Fall 1997
Details on User StudyDetails on User StudyKoenemann & Belkin 96Koenemann & Belkin 96
Subjects have a tutorial session to learn Subjects have a tutorial session to learn the systemthe system
Their goal is to keep modifying the Their goal is to keep modifying the query until they’ve developed one that query until they’ve developed one that gets high precisiongets high precision
This is an example of a routing query This is an example of a routing query (as opposed to ad hoc)(as opposed to ad hoc)
Marti A. HearstSIMS 202, Fall 1997
Details on User StudyDetails on User StudyKoenemann & Belkin 96Koenemann & Belkin 96 64 novice searchers64 novice searchers
43 female, 21 male, native english TREC test bedTREC test bed
Wall Street Journal subset Two search topicsTwo search topics
Automobile Recalls Tobacco Advertising and the Young
Relevance judgements from TREC and Relevance judgements from TREC and experimenterexperimenter
System was INQUERY (vector space with some System was INQUERY (vector space with some bells and whistles)bells and whistles)
Marti A. HearstSIMS 202, Fall 1997
Sample TREC querySample TREC query
Marti A. HearstSIMS 202, Fall 1997
EvaluationEvaluation
Precision at 30 documentsPrecision at 30 documents Baseline: (Trial 1)Baseline: (Trial 1)
How well does initial search go? One topic has more relevant docs than the
other Experimental condition (Trial 2)Experimental condition (Trial 2)
Subjects get tutorial on relevance feedback Modify query in one of four modes
no r.f., opaque, transparent, penetration
Marti A. HearstSIMS 202, Fall 1997
Precision vs. RF condition (from Precision vs. RF condition (from Koenemann & Belkin 96)Koenemann & Belkin 96)
Marti A. HearstSIMS 202, Fall 1997
Effectiveness ResultsEffectiveness Results
Subjects with r.f. did 17-34% Subjects with r.f. did 17-34% better performance than no r.f.better performance than no r.f.
Subjects with penetration case did Subjects with penetration case did 15% better as a group than those 15% better as a group than those in opaque and transparent cases.in opaque and transparent cases.
Marti A. HearstSIMS 202, Fall 1997
Number of iterations in formulating Number of iterations in formulating queries (from Koenemann & Belkin queries (from Koenemann & Belkin 96)96)
Marti A. HearstSIMS 202, Fall 1997
Number of terms in created queries Number of terms in created queries (from Koenemann & Belkin 96)(from Koenemann & Belkin 96)
Marti A. HearstSIMS 202, Fall 1997
Behavior ResultsBehavior Results Search times approximately equalSearch times approximately equal Precision increased in first few iterations Precision increased in first few iterations Penetration case required fewer Penetration case required fewer
iterations to make a good query than iterations to make a good query than transparent and opaquetransparent and opaque
R.F. queries much longerR.F. queries much longer but fewer terms in penetrable case -- users
were more selective about which terms were added in.
Marti A. HearstSIMS 202, Fall 1997
Relevance Feedback Relevance Feedback SummarySummary Iterative query modification can Iterative query modification can
improve precision and recall for a improve precision and recall for a standing querystanding query
In at least one study, users were In at least one study, users were able to make good choices by able to make good choices by seeing which terms were suggested seeing which terms were suggested for r.f. and selecting among themfor r.f. and selecting among them
So … “more like this” can be useful!So … “more like this” can be useful!