9/21/2000Information Organization and Retrieval Ranking and Relevance Feedback Ray Larson & Marti...

48
9/21/2000 Information Organization and Retrieval Ranking and Relevance Feedback Ray Larson & Marti Hearst University of California, Berkeley School of Information Management and Systems SIMS 202: Information Organization and Retrieval
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    215
  • download

    1

Transcript of 9/21/2000Information Organization and Retrieval Ranking and Relevance Feedback Ray Larson & Marti...

Page 1: 9/21/2000Information Organization and Retrieval Ranking and Relevance Feedback Ray Larson & Marti Hearst University of California, Berkeley School of Information.

9/21/2000 Information Organization and Retrieval

Ranking and Relevance Feedback

Ray Larson & Marti Hearst

University of California, Berkeley

School of Information Management and Systems

SIMS 202: Information Organization and Retrieval

Page 2: 9/21/2000Information Organization and Retrieval Ranking and Relevance Feedback Ray Larson & Marti Hearst University of California, Berkeley School of Information.

9/21/2000 Information Organization and Retrieval

Review

• Inverted files

• The Vector Space Model

• Term weighting

Page 3: 9/21/2000Information Organization and Retrieval Ranking and Relevance Feedback Ray Larson & Marti Hearst University of California, Berkeley School of Information.

9/21/2000 Information Organization and Retrieval

tf x idf)/log(* kikik nNtfw

log

Tcontain that in documents ofnumber the

collection in the documents ofnumber total

in T termoffrequency document inverse

document in T termoffrequency

document in term

nNidf

Cn

CN

Cidf

Dtf

DkT

kk

kk

kk

ikik

ik

Page 4: 9/21/2000Information Organization and Retrieval Ranking and Relevance Feedback Ray Larson & Marti Hearst University of California, Berkeley School of Information.

9/21/2000 Information Organization and Retrieval

Inverse Document Frequency

• IDF provides high values for rare words and low values for common words

41

10000log

698.220

10000log

301.05000

10000log

010000

10000log

For a collectionof 10000 documents

Page 5: 9/21/2000Information Organization and Retrieval Ranking and Relevance Feedback Ray Larson & Marti Hearst University of California, Berkeley School of Information.

9/21/2000 Information Organization and Retrieval

Similarity Measures

|)||,min(|

||

||||

||

||||

||||

||2

||

21

21

DQ

DQ

DQ

DQ

DQDQ

DQ

DQ

DQ

Simple matching (coordination level match)

Dice’s Coefficient

Jaccard’s Coefficient

Cosine Coefficient

Overlap Coefficient

Page 6: 9/21/2000Information Organization and Retrieval Ranking and Relevance Feedback Ray Larson & Marti Hearst University of California, Berkeley School of Information.

9/21/2000 Information Organization and Retrieval

Vector Space Visualization

Page 7: 9/21/2000Information Organization and Retrieval Ranking and Relevance Feedback Ray Larson & Marti Hearst University of California, Berkeley School of Information.

9/21/2000 Information Organization and Retrieval

Text ClusteringClustering is

“The art of finding groups in data.” -- Kaufmann and Rousseeu

Term 1

Term 2

Page 8: 9/21/2000Information Organization and Retrieval Ranking and Relevance Feedback Ray Larson & Marti Hearst University of California, Berkeley School of Information.
Page 9: 9/21/2000Information Organization and Retrieval Ranking and Relevance Feedback Ray Larson & Marti Hearst University of California, Berkeley School of Information.

9/21/2000 Information Organization and Retrieval

tf x idf normalization• Normalize the term weights (so longer documents

are not unfairly given more weight)– normalize usually means force all values to fall within a certain

range, usually between 0 and 1, inclusive.

t

k kik

kikik

nNtf

nNtfw

1

22 )]/[log()(

)/log(

Page 10: 9/21/2000Information Organization and Retrieval Ranking and Relevance Feedback Ray Larson & Marti Hearst University of California, Berkeley School of Information.

9/21/2000 Information Organization and Retrieval

Vector space similarity(use the weights to compare the documents)

terms.) thehting when weigdone tion was(Normaliza

product.inner normalizedor cosine, thecalled also is This

),(

:is documents twoof similarity theNow,

1

t

kjkikji wwDDsim

Page 11: 9/21/2000Information Organization and Retrieval Ranking and Relevance Feedback Ray Larson & Marti Hearst University of California, Berkeley School of Information.

9/21/2000 Information Organization and Retrieval

Vector Space Similarity Measurecombine tf x idf into a similarity measure

)()(

),(

:comparison similarity in the normalize otherwise

),( :normalized weights termif

absent is terma if 0 ...,,

,...,,

1

2

1

2

1

1

,21

21

t

jd

t

jqj

t

jdqj

i

t

jdqji

qtqq

dddi

ij

ij

ij

itii

ww

ww

DQsim

wwDQsim

wwwwQ

wwwD

Page 12: 9/21/2000Information Organization and Retrieval Ranking and Relevance Feedback Ray Larson & Marti Hearst University of California, Berkeley School of Information.

9/21/2000 Information Organization and Retrieval

Vector Space with Term Weights and Cosine Matching

1.0

0.8

0.6

0.4

0.2

0.80.60.40.20 1.0

D2

D1

Q

1

2

Term B

Term A

Di=(di1,wdi1;di2, wdi2;…;dit, wdit)Q =(qi1,wqi1;qi2, wqi2;…;qit, wqit)

t

j

t

j dq

t

j dq

i

ijj

ijj

ww

wwDQsim

1 1

22

1

)()(),(

Q = (0.4,0.8)D1=(0.8,0.3)D2=(0.2,0.7)

98.042.0

64.0

])7.0()2.0[(])8.0()4.0[(

)7.08.0()2.04.0()2,(

2222

DQsim

74.058.0

56.),( 1 DQsim

Page 13: 9/21/2000Information Organization and Retrieval Ranking and Relevance Feedback Ray Larson & Marti Hearst University of California, Berkeley School of Information.

9/21/2000 Information Organization and Retrieval

Term Weights in SMART

• In SMART weights are decomposed into three factors:

norm

collectfreqw kkd

kd

Page 14: 9/21/2000Information Organization and Retrieval Ranking and Relevance Feedback Ray Larson & Marti Hearst University of California, Berkeley School of Information.

9/21/2000 Information Organization and Retrieval

SMART Freq Components

1)ln(

)max(2

1

2

1)max(

}1,0{

kd

kd

kd

kd

kd

kd

freq

freqfreq

freq

freq

freq

Binary

maxnorm

augmented

log

Page 15: 9/21/2000Information Organization and Retrieval Ranking and Relevance Feedback Ray Larson & Marti Hearst University of California, Berkeley School of Information.

9/21/2000 Information Organization and Retrieval

Collection Weighting in SMART

k

k

k

k

k

k

Doc

Doc

DocNDocDoc

NDoc

Doc

NDoc

collect

1

log

log

log

2

Inverse

squared

probabilistic

frequency

Page 16: 9/21/2000Information Organization and Retrieval Ranking and Relevance Feedback Ray Larson & Marti Hearst University of California, Berkeley School of Information.

9/21/2000 Information Organization and Retrieval

Term Normalization in SMART

jvector

vectorj

vectorj

vectorj

w

w

w

w

norm

max

4

2

sum

cosine

fourth

max

Page 17: 9/21/2000Information Organization and Retrieval Ranking and Relevance Feedback Ray Larson & Marti Hearst University of California, Berkeley School of Information.

9/21/2000 Information Organization and Retrieval

Probabilistic Models: Logistic Regression attributes

MX

n

nNIDF

IDFM

X

DLX

DAFM

X

QLX

QAFM

X

j

j

j

j

j

t

t

M

t

M

t

M

t

log

log1

log1

log1

6

15

4

13

2

11

Average Absolute Query Frequency

Query Length

Average Absolute Document Frequency

Document Length

Average Inverse Document Frequency

Inverse Document Frequency

Number of Terms in common between query and document -- logged

Page 18: 9/21/2000Information Organization and Retrieval Ranking and Relevance Feedback Ray Larson & Marti Hearst University of California, Berkeley School of Information.

9/21/2000 Information Organization and Retrieval

Probabilistic Models: Logistic Regression

6

10),|(

iii XccDQRP

Probability of relevance is based onLogistic regression from a sample set of documentsto determine values of the coefficients.At retrieval the probability estimate is obtained by:

For the 6 X attribute measures shown previously

Page 19: 9/21/2000Information Organization and Retrieval Ranking and Relevance Feedback Ray Larson & Marti Hearst University of California, Berkeley School of Information.

9/21/2000 Information Organization and Retrieval

Probabilistic Models

• Strong theoretical basis

• In principle should supply the best predictions of relevance given available information

• Can be implemented similarly to Vector

• Relevance information is required -- or is “guestimated”

• Important indicators of relevance may not be term -- though terms only are usually used

• Optimally requires on-going collection of relevance information

Advantages Disadvantages

Page 20: 9/21/2000Information Organization and Retrieval Ranking and Relevance Feedback Ray Larson & Marti Hearst University of California, Berkeley School of Information.

9/21/2000 Information Organization and Retrieval

Vector and Probabilistic Models

• Support “natural language” queries• Treat documents and queries the same• Support relevance feedback searching• Support ranked retrieval• Differ primarily in theoretical basis and in how the

ranking is calculated– Vector assumes relevance

– Probabilistic relies on relevance judgments or estimates

Page 21: 9/21/2000Information Organization and Retrieval Ranking and Relevance Feedback Ray Larson & Marti Hearst University of California, Berkeley School of Information.

9/21/2000 Information Organization and Retrieval

Current use of Probabilistic Models

• Virtually all the major systems in TREC now use the “Okapi BM25 formula” which incorporates the Robertson-Sparck Jones weights…

5.05.05.0

5.0

log)1(

rRnNrnrR

r

w

Page 22: 9/21/2000Information Organization and Retrieval Ranking and Relevance Feedback Ray Larson & Marti Hearst University of California, Berkeley School of Information.

9/21/2000 Information Organization and Retrieval

Okapi BM25

• Where:• Q is a query containing terms T• K is k1((1-b) + b.dl/avdl)• k1, b and k3 are parameters , usually 1.2, 0.75 and 7-1000• tf is the frequency of the term in a specific document• qtf is the frequency of the term in a topic from which Q was

derived• dl and avdl are the document length and the average

document length measured in some convenient unit• w(1) is the Robertson-Sparck Jones weight.

QT qtfk

qtfk

tfK

tfkw

3

31)1( )1()1(

Page 23: 9/21/2000Information Organization and Retrieval Ranking and Relevance Feedback Ray Larson & Marti Hearst University of California, Berkeley School of Information.

9/21/2000 Information Organization and Retrieval

Today

• Logistic regression and Cheshire

• Relevance Feedback

Page 24: 9/21/2000Information Organization and Retrieval Ranking and Relevance Feedback Ray Larson & Marti Hearst University of California, Berkeley School of Information.

9/21/2000 Information Organization and Retrieval

Logistic Regression and Cheshire II

• The Cheshire II system uses Logistic Regression equations estimated from TREC full-text data.

• Demo (?)

Page 25: 9/21/2000Information Organization and Retrieval Ranking and Relevance Feedback Ray Larson & Marti Hearst University of California, Berkeley School of Information.

9/21/2000 Information Organization and Retrieval

Querying in IR SystemInterest profiles

& QueriesDocuments

& data

Rules of the game =Rules for subject indexing +

Thesaurus (which consists of

Lead-InVocabulary

andIndexing

Language

StorageLine

Potentially Relevant

Documents

Comparison/Matching

Store1: Profiles/Search requests

Store2: Documentrepresentations

Indexing (Descriptive and

Subject)

Formulating query in terms of

descriptors

Storage of profiles

Storage of Documents

Information Storage and Retrieval System

Page 26: 9/21/2000Information Organization and Retrieval Ranking and Relevance Feedback Ray Larson & Marti Hearst University of California, Berkeley School of Information.

9/21/2000 Information Organization and Retrieval

Relevance Feedback in an IR System

Interest profiles& Queries

Documents & data

Rules of the game =Rules for subject indexing +

Thesaurus (which consists of

Lead-InVocabulary

andIndexing

Language

StorageLine

Potentially Relevant

Documents

Comparison/Matching

Store1: Profiles/Search requests

Store2: Documentrepresentations

Indexing (Descriptive and

Subject)

Formulating query in terms of

descriptors

Storage of profiles

Storage of Documents

Information Storage and Retrieval System

Selected relevant docs

Page 27: 9/21/2000Information Organization and Retrieval Ranking and Relevance Feedback Ray Larson & Marti Hearst University of California, Berkeley School of Information.

9/21/2000 Information Organization and Retrieval

Query Modification• Problem: how to reformulate the query?

– Thesaurus expansion:• Suggest terms similar to query terms

– Relevance feedback:• Suggest terms (and documents) similar to retrieved documents

that have been judged to be relevant

Page 28: 9/21/2000Information Organization and Retrieval Ranking and Relevance Feedback Ray Larson & Marti Hearst University of California, Berkeley School of Information.

9/21/2000 Information Organization and Retrieval

Relevance Feedback• Main Idea:

– Modify existing query based on relevance judgements• Extract terms from relevant documents and add them to the

query

• and/or re-weight the terms already in the query

– Two main approaches:• Automatic (psuedo-relevance feedback)

• Users select relevant documents

– Users/system select terms from an automatically-generated list

Page 29: 9/21/2000Information Organization and Retrieval Ranking and Relevance Feedback Ray Larson & Marti Hearst University of California, Berkeley School of Information.

9/21/2000 Information Organization and Retrieval

Relevance Feedback

• Usually do both:– expand query with new terms

– re-weight terms in query

• There are many variations– usually positive weights for terms from relevant docs

– sometimes negative weights for terms from non-relevant docs

– Remove terms ONLY in non-relevant documents

Page 30: 9/21/2000Information Organization and Retrieval Ranking and Relevance Feedback Ray Larson & Marti Hearst University of California, Berkeley School of Information.

9/21/2000 Information Organization and Retrieval

Rocchio Method

0.25) to and 0.75 toset best to studies some(in

t termsnonrelevan andrelevant of importance the tune and

chosen documentsrelevant -non ofnumber the

chosen documentsrelevant ofnumber the

document relevant -non for the vector the

document relevant for the vector the

query initial for the vector the

2

1

0

1 21 101

21

n

n

iS

iR

Q

where

n

S

n

RQQ

i

i

n

i

in

i

i

Page 31: 9/21/2000Information Organization and Retrieval Ranking and Relevance Feedback Ray Larson & Marti Hearst University of California, Berkeley School of Information.

9/21/2000 Information Organization and Retrieval

Rocchio Method

0.25) to and 0.75 toset best to studies some(in

t termsnonrelevan andrelevant of importance thetune and ,

chosen documentsrelevant -non ofnumber the

chosen documentsrelevant ofnumber the

document relevant -non for the vector the

document relevant for the vector the

query initial for the vector the

2

1

0

121101

21

n

n

iS

iR

Q

where

Sn

Rn

QQ

i

i

i

n

i

n

ii

Page 32: 9/21/2000Information Organization and Retrieval Ranking and Relevance Feedback Ray Larson & Marti Hearst University of California, Berkeley School of Information.

9/21/2000 Information Organization and Retrieval

Rocchio/Vector Illustration

Retrieval

Information

0.5

1.0

0 0.5 1.0

D1

D2

Q0

Q’

Q”

Q0 = retrieval of information = (0.7,0.3)D1 = information science = (0.2,0.8)D2 = retrieval systems = (0.9,0.1)

Q’ = ½*Q0+ ½ * D1 = (0.45,0.55)Q” = ½*Q0+ ½ * D2 = (0.80,0.20)

Page 33: 9/21/2000Information Organization and Retrieval Ranking and Relevance Feedback Ray Larson & Marti Hearst University of California, Berkeley School of Information.

9/21/2000 Information Organization and Retrieval

Example Rocchio Calculation

)04.1,033.0,488.0,022.0,527.0,01.0,002.0,000875.0,011.0(

12

25.0

75.0

1

)950,.00.0,450,.00.0,500,.00.0,00.0,00.0,00.0(

)00.0,020,.00.0,025,.005,.00.0,020,.010,.030(.

)120,.100,.100,.025,.050,.002,.020,.009,.020(.

)120,.00.0,00.0,050,.025,.025,.00.0,00.0,030(.

121

1

2

1

new

new

Q

SRRQQ

Q

S

R

R

Relevantdocs

Non-rel doc

Original Query

Constants

Rocchio Calculation

Resulting feedback query

Page 34: 9/21/2000Information Organization and Retrieval Ranking and Relevance Feedback Ray Larson & Marti Hearst University of California, Berkeley School of Information.

9/21/2000 Information Organization and Retrieval

Rocchio Method

• Rocchio automatically– re-weights terms

– adds in new terms (from relevant docs)• have to be careful when using negative terms

• Rocchio is not a machine learning algorithm

• Most methods perform similarly– results heavily dependent on test collection

• Machine learning methods are proving to work better than standard IR approaches like Rocchio

Page 35: 9/21/2000Information Organization and Retrieval Ranking and Relevance Feedback Ray Larson & Marti Hearst University of California, Berkeley School of Information.

9/21/2000 Information Organization and Retrieval

Probabilistic Relevance Feedback Robertson & Sparck Jones

Document Relevance

Documentindexing

Given a query term t

+ -

+ r n-r n

- R-r N-n-R+r N-n

R N-R N

Where N is the number of documents seen

Page 36: 9/21/2000Information Organization and Retrieval Ranking and Relevance Feedback Ray Larson & Marti Hearst University of California, Berkeley School of Information.

9/21/2000 Information Organization and Retrieval

Robertson-Spark Jones Weights

• Retrospective formulation --

rRnNrnrR

r

wnewt log

Page 37: 9/21/2000Information Organization and Retrieval Ranking and Relevance Feedback Ray Larson & Marti Hearst University of California, Berkeley School of Information.

9/21/2000 Information Organization and Retrieval

Robertson-Sparck Jones Weights

5.05.05.0

5.0

log)1(

rRnNrnrR

r

w

Predictive formulation

Page 38: 9/21/2000Information Organization and Retrieval Ranking and Relevance Feedback Ray Larson & Marti Hearst University of California, Berkeley School of Information.

9/21/2000 Information Organization and Retrieval

Using Relevance Feedback

• Known to improve results– in TREC-like conditions (no user involved)

• What about with a user in the loop?– How might you measure this?

Page 39: 9/21/2000Information Organization and Retrieval Ranking and Relevance Feedback Ray Larson & Marti Hearst University of California, Berkeley School of Information.

9/21/2000 Information Organization and Retrieval

Relevance Feedback Summary

• Iterative query modification can improve precision and recall for a standing query

• In at least one study, users were able to make good choices by seeing which terms were suggested for R.F. and selecting among them

Page 40: 9/21/2000Information Organization and Retrieval Ranking and Relevance Feedback Ray Larson & Marti Hearst University of California, Berkeley School of Information.

Information Organization and Retrieval9/21/2000

Alternative Notions of Relevance Feedback

Page 41: 9/21/2000Information Organization and Retrieval Ranking and Relevance Feedback Ray Larson & Marti Hearst University of California, Berkeley School of Information.

9/21/2000 Information Organization and Retrieval

Alternative Notions of Relevance Feedback

• Find people whose taste is “similar” to yours. Will you like what they like?

• Follow a users’ actions in the background. Can this be used to predict what the user will want to see next?

• Track what lots of people are doing. Does this implicitly indicate what they think is good and not good?

Page 42: 9/21/2000Information Organization and Retrieval Ranking and Relevance Feedback Ray Larson & Marti Hearst University of California, Berkeley School of Information.

9/21/2000 Information Organization and Retrieval

Alternative Notions of Relevance Feedback

• Several different criteria to consider:– Implicit vs. Explicit judgements – Individual vs. Group judgements– Standing vs. Dynamic topics– Similarity of the items being judged vs. similarity

of the judges themselves

Page 43: 9/21/2000Information Organization and Retrieval Ranking and Relevance Feedback Ray Larson & Marti Hearst University of California, Berkeley School of Information.

Information Organization and Retrieval

Collaborative Filtering (social filtering)

• If Pam liked the paper, I’ll like the paper

• If you liked Star Wars, you’ll like Independence Day

• Rating based on ratings of similar people– Ignores the text, so works on text, sound, pictures etc.

– But: Initial users can bias ratings of future users

Sally Bob Chris Lynn KarenStar Wars 7 7 3 4 7Jurassic Park 6 4 7 4 4Terminator II 3 4 7 6 3Independence Day 7 7 2 2 ?

Page 44: 9/21/2000Information Organization and Retrieval Ranking and Relevance Feedback Ray Larson & Marti Hearst University of California, Berkeley School of Information.

Information Organization and Retrieval

• Users rate musical artists from like to dislike– 1 = detest 7 = can’t live without 4 = ambivalent

– There is a normal distribution around 4

– However, what matters are the extremes

• Nearest Neighbors Strategy: Find similar users and predicted (weighted) average of user ratings

• Pearson r algorithm: weight by degree of correlation between user U and user J– 1 means very similar, 0 means no correlation, -1 dissimilar

– Works better to compare against the ambivalent rating (4), rather than the individual’s average score

Ringo Collaborative Filtering (Shardanand & Maes 95)

22 )()(

))((

JJUU

JJUUrUJ

Page 45: 9/21/2000Information Organization and Retrieval Ranking and Relevance Feedback Ray Larson & Marti Hearst University of California, Berkeley School of Information.

9/21/2000 Information Organization and Retrieval

Social Filtering• Ignores the content, only looks at who judges

things similarly• Works well on data relating to “taste”

– something that people are good at predicting about each other too

• Does it work for topic? – GroupLens results suggest otherwise (preliminary)– Perhaps for quality assessments– What about for assessing if a document is about a

topic?

Page 46: 9/21/2000Information Organization and Retrieval Ranking and Relevance Feedback Ray Larson & Marti Hearst University of California, Berkeley School of Information.

Information Organization and Retrieval

Learning interface agents• Add agents in the UI, delegate tasks to them

• Use machine learning to improve performance– learn user behavior, preferences

• Useful when:– 1) past behavior is a useful predictor of the future

– 2) wide variety of behaviors amongst users

• Examples: – mail clerk: sort incoming messages in right mailboxes

– calendar manager: automatically schedule meeting times?

Page 47: 9/21/2000Information Organization and Retrieval Ranking and Relevance Feedback Ray Larson & Marti Hearst University of California, Berkeley School of Information.

9/21/2000 Information Organization and Retrieval

Summary

• Relevance feedback is an effective means for user-directed query modification.

• Modification can be done with either direct or indirect user input

• Modification can be done based on an individual’s or a group’s past input.

Page 48: 9/21/2000Information Organization and Retrieval Ranking and Relevance Feedback Ray Larson & Marti Hearst University of California, Berkeley School of Information.

9/21/2000 Information Organization and Retrieval