A Field Relevance Model

32
1 A Field Relevance Model for Structured Document Retrieva JIN YOUNG KIM @ ECIR 2012

description

A Field Relevance Model . for Structured Document Retrieval. JIN YOUNG KIM @ ECIR 2012. Three Themes. The Concept of Field Relevance Using Field Relevance for Retrieval The Estimation of Field Relevance. Relevance. Field Weighting. Field Relevance. T he Field Relevance. - PowerPoint PPT Presentation

Transcript of A Field Relevance Model

Page 1: A Field Relevance Model

1

A Field Relevance Model for Structured Document Retrieval

JIN YOUNG KIM @ ECIR 2012

Page 2: A Field Relevance Model

Three Themes• The Concept of Field Relevance

• Using Field Relevance for Retrieval

• The Estimation of Field Relevance

2

Relevance Field Weighting

Field Relevance

Page 3: A Field Relevance Model

THE FIELD RELEVANCE

3

Page 4: A Field Relevance Model

IR : The Quest for Relevance• The Role of Relevance• Core Component of Retrieval Models• Basis of (Pseudo) Relevance Feedback

• Retrieval Models based on the Relevance• Binary Independence Model (BM25) [Robertson76]

• Relevance-based Language Model [Lavrenko01]

4

V = (w1 w2 ... wm)

P(w|R)

Page 5: A Field Relevance Model

Structured Document Retrieval• Documents have multiple fields• Emails, products (entities), and so on.

• Retrieval models exploit the structure• Field weighting is common

5

q1 q2 ... qm

f1

f2

fn

...

f1

f2

fn

...

w1

w2

wn

w1

w2

wn

sum

multiply

Page 6: A Field Relevance Model

Relevance for Structured Document Retrieval• Term-level Relevance• Which term is important for user’s information need?

• Field-level Relevance• Which field is important for user’s information need?

6

P(F|R)

F = (F1 F2 … Fn)V = (w1 w2 ... wm)

P(w|R)Field-level relevanceTerm-level relevance

Page 7: A Field Relevance Model

Defining the Field Relevance7

q1 … qi … qm

F 1 …

F j

… F

n

P(F|q1,R) P(F|qm,R)P(F|qi,R)

Field RelevanceThe distribution of per-term relevance over document fields

per-term P(F|w,R)

Query:m words

Collection:n fields for each document

Q = (q1 q2 ... qm)

F = (F1 F2 … Fn)

Page 8: A Field Relevance Model

8

1

1

221

2

• Different fields are relevant for different query-term

‘james’ is relevant when it occurs in

<to>

‘registration’ is relevant when it occurs

in <subject>

Why P(F|w,R) instead of P(F|R)?

Query: ‘james registration’

Page 9: A Field Relevance Model

More Evidence for the Field Relevance• Field Operator / Advanced Search Interface• User’s search terms are found in multiple fields

9

Understanding Re-finding Behavior in Naturalistic Email Interaction Logs. Elsweiler, D, Harvey, M, Hacker., M [SIGIR'11]

Evaluating Search in Personal Social Media Collections Chia-Jung L, Croft, W.B., Kim, J[WSDM12]

Page 10: A Field Relevance Model

THE FIELD RELEVANCE MODEL

10

Page 11: A Field Relevance Model

Retrieval over Structured Documents• Field-based Retrieval Models• Score each field against each query-term• Combine field-level scores using field weights

11

Fixed field weights wj can be too restrictive

Page 12: A Field Relevance Model

• Field Relevance Model

• Comparison with Mixture of Field Language Model

12

Using the Field Relevance for Retrieval

Per-term Field Weight

Per-term Field Score

q1 q2 ... qm

f1

f2

fn

...

f1

f2

fn

...

w1

w2

wn

w1

w2

wn

q1 q2 ... qm

f1

f2

fn

...

f1

f2

fn

...

P(F1|q1)

P(F2|q1)

P(Fn|q1)

P(F1|qm)

P(F2|qm)

P(Fn|qm)

sum

multiply

Page 13: A Field Relevance Model

Structured Document Retrieval: PRM-S13

• Probabilistic Retrieval Model for Semi-structured data • Estimate the mapping between query terms and doc. fields• Use the mapping probability as per-term field weights

[Kim, Xue, Croft 09]

Estimation is based on limited sources.

Page 14: A Field Relevance Model

• Field Relevance Model

• Comparison with the PRM-S• FRM has the same functional form to PRM-S• FRM differs in how per-term field weights are estimated

14

Using the Field Relevance for Retrieval

Per-term Field Weight

Per-term Field Score

Per-term Field Weight

Page 15: A Field Relevance Model

ESTIMATING FIELD RELEVANCE

15

Page 16: A Field Relevance Model

Estimating Field Relevance: in a Nutshell• If User Provides Feedback• Relevant document provides sufficient information

• If No Feedback is Available• Combine field-level term statistics from multiple sources

16

contenttitle

from/to

Relevant Docscontent

titlefrom/to

Collection content

titlefrom/to

Top-k Docs

+ ≅

Page 17: A Field Relevance Model

• Assume a user who marked DR as relevant• Estimate field relevance from the field-level term dist. of

DR

• We can personalize the results accordingly• Rank higher docs with similar field-level term distribution

Estimating Field Relevance using Feedback17

DR

- To is relevant for ‘james’- Content is relevant for ‘registration’

Field Relevance:

Page 18: A Field Relevance Model

Estimating Field Relevance without Feedback18

Unigram is thesame to PRM-S

Similar to MFLM and BM25F

Pseudo-relevance Feedback

• Method• Linear Combination of Multiple Sources• Weights estimated using training queries

• Features• Field-level term distribution of the collection• Unigram and Bigram LM

• Field-level term distribution of top-k docs• Unigram and Bigram LM

• A priori importance of each field (wj)• Estimated using held-out training queries

Page 19: A Field Relevance Model

EXPERIMENTS

19

Page 20: A Field Relevance Model

Experimental Setup• Collections• TREC Emails• IMDB Movies• Monster Resumes

• Distribution of the Most Relevant Field

20

#Documents

#Queries

#RelDocs / Query

TREC 198,394 125 1IMDB 437,281 50 2Monster 1,034,795 60 15

Page 21: A Field Relevance Model

Query Examples (Indri)• Oracle Estimates of Field Relevance

21

TREC

IMDB

Monster

Page 22: A Field Relevance Model

Retrieval Methods Compared• Baselines• DQL / BM25F• MFLM : fixed regardless of terms• PRM-S : estimated using the collection

• Field Relevance Models• FRM-C : estimated using the combination• FRM-O : estimated using relevant documents

22

Differs only in terms of the field weighting!

Page 23: A Field Relevance Model

DQL BM25F MFLM PRM-S FRM-C FRM-OTREC 54.2% 59.7% 60.1% 62.4% 66.8% 79.4%IMDB 40.8% 52.4% 61.2% 63.7% 65.7% 70.4%Monster 42.9% 27.9% 46.0% 54.2% 55.8% 71.6%

Retrieval Effectiveness23

(Metric: Mean Reciprocal Rank)

Fixed Field WeightsPer-term Field Weights

DQL BM25F MFLM PRM-S FRM-C FRM-O20.0%30.0%40.0%50.0%60.0%70.0%80.0%

TRECIMDBMonster

Page 24: A Field Relevance Model

• Aggregated KL-Divergence from Oracle Estimates

• Aggregated Cosine Similarity with Oracle Estimates

Quality of Field Relevance Estimation24

TREC Monster IMDB0.0001.0002.0003.0004.0005.000

MFLMPRM-SFRM-C

TREC Monster IMDB0.0000.2000.4000.6000.8001.000

MFLMPRM-SFRM-C

Page 25: A Field Relevance Model

• Features Revisited• Field-level term distribution of the collection (PRM-S)• Field-level term distribution of top-k documents• A priori relevance of term (prior)

• Results for TREC Collection

Feature Ablation Results25

Feature Set All -rug/rbg -cbg/

rbg -cbg/cug -priorMAP 0.668 0.662 0.651 0.648 0.644%Reduction 0% -0.9% -2.5% -3% -3.6%

Unigram

Bigram

Collection LM

cug cbg

Top-k Docs LM

rug rbg

Page 26: A Field Relevance Model

CONCLUSIONS

26

Page 27: A Field Relevance Model

Summary• Field relevance as a generalization of field weighting• Relevance modeling for structured document retrieval

• Field relevance model for structured doc. retrieval• Using field relevance to combine per-field LM scores

• Estimating the field relevance using relevant docs• Providing a natural way to incorporate relevance feedback

• Estimating the field relevance by combining sources• Improved performance over MFLM and PRM-S

27

Page 28: A Field Relevance Model

Ongoing Work• Large-scale batch evaluation on a book collection• Test collections built using OpenLibrary.org query logs

• Evaluation of the relevance feedback on FRM• Does relevance feedback improves on subsequent results?

• Integrating the term relevance and field relevance• Further improvement is expected when combined

28

Field Relevance

Term Relevance

Page 29: A Field Relevance Model

I’m on the job market!• Structured Document Retrieval• A Probabilistic Retrieval Model for Semi-structured Data [ECIR09]

• A Field Relevance Model for Structured Document Retrieval [ECIR11]

• Personal Search• Retrieval Experiments using Pseudo-Desktop Collections [CIKM09]

• Ranking using Multiple Document Types in Desktop Search [SIGIR10]

• Evaluating an Associative Browsing Model for Personal Info. [CIKM11]

• Evaluating Search in Personal Social Media Collections [WSDM12]

• Web Search• An Analysis of Instability for Web Search Results [ECIR10]

• Characterizing Web Content, User Interests, and Search Behavior by Reading Level and Topic [WSDM12]

29

More at @jin4ir, orcs.umass.edu/

~jykim

Page 30: A Field Relevance Model

OPTIONAL SLIDES

30

Page 31: A Field Relevance Model

Optimality of Field Relevance Estimation• This results in the optimal field weighting• Scores DR as highly as possible against other docs• Under the language modeling framework for IR

31

Per-term Field Weight

Per-term Field Score

Proof on the extended version

Page 32: A Field Relevance Model

Features based on Field-level Term Dists.• Summary

• Estimation

32

Unigram LM (= PRM-S)

Bigram LM

Unigram

Bigram

Collection LM cug cbgTop-k Docs LM rug rbg