Kazai cikm2013-intent

17
User Intent and Assessor Disagreement in Web Search Evaluation Gabriella Kazai, Emine Yilmaz, Nick Craswell, S.M.M. Tahaghoghi

description

Gabriella Kazai, Emine Yilmaz, Nick Craswell, and S.M.M. Tahaghoghi. 2013. User intent and assessor disagreement in web search evaluation. In Proceedings of the 22nd ACM international conference on Conference on information & knowledge management (CIKM '13). ACM, New York, NY, USA, 699-708. DOI=10.1145/2505515.2505716 http://doi.acm.org/10.1145/2505515.2505716 Preference based methods for collecting relevance data for information retrieval (IR) evaluation have been shown to lead to better inter-assessor agreement than the traditional method of judging individual documents. However, little is known as to why preference judging reduces assessor disagreement and whether better agreement among assessors also means better agreement with user satisfaction, as signalled by user clicks. In this paper, we examine the relationship between assessor disagreement and various click based measures, such as click preference strength and user intent similarity, for judgments collected from editorial judges and crowd workers using single absolute, pairwise absolute and pairwise preference based judging methods. We find that trained judges are significantly more likely to agree with each other and with users than crowd workers, but inter-assessor agreement does not mean agreement with users. Switching to a pairwise judging mode improves crowdsourcing quality close to that of trained judges. We also find a relationship between intent similarity and assessor-user agreement, where the nature of the relationship changes across judging modes. Overall, our findings suggest that the awareness of different possible intents, enabled by pairwise judging, is a key reason of the improved agreement, and a crucial requirement when crowdsourcing relevance data.

Transcript of Kazai cikm2013-intent

Page 1: Kazai cikm2013-intent

User Intent and Assessor Disagreement in

Web Search Evaluation

Gabriella Kazai, Emine Yilmaz, Nick Craswell, S.M.M. Tahaghoghi

Page 2: Kazai cikm2013-intent

Information Retrieval Evaluation

IR system Evaluation

Page 3: Kazai cikm2013-intent

Main Tests and Analysis

1. Test crowd judges and trained judges on inter-assessor agreement and user (i.e. click) agreement– Single judging UI– Pairwise judging UI

2. When clicks show a strong preference, analyse judge quality

3. When clicks indicate substitutability, analyse judge quality

Page 4: Kazai cikm2013-intent

Relevance judgments

Judge groups

Eval

uatio

n m

easu

res

Click-based properties of web pages

Click Preference

Strength

*Click-agreement in paper

Page 5: Kazai cikm2013-intent

• Sample (q,u,v) where urls u and v are adjacent, one or both are clicked, and we have seen both orders (uv and vu)

• Click Preference Strength

• Dupe score (Radlinski et al. WSDM 2011)

*Paper has two other intent similarity measures

Click-based Properties of Web Pages

𝑃𝑢𝑣=(𝑐�̂� 𝑣+𝑐𝑣�̂� )−(𝑐𝑢 �̂�+𝑐 �̂� 𝑢)

𝑐 �̂�𝑣+𝑐𝑣 �̂�+𝑐𝑢 �̂�+𝑐 �̂� 𝑢+𝑐 �̂��̂�+𝑐 �̂� �̂�

Click Preference

Strength

Intent Similarity (Dupe)*

𝑅𝑢𝑣=min (𝑐 �̂�𝑣

𝑐�̂� 𝑣+𝑐𝑢�̂�+𝑐 �̂��̂� ,𝑐 �̂� 𝑢

𝑐 �̂� 𝑢+𝑐𝑣 �̂�+𝑐 �̂� �̂� )

Page 6: Kazai cikm2013-intent

Experiment SetupPairwise

UISingle

UITrained JudgesCrowd

Page 7: Kazai cikm2013-intent

Method of Analysis

• Inter-assessor agreement• Fleiss kappa

• User-assessor agreement • Based on directional agreement between judgment-based preference

and click-based preference over pairs of URLs

Inter-assessor

Agreement

User-Assessor

Agreement

Def Example case

Agree JURL1 > JURL2 & CURL1 > CURL2

Disagree JURL1 > JURL2 & CURL1 < CURL2

Undetected JURL1 = JURL2 & CURL1 < CURL2

Page 8: Kazai cikm2013-intent

RESULTS 1

What is the relationship between inter-assessor agreement and agreement with web users (click-agreement) for crowd and editorial judges in different judging modes?

Page 9: Kazai cikm2013-intent

Results 1

• Trained judges agree better with each other and with users than crowd

• Pairwise UI leads to better agreement than single UI

• Inter-assessor agreement does NOT mean user-assessor agreement

Inter-assessor Agreement

Single UI Pairwise UI

Crowd workers 0.24 0.29

Editorial judges 0.51 0.57

User-assessor Agreement (%)

Single UI Pairwise UI

Crowd workers 45 – 27 – 28 56 – 24 – 20

Editorial judges 58 – 21 – 21 66 – 18 – 16

Inter-assessor

Agreement

User-Assessor

Agreement

Agree – Undetected – Disagree

Page 10: Kazai cikm2013-intent

RESULTS 2

When web users show a strong preference for a result, do we see a change in inter-assessor agreement or in click-agreement for editorial or crowd judges?

Page 11: Kazai cikm2013-intent

• Sample (q,u,v) where urls u and v are adjacent, one or both are clicked, and we have seen both orders (uv and vu)

• Click Preference Strength

• Dupe score (Radlinski et al. WSDM 2011)

*Paper has two other intent similarity measures

Click-based Properties of Web Pages

𝑃𝑢𝑣=(𝑐�̂� 𝑣+𝑐𝑣�̂� )−(𝑐𝑢 �̂�+𝑐 �̂� 𝑢)

𝑐 �̂�𝑣+𝑐𝑣 �̂�+𝑐𝑢 �̂�+𝑐 �̂� 𝑢+𝑐 �̂��̂�+𝑐 �̂� �̂�

Click Preference

Strength

Intent Similarity (Dupe)*

𝑅𝑢𝑣=min (𝑐 �̂�𝑣

𝑐�̂� 𝑣+𝑐𝑢�̂�+𝑐 �̂��̂� ,𝑐 �̂� 𝑢

𝑐 �̂� 𝑢+𝑐𝑣 �̂�+𝑐 �̂� �̂� )

Page 12: Kazai cikm2013-intent

Inter-assessor

AgreementY axis

Click Preference

StrengthX axis

Results 2a

• No relationship for crowd

• Positive trend for trained judges: They agree more with each other as Puv increases, esp. for high click volume URL pairs (50k, red line)

Single Pairwise

Edito

rial

Crow

d

Page 13: Kazai cikm2013-intent

Results 2b

• With higher Puv, all judges agree better with web users (positive trends)

• Pairwise judging induces judging patterns for crowd that are more similar to editorial judges’

User-Assessor

AgreementY axis

Click Preference

StrengthX axis

Single Pairwise

Edito

rial

Crow

d

Page 14: Kazai cikm2013-intent

RESULTS 3

When two documents are detected as satisfying similar intents, do we see a change in inter-assessor agreement or click-agreement for editorial or crowd judges?

Page 15: Kazai cikm2013-intent

Results 3a

• Positive trend, except PC: judges agree with each other more on more redundant (dupe) pages

• Crowd judges’ inter-assessor agreement has no clear relationship with Dupe score

Inter-assessor

AgreementY axis

Intent Similarity

(Dupe)X axis

Single Pairwise

Edito

rial

Crow

d

Page 16: Kazai cikm2013-intent

User-Assessor

AgreementY axis

Results 3b

• Positive trend, except SC

• Pairwise UI exposes properties of web pages that can improve judging quality when faced with more interchangeable documents, leading to better agreement with web users (even if not with other judges)

Intent Similarity

(Dupe)X axis

Single Pairwise

Edito

rial

Crow

d

Page 17: Kazai cikm2013-intent

Inter-assessor

Agreement

User-assessor

AgreementClick

Preference

Intent Similarity

Conclusions

• Different assessment procedure Different properties

• Trained judges beat crowd judges• Pairwise UI beats single UI on both inter-

assessor and user-assessor agreement

• Note: Specific to our method of sampling adjacent URLs?

• Open issue: Optimizing your assessment procedure

PairwiseUI

SingleUI

Trained Judges

Crowd