Kazai cikm2013-intent

User Intent and Assessor Disagreement in

Web Search Evaluation

Gabriella Kazai, Emine Yilmaz, Nick Craswell, S.M.M. Tahaghoghi

Information Retrieval Evaluation

IR system Evaluation

Main Tests and Analysis

1. Test crowd judges and trained judges on inter-assessor agreement and user (i.e. click) agreement– Single judging UI– Pairwise judging UI

2. When clicks show a strong preference, analyse judge quality

3. When clicks indicate substitutability, analyse judge quality

Relevance judgments

Judge groups

Eval

uatio

n m

easu

res

Click-based properties of web pages

Click Preference

Strength

*Click-agreement in paper

• Sample (q,u,v) where urls u and v are adjacent, one or both are clicked, and we have seen both orders (uv and vu)

• Click Preference Strength

• Dupe score (Radlinski et al. WSDM 2011)

*Paper has two other intent similarity measures

Click-based Properties of Web Pages

𝑃𝑢𝑣=(𝑐�̂� 𝑣+𝑐𝑣�̂� )−(𝑐𝑢 �̂�+𝑐 �̂� 𝑢)

𝑐 �̂�𝑣+𝑐𝑣 �̂�+𝑐𝑢 �̂�+𝑐 �̂� 𝑢+𝑐 �̂��̂�+𝑐 �̂� �̂�

Click Preference

Strength

Intent Similarity (Dupe)*

𝑅𝑢𝑣=min (𝑐 �̂�𝑣

𝑐�̂� 𝑣+𝑐𝑢�̂�+𝑐 �̂��̂� ,𝑐 �̂� 𝑢

𝑐 �̂� 𝑢+𝑐𝑣 �̂�+𝑐 �̂� �̂� )

Experiment SetupPairwise

UISingle

UITrained JudgesCrowd

Method of Analysis

• Inter-assessor agreement• Fleiss kappa

• User-assessor agreement • Based on directional agreement between judgment-based preference

and click-based preference over pairs of URLs

Inter-assessor

Agreement

User-Assessor

Agreement

Def Example case

Agree JURL1 > JURL2 & CURL1 > CURL2

Disagree JURL1 > JURL2 & CURL1 < CURL2

Undetected JURL1 = JURL2 & CURL1 < CURL2

RESULTS 1

What is the relationship between inter-assessor agreement and agreement with web users (click-agreement) for crowd and editorial judges in different judging modes?

Results 1

• Trained judges agree better with each other and with users than crowd

• Pairwise UI leads to better agreement than single UI

• Inter-assessor agreement does NOT mean user-assessor agreement

Inter-assessor Agreement

Single UI Pairwise UI

Crowd workers 0.24 0.29

Editorial judges 0.51 0.57

User-assessor Agreement (%)

Single UI Pairwise UI

Crowd workers 45 – 27 – 28 56 – 24 – 20

Editorial judges 58 – 21 – 21 66 – 18 – 16

Inter-assessor

Agreement

User-Assessor

Agreement

Agree – Undetected – Disagree

RESULTS 2

When web users show a strong preference for a result, do we see a change in inter-assessor agreement or in click-agreement for editorial or crowd judges?

• Sample (q,u,v) where urls u and v are adjacent, one or both are clicked, and we have seen both orders (uv and vu)

• Click Preference Strength

• Dupe score (Radlinski et al. WSDM 2011)

*Paper has two other intent similarity measures

Click-based Properties of Web Pages

𝑃𝑢𝑣=(𝑐�̂� 𝑣+𝑐𝑣�̂� )−(𝑐𝑢 �̂�+𝑐 �̂� 𝑢)

𝑐 �̂�𝑣+𝑐𝑣 �̂�+𝑐𝑢 �̂�+𝑐 �̂� 𝑢+𝑐 �̂��̂�+𝑐 �̂� �̂�

Click Preference

Strength

Intent Similarity (Dupe)*

𝑅𝑢𝑣=min (𝑐 �̂�𝑣

𝑐�̂� 𝑣+𝑐𝑢�̂�+𝑐 �̂��̂� ,𝑐 �̂� 𝑢

𝑐 �̂� 𝑢+𝑐𝑣 �̂�+𝑐 �̂� �̂� )

Inter-assessor

AgreementY axis

Click Preference

StrengthX axis

Results 2a

• No relationship for crowd

• Positive trend for trained judges: They agree more with each other as Puv increases, esp. for high click volume URL pairs (50k, red line)

Single Pairwise

Edito

rial

Crow

d

Results 2b

• With higher Puv, all judges agree better with web users (positive trends)

• Pairwise judging induces judging patterns for crowd that are more similar to editorial judges’

User-Assessor

AgreementY axis

Click Preference

StrengthX axis

Single Pairwise

Edito

rial

Crow

d

RESULTS 3

When two documents are detected as satisfying similar intents, do we see a change in inter-assessor agreement or click-agreement for editorial or crowd judges?

Results 3a

• Positive trend, except PC: judges agree with each other more on more redundant (dupe) pages

• Crowd judges’ inter-assessor agreement has no clear relationship with Dupe score

Inter-assessor

AgreementY axis

Intent Similarity

(Dupe)X axis

Single Pairwise

Edito

rial

Crow

d

User-Assessor

AgreementY axis

Results 3b

• Positive trend, except SC

• Pairwise UI exposes properties of web pages that can improve judging quality when faced with more interchangeable documents, leading to better agreement with web users (even if not with other judges)

Intent Similarity

(Dupe)X axis

Single Pairwise

Edito

rial

Crow

d

Inter-assessor

Agreement

User-assessor

AgreementClick

Preference

Intent Similarity

Conclusions

• Different assessment procedure Different properties

• Trained judges beat crowd judges• Pairwise UI beats single UI on both inter-

assessor and user-assessor agreement

• Note: Specific to our method of sampling adjacent URLs?

• Open issue: Optimizing your assessment procedure

PairwiseUI

SingleUI

Trained Judges

Crowd

Kazai cikm2013-intent

Technology

Transcript of Kazai cikm2013-intent