Kazai cikm2013-intent
-
Upload
gabriella-kazai -
Category
Technology
-
view
114 -
download
1
description
Transcript of Kazai cikm2013-intent
User Intent and Assessor Disagreement in
Web Search Evaluation
Gabriella Kazai, Emine Yilmaz, Nick Craswell, S.M.M. Tahaghoghi
Information Retrieval Evaluation
IR system Evaluation
Main Tests and Analysis
1. Test crowd judges and trained judges on inter-assessor agreement and user (i.e. click) agreement– Single judging UI– Pairwise judging UI
2. When clicks show a strong preference, analyse judge quality
3. When clicks indicate substitutability, analyse judge quality
Relevance judgments
Judge groups
Eval
uatio
n m
easu
res
Click-based properties of web pages
Click Preference
Strength
*Click-agreement in paper
• Sample (q,u,v) where urls u and v are adjacent, one or both are clicked, and we have seen both orders (uv and vu)
• Click Preference Strength
• Dupe score (Radlinski et al. WSDM 2011)
*Paper has two other intent similarity measures
Click-based Properties of Web Pages
𝑃𝑢𝑣=(𝑐�̂� 𝑣+𝑐𝑣�̂� )−(𝑐𝑢 �̂�+𝑐 �̂� 𝑢)
𝑐 �̂�𝑣+𝑐𝑣 �̂�+𝑐𝑢 �̂�+𝑐 �̂� 𝑢+𝑐 �̂��̂�+𝑐 �̂� �̂�
Click Preference
Strength
Intent Similarity (Dupe)*
𝑅𝑢𝑣=min (𝑐 �̂�𝑣
𝑐�̂� 𝑣+𝑐𝑢�̂�+𝑐 �̂��̂� ,𝑐 �̂� 𝑢
𝑐 �̂� 𝑢+𝑐𝑣 �̂�+𝑐 �̂� �̂� )
Experiment SetupPairwise
UISingle
UITrained JudgesCrowd
Method of Analysis
• Inter-assessor agreement• Fleiss kappa
• User-assessor agreement • Based on directional agreement between judgment-based preference
and click-based preference over pairs of URLs
Inter-assessor
Agreement
User-Assessor
Agreement
Def Example case
Agree JURL1 > JURL2 & CURL1 > CURL2
Disagree JURL1 > JURL2 & CURL1 < CURL2
Undetected JURL1 = JURL2 & CURL1 < CURL2
RESULTS 1
What is the relationship between inter-assessor agreement and agreement with web users (click-agreement) for crowd and editorial judges in different judging modes?
Results 1
• Trained judges agree better with each other and with users than crowd
• Pairwise UI leads to better agreement than single UI
• Inter-assessor agreement does NOT mean user-assessor agreement
Inter-assessor Agreement
Single UI Pairwise UI
Crowd workers 0.24 0.29
Editorial judges 0.51 0.57
User-assessor Agreement (%)
Single UI Pairwise UI
Crowd workers 45 – 27 – 28 56 – 24 – 20
Editorial judges 58 – 21 – 21 66 – 18 – 16
Inter-assessor
Agreement
User-Assessor
Agreement
Agree – Undetected – Disagree
RESULTS 2
When web users show a strong preference for a result, do we see a change in inter-assessor agreement or in click-agreement for editorial or crowd judges?
• Sample (q,u,v) where urls u and v are adjacent, one or both are clicked, and we have seen both orders (uv and vu)
• Click Preference Strength
• Dupe score (Radlinski et al. WSDM 2011)
*Paper has two other intent similarity measures
Click-based Properties of Web Pages
𝑃𝑢𝑣=(𝑐�̂� 𝑣+𝑐𝑣�̂� )−(𝑐𝑢 �̂�+𝑐 �̂� 𝑢)
𝑐 �̂�𝑣+𝑐𝑣 �̂�+𝑐𝑢 �̂�+𝑐 �̂� 𝑢+𝑐 �̂��̂�+𝑐 �̂� �̂�
Click Preference
Strength
Intent Similarity (Dupe)*
𝑅𝑢𝑣=min (𝑐 �̂�𝑣
𝑐�̂� 𝑣+𝑐𝑢�̂�+𝑐 �̂��̂� ,𝑐 �̂� 𝑢
𝑐 �̂� 𝑢+𝑐𝑣 �̂�+𝑐 �̂� �̂� )
Inter-assessor
AgreementY axis
Click Preference
StrengthX axis
Results 2a
• No relationship for crowd
• Positive trend for trained judges: They agree more with each other as Puv increases, esp. for high click volume URL pairs (50k, red line)
Single Pairwise
Edito
rial
Crow
d
Results 2b
• With higher Puv, all judges agree better with web users (positive trends)
• Pairwise judging induces judging patterns for crowd that are more similar to editorial judges’
User-Assessor
AgreementY axis
Click Preference
StrengthX axis
Single Pairwise
Edito
rial
Crow
d
RESULTS 3
When two documents are detected as satisfying similar intents, do we see a change in inter-assessor agreement or click-agreement for editorial or crowd judges?
Results 3a
• Positive trend, except PC: judges agree with each other more on more redundant (dupe) pages
• Crowd judges’ inter-assessor agreement has no clear relationship with Dupe score
Inter-assessor
AgreementY axis
Intent Similarity
(Dupe)X axis
Single Pairwise
Edito
rial
Crow
d
User-Assessor
AgreementY axis
Results 3b
• Positive trend, except SC
• Pairwise UI exposes properties of web pages that can improve judging quality when faced with more interchangeable documents, leading to better agreement with web users (even if not with other judges)
Intent Similarity
(Dupe)X axis
Single Pairwise
Edito
rial
Crow
d
Inter-assessor
Agreement
User-assessor
AgreementClick
Preference
Intent Similarity
Conclusions
• Different assessment procedure Different properties
• Trained judges beat crowd judges• Pairwise UI beats single UI on both inter-
assessor and user-assessor agreement
• Note: Specific to our method of sampling adjacent URLs?
• Open issue: Optimizing your assessment procedure
PairwiseUI
SingleUI
Trained Judges
Crowd