Ensuring quality in crowdsourced search relevance evaluation : The effects of training question...

Ensuring quality in crowdsourced search relevance evaluation:

The effects of training question distribution

John Le - CrowdFlowerAndy Edmonds - eBay

Vaughn Hester - CrowdFlowerLukas Biewald - CrowdFlower

Background/Motivation

• Human judgments for search relevance evaluation/training

• Quality Control in crowdsourcing• Observed worker regression to the mean

over previous months

Our Techniques for Quality Control• Training data = training questions• Questions to which we know the answer

• Dynamic learning for quality control• An initial training period• Per HIT screening questions

Contributions• Questions explored– Does training data setup and distribution affect

worker output and final results?• Why important?– Quality control is paramount– Quantifying and understanding the effect of

training data

The Experiment: AMT• Using Mechanical Turk and the

CrowdFlower platform• 25 results per HIT• 20 cents per HIT• No Turk qualifications• Title: “Judge approximately 25 search results

for relevance”

Judgment Dataset• Dataset: major online retailer’s internal

product search projects• 256 queries with 5 product pairs associated

with each query = 1280 search results• Examples:• “epiphone guitar”, “sofa,” and “yamaha a100.”

Experimental Manipulation

Experiment 1 2 3 4 5Matching 72.7% 58% 45.3% 34.7% 12.7%

Not Matching 8% 23.3% 47.3% 56% 84%

Off Topic 19.3% 18% 7.3% 9.3% 3.3%

Spam 0% 0.7% 0% 0.7% 0%

Judge Training Question Answer Distribution Skews

Matching Not Matching Off Topic Spam

14.5% 82.67% 2.5% 0.33%

Underlying Distribution Skew

Experimental Control• Round-robin workers into the

simultaneously running experiments• Note only one HIT showed up on Turk

• Workers were sent to the same experiment if they left and returned

Results1. Worker participation2. Mean worker performance3. Aggregate majority vote • Accuracy• Performance measures: precision and recall

Worker Participation

Experiment 1 2 3 4 5

Came to the Task 43 42 42 87 41

Did Training 26 25 27 50 21

Passed Training 19 18 25 37 17

Failed Training 7 7 2 13 4

Percent Passed 73% 72% 92.6% 74% 80.9%

Matching skew Not Matching skew

Mean Worker Performance

Worker \ Experiment 1 2 3 4 5

Accuracy (Overall) 0.690 0.708 0.749 0.763 0.790

Precision (Not Matching) 0.909 0.895 0.930 0.917 0.915

Recall (Not Matching) 0.704 0.714 0.774 0.800 0.828

Aggregate Majority Vote Accuracy: Trusted Workers

Underlying Distribution Skew

Aggregate Majority Vote Performance Measures

Experiment 1 2 3 4 5

Precision 0.921 0.932 0.936 0.932 0.912

Recall 0.865 0.917 0.919 0.863 0.921

Discussion and Limitations

• Maximize entropy -> minimize perceptible signal

• For a skewed underlying distribution

Future Work

• Optimal judgment task design and metrics• Quality control enhancements• Separate validation and ongoing training• Long term worker performance optimizations• Incorporation of active learning

• IR performance metric analysis

Acknowledgements

We thank Riddick Jiang for compiling the dataset for this project. We thank Brian Johnson (eBay), James Rubinstein (eBay), Aaron Shaw (Berkeley), Alex Sorokin (CrowdFlower), Chris Van Pelt (CrowdFlower) and Meili Zhong (PayPal) for their assistance with the paper.

QUESTIONS?

john@crowdflower.comaedmonds@ebay.comvaughn@crowdflower.comlukas@crowdflower.com

Thanks!

Ensuring quality in crowdsourced search relevance evaluation : The effects of training question...

Documents

Transcript of Ensuring quality in crowdsourced search relevance evaluation : The effects of training question...

Crowdsourced planning nash_27mar2014.pptx

Isabc15 crowdsourced creativity

Praise for The Crowdsourced Performance Reviewgo.globoforce.com/rs/globoforce/images/The Crowdsourced... · 2020-04-21 · Praise for The Crowdsourced Performance Review “Eric Mosley

Ensuring Quality and Relevance of Higher Education, Life Long … · 2017. 7. 24. · Page 1 of 30 Ensuring Quality and Relevance of Higher Education, Life Long Learning, Teacher

The Big Green Crowdsourced Guide

Digital Storytelling Through Crowdsourced Video

Crowdsourced Rumour Identiﬁcation During Emergencies

Rewarding Crowdsourced Workers

Broadband mapping and crowdsourced data

Crowdsourced Microlearning_part1_voigt

IHO Crowdsourced Bathymetry Initiative · Crowdsourced Bathymetry to state the IHO’s policy towards, and provide best practices for collecting, crowdsourced bathymetry. This document

Crowdsourced open houses

Ensuring quality through education and training A global ... · Australian Pharmacy Council Ensuring quality through education and training A global need to ... Context of and relevance

CrowdQ: Crowdsourced Query Understanding

Expert guided crowdsourced learning content

Implementing Crowdsourced Testing

A Wikistrat Crowdsourced Simulation

USEUM: The Crowdsourced Art Museum

WHITEPAPER CROWDSOURCED USABILITY TESTINGgo.applause.com/.../Crowdsourced-Usability-Testing.pdf · There are several different crowdsourced testing options that offer different levels

Planning with Crowdsourced Data: Rhetoric and ...ledantec.net/.../09/Le-Dantec-Planning-with-Crowdsourced-Data-CSC… · Planning with Crowdsourced Data: Rhetoric and Representation