Crowdsourcing Research Opportunities:
Lessons from Natural Language Processing
Marta Sabou, Kalina Bontcheva, Arno Scharl
Crowdsourcing
Crowdsourcing
Undefined and generally large group
Crowdsourcing in Science
Crowdsourcing for NLP
Challenges
Crowdsourcing in science – is not new
Citizen science, from early 19th century, 60,000 – 80,000 yearly volunteers
Sir Francis Galton, “VOX POPULI”
Genre 1: Mechanised Labour
Participants (workers) paid a small amount of money to complete easy tasks (HIT = Human Intelligence Task)
Genre 2: Games with a purposeFrom 2008240k players
Crowdsourcing via Facebook
Genre 3: Altruistic Crowdsourcing
>250K players
>670K players
Crowdsourcing in Science - Typical Use
InputProcess/
AlgorithmOutput Evaluation
•Harness human intuition to prune solution space
•Form based data collection•Labeling, Classification•Surveys
Crowdsourcing in Science
Crowdsourcing for NLP
Challenges
Crowdsourcing in NLP
Papers relying on crowdsourcing in major NLP venues
Crowdsourcing Genres in NLP
Benefit 1: Affordable, Large-Scale Resources
A variety of small-medium sized resources can be obtained with as little as 100$ using AMT
Crowdsourcing is also cost effective for large resources (Poesio, 2012)
$/label
1 M labels ($)
Traditional High Q. 1 1,000,000
Mechanical Turk .38 380,000 (<40%)
Game .19 217,000 (20%)
Benefit 2: Diversification of research
Challenge 1: Contributor Selection and Training
From: prior to resource creation To: during the resource creation
Challenge 2: Aggregation and Quality Control
From: a few experts‘ annotations To: multiple, noisy annotations from non-
experts Approach 1: Statistical techniques
Simplest (and most popular): majority voting More complex: Machine learning model trained
on various features Approach 2: Crowdsourcing the QC process
itselfHIT1 (Create):
Translate the following sentence:
HIT2 (Verify):Which of these 5 sentences is the
best translation?
Conclusions (What have we learned from NLP?)
Crowdsourcing is revolutionalising NLP research Cheaper resource acquisition Diversification of research agenda
But requires more complex methodologies For contributor management For quality control and data aggregation
Other findings: most popular Genre: mechanised labour Task: acquiring input data Problem: solving subjective tasks
Crowdsourcing in Science
Crowdsourcing for NLP
Challenges
User Motivation
Motivating users Motivations for scientific projects might
differ Task-granularity might impact motivation
Promoting learning and science Advertise STEM research to young people Support learning and self-improvement
through participation in crowdsourcing
Legal and Ethical Issues
Acknowledging the Crowd‘s contribution S. Cooper, [other auhors], and Foldit players:
Predicting protein structures with a multiplayer online game. Nature, 466(7307):756-760, 2010.
Ensuring privacy and wellbeing Mechnised labour criticesed for low wages
(,$2/hour), lack of worker rights Prevent addition, prolonged-use & user
exploitation Licensing and consent
Some clearly state the use of Creative Common licenses
General failure to provide informed consent information
Technical Issues
Scaling up to large resources Preventing bias Increasing repeatability
Through reuse of crowdsourcing elements (e.g., HIT templates)
uComp - Embedded Human Computation for Knowledge Extraction and Evaluation 3 year project, starting November 2012 Develops a scalable and generic HC framework
for knowledge creation Provides reusable HC elements
Thank you!
Top Related