Cscw family searchindexing
-
Upload
derek-hansen -
Category
Documents
-
view
444 -
download
1
description
Transcript of Cscw family searchindexing
CSCW, SAN ANTONIO, TXFEB 26, 2013
Derek Hansen, Patrick Schone, Douglas Corey, Matthew Reid, & Jake Gehring
QUALITY CONTROL MECHANISMS FOR CROWDSOURCING: PEER REVIEW, ARBITRATION, & EXPERTISE AT FAMILYSEARCH INDEXING
FamilySearch.org
FamilySearch Indexing (FSI)
FamilySearch Indexing (FSI)
FSI in Broader Landscape• Crowdsourcing Project
Aggregates discrete tasks completed by volunteers who replace professionals (Howe, 2006; Doan, et al., 2011)
• Human Computation SystemHumans use computational system to work on a problem that may someday be solvable by computers (Quinn & Bederson, 2011)
• Lightweight Peer ProductionLargely anonymous contributors independently completing discrete, repetitive tasks provided by authorities (Haythornthwaite, 2009)
Design Challenge: Improve efficiency without sacrificing quality
Time
Am
ount
Scanned Documents
Quality Control Mechanisms• 9 Types of quality control for human computation
systems (Quinn & Bederson, 2011)• Redundancy• Multi-level review
• Find-Fix-Verify pattern (Bernstein, et al., 2010)• Weight proposed solutions by reputation of contributor
(McCann, et al., 2003)• Peer or expert oversight (Cosley, et al., 2005)• Tournament selection approach (Sun, et al., 2011)
A-B-Arbitrate process (A-B-ARB)
A
B
ARB
Currently Used Mechanism
Peer review process (A-R-RARB)
A R RARB
Already Filled InOptional?Proposed Mechanism
Two Act Play
Act I: Experience
What is the role of experience on quality and efficiency?
Historical data analysis using full US and Canadian Census records from 1920 and earlier
Act II: Quality Control
Is peer review or arbitration better in terms of quality and efficiency?
Field experiment using 2,000 images from the 1930 US Census Data & corresponding truth set
Act I: Experience
Quality is estimated based on A-B agreement (no truth set)
Efficiency calculated using keystroke-logging data with idle time and outliers removed
A-B agreement by field
A-B agreement by language (1871 Canadian Census)
English Language
Given Name: 79.8%
Surname: 66.4%
French Language
Given Name: 62.7%
Surname: 48.8%
A-B agreement by experience
Birth Place: All U.S. Censuses
B (
novi
ce ↔
exp
erie
nced
)
A (novice ↔ experienced)
A-B agreement by experience
Given Name: All U.S. Censuses
A (novice ↔ experienced)
B (
novi
ce ↔
exp
erie
nced
)
A-B agreement by experience
Surname: All U.S. Censuses
A (novice ↔ experienced)
B (
novi
ce ↔
exp
erie
nced
)
A-B agreement by experience
Gender: All U.S. Censuses
A (novice ↔ experienced)
B (
novi
ce ↔
exp
erie
nced
)
A-B agreement by experience
Birthplace: English-speaking Canadian Census
A (novice ↔ experienced)
B (
novi
ce ↔
exp
erie
nced
)
Time & keystroke by experience
Summary & Implications of Act I Experienced workers are faster and more accurate,
gains which continue even at high levels
- Focus on retention
- Encourage both novices & experts to do more
- Develop interventions to speed up experience gains (e.g., send users common mistakes made by people at their experience level)
Summary & Implications of Act I Contextual knowledge (e.g., Canadian placenames)
and specialized skills (e.g., French language fluency) is needed for some tasks
- Recruit people with existing knowledge & skills
- Provide contextual information when possible (e.g., Canadian placename prompts)
- Don’t remove context (e.g., captcha)
- Allow users to specialize?
Act II: Quality ControlA-B-ARB data from original transcribers (Feb 2011)
A-R-RARB data includes original A data and newly collected R and RARB data from people new to this method (Jan-Feb of 2012)
Truth Set data from company with independent audit by FSI experts
Statistical Test: mixed-model logistic regression (accurate or not) with random effects, controlling for expertise
Limitations• Experience levels of R and RARB were
lower than expected, though we did statistically control for this
• Original B data used in A-B-ARB for certain fields was transcribed in non-standard manner requiring adjustment
No Need for RARB• No gains in quality from extra arbitration of
peer reviewed data (A-R = A-R-RARB)• RARB takes some time, so better without
Quality Comparison
• Both methods were statistically better than A alone
• A-B-ARB had slightly lower error rates than A-R
• R “missed” more errors, but also introduced fewer errors
Time Comparison
Summary & Implications of Act II Peer Review shows considerable efficiency
gains with nearly as good quality as Arbitration
- Prime reviewers to find errors (e.g., prompt them with expected # of errors on a page)
- Highlight potential problems (e.g., let A flag tough fields)
- Route difficult pages to experts
- Consider an A-R1-R2 process when high quality is critical
Summary & Implications of Act II Reviewing reviewers isn’t always worth the time
- At least in some contexts, Find-Fix may not need Verify
Quality of different fields varies dramatically
- Use different quality control mechanisms for harder or easier fields
Integrate human and algorithmic transcription
- Use algorithms on easy fields & integrate into review process so machine learning can occur
Questions• Derek Hansen ([email protected])• Patrick Schone ([email protected])• Douglas Corey ([email protected])• Matthew Reid ([email protected])• Jake Gehring ([email protected])