2010-02-22 Wikipedia MTurk Research talk given in Taiwan's Academica Sinica
-
Upload
ed-chi -
Category
Technology
-
view
2.103 -
download
1
description
Transcript of 2010-02-22 Wikipedia MTurk Research talk given in Taiwan's Academica Sinica
Ed H. Chi
Area Manager and Principal Scientist Augmented Social Cognition Area Palo Alto Research Center
Cognition: the ability to remember, think, and reason; the faculty of knowing.
Social Cognition: the ability of a group to remember, think, and reason; the construction of knowledge structures by a group. – (not quite the same as in the branch of psychology that studies the
cognitive processes involved in social interaction, though included)
Augmented Social Cognition: Supported by systems, the enhancement of the ability of a group to remember, think, and reason; the system-‐supported construction of knowledge structures by a group.
Citation: Chi, IEEE Computer, Sept 2008
2010-02-22 Ed H. Chi ASC Overview 2
2
Characterize activity on social systems with analytics Model interaction social and community dynamics and variables Prototype tools to increase benefits or reduce cost Evaluate prototypes via Living Laboratories with real users
3 Ed H. Chi ASC Overview 2010-02-22
Characteriza*on Models
Prototypes Evalua*ons
3
Characterization and Modeling: – Community Analytics and Wikipedia Dynamics
Prototyping: – Social Transparency thru WikiDashboard
Evaluation: – Evaluations using Amazon Mechanical Turk
4 Ed H. Chi ASC Overview 2010-02-22 4
Characteriza*on Models
Prototypes Evalua*ons
2010-02-22 6
Conflict/Coordination Effects in Wikipedia
Ed H. Chi ASC Overview
Mediator Pattern -‐ Terri Schiavo
Mediators
Sympathetic to parents
Sympathetic to husband
Anonymous (vandals/spammers)
2010-02-22 7 Ed H. Chi ASC Overview
Measure of controversy • “Controversial” tag
• Use # revisions tagged controversial
8 2010-02-22 Ed H. Chi ASC Overview
Page metrics • Possible metrics for identifying conflict in articles
Metric type Page Type Revisions (#) Article, talk, article/talk Page length Article, talk, article/talk
Unique editors Article, talk, article/talk Unique editors / revisions Article, talk Links from other articles Article, talk
Links to other articles Article, talk Anonymous edits (#, %) Article, talk
Administrator edits (#, %) Article, talk Minor edits (#, %) Article, talk
Reverts (#, by unique editors) Article
9 2010-02-22 Ed H. Chi ASC Overview
Performance: Cross-‐validation • 5x cross-‐validation, R2 = 0.897
10 2010-02-22 Ed H. Chi ASC Overview
Determinants of conflict
Revisions (talk) Minor edits (talk) Unique editors (talk) Revisions (article) Unique editors (article) Anonymous edits (talk) Anonymous edits (article)
Highly weighted features of conflict model:
11 2010-02-22 Ed H. Chi ASC Overview
Number of Articles (Log Scale)
http://en.wikipedia.org/wiki/Wikipedia:Modelling_Wikipedia’s_growth
12 2010-02-22 12 Ed H. Chi ASC Overview
13 2010-02-22 13 Ed H. Chi ASC Overview
Monthly Edits
14 2010-02-22 14 Ed H. Chi ASC Overview
Monthly Edits
15 2010-02-22 15 Ed H. Chi ASC Overview
Monthly Active Editors
16 2010-02-22 16 Ed H. Chi ASC Overview
Characteriza*on Models
Prototypes Evalua*ons
18 2010-02-22 18 Ed H. Chi ASC Overview
Edits beget edits – more number of previous edits, more number of new edits
€
N(t) = N0 ⋅ ert
€
dNdt
= r ⋅ N
Growth rate of population
Current population
Growth rate depends on current population N r = growth rate of the population
19 2010-02-22 19 Ed H. Chi ASC Overview
Ecological population growth model – r, growth rate of the population – K, carrying capacity (due to resource limitation)
€
dNdt
= r ⋅ N ⋅ (1− NK)
0
500000
1000000
1500000
2000000
2500000
3000000
3500000
4000000
2000 2002 2004 2006 2008 2010
Popu
latio
n
Year
K
20 2010-02-22 20 Ed H. Chi ASC Overview
http://en.wikipedia.org/wiki/Wikipedia:Modelling_Wikipedia’s_growth
Follows a logistic growth curve
New Article
21 2010-02-22 21 Ed H. Chi ASC Overview
Carrying Capacity as a function of time.
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
Popu
latio
n
Year
K(t)
22 2010-02-22 22 Ed H. Chi ASC Overview
Biological system – Competition increases as
population hit the limits of the ecology
– Advantage go to members of the population that have competitive dominance over others
Analogy – Limited opportunities to make
novel contributions – Increased patterns of conflict and
dominance
23 2010-02-22 23 Ed H. Chi ASC Overview
24 2010-02-22 24 Ed H. Chi ASC Overview
Highly skewed contribution pattern – Top 3% users contribute 50%+ edits – A lot of single-edit users
Five Editor Classes – Monthly edit count – No bot, vandalism included in the analysis – 1000+: editors who made more than 1000 edits in that month – 100-999 – 10-99 – 2-9 – 1
25 2010-02-22 25 Ed H. Chi ASC Overview
Monthly Edits by Editor Class (in thousands)
26 2010-02-22 26 Ed H. Chi ASC Overview
27 2010-02-22 27 Ed H. Chi ASC Overview
Monthly Ratio of Reverted Edits
2010-02-22 Ed H. Chi ASC Overview 28
28
Two interpretations: – Overall increased resistance
from the Wikipedia community to changing content
– Disparity of treatment of edits » Occasional editors have been
reverted in a higher rate
Example of increased patterns of conflict and dominance
Photo: http://www.flickr.com/photos/efan78/3619921561/
29 2010-02-22 29 Ed H. Chi ASC Overview
30 2010-02-22 30 Ed H. Chi ASC Overview
Bongwon Suh, Gregorio Convertino, Ed H. Chi, Peter Pirolli. WikiSym 2009
31 2010-02-22 31 Ed H. Chi ASC Overview
Characteriza*on Models
Prototypes Evalua*ons
“Wikipedia is the best thing ever. Anyone in the world can write anything they want about any subject, so you know you’re getting the
best possible information.” – Steve Carell, The Office
33 2010-02-22 33 Ed H. Chi ASC Overview
Content in Wikipedia can be added or changed by anyone
Because of this, WP has become one of the most important resources on the web – Hundreds of thousands of contributors – Over 2 million articles – 5th most used websites (Alexa.com)
Also because of this, is viewed with skepticism by readers, press, researchers
34 2010-02-22 34 Ed H. Chi ASC Overview
35 2010-02-22 35 Ed H. Chi ASC Overview
Nothing
36 2010-02-22 36 Ed H. Chi ASC Overview
“Wikipedia, just by its nature, is impossible to trust completely. I don't think this can necessarily be changed.”
37 2010-02-22 37 Ed H. Chi ASC Overview
Risks with using Wikipedia – Accuracy of content – Motives of editors – Expertise of editors – Stability of article – Coverage of topics – Quality of cited information
Insufficient information to evaluate trustworthiness
38 2010-02-22 38 Ed H. Chi ASC Overview
Transparency of social dynamics can reduce conflict and coordination issues
Attribution encourages contribution – WikiDashboard: Social dashboard for wikis – Prototype system: http://wikidashboard.parc.com
Visualization for every wiki page showing edit history timeline and top individual editors
Can drill down into activity history for specific editors and view edits to see changes side-by-side
39
Citation: Suh et al. CHI 2008 Proceedings
Ed H. Chi ASC Overview 2010-02-22 39
2010-02-22 40 Ed H. Chi ASC Overview 40
2010-02-22 41 Ed H. Chi ASC Overview
Characteriza*on Models
Prototypes Evalua*ons
Surfacing information
• Numerous studies mining Wikipedia revision history to surface trust-relevant information – Adler & Alfaro, 2007; Dondio et al., 2006; Kittur et al., 2007;
Viegas et al., 2004; Zeng et al., 2006
• But how much impact can this have on user perceptions in a system which is inherently mutable?
Suh, Chi, Kittur, & Pendleton, CHI2008
43
Hypotheses
1. Visualization will impact perceptions of trust 2. Compared to baseline, visualization will
impact trust both positively and negatively 3. Visualization should have most impact when
high uncertainty about article • Low quality • High controversy
44
Design
• 3 x 2 x 2 design
Abortion
George Bush
Volcano
Shark
Pro-life feminism
Scientology and celebrities
Disk defragmenter
Beeswax
Controversial Uncontroversial
High quality
Low quality
Visualization • High stability • Low stability • Baseline (none)
45
Example: High trust visualization
46
Example: Low trust visualization
47
Summary info
• % from anonymous users
48
Summary info
• % from anonymous users
• Last change by anonymous or established user
49
Summary info
• % from anonymous users
• Last change by anonymous or established user
• Stability of words
50
Graph
• Instability
51
Graph
• Instability • Revert activity
52
Method
• Users recruited via Amazon’s Mechanical Turk – 253 participants – 673 ratings – 7 cents per rating – Kittur, Chi, & Suh, CHI 2008: Crowdsourcing user studies
• To ensure salience and valid answers, participants answered: – In what time period was this article the least stable? – How stable has this article been for the last month? – Who was the last editor? – How trustworthy do you consider the above editor?
53
Results
main effects of quality and controversy: • high-quality articles > low-quality articles (F(1, 425) = 25.37, p < .001) • uncontroversial articles > controversial articles (F(1, 425) = 4.69, p = .031)
54
Results
interaction effects of quality and controversy: • high quality articles were rated equally trustworthy whether controversial or not, while • low quality articles were rated lower when they were controversial than when they were uncontroversial service.
55
Results
1. Significant effect of visualization – High > low, p < .001
2. Viz has both positive and negative effects – High > baseline, p < .001 – Low > baseline, p < .01
3. No interaction of visualization with either quality or controversy – Robust across conditions
56
Results
1. Significant effect of visualization – High > low, p < .001
2. Viz has both positive and negative effects – High > baseline, p < .001 – Low > baseline, p < .01
3. No interaction of visualization with either quality or controversy – Robust across conditions
57
Results
1. Significant effect of visualization – High > low, p < .001
2. Viz has both positive and negative effects – High > baseline, p < .001 – Low > baseline, p < .01
3. No interaction effect of visualization with either quality or controversy – Robust across conditions
58
Methodology
Characteriza*on Models
Prototypes Evalua*ons
User studies
• Getting input from users is important in HCI – surveys – rapid prototyping – usability tests – cognitive walkthroughs – performance measures – quantitative ratings
User studies
• Getting input from users is expensive – Time costs – Monetary costs
• Often have to trade off costs with sample size
Online solutions
• Online user surveys • Remote usability testing • Online experiments • But still have difficulties
– Rely on practitioner for recruiting participants – Limited pool of participants
Crowdsourcing
• Make tasks available for anyone online to complete • Quickly access a large user pool, collect data, and
compensate users
• Experiences at PARC: – CSL UbiComp group – ISL’s NLTT group
Crowdsourcing
• Make tasks available for anyone online to complete • Quickly access a large user pool, collect data, and
compensate users • Example: NASA Clickworkers
– 100k+ volunteers identified Mars craters from space photographs
– Aggregate results “virtually indistinguishable” from expert geologists
experts
crowds
http://clickworkers.arc.nasa.gov
Amazon’s Mechanical turk
• Market for “human intelligence tasks” • Typically short, objective tasks
– Tag an image – Find a webpage – Evaluate relevance of search results
• Users complete for a few pennies each
Example task
Using Mechanical Turk for user studies
Traditional user studies
Mechanical Turk
Task complexity Complex Long
Simple Short
Task subjectivity Subjective Opinions
Objective Verifiable
User information Targeted demographics High interactivity
Unknown demographics Limited interactivity
Can Mechanical Turk be usefully used for user studies?
Task
• Assess quality of Wikipedia articles • Started with ratings from expert Wikipedians
– 14 articles (e.g., “Germany”, “Noam Chomsky”) – 7-point scale
• Can we get matching ratings with mechanical turk?
Experiment 1
• Rate articles on 7-point scales: – Well written – Factually accurate – Overall quality
• Free-text input: – What improvements does the article need?
• Paid $0.05 each
Experiment 1: Good news
• 58 users made 210 ratings (15 per article) – $10.50 total
• Fast results – 44% within a day, 100% within two days – Many completed within minutes
Experiment 1: Bad news
• Correlation between turkers and Wikipedians only marginally significant (r=.50, p=.07)
• Worse, 59% potentially invalid responses
• Nearly 75% of these done by only 8 users
Experiment 1
Invalid comments
49%
<1 min responses
31%
Not a good start
• Summary so far: – Only marginal correlation with experts. – Heavy gaming of the system by a minority
• Possible Response: – Can make sure these gamers are not rewarded – Ban them from doing your hits in the future – Create a reputation system [Delores Lab]
• Can we change how we collect user input ?
Design changes
• Use verifiable questions to signal monitoring – “How many sections does the article have?” – “How many images does the article have?” – “How many references does the article have?”
Design changes
• Use verifiable questions to signal monitoring • Make malicious answers as high cost as
good-faith answers – “Provide 4-6 keywords that would give someone a
good summary of the contents of the article”
Design changes
• Use verifiable questions to signal monitoring • Make malicious answers as high cost as
good-faith answers • Make verifiable answers useful for completing
task – Used tasks similar to how Wikipedians described
evaluating quality (organization, presentation, references)
Design changes
• Use verifiable questions to signal monitoring • Make malicious answers as high cost as
good-faith answers • Make verifiable answers useful for completing
task • Put verifiable tasks before subjective
responses – First do objective tasks and summarization – Only then evaluate subjective quality – Ecological validity?
Experiment 2: Results
• 124 users provided 277 ratings (~20 per article) • Significant positive correlation with Wikipedians (r=.
66, p=.01)
• Smaller proportion malicious responses • Increased time on task
Experiment 1 Experiment 2
Invalid comments
49% 3% <1 min
responses 31% 7%
Median time 1:30 4:06
Generalizing to other user studies
• Combine objective and subjective questions – Rapid prototyping: ask verifiable questions about
content/design of prototype before subjective evaluation
– User surveys: ask common-knowledge questions before asking for opinions
Limitations of mechanical turk
• No control of users’ environment – Potential for different browsers, physical
distractions – General problem with online experimentation
• Not designed for user studies – Difficult to do between-subjects design – Involves some programming
• Users – Uncertainty about user demographics, expertise
Conclusion
1. Use verifiable questions to signal monitoring 2. Make malicious answers as high cost as good-faith
answers 3. Make verifiable answers useful for completing task 4. Put verifiable tasks before subjective responses
• Mechanical Turk offers the practitioner a way to access a large user pool and quickly collect data at low cost
• Good results require careful task design
Ed H. Chi (manager, PS) Peter Pirolli (RF) Lichan Hong Bongwon Suh Les Nelson Rowan Nairn Gregorio Convertino
Interns/Collaborators: Sanjay Kairam, Jilin Chen (UMinn), Michael Bernstein (MIT)
http://asc-‐parc.blogspot.com
2010-02-22 Ed H. Chi ASC Overview 81
2010-02-22 82 Ed H. Chi ASC Overview
r, growth rate K, carrying capacity
€
dNdt
= rN(1− NK)
0
500000
1000000
1500000
2000000
2500000
3000000
3500000
4000000
2000 2002 2004 2006 2008 2010 Year
r dominates when N is small
K dominates when N ⇒K
€
(1− NK) ≈1
€
(1− NK) ≈ 0
2010-02-22 83 Ed H. Chi ASC Overview
r-Strategist – Growth or exploitation – Less-crowded niches / produce many offspring
K-Strategist – Conservation – Strong competitors in crowded niches / invest more heavily in
fewer offspring
Evolution cycle – Resilience of an ecological system – Gunderson & Holling 2001
2010-02-22 84 Ed H. Chi ASC Overview
Exponential growth model – Growth rate depends on the current N
Ecological population growth model – r, growth rate of the population – K, carrying capacity (due to resource limitation)
€
dNdt
= rN(1− NK)€
dNdt
= r *N
2010-02-22 85 Ed H. Chi ASC Overview
People-ware – Growing resistance to changing content – Coordination cost and bureaucracy
Knowledge-ware: Availability of easy topics to write about Tool-ware: Quality of tools used by editors and admins
http://www.aerostich.com/ http://www.mikestreetmedia.co.uk/blog/wp-content/uploads/2009/01/knowledge.jpg http://youropenbook.agitprop.co.uk/growing.php?p=2 2010-02-22 86 Ed H. Chi ASC Overview