2010-02-22 Wikipedia MTurk Research talk given in Taiwan's Academica Sinica

Ed H. Chi

Area Manager and Principal Scientist Augmented Social Cognition Area Palo Alto Research Center

  Cognition: the ability to remember, think, and reason; the faculty of knowing.

  Social Cognition: the ability of a group to remember, think, and reason; the construction of knowledge structures by a group. –  (not quite the same as in the branch of psychology that studies the

cognitive processes involved in social interaction, though included)

  Augmented Social Cognition: Supported by systems, the enhancement of the ability of a group to remember, think, and reason; the system-‐supported construction of knowledge structures by a group.

Citation: Chi, IEEE Computer, Sept 2008

2010-02-22 Ed H. Chi ASC Overview 2

2

  Characterize activity on social systems with analytics   Model interaction social and community dynamics and variables   Prototype tools to increase benefits or reduce cost   Evaluate prototypes via Living Laboratories with real users

3 Ed H. Chi ASC Overview 2010-02-22

Characteriza*on Models

Prototypes Evalua*ons

3

  Characterization and Modeling: –  Community Analytics and Wikipedia Dynamics

  Prototyping: –  Social Transparency thru WikiDashboard

  Evaluation: –  Evaluations using Amazon Mechanical Turk

4 Ed H. Chi ASC Overview 2010-02-22 4

2010-02-22 6

Conflict/Coordination Effects in Wikipedia

Ed H. Chi ASC Overview

Mediator Pattern -‐ Terri Schiavo

Mediators

Sympathetic to parents

Sympathetic to husband

Anonymous (vandals/spammers)

2010-02-22 7 Ed H. Chi ASC Overview

Measure of controversy • “Controversial” tag

• Use # revisions tagged controversial

8 2010-02-22 Ed H. Chi ASC Overview

Page metrics •  Possible metrics for identifying conflict in articles

Metric type Page Type Revisions (#) Article, talk, article/talk Page length Article, talk, article/talk

Unique editors Article, talk, article/talk Unique editors / revisions Article, talk Links from other articles Article, talk

Links to other articles Article, talk Anonymous edits (#, %) Article, talk

Administrator edits (#, %) Article, talk Minor edits (#, %) Article, talk

Reverts (#, by unique editors) Article


Performance: Cross-‐validation • 5x cross-‐validation, R2 = 0.897


Determinants of conflict

Revisions (talk) Minor edits (talk) Unique editors (talk) Revisions (article) Unique editors (article) Anonymous edits (talk) Anonymous edits (article)

Highly weighted features of conflict model:


Number of Articles (Log Scale)

http://en.wikipedia.org/wiki/Wikipedia:Modelling_Wikipedia’s_growth

12 2010-02-22 12 Ed H. Chi ASC Overview

Monthly Edits


Monthly Active Editors


  Edits beget edits –  more number of previous edits, more number of new edits

€

N(t) = N0 ⋅ ert

€

dNdt

= r ⋅ N

Growth rate of population

Current population

Growth rate depends on current population N r = growth rate of the population


  Ecological population growth model –  r, growth rate of the population –  K, carrying capacity (due to resource limitation)

€

dNdt

= r ⋅ N ⋅ (1− NK)

0

500000

1000000

1500000

2000000

2500000

3000000

3500000

4000000

2000 2002 2004 2006 2008 2010

Popu

latio

n

Year

K


http://en.wikipedia.org/wiki/Wikipedia:Modelling_Wikipedia’s_growth

  Follows a logistic growth curve

New Article


  Carrying Capacity as a function of time.

2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

Popu

latio

n

Year

K(t)


  Biological system –  Competition increases as

population hit the limits of the ecology

–  Advantage go to members of the population that have competitive dominance over others

  Analogy –  Limited opportunities to make

novel contributions –  Increased patterns of conflict and

dominance


  Highly skewed contribution pattern –  Top 3% users contribute 50%+ edits –  A lot of single-edit users

  Five Editor Classes –  Monthly edit count –  No bot, vandalism included in the analysis –  1000+: editors who made more than 1000 edits in that month –  100-999 –  10-99 –  2-9 –  1


Monthly Edits by Editor Class (in thousands)


Monthly Ratio of Reverted Edits


28

  Two interpretations: –  Overall increased resistance

from the Wikipedia community to changing content

–  Disparity of treatment of edits »  Occasional editors have been

reverted in a higher rate

  Example of increased patterns of conflict and dominance

Photo: http://www.flickr.com/photos/efan78/3619921561/


Bongwon Suh, Gregorio Convertino, Ed H. Chi, Peter Pirolli. WikiSym 2009


“Wikipedia is the best thing ever. Anyone in the world can write anything they want about any subject, so you know you’re getting the

best possible information.” – Steve Carell, The Office


  Content in Wikipedia can be added or changed by anyone

  Because of this, WP has become one of the most important resources on the web –  Hundreds of thousands of contributors –  Over 2 million articles –  5th most used websites (Alexa.com)

  Also because of this, is viewed with skepticism by readers, press, researchers


Nothing


“Wikipedia, just by its nature, is impossible to trust completely. I don't think this can necessarily be changed.”


  Risks with using Wikipedia –  Accuracy of content –  Motives of editors –  Expertise of editors –  Stability of article –  Coverage of topics –  Quality of cited information

Insufficient information to evaluate trustworthiness


  Transparency of social dynamics can reduce conflict and coordination issues

  Attribution encourages contribution –  WikiDashboard: Social dashboard for wikis –  Prototype system: http://wikidashboard.parc.com

  Visualization for every wiki page showing edit history timeline and top individual editors

  Can drill down into activity history for specific editors and view edits to see changes side-by-side

39

Citation: Suh et al. CHI 2008 Proceedings

Ed H. Chi ASC Overview 2010-02-22 39

2010-02-22 40 Ed H. Chi ASC Overview 40

Surfacing information

•  Numerous studies mining Wikipedia revision history to surface trust-relevant information –  Adler & Alfaro, 2007; Dondio et al., 2006; Kittur et al., 2007;

Viegas et al., 2004; Zeng et al., 2006

•  But how much impact can this have on user perceptions in a system which is inherently mutable?

Suh, Chi, Kittur, & Pendleton, CHI2008

43

Hypotheses

1.  Visualization will impact perceptions of trust 2.  Compared to baseline, visualization will

impact trust both positively and negatively 3.  Visualization should have most impact when

high uncertainty about article •  Low quality •  High controversy

44

Design

•  3 x 2 x 2 design

Abortion

George Bush

Volcano

Shark

Pro-life feminism

Scientology and celebrities

Disk defragmenter

Beeswax

Controversial Uncontroversial

High quality

Low quality

Visualization •  High stability •  Low stability •  Baseline (none)

45

Example: High trust visualization

46

Example: Low trust visualization

47

Summary info

•  % from anonymous users

48

Summary info


•  Last change by anonymous or established user

49

Summary info


•  Last change by anonymous or established user

•  Stability of words

50

Graph

•  Instability

51

Graph

•  Instability •  Revert activity

52

Method

•  Users recruited via Amazon’s Mechanical Turk –  253 participants –  673 ratings –  7 cents per rating –  Kittur, Chi, & Suh, CHI 2008: Crowdsourcing user studies

•  To ensure salience and valid answers, participants answered: –  In what time period was this article the least stable? –  How stable has this article been for the last month? –  Who was the last editor? –  How trustworthy do you consider the above editor?

53

Results

main effects of quality and controversy: • high-quality articles > low-quality articles (F(1, 425) = 25.37, p < .001) • uncontroversial articles > controversial articles (F(1, 425) = 4.69, p = .031)

54

Results

interaction effects of quality and controversy: • high quality articles were rated equally trustworthy whether controversial or not, while • low quality articles were rated lower when they were controversial than when they were uncontroversial service.

55

Results

1.  Significant effect of visualization –  High > low, p < .001

2.  Viz has both positive and negative effects –  High > baseline, p < .001 –  Low > baseline, p < .01

3.  No interaction of visualization with either quality or controversy –  Robust across conditions

56

Results



3.  No interaction of visualization with either quality or controversy –  Robust across conditions

57

Results



3.  No interaction effect of visualization with either quality or controversy –  Robust across conditions

58

Methodology



User studies

•  Getting input from users is important in HCI –  surveys –  rapid prototyping –  usability tests –  cognitive walkthroughs –  performance measures –  quantitative ratings

User studies

•  Getting input from users is expensive –  Time costs –  Monetary costs

•  Often have to trade off costs with sample size

Online solutions

•  Online user surveys •  Remote usability testing •  Online experiments •  But still have difficulties

–  Rely on practitioner for recruiting participants –  Limited pool of participants

Crowdsourcing

•  Make tasks available for anyone online to complete •  Quickly access a large user pool, collect data, and

compensate users

•  Experiences at PARC: –  CSL UbiComp group –  ISL’s NLTT group

Crowdsourcing

•  Make tasks available for anyone online to complete •  Quickly access a large user pool, collect data, and

compensate users •  Example: NASA Clickworkers

–  100k+ volunteers identified Mars craters from space photographs

–  Aggregate results “virtually indistinguishable” from expert geologists

experts

crowds

http://clickworkers.arc.nasa.gov

Amazon’s Mechanical turk

•  Market for “human intelligence tasks” •  Typically short, objective tasks

–  Tag an image –  Find a webpage –  Evaluate relevance of search results

•  Users complete for a few pennies each

Example task

Using Mechanical Turk for user studies

Traditional user studies

Mechanical Turk

Task complexity Complex Long

Simple Short

Task subjectivity Subjective Opinions

Objective Verifiable

User information Targeted demographics High interactivity

Unknown demographics Limited interactivity

Can Mechanical Turk be usefully used for user studies?

Task

•  Assess quality of Wikipedia articles •  Started with ratings from expert Wikipedians

–  14 articles (e.g., “Germany”, “Noam Chomsky”) –  7-point scale

•  Can we get matching ratings with mechanical turk?

Experiment 1

•  Rate articles on 7-point scales: –  Well written –  Factually accurate –  Overall quality

•  Free-text input: –  What improvements does the article need?

•  Paid $0.05 each

Experiment 1: Good news

•  58 users made 210 ratings (15 per article) –  $10.50 total

•  Fast results –  44% within a day, 100% within two days –  Many completed within minutes

Experiment 1: Bad news

•  Correlation between turkers and Wikipedians only marginally significant (r=.50, p=.07)

•  Worse, 59% potentially invalid responses

•  Nearly 75% of these done by only 8 users

Experiment 1

Invalid comments

49%

<1 min responses

31%

Not a good start

•  Summary so far: –  Only marginal correlation with experts. –  Heavy gaming of the system by a minority

•  Possible Response: –  Can make sure these gamers are not rewarded –  Ban them from doing your hits in the future –  Create a reputation system [Delores Lab]

•  Can we change how we collect user input ?

Design changes

•  Use verifiable questions to signal monitoring –  “How many sections does the article have?” –  “How many images does the article have?” –  “How many references does the article have?”

Design changes

•  Use verifiable questions to signal monitoring •  Make malicious answers as high cost as

good-faith answers –  “Provide 4-6 keywords that would give someone a

good summary of the contents of the article”

Design changes


good-faith answers •  Make verifiable answers useful for completing

task –  Used tasks similar to how Wikipedians described

evaluating quality (organization, presentation, references)

Design changes


good-faith answers •  Make verifiable answers useful for completing

task •  Put verifiable tasks before subjective

responses –  First do objective tasks and summarization –  Only then evaluate subjective quality –  Ecological validity?

Experiment 2: Results

•  124 users provided 277 ratings (~20 per article) •  Significant positive correlation with Wikipedians (r=.

66, p=.01)

•  Smaller proportion malicious responses •  Increased time on task

Experiment 1 Experiment 2

Invalid comments

49% 3% <1 min

responses 31% 7%

Median time 1:30 4:06

Generalizing to other user studies

•  Combine objective and subjective questions –  Rapid prototyping: ask verifiable questions about

content/design of prototype before subjective evaluation

–  User surveys: ask common-knowledge questions before asking for opinions

Limitations of mechanical turk

•  No control of users’ environment –  Potential for different browsers, physical

distractions –  General problem with online experimentation

•  Not designed for user studies –  Difficult to do between-subjects design –  Involves some programming

•  Users –  Uncertainty about user demographics, expertise

Conclusion

1.  Use verifiable questions to signal monitoring 2.  Make malicious answers as high cost as good-faith

answers 3.  Make verifiable answers useful for completing task 4.  Put verifiable tasks before subjective responses

•  Mechanical Turk offers the practitioner a way to access a large user pool and quickly collect data at low cost

•  Good results require careful task design

Ed H. Chi (manager, PS) Peter Pirolli (RF) Lichan Hong Bongwon Suh Les Nelson Rowan Nairn Gregorio Convertino

Interns/Collaborators: Sanjay Kairam, Jilin Chen (UMinn), Michael Bernstein (MIT)

http://asc-‐parc.blogspot.com


  r, growth rate   K, carrying capacity

€

dNdt

= rN(1− NK)

0

500000

1000000

1500000

2000000

2500000

3000000

3500000

4000000

2000 2002 2004 2006 2008 2010 Year

r dominates when N is small

K dominates when N ⇒K

€

(1− NK) ≈1

€

(1− NK) ≈ 0


  r-Strategist –  Growth or exploitation –  Less-crowded niches / produce many offspring

  K-Strategist –  Conservation –  Strong competitors in crowded niches / invest more heavily in

fewer offspring

  Evolution cycle –  Resilience of an ecological system –  Gunderson & Holling 2001


  Exponential growth model –  Growth rate depends on the current N

  Ecological population growth model –  r, growth rate of the population –  K, carrying capacity (due to resource limitation)

€

dNdt

= rN(1− NK)€

dNdt

= r *N


  People-ware –  Growing resistance to changing content –  Coordination cost and bureaucracy

  Knowledge-ware: Availability of easy topics to write about   Tool-ware: Quality of tools used by editors and admins

http://www.aerostich.com/ http://www.mikestreetmedia.co.uk/blog/wp-content/uploads/2009/01/knowledge.jpg http://youropenbook.agitprop.co.uk/growing.php?p=2 2010-02-22 86 Ed H. Chi ASC Overview

2010-02-22 Wikipedia MTurk Research talk given in Taiwan's Academica Sinica

Technology

Transcript of 2010-02-22 Wikipedia MTurk Research talk given in Taiwan's Academica Sinica