Running Experiments with Amazon Mechanical-Turk Gabriele Paolacci, Jesse Chandler, Jesse Chandler...

12
Running Experiments with Amazon Mechanical-Turk Gabriele Paolacci, Jesse Chandler, Jesse Chandler Judgment and Decision Making, Vol. 5, No. 5, August 2010 KSE 801: Human Computation and Crowdsourcing

Transcript of Running Experiments with Amazon Mechanical-Turk Gabriele Paolacci, Jesse Chandler, Jesse Chandler...

Page 1: Running Experiments with Amazon Mechanical-Turk Gabriele Paolacci, Jesse Chandler, Jesse Chandler Judgment and Decision Making, Vol. 5, No. 5, August 2010.

Running Experiments with Amazon Mechanical-Turk

Gabriele Paolacci, Jesse Chandler, Jesse ChandlerJudgment and Decision Making, Vol. 5, No. 5,

August 2010KSE 801: Human Computation and Crowdsourcing

Page 2: Running Experiments with Amazon Mechanical-Turk Gabriele Paolacci, Jesse Chandler, Jesse Chandler Judgment and Decision Making, Vol. 5, No. 5, August 2010.

Practical Advantages of M-Turk• Supportive infrastructure:

– Fast recruiting– Convenient to run experiments– External site could be used (e.g., validation code)

• Subject identifiability and prescreening:– M-Turk workers can be required to earn “qualifications” (or prescreening questions) prior

to completing a HIT • Subject identifiability and longitudinal studies:

– Worker IDs can be used to explicitly re-contact former subjects or code can be written that restricts the availability of a HIT to a predetermined list of workers

• Cultural diversity: – Cross-cultural comparisons feasible (e.g., country, language, currency)

• Subject anonymity (not easy though)– Ensuring worker’s anonymity (if external site is used) – M-Turk studies can be exempted for the review of IRBs (Institutional Review Boards) if

anonymity is guaranteed

Page 3: Running Experiments with Amazon Mechanical-Turk Gabriele Paolacci, Jesse Chandler, Jesse Chandler Judgment and Decision Making, Vol. 5, No. 5, August 2010.

Tradeoffs of Different Recruiting Methods

Page 4: Running Experiments with Amazon Mechanical-Turk Gabriele Paolacci, Jesse Chandler, Jesse Chandler Judgment and Decision Making, Vol. 5, No. 5, August 2010.

A Comparative Study

• Tested various Judgment and Decision Making (JDM) findings– M-Turk, a traditional subject pool at a large

Midwestern US university, and visitors of online discussion boards

– During April to May 2010• Survey:– Asian disease problem– Linda problem– Physician problem

Page 5: Running Experiments with Amazon Mechanical-Turk Gabriele Paolacci, Jesse Chandler, Jesse Chandler Judgment and Decision Making, Vol. 5, No. 5, August 2010.

Survey (Asian Disease Problem)• Asian disease problem (called framing, Tversky and Kahnerman, 1981)• Subjects read one of two hypothetical scenarios

– Imagine that the United States is preparing for the outbreak of an unusual Asian disease, which is expected to kill 600 people. Two alternative programs to combat the disease have been proposed. Assume that the exact scientific estimates of the consequences of the programs are as follows:

– Problem 1: If Program A is adopted, 200 people will be saved. If Program B is adopted, there is 1/3 probability that 600 people will be saved and 2/3 probability that no people will be saved. Which of the two programs would you favor?

– Problem 2: If Program A is adopted, 400 people will die. If Program B is adopted, there is 1/3 probability that nobody will die, and 2/3 probability that 600 people will die.

• Two scenarios are numerically identical, but the subjects responded very differently

• In the scenario framed in terms of gains, subjects were risk-averse (72% chose Program A); in the scenario framed in terms of losses, 78% of subjects preferred Program B (Tversky and Kahnerman, 1981)

Page 6: Running Experiments with Amazon Mechanical-Turk Gabriele Paolacci, Jesse Chandler, Jesse Chandler Judgment and Decision Making, Vol. 5, No. 5, August 2010.

Survey (Linda Problem)• Example: “Linda is 31 years old, single, outspoken, and very

bright. She majored in philosophy. As a student, she was deeply concerned with issues of discrimination and social justice, and also participated in anti-nuclear demonstrations.”

• Which is more probable?– Linda is a bank teller – Linda is a bank teller and is active in the feminist movement

• Linda problem (Tversky & Kahneman, 1983)– Demonstrates the conjunction fallacy– People often fail to regard a combination of events as less

probable than a single event in the combination• Probability of two events occurring together (in “conjunction”) is always

less than or equal to the probability of either one occurring alone

Page 7: Running Experiments with Amazon Mechanical-Turk Gabriele Paolacci, Jesse Chandler, Jesse Chandler Judgment and Decision Making, Vol. 5, No. 5, August 2010.

Survey (Physician Problem)• Physician problem demonstrates the outcome bias: a surgeon

deciding whether or not to do a risky surgery on a patient. – The surgery had a known probability of success (e.g., 8%)– Subjects were presented with either a good or bad outcome (in this case

living or dying), and asked to rate the quality of the surgeon's pre-operation decision.

• Judgment of quality of a decision is often dependent on the valence of the outcome (Baron and Hershey, 1988)

• Subjects rated the quality of a physician’s decision to perform an operation on a patient (on a 7-point scale)– 1: incorrect and inexcusable, 7: clearly correct, and the opposite

decision would be inexcusable– Those presented with bad outcomes rated the decision worse than

those who had good outcomes.

Page 8: Running Experiments with Amazon Mechanical-Turk Gabriele Paolacci, Jesse Chandler, Jesse Chandler Judgment and Decision Making, Vol. 5, No. 5, August 2010.

After Survey

• After survey, subjects completed the subjective numeracy scale (SNS, 2007) called SNS score– An eight-item self-report measure of perceived ability to perform

various mathematical tasks and preference for the use of numerical vs. prose information

– Used as a parsimonious measurement of an individual’s quantitative abilities

• Additional “catch trial” question: to test whether subjects were attending to the questions (by requiring precise and obvious answers)– E.g., “while watching the television, have you ever had a fatal heart

attack?” (w/ six-point scale anchored on “Never” and “Often”)

Page 9: Running Experiments with Amazon Mechanical-Turk Gabriele Paolacci, Jesse Chandler, Jesse Chandler Judgment and Decision Making, Vol. 5, No. 5, August 2010.

Configuration• M-Turk:

– Pay: $0.10 (N=318 participated)– Title: “Answer a short decision survey”– Description: “Make some choices and judgments in this 5-minute

survey”• Estimated completion time is included to provide workers with a rough

assessment of the reward/effort ratio (e.g., $1.71/hour)

• Lab subject pool: – N=141 students from an introductory subject pool at a large university

• Internet discussion board: – Posted a link to the survey to several online discussion boards that host

online experiments in psychology – Online for 2 weeks; and N=137 visitors took part in the survey

Page 10: Running Experiments with Amazon Mechanical-Turk Gabriele Paolacci, Jesse Chandler, Jesse Chandler Judgment and Decision Making, Vol. 5, No. 5, August 2010.

Subject Pools: Characteristics

• Subjects recruited from online discussion forums were significantly less likely to complete the survey than the subjects on M-Turk (69.3% vs. 91.6%, X2=20.915, p<.001)

• # of respondents who failed the catch trial is low, and not significantly different across subject pools (X2(2,301)=0.187, p=091)

• Subjects in the three subject pools did not differ significantly in the SNS score: F(2, 299) = 1.193, p=0.30

Page 11: Running Experiments with Amazon Mechanical-Turk Gabriele Paolacci, Jesse Chandler, Jesse Chandler Judgment and Decision Making, Vol. 5, No. 5, August 2010.

Results on Experimental Tasks• M-Turk is a reliable source of experimental data in JDM

Page 12: Running Experiments with Amazon Mechanical-Turk Gabriele Paolacci, Jesse Chandler, Jesse Chandler Judgment and Decision Making, Vol. 5, No. 5, August 2010.

Labor Supply

• Economic theory predicts that increasing the price paid for labor will increase the supply of labor in most cases

• M-Turk experiment: after completing the demographic survey and the first task (transcription), subjects were randomly assigned to one of the four treatment groups and offered the chance to perform another transcription for p cents: 1, 5, 15, or 25

• Workers receiving high offers were more likely to accept