Cicero: Multi-Turn, Contextual Argumentation for Accurate ... · Cicero DESIGN We designed the...

12
Cicero: Multi-Turn, Contextual Argumentation for Accurate Crowdsourcing Quanze Chen 1 , Jonathan Bragg 1 , Lydia B. Chilton 2 , Daniel S. Weld 1 1 University of Washington, Seattle, WA, USA 2 Columbia University, New York, NY, USA {cqz, jbragg, weld}@cs.washington.edu, [email protected] ABSTRACT Traditional approaches for ensuring high quality crowdwork have failed to achieve high-accuracy on difficult problems. Aggregating redundant answers often fails on the hardest prob- lems when the majority is confused. Argumentation has been shown to be effective in mitigating these drawbacks. However, existing argumentation systems only support limited interac- tions and show workers general justifications, not context- specific arguments targeted to their reasoning. This paper presents CICERO, a new workflow that improves crowd accuracy on difficult tasks by engaging workers in multi- turn, contextual discussions through real-time, synchronous ar- gumentation. Our experiments show that compared to previous argumentation systems (e.g.,MICROTALK [10]) which only improve the average individual worker accuracy by 6.8 percent- age points on the Relation Extraction domain, our workflow achieves 16.7 percentage point improvement. Furthermore, previous argumentation approaches don’t apply to tasks with many possible answers; in contrast, CICERO works well in these cases, raising accuracy from 66.7% to 98.8% on the Codenames domain. ACM Classification Keywords H.5.m. Information Interfaces and Presentation (e.g. HCI): Group and Organization Interfaces Author Keywords Crowdsourcing; argumentation; dialog INTRODUCTION Crowdsourcing has been used for a wide variety of tasks, from image labeling to language transcription and translation. Many complex jobs can be decomposed into small micro-tasks [22, 2, 28, 4]. After such decomposition, the primary challenge becomes ensuring that independent individual judgments re- sult in accurate global answers. Approaches ranging aggre- gation via majority vote [33] to programmatic filtering via gold-standard questions [29] have all been created to achieve Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. © 2018 Copyright held by the owner/author(s). Publication rights licensed to ACM. ISBN 123-4567-24-567/08/06. . . $15.00 DOI: http://dx.doi.org/10.475/123_4 Figure 1. Discussion interface for use in CICERO, inspired by instant- messaging clients, showing fragment of an actual discussion in the Re- lation Extraction domain. (1) Presents the question (sentence + claim) and both sides’ beliefs. (2) Initial discussion is seeded with the workers’ justifications. (3) Options added to facilitate termination of a discussion once it has reached the end of its usefulness. this goal. Further improvements have led to more intelligent aggregation such as expectation maximization (EM) [6, 37, 36]. However, EM may still fall short, especially on hard prob- lems where individual judgments are unreliable. Indeed, some researchers have concluded that crowdsourcing is incapable of achieving perfect accuracy [7]. Yet recently, argumentation has been shown to be an effective way to improve the accuracy of both individual and aggregate judgments. For example, Drapeau et al.’s MicroTalk [10] used a pipelined approach of: 1) asking crowdworkers to assess a question’s answer, 2) prompting them to justify their reasoning, 3) showing them counterarguments written by other workers, and 4) allowing them to reconsider their original answers to improve individual judgments. In principle, this simplified form of argumentation allows a single dissident worker, through force of reason, to steer others to the right answer. Furthermore, the authors showed that argumentation was compatible with EM; combining the two methods resulted in substantial gains in accuracy. arXiv:1810.10733v1 [cs.HC] 25 Oct 2018

Transcript of Cicero: Multi-Turn, Contextual Argumentation for Accurate ... · Cicero DESIGN We designed the...

Page 1: Cicero: Multi-Turn, Contextual Argumentation for Accurate ... · Cicero DESIGN We designed the CICERO workflow to address issues in exist-ing crowdsourced argumentation systems by

Cicero: Multi-Turn, Contextual Argumentation for AccurateCrowdsourcing

Quanze Chen1, Jonathan Bragg1, Lydia B. Chilton2, Daniel S. Weld1

1University of Washington, Seattle, WA, USA2Columbia University, New York, NY, USA

{cqz, jbragg, weld}@cs.washington.edu, [email protected]

ABSTRACTTraditional approaches for ensuring high quality crowdworkhave failed to achieve high-accuracy on difficult problems.Aggregating redundant answers often fails on the hardest prob-lems when the majority is confused. Argumentation has beenshown to be effective in mitigating these drawbacks. However,existing argumentation systems only support limited interac-tions and show workers general justifications, not context-specific arguments targeted to their reasoning.

This paper presents CICERO, a new workflow that improvescrowd accuracy on difficult tasks by engaging workers in multi-turn, contextual discussions through real-time, synchronous ar-gumentation. Our experiments show that compared to previousargumentation systems (e.g., MICROTALK [10]) which onlyimprove the average individual worker accuracy by 6.8 percent-age points on the Relation Extraction domain, our workflowachieves 16.7 percentage point improvement. Furthermore,previous argumentation approaches don’t apply to tasks withmany possible answers; in contrast, CICERO works well inthese cases, raising accuracy from 66.7% to 98.8% on theCodenames domain.

ACM Classification KeywordsH.5.m. Information Interfaces and Presentation (e.g. HCI):Group and Organization Interfaces

Author KeywordsCrowdsourcing; argumentation; dialog

INTRODUCTIONCrowdsourcing has been used for a wide variety of tasks, fromimage labeling to language transcription and translation. Manycomplex jobs can be decomposed into small micro-tasks [22,2, 28, 4]. After such decomposition, the primary challengebecomes ensuring that independent individual judgments re-sult in accurate global answers. Approaches ranging aggre-gation via majority vote [33] to programmatic filtering viagold-standard questions [29] have all been created to achievePermission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than theauthor(s) must be honored. Abstracting with credit is permitted. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected].

© 2018 Copyright held by the owner/author(s). Publication rights licensed to ACM.ISBN 123-4567-24-567/08/06. . . $15.00

DOI: http://dx.doi.org/10.475/123_4

Figure 1. Discussion interface for use in CICERO, inspired by instant-messaging clients, showing fragment of an actual discussion in the Re-lation Extraction domain. (1) Presents the question (sentence + claim)and both sides’ beliefs. (2) Initial discussion is seeded with the workers’justifications. (3) Options added to facilitate termination of a discussiononce it has reached the end of its usefulness.

this goal. Further improvements have led to more intelligentaggregation such as expectation maximization (EM) [6, 37,36]. However, EM may still fall short, especially on hard prob-lems where individual judgments are unreliable. Indeed, someresearchers have concluded that crowdsourcing is incapableof achieving perfect accuracy [7].

Yet recently, argumentation has been shown to be an effectiveway to improve the accuracy of both individual and aggregatejudgments. For example, Drapeau et al.’s MicroTalk [10]used a pipelined approach of: 1) asking crowdworkers toassess a question’s answer, 2) prompting them to justify theirreasoning, 3) showing them counterarguments written by otherworkers, and 4) allowing them to reconsider their originalanswers to improve individual judgments. In principle, thissimplified form of argumentation allows a single dissidentworker, through force of reason, to steer others to the rightanswer. Furthermore, the authors showed that argumentationwas compatible with EM; combining the two methods resultedin substantial gains in accuracy.

arX

iv:1

810.

1073

3v1

[cs

.HC

] 2

5 O

ct 2

018

Page 2: Cicero: Multi-Turn, Contextual Argumentation for Accurate ... · Cicero DESIGN We designed the CICERO workflow to address issues in exist-ing crowdsourced argumentation systems by

However, while asynchronous argumentation systems like Mi-croTalk attempt to resolve disagreement, the steering powerof a one-round debate is limited. Workers are only shown apre-collected justification for an opposing answer; they aren’tchallenged by a specific and personalized argument againstthe flaws in their original reasoning. There is also no back-and-forth interaction that could illuminate subtle aspects of aproblem or resolve a worker’s misconceptions — somethingwhich may only become apparent after several turns of dis-cussion. Furthermore, since justifications are pre-collected,workers need to write a generic counter argument; while thisworks for binary answer tasks, it is completely impracticalfor tasks with many answers; such a counter-argument wouldtypically be prohibitively long, refuting n−1 alternatives.

This paper presents CICERO, a new workflow that engagesworkers in multi-turn and contextual argumentation to improvecrowd accuracy on difficult tasks. CICERO selects workerswith opposing answers to questions and pairs them into adiscussion session using a chat-style interface, in which theycan respond to each other’s reasoning and debate the bestanswer. During these exchanges, workers are able to writecontext-dependent counter-arguments addressing their part-ner’s specific claims, cite rules from the training materials tosupport their answers, point out oversights of other workers,and resolve misconceptions about the rules and task whichcan impact their future performance on the task. As a result ofthese effects, workers are more likely to converge to correctanswers, improving individual accuracy. Our experiments ontwo difficult text based task domains, relation extraction anda word association task, show that contextual multi-turn dis-cussion yields vastly improved worker accuracy compared totraditional argumentation workflows.

In summary, we make the following contributions:

• We propose CICERO, a novel workflow that induces multi-turn and contextual argumentation, facilitating focused dis-cussions about the answers to objective questions.

• We introduce a new type of worker training to ensure thatworkers understand the process of argumentation (in addi-tion to the task itself) and produce high quality arguments.

• We develop CICERO-SYNC, a synchronous implementationof our workflow using real-time crowdsourcing, and applyit to conduct the following experiments:

– In the Relation Extraction domain introduced by MI-CROTALK [10], we show that contextual, multi-turnargumentation results in significantly higher improve-ment for individual worker accuracy over existing one-shot argumentation: a 16.7 percentage point improve-ment v.s 6.8.

– Using a version of the Codenames domain [41], thathas many answer choices (making MICROTALK’snon-contextual argumentation unworkable), we showthat CICERO is quite effective, improving individualworker accuracy from 66.7% to a near-perfect 98.8%.

– We qualitatively analyze the discussion transcriptsproduced from our experiments with CICERO-SYNC,

identifying several characteristics present in contextual,multi-turn argumentation.

PREVIOUS WORKQuality control has been a central concern in crowdsourc-ing, and space precludes a complete discussion of post-hocmethods such as majority vote [33], gated instruction [23],and programmatic generation of gold-standard questions [29].Similarly, many different approaches have been suggested toimprove accuracy by assigning certain question to specificworkers [16, 5, 15].

Expectation maximization [6, 37, 36] is especially popular, butall methods embody greedy optimization and hence are proneto local maxima. If the data set includes challenging problems,where a significant majority of workers gets the answer wrong,EM will likely converge to the incorrect answer.

Other researchers have investigated methods to handle caseswhere the majority may be wrong, e.g., tournament vot-ing [34] and Bayesian truth serum [31]. Unfortunately, thesemethods are unlikely to work for difficult questions whereindividual answers require detailed analysis.

Rationales & Feedback Can Improve AccuracySome researchers have demonstrated that requiring annotatorsto submit “rationales” for their answers by highlighting por-tions of text [39, 26] or an image [8] can improve machinelearning classifiers. In contrast, we focus not just on providingjustifications, but also back-and-forth dialog between workers.

Dow et al. [9] conduct experiments demonstrating that timely,task-specific feedback helps crowd workers learn, persevere,and produce better results. Wiebe et al. [38] go a step further(with small-group, in-person studies), showing the benefitsof getting annotators to reconsider their positions and discussthem with other workers. ConsiderIt [18] takes this kindof principled debate online and into the political spectrum,using pro/con points and encouraging participants to restatealternative positions to help voters make informed choices.Recently, Kobayashi et al. [17] have shown that self-correctionin the form of reviewing other workers’ answers is effective ingetting workers to reconsider and correct their own.

Many of these papers inspired the MICROTALK system, whichis the primary inspiration for this work [10]. MICROTALKcombines three microtasks — assess, justify & reconsider —asking crowd workers to assess a question, prompting them tojustify their reasoning, confronting them with another worker’scounterargument, and then encouraging them to reconsidertheir original decision. Our work builds on MICROTALKin several ways: 1) we support contextual communicationthat allows participants to respond to specific points madeby their partner, and 2) we support multi-turn dialog thatallows workers to “dive deep” if necessary to resolve thedisagreement.

Liu et al. [42] also noted that workers who review others’work perform better on subsequent tasks, which inspired us toalso examine effects of multi-turn argumentation on workers’future tasks.

Page 3: Cicero: Multi-Turn, Contextual Argumentation for Accurate ... · Cicero DESIGN We designed the CICERO workflow to address issues in exist-ing crowdsourced argumentation systems by

Figure 2. CICERO System Diagram. Solid arrows indicate paths for workers through the system. Dotted arrows indicate how questions are allocatedthrough the system.

Real-Time CrowdsourcingIn order to facilitate synchronous and real-time dialog, ourimplementation, CICERO-SYNC, has to connect workers inreal-time. There are significant logistical challenges withrecruiting a real-time crowd [13], especially in the contextof our experiments, where we wished to have several dozenworkers working simultaneously. Fortunately, there have beenenough real-time, crowd deployments (e.g., [2, 3]) that manyuseful lessons have been distilled [12].

Instructions for the Task & the Meta-TaskIt’s generally agreed that good instructions are essential forhigh inter-annotator agreement [23]. We were frequently re-minded of this as we iterated on our task design. Perhapswe should not have been surprised to discover that good in-structions were also necessary for our ‘meta-task’ — arguingabout the task. Argument forms and norms that contribute topositive discussion have been long-studied in the educationcommunity, termed ‘accountable talk’ [27]. Our training foreach experiment domain instructs workers on some of theseguidelines, e.g. asking workers to rate justifications to ensurethat participants understand accepted standards of reasoning,arguments that emphasize logical connections, and the abilityto draw a reasonable conclusion.

Cicero DESIGNWe designed the CICERO workflow to address issues in exist-ing crowdsourced argumentation systems by using contextual,multi-turn discussions to address drawbacks in one-shot recon-sider systems.

In this section, we present the CICERO workflow as well asdesign considerations in a synchronous implementation ofthe workflow used for our experiments. We first explain therationale for contextual, multi-turn discussions and give anoverview of our CICERO workflow. We then talk about the

decision to implement our workflow in a synchronous system—CICERO-SYNC. Finally, we discuss the design choices wemade to (1) create an interface for effective real-time discus-sion, as well as (2) improve instructions and training for thedomains we examined.

Contextual and Multi-Turn DiscussionIn natural forms of debate, participants who disagree taketurns presenting arguments which can refute or supplementprior arguments. Our CICERO workflow is designed aroundthe concept of emulating this process in a crowd work settingby using paired discussions facilitated by a dynamic matchingsystem. Participants are matched with partners based on theircurrent beliefs and are encouraged to present their argumentsover multiple turns.

While real-life debates may include multiple participants eachresponsible for addressing arguments on different aspects of aproblem, in the crowd setting we can utilize the diversity ofworkers to cover a broad set of views and reasoning; thus, tosimplify the process, we focus on a two-participant discussionmodel.

Workflow OverviewSince argumentation happens on an ad-hoc basis, it’s muchmore flexible to have our workflow focus on managing tran-sitions between different states a worker may be in insteadof defining a single pipeline. Due to this, our design of theCICERO workflow follows an event-based definition modelwhere the task assigner allocates tasks as workers’ statechanges. Figure 2 summarizes how our workflow allocatesworker resources and questions in a dynamic way.

Initially, workers are recruited from a crowd work platform(such as Amazon Mechanical Turk) and are immediately as-signed to a training task. Workers who pass training and theassociated gating tests [24] enter the worker pool and wait to

Page 4: Cicero: Multi-Turn, Contextual Argumentation for Accurate ... · Cicero DESIGN We designed the CICERO workflow to address issues in exist-ing crowdsourced argumentation systems by

be assigned work. Then, instead of a fixed workflow, our event-based task allocator decides which type of task and question toassign to a worker subject to a set of constraints. As workerscomplete their tasks and update the beliefs of questions in theworking set, new candidate tasks are dynamically selected andallocated.

In CICERO, there are two main types of tasks that the assignermay assign to an idle worker: assess and discussion.

• The assess task acquires one worker (w) from the workerpool who is then presented with one question (q) — inour case a single question in the domain — that asksfor an answer to a multiple choice question (Beliefw(q))and optionally a free-form justification for their position(Justificationw(q)). This task is a combination of the assessand justify microtasks in MicroTalk [10] as a single task.

• The discussion task acquires two workers (w1,w2) fromthe worker pool who are both shown a discussion in-terface for a question (q). At the end of a discus-sion, Beliefw(q),Justificationw(q) may be updated for w ∈{w1,w2}. We will cover details on the design of the discus-sion task in later sections.

The task allocator manages which type of task can be allocatedwhen a worker changes their state and, depending on domain,can be adjusted to prioritize specific kinds of tasks, particularquestions or qualities such as minimizing worker wait timeand increasing concurrent work.

In general, the task assigned can be adapted to the goals of therequester. However, there are a few general constraints thatthe task assigner must follow:

• Incompatible beliefs: A discussion may only be assignedto workers (w1,w2) if they have incompatible beliefs(Beliefw1(q) , Beliefw2(q)). Implicitly, this also requiresexistence of the both beliefs, implying they must have beencollected.

• No repeated discussions: Two workers (w1,w2) may onlydiscuss question q if they have never discussed question qwith each other before (¬∃ADiscuss

q (w1,w2)).

These constraints guarantee that the workflow will eventuallyterminate when there are no more workers who disagree andhave never paired with each other. It’s important to notethat requesters can set up the task allocator to terminate theworkflow sooner if, for example, thresholds for agreement onquestions are reached.

CICERO-SYNC: A Real-Time ImplementationWhile the CICERO workflow does not constrain the type ofinteraction during a discussion task, we decided to test out theeffectiveness of our workflow using synchronous discussionswhere both workers are simultaneously online and engaged ina chat-like discussion environment.

In this implementation of CICERO, CICERO-SYNC, onceworkers are matched into a discussion, they will not be as-signed other tasks for the duration of that discussion and areexpected to give each other their undivided attention. A syn-chronous and real-time discussion task allows us to maintain

a continuous dialogue spanning many turns while preservingdiscussion context in a simple and natural way. However,we came to learn that systems relying on synchronous, real-time worker interactions have some disadvantages: The syn-chronous nature of discussions means that some workers willhave to wait for a partner to become available and workersneed to be online and active within the same time window,both of which imply a higher cost to the requester.

Additionally, there are many practical challenges to implement-ing and setting up synchronous real-time experiments withcrowd workers, including implementing real-time client-servercommunication and working with APIs for worker recruitmentand payment. We elected to build CICERO-SYNC on top ofthe TurkServer [25] toolkit. Various tools built into TurkServersimplify the interfacing with Amazon Mechanical Turk forworker recruitment and task management and allow us to auto-matically track worker state as well as building a “worker pool”through the lobby. These allowed us to quickly design andprototype CICERO-SYNC, which builds upon TurkServer’slobby-assigner-experiment model. Our training, assess, anddiscussion tasks types in CICERO-SYNC map experiment in-stances in TurkServer.

Discussion InterfaceThe discussion task is the most important and defining task ofthe CICERO workflow. We considered multiple different op-tions for the discussion interface focusing on ways to organizediscussion structure and facilitate discoverability.

Early proposals included designs that were inspired by theposts-and-replies interfaces in social network timelines andthe split-view pros-and-cons interfaces used in ConsiderIt, apolitical, argumentation system [18]. Our pilot studies showedthat these methods were cumbersome and non-intuitive, so wedecided on a free-form instant messaging (chat) metaphor forthe discussion task; shown in Figure 1.

When a pair of workers enter a discussion, they are placedinto a familiar instant messaging setting, where they can freelysend and receive messages. An additional “exit” section belowthe chat interface allows either participant to terminate thediscussion if they feel that it is no longer useful. Workers canutilize this exit mechanism to indicate that a consensus wasreached or that no agreement is possible between the pair. Thediscussion interface can be easily adapted to specific needs ofeach experiment domain: In the Relation Extraction domain,the justifications collected from earlier assess or discussiontasks are used to seed the system, which we found to be ben-eficial in starting a conversation. In the Codenames domain,a drop-down menu below the text input field accommodatesswitching to alternate answers during the discussion address-ing the non-binary nature of the questions.

We found that workers required minimal training to understandthe discussion interface and were quickly able to effectivelyparticipate in argumentation with their partners in both do-mains.

Page 5: Cicero: Multi-Turn, Contextual Argumentation for Accurate ... · Cicero DESIGN We designed the CICERO workflow to address issues in exist-ing crowdsourced argumentation systems by

Figure 3. Screenshot of our LivedIn task (Relation Extraction domain)instructions containing 5 citable rules including the definition. Short-hands (in bold) allow for efficient citation of rules during discussion andwithin justifications (as shown in the example’s justification).

Optimizing Task InstructionsWe observed in early pilot experiments that arguments whichrefer explicitly to parts of task guidelines were more effectiveat convincing a partner. However, our original task guidelinesand training did nothing to encourage this practice. Workerscame up with different ways to refer to parts of the instructionsor training examples, but this was inconsistent and frequentlycaused confusion. References to the guidelines were hard toidentify making it harder for workers to determine correct invo-cations of rules in the Relation Extraction domain pilots. Sincearguing in synchronous discussion sessions is time-sensitive,creating rules and shorthands that are easy to cite is importantfor discussion efficiency.

We adjusted the task guidelines for the Relation Extractiontask from those in MICROTALK, re-organizing them into fiveconcrete and easy-to-cite rules as shown in Figure 3. Eachrule was given a shorthand so that workers can unambiguouslyrefer to a specific rule and aid in identification of proper orimproper rule usage during the discussions. We observed thatciting behavior became more consistent within discussionswith workers frequently utilizing our shorthands.

In the Codenames domain, which has simple instructions buta lot of example cases, we designed the instructions to bothshow the general guidelines and also provide a way for workersto review examples from training if they decide to referencethem.

Selecting and Training Effective WorkersIn initial pilots with CICERO-SYNC, we provided minimaltraining to workers. New workers were presented with task de-scriptions, instructions and one sample problem which couldbe attempted multiple times. After several pilot experiments,we noticed that workers were performing inconsistently. Manyworkers’ discussions were ineffective, with one or both show-ing lack of basic understanding of the instructions. Inspectionof timing data also showed that many workers were spend-ing significantly less time in training and assess tasks withoutputting the necessary thought into learning the rules.

One method for improving worker quality, suggested by Dra-peau et al. [10], is to select for “discerning workers” by findingworkers who write justifications that are more complex thanthose of the median worker e.g., using Flesch-Kincaid [14]readability grade levels on gold standard questions. Follow-ing Drapeau et al., we tried filtering for “discerning workers”using the Flesh-Kincaid score but our pilot experiments onthe Relation Extraction domain showed limited effect and sig-nificant reductions in worker pool size. Filtering of workersbased only on gold standard question performance was alsoineffective as it did not train workers to understand the rulesrequired for our complex tasks.

Since our tasks require worker training, we opted to implementa gating process [23], that can both train and select workers atthe same time. Workers are presented with questions laid outin a quiz-like format. Each training question is provided alongan introduction of related concepts from the task instructions.The questions are interleaved with the instructions so thatnew questions are presented as new concepts are introducedto reinforce the understanding. Feedback is given when aworkers selects an answer. At the end, workers’ performanceon a set of quiz questions is recorded. If a worker’s accuracyon the quiz falls below a certain threshold, the worker will beasked to retry the training section (a limited number of times)with the order of the quiz questions randomized. Workers aredismissed if they exceed the retry limit and still do not meetthe passing threshold.

After the improved training and gating, we observed no sig-nificant correlation between workers’ initial accuracy and the“discerning worker” measure in the Relation Extraction do-main.

Selecting and Training Effective Argue-ersIn addition to gating for the task instructions, we also designeda novel justification training task incorporated as a part ofthe training process with the goal being to train the workerson recognizing good arguments and justifications. This justifi-cation training task is presented in the form of an assess taskfollowed by providing feedback to workers explaining goodarguments and justifications based on their assessment.

In the Relation Extraction domain, workers are asked to iden-tify the best justification in a list after providing an assessment.Incorrect options aim to address potential mistakes a workermay make in writing a good justification, such as: failure tocite rules, incomplete or incorrect references to the rules, ormaking extended and inappropriate inferences. Workers aregiven feedback explaining why their selected justification wasincorrect or correct with both the good and bad parts of thejustification explained.

Since questions in the Codenames domain have ten or morepossible answers, it’s not practical to collect justifications(which would have to rule out n− 1 alternatives) at assesstime. Therefore, justification training is adjusted to insteadshow a context-dependent counter-argument when a workerselects an incorrect answer refuting the incorrect choice andsupporting the correct one. By training workers to recognize

Page 6: Cicero: Multi-Turn, Contextual Argumentation for Accurate ... · Cicero DESIGN We designed the CICERO workflow to address issues in exist-ing crowdsourced argumentation systems by

and analyze arguments, the goal of justification training is topromote more critical discussion.

Worker Retention and Real-Time Quality ControlDue to the synchronous nature of discussions inCICERO-SYNC, workers may become idle for short pe-riods of time when they are waiting to be matched to a partner.In these circumstances, workers are kept in the worker pool inthe form of a lobby. Our lobby design was mainly inspiredby both the default lobby provided in TurkServer [25] andfrom a worker-progress feedback design developed by Huanget al. [12] for low-latency crowdsourcing. While in the lobby,workers are presented with information on their peers’ currentstatus, such as how many workers are currently online andwhich workers may become available soon. Workers also seestatistics on their work, which is tied to bonus payments, andare encouraged to wait. In CICERO-SYNC, the task assigneris configured to immediately assign work as it becomesavailable, but we also allow a worker to voluntarily exit afterthey have waited for a period without matches or completed asufficiently large amount of tasks.

In addition, while our gating process is designed to selectworkers serious about the task, we do incorporate several tech-niques to assure that workers stay active when a task getsassigned to them. Individual tasks, such as assess tasks, im-pose anti-cheating mechanisms to discourage spammers fromquickly progressing. These mechanisms include character andword count minimums and disabling of copy-paste for free-form entries. Workers are also encouraged to peer-regulateduring discussion — participants can indicate a partner’s in-activity upon ending a discussion with no agreement. Pairedwith corresponding payout incentives, these methods ensurethat most workers stay active throughout the duration of anexperiment.

EXPERIMENTSWe deployed our experiments on our synchronous implemen-tation, CICERO-SYNC, to address the following questions: 1)Does multi-turn discussion improve individual accuracy morecompared to existing one-shot reconsider based workflows?2) Is multi-turn discussion effective in cases where acquiringjustifications to implement one-shot argumentation (recon-sider) is impractical? 3) Do discussions exhibit multi-turn andcontextual properties?

We selected two domains to evaluate the research questionsabove: a traditional NLP binary answer task, Relation Ex-traction, for comparing against one-shot argumentation and amulti choice answer task, inspired by the word relation gameCodenames, to evaluate CICERO in a non-binary choice do-main.

In the following sections, we first introduce the experimentsetup and configuration, then we introduce each domain andpresent our results for experiments on that domain. At the end,we present a qualitative analysis of discussion characteristicsand explore whether discussions can improve future accuracy.

Experiment SetupCICERO’s design enables interleaved assignment of differenttask types (assessments or discussions) for individual workers.This can be beneficial in reducing worker waiting overhead byassigning individual tasks when paired tasks are not available.However, in order to evaluate the effects of contextual, multi-turn argumentation under a controlled setting, we need toisolate the process of assessment and argumentation. For ourexperiments, we implemented a “blocking” task assigner thatavoids interleaved concurrent tasks and is designed to assignthe same type of task to a worker until they have answered allquestions of that type.

The blocking assigner includes a few extra constraints in addi-tion to those required by the workflow:

• Gold Standard Assessments: The task assigner assignsassess tasks for gold standard questions to evaluate qualityof workers who passed the training and gating quiz phase.Workers are assigned these questions before any other ques-tions. No discussions are ever initiated for these questions;they let us control for worker quality and filter workers thatdo not pass the gating threshold.

• Greedy Matching: The task assigner tries to assign a dis-cussion as soon as such a task is available. In the case ofmultiple candidates, the task assigner picks one randomly.

Additionally, the blocking assigner doesn’t allocate any dis-cussions until a worker has finished Assess-ing all questions.This allows us to collect the initial answers of a worker beforethey participate in any argumentation.

We adjusted CICERO-SYNC to include these experimental con-straints and collected multi-turn, contextual arguments. Thesystem used in experiments consists of three distinct stages:Training, Assess and Discussion. Workers progress througheach stage sequentially. Within the same stage, workers willbe allocated tasks on demand by the blocking assigner. Thissystem represents the discussion condition.

In addition to CICERO-SYNC, we also implemented the adap-tive workflow described in MicroTalk [10] using our platformto compare with one-shot argumentation. We reproducedthe reconsider task interface from MicroTalk which replacesour discussion task and created a specialized assigner thatallocates tasks following the adaptive algorithm described inMicroTalk. In this system, workers will be adaptively askedto justify or do reconsider tasks depending on their initialanswer. When a worker is the only worker with a particularanswer for a question, they will be asked to provide a justifica-tion for their answer. Reconsider tasks are only assigned to aworker if there is a previously justified answer that their cur-rent answer disagrees with. That justification will be the oneshown to the worker. This system represents the reconsidercondition.

Recruiting and IncentivesWe ran experiments on Amazon Mechanical Turk, using work-ers who had completed at least 100 tasks with a 95% accep-tance rate for both of our experiment domains.

Page 7: Cicero: Multi-Turn, Contextual Argumentation for Accurate ... · Cicero DESIGN We designed the CICERO workflow to address issues in exist-ing crowdsourced argumentation systems by

Figure 4. Comparison for improvement in average worker accuracy (Re-lation Extraction domain) for each batch (subset) of questions as well ason the entire set of questions.

Within each domain, we calibrated our subtask payments byobserving the average worker time for that subtask from apilot run and allocating an approximately $7 hourly wage. Ourtraining bonus of $1.00 for successfully completing trainingand the gating quiz is also calibrated using the average time ittakes workers to complete the training session.

For the Relation Extraction domain, workers are paid $0.10as base payment and $1.00 for training. Workers are paid aper-question bonus of $0.05 for an assessment, $0.05 for ajustification, and — depending on their condition — a bonusof $0.50 for a discussion or $0.05 for a reconsider task. Sincewe always collect a justification for each question in the dis-cussion condition, workers in that condition are always giventhe full $0.10 per-question bonus. Per-question incentives arechosen to match those used in MicroTalk [10].

For the Codenames domain, workers are paid $0.20 as baseand $1.00 for training. Workers are paid a per-question bonusof $0.20 for each correct answer and a per-discussion bonusof $0.50 for participating in a discussion with an extra $0.25for holding the correct answer at the end of that.

While it is possible to design a more complex incentive struc-ture, our main goal for this set of incentives is to discouragecheating behavior and align with that of MicroTalk. We thinkthese incentives are consistent with those used in other, recentcrowdsourcing research [23].

Relation Extraction Domain: Binary AnswerIn the interest of comparing to previous work, we evaluatedour method on a tradition NLP annotation task of informationextraction (IE) — identifying structured, semantic information(relational tuples, such as would be found in a SQL database)from natural language text [11]. The task is of considerableinterest in the NLP community, since most IE approaches usemachine learning and many exploit crowdsourced trainingdata [40, 30, 1, 23].

Specifically, we consider the problem of annotating a sentenceto indicate whether it encodes the TAC KBP LivedIn relation —does a sentence support the conclusion that a person lived in alocation? While such a judgment may seem simple, the officialLDC annotation guidelines are deceptively complex [35]. For

example, one can conclude that a national official lives in hercountry, but not that a city official lives in her city. Figure 3defines the task, showing the instructions given to our workers.

We created a set of 23 challenging TAC KBP questions draw-ing from the 20 used in MicroTalk [10] and adding 3 additionalquestions from Liu et al. [23]. This set was then divided into3 batches of size 7, 8, and 8 for our discussion experiments.For gold standard questions, we selected 3 simple questionsfrom the TAC KBP set, each of which can be resolved withan invocation of one rule. Upon recruitment, each worker isalso presented with a 6 question gating quiz and are allowed 2attempts to pass the gating threshold. Gating questions werewritten to be simple and unambiguous, testing whether theworker was diligent and had absorbed the guidelines.

Multi-turn vs. One-shot WorkflowsOur first experiment compares worker accuracy for the multi-turn, contextual discussion workflow design against that of anone-shot (non-contextual) reconsider workflow on the binaryanswer Relation Extraction domain (i.e., CICERO vs. MI-CROTALK). We deployed both conditions with the systemsdescribed in the experiment setup. Since workers need to com-plete all assessments before starting discussions which wouldcause increased waiting time on a large set of questions, wedeployed the discussion condition experiments in 3 batches(N=9, 16, 13) corresponding to the 3 batches the experimentquestions were divided into. In the reconsider condition (N= 12), workers were put through our implementation of theadaptive workflow from MICROTALK on all questions.

In both conditions, the gating threshold was set at 100% —workers needed to answer all gold standard questions correctlyto be included. From the plot shown in Figure 4 we can see thatthe discussion condition (CICERO) improves average workeraccuracy by 16.7 percentage points over the initial accuracycompared to 6.8 for the reconsider condition (statisticallysignificant at p = 0.0143).

We performed a t-test on the initial accuracy of workers acrossboth conditions for each batch and found no statistically signif-icant difference (p = 0.77,0.78,0.67) indicating that workersof similar quality were recruited for each of our batches. Onaverage, workers participated in 7.7 discussions (σ = 4.75)and were presented with 16.8 reconsider prompts (σ = 3.83)in the one-shot workflow.

We do note that discussions are more costly, largely due topaying workers for time spent waiting for their partner torespond. Each CICERO-SYNC discussion took an averageof 225.3 seconds (σ = 234.8) of worker time compared to aone-shot reconsider task averaging 13.6 seconds (σ = 15.0).We believe that an asynchronous implementation of CICEROcould reduce overhead and dramatically lower costs.

Codenames Domain: Multiple Choice with Many AnswersPrevious work using one-shot argumentation focused mainlyon evaluating argumentation in domains that only acquiredbinary answers such as Relation Extraction. However, we ob-served that this is not sufficient to represent a wide variety ofreal world tasks. In a binary answer setting, workers are able to

Page 8: Cicero: Multi-Turn, Contextual Argumentation for Accurate ... · Cicero DESIGN We designed the CICERO workflow to address issues in exist-ing crowdsourced argumentation systems by

Candidates business, card, knotPositive Clues suit, tie

Negative Clues corporation, speedExplanation Workers must find the single best

candidate word that is related in meaningto some positive clue word, but none ofthe negative clues. In this example,all three candidates are related tosome positive clue: a suitfor business, a suit of cards, and to tie aknot. However, business relates tocorporation and knot is a unit of speed.Card is the best answer: it’s related toa positive clue, while being largelyunrelated to any negative clues.

Best Answer cardTable 1. Example of a simple question used for training from the Code-names domain. Real questions have around 7-10 candidate words.

write justifications arguing against the single opposing belief.As the number of answer options grows, it becomes increas-ingly inefficient and even infeasible to collect justifications asa part of each worker’s assessment.

Effective justifications for multiple choice answers need toaddress not only the selected answer, but also argue againstremaining options, making them long and difficult to under-stand. Multi-turn discussion can address these scaling issuesthrough back-and-forth dialog through which workers canargue against their partner’s specific choice.

We created a synthetic task inspired by the popular word asso-ciation Codenames board game as a multiple choice answer do-main. The Codenames game has been adopted as the main taskin other cooperative work designs such as DreamTeam [41],which utilized a cooperative version of the game. The objec-tive in the game is for each team to identify the tiles assignedto them from a shared list of word tiles. Clue words are givenby one team member (the “spymaster”) who can see the as-signment of word tiles (which ones belong to which team)while other teammates have to find the correct word tiles fortheir team while avoiding the tiles assigned to the other team.

Our Codenames task domain draws inspiration from the com-petitive aspect of the game. We observe that late into the game,good word guesses are often informed by both the teammateclues (which should be matched) and opponent clues (whichshould be avoided). With this observation, we created taskswhich consist of a list of candidate words, several positive andseveral negative clue words. Workers, in the role of a teammember, are instructed to find the single best candidate wordthat is related in meaning to some positive clue word but noneof the negative clues. An example of this task can be seen inTable 1. Each question contains around 2 positive clues, 2-3negative clues and 7-10 candidate words. We created 3 gatingquestions, 7 experiment questions and 1 individual assessmentquestion for this task. We used a gating threshold of 66.7%.While Codenames is not a typical task for crowd work, as alsonoted in DreamTeam, we think its aspect of multiple choiceanswers is representative of a whole class of similar tasks thatlack effective one-shot argumentation strategies.

Figure 5. Initial and final accuracy of multi-turn argumentation on theCodenames domain with 95% confidence intervals.

The loose definition of words being “related” in Codenamesdomain reduces the amount of worker training required forparticipation since it utilizes common knowledge of language.However, this may lead to ambiguity in reference answerswhich would be undesirable. We elected to manually createa set of questions which were validated to have only 1 objec-tively best answer. The distractors for each question and ourreference argument were evaluated with a group of expert pi-lot testers. We confirmed that all participants agreed with ourreference counter-arguments against the distractors and alsowith our reference answer. In the pilot test, we also noted thatthis task can be very challenging even for experts as multipleword senses are involved in distractors.

Evaluating on Multiple Choice Tasks with Many AnswersOur second experiment examines the performance ofCICERO-SYNC on multiple choice answer tasks from the Co-denames domain, a domain that would be very inefficient forone-shot argumentation (justifications would need to addressup to 9 alternatives). While initial worker accuracy is only66.7%, CICERO-SYNC achieves a final average worker accu-racy of 98.8%, a significant improvement (N = 12).

We performed the ANOVA omnibus test with a mixed ef-fects model using worker initial accuracy as a random ef-fect and found that the experiment stage (initial v.s. fi-nal) is statistically significant at (F(1,57.586) = 85.608, p =5.445x10−13 < 0.001). The average duration of each discus-sion was 123.56 seconds (σ = 64.79) and workers had anaverage of 6.3 discussions (σ = 3.89) in the Codenames do-main.

Discussion CharacteristicsWe can see from the previous experiments that multi-turn,contextual argumentation is effective at improving workeraccuracy across a variety of tasks, but are the discussions actu-ally taking advantage of multi-turn arguments and the contextbeing available? To answer this question, we collected andanalyzed the transcripts recorded for each domain: RelationExtraction and Codenames.

We computed statistics on multi-turn engagement by analyzingthe number of worker initiated messages – each of which isconsidered a turn. We found that in the Relation Extractiondomain, discussions averaged 7.5 turns (σ = 6.1, median of 5)while in Codenames discussions averaged 8.3 turns (σ = 4.23,

Page 9: Cicero: Multi-Turn, Contextual Argumentation for Accurate ... · Cicero DESIGN We designed the CICERO workflow to address issues in exist-ing crowdsourced argumentation systems by

median of 7). We also found that in Codenames, the num-ber of turns correlates to convergence on the correct answer(F(1,31) = 7.2509, p < 0.05) while we found no significantrelation between turns and convergence (p > 0.1) in the Rela-tion Extraction domain. We note that in Relation Extraction,discussions are seeded with workers’ justifications from theassess task (equivalent to 2 non-contextual turns, not countedin the average) whereas discussions in the Codenames domainuse actual contextual turns to communicate this information.Codenames discussions also utilized extra turns to argue aboutalternative choices.

Additionally, we noticed several patterns in the discussion textthat appeared in both domains. We further examined thesepatterns by coding the the discussion transcripts (147 fromRelation Extraction and 38 from Codenames). We surveyedthe discussions looking only at patterns specific to argumen-tation and came up with 8 patterns related to argumentationtechniques and 6 reasons workers changed their answer.

We then narrowed down the argumentation patterns by remov-ing any that were highly correlated or any that had just 1-2examples and finalized the following 4 prominent patterns ascodes:

• Refute: Argue by directly giving a reason for why the part-ner’s specific answer is believed to be incorrect. Examples:“Small [partner choice] is the opposite of large [negativeclue] and will not work”; “Louisana isn’t a country, there-fore NonCountry applies.”• Query: Ask the partner to explain their answer, a part of

their answer or ask for a clarification in their explanation.Examples: “Why do you think it should be bill?”; “Howwould bridge work?”• Counter: Pose a counter-argument to a partner in response

to their explanation. Example: A: “Erdogan’s governmentis nationally affiliated with Turkey.” B: “[...] The sentencecould be interpreted as one of Turkey’s allies is helpingthem with the EU thing.”• Previous: Explicitly state that knowledge/line of reasoning

acquired from a previous discussion is being used. Example:“I had window at first too, but someone else had bridge, butthey thought bridge because of the card game bridge, andthat made sense to me”;

We found that workers used these contextual patterns fre-quently during their discussions for both domains with 77.6%and 86.8% of all discussions utilizing at least one pattern inthe Relation Extraction and Codenames domains respectively.We can also see that distribution of patterns across the twodomains (Table 2) on discussions converging to the correct an-swer indicates that the utility of each pattern may be differentin different domains. We hypothesize that the higher frequencyof Counter and lower frequency of Query in Relation Extrac-tion is likely due to the justification seeding which reducedneed for workers to ask for explanations but encouraged morecounter-arguments.

We also condensed the reasons for workers changing theiranswer down to 3 basic categories: learning about the task(rules), agreeing on meaning of concepts in a question, and

Relation Extraction CodenamesRefute 42% 59%Query 25% 35%

Counter 34% 14%Previous 16% 10%

Table 2. Proportion of each pattern appearing in discussions that con-verged to the correct answer for each domain.

being convinced by an argument. After coding the discussions,we found that the distribution of the reason for changing an-swers was 18%,3%,79% for Relation Extraction domain and17%,28%,55% for Codenames, across each category (task,question, convinced) respectively showing that discussionscould help workers understand the task.

We also observed that 70% of all discussions and 75% ofdiscussions converging to the right answer used our rule short-hands when referring to the rules instead of describing them.However, we note that simply citing shorthands doesn’t corre-late with convergence of a discussion (p > 0.1).

Do Workers Learn Through Discussion?While we didn’t design discussions to be used as a way oftraining workers, many reported that they “understood the taskmuch better” after discussions in pilot experiment feedback sowe explored the effects of discussions on workers’ future ac-curacy. We tested a worker’s performance by adding post-testquestions after they finished their corresponding experimentcondition. We selected 4 questions for the Relation Extractiondomain and 1 for the Codenames domain, all of comparabledifficulty to the main questions, to be individually evaluated.

Average accuracy on the individual evaluation sections trendedhigher for the discussion condition: accuracies were 66.7%,69.3%, and 73.9% for the baseline (no argumentation), recon-sider and discussion conditions respectively in the RelationExtraction domain and 46.7% and 52.0% for the baseline anddiscussion conditions in the Codenames domain. However,ANOVA on all conditions for each domain shows no statisti-cally significant interaction (F(1,49.1) = 0.013, p > 0.1 andF(2,58.3) = 1.03, p > 0.1 for Codenames and Relation Ex-traction respectively) between the experiment condition andthe accuracy on the individual evaluation questions. We con-jecture that need for argumentation may be reduced as workersbetter learn the guidelines through peer interaction [20, 9], butthe difficult questions will likely always warrant some debate.

DISCUSSIONIn this paper we explored the potential for multi-turn, contex-tual argumentation as a next step for improving crowdsourcingaccuracy. Multi-turn argumentation systems provide some dis-tinct benefits over one-shot reconsider systems. Since workerscan respond directly to each other’s arguments, it is muchmore likely that a correct, minority opinion may convincethe majority without the need to recruit a huge number ofworkers. This power does come with caveats, noted earlier,such as higher costs and the need for training workers how torecognize and make good arguments (in addition to trainingthem on the base task). Our results suggest that workers canbe trained to discuss successfully.

Page 10: Cicero: Multi-Turn, Contextual Argumentation for Accurate ... · Cicero DESIGN We designed the CICERO workflow to address issues in exist-ing crowdsourced argumentation systems by

Figure 6. Scaling of majority vote (green) and EM-aggregated perfor-mance (blue) for one-shot argumentation (Microtalk) on the RelationExtraction domain, computed by simulation (100 simulations per bud-get) excluding training cost. Confirming previous reports [7, 10], we ob-serve that accuracy plateaus. While expensive due to the use of real-timecrowdsourcing, EM-aggregated performance of CICERO-SYNC (shownas a red dot) is higher.

While each discussion task in CICERO-SYNC required moreworker time, we found significantly higher gains to individ-ual worker accuracy compared to the reconsider conditionfrom MICROTALK. We believe that much of the increase inwork time stems from our decision to use synchronous, real-time crowdsourcing in CICERO-SYNC, leading to higher per-argument-task costs. Under a synchronous environment, work-ers must wait for other workers’ actions during and in-betweendiscussions. Since our experiments are focused on evaluatingthe multi-turn argumentation workflow, synchronized discus-sions allowed us to better collect data in a controlled way.Many efficiency optimizations, that we did not explore, couldbe implemented to run the CICERO workflow at scale in amore cost effective way. Specifically, an asynchronous imple-mentation of CICERO would eliminate the need for workersto wait for each other, reducing costs. Alternatively, if a syn-chronous implementation were run at larger scale on a muchlarger set of problems, there would be proportionately lessoverhead.

Argumentation (both contextual and one-shot) is likely un-necessary for many crowdsourcing tasks. For example, ifone is merely labeling training data for supervised machinelearning (a common application), then it may be more cost ef-fective to eschew most forms of quality control (majority vote,EM or argumentation) and instead collect a larger amount ofnoisy data [21]. However, if one needs data of the highestpossible accuracy, then argumentation — specifically contex-tual, multi-turn argumentation — is the best option. CICEROmay cost more than other approaches, but as shown in Fig-ure 6, it achieves higher accuracy than any other approach,regardless of cost. Furthermore, the CICERO workflow canhandle questions with many possible answers such as thosein Codenames. Single-shot argumentation systems, such asMICROTALK, aren’t practical in these situations, because ofthe need to pre-collect arguments against so many alternativeanswers.

In the end, the most cost effective crowd technique dependson both problem difficulty and quality requirements.

High cost methods, like argumentation, should be reservedfor the most difficult tasks, such as developing challengingmachine learning test sets, or tasks comprising a high-stakesdecision, where a corresponding explanation is desirable.

CONCLUSION & FUTURE WORKIn this paper, we presented CICERO, a novel workflow thatengages workers in multi-turn, contextual argumentation (dis-cussion) to improve crowd accuracy on difficult tasks. Weimplemented this workflow using a synchronous, real-timedesign for discussions tasks and created the CICERO-SYNCsystem. Since the quality of a discussion depends on its partic-ipants, we also designed and implemented gated instructionsand a novel justification training task for CICERO-SYNC to en-sure competent discussions through improving workers’ abilityto recognize and synthesize good arguments.

We demonstrate that our implementation of CICERO-SYNC,the synchronous version of the CICERO workflow, is able toachieve

• Higher improvement of individual worker accuracy com-pared to a state-of-art, one-shot argumentation system ona difficult NLP annotation task: 16.7 vs. 6.8 percentagepoints improvement, at a higher cost, and

• Very high accuracy in a non-binary choice answer task thatwould be impractical with one-shot argumentation: 98.8%accuracy (a 32.1 percentage point improvement over theinitial accuracy)

Both these accuracies are much higher than can be achievedwithout argumentation. Traditional majority vote andexpectation-maximization without argumentation approachesplateau at 65% on similar questions [10]. Additionally, weobserved several interesting patterns of discourse that are en-abled by multi-turn, contextual argumentation and note thatmany successful discussions utilize these patterns.

There are many future directions for improving the argumen-tation workflow and system implementation. Currently, thecost of argumentation is still relatively high but cost may bereduced further. It’s common for people to context switchbetween several ongoing discussions as well as have groupdiscussions with many people, both of which could be inter-esting modifications to the workflow. We also envision thatbetter models of discussions could allow a system to only pairarguments where the outcome reduces uncertainty.

Furthermore, there is potential in incorporating natural lan-guage processing techniques to identify and support positivebehavior patterns during argumentation and opportunities forlearning from misconceptions surfaced during discussion toimprove training and task instructions.

Finally, we hope to apply argumentation techniques to a widerrange of tasks and meta-tasks, including issues studied inTurkomatic [19] and flash teams [32] — the process of defininga problem and refining a workflow to achieve it.

ACKNOWLEDGEMENTSWe would like to thank Eunice Jun, Gagan Bansal and Tong-shuang Wu for their helpful feedback and participation in pilotexperiments. This work was supported in part by NSF grantIIS-1420667, ONR grants N00014-15-1-2774 and N00014-18-1-2193, the WRF/Cable Professorship and support fromGoogle.

Page 11: Cicero: Multi-Turn, Contextual Argumentation for Accurate ... · Cicero DESIGN We designed the CICERO workflow to address issues in exist-ing crowdsourced argumentation systems by

REFERENCES1. Gabor Angeli, Julie Tibshirani, Jean Wu, and

Christopher D. Manning. 2014. Combining Distant andPartial Supervision for Relation Extraction. InProceedings of the 2014 Conference on EmpiricalMethods in Natural Language Processing, EMNLP 2014,October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT,a Special Interest Group of the ACL, AlessandroMoschitti, Bo Pang, and Walter Daelemans (Eds.). ACL,1556–1567.http://aclweb.org/anthology/D/D14/D14-1164.pdf

2. Michael S Bernstein, Greg Little, Robert C Miller, BjörnHartmann, Mark S Ackerman, David R Karger, DavidCrowell, and Katrina Panovich. 2010. Soylent: a wordprocessor with a crowd inside. In UIST ’10 Proceedingsof the 23nd annual ACM symposium on User interfacesoftware and technology. ACM Press, 313–322.

3. J. Bigham, C. Jayant, H. Ji, G. Little, , A. Miller, R.Miller, A. Tatarowicz, B. White, S. White, and T. Yeh.2010. VizWiz: Nearly Real-Time Answers to VisualQuestions. In UIST.

4. L. Chilton, G. Little, D. Edge, D. Weld, and J. Landay.2013. Cascade: Crowdsourcing Taxonomy Creation. InCHI ’13. ACM Press, New York, NY, USA.

5. Peng Dai, Christopher H. Lin, Mausam, and Daniel S.Weld. 2013. POMDP-based control of workflows forcrowdsourcing. Artificial Intelligence 202 (2013), 52–85.

6. A.P. Dawid and A. M. Skene. 1979. MaximumLikelihood Estimation of Observer Error-rates using theEM Algorithm. Applied Statistics 28, 1 (1979), 20–28.

7. Gianluca Demartini, Djellel Eddine Difallah, andPhilippe Cudré-Mauroux. 2012. ZenCrowd: LeveragingProbabilistic Reasoning and Crowdsourcing Techniquesfor Large-scale Entity Linking. In Proceedings of the 21stInternational Conference on World Wide Web (WWW’12). ACM, New York, NY, USA, 469–478. DOI:http://dx.doi.org/10.1145/2187836.2187900

8. Jeff Donahue and Kristen Grauman. 2011. Annotatorrationales for visual recognition. In ICCV 2011. DOI:http://dx.doi.org/10.1109/ICCV.2011.6126394

9. Steven Dow, Anand Pramod Kulkarni, Scott R. Klemmer,and Björn Hartmann. 2012. Shepherding the crowd yieldsbetter work. In CSCW.

10. Ryan Drapeau, Lydia B. Chilton, Jonathan Bragg, andDaniel S. Weld. 2016. MicroTalk: Using Argumentationto Improve Crowdsourcing Accuracy.

11. Ralph Grishman. 1997. Information extraction:Techniques and challenges. In Information extraction amultidisciplinary approach to an emerging informationtechnology. Springer, 10–27.

12. Ting-Hao Kenneth Huang and Jeffrey P Bigham. 2017. A10-Month-Long Deployment Study of On-DemandRecruiting for Low-Latency Crowdsourcing. InProceedings of The fifth AAAI Conference on HumanComputation and Crowdsourcing (HCOMP 2017).

13. Ting-Hao Kenneth Huang, Walter S Lasecki, AmosAzaria, and Jeffrey P Bigham. 2016. " Is There AnythingElse I Can Help You With?" Challenges in Deploying anOn-Demand Crowd-Powered Conversational Agent. InFourth AAAI Conference on Human Computation andCrowdsourcing.

14. R. L. Rogers J. P. Kincaid, R. P. Fishburne Jr and B. S.Chissom. 1975. Derivation of new readability formulas(automated readability index, fog count and fleschreading ease formula) for navy enlisted personnel. InTechnical report, DTIC Document.

15. Ece Kamar, Severin Hacker, and Eric Horvitz. 2012.Combining human and machine intelligence inlarge-scale crowdsourcing. In Proceedings of the 11thInternational Conference on Autonomous Agents andMultiagent Systems-Volume 1. International Foundationfor Autonomous Agents and Multiagent Systems,467–474.

16. David R. Karger, Sewoong Oh, and Devavrat Shah. 2011.Budget-optimal Crowdsourcing using Low-rank MatrixApproximations. In Conference on Communication,Control, and Computing.

17. Masaki Kobayashi, Hiromi Morita, Masaki Matsubara,Nobuyuki Shimizu, and Atsuyuki Morishima. 2018. AnEmpirical Study on Short- and Long-Term Effects ofSelf-Correction in Crowdsourced Microtasks. InHCOMP.

18. Travis Kriplean, Jonathan T. Morgan, Deen Freelon, AlanBorning, and Lance Bennett. 2011. ConsiderIt:improving structured public deliberation. In CHIExtended Abstracts.

19. Anand Kulkarni, Matthew Can, and Björn Hartmann.2012. Collaboratively crowdsourcing workflows withTurkomatic. In CSCW. ACM Press, New York, New York,USA. DOI:http://dx.doi.org/10.1145/2145204.2145354

20. Chinmay Kulkarni, Koh Pang Wei, Huy Le, Daniel Chia,Kathryn Papadopoulos, Justin Cheng, Daphne Koller, andScott R Klemmer. 2015. Peer and self assessment inmassive online classes. In Design thinking research.Springer, 131–168.

21. Christopher H. Lin, Mausam, and Daniel S. Weld. 2014.To Re(label), or Not To Re(label). In HCOMP.

22. Greg Little, Lydia B. Chilton, Max Goldman, andRobert C. Miller. 2009. TurKit: Tools for Iterative Taskson Mechanical Turk. In Human Computation Workshop(HComp2009).

23. Angli Liu, Stephen Soderland, Jonathan Bragg,Christopher H. Lin, Xiao Ling, and Daniel S. Weld.2016a. Effective Crowd Annotation for RelationExtraction. In Proceedings of NAACL and HLT 2016.

24. Angli Liu, Stephen Soderland, Jonathan Bragg,Christopher H. Lin, Xiao Ling, and Daniel S. Weld.2016b. Effective Crowd Annotation for RelationExtraction. In Proceedings of NAACL and HLT 2016.

Page 12: Cicero: Multi-Turn, Contextual Argumentation for Accurate ... · Cicero DESIGN We designed the CICERO workflow to address issues in exist-ing crowdsourced argumentation systems by

25. Andrew Mao, Yiling Chen, Krzysztof Z. Gajos, David C.Parkes, Ariel D. Procaccia, and Haoqi Zhang. 2012.TurkServer: Enabling Synchronous and LongitudinalOnline Experiments. In The Fourth Workshop on HumanComputation (HCOMP 2012).

26. Tyler McDonnell, Matthew Lease, Tamer Elsayad, andMucahid Kutlu. 2016. Why Is That Relevant? CollectingAnnotator Rationales for Relevance Judgments. InProceedings of the 4th AAAI Conference on HumanComputation and Crowdsourcing (HCOMP). 10.

27. Sarah Michaels, Catherine OâAZConnor, and Lauren BResnick. 2008. Deliberative discourse idealized andrealized: Accountable talk in the classroom and in civiclife. Studies in philosophy and education 27, 4 (2008),283–297.

28. Jon Noronha, Eric Hysen, Haoqi Zhang, and Krzysztof ZGajos. 2011. Platemate: crowdsourcing nutritionalanalysis from food photographs. In Proceedings of the24th annual ACM symposium on User interface softwareand technology. ACM, 1–12.

29. David Oleson, Alexander Sorokin, Greg P Laughlin,Vaughn Hester, John Le, and Lukas Biewald. 2011.Programmatic Gold: Targeted and Scalable QualityAssurance in Crowdsourcing.. In Human ComputationWorkshop. 11.

30. Maria Pershina, Bonan Min, Wei Xu, and RalphGrishman. 2014. Infusion of labeled data into distantsupervision for relation extraction. In Proceedings ofACL.

31. Drazen Prelec and H. Sebastian Seung. 2007. Analgorithm that finds truth even if most people are wrong.(2007).

32. Daniela Retelny, Sébastien Robaszkiewicz, Alexandra To,Walter S Lasecki, Jay Patel, Negar Rahmati, TulseeDoshi, Melissa Valentine, and Michael S Bernstein. 2014.Expert crowdsourcing with flash teams. In Proceedings ofthe 27th annual ACM symposium on User interfacesoftware and technology. ACM, 75–85.

33. Rion Snow, Brendan O’Connor, Daniel Jurafsky, and A.Ng. 2008. Cheap and Fast — But is it Good? EvaluatingNon-Expert Annotations for Natural Language Tasks. InEMNLP’08.

34. Yu-An Sun, Christopher R Dance, Shourya Roy, andGreg Little. 2011. How to assure the quality of humancomputation tasks when majority voting fails. InWorkshop on Computational Social Science and theWisdom of Crowds, NIPS.

35. Mihai Surdeanu. 2013. Overview of the TAC2013Knowledge Base Population Evaluation: English SlotFilling and Temporal Slot Filling. In TAC 2013.

36. Peter Welinder, Steve Branson, Serge Belongie, andPietro Perona. 2010. The Multidimensional Wisdom ofCrowds. In NIPS.

37. Jacob Whitehill, Paul Ruvolo, Tingfan Wu, JacobBergsma, and Javier Movellan. 2009. Whose Vote ShouldCount More: Optimal Integration of Labels fromLaberlers of Unknown Expertise. In In Proc. of NIPS.2035–2043.

38. J. Wiebe, R. Bruce, and T. O’Hara. 1999. Developmentand use of a gold-standard data set for subjectivityclassifications. In Proceedings of the 37th annual meetingof the Association for Computational Linguistics onComputational Linguistics. 246–253.http://dl.acm.org/citation.cfm?id=1034678.1034721

39. Omar F Zaidan, Jason Eisner, and Christine D Piatko.2007. Using “annotator rationales" to improve machinelearning for text categorization. In Proceedings of NAACLand HLT 2007.

40. Ce Zhang, Feng Niu, Christopher Ré, and Jude Shavlik.2012. Big data versus the crowd: Looking forrelationships in all the right places. In Proceedings of the50th Annual Meeting of the Association forComputational Linguistics: Long Papers-Volume 1.Association for Computational Linguistics, 825–834.

41. Sharon Zhou, Melissa Valentine, and Michael S.Bernstein. 2018. In Search of the Dream Team:Temporally Constrained Multi-Armed Bandits forIdentifying Effective Team Structures.

42. Haiyi Zhu, Steven P. Dow, Robert E. Kraut, and AniketKittur. 2014. Reviewing Versus Doing: Learning andPerformance in Crowd Assessment. In Proceedings of the17th ACM Conference on Computer SupportedCooperative Work &#38; Social Computing (CSCW ’14).ACM, New York, NY, USA, 1445–1455. DOI:http://dx.doi.org/10.1145/2531602.2531718