TransDoop: A Map-Reduce based Crowdsourced …aclweb.org/anthology/P/P13/P13-4030.pdf ·...

Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, pages 175–180,Sofia, Bulgaria, August 4-9 2013. c©2013 Association for Computational Linguistics

TransDoop: A Map-Reduce based Crowdsourced Translation forComplex Domains

Anoop Kunchukuttan∗, Rajen Chatterjee∗, Shourya Roy†, Abhijit Mishra∗,Pushpak Bhattacharyya∗

∗ Department of Computer Science and Engineering, IIT Bombay,{anoopk,abhijitmishra,pb}@cse.iitb.ac.in, [email protected]

† Xerox India Research Centre,[email protected]

Abstract

Large amount of parallel corpora is re-quired for building Statistical MachineTranslation (SMT) systems. We describethe TransDoop system for gathering trans-lations to create parallel corpora from on-line crowd workforce who have familiar-ity with multiple languages but are notexpert translators. Our system uses aMap-Reduce-like approach to translationcrowdsourcing where sentence translationis decomposed into the following smallertasks: (a) translation of constituent phrasesof the sentence; (b) validation of qual-ity of the phrase translations; and (c)composition of complete sentence trans-lations from phrase translations. Trans-Doop incorporates quality control mech-anisms and easy-to-use worker user in-terfaces designed to address issues withtranslation crowdsourcing. We have eval-uated the crowd’s output using the ME-TEOR metric. For a complex domain likejudicial proceedings, the higher scores ob-tained by the map-reduce based approachcompared to complete sentence translationestablishes the efficacy of our work.

1 Introduction

Crowdsourcing is no longer a new term in the do-main of Computational Linguistics and MachineTranslation research (Callison-Burch and Dredze,2010; Snow et al., 2008; Callison-Burch, 2009).Crowdsourcing - basically where task outsourcingis delegated to a largely unknown Internet audi-ence - is emerging as a new paradigm of humanin the loop approaches for developing sophisti-cated techniques for understanding and generat-ing natural language content. Amazon Mechanical

Turk(AMT) and CrowdFlower 1 are representativegeneral purpose crowdsourcing platforms whereas Lingotek and Gengo2 are companies targetedat localization and translation of content typicallyleveraging freelancers.

Our interest is towards developing a crowd-sourcing based system to enable general, non-expert crowd-workers generate natural languagecontent equivalent in quality to that of expert lin-guists. Realization of the potential of attaininggreat scalability and cost-benefit of crowdsourcingfor natural language tasks is limited by the abil-ity of novice multi-lingual workers generate highquality translations. We have specific interest inIndian languages due to the large linguistic diver-sity as well as the scarcity of linguistic resources inthese languages when compared to European lan-guages. Crowdsourcing is a promising approachas many Indian languages are spoken by hundredsof Millions of people (approximately, Hindi-Urduby 500M, Bangla by 200M, Punjabi by over 100M3) coupled with the fact that representation of In-dian workers in online crowdsourcing platforms isvery high (close to 40% in Amazon MechanicalTurk (AMT)).

However, this is a non-trivial task owing to lackof expertise of novice crowd workers in transla-tion of content. It is well understood that famil-iarity with multiple languages might not be goodenough for people to generate high quality transla-tions. This is compounded by lack of sincerity andin certain cases, dishonest intention of earning re-wards disproportionate to the effort and time spentfor online tasks. Common techniques for qualitycontrol like gold data based validation and workerreputation are not effective for a subjective task

1http://www.mturk.com,http://www.crowdflower.com

2http://www.lingotek.com,http:///www.gengo.com

3http://en.wikipedia.org/wiki/List_of_languages_by_total_number_of_speakers

175

like translation which does not have any task spe-cific measurements. Having expert linguists man-ually validate crowd generated content defies thepurpose of deploying crowdsourcing on a largescale.

In this work, we propose a technique, basedon the Divide-and-Conquer principle. The tech-nique can be considered similar to a Map-Reducetask run on crowd processors, where the transla-tion task is split into simpler tasks distributed tothe crowd (the map stage) and the results are latercombined in a reduce stage to generate completetranslations. The attempt is to make translationtasks easy and intuitive for novice crowd-workersby providing translations aids to help them gen-erate high quality of translations. Our contribu-tion in this work is a end-to-end, crowdsourcing-platform-independent, translation crowdsourcingsystem that completely automates the translationcrowdsourcing task by (i) managing the transla-tion pipeline through software components and thecrowd; (ii) performing quality control on work-ers’ output; and (iii) interfacing with crowdsourc-ing service providers. The multi-stage, Map-reduce approach simplifies the translation task forcrowd workers, while novel design of user inter-face makes the task convenient for the worker anddiscourages spamming. The system thus offers thepotential to generate high quality parallel corporaon a large scale.

We discuss related work in Section 2 and themulti-staged approach which is central to our sys-tem in Section 3. Section 4 describes the sys-tem architecture and workflow, while Section 5presents important aspects of the user interfacesin the system. We present our preliminary exper-iments and observations in Section 6. Section 7concludes the paper, pointing to future directions.

2 Related Work

Lately, crowdsourcing has been explored as asource for generating data for NLP tasks (Snowet al., 2008; Callison-Burch and Dredze, 2010).Specifically, it has been explored as a channel forcollecting different resources for SMT - evalua-tions of MT output (Callison-Burch, 2009), wordalignments in parallel sentences (Gao et al., 2010)and post-edited versions of MT output (Aikawa etal., 2012). Ambati and Vogel (2010), Kunchukut-tan et al. (2012) have shown the feasibility ofcrowdsourcing for collecting parallel corpora and

pointed out that quality assurance is a major issuefor successful translation crowdsourcing.

The most popular methods for quality controlof crowdsourced tasks are based on sampling andredundancy. For translation crowdsourcing, Am-bati et al. (2010) use inter-translator agreement forselection of a good translation from multiple, re-dundant worker translations. Zaidan and Callison-Burch (2011) score translations using a featurebased model comprising sentence level, workerlevel and crowd ranking based features. However,automatic evaluation of translation quality is diffi-cult, such automatic methods being either inaccu-rate or expensive. Post et al. (2012) have collectedIndic language corpora data utilizing the crowd forcollecting translations as well as validations. Thequality of the validations is ensured using gold-standard sentence translations. Our approach toquality control is similar to Post et al. (2012), butwe work at the level of phrases.

While most crowdsourcing activities for datagathering has been concerned with collecting sim-ple annotations like relevance judgments, there hasbeen work to explore the use of crowdsourcingfor more complex tasks, of which translation isa good example. Little et al. (2010) propose thatmany complex tasks can be modeled either as iter-ative workflows (where workers iteratively buildon each other’s works) or as parallel workflows(where workers solve the tasks in parallel, with thebest result voted upon later). Kittur et al. (2011)suggest a map-and-reduce approach to solve com-plex problems, where a problem is decomposedinto smaller problems, which are solved in the mapstage and the results are combined in the reducestage. Our method can be seen as an instanceof the map-reduce approach applied to translationcrowdsourcing, with two map stages (phrase trans-lation and translation validation) and one reducestage (sentence combination).

3 Multi-Stage Crowdsourcing Pipeline

Our system is based on a multi-stage pipeline,whose central idea is to simplify the translationtask into smaller tasks. The high level block di-agram of the system is shown in Figure 1. Sourcelanguage documents are sentencified using stan-dard NLP tokenizers and sentence splitters. Ex-tracted sentences are then split into phrases us-ing a standard chunker and rule-based mergingof small chunks. This step creates small phrases

176

Figure 1: Multistage crowdsourced translation

from complex sentences which can be easily andindependently translated. This leads to a crowd-sourcing pipeline, with three stages of tasks for thecrowd: Phrase Translation (PT), Phrase Transla-tion Validation (PV), Sentence Composition (SC).A group of crowd workers translate source lan-guage phrases, the translations are validated by adifferent group of workers and finally a third groupof workers put the phrase translation together tocreate target language sentences. The validationis done by workers by providing ratings on a k-point scale. This kind of divide and conquer ap-proach helps to tackle the complexity of crowd-sourcing translations since: (1) the tasks are sim-pler for workers; (2) uniformity of smaller tasksbrings about efficiency as in any industrial assem-bly line; (3) pricing can be controlled for eachstage depending on the complexity; and (4) qualitycontrol can be performed better for smaller tasks.

4 System Architecture

Figure 2 shows the architecture of TransDoop,which implements the 3-stage pipeline. The majordesign considerations were: (i) translation crowd-sourcing pipeline should be independent of spe-cific crowdsourcing platforms; (ii) support multi-ple crowdsourcing platforms; (iii) customize jobparameters like pricing, quality control methodand task design; and (iv) support multiple lan-guages and domains.

The core component in the system is theCrowdsourcing Engine. The engine manages theexecution of the crowdsourcing pipeline, lifecycleof jobs and quality control of submitted tasks. TheEngine exposes its capabilities through the Re-quester API, which can be used by clients forsetting up, customizing and monitoring transla-tion crowdsourcing jobs and controlling their exe-cution. These capabilities are made available to

requesters via the Requester Portal. In orderto make the crowdsourcing engine independentof any specific crowdsourcing platform, platformspecific Connectors are developed. The Crowd-sourcing system makes the tasks to be crowd-sourced available through the Connector API.The connectors are responsible for polling the en-gine for tasks to be crowdsourced, pushing thetasks to crowdsourcing platforms, hosting workerinterfaces for the tasks and pushing the resultsback to the engine after they have been completedby workers on the crowdsourcing platform. Cur-rently the system supports the AMT crowdsourc-ing platform.

Figure 3 depicts the lifecycle of a translationcrowdsourcing job. The requester initiates a trans-lation job for a document (a set of sentences). TheCrowdsourcing Engine schedules the job for exe-cution. It first splits each sentence into phrases.For the job, PT tasks are created and made avail-able through the Connector API. The connectorfor the specified platform periodically polls theCrowdsourcing Engine via the Connector API.Once the connector has new PT tasks for crowd-sourcing, it interacts with the crowdsourcing plat-form to request crowdsourcing services. The con-nector monitors the progress of the tasks and oncompletion provides the results and execution sta-tus to the Crowdsourcing Engine. Once all the PTtasks for the job are completed, the crowdsourcingEngine initiates the PV task to obtain validationsfor the translations. The Quality Control systemkicks in when all the PV tasks for the job havebeen completed.

The quality control (QC) relies on a combina-tion of sampling and redundancy. Each PV taskhas a few gold-standard phrase translation pairs,which is used to ensure that the validators are hon-estly doing their tasks. The judgments from the

177

Figure 2: Architecture of TransDoop

Figure 3: Lifecycle of a Translation Job

good validators are used to determine the qualityof the phrase translation, based on majority voting,average rating, etc. using multiple judgments col-lected for each phrase translation. If any phrasevalidations or translations are incorrect, then thecorresponding phrases/translations are again sentto the PT/PV stage as the case may be. This willcontinue until all phrase translations in the job arecorrectly translated or a pre-configured number ofiterations are done.

Once phrase translations are obtained for allphrases in a sentence, the Crowdsourcing Enginecreates SC tasks, where the workers are askedto compose a single correct, coherent translationfrom the phrase translation obtained in the previ-ous stages.

5 User Interfaces

5.1 Worker User Interfaces

This section describes the worker user interfacesfor each stage in the pipeline. These are man-aged by the Connector and have been designed tomake the task convenient for the worker and pre-vent spam submissions. In the rest of the section,we describe the salient features of the PT and SC

UI’s. PV UI is similar to k-scale voting tasks com-monly found in crowdsourcing platforms.

• Translation UI: Figure 4a shows the trans-lation UI for the PT stage. The user in-terface discourages spamming by: (a) dis-playing source text as images; and (b) alert-ing workers if they don’t provide a transla-tion or spend very little time on a task. TheUI also provides transliteration support fornon-Latin scripts (especially helpful for Indicscripts). A Vocabulary Support, which showstranslation suggestions for word sequencesappearing in the source phrase, is also avail-able. Suggested translations can be copied tothe input area with ease and speed.

• Sentence Translation Composition UI: Thesentence translation composition UI (shownin Figure 4b) facilitates composition of sen-tence translations from phrase translations.First, the worker can drag and rearrange thetranslated phrases into the right order, fol-lowed by reordering of individual words.This is important because many Indian lan-guages have different constituent order ( S-O-V) with respect to English (S-V-O). Finally,the synthesized language sentence can bepost-edited to correct spelling, case marking,inflectional errors, etc. The system also cap-tures the reordering performed by the worker,an important byproduct, which can be usedfor training reordering models for SMT.

5.2 Requester UI

The system provides a Requester Portal throughwhich the requester can create, control and mon-itor jobs and retrieve results. The portal allowsthe requester to customize the job during creationby configuring various parameters: (a) domainand language pair (b) entire sentence vs multi-stage translation (c) price for task at each stage(d) task design (number of tasks in a task group,etc.) (e) translation redundancy (f) validation qual-ity parameters. Translation redundancy refers tothe number of translations requested for a sourcephrase. Validation redundancy refers to the num-ber of validations collected for each phrase trans-lation pair and the redundancy based acceptancecriteria for phrase translations (majority, consen-sus, threshold, etc.)

178

(a) Phrase Translation UI (b) Sentence Composition UI

Figure 4: Worker User Interfaces

6 Experiments and Observations

Using TransDoop, we conducted a set of small-scale, preliminary translation experiments. We ob-tained translations for English-Hindi and English-Marathi language pairs for the Judicial andTourism domains. For each experiment, 15 sen-tences were given as input to the pipeline. Forevaluation, we chose METEOR, a well-knowntranslation evaluation metric (Banerjee and Lavie,2005). We compared the results obtained from thecrowdsourcing system with a expert human trans-lation and the output of Google Translate. We alsocompared two expert translations using METEORto establish a skyline for the translation accuracy.Table 1 summarizes the results of our experiments.

The translations with Quality Control and mul-tistage pipeline are better than Google translationsand translations obtained from the crowd withoutany quality control, as evaluated by METEOR.Multi-stage translation yields better than completesentence translation. Moreover, the translationquality is comparable to that of expert humantranslation. This behavior is observed across thetwo language pairs and domains. This can be seenin some examples of crowdsourced translationsobtained through the system which are shown inTable 2.

Incorrect splitting of sentences can cause diffi-culties in translation for the worker. For instance,discontinuous phrases will not be available to theworker as a single translation unit. In the Englishinterrogative sentence, the noun phrase splits theverb phrase, therefore the auxiliary and main verbcould be in different translation units. e.g.

Why did you buy the book?

In addition, the phrase structures of the source

and target languages may not map, making trans-lation difficult. For instance, the vaala modifier inHindi translates to a clause in English. It does notcontain any tense information, therefore the tenseof the English clause cannot be determined by theworker. e.g.

Lucknow vaalaa ladkaa

could translate to any one of:

the boy who lives/lived/is living in LucknowWe rely on the worker in sentence composition

stage to correct mistakes due to these inadequaciesand compose a good translation. In addition, theworker in the PT stage could be provided with thesentence context for translation. However, thereis a tradeoff between the cognitive load of contextprocessing versus uncertainty in translation. Moreelaborately, to what extent can the cognitive loadbe reduced before uncertainty of translation setsin? Similarly, how much of context can be shownbefore the cognitive load becomes pressing?

7 Conclusions

In this system demonstration, we present Trans-Doop as a translation crowdsourcing system whichhas the potential to harness the strength of thecrowd to collect high quality human translationson a large scale. It simplifies the tedious trans-lation tasks by decomposing them into several“easy-to-solve” subtasks while ensuring quality.Our evaluation on small scale data shows thatthe multistage approach performs better than com-plete sentence translation. We would like to exten-sively use this platform for large scale experimentson more language pairs and complex domains likeHealth, Parliamentary Proceedings, Technical andScientific literature etc. to establish the utility of

179

Language Pair Domain Google No QC Translation with QC ReferenceTranslate single stage multi stage Human

en-mr Tourism 0.227∗ 0.30 0.368 0.372 0.48en-hi Tourism 0.292 0.363 0.387 0.422 0.51en-hi Judicial 0.252 0.30 0.388 0.436 0.49

Table 1: Experimental Results: Comparison of METEOR scores for different techniques, language pairs and domains∗

Translated by an internal Moses-based SMT system

Accordingly the penalty imposed by AO is not justified and the same is cancelled.

isk� an� sAr e aO �ArA lgAy� gy� d\X uEct nhF\ h{ aOr ek hF r� kr EdyA h{

Accordingly A O by imposed penalty justified not is and one also cancel did

tdAn� sAr e ao �ArA lgAyA gyA d\X jAy) nhF\ h{ aOr us� r� kr EdyA h{

Accordingly A O by imposed penalty justified not is and that cancel did

(a) English-Hindi Judicial Translation

A crowd of devotees engulf Haridwar during the time of daily prayer in the evening

fAm m�\ d{Enk þAT nA k� smy k� dOrAn BÄo\ ko apnF cp�V m�\ l� hEr�Ar kF BFX

evening in daily prayer of time during devotees its engulf in take Haridwar of crowd

��Al� ao\ kF BFX fAm m�\ d{Enk þAT nA k� smy hEr�Ar ko apnF cp�V m�\ l�tF h{

devotees of crowd evening in daily prayer of time haridwar its engulf in take

(b) English-Hindi Tourism Translation

Table 2: Examples of translation from Google and threestaged pipeline for source sentence (2nd, 3rd and 1st rowsof each table respectively). Domains and languages are indi-cated above.

the method for collection of parallel corpora on alarge scale.

ReferencesTakako Aikawa, Kentaro Yamamoto, and Hitoshi Isa-

hara. 2012. The impact of crowdsourcing post-editing with the collaborative translation frame-work. In Advances in Natural Language Processing.Springer Berlin Heidelberg.

Vamshi Ambati and Stephan Vogel. 2010. Can crowdsbuild parallel corpora for machine translation sys-tems? In Proceedings of the NAACL HLT 2010Workshop on Creating Speech and Language Datawith Amazon’s Mechanical Turk.

Vamshi Ambati, Stephan Vogel, and Jaime Carbonell.2010. Active learning and crowd-sourcing for ma-chine translation. Language Resources and Evalua-tion LREC.

Satanjeev Banerjee and Alon Lavie. 2005. Meteor: Anautomatic metric for mt evaluation with improvedcorrelation with human judgments. In Proceed-ings of the ACL Workshop on Intrinsic and Extrin-sic Evaluation Measures for Machine Translationand/or Summarization.

Chris Callison-Burch and Mark Dredze. 2010. Cre-ating speech and language data with amazon’s me-chanical turk. In Proceedings of the NAACL HLT2010 Workshop on Creating Speech and LanguageData with Amazon’s Mechanical Turk.

Chris Callison-Burch. 2009. Fast, cheap, and cre-ative: evaluating translation quality using amazon’smechanical turk. In Proceedings of the 2009 Con-ference on Empirical Methods in Natural LanguageProcessing.

Qin Gao, Nguyen Bach, and Stephan Vogel. 2010. Asemi-supervised word alignment algorithm with par-tial manual alignments. In Proceedings of the JointFifth Workshop on Statistical Machine Translationand MetricsMATR.

Aniket Kittur, Boris Smus, Susheel Khamkar, andRobert E Kraut. 2011. Crowdforge: Crowdsourc-ing complex work. In Proceedings of the 24th an-nual ACM symposium on User interface softwareand technology.

Anoop Kunchukuttan, Shourya Roy, Pratik Patel,Kushal Ladha, Somya Gupta, Mitesh Khapra, andPushpak Bhattacharyya. 2012. Experiences in re-source generation for machine translation throughcrowdsourcing. Language Resources and Evalua-tion LREC.

Greg Little, Lydia B Chilton, Max Goldman, andRobert C Miller. 2010. Exploring iterative and par-allel human computation processes. In Proceedingsof the ACM SIGKDD workshop on human computa-tion.

Matt Post, Chris Callison-Burch, and Miles Osborne.2012. Constructing parallel corpora for six indianlanguages via crowdsourcing. In Proceedings of theSeventh Workshop on Statistical Machine Transla-tion.

Rion Snow, Brendan O’Connor, Daniel Jurafsky, andAndrew Y. Ng. 2008. Cheap and fast—but is itgood?: evaluating non-expert annotations for naturallanguage tasks. In Proceedings of the Conference onEmpirical Methods in Natural Language Process-ing.

Omar Zaidan and Chris Callison-Burch. 2011. Crowd-sourcing translation: Professional quality from non-professionals. In Proceedings of the NAACL HLT2010 Workshop on Creating Speech and LanguageData with Amazon’s Mechanical Turk.

180

TransDoop: A Map-Reduce based Crowdsourced …aclweb.org/anthology/P/P13/P13-4030.pdf ·...

Documents

Transcript of TransDoop: A Map-Reduce based Crowdsourced …aclweb.org/anthology/P/P13/P13-4030.pdf ·...