Overview of DETOXIS at IberLEF 2021: DEtection of TOXicity ...

Overview of DETOXIS at IberLEF 2021:DEtection of TOXicity in comments In Spanish

Vision general de DETOXIS en IberLEF 2021:DEteccion de TOXicidad en comentarios En Espanol

Mariona Taule1, Alejandro Ariza1, Montserrat Nofre1, Enrique Amigo2, Paolo Rosso3

1CLiC, Universitat de Barcelona, Spain2Research Group in NLP and IR, Universidad Nacional de Educacion a Distancia, Spain

3PRHLT Research Center, Universitat Politecnica de Valencia, Spain{mtaule, alejandro.ariza14, montsenofre}@ub.edu, [email protected], [email protected]

Abstract: In this paper we present the DETOXIS task, DEtection of TOxicity incomments In Spanish, which took place as part of the IberLEF 2021 Workshop onIberian Languages Evaluation Forum at the SEPLN 2021 Conference. We describethe NewsCom-TOX dataset used for training and testing the systems, the metricsapplied for their evaluation and the results obtained by the submitted approaches.We also provide an error analysis of the results of these systems.Keywords: Toxicity detection, Rank Biased Precision, Closeness Evaluation Mea-sure, NewsCom-TOX corpus.

Resumen: En este artıculo se presenta la tarea DETOXIS, DEteccion de TOxici-dad en comentarios en espanol, que tuvo lugar en el Iberian Languages EvaluationForum workshop (IberLEF 2021) en el congreso de la SEPLN 2021. Se describe elcorpus NewsCom-TOX utilizado para entrenar y evaluar los sistemas, las metricaspara evaluarlos y los resultados obtenidos por las distintas aproximaciones utilizadas.Se proporciona tambien un analisis de los resultados obtenidos por estos sistemas.Palabras clave: Deteccion de toxicidad, Rank Biased Precision, Closeness Evalu-ation Measure, NewsCom-TOX corpus.

1 Introduction

The aim of the DETOXIS task is the detec-tion of toxicity in comments posted in Span-ish in response to different online news arti-cles related to immigration. The DETOXIStask is divided into two related classificationsubtasks: Toxicity detection task and Toxic-ity level detection task, which are describedin Section 2. The presence of toxic messageson social media and the need to identify andmitigate them leads to the development ofsystems for their automatic detection. Theautomatic detection of toxic language, espe-cially in tweets and comments, is a task thathas attracted growing interest from the Nat-ural Language Processing (NLP) communityin recent years. This interest is reflectedin the diversity of the shared tasks thathave been organized recently, among whichwe highlight those held over the last twoyears: HateEval-20191 (Basile et al., 2019) on

1https://competitions.codalab.org/competitions/19935

hate speech against immigrants and womenin English and Spanish tweets; TRAC-2task on Aggression Identification2 (Kumaret al., 2020) for English, Bengali and Hindiin comments extracted from YouTube; theOffensEval-20203 on offensive language iden-tification (Zampieri et al., 2020) in Arabic,Danish, English, Greek and Turkish tweets;GermEval-2019 shared task4 on the Identi-fication of Offensive Language for Germanon Twitter (Struß et al., 2019); and the Jig-saw Multilingual Toxic Comment Classifica-tion Challenge 5, in which the task is focusedon building multilingual models (English,French, German, Italian, Portuguese, Rus-sian and Spanish) with English-only trainingdata from Wikipedia comments.

DETOXIS is the first task that focuses

2https://sites.google.com/view/trac2/shared-task3https://sites.google.com/site/offensevalsharedtask/4https://projects.fzai.h-da.de/iggsa/the5https://www.kaggle.com/c/jigsaw-multilingual-

toxic-comment-classification

Procesamiento del Lenguaje Natural, Revista nº 67, septiembre de 2021, pp. 209-221 recibido 03-07-2021 revisado 10-07-2021 aceptado 14-07-2021

ISSN 1135-5948. DOI 10.26342/2021-67-18 © 2021 Sociedad Española para el Procesamiento del Lenguaje Natural

on the detection of different levels of toxic-ity in comments posted in response to newsarticles written in Spanish. The main nov-elty of the present task is, on the one hand,the methodology applied to the annotationof the dataset that will be used for trainingand testing the participant models and, onthe other hand, the evaluation metrics thatwill be applied to evaluating the participantmodels in terms of their system use profileapplying four different metrics (F-measure,Rank Biased Precision (Moffat and Zobel,2008), Closeness Evaluation Measure (Amigoet al., 2020) and Pearson’s correlation co-efficient). The methodology proposed aimsto reduce the subjectivity of the annotationof toxicity by taking into account the con-textual information, i.e. the conversationalthread, and by annotating different linguis-tic features, such as argumentation, construc-tiveness, stance, target, stereotype, sarcasm,mockery, insult, improper language, aggres-siveness and intolerance, which allowed us todiscriminate the different levels of toxicity.

The rest of the overview is structured asfollows. In Section 2 we present the two sub-tasks of DETOXIS. In Section 3 the corpusNewsCom-TOX used as dataset is describedtogether with the way it was gathered andannotated. In Section 4 the different metricsused for the evaluation of the systems andthe results they obtained are presented, aswell as a description of the techniques andmodels used by systems. Finally, in Section5 the conclusions are drawn.

2 Task Description

The aim of the DETOXIS task is the detec-tion of toxicity in comments posted in Span-ish in response to different online news arti-cles related to immigration. The DETOXIStask is divided into two related classificationsubtasks:

• Subtask 1: Toxicity detection task isa binary classification task that con-sists of classifying the content of a com-ment as toxic (toxic=yes) or not toxic(toxic=no).

• Subtask 2: Toxicity level detection taskis a more fine-grained classification taskin which the aim is to identify the levelof toxicity of a comment (0=not toxic;1=mildly toxic; 2=toxic and 3=verytoxic).

Although we recommended to participatein both subtasks, participants were also al-lowed to participate just in one of them.Teams were also encouraged to submit mul-tiple runs (maximum 5).

3 Dataset: The NewsCom-TOXcorpus

We used as dataset the NewsCom-TOX cor-pus, which consists of 4,359 comments postedin response to different 21 articles extractedfrom Spanish online newspapers (ABC, el-Diario.es, El Mundo, NIUS, etc.) and dis-cussion forums (such as Meneame6 and Foro-Coches7) from August 2017 to July 2020.These articles were manually selected tak-ing into account their controversial subjectmatter, their potential toxicity, and the num-ber of comments posted (minimum 50 com-ments). We used a keyword-based approachto search for articles related mainly to immi-gration. The comments were selected in thesame order in which they appear in the timethread in the web. The author (anonymized),the date and the time when the comment wasposted were also retrieved. The number ofcomments ranges from 65 to 359 commentsper article. On average, 31.16% of the com-ments are toxic.

3.1 Annotation Scheme

We considered that a comment is toxic whenit attacks, threatens, insults, offends, deni-grates or disqualifies a person or group ofpeople on the basis of characteristics suchas race, ethnicity, nationality, political ideol-ogy, religion, gender and sexual orientation,among others. This attack can be expressedin different ways –explicitly (through insult,mockery and inappropriate humor) or implic-itly (for instance through sarcasm)– and atdifferent levels of intensity, that is at dif-ferent levels of toxicity (from impolite andoffensive comments to the most aggressive,the latter being those comments that incitehate or even physical violence). We use tox-icity as an umbrella term under which weinclude different definitions used in the lit-erature to describe hate speech ((Nockleby,2000), (Waseem and Hovy, 2016), (Schmidtand Wiegand, 2017), (Davidson et al., 2017))and abusive (Nobata et al., 2016), aggres-sive (Kumar et al., 2018), toxic (Kolhatkar et

6https://www.meneame.net/7https://www.forocoches.com/

Mariona Taulé, Alejandro Ariza, Montserrat Nofre, Enrique amigó, Paolo Rosso

210

al., 2020) or offensive (?) language. In fact,these different terms address different aspectsof toxic language (Poletto et al., 2020).

We annotated each comment in twocategories ‘toxic’ and ‘not toxic’, andthen we assigned different levels of tox-icity: ‘toxicity level 0=not toxic’ (1),‘toxicity level 1=mildly toxic’ (2), ‘toxi-city level 2=toxic’(example 3) or ‘toxic-ity level 3=very toxic’ (4) to those that wereannotated first as toxic 8.

(1a) Contra la desinformacion datos. En elINE teneis los datos de los inmigrantes.Delitos, cotizaciones, prestaciones, etc.<toxicity=not> <toxicity level=0>

(1b) Against misleading data. In theINE you have data for immigrants.Crimes, contributions, subsidies, etc.<toxicity=not> <toxicity level=0>

(2a) Esta gentuza se carga al paıs en dos tele-diarios<toxicity=yes> <toxicity level=1>

(2b) This rabble could destroy the coun-try with two TV news programs<toxicity=yes> <toxicity level=1>

(3a) Lo que importa en realidad es sentirsemejor con uno mismo. Sumar karma. Alos putos negros les pueden joder bien.<toxicity=yes> <toxicity level=2>

(3b) What really matters is feeling betterabout yourself. Accumulating karma.Fucking blacks can get fucked up prettygood.<toxicity=yes> <toxicity level=2>

(4a) A estos putos animales sarnososque los encierren y tiren la llave.<toxicity=yes> <toxicity level=3>

(4b) With these mangy fucking animals, lockthem up and throw away the key.<toxicity=yes> <toxicity level=3>

In addition to annotating whether or nota comment was toxic and its level of toxic-ity, we also annotated the following features:argumentation, constructiveness, stance, tar-get, stereotype, sarcasm, mockery, insult, im-proper language, aggressiveness and intoler-ance. All these features have binary val-ues except the toxicity level (the value ‘0’

8The examples contain text that may be consid-ered offensive.

is assigned for indicating the absence of thecorresponding feature in the comment, andthe value ‘1’ is assigned when the featureis present). We first annotated these fea-tures and we then used them to establish thetoxicity of comments and to determine theirlevel of toxicity. It is worth noting that thelevel of toxicity is especially determined bythe type of features combined. All this in-formation was included only in the trainingdataset that was used for the task. In or-der to assign each of these features and beable to interpret the global meaning of com-ments, it was crucial to take the context intoaccount, that is, the conversational thread.For instance, the conversational threat wasvery useful for interpreting and annotatingsarcasm, mockery and stance. The identifierof each conversation thread (‘thread id’ fea-ture) was also provided for the participants,as well as the identifier of the previous com-ment in the thread (‘reply to’ feature) andthe ‘comment level’ feature. The latter is acategorical attribute with two possible val-ues: ‘1’ for indicating that the comment isa direct comment to an article and ‘2’ forindicating that the comment is a reply toanother comment. The ‘topic’ feature wasalso provided in the training dataset. Thisfeature has three possible values (CR=crime,MI=migration, SO=social) to distinguish thetopic of the news article.

3.2 Annotation Process

Each comment was annotated in parallelby three annotators and an inter-annotatoragreement test was carried out once all thecomments on each article had been anno-tated. Then, disagreements were discussedby the annotators and a senior annotatoruntil agreement was reached. The team ofannotators involved in the task consisted oftwo expert linguists and two trained anno-tators, who were students of linguistics. Ta-ble 1 shows the results obtained in the inter-annotator agreement test, the average ob-served percentage of agreement between thethree annotators, 81.99% for toxicity and73.95% for toxicity level, and Krippenddorfs’alpha (0.59 and 0.62 respectively), which en-sures annotation reliability. The results ob-tained are quite acceptable considering thedifficulty of both tasks.

Overview of DETOXIS at IberLEF 2021: DEtection of TOXicity in comments In Spanish

211

Feature Average observed agreement Krippendorff’s α

Toxicity 81.99% 0.59Toxicity level 73.95% 0.62

Table 1: Inter-Annotator Agreement Test.

3.3 Training and Test Dataset

We provided participants with 80% of theNewsCom-TOX corpus for training theirmodels (3,463 comments), and the remain-ing 20% of the corpus (896 comments) wasused for testing their models.9 Both train-ing and test files were provided in csv for-mat. The training dataset contains all thefeatures included in the annotation of theNewsCom-TOX corpus (see Subsection 3.1),whereas the test dataset only contains thefollowing features: ‘thread id’, ‘comment id’,‘reply to’, ‘comment level’ and ‘comment’.

Table 2 shows the distribution of toxiccomments by toxicity level. The dataset con-sists of 4,359 comments, of which 1,385 aretoxic (31.16%): 996 mildly toxic (21.90%),310 toxic (7.36%) and 79 very toxic (1.81% ).

4 Systems and Results

This section contains a brief description ofthe baselines provided as a benchmark, fol-lowed by an overview of both subtasks (tox-icity detection and toxicity level detection)mentioning the reasons behind the selectionof the evaluation metrics and their implica-tion when models are compared. Finally,some interesting systems insights and a brieferror analysis of the submitted approachesare presented.

4.1 Baselines

A benchmark was set with the introduction ofthree different baselines: RandomClassifier,BOWClassifier and ChainBOW. First, Ran-domClassifier assigns a random of toxicityfrom four possible values {0, 1, 2, 3} to eachcomment in the test set without any kindof weighting strategy. Second, the BOW-Classifier consists of a simple Support VectorClassifier (SVC) that receives the features ex-tracted by a TF-IDF Vectorizer (an advancedversion of the classical Bag-Of-Words tech-nique). In particular, the SVC model uses

9In order to avoid any conflict with the sourcesof comments regarding their Intellectual PropertyRights (IPR), the data was privately sent to each par-ticipant that was interested in the task. The corpuswill be only available for research purposes.

a linear kernel with a regularization param-eter equal to 1.0 and a “one-versus-rest” de-cision function strategy for the toxicity leveldetection subtask. Furthermore, the TF-IDFVectorizer, which is responsible for featureextraction, constructs a vocabulary of 5,000entries including unigrams and bigrams af-ter performing a simple preprocessing stage(lowercasing and removal of accents). Fi-nally, due to the fact of having an unbalanceddataset, we decided to include a baseline thatprocesses both subtasks sequentially. There-fore, ChainBOW contains two BOWClassi-fiers, one for each subtask (toxicity detectionand toxicity level detection), connected in ahierarchical fashion with the same configura-tion as mentioned before. It is worth men-tioning that, given their baseline nature, nohyperparameter optimization was performedon any of the models. In addition, all base-lines were implemented using the Pythonscikit-learn library.

4.2 Subtask 1: Toxicity DetectionTask

Subtask 1 consists of a binary classification(toxic vs. non-toxic). In this subtask, themetrics precision, recall and their combina-tion by means of the F-measure were used.Table 5 ranks the best run of each of the par-ticipating teams according to F-measure. Ingeneral, the SINAI system outperforms therest of systems. SINAI is outperformed bysome of the other approaches in terms ofrecall, although at the cost of a significantprecision loss. In relation to the baselines,RandomClassifier achieves a mid-ranking po-sition in the ranking, with low precision butmedium high recall. In a similar positionis ChainBOW, with low recall and mediumhigh precision. In general, there is significantroom for improvement between participantsruns and the Gold Standard.

Figure 1 illustrates the precision and recallscores achieved by the systems. The precisionof all systems is between 0.25 and 0.75.


212

Feature Comments Percentage

mildly toxic (level 1) 996 21.90%toxic (level 2) 310 7.36%very toxic (level 3) 79 1.81%Total 1,385 31.16%

Table 2: Distribution of toxic comments.

Figure 1: Precision and Recall for the bestrun of each team in Subtask 1.

4.3 Subtask 2: Toxicity LevelDetection Task

Regarding Subtask 2, the toxicity level detec-tion task is a more fine grained classificationtask in which the aim is to identify the levelof toxicity of a comment (0= not toxic; 1=mildly toxic; 2= toxic and 3: very toxic). Forthis task we considered four evaluation met-rics.

Closseness Evaluation Metric (CEM):The metric CEM (Amigo et al., 2020)considers the proximity between pre-dicted and real categories. Unlikemeasures based on absolute error, CEMhas two particularities. First, it assumesno equidistance between categories.That is, an error between category 0and 2 is not necessarily twice as seriousas an error between categories 0 and1. In addition, it assigns more weightto infrequent categories. This avoidsover-weighting systems that tend toclassify items in majority classes. CEMis appropriate for unbalanced data sets,and when the relative distance betweencategories beyond the way they areordered is unknown.

Rank Biased Precision (RBP): RBP isa ranking metric (Moffat and Zobel,

2008). We rank the output of the sys-tem and the Gold Standard on the basisof toxicity levels (from highest to lowest)and compare the two rankings. Basi-cally, RBP is computed as the sum of theactual values (Gold Standard) along theranking generated by the system with aweight function that decreases as we de-cay in the ranking of the system output.d corresponds to the categorized items inthe system output s, with pos(d) beingthe ranking position of d in the systemoutput s and g(d) being the real cate-gory value in the Gold Standard for thisitem. RBP is computed as:

RBP = (1− p)∑d

f(d)g(d)

where f is a decay function thatdecreases with position i, concretelyf(d) = ppos(i)−1. The value p is a param-eter which is typically fixed at X. Origi-nally, RBP applies to rankings with oneitem in each ranking position. However,in our scenario, the items are orderedin four levels in the ranking generatedby the system (level 4= not toxic; level3= mildly toxic; level 2= toxic and level1: very toxic). Therefore, we modifyRBP by considering the average i posi-tion that each item would occupy in thesystem output if we were to randomlyorder the ties:

f(d) =1

ns(d)

MaxPos(d)∑i=MinPos(d)

pi−1

where ns(d) represents the number ofitems at level s(d) in the system out-put. MinPos(d) and MaxPos(d) repre-sent the minimum and maximum posi-tion that item d could occupy in the sys-tem output s according to its level. Thismetric is appropriate when sorting itemsaccording to their toxicity. For instance,the scenario in which the user needs to


213

find the most toxic comments fits in thismetric.

Accuracy: Accuracy is the most popularclassification metric. This metric doesnot consider the order between cate-gories. That is, an error is penalisedregardless of the distance to the actualcategory. Furthermore, it does not com-pensate for the effect of imbalance in thedata set. This metric is not appropriatefor most possible scenarios, although ithas the advantage of being very easy tointerpret in terms of the percentage ofhits.

Mean Average Error (MAE): MAE av-erages the absolute difference betweenpredicted and actual categories. Thismetric assumes equidistance betweencategories and does not compensate forthe effect of imbalance between cate-gories in the data set. MAE is appro-priate, for example, when predicting theaverage toxicity value in a comment set.Note this requires assuming numericalintervals between categories.

Pearson Coefficient: Like MAE, usingPearson coefficient requires assumingequidistance between categories. Thedifference with respect to MAE is thatit does not require the system to predictthe category of each item, but gener-ates a scale that is linearly correlatedwith the actual scale. In addition, itcompensates for the effect of imbalancebetween categories by giving moreweight to those categories that are moreinfrequent. Obtaining a high Pearsonvalue is interesting, for example, whenpredicting the evolution of toxicity overtime in a comment stream or comparethe average toxicity of two streams.

Table 6 shows the results obtained by thebest run of each team in terms of CEM. Thebaselines are in the middle positions of thesystems ranking. In this case, the BOW-Classifier approach performs better than theother baselines developed by the task organ-isers.

Figure 2 illustrates the CEM results vs.other evaluation metrics. In this subtask,the systems SINAI and Team Sabari outper-form most of the other approaches for allmetrics. There is only one exception. The

system output DCG outperforms SINAI andSabari by a wide margin in terms of ranking(RBP metric). This means that, in a scenariowhere the objective is to prioritise particu-larly toxic comments, DCG is more effectivethan SINAI.

In Figure 2, we find a high correspondencebetween CEM and the Pearson correlationcoefficient. CEM does not consider numericintervals. The Pearson coefficient does nottake into account the proximity of the esti-mated categories to the actual ones but thecorrelation between the values (numeric cat-egory labels) in the system output and thegold standard. This means that the effectof ’scale shifts’ and the effect of assumingnumeric intervals between categories is notparticularly relevant in this evaluation bench-mark.

However, we did not find as much corre-spondence between the CEM metric and theMAE and Accuracy metrics. The main dif-ference is that CEM, like Pearson, compen-sates for the imbalance between categories inthe data set, giving more weight to errorsor hits in infrequent categories. As Figure 2shows, there is a set of runs that obtain simi-lar Accuracy and MAE values, but neverthe-less present important differences in terms ofCEM. From a usage scenario perspective, thisresult indicates that these runs are compara-ble if we are interested in predicting or ap-proximating the actual category in as manycases as possible. For example, this would bethe case if we wanted to calculate the averagetoxicity of a set of comments. However, if weare interested in detecting particular cases oftoxicity (low, medium or high) then we canassert that some of these runs will be moreeffective than others.

4.4 Systems Insights

A total of 31 teams (Subtask 1) and 24teams (Subtask 2) sent a maximum offive submissions per subtask to be evalu-ated. Not surprisingly, and based on therecent success of transfer learning with pre-trained Transformer-like language models,the top five teams in both subtasks achievedtheir best scores using BETO10 (the Span-ish version of the BERT model). The dif-ference in performance between their re-spective submissions thus lies in the fine-tuning techniques, data augmentation strat-

10https://github.com/dccuchile/beto


214

Figure 2: Correspondence between CEM and other metrics in Subtask 2.

egy and/or preprocessing steps. Althoughthe performance of classical machine learn-ing models such as TF-IDF with RandomForests/Support Vector Machines/LogisticRegression Classifiers were not as high asthose achieved by BETO, they were also usedas part of ensemble architectures and ana-lyzed by multiple participants. In fact, sev-eral teams provided valuable insights andcomparisons of models for a task that is in-credibly challenging, even for trained expertannotators. The most interesting conclusionsare summed up below.

This competition introduced several chal-lenges in addition to the toxicity detectionproblem. First, the language of the dataset(comments in Spanish) differs from the usualEnglish language present in most of the gen-eral benchmark NLP datasets. Therefore,the models needed to account for the differentdirectionality of the text, as well as for theselection of other pre-processing steps. Forinstance, the GTH-UPM team performed ananalysis of the multiple pre-processing stepsapplied prior to a pre-trained Transformer-like model (BETO). Apparently, a basictext normalization (the replacement of spe-cial Named Entities such as MAIL, DATE,URL... with their respective shared token)improves BETO’s performance as opposedto other steps e.g. the removal of stop-words and punctuation or lemmatization.

GTH-UPM also validated the extension ofpre-trained models’ vocabulary with domain-specific terms as a useful technique to im-prove task-oriented results.

Another analysis that is worth payingattention to regarding the language of thetexts is a performance comparison amongpretrained multilingual models, English mod-els with a proper Spanish-to-English transla-tor and Spanish models. Alejandro Mosquerabuilt a stacked model composed of at leastone model for each variant in conjunctionwith additional features and extracted thefeature importance of the individual modelsvia cross-validation on the training dataset.The results show that pretrained multilin-gual models such as XLM do not provide asmuch predictive power in this toxicity contextas a neural network with a capsule networkarchitecture using SBWC i25 GloVe Span-ish embeddings. However, the difference isnot so pronounced when we compare it withthe translator followed by the model trainedon English embeddings. A similar compari-son was performed by the AI-UPV team taskin which classical machine learning models(both generative and discriminative), multi-lingual BERT and BETO performed cross-validation with a hyperparameter setup. Notonly did they conclude that transformer-likearchitectures outperform classical statisticalmodels in this complex task, but they also


215

found that BETO achieved substantially bet-ter results than multilingual BERT for all oftheir top five best configurations and in bothsubtasks.

Furthermore, Alejandro Mosquera’s wasthe only team to introduce side informa-tion related to the topic of the articles eachcomment refers to and the thread to whichthey belong. As a preliminary analysisusing cross-validation, the use of informa-tion related to the comments present in thethread/conversation seems to enrich overallmodel performance. Further research on howto better exploit thread and topic informa-tion may be an interesting and promising di-rection.

Last but not least, the winners of bothDETOXIS subtasks (SINAI team proved theimportance of enriching the model by fine-tuning it on similar tasks related to senti-ment and emotion analysis, thus increasingthe available training data and making moreprecise predictions thanks to this Multi-TaskLearning strategy. In fact, SINAI was theonly team in the competition that took theprovided extra features (see Section 3.1) intoaccount, thereby giving their systems an im-portant boost in comparison to the other par-ticipants. However, a study of the influenceof each extra feature on the model’s predic-tive power is yet to be performed.

4.5 Error Analysis

The tasks of toxicity and toxicity level de-tection are particularly difficult for machinelearning models. In fact, even the most re-cent transformer models struggle with lin-guistic devices users sometimes use such assarcasm or irony. However, thanks to thefact that the dataset includes 13 additionalfeatures that classify the text according toother helpful semantic dimensions from theraw text, we can easily identify the most chal-lenging comments and the possible reasonsbehind the difficulty of their detection. Ta-ble 3 contains the average performance of thetop five best submissions per task (based onthe F1-score and CEM metrics respectively)considering the subset of samples where eachfeature is present and a single (best) run perteam.

Regarding the first task (toxicity detec-tion), the official metric focuses on the abil-ity of the models to detect all toxic com-ments and pays zero attention to the neg-

ative class. Therefore, comments with ar-gumentative cues (generally linked to non-toxic comments) that are marked as toxicare usually the most difficult ones to detect(see examples 5, 8, 10, 13 and 14 in Table4). A similar situation appears in commentswith a positive stance that are actually toxic.Other dimensions that clearly increase thedifficulty of this task are sarcasm and stereo-type. The existence of a greater number ofimplicit stereotypes rather than explicit onesin hate speech posts which ultimately requiremodels to learn related world knowledge be-forehand is well known. Furthermore, thisdataset contains comments that are repliesto other users’ comment. Consequently, abroader context than just the current textis sometimes required to make an informeddecision on whether the comment is toxic ornot.

On the other hand, the difficulty of lin-guistic cues related to aggressiveness, mock-ery, sarcasm and stereotype is even clearer inthe task of toxicity level detection when CEMscores are considered. Even though a com-ment may not apparently be very toxic ac-cording to the explicit words in the text, im-plicit messages and emitter intention can bereally harmful for the receiver. Consequently,an effort to build automatic systems able tocapture toxicity levels in comments in whichsuch subtle hidden messages are included isof utmost importance. For instance, example12 is usually marked as non-toxic by the topperforming submitted systems although theactual toxicity is at its maximum level dueto the implicit aggressiveness within the mes-sage. This difference between the predictedand ground truth labels is highly penalizedby the official CEM metric.

5 Conclusions and Future Work

This paper has described the DETOXIS chal-lenge at IberLEF 2021 and summarized theparticipation of several teams in both sub-tasks which evidenced interesting approachesand conclusions. Although all of the top-performing systems made use of the Spanishversion of the BERT model (a Transformerencoder), a variety of insights into the impor-tance of different strategies become clearerfor researchers to build upon or explore fur-ther in their respective studies on hate speechand toxicity detection. As opposed to pre-vious challenges on hate speech, we have


216

Feature Size (Toxic) Subtask 1 Subtask 2

1. argumentation 459 (113) 0.6006 0.75912. constructiveness 282 (0) - 0.89293. positive stance 22 (7) 0.6545 0.84654. negative stance 252 (125) 0.6919 0.67645. target person 108 (90) 0.7243 0.56726. target group 67 (62) 0.7945 0.54447. stereotype 41 (38) 0.6938 0.51108. sarcasm 52 (42) 0.6653 0.54979. mockery 80 (77) 0.7443 0.482410. insult 85 (83) 0.8541 0.520211. improper language 74 (51) 0.7985 0.654212. aggressiveness 15 (15) 0.7091 0.389213. intolerance 15 (13) 0.8516 0.5655

Table 3: Average score (F1-score and CEM respectively) of the five best performing teams ineach task for the subset of comments where the corresponding feature is positive together withthe size of each subset and the number of toxic instances in them.

provided 13 additional features and a fine-grained toxicity degree target with four pos-sible labels that go beyond the classical bi-nary toxicity classification. Furthermore, thecollection of comments includes the thread towhich they belong and the topic of the articlethey are posted in, thus, allowing a broadercontext to be used by innovative solutions.

Unsurprisingly, multilingual models areoutperformed by the Spanish counterpartsin this challenge most probably due to thespecific embedding space fully optimized inthe given language and the use of a morelanguage-oriented token vocabulary. Tech-niques such as data augmentation withMasked Language Modeling and the additionof Multi-Task Learning with datasets belong-ing to similar tasks in order to increase theavailable data used at the finetuning step hasturned out to be beneficial in a scenario inwhich the number of comments is small andclasses were unbalanced. Moreover, despitethe fact that further analysis and more elab-orate ways to extract information from theconversation and the topic are needed, theonly team that took advantage of such infor-mation achieved positive results in the finaloutcome by combining them with other com-mon strategies.

Finally, it is important to mention that,depending on the final application and itsspecific requirements, the selection of themodel may require a different evaluation met-ric. Thanks to our selection of metrics, wewere able to provide a simple visualization ofdifferent trade-offs when opting for a certain

system and how models perfectly suitable fora particular goal may not perform at thesame level when those requisites change. Aclear example we have presented is the com-parison between the SINAI and DCG sys-tems, in which the former was better at de-tecting all toxicity levels according to an or-dinal classification metric such as CEM butwas outperformed by the latter in scenarios inwhich prioritizing the most toxic comments isthe priority (according to the RBP rankingmetric).

Regarding future work, systems are as yetfar from the Gold Standard and this is mainlydue to the difficulty these models encounterwhen detecting implicit features such as ag-gressiveness, mockery and sarcasm. More-over, the lack of examples corresponding toeach individual feature makes this an evenmore challenging task. Additional researchon the influence of these implicit features onthe final toxicity of the comment and the cre-ation of systems able to detect and mitigatesuch subtle cues is highly necessary in a so-ciety in which messages and ideas are spreadfaster than ever. Some of the paths that werenot explored by participants and could beworth looking into are: larger language mod-els or possible distillations of them, improve-ments to the semantic contextualization oflanguage models by the use of techniques thatreduce their anisotropicity problem and theincorporation of sentence encoders or compo-sitional representations to the ensemble ar-chitectures.


217

ID Features FN Ratio Toxicity Comment

1 8 1.0000 1 Como se echa de menos la opinion de los guardianes de la moralen la noticia de la ejecucion en Portland. Bueno, en muchos casosestan sus negativos pero no sus siempre dignos comentarios.

2 10, 11 1.0000 1 Directamente :peineta:3 4 1.0000 1 Los andaluces, no los espanoles4 - 0.9677 1 Y 600000 votos mas, para el pandemias.5 1, 4, 12 0.9677 1 No intentaba poner ni a favor ni en contra. Ni intentaba realizar

una crıtica en el origen de los EEUU. Mi comentario solo tratabade dar perspectiva del problema en USA. En mi opinion personaltodos los usanos con armas podrıan salir a la calle hasta matarselos unos a los otros que el mundo en un plazo mas corto que largoirıa a mejor. Pero es mi opinion.

6 3, 4, 5, 10 0.9677 1 Que facil es hablar bocachancla7 5 0.9677 1 Expulsion de echenique de espana ya18 1, 4, 6, 9, 10 0.9355 1 En Madrid no hay campo, no hablo de la purria de ciudad. Hablo

de la gente que todo los anos entra en Espana para hacer estostrabajos estacionales + los que estan siempre aquı y que se hanido

9 4, 11 0.9355 1 Como conseguir 600.000 votos. Vomitivo10 1, 5 0.9355 1 El segundo es el que le tira al suelo, pero mira el primero tambien,

figura. youtu.be/neUnhYO2Ehc11 6, 7 0.9355 1 Pues mas razon para hacer como ellos contra ellos. La diferencia es

que nosotros sabemos que esta mal, ellos creen que es lo correcto.Nosotros podemos parar cuando lo estimemos, ellos no.

12 12 0.9355 3 Tengo esperanzas en que legalicen la noche de ’La Purga’. La pelicuenta como las clases acomodadas impulsan esta celebracion paraque el lumpen se autorregule. Es sencilla, pero a mi me entretuvomucho : )

13 1, 4, 5, 9 0.9355 1 Echenique miente mas que corre....aun regularizando estos irreg-ulares, no serian de hecho ”ciudadanos espanoles” y por lo tantoSIN derecho a voto. Pura propaganda bolchevique.

14 1, 4, 5 0.9355 1 ss. 29.304 simplemente exige que tengas licencia de caza (queno se sabe publicamente si el implicado la tiene o no) [...] Ergo,podemos deducir que el acusado no estaba violando la 948.60(2)(a)al usar un arma larga con 17 anos. Hay que hacer los deberesantes de soltar afirmaciones tan categoricas. O sea que necesitauna licencia de caza y como no sabemos si la tiene o no vamosa suponer que SI la tiene pero el que tiene que hacer los deberasantes de hacer afirmaciones categoricas soy yo. Pos fueno, pos fale,pos malegro, campeon. al parecer al final no va a haber acusacionpor la tenencia de armas, ¿Fuente?

15 5, 8 0.9355 2 Si le hacen tantas pregunta a ”NUMERO 3”, es posible que secortocircuite?

16 4, 13 0.8710 1 Deportaciones masivas ya!

Table 4: List of the top 16 comments in the test set that are misclassified as non-toxic by themajority of the systems. The Features column shows the feature identifiers from Table 3 markedas positive for the given comment, FN Rate is the ratio of systems that marked the commentas non-toxic and Toxicity refers to the four-levels annotation for that specific comment.

Acknowledgments

The work has been carried out in theframework of the following projects:MISMIS project (PGC2018-096212-B),funded by Ministerio de Ciencia, Inno-vacion y Universidades (Spain), CLiC SGR(2027SGR341), funded by AGAUR (Gener-alitat de Catalunya) and STERHEOTYPESproject (Challenges for Europe), funded byFondazione Compangia di San Paolo.

References

Amigo, E., J. Gonzalo, S. Mizzaro, andJ. Carrillo-de Albornoz. 2020. Effec-tiveness metric for ordinal classification:Formal properties and experimental re-sults. In Proceedings of the 58th AnnualMeeting of the Association for Computa-tional Linguistics. Association for Compu-tational Linguistics.

Basile, V., C. Bosco, E. Fersini, N. Deb-ora, V. Patti, F. M. R. Pardo, P. Rosso,


218

M. Sanguinetti, et al. 2019. Semeval-2019 task 5: Multilingual detection of hatespeech against immigrants and women intwitter. In 13th International Workshopon Semantic Evaluation, pages 54–63. As-sociation for Computational Linguistics.

Davidson, T., D. Warmsley, M. Macy, andI. Weber. 2017. Automated hate speechdetection and the problem of offensive lan-guage. In Proceedings of the InternationalAAAI Conference on Web and Social Me-dia, volume 11.

Kolhatkar, V., H. Wu, L. Cavasso, E. Fran-cis, K. Shukla, and M. Taboada. 2020.The sfu opinion and comments corpus: Acorpus for the analysis of online news com-ments. Corpus Pragmatics, 4(2):155–190.

Kumar, R., A. K. Ojha, S. Malmasi, andM. Zampieri. 2018. Benchmarking ag-gression identification in social media.In Proceedings of the First Workshopon Trolling, Aggression and Cyberbullying(TRAC-2018), pages 1–11.

Kumar, R., A. K. Ojha, S. Malmasi, andM. Zampieri. 2020. Evaluating aggressionidentification in social media. In Proceed-ings of the Second Workshop on Trolling,Aggression and Cyberbullying, pages 1–5.

Moffat, A. and J. Zobel. 2008. Rank-biasedprecision for measurement of retrieval ef-fectiveness. ACM Transactions on Infor-mation Systems (TOIS), 27(1):1–27.

Nobata, C., J. Tetreault, A. Thomas,Y. Mehdad, and Y. Chang. 2016. Abu-sive language detection in online user con-tent. In Proceedings of the 25th interna-tional conference on world wide web, pages145–153.

Nockleby, J. T. 2000. Hate speech. En-cyclopedia of the American constitution,3(2):1277–1279.

Poletto, F., V. Basile, M. Sanguinetti,C. Bosco, and V. Patti. 2020. Resourcesand benchmark corpora for hate speechdetection: a systematic review. LanguageResources and Evaluation, pages 1–47.

Schmidt, A. and M. Wiegand. 2017. A sur-vey on hate speech detection using nat-ural language processing. In Proceedingsof the fifth international workshop on nat-ural language processing for social media,pages 1–10.

Struß, J. M., M. Siegel, J. Ruppenhofer,M. Wiegand, M. Klenner, et al. 2019.Overview of germeval task 2, 2019 sharedtask on the identification of offensive lan-guage. In Preliminary proceedings ofthe 15th Conference on Natural LanguageProcessing (KONVENS 2019).

Waseem, Z. and D. Hovy. 2016. Hatefulsymbols or hateful people? predictive fea-tures for hate speech detection on twitter.In Proceedings of the NAACL student re-search workshop, pages 88–93.

Zampieri, M., P. Nakov, S. Rosenthal,P. Atanasova, G. Karadzhov, H. Mubarak,L. Derczynski, Z. Pitenis, and C. Coltekin.2020. SemEval-2020 task 12: Multilingualoffensive language identification in socialmedia (OffensEval 2020). In Proceedingsof the Fourteenth Workshop on Seman-tic Evaluation, pages 1425–1447. Interna-tional Committee for Computational Lin-guistics.

A Appendix: Results for bothSubtasks

This appendix provides the results obtainedby each team, selecting their best scoring out-put, for both DETOXIS subtasks. Table 5ranks the best run of each of the participat-ing teams according to F-measure. Table 6shows the results obtained by the best run ofeach team in terms of CEM (column 3). Therest of columns show the results in terms ofMAE, RBP, Pearson and Accuracy metrics.The baselines are in the middle positions ofthe systems ranking.


219

Ranking Team F-measure Precision Recall1 SINAI 0.6461 0.6356 0.65692 GuillemGSubies 0.6000 0.5234 0.70293 AI-UPV 0.5996 0.5672 0.63604 DCG 0.5734 0.6225 0.53145 GTH-UPM 0.5726 0.5852 0.56076 alejandro mosquera 0.5691 0.4688 0.72387 Team Sabari 0.5646 0.5379 0.59418 maia 0.5570 0.4904 0.64449 address 0.4930 0.4697 0.518810 Javier Garcıa Gilabert 0.4911 0.3644 0.753111 Dembo 0.4632 0.4798 0.447712 ToxicityAnalizers 0.4562 0.4444 0.468613 DetoxisLuciaYNora 0.4529 0.4346 0.472814 YaizaMinana 0.4449 0.4077 0.489515 VG UPV 0.4377 0.3348 0.631816 Datacism 0.4235 0.4839 0.376617 Calabacines cosmicos 0.4065 0.4603 0.36418 Ivan i Jaume 0.3991 0.4194 0.380819 CarlesyJorge 0.3844 0.3973 0.372420 NEON2 0.3798 0.4463 0.3305

RandomClassifier 0.3761 0.2540 0.7238ChainBOW 0.3747 0.5071 0.2971

21 JOSAND10 0.3636 0.5323 0.276222 GCP 0.3535 0.4459 0.292923 LlunaPerez-JuliaGregori 0.3017 0.2806 0.326424 GalAgo 0.2982 0.2841 0.313825 Just Do It 0.2060 0.5000 0.1297

BOW Classifier 0.1837 0.5909 0.108826 Benlloch 0.1828 0.2705 0.138127 LNR SaraSoto ClaraSalelles 0.1637 0.5476 0.096228 LNR InigoPicasarri JoanCastillo 0.1637 0.5476 0.096229 ElenaLopez MartaGarcia LNR 0.1605 0.4000 0.1004

Word2VecSpacy 0.1523 0.3651 0.096230 Iker&Miguel 0.0405 0.6250 0.020931 JOREST 0.0246 0.6000 0.0126

Table 5: Results of Subtask 1.


220

Ranking Team CEM MAE RBP Pearson AccuracyGold Standard 1 0 0.8213 1 1

1 SINAI 0.7495 0.2626 0.2612 0.4957 0.76542 Team Sabari 0.7428 0.2795 0.2670 0.5014 0.74643 DCG 0.7300 0.3019 0.3925 0.4544 0.73294 GTH-UPM 0.7256 0.3019 0.1545 0.4298 0.73185 GuillemGSubies 0.7189 0.349 0.2449 0.4451 0.68356 AI-UPV 0.7142 0.3143 0.2101 0.4734 0.71277 address 0.6915 0.3165 0.1136 0.3215 0.72288 Dembo 0.6703 0.3468 0.1037 0.2677 0.6936

ChainBOW 0.6535 0.3389 0.0787 0.2077 0.71279 Javier Garcıa Gilabert 0.6514 0.3345 0.2773 0.2158 0.725010 JOSAND10 0.6424 0.3367 0.1306 0.2046 0.723911 ToxicityAnalizers 0.6332 0.4871 0.0709 0.1805 0.613912 NEON2 0.6324 0.3434 0.1339 0.1632 0.7116

BOWClassifier 0.6318 0.3266 0.1657 0.1688 0.732913 JOREST 0.6250 0.3389 0.0592 0.0972 0.727314 Iker&Miguel 0.6250 0.3389 0.0592 0.0972 0.727315 Ivan i Jaume 0.6201 0.4366 0.0844 0.1604 0.625116 DetoxisLuciaYNora 0.6200 0.3816 0.0903 0.1248 0.6869

Word2VecSpacy 0.6116 0.3928 0.0855 0.0566 0.7015GloVeSBWC 0.6111 0.3356 0.1085 0.0623 0.7318

17 CarlesyJorge 0.6094 0.3367 0.0535 NaN 0.731818 ElenaLopez MartaGarcia LNR 0.6064 0.3490 0.0294 -0.0153 0.721719 VG UPV 0.6041 0.6498 0.1224 0.1896 0.558920 Just Do It 0.5928 0.4579 0.0320 0.0195 0.620721 Benlloch 0.5913 0.4545 0.0711 0.0237 0.63322 GalAgo 0.5876 0.4759 0.0936 0.0372 0.628523 LlunaPerez-JuliaGregori 0.5498 0.5724 0.0369 0.0004 0.536524 JosepCarles CandidogGarcia LNR 0.5376 0.6061 0.0705 0.0072 0.4949

RandomClassifier 0.4382 1.4287 0.0390 -0.0455 0.2278

Table 6: Results of Subtask 2.


221

Overview of DETOXIS at IberLEF 2021: DEtection of TOXicity ...

Documents

Transcript of Overview of DETOXIS at IberLEF 2021: DEtection of TOXicity ...