Proceedings of the GermEval 2017 Shared Task...Proceedings of the GermEval 2017 – Shared Task on...

GSCL 2017Language Technologies for the Challenges of the Digital Age

Proceedings of the GermEval 2017 –Shared Task on Aspect-based Sentiment in

Social Media Customer Feedback

September 2017Berlin, Germany

Foreword

In the connected, modern world, customer feedback is a valuable source for insights on the quality ofproducts or services. This feedback allows other customers to benefit from the experiences of others andenables businesses to react on requests, complaints or recommendations. However, the more people usea product or service, the more feedback is generated, which results in the major challenge of analyzinghuge amounts of feedback in an efficient, but still meaningful way.

Aspect-based Sentiment Analysis is an important task to analyze customer feedback and a growingnumber of Shared Tasks exist for various languages. However, these taks lack large-scale Germandatasets. Thus, we present a shared task on automatically analyzing customer reviews about “DeutscheBahn” – the German public train operator with about two billion passengers each year. We have annotatedmore than 26,000 documents and present four sub-tasks that represent a complete classification pipeline(relevance, sentiment, aspect classification, opinion target extraction).

The results indicate that the public transport domain offers challenging tasks. E.g., the large numberof aspects – in combination with an almost Zipfian label distribution of real user feedback – leads tolabel bias problems and creates strong baselines. We observe that the usage of extensive preprocessing,large sentiment lexicons, and the connection of neural and more traditional classifiers are advantageousstrategies for the formulated tasks. The Shared Task is a first step in sentiment analysis for this domain.

For the GermEval 2017 Shared Task, we received 8 submissions. One submission was withdrawnfrom the proceedings. The dataset and the proceedings are available from the task website(https://sites.google.com/view/germeval2017-absa/). It also contains the presentation slides from theparticipants and the individual prediction labels from the participating systems.

Berlin, September 2017

The organizing committee

Organizers:

Michael Wojatzki (Universitat Duisburg-Essen)Eugen Ruppert (Universitat Hamburg)Torsten Zesch (Universitat Duisburg-Essen)Chris Biemann (Universitat Hamburg)

ii

Table of Contents

GermEval 2017: Shared Task on Aspect-based Sentiment in Social Media Customer FeedbackMichael Wojatzki, Eugen Ruppert, Sarah Holschneider, Torsten Zesch and Chris Biemann . . . . . . 1

h da Participation at Germeval Subtask B: Document-level PolarityKaren Schulz, Margot Mieskes and Christoph Becker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

HU-HHU at GermEval-2017 Sub-task B: Lexicon-Based Deep Learning for Contextual Sentiment Anal-ysis

Behzad Naderalvojoud, Behrang Qasemizadeh and Laura Kallmeyer . . . . . . . . . . . . . . . . . . . . . . . . . 18

UKP TU-DA at GermEval 2017: Deep Learning for Aspect Based Sentiment DetectionJi-Ung Lee, Steffen Eger, Johannes Daxenberger and Iryna Gurevych . . . . . . . . . . . . . . . . . . . . . . . . 22

Fasttext and Gradient Boosted Trees at GermEval-2017 on Relevance Classification and Document-levelPolarity

Leonard Hovelmann and Christoph M. Friedrich . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

GermEval 2017 : Sequence based Models for Customer Feedback AnalysisPruthwik Mishra, Vandan Mujadia and Soujanya Lanka . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

DIDS IUCL: Investigating Feature Selection and Oversampling for GermEval2017Zeeshan Ali Sayyed, Daniel Dakota and Sandra Kubler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

PotTS at GermEval-2017 Task B: Document-Level Polarity Detection Using Hand-Crafted SVM andDeep Bidirectional LSTM Network

Uladzimir Sidarenka . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

LT-ABSA: An Extensible Open-Source System for Document-Level and Aspect-Based Sentiment AnalysisEugen Ruppert, Abhishek Kumar and Chris Biemann. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .55

iii

Proceedings of the GermEval 2017 – Shared Task on Aspect-based Sentiment in Social Media Customer Feedback, pages 1–12,Berlin, Germany, September 12, 2017.

GermEval 2017: Shared Task on Aspect-based Sentimentin Social Media Customer Feedback

Michael Wojatzki† and Eugen Ruppert‡ and Sarah Holschneider� andTorsten Zesch† and Chris Biemann‡

†Language Technology LabCompSci and Appl. CogSci

Universität Duisburg-Essen

http://www.ltl.uni-due.de/

�Technische Universität Darmstadthttp://www.tu-darmstadt.de

‡Language Technology GroupComputer Science Department

Universität Hamburg

http://lt.informatik.uni-hamburg.de

Abstract

This paper describes the GermEval 2017shared task on Aspect-Based SentimentAnalysis that consists of four subtasks: rel-evance, document-level sentiment polarity,aspect-level polarity ad opinion target ex-traction. System performance is measuredon two evaluation sets – one from the sametime period as the training and developmentset, and a second one, which contains datafrom a later time period. We describe thesubtasks and the data in detail and providethe shared task results. Overall, the sharedtask attracted over 50 system runs from 8teams.

1 Introduction

In a connected, modern world, customer feedbackis a valuable source for insights on the quality ofproducts or services. This feedback allows othercustomers to benefit from the experiences of othersand enables businesses to react on requests, com-plaints or recommendations. However, the morepeople use a product or service, the more feedbackis generated, which results in the major challengeof analyzing huge amounts of feedback in an effi-cient, but still meaningful way.

Recently, shared tasks on Sentiment Analysishave been organized regularly, the most popular arethe shared tasks in the SemEval framework (Pontikiet al., 2015; Pontiki et al., 2016). And even thoughthe number of domains and languages is growingwith each iteration, there has not existed a largepublic sentiment analysis dataset for German untilnow.

To fill this gap, we conducted a shared task1

on automatically analyzing customer reviews and

1Documents and description of the GermEval 2017 sharedtask are available on the task website: https://sites.google.com/view/germeval2017-absa/

news about “Deutsche Bahn” – the major Germanpublic train operator, with about two billion pas-sengers each year. This is the first shared task onGerman sentiment analysis that provides a largeannotated dataset for training and evaluating ma-chine learning approaches. Furthermore, it featuresone of the largest datasets for sentiment analysisoverall, containing annotations on almost 28,000short documents, more than 10 times of the traininginstances in the largest set to date (from SemEval-2016, task 5 ‘Arabic Hotels’).

2 Task Description

The shared task features four subtasks, which canbe tackled individually. They are aimed at realizinga full classification pipeline when dealing with webdata from various heterogeneous sources. First, inSubtask A, the goal is to determine whether a re-view is relevant to our topic. In real life scenariosthis task is necessary to filter irrelevant documentsthat are a by-catch of the method of collecting thedata. Second, Subtask B is about inferring a cus-tomer’s overall evaluation of the Deutsche Bahnbased on the given document. Here, we support ause-case in which e.g. a manager is interested howwell or badly the offered services are perceivedoverall. Third, Subtask C addresses a more fine-grained level and aims at finding the particular kindof service, called aspect, which is referred to posi-tively or negatively. Finally, in Subtask D the taskis to identify the actual expressions that verbalizethe evaluations covered in Subtask C, commonlyknown as opinion target expression (OTE) identifi-cation.

2.1 Subtask A: Relevance Classification

The first subtask is used to filter incoming docu-ments so that only the relevant and interesting onesare processed further. The term Bahn can refer tomany different things in German: the rails, the train,any track or anything that can be laid in straight

1

lines. Therefore, it is important to remove docu-ments about e.g. the Autobahn (highway). Thisis similar for other query terms that are used tomonitor web sites and microblogging services.

In Subtask A, the documents have to be labeledin a binary classification task as relevant (true) orirrelevant (false) for Deutsche Bahn. Below is arelevant document about bad behavior in a train,and an irrelevant document about stock exchangedevelopments.

true Ehrlich die männer in Der Bahnhaben keine manieren? (Seriously, the men inthose trains have no manners!)

false Aus der Presseschau: Japanische S-Bahnwird mit Spiegelwaggons ‘unsichtbar ’(Re-view: Japanese urban railway becomes ’invis-ible’ thanks to reflecting wagons)

2.2 Subtask B: Document-level Polarity

In Subtask B, systems have to identify, whetherthe customer evaluates the service of the railwaycompany, be it e.g. travel experience, timetablesor customer communication as positive, negative,or neutral. During data acquisition, annotators pro-vided more complex aspect-level annotations asused in Subtasks C and D. Document level senti-ment polarity in Subtask B is computed from the in-dividual aspect polarities in the document: If thereis a mixture between neutral and positive/negative,the documents are classified as positive/negative.If there are two opposing polarities (positive andnegative), the overall sentiment is set to neutral.

2.3 Subtask C: Aspect-level Polarity

For Subtask C, participants are asked to identifyall aspects in the document. Each aspect should belabeled with the appropriate polarity label. Since,in the annotations, it was possible to label multipletokens with the same aspect, multiple mentions ofthe same aspect are possible. The example belowshows a mixed sentiment in a document that ispresented as a dialogue.

The positive aspect is the end of a strike – Streikbeendet. The negative aspect in this document arethe tickets, which are getting more expensive – dieTickets teurer. Thus, in the given example, thetask is to identify the aspects (and their polarity)in the following way: Ticketkauf#Haupt:negative,Allgemein#Haupt:positive.

Sentiment Example

negative Re: Ingo Lenßen Guten morgenIngo...bei mir kein regen aber bahnfehr wieder nicht....liebe grusse ....

Ger

man positive Re: DB Bahn Danke, hat sich gerade

erledigt. Das Team hat mich per E-Mail kontaktiert. Danke trotzdemfür die prompte Antwort:-)

neutral Kann man beim DB Navigator(APP) auch Jugend/Kinder Kartenbuchen?

negative Re: Ingo Lenßen Good morningIngo...No rain where I am but notrains again. Best wishes ....

Eng

lish positive Re: DB Bahn Thanks, sorted. I was

contacted by the team. Anyways,thanks for replying so fast :-)

neutral Can you book concessions/childtickets using the DB Navigator(App)?

Table 1: Example for Document Sentiment

Sentiment Example

positive Alle so ‘Yeah, Streik beendet’

Ger

man negative Bahn so ‘Okay, dafür werden dann

natürlich die Tickets teurer ’Alle so‘Können wir wieder Streik haben?’

positive Everybody’s like ‘Yeah, strike’sover’

Eng

lish negative Bahn goes ‘Okay, but therefore

we’re going to raise the prices ’Ev-erybody’s like ‘Can we have thestrike back?’

Table 2: Example for Document Sentiment

The aspect classification was provided by thedata analysis from Deutsche Bahn and was refinedduring the annotation process.

2.4 Subtask D: Opinion Target Extraction

For the last subtask, participants should identifythe linguistic expressions that are used to expressthe aspect-based sentiment (Subtask C). The opin-ion target expression is defined by its starting andending offsets. For human readability, the targetterms are also present in the data as well.

An example is given in Listing 1. In this docu-ment, the task is to identify the target expressionfährt nicht (does not drive/go), which is an indica-tion of an irregularity in the operating schedule.

While the data set is available in both TSV andXML formats (see Section 3.4), Subtask D can onlybe done using the XML format, as the spans of theopinion target expression are not available in the

2

<Document><text>@m_wabersich IC 2151? Der fährt nicht. Ich habe Ihnen die Alternative

bereits genannt. /je</text><Opinions><Opinion aspect="Sonstige_Unregelmässigkeiten#Haupt" from="26" to="37" polarity

="negative" target="fährt nicht"/></Opinions>

</Document>

Listing 1: Example document for Subtask D. Translation: @m wabersich IC 2151? It does not run. Ialready have told you about an alternative. aspect=miscellaneous irregularities#Main, target ”does notrun”.

document-based TSV format. For more detail, seethe next section.

3 Dataset

3.1 Data CollectionThe data was crawled from the Internet on a dailybasis with a list of query terms. We filtered forGerman documents and focused on social media,microblogs, news, and Q&A sites. Besides thedocument text, meta information like URL, date,and language was collected as well.

In the project context, we received more than2.5 million documents overall, spanning a wholeyear (May 2015–June 2016), so that we could cap-ture all possible seasonal problems (holidays, heat,cold, strikes) as well as daily problems such asdelays, canceled trains, or unclear information atthe train stations. From this large amount of docu-ments, we sampled from each month approximately1,500 documents for annotation. Since the word-list-based relevance filtering is very coarse and a lotof irrelevant documents were present in the initialsamples, e.g. questions about the orbit of the moon(Mondumlaufbahn, lunar orbit) or mentions of airdraft (zugig, drafty), we trained a baseline SVMclassifier to perform pre-filtering and increase thenumber of relevant and interesting documents persplit. The annotated data is used for the training,development, as well as for a synchronic test set.

Additionally, to test the robustness of the par-ticipating systems, we created a diachronic testset, which was (pre-)processed and annotated inthe same manner, using data from November 2016to January 2017.

3.2 AnnotationFor annotation, we used WebAnno (de Castilhoet al., 2016). Annotators were asked to performthe full annotation of every document assigned tothem. To keep the individual tasks manageable, we

split the annotation tasks into chunks of 100 shortdocuments, which could be completed in 1–2 hoursby an annotator.

The annotation task consisted of first labeling thedocument relevance. For relevant documents, theannotators had to identify the aspect targets (spansof single or multiple tokens) and label them withone of 19 aspects and, if identifiable, with one ofoverall 43 sub-aspects.

Relevant documents that did not contain a clearaspect target expression could also be assigned adocument-level aspect annotation. The polaritywords for each aspect target were annotated as well.If they were not part of the OTE (as e.g. Verspätung– delay, which is inherently negative), they wereconnected with the aspect-bearing word using arelation arc. These annotations have not been dis-tributed as part of this challenge, but will be madeavailable afterwards. Expressions of the same as-pect were also connected from left to right.

The annotation team consisted of six trained stu-dent annotators and a supervisor/curator. Everydocument was annotated by two annotators in dif-fering pairings. The curator checked the documentsfor diverging annotations and decided on the cor-rect one using WebAnno’s curation interface. Fur-thermore, she was also able to add new annotations,in case the others missed some expressions. Inweekly feedback sessions, the team talked aboutnew problems and added the results to the annota-tion guidelines.2 This led to consistent improve-ments of inter-annotator agreement over time, seeTable 3 and Figure 1. The overall lower agreementsfor the Relevance classifications are due to the dif-ficulty of deciding between irrelevant documentsand documents without explicit sentiments.

2The German annotation guidelines are availableat: http://ltdata1.informatik.uni-hamburg.de/germeval2017/Guidelines_DB_v4.pdf

3

07.2016 10.2016 01.201720

30

40

50

60

70

80

90

100

Date

Coh

en’s

Kap

pa

Relevance Polarity Aspect Sub-Aspect

Figure 1: Development of the inter-annotator agreement over time

Date 07.2016 10.2016 01.2017

Relevance 0.26–0.51 0.39–0.74 0.51–0.76Polarity 0.35–0.79 0.45–0.97 0.90–1.00Aspect 0.42–0.70 0.44–0.93 0.79–1.00Sub-Aspect 0.35–0.65 0.37–0.87 0.63–1.00

Table 3: Development of the inter-annotator agree-ment ranges (Cohen’s kappa)

3.3 Splits

We obtained about 26,000 annotated documentsfor the main dataset and about 1,800 documentsfor the diachronic dataset. We split the maindataset into a training, development and testset using a random 80%/10%/10% split. Thenumber of documents for each split is shownin Table 4. The dataset can be downloadedfrom: http://ltdata1.informatik.uni-hamburg.de/germeval2017/

Tables 5–7 show the label distributions for thesubtasks. There is always a clear majority class,which leads to strong baselines. This is especiallyapparent for Subtask C (Table 7), where the mostfrequent label Allgemein (General) is almost 10times as frequent as the second frequent label.

train dev test_syn test_dia

19,432 2,369 2,566 1,842

Table 4: Number of documents in data splits

Dataset true false

Training 16,201 3,231Development 1,931 438Test_syn 2,095 471Test_dia 1,547 295

Table 5: Relevance Distribution in Subtask A data

Dataset negative neutral positive

Training 5,045 13,208 1,179Development 589 1,632 148Test_syn 780 1,681 105Test_dia 497 1,237 108

Table 6: Sentiment Distribution in Subtask B data

3.4 Formats

We utilize an XML format that is similar to theformat used in SemEval-2016 task on ABSA (Task5) (Pontiki et al., 2016). Each Document elementcontains the original URL as the document id and

4

Training Development Test_syn Test_dia

Top 10Aspects

11,191 Allgemein 1,363 Allgemein 1,351 Allgemein 1,008 Allgemein1,240 Zugfahrt 140 Zugfahrt 178 Sonstige... 144 Zugfahrt1,007 Sonstige_Unregelmässigkeiten 108 Atmosphäre 160 Zugfahrt 138 Sonstige...

819 Atmosphäre 102 Sonstige... 112 Atmosphäre 72 Connectivity417 Ticketkauf 51 Ticketkauf 75 Ticketkauf 42 Atmosphäre296 Service_und_Kundenbetreuung 37 Sicherheit 51 Sicherheit 29 Ticketkauf278 Sicherheit 29 Service... 42 Service... 27 Sicherheit224 Connectivity 22 Connectivity 31 Informat... 21 Informat...193 Informationen 19 Auslastung... 27 Connectivity 18 Service...158 Auslastung_und_Platzangebot 14 DB_App_und_Website 22 Auslastung... 15 Auslastung...

∑ Rest 377 ... 45 ... 46 ... 33 ...# Aspects 16,200 1,930 2,095 1,547# non-Null Asp. 12,139 1,380 2,162 1,163

Table 7: Distribution of top-frequent aspects (aspects partly shortened) in Subtask C data

the extracted untokenized document text. Further-more, the relevance and the document polarity areannotated as well. For relevant documents, theopinion target expressions (OTE) are grouped asOpinions. Each Opinion contains the tokenoffsets for the OTE, its aspect and the sentimentpolarity. Two examples are given in Listing 2. Thefirst one has identifiable OTEs, while the secondone – although relevant – does not provide an ex-plicit opinion target expression.

To increase participation and lower the en-try boundary, we also provide an TSV formatfor document-level annotation in order to enablestraightforward use with any document classi-fier. The TSV format contains the following tab-separated fields:

• document id (URL)

• document text

• relevance (true or false)

• document-level polarity, neutral for irrelevantdocuments

• aspects with polarities; several mentions arepossible, empty for irrelevant documentsExample: Atmosphäre#Haupt:neutral Atmo-sphäre#Lautstärke:negative

Since there are only document-level labels forthe TSV format, Subtask D is not evaluated forTSV submissions.

4 Evaluation Measures and Baselines

We evaluate the system predictions using a micro-averaged F1 score. This metric is well-suited for

datasets with a clear majority class because eachinstance is weighted the same as every other one.

For Subtasks A and B (relevance and document-level polarity), we only report the F1 score. Forthe aspect identification (Subtask C), we reportscores for the aspect identification itself, as wellas the combination of aspect and sentiment, as itcan differ between several aspects in a document.Opinion target expression matching is evaluatedin an exact setting, where the token offsets haveto match exactly, and a less strict setting, whichconsiders overlaps and partial matches as correct.In detail, we consider an expression a match if thespan is +/- one token of the gold data.

Majority Class Baseline The majority classbaseline (MCB) yields already a quite good per-formance, since Subtasks A, B and C have a clearmajority class. Thus, the majority class is a strongprior or fallback alternative for instances withoutmuch evidence for the other classes. For Subtask Cwe assign exactly one opinion with the aspect All-gemein and the sentiment neutral. Since Subtask Dis a sequence tagging task, there is no meaningfulmajority class baseline.

Baseline System We provided a baseline systemthat uses machine learning with a basic feature setto show the improvements put forth by the partici-pating system. It uses a linear SVM classifier forSubtasks A, B and C and a CRF classifier for theOTEs in Subtask D, both with a minimal set of stan-dard features. The baseline system is available forthe participants for initial evaluation and a possibleweakly-informed classifier in an ensemble learningsetting.3 Furthermore, it is open-source, so that

3The system is available under the Apache Soft-ware License 2.0 at: https://github.com/uhh-lt/

5

<Document id="http://www.neckar-chronik.de/Home/nachrichten/ueberregional/baden-wuerttemberg\_artikel,-Bald-schneller-mit-der-Bahn-von-Deutschland-nach-Paris-\_arid,319757.html">

<text>Bald schneller mit der Bahn von Deutschland nach Paris 5 Stunden 40 Minuten,statt wie bisher 6 Stunden 20 Minuten. Stra ß burg. Man kann auch öfter fahren

. Den neuen grenzüberschreitenden Fahrplan stellte die Regionaldirektion derfranzösischen Bahn SNCF am</text>

<relevance>true</relevance><polarity>positive</polarity><Opinions><Opinion aspect="Zugfahrt#Fahrtzeit_und_Schnelligkeit" from="5" to="14" polarity

="positive" target="schneller"/><Opinion category="Zugfahrt#Streckennetz" from="141" to="153" polarity="positive

" target="öfter fahren"/></Opinions>

</Document>

<Document id="http://twitter.com/majc14055/statuses/649275540877254656"><text>@Cmbln Sollte die S- Bahn Berlin nicht einheitlich 80 fahren, wegen

Konzernvorgabe? Da soll noch Einer durchblicken. ;-)</text><relevance>true</relevance><polarity>neutral</polarity><Opinions>

<Opinion aspect="Allgemein#Haupt" from="0" to="0" polarity="neutral" target="NULL"/>

</Opinions></Document>

Listing 2: Example documents in XML format

participants could use parts – like the documentreaders or the feature extractors – as parts in theirsystems.

The SVM classifiers use the term frequencies ofdocument terms and a sentiment lexicon (Waltinger,2010) for prediction. The CRF classifier uses thesurface token without processing (lemmatization,standardization, lowercasing) and the POS tag. Fortokenization and POS tagging, we use the DKProtools (Eckart de Castilho and Gurevych, 2014) inthe UIMA framework (Ferrucci and Lally, 2004).We have also developed a full system in the courseof the same project where the data was annotated,described in (Ruppert et al., 2017). While the fullorganizer’s system did not compete in the sharedtask as it was developed over a much longer time,it would have been positioned second and third inSubtask A, first and third in Subtask B and first inSubtasks C and D.

5 Participation

Overall, 8 teams participated in the shared task.All of them participated in Subtask B and 5 ofthem in Subtask A. Only Lee et al. (2017) andMishra et al. (2017) have participated in SubtasksC and D. Table 8 gives an overview of which team

GermEval2017-Baseline

participated in which subtask.

5.1 Participant’s Approaches

Across all subtasks, the participants have applied alarge variety of approaches. However, we can iden-tify trends and commonalities between the teams,which will be discussed in more detail below. Fora detailed description of the approaches, we referto the referenced papers.

Preprocessing Although some teams have usedoff-the-shelf tokenizers, such as Schulz et al. (2017)who used the opennlp maxent tokenizer, most of theteams relied on their own implementations. Thesetokenizers were often combined with large setsof rules that cover social media specific languagephenomena such as emoticons, URLs, or repeatedpunctuation (Sayyed et al., 2017; Sidarenka, 2017;Mishra et al., 2017; Hövelmann and Friedrich,2017). It would have been possible to use tokeniz-ers from the 2016 EMPIRIST task, e.g. (Remuset al., 2016). Moreover, one team (Hövelmannand Friedrich, 2017) further normalized the databy using an off-the-shelf spell checker and rulesto replace e.g. numbers, dates, and URLs with aspecial token.

Besides a tokenizer, many recent neural classi-fiers do not require deeper preprocessing. Never-

6

Team reference Team name SubtaskA B C D

Schulz et al. (2017) hda XUH-HHU-G 4 UH-HHU-G X XLee et al. (2017) UKP_Lab_TUDA X X X XMishra et al. (2017) im+sing X X X XSayyed et al. (2017) IDS_IULC X XSidarenka (2017) PotTS XNaderalvojoud et al. (2017) HU-HHU XHövelmann and Friedrich (2017) fhdo X X

Table 8: Teams and subtask participation

theless, some of the participants used lemmatizers,chunkers, and part-of-speech taggers (Sidarenka,2017; Schulz et al., 2017; Naderalvojoud et al.,2017), relying on the TreeTagger by Schmid (1994)or on the Stanford CoreNLP library (Manning etal., 2014).

To compensate for imbalances in the class distri-bution, two teams have used sampling techniques(Sayyed et al. (2017) and UH-HHU-G) – namelyadaptive synthetic sampling (He et al., 2008) andsynthetic minority over-sampling (Chawla et al.,2002).

Sentiment Lexicons Most teams used or experi-mented with some form of word polarity resources.Two teams (Schulz et al., 2017; Sidarenka, 2017)relied on SentiWS (Remus et al., 2010). The re-source was also considered but not included in theactual submissions of Hövelmann and Friedrich(2017). Two teams (Schulz et al., 2017; Mishraet al., 2017) have used the lexicon created byWaltinger (2010). Other similarly used resourcesinclude the Zurich Polarity List (Clematide andKlenner, 2010) or the LWIC tool (Tausczik andPennebaker, 2010).

In addition to the use of pre-calculated or man-ual resources, some teams also created their ownlexicons. For instance, Naderalvojoud et al. (2017)created a sense based sentiment lexicon from alarge subtitle corpus. Sidarenka (2017) createdseveral lexicons e.g. based on other pre-existingdictionaries and using a German Twitter snapshot.

Dense Word Vectors In addition to word polar-ity, several teams made use of dense word vec-tors (also known as word embeddings) and thusintegrated distributional semantic word informa-tion in their systems. Mishra et al. (2017) traineddense word vectors on large corpus of parliamentspeeches using GloVe (Pennington et al., 2014).

4Submission withdrawn after reviewing

Lee et al. (2017) used word2vec (Mikolov et al.,2013) trained word embeddings on Wikipedia.They also trained sentence vectors on the same dataand experimented with German-English bilingualembeddings. Finally, some of the teams relied onfastText (Bojanowski et al., 2017) that makes useof sub-word information to create word vectors, ad-dressing phenomena such as German single-tokencompounding.

Classifiers When analyzing the classification al-gorithms utilized by the participants, we identifythree major strands. The first strand are approachesthat use engineered features to represent the datatogether with more traditional classification algo-rithms. The second strand translates the trainingdata in sequences of vectors and feeds them intoneural networks. Third, there are ensemble ap-proaches that orchestrate several neural and/or non-neural approaches.

Within the non-neural strand we observe the us-age of SVMs (Sidarenka, 2017), CRFs (Mishra etal., 2017; Lee et al., 2017), and threshold basedclassification (Schulz et al., 2017). Approachesof the neural strand used several different neuralnetwork architectures. Most dominant is the us-age of recurrent neural networks that contain long-short-term-memory (LSTM) units (UH-HHU-G).In particular, many teams used biLSTM - a variantof LSTMs in which both preceding and followingcontext is considered (Sidarenka, 2017; Mishraet al., 2017; Naderalvojoud et al., 2017; Lee etal., 2017). Other used architectures include con-volution layers (UH-HHU-G) and other forms ofstructured or multi-layered perceptrons (Mishra etal., 2017). Within the ensemble approaches thereis an approach of orchestrating several neural net-works (Lee et al., 2017), one that combines LSTMand SVM (Sidarenka, 2017), one that uses fast-Text (Hövelmann and Friedrich, 2017) and two ap-proaches that rely on gradient boosted trees (Hövel-

7

mann and Friedrich, 2017; Sayyed et al., 2017).

6 Evaluation Results

As expected, we observe an increasing difficultybetween the subtasks in alphabetical order. Thismeans Subtask A is solved better than B, B solvedbetter than C and C is solved better than D. Inter-estingly, for all tasks we only see small differencesbetween synchronic and diachronic test sets. Fromthis, we can conclude that either all models are ro-bust against temporary fluctuation, or the distribu-tions in this data do not change at a very high speed.Furthermore, both the majority class baseline andour simple baseline system are quite competitivein all tasks. The detailed results of the subtasks arediscussed below.

6.1 Subtask A - Relevance

Most of the teams that participated in Subtask Ahave beaten the majority class baseline and thebaseline system. Table 9 gives an overview ofthe results. Note that the majority class baseline(0.816) and baseline system (0.852) in this subtaskare quite strong. The best system by Sayyed etal. (2017) surpassed our baseline system by 0.05percent point by using gradient boosted trees andfeature selection to obtain the predictions. Thesecond-best team (Hövelmann and Friedrich, 2017)used fastText and applied extensive preprocessing.In future research, it seems worthwhile to examinehow these strategies contribute to each system. Inaddition, we note that the neural approaches ofMishra et al. (2017) and Lee et al. (2017) are almosten par (∼−0.02).

6.2 Subtask B - Document-level Polarity

Similar to Subtask A, we also observe strong base-lines in Subtask B, yet that most participants sur-pass them. Table 10 shows the results. Perfor-mance among the top three teams is highly simi-lar. This is particularly interesting as the top threeteams Naderalvojoud et al. (2017) [0.749], Hövel-mann and Friedrich (2017)[0.748] and Sidarenka(2017) [0.745] have all followed completely dif-ferent approaches. Naderalvojoud et al. (2017)[0.749] made use of a large lexicon that was com-bined with a neural network. As already describedabove, Hövelmann and Friedrich (2017)[0.748]used fastText and extensive preprocessing of thedata, whereas Sidarenka (2017) relied on a biLST-M/SVM ensemble. The more or less pure neu-

ral approaches of Lee et al. (2017), Sidarenka(2017), and UH-HHU-G yield a slightly worse per-formance, but still outperform our simple baselinesystem. Overall, we do not observe large differ-ence on the synchronic versus the diachronic testset, however, most systems marginally lose perfor-mance on the diachronic data.

6.3 Subtask C - Aspect-level Polarity

Table 11 shows the performance of the two teamsthat participated in the aspect-based subtask. Only(Lee et al., 2017) could outperform both providedbaselines on the synchronic data. However, theimprovements of 0.001 for aspect classificationand 0.03 for aspect and sentiment classificationare only slight. Surprisingly, on the diachronic databoth teams could neither significantly outperformthe baseline system nor the majority class baseline(Allgemein:neutral). Interestingly, we observe aincreased performance for all submitted runs forthe diachronic data.

6.4 Subtask D - Target Extraction

The same teams that worked on Subtask C also par-ticipated in Subtask D. Both teams relied on neuralnetwork approaches and outperformed both base-lines. While the structured perceptron of Mishraet al. (2017) achieved the best results for the exactmetric, the combination of LSTM and CRF by Leeet al. (2017) gained the – by far – best results for theoverlap metric. In Table 12 we report the results.As expected, the results of the overlap metric arebetter than those of the exact metric, as the exactmetric is more strict. Similar to subtask C, we canconclude that the diachronic data can be classifiedmore easily in both metrics.

7 Related Work

First of all, our shared task is related to sharedtasks on aspect-based sentiment analysis that wereconducted within the international workshop on se-mantic evaluation (SemEval) (Pontiki et al., 2014;Pontiki et al., 2015; Pontiki et al., 2016). However,we here focus exclusively on German but targeta larger, monolingual data set. We also relate toprevious German shared tasks on aspect-based sen-timent analysis such as Ruppenhofer et al. (2016).In contrast to this work, we are pursuing an an-notation scheme that is inspired by the needs of aindustrial customer as opposed to linguistic consid-erations.

8

Team Run synchronic diachronic

Sayyed et al. (2017) xgboost .903 .906Hövelmann and Friedrich (2017) fasttext .899 .897Mishra et al. (2017) biLSTM strucutured perceptron .879 .870Lee et al. (2017) stacked learner CCA SIF embedding .873 .881Hövelmann and Friedrich (2017) gbt_bow .863 .856organizers baseline system .852 .868UH-HHU-G ridge classifier char fourgram .835 .849UH-HHU-G linear SVC l2 char fivegram .834 .859UH-HHU-G passive-aggressive char fivegram .827 .850UH-HHU-G linear SVC l2 trigram .824 .837organizers majority class baseline .816 .839UH-HHU-G gru mt .816 .840UH-HHU-G cnn gru sent mt .810 .839Hövelmann and Friedrich (2017) ensemble .734 .160

Table 9: Results for Subtask A on relevance detection.

Team Run synchronic diachronic

Naderalvojoud et al. (2017) SWN2-RNN .749 .736Hövelmann and Friedrich (2017) fasttext .748 .742Sidarenka (2017) bilstm-svm .745 .718Naderalvojoud et al. (2017) SWN1-RNN .737 .736Sayyed et al. (2017) xgboost .733 .750Sidarenka (2017) bilstm .727 .704Lee et al. (2017) stacked learner CCA SIF embedding .722 .724Hövelmann and Friedrich (2017) gbt_bow .714 .714Hövelmann and Friedrich (2017) ensemble .710 .725UH-HHU-G ridge classifier char fourgram .692 .691Mishra et al. (2017) biLSTM strucutured perceptron .685 .675UH-HHU-G linear SVC l2 char fivegram .680 .692organizers baseline system .667 .694UH-HHU-G linearSVC l2 trigram .663 .702organizers majority class baseline .656 .672UH-HHU-G gru mt .656 .672UH-HHU-G cnn gru sent mt .644 .668Schulz et al. (2017) .612 .616UH-HHU-G Passive-Aggressive char fivegram .575 .676

Table 10: Results for Subtask B on sentiment detection.

As we are examining directed opinions, we alsorelate to shared tasks that were conducted on auto-matically detecting stance from social media data.Stance is defined as being in favor or against agiven target, which can be a politician, a politi-cal assertion or any controversial issue. Stancedetection has been addressed by a couple of re-cent shared tasks – namely SemEval 2016 task 6(Mohammad et al., 2016), NLPCC Task 4 (Xu etal., 2016) or IBEREVAL 2017 (Taulé et al., 2017).Similar to them, we find that state-of-the-art meth-ods still have a long way to go to solve the problemand that, in contrast to other domains and tasks,neural networks are not clearly superior and ofteninferior to more traditional rule-based or featureengineering approaches.

8 Conclusions

In this paper, we describe a shared task on aspect-based sentiment analysis in social media customerfeedback. Our shared task includes four subtasks,in which the participants had to detect A) whetherfeedback is relevant to the given topic DeutscheBahn, B) which overall sentiment is express by areview, C) what aspects are evaluated, and D) whatlinguistic expressions are used to express these as-pects. We provide an annotated data set of almost28,000 messages from several social media sources.Thereby our dataset represents the largest set ofGerman sentiment annotated reviews.

The shared task attracted a high variance of ap-proaches from 8 different teams. We observe thatthe usage of gradient boosted trees, large sentimentlexicons, and the connection of neural and moretraditional classifiers are advantageous strategies

9

synchronic diachronicTeam Run aspect aspect + aspect aspect +

sentiment sentiment

Lee et al. (2017) LSTM CRF stacked learner correct offsets .482 .354 - -organizers baseline system .481 .322 .495 .389organizers majority class baseline .442 .315 .456 .384Mishra et al. (2017) biLSTM strucutured perceptron .421 .349 .460 .401Lee et al. (2017) LSTM CRF stacked learner correct offsets 2 .358 .308 - -Lee et al. (2017) LSTM-CRF only correct offsets .095 .081 - -

Table 11: Results for Subtask C on aspect-based sentiment detection.

synchronic diachronicTeam Run exact overlap exact overlap

Mishra et al. (2017) biLSTM strucutured perceptron .220 .221 .281 .282Lee et al. (2017) LSTM CRF stacked learner correct offsets .203 .348 - -Lee et al. (2017) LSTM CRF stacked learner correct offsets 2 .186 .267 - -organizers baseline system .170 .237 .216 .271Lee et al. (2017) LSTM-CRF only correct offsets .089 .089 - -Lee et al. (2017) LSTM-CRF stacked learner 4 polarity correct

offsets.024 .183 - -

Table 12: Results for Subtask D on opinion target expression identification

for the formulated tasks. Nevertheless, our sim-ple baseline classifier is highly competitive acrossall tasks. We will release the annotated dataset aspart of this task. This will hopefully strengthen theresearch on German sentiment and social mediaanalysis.

Acknowledgments

We would like to thank Axel Schulz and MariaPelevina from the Deutsche Bahn Fernverkehr AGfor their collaboration in the project ABSA-DB:Aspect-based Sentiment Analysis for DB Productsand Services as part of the Innovation Alliance DB& TU Darmstadt. We would also like to thank Ji-Ung Lee for his helpful feedback. In addition, thiswork was supported by the Deutsche Forschungs-gemeinschaft (DFG) under grant No. GRK 2167,Research Training Group “User-Centred SocialMedia”. Finally, we thank the Interest Group onGerman Sentiment Analysis (IGGSA) for their en-dorsement.

ReferencesPiotr Bojanowski, Edouard Grave, Armand Joulin, and

Tomas Mikolov. 2017. Enriching word vectors withsubword information. Transactions of the Associa-tion for Computational Linguistics, 5:135–146.

Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall,and W. Philip Kegelmeyer. 2002. Smote: syntheticminority over-sampling technique. Journal of artifi-cial intelligence research, 16:321–357.

Simon Clematide and Manfred Klenner. 2010. Eval-uation and extension of a polarity lexicon for ger-man. In Proceedings of the First Workshop on Com-putational Approaches to Subjectivity and SentimentAnalysis, pages 7–13, Lisbon, Portugal.

Richard Eckart de Castilho, Eva Mujdricza-Maydt,Seid Muhie Yimam, Silvana Hartmann, IrynaGurevych, Anette Frank, and Chris Biemann. 2016.A Web-based Tool for the Integrated Annotation ofSemantic and Syntactic Structures. In Proceedingsof the LT4DH workshop at COLING 2016, pages 76–84, Osaka, Japan.

Richard Eckart de Castilho and Iryna Gurevych. 2014.A broad-coverage collection of portable NLP com-ponents for building shareable analysis pipelines. InProc. Workshop on Open Infrastructures and Anal-ysis Frameworks for HLT (OIAF4HLT) at COLING2014, pages 1–11, Dublin, Ireland.

David Ferrucci and Adam Lally. 2004. UIMA: AnArchitectural Approach to Unstructured InformationProcessing in the Corporate Research Environment.Natural Language Engineering 2004, 10(3-4):327–348.

Haibo He, Yang Bai, Edwardo A Garcia, and ShutaoLi. 2008. ADASYN: Adaptive synthetic samplingapproach for imbalanced learning. In Neural Net-works, 2008. IJCNN 2008.(IEEE World Congresson Computational Intelligence). IEEE InternationalJoint Conference on, pages 1322–1328, Hong Kong,China. IEEE.

Leonard Hövelmann and Christoph M. Friedrich.2017. Fasttext and Gradient Boosted Trees atGermEval-2017 Tasks on Relevance Classificationand Document-level Polarity. In Proceedings of the

10

GSCL GermEval Shared Task on Aspect-based Sen-timent in Social Media Customer Feedback, pages30–35, Berlin, Germany.

Ji-Ung Lee, Steffen Eger, Johannes Daxenberger, andIryna Gurevych. 2017. UKP TU-DA at GermEval2017: Deep Learning for Aspect Based SentimentDetection. In Proceedings of the GSCL GermEvalShared Task on Aspect-based Sentiment in SocialMedia Customer Feedback, pages 22–29, Berlin,Germany.

Christopher D. Manning, Mihai Surdeanu, John Bauer,Jenny Finkel, Steven J. Bethard, and David Mc-Closky. 2014. The Stanford CoreNLP natural lan-guage processing toolkit. In Association for Compu-tational Linguistics (ACL) System Demonstrations,pages 55–60, Baltimore, MD, USA.

Tomas Mikolov, Kai Chen, Greg Corrado, and JeffreyDean. 2013. Efficient estimation of word repre-sentations in vector space. In Workshop at Inter-national Conference on Learning Representations(ICLR)., pages 1310–1318, Scottsdale, AZ, USA.

Pruthwik Mishra, Vandan Mujadia, and SoujanyaLanka. 2017. GermEval 2017 : Sequence basedModels for Customer Feedback Analysis. In Pro-ceedings of the GSCL GermEval Shared Task onAspect-based Sentiment in Social Media CustomerFeedback, pages 36–42, Berlin, Germany.

Saif M. Mohammad, Svetlana Kiritchenko, ParinazSobhani, Xiaodan Zhu, and Colin Cherry. 2016.SemEval-2016 Task 6: Detecting Stance in Tweets.In Proceedings of the International Workshop on Se-mantic Evaluation 2016, San Diego, USA.

Behzad Naderalvojoud, Behrang Qasemizadeh, andLaura Kallmeyer. 2017. HU-HHU at GermEval-2017 Sub-task B: Lexicon-Based Deep Learningfor Contextual Sentiment Analysis. In Proceed-ings of the GSCL GermEval Shared Task on Aspect-based Sentiment in Social Media Customer Feed-back, pages 18–21, Berlin, Germany.

Jeffrey Pennington, Richard Socher, and Christopher D.Manning. 2014. Glove: Global vectors for wordrepresentation. In Proceedings of the 2014 confer-ence on empirical methods in natural language pro-cessing (EMNLP), pages 1532–1543, Austin, TX,USA.

Maria Pontiki, Dimitris Galanis, John Pavlopoulos,Harris Papageorgiou, Ion Androutsopoulos, andSuresh Manandhar. 2014. Semeval-2014 task 4: As-pect based sentiment analysis. In Proceedings of Se-mEval 2014, pages 27–35, Dublin, Ireland.

Maria Pontiki, Dimitris Galanis, Haris Papageorgiou,Suresh Manandhar, and Ion Androutsopoulos. 2015.Semeval-2015 task 12: Aspect based sentiment anal-ysis. In Proceedings of the 9th SemEval, pages 486–495, Denver, Colorado.

Maria Pontiki, Dimitris Galanis, Haris Papageorgiou,Ion Androutsopoulos, Suresh Manandhar, Moham-mad AL-Smadi, Mahmoud Al-Ayyoub, YanyanZhao, Bing Qin, Orphee De Clercq, VeroniqueHoste, Marianna Apidianaki, Xavier Tannier, Na-talia Loukachevitch, Evgeniy Kotelnikov, Núria Bel,Salud María Jiménez-Zafra, and Gülsen Eryigit.2016. SemEval-2016 Task 5: Aspect Based Senti-ment Analysis. In Proceedings of the 10th SemEval,pages 19–30, San Diego, California.

Robert Remus, Uwe Quasthoff, and Gerhard Heyer.2010. SentiWS-A Publicly Available German-language Resource for Sentiment Analysis. InProceedings of the 7th International LanguageRessources and Evaluation (LREC’10), pages 1168–1171, Valletta, Malta.

Steffen Remus, Gerold Hintz, Chris Biemann, Chris-tian M. Meyer, Darina Benikova, Judith Eckle-Kohler, Margot Mieskes, and Thomas Arnold. 2016.EmpiriST: AIPHES-Robust Tokenization and POS-Tagging for Different Genres. In Proceedings of the10th Web as Corpus Workshop (WAC-X), pages 106–114, Berlin, Germany.

Josef Ruppenhofer, Julia Maria Struß, and MichaelWiegand. 2016. Overview of the IGGSA 2016Shared Task on Source and Target Extraction fromPolitical Speeches. In Bochumer Linguistische Ar-beitsberichte, pages 1–9, Bochum, Germany.

Eugen Ruppert, Abhishek Kumar, and Chris Biemann.2017. LT-ABSA: An extensible open-source systemfor document-level and aspect-based sentiment anal-ysis. In Proceedings of the GSCL GermEval SharedTask on Aspect-based Sentiment in Social MediaCustomer Feedback, pages 55–60, Berlin, Germany.

Zeeshan Ali Sayyed, Daniel Dakota, and SandraKübler. 2017. IDS-IUCL Contribution to Ger-mEval 2017. In Proceedings of the GSCL Ger-mEval Shared Task on Aspect-based Sentiment in So-cial Media Customer Feedback, pages 43–48, Berlin,Germany.

Helmut Schmid. 1994. Probabilistic part-of-speechtagging using decision trees. In Proceedings of Inter-national Conference on New Methods in LanguageProcessing, pages 44–49, Manchester, UK.

Karen Schulz, Margot Mieskes, and Christoph Becker.2017. h-da Participation at GermEval Subtask B:Document-level Polarity. In Proceedings of theGSCL GermEval Shared Task on Aspect-based Sen-timent in Social Media Customer Feedback, pages13–17, Berlin, Germany.

Uladzimir Sidarenka. 2017. PotTS at GermEval-2017 Task B: Document-Level Polarity DetectionUsing Hand-Crafted SVM and Deep BidirectionalLSTM Network. In Proceedings of the GSCL Ger-mEval Shared Task on Aspect-based Sentiment in So-cial Media Customer Feedback, pages 49–54, Berlin,Germany.

11

Mariona Taulé, M Antonia Martı, Francisco Rangel,Paolo Rosso, Cristina Bosco, and Viviana Patti.2017. Overview of the task of Stance and Gen-der Detection in Tweets on Catalan Independence atIBEREVAL 2017. In Notebook Papers of 2nd SE-PLN Workshop on Evaluation of Human LanguageTechnologies for Iberian Languages (IBEREVAL),volume 19, Murcia, Spain.

Yla R. Tausczik and James W. Pennebaker. 2010. Thepsychological meaning of words: LIWC and com-puterized text analysis methods. Journal of lan-guage and social psychology, 29(1):24–54.

Ulli Waltinger. 2010. Sentiment Analysis Reloaded:A Comparative Study On Sentiment Polarity Identi-fication Combining Machine Learning And Subjec-tivity Features. In Proceedings of the 6th Interna-tional Conference on Web Information Systems andTechnologies (WEBIST ’10), pages 203–210, Valen-cia, Spain.

Ruifeng Xu, Yu Zhou, Dongyin Wu, Lin Gui, JiachenDu, and Yun Xue. 2016. Overview of NLPCCShared Task 4: Stance Detection in Chinese Mi-croblogs. In International Conference on ComputerProcessing of Oriental Languages, pages 907–916,Kunming, China. Springer.

12


h da Participation at Germeval Subtask B: Document-level Polarity

Karen Schulz, Margot Mieskes∗, Christoph Becker∗University of Applied Sciences Darmstadt

[email protected], ∗[email protected] of Applied Sciences Darmstadt

AbstractThis paper describes our participation insubtask B of the Germeval 2017 shared taskon Sentiment Polarity Detection in SocialMedia Customer Feedback. The task asksto classify the provided comments with re-spect to “Deutsche Bahn” and “travelling”.Our system is based on lexical resources,which are combined to determine the po-larity for each sentence. The polarity ofthe individual sentences is used to classifycustomer reviews.

1 Introduction

This paper describes our participation in the Ger-meval 2017 subtask B, which deals with documentlevel polarity on customer reviews in the contextof the german railway company (Deutsche Bahn).Originally, our sentiment analysis method was cre-ated for English Tweets from the financial domainin R. Therefore, in order to participate in the taskat hand, we had to move our current system fromone domain (financial), one genre (Twitter) and onelanguage (English) to another domain (travellingby train), another genre (customer reviews) andanother language (German).

2 Related Work

Reviewing the available literature on sentimentanalysis is beyond the scope of this paper. Foran extensive overview on the topic we refer to (Liuand Zhang, 2012) or to (Cambria et al., 2017). Anoverview on various applications that have beenstudied in the domain of sentiment analysis canbe found for example in Lak and Turetken (2017).Taboada et al. (2011) pointed out, that rule-basedsystems are more robust to domain changes, whichis one of the key issues in our setup.

The majority of work on sentiment analysis hasbeen done on English, but very little on other lan-guages such as German (see for example Denecke

(2008)). In some cases, where other languageswere analysed, the authors made use of automatictranslation tools. One example is Tumasjan et al.(2010), who analysed tweets. As we deal with Ger-man data in this case, we focus on previous workusing German data directly.

Waltinger (2010) presented a lexicon for Germansentiment analysis, which was compiled by translat-ing existing English dictionaries automatically andmanually post-process the results. The final lexi-con has over 3,000 positive features, almost 6,000negative features and over 1,000 neutral features.This lexicon was tested on 5-star customer reviews,where 1 and 2 star reviews were collapsed to nega-tive reviews, 4 and 5 star reviews were grouped topositive reviews and 3 star reviews were consideredneutral. Comparing positive vs. negative reviewsor 1 vs. 5 star reviews using an SVM the authorsachieved an F1-average score of .803 to .876.

Remus et al. (2010) also presented a senti-ment lexicon, which contains almost 2,000 pos-itive and negative words each, excluding inflec-tions. It is also based on automatic translation andmanual revision. Additionally, the authors usedco-occurrence analysis and a collocation lexicon.The evaluation is based on individual words andachieves an overall F1-measure of .84. However,the authors observed that positive words achievedbetter results than negative words.

Momtazi (2012) worked on sentiment analysisof German sentences from various Social Mediachannels. Her rule-based approach used the countsof positive and negative words. She distinguishedbetween binary classifications, where the major-ity of words determined the final results, whilein a more fine-grained scenario the frequency ofthe sentiment-bearing words is taken into accountas well. Additionally, so-called booster wordsand negations are considered. Her results, usingthe rule-based approach achieved 69.6% accuracyfor positive sentences and 71.0% for negative sen-

13

tences. She attributes this to the observation that“negative opinions (. . .) are more transparent thanthe positive opinions”.

Wiegand et al. (2010) took a closer look at theissue of negation in sentiment analysis. The au-thors observed that bag-of-words models, whichdo not explicitly model negation and which lacklinguistic analysis perform reasonably well, espe-cially in document-level analysis. Although, theyalso point to work using a parser to determine thescope of negation, but do not report on the finalresults of this. The authors observed that nega-tion words are more important for the classificationthan diminishers. They also point to words con-taining negations, which can only be determinedusing a morphological analysis. But they cite onlyfew examples where this has been carried out andthe impact on the polarity classification is unclear.While their analysis primarily focuses on English,they point to language-specific phenomena, whichmakes the treatment of negations more difficult inlanguages such as German, where the negation canoccur before or after the word(s) it refers to.

3 Experimental Setup

The major functionality of our processing pipeline(see Figure 1) is the computation of sentiment forcomments related to ”Deutsche Bahn” and ”trav-elling” based on the Germeval 2017 dataset (Wo-jatzki et al., 2017). We determine the sentimentas ”negative”, ”neutral” and ”positive”. In orderto classify the comments, we split them into sen-tences and chunks first. Then, each chunk getslemmatized. For measuring the sentiment of eachchunk, we take both the term-frequency as wellas the polarity of each word into account. If anegation occurs before a sentiment-bearing word,we reverse the sentiment of the respective word.Chunks are then recombined to sentences and com-ments, which we classify. Details of the pipelineare shown in Figure 1.

4 Resources

Our pipeline is written in R. Through an interfaceto openNLP1 we use the maxent sentence annotatorfor tokenization. For lemmatization and chunkingwe use the TreeTagger (Schmid, 1995). We usethe dictionaries provided by the organizers2. Addi-

1https://opennlp.apache.org2https://github.com/uhh-lt/

GermEval2017-Baseline

tionally, we use the SentiWS lexicon (Remus et al.,2010). The latter includes polarities in the intervalof [-1,1]. We manually created a list of negations,as well as a list of synonyms for ”Deutsche Bahn”and ”travelling”, which was based on the OpenThe-saurus3. The resources we manually created areprovided to the community.4.

5 Measuring Polarity

We determine the polarity of each document ina three-step pipeline. During preprocessing, wesplit the comments into individual sentences andchunks. We analyse them individually and com-bine this analysis in order to measure the sentimentof each sentence. The sentiment of the whole doc-ument is then based on the accumulated sentimentof individual words and sentences.

5.1 PreprocessingEach comment is split into sentences, chunks andtokens. For the sentence-tokenization we use theMaxent annotator, for lemmatization and chunkingwe use the TreeTagger.

5.2 Measuring SentimentWe compute a term-document matrix (tdm) to de-termine the term-frequencies of each word in thechunks. Then, we match the words with the en-tries in the sentiment dictionary. For each chunkwe extract the single polarities with regard to ap-plicable word-frequencies by multiplying the twovalues. Negations reverse the polarity of the follow-ing sentiment-bearing words in each chunk. Wecombine the polarities of each chunk, sentence andcomment to determine the final sentiment score.Precisely, we combine the polarities of a chunk bytaking the arithmetic mean of its measured values.Applicable sentiment scores of chunks are used aspolarities to compose σi, that represents the senti-ment score of sentence i. We combine polarities inσi by taking the arithmetic mean of the polarities(xi) multiplied by a weighting factor. This factorcorresponds to half of the percentage of applicablechunks in which sentiment can be detected (ρi).We generate a random variable Y that follows theuniform distribution with parameters zero and themaximal amount of chunks that are combined persentence (max(y)). P(Y ≤ yi) represents the prob-ability that Y takes on a value less or equal to yi.

3https://www.openthesaurus.de4https://b2drop.eudat.eu/s/

IzC5A756GSCofCB

14

Figure 1: Pipeline of sentiment detection

Therefore, the weighting factor corresponds to halfof the probabilty that Y takes on a value less orequal to the amount of chunks that are combinedin sentence i (yi). Then, we locate topic relatedsentiment in each sentence through the dictionaryof topic words and their synonyms. If we are notable to locate a word from the dictionary in sen-tence i, we set the respective polarity to zero. Thefinal sentiment score of the whole comment is ob-tained in a similar fashion, by applying the polaritycalculation to sentences.

[σi = xi ·(12

ρi+12·P(Y ≤ yi)), Y ∼U(0,max(y))]

(1)

5.3 ClassificationFor the final classification for each comment into“negative”, “neutral” and “positive” we use the re-sults of Formula 1. Intuitively, values above 0would result in positive comments and values be-low 0 would be classified as negative values. Nev-ertheless, we experimented with various thresholdsfor σi and experimentally determined the best setof thresholds for “negative”, “positive” and “neu-tral” using the development set. These are alsoused for the test data. Scores for the best set ofthresholds on the development set are presented inTable 1 below. As can be seen, the intuitive thresh-olds achieve worse results than thresholds whichare slightly larger and smaller than 0.

6 Results

In the following, we present results on the develop-ment set and discuss some of the observed errors.

6.1 EvaluationTable 1 shows results from the provided develop-ment data with various thresholds (th) and usesindicators micro and macro F1 for evaluation. Thefirst line shows results in which comments thathave a polarity of zero are classified as neutral. F1

measures for groups positive and negative are thebest in line one. An overall high F1 indicator isshown in line two. But F1 measures for groupspositive and negative are lower than 0.2 in this line.It derives from an unbalanced size of groups.

th pos th neg micro F1 macro F10 0 0.544 0.544

dev 0.01 -0.01 0.666 0.6660.001 -0.001 0.632 0.632

Table 1: Comparison between thresholds

6.2 Error AnalysisTable 2 shows the confusion matrix for the devel-opment set. When looking at the wrongly classi-fied instances, we observe several problems thatwould need further work. One issue lies with theresources we employed in combination with thedata and the domain. For example, if ”Bahn” (rail-way) is written in lowercase letters (”bahn”) theTreeTagger sets the verb ”bahnen” as lemma in-stead of the noun ”Bahn”. These instances are notregarded as related to travelling by train. The word”Streik” (strike) is marked as minimally negativein the SentiWS dictionary. Considering the con-text of Deutsche Bahn or travelling, the polarityof ”Streik” should be far more negative. The word”Storung” (engl. malfunction) is not included inSentiWS. But it is sentiment-bearing in this con-text. The phrase ”Streik beenden” (ending a strike)has negative polarities in SentiWS. But connect-ing these words to ”Streik beenden” gives thema positive polarity. This is not recognized by the

predicted vs. actual positive negative neutralpositive 32 49 145negative 33 114 137neutral 84 426 1355

Table 2: Confusion Matrix on the dev set using thebest threshold.

small

15

algorithm. Also, the topic dictionary is not compre-hensive enough. Hashtags that are associated with”deutsche Bahn” or travelling (e.g. #Bahn) are notadded to the dictionary, as well as Twitter accountslike @DB Bahn or @Regio NRW, which are alsomissing.

When looking at the confusion matrix, we ob-serve, that both positive and negative texts havebeen often confused with neutral posts by our al-gorithm. A closer look at the instances reveals,that some of them are indeed neutral. One exam-ple is: Bahnhof in Wittenberg: Info-Umzug zumStreikende — Wittenberg/Grafenhainichen - Mit-teldeutsche Zeitung (BILD: Baumbach) AlexanderBaumbach Am Wittenberger Bahnhof gibt es jetztReiseauskunfte nur noch im Container. Bis zumBezug des neuen Bahnhofgebaudes im nachstenJahr wird dies auch so bleiben. Wittenberg Punk-tlich zum Ende des Lokfuhrer-Streik. This has beenmarked as neutral by our algorithm, but the goldstandard says it is positive, probably due to the endof the strike. But we tend to agree, that the postitself is primarily neutral.

Another instance such as RT @nhitastic: Wisstihr was geil ist? WLAN in der deutschen Bahn!@DB Bahn haha is also marked as neutral, whereasit is supposed to be positive. Looking at the details,it is obvious, that “sounds” such as haha couldpoint either to a positive sentiment or could beregarded as ironic, which would make this postnegative rather than positive.

An example for a negative post, that wasclassified as neutral by our algorithm is:Bericht uber hunderte Funklocher bei der Bahnhttps://t.co/cMoUIaSHOk

Instances where our algorithm classified a postas negative, whereas it was positive can be at-tributed to our lexicon-based approach. For ex-ample in GDL-Streik beendet: Warum Bahnkundendem Frieden nicht trauen Pfingsten mit der Bahnkann kommen und Millionen Bahnkunden atmenauf. Die Lokfuhrer haben ihren Streik abgeblasen.Reisende bleiben trotzdem skeptisch

phrases such as “nicht trauen” (not trusting) andwords such as “skeptisch” (sceptic) tend to be morenegative than positive. Even though this post is onthe strike, customers are still wary, which is themajor part of this post and therefore, this post isclassified as negative, whereas the gold standardmarks it as positive.

The neutral class was reliably classified. There

were similar amounts of instances missclassified aspositive or negative.

One example being: Der Schweizer Bahn-Vierer musste sich beim Weltcup im kolumbian-ischen Cali nur den Russen geschlagen geben.https://t.co/fQ5VUos57w #srfra

Which actually has no relation to travelling bytrain and therefore is not relevant for the taskat hand. Another example is: Verargerung beiPendlern in London: Tiefstehende Sonne sorgt beiBahn fur Verspatungen https://t.co/0PgTRO6YS2

While “sun” normally would be related to a pos-itive sentiment, in this case, as it caused delays itwould be negative. In combination our lexicon-based approach marked it as neutral.

7 Conclusion and Future Work

In this paper, we presented our contribution to theGermeval 2017 document level polarity task us-ing R. Our pipeline is based on various lexicalresources, which determine positive and negativewords. Additionally, we consider negations in orderto switch the sentiment of the respective phrase. Wealso used various linguistic preprocessing methodssuch as a chunker to allow for a better treatment ofthe scope of the negations. Lemmatization reducedthe search space for sentiment-bearing words.

We aim to continuously improve this pipeline. Amajor improvement for the future would be to cre-ate a domain-dependent sentiment lexicon in orderto capture specific words and phrases which in thisparticular context have a stronger polarity than inothers. Also, instead of just switching the polarityof a word based on the existence of negation words,a more fine-grained approach would be meaningful.Taking intensifiers into account would also be ben-eficial. Additionally, the linguistic pre-processingtools we used were not adapted for social mediatext. Finally, adapting the rule-based method tousing machine-learning methods would greatly im-prove the performance of our pipeline.

Acknowledgements

We would like to thank Magdalena Bergold, Cheuk-Hei Chin, Martin Czernin, Viktoria Gaus, Ben-jamin Lossau, David Michalski, Yeliz Ovic, BjornSeveritt, Daniela Di Schiena and Nikolai Spuckwho attended Christoph Becker’s course ”Textmin-ing with R” and created the very first version onwhich these results are based.

16

ReferencesErik Cambria, Dipankar Das, Sivaji Bandyopadhyay,

and Antonio Feraco, editors. 2017. A PracticalGuide to Sentiment Analysis. Springer InternationalPublishing.

Kerstin Denecke. 2008. Using SentiWordNet for Mul-tilingual Sentiment Analysis. In IEEE Data Engi-neering Workshop (ICDEV 2008), pages 507–512.

Parsa Lak and Ozgur Turetken. 2017. The Im-pact of Sentiment Analysis Ouptut on DecisionOutcomes: An empirical Study. Transactions onHuman-Computer Interaction, 9(1):1–22, March.

Bing Liu and Lei Zhang, 2012. A Suervey of OpinionMining and Sentiment Analysis, chapter 13, pages415–463. Springer Science+Business Media.

Saeedeh Momtazi. 2012. Fine-grained german senti-ment analysis on social media. In Proceedings of the8th International Language Resources and Evalua-tion (LREC’12), pages 1215–1220, Istanbul, Turkey.

Robert Remus, Uwe Quasthoff, and Gerhard Heyer.2010. SentiWS – a Publicly available German-language Resource for Sentiment Analysis. InProceedings of the 7th International Language Re-sources and Evaluation (LREC’10), pages 1168–1171, Valetta, Malta.

Helmut Schmid. 1995. Improvements In Part-of-Speech Tagging With an Application To German.In In Proceedings of the ACL SIGDAT-Workshop,pages 47–50.

Maite Taboada, Julian Brooke, Milan Tofiloski, Kim-berly Voll, and Manfred Stede. 2011. Lexicon-based methods for sentiment analysis. Computa-tional Linguistics, 37(2):267–307.

Andranik Tumasjan, Timm O. Sprenger, Philipp G.Sandner, and Isabell M. Welpe. 2010. PredictingElections with Twitter: What 140 Characters Re-veal about Political Sentiment. In Fourth Interna-tional AAAI Conference on Weblogs and Social Me-dia, pages 178–185. AAAI.

Ulli Waltinger. 2010. GermanPolarityClues: A Lex-ical Resource for German Sentiment Analysis. InProceedings of the 7th International Language Re-sources and Evaluation (LREC’10), pages 1638–1642, Valetta, Malta.

Michael Wiegand, Alexandra Balahur, Benjamin Roth,Dietrich Klakow, and Andes Montoyo. 2010. A Sur-vey on the Role of Negation in Sentiment Analysis.In Proceedings of the Workshop on Negation andSpeculation in Natural Language Processing, pages60–68, Uppsala, Sweden.

Michael Wojatzki, Eugen Ruppert, Sarah Holschneider,Torsten Zesch, and Chris Biemann. 2017. Germeval2017: Shared task on aspect-based sentiment in so-cial media customer feedback. In Proceedings of the

GSCL GermEval Shared Task on Aspect-based Sen-timent in Social Media Customer Feedback, Berlin,Germany.

17


HU-HHU at GermEval-2017 Sub-task B:Lexicon-Based Deep Learning for Contextual Sentiment Analysis

Behzad NaderalvojoudDFG SFB 991

Hacettepe UniversityAnkara, Turkey

[email protected]

Behrang QasemizadehDFG SFB 991

Universitat DusseldorfDusseldorf, [email protected]

Laura KallmeyerDFG SFB 991

Universitat DusseldorfDusseldorf, Germany

[email protected]

Abstract

This paper describes the HU-HHU sys-tem that participated in Sub-task B ofGermEval 2017, Document-level Polarity.The system uses 3 kinds of German senti-ment lexicons that are built using transla-tions of English lexicons, and it employsthem in a neural-network based context-dependent sentiment classification method.This method uses a deep recurrent neuralnetwork to learn context-dependent senti-ment weights to change the lexicon polarityof terms depending on the context of theirusage. The performance of the system isevaluated using the benchmarks providedby the task’s organizers.

1 Introduction

Sentiment lexicons are known as useful languageresources in sentiment analysis systems that de-termine the sentimental orientation of terms outof context. As manually creating such lexicons isexpensive and time-consuming, one possible so-lution is to translate English resources into otherlanguages (Waltinger, 2010; Ucan et al., 2016).However, natural language is ambiguous and wordby word translation cannot achieve satisfying re-sults; terms can possess more than one sentimentand can express different sentiments with respectto the context. In this case, sense-based sentimentlexicons, like SentiWordNet (SWN) (Esuli and Se-bastiani, 2006), assign polarities to word-sensesinstead of words. However, the number of word-senses cannot necessarily be the same for two termsthat are translations of each other. Moreover, notall word-senses of a term may express subjectiv-ity in a language (Akkaya et al., 2009). There-fore, we generated a sentiment lexicon for Germanthat takes into account SWN synsets. To map sub-jective synsets to German terms, we extended the

approach used in the HUMIR1 project (Naderalvo-joud et al., 2017) for applying it to German lan-guage. This approach creates a cross-lingual sensemapping between the SWN synsets and Germanterms and produces a single polarity value for eachterm. This value indicates the strength of subjectiv-ity according to the number of mapped synsets. Infact, the polarity and the number of English synsetsassociated with each term constitute the domain ofGerman terms’ sentiments.

Besides this lexicon, we also employ two otherGerman sentiment lexicons proposed in (Waltinger,2010) that are translations of English SubjectivityClues (Wilson et al., 2005) and SentiSpin (Taka-mura et al., 2005) lexicons. Three online English-to-German translation softwares have been used inconstructing the German lexicons; a German termthat appears in most of translation results is selectedas a translation of the given English term. Whilethe polarity values of German Subjectivity cluesare assigned manually, they are inherited from thecorresponding English resource for German Sen-tiSpin.

The sentiment lexicons generated is employedin a context-dependent sentiment analysis systemthat uses a deep recurrent neural network (RNN) tocapture the contextual sentiment modifications thatcan change the prior known sentiments of termswith respect to the context. After describing theconstruction of the proposed German SentiWord-Net lexicon in Section 2, we explain the sentimentanalysis system we used in Section 3. Section 4reports the evaluation results. Finally, Section 5presents the conclusion.

2 German SentiWordNet Lexicon

The proposed German SentiWordNet lexicon isconstructed by the following three main steps usingthe Open Subtitle Corpus (Tiedemann, 2009) for

1http://humir.cs.hacettepe.edu.tr/projects/tsa.html

18

the German–English language pair: (1) We firstgenerate a cross-lingual distributional/statisticalmodel to represent subjective English terms2 inthe German language vector space. In this model,each English term is represented by a distributionalsemantic vector whose elements are German terms(Naderalvojoud et al., 2017). (2) The generatedmodel is used to represent the SWN synsets. Torepresent synsets, the semantic vectors of the synsetterms are summed up. (3) Synset mapping is ap-plied to the reduced semantic vectors3 for mappingGerman terms to subjective SWN synsets. We sup-pose that German terms with high frequency in thesemantic vector are more likely associated with thegiven synset.

As a result of this approach, we achieved two lex-icons of 14,309 (named SWN1) and 43,790 (namedSWN2) German subjective terms by using twodifferent corpora having 70,534 and 13,883,398movie subtitles, respectively.

3 Context-Dependent Sentiment AnalysisUsing Deep Learning

We use deep learning to capture the implicitsentiment knowledge contained in the seman-tic/syntactic structure of a sentence. In fact, theprior sentiment of terms in the lexicon can bechanged based on the negation, intensification, orsemantic structure of terms in the context. For ex-ample, the positive sentiment of “good” is shiftedto negative in the sentence “Nobody gives a goodperformance in the team” by the word “nobody”.A similar situation can be seen for the word “great”in the sentence “He was a great liar” and its posi-tive sentiment (based on the lexicon) is shifted tonegative. Thus, the use of sentiment lexicons with-out consideration of the context cannot achieve asatisfying result.

In the sentiment analysis method we use, thecontextual sentiment knowledge is combined withthe terms’ prior sentiments. To this end, we employa context-sensitive lexicon-based method proposedin (Teng et al., 2016). In this approach, the senti-ment score of a sentence is computed based on theweighted sum of the polarity values of the subjec-tive terms obtained from the lexicon. The learned

2A term is subjective if it has at least a synset with non-zero positive or negative polarity value. SWN includes 29,095subjective synsets; the number of synset terms belonging tothese subjective synsets is 39,746.

3For synset mapping, we reduced the dimension of vectorsto 10 by selecting the most frequent terms.

weights are considered as context-dependent fea-tures that modify the prior polarity values of termswith respect to the context. Therefore, the effectsof negation, shifting and intensification can be con-sidered in the sentiment classification task. Theoverall structure of the model is simply shown inEq. 1.

Score =N

∑k=1

γk×LexScore(tk)+b (1)

In Eq. 1, N denotes the number of subjective termsin the sentence, LexScore(tk) is the polarity valueof term tk in the lexicon, γk is the context-dependentweight of term tk and b is the sentence bias score.

To Learn γ and b, following (Teng et al., 2016),we employ a recurrent neural network (RNN)model with bidirectional long-short-term-memory(BiLSTM) cells (Graves et al., 2013; Sak et al.,2014) for extracting the semantic composition fea-tures.

We use the sentiment annotated data provided bythe task organizers (Wojatzki et al., 2017) for train-ing the model. The German FastText pre-trainedword embeddings4 are used in generating the modelin combination with the German sentiment lexi-cons. We tune hyper-parameters of our model usingthe obtained classification results on the develop-ment set.

4 Evaluation Report

We evaluate the proposed sentiment model usingcustomer reviews about “Deutsche Bahn” providedby the task organizers5. Table 1 shows the statisticsof data in three categories, “positive”, “negative”and “neutral”. In order to show the significance ofcontextual sentiment analysis, we also indicate thecontextual ambiguity of the training set by relyingon the occurrence of subjective terms in irrelevantreviews. For example, in Table 1, while column“NegInPos” shows the percentage of positive re-views with negative clue terms, the column “PosIn-Neg” indicates the occurrence of positive clue termsin negative reviews. The most important point isthe occurrence of subjective terms in the neutralreviews (column “SubInNeu”) which can make ithard to distinguish neutral reviews from subjectiveones. The other observation is that the number ofneutral reviews outnumbers the positive and nega-

4https://github.com/Kyubyong/wordvectors5https://sites.google.com/view/germeval2017-absa/data

19

tive ones. For example, only 6% of train reviewsare positive, whereas 68% are neutral.

Table 1: Statistics of datasetData Split All Pos 6% Neg 26% Neu 68%train 19432 1179 5045 13208dev 2369 148 589 1632test-syn 2566 105 780 1681test-dia 1842 108 497 1237sum 26209 1540 6911 17758

Contextual ambiguity on the train set %German Lexicon NegInPos PosInNeg SubInNeuSWN1 100.00 93.18 100.00SentiSpin 93.64 96.13 98.37SubjectivityClue 98.05 72.80 98.73

As the train set is imbalanced, the performanceof the classification model can tend towards themajority class (Naderalvojoud et al., 2015). As aresult of this, the micro F-measure value (which isthe shared task evaluation metric) is affected by themajority class. Hence, we use the macro F-measurevalue along with the micro score in the evaluation.

Table 2 indicates the best results on the devel-opment set achieved from the shared task baselinesystem (SVM) and our context-dependent system(RNN). We select the model that produces the bestresults on the development set for applying to testset. As two lexicons generated from SWN achievethe best results in comparison to two other lexicons,we constructed two models according to these lexi-cons for testing.

From the results shown in Table 2, the RNNmodel outperforms the baseline system in all threeclasses. While the baseline system yields weakF-measure values for positive and negative classes,the RNN-based system achieves F-measure val-ues of 0.4533 and 0.6254. Despite the lack ofpositive instances in the train set (6%), the RNNmodel can achieve much better results than SVMin combination with the proposed German SWNlexicons. This can be also observed in the nega-tive class in which the F-measure value increasesfrom 0.2212 in SVM to 0.6254 in RNN. Overall,the change in Positive–Negative macro F-measurevalue from 0.1173 (in Baseline-SVM) to 0.5394(in SWN1-RNN) clearly shows the effect of theproposed lexicon-based context-dependent senti-ment analysis method. It is worth noting that aGerman lexicon has been also used in the baselinesystem, however, this system has not made an im-pact on the sentiment classification result as muchas the proposed RNN model has made. Further-more, while the German SentiSpin lexicon doesnot improve the performance of the RNN model

in the positive class, the proposed German SWNlexicons significantly improve its performance. Al-though the German subjectivity clue lexicon per-forms better than SentiSpin, the proposed GermanSWN1 and SWN2 outperform it by 7% and 4% inMacroF1(PN), respectively.

As the neutral class is the majority class in thetrain set, all systems yield high F-measure valuesfor this class. Nevertheless, the RNN model in com-bination with two German SWN lexicons achievesthe best results in terms of macro F-measure value(0.6452, SWN1-RNN) and micro F-measure value(0.7873, SWN2-RNN) over all classes.

Tables 3 and 4 show the results of the baselineand proposed systems on the synchronic and di-achronic test sets, respectively. From these re-sults, we can observe that the proposed systemoutperforms the baseline method by using all Ger-man sentiment lexicons. In synchronic test set,while the SWN1-RNN achieves the best macroF-measure value (0.4907), SWN2-RNN yields thebest micro F-measure value (0.7494). In diachronictest set, the RNN model with both German SWNlexicons achieves the best micro F-measure val-ues. However, they do not maintain this superior-ity and the RNN model with German Subjectiv-ity clue lexicon gives the best macro F-measurevalue (0.5211). This may arise from the fact thatthe polarity values of German Subjectivity Cluesare manually assigned. As a result, the proposedcontext-dependent sentiment analysis system per-forms well in combination with the German SWNlexicons and remarkably outperforms the baselineSVM model.

5 Conclusion

This paper presented the sentiment analysis ap-proach of HU-HHU system in the GermEval 2017shared task. In this approach, an RNN modelis used to learn the context-dependent sentimentweights that can change the lexicon polarity ofterms depending on the context. As shown in theempirical evaluations, compared to the baselinesystem, this approach significantly improves theperformance of the sentiment classification task.

ReferencesCem Akkaya, Janyce Wiebe, and Rada Mihalcea. 2009.

Subjectivity word sense disambiguation. In Pro-ceedings of the 2009 Conference on Empirical Meth-ods in Natural Language Processing: Volume 1-

20

Table 2: The best results on the development setLexicon-Model F1pos F1neg F1neu MacroF1(PN) MacroF1(all) MicroF1Baseline-SVM 0.0134 0.2212 0.8244 0.1173 0.3530 0.7092SWN1-RNN 0.4533 0.6254 0.8568 0.5394 0.6452 0.7847SWN2-RNN 0.4381 0.6086 0.8602 0.5234 0.6356 0.7873SentiSpin-RNN 0.0127 0.6075 0.8525 0.3101 0.4909 0.7716SubjectivityClues-RNN 0.4112 0.5932 0.8591 0.5022 0.6212 0.7838

Table 3: Results on synchronic test setLexicon-Model MacroF1 MicroF1Baseline-SVM 0.3325 0.6730SWN1-RNN 0.4907 0.7366SWN2-RNN 0.4806 0.7494SentiSpin-RNN 0.4764 0.7159SubjectivityClues-RNN 0.4718 0.7357

Table 4: Results on diachronic test setLexicon-Model MacroF1 MicroF1Baseline-SVM 0.3539 0.6894SWN1-RNN 0.5165 0.7362SWN2-RNN 0.5036 0.7362SentiSpin-RNN 0.4482 0.7176SubjectivityClues-RNN 0.5211 0.7323

Volume 1, pages 190–199. Association for Compu-tational Linguistics.

Andrea Esuli and Fabrizio Sebastiani. 2006. SENTI-WORDNET: A high-coverage lexical resource foropinion mining. Institute of Information Scienceand Technologies (ISTI) of the Italian National Re-search Council (CNR).

Alex Graves, Abdel-rahman Mohamed, and GeoffreyHinton. 2013. Speech recognition with deep re-current neural networks. In Acoustics, speech andsignal processing (icassp), 2013 ieee internationalconference on, pages 6645–6649. IEEE.

Behzad Naderalvojoud, Ebru Akcapinar Sezer, andAlaettin Ucan. 2015. Imbalanced text categoriza-tion based on positive and negative term weight-ing approach. In International Conference on Text,Speech, and Dialogue, pages 325–333. Springer.

Behzad Naderalvojoud, Alaettin Ucan, and Ebru Ak-capinar Sezer. 2017. A novel approach to rule basedturkish sentiment analysis using sentiment lexicon.The Scientific and Technological Research Councilof Turkey (TUBITAK), 115E440.

Hasim Sak, Andrew Senior, and Francoise Beaufays.2014. Long short-term memory recurrent neural net-work architectures for large scale acoustic modeling.In Fifteenth Annual Conference of the InternationalSpeech Communication Association.

Hiroya Takamura, Takashi Inui, and Manabu Okumura.2005. Extracting semantic orientations of words us-ing spin model. In Proceedings of the 43rd AnnualMeeting on Association for Computational Linguis-tics, pages 133–140. Association for ComputationalLinguistics.

Zhiyang Teng, Duy-Tin Vo, and Yue Zhang. 2016.Context-sensitive lexicon features for neural senti-ment analysis. In EMNLP, pages 1629–1638.

Jorg Tiedemann. 2009. News from opus-a collectionof multilingual parallel corpora with tools and inter-faces. In Recent advances in natural language pro-cessing, volume 5, pages 237–248.

Alaettin Ucan, Behzad Naderalvojoud, Ebru Akcap-inar Sezer, and Hayri Sever. 2016. SentiWordNetfor new language: Automatic translation approach.In Signal-Image Technology & Internet-Based Sys-tems (SITIS), 2016 12th International Conferenceon, pages 308–315. IEEE.

Ulli Waltinger. 2010. GERMANPOLARITYCLUES:A lexical resource for German sentiment analy-sis. In Proceedings of the Seventh InternationalConference on Language Resources and Evaluation(LREC), Valletta, Malta, May. electronic proceed-ings.

Theresa Wilson, Janyce Wiebe, and Paul Hoffmann.2005. Recognizing contextual polarity in phrase-level sentiment analysis. In Proceedings of the con-ference on human language technology and empiri-cal methods in natural language processing, pages347–354. Association for Computational Linguis-tics.

Michael Wojatzki, Eugen Ruppert, Sarah Holschneider,Torsten Zesch, and Chris Biemann. 2017. Germeval2017: Shared task on aspect-based sentiment in so-cial media customer feedback. In Proceedings of theGSCL GermEval Shared Task on Aspect-based Sen-timent in Social Media Customer Feedback.

21


UKP TU-DA at GermEval 2017:Deep Learning for Aspect Based Sentiment Detection

Ji-Ung Lee, Steffen Eger, Johannes Daxenberger, Iryna GurevychUbiquitous Knowledge Processing Lab (UKP-TUDA)

Department of Computer Science, Technische Universitat DarmstadtUbiquitous Knowledge Processing Lab (UKP-DIPF)

German Institute for Educational Research and Educational Informationhttp://www.ukp.tu-darmstadt.de

Abstract

This paper describes our submissions tothe GermEval 2017 Shared Task, whichfocused on the analysis of customer feed-back about the Deutsche Bahn AG. Weused sentence embeddings and an ensem-ble of classifiers for two sub-tasks as wellas state-of-the-art sequence taggers fortwo other sub-tasks. Relevant aspectsto reproduce our experiments are avail-able from https://github.com/UKPLab/

germeval2017-sentiment-detection.

1 Introduction

For many companies, customer feedback is an im-portant source for identifying problems affectingtheir services. Although customer feedback may beobtained by interviewing single customers or con-ducting larger studies using questionnaires, thoseare often cost-intensive. Instead, it is much cheaperto crawl customer feedback from the web, for ex-ample from social media platforms like Twitter,Facebook, or even news pages. In contrast to inter-views or questionnaires, crawled data is often noisyand does not necessarily cover specific company-related topics. Due to the vast amount of availabledata on the web, it is crucial to analyze relevantdocuments and extract the feedback automatically.

The GermEval 2017 Shared Task (Wojatzki etal., 2017) focuses on the automated analysis ofcustomer feedback about the Deutsche Bahn AG(DB) in four subtasks, namely (A) relevance clas-sification of documents, (B) identification of thedocument-level polarity, (C) identification of cer-tain aspects in a single document as well as predict-ing their category and polarity, and (D) extractionof the exact phrase of a single aspect.

For example, the tweet

@RMVdialog hey, wann fahrt denn nachder Storung jetzt die nachste Bahn von

Glauberg nach Ffm?

has the following gold-standard annotations forTasks A-D, respectively:

(A) true(B) neutral(C) Sonstige Unregelmassigkeiten – negative(D) Storung

We participated in all subtasks of the sharedtask. For Tasks A, B, and C we trained models ondocument-level representations using a classifierensemble. As Task D can be modeled as a sequencetagging task, we used a state-of-the-art deep neuralnetwork tagger with a conditional random field atthe output layer.

This work is structured as follows. Section 2gives an overview of the data. Section 3 details thetwo main modeling approaches we made use of.Section 4 describes our experimental set-ups, andpresents and discusses our results for a selection ofwell-performing models. We conclude in Section5.

2 Data

The data provided for this shared task contains≈ 22,000 German messages from various socialmedia and web sources and has been annotatedin a joint project between Technische UniversitatDarmstadt and DB. In addition to the provideddata we used several external resources for trainingword and sentence embeddings and computing taskspecific features.

2.1 Task Specific DataThe shared task data contains annotations about therelevance R of a message (Task A) and its senti-ment polarity (B), either positive P, negative NG,or neutral NT. Relevant messages contain furtherannotations about their aspects, with the aspectcategory and sentiment polarity (C) and its exact

22

R (A) P (B) NG (B) NT (B)

Total 83 6 26 68

Table 1: Class distributions for task A and B in %

Category #

Allgemein 13892Zugfahrt 2421Sonstige Unregelmassigkeiten 2112Atmosphare 1576Sicherheit 962Ticketkauf 741Service und Kundenbetreuung 551Connectivity 390Informationen 388Auslastung und Platzangebot 304DB App und Website 252Komfort und Ausstattung 166Barrierefreiheit 89Toiletten 54Image 54Gastronomisches Angebot 47Reisen mit Kindern 46Design 37Gepack 15QR-Code 1Total 24098

Table 2: The number of aspects per category.

phrase (target) identified by the character offsets(D). Table 1 shows the distribution of classes in thetrain and dev sets for tasks A and B.

For Tasks C and D, the data contains 24,098 as-pects in total, which are classified into 20 differentcategories. Table 2 shows the number of aspects foreach category. We observe that the data is highlyskewed here, with more than 57% of all aspectsbeing of category “Allgemein”. Table 3 shows thedistribution of positive, negative, and neutral as-pects in the data.

Furthermore, not every aspect can be matchedto an exact phrase (target). In 44% of the cases, acategory and the polarity is assigned to a messagewithout having a target. For these cases, the target

P (C) NG (C) NT (C)Total 10 42 48

Table 3: Polarity distribution of aspects in %

is annotated by NULL.

2.2 External Sources

We use several external data sources for trainingvarious word and sentence embeddings, namely aGerman Wikipedia corpus (Al-Rfou et al., 2013)and a German Twitter corpus (Cieliebak et al.,2017). The Wikipedia corpus is publicly avail-able and contains already tokenized data. We usea crawler published along with the Twitter corpus,to obtain the actual texts of the tweets. This resultsin a corpus containing 7464 tweets, which we thentokenized using the Tweet Tokenizer from NLTK(Bird et al., 2009).

We also made use of an English Twitter senti-ment corpus of around 40K tweets (Rosenthal etal., 2017), each annotated with positive, negative,or neutral stance, just as the German data. Ourhope was that this would provide a strong addi-tional signal from which our learners could inducethe sentiment of a tweet, be it English or German.To make use of this additional data, we projectedour word and sentence embeddings (see below) ina bilingual German-English embedding space sothat they are comparable.1 We used CCA (Faruquiand Dyer, 2014) for this, which requires indepen-dently constructed language specific embeddingsand word translation pairs (such as (Katze,cat)) toallow projecting vectors into a joint space. Theword translation pairs were induced from the Eu-roparl corpus (Koehn, 2005).

3 Methods

In what follows, we describe, on a general level,our approaches to Tasks A and B (Section 3.1) andTask D (Section 3.2), respectively. For Task C, wemixed between the approaches outlined in Sections3.1 and 3.2 in our experiments. We relegate thecorresponding model description to Section 4.

3.1 Sentence Embeddings and ClassifierEnsemble

We used a unified and minimally expensive (interms of feature engineering) approach to tackleTasks A and B, which both concern the classifica-tion of documents into categories. We tokenized

1Besides using the English Twitter sentiment corpus forcomputing word embeddings, we had hoped that the anno-tated English data would improve our classification results inGerman, but initial experiments in which we (naively) mergedboth annotated datasets led to performance deteriorations, sowe abandoned the idea.

23

each document and converted it to an embeddingvia the tools Sent2Vec (Pagliardini et al., 2017) andSIF (Arora et al., 2017). Both of these tools aspireto improve upon the simple average word embed-ding baseline for sentence embeddings, but are con-ceptually simple. We trained Sent2Vec on the unionof German Wikipedia data as well as a Twitter cor-pus and the task specific data of the Shared Task.For SIF, we first created word embeddings with thestandard skip-gram model of Word2Vec (Mikolovet al., 2013), and then generated sentence embed-dings from these via specific SIF parametrizationsoutlined below. We train Word2Vec on the samedata sources as Sent2Vec.

After converting documents to embeddings ofparticular sizes d, we train a classifier that mapsrepresentations in Rd to one of N classes, whereN = 2 for Task A and N = 3 for Task B. We use thestacked learner from Eger et al. (2017) as a clas-sifier. This is an ensemble based system that usesseveral base classifiers from scikit-learn and a mul-tilayer perceptron as a meta-classifier to combinethe predictions of the base classifiers.

3.2 (MTL) Sequence Tagging

Task D is naturally modeled as sequence tag-ging task, that is, it can be framed as the prob-lem of tagging each element in a sequence of to-kens x1, . . . ,xT with a label y1, . . . ,yT . We usedthe most recent state-of-the-art sequence taggingframeworks (Lample et al., 2016; Ma and Hovy,2016), which consist of a neural network (bidirec-tional) LSTM tagger that uses word and characterlevel information as well as a CRF layer on topthat accounts for dependencies between succes-sive output predictions. Moreover, since multi-tasklearning (MTL) settings in which several tasks arelearned jointly have been reported to sometimesoutperform single-task learning (STL) scenarios,we directly allow for inclusion of several tasks dur-ing training and prediction time. Our approachbuilds here upon the architecture of Søgaard andGoldberg (2016) in which different tasks feed fromparticular levels of hidden layers in a deep LSTMtagger. Our employed framework (Kahse, 2017)extends Søgaard and Goldberg (2016) in that weinclude both character and word-level informationas well as implement CRF layers for each task,as mentioned already. Note that we could in prin-ciple train all four Shared Task tasks in a singlearchitecture, possibly with Tasks A and B feeding

from lower layers of the deep LSTM, because thetasks satisfy some of the requirements that haveoften been attributed to successful MTL, such asrelatedness of tasks and natural task hierarchy.

To illustrate, for Task D, the goal is to extractthe relevant phrase to be classified in Task C. Weframe this as a token-level BIO tagging problemin which each token is labeled with one of threeclasses from {I,O,B}. That is,

Notrufsystem : 250 Funklocher bei . . .B I I I O . . .

retrieves the target phrase Notrufsystem : 250Funklocher from the document.

4 Experiments

Baseline: The organizers of the shared task pro-vided baselines, consisting of an SVM with uni-gram word features for Tasks A, B, and C and aCRF for Task D.2

4.1 Tasks A and B

Approach: We train models with document-level features using the stacked learner. We fo-cus on the comparison of different word and sen-tence embeddings, and additional polarity featurescomputed using a lexical resource described inWaltinger (2010) for Task B. For document em-beddings, we evaluate average word vectors, be-sides the approaches mentioned above. Further-more, we ran experiments with combinations ofdifferent word and sentence embeddings. For these,we compute average word vectors for a single doc-ument and concatenate it with the respective sen-tence embedding.

Hyperparameters: We compare Word2Vecand 100 dimensional Komninos word embeddings(Komninos and Manandhar, 2016), and 500 di-mensional Sent2Vec and SIF sentence embed-dings as described before.3 Word2vec skip-gramembeddings are computed for dimensions d =50,100,500. In addition, we compare two SIFembeddings computed with different input wordembeddings. One was computed from the Ger-man data directly and another one by projecting the

2An updated data set was released on August, 10th. Forour submissions, we retrained all models on the new data.

3For computing the SIF embeddings, we use the wordweighting parameter a = 0.01 and, we subtract the first r = 2principal components. See the original paper for details.

24

Micro F1

Baseline 0.882

W2V (d = 50) 0.883W2V (d = 500) 0.897S2V 0.885S2V + W2V (d = 50) 0.891S2V + K + W2V(d = 50) 0.890SIF (DE) 0.895SIF (DE-EN) 0.892

Table 4: Task A results

Micro F1

Baseline 0.709


Table 5: Task B results

German data into a shared embedding space withEnglish embeddings as described before.

Results: The results of the models better thanthe baseline are reported in Tables 4 and 5. As canbe seen, all models only slightly outperform thebaseline in Task A. For Task B, all models trainedon the stacked learner beat the baseline substan-tially even when using only plain averaged wordembeddings. We furthermore trained models on ad-ditional polarity features for Task B as mentionedbefore. For this, we look up all positive, negative,and neutral words in a document and compute athree-dimensional polarity vector by using the totalcount of found words. These are concatenated tothe respective document representation. Addingthe polarity features improved the results for allmodels except for those using SIF embeddings (Ta-ble 6).

Discussion: Unexpectedly, the model using theaveraged Word2Vec embedding performs best forTask A, even though the other embeddings cre-ated by Sent2Vec or SIF have the same dimension(500). A reason for this may be chance or the Twit-ter data. As the experiments of Pagliardini et al.(2017) confirm, averaged Word2Vec embeddingsperform rather well for a similarity task on Twitter

Micro F1

Baseline 0.709


Table 6: Task B results with polarity features

Macro F1

Baseline 0.478

MTLAdam (d = 50) 0.438STLAdam (d = 50) 0.458

STLAdam (d = 100) 0.488STLAdam (d = 100) + POS-Tags 0.494STLAdaDelta (d = 100) 0.543STLAdaDelta (d = 100) + POS-Tags 0.554

Table 7: Task D results

data. However, we observe that particularly SIFoutperforms average word embeddings for Task B.We also observe that the joint EN-DE embeddingsimprove results for Task B (+0.6% and +0.9%, re-spectively) but lead to a drop in performance forTask A (-0.3%). This is in line with the commonobservation that the bilingual signal may providean additional source of both useful and noisy, ir-relevant, or even hurtful information (Faruqui andDyer, 2014; Eger et al., 2016).

4.2 Task D

Approach: We tackle this task with our se-quence tagging framework and evaluate on the devset using the macro F1 score.

Hyperparameters: We use Word2Vec embed-dings of d = 50,100 trained on German Wikipedia,Twitter, and the shared task data. We also incor-porate 20 dimensional skip-gram embeddings forPOS-tags, trained on the data of the shared taskand concatenate them with the corresponding wordvectors. The German STTS POS-Tags were com-puted with the Marmot POS-Tagger (Muller et al.,2013). We furthermore compute 30 dimensionalcharacter embeddings on the shared task data (i.e.,not pre-trained), using an LSTM with 50 hidden

25

units. Dropout is set to 0.2 for the BLSTM andthe batch size is set to 50 for all experiments. Allmodels were trained with 100 hidden units.

Results: Since the evaluation tool provided bythe task organizers always requires a category forcomputing the scores on Task D, we evaluated oursystems using the macro F1 score on the BIO tags.For comparison with the baseline, we compute thescore by converting the predictions into BIO format.Table 7 contains our results for Task D.

We trained different set-ups with STL and MTLmodels. First of all, we evaluated STL againstMTL by training two models on 50 dimensionalWord2Vec embeddings. For the MTL set-up, wedefined the BIO tagging (D) as the main task andadded tasks A, B, and C as auxiliary tasks. Fordocument-level annotations (Task A and B) eachtoken of the document is tagged with the respectiveclass of the document. As the results show, theMTL set-up did not improve the macro F1 scorein this setting. Thus, we tried to improve the pre-dictions of the STL model in our follow-up exper-iments. The best results were achieved by using100 dimensional Word2Vec embeddings with addi-tional POS-Tag embeddings. Furthermore, usingAdaDelta as an optimizer yielded better results thanusing Adam.

Further results, using the organizers’ evaluationtool, can be found below.

4.3 Task C

Approach: There are several difficulties for thistask. First, documents may contain several aspectsof different categories, making this at least a multi-class classification problem for document-level ap-proaches. Furthermore, in some cases one docu-ment contains several aspects of the same category.On a document-level, one either has to give up onpredicting multiple aspects of one class, or addclasses for each possible combination of categories,leading to a huge number of classes which do notscale well to new data. Second, there exist aspectswith NULL targets which are not assigned to anytokens in the text, but still belong to a category andhave a polarity. They cannot be expressed properlyon a token-level, as they were not annotated withthis intention. One solution may be assigning alltokens of a document to a NULL target category,but this leads to overlapping categories on a token-level, adding more difficulty to the task itself.

To obtain aspect category and polarity predic-

Token BIO-Tag cat pol n-cat n-pol

Notrufsystem B Allgemein positive none none

: I Sicherheit negative Allgemein neutral

... ... ... ... ... ...

bei O none none Allgemein neutral

Category : SicherheitPolarity : negative

Target : NULL Category : Allgemein

Polarity : neutral

majority vote majority vote

Figure 1: Combination of predictions from severalindependent sequence tagging models (Task C).

tions, we evaluate various combinations of thestacked learner and the sequence tagger. We re-port the results for three approaches. (1) Weuse independent STL sequence taggers to predictBIO labeling as well as category and polarity ofeach token in a document (INDEP). (2) We pre-dict BIO labeling first, and feed each identifiedentity to our described ensemble model to pre-dict category and polarity of the identified tar-gets (PIPE). (3) We use the Sequence Tagger forBIO tagging and category prediction (label setis {B, I,O}×{Allgemein,Sicherheit, . . .}) and thestacked learner for polarity prediction (JOINT).INDEP: We train a separate model for five sub-

tasks, namely the prediction of BIO labeling, cate-gory (cat), polarity (pol), NULL category (n-cat),and NULL polarity (n-pol). If the BIO model pre-dicts B or I for a given token, we look up the cat andpol prediction and obtain the final prediction viaa majority vote over the span of BI tokens. SinceO tokens are mapped to the none class, we onlypredict category and polarity if both are present.As the n-cat and n-pol predictions do not dependon the BIO prediction, we perform a majority voteover the whole document. Figure 1 shows an ex-ample of how we combine the predictions for theindividual subtasks from different STL models.PIPE: We train models with the stacked learner

for aspect categories and their polarity. As the BIOpredictions do not include NULL targets, we traina separate model on the binary task whether or nota document contains a NULL target. Instead, onecould also add another class for documents withoutany aspects, however we decided not to increasethe difficulty for Task C as it already contains 20classes. If a document is predicted to contain aNULL target, it is added as input for category andpolarity prediction. Figure 2 shows the interactionof all models and how the predictions are forwarded

26

Token BIO-Tag

Notrufsystem B

... ...

bei O


Polarity prediction

Category prediction


Polarity : neutral

Stacked Learner

STL / MTL

Notrufsystem : 250 Funklöcher

tokenized document

NULL target prediction

Stacked Learner

if true

Figure 2: Prediction of category and polarity usinga pipeline of stacked learner models (Task C).

Token BIO-Tag cat

Notrufsystem B Allgemein

... ... ...

Funklöcher I Sicherheit

bei O none


Polarity prediction


Polarity : neutral

Stacked Learner

STL / MTL

Notrufsystem : 250 Funklöcher

tokenized document

NULL target prediction

Stacked Learner

if true

Category prediction

Stacked Learner

majority vote

Figure 3: Computing predictions for using STLand SL predictions (Task C).

to the next model for prediction.JOINT: Here, we use the BIO and category pre-

dictions of the sequence tagger, while using thestacked learner for polarity prediction. For NULLtargets, we train a separate model for category pre-diction on the stacked learner similar to the PIPEapproach. The model is illustrated in Figure 3.

Results: We train the stacked learner using500 dimensional Word2Vec embeddings, whichshowed a good performance for tasks A and B. Wedo not use Sent2Vec or SIF, since many targets forcategory and polarity prediction consist of only oneword. For targets of longer sequences, we averagethe word vectors over all tokens.

We used the same hyper-parameters as in Tasks

A, B, and D. Table 8 shows the results for our threefinal systems (INDEP, PIPE, and JOINT) evalu-ated with the tool provided by the organizers. Itcalculates the micro F1 scores of only the cate-gories (C-1) and the categories along with theirsentiment (C-2) for Task C, and for Task D themicro F1 scores based on exact (D-1) and over-lapping (D-2) matching of the offsets. We obtainBIO predictions for Task D using the best model,namely STLAdaDelta (d = 100 + POS-Tags), andtrained the additional models for the INDEP andJOINT approaches with the same parameters.

As can be seen, INDEP consistently outperformsthe baseline except for C-1. Further, PIPE outper-forms INDEP except for D-1, where it performseven worse than the baseline. The JOINT ap-proach lies between INDEP and PIPE on average.

Strangely, the organizers’ evaluation tool in-cludes the category prediction from Task C forcalculating the scores of Task D. The reason forthis may be a different point of view for Tasks Cand D. If one first identifies the targets and predictsthe category and sentiment accordingly, the scorefor Task D should not be affected by the results forTask C. However, if one first predicts all categoriesand their sentiment in a document and identifiesthe targets afterwards, it is important to map thetargets to their appropriate categories. Then the cor-rect mapping of category and target may be seen asan additional task which has to be considered forcalculating the score for Task D. So even if INDEP,PIPE and JOINT have the same BIO output forTask D, their scores differ due to different predic-tions of category and sentiment. For example, ifthe sequence tagging model for categories predictsnone for a given chunk, the JOINT and INDEPmodel discard it for the final results, leading to adifferent score with the provided evaluation tool.While we tried to model the tasks as they were in-troduced, our approach to first identify the targetsand then to predict category and sentiment seemsmore intuitive. This way, we do not have the prob-lem of dealing with multiple assignments of onecategory for a document, as the task is solved on atoken-level with a distinct label.

Discussion: All three approaches fail to predictany of the categories Design, Image, and QR-Code.In addition, the JOINT model did not predict anyof the categories Gastronomisches Angebot, Toilet-ten, Reisen mit Kindern, and Gepack. The INDEPmodel predicted the least number of different cate-

27

C-1 C-2 D-1 D-2

Baseline 0.477 0.334 0.244 0.329

INDEP 0.429 0.377 0.253 0.364PIPE 0.476 0.381 0.233 0.386JOINT 0.443 0.367 0.250 0.377

Table 8: Task C and D results calculated with theprovided evaluation tool

gories, adding Informationen, Barrierefreiheit, andAuslastung und Platzangebot to those mentionedbefore. This is unsurprising given that some cate-gories occur very infrequently in the data (cf. Table2) and the general skewness of the data distribution.

5 Conclusion

We presented our submissions to the GermEval2017 Shared Task, which focused on the analysisof customer feedback about the Deutsche BahnAG. We used neural sentence embeddings and anensemble of classifiers for two sub-tasks as well asstate-of-the-art sequence taggers for two other sub-tasks. We substantially outperformed the baselineparticularly for Task B, the detection of sentimentin customer feedback, as well as for Task D, theextraction of phrases which carry category and po-larity of a meaningful aspect in customer feedback.

Acknowledgments

This work has been supported by the German Fed-eral Ministry of Education and Research (BMBF)under the promotional reference 03VP02540 (Ar-gumenText) and 01UG1416B (CEDIFOR).

ReferencesRami Al-Rfou, Bryan Perozzi, and Steven Skiena.

2013. Polyglot: Distributed Word Representationsfor Multilingual NLP. In Proceedings of the Seven-teenth Conference on Computational Natural Lan-guage Learning, pages 183–192, Sofia, Bulgaria,August. Association for Computational Linguistics.

Sanjeev Arora, Yingyu Liang, and Tengyu Ma. 2017.A Simple but Tough-to-Beat Baseline for SentenceEmbeddings. In International Conference on Learn-ing Representations, April.

Steven Bird, Ewan Klein, and Edward Loper.2009. Natural Language Processing with Python.O’Reilly Media, Inc., 1st edition.

Mark Cieliebak, Jan Milan Deriu, Dominic Egger, andFatih Uzdilli. 2017. A twitter corpus and bench-mark resources for german sentiment analysis. InProceedings of the 4th International Workshop onNatural Language Processing for Social Media (So-cialNLP 2017), pages 45–51. Association for Com-putational Linguistics.

Steffen Eger, Armin Hoenen, and Alexander Mehler.2016. Language classification from bilingual wordembedding graphs. In Proceedings of the 26th Inter-national Conference on Computational Linguistics(COLING), page (to appear), December.

Steffen Eger, Erik-Lan Do Dinh, Ilia Kutsnezov, Ma-soud Kiaeeha, and Iryna Gurevych. 2017. EELEC-TION at SemEval-2017 Task 10: Ensemble of nEu-ral Learners for kEyphrase ClassificaTION. In Pro-ceedings of the 11th International Workshop onSemantic Evaluation (SemEval 2017), Vancouver,Canada, August. Association for Computational Lin-guistics.

Manaal Faruqui and Chris Dyer. 2014. Improving Vec-tor Space Word Representations Using MultilingualCorrelation. In Proceedings of the European Associ-ation for Computer Linguistics.

Tobias Kahse. 2017. Multi-Task Learning for Argu-mentation Mining. Master Thesis.

Philipp Koehn. 2005. Europarl: A Parallel Corpusfor Statistical Machine Translation. In ConferenceProceedings: the tenth Machine Translation Summit,pages 79–86, Phuket, Thailand. AAMT, AAMT.

Alexandros Komninos and Suresh Manandhar. 2016.Dependency based embeddings for sentence classi-fication tasks. In Proceedings of the 2016 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies, pages 1490–1500, San Diego,California, June. Association for Computational Lin-guistics.

Guillaume Lample, Miguel Ballesteros, Sandeep Sub-ramanian, Kazuya Kawakami, and Chris Dyer. 2016.Neural architectures for named entity recognition.In NAACL HLT 2016, The 2016 Conference of theNorth American Chapter of the Association for Com-putational Linguistics: Human Language Technolo-gies, San Diego California, USA, June 12-17, 2016,pages 260–270.

Xuezhe Ma and Eduard Hovy. 2016. End-to-endSequence Labeling via Bi-directional LSTM-CNNs-CRF. In Proceedings of the 54th Annual Meeting ofthe Association for Computational Linguistics, vol-ume 1: Long Papers, pages 1064–1074, Berlin, Ger-many, August. Association for Computational Lin-guistics.

Tomas Mikolov, Kai Chen, Greg Corrado, and Jef-frey Dean. 2013. Efficient Estimation of WordRepresentations in Vector Space. arXiv preprintarXiv:1301.3781.

28

Thomas Muller, Helmut Schmid, and Hinrich Schutze.2013. Efficient higher-order CRFs for morphologi-cal tagging. In Proceedings of the 2013 Conferenceon Empirical Methods in Natural Language Process-ing, pages 322–332, Seattle, Washington, USA, Oc-tober. Association for Computational Linguistics.

Matteo Pagliardini, Prakhar Gupta, and Martin Jaggi.2017. Unsupervised learning of sentence embed-dings using compositional n-gram features. CoRR,abs/1703.02507.

Sara Rosenthal, Noura Farra, and Preslav Nakov. 2017.SemEval-2017 task 4: Sentiment analysis in Twitter.In Proceedings of the 11th International Workshopon Semantic Evaluation, SemEval ’17, Vancouver,Canada, August. Association for Computational Lin-guistics.

Anders Søgaard and Yoav Goldberg. 2016. Deepmulti-task learning with low level tasks supervisedat lower layers. In Proceedings of the 54th AnnualMeeting of the Association for Computational Lin-guistics, ACL 2016, August 7-12, 2016, Berlin, Ger-many, Volume 2: Short Papers.

Ulli Waltinger. 2010. German polarity clues: A lexicalresource for german sentiment analysis. In Proceed-ings of the Seventh International Conference on Lan-guage Resources and Evaluation (LREC), Valletta,Malta, May. electronic proceedings.

Michael Wojatzki, Eugen Ruppert, Sarah Holschneider,Torsten Zesch, and Chris Biemann. 2017. Germeval2017: Shared task on aspect-based sentiment in so-cial media customer feedback. In Proceedings of theGSCL GermEval Shared Task on Aspect-based Sen-timent in Social Media Customer Feedback, Berlin,Germany.

29


Fasttext and Gradient Boosted Trees at GermEval-2017 on RelevanceClassification and Document-level Polarity ∗

Leonard HovelmannUniversity of Applied Sciences

and Arts Dortmund (FHDO)Department of Computer Science

Emil-Figge-Straße 4244227 Dortmund, Germany

adesso AGStockholmer Allee 20

44269 Dortmund, [email protected]

Christoph M. FriedrichUniversity of Applied Sciences

and Arts Dortmund (FHDO)Department of Computer Science

Emil-Figge-Straße 4244227 Dortmund, Germany

[email protected]

Abstract

This paper describes the submissions tothe Shared Task on Aspect-based Sentimentin Social Media Customer Feedback forthe GermEval 2017-workshop for the twosubtasks Relevance Classification (task A)and Document-level Polarity (task B). Foreach subtask, the results of the same threesystems were submitted: a fastText clas-sifier, enhanced with pretrained vectors,gradient boosted trees (GBTs) trained onbag-of-words (BOWs), and an ensemble ofGBTs, respectively trained on word embed-dings and on BOWs. For the subtask Rele-vance Classification, the best system yieldsa micro-averaged F1-score of 0.895 on thedev set. For the subtask Document-levelPolarity, the best system achieves 0.782 onthe test set. The proposed system achievedthe second place out of twelve systems sub-mitted by seven teams for task A for bothtest sets. For task B, the proposed systemachieved the first place for test set one andthe second place for test set two out of 17systems submitted by eight different teams.

1 Introduction

Customer feedback in social networks is a valuableresource for improving the service of companies.Customers often propose improvements and showpoints of criticism companies were unaware of. It isimportant to get an impression of the opinions cus-tomers hold with regards to companies. Separating

∗The sources are available under http://bit.ly/2wKZSdE [github.com], last access: 2017-09-05, license:MIT

relevant and irrelevant feedback requires expensivemanual work. The GermEval 2017 Shared Taskon Aspect-based Sentiment in Social Media Cus-tomer Feedback workshop (Wojatzki et al., 2017)addresses the automatic processing of German lan-guage customer feedback regarding its differentcharacteristics. Therefore, four subtasks were de-fined: binary classification of whether a feedback isrelevant to a given instance (e.g. a company) (taskA), categorization of the feedback into sentimentclasses (positive, neutral, and negative) (task B),binary classification of a specific aspect into senti-ment classes (positive and negative) (task C), andopinion target extraction (task D). In order to elab-orate the different subtasks, a training set of cus-tomer feedbacks on the German railroad companyDeutsche Bahn was provided. For each subtask,corresponding class labels to the feedback textswere provided. The goal was to develop a classifierwith a high micro-averaged F1-score on the classprediction.Word embeddings trained with the objective to pre-dict sentiment polarity were successfully applied atSemEval 2014 workshop (Tang et al., 2014a; Tanget al., 2014b). Ensembles of convolutional neuralnetworks (CNNs) and long short-term memories(LSTMs), trained on pretrained word embeddingsachieved state-of-the-art performance at SemEval2017 workshop (Cliche, 2017). Zhang et al. pre-sented character-level CNNs, trained on one-hotencoded character vectors for text classificationtasks (Zhang et al., 2015). Gradient boosted trees(GBTs) (Friedman, 2001) have shown good resultsin a variety of classification tasks. 17 out of 29systems that have been published on Kaggle’s blogduring 2015 used XGBoost, a framework that im-plements GBTs (Chen and Guestrin, 2016).

30

The following sections describe the work on twoout of the four given subtasks (task A and task B).

2 Dataset

The dataset for the classification task consists ofcustomer feedback texts collected from varioussocial media platforms, the hyperlink to the re-views, and the annotations of class labels per sub-task. The class labels contain information about thedocument-level sentiment polarity (positive, neu-tral, or negative), the binary relevance label (denot-ing whether the text contains feedback about theDeutsche Bahn), as well as annotations for othersubtasks which will not be considered in this paper.The data set was split into a training set (train) anda validation set (dev). Two test sets (test) withoutclass labels were provided at a later stage.

Dataset # Reviews Relevance# true # false

train 19449 16217 1937dev 2375 3232 438test 4408 - -

Table 1: Number of Reviews in the Data Sets andtheir Respective Relevance Classes

Dataset # Reviews Document-level Polarity# positive # neutral # negative

train 19449 1179 13222 5048dev 2375 149 1637 589test 4408 - - -

Table 2: Number of Reviews in the DifferentDocument-level Polarity Classes

The distribution of the feedbacks over thedifferent classes are shown in Tables 1 and 2.The classes are not equally distributed, neitherin the relevance classification subtask nor in thedocument-level sentiment polarity classificationsubtask.

3 Text Preprocessing

The review texts were adopted as feature repre-sentation and the hyperlinks were not considered.Each review text was tokenized to extract singleterms using whitespaces, percents signs, forwardslashes, and plus signs as delimiters.The largest source of training data are Tweets(≈ 48%), followed by Facebook posts (≈ 22%).Therefore, special attention was paid to Twitter-specific text preprocessing. Frequently occurring

Twitter usernames related to the DeutscheBahn like @DB Bahn, @Bahnansagen, or@DB Info are pooled by replacing them with〈〈〈tokendbusername〉〉〉. Other terms con-taining an “@” are replaced with the token〈〈〈tokentwitterusername〉〉〉. The terms S-Bahnand S Bahn are replaced with sbahn.As the fastText classifier removes punctuationmarks, the emoticons “:-)”, “:)”, and ”:-))” arereplaced by the token 〈〈〈tokenhappysmiley〉〉〉.The emoticons “:-D” and “xD” are replaced by〈〈〈tokenlaughingsmiley〉〉〉. The emoticons “:-(”and “:(” are replaced by 〈〈〈tokensadsmiley〉〉〉.Punctuation characters are removed. An exceptionis two or more repetitions of question marks, ex-clamation points, and periods. These are replacedwith the tokens 〈〈〈tokenstrongquestion〉〉〉,〈〈〈tokenstrongexclamation〉〉〉, and〈〈〈tokenannoyeddots〉〉〉 in order to retainthe emotion, expressed by the usage of such awriting style.Many of the feedbacks contain time specifications,for example in the following tweet:

Nach 25 Minuten ist mein Neben-schwitzer in der Bahn ausgestiegen.

These time specifications are replaced by〈〈〈tokentimeexpression〉〉〉 using a regular ex-pression. Money amounts are replaced by〈〈〈tokenmoneyamount〉〉〉. Other numbers arereplaced by 〈〈〈tokennumber〉〉〉. Furthermore,hyperlinks occurring within the text are alsoreplaced by the token 〈〈〈tokenhyperlink〉〉〉 inorder to prevent overfitting. Finally, quotationmarks are replaced by 〈〈〈tokenquotation〉〉〉.After grouping related terms, the remaining text istransformed to lower case and stemmed using theGerman stemmer in Snowball1. The stemmer alsoreplaces special German characters. The characterß is replaced by ss and a, o, and u are replaced bya, o, and u.In the next step, the feedbacks are vectorized usingthe hashing trick (Weinberger et al., 2009). Forcomputational efficiency, the resulting vectors arereduced to a length of 16,384 features applyinga modulo operation. The vector entries are theTF-IDF (term frequency - inverse documentfrequency) values of the terms (Sparck Jones,1972).

1available from: http://snowballstem.org/, lastaccess: 2017-07-19, license: 3-clause BSD

31

4 Additional Features

The TF-IDF vectors are enriched with additionalinformation from three feature sources: LIWC-features, word defectiveness, and sentiment lex-icon. The LIWC-features (Linguistic Inquiry andWord Count) are determined using the German ver-sion of the LIWC computer program (Tausczikand Pennebaker, 2010; Wolf et al., 2008), that“counts words in psychological meaningful cate-gories” (Tausczik and Pennebaker, 2010). TheLIWC tool creates 93 decimal values, that can di-rectly be used as features.In addition to the LIWC-features, the vector is aug-mented with a feature set expressing the belongingof a review to a certain cluster of reviews. To gen-erate these features, the vector space of the fastTextword representations was clustered into 100 clus-ters. In order to determine the cluster centers, alarge vector space model was trained on a snap-shot of all German Wikipedia articles2, a monolin-gual news corpus,3 and the feedback texts them-selves (train+dev+test). The k-means algorithmwas trained on this Euclidean vector space with theobjective to determine 100 cluster centers.Binary features resulting from SentiWS (Goldhahnet al., 2012) were used. Therefor, an n-dimensionalvector was created, where n is the sum of positiveand negative words. Each index in the vector isconnected to one word in SentiWS. During trans-formation, the respective cell in this vector is setto 1 if the word is contained in the review and to 0otherwise. The list of words considered as positivehas a length of 17,626 words and the list of nega-tive words has a length of 19,961 words, resultingin 37,587 additional features.Finally, a feature expressing word defectivenesswas created. For this purpose, the German ver-sion of LanguageTool4 was used. LanguageToolis able to detect a variety of faulty language, in-cluding grammatical errors, missing punctuationor wrong capitalization. Since the text providedis lower cased and free from punctuations, onlyerrors that match the GermanSpellerRule were con-sidered, which detects spelling errors in Germanlanguage. The feature was created by dividing thenumber of times the GermanSpellerRule matches

2https://dumps.wikimedia.org/dewiki/,database dump created on 2017-07-30

3http://bit.ly/2eJrp6Z [statmt.org], last access:2017-08-15

4available from https://languagetool.org, lastaccess: 2017-07-19, license: LGPL 2.1

in a feedback by the number of words in the respec-tive feedback.

5 Feature Selection

The top-1000 features are chosen from the 16,384hashed BOW features and the additional featuresexcept for the SentiWS-features. The features wereselected performing a χ2-test (Liu and Setiono,1995, pp. 36,37). Tables 3 and 4 show the top 20features for the sentiment class and the relevanceclass, applying χ2-feature selection and mutual-information feature selection to the BOW features.The additional features were not considered forthe creation of these tables. The features in Ta-bles 3 and 4 give an impression on how differentlyχ2-selection and mutual information selection dif-ferently select top features. Since using the hashingtrick causes hash collisions, the top features in thetables were determined without using the hashingtrick.The mutual information analysis results in manystop words like articles or pronouns, whereas theχ2-analysis yields more meaningful terms likestreik (strike) or verspat, which is the stem ofVerspatung (delay).

χ2 (relevance) χ2 (document-level polarity)gestartet franzos

kehrt streikendasteroid notrufsyst

germanwing schnellschweiz aufatm

barbi grenzuberschreitweltmeist bahnstreik

stellenangebot lokfuhrbahn storungjob sncf

cavendish regionaldirektionlik tarifkonflikt

osterreich schlichtung〈〈〈db username〉〉〉 funkloch

ad beendanna paris

schwebebahn verspators gdl

bigg streiklufthansa beendet

Table 3: Top Features Using χ2-Selection

6 Classifiers

The first system (FHDO GBT BOW) are GBTs(Friedman, 2002) on BOW vectors. For this sys-tem, hashed TF-IDF weights of individual termsare merged with LIWC-features. Using the fea-

32

MI (relevance) MI (document-level polarity)es fur

nicht denzu imre esim aufvon ichden refur nichtist dasauf mitdas istmit 〈〈〈db username〉〉〉

〈〈〈hyperlink〉〉〉 einein inin 〈〈〈hyperlink〉〉〉

und 〈〈〈number〉〉〉〈〈〈number〉〉〉 und

der diedie der

bahn bahn

Table 4: Top Features Using Mutual Information-Selection

ture selection algorithm, top-1000 features wereextracted. For the document-level sentiment polar-ity classification, the top-1000 features include sixLIWC features: body, netspeak, other punctuation,positive emotion, auxiliary verbs, and comparisons.For relevance classifications, no LIWC features areamong the top-1000 features. GBTs are trained onthese top-1000 features. The maximal depth of thetrees was set to 10, and the number of iterationsto 30. For document-level sentiment polarity clas-sification, a one-vs-rest strategy (Hulsmann andFriedrich, 2007) was used as the implementationof this system only supports GBTs for binary clas-sification. The Gini-coefficient is used for impuritycalculation and logistic loss as the loss function.The step size was initialized with 0.1.The second system (FHDO FT) is a fastText classi-fier (Joulin et al., 2017) trained on the preprocessedtext. The word stemming was omitted for fastTextclassification. Generally, the following default con-figuration was used. The dimensionality of theword vectors was set to 100, the size of the con-text window was set to five, negative sampling wasused for loss computation and the learning rate wasinitialized with 0.05. Deviating from the defaultconfiguration, the word vectors were obtained us-ing the word vectors from the large unsupervisedcorpus (see section 4).The supervised fastText algorithm is constructedas follows: instead of predicting a word, thegoal is to predict a class (or label) - for example,

true or false in the sense of document relevance.Similar to the continuous bag-of-words CBOWmodel (Mikolov et al., 2013), fastText works with(language-dependent) word embeddings. In addi-tion to the CBOW model, fastText word embed-dings also make use of character-level n-grams. Inorder to represent a document, word embeddingsare averaged and word order is discarded. Themodel is capable of handling sentences with a vary-ing number of words. FastText also uses word-levelbigram features in order to retain some informa-tion about the word order. The use of bigrams isbased on the impact of bigrams on classificationaccuracy in sentiment analysis (Wang et al., 2012).The model is trained by minimizing the negativelog-likelihood over classes

− 1N·

N

∑n=1

yn · log( f (B ·A · xn)) (1)

where xn “is the normalized bag of features ofthe n-th document” (Joulin et al., 2017), yn thelabel, A and B are weight matrices, and f is thesoftmax-function.The third model is an ensemble of two GBTclassifiers, where one classifier is trained onthe BOW vectors and the other on the wordembedding vectors, merged with the one-hotencoded cluster belonging and the LIWC features(FHDO GBT NSMBL).The baseline results in Table 5 are provided byorganizers of the GermEval 2017 workshop 5. Itis a Support Vector Machine (SVM) trained withterm frequency and a sentiment lexicon as features.

7 Results

Table 5 shows the results of all three systems. Toavoid overfitting on the dev set, 5-fold cross vali-dation was performed for model selection on boththe train and the dev set. The final submissionwas created by training on both sets and apply-ing the trained model on the two test sets. Thefirst column gives the name of the system. Thesecond column shows the average of the 5-foldcross-validation and the respective standard devi-ation while the third column gives the results ofthe classifier trained on the given training set andvalidated on the given dev set.The models whose names start with “FHDO” are

5http://bit.ly/2xR0zCi [google.com], last ac-cess: 2017-07-24

33

System 5-fold CV Train / DevRelevance

Baseline 0.881 (±0.010) 0.882FHDO FT 0.907 (±0.007) 0.895FHDO GBT BOW 0.893 (±0.006) 0.878FHDO GBT NSMBL 0.887 (±0.004) 0.866FT UNPROCESSED 0.891 (±0.006) 0.896FT NO PRETRAIN 0.900 (±0.005) 0.894GBT TOP 1000 0.885 (±0.005) 0.878GBT LT TOP 1000 0.886 (±0.006) 0.878GBT W2V ONLY 0.862 (±0.007) 0.852GBT W2V LIWC 0.872 (±0.008) 0.858GBT W2V SENLEX 0.862 (±0.004) 0.847MLP TOP 1000 0.864 (±0.002) 0.858MLP W2V ONLY 0.872 (±0.005) 0.872MLP W2V LIWC 0.107 (±0.005) 0.106MLP W2V LIWC SENLEX 0.810 (±0.006) 0.796MLP W2V SENLEX 0.870 (±0.005) 0.858

Document-level PolarityBaseline 0.700 (±0.008) 0.710FHDO FT 0.775 (±0.007) 0.782FHDO GBT BOW 0.728 (±0.006) 0.729FHDO GBT NSMBL 0.751 (±0.004) 0.754FT UNPROCESSED 0.757 (±0.006) 0.760FT NO PRETRAIN 0.756 (±0.006) 0.764GBT TOP 1000 0.728 (±0.007) 0.728GBT LT TOP 1000 0.728 (±0.007) 0.728GBT W2V ONLY 0.720 (±0.009) 0.727GBT W2V LIWC 0.718 (±0.008) 0.725GBT W2V SENLEX 0.734 (±0.002) 0.731MLP TOP 1000 0.734 (±0.005) 0.744MLP W2V ONLY 0.719 (±0.011) 0.721MLP W2V LIWC 0.585 (±0.008) 0.587MLP W2V LIWC SENLEX 0.646 (±0.007) 0.653MLP W2V SENLEX 0.731 (±0.006) 0.742

Table 5: Results of submitted systems and base-line. First column: name of the respective system.Second column: average micro-averaged F1-scoreand the respective standard deviation. Third col-umn: results on the official dev set. Results in thesecond column were computed using 5-fold crossvalidation.

the submitted models described in the previoussection. The other models were developed for com-parison reasons.FT UNPROCESSED is a model applying the samefastText classifier used for FHDO FT to an un-processed version of the dataset, where only thetokens were split using whitespaces as delim-iters. This model is 1.8 percentage points worsefor document-level polarity classification and 1.6percentage points worse for relevance classifica-tion. FT NO PRETRAIN is trained as FHDO FT,preprocessing the text, but without incorporatingthe pretrained vectors. This model is 1.9 per-centage points worse for document-level polar-ity and .07 percentage points worse for relevanceclassification. GBT TOP 1000 are GBTs withthe same configuration as the ones in the sub-missions, that has been trained on the top-1000TF-IDF features from the feature selection only.

GBT W2V ONLY are GBTs trained on the fast-Text word embeddings only, whereby the wordembeddings were again computed using the pre-trained vectors. GBT W2V LIWC is the samesetup, but taking into account the LIWC-featuresas additional features. GBT W2V SENLEX arethe same setup as GBT W2V ONLY but consid-ering the sentiment lexicon features as additionalfeatures. The influence of these features is verysmall. The LIWC-features did not bring the antici-pated results.In addition to the GBT classifier, a feedforwardmultilayer perceptron (MLP) with one hidden layerwith 500 hidden units was used. The output layerwas trained using the softmax function while thehidden layer uses the sigmoid function. The num-ber of output units varies depending on the numberof classes. For the MLP, the same experimentswere performed as for the GBT. Accordingly, themodels starting with MLP have the same setuplike the GBT -models. No one-vs-rest strategy wasused for the MLP. The results of the MLP modelsare worse compared to the GBT and the fastTextresults.

8 Conclusion

The best results for both subtasks have beenachieved by applying the fastText classifier withthe pretrained vectors on the preprocessed dataset.Preprocessing the text improved the micro aver-aged F1-score by approximately two percentagepoints. The result of a fastText classifier applied tothe unprocessed text was still better than the othertested classifiers. Using fastText without the pre-trained vectors causes a drop of approximately twopercentage points for document-level sentiment po-larity classification and of 0.7 percentage pointsfor relevance classification. Applying GBT clas-sifiers to BOW representations instead of fastTextword embedding representations results in a highermicro-averaged F1-score for both subtasks. Usingan ensemble of two different GBT classifiers ontwo different variants of the dataset yields a modelwith low variance for both subtasks. From all clas-sifiers, MLPs have shown the poorest performance.Future work should analyze the impact of the differ-ent features and perform an error analysis. FastTextbrought results on a par with other state-of-the-artparticipating systems. The effect of incorporatingsubword information for German language with itscompound nouns should be further examined.

34

ReferencesTianqi Chen and Carlos Guestrin. 2016. Xgboost.

In Balaji Krishnapuram, Mohak Shah, Alex Smola,Charu Aggarwal, Dou Shen, and Rajeev Rastogi, ed-itors, Proceedings of the 22nd ACM SIGKDD Inter-national Conference on Knowledge Discovery andData Mining - KDD ’16, pages 785–794. ACMPress.

Mathieu Cliche. 2017. BB twtr at SemEval-2017task 4: Twitter sentiment analysis with CNNs andLSTMs. In Proceedings of the 11th InternationalWorkshop on Semantic Evaluation (SemEval-2017),pages 573–580. Association for Computational Lin-guistics.

Jerome H. Friedman. 2001. Greedy function approx-imation: A gradient boosting machine. Annals ofstatistics, pages 1189–1232.

Jerome H. Friedman. 2002. Stochastic gradient boost-ing. Computational Statistics & Data Analysis,38(4):367–378.

Dirk Goldhahn, Thomas Eckart, and Uwe Quasthoff.2012. Building large monolingual dictionaries atthe leipzig corpora collection: From 100 to 200languages. In Proceedings of the 8th InternationalConference on Language Resources and Evaluation,pages 759–765.

Marco Hulsmann and Christoph M. Friedrich. 2007.Comparison of a novel combined ECOC strategywith different multiclass algorithms together with pa-rameter optimization methods. In Petra Perner, edi-tor, Proceedings of the 5th International Conferenceon Machine Learning and Data Mining in PatternRecognition, MLDM 2007, Leipzig, Germany, July18-20, 2007., pages 17–31. Springer Berlin Heidel-berg.

Armand Joulin, Edouard Grave, Piotr Bojanowski, andTomas Mikolov. 2017. Bag of tricks for efficienttext classification. In Proceedings of the 15th Con-ference of the European Chapter of the Associationfor Computational Linguistics: Volume 2, Short Pa-pers, pages 427–431. Association for ComputationalLinguistics.

Huan Liu and Rudy Setiono. 1995. Chi2: Featureselection and discretization of numeric attributes.In Proceedings of the Seventh International Confer-ence on Tools with Artificial Intelligence, pages 388–391.

Tomas Mikolov, Kai Chen, Greg Corrado, and JeffreyDean. 2013. Efficient estimation of word represen-tations in vector space. International Conference onLearning Representations (ICLR) Workshop Papers.

Karen Sparck Jones. 1972. A statistical interpretationof term specificity and its application in retrieval.Journal of Documentation, 28(1):11–21.

Duyu Tang, Furu Wei, Bing Qin, Ting Liu, and MingZhou. 2014a. Coooolll: A deep learning system fortwitter sentiment classification. Proceedings of the8th International Workshop on Semantic Evaluation(SemEval 2014), pages 208–212.

Duyu Tang, Furu Wei, Nan Yang, Ming Zhou, Ting Liu,and Bing Qin. 2014b. Learning sentiment-specificword embedding for twitter sentiment classification.In Kristina Toutanova and Hua Wu, editors, Proceed-ings of the 52nd Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Pa-pers), pages 1555–1565. Association for Computa-tional Linguistics.

Yla R. Tausczik and James W. Pennebaker. 2010. Thepsychological meaning of words: LIWC and com-puterized text analysis methods. Journal of lan-guage and social psychology, 29(1):24–54.

Hao Wang, Dogan Can, Abe Kazemzadeh, FrancoisBar, and Shrikanth Narayanan. 2012. A systemfor real-time twitter sentiment analysis of 2012 U.S.presidential election cycle. In Proceedings of theACL 2012 System Demonstrations, pages 115–120.Association for Computational Linguistics.

Kilian Weinberger, Anirban Dasgupta, John Langford,Alex Smola, and Josh Attenberg. 2009. Featurehashing for large scale multitask learning. In Pro-ceedings of the 26th Annual International Confer-ence on Machine Learning, pages 1113–1120.

Michael Wojatzki, Eugen Ruppert, Sarah Holschneider,Torsten Zesch, and Chris Biemann. 2017. Germeval2017: Shared task on aspect-based sentiment in so-cial media customer feedback. In Proceedings of theGSCL GermEval Shared Task on Aspect-based Sen-timent in Social Media Customer Feedback, Berlin,Germany, September 12th, 2017.

Markus Wolf, Andrea B. Horn, Matthias R. Mehl, Sev-erin Haug, James W. Pennebaker, and Hans Kordy.2008. Computergestutzte quantitative Textanalyse:Aquivalenz und Robustheit der deutschen Versiondes Linguistic Inquiry and Word Count. Diagnos-tica, 54(2):85–98.

Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015.Character-level convolutional networks for text clas-sification. In Advances in Neural Information Pro-cessing Systems, pages 649–657.

35


GermEval 2017 : Sequence based Models for Customer FeedbackAnalysis

Pruthwik Mishra∗ Vandan Mujadia† Soujanya Lanka‡

IIIT-Hyderabad i.am+ LLC. i.am+ LLC.Hyderabad, India Bangalore, India 37 Mapex building

500032 560071 Singapore - 577177

Abstract

In this era of information explosion,customer feedback is an important sourceof published content; especially in thesocial media. Accordingly, analysis ofcustomer feedback is a necessity forcompanies and enterprises to understandand adapt based on it. Huge volumes offeedback is tough to peruse manually andhence require automated feedback analysisalong with associated polarity detection.Customer feedback analysis forms thetheme for Germeval Task, 2017 (Wojatzkiet al., 2017)1 .

In this paper, we describe our approachesfor different subtasks in the Germeval Task2017. The major part of our method isbased on a bi-LSTM neural network withword embeddings as features. We observethat our approaches outperform the base-line system for three of the four tasks. Therationale behind us using a bi-LSTM so-lution will be presented in subsequent sec-tions.

1 Introduction

Customer feedback analysis plays a very importantrole for an organization to have a clear overviewabout its products and offered services. A lot ofusers are expressing their opinions on social mediathat amounts to the huge influx of data. Thereforeextracting relevant information from the gathereddata is highly essential. This shared task intendsto achieve the same from customer reviews aboutDeutsche Bahn, a German public train operator.

∗[email protected]†[email protected]‡[email protected]

1https://sites.google.com/view/germeval2017-absa/home

The Germeval shared task is divided into four sub-tasks.

A. Relevance Classification - This subtask dealswith the identifying if a feedback is aboutDeutsche Bahn. Hence, we design this as abinary classification problem.

B. Document-Level Polarity - This subtask is amulti-class classification problem where oneneeds to identify whether the feedback aboutDeutsche Bahn is positive, negative or neutral.

C. Aspect-Level Polarity - This subtask involvesthe identification of all the aspects present ina review. The goal is to identify the categoryand the sentiment of an aspect. This is a multi-label classification problem where multipleclasses can be active at the same time.

D. Opinion Target Extraction - Finally, this sub-task identifies all the opinions and extractsthem.

Sentences containing feedback require posi-tional intelligence based on linguistics for twoof the four tasks: (a) identification of the aspect,(b) polarity associated with the aspect. Positionalintelligence can only be derived by learningsequential structure of the language or regularpatterns associated with feedback. Bi-LSTMs areknown for their sequential learning capacity andhence, we base our solutions on bi-LSTM driventraining and prediction for three of the tasks.

We use bidirectional LSTM (Graves and Schmid-huber, 2005) for first three tasks and structured per-ceptron (Collins, 2002) for the fourth task. Thepaper is organized as follows: Section 2 lists downthe related work and Section 3 describes our ap-proach. Section 4 presents the experiments andresults on the development set. Section 5 discussesthe performance numbers and Section 6 details

36

about the error analysis. Section 7 concludes thepaper with possible future work.

2 Related Work

Sentiment polarity identification has always beenan active research area. The polarity label canrange from coarse to fine. The coarse polarities canbe positive and negative while much finer labelsmay refer to highly positive, positive, neutral,negative, highly negative. So far many learningapproaches like Naive Bayes, SVM, MaximumEntropy classifiers have been employed (Panget al., 2002) (Agarwal et al., 2011) to tackle it.But these approaches rely heavily on customword lists, part-of-speech tags, fixed syntacticpattern (Turney and Littman, 2003) and othersentential information. (Socher et al., 2013) useddeep recursive models to capture the semanticsfor longer phrases which eventually improvedsentiment classification. (Qian et al., 2016) usedlinguistically regularized LSTM (Hochreiter andSchmidhuber, 1997) for sentiment classificationand the effects of intensifiers and negations onpolarity of sentences.

But most of the work has been done onresource rich English language. For German,(Waltinger, 2010) translated English dictionariesto put together a consolidated German lexicon.This German lexicon achieved good results onsentiment classification. (Momtazi, 2012) createda German opinion dictionary with substantialcoverage. They built a rule-based sentimentidentification system for automatic sentimentdetection on social media data.

Aspect or Opinion Target identification dealswith the identifying the target words for which asentiment is expressed. The previous approaches(Jakob and Gurevych, 2010), (Hamdan et al., 2015),(Kessler and Nicolov, 2009) depend on the part-of-speech tags, dependency relations between tokens,distance from the nearest noun phrase for findingthe target words. SVM and CRF have been usedextensively for this task.

3 Approach

Most of the machine learning algorithms requiretask specific feature engineering for achieving de-pendable results. When it comes to natural lan-guage processing (NLP), these features depend on

the knowledge of the domain, language or both,which is a limitation. Hence, for this shared task,we resisted ourselves from using language specificfeatures. We describe our approaches for each ofthe subtasks in the following subsections. For thefirst 3 subtasks, we make use of Keras (Cholletand others, 2015) deep learning library in-order toemploy multi-layer perceptron(MLP) and bidirec-tional LSTM (bi-LSTM) (Graves and Schmidhuber,2005). We used Adam optimizer (Kingma and Ba,2014) with a learning rate of 0.001.

In sequence encoding/tagging, we have access toboth past and future words for a given time stamp.The bidirectional LSTM network utilize these con-textual information in both directions (Graves andothers, 2012). In doing so, we can efficiently makeuse of past words (and their features) and futurewords (and their features) for a given time stamp.For the classification tasks (task 1 and 2), we trainbidirectional LSTM networks using back propaga-tion to learn and remember the forward and back-ward context and to learn categorical information,we use softmax layer over the output of the bi-LSTM network.

3.1 Relevance Classification

In our MLP architecture, we use an additional sen-timent feature where we look up in a sentimentlexicon (Waltinger, 2010), each word in a review.If a given word is present, it is assigned the appro-priate sentiment, neutral otherwise. Each word isrepresented by its distributional vector which weobtain from a pre-trained Glove word embeddingmodel (Pennington et al., 2014). In the bidirec-tional LSTM architecture, the only features we useto train the system are a sequence of word embed-dings trained on Glove. We will use Glove vectorsand word vectors interchangeably.

3.2 Document-Level Polarity Identification

The architecture and parameters for this subtask aresimilar to that of the earlier one. We did experimentwith part-of-speech tag features and glove vectors.Spacy POS-Tagger2 is used for part-of-speech tag-ging each feedback sentence. For this task, onlyglove vectors of adjectives and adverbs are chosenwhich are tagged by the Spacy POS tagger.

2https://spacy.io/docs/usage/pos-tagging

37

3.3 Aspect Identification

We modeled this problem as a multi-label classi-fication task where multiple labels can be activeat a given time. Mean square error is used as theloss function. Consider an example “Streik endetvorzeitig Der Streik der Gewerkschaft DeutscherLokomotivfhrer (GDL) endet am 21.” For this sen-tence, there are 2 aspects, Allgemein:neutral forthe two occurrences of the word “Streik”. We com-bined an aspect and its corresponding polarity torepresent a label. The same kind of architecture isalso employed in this task as the above two tasks.In this model, multiple occurrences of the sameaspect in a single document can not be captured.

3.4 Opinion-Target Identification

Given that Opinion-Target Identification is a se-quence labeling task, we used Conditional RandomFields (Lafferty et al., 2001)3 and structured percep-tron (Collins, 2002) 4 algorithms for this task. Eachword in a sample is represented by its lowercase,combination of prefix and suffix characters, lengthof the word, length of the sequence, its immedi-ate context. Morphological features for a word areimportant cues for the identification of its lexicalcategory. As the part-of-speech tags were noisy,we use these features instead. Different combina-tions of the features have been tested and the bestpossible combination for this task has been chosen.The target words are annotated in BIO (beginninginside out) format as a preprocessing task beforetraining and testing. 5

4 Experiment Setup

The corpus size provided in the shared task is de-tailed below:-

4.1 Corpus Details

Type No Of SamplesTrain 19449Dev 2375

Test TimeStamp1 2566Test TimeStamp2 1842

Table 1: Corpus Size

3http://python-crfsuite.readthedocs.io/en/latest/4https://github.com/larsmans/seqlearn5cw→ contextwindow

4.2 PreprocessingAs the data provided for this task consisted of socialmedia text, we tokenized the data as a preprocess-ing step. As part of the tokenization, we consideredpunctuations except the α and # as individual to-kens. α and # are associated with twitter handlesand user ids, so we did not tinker them. The impactof tokenization is detailed below. The train anddev accuracies are reported in terms of percentage:-We can observe that there is very minor improve-

Model Train Acc Dev AccTaskAwithpunctua-tion

99.67 86.90

TaskAwith nopunctua-tion

99.70 87.32

TaskBwithpunctua-tion

98.57 69.81

TaskBwith nopunctua-tion

98.42 69.85

Table 2: Experiments with and without punctua-tions

ment to polarity and relevance classification whenpunctuations and whitespaces are removed fromthe documents.

4.3 Model DescriptionWe have created the word embedding glove modelby collecting a large number of crawled Ger-man newspapers, news commentary corpus (Stede,2004) and the Europarl (KOEHN, 2005) corpustotaling 1.9M unique tokens . The size of the gloveword vectors has been fixed at 100. We have ex-perimented only with a single hidden layer, with100 hidden units for the MLP architecture. For thebidirectional LSTM model, we only use glove vec-tors as features. The experiment results are shownin the tables below. The maximum length of a doc-ument has been fixed at 200. Each sample is rep-resented as a vector of size 200∗100 = 20000 forbi-LSTM and MLP experiments when only wordvectors are used as features. For combination of

38

the glove vectors and sentiment vectors as features,the size of the sample is 200∗ (100+3) = 20600.For MLP, all the vectors are concatenated and pre-sented as an input sample whereas all the vectorsare given as a sequence of vectors in case of a bidi-rectional LSTM. If a document contains more than200 words, only the first 200 words are used in theexperiments and the rest are ignored. When a doc-ument has less than 200 words, it has to be paddedwith zero vectors. We used post-padding for ourexperiments. Evaluation code has been providedby the users6. For the structured perceptron, thenumber of iterations was fixed to 10 with viterbidecoding (Viterbi, 1998).

4.4 Experimental Results on DevelopmentSet

The results of the experiments on the develop-ment set are detailed in this section. The base-line system was provided by the organizers 7. ForTaskB, we did experiments where one networklearns with glove vectors and the other learns withsentiment vectors. The individual networks arefully-connected dense networks. The hidden layermerges the outputs of these two networks, and thefinal layer is a softmax layer to predict the polarityof the document. The results of this experiment istabulated below.

Model Feature Train Acc Dev AccMLP glove

vectors99.64 86.82

MLP sentimentvectors

83.69 81.83

MLP glove & sen-timent vec-tors

99.62 88.0

BiLSTM glovevectors

99.34 90.78

Table 3: TaskA Experiments

4.5 Germeval ResultsAs per the latest test results announced, the stand-ing of our approaches for various tasks are asfollows: (a) Relevance Classification - 3rd place(0.8787), (b) Document-Level Polarity - 11th place(0.6851), (c) Aspect-Level Polarity - 2nd place

6https://github.com/muchafel/GermEval20177BaseLineSystem-https://github.com/uhh-

lt/GermEval2017-Baseline

Model Feature Train Acc Dev AccMLP glove vectors 96.54 69.22MLP sentiment

vectors82.42 67.83

MLP glove andsentimentvectors

97.81 70.11

MLP glove vectorsof Adjectivesand Adverbs

82.3 67.43

Merging2 MLPs

one withglove vec-tor andother withsentimentvector

95.59 70.57

BiLSTM glove vectors 97.22 72.17

Table 4: TaskB Experiments

Model Feature F1 ScoreMLP glove vectors 0.44MLP glove &

sentimentvectors

0.46

BiLSTM glove vectors 0.47

Table 5: TaskC Experiments for Category + senti-ment

(category - 0.4208 , category+sentiment - 0.3485),(d) Opinion Target Extraction - 1st place (ex-act matching - 0.2202, overlap matching - 0.2208).For the final submitted test results, we used BiL-STM for TaskA, MLP for TaskB, BiLSTM forTaskC and Structured Perceptron for TaskD. Allthe results are reported in terms of the micro F1-scores.8

5 Performance Analysis

Training an MLP model is much faster than a bi-LSTM model. The reason could be the memorycomponent associated with LSTMs. The perfor-mance evaluation metrics are shown in Table 8.Due to memory limitation of our system, we couldnot test for bigger batch sizes.

8All the F1-Scores refer to the micro F1-Scores

39

Model Feature F1 ScoreCRF word

Features−lengthfeatures

0.35

StructuredPerceptron

wordFeatures−lengthfeatures

0.38

CRF wordFeatures+lengthfeatures

0.42


wordFeatures+lengthfeatures

0.43

CRF wordFeatures+lengthfeatures+cw=3

0.46


wordFeatures+lengthfeatures+cw=3

0.47

CRF wordFeatures+lengthfeatures+cw=5

0.41


wordFeatures+lengthfeatures+cw=5

0.42

Table 6: TaskD Experiments

6 Error Analysis

We could observe from the results that addingsentiment features to the network did not improvethe dev-set accuracy. We therefore did not considerthis feature in our final model. Furthermore, thePOS tag for a word does not help in identificationof its polarity. In addition, there was errorpropagation due to errors caused by the SpacyPOS tagger on social media text. Hence, POStags as features have been ruled out in this scenario.

Bidirectional LSTMs are able to model theinput sequences better and therefore resulted inhigher classification accuracy than MLPs. MLPsperformance is relatively low due to the fact thatthere is no positional information for an inputsequence when compared to LSTMs(Hochreiterand Schmidhuber, 1997). Structured Perceptronand CRFs are performing similarly for TaskD.POS Tags have been ignored as a as mentionedabove. We could observe that a word along withits immediate neighbors plays an important role in

SubTask BaseLine Our SystemTask1 0.88 0.91Task2 0.71 0.71Task3-only cate-gories

0.61 0.56

Task3-categories &sentiment

0.49 0.47

Task4-exact match-ing

0.2 0.46

Task4-overlappingmatching

0.28 0.49

Table 7: Comparison of Baseline-System and oursystem micro F1-Scores

identification of the target words. The target wordscan be of a single word or multiple words. Thenumbers of words in case of a multi-word targetword is either three or four. Therefore, a contextwindow of length 3 performs better than a contextwindow of length 5. The errors in the MLP andbi-LSTM models are due to data sparsity. Bettervector representations can be learnt with more data.

In the example sentence, “Von den horrendenPreisen und verwirrenden Zonen mal ganz zuschweigen.”, there are 2 aspects “Preisen” and “Zo-nen”. The system was unable to predict them assuch. The training data has several instances of“Zonen”, but neither of them are target words nordo they have any sentiment words in the context.Hence, the system failed to identify this instance.“Preisen” similarly has several instances in the train-ing data, some of them are aspects and some ofthem are not. The system incorrectly tags it as non-aspect. Even in the presence of a sentiment wordas its neighbor, a word is not an aspect in the train-ing data. These are difficult cases for a system toidentify and require better semantic representationof the text.

Model AvgTrainingTime PerEpoch(sec)

#Epochs BatchSize

MLP 3 50 128Bi-LSTM

1800 20 4

Table 8: Performance Evaluation for Models

40

7 Conclusion & Future Work

In this paper, we describe our systems for allthe subtasks. We show that systems can achievea reasonable accuracy with distributional glovevectors without hand crafted features. We alsoexplore target word identification with structuredperceptron and CRF by using just basic prefixand suffix features. We could achieve comparableaccuracies with these minimal features.

It is intuitive that specific words in a text influ-ence its overall sentiment. In the future we plan tomodel this by incorporating attention mechanismcoupled with a bi-LSTM model. We plan to usecharacter embeddings for the words whose wordembeddings could not be learnt. This might helpus improve the overall models and make it robustto minute errors/typos. We also plan to add somelinguistic features which are observed in social me-dia text like presence of symbols and numbers inorder to improve in target word identification. AGerman lemma dictionary can be used to replaceevery word in the vocabulary and its morphologi-cal variants with its lemma. A sentiment word isknown to be present in the neighborhood of a targetword. Hence, a sentiment lexicon can also be usedto identify the target words.

Acknowledgements

We thank Silpa Kanneganti for her valuable reviewand feedback for us on the paper.

ReferencesApoorv Agarwal, Boyi Xie, Ilia Vovsha, Owen Ram-

bow, and Rebecca Passonneau. 2011. Sentimentanalysis of twitter data. In Proceedings of the work-shop on languages in social media, pages 30–38. As-sociation for Computational Linguistics.

Francois Chollet et al. 2015. Keras. https://github.com/fchollet/keras.

Michael Collins. 2002. Discriminative training meth-ods for hidden markov models: Theory and experi-ments with perceptron algorithms. In Proceedingsof the ACL-02 conference on Empirical methods innatural language processing-Volume 10, pages 1–8.Association for Computational Linguistics.

Alex Graves et al. 2012. Supervised sequence la-belling with recurrent neural networks, volume 385.Springer.

Alex Graves and Jurgen Schmidhuber. 2005. Frame-wise phoneme classification with bidirectional lstm

and other neural network architectures. Neural Net-works, 18(5):602–610.

Hussam Hamdan, Patrice Bellot, and Frederic Bechet.2015. Lsislif: Crf and logistic regression for opiniontarget extraction and sentiment polarity analysis. InSemEval@ NAACL-HLT, pages 753–758.

Sepp Hochreiter and Jurgen Schmidhuber. 1997.Long short-term memory. Neural computation,9(8):1735–1780.

Niklas Jakob and Iryna Gurevych. 2010. Extractingopinion targets in a single-and cross-domain settingwith conditional random fields. In Proceedings ofthe 2010 conference on empirical methods in naturallanguage processing, pages 1035–1045. Associationfor Computational Linguistics.

Jason S Kessler and Nicolas Nicolov. 2009. Target-ing sentiment expressions through supervised rank-ing of linguistic configurations. In In Third Inter-national AAAI Conference on Weblogs and SocialMedia ,ICWSM, pages 90–97.

Diederik Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980.

P KOEHN. 2005. Europarl: A parallel corpus forstatistical machine translation. Proc. 10th MachineTranslation Summit (MT Summit), 2005, pages 79–86.

John Lafferty, Andrew McCallum, and Fernando CNPereira. 2001. Conditional random fields: Prob-abilistic models for segmenting and labeling se-quence data. Proceedings of the 18th Interna-tional Conference on Machine Learning, ICML-2001, pages 282–289.

Saeedeh Momtazi. 2012. Fine-grained german sen-timent analysis on social media. In LREC, pages1215–1220.

Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan.2002. Thumbs up?: sentiment classification usingmachine learning techniques. In Proceedings of theACL-02 conference on Empirical methods in naturallanguage processing-Volume 10, pages 79–86. Asso-ciation for Computational Linguistics.

Jeffrey Pennington, Richard Socher, and Christopher DManning. 2014. Glove: Global vectors for wordrepresentation. In EMNLP, volume 14, pages 1532–1543.

Qiao Qian, Minlie Huang, and Xiaoyan Zhu. 2016.Linguistically regularized lstms for sentiment clas-sification. arXiv preprint arXiv:1611.03949.

Richard Socher, Alex Perelygin, Jean Y Wu, JasonChuang, Christopher D Manning, Andrew Y Ng,Christopher Potts, et al. 2013. Recursive deepmodels for semantic compositionality over a senti-ment treebank. In Proceedings of the conference on

41

empirical methods in natural language processing(EMNLP), volume 1631, page 1642.

Manfred Stede. 2004. The potsdam commentary cor-pus. In Proceedings of the 2004 ACL Workshop onDiscourse Annotation, pages 96–102. Associationfor Computational Linguistics.

Peter D Turney and Michael L Littman. 2003. Mea-suring praise and criticism: Inference of semanticorientation from association. ACM Transactions onInformation Systems (TOIS), 21(4):315–346.

Andrew J. Viterbi. 1998. An intuitive justification anda simplified implementation of the map decoder forconvolutional codes. IEEE Journal on Selected Ar-eas in Communications, 16(2):260–264.

Ulli Waltinger. 2010. Germanpolarityclues: A lexicalresource for german sentiment analysis. In LREC.

Michael Wojatzki, Eugen Ruppert, Sarah Holschneider,Torsten Zesch, and Chris Biemann. 2017. Germeval2017: Shared task on aspect-based sentiment in so-cial media customer feedback. In Proceedings of theGSCL GermEval Shared Task on Aspect-based Sen-timent in Social Media Customer Feedback.

42


IDS IUCL: Investigating Feature Selection and Oversampling forGermEval2017

Zeeshan Ali Sayyed†, Daniel Dakota†‡, Sandra Kubler†

†Indiana University, ‡Institut fur Deutsche [email protected],[email protected],[email protected]

Abstract

We present the IDS IUCL contribution tothe GermEval 2017 shared task on “Aspect-based Sentiment in Social Media CustomerFeedback”. We choose to compete in bothsubtasks A & B. Our system focuses onhandling the imbalance in the data sets, byfocusing on feature selection and oversam-pling of the minority classes. We achieve0.916 micro-F for relevance (task A) and0.781 for polarity (Task B) on the devel-opment set. For task A, we reach the bestscores among the submitted systems, fortask B the 5th best results for the syn-chronic test set and the best result for thediachronic test set.

1 Introduction

Sentiment analysis is a field that has seen much at-tention in recent years, but most of the researchconcentrates on English. In comparison to En-glish, there exists little work that has focused onsentiment related problems for German. One ofthe primary issues has been a lack of resources.However, with the creation of German specific re-sources, such as the Multi-Layered Reference Cor-pus for German Sentiment Analysis (Clematideet al., 2012), there has been a growing interest invarious areas related to semantics and sentimentanalysis with a German specific focus. This has re-sulted in shared tasks on Name Entity Recognition(Benikova et al., 2014), lexical substitution (Milleret al., 2015) and two tasks extracting subjectiveexpressions, sources, and targets in parliamentaryspeeches (Ruppenhofer et al., 2014; Ruppenhoferet al., 2016).

In the current paper, we describe the IDS IUCLsystem that participated in the GermEval 2017shared task on “Aspect-based Sentiment in So-cial Media Customer Feedback” (Wojatzki et al.,

2017). We choose to compete in subtasks A &B. Our approach uses both word and character n-grams as features, in combination with feature se-lection based on Information Gain (IG) and L1-regularization (Lasso). Sampling is then used withGradient Boost to create a final classifier. Weachieve 0.916 micro-F for relevance (task A) and0.781 for polarity (Task B) on the development set.On the official test sets, we reach micro-F-scores of0.903 and 0.906 on the synchronic and diachronictest sets respectively for task A and 0.734 and 0.750on the synchronic and diachronic test sets for taskB. For task A, we reach the best scores among thesubmitted systems, for task B the 5th best result forthe synchronic test set and the best result for thediachronic test set.

2 Related Work

Perhaps the most closely related shared task tothe current one is the sentiment analysis task onTwitter for English (Nakov et al., 2013). Submit-ted classifiers were predominantly either SVMsor Naive Bayes using a wide range of features.Work on polarity in German sentiment is still arelatively new area, with most of the current ap-proaches profiting from multilingual resources: De-necke (2008) shows how German documents canbe translated into English, with positive or nega-tive sentiment analysis performed using Englishresources on the translated texts achieving reliableperformance. This strategy is also seen in work byMomtazi (2012), who automatically translated anEnglish opinion lexicon into German and assignedfine-grained scores to specific words. The result-ing approach is rule-based; it outperforms both anSVM and Naive Bayes approach on detecting pos-itive or negative sentiment in social media postsabout German celebrities. In both the above cases,however, the systems utilize existing resources forEnglish and attempt to transfer this knowledge toGerman. Although this approach has been shown

43

true false totaltrain 16 200 3 231 19 431dev 1 931 437 2 368total 18 131 3 668 21 799

Table 1: Distribution of class labels in the relevance task

negative neutral positive totaltrain 5 045 13 208 1 178 19 431dev 589 1 631 148 2 368total 5 634 14 838 1 326 21 799

Table 2: Distribution of class labels in the polarity task

to be feasible, such strategies require many assump-tions to be made about the 1 : 1 correspondence ofmeaning between languages, which is often not thecase. As a consequence, language specific nuancescannot be captured. For this reason, we have de-cided to work with a knowledge poor approach,focusing on automatically extracting reliable fea-tures directly from the training data.

3 Methodology

3.1 DatasetThere are 21 799 samples in the dataset providedby the shared task. For system development, wedivide the dataset into a training and a developmentset for each subtask, using stratified sampling. Thetraining set consists of 19 431 samples (≈ 89%)whereas the development set consists of 2 368 sam-ples (≈ 11%).

The distributions of class labels in the trainingand development set for the relevance task areshown in table 1, the ones for the polarity task in ta-ble 2. The distribution of class labels clearly showsthat the classes are imbalanced. For this reason, weuse feature selection and sampling methods thathelp alleviate problems for machine learning asso-ciated with imbalanced data (see sections 3.4 and3.6), following Kubler et al. (2017) who show theeffectiveness of feature selection with informationgain for multiclass sentiment analysis in English.

3.2 Data PreprocessingSince the dataset consists of social media data,the text shows many typical characteristics of thisgenre. In order to extract maximal informationfrom the comments, we process hashtags and han-dles, as well as URLs. We remove all #s and @sfrom the tweets leaving just the attached characters(e.g., #Bahn becomes Bahn). We also replace all

identifiable URLs (e.g., http or www) by a singleURL token. We do not normalize capitalizationdue to the role it plays in German orthography.We also do not remove punctuation based on theassumption that punctuation in social media dataoften carries sentiment. Repeated punctuation, forexample, serves as an intensifier. We also assumethat if punctuation is not relevant, feature selectionwill delete those features.

3.3 Feature representation

We extract features from the training set using thestandard bag-of-words as well as bag-of-charactersapproach. We use word unigrams, bigrams, andtrigrams as well as character n-grams ranging fromstrings between three and eight characters in length.This results in a total of 1 099 714 features, clearlyshowing the necessity for feature selection.

3.4 Feature selection

Before any sophisticated feature selection step, fre-quency thresholding is necessary on this featureset to remove less frequent features. Such featuresare potentially noisy and misleading. Thresholdingalso prevents the classifier from overfitting. Wetest two different thresholds: The first thresholdremoves all features occurring ≤50 times in theentire training set, the second threshold is set to100. The choice of threshold has consequences forthe size of the features set: The threshold of 50reduces the number of features to 18 722 whereasa threshold of 100 reduces it to 11 292.

Kubler et al. (2017) show that feature selectionusing multiclass IG is effective for multiclass senti-ment analysis in English when there exists a highimbalance in the classes. Hence, we use multiclassIG for reducing the number of features. Moreover,for large datasets and exponentially many irrelevantfeatures, logistic regression with L1 regularization(Lasso) has been shown to perform successfully(Ng, 2004). Hence, we perform both feature selec-tion techniques on our dataset and then combinethe features sets contributed by the two techniques.We use the implementation available from scikit-learn1 for logistic regression with L1 regularization.We use Lasso with default parameters.

We choose the top 2 000 features from multi-class IG, along with features with non-zero weightsfrom Lasso. When combining the two feature sets,we eliminate duplicates. The distributions of the

1http://scikit-learn.org/stable/

44

Task Thresh. IG Lasso Combinedrelevance 50 2 000 434 2 162(task A) 100 2 000 366 2 154polarity 50 2 000 736 2 400(task B) 100 2 000 652 2 372

Table 3: Numbers of features chosen by the different featureselection methods using different thresholds

resulting feature sets for the two different tasksare shown in table 3. Obviously, Lasso adds con-siderably fewer features than IG. However, thesefeatures have a positive effect on system perfor-mance (see below). Additionally, we see that thereare between 200 and 350 features that are selectedby both algorithms. The fact that this number islow shows that the feature selection methods havedifferent biases, and a combination can thus beuseful.

3.5 Classifier

The official baseline for the subtasks A and B con-sists of an SVM classifier using term frequencyfeatures and an external German sentiment lexi-con, which resulted in 38 241 features. We per-formed initial experiments using an SVM and arandom forest classifier with the top 2000 featuresreturned from IG, reaching suboptimal results be-low the official baseline. In contrast, regularizedgradient boosted trees (Chen and Guestrin, 2016)perform well on the problem. We use the imple-mentation available in the XGboost library.2 Thisimplementation is known for its scalability and ro-bustness given sparse data. The dataset created bythe n-gram models is large but very sparse, andhence shows the characteristics for which extremegradient boosted machines are well suited. Theparameters of the algorithm are tuned to combatthe imbalance in the dataset. We performed a non-exhaustive search for parameter settings, optimiz-ing performance on the development set.

3.6 Sampling

In order to handle the problem of class imbalance,we perform oversampling of the minority classes,using adaptive synthetic sampling (He et al., 2008)in the implementation provided by the imbalanced-learn toolkit (Lemaıtre et al., 2016). The stan-dard method for over-sampling minority examplesis the synthetic minority over-sampling technique

2https://github.com/dmlc/xgboost

Task System synchronic diachronicrelevance (A) IDS IUCL 0.903 0.906

fhdo 0.899 0.897polarity (B) IDS IUCL 0.734 0.750

fhdo 0.748 0.742

Table 4: Official results (micro-averaged F1) on the test sets

(SMOTE), which generates artificial samples basedon feature space similarities between the existingsamples of the minority class. This method cancreate good synthetic samples if the features aremeaningful and densely populated in the searchspace. However, our feature matrix is extremelysparse. Additionally, SMOTE runs the risk of over-generalization because it generates the same num-ber of synthetic samples for every minority classsample without regard to its neighboring samples.Thus, we chose adaptive synthetic sampling, whichmitigates these problems by using a weighted dis-tribution for different minority classes, where moredata is generated for minority samples which areharder to classify and less for those that are easy.

4 Results

4.1 Official Results

Table 4 shows the official results of our systemcompared to the next-best and best performing sys-tem respectively. Note that we have optimized thesettings for task A but because of time constraints,we used the same settings also for task B withoutfurther optimization.

Our system reaches the highest results on bothtest sets for task A. For the synchronic test set, theresults are very close to the next system, for the di-achronic test set, the difference is more pronounced.For task B, we reach the highest results for the di-achronic test set, but only the fifth best results onthe synchronic test set. The results show that thecombination of feature selection and minority over-sampling provides us with a robust system that canhandle shifts in topics over time.

4.2 Results on the Development Sets

We show the results of the IDS IUCL system onthe development set in table 5. We report the offi-cial baseline of the shared task as well as our ownbaseline, established using 2 000 features obtainedwhen using IG and and the SVM classifier. Ourbaseline corresponds to majority classification base-line since the SVM classifier chooses the majority

45

Task/Class. off. baseline SVM baseline XGboost-50 XGboost-100 XGboost-50+samplingrelevance (A) 0.882 0.815 0.916 0.915 0.916polarity (B) 0.709 0.689 0.781 0.777 0.763

Table 5: Results (micro-averaged F1) on the development set

Task No. of Feats.relevance 2 208polarity 2 210intersection 191

Table 6: Number of features per subtask (threshold=50)

Feat. Selection FeaturesIG+Lasso Overlap Bahn, ahn, Zug, Ticket, de-

shalb, ofter, morgens, hass,mfra, Stre, reik, verk

IG Relevance offenbar, Ruckfahrt, Fenster,zumindest, Erg, Ank, tark,ohnt

IG Polarity Jahren, Hauptbahnhof, Tar-ifkonflikt, Klimaanlage, lich,chen, unde, ahme

Lasso Relevance Bar, Spaß, Fahrer, Bank, ßen,ler, orse, letz

Lasso Polarity steigen, Kunden, Danke,Stunden, eei, bequ, onnt, tehe

Table 7: Selected overlap and distinct features for eachfeature selection method (threshold=50)

label in all cases, thus showing how important thehandling of imbalanced data is. The remaining re-sults are based on XGgboost with a threshold of 50or 100. We also combine the threshold of 50 withfeature sampling as described in section 3.6.

The results show that for relevance as well as forpolarity, the XGboost-50 system outperforms bothbaselines by around 3.6 percent absolute and by7.1 percent absolute respectively. Using the higherthreshold results in a minor loss in performance.Sampling, in contrast, shows very different effectsin the two tasks: For relevance, it does not changethe results, but for polarity, sampling results in aloss of almost 2 percent absolute. The reason forthis behavior requires further analysis.

5 Further Analysis

5.1 Feature AnalysisWe manually examine the selected features to seewhat types of unique features are returned by IGand Lasso for both relevance and polarity (thresh-

old=50, no sampling).

First we examine the overlapping features be-tween the relevance and the polarity tasks. In total,there are 191 overlapping features between the twotasks. The first row of table 7 shows examplesof such overlapping features. Many of these fea-tures are to be expected, e.g., Bahn (Eng. train) andsubstrings of words such as Stre, most likely deriv-ing from Streik (Eng. strike) and/or Strecke (Eng.route). The minimal overlap is interesting in andof itself given that the base feature set is the samefor both tasks. Additionally, there is a sizable num-ber of examples of long n-grams with overlappingsubstrings between the tasks. I.e., the tasks selectsimilar information, but with a slightly differentemphasis. For example, die Strecke (Eng. the route[nom/acc]) is selected for relevance but der Strecke(Eng. the route [gen/dat]) is selected for polarity.Such differences may be caused by data sparsityor by minimally different feature selection values.However, in our cases, those features are not closeto the cut-off IG. Thus, they must be indicative ofcertain structures or contexts that can be associ-ated more strongly with one specific task but notthe other. This can be seen when examining someof the selected n-grams, which seem to benefit aparticular subtask. For example, schneller (Eng.faster) is only selected for the polarity task, but notfor the relevance task while Ruckfahrt (Eng. returntrip) is selected for relevance. This can be inter-preted assuming that being fast is good for a train,thus evoking sentiment, but it may not help withrelevance. A return trip, in contrast is a more gen-eral term that may be more helpful in determiningrelevance, but it does not evoke sentiment.

We also examine features chosen when usingIG or Lasso in table 7. For the polarity n-grams,there are recognizable features, such as Tarifkon-flikt (Eng. trade dispute). For all methods, one canidentify possible words for the returned characterlevel n-grams, although they may not be initiallyintuitive. However, the main impression is that IGselects common features while Lasso seems to pre-fer less common features that are highly indicativeof a class.

46

Metric / micro-F macro-F Precision RecallClassifier false true false trueXGboost-50 0.916 0.847 0.852 0.927 0.659 0.974XGboost-100 0.914 0.838 0.890 0.918 0.613 0.983XGboost-50+sampling 0.916 0.847 0.850 0.927 0.661 0.974

Table 8: Additional results for relevance task

Metric / micro-F macro-F Precision RecallClassifier Neg. Neutral Pos. Neg. Neutral Pos.XGboost-50 0.781 0.625 0.774 0.784 0.732 0.413 0.953 0.351XGboost-100 0.777 0.615 0.767 0.779 0.758 0.392 0.956 0.338XGboost-50+sampling 0.777 0.620 0.760 0.780 0.785 0.397 0.954 0.345

Table 9: Additional results for the polarity task

5.2 Alternative Evaluation Metrics

We also have a closer look at the performance ofour classifiers using macro-F along with micro-F. Macro-F first calculates the F-score per classand then averages over the classes, which givesequal weight to minority classes rather than givingequal weight to each example, thus making minor-ity classes less important in the evaluation. We alsoreport precision and recall for individual classes.Table 8 shows the results for the relevance task,table 9 for polarity. These results show that thehigher threshold results in higher precision for theminority classes (false for relevance and positivefor polarity). The oversampling method reaches thehighest precision for the minority class for polarity.Further investigation is required to determine whythis effect is not evident in the relevance task.

6 Conclusion

The IDS IUCL system reaches good results whenusing XGboost while first experiments with SVMor random forest classifiers resulted in majorityclassification. We show that feature selection isextremely important for tasks in sentiment analysiswith an imbalanced class distribution. We reachour best results with a feature threshold of 50; ahigher threshold results in higher precision for theminority class. Oversampling of minority classexamples increases precision in the polarity task.

ReferencesDarina Benikova, Chris Biemann, Kisselew Max, and

Sebastian Pado. 2014. GermEval 2014 Named En-tity Recognition shared task: Companion paper. InKONVENS 2014, pages 104–113, Hildesheim.

Tianqi Chen and Carlos Guestrin. 2016. XGboost: Ascalable tree boosting system. In Proceedings ofthe 22nd ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining, pages785–794.

Simon Clematide, Stefan Gindl, Manfred Klenner, Ste-fanos Petrakis, Robert Remus, Josef Ruppenhofer,Ulli Waltinger, and Michael Wiegand. 2012. MLSA– a multi-layered reference corpus for German sen-timent analysis. In Proceedings of the 8th Inter-national Conference on Language Resources andEvaluation (LREC), pages 3551–3556, Marrakesh,Marocco.

Kerstin Denecke. 2008. Using SentiWordNet formultilingual sentiment analysis. In Data Engineer-ing Workshop 2008. ICDE 2008 IEEE 24th Interna-tional Conference on Data Engineering Workshops,pages 507–512, Cancun, Mexico.

Haibo He, Yang Bai, Edwardo A Garcia, and Shutao Li.2008. ADASYN: Adaptive synthetic sampling ap-proach for imbalanced learning. In Neural Networks,2008: IEEE World Congress on Computational In-telligence (IJCNN), pages 1322–1328.

Sandra Kubler, Can Liu, and Zeeshan Ali Sayyed.2017. To use or not to use: Feature selection for sen-timent analysis of highly imbalanced data. NaturalLanguage Engineering.

Guillaume Lemaıtre, Fernando Nogueira, and Chris-tos K. Aridas. 2016. Imbalanced-learn: A pythontoolbox to tackle the curse of imbalanced datasets inmachine learning. CoRR, abs/1609.06570.

Tristan Miller, Darina Benikova, and Sallam Abual-haija. 2015. GermEval 2015: LexSub – A sharedtask for German-language lexical substitution. InProceedings of GermEval 2015: LexSub, pages 1–9,Duisburg, Germany.

Saeedeh Momtazi. 2012. Fine-grained German senti-ment analysis on social media. In Proceedings of theLREC’12, pages 1215–1220, Istanbul, Turkey.

47

Preslav Nakov, Sara Rosenthal, Zornitsa Kozareva,Veselin Stoyanov, Alan Ritter, and Theresa Wilson.2013. SemEval-2013 task 2: Sentiment analysis inTwitter. In Proceedings of the Seventh InternationalWorkshop on Semantic Evaluation, pages 312–320,Atlanta, Georgia, USA.

Andrew Ng. 2004. Feature selection, L1 vs. L2 regu-larization, and rotational invariance. In Proceedingsof the Twenty-First International Conference on Ma-chine Learning, page 78.

Josef Ruppenhofer, Roman Klinger, Julia Maria Struß,Jonathan Sonntag, and Michael Wiegand. 2014. IG-GSA shared tasks on German sentiment analysis(GESTALT). In Workshop Proceedings of the 12thEdition of the KONVENS Conference, pages 164–173, Hildesheim, Germany.

Josef Ruppenhofer, Julia Maria Struß, and MichaelWiegand. 2016. Overview of the IGGSA 2016shared task on source and target extraction from po-litical speeches. In Proceedings of IGGSA SharedTask 2016 Workshop, pages 1–9, Bochum, Germany.

Michael Wojatzki, Eugen Ruppert, Sarah Holschneider,Torsten Zesch, and Chris Biemann. 2017. Germeval2017: Shared task on aspect-based sentiment in so-cial media customer feedback. In Proceedings of theGSCL GermEval Shared Task on Aspect-based Sen-timent in Social Media Customer Feedback, Berlin,Germany.

48


PotTS at GermEval-2017 Task B: Document-Level Polarity DetectionUsing Hand-Crafted SVM and Deep Bidirectional LSTM Network

Uladzimir SidarenkaApplied Computational Linguistics

UFS Cognitive ScienceUniversity of Potsdam / [email protected]

Abstract

This paper presents a hybrid approach todocument-level polarity classification of so-cial media (SM) posts. In contrast to pre-vious related works, which relied on sen-timent classifiers with either manually de-signed or automatically induced features,our system simultaneously harnesses bothof these options, comprising an SVM mod-ule with user-specified attributes and a bidi-rectional LSTM network whose input em-beddings are learned completely automat-ically. While doing prediction, we unitethe decisions of both of these classifiersinto a single score vector by taking the sumof their probability estimates and choos-ing the class with the highest joint value.We evaluate our ensemble on subtask Bof GermEval-2017 shared task competi-tion (Wojatzki et al., 2017), getting thirdplace among all competing systems andreaching 0.745 and 0.718 micro-averagedF1 on timestamps one and two of thisdataset respectively.1

1 Introduction

With the ever increasing role of social media in peo-ple’s everyday life, web forums, Facebook walls,and Twitter threads gradually become our primarychannels for staying in touch with friends, sharingimportant information, or just expressing our per-sonal opinions and beliefs. The last purpose is ofparticular interest to many private companies andorganizations, as it turns social networks into aninvaluable source of critical market information,providing deeper insights into the general sales at-mosphere and revealing wishes, complaints, andpreferences of particular customer groups.

1The source code of our implementation is available onlineat https://github.com/WladimirSidorenko/CGSA

Unfortunately, manually analyzing the avalancheof users’ posts can hardly be done rapidly and isimpossible to do in a flash. Consequently, any seri-ous economic endeavor nowadays depends on theexistence of a reliable automatic sentiment analy-sis tool. To meet this need, a plethora of differentrule-based, machine- and deep-learning methodshave been proposed in literature in the last fewdecades (Pang and Lee, 2008; Liu, 2012), witheach of them claiming superior results in compari-son to its predecessors.

In this paper, we are going to investigate whetherthe recent “tornado”-like popularity of deep neu-ral networks for coarse-grained sentiment analy-sis (Manning, 2015) is indeed backed by empiri-cal results and whether these systems can actuallyoutperform traditional machine-learning methods,which are commonly considered to be the previousstate of the art for this task. Furthermore, sincethese techniques not necessarily need to be in con-tradiction, we also check whether uniting both ap-proaches into a single ensemble can further im-prove the quality of the analysis.

We perform our experiments on Subtask B(document-level polarity prediction) of GermEval-2017 (Wojatzki et al., 2017), presenting its datasetin Section 2 of this paper. After shortly describinginitial preparation steps in Section 3, we presentboth methods (SVM and bidirectional LSTM),sketching their features, training mode, and archi-tecture. In the final step, we estimate the effectsof different feature groups and hyperparameters onthe net results of these systems, concluding andsummarizing our findings in the last part of thissubmission.

2 Data

To train and evaluate our approaches, we down-loaded the training and development sets of theGermEval-2017 shared task competition (Wojatzkiet al., 2017), getting a total of 23,525 messages. A

49

detailed breakdown of the the number of posts andclass distributions in each of these sets is providedin Table 1.

Dataset Total Positive Negative Neutral

Training Set 20,941 14,497 5,228 1,216Development Set 2,584 1,812 617 155

Table 1: Class distribution in the GermEval dataset.

As we can see from the statistics, the neutralclass distinctly dominates the complete corpus,making up 70% of the training instances. Neg-ative posts form the second most frequent group,accounting for ≈ 24% of the data; while the pos-itive class represents an absolute minority of thedataset, constituting only 6% of all cases.

Besides a conspicuous imbalance of the polarityclasses, an additional challenge of this shared taskis posed by the heterogeneity of the provided con-tent: Instead of concentrating on just one popularsocial channel—as it was done, for example, insimilar shared tasks (Nakov et al., 2013; Rosenthalet al., 2014; Rosenthal et al., 2015)—the organizerscovered a wide range of possible domains, provid-ing data from 2,291 different sources.

3 Preprocessing

We preprocessed all retrieved messages using therule-based normalization procedure of Sidarenka etal. (2013). In particular, during this step, we spliteach user post into sentences and tokens, restoredsome common colloquial spelling mistakes (e.g.,“kannste” → “kannst du”, “laaaangen” → “lan-gen”, “vlt”→ “vielleicht”), and replaced frequentSM-specific entites (@-mentions, links, e-mailaddresses, and emoticons) with unique artificialtokens representing their lexical class (“%User”,“%Link”, “%Mail”, “%PosSmiley”, “%NegSmiley”,etc.). A sample output of this normalization isshown in Example 5.2.

Example 3.1Original: Wochenende! Und derTelekom-Hotspot performt wiederwie tote Fuße . . . (@DBLounge -@db bahn) https://t.co/dlIMnKsnLhhttps://t.co/JYoy3OPbeU

Normalized: Wochenende ! Und der Telekom- Hotspot performt wieder wie tote Fusse ... (%User - %User ) %Link

In the final step of this procedure, we obtained

lemmas and part-of-speech tags of the analyzedtokens using the TREETAGGER of Schmid (1995).

4 Method

Once the data were preprocessed, we trained twoindependent classification systems and united theiroutput at the end into a single ensemble.

4.1 SVMThe first of these systems—a multi-classSVM classifier—was largely inspired by thework of Mohammad et al. (2013). Similarly tothese authors, we used a wide variety of differentfeatures, which, for simplicity, can be groupedtogether as follows:

• punctuation features, which included thenumber of repeated exclamation and questionmarks as well as the count of contiguous se-quences of exclamation and question charac-ters;

• character-level features, which comprisedcharacter n-grams of length three to five aswell as the number of capitalized words andwords with elongated vowels (e.g., “sooooo”,“guuuut”, etc.);

• word-level features, which included contigu-ous and non-contiguous (i.e., with one of thetokens replaced by a wildcard) sequences ofone to four tokens as well as the counts of usermentions, emoticons, and hashtags found inthe message;

• part-of-speech features, which reflected thenumber of occurrences of each particular part-of-speech tag in the analyzed instance;

• lexicon features, which, similarly to Moham-mad et al. (2013), were subdivided into man-ual and automatic ones;

– manual lexicon features were estimatedusing the SentiWS lexicon of Remuset al. (2010) and Zurich Polarity Listof Clematide and Klenner (2010). Foreach of these resources and for each ofthe non-neutral polarity classes, we com-puted the total sum of the lexicon scoresfor all message tokens and also sepa-rately calculated these statistics for eachparticular part-of-speech tag, consider-ing them as additional attributes;

50

– automatic lexicon features were com-puted using several automatically in-duced polarity lists. In particular, were-implemented the dictionary-based ap-proaches of Blair-Goldensohn et al.(2008), Hu and Liu (2004), and Kimand Hovy (2004), applying these sys-tems to GERMANET 9.0 (Hamp andFeldweg, 1997). Another sentiment lex-icon was produced using the Ising-spinmethod of Takamura et al. (2005), forwhich we again used GERMANET 9.0,extending its data with corpus colloca-tions gathered from the German Twit-ter snapshot (Scheffler, 2014). More-over, we also came up with two own ad-ditional solutions, in which we derivednew polarity lists by performing k-nn andprojection-based clustering of word2vecembeddings that had been previously pre-trained on the aforementioned snapshotcorpus. For each of these resources, foreach of the two polarity classes (positiveand negative), we produced four featuresrepresenting the number of tokens withnon-zero scores, the sum and the maxi-mum of all respective lexicon values forall message tokens, and the score of thelast term.

To overcome the rigidness of linear decisions inSVM (i.e., the inability of standard SVM to makea boundary between linearly inseparable classes),we split all automatic lexicon features into decilesbased on the total range of their observed values,and replaced these attributes with binary featuresreflecting the respective quantile of their originalscores. For example, if a training instance had a fea-ture called hu-liu-positive-sum with the value 15.7,which belonged to the third decile of all seen scoresfor this attribute, we replaced this feature with thebinary attribute hu-liu-positive-sum-3, setting itsvalue to 1. This way, we allowed the separatinghyperplane of SVM to be more flexible by hav-ing different slopes at different value regions ofbasically same attributes.

In the last step, we determined the optimal hyper-parameter settings (slack variable C) for the clas-sifier using grid search with five-fold cross vali-dation over both training and development data.Afterwards, while preparing the final submissionfor GermEval, we retrained the system once again

on the complete dataset, keeping the slack constantC fixed to its best-performing value.

4.2 Bidirectional LSTMIn addition to the SVM system, we also trained adeep multi-layer bidirectional recurrent neural net-work. In particular, we used the usual LSTM recur-rence (Hochreiter and Schmidhuber, 1997), whichtook word embeddings ([~w1; . . . ;~wL] ∈ RL×300,where L is the length of the training instance) asinput. Following the usual practices for definingsuch networks, we ran one of the loops from leftto right, obtaining an output vector ~ot

→ ∈ R16 foreach sequence position t, and let another loop workfrom right to left, getting a vector ~ot

← ∈ R16 foreach position. We concatenated the outputs ofboth recurrences into a single intermediate tensorO ∈ RL×16×2, and run a third LSTM loop over itsleading dimension (L). Eventually, we multipliedthe output of this third reccurrence ~o at the finalstep L with a linear transform matrix W ∈ R3×16,adding a bias term~b∈R3 and applying the softmaxnon-linearity σ to this product:

~y = σ(

W ·~oL +~b). (1)

We set the initial values of all trained parame-ters (word embeddings, recurrences, bias terms,and transform matrices) using normal He initializa-tion (He et al., 2015), training the network for fiveepochs and picking the model which maximizedthe accuracy on a randomly chosen held-out set.Due to the high imbalance of the target classes(with neutral instances accounting for almost 70%of the training data), we used categorical hinge lossas the primary objective function and optimizedit using RmsProp (Tieleman and Hinton, 2012) tospeed up the convergence. Last but not least, toprevent overfitting to the training set, we appliedBayesian dropout (Gal and Ghahramani, 2015) tothe reccurrence loops, setting the Binomial proba-bility of randomly dropping a neuron to 0.2.

4.3 EnsembleTo unite the results of both classifiers, we obtainedthe decision scores from the SVM system. Sincethese scores, however, reflected the distance to theseparating hyperplanes and could easily outweighthe results of the LSTM module (whose outputwas guaranteed to be in the range [0, . . . ,1]), wealso applied the softmax function to these values.Afterwards, we summed both vectors (the modified

51

SVM output and the~y vector from Equation 1) andchose the class with the maximum total value asthe final decision.

5 Evaluation

The results of these systems are shown in Table 2.As we can see from the figures, the SVM approachclearly outperforms the other two options, achiev-ing best results on both timestamps: one, where itattains 0.752 points, and two, where it gets 0.737 F1.The second-best system is, not surprisingly, the en-semble, whose micro-averaged results, however,are 0.007 and 0.02 points worse than the respectiveSVM scores. This degradation is mainly due to thebidirectional LSTM component, which, unexpect-edly, yields the worst overall performance (getting0.727 and 0.704 micro-averaged F1).

Dataset Micro-F1Timestamp 1 Timestamp 2

SVM 0.752 0.737BiLSTM 0.727 0.704SVM + BiLSTM 0.745 0.717

Majority 0.655 0.672Random 0.417 0.403

Table 2: Micro-averaged F1-scores on timestamps 1and 2 of the GermEval test set for Subtask B.

5.1 Qualitative AnalysisBesides calculating these statistics, we also had acloser look at the particular errors made by theseclassifiers. As it turned out, most of the mistakeson Timestamp 1 were made in the cases when allthree classifiers confused negative instances withthe neutral class (185 errors). On Timestamp 2,the most frequent type of errors (124 mistakes)were cases when the SVM classifier correctly pre-dicted the neutral class, but the BiLSTM methodassigned the negative label with a very high prob-ability, outweighing the former approach in theensemble decision.

In general, the prevalence of the BiLSTM classi-fier accounted for the vast majority of ensemble’serrors, with a few examples of this behavior pro-vided below.

Example 5.1Message: Zeugen gesucht: 39-Jahrigerwurde in S-Bahn von dunkelhautigen Mannernbetaubt und uberfallen - B.Z. Berlin...

(Looking for witnesses: a 39-year-old was at-tacked in the S-Bahn by dark-skinned men -B.Z. Berlin...)

Gold: negativeSVM: negativeBiLSTM: neutralSVM + BiLSTM: neutral

Example 5.2Message: Der Bahn-Betriebsrat fordert eineschnelle Einigung im Tarifstreit: %Link gdl-streik bahnstreik(The Bahn employee organization demands aquick agreement in the pay dispute: %Linkgdl-strike Bahn-strike)

Gold: neutralSVM: positiveBiLSTM: negativeSVM + BiLSTM: negative

5.2 Comparison with Subtask A

On request of one of the reviewers, we also re-trained our system on Subtask A of GermEval-2017. In this task, we had to predict whether partic-ular posts contained feedback about the “DeutscheBahn” or not. This way, we hoped to check thegeneralizability of the proposed methods (i.e., tosee whether the same techniques could be equallywell applied to other objectives).

Dataset Micro-F1Timestamp 1 Timestamp 2

SVM 0.816 0.84BiLSTM 0.865 0.857SVM + BiLSTM 0.873 0.869

Majority 0.816 0.84Random 0.496 0.491

Table 3: Micro-averaged F1-scores on timestamps 1and 2 of the GermEval test set for Subtask A.

As we can see from the results in Table 3, theanswer to this question is negative for the SVMclassifier, which performs on a par with the ma-jority class baseline. Nevertheless, the BiLSTMapproach is quite promising in this regard, yieldingmuch better figures than for the previous subtask.This time, the ensemble approach outperforms allits single components, mitigating the lower accu-racy of SVM.

52

6 Summary and Conclusions

Summarizing the above findings, we would like torecap that, in this work, we compared three differ-ent approaches to coarse-grained sentiment analy-sis of social media posts: SVM with manually spec-ified features, bidirectional LSTM with automati-cally induced input representations, and a combina-tion of both. Two of these systems—unfortunately,the weaker ones—were part of the official submis-sion of the PotTS team to the GermEval-2017 com-petition.

We showed that traditional machine-learningtechniques still achieve state-of-the-art results, out-performing modern deep-learning methods. There-fore, we would recommend taking newer DL ap-proaches with a grain of salt, always keeping inmind that the results of machine-learning methodsmight crucially depend on the specifics of the ana-lyzed data and peculiarities of the task at hand (aswe also confirmed in the evaluation section). Onepotential weakness of these approaches though, istheir weak generalizability, which prevents us fromreusing them for other (related) objectives.

Acknowledgments

We thank the anonymous reviewers for their helpfulsuggestions and comments.

ReferencesSasha Blair-Goldensohn, Tyler Neylon, Kerry Hannan,

George A. Reis, Ryan Mcdonald, and Jeff Reynar.2008. Building a sentiment summarizer for local ser-vice reviews. In NLP in the Information ExplosionEra.

Simon Clematide and Manfred Klenner. 2010. Eval-uation and extension of a polarity lexicon for Ger-man. In Proceedings of the First Workshop on Com-putational Approaches to Subjectivity and SentimentAnalysis, pages 7–13.

Yarin Gal and Zoubin Ghahramani. 2015. Dropout asa bayesian approximation: Representing model un-certainty in deep learning. CoRR, abs/1506.02142.

Birgit Hamp and Helmut Feldweg. 1997. GermaNet -a lexical-semantic net for German. In In Proceed-ings of ACL workshop Automatic Information Ex-traction and Building of Lexical Semantic Resourcesfor NLP Applications, pages 9–15.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. 2015. Delving deep into rectifiers: Surpassinghuman-level performance on imagenet classification.CoRR, abs/1502.01852.

Sepp Hochreiter and Jurgen Schmidhuber. 1997.Long short-term memory. Neural Computation,9(8):1735–1780.

Minqing Hu and Bing Liu. 2004. Mining and summa-rizing customer reviews. In Won Kim, Ron Kohavi,Johannes Gehrke, and William DuMouchel, editors,KDD, pages 168–177. ACM.

Soo-Min Kim and Eduard H. Hovy. 2004. Deter-mining the sentiment of opinions. In COLING2004, 20th International Conference on Computa-tional Linguistics, Proceedings of the Conference,23-27 August 2004, Geneva, Switzerland.

Bing Liu. 2012. Sentiment Analysis and Opinion Min-ing. Synthesis Lectures on Human Language Tech-nologies. Morgan & Claypool Publishers.

Christopher D. Manning. 2015. Computational lin-guistics and deep learning. Computational Linguis-tics, 41(4):701–707.

Saif M. Mohammad, Svetlana Kiritchenko, and Xiao-dan Zhu. 2013. NRC-Canada: Building the State-of-the-Art in Sentiment Analysis of Tweets. CoRR,abs/1308.6242.

Preslav Nakov, Sara Rosenthal, Zornitsa Kozareva,Veselin Stoyanov, Alan Ritter, and Theresa Wilson.2013. SemEval-2013 Task 2: Sentiment Analysis inTwitter. In Second Joint Conference on Lexical andComputational Semantics (*SEM), Volume 2: Pro-ceedings of the Seventh International Workshop onSemantic Evaluation (SemEval 2013), pages 312–320, Atlanta, Georgia, USA, June. Association forComputational Linguistics.

Bo Pang and Lillian Lee. 2008. Opinion mining andsentiment analysis. Found. Trends Inf. Retr., 2(1-2):1–135, January.

Robert Remus, Uwe Quasthoff, and Gerhard Heyer.2010. SentiWS - A publicly available German-language resource for sentiment analysis. In Nico-letta Calzolari, Khalid Choukri, Bente Maegaard,Joseph Mariani, Jan Odijk, Stelios Piperidis, MikeRosner, and Daniel Tapias, editors, Proceedingsof the International Conference on Language Re-sources and Evaluation, LREC 2010, 17-23 May2010, Valletta, Malta. European Language Re-sources Association.

Sara Rosenthal, Alan Ritter, Preslav Nakov, andVeselin Stoyanov. 2014. SemEval-2014 Task 9:Sentiment Analysis in Twitter. In Proceedings of the8th International Workshop on Semantic Evaluation(SemEval 2014), pages 73–80, Dublin, Ireland, Au-gust. Association for Computational Linguistics andDublin City University.

Sara Rosenthal, Preslav Nakov, Svetlana Kiritchenko,Saif Mohammad, Alan Ritter, and Veselin Stoyanov.2015. Semeval-2015 task 10: Sentiment analysisin twitter. In Proceedings of the 9th InternationalWorkshop on Semantic Evaluation (SemEval 2015),

53

pages 451–463, Denver, Colorado, June. Associa-tion for Computational Linguistics.

Tatjana Scheffler. 2014. A German Twitter Snapshot.In Nicoletta Calzolari, Khalid Choukri, Thierry De-clerck, Hrafn Loftsson, Bente Maegaard, JosephMariani, Asuncion Moreno, Jan Odijk, and SteliosPiperidis, editors, Proceedings of the Ninth Interna-tional Conference on Language Resources and Eval-uation (LREC-2014), Reykjavik, Iceland, May 26-31,2014., pages 2284–2289. European Language Re-sources Association (ELRA).

Helmut Schmid. 1995. Probabilistic part-of-speechtagging using decision trees. In Proceedings of theACL SIGDAT-Workshop.

Uladzimir Sidarenka, Tatjana Scheffler, and ManfredStede. 2013. Rule-based normalization of Ger-man Twitter messages. In Language Processingand Knowledge in the Web - 25th InternationalConference, GSCL 2013: Proceedings of the work-shop Verarbeitung und Annotation von Sprachdatenaus Genres internetbasierter Kommunikation, Darm-stadt, Germany.

Hiroya Takamura, Takashi Inui, and Manabu Okumura.2005. Extracting semantic orientations of words us-ing spin model. In Kevin Knight, Hwee Tou Ng,and Kemal Oflazer, editors, ACL 2005, 43rd AnnualMeeting of the Association for Computational Lin-guistics, Proceedings of the Conference, 25-30 June2005, University of Michigan, USA. The Associationfor Computer Linguistics.

T. Tieleman and G. Hinton. 2012. Lecture 6.5—RmsProp: Divide the gradient by a running averageof its recent magnitude. COURSERA: Neural Net-works for Machine Learning.

Michael Wojatzki, Eugen Ruppert, Sarah Holschneider,Torsten Zesch, and Chris Biemann. 2017. Ger-mEval 2017: Shared Task on Aspect-based Senti-ment in Social Media Customer Feedback. In Pro-ceedings of the GSCL GermEval Shared Task onAspect-based Sentiment in Social Media CustomerFeedback, Berlin, Germany.

54


LT-ABSA: An Extensible Open-Source System forDocument-Level and Aspect-Based Sentiment Analysis

Eugen Ruppert‡ and Abhishek Kumar† and Chris Biemann‡

‡Language Technology GroupComputer Science Dept.

Universitat Hamburghttp://lt.informatik.uni-hamburg.de

†Indian Institute of Technology PatnaAI-NLP-ML group

Patna, Indiahttp://www.iitp.ac.in

Abstract

This paper presents a system for document-level and aspect-based sentiment analysis,developed during the inception of the Ger-mEval 2017 Shared Task on aspect-basedsentiment analysis (ABSA) (Wojatzki et al.,2017). It is a fully-featured open-source so-lution that offers competitive performanceon previous tasks as well as a strong per-formance on the GermEval 2017 SharedTask. We describe the architecture of thesystem in detail and show competitive eval-uation results on ABSA datasets in fourlanguages. The software is distributed un-der a lenient license, allowing royalty-freeuse in academia and industry.

1 Introduction

Sentiment analysis has gained a lot of attentionin recent years in the CL/NLP community. Ag-gregating over the sentiment in a large amount oftextual material helps governments and companiesto deal with the large increase of user-generatedcontent due to the popularity of social media. Com-panies can react to upcoming problems and preparestrategies to help users to navigate reviews and toimprove their reputation.

While determining document-level sentimentcan be framed as a classification task with two orthree classes (positive, negative, possibly neutral),identifying and evaluating aspect-based sentimentis more challenging: here, we are not only inter-ested in the polarity of the sentiment, but also towhat particular aspect the sentiment refers to – forexample people might express in the same prod-uct review that they like the high-resolution screenof a phone while complaining about its poor bat-tery life. Aspects are typically classified into aflat taxonomy, and are lexicalized in opinion targetexpressions (OTEs), which shall be identified byABSA systems.

Even though a steady number of sentiment anal-ysis tasks have been conducted in the past years onaspect-based as well as other flavors of sentimentanalysis, e.g. (Pontiki et al., 2015; Pontiki et al.,2016; Wojatzki et al., 2017), participants mostlydo not share their systems, so that others could useor extend them. Even if systems are shared, theyare usually not easy to operate, since they typicallystay on the level of research software prototypes.A notable exception is Stanford’s CoreNLP project,which however only performs document-level sen-timent on English (Socher et al., 2013).

In this paper, we present a fully-featured open-source1 system for ABSA. Configurations regard-ing the use of features or the choice of trainingdata can be shared, enabling reproducible results.Our system is flexible enough to support document-level and aspect-based sentiment analysis on multi-ple languages. Since we also provide feature induc-tion on background corpora as part of the system,it can be applied out of the box.

We focus on engineering aspects. For relatedwork regarding aspect-based sentiment analysis,we refer to the task description papers cited above,as well as recent surveys, e.g. (Medhat et al., 2014).

2 Architecture

The system is designed as an extensible frameworkthat can be adapted to many different datasets. Itis able to perform document-level classification aswell as the identification of opinion target expres-sions (OTEs). NLP pre-processing is engineered inthe UIMA framework (Ferrucci and Lally, 2004),which contributes to adaptability and modularity.It is a full-fledged system that contains all stagesof preprocessing, from reading in different dataformats over tokenization to various target outputs,and is aimed at productive use.

1The system is available under the permissive Apache Soft-ware License 2.0, http://apache.org/licenses/LICENSE-2.0.html

55

1. Feature Induction

2. Training

3. Classification

DomainCorpus

Resources

InducedFeatures

TrainingData

TestData

Model 1

Model n

...

ClassifiedData

Figure 1: The system workflow of the LT-ABSA system

2.1 Execution and Workflow

The general workflow consists of three major steps(see Figure 1). To prepare model creation, we per-form feature induction (1). This step has to be con-ducted only once when creating a model for a newlanguage or domain. The operator can provide anin-domain corpus to induce features derived fromwhole-corpus statistics, like tf-idf scores. Further-more, we support the corpus-informed extensionof word lists, such as augmenting a list of posi-tive words with similar words from a backgroundcorpus, as described in more detail below. Whileour system also uses word embeddings as features,their training is not part of our system but needs tobe done externally.

In the training step (2), models are trained us-ing labeled training data. The processing pipelineincludes readers for several formats to create a doc-ument representation, language-specific NLP toolsand feature extractors to to create feature vectors.We train machine learning models on these fea-ture representation in order to support two general

setups: document-level classification into an arbi-trary number of classes, and sequence tagging forextracting spans, such as OTEs.

Finally, the models are used for the classificationof new documents (3). This step supports the samefile formats and conducts the same feature extrac-tion as in the training step. Additionally, we haveincluded a small web server with an RESTful APIwith HTML and JSON output (see Listing 1 for anexample).

The NLP pipeline includes the rule-based seg-menter described in Remus et al. (2016), whichallows adapting the tokenization to the target do-main, e.g. handle hashtags, cashtags and othertypes of tokens for social media content. For POStagging, we rely on OpenNLP2 for the reason oflicense compatibility.

3 Features

In this section, we describe our feature inductionon background corpora and list the features for

2http://opennlp.apache.org/

56

{"aspect": {

"label": "DB_App_und_Website#Haupt","score": 0.21274166800759153

},"aspect_coarse": {

"label": "DB_App_und_Website","score": 0.228312850597364

},"input": "Die App funktioniert nicht, nichts geht mehr","relevance": {

"label": "true","score": 0.8396798158862353

},"sentiment": {

"label": "negative","score": 0.46157282933962135

},"targets": ["App"]

}

Listing 1: Example response from the web API

document-level classification with support vectormachines (SVMs) and sequence tagging with aconditional random field (CRF).

3.1 Feature Induction

Background Corpus We use an in-domain cor-pus to induce features and semantic models. E.g.,for the background corpus on the GermEval 2017dataset, we used a web crawl obtained by thelanguage-model-based crawler of (Remus and Bie-mann, 2016). If in-domain data is not available,we still recommend to perform feature inductionwith a background corpus from the same language,On the background corpus, we compute a distribu-tional thesaurus (DT) (Biemann and Riedl, 2013)and a word2vec model (Mikolov et al., 2013) us-ing the according software packages, which arenot part of the distribution. However, we providethe models as well as usage instructions on how tocompute them. Further, we compute inverse doc-ument frequencies (IDF) of words (Sparck Jones,1973).

Training Data Using the training data and the idfscores, we determine the tokens with the highesttf-idf scores for each document-level class. The top30 tokens for each class are used as binary features.

Polarity Lexicon Expansion Assuming the ex-istence of a polarity lexicon (e.g. Waltinger (2010)for German), we automatically expand such lexiconfor a language using the method described in ourprevious work (Kumar et al., 2016): First, we col-lect the top 10 distributionally most similar words

for each entry in each polarity class (positive, nega-tive, sometimes also neutral). Then, we filter theseexpansions by a minimum corpus frequency thresh-old of 50 in the background corpus. Next, we onlykeep the expansions that were present in at least 10of the seed terms. While distributional similaritydoes not preserve polarity, described aggregationstrategy results in a high-precision high-coveragedomain-specific polarity lexicon.

For all expansion terms, we calculate the nor-malized scores for each polarity, resulting in a real-valued weight for each polarity.

3.2 Document-Based ClassifierWe use a linear SVM classifier (Fan et al., 2008)for document-based classification. As the featurespace is fairly large and sparse (100+K featuresfor GermEval 2017), we can resort to a linear ker-nel and do not require more CPU-intensive kernelmethods.

• TF-IDF: We calculate the tf-idf weights foreach token using the IDF from the backgroundcorpus and the frequency of the token in thecurrent document, using token weights as fea-tures. The overall TF-IDF feature vector isnormalized with the L2 norm.

• Word Embeddings: We use word em-beddings of 300 dimensions trained withword2vec (Mikolov et al., 2013) on back-ground corpora. For the document represen-tation, word representation for each word isobtained and then averaged up to get a 300

57

dimensional feature vector. Word embed-ding averaging is done unweighted as well asweighted by the token’s tf-idf score. Finally,the averaged feature vector is normalized us-ing the L2 norm.

• Lexicon: This feature class allows to supplyword lists, recording their presence or absencein a sparse feature vector. We use this featureclass for supplying polarity lexicons to ourclassifier.

• Aggregated Lexicon: This feature class alsorelies on word lists with labels, but aggregatesover words from the same class: we supplythe relative amount of positive, negative andneutral words in the document, normalized bydocument length.

• Expanded Polarity Lexicon: We use the in-duced expanded polarity lexicon to generate alow-dimensional feature vector (2-3 features).The expanded polarity lexicon provides a po-larity distribution for each term, e.g., schnell(fast) – 0.32 (neg-value) – 0.68 (pos-value).We use this feature by summing up the dis-tributions of the tokens that appear in the ex-panded lexicon and averaging them.

3.3 CRFThe CRF classifier (Okazaki, 2007) is used forannotation of Opinion Target Expressions, cast ina sequence tagging setup. It uses the followingsymbolic features in the ClearTk3 framework:

• current token (surface form + lowercased)

• POS tag

• lemma (not available for all languages)

• character prefixes (2–5 characters)

• suffixes (2–5 characters)

• capitalization

• numeric type (identifies types, when numbersare present; e.g. digits, alphanumeric, year)

• character categories (patterns based on Uni-code categories)

• hyphenation

These features are computed in a window of +/–2 tokens around the target token.

3http://cleartk.github.io/cleartk/

4 Results

In the experimental results reported below, we haveused the following background corpora for fea-ture induction: For German, we have compileda corpus from a focused webcrawl (Remus et al.,2016). For the SemEval tasks, we employ COW(Schafer, 2016) web corpora4 for English, Spanishand Dutch.

4.1 GermEval 2017 Shared TaskThe GermEval 2017 Shared Task on ABSA (Wo-jatzki et al., 2017) features a large German datasetconsisting of user-generated content from the rail-way transportation domain. There are four subtasksthat cover document-based and aspect-based sen-timent analysis. Participants should classify thebinary relevance and the document-level sentimentin Subtasks A and B. Next, they should identifyaspects in the document and their correspondingsentiment (Subtask C). Finally, OTEs are identifiedby span and labeled with an aspect and a senti-ment polarity in Subtask D. The task features twotest sets: documents from the same period as thetraining data (synchronic) and documents from alater point in time (diachronic). For evaluation,micro-averaged F1 scores are used.

Our system has been developed in the sameproject that funded the creation of the dataset usedin GermEval 2017. Naturally, as the organizer’sentry, it did not compete in the shared task. Never-theless, we report the ranks our system would haveobtained in this task.

Table 1 presents the results on the synchronicdataset and Table 2 on the diachronic dataset. Oursystem outperforms all baselines and would haveranked highly in the competition, outperformingmost submissions on almost every task. On Sub-tasks A and B, our system is outperformed by asmall margin, on Subtasks C and D, we show thebest performance overall. We conclude that LT-ABSA is a highly competitive system for sentimentclassification on German.

4.2 SemEval-2016 Task 5: Aspect BasedSentiment Analysis

The SemEval-2016 task on aspect-based sentimentanalysis (Task 5; (Pontiki et al., 2016)) is compa-rable in structure to Subtasks B, C and D in theGermEval-2017 evaluation. While the overall taskwas conducted on datasets in eight languages and

4http://corporafromtheweb.org/

58

Table 1: GermEval 2017 results, synchronic testset (F1 score)

System Relevance Sentiment Aspect Aspect + Sentiment OTE (exact) OTE (overlap)

MCB 0.816 0.656 0.442 0.315 – –Baseline system 0.852 0.667 0.481 0.322 0.170 0.237Best contender 0.903 0.749 0.482 0.354 0.220 0.348Our system 0.895 0.767 0.537 0.396 0.229 0.306Rank 3 1 1 1 1 2

Table 2: GermEval 2017 results, diachronic testset (F1 score)System Relevance Sentiment Aspect Aspect + Sentiment OTE (exact) OTE (overlap)

MCB 0.839 0.672 0.465 0.384 – –Baseline system 0.868 0.694 0.495 0.389 0.216 0.271Best contender 0.906 0.750 0.460 0.401 0.281 0.282Our system 0.894 0.744 0.556 0.424 0.301 0.365Rank 3 2 1 1 1 1

Table 3: Results on SemEval-2016, Task 5Dataset System SB1, Slot 1 (F) SB1, Slot 3 (Acc) SB2, 2 (Acc)

English Baseline 0.599 0.765 0.743Restaurants Top system 0.730 0.881 0.819

LT-ABSA 0.651 0.782 0.731Rank 16 19 5

English Baseline 0.375 0.700 0.730Laptops Top system 0.519 0.828 0.750

LT-ABSA 0.412 0.736 0.675Rank 17 12 5

Dutch Baseline 0.428 0.693 0.732Restaurants Top system 0.602 0.778 –

LT-ABSA 0.578 0.824 0.863Rank 2 1 –

Spanish Baseline 0.547 0.778 0.745Restaurants Top system 0.706 0.836 0.772

LT-ABSA 0.586 0.821 0.797Rank 9 2 1

multiple domains, we have only experimented withthe English, Spanish and Dutch datasets. Table 3presents the results, again with ranks that our sys-tem would have obtained in the task. We reportscores on Subtask 1, Slots 1 (Sentence-level As-pect Identification) and 3 (Sentiment Polarity), andon Subtask 2, Slot 2 (Document-level SentimentPolarity).5

5We used our system out-of-the-box, without adaptationto the tasks. E.g., in Subtask 2, the entities are already givenand need to be classified. We also identify the aspects.

Overall LT-ABSA is able to beat all baselinesfor the reported slots. Only for SB2, Slot 2 onEnglish, where the baselines rank in the middle,we are outperformed by the baselines. The perfor-mance varies across tasks. For the highly contestedEnglish datasets, we rank in the lower midfield forSB1 and in the top 5 for SB2. For the less contestedSpanish and Dutch datasets, we show a competitiveperformance.

59

5 Conclusion

We present a flexible, extensible open source sys-tem for document-level and aspect-based sentimentanalysis and have reported state of the art resultson two shared tasks in four different languages.Code and documentation are available on GitHub.6.We also provide complete feature sets and trainedmodels for all experiments reported in this paper.7

Acknowledgements

This work has been supported by the InnovationAlliance between the Deutsche Bahn and TU Darm-stadt, Germany, as well as a DAAD WISE intern-ship grant.

ReferencesChris Biemann and Martin Riedl. 2013. Text: Now

in 2D! a framework for lexical expansion with con-textual similarity. Journal of Language Modelling,1(1):55–95.

Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. 2008. LIBLINEAR:A library for large linear classification. Journal ofMachine Learning Research, 9:1871–1874.

David Ferrucci and Adam Lally. 2004. UIMA: AnArchitectural Approach to Unstructured InformationProcessing in the Corporate Research Environment.Natural Language Engineering 2004, 10(3-4):327–348.

Ayush Kumar, Sarah Kohail, Amit Kumar, Asif Ekbal,and Chris Biemann. 2016. IIT-TUDA at SemEval-2016 Task 5: Beyond Sentiment Lexicon: Com-bining Domain Dependency and Distributional Se-mantics Features for Aspect Based Sentiment Anal-ysis. In Proceedings of the 10th InternationalWorkshop on Semantic Evaluation (SemEval-2016),pages 1129–1135.

Walaa Medhat, Ahmed Hassan, and Hoda Korashy.2014. Electrical engineering. Ain Shams Engineer-ing Journal, 5(4):1093–1113.

Tomas Mikolov, Kai Chen, Greg Corrado, and JeffreyDean. 2013. Efficient estimation of word repre-sentations in vector space. In Workshop at Inter-national Conference on Learning Representations(ICLR), pages 1310–1318, Scottsdale, AZ, USA.

Naoaki Okazaki. 2007. CRFsuite: a fast implementa-tion of Conditional Random Fields (CRFs).

6Code: https://github.com/uhh-lt/LT-ABSA7Data and Models: http://ltdata1.informatik.

uni-hamburg.de/sentiment/

Maria Pontiki, Dimitris Galanis, Haris Papageorgiou,Suresh Manandhar, and Ion Androutsopoulos. 2015.Semeval-2015 task 12: Aspect based sentiment anal-ysis. In Proceedings of the 9th International Work-shop on Semantic Evaluation (SemEval 2015), pages486–495, Denver, CO, USA.

Maria Pontiki, Dimitris Galanis, Haris Papageor-giou, Ion Androutsopoulos, Suresh Manandhar, Mo-hammed AL-Smadi, Mahmoud Al-Ayyoub, YanyanZhao, Bing Qin, Orphee De Clercq, VeroniqueHoste, Marianna Apidianaki, Xavier Tannier, Na-talia Loukachevitch, Evgeniy Kotelnikov, Nuria Bel,Salud Maria Jimenez-Zafra, and Gulsen Eryigit.2016. Semeval-2016 task 5 : aspect based sentimentanalysis. In Proceedings of the 10th InternationalWorkshop on Semantic Evaluation (SemEval-2016),pages 19–30, San Diego, CA, USA.

Steffen Remus and Chris Biemann. 2016. Domain-Specific Corpus Expansion with Focused Webcrawl-ing. In Proceedings of the 10th Web as Corpus Work-shop, pages 106–114, Berlin, Germany.

Steffen Remus, Gerold Hintz, Darina Benikova,Thomas Arnold, Judith Eckle-Kohler, Christian M.Meyer, Margot Mieskes, and Chris Biemann. 2016.EmpiriST: AIPHES. Robust Tokenization and POS-Tagging for Different Genres. In Proceedings TenthInternational Conference on Language Resourcesand Evaluation (LREC 2016), pages 106–114, Por-toroz, Slovenia.

Roland Schafer. 2016. On bias-free crawling and rep-resentative web corpora. In Proceedings of the 10thWeb as Corpus Workshop, pages 99–105, Berlin,Germany.

Richard Socher, Alex Perelygin, Jean Wu, JasonChuang, Christopher D. Manning, Andrew Ng, andChristopher Potts. 2013. Recursive deep modelsfor semantic compositionality over a sentiment tree-bank. In Proceedings of the 2013 Conference onEmpirical Methods in Natural Language Processing,pages 1631–1642, Seattle, WA, USA.

Karen Sparck Jones. 1973. Index term weighting. In-formation Storage and Retrieval, 9(11):619 – 633.

Ulli Waltinger. 2010. Sentiment Analysis Reloaded: AComparative Study On Sentiment Polarity Identifi-cation Combining Machine Learning And Subjectiv-ity Features. In Proceedings of the 6th InternationalConference on Web Information Systems and Tech-nologies (WEBIST ’10), Valencia, Spain.

Michael Wojatzki, Eugen Ruppert, Sarah Holschneider,Torsten Zesch, and Chris Biemann. 2017. Ger-mEval 2017: Shared Task on Aspect-based Senti-ment in Social Media Customer Feedback. In Pro-ceedings of the GSCL GermEval Shared Task onAspect-based Sentiment in Social Media CustomerFeedback, Berlin, Germany.

60

Author Index

Becker, Christoph, 13Biemann, Chris, 1, 55

Dakota, Daniel, 43Daxenberger, Johannes, 22

Eger, Steffen, 22

Friedrich, Christoph M. , 30

Gurevych, Iryna, 22

Hovelmann, Leonard, 30Holschneider, Sarah, 1

Kubler, Sandra, 43Kallmeyer, Laura, 18Kumar, Abhishek, 55

Lanka, Soujanya, 36Lee, Ji-Ung, 22

Mieskes, Margot, 13Mishra, Pruthwik, 36Mujadia, Vandan, 36

Naderalvojoud, Behzad, 18

Qasemizadeh, Behrang, 18

Ruppert, Eugen, 1, 55

Sayyed, Zeeshan Ali, 43Schulz, Karen, 13Sidarenka, Uladzimir, 49

Wojatzki, Michael, 1

Zesch, Torsten, 1

61

Proceedings of the GermEval 2017 Shared Task...Proceedings of the GermEval 2017 – Shared Task on...

Documents

Transcript of Proceedings of the GermEval 2017 Shared Task...Proceedings of the GermEval 2017 – Shared Task on...