SentiRuEval: Testing Object-Oriented Sentiment Analysis ... · Entity-oriented sentiment analysis...

Post on 11-Aug-2020

40 views 0 download

Transcript of SentiRuEval: Testing Object-Oriented Sentiment Analysis ... · Entity-oriented sentiment analysis...

SentiRuEval: Testing Object-Oriented

Sentiment Analysis Systems

in Russian

N. Loukachevitch (Moscow), P. Blinov (Kirov),

E. Kotelnikov (Kirov), Y. Rubtsova

(Novosibirsk)

V. Ivanov (Kazan), E. Tutubalina (Kazan)

Entity-oriented sentiment analysis

• Sentiment analysis

– In general: sentiment of the whole document, fragment

or sentence

– Entity-oriented

• Sentiment about a specific entity

– Politician, political party

– Company etc.

• Sentiment about specific parts or properties of an

entity (aspects)

• Переходи в Билайн. «Все за 300» — отличный тариф!

• Previous Russian-oriented sentiment analysis

evaluations (2011-2013) concerned general sentiment

of a review or a news quotation

SentiRuEval 2014-2015

• Testing of sentiment analysis systems of Russian texts

• Aspect-oriented analysis of reviews – Restaurants

– Cars

• Entity-Oriented analysis of tweets: reputation monitoring – Banks

– Telecom companies

SentiRuEval: Analysis of reviews

• Tasks – Aspect terms extraction

– Sentiment towards aspect terms

– Determining categories of aspect terms

• Restaurants: food, interior, service, price, restaurant as a whole

• Cars: comfort, reliability, appearance, price, driveability, car as a whole

– Determining sentiments of categories for the whole review

• Data sets in each domain – Training collection - 200 reviews

– Test collection – 200 reviews

• Participants – 12 participants with 21 runs

Reviews: different types of lexical units

Друзья, давайте перестанем покорно принимать

курицу с сухарями залитую майонезом за

салат Цезарь. Я заказала Цезарь во "Временах

Года" и опять как и во многих других ресторанах,

мне принесли залитую майонезом кислятину под

названием "Салат Цезарь с куриным филе" за

280 руб.

Уважаемые повара, вам не стыдно??? Цезарь -

это салат с особым соусом с анчоусами ,очень

вкусный. В вашем заведении - столовская кухня

по ценам хорошего ресторана.

Aspects labeling • Types of aspects

– Explicit aspects denote some part or characteristics of a described object :

• staff, pasta, music in restaurant reviews.

• usually noun or noun groups

– Implicit aspects are single words or single words with sentiment operators that contain within themselves as specific sentiments as the clear indication to the aspect category

• tasty (positive+food), comfortable (positive+interior), not comfortable (negative+interior).

– Sentiment facts do not mention the user sentiment directly, formally they inform us only about a real fact, however, this fact conveys us a user’s sentiment as well as the aspect category it related to.

– отвечала на все вопросы (answered all questions)

– долго ждали (long waiting)

– знала меню (knew the menu)

– человеческий волос (human hair)

• May contain or not contain explicit aspects within themselves

Relevance of the term to the review

• Relevance of the term to the review: – Rel – relevant (to the current review),

– Cmpr – comparison, that is the term concerns another entity,

– We decided not to have dessert and coffee there, but instead went to another restaurant where we enjoyed a wonderful end to our evening.

– Prev – previous, that is the term is related to previous opinions,

• Приехали в новый ресторан Тао с мужем, в предвкушении чего-то необыкновенного, ожидания были таковы из-за прочитанных раннее отзывов, место описывалось, как магическое, а еда феерично-космическая

– Irr – irrealis, that is the term is the part of a recommendation or description of a desirable situation,

– Irn – irony.

Instrument for annotation: Brat

Aspect-oriented tasks

• Tasks: – A: automatic extraction of explicit aspects,

– B: automatic extraction of all aspects including sentiment facts,

– C: extraction of sentiments towards explicit aspects,

– D: automatic categorization of explicit aspects into aspect categories,

– E: sentiment analysis of the whole review on aspect categories

• Test data in xml – Several thousands of automatically labeled reviews conceal

reviews with correct aspects (Aspects block)

– Participants should write the extracted aspects to Aspect1 block

– Participants should categorize aspects in Aspects block

Format of test collection

Problems with annotation

• Aspect labeling

– Mentions in neutral contexts especially mentions of entities (restaurant, car)

– Usually maximal noun groups should be labeled, but: внешний вид автомобиля

• Aspect categorization

– по ходовой слабенькая машина - drivability

– красивая машина – appearance

– машина просто классная – as a whole

Extraction of explicit aspects: F1-measure

Domain Baseline Best result

Restaurants 0.608 0.632

Cars 0.594 0.676

Problems of automatic approaches • long noun groups with low frequencies

• “сытая хавронья" из свинины

• баклажаны, запеченные с сыром

• бекон на хрустящем тосте с помидором черри

• ambiguous verbs

• ели, поели

Best approaches

• Sequence labeling (SVM), distributional approaches, recurrent neural

nets

Sentiments towards aspects: macroF1

Domain Baseline

Most frequent

class

Best result

Restaurants 0.267 0.554

Cars 0.264 0.568

Leader: Gradient Boosting Classifier

Features: skip-gram model exploiting word contexts

for learning better vector representations and pointwise

mutual information

Aspects categorization

Domain Baseline

Most frequent

class

Best result

Restaurants 0.800 0.865

Cars 0.564 0.652

The best result:

•SVM with features based on pointwise mutual information

The second-place result:

•the method relying on the term similarity in the space of

distributed representations of words

SentiRuEval: Reputation monitoring

• Reputation-oriented tweet may express

– positive or negative to a company

– positive or negative fact concerning a company

• Training collection

– 5000 banking tweets and 5000 telecom tweets

• Participation

– 10 participants

– 33 runs

Example of tweet and format of data

for labeling

• <table name="bank">

• <column name="id">71</column>

• <column name="twitid">492547326574360000</column>

• <column name="text">Сбербанк России не будет работать в

Крыму и Севастополе </column>

• <column name="sberbank">0</column>

• <column name="vtb">NULL</column>

• <column name="gazprom">NULL</column>

• <column name="alfabank">NULL</column>

• <column name="bankmoskvy">NULL</column>

• <column name="raiffeisen">NULL</column>

• <column name="uralsib">NULL</column>

• <column name="rshb">NULL</column>

Expert annotation

• Annotators should:

– leave “0” label for a mentioned entity unchanged if the

tweet was considered as neutral

– or replace the value with “1” (positive)

• Positive fact or opinion

– or “-1” (negative)

• Negative fact or opinion

• Annotators also could:

– label tweets with “--", which means =meaningless=,

– or with “+-”, which means positive and negative

sentiments in the same tweet.

– Both latter cases were excluded from evaluation.

Annotation problems

• Problems – Disagreement in sentiment labeling

• я сегодня ходил в сбербанк за картой, там оч милая девушка работала

– Multiple mistakes

• Test data were annotated using the voting scheme (3 annotators) – Agreement between 2 or 3 annotators

• Size of test collections – Banks – 4549 tweets of 5000 labeled

– Telecom – 3845 tweets of 5000 labeled

Performance measures

• Three-way classification of tweets: positive, negative or

neutral.

• Main quality measure: macro-average F-measure

– average value between

• F-measure of the positive class and

• F-measure of the negative class.

– ignored F-measure of neutral class

– this does not reduce the task to the two-class

prediction.

• Additionally micro-average F-measures were calculated

for two sentiment classes.

Results

• Manual labeling of participant for telecom domain

– Macro-F – 0.703, Micro-F – 0.7487

– Absolute maximum for automatic approaches

• Best results of automatic systems are far from manual results

• Best results:

– SVM+syntactic relations, – Linguistic syntax-based pattern (without machine learning)

– Maxent, SVM using various features

Domain Macro-F Micro-F

Banking 0.3598 0.3656

Telecom 0.4882

0.5362

Why best results in two domains

are so different

• Best results in banking and telecom domains are so

different: 0.36 vs. 0.488

• Difference between training and test collections: Kullback-

Leibler divergence

Problems of reputation analysis of tweets

• In any moment some events influencing

reputation can occur =>absence in training data

• In our case

• - training collections in both domains

• during July-August 2014 after Ukraine events

2013-2014

– Sanctions against banks

– Problems with communication in Crimea (in less extent)

• - test collections

• December 2013-February 2014

– Ukraine events did not influence target entities

Most difficult tweets: almost all systems

made mistakes

1. Training collection does not contain words from test collection

– Самый безалаберный банк по отношению к клиентам - Сбербанк

– В столице произошло дерзкое ограбление Сбербанка

– Гребаный сбербанк

2. Really difficult tweets: irony and sarcasm, comparisons – Сбербанк России – лучший в мире производитель

пластиковых карточек для отскабливания льда от автомобиля

– Нормально @sberbank зарабатывает - размен 5% от суммы

• Great difference between training and test collections in the banking domain=>

– 30% of tweets could be better classified if the approaches have general sentiment dictionaries

If systems were really entity-oriented

• Test tweets mentioning two or more entities

– 58 tweets in the banking domain (15 tweets with different polarity

labels),

– 232 tweets in the telecom domain (71 tweets with different polarity

labels).

• Only three of nine participants considered the task as

entity-oriented one

– Other participants always assigned the same polarity class to all

entities mentioned in a tweet.

• Performance

– Worse than for all tweets on average

– Entity-oriented approaches did not achieve better results

Conclusion

• We described the tasks, approaches and results

in SentiRuEval testing

– Aspect-oriented analysis of reviews in two domains

– Reputation-oriented analysis of tweets

• All prepared materials are accessible for research

purposes (see hyperlinks in the paper)

• Reviews conclusions

– Most efforts were directed to aspect extraction

– Less attention to other tasks

• Tweet task conclusions

– High dependence from training collections

– Capability to do entity-oriented analysis is quite restricted

• Both tasks (or some variants) should be repeated?