LINGUISTICA GENERALE E COMPUTAZIONALE

62
LINGUISTICA GENERALE E COMPUTAZIONALE SENTIMENT ANALYSIS

description

LINGUISTICA GENERALE E COMPUTAZIONALE. SENTIMENT ANALYSIS. FACTS AND OPINIONS. Two main types of textual information on the Web: FACTS and OPINIONS Current search engines search for facts (assume they are true) Facts can be expressed with topic keywords . - PowerPoint PPT Presentation

Transcript of LINGUISTICA GENERALE E COMPUTAZIONALE

Page 1: LINGUISTICA GENERALE E COMPUTAZIONALE

LINGUISTICA GENERALE E COMPUTAZIONALE

SENTIMENT ANALYSIS

Page 2: LINGUISTICA GENERALE E COMPUTAZIONALE

FACTS AND OPINIONS

• Two main types of textual information on the Web: FACTS and OPINIONS

• Current search engines search for facts (assume they are true)– Facts can be expressed with topic keywords.

Page 3: LINGUISTICA GENERALE E COMPUTAZIONALE

THERE IS PLENTY OF OPINIONS IN THE WEB

Page 4: LINGUISTICA GENERALE E COMPUTAZIONALE

SENTIMENT ANALYSIS

• (also known as opinion mining)• Attempts to identify the opinion/sentiment

that a person may hold towards an object

Sentiment Analysis

Positive

Negative

Neutral

Page 5: LINGUISTICA GENERALE E COMPUTAZIONALE

Components of an opinion

• Basic components of an opinion:– Opinion holder: The person or organization that

holds a specific opinion on a particular object.– Object: on which an opinion is expressed– Opinion: a view, attitude, or appraisal on an object

from an opinion holder.

Page 6: LINGUISTICA GENERALE E COMPUTAZIONALE

SENTIMENT ANALYSIS GRANULARITY

• At the document (or review) level:– Task: sentiment classification of reviews– Classes: positive, negative, and neutral– Assumption: each document (or review) focuses on a single

object (not true in many discussion posts) and contains opinion from a single opinion holder.

Page 7: LINGUISTICA GENERALE E COMPUTAZIONALE

DOCUMENT-LEVEL SENTIMENT ANALYSIS EXAMPLE

Page 8: LINGUISTICA GENERALE E COMPUTAZIONALE

SENTIMENT ANALYSIS GRANULARITY

• At the document (or review) level:– Task: sentiment classification of reviews– Classes: positive, negative, and neutral– Assumption: each document (or review) focuses on a single object

(not true in many discussion posts) and contains opinion from a single opinion holder.

• At the sentence level:– Task 1: identifying subjective/opinionated sentences

• Classes: objective and subjective (opinionated)– Task 2: sentiment classification of sentences

• Classes: positive, negative and neutral.• Assumption: a sentence contains only one opinion; not true in many

cases.• Then we can also consider clauses or phrases.

Page 9: LINGUISTICA GENERALE E COMPUTAZIONALE

SENTENCE-LEVEL SENTIMENT ANALYSIS EXAMPLE

Id: Abc123 on 5-1-2008 “I bought an iPhone a few days ago. It is such a nice phone. The touch screen is really cool. The voice quality is clear too.

It is much better than my old Blackberry, which was a terrible phone and so difficult to type with its tiny keys. However, my mother was mad with me as I did not tell her before I bought the phone. She also thought the phone was too expensive, …”

Page 10: LINGUISTICA GENERALE E COMPUTAZIONALE

SENTENCE-LEVEL SENTIMENT ANALYSIS

Id: Abc123 on 5-1-2008 “I bought an iPhone a few days ago. It is such a nice phone. The touch screen is really cool. The voice quality is clear too.

It is much better than my old Blackberry, which was a terrible phone and so difficult to type with its tiny keys. However, my mother was mad with me as I did not tell her before I bought the phone. She also thought the phone was too expensive, …”

Page 11: LINGUISTICA GENERALE E COMPUTAZIONALE

SENTENCE-LEVEL SENTIMENT ANALYSIS

Id: Abc123 on 5-1-2008 “I bought an iPhone a few days ago. It is such a nice phone. The touch screen is really cool. The voice quality is clear too.

It is much better than my old Blackberry, which was a terrible phone and so difficult to type with its tiny keys. However, my mother was mad with me as I did not tell her before I bought the phone. She also thought the phone was too expensive, …”

Page 12: LINGUISTICA GENERALE E COMPUTAZIONALE

SENTIMENT ANALYSIS GRANULARITY

• At the feature level:– Task 1: Identify and extract object features that

have been commented on by an opinion holder (e.g., a reviewer).

– Task 2: Determine whether the opinions on the features are positive, negative or neutral.

– Task 3: Group feature synonyms.• Produce a feature-based opinion summary of multiple

reviews.

Page 13: LINGUISTICA GENERALE E COMPUTAZIONALE

SENTIMENT ANALYSIS GRANULARITY

• At the feature level:– Task 1: Identify and extract object features that have been

commented on by an opinion holder (e.g., a reviewer).– Task 2: Determine whether the opinions on the features

are positive, negative or neutral.– Task 3: Group feature synonyms.

• Produce a feature-based opinion summary of multiple reviews.

• Opinion holders: identify holders is also useful, e.g., in news articles, etc, but they are usually known in the user generated content, i.e., authors of the posts.

Page 14: LINGUISTICA GENERALE E COMPUTAZIONALE

FEATURE-LEVEL SENTIMENT ANALYSIS

Page 15: LINGUISTICA GENERALE E COMPUTAZIONALE

ENTITY AND ASPECT (Hu and Liu, 2004; Liu, 2006)

Page 16: LINGUISTICA GENERALE E COMPUTAZIONALE

OPINION TARGET

Page 17: LINGUISTICA GENERALE E COMPUTAZIONALE

A DEFINITION OF OPINION (Liu, Ch. in NLP handbook, 2010)

Page 18: LINGUISTICA GENERALE E COMPUTAZIONALE

SENTIMENT ANALYSIS: THE TASK

Page 19: LINGUISTICA GENERALE E COMPUTAZIONALE

Applications• Businesses and organizations:

– product and service benchmarking.– market intelligence.– Business spends a huge amount of money to find consumer

sentiments and opinions.• Consultants, surveys and focused groups, etc

• Individuals: interested in other’s opinions when – purchasing a product or using a service, – finding opinions on political topics

• Ads placements: Placing ads in the user-generated content– Place an ad when one praises a product.– Place an ad from a competitor if one criticizes a product.

• Opinion retrieval/search: providing general search for opinions.

Page 20: LINGUISTICA GENERALE E COMPUTAZIONALE

DOCUMENT-LEVEL SENTIMENT ANALYSIS

Page 21: LINGUISTICA GENERALE E COMPUTAZIONALE

DOCUMENT-LEVEL SENTIMENT ANALYSIS

Page 22: LINGUISTICA GENERALE E COMPUTAZIONALE

DOCUMENT-LEVEL SENTIMENT ANALYSIS = TEXT CLASSIFICATION

Page 23: LINGUISTICA GENERALE E COMPUTAZIONALE

ASSUMPTIONS AND GOALS

Page 24: LINGUISTICA GENERALE E COMPUTAZIONALE

LEXICON-BASED APPROACHES

• Use sentiment and subjectivity lexicons• Rule-based classifier– A sentence is subjective if it has at least two words

in the lexicon– A sentence is objective otherwise

Page 25: LINGUISTICA GENERALE E COMPUTAZIONALE

SUPERVISED CLASSIFICATION

• Treat sentiment analysis as a type of classification• Use corpora annotated for subjectivity and/or

sentiment• Train machine learning algorithms:– Naïve bayes– Decision trees– SVM – …

• Learn to automatically annotate new text

Page 26: LINGUISTICA GENERALE E COMPUTAZIONALE

TYPICAL SUPERVISED APPROACH

Page 27: LINGUISTICA GENERALE E COMPUTAZIONALE

FEATURES FOR SUPERVISED DOCUMENT-LEVEL SENTIMENT ANALYSIS

• A large set of features have been tried by researchers– Terms frequency and different IR weighting

schemes as in other work on classification– Part of speech (POS) tags– Opinion words and phrases– Negations– Syntactic dependency

Page 28: LINGUISTICA GENERALE E COMPUTAZIONALE

SENTIMENT ANALYSIS IN PYTHON

Page 29: LINGUISTICA GENERALE E COMPUTAZIONALE

EASIER AND HARDER PROBLEMS

• Tweets from Twitter are probably the easiest– short and thus usually straight to the point

• Reviews are next – entities are given (almost) and there is little noise

• Discussions, comments, and blogs are hard. – Multiple entities, comparisons, noisy, sarcasm, etc

Page 30: LINGUISTICA GENERALE E COMPUTAZIONALE

ASPECT-BASED SENTIMENT ANALYSIS

• Sentiment classification at the document or sentence (or clause) levels are useful, but do not find what people liked and disliked.

• They do not identify the targets of opinions, i.e., ENTITIES and their ASPECTS

• Without knowing targets, opinions are of limited use.

Page 31: LINGUISTICA GENERALE E COMPUTAZIONALE

ASPECT-BASED SENTIMENT ANALYSIS

• Much of the research is based on online reviews• For reviews, aspect-based sentiment analysisis easier

because the entity (i.e., product name) is usually known– Reviewers simply express positive and negative opinions

on different aspects of the entity.• For blogs, forum discussions, etc., it is harder: – both entity and aspects of entity are unknown– there may also be many comparisons– and there is also a lot of irrelevant information.

Page 32: LINGUISTICA GENERALE E COMPUTAZIONALE

BRIEF DIGRESSION

• Regular opinions: Sentiment/opinion expressions on some target entities– Direct opinions: The touch screen is really cool– Indirect opinions: “After taking the drug, my pain

has gone”• COMPARATIVE opinions: Comparisons of

more than one entity.– “iPhone is better than Blackberry”

Page 33: LINGUISTICA GENERALE E COMPUTAZIONALE

Find entities (entity set expansion)

• Although similar, it is somewhat different from the traditional named entity recognition (NER). (See next lectures)

• E.g., one wants to study opinions on phones– given Motorola and Nokia, find all phone brands

and models in a corpus, e.g., Samsung, Moto,

Page 34: LINGUISTICA GENERALE E COMPUTAZIONALE

Feature/Aspect extraction

• May extract frequent nouns and noun phrases– Sometimes limited to a set known to be related to

the entity of interest or using part discriminators– e.g., for a scanner entity “scanner”, “scanner has”

• opinion and target relations – Proximity or syntactic dependency

• Standard IE methods– Rule-based or supervised learning – Often HMMs or CRFs (like standard IE)

Page 35: LINGUISTICA GENERALE E COMPUTAZIONALE

Aspect extraction using dependency grammar

Page 36: LINGUISTICA GENERALE E COMPUTAZIONALE

RESOURCES FOR SENTIMENT ANALYSIS• Lexicons• General Inquirer (Stone et al., 1966)• OpinionFinder lexicon (Wiebe & Riloff, 2005)• SentiWordNet (Esuli & Sebastiani, 2006)

• Annotated corpora• Used in statistical approaches (Hu

& Liu 2004, Pang & Lee 2004)• MPQA corpus (Wiebe et. al, 2005)

• Tools • Algorithm based on minimum

cuts (Pang & Lee, 2004) • OpinionFinder (Wiebe et. al, 2005)

Page 37: LINGUISTICA GENERALE E COMPUTAZIONALE

Lexical resources for Sentiment and Subjectivity Analysis

Overview

Page 38: LINGUISTICA GENERALE E COMPUTAZIONALE

Sentiment (or opinion) lexica

Page 39: LINGUISTICA GENERALE E COMPUTAZIONALE

Sentiment lexica

Page 40: LINGUISTICA GENERALE E COMPUTAZIONALE

40

Sentiment-bearing words

ICWSM 2008

• Adjectives Hatzivassiloglou & McKeown 1997, Wiebe 2000, Kamps & Marx 2002, Andreevskaia & Bergler 2006

– positive: honest important mature large patient

• Ron Paul is the only honest man in Washington. • Kitchell’s writing is unbelievably mature and is only

likely to get better. • To humour me my patient father agrees yet again to my

choice of film

Page 41: LINGUISTICA GENERALE E COMPUTAZIONALE

41

Negative adjectives

ICWSM 2008

• Adjectives– negative: harmful hypocritical inefficient insecure• It was a macabre and hypocritical circus. • Why are they being so inefficient ? bjective: curious,

peculiar, odd, likely, probably

Page 42: LINGUISTICA GENERALE E COMPUTAZIONALE

42

Subjective adjectives

ICWSM 2008

• Adjectives – Subjective (but not positive or negative

sentiment): curious, peculiar, odd, likely, probable• He spoke of Sue as his probable successor.• The two species are likely to flower at different times.

Page 43: LINGUISTICA GENERALE E COMPUTAZIONALE

43

Other words

ICWSM 2008

• Other parts of speech Turney & Littman 2003, Riloff, Wiebe & Wilson 2003, Esuli & Sebastiani 2006

– Verbs• positive: praise, love• negative: blame, criticize• subjective: predict

– Nouns• positive: pleasure, enjoyment• negative: pain, criticism• subjective: prediction, feeling

Page 44: LINGUISTICA GENERALE E COMPUTAZIONALE

44

Phrases

ICWSM 2008

• Phrases containing adjectives and adverbs Turney 2002, Takamura, Inui & Okumura 2007

– positive: high intelligence, low cost– negative: little variation, many troubles

Page 45: LINGUISTICA GENERALE E COMPUTAZIONALE

45

Creating sentiment lexica

ICWSM 2008

• Humans

• Semi-automatic

• Fully automatic

Page 46: LINGUISTICA GENERALE E COMPUTAZIONALE

46

(Semi) Automatic creation of sentiment lexica

ICWSM 2008

• Find relevant words, phrases, patterns that can be used to express subjectivity

• Determine the polarity of subjective expressions

Page 47: LINGUISTICA GENERALE E COMPUTAZIONALE

FINDING POLARITY IN CORPORA USING PATTERNS

Page 48: LINGUISTICA GENERALE E COMPUTAZIONALE

48

USING PATTERNS

ICWSM 2008

• Lexico-syntactic patterns Riloff & Wiebe 2003

• way with <np>: … to ever let China use force to have its way with …

• expense of <np>: at the expense of the world’s security and stability

• underlined <dobj>: Jiang’s subdued tone … underlined his desire to avoid disputes …

Page 49: LINGUISTICA GENERALE E COMPUTAZIONALE

DICTIONARY-BASED METHODS

Page 50: LINGUISTICA GENERALE E COMPUTAZIONALE

SEMI-SUPERVISED LEARNING(Esuti and Sebastiani, 2005)

Page 51: LINGUISTICA GENERALE E COMPUTAZIONALE

Corpora for Sentiment and Subjectivity Analysis

Overview

Page 52: LINGUISTICA GENERALE E COMPUTAZIONALE

52

Definitions and Annotation Scheme

ICWSM 2008

• Manual annotation: human markup of corpora (bodies of text)

• Why? – Understand the problem– Create gold standards (and training data)

Wiebe, Wilson, Cardie LRE 2005Wilson & Wiebe ACL-2005 workshopSomasundaran, Wiebe, Hoffmann, Litman ACL-2006 workshopSomasundaran, Ruppenhofer, Wiebe SIGdial 2007Wilson 2008 PhD dissertation

Page 53: LINGUISTICA GENERALE E COMPUTAZIONALE

53

Overview

ICWSM 2008

• Fine-grained: expression-level rather than sentence or document level

• Annotate – Subjective expressions– material attributed to a source, but presented

objectively

Page 54: LINGUISTICA GENERALE E COMPUTAZIONALE

54

Corpus

ICWSM 2008

• MPQA: www.cs.pitt.edu/mqpa/databaserelease (version 2)

• English language versions of articles from the world press (187 news sources)

• Also includes contextual polarity annotations (later)

• Themes of the instructions:– No rules about how particular words should be annotated.

– Don’t take expressions out of context and think about what they could mean, but judge them as they are used in that sentence.

Page 55: LINGUISTICA GENERALE E COMPUTAZIONALE

55

Gold Standards

ICWSM 2008

• Derived from manually annotated data• Derived from “found” data (examples): – Blog tags Balog, Mishne, de Rijke EACL 2006

– Websites for reviews, complaints, political arguments• amazon.com Pang and Lee ACL 2004

• complaints.com Kim and Hovy ACL 2006

• bitterlemons.com Lin and Hauptmann ACL 2006 • Word lists (example):– General Inquirer Stone et al. 1996

Page 56: LINGUISTICA GENERALE E COMPUTAZIONALE

SENTIMENT ANALYSIS IN NLTK

• See 6.1

Page 57: LINGUISTICA GENERALE E COMPUTAZIONALE

TOOLS

Page 58: LINGUISTICA GENERALE E COMPUTAZIONALE

OPINE

Page 59: LINGUISTICA GENERALE E COMPUTAZIONALE

OPINION SUMMARIES

Page 60: LINGUISTICA GENERALE E COMPUTAZIONALE

GOOGLE PRODUCTS

Page 61: LINGUISTICA GENERALE E COMPUTAZIONALE

READINGS

• Bo Pang & Lillian Lee, 2008 – Opinion Mining and Sentiment Analysis – Foundations and Trends in Information Retrieval, v. 2, 1-2– On the website

Page 62: LINGUISTICA GENERALE E COMPUTAZIONALE

ACKNOWLEDGMENTS

• Some slides borrowed from– Janyce Wiebe’s tutorials– Bing Liu’s tutorials– Ronen Feldman’s IJCAI 2013 tutorial