Sentiment classification - Stanford...

35
Sentiment classification Chris Potts Linguist 287 / CS 424P: Extracting Social Meaning and Sentiment, Fall 2010 Sep 28 1 Overview Goals • Introduce some data for you to experiment with. • Get a feel for the properties of real-world sentiment corpora. • Explore a wide range of ideas for features. • Introduce a few classifier models and try to decide between them. • Run some new experiments and try to get a grip on what they can tell us. Plan §1 Some available data (very diverse). §2 Understanding the available data: imbalances, interdependencies, useful patterns. §3 Features: lexical, contextual, and structural. §4 Models: NaiveBayes and various regression models. §5 Experiments: Training on one corpus, testing on another, to reward linguistically-informed features. Some shortcomings • Emphasis is again on positive vs. negative, though the data sets do not have this limitation, so you can push beyond what we do. • Emphasis is on document classification, but the basic ideas should be useful at other levels (words, sentences, blogs, . . . ). • Not an especially wide survey of features, for the sake of looking closely at a few, but Pang and Lee 2008 has a huge bibliography.

Transcript of Sentiment classification - Stanford...

Page 1: Sentiment classification - Stanford Universityweb.stanford.edu/class/cs424p/materials/ling287-handout-09-28... · neg 1,000 31,783 738,419 34,543 738.42 Overall 2,000 64,720 1,559,304

Sentiment classificationChris Potts

Linguist 287 / CS 424P: Extracting Social Meaning and Sentiment, Fall 2010

Sep 28

1 Overview

Goals

• Introduce some data for you to experiment with.

• Get a feel for the properties of real-world sentiment corpora.

• Explore a wide range of ideas for features.

• Introduce a few classifier models and try to decide between them.

• Run some new experiments and try to get a grip on what they can tell us.

Plan

§1 Some available data (very diverse).

§2 Understanding the available data: imbalances, interdependencies, useful patterns.

§3 Features: lexical, contextual, and structural.

§4 Models: NaiveBayes and various regression models.

§5 Experiments: Training on one corpus, testing on another, to reward linguistically-informedfeatures.

Some shortcomings

• Emphasis is again on positive vs. negative, though the data sets do not have this limitation,so you can push beyond what we do.

• Emphasis is on document classification, but the basic ideas should be useful at other levels(words, sentences, blogs, . . . ).

• Not an especially wide survey of features, for the sake of looking closely at a few, but Pangand Lee 2008 has a huge bibliography.

Page 2: Sentiment classification - Stanford Universityweb.stanford.edu/class/cs424p/materials/ling287-handout-09-28... · neg 1,000 31,783 738,419 34,543 738.42 Overall 2,000 64,720 1,559,304

LING287/CS424P, Stanford (Potts) Sentiment classification

2 Some available data

2.1 Pang and Lee’s (2004) movie review data

Polarity data 2.0: http://www.cs.cornell.edu/people/pabo/movie-review-data

Rating: posText: when _star wars_ came out some twenty years ago , the image of traveling throughout the stars

has become a commonplace image .when the millenium falcon moves throughout the stars , we see constellations , meteor showers ,and cool space-ships .when han solo goes light speed , the stars change to bright lines , going towards the viewer inlines that converge at an invisible point .cool ._october sky_ offers a much simpler image–that of a single white dot , traveling horizontallyacross the night sky .[. . . ] (file pos/cv048_16828.txt)

Rating: negText: “ snake eyes ” is the most aggravating kind of movie : the kind that shows so much potential then

becomes unbelievably disappointing .it’s not just because this is a brian depalma film , and since he’s a great director and one who’sfilms are always greeted with at least some fanfare .and it’s not even because this was a film starring nicolas cage and since he gives a brauvaraperformance , this film is hardly worth his talents .it’s worse than that .[. . . ] (file neg/cv023_13847.txt)

Table 1: Excerpts from pos and neg reviews from Polarity data 2.0. The reviews are tokenized andformatted, one sentence per line. Sentence chunking with Adwait Ratnaparkhi’s MXTERMINATOR.

Reviews Sentences Words Vocab Mean words per review

pos 1,000 32,937 820,885 36,806 820.89neg 1,000 31,783 738,419 34,543 738.42

Overall 2,000 64,720 1,559,304 34,543 779.65

Table 2: Summary stats for Polarity data 2.0.

5-star system 4-star system grading system

pos 3.5 and up 3 and up B or aboveneg 2 or below 1.5 or below C- or below

Table 3: Class definitions. The source HTML used a number of different rating systems and nota-tions. The ratings were heuristically extracted. See the corpus’s README file for more information.

2

Page 3: Sentiment classification - Stanford Universityweb.stanford.edu/class/cs424p/materials/ling287-handout-09-28... · neg 1,000 31,783 738,419 34,543 738.42 Overall 2,000 64,720 1,559,304

LING287/CS424P, Stanford (Potts) Sentiment classification

2.2 IMDB user-supplied reviews

Movie: The Neverending Story II: The Next Chapter (1990); Adventure,Family,Fantasy; . . .Rating: 1 out of 10 stars

Summary: Awful, Awful, AwfulReview: For fans of the North and South series, this should never have been produced. Never, never,

never never!! (If you have seen the first two Books and enjoyed them as most do, [. . . ])Helpful: 6 out of 17 people

Date: 12 July 2003Author: ur1328271 Location: Indianapolis IN

Movie: Death Becomes Her (1992); Comedy,Fantasy; . . .Rating: 5 out of 10 stars

Summary: so-so film about 2 women who can’t stand agingReview: Two women compete with each other, seeing who can stay the youngest looking. Both go to a

beautiful witch who has a youth potion, but they get more than they bargained for. [. . . ]Helpful: 3 out of 8 people

Date: 10 April 1999Author: ur0070535 Location: Broken Bow, Oklahoma

Movie: Sliders Pilot (1995); Sci-Fi,Adventure,Fantasy; . . .Rating: 10 out of 10 stars

Summary: This is AWESOMEReview: This is the greatest TV series ever! I hope it hits the shelves! A movie would be da bomb! The

special f/x are so cool! Too bad the series died. Hope for a renewal!!Helpful: 4 out of 5 people

Date: 25 August 2001Author: ur0883211 Location: Vancouver, BC

Table 4: Short sample reviews from IMDB, lowest, middle, and highest rating categories.

Rating Reviews Words Vocabulary Mean words/review

1 124,587 (9%) 25,395,214 172,346 203.842 51,390 (4%) 11,755,132 119,245 228.743 58,051 (4%) 13,995,838 132,002 241.104 59,781 (4%) 14,963,866 138,355 250.315 80,487 (6%) 20,390,515 164,476 253.346 106,145 (8%) 27,420,036 194,195 258.337 157,005 (12%) 40,192,077 240,876 255.998 195,378 (14%) 48,723,444 267,901 249.389 170,531 (13%) 40,277,743 236,249 236.19

10 358,441 (26%) 73,948,447 330,784 206.31

Total 1,361,796 317,062,312 800,743 232.83

Table 5: IMDB basic stats by rating category. Positive reviews dominate. Neutral ones are longer.

3

Page 4: Sentiment classification - Stanford Universityweb.stanford.edu/class/cs424p/materials/ling287-handout-09-28... · neg 1,000 31,783 738,419 34,543 738.42 Overall 2,000 64,720 1,559,304

LING287/CS424P, Stanford (Potts) Sentiment classification

2.3 OpenTable

(a) Restaurant information.

Review date 05/31/2010Dine date 05/29/2010

Review Everything was excellent, food, service andatmosphere.

Overall 5 out of 5 (Outstanding)Ambiance 5 out of 5 (Outstanding)

Food 5 out of 5 (Outstanding)Service 5 out of 5 (Outstanding)

Noise 1 out of 3 (Quiet)Special features romantic, special occasion

Review date 03/22/2010Dine date 03/19/2010

Review A mixed bag. The ambience and ser-vice were of high quality. Our sharedmozarella/red pepper/arugula salad wasdecent and our entrees were well-preparedbut horribly over-sized. The sides were dis-asters. [. . . ]

Overall 2 out of 5 (Fair)Ambiance 4 out of 5 (Outstanding)

Food 2 out of 5 (Fair)Service 4 out of 5 (Very Good)

Noise 2 out of 3 (Moderate)

(b) Sample reviews.

Table 6: Sample restaurant information plus two (of 60) user reviews with meta-data. For moredetails: http://reviews.opentable.com/0938/6648/reviews.htm.

Rating Reviews Words Vocabulary Mean words/review

1 9,352 (2%) 699,695 17,912 74.822 36,997 (8%) 2,507,147 34,818 67.773 73,064 (15%) 4,207,700 45,258 57.594 172,195 (35%) 7,789,649 64,143 45.245 197,757 (40%) 8,266,564 65,514 41.80

Total 489,365 23,470,755 116,406 47.96

Table 7: OpenTable basic stats by rating category. Positive reviews dominate.

4

Page 5: Sentiment classification - Stanford Universityweb.stanford.edu/class/cs424p/materials/ling287-handout-09-28... · neg 1,000 31,783 738,419 34,543 738.42 Overall 2,000 64,720 1,559,304

LING287/CS424P, Stanford (Potts) Sentiment classification

2.4 Experience Project

Confession I really hate being shy . . . I just want to be able to talk to someone about anything andeverything and be myself. . . That’s all I’ve ever wanted.

Date January 20th, 2010 at 12:38 PMUsername brokenangelwishes

Demographics 16-17 years old female

Reactions understand: 10; hugs: 1; just wow: 0; rock: 1; teehee: 2

Comment I was really shy when I was younger. I got better when I entered the work field and gainedconfidence. I think you will grow out of it . :)

Date January 20th, 2010 at 12:41 PMUsername bigbadbear

Comment I’m still really shy, and I want that too.. Find that person where I can talk to for endlesshours about anything and everything. The person that asks the right questions, that keepsme talking and listening. [. . . ] You know. I might have found ’em.

Date January 20th, 2010 at 12:52 PMUsername demetrie1930

Confession subconsciously, I constantly narrate my own life in my head. in third person. in a britishaccent. Insane? Probably

Date July 8th, 2009 at 12:00 AMUsername flicksy

Demographics 18-21 years old female

Reactions understand: 0; hugs: 0; just wow: 1; rock: 7; teehee: 8

Comment =D I do that too!! I sometimes do it with music as well. If it is insane, then I guess we’reboth insane. XD YAY FOR INSANITY!!

Username anagha777Date July 9th, 2009 at 4:47 PM

Table 8: Sample Experience Project confessions, with reaction data and sample comments.

Texts Words Vocab Avg. words per text

Confessions 31,675 3,129,463 47,001 98.79Comments 90,194 3,106,037 51,553 34.44

(a) Basic word-level statistics.

Category Clicks

‘sorry, hugs’ 3,733 (16%)‘you rock’ 3,781 (16%)

‘teehee’ 3,545 (15%)‘I understand’ 11,277 (48%)

‘wow, just wow’ 916 (4%)

(b) Annotation choices.

Table 9: Basic statistics for the Experience Project corpus.

5

Page 6: Sentiment classification - Stanford Universityweb.stanford.edu/class/cs424p/materials/ling287-handout-09-28... · neg 1,000 31,783 738,419 34,543 738.42 Overall 2,000 64,720 1,559,304

LING287/CS424P, Stanford (Potts) Sentiment classification

2.5 Quirky anecdotes

FMyLife (11,989 stories, 524,131 words, 7 cats. (health, intimacy, kids, love, misc, money, work)

• Today, I’m starting my 28th year with 28 cents on my bank account. FML(category money; 26,618 ‘you totally deserved it’ clicks; 6,454 ‘I agree, your life sucks’ clicks)

• Today, I misspelled a word at work. I am a tattoo artist. FML(work; 29,353 ‘you totally deserved it’ clicks; 8,812 ‘I agree, your life sucks’ clicks)

My Life is Average (47,600 stories, 2,066,312 words)

• Today, I was bored so I went to whereswaldo.com. The page couldn’t be found. MLIA.(1,011 ‘average’ clicks; 142 ‘meh’ clicks)

• Today, my sister texted my mom “I’m in the splelling bee.” MLIA.(6,377 ‘average’ clicks; 600 ‘meh’ clicks)

It Made My Day (1,960 stories, 72,033 words)

• My father-in-law gave me a 10lb box of bacon for my birthday today. IMMD.(879 ‘thumbs-up’ clicks; 117 ‘thumbs-down’ clicks)

• Today I saw a hummer parked in two spaces, with a parking ticket on the window. IMMD.(2702 ‘thumbs-up’ clicks; 68 ‘thumbs-down’ clicks)

2.6 Other data

• UMass Amherst Sentiment Corpora (Chinese, English, German, Japanese), from Amazon(all), MyPrice (Chinese). Tripadvisor.com (English):

http://semanticsarchive.net/Archive/jQ0ZGZiM/readme.html

• The data and code from Potts 2010:

http://stanford.edu/~cgpotts/data/salt20/potts-salt20-data-and-code.zip

• The data and code from Potts and Schwarz 2010:

http://stanford.edu/~cgpotts/data/lilt/potts-schwarz-lilt-data-and-code.zip

• Twitter APIs: http://apiwiki.twitter.com/

• For much, much more: Pang and Lee 2008:§7.

6

Page 7: Sentiment classification - Stanford Universityweb.stanford.edu/class/cs424p/materials/ling287-handout-09-28... · neg 1,000 31,783 738,419 34,543 738.42 Overall 2,000 64,720 1,559,304

LING287/CS424P, Stanford (Potts) Sentiment classification

3 Understanding the available data

Before doing any experiments or analysis, we should spend some time getting to know the corporathemselves.

• Do the meta-data make sense?

• How are the pieces of meta-data related?

• Are there quirks of these real-world corpora that we should be aware of?

3.1 Author and reader perspectives

Ratings Ratings reflect author choices. Fig. 1 suggests that readers are well-attuned to wayratings choices and linguistic choices are correlated.

1 2 3 4 5

12

34

5

●● ●●

●●

●●

●●

●●

●●●

●● ●

●●● ●

●●

●●

●●

●●●

●●

● ●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●●●

●● ●●

●●

●●

●●

●● ●

● ●●

● ●●

●●

●●

●●

●●

●●●

● ●●●

● ● ●

●●

● ●

●●

● ●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●●

●●●

●●

●●●

●● ●

●●

●●

●●●●

●●

●●

●●

●●●

●●

●●●●

●● ●

●●

● ●

●●

●●

● ●

●●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●● ●

●● ●●

●●

●●●

●●

●● ●●

●●

●●

●●

●●

●●●

●●●

● ●

●● ●

●●

●●●

●●●

●●

●●

●●●●●

●●●

●●

●●

●●

●●●●

● ●●

●●

●●

●●

●●●

●●●

●●

●●●

●●●●

●● ●

● ●

●●

●●

●●●

● ●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●●

●●

●●

●●●

●●

●●

● ●

● ●●

●●●

●●●

●●

●●

●●●

●●●

●●

●●

●●

●●●

●●

● ●●

●●

●●

●●

●●

●●

● ●

●●

●● ●

●●

●●● ●

● ●

●●

● ●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●● ● ●

●●

●●

●●

●●●

●● ●

● ●

● ●●●

●●●

●●

● ●●

●●●

●●

●●

● ●

●●

●●

● ●●

●●

● ●

●●

●●●

● ●

●●

● ●●

● ●

●●●

●●

●●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

● ●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●●

●●●

●●

●●

● ●

●●●

● ●●●●

●●●

●● ●●

●●

● ●●●

●●

●●

●●●

●●

●●

●●

● ●

●●

●●●

●●

●●

● ●

●●

●●

●●

●●●

●●

●●

●●●●

●●

● ●

●●

●●

●●

●●

●●●●

● ●

●● ●●

●●●

●●

●●

●●

●● ●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●●

●●

●● ●●

●●

●●

●●● ●

●●

●●

●●

●●

●●

●●

●●● ●

●●

●●

●●

●●

●●●

●●

● ● ●

●● ●

●●● ●●

●●●

●●●

●●

●●

●●

●●

● ●●

●●

●●

●●

● ●

●●

●● ●

●●

●●

●●

●● ●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

● ●

●●

● ●●

●●

● ●

●●

●●

●● ●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●●●

●●

●●●

●●

● ●

●●

● ●

●●●

● ●●●●

●●

●●●

● ●●●●● ●●

●●

●●

●●

● ●

●●

● ●

●●

● ●● ●●●●●

●●

●●

●●

●●●●●

● ●

●●

●● ●

●●

●●

●●●

●●

●● ●

●●

●●

●●

●●

● ●●

●●

● ●

●●

●●●

●●

●● ●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●● ●●

●●

●● ● ●

●●●

●●●

● ●

●●

●●

Sub

ject

's g

uess

of a

utho

r's r

atin

g

Author's rating

Figure 1: Results of an experiment conducted with Amazon’s Mechanical Turk (Snow et al. 2008;Munro et al. 2010) in which participants read a short review from OpenTable.com and guessedwhich star rating, 1-5 stars, was assigned by the review’s author. The individual data points havebeen jittered so that they don’t all sit atop each other. The dark horizontal lines indicate themedian guesses, with boxes surrounding 50% of the guesses. A linear model using the actualrating to predict participants’ guesses produces an excellent fit, with the average prediction just0.81 stars from the empirical value (residual standard deviation = 0.81; R2 = 0.65).

Two perspectives Experience Project reactions represent reader choices when we consider themrelative to confessions, and they reflect author choices (loosely) when we consider them relativeto comments.

7

Page 8: Sentiment classification - Stanford Universityweb.stanford.edu/class/cs424p/materials/ling287-handout-09-28... · neg 1,000 31,783 738,419 34,543 738.42 Overall 2,000 64,720 1,559,304

LING287/CS424P, Stanford (Potts) Sentiment classification

3.2 Category imbalances

1 2 3 4 5 6 7 8 9 10

IMDB

050000

150000

250000

350000

1 2 3 4 5

OpenTable overall

050000

100000

150000

1 2 3 4 5

Chinese Amazon

0e+00

2e+05

4e+05

1 2 3 4 5

Japanese Amazon

010000

20000

30000

40000

sorr

y, h

ugs

you

rock

teehee

I und

erst

and

wow

, jus

t wow

Experience Project

02000

4000

6000

8000

10000

Reviews

Category

Figure 2: Imbalances in the distribution of texts relative to categories. For the ratings corpora(panels 1-4), positive dominates. For Experience Project, sympathy and solidarity dominate.

1 2 3 4 5

Overall

0.0

0.1

0.2

0.3

0.4

1 2 3 4 5

Food

0.0

0.1

0.2

0.3

0.4

1 2 3 4 5

Service

0.0

0.1

0.2

0.3

0.4

1 2 3 4 5

Ambiance

0.0

0.1

0.2

0.3

1-Quiet

2-Moderate

3-Energetic

Noise

0.0

0.1

0.2

0.3

0.4

Rating

Per

cent

age

of re

view

s

Figure 3: OpenTable rating distributions. Positive reviews dominate in all categories. ‘Noise’ isfundamentally different, since it doesn’t have a standard preference ordering.

0 200 400 600 800

12

34

5

Amazon

Mea

n ra

ting

for p

rodu

ct

0 200 400 600 800 1000

12

34

5

Tripadvisor

0 10000 20000 30000 40000

12

34

5

Goodreads

0 200 400 600 800

0.0

0.5

1.0

1.5

2.0

Amazon

Rat

ing

stan

dard

dev

iatio

n fo

r pro

duct

0 200 400 600 800 1000

0.0

0.5

1.0

1.5

2.0

Tripadvisor

0 10000 20000 30000 40000

0.0

0.5

1.0

1.5

2.0

Goodreads

Number of reviews by product

Figure 4: Star ratings and product re-view counts. High ratings and high re-view counts go hand-in-hand (top pan-els), and this results in a narrowing ofopinions (bottom panels). This is pre-sumably because products with a fewnegative reviews are purchased lessand hence reviewed less. Movies aresomewhat exceptional (see the leftmostpanel in fig. 2), as are video games onAmazon (not pictured). Both are prod-ucts that many people buy immediately,sight unseen.

8

Page 9: Sentiment classification - Stanford Universityweb.stanford.edu/class/cs424p/materials/ling287-handout-09-28... · neg 1,000 31,783 738,419 34,543 738.42 Overall 2,000 64,720 1,559,304

LING287/CS424P, Stanford (Potts) Sentiment classification

3.3 The rating scale

OpenTable labels its five stars with the words ‘Poor’, ‘Fair’, ‘Good’, ‘Very good’, and ‘Outstanding’,to help align people’s intuitions with the scale. Tripadvisor asks a separate but related questionthat provides insight into how people view the scale (tab. 5). These things suggest that the ratingscale is not totally ordered, but rather consists of two separate scales: positive and negative.

Tripadvisor rating–recommendation connection

Would I recommend this hotel to my best friend?

Rating

12

34

5

no way! probably not most likely absolutely!

Figure 5: Users who both answered the question on the x-axis and gave a rating (y-axis) providea window into how people conceptualize the rating scale. The dark horizontal lines indicate themedian ratings, with boxes surrounding 50% of the ratings. The outliers could trace to people whotreat the stars as a ranking system where 1 is the best.

Chi-squared

500

1000

1500

2000

R2R3 R3R4 R1R2 R2R4 R4R5 R1R3 R3R5 R2R5 R1R5 R1R4

KL

dive

rgen

ce

0.010

0.020

0.030

0.040

R2R3 R3R4 R1R2 R2R4 R4R5 R1R3 R3R5 R2R5 R1R5 R1R4

Pairwise distances between rating categories

Rating-category pair

Figure 6: Category distances, for itemswith at least 200 tokens in both cor-pora. The top panel give the distancesaccording to their χ2 statistics (Kil-garriff and Rose 1998; Manning andSchütze 1999:171), and the bottompanel gives their KL-divergences. Theorderings are the same in both cases,and they suggest that we might do wellto treat the ratings scale as having threeparts: the lowest of the low (1 star), themushy middle (3-4 stars), and the bestof the best (5 star).

9

Page 10: Sentiment classification - Stanford Universityweb.stanford.edu/class/cs424p/materials/ling287-handout-09-28... · neg 1,000 31,783 738,419 34,543 738.42 Overall 2,000 64,720 1,559,304

LING287/CS424P, Stanford (Potts) Sentiment classification

3.4 Relationships between ratings

-4 -3 -2 -1 0 1 2 3 4

Overall - Rating

0.0

0.2

0.4

0.6

-4 -3 -2 -1 0 1 2 3 4

Overall - Service

0.0

0.1

0.2

0.3

0.4

0.5

0.6

-4 -3 -2 -1 0 1 2 3 4

Overall - Ambiance

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Reviews

Rating difference

(a) Comparisons with ‘Overall’. In each panel, the overall rating value is sub-tracted from the other rating value. Thus, a value of 0 indicates agreementbetween the two ratings for the review in question.

Overall, Food 0.82Overall, Service 0.77

Overall, Ambiance 0.70Food, Service 0.57

Food, Ambiance 0.56Ambiance, Service 0.54

(b) Correlations.

Figure 7: OpenTable rating category comparisons. ‘Overall’ and ‘Food’ are highly correlated.

3.5 Helpfulness

1 2 3 4 5

0.2

0.4

0.6

0.8

1.0

English Amazon (300 reviews in each category)

1 2 3 4 5

0.2

0.4

0.6

0.8

1.0

German Amazon (180 reviews in each category)

Product rating associated with review (jittered left-right for readability)

In the plot: only reviews with < 1000 words (eliminates some outliers) and ≥ 20 readers

Per

cent

age

of re

ader

s w

ho fo

und

the

revi

ew h

elpf

ul

Figure 8: Helpfulness. A few of the corpora have meta-data saying ‘X of Y people found thisreview helpful’, where Y is not the number of views, but rather the total number of people whoselected ‘helpful’ or ‘unhelpful’. These plots show that there is a correlation between star ratingsand helpfulness ratings: the higher the star rating, the more helpful people find the review. Reviewlength is another useful predictor. For more on predicting helpfulness ratings, see Ghose et al.2007; Danescu-Niculescu-Mizil et al. 2009.

10

Page 11: Sentiment classification - Stanford Universityweb.stanford.edu/class/cs424p/materials/ling287-handout-09-28... · neg 1,000 31,783 738,419 34,543 738.42 Overall 2,000 64,720 1,559,304

LING287/CS424P, Stanford (Potts) Sentiment classification

3.6 Authorship

Standard deviation of ratings, by author

Limited to authors with at least 10 reviewsStandard deviation

Num

ber o

f aut

hors

0.0 0.5 1.0 1.5 2.0

050

100

150

mean

Figure 9: Rating standard deviation by author. The upshot of this figure is that author names willbe useful features, since individual authors tend to give the same rating repeatedly, and relativelyfew authors use the whole rating scale.

3.7 Narrative structure?

0 5000 10000 15000 20000

0.040

0.045

0.050

0.055

0.060

'the' in Alice

Sample size

Rel

ativ

e fre

quen

cy

0 5000 10000 15000 20000

0.04

0.05

0.06

0.07

0.08

'the' in Alice

Random sample size

Rel

ativ

e fre

quen

cy

0 5000 10000 15000 20000

0.065

0.075

0.085

0.095

'the' in Tripadvisor

Sample size

Rel

ativ

e fre

quen

cy

0 5000 10000 15000 20000

0.018

0.022

0.026

0.030

'a' in Alice

Sample size

Rel

ativ

e fre

quen

cy

0 5000 10000 15000 20000

0.000

0.010

0.020

0.030

'a' in Alice

Random sample size

Rel

ativ

e fre

quen

cy

0 5000 10000 15000 20000

0.0300.0350.0400.0450.050

'a' in Tripadvisor

Sample size

Rel

ativ

e fre

quen

cy

Figure 10: In texts with rich narrative structure, word frequencies are heavily dependent uponplacement in the text (Baayen 2001). This is not globally true for collections of reviews. Reviewslate in the reviewing history of a product are often influenced by previous ones (sometimes theyare direct responses), but the impact this has on word-level features is not clear to me. The reviewsdo have useful internal structure, though (Pang and Lee 2004).

11

Page 12: Sentiment classification - Stanford Universityweb.stanford.edu/class/cs424p/materials/ling287-handout-09-28... · neg 1,000 31,783 738,419 34,543 738.42 Overall 2,000 64,720 1,559,304

LING287/CS424P, Stanford (Potts) Sentiment classification

4 Features

4.1 Lexical features

4.1.1 Counting

Definition 1 (IMDB counts). Let C = {1 . . . 10} be the set of rating categories for the IMDB corpus,and let π be a linguistic type (e.g., a morpheme, word):

countIMDB(π, c)def= the number of tokens of π in IMDB reviews in c ∈ C

Definition 2 (Category counts). Let T be a corpus partitioned by categories C , π a linguistic type,and Π the set of all linguistic types of the same class as π:

countT,Π(c)def=∑

π∈Π

countT (π, c)

Definition 3. Let T be a corpus partitioned by categories C , π a linguistic type, and Π the set ofall linguistic types of the same class as π:

i. The probability of π given c ∈ C:

PrT,Π(π|c)def=

countT (π, c)countT,Π(c)

ii. The probability of c ∈ C given π:

PrT,Π(c|π)def=

PrT,Π(π|c)∑

c′∈C PrT,Π(π|c′)

Cat. Count Total PrIMDB(w|c) PrIMDB(c|w)

1 8,557 25,395,214 0.0003 0.102 4,627 11,755,132 0.0004 0.123 6,726 13,995,838 0.0005 0.144 7,171 14,963,866 0.0008 0.145 9,039 20,390,515 0.0004 0.136 10,101 27,420,036 0.0004 0.117 10,362 40,192,077 0.0003 0.088 10,064 48,723,444 0.0002 0.069 7,909 40,277,743 0.0002 0.06

10 13,570 73,948,447 0.0002 0.05

(a) Count gives the countIMDB(w, c) values, Total the countIMDB(c) val-ues. For PrIMDB(w|c), divide Count values by corresponding Total val-ues. For PrIMDB(c|w), divide PrIMDB(w|c) values by the sum of all thePrIMDB(w|c) values, as in def. 3.

● ●

● ● ●

disappoint(ed/ing) (88,126 tokens)

1 2 3 4 5 6 7 8 9 10

0.05

0.1

0.14

Pr(

c|w

)

Rating

(b) The black dots represent PrIMDB(c|w)values. The error bars around each pointmark 95% confidence intervals. The hori-zontal line is the probability we would ex-pect (always 0.1 for IMDB) if the wordwere equally probable in all categories.

Table 10: disappoint(ed|ing) in IMDB.

12

Page 13: Sentiment classification - Stanford Universityweb.stanford.edu/class/cs424p/materials/ling287-handout-09-28... · neg 1,000 31,783 738,419 34,543 738.42 Overall 2,000 64,720 1,559,304

LING287/CS424P, Stanford (Potts) Sentiment classification

Definition 4 (EP token annotations). Let t be a token in the EP corpus and c an EP reactioncategory. Then R(t, c) is the number of c choices for the text containing t.

Definition 5 (EP counts). Let π be a linguistic type (e.g., morpheme, word), πt the set of tokensof π in the EP corpus, and c an EP reaction category:

countEP(π, c)def=∑

t∈πt

R(t, c)

Cat. Count Total PrEP(w|c) PrEP(c|w)

hugs 108 2,153,134 0.00005 0.25rock 34 1,330,084 0.00002 0.13

teehee 25 845,397 0.00003 0.15understand 197 3,447,377 0.00006 0.29

just wow 29 838,059 0.00004 0.18

(a) Count gives the word-level reactions for the corpus (def. 5), and To-tal sums over all the words’ reaction counts, using def. 2 with countEPvalues. The PrEP(c|w) values are again obtained by dividing eachPrEP(w|c) value by the sum of the PrEP(w|c) values.

hugs

rock

s

teeh

ee

unde

rsta

nd

just

wow

disappoint(ed/ing) (145 tokens)

0.13

0.21

0.29

Pr(

c|w

)

(b) The PrEP(c|w) values at left, with 95%confidence intervals, and a horizontal linemarking the expected frequency assumingan even distribution across the categories(0.20). The token count is the number oftimes w occurs, which is different from thecountEP values defined in def. 5 but whichconveys a sense for the corpus’s coverageof the item.

Table 11: disappoint(ed|ing) in Experience Project.

4.1.2 Expected ratings

Definition 6 (Expected rating). For a linguistic unit π in corpus T with ordered and arguablycontinuous ratings C:

ER(π)def=∑

c∈C

c PrT (c|π)

IMDB

−4 −2 0 2 4

050

0010

000

1500

0

Five−star reviews

−2 −1 0 1 2

020

000

4000

060

000

Expected rating

Fre

quen

cy

Figure 11: Distribution of expected ratings.

13

Page 14: Sentiment classification - Stanford Universityweb.stanford.edu/class/cs424p/materials/ling287-handout-09-28... · neg 1,000 31,783 738,419 34,543 738.42 Overall 2,000 64,720 1,559,304

LING287/CS424P, Stanford (Potts) Sentiment classification

POS good (883,417 tokens)

-4.5

-3.5

-2.5

-1.5

-0.5 0.5

1.5

2.5

3.5

4.5

0.080.10.12

ER = 0.05

great (648,110 tokens)-4.5

-3.5

-2.5

-1.5

-0.5 0.5

1.5

2.5

3.5

4.5

0.05

0.11

0.17

ER = 1.09

awesome (47,142 tokens)

-4.5

-3.5

-2.5

-1.5

-0.5 0.5

1.5

2.5

3.5

4.5

0.05

0.16

0.27 ER = 1.68

amazing (103,509 tokens)

-4.5

-3.5

-2.5

-1.5

-0.5 0.5

1.5

2.5

3.5

4.5

0.05

0.17

0.28ER = 1.72

Pr(c|w)

Rating

(a) Positive scalar modifiers.

depress(ed|ing) (18,498 tokens)

-4.5

-3.5

-2.5

-1.5

-0.5 0.5

1.5

2.5

3.5

4.5

0.080.110.13

ER = -0.4

NEG good (20,447 tokens)

-4.5

-3.5

-2.5

-1.5

-0.5 0.5

1.5

2.5

3.5

4.5

0.03

0.1

0.16

ER = -1.41

bad (368,273 tokens)

-4.5

-3.5

-2.5

-1.5

-0.5 0.5

1.5

2.5

3.5

4.5

0.04

0.12

0.21ER = -1.46

terrible (55,492 tokens)

-4.5

-3.5

-2.5

-1.5

-0.5 0.5

1.5

2.5

3.5

4.50.03

0.16

0.28ER = -2.18

Pr(c|w)

Rating

(b) Negative scalar modifiers.

Figure 12: Scalar modifiers with their expected ratings in the IMDB corpus. The y-axis values arethe conditional probabilities of def. 3, clause (ii). The x-axis gives the 10-star scale re-centeredat 0 to reflect its positive and negative substructure. Going top to bottom, the modifiers arearranged by the absolute value of their expected rating. This ordering corresponds strikingly wellto our intuitive ordering based on strength. de Marneffe et al. (2010) use these orderings topragmatically enrich the answers in indirect question–answer pairs like “Was it good?”; “It wasterrible/great/memorable.”

14

Page 15: Sentiment classification - Stanford Universityweb.stanford.edu/class/cs424p/materials/ling287-handout-09-28... · neg 1,000 31,783 738,419 34,543 738.42 Overall 2,000 64,720 1,559,304

LING287/CS424P, Stanford (Potts) Sentiment classification

4.1.3 Logistic regression

Expected ratings are a poor measure for intensives and non-intensives, which are prominent atboth ends (intensive) or at the middle (non-intensives). Potts and Schwarz (2010) and Davis andPotts (2009) use a quadratic logistic regression to distinguish these:

Pr(word) = logit−1(intercept+ rating+ rating2) (1)

Where rating2 is positive, we see a U shape. Where it is negative, we see a Turned U shape. Thecoefficient rating can be used to assess whether the turning point of the curve is significantlydifferent from 0, which in turn provides insights into positive and negative biases. (Using justrating is appropriate for items that have a linear relationship to the rating scale. The resultsyield much the same information as expected ratings, def. 6.)

totally (68,436 tokens)

-4.5

-3.5

-2.5

-1.5

-0.5 0.5

1.5

2.5

3.5

4.5

0.070.110.14

ER = -0.54R2 = 0.02

absolutely (76,772 tokens)

-4.5

-3.5

-2.5

-1.5

-0.5 0.5

1.5

2.5

3.5

4.5

0.060.11

0.17

ER = -0.31R2 = 0.04

wow (23,987 tokens)

-4.5

-3.5

-2.5

-1.5

-0.5 0.5

1.5

2.5

3.5

4.5

0.05

0.11

0.17

ER = -0.3R2 = 0.05

! (1,300,044 tokens)

-4.5

-3.5

-2.5

-1.5

-0.5 0.5

1.5

2.5

3.5

4.5

0.06

0.120.17

ER = -0.13R2 = 0.05

Pr(c|w)

Rating

(a) Intensives.

but (2,403,130 tokens)

-4.5

-3.5

-2.5

-1.5

-0.5 0.5

1.5

2.5

3.5

4.5

0.080.10.11

ER = -0.07R2 = -0.01

however (212,660 tokens)-4.5

-3.5

-2.5

-1.5

-0.5 0.5

1.5

2.5

3.5

4.5

0.070.10.13

ER = 0.06R2 = -0.03

quite (212,302 tokens)

-4.5

-3.5

-2.5

-1.5

-0.5 0.5

1.5

2.5

3.5

4.5

0.060.10.14

ER = 0.29R2 = -0.03

somewhat (54,629 tokens)

-4.5

-3.5

-2.5

-1.5

-0.5 0.5

1.5

2.5

3.5

4.5

0.050.10.15

ER = 0.24R2 = -0.05

Pr(c|w)

Rating

(b) Non-intensives.

Figure 13: The model in (1) distinguishes intensives from non-intensives by the sign of itsquadratic coefficient (‘R2’), and it too returns an ordering based on the absolute values. Thecoefficients get bigger, and the U shapes deeper, as we move from top to bottom here.

15

Page 16: Sentiment classification - Stanford Universityweb.stanford.edu/class/cs424p/materials/ling287-handout-09-28... · neg 1,000 31,783 738,419 34,543 738.42 Overall 2,000 64,720 1,559,304

LING287/CS424P, Stanford (Potts) Sentiment classification

4.2 Unstoppable stopwords

• Bird et al. (2009): “Stopwords usually have little lexical content, and their presence in a textfail to distinguish it from other texts.”

• Chung and Pennebaker (2007): “we are finding that the examination of often-overlooked“junk words” — more formally known as function words or particles — can provide powerfulinsight into the human psyche.”

4.2.1 Negation

• “Learned early on with the association of ‘unpleasant feelings’ ”.(Russell 1948, cited by Horn 1989:164)

• Associated with “falsity, absence, deprivation, and evil”. (Israel 2004:706)

●● ● ● ●

●●

● ● ●

IMDB (4,073,228 tokens)

1 2 3 4 5 6 7 8 9 10

0.080.1

0.12

● ●

●●

Five−star reviews (846,444 tokens)

1 2 3 4 5

0.13

0.19

0.26

hugs

rock

s

teeh

ee

unde

rsta

nd

just

wow

EP (60,764 tokens)

0.190.20.22

No

Yes

Convote (18,320 tokens)

0.36

0.5

0.64

Pr(

c|w

)

(a) English. The Convote corpus (Thomas et al. 2006) consists of selections from 8,121 speeches made by members ofthe U. S. Congress during the debate period prior to legislative votes. The categories are ‘Yes’ if the speaker voted infavor of the bill, and ‘No’ if he voted against the bill. The panel depicts the distribution of negative morphemes acrossthese two categories, with no left out of the calculations to avoid undue influence from speakers simply saying “Voteno” and the like.

●●

German Amazon (64,093 tokens)

1 2 3 4 5

0.160.2

0.24●

Japanese Amazon (116,472 tokens)

1 2 3 4 5

0.160.2

0.24

Pr(

c|w

)

(b) German and Japanese, from corpora of Amazon reviews.

Figure 14: Negation without negative polarity items (for reasons discussed in sec. 4.2.2). Negationis distributionally like a negative scalar modifier (fig. 12(b)). It seems very likely that this is cross-linguistically common.

16

Page 17: Sentiment classification - Stanford Universityweb.stanford.edu/class/cs424p/materials/ling287-handout-09-28... · neg 1,000 31,783 738,419 34,543 738.42 Overall 2,000 64,720 1,559,304

LING287/CS424P, Stanford (Potts) Sentiment classification

4.2.2 Negative polarity items

Negative polarity items (NPIs) need to be in the scope of (very loosely speaking) a negative opera-tor (see sec. 4.4 and sec. 4.5 for a better characterization). Israel (1996, 2001, 2004) distinguishesbetween ‘emphatic’ and ‘attenuating’ NPIs, which have different discourse functions:

The pragmatic functions which polarity items encode, emphasis and attenuation, re-flect two antithetical ways in which scalar semantics may be deployed for rhetoricaleffect: emphatic expressions serve to mark commitment or emotional involvement ina communicative exchange, while attenuation both protects a speaker’s credibility andshows deference to a hearer by minimizing any demands on his credulity.

(Israel 2004:717)

Emphatic: any, ever, at all, whatsoever, give a damnAttenuating: much, overmuch, long, all that, infinitival need

Table 12: Some frequent NPIs. These are pooled into their respective categories in fig. 15.

●●

●●

●● ● ●

IMDB (387,758 tokens)

1 2 3 4 5 6 7 8 9 10

0.07

0.110.14

● ●

Five−star reviews (66,384 tokens)

1 2 3 4 5

0.12

0.22

0.33

hugs

rock

s

teeh

ee

unde

rsta

nd

just

wow

EP (15,064 tokens)

0.15

0.2

0.26

Pr(

c|w

)

(a) Emphatic.

● ●●

●●

IMDB (223,087 tokens)

1 2 3 4 5 6 7 8 9 10

0.070.1

0.13●

●●

Five−star reviews (38,505 tokens)

1 2 3 4 5

0.13

0.19

0.26

hugs

rock

s

teeh

ee

unde

rsta

nd

just

wow

EP (3,874 tokens)

0.13

0.18

0.23

Pr(

c|w

)

(b) Attenuating.

Figure 15: Negation with negative polarity items. The effects predicted by Israel are evidentthroughout the corpora: whereas emphatic NPIs enhance the negativity of negation (cf. fig. 14),attenuating NPIs soften it.

17

Page 18: Sentiment classification - Stanford Universityweb.stanford.edu/class/cs424p/materials/ling287-handout-09-28... · neg 1,000 31,783 738,419 34,543 738.42 Overall 2,000 64,720 1,559,304

LING287/CS424P, Stanford (Potts) Sentiment classification

4.2.3 First-person features

Increased relative frequency of first-person pronouns positively correlates with the speaker beingin an extreme emotional state (Campbell and Pennebaker 2003; Chung and Pennebaker 2007) (aswell as with honesty, Newman et al. 2003).

IMDB (5,902,981 tokens)

-4.5

-3.5

-2.5

-1.5

-0.5 0.5

1.5

2.5

3.5

4.5

0.080.110.13

R2 = 0.02

hugs

rocks

teehee

understand

just

wow

EP (300,766 tokens)

0.180.20.23

Amazon.de (210,420 tokens)

-4.5

-4.5

-3.5

-3.5

-2.5

-2.5

-1.5

-1.5

-0.5

-0.5

0.080.10.11

R2 = 0.05

Pr(c|w)

Figure 16: First-person features. The English panels match (̂i|me|my|mine)$ and the Germanpanel matches (̂ich|mich|mir|mein(e|er|em|en|es))$. In the review data, they look like mildintensives (fig. 13(a)), whereas the sympathy/solidarity effects predicted by Pennebaker are inevidence in the EP data.

4.2.4 Demonstratives

Lakoff’s central generalization is that affective demonstratives are markers of solidarity, indicatingthe speaker’s desire to involve the listener emotionally and foster a sense of closeness and sharedsentiment. Comparable generalizations have since been made for Japanese (Naruoka 2003; Ono1994; Davis and Potts 2009) and German (Potts and Schwarz 2010). See also Liberman (2008)for discussion in the political realm.

(2) This Henry Kissinger is really something!

(3) Okay NPR listeners, time to make that donation!

Figure 17: Demonstratives used with noun comple-ments. Proximal demonstratives (this, these) resembleintensives (fig. 13(a)), whereas distal demonstratives(that, those) resemble nonintensives (fig. 13(b)). Thedata from 5-star reviews that were tagged with Stan-ford Tagger (Manning et al. 2008) on the sentences con-taining demonstratives, and then chunked with Green-wood’s (2005) NP chunker, a Java implementation ofRamshaw and Marcus’s (1995) NP chunker, which istransformation-based in the sense described by Jurafskyand Martin (2009:§5.6). For details and an accuracy as-sessment, see Potts and Schwarz 2010.

Proximal det. (142,930 tokens)

-2 -1 0 1 20.170.20.24

R2 = 0.08

Distal det. (15,403 tokens)

-2 -1 0 1 2

0.170.20.22 R2 = -0.02

Pr(c|w)

Rating

18

Page 19: Sentiment classification - Stanford Universityweb.stanford.edu/class/cs424p/materials/ling287-handout-09-28... · neg 1,000 31,783 738,419 34,543 738.42 Overall 2,000 64,720 1,559,304

LING287/CS424P, Stanford (Potts) Sentiment classification

4.2.5 Emoticons and IM abbreviations

Smiley (18,160 tokens)

-4.5

-3.5

-2.5

-1.5

-0.5 0.5

1.5

2.5

3.5

4.5

0.06

0.13

0.21

ER = 1.16

Frownie (2,293 tokens)

-4.5

-3.5

-2.5

-1.5

-0.5 0.5

1.5

2.5

3.5

4.5

0.06

0.11

0.15

ER = -0.8

Pr(c|w)

Rating

(a) IMDB.

hugs

rocks

teehee

understand

just

wow

Smiley (0 tokens)

0.04

0.23

hugs

rocks

teehee

understand

just

wow

Frownie (0 tokens)

0.11

0.21

0.3

Pr(c|w)

(b) EP.

Figure 18: Emoticons. These can reveal a lot about emotion, so it is worth handling them speciallywhen tokenizing.

An emoticon identifier (at least for Western emoticons; http://en.wikipedia.org/wiki/List_of_emoticons):

(^|\s) # boundary (\b unlikely to work)(

[:;=8] # eyes[\-o]? # optional nose[\)\]\(\[dDpP/\:\}\{@] # mouth| ##### Switch orientation.[\)\]\(\[dDpP/\:\}\{@] # mouth[\-o]? # optional nose[:;=8] # eyes

)(\s|$) # boundary (\b unlikely to work)

Improvements extremely welcome!

4.3 Dependency structures

These next few subsections explore using dependency structures to get at sentiment information.I focus on a few features related to scope, but the structures provide a lot of other useful informa-tion as well. Throughout, I use the Stanford dependencies (de Marneffe and Manning 2008a,b),which, most notably, map a lot of prepositional relations to edge relations rather than nodes (veryFinnish-like). This turns out to be really useful for doing what I’d like to do.

The Stanford Parser includes a dependency parser (Klein and Manning 2003a,b). The outputfor fig. 19(d) on the next page looks like this (the indices correspond to linear order):

[det(movie-2, the-1), nsubj(good-6, movie-2), cop(good-6, was-3),neg(good-6, not-4), advmod(good-6, very-5)]

19

Page 20: Sentiment classification - Stanford Universityweb.stanford.edu/class/cs424p/materials/ling287-handout-09-28... · neg 1,000 31,783 738,419 34,543 738.42 Overall 2,000 64,720 1,559,304

LING287/CS424P, Stanford (Potts) Sentiment classification

the movie was very good .

the

movie

det

was very

good

nsubj cop advmod

(a)

i always enjoy horror movies .

i always

enjoy

dep advmod

movies

dobj

horror

nn

(b)

no good musician would play elevator music .

no good

musician

det amod

would

play

nsubj aux

music

dobj

elevator

nn

(c)

the movie was not very good .

the

movie

det

was not very

good

nsubj cop neg advmod

(d)

i rarely enjoy horror movies .

i rarely

enjoy

dep advmod

movies

dobj

horror

nn

(e)

her instrument was tuned by every musician .

her

instrument

poss

was

tuned

nsubjpass auxpass

musician

agent

every

det

(f)

the reviews said that the movie would be good , but it was n't .

the

reviews

det

said

nsubj

good

ccomp

was

conj_but

that

the

movie

det

would be

complm nsubj aux cop

it

nsubj

n't

neg

(g)

i predicted that it would be outstanding .

i

predicted

nsubj

outstanding

ccomp

that it would be

complm nsubj aux cop

(h)

Figure 19: Stanford dependency trees.

20

Page 21: Sentiment classification - Stanford Universityweb.stanford.edu/class/cs424p/materials/ling287-handout-09-28... · neg 1,000 31,783 738,419 34,543 738.42 Overall 2,000 64,720 1,559,304

LING287/CS424P, Stanford (Potts) Sentiment classification

4.4 Scope

The scope of an operator (Carpenter 1997:§7) determines its domain of influence. The syntacticnotion of scope is roughly: for a node n, the scope of n is set of all subtrees rooted as the sistersof n, perhaps with some fudging concerning prepositions and determiners. The semantic notionof scope doesn’t have such an easy definition because it traces ultimately to the meanings of theitems involved. Nonetheless, we can approximate semantic scope with syntactic scope.

Pronominal binding (Heim and Kratzer 1998:§5-10) and NPIs are useful probes for scope:

(4) [Every cyclist]i wore hisi helmet.

(5) Hisi helmet was worn by [no cyclist]i.

(6) Every cyclist affirmed that hei wore a helmet.

(7) ∗ [Every cyclist]i wore hisi helmet and hei had safety glasses too.

(8) No cyclist has ever climbed that mountain.

(9) Under no circumstances should you ever lie under oath.

(10) ∗Sam has ever climbed no mountain.

(11) During [every cyclist]i ’s break hei drank a lot of water.

Op Scope domainfor Op

NP

Op

Scope domainfor Op

PP

NP

Op

Scope domainfor Op

(a) Parsetrees.

Op ...

rel

Op

{det, amod}

...

...

rel

(b) Dependency trees. ‘rel’ should exclude cer-tain non-scope relations. See line 1 of the algo-rithm in sec. 4.7.

Figure 20: Scope configurations. These roughly approximate the semantic notion that I am after.The top two trees are the prototypical situations. The middle tree on the left and the bottom treeon the right allow determiners to control the sisters of their mothers. The bottom tree on theleft makes prepositions invisible. This needn’t be done for the Stanford dependencies, which treatprepositions as edge relations. Neither set of trees handles (11) right. Room for improvement!

21

Page 22: Sentiment classification - Stanford Universityweb.stanford.edu/class/cs424p/materials/ling287-handout-09-28... · neg 1,000 31,783 738,419 34,543 738.42 Overall 2,000 64,720 1,559,304

LING287/CS424P, Stanford (Potts) Sentiment classification

4.5 Monotonicity in language

Monotonicity is a fundamental notion in linguistic meaning. For overviews: Ladusaw 1980; Keenan2002; van Benthem 2008. MacCartney (2009) generalized the monotonicity calculus of Sánchez-Valencia 1991 to create a system of natural logic, which he implemented and applied to textualinference (MacCartney and Manning 2007, 2008, 2009). (The 2009 paper is ideal for linguists.)

Definition 7 (Upward monotonicity). An operator δ is upward monotone, ⇑δ, iff for all expres-sions α in the domain of δ:

if α⊆ β , then (δα)⊆ (δβ)

Definition 8 (Downard monotonicity). An operator δ is downward monotone, ⇓δ, iff for allexpressions α in the domain of δ:

if α⊆ β , then (δβ)⊆ (δα)

(12) a. A student took notes.

b. A semantics student took notes.

c. A student took good notes.

(13) a. No student took notes.

b. No semantics student took notes.

c. No student took good notes.

(14) a. Every student took notes.

b. Every semantics student tooknotes.

c. Every student took good notes.

(15) a. Few students took notes.

b. Few semantics student took notes.

c. Few students took good notes.

Quantificational determiners can have different monotonicity properties for their first andsecond arguments. I won’t try to account for this in what follows.

(16) poodle⇒ dog

a. non-dog⇒ non-poodle

b. hungry poodle⇒ hungry dog

(17) tango⇒ dance

a. fail to dance⇒ fail to tango

b. manage to tango ⇒ mange todance

(18) superb ⇒ great ⇒ goodnot good ⇒ not great ⇒ not superb

(19) scalding ⇒ hot ⇒ warmnot warm ⇒ not hot ⇒ not scalding

(20) necessarily ⇒ possiblynot possibly ⇒ not necessarily

22

Page 23: Sentiment classification - Stanford Universityweb.stanford.edu/class/cs424p/materials/ling287-handout-09-28... · neg 1,000 31,783 738,419 34,543 738.42 Overall 2,000 64,720 1,559,304

LING287/CS424P, Stanford (Potts) Sentiment classification

4.6 Veridicality

Definition 9 (Verdical). An operator δ is veridical iff, for any α in the domain of δ, δα entails α.

Definition 10 (Non-verdical). An operator δ is non-veridical iff, for any α in the domain of δ, δαdoes not entail ¬α.

Definition 11 (Anti-verdical). An operator δ is veridical iff, for any α in the domain of δ, δαentails ¬α.

Downward entailing operators are anti-veridical. Many epistemic operators are non-veridical.

(21) They {say/claim/believe/assert} it is good, but it isn’t.

(22) The idea {seems/appears} to be good, but it isn’t.

(23) # They {know/regret/understand} it is good, but it isn’t.

4.7 Scope marking

Here is a procedure for monotonicity marking. It’s the same for veridicity, except that we changeline 2 of MONOTONICITYMARKING and line 4 of DOWNWARDDAUGHTERS so that they distinguish veridi-cal from nonveridical rather than upward from downward.

MONOTONICITYMARKING(T)

� Input:� T: A dependency graph in which the nodes have at least the attributes� word (string), index (integer), and pol (values: +, −)� Output: T with monotonicity-marking

1 Nonscope= {dep,conj_but,conj_and,parataxis,advcl,rcmod} � Will block propagation.2 for n ∈ Nodes(T), set n[pol] =− if ⇓n, else set n[pol] =+ � ⇓ is defined in def. 83 for m ∈ Nodes(T)4 for n ∈ DOWNWARDDAUGHTERS(m)5 m[pol] =−6 for d ∈ Daughters(m)− {n}

� Mark only rightwards and not through non-scope relations.7 if Index(n)< Index(d) and EdgeRel(m, d) /∈ Nonscope8 then d[pol] =−

� Spread the negativity to the whole subtree.9 for d ′ ∈ Nodes

10 if Path(d, d ′) and Index(n)< Index(d ′)11 then d ′[pol] =−

DOWNWARDDAUGHTERS(n)

1 ProjRels= {det,amod}2 for d ∈ Daughters(n)

� Conjunct 2 spreads negativity found on determiners, as in the lower tree in fig. 20(b)3 if pol[d] =− or (∃d ′ ∈ Daughters(d) : EdgeRel(d, d ′) ∈ ProjRels and pol[d ′] =−)4 then dd← dd∪ {d}5 return dd

23

Page 24: Sentiment classification - Stanford Universityweb.stanford.edu/class/cs424p/materials/ling287-handout-09-28... · neg 1,000 31,783 738,419 34,543 738.42 Overall 2,000 64,720 1,559,304

LING287/CS424P, Stanford (Potts) Sentiment classification

(a) (b) (c)

(d) (e) (f)

(g) (h)

Figure 21: Dependency trees with scope marking. The black nodes are marked as in the scope ofa downward entailing operator. The diamond nodes are marked as in the scope of a non-veridicaloperator. A black diamond is thus an anti-veridical marking.

24

Page 25: Sentiment classification - Stanford Universityweb.stanford.edu/class/cs424p/materials/ling287-handout-09-28... · neg 1,000 31,783 738,419 34,543 738.42 Overall 2,000 64,720 1,559,304

LING287/CS424P, Stanford (Potts) Sentiment classification

4.8 Thwarted expectationsPang & Lee

neg pos

0.25

0.92

1.16

1.47

2.29

0.42

1.15

1.53

2.06

3.40

Figure 22: Inquirer Pos:Neg ratios obtained by counting the terms in the review that are classifiedas Positiv or Negativ in the Harvard Inquirer (Stone et al. 1966).

i had been looking forward to this film since i heard about it early last year , when matthew perry had justsigned on . i’m big fan of perry’s subtle sense of humor , and in addition , i think chris farley’s on-edge ,extreme acting was a riot . so naturally , when the trailer for " almost heroes " hit theaters , i almost jumped upand down . a soda in hand , the lights dimming , i was ready to be blown away by farley’s final starring roleand what was supposed to be matthew perry’s big breakthrough . i was ready to be just amazed ; for this to beamong farley’s best , in spite of david spade’s absence . i was ready to be laughing my head off the minute thecredits ran . sadly , none of this came to pass . the humor is spotty at best , with good moments and laughableone-liners few and far between . perry and farley have no chemistry ; the role that perry was cast in seemsobviously written for spade , for it’s his type of humor , and not at all what perry is associated with . and themovie tries to be smart , a subject best left alone when it’s a farley flick . the movie is a major dissapointment, with only a few scenes worth a first look , let alone a second . perry delivers not one humorous line thewhole movie , and not surprisingly ; the only reason the movie made the top ten grossing list opening week wasbecause it was advertised with farley . and farley’s classic humor is widespread , too . almost heroes almostworks , but misses the wagon-train by quite a longshot . guys , let’s leave the exploring to lewis and clark , huh? stick to " tommy boy " , and we’ll all be " friends " .

Table 13: An example of thwarted expectations. This is a negative review. Inquirer positive termsare in bold, and Inquirer negative terms are underlined. There are 20 positive terms and sixnegative ones, for a Pos:Neg ratio of 3.33.

Proposed feature Create a real-valued feature that is the Pos:Neg ratio if that ratio is below 1(lower quartile for the whole Pang & Lee data set) or above 1.76 (upper quartile), else 1.31 (themedian). The goal is to single out ‘imbalanced’ reviews as potentially untrustworthy at the level oftheir unigrams. (For similar idea, see Pang et al. 2002.)

25

Page 26: Sentiment classification - Stanford Universityweb.stanford.edu/class/cs424p/materials/ling287-handout-09-28... · neg 1,000 31,783 738,419 34,543 738.42 Overall 2,000 64,720 1,559,304

LING287/CS424P, Stanford (Potts) Sentiment classification

5 Models

Throughout, we classify document x from corpus T by picking the class c that maximizes PrT (c|x).

5.1 Naive Bayes

Definition 12 (Naive Bayes). T is a corpus with categories C , for a document x = {π1 . . .πn},where the πi are features of type Π.

PrT,Π(c|x)def= PrT (c)

n∏

i=1

PrT,Π(πn|c)

• PrT (c) = count(c)/∑

c′∈C count(c′) (count is defined in def. 1)

• PrT,Π(π|c) (as in def. 3, clause (i))

Comment: The Naive Bayes model is easy to build, but it makes extremely strong assumptionsabout the independence of the features. We rarely fully appreciate the ways in which our featuresinteract, though, and the unnoticed dependencies can impact performance.

5.2 Unordered categorical regression (MaxEnt)

Definition 13 (MaxEnt).

P(c|x) =1

Z(x)exp�

λc +n∑

i=1

λc,i fc,i(x, c)�

where Z sums over all the exponentiated terms for each c.

• fc,i(x, c′)�

1 if c = c′ and x contains an instance of f0 otherwise

• The weights are learned via iterative optimization.

Comment: MaxEnt classifiers are a species of logistic regression model. Indeed, following Hatzi-vassiloglou and McKeown (1997), we used a logisitic regression fit as a classifier in the ‘Sentimentlexicons’ discussion. They are more likely than NaiveBayes to deal with the the feature-sets thatyou come up with for sentiment analysis. The problem is rather that they will do too good a job,massively overfitting to your data (see sec. 7).

26

Page 27: Sentiment classification - Stanford Universityweb.stanford.edu/class/cs424p/materials/ling287-handout-09-28... · neg 1,000 31,783 738,419 34,543 738.42 Overall 2,000 64,720 1,559,304

LING287/CS424P, Stanford (Potts) Sentiment classification

5.3 Ordered categorical regression

This is probably appropriate for data with definitely ordered rating scales like IMDB (though takecare with the scale — it probably isn’t conceptually a total ordering for users, but rather more likea pair of scales, positive and negative). The form of the model is as with MaxEnt, except that onefits a series of two-way models. For n categories:

P(c > 1|x) . . ....

P(c > n− 1|x) . . .

Probabilities for the categories:

P(c = k|x) = P(c > k− 1)− P(c > k)

I don’t know whether any classifier packages can build these models, but R users can fit smallermodels using polr (from the MASS library).

5.4 Multi-level regression

Multilevel (hierarchical, mixed effects) models allow you to model variation at multiple levels.For example, if you are modeling reviews with linguistic features, you might include movie name,genre, reviewer, etc., as higher-level features. Of course, you could create a bunch of classifiers,one for each movie (genre, reviewer, etc.), but the advantage of multilevel models is that they canoften deal well with sparse areas of the data, by relativizing their estimates to the entire population.Mixed-effects models are increasingly important in the social sciences (Bates 2005; Gelman andHill 2007; Baayen et al. 2008; Jaeger 2008), but I know of less work with them in NLP. A localexception: Finkel and Manning 2010.

0.05

0.15

-4.5 0 4.5

N=49

ER = 0.08

0.05

0.15

-4.5 0 4.5

N=41

ER = -0.49

0.05

0.15

-4.5 0 4.5

N=50

ER = -0.08

0.05

0.15

-4.5 0 4.5

N=74

ER = 0.82

0.05

0.15

-4.5 0 4.5

N=85

ER = -0.58

0.05

0.15

-4.5 0 4.5

N=48

ER = 0.08

0.05

0.15

-4.5 0 4.5

N=65

ER = -0.44

0.05

0.15

-4.5 0 4.5

N=59

ER = 0.21

0.05

0.15

-4.5 0 4.5

N=53

ER = -0.06

0.05

0.15

-4.5 0 4.5

N=155

ER = 0.05

0.05

0.15

-4.5 0 4.5

N=66

ER = 1.12

0.05

0.15

-4.5 0 4.5

N=110

ER = -1.15

0.05

0.15

-4.5 0 4.5

ER = -1.37

0.05

0.15

-4.5 0 4.5

ER = -0.37

0.05

0.15

-4.5 0 4.5

ER = 1.37

0.05

0.15

-4.5 0 4.5

ER = -0.06

damn in IMDB : Multilevel analysis, fitted values by user

Fitte

d pr

obab

ilitie

s

Figure 23: A portrait of individual vari-ation: damn as used by 16 differentusers in a collection of reviews fromIMDB.com. The panels depict the esti-mates from a fitted multi-level model inwhich the intercept and all the predic-tors are allowed to vary by user. Someusers swear only when happy, othersonly when sad, others at either ex-treme, and still others just whenever.

27

Page 28: Sentiment classification - Stanford Universityweb.stanford.edu/class/cs424p/materials/ling287-handout-09-28... · neg 1,000 31,783 738,419 34,543 738.42 Overall 2,000 64,720 1,559,304

LING287/CS424P, Stanford (Potts) Sentiment classification

5.5 Other approaches

Restricting attention to sentiment analysis:

• Pang et al. (2002) also look at support vector machines (SVMs).

• Druck et al. 2008 briefly used generalized expectation criteria to classify the Pang & Lee data.(This is a semi-supervised method in which a small amount of expert labeled data is used tobuild the model, which can be a MaxEnt-type model.)

• Wilson et al. (2005) use AdaBoost in the context of polarity lexicon construction.

5.6 Tools

• MALLET: http://mallet.cs.umass.edu/(Variety of models. Simple command-line interface. Builds classifiers from csv files or fromdirectories. Less flexible than the Stanford classifier about how you define features. Excellentsupport for random train/test splits, assessment, and inspection. Mallet 2.0.5 has supportfor Generalized Expectation Criteria and topic modeling.)

• Stanford Classifier: http://nlp.stanford.edu/software/classifier.shtml(Variety of models. Simple command-line interface. Builds classifier from tab-separated files.‘prop’ files let you specify a lot of information about the features to use, the optimizer, andthe assessment values and model information to return.)

These are also recommended, though we’ve not used them much:

• NLTK: http://www.nltk.org/(I use NLTK all the time, but I’ve not used the classifiers. I’d be curious to hear about yourexperiences.)

• LingPipe: http://alias-i.com/lingpipe/(I’ve not used this, but the documentation throughout LingPipe is incredibly detailed andwell-written. Many of the pages include mini-tutorials.)

6 How hard is sentiment?

Data-sets: Pang & Lee (sentiment, sec. 2.1), 20_newsgroups baseball vs. hockey subcategories, anda random selection from spam vs. ham. For each corpus, 1000 texts in each of the two categories.

Definition 14 (Mutual information between classes and words.).

I(C;π) = H(C)−H(C |π)

=∑

c∈C

x∈{0,1}

P(c, fπ(x)) log2

P(c, fπ(x))P(c)P( fπ(x))

fw is a function from {0,1} into counts, where fπ(1) is the number of tokens of π in the corpusand fπ(0) is the number of non-π tokens in the corpus.

28

Page 29: Sentiment classification - Stanford Universityweb.stanford.edu/class/cs424p/materials/ling287-handout-09-28... · neg 1,000 31,783 738,419 34,543 738.42 Overall 2,000 64,720 1,559,304

LING287/CS424P, Stanford (Potts) Sentiment classification

Pang & Lee

0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90

0.71

0.84

0.981.00

TrainTest

20 newsgroups: baseball vs. hockey

0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90

0.83

0.96

1.00

TrainTest

Spam vs. ham

0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90

0.94

0.970.980.99

TrainTest

Unigram frequency

Training percentage

Mea

n ac

cura

cy

Figure 24: MaxEnt model with unigram frequencies as percentages. 10 random train-test splotsat each training percentage level.

panglee

50 6400 12800 25600

0.5

0.8

0.53

0.75

Pres.Freq.

newsgroups-baseballhockey

50 12800 25600 51200

0.49

0.98

0.49

0.98

Pres.Freq.

spamham

50 12800 51200 80967

0.50

0.97

0.49

0.97

Pres.Freq.

Vocabulary size

Mea

n ac

cura

cy

(a) NaiveBayes.

panglee

50 6400 12800 25600

0.54

0.80

0.54

0.79

Pres.Freq.

newsgroups-baseballhockey

50 12800 25600 51200

0.50

0.96

0.52

0.96

Pres.Freq.

spamham

50 12800 51200 80967

0.53

0.97

0.53

0.97

Pres.Freq.

Vocabulary size

Mea

n ac

cura

cy

(b) MaxEnt.

Figure 25: A vocabulary of size n means that the features were the top n unigrams by mutualinformation with the class label. These results support Pang et al.’s (2002) suggestion that featurepresence is, unusually, better for sentiment. See also figures 1-4 of McCallum and Nigam 1998.

It looks like sentiment is pretty hard.

29

Page 30: Sentiment classification - Stanford Universityweb.stanford.edu/class/cs424p/materials/ling287-handout-09-28... · neg 1,000 31,783 738,419 34,543 738.42 Overall 2,000 64,720 1,559,304

LING287/CS424P, Stanford (Potts) Sentiment classification

7 Experiments

These experiments were done with MALLET (McCallum 2002) and the Stanford Classifier (Kleinand Manning 2003c).

7.1 Pang et al. (2002)

Features # of frequency or NB ME SVMfeatures presence?

(1) unigrams 16165 freq. 78.7 N/A 72.8

(2) unigrams ” pres. 81.0 80.4 82.9

(3) unigrams+bigrams 32330 pres. 80.6 80.8 82.7(4) bigrams 16165 pres. 77.3 77.4 77.1

(5) unigrams+POS 16695 pres. 81.5 80.4 81.9(6) adjectives 2633 pres. 77.0 77.7 75.1(7) top 2633 unigrams 2633 pres. 80.3 81.0 81.4

(8) unigrams+position 22430 pres. 81.0 80.1 81.6

Figure 3: Average three-fold cross-validation accuracies, in percent. Boldface: best performance for a givensetting (row). Recall that our baseline results ranged from 50% to 69%.

class distributions was out of the scope of this study),we randomly selected 700 positive-sentiment and 700negative-sentiment documents. We then divided thisdata into three equal-sized folds, maintaining bal-anced class distributions in each fold. (We did notuse a larger number of folds due to the slowness ofthe MaxEnt training procedure.) All results reportedbelow, as well as the baseline results from Section 4,are the average three-fold cross-validation results onthis data (of course, the baseline algorithms had noparameters to tune).

To prepare the documents, we automatically re-moved the rating indicators and extracted the tex-tual information from the original HTML docu-ment format, treating punctuation as separate lex-ical items. No stemming or stoplists were used.

One unconventional step we took was to attemptto model the potentially important contextual effectof negation: clearly “good” and “not very good” in-dicate opposite sentiment orientations. Adapting atechnique of Das and Chen (2001), we added the tagNOT to every word between a negation word (“not”,“isn’t”, “didn’t”, etc.) and the first punctuationmark following the negation word. (Preliminary ex-periments indicate that removing the negation taghad a negligible, but on average slightly harmful, ef-fect on performance.)

For this study, we focused on features based onunigrams (with negation tagging) and bigrams. Be-cause training MaxEnt is expensive in the number offeatures, we limited consideration to (1) the 16165unigrams appearing at least four times in our 1400-document corpus (lower count cutoffs did not yieldsignificantly different results), and (2) the 16165 bi-grams occurring most often in the same data (theselected bigrams all occurred at least seven times).Note that we did not add negation tags to the bi-grams, since we consider bigrams (and n-grams in

general) to be an orthogonal way to incorporate con-text.

6.2 Results

Initial unigram results The classification accu-racies resulting from using only unigrams as fea-tures are shown in line (1) of Figure 3. As a whole,the machine learning algorithms clearly surpass therandom-choice baseline of 50%. They also hand-ily beat our two human-selected-unigram baselinesof 58% and 64%, and, furthermore, perform well incomparison to the 69% baseline achieved via limitedaccess to the test-data statistics, although the im-provement in the case of SVMs is not so large.

On the other hand, in topic-based classification,all three classifiers have been reported to use bag-of-unigram features to achieve accuracies of 90%and above for particular categories (Joachims, 1998;Nigam et al., 1999)9 — and such results are for set-tings with more than two classes. This providessuggestive evidence that sentiment categorization ismore difficult than topic classification, which cor-responds to the intuitions of the text categoriza-tion expert mentioned above.10 Nonetheless, we stillwanted to investigate ways to improve our senti-ment categorization results; these experiments arereported below.

Feature frequency vs. presence Recall that werepresent each document d by a feature-count vector(n1(d), . . . , nm(d)). However, the definition of the

9Joachims (1998) used stemming and stoplists; insome of their experiments, Nigam et al. (1999), like us,did not.

10We could not perform the natural experiment of at-tempting topic-based categorization on our data becausethe only obvious topics would be the film being reviewed;unfortunately, in our data, the maximum number of re-views per movie is 27, too small for meaningful results.

Table 14: Results

unigram presence

0.10 0.25 0.40 0.55 0.70 0.85

0.75

0.87

1.00

TrainTest

unigram frequency

0.10 0.25 0.40 0.55 0.70 0.85

0.71

0.84

0.981.00

TrainTest

Training percentage

Mea

n ac

cura

cy

(a) MaxEnt.

unigram presence

0.10 0.25 0.40 0.55 0.70 0.85

0.67

0.84

0.981.00

TrainTest

unigram frequency

0.10 0.25 0.40 0.55 0.70 0.85

0.67

0.76

0.94

1.00

TrainTest

Training percentage

Mea

n ac

cura

cy

(b) Naive Bayes.

Figure 26: Reproduction of some of their results. The accuracy numbers are higher than theirs,but we have a serious over-fitting problem.

30

Page 31: Sentiment classification - Stanford Universityweb.stanford.edu/class/cs424p/materials/ling287-handout-09-28... · neg 1,000 31,783 738,419 34,543 738.42 Overall 2,000 64,720 1,559,304

LING287/CS424P, Stanford (Potts) Sentiment classification

7.2 New test sets

Randomly selected subsets of the IMDB corpus (movies and TV) and from a collection of Amazonreviews (lots of different products). For IMDB, a review is positive if its rating is at least 8, andnegative if it is at most 3. For Amazon, a review is positive if its rating is at least 4, and negative ifit is at most 2:

https://www.stanford.edu/class/cs424p/restricted/data/imdb-subset.ziphttps://www.stanford.edu/class/cs424p/restricted/data/amazon-subset.zip

Reviews Mean words/review Mean sentences/review

Pang & Lee 1,000 pos; 1,000 neg 746.34 32.36IMDB subset 600 pos; 600 neg 238.67 12.14

Amazon subset 500 pos; 500 neg 162.22 9.79

Table 15: Two new test sets

Pang & LeeTotal: 50920

IMDB subsetTotal: 19689

Amazon subsetTotal: 14637

33,514

69715020

7406

39103029

292

Figure 27: Vocabulary comparisons. The numbers indicate the sizes of the various regions. (I triedto get the regions roughly to scale but did not uniformly succeed.) There are 7406 words at theintersection of all three test-sets.

Additional structure I tokenized the new subsets using (what I think is) the same method thatPang and Lee used, and then I parsed it into dependency structures using the Stanford dependen-cies. I have not done a serious error analysis, but my impression is that the parser did a passable joband would have done a terrific job if there had been more reliable end-of-sentence punctuation.Many of the major mix-ups come from heroic attempts to merge multiple clearly free-standingsentences.

Plans Use Pang & Lee to build models and then test them separately on both of the two othercorpora. Also, just for fun: train on Amazon (the smallest of the three along every dimension, andthe most diverse in terms of products) and then test on the other two. This should penalize theover-fitting seen in fig. 26. It should also reward domain-independent linguistic features.

31

Page 32: Sentiment classification - Stanford Universityweb.stanford.edu/class/cs424p/materials/ling287-handout-09-28... · neg 1,000 31,783 738,419 34,543 738.42 Overall 2,000 64,720 1,559,304

LING287/CS424P, Stanford (Potts) Sentiment classification

7.3 Models and assessments

7.3.1 Baseline: Unigrams (presence and frequency)

Unigrams + Unigrams + Unigrams +Test set Unigrams thwarting scope-marking thwarting+scope-marking

Amazon 0.679 0.698 0.674 0.685IMDB 0.800 0.826 0.805 0.830

(a) Training on Pang & Lee.

Unigrams + Unigrams + Unigrams +Test set Unigrams thwarting scope-marking thwarting+scope-marking

Pang & Lee 0.634 0.626 0.607 0.610IMDB 0.688 0.694 0.665 0.677

(b) Training on Amazon.

Table 16: Results, using ‘feature presence’ rather than ‘feature counts’ throughout. All standarderrors effectively 0.

Assessment

• The thwarted expectation feature gives the most reliable boost.

• The scope-marking feature is less clearly helpful. I have faith in it, though, and plan toexperiment with (i) trying to get more accurate dependency parses and (ii) further refiningthe scope-marking algorithm.

• Scope-marking and thwarted expectation might not play well together, since scope-markingtakes care of some of the apparent thwarting, by accounting for the fact that some wrongpolarity terms are not speaker commitments.

• The scope-marking feature can result in sparseness, which is problematic for the smallerAmazon data, especially when it is used for training.

• The big jumps in performance will likely come from increased attention to the structuresthemselves. Some quick things to try:

– Immediate dependencies like advmod(adj, adv).

– Interactions between 1st person subjects and their clausemate predications.

– Marking other environments that could be relevant to commitment.

32

Page 33: Sentiment classification - Stanford Universityweb.stanford.edu/class/cs424p/materials/ling287-handout-09-28... · neg 1,000 31,783 738,419 34,543 738.42 Overall 2,000 64,720 1,559,304

LING287/CS424P, Stanford (Potts) Sentiment classification

References

Baayen, R. Harald. 2001. Word Frequency Distributions. Dordrecht: Kluwer Academic Publishers.Baayen, R. Harald.; Douglas J. Davidson; and Douglas M. Bates. 2008. Mixed-effects modeling with crossed

random effects for subjects and items. Journal of Memory and Language 59:390–412.Bates, Douglas M. 2005. Fitting linear mixed models in R. R News 5(1):27–30.van Benthem, Johan. 2008. A brief history of natural logic. In M Chakraborty; B Löwe; M. Nath Mitra; and

S. Sarukki, eds., Logic, Navya-Nyaya and Applications: Homage to Bimal Matilal.Bird, Steven; Ewan Klein; and Edward Loper. 2009. Natural Language Processing with Python. Sebastopol,

CA: O’Reilly Media.Campbell, Sherlock R. and James W. Pennebaker. 2003. The secret life of pronouns: Flexibility in writing

style and physical health. Psychological Science 14(1):60–65.Carpenter, Bob. 1997. Type-Logical Semantics. Cambridge, MA: MIT Press.Chung, Cindy and James W. Pennebaker. 2007. The psychological function of function words. In Klaus

Fiedler, ed., Social Communication, 343–359. New York: Psychology Press.Danescu-Niculescu-Mizil, Cristian; Gueorgi Kossinets; Jon Kleinberg; and Lillian Lee. 2009. How opinions

are received by online communities: A case study on Amazon.com helpfulness votes. In Proceedings ofWWW, 141–150. ACL.

Davis, Christopher and Christopher Potts. 2009. Affective demonstratives and the division of pragmaticlabor. In Harald Aloni, Maria Bastiaanse; Tikitu de Jager; Peter van Ormondt; and Katrin Schulz, eds.,Preproceedings of the 17th Amsterdam Colloquium. Institute for Logic, Language and Computation, Uni-versiteit van Amsterdam.

Druck, Gregory; Gideon Mann; and Andrew McCallum. 2008. Learning from labeled features using gener-alized expectation criteria. In Proceedings of ACM Special Interest Group on Information Retreival.

Finkel, Jenny and Christopher D. Manning. 2010. Hierarchical joint learning: Improving joint parsing andnamed entity recognition with non-jointly labeled data. In Proceedings of the 48th Annual Meeting of theAssociation for Computational Linguistics, 720–728. ACL.

Gelman, Andrew and Jennifer Hill. 2007. Data Analysis Using Regression and Multilevel/Hierarchical Models.Cambridge University Press.

Ghose, Anindya; Panagiotis Ipeirotis; and Arun Sundararajan. 2007. Opinion mining using econometrics:A case study on reputation systems. In Proceedings of the 45th Annual Meeting of the Association ofComputational Linguistics, 416–423. Prague, Czech Republic: Association for Computational Linguistics.URL http://www.aclweb.org/anthology/P/P07/P07-1053.

Greenwood, Mark. 2005. NP chunker v1.1. URL http://www.dcs.shef.ac.uk/~mark/index.html?http://www.dcs.shef.ac.uk/~mark/phd/software/chunker.html.

Hatzivassiloglou, Vasileios and Kathleen R. McKeown. 1997. Predicting the semantic orientation of adjec-tives. In Proceedings of the 35th Annual Meeting of the ACL and the 8th Conference of the European Chapterof the ACL, 174–181. ACL.

Heim, Irene and Angelika Kratzer. 1998. Semantics in Generative Grammar. Oxford: Blackwell Publishers.Horn, Laurence R. 1989. A Natural History of Negation. University of Chicago Press. Reissued 2001 by CSLI.Israel, Michael. 1996. Polarity sensitivity as lexical semantics. Linguistics and Philosophy 19(6):619–666.Israel, Michael. 2001. Minimizers, maximizers, and the rhetoric of scalar reasoning. Journal of Semantics

18(4):297–331.Israel, Michael. 2004. The pragmatics of polarity. In Laurence Horn and Gregory Ward, eds., The Handbook

of Pragmatics, 701–723. Oxford: Blackwell.Jaeger, T. Florian. 2008. Categorical data analysis: Away from ANOVAs (transformation or not) and towards

logit mixed models. Journal of Memory and Language 59(4):434–446.

33

Page 34: Sentiment classification - Stanford Universityweb.stanford.edu/class/cs424p/materials/ling287-handout-09-28... · neg 1,000 31,783 738,419 34,543 738.42 Overall 2,000 64,720 1,559,304

LING287/CS424P, Stanford (Potts) Sentiment classification

Jurafsky, Daniel and James H. Martin. 2009. Speech and Language Processing: An Introduction to NaturalLanguage Processing, Computational Linguistics, and Speech Recognition. Englewood Cliffs, NJ: Prentice-Hall, 2nd edition.

Keenan, Edward L. 2002. Some properties of natural language quantifiers: Generalized quantifier theory.Linguistics and Philosophy 25(5–6):627–654.

Kilgarriff, Adam and Tony Rose. 1998. Measures for corpus similarity and homogeneity. In Proceedings ofthe 3rd Conference on Empirical Methods in Natural Language Processing, 46–52. ACL-SIGDAT.

Klein, Dan and Christopher D. Manning. 2003a. Accurate unlexicalized parsing. In ACL ’03: Proceedings ofthe 41st Annual Meeting of the Association for Computational Linguistics, volume 1, 423–430. ACL.

Klein, Dan and Christopher D. Manning. 2003b. Fast exact inference with a factored model for naturallanguage parsing. In Suzanna Becker; Sebastian Thrun; and Klaus Obermayer, eds., Advances in NeuralInformation Processing Systems 15, 3–10. Cambridge, MA: MIT Press.

Klein, Dan and Christopher D. Manning. 2003c. Optimization, MaxEnt models, and conditional estimationwithout magic. Tutorial at NLT-NAACL 2003 and ACL 2003.

Ladusaw, William A. 1980. On the notion ‘affective’ in the analysis of negative polarity items. Journal ofLinguistic Research 1(1):1–16. Reprinted in Portner and Partee (2002), 457–470.

Lakoff, Robin. 1974. Remarks on ‘this’ and ‘that’. In Proceedings of the Chicago Linguistics Society 10, 345–356.

Liberman, Mark. 2008. Affective demonstratives. URL http://languagelog.ldc.upenn.edu/nll/?p=674.

MacCartney, Bill. 2009. Natural Language Inference. Ph.D. thesis, Stanford University.MacCartney, Bill and Christopher D. Manning. 2007. Natural logic for textual inference. In Proceedings

of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing, 193–200. Prague: Association forComputational Linguistics. URL http://www.aclweb.org/anthology/W/W07/W07-1431.

MacCartney, Bill and Christopher D. Manning. 2008. Modeling semantic containment and exclusion innatural language inference. In Proceedings of the 22nd International Conference on Computational Lin-guistics (Coling 2008), 521–528. Manchester, UK: Coling 2008 Organizing Committee. URL http://www.aclweb.org/anthology/C08-1066.

MacCartney, Bill and Christopher D. Manning. 2009. An extended model of natural logic. In Proceedingsof the Eight International Conference on Computational Semantics, 140–156. Tilburg, The Netherlands:Association for Computational Linguistics. URL http://www.aclweb.org/anthology/W09-3714.

Manning, Christopher D.; Dan Klein; Huihsin Morgan, William Tseng; and Anna N. Rafferty. 2008. Stanfordlog-linear part-of-speech tagger, version 1.6. URL http://nlp.stanford.edu/software/tagger.shtml.

Manning, Christopher D. and Hinrich Schütze. 1999. Foundations of Statistical Natural Language Processing.Cambridge, MA: MIT Press.

de Marneffe, Marie; Christopher D. Manning; and Christopher Potts. 2010. “Was it good? It was provoca-tive.” Learning adjective scales from review corpora and the Web. In Proceedings of the 48th AnnualMeeting of the Association for Computational Linguistics, 167–176. ACL.

de Marneffe, Marie-Catherine and Christopher D. Manning. 2008a. Stanford Typed Dependencies Manual.Stanford University.

de Marneffe, Marie-Catherine and Christopher D. Manning. 2008b. The Stanford typed dependencies rep-resentation. In Proceedings of the COLING 2008 Workshop on Cross-Framework and Cross-Domain ParserEvaluation, 1–8. ACL.

McCallum, Andrew. 2002. MALLET: A machine learning for language toolkit. Software package, UMassAmherst, URL http://mallet.cs.umass.edu/.

McCallum, Andrew and Kamal Nigam. 1998. A comparison of event models for naive bayes text classifica-

34

Page 35: Sentiment classification - Stanford Universityweb.stanford.edu/class/cs424p/materials/ling287-handout-09-28... · neg 1,000 31,783 738,419 34,543 738.42 Overall 2,000 64,720 1,559,304

LING287/CS424P, Stanford (Potts) Sentiment classification

tion. In AAAI/ICML-98 Workshop on Learning for Text Categorization, 41–48. AAAI Press.Munro, Robert; Steven Bethard; Victor Kuperman; Robin Melnick; Christopher Potts; Tyler Schnoebelen;

and Harry Tily. 2010. Crowdsourcing and language studies: The new generation of linguistic data. InProceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data With MechanicalTurk, 122–130. ACL.

Naruoka, Keiko. 2003. Expressive functions of Japanese adnominal demonstrative ‘konna/sonna/anna’. InThe 13th Japanese/Korean Linguistics Conference, 433–444.

Newman, Matthew L.; James W. Pennebaker; Diane S. Berry; and Jane M. Richard. 2003. Lying words:Predicting deception from linguistic styles. Journal of Language and Social Psychology 29(5):665–675.

Ono, Kiyoharu. 1994. Territories of information and Japanese demonstratives. The Journal of the Associationof Teachers of Japanese 28(2):131–155.

Pang, Bo and Lillian Lee. 2004. A sentimental education: Sentiment analysis using subjectivity summariza-tion based on minimum cuts. In Proceedings of the 42nd Annual Meeting of the Association for Computa-tional Linguistics, 271–278. Barcelona, Spain.

Pang, Bo and Lillian Lee. 2008. Opinion mining and sentiment analysis. Foundations and Trends in Informa-tion Retrieval 2(1):1–135.

Pang, Bo; Lillian Lee; and Shivakumar Vaithyanathan. 2002. Thumbs up? sentiment classification usingmachine learning techniques. In Proceedings of the Conference on Empirical Methods in Natural LanguageProcessing (EMNLP), 79–86. Philadelphia: Association for Computational Linguistics.

Portner, Paul and Barbara H. Partee, eds. 2002. Formal Semantics: The Essential Readings. Oxford: Blackwell.Potts, Christopher. 2010. On the negativity of negation. In David Lutz and Nan Li, eds., Proceedings of

Semantics and Linguistic Theory 20. Ithaca, NY: CLC Publications.Potts, Christopher and Florian Schwarz. 2010. Affective ‘this’. Linguistic Issues in Language Technology

3(5):1–30.Ramshaw, Lance and Mitchell P. Marcus. 1995. Text chunking using transformation-based learning. In David

Yarowsky and Kenneth Church, eds., Proceedings of the Third ACL Workshop on Very Large Corpora, 82–94.The Association for Computational Linguistics.

Russell, Bertrand. 1948. Human Knowledge, Its Scope and Limits. New York: Simon and Schuster.Sánchez-Valencia, Víctor. 1991. Studies in Natural Logic and Categorial Grammar. Ph.D. thesis, University of

Amsterdam.Snow, Rion; Brendan O’Connor; Daniel Jurafsky; and Andrew Y. Ng. 2008. Cheap and fast — but is it good?

Evaluating non-expert annotations for natural language tasks. In Proceedings of the 2008 Conference onEmpirical Methods in Natural Language Processing, 254–263. ACL.

Stone, Philip J; Dexter C Dunphry; Marshall S Smith; and Daniel M Ogilvie. 1966. The General Inquirer: AComputer Approach to Content Analysis. Cambridge, MA: MIT Press.

Thomas, Matt; Bo Pang; and Lillian Lee. 2006. Get out the vote: Determining support or opposition fromCongressional floor-debate transcripts. In Proceedings of EMNLP, 327–335.

Wilson, T; Janyce Wiebe; and Philip Hoffmann. 2005. Recognizing contextual polarity in phrase-level sen-timent analysis. In Proceedings of the Conference on Empirical Methods in Natural Language Processing.

35