Natural Processingnasmith/slides/casbs.10.pdf · Natural Language Processing • Typical...

64
Natural ___ Processing Noah A. Smith School of Computer Science Carnegie Mellon University [email protected]

Transcript of Natural Processingnasmith/slides/casbs.10.pdf · Natural Language Processing • Typical...

Page 1: Natural Processingnasmith/slides/casbs.10.pdf · Natural Language Processing • Typical representation of a document’s text content: bag of words • Great for ranking documents

Natural ___ Processing

Noah A. SmithSchool of Computer ScienceCarnegie Mellon [email protected]

Page 2: Natural Processingnasmith/slides/casbs.10.pdf · Natural Language Processing • Typical representation of a document’s text content: bag of words • Great for ranking documents

This Talk

1. A light discussion of micro-analysis

2. Some recent developments in macro-analysis, specifically linking text and non-text

“micro”

“macro”

documents

words

sentences

Page 3: Natural Processingnasmith/slides/casbs.10.pdf · Natural Language Processing • Typical representation of a document’s text content: bag of words • Great for ranking documents

Text Mining/Information Retrieval vs. Natural Language Processing

• Typical representation of a document’s text content: bag of words

• Great for ranking documents given a query, and for classifying documents, and a number of other applications.

• Engineering: Will (some) applications work better if we “square” the data and opt for more abstract representations of text?

• “Turning unstructured data into structured data”

• Cf. Claire’s talk yesterday!

Page 4: Natural Processingnasmith/slides/casbs.10.pdf · Natural Language Processing • Typical representation of a document’s text content: bag of words • Great for ranking documents

Beyond Engineering

• This isn’t just about how to build a better piece of software.

• Old AI philosophers: Is a bag of words really “understanding”?

• Computational linguistics: computational models for theorizing about the phenomenon of human language

• Very hard problem: representing what a piece of text “means”

• Long running debate: linguistic theories and NLP

• Less controversial: linguistic representations in NLP

• Is there a parallel for computational social science?

Page 5: Natural Processingnasmith/slides/casbs.10.pdf · Natural Language Processing • Typical representation of a document’s text content: bag of words • Great for ranking documents

Deeper, More Abstract, Structured Representations

• Stemming, lemmatization, and (for other languages) morphological disambiguation

• Syntactic disambiguation (parts of speech, phrase-structure parsing, dependency parsing)

• Word sense disambiguation, semantic roles, predicate-argument structures

• Named entity recognition, coreference resolution

• Opinion, sentiment, and subjectivity analysis, discourse relations and parsing

Jan’s Continuum

Page 6: Natural Processingnasmith/slides/casbs.10.pdf · Natural Language Processing • Typical representation of a document’s text content: bag of words • Great for ranking documents

Phrase-Structure Syntax

" This is not the sort of market to have a big position in , " said David Smith , who heads trading in all non-U.S. stocks .

SINV

"

S-TPC

NP-SBJ

VP

DT VBZ RB ,

NP-PRD

NP

PP

SBAR

DT NN NN NN NNIN TO VB DT JJ IN " VBD NNP NNP , WP VBZ IN DT JJ NNS .

WHNP

S

*

NP-SBJ

*

VP

VP

NP NP

*

PP

VP

*

NP-SBJ

NP

*

SBAR

WHNP

S

NP-SBJ

VP

NP

NP PP

NP

0 28

NPS

Page 7: Natural Processingnasmith/slides/casbs.10.pdf · Natural Language Processing • Typical representation of a document’s text content: bag of words • Great for ranking documents

Dependency Syntax

" This is not the sort of market to have a big position in , " said David Smith , who heads trading in all non-U.S. stocks .

0 28

" This is not the sort of market to have a big position in , " said David Smith , who heads trading in all non-U.S. stocks .

0 28

http://www.ark.cs.cmu.edu/TurboParser

Page 8: Natural Processingnasmith/slides/casbs.10.pdf · Natural Language Processing • Typical representation of a document’s text content: bag of words • Great for ranking documents

Frame-Semantic Structure

The professor chuckled with unabashed glee

Make noiseEmotion directed

Experiencer State

Sound sourceInternal causeSound

People byvocation

Emotion directed

Experiencer

http://www.ark.cs.cmu.edu/SEMAFOR

Page 9: Natural Processingnasmith/slides/casbs.10.pdf · Natural Language Processing • Typical representation of a document’s text content: bag of words • Great for ranking documents

A Bit of History

• Many of these are “old” problems.

• Twenty years ago we started using text data to solve them:

• Sound familiar?

+ Reusable data, clearly identified evaluation tasks, benchmarks → progress

- Domain specificity, overcommitting to representations, nearsightedness

1. Pay experts to annotate examples.

2. Use a combination of human ingenuity and statistics to build systems and/or do science.

Page 10: Natural Processingnasmith/slides/casbs.10.pdf · Natural Language Processing • Typical representation of a document’s text content: bag of words • Great for ranking documents

Comments on “Domain”

• Huge challenge: models built for one domain don’t work well in others.

• What’s a “domain”?

• Huge effort expended on news and, later, biomedical articles.

• Lesson: NLP people will follow the annotated datasets.

Page 11: Natural Processingnasmith/slides/casbs.10.pdf · Natural Language Processing • Typical representation of a document’s text content: bag of words • Great for ranking documents

The Next Paradigm?

• If we could do with less or no annotated data, maybe we could be more “agile” about text domains and representations.

• Small, noisy datasets; active, semisupervised, and unsupervised learning

• NB: this is where we do Bayesian statistics!

• Can we get information about language from other data? (See part 2.)

1. Pay experts to annotate examples.

2. Use a combination of human ingenuity and statistics to build systems and/or do science.

Page 12: Natural Processingnasmith/slides/casbs.10.pdf · Natural Language Processing • Typical representation of a document’s text content: bag of words • Great for ranking documents

Current Work (Yano, Resnik, and Smith, 2010)

• Pilot study: can untrained annotators consistently annotate political bias?

• What clues give away political bias in sentences from political blogs? (See Monroe, Colaresi, and Quinn, 2008)

• We could probably define “political bias” under Jan’s subjectivity umbrella!

• Sample of sentences from six political blogs (2008) - not uniform

• Amazon Mechanical Turk judgments of bias (buzzword: crowdsourcing)

• Survey of basic political views of annotators

Page 13: Natural Processingnasmith/slides/casbs.10.pdf · Natural Language Processing • Typical representation of a document’s text content: bag of words • Great for ranking documents

Example 1

Feminism has much to answer for denigrating men and encouraging women to seek independence whatever the cost to their families.

Page 14: Natural Processingnasmith/slides/casbs.10.pdf · Natural Language Processing • Typical representation of a document’s text content: bag of words • Great for ranking documents

Example 2

They died because of the Bush administration’s hubris.

Page 15: Natural Processingnasmith/slides/casbs.10.pdf · Natural Language Processing • Typical representation of a document’s text content: bag of words • Great for ranking documents

Example 3

In any event, for those of us who have followed this White House carefully during the entirety of the war, the charge is frankly absurd.

Page 16: Natural Processingnasmith/slides/casbs.10.pdf · Natural Language Processing • Typical representation of a document’s text content: bag of words • Great for ranking documents

Other Observations

• Bias is hard to label without context, but not always impossible.

• Pairwise kappa of 0.5 to 0.55 (decent for “moderately difficult” tasks)

• Annotators: more social liberals, more fiscal conservatives

• Liberal blogs are liberal, conservative blogs are conservative

• Liberals are quick to see conservative bias; conservatives see both!

Page 17: Natural Processingnasmith/slides/casbs.10.pdf · Natural Language Processing • Typical representation of a document’s text content: bag of words • Great for ranking documents

Lexical Indicators of Bias

Overall Liberal Conservative Not Sure Whichbad 0.60 Administration 0.28 illegal 0.40 pass 0.32personally 0.56 Americans 0.24 Obama’s 0.38 bad 0.32illegal 0.53 woman 0.24 corruption 0.32 sure 0.28woman 0.52 single 0.24 rich 0.28 blame 0.28single 0.52 personally 0.24 stop 0.26 they’re 0.24rich 0.52 lobbyists 0.23 tax 0.25 happen 0.24corruption 0.52 Republican 0.22 claimed 0.25 doubt 0.24Administration 0.52 union 0.20 human 0.24 doing 0.24Americans 0.51 torture 0.20 doesn’t 0.24 death 0.24conservative 0.50 rich 0.20 difficult 0.24 actually 0.24doubt 0.48 interests 0.20 Democrats 0.24 exactly 0.22torture 0.47 doing 0.20 less 0.23 wrong 0.22

Table 6: Most strongly biased words, ranked by relative frequency of receiving a bias mark, normalized by total frequency. Onlywords appearing five times or more in our annotation set are ranked.

whole. Top-ranked words for each calculation are shownin Table 6.

Some of the patterns we see are consistent with whatwe found in our automatic method for proposing biasedbigrams. For example, the bigrams tended to includeterms that refer to members or groups on the opposingside. Here we find that Republican and Administration(referring in 2008 to the Bush administration) tends toshow liberal bias, while Obama’s and Democrats showconservative bias.

5 Discussion and Future Work

The study we have conducted here represents an initialpass at empirical, corpus-driven analysis of bias using themethods of computational linguistics. The results thus farsuggest that it is possible to automatically extract a sam-ple that is rich in examples that annotators would con-sider biased; that naı̈ve annotators can achieve reason-able agreement with minimal instructions and no train-ing; and that basic exploratory analysis of results yieldsinterpretable patterns that comport with prior expecta-tions, as well as interesting observations that merit furtherinvestigation.

In future work, enabled by annotations of biased andnon-biased material, we plan to delve more deeply intothe linguistic characteristics associated with biased ex-pression. These will include, for example, an analysisof the extent to which explicit “lexical framing” (use ofpartisan terms, e.g., Monroe et al., 2008) is used to con-vey bias, versus use of more subtle cues such as syntacticframing (Greene and Resnik, 2009). We will also explorethe extent to which idiomatic usages are connected withbias, with the prediction that partisan “memes” tend to bemore idiomatic than compositional in nature.

In our current analysis, the issue of subjectivity was notdirectly addressed. Previous work has shown that opin-

ions are closely related to subjective language (Pang andLee, 2008). It is possible that asking annotators aboutsentiment while asking about bias would provide a deeperunderstanding of the latter. Interestingly, annotator feed-back included remarks that mere negative “facts” do notconvey an author’s opinion or bias. The nature of subjec-tivity as a factor in bias perception is an important issuefor future investigation.

6 ConclusionThis paper considered the linguistic indicators of bias inpolitical text. We used Amazon Mechanical Turk judg-ments about sentences from American political blogs,asking annotators to indicate whether a sentence showedbias, and if so, in which political direction and throughwhich word tokens; these data were augmented by a po-litical questionnaire for each annotator. Our preliminaryanalysis suggests that bias can be annotated reasonablyconsistently, that bias perception varies based on personalviews, and that there are some consistent lexical cues forbias in political blog data.

AcknowledgmentsThe authors acknowledge research support from HPLabs, help with data from Jacob Eisenstein, and help-ful comments from the reviewers, Olivia Buzek, MichaelHeilman, and Brendan O’Connor.

ReferencesSatanjeev Banerjee and Ted Pedersen. 2003. The design, implementa-

tion and use of the ngram statistics package. In the Fourth Interna-tional Conference on Intelligent Text Processing and ComputationalLinguistics.

Jacob Cohen. 1960. A coefficient of agreement for nominal scales.Educational and Psychological Measurement, 20(1):37–46.

David Dowty. 1991. Thematic Proto-Roles and Argument Selection.Language, 67:547–619.

Page 18: Natural Processingnasmith/slides/casbs.10.pdf · Natural Language Processing • Typical representation of a document’s text content: bag of words • Great for ranking documents

Lexical Indicators of Bias

Overall Liberal Conservative Not Sure Whichbad 0.60 Administration 0.28 illegal 0.40 pass 0.32personally 0.56 Americans 0.24 Obama’s 0.38 bad 0.32illegal 0.53 woman 0.24 corruption 0.32 sure 0.28woman 0.52 single 0.24 rich 0.28 blame 0.28single 0.52 personally 0.24 stop 0.26 they’re 0.24rich 0.52 lobbyists 0.23 tax 0.25 happen 0.24corruption 0.52 Republican 0.22 claimed 0.25 doubt 0.24Administration 0.52 union 0.20 human 0.24 doing 0.24Americans 0.51 torture 0.20 doesn’t 0.24 death 0.24conservative 0.50 rich 0.20 difficult 0.24 actually 0.24doubt 0.48 interests 0.20 Democrats 0.24 exactly 0.22torture 0.47 doing 0.20 less 0.23 wrong 0.22

Table 6: Most strongly biased words, ranked by relative frequency of receiving a bias mark, normalized by total frequency. Onlywords appearing five times or more in our annotation set are ranked.

whole. Top-ranked words for each calculation are shownin Table 6.

Some of the patterns we see are consistent with whatwe found in our automatic method for proposing biasedbigrams. For example, the bigrams tended to includeterms that refer to members or groups on the opposingside. Here we find that Republican and Administration(referring in 2008 to the Bush administration) tends toshow liberal bias, while Obama’s and Democrats showconservative bias.

5 Discussion and Future Work

The study we have conducted here represents an initialpass at empirical, corpus-driven analysis of bias using themethods of computational linguistics. The results thus farsuggest that it is possible to automatically extract a sam-ple that is rich in examples that annotators would con-sider biased; that naı̈ve annotators can achieve reason-able agreement with minimal instructions and no train-ing; and that basic exploratory analysis of results yieldsinterpretable patterns that comport with prior expecta-tions, as well as interesting observations that merit furtherinvestigation.

In future work, enabled by annotations of biased andnon-biased material, we plan to delve more deeply intothe linguistic characteristics associated with biased ex-pression. These will include, for example, an analysisof the extent to which explicit “lexical framing” (use ofpartisan terms, e.g., Monroe et al., 2008) is used to con-vey bias, versus use of more subtle cues such as syntacticframing (Greene and Resnik, 2009). We will also explorethe extent to which idiomatic usages are connected withbias, with the prediction that partisan “memes” tend to bemore idiomatic than compositional in nature.

In our current analysis, the issue of subjectivity was notdirectly addressed. Previous work has shown that opin-

ions are closely related to subjective language (Pang andLee, 2008). It is possible that asking annotators aboutsentiment while asking about bias would provide a deeperunderstanding of the latter. Interestingly, annotator feed-back included remarks that mere negative “facts” do notconvey an author’s opinion or bias. The nature of subjec-tivity as a factor in bias perception is an important issuefor future investigation.

6 ConclusionThis paper considered the linguistic indicators of bias inpolitical text. We used Amazon Mechanical Turk judg-ments about sentences from American political blogs,asking annotators to indicate whether a sentence showedbias, and if so, in which political direction and throughwhich word tokens; these data were augmented by a po-litical questionnaire for each annotator. Our preliminaryanalysis suggests that bias can be annotated reasonablyconsistently, that bias perception varies based on personalviews, and that there are some consistent lexical cues forbias in political blog data.

AcknowledgmentsThe authors acknowledge research support from HPLabs, help with data from Jacob Eisenstein, and help-ful comments from the reviewers, Olivia Buzek, MichaelHeilman, and Brendan O’Connor.

ReferencesSatanjeev Banerjee and Ted Pedersen. 2003. The design, implementa-

tion and use of the ngram statistics package. In the Fourth Interna-tional Conference on Intelligent Text Processing and ComputationalLinguistics.

Jacob Cohen. 1960. A coefficient of agreement for nominal scales.Educational and Psychological Measurement, 20(1):37–46.

David Dowty. 1991. Thematic Proto-Roles and Argument Selection.Language, 67:547–619.

Page 19: Natural Processingnasmith/slides/casbs.10.pdf · Natural Language Processing • Typical representation of a document’s text content: bag of words • Great for ranking documents

Overall Liberal Conservative Not Sure Whichbad 0.60 Administration 0.28 illegal 0.40 pass 0.32personally 0.56 Americans 0.24 Obama’s 0.38 bad 0.32illegal 0.53 woman 0.24 corruption 0.32 sure 0.28woman 0.52 single 0.24 rich 0.28 blame 0.28single 0.52 personally 0.24 stop 0.26 they’re 0.24rich 0.52 lobbyists 0.23 tax 0.25 happen 0.24corruption 0.52 Republican 0.22 claimed 0.25 doubt 0.24Administration 0.52 union 0.20 human 0.24 doing 0.24Americans 0.51 torture 0.20 doesn’t 0.24 death 0.24conservative 0.50 rich 0.20 difficult 0.24 actually 0.24doubt 0.48 interests 0.20 Democrats 0.24 exactly 0.22torture 0.47 doing 0.20 less 0.23 wrong 0.22

Table 6: Most strongly biased words, ranked by relative frequency of receiving a bias mark, normalized by total frequency. Onlywords appearing five times or more in our annotation set are ranked.

whole. Top-ranked words for each calculation are shownin Table 6.

Some of the patterns we see are consistent with whatwe found in our automatic method for proposing biasedbigrams. For example, the bigrams tended to includeterms that refer to members or groups on the opposingside. Here we find that Republican and Administration(referring in 2008 to the Bush administration) tends toshow liberal bias, while Obama’s and Democrats showconservative bias.

5 Discussion and Future Work

The study we have conducted here represents an initialpass at empirical, corpus-driven analysis of bias using themethods of computational linguistics. The results thus farsuggest that it is possible to automatically extract a sam-ple that is rich in examples that annotators would con-sider biased; that naı̈ve annotators can achieve reason-able agreement with minimal instructions and no train-ing; and that basic exploratory analysis of results yieldsinterpretable patterns that comport with prior expecta-tions, as well as interesting observations that merit furtherinvestigation.

In future work, enabled by annotations of biased andnon-biased material, we plan to delve more deeply intothe linguistic characteristics associated with biased ex-pression. These will include, for example, an analysisof the extent to which explicit “lexical framing” (use ofpartisan terms, e.g., Monroe et al., 2008) is used to con-vey bias, versus use of more subtle cues such as syntacticframing (Greene and Resnik, 2009). We will also explorethe extent to which idiomatic usages are connected withbias, with the prediction that partisan “memes” tend to bemore idiomatic than compositional in nature.

In our current analysis, the issue of subjectivity was notdirectly addressed. Previous work has shown that opin-

ions are closely related to subjective language (Pang andLee, 2008). It is possible that asking annotators aboutsentiment while asking about bias would provide a deeperunderstanding of the latter. Interestingly, annotator feed-back included remarks that mere negative “facts” do notconvey an author’s opinion or bias. The nature of subjec-tivity as a factor in bias perception is an important issuefor future investigation.

6 ConclusionThis paper considered the linguistic indicators of bias inpolitical text. We used Amazon Mechanical Turk judg-ments about sentences from American political blogs,asking annotators to indicate whether a sentence showedbias, and if so, in which political direction and throughwhich word tokens; these data were augmented by a po-litical questionnaire for each annotator. Our preliminaryanalysis suggests that bias can be annotated reasonablyconsistently, that bias perception varies based on personalviews, and that there are some consistent lexical cues forbias in political blog data.

AcknowledgmentsThe authors acknowledge research support from HPLabs, help with data from Jacob Eisenstein, and help-ful comments from the reviewers, Olivia Buzek, MichaelHeilman, and Brendan O’Connor.

ReferencesSatanjeev Banerjee and Ted Pedersen. 2003. The design, implementa-

tion and use of the ngram statistics package. In the Fourth Interna-tional Conference on Intelligent Text Processing and ComputationalLinguistics.

Jacob Cohen. 1960. A coefficient of agreement for nominal scales.Educational and Psychological Measurement, 20(1):37–46.

David Dowty. 1991. Thematic Proto-Roles and Argument Selection.Language, 67:547–619.

Lexical Indicators of Bias

Page 20: Natural Processingnasmith/slides/casbs.10.pdf · Natural Language Processing • Typical representation of a document’s text content: bag of words • Great for ranking documents

Current Work (Eisenstein, Yano, Smith, Cohen, and Xing, in progress)

• Goal: find mentions of specific “persons of interest” in noisy multi-author text

• Starting point: a list of names

• Cf. noun phrase coreference resolution

• Deal with misspellings, variations, titles, even sobriquets like “Obamessiah”

• Exploit local and document context

• Predominantly unsupervised approach (supervision is the list of names)

Page 21: Natural Processingnasmith/slides/casbs.10.pdf · Natural Language Processing • Typical representation of a document’s text content: bag of words • Great for ranking documents

NLP Success Stories

? Search engines

✓ Translation (Google)

✓ Information extraction (text → databases)

✓ Question answering (recently: IBM’s Watson)

✓ Opinion mining

Page 22: Natural Processingnasmith/slides/casbs.10.pdf · Natural Language Processing • Typical representation of a document’s text content: bag of words • Great for ranking documents

NLP: The Refrigerator of the Information Age

• It enables all kinds of more exciting activities.

• It does something you can’t do very well (cf. dishwashers).

• When it’s working for you, it’s always quietly humming in the background.

• You don’t think about it too often, and instructions never mention it.

• When it stops working, you will notice.

• Though expertise is required, there is very little glamor in manufacturing, maintaining, or improving it.

Opinion

Page 23: Natural Processingnasmith/slides/casbs.10.pdf · Natural Language Processing • Typical representation of a document’s text content: bag of words • Great for ranking documents

2. Adventures in Macro-Analysis: Linking Text to Other Data

Note to John: These

may be superficial.

“data describing government activity in all parts of the policymaking process”

social behavior

economic, financial, political

Page 24: Natural Processingnasmith/slides/casbs.10.pdf · Natural Language Processing • Typical representation of a document’s text content: bag of words • Great for ranking documents

• Linear regression:

• Each xi is some representation of a document as a d-dimensional vector.

• Each yi is some output value that we seek to learn to predict.

• w is a vector of weights.

• Learning the model = picking w

Recent Development: Text Regression

minw!Rd

1n

n!

i=1

error"yi,w

"xi

#

Page 25: Natural Processingnasmith/slides/casbs.10.pdf · Natural Language Processing • Typical representation of a document’s text content: bag of words • Great for ranking documents

Text Regression Examples

x y

reviews of a film from newspaper critics

opening weekend revenue

(Joshi, Das, Gimpel, and Smith, 2010)

a company’s annual 10-K report

volatility (a measure of financial risk)

(Kogan, Levin, Routledge, Sagi, and Smith, 2009)

political blog post total comment volume (Yano and Smith, 2010)

microblog microblogger’s geographical coordinates

(Eisenstein, O’Connor, and Smith, in progress)

Page 26: Natural Processingnasmith/slides/casbs.10.pdf · Natural Language Processing • Typical representation of a document’s text content: bag of words • Great for ranking documents

Text Regression Examples

x y

reviews of a film from newspaper critics

opening weekend revenue

(Joshi, Das, Gimpel, and Smith, 2010)

a company’s annual 10-K report

volatility (a measure of financial risk)

(Kogan, Levin, Routledge, Sagi, and Smith, 2009)

political blog post total comment volume (Yano and Smith, 2010)

microblog microblogger’s geographical coordinates

(Eisenstein, O’Connor, and Smith, in progress)

Page 27: Natural Processingnasmith/slides/casbs.10.pdf · Natural Language Processing • Typical representation of a document’s text content: bag of words • Great for ranking documents

Data

• Text: pre-release movie reviews from seven American newspapers (1,147+317+254 movies), during 2005-2009

• Metadata: name, production house, genre(s), scriptwriter(s), director(s), country of origin, primary actors, release date, MPAA rating, running time, production budget (metacritic.com, the-numbers.com)

• (State-of-the-art forecasters are based on these observables; see, e.g., Simonoff and Sparrow, 2000; Sharda and Delen, 2006)

• Target: opening weekend gross revenue, number of opening weekend screens (the-numbers.com)

Page 28: Natural Processingnasmith/slides/casbs.10.pdf · Natural Language Processing • Typical representation of a document’s text content: bag of words • Great for ranking documents

Experimental Results

Total MAE($M)

Per Screen MAE ($K)

Predict median 10.521 6.642

Non-text 5.983 6.540

Words, bigrams, trigrams 7.627 6.060

Non-text + words, bigrams, trigrams 5.750 6.052

Page 29: Natural Processingnasmith/slides/casbs.10.pdf · Natural Language Processing • Typical representation of a document’s text content: bag of words • Great for ranking documents

Features ($M)

rating

pg +0.085

adult -0.236

rate r -0.364

sequels

this series +13.925

the franchise +5.112

the sequel +4.224

people

will smith +2.560

brittany +1.128

^ producer brian +0.486

genre

testosterone +1.945

comedy for +1.143

a horror +0.595

documentary -0.037

independent -0.127

sent.

best parts of +1.462

smart enough +1.449

a good thing +1.117

shame $ -0.098

bogeyman -0.689

plot

torso +09.054

vehicle in 5.827

superhero $ 2.020

Also ... “of the art,” “and cgi”, “shrek movies,” “voldemort,” “blockbuster,” “anticipation,” “summer movie”, “canne” is bad.

Page 30: Natural Processingnasmith/slides/casbs.10.pdf · Natural Language Processing • Typical representation of a document’s text content: bag of words • Great for ranking documents

Text Regression Examples

x y

reviews of a film from newspaper critics

opening weekend revenue

(Joshi, Das, Gimpel, and Smith, 2010)

a company’s annual 10-K report

volatility (a measure of financial risk)

(Kogan, Levin, Routledge, Sagi, and Smith, 2009)

political blog post total comment volume (Yano and Smith, 2010)

microblog microblogger’s geographical coordinates

(Eisenstein, O’Connor, and Smith, in progress)

Page 31: Natural Processingnasmith/slides/casbs.10.pdf · Natural Language Processing • Typical representation of a document’s text content: bag of words • Great for ranking documents

Why Finance?

• I hate reading financial reports, but they contain crucial information about my investments, hence my future.

• Natural language processing for boring texts seems like a good bet.

• Finance researchers want to know:

• Are financial reports worth the cost?

• Are they informative?

• Does this tell us anything about financial policy?

Page 32: Natural Processingnasmith/slides/casbs.10.pdf · Natural Language Processing • Typical representation of a document’s text content: bag of words • Great for ranking documents

• Return on day t:

• Sample standard deviation from day t - τ to day t:

• This is called measured volatility.

Volatility

v[t!!,t] =

!""#!$

i=0

(rt!i ! r̄)2%

!

rt =closingpricet + dividendst

closingpricet!1

! 1

Page 33: Natural Processingnasmith/slides/casbs.10.pdf · Natural Language Processing • Typical representation of a document’s text content: bag of words • Great for ranking documents

Important Property of Volatility

• Autoregressive conditional heteroscedacity: volatility tends to be stable (over horizons like one year).

• v[t - τ, t] is a strong predictor of v[t, t + τ]

• This is our strong baseline.

Page 34: Natural Processingnasmith/slides/casbs.10.pdf · Natural Language Processing • Typical representation of a document’s text content: bag of words • Great for ranking documents

Data

• Text: “Form 10-K” section 7 from publicly-traded American firms’ government-mandated disclosures (26,806 reports, ~250M words), during 1996-2006

• sec.gov

• www.ark.cs.cmu.edu/10K

• Metadata: volatility in the year prior to the report (“history”), v[t - 1y, t]

• Target: volatility in the year following the report, v[t, t + 1y]

Source: Center for Research in Security Prices U.S. Stocks Databases

Page 35: Natural Processingnasmith/slides/casbs.10.pdf · Natural Language Processing • Typical representation of a document’s text content: bag of words • Great for ranking documents

Experimental Setup

• Test on year Y.

• Train on (Y - 5, Y - 4, Y - 3, Y - 2, Y - 1).

• Six such splits.

• Compare

• history-only baseline;

• text-only SVR, unigrams and bigrams, log(1 + frequency);

• combined SVR.

Page 36: Natural Processingnasmith/slides/casbs.10.pdf · Natural Language Processing • Typical representation of a document’s text content: bag of words • Great for ranking documents

Dominant Weights (2000-4)

loss 0.025 net income -0.021net loss 0.017 rate -0.017

year # 0.016 properties -0.014expenses 0.015 dividends -0.013

going concern 0.014 lower interest -0.012a going 0.013 critical accounting -0.012

administrative 0.013 insurance -0.011personnel 0.013 distributions -0.011

high volatility terms low volatility terms

Page 37: Natural Processingnasmith/slides/casbs.10.pdf · Natural Language Processing • Typical representation of a document’s text content: bag of words • Great for ranking documents

MSE of Log-Volatility

0.120

0.143

0.165

0.188

0.210

2001 2002 2003 2004 2005 2006 Micro-ave.

HistoryTextText + History

*

**

*

*

lower is better

Page 38: Natural Processingnasmith/slides/casbs.10.pdf · Natural Language Processing • Typical representation of a document’s text content: bag of words • Great for ranking documents

2002

• Enron and other accounting scandals

• Sarbanes-Oxley Act of 2002 (and other reforms)

• Longer reports

• Are the reports more informative after 2002? Because of Sarbanes-Oxley?

Page 39: Natural Processingnasmith/slides/casbs.10.pdf · Natural Language Processing • Typical representation of a document’s text content: bag of words • Great for ranking documents

Text Regression Examples

x y

reviews of a film from newspaper critics

opening weekend revenue

(Joshi, Das, Gimpel, and Smith, 2010)

a company’s annual 10-K report

volatility (a measure of financial risk)

(Kogan, Levin, Routledge, Sagi, and Smith, 2009)

political blog post total comment volume (Yano and Smith, 2010)

microblog microblogger’s geographical coordinates

(Eisenstein, O’Connor, and Smith, in progress)

Page 40: Natural Processingnasmith/slides/casbs.10.pdf · Natural Language Processing • Typical representation of a document’s text content: bag of words • Great for ranking documents

Political Blogs

• Arguably an influential medium in American politics.

• Adamic and Glance (2005), Leskovec et al. (2007), and many others have considered the link and community structure.

• We’re focusing on predicting how readers will respond.

• Cf. ideological discourse models: Lin, Xing, and Hauptmann (2008)

Page 41: Natural Processingnasmith/slides/casbs.10.pdf · Natural Language Processing • Typical representation of a document’s text content: bag of words • Great for ranking documents
Page 42: Natural Processingnasmith/slides/casbs.10.pdf · Natural Language Processing • Typical representation of a document’s text content: bag of words • Great for ranking documents

42

Comments

Page 43: Natural Processingnasmith/slides/casbs.10.pdf · Natural Language Processing • Typical representation of a document’s text content: bag of words • Great for ranking documents

Data

• Main Text: blog posts from five American political blogs, 2007-2008. 1000-2200 posts per blog, 110K-320K words per blog, average words 68-185 per post (by blog).

• Comments: text and author for each comment on the above posts. 30-200 comments per post, 20-40 words per comment.

• www.ark.cs.cmu.edu/blog-data

Page 44: Natural Processingnasmith/slides/casbs.10.pdf · Natural Language Processing • Typical representation of a document’s text content: bag of words • Great for ranking documents

Technical Digression

• Blei et al.’s latent Dirichlet allocation model (Blei et al., 2003) is not so different from dimensionality reduction, but it has a more extensible probabilistic interpretation.

• Quite useful for exploratory data analysis.

• Easy to extend to model other observed and hidden variables.

• I will skip the derivations and give the high-level idea:

p(response | words) !!

p(words, topics,mixture, response)

Page 45: Natural Processingnasmith/slides/casbs.10.pdf · Natural Language Processing • Typical representation of a document’s text content: bag of words • Great for ranking documents

1873 women black white men people liberal civil working woman rights

1730 obama clinton campaign hillary barack president presidential really senator democratic

1643 think people policy really way just good political kind going

1561 conservative party political democrats democratic republican republicans immigration gop right

1521 people city school college photo creative states license good time

1484 romney huckabee giuliani mitt mike rudy muslim church really republican

1478 iran world nuclear israel united states foreign war international iranian

1452 carbon oil trade emissions change climate energy human global world

1425 obama clinton win campaign mccain hillary primary voters vote race

1352 health economic plan care tax spending economy money people insurance

1263 iraq war military government american iraq troops forces security years

1246 administration bush congress torture law intelligence legal president cia government

1215 mccain john bush president campaign policy know george press man

1025 team game season defense good trade play player better best

1007 book times news read article post blog know media good

Page 46: Natural Processingnasmith/slides/casbs.10.pdf · Natural Language Processing • Typical representation of a document’s text content: bag of words • Great for ranking documents

Predicting Volume

• Which blog posts will attract more attention than the average?

Precision Recall

Naive Bayes 72.5 41.7

Poisson SLDA* 70.1 63.2

CommentLDA 70.2 68.8

*Similar to Blei and McAuliffe (2008), but mixture of Poissons.

Page 47: Natural Processingnasmith/slides/casbs.10.pdf · Natural Language Processing • Typical representation of a document’s text content: bag of words • Great for ranking documents

Text Regression Examples

x y

reviews of a film from newspaper critics

opening weekend revenue

(Joshi, Das, Gimpel, and Smith, 2010)

a company’s annual 10-K report

volatility (a measure of financial risk)

(Kogan, Levin, Routledge, Sagi, and Smith, 2009)

political blog post total comment volume (Yano and Smith, 2010)

microblog microblogger’s geographical coordinates

(Eisenstein, O’Connor, and Smith, in progress)

Page 48: Natural Processingnasmith/slides/casbs.10.pdf · Natural Language Processing • Typical representation of a document’s text content: bag of words • Great for ranking documents

“Dialect” from Twitter

!130 !120 !110 !100 !90 !80 !70 !6025

30

35

40

45

50

Page 49: Natural Processingnasmith/slides/casbs.10.pdf · Natural Language Processing • Typical representation of a document’s text content: bag of words • Great for ranking documents

Another Macro-Analysis Example

• Linking microblog sentiment to other measures of public opinion

• Warning: this is descriptive stuff!

Page 50: Natural Processingnasmith/slides/casbs.10.pdf · Natural Language Processing • Typical representation of a document’s text content: bag of words • Great for ranking documents

Data

• Text: A billion “tweets,” from 2008-2009. Average length about 11 words.

• Time Series: (all data are freely available online)

Economic confidence:

• Gallup Economic Confidence poll (daily, three-day averages)

• ICS from Reuters/U. Michigan (monthly)

Politics:

• Gallup’s poll of Obama approval rating

Page 51: Natural Processingnasmith/slides/casbs.10.pdf · Natural Language Processing • Typical representation of a document’s text content: bag of words • Great for ranking documents

Technical Approach

• Retrieve messages that match a keyword (e.g., jobs or obama).

• Estimate daily opinion.

• “Positive” and “negative” words come from the OpinionFinder lexicon (Wilson, Wiebe, and Hoffmann, 2005; 1600 types and 1200 types respectively).

• Two parameters: number of days for moving average, and lead/lag.

xt =countt(positive word ! keyword)countt(negative word ! keyword)

Page 52: Natural Processingnasmith/slides/casbs.10.pdf · Natural Language Processing • Typical representation of a document’s text content: bag of words • Great for ranking documents

Moving Averages

1, 7, 30 past days

20

08!

01

20

08!

02

20

08!

03

20

08!

04

20

08!

05

20

08!

06

20

08!

07

20

08!

08

20

08!

09

20

08!

10

20

08!

11

20

08!

12

20

09!

01

20

09!

02

20

09!

03

20

09!

04

20

09!

05

20

09!

06

20

09!

07

20

09!

08

20

09!

09

20

09!

10

20

09!

11

01

23

45

Sen

tim

ent

Ra

tio

Page 53: Natural Processingnasmith/slides/casbs.10.pdf · Natural Language Processing • Typical representation of a document’s text content: bag of words • Great for ranking documents

Sentiment ratios (x); two different smoothing window and lead parameters

Index

Sentim

ent R

atio

1.5

2.0

2.5

3.0

3.5

4.0

k=15, lead=0k=30, lead=50

Index

Gallu

p E

conom

ic C

onfidence

!60

!50

!40

!30

!20

Mic

hig

an IC

S

2008!

01

2008!

02

2008!

03

2008!

04

2008!

05

2008!

06

2008!

07

2008!

08

2008!

09

2008!

10

2008!

11

2008!

12

2009!

01

2009!

02

2009!

03

2009!

04

2009!

05

2009!

06

2009!

07

2009!

08

2009!

09

2009!

10

2009!

11

55

60

65

70

75

Gallup (y)Michigan ICS (y)

Page 54: Natural Processingnasmith/slides/casbs.10.pdf · Natural Language Processing • Typical representation of a document’s text content: bag of words • Great for ranking documents

r = 0.794

jobs and Gallup’s Economic Confidence Poll

Page 55: Natural Processingnasmith/slides/casbs.10.pdf · Natural Language Processing • Typical representation of a document’s text content: bag of words • Great for ranking documents

Economic Confidence and Twitter

• job and economy did not work well at all.

• jobs ≠ job

Page 56: Natural Processingnasmith/slides/casbs.10.pdf · Natural Language Processing • Typical representation of a document’s text content: bag of words • Great for ranking documents

15-day smoothing; r = 0.725

12

34

5

Se

ntim

en

t R

atio

fo

r "o

ba

ma

"

0.0

00

.15

Fra

c.

Me

ssa

ge

s

with

"o

ba

ma

"

20

08!

01

20

08!

02

20

08!

03

20

08!

04

20

08!

05

20

08!

06

20

08!

07

20

08!

08

20

08!

09

20

08!

10

20

08!

11

20

08!

12

20

09!

01

20

09!

02

20

09!

03

20

09!

04

20

09!

05

20

09!

06

20

09!

07

20

09!

08

20

09!

09

20

09!

10

20

09!

11

20

09!

12

40

45

50

55

% S

up

po

rt O

ba

ma

(E

lectio

n)

40

50

60

70

% P

res.

Jo

b A

pp

rova

l

obama sentiment

obama frequency

Gallupapproval

Page 57: Natural Processingnasmith/slides/casbs.10.pdf · Natural Language Processing • Typical representation of a document’s text content: bag of words • Great for ranking documents

Questions for You

• Fill in the blank:

computational linguistics is to NLP as computational social science is to ___

(and what is that relationship?)

• What linguistic representations make sense for computational social science?

• What are some applications that are not superficial?

Page 58: Natural Processingnasmith/slides/casbs.10.pdf · Natural Language Processing • Typical representation of a document’s text content: bag of words • Great for ranking documents

Acknowledgments

• Collaborators: Ramnath Balasubramanyan, Desai Chen, William Cohen, Dipanjan Das, Kevin Gimpel, Jacob Eisenstein, Mahesh Joshi, Dimitry Levin, André Martins, Brendan O’Connor, Bryan Routledge, Nathan Schneider, Eric Xing, Tae Yano (all CMU);Shimon Kogan (Texas);Jacob Sagi (Vanderbilt)

• NSF, DARPA, IBM, Google, HP Labs, Yahoo

Page 59: Natural Processingnasmith/slides/casbs.10.pdf · Natural Language Processing • Typical representation of a document’s text content: bag of words • Great for ranking documents

Lead/Lag

• One thing to consider is the correlation between the text statistic (x) and the poll (y).

• But we want forecasting!

• We can track correlation for different “text lead/poll lag” values; a positive value means we are forecasting.

• Does xt predict yt+k?

Page 60: Natural Processingnasmith/slides/casbs.10.pdf · Natural Language Processing • Typical representation of a document’s text content: bag of words • Great for ranking documents

!90 !50 !10 30 50 70 90

0.4

0.5

0.6

0.7

0.8

0.9

Text lead / poll lag

Corr

. again

st G

allu

p

!!!!!!!!!!!!!!!!!

!!!!!!!!!!!!!!!!!!

!!!!!!!!!!

!!!!!!!!

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

!

k=30

k=15

k=7

Text leads poll

Poll leads text

!90 !50 !10 30 50 70 90

!0.2

0.0

0.2

0.4

0.6

0.8

Text lead / poll lag

Corr

. again

st IC

S

k=30

k=60

Page 61: Natural Processingnasmith/slides/casbs.10.pdf · Natural Language Processing • Typical representation of a document’s text content: bag of words • Great for ranking documents

Index

Sentim

ent R

atio

1.5

2.0

2.5

3.0

3.5

4.0

k=15, lead=0k=30, lead=50

Index

Gallu

p E

conom

ic C

onfidence

!60

!50

!40

!30

!20

Mic

hig

an IC

S

2008!

01

2008!

02

2008!

03

2008!

04

2008!

05

2008!

06

2008!

07

2008!

08

2008!

09

2008!

10

2008!

11

2008!

12

2009!

01

2009!

02

2009!

03

2009!

04

2009!

05

2009!

06

2009!

07

2009!

08

2009!

09

2009!

10

2009!

11

55

60

65

70

75

!90 !50 !10 30 50 70 90

0.4

0.5

0.6

0.7

0.8

0.9

Text lead / poll lag

Corr

. again

st G

allu

p

!!!!!!!!!!!!!!!!!

!!!!!!!!!!!!!!!!!!

!!!!!!!!!!

!!!!!!!!

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

!

k=30

k=15

k=7

Text leads poll

Poll leads text

!90 !50 !10 30 50 70 90

!0.2

0.0

0.2

0.4

0.6

0.8

Text lead / poll lag

Corr

. again

st IC

S

k=30

k=60

Gallup (y)Michigan ICS (y)

Page 62: Natural Processingnasmith/slides/casbs.10.pdf · Natural Language Processing • Typical representation of a document’s text content: bag of words • Great for ranking documents

Open Questions

• Smoothing tends to lead to higher correlation. We do not know why.

• Our sentiment statistic is noisy - is that a problem?

• Data are not IID. What are appropriate training/tuning/testing strategies?

• Intuitively, “check it tomorrow.”

• But how to compare systems fairly, test for significant differences, etc.?

• How do we know whether this is noise?

Page 63: Natural Processingnasmith/slides/casbs.10.pdf · Natural Language Processing • Typical representation of a document’s text content: bag of words • Great for ranking documents

Conclusion

Page 64: Natural Processingnasmith/slides/casbs.10.pdf · Natural Language Processing • Typical representation of a document’s text content: bag of words • Great for ranking documents

Text-Driven Forecasting

• Input: documents

• Output: a prediction about a future, measurable state of the world.

• Attractions:

• theory-neutral,

• easy and natural to evaluate,

• inexpensive, and

• plenty of data (and no annotation needed).

• Challenges: linguistic representations, noisy NLP, experimental design.