Duke University Jensen Lecture

39
Gabe Ignatow, University of North Texas, Sociology Nick Evangelopoulos, University of North Texas, Management Information Systems Konstantinos Zougris, University of North Texas, Sociology Sentiment Analysis of Polarizing Topics in Social Media: News Site Readers’ Comments on the Trayvon Martin Controversy Duke University Department of Sociology Jensen Lecture February 2015

Transcript of Duke University Jensen Lecture

Page 1: Duke University Jensen Lecture

Gabe Ignatow, University of North Texas, SociologyNick Evangelopoulos, University of North Texas, Management Information SystemsKonstantinos Zougris, University of North Texas, Sociology

Sentiment Analysis of Polarizing Topics in Social Media: News Site Readers’ Comments on the Trayvon Martin Controversy

Duke UniversityDepartment of SociologyJensen Lecture February 2015

Page 2: Duke University Jensen Lecture

Social Science Research Methods

Digital Social Research Methods Efforts to keep sociology dynamic and relevant (to students and other audiences) in the digital age

Text-based Digital Methods

Semantic Text Analysis● not network or thematic text analysis (Roberts 1997)● concerned with social meaning, social construction processes

Situating Our Project

Page 3: Duke University Jensen Lecture

Historical Development of Formal Text Analysis Methods

Pre-Internet Text Analysis● Harold Lasswell, WWII-era development of content analysis methods● Klaus Krippendorf

Web 1.0 Text AnalysisWithin sociology: Franzosi, Roberts, Cerulo, Ignatow, Discourse Analysis, CDAQDAS packages

Web 2.0 Text Analysis● accessible Big Data● social media platforms● emergence of the field of computational linguistics

Further Situating Our Project

Page 4: Duke University Jensen Lecture

Even Further Situating Our ProjectWithin my own research

pre-Internet2003 study of Silicon Valley jargon (Poetics); 2004 study shipyard union workers meeting transcripts (Sociological Forum)

Web 1.02009 study of overeaters online support groups (Social Forces)

Web 2.02015 theoretical article on digital text analysis (Journal for the Theory of Social Behaviour)

2015 (with Rada Mihalcea) Text Mining for the Social Sciences: Research Design, Data Collection, and Analysis (Sage)2016 (with Rada Mihalcea) An Introduction to Text Mining and Analysis (Sage)

● Project 1: 2015 (with N. Evangelopoulos and K. Zougris) topic sentiment analysis (today’s talk)● Project 2: 2016 (with N. Evangelopoulos et al.) cultural power as topic centrality● Project 3: 2016-17 (with Rada Mihalcea et al.) machine-assisted moral sentiment analysis in online

communities

Page 5: Duke University Jensen Lecture

Overview of this study

● This study is Project 1 of 3 planned.

● Our modest methodological aim is to demonstrate that a text analysis methodology developed by computer scientists and computational linguists can be used within a social science research project design.

● Our contribution to substantive research is limited. This is a first step in what we hope will be a series of methodological developments at the intersection of computer and social science text analysis.

Page 6: Duke University Jensen Lecture

❖ Surveys❖ The Implicit Attitude Test (IAT)

Topic Sentiment Analysis

● How can we know how groups feel about topics? ● Are some topics associated with different emotional

experiences by members of different groups or populations?

Page 7: Duke University Jensen Lecture

Topic Sentiment Analysis (TSA) refers to several relatively new text analysis methods that estimate the polarity of sentiments across units of text within large text corpora (Mei et al. 2007; Lin and He 2009).

TSA combines two well established text analysis methods:

1. Topic Modeling/Topic Extraction2. Sentiment analysis

Page 8: Duke University Jensen Lecture

1. Topic ModelsStatistical models and techniques for automatically identifying latent topics in large document collections.

The digital humanities community is far ahead in the use of topic models (Jockers 2013; Meeks and Weingart; Moretti 2013),

These models have recently been used in the social sciences by:a. political scientists (e.g. Grimmer, 2010; Grimmer and King, 2011; Grimmer and Stewart, 2013) b. sociologists (Moody and Light 2006; McFarland et al. 2013; Mohr and Bogdanov 2013; Mützel

2012; Kaplan and Vakili 2012).

Different methods of topic identification share three fundamental assumptions.

1. documents have latent semantic structures, or topics. 2. it is possible to infer topics from word-document co-occurrences. 3. words are related to topics and topics to documents.

Page 9: Duke University Jensen Lecture

2. Sentiment Analysis

● Analysis of sentiment in texts using standard lexicons

● We use feature-based sentiment analysis.

● FBSA reveals the sentiments associated with targeted words or clauses within a text.

Page 10: Duke University Jensen Lecture

Topic Sentiment Mixture (TSM; Mei et al. 2007) ● Hidden Markov Model (HMM) structure● Applying TSM to weblogs discussing topics such as laptops, movies,

universities, airlines, cities, iPod and the Da Vinci Code, Mei et al. found that positive and negative bursts in topic-sentiment mixtures tracked closely with known events related to the topics (e.g. the release of the movie The Da Vinci Code, or negative sentiment associated with reviews of the iPod).

The Joint Sentiment Topic (JST) model (Lin and He 2009)● The JST model is a fully unsupervised probabilistic modeling framework. ● Evaluating their model on a widely used movie review data set, Lin and He

found that JST effectively categorized reviews’ overall sentiment.

Topic Sentiment Mixture (TSM) and the Joint Sentiment Topic (JST) Model

Page 11: Duke University Jensen Lecture

Our Adaptation of Topic Sentiment Analysis

We attempt to optimize TSA for social research in terms of simplicity and ease of use as well as analytic value.

First we employ Latent Semantic Analysis (LSA) as a modeling framework rather than the more commonly used Latent Dirichlet Allocation (LDA),

For this project we also employ a text mining strategy based on John Stuart Mill’s “method of difference” (Lange 2013: 108) to construct text corpora from two independent sources (in this case two partisan news sites).

Our goal is to minimize differences between the corpora in terms of all factors except for the factor of theoretical interest, which for our study is news site ideology.

Page 12: Duke University Jensen Lecture

Our case

We apply TSA to the study of public opinion as expressed in social media by comparing reactions to the Trayvon Martin controversy in spring 2012 by commenters on the partisan news websites the Huffington Post and Daily Caller.

Page 13: Duke University Jensen Lecture

National and global media landscapes have transformed over the last twenty years:● new media● social media● hyper-partisan media

How have these changes affected public attitudes?Have they contributed to a more polarized public?

Research on how consumption of partisan news may influence audience attitudes has been held back by measurement problems (Prior 2013).

TSA may prove to be a useful tool in this regard.

Why study social media opinion polarization?

Page 14: Duke University Jensen Lecture

Our project is primarily a methodological demonstration, but theoretically our study is based on contemporary studies of news media that depict the media field as an “outrage industry” that incentivizes media personalities to be controversial and polarizing (Berry and Sobieraj 2013) rather than accurate (Tetlock 2006).

Based on these studies and our own abductive process of hypothesis generation (see Bauer, Bicquelet and Suerdem 2014; Ruiz Ruiz 2009) we predict that commenters on highly ideological, partisan news sites will use more polarized emotional language in association with celebrity news commentators than in association with other celebrities, public figures, or topics generally.

Theories and hypotheses (overview)

Page 15: Duke University Jensen Lecture
Page 16: Duke University Jensen Lecture
Page 17: Duke University Jensen Lecture

● Although there is a rich research tradition on how traditional news organizations engage in “gatekeeping” of news topics (Lewin, 1947; White, 1950; Sigal, 1973; Gans, 1979; Shoemaker and Vos 2009), studies of online news media have paid relatively little attention to gatekeeping.

● There are no reliable low-cost methods available for identifying more and less polarizing topics in political discourse.

● Such methods could potentially open new avenues of research on media outlets’ topic choices and the effects of those choices on audiences’ attitudes and opinions.

Page 18: Duke University Jensen Lecture

1. Latent Semantic Analysis (LSA)

● Topic modeling techniques use different mathematical frameworks, mainly linear algebra or probabilistic modeling.

● Several methods of topic extraction are available for topic modeling, including LSA (Latent Semantic Analysis), NMF (Non-negative Matrix Factorization), and LDA (Latent Dirichlet Allocation; see Blei 2003).

● While LDA is currently the most widely used method, LSA has several important advantages over LDA for social research applications.

● However, because the term “topic model” is today used mainly in association with LDA, we use the term “topic extraction” in reference to LSA.

Page 19: Duke University Jensen Lecture

LDA and LSA

LDA● Markov Chain Monte Carlo and Bayesian Inference● Most cited paper: Blei et al. 2003

LSA● Least-squares based● Most cited paper: Deerwester et al. 1993

Page 20: Duke University Jensen Lecture

LSA continued

● A number of researchers in political science and sociology are using LDA because it is considered to have a superior statistical foundation to LSA,

● Yet LSA is faster, simpler to implement, and more consistent than LDA (Anaya 2011).

● LSA also more accurately models human cognitive processing (Evangelopoulos 2013), and for this reason is more widely used in psychology than is LDA (see Landauer 2007).

● LSA is arguably the overall better choice for social science research applications.

● See also Pilato and Vassallo (2014).

Page 21: Duke University Jensen Lecture

2. Sentiment Analysis

● Also referred to as opinion mining, sentiment mining, opinion extraction, subjectivity analysis, or emotion analysis

● SA is the field of study that analyzes people’s sentiments, feelings, and appraisals of entities, events, properties and topics as expressed in large document collections.

● This is a relatively new field, with little large-scale text-based research having been done on people’s sentiments and opinions before 2000 (Liu 2010: 1).

● Sentiment analysis generally uses standard lexicons of positive and negative sentiment terms, although it is widely recognized that the meaning of sentiment terms depends on many factors, such as the immediate context and the author’s use of irony, humor, sarcasm, and quotations.

Page 22: Duke University Jensen Lecture

3. Mill’s “Method of Difference”

● Most research papers by computer scientists and statisticians that use topic modeling, sentiment analysis and related techniques are not concerned to strategically sample data in such a way as to attempt to answer social research questions.

● For our study data selection and sampling are critically important. ● Our data selection strategy is based on John Stuart Mill’s “method of

difference” (see Lange 2013: 108), although Mill’s logic is vigorously debated in comparative-historical sociology.

● We create document collections from two sources in order to minimize differences between the collections in terms of all factors except for the factor of theoretical interest, which in the present study is website ideology.

Page 23: Duke University Jensen Lecture

● We first constructed document collections based on their relevance to our research question.

● Then used a form of systematic sampling to match sample sizes of documents to reduce the possibility that difference found between the sentiment term topic associations across the document collections were caused by exogenous or endogenous factors unrelated to our research questions.

● While our sampling method is not strictly technically necessary for topic discovery, it allows us to isolate empirical variation in only the empirical phenomena that are of theoretical interest, and so is desirable for substantive and theoretical reasons.

Page 24: Duke University Jensen Lecture

4. Correspondence Analysis● We use CA for graphical display of results.

● Familiar to sociologists (Bourdieu, Mohr, Bail, others), CA is similar to principal component analysis (factor analysis) in that it aims to graphically display data in low-dimensional space

From Distinction (1984)

Page 25: Duke University Jensen Lecture

The Trayvon Martin Controversy

To demonstrate TSA and explore whether high-profile commentators may be more polarizing than other personalities and topics, we analyze public reactions to the shooting death of unarmed teenager Trayvon Martin in spring 2012.

We chose this topic because of the exceptionally large volume of online commentary it generated on partisan news websites.

The shooting of Trayvon Martin by George Zimmerman took place on the night of February 26, 2012, in Sanford, Florida. Martin was an unarmed 17-year-old African American, Zimmerman a 28-year-old multi-racial Hispanic American. Zimmerman was the appointed neighborhood watch coordinator for the gated community where Martin was temporarily staying and where the shooting took place.

While in his vehicle on a personal errand, Zimmerman noticed Martin walking inside the community, and called the Sanford Police Department to report Martin’s behavior as suspicious. While still on the phone with the police dispatcher, Zimmerman left his vehicle. There was a violent encounter between the two men that ended with Zimmerman fatally shooting Martin once in the chest at close range. Zimmerman was detained by police and questioned for approximately five hours, then released without being charged.

Page 26: Duke University Jensen Lecture

The circumstances of Trayvon Martin’s death and the initial decision not to charge Zimmerman received national and international attention. Allegations of racist motivation for both the shooting and police conduct, along with intense media reporting, contributed to public demands for Zimmerman’s arrest.

Protests were staged around the U.S. prior to Zimmerman’s April 11 indictment on murder charges. Over 2.2 million signatures were collected on a Change.org petition, created by Martin’s mother, calling for Zimmerman’s arrest.

Since Martin was killed while wearing a hooded sweatshirt, hoodies were used as a sign of protest, and many cities staged “million hoodie marches” or “hundred hoodie marches.”

Walkouts were staged by students at over a dozen Florida high schools, and thousands of people attended rallies around the country to demand Zimmerman’s arrest. A number of public figures made comments or released statements calling for a full investigation.

In October 2012, Judge Debra S. Nelson set Zimmerman’s trial date for June 10, 2013. In the the trial Zimmerman was found not guilty.

Page 27: Duke University Jensen Lecture

Table 1. Reader Comments in Response to Trayvon Martin Stories on Partisan News Sites

Huffington Post Daily Caller

DateArticle to which

readers responded:

Lines DateArticle to which

readers responded:

Lines

3/23

Geraldo Rivera: Trayvon Martin’s

‘Hoodie Is As Much

Responsible For [His] Death As

George Zimmerman’

405 3/23

Geraldo: ‘The hoodie is as

much responsible for

Trayvon Martin’s death as George Zimmerman was’

405

Page 28: Duke University Jensen Lecture

Sample of Table 1. (continued)

3/23Newt Gingrich: Obama’s Trayvon

Martin Statement ‘Disgraceful’565 3/24

Gingrich lashes out at Obama for ‘disgraceful’ Trayvon Martin comments as new witness

supports shooter

565

3/23Obama On Trayvon Martin Case:

‘If I Had A Son, He’d Look Like Trayvon’

124 3/23Obama offers belated comments

on Trayvon Martin’s death124

total 129,584 words3908 lines

130,204 words3908 lines

Page 29: Duke University Jensen Lecture

Table 2.

     Huffington

Post  Daily Caller

Topic Topic Label Topic Total Negative Positive Negative Positive

T01 The incident 1500 379 131 294 124

T03Black and white

1038 221 60 309 92

T04President Obama

1094 187 124 220 143

T05Gun laws & Bill Maher

882 239 84 181 69

Page 30: Duke University Jensen Lecture

Table 2. (continued)

T06Hoodies & Geraldo Rivera

617 157 43 122 66

T08

Looked like Obama's son

585 83 74 101 93

T09 Racist! 429 89 12 201 21

T10 God 263 53 38 47 23

T11All, justice for all

725 137 69 153 74

T12Guns kill, people kill

830 189 68 184 44

Page 31: Duke University Jensen Lecture

Sentiment-Source Polarity Index (PI)

The PI builds upon the idea of a chi-square test for independence between source and sentiment,

and is computed separately for each topic k as:

PIk = ΣiΣj((Oijk – Eijk)2/Eijk), i = 1 to 2, j = 1 to 2, k = 1 to 10.

Here, Oijk is the observed comment count from source i, expressing sentiment j, addressing topic k,

and Eijk is the corresponding expected comment count under the conditional assumption of

independence between source and sentiment given a fixed topic. Eijk is then computed as

Eijk = (ΣjOijk)(ΣiOijk)/(ΣiΣjOijk),

Page 32: Duke University Jensen Lecture

Polarization Index by Topic

Topic Topic Label PI df significance

1 the incident 1.82 1 0.1767

3 black and white 0.24 1 0.6233

4 President Obama 0.02 1 0.8995

5gun laws/Bill Maher 0.18 1 0.6689

6hoodies/Geraldo Rivera 8.88 1 0.0029

8looked like Obama's son 0.02 1 0.8807

9 racist! 0.44 1 0.5053

10 God 1.33 1 0.2484

11 all, justice for all 0.04 1 0.8431

12guns kill, people kill 3.49 1 0.0618

Page 33: Duke University Jensen Lecture

Sample Comments With High Loadings on Topic 6 (hoodies/Geraldo Rivera)

Comment ID Comment LoadingHP3528 Kim Kardashian wears Hoodies : 0.7186

HP0949 ...Geraldo is absolutely right about the hoodie theory... when you see someone in a hoodie the first thought is what are they up to? Why does that person have a hoodie on?...

0.6091

HP1394 So why does Geraldo wear a hoodie? We have pictures. 0.5558

DC1266 BS. A hoodie disguises a person's race, idiot. That's why they wear them… 0.5105

DC2419 ...You are stupid beyond words. So a black kid can't wear a hoodie because other black people in the country who have committed crimes also wear hoodies?...

0.4889

DC2530 Kids today dress like they want to be bad guys...The music culture is the culprit here and the clothing styles...This boy died for stupidity. Stupidity over a good kid dressing like a bad kid when

he wasn't…

0.4741

DC0593 The blistering idiotic Rivera wears a hoodie during a rainstorm: 0.4699

DC1951 Hurray for Geraldo!!!. Finally someone steps up to the plate and tells it like it is. Hoodies should be banned…

0.4490

DC1564 First time I agree with Geraldo as well. Hoodies have negative connotations… 0.4468

HP1193 Notice I am wearing a Hoodie in my profile picture... Is Rivera suggesting we're in a gang in my town?

0.4464

HP0618 How can you say wearing a hoodie=glamorizing gangsta style?... Geraldo wore hoodies on many different occasions, yet he doesn't seem fearful of his life…

0.4104

HP0451 Geraldo and Bill O'Reilly wear hoodies. http:www… 0.4074

Page 34: Duke University Jensen Lecture

Correspondence Analysis: Topic-Sentiment Map for Huffington Post Commenters

Page 35: Duke University Jensen Lecture

Correspondence Analysis: Topic-Sentiment Map for Daily Caller Commenters

Page 36: Duke University Jensen Lecture

Discussion

1. Berry and Sobieraj’s “outrage industry” argument makes sense of Geraldo’s “hoodies” comments and of our finding that Geraldo’s comments were highly polarizing.

2. Clearly there are financial incentives in place for celebrity pundits to be outrageous (controversial, polarizing) rather than accurate (Tetlock 2006).

3. A media consultant quoted by Berry and Sobieraj (2013) recommended that for pundits to be successful in the contemporary media landscape they should “be polarizing, be self-deprecating” (p. 114).

Page 37: Duke University Jensen Lecture

TSA has several advantages over survey-based and qualitative methods for analyzing social media data.

● Inexpensive● Uses widely available technology● Allows for rigorous analysis of sentiment polarization with only minimal

interpretation needed

Conclusions

Page 38: Duke University Jensen Lecture

● The present study admittedly does not contribute much to our knowledge of the contemporary media landscape. Our results confirm Sobieraj and Berry’s historical analysis.

● But our goal here is to explore how computational linguistics methods can be adapted for social science research. This study is only a first modest step.

● We expect to more fully automate TSA in the near future.● We see potential applications for conflict negotiation and resolution.● We also see many other areas of potential collaboration between

statisticians, computational linguists and social scientists.

Page 39: Duke University Jensen Lecture

unt.academia.edu/GabeIgnatow

For working papers: