Challenges Analyzing Twitter Public Opinion …...Twitter users and the general public, and estimate...

17
The Challenges in Analyzing Twitter Data for Public Opinion Researchers Masahiko Aida, Director of Analytics

Transcript of Challenges Analyzing Twitter Public Opinion …...Twitter users and the general public, and estimate...

Page 1: Challenges Analyzing Twitter Public Opinion …...Twitter users and the general public, and estimate opinion with the correct specification, the bias will be smaller. – Ex. Heckman

The Challenges in Analyzing Twitter Data for Public Opinion Researchers

Masahiko Aida, Director of Analytics

Page 2: Challenges Analyzing Twitter Public Opinion …...Twitter users and the general public, and estimate opinion with the correct specification, the bias will be smaller. – Ex. Heckman

WHY TWITTER?

© 2012 All Rights Reserved. 1

• Access to Twitter data is open, unlike Facebook

• User base is large– 140 million users (901 million FB users, 100 million Google+ users)

• Ubiquitous among Politicians in the US (as of May 2012)

– 375 House members (of 435)– 92  Senators ( of 100)– 49 Governors (of 51)

Page 3: Challenges Analyzing Twitter Public Opinion …...Twitter users and the general public, and estimate opinion with the correct specification, the bias will be smaller. – Ex. Heckman

CHALLENGES IN TWITTER ANALYSIS

• Sampling and Coverage Problem– The volume of Twitter data can be large and it can be costly to 

obtain and store– Coverage issue

• Text Analysis Problem 

• Inference Problem 

© 2012 All Rights Reserved. 2

Page 4: Challenges Analyzing Twitter Public Opinion …...Twitter users and the general public, and estimate opinion with the correct specification, the bias will be smaller. – Ex. Heckman

DATA SIZE AND SAMPLING

© 2012 All Rights Reserved. 3

Source: from twitter

• During the State of Union address, there were 766,681 SOTU related tweets in 95 minutes.  (8,070 tweets per minute).  The data will be approximately 600MB.

• Imagine saving all the tweets during an election year.

• However, one can sample and save subset.

Page 5: Challenges Analyzing Twitter Public Opinion …...Twitter users and the general public, and estimate opinion with the correct specification, the bias will be smaller. – Ex. Heckman

BarackBarack

COVERAGE – OBAMA TWEETS EXAMPLE

© 2012 All Rights Reserved. 4

• Ideally, we want to assign a non‐zero chance of selection to all the tweets that discuss a particular topic.

• However, with 340 million tweets a day,  it is extremely inefficient to pull  a random sample.– Ex. There are about 20,000 tweets 

that include “Barack Obama” a day, it is 0.0059% of all tweets.

• Another possibility is to query related words such as “the president”.– However, it will increase noise.

ObamaObama

The President

The President

Universe

Missed tweets (φ)

Irreverent tweets

Irreverent tweets

Page 6: Challenges Analyzing Twitter Public Opinion …...Twitter users and the general public, and estimate opinion with the correct specification, the bias will be smaller. – Ex. Heckman

POTENTIAL SOLUTION FOR COVERAGE ERRORS

© 2012 All Rights Reserved. 5

• Stratified Approach – create list of users from a keyword query and pull tweets targeting user IDs.  

High Density Set of users who tweeted Obama within 3 days Least Expensive

Mid Density Set of users who tweeted Obama within 1 week Inexpensive

Low Density Set of users who tweeted Obama within 1 month Expensive

Very Low Density

Users who have not tweeted Obama more than 1 month Cost Prohibitive

Page 7: Challenges Analyzing Twitter Public Opinion …...Twitter users and the general public, and estimate opinion with the correct specification, the bias will be smaller. – Ex. Heckman

TEXT ANALYSIS

© 2012 All Rights Reserved. 6

• Many of our tools are assuming numeric data and it is very difficult to translate/map text sentences into numeric values.

• Several vendors offer rule based sentiment scores.– Ex. Like, love, greedy, enthusiastic– Cannot handle sarcasm.– Vendors use secrete proprietary algorithms to code

• Alternative: supervised learning methods

Page 8: Challenges Analyzing Twitter Public Opinion …...Twitter users and the general public, and estimate opinion with the correct specification, the bias will be smaller. – Ex. Heckman

HOW SENTIMENT CODINGWORKS

© 2012 All Rights Reserved. 7

Tweets

Human coder classifies and assigns sentiment (training dataset)

Create supervised learning models

Tweets

Tweets

Tweets

Example: R‐text‐tool: Timothy P. Jurka, Loren Collingwood, Amber E. Boydstun, EmilianoGrossman and Wouter van Atteveldt (2012). RTextTools: Automatic Text Classification via Supervised Learning. R package version 1.3.6. http://CRAN.R‐project.org/package=RTextTools

Page 9: Challenges Analyzing Twitter Public Opinion …...Twitter users and the general public, and estimate opinion with the correct specification, the bias will be smaller. – Ex. Heckman

OTHER TEXT ANALYSIS EXAMPLES

© 2012 All Rights Reserved. 8

• Forget about quantifying – use a strictly visual approach.– Appearances of nouns and adjectives.– Process tweets with natural language processing software

• Network analysis– Identify “influentials” in the network

• Data: Twitter data that includes following1. Mitt Romney2. WI recall election

Page 10: Challenges Analyzing Twitter Public Opinion …...Twitter users and the general public, and estimate opinion with the correct specification, the bias will be smaller. – Ex. Heckman

FREQUENT TERMS THAT DESCRIBE ROMNEY

© 2012 All Rights Reserved. 9

Increased mentions of Romney’s money

War on Women

Romney’s proposed tax rate

Data: Sample of tweets that include “Mitt Romney”.

Processed with natural language analysis package using Python.

Data: Sample of tweets that include “Mitt Romney”.

Processed with natural language analysis package using Python.

Page 11: Challenges Analyzing Twitter Public Opinion …...Twitter users and the general public, and estimate opinion with the correct specification, the bias will be smaller. – Ex. Heckman

GOP PRIMARY POLLS AND TWITTER

© 2012 All Rights Reserved. 10

1st row : GOP primary public polling summary.2nd row : frequency of candidate names from twitter sample.

Page 12: Challenges Analyzing Twitter Public Opinion …...Twitter users and the general public, and estimate opinion with the correct specification, the bias will be smaller. – Ex. Heckman

VISUALIZINGMENTIONS : WI RECALL ELECTION

© 2012 All Rights Reserved. 11

Liberal News MediaLiberal News Media Tea Party Types

Tea Party Types

Method: Collect sample of tweets that include “WI re‐election”, “Scott Walker”.

Visualize relationships of mentions.

Rasmussen pollRasmussen poll

Page 13: Challenges Analyzing Twitter Public Opinion …...Twitter users and the general public, and estimate opinion with the correct specification, the bias will be smaller. – Ex. Heckman

CONSERVATIVE TWITTER ACCOUNTS

© 2012 All Rights Reserved. 12

Brother of Rush Limbaugh

Brother of Rush Limbaugh

Network visualization allows us to see popular news sources and how mentions are clustered ideologically.

Page 14: Challenges Analyzing Twitter Public Opinion …...Twitter users and the general public, and estimate opinion with the correct specification, the bias will be smaller. – Ex. Heckman

PROBLEM OF INFERENCE I

© 2012 All Rights Reserved. 13

• We may develop better means of predicting sentiment and sampling, thus the measurement of Twitter opinion will improve as we gain experience.

• However, the distribution of opinions on Twitter is not directly transferable to the opinions of the general public or likely voters.

• Can we find a way to infer the opinions of the general  population from Twitter data?  Maybe.

Page 15: Challenges Analyzing Twitter Public Opinion …...Twitter users and the general public, and estimate opinion with the correct specification, the bias will be smaller. – Ex. Heckman

PROBLEM OF INFERENCE II

© 2012 All Rights Reserved. 14

• The purpose of research is not necessarily obtaining unbiased point estimates of a population parameter.

No smoke where there is no fire.

– Ex. Suppose one person claims  “All swans are white.”– I just need one black swan to prove that is not the case.

Page 16: Challenges Analyzing Twitter Public Opinion …...Twitter users and the general public, and estimate opinion with the correct specification, the bias will be smaller. – Ex. Heckman

PROBLEM OF INFERENCE III : MODEL BASED

© 2012 All Rights Reserved. 15

• If we can approximate the mechanism that separates Twitter users and the general public, and estimate opinion with the correct specification, the bias will be smaller.

– Ex. Heckman model, sample matching (YouGovPolimetrix surveys)

– Ex.  Twitter sentiment of a college educated gay black man who lives in Ohio

– Likely support Obama and do not favor Romney.

Page 17: Challenges Analyzing Twitter Public Opinion …...Twitter users and the general public, and estimate opinion with the correct specification, the bias will be smaller. – Ex. Heckman

SUMMARY

© 2012 All Rights Reserved. 16

• Technical challenges that can be solved– Sampling and coverage issue in Twitter sampling– Mapping of text data into a scale

• Problems that are hard to solve in the near future– Generalization of opinion distribution to general public

• Think differently– Use Twitter to find emerging or rare patterns– Use Twitter to see how people are obtaining information– Different types of inference (find smoke)