Beyond Binary Labels: Political Ideology Prediction of Twitter...

Beyond Binary Labels: Political IdeologyPrediction of Twitter Users

Daniel Preotiuc-PietroBloomberg LP

Joint work with Ye Liu (NUS), Daniel J Hopkins (Political Science), LyleUngar (CS) while at the University of Pennsylvania

2 August 2017

Motivation

User attribute prediction from text is successful:

I Age (Rao et al. 2010 ACL)

I Gender (Burger et al. 2011 EMNLP)

I Location (Eisenstein et al. 2010 EMNLP)

I Personality (Schwartz et al. 2013 PLoS One)

I Impact (Lampos et al. 2014 EACL)

I Political Orientation (Volkova et al. 2014 ACL)

I Mental Illness (Coppersmith et al. 2014 ACL)

I Occupation (Preotiuc-Pietro et al. 2015 ACL)

I Income (Preotiuc-Pietro et al. 2015 PLoS One)

... and useful in many applications.

Political Ideology & Text

Hypothesis:

Political ideology of a user is disclosed through language use

I partisan political mentions or issues

I cultural differences


Previous CS/NLP research used data sets with user labelsidentified through:

1. User descriptions

H1 Users are far more likely to be politically engaged


2. Partisan Hashtags

H2 The prediction problem was so far over-simplified


3. Lists of Conservative/Liberal users

H3 Neutral users


4. Followers of partisan accounts

H4 Differences in language use exist between moderate andextreme users

Data

I Political ideologyI specific of country and cultureI our use case is US politics (similar to all previous work)I the major US ideology spectrum is Conservative – LiberalI seven point scale

Data

We collect a new data set:

I 3.938 users (4.8M tweets)I public Twitter handle with >100 posts

Political ideology is reported through an online survey

I only way to obtain unbiased ground truth labels (Flekova et al.

2016 ACL, Carpenter et al. 2016 SPPS)

I additionally reported age, gender and other demographics

Data

I Data available at preotiuc.roI full data for research purposesI aggregate for replicability

I Twitter Developer Agreement & Policy VII.A4”Twitter Content, and information derived from Twitter Content, may not be

used by, or knowingly displayed, distributed, or otherwise made available to

any entity to target, segment, or profile individuals based on [...] political

affiliation or beliefs”

I Study approved by the Internal Review Board (IRB) of theUniversity of Pennsylvania

preotiuc.ro

Class Distribution

401453

696

195

401453

696

501

692

594

0

250

500

750

1000

Data

For comparison to previous work, we collect a data set:

I 13.651 users (25.5M tweets)I follow liberal/conservative politicians on Twitter

Hypotheses

H1 Previous studies used users far more likely to be politicallyengaged


H3 Neutral users can be identified


Engagement


Manually coded:

I Political words (234)I Political NEs: mentions of politician proper names (39)I Media NEs: mentions of political media sources and

pundints (20)

Engagement

Data set obtained using previous methods

2.64 2.95

0.73

0.79

0.11

0.18

0.00

0.50

1.00

1.50

2.00

2.50

3.00

3.50

4.00Political word usage across

user groups

Media/Pundit Names

Politician Names

Political Words

Average percentage of political word usage

Engagement

Our data set

2.64 0.76 0.55 0.42 0.36 0.46 0.51 0.76 2.95

0.73

0.24

0.140.07 0.07

0.09 0.12

0.19

0.79

0.11

0.03

0.03

0.02 0.020.03

0.03

0.04

0.18

0.00

0.50

1.00

1.50

2.00

2.50

3.00

3.50

4.00Political word usage across

user groups

Media/Pundit Names

Politician Names

Political Words

Average percentage of political word usage

Engagement

Take aways:

I 3x more political terms for automatically identified userscompared to the highest survey-based scores

I almost perfectly symmetrical U-shape across all threetypes of political terms

I The difference between 1-2/6-7 is larger than 2-3/5-6

Hypotheses





Over-simplification


.891

.972 .976

.5

.6

.7

.8

.9

1.0

CvL

Topics Political Terms Domain Adaptation

ROC AUC, Logistic Regression, 10-fold cross-validation

Over-simplification


.891

.785

.972

.785

.976

.789

.5

.6

.7

.8

.9

1.0

CvL 1v7



Over-simplification


.891

.785

.662

.972

.785

.679

.976

.789

.690

.5

.6

.7

.8

.9

1.0

CvL 1v7 2v6



Over-simplification


.891

.785

.662

.581

.972

.785

.679

.590

.976

.789

.690

.625

.5

.6

.7

.8

.9

1.0

CvL 1v7 2v6 3v5


ROC AUC, Logistic Regression, 10 fold-cross validation

Over-simplification

Predicting continuous political leaning (1 – 7)

.294 .286.300

.145

.256

.369

.00

.10

.20

.30

.40

Leaning

Unigrams LIWC Topics Emotions Political All

Pearson R between predictions and true labels, Linear Regression,10-fold cross-validation

Over-simplification

Seven-class classification

19.60%22.20%

24.20%26.20%

27.60%

0%

10%

20%

30%

Accuracy, 10-fold cross-validation

GR – Logistic regression with Group Lasso regularisation

Hypotheses





Neutral Users


Words associated with eitherextreme conservative or liberal

Words associated with neutralusers

a aacorrelation strength

Correlations are age and gender controlled. Extreme groups arecombined using matched age and gender distributions.

Political Engagement

H3a There is a separate dimension of political engagement

Combine the classes into a scale: 4 – 3&5 – 2&6 – 1&7

.294

.165

.286

.149

.300

.169.145

.079

.256

.169

.369

.196

.00

.10

.20

.30

.40

Leaning Engagement

Unigrams LIWC Topics Emotions Political All

Pearson R between predictions and true labels, Linear Regression, 10fold-cross validation

Hypotheses





Moderate Users

H4 Differences between moderate and extreme users

Words associated with moderateliberals (5 and 6).

Words associated with extremeliberals (7).

relative frequency

a aacorrelation strength

Correlations are age and gender controlled

Take Aways

I User-level trait acquisition methodologies can generatenon-representative samples

I Political ideology:I Goes beyond binary classesI The problem was to date over-simplifiedI New data set available for researchI New model to identify political leaning and engagement

Questions?www.preotiuc.ro

wwbp.org

www.preotiuc.ro

wwbp.org

Beyond Binary Labels: Political Ideology Prediction of Twitter...

Documents

Transcript of Beyond Binary Labels: Political Ideology Prediction of Twitter...