Beyond Binary Labels: Political Ideology Prediction of Twitter...
Transcript of Beyond Binary Labels: Political Ideology Prediction of Twitter...
Beyond Binary Labels: Political IdeologyPrediction of Twitter Users
Daniel Preotiuc-PietroBloomberg LP
Joint work with Ye Liu (NUS), Daniel J Hopkins (Political Science), LyleUngar (CS) while at the University of Pennsylvania
2 August 2017
Motivation
User attribute prediction from text is successful:
I Age (Rao et al. 2010 ACL)
I Gender (Burger et al. 2011 EMNLP)
I Location (Eisenstein et al. 2010 EMNLP)
I Personality (Schwartz et al. 2013 PLoS One)
I Impact (Lampos et al. 2014 EACL)
I Political Orientation (Volkova et al. 2014 ACL)
I Mental Illness (Coppersmith et al. 2014 ACL)
I Occupation (Preotiuc-Pietro et al. 2015 ACL)
I Income (Preotiuc-Pietro et al. 2015 PLoS One)
... and useful in many applications.
Political Ideology & Text
Hypothesis:
Political ideology of a user is disclosed through language use
I partisan political mentions or issues
I cultural differences
Political Ideology & Text
Previous CS/NLP research used data sets with user labelsidentified through:
1. User descriptions
H1 Users are far more likely to be politically engaged
Political Ideology & Text
2. Partisan Hashtags
H2 The prediction problem was so far over-simplified
Political Ideology & Text
3. Lists of Conservative/Liberal users
H3 Neutral users
Political Ideology & Text
4. Followers of partisan accounts
H4 Differences in language use exist between moderate andextreme users
Data
I Political ideologyI specific of country and cultureI our use case is US politics (similar to all previous work)I the major US ideology spectrum is Conservative – LiberalI seven point scale
Data
We collect a new data set:
I 3.938 users (4.8M tweets)I public Twitter handle with >100 posts
Political ideology is reported through an online survey
I only way to obtain unbiased ground truth labels (Flekova et al.
2016 ACL, Carpenter et al. 2016 SPPS)
I additionally reported age, gender and other demographics
Data
I Data available at preotiuc.roI full data for research purposesI aggregate for replicability
I Twitter Developer Agreement & Policy VII.A4”Twitter Content, and information derived from Twitter Content, may not be
used by, or knowingly displayed, distributed, or otherwise made available to
any entity to target, segment, or profile individuals based on [...] political
affiliation or beliefs”
I Study approved by the Internal Review Board (IRB) of theUniversity of Pennsylvania
Class Distribution
401453
696
195
401453
696
501
692
594
0
250
500
750
1000
Data
For comparison to previous work, we collect a data set:
I 13.651 users (25.5M tweets)I follow liberal/conservative politicians on Twitter
Hypotheses
H1 Previous studies used users far more likely to be politicallyengaged
H2 The prediction problem was so far over-simplified
H3 Neutral users can be identified
H4 Differences in language use exist between moderate andextreme users
Engagement
H1 Previous studies used users far more likely to be politicallyengaged
Manually coded:
I Political words (234)I Political NEs: mentions of politician proper names (39)I Media NEs: mentions of political media sources and
pundints (20)
Engagement
Data set obtained using previous methods
2.64 2.95
0.73
0.79
0.11
0.18
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
4.00Political word usage across
user groups
Media/Pundit Names
Politician Names
Political Words
Average percentage of political word usage
Engagement
Our data set
2.64 0.76 0.55 0.42 0.36 0.46 0.51 0.76 2.95
0.73
0.24
0.140.07 0.07
0.09 0.12
0.19
0.79
0.11
0.03
0.03
0.02 0.020.03
0.03
0.04
0.18
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
4.00Political word usage across
user groups
Media/Pundit Names
Politician Names
Political Words
Average percentage of political word usage
Engagement
Our data set
2.64 0.76 0.55 0.42 0.36 0.46 0.51 0.76 2.95
0.73
0.24
0.140.07 0.07
0.09 0.12
0.19
0.79
0.11
0.03
0.03
0.02 0.020.03
0.03
0.04
0.18
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
4.00Political word usage across
user groups
Media/Pundit Names
Politician Names
Political Words
Average percentage of political word usage
Engagement
Take aways:
I 3x more political terms for automatically identified userscompared to the highest survey-based scores
I almost perfectly symmetrical U-shape across all threetypes of political terms
I The difference between 1-2/6-7 is larger than 2-3/5-6
Hypotheses
H1 Previous studies used users far more likely to be politicallyengaged
H2 The prediction problem was so far over-simplified
H3 Neutral users can be identified
H4 Differences in language use exist between moderate andextreme users
Over-simplification
H2 The prediction problem was so far over-simplified
.891
.972 .976
.5
.6
.7
.8
.9
1.0
CvL
Topics Political Terms Domain Adaptation
ROC AUC, Logistic Regression, 10-fold cross-validation
Over-simplification
H2 The prediction problem was so far over-simplified
.891
.785
.972
.785
.976
.789
.5
.6
.7
.8
.9
1.0
CvL 1v7
Topics Political Terms Domain Adaptation
ROC AUC, Logistic Regression, 10-fold cross-validation
Over-simplification
H2 The prediction problem was so far over-simplified
.891
.785
.662
.972
.785
.679
.976
.789
.690
.5
.6
.7
.8
.9
1.0
CvL 1v7 2v6
Topics Political Terms Domain Adaptation
ROC AUC, Logistic Regression, 10-fold cross-validation
Over-simplification
H2 The prediction problem was so far over-simplified
.891
.785
.662
.581
.972
.785
.679
.590
.976
.789
.690
.625
.5
.6
.7
.8
.9
1.0
CvL 1v7 2v6 3v5
Topics Political Terms Domain Adaptation
ROC AUC, Logistic Regression, 10 fold-cross validation
Over-simplification
Predicting continuous political leaning (1 – 7)
.294 .286.300
.145
.256
.369
.00
.10
.20
.30
.40
Leaning
Unigrams LIWC Topics Emotions Political All
Pearson R between predictions and true labels, Linear Regression,10-fold cross-validation
Over-simplification
Seven-class classification
19.60%22.20%
24.20%26.20%
27.60%
0%
10%
20%
30%
Accuracy, 10-fold cross-validation
GR – Logistic regression with Group Lasso regularisation
Hypotheses
H1 Previous studies used users far more likely to be politicallyengaged
H2 The prediction problem was so far over-simplified
H3 Neutral users can be identified
H4 Differences in language use exist between moderate andextreme users
Neutral Users
H3 Neutral users can be identified
Words associated with eitherextreme conservative or liberal
Words associated with neutralusers
a aacorrelation strength
Correlations are age and gender controlled. Extreme groups arecombined using matched age and gender distributions.
Political Engagement
H3a There is a separate dimension of political engagement
Combine the classes into a scale: 4 – 3&5 – 2&6 – 1&7
.294
.165
.286
.149
.300
.169.145
.079
.256
.169
.369
.196
.00
.10
.20
.30
.40
Leaning Engagement
Unigrams LIWC Topics Emotions Political All
Pearson R between predictions and true labels, Linear Regression, 10fold-cross validation
Hypotheses
H1 Previous studies used users far more likely to be politicallyengaged
H2 The prediction problem was so far over-simplified
H3 Neutral users can be identified
H4 Differences in language use exist between moderate andextreme users
Moderate Users
H4 Differences between moderate and extreme users
Words associated with moderateliberals (5 and 6).
Words associated with extremeliberals (7).
relative frequency
a aacorrelation strength
Correlations are age and gender controlled
Take Aways
I User-level trait acquisition methodologies can generatenon-representative samples
I Political ideology:I Goes beyond binary classesI The problem was to date over-simplifiedI New data set available for researchI New model to identify political leaning and engagement