Modeling Social Data, Lecture 2: Introduction to Counting
-
Upload
jakehofman -
Category
Education
-
view
265 -
download
0
Transcript of Modeling Social Data, Lecture 2: Introduction to Counting
Introduction to CountingAPAM E4990
Modeling Social Data
Jake Hofman
Columbia University
January 27, 2017
Jake Hofman (Columbia University) Intro to Counting January 27, 2017 1 / 27
Why counting?
http://bit.ly/august2016poll
p( y︸︷︷︸support
| x︸︷︷︸age
)
Jake Hofman (Columbia University) Intro to Counting January 27, 2017 2 / 27
Why counting?
http://bit.ly/ageracepoll2016
p( y︸︷︷︸support
| x1, x2︸ ︷︷ ︸age, race
)
Jake Hofman (Columbia University) Intro to Counting January 27, 2017 2 / 27
Why counting?
?p( y︸︷︷︸
support
| x1, x2, x3, . . .︸ ︷︷ ︸age, sex, race, party
)
Jake Hofman (Columbia University) Intro to Counting January 27, 2017 2 / 27
Why counting?
Problem:
Traditionally difficult to obtain reliable estimates due to smallsample sizes or sparsity
(e.g., ∼ 100 age × 2 sex × 5 race × 3 party = 3,000 groups,but typical surveys collect ∼ 1,000s of responses)
Jake Hofman (Columbia University) Intro to Counting January 27, 2017 3 / 27
Why counting?
Potential solution:
Sacrifice granularity for precision, by binning observations intolarger, but fewer, groups
(e.g., bin age into a few groups: 18-29, 30-49, 50-64, 65+)
Jake Hofman (Columbia University) Intro to Counting January 27, 2017 3 / 27
Why counting?
Potential solution:
Develop more sophisticated methods that generalize well fromsmall samples
(e.g., fit a model: support ∼ β0 + β1age + β2age2 + . . .)
Jake Hofman (Columbia University) Intro to Counting January 27, 2017 3 / 27
Why counting?
(Partial) solution:
Obtain larger samples through other means, so we can just countand divide to make estimates via relative frequencies
(e.g., with ∼ 1M responses, we have 100s per group and canestimate support within a few percentage points)
Jake Hofman (Columbia University) Intro to Counting January 27, 2017 4 / 27
Why counting?
International Journal of Forecasting 31 (2015) 980–991
Contents lists available at ScienceDirect
International Journal of Forecasting
journal homepage: www.elsevier.com/locate/ijforecast
Forecasting elections with non-representative pollsWei Wanga,⇤, David Rothschild b, Sharad Goel b, Andrew Gelmana,c
a Department of Statistics, Columbia University, New York, NY, USAb Microsoft Research, New York, NY, USAc Department of Political Science, Columbia University, New York, NY, USA
a r t i c l e i n f o
Keywords:Non-representative pollingMultilevel regression and poststratificationElection forecasting
a b s t r a c t
Election forecasts have traditionally been based on representative polls, inwhich randomlysampled individuals are askedwho they intend to vote for.While representative polling hashistorically proven to be quite effective, it comes at considerable costs of time and money.Moreover, as response rates have declined over the past several decades, the statisticalbenefits of representative sampling have diminished. In this paper, we show that, withproper statistical adjustment, non-representative polls can be used to generate accurateelection forecasts, and that this can often be achieved faster and at a lesser expense thantraditional survey methods. We demonstrate this approach by creating forecasts from anovel and highly non-representative survey dataset: a series of daily voter intention pollsfor the 2012 presidential election conducted on the Xbox gaming platform. After adjustingthe Xbox responses via multilevel regression and poststratification, we obtain estimateswhich are in line with the forecasts from leading poll analysts, which were based onaggregating hundreds of traditional polls conducted during the election cycle.We concludeby arguing that non-representative polling shows promise not only for election forecasting,but also for measuring public opinion on a broad range of social, economic and culturalissues.© 2014 International Institute of Forecasters. Published by Elsevier B.V. All rights reserved.
1. Introduction
At the heart ofmodern opinion polling is representativesampling, built around the idea that every individual in aparticular target population, such as registered or likelyUS voters, has the same probability of being sampled.From address-based, in-home interview sampling in the1930s to random digit dialing after the growth of landlinesand cellphones, leading polling organizations have putimmense efforts into obtaining representative samples.
⇤ Corresponding author.E-mail addresses: [email protected] (W. Wang),
[email protected] (D. Rothschild), [email protected](S. Goel), [email protected] (A. Gelman).
The wide-scale adoption of representative polling canbe traced largely back to a pivotal polling mishap inthe 1936 US presidential election campaign. Duringthat campaign, the popular magazine Literary Digestconducted amail-in survey that attracted over twomillionresponses, a huge sample even by modern standards.However, the magazine incorrectly predicted a landslidevictory for Republican candidate Alf Landon over theincumbent Franklin Roosevelt. In actual fact, Rooseveltwon the election decisively, carrying every state exceptfor Maine and Vermont. As pollsters and academics havepointed out since, the magazine’s pool of respondents washighly biased: it consisted mostly of auto and telephoneowners, as well as the magazine’s own subscribers, whichunderrepresented Roosevelt’s core constituencies (Squire,1988). During that same campaign, various pioneering
http://dx.doi.org/10.1016/j.ijforecast.2014.06.0010169-2070/© 2014 International Institute of Forecasters. Published by Elsevier B.V. All rights reserved.
W. Wang et al. / International Journal of Forecasting 31 (2015) 980–991 981
pollsters, including George Gallup, Archibald Crossley, andElmo Roper, used considerably smaller but representativesamples, and predicted the election outcome with areasonable level of accuracy (Gosnell, 1937). Accordingly,non-representative or ‘‘convenience sampling’’ rapidly fellout of favor with polling experts.
So, why do we revisit this seemingly long-settledcase? Two recent trends spur our investigation. First, ran-dom digit dialing (RDD), the standard method in modernrepresentative polling, has suffered increasingly highnon-response rates, due both to the general public’s grow-ing reluctance to answer phone surveys, and to expand-ing technical means of screening unsolicited calls (Keeter,Kennedy, Dimock, Best, & Craighill, 2006). By one mea-sure, RDD response rates have decreased from 36% in 1997to 9% in 2012 (Kohut, Keeter, Doherty, Dimock, & Chris-tian, 2012), and other studies confirm this trend (Holbrook,Krosnick, & Pfent, 2007; Steeh, Kirgis, Cannon, & DeWitt,2001; Tourangeau & Plewes, 2013). Assuming that the ini-tial pool of targets is representative, such low responserates mean that those who ultimately answer the phoneand elect to respond might not be. Even if the selection is-sues are not yet a serious problem for accuracy, as somehave argued (Holbrook et al., 2007), the downward trendin response rates suggests an increasing need for post-sampling adjustments; indeed, the adjustment methodswe present here should work just as well for surveys ob-tained by probability sampling as for convenience samples.The second trend driving our research is the fact that, withrecent technological innovations, it is increasingly conve-nient and cost-effective to collect large numbers of highlynon-representative samples via online surveys. The datathat took the Literary Digest editors several months to col-lect in 1936 can now take only a few days, and, for somesurveys, can cost just pennies per response. However, thechallenge is to extract a meaningful signal from these un-conventional samples.
In this paper, we show that, with proper statistical ad-justments, non-representative polls are able to yield ac-curate presidential election forecasts, on par with thosebased on traditional representative polls. We proceed asfollows. Section 2 describes the election survey that weconducted on the Xbox gaming platform during the 45days leading up to the 2012 US presidential race. Our Xboxsample is highly biased in two key demographic dimen-sions, gender and age, and, accordingly, the raw responsesdisagree with the actual outcomes. The statistical tech-niques we use to adjust the raw estimates are introducedin two stages. In Section 3, we construct daily estimatesof voter intent via multilevel regression and poststratifica-tion (MRP). The central idea of MRP is to partition the datainto thousands of demographic cells, estimate voter intentat the cell level using amultilevel regressionmodel, and fi-nally aggregate the cell-level estimates in accordance withthe target population’s demographic composition. One re-cent study suggested that non-probability samples provideworse estimates than probability samples (Yeager et al.,2011), but that study used simple adjustment techniques,not MRP. Even after getting good daily estimates of voterintent, however, more needs to be done to translate theseinto election-day forecasts. Section 4 therefore describes
how to transform voter intent into projections of voteshare and electoral votes. We conclude in Section 5 bydiscussing the potential for non-representative polling inother domains.
2. Xbox data
Our analysis is based on an opt-in poll which was avail-able continuously on the Xbox gaming platform duringthe 45 days preceding the 2012 US presidential election.Each day, three to five questionswere posted, one of whichgauged voter intention via the standard query, ‘‘If the elec-tion were held today, who would you vote for?’’. Full de-tails of the questionnaire are given in the Appendix. Therespondents were allowed to answer at most once per day.The first time they participated in an Xbox poll, respon-dents were also asked to provide basic demographic in-formation about themselves, including their sex, race, age,education, state, party ID, political ideology, and who theyvoted for in the 2008 presidential election. In total, 750,148interviews were conducted, with 345,858 unique respon-dents – over 30,000 of whom completed five or more polls–making this one of the largest election panel studies ever.
Despite the large sample size, the pool of Xbox respon-dents is far from being representative of the voting pop-ulation. Fig. 1 compares the demographic composition ofthe Xbox participants to that of the general electorate, asestimated via the 2012 national exit poll.1 The most strik-ing differences are for age and sex. As one might expect,youngmen dominate the Xbox population: 18- to 29-year-olds comprise 65% of the Xbox dataset, compared to 19%in the exit poll; and men make up 93% of the Xbox sam-ple but only 47% of the electorate. Political scientists havelong observed that both age and sex are strongly correlatedwith voting preferences (Kaufmann & Petrocik, 1999), andindeed these discrepancies are apparent in the unadjustedtime series of Xbox voter intent shown in Fig. 2. In contrastto estimates based on traditional, representative polls (in-dicated by the dotted blue line in Fig. 2), the uncorrectedXbox sample suggests a landslide victory for Mitt Romney,reminiscent of the infamous Literary Digest error.
3. Estimating voter intent with multilevel regression
and poststratification
3.1. Multilevel regression and poststratification
To transform the raw Xbox data into accurate estimatesof voter intent in the general electorate, wemake use of the
1 For ease of interpretation, in Fig. 1 we group the states into fourcategories: (1) battleground states (Colorado, Florida, Iowa, New Hamp-shire, Ohio, and Virginia), the five states with the highest amounts ofTV spending plus New Hampshire, which had the highest per-capitaspending; (2) quasi-battleground states (Michigan, Minnesota, NorthCarolina, Nevada, New Mexico, Pennsylvania, and Wisconsin), whichround out the states where the campaigns and their affiliates mademajor TV buys; (3) solid Obama states (California, Connecticut, Districtof Columbia, Delaware, Hawaii, Illinois, Maine, Maryland, Massachusetts,New Jersey, New York, Oregon, Rhode Island, Vermont, andWashington);and (4) solid Romney states (Alabama, Alaska, Arizona, Arkansas, Geor-gia, Idaho, Indiana, Kansas, Kentucky, Louisiana, Mississippi, Missouri,Montana, Nebraska, North Dakota, Oklahoma, South Carolina, SouthDakota, Tennessee, Texas, Utah, West Virginia, and Wyoming).
http://bit.ly/nonreppoll
Jake Hofman (Columbia University) Intro to Counting January 27, 2017 5 / 27
Why counting?
The good:
Shift away from sophisticated statistical methods on small samplesto simpler methods on large samples
Jake Hofman (Columbia University) Intro to Counting January 27, 2017 6 / 27
Why counting?
The bad:
Even simple methods (e.g., counting) are computationallychallenging at large scales
(1M is easy, 1B a bit less so, 1T gets interesting)
Jake Hofman (Columbia University) Intro to Counting January 27, 2017 6 / 27
Why counting?
Claim:
Solving the counting problem at scale enables you to investigatemany interesting questions in the social sciences
Jake Hofman (Columbia University) Intro to Counting January 27, 2017 6 / 27
Learning to count
This week:
Counting at small/medium scales on a single machine
Following weeks:
Counting at large scales in parallel
Jake Hofman (Columbia University) Intro to Counting January 27, 2017 7 / 27
Learning to count
This week:
Counting at small/medium scales on a single machine
Following weeks:
Counting at large scales in parallel
Jake Hofman (Columbia University) Intro to Counting January 27, 2017 7 / 27
Counting, the easy way
Split / Apply / Combine1
• Load dataset into memory
• Split: Arrange observations into groups of interest
• Apply: Compute distributions and statistics within each group
• Combine: Collect results across groups
1http://bit.ly/splitapplycombineJake Hofman (Columbia University) Intro to Counting January 27, 2017 8 / 27
The generic group-by operation
Split / Apply / Combine
for each observation as (group, value):
place value in bucket for corresponding group
for each group:
apply a function over values in bucket
output group and result
Useful for computing arbitrary within-group statistics when wehave required memory
(e.g., conditional distribution, median, etc.)
Jake Hofman (Columbia University) Intro to Counting January 27, 2017 9 / 27
The generic group-by operation
Split / Apply / Combine
for each observation as (group, value):
place value in bucket for corresponding group
for each group:
apply a function over values in bucket
output group and result
Useful for computing arbitrary within-group statistics when wehave required memory
(e.g., conditional distribution, median, etc.)
Jake Hofman (Columbia University) Intro to Counting January 27, 2017 9 / 27
Why counting?
Jake Hofman (Columbia University) Intro to Counting January 27, 2017 10 / 27
Example: Anatomy of the long tail
Dataset Users Items Rating levels ObservationsMovielens 100K 10K 10 10M
Netflix 500K 20K 5 100M
Jake Hofman (Columbia University) Intro to Counting January 27, 2017 11 / 27
Example: Anatomy of the long tail
Dataset Users Items Rating levels ObservationsMovielens 100K 10K 10 10M
Netflix 500K 20K 5 100M
Jake Hofman (Columbia University) Intro to Counting January 27, 2017 11 / 27
Example: Movielens
How many ratings are there at each star level?
0
1,000,000
2,000,000
3,000,000
1 2 3 4 5Rating
Num
ber
of r
atin
gs
Jake Hofman (Columbia University) Intro to Counting January 27, 2017 12 / 27
Example: Movielens
0
1,000,000
2,000,000
3,000,000
1 2 3 4 5Rating
Num
ber
of r
atin
gs
group by rating value
for each group:
count # ratings
Jake Hofman (Columbia University) Intro to Counting January 27, 2017 13 / 27
Example: Movielens
What is the distribution of average ratings by movie?
1 2 3 4 5Mean Rating by Movie
Den
sity
Jake Hofman (Columbia University) Intro to Counting January 27, 2017 14 / 27
Example: Movielens
group by movie id
for each group:
compute average rating
1 2 3 4 5Mean Rating by Movie
Den
sity
Jake Hofman (Columbia University) Intro to Counting January 27, 2017 15 / 27
Example: Movielens
What fraction of ratings are given to the most popular movies?
0%
25%
50%
75%
100%
0 3,000 6,000 9,000Movie Rank
CD
F
Jake Hofman (Columbia University) Intro to Counting January 27, 2017 16 / 27
Example: Movielens
0%
25%
50%
75%
100%
0 3,000 6,000 9,000Movie Rank
CD
F
group by movie id
for each group:
count # ratings
sort by group size
cumulatively sum group sizes
Jake Hofman (Columbia University) Intro to Counting January 27, 2017 17 / 27
Example: Movielens
What is the median rank of each user’s rated movies?
0
2,000
4,000
6,000
8,000
100 10,000User eccentricity
Num
ber
of u
sers
Jake Hofman (Columbia University) Intro to Counting January 27, 2017 18 / 27
Example: Movielens
join movie ranks to ratings
group by user id
for each group:
compute median movie rank
0
2,000
4,000
6,000
8,000
100 10,000User eccentricity
Num
ber
of u
sers
Jake Hofman (Columbia University) Intro to Counting January 27, 2017 19 / 27
Example: Anatomy of the long tail
Dataset Users Items Rating levels ObservationsMovielens 100K 10K 10 10M
Netflix 500K 20K 5 100M
What do we do when the full dataset exceeds available memory?
Jake Hofman (Columbia University) Intro to Counting January 27, 2017 20 / 27
Example: Anatomy of the long tail
Dataset Users Items Rating levels ObservationsMovielens 100K 10K 10 10M
Netflix 500K 20K 5 100M
What do we do when the full dataset exceeds available memory?
Sampling?Unreliable estimates for rare groups
Jake Hofman (Columbia University) Intro to Counting January 27, 2017 20 / 27
Example: Anatomy of the long tail
Dataset Users Items Rating levels ObservationsMovielens 100K 10K 10 10M
Netflix 500K 20K 5 100M
What do we do when the full dataset exceeds available memory?
Random access from disk?1000x more storage, but 1000x slower2
2Numbers every programmer should knowJake Hofman (Columbia University) Intro to Counting January 27, 2017 20 / 27
Example: Anatomy of the long tail
Dataset Users Items Rating levels ObservationsMovielens 100K 10K 10 10M
Netflix 500K 20K 5 100M
What do we do when the full dataset exceeds available memory?
StreamingRead data one observation at a time, storing only needed state
Jake Hofman (Columbia University) Intro to Counting January 27, 2017 20 / 27
The combinable group-by operation
Streaming
for each observation as (group, value):
if new group:
initialize result
update result for corresponding group as function of
existing result and current value
for each group:
output group and result
Useful for computing a subset of within-group statistics with alimited memory footprint
(e.g., min, mean, max, variance, etc.)
Jake Hofman (Columbia University) Intro to Counting January 27, 2017 21 / 27
The combinable group-by operation
Streaming
for each observation as (group, value):
if new group:
initialize result
update result for corresponding group as function of
existing result and current value
for each group:
output group and result
Useful for computing a subset of within-group statistics with alimited memory footprint
(e.g., min, mean, max, variance, etc.)
Jake Hofman (Columbia University) Intro to Counting January 27, 2017 21 / 27
Example: Movielens
0
1,000,000
2,000,000
3,000,000
1 2 3 4 5Rating
Num
ber
of r
atin
gs
for each rating:
counts[movie id]++
Jake Hofman (Columbia University) Intro to Counting January 27, 2017 22 / 27
Example: Movielens
for each rating:
totals[movie id] += rating
counts[movie id]++
for each group:
totals[movie id] /
counts[movie id]
1 2 3 4 5Mean Rating by Movie
Den
sity
Jake Hofman (Columbia University) Intro to Counting January 27, 2017 23 / 27
Yet another group-by operation
Per-group histograms
for each observation as (group, value):
histogram[group][value]++
for each group:
compute result as a function of histogram
output group and result
We can recover arbitrary statistics if we can afford to store countsof all distinct values within in each group
Jake Hofman (Columbia University) Intro to Counting January 27, 2017 24 / 27
Yet another group-by operation
Per-group histograms
for each observation as (group, value):
histogram[group][value]++
for each group:
compute result as a function of histogram
output group and result
We can recover arbitrary statistics if we can afford to store countsof all distinct values within in each group
Jake Hofman (Columbia University) Intro to Counting January 27, 2017 24 / 27
The group-by operation
For arbitrary input data:
Memory Scenario Distributions StatisticsN Small dataset Yes General
V*G Small distributions Yes GeneralG Small # groups No CombinableV Small # outcomes No No1 Large # both No No
N = total number of observationsG = number of distinct groups
V = largest number of distinct values within group
Jake Hofman (Columbia University) Intro to Counting January 27, 2017 25 / 27
Examples (w/ 8GB RAM)
Median rating by movie for Netflix
N ∼ 100M ratingsG ∼ 20K movies
V ∼ 10 half-star values
V *G ∼ 200K, store per-group histograms for arbitrary statistics
(scales to arbitrary N, if you’re patient)
Jake Hofman (Columbia University) Intro to Counting January 27, 2017 26 / 27
Examples (w/ 8GB RAM)
Median rating by video for YouTube
N ∼ 10B ratingsG ∼ 1B videos
V ∼ 10 half-star values
V *G ∼ 10B, fails because per-group histograms are too large tostore in memory
G ∼ 1B, but no (exact) calculation for streaming median
Jake Hofman (Columbia University) Intro to Counting January 27, 2017 26 / 27
Examples (w/ 8GB RAM)
Mean rating by video for YouTube
N ∼ 10B ratingsG ∼ 1B videos
V ∼ 10 half-star values
G ∼ 1B, use streaming to compute combinable statistics
Jake Hofman (Columbia University) Intro to Counting January 27, 2017 26 / 27
The group-by operation
For pre-grouped input data:
Memory Scenario Distributions StatisticsN Small dataset Yes General
V*G Small distributions Yes GeneralG Small # groups No CombinableV Small # outcomes Yes General1 Large # both No Combinable
N = total number of observationsG = number of distinct groups
V = largest number of distinct values within group
Jake Hofman (Columbia University) Intro to Counting January 27, 2017 27 / 27