Modeling Social Data, Lecture 1: Overview
-
Upload
jakehofman -
Category
Education
-
view
222 -
download
2
Transcript of Modeling Social Data, Lecture 1: Overview
Introduction and OverviewAPAM E4990
Modeling Social Data
Jake Hofman
Columbia University
January 20, 2017
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 1 / 53
Course overview
Modeling social data requires an understanding of:
1 How to obtain data produced by (online) human interactions,
2 What questions we typically ask about human-generated data,
3 How to reframe these questions as mathematical models, and
4 How to interpret the results of these models in ways thataddress our questions.
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 2 / 53
Questions
Many long-standing questions in the social sciences are notoriouslydifficult to answer, e.g.:
• “Who says what to whom in what channel with what effect”?(Laswell, 1948)
• How do ideas and technology spread through cultures?(Rogers, 1962)
• How do new forms of communication affect society?(Singer, 1970)
• . . .
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 3 / 53
Questions
Typically difficult to observe the relevant information viaconventional methods
Moreno, 1933
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 4 / 53
Large-scale data
Recently available electronic data provide an unprecedentedopportunity to address these questions at scale
Demographic Behavioral Network
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 5 / 53
Computational social science
An emerging discipline at the intersection of the social sciences,statistics, and computer science
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 6 / 53
Computational social science
An emerging discipline at the intersection of the social sciences,statistics, and computer science
(motivating questions)
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 6 / 53
Computational social science
An emerging discipline at the intersection of the social sciences,statistics, and computer science
(fitting large, potentially sparse models)
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 6 / 53
Computational social science
An emerging discipline at the intersection of the social sciences,statistics, and computer science
(parallel processing for filtering and aggregating data)
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 6 / 53
Topics
Exploratory Data Analysis
Classification
Regression
Networks
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 7 / 53
Exploratory Data Analysis
(a.k.a. counting and plotting things)
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 8 / 53
Regression
(a.k.a. modeling continuous things)
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 9 / 53
Classification
(a.k.a. modeling discrete things)
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 10 / 53
Networks
(a.k.a. counting complicated things)
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 11 / 53
Topics
http://modelingsocialdata.org
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 12 / 53
The clean real story
“We have a habit in writing articles published inscientific journals to make the work as finished aspossible, to cover all the tracks, to not worry about theblind alleys or to describe how you had the wrong ideafirst, and so on. So there isn’t any place to publish, ina dignified manner, what you actually did in order toget to do the work ...”
-Richard FeynmanNobel Lecture1, 1965
1http://bit.ly/feynmannobelJake Hofman (Columbia University) Introduction and Overview January 20, 2017 13 / 53
Outline
Web demographicsD
aily
Per
−C
apita
Pag
evie
ws
0
10
20
30
40
50
60
70
●
●
●●
●
Over $25k
Under $25k
Black
&
Hispanic
White
No College
Some College
Over 65
Under 65
Female
Male
Income Race Education Age Sex
Search predictions"Right Round"
Week
Ran
k
40
30
20
10
cccccccccccccccccccccccccccccccccccccccccc
Mar−09 Apr−09 May−09 Jun−09 Jul−09 Aug−09
BillboardSearch
Viral hits
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 14 / 53
Predicting consumer activity with Web searchwith Sharad Goel, Sebastien Lahaie, David Pennock, Duncan Watts
"Right Round"
Week
Ran
k
40
30
20
10
cccccccccccccccccccccccccccccccccccccccccc
Mar−09 Apr−09 May−09 Jun−09 Jul−09 Aug−09
BillboardSearch
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 15 / 53
Search predictionsMotivation
Does collective search activityprovide useful predictive signalabout real-world outcomes?
"Right Round"
Week
Ran
k
40
30
20
10
cccccccccccccccccccccccccccccccccccccccccc
Mar−09 Apr−09 May−09 Jun−09 Jul−09 Aug−09
BillboardSearch
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 16 / 53
Search predictionsMotivation
Past work mainly focuses on predicting the present2 and ignoresbaseline models trained on publicly available data
Date
Flu
Lev
el (
Per
cent
)
1
2
3
4
5
6
7
8
2004 2005 2006 2007 2008 2009 2010
Actual
Search
Autoregressive
2Varian, 2009Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 17 / 53
Search predictionsMotivation
We predict future sales for movies, video games, and music
"Transformers 2"
Time to Release (Days)
Sea
rch
Volu
me
a
−30 −20 −10 0 10 20 30
"Tom Clancy's HAWX"
Time to Release (Days)
Sea
rch
Volu
me
b
−30 −20 −10 0 10 20 30
"Right Round"
Week
Ran
k
40
30
20
10
cccccccccccccccccccccccccccccccccccccccccc
Mar−09 Apr−09 May−09 Jun−09 Jul−09 Aug−09
Billboard
Search
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 18 / 53
Search predictionsSearch models
For movies and video games, predict opening weekend box officeand first month sales, respectively:
log(revenue) = β0 + β1 log(search) + ε
For music, predict following week’s Billboard Hot 100 rank:
billboardt+1 = β0 + β1searcht + β2searcht−1 + ε
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 19 / 53
Search predictionsSearch volume
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 20 / 53
Search predictionsSearch models
Search activity is predictive for movies, video games, and musicweeks to months in advance
Movies
Predicted Revenue (Dollars)
Actu
al Re
venu
e (D
ollar
s)
103
104
105
106
107
108
109
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
● ●
●
●
●●●
●
●
●
●
●●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
103 104 105 106 107 108 109
Video Games
Predicted Revenue (Dollars)
Actu
al Re
venu
e (D
ollar
s)103
104
105
106
107
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●●
● ●
●
●
●
●
●●
●
●
●
●
●
●
●
●
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
103 104 105 106 107
● Non−SequelSequel
Music
Predicted Billboard Rank
Actu
al Bi
llboa
rd R
ank
0
20
40
60
80
100
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
c
0 20 40 60 80 100
Movies
Time to Release (Weeks)
Mod
el Fi
t
0.4
0.5
0.6
0.7
0.8
0.9 ddddddd
−6 −5 −4 −3 −2 −1 0
Video Games
Time to Release (Weeks)
Mod
el Fi
t
0.4
0.5
0.6
0.7
0.8
0.9 eeeeeee
−6 −5 −4 −3 −2 −1 0
Music
Time to Release (Weeks)M
odel
Fit
0.4
0.5
0.6
0.7
0.8
0.9 fffffff
−6 −5 −4 −3 −2 −1 0
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 21 / 53
Search predictionsBaseline models
For movies, use budget, number of opening screens and HollywoodStock Exchange:
log(revenue) = β0 + β1 log(budget) + β2 log(screens) +
β3 log(hsx) + ε
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 22 / 53
Search predictionsBaseline models
For video games, use critic ratings and predecessor sales (sequelsonly):
log(revenue) = β0 + β1rating + β2 log(predecessor) + ε
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 22 / 53
Search predictionsBaseline models
For music, use an autoregressive model with the previouslyavailable rank:
billboardt+1 = β0 + β1billboardt−1 + ε
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 22 / 53
Search predictionsBaseline + combined models
Baseline models are often surprisingly good
Movies (Baseline)
Predicted Revenue (Dollars)
Actu
al Re
venu
e (D
ollar
s)
103
104
105
106
107
108
109
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●●●●
●
●
●
●●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
103 104 105 106 107 108 109
Video Games (Baseline)
Predicted Revenue (Dollars)
Actu
al Re
venu
e (D
ollar
s)103
104
105
106
107
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
103 104 105 106 107
● Non−SequelSequel
Music (Baseline)
Predicted Billboard Rank
Actu
al Bi
llboa
rd R
ank
0
20
40
60
80
100
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
c
0 20 40 60 80 100
Movies (Combined)
Predicted Revenue (Dollars)
Actu
al Re
venu
e (D
ollar
s)
103
104
105
106
107
108
109
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●●●●
●
●
●
●●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
ddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddd
103 104 105 106 107 108 109
Video Games (Combined)
Predicted Revenue (Dollars)
Actu
al Re
venu
e (D
ollar
s)
103
104
105
106
107
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
103 104 105 106 107
● Non−SequelSequel
Music (Combined)
Predicted Billboard Rank
Actu
al Bi
llboa
rd R
ank
0
20
40
60
80
100
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
f
0 20 40 60 80 100
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 23 / 53
Search predictionsModel comparison
For movies, search is outperformed by the baseline and of littlemarginal value
M
odel
Fit
0.4
0.5
0.6
0.7
0.8
0.9
1.0
CombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombined
SearchSearchSearchSearchSearchSearchSearchSearchSearchSearchSearchSearchSearchSearchSearch
BaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaseline
Nonse
quel
Games
Seque
l Gam
es
Mus
ic
Mov
ies Flu
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 24 / 53
Search predictionsModel comparison
For video games, search helps substantially for non-sequels, less sofor sequels
M
odel
Fit
0.4
0.5
0.6
0.7
0.8
0.9
1.0
CombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombined
SearchSearchSearchSearchSearchSearchSearchSearchSearchSearchSearchSearchSearchSearchSearch
BaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaseline
Nonse
quel
Games
Seque
l Gam
es
Mus
ic
Mov
ies Flu
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 24 / 53
Search predictionsModel comparison
For music, the addition of search yields a substantially bettercombined model
M
odel
Fit
0.4
0.5
0.6
0.7
0.8
0.9
1.0
CombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombinedCombined
SearchSearchSearchSearchSearchSearchSearchSearchSearchSearchSearchSearchSearchSearchSearch
BaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaselineBaseline
Nonse
quel
Games
Seque
l Gam
es
Mus
ic
Mov
ies Flu
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 24 / 53
Search predictionsSummary
• Relative performance and value of search varies acrossdomains
• Search provides a fast, convenient, and flexible signal acrossdomains
• “Predicting consumer activity with Web search”Goel, Hofman, Lahaie, Pennock & Watts, PNAS 2010
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 25 / 53
Demographic diversity on the Webwith Irmak Sirer and Sharad Goel (ICWSM 2012)
Dai
ly P
er−
Cap
ita P
agev
iew
s
0
10
20
30
40
50
60
70
●
●
●●
●
Over $25k
Under $25k
Black
&
Hispanic
White
No College
Some College
Over 65
Under 65
Female
Male
Income Race Education Age Sex
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 26 / 53
Motivation
Previous work is largely survey-based and focuses and group-leveldifferences in online access
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 27 / 53
Motivation
“As of January 1997, we estimate that 5.2 millionAfrican Americans and 40.8 million whites have ever usedthe Web, and that 1.4 million African Americans and20.3 million whites used the Web in the past week.”
-Hoffman & Novak (1998)
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 27 / 53
Motivation
Focus on activity instead of access
How diverse is the Web?
To what extent do online experiences vary across demographicgroups?
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 28 / 53
Data
• Representative sample of 265,000 individuals in the US, paidvia the Nielsen MegaPanel3
• Log of anonymized, complete browsing activity from June2009 through May 2010 (URLs viewed, timestamps, etc.)
• Detailed individual and household demographic information(age, education, income, race, sex, etc.)
3Special thanks to Mainak MazumdarJake Hofman (Columbia University) Introduction and Overview January 20, 2017 29 / 53
Data
# ls -alh nielsen_megapanel.tar
-rw-r--r-- 100G Jul 17 13:00 nielsen_megapanel.tar
• Normalize pageviews to at most three domain levels, sans www
e.g. www.yahoo.com → yahoo.com,us.mg2.mail.yahoo.com/neo/launch → mail.yahoo.com
• Restrict to top 100k (out of 9M+ total) most popular sites(by unique visitors)
• Aggregate activity at the site, group, and user levels
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 30 / 53
Data
# ls -alh nielsen_megapanel.tar
-rw-r--r-- 100G Jul 17 13:00 nielsen_megapanel.tar
• Normalize pageviews to at most three domain levels, sans www
e.g. www.yahoo.com → yahoo.com,us.mg2.mail.yahoo.com/neo/launch → mail.yahoo.com
• Restrict to top 100k (out of 9M+ total) most popular sites(by unique visitors)
• Aggregate activity at the site, group, and user levels
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 30 / 53
Data
# ls -alh nielsen_megapanel.tar
-rw-r--r-- 100G Jul 17 13:00 nielsen_megapanel.tar
• Normalize pageviews to at most three domain levels, sans www
e.g. www.yahoo.com → yahoo.com,us.mg2.mail.yahoo.com/neo/launch → mail.yahoo.com
• Restrict to top 100k (out of 9M+ total) most popular sites(by unique visitors)
• Aggregate activity at the site, group, and user levels
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 30 / 53
Data
# ls -alh nielsen_megapanel.tar
-rw-r--r-- 100G Jul 17 13:00 nielsen_megapanel.tar
• Normalize pageviews to at most three domain levels, sans www
e.g. www.yahoo.com → yahoo.com,us.mg2.mail.yahoo.com/neo/launch → mail.yahoo.com
• Restrict to top 100k (out of 9M+ total) most popular sites(by unique visitors)
• Aggregate activity at the site, group, and user levels
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 30 / 53
Aggregate usage patterns
How do users distribute their time across different categories?
Fra
ctio
n of
tota
l pag
evie
ws
0.05
0.10
0.15
0.20
0.25●
●
●
● ●
Social
Med
ia
E−mail
Games
Porta
ls
Searc
h
All groups spend the majority of their time in the top five mostpopular categories
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 31 / 53
Aggregate usage patterns
How do users distribute their time across different categories?
User Rank by Daily Activity
Fra
ctio
n of
Pag
evie
ws
in C
ateg
ory
0.05
0.10
0.15
0.20
0.25
0.30
●
● ● ● ●●
●
●
●
●
10% 30% 50% 70% 90%
● Social Media
E−mail
Games
Portals
Search
Highly active users devote nearly twice as much of their time tosocial media relative to typical individuals
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 31 / 53
Group-level activity
How does browsing activity vary at the group level?
Dai
ly P
er−
Cap
ita P
agev
iew
s
0
10
20
30
40
50
60
70
●
●
●●
●
Over $25k
Under $25k
Black
&
Hispanic
White
No College
Some College
Over 65
Under 65
Female
Male
Income Race Education Age Sex
Large differences exist even at the aggregate level(e.g. women on average generate 40% more pageviews than men)
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 32 / 53
Group-level activity
How does browsing activity vary at the group level?
Dai
ly P
er−
Cap
ita P
agev
iew
s
0
10
20
30
40
50
60
70
●
●
●●
●
Over $25k
Under $25k
Black
&
Hispanic
White
No College
Some College
Over 65
Under 65
Female
Male
Income Race Education Age Sex
Younger and more educated individuals are both more likely toaccess the Web and more active once they do
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 32 / 53
Group-level activity
All demographic groups spend the majority of their time in thesame categories
Age
Fra
ctio
n of
tota
l pag
evie
ws
0.0
0.1
0.2
0.3
0.4
0.5
●
●
●
●
●●
● ●
●
●
●
●
●●
● ●
5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80
● Social Media
E−mail
Games
Portals
Search
Fr
actio
n of
tota
l pag
evie
ws
0.0
0.1
0.2
0.3
0.4Education
● ●
●●
●
●
●
Grammar
Schoo
l
Some H
igh Sch
ool
High Sch
ool G
radua
te
Some C
ollege
Associa
te Deg
ree
Bache
lor's D
egree
Post G
radua
te Deg
ree
Sex
●
●
Female Male
Income
●● ●
●●
●
$0−25k
$25−50k
$50−75k
$75−100k
$100−150k
$150k+
Race
● ●● ●
●
Other
Hispan
icBlack
White
Asian
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 33 / 53
Group-level activity
Older, more educated, male, wealthier, and Asian Internet usersspend a smaller fraction of their time on social media
Age
Fra
ctio
n of
tota
l pag
evie
ws
0.0
0.1
0.2
0.3
0.4
0.5
●
●
●
●
●●
● ●
●
●
●
●
●●
● ●
5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80
● Social Media
E−mail
Games
Portals
Search
Fr
actio
n of
tota
l pag
evie
ws
0.0
0.1
0.2
0.3
0.4Education
● ●
●●
●
●
●
Grammar
Schoo
l
Some H
igh Sch
ool
High Sch
ool G
radua
te
Some C
ollege
Associa
te Deg
ree
Bache
lor's D
egree
Post G
radua
te Deg
ree
Sex
●
●
Female Male
Income
●● ●
●●
●
$0−25k
$25−50k
$50−75k
$75−100k
$100−150k
$150k+
Race
● ●● ●
●
Other
Hispan
icBlack
White
Asian
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 33 / 53
Group-level activity
Lower social media use by these groups is often accompanied byhigher e-mail volume
Age
Fra
ctio
n of
tota
l pag
evie
ws
0.0
0.1
0.2
0.3
0.4
0.5
●
●
●
●
●●
● ●
●
●
●
●
●●
● ●
5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80
● Social Media
E−mail
Games
Portals
Search
Fr
actio
n of
tota
l pag
evie
ws
0.0
0.1
0.2
0.3
0.4Education
● ●
●●
●
●
●
Grammar
Schoo
l
Some H
igh Sch
ool
High Sch
ool G
radua
te
Some C
ollege
Associa
te Deg
ree
Bache
lor's D
egree
Post G
radua
te Deg
ree
Sex
●
●
Female Male
Income
●● ●
●●
●
$0−25k
$25−50k
$50−75k
$75−100k
$100−150k
$150k+
Race
● ●● ●
●
Other
Hispan
icBlack
White
Asian
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 33 / 53
Revisiting the digital divide
How does usage of news, health, and reference vary withdemographics?
A
vera
ge p
agev
iew
s pe
r mon
th
0
2
4
6
8
10
12Education
●
●
●
● ●
●
●
Grammar
Schoo
l
Some H
igh Sch
ool
High Sch
ool G
radua
te
Some C
ollege
Associa
te Deg
ree
Bache
lor's D
egree
Post G
radua
te Deg
ree
Sex
●
●
Female Male
Income
● ● ●●
●
●
$0−25k
$25−50k
$50−75k
$75−100k
$100−150k
$150k+
Race
● ●●
●
●
Other
Hispan
icBlack
White
Asian
● NewsHealthReference
Post-graduates spend three times as much time on health sitesthan adults with only some high school education
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 34 / 53
Revisiting the digital divide
How does usage of news, health, and reference vary withdemographics?
A
vera
ge p
agev
iew
s pe
r mon
th
0
2
4
6
8
10
12Education
●
●
●
● ●
●
●
Grammar
Schoo
l
Some H
igh Sch
ool
High Sch
ool G
radua
te
Some C
ollege
Associa
te Deg
ree
Bache
lor's D
egree
Post G
radua
te Deg
ree
Sex
●
●
Female Male
Income
● ● ●●
●
●
$0−25k
$25−50k
$50−75k
$75−100k
$100−150k
$150k+
Race
● ●●
●
●
Other
Hispan
icBlack
White
Asian
● NewsHealthReference
Asians spend more than 50% more time browsing online news thando other race groups
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 34 / 53
Revisiting the digital divide
How does usage of news, health, and reference vary withdemographics?
A
vera
ge p
agev
iew
s pe
r mon
th
0
2
4
6
8
10
12Education
●
●
●
● ●
●
●
Grammar
Schoo
l
Some H
igh Sch
ool
High Sch
ool G
radua
te
Some C
ollege
Associa
te Deg
ree
Bache
lor's D
egree
Post G
radua
te Deg
ree
Sex
●
●
Female Male
Income
● ● ●●
●
●
$0−25k
$25−50k
$50−75k
$75−100k
$100−150k
$150k+
Race
● ●●
●
●
Other
Hispan
icBlack
White
Asian
● NewsHealthReference
Even when less educated and less wealthy groups gain access tothe Web, they utilize these resources relatively infrequently
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 34 / 53
Revisiting the digital divide
How does usage of news, health, and reference vary withdemographics?
A
vera
ge p
agev
iew
s pe
r m
onth
0
2
4
6
8
10
12
News
●
● ●
●
●
High S
choo
l Gra
duat
e
Some
Colleg
e
Assoc
iate
Degre
e
Bache
lor's
Degre
e
Post G
radu
ate
Degre
e
Health
●● ●
●●
High S
choo
l Gra
duat
e
Some
Colleg
e
Assoc
iate
Degre
e
Bache
lor's
Degre
e
Post G
radu
ate
Degre
e
Reference
●● ●
● ●
High S
choo
l Gra
duat
e
Some
Colleg
e
Assoc
iate
Degre
e
Bache
lor's
Degre
e
Post G
radu
ate
Degre
e
Asian
Black
Hispanic
White
Controlling for other variables, effects of race and gender largelydisappear, while education continues to have large effect
pi =∑j
αjxij +∑j
∑k
βjkxijxik +∑j
γjx2ij + εi
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 35 / 53
Revisiting the digital divide
How does usage of news, health, and reference vary withdemographics?
A
vera
ge p
agev
iew
s pe
r m
onth
0
2
4
6
8
10
12
Health
●● ●
● ●
High S
choo
l Gra
duat
e
Some
Colleg
e
Assoc
iate
Degre
e
Bache
lor's
Degre
e
Post G
radu
ate
Degre
e
Female
Male
However, women spend considerably more time on health sitescompared to men
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 36 / 53
Revisiting the digital divide
How does usage of news, health, and reference vary withdemographics?
Monthly pageviews on health sites
20 40 60 80 100
Female
Male
However, women spend considerably more time on health sitescompared to men, although means can be misleading
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 36 / 53
Summary
• Highly active users spend disproportionately more of theirtime on social media and less on e-mail relative to the overallpopulation
• Access to research, news, and healthcare is strongly related toeducation, not as closely to ethnicity
• User demographics can be inferred from browsing activity withreasonable accuracy
• “Who Does What on the Web”, Goel, Hofman & Sirer,ICWSM 2012
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 37 / 53
The structural virality of online diffusionwith Ashton Anderson, Sharad Goel, Duncan Watts (Management Science 2015)
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 38 / 53
“Going Viral”?
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 39 / 53
“Going Viral”?
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 40 / 53
“Going Viral”?
“Therefore we ... wish to proceed with great care as isproper, and to cut off the advance of this plague andcancerous disease so it will not spread any further ...”4
-Pope Leo XExsurge Domine (1520)
4http://www.economist.com/node/21541719Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 40 / 53
“Going Viral”?
Rogers (1962), Bass (1969)
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 41 / 53
“Going viral”?
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 42 / 53
“Going viral”?
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 42 / 53
“Going viral”?
How do popular things become popular?
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 43 / 53
Data
• Examined one year of tweets from July 2011 to July 2012
• Restricted to 1.4 billion tweets containing links to top news,videos, images, and petitions sites
• Aggregated tweets by URL, resulting in 1 billion distinct“events”
• Crawled friend list of each adopter
• Inferred “who got what from whom” to construct diffusiontrees
• Characterized size and structure of trees
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 44 / 53
Data
• Examined one year of tweets from July 2011 to July 2012
• Restricted to 1.4 billion tweets containing links to top news,videos, images, and petitions sites
• Aggregated tweets by URL, resulting in 1 billion distinct“events”
• Crawled friend list of each adopter
• Inferred “who got what from whom” to construct diffusiontrees
• Characterized size and structure of trees
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 44 / 53
Data
• Examined one year of tweets from July 2011 to July 2012
• Restricted to 1.4 billion tweets containing links to top news,videos, images, and petitions sites
• Aggregated tweets by URL, resulting in 1 billion distinct“events”
• Crawled friend list of each adopter
• Inferred “who got what from whom” to construct diffusiontrees
• Characterized size and structure of trees
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 44 / 53
Data
• Examined one year of tweets from July 2011 to July 2012
• Restricted to 1.4 billion tweets containing links to top news,videos, images, and petitions sites
• Aggregated tweets by URL, resulting in 1 billion distinct“events”
• Crawled friend list of each adopter
• Inferred “who got what from whom” to construct diffusiontrees
• Characterized size and structure of trees
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 44 / 53
Data
• Examined one year of tweets from July 2011 to July 2012
• Restricted to 1.4 billion tweets containing links to top news,videos, images, and petitions sites
• Aggregated tweets by URL, resulting in 1 billion distinct“events”
• Crawled friend list of each adopter
• Inferred “who got what from whom” to construct diffusiontrees
• Characterized size and structure of trees
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 44 / 53
Data
• Examined one year of tweets from July 2011 to July 2012
• Restricted to 1.4 billion tweets containing links to top news,videos, images, and petitions sites
• Aggregated tweets by URL, resulting in 1 billion distinct“events”
• Crawled friend list of each adopter
• Inferred “who got what from whom” to construct diffusiontrees
• Characterized size and structure of trees
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 44 / 53
The Structural Virality of Online Diffusion
A
B
D
C
E
Tim
e
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 45 / 53
Information diffusionCascade size distribution
0.00001%
0.0001%
0.001%
0.01%
0.1%
1%
10%
1 10 100 1,000 10,000
Cascade Size
CC
DF
Focus on the rare hits that get at least 100 adoptions
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 46 / 53
Quantifying structure
Measure the average distance between all pairs of nodes5
5Weiner (1947); correlated with other possible metricsJake Hofman (Columbia University) Introduction and Overview January 20, 2017 47 / 53
Information diffusionSize and virality by category
Remarkable structural diversity across across categories
0.001%
0.01%
0.1%
1%
10%
100%
100 1,000 10,000
Cascade Size
CC
DF
Videos
Pictures
News
Petitions
0.001%
0.01%
0.1%
1%
10%
100%
3 10 30
Structural Virality
CC
DF
Videos
Pictures
News
Petitions
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 48 / 53
Information diffusionStructural diversity
0 50 100 150time
size
0 5 10 15 20time
size
0 20 40 60 80 100 120 140time
size
0 20 40 60 80 100 120time
size
0.0 0.5 1.0 1.5time
size
0 10 20 30 40 50 60 70time
size
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 49 / 53
Information diffusionStructural diversity
Size is relatively poor predictive of structure
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 50 / 53
Summary
Popular 6= Viral
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 51 / 53
Information diffusionSummary
• Most cascades fail, resulting in fewer than two adoptions, onaverage
• Of the hits that do succeed, we observe a wide range ofdiverse diffusion structures
• It’s difficult to say how something spread given only itspopularity
• “The structural virality of online diffusion”, Anderson, Goel,Hofman & Watts (Management Science 2015)
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 52 / 53
1. Ask good questionsThere’s nothing interesting in the data without them
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 53 / 53
2. Think before you code5 minutes at the whiteboard is worth an hour at the keyboard
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 53 / 53
3. Keep the answers simpleExploratory data analysis and linear models go a long way
Jake Hofman (Columbia University) Introduction and Overview January 20, 2017 53 / 53