WORKING WITH INSTAGRAM DATA: THEMES AND IDEAS To …
Transcript of WORKING WITH INSTAGRAM DATA: THEMES AND IDEAS To …
WORKING WITH INSTAGRAM DATA: THEMES AND IDEAS
WHAT KIND OF POLITICAL CAMPAIGNS/USES HAVE THERE BEEN?
▸ #thisundocumentedlife
▸ Undocumented teenagers living in the US shared aspects of their life struggle with an instagram campaign
▸ #blacklivesmatter
▸ Used it to instantly document violence towards black people as part of the larger campaign
Stefano M. Iacus
University of Milan VOICES from the Blogs
R Foundation for Statistical Computing
To which extent Social Media can help migration monitoring?
(on-going work with L.Curini, R. Impicciatore, Y. Teocharis)
Measuring Migration: NTTS 2017 satellite event, Brussels, 13 March 2017
Let us start from the very beginning…
exabyteortrillionor1000^6:abillionofbillionsofbytes
2003dawnofcivilization
“Therewas5exabytesofinformationcreatedbetweenthedawnofcivilizationthrough2003…butthatmuchinformationisnowcreatedevery2days”(EricSchmidt,Google,2010)
How big are Big data?
2013:2,7BillionsofInternetusers,37,9%oftheWorldpopulation
2014:2,9Billionsofusers,40,4%oftheWorldpop.
2015:3,2Billions,+11.6%increment
2016:46.1%oftheWorldpop.(asofJuly2016)
Source:
Internet growth rate
2013:2,7BillionsofInternetusers,37,9%oftheWorldpopulation
2014:2,9Billionsofusers,40,4%oftheWorldpop.
2015:3,2Billions,+11.6%increment
2016:46.1%oftheWorldpop.(asofJuly2016)
Source:
Internet growth rate
2013:2,7BillionsofInternetusers,37,9%oftheWorldpopulation
2014:2,9Billionsofusers,40,4%oftheWorldpop.
2015:3,18Billions,+11.6%increment
2016:46.1%oftheWorldpop.(asofJuly2016)
Source:
Internet growth rate
unfortunately crashed on Sep 1st 2016
Why Social Media?
Take advantage of the specificity of these data:
• Social media are Big Data: we have many in time and space
• They exhibit nowcasting properties which can be exploited
• limitless way to look at this data
• usually less expensive and faster to collect than survey data
Limits of Social Media data
• The real profiles behind social media accounts are not known in most cases;
• The population on Social Media is a biased sample from the demographics population;
• The population of Social Media under observation, changes according to the topic.
• Social media are not the same everywhere (no FB but VK in RUSSIA, no Twitter but Sina Weibo in China, etc)
(possible solutions to some of these issues a the end of the talk)
Twitter numbers
Number of (monthly) active users
Instagram numbers
WORKING WITH INSTAGRAM DATA: THEMES AND IDEAS
YANNIS THEOCHARIS BI-ANNUAL SMAPP GLOBAL CONFERENCE, NEW YORK UNIVERSITY FLORENCE, MAY 23-24, 2016
Number of (monthly) active users
Instagram users have shared over 30 billion photos to date, and now share an average of 70 million photos per day
70 percent of Instagram users c o m e f r o m outside of the U.S.
Observing the unobservable
In November 2014 VOICES published the article “Support for Isis stronger in Arabic social media in Europe than in Syria“ for The Guardian. The analysis of 2 million online posts found those originating in Europe were more favurable to Isis than those from frontline of conflict. Total ISIS mentions and sentiment on social media from July to October 2014
In December 2015 VOICES published the article “Here’s a paradox: Shutting down the Islamic State on Twitter might help it recruit” for the Washington Post.
”[…] limiting debate in a digital forum could further radicalize and isolate possible Islamic State sympathizers. The resulting “loneliness
effect” can be dangerous”
“We examined nearly 13 million tweets in Arabic from 53 countries published between July 2014 and
January 2015. We examined the ratio of positive to negative tweets about the Islamic State, by country.”
Observing the unobservable
As of Nov 2014
Belgium Isis attack: 22 March 2016
The “loneliness effect” and its risks (…) limiting debate in a digital forum could further radicalize and isolate possible Islamic State sympathizers.
WORKING WITH INSTAGRAM DATA: THEMES AND IDEASWhy Social Media and Refugee?
WORKING WITH INSTAGRAM DATA: THEMES AND IDEAS
• Refugees* are coming from highly wired countries, with youths that are experts in ICT use and high levels of social media use (Howard, 2011; Howard & Hussain, 2014)
• Smartphones are the most important items in most refugees’ luggage. About the only things available to them to keep in touch with people at home
• The best possible tool in their hands to document the conditions they live in and their struggle for physical survival
• Help people create and share their own narratives through systematic visual and textual documentation
• Allow them to present themselves as human beings rather than as “others” or “hostiles” or whatever else
* At least those from the Syrian crisis
Why Social Media and Refugee?
Part of Twitter & Instagram have geo-reference meta data
For Twitter, this proportion is around 1% to 3% of the total accouts
{ "geo": { "type": "Point", "coordinates": [40.0160921, -105.2812196] }, "coordinates": { "type": "Point", "coordinates": [-105.2812196, 40.0160921] } }
Do we have enough data then?
Part of Twitter & Instagram have geo-reference meta data
For Instagram, this proportion grows to 30%{ "data": [{ "id": "788029", "latitude": 48.858844300000001, "longitude": 2.2943506, "name": "Eiffel Tower, Paris" }]}
WORKING WITH INSTAGRAM DATA: THEMES AND IDEAS
YANNIS THEOCHARIS BI-ANNUAL SMAPP GLOBAL CONFERENCE, NEW YORK UNIVERSITY FLORENCE, MAY 23-24, 2016
Movements in a day in
Rome
41.8
41.9
42.0
12.3 12.4 12.5 12.6 12.7lon
lat
Application from tourism study
Density along the year in Milan
50
100
150
200Tweet
25/04/2014
200
400
600Tweet
08/04/2014
Salone Mobile April 25th
200400600800
Tweet
21/02/2014
Fier
a M
ilano
Rho
Duomo Square
Forum Assago
Concert
Travels of italian Twitter accounts through 2012-2016
Travel in time
Time t0Past
Backward lookingMonitoring activitycrisis
event date
Can we exploit also information coming from networks? Which are the hubs of information?
Would tracking them help in forecasting the flows?
The ongoing Refugee projectWORKING WITH INSTAGRAM DATA: THEMES AND IDEAS
COLLECTING DATA USING GEOLOCATION FROM IDOMENI, GREECE
About 5000 Instagram posts limited to the Idomeni camp area in Greece
The ongoing Syrian Refugee project
About 5000 Instagram posts limited to the Idomeni camp area in Greece
Period 11-21 Feb 2016
Plan: track the accounts generating these posts a year later
Issue: Instagram has severely restricted the API usage in late April 2016 and this makes this repository very valuable.
The ongoing Syrian Refugee project
WORKING WITH INSTAGRAM DATA: THEMES AND IDEAS
COLLECTING DATA USING #REFUGEES (POSTS LIKED)Activity around the 5000 posts
The ongoing Syrian Refugee project
1. Extract the Instagram accounts present in the data collection from Idomeni camp (Greece)
2. Follow them in future (i.e. present): are they still in Greece? Or they moved around Europe?
3. Check if before the crisis the accounts were twitting from, e.g., Syria, to test for “real” (suspect) refugee
4. confirm the analysis by manually looking at the profiles
?
ì
Three possible opportunities/challenges: • small area estimation approach (Rao, 2003): use social media data as IV • anchoring social media data to official statistics (Bayesian/Multilevel approach) • build composite statistics (survey + social media) to nowcast migration phenomena
The new magic word in Social Data Science: “data mashup” = mix Big Data with official statistics
Small area estimation idea: HCR (Risk-at-poverty-index) vs Mobile phone data (Md). Marchetti et al. (2015), JOS, 31(2), 263-281.
a measure of entropy where (l1,l2) represents a pair of locations, pv(l1,l2) is the
probability of observing a movement of vehicle v between the locations l1 and l2, and L
is the total number of locations. The probability pv(l1,l2) is given by the ratio between
the number of trips of v between l1 and l2 and the total number of trips of v. When l1 is
equal to l2, pv(l1,l2) is set to 0. Then, we define the mobility of an area d as:
Md ¼1
Vd n[d
XMn; ð3Þ
where Vd is the number of vehicles resident in area d. A vehicle is considered resident
in the area where it most frequently stops during the night. The mobility value tends to
zero when the vehicle v visits few distinct locations, showing low mobility diversity.
On the other hand, when the mobility measure (2) increases, it means that the vehicle v
makes journeys with several locations as destinations. We calculate the standard
deviation of the mobility Md for each area. For a given area d we measure the standard
deviation of the mobility by:
sMd¼ n[d
XMn 2 Mdð Þ2
Vd 2 1
8>><
>>:
9>>=
>>;
1=2
; ð4Þ
where Mn and Md are defined by (2) and (3).
Figure 1 shows the scatterplot of the HCR values plotted against the sMdvalues
computed for the ten provinces of the Tuscany region. Their linear correlation
coefficient, used as a mere descriptive index, is equal to 20.74. This result suggests that
higher levels of heterogeneity of mobility (Md), expressed by the standard deviation sMd,
are in the provinces where there are lower levels of poverty. In other words, the
diversification of mobility within an area with respect to its mean value can be a proxy
100 105 110 115 120 125 130
0.00
0.05
0.10
0.15
0.20
0.25
0.30
SMd
HC
R
Fig. 1. Scatterplot of the standard deviation of the mobility vs. estimates of the HCR at province level
in Tuscany.
Journal of Official Statistics270
UnauthenticatedDownload Date | 12/6/15 10:09 AM
Model HCR as a function of Md for the data at hand, then use Md to estimate the unobserved HCR in the region d.
Use the estimate of HCR in further analysis.
Extension: here d is space (province) but we can extend the model to time as well using time series approach
Goal: extend/estimate official statistic data
On a continuous-time & space model for small area estimation (C-SAE)
(on going project called: SWBI)The continuous-time and space model is stated as follows:
Let Y dt be a direct measure of a proxy of some sort of well-being (e.g. “hospital admissions”) for region
d = 1, . . . , D, at time t
Let µdt be the target latent measure of well-being (e.g. “health-related quality of life”) for region d = 1, . . . , D
Assume there are other covariates, e.g., o�cial statistics, say Xdj,t, for j = 1, . . . ,K, region d at time t, that
summarize the “standard of living” so that we can write
dY dt = µd
t dt+
KX
j=1
�jXdj,tdt+ dBd
t
OBS = UNOBS + COV ARIATES +NOISE
The assume that there are other m1 variables Zdt who are expression of µ. We assumed that these covariates
are also geographically dependent. Finally Sdt are m2 come from Social Media (e.g. SWBI components). We
assume that Zdt and Sd
t contribute to µdt in this way
dµdt =
✓d0 +
m1X
i=1
✓di Zdi,t +
m2X
i=1
diS
di,t
!dt+
DX
l=1
wlddLdt
where w = (wd,l) is a proximity matrix detected among the di↵erent areas which is used to control for spatial
correlation.
dXdi,t = ⇣di (⌫
di �Xd
i,t)dt+ d⇧di,t, i = 1, . . . ,K
dZdi,t = ↵d
i (�di � Zd
i,t)dt+ dW di,t, i = 1, . . . ,m1
dSd`,t = �d
` (⌘d` � Sd
`,t)dt+ d�d`,t, ` = 1, . . . ,m2
where Zdi,t represent the quality of life variables and w = (wd,l) is a proximity matrix detected among the
di↵erent areas, Bdt and Ld
t , Wdt and �
dt are vectors of independent the Brownian motions
As quality of life variables we consider the SWBI index and its components, some weather and pollution
indicators, the consumer price index for blue and white-collar worker households, and a proxy of job market.
1
provinces of Lombardy
d=1,…, 11
On a continuous-time & space model for small area estimation (C-SAE)
Dic2013
Ago2014
Apr2015
Dic2015
546000
MilanoRicoveri
Dic2013
Ago2014
Apr2015
Dic2015
875
900
MilanoConsumi
Dic2013
Ago2014
Apr2015
Dic2015
198
210
MilanoCasa
Dic2013
Ago2014
Apr2015
Dic2015
200
MilanoPensioni
Dic2013
Ago2014
Apr2015
Dic2015
0600
MilanoValoreAggiunto
Dic2013
Ago2014
Apr2015
Dic2015
740
790
MilanoTenoreVita
Dic2013
Ago2014
Apr2015
Dic2015
660
MilanoServiziAmbiente
Dic2013
Ago2014
Apr2015
Dic2015
560
640
MilanoAffariLavoro
Dic2013
Ago2014
Apr2015
Dic2015
240
MilanoOrdinePubblico
Dic2013
Ago2014
Apr2015
Dic2015
600
640
MilanoPopolazione
Dic2013
Ago2014
Apr2015
Dic2015
490
530
MilanoTempoLibero
Dic2013
Ago2014
Apr2015
Dic2015
114000
Milanodisoccupati
Dic2013
Ago2014
Apr2015
Dic2015
106.4
MilanoPrezzi
Dic2013
Ago2014
Apr2015
Dic2015
3170000
Milanopop_media
Dic2013
Ago2014
Apr2015
Dic2015
800
1400
Milanonum_avviamenti
Dic2013
Ago2014
Apr2015
Dic2015
2070
Milanopm10
Dic2013
Ago2014
Apr2015
Dic2015
1050
Milanopm2.5
Dic2013
Ago2014
Apr2015
Dic2015
520
Milanotmedia
Dic2013
Ago2014
Apr2015
Dic2015
6090
Milanoumidita
Dic2013
Ago2014
Apr2015
Dic2015
3060
Milanoemo
Dic2013
Ago2014
Apr2015
Dic2015
3560
Milanofun
Dic2013
Ago2014
Apr2015
Dic2015
3555
Milanorel
Dic2013
Ago2014
Apr2015
Dic2015
4065
Milanores
Dic2013
Ago2014
Apr2015
Dic2015
2050
Milanosat
Dic2013
Ago2014
Apr2015
Dic2015
3070
Milanotru
Dic2013
Ago2014
Apr2015
Dic2015
4560
Milanovit
Dic2013
Ago2014
Apr2015
Dic2015
1040
Milanowor
Dic2013
Ago2014
Apr2015
Dic2015
3846
Milanoswbi
Social Media estimates
(on going project called: SWBI)
On a continuous-time & space model for small area estimation (C-SAE)
provinces of Lombardy
d=1,…, 11
The continuous-time and space model is stated as follows:
Let Y dt be a direct measure of a proxy of some sort of well-being (e.g. “hospital admissions”) for region
d = 1, . . . , D, at time t
Let µdt be the target latent measure of well-being (e.g. “health-related quality of life”) for region d = 1, . . . , D
Assume there are other covariates, e.g., o�cial statistics, say Xdj,t, for j = 1, . . . ,K, region d at time t, that
summarize the “standard of living” so that we can write
dY dt = µd
t dt+
KX
j=1
�jXdj,tdt+ dBd
t
OBS = UNOBS + COV ARIATES +NOISE
Then, assume that there are otherm1 variables Zdt who are expression of µ. We assumed that these covariates
are also geographically dependent. Finally Sdt are m2 come from Social Media (e.g. SWBI components). We
assume that Zdt and Sd
t contribute to µdt in this way
dµdt =
✓d0 +
m1X
i=1
✓di Zdi,t +
m2X
i=1
diS
di,t
!dt+
DX
l=1
wlddLdt
where w = (wd,l) is a proximity matrix detected among the di↵erent areas which is used to control for spatial
correlation.
dXdi,t = ⇣di (⌫
di �Xd
i,t)dt+ d⇧di,t, i = 1, . . . ,K
dZdi,t = ↵d
i (�di � Zd
i,t)dt+ dW di,t, i = 1, . . . ,m1
dSd`,t = �d
` (⌘d` � Sd
`,t)dt+ d�d`,t, ` = 1, . . . ,m2
where Zdi,t represent the quality of life variables and w = (wd,l) is a proximity matrix detected among the
di↵erent areas, Bdt and Ld
t , Wdt and �
dt are vectors of independent the Brownian motions
As quality of life variables we consider the SWBI index and its components, some weather and pollution
indicators, the consumer price index for blue and white-collar worker households, and a proxy of job market.
1
The continuous-time and space model is stated as follows:
Let Y dt be a direct measure of a proxy of some sort of well-being (e.g. “hospital admissions”) for region
d = 1, . . . , D, at time t
Let µdt be the target latent measure of well-being (e.g. “health-related quality of life”) for region d = 1, . . . , D
Assume there are other covariates, e.g., o�cial statistics, say Xdj,t, for j = 1, . . . ,K, region d at time t, that
summarize the “standard of living” so that we can write
dY dt = µd
t dt+
KX
j=1
�jXdj,tdt+ dBd
t
OBS = UNOBS + COV ARIATES +NOISE
Then, assume that there are otherm1 variables Zdt who are expression of µ. We assumed that these covariates
are also geographically dependent. Finally Sdt are m2 come from Social Media (e.g. SWBI components). We
assume that Zdt and Sd
t contribute to µdt in this way
dµdt =
✓d0 +
m1X
i=1
✓di Zdi,t +
m2X
i=1
diS
di,t
!dt+
DX
l=1
wlddLdt
where w = (wd,l) is a proximity matrix detected among the di↵erent areas which is used to control for spatial
correlation. Putting all together
dY dt =
✓d0 +
m1X
i=1
✓di Zdi,t +
m2X
i=1
diS
di,t
!dt+
DX
l=1
wlddLdt +
KX
j=1
�jXdj,tdt+ dBd
t
dXdi,t = ⇣di (⌫
di �Xd
i,t)dt+ d⇧di,t, i = 1, . . . ,K
dZdi,t = ↵d
i (�di � Zd
i,t)dt+ dW di,t, i = 1, . . . ,m1
dSd`,t = �d
` (⌘d` � Sd
`,t)dt+ d�d`,t, ` = 1, . . . ,m2
where Zdi,t represent the quality of life variables and w = (wd,l) is a proximity matrix detected among the
di↵erent areas, Bdt and Ld
t , Wdt and �
dt are vectors of independent the Brownian motions
As quality of life variables we consider the SWBI index and its components, some weather and pollution
indicators, the consumer price index for blue and white-collar worker households, and a proxy of job market.
1
The continuous-time and space model is stated as follows:
Let Y dt be a direct measure of a proxy of some sort of well-being (e.g. “hospital admissions”) for region
d = 1, . . . , D, at time t
Let µdt be the target latent measure of well-being (e.g. “health-related quality of life”) for region d = 1, . . . , D
Assume there are other covariates, e.g., o�cial statistics, say Xdj,t, for j = 1, . . . ,K, region d at time t, that
summarize the “standard of living” so that we can write
dY dt = µd
t dt+
KX
j=1
�jXdj,tdt+ dBd
t
OBS = UNOBS + COV ARIATES +NOISE
Then, assume that there are otherm1 variables Zdt who are expression of µ. We assumed that these covariates
are also geographically dependent. Finally Sdt are m2 come from Social Media (e.g. SWBI components). We
assume that Zdt and Sd
t contribute to µdt in this way
dµdt =
✓d0 +
m1X
i=1
✓di Zdi,t +
m2X
i=1
diS
di,t
!dt+
DX
l=1
wlddLdt
where w = (wd,l) is a proximity matrix detected among the di↵erent areas which is used to control for spatial
correlation. Putting all together
dY dt =
✓d0 +
m1X
i=1
✓di Zdi,t +
m2X
i=1
diS
di,t
!dt+
DX
l=1
wlddLdt +
KX
j=1
�jXdj,tdt+ dBd
t
dXdi,t = ⇣di (⌫
di �Xd
i,t)dt+ d⇧di,t, i = 1, . . . ,K
dZdi,t = ↵d
i (�di � Zd
i,t)dt+ dW di,t, i = 1, . . . ,m1
dSd`,t = �d
` (⌘d` � Sd
`,t)dt+ d�d`,t, ` = 1, . . . ,m2
where Zdi,t represent the quality of life variables and w = (wd,l) is a proximity matrix detected among the
di↵erent areas, Bdt and Ld
t , Wdt and �
dt are vectors of independent the Brownian motions
As quality of life variables we consider the SWBI index and its components, some weather and pollution
indicators, the consumer price index for blue and white-collar worker households, and a proxy of job market.
1
The continuous-time and space model is stated as follows:
Let Y dt be a direct measure of a proxy of some sort of well-being (e.g. “hospital admissions”) for region
d = 1, . . . , D, at time t
Let µdt be the target latent measure of well-being (e.g. “health-related quality of life”) for region d = 1, . . . , D
Assume there are other covariates, e.g., o�cial statistics, say Xdj,t, for j = 1, . . . ,K, region d at time t, that
summarize the “standard of living” so that we can write
dY dt = µd
t dt+
KX
j=1
�jXdj,tdt+ dBd
t
OBS = UNOBS + COV ARIATES +NOISE
Then, assume that there are otherm1 variables Zdt who are expression of µ. We assumed that these covariates
are also geographically dependent. Finally Sdt are m2 come from Social Media (e.g. SWBI components). We
assume that Zdt and Sd
t contribute to µdt in this way
dµdt =
✓d0 +
m1X
i=1
✓di Zdi,t +
m2X
i=1
diS
di,t
!dt+
DX
l=1
wlddLdt
where w = (wd,l) is a proximity matrix detected among the di↵erent areas which is used to control for spatial
correlation. Putting all together
dY dt =
✓d0 +
m1X
i=1
✓di Zdi,t +
m2X
i=1
diS
di,t
!dt+
DX
l=1
wlddLdt +
KX
j=1
�jXdj,tdt+ dBd
t
dXdi,t = ⇣di (⌫
di �Xd
i,t)dt+ d⇧di,t, i = 1, . . . ,K
dZdi,t = ↵d
i (�di � Zd
i,t)dt+ dW di,t, i = 1, . . . ,m1
dSd`,t = �d
` (⌘d` � Sd
`,t)dt+ d�d`,t, ` = 1, . . . ,m2
where Zdi,t represent the quality of life variables and w = (wd,l) is a proximity matrix detected among the
di↵erent areas, Bdt and Ld
t , Wdt and �
dt are vectors of independent the Brownian motions
As quality of life variables we consider the SWBI index and its components, some weather and pollution
indicators, the consumer price index for blue and white-collar worker households, and a proxy of job market.
1
On a continuous-time & space model for small area estimation (C-SAE)
d=1,…, 11
The continuous-time and space model is stated as follows:Let Y d
t be a direct measure of a proxy of some sort of well-being (e.g. “hospital admissions”) for regiond = 1, . . . , D, at time t
Let µdt be the target latent measure of well-being (e.g. “health-related quality of life”) for region d = 1, . . . , D
Assume there are other covariates, e.g., o�cial statistics, say Xdj,t, for j = 1, . . . ,K, region d at time t, that
summarize the “standard of living” so that we can write
dY dt = µd
t dt+
KX
j=1
�jXdj,tdt+ dBd
t
OBS = UNOBS + COV ARIATES +NOISE
Then, assume that there are otherm1 variables Zdt who are expression of µ. We assumed that these covariates
are also geographically dependent. Finally Sdt are m2 come from Social Media (e.g. SWBI components). We
assume that Zdt and Sd
t contribute to µdt in this way
dµdt =
✓d0 +
m1X
i=1
✓di Zdi,t +
m2X
i=1
diS
di,t
!dt+
DX
l=1
wlddLdt
where w = (wd,l) is a proximity matrix detected among the di↵erent areas which is used to control for spatialcorrelation. Putting all together
dY dt =
✓d0 +
m1X
i=1
✓di Zdi,t +
m2X
i=1
diS
di,t
!dt+
DX
l=1
wlddLdt +
KX
j=1
�jXdj,tdt+ dBd
t
dXdi,t = ⇣di (⌫
di �Xd
i,t)dt+ d⇧di,t, i = 1, . . . ,K
dZdi,t = ↵d
i (�di � Zd
i,t)dt+ dW di,t, i = 1, . . . ,m1
dSd`,t = �d
` (⌘d` � Sd
`,t)dt+ d�d`,t, ` = 1, . . . ,m2
where Zdi,t represent the quality of life variables and w = (wd,l) is a proximity matrix detected among the
di↵erent areas, Bdt and Ld
t , Wdt and �d
t are vectors of independent the Brownian motions
Once the parameters are estimated we can predict the underlying measure as follows
dµdt =
✓d0 +
m1X
i=1
✓di Zdi,t +
m2X
i=1
diS
di,t
!dt+
DX
l=1
wlddLdt
As quality of life variables we consider the SWBI index and its components, some weather and pollutionindicators, the consumer price index for blue and white-collar worker households, and a proxy of job market.
1
in our case we have D = 11, m1 = 4, m2 = 8, K=2
which means: 165 (D*(1+K+m1+m2) equations and 473 parameters!
Advantages of the continuous-time & space model for small area estimation (C-SAE)
Using official statistics of Y, X and Z means that we have at most monthly data, while SM data Z are intraday o daily data
Once the model is estimated at “low frequency” (e.g., monthly), we can simulate it at any frequency/time t.
Anchoring Big Data to Official Statistics/Survey data
Anchoring social media data to official statistics: on going.
Xst ⇠ L(✓st) social media statistics, t = time, s = space
✓st ⇠ Q(�st ), Q is the pior calibrated on official statistics
Outcome: posterior distribution of Xst , P(Xs
t ; ✓st |�s
t ).
Goal1: adjust social media statistics (or indirectly estimate bias?) to obtain high-frequency andspace-distributed social info in real time much before official statistics or survey day are collectedon the next wave.
Goal2: make official-statistics subjects (institutions, politicians, academics, ecc) happier withsocial media data.
4
Goal: adjust social media statistics (or indirectly estimate bias?) to obtain high-frequency and space-distributed social info in real time much before official statistics or survey data are collected on the next wave.
Goal: Nowcasting overall expectations and act accordingly. Example: looking at switching of WNI’s trend, speculate on the market.
Combining official statistics, survey and social media: Wired Next Index (WNI)
WNI (“measures” expectation about economic wealth of a country, well... Italy) combines different times series:
• Off. Stat: GDP, Import/Export, Unemployment rate (low frequency, backward looking)
• Survey Data: consumer expectations, entrepreneurs expectations (low freq., forward looking)
• Social Media Data: sentiment data on economy, politics, personal wealth (high frequency, geo-referenced, nowcasting)
What about Replicability?
Replicability of this particular experiment: given the structure of the query used to download the data and the “post ID”, everyone can replicate the analysis.
Replicability of a similar idea: as the model is very easy to understand (although computationally intensive), the same idea can be applied to other situations. We are indeed working with clandestine migrants in Italy and soon Spain using mainly Twitter data.
Sedeoperativa:ViaGaspareBugatti7/A,Milano
Tel.+393661652058/61/64Fax+390269000855
voices-int.com
@blogsvoices
Thanks!
forfurtherinformation:[email protected]