WORKING WITH INSTAGRAM DATA: THEMES AND IDEAS To …

WORKING WITH INSTAGRAM DATA: THEMES AND IDEAS

WHAT KIND OF POLITICAL CAMPAIGNS/USES HAVE THERE BEEN?

▸ #thisundocumentedlife

▸ Undocumented teenagers living in the US shared aspects of their life struggle with an instagram campaign

▸ #blacklivesmatter

▸ Used it to instantly document violence towards black people as part of the larger campaign

Stefano M. Iacus

University of Milan VOICES from the Blogs

R Foundation for Statistical Computing

To which extent Social Media can help migration monitoring?

(on-going work with L.Curini, R. Impicciatore, Y. Teocharis)

Measuring Migration: NTTS 2017 satellite event, Brussels, 13 March 2017

Let us start from the very beginning…

exabyteortrillionor1000^6:abillionofbillionsofbytes

2003dawnofcivilization

“Therewas5exabytesofinformationcreatedbetweenthedawnofcivilizationthrough2003…butthatmuchinformationisnowcreatedevery2days”(EricSchmidt,Google,2010)

How big are Big data?

2013:2,7BillionsofInternetusers,37,9%oftheWorldpopulation

2014:2,9Billionsofusers,40,4%oftheWorldpop.

2015:3,2Billions,+11.6%increment

2016:46.1%oftheWorldpop.(asofJuly2016)

Source:

Internet growth rate

http://www.internetlivestats.com





















































2013:2,7BillionsofInternetusers,37,9%oftheWorldpopulation

2014:2,9Billionsofusers,40,4%oftheWorldpop.

2015:3,18Billions,+11.6%increment

2016:46.1%oftheWorldpop.(asofJuly2016)

Source:

Internet growth rate

unfortunately crashed on Sep 1st 2016






















































Why Social Media?

Take advantage of the specificity of these data:

• Social media are Big Data: we have many in time and space

• They exhibit nowcasting properties which can be exploited

• limitless way to look at this data

• usually less expensive and faster to collect than survey data

Limits of Social Media data

• The real profiles behind social media accounts are not known in most cases;

• The population on Social Media is a biased sample from the demographics population;

• The population of Social Media under observation, changes according to the topic.

• Social media are not the same everywhere (no FB but VK in RUSSIA, no Twitter but Sina Weibo in China, etc)

(possible solutions to some of these issues a the end of the talk)

Twitter numbers

Number of (monthly) active users

Instagram numbers


YANNIS THEOCHARIS BI-ANNUAL SMAPP GLOBAL CONFERENCE, NEW YORK UNIVERSITY FLORENCE, MAY 23-24, 2016

Number of (monthly) active users

Instagram users have shared over 30 billion photos to date, and now share an average of 70 million photos per day

70 percent of Instagram users c o m e f r o m outside of the U.S.

Observing the unobservable

In November 2014 VOICES published the article “Support for Isis stronger in Arabic social media in Europe than in Syria“ for The Guardian. The analysis of 2 million online posts found those originating in Europe were more favurable to Isis than those from frontline of conflict. Total ISIS mentions and sentiment on social media from July to October 2014

In December 2015 VOICES published the article “Here’s a paradox: Shutting down the Islamic State on Twitter might help it recruit” for the Washington Post.

”[…] limiting debate in a digital forum could further radicalize and isolate possible Islamic State sympathizers. The resulting “loneliness

effect” can be dangerous”

“We examined nearly 13 million tweets in Arabic from 53 countries published between July 2014 and

January 2015. We examined the ratio of positive to negative tweets about the Islamic State, by country.”

Observing the unobservable

As of Nov 2014

Belgium Isis attack: 22 March 2016

The “loneliness effect” and its risks (…) limiting debate in a digital forum could further radicalize and isolate possible Islamic State sympathizers.

https://www.washingtonpost.com/news/monkey-cage/wp/2015/12/10/heres-a-paradox-shutting-down-the-islamic-state-on-twitter-might-help-it-recruit/

WORKING WITH INSTAGRAM DATA: THEMES AND IDEASWhy Social Media and Refugee?

• Refugees* are coming from highly wired countries, with youths that are experts in ICT use and high levels of social media use (Howard, 2011; Howard & Hussain, 2014)

• Smartphones are the most important items in most refugees’ luggage. About the only things available to them to keep in touch with people at home

• The best possible tool in their hands to document the conditions they live in and their struggle for physical survival

• Help people create and share their own narratives through systematic visual and textual documentation

• Allow them to present themselves as human beings rather than as “others” or “hostiles” or whatever else

* At least those from the Syrian crisis

Why Social Media and Refugee?

Part of Twitter & Instagram have geo-reference meta data

For Twitter, this proportion is around 1% to 3% of the total accouts

{ "geo": { "type": "Point", "coordinates": [40.0160921, -105.2812196] }, "coordinates": { "type": "Point", "coordinates": [-105.2812196, 40.0160921] } }

Do we have enough data then?

Part of Twitter & Instagram have geo-reference meta data

For Instagram, this proportion grows to 30%{ "data": [{ "id": "788029", "latitude": 48.858844300000001, "longitude": 2.2943506, "name": "Eiffel Tower, Paris" }]}


YANNIS THEOCHARIS BI-ANNUAL SMAPP GLOBAL CONFERENCE, NEW YORK UNIVERSITY FLORENCE, MAY 23-24, 2016

Movements in a day in

Rome

41.8

41.9

42.0

12.3 12.4 12.5 12.6 12.7lon

lat

Application from tourism study

Density along the year in Milan

50

100

150

200Tweet

25/04/2014

200

400

600Tweet

08/04/2014

Salone Mobile April 25th

200400600800

Tweet

21/02/2014

Fier

a M

ilano

Rho

Duomo Square

Forum Assago

Concert

Travels of italian Twitter accounts through 2012-2016

Travel in time

Time t0Past

Backward lookingMonitoring activitycrisis

event date

Can we exploit also information coming from networks? Which are the hubs of information?

Would tracking them help in forecasting the flows?

The ongoing Refugee projectWORKING WITH INSTAGRAM DATA: THEMES AND IDEAS

COLLECTING DATA USING GEOLOCATION FROM IDOMENI, GREECE

About 5000 Instagram posts limited to the Idomeni camp area in Greece

The ongoing Syrian Refugee project

About 5000 Instagram posts limited to the Idomeni camp area in Greece

Period 11-21 Feb 2016

Plan: track the accounts generating these posts a year later

Issue: Instagram has severely restricted the API usage in late April 2016 and this makes this repository very valuable.


COLLECTING DATA USING #REFUGEES (POSTS LIKED)Activity around the 5000 posts


1. Extract the Instagram accounts present in the data collection from Idomeni camp (Greece)

2. Follow them in future (i.e. present): are they still in Greece? Or they moved around Europe?

3. Check if before the crisis the accounts were twitting from, e.g., Syria, to test for “real” (suspect) refugee

4. confirm the analysis by manually looking at the profiles

ì

Three possible opportunities/challenges: • small area estimation approach (Rao, 2003): use social media data as IV • anchoring social media data to official statistics (Bayesian/Multilevel approach) • build composite statistics (survey + social media) to nowcast migration phenomena

The new magic word in Social Data Science: “data mashup” = mix Big Data with official statistics

Small area estimation idea: HCR (Risk-at-poverty-index) vs Mobile phone data (Md). Marchetti et al. (2015), JOS, 31(2), 263-281.

a measure of entropy where (l1,l2) represents a pair of locations, pv(l1,l2) is the

probability of observing a movement of vehicle v between the locations l1 and l2, and L

is the total number of locations. The probability pv(l1,l2) is given by the ratio between

the number of trips of v between l1 and l2 and the total number of trips of v. When l1 is

equal to l2, pv(l1,l2) is set to 0. Then, we define the mobility of an area d as:

Md ¼1

Vd n[d

XMn; ð3Þ

where Vd is the number of vehicles resident in area d. A vehicle is considered resident

in the area where it most frequently stops during the night. The mobility value tends to

zero when the vehicle v visits few distinct locations, showing low mobility diversity.

On the other hand, when the mobility measure (2) increases, it means that the vehicle v

makes journeys with several locations as destinations. We calculate the standard

deviation of the mobility Md for each area. For a given area d we measure the standard

deviation of the mobility by:

sMd¼ n[d

XMn 2 Mdð Þ2

Vd 2 1

8>><

>>:

9>>=

>>;

1=2

; ð4Þ

where Mn and Md are defined by (2) and (3).

Figure 1 shows the scatterplot of the HCR values plotted against the sMdvalues

computed for the ten provinces of the Tuscany region. Their linear correlation

coefficient, used as a mere descriptive index, is equal to 20.74. This result suggests that

higher levels of heterogeneity of mobility (Md), expressed by the standard deviation sMd,

are in the provinces where there are lower levels of poverty. In other words, the

diversification of mobility within an area with respect to its mean value can be a proxy

100 105 110 115 120 125 130

0.00

0.05

0.10

0.15

0.20

0.25

0.30

SMd

HC

R

Fig. 1. Scatterplot of the standard deviation of the mobility vs. estimates of the HCR at province level

in Tuscany.

Journal of Official Statistics270

UnauthenticatedDownload Date | 12/6/15 10:09 AM

Model HCR as a function of Md for the data at hand, then use Md to estimate the unobserved HCR in the region d.

Use the estimate of HCR in further analysis.

Extension: here d is space (province) but we can extend the model to time as well using time series approach

Goal: extend/estimate official statistic data

On a continuous-time & space model for small area estimation (C-SAE)

(on going project called: SWBI)The continuous-time and space model is stated as follows:

Let Y dt be a direct measure of a proxy of some sort of well-being (e.g. “hospital admissions”) for region

d = 1, . . . , D, at time t

Let µdt be the target latent measure of well-being (e.g. “health-related quality of life”) for region d = 1, . . . , D

Assume there are other covariates, e.g., o�cial statistics, say Xdj,t, for j = 1, . . . ,K, region d at time t, that

summarize the “standard of living” so that we can write

dY dt = µd

t dt+

KX

j=1

�jXdj,tdt+ dBd

t

OBS = UNOBS + COV ARIATES +NOISE

The assume that there are other m1 variables Zdt who are expression of µ. We assumed that these covariates

are also geographically dependent. Finally Sdt are m2 come from Social Media (e.g. SWBI components). We

assume that Zdt and Sd

t contribute to µdt in this way

dµdt =

✓d0 +

m1X

i=1

✓di Zdi,t +

m2X

i=1

diS

di,t

!dt+

DX

l=1

wlddLdt

where w = (wd,l) is a proximity matrix detected among the di↵erent areas which is used to control for spatial

correlation.

dXdi,t = ⇣di (⌫

di �Xd

i,t)dt+ d⇧di,t, i = 1, . . . ,K

dZdi,t = ↵d

i (�di � Zd

i,t)dt+ dW di,t, i = 1, . . . ,m1

dSd`,t = �d

` (⌘d` � Sd

`,t)dt+ d�d`,t, ` = 1, . . . ,m2

where Zdi,t represent the quality of life variables and w = (wd,l) is a proximity matrix detected among the

di↵erent areas, Bdt and Ld

t , Wdt and �

dt are vectors of independent the Brownian motions

As quality of life variables we consider the SWBI index and its components, some weather and pollution

indicators, the consumer price index for blue and white-collar worker households, and a proxy of job market.

1

provinces of Lombardy

d=1,…, 11


Dic2013

Ago2014

Apr2015

Dic2015

546000

MilanoRicoveri

Dic2013

Ago2014

Apr2015

Dic2015

875

900

MilanoConsumi

Dic2013

Ago2014

Apr2015

Dic2015

198

210

MilanoCasa

Dic2013

Ago2014

Apr2015

Dic2015

200

MilanoPensioni

Dic2013

Ago2014

Apr2015

Dic2015

0600

MilanoValoreAggiunto

Dic2013

Ago2014

Apr2015

Dic2015

740

790

MilanoTenoreVita

Dic2013

Ago2014

Apr2015

Dic2015

660

MilanoServiziAmbiente

Dic2013

Ago2014

Apr2015

Dic2015

560

640

MilanoAffariLavoro

Dic2013

Ago2014

Apr2015

Dic2015

240

MilanoOrdinePubblico

Dic2013

Ago2014

Apr2015

Dic2015

600

640

MilanoPopolazione

Dic2013

Ago2014

Apr2015

Dic2015

490

530

MilanoTempoLibero

Dic2013

Ago2014

Apr2015

Dic2015

114000

Milanodisoccupati

Dic2013

Ago2014

Apr2015

Dic2015

106.4

MilanoPrezzi

Dic2013

Ago2014

Apr2015

Dic2015

3170000

Milanopop_media

Dic2013

Ago2014

Apr2015

Dic2015

800

1400

Milanonum_avviamenti

Dic2013

Ago2014

Apr2015

Dic2015

2070

Milanopm10

Dic2013

Ago2014

Apr2015

Dic2015

1050

Milanopm2.5

Dic2013

Ago2014

Apr2015

Dic2015

520

Milanotmedia

Dic2013

Ago2014

Apr2015

Dic2015

6090

Milanoumidita

Dic2013

Ago2014

Apr2015

Dic2015

3060

Milanoemo

Dic2013

Ago2014

Apr2015

Dic2015

3560

Milanofun

Dic2013

Ago2014

Apr2015

Dic2015

3555

Milanorel

Dic2013

Ago2014

Apr2015

Dic2015

4065

Milanores

Dic2013

Ago2014

Apr2015

Dic2015

2050

Milanosat

Dic2013

Ago2014

Apr2015

Dic2015

3070

Milanotru

Dic2013

Ago2014

Apr2015

Dic2015

4560

Milanovit

Dic2013

Ago2014

Apr2015

Dic2015

1040

Milanowor

Dic2013

Ago2014

Apr2015

Dic2015

3846

Milanoswbi

Social Media estimates

(on going project called: SWBI)


provinces of Lombardy

d=1,…, 11

The continuous-time and space model is stated as follows:


d = 1, . . . , D, at time t




dY dt = µd

t dt+

KX

j=1

�jXdj,tdt+ dBd

t


Then, assume that there are otherm1 variables Zdt who are expression of µ. We assumed that these covariates




dµdt =

✓d0 +

m1X

i=1

✓di Zdi,t +

m2X

i=1

diS

di,t

!dt+

DX

l=1

wlddLdt


correlation.

dXdi,t = ⇣di (⌫

di �Xd

i,t)dt+ d⇧di,t, i = 1, . . . ,K

dZdi,t = ↵d

i (�di � Zd

i,t)dt+ dW di,t, i = 1, . . . ,m1

dSd`,t = �d

` (⌘d` � Sd

`,t)dt+ d�d`,t, ` = 1, . . . ,m2



t , Wdt and �




1



d = 1, . . . , D, at time t




dY dt = µd

t dt+

KX

j=1

�jXdj,tdt+ dBd

t






dµdt =

✓d0 +

m1X

i=1

✓di Zdi,t +

m2X

i=1

diS

di,t

!dt+

DX

l=1

wlddLdt


correlation. Putting all together

dY dt =

✓d0 +

m1X

i=1

✓di Zdi,t +

m2X

i=1

diS

di,t

!dt+

DX

l=1

wlddLdt +

KX

j=1

�jXdj,tdt+ dBd

t

dXdi,t = ⇣di (⌫

di �Xd

i,t)dt+ d⇧di,t, i = 1, . . . ,K

dZdi,t = ↵d

i (�di � Zd

i,t)dt+ dW di,t, i = 1, . . . ,m1

dSd`,t = �d

` (⌘d` � Sd

`,t)dt+ d�d`,t, ` = 1, . . . ,m2



t , Wdt and �




1



d = 1, . . . , D, at time t




dY dt = µd

t dt+

KX

j=1

�jXdj,tdt+ dBd

t






dµdt =

✓d0 +

m1X

i=1

✓di Zdi,t +

m2X

i=1

diS

di,t

!dt+

DX

l=1

wlddLdt



dY dt =

✓d0 +

m1X

i=1

✓di Zdi,t +

m2X

i=1

diS

di,t

!dt+

DX

l=1

wlddLdt +

KX

j=1

�jXdj,tdt+ dBd

t

dXdi,t = ⇣di (⌫

di �Xd

i,t)dt+ d⇧di,t, i = 1, . . . ,K

dZdi,t = ↵d

i (�di � Zd

i,t)dt+ dW di,t, i = 1, . . . ,m1

dSd`,t = �d

` (⌘d` � Sd

`,t)dt+ d�d`,t, ` = 1, . . . ,m2



t , Wdt and �




1



d = 1, . . . , D, at time t




dY dt = µd

t dt+

KX

j=1

�jXdj,tdt+ dBd

t






dµdt =

✓d0 +

m1X

i=1

✓di Zdi,t +

m2X

i=1

diS

di,t

!dt+

DX

l=1

wlddLdt



dY dt =

✓d0 +

m1X

i=1

✓di Zdi,t +

m2X

i=1

diS

di,t

!dt+

DX

l=1

wlddLdt +

KX

j=1

�jXdj,tdt+ dBd

t

dXdi,t = ⇣di (⌫

di �Xd

i,t)dt+ d⇧di,t, i = 1, . . . ,K

dZdi,t = ↵d

i (�di � Zd

i,t)dt+ dW di,t, i = 1, . . . ,m1

dSd`,t = �d

` (⌘d` � Sd

`,t)dt+ d�d`,t, ` = 1, . . . ,m2



t , Wdt and �




1


d=1,…, 11

The continuous-time and space model is stated as follows:Let Y d

t be a direct measure of a proxy of some sort of well-being (e.g. “hospital admissions”) for regiond = 1, . . . , D, at time t




dY dt = µd

t dt+

KX

j=1

�jXdj,tdt+ dBd

t






dµdt =

✓d0 +

m1X

i=1

✓di Zdi,t +

m2X

i=1

diS

di,t

!dt+

DX

l=1

wlddLdt

where w = (wd,l) is a proximity matrix detected among the di↵erent areas which is used to control for spatialcorrelation. Putting all together

dY dt =

✓d0 +

m1X

i=1

✓di Zdi,t +

m2X

i=1

diS

di,t

!dt+

DX

l=1

wlddLdt +

KX

j=1

�jXdj,tdt+ dBd

t

dXdi,t = ⇣di (⌫

di �Xd

i,t)dt+ d⇧di,t, i = 1, . . . ,K

dZdi,t = ↵d

i (�di � Zd

i,t)dt+ dW di,t, i = 1, . . . ,m1

dSd`,t = �d

` (⌘d` � Sd

`,t)dt+ d�d`,t, ` = 1, . . . ,m2



t , Wdt and �d

t are vectors of independent the Brownian motions

Once the parameters are estimated we can predict the underlying measure as follows

dµdt =

✓d0 +

m1X

i=1

✓di Zdi,t +

m2X

i=1

diS

di,t

!dt+

DX

l=1

wlddLdt

As quality of life variables we consider the SWBI index and its components, some weather and pollutionindicators, the consumer price index for blue and white-collar worker households, and a proxy of job market.

1

in our case we have D = 11, m1 = 4, m2 = 8, K=2

which means: 165 (D*(1+K+m1+m2) equations and 473 parameters!

Advantages of the continuous-time & space model for small area estimation (C-SAE)

Using official statistics of Y, X and Z means that we have at most monthly data, while SM data Z are intraday o daily data

Once the model is estimated at “low frequency” (e.g., monthly), we can simulate it at any frequency/time t.

Anchoring Big Data to Official Statistics/Survey data

Anchoring social media data to official statistics: on going.

Xst ⇠ L(✓st) social media statistics, t = time, s = space

✓st ⇠ Q(�st ), Q is the pior calibrated on official statistics

Outcome: posterior distribution of Xst , P(Xs

t ; ✓st |�s

t ).

Goal1: adjust social media statistics (or indirectly estimate bias?) to obtain high-frequency andspace-distributed social info in real time much before official statistics or survey day are collectedon the next wave.

Goal2: make official-statistics subjects (institutions, politicians, academics, ecc) happier withsocial media data.

4

Goal: adjust social media statistics (or indirectly estimate bias?) to obtain high-frequency and space-distributed social info in real time much before official statistics or survey data are collected on the next wave.

Goal: Nowcasting overall expectations and act accordingly. Example: looking at switching of WNI’s trend, speculate on the market.

Combining official statistics, survey and social media: Wired Next Index (WNI)

WNI (“measures” expectation about economic wealth of a country, well... Italy) combines different times series:

• Off. Stat: GDP, Import/Export, Unemployment rate (low frequency, backward looking)

• Survey Data: consumer expectations, entrepreneurs expectations (low freq., forward looking)

• Social Media Data: sentiment data on economy, politics, personal wealth (high frequency, geo-referenced, nowcasting)

http://index.wired.it

What about Replicability?

Replicability of this particular experiment: given the structure of the query used to download the data and the “post ID”, everyone can replicate the analysis.

Replicability of a similar idea: as the model is very easy to understand (although computationally intensive), the same idea can be applied to other situations. We are indeed working with clandestine migrants in Italy and soon Spain using mainly Twitter data.

Sedeoperativa:ViaGaspareBugatti7/A,Milano

Tel.+393661652058/61/64Fax+390269000855

voices-int.com

[email protected]

@blogsvoices

Thanks!

forfurtherinformation:[email protected]

http://voices-int.com

mailto:[email protected]?subject=

mailto:[email protected]?subject=

WORKING WITH INSTAGRAM DATA: THEMES AND IDEAS To …

Documents

Transcript of WORKING WITH INSTAGRAM DATA: THEMES AND IDEAS To …