Social media ngerprints of unemploymentSocial media ngerprints of unemployment | 2/19 Figure 1. A)...
Transcript of Social media ngerprints of unemploymentSocial media ngerprints of unemployment | 2/19 Figure 1. A)...
Nov 2014
Social media fingerprints of unemploymentAlejandro Llorente12 Manuel Garcıa-Herranz3 Manuel Cebrian45 Esteban Moro 12
AbstractRecent wide-spread adoption of electronic and pervasive technologies has enabled the study of human behavior at an unprecedentedlevel uncovering universal patterns underlying human activity mobility and inter-personal communication In the present workwe investigate whether deviations from these universal patterns may reveal information about the socio-economical status ofgeographical regions We quantify the extent to which deviations in diurnal rhythm mobility patterns and communicationstyles across regions relate to their unemployment incidence For this we examine a country-scale publicly articulated socialmedia dataset where we quantify individual behavioral features from over 145 million geo-located messages distributed amongmore than 340 different Spanish economic regions inferred by computing communities of cohesive mobility fluxes We findthat regions exhibiting more diverse mobility fluxes earlier diurnal rhythms and more correct grammatical styles display lowerunemployment rates As a result we provide a simple model able to produce accurate easily interpretable reconstructionof regional unemployment incidence from their social-media digital fingerprints alone Our results show that cost-effectiveeconomical indicators can be built based on publicly-available social media datasets
KeywordsHuman mobility Social networks Communication patterns Unemployment
1Instituto de Ingenierıa del Conocimiento Universidad Autonoma de Madrid Madrid 28049 Spain2Departamento de Matematicas amp GISC Universidad Carlos III de Madrid Leganes 28911 Spain3UNICEF Innovation Unit New York NY 10017 USA4Department of Computer Science and Engineering University of California at San Diego La Jolla CA 92093 USA5National Information and Communications Technology Australia Melbourne Victoria 3003 Australia
Corresponding author emoromathuc3mes
HUMAN behavior is closely intertwined with socioeconom-ical status as many of our daily routines are driven by
activities related to maintain to improve or afforded by suchstatus [1ndash3] From our movements around the city to our dailyschedules to the communication with others humans performdifferent actions along the day that reflect and impact theireconomical situation The distribution of different individ-ual behaviors across neighborhoods municipalities or citiesimpacts the economical development of those geographicalareas and in turn to that of the whole country [4ndash9] Detect-ing patterns and quantifying relevant metrics to unveil thecomplex relationship between geography and collective be-havior is thus of paramount importance for understanding theeconomical heart-beat of cities and the structure of inter-citynetworks and thus to economic planning educational policyurban planning transportation design and other large-scalesocietal problems [10ndash14]
Much knowledge about how mobility social communi-cation and education affect the economical development ofcities has been being obtained through complex and costlysurveys with an update rate ranging from fortnights (unem-ployment) to decades (census) [15ndash17] At the same timethe recent availability of vast and rich datasets of individualdigital fingerprints has increased the scale and granularity atwhich we can measure these behavioral features reduced thecost and update rate of these measurements and provided newopportunities to combine them with more traditional socio-economical surveys [14 18ndash22]
In this work we provide a proof of concept for the use ofsocial media individual digital fingerprints to infer city-levelbehavioral measures and then uncover their relationship withsocioeconomic output We present a comprehensive study
of the different behavioral traces that can be extracted fromsocial media (i) technology adoption from (social media)user demographics (ii) mobility patterns from geo-locatedmessages (iii) communication patterns from exchanged mes-sages and (iv) content analysis from the published messagesTo this end we use a country-scale publicly articulated socialmedia dataset in Spain where we infer behavioral patternsfrom almost 146 million geo-located messages We match thisdataset with the granular unemployment at the level of munic-ipality measured at the peak of the Spanish financial crisis(2012ndash2013) We consider unemployment to be the most im-portant signal for the socioeconomic status of a region sincethe effects of the crisis have had a very large impact in termsof unemployment in the country (around 92 in 2005 morethan 26 in 2013)
Our extensive investigation of this large variety of tracesin a large social media dataset allows us not only to build anaccurate model of unemployment impact across geographi-cal areas but also to compare globally previously reportedmetrics in diverse works and datasets as well as asses theirrelevance and uniqueness to understand economical devel-opment [14 19 20 22ndash27] As we will show technologyadoption mobility diurnal activity and communication stylemetrics carry a different weight in explaining unemploymentin different geographical areas Our goal is not to state causal-ity between unemployment and the extracted metrics but touncover the relationship emerging when we observe the eco-nomical metrics of cities and the social behavior at the sametime
arX
iv1
411
3140
v1 [
phys
ics
soc-
ph]
12
Nov
201
4
Social media fingerprints of unemployment mdash 219
Figure 1 A) Map of the mobility fluxes Ti j between municipalities based on Twitter inferred trips (white) Infomap communitiesdetected on the network Ti j are colored under the mobility fluxes (blue colors) B) Mobility fluxes Ti j between municipalities i and jare constructed by aggregating the number of trips between them C) Correspondence between the observed fluxes Ti j and the fittedgravity model fluxes Dashed line is the Ti j = T grav
i j while the (blue) solid line is an conditional average of T gravi j for fixed values of
Ti j
1 Social media dataset and functional parti-tion of cities
Twitter is a microblogging online application where users canexpress their opinions share content and receive informationfrom other users in text messages of 140 characters longcommonly known as tweets Users can interact with other
users by mentioning them or retweeting (share someonersquostweet with your followers) their content Some of these tweetscontain information about the geographical location wherethe user was located when the tweet was published we referto them as geo-located tweets
To perform our analysis we consider 196 million geo-located Twitter messages (tweets) collected through the pub-
Social media fingerprints of unemployment mdash 319
lic API provided by Twitter from continental Spain rang-ing from 29th November 2012 to 30th June 2013 Tweetswere posted by (properly anonymized) 057 Million uniqueusers and geo-positioned in 7683 different municipalitiesWe observed a large correlation (Pearsonrsquos coefficient ρ =0951[09490953]) between the number of geopositionedtweets per municipality and the municipalityrsquos population Onaverage we find around 50 tweets per month and per 1000persons in each municipality
Despite this high level of social media activity withinmunicipalities we find their official administrative areas notsuitable to study socio-economical activity administrativeboundaries between municipalities reflect political and histo-rical decisions while economical trade and activity often hap-pens across those boundaries The result is that municipalitiesin Spain are artificially diverse ranging from a municipalitywith only 7 inhabitants to other with population 32 millionAlthough there exists natural aggregations of municipalitiesin provinces (regions) or statisticalmetropolitan areas (NUTSareas) we have used our own procedure to detect economicalareas In particular we have used user daily trips betweenpairs of municipalities as a measure of the economic relat-edness between said municipalities We say that there is adaily trip between municipality i and j if a user has tweetedin place i and j consecutively within the same day In ourdataset we find 19 million trips by 022 million users Withthose trips we construct the daily mobility flux network Ti jbetween municipalities as the number of trips between place iand j (see 1B) Remarkably the statistical properties of tripsand of the mobility matrix Ti j coincide with those of othermobility datasets (see SI section 2) for example trip distancer and elapsed time δ t are power-law distributed with expo-nents P(r)sim rminus167 and P(δ t)sim δ tminus062 very similar to thosefound in the literature [9 23] And the mobility fluxes Ti j arewell described by the Gravity Law (R2 = 080) [28]
Ti j T gravi j =
Pαii P
α jj
dβ
i j
(1)
where Pi and Pj are the populations of municipalities i andj and di j is the distance between them Similarly the expo-nents in (1) are very similar to those reported in other worksαi α j = 048 and β 105 [23 29] These results suggestthat detected mobility from geo-located tweets is a good proxyof human mobility within and between municipalities [30]
We use the network of daily fluxes between municipali-ties Ti j to detect the geographical communities of economicalactivity To this end we employ standard partition techniquesof the mobility network Ti j using graph community findingalgorithms This technique has been applied extensively spe-cially with mobile phone data to unveil the effective mapsof countries based on mobility andor social interactions ofpeople[31ndash33] In our case we have used the Infomap al-gorithm [34] and found 340 different communities withinSpain For further details about the comparison among dif-ferent state-of-art community detection algorithms executedon the inter-city graph see SI section 3 The average num-ber of municipalities per community is 21 and the largest
community contains 142 municipalities The communitiesdetected have very interesting features (see SI section 3) (i)they are cohesive geographically (see figure 1) (ii) they arestatistically robust against randomly removal of trips in ourdatabase (SI table S2) and (iii) modularity of the partition isvery high ( 076 see SI table S3) Finally (iv) the partitionfound has some overlap (77 of Normalized Mutual Informa-tion NMI see [35]) with coarser administrative boundarieslike provinces (regions) (see SI section 3 for details) Butinterestingly it shows a larger overlap (83 of NMI) withcomarcas (counties) areas in Spain that reflect geographicaland economical relations between municipalities This resultshows that the mobility detected from geo-located tweets andthe communities obtained are a good description of economi-cal areas
In the rest of the paper we restrict our analysis to the geo-graphical areas defined by the Infomap detected communities(see figure 1) For statistical reasons we discard communitieswhich are not formed by at least 5 municipalities Despite thissampling 96 of the total country population is consideredin our analysis Our results in the rest of the paper also holdfor municipalities counties or provinces though with lowerstatistical power (see SI section 9)
2 Social media behavioral fingerprints
The goal of this work is to quantify how and what behavioralfeatures can be extracted from social media and then relatedback to the to the economical level of cities To this end wedefine four groups of measures that have been widely exploredin other fields like economy or social sciences These fourtypes measures rely on the identification of the place whereusers live Instead of using information in the user profilewe analyze the places where the user has tweeted and weset as hometown of the user the municipality where heshehas tweeted with the highest frequency a method usually em-ployed in mobile phone and social media [11 23] To thisend we select those users with more than 5 geo-located tweetsin our period and which have tweeted at least 40 of theirtweets in a given municipality which we will consider theirhometown After this filtering we end up with 032 millionusers and we can then define the twitter population πi in areai as the number of users with their hometown within area iWe obtain a very high correlation between πi and populationof the cities Pi in the national census ρ = 0977[09760978]which provides an indirect validation of our approach withthe present data However not all demographic groups areequally represented in the our twitter database As shown inthe SI section 4 Twitter user demographics in Spain obtainedfrom surveys [36] show that age groups above 44 years old areunder-represented Thus our results would mainly describethe socio-economical status of people below 44 years old Em-ployment analysis is then performed in different age groupsunemployment for people below 25 years old between 25 and44 years old and older than 44 years old Finally we havechosen the unemployment reported officially at the end of ourobservation time window (June 2013) but our results are notaffected by the month selected see SI Section 7
For every considered region we investigate the officiallyreported unemployment for different age groups and a number
Social media fingerprints of unemployment mdash 419
Figure 2 Examples of different behaviour in the observed variables and the unemployment In A we observe that two cities withdifferent unemployment levels have different temporal activity patterns Figure C show how communities (red) with distinct entropylevels of social communication with other communities (blue) may hold different unemployment intensity left map shows a highlyfocused communication pattern (low entropy) while right map correspond to a community with a diverse communication pattern(high entropy) Finally figure B shows some examples of detected misspellings in our database using 618 incorrect expressions (seeSI Section 6) such as ldquoCon migordquo ldquoAverrdquo or ldquollendordquo
of metrics related to social media activity Some of thosemetrics are already reported in the literature but some othersare introduced in this work Specifically we consider
bull Social media technology adoption we can use twitterpenetration rate τi = πiPi in each area i as a proxy oftechnology adoption Recent works have shown thatindeed there is a correlation between country GDP andtwitter penetration specifically it was found that a pos-itive correlation between τi and GDP at the countrylevel [23] However in our data we find the oppo-site correlation (see figure 3) namely that the largerthe penetration rate the bigger the unemployment iswhich suggest that the impact of technology adoptionat country scale is different of what happens withinan (industrialized) country where technology to accesssocial media is commoditized
bull Social media activity regions with very different eco-nomical situations should exhibit different patterns ofactivity during the day Since working leisure fam-ily shopping etc activities happen at different timesof the day we might observe different daily patterns
in regions with different socio-economical status Forexample we hypothesize that communities with lowlevels of unemployment will tend to have higher activ-ity levels at the beginning of a typical weekday Thisis indeed what we find figure 2A shows the hourlyfraction of tweets during workdays of two communi-ties with very different rate of unemployment As wecan observe both profiles are quite different and inthe case of low unemployment we find a strong peakof activity between 8 and 11am (morning) and lowerperiods of activity during the afternoons and nights Weencode this finding in νmrngi νaftni and νngti the to-tal fraction of tweets happening in geographical area ibetween 8am and 10am 3pm and 5pm and 12am and3am respectively Figure 3 shows a strong negative co-rrelation between νmrngi and the unemployment for thecommunities in our database and positive correlationwith νaftni and νngti
bull Social media content some works have observed acorrelation between the frequency of words relatedto work conditions [22] or looking forward thinking
Social media fingerprints of unemployment mdash 519
searches [21] to the economical situation of countriesIn our case we also find that there is a moderate posi-tive correlation between the fraction of tweets microi men-tioning job or unemployment terms and the observedunemployment while the correlation is negative for thenumber mentions to employment or the economy How-ever we have tried a different approach by measuringthe relation between the way of writing and the edu-cational level [37] To this end we build a list of 618misspelled Spanish expressions and extract the tweetsof the dataset containing at least one of these words(see SI section 6 for further details about how theseexpressions were collected) We only consider tweetsin Spanish detected with a N-grams based algorithmThen we only consider misspellings that cannot bejustified as abbreviations Finally we compute for ev-ery region the proportion εi of misspellers among theTwitter population If the fraction of misspellers pergeographical area is a proxy for the educational levelof that region we expect a positive correlation betweenεi and unemployment Indeed we find (see figure 3)that there is a strong correlation between the fraction ofmisspellers and unemployment
bull Social media interactions and geographical flow di-versity following the ideas in [14] which correlatedthe economical development of an area with the diver-sity of communications with other areas we considerall tweets mentioning another user and take them asa proxy for communication between users Then wecompute the number of communications wi j betweenareas i and j as the number of mentions between usersin those areas To measure the diversity we use as in[14] the informational normalized entropy (Entropy 1)Sui = minussum j pi j log pi jSri where pi j = wi jsum j wi j and(Entropy 2) Sri = logki with ki the number of differentareas with which users in area i have interacted As in[14] we find that areas with large unemployment haveless diverse communication patterns than areas with lowunemployment This translates in a strong negative co-rrelation between Si and the unemployment see figure3 Similar ideas are applied to the flows of people be-tween areas to investigate the diversity of the geograph-ical flows through the entropy Si =minussum j pi j log pi jSriwhere pi j = Ti jsum j Ti j and Sri = log(ki) with ki thenumber of different areas which has been visited byusers that live in area i Figure 3 shows that as in [19]correlation of these geographical entropies is low witheconomical development
Normalization of variables is discussed in SI section 5 Wehave also studied the correlation between the variables con-sidered As expected variables in each group show moderatecorrelations between them However the inspection of thecorrelation matrix and a Principal Component Analysis ofthe variables considered show that there is information (aspercentage of variance in the data) in each of the groups ofvariables see SI section 5 Because of these two facts werestrict our analysis to the variables within each group withthe highest correlation with the unemployment namely thepenetration rate τi the social and mobility diversity variables
Sui and Sui the morning activity νmrngi the fraction of mis-spellers εi and fraction of employment-related tweets microempi
3 Explanatory power of social media in un-employment
The four previous groups of variables are fingerprints of hu-man behavior reflected on the Twitter usage habits As weobserved in figure 3 all of them exhibit statistically strongcorrelations with unemployment The question we address inthis section is whether those variables suffice to explain theobserved unemployment (their explanatory power) and alsodetermine the most important ones among themselves (whichgive more explanatory power than others) Note that we arenot stating a causality arrow between the measures built in theprevious section and the unemployment rate but only explor-ing whether they can be used as alternative indicators with areal translation in the economy
Figure 4 shows the result of a simple linear regressionmodel for the observed unemployment for ages below 25 yearsas a function of the variables which have more correlation withthe unemployment The model has a significant R2 = 062showing that there is a large explanatory power of the un-employment encoded in the behavioral variables extractedfrom Twitter However not all the variables weight equallyin the model specifically the penetration rate geographicaldiversity morning activity and fraction of misspellers accountfor up to 92 of the explained variance while social diversityand number of employment related tweets are not statisticalsignificant (see SI section 10 for the methods used to deter-mine the relative importance of the variables) It is interestingto note that while social diversity obtained by mobile phonecommunications was a key variable in the explanation of de-privation indexes in [14 19] the communication diversity oftwitter users seem to have a minor role in the explanation ofheterogeneity of unemployment in Spain
Similar explanatory power is found for other age groupsR2 = 044 for all ages and R2 = 052 for ages between 25 and44 years However the model degrades for ages above 44years (R2 = 026) proving that our variables mainly describedthe behavior of the most represented age groups in Twitternamely those below 44 years old On the other hand sinceour Twitter variables seem to describe the behavior of youngpeople we have investigated whether Twitter constructed vari-ables have similar explanatory value (in terms of R2) thansimple census demographic variables for young people How-ever regression models including young population rate yieldto a minor improvement R2 = 065 while young populationrate only gives R2 = 024 a result which shows that Twittervariables do indeed posses a genuine explanatory power awayfrom their simple demographic representation Finally ourmodel have the largest explanatory power for detected com-munities but large R2 are also found for other geographicalareas like counties (R2 = 054) and provinces (R2 = 065) seeSI section 9
4 Discussion
This work serves as a proof of concept for how a wide rangeof behavioral features linked to socioeconomic behavior can
Social media fingerprints of unemployment mdash 619
eco
unemp
emp
job
fmiss
madrugada
tarde
manana
siorsocial
siosocial
sior
sio
rtwpen
minus05 00 05corre
ff500
1000
10 20parofa
ctor
[i]
tt[ v
aria
bles
_sel
[i]]
20
40
60
80
10 20parofa
ctor
[i]
tt[ v
aria
bles
_sel
[i]]
20
40
60
80
1020paro fa
ctor[i] tt[ variables_sel[i]]
20
40
60
80
1020paro fa
ctor[i] tt[ variables_sel[i]]
0
5
10
15
20
10 20parofa
ctor
[i]
tt[ v
aria
bles
_sel
[i]]
0
5
10
15
20
1020paro fa
ctor[i] tt[ variables_sel[i]]
0
5
10
15
20
1020paro fa
ctor[i] tt[ variables_sel[i]]
4
5
6
7
10 20parofa
ctor
[i]
tt[ v
aria
bles
_sel
[i]] 5
10
10 20parofa
ctor
[i]
tt[ v
aria
bles
_sel
[i]]
40
50
60
70
10 20parofa
ctor
[i]
tt[ v
aria
bles
_sel
[i]]
02
04
06
08
10 20parofa
ctor
[i]
tt[ v
aria
bles
_sel
[i]]
0
50
100
150
200
10 20parofa
ctor
[i]
tt[ v
aria
bles
_sel
[i]]
0
50
100
150
200
10 20parofa
ctor
[i]
tt[ v
aria
bles
_sel
[i]]
0
50
100
150
200
10 20parofa
ctor
[i]
tt[ v
aria
bles
_sel
[i]]
Penetration rate Entropy 1 (geo) Entropy 2 (geo)
Entropy 1 (social) Entropy 2 (social) Activity (morning)
Activity (afternoon) Activity (night)
Misspellers rate Job tweets
Employment tweets Unemployment tweets
Economy tweets
A B C
D E
Unemployment UnemploymentCorrelation
Entropy1 (social)
Misspellers rate
Pen
etra
tion
rate
Act
ivity
(mor
ning
)Figure 3 A) Correlation coefficient of all the extracted Twitter metrics grouped by technology adoption (black) geographicaldiversity (orange) social diversity (light blue) temporal activity (green) and content analysis (dark blue) Error bars correspond to95 confidence intervals of the correlation coefficient Gray area correspond the statistical significance thresholds Panels B C Dand E show the values of 4 selected variables in each geographical community against its percentage of unemployment Size of thepoints is proportional to the population in each geographical community Solid lines correspond to linear fits to the data
10
15
20
25
10 15 20 25x
y
5
10
15
2025
5 10 15 20 25x
y
0 10 20 30per
order
col0000000072B2009E7356B4E9E69F00
Penetration rate
Entropy 1 (geo)
Entropy 1 (social)
Activity (morning)
Misspellers rate
Employment tweetsR2 = 062
Pred
icte
d un
empl
oym
ent
Observed unemployment Weight
A B
R2 = 052
Observed unemployment
CAge lt 25 25 lt Age lt 44
Figure 4 A) and B) Performance of the model showing the predicted unemployment rate for ages below 25 versus the observedone R2 = 062 and with ages between 25 and 44 Dashed lines correspond to the equality line and plusmn20 error C) Percentageof weight for each of the variables in the regression model using the relative weight of the absolute values of coefficients in theregression model (see SI section 10) Variables marked with lowast are not statistical significant in the model
be inferred from the digital traces that are left by publicly-available social media In particular we demonstrate thatbehavioral features related to unemployment can be recoveredfrom the digital exhaust left by the microblogging networkTwitter First of all Twitter geolocalized traces together withoff-the-shelve community detection algorithms render an op-timal partition of a country for economical activity showingthe remarkable power of social media to understand and unveileconomical behavior at a country-scale This insight is likelyto apply to other administrative definitions in other countriesspecially when considering large cities with an inherent dy-namical nature and evolution of mobility fluxes and citiescomposed of small satellite cities with arbitrary agglomera-tions or division among them (eg London NYC Singapore)
This result is unsurprising it should be natural to recomputecity clusterscommunities of activity based on their real timemobility which may vary considerably faster than the updaterates of mobility and travel surveys [31ndash33]
Our main result demonstrates that several key indicatorsdifferent penetration rates among regions fingerprints of thetemporal patterns of activity content lexical correctness andgeo-social connectivities among regions can be extractedfrom social media and then used to infer unemployment lev-els These findings shed light in two directions first on howindividualsrsquo extensive use of their social channels allow usto characterize cities based on their activity in a meaningfulfashion and secondly on how this information can be usedto build economic indicators that are directly related to the
Social media fingerprints of unemployment mdash 719
economy Regarding the latter our work is important forunderstanding how country-scale analysis of Social Mediashould consider the demographic but also the economical dif-ference between users As we have shown users in areaswith large unemployment have different mobility different so-cial interactions and different daily activity than those in lowunemployment areas This intertwined relationship betweenuser behavior and employment should be considered not onlyin economical analysis derived from social media but alsoin other applications like marketing communication socialmobilization etc
It is particularly remarkable that Twitter data can providethese accurate results Twitter is among the many currentlypopular social networking platforms perhaps the noisiestsparsest more lsquosabotagedrsquo medium very few users sendout messages at a regular rate most of the users do nothave geolocated information the social relationships (fol-lowersfollowers) contains a lot of unusedunimportant linksit is plagued by spam-bots and last but not least we haveno way to identify the motivegoalfunctionality of mobilityfluxes we are able to extract These limitations are not par-ticular to our sample but general to the sample Twitter databeing employed in the computational social science commu-nity Despite all these caveats we are able to show that evensome simple filtering techniques together with basic statisticalregressions yields predictive power about a variable as impor-tant as unemployment Other social media platforms such asFacebook Google+ Sina Weibo Instagram Orkut or Flickerwith more granular and consistent individual data are likelyto provide similar or better results by themselves or in com-bination Further improvements can be obtained by the useof more sophisticated statistical machine learning techniquessome of them even tailored to the peculiarities of social mediadata Our work serves to illustrate the tremendous potentialof these new digital datasets to improve the understanding ofsocietyrsquos functioning at the finer scales of granularity
The usefulness of our approach must be considered againstthe cost and update rate of performing detailed surveys ofmobility social structure and economic performance Ourdatabase is publicly articulated which means that our analysiscould be replicated easily in other countries other time periodsand with different scopes Naturally survey results providemore accurate results but they also consume considerablyhigher financial and human resources employing hundredsof people and taking months even years to complete and bereleased mdash they are so costly that countries going througheconomic recession have considered discontinuing them oraltering their update rate in recent times A particularly prob-lematic aspect of these surveys is that they are ldquoout-of-syncrdquoie census may be up to date whereas those same individualsrsquotravel surveys may not be and therefore drawing inferencesbetween both may be particularly difficult This is a partic-ularly challenging problem that the immediateness of socialmedia can help ameliorate
A few questions remain open for further investigationHow can traditional surveys and social media digital tracesbe best combined to maximize their predictive ability Cansocial media provide a reliable leading indicator to unem-ployment and in general economic surveys How muchreliable lead is it possible if at all As we have found Twitter
penetration and educational levels are found to be correlatedwith unemployment but this levels are unlikely to changerapidly to describe or anticipate changes in the economy orunemployment However other indicators like daily activitysocial interactions and geographical mobility are more con-nected with our daily activity and perhaps they have morepredicting power to show andor anticipate sudden changesin employment The relationship between unemployment andindividual and group behavior may help contextualize themultiple factors affecting the socioeconomic well-being of aregion while penetration content daily activity and mobilitydiversity seem to be highly correlated to unemployment inSpain different weights for each group of traces might beexpected in other countries [14] Finally digital traces couldserve as an alternative (some times the only one available) tothe lack of surveys in poor or remote areas [20 27] Anotherinteresting avenue of research involves the use of social mediato detect mismatches between the real (hidden underground)economy and the officially reported [38]
Most importantly the immediacy of social media may alsoallow governments to better measure and understand the effectof policies social changes natural or man-made disasters inthe economical status of cities in almost real-time [18 39]These new avenues for research provide great opportunitiesat the intersection of the economic social and computationalsciences that originate from these new widespread inexpensivedatasets
Acknowledgments
We would like to thank Kristina Lerman Lada Adamic JamesFowler Daniel Villatoro and Ricardo Herranz for stimulatingdiscussions and Yuri Kryvasheyeu and Thomas Bochynek fortheir critical reading of the manuscript This work was par-tially supported by Spanish Ministry of Science and Technol-ogy Grant FIS2013-47532-C3-3-P (to A L M G H and EM) Manuel Cebrian is funded by the Australian Governmentas represented by the Department of Broadband Communi-cations and Digital Economy and the Australian ResearchCouncil through the ICT Centre of Excellence program
References
[1] Becker G S (1976) The economic approach to human behav-ior (University of Chicago Press)
[2] Granovetter M (1985) Economic action and social structurethe problem of embeddedness American journal of sociologypp 481ndash510
[3] Camerer C F Loewenstein G amp Rabin M (2011) Advancesin behavioral economics (Princeton University Press)
[4] Glaeser E L Kallal H D Scheinkman J A amp ShleiferA (1991) Growth in cities (National Bureau of EconomicResearch) Technical report
[5] Bettencourt L M Lobo J Helbing D Kuhnert C amp WestG B (2007) Growth innovation scaling and the pace of lifein cities Proceedings of the National Academy of Sciences104 7301ndash7306
[6] Batty M (2008) The size scale and shape of cities science319 769ndash771
[7] Milgram S (1974) The experience of living in cities Crowdingand behavior 167 41
Social media fingerprints of unemployment mdash 819
[8] Pan W Ghoshal G Krumme C Cebrian M amp Pentland A(2013) Urban characteristics attributable to density-driven tieformation Nature communications 4
[9] Gonzalez M C Hidalgo C A amp Barabasi A-L (2008)Understanding individual human mobility patterns Nature453 779ndash782
[10] Calabrese F Diao M Lorenzo G D Jr J F amp Ratti C(2013) Understanding individual mobility patterns from urbansensing data A mobile phone trace example TransportationResearch Part C Emerging Technologies 26 301 ndash 313
[11] Cheng Z Caverlee J Lee K amp Sui D Z (2011) ExploringMillions of Footprints in Location Sharing Services (AAAIMenlo Park CA USA)
[12] Cho E Myers S A amp Leskovec J (2011) Friendship andmobility user movement in location-based social networksKDD rsquo11 (ACM New York NY USA) pp 1082ndash1090
[13] Sun L Jin J G Axhausen K W Lee D-H amp Cebrian M(2014) Quantifying long-term evolution of intra-urban spatialinteractions arXiv preprint arXiv14070145
[14] Eagle N Macy M amp Claxton R (2010) Network diversityand economic development Science 328 1029ndash1031
[15] Henrich J Boyd R Bowles S Camerer C Fehr E GintisH amp McElreath R (2001) In search of homo economicusbehavioral experiments in 15 small-scale societies AmericanEconomic Review pp 73ndash78
[16] Krieger N Williams D R amp Moss N E (1997) Measuringsocial class in us public health research concepts method-ologies and guidelines Annual review of public health 18341ndash378
[17] Groves R M Fowler Jr F J Couper M P Lepkowski J MSinger E amp Tourangeau R (2013) Survey methodology (JohnWiley amp Sons)
[18] Lazer D Pentland A S Adamic L Aral S Barabasi A LBrewer D Christakis N Contractor N Fowler J GutmannM et al (2009) Life in the network the coming age of compu-tational social science Science (New York NY) 323 721
[19] Smith C Quercia D amp Capra L (2013) Finger on the pulseidentifying deprivation using transit flow analysis (ACM) pp683ndash692
[20] Soto V Frias-Martinez V Virseda J amp Frias-Martinez E(2011) Prediction of socioeconomic levels using cell phonerecords UMAPrsquo11 (Springer-Verlag Berlin Heidelberg) pp377ndash388
[21] Preis T Moat H S Stanley H E amp Bishop S R (2012)Quantifying the advantage of looking forward Scientific re-ports 2
[22] Antenucci D Cafarella M Levenstein M C Re C ampShapiro M D (2014) Using social media to measure la-bor market flows (National Bureau of Economic Research)Technical report
[23] Hawelka B Sitko I Beinat E Sobolevsky S KazakopoulosP amp Ratti C (2013) Geo-located twitter as the proxy for globalmobility patterns arXiv preprint arXiv13110680
[24] Lathia N Quercia D amp Crowcroft J (2012) in Pervasivecomputing (Springer) pp 91ndash98
[25] Frias-Martinez V Virseda J amp Frias-Martinez E (2010)Socio-economic levels and human mobility
[26] Gutierrez T Krings G amp Blondel V D (2013) Evaluatingsocio-economic state of a country analyzing airtime credit andmobile phone datasets arXiv preprint arXiv13094496
[27] Smith C Mashhadi A amp Capra L (2013) Ubiquitous sensingfor mapping poverty in developing countries Paper submittedto the Orange D4D Challenge
[28] Erlander S amp Stewart N F (1990) The gravity model intransportation analysis theory and extensions (Vsp) Vol 3
[29] Simini F Gonzalez M C Maritan A amp Barabasi A-L(2012) A universal model for mobility and migration patternsNature 484 96ndash100
[30] Lenormand M Picornell M Cantu-Ros O G Tugores ALouail T Herranz R Barthelemy M Frias-Martinez E ampRamasco J J (2014) Cross-checking different sources ofmobility information arXiv preprint arXiv14040333
[31] Barthelemy M (2011) Spatial networks Physics Reports 4991 ndash 101
[32] Expert P Evans T S Blondel V D amp Lambiotte R (2011)Uncovering space-independent communities in spatial net-works Proceedings of the National Academy of Sciences 1087663ndash7668
[33] Sobolevsky S Szell M Campari R Couronne T SmoredaZ amp Ratti C (2013) Delineating geographical regions withnetworks of human interactions in an extensive set of countriesPloS one 8 e81707
[34] Rosvall M amp Bergstrom C T (2008) Maps of random walkson complex networks reveal community structure Proceedingsof the National Academy of Sciences 105 1118ndash1123
[35] Danon L Diaz-Guilera A Duch J amp Arenas A (2005)Comparing community structure identification Journal ofStatistical Mechanics Theory and Experiment 2005 P09008
[36] ADigital (2013) Uso de twitter en espana 2012 [Onlineaccessed 1-November-2014]
[37] Davenport J R amp DeLine R (2014) The readability of tweetsand their geographic correlation with education arXiv preprintarXiv14016058
[38] Schneider F Buehn A amp Montenegro C E (2011) Shadoweconomies all over the world New estimates for 162 countriesfrom 1999 to 2007 Handbook on the shadow economy pp9ndash77
[39] Rutherford A Cebrian M Dsouza S Moro E Pentland A ampRahwan I (2013) Limits of social mobilization Proceedingsof the National Academy of Sciences 110 6281ndash6286
Social media fingerprints of unemployment mdash 919
Social media fingerprints of unemployment mdash 1019
Supporting Information forSocial media fingerprints of unemploymentAlejandro Llorente Manuel Garcıa-Herranz Manuel Cebrian and Esteban Moro
S1 The dataset
Twitter provides an extremely rich and publicly available data set ofuser interactions information flows and thanks to the geo locationof tweets user movements Nevertheless the representativenessof this geo-located Twitter as a global source of mobility data hasstill received sparse attention In this sense while [13] present apromising and extensive study regarding global country-to-countrymovements (mostly driven by tourism) within-country human flows(comprising not only internal tourism but also in a greater extentthan country-to-country travels visiting and commuting) still needfurther investigation Therefore throughout this work we will com-pare our findings using geo-located Twitter with similar study usingcommuting surveys
For the Twitter analysis we consider almost 146 million geo-located Twitter messages (tweet(s)) collected through the publicAPI provided by Twitter for the continental part of Spain and from29th November 2012 to 10th April 2013 In this dataset we considerthat there has been a trip from place l to place k if a user has tweetedin place l and place k consecutively We only keep those transitionswhen the first tweet and the second one are dated in the same dayWe filter the trips database to avoid unrealistic transitions and keeponly trips with a geographical displacement larger than 1km (SeeMethods section) By this method 138 million of trips from 167376different users are considered in our work
From those trips we construct the mobility flow Ti j betweenmunicipalities which measures the number of trips in our databasein which the origin is within city i boundaries and destination lieswithin those of city j
We also consider population and economical information aboutthe municipalities from the Spanish Census (2011) [8] and unem-ployment figures from the Public Service of Employment (ServicioPublico de Empleo Estatal SEPE) [7] In the former In the lat-ter case registered unemployment (in number of persons) is givenfor each Spanish municipality by gender age and month To getunemployment rates we divide register unemployment by the totalworkforce in the municipality estimated as the number of peoplewith age between 16 and 65 years
S2 Twitter as mobility proxy
Considering all of the available transitions in our database one cancompute the distance between origin and destination the elapsedtime of the transition and the number of trips per user among manyother statistics All of them seems to show a Power-law distributionwith a cutoff due to the finite spatial size of Spain and the constraintof considering only transitions where the origin and destinationcheckins are done the same day Focusing on the log-linear partof the distributions self-similar behaviors arise when Twitter basedmobility is analyzed (see figure 5)
Twitter based inter-city flows can be well modelled by means ofthe The Gravity Law which is one of the most extended methods torepresent human mobility [1 19] with applications in many fieldslike urban planning [23] traffic engineering [4] or transportationproblems [9] Gravity Law is also the solution to the problem of
maximizing the entropy of the particle distribution among all thepossible trips using statistical mechanics techniques [2 22] Recentlyit has also been used as a model for human mobility based on cellphone traces [10 20 21] and social media data at a global scale [13]and at the inter-city level [14]
The Gravity Model for human mobility assume that the flowsbetween cities can be explained by the expression
T gravi j =
Pα1i Pα2
j
dβ
i j
(2)
where T gravi j is the flow in terms of number of people between cities
i and j di j is the geographical distance and Pi and Pj the populationof every city respectively
Given the data we can obtain the parameters of the model byWeighted Least Squares Minimization
αlowast1 α
lowast2 β
lowast = argminα1α2β
1N sum
i jwi j
(Ti jminusT grav
i j
)2(3)
where N is the total number of connections in the mobility graph andwi j is a weight proportional to the number of observed transitionsbetween i and j In particular we find that taking wi j = T 13
i j givesthe best performance in the model
In our case this model fits quite accurately the inter-city mobilitybased on Twitter GPS checkins (see table 1) Even though we areconsidering Ti j not necessarily symmetric the exponents of thepopulations are similar indicating that we are observing a similarflows in both directions between i and j
S3 Community structures in inter-city mo-bility graph
Typically complex networks exhibit community structure that isthere are subsets of nodes that are more densely connected amongthem comparing to the rest of the nodes In mobility networks whosenodes correspond to geographical areas these communities are inter-preted as zones with high common activity and tend to be constrainedby geographical and political barriers We check whether this is alsoobserved in our dataset by performing 6 state-of-art community de-tection algorithms FastGreedy [5] Walktrap [16] Infomap [18]MultiLevel [3] Label Propagation [17] and Leading Eigenvector[15] These six different algorithms exhibit different communitystructures in terms of number of communities average size of com-munity or modularity (see table 3) Members (municipalities) ofthe resulting communities are spatially connected except some fewcases as figure 7 shows We test the statistical robustness of theobtained communities by randomly removing a proportion p of theoriginal links and performing the algorithms on this new graph GpWe will consider that communities are robust when the communi-ties given for the original network G and Gp are highly similar Inorder to compare two arbitrary memberships to communities we usethe Normalized Mutual Information (NMI) method described in [6]which returns 0 when two memberships are totally different and 1when we compare two equal memberships We compute the NMI for
Social media fingerprints of unemployment mdash 1119
10minus8
10minus6
10minus4
10minus2
100
100 1005 101 1015 102 1025 103x
dens
10minus6
10minus4
10minus2
100
1005 101 1015 102 1025 103x
dens
10minus7
10minus6
10minus5
10minus4
10minus3
102 1025 103 1035 104 1045 105x
dens
Den
sity
Den
sity
Den
sity
Trip distance (km) Number of trips Elapsed time (secs)
Figure 5 Probability distributions for the different properties of daily trips in the Twitter dataset Dashed lines corresponds to apower law fit with exponents minus167 minus243 and minus062 respectively
each chosen algorithm performed on G and Gp for p between 1and 10 concluding that obtained community structures are robustbecause they are not broken when some randomly chosen links areremoved (see table 2)
1e+01 1e+03 1e+05
2eminus04
2eminus03
2eminus02
2eminus01
cities2$total_population
cities2$twpen
Population
Pen
etra
tion
Rat
e CommunitiesCities
Figure 6 Penetration rates for both cities and detectedcommunities
As other works have shown mobility graph communities areusually interpreted in terms of geographical and political barriersand a natural question is whether the mobility based communitiesare related to any of these barriers In Spain there are differentterritorial divisions for administration purposes In this work weconsider two of them provinces defined in 1978 Constitution are 48different heterogeneous aggregations of municipalities and counties(comarca in Spanish terminology) which are traditional aggregationsof municipalities mainly based on Spanish holography (rivers val-leys ridges etc) and some of them are composed by municipalitiesof different provinces We use again the NMI method to compare thecommunities structure given by the algorithms to the administrativelimits Except Leading Eigenvector algorithm the rest of methodsreturn communities that are quite related to provinces (NMI asymp 07)whereas for the county administration limits higher variability is
observed In this last case the algorithm providing more relationshipwith county limits is Infomap NMI asymp 083 Therefore Twitter basedmobility summarizes the inter-city flows exhibiting that these flowsare influenced by geographical and political barriers
S4 Twitter demographics and unemploy-ment rates
Different age groups are not equally represented in Twitter Recentsurveys (2012) in Spain suggest that most (86) of users in Twitterare 16 to 44 years old Comparison of the percentage of users perage group with the total population within the same groups (seefigure 8) reveals that groups of ages above 35 years old are under-represented in Twitter Thus our Twitter data will be more revealingwhen trying to describe unemployment in age groups below 44 yearsold This is indeed what we find when we try to build a linear modelfor the rate unemployment in different age groups with the sameTwitter variables while unemployment rates for ages below 24 canbe fitted to a linear model with R2 = 062 we find that regressionmodels for unemployment rates for ages between 25 and 44 have aR2 = 052 while for ages above 44 we get only R2 = 026 Table 4summarizes the results for the regression models of unemploymentrates in each age group showing that our Twitter variables have moreexplanatory power for ages below 44 Finally in figure 8 we can seethe performance of the model at different age groups and once againit is obvious the poor explanatory power of the Twitter variables forthe unemployment rate in ages above 44 years old
S5 Properties of Twitter variables
Normalization and distributions
Heterogeneity between the values of variables constructed fromTwitter is large but moderate as histograms in figure 9 show Wedid not find any geographical area with anomalous values in anyof the variables considered Variables are normalized in differentways both the penetration τi and misspellers rate εi are defined asthe number of users or misspellers per 100000 persons (population)activity variables νi are normalized as the percentage of tweets pertime interval finally number of tweets that mention a specific term
Social media fingerprints of unemployment mdash 1219
Figure 7 From left to right and from top to bottom Fastgreedy Walktrap Infomap Multilevel Label Propagation and LeadingEigenvector communities on Twitter based mobility transitions
microi are also given per 100000 tweets published in the geographicalarea
Correlation between variables
Variables are constructed to reflect the behavior of areas in the dif-ferent dimensions of Twitter penetration social or geographicaldiversity activity through the day and content Correlation betweenvariables does indeed show that variables within each dimensionshold strong correlations between them As we can see in figure 10social and geographical diversities are highly correlated betweenthem an expected fact given the gravity law accurate descriptionof flows of people between geographical areas but also the amountof communication between them Same behavior is found for thegroup of variables in the activity group while content variables areless correlated Finally we find that both the penetration rate τi andfraction of misspellers εi have a strong correlation with most of thevariables
High correlation between variables might lead to collinearityeffects [24] in the linear regression models that is some variableswith predictive variable might have non-significant weights becausethey explain the same part of the variance For instance in Table5 misspellers rate has a very strong predictive value but its p-valueis too high to consider it significant To test this hypothesis weperform a principal component analysis (PCA) on the independentvariables of the regression Figure 10 exhibits the loadings of thedifferent variables for the considered variables The block structureshowed in 10 results in similar directions of the variables in the firstcomponentes of the PCA We observe some groups of variables onthe one hand geographical and social diversity seem to explain largepart of the variance on the other hand we find a perpendicular group
of variables formed by temporal activity finally penetration rateand misspellers fraction seem to represent a different independentdirection of data with high collinearity between them This mightexplain the low statistical significance in the models of section 4 Inany case the structure of the correlation matrix and the PCA resultsshow that there is indeed information in all groups of variables andthus we have take a variable in each of them for our regressionmodels
S6 Misspellers detection
In this work we will consider only tweets in Spanish that is sincein Spain several languages live at the same time depending on thepart of the country the first step is to reduce our Twitter dataset tothose tweets that are written in Spanish This task is carried out usingthe n-gram based text categorization R library textcat [11] Then inorder to decide whether a tweet has a misspelling or not we needto establish some patterns to select from our set of tweets Sincewe want to be sure that a detected mistake corresponds to a realmisspeller we will not consider the following cases
bull Lack of written accents People tend to avoid writing accentswhen talking in a colloquial way
bull Mistakes derived from removing unnecessary letters Themost common cases are removing a h at the beginning of aword (in Spanish the letter h is not pronounced) or replacingthe letters qu by k We understand that these mistakes can bemotivated for the limitation of length in tweets and not for areal misspelling
bull In the same line we neglect mistakes produced by removing
Social media fingerprints of unemployment mdash 1319
0
10
20
30
40
0 100 200 300x = Tweets (unemployment)
P(x)
0
5
10
15
500 1000 1500x = Penetration rate
P(x)
0
5
10
15
2 4 6 8x = Entropy2 (social)
P(x)
0
5
10
15
025 050 075x = Entropy1 (social)
P(x)
0
5
10
15
4 6 8x = Entropy2 (geo)
P(x)
0
5
10
15
01 02 03 04x = Entropy1 (geo)
P(x)
0
5
10
15
0 50 100 150 200x = Misspellers rate
P(x)
0
20
40
60
0 100 200x = Tweets (employment)
P(x)
0
5
10
15
20
250 500 750x = Tweets (job)
P(x)
0
5
10
15
20
2 3 4 5 6x = Activity (night)
P(x)
0
5
10
15
20
4 5 6 7x = Activity (morning)
P(x)
0
5
10
15
35 40 45 50x = Activity (afternoon)
P(x)
i SuiSri
Sri Suimrngi
aftni ngti
imicrojobi
microempi
microunempi
x x x
x x x
x x x
x x x
Figure 9 Frequency plots for each variable constructed from Twitter
letters in the middle of a word whose pronunciation can bededuced without them
bull We do not consider either mistakes related to features ofspecific areas in Spain For example in the south the pronun-ciation of ce and se is the same what produces a big amountof mistakes when writing However since we want to extractobjective and equitable conclusion over the whole Spanishgeography we neglect those misspellings that only appear ina specific area
Likewise we will consider as real misspellings the followingmistakes
bull Adding letters For example writing a h at the beginning of aword that starts with a vowel
bull Changing the special cases mp mb by the wrong writings npnb
bull Mixing up b with v g with j ll with y and ex with es Theseare typical mistakes in Spanish because they have the sameor a very close pronunciation
Social media fingerprints of unemployment mdash 1419
0
10
20
30
16minus24 25minus34 35minus44 45minus54 55minus64r
f
000
025
050
075
100llena
Age group
Perc
enta
ge o
f pop
ulat
ion Census
10
15
20
25
10 15 20 25x
y
10
15
20
25
10 15 20 25x
y
5
10
15
20
5 10 15 20x
y
5
10
15
2025
5 10 15 20 25x
y
All ages lt 24
25-44 gt 44
Observed Unemployment () Observed Unemployment ()
Observed Unemployment ()Observed Unemployment ()
Pred
icte
d Un
empl
oym
ent (
)
Pred
icte
d Un
empl
oym
ent (
)
Pred
icte
d Un
empl
oym
ent (
)
Pred
icte
d Un
empl
oym
ent (
)
R2 = 047 R2 = 062
R2 = 052 R2 = 026
Figure 8 Top Percentage of population in each age groupfrom the Spanish Census (dark bars) and surveys about usersin Twitter (light bars) Bottom performance of the linearmodels for each of the age groups
bull Confusing the verb haber with the periphrasis a ver
bull Separating a word into two ones for instance writing theword conmigo as con migo
This way our list of mispellings is composed of 617 common mis-takes in Spanish that cannot be attributed to the special featuresof Twitter or a specific region of Spain Thus one can expect thatthis selection provides an accurate and equitable method of detect-ing misspellers Under these conditions the number of users whowrote at least one misspelled word is 27055 (56 over the wholepopulation)
We analyze whether misspellers have different Twitter usagebehavior from that people who do not make serious mistakes whenpublishing a tweet Comparing the average number of tweets itcan be observed that misspellers tend to publish a larger numberof tweets than those who did not made mistakes (14471 against2372) This also emerges when the mean number of misspellinggiven the total number of tweets is considered For users with lessthan approximately 30 published tweets in the observation period thenumber of misspellings is almost zero whereas for users who publishmore often the mean number of misspellings scales sub-linearly
minus1
minus08
minus06
minus04
minus02
0
02
04
06
08
1rtwpen
sio
sior
siosocial
siorsocial
manana
tarde
madrugada
fmiss
job
emp
unem
p
eco
rtwpensiosior
siosocialsiorsocialmanana
tardemadrugada
fmissjobemp
unempeco
i
Sui
Sri
Sri
Sui
i
microecoi
microunempi
microempi
microjobi
ngti
aftni
mrngi
minus3 minus2 minus1 0 1 2 3
minus3minus2
minus10
12
3
First Principal component
Seco
nd P
rinci
pal C
ompo
nent
6
26
47
70 92123
145
155185
209
213
251
257 279305318339
371
396411
423
455
466
490
507
546552
568
597
621
637672
681
710
718738
772
781
805828
849
876
900
919
925
966979
1002 1021
1047
1064
1085
1109
1126
1137
1164
11861200
12361249
1264
1300
1318
1336
13541386
1406
1425
1444
1457
14711504
15141554
1572
1584
1611
1622
16571667
1686
1721
1739
1800
1823
1837
1863
18831930
1949
1968
20062036
2042
2067
2181
2206
22442264
2305
2331
23322413 2435
2456
2512
2554
2569
25982617
26522670
26972721 2748
2790
2875
2917
2939
29492976
2989
3025
3132
3228
3245
3331
3347
3381
3451
3484
3519
3613
3837
38653987
4326
minus05 00 05
minus05
00
05
sio
sior
siosocial
siorsocial
rtwpen
manana
tarde
madrugada
fmiss
jobemp
unemp
eco
ngt
microunemp
aftn
microempmicroeco
microjobmrng
Si
Si
Sir
˜Sir
domingo 12 de octubre de 14
Figure 10 Top Correlation matrix between the vari-ables constructed from Twitter Each entry in the matrixis depicted as a circle whose size is proportional to thecorrelation between variables and the sign is bluered forpositivenegative correlations Blank entries correspond tostatistically insignificant correlations with 95 confidenceBottom Variables projection on the first two principal com-ponents given by PCA We observe different groups of vari-ables and collinearity between some of them
with the number of tweets (exponent asymp 033)
Since we have observed a segmentation of Twitter populationbased on how accurate they write we consider the misspeller rate as aproxy of the educational level of the cities Large number of previousworks in the literature have revealed the relationship between theeconomical status and the educational level of geographical areasand therefore it is natural to ask whether the observed misspellersrate is related to economy driven by the unemployment rate To testthis hypothesis we consider cities populated with more than 5000inhabitants to avoid subsampled cases We find a strong positivecorrelation between the probability of finding a misspeller in a cityand the unemployment rate (0372 0491)
Social media fingerprints of unemployment mdash 1519
2 5 10 20 50 200 500
10
15
20
30
40
tweets
N(m
iss|
tw
eets
)
2 5 10 20 50 200 500
000
50
050
050
0
tweets
P(m
iss|
twee
ts)
2 50 500
Figure 11 Number (red) and probability (blue) of ob-served misspellings given the number of tweets
050
055
060
065
070
201210201211201212201301201302201303201304201305201306201307201308201309201310201311201312201401201402201403201404201405201406
month
r2R2
month
Figure 12 Explanatory power of the linear regressionmodel when fitted against the unemployment data for dif-ferent months Gray (orange) area correspond to the timewindow in which Twitter data is collected and variables areconstructed
S7 Time window and unemployment
In the definition of the variables we have aggregated the Twitter ac-tivity within a 7 months time window (from December 2012 to June2013) Since unemployment has a significant variation along timewe investigate here what is the correlation and explanatory powerof the Twitter variables for the values of unemployment determinedat different months through the same time window in which Twitterdata was collected Or if the variables collected in that time windoware more correlated with past or future values of unemploymentFigure 12 shows the explanatory value of the model when the linearregression is done for values of unemployment of different monthsbefore during and after the Twitter data time window Althoughthere is a small seasonal effect along the year we see that the ex-planatory power remains around R2 = 06 which suggest that ourTwitter linear model retains its explanatory power even though unem-ployment changes considerably throughout the year It is interestingto note that R2 decays a little bit during the summer which meansthat our variables are less correlated with summer unemploymentFinally unemployment used in the main article is from June 2013ie the last month in the time window used to collect the data
S8 Demographics does not explain unem-ployment
Since unemployment rates are very large for the group of youngpeople a natural question is whether only demographic variablescould explain the heterogeneity of young unemployment rates foundin the geographical areas To test this end we have built four linearmodels the first one (named Youth model in Table 5) is composedby the rate of young population as the only explaining variable thesecond ones are built based on only the Twitter variables consideredin the main text (named Twitter model (I)) or just with those whoseregression coefficients are statistically significant (Twitter model(II)) the third one is fitted with all the variables (named All variablesmodel in Table 5) In table 5 we show the summary of the regressionfor each model Focusing on the explained variance by the model interms of R2 it can be checked that considering all Twitter variables isthree times more explanatory than considering only the young peopleproportion On the other hand the comparison of R2 for the Twittermodel with the one for All variables and Youth model shows that therate of young population does not provide a significant explanatorypower This semi-partial analysis shows that our Twitter variablesretain a high explanatory power when the effect of young populationrate is controlled
S9 Unemployment models for other geo-graphical areas
While municipalities are very heterogeneous demographically otheradministrative areas exist in Spain at large scales that could be usedfor our model of unemployment As mentioned in section 4 thesmallest administrative division of Spain we have considered is thatof the 8200 municipalities At larger scales we have the 326 coun-ties (comarcas in spanish) which are aggregations of municipalitiesFinally the largest geographical scale we considered is defined by50 provinces (provincias in Spanish) In this section we comparethe performance of our Twitter model for unemployment for thevariables defined in those administrative areas and relate it to thegeographical communities detected and used in the main paper (seesection 4) Not all the areas at different administrative divisions areconsidered in the model To minimize the effect of areas in whichthe number of geo-tagged tweets is very small we only consider the1738 municipalities which have a Twitter population π gt 10 Simi-larly we only consider the 198 counties with π gt 100 As we can seein Table 6 the model has a large explanatory power for areas equalor bigger than counties As expected R2 increases as the number ofareas in the model is smaller but the description level of the modelis very low for provinces for example The best performance (highR2 and high geographical description level) is attained at the level ofthe detected communities
S10 Relative importance of the variables
To asses the relative importance of the variables in the unemploymentmodel we have used several methods They all give qualitatively thesame results with some variations for the statistically insignificantvariables Specifically we have use
1 (weight) Relative weight of the absolute values of the coef-ficients obtained in the linear regression when variables arescaled to have mean zero and variance one
2 (lmg) averaging over orderings proposed by LindemanMerenda and Gold
Social media fingerprints of unemployment mdash 1619
0
10
20
30
40
emp fmiss manana rtwpen sio siosocialnames
values
indabscoefffirstlmgpmvd
microemp mrng SuSu
Rela
tive
impo
rtanc
e (
)
0
10
20
30
40
emp fmiss manana rtwpen sio siosocialnames
values
indabscoefffirstlmgpmvd
weight first lmg pvmd
Figure 13 Relative importance of the variables (in per-centage) in the unemployment model for different ways tocalculate it
3 (pmvd) The PMVD metric introduced by Feldman whichan average over orderings as well but with data-dependentweights
4 (first) The univariate R2-values from regression models withone variable only
All these metrics are obtained using the relaimpo R package [12]The results for the young unemployment model are shown in figure13 where we can see that different methods yield to similar rela-tive importance of the variables excepting perhaps for the diversityof mobility flows a variable with a non-significant weight in theregression model
References
[1] B Ashtakala Generalized power model for trip distributionTransportation Research Part B Methodological 21(1)59ndash671987
[2] Michel Bierlaire Mathematical models for transportation de-mand analysis Transportation research Part A Policy andpractice 31(1)86ndash86 1997
[3] Vincent D Blondel Jean-Loup Guillaume Renaud Lambiotteand Etienne Lefebvre Fast unfolding of communities in largenetworks Journal of Statistical Mechanics Theory and Exper-iment 2008(10)P10008 2008
[4] Harry J Casey Jr The law of retail gravitation applied to trafficengineering Traffic Quarterly 9(3) 1955
[5] Aaron Clauset Mark EJ Newman and Cristopher Moore Find-ing community structure in very large networks Physicalreview E 70(6)066111 2004
[6] Leon Danon Albert Diaz-Guilera Jordi Duch and AlexArenas Comparing community structure identificationJournal of Statistical Mechanics Theory and Experiment2005(09)P09008 2005
[7] Servicio Publico de Empleo Estatal (SEPE) Spanish registeredunemployment httpwwwsepeescontenidosque_
es_el_sepeestadisticasindexhtml[8] Instituto Nacional de Estadıstica Spanish 2011 cen-
sus httpwwwineescensos2011_datoscen11_
datos_iniciohtm|[9] Suzanne P Evans A relationship between the gravity model
for trip distribution and the transportation problem in linearprogramming Transportation Research 7(1)39ndash61 1973
[10] Paul Expert Tim S Evans Vincent D Blondel and RenaudLambiotte Uncovering space-independent communities inspatial networks Proceedings of the National Academy ofSciences 108(19)7663ndash7668 2011
[11] Ingo Feinerer Christian Buchta Wilhelm Geiger JohannesRauch Patrick Mair and Kurt Hornik The textcat packagefor n-gram based text categorization in r Journal of StatisticalSoftware 52(6)1ndash17 2013
[12] Ulrike Gromping Relative importance for linear regressionin r the package relaimpo Journal of statistical software17(1)1ndash27 2006
[13] Bartosz Hawelka Izabela Sitko Euro Beinat StanislavSobolevsky Pavlos Kazakopoulos and Carlo Ratti Geo-located twitter as the proxy for global mobility patterns arXivpreprint arXiv13110680 2013
[14] Yu Liu Zhengwei Sui Chaogui Kang and Yong Gao Uncov-ering patterns of inter-urban trips and spatial interactions fromcheck-in data arXiv preprint arXiv13100282 2013
[15] Mark EJ Newman Finding community structure in net-works using the eigenvectors of matrices Physical reviewE 74(3)036104 2006
[16] Pascal Pons and Matthieu Latapy Computing communities inlarge networks using random walks In Computer and Informa-tion Sciences-ISCIS 2005 pages 284ndash293 Springer 2005
[17] Usha Nandini Raghavan Reka Albert and Soundar KumaraNear linear time algorithm to detect community structures inlarge-scale networks Physical Review E 76(3)036106 2007
[18] Martin Rosvall and Carl T Bergstrom Maps of random walkson complex networks reveal community structure Proceedingsof the National Academy of Sciences 105(4)1118ndash1123 2008
[19] Morton Schneider Gravity models and trip distribution theoryPapers in Regional Science 5(1)51ndash56 1959
[20] Filippo Simini Marta C Gonzalez Amos Maritan and Albert-Laszlo Barabasi A universal model for mobility and migrationpatterns Nature 484(7392)96ndash100 2012
[21] Chaoming Song Tal Koren Pu Wang and Albert-LaszloBarabasi Modelling the scaling properties of human mobilityNature Physics 6(10)818ndash823 2010
[22] Alan Geoffrey Wilson Entropy in urban and regional mod-elling Pion Ltd 1970
[23] Alan Geoffrey Wilson Urban and regional models in geogra-phy and planning 1974
[24] Svante Wold Arnold Ruhe Herman Wold and WJ Dunn IIIThe collinearity problem in linear regression the partial leastsquares (pls) approach to generalized inverses SIAM Journalon Scientific and Statistical Computing 5(3)735ndash743 1984
Social media fingerprints of unemployment mdash 1719
Gravity ModelParameter Description Spain
α1 Origin exponent 0477lowastlowastlowast(0002)α2 Destination exponent 0478lowastlowastlowast(0002)β Distance exponent 105lowastlowastlowast(00035)R2 Goodness of fit 0797φ Correlation between Ti j and T gra
i j 0826
Table 1 Description of the parameters for the Gravity Law Model in geo-tagged social media data for Spain (lowast lowast lowast) meanssignificance p lt 00001
NMI between G and Gp for different pAlgorithm p = 001 002 003 004 005 006 007 008 009 01
FG 0995 0992 0989 0983 0981 0977 0983 0969 0980 0959WT 0954 0959 0950 0954 0945 0948 0947 0935 0926 0931IM 0988 0981 0980 0981 0978 0974 0975 0970 0969 0966ML 0994 0978 0979 0983 0948 0934 0972 0952 0973 0947LP 0906 0908 0911 0915 0895 0907 0907 0893 0905 0904LE 0960 0957 0956 0859 0910 0892 0908 0858 0885 0884
Table 2 NMI measure comparing G and Gp
Communities StatsAlgorithm 〈|Ni|〉i max|Ni| |Ni| Modularity NMI P NMI C
FG 309696 1385 23 0726 0712 0590WT 9262 433 769 0417 0744 0757IM 21011 143 339 0758 0770 0831ML 323772 1132 22 0800 0717 0599LP 22052 750 323 0732 0749 0761LE 1017571 5344 7 0381 0264 0205
Table 3 Statistics of the communities Ni returned by the six algorithms NMI P refers to the comparison between communitiesand provinces whereas NMI C considers counties instead of provinces
Social media fingerprints of unemployment mdash 1819
All ages lt 24 25minus44 gt 44(Intercept) 011lowastlowastlowastlowast 010lowastlowastlowast 020lowastlowastlowast 020lowastlowastlowast
(002) (003) (003) (0035)Penetration rate 323lowast 857lowastlowastlowast 628lowastlowast 240
(141) (222) (217) (277)Geographical diversity 003 015lowastlowastlowast 008lowast 006
(002) (004) (004) (005)Social diversity minus003lowast minus003 minus005lowast minus006lowast
(001) (002) (002) (003)Morning activity minus069lowast minus130lowastlowast minus153lowastlowastlowast minus119lowast
(026) (042) (041) (052)Misspellers rate 1156 3151lowast 1546 2360
(813) (1278) (1248) (1594)Employment mentions minus180 317 minus994 271
(627) (986) (964) (123)R2 047 064 055 029Adj R2 044 062 052 026lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005
Table 4 Regression table for the different models in which unemployment for different age groups is fitted The All ages model isthe fit to the general rate of unemployment in each geographical area while the other models are for the rates of unemployment ingroups of less than 24 years between 25 and 44 years and above 44 years
All variables Youth model Twitter model (I) Twitter model (II)(Intercept) 006 minus002 010lowastlowastlowast 009lowastlowastlowast
(003) (003) (003) (0027)Young pop rate 066lowast 220lowastlowastlowast
(030) (035)Penetration rate 820lowastlowastlowast 857lowastlowastlowast 862lowastlowastlowast
(225) (222) (221)Geographical diversity 014lowastlowastlowast 015lowastlowastlowast 012lowastlowastlowast
(004) (004) (003)Social diversity minus002 minus003
(002) (002)Morning activity minus142lowastlowastlowast minus130lowastlowast minus128lowastlowast
(041) (042) (041)Misspellers rate 2395 3151lowast 3228lowast
(1309) (1278) (1271)Employment mentions 034 317
(981) (986)R2 065 024 064 063Adj R2 063 024 062 062lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005
Table 5 Regression table for the different statistical models The All variables model includes both Twitter and rate of youngpopulation variables Twitter model (I) includes only the variables described in the main article while Twitter model (II) only includesthose variables which are significant p lt 005 in Twitter model (I)
Social media fingerprints of unemployment mdash 1919
Communities Municipalities Counties Provinces(Intercept) 010lowastlowastlowast 016lowastlowastlowast 011lowastlowastlowast 011lowast
(003) (001) (003) (005)Penetration rate 857lowastlowastlowast 401lowastlowastlowast 912lowastlowastlowast 1047lowastlowastlowast
(222) (059) (181) (197)Geographical diversity 015lowastlowastlowast 002 012lowastlowastlowast 008
(004) (001) (003) (007)Social diversity minus003 minus001 minus001 minus003
(002) (001) (002) 007Morning activity minus130lowastlowast minus116lowastlowastlowast minus149lowastlowastlowast minus103
(042) (014) (039) (088)Misspellers rate 3151lowast 1440lowastlowastlowast 1409
(1278) (251) (1002)Employment mentions 317 minus071 241 minus317
(986) (089) (886) (1229)Number of points 128 1738 198 50R2 064 022 055 065Adj R2 062 021 054 061lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005
Table 6 Regression table for the unemployment linear regression model in different levels of geographical areas In the Provincesmodel the misspellers rate has been removed from the model due to the large collinearity with the penetration rate
- 1 Social media dataset and functional partition of cities
- 2 Social media behavioral fingerprints
- 3 Explanatory power of social media in unemployment
- 4 Discussion
-
Social media fingerprints of unemployment mdash 219
Figure 1 A) Map of the mobility fluxes Ti j between municipalities based on Twitter inferred trips (white) Infomap communitiesdetected on the network Ti j are colored under the mobility fluxes (blue colors) B) Mobility fluxes Ti j between municipalities i and jare constructed by aggregating the number of trips between them C) Correspondence between the observed fluxes Ti j and the fittedgravity model fluxes Dashed line is the Ti j = T grav
i j while the (blue) solid line is an conditional average of T gravi j for fixed values of
Ti j
1 Social media dataset and functional parti-tion of cities
Twitter is a microblogging online application where users canexpress their opinions share content and receive informationfrom other users in text messages of 140 characters longcommonly known as tweets Users can interact with other
users by mentioning them or retweeting (share someonersquostweet with your followers) their content Some of these tweetscontain information about the geographical location wherethe user was located when the tweet was published we referto them as geo-located tweets
To perform our analysis we consider 196 million geo-located Twitter messages (tweets) collected through the pub-
Social media fingerprints of unemployment mdash 319
lic API provided by Twitter from continental Spain rang-ing from 29th November 2012 to 30th June 2013 Tweetswere posted by (properly anonymized) 057 Million uniqueusers and geo-positioned in 7683 different municipalitiesWe observed a large correlation (Pearsonrsquos coefficient ρ =0951[09490953]) between the number of geopositionedtweets per municipality and the municipalityrsquos population Onaverage we find around 50 tweets per month and per 1000persons in each municipality
Despite this high level of social media activity withinmunicipalities we find their official administrative areas notsuitable to study socio-economical activity administrativeboundaries between municipalities reflect political and histo-rical decisions while economical trade and activity often hap-pens across those boundaries The result is that municipalitiesin Spain are artificially diverse ranging from a municipalitywith only 7 inhabitants to other with population 32 millionAlthough there exists natural aggregations of municipalitiesin provinces (regions) or statisticalmetropolitan areas (NUTSareas) we have used our own procedure to detect economicalareas In particular we have used user daily trips betweenpairs of municipalities as a measure of the economic relat-edness between said municipalities We say that there is adaily trip between municipality i and j if a user has tweetedin place i and j consecutively within the same day In ourdataset we find 19 million trips by 022 million users Withthose trips we construct the daily mobility flux network Ti jbetween municipalities as the number of trips between place iand j (see 1B) Remarkably the statistical properties of tripsand of the mobility matrix Ti j coincide with those of othermobility datasets (see SI section 2) for example trip distancer and elapsed time δ t are power-law distributed with expo-nents P(r)sim rminus167 and P(δ t)sim δ tminus062 very similar to thosefound in the literature [9 23] And the mobility fluxes Ti j arewell described by the Gravity Law (R2 = 080) [28]
Ti j T gravi j =
Pαii P
α jj
dβ
i j
(1)
where Pi and Pj are the populations of municipalities i andj and di j is the distance between them Similarly the expo-nents in (1) are very similar to those reported in other worksαi α j = 048 and β 105 [23 29] These results suggestthat detected mobility from geo-located tweets is a good proxyof human mobility within and between municipalities [30]
We use the network of daily fluxes between municipali-ties Ti j to detect the geographical communities of economicalactivity To this end we employ standard partition techniquesof the mobility network Ti j using graph community findingalgorithms This technique has been applied extensively spe-cially with mobile phone data to unveil the effective mapsof countries based on mobility andor social interactions ofpeople[31ndash33] In our case we have used the Infomap al-gorithm [34] and found 340 different communities withinSpain For further details about the comparison among dif-ferent state-of-art community detection algorithms executedon the inter-city graph see SI section 3 The average num-ber of municipalities per community is 21 and the largest
community contains 142 municipalities The communitiesdetected have very interesting features (see SI section 3) (i)they are cohesive geographically (see figure 1) (ii) they arestatistically robust against randomly removal of trips in ourdatabase (SI table S2) and (iii) modularity of the partition isvery high ( 076 see SI table S3) Finally (iv) the partitionfound has some overlap (77 of Normalized Mutual Informa-tion NMI see [35]) with coarser administrative boundarieslike provinces (regions) (see SI section 3 for details) Butinterestingly it shows a larger overlap (83 of NMI) withcomarcas (counties) areas in Spain that reflect geographicaland economical relations between municipalities This resultshows that the mobility detected from geo-located tweets andthe communities obtained are a good description of economi-cal areas
In the rest of the paper we restrict our analysis to the geo-graphical areas defined by the Infomap detected communities(see figure 1) For statistical reasons we discard communitieswhich are not formed by at least 5 municipalities Despite thissampling 96 of the total country population is consideredin our analysis Our results in the rest of the paper also holdfor municipalities counties or provinces though with lowerstatistical power (see SI section 9)
2 Social media behavioral fingerprints
The goal of this work is to quantify how and what behavioralfeatures can be extracted from social media and then relatedback to the to the economical level of cities To this end wedefine four groups of measures that have been widely exploredin other fields like economy or social sciences These fourtypes measures rely on the identification of the place whereusers live Instead of using information in the user profilewe analyze the places where the user has tweeted and weset as hometown of the user the municipality where heshehas tweeted with the highest frequency a method usually em-ployed in mobile phone and social media [11 23] To thisend we select those users with more than 5 geo-located tweetsin our period and which have tweeted at least 40 of theirtweets in a given municipality which we will consider theirhometown After this filtering we end up with 032 millionusers and we can then define the twitter population πi in areai as the number of users with their hometown within area iWe obtain a very high correlation between πi and populationof the cities Pi in the national census ρ = 0977[09760978]which provides an indirect validation of our approach withthe present data However not all demographic groups areequally represented in the our twitter database As shown inthe SI section 4 Twitter user demographics in Spain obtainedfrom surveys [36] show that age groups above 44 years old areunder-represented Thus our results would mainly describethe socio-economical status of people below 44 years old Em-ployment analysis is then performed in different age groupsunemployment for people below 25 years old between 25 and44 years old and older than 44 years old Finally we havechosen the unemployment reported officially at the end of ourobservation time window (June 2013) but our results are notaffected by the month selected see SI Section 7
For every considered region we investigate the officiallyreported unemployment for different age groups and a number
Social media fingerprints of unemployment mdash 419
Figure 2 Examples of different behaviour in the observed variables and the unemployment In A we observe that two cities withdifferent unemployment levels have different temporal activity patterns Figure C show how communities (red) with distinct entropylevels of social communication with other communities (blue) may hold different unemployment intensity left map shows a highlyfocused communication pattern (low entropy) while right map correspond to a community with a diverse communication pattern(high entropy) Finally figure B shows some examples of detected misspellings in our database using 618 incorrect expressions (seeSI Section 6) such as ldquoCon migordquo ldquoAverrdquo or ldquollendordquo
of metrics related to social media activity Some of thosemetrics are already reported in the literature but some othersare introduced in this work Specifically we consider
bull Social media technology adoption we can use twitterpenetration rate τi = πiPi in each area i as a proxy oftechnology adoption Recent works have shown thatindeed there is a correlation between country GDP andtwitter penetration specifically it was found that a pos-itive correlation between τi and GDP at the countrylevel [23] However in our data we find the oppo-site correlation (see figure 3) namely that the largerthe penetration rate the bigger the unemployment iswhich suggest that the impact of technology adoptionat country scale is different of what happens withinan (industrialized) country where technology to accesssocial media is commoditized
bull Social media activity regions with very different eco-nomical situations should exhibit different patterns ofactivity during the day Since working leisure fam-ily shopping etc activities happen at different timesof the day we might observe different daily patterns
in regions with different socio-economical status Forexample we hypothesize that communities with lowlevels of unemployment will tend to have higher activ-ity levels at the beginning of a typical weekday Thisis indeed what we find figure 2A shows the hourlyfraction of tweets during workdays of two communi-ties with very different rate of unemployment As wecan observe both profiles are quite different and inthe case of low unemployment we find a strong peakof activity between 8 and 11am (morning) and lowerperiods of activity during the afternoons and nights Weencode this finding in νmrngi νaftni and νngti the to-tal fraction of tweets happening in geographical area ibetween 8am and 10am 3pm and 5pm and 12am and3am respectively Figure 3 shows a strong negative co-rrelation between νmrngi and the unemployment for thecommunities in our database and positive correlationwith νaftni and νngti
bull Social media content some works have observed acorrelation between the frequency of words relatedto work conditions [22] or looking forward thinking
Social media fingerprints of unemployment mdash 519
searches [21] to the economical situation of countriesIn our case we also find that there is a moderate posi-tive correlation between the fraction of tweets microi men-tioning job or unemployment terms and the observedunemployment while the correlation is negative for thenumber mentions to employment or the economy How-ever we have tried a different approach by measuringthe relation between the way of writing and the edu-cational level [37] To this end we build a list of 618misspelled Spanish expressions and extract the tweetsof the dataset containing at least one of these words(see SI section 6 for further details about how theseexpressions were collected) We only consider tweetsin Spanish detected with a N-grams based algorithmThen we only consider misspellings that cannot bejustified as abbreviations Finally we compute for ev-ery region the proportion εi of misspellers among theTwitter population If the fraction of misspellers pergeographical area is a proxy for the educational levelof that region we expect a positive correlation betweenεi and unemployment Indeed we find (see figure 3)that there is a strong correlation between the fraction ofmisspellers and unemployment
bull Social media interactions and geographical flow di-versity following the ideas in [14] which correlatedthe economical development of an area with the diver-sity of communications with other areas we considerall tweets mentioning another user and take them asa proxy for communication between users Then wecompute the number of communications wi j betweenareas i and j as the number of mentions between usersin those areas To measure the diversity we use as in[14] the informational normalized entropy (Entropy 1)Sui = minussum j pi j log pi jSri where pi j = wi jsum j wi j and(Entropy 2) Sri = logki with ki the number of differentareas with which users in area i have interacted As in[14] we find that areas with large unemployment haveless diverse communication patterns than areas with lowunemployment This translates in a strong negative co-rrelation between Si and the unemployment see figure3 Similar ideas are applied to the flows of people be-tween areas to investigate the diversity of the geograph-ical flows through the entropy Si =minussum j pi j log pi jSriwhere pi j = Ti jsum j Ti j and Sri = log(ki) with ki thenumber of different areas which has been visited byusers that live in area i Figure 3 shows that as in [19]correlation of these geographical entropies is low witheconomical development
Normalization of variables is discussed in SI section 5 Wehave also studied the correlation between the variables con-sidered As expected variables in each group show moderatecorrelations between them However the inspection of thecorrelation matrix and a Principal Component Analysis ofthe variables considered show that there is information (aspercentage of variance in the data) in each of the groups ofvariables see SI section 5 Because of these two facts werestrict our analysis to the variables within each group withthe highest correlation with the unemployment namely thepenetration rate τi the social and mobility diversity variables
Sui and Sui the morning activity νmrngi the fraction of mis-spellers εi and fraction of employment-related tweets microempi
3 Explanatory power of social media in un-employment
The four previous groups of variables are fingerprints of hu-man behavior reflected on the Twitter usage habits As weobserved in figure 3 all of them exhibit statistically strongcorrelations with unemployment The question we address inthis section is whether those variables suffice to explain theobserved unemployment (their explanatory power) and alsodetermine the most important ones among themselves (whichgive more explanatory power than others) Note that we arenot stating a causality arrow between the measures built in theprevious section and the unemployment rate but only explor-ing whether they can be used as alternative indicators with areal translation in the economy
Figure 4 shows the result of a simple linear regressionmodel for the observed unemployment for ages below 25 yearsas a function of the variables which have more correlation withthe unemployment The model has a significant R2 = 062showing that there is a large explanatory power of the un-employment encoded in the behavioral variables extractedfrom Twitter However not all the variables weight equallyin the model specifically the penetration rate geographicaldiversity morning activity and fraction of misspellers accountfor up to 92 of the explained variance while social diversityand number of employment related tweets are not statisticalsignificant (see SI section 10 for the methods used to deter-mine the relative importance of the variables) It is interestingto note that while social diversity obtained by mobile phonecommunications was a key variable in the explanation of de-privation indexes in [14 19] the communication diversity oftwitter users seem to have a minor role in the explanation ofheterogeneity of unemployment in Spain
Similar explanatory power is found for other age groupsR2 = 044 for all ages and R2 = 052 for ages between 25 and44 years However the model degrades for ages above 44years (R2 = 026) proving that our variables mainly describedthe behavior of the most represented age groups in Twitternamely those below 44 years old On the other hand sinceour Twitter variables seem to describe the behavior of youngpeople we have investigated whether Twitter constructed vari-ables have similar explanatory value (in terms of R2) thansimple census demographic variables for young people How-ever regression models including young population rate yieldto a minor improvement R2 = 065 while young populationrate only gives R2 = 024 a result which shows that Twittervariables do indeed posses a genuine explanatory power awayfrom their simple demographic representation Finally ourmodel have the largest explanatory power for detected com-munities but large R2 are also found for other geographicalareas like counties (R2 = 054) and provinces (R2 = 065) seeSI section 9
4 Discussion
This work serves as a proof of concept for how a wide rangeof behavioral features linked to socioeconomic behavior can
Social media fingerprints of unemployment mdash 619
eco
unemp
emp
job
fmiss
madrugada
tarde
manana
siorsocial
siosocial
sior
sio
rtwpen
minus05 00 05corre
ff500
1000
10 20parofa
ctor
[i]
tt[ v
aria
bles
_sel
[i]]
20
40
60
80
10 20parofa
ctor
[i]
tt[ v
aria
bles
_sel
[i]]
20
40
60
80
1020paro fa
ctor[i] tt[ variables_sel[i]]
20
40
60
80
1020paro fa
ctor[i] tt[ variables_sel[i]]
0
5
10
15
20
10 20parofa
ctor
[i]
tt[ v
aria
bles
_sel
[i]]
0
5
10
15
20
1020paro fa
ctor[i] tt[ variables_sel[i]]
0
5
10
15
20
1020paro fa
ctor[i] tt[ variables_sel[i]]
4
5
6
7
10 20parofa
ctor
[i]
tt[ v
aria
bles
_sel
[i]] 5
10
10 20parofa
ctor
[i]
tt[ v
aria
bles
_sel
[i]]
40
50
60
70
10 20parofa
ctor
[i]
tt[ v
aria
bles
_sel
[i]]
02
04
06
08
10 20parofa
ctor
[i]
tt[ v
aria
bles
_sel
[i]]
0
50
100
150
200
10 20parofa
ctor
[i]
tt[ v
aria
bles
_sel
[i]]
0
50
100
150
200
10 20parofa
ctor
[i]
tt[ v
aria
bles
_sel
[i]]
0
50
100
150
200
10 20parofa
ctor
[i]
tt[ v
aria
bles
_sel
[i]]
Penetration rate Entropy 1 (geo) Entropy 2 (geo)
Entropy 1 (social) Entropy 2 (social) Activity (morning)
Activity (afternoon) Activity (night)
Misspellers rate Job tweets
Employment tweets Unemployment tweets
Economy tweets
A B C
D E
Unemployment UnemploymentCorrelation
Entropy1 (social)
Misspellers rate
Pen
etra
tion
rate
Act
ivity
(mor
ning
)Figure 3 A) Correlation coefficient of all the extracted Twitter metrics grouped by technology adoption (black) geographicaldiversity (orange) social diversity (light blue) temporal activity (green) and content analysis (dark blue) Error bars correspond to95 confidence intervals of the correlation coefficient Gray area correspond the statistical significance thresholds Panels B C Dand E show the values of 4 selected variables in each geographical community against its percentage of unemployment Size of thepoints is proportional to the population in each geographical community Solid lines correspond to linear fits to the data
10
15
20
25
10 15 20 25x
y
5
10
15
2025
5 10 15 20 25x
y
0 10 20 30per
order
col0000000072B2009E7356B4E9E69F00
Penetration rate
Entropy 1 (geo)
Entropy 1 (social)
Activity (morning)
Misspellers rate
Employment tweetsR2 = 062
Pred
icte
d un
empl
oym
ent
Observed unemployment Weight
A B
R2 = 052
Observed unemployment
CAge lt 25 25 lt Age lt 44
Figure 4 A) and B) Performance of the model showing the predicted unemployment rate for ages below 25 versus the observedone R2 = 062 and with ages between 25 and 44 Dashed lines correspond to the equality line and plusmn20 error C) Percentageof weight for each of the variables in the regression model using the relative weight of the absolute values of coefficients in theregression model (see SI section 10) Variables marked with lowast are not statistical significant in the model
be inferred from the digital traces that are left by publicly-available social media In particular we demonstrate thatbehavioral features related to unemployment can be recoveredfrom the digital exhaust left by the microblogging networkTwitter First of all Twitter geolocalized traces together withoff-the-shelve community detection algorithms render an op-timal partition of a country for economical activity showingthe remarkable power of social media to understand and unveileconomical behavior at a country-scale This insight is likelyto apply to other administrative definitions in other countriesspecially when considering large cities with an inherent dy-namical nature and evolution of mobility fluxes and citiescomposed of small satellite cities with arbitrary agglomera-tions or division among them (eg London NYC Singapore)
This result is unsurprising it should be natural to recomputecity clusterscommunities of activity based on their real timemobility which may vary considerably faster than the updaterates of mobility and travel surveys [31ndash33]
Our main result demonstrates that several key indicatorsdifferent penetration rates among regions fingerprints of thetemporal patterns of activity content lexical correctness andgeo-social connectivities among regions can be extractedfrom social media and then used to infer unemployment lev-els These findings shed light in two directions first on howindividualsrsquo extensive use of their social channels allow usto characterize cities based on their activity in a meaningfulfashion and secondly on how this information can be usedto build economic indicators that are directly related to the
Social media fingerprints of unemployment mdash 719
economy Regarding the latter our work is important forunderstanding how country-scale analysis of Social Mediashould consider the demographic but also the economical dif-ference between users As we have shown users in areaswith large unemployment have different mobility different so-cial interactions and different daily activity than those in lowunemployment areas This intertwined relationship betweenuser behavior and employment should be considered not onlyin economical analysis derived from social media but alsoin other applications like marketing communication socialmobilization etc
It is particularly remarkable that Twitter data can providethese accurate results Twitter is among the many currentlypopular social networking platforms perhaps the noisiestsparsest more lsquosabotagedrsquo medium very few users sendout messages at a regular rate most of the users do nothave geolocated information the social relationships (fol-lowersfollowers) contains a lot of unusedunimportant linksit is plagued by spam-bots and last but not least we haveno way to identify the motivegoalfunctionality of mobilityfluxes we are able to extract These limitations are not par-ticular to our sample but general to the sample Twitter databeing employed in the computational social science commu-nity Despite all these caveats we are able to show that evensome simple filtering techniques together with basic statisticalregressions yields predictive power about a variable as impor-tant as unemployment Other social media platforms such asFacebook Google+ Sina Weibo Instagram Orkut or Flickerwith more granular and consistent individual data are likelyto provide similar or better results by themselves or in com-bination Further improvements can be obtained by the useof more sophisticated statistical machine learning techniquessome of them even tailored to the peculiarities of social mediadata Our work serves to illustrate the tremendous potentialof these new digital datasets to improve the understanding ofsocietyrsquos functioning at the finer scales of granularity
The usefulness of our approach must be considered againstthe cost and update rate of performing detailed surveys ofmobility social structure and economic performance Ourdatabase is publicly articulated which means that our analysiscould be replicated easily in other countries other time periodsand with different scopes Naturally survey results providemore accurate results but they also consume considerablyhigher financial and human resources employing hundredsof people and taking months even years to complete and bereleased mdash they are so costly that countries going througheconomic recession have considered discontinuing them oraltering their update rate in recent times A particularly prob-lematic aspect of these surveys is that they are ldquoout-of-syncrdquoie census may be up to date whereas those same individualsrsquotravel surveys may not be and therefore drawing inferencesbetween both may be particularly difficult This is a partic-ularly challenging problem that the immediateness of socialmedia can help ameliorate
A few questions remain open for further investigationHow can traditional surveys and social media digital tracesbe best combined to maximize their predictive ability Cansocial media provide a reliable leading indicator to unem-ployment and in general economic surveys How muchreliable lead is it possible if at all As we have found Twitter
penetration and educational levels are found to be correlatedwith unemployment but this levels are unlikely to changerapidly to describe or anticipate changes in the economy orunemployment However other indicators like daily activitysocial interactions and geographical mobility are more con-nected with our daily activity and perhaps they have morepredicting power to show andor anticipate sudden changesin employment The relationship between unemployment andindividual and group behavior may help contextualize themultiple factors affecting the socioeconomic well-being of aregion while penetration content daily activity and mobilitydiversity seem to be highly correlated to unemployment inSpain different weights for each group of traces might beexpected in other countries [14] Finally digital traces couldserve as an alternative (some times the only one available) tothe lack of surveys in poor or remote areas [20 27] Anotherinteresting avenue of research involves the use of social mediato detect mismatches between the real (hidden underground)economy and the officially reported [38]
Most importantly the immediacy of social media may alsoallow governments to better measure and understand the effectof policies social changes natural or man-made disasters inthe economical status of cities in almost real-time [18 39]These new avenues for research provide great opportunitiesat the intersection of the economic social and computationalsciences that originate from these new widespread inexpensivedatasets
Acknowledgments
We would like to thank Kristina Lerman Lada Adamic JamesFowler Daniel Villatoro and Ricardo Herranz for stimulatingdiscussions and Yuri Kryvasheyeu and Thomas Bochynek fortheir critical reading of the manuscript This work was par-tially supported by Spanish Ministry of Science and Technol-ogy Grant FIS2013-47532-C3-3-P (to A L M G H and EM) Manuel Cebrian is funded by the Australian Governmentas represented by the Department of Broadband Communi-cations and Digital Economy and the Australian ResearchCouncil through the ICT Centre of Excellence program
References
[1] Becker G S (1976) The economic approach to human behav-ior (University of Chicago Press)
[2] Granovetter M (1985) Economic action and social structurethe problem of embeddedness American journal of sociologypp 481ndash510
[3] Camerer C F Loewenstein G amp Rabin M (2011) Advancesin behavioral economics (Princeton University Press)
[4] Glaeser E L Kallal H D Scheinkman J A amp ShleiferA (1991) Growth in cities (National Bureau of EconomicResearch) Technical report
[5] Bettencourt L M Lobo J Helbing D Kuhnert C amp WestG B (2007) Growth innovation scaling and the pace of lifein cities Proceedings of the National Academy of Sciences104 7301ndash7306
[6] Batty M (2008) The size scale and shape of cities science319 769ndash771
[7] Milgram S (1974) The experience of living in cities Crowdingand behavior 167 41
Social media fingerprints of unemployment mdash 819
[8] Pan W Ghoshal G Krumme C Cebrian M amp Pentland A(2013) Urban characteristics attributable to density-driven tieformation Nature communications 4
[9] Gonzalez M C Hidalgo C A amp Barabasi A-L (2008)Understanding individual human mobility patterns Nature453 779ndash782
[10] Calabrese F Diao M Lorenzo G D Jr J F amp Ratti C(2013) Understanding individual mobility patterns from urbansensing data A mobile phone trace example TransportationResearch Part C Emerging Technologies 26 301 ndash 313
[11] Cheng Z Caverlee J Lee K amp Sui D Z (2011) ExploringMillions of Footprints in Location Sharing Services (AAAIMenlo Park CA USA)
[12] Cho E Myers S A amp Leskovec J (2011) Friendship andmobility user movement in location-based social networksKDD rsquo11 (ACM New York NY USA) pp 1082ndash1090
[13] Sun L Jin J G Axhausen K W Lee D-H amp Cebrian M(2014) Quantifying long-term evolution of intra-urban spatialinteractions arXiv preprint arXiv14070145
[14] Eagle N Macy M amp Claxton R (2010) Network diversityand economic development Science 328 1029ndash1031
[15] Henrich J Boyd R Bowles S Camerer C Fehr E GintisH amp McElreath R (2001) In search of homo economicusbehavioral experiments in 15 small-scale societies AmericanEconomic Review pp 73ndash78
[16] Krieger N Williams D R amp Moss N E (1997) Measuringsocial class in us public health research concepts method-ologies and guidelines Annual review of public health 18341ndash378
[17] Groves R M Fowler Jr F J Couper M P Lepkowski J MSinger E amp Tourangeau R (2013) Survey methodology (JohnWiley amp Sons)
[18] Lazer D Pentland A S Adamic L Aral S Barabasi A LBrewer D Christakis N Contractor N Fowler J GutmannM et al (2009) Life in the network the coming age of compu-tational social science Science (New York NY) 323 721
[19] Smith C Quercia D amp Capra L (2013) Finger on the pulseidentifying deprivation using transit flow analysis (ACM) pp683ndash692
[20] Soto V Frias-Martinez V Virseda J amp Frias-Martinez E(2011) Prediction of socioeconomic levels using cell phonerecords UMAPrsquo11 (Springer-Verlag Berlin Heidelberg) pp377ndash388
[21] Preis T Moat H S Stanley H E amp Bishop S R (2012)Quantifying the advantage of looking forward Scientific re-ports 2
[22] Antenucci D Cafarella M Levenstein M C Re C ampShapiro M D (2014) Using social media to measure la-bor market flows (National Bureau of Economic Research)Technical report
[23] Hawelka B Sitko I Beinat E Sobolevsky S KazakopoulosP amp Ratti C (2013) Geo-located twitter as the proxy for globalmobility patterns arXiv preprint arXiv13110680
[24] Lathia N Quercia D amp Crowcroft J (2012) in Pervasivecomputing (Springer) pp 91ndash98
[25] Frias-Martinez V Virseda J amp Frias-Martinez E (2010)Socio-economic levels and human mobility
[26] Gutierrez T Krings G amp Blondel V D (2013) Evaluatingsocio-economic state of a country analyzing airtime credit andmobile phone datasets arXiv preprint arXiv13094496
[27] Smith C Mashhadi A amp Capra L (2013) Ubiquitous sensingfor mapping poverty in developing countries Paper submittedto the Orange D4D Challenge
[28] Erlander S amp Stewart N F (1990) The gravity model intransportation analysis theory and extensions (Vsp) Vol 3
[29] Simini F Gonzalez M C Maritan A amp Barabasi A-L(2012) A universal model for mobility and migration patternsNature 484 96ndash100
[30] Lenormand M Picornell M Cantu-Ros O G Tugores ALouail T Herranz R Barthelemy M Frias-Martinez E ampRamasco J J (2014) Cross-checking different sources ofmobility information arXiv preprint arXiv14040333
[31] Barthelemy M (2011) Spatial networks Physics Reports 4991 ndash 101
[32] Expert P Evans T S Blondel V D amp Lambiotte R (2011)Uncovering space-independent communities in spatial net-works Proceedings of the National Academy of Sciences 1087663ndash7668
[33] Sobolevsky S Szell M Campari R Couronne T SmoredaZ amp Ratti C (2013) Delineating geographical regions withnetworks of human interactions in an extensive set of countriesPloS one 8 e81707
[34] Rosvall M amp Bergstrom C T (2008) Maps of random walkson complex networks reveal community structure Proceedingsof the National Academy of Sciences 105 1118ndash1123
[35] Danon L Diaz-Guilera A Duch J amp Arenas A (2005)Comparing community structure identification Journal ofStatistical Mechanics Theory and Experiment 2005 P09008
[36] ADigital (2013) Uso de twitter en espana 2012 [Onlineaccessed 1-November-2014]
[37] Davenport J R amp DeLine R (2014) The readability of tweetsand their geographic correlation with education arXiv preprintarXiv14016058
[38] Schneider F Buehn A amp Montenegro C E (2011) Shadoweconomies all over the world New estimates for 162 countriesfrom 1999 to 2007 Handbook on the shadow economy pp9ndash77
[39] Rutherford A Cebrian M Dsouza S Moro E Pentland A ampRahwan I (2013) Limits of social mobilization Proceedingsof the National Academy of Sciences 110 6281ndash6286
Social media fingerprints of unemployment mdash 919
Social media fingerprints of unemployment mdash 1019
Supporting Information forSocial media fingerprints of unemploymentAlejandro Llorente Manuel Garcıa-Herranz Manuel Cebrian and Esteban Moro
S1 The dataset
Twitter provides an extremely rich and publicly available data set ofuser interactions information flows and thanks to the geo locationof tweets user movements Nevertheless the representativenessof this geo-located Twitter as a global source of mobility data hasstill received sparse attention In this sense while [13] present apromising and extensive study regarding global country-to-countrymovements (mostly driven by tourism) within-country human flows(comprising not only internal tourism but also in a greater extentthan country-to-country travels visiting and commuting) still needfurther investigation Therefore throughout this work we will com-pare our findings using geo-located Twitter with similar study usingcommuting surveys
For the Twitter analysis we consider almost 146 million geo-located Twitter messages (tweet(s)) collected through the publicAPI provided by Twitter for the continental part of Spain and from29th November 2012 to 10th April 2013 In this dataset we considerthat there has been a trip from place l to place k if a user has tweetedin place l and place k consecutively We only keep those transitionswhen the first tweet and the second one are dated in the same dayWe filter the trips database to avoid unrealistic transitions and keeponly trips with a geographical displacement larger than 1km (SeeMethods section) By this method 138 million of trips from 167376different users are considered in our work
From those trips we construct the mobility flow Ti j betweenmunicipalities which measures the number of trips in our databasein which the origin is within city i boundaries and destination lieswithin those of city j
We also consider population and economical information aboutthe municipalities from the Spanish Census (2011) [8] and unem-ployment figures from the Public Service of Employment (ServicioPublico de Empleo Estatal SEPE) [7] In the former In the lat-ter case registered unemployment (in number of persons) is givenfor each Spanish municipality by gender age and month To getunemployment rates we divide register unemployment by the totalworkforce in the municipality estimated as the number of peoplewith age between 16 and 65 years
S2 Twitter as mobility proxy
Considering all of the available transitions in our database one cancompute the distance between origin and destination the elapsedtime of the transition and the number of trips per user among manyother statistics All of them seems to show a Power-law distributionwith a cutoff due to the finite spatial size of Spain and the constraintof considering only transitions where the origin and destinationcheckins are done the same day Focusing on the log-linear partof the distributions self-similar behaviors arise when Twitter basedmobility is analyzed (see figure 5)
Twitter based inter-city flows can be well modelled by means ofthe The Gravity Law which is one of the most extended methods torepresent human mobility [1 19] with applications in many fieldslike urban planning [23] traffic engineering [4] or transportationproblems [9] Gravity Law is also the solution to the problem of
maximizing the entropy of the particle distribution among all thepossible trips using statistical mechanics techniques [2 22] Recentlyit has also been used as a model for human mobility based on cellphone traces [10 20 21] and social media data at a global scale [13]and at the inter-city level [14]
The Gravity Model for human mobility assume that the flowsbetween cities can be explained by the expression
T gravi j =
Pα1i Pα2
j
dβ
i j
(2)
where T gravi j is the flow in terms of number of people between cities
i and j di j is the geographical distance and Pi and Pj the populationof every city respectively
Given the data we can obtain the parameters of the model byWeighted Least Squares Minimization
αlowast1 α
lowast2 β
lowast = argminα1α2β
1N sum
i jwi j
(Ti jminusT grav
i j
)2(3)
where N is the total number of connections in the mobility graph andwi j is a weight proportional to the number of observed transitionsbetween i and j In particular we find that taking wi j = T 13
i j givesthe best performance in the model
In our case this model fits quite accurately the inter-city mobilitybased on Twitter GPS checkins (see table 1) Even though we areconsidering Ti j not necessarily symmetric the exponents of thepopulations are similar indicating that we are observing a similarflows in both directions between i and j
S3 Community structures in inter-city mo-bility graph
Typically complex networks exhibit community structure that isthere are subsets of nodes that are more densely connected amongthem comparing to the rest of the nodes In mobility networks whosenodes correspond to geographical areas these communities are inter-preted as zones with high common activity and tend to be constrainedby geographical and political barriers We check whether this is alsoobserved in our dataset by performing 6 state-of-art community de-tection algorithms FastGreedy [5] Walktrap [16] Infomap [18]MultiLevel [3] Label Propagation [17] and Leading Eigenvector[15] These six different algorithms exhibit different communitystructures in terms of number of communities average size of com-munity or modularity (see table 3) Members (municipalities) ofthe resulting communities are spatially connected except some fewcases as figure 7 shows We test the statistical robustness of theobtained communities by randomly removing a proportion p of theoriginal links and performing the algorithms on this new graph GpWe will consider that communities are robust when the communi-ties given for the original network G and Gp are highly similar Inorder to compare two arbitrary memberships to communities we usethe Normalized Mutual Information (NMI) method described in [6]which returns 0 when two memberships are totally different and 1when we compare two equal memberships We compute the NMI for
Social media fingerprints of unemployment mdash 1119
10minus8
10minus6
10minus4
10minus2
100
100 1005 101 1015 102 1025 103x
dens
10minus6
10minus4
10minus2
100
1005 101 1015 102 1025 103x
dens
10minus7
10minus6
10minus5
10minus4
10minus3
102 1025 103 1035 104 1045 105x
dens
Den
sity
Den
sity
Den
sity
Trip distance (km) Number of trips Elapsed time (secs)
Figure 5 Probability distributions for the different properties of daily trips in the Twitter dataset Dashed lines corresponds to apower law fit with exponents minus167 minus243 and minus062 respectively
each chosen algorithm performed on G and Gp for p between 1and 10 concluding that obtained community structures are robustbecause they are not broken when some randomly chosen links areremoved (see table 2)
1e+01 1e+03 1e+05
2eminus04
2eminus03
2eminus02
2eminus01
cities2$total_population
cities2$twpen
Population
Pen
etra
tion
Rat
e CommunitiesCities
Figure 6 Penetration rates for both cities and detectedcommunities
As other works have shown mobility graph communities areusually interpreted in terms of geographical and political barriersand a natural question is whether the mobility based communitiesare related to any of these barriers In Spain there are differentterritorial divisions for administration purposes In this work weconsider two of them provinces defined in 1978 Constitution are 48different heterogeneous aggregations of municipalities and counties(comarca in Spanish terminology) which are traditional aggregationsof municipalities mainly based on Spanish holography (rivers val-leys ridges etc) and some of them are composed by municipalitiesof different provinces We use again the NMI method to compare thecommunities structure given by the algorithms to the administrativelimits Except Leading Eigenvector algorithm the rest of methodsreturn communities that are quite related to provinces (NMI asymp 07)whereas for the county administration limits higher variability is
observed In this last case the algorithm providing more relationshipwith county limits is Infomap NMI asymp 083 Therefore Twitter basedmobility summarizes the inter-city flows exhibiting that these flowsare influenced by geographical and political barriers
S4 Twitter demographics and unemploy-ment rates
Different age groups are not equally represented in Twitter Recentsurveys (2012) in Spain suggest that most (86) of users in Twitterare 16 to 44 years old Comparison of the percentage of users perage group with the total population within the same groups (seefigure 8) reveals that groups of ages above 35 years old are under-represented in Twitter Thus our Twitter data will be more revealingwhen trying to describe unemployment in age groups below 44 yearsold This is indeed what we find when we try to build a linear modelfor the rate unemployment in different age groups with the sameTwitter variables while unemployment rates for ages below 24 canbe fitted to a linear model with R2 = 062 we find that regressionmodels for unemployment rates for ages between 25 and 44 have aR2 = 052 while for ages above 44 we get only R2 = 026 Table 4summarizes the results for the regression models of unemploymentrates in each age group showing that our Twitter variables have moreexplanatory power for ages below 44 Finally in figure 8 we can seethe performance of the model at different age groups and once againit is obvious the poor explanatory power of the Twitter variables forthe unemployment rate in ages above 44 years old
S5 Properties of Twitter variables
Normalization and distributions
Heterogeneity between the values of variables constructed fromTwitter is large but moderate as histograms in figure 9 show Wedid not find any geographical area with anomalous values in anyof the variables considered Variables are normalized in differentways both the penetration τi and misspellers rate εi are defined asthe number of users or misspellers per 100000 persons (population)activity variables νi are normalized as the percentage of tweets pertime interval finally number of tweets that mention a specific term
Social media fingerprints of unemployment mdash 1219
Figure 7 From left to right and from top to bottom Fastgreedy Walktrap Infomap Multilevel Label Propagation and LeadingEigenvector communities on Twitter based mobility transitions
microi are also given per 100000 tweets published in the geographicalarea
Correlation between variables
Variables are constructed to reflect the behavior of areas in the dif-ferent dimensions of Twitter penetration social or geographicaldiversity activity through the day and content Correlation betweenvariables does indeed show that variables within each dimensionshold strong correlations between them As we can see in figure 10social and geographical diversities are highly correlated betweenthem an expected fact given the gravity law accurate descriptionof flows of people between geographical areas but also the amountof communication between them Same behavior is found for thegroup of variables in the activity group while content variables areless correlated Finally we find that both the penetration rate τi andfraction of misspellers εi have a strong correlation with most of thevariables
High correlation between variables might lead to collinearityeffects [24] in the linear regression models that is some variableswith predictive variable might have non-significant weights becausethey explain the same part of the variance For instance in Table5 misspellers rate has a very strong predictive value but its p-valueis too high to consider it significant To test this hypothesis weperform a principal component analysis (PCA) on the independentvariables of the regression Figure 10 exhibits the loadings of thedifferent variables for the considered variables The block structureshowed in 10 results in similar directions of the variables in the firstcomponentes of the PCA We observe some groups of variables onthe one hand geographical and social diversity seem to explain largepart of the variance on the other hand we find a perpendicular group
of variables formed by temporal activity finally penetration rateand misspellers fraction seem to represent a different independentdirection of data with high collinearity between them This mightexplain the low statistical significance in the models of section 4 Inany case the structure of the correlation matrix and the PCA resultsshow that there is indeed information in all groups of variables andthus we have take a variable in each of them for our regressionmodels
S6 Misspellers detection
In this work we will consider only tweets in Spanish that is sincein Spain several languages live at the same time depending on thepart of the country the first step is to reduce our Twitter dataset tothose tweets that are written in Spanish This task is carried out usingthe n-gram based text categorization R library textcat [11] Then inorder to decide whether a tweet has a misspelling or not we needto establish some patterns to select from our set of tweets Sincewe want to be sure that a detected mistake corresponds to a realmisspeller we will not consider the following cases
bull Lack of written accents People tend to avoid writing accentswhen talking in a colloquial way
bull Mistakes derived from removing unnecessary letters Themost common cases are removing a h at the beginning of aword (in Spanish the letter h is not pronounced) or replacingthe letters qu by k We understand that these mistakes can bemotivated for the limitation of length in tweets and not for areal misspelling
bull In the same line we neglect mistakes produced by removing
Social media fingerprints of unemployment mdash 1319
0
10
20
30
40
0 100 200 300x = Tweets (unemployment)
P(x)
0
5
10
15
500 1000 1500x = Penetration rate
P(x)
0
5
10
15
2 4 6 8x = Entropy2 (social)
P(x)
0
5
10
15
025 050 075x = Entropy1 (social)
P(x)
0
5
10
15
4 6 8x = Entropy2 (geo)
P(x)
0
5
10
15
01 02 03 04x = Entropy1 (geo)
P(x)
0
5
10
15
0 50 100 150 200x = Misspellers rate
P(x)
0
20
40
60
0 100 200x = Tweets (employment)
P(x)
0
5
10
15
20
250 500 750x = Tweets (job)
P(x)
0
5
10
15
20
2 3 4 5 6x = Activity (night)
P(x)
0
5
10
15
20
4 5 6 7x = Activity (morning)
P(x)
0
5
10
15
35 40 45 50x = Activity (afternoon)
P(x)
i SuiSri
Sri Suimrngi
aftni ngti
imicrojobi
microempi
microunempi
x x x
x x x
x x x
x x x
Figure 9 Frequency plots for each variable constructed from Twitter
letters in the middle of a word whose pronunciation can bededuced without them
bull We do not consider either mistakes related to features ofspecific areas in Spain For example in the south the pronun-ciation of ce and se is the same what produces a big amountof mistakes when writing However since we want to extractobjective and equitable conclusion over the whole Spanishgeography we neglect those misspellings that only appear ina specific area
Likewise we will consider as real misspellings the followingmistakes
bull Adding letters For example writing a h at the beginning of aword that starts with a vowel
bull Changing the special cases mp mb by the wrong writings npnb
bull Mixing up b with v g with j ll with y and ex with es Theseare typical mistakes in Spanish because they have the sameor a very close pronunciation
Social media fingerprints of unemployment mdash 1419
0
10
20
30
16minus24 25minus34 35minus44 45minus54 55minus64r
f
000
025
050
075
100llena
Age group
Perc
enta
ge o
f pop
ulat
ion Census
10
15
20
25
10 15 20 25x
y
10
15
20
25
10 15 20 25x
y
5
10
15
20
5 10 15 20x
y
5
10
15
2025
5 10 15 20 25x
y
All ages lt 24
25-44 gt 44
Observed Unemployment () Observed Unemployment ()
Observed Unemployment ()Observed Unemployment ()
Pred
icte
d Un
empl
oym
ent (
)
Pred
icte
d Un
empl
oym
ent (
)
Pred
icte
d Un
empl
oym
ent (
)
Pred
icte
d Un
empl
oym
ent (
)
R2 = 047 R2 = 062
R2 = 052 R2 = 026
Figure 8 Top Percentage of population in each age groupfrom the Spanish Census (dark bars) and surveys about usersin Twitter (light bars) Bottom performance of the linearmodels for each of the age groups
bull Confusing the verb haber with the periphrasis a ver
bull Separating a word into two ones for instance writing theword conmigo as con migo
This way our list of mispellings is composed of 617 common mis-takes in Spanish that cannot be attributed to the special featuresof Twitter or a specific region of Spain Thus one can expect thatthis selection provides an accurate and equitable method of detect-ing misspellers Under these conditions the number of users whowrote at least one misspelled word is 27055 (56 over the wholepopulation)
We analyze whether misspellers have different Twitter usagebehavior from that people who do not make serious mistakes whenpublishing a tweet Comparing the average number of tweets itcan be observed that misspellers tend to publish a larger numberof tweets than those who did not made mistakes (14471 against2372) This also emerges when the mean number of misspellinggiven the total number of tweets is considered For users with lessthan approximately 30 published tweets in the observation period thenumber of misspellings is almost zero whereas for users who publishmore often the mean number of misspellings scales sub-linearly
minus1
minus08
minus06
minus04
minus02
0
02
04
06
08
1rtwpen
sio
sior
siosocial
siorsocial
manana
tarde
madrugada
fmiss
job
emp
unem
p
eco
rtwpensiosior
siosocialsiorsocialmanana
tardemadrugada
fmissjobemp
unempeco
i
Sui
Sri
Sri
Sui
i
microecoi
microunempi
microempi
microjobi
ngti
aftni
mrngi
minus3 minus2 minus1 0 1 2 3
minus3minus2
minus10
12
3
First Principal component
Seco
nd P
rinci
pal C
ompo
nent
6
26
47
70 92123
145
155185
209
213
251
257 279305318339
371
396411
423
455
466
490
507
546552
568
597
621
637672
681
710
718738
772
781
805828
849
876
900
919
925
966979
1002 1021
1047
1064
1085
1109
1126
1137
1164
11861200
12361249
1264
1300
1318
1336
13541386
1406
1425
1444
1457
14711504
15141554
1572
1584
1611
1622
16571667
1686
1721
1739
1800
1823
1837
1863
18831930
1949
1968
20062036
2042
2067
2181
2206
22442264
2305
2331
23322413 2435
2456
2512
2554
2569
25982617
26522670
26972721 2748
2790
2875
2917
2939
29492976
2989
3025
3132
3228
3245
3331
3347
3381
3451
3484
3519
3613
3837
38653987
4326
minus05 00 05
minus05
00
05
sio
sior
siosocial
siorsocial
rtwpen
manana
tarde
madrugada
fmiss
jobemp
unemp
eco
ngt
microunemp
aftn
microempmicroeco
microjobmrng
Si
Si
Sir
˜Sir
domingo 12 de octubre de 14
Figure 10 Top Correlation matrix between the vari-ables constructed from Twitter Each entry in the matrixis depicted as a circle whose size is proportional to thecorrelation between variables and the sign is bluered forpositivenegative correlations Blank entries correspond tostatistically insignificant correlations with 95 confidenceBottom Variables projection on the first two principal com-ponents given by PCA We observe different groups of vari-ables and collinearity between some of them
with the number of tweets (exponent asymp 033)
Since we have observed a segmentation of Twitter populationbased on how accurate they write we consider the misspeller rate as aproxy of the educational level of the cities Large number of previousworks in the literature have revealed the relationship between theeconomical status and the educational level of geographical areasand therefore it is natural to ask whether the observed misspellersrate is related to economy driven by the unemployment rate To testthis hypothesis we consider cities populated with more than 5000inhabitants to avoid subsampled cases We find a strong positivecorrelation between the probability of finding a misspeller in a cityand the unemployment rate (0372 0491)
Social media fingerprints of unemployment mdash 1519
2 5 10 20 50 200 500
10
15
20
30
40
tweets
N(m
iss|
tw
eets
)
2 5 10 20 50 200 500
000
50
050
050
0
tweets
P(m
iss|
twee
ts)
2 50 500
Figure 11 Number (red) and probability (blue) of ob-served misspellings given the number of tweets
050
055
060
065
070
201210201211201212201301201302201303201304201305201306201307201308201309201310201311201312201401201402201403201404201405201406
month
r2R2
month
Figure 12 Explanatory power of the linear regressionmodel when fitted against the unemployment data for dif-ferent months Gray (orange) area correspond to the timewindow in which Twitter data is collected and variables areconstructed
S7 Time window and unemployment
In the definition of the variables we have aggregated the Twitter ac-tivity within a 7 months time window (from December 2012 to June2013) Since unemployment has a significant variation along timewe investigate here what is the correlation and explanatory powerof the Twitter variables for the values of unemployment determinedat different months through the same time window in which Twitterdata was collected Or if the variables collected in that time windoware more correlated with past or future values of unemploymentFigure 12 shows the explanatory value of the model when the linearregression is done for values of unemployment of different monthsbefore during and after the Twitter data time window Althoughthere is a small seasonal effect along the year we see that the ex-planatory power remains around R2 = 06 which suggest that ourTwitter linear model retains its explanatory power even though unem-ployment changes considerably throughout the year It is interestingto note that R2 decays a little bit during the summer which meansthat our variables are less correlated with summer unemploymentFinally unemployment used in the main article is from June 2013ie the last month in the time window used to collect the data
S8 Demographics does not explain unem-ployment
Since unemployment rates are very large for the group of youngpeople a natural question is whether only demographic variablescould explain the heterogeneity of young unemployment rates foundin the geographical areas To test this end we have built four linearmodels the first one (named Youth model in Table 5) is composedby the rate of young population as the only explaining variable thesecond ones are built based on only the Twitter variables consideredin the main text (named Twitter model (I)) or just with those whoseregression coefficients are statistically significant (Twitter model(II)) the third one is fitted with all the variables (named All variablesmodel in Table 5) In table 5 we show the summary of the regressionfor each model Focusing on the explained variance by the model interms of R2 it can be checked that considering all Twitter variables isthree times more explanatory than considering only the young peopleproportion On the other hand the comparison of R2 for the Twittermodel with the one for All variables and Youth model shows that therate of young population does not provide a significant explanatorypower This semi-partial analysis shows that our Twitter variablesretain a high explanatory power when the effect of young populationrate is controlled
S9 Unemployment models for other geo-graphical areas
While municipalities are very heterogeneous demographically otheradministrative areas exist in Spain at large scales that could be usedfor our model of unemployment As mentioned in section 4 thesmallest administrative division of Spain we have considered is thatof the 8200 municipalities At larger scales we have the 326 coun-ties (comarcas in spanish) which are aggregations of municipalitiesFinally the largest geographical scale we considered is defined by50 provinces (provincias in Spanish) In this section we comparethe performance of our Twitter model for unemployment for thevariables defined in those administrative areas and relate it to thegeographical communities detected and used in the main paper (seesection 4) Not all the areas at different administrative divisions areconsidered in the model To minimize the effect of areas in whichthe number of geo-tagged tweets is very small we only consider the1738 municipalities which have a Twitter population π gt 10 Simi-larly we only consider the 198 counties with π gt 100 As we can seein Table 6 the model has a large explanatory power for areas equalor bigger than counties As expected R2 increases as the number ofareas in the model is smaller but the description level of the modelis very low for provinces for example The best performance (highR2 and high geographical description level) is attained at the level ofthe detected communities
S10 Relative importance of the variables
To asses the relative importance of the variables in the unemploymentmodel we have used several methods They all give qualitatively thesame results with some variations for the statistically insignificantvariables Specifically we have use
1 (weight) Relative weight of the absolute values of the coef-ficients obtained in the linear regression when variables arescaled to have mean zero and variance one
2 (lmg) averaging over orderings proposed by LindemanMerenda and Gold
Social media fingerprints of unemployment mdash 1619
0
10
20
30
40
emp fmiss manana rtwpen sio siosocialnames
values
indabscoefffirstlmgpmvd
microemp mrng SuSu
Rela
tive
impo
rtanc
e (
)
0
10
20
30
40
emp fmiss manana rtwpen sio siosocialnames
values
indabscoefffirstlmgpmvd
weight first lmg pvmd
Figure 13 Relative importance of the variables (in per-centage) in the unemployment model for different ways tocalculate it
3 (pmvd) The PMVD metric introduced by Feldman whichan average over orderings as well but with data-dependentweights
4 (first) The univariate R2-values from regression models withone variable only
All these metrics are obtained using the relaimpo R package [12]The results for the young unemployment model are shown in figure13 where we can see that different methods yield to similar rela-tive importance of the variables excepting perhaps for the diversityof mobility flows a variable with a non-significant weight in theregression model
References
[1] B Ashtakala Generalized power model for trip distributionTransportation Research Part B Methodological 21(1)59ndash671987
[2] Michel Bierlaire Mathematical models for transportation de-mand analysis Transportation research Part A Policy andpractice 31(1)86ndash86 1997
[3] Vincent D Blondel Jean-Loup Guillaume Renaud Lambiotteand Etienne Lefebvre Fast unfolding of communities in largenetworks Journal of Statistical Mechanics Theory and Exper-iment 2008(10)P10008 2008
[4] Harry J Casey Jr The law of retail gravitation applied to trafficengineering Traffic Quarterly 9(3) 1955
[5] Aaron Clauset Mark EJ Newman and Cristopher Moore Find-ing community structure in very large networks Physicalreview E 70(6)066111 2004
[6] Leon Danon Albert Diaz-Guilera Jordi Duch and AlexArenas Comparing community structure identificationJournal of Statistical Mechanics Theory and Experiment2005(09)P09008 2005
[7] Servicio Publico de Empleo Estatal (SEPE) Spanish registeredunemployment httpwwwsepeescontenidosque_
es_el_sepeestadisticasindexhtml[8] Instituto Nacional de Estadıstica Spanish 2011 cen-
sus httpwwwineescensos2011_datoscen11_
datos_iniciohtm|[9] Suzanne P Evans A relationship between the gravity model
for trip distribution and the transportation problem in linearprogramming Transportation Research 7(1)39ndash61 1973
[10] Paul Expert Tim S Evans Vincent D Blondel and RenaudLambiotte Uncovering space-independent communities inspatial networks Proceedings of the National Academy ofSciences 108(19)7663ndash7668 2011
[11] Ingo Feinerer Christian Buchta Wilhelm Geiger JohannesRauch Patrick Mair and Kurt Hornik The textcat packagefor n-gram based text categorization in r Journal of StatisticalSoftware 52(6)1ndash17 2013
[12] Ulrike Gromping Relative importance for linear regressionin r the package relaimpo Journal of statistical software17(1)1ndash27 2006
[13] Bartosz Hawelka Izabela Sitko Euro Beinat StanislavSobolevsky Pavlos Kazakopoulos and Carlo Ratti Geo-located twitter as the proxy for global mobility patterns arXivpreprint arXiv13110680 2013
[14] Yu Liu Zhengwei Sui Chaogui Kang and Yong Gao Uncov-ering patterns of inter-urban trips and spatial interactions fromcheck-in data arXiv preprint arXiv13100282 2013
[15] Mark EJ Newman Finding community structure in net-works using the eigenvectors of matrices Physical reviewE 74(3)036104 2006
[16] Pascal Pons and Matthieu Latapy Computing communities inlarge networks using random walks In Computer and Informa-tion Sciences-ISCIS 2005 pages 284ndash293 Springer 2005
[17] Usha Nandini Raghavan Reka Albert and Soundar KumaraNear linear time algorithm to detect community structures inlarge-scale networks Physical Review E 76(3)036106 2007
[18] Martin Rosvall and Carl T Bergstrom Maps of random walkson complex networks reveal community structure Proceedingsof the National Academy of Sciences 105(4)1118ndash1123 2008
[19] Morton Schneider Gravity models and trip distribution theoryPapers in Regional Science 5(1)51ndash56 1959
[20] Filippo Simini Marta C Gonzalez Amos Maritan and Albert-Laszlo Barabasi A universal model for mobility and migrationpatterns Nature 484(7392)96ndash100 2012
[21] Chaoming Song Tal Koren Pu Wang and Albert-LaszloBarabasi Modelling the scaling properties of human mobilityNature Physics 6(10)818ndash823 2010
[22] Alan Geoffrey Wilson Entropy in urban and regional mod-elling Pion Ltd 1970
[23] Alan Geoffrey Wilson Urban and regional models in geogra-phy and planning 1974
[24] Svante Wold Arnold Ruhe Herman Wold and WJ Dunn IIIThe collinearity problem in linear regression the partial leastsquares (pls) approach to generalized inverses SIAM Journalon Scientific and Statistical Computing 5(3)735ndash743 1984
Social media fingerprints of unemployment mdash 1719
Gravity ModelParameter Description Spain
α1 Origin exponent 0477lowastlowastlowast(0002)α2 Destination exponent 0478lowastlowastlowast(0002)β Distance exponent 105lowastlowastlowast(00035)R2 Goodness of fit 0797φ Correlation between Ti j and T gra
i j 0826
Table 1 Description of the parameters for the Gravity Law Model in geo-tagged social media data for Spain (lowast lowast lowast) meanssignificance p lt 00001
NMI between G and Gp for different pAlgorithm p = 001 002 003 004 005 006 007 008 009 01
FG 0995 0992 0989 0983 0981 0977 0983 0969 0980 0959WT 0954 0959 0950 0954 0945 0948 0947 0935 0926 0931IM 0988 0981 0980 0981 0978 0974 0975 0970 0969 0966ML 0994 0978 0979 0983 0948 0934 0972 0952 0973 0947LP 0906 0908 0911 0915 0895 0907 0907 0893 0905 0904LE 0960 0957 0956 0859 0910 0892 0908 0858 0885 0884
Table 2 NMI measure comparing G and Gp
Communities StatsAlgorithm 〈|Ni|〉i max|Ni| |Ni| Modularity NMI P NMI C
FG 309696 1385 23 0726 0712 0590WT 9262 433 769 0417 0744 0757IM 21011 143 339 0758 0770 0831ML 323772 1132 22 0800 0717 0599LP 22052 750 323 0732 0749 0761LE 1017571 5344 7 0381 0264 0205
Table 3 Statistics of the communities Ni returned by the six algorithms NMI P refers to the comparison between communitiesand provinces whereas NMI C considers counties instead of provinces
Social media fingerprints of unemployment mdash 1819
All ages lt 24 25minus44 gt 44(Intercept) 011lowastlowastlowastlowast 010lowastlowastlowast 020lowastlowastlowast 020lowastlowastlowast
(002) (003) (003) (0035)Penetration rate 323lowast 857lowastlowastlowast 628lowastlowast 240
(141) (222) (217) (277)Geographical diversity 003 015lowastlowastlowast 008lowast 006
(002) (004) (004) (005)Social diversity minus003lowast minus003 minus005lowast minus006lowast
(001) (002) (002) (003)Morning activity minus069lowast minus130lowastlowast minus153lowastlowastlowast minus119lowast
(026) (042) (041) (052)Misspellers rate 1156 3151lowast 1546 2360
(813) (1278) (1248) (1594)Employment mentions minus180 317 minus994 271
(627) (986) (964) (123)R2 047 064 055 029Adj R2 044 062 052 026lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005
Table 4 Regression table for the different models in which unemployment for different age groups is fitted The All ages model isthe fit to the general rate of unemployment in each geographical area while the other models are for the rates of unemployment ingroups of less than 24 years between 25 and 44 years and above 44 years
All variables Youth model Twitter model (I) Twitter model (II)(Intercept) 006 minus002 010lowastlowastlowast 009lowastlowastlowast
(003) (003) (003) (0027)Young pop rate 066lowast 220lowastlowastlowast
(030) (035)Penetration rate 820lowastlowastlowast 857lowastlowastlowast 862lowastlowastlowast
(225) (222) (221)Geographical diversity 014lowastlowastlowast 015lowastlowastlowast 012lowastlowastlowast
(004) (004) (003)Social diversity minus002 minus003
(002) (002)Morning activity minus142lowastlowastlowast minus130lowastlowast minus128lowastlowast
(041) (042) (041)Misspellers rate 2395 3151lowast 3228lowast
(1309) (1278) (1271)Employment mentions 034 317
(981) (986)R2 065 024 064 063Adj R2 063 024 062 062lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005
Table 5 Regression table for the different statistical models The All variables model includes both Twitter and rate of youngpopulation variables Twitter model (I) includes only the variables described in the main article while Twitter model (II) only includesthose variables which are significant p lt 005 in Twitter model (I)
Social media fingerprints of unemployment mdash 1919
Communities Municipalities Counties Provinces(Intercept) 010lowastlowastlowast 016lowastlowastlowast 011lowastlowastlowast 011lowast
(003) (001) (003) (005)Penetration rate 857lowastlowastlowast 401lowastlowastlowast 912lowastlowastlowast 1047lowastlowastlowast
(222) (059) (181) (197)Geographical diversity 015lowastlowastlowast 002 012lowastlowastlowast 008
(004) (001) (003) (007)Social diversity minus003 minus001 minus001 minus003
(002) (001) (002) 007Morning activity minus130lowastlowast minus116lowastlowastlowast minus149lowastlowastlowast minus103
(042) (014) (039) (088)Misspellers rate 3151lowast 1440lowastlowastlowast 1409
(1278) (251) (1002)Employment mentions 317 minus071 241 minus317
(986) (089) (886) (1229)Number of points 128 1738 198 50R2 064 022 055 065Adj R2 062 021 054 061lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005
Table 6 Regression table for the unemployment linear regression model in different levels of geographical areas In the Provincesmodel the misspellers rate has been removed from the model due to the large collinearity with the penetration rate
- 1 Social media dataset and functional partition of cities
- 2 Social media behavioral fingerprints
- 3 Explanatory power of social media in unemployment
- 4 Discussion
-
Social media fingerprints of unemployment mdash 319
lic API provided by Twitter from continental Spain rang-ing from 29th November 2012 to 30th June 2013 Tweetswere posted by (properly anonymized) 057 Million uniqueusers and geo-positioned in 7683 different municipalitiesWe observed a large correlation (Pearsonrsquos coefficient ρ =0951[09490953]) between the number of geopositionedtweets per municipality and the municipalityrsquos population Onaverage we find around 50 tweets per month and per 1000persons in each municipality
Despite this high level of social media activity withinmunicipalities we find their official administrative areas notsuitable to study socio-economical activity administrativeboundaries between municipalities reflect political and histo-rical decisions while economical trade and activity often hap-pens across those boundaries The result is that municipalitiesin Spain are artificially diverse ranging from a municipalitywith only 7 inhabitants to other with population 32 millionAlthough there exists natural aggregations of municipalitiesin provinces (regions) or statisticalmetropolitan areas (NUTSareas) we have used our own procedure to detect economicalareas In particular we have used user daily trips betweenpairs of municipalities as a measure of the economic relat-edness between said municipalities We say that there is adaily trip between municipality i and j if a user has tweetedin place i and j consecutively within the same day In ourdataset we find 19 million trips by 022 million users Withthose trips we construct the daily mobility flux network Ti jbetween municipalities as the number of trips between place iand j (see 1B) Remarkably the statistical properties of tripsand of the mobility matrix Ti j coincide with those of othermobility datasets (see SI section 2) for example trip distancer and elapsed time δ t are power-law distributed with expo-nents P(r)sim rminus167 and P(δ t)sim δ tminus062 very similar to thosefound in the literature [9 23] And the mobility fluxes Ti j arewell described by the Gravity Law (R2 = 080) [28]
Ti j T gravi j =
Pαii P
α jj
dβ
i j
(1)
where Pi and Pj are the populations of municipalities i andj and di j is the distance between them Similarly the expo-nents in (1) are very similar to those reported in other worksαi α j = 048 and β 105 [23 29] These results suggestthat detected mobility from geo-located tweets is a good proxyof human mobility within and between municipalities [30]
We use the network of daily fluxes between municipali-ties Ti j to detect the geographical communities of economicalactivity To this end we employ standard partition techniquesof the mobility network Ti j using graph community findingalgorithms This technique has been applied extensively spe-cially with mobile phone data to unveil the effective mapsof countries based on mobility andor social interactions ofpeople[31ndash33] In our case we have used the Infomap al-gorithm [34] and found 340 different communities withinSpain For further details about the comparison among dif-ferent state-of-art community detection algorithms executedon the inter-city graph see SI section 3 The average num-ber of municipalities per community is 21 and the largest
community contains 142 municipalities The communitiesdetected have very interesting features (see SI section 3) (i)they are cohesive geographically (see figure 1) (ii) they arestatistically robust against randomly removal of trips in ourdatabase (SI table S2) and (iii) modularity of the partition isvery high ( 076 see SI table S3) Finally (iv) the partitionfound has some overlap (77 of Normalized Mutual Informa-tion NMI see [35]) with coarser administrative boundarieslike provinces (regions) (see SI section 3 for details) Butinterestingly it shows a larger overlap (83 of NMI) withcomarcas (counties) areas in Spain that reflect geographicaland economical relations between municipalities This resultshows that the mobility detected from geo-located tweets andthe communities obtained are a good description of economi-cal areas
In the rest of the paper we restrict our analysis to the geo-graphical areas defined by the Infomap detected communities(see figure 1) For statistical reasons we discard communitieswhich are not formed by at least 5 municipalities Despite thissampling 96 of the total country population is consideredin our analysis Our results in the rest of the paper also holdfor municipalities counties or provinces though with lowerstatistical power (see SI section 9)
2 Social media behavioral fingerprints
The goal of this work is to quantify how and what behavioralfeatures can be extracted from social media and then relatedback to the to the economical level of cities To this end wedefine four groups of measures that have been widely exploredin other fields like economy or social sciences These fourtypes measures rely on the identification of the place whereusers live Instead of using information in the user profilewe analyze the places where the user has tweeted and weset as hometown of the user the municipality where heshehas tweeted with the highest frequency a method usually em-ployed in mobile phone and social media [11 23] To thisend we select those users with more than 5 geo-located tweetsin our period and which have tweeted at least 40 of theirtweets in a given municipality which we will consider theirhometown After this filtering we end up with 032 millionusers and we can then define the twitter population πi in areai as the number of users with their hometown within area iWe obtain a very high correlation between πi and populationof the cities Pi in the national census ρ = 0977[09760978]which provides an indirect validation of our approach withthe present data However not all demographic groups areequally represented in the our twitter database As shown inthe SI section 4 Twitter user demographics in Spain obtainedfrom surveys [36] show that age groups above 44 years old areunder-represented Thus our results would mainly describethe socio-economical status of people below 44 years old Em-ployment analysis is then performed in different age groupsunemployment for people below 25 years old between 25 and44 years old and older than 44 years old Finally we havechosen the unemployment reported officially at the end of ourobservation time window (June 2013) but our results are notaffected by the month selected see SI Section 7
For every considered region we investigate the officiallyreported unemployment for different age groups and a number
Social media fingerprints of unemployment mdash 419
Figure 2 Examples of different behaviour in the observed variables and the unemployment In A we observe that two cities withdifferent unemployment levels have different temporal activity patterns Figure C show how communities (red) with distinct entropylevels of social communication with other communities (blue) may hold different unemployment intensity left map shows a highlyfocused communication pattern (low entropy) while right map correspond to a community with a diverse communication pattern(high entropy) Finally figure B shows some examples of detected misspellings in our database using 618 incorrect expressions (seeSI Section 6) such as ldquoCon migordquo ldquoAverrdquo or ldquollendordquo
of metrics related to social media activity Some of thosemetrics are already reported in the literature but some othersare introduced in this work Specifically we consider
bull Social media technology adoption we can use twitterpenetration rate τi = πiPi in each area i as a proxy oftechnology adoption Recent works have shown thatindeed there is a correlation between country GDP andtwitter penetration specifically it was found that a pos-itive correlation between τi and GDP at the countrylevel [23] However in our data we find the oppo-site correlation (see figure 3) namely that the largerthe penetration rate the bigger the unemployment iswhich suggest that the impact of technology adoptionat country scale is different of what happens withinan (industrialized) country where technology to accesssocial media is commoditized
bull Social media activity regions with very different eco-nomical situations should exhibit different patterns ofactivity during the day Since working leisure fam-ily shopping etc activities happen at different timesof the day we might observe different daily patterns
in regions with different socio-economical status Forexample we hypothesize that communities with lowlevels of unemployment will tend to have higher activ-ity levels at the beginning of a typical weekday Thisis indeed what we find figure 2A shows the hourlyfraction of tweets during workdays of two communi-ties with very different rate of unemployment As wecan observe both profiles are quite different and inthe case of low unemployment we find a strong peakof activity between 8 and 11am (morning) and lowerperiods of activity during the afternoons and nights Weencode this finding in νmrngi νaftni and νngti the to-tal fraction of tweets happening in geographical area ibetween 8am and 10am 3pm and 5pm and 12am and3am respectively Figure 3 shows a strong negative co-rrelation between νmrngi and the unemployment for thecommunities in our database and positive correlationwith νaftni and νngti
bull Social media content some works have observed acorrelation between the frequency of words relatedto work conditions [22] or looking forward thinking
Social media fingerprints of unemployment mdash 519
searches [21] to the economical situation of countriesIn our case we also find that there is a moderate posi-tive correlation between the fraction of tweets microi men-tioning job or unemployment terms and the observedunemployment while the correlation is negative for thenumber mentions to employment or the economy How-ever we have tried a different approach by measuringthe relation between the way of writing and the edu-cational level [37] To this end we build a list of 618misspelled Spanish expressions and extract the tweetsof the dataset containing at least one of these words(see SI section 6 for further details about how theseexpressions were collected) We only consider tweetsin Spanish detected with a N-grams based algorithmThen we only consider misspellings that cannot bejustified as abbreviations Finally we compute for ev-ery region the proportion εi of misspellers among theTwitter population If the fraction of misspellers pergeographical area is a proxy for the educational levelof that region we expect a positive correlation betweenεi and unemployment Indeed we find (see figure 3)that there is a strong correlation between the fraction ofmisspellers and unemployment
bull Social media interactions and geographical flow di-versity following the ideas in [14] which correlatedthe economical development of an area with the diver-sity of communications with other areas we considerall tweets mentioning another user and take them asa proxy for communication between users Then wecompute the number of communications wi j betweenareas i and j as the number of mentions between usersin those areas To measure the diversity we use as in[14] the informational normalized entropy (Entropy 1)Sui = minussum j pi j log pi jSri where pi j = wi jsum j wi j and(Entropy 2) Sri = logki with ki the number of differentareas with which users in area i have interacted As in[14] we find that areas with large unemployment haveless diverse communication patterns than areas with lowunemployment This translates in a strong negative co-rrelation between Si and the unemployment see figure3 Similar ideas are applied to the flows of people be-tween areas to investigate the diversity of the geograph-ical flows through the entropy Si =minussum j pi j log pi jSriwhere pi j = Ti jsum j Ti j and Sri = log(ki) with ki thenumber of different areas which has been visited byusers that live in area i Figure 3 shows that as in [19]correlation of these geographical entropies is low witheconomical development
Normalization of variables is discussed in SI section 5 Wehave also studied the correlation between the variables con-sidered As expected variables in each group show moderatecorrelations between them However the inspection of thecorrelation matrix and a Principal Component Analysis ofthe variables considered show that there is information (aspercentage of variance in the data) in each of the groups ofvariables see SI section 5 Because of these two facts werestrict our analysis to the variables within each group withthe highest correlation with the unemployment namely thepenetration rate τi the social and mobility diversity variables
Sui and Sui the morning activity νmrngi the fraction of mis-spellers εi and fraction of employment-related tweets microempi
3 Explanatory power of social media in un-employment
The four previous groups of variables are fingerprints of hu-man behavior reflected on the Twitter usage habits As weobserved in figure 3 all of them exhibit statistically strongcorrelations with unemployment The question we address inthis section is whether those variables suffice to explain theobserved unemployment (their explanatory power) and alsodetermine the most important ones among themselves (whichgive more explanatory power than others) Note that we arenot stating a causality arrow between the measures built in theprevious section and the unemployment rate but only explor-ing whether they can be used as alternative indicators with areal translation in the economy
Figure 4 shows the result of a simple linear regressionmodel for the observed unemployment for ages below 25 yearsas a function of the variables which have more correlation withthe unemployment The model has a significant R2 = 062showing that there is a large explanatory power of the un-employment encoded in the behavioral variables extractedfrom Twitter However not all the variables weight equallyin the model specifically the penetration rate geographicaldiversity morning activity and fraction of misspellers accountfor up to 92 of the explained variance while social diversityand number of employment related tweets are not statisticalsignificant (see SI section 10 for the methods used to deter-mine the relative importance of the variables) It is interestingto note that while social diversity obtained by mobile phonecommunications was a key variable in the explanation of de-privation indexes in [14 19] the communication diversity oftwitter users seem to have a minor role in the explanation ofheterogeneity of unemployment in Spain
Similar explanatory power is found for other age groupsR2 = 044 for all ages and R2 = 052 for ages between 25 and44 years However the model degrades for ages above 44years (R2 = 026) proving that our variables mainly describedthe behavior of the most represented age groups in Twitternamely those below 44 years old On the other hand sinceour Twitter variables seem to describe the behavior of youngpeople we have investigated whether Twitter constructed vari-ables have similar explanatory value (in terms of R2) thansimple census demographic variables for young people How-ever regression models including young population rate yieldto a minor improvement R2 = 065 while young populationrate only gives R2 = 024 a result which shows that Twittervariables do indeed posses a genuine explanatory power awayfrom their simple demographic representation Finally ourmodel have the largest explanatory power for detected com-munities but large R2 are also found for other geographicalareas like counties (R2 = 054) and provinces (R2 = 065) seeSI section 9
4 Discussion
This work serves as a proof of concept for how a wide rangeof behavioral features linked to socioeconomic behavior can
Social media fingerprints of unemployment mdash 619
eco
unemp
emp
job
fmiss
madrugada
tarde
manana
siorsocial
siosocial
sior
sio
rtwpen
minus05 00 05corre
ff500
1000
10 20parofa
ctor
[i]
tt[ v
aria
bles
_sel
[i]]
20
40
60
80
10 20parofa
ctor
[i]
tt[ v
aria
bles
_sel
[i]]
20
40
60
80
1020paro fa
ctor[i] tt[ variables_sel[i]]
20
40
60
80
1020paro fa
ctor[i] tt[ variables_sel[i]]
0
5
10
15
20
10 20parofa
ctor
[i]
tt[ v
aria
bles
_sel
[i]]
0
5
10
15
20
1020paro fa
ctor[i] tt[ variables_sel[i]]
0
5
10
15
20
1020paro fa
ctor[i] tt[ variables_sel[i]]
4
5
6
7
10 20parofa
ctor
[i]
tt[ v
aria
bles
_sel
[i]] 5
10
10 20parofa
ctor
[i]
tt[ v
aria
bles
_sel
[i]]
40
50
60
70
10 20parofa
ctor
[i]
tt[ v
aria
bles
_sel
[i]]
02
04
06
08
10 20parofa
ctor
[i]
tt[ v
aria
bles
_sel
[i]]
0
50
100
150
200
10 20parofa
ctor
[i]
tt[ v
aria
bles
_sel
[i]]
0
50
100
150
200
10 20parofa
ctor
[i]
tt[ v
aria
bles
_sel
[i]]
0
50
100
150
200
10 20parofa
ctor
[i]
tt[ v
aria
bles
_sel
[i]]
Penetration rate Entropy 1 (geo) Entropy 2 (geo)
Entropy 1 (social) Entropy 2 (social) Activity (morning)
Activity (afternoon) Activity (night)
Misspellers rate Job tweets
Employment tweets Unemployment tweets
Economy tweets
A B C
D E
Unemployment UnemploymentCorrelation
Entropy1 (social)
Misspellers rate
Pen
etra
tion
rate
Act
ivity
(mor
ning
)Figure 3 A) Correlation coefficient of all the extracted Twitter metrics grouped by technology adoption (black) geographicaldiversity (orange) social diversity (light blue) temporal activity (green) and content analysis (dark blue) Error bars correspond to95 confidence intervals of the correlation coefficient Gray area correspond the statistical significance thresholds Panels B C Dand E show the values of 4 selected variables in each geographical community against its percentage of unemployment Size of thepoints is proportional to the population in each geographical community Solid lines correspond to linear fits to the data
10
15
20
25
10 15 20 25x
y
5
10
15
2025
5 10 15 20 25x
y
0 10 20 30per
order
col0000000072B2009E7356B4E9E69F00
Penetration rate
Entropy 1 (geo)
Entropy 1 (social)
Activity (morning)
Misspellers rate
Employment tweetsR2 = 062
Pred
icte
d un
empl
oym
ent
Observed unemployment Weight
A B
R2 = 052
Observed unemployment
CAge lt 25 25 lt Age lt 44
Figure 4 A) and B) Performance of the model showing the predicted unemployment rate for ages below 25 versus the observedone R2 = 062 and with ages between 25 and 44 Dashed lines correspond to the equality line and plusmn20 error C) Percentageof weight for each of the variables in the regression model using the relative weight of the absolute values of coefficients in theregression model (see SI section 10) Variables marked with lowast are not statistical significant in the model
be inferred from the digital traces that are left by publicly-available social media In particular we demonstrate thatbehavioral features related to unemployment can be recoveredfrom the digital exhaust left by the microblogging networkTwitter First of all Twitter geolocalized traces together withoff-the-shelve community detection algorithms render an op-timal partition of a country for economical activity showingthe remarkable power of social media to understand and unveileconomical behavior at a country-scale This insight is likelyto apply to other administrative definitions in other countriesspecially when considering large cities with an inherent dy-namical nature and evolution of mobility fluxes and citiescomposed of small satellite cities with arbitrary agglomera-tions or division among them (eg London NYC Singapore)
This result is unsurprising it should be natural to recomputecity clusterscommunities of activity based on their real timemobility which may vary considerably faster than the updaterates of mobility and travel surveys [31ndash33]
Our main result demonstrates that several key indicatorsdifferent penetration rates among regions fingerprints of thetemporal patterns of activity content lexical correctness andgeo-social connectivities among regions can be extractedfrom social media and then used to infer unemployment lev-els These findings shed light in two directions first on howindividualsrsquo extensive use of their social channels allow usto characterize cities based on their activity in a meaningfulfashion and secondly on how this information can be usedto build economic indicators that are directly related to the
Social media fingerprints of unemployment mdash 719
economy Regarding the latter our work is important forunderstanding how country-scale analysis of Social Mediashould consider the demographic but also the economical dif-ference between users As we have shown users in areaswith large unemployment have different mobility different so-cial interactions and different daily activity than those in lowunemployment areas This intertwined relationship betweenuser behavior and employment should be considered not onlyin economical analysis derived from social media but alsoin other applications like marketing communication socialmobilization etc
It is particularly remarkable that Twitter data can providethese accurate results Twitter is among the many currentlypopular social networking platforms perhaps the noisiestsparsest more lsquosabotagedrsquo medium very few users sendout messages at a regular rate most of the users do nothave geolocated information the social relationships (fol-lowersfollowers) contains a lot of unusedunimportant linksit is plagued by spam-bots and last but not least we haveno way to identify the motivegoalfunctionality of mobilityfluxes we are able to extract These limitations are not par-ticular to our sample but general to the sample Twitter databeing employed in the computational social science commu-nity Despite all these caveats we are able to show that evensome simple filtering techniques together with basic statisticalregressions yields predictive power about a variable as impor-tant as unemployment Other social media platforms such asFacebook Google+ Sina Weibo Instagram Orkut or Flickerwith more granular and consistent individual data are likelyto provide similar or better results by themselves or in com-bination Further improvements can be obtained by the useof more sophisticated statistical machine learning techniquessome of them even tailored to the peculiarities of social mediadata Our work serves to illustrate the tremendous potentialof these new digital datasets to improve the understanding ofsocietyrsquos functioning at the finer scales of granularity
The usefulness of our approach must be considered againstthe cost and update rate of performing detailed surveys ofmobility social structure and economic performance Ourdatabase is publicly articulated which means that our analysiscould be replicated easily in other countries other time periodsand with different scopes Naturally survey results providemore accurate results but they also consume considerablyhigher financial and human resources employing hundredsof people and taking months even years to complete and bereleased mdash they are so costly that countries going througheconomic recession have considered discontinuing them oraltering their update rate in recent times A particularly prob-lematic aspect of these surveys is that they are ldquoout-of-syncrdquoie census may be up to date whereas those same individualsrsquotravel surveys may not be and therefore drawing inferencesbetween both may be particularly difficult This is a partic-ularly challenging problem that the immediateness of socialmedia can help ameliorate
A few questions remain open for further investigationHow can traditional surveys and social media digital tracesbe best combined to maximize their predictive ability Cansocial media provide a reliable leading indicator to unem-ployment and in general economic surveys How muchreliable lead is it possible if at all As we have found Twitter
penetration and educational levels are found to be correlatedwith unemployment but this levels are unlikely to changerapidly to describe or anticipate changes in the economy orunemployment However other indicators like daily activitysocial interactions and geographical mobility are more con-nected with our daily activity and perhaps they have morepredicting power to show andor anticipate sudden changesin employment The relationship between unemployment andindividual and group behavior may help contextualize themultiple factors affecting the socioeconomic well-being of aregion while penetration content daily activity and mobilitydiversity seem to be highly correlated to unemployment inSpain different weights for each group of traces might beexpected in other countries [14] Finally digital traces couldserve as an alternative (some times the only one available) tothe lack of surveys in poor or remote areas [20 27] Anotherinteresting avenue of research involves the use of social mediato detect mismatches between the real (hidden underground)economy and the officially reported [38]
Most importantly the immediacy of social media may alsoallow governments to better measure and understand the effectof policies social changes natural or man-made disasters inthe economical status of cities in almost real-time [18 39]These new avenues for research provide great opportunitiesat the intersection of the economic social and computationalsciences that originate from these new widespread inexpensivedatasets
Acknowledgments
We would like to thank Kristina Lerman Lada Adamic JamesFowler Daniel Villatoro and Ricardo Herranz for stimulatingdiscussions and Yuri Kryvasheyeu and Thomas Bochynek fortheir critical reading of the manuscript This work was par-tially supported by Spanish Ministry of Science and Technol-ogy Grant FIS2013-47532-C3-3-P (to A L M G H and EM) Manuel Cebrian is funded by the Australian Governmentas represented by the Department of Broadband Communi-cations and Digital Economy and the Australian ResearchCouncil through the ICT Centre of Excellence program
References
[1] Becker G S (1976) The economic approach to human behav-ior (University of Chicago Press)
[2] Granovetter M (1985) Economic action and social structurethe problem of embeddedness American journal of sociologypp 481ndash510
[3] Camerer C F Loewenstein G amp Rabin M (2011) Advancesin behavioral economics (Princeton University Press)
[4] Glaeser E L Kallal H D Scheinkman J A amp ShleiferA (1991) Growth in cities (National Bureau of EconomicResearch) Technical report
[5] Bettencourt L M Lobo J Helbing D Kuhnert C amp WestG B (2007) Growth innovation scaling and the pace of lifein cities Proceedings of the National Academy of Sciences104 7301ndash7306
[6] Batty M (2008) The size scale and shape of cities science319 769ndash771
[7] Milgram S (1974) The experience of living in cities Crowdingand behavior 167 41
Social media fingerprints of unemployment mdash 819
[8] Pan W Ghoshal G Krumme C Cebrian M amp Pentland A(2013) Urban characteristics attributable to density-driven tieformation Nature communications 4
[9] Gonzalez M C Hidalgo C A amp Barabasi A-L (2008)Understanding individual human mobility patterns Nature453 779ndash782
[10] Calabrese F Diao M Lorenzo G D Jr J F amp Ratti C(2013) Understanding individual mobility patterns from urbansensing data A mobile phone trace example TransportationResearch Part C Emerging Technologies 26 301 ndash 313
[11] Cheng Z Caverlee J Lee K amp Sui D Z (2011) ExploringMillions of Footprints in Location Sharing Services (AAAIMenlo Park CA USA)
[12] Cho E Myers S A amp Leskovec J (2011) Friendship andmobility user movement in location-based social networksKDD rsquo11 (ACM New York NY USA) pp 1082ndash1090
[13] Sun L Jin J G Axhausen K W Lee D-H amp Cebrian M(2014) Quantifying long-term evolution of intra-urban spatialinteractions arXiv preprint arXiv14070145
[14] Eagle N Macy M amp Claxton R (2010) Network diversityand economic development Science 328 1029ndash1031
[15] Henrich J Boyd R Bowles S Camerer C Fehr E GintisH amp McElreath R (2001) In search of homo economicusbehavioral experiments in 15 small-scale societies AmericanEconomic Review pp 73ndash78
[16] Krieger N Williams D R amp Moss N E (1997) Measuringsocial class in us public health research concepts method-ologies and guidelines Annual review of public health 18341ndash378
[17] Groves R M Fowler Jr F J Couper M P Lepkowski J MSinger E amp Tourangeau R (2013) Survey methodology (JohnWiley amp Sons)
[18] Lazer D Pentland A S Adamic L Aral S Barabasi A LBrewer D Christakis N Contractor N Fowler J GutmannM et al (2009) Life in the network the coming age of compu-tational social science Science (New York NY) 323 721
[19] Smith C Quercia D amp Capra L (2013) Finger on the pulseidentifying deprivation using transit flow analysis (ACM) pp683ndash692
[20] Soto V Frias-Martinez V Virseda J amp Frias-Martinez E(2011) Prediction of socioeconomic levels using cell phonerecords UMAPrsquo11 (Springer-Verlag Berlin Heidelberg) pp377ndash388
[21] Preis T Moat H S Stanley H E amp Bishop S R (2012)Quantifying the advantage of looking forward Scientific re-ports 2
[22] Antenucci D Cafarella M Levenstein M C Re C ampShapiro M D (2014) Using social media to measure la-bor market flows (National Bureau of Economic Research)Technical report
[23] Hawelka B Sitko I Beinat E Sobolevsky S KazakopoulosP amp Ratti C (2013) Geo-located twitter as the proxy for globalmobility patterns arXiv preprint arXiv13110680
[24] Lathia N Quercia D amp Crowcroft J (2012) in Pervasivecomputing (Springer) pp 91ndash98
[25] Frias-Martinez V Virseda J amp Frias-Martinez E (2010)Socio-economic levels and human mobility
[26] Gutierrez T Krings G amp Blondel V D (2013) Evaluatingsocio-economic state of a country analyzing airtime credit andmobile phone datasets arXiv preprint arXiv13094496
[27] Smith C Mashhadi A amp Capra L (2013) Ubiquitous sensingfor mapping poverty in developing countries Paper submittedto the Orange D4D Challenge
[28] Erlander S amp Stewart N F (1990) The gravity model intransportation analysis theory and extensions (Vsp) Vol 3
[29] Simini F Gonzalez M C Maritan A amp Barabasi A-L(2012) A universal model for mobility and migration patternsNature 484 96ndash100
[30] Lenormand M Picornell M Cantu-Ros O G Tugores ALouail T Herranz R Barthelemy M Frias-Martinez E ampRamasco J J (2014) Cross-checking different sources ofmobility information arXiv preprint arXiv14040333
[31] Barthelemy M (2011) Spatial networks Physics Reports 4991 ndash 101
[32] Expert P Evans T S Blondel V D amp Lambiotte R (2011)Uncovering space-independent communities in spatial net-works Proceedings of the National Academy of Sciences 1087663ndash7668
[33] Sobolevsky S Szell M Campari R Couronne T SmoredaZ amp Ratti C (2013) Delineating geographical regions withnetworks of human interactions in an extensive set of countriesPloS one 8 e81707
[34] Rosvall M amp Bergstrom C T (2008) Maps of random walkson complex networks reveal community structure Proceedingsof the National Academy of Sciences 105 1118ndash1123
[35] Danon L Diaz-Guilera A Duch J amp Arenas A (2005)Comparing community structure identification Journal ofStatistical Mechanics Theory and Experiment 2005 P09008
[36] ADigital (2013) Uso de twitter en espana 2012 [Onlineaccessed 1-November-2014]
[37] Davenport J R amp DeLine R (2014) The readability of tweetsand their geographic correlation with education arXiv preprintarXiv14016058
[38] Schneider F Buehn A amp Montenegro C E (2011) Shadoweconomies all over the world New estimates for 162 countriesfrom 1999 to 2007 Handbook on the shadow economy pp9ndash77
[39] Rutherford A Cebrian M Dsouza S Moro E Pentland A ampRahwan I (2013) Limits of social mobilization Proceedingsof the National Academy of Sciences 110 6281ndash6286
Social media fingerprints of unemployment mdash 919
Social media fingerprints of unemployment mdash 1019
Supporting Information forSocial media fingerprints of unemploymentAlejandro Llorente Manuel Garcıa-Herranz Manuel Cebrian and Esteban Moro
S1 The dataset
Twitter provides an extremely rich and publicly available data set ofuser interactions information flows and thanks to the geo locationof tweets user movements Nevertheless the representativenessof this geo-located Twitter as a global source of mobility data hasstill received sparse attention In this sense while [13] present apromising and extensive study regarding global country-to-countrymovements (mostly driven by tourism) within-country human flows(comprising not only internal tourism but also in a greater extentthan country-to-country travels visiting and commuting) still needfurther investigation Therefore throughout this work we will com-pare our findings using geo-located Twitter with similar study usingcommuting surveys
For the Twitter analysis we consider almost 146 million geo-located Twitter messages (tweet(s)) collected through the publicAPI provided by Twitter for the continental part of Spain and from29th November 2012 to 10th April 2013 In this dataset we considerthat there has been a trip from place l to place k if a user has tweetedin place l and place k consecutively We only keep those transitionswhen the first tweet and the second one are dated in the same dayWe filter the trips database to avoid unrealistic transitions and keeponly trips with a geographical displacement larger than 1km (SeeMethods section) By this method 138 million of trips from 167376different users are considered in our work
From those trips we construct the mobility flow Ti j betweenmunicipalities which measures the number of trips in our databasein which the origin is within city i boundaries and destination lieswithin those of city j
We also consider population and economical information aboutthe municipalities from the Spanish Census (2011) [8] and unem-ployment figures from the Public Service of Employment (ServicioPublico de Empleo Estatal SEPE) [7] In the former In the lat-ter case registered unemployment (in number of persons) is givenfor each Spanish municipality by gender age and month To getunemployment rates we divide register unemployment by the totalworkforce in the municipality estimated as the number of peoplewith age between 16 and 65 years
S2 Twitter as mobility proxy
Considering all of the available transitions in our database one cancompute the distance between origin and destination the elapsedtime of the transition and the number of trips per user among manyother statistics All of them seems to show a Power-law distributionwith a cutoff due to the finite spatial size of Spain and the constraintof considering only transitions where the origin and destinationcheckins are done the same day Focusing on the log-linear partof the distributions self-similar behaviors arise when Twitter basedmobility is analyzed (see figure 5)
Twitter based inter-city flows can be well modelled by means ofthe The Gravity Law which is one of the most extended methods torepresent human mobility [1 19] with applications in many fieldslike urban planning [23] traffic engineering [4] or transportationproblems [9] Gravity Law is also the solution to the problem of
maximizing the entropy of the particle distribution among all thepossible trips using statistical mechanics techniques [2 22] Recentlyit has also been used as a model for human mobility based on cellphone traces [10 20 21] and social media data at a global scale [13]and at the inter-city level [14]
The Gravity Model for human mobility assume that the flowsbetween cities can be explained by the expression
T gravi j =
Pα1i Pα2
j
dβ
i j
(2)
where T gravi j is the flow in terms of number of people between cities
i and j di j is the geographical distance and Pi and Pj the populationof every city respectively
Given the data we can obtain the parameters of the model byWeighted Least Squares Minimization
αlowast1 α
lowast2 β
lowast = argminα1α2β
1N sum
i jwi j
(Ti jminusT grav
i j
)2(3)
where N is the total number of connections in the mobility graph andwi j is a weight proportional to the number of observed transitionsbetween i and j In particular we find that taking wi j = T 13
i j givesthe best performance in the model
In our case this model fits quite accurately the inter-city mobilitybased on Twitter GPS checkins (see table 1) Even though we areconsidering Ti j not necessarily symmetric the exponents of thepopulations are similar indicating that we are observing a similarflows in both directions between i and j
S3 Community structures in inter-city mo-bility graph
Typically complex networks exhibit community structure that isthere are subsets of nodes that are more densely connected amongthem comparing to the rest of the nodes In mobility networks whosenodes correspond to geographical areas these communities are inter-preted as zones with high common activity and tend to be constrainedby geographical and political barriers We check whether this is alsoobserved in our dataset by performing 6 state-of-art community de-tection algorithms FastGreedy [5] Walktrap [16] Infomap [18]MultiLevel [3] Label Propagation [17] and Leading Eigenvector[15] These six different algorithms exhibit different communitystructures in terms of number of communities average size of com-munity or modularity (see table 3) Members (municipalities) ofthe resulting communities are spatially connected except some fewcases as figure 7 shows We test the statistical robustness of theobtained communities by randomly removing a proportion p of theoriginal links and performing the algorithms on this new graph GpWe will consider that communities are robust when the communi-ties given for the original network G and Gp are highly similar Inorder to compare two arbitrary memberships to communities we usethe Normalized Mutual Information (NMI) method described in [6]which returns 0 when two memberships are totally different and 1when we compare two equal memberships We compute the NMI for
Social media fingerprints of unemployment mdash 1119
10minus8
10minus6
10minus4
10minus2
100
100 1005 101 1015 102 1025 103x
dens
10minus6
10minus4
10minus2
100
1005 101 1015 102 1025 103x
dens
10minus7
10minus6
10minus5
10minus4
10minus3
102 1025 103 1035 104 1045 105x
dens
Den
sity
Den
sity
Den
sity
Trip distance (km) Number of trips Elapsed time (secs)
Figure 5 Probability distributions for the different properties of daily trips in the Twitter dataset Dashed lines corresponds to apower law fit with exponents minus167 minus243 and minus062 respectively
each chosen algorithm performed on G and Gp for p between 1and 10 concluding that obtained community structures are robustbecause they are not broken when some randomly chosen links areremoved (see table 2)
1e+01 1e+03 1e+05
2eminus04
2eminus03
2eminus02
2eminus01
cities2$total_population
cities2$twpen
Population
Pen
etra
tion
Rat
e CommunitiesCities
Figure 6 Penetration rates for both cities and detectedcommunities
As other works have shown mobility graph communities areusually interpreted in terms of geographical and political barriersand a natural question is whether the mobility based communitiesare related to any of these barriers In Spain there are differentterritorial divisions for administration purposes In this work weconsider two of them provinces defined in 1978 Constitution are 48different heterogeneous aggregations of municipalities and counties(comarca in Spanish terminology) which are traditional aggregationsof municipalities mainly based on Spanish holography (rivers val-leys ridges etc) and some of them are composed by municipalitiesof different provinces We use again the NMI method to compare thecommunities structure given by the algorithms to the administrativelimits Except Leading Eigenvector algorithm the rest of methodsreturn communities that are quite related to provinces (NMI asymp 07)whereas for the county administration limits higher variability is
observed In this last case the algorithm providing more relationshipwith county limits is Infomap NMI asymp 083 Therefore Twitter basedmobility summarizes the inter-city flows exhibiting that these flowsare influenced by geographical and political barriers
S4 Twitter demographics and unemploy-ment rates
Different age groups are not equally represented in Twitter Recentsurveys (2012) in Spain suggest that most (86) of users in Twitterare 16 to 44 years old Comparison of the percentage of users perage group with the total population within the same groups (seefigure 8) reveals that groups of ages above 35 years old are under-represented in Twitter Thus our Twitter data will be more revealingwhen trying to describe unemployment in age groups below 44 yearsold This is indeed what we find when we try to build a linear modelfor the rate unemployment in different age groups with the sameTwitter variables while unemployment rates for ages below 24 canbe fitted to a linear model with R2 = 062 we find that regressionmodels for unemployment rates for ages between 25 and 44 have aR2 = 052 while for ages above 44 we get only R2 = 026 Table 4summarizes the results for the regression models of unemploymentrates in each age group showing that our Twitter variables have moreexplanatory power for ages below 44 Finally in figure 8 we can seethe performance of the model at different age groups and once againit is obvious the poor explanatory power of the Twitter variables forthe unemployment rate in ages above 44 years old
S5 Properties of Twitter variables
Normalization and distributions
Heterogeneity between the values of variables constructed fromTwitter is large but moderate as histograms in figure 9 show Wedid not find any geographical area with anomalous values in anyof the variables considered Variables are normalized in differentways both the penetration τi and misspellers rate εi are defined asthe number of users or misspellers per 100000 persons (population)activity variables νi are normalized as the percentage of tweets pertime interval finally number of tweets that mention a specific term
Social media fingerprints of unemployment mdash 1219
Figure 7 From left to right and from top to bottom Fastgreedy Walktrap Infomap Multilevel Label Propagation and LeadingEigenvector communities on Twitter based mobility transitions
microi are also given per 100000 tweets published in the geographicalarea
Correlation between variables
Variables are constructed to reflect the behavior of areas in the dif-ferent dimensions of Twitter penetration social or geographicaldiversity activity through the day and content Correlation betweenvariables does indeed show that variables within each dimensionshold strong correlations between them As we can see in figure 10social and geographical diversities are highly correlated betweenthem an expected fact given the gravity law accurate descriptionof flows of people between geographical areas but also the amountof communication between them Same behavior is found for thegroup of variables in the activity group while content variables areless correlated Finally we find that both the penetration rate τi andfraction of misspellers εi have a strong correlation with most of thevariables
High correlation between variables might lead to collinearityeffects [24] in the linear regression models that is some variableswith predictive variable might have non-significant weights becausethey explain the same part of the variance For instance in Table5 misspellers rate has a very strong predictive value but its p-valueis too high to consider it significant To test this hypothesis weperform a principal component analysis (PCA) on the independentvariables of the regression Figure 10 exhibits the loadings of thedifferent variables for the considered variables The block structureshowed in 10 results in similar directions of the variables in the firstcomponentes of the PCA We observe some groups of variables onthe one hand geographical and social diversity seem to explain largepart of the variance on the other hand we find a perpendicular group
of variables formed by temporal activity finally penetration rateand misspellers fraction seem to represent a different independentdirection of data with high collinearity between them This mightexplain the low statistical significance in the models of section 4 Inany case the structure of the correlation matrix and the PCA resultsshow that there is indeed information in all groups of variables andthus we have take a variable in each of them for our regressionmodels
S6 Misspellers detection
In this work we will consider only tweets in Spanish that is sincein Spain several languages live at the same time depending on thepart of the country the first step is to reduce our Twitter dataset tothose tweets that are written in Spanish This task is carried out usingthe n-gram based text categorization R library textcat [11] Then inorder to decide whether a tweet has a misspelling or not we needto establish some patterns to select from our set of tweets Sincewe want to be sure that a detected mistake corresponds to a realmisspeller we will not consider the following cases
bull Lack of written accents People tend to avoid writing accentswhen talking in a colloquial way
bull Mistakes derived from removing unnecessary letters Themost common cases are removing a h at the beginning of aword (in Spanish the letter h is not pronounced) or replacingthe letters qu by k We understand that these mistakes can bemotivated for the limitation of length in tweets and not for areal misspelling
bull In the same line we neglect mistakes produced by removing
Social media fingerprints of unemployment mdash 1319
0
10
20
30
40
0 100 200 300x = Tweets (unemployment)
P(x)
0
5
10
15
500 1000 1500x = Penetration rate
P(x)
0
5
10
15
2 4 6 8x = Entropy2 (social)
P(x)
0
5
10
15
025 050 075x = Entropy1 (social)
P(x)
0
5
10
15
4 6 8x = Entropy2 (geo)
P(x)
0
5
10
15
01 02 03 04x = Entropy1 (geo)
P(x)
0
5
10
15
0 50 100 150 200x = Misspellers rate
P(x)
0
20
40
60
0 100 200x = Tweets (employment)
P(x)
0
5
10
15
20
250 500 750x = Tweets (job)
P(x)
0
5
10
15
20
2 3 4 5 6x = Activity (night)
P(x)
0
5
10
15
20
4 5 6 7x = Activity (morning)
P(x)
0
5
10
15
35 40 45 50x = Activity (afternoon)
P(x)
i SuiSri
Sri Suimrngi
aftni ngti
imicrojobi
microempi
microunempi
x x x
x x x
x x x
x x x
Figure 9 Frequency plots for each variable constructed from Twitter
letters in the middle of a word whose pronunciation can bededuced without them
bull We do not consider either mistakes related to features ofspecific areas in Spain For example in the south the pronun-ciation of ce and se is the same what produces a big amountof mistakes when writing However since we want to extractobjective and equitable conclusion over the whole Spanishgeography we neglect those misspellings that only appear ina specific area
Likewise we will consider as real misspellings the followingmistakes
bull Adding letters For example writing a h at the beginning of aword that starts with a vowel
bull Changing the special cases mp mb by the wrong writings npnb
bull Mixing up b with v g with j ll with y and ex with es Theseare typical mistakes in Spanish because they have the sameor a very close pronunciation
Social media fingerprints of unemployment mdash 1419
0
10
20
30
16minus24 25minus34 35minus44 45minus54 55minus64r
f
000
025
050
075
100llena
Age group
Perc
enta
ge o
f pop
ulat
ion Census
10
15
20
25
10 15 20 25x
y
10
15
20
25
10 15 20 25x
y
5
10
15
20
5 10 15 20x
y
5
10
15
2025
5 10 15 20 25x
y
All ages lt 24
25-44 gt 44
Observed Unemployment () Observed Unemployment ()
Observed Unemployment ()Observed Unemployment ()
Pred
icte
d Un
empl
oym
ent (
)
Pred
icte
d Un
empl
oym
ent (
)
Pred
icte
d Un
empl
oym
ent (
)
Pred
icte
d Un
empl
oym
ent (
)
R2 = 047 R2 = 062
R2 = 052 R2 = 026
Figure 8 Top Percentage of population in each age groupfrom the Spanish Census (dark bars) and surveys about usersin Twitter (light bars) Bottom performance of the linearmodels for each of the age groups
bull Confusing the verb haber with the periphrasis a ver
bull Separating a word into two ones for instance writing theword conmigo as con migo
This way our list of mispellings is composed of 617 common mis-takes in Spanish that cannot be attributed to the special featuresof Twitter or a specific region of Spain Thus one can expect thatthis selection provides an accurate and equitable method of detect-ing misspellers Under these conditions the number of users whowrote at least one misspelled word is 27055 (56 over the wholepopulation)
We analyze whether misspellers have different Twitter usagebehavior from that people who do not make serious mistakes whenpublishing a tweet Comparing the average number of tweets itcan be observed that misspellers tend to publish a larger numberof tweets than those who did not made mistakes (14471 against2372) This also emerges when the mean number of misspellinggiven the total number of tweets is considered For users with lessthan approximately 30 published tweets in the observation period thenumber of misspellings is almost zero whereas for users who publishmore often the mean number of misspellings scales sub-linearly
minus1
minus08
minus06
minus04
minus02
0
02
04
06
08
1rtwpen
sio
sior
siosocial
siorsocial
manana
tarde
madrugada
fmiss
job
emp
unem
p
eco
rtwpensiosior
siosocialsiorsocialmanana
tardemadrugada
fmissjobemp
unempeco
i
Sui
Sri
Sri
Sui
i
microecoi
microunempi
microempi
microjobi
ngti
aftni
mrngi
minus3 minus2 minus1 0 1 2 3
minus3minus2
minus10
12
3
First Principal component
Seco
nd P
rinci
pal C
ompo
nent
6
26
47
70 92123
145
155185
209
213
251
257 279305318339
371
396411
423
455
466
490
507
546552
568
597
621
637672
681
710
718738
772
781
805828
849
876
900
919
925
966979
1002 1021
1047
1064
1085
1109
1126
1137
1164
11861200
12361249
1264
1300
1318
1336
13541386
1406
1425
1444
1457
14711504
15141554
1572
1584
1611
1622
16571667
1686
1721
1739
1800
1823
1837
1863
18831930
1949
1968
20062036
2042
2067
2181
2206
22442264
2305
2331
23322413 2435
2456
2512
2554
2569
25982617
26522670
26972721 2748
2790
2875
2917
2939
29492976
2989
3025
3132
3228
3245
3331
3347
3381
3451
3484
3519
3613
3837
38653987
4326
minus05 00 05
minus05
00
05
sio
sior
siosocial
siorsocial
rtwpen
manana
tarde
madrugada
fmiss
jobemp
unemp
eco
ngt
microunemp
aftn
microempmicroeco
microjobmrng
Si
Si
Sir
˜Sir
domingo 12 de octubre de 14
Figure 10 Top Correlation matrix between the vari-ables constructed from Twitter Each entry in the matrixis depicted as a circle whose size is proportional to thecorrelation between variables and the sign is bluered forpositivenegative correlations Blank entries correspond tostatistically insignificant correlations with 95 confidenceBottom Variables projection on the first two principal com-ponents given by PCA We observe different groups of vari-ables and collinearity between some of them
with the number of tweets (exponent asymp 033)
Since we have observed a segmentation of Twitter populationbased on how accurate they write we consider the misspeller rate as aproxy of the educational level of the cities Large number of previousworks in the literature have revealed the relationship between theeconomical status and the educational level of geographical areasand therefore it is natural to ask whether the observed misspellersrate is related to economy driven by the unemployment rate To testthis hypothesis we consider cities populated with more than 5000inhabitants to avoid subsampled cases We find a strong positivecorrelation between the probability of finding a misspeller in a cityand the unemployment rate (0372 0491)
Social media fingerprints of unemployment mdash 1519
2 5 10 20 50 200 500
10
15
20
30
40
tweets
N(m
iss|
tw
eets
)
2 5 10 20 50 200 500
000
50
050
050
0
tweets
P(m
iss|
twee
ts)
2 50 500
Figure 11 Number (red) and probability (blue) of ob-served misspellings given the number of tweets
050
055
060
065
070
201210201211201212201301201302201303201304201305201306201307201308201309201310201311201312201401201402201403201404201405201406
month
r2R2
month
Figure 12 Explanatory power of the linear regressionmodel when fitted against the unemployment data for dif-ferent months Gray (orange) area correspond to the timewindow in which Twitter data is collected and variables areconstructed
S7 Time window and unemployment
In the definition of the variables we have aggregated the Twitter ac-tivity within a 7 months time window (from December 2012 to June2013) Since unemployment has a significant variation along timewe investigate here what is the correlation and explanatory powerof the Twitter variables for the values of unemployment determinedat different months through the same time window in which Twitterdata was collected Or if the variables collected in that time windoware more correlated with past or future values of unemploymentFigure 12 shows the explanatory value of the model when the linearregression is done for values of unemployment of different monthsbefore during and after the Twitter data time window Althoughthere is a small seasonal effect along the year we see that the ex-planatory power remains around R2 = 06 which suggest that ourTwitter linear model retains its explanatory power even though unem-ployment changes considerably throughout the year It is interestingto note that R2 decays a little bit during the summer which meansthat our variables are less correlated with summer unemploymentFinally unemployment used in the main article is from June 2013ie the last month in the time window used to collect the data
S8 Demographics does not explain unem-ployment
Since unemployment rates are very large for the group of youngpeople a natural question is whether only demographic variablescould explain the heterogeneity of young unemployment rates foundin the geographical areas To test this end we have built four linearmodels the first one (named Youth model in Table 5) is composedby the rate of young population as the only explaining variable thesecond ones are built based on only the Twitter variables consideredin the main text (named Twitter model (I)) or just with those whoseregression coefficients are statistically significant (Twitter model(II)) the third one is fitted with all the variables (named All variablesmodel in Table 5) In table 5 we show the summary of the regressionfor each model Focusing on the explained variance by the model interms of R2 it can be checked that considering all Twitter variables isthree times more explanatory than considering only the young peopleproportion On the other hand the comparison of R2 for the Twittermodel with the one for All variables and Youth model shows that therate of young population does not provide a significant explanatorypower This semi-partial analysis shows that our Twitter variablesretain a high explanatory power when the effect of young populationrate is controlled
S9 Unemployment models for other geo-graphical areas
While municipalities are very heterogeneous demographically otheradministrative areas exist in Spain at large scales that could be usedfor our model of unemployment As mentioned in section 4 thesmallest administrative division of Spain we have considered is thatof the 8200 municipalities At larger scales we have the 326 coun-ties (comarcas in spanish) which are aggregations of municipalitiesFinally the largest geographical scale we considered is defined by50 provinces (provincias in Spanish) In this section we comparethe performance of our Twitter model for unemployment for thevariables defined in those administrative areas and relate it to thegeographical communities detected and used in the main paper (seesection 4) Not all the areas at different administrative divisions areconsidered in the model To minimize the effect of areas in whichthe number of geo-tagged tweets is very small we only consider the1738 municipalities which have a Twitter population π gt 10 Simi-larly we only consider the 198 counties with π gt 100 As we can seein Table 6 the model has a large explanatory power for areas equalor bigger than counties As expected R2 increases as the number ofareas in the model is smaller but the description level of the modelis very low for provinces for example The best performance (highR2 and high geographical description level) is attained at the level ofthe detected communities
S10 Relative importance of the variables
To asses the relative importance of the variables in the unemploymentmodel we have used several methods They all give qualitatively thesame results with some variations for the statistically insignificantvariables Specifically we have use
1 (weight) Relative weight of the absolute values of the coef-ficients obtained in the linear regression when variables arescaled to have mean zero and variance one
2 (lmg) averaging over orderings proposed by LindemanMerenda and Gold
Social media fingerprints of unemployment mdash 1619
0
10
20
30
40
emp fmiss manana rtwpen sio siosocialnames
values
indabscoefffirstlmgpmvd
microemp mrng SuSu
Rela
tive
impo
rtanc
e (
)
0
10
20
30
40
emp fmiss manana rtwpen sio siosocialnames
values
indabscoefffirstlmgpmvd
weight first lmg pvmd
Figure 13 Relative importance of the variables (in per-centage) in the unemployment model for different ways tocalculate it
3 (pmvd) The PMVD metric introduced by Feldman whichan average over orderings as well but with data-dependentweights
4 (first) The univariate R2-values from regression models withone variable only
All these metrics are obtained using the relaimpo R package [12]The results for the young unemployment model are shown in figure13 where we can see that different methods yield to similar rela-tive importance of the variables excepting perhaps for the diversityof mobility flows a variable with a non-significant weight in theregression model
References
[1] B Ashtakala Generalized power model for trip distributionTransportation Research Part B Methodological 21(1)59ndash671987
[2] Michel Bierlaire Mathematical models for transportation de-mand analysis Transportation research Part A Policy andpractice 31(1)86ndash86 1997
[3] Vincent D Blondel Jean-Loup Guillaume Renaud Lambiotteand Etienne Lefebvre Fast unfolding of communities in largenetworks Journal of Statistical Mechanics Theory and Exper-iment 2008(10)P10008 2008
[4] Harry J Casey Jr The law of retail gravitation applied to trafficengineering Traffic Quarterly 9(3) 1955
[5] Aaron Clauset Mark EJ Newman and Cristopher Moore Find-ing community structure in very large networks Physicalreview E 70(6)066111 2004
[6] Leon Danon Albert Diaz-Guilera Jordi Duch and AlexArenas Comparing community structure identificationJournal of Statistical Mechanics Theory and Experiment2005(09)P09008 2005
[7] Servicio Publico de Empleo Estatal (SEPE) Spanish registeredunemployment httpwwwsepeescontenidosque_
es_el_sepeestadisticasindexhtml[8] Instituto Nacional de Estadıstica Spanish 2011 cen-
sus httpwwwineescensos2011_datoscen11_
datos_iniciohtm|[9] Suzanne P Evans A relationship between the gravity model
for trip distribution and the transportation problem in linearprogramming Transportation Research 7(1)39ndash61 1973
[10] Paul Expert Tim S Evans Vincent D Blondel and RenaudLambiotte Uncovering space-independent communities inspatial networks Proceedings of the National Academy ofSciences 108(19)7663ndash7668 2011
[11] Ingo Feinerer Christian Buchta Wilhelm Geiger JohannesRauch Patrick Mair and Kurt Hornik The textcat packagefor n-gram based text categorization in r Journal of StatisticalSoftware 52(6)1ndash17 2013
[12] Ulrike Gromping Relative importance for linear regressionin r the package relaimpo Journal of statistical software17(1)1ndash27 2006
[13] Bartosz Hawelka Izabela Sitko Euro Beinat StanislavSobolevsky Pavlos Kazakopoulos and Carlo Ratti Geo-located twitter as the proxy for global mobility patterns arXivpreprint arXiv13110680 2013
[14] Yu Liu Zhengwei Sui Chaogui Kang and Yong Gao Uncov-ering patterns of inter-urban trips and spatial interactions fromcheck-in data arXiv preprint arXiv13100282 2013
[15] Mark EJ Newman Finding community structure in net-works using the eigenvectors of matrices Physical reviewE 74(3)036104 2006
[16] Pascal Pons and Matthieu Latapy Computing communities inlarge networks using random walks In Computer and Informa-tion Sciences-ISCIS 2005 pages 284ndash293 Springer 2005
[17] Usha Nandini Raghavan Reka Albert and Soundar KumaraNear linear time algorithm to detect community structures inlarge-scale networks Physical Review E 76(3)036106 2007
[18] Martin Rosvall and Carl T Bergstrom Maps of random walkson complex networks reveal community structure Proceedingsof the National Academy of Sciences 105(4)1118ndash1123 2008
[19] Morton Schneider Gravity models and trip distribution theoryPapers in Regional Science 5(1)51ndash56 1959
[20] Filippo Simini Marta C Gonzalez Amos Maritan and Albert-Laszlo Barabasi A universal model for mobility and migrationpatterns Nature 484(7392)96ndash100 2012
[21] Chaoming Song Tal Koren Pu Wang and Albert-LaszloBarabasi Modelling the scaling properties of human mobilityNature Physics 6(10)818ndash823 2010
[22] Alan Geoffrey Wilson Entropy in urban and regional mod-elling Pion Ltd 1970
[23] Alan Geoffrey Wilson Urban and regional models in geogra-phy and planning 1974
[24] Svante Wold Arnold Ruhe Herman Wold and WJ Dunn IIIThe collinearity problem in linear regression the partial leastsquares (pls) approach to generalized inverses SIAM Journalon Scientific and Statistical Computing 5(3)735ndash743 1984
Social media fingerprints of unemployment mdash 1719
Gravity ModelParameter Description Spain
α1 Origin exponent 0477lowastlowastlowast(0002)α2 Destination exponent 0478lowastlowastlowast(0002)β Distance exponent 105lowastlowastlowast(00035)R2 Goodness of fit 0797φ Correlation between Ti j and T gra
i j 0826
Table 1 Description of the parameters for the Gravity Law Model in geo-tagged social media data for Spain (lowast lowast lowast) meanssignificance p lt 00001
NMI between G and Gp for different pAlgorithm p = 001 002 003 004 005 006 007 008 009 01
FG 0995 0992 0989 0983 0981 0977 0983 0969 0980 0959WT 0954 0959 0950 0954 0945 0948 0947 0935 0926 0931IM 0988 0981 0980 0981 0978 0974 0975 0970 0969 0966ML 0994 0978 0979 0983 0948 0934 0972 0952 0973 0947LP 0906 0908 0911 0915 0895 0907 0907 0893 0905 0904LE 0960 0957 0956 0859 0910 0892 0908 0858 0885 0884
Table 2 NMI measure comparing G and Gp
Communities StatsAlgorithm 〈|Ni|〉i max|Ni| |Ni| Modularity NMI P NMI C
FG 309696 1385 23 0726 0712 0590WT 9262 433 769 0417 0744 0757IM 21011 143 339 0758 0770 0831ML 323772 1132 22 0800 0717 0599LP 22052 750 323 0732 0749 0761LE 1017571 5344 7 0381 0264 0205
Table 3 Statistics of the communities Ni returned by the six algorithms NMI P refers to the comparison between communitiesand provinces whereas NMI C considers counties instead of provinces
Social media fingerprints of unemployment mdash 1819
All ages lt 24 25minus44 gt 44(Intercept) 011lowastlowastlowastlowast 010lowastlowastlowast 020lowastlowastlowast 020lowastlowastlowast
(002) (003) (003) (0035)Penetration rate 323lowast 857lowastlowastlowast 628lowastlowast 240
(141) (222) (217) (277)Geographical diversity 003 015lowastlowastlowast 008lowast 006
(002) (004) (004) (005)Social diversity minus003lowast minus003 minus005lowast minus006lowast
(001) (002) (002) (003)Morning activity minus069lowast minus130lowastlowast minus153lowastlowastlowast minus119lowast
(026) (042) (041) (052)Misspellers rate 1156 3151lowast 1546 2360
(813) (1278) (1248) (1594)Employment mentions minus180 317 minus994 271
(627) (986) (964) (123)R2 047 064 055 029Adj R2 044 062 052 026lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005
Table 4 Regression table for the different models in which unemployment for different age groups is fitted The All ages model isthe fit to the general rate of unemployment in each geographical area while the other models are for the rates of unemployment ingroups of less than 24 years between 25 and 44 years and above 44 years
All variables Youth model Twitter model (I) Twitter model (II)(Intercept) 006 minus002 010lowastlowastlowast 009lowastlowastlowast
(003) (003) (003) (0027)Young pop rate 066lowast 220lowastlowastlowast
(030) (035)Penetration rate 820lowastlowastlowast 857lowastlowastlowast 862lowastlowastlowast
(225) (222) (221)Geographical diversity 014lowastlowastlowast 015lowastlowastlowast 012lowastlowastlowast
(004) (004) (003)Social diversity minus002 minus003
(002) (002)Morning activity minus142lowastlowastlowast minus130lowastlowast minus128lowastlowast
(041) (042) (041)Misspellers rate 2395 3151lowast 3228lowast
(1309) (1278) (1271)Employment mentions 034 317
(981) (986)R2 065 024 064 063Adj R2 063 024 062 062lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005
Table 5 Regression table for the different statistical models The All variables model includes both Twitter and rate of youngpopulation variables Twitter model (I) includes only the variables described in the main article while Twitter model (II) only includesthose variables which are significant p lt 005 in Twitter model (I)
Social media fingerprints of unemployment mdash 1919
Communities Municipalities Counties Provinces(Intercept) 010lowastlowastlowast 016lowastlowastlowast 011lowastlowastlowast 011lowast
(003) (001) (003) (005)Penetration rate 857lowastlowastlowast 401lowastlowastlowast 912lowastlowastlowast 1047lowastlowastlowast
(222) (059) (181) (197)Geographical diversity 015lowastlowastlowast 002 012lowastlowastlowast 008
(004) (001) (003) (007)Social diversity minus003 minus001 minus001 minus003
(002) (001) (002) 007Morning activity minus130lowastlowast minus116lowastlowastlowast minus149lowastlowastlowast minus103
(042) (014) (039) (088)Misspellers rate 3151lowast 1440lowastlowastlowast 1409
(1278) (251) (1002)Employment mentions 317 minus071 241 minus317
(986) (089) (886) (1229)Number of points 128 1738 198 50R2 064 022 055 065Adj R2 062 021 054 061lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005
Table 6 Regression table for the unemployment linear regression model in different levels of geographical areas In the Provincesmodel the misspellers rate has been removed from the model due to the large collinearity with the penetration rate
- 1 Social media dataset and functional partition of cities
- 2 Social media behavioral fingerprints
- 3 Explanatory power of social media in unemployment
- 4 Discussion
-
Social media fingerprints of unemployment mdash 419
Figure 2 Examples of different behaviour in the observed variables and the unemployment In A we observe that two cities withdifferent unemployment levels have different temporal activity patterns Figure C show how communities (red) with distinct entropylevels of social communication with other communities (blue) may hold different unemployment intensity left map shows a highlyfocused communication pattern (low entropy) while right map correspond to a community with a diverse communication pattern(high entropy) Finally figure B shows some examples of detected misspellings in our database using 618 incorrect expressions (seeSI Section 6) such as ldquoCon migordquo ldquoAverrdquo or ldquollendordquo
of metrics related to social media activity Some of thosemetrics are already reported in the literature but some othersare introduced in this work Specifically we consider
bull Social media technology adoption we can use twitterpenetration rate τi = πiPi in each area i as a proxy oftechnology adoption Recent works have shown thatindeed there is a correlation between country GDP andtwitter penetration specifically it was found that a pos-itive correlation between τi and GDP at the countrylevel [23] However in our data we find the oppo-site correlation (see figure 3) namely that the largerthe penetration rate the bigger the unemployment iswhich suggest that the impact of technology adoptionat country scale is different of what happens withinan (industrialized) country where technology to accesssocial media is commoditized
bull Social media activity regions with very different eco-nomical situations should exhibit different patterns ofactivity during the day Since working leisure fam-ily shopping etc activities happen at different timesof the day we might observe different daily patterns
in regions with different socio-economical status Forexample we hypothesize that communities with lowlevels of unemployment will tend to have higher activ-ity levels at the beginning of a typical weekday Thisis indeed what we find figure 2A shows the hourlyfraction of tweets during workdays of two communi-ties with very different rate of unemployment As wecan observe both profiles are quite different and inthe case of low unemployment we find a strong peakof activity between 8 and 11am (morning) and lowerperiods of activity during the afternoons and nights Weencode this finding in νmrngi νaftni and νngti the to-tal fraction of tweets happening in geographical area ibetween 8am and 10am 3pm and 5pm and 12am and3am respectively Figure 3 shows a strong negative co-rrelation between νmrngi and the unemployment for thecommunities in our database and positive correlationwith νaftni and νngti
bull Social media content some works have observed acorrelation between the frequency of words relatedto work conditions [22] or looking forward thinking
Social media fingerprints of unemployment mdash 519
searches [21] to the economical situation of countriesIn our case we also find that there is a moderate posi-tive correlation between the fraction of tweets microi men-tioning job or unemployment terms and the observedunemployment while the correlation is negative for thenumber mentions to employment or the economy How-ever we have tried a different approach by measuringthe relation between the way of writing and the edu-cational level [37] To this end we build a list of 618misspelled Spanish expressions and extract the tweetsof the dataset containing at least one of these words(see SI section 6 for further details about how theseexpressions were collected) We only consider tweetsin Spanish detected with a N-grams based algorithmThen we only consider misspellings that cannot bejustified as abbreviations Finally we compute for ev-ery region the proportion εi of misspellers among theTwitter population If the fraction of misspellers pergeographical area is a proxy for the educational levelof that region we expect a positive correlation betweenεi and unemployment Indeed we find (see figure 3)that there is a strong correlation between the fraction ofmisspellers and unemployment
bull Social media interactions and geographical flow di-versity following the ideas in [14] which correlatedthe economical development of an area with the diver-sity of communications with other areas we considerall tweets mentioning another user and take them asa proxy for communication between users Then wecompute the number of communications wi j betweenareas i and j as the number of mentions between usersin those areas To measure the diversity we use as in[14] the informational normalized entropy (Entropy 1)Sui = minussum j pi j log pi jSri where pi j = wi jsum j wi j and(Entropy 2) Sri = logki with ki the number of differentareas with which users in area i have interacted As in[14] we find that areas with large unemployment haveless diverse communication patterns than areas with lowunemployment This translates in a strong negative co-rrelation between Si and the unemployment see figure3 Similar ideas are applied to the flows of people be-tween areas to investigate the diversity of the geograph-ical flows through the entropy Si =minussum j pi j log pi jSriwhere pi j = Ti jsum j Ti j and Sri = log(ki) with ki thenumber of different areas which has been visited byusers that live in area i Figure 3 shows that as in [19]correlation of these geographical entropies is low witheconomical development
Normalization of variables is discussed in SI section 5 Wehave also studied the correlation between the variables con-sidered As expected variables in each group show moderatecorrelations between them However the inspection of thecorrelation matrix and a Principal Component Analysis ofthe variables considered show that there is information (aspercentage of variance in the data) in each of the groups ofvariables see SI section 5 Because of these two facts werestrict our analysis to the variables within each group withthe highest correlation with the unemployment namely thepenetration rate τi the social and mobility diversity variables
Sui and Sui the morning activity νmrngi the fraction of mis-spellers εi and fraction of employment-related tweets microempi
3 Explanatory power of social media in un-employment
The four previous groups of variables are fingerprints of hu-man behavior reflected on the Twitter usage habits As weobserved in figure 3 all of them exhibit statistically strongcorrelations with unemployment The question we address inthis section is whether those variables suffice to explain theobserved unemployment (their explanatory power) and alsodetermine the most important ones among themselves (whichgive more explanatory power than others) Note that we arenot stating a causality arrow between the measures built in theprevious section and the unemployment rate but only explor-ing whether they can be used as alternative indicators with areal translation in the economy
Figure 4 shows the result of a simple linear regressionmodel for the observed unemployment for ages below 25 yearsas a function of the variables which have more correlation withthe unemployment The model has a significant R2 = 062showing that there is a large explanatory power of the un-employment encoded in the behavioral variables extractedfrom Twitter However not all the variables weight equallyin the model specifically the penetration rate geographicaldiversity morning activity and fraction of misspellers accountfor up to 92 of the explained variance while social diversityand number of employment related tweets are not statisticalsignificant (see SI section 10 for the methods used to deter-mine the relative importance of the variables) It is interestingto note that while social diversity obtained by mobile phonecommunications was a key variable in the explanation of de-privation indexes in [14 19] the communication diversity oftwitter users seem to have a minor role in the explanation ofheterogeneity of unemployment in Spain
Similar explanatory power is found for other age groupsR2 = 044 for all ages and R2 = 052 for ages between 25 and44 years However the model degrades for ages above 44years (R2 = 026) proving that our variables mainly describedthe behavior of the most represented age groups in Twitternamely those below 44 years old On the other hand sinceour Twitter variables seem to describe the behavior of youngpeople we have investigated whether Twitter constructed vari-ables have similar explanatory value (in terms of R2) thansimple census demographic variables for young people How-ever regression models including young population rate yieldto a minor improvement R2 = 065 while young populationrate only gives R2 = 024 a result which shows that Twittervariables do indeed posses a genuine explanatory power awayfrom their simple demographic representation Finally ourmodel have the largest explanatory power for detected com-munities but large R2 are also found for other geographicalareas like counties (R2 = 054) and provinces (R2 = 065) seeSI section 9
4 Discussion
This work serves as a proof of concept for how a wide rangeof behavioral features linked to socioeconomic behavior can
Social media fingerprints of unemployment mdash 619
eco
unemp
emp
job
fmiss
madrugada
tarde
manana
siorsocial
siosocial
sior
sio
rtwpen
minus05 00 05corre
ff500
1000
10 20parofa
ctor
[i]
tt[ v
aria
bles
_sel
[i]]
20
40
60
80
10 20parofa
ctor
[i]
tt[ v
aria
bles
_sel
[i]]
20
40
60
80
1020paro fa
ctor[i] tt[ variables_sel[i]]
20
40
60
80
1020paro fa
ctor[i] tt[ variables_sel[i]]
0
5
10
15
20
10 20parofa
ctor
[i]
tt[ v
aria
bles
_sel
[i]]
0
5
10
15
20
1020paro fa
ctor[i] tt[ variables_sel[i]]
0
5
10
15
20
1020paro fa
ctor[i] tt[ variables_sel[i]]
4
5
6
7
10 20parofa
ctor
[i]
tt[ v
aria
bles
_sel
[i]] 5
10
10 20parofa
ctor
[i]
tt[ v
aria
bles
_sel
[i]]
40
50
60
70
10 20parofa
ctor
[i]
tt[ v
aria
bles
_sel
[i]]
02
04
06
08
10 20parofa
ctor
[i]
tt[ v
aria
bles
_sel
[i]]
0
50
100
150
200
10 20parofa
ctor
[i]
tt[ v
aria
bles
_sel
[i]]
0
50
100
150
200
10 20parofa
ctor
[i]
tt[ v
aria
bles
_sel
[i]]
0
50
100
150
200
10 20parofa
ctor
[i]
tt[ v
aria
bles
_sel
[i]]
Penetration rate Entropy 1 (geo) Entropy 2 (geo)
Entropy 1 (social) Entropy 2 (social) Activity (morning)
Activity (afternoon) Activity (night)
Misspellers rate Job tweets
Employment tweets Unemployment tweets
Economy tweets
A B C
D E
Unemployment UnemploymentCorrelation
Entropy1 (social)
Misspellers rate
Pen
etra
tion
rate
Act
ivity
(mor
ning
)Figure 3 A) Correlation coefficient of all the extracted Twitter metrics grouped by technology adoption (black) geographicaldiversity (orange) social diversity (light blue) temporal activity (green) and content analysis (dark blue) Error bars correspond to95 confidence intervals of the correlation coefficient Gray area correspond the statistical significance thresholds Panels B C Dand E show the values of 4 selected variables in each geographical community against its percentage of unemployment Size of thepoints is proportional to the population in each geographical community Solid lines correspond to linear fits to the data
10
15
20
25
10 15 20 25x
y
5
10
15
2025
5 10 15 20 25x
y
0 10 20 30per
order
col0000000072B2009E7356B4E9E69F00
Penetration rate
Entropy 1 (geo)
Entropy 1 (social)
Activity (morning)
Misspellers rate
Employment tweetsR2 = 062
Pred
icte
d un
empl
oym
ent
Observed unemployment Weight
A B
R2 = 052
Observed unemployment
CAge lt 25 25 lt Age lt 44
Figure 4 A) and B) Performance of the model showing the predicted unemployment rate for ages below 25 versus the observedone R2 = 062 and with ages between 25 and 44 Dashed lines correspond to the equality line and plusmn20 error C) Percentageof weight for each of the variables in the regression model using the relative weight of the absolute values of coefficients in theregression model (see SI section 10) Variables marked with lowast are not statistical significant in the model
be inferred from the digital traces that are left by publicly-available social media In particular we demonstrate thatbehavioral features related to unemployment can be recoveredfrom the digital exhaust left by the microblogging networkTwitter First of all Twitter geolocalized traces together withoff-the-shelve community detection algorithms render an op-timal partition of a country for economical activity showingthe remarkable power of social media to understand and unveileconomical behavior at a country-scale This insight is likelyto apply to other administrative definitions in other countriesspecially when considering large cities with an inherent dy-namical nature and evolution of mobility fluxes and citiescomposed of small satellite cities with arbitrary agglomera-tions or division among them (eg London NYC Singapore)
This result is unsurprising it should be natural to recomputecity clusterscommunities of activity based on their real timemobility which may vary considerably faster than the updaterates of mobility and travel surveys [31ndash33]
Our main result demonstrates that several key indicatorsdifferent penetration rates among regions fingerprints of thetemporal patterns of activity content lexical correctness andgeo-social connectivities among regions can be extractedfrom social media and then used to infer unemployment lev-els These findings shed light in two directions first on howindividualsrsquo extensive use of their social channels allow usto characterize cities based on their activity in a meaningfulfashion and secondly on how this information can be usedto build economic indicators that are directly related to the
Social media fingerprints of unemployment mdash 719
economy Regarding the latter our work is important forunderstanding how country-scale analysis of Social Mediashould consider the demographic but also the economical dif-ference between users As we have shown users in areaswith large unemployment have different mobility different so-cial interactions and different daily activity than those in lowunemployment areas This intertwined relationship betweenuser behavior and employment should be considered not onlyin economical analysis derived from social media but alsoin other applications like marketing communication socialmobilization etc
It is particularly remarkable that Twitter data can providethese accurate results Twitter is among the many currentlypopular social networking platforms perhaps the noisiestsparsest more lsquosabotagedrsquo medium very few users sendout messages at a regular rate most of the users do nothave geolocated information the social relationships (fol-lowersfollowers) contains a lot of unusedunimportant linksit is plagued by spam-bots and last but not least we haveno way to identify the motivegoalfunctionality of mobilityfluxes we are able to extract These limitations are not par-ticular to our sample but general to the sample Twitter databeing employed in the computational social science commu-nity Despite all these caveats we are able to show that evensome simple filtering techniques together with basic statisticalregressions yields predictive power about a variable as impor-tant as unemployment Other social media platforms such asFacebook Google+ Sina Weibo Instagram Orkut or Flickerwith more granular and consistent individual data are likelyto provide similar or better results by themselves or in com-bination Further improvements can be obtained by the useof more sophisticated statistical machine learning techniquessome of them even tailored to the peculiarities of social mediadata Our work serves to illustrate the tremendous potentialof these new digital datasets to improve the understanding ofsocietyrsquos functioning at the finer scales of granularity
The usefulness of our approach must be considered againstthe cost and update rate of performing detailed surveys ofmobility social structure and economic performance Ourdatabase is publicly articulated which means that our analysiscould be replicated easily in other countries other time periodsand with different scopes Naturally survey results providemore accurate results but they also consume considerablyhigher financial and human resources employing hundredsof people and taking months even years to complete and bereleased mdash they are so costly that countries going througheconomic recession have considered discontinuing them oraltering their update rate in recent times A particularly prob-lematic aspect of these surveys is that they are ldquoout-of-syncrdquoie census may be up to date whereas those same individualsrsquotravel surveys may not be and therefore drawing inferencesbetween both may be particularly difficult This is a partic-ularly challenging problem that the immediateness of socialmedia can help ameliorate
A few questions remain open for further investigationHow can traditional surveys and social media digital tracesbe best combined to maximize their predictive ability Cansocial media provide a reliable leading indicator to unem-ployment and in general economic surveys How muchreliable lead is it possible if at all As we have found Twitter
penetration and educational levels are found to be correlatedwith unemployment but this levels are unlikely to changerapidly to describe or anticipate changes in the economy orunemployment However other indicators like daily activitysocial interactions and geographical mobility are more con-nected with our daily activity and perhaps they have morepredicting power to show andor anticipate sudden changesin employment The relationship between unemployment andindividual and group behavior may help contextualize themultiple factors affecting the socioeconomic well-being of aregion while penetration content daily activity and mobilitydiversity seem to be highly correlated to unemployment inSpain different weights for each group of traces might beexpected in other countries [14] Finally digital traces couldserve as an alternative (some times the only one available) tothe lack of surveys in poor or remote areas [20 27] Anotherinteresting avenue of research involves the use of social mediato detect mismatches between the real (hidden underground)economy and the officially reported [38]
Most importantly the immediacy of social media may alsoallow governments to better measure and understand the effectof policies social changes natural or man-made disasters inthe economical status of cities in almost real-time [18 39]These new avenues for research provide great opportunitiesat the intersection of the economic social and computationalsciences that originate from these new widespread inexpensivedatasets
Acknowledgments
We would like to thank Kristina Lerman Lada Adamic JamesFowler Daniel Villatoro and Ricardo Herranz for stimulatingdiscussions and Yuri Kryvasheyeu and Thomas Bochynek fortheir critical reading of the manuscript This work was par-tially supported by Spanish Ministry of Science and Technol-ogy Grant FIS2013-47532-C3-3-P (to A L M G H and EM) Manuel Cebrian is funded by the Australian Governmentas represented by the Department of Broadband Communi-cations and Digital Economy and the Australian ResearchCouncil through the ICT Centre of Excellence program
References
[1] Becker G S (1976) The economic approach to human behav-ior (University of Chicago Press)
[2] Granovetter M (1985) Economic action and social structurethe problem of embeddedness American journal of sociologypp 481ndash510
[3] Camerer C F Loewenstein G amp Rabin M (2011) Advancesin behavioral economics (Princeton University Press)
[4] Glaeser E L Kallal H D Scheinkman J A amp ShleiferA (1991) Growth in cities (National Bureau of EconomicResearch) Technical report
[5] Bettencourt L M Lobo J Helbing D Kuhnert C amp WestG B (2007) Growth innovation scaling and the pace of lifein cities Proceedings of the National Academy of Sciences104 7301ndash7306
[6] Batty M (2008) The size scale and shape of cities science319 769ndash771
[7] Milgram S (1974) The experience of living in cities Crowdingand behavior 167 41
Social media fingerprints of unemployment mdash 819
[8] Pan W Ghoshal G Krumme C Cebrian M amp Pentland A(2013) Urban characteristics attributable to density-driven tieformation Nature communications 4
[9] Gonzalez M C Hidalgo C A amp Barabasi A-L (2008)Understanding individual human mobility patterns Nature453 779ndash782
[10] Calabrese F Diao M Lorenzo G D Jr J F amp Ratti C(2013) Understanding individual mobility patterns from urbansensing data A mobile phone trace example TransportationResearch Part C Emerging Technologies 26 301 ndash 313
[11] Cheng Z Caverlee J Lee K amp Sui D Z (2011) ExploringMillions of Footprints in Location Sharing Services (AAAIMenlo Park CA USA)
[12] Cho E Myers S A amp Leskovec J (2011) Friendship andmobility user movement in location-based social networksKDD rsquo11 (ACM New York NY USA) pp 1082ndash1090
[13] Sun L Jin J G Axhausen K W Lee D-H amp Cebrian M(2014) Quantifying long-term evolution of intra-urban spatialinteractions arXiv preprint arXiv14070145
[14] Eagle N Macy M amp Claxton R (2010) Network diversityand economic development Science 328 1029ndash1031
[15] Henrich J Boyd R Bowles S Camerer C Fehr E GintisH amp McElreath R (2001) In search of homo economicusbehavioral experiments in 15 small-scale societies AmericanEconomic Review pp 73ndash78
[16] Krieger N Williams D R amp Moss N E (1997) Measuringsocial class in us public health research concepts method-ologies and guidelines Annual review of public health 18341ndash378
[17] Groves R M Fowler Jr F J Couper M P Lepkowski J MSinger E amp Tourangeau R (2013) Survey methodology (JohnWiley amp Sons)
[18] Lazer D Pentland A S Adamic L Aral S Barabasi A LBrewer D Christakis N Contractor N Fowler J GutmannM et al (2009) Life in the network the coming age of compu-tational social science Science (New York NY) 323 721
[19] Smith C Quercia D amp Capra L (2013) Finger on the pulseidentifying deprivation using transit flow analysis (ACM) pp683ndash692
[20] Soto V Frias-Martinez V Virseda J amp Frias-Martinez E(2011) Prediction of socioeconomic levels using cell phonerecords UMAPrsquo11 (Springer-Verlag Berlin Heidelberg) pp377ndash388
[21] Preis T Moat H S Stanley H E amp Bishop S R (2012)Quantifying the advantage of looking forward Scientific re-ports 2
[22] Antenucci D Cafarella M Levenstein M C Re C ampShapiro M D (2014) Using social media to measure la-bor market flows (National Bureau of Economic Research)Technical report
[23] Hawelka B Sitko I Beinat E Sobolevsky S KazakopoulosP amp Ratti C (2013) Geo-located twitter as the proxy for globalmobility patterns arXiv preprint arXiv13110680
[24] Lathia N Quercia D amp Crowcroft J (2012) in Pervasivecomputing (Springer) pp 91ndash98
[25] Frias-Martinez V Virseda J amp Frias-Martinez E (2010)Socio-economic levels and human mobility
[26] Gutierrez T Krings G amp Blondel V D (2013) Evaluatingsocio-economic state of a country analyzing airtime credit andmobile phone datasets arXiv preprint arXiv13094496
[27] Smith C Mashhadi A amp Capra L (2013) Ubiquitous sensingfor mapping poverty in developing countries Paper submittedto the Orange D4D Challenge
[28] Erlander S amp Stewart N F (1990) The gravity model intransportation analysis theory and extensions (Vsp) Vol 3
[29] Simini F Gonzalez M C Maritan A amp Barabasi A-L(2012) A universal model for mobility and migration patternsNature 484 96ndash100
[30] Lenormand M Picornell M Cantu-Ros O G Tugores ALouail T Herranz R Barthelemy M Frias-Martinez E ampRamasco J J (2014) Cross-checking different sources ofmobility information arXiv preprint arXiv14040333
[31] Barthelemy M (2011) Spatial networks Physics Reports 4991 ndash 101
[32] Expert P Evans T S Blondel V D amp Lambiotte R (2011)Uncovering space-independent communities in spatial net-works Proceedings of the National Academy of Sciences 1087663ndash7668
[33] Sobolevsky S Szell M Campari R Couronne T SmoredaZ amp Ratti C (2013) Delineating geographical regions withnetworks of human interactions in an extensive set of countriesPloS one 8 e81707
[34] Rosvall M amp Bergstrom C T (2008) Maps of random walkson complex networks reveal community structure Proceedingsof the National Academy of Sciences 105 1118ndash1123
[35] Danon L Diaz-Guilera A Duch J amp Arenas A (2005)Comparing community structure identification Journal ofStatistical Mechanics Theory and Experiment 2005 P09008
[36] ADigital (2013) Uso de twitter en espana 2012 [Onlineaccessed 1-November-2014]
[37] Davenport J R amp DeLine R (2014) The readability of tweetsand their geographic correlation with education arXiv preprintarXiv14016058
[38] Schneider F Buehn A amp Montenegro C E (2011) Shadoweconomies all over the world New estimates for 162 countriesfrom 1999 to 2007 Handbook on the shadow economy pp9ndash77
[39] Rutherford A Cebrian M Dsouza S Moro E Pentland A ampRahwan I (2013) Limits of social mobilization Proceedingsof the National Academy of Sciences 110 6281ndash6286
Social media fingerprints of unemployment mdash 919
Social media fingerprints of unemployment mdash 1019
Supporting Information forSocial media fingerprints of unemploymentAlejandro Llorente Manuel Garcıa-Herranz Manuel Cebrian and Esteban Moro
S1 The dataset
Twitter provides an extremely rich and publicly available data set ofuser interactions information flows and thanks to the geo locationof tweets user movements Nevertheless the representativenessof this geo-located Twitter as a global source of mobility data hasstill received sparse attention In this sense while [13] present apromising and extensive study regarding global country-to-countrymovements (mostly driven by tourism) within-country human flows(comprising not only internal tourism but also in a greater extentthan country-to-country travels visiting and commuting) still needfurther investigation Therefore throughout this work we will com-pare our findings using geo-located Twitter with similar study usingcommuting surveys
For the Twitter analysis we consider almost 146 million geo-located Twitter messages (tweet(s)) collected through the publicAPI provided by Twitter for the continental part of Spain and from29th November 2012 to 10th April 2013 In this dataset we considerthat there has been a trip from place l to place k if a user has tweetedin place l and place k consecutively We only keep those transitionswhen the first tweet and the second one are dated in the same dayWe filter the trips database to avoid unrealistic transitions and keeponly trips with a geographical displacement larger than 1km (SeeMethods section) By this method 138 million of trips from 167376different users are considered in our work
From those trips we construct the mobility flow Ti j betweenmunicipalities which measures the number of trips in our databasein which the origin is within city i boundaries and destination lieswithin those of city j
We also consider population and economical information aboutthe municipalities from the Spanish Census (2011) [8] and unem-ployment figures from the Public Service of Employment (ServicioPublico de Empleo Estatal SEPE) [7] In the former In the lat-ter case registered unemployment (in number of persons) is givenfor each Spanish municipality by gender age and month To getunemployment rates we divide register unemployment by the totalworkforce in the municipality estimated as the number of peoplewith age between 16 and 65 years
S2 Twitter as mobility proxy
Considering all of the available transitions in our database one cancompute the distance between origin and destination the elapsedtime of the transition and the number of trips per user among manyother statistics All of them seems to show a Power-law distributionwith a cutoff due to the finite spatial size of Spain and the constraintof considering only transitions where the origin and destinationcheckins are done the same day Focusing on the log-linear partof the distributions self-similar behaviors arise when Twitter basedmobility is analyzed (see figure 5)
Twitter based inter-city flows can be well modelled by means ofthe The Gravity Law which is one of the most extended methods torepresent human mobility [1 19] with applications in many fieldslike urban planning [23] traffic engineering [4] or transportationproblems [9] Gravity Law is also the solution to the problem of
maximizing the entropy of the particle distribution among all thepossible trips using statistical mechanics techniques [2 22] Recentlyit has also been used as a model for human mobility based on cellphone traces [10 20 21] and social media data at a global scale [13]and at the inter-city level [14]
The Gravity Model for human mobility assume that the flowsbetween cities can be explained by the expression
T gravi j =
Pα1i Pα2
j
dβ
i j
(2)
where T gravi j is the flow in terms of number of people between cities
i and j di j is the geographical distance and Pi and Pj the populationof every city respectively
Given the data we can obtain the parameters of the model byWeighted Least Squares Minimization
αlowast1 α
lowast2 β
lowast = argminα1α2β
1N sum
i jwi j
(Ti jminusT grav
i j
)2(3)
where N is the total number of connections in the mobility graph andwi j is a weight proportional to the number of observed transitionsbetween i and j In particular we find that taking wi j = T 13
i j givesthe best performance in the model
In our case this model fits quite accurately the inter-city mobilitybased on Twitter GPS checkins (see table 1) Even though we areconsidering Ti j not necessarily symmetric the exponents of thepopulations are similar indicating that we are observing a similarflows in both directions between i and j
S3 Community structures in inter-city mo-bility graph
Typically complex networks exhibit community structure that isthere are subsets of nodes that are more densely connected amongthem comparing to the rest of the nodes In mobility networks whosenodes correspond to geographical areas these communities are inter-preted as zones with high common activity and tend to be constrainedby geographical and political barriers We check whether this is alsoobserved in our dataset by performing 6 state-of-art community de-tection algorithms FastGreedy [5] Walktrap [16] Infomap [18]MultiLevel [3] Label Propagation [17] and Leading Eigenvector[15] These six different algorithms exhibit different communitystructures in terms of number of communities average size of com-munity or modularity (see table 3) Members (municipalities) ofthe resulting communities are spatially connected except some fewcases as figure 7 shows We test the statistical robustness of theobtained communities by randomly removing a proportion p of theoriginal links and performing the algorithms on this new graph GpWe will consider that communities are robust when the communi-ties given for the original network G and Gp are highly similar Inorder to compare two arbitrary memberships to communities we usethe Normalized Mutual Information (NMI) method described in [6]which returns 0 when two memberships are totally different and 1when we compare two equal memberships We compute the NMI for
Social media fingerprints of unemployment mdash 1119
10minus8
10minus6
10minus4
10minus2
100
100 1005 101 1015 102 1025 103x
dens
10minus6
10minus4
10minus2
100
1005 101 1015 102 1025 103x
dens
10minus7
10minus6
10minus5
10minus4
10minus3
102 1025 103 1035 104 1045 105x
dens
Den
sity
Den
sity
Den
sity
Trip distance (km) Number of trips Elapsed time (secs)
Figure 5 Probability distributions for the different properties of daily trips in the Twitter dataset Dashed lines corresponds to apower law fit with exponents minus167 minus243 and minus062 respectively
each chosen algorithm performed on G and Gp for p between 1and 10 concluding that obtained community structures are robustbecause they are not broken when some randomly chosen links areremoved (see table 2)
1e+01 1e+03 1e+05
2eminus04
2eminus03
2eminus02
2eminus01
cities2$total_population
cities2$twpen
Population
Pen
etra
tion
Rat
e CommunitiesCities
Figure 6 Penetration rates for both cities and detectedcommunities
As other works have shown mobility graph communities areusually interpreted in terms of geographical and political barriersand a natural question is whether the mobility based communitiesare related to any of these barriers In Spain there are differentterritorial divisions for administration purposes In this work weconsider two of them provinces defined in 1978 Constitution are 48different heterogeneous aggregations of municipalities and counties(comarca in Spanish terminology) which are traditional aggregationsof municipalities mainly based on Spanish holography (rivers val-leys ridges etc) and some of them are composed by municipalitiesof different provinces We use again the NMI method to compare thecommunities structure given by the algorithms to the administrativelimits Except Leading Eigenvector algorithm the rest of methodsreturn communities that are quite related to provinces (NMI asymp 07)whereas for the county administration limits higher variability is
observed In this last case the algorithm providing more relationshipwith county limits is Infomap NMI asymp 083 Therefore Twitter basedmobility summarizes the inter-city flows exhibiting that these flowsare influenced by geographical and political barriers
S4 Twitter demographics and unemploy-ment rates
Different age groups are not equally represented in Twitter Recentsurveys (2012) in Spain suggest that most (86) of users in Twitterare 16 to 44 years old Comparison of the percentage of users perage group with the total population within the same groups (seefigure 8) reveals that groups of ages above 35 years old are under-represented in Twitter Thus our Twitter data will be more revealingwhen trying to describe unemployment in age groups below 44 yearsold This is indeed what we find when we try to build a linear modelfor the rate unemployment in different age groups with the sameTwitter variables while unemployment rates for ages below 24 canbe fitted to a linear model with R2 = 062 we find that regressionmodels for unemployment rates for ages between 25 and 44 have aR2 = 052 while for ages above 44 we get only R2 = 026 Table 4summarizes the results for the regression models of unemploymentrates in each age group showing that our Twitter variables have moreexplanatory power for ages below 44 Finally in figure 8 we can seethe performance of the model at different age groups and once againit is obvious the poor explanatory power of the Twitter variables forthe unemployment rate in ages above 44 years old
S5 Properties of Twitter variables
Normalization and distributions
Heterogeneity between the values of variables constructed fromTwitter is large but moderate as histograms in figure 9 show Wedid not find any geographical area with anomalous values in anyof the variables considered Variables are normalized in differentways both the penetration τi and misspellers rate εi are defined asthe number of users or misspellers per 100000 persons (population)activity variables νi are normalized as the percentage of tweets pertime interval finally number of tweets that mention a specific term
Social media fingerprints of unemployment mdash 1219
Figure 7 From left to right and from top to bottom Fastgreedy Walktrap Infomap Multilevel Label Propagation and LeadingEigenvector communities on Twitter based mobility transitions
microi are also given per 100000 tweets published in the geographicalarea
Correlation between variables
Variables are constructed to reflect the behavior of areas in the dif-ferent dimensions of Twitter penetration social or geographicaldiversity activity through the day and content Correlation betweenvariables does indeed show that variables within each dimensionshold strong correlations between them As we can see in figure 10social and geographical diversities are highly correlated betweenthem an expected fact given the gravity law accurate descriptionof flows of people between geographical areas but also the amountof communication between them Same behavior is found for thegroup of variables in the activity group while content variables areless correlated Finally we find that both the penetration rate τi andfraction of misspellers εi have a strong correlation with most of thevariables
High correlation between variables might lead to collinearityeffects [24] in the linear regression models that is some variableswith predictive variable might have non-significant weights becausethey explain the same part of the variance For instance in Table5 misspellers rate has a very strong predictive value but its p-valueis too high to consider it significant To test this hypothesis weperform a principal component analysis (PCA) on the independentvariables of the regression Figure 10 exhibits the loadings of thedifferent variables for the considered variables The block structureshowed in 10 results in similar directions of the variables in the firstcomponentes of the PCA We observe some groups of variables onthe one hand geographical and social diversity seem to explain largepart of the variance on the other hand we find a perpendicular group
of variables formed by temporal activity finally penetration rateand misspellers fraction seem to represent a different independentdirection of data with high collinearity between them This mightexplain the low statistical significance in the models of section 4 Inany case the structure of the correlation matrix and the PCA resultsshow that there is indeed information in all groups of variables andthus we have take a variable in each of them for our regressionmodels
S6 Misspellers detection
In this work we will consider only tweets in Spanish that is sincein Spain several languages live at the same time depending on thepart of the country the first step is to reduce our Twitter dataset tothose tweets that are written in Spanish This task is carried out usingthe n-gram based text categorization R library textcat [11] Then inorder to decide whether a tweet has a misspelling or not we needto establish some patterns to select from our set of tweets Sincewe want to be sure that a detected mistake corresponds to a realmisspeller we will not consider the following cases
bull Lack of written accents People tend to avoid writing accentswhen talking in a colloquial way
bull Mistakes derived from removing unnecessary letters Themost common cases are removing a h at the beginning of aword (in Spanish the letter h is not pronounced) or replacingthe letters qu by k We understand that these mistakes can bemotivated for the limitation of length in tweets and not for areal misspelling
bull In the same line we neglect mistakes produced by removing
Social media fingerprints of unemployment mdash 1319
0
10
20
30
40
0 100 200 300x = Tweets (unemployment)
P(x)
0
5
10
15
500 1000 1500x = Penetration rate
P(x)
0
5
10
15
2 4 6 8x = Entropy2 (social)
P(x)
0
5
10
15
025 050 075x = Entropy1 (social)
P(x)
0
5
10
15
4 6 8x = Entropy2 (geo)
P(x)
0
5
10
15
01 02 03 04x = Entropy1 (geo)
P(x)
0
5
10
15
0 50 100 150 200x = Misspellers rate
P(x)
0
20
40
60
0 100 200x = Tweets (employment)
P(x)
0
5
10
15
20
250 500 750x = Tweets (job)
P(x)
0
5
10
15
20
2 3 4 5 6x = Activity (night)
P(x)
0
5
10
15
20
4 5 6 7x = Activity (morning)
P(x)
0
5
10
15
35 40 45 50x = Activity (afternoon)
P(x)
i SuiSri
Sri Suimrngi
aftni ngti
imicrojobi
microempi
microunempi
x x x
x x x
x x x
x x x
Figure 9 Frequency plots for each variable constructed from Twitter
letters in the middle of a word whose pronunciation can bededuced without them
bull We do not consider either mistakes related to features ofspecific areas in Spain For example in the south the pronun-ciation of ce and se is the same what produces a big amountof mistakes when writing However since we want to extractobjective and equitable conclusion over the whole Spanishgeography we neglect those misspellings that only appear ina specific area
Likewise we will consider as real misspellings the followingmistakes
bull Adding letters For example writing a h at the beginning of aword that starts with a vowel
bull Changing the special cases mp mb by the wrong writings npnb
bull Mixing up b with v g with j ll with y and ex with es Theseare typical mistakes in Spanish because they have the sameor a very close pronunciation
Social media fingerprints of unemployment mdash 1419
0
10
20
30
16minus24 25minus34 35minus44 45minus54 55minus64r
f
000
025
050
075
100llena
Age group
Perc
enta
ge o
f pop
ulat
ion Census
10
15
20
25
10 15 20 25x
y
10
15
20
25
10 15 20 25x
y
5
10
15
20
5 10 15 20x
y
5
10
15
2025
5 10 15 20 25x
y
All ages lt 24
25-44 gt 44
Observed Unemployment () Observed Unemployment ()
Observed Unemployment ()Observed Unemployment ()
Pred
icte
d Un
empl
oym
ent (
)
Pred
icte
d Un
empl
oym
ent (
)
Pred
icte
d Un
empl
oym
ent (
)
Pred
icte
d Un
empl
oym
ent (
)
R2 = 047 R2 = 062
R2 = 052 R2 = 026
Figure 8 Top Percentage of population in each age groupfrom the Spanish Census (dark bars) and surveys about usersin Twitter (light bars) Bottom performance of the linearmodels for each of the age groups
bull Confusing the verb haber with the periphrasis a ver
bull Separating a word into two ones for instance writing theword conmigo as con migo
This way our list of mispellings is composed of 617 common mis-takes in Spanish that cannot be attributed to the special featuresof Twitter or a specific region of Spain Thus one can expect thatthis selection provides an accurate and equitable method of detect-ing misspellers Under these conditions the number of users whowrote at least one misspelled word is 27055 (56 over the wholepopulation)
We analyze whether misspellers have different Twitter usagebehavior from that people who do not make serious mistakes whenpublishing a tweet Comparing the average number of tweets itcan be observed that misspellers tend to publish a larger numberof tweets than those who did not made mistakes (14471 against2372) This also emerges when the mean number of misspellinggiven the total number of tweets is considered For users with lessthan approximately 30 published tweets in the observation period thenumber of misspellings is almost zero whereas for users who publishmore often the mean number of misspellings scales sub-linearly
minus1
minus08
minus06
minus04
minus02
0
02
04
06
08
1rtwpen
sio
sior
siosocial
siorsocial
manana
tarde
madrugada
fmiss
job
emp
unem
p
eco
rtwpensiosior
siosocialsiorsocialmanana
tardemadrugada
fmissjobemp
unempeco
i
Sui
Sri
Sri
Sui
i
microecoi
microunempi
microempi
microjobi
ngti
aftni
mrngi
minus3 minus2 minus1 0 1 2 3
minus3minus2
minus10
12
3
First Principal component
Seco
nd P
rinci
pal C
ompo
nent
6
26
47
70 92123
145
155185
209
213
251
257 279305318339
371
396411
423
455
466
490
507
546552
568
597
621
637672
681
710
718738
772
781
805828
849
876
900
919
925
966979
1002 1021
1047
1064
1085
1109
1126
1137
1164
11861200
12361249
1264
1300
1318
1336
13541386
1406
1425
1444
1457
14711504
15141554
1572
1584
1611
1622
16571667
1686
1721
1739
1800
1823
1837
1863
18831930
1949
1968
20062036
2042
2067
2181
2206
22442264
2305
2331
23322413 2435
2456
2512
2554
2569
25982617
26522670
26972721 2748
2790
2875
2917
2939
29492976
2989
3025
3132
3228
3245
3331
3347
3381
3451
3484
3519
3613
3837
38653987
4326
minus05 00 05
minus05
00
05
sio
sior
siosocial
siorsocial
rtwpen
manana
tarde
madrugada
fmiss
jobemp
unemp
eco
ngt
microunemp
aftn
microempmicroeco
microjobmrng
Si
Si
Sir
˜Sir
domingo 12 de octubre de 14
Figure 10 Top Correlation matrix between the vari-ables constructed from Twitter Each entry in the matrixis depicted as a circle whose size is proportional to thecorrelation between variables and the sign is bluered forpositivenegative correlations Blank entries correspond tostatistically insignificant correlations with 95 confidenceBottom Variables projection on the first two principal com-ponents given by PCA We observe different groups of vari-ables and collinearity between some of them
with the number of tweets (exponent asymp 033)
Since we have observed a segmentation of Twitter populationbased on how accurate they write we consider the misspeller rate as aproxy of the educational level of the cities Large number of previousworks in the literature have revealed the relationship between theeconomical status and the educational level of geographical areasand therefore it is natural to ask whether the observed misspellersrate is related to economy driven by the unemployment rate To testthis hypothesis we consider cities populated with more than 5000inhabitants to avoid subsampled cases We find a strong positivecorrelation between the probability of finding a misspeller in a cityand the unemployment rate (0372 0491)
Social media fingerprints of unemployment mdash 1519
2 5 10 20 50 200 500
10
15
20
30
40
tweets
N(m
iss|
tw
eets
)
2 5 10 20 50 200 500
000
50
050
050
0
tweets
P(m
iss|
twee
ts)
2 50 500
Figure 11 Number (red) and probability (blue) of ob-served misspellings given the number of tweets
050
055
060
065
070
201210201211201212201301201302201303201304201305201306201307201308201309201310201311201312201401201402201403201404201405201406
month
r2R2
month
Figure 12 Explanatory power of the linear regressionmodel when fitted against the unemployment data for dif-ferent months Gray (orange) area correspond to the timewindow in which Twitter data is collected and variables areconstructed
S7 Time window and unemployment
In the definition of the variables we have aggregated the Twitter ac-tivity within a 7 months time window (from December 2012 to June2013) Since unemployment has a significant variation along timewe investigate here what is the correlation and explanatory powerof the Twitter variables for the values of unemployment determinedat different months through the same time window in which Twitterdata was collected Or if the variables collected in that time windoware more correlated with past or future values of unemploymentFigure 12 shows the explanatory value of the model when the linearregression is done for values of unemployment of different monthsbefore during and after the Twitter data time window Althoughthere is a small seasonal effect along the year we see that the ex-planatory power remains around R2 = 06 which suggest that ourTwitter linear model retains its explanatory power even though unem-ployment changes considerably throughout the year It is interestingto note that R2 decays a little bit during the summer which meansthat our variables are less correlated with summer unemploymentFinally unemployment used in the main article is from June 2013ie the last month in the time window used to collect the data
S8 Demographics does not explain unem-ployment
Since unemployment rates are very large for the group of youngpeople a natural question is whether only demographic variablescould explain the heterogeneity of young unemployment rates foundin the geographical areas To test this end we have built four linearmodels the first one (named Youth model in Table 5) is composedby the rate of young population as the only explaining variable thesecond ones are built based on only the Twitter variables consideredin the main text (named Twitter model (I)) or just with those whoseregression coefficients are statistically significant (Twitter model(II)) the third one is fitted with all the variables (named All variablesmodel in Table 5) In table 5 we show the summary of the regressionfor each model Focusing on the explained variance by the model interms of R2 it can be checked that considering all Twitter variables isthree times more explanatory than considering only the young peopleproportion On the other hand the comparison of R2 for the Twittermodel with the one for All variables and Youth model shows that therate of young population does not provide a significant explanatorypower This semi-partial analysis shows that our Twitter variablesretain a high explanatory power when the effect of young populationrate is controlled
S9 Unemployment models for other geo-graphical areas
While municipalities are very heterogeneous demographically otheradministrative areas exist in Spain at large scales that could be usedfor our model of unemployment As mentioned in section 4 thesmallest administrative division of Spain we have considered is thatof the 8200 municipalities At larger scales we have the 326 coun-ties (comarcas in spanish) which are aggregations of municipalitiesFinally the largest geographical scale we considered is defined by50 provinces (provincias in Spanish) In this section we comparethe performance of our Twitter model for unemployment for thevariables defined in those administrative areas and relate it to thegeographical communities detected and used in the main paper (seesection 4) Not all the areas at different administrative divisions areconsidered in the model To minimize the effect of areas in whichthe number of geo-tagged tweets is very small we only consider the1738 municipalities which have a Twitter population π gt 10 Simi-larly we only consider the 198 counties with π gt 100 As we can seein Table 6 the model has a large explanatory power for areas equalor bigger than counties As expected R2 increases as the number ofareas in the model is smaller but the description level of the modelis very low for provinces for example The best performance (highR2 and high geographical description level) is attained at the level ofthe detected communities
S10 Relative importance of the variables
To asses the relative importance of the variables in the unemploymentmodel we have used several methods They all give qualitatively thesame results with some variations for the statistically insignificantvariables Specifically we have use
1 (weight) Relative weight of the absolute values of the coef-ficients obtained in the linear regression when variables arescaled to have mean zero and variance one
2 (lmg) averaging over orderings proposed by LindemanMerenda and Gold
Social media fingerprints of unemployment mdash 1619
0
10
20
30
40
emp fmiss manana rtwpen sio siosocialnames
values
indabscoefffirstlmgpmvd
microemp mrng SuSu
Rela
tive
impo
rtanc
e (
)
0
10
20
30
40
emp fmiss manana rtwpen sio siosocialnames
values
indabscoefffirstlmgpmvd
weight first lmg pvmd
Figure 13 Relative importance of the variables (in per-centage) in the unemployment model for different ways tocalculate it
3 (pmvd) The PMVD metric introduced by Feldman whichan average over orderings as well but with data-dependentweights
4 (first) The univariate R2-values from regression models withone variable only
All these metrics are obtained using the relaimpo R package [12]The results for the young unemployment model are shown in figure13 where we can see that different methods yield to similar rela-tive importance of the variables excepting perhaps for the diversityof mobility flows a variable with a non-significant weight in theregression model
References
[1] B Ashtakala Generalized power model for trip distributionTransportation Research Part B Methodological 21(1)59ndash671987
[2] Michel Bierlaire Mathematical models for transportation de-mand analysis Transportation research Part A Policy andpractice 31(1)86ndash86 1997
[3] Vincent D Blondel Jean-Loup Guillaume Renaud Lambiotteand Etienne Lefebvre Fast unfolding of communities in largenetworks Journal of Statistical Mechanics Theory and Exper-iment 2008(10)P10008 2008
[4] Harry J Casey Jr The law of retail gravitation applied to trafficengineering Traffic Quarterly 9(3) 1955
[5] Aaron Clauset Mark EJ Newman and Cristopher Moore Find-ing community structure in very large networks Physicalreview E 70(6)066111 2004
[6] Leon Danon Albert Diaz-Guilera Jordi Duch and AlexArenas Comparing community structure identificationJournal of Statistical Mechanics Theory and Experiment2005(09)P09008 2005
[7] Servicio Publico de Empleo Estatal (SEPE) Spanish registeredunemployment httpwwwsepeescontenidosque_
es_el_sepeestadisticasindexhtml[8] Instituto Nacional de Estadıstica Spanish 2011 cen-
sus httpwwwineescensos2011_datoscen11_
datos_iniciohtm|[9] Suzanne P Evans A relationship between the gravity model
for trip distribution and the transportation problem in linearprogramming Transportation Research 7(1)39ndash61 1973
[10] Paul Expert Tim S Evans Vincent D Blondel and RenaudLambiotte Uncovering space-independent communities inspatial networks Proceedings of the National Academy ofSciences 108(19)7663ndash7668 2011
[11] Ingo Feinerer Christian Buchta Wilhelm Geiger JohannesRauch Patrick Mair and Kurt Hornik The textcat packagefor n-gram based text categorization in r Journal of StatisticalSoftware 52(6)1ndash17 2013
[12] Ulrike Gromping Relative importance for linear regressionin r the package relaimpo Journal of statistical software17(1)1ndash27 2006
[13] Bartosz Hawelka Izabela Sitko Euro Beinat StanislavSobolevsky Pavlos Kazakopoulos and Carlo Ratti Geo-located twitter as the proxy for global mobility patterns arXivpreprint arXiv13110680 2013
[14] Yu Liu Zhengwei Sui Chaogui Kang and Yong Gao Uncov-ering patterns of inter-urban trips and spatial interactions fromcheck-in data arXiv preprint arXiv13100282 2013
[15] Mark EJ Newman Finding community structure in net-works using the eigenvectors of matrices Physical reviewE 74(3)036104 2006
[16] Pascal Pons and Matthieu Latapy Computing communities inlarge networks using random walks In Computer and Informa-tion Sciences-ISCIS 2005 pages 284ndash293 Springer 2005
[17] Usha Nandini Raghavan Reka Albert and Soundar KumaraNear linear time algorithm to detect community structures inlarge-scale networks Physical Review E 76(3)036106 2007
[18] Martin Rosvall and Carl T Bergstrom Maps of random walkson complex networks reveal community structure Proceedingsof the National Academy of Sciences 105(4)1118ndash1123 2008
[19] Morton Schneider Gravity models and trip distribution theoryPapers in Regional Science 5(1)51ndash56 1959
[20] Filippo Simini Marta C Gonzalez Amos Maritan and Albert-Laszlo Barabasi A universal model for mobility and migrationpatterns Nature 484(7392)96ndash100 2012
[21] Chaoming Song Tal Koren Pu Wang and Albert-LaszloBarabasi Modelling the scaling properties of human mobilityNature Physics 6(10)818ndash823 2010
[22] Alan Geoffrey Wilson Entropy in urban and regional mod-elling Pion Ltd 1970
[23] Alan Geoffrey Wilson Urban and regional models in geogra-phy and planning 1974
[24] Svante Wold Arnold Ruhe Herman Wold and WJ Dunn IIIThe collinearity problem in linear regression the partial leastsquares (pls) approach to generalized inverses SIAM Journalon Scientific and Statistical Computing 5(3)735ndash743 1984
Social media fingerprints of unemployment mdash 1719
Gravity ModelParameter Description Spain
α1 Origin exponent 0477lowastlowastlowast(0002)α2 Destination exponent 0478lowastlowastlowast(0002)β Distance exponent 105lowastlowastlowast(00035)R2 Goodness of fit 0797φ Correlation between Ti j and T gra
i j 0826
Table 1 Description of the parameters for the Gravity Law Model in geo-tagged social media data for Spain (lowast lowast lowast) meanssignificance p lt 00001
NMI between G and Gp for different pAlgorithm p = 001 002 003 004 005 006 007 008 009 01
FG 0995 0992 0989 0983 0981 0977 0983 0969 0980 0959WT 0954 0959 0950 0954 0945 0948 0947 0935 0926 0931IM 0988 0981 0980 0981 0978 0974 0975 0970 0969 0966ML 0994 0978 0979 0983 0948 0934 0972 0952 0973 0947LP 0906 0908 0911 0915 0895 0907 0907 0893 0905 0904LE 0960 0957 0956 0859 0910 0892 0908 0858 0885 0884
Table 2 NMI measure comparing G and Gp
Communities StatsAlgorithm 〈|Ni|〉i max|Ni| |Ni| Modularity NMI P NMI C
FG 309696 1385 23 0726 0712 0590WT 9262 433 769 0417 0744 0757IM 21011 143 339 0758 0770 0831ML 323772 1132 22 0800 0717 0599LP 22052 750 323 0732 0749 0761LE 1017571 5344 7 0381 0264 0205
Table 3 Statistics of the communities Ni returned by the six algorithms NMI P refers to the comparison between communitiesand provinces whereas NMI C considers counties instead of provinces
Social media fingerprints of unemployment mdash 1819
All ages lt 24 25minus44 gt 44(Intercept) 011lowastlowastlowastlowast 010lowastlowastlowast 020lowastlowastlowast 020lowastlowastlowast
(002) (003) (003) (0035)Penetration rate 323lowast 857lowastlowastlowast 628lowastlowast 240
(141) (222) (217) (277)Geographical diversity 003 015lowastlowastlowast 008lowast 006
(002) (004) (004) (005)Social diversity minus003lowast minus003 minus005lowast minus006lowast
(001) (002) (002) (003)Morning activity minus069lowast minus130lowastlowast minus153lowastlowastlowast minus119lowast
(026) (042) (041) (052)Misspellers rate 1156 3151lowast 1546 2360
(813) (1278) (1248) (1594)Employment mentions minus180 317 minus994 271
(627) (986) (964) (123)R2 047 064 055 029Adj R2 044 062 052 026lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005
Table 4 Regression table for the different models in which unemployment for different age groups is fitted The All ages model isthe fit to the general rate of unemployment in each geographical area while the other models are for the rates of unemployment ingroups of less than 24 years between 25 and 44 years and above 44 years
All variables Youth model Twitter model (I) Twitter model (II)(Intercept) 006 minus002 010lowastlowastlowast 009lowastlowastlowast
(003) (003) (003) (0027)Young pop rate 066lowast 220lowastlowastlowast
(030) (035)Penetration rate 820lowastlowastlowast 857lowastlowastlowast 862lowastlowastlowast
(225) (222) (221)Geographical diversity 014lowastlowastlowast 015lowastlowastlowast 012lowastlowastlowast
(004) (004) (003)Social diversity minus002 minus003
(002) (002)Morning activity minus142lowastlowastlowast minus130lowastlowast minus128lowastlowast
(041) (042) (041)Misspellers rate 2395 3151lowast 3228lowast
(1309) (1278) (1271)Employment mentions 034 317
(981) (986)R2 065 024 064 063Adj R2 063 024 062 062lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005
Table 5 Regression table for the different statistical models The All variables model includes both Twitter and rate of youngpopulation variables Twitter model (I) includes only the variables described in the main article while Twitter model (II) only includesthose variables which are significant p lt 005 in Twitter model (I)
Social media fingerprints of unemployment mdash 1919
Communities Municipalities Counties Provinces(Intercept) 010lowastlowastlowast 016lowastlowastlowast 011lowastlowastlowast 011lowast
(003) (001) (003) (005)Penetration rate 857lowastlowastlowast 401lowastlowastlowast 912lowastlowastlowast 1047lowastlowastlowast
(222) (059) (181) (197)Geographical diversity 015lowastlowastlowast 002 012lowastlowastlowast 008
(004) (001) (003) (007)Social diversity minus003 minus001 minus001 minus003
(002) (001) (002) 007Morning activity minus130lowastlowast minus116lowastlowastlowast minus149lowastlowastlowast minus103
(042) (014) (039) (088)Misspellers rate 3151lowast 1440lowastlowastlowast 1409
(1278) (251) (1002)Employment mentions 317 minus071 241 minus317
(986) (089) (886) (1229)Number of points 128 1738 198 50R2 064 022 055 065Adj R2 062 021 054 061lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005
Table 6 Regression table for the unemployment linear regression model in different levels of geographical areas In the Provincesmodel the misspellers rate has been removed from the model due to the large collinearity with the penetration rate
- 1 Social media dataset and functional partition of cities
- 2 Social media behavioral fingerprints
- 3 Explanatory power of social media in unemployment
- 4 Discussion
-
Social media fingerprints of unemployment mdash 519
searches [21] to the economical situation of countriesIn our case we also find that there is a moderate posi-tive correlation between the fraction of tweets microi men-tioning job or unemployment terms and the observedunemployment while the correlation is negative for thenumber mentions to employment or the economy How-ever we have tried a different approach by measuringthe relation between the way of writing and the edu-cational level [37] To this end we build a list of 618misspelled Spanish expressions and extract the tweetsof the dataset containing at least one of these words(see SI section 6 for further details about how theseexpressions were collected) We only consider tweetsin Spanish detected with a N-grams based algorithmThen we only consider misspellings that cannot bejustified as abbreviations Finally we compute for ev-ery region the proportion εi of misspellers among theTwitter population If the fraction of misspellers pergeographical area is a proxy for the educational levelof that region we expect a positive correlation betweenεi and unemployment Indeed we find (see figure 3)that there is a strong correlation between the fraction ofmisspellers and unemployment
bull Social media interactions and geographical flow di-versity following the ideas in [14] which correlatedthe economical development of an area with the diver-sity of communications with other areas we considerall tweets mentioning another user and take them asa proxy for communication between users Then wecompute the number of communications wi j betweenareas i and j as the number of mentions between usersin those areas To measure the diversity we use as in[14] the informational normalized entropy (Entropy 1)Sui = minussum j pi j log pi jSri where pi j = wi jsum j wi j and(Entropy 2) Sri = logki with ki the number of differentareas with which users in area i have interacted As in[14] we find that areas with large unemployment haveless diverse communication patterns than areas with lowunemployment This translates in a strong negative co-rrelation between Si and the unemployment see figure3 Similar ideas are applied to the flows of people be-tween areas to investigate the diversity of the geograph-ical flows through the entropy Si =minussum j pi j log pi jSriwhere pi j = Ti jsum j Ti j and Sri = log(ki) with ki thenumber of different areas which has been visited byusers that live in area i Figure 3 shows that as in [19]correlation of these geographical entropies is low witheconomical development
Normalization of variables is discussed in SI section 5 Wehave also studied the correlation between the variables con-sidered As expected variables in each group show moderatecorrelations between them However the inspection of thecorrelation matrix and a Principal Component Analysis ofthe variables considered show that there is information (aspercentage of variance in the data) in each of the groups ofvariables see SI section 5 Because of these two facts werestrict our analysis to the variables within each group withthe highest correlation with the unemployment namely thepenetration rate τi the social and mobility diversity variables
Sui and Sui the morning activity νmrngi the fraction of mis-spellers εi and fraction of employment-related tweets microempi
3 Explanatory power of social media in un-employment
The four previous groups of variables are fingerprints of hu-man behavior reflected on the Twitter usage habits As weobserved in figure 3 all of them exhibit statistically strongcorrelations with unemployment The question we address inthis section is whether those variables suffice to explain theobserved unemployment (their explanatory power) and alsodetermine the most important ones among themselves (whichgive more explanatory power than others) Note that we arenot stating a causality arrow between the measures built in theprevious section and the unemployment rate but only explor-ing whether they can be used as alternative indicators with areal translation in the economy
Figure 4 shows the result of a simple linear regressionmodel for the observed unemployment for ages below 25 yearsas a function of the variables which have more correlation withthe unemployment The model has a significant R2 = 062showing that there is a large explanatory power of the un-employment encoded in the behavioral variables extractedfrom Twitter However not all the variables weight equallyin the model specifically the penetration rate geographicaldiversity morning activity and fraction of misspellers accountfor up to 92 of the explained variance while social diversityand number of employment related tweets are not statisticalsignificant (see SI section 10 for the methods used to deter-mine the relative importance of the variables) It is interestingto note that while social diversity obtained by mobile phonecommunications was a key variable in the explanation of de-privation indexes in [14 19] the communication diversity oftwitter users seem to have a minor role in the explanation ofheterogeneity of unemployment in Spain
Similar explanatory power is found for other age groupsR2 = 044 for all ages and R2 = 052 for ages between 25 and44 years However the model degrades for ages above 44years (R2 = 026) proving that our variables mainly describedthe behavior of the most represented age groups in Twitternamely those below 44 years old On the other hand sinceour Twitter variables seem to describe the behavior of youngpeople we have investigated whether Twitter constructed vari-ables have similar explanatory value (in terms of R2) thansimple census demographic variables for young people How-ever regression models including young population rate yieldto a minor improvement R2 = 065 while young populationrate only gives R2 = 024 a result which shows that Twittervariables do indeed posses a genuine explanatory power awayfrom their simple demographic representation Finally ourmodel have the largest explanatory power for detected com-munities but large R2 are also found for other geographicalareas like counties (R2 = 054) and provinces (R2 = 065) seeSI section 9
4 Discussion
This work serves as a proof of concept for how a wide rangeof behavioral features linked to socioeconomic behavior can
Social media fingerprints of unemployment mdash 619
eco
unemp
emp
job
fmiss
madrugada
tarde
manana
siorsocial
siosocial
sior
sio
rtwpen
minus05 00 05corre
ff500
1000
10 20parofa
ctor
[i]
tt[ v
aria
bles
_sel
[i]]
20
40
60
80
10 20parofa
ctor
[i]
tt[ v
aria
bles
_sel
[i]]
20
40
60
80
1020paro fa
ctor[i] tt[ variables_sel[i]]
20
40
60
80
1020paro fa
ctor[i] tt[ variables_sel[i]]
0
5
10
15
20
10 20parofa
ctor
[i]
tt[ v
aria
bles
_sel
[i]]
0
5
10
15
20
1020paro fa
ctor[i] tt[ variables_sel[i]]
0
5
10
15
20
1020paro fa
ctor[i] tt[ variables_sel[i]]
4
5
6
7
10 20parofa
ctor
[i]
tt[ v
aria
bles
_sel
[i]] 5
10
10 20parofa
ctor
[i]
tt[ v
aria
bles
_sel
[i]]
40
50
60
70
10 20parofa
ctor
[i]
tt[ v
aria
bles
_sel
[i]]
02
04
06
08
10 20parofa
ctor
[i]
tt[ v
aria
bles
_sel
[i]]
0
50
100
150
200
10 20parofa
ctor
[i]
tt[ v
aria
bles
_sel
[i]]
0
50
100
150
200
10 20parofa
ctor
[i]
tt[ v
aria
bles
_sel
[i]]
0
50
100
150
200
10 20parofa
ctor
[i]
tt[ v
aria
bles
_sel
[i]]
Penetration rate Entropy 1 (geo) Entropy 2 (geo)
Entropy 1 (social) Entropy 2 (social) Activity (morning)
Activity (afternoon) Activity (night)
Misspellers rate Job tweets
Employment tweets Unemployment tweets
Economy tweets
A B C
D E
Unemployment UnemploymentCorrelation
Entropy1 (social)
Misspellers rate
Pen
etra
tion
rate
Act
ivity
(mor
ning
)Figure 3 A) Correlation coefficient of all the extracted Twitter metrics grouped by technology adoption (black) geographicaldiversity (orange) social diversity (light blue) temporal activity (green) and content analysis (dark blue) Error bars correspond to95 confidence intervals of the correlation coefficient Gray area correspond the statistical significance thresholds Panels B C Dand E show the values of 4 selected variables in each geographical community against its percentage of unemployment Size of thepoints is proportional to the population in each geographical community Solid lines correspond to linear fits to the data
10
15
20
25
10 15 20 25x
y
5
10
15
2025
5 10 15 20 25x
y
0 10 20 30per
order
col0000000072B2009E7356B4E9E69F00
Penetration rate
Entropy 1 (geo)
Entropy 1 (social)
Activity (morning)
Misspellers rate
Employment tweetsR2 = 062
Pred
icte
d un
empl
oym
ent
Observed unemployment Weight
A B
R2 = 052
Observed unemployment
CAge lt 25 25 lt Age lt 44
Figure 4 A) and B) Performance of the model showing the predicted unemployment rate for ages below 25 versus the observedone R2 = 062 and with ages between 25 and 44 Dashed lines correspond to the equality line and plusmn20 error C) Percentageof weight for each of the variables in the regression model using the relative weight of the absolute values of coefficients in theregression model (see SI section 10) Variables marked with lowast are not statistical significant in the model
be inferred from the digital traces that are left by publicly-available social media In particular we demonstrate thatbehavioral features related to unemployment can be recoveredfrom the digital exhaust left by the microblogging networkTwitter First of all Twitter geolocalized traces together withoff-the-shelve community detection algorithms render an op-timal partition of a country for economical activity showingthe remarkable power of social media to understand and unveileconomical behavior at a country-scale This insight is likelyto apply to other administrative definitions in other countriesspecially when considering large cities with an inherent dy-namical nature and evolution of mobility fluxes and citiescomposed of small satellite cities with arbitrary agglomera-tions or division among them (eg London NYC Singapore)
This result is unsurprising it should be natural to recomputecity clusterscommunities of activity based on their real timemobility which may vary considerably faster than the updaterates of mobility and travel surveys [31ndash33]
Our main result demonstrates that several key indicatorsdifferent penetration rates among regions fingerprints of thetemporal patterns of activity content lexical correctness andgeo-social connectivities among regions can be extractedfrom social media and then used to infer unemployment lev-els These findings shed light in two directions first on howindividualsrsquo extensive use of their social channels allow usto characterize cities based on their activity in a meaningfulfashion and secondly on how this information can be usedto build economic indicators that are directly related to the
Social media fingerprints of unemployment mdash 719
economy Regarding the latter our work is important forunderstanding how country-scale analysis of Social Mediashould consider the demographic but also the economical dif-ference between users As we have shown users in areaswith large unemployment have different mobility different so-cial interactions and different daily activity than those in lowunemployment areas This intertwined relationship betweenuser behavior and employment should be considered not onlyin economical analysis derived from social media but alsoin other applications like marketing communication socialmobilization etc
It is particularly remarkable that Twitter data can providethese accurate results Twitter is among the many currentlypopular social networking platforms perhaps the noisiestsparsest more lsquosabotagedrsquo medium very few users sendout messages at a regular rate most of the users do nothave geolocated information the social relationships (fol-lowersfollowers) contains a lot of unusedunimportant linksit is plagued by spam-bots and last but not least we haveno way to identify the motivegoalfunctionality of mobilityfluxes we are able to extract These limitations are not par-ticular to our sample but general to the sample Twitter databeing employed in the computational social science commu-nity Despite all these caveats we are able to show that evensome simple filtering techniques together with basic statisticalregressions yields predictive power about a variable as impor-tant as unemployment Other social media platforms such asFacebook Google+ Sina Weibo Instagram Orkut or Flickerwith more granular and consistent individual data are likelyto provide similar or better results by themselves or in com-bination Further improvements can be obtained by the useof more sophisticated statistical machine learning techniquessome of them even tailored to the peculiarities of social mediadata Our work serves to illustrate the tremendous potentialof these new digital datasets to improve the understanding ofsocietyrsquos functioning at the finer scales of granularity
The usefulness of our approach must be considered againstthe cost and update rate of performing detailed surveys ofmobility social structure and economic performance Ourdatabase is publicly articulated which means that our analysiscould be replicated easily in other countries other time periodsand with different scopes Naturally survey results providemore accurate results but they also consume considerablyhigher financial and human resources employing hundredsof people and taking months even years to complete and bereleased mdash they are so costly that countries going througheconomic recession have considered discontinuing them oraltering their update rate in recent times A particularly prob-lematic aspect of these surveys is that they are ldquoout-of-syncrdquoie census may be up to date whereas those same individualsrsquotravel surveys may not be and therefore drawing inferencesbetween both may be particularly difficult This is a partic-ularly challenging problem that the immediateness of socialmedia can help ameliorate
A few questions remain open for further investigationHow can traditional surveys and social media digital tracesbe best combined to maximize their predictive ability Cansocial media provide a reliable leading indicator to unem-ployment and in general economic surveys How muchreliable lead is it possible if at all As we have found Twitter
penetration and educational levels are found to be correlatedwith unemployment but this levels are unlikely to changerapidly to describe or anticipate changes in the economy orunemployment However other indicators like daily activitysocial interactions and geographical mobility are more con-nected with our daily activity and perhaps they have morepredicting power to show andor anticipate sudden changesin employment The relationship between unemployment andindividual and group behavior may help contextualize themultiple factors affecting the socioeconomic well-being of aregion while penetration content daily activity and mobilitydiversity seem to be highly correlated to unemployment inSpain different weights for each group of traces might beexpected in other countries [14] Finally digital traces couldserve as an alternative (some times the only one available) tothe lack of surveys in poor or remote areas [20 27] Anotherinteresting avenue of research involves the use of social mediato detect mismatches between the real (hidden underground)economy and the officially reported [38]
Most importantly the immediacy of social media may alsoallow governments to better measure and understand the effectof policies social changes natural or man-made disasters inthe economical status of cities in almost real-time [18 39]These new avenues for research provide great opportunitiesat the intersection of the economic social and computationalsciences that originate from these new widespread inexpensivedatasets
Acknowledgments
We would like to thank Kristina Lerman Lada Adamic JamesFowler Daniel Villatoro and Ricardo Herranz for stimulatingdiscussions and Yuri Kryvasheyeu and Thomas Bochynek fortheir critical reading of the manuscript This work was par-tially supported by Spanish Ministry of Science and Technol-ogy Grant FIS2013-47532-C3-3-P (to A L M G H and EM) Manuel Cebrian is funded by the Australian Governmentas represented by the Department of Broadband Communi-cations and Digital Economy and the Australian ResearchCouncil through the ICT Centre of Excellence program
References
[1] Becker G S (1976) The economic approach to human behav-ior (University of Chicago Press)
[2] Granovetter M (1985) Economic action and social structurethe problem of embeddedness American journal of sociologypp 481ndash510
[3] Camerer C F Loewenstein G amp Rabin M (2011) Advancesin behavioral economics (Princeton University Press)
[4] Glaeser E L Kallal H D Scheinkman J A amp ShleiferA (1991) Growth in cities (National Bureau of EconomicResearch) Technical report
[5] Bettencourt L M Lobo J Helbing D Kuhnert C amp WestG B (2007) Growth innovation scaling and the pace of lifein cities Proceedings of the National Academy of Sciences104 7301ndash7306
[6] Batty M (2008) The size scale and shape of cities science319 769ndash771
[7] Milgram S (1974) The experience of living in cities Crowdingand behavior 167 41
Social media fingerprints of unemployment mdash 819
[8] Pan W Ghoshal G Krumme C Cebrian M amp Pentland A(2013) Urban characteristics attributable to density-driven tieformation Nature communications 4
[9] Gonzalez M C Hidalgo C A amp Barabasi A-L (2008)Understanding individual human mobility patterns Nature453 779ndash782
[10] Calabrese F Diao M Lorenzo G D Jr J F amp Ratti C(2013) Understanding individual mobility patterns from urbansensing data A mobile phone trace example TransportationResearch Part C Emerging Technologies 26 301 ndash 313
[11] Cheng Z Caverlee J Lee K amp Sui D Z (2011) ExploringMillions of Footprints in Location Sharing Services (AAAIMenlo Park CA USA)
[12] Cho E Myers S A amp Leskovec J (2011) Friendship andmobility user movement in location-based social networksKDD rsquo11 (ACM New York NY USA) pp 1082ndash1090
[13] Sun L Jin J G Axhausen K W Lee D-H amp Cebrian M(2014) Quantifying long-term evolution of intra-urban spatialinteractions arXiv preprint arXiv14070145
[14] Eagle N Macy M amp Claxton R (2010) Network diversityand economic development Science 328 1029ndash1031
[15] Henrich J Boyd R Bowles S Camerer C Fehr E GintisH amp McElreath R (2001) In search of homo economicusbehavioral experiments in 15 small-scale societies AmericanEconomic Review pp 73ndash78
[16] Krieger N Williams D R amp Moss N E (1997) Measuringsocial class in us public health research concepts method-ologies and guidelines Annual review of public health 18341ndash378
[17] Groves R M Fowler Jr F J Couper M P Lepkowski J MSinger E amp Tourangeau R (2013) Survey methodology (JohnWiley amp Sons)
[18] Lazer D Pentland A S Adamic L Aral S Barabasi A LBrewer D Christakis N Contractor N Fowler J GutmannM et al (2009) Life in the network the coming age of compu-tational social science Science (New York NY) 323 721
[19] Smith C Quercia D amp Capra L (2013) Finger on the pulseidentifying deprivation using transit flow analysis (ACM) pp683ndash692
[20] Soto V Frias-Martinez V Virseda J amp Frias-Martinez E(2011) Prediction of socioeconomic levels using cell phonerecords UMAPrsquo11 (Springer-Verlag Berlin Heidelberg) pp377ndash388
[21] Preis T Moat H S Stanley H E amp Bishop S R (2012)Quantifying the advantage of looking forward Scientific re-ports 2
[22] Antenucci D Cafarella M Levenstein M C Re C ampShapiro M D (2014) Using social media to measure la-bor market flows (National Bureau of Economic Research)Technical report
[23] Hawelka B Sitko I Beinat E Sobolevsky S KazakopoulosP amp Ratti C (2013) Geo-located twitter as the proxy for globalmobility patterns arXiv preprint arXiv13110680
[24] Lathia N Quercia D amp Crowcroft J (2012) in Pervasivecomputing (Springer) pp 91ndash98
[25] Frias-Martinez V Virseda J amp Frias-Martinez E (2010)Socio-economic levels and human mobility
[26] Gutierrez T Krings G amp Blondel V D (2013) Evaluatingsocio-economic state of a country analyzing airtime credit andmobile phone datasets arXiv preprint arXiv13094496
[27] Smith C Mashhadi A amp Capra L (2013) Ubiquitous sensingfor mapping poverty in developing countries Paper submittedto the Orange D4D Challenge
[28] Erlander S amp Stewart N F (1990) The gravity model intransportation analysis theory and extensions (Vsp) Vol 3
[29] Simini F Gonzalez M C Maritan A amp Barabasi A-L(2012) A universal model for mobility and migration patternsNature 484 96ndash100
[30] Lenormand M Picornell M Cantu-Ros O G Tugores ALouail T Herranz R Barthelemy M Frias-Martinez E ampRamasco J J (2014) Cross-checking different sources ofmobility information arXiv preprint arXiv14040333
[31] Barthelemy M (2011) Spatial networks Physics Reports 4991 ndash 101
[32] Expert P Evans T S Blondel V D amp Lambiotte R (2011)Uncovering space-independent communities in spatial net-works Proceedings of the National Academy of Sciences 1087663ndash7668
[33] Sobolevsky S Szell M Campari R Couronne T SmoredaZ amp Ratti C (2013) Delineating geographical regions withnetworks of human interactions in an extensive set of countriesPloS one 8 e81707
[34] Rosvall M amp Bergstrom C T (2008) Maps of random walkson complex networks reveal community structure Proceedingsof the National Academy of Sciences 105 1118ndash1123
[35] Danon L Diaz-Guilera A Duch J amp Arenas A (2005)Comparing community structure identification Journal ofStatistical Mechanics Theory and Experiment 2005 P09008
[36] ADigital (2013) Uso de twitter en espana 2012 [Onlineaccessed 1-November-2014]
[37] Davenport J R amp DeLine R (2014) The readability of tweetsand their geographic correlation with education arXiv preprintarXiv14016058
[38] Schneider F Buehn A amp Montenegro C E (2011) Shadoweconomies all over the world New estimates for 162 countriesfrom 1999 to 2007 Handbook on the shadow economy pp9ndash77
[39] Rutherford A Cebrian M Dsouza S Moro E Pentland A ampRahwan I (2013) Limits of social mobilization Proceedingsof the National Academy of Sciences 110 6281ndash6286
Social media fingerprints of unemployment mdash 919
Social media fingerprints of unemployment mdash 1019
Supporting Information forSocial media fingerprints of unemploymentAlejandro Llorente Manuel Garcıa-Herranz Manuel Cebrian and Esteban Moro
S1 The dataset
Twitter provides an extremely rich and publicly available data set ofuser interactions information flows and thanks to the geo locationof tweets user movements Nevertheless the representativenessof this geo-located Twitter as a global source of mobility data hasstill received sparse attention In this sense while [13] present apromising and extensive study regarding global country-to-countrymovements (mostly driven by tourism) within-country human flows(comprising not only internal tourism but also in a greater extentthan country-to-country travels visiting and commuting) still needfurther investigation Therefore throughout this work we will com-pare our findings using geo-located Twitter with similar study usingcommuting surveys
For the Twitter analysis we consider almost 146 million geo-located Twitter messages (tweet(s)) collected through the publicAPI provided by Twitter for the continental part of Spain and from29th November 2012 to 10th April 2013 In this dataset we considerthat there has been a trip from place l to place k if a user has tweetedin place l and place k consecutively We only keep those transitionswhen the first tweet and the second one are dated in the same dayWe filter the trips database to avoid unrealistic transitions and keeponly trips with a geographical displacement larger than 1km (SeeMethods section) By this method 138 million of trips from 167376different users are considered in our work
From those trips we construct the mobility flow Ti j betweenmunicipalities which measures the number of trips in our databasein which the origin is within city i boundaries and destination lieswithin those of city j
We also consider population and economical information aboutthe municipalities from the Spanish Census (2011) [8] and unem-ployment figures from the Public Service of Employment (ServicioPublico de Empleo Estatal SEPE) [7] In the former In the lat-ter case registered unemployment (in number of persons) is givenfor each Spanish municipality by gender age and month To getunemployment rates we divide register unemployment by the totalworkforce in the municipality estimated as the number of peoplewith age between 16 and 65 years
S2 Twitter as mobility proxy
Considering all of the available transitions in our database one cancompute the distance between origin and destination the elapsedtime of the transition and the number of trips per user among manyother statistics All of them seems to show a Power-law distributionwith a cutoff due to the finite spatial size of Spain and the constraintof considering only transitions where the origin and destinationcheckins are done the same day Focusing on the log-linear partof the distributions self-similar behaviors arise when Twitter basedmobility is analyzed (see figure 5)
Twitter based inter-city flows can be well modelled by means ofthe The Gravity Law which is one of the most extended methods torepresent human mobility [1 19] with applications in many fieldslike urban planning [23] traffic engineering [4] or transportationproblems [9] Gravity Law is also the solution to the problem of
maximizing the entropy of the particle distribution among all thepossible trips using statistical mechanics techniques [2 22] Recentlyit has also been used as a model for human mobility based on cellphone traces [10 20 21] and social media data at a global scale [13]and at the inter-city level [14]
The Gravity Model for human mobility assume that the flowsbetween cities can be explained by the expression
T gravi j =
Pα1i Pα2
j
dβ
i j
(2)
where T gravi j is the flow in terms of number of people between cities
i and j di j is the geographical distance and Pi and Pj the populationof every city respectively
Given the data we can obtain the parameters of the model byWeighted Least Squares Minimization
αlowast1 α
lowast2 β
lowast = argminα1α2β
1N sum
i jwi j
(Ti jminusT grav
i j
)2(3)
where N is the total number of connections in the mobility graph andwi j is a weight proportional to the number of observed transitionsbetween i and j In particular we find that taking wi j = T 13
i j givesthe best performance in the model
In our case this model fits quite accurately the inter-city mobilitybased on Twitter GPS checkins (see table 1) Even though we areconsidering Ti j not necessarily symmetric the exponents of thepopulations are similar indicating that we are observing a similarflows in both directions between i and j
S3 Community structures in inter-city mo-bility graph
Typically complex networks exhibit community structure that isthere are subsets of nodes that are more densely connected amongthem comparing to the rest of the nodes In mobility networks whosenodes correspond to geographical areas these communities are inter-preted as zones with high common activity and tend to be constrainedby geographical and political barriers We check whether this is alsoobserved in our dataset by performing 6 state-of-art community de-tection algorithms FastGreedy [5] Walktrap [16] Infomap [18]MultiLevel [3] Label Propagation [17] and Leading Eigenvector[15] These six different algorithms exhibit different communitystructures in terms of number of communities average size of com-munity or modularity (see table 3) Members (municipalities) ofthe resulting communities are spatially connected except some fewcases as figure 7 shows We test the statistical robustness of theobtained communities by randomly removing a proportion p of theoriginal links and performing the algorithms on this new graph GpWe will consider that communities are robust when the communi-ties given for the original network G and Gp are highly similar Inorder to compare two arbitrary memberships to communities we usethe Normalized Mutual Information (NMI) method described in [6]which returns 0 when two memberships are totally different and 1when we compare two equal memberships We compute the NMI for
Social media fingerprints of unemployment mdash 1119
10minus8
10minus6
10minus4
10minus2
100
100 1005 101 1015 102 1025 103x
dens
10minus6
10minus4
10minus2
100
1005 101 1015 102 1025 103x
dens
10minus7
10minus6
10minus5
10minus4
10minus3
102 1025 103 1035 104 1045 105x
dens
Den
sity
Den
sity
Den
sity
Trip distance (km) Number of trips Elapsed time (secs)
Figure 5 Probability distributions for the different properties of daily trips in the Twitter dataset Dashed lines corresponds to apower law fit with exponents minus167 minus243 and minus062 respectively
each chosen algorithm performed on G and Gp for p between 1and 10 concluding that obtained community structures are robustbecause they are not broken when some randomly chosen links areremoved (see table 2)
1e+01 1e+03 1e+05
2eminus04
2eminus03
2eminus02
2eminus01
cities2$total_population
cities2$twpen
Population
Pen
etra
tion
Rat
e CommunitiesCities
Figure 6 Penetration rates for both cities and detectedcommunities
As other works have shown mobility graph communities areusually interpreted in terms of geographical and political barriersand a natural question is whether the mobility based communitiesare related to any of these barriers In Spain there are differentterritorial divisions for administration purposes In this work weconsider two of them provinces defined in 1978 Constitution are 48different heterogeneous aggregations of municipalities and counties(comarca in Spanish terminology) which are traditional aggregationsof municipalities mainly based on Spanish holography (rivers val-leys ridges etc) and some of them are composed by municipalitiesof different provinces We use again the NMI method to compare thecommunities structure given by the algorithms to the administrativelimits Except Leading Eigenvector algorithm the rest of methodsreturn communities that are quite related to provinces (NMI asymp 07)whereas for the county administration limits higher variability is
observed In this last case the algorithm providing more relationshipwith county limits is Infomap NMI asymp 083 Therefore Twitter basedmobility summarizes the inter-city flows exhibiting that these flowsare influenced by geographical and political barriers
S4 Twitter demographics and unemploy-ment rates
Different age groups are not equally represented in Twitter Recentsurveys (2012) in Spain suggest that most (86) of users in Twitterare 16 to 44 years old Comparison of the percentage of users perage group with the total population within the same groups (seefigure 8) reveals that groups of ages above 35 years old are under-represented in Twitter Thus our Twitter data will be more revealingwhen trying to describe unemployment in age groups below 44 yearsold This is indeed what we find when we try to build a linear modelfor the rate unemployment in different age groups with the sameTwitter variables while unemployment rates for ages below 24 canbe fitted to a linear model with R2 = 062 we find that regressionmodels for unemployment rates for ages between 25 and 44 have aR2 = 052 while for ages above 44 we get only R2 = 026 Table 4summarizes the results for the regression models of unemploymentrates in each age group showing that our Twitter variables have moreexplanatory power for ages below 44 Finally in figure 8 we can seethe performance of the model at different age groups and once againit is obvious the poor explanatory power of the Twitter variables forthe unemployment rate in ages above 44 years old
S5 Properties of Twitter variables
Normalization and distributions
Heterogeneity between the values of variables constructed fromTwitter is large but moderate as histograms in figure 9 show Wedid not find any geographical area with anomalous values in anyof the variables considered Variables are normalized in differentways both the penetration τi and misspellers rate εi are defined asthe number of users or misspellers per 100000 persons (population)activity variables νi are normalized as the percentage of tweets pertime interval finally number of tweets that mention a specific term
Social media fingerprints of unemployment mdash 1219
Figure 7 From left to right and from top to bottom Fastgreedy Walktrap Infomap Multilevel Label Propagation and LeadingEigenvector communities on Twitter based mobility transitions
microi are also given per 100000 tweets published in the geographicalarea
Correlation between variables
Variables are constructed to reflect the behavior of areas in the dif-ferent dimensions of Twitter penetration social or geographicaldiversity activity through the day and content Correlation betweenvariables does indeed show that variables within each dimensionshold strong correlations between them As we can see in figure 10social and geographical diversities are highly correlated betweenthem an expected fact given the gravity law accurate descriptionof flows of people between geographical areas but also the amountof communication between them Same behavior is found for thegroup of variables in the activity group while content variables areless correlated Finally we find that both the penetration rate τi andfraction of misspellers εi have a strong correlation with most of thevariables
High correlation between variables might lead to collinearityeffects [24] in the linear regression models that is some variableswith predictive variable might have non-significant weights becausethey explain the same part of the variance For instance in Table5 misspellers rate has a very strong predictive value but its p-valueis too high to consider it significant To test this hypothesis weperform a principal component analysis (PCA) on the independentvariables of the regression Figure 10 exhibits the loadings of thedifferent variables for the considered variables The block structureshowed in 10 results in similar directions of the variables in the firstcomponentes of the PCA We observe some groups of variables onthe one hand geographical and social diversity seem to explain largepart of the variance on the other hand we find a perpendicular group
of variables formed by temporal activity finally penetration rateand misspellers fraction seem to represent a different independentdirection of data with high collinearity between them This mightexplain the low statistical significance in the models of section 4 Inany case the structure of the correlation matrix and the PCA resultsshow that there is indeed information in all groups of variables andthus we have take a variable in each of them for our regressionmodels
S6 Misspellers detection
In this work we will consider only tweets in Spanish that is sincein Spain several languages live at the same time depending on thepart of the country the first step is to reduce our Twitter dataset tothose tweets that are written in Spanish This task is carried out usingthe n-gram based text categorization R library textcat [11] Then inorder to decide whether a tweet has a misspelling or not we needto establish some patterns to select from our set of tweets Sincewe want to be sure that a detected mistake corresponds to a realmisspeller we will not consider the following cases
bull Lack of written accents People tend to avoid writing accentswhen talking in a colloquial way
bull Mistakes derived from removing unnecessary letters Themost common cases are removing a h at the beginning of aword (in Spanish the letter h is not pronounced) or replacingthe letters qu by k We understand that these mistakes can bemotivated for the limitation of length in tweets and not for areal misspelling
bull In the same line we neglect mistakes produced by removing
Social media fingerprints of unemployment mdash 1319
0
10
20
30
40
0 100 200 300x = Tweets (unemployment)
P(x)
0
5
10
15
500 1000 1500x = Penetration rate
P(x)
0
5
10
15
2 4 6 8x = Entropy2 (social)
P(x)
0
5
10
15
025 050 075x = Entropy1 (social)
P(x)
0
5
10
15
4 6 8x = Entropy2 (geo)
P(x)
0
5
10
15
01 02 03 04x = Entropy1 (geo)
P(x)
0
5
10
15
0 50 100 150 200x = Misspellers rate
P(x)
0
20
40
60
0 100 200x = Tweets (employment)
P(x)
0
5
10
15
20
250 500 750x = Tweets (job)
P(x)
0
5
10
15
20
2 3 4 5 6x = Activity (night)
P(x)
0
5
10
15
20
4 5 6 7x = Activity (morning)
P(x)
0
5
10
15
35 40 45 50x = Activity (afternoon)
P(x)
i SuiSri
Sri Suimrngi
aftni ngti
imicrojobi
microempi
microunempi
x x x
x x x
x x x
x x x
Figure 9 Frequency plots for each variable constructed from Twitter
letters in the middle of a word whose pronunciation can bededuced without them
bull We do not consider either mistakes related to features ofspecific areas in Spain For example in the south the pronun-ciation of ce and se is the same what produces a big amountof mistakes when writing However since we want to extractobjective and equitable conclusion over the whole Spanishgeography we neglect those misspellings that only appear ina specific area
Likewise we will consider as real misspellings the followingmistakes
bull Adding letters For example writing a h at the beginning of aword that starts with a vowel
bull Changing the special cases mp mb by the wrong writings npnb
bull Mixing up b with v g with j ll with y and ex with es Theseare typical mistakes in Spanish because they have the sameor a very close pronunciation
Social media fingerprints of unemployment mdash 1419
0
10
20
30
16minus24 25minus34 35minus44 45minus54 55minus64r
f
000
025
050
075
100llena
Age group
Perc
enta
ge o
f pop
ulat
ion Census
10
15
20
25
10 15 20 25x
y
10
15
20
25
10 15 20 25x
y
5
10
15
20
5 10 15 20x
y
5
10
15
2025
5 10 15 20 25x
y
All ages lt 24
25-44 gt 44
Observed Unemployment () Observed Unemployment ()
Observed Unemployment ()Observed Unemployment ()
Pred
icte
d Un
empl
oym
ent (
)
Pred
icte
d Un
empl
oym
ent (
)
Pred
icte
d Un
empl
oym
ent (
)
Pred
icte
d Un
empl
oym
ent (
)
R2 = 047 R2 = 062
R2 = 052 R2 = 026
Figure 8 Top Percentage of population in each age groupfrom the Spanish Census (dark bars) and surveys about usersin Twitter (light bars) Bottom performance of the linearmodels for each of the age groups
bull Confusing the verb haber with the periphrasis a ver
bull Separating a word into two ones for instance writing theword conmigo as con migo
This way our list of mispellings is composed of 617 common mis-takes in Spanish that cannot be attributed to the special featuresof Twitter or a specific region of Spain Thus one can expect thatthis selection provides an accurate and equitable method of detect-ing misspellers Under these conditions the number of users whowrote at least one misspelled word is 27055 (56 over the wholepopulation)
We analyze whether misspellers have different Twitter usagebehavior from that people who do not make serious mistakes whenpublishing a tweet Comparing the average number of tweets itcan be observed that misspellers tend to publish a larger numberof tweets than those who did not made mistakes (14471 against2372) This also emerges when the mean number of misspellinggiven the total number of tweets is considered For users with lessthan approximately 30 published tweets in the observation period thenumber of misspellings is almost zero whereas for users who publishmore often the mean number of misspellings scales sub-linearly
minus1
minus08
minus06
minus04
minus02
0
02
04
06
08
1rtwpen
sio
sior
siosocial
siorsocial
manana
tarde
madrugada
fmiss
job
emp
unem
p
eco
rtwpensiosior
siosocialsiorsocialmanana
tardemadrugada
fmissjobemp
unempeco
i
Sui
Sri
Sri
Sui
i
microecoi
microunempi
microempi
microjobi
ngti
aftni
mrngi
minus3 minus2 minus1 0 1 2 3
minus3minus2
minus10
12
3
First Principal component
Seco
nd P
rinci
pal C
ompo
nent
6
26
47
70 92123
145
155185
209
213
251
257 279305318339
371
396411
423
455
466
490
507
546552
568
597
621
637672
681
710
718738
772
781
805828
849
876
900
919
925
966979
1002 1021
1047
1064
1085
1109
1126
1137
1164
11861200
12361249
1264
1300
1318
1336
13541386
1406
1425
1444
1457
14711504
15141554
1572
1584
1611
1622
16571667
1686
1721
1739
1800
1823
1837
1863
18831930
1949
1968
20062036
2042
2067
2181
2206
22442264
2305
2331
23322413 2435
2456
2512
2554
2569
25982617
26522670
26972721 2748
2790
2875
2917
2939
29492976
2989
3025
3132
3228
3245
3331
3347
3381
3451
3484
3519
3613
3837
38653987
4326
minus05 00 05
minus05
00
05
sio
sior
siosocial
siorsocial
rtwpen
manana
tarde
madrugada
fmiss
jobemp
unemp
eco
ngt
microunemp
aftn
microempmicroeco
microjobmrng
Si
Si
Sir
˜Sir
domingo 12 de octubre de 14
Figure 10 Top Correlation matrix between the vari-ables constructed from Twitter Each entry in the matrixis depicted as a circle whose size is proportional to thecorrelation between variables and the sign is bluered forpositivenegative correlations Blank entries correspond tostatistically insignificant correlations with 95 confidenceBottom Variables projection on the first two principal com-ponents given by PCA We observe different groups of vari-ables and collinearity between some of them
with the number of tweets (exponent asymp 033)
Since we have observed a segmentation of Twitter populationbased on how accurate they write we consider the misspeller rate as aproxy of the educational level of the cities Large number of previousworks in the literature have revealed the relationship between theeconomical status and the educational level of geographical areasand therefore it is natural to ask whether the observed misspellersrate is related to economy driven by the unemployment rate To testthis hypothesis we consider cities populated with more than 5000inhabitants to avoid subsampled cases We find a strong positivecorrelation between the probability of finding a misspeller in a cityand the unemployment rate (0372 0491)
Social media fingerprints of unemployment mdash 1519
2 5 10 20 50 200 500
10
15
20
30
40
tweets
N(m
iss|
tw
eets
)
2 5 10 20 50 200 500
000
50
050
050
0
tweets
P(m
iss|
twee
ts)
2 50 500
Figure 11 Number (red) and probability (blue) of ob-served misspellings given the number of tweets
050
055
060
065
070
201210201211201212201301201302201303201304201305201306201307201308201309201310201311201312201401201402201403201404201405201406
month
r2R2
month
Figure 12 Explanatory power of the linear regressionmodel when fitted against the unemployment data for dif-ferent months Gray (orange) area correspond to the timewindow in which Twitter data is collected and variables areconstructed
S7 Time window and unemployment
In the definition of the variables we have aggregated the Twitter ac-tivity within a 7 months time window (from December 2012 to June2013) Since unemployment has a significant variation along timewe investigate here what is the correlation and explanatory powerof the Twitter variables for the values of unemployment determinedat different months through the same time window in which Twitterdata was collected Or if the variables collected in that time windoware more correlated with past or future values of unemploymentFigure 12 shows the explanatory value of the model when the linearregression is done for values of unemployment of different monthsbefore during and after the Twitter data time window Althoughthere is a small seasonal effect along the year we see that the ex-planatory power remains around R2 = 06 which suggest that ourTwitter linear model retains its explanatory power even though unem-ployment changes considerably throughout the year It is interestingto note that R2 decays a little bit during the summer which meansthat our variables are less correlated with summer unemploymentFinally unemployment used in the main article is from June 2013ie the last month in the time window used to collect the data
S8 Demographics does not explain unem-ployment
Since unemployment rates are very large for the group of youngpeople a natural question is whether only demographic variablescould explain the heterogeneity of young unemployment rates foundin the geographical areas To test this end we have built four linearmodels the first one (named Youth model in Table 5) is composedby the rate of young population as the only explaining variable thesecond ones are built based on only the Twitter variables consideredin the main text (named Twitter model (I)) or just with those whoseregression coefficients are statistically significant (Twitter model(II)) the third one is fitted with all the variables (named All variablesmodel in Table 5) In table 5 we show the summary of the regressionfor each model Focusing on the explained variance by the model interms of R2 it can be checked that considering all Twitter variables isthree times more explanatory than considering only the young peopleproportion On the other hand the comparison of R2 for the Twittermodel with the one for All variables and Youth model shows that therate of young population does not provide a significant explanatorypower This semi-partial analysis shows that our Twitter variablesretain a high explanatory power when the effect of young populationrate is controlled
S9 Unemployment models for other geo-graphical areas
While municipalities are very heterogeneous demographically otheradministrative areas exist in Spain at large scales that could be usedfor our model of unemployment As mentioned in section 4 thesmallest administrative division of Spain we have considered is thatof the 8200 municipalities At larger scales we have the 326 coun-ties (comarcas in spanish) which are aggregations of municipalitiesFinally the largest geographical scale we considered is defined by50 provinces (provincias in Spanish) In this section we comparethe performance of our Twitter model for unemployment for thevariables defined in those administrative areas and relate it to thegeographical communities detected and used in the main paper (seesection 4) Not all the areas at different administrative divisions areconsidered in the model To minimize the effect of areas in whichthe number of geo-tagged tweets is very small we only consider the1738 municipalities which have a Twitter population π gt 10 Simi-larly we only consider the 198 counties with π gt 100 As we can seein Table 6 the model has a large explanatory power for areas equalor bigger than counties As expected R2 increases as the number ofareas in the model is smaller but the description level of the modelis very low for provinces for example The best performance (highR2 and high geographical description level) is attained at the level ofthe detected communities
S10 Relative importance of the variables
To asses the relative importance of the variables in the unemploymentmodel we have used several methods They all give qualitatively thesame results with some variations for the statistically insignificantvariables Specifically we have use
1 (weight) Relative weight of the absolute values of the coef-ficients obtained in the linear regression when variables arescaled to have mean zero and variance one
2 (lmg) averaging over orderings proposed by LindemanMerenda and Gold
Social media fingerprints of unemployment mdash 1619
0
10
20
30
40
emp fmiss manana rtwpen sio siosocialnames
values
indabscoefffirstlmgpmvd
microemp mrng SuSu
Rela
tive
impo
rtanc
e (
)
0
10
20
30
40
emp fmiss manana rtwpen sio siosocialnames
values
indabscoefffirstlmgpmvd
weight first lmg pvmd
Figure 13 Relative importance of the variables (in per-centage) in the unemployment model for different ways tocalculate it
3 (pmvd) The PMVD metric introduced by Feldman whichan average over orderings as well but with data-dependentweights
4 (first) The univariate R2-values from regression models withone variable only
All these metrics are obtained using the relaimpo R package [12]The results for the young unemployment model are shown in figure13 where we can see that different methods yield to similar rela-tive importance of the variables excepting perhaps for the diversityof mobility flows a variable with a non-significant weight in theregression model
References
[1] B Ashtakala Generalized power model for trip distributionTransportation Research Part B Methodological 21(1)59ndash671987
[2] Michel Bierlaire Mathematical models for transportation de-mand analysis Transportation research Part A Policy andpractice 31(1)86ndash86 1997
[3] Vincent D Blondel Jean-Loup Guillaume Renaud Lambiotteand Etienne Lefebvre Fast unfolding of communities in largenetworks Journal of Statistical Mechanics Theory and Exper-iment 2008(10)P10008 2008
[4] Harry J Casey Jr The law of retail gravitation applied to trafficengineering Traffic Quarterly 9(3) 1955
[5] Aaron Clauset Mark EJ Newman and Cristopher Moore Find-ing community structure in very large networks Physicalreview E 70(6)066111 2004
[6] Leon Danon Albert Diaz-Guilera Jordi Duch and AlexArenas Comparing community structure identificationJournal of Statistical Mechanics Theory and Experiment2005(09)P09008 2005
[7] Servicio Publico de Empleo Estatal (SEPE) Spanish registeredunemployment httpwwwsepeescontenidosque_
es_el_sepeestadisticasindexhtml[8] Instituto Nacional de Estadıstica Spanish 2011 cen-
sus httpwwwineescensos2011_datoscen11_
datos_iniciohtm|[9] Suzanne P Evans A relationship between the gravity model
for trip distribution and the transportation problem in linearprogramming Transportation Research 7(1)39ndash61 1973
[10] Paul Expert Tim S Evans Vincent D Blondel and RenaudLambiotte Uncovering space-independent communities inspatial networks Proceedings of the National Academy ofSciences 108(19)7663ndash7668 2011
[11] Ingo Feinerer Christian Buchta Wilhelm Geiger JohannesRauch Patrick Mair and Kurt Hornik The textcat packagefor n-gram based text categorization in r Journal of StatisticalSoftware 52(6)1ndash17 2013
[12] Ulrike Gromping Relative importance for linear regressionin r the package relaimpo Journal of statistical software17(1)1ndash27 2006
[13] Bartosz Hawelka Izabela Sitko Euro Beinat StanislavSobolevsky Pavlos Kazakopoulos and Carlo Ratti Geo-located twitter as the proxy for global mobility patterns arXivpreprint arXiv13110680 2013
[14] Yu Liu Zhengwei Sui Chaogui Kang and Yong Gao Uncov-ering patterns of inter-urban trips and spatial interactions fromcheck-in data arXiv preprint arXiv13100282 2013
[15] Mark EJ Newman Finding community structure in net-works using the eigenvectors of matrices Physical reviewE 74(3)036104 2006
[16] Pascal Pons and Matthieu Latapy Computing communities inlarge networks using random walks In Computer and Informa-tion Sciences-ISCIS 2005 pages 284ndash293 Springer 2005
[17] Usha Nandini Raghavan Reka Albert and Soundar KumaraNear linear time algorithm to detect community structures inlarge-scale networks Physical Review E 76(3)036106 2007
[18] Martin Rosvall and Carl T Bergstrom Maps of random walkson complex networks reveal community structure Proceedingsof the National Academy of Sciences 105(4)1118ndash1123 2008
[19] Morton Schneider Gravity models and trip distribution theoryPapers in Regional Science 5(1)51ndash56 1959
[20] Filippo Simini Marta C Gonzalez Amos Maritan and Albert-Laszlo Barabasi A universal model for mobility and migrationpatterns Nature 484(7392)96ndash100 2012
[21] Chaoming Song Tal Koren Pu Wang and Albert-LaszloBarabasi Modelling the scaling properties of human mobilityNature Physics 6(10)818ndash823 2010
[22] Alan Geoffrey Wilson Entropy in urban and regional mod-elling Pion Ltd 1970
[23] Alan Geoffrey Wilson Urban and regional models in geogra-phy and planning 1974
[24] Svante Wold Arnold Ruhe Herman Wold and WJ Dunn IIIThe collinearity problem in linear regression the partial leastsquares (pls) approach to generalized inverses SIAM Journalon Scientific and Statistical Computing 5(3)735ndash743 1984
Social media fingerprints of unemployment mdash 1719
Gravity ModelParameter Description Spain
α1 Origin exponent 0477lowastlowastlowast(0002)α2 Destination exponent 0478lowastlowastlowast(0002)β Distance exponent 105lowastlowastlowast(00035)R2 Goodness of fit 0797φ Correlation between Ti j and T gra
i j 0826
Table 1 Description of the parameters for the Gravity Law Model in geo-tagged social media data for Spain (lowast lowast lowast) meanssignificance p lt 00001
NMI between G and Gp for different pAlgorithm p = 001 002 003 004 005 006 007 008 009 01
FG 0995 0992 0989 0983 0981 0977 0983 0969 0980 0959WT 0954 0959 0950 0954 0945 0948 0947 0935 0926 0931IM 0988 0981 0980 0981 0978 0974 0975 0970 0969 0966ML 0994 0978 0979 0983 0948 0934 0972 0952 0973 0947LP 0906 0908 0911 0915 0895 0907 0907 0893 0905 0904LE 0960 0957 0956 0859 0910 0892 0908 0858 0885 0884
Table 2 NMI measure comparing G and Gp
Communities StatsAlgorithm 〈|Ni|〉i max|Ni| |Ni| Modularity NMI P NMI C
FG 309696 1385 23 0726 0712 0590WT 9262 433 769 0417 0744 0757IM 21011 143 339 0758 0770 0831ML 323772 1132 22 0800 0717 0599LP 22052 750 323 0732 0749 0761LE 1017571 5344 7 0381 0264 0205
Table 3 Statistics of the communities Ni returned by the six algorithms NMI P refers to the comparison between communitiesand provinces whereas NMI C considers counties instead of provinces
Social media fingerprints of unemployment mdash 1819
All ages lt 24 25minus44 gt 44(Intercept) 011lowastlowastlowastlowast 010lowastlowastlowast 020lowastlowastlowast 020lowastlowastlowast
(002) (003) (003) (0035)Penetration rate 323lowast 857lowastlowastlowast 628lowastlowast 240
(141) (222) (217) (277)Geographical diversity 003 015lowastlowastlowast 008lowast 006
(002) (004) (004) (005)Social diversity minus003lowast minus003 minus005lowast minus006lowast
(001) (002) (002) (003)Morning activity minus069lowast minus130lowastlowast minus153lowastlowastlowast minus119lowast
(026) (042) (041) (052)Misspellers rate 1156 3151lowast 1546 2360
(813) (1278) (1248) (1594)Employment mentions minus180 317 minus994 271
(627) (986) (964) (123)R2 047 064 055 029Adj R2 044 062 052 026lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005
Table 4 Regression table for the different models in which unemployment for different age groups is fitted The All ages model isthe fit to the general rate of unemployment in each geographical area while the other models are for the rates of unemployment ingroups of less than 24 years between 25 and 44 years and above 44 years
All variables Youth model Twitter model (I) Twitter model (II)(Intercept) 006 minus002 010lowastlowastlowast 009lowastlowastlowast
(003) (003) (003) (0027)Young pop rate 066lowast 220lowastlowastlowast
(030) (035)Penetration rate 820lowastlowastlowast 857lowastlowastlowast 862lowastlowastlowast
(225) (222) (221)Geographical diversity 014lowastlowastlowast 015lowastlowastlowast 012lowastlowastlowast
(004) (004) (003)Social diversity minus002 minus003
(002) (002)Morning activity minus142lowastlowastlowast minus130lowastlowast minus128lowastlowast
(041) (042) (041)Misspellers rate 2395 3151lowast 3228lowast
(1309) (1278) (1271)Employment mentions 034 317
(981) (986)R2 065 024 064 063Adj R2 063 024 062 062lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005
Table 5 Regression table for the different statistical models The All variables model includes both Twitter and rate of youngpopulation variables Twitter model (I) includes only the variables described in the main article while Twitter model (II) only includesthose variables which are significant p lt 005 in Twitter model (I)
Social media fingerprints of unemployment mdash 1919
Communities Municipalities Counties Provinces(Intercept) 010lowastlowastlowast 016lowastlowastlowast 011lowastlowastlowast 011lowast
(003) (001) (003) (005)Penetration rate 857lowastlowastlowast 401lowastlowastlowast 912lowastlowastlowast 1047lowastlowastlowast
(222) (059) (181) (197)Geographical diversity 015lowastlowastlowast 002 012lowastlowastlowast 008
(004) (001) (003) (007)Social diversity minus003 minus001 minus001 minus003
(002) (001) (002) 007Morning activity minus130lowastlowast minus116lowastlowastlowast minus149lowastlowastlowast minus103
(042) (014) (039) (088)Misspellers rate 3151lowast 1440lowastlowastlowast 1409
(1278) (251) (1002)Employment mentions 317 minus071 241 minus317
(986) (089) (886) (1229)Number of points 128 1738 198 50R2 064 022 055 065Adj R2 062 021 054 061lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005
Table 6 Regression table for the unemployment linear regression model in different levels of geographical areas In the Provincesmodel the misspellers rate has been removed from the model due to the large collinearity with the penetration rate
- 1 Social media dataset and functional partition of cities
- 2 Social media behavioral fingerprints
- 3 Explanatory power of social media in unemployment
- 4 Discussion
-
Social media fingerprints of unemployment mdash 619
eco
unemp
emp
job
fmiss
madrugada
tarde
manana
siorsocial
siosocial
sior
sio
rtwpen
minus05 00 05corre
ff500
1000
10 20parofa
ctor
[i]
tt[ v
aria
bles
_sel
[i]]
20
40
60
80
10 20parofa
ctor
[i]
tt[ v
aria
bles
_sel
[i]]
20
40
60
80
1020paro fa
ctor[i] tt[ variables_sel[i]]
20
40
60
80
1020paro fa
ctor[i] tt[ variables_sel[i]]
0
5
10
15
20
10 20parofa
ctor
[i]
tt[ v
aria
bles
_sel
[i]]
0
5
10
15
20
1020paro fa
ctor[i] tt[ variables_sel[i]]
0
5
10
15
20
1020paro fa
ctor[i] tt[ variables_sel[i]]
4
5
6
7
10 20parofa
ctor
[i]
tt[ v
aria
bles
_sel
[i]] 5
10
10 20parofa
ctor
[i]
tt[ v
aria
bles
_sel
[i]]
40
50
60
70
10 20parofa
ctor
[i]
tt[ v
aria
bles
_sel
[i]]
02
04
06
08
10 20parofa
ctor
[i]
tt[ v
aria
bles
_sel
[i]]
0
50
100
150
200
10 20parofa
ctor
[i]
tt[ v
aria
bles
_sel
[i]]
0
50
100
150
200
10 20parofa
ctor
[i]
tt[ v
aria
bles
_sel
[i]]
0
50
100
150
200
10 20parofa
ctor
[i]
tt[ v
aria
bles
_sel
[i]]
Penetration rate Entropy 1 (geo) Entropy 2 (geo)
Entropy 1 (social) Entropy 2 (social) Activity (morning)
Activity (afternoon) Activity (night)
Misspellers rate Job tweets
Employment tweets Unemployment tweets
Economy tweets
A B C
D E
Unemployment UnemploymentCorrelation
Entropy1 (social)
Misspellers rate
Pen
etra
tion
rate
Act
ivity
(mor
ning
)Figure 3 A) Correlation coefficient of all the extracted Twitter metrics grouped by technology adoption (black) geographicaldiversity (orange) social diversity (light blue) temporal activity (green) and content analysis (dark blue) Error bars correspond to95 confidence intervals of the correlation coefficient Gray area correspond the statistical significance thresholds Panels B C Dand E show the values of 4 selected variables in each geographical community against its percentage of unemployment Size of thepoints is proportional to the population in each geographical community Solid lines correspond to linear fits to the data
10
15
20
25
10 15 20 25x
y
5
10
15
2025
5 10 15 20 25x
y
0 10 20 30per
order
col0000000072B2009E7356B4E9E69F00
Penetration rate
Entropy 1 (geo)
Entropy 1 (social)
Activity (morning)
Misspellers rate
Employment tweetsR2 = 062
Pred
icte
d un
empl
oym
ent
Observed unemployment Weight
A B
R2 = 052
Observed unemployment
CAge lt 25 25 lt Age lt 44
Figure 4 A) and B) Performance of the model showing the predicted unemployment rate for ages below 25 versus the observedone R2 = 062 and with ages between 25 and 44 Dashed lines correspond to the equality line and plusmn20 error C) Percentageof weight for each of the variables in the regression model using the relative weight of the absolute values of coefficients in theregression model (see SI section 10) Variables marked with lowast are not statistical significant in the model
be inferred from the digital traces that are left by publicly-available social media In particular we demonstrate thatbehavioral features related to unemployment can be recoveredfrom the digital exhaust left by the microblogging networkTwitter First of all Twitter geolocalized traces together withoff-the-shelve community detection algorithms render an op-timal partition of a country for economical activity showingthe remarkable power of social media to understand and unveileconomical behavior at a country-scale This insight is likelyto apply to other administrative definitions in other countriesspecially when considering large cities with an inherent dy-namical nature and evolution of mobility fluxes and citiescomposed of small satellite cities with arbitrary agglomera-tions or division among them (eg London NYC Singapore)
This result is unsurprising it should be natural to recomputecity clusterscommunities of activity based on their real timemobility which may vary considerably faster than the updaterates of mobility and travel surveys [31ndash33]
Our main result demonstrates that several key indicatorsdifferent penetration rates among regions fingerprints of thetemporal patterns of activity content lexical correctness andgeo-social connectivities among regions can be extractedfrom social media and then used to infer unemployment lev-els These findings shed light in two directions first on howindividualsrsquo extensive use of their social channels allow usto characterize cities based on their activity in a meaningfulfashion and secondly on how this information can be usedto build economic indicators that are directly related to the
Social media fingerprints of unemployment mdash 719
economy Regarding the latter our work is important forunderstanding how country-scale analysis of Social Mediashould consider the demographic but also the economical dif-ference between users As we have shown users in areaswith large unemployment have different mobility different so-cial interactions and different daily activity than those in lowunemployment areas This intertwined relationship betweenuser behavior and employment should be considered not onlyin economical analysis derived from social media but alsoin other applications like marketing communication socialmobilization etc
It is particularly remarkable that Twitter data can providethese accurate results Twitter is among the many currentlypopular social networking platforms perhaps the noisiestsparsest more lsquosabotagedrsquo medium very few users sendout messages at a regular rate most of the users do nothave geolocated information the social relationships (fol-lowersfollowers) contains a lot of unusedunimportant linksit is plagued by spam-bots and last but not least we haveno way to identify the motivegoalfunctionality of mobilityfluxes we are able to extract These limitations are not par-ticular to our sample but general to the sample Twitter databeing employed in the computational social science commu-nity Despite all these caveats we are able to show that evensome simple filtering techniques together with basic statisticalregressions yields predictive power about a variable as impor-tant as unemployment Other social media platforms such asFacebook Google+ Sina Weibo Instagram Orkut or Flickerwith more granular and consistent individual data are likelyto provide similar or better results by themselves or in com-bination Further improvements can be obtained by the useof more sophisticated statistical machine learning techniquessome of them even tailored to the peculiarities of social mediadata Our work serves to illustrate the tremendous potentialof these new digital datasets to improve the understanding ofsocietyrsquos functioning at the finer scales of granularity
The usefulness of our approach must be considered againstthe cost and update rate of performing detailed surveys ofmobility social structure and economic performance Ourdatabase is publicly articulated which means that our analysiscould be replicated easily in other countries other time periodsand with different scopes Naturally survey results providemore accurate results but they also consume considerablyhigher financial and human resources employing hundredsof people and taking months even years to complete and bereleased mdash they are so costly that countries going througheconomic recession have considered discontinuing them oraltering their update rate in recent times A particularly prob-lematic aspect of these surveys is that they are ldquoout-of-syncrdquoie census may be up to date whereas those same individualsrsquotravel surveys may not be and therefore drawing inferencesbetween both may be particularly difficult This is a partic-ularly challenging problem that the immediateness of socialmedia can help ameliorate
A few questions remain open for further investigationHow can traditional surveys and social media digital tracesbe best combined to maximize their predictive ability Cansocial media provide a reliable leading indicator to unem-ployment and in general economic surveys How muchreliable lead is it possible if at all As we have found Twitter
penetration and educational levels are found to be correlatedwith unemployment but this levels are unlikely to changerapidly to describe or anticipate changes in the economy orunemployment However other indicators like daily activitysocial interactions and geographical mobility are more con-nected with our daily activity and perhaps they have morepredicting power to show andor anticipate sudden changesin employment The relationship between unemployment andindividual and group behavior may help contextualize themultiple factors affecting the socioeconomic well-being of aregion while penetration content daily activity and mobilitydiversity seem to be highly correlated to unemployment inSpain different weights for each group of traces might beexpected in other countries [14] Finally digital traces couldserve as an alternative (some times the only one available) tothe lack of surveys in poor or remote areas [20 27] Anotherinteresting avenue of research involves the use of social mediato detect mismatches between the real (hidden underground)economy and the officially reported [38]
Most importantly the immediacy of social media may alsoallow governments to better measure and understand the effectof policies social changes natural or man-made disasters inthe economical status of cities in almost real-time [18 39]These new avenues for research provide great opportunitiesat the intersection of the economic social and computationalsciences that originate from these new widespread inexpensivedatasets
Acknowledgments
We would like to thank Kristina Lerman Lada Adamic JamesFowler Daniel Villatoro and Ricardo Herranz for stimulatingdiscussions and Yuri Kryvasheyeu and Thomas Bochynek fortheir critical reading of the manuscript This work was par-tially supported by Spanish Ministry of Science and Technol-ogy Grant FIS2013-47532-C3-3-P (to A L M G H and EM) Manuel Cebrian is funded by the Australian Governmentas represented by the Department of Broadband Communi-cations and Digital Economy and the Australian ResearchCouncil through the ICT Centre of Excellence program
References
[1] Becker G S (1976) The economic approach to human behav-ior (University of Chicago Press)
[2] Granovetter M (1985) Economic action and social structurethe problem of embeddedness American journal of sociologypp 481ndash510
[3] Camerer C F Loewenstein G amp Rabin M (2011) Advancesin behavioral economics (Princeton University Press)
[4] Glaeser E L Kallal H D Scheinkman J A amp ShleiferA (1991) Growth in cities (National Bureau of EconomicResearch) Technical report
[5] Bettencourt L M Lobo J Helbing D Kuhnert C amp WestG B (2007) Growth innovation scaling and the pace of lifein cities Proceedings of the National Academy of Sciences104 7301ndash7306
[6] Batty M (2008) The size scale and shape of cities science319 769ndash771
[7] Milgram S (1974) The experience of living in cities Crowdingand behavior 167 41
Social media fingerprints of unemployment mdash 819
[8] Pan W Ghoshal G Krumme C Cebrian M amp Pentland A(2013) Urban characteristics attributable to density-driven tieformation Nature communications 4
[9] Gonzalez M C Hidalgo C A amp Barabasi A-L (2008)Understanding individual human mobility patterns Nature453 779ndash782
[10] Calabrese F Diao M Lorenzo G D Jr J F amp Ratti C(2013) Understanding individual mobility patterns from urbansensing data A mobile phone trace example TransportationResearch Part C Emerging Technologies 26 301 ndash 313
[11] Cheng Z Caverlee J Lee K amp Sui D Z (2011) ExploringMillions of Footprints in Location Sharing Services (AAAIMenlo Park CA USA)
[12] Cho E Myers S A amp Leskovec J (2011) Friendship andmobility user movement in location-based social networksKDD rsquo11 (ACM New York NY USA) pp 1082ndash1090
[13] Sun L Jin J G Axhausen K W Lee D-H amp Cebrian M(2014) Quantifying long-term evolution of intra-urban spatialinteractions arXiv preprint arXiv14070145
[14] Eagle N Macy M amp Claxton R (2010) Network diversityand economic development Science 328 1029ndash1031
[15] Henrich J Boyd R Bowles S Camerer C Fehr E GintisH amp McElreath R (2001) In search of homo economicusbehavioral experiments in 15 small-scale societies AmericanEconomic Review pp 73ndash78
[16] Krieger N Williams D R amp Moss N E (1997) Measuringsocial class in us public health research concepts method-ologies and guidelines Annual review of public health 18341ndash378
[17] Groves R M Fowler Jr F J Couper M P Lepkowski J MSinger E amp Tourangeau R (2013) Survey methodology (JohnWiley amp Sons)
[18] Lazer D Pentland A S Adamic L Aral S Barabasi A LBrewer D Christakis N Contractor N Fowler J GutmannM et al (2009) Life in the network the coming age of compu-tational social science Science (New York NY) 323 721
[19] Smith C Quercia D amp Capra L (2013) Finger on the pulseidentifying deprivation using transit flow analysis (ACM) pp683ndash692
[20] Soto V Frias-Martinez V Virseda J amp Frias-Martinez E(2011) Prediction of socioeconomic levels using cell phonerecords UMAPrsquo11 (Springer-Verlag Berlin Heidelberg) pp377ndash388
[21] Preis T Moat H S Stanley H E amp Bishop S R (2012)Quantifying the advantage of looking forward Scientific re-ports 2
[22] Antenucci D Cafarella M Levenstein M C Re C ampShapiro M D (2014) Using social media to measure la-bor market flows (National Bureau of Economic Research)Technical report
[23] Hawelka B Sitko I Beinat E Sobolevsky S KazakopoulosP amp Ratti C (2013) Geo-located twitter as the proxy for globalmobility patterns arXiv preprint arXiv13110680
[24] Lathia N Quercia D amp Crowcroft J (2012) in Pervasivecomputing (Springer) pp 91ndash98
[25] Frias-Martinez V Virseda J amp Frias-Martinez E (2010)Socio-economic levels and human mobility
[26] Gutierrez T Krings G amp Blondel V D (2013) Evaluatingsocio-economic state of a country analyzing airtime credit andmobile phone datasets arXiv preprint arXiv13094496
[27] Smith C Mashhadi A amp Capra L (2013) Ubiquitous sensingfor mapping poverty in developing countries Paper submittedto the Orange D4D Challenge
[28] Erlander S amp Stewart N F (1990) The gravity model intransportation analysis theory and extensions (Vsp) Vol 3
[29] Simini F Gonzalez M C Maritan A amp Barabasi A-L(2012) A universal model for mobility and migration patternsNature 484 96ndash100
[30] Lenormand M Picornell M Cantu-Ros O G Tugores ALouail T Herranz R Barthelemy M Frias-Martinez E ampRamasco J J (2014) Cross-checking different sources ofmobility information arXiv preprint arXiv14040333
[31] Barthelemy M (2011) Spatial networks Physics Reports 4991 ndash 101
[32] Expert P Evans T S Blondel V D amp Lambiotte R (2011)Uncovering space-independent communities in spatial net-works Proceedings of the National Academy of Sciences 1087663ndash7668
[33] Sobolevsky S Szell M Campari R Couronne T SmoredaZ amp Ratti C (2013) Delineating geographical regions withnetworks of human interactions in an extensive set of countriesPloS one 8 e81707
[34] Rosvall M amp Bergstrom C T (2008) Maps of random walkson complex networks reveal community structure Proceedingsof the National Academy of Sciences 105 1118ndash1123
[35] Danon L Diaz-Guilera A Duch J amp Arenas A (2005)Comparing community structure identification Journal ofStatistical Mechanics Theory and Experiment 2005 P09008
[36] ADigital (2013) Uso de twitter en espana 2012 [Onlineaccessed 1-November-2014]
[37] Davenport J R amp DeLine R (2014) The readability of tweetsand their geographic correlation with education arXiv preprintarXiv14016058
[38] Schneider F Buehn A amp Montenegro C E (2011) Shadoweconomies all over the world New estimates for 162 countriesfrom 1999 to 2007 Handbook on the shadow economy pp9ndash77
[39] Rutherford A Cebrian M Dsouza S Moro E Pentland A ampRahwan I (2013) Limits of social mobilization Proceedingsof the National Academy of Sciences 110 6281ndash6286
Social media fingerprints of unemployment mdash 919
Social media fingerprints of unemployment mdash 1019
Supporting Information forSocial media fingerprints of unemploymentAlejandro Llorente Manuel Garcıa-Herranz Manuel Cebrian and Esteban Moro
S1 The dataset
Twitter provides an extremely rich and publicly available data set ofuser interactions information flows and thanks to the geo locationof tweets user movements Nevertheless the representativenessof this geo-located Twitter as a global source of mobility data hasstill received sparse attention In this sense while [13] present apromising and extensive study regarding global country-to-countrymovements (mostly driven by tourism) within-country human flows(comprising not only internal tourism but also in a greater extentthan country-to-country travels visiting and commuting) still needfurther investigation Therefore throughout this work we will com-pare our findings using geo-located Twitter with similar study usingcommuting surveys
For the Twitter analysis we consider almost 146 million geo-located Twitter messages (tweet(s)) collected through the publicAPI provided by Twitter for the continental part of Spain and from29th November 2012 to 10th April 2013 In this dataset we considerthat there has been a trip from place l to place k if a user has tweetedin place l and place k consecutively We only keep those transitionswhen the first tweet and the second one are dated in the same dayWe filter the trips database to avoid unrealistic transitions and keeponly trips with a geographical displacement larger than 1km (SeeMethods section) By this method 138 million of trips from 167376different users are considered in our work
From those trips we construct the mobility flow Ti j betweenmunicipalities which measures the number of trips in our databasein which the origin is within city i boundaries and destination lieswithin those of city j
We also consider population and economical information aboutthe municipalities from the Spanish Census (2011) [8] and unem-ployment figures from the Public Service of Employment (ServicioPublico de Empleo Estatal SEPE) [7] In the former In the lat-ter case registered unemployment (in number of persons) is givenfor each Spanish municipality by gender age and month To getunemployment rates we divide register unemployment by the totalworkforce in the municipality estimated as the number of peoplewith age between 16 and 65 years
S2 Twitter as mobility proxy
Considering all of the available transitions in our database one cancompute the distance between origin and destination the elapsedtime of the transition and the number of trips per user among manyother statistics All of them seems to show a Power-law distributionwith a cutoff due to the finite spatial size of Spain and the constraintof considering only transitions where the origin and destinationcheckins are done the same day Focusing on the log-linear partof the distributions self-similar behaviors arise when Twitter basedmobility is analyzed (see figure 5)
Twitter based inter-city flows can be well modelled by means ofthe The Gravity Law which is one of the most extended methods torepresent human mobility [1 19] with applications in many fieldslike urban planning [23] traffic engineering [4] or transportationproblems [9] Gravity Law is also the solution to the problem of
maximizing the entropy of the particle distribution among all thepossible trips using statistical mechanics techniques [2 22] Recentlyit has also been used as a model for human mobility based on cellphone traces [10 20 21] and social media data at a global scale [13]and at the inter-city level [14]
The Gravity Model for human mobility assume that the flowsbetween cities can be explained by the expression
T gravi j =
Pα1i Pα2
j
dβ
i j
(2)
where T gravi j is the flow in terms of number of people between cities
i and j di j is the geographical distance and Pi and Pj the populationof every city respectively
Given the data we can obtain the parameters of the model byWeighted Least Squares Minimization
αlowast1 α
lowast2 β
lowast = argminα1α2β
1N sum
i jwi j
(Ti jminusT grav
i j
)2(3)
where N is the total number of connections in the mobility graph andwi j is a weight proportional to the number of observed transitionsbetween i and j In particular we find that taking wi j = T 13
i j givesthe best performance in the model
In our case this model fits quite accurately the inter-city mobilitybased on Twitter GPS checkins (see table 1) Even though we areconsidering Ti j not necessarily symmetric the exponents of thepopulations are similar indicating that we are observing a similarflows in both directions between i and j
S3 Community structures in inter-city mo-bility graph
Typically complex networks exhibit community structure that isthere are subsets of nodes that are more densely connected amongthem comparing to the rest of the nodes In mobility networks whosenodes correspond to geographical areas these communities are inter-preted as zones with high common activity and tend to be constrainedby geographical and political barriers We check whether this is alsoobserved in our dataset by performing 6 state-of-art community de-tection algorithms FastGreedy [5] Walktrap [16] Infomap [18]MultiLevel [3] Label Propagation [17] and Leading Eigenvector[15] These six different algorithms exhibit different communitystructures in terms of number of communities average size of com-munity or modularity (see table 3) Members (municipalities) ofthe resulting communities are spatially connected except some fewcases as figure 7 shows We test the statistical robustness of theobtained communities by randomly removing a proportion p of theoriginal links and performing the algorithms on this new graph GpWe will consider that communities are robust when the communi-ties given for the original network G and Gp are highly similar Inorder to compare two arbitrary memberships to communities we usethe Normalized Mutual Information (NMI) method described in [6]which returns 0 when two memberships are totally different and 1when we compare two equal memberships We compute the NMI for
Social media fingerprints of unemployment mdash 1119
10minus8
10minus6
10minus4
10minus2
100
100 1005 101 1015 102 1025 103x
dens
10minus6
10minus4
10minus2
100
1005 101 1015 102 1025 103x
dens
10minus7
10minus6
10minus5
10minus4
10minus3
102 1025 103 1035 104 1045 105x
dens
Den
sity
Den
sity
Den
sity
Trip distance (km) Number of trips Elapsed time (secs)
Figure 5 Probability distributions for the different properties of daily trips in the Twitter dataset Dashed lines corresponds to apower law fit with exponents minus167 minus243 and minus062 respectively
each chosen algorithm performed on G and Gp for p between 1and 10 concluding that obtained community structures are robustbecause they are not broken when some randomly chosen links areremoved (see table 2)
1e+01 1e+03 1e+05
2eminus04
2eminus03
2eminus02
2eminus01
cities2$total_population
cities2$twpen
Population
Pen
etra
tion
Rat
e CommunitiesCities
Figure 6 Penetration rates for both cities and detectedcommunities
As other works have shown mobility graph communities areusually interpreted in terms of geographical and political barriersand a natural question is whether the mobility based communitiesare related to any of these barriers In Spain there are differentterritorial divisions for administration purposes In this work weconsider two of them provinces defined in 1978 Constitution are 48different heterogeneous aggregations of municipalities and counties(comarca in Spanish terminology) which are traditional aggregationsof municipalities mainly based on Spanish holography (rivers val-leys ridges etc) and some of them are composed by municipalitiesof different provinces We use again the NMI method to compare thecommunities structure given by the algorithms to the administrativelimits Except Leading Eigenvector algorithm the rest of methodsreturn communities that are quite related to provinces (NMI asymp 07)whereas for the county administration limits higher variability is
observed In this last case the algorithm providing more relationshipwith county limits is Infomap NMI asymp 083 Therefore Twitter basedmobility summarizes the inter-city flows exhibiting that these flowsare influenced by geographical and political barriers
S4 Twitter demographics and unemploy-ment rates
Different age groups are not equally represented in Twitter Recentsurveys (2012) in Spain suggest that most (86) of users in Twitterare 16 to 44 years old Comparison of the percentage of users perage group with the total population within the same groups (seefigure 8) reveals that groups of ages above 35 years old are under-represented in Twitter Thus our Twitter data will be more revealingwhen trying to describe unemployment in age groups below 44 yearsold This is indeed what we find when we try to build a linear modelfor the rate unemployment in different age groups with the sameTwitter variables while unemployment rates for ages below 24 canbe fitted to a linear model with R2 = 062 we find that regressionmodels for unemployment rates for ages between 25 and 44 have aR2 = 052 while for ages above 44 we get only R2 = 026 Table 4summarizes the results for the regression models of unemploymentrates in each age group showing that our Twitter variables have moreexplanatory power for ages below 44 Finally in figure 8 we can seethe performance of the model at different age groups and once againit is obvious the poor explanatory power of the Twitter variables forthe unemployment rate in ages above 44 years old
S5 Properties of Twitter variables
Normalization and distributions
Heterogeneity between the values of variables constructed fromTwitter is large but moderate as histograms in figure 9 show Wedid not find any geographical area with anomalous values in anyof the variables considered Variables are normalized in differentways both the penetration τi and misspellers rate εi are defined asthe number of users or misspellers per 100000 persons (population)activity variables νi are normalized as the percentage of tweets pertime interval finally number of tweets that mention a specific term
Social media fingerprints of unemployment mdash 1219
Figure 7 From left to right and from top to bottom Fastgreedy Walktrap Infomap Multilevel Label Propagation and LeadingEigenvector communities on Twitter based mobility transitions
microi are also given per 100000 tweets published in the geographicalarea
Correlation between variables
Variables are constructed to reflect the behavior of areas in the dif-ferent dimensions of Twitter penetration social or geographicaldiversity activity through the day and content Correlation betweenvariables does indeed show that variables within each dimensionshold strong correlations between them As we can see in figure 10social and geographical diversities are highly correlated betweenthem an expected fact given the gravity law accurate descriptionof flows of people between geographical areas but also the amountof communication between them Same behavior is found for thegroup of variables in the activity group while content variables areless correlated Finally we find that both the penetration rate τi andfraction of misspellers εi have a strong correlation with most of thevariables
High correlation between variables might lead to collinearityeffects [24] in the linear regression models that is some variableswith predictive variable might have non-significant weights becausethey explain the same part of the variance For instance in Table5 misspellers rate has a very strong predictive value but its p-valueis too high to consider it significant To test this hypothesis weperform a principal component analysis (PCA) on the independentvariables of the regression Figure 10 exhibits the loadings of thedifferent variables for the considered variables The block structureshowed in 10 results in similar directions of the variables in the firstcomponentes of the PCA We observe some groups of variables onthe one hand geographical and social diversity seem to explain largepart of the variance on the other hand we find a perpendicular group
of variables formed by temporal activity finally penetration rateand misspellers fraction seem to represent a different independentdirection of data with high collinearity between them This mightexplain the low statistical significance in the models of section 4 Inany case the structure of the correlation matrix and the PCA resultsshow that there is indeed information in all groups of variables andthus we have take a variable in each of them for our regressionmodels
S6 Misspellers detection
In this work we will consider only tweets in Spanish that is sincein Spain several languages live at the same time depending on thepart of the country the first step is to reduce our Twitter dataset tothose tweets that are written in Spanish This task is carried out usingthe n-gram based text categorization R library textcat [11] Then inorder to decide whether a tweet has a misspelling or not we needto establish some patterns to select from our set of tweets Sincewe want to be sure that a detected mistake corresponds to a realmisspeller we will not consider the following cases
bull Lack of written accents People tend to avoid writing accentswhen talking in a colloquial way
bull Mistakes derived from removing unnecessary letters Themost common cases are removing a h at the beginning of aword (in Spanish the letter h is not pronounced) or replacingthe letters qu by k We understand that these mistakes can bemotivated for the limitation of length in tweets and not for areal misspelling
bull In the same line we neglect mistakes produced by removing
Social media fingerprints of unemployment mdash 1319
0
10
20
30
40
0 100 200 300x = Tweets (unemployment)
P(x)
0
5
10
15
500 1000 1500x = Penetration rate
P(x)
0
5
10
15
2 4 6 8x = Entropy2 (social)
P(x)
0
5
10
15
025 050 075x = Entropy1 (social)
P(x)
0
5
10
15
4 6 8x = Entropy2 (geo)
P(x)
0
5
10
15
01 02 03 04x = Entropy1 (geo)
P(x)
0
5
10
15
0 50 100 150 200x = Misspellers rate
P(x)
0
20
40
60
0 100 200x = Tweets (employment)
P(x)
0
5
10
15
20
250 500 750x = Tweets (job)
P(x)
0
5
10
15
20
2 3 4 5 6x = Activity (night)
P(x)
0
5
10
15
20
4 5 6 7x = Activity (morning)
P(x)
0
5
10
15
35 40 45 50x = Activity (afternoon)
P(x)
i SuiSri
Sri Suimrngi
aftni ngti
imicrojobi
microempi
microunempi
x x x
x x x
x x x
x x x
Figure 9 Frequency plots for each variable constructed from Twitter
letters in the middle of a word whose pronunciation can bededuced without them
bull We do not consider either mistakes related to features ofspecific areas in Spain For example in the south the pronun-ciation of ce and se is the same what produces a big amountof mistakes when writing However since we want to extractobjective and equitable conclusion over the whole Spanishgeography we neglect those misspellings that only appear ina specific area
Likewise we will consider as real misspellings the followingmistakes
bull Adding letters For example writing a h at the beginning of aword that starts with a vowel
bull Changing the special cases mp mb by the wrong writings npnb
bull Mixing up b with v g with j ll with y and ex with es Theseare typical mistakes in Spanish because they have the sameor a very close pronunciation
Social media fingerprints of unemployment mdash 1419
0
10
20
30
16minus24 25minus34 35minus44 45minus54 55minus64r
f
000
025
050
075
100llena
Age group
Perc
enta
ge o
f pop
ulat
ion Census
10
15
20
25
10 15 20 25x
y
10
15
20
25
10 15 20 25x
y
5
10
15
20
5 10 15 20x
y
5
10
15
2025
5 10 15 20 25x
y
All ages lt 24
25-44 gt 44
Observed Unemployment () Observed Unemployment ()
Observed Unemployment ()Observed Unemployment ()
Pred
icte
d Un
empl
oym
ent (
)
Pred
icte
d Un
empl
oym
ent (
)
Pred
icte
d Un
empl
oym
ent (
)
Pred
icte
d Un
empl
oym
ent (
)
R2 = 047 R2 = 062
R2 = 052 R2 = 026
Figure 8 Top Percentage of population in each age groupfrom the Spanish Census (dark bars) and surveys about usersin Twitter (light bars) Bottom performance of the linearmodels for each of the age groups
bull Confusing the verb haber with the periphrasis a ver
bull Separating a word into two ones for instance writing theword conmigo as con migo
This way our list of mispellings is composed of 617 common mis-takes in Spanish that cannot be attributed to the special featuresof Twitter or a specific region of Spain Thus one can expect thatthis selection provides an accurate and equitable method of detect-ing misspellers Under these conditions the number of users whowrote at least one misspelled word is 27055 (56 over the wholepopulation)
We analyze whether misspellers have different Twitter usagebehavior from that people who do not make serious mistakes whenpublishing a tweet Comparing the average number of tweets itcan be observed that misspellers tend to publish a larger numberof tweets than those who did not made mistakes (14471 against2372) This also emerges when the mean number of misspellinggiven the total number of tweets is considered For users with lessthan approximately 30 published tweets in the observation period thenumber of misspellings is almost zero whereas for users who publishmore often the mean number of misspellings scales sub-linearly
minus1
minus08
minus06
minus04
minus02
0
02
04
06
08
1rtwpen
sio
sior
siosocial
siorsocial
manana
tarde
madrugada
fmiss
job
emp
unem
p
eco
rtwpensiosior
siosocialsiorsocialmanana
tardemadrugada
fmissjobemp
unempeco
i
Sui
Sri
Sri
Sui
i
microecoi
microunempi
microempi
microjobi
ngti
aftni
mrngi
minus3 minus2 minus1 0 1 2 3
minus3minus2
minus10
12
3
First Principal component
Seco
nd P
rinci
pal C
ompo
nent
6
26
47
70 92123
145
155185
209
213
251
257 279305318339
371
396411
423
455
466
490
507
546552
568
597
621
637672
681
710
718738
772
781
805828
849
876
900
919
925
966979
1002 1021
1047
1064
1085
1109
1126
1137
1164
11861200
12361249
1264
1300
1318
1336
13541386
1406
1425
1444
1457
14711504
15141554
1572
1584
1611
1622
16571667
1686
1721
1739
1800
1823
1837
1863
18831930
1949
1968
20062036
2042
2067
2181
2206
22442264
2305
2331
23322413 2435
2456
2512
2554
2569
25982617
26522670
26972721 2748
2790
2875
2917
2939
29492976
2989
3025
3132
3228
3245
3331
3347
3381
3451
3484
3519
3613
3837
38653987
4326
minus05 00 05
minus05
00
05
sio
sior
siosocial
siorsocial
rtwpen
manana
tarde
madrugada
fmiss
jobemp
unemp
eco
ngt
microunemp
aftn
microempmicroeco
microjobmrng
Si
Si
Sir
˜Sir
domingo 12 de octubre de 14
Figure 10 Top Correlation matrix between the vari-ables constructed from Twitter Each entry in the matrixis depicted as a circle whose size is proportional to thecorrelation between variables and the sign is bluered forpositivenegative correlations Blank entries correspond tostatistically insignificant correlations with 95 confidenceBottom Variables projection on the first two principal com-ponents given by PCA We observe different groups of vari-ables and collinearity between some of them
with the number of tweets (exponent asymp 033)
Since we have observed a segmentation of Twitter populationbased on how accurate they write we consider the misspeller rate as aproxy of the educational level of the cities Large number of previousworks in the literature have revealed the relationship between theeconomical status and the educational level of geographical areasand therefore it is natural to ask whether the observed misspellersrate is related to economy driven by the unemployment rate To testthis hypothesis we consider cities populated with more than 5000inhabitants to avoid subsampled cases We find a strong positivecorrelation between the probability of finding a misspeller in a cityand the unemployment rate (0372 0491)
Social media fingerprints of unemployment mdash 1519
2 5 10 20 50 200 500
10
15
20
30
40
tweets
N(m
iss|
tw
eets
)
2 5 10 20 50 200 500
000
50
050
050
0
tweets
P(m
iss|
twee
ts)
2 50 500
Figure 11 Number (red) and probability (blue) of ob-served misspellings given the number of tweets
050
055
060
065
070
201210201211201212201301201302201303201304201305201306201307201308201309201310201311201312201401201402201403201404201405201406
month
r2R2
month
Figure 12 Explanatory power of the linear regressionmodel when fitted against the unemployment data for dif-ferent months Gray (orange) area correspond to the timewindow in which Twitter data is collected and variables areconstructed
S7 Time window and unemployment
In the definition of the variables we have aggregated the Twitter ac-tivity within a 7 months time window (from December 2012 to June2013) Since unemployment has a significant variation along timewe investigate here what is the correlation and explanatory powerof the Twitter variables for the values of unemployment determinedat different months through the same time window in which Twitterdata was collected Or if the variables collected in that time windoware more correlated with past or future values of unemploymentFigure 12 shows the explanatory value of the model when the linearregression is done for values of unemployment of different monthsbefore during and after the Twitter data time window Althoughthere is a small seasonal effect along the year we see that the ex-planatory power remains around R2 = 06 which suggest that ourTwitter linear model retains its explanatory power even though unem-ployment changes considerably throughout the year It is interestingto note that R2 decays a little bit during the summer which meansthat our variables are less correlated with summer unemploymentFinally unemployment used in the main article is from June 2013ie the last month in the time window used to collect the data
S8 Demographics does not explain unem-ployment
Since unemployment rates are very large for the group of youngpeople a natural question is whether only demographic variablescould explain the heterogeneity of young unemployment rates foundin the geographical areas To test this end we have built four linearmodels the first one (named Youth model in Table 5) is composedby the rate of young population as the only explaining variable thesecond ones are built based on only the Twitter variables consideredin the main text (named Twitter model (I)) or just with those whoseregression coefficients are statistically significant (Twitter model(II)) the third one is fitted with all the variables (named All variablesmodel in Table 5) In table 5 we show the summary of the regressionfor each model Focusing on the explained variance by the model interms of R2 it can be checked that considering all Twitter variables isthree times more explanatory than considering only the young peopleproportion On the other hand the comparison of R2 for the Twittermodel with the one for All variables and Youth model shows that therate of young population does not provide a significant explanatorypower This semi-partial analysis shows that our Twitter variablesretain a high explanatory power when the effect of young populationrate is controlled
S9 Unemployment models for other geo-graphical areas
While municipalities are very heterogeneous demographically otheradministrative areas exist in Spain at large scales that could be usedfor our model of unemployment As mentioned in section 4 thesmallest administrative division of Spain we have considered is thatof the 8200 municipalities At larger scales we have the 326 coun-ties (comarcas in spanish) which are aggregations of municipalitiesFinally the largest geographical scale we considered is defined by50 provinces (provincias in Spanish) In this section we comparethe performance of our Twitter model for unemployment for thevariables defined in those administrative areas and relate it to thegeographical communities detected and used in the main paper (seesection 4) Not all the areas at different administrative divisions areconsidered in the model To minimize the effect of areas in whichthe number of geo-tagged tweets is very small we only consider the1738 municipalities which have a Twitter population π gt 10 Simi-larly we only consider the 198 counties with π gt 100 As we can seein Table 6 the model has a large explanatory power for areas equalor bigger than counties As expected R2 increases as the number ofareas in the model is smaller but the description level of the modelis very low for provinces for example The best performance (highR2 and high geographical description level) is attained at the level ofthe detected communities
S10 Relative importance of the variables
To asses the relative importance of the variables in the unemploymentmodel we have used several methods They all give qualitatively thesame results with some variations for the statistically insignificantvariables Specifically we have use
1 (weight) Relative weight of the absolute values of the coef-ficients obtained in the linear regression when variables arescaled to have mean zero and variance one
2 (lmg) averaging over orderings proposed by LindemanMerenda and Gold
Social media fingerprints of unemployment mdash 1619
0
10
20
30
40
emp fmiss manana rtwpen sio siosocialnames
values
indabscoefffirstlmgpmvd
microemp mrng SuSu
Rela
tive
impo
rtanc
e (
)
0
10
20
30
40
emp fmiss manana rtwpen sio siosocialnames
values
indabscoefffirstlmgpmvd
weight first lmg pvmd
Figure 13 Relative importance of the variables (in per-centage) in the unemployment model for different ways tocalculate it
3 (pmvd) The PMVD metric introduced by Feldman whichan average over orderings as well but with data-dependentweights
4 (first) The univariate R2-values from regression models withone variable only
All these metrics are obtained using the relaimpo R package [12]The results for the young unemployment model are shown in figure13 where we can see that different methods yield to similar rela-tive importance of the variables excepting perhaps for the diversityof mobility flows a variable with a non-significant weight in theregression model
References
[1] B Ashtakala Generalized power model for trip distributionTransportation Research Part B Methodological 21(1)59ndash671987
[2] Michel Bierlaire Mathematical models for transportation de-mand analysis Transportation research Part A Policy andpractice 31(1)86ndash86 1997
[3] Vincent D Blondel Jean-Loup Guillaume Renaud Lambiotteand Etienne Lefebvre Fast unfolding of communities in largenetworks Journal of Statistical Mechanics Theory and Exper-iment 2008(10)P10008 2008
[4] Harry J Casey Jr The law of retail gravitation applied to trafficengineering Traffic Quarterly 9(3) 1955
[5] Aaron Clauset Mark EJ Newman and Cristopher Moore Find-ing community structure in very large networks Physicalreview E 70(6)066111 2004
[6] Leon Danon Albert Diaz-Guilera Jordi Duch and AlexArenas Comparing community structure identificationJournal of Statistical Mechanics Theory and Experiment2005(09)P09008 2005
[7] Servicio Publico de Empleo Estatal (SEPE) Spanish registeredunemployment httpwwwsepeescontenidosque_
es_el_sepeestadisticasindexhtml[8] Instituto Nacional de Estadıstica Spanish 2011 cen-
sus httpwwwineescensos2011_datoscen11_
datos_iniciohtm|[9] Suzanne P Evans A relationship between the gravity model
for trip distribution and the transportation problem in linearprogramming Transportation Research 7(1)39ndash61 1973
[10] Paul Expert Tim S Evans Vincent D Blondel and RenaudLambiotte Uncovering space-independent communities inspatial networks Proceedings of the National Academy ofSciences 108(19)7663ndash7668 2011
[11] Ingo Feinerer Christian Buchta Wilhelm Geiger JohannesRauch Patrick Mair and Kurt Hornik The textcat packagefor n-gram based text categorization in r Journal of StatisticalSoftware 52(6)1ndash17 2013
[12] Ulrike Gromping Relative importance for linear regressionin r the package relaimpo Journal of statistical software17(1)1ndash27 2006
[13] Bartosz Hawelka Izabela Sitko Euro Beinat StanislavSobolevsky Pavlos Kazakopoulos and Carlo Ratti Geo-located twitter as the proxy for global mobility patterns arXivpreprint arXiv13110680 2013
[14] Yu Liu Zhengwei Sui Chaogui Kang and Yong Gao Uncov-ering patterns of inter-urban trips and spatial interactions fromcheck-in data arXiv preprint arXiv13100282 2013
[15] Mark EJ Newman Finding community structure in net-works using the eigenvectors of matrices Physical reviewE 74(3)036104 2006
[16] Pascal Pons and Matthieu Latapy Computing communities inlarge networks using random walks In Computer and Informa-tion Sciences-ISCIS 2005 pages 284ndash293 Springer 2005
[17] Usha Nandini Raghavan Reka Albert and Soundar KumaraNear linear time algorithm to detect community structures inlarge-scale networks Physical Review E 76(3)036106 2007
[18] Martin Rosvall and Carl T Bergstrom Maps of random walkson complex networks reveal community structure Proceedingsof the National Academy of Sciences 105(4)1118ndash1123 2008
[19] Morton Schneider Gravity models and trip distribution theoryPapers in Regional Science 5(1)51ndash56 1959
[20] Filippo Simini Marta C Gonzalez Amos Maritan and Albert-Laszlo Barabasi A universal model for mobility and migrationpatterns Nature 484(7392)96ndash100 2012
[21] Chaoming Song Tal Koren Pu Wang and Albert-LaszloBarabasi Modelling the scaling properties of human mobilityNature Physics 6(10)818ndash823 2010
[22] Alan Geoffrey Wilson Entropy in urban and regional mod-elling Pion Ltd 1970
[23] Alan Geoffrey Wilson Urban and regional models in geogra-phy and planning 1974
[24] Svante Wold Arnold Ruhe Herman Wold and WJ Dunn IIIThe collinearity problem in linear regression the partial leastsquares (pls) approach to generalized inverses SIAM Journalon Scientific and Statistical Computing 5(3)735ndash743 1984
Social media fingerprints of unemployment mdash 1719
Gravity ModelParameter Description Spain
α1 Origin exponent 0477lowastlowastlowast(0002)α2 Destination exponent 0478lowastlowastlowast(0002)β Distance exponent 105lowastlowastlowast(00035)R2 Goodness of fit 0797φ Correlation between Ti j and T gra
i j 0826
Table 1 Description of the parameters for the Gravity Law Model in geo-tagged social media data for Spain (lowast lowast lowast) meanssignificance p lt 00001
NMI between G and Gp for different pAlgorithm p = 001 002 003 004 005 006 007 008 009 01
FG 0995 0992 0989 0983 0981 0977 0983 0969 0980 0959WT 0954 0959 0950 0954 0945 0948 0947 0935 0926 0931IM 0988 0981 0980 0981 0978 0974 0975 0970 0969 0966ML 0994 0978 0979 0983 0948 0934 0972 0952 0973 0947LP 0906 0908 0911 0915 0895 0907 0907 0893 0905 0904LE 0960 0957 0956 0859 0910 0892 0908 0858 0885 0884
Table 2 NMI measure comparing G and Gp
Communities StatsAlgorithm 〈|Ni|〉i max|Ni| |Ni| Modularity NMI P NMI C
FG 309696 1385 23 0726 0712 0590WT 9262 433 769 0417 0744 0757IM 21011 143 339 0758 0770 0831ML 323772 1132 22 0800 0717 0599LP 22052 750 323 0732 0749 0761LE 1017571 5344 7 0381 0264 0205
Table 3 Statistics of the communities Ni returned by the six algorithms NMI P refers to the comparison between communitiesand provinces whereas NMI C considers counties instead of provinces
Social media fingerprints of unemployment mdash 1819
All ages lt 24 25minus44 gt 44(Intercept) 011lowastlowastlowastlowast 010lowastlowastlowast 020lowastlowastlowast 020lowastlowastlowast
(002) (003) (003) (0035)Penetration rate 323lowast 857lowastlowastlowast 628lowastlowast 240
(141) (222) (217) (277)Geographical diversity 003 015lowastlowastlowast 008lowast 006
(002) (004) (004) (005)Social diversity minus003lowast minus003 minus005lowast minus006lowast
(001) (002) (002) (003)Morning activity minus069lowast minus130lowastlowast minus153lowastlowastlowast minus119lowast
(026) (042) (041) (052)Misspellers rate 1156 3151lowast 1546 2360
(813) (1278) (1248) (1594)Employment mentions minus180 317 minus994 271
(627) (986) (964) (123)R2 047 064 055 029Adj R2 044 062 052 026lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005
Table 4 Regression table for the different models in which unemployment for different age groups is fitted The All ages model isthe fit to the general rate of unemployment in each geographical area while the other models are for the rates of unemployment ingroups of less than 24 years between 25 and 44 years and above 44 years
All variables Youth model Twitter model (I) Twitter model (II)(Intercept) 006 minus002 010lowastlowastlowast 009lowastlowastlowast
(003) (003) (003) (0027)Young pop rate 066lowast 220lowastlowastlowast
(030) (035)Penetration rate 820lowastlowastlowast 857lowastlowastlowast 862lowastlowastlowast
(225) (222) (221)Geographical diversity 014lowastlowastlowast 015lowastlowastlowast 012lowastlowastlowast
(004) (004) (003)Social diversity minus002 minus003
(002) (002)Morning activity minus142lowastlowastlowast minus130lowastlowast minus128lowastlowast
(041) (042) (041)Misspellers rate 2395 3151lowast 3228lowast
(1309) (1278) (1271)Employment mentions 034 317
(981) (986)R2 065 024 064 063Adj R2 063 024 062 062lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005
Table 5 Regression table for the different statistical models The All variables model includes both Twitter and rate of youngpopulation variables Twitter model (I) includes only the variables described in the main article while Twitter model (II) only includesthose variables which are significant p lt 005 in Twitter model (I)
Social media fingerprints of unemployment mdash 1919
Communities Municipalities Counties Provinces(Intercept) 010lowastlowastlowast 016lowastlowastlowast 011lowastlowastlowast 011lowast
(003) (001) (003) (005)Penetration rate 857lowastlowastlowast 401lowastlowastlowast 912lowastlowastlowast 1047lowastlowastlowast
(222) (059) (181) (197)Geographical diversity 015lowastlowastlowast 002 012lowastlowastlowast 008
(004) (001) (003) (007)Social diversity minus003 minus001 minus001 minus003
(002) (001) (002) 007Morning activity minus130lowastlowast minus116lowastlowastlowast minus149lowastlowastlowast minus103
(042) (014) (039) (088)Misspellers rate 3151lowast 1440lowastlowastlowast 1409
(1278) (251) (1002)Employment mentions 317 minus071 241 minus317
(986) (089) (886) (1229)Number of points 128 1738 198 50R2 064 022 055 065Adj R2 062 021 054 061lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005
Table 6 Regression table for the unemployment linear regression model in different levels of geographical areas In the Provincesmodel the misspellers rate has been removed from the model due to the large collinearity with the penetration rate
- 1 Social media dataset and functional partition of cities
- 2 Social media behavioral fingerprints
- 3 Explanatory power of social media in unemployment
- 4 Discussion
-
Social media fingerprints of unemployment mdash 719
economy Regarding the latter our work is important forunderstanding how country-scale analysis of Social Mediashould consider the demographic but also the economical dif-ference between users As we have shown users in areaswith large unemployment have different mobility different so-cial interactions and different daily activity than those in lowunemployment areas This intertwined relationship betweenuser behavior and employment should be considered not onlyin economical analysis derived from social media but alsoin other applications like marketing communication socialmobilization etc
It is particularly remarkable that Twitter data can providethese accurate results Twitter is among the many currentlypopular social networking platforms perhaps the noisiestsparsest more lsquosabotagedrsquo medium very few users sendout messages at a regular rate most of the users do nothave geolocated information the social relationships (fol-lowersfollowers) contains a lot of unusedunimportant linksit is plagued by spam-bots and last but not least we haveno way to identify the motivegoalfunctionality of mobilityfluxes we are able to extract These limitations are not par-ticular to our sample but general to the sample Twitter databeing employed in the computational social science commu-nity Despite all these caveats we are able to show that evensome simple filtering techniques together with basic statisticalregressions yields predictive power about a variable as impor-tant as unemployment Other social media platforms such asFacebook Google+ Sina Weibo Instagram Orkut or Flickerwith more granular and consistent individual data are likelyto provide similar or better results by themselves or in com-bination Further improvements can be obtained by the useof more sophisticated statistical machine learning techniquessome of them even tailored to the peculiarities of social mediadata Our work serves to illustrate the tremendous potentialof these new digital datasets to improve the understanding ofsocietyrsquos functioning at the finer scales of granularity
The usefulness of our approach must be considered againstthe cost and update rate of performing detailed surveys ofmobility social structure and economic performance Ourdatabase is publicly articulated which means that our analysiscould be replicated easily in other countries other time periodsand with different scopes Naturally survey results providemore accurate results but they also consume considerablyhigher financial and human resources employing hundredsof people and taking months even years to complete and bereleased mdash they are so costly that countries going througheconomic recession have considered discontinuing them oraltering their update rate in recent times A particularly prob-lematic aspect of these surveys is that they are ldquoout-of-syncrdquoie census may be up to date whereas those same individualsrsquotravel surveys may not be and therefore drawing inferencesbetween both may be particularly difficult This is a partic-ularly challenging problem that the immediateness of socialmedia can help ameliorate
A few questions remain open for further investigationHow can traditional surveys and social media digital tracesbe best combined to maximize their predictive ability Cansocial media provide a reliable leading indicator to unem-ployment and in general economic surveys How muchreliable lead is it possible if at all As we have found Twitter
penetration and educational levels are found to be correlatedwith unemployment but this levels are unlikely to changerapidly to describe or anticipate changes in the economy orunemployment However other indicators like daily activitysocial interactions and geographical mobility are more con-nected with our daily activity and perhaps they have morepredicting power to show andor anticipate sudden changesin employment The relationship between unemployment andindividual and group behavior may help contextualize themultiple factors affecting the socioeconomic well-being of aregion while penetration content daily activity and mobilitydiversity seem to be highly correlated to unemployment inSpain different weights for each group of traces might beexpected in other countries [14] Finally digital traces couldserve as an alternative (some times the only one available) tothe lack of surveys in poor or remote areas [20 27] Anotherinteresting avenue of research involves the use of social mediato detect mismatches between the real (hidden underground)economy and the officially reported [38]
Most importantly the immediacy of social media may alsoallow governments to better measure and understand the effectof policies social changes natural or man-made disasters inthe economical status of cities in almost real-time [18 39]These new avenues for research provide great opportunitiesat the intersection of the economic social and computationalsciences that originate from these new widespread inexpensivedatasets
Acknowledgments
We would like to thank Kristina Lerman Lada Adamic JamesFowler Daniel Villatoro and Ricardo Herranz for stimulatingdiscussions and Yuri Kryvasheyeu and Thomas Bochynek fortheir critical reading of the manuscript This work was par-tially supported by Spanish Ministry of Science and Technol-ogy Grant FIS2013-47532-C3-3-P (to A L M G H and EM) Manuel Cebrian is funded by the Australian Governmentas represented by the Department of Broadband Communi-cations and Digital Economy and the Australian ResearchCouncil through the ICT Centre of Excellence program
References
[1] Becker G S (1976) The economic approach to human behav-ior (University of Chicago Press)
[2] Granovetter M (1985) Economic action and social structurethe problem of embeddedness American journal of sociologypp 481ndash510
[3] Camerer C F Loewenstein G amp Rabin M (2011) Advancesin behavioral economics (Princeton University Press)
[4] Glaeser E L Kallal H D Scheinkman J A amp ShleiferA (1991) Growth in cities (National Bureau of EconomicResearch) Technical report
[5] Bettencourt L M Lobo J Helbing D Kuhnert C amp WestG B (2007) Growth innovation scaling and the pace of lifein cities Proceedings of the National Academy of Sciences104 7301ndash7306
[6] Batty M (2008) The size scale and shape of cities science319 769ndash771
[7] Milgram S (1974) The experience of living in cities Crowdingand behavior 167 41
Social media fingerprints of unemployment mdash 819
[8] Pan W Ghoshal G Krumme C Cebrian M amp Pentland A(2013) Urban characteristics attributable to density-driven tieformation Nature communications 4
[9] Gonzalez M C Hidalgo C A amp Barabasi A-L (2008)Understanding individual human mobility patterns Nature453 779ndash782
[10] Calabrese F Diao M Lorenzo G D Jr J F amp Ratti C(2013) Understanding individual mobility patterns from urbansensing data A mobile phone trace example TransportationResearch Part C Emerging Technologies 26 301 ndash 313
[11] Cheng Z Caverlee J Lee K amp Sui D Z (2011) ExploringMillions of Footprints in Location Sharing Services (AAAIMenlo Park CA USA)
[12] Cho E Myers S A amp Leskovec J (2011) Friendship andmobility user movement in location-based social networksKDD rsquo11 (ACM New York NY USA) pp 1082ndash1090
[13] Sun L Jin J G Axhausen K W Lee D-H amp Cebrian M(2014) Quantifying long-term evolution of intra-urban spatialinteractions arXiv preprint arXiv14070145
[14] Eagle N Macy M amp Claxton R (2010) Network diversityand economic development Science 328 1029ndash1031
[15] Henrich J Boyd R Bowles S Camerer C Fehr E GintisH amp McElreath R (2001) In search of homo economicusbehavioral experiments in 15 small-scale societies AmericanEconomic Review pp 73ndash78
[16] Krieger N Williams D R amp Moss N E (1997) Measuringsocial class in us public health research concepts method-ologies and guidelines Annual review of public health 18341ndash378
[17] Groves R M Fowler Jr F J Couper M P Lepkowski J MSinger E amp Tourangeau R (2013) Survey methodology (JohnWiley amp Sons)
[18] Lazer D Pentland A S Adamic L Aral S Barabasi A LBrewer D Christakis N Contractor N Fowler J GutmannM et al (2009) Life in the network the coming age of compu-tational social science Science (New York NY) 323 721
[19] Smith C Quercia D amp Capra L (2013) Finger on the pulseidentifying deprivation using transit flow analysis (ACM) pp683ndash692
[20] Soto V Frias-Martinez V Virseda J amp Frias-Martinez E(2011) Prediction of socioeconomic levels using cell phonerecords UMAPrsquo11 (Springer-Verlag Berlin Heidelberg) pp377ndash388
[21] Preis T Moat H S Stanley H E amp Bishop S R (2012)Quantifying the advantage of looking forward Scientific re-ports 2
[22] Antenucci D Cafarella M Levenstein M C Re C ampShapiro M D (2014) Using social media to measure la-bor market flows (National Bureau of Economic Research)Technical report
[23] Hawelka B Sitko I Beinat E Sobolevsky S KazakopoulosP amp Ratti C (2013) Geo-located twitter as the proxy for globalmobility patterns arXiv preprint arXiv13110680
[24] Lathia N Quercia D amp Crowcroft J (2012) in Pervasivecomputing (Springer) pp 91ndash98
[25] Frias-Martinez V Virseda J amp Frias-Martinez E (2010)Socio-economic levels and human mobility
[26] Gutierrez T Krings G amp Blondel V D (2013) Evaluatingsocio-economic state of a country analyzing airtime credit andmobile phone datasets arXiv preprint arXiv13094496
[27] Smith C Mashhadi A amp Capra L (2013) Ubiquitous sensingfor mapping poverty in developing countries Paper submittedto the Orange D4D Challenge
[28] Erlander S amp Stewart N F (1990) The gravity model intransportation analysis theory and extensions (Vsp) Vol 3
[29] Simini F Gonzalez M C Maritan A amp Barabasi A-L(2012) A universal model for mobility and migration patternsNature 484 96ndash100
[30] Lenormand M Picornell M Cantu-Ros O G Tugores ALouail T Herranz R Barthelemy M Frias-Martinez E ampRamasco J J (2014) Cross-checking different sources ofmobility information arXiv preprint arXiv14040333
[31] Barthelemy M (2011) Spatial networks Physics Reports 4991 ndash 101
[32] Expert P Evans T S Blondel V D amp Lambiotte R (2011)Uncovering space-independent communities in spatial net-works Proceedings of the National Academy of Sciences 1087663ndash7668
[33] Sobolevsky S Szell M Campari R Couronne T SmoredaZ amp Ratti C (2013) Delineating geographical regions withnetworks of human interactions in an extensive set of countriesPloS one 8 e81707
[34] Rosvall M amp Bergstrom C T (2008) Maps of random walkson complex networks reveal community structure Proceedingsof the National Academy of Sciences 105 1118ndash1123
[35] Danon L Diaz-Guilera A Duch J amp Arenas A (2005)Comparing community structure identification Journal ofStatistical Mechanics Theory and Experiment 2005 P09008
[36] ADigital (2013) Uso de twitter en espana 2012 [Onlineaccessed 1-November-2014]
[37] Davenport J R amp DeLine R (2014) The readability of tweetsand their geographic correlation with education arXiv preprintarXiv14016058
[38] Schneider F Buehn A amp Montenegro C E (2011) Shadoweconomies all over the world New estimates for 162 countriesfrom 1999 to 2007 Handbook on the shadow economy pp9ndash77
[39] Rutherford A Cebrian M Dsouza S Moro E Pentland A ampRahwan I (2013) Limits of social mobilization Proceedingsof the National Academy of Sciences 110 6281ndash6286
Social media fingerprints of unemployment mdash 919
Social media fingerprints of unemployment mdash 1019
Supporting Information forSocial media fingerprints of unemploymentAlejandro Llorente Manuel Garcıa-Herranz Manuel Cebrian and Esteban Moro
S1 The dataset
Twitter provides an extremely rich and publicly available data set ofuser interactions information flows and thanks to the geo locationof tweets user movements Nevertheless the representativenessof this geo-located Twitter as a global source of mobility data hasstill received sparse attention In this sense while [13] present apromising and extensive study regarding global country-to-countrymovements (mostly driven by tourism) within-country human flows(comprising not only internal tourism but also in a greater extentthan country-to-country travels visiting and commuting) still needfurther investigation Therefore throughout this work we will com-pare our findings using geo-located Twitter with similar study usingcommuting surveys
For the Twitter analysis we consider almost 146 million geo-located Twitter messages (tweet(s)) collected through the publicAPI provided by Twitter for the continental part of Spain and from29th November 2012 to 10th April 2013 In this dataset we considerthat there has been a trip from place l to place k if a user has tweetedin place l and place k consecutively We only keep those transitionswhen the first tweet and the second one are dated in the same dayWe filter the trips database to avoid unrealistic transitions and keeponly trips with a geographical displacement larger than 1km (SeeMethods section) By this method 138 million of trips from 167376different users are considered in our work
From those trips we construct the mobility flow Ti j betweenmunicipalities which measures the number of trips in our databasein which the origin is within city i boundaries and destination lieswithin those of city j
We also consider population and economical information aboutthe municipalities from the Spanish Census (2011) [8] and unem-ployment figures from the Public Service of Employment (ServicioPublico de Empleo Estatal SEPE) [7] In the former In the lat-ter case registered unemployment (in number of persons) is givenfor each Spanish municipality by gender age and month To getunemployment rates we divide register unemployment by the totalworkforce in the municipality estimated as the number of peoplewith age between 16 and 65 years
S2 Twitter as mobility proxy
Considering all of the available transitions in our database one cancompute the distance between origin and destination the elapsedtime of the transition and the number of trips per user among manyother statistics All of them seems to show a Power-law distributionwith a cutoff due to the finite spatial size of Spain and the constraintof considering only transitions where the origin and destinationcheckins are done the same day Focusing on the log-linear partof the distributions self-similar behaviors arise when Twitter basedmobility is analyzed (see figure 5)
Twitter based inter-city flows can be well modelled by means ofthe The Gravity Law which is one of the most extended methods torepresent human mobility [1 19] with applications in many fieldslike urban planning [23] traffic engineering [4] or transportationproblems [9] Gravity Law is also the solution to the problem of
maximizing the entropy of the particle distribution among all thepossible trips using statistical mechanics techniques [2 22] Recentlyit has also been used as a model for human mobility based on cellphone traces [10 20 21] and social media data at a global scale [13]and at the inter-city level [14]
The Gravity Model for human mobility assume that the flowsbetween cities can be explained by the expression
T gravi j =
Pα1i Pα2
j
dβ
i j
(2)
where T gravi j is the flow in terms of number of people between cities
i and j di j is the geographical distance and Pi and Pj the populationof every city respectively
Given the data we can obtain the parameters of the model byWeighted Least Squares Minimization
αlowast1 α
lowast2 β
lowast = argminα1α2β
1N sum
i jwi j
(Ti jminusT grav
i j
)2(3)
where N is the total number of connections in the mobility graph andwi j is a weight proportional to the number of observed transitionsbetween i and j In particular we find that taking wi j = T 13
i j givesthe best performance in the model
In our case this model fits quite accurately the inter-city mobilitybased on Twitter GPS checkins (see table 1) Even though we areconsidering Ti j not necessarily symmetric the exponents of thepopulations are similar indicating that we are observing a similarflows in both directions between i and j
S3 Community structures in inter-city mo-bility graph
Typically complex networks exhibit community structure that isthere are subsets of nodes that are more densely connected amongthem comparing to the rest of the nodes In mobility networks whosenodes correspond to geographical areas these communities are inter-preted as zones with high common activity and tend to be constrainedby geographical and political barriers We check whether this is alsoobserved in our dataset by performing 6 state-of-art community de-tection algorithms FastGreedy [5] Walktrap [16] Infomap [18]MultiLevel [3] Label Propagation [17] and Leading Eigenvector[15] These six different algorithms exhibit different communitystructures in terms of number of communities average size of com-munity or modularity (see table 3) Members (municipalities) ofthe resulting communities are spatially connected except some fewcases as figure 7 shows We test the statistical robustness of theobtained communities by randomly removing a proportion p of theoriginal links and performing the algorithms on this new graph GpWe will consider that communities are robust when the communi-ties given for the original network G and Gp are highly similar Inorder to compare two arbitrary memberships to communities we usethe Normalized Mutual Information (NMI) method described in [6]which returns 0 when two memberships are totally different and 1when we compare two equal memberships We compute the NMI for
Social media fingerprints of unemployment mdash 1119
10minus8
10minus6
10minus4
10minus2
100
100 1005 101 1015 102 1025 103x
dens
10minus6
10minus4
10minus2
100
1005 101 1015 102 1025 103x
dens
10minus7
10minus6
10minus5
10minus4
10minus3
102 1025 103 1035 104 1045 105x
dens
Den
sity
Den
sity
Den
sity
Trip distance (km) Number of trips Elapsed time (secs)
Figure 5 Probability distributions for the different properties of daily trips in the Twitter dataset Dashed lines corresponds to apower law fit with exponents minus167 minus243 and minus062 respectively
each chosen algorithm performed on G and Gp for p between 1and 10 concluding that obtained community structures are robustbecause they are not broken when some randomly chosen links areremoved (see table 2)
1e+01 1e+03 1e+05
2eminus04
2eminus03
2eminus02
2eminus01
cities2$total_population
cities2$twpen
Population
Pen
etra
tion
Rat
e CommunitiesCities
Figure 6 Penetration rates for both cities and detectedcommunities
As other works have shown mobility graph communities areusually interpreted in terms of geographical and political barriersand a natural question is whether the mobility based communitiesare related to any of these barriers In Spain there are differentterritorial divisions for administration purposes In this work weconsider two of them provinces defined in 1978 Constitution are 48different heterogeneous aggregations of municipalities and counties(comarca in Spanish terminology) which are traditional aggregationsof municipalities mainly based on Spanish holography (rivers val-leys ridges etc) and some of them are composed by municipalitiesof different provinces We use again the NMI method to compare thecommunities structure given by the algorithms to the administrativelimits Except Leading Eigenvector algorithm the rest of methodsreturn communities that are quite related to provinces (NMI asymp 07)whereas for the county administration limits higher variability is
observed In this last case the algorithm providing more relationshipwith county limits is Infomap NMI asymp 083 Therefore Twitter basedmobility summarizes the inter-city flows exhibiting that these flowsare influenced by geographical and political barriers
S4 Twitter demographics and unemploy-ment rates
Different age groups are not equally represented in Twitter Recentsurveys (2012) in Spain suggest that most (86) of users in Twitterare 16 to 44 years old Comparison of the percentage of users perage group with the total population within the same groups (seefigure 8) reveals that groups of ages above 35 years old are under-represented in Twitter Thus our Twitter data will be more revealingwhen trying to describe unemployment in age groups below 44 yearsold This is indeed what we find when we try to build a linear modelfor the rate unemployment in different age groups with the sameTwitter variables while unemployment rates for ages below 24 canbe fitted to a linear model with R2 = 062 we find that regressionmodels for unemployment rates for ages between 25 and 44 have aR2 = 052 while for ages above 44 we get only R2 = 026 Table 4summarizes the results for the regression models of unemploymentrates in each age group showing that our Twitter variables have moreexplanatory power for ages below 44 Finally in figure 8 we can seethe performance of the model at different age groups and once againit is obvious the poor explanatory power of the Twitter variables forthe unemployment rate in ages above 44 years old
S5 Properties of Twitter variables
Normalization and distributions
Heterogeneity between the values of variables constructed fromTwitter is large but moderate as histograms in figure 9 show Wedid not find any geographical area with anomalous values in anyof the variables considered Variables are normalized in differentways both the penetration τi and misspellers rate εi are defined asthe number of users or misspellers per 100000 persons (population)activity variables νi are normalized as the percentage of tweets pertime interval finally number of tweets that mention a specific term
Social media fingerprints of unemployment mdash 1219
Figure 7 From left to right and from top to bottom Fastgreedy Walktrap Infomap Multilevel Label Propagation and LeadingEigenvector communities on Twitter based mobility transitions
microi are also given per 100000 tweets published in the geographicalarea
Correlation between variables
Variables are constructed to reflect the behavior of areas in the dif-ferent dimensions of Twitter penetration social or geographicaldiversity activity through the day and content Correlation betweenvariables does indeed show that variables within each dimensionshold strong correlations between them As we can see in figure 10social and geographical diversities are highly correlated betweenthem an expected fact given the gravity law accurate descriptionof flows of people between geographical areas but also the amountof communication between them Same behavior is found for thegroup of variables in the activity group while content variables areless correlated Finally we find that both the penetration rate τi andfraction of misspellers εi have a strong correlation with most of thevariables
High correlation between variables might lead to collinearityeffects [24] in the linear regression models that is some variableswith predictive variable might have non-significant weights becausethey explain the same part of the variance For instance in Table5 misspellers rate has a very strong predictive value but its p-valueis too high to consider it significant To test this hypothesis weperform a principal component analysis (PCA) on the independentvariables of the regression Figure 10 exhibits the loadings of thedifferent variables for the considered variables The block structureshowed in 10 results in similar directions of the variables in the firstcomponentes of the PCA We observe some groups of variables onthe one hand geographical and social diversity seem to explain largepart of the variance on the other hand we find a perpendicular group
of variables formed by temporal activity finally penetration rateand misspellers fraction seem to represent a different independentdirection of data with high collinearity between them This mightexplain the low statistical significance in the models of section 4 Inany case the structure of the correlation matrix and the PCA resultsshow that there is indeed information in all groups of variables andthus we have take a variable in each of them for our regressionmodels
S6 Misspellers detection
In this work we will consider only tweets in Spanish that is sincein Spain several languages live at the same time depending on thepart of the country the first step is to reduce our Twitter dataset tothose tweets that are written in Spanish This task is carried out usingthe n-gram based text categorization R library textcat [11] Then inorder to decide whether a tweet has a misspelling or not we needto establish some patterns to select from our set of tweets Sincewe want to be sure that a detected mistake corresponds to a realmisspeller we will not consider the following cases
bull Lack of written accents People tend to avoid writing accentswhen talking in a colloquial way
bull Mistakes derived from removing unnecessary letters Themost common cases are removing a h at the beginning of aword (in Spanish the letter h is not pronounced) or replacingthe letters qu by k We understand that these mistakes can bemotivated for the limitation of length in tweets and not for areal misspelling
bull In the same line we neglect mistakes produced by removing
Social media fingerprints of unemployment mdash 1319
0
10
20
30
40
0 100 200 300x = Tweets (unemployment)
P(x)
0
5
10
15
500 1000 1500x = Penetration rate
P(x)
0
5
10
15
2 4 6 8x = Entropy2 (social)
P(x)
0
5
10
15
025 050 075x = Entropy1 (social)
P(x)
0
5
10
15
4 6 8x = Entropy2 (geo)
P(x)
0
5
10
15
01 02 03 04x = Entropy1 (geo)
P(x)
0
5
10
15
0 50 100 150 200x = Misspellers rate
P(x)
0
20
40
60
0 100 200x = Tweets (employment)
P(x)
0
5
10
15
20
250 500 750x = Tweets (job)
P(x)
0
5
10
15
20
2 3 4 5 6x = Activity (night)
P(x)
0
5
10
15
20
4 5 6 7x = Activity (morning)
P(x)
0
5
10
15
35 40 45 50x = Activity (afternoon)
P(x)
i SuiSri
Sri Suimrngi
aftni ngti
imicrojobi
microempi
microunempi
x x x
x x x
x x x
x x x
Figure 9 Frequency plots for each variable constructed from Twitter
letters in the middle of a word whose pronunciation can bededuced without them
bull We do not consider either mistakes related to features ofspecific areas in Spain For example in the south the pronun-ciation of ce and se is the same what produces a big amountof mistakes when writing However since we want to extractobjective and equitable conclusion over the whole Spanishgeography we neglect those misspellings that only appear ina specific area
Likewise we will consider as real misspellings the followingmistakes
bull Adding letters For example writing a h at the beginning of aword that starts with a vowel
bull Changing the special cases mp mb by the wrong writings npnb
bull Mixing up b with v g with j ll with y and ex with es Theseare typical mistakes in Spanish because they have the sameor a very close pronunciation
Social media fingerprints of unemployment mdash 1419
0
10
20
30
16minus24 25minus34 35minus44 45minus54 55minus64r
f
000
025
050
075
100llena
Age group
Perc
enta
ge o
f pop
ulat
ion Census
10
15
20
25
10 15 20 25x
y
10
15
20
25
10 15 20 25x
y
5
10
15
20
5 10 15 20x
y
5
10
15
2025
5 10 15 20 25x
y
All ages lt 24
25-44 gt 44
Observed Unemployment () Observed Unemployment ()
Observed Unemployment ()Observed Unemployment ()
Pred
icte
d Un
empl
oym
ent (
)
Pred
icte
d Un
empl
oym
ent (
)
Pred
icte
d Un
empl
oym
ent (
)
Pred
icte
d Un
empl
oym
ent (
)
R2 = 047 R2 = 062
R2 = 052 R2 = 026
Figure 8 Top Percentage of population in each age groupfrom the Spanish Census (dark bars) and surveys about usersin Twitter (light bars) Bottom performance of the linearmodels for each of the age groups
bull Confusing the verb haber with the periphrasis a ver
bull Separating a word into two ones for instance writing theword conmigo as con migo
This way our list of mispellings is composed of 617 common mis-takes in Spanish that cannot be attributed to the special featuresof Twitter or a specific region of Spain Thus one can expect thatthis selection provides an accurate and equitable method of detect-ing misspellers Under these conditions the number of users whowrote at least one misspelled word is 27055 (56 over the wholepopulation)
We analyze whether misspellers have different Twitter usagebehavior from that people who do not make serious mistakes whenpublishing a tweet Comparing the average number of tweets itcan be observed that misspellers tend to publish a larger numberof tweets than those who did not made mistakes (14471 against2372) This also emerges when the mean number of misspellinggiven the total number of tweets is considered For users with lessthan approximately 30 published tweets in the observation period thenumber of misspellings is almost zero whereas for users who publishmore often the mean number of misspellings scales sub-linearly
minus1
minus08
minus06
minus04
minus02
0
02
04
06
08
1rtwpen
sio
sior
siosocial
siorsocial
manana
tarde
madrugada
fmiss
job
emp
unem
p
eco
rtwpensiosior
siosocialsiorsocialmanana
tardemadrugada
fmissjobemp
unempeco
i
Sui
Sri
Sri
Sui
i
microecoi
microunempi
microempi
microjobi
ngti
aftni
mrngi
minus3 minus2 minus1 0 1 2 3
minus3minus2
minus10
12
3
First Principal component
Seco
nd P
rinci
pal C
ompo
nent
6
26
47
70 92123
145
155185
209
213
251
257 279305318339
371
396411
423
455
466
490
507
546552
568
597
621
637672
681
710
718738
772
781
805828
849
876
900
919
925
966979
1002 1021
1047
1064
1085
1109
1126
1137
1164
11861200
12361249
1264
1300
1318
1336
13541386
1406
1425
1444
1457
14711504
15141554
1572
1584
1611
1622
16571667
1686
1721
1739
1800
1823
1837
1863
18831930
1949
1968
20062036
2042
2067
2181
2206
22442264
2305
2331
23322413 2435
2456
2512
2554
2569
25982617
26522670
26972721 2748
2790
2875
2917
2939
29492976
2989
3025
3132
3228
3245
3331
3347
3381
3451
3484
3519
3613
3837
38653987
4326
minus05 00 05
minus05
00
05
sio
sior
siosocial
siorsocial
rtwpen
manana
tarde
madrugada
fmiss
jobemp
unemp
eco
ngt
microunemp
aftn
microempmicroeco
microjobmrng
Si
Si
Sir
˜Sir
domingo 12 de octubre de 14
Figure 10 Top Correlation matrix between the vari-ables constructed from Twitter Each entry in the matrixis depicted as a circle whose size is proportional to thecorrelation between variables and the sign is bluered forpositivenegative correlations Blank entries correspond tostatistically insignificant correlations with 95 confidenceBottom Variables projection on the first two principal com-ponents given by PCA We observe different groups of vari-ables and collinearity between some of them
with the number of tweets (exponent asymp 033)
Since we have observed a segmentation of Twitter populationbased on how accurate they write we consider the misspeller rate as aproxy of the educational level of the cities Large number of previousworks in the literature have revealed the relationship between theeconomical status and the educational level of geographical areasand therefore it is natural to ask whether the observed misspellersrate is related to economy driven by the unemployment rate To testthis hypothesis we consider cities populated with more than 5000inhabitants to avoid subsampled cases We find a strong positivecorrelation between the probability of finding a misspeller in a cityand the unemployment rate (0372 0491)
Social media fingerprints of unemployment mdash 1519
2 5 10 20 50 200 500
10
15
20
30
40
tweets
N(m
iss|
tw
eets
)
2 5 10 20 50 200 500
000
50
050
050
0
tweets
P(m
iss|
twee
ts)
2 50 500
Figure 11 Number (red) and probability (blue) of ob-served misspellings given the number of tweets
050
055
060
065
070
201210201211201212201301201302201303201304201305201306201307201308201309201310201311201312201401201402201403201404201405201406
month
r2R2
month
Figure 12 Explanatory power of the linear regressionmodel when fitted against the unemployment data for dif-ferent months Gray (orange) area correspond to the timewindow in which Twitter data is collected and variables areconstructed
S7 Time window and unemployment
In the definition of the variables we have aggregated the Twitter ac-tivity within a 7 months time window (from December 2012 to June2013) Since unemployment has a significant variation along timewe investigate here what is the correlation and explanatory powerof the Twitter variables for the values of unemployment determinedat different months through the same time window in which Twitterdata was collected Or if the variables collected in that time windoware more correlated with past or future values of unemploymentFigure 12 shows the explanatory value of the model when the linearregression is done for values of unemployment of different monthsbefore during and after the Twitter data time window Althoughthere is a small seasonal effect along the year we see that the ex-planatory power remains around R2 = 06 which suggest that ourTwitter linear model retains its explanatory power even though unem-ployment changes considerably throughout the year It is interestingto note that R2 decays a little bit during the summer which meansthat our variables are less correlated with summer unemploymentFinally unemployment used in the main article is from June 2013ie the last month in the time window used to collect the data
S8 Demographics does not explain unem-ployment
Since unemployment rates are very large for the group of youngpeople a natural question is whether only demographic variablescould explain the heterogeneity of young unemployment rates foundin the geographical areas To test this end we have built four linearmodels the first one (named Youth model in Table 5) is composedby the rate of young population as the only explaining variable thesecond ones are built based on only the Twitter variables consideredin the main text (named Twitter model (I)) or just with those whoseregression coefficients are statistically significant (Twitter model(II)) the third one is fitted with all the variables (named All variablesmodel in Table 5) In table 5 we show the summary of the regressionfor each model Focusing on the explained variance by the model interms of R2 it can be checked that considering all Twitter variables isthree times more explanatory than considering only the young peopleproportion On the other hand the comparison of R2 for the Twittermodel with the one for All variables and Youth model shows that therate of young population does not provide a significant explanatorypower This semi-partial analysis shows that our Twitter variablesretain a high explanatory power when the effect of young populationrate is controlled
S9 Unemployment models for other geo-graphical areas
While municipalities are very heterogeneous demographically otheradministrative areas exist in Spain at large scales that could be usedfor our model of unemployment As mentioned in section 4 thesmallest administrative division of Spain we have considered is thatof the 8200 municipalities At larger scales we have the 326 coun-ties (comarcas in spanish) which are aggregations of municipalitiesFinally the largest geographical scale we considered is defined by50 provinces (provincias in Spanish) In this section we comparethe performance of our Twitter model for unemployment for thevariables defined in those administrative areas and relate it to thegeographical communities detected and used in the main paper (seesection 4) Not all the areas at different administrative divisions areconsidered in the model To minimize the effect of areas in whichthe number of geo-tagged tweets is very small we only consider the1738 municipalities which have a Twitter population π gt 10 Simi-larly we only consider the 198 counties with π gt 100 As we can seein Table 6 the model has a large explanatory power for areas equalor bigger than counties As expected R2 increases as the number ofareas in the model is smaller but the description level of the modelis very low for provinces for example The best performance (highR2 and high geographical description level) is attained at the level ofthe detected communities
S10 Relative importance of the variables
To asses the relative importance of the variables in the unemploymentmodel we have used several methods They all give qualitatively thesame results with some variations for the statistically insignificantvariables Specifically we have use
1 (weight) Relative weight of the absolute values of the coef-ficients obtained in the linear regression when variables arescaled to have mean zero and variance one
2 (lmg) averaging over orderings proposed by LindemanMerenda and Gold
Social media fingerprints of unemployment mdash 1619
0
10
20
30
40
emp fmiss manana rtwpen sio siosocialnames
values
indabscoefffirstlmgpmvd
microemp mrng SuSu
Rela
tive
impo
rtanc
e (
)
0
10
20
30
40
emp fmiss manana rtwpen sio siosocialnames
values
indabscoefffirstlmgpmvd
weight first lmg pvmd
Figure 13 Relative importance of the variables (in per-centage) in the unemployment model for different ways tocalculate it
3 (pmvd) The PMVD metric introduced by Feldman whichan average over orderings as well but with data-dependentweights
4 (first) The univariate R2-values from regression models withone variable only
All these metrics are obtained using the relaimpo R package [12]The results for the young unemployment model are shown in figure13 where we can see that different methods yield to similar rela-tive importance of the variables excepting perhaps for the diversityof mobility flows a variable with a non-significant weight in theregression model
References
[1] B Ashtakala Generalized power model for trip distributionTransportation Research Part B Methodological 21(1)59ndash671987
[2] Michel Bierlaire Mathematical models for transportation de-mand analysis Transportation research Part A Policy andpractice 31(1)86ndash86 1997
[3] Vincent D Blondel Jean-Loup Guillaume Renaud Lambiotteand Etienne Lefebvre Fast unfolding of communities in largenetworks Journal of Statistical Mechanics Theory and Exper-iment 2008(10)P10008 2008
[4] Harry J Casey Jr The law of retail gravitation applied to trafficengineering Traffic Quarterly 9(3) 1955
[5] Aaron Clauset Mark EJ Newman and Cristopher Moore Find-ing community structure in very large networks Physicalreview E 70(6)066111 2004
[6] Leon Danon Albert Diaz-Guilera Jordi Duch and AlexArenas Comparing community structure identificationJournal of Statistical Mechanics Theory and Experiment2005(09)P09008 2005
[7] Servicio Publico de Empleo Estatal (SEPE) Spanish registeredunemployment httpwwwsepeescontenidosque_
es_el_sepeestadisticasindexhtml[8] Instituto Nacional de Estadıstica Spanish 2011 cen-
sus httpwwwineescensos2011_datoscen11_
datos_iniciohtm|[9] Suzanne P Evans A relationship between the gravity model
for trip distribution and the transportation problem in linearprogramming Transportation Research 7(1)39ndash61 1973
[10] Paul Expert Tim S Evans Vincent D Blondel and RenaudLambiotte Uncovering space-independent communities inspatial networks Proceedings of the National Academy ofSciences 108(19)7663ndash7668 2011
[11] Ingo Feinerer Christian Buchta Wilhelm Geiger JohannesRauch Patrick Mair and Kurt Hornik The textcat packagefor n-gram based text categorization in r Journal of StatisticalSoftware 52(6)1ndash17 2013
[12] Ulrike Gromping Relative importance for linear regressionin r the package relaimpo Journal of statistical software17(1)1ndash27 2006
[13] Bartosz Hawelka Izabela Sitko Euro Beinat StanislavSobolevsky Pavlos Kazakopoulos and Carlo Ratti Geo-located twitter as the proxy for global mobility patterns arXivpreprint arXiv13110680 2013
[14] Yu Liu Zhengwei Sui Chaogui Kang and Yong Gao Uncov-ering patterns of inter-urban trips and spatial interactions fromcheck-in data arXiv preprint arXiv13100282 2013
[15] Mark EJ Newman Finding community structure in net-works using the eigenvectors of matrices Physical reviewE 74(3)036104 2006
[16] Pascal Pons and Matthieu Latapy Computing communities inlarge networks using random walks In Computer and Informa-tion Sciences-ISCIS 2005 pages 284ndash293 Springer 2005
[17] Usha Nandini Raghavan Reka Albert and Soundar KumaraNear linear time algorithm to detect community structures inlarge-scale networks Physical Review E 76(3)036106 2007
[18] Martin Rosvall and Carl T Bergstrom Maps of random walkson complex networks reveal community structure Proceedingsof the National Academy of Sciences 105(4)1118ndash1123 2008
[19] Morton Schneider Gravity models and trip distribution theoryPapers in Regional Science 5(1)51ndash56 1959
[20] Filippo Simini Marta C Gonzalez Amos Maritan and Albert-Laszlo Barabasi A universal model for mobility and migrationpatterns Nature 484(7392)96ndash100 2012
[21] Chaoming Song Tal Koren Pu Wang and Albert-LaszloBarabasi Modelling the scaling properties of human mobilityNature Physics 6(10)818ndash823 2010
[22] Alan Geoffrey Wilson Entropy in urban and regional mod-elling Pion Ltd 1970
[23] Alan Geoffrey Wilson Urban and regional models in geogra-phy and planning 1974
[24] Svante Wold Arnold Ruhe Herman Wold and WJ Dunn IIIThe collinearity problem in linear regression the partial leastsquares (pls) approach to generalized inverses SIAM Journalon Scientific and Statistical Computing 5(3)735ndash743 1984
Social media fingerprints of unemployment mdash 1719
Gravity ModelParameter Description Spain
α1 Origin exponent 0477lowastlowastlowast(0002)α2 Destination exponent 0478lowastlowastlowast(0002)β Distance exponent 105lowastlowastlowast(00035)R2 Goodness of fit 0797φ Correlation between Ti j and T gra
i j 0826
Table 1 Description of the parameters for the Gravity Law Model in geo-tagged social media data for Spain (lowast lowast lowast) meanssignificance p lt 00001
NMI between G and Gp for different pAlgorithm p = 001 002 003 004 005 006 007 008 009 01
FG 0995 0992 0989 0983 0981 0977 0983 0969 0980 0959WT 0954 0959 0950 0954 0945 0948 0947 0935 0926 0931IM 0988 0981 0980 0981 0978 0974 0975 0970 0969 0966ML 0994 0978 0979 0983 0948 0934 0972 0952 0973 0947LP 0906 0908 0911 0915 0895 0907 0907 0893 0905 0904LE 0960 0957 0956 0859 0910 0892 0908 0858 0885 0884
Table 2 NMI measure comparing G and Gp
Communities StatsAlgorithm 〈|Ni|〉i max|Ni| |Ni| Modularity NMI P NMI C
FG 309696 1385 23 0726 0712 0590WT 9262 433 769 0417 0744 0757IM 21011 143 339 0758 0770 0831ML 323772 1132 22 0800 0717 0599LP 22052 750 323 0732 0749 0761LE 1017571 5344 7 0381 0264 0205
Table 3 Statistics of the communities Ni returned by the six algorithms NMI P refers to the comparison between communitiesand provinces whereas NMI C considers counties instead of provinces
Social media fingerprints of unemployment mdash 1819
All ages lt 24 25minus44 gt 44(Intercept) 011lowastlowastlowastlowast 010lowastlowastlowast 020lowastlowastlowast 020lowastlowastlowast
(002) (003) (003) (0035)Penetration rate 323lowast 857lowastlowastlowast 628lowastlowast 240
(141) (222) (217) (277)Geographical diversity 003 015lowastlowastlowast 008lowast 006
(002) (004) (004) (005)Social diversity minus003lowast minus003 minus005lowast minus006lowast
(001) (002) (002) (003)Morning activity minus069lowast minus130lowastlowast minus153lowastlowastlowast minus119lowast
(026) (042) (041) (052)Misspellers rate 1156 3151lowast 1546 2360
(813) (1278) (1248) (1594)Employment mentions minus180 317 minus994 271
(627) (986) (964) (123)R2 047 064 055 029Adj R2 044 062 052 026lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005
Table 4 Regression table for the different models in which unemployment for different age groups is fitted The All ages model isthe fit to the general rate of unemployment in each geographical area while the other models are for the rates of unemployment ingroups of less than 24 years between 25 and 44 years and above 44 years
All variables Youth model Twitter model (I) Twitter model (II)(Intercept) 006 minus002 010lowastlowastlowast 009lowastlowastlowast
(003) (003) (003) (0027)Young pop rate 066lowast 220lowastlowastlowast
(030) (035)Penetration rate 820lowastlowastlowast 857lowastlowastlowast 862lowastlowastlowast
(225) (222) (221)Geographical diversity 014lowastlowastlowast 015lowastlowastlowast 012lowastlowastlowast
(004) (004) (003)Social diversity minus002 minus003
(002) (002)Morning activity minus142lowastlowastlowast minus130lowastlowast minus128lowastlowast
(041) (042) (041)Misspellers rate 2395 3151lowast 3228lowast
(1309) (1278) (1271)Employment mentions 034 317
(981) (986)R2 065 024 064 063Adj R2 063 024 062 062lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005
Table 5 Regression table for the different statistical models The All variables model includes both Twitter and rate of youngpopulation variables Twitter model (I) includes only the variables described in the main article while Twitter model (II) only includesthose variables which are significant p lt 005 in Twitter model (I)
Social media fingerprints of unemployment mdash 1919
Communities Municipalities Counties Provinces(Intercept) 010lowastlowastlowast 016lowastlowastlowast 011lowastlowastlowast 011lowast
(003) (001) (003) (005)Penetration rate 857lowastlowastlowast 401lowastlowastlowast 912lowastlowastlowast 1047lowastlowastlowast
(222) (059) (181) (197)Geographical diversity 015lowastlowastlowast 002 012lowastlowastlowast 008
(004) (001) (003) (007)Social diversity minus003 minus001 minus001 minus003
(002) (001) (002) 007Morning activity minus130lowastlowast minus116lowastlowastlowast minus149lowastlowastlowast minus103
(042) (014) (039) (088)Misspellers rate 3151lowast 1440lowastlowastlowast 1409
(1278) (251) (1002)Employment mentions 317 minus071 241 minus317
(986) (089) (886) (1229)Number of points 128 1738 198 50R2 064 022 055 065Adj R2 062 021 054 061lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005
Table 6 Regression table for the unemployment linear regression model in different levels of geographical areas In the Provincesmodel the misspellers rate has been removed from the model due to the large collinearity with the penetration rate
- 1 Social media dataset and functional partition of cities
- 2 Social media behavioral fingerprints
- 3 Explanatory power of social media in unemployment
- 4 Discussion
-
Social media fingerprints of unemployment mdash 819
[8] Pan W Ghoshal G Krumme C Cebrian M amp Pentland A(2013) Urban characteristics attributable to density-driven tieformation Nature communications 4
[9] Gonzalez M C Hidalgo C A amp Barabasi A-L (2008)Understanding individual human mobility patterns Nature453 779ndash782
[10] Calabrese F Diao M Lorenzo G D Jr J F amp Ratti C(2013) Understanding individual mobility patterns from urbansensing data A mobile phone trace example TransportationResearch Part C Emerging Technologies 26 301 ndash 313
[11] Cheng Z Caverlee J Lee K amp Sui D Z (2011) ExploringMillions of Footprints in Location Sharing Services (AAAIMenlo Park CA USA)
[12] Cho E Myers S A amp Leskovec J (2011) Friendship andmobility user movement in location-based social networksKDD rsquo11 (ACM New York NY USA) pp 1082ndash1090
[13] Sun L Jin J G Axhausen K W Lee D-H amp Cebrian M(2014) Quantifying long-term evolution of intra-urban spatialinteractions arXiv preprint arXiv14070145
[14] Eagle N Macy M amp Claxton R (2010) Network diversityand economic development Science 328 1029ndash1031
[15] Henrich J Boyd R Bowles S Camerer C Fehr E GintisH amp McElreath R (2001) In search of homo economicusbehavioral experiments in 15 small-scale societies AmericanEconomic Review pp 73ndash78
[16] Krieger N Williams D R amp Moss N E (1997) Measuringsocial class in us public health research concepts method-ologies and guidelines Annual review of public health 18341ndash378
[17] Groves R M Fowler Jr F J Couper M P Lepkowski J MSinger E amp Tourangeau R (2013) Survey methodology (JohnWiley amp Sons)
[18] Lazer D Pentland A S Adamic L Aral S Barabasi A LBrewer D Christakis N Contractor N Fowler J GutmannM et al (2009) Life in the network the coming age of compu-tational social science Science (New York NY) 323 721
[19] Smith C Quercia D amp Capra L (2013) Finger on the pulseidentifying deprivation using transit flow analysis (ACM) pp683ndash692
[20] Soto V Frias-Martinez V Virseda J amp Frias-Martinez E(2011) Prediction of socioeconomic levels using cell phonerecords UMAPrsquo11 (Springer-Verlag Berlin Heidelberg) pp377ndash388
[21] Preis T Moat H S Stanley H E amp Bishop S R (2012)Quantifying the advantage of looking forward Scientific re-ports 2
[22] Antenucci D Cafarella M Levenstein M C Re C ampShapiro M D (2014) Using social media to measure la-bor market flows (National Bureau of Economic Research)Technical report
[23] Hawelka B Sitko I Beinat E Sobolevsky S KazakopoulosP amp Ratti C (2013) Geo-located twitter as the proxy for globalmobility patterns arXiv preprint arXiv13110680
[24] Lathia N Quercia D amp Crowcroft J (2012) in Pervasivecomputing (Springer) pp 91ndash98
[25] Frias-Martinez V Virseda J amp Frias-Martinez E (2010)Socio-economic levels and human mobility
[26] Gutierrez T Krings G amp Blondel V D (2013) Evaluatingsocio-economic state of a country analyzing airtime credit andmobile phone datasets arXiv preprint arXiv13094496
[27] Smith C Mashhadi A amp Capra L (2013) Ubiquitous sensingfor mapping poverty in developing countries Paper submittedto the Orange D4D Challenge
[28] Erlander S amp Stewart N F (1990) The gravity model intransportation analysis theory and extensions (Vsp) Vol 3
[29] Simini F Gonzalez M C Maritan A amp Barabasi A-L(2012) A universal model for mobility and migration patternsNature 484 96ndash100
[30] Lenormand M Picornell M Cantu-Ros O G Tugores ALouail T Herranz R Barthelemy M Frias-Martinez E ampRamasco J J (2014) Cross-checking different sources ofmobility information arXiv preprint arXiv14040333
[31] Barthelemy M (2011) Spatial networks Physics Reports 4991 ndash 101
[32] Expert P Evans T S Blondel V D amp Lambiotte R (2011)Uncovering space-independent communities in spatial net-works Proceedings of the National Academy of Sciences 1087663ndash7668
[33] Sobolevsky S Szell M Campari R Couronne T SmoredaZ amp Ratti C (2013) Delineating geographical regions withnetworks of human interactions in an extensive set of countriesPloS one 8 e81707
[34] Rosvall M amp Bergstrom C T (2008) Maps of random walkson complex networks reveal community structure Proceedingsof the National Academy of Sciences 105 1118ndash1123
[35] Danon L Diaz-Guilera A Duch J amp Arenas A (2005)Comparing community structure identification Journal ofStatistical Mechanics Theory and Experiment 2005 P09008
[36] ADigital (2013) Uso de twitter en espana 2012 [Onlineaccessed 1-November-2014]
[37] Davenport J R amp DeLine R (2014) The readability of tweetsand their geographic correlation with education arXiv preprintarXiv14016058
[38] Schneider F Buehn A amp Montenegro C E (2011) Shadoweconomies all over the world New estimates for 162 countriesfrom 1999 to 2007 Handbook on the shadow economy pp9ndash77
[39] Rutherford A Cebrian M Dsouza S Moro E Pentland A ampRahwan I (2013) Limits of social mobilization Proceedingsof the National Academy of Sciences 110 6281ndash6286
Social media fingerprints of unemployment mdash 919
Social media fingerprints of unemployment mdash 1019
Supporting Information forSocial media fingerprints of unemploymentAlejandro Llorente Manuel Garcıa-Herranz Manuel Cebrian and Esteban Moro
S1 The dataset
Twitter provides an extremely rich and publicly available data set ofuser interactions information flows and thanks to the geo locationof tweets user movements Nevertheless the representativenessof this geo-located Twitter as a global source of mobility data hasstill received sparse attention In this sense while [13] present apromising and extensive study regarding global country-to-countrymovements (mostly driven by tourism) within-country human flows(comprising not only internal tourism but also in a greater extentthan country-to-country travels visiting and commuting) still needfurther investigation Therefore throughout this work we will com-pare our findings using geo-located Twitter with similar study usingcommuting surveys
For the Twitter analysis we consider almost 146 million geo-located Twitter messages (tweet(s)) collected through the publicAPI provided by Twitter for the continental part of Spain and from29th November 2012 to 10th April 2013 In this dataset we considerthat there has been a trip from place l to place k if a user has tweetedin place l and place k consecutively We only keep those transitionswhen the first tweet and the second one are dated in the same dayWe filter the trips database to avoid unrealistic transitions and keeponly trips with a geographical displacement larger than 1km (SeeMethods section) By this method 138 million of trips from 167376different users are considered in our work
From those trips we construct the mobility flow Ti j betweenmunicipalities which measures the number of trips in our databasein which the origin is within city i boundaries and destination lieswithin those of city j
We also consider population and economical information aboutthe municipalities from the Spanish Census (2011) [8] and unem-ployment figures from the Public Service of Employment (ServicioPublico de Empleo Estatal SEPE) [7] In the former In the lat-ter case registered unemployment (in number of persons) is givenfor each Spanish municipality by gender age and month To getunemployment rates we divide register unemployment by the totalworkforce in the municipality estimated as the number of peoplewith age between 16 and 65 years
S2 Twitter as mobility proxy
Considering all of the available transitions in our database one cancompute the distance between origin and destination the elapsedtime of the transition and the number of trips per user among manyother statistics All of them seems to show a Power-law distributionwith a cutoff due to the finite spatial size of Spain and the constraintof considering only transitions where the origin and destinationcheckins are done the same day Focusing on the log-linear partof the distributions self-similar behaviors arise when Twitter basedmobility is analyzed (see figure 5)
Twitter based inter-city flows can be well modelled by means ofthe The Gravity Law which is one of the most extended methods torepresent human mobility [1 19] with applications in many fieldslike urban planning [23] traffic engineering [4] or transportationproblems [9] Gravity Law is also the solution to the problem of
maximizing the entropy of the particle distribution among all thepossible trips using statistical mechanics techniques [2 22] Recentlyit has also been used as a model for human mobility based on cellphone traces [10 20 21] and social media data at a global scale [13]and at the inter-city level [14]
The Gravity Model for human mobility assume that the flowsbetween cities can be explained by the expression
T gravi j =
Pα1i Pα2
j
dβ
i j
(2)
where T gravi j is the flow in terms of number of people between cities
i and j di j is the geographical distance and Pi and Pj the populationof every city respectively
Given the data we can obtain the parameters of the model byWeighted Least Squares Minimization
αlowast1 α
lowast2 β
lowast = argminα1α2β
1N sum
i jwi j
(Ti jminusT grav
i j
)2(3)
where N is the total number of connections in the mobility graph andwi j is a weight proportional to the number of observed transitionsbetween i and j In particular we find that taking wi j = T 13
i j givesthe best performance in the model
In our case this model fits quite accurately the inter-city mobilitybased on Twitter GPS checkins (see table 1) Even though we areconsidering Ti j not necessarily symmetric the exponents of thepopulations are similar indicating that we are observing a similarflows in both directions between i and j
S3 Community structures in inter-city mo-bility graph
Typically complex networks exhibit community structure that isthere are subsets of nodes that are more densely connected amongthem comparing to the rest of the nodes In mobility networks whosenodes correspond to geographical areas these communities are inter-preted as zones with high common activity and tend to be constrainedby geographical and political barriers We check whether this is alsoobserved in our dataset by performing 6 state-of-art community de-tection algorithms FastGreedy [5] Walktrap [16] Infomap [18]MultiLevel [3] Label Propagation [17] and Leading Eigenvector[15] These six different algorithms exhibit different communitystructures in terms of number of communities average size of com-munity or modularity (see table 3) Members (municipalities) ofthe resulting communities are spatially connected except some fewcases as figure 7 shows We test the statistical robustness of theobtained communities by randomly removing a proportion p of theoriginal links and performing the algorithms on this new graph GpWe will consider that communities are robust when the communi-ties given for the original network G and Gp are highly similar Inorder to compare two arbitrary memberships to communities we usethe Normalized Mutual Information (NMI) method described in [6]which returns 0 when two memberships are totally different and 1when we compare two equal memberships We compute the NMI for
Social media fingerprints of unemployment mdash 1119
10minus8
10minus6
10minus4
10minus2
100
100 1005 101 1015 102 1025 103x
dens
10minus6
10minus4
10minus2
100
1005 101 1015 102 1025 103x
dens
10minus7
10minus6
10minus5
10minus4
10minus3
102 1025 103 1035 104 1045 105x
dens
Den
sity
Den
sity
Den
sity
Trip distance (km) Number of trips Elapsed time (secs)
Figure 5 Probability distributions for the different properties of daily trips in the Twitter dataset Dashed lines corresponds to apower law fit with exponents minus167 minus243 and minus062 respectively
each chosen algorithm performed on G and Gp for p between 1and 10 concluding that obtained community structures are robustbecause they are not broken when some randomly chosen links areremoved (see table 2)
1e+01 1e+03 1e+05
2eminus04
2eminus03
2eminus02
2eminus01
cities2$total_population
cities2$twpen
Population
Pen
etra
tion
Rat
e CommunitiesCities
Figure 6 Penetration rates for both cities and detectedcommunities
As other works have shown mobility graph communities areusually interpreted in terms of geographical and political barriersand a natural question is whether the mobility based communitiesare related to any of these barriers In Spain there are differentterritorial divisions for administration purposes In this work weconsider two of them provinces defined in 1978 Constitution are 48different heterogeneous aggregations of municipalities and counties(comarca in Spanish terminology) which are traditional aggregationsof municipalities mainly based on Spanish holography (rivers val-leys ridges etc) and some of them are composed by municipalitiesof different provinces We use again the NMI method to compare thecommunities structure given by the algorithms to the administrativelimits Except Leading Eigenvector algorithm the rest of methodsreturn communities that are quite related to provinces (NMI asymp 07)whereas for the county administration limits higher variability is
observed In this last case the algorithm providing more relationshipwith county limits is Infomap NMI asymp 083 Therefore Twitter basedmobility summarizes the inter-city flows exhibiting that these flowsare influenced by geographical and political barriers
S4 Twitter demographics and unemploy-ment rates
Different age groups are not equally represented in Twitter Recentsurveys (2012) in Spain suggest that most (86) of users in Twitterare 16 to 44 years old Comparison of the percentage of users perage group with the total population within the same groups (seefigure 8) reveals that groups of ages above 35 years old are under-represented in Twitter Thus our Twitter data will be more revealingwhen trying to describe unemployment in age groups below 44 yearsold This is indeed what we find when we try to build a linear modelfor the rate unemployment in different age groups with the sameTwitter variables while unemployment rates for ages below 24 canbe fitted to a linear model with R2 = 062 we find that regressionmodels for unemployment rates for ages between 25 and 44 have aR2 = 052 while for ages above 44 we get only R2 = 026 Table 4summarizes the results for the regression models of unemploymentrates in each age group showing that our Twitter variables have moreexplanatory power for ages below 44 Finally in figure 8 we can seethe performance of the model at different age groups and once againit is obvious the poor explanatory power of the Twitter variables forthe unemployment rate in ages above 44 years old
S5 Properties of Twitter variables
Normalization and distributions
Heterogeneity between the values of variables constructed fromTwitter is large but moderate as histograms in figure 9 show Wedid not find any geographical area with anomalous values in anyof the variables considered Variables are normalized in differentways both the penetration τi and misspellers rate εi are defined asthe number of users or misspellers per 100000 persons (population)activity variables νi are normalized as the percentage of tweets pertime interval finally number of tweets that mention a specific term
Social media fingerprints of unemployment mdash 1219
Figure 7 From left to right and from top to bottom Fastgreedy Walktrap Infomap Multilevel Label Propagation and LeadingEigenvector communities on Twitter based mobility transitions
microi are also given per 100000 tweets published in the geographicalarea
Correlation between variables
Variables are constructed to reflect the behavior of areas in the dif-ferent dimensions of Twitter penetration social or geographicaldiversity activity through the day and content Correlation betweenvariables does indeed show that variables within each dimensionshold strong correlations between them As we can see in figure 10social and geographical diversities are highly correlated betweenthem an expected fact given the gravity law accurate descriptionof flows of people between geographical areas but also the amountof communication between them Same behavior is found for thegroup of variables in the activity group while content variables areless correlated Finally we find that both the penetration rate τi andfraction of misspellers εi have a strong correlation with most of thevariables
High correlation between variables might lead to collinearityeffects [24] in the linear regression models that is some variableswith predictive variable might have non-significant weights becausethey explain the same part of the variance For instance in Table5 misspellers rate has a very strong predictive value but its p-valueis too high to consider it significant To test this hypothesis weperform a principal component analysis (PCA) on the independentvariables of the regression Figure 10 exhibits the loadings of thedifferent variables for the considered variables The block structureshowed in 10 results in similar directions of the variables in the firstcomponentes of the PCA We observe some groups of variables onthe one hand geographical and social diversity seem to explain largepart of the variance on the other hand we find a perpendicular group
of variables formed by temporal activity finally penetration rateand misspellers fraction seem to represent a different independentdirection of data with high collinearity between them This mightexplain the low statistical significance in the models of section 4 Inany case the structure of the correlation matrix and the PCA resultsshow that there is indeed information in all groups of variables andthus we have take a variable in each of them for our regressionmodels
S6 Misspellers detection
In this work we will consider only tweets in Spanish that is sincein Spain several languages live at the same time depending on thepart of the country the first step is to reduce our Twitter dataset tothose tweets that are written in Spanish This task is carried out usingthe n-gram based text categorization R library textcat [11] Then inorder to decide whether a tweet has a misspelling or not we needto establish some patterns to select from our set of tweets Sincewe want to be sure that a detected mistake corresponds to a realmisspeller we will not consider the following cases
bull Lack of written accents People tend to avoid writing accentswhen talking in a colloquial way
bull Mistakes derived from removing unnecessary letters Themost common cases are removing a h at the beginning of aword (in Spanish the letter h is not pronounced) or replacingthe letters qu by k We understand that these mistakes can bemotivated for the limitation of length in tweets and not for areal misspelling
bull In the same line we neglect mistakes produced by removing
Social media fingerprints of unemployment mdash 1319
0
10
20
30
40
0 100 200 300x = Tweets (unemployment)
P(x)
0
5
10
15
500 1000 1500x = Penetration rate
P(x)
0
5
10
15
2 4 6 8x = Entropy2 (social)
P(x)
0
5
10
15
025 050 075x = Entropy1 (social)
P(x)
0
5
10
15
4 6 8x = Entropy2 (geo)
P(x)
0
5
10
15
01 02 03 04x = Entropy1 (geo)
P(x)
0
5
10
15
0 50 100 150 200x = Misspellers rate
P(x)
0
20
40
60
0 100 200x = Tweets (employment)
P(x)
0
5
10
15
20
250 500 750x = Tweets (job)
P(x)
0
5
10
15
20
2 3 4 5 6x = Activity (night)
P(x)
0
5
10
15
20
4 5 6 7x = Activity (morning)
P(x)
0
5
10
15
35 40 45 50x = Activity (afternoon)
P(x)
i SuiSri
Sri Suimrngi
aftni ngti
imicrojobi
microempi
microunempi
x x x
x x x
x x x
x x x
Figure 9 Frequency plots for each variable constructed from Twitter
letters in the middle of a word whose pronunciation can bededuced without them
bull We do not consider either mistakes related to features ofspecific areas in Spain For example in the south the pronun-ciation of ce and se is the same what produces a big amountof mistakes when writing However since we want to extractobjective and equitable conclusion over the whole Spanishgeography we neglect those misspellings that only appear ina specific area
Likewise we will consider as real misspellings the followingmistakes
bull Adding letters For example writing a h at the beginning of aword that starts with a vowel
bull Changing the special cases mp mb by the wrong writings npnb
bull Mixing up b with v g with j ll with y and ex with es Theseare typical mistakes in Spanish because they have the sameor a very close pronunciation
Social media fingerprints of unemployment mdash 1419
0
10
20
30
16minus24 25minus34 35minus44 45minus54 55minus64r
f
000
025
050
075
100llena
Age group
Perc
enta
ge o
f pop
ulat
ion Census
10
15
20
25
10 15 20 25x
y
10
15
20
25
10 15 20 25x
y
5
10
15
20
5 10 15 20x
y
5
10
15
2025
5 10 15 20 25x
y
All ages lt 24
25-44 gt 44
Observed Unemployment () Observed Unemployment ()
Observed Unemployment ()Observed Unemployment ()
Pred
icte
d Un
empl
oym
ent (
)
Pred
icte
d Un
empl
oym
ent (
)
Pred
icte
d Un
empl
oym
ent (
)
Pred
icte
d Un
empl
oym
ent (
)
R2 = 047 R2 = 062
R2 = 052 R2 = 026
Figure 8 Top Percentage of population in each age groupfrom the Spanish Census (dark bars) and surveys about usersin Twitter (light bars) Bottom performance of the linearmodels for each of the age groups
bull Confusing the verb haber with the periphrasis a ver
bull Separating a word into two ones for instance writing theword conmigo as con migo
This way our list of mispellings is composed of 617 common mis-takes in Spanish that cannot be attributed to the special featuresof Twitter or a specific region of Spain Thus one can expect thatthis selection provides an accurate and equitable method of detect-ing misspellers Under these conditions the number of users whowrote at least one misspelled word is 27055 (56 over the wholepopulation)
We analyze whether misspellers have different Twitter usagebehavior from that people who do not make serious mistakes whenpublishing a tweet Comparing the average number of tweets itcan be observed that misspellers tend to publish a larger numberof tweets than those who did not made mistakes (14471 against2372) This also emerges when the mean number of misspellinggiven the total number of tweets is considered For users with lessthan approximately 30 published tweets in the observation period thenumber of misspellings is almost zero whereas for users who publishmore often the mean number of misspellings scales sub-linearly
minus1
minus08
minus06
minus04
minus02
0
02
04
06
08
1rtwpen
sio
sior
siosocial
siorsocial
manana
tarde
madrugada
fmiss
job
emp
unem
p
eco
rtwpensiosior
siosocialsiorsocialmanana
tardemadrugada
fmissjobemp
unempeco
i
Sui
Sri
Sri
Sui
i
microecoi
microunempi
microempi
microjobi
ngti
aftni
mrngi
minus3 minus2 minus1 0 1 2 3
minus3minus2
minus10
12
3
First Principal component
Seco
nd P
rinci
pal C
ompo
nent
6
26
47
70 92123
145
155185
209
213
251
257 279305318339
371
396411
423
455
466
490
507
546552
568
597
621
637672
681
710
718738
772
781
805828
849
876
900
919
925
966979
1002 1021
1047
1064
1085
1109
1126
1137
1164
11861200
12361249
1264
1300
1318
1336
13541386
1406
1425
1444
1457
14711504
15141554
1572
1584
1611
1622
16571667
1686
1721
1739
1800
1823
1837
1863
18831930
1949
1968
20062036
2042
2067
2181
2206
22442264
2305
2331
23322413 2435
2456
2512
2554
2569
25982617
26522670
26972721 2748
2790
2875
2917
2939
29492976
2989
3025
3132
3228
3245
3331
3347
3381
3451
3484
3519
3613
3837
38653987
4326
minus05 00 05
minus05
00
05
sio
sior
siosocial
siorsocial
rtwpen
manana
tarde
madrugada
fmiss
jobemp
unemp
eco
ngt
microunemp
aftn
microempmicroeco
microjobmrng
Si
Si
Sir
˜Sir
domingo 12 de octubre de 14
Figure 10 Top Correlation matrix between the vari-ables constructed from Twitter Each entry in the matrixis depicted as a circle whose size is proportional to thecorrelation between variables and the sign is bluered forpositivenegative correlations Blank entries correspond tostatistically insignificant correlations with 95 confidenceBottom Variables projection on the first two principal com-ponents given by PCA We observe different groups of vari-ables and collinearity between some of them
with the number of tweets (exponent asymp 033)
Since we have observed a segmentation of Twitter populationbased on how accurate they write we consider the misspeller rate as aproxy of the educational level of the cities Large number of previousworks in the literature have revealed the relationship between theeconomical status and the educational level of geographical areasand therefore it is natural to ask whether the observed misspellersrate is related to economy driven by the unemployment rate To testthis hypothesis we consider cities populated with more than 5000inhabitants to avoid subsampled cases We find a strong positivecorrelation between the probability of finding a misspeller in a cityand the unemployment rate (0372 0491)
Social media fingerprints of unemployment mdash 1519
2 5 10 20 50 200 500
10
15
20
30
40
tweets
N(m
iss|
tw
eets
)
2 5 10 20 50 200 500
000
50
050
050
0
tweets
P(m
iss|
twee
ts)
2 50 500
Figure 11 Number (red) and probability (blue) of ob-served misspellings given the number of tweets
050
055
060
065
070
201210201211201212201301201302201303201304201305201306201307201308201309201310201311201312201401201402201403201404201405201406
month
r2R2
month
Figure 12 Explanatory power of the linear regressionmodel when fitted against the unemployment data for dif-ferent months Gray (orange) area correspond to the timewindow in which Twitter data is collected and variables areconstructed
S7 Time window and unemployment
In the definition of the variables we have aggregated the Twitter ac-tivity within a 7 months time window (from December 2012 to June2013) Since unemployment has a significant variation along timewe investigate here what is the correlation and explanatory powerof the Twitter variables for the values of unemployment determinedat different months through the same time window in which Twitterdata was collected Or if the variables collected in that time windoware more correlated with past or future values of unemploymentFigure 12 shows the explanatory value of the model when the linearregression is done for values of unemployment of different monthsbefore during and after the Twitter data time window Althoughthere is a small seasonal effect along the year we see that the ex-planatory power remains around R2 = 06 which suggest that ourTwitter linear model retains its explanatory power even though unem-ployment changes considerably throughout the year It is interestingto note that R2 decays a little bit during the summer which meansthat our variables are less correlated with summer unemploymentFinally unemployment used in the main article is from June 2013ie the last month in the time window used to collect the data
S8 Demographics does not explain unem-ployment
Since unemployment rates are very large for the group of youngpeople a natural question is whether only demographic variablescould explain the heterogeneity of young unemployment rates foundin the geographical areas To test this end we have built four linearmodels the first one (named Youth model in Table 5) is composedby the rate of young population as the only explaining variable thesecond ones are built based on only the Twitter variables consideredin the main text (named Twitter model (I)) or just with those whoseregression coefficients are statistically significant (Twitter model(II)) the third one is fitted with all the variables (named All variablesmodel in Table 5) In table 5 we show the summary of the regressionfor each model Focusing on the explained variance by the model interms of R2 it can be checked that considering all Twitter variables isthree times more explanatory than considering only the young peopleproportion On the other hand the comparison of R2 for the Twittermodel with the one for All variables and Youth model shows that therate of young population does not provide a significant explanatorypower This semi-partial analysis shows that our Twitter variablesretain a high explanatory power when the effect of young populationrate is controlled
S9 Unemployment models for other geo-graphical areas
While municipalities are very heterogeneous demographically otheradministrative areas exist in Spain at large scales that could be usedfor our model of unemployment As mentioned in section 4 thesmallest administrative division of Spain we have considered is thatof the 8200 municipalities At larger scales we have the 326 coun-ties (comarcas in spanish) which are aggregations of municipalitiesFinally the largest geographical scale we considered is defined by50 provinces (provincias in Spanish) In this section we comparethe performance of our Twitter model for unemployment for thevariables defined in those administrative areas and relate it to thegeographical communities detected and used in the main paper (seesection 4) Not all the areas at different administrative divisions areconsidered in the model To minimize the effect of areas in whichthe number of geo-tagged tweets is very small we only consider the1738 municipalities which have a Twitter population π gt 10 Simi-larly we only consider the 198 counties with π gt 100 As we can seein Table 6 the model has a large explanatory power for areas equalor bigger than counties As expected R2 increases as the number ofareas in the model is smaller but the description level of the modelis very low for provinces for example The best performance (highR2 and high geographical description level) is attained at the level ofthe detected communities
S10 Relative importance of the variables
To asses the relative importance of the variables in the unemploymentmodel we have used several methods They all give qualitatively thesame results with some variations for the statistically insignificantvariables Specifically we have use
1 (weight) Relative weight of the absolute values of the coef-ficients obtained in the linear regression when variables arescaled to have mean zero and variance one
2 (lmg) averaging over orderings proposed by LindemanMerenda and Gold
Social media fingerprints of unemployment mdash 1619
0
10
20
30
40
emp fmiss manana rtwpen sio siosocialnames
values
indabscoefffirstlmgpmvd
microemp mrng SuSu
Rela
tive
impo
rtanc
e (
)
0
10
20
30
40
emp fmiss manana rtwpen sio siosocialnames
values
indabscoefffirstlmgpmvd
weight first lmg pvmd
Figure 13 Relative importance of the variables (in per-centage) in the unemployment model for different ways tocalculate it
3 (pmvd) The PMVD metric introduced by Feldman whichan average over orderings as well but with data-dependentweights
4 (first) The univariate R2-values from regression models withone variable only
All these metrics are obtained using the relaimpo R package [12]The results for the young unemployment model are shown in figure13 where we can see that different methods yield to similar rela-tive importance of the variables excepting perhaps for the diversityof mobility flows a variable with a non-significant weight in theregression model
References
[1] B Ashtakala Generalized power model for trip distributionTransportation Research Part B Methodological 21(1)59ndash671987
[2] Michel Bierlaire Mathematical models for transportation de-mand analysis Transportation research Part A Policy andpractice 31(1)86ndash86 1997
[3] Vincent D Blondel Jean-Loup Guillaume Renaud Lambiotteand Etienne Lefebvre Fast unfolding of communities in largenetworks Journal of Statistical Mechanics Theory and Exper-iment 2008(10)P10008 2008
[4] Harry J Casey Jr The law of retail gravitation applied to trafficengineering Traffic Quarterly 9(3) 1955
[5] Aaron Clauset Mark EJ Newman and Cristopher Moore Find-ing community structure in very large networks Physicalreview E 70(6)066111 2004
[6] Leon Danon Albert Diaz-Guilera Jordi Duch and AlexArenas Comparing community structure identificationJournal of Statistical Mechanics Theory and Experiment2005(09)P09008 2005
[7] Servicio Publico de Empleo Estatal (SEPE) Spanish registeredunemployment httpwwwsepeescontenidosque_
es_el_sepeestadisticasindexhtml[8] Instituto Nacional de Estadıstica Spanish 2011 cen-
sus httpwwwineescensos2011_datoscen11_
datos_iniciohtm|[9] Suzanne P Evans A relationship between the gravity model
for trip distribution and the transportation problem in linearprogramming Transportation Research 7(1)39ndash61 1973
[10] Paul Expert Tim S Evans Vincent D Blondel and RenaudLambiotte Uncovering space-independent communities inspatial networks Proceedings of the National Academy ofSciences 108(19)7663ndash7668 2011
[11] Ingo Feinerer Christian Buchta Wilhelm Geiger JohannesRauch Patrick Mair and Kurt Hornik The textcat packagefor n-gram based text categorization in r Journal of StatisticalSoftware 52(6)1ndash17 2013
[12] Ulrike Gromping Relative importance for linear regressionin r the package relaimpo Journal of statistical software17(1)1ndash27 2006
[13] Bartosz Hawelka Izabela Sitko Euro Beinat StanislavSobolevsky Pavlos Kazakopoulos and Carlo Ratti Geo-located twitter as the proxy for global mobility patterns arXivpreprint arXiv13110680 2013
[14] Yu Liu Zhengwei Sui Chaogui Kang and Yong Gao Uncov-ering patterns of inter-urban trips and spatial interactions fromcheck-in data arXiv preprint arXiv13100282 2013
[15] Mark EJ Newman Finding community structure in net-works using the eigenvectors of matrices Physical reviewE 74(3)036104 2006
[16] Pascal Pons and Matthieu Latapy Computing communities inlarge networks using random walks In Computer and Informa-tion Sciences-ISCIS 2005 pages 284ndash293 Springer 2005
[17] Usha Nandini Raghavan Reka Albert and Soundar KumaraNear linear time algorithm to detect community structures inlarge-scale networks Physical Review E 76(3)036106 2007
[18] Martin Rosvall and Carl T Bergstrom Maps of random walkson complex networks reveal community structure Proceedingsof the National Academy of Sciences 105(4)1118ndash1123 2008
[19] Morton Schneider Gravity models and trip distribution theoryPapers in Regional Science 5(1)51ndash56 1959
[20] Filippo Simini Marta C Gonzalez Amos Maritan and Albert-Laszlo Barabasi A universal model for mobility and migrationpatterns Nature 484(7392)96ndash100 2012
[21] Chaoming Song Tal Koren Pu Wang and Albert-LaszloBarabasi Modelling the scaling properties of human mobilityNature Physics 6(10)818ndash823 2010
[22] Alan Geoffrey Wilson Entropy in urban and regional mod-elling Pion Ltd 1970
[23] Alan Geoffrey Wilson Urban and regional models in geogra-phy and planning 1974
[24] Svante Wold Arnold Ruhe Herman Wold and WJ Dunn IIIThe collinearity problem in linear regression the partial leastsquares (pls) approach to generalized inverses SIAM Journalon Scientific and Statistical Computing 5(3)735ndash743 1984
Social media fingerprints of unemployment mdash 1719
Gravity ModelParameter Description Spain
α1 Origin exponent 0477lowastlowastlowast(0002)α2 Destination exponent 0478lowastlowastlowast(0002)β Distance exponent 105lowastlowastlowast(00035)R2 Goodness of fit 0797φ Correlation between Ti j and T gra
i j 0826
Table 1 Description of the parameters for the Gravity Law Model in geo-tagged social media data for Spain (lowast lowast lowast) meanssignificance p lt 00001
NMI between G and Gp for different pAlgorithm p = 001 002 003 004 005 006 007 008 009 01
FG 0995 0992 0989 0983 0981 0977 0983 0969 0980 0959WT 0954 0959 0950 0954 0945 0948 0947 0935 0926 0931IM 0988 0981 0980 0981 0978 0974 0975 0970 0969 0966ML 0994 0978 0979 0983 0948 0934 0972 0952 0973 0947LP 0906 0908 0911 0915 0895 0907 0907 0893 0905 0904LE 0960 0957 0956 0859 0910 0892 0908 0858 0885 0884
Table 2 NMI measure comparing G and Gp
Communities StatsAlgorithm 〈|Ni|〉i max|Ni| |Ni| Modularity NMI P NMI C
FG 309696 1385 23 0726 0712 0590WT 9262 433 769 0417 0744 0757IM 21011 143 339 0758 0770 0831ML 323772 1132 22 0800 0717 0599LP 22052 750 323 0732 0749 0761LE 1017571 5344 7 0381 0264 0205
Table 3 Statistics of the communities Ni returned by the six algorithms NMI P refers to the comparison between communitiesand provinces whereas NMI C considers counties instead of provinces
Social media fingerprints of unemployment mdash 1819
All ages lt 24 25minus44 gt 44(Intercept) 011lowastlowastlowastlowast 010lowastlowastlowast 020lowastlowastlowast 020lowastlowastlowast
(002) (003) (003) (0035)Penetration rate 323lowast 857lowastlowastlowast 628lowastlowast 240
(141) (222) (217) (277)Geographical diversity 003 015lowastlowastlowast 008lowast 006
(002) (004) (004) (005)Social diversity minus003lowast minus003 minus005lowast minus006lowast
(001) (002) (002) (003)Morning activity minus069lowast minus130lowastlowast minus153lowastlowastlowast minus119lowast
(026) (042) (041) (052)Misspellers rate 1156 3151lowast 1546 2360
(813) (1278) (1248) (1594)Employment mentions minus180 317 minus994 271
(627) (986) (964) (123)R2 047 064 055 029Adj R2 044 062 052 026lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005
Table 4 Regression table for the different models in which unemployment for different age groups is fitted The All ages model isthe fit to the general rate of unemployment in each geographical area while the other models are for the rates of unemployment ingroups of less than 24 years between 25 and 44 years and above 44 years
All variables Youth model Twitter model (I) Twitter model (II)(Intercept) 006 minus002 010lowastlowastlowast 009lowastlowastlowast
(003) (003) (003) (0027)Young pop rate 066lowast 220lowastlowastlowast
(030) (035)Penetration rate 820lowastlowastlowast 857lowastlowastlowast 862lowastlowastlowast
(225) (222) (221)Geographical diversity 014lowastlowastlowast 015lowastlowastlowast 012lowastlowastlowast
(004) (004) (003)Social diversity minus002 minus003
(002) (002)Morning activity minus142lowastlowastlowast minus130lowastlowast minus128lowastlowast
(041) (042) (041)Misspellers rate 2395 3151lowast 3228lowast
(1309) (1278) (1271)Employment mentions 034 317
(981) (986)R2 065 024 064 063Adj R2 063 024 062 062lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005
Table 5 Regression table for the different statistical models The All variables model includes both Twitter and rate of youngpopulation variables Twitter model (I) includes only the variables described in the main article while Twitter model (II) only includesthose variables which are significant p lt 005 in Twitter model (I)
Social media fingerprints of unemployment mdash 1919
Communities Municipalities Counties Provinces(Intercept) 010lowastlowastlowast 016lowastlowastlowast 011lowastlowastlowast 011lowast
(003) (001) (003) (005)Penetration rate 857lowastlowastlowast 401lowastlowastlowast 912lowastlowastlowast 1047lowastlowastlowast
(222) (059) (181) (197)Geographical diversity 015lowastlowastlowast 002 012lowastlowastlowast 008
(004) (001) (003) (007)Social diversity minus003 minus001 minus001 minus003
(002) (001) (002) 007Morning activity minus130lowastlowast minus116lowastlowastlowast minus149lowastlowastlowast minus103
(042) (014) (039) (088)Misspellers rate 3151lowast 1440lowastlowastlowast 1409
(1278) (251) (1002)Employment mentions 317 minus071 241 minus317
(986) (089) (886) (1229)Number of points 128 1738 198 50R2 064 022 055 065Adj R2 062 021 054 061lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005
Table 6 Regression table for the unemployment linear regression model in different levels of geographical areas In the Provincesmodel the misspellers rate has been removed from the model due to the large collinearity with the penetration rate
- 1 Social media dataset and functional partition of cities
- 2 Social media behavioral fingerprints
- 3 Explanatory power of social media in unemployment
- 4 Discussion
-
Social media fingerprints of unemployment mdash 919
Social media fingerprints of unemployment mdash 1019
Supporting Information forSocial media fingerprints of unemploymentAlejandro Llorente Manuel Garcıa-Herranz Manuel Cebrian and Esteban Moro
S1 The dataset
Twitter provides an extremely rich and publicly available data set ofuser interactions information flows and thanks to the geo locationof tweets user movements Nevertheless the representativenessof this geo-located Twitter as a global source of mobility data hasstill received sparse attention In this sense while [13] present apromising and extensive study regarding global country-to-countrymovements (mostly driven by tourism) within-country human flows(comprising not only internal tourism but also in a greater extentthan country-to-country travels visiting and commuting) still needfurther investigation Therefore throughout this work we will com-pare our findings using geo-located Twitter with similar study usingcommuting surveys
For the Twitter analysis we consider almost 146 million geo-located Twitter messages (tweet(s)) collected through the publicAPI provided by Twitter for the continental part of Spain and from29th November 2012 to 10th April 2013 In this dataset we considerthat there has been a trip from place l to place k if a user has tweetedin place l and place k consecutively We only keep those transitionswhen the first tweet and the second one are dated in the same dayWe filter the trips database to avoid unrealistic transitions and keeponly trips with a geographical displacement larger than 1km (SeeMethods section) By this method 138 million of trips from 167376different users are considered in our work
From those trips we construct the mobility flow Ti j betweenmunicipalities which measures the number of trips in our databasein which the origin is within city i boundaries and destination lieswithin those of city j
We also consider population and economical information aboutthe municipalities from the Spanish Census (2011) [8] and unem-ployment figures from the Public Service of Employment (ServicioPublico de Empleo Estatal SEPE) [7] In the former In the lat-ter case registered unemployment (in number of persons) is givenfor each Spanish municipality by gender age and month To getunemployment rates we divide register unemployment by the totalworkforce in the municipality estimated as the number of peoplewith age between 16 and 65 years
S2 Twitter as mobility proxy
Considering all of the available transitions in our database one cancompute the distance between origin and destination the elapsedtime of the transition and the number of trips per user among manyother statistics All of them seems to show a Power-law distributionwith a cutoff due to the finite spatial size of Spain and the constraintof considering only transitions where the origin and destinationcheckins are done the same day Focusing on the log-linear partof the distributions self-similar behaviors arise when Twitter basedmobility is analyzed (see figure 5)
Twitter based inter-city flows can be well modelled by means ofthe The Gravity Law which is one of the most extended methods torepresent human mobility [1 19] with applications in many fieldslike urban planning [23] traffic engineering [4] or transportationproblems [9] Gravity Law is also the solution to the problem of
maximizing the entropy of the particle distribution among all thepossible trips using statistical mechanics techniques [2 22] Recentlyit has also been used as a model for human mobility based on cellphone traces [10 20 21] and social media data at a global scale [13]and at the inter-city level [14]
The Gravity Model for human mobility assume that the flowsbetween cities can be explained by the expression
T gravi j =
Pα1i Pα2
j
dβ
i j
(2)
where T gravi j is the flow in terms of number of people between cities
i and j di j is the geographical distance and Pi and Pj the populationof every city respectively
Given the data we can obtain the parameters of the model byWeighted Least Squares Minimization
αlowast1 α
lowast2 β
lowast = argminα1α2β
1N sum
i jwi j
(Ti jminusT grav
i j
)2(3)
where N is the total number of connections in the mobility graph andwi j is a weight proportional to the number of observed transitionsbetween i and j In particular we find that taking wi j = T 13
i j givesthe best performance in the model
In our case this model fits quite accurately the inter-city mobilitybased on Twitter GPS checkins (see table 1) Even though we areconsidering Ti j not necessarily symmetric the exponents of thepopulations are similar indicating that we are observing a similarflows in both directions between i and j
S3 Community structures in inter-city mo-bility graph
Typically complex networks exhibit community structure that isthere are subsets of nodes that are more densely connected amongthem comparing to the rest of the nodes In mobility networks whosenodes correspond to geographical areas these communities are inter-preted as zones with high common activity and tend to be constrainedby geographical and political barriers We check whether this is alsoobserved in our dataset by performing 6 state-of-art community de-tection algorithms FastGreedy [5] Walktrap [16] Infomap [18]MultiLevel [3] Label Propagation [17] and Leading Eigenvector[15] These six different algorithms exhibit different communitystructures in terms of number of communities average size of com-munity or modularity (see table 3) Members (municipalities) ofthe resulting communities are spatially connected except some fewcases as figure 7 shows We test the statistical robustness of theobtained communities by randomly removing a proportion p of theoriginal links and performing the algorithms on this new graph GpWe will consider that communities are robust when the communi-ties given for the original network G and Gp are highly similar Inorder to compare two arbitrary memberships to communities we usethe Normalized Mutual Information (NMI) method described in [6]which returns 0 when two memberships are totally different and 1when we compare two equal memberships We compute the NMI for
Social media fingerprints of unemployment mdash 1119
10minus8
10minus6
10minus4
10minus2
100
100 1005 101 1015 102 1025 103x
dens
10minus6
10minus4
10minus2
100
1005 101 1015 102 1025 103x
dens
10minus7
10minus6
10minus5
10minus4
10minus3
102 1025 103 1035 104 1045 105x
dens
Den
sity
Den
sity
Den
sity
Trip distance (km) Number of trips Elapsed time (secs)
Figure 5 Probability distributions for the different properties of daily trips in the Twitter dataset Dashed lines corresponds to apower law fit with exponents minus167 minus243 and minus062 respectively
each chosen algorithm performed on G and Gp for p between 1and 10 concluding that obtained community structures are robustbecause they are not broken when some randomly chosen links areremoved (see table 2)
1e+01 1e+03 1e+05
2eminus04
2eminus03
2eminus02
2eminus01
cities2$total_population
cities2$twpen
Population
Pen
etra
tion
Rat
e CommunitiesCities
Figure 6 Penetration rates for both cities and detectedcommunities
As other works have shown mobility graph communities areusually interpreted in terms of geographical and political barriersand a natural question is whether the mobility based communitiesare related to any of these barriers In Spain there are differentterritorial divisions for administration purposes In this work weconsider two of them provinces defined in 1978 Constitution are 48different heterogeneous aggregations of municipalities and counties(comarca in Spanish terminology) which are traditional aggregationsof municipalities mainly based on Spanish holography (rivers val-leys ridges etc) and some of them are composed by municipalitiesof different provinces We use again the NMI method to compare thecommunities structure given by the algorithms to the administrativelimits Except Leading Eigenvector algorithm the rest of methodsreturn communities that are quite related to provinces (NMI asymp 07)whereas for the county administration limits higher variability is
observed In this last case the algorithm providing more relationshipwith county limits is Infomap NMI asymp 083 Therefore Twitter basedmobility summarizes the inter-city flows exhibiting that these flowsare influenced by geographical and political barriers
S4 Twitter demographics and unemploy-ment rates
Different age groups are not equally represented in Twitter Recentsurveys (2012) in Spain suggest that most (86) of users in Twitterare 16 to 44 years old Comparison of the percentage of users perage group with the total population within the same groups (seefigure 8) reveals that groups of ages above 35 years old are under-represented in Twitter Thus our Twitter data will be more revealingwhen trying to describe unemployment in age groups below 44 yearsold This is indeed what we find when we try to build a linear modelfor the rate unemployment in different age groups with the sameTwitter variables while unemployment rates for ages below 24 canbe fitted to a linear model with R2 = 062 we find that regressionmodels for unemployment rates for ages between 25 and 44 have aR2 = 052 while for ages above 44 we get only R2 = 026 Table 4summarizes the results for the regression models of unemploymentrates in each age group showing that our Twitter variables have moreexplanatory power for ages below 44 Finally in figure 8 we can seethe performance of the model at different age groups and once againit is obvious the poor explanatory power of the Twitter variables forthe unemployment rate in ages above 44 years old
S5 Properties of Twitter variables
Normalization and distributions
Heterogeneity between the values of variables constructed fromTwitter is large but moderate as histograms in figure 9 show Wedid not find any geographical area with anomalous values in anyof the variables considered Variables are normalized in differentways both the penetration τi and misspellers rate εi are defined asthe number of users or misspellers per 100000 persons (population)activity variables νi are normalized as the percentage of tweets pertime interval finally number of tweets that mention a specific term
Social media fingerprints of unemployment mdash 1219
Figure 7 From left to right and from top to bottom Fastgreedy Walktrap Infomap Multilevel Label Propagation and LeadingEigenvector communities on Twitter based mobility transitions
microi are also given per 100000 tweets published in the geographicalarea
Correlation between variables
Variables are constructed to reflect the behavior of areas in the dif-ferent dimensions of Twitter penetration social or geographicaldiversity activity through the day and content Correlation betweenvariables does indeed show that variables within each dimensionshold strong correlations between them As we can see in figure 10social and geographical diversities are highly correlated betweenthem an expected fact given the gravity law accurate descriptionof flows of people between geographical areas but also the amountof communication between them Same behavior is found for thegroup of variables in the activity group while content variables areless correlated Finally we find that both the penetration rate τi andfraction of misspellers εi have a strong correlation with most of thevariables
High correlation between variables might lead to collinearityeffects [24] in the linear regression models that is some variableswith predictive variable might have non-significant weights becausethey explain the same part of the variance For instance in Table5 misspellers rate has a very strong predictive value but its p-valueis too high to consider it significant To test this hypothesis weperform a principal component analysis (PCA) on the independentvariables of the regression Figure 10 exhibits the loadings of thedifferent variables for the considered variables The block structureshowed in 10 results in similar directions of the variables in the firstcomponentes of the PCA We observe some groups of variables onthe one hand geographical and social diversity seem to explain largepart of the variance on the other hand we find a perpendicular group
of variables formed by temporal activity finally penetration rateand misspellers fraction seem to represent a different independentdirection of data with high collinearity between them This mightexplain the low statistical significance in the models of section 4 Inany case the structure of the correlation matrix and the PCA resultsshow that there is indeed information in all groups of variables andthus we have take a variable in each of them for our regressionmodels
S6 Misspellers detection
In this work we will consider only tweets in Spanish that is sincein Spain several languages live at the same time depending on thepart of the country the first step is to reduce our Twitter dataset tothose tweets that are written in Spanish This task is carried out usingthe n-gram based text categorization R library textcat [11] Then inorder to decide whether a tweet has a misspelling or not we needto establish some patterns to select from our set of tweets Sincewe want to be sure that a detected mistake corresponds to a realmisspeller we will not consider the following cases
bull Lack of written accents People tend to avoid writing accentswhen talking in a colloquial way
bull Mistakes derived from removing unnecessary letters Themost common cases are removing a h at the beginning of aword (in Spanish the letter h is not pronounced) or replacingthe letters qu by k We understand that these mistakes can bemotivated for the limitation of length in tweets and not for areal misspelling
bull In the same line we neglect mistakes produced by removing
Social media fingerprints of unemployment mdash 1319
0
10
20
30
40
0 100 200 300x = Tweets (unemployment)
P(x)
0
5
10
15
500 1000 1500x = Penetration rate
P(x)
0
5
10
15
2 4 6 8x = Entropy2 (social)
P(x)
0
5
10
15
025 050 075x = Entropy1 (social)
P(x)
0
5
10
15
4 6 8x = Entropy2 (geo)
P(x)
0
5
10
15
01 02 03 04x = Entropy1 (geo)
P(x)
0
5
10
15
0 50 100 150 200x = Misspellers rate
P(x)
0
20
40
60
0 100 200x = Tweets (employment)
P(x)
0
5
10
15
20
250 500 750x = Tweets (job)
P(x)
0
5
10
15
20
2 3 4 5 6x = Activity (night)
P(x)
0
5
10
15
20
4 5 6 7x = Activity (morning)
P(x)
0
5
10
15
35 40 45 50x = Activity (afternoon)
P(x)
i SuiSri
Sri Suimrngi
aftni ngti
imicrojobi
microempi
microunempi
x x x
x x x
x x x
x x x
Figure 9 Frequency plots for each variable constructed from Twitter
letters in the middle of a word whose pronunciation can bededuced without them
bull We do not consider either mistakes related to features ofspecific areas in Spain For example in the south the pronun-ciation of ce and se is the same what produces a big amountof mistakes when writing However since we want to extractobjective and equitable conclusion over the whole Spanishgeography we neglect those misspellings that only appear ina specific area
Likewise we will consider as real misspellings the followingmistakes
bull Adding letters For example writing a h at the beginning of aword that starts with a vowel
bull Changing the special cases mp mb by the wrong writings npnb
bull Mixing up b with v g with j ll with y and ex with es Theseare typical mistakes in Spanish because they have the sameor a very close pronunciation
Social media fingerprints of unemployment mdash 1419
0
10
20
30
16minus24 25minus34 35minus44 45minus54 55minus64r
f
000
025
050
075
100llena
Age group
Perc
enta
ge o
f pop
ulat
ion Census
10
15
20
25
10 15 20 25x
y
10
15
20
25
10 15 20 25x
y
5
10
15
20
5 10 15 20x
y
5
10
15
2025
5 10 15 20 25x
y
All ages lt 24
25-44 gt 44
Observed Unemployment () Observed Unemployment ()
Observed Unemployment ()Observed Unemployment ()
Pred
icte
d Un
empl
oym
ent (
)
Pred
icte
d Un
empl
oym
ent (
)
Pred
icte
d Un
empl
oym
ent (
)
Pred
icte
d Un
empl
oym
ent (
)
R2 = 047 R2 = 062
R2 = 052 R2 = 026
Figure 8 Top Percentage of population in each age groupfrom the Spanish Census (dark bars) and surveys about usersin Twitter (light bars) Bottom performance of the linearmodels for each of the age groups
bull Confusing the verb haber with the periphrasis a ver
bull Separating a word into two ones for instance writing theword conmigo as con migo
This way our list of mispellings is composed of 617 common mis-takes in Spanish that cannot be attributed to the special featuresof Twitter or a specific region of Spain Thus one can expect thatthis selection provides an accurate and equitable method of detect-ing misspellers Under these conditions the number of users whowrote at least one misspelled word is 27055 (56 over the wholepopulation)
We analyze whether misspellers have different Twitter usagebehavior from that people who do not make serious mistakes whenpublishing a tweet Comparing the average number of tweets itcan be observed that misspellers tend to publish a larger numberof tweets than those who did not made mistakes (14471 against2372) This also emerges when the mean number of misspellinggiven the total number of tweets is considered For users with lessthan approximately 30 published tweets in the observation period thenumber of misspellings is almost zero whereas for users who publishmore often the mean number of misspellings scales sub-linearly
minus1
minus08
minus06
minus04
minus02
0
02
04
06
08
1rtwpen
sio
sior
siosocial
siorsocial
manana
tarde
madrugada
fmiss
job
emp
unem
p
eco
rtwpensiosior
siosocialsiorsocialmanana
tardemadrugada
fmissjobemp
unempeco
i
Sui
Sri
Sri
Sui
i
microecoi
microunempi
microempi
microjobi
ngti
aftni
mrngi
minus3 minus2 minus1 0 1 2 3
minus3minus2
minus10
12
3
First Principal component
Seco
nd P
rinci
pal C
ompo
nent
6
26
47
70 92123
145
155185
209
213
251
257 279305318339
371
396411
423
455
466
490
507
546552
568
597
621
637672
681
710
718738
772
781
805828
849
876
900
919
925
966979
1002 1021
1047
1064
1085
1109
1126
1137
1164
11861200
12361249
1264
1300
1318
1336
13541386
1406
1425
1444
1457
14711504
15141554
1572
1584
1611
1622
16571667
1686
1721
1739
1800
1823
1837
1863
18831930
1949
1968
20062036
2042
2067
2181
2206
22442264
2305
2331
23322413 2435
2456
2512
2554
2569
25982617
26522670
26972721 2748
2790
2875
2917
2939
29492976
2989
3025
3132
3228
3245
3331
3347
3381
3451
3484
3519
3613
3837
38653987
4326
minus05 00 05
minus05
00
05
sio
sior
siosocial
siorsocial
rtwpen
manana
tarde
madrugada
fmiss
jobemp
unemp
eco
ngt
microunemp
aftn
microempmicroeco
microjobmrng
Si
Si
Sir
˜Sir
domingo 12 de octubre de 14
Figure 10 Top Correlation matrix between the vari-ables constructed from Twitter Each entry in the matrixis depicted as a circle whose size is proportional to thecorrelation between variables and the sign is bluered forpositivenegative correlations Blank entries correspond tostatistically insignificant correlations with 95 confidenceBottom Variables projection on the first two principal com-ponents given by PCA We observe different groups of vari-ables and collinearity between some of them
with the number of tweets (exponent asymp 033)
Since we have observed a segmentation of Twitter populationbased on how accurate they write we consider the misspeller rate as aproxy of the educational level of the cities Large number of previousworks in the literature have revealed the relationship between theeconomical status and the educational level of geographical areasand therefore it is natural to ask whether the observed misspellersrate is related to economy driven by the unemployment rate To testthis hypothesis we consider cities populated with more than 5000inhabitants to avoid subsampled cases We find a strong positivecorrelation between the probability of finding a misspeller in a cityand the unemployment rate (0372 0491)
Social media fingerprints of unemployment mdash 1519
2 5 10 20 50 200 500
10
15
20
30
40
tweets
N(m
iss|
tw
eets
)
2 5 10 20 50 200 500
000
50
050
050
0
tweets
P(m
iss|
twee
ts)
2 50 500
Figure 11 Number (red) and probability (blue) of ob-served misspellings given the number of tweets
050
055
060
065
070
201210201211201212201301201302201303201304201305201306201307201308201309201310201311201312201401201402201403201404201405201406
month
r2R2
month
Figure 12 Explanatory power of the linear regressionmodel when fitted against the unemployment data for dif-ferent months Gray (orange) area correspond to the timewindow in which Twitter data is collected and variables areconstructed
S7 Time window and unemployment
In the definition of the variables we have aggregated the Twitter ac-tivity within a 7 months time window (from December 2012 to June2013) Since unemployment has a significant variation along timewe investigate here what is the correlation and explanatory powerof the Twitter variables for the values of unemployment determinedat different months through the same time window in which Twitterdata was collected Or if the variables collected in that time windoware more correlated with past or future values of unemploymentFigure 12 shows the explanatory value of the model when the linearregression is done for values of unemployment of different monthsbefore during and after the Twitter data time window Althoughthere is a small seasonal effect along the year we see that the ex-planatory power remains around R2 = 06 which suggest that ourTwitter linear model retains its explanatory power even though unem-ployment changes considerably throughout the year It is interestingto note that R2 decays a little bit during the summer which meansthat our variables are less correlated with summer unemploymentFinally unemployment used in the main article is from June 2013ie the last month in the time window used to collect the data
S8 Demographics does not explain unem-ployment
Since unemployment rates are very large for the group of youngpeople a natural question is whether only demographic variablescould explain the heterogeneity of young unemployment rates foundin the geographical areas To test this end we have built four linearmodels the first one (named Youth model in Table 5) is composedby the rate of young population as the only explaining variable thesecond ones are built based on only the Twitter variables consideredin the main text (named Twitter model (I)) or just with those whoseregression coefficients are statistically significant (Twitter model(II)) the third one is fitted with all the variables (named All variablesmodel in Table 5) In table 5 we show the summary of the regressionfor each model Focusing on the explained variance by the model interms of R2 it can be checked that considering all Twitter variables isthree times more explanatory than considering only the young peopleproportion On the other hand the comparison of R2 for the Twittermodel with the one for All variables and Youth model shows that therate of young population does not provide a significant explanatorypower This semi-partial analysis shows that our Twitter variablesretain a high explanatory power when the effect of young populationrate is controlled
S9 Unemployment models for other geo-graphical areas
While municipalities are very heterogeneous demographically otheradministrative areas exist in Spain at large scales that could be usedfor our model of unemployment As mentioned in section 4 thesmallest administrative division of Spain we have considered is thatof the 8200 municipalities At larger scales we have the 326 coun-ties (comarcas in spanish) which are aggregations of municipalitiesFinally the largest geographical scale we considered is defined by50 provinces (provincias in Spanish) In this section we comparethe performance of our Twitter model for unemployment for thevariables defined in those administrative areas and relate it to thegeographical communities detected and used in the main paper (seesection 4) Not all the areas at different administrative divisions areconsidered in the model To minimize the effect of areas in whichthe number of geo-tagged tweets is very small we only consider the1738 municipalities which have a Twitter population π gt 10 Simi-larly we only consider the 198 counties with π gt 100 As we can seein Table 6 the model has a large explanatory power for areas equalor bigger than counties As expected R2 increases as the number ofareas in the model is smaller but the description level of the modelis very low for provinces for example The best performance (highR2 and high geographical description level) is attained at the level ofthe detected communities
S10 Relative importance of the variables
To asses the relative importance of the variables in the unemploymentmodel we have used several methods They all give qualitatively thesame results with some variations for the statistically insignificantvariables Specifically we have use
1 (weight) Relative weight of the absolute values of the coef-ficients obtained in the linear regression when variables arescaled to have mean zero and variance one
2 (lmg) averaging over orderings proposed by LindemanMerenda and Gold
Social media fingerprints of unemployment mdash 1619
0
10
20
30
40
emp fmiss manana rtwpen sio siosocialnames
values
indabscoefffirstlmgpmvd
microemp mrng SuSu
Rela
tive
impo
rtanc
e (
)
0
10
20
30
40
emp fmiss manana rtwpen sio siosocialnames
values
indabscoefffirstlmgpmvd
weight first lmg pvmd
Figure 13 Relative importance of the variables (in per-centage) in the unemployment model for different ways tocalculate it
3 (pmvd) The PMVD metric introduced by Feldman whichan average over orderings as well but with data-dependentweights
4 (first) The univariate R2-values from regression models withone variable only
All these metrics are obtained using the relaimpo R package [12]The results for the young unemployment model are shown in figure13 where we can see that different methods yield to similar rela-tive importance of the variables excepting perhaps for the diversityof mobility flows a variable with a non-significant weight in theregression model
References
[1] B Ashtakala Generalized power model for trip distributionTransportation Research Part B Methodological 21(1)59ndash671987
[2] Michel Bierlaire Mathematical models for transportation de-mand analysis Transportation research Part A Policy andpractice 31(1)86ndash86 1997
[3] Vincent D Blondel Jean-Loup Guillaume Renaud Lambiotteand Etienne Lefebvre Fast unfolding of communities in largenetworks Journal of Statistical Mechanics Theory and Exper-iment 2008(10)P10008 2008
[4] Harry J Casey Jr The law of retail gravitation applied to trafficengineering Traffic Quarterly 9(3) 1955
[5] Aaron Clauset Mark EJ Newman and Cristopher Moore Find-ing community structure in very large networks Physicalreview E 70(6)066111 2004
[6] Leon Danon Albert Diaz-Guilera Jordi Duch and AlexArenas Comparing community structure identificationJournal of Statistical Mechanics Theory and Experiment2005(09)P09008 2005
[7] Servicio Publico de Empleo Estatal (SEPE) Spanish registeredunemployment httpwwwsepeescontenidosque_
es_el_sepeestadisticasindexhtml[8] Instituto Nacional de Estadıstica Spanish 2011 cen-
sus httpwwwineescensos2011_datoscen11_
datos_iniciohtm|[9] Suzanne P Evans A relationship between the gravity model
for trip distribution and the transportation problem in linearprogramming Transportation Research 7(1)39ndash61 1973
[10] Paul Expert Tim S Evans Vincent D Blondel and RenaudLambiotte Uncovering space-independent communities inspatial networks Proceedings of the National Academy ofSciences 108(19)7663ndash7668 2011
[11] Ingo Feinerer Christian Buchta Wilhelm Geiger JohannesRauch Patrick Mair and Kurt Hornik The textcat packagefor n-gram based text categorization in r Journal of StatisticalSoftware 52(6)1ndash17 2013
[12] Ulrike Gromping Relative importance for linear regressionin r the package relaimpo Journal of statistical software17(1)1ndash27 2006
[13] Bartosz Hawelka Izabela Sitko Euro Beinat StanislavSobolevsky Pavlos Kazakopoulos and Carlo Ratti Geo-located twitter as the proxy for global mobility patterns arXivpreprint arXiv13110680 2013
[14] Yu Liu Zhengwei Sui Chaogui Kang and Yong Gao Uncov-ering patterns of inter-urban trips and spatial interactions fromcheck-in data arXiv preprint arXiv13100282 2013
[15] Mark EJ Newman Finding community structure in net-works using the eigenvectors of matrices Physical reviewE 74(3)036104 2006
[16] Pascal Pons and Matthieu Latapy Computing communities inlarge networks using random walks In Computer and Informa-tion Sciences-ISCIS 2005 pages 284ndash293 Springer 2005
[17] Usha Nandini Raghavan Reka Albert and Soundar KumaraNear linear time algorithm to detect community structures inlarge-scale networks Physical Review E 76(3)036106 2007
[18] Martin Rosvall and Carl T Bergstrom Maps of random walkson complex networks reveal community structure Proceedingsof the National Academy of Sciences 105(4)1118ndash1123 2008
[19] Morton Schneider Gravity models and trip distribution theoryPapers in Regional Science 5(1)51ndash56 1959
[20] Filippo Simini Marta C Gonzalez Amos Maritan and Albert-Laszlo Barabasi A universal model for mobility and migrationpatterns Nature 484(7392)96ndash100 2012
[21] Chaoming Song Tal Koren Pu Wang and Albert-LaszloBarabasi Modelling the scaling properties of human mobilityNature Physics 6(10)818ndash823 2010
[22] Alan Geoffrey Wilson Entropy in urban and regional mod-elling Pion Ltd 1970
[23] Alan Geoffrey Wilson Urban and regional models in geogra-phy and planning 1974
[24] Svante Wold Arnold Ruhe Herman Wold and WJ Dunn IIIThe collinearity problem in linear regression the partial leastsquares (pls) approach to generalized inverses SIAM Journalon Scientific and Statistical Computing 5(3)735ndash743 1984
Social media fingerprints of unemployment mdash 1719
Gravity ModelParameter Description Spain
α1 Origin exponent 0477lowastlowastlowast(0002)α2 Destination exponent 0478lowastlowastlowast(0002)β Distance exponent 105lowastlowastlowast(00035)R2 Goodness of fit 0797φ Correlation between Ti j and T gra
i j 0826
Table 1 Description of the parameters for the Gravity Law Model in geo-tagged social media data for Spain (lowast lowast lowast) meanssignificance p lt 00001
NMI between G and Gp for different pAlgorithm p = 001 002 003 004 005 006 007 008 009 01
FG 0995 0992 0989 0983 0981 0977 0983 0969 0980 0959WT 0954 0959 0950 0954 0945 0948 0947 0935 0926 0931IM 0988 0981 0980 0981 0978 0974 0975 0970 0969 0966ML 0994 0978 0979 0983 0948 0934 0972 0952 0973 0947LP 0906 0908 0911 0915 0895 0907 0907 0893 0905 0904LE 0960 0957 0956 0859 0910 0892 0908 0858 0885 0884
Table 2 NMI measure comparing G and Gp
Communities StatsAlgorithm 〈|Ni|〉i max|Ni| |Ni| Modularity NMI P NMI C
FG 309696 1385 23 0726 0712 0590WT 9262 433 769 0417 0744 0757IM 21011 143 339 0758 0770 0831ML 323772 1132 22 0800 0717 0599LP 22052 750 323 0732 0749 0761LE 1017571 5344 7 0381 0264 0205
Table 3 Statistics of the communities Ni returned by the six algorithms NMI P refers to the comparison between communitiesand provinces whereas NMI C considers counties instead of provinces
Social media fingerprints of unemployment mdash 1819
All ages lt 24 25minus44 gt 44(Intercept) 011lowastlowastlowastlowast 010lowastlowastlowast 020lowastlowastlowast 020lowastlowastlowast
(002) (003) (003) (0035)Penetration rate 323lowast 857lowastlowastlowast 628lowastlowast 240
(141) (222) (217) (277)Geographical diversity 003 015lowastlowastlowast 008lowast 006
(002) (004) (004) (005)Social diversity minus003lowast minus003 minus005lowast minus006lowast
(001) (002) (002) (003)Morning activity minus069lowast minus130lowastlowast minus153lowastlowastlowast minus119lowast
(026) (042) (041) (052)Misspellers rate 1156 3151lowast 1546 2360
(813) (1278) (1248) (1594)Employment mentions minus180 317 minus994 271
(627) (986) (964) (123)R2 047 064 055 029Adj R2 044 062 052 026lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005
Table 4 Regression table for the different models in which unemployment for different age groups is fitted The All ages model isthe fit to the general rate of unemployment in each geographical area while the other models are for the rates of unemployment ingroups of less than 24 years between 25 and 44 years and above 44 years
All variables Youth model Twitter model (I) Twitter model (II)(Intercept) 006 minus002 010lowastlowastlowast 009lowastlowastlowast
(003) (003) (003) (0027)Young pop rate 066lowast 220lowastlowastlowast
(030) (035)Penetration rate 820lowastlowastlowast 857lowastlowastlowast 862lowastlowastlowast
(225) (222) (221)Geographical diversity 014lowastlowastlowast 015lowastlowastlowast 012lowastlowastlowast
(004) (004) (003)Social diversity minus002 minus003
(002) (002)Morning activity minus142lowastlowastlowast minus130lowastlowast minus128lowastlowast
(041) (042) (041)Misspellers rate 2395 3151lowast 3228lowast
(1309) (1278) (1271)Employment mentions 034 317
(981) (986)R2 065 024 064 063Adj R2 063 024 062 062lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005
Table 5 Regression table for the different statistical models The All variables model includes both Twitter and rate of youngpopulation variables Twitter model (I) includes only the variables described in the main article while Twitter model (II) only includesthose variables which are significant p lt 005 in Twitter model (I)
Social media fingerprints of unemployment mdash 1919
Communities Municipalities Counties Provinces(Intercept) 010lowastlowastlowast 016lowastlowastlowast 011lowastlowastlowast 011lowast
(003) (001) (003) (005)Penetration rate 857lowastlowastlowast 401lowastlowastlowast 912lowastlowastlowast 1047lowastlowastlowast
(222) (059) (181) (197)Geographical diversity 015lowastlowastlowast 002 012lowastlowastlowast 008
(004) (001) (003) (007)Social diversity minus003 minus001 minus001 minus003
(002) (001) (002) 007Morning activity minus130lowastlowast minus116lowastlowastlowast minus149lowastlowastlowast minus103
(042) (014) (039) (088)Misspellers rate 3151lowast 1440lowastlowastlowast 1409
(1278) (251) (1002)Employment mentions 317 minus071 241 minus317
(986) (089) (886) (1229)Number of points 128 1738 198 50R2 064 022 055 065Adj R2 062 021 054 061lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005
Table 6 Regression table for the unemployment linear regression model in different levels of geographical areas In the Provincesmodel the misspellers rate has been removed from the model due to the large collinearity with the penetration rate
- 1 Social media dataset and functional partition of cities
- 2 Social media behavioral fingerprints
- 3 Explanatory power of social media in unemployment
- 4 Discussion
-
Social media fingerprints of unemployment mdash 1019
Supporting Information forSocial media fingerprints of unemploymentAlejandro Llorente Manuel Garcıa-Herranz Manuel Cebrian and Esteban Moro
S1 The dataset
Twitter provides an extremely rich and publicly available data set ofuser interactions information flows and thanks to the geo locationof tweets user movements Nevertheless the representativenessof this geo-located Twitter as a global source of mobility data hasstill received sparse attention In this sense while [13] present apromising and extensive study regarding global country-to-countrymovements (mostly driven by tourism) within-country human flows(comprising not only internal tourism but also in a greater extentthan country-to-country travels visiting and commuting) still needfurther investigation Therefore throughout this work we will com-pare our findings using geo-located Twitter with similar study usingcommuting surveys
For the Twitter analysis we consider almost 146 million geo-located Twitter messages (tweet(s)) collected through the publicAPI provided by Twitter for the continental part of Spain and from29th November 2012 to 10th April 2013 In this dataset we considerthat there has been a trip from place l to place k if a user has tweetedin place l and place k consecutively We only keep those transitionswhen the first tweet and the second one are dated in the same dayWe filter the trips database to avoid unrealistic transitions and keeponly trips with a geographical displacement larger than 1km (SeeMethods section) By this method 138 million of trips from 167376different users are considered in our work
From those trips we construct the mobility flow Ti j betweenmunicipalities which measures the number of trips in our databasein which the origin is within city i boundaries and destination lieswithin those of city j
We also consider population and economical information aboutthe municipalities from the Spanish Census (2011) [8] and unem-ployment figures from the Public Service of Employment (ServicioPublico de Empleo Estatal SEPE) [7] In the former In the lat-ter case registered unemployment (in number of persons) is givenfor each Spanish municipality by gender age and month To getunemployment rates we divide register unemployment by the totalworkforce in the municipality estimated as the number of peoplewith age between 16 and 65 years
S2 Twitter as mobility proxy
Considering all of the available transitions in our database one cancompute the distance between origin and destination the elapsedtime of the transition and the number of trips per user among manyother statistics All of them seems to show a Power-law distributionwith a cutoff due to the finite spatial size of Spain and the constraintof considering only transitions where the origin and destinationcheckins are done the same day Focusing on the log-linear partof the distributions self-similar behaviors arise when Twitter basedmobility is analyzed (see figure 5)
Twitter based inter-city flows can be well modelled by means ofthe The Gravity Law which is one of the most extended methods torepresent human mobility [1 19] with applications in many fieldslike urban planning [23] traffic engineering [4] or transportationproblems [9] Gravity Law is also the solution to the problem of
maximizing the entropy of the particle distribution among all thepossible trips using statistical mechanics techniques [2 22] Recentlyit has also been used as a model for human mobility based on cellphone traces [10 20 21] and social media data at a global scale [13]and at the inter-city level [14]
The Gravity Model for human mobility assume that the flowsbetween cities can be explained by the expression
T gravi j =
Pα1i Pα2
j
dβ
i j
(2)
where T gravi j is the flow in terms of number of people between cities
i and j di j is the geographical distance and Pi and Pj the populationof every city respectively
Given the data we can obtain the parameters of the model byWeighted Least Squares Minimization
αlowast1 α
lowast2 β
lowast = argminα1α2β
1N sum
i jwi j
(Ti jminusT grav
i j
)2(3)
where N is the total number of connections in the mobility graph andwi j is a weight proportional to the number of observed transitionsbetween i and j In particular we find that taking wi j = T 13
i j givesthe best performance in the model
In our case this model fits quite accurately the inter-city mobilitybased on Twitter GPS checkins (see table 1) Even though we areconsidering Ti j not necessarily symmetric the exponents of thepopulations are similar indicating that we are observing a similarflows in both directions between i and j
S3 Community structures in inter-city mo-bility graph
Typically complex networks exhibit community structure that isthere are subsets of nodes that are more densely connected amongthem comparing to the rest of the nodes In mobility networks whosenodes correspond to geographical areas these communities are inter-preted as zones with high common activity and tend to be constrainedby geographical and political barriers We check whether this is alsoobserved in our dataset by performing 6 state-of-art community de-tection algorithms FastGreedy [5] Walktrap [16] Infomap [18]MultiLevel [3] Label Propagation [17] and Leading Eigenvector[15] These six different algorithms exhibit different communitystructures in terms of number of communities average size of com-munity or modularity (see table 3) Members (municipalities) ofthe resulting communities are spatially connected except some fewcases as figure 7 shows We test the statistical robustness of theobtained communities by randomly removing a proportion p of theoriginal links and performing the algorithms on this new graph GpWe will consider that communities are robust when the communi-ties given for the original network G and Gp are highly similar Inorder to compare two arbitrary memberships to communities we usethe Normalized Mutual Information (NMI) method described in [6]which returns 0 when two memberships are totally different and 1when we compare two equal memberships We compute the NMI for
Social media fingerprints of unemployment mdash 1119
10minus8
10minus6
10minus4
10minus2
100
100 1005 101 1015 102 1025 103x
dens
10minus6
10minus4
10minus2
100
1005 101 1015 102 1025 103x
dens
10minus7
10minus6
10minus5
10minus4
10minus3
102 1025 103 1035 104 1045 105x
dens
Den
sity
Den
sity
Den
sity
Trip distance (km) Number of trips Elapsed time (secs)
Figure 5 Probability distributions for the different properties of daily trips in the Twitter dataset Dashed lines corresponds to apower law fit with exponents minus167 minus243 and minus062 respectively
each chosen algorithm performed on G and Gp for p between 1and 10 concluding that obtained community structures are robustbecause they are not broken when some randomly chosen links areremoved (see table 2)
1e+01 1e+03 1e+05
2eminus04
2eminus03
2eminus02
2eminus01
cities2$total_population
cities2$twpen
Population
Pen
etra
tion
Rat
e CommunitiesCities
Figure 6 Penetration rates for both cities and detectedcommunities
As other works have shown mobility graph communities areusually interpreted in terms of geographical and political barriersand a natural question is whether the mobility based communitiesare related to any of these barriers In Spain there are differentterritorial divisions for administration purposes In this work weconsider two of them provinces defined in 1978 Constitution are 48different heterogeneous aggregations of municipalities and counties(comarca in Spanish terminology) which are traditional aggregationsof municipalities mainly based on Spanish holography (rivers val-leys ridges etc) and some of them are composed by municipalitiesof different provinces We use again the NMI method to compare thecommunities structure given by the algorithms to the administrativelimits Except Leading Eigenvector algorithm the rest of methodsreturn communities that are quite related to provinces (NMI asymp 07)whereas for the county administration limits higher variability is
observed In this last case the algorithm providing more relationshipwith county limits is Infomap NMI asymp 083 Therefore Twitter basedmobility summarizes the inter-city flows exhibiting that these flowsare influenced by geographical and political barriers
S4 Twitter demographics and unemploy-ment rates
Different age groups are not equally represented in Twitter Recentsurveys (2012) in Spain suggest that most (86) of users in Twitterare 16 to 44 years old Comparison of the percentage of users perage group with the total population within the same groups (seefigure 8) reveals that groups of ages above 35 years old are under-represented in Twitter Thus our Twitter data will be more revealingwhen trying to describe unemployment in age groups below 44 yearsold This is indeed what we find when we try to build a linear modelfor the rate unemployment in different age groups with the sameTwitter variables while unemployment rates for ages below 24 canbe fitted to a linear model with R2 = 062 we find that regressionmodels for unemployment rates for ages between 25 and 44 have aR2 = 052 while for ages above 44 we get only R2 = 026 Table 4summarizes the results for the regression models of unemploymentrates in each age group showing that our Twitter variables have moreexplanatory power for ages below 44 Finally in figure 8 we can seethe performance of the model at different age groups and once againit is obvious the poor explanatory power of the Twitter variables forthe unemployment rate in ages above 44 years old
S5 Properties of Twitter variables
Normalization and distributions
Heterogeneity between the values of variables constructed fromTwitter is large but moderate as histograms in figure 9 show Wedid not find any geographical area with anomalous values in anyof the variables considered Variables are normalized in differentways both the penetration τi and misspellers rate εi are defined asthe number of users or misspellers per 100000 persons (population)activity variables νi are normalized as the percentage of tweets pertime interval finally number of tweets that mention a specific term
Social media fingerprints of unemployment mdash 1219
Figure 7 From left to right and from top to bottom Fastgreedy Walktrap Infomap Multilevel Label Propagation and LeadingEigenvector communities on Twitter based mobility transitions
microi are also given per 100000 tweets published in the geographicalarea
Correlation between variables
Variables are constructed to reflect the behavior of areas in the dif-ferent dimensions of Twitter penetration social or geographicaldiversity activity through the day and content Correlation betweenvariables does indeed show that variables within each dimensionshold strong correlations between them As we can see in figure 10social and geographical diversities are highly correlated betweenthem an expected fact given the gravity law accurate descriptionof flows of people between geographical areas but also the amountof communication between them Same behavior is found for thegroup of variables in the activity group while content variables areless correlated Finally we find that both the penetration rate τi andfraction of misspellers εi have a strong correlation with most of thevariables
High correlation between variables might lead to collinearityeffects [24] in the linear regression models that is some variableswith predictive variable might have non-significant weights becausethey explain the same part of the variance For instance in Table5 misspellers rate has a very strong predictive value but its p-valueis too high to consider it significant To test this hypothesis weperform a principal component analysis (PCA) on the independentvariables of the regression Figure 10 exhibits the loadings of thedifferent variables for the considered variables The block structureshowed in 10 results in similar directions of the variables in the firstcomponentes of the PCA We observe some groups of variables onthe one hand geographical and social diversity seem to explain largepart of the variance on the other hand we find a perpendicular group
of variables formed by temporal activity finally penetration rateand misspellers fraction seem to represent a different independentdirection of data with high collinearity between them This mightexplain the low statistical significance in the models of section 4 Inany case the structure of the correlation matrix and the PCA resultsshow that there is indeed information in all groups of variables andthus we have take a variable in each of them for our regressionmodels
S6 Misspellers detection
In this work we will consider only tweets in Spanish that is sincein Spain several languages live at the same time depending on thepart of the country the first step is to reduce our Twitter dataset tothose tweets that are written in Spanish This task is carried out usingthe n-gram based text categorization R library textcat [11] Then inorder to decide whether a tweet has a misspelling or not we needto establish some patterns to select from our set of tweets Sincewe want to be sure that a detected mistake corresponds to a realmisspeller we will not consider the following cases
bull Lack of written accents People tend to avoid writing accentswhen talking in a colloquial way
bull Mistakes derived from removing unnecessary letters Themost common cases are removing a h at the beginning of aword (in Spanish the letter h is not pronounced) or replacingthe letters qu by k We understand that these mistakes can bemotivated for the limitation of length in tweets and not for areal misspelling
bull In the same line we neglect mistakes produced by removing
Social media fingerprints of unemployment mdash 1319
0
10
20
30
40
0 100 200 300x = Tweets (unemployment)
P(x)
0
5
10
15
500 1000 1500x = Penetration rate
P(x)
0
5
10
15
2 4 6 8x = Entropy2 (social)
P(x)
0
5
10
15
025 050 075x = Entropy1 (social)
P(x)
0
5
10
15
4 6 8x = Entropy2 (geo)
P(x)
0
5
10
15
01 02 03 04x = Entropy1 (geo)
P(x)
0
5
10
15
0 50 100 150 200x = Misspellers rate
P(x)
0
20
40
60
0 100 200x = Tweets (employment)
P(x)
0
5
10
15
20
250 500 750x = Tweets (job)
P(x)
0
5
10
15
20
2 3 4 5 6x = Activity (night)
P(x)
0
5
10
15
20
4 5 6 7x = Activity (morning)
P(x)
0
5
10
15
35 40 45 50x = Activity (afternoon)
P(x)
i SuiSri
Sri Suimrngi
aftni ngti
imicrojobi
microempi
microunempi
x x x
x x x
x x x
x x x
Figure 9 Frequency plots for each variable constructed from Twitter
letters in the middle of a word whose pronunciation can bededuced without them
bull We do not consider either mistakes related to features ofspecific areas in Spain For example in the south the pronun-ciation of ce and se is the same what produces a big amountof mistakes when writing However since we want to extractobjective and equitable conclusion over the whole Spanishgeography we neglect those misspellings that only appear ina specific area
Likewise we will consider as real misspellings the followingmistakes
bull Adding letters For example writing a h at the beginning of aword that starts with a vowel
bull Changing the special cases mp mb by the wrong writings npnb
bull Mixing up b with v g with j ll with y and ex with es Theseare typical mistakes in Spanish because they have the sameor a very close pronunciation
Social media fingerprints of unemployment mdash 1419
0
10
20
30
16minus24 25minus34 35minus44 45minus54 55minus64r
f
000
025
050
075
100llena
Age group
Perc
enta
ge o
f pop
ulat
ion Census
10
15
20
25
10 15 20 25x
y
10
15
20
25
10 15 20 25x
y
5
10
15
20
5 10 15 20x
y
5
10
15
2025
5 10 15 20 25x
y
All ages lt 24
25-44 gt 44
Observed Unemployment () Observed Unemployment ()
Observed Unemployment ()Observed Unemployment ()
Pred
icte
d Un
empl
oym
ent (
)
Pred
icte
d Un
empl
oym
ent (
)
Pred
icte
d Un
empl
oym
ent (
)
Pred
icte
d Un
empl
oym
ent (
)
R2 = 047 R2 = 062
R2 = 052 R2 = 026
Figure 8 Top Percentage of population in each age groupfrom the Spanish Census (dark bars) and surveys about usersin Twitter (light bars) Bottom performance of the linearmodels for each of the age groups
bull Confusing the verb haber with the periphrasis a ver
bull Separating a word into two ones for instance writing theword conmigo as con migo
This way our list of mispellings is composed of 617 common mis-takes in Spanish that cannot be attributed to the special featuresof Twitter or a specific region of Spain Thus one can expect thatthis selection provides an accurate and equitable method of detect-ing misspellers Under these conditions the number of users whowrote at least one misspelled word is 27055 (56 over the wholepopulation)
We analyze whether misspellers have different Twitter usagebehavior from that people who do not make serious mistakes whenpublishing a tweet Comparing the average number of tweets itcan be observed that misspellers tend to publish a larger numberof tweets than those who did not made mistakes (14471 against2372) This also emerges when the mean number of misspellinggiven the total number of tweets is considered For users with lessthan approximately 30 published tweets in the observation period thenumber of misspellings is almost zero whereas for users who publishmore often the mean number of misspellings scales sub-linearly
minus1
minus08
minus06
minus04
minus02
0
02
04
06
08
1rtwpen
sio
sior
siosocial
siorsocial
manana
tarde
madrugada
fmiss
job
emp
unem
p
eco
rtwpensiosior
siosocialsiorsocialmanana
tardemadrugada
fmissjobemp
unempeco
i
Sui
Sri
Sri
Sui
i
microecoi
microunempi
microempi
microjobi
ngti
aftni
mrngi
minus3 minus2 minus1 0 1 2 3
minus3minus2
minus10
12
3
First Principal component
Seco
nd P
rinci
pal C
ompo
nent
6
26
47
70 92123
145
155185
209
213
251
257 279305318339
371
396411
423
455
466
490
507
546552
568
597
621
637672
681
710
718738
772
781
805828
849
876
900
919
925
966979
1002 1021
1047
1064
1085
1109
1126
1137
1164
11861200
12361249
1264
1300
1318
1336
13541386
1406
1425
1444
1457
14711504
15141554
1572
1584
1611
1622
16571667
1686
1721
1739
1800
1823
1837
1863
18831930
1949
1968
20062036
2042
2067
2181
2206
22442264
2305
2331
23322413 2435
2456
2512
2554
2569
25982617
26522670
26972721 2748
2790
2875
2917
2939
29492976
2989
3025
3132
3228
3245
3331
3347
3381
3451
3484
3519
3613
3837
38653987
4326
minus05 00 05
minus05
00
05
sio
sior
siosocial
siorsocial
rtwpen
manana
tarde
madrugada
fmiss
jobemp
unemp
eco
ngt
microunemp
aftn
microempmicroeco
microjobmrng
Si
Si
Sir
˜Sir
domingo 12 de octubre de 14
Figure 10 Top Correlation matrix between the vari-ables constructed from Twitter Each entry in the matrixis depicted as a circle whose size is proportional to thecorrelation between variables and the sign is bluered forpositivenegative correlations Blank entries correspond tostatistically insignificant correlations with 95 confidenceBottom Variables projection on the first two principal com-ponents given by PCA We observe different groups of vari-ables and collinearity between some of them
with the number of tweets (exponent asymp 033)
Since we have observed a segmentation of Twitter populationbased on how accurate they write we consider the misspeller rate as aproxy of the educational level of the cities Large number of previousworks in the literature have revealed the relationship between theeconomical status and the educational level of geographical areasand therefore it is natural to ask whether the observed misspellersrate is related to economy driven by the unemployment rate To testthis hypothesis we consider cities populated with more than 5000inhabitants to avoid subsampled cases We find a strong positivecorrelation between the probability of finding a misspeller in a cityand the unemployment rate (0372 0491)
Social media fingerprints of unemployment mdash 1519
2 5 10 20 50 200 500
10
15
20
30
40
tweets
N(m
iss|
tw
eets
)
2 5 10 20 50 200 500
000
50
050
050
0
tweets
P(m
iss|
twee
ts)
2 50 500
Figure 11 Number (red) and probability (blue) of ob-served misspellings given the number of tweets
050
055
060
065
070
201210201211201212201301201302201303201304201305201306201307201308201309201310201311201312201401201402201403201404201405201406
month
r2R2
month
Figure 12 Explanatory power of the linear regressionmodel when fitted against the unemployment data for dif-ferent months Gray (orange) area correspond to the timewindow in which Twitter data is collected and variables areconstructed
S7 Time window and unemployment
In the definition of the variables we have aggregated the Twitter ac-tivity within a 7 months time window (from December 2012 to June2013) Since unemployment has a significant variation along timewe investigate here what is the correlation and explanatory powerof the Twitter variables for the values of unemployment determinedat different months through the same time window in which Twitterdata was collected Or if the variables collected in that time windoware more correlated with past or future values of unemploymentFigure 12 shows the explanatory value of the model when the linearregression is done for values of unemployment of different monthsbefore during and after the Twitter data time window Althoughthere is a small seasonal effect along the year we see that the ex-planatory power remains around R2 = 06 which suggest that ourTwitter linear model retains its explanatory power even though unem-ployment changes considerably throughout the year It is interestingto note that R2 decays a little bit during the summer which meansthat our variables are less correlated with summer unemploymentFinally unemployment used in the main article is from June 2013ie the last month in the time window used to collect the data
S8 Demographics does not explain unem-ployment
Since unemployment rates are very large for the group of youngpeople a natural question is whether only demographic variablescould explain the heterogeneity of young unemployment rates foundin the geographical areas To test this end we have built four linearmodels the first one (named Youth model in Table 5) is composedby the rate of young population as the only explaining variable thesecond ones are built based on only the Twitter variables consideredin the main text (named Twitter model (I)) or just with those whoseregression coefficients are statistically significant (Twitter model(II)) the third one is fitted with all the variables (named All variablesmodel in Table 5) In table 5 we show the summary of the regressionfor each model Focusing on the explained variance by the model interms of R2 it can be checked that considering all Twitter variables isthree times more explanatory than considering only the young peopleproportion On the other hand the comparison of R2 for the Twittermodel with the one for All variables and Youth model shows that therate of young population does not provide a significant explanatorypower This semi-partial analysis shows that our Twitter variablesretain a high explanatory power when the effect of young populationrate is controlled
S9 Unemployment models for other geo-graphical areas
While municipalities are very heterogeneous demographically otheradministrative areas exist in Spain at large scales that could be usedfor our model of unemployment As mentioned in section 4 thesmallest administrative division of Spain we have considered is thatof the 8200 municipalities At larger scales we have the 326 coun-ties (comarcas in spanish) which are aggregations of municipalitiesFinally the largest geographical scale we considered is defined by50 provinces (provincias in Spanish) In this section we comparethe performance of our Twitter model for unemployment for thevariables defined in those administrative areas and relate it to thegeographical communities detected and used in the main paper (seesection 4) Not all the areas at different administrative divisions areconsidered in the model To minimize the effect of areas in whichthe number of geo-tagged tweets is very small we only consider the1738 municipalities which have a Twitter population π gt 10 Simi-larly we only consider the 198 counties with π gt 100 As we can seein Table 6 the model has a large explanatory power for areas equalor bigger than counties As expected R2 increases as the number ofareas in the model is smaller but the description level of the modelis very low for provinces for example The best performance (highR2 and high geographical description level) is attained at the level ofthe detected communities
S10 Relative importance of the variables
To asses the relative importance of the variables in the unemploymentmodel we have used several methods They all give qualitatively thesame results with some variations for the statistically insignificantvariables Specifically we have use
1 (weight) Relative weight of the absolute values of the coef-ficients obtained in the linear regression when variables arescaled to have mean zero and variance one
2 (lmg) averaging over orderings proposed by LindemanMerenda and Gold
Social media fingerprints of unemployment mdash 1619
0
10
20
30
40
emp fmiss manana rtwpen sio siosocialnames
values
indabscoefffirstlmgpmvd
microemp mrng SuSu
Rela
tive
impo
rtanc
e (
)
0
10
20
30
40
emp fmiss manana rtwpen sio siosocialnames
values
indabscoefffirstlmgpmvd
weight first lmg pvmd
Figure 13 Relative importance of the variables (in per-centage) in the unemployment model for different ways tocalculate it
3 (pmvd) The PMVD metric introduced by Feldman whichan average over orderings as well but with data-dependentweights
4 (first) The univariate R2-values from regression models withone variable only
All these metrics are obtained using the relaimpo R package [12]The results for the young unemployment model are shown in figure13 where we can see that different methods yield to similar rela-tive importance of the variables excepting perhaps for the diversityof mobility flows a variable with a non-significant weight in theregression model
References
[1] B Ashtakala Generalized power model for trip distributionTransportation Research Part B Methodological 21(1)59ndash671987
[2] Michel Bierlaire Mathematical models for transportation de-mand analysis Transportation research Part A Policy andpractice 31(1)86ndash86 1997
[3] Vincent D Blondel Jean-Loup Guillaume Renaud Lambiotteand Etienne Lefebvre Fast unfolding of communities in largenetworks Journal of Statistical Mechanics Theory and Exper-iment 2008(10)P10008 2008
[4] Harry J Casey Jr The law of retail gravitation applied to trafficengineering Traffic Quarterly 9(3) 1955
[5] Aaron Clauset Mark EJ Newman and Cristopher Moore Find-ing community structure in very large networks Physicalreview E 70(6)066111 2004
[6] Leon Danon Albert Diaz-Guilera Jordi Duch and AlexArenas Comparing community structure identificationJournal of Statistical Mechanics Theory and Experiment2005(09)P09008 2005
[7] Servicio Publico de Empleo Estatal (SEPE) Spanish registeredunemployment httpwwwsepeescontenidosque_
es_el_sepeestadisticasindexhtml[8] Instituto Nacional de Estadıstica Spanish 2011 cen-
sus httpwwwineescensos2011_datoscen11_
datos_iniciohtm|[9] Suzanne P Evans A relationship between the gravity model
for trip distribution and the transportation problem in linearprogramming Transportation Research 7(1)39ndash61 1973
[10] Paul Expert Tim S Evans Vincent D Blondel and RenaudLambiotte Uncovering space-independent communities inspatial networks Proceedings of the National Academy ofSciences 108(19)7663ndash7668 2011
[11] Ingo Feinerer Christian Buchta Wilhelm Geiger JohannesRauch Patrick Mair and Kurt Hornik The textcat packagefor n-gram based text categorization in r Journal of StatisticalSoftware 52(6)1ndash17 2013
[12] Ulrike Gromping Relative importance for linear regressionin r the package relaimpo Journal of statistical software17(1)1ndash27 2006
[13] Bartosz Hawelka Izabela Sitko Euro Beinat StanislavSobolevsky Pavlos Kazakopoulos and Carlo Ratti Geo-located twitter as the proxy for global mobility patterns arXivpreprint arXiv13110680 2013
[14] Yu Liu Zhengwei Sui Chaogui Kang and Yong Gao Uncov-ering patterns of inter-urban trips and spatial interactions fromcheck-in data arXiv preprint arXiv13100282 2013
[15] Mark EJ Newman Finding community structure in net-works using the eigenvectors of matrices Physical reviewE 74(3)036104 2006
[16] Pascal Pons and Matthieu Latapy Computing communities inlarge networks using random walks In Computer and Informa-tion Sciences-ISCIS 2005 pages 284ndash293 Springer 2005
[17] Usha Nandini Raghavan Reka Albert and Soundar KumaraNear linear time algorithm to detect community structures inlarge-scale networks Physical Review E 76(3)036106 2007
[18] Martin Rosvall and Carl T Bergstrom Maps of random walkson complex networks reveal community structure Proceedingsof the National Academy of Sciences 105(4)1118ndash1123 2008
[19] Morton Schneider Gravity models and trip distribution theoryPapers in Regional Science 5(1)51ndash56 1959
[20] Filippo Simini Marta C Gonzalez Amos Maritan and Albert-Laszlo Barabasi A universal model for mobility and migrationpatterns Nature 484(7392)96ndash100 2012
[21] Chaoming Song Tal Koren Pu Wang and Albert-LaszloBarabasi Modelling the scaling properties of human mobilityNature Physics 6(10)818ndash823 2010
[22] Alan Geoffrey Wilson Entropy in urban and regional mod-elling Pion Ltd 1970
[23] Alan Geoffrey Wilson Urban and regional models in geogra-phy and planning 1974
[24] Svante Wold Arnold Ruhe Herman Wold and WJ Dunn IIIThe collinearity problem in linear regression the partial leastsquares (pls) approach to generalized inverses SIAM Journalon Scientific and Statistical Computing 5(3)735ndash743 1984
Social media fingerprints of unemployment mdash 1719
Gravity ModelParameter Description Spain
α1 Origin exponent 0477lowastlowastlowast(0002)α2 Destination exponent 0478lowastlowastlowast(0002)β Distance exponent 105lowastlowastlowast(00035)R2 Goodness of fit 0797φ Correlation between Ti j and T gra
i j 0826
Table 1 Description of the parameters for the Gravity Law Model in geo-tagged social media data for Spain (lowast lowast lowast) meanssignificance p lt 00001
NMI between G and Gp for different pAlgorithm p = 001 002 003 004 005 006 007 008 009 01
FG 0995 0992 0989 0983 0981 0977 0983 0969 0980 0959WT 0954 0959 0950 0954 0945 0948 0947 0935 0926 0931IM 0988 0981 0980 0981 0978 0974 0975 0970 0969 0966ML 0994 0978 0979 0983 0948 0934 0972 0952 0973 0947LP 0906 0908 0911 0915 0895 0907 0907 0893 0905 0904LE 0960 0957 0956 0859 0910 0892 0908 0858 0885 0884
Table 2 NMI measure comparing G and Gp
Communities StatsAlgorithm 〈|Ni|〉i max|Ni| |Ni| Modularity NMI P NMI C
FG 309696 1385 23 0726 0712 0590WT 9262 433 769 0417 0744 0757IM 21011 143 339 0758 0770 0831ML 323772 1132 22 0800 0717 0599LP 22052 750 323 0732 0749 0761LE 1017571 5344 7 0381 0264 0205
Table 3 Statistics of the communities Ni returned by the six algorithms NMI P refers to the comparison between communitiesand provinces whereas NMI C considers counties instead of provinces
Social media fingerprints of unemployment mdash 1819
All ages lt 24 25minus44 gt 44(Intercept) 011lowastlowastlowastlowast 010lowastlowastlowast 020lowastlowastlowast 020lowastlowastlowast
(002) (003) (003) (0035)Penetration rate 323lowast 857lowastlowastlowast 628lowastlowast 240
(141) (222) (217) (277)Geographical diversity 003 015lowastlowastlowast 008lowast 006
(002) (004) (004) (005)Social diversity minus003lowast minus003 minus005lowast minus006lowast
(001) (002) (002) (003)Morning activity minus069lowast minus130lowastlowast minus153lowastlowastlowast minus119lowast
(026) (042) (041) (052)Misspellers rate 1156 3151lowast 1546 2360
(813) (1278) (1248) (1594)Employment mentions minus180 317 minus994 271
(627) (986) (964) (123)R2 047 064 055 029Adj R2 044 062 052 026lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005
Table 4 Regression table for the different models in which unemployment for different age groups is fitted The All ages model isthe fit to the general rate of unemployment in each geographical area while the other models are for the rates of unemployment ingroups of less than 24 years between 25 and 44 years and above 44 years
All variables Youth model Twitter model (I) Twitter model (II)(Intercept) 006 minus002 010lowastlowastlowast 009lowastlowastlowast
(003) (003) (003) (0027)Young pop rate 066lowast 220lowastlowastlowast
(030) (035)Penetration rate 820lowastlowastlowast 857lowastlowastlowast 862lowastlowastlowast
(225) (222) (221)Geographical diversity 014lowastlowastlowast 015lowastlowastlowast 012lowastlowastlowast
(004) (004) (003)Social diversity minus002 minus003
(002) (002)Morning activity minus142lowastlowastlowast minus130lowastlowast minus128lowastlowast
(041) (042) (041)Misspellers rate 2395 3151lowast 3228lowast
(1309) (1278) (1271)Employment mentions 034 317
(981) (986)R2 065 024 064 063Adj R2 063 024 062 062lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005
Table 5 Regression table for the different statistical models The All variables model includes both Twitter and rate of youngpopulation variables Twitter model (I) includes only the variables described in the main article while Twitter model (II) only includesthose variables which are significant p lt 005 in Twitter model (I)
Social media fingerprints of unemployment mdash 1919
Communities Municipalities Counties Provinces(Intercept) 010lowastlowastlowast 016lowastlowastlowast 011lowastlowastlowast 011lowast
(003) (001) (003) (005)Penetration rate 857lowastlowastlowast 401lowastlowastlowast 912lowastlowastlowast 1047lowastlowastlowast
(222) (059) (181) (197)Geographical diversity 015lowastlowastlowast 002 012lowastlowastlowast 008
(004) (001) (003) (007)Social diversity minus003 minus001 minus001 minus003
(002) (001) (002) 007Morning activity minus130lowastlowast minus116lowastlowastlowast minus149lowastlowastlowast minus103
(042) (014) (039) (088)Misspellers rate 3151lowast 1440lowastlowastlowast 1409
(1278) (251) (1002)Employment mentions 317 minus071 241 minus317
(986) (089) (886) (1229)Number of points 128 1738 198 50R2 064 022 055 065Adj R2 062 021 054 061lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005
Table 6 Regression table for the unemployment linear regression model in different levels of geographical areas In the Provincesmodel the misspellers rate has been removed from the model due to the large collinearity with the penetration rate
- 1 Social media dataset and functional partition of cities
- 2 Social media behavioral fingerprints
- 3 Explanatory power of social media in unemployment
- 4 Discussion
-
Social media fingerprints of unemployment mdash 1119
10minus8
10minus6
10minus4
10minus2
100
100 1005 101 1015 102 1025 103x
dens
10minus6
10minus4
10minus2
100
1005 101 1015 102 1025 103x
dens
10minus7
10minus6
10minus5
10minus4
10minus3
102 1025 103 1035 104 1045 105x
dens
Den
sity
Den
sity
Den
sity
Trip distance (km) Number of trips Elapsed time (secs)
Figure 5 Probability distributions for the different properties of daily trips in the Twitter dataset Dashed lines corresponds to apower law fit with exponents minus167 minus243 and minus062 respectively
each chosen algorithm performed on G and Gp for p between 1and 10 concluding that obtained community structures are robustbecause they are not broken when some randomly chosen links areremoved (see table 2)
1e+01 1e+03 1e+05
2eminus04
2eminus03
2eminus02
2eminus01
cities2$total_population
cities2$twpen
Population
Pen
etra
tion
Rat
e CommunitiesCities
Figure 6 Penetration rates for both cities and detectedcommunities
As other works have shown mobility graph communities areusually interpreted in terms of geographical and political barriersand a natural question is whether the mobility based communitiesare related to any of these barriers In Spain there are differentterritorial divisions for administration purposes In this work weconsider two of them provinces defined in 1978 Constitution are 48different heterogeneous aggregations of municipalities and counties(comarca in Spanish terminology) which are traditional aggregationsof municipalities mainly based on Spanish holography (rivers val-leys ridges etc) and some of them are composed by municipalitiesof different provinces We use again the NMI method to compare thecommunities structure given by the algorithms to the administrativelimits Except Leading Eigenvector algorithm the rest of methodsreturn communities that are quite related to provinces (NMI asymp 07)whereas for the county administration limits higher variability is
observed In this last case the algorithm providing more relationshipwith county limits is Infomap NMI asymp 083 Therefore Twitter basedmobility summarizes the inter-city flows exhibiting that these flowsare influenced by geographical and political barriers
S4 Twitter demographics and unemploy-ment rates
Different age groups are not equally represented in Twitter Recentsurveys (2012) in Spain suggest that most (86) of users in Twitterare 16 to 44 years old Comparison of the percentage of users perage group with the total population within the same groups (seefigure 8) reveals that groups of ages above 35 years old are under-represented in Twitter Thus our Twitter data will be more revealingwhen trying to describe unemployment in age groups below 44 yearsold This is indeed what we find when we try to build a linear modelfor the rate unemployment in different age groups with the sameTwitter variables while unemployment rates for ages below 24 canbe fitted to a linear model with R2 = 062 we find that regressionmodels for unemployment rates for ages between 25 and 44 have aR2 = 052 while for ages above 44 we get only R2 = 026 Table 4summarizes the results for the regression models of unemploymentrates in each age group showing that our Twitter variables have moreexplanatory power for ages below 44 Finally in figure 8 we can seethe performance of the model at different age groups and once againit is obvious the poor explanatory power of the Twitter variables forthe unemployment rate in ages above 44 years old
S5 Properties of Twitter variables
Normalization and distributions
Heterogeneity between the values of variables constructed fromTwitter is large but moderate as histograms in figure 9 show Wedid not find any geographical area with anomalous values in anyof the variables considered Variables are normalized in differentways both the penetration τi and misspellers rate εi are defined asthe number of users or misspellers per 100000 persons (population)activity variables νi are normalized as the percentage of tweets pertime interval finally number of tweets that mention a specific term
Social media fingerprints of unemployment mdash 1219
Figure 7 From left to right and from top to bottom Fastgreedy Walktrap Infomap Multilevel Label Propagation and LeadingEigenvector communities on Twitter based mobility transitions
microi are also given per 100000 tweets published in the geographicalarea
Correlation between variables
Variables are constructed to reflect the behavior of areas in the dif-ferent dimensions of Twitter penetration social or geographicaldiversity activity through the day and content Correlation betweenvariables does indeed show that variables within each dimensionshold strong correlations between them As we can see in figure 10social and geographical diversities are highly correlated betweenthem an expected fact given the gravity law accurate descriptionof flows of people between geographical areas but also the amountof communication between them Same behavior is found for thegroup of variables in the activity group while content variables areless correlated Finally we find that both the penetration rate τi andfraction of misspellers εi have a strong correlation with most of thevariables
High correlation between variables might lead to collinearityeffects [24] in the linear regression models that is some variableswith predictive variable might have non-significant weights becausethey explain the same part of the variance For instance in Table5 misspellers rate has a very strong predictive value but its p-valueis too high to consider it significant To test this hypothesis weperform a principal component analysis (PCA) on the independentvariables of the regression Figure 10 exhibits the loadings of thedifferent variables for the considered variables The block structureshowed in 10 results in similar directions of the variables in the firstcomponentes of the PCA We observe some groups of variables onthe one hand geographical and social diversity seem to explain largepart of the variance on the other hand we find a perpendicular group
of variables formed by temporal activity finally penetration rateand misspellers fraction seem to represent a different independentdirection of data with high collinearity between them This mightexplain the low statistical significance in the models of section 4 Inany case the structure of the correlation matrix and the PCA resultsshow that there is indeed information in all groups of variables andthus we have take a variable in each of them for our regressionmodels
S6 Misspellers detection
In this work we will consider only tweets in Spanish that is sincein Spain several languages live at the same time depending on thepart of the country the first step is to reduce our Twitter dataset tothose tweets that are written in Spanish This task is carried out usingthe n-gram based text categorization R library textcat [11] Then inorder to decide whether a tweet has a misspelling or not we needto establish some patterns to select from our set of tweets Sincewe want to be sure that a detected mistake corresponds to a realmisspeller we will not consider the following cases
bull Lack of written accents People tend to avoid writing accentswhen talking in a colloquial way
bull Mistakes derived from removing unnecessary letters Themost common cases are removing a h at the beginning of aword (in Spanish the letter h is not pronounced) or replacingthe letters qu by k We understand that these mistakes can bemotivated for the limitation of length in tweets and not for areal misspelling
bull In the same line we neglect mistakes produced by removing
Social media fingerprints of unemployment mdash 1319
0
10
20
30
40
0 100 200 300x = Tweets (unemployment)
P(x)
0
5
10
15
500 1000 1500x = Penetration rate
P(x)
0
5
10
15
2 4 6 8x = Entropy2 (social)
P(x)
0
5
10
15
025 050 075x = Entropy1 (social)
P(x)
0
5
10
15
4 6 8x = Entropy2 (geo)
P(x)
0
5
10
15
01 02 03 04x = Entropy1 (geo)
P(x)
0
5
10
15
0 50 100 150 200x = Misspellers rate
P(x)
0
20
40
60
0 100 200x = Tweets (employment)
P(x)
0
5
10
15
20
250 500 750x = Tweets (job)
P(x)
0
5
10
15
20
2 3 4 5 6x = Activity (night)
P(x)
0
5
10
15
20
4 5 6 7x = Activity (morning)
P(x)
0
5
10
15
35 40 45 50x = Activity (afternoon)
P(x)
i SuiSri
Sri Suimrngi
aftni ngti
imicrojobi
microempi
microunempi
x x x
x x x
x x x
x x x
Figure 9 Frequency plots for each variable constructed from Twitter
letters in the middle of a word whose pronunciation can bededuced without them
bull We do not consider either mistakes related to features ofspecific areas in Spain For example in the south the pronun-ciation of ce and se is the same what produces a big amountof mistakes when writing However since we want to extractobjective and equitable conclusion over the whole Spanishgeography we neglect those misspellings that only appear ina specific area
Likewise we will consider as real misspellings the followingmistakes
bull Adding letters For example writing a h at the beginning of aword that starts with a vowel
bull Changing the special cases mp mb by the wrong writings npnb
bull Mixing up b with v g with j ll with y and ex with es Theseare typical mistakes in Spanish because they have the sameor a very close pronunciation
Social media fingerprints of unemployment mdash 1419
0
10
20
30
16minus24 25minus34 35minus44 45minus54 55minus64r
f
000
025
050
075
100llena
Age group
Perc
enta
ge o
f pop
ulat
ion Census
10
15
20
25
10 15 20 25x
y
10
15
20
25
10 15 20 25x
y
5
10
15
20
5 10 15 20x
y
5
10
15
2025
5 10 15 20 25x
y
All ages lt 24
25-44 gt 44
Observed Unemployment () Observed Unemployment ()
Observed Unemployment ()Observed Unemployment ()
Pred
icte
d Un
empl
oym
ent (
)
Pred
icte
d Un
empl
oym
ent (
)
Pred
icte
d Un
empl
oym
ent (
)
Pred
icte
d Un
empl
oym
ent (
)
R2 = 047 R2 = 062
R2 = 052 R2 = 026
Figure 8 Top Percentage of population in each age groupfrom the Spanish Census (dark bars) and surveys about usersin Twitter (light bars) Bottom performance of the linearmodels for each of the age groups
bull Confusing the verb haber with the periphrasis a ver
bull Separating a word into two ones for instance writing theword conmigo as con migo
This way our list of mispellings is composed of 617 common mis-takes in Spanish that cannot be attributed to the special featuresof Twitter or a specific region of Spain Thus one can expect thatthis selection provides an accurate and equitable method of detect-ing misspellers Under these conditions the number of users whowrote at least one misspelled word is 27055 (56 over the wholepopulation)
We analyze whether misspellers have different Twitter usagebehavior from that people who do not make serious mistakes whenpublishing a tweet Comparing the average number of tweets itcan be observed that misspellers tend to publish a larger numberof tweets than those who did not made mistakes (14471 against2372) This also emerges when the mean number of misspellinggiven the total number of tweets is considered For users with lessthan approximately 30 published tweets in the observation period thenumber of misspellings is almost zero whereas for users who publishmore often the mean number of misspellings scales sub-linearly
minus1
minus08
minus06
minus04
minus02
0
02
04
06
08
1rtwpen
sio
sior
siosocial
siorsocial
manana
tarde
madrugada
fmiss
job
emp
unem
p
eco
rtwpensiosior
siosocialsiorsocialmanana
tardemadrugada
fmissjobemp
unempeco
i
Sui
Sri
Sri
Sui
i
microecoi
microunempi
microempi
microjobi
ngti
aftni
mrngi
minus3 minus2 minus1 0 1 2 3
minus3minus2
minus10
12
3
First Principal component
Seco
nd P
rinci
pal C
ompo
nent
6
26
47
70 92123
145
155185
209
213
251
257 279305318339
371
396411
423
455
466
490
507
546552
568
597
621
637672
681
710
718738
772
781
805828
849
876
900
919
925
966979
1002 1021
1047
1064
1085
1109
1126
1137
1164
11861200
12361249
1264
1300
1318
1336
13541386
1406
1425
1444
1457
14711504
15141554
1572
1584
1611
1622
16571667
1686
1721
1739
1800
1823
1837
1863
18831930
1949
1968
20062036
2042
2067
2181
2206
22442264
2305
2331
23322413 2435
2456
2512
2554
2569
25982617
26522670
26972721 2748
2790
2875
2917
2939
29492976
2989
3025
3132
3228
3245
3331
3347
3381
3451
3484
3519
3613
3837
38653987
4326
minus05 00 05
minus05
00
05
sio
sior
siosocial
siorsocial
rtwpen
manana
tarde
madrugada
fmiss
jobemp
unemp
eco
ngt
microunemp
aftn
microempmicroeco
microjobmrng
Si
Si
Sir
˜Sir
domingo 12 de octubre de 14
Figure 10 Top Correlation matrix between the vari-ables constructed from Twitter Each entry in the matrixis depicted as a circle whose size is proportional to thecorrelation between variables and the sign is bluered forpositivenegative correlations Blank entries correspond tostatistically insignificant correlations with 95 confidenceBottom Variables projection on the first two principal com-ponents given by PCA We observe different groups of vari-ables and collinearity between some of them
with the number of tweets (exponent asymp 033)
Since we have observed a segmentation of Twitter populationbased on how accurate they write we consider the misspeller rate as aproxy of the educational level of the cities Large number of previousworks in the literature have revealed the relationship between theeconomical status and the educational level of geographical areasand therefore it is natural to ask whether the observed misspellersrate is related to economy driven by the unemployment rate To testthis hypothesis we consider cities populated with more than 5000inhabitants to avoid subsampled cases We find a strong positivecorrelation between the probability of finding a misspeller in a cityand the unemployment rate (0372 0491)
Social media fingerprints of unemployment mdash 1519
2 5 10 20 50 200 500
10
15
20
30
40
tweets
N(m
iss|
tw
eets
)
2 5 10 20 50 200 500
000
50
050
050
0
tweets
P(m
iss|
twee
ts)
2 50 500
Figure 11 Number (red) and probability (blue) of ob-served misspellings given the number of tweets
050
055
060
065
070
201210201211201212201301201302201303201304201305201306201307201308201309201310201311201312201401201402201403201404201405201406
month
r2R2
month
Figure 12 Explanatory power of the linear regressionmodel when fitted against the unemployment data for dif-ferent months Gray (orange) area correspond to the timewindow in which Twitter data is collected and variables areconstructed
S7 Time window and unemployment
In the definition of the variables we have aggregated the Twitter ac-tivity within a 7 months time window (from December 2012 to June2013) Since unemployment has a significant variation along timewe investigate here what is the correlation and explanatory powerof the Twitter variables for the values of unemployment determinedat different months through the same time window in which Twitterdata was collected Or if the variables collected in that time windoware more correlated with past or future values of unemploymentFigure 12 shows the explanatory value of the model when the linearregression is done for values of unemployment of different monthsbefore during and after the Twitter data time window Althoughthere is a small seasonal effect along the year we see that the ex-planatory power remains around R2 = 06 which suggest that ourTwitter linear model retains its explanatory power even though unem-ployment changes considerably throughout the year It is interestingto note that R2 decays a little bit during the summer which meansthat our variables are less correlated with summer unemploymentFinally unemployment used in the main article is from June 2013ie the last month in the time window used to collect the data
S8 Demographics does not explain unem-ployment
Since unemployment rates are very large for the group of youngpeople a natural question is whether only demographic variablescould explain the heterogeneity of young unemployment rates foundin the geographical areas To test this end we have built four linearmodels the first one (named Youth model in Table 5) is composedby the rate of young population as the only explaining variable thesecond ones are built based on only the Twitter variables consideredin the main text (named Twitter model (I)) or just with those whoseregression coefficients are statistically significant (Twitter model(II)) the third one is fitted with all the variables (named All variablesmodel in Table 5) In table 5 we show the summary of the regressionfor each model Focusing on the explained variance by the model interms of R2 it can be checked that considering all Twitter variables isthree times more explanatory than considering only the young peopleproportion On the other hand the comparison of R2 for the Twittermodel with the one for All variables and Youth model shows that therate of young population does not provide a significant explanatorypower This semi-partial analysis shows that our Twitter variablesretain a high explanatory power when the effect of young populationrate is controlled
S9 Unemployment models for other geo-graphical areas
While municipalities are very heterogeneous demographically otheradministrative areas exist in Spain at large scales that could be usedfor our model of unemployment As mentioned in section 4 thesmallest administrative division of Spain we have considered is thatof the 8200 municipalities At larger scales we have the 326 coun-ties (comarcas in spanish) which are aggregations of municipalitiesFinally the largest geographical scale we considered is defined by50 provinces (provincias in Spanish) In this section we comparethe performance of our Twitter model for unemployment for thevariables defined in those administrative areas and relate it to thegeographical communities detected and used in the main paper (seesection 4) Not all the areas at different administrative divisions areconsidered in the model To minimize the effect of areas in whichthe number of geo-tagged tweets is very small we only consider the1738 municipalities which have a Twitter population π gt 10 Simi-larly we only consider the 198 counties with π gt 100 As we can seein Table 6 the model has a large explanatory power for areas equalor bigger than counties As expected R2 increases as the number ofareas in the model is smaller but the description level of the modelis very low for provinces for example The best performance (highR2 and high geographical description level) is attained at the level ofthe detected communities
S10 Relative importance of the variables
To asses the relative importance of the variables in the unemploymentmodel we have used several methods They all give qualitatively thesame results with some variations for the statistically insignificantvariables Specifically we have use
1 (weight) Relative weight of the absolute values of the coef-ficients obtained in the linear regression when variables arescaled to have mean zero and variance one
2 (lmg) averaging over orderings proposed by LindemanMerenda and Gold
Social media fingerprints of unemployment mdash 1619
0
10
20
30
40
emp fmiss manana rtwpen sio siosocialnames
values
indabscoefffirstlmgpmvd
microemp mrng SuSu
Rela
tive
impo
rtanc
e (
)
0
10
20
30
40
emp fmiss manana rtwpen sio siosocialnames
values
indabscoefffirstlmgpmvd
weight first lmg pvmd
Figure 13 Relative importance of the variables (in per-centage) in the unemployment model for different ways tocalculate it
3 (pmvd) The PMVD metric introduced by Feldman whichan average over orderings as well but with data-dependentweights
4 (first) The univariate R2-values from regression models withone variable only
All these metrics are obtained using the relaimpo R package [12]The results for the young unemployment model are shown in figure13 where we can see that different methods yield to similar rela-tive importance of the variables excepting perhaps for the diversityof mobility flows a variable with a non-significant weight in theregression model
References
[1] B Ashtakala Generalized power model for trip distributionTransportation Research Part B Methodological 21(1)59ndash671987
[2] Michel Bierlaire Mathematical models for transportation de-mand analysis Transportation research Part A Policy andpractice 31(1)86ndash86 1997
[3] Vincent D Blondel Jean-Loup Guillaume Renaud Lambiotteand Etienne Lefebvre Fast unfolding of communities in largenetworks Journal of Statistical Mechanics Theory and Exper-iment 2008(10)P10008 2008
[4] Harry J Casey Jr The law of retail gravitation applied to trafficengineering Traffic Quarterly 9(3) 1955
[5] Aaron Clauset Mark EJ Newman and Cristopher Moore Find-ing community structure in very large networks Physicalreview E 70(6)066111 2004
[6] Leon Danon Albert Diaz-Guilera Jordi Duch and AlexArenas Comparing community structure identificationJournal of Statistical Mechanics Theory and Experiment2005(09)P09008 2005
[7] Servicio Publico de Empleo Estatal (SEPE) Spanish registeredunemployment httpwwwsepeescontenidosque_
es_el_sepeestadisticasindexhtml[8] Instituto Nacional de Estadıstica Spanish 2011 cen-
sus httpwwwineescensos2011_datoscen11_
datos_iniciohtm|[9] Suzanne P Evans A relationship between the gravity model
for trip distribution and the transportation problem in linearprogramming Transportation Research 7(1)39ndash61 1973
[10] Paul Expert Tim S Evans Vincent D Blondel and RenaudLambiotte Uncovering space-independent communities inspatial networks Proceedings of the National Academy ofSciences 108(19)7663ndash7668 2011
[11] Ingo Feinerer Christian Buchta Wilhelm Geiger JohannesRauch Patrick Mair and Kurt Hornik The textcat packagefor n-gram based text categorization in r Journal of StatisticalSoftware 52(6)1ndash17 2013
[12] Ulrike Gromping Relative importance for linear regressionin r the package relaimpo Journal of statistical software17(1)1ndash27 2006
[13] Bartosz Hawelka Izabela Sitko Euro Beinat StanislavSobolevsky Pavlos Kazakopoulos and Carlo Ratti Geo-located twitter as the proxy for global mobility patterns arXivpreprint arXiv13110680 2013
[14] Yu Liu Zhengwei Sui Chaogui Kang and Yong Gao Uncov-ering patterns of inter-urban trips and spatial interactions fromcheck-in data arXiv preprint arXiv13100282 2013
[15] Mark EJ Newman Finding community structure in net-works using the eigenvectors of matrices Physical reviewE 74(3)036104 2006
[16] Pascal Pons and Matthieu Latapy Computing communities inlarge networks using random walks In Computer and Informa-tion Sciences-ISCIS 2005 pages 284ndash293 Springer 2005
[17] Usha Nandini Raghavan Reka Albert and Soundar KumaraNear linear time algorithm to detect community structures inlarge-scale networks Physical Review E 76(3)036106 2007
[18] Martin Rosvall and Carl T Bergstrom Maps of random walkson complex networks reveal community structure Proceedingsof the National Academy of Sciences 105(4)1118ndash1123 2008
[19] Morton Schneider Gravity models and trip distribution theoryPapers in Regional Science 5(1)51ndash56 1959
[20] Filippo Simini Marta C Gonzalez Amos Maritan and Albert-Laszlo Barabasi A universal model for mobility and migrationpatterns Nature 484(7392)96ndash100 2012
[21] Chaoming Song Tal Koren Pu Wang and Albert-LaszloBarabasi Modelling the scaling properties of human mobilityNature Physics 6(10)818ndash823 2010
[22] Alan Geoffrey Wilson Entropy in urban and regional mod-elling Pion Ltd 1970
[23] Alan Geoffrey Wilson Urban and regional models in geogra-phy and planning 1974
[24] Svante Wold Arnold Ruhe Herman Wold and WJ Dunn IIIThe collinearity problem in linear regression the partial leastsquares (pls) approach to generalized inverses SIAM Journalon Scientific and Statistical Computing 5(3)735ndash743 1984
Social media fingerprints of unemployment mdash 1719
Gravity ModelParameter Description Spain
α1 Origin exponent 0477lowastlowastlowast(0002)α2 Destination exponent 0478lowastlowastlowast(0002)β Distance exponent 105lowastlowastlowast(00035)R2 Goodness of fit 0797φ Correlation between Ti j and T gra
i j 0826
Table 1 Description of the parameters for the Gravity Law Model in geo-tagged social media data for Spain (lowast lowast lowast) meanssignificance p lt 00001
NMI between G and Gp for different pAlgorithm p = 001 002 003 004 005 006 007 008 009 01
FG 0995 0992 0989 0983 0981 0977 0983 0969 0980 0959WT 0954 0959 0950 0954 0945 0948 0947 0935 0926 0931IM 0988 0981 0980 0981 0978 0974 0975 0970 0969 0966ML 0994 0978 0979 0983 0948 0934 0972 0952 0973 0947LP 0906 0908 0911 0915 0895 0907 0907 0893 0905 0904LE 0960 0957 0956 0859 0910 0892 0908 0858 0885 0884
Table 2 NMI measure comparing G and Gp
Communities StatsAlgorithm 〈|Ni|〉i max|Ni| |Ni| Modularity NMI P NMI C
FG 309696 1385 23 0726 0712 0590WT 9262 433 769 0417 0744 0757IM 21011 143 339 0758 0770 0831ML 323772 1132 22 0800 0717 0599LP 22052 750 323 0732 0749 0761LE 1017571 5344 7 0381 0264 0205
Table 3 Statistics of the communities Ni returned by the six algorithms NMI P refers to the comparison between communitiesand provinces whereas NMI C considers counties instead of provinces
Social media fingerprints of unemployment mdash 1819
All ages lt 24 25minus44 gt 44(Intercept) 011lowastlowastlowastlowast 010lowastlowastlowast 020lowastlowastlowast 020lowastlowastlowast
(002) (003) (003) (0035)Penetration rate 323lowast 857lowastlowastlowast 628lowastlowast 240
(141) (222) (217) (277)Geographical diversity 003 015lowastlowastlowast 008lowast 006
(002) (004) (004) (005)Social diversity minus003lowast minus003 minus005lowast minus006lowast
(001) (002) (002) (003)Morning activity minus069lowast minus130lowastlowast minus153lowastlowastlowast minus119lowast
(026) (042) (041) (052)Misspellers rate 1156 3151lowast 1546 2360
(813) (1278) (1248) (1594)Employment mentions minus180 317 minus994 271
(627) (986) (964) (123)R2 047 064 055 029Adj R2 044 062 052 026lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005
Table 4 Regression table for the different models in which unemployment for different age groups is fitted The All ages model isthe fit to the general rate of unemployment in each geographical area while the other models are for the rates of unemployment ingroups of less than 24 years between 25 and 44 years and above 44 years
All variables Youth model Twitter model (I) Twitter model (II)(Intercept) 006 minus002 010lowastlowastlowast 009lowastlowastlowast
(003) (003) (003) (0027)Young pop rate 066lowast 220lowastlowastlowast
(030) (035)Penetration rate 820lowastlowastlowast 857lowastlowastlowast 862lowastlowastlowast
(225) (222) (221)Geographical diversity 014lowastlowastlowast 015lowastlowastlowast 012lowastlowastlowast
(004) (004) (003)Social diversity minus002 minus003
(002) (002)Morning activity minus142lowastlowastlowast minus130lowastlowast minus128lowastlowast
(041) (042) (041)Misspellers rate 2395 3151lowast 3228lowast
(1309) (1278) (1271)Employment mentions 034 317
(981) (986)R2 065 024 064 063Adj R2 063 024 062 062lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005
Table 5 Regression table for the different statistical models The All variables model includes both Twitter and rate of youngpopulation variables Twitter model (I) includes only the variables described in the main article while Twitter model (II) only includesthose variables which are significant p lt 005 in Twitter model (I)
Social media fingerprints of unemployment mdash 1919
Communities Municipalities Counties Provinces(Intercept) 010lowastlowastlowast 016lowastlowastlowast 011lowastlowastlowast 011lowast
(003) (001) (003) (005)Penetration rate 857lowastlowastlowast 401lowastlowastlowast 912lowastlowastlowast 1047lowastlowastlowast
(222) (059) (181) (197)Geographical diversity 015lowastlowastlowast 002 012lowastlowastlowast 008
(004) (001) (003) (007)Social diversity minus003 minus001 minus001 minus003
(002) (001) (002) 007Morning activity minus130lowastlowast minus116lowastlowastlowast minus149lowastlowastlowast minus103
(042) (014) (039) (088)Misspellers rate 3151lowast 1440lowastlowastlowast 1409
(1278) (251) (1002)Employment mentions 317 minus071 241 minus317
(986) (089) (886) (1229)Number of points 128 1738 198 50R2 064 022 055 065Adj R2 062 021 054 061lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005
Table 6 Regression table for the unemployment linear regression model in different levels of geographical areas In the Provincesmodel the misspellers rate has been removed from the model due to the large collinearity with the penetration rate
- 1 Social media dataset and functional partition of cities
- 2 Social media behavioral fingerprints
- 3 Explanatory power of social media in unemployment
- 4 Discussion
-
Social media fingerprints of unemployment mdash 1219
Figure 7 From left to right and from top to bottom Fastgreedy Walktrap Infomap Multilevel Label Propagation and LeadingEigenvector communities on Twitter based mobility transitions
microi are also given per 100000 tweets published in the geographicalarea
Correlation between variables
Variables are constructed to reflect the behavior of areas in the dif-ferent dimensions of Twitter penetration social or geographicaldiversity activity through the day and content Correlation betweenvariables does indeed show that variables within each dimensionshold strong correlations between them As we can see in figure 10social and geographical diversities are highly correlated betweenthem an expected fact given the gravity law accurate descriptionof flows of people between geographical areas but also the amountof communication between them Same behavior is found for thegroup of variables in the activity group while content variables areless correlated Finally we find that both the penetration rate τi andfraction of misspellers εi have a strong correlation with most of thevariables
High correlation between variables might lead to collinearityeffects [24] in the linear regression models that is some variableswith predictive variable might have non-significant weights becausethey explain the same part of the variance For instance in Table5 misspellers rate has a very strong predictive value but its p-valueis too high to consider it significant To test this hypothesis weperform a principal component analysis (PCA) on the independentvariables of the regression Figure 10 exhibits the loadings of thedifferent variables for the considered variables The block structureshowed in 10 results in similar directions of the variables in the firstcomponentes of the PCA We observe some groups of variables onthe one hand geographical and social diversity seem to explain largepart of the variance on the other hand we find a perpendicular group
of variables formed by temporal activity finally penetration rateand misspellers fraction seem to represent a different independentdirection of data with high collinearity between them This mightexplain the low statistical significance in the models of section 4 Inany case the structure of the correlation matrix and the PCA resultsshow that there is indeed information in all groups of variables andthus we have take a variable in each of them for our regressionmodels
S6 Misspellers detection
In this work we will consider only tweets in Spanish that is sincein Spain several languages live at the same time depending on thepart of the country the first step is to reduce our Twitter dataset tothose tweets that are written in Spanish This task is carried out usingthe n-gram based text categorization R library textcat [11] Then inorder to decide whether a tweet has a misspelling or not we needto establish some patterns to select from our set of tweets Sincewe want to be sure that a detected mistake corresponds to a realmisspeller we will not consider the following cases
bull Lack of written accents People tend to avoid writing accentswhen talking in a colloquial way
bull Mistakes derived from removing unnecessary letters Themost common cases are removing a h at the beginning of aword (in Spanish the letter h is not pronounced) or replacingthe letters qu by k We understand that these mistakes can bemotivated for the limitation of length in tweets and not for areal misspelling
bull In the same line we neglect mistakes produced by removing
Social media fingerprints of unemployment mdash 1319
0
10
20
30
40
0 100 200 300x = Tweets (unemployment)
P(x)
0
5
10
15
500 1000 1500x = Penetration rate
P(x)
0
5
10
15
2 4 6 8x = Entropy2 (social)
P(x)
0
5
10
15
025 050 075x = Entropy1 (social)
P(x)
0
5
10
15
4 6 8x = Entropy2 (geo)
P(x)
0
5
10
15
01 02 03 04x = Entropy1 (geo)
P(x)
0
5
10
15
0 50 100 150 200x = Misspellers rate
P(x)
0
20
40
60
0 100 200x = Tweets (employment)
P(x)
0
5
10
15
20
250 500 750x = Tweets (job)
P(x)
0
5
10
15
20
2 3 4 5 6x = Activity (night)
P(x)
0
5
10
15
20
4 5 6 7x = Activity (morning)
P(x)
0
5
10
15
35 40 45 50x = Activity (afternoon)
P(x)
i SuiSri
Sri Suimrngi
aftni ngti
imicrojobi
microempi
microunempi
x x x
x x x
x x x
x x x
Figure 9 Frequency plots for each variable constructed from Twitter
letters in the middle of a word whose pronunciation can bededuced without them
bull We do not consider either mistakes related to features ofspecific areas in Spain For example in the south the pronun-ciation of ce and se is the same what produces a big amountof mistakes when writing However since we want to extractobjective and equitable conclusion over the whole Spanishgeography we neglect those misspellings that only appear ina specific area
Likewise we will consider as real misspellings the followingmistakes
bull Adding letters For example writing a h at the beginning of aword that starts with a vowel
bull Changing the special cases mp mb by the wrong writings npnb
bull Mixing up b with v g with j ll with y and ex with es Theseare typical mistakes in Spanish because they have the sameor a very close pronunciation
Social media fingerprints of unemployment mdash 1419
0
10
20
30
16minus24 25minus34 35minus44 45minus54 55minus64r
f
000
025
050
075
100llena
Age group
Perc
enta
ge o
f pop
ulat
ion Census
10
15
20
25
10 15 20 25x
y
10
15
20
25
10 15 20 25x
y
5
10
15
20
5 10 15 20x
y
5
10
15
2025
5 10 15 20 25x
y
All ages lt 24
25-44 gt 44
Observed Unemployment () Observed Unemployment ()
Observed Unemployment ()Observed Unemployment ()
Pred
icte
d Un
empl
oym
ent (
)
Pred
icte
d Un
empl
oym
ent (
)
Pred
icte
d Un
empl
oym
ent (
)
Pred
icte
d Un
empl
oym
ent (
)
R2 = 047 R2 = 062
R2 = 052 R2 = 026
Figure 8 Top Percentage of population in each age groupfrom the Spanish Census (dark bars) and surveys about usersin Twitter (light bars) Bottom performance of the linearmodels for each of the age groups
bull Confusing the verb haber with the periphrasis a ver
bull Separating a word into two ones for instance writing theword conmigo as con migo
This way our list of mispellings is composed of 617 common mis-takes in Spanish that cannot be attributed to the special featuresof Twitter or a specific region of Spain Thus one can expect thatthis selection provides an accurate and equitable method of detect-ing misspellers Under these conditions the number of users whowrote at least one misspelled word is 27055 (56 over the wholepopulation)
We analyze whether misspellers have different Twitter usagebehavior from that people who do not make serious mistakes whenpublishing a tweet Comparing the average number of tweets itcan be observed that misspellers tend to publish a larger numberof tweets than those who did not made mistakes (14471 against2372) This also emerges when the mean number of misspellinggiven the total number of tweets is considered For users with lessthan approximately 30 published tweets in the observation period thenumber of misspellings is almost zero whereas for users who publishmore often the mean number of misspellings scales sub-linearly
minus1
minus08
minus06
minus04
minus02
0
02
04
06
08
1rtwpen
sio
sior
siosocial
siorsocial
manana
tarde
madrugada
fmiss
job
emp
unem
p
eco
rtwpensiosior
siosocialsiorsocialmanana
tardemadrugada
fmissjobemp
unempeco
i
Sui
Sri
Sri
Sui
i
microecoi
microunempi
microempi
microjobi
ngti
aftni
mrngi
minus3 minus2 minus1 0 1 2 3
minus3minus2
minus10
12
3
First Principal component
Seco
nd P
rinci
pal C
ompo
nent
6
26
47
70 92123
145
155185
209
213
251
257 279305318339
371
396411
423
455
466
490
507
546552
568
597
621
637672
681
710
718738
772
781
805828
849
876
900
919
925
966979
1002 1021
1047
1064
1085
1109
1126
1137
1164
11861200
12361249
1264
1300
1318
1336
13541386
1406
1425
1444
1457
14711504
15141554
1572
1584
1611
1622
16571667
1686
1721
1739
1800
1823
1837
1863
18831930
1949
1968
20062036
2042
2067
2181
2206
22442264
2305
2331
23322413 2435
2456
2512
2554
2569
25982617
26522670
26972721 2748
2790
2875
2917
2939
29492976
2989
3025
3132
3228
3245
3331
3347
3381
3451
3484
3519
3613
3837
38653987
4326
minus05 00 05
minus05
00
05
sio
sior
siosocial
siorsocial
rtwpen
manana
tarde
madrugada
fmiss
jobemp
unemp
eco
ngt
microunemp
aftn
microempmicroeco
microjobmrng
Si
Si
Sir
˜Sir
domingo 12 de octubre de 14
Figure 10 Top Correlation matrix between the vari-ables constructed from Twitter Each entry in the matrixis depicted as a circle whose size is proportional to thecorrelation between variables and the sign is bluered forpositivenegative correlations Blank entries correspond tostatistically insignificant correlations with 95 confidenceBottom Variables projection on the first two principal com-ponents given by PCA We observe different groups of vari-ables and collinearity between some of them
with the number of tweets (exponent asymp 033)
Since we have observed a segmentation of Twitter populationbased on how accurate they write we consider the misspeller rate as aproxy of the educational level of the cities Large number of previousworks in the literature have revealed the relationship between theeconomical status and the educational level of geographical areasand therefore it is natural to ask whether the observed misspellersrate is related to economy driven by the unemployment rate To testthis hypothesis we consider cities populated with more than 5000inhabitants to avoid subsampled cases We find a strong positivecorrelation between the probability of finding a misspeller in a cityand the unemployment rate (0372 0491)
Social media fingerprints of unemployment mdash 1519
2 5 10 20 50 200 500
10
15
20
30
40
tweets
N(m
iss|
tw
eets
)
2 5 10 20 50 200 500
000
50
050
050
0
tweets
P(m
iss|
twee
ts)
2 50 500
Figure 11 Number (red) and probability (blue) of ob-served misspellings given the number of tweets
050
055
060
065
070
201210201211201212201301201302201303201304201305201306201307201308201309201310201311201312201401201402201403201404201405201406
month
r2R2
month
Figure 12 Explanatory power of the linear regressionmodel when fitted against the unemployment data for dif-ferent months Gray (orange) area correspond to the timewindow in which Twitter data is collected and variables areconstructed
S7 Time window and unemployment
In the definition of the variables we have aggregated the Twitter ac-tivity within a 7 months time window (from December 2012 to June2013) Since unemployment has a significant variation along timewe investigate here what is the correlation and explanatory powerof the Twitter variables for the values of unemployment determinedat different months through the same time window in which Twitterdata was collected Or if the variables collected in that time windoware more correlated with past or future values of unemploymentFigure 12 shows the explanatory value of the model when the linearregression is done for values of unemployment of different monthsbefore during and after the Twitter data time window Althoughthere is a small seasonal effect along the year we see that the ex-planatory power remains around R2 = 06 which suggest that ourTwitter linear model retains its explanatory power even though unem-ployment changes considerably throughout the year It is interestingto note that R2 decays a little bit during the summer which meansthat our variables are less correlated with summer unemploymentFinally unemployment used in the main article is from June 2013ie the last month in the time window used to collect the data
S8 Demographics does not explain unem-ployment
Since unemployment rates are very large for the group of youngpeople a natural question is whether only demographic variablescould explain the heterogeneity of young unemployment rates foundin the geographical areas To test this end we have built four linearmodels the first one (named Youth model in Table 5) is composedby the rate of young population as the only explaining variable thesecond ones are built based on only the Twitter variables consideredin the main text (named Twitter model (I)) or just with those whoseregression coefficients are statistically significant (Twitter model(II)) the third one is fitted with all the variables (named All variablesmodel in Table 5) In table 5 we show the summary of the regressionfor each model Focusing on the explained variance by the model interms of R2 it can be checked that considering all Twitter variables isthree times more explanatory than considering only the young peopleproportion On the other hand the comparison of R2 for the Twittermodel with the one for All variables and Youth model shows that therate of young population does not provide a significant explanatorypower This semi-partial analysis shows that our Twitter variablesretain a high explanatory power when the effect of young populationrate is controlled
S9 Unemployment models for other geo-graphical areas
While municipalities are very heterogeneous demographically otheradministrative areas exist in Spain at large scales that could be usedfor our model of unemployment As mentioned in section 4 thesmallest administrative division of Spain we have considered is thatof the 8200 municipalities At larger scales we have the 326 coun-ties (comarcas in spanish) which are aggregations of municipalitiesFinally the largest geographical scale we considered is defined by50 provinces (provincias in Spanish) In this section we comparethe performance of our Twitter model for unemployment for thevariables defined in those administrative areas and relate it to thegeographical communities detected and used in the main paper (seesection 4) Not all the areas at different administrative divisions areconsidered in the model To minimize the effect of areas in whichthe number of geo-tagged tweets is very small we only consider the1738 municipalities which have a Twitter population π gt 10 Simi-larly we only consider the 198 counties with π gt 100 As we can seein Table 6 the model has a large explanatory power for areas equalor bigger than counties As expected R2 increases as the number ofareas in the model is smaller but the description level of the modelis very low for provinces for example The best performance (highR2 and high geographical description level) is attained at the level ofthe detected communities
S10 Relative importance of the variables
To asses the relative importance of the variables in the unemploymentmodel we have used several methods They all give qualitatively thesame results with some variations for the statistically insignificantvariables Specifically we have use
1 (weight) Relative weight of the absolute values of the coef-ficients obtained in the linear regression when variables arescaled to have mean zero and variance one
2 (lmg) averaging over orderings proposed by LindemanMerenda and Gold
Social media fingerprints of unemployment mdash 1619
0
10
20
30
40
emp fmiss manana rtwpen sio siosocialnames
values
indabscoefffirstlmgpmvd
microemp mrng SuSu
Rela
tive
impo
rtanc
e (
)
0
10
20
30
40
emp fmiss manana rtwpen sio siosocialnames
values
indabscoefffirstlmgpmvd
weight first lmg pvmd
Figure 13 Relative importance of the variables (in per-centage) in the unemployment model for different ways tocalculate it
3 (pmvd) The PMVD metric introduced by Feldman whichan average over orderings as well but with data-dependentweights
4 (first) The univariate R2-values from regression models withone variable only
All these metrics are obtained using the relaimpo R package [12]The results for the young unemployment model are shown in figure13 where we can see that different methods yield to similar rela-tive importance of the variables excepting perhaps for the diversityof mobility flows a variable with a non-significant weight in theregression model
References
[1] B Ashtakala Generalized power model for trip distributionTransportation Research Part B Methodological 21(1)59ndash671987
[2] Michel Bierlaire Mathematical models for transportation de-mand analysis Transportation research Part A Policy andpractice 31(1)86ndash86 1997
[3] Vincent D Blondel Jean-Loup Guillaume Renaud Lambiotteand Etienne Lefebvre Fast unfolding of communities in largenetworks Journal of Statistical Mechanics Theory and Exper-iment 2008(10)P10008 2008
[4] Harry J Casey Jr The law of retail gravitation applied to trafficengineering Traffic Quarterly 9(3) 1955
[5] Aaron Clauset Mark EJ Newman and Cristopher Moore Find-ing community structure in very large networks Physicalreview E 70(6)066111 2004
[6] Leon Danon Albert Diaz-Guilera Jordi Duch and AlexArenas Comparing community structure identificationJournal of Statistical Mechanics Theory and Experiment2005(09)P09008 2005
[7] Servicio Publico de Empleo Estatal (SEPE) Spanish registeredunemployment httpwwwsepeescontenidosque_
es_el_sepeestadisticasindexhtml[8] Instituto Nacional de Estadıstica Spanish 2011 cen-
sus httpwwwineescensos2011_datoscen11_
datos_iniciohtm|[9] Suzanne P Evans A relationship between the gravity model
for trip distribution and the transportation problem in linearprogramming Transportation Research 7(1)39ndash61 1973
[10] Paul Expert Tim S Evans Vincent D Blondel and RenaudLambiotte Uncovering space-independent communities inspatial networks Proceedings of the National Academy ofSciences 108(19)7663ndash7668 2011
[11] Ingo Feinerer Christian Buchta Wilhelm Geiger JohannesRauch Patrick Mair and Kurt Hornik The textcat packagefor n-gram based text categorization in r Journal of StatisticalSoftware 52(6)1ndash17 2013
[12] Ulrike Gromping Relative importance for linear regressionin r the package relaimpo Journal of statistical software17(1)1ndash27 2006
[13] Bartosz Hawelka Izabela Sitko Euro Beinat StanislavSobolevsky Pavlos Kazakopoulos and Carlo Ratti Geo-located twitter as the proxy for global mobility patterns arXivpreprint arXiv13110680 2013
[14] Yu Liu Zhengwei Sui Chaogui Kang and Yong Gao Uncov-ering patterns of inter-urban trips and spatial interactions fromcheck-in data arXiv preprint arXiv13100282 2013
[15] Mark EJ Newman Finding community structure in net-works using the eigenvectors of matrices Physical reviewE 74(3)036104 2006
[16] Pascal Pons and Matthieu Latapy Computing communities inlarge networks using random walks In Computer and Informa-tion Sciences-ISCIS 2005 pages 284ndash293 Springer 2005
[17] Usha Nandini Raghavan Reka Albert and Soundar KumaraNear linear time algorithm to detect community structures inlarge-scale networks Physical Review E 76(3)036106 2007
[18] Martin Rosvall and Carl T Bergstrom Maps of random walkson complex networks reveal community structure Proceedingsof the National Academy of Sciences 105(4)1118ndash1123 2008
[19] Morton Schneider Gravity models and trip distribution theoryPapers in Regional Science 5(1)51ndash56 1959
[20] Filippo Simini Marta C Gonzalez Amos Maritan and Albert-Laszlo Barabasi A universal model for mobility and migrationpatterns Nature 484(7392)96ndash100 2012
[21] Chaoming Song Tal Koren Pu Wang and Albert-LaszloBarabasi Modelling the scaling properties of human mobilityNature Physics 6(10)818ndash823 2010
[22] Alan Geoffrey Wilson Entropy in urban and regional mod-elling Pion Ltd 1970
[23] Alan Geoffrey Wilson Urban and regional models in geogra-phy and planning 1974
[24] Svante Wold Arnold Ruhe Herman Wold and WJ Dunn IIIThe collinearity problem in linear regression the partial leastsquares (pls) approach to generalized inverses SIAM Journalon Scientific and Statistical Computing 5(3)735ndash743 1984
Social media fingerprints of unemployment mdash 1719
Gravity ModelParameter Description Spain
α1 Origin exponent 0477lowastlowastlowast(0002)α2 Destination exponent 0478lowastlowastlowast(0002)β Distance exponent 105lowastlowastlowast(00035)R2 Goodness of fit 0797φ Correlation between Ti j and T gra
i j 0826
Table 1 Description of the parameters for the Gravity Law Model in geo-tagged social media data for Spain (lowast lowast lowast) meanssignificance p lt 00001
NMI between G and Gp for different pAlgorithm p = 001 002 003 004 005 006 007 008 009 01
FG 0995 0992 0989 0983 0981 0977 0983 0969 0980 0959WT 0954 0959 0950 0954 0945 0948 0947 0935 0926 0931IM 0988 0981 0980 0981 0978 0974 0975 0970 0969 0966ML 0994 0978 0979 0983 0948 0934 0972 0952 0973 0947LP 0906 0908 0911 0915 0895 0907 0907 0893 0905 0904LE 0960 0957 0956 0859 0910 0892 0908 0858 0885 0884
Table 2 NMI measure comparing G and Gp
Communities StatsAlgorithm 〈|Ni|〉i max|Ni| |Ni| Modularity NMI P NMI C
FG 309696 1385 23 0726 0712 0590WT 9262 433 769 0417 0744 0757IM 21011 143 339 0758 0770 0831ML 323772 1132 22 0800 0717 0599LP 22052 750 323 0732 0749 0761LE 1017571 5344 7 0381 0264 0205
Table 3 Statistics of the communities Ni returned by the six algorithms NMI P refers to the comparison between communitiesand provinces whereas NMI C considers counties instead of provinces
Social media fingerprints of unemployment mdash 1819
All ages lt 24 25minus44 gt 44(Intercept) 011lowastlowastlowastlowast 010lowastlowastlowast 020lowastlowastlowast 020lowastlowastlowast
(002) (003) (003) (0035)Penetration rate 323lowast 857lowastlowastlowast 628lowastlowast 240
(141) (222) (217) (277)Geographical diversity 003 015lowastlowastlowast 008lowast 006
(002) (004) (004) (005)Social diversity minus003lowast minus003 minus005lowast minus006lowast
(001) (002) (002) (003)Morning activity minus069lowast minus130lowastlowast minus153lowastlowastlowast minus119lowast
(026) (042) (041) (052)Misspellers rate 1156 3151lowast 1546 2360
(813) (1278) (1248) (1594)Employment mentions minus180 317 minus994 271
(627) (986) (964) (123)R2 047 064 055 029Adj R2 044 062 052 026lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005
Table 4 Regression table for the different models in which unemployment for different age groups is fitted The All ages model isthe fit to the general rate of unemployment in each geographical area while the other models are for the rates of unemployment ingroups of less than 24 years between 25 and 44 years and above 44 years
All variables Youth model Twitter model (I) Twitter model (II)(Intercept) 006 minus002 010lowastlowastlowast 009lowastlowastlowast
(003) (003) (003) (0027)Young pop rate 066lowast 220lowastlowastlowast
(030) (035)Penetration rate 820lowastlowastlowast 857lowastlowastlowast 862lowastlowastlowast
(225) (222) (221)Geographical diversity 014lowastlowastlowast 015lowastlowastlowast 012lowastlowastlowast
(004) (004) (003)Social diversity minus002 minus003
(002) (002)Morning activity minus142lowastlowastlowast minus130lowastlowast minus128lowastlowast
(041) (042) (041)Misspellers rate 2395 3151lowast 3228lowast
(1309) (1278) (1271)Employment mentions 034 317
(981) (986)R2 065 024 064 063Adj R2 063 024 062 062lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005
Table 5 Regression table for the different statistical models The All variables model includes both Twitter and rate of youngpopulation variables Twitter model (I) includes only the variables described in the main article while Twitter model (II) only includesthose variables which are significant p lt 005 in Twitter model (I)
Social media fingerprints of unemployment mdash 1919
Communities Municipalities Counties Provinces(Intercept) 010lowastlowastlowast 016lowastlowastlowast 011lowastlowastlowast 011lowast
(003) (001) (003) (005)Penetration rate 857lowastlowastlowast 401lowastlowastlowast 912lowastlowastlowast 1047lowastlowastlowast
(222) (059) (181) (197)Geographical diversity 015lowastlowastlowast 002 012lowastlowastlowast 008
(004) (001) (003) (007)Social diversity minus003 minus001 minus001 minus003
(002) (001) (002) 007Morning activity minus130lowastlowast minus116lowastlowastlowast minus149lowastlowastlowast minus103
(042) (014) (039) (088)Misspellers rate 3151lowast 1440lowastlowastlowast 1409
(1278) (251) (1002)Employment mentions 317 minus071 241 minus317
(986) (089) (886) (1229)Number of points 128 1738 198 50R2 064 022 055 065Adj R2 062 021 054 061lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005
Table 6 Regression table for the unemployment linear regression model in different levels of geographical areas In the Provincesmodel the misspellers rate has been removed from the model due to the large collinearity with the penetration rate
- 1 Social media dataset and functional partition of cities
- 2 Social media behavioral fingerprints
- 3 Explanatory power of social media in unemployment
- 4 Discussion
-
Social media fingerprints of unemployment mdash 1319
0
10
20
30
40
0 100 200 300x = Tweets (unemployment)
P(x)
0
5
10
15
500 1000 1500x = Penetration rate
P(x)
0
5
10
15
2 4 6 8x = Entropy2 (social)
P(x)
0
5
10
15
025 050 075x = Entropy1 (social)
P(x)
0
5
10
15
4 6 8x = Entropy2 (geo)
P(x)
0
5
10
15
01 02 03 04x = Entropy1 (geo)
P(x)
0
5
10
15
0 50 100 150 200x = Misspellers rate
P(x)
0
20
40
60
0 100 200x = Tweets (employment)
P(x)
0
5
10
15
20
250 500 750x = Tweets (job)
P(x)
0
5
10
15
20
2 3 4 5 6x = Activity (night)
P(x)
0
5
10
15
20
4 5 6 7x = Activity (morning)
P(x)
0
5
10
15
35 40 45 50x = Activity (afternoon)
P(x)
i SuiSri
Sri Suimrngi
aftni ngti
imicrojobi
microempi
microunempi
x x x
x x x
x x x
x x x
Figure 9 Frequency plots for each variable constructed from Twitter
letters in the middle of a word whose pronunciation can bededuced without them
bull We do not consider either mistakes related to features ofspecific areas in Spain For example in the south the pronun-ciation of ce and se is the same what produces a big amountof mistakes when writing However since we want to extractobjective and equitable conclusion over the whole Spanishgeography we neglect those misspellings that only appear ina specific area
Likewise we will consider as real misspellings the followingmistakes
bull Adding letters For example writing a h at the beginning of aword that starts with a vowel
bull Changing the special cases mp mb by the wrong writings npnb
bull Mixing up b with v g with j ll with y and ex with es Theseare typical mistakes in Spanish because they have the sameor a very close pronunciation
Social media fingerprints of unemployment mdash 1419
0
10
20
30
16minus24 25minus34 35minus44 45minus54 55minus64r
f
000
025
050
075
100llena
Age group
Perc
enta
ge o
f pop
ulat
ion Census
10
15
20
25
10 15 20 25x
y
10
15
20
25
10 15 20 25x
y
5
10
15
20
5 10 15 20x
y
5
10
15
2025
5 10 15 20 25x
y
All ages lt 24
25-44 gt 44
Observed Unemployment () Observed Unemployment ()
Observed Unemployment ()Observed Unemployment ()
Pred
icte
d Un
empl
oym
ent (
)
Pred
icte
d Un
empl
oym
ent (
)
Pred
icte
d Un
empl
oym
ent (
)
Pred
icte
d Un
empl
oym
ent (
)
R2 = 047 R2 = 062
R2 = 052 R2 = 026
Figure 8 Top Percentage of population in each age groupfrom the Spanish Census (dark bars) and surveys about usersin Twitter (light bars) Bottom performance of the linearmodels for each of the age groups
bull Confusing the verb haber with the periphrasis a ver
bull Separating a word into two ones for instance writing theword conmigo as con migo
This way our list of mispellings is composed of 617 common mis-takes in Spanish that cannot be attributed to the special featuresof Twitter or a specific region of Spain Thus one can expect thatthis selection provides an accurate and equitable method of detect-ing misspellers Under these conditions the number of users whowrote at least one misspelled word is 27055 (56 over the wholepopulation)
We analyze whether misspellers have different Twitter usagebehavior from that people who do not make serious mistakes whenpublishing a tweet Comparing the average number of tweets itcan be observed that misspellers tend to publish a larger numberof tweets than those who did not made mistakes (14471 against2372) This also emerges when the mean number of misspellinggiven the total number of tweets is considered For users with lessthan approximately 30 published tweets in the observation period thenumber of misspellings is almost zero whereas for users who publishmore often the mean number of misspellings scales sub-linearly
minus1
minus08
minus06
minus04
minus02
0
02
04
06
08
1rtwpen
sio
sior
siosocial
siorsocial
manana
tarde
madrugada
fmiss
job
emp
unem
p
eco
rtwpensiosior
siosocialsiorsocialmanana
tardemadrugada
fmissjobemp
unempeco
i
Sui
Sri
Sri
Sui
i
microecoi
microunempi
microempi
microjobi
ngti
aftni
mrngi
minus3 minus2 minus1 0 1 2 3
minus3minus2
minus10
12
3
First Principal component
Seco
nd P
rinci
pal C
ompo
nent
6
26
47
70 92123
145
155185
209
213
251
257 279305318339
371
396411
423
455
466
490
507
546552
568
597
621
637672
681
710
718738
772
781
805828
849
876
900
919
925
966979
1002 1021
1047
1064
1085
1109
1126
1137
1164
11861200
12361249
1264
1300
1318
1336
13541386
1406
1425
1444
1457
14711504
15141554
1572
1584
1611
1622
16571667
1686
1721
1739
1800
1823
1837
1863
18831930
1949
1968
20062036
2042
2067
2181
2206
22442264
2305
2331
23322413 2435
2456
2512
2554
2569
25982617
26522670
26972721 2748
2790
2875
2917
2939
29492976
2989
3025
3132
3228
3245
3331
3347
3381
3451
3484
3519
3613
3837
38653987
4326
minus05 00 05
minus05
00
05
sio
sior
siosocial
siorsocial
rtwpen
manana
tarde
madrugada
fmiss
jobemp
unemp
eco
ngt
microunemp
aftn
microempmicroeco
microjobmrng
Si
Si
Sir
˜Sir
domingo 12 de octubre de 14
Figure 10 Top Correlation matrix between the vari-ables constructed from Twitter Each entry in the matrixis depicted as a circle whose size is proportional to thecorrelation between variables and the sign is bluered forpositivenegative correlations Blank entries correspond tostatistically insignificant correlations with 95 confidenceBottom Variables projection on the first two principal com-ponents given by PCA We observe different groups of vari-ables and collinearity between some of them
with the number of tweets (exponent asymp 033)
Since we have observed a segmentation of Twitter populationbased on how accurate they write we consider the misspeller rate as aproxy of the educational level of the cities Large number of previousworks in the literature have revealed the relationship between theeconomical status and the educational level of geographical areasand therefore it is natural to ask whether the observed misspellersrate is related to economy driven by the unemployment rate To testthis hypothesis we consider cities populated with more than 5000inhabitants to avoid subsampled cases We find a strong positivecorrelation between the probability of finding a misspeller in a cityand the unemployment rate (0372 0491)
Social media fingerprints of unemployment mdash 1519
2 5 10 20 50 200 500
10
15
20
30
40
tweets
N(m
iss|
tw
eets
)
2 5 10 20 50 200 500
000
50
050
050
0
tweets
P(m
iss|
twee
ts)
2 50 500
Figure 11 Number (red) and probability (blue) of ob-served misspellings given the number of tweets
050
055
060
065
070
201210201211201212201301201302201303201304201305201306201307201308201309201310201311201312201401201402201403201404201405201406
month
r2R2
month
Figure 12 Explanatory power of the linear regressionmodel when fitted against the unemployment data for dif-ferent months Gray (orange) area correspond to the timewindow in which Twitter data is collected and variables areconstructed
S7 Time window and unemployment
In the definition of the variables we have aggregated the Twitter ac-tivity within a 7 months time window (from December 2012 to June2013) Since unemployment has a significant variation along timewe investigate here what is the correlation and explanatory powerof the Twitter variables for the values of unemployment determinedat different months through the same time window in which Twitterdata was collected Or if the variables collected in that time windoware more correlated with past or future values of unemploymentFigure 12 shows the explanatory value of the model when the linearregression is done for values of unemployment of different monthsbefore during and after the Twitter data time window Althoughthere is a small seasonal effect along the year we see that the ex-planatory power remains around R2 = 06 which suggest that ourTwitter linear model retains its explanatory power even though unem-ployment changes considerably throughout the year It is interestingto note that R2 decays a little bit during the summer which meansthat our variables are less correlated with summer unemploymentFinally unemployment used in the main article is from June 2013ie the last month in the time window used to collect the data
S8 Demographics does not explain unem-ployment
Since unemployment rates are very large for the group of youngpeople a natural question is whether only demographic variablescould explain the heterogeneity of young unemployment rates foundin the geographical areas To test this end we have built four linearmodels the first one (named Youth model in Table 5) is composedby the rate of young population as the only explaining variable thesecond ones are built based on only the Twitter variables consideredin the main text (named Twitter model (I)) or just with those whoseregression coefficients are statistically significant (Twitter model(II)) the third one is fitted with all the variables (named All variablesmodel in Table 5) In table 5 we show the summary of the regressionfor each model Focusing on the explained variance by the model interms of R2 it can be checked that considering all Twitter variables isthree times more explanatory than considering only the young peopleproportion On the other hand the comparison of R2 for the Twittermodel with the one for All variables and Youth model shows that therate of young population does not provide a significant explanatorypower This semi-partial analysis shows that our Twitter variablesretain a high explanatory power when the effect of young populationrate is controlled
S9 Unemployment models for other geo-graphical areas
While municipalities are very heterogeneous demographically otheradministrative areas exist in Spain at large scales that could be usedfor our model of unemployment As mentioned in section 4 thesmallest administrative division of Spain we have considered is thatof the 8200 municipalities At larger scales we have the 326 coun-ties (comarcas in spanish) which are aggregations of municipalitiesFinally the largest geographical scale we considered is defined by50 provinces (provincias in Spanish) In this section we comparethe performance of our Twitter model for unemployment for thevariables defined in those administrative areas and relate it to thegeographical communities detected and used in the main paper (seesection 4) Not all the areas at different administrative divisions areconsidered in the model To minimize the effect of areas in whichthe number of geo-tagged tweets is very small we only consider the1738 municipalities which have a Twitter population π gt 10 Simi-larly we only consider the 198 counties with π gt 100 As we can seein Table 6 the model has a large explanatory power for areas equalor bigger than counties As expected R2 increases as the number ofareas in the model is smaller but the description level of the modelis very low for provinces for example The best performance (highR2 and high geographical description level) is attained at the level ofthe detected communities
S10 Relative importance of the variables
To asses the relative importance of the variables in the unemploymentmodel we have used several methods They all give qualitatively thesame results with some variations for the statistically insignificantvariables Specifically we have use
1 (weight) Relative weight of the absolute values of the coef-ficients obtained in the linear regression when variables arescaled to have mean zero and variance one
2 (lmg) averaging over orderings proposed by LindemanMerenda and Gold
Social media fingerprints of unemployment mdash 1619
0
10
20
30
40
emp fmiss manana rtwpen sio siosocialnames
values
indabscoefffirstlmgpmvd
microemp mrng SuSu
Rela
tive
impo
rtanc
e (
)
0
10
20
30
40
emp fmiss manana rtwpen sio siosocialnames
values
indabscoefffirstlmgpmvd
weight first lmg pvmd
Figure 13 Relative importance of the variables (in per-centage) in the unemployment model for different ways tocalculate it
3 (pmvd) The PMVD metric introduced by Feldman whichan average over orderings as well but with data-dependentweights
4 (first) The univariate R2-values from regression models withone variable only
All these metrics are obtained using the relaimpo R package [12]The results for the young unemployment model are shown in figure13 where we can see that different methods yield to similar rela-tive importance of the variables excepting perhaps for the diversityof mobility flows a variable with a non-significant weight in theregression model
References
[1] B Ashtakala Generalized power model for trip distributionTransportation Research Part B Methodological 21(1)59ndash671987
[2] Michel Bierlaire Mathematical models for transportation de-mand analysis Transportation research Part A Policy andpractice 31(1)86ndash86 1997
[3] Vincent D Blondel Jean-Loup Guillaume Renaud Lambiotteand Etienne Lefebvre Fast unfolding of communities in largenetworks Journal of Statistical Mechanics Theory and Exper-iment 2008(10)P10008 2008
[4] Harry J Casey Jr The law of retail gravitation applied to trafficengineering Traffic Quarterly 9(3) 1955
[5] Aaron Clauset Mark EJ Newman and Cristopher Moore Find-ing community structure in very large networks Physicalreview E 70(6)066111 2004
[6] Leon Danon Albert Diaz-Guilera Jordi Duch and AlexArenas Comparing community structure identificationJournal of Statistical Mechanics Theory and Experiment2005(09)P09008 2005
[7] Servicio Publico de Empleo Estatal (SEPE) Spanish registeredunemployment httpwwwsepeescontenidosque_
es_el_sepeestadisticasindexhtml[8] Instituto Nacional de Estadıstica Spanish 2011 cen-
sus httpwwwineescensos2011_datoscen11_
datos_iniciohtm|[9] Suzanne P Evans A relationship between the gravity model
for trip distribution and the transportation problem in linearprogramming Transportation Research 7(1)39ndash61 1973
[10] Paul Expert Tim S Evans Vincent D Blondel and RenaudLambiotte Uncovering space-independent communities inspatial networks Proceedings of the National Academy ofSciences 108(19)7663ndash7668 2011
[11] Ingo Feinerer Christian Buchta Wilhelm Geiger JohannesRauch Patrick Mair and Kurt Hornik The textcat packagefor n-gram based text categorization in r Journal of StatisticalSoftware 52(6)1ndash17 2013
[12] Ulrike Gromping Relative importance for linear regressionin r the package relaimpo Journal of statistical software17(1)1ndash27 2006
[13] Bartosz Hawelka Izabela Sitko Euro Beinat StanislavSobolevsky Pavlos Kazakopoulos and Carlo Ratti Geo-located twitter as the proxy for global mobility patterns arXivpreprint arXiv13110680 2013
[14] Yu Liu Zhengwei Sui Chaogui Kang and Yong Gao Uncov-ering patterns of inter-urban trips and spatial interactions fromcheck-in data arXiv preprint arXiv13100282 2013
[15] Mark EJ Newman Finding community structure in net-works using the eigenvectors of matrices Physical reviewE 74(3)036104 2006
[16] Pascal Pons and Matthieu Latapy Computing communities inlarge networks using random walks In Computer and Informa-tion Sciences-ISCIS 2005 pages 284ndash293 Springer 2005
[17] Usha Nandini Raghavan Reka Albert and Soundar KumaraNear linear time algorithm to detect community structures inlarge-scale networks Physical Review E 76(3)036106 2007
[18] Martin Rosvall and Carl T Bergstrom Maps of random walkson complex networks reveal community structure Proceedingsof the National Academy of Sciences 105(4)1118ndash1123 2008
[19] Morton Schneider Gravity models and trip distribution theoryPapers in Regional Science 5(1)51ndash56 1959
[20] Filippo Simini Marta C Gonzalez Amos Maritan and Albert-Laszlo Barabasi A universal model for mobility and migrationpatterns Nature 484(7392)96ndash100 2012
[21] Chaoming Song Tal Koren Pu Wang and Albert-LaszloBarabasi Modelling the scaling properties of human mobilityNature Physics 6(10)818ndash823 2010
[22] Alan Geoffrey Wilson Entropy in urban and regional mod-elling Pion Ltd 1970
[23] Alan Geoffrey Wilson Urban and regional models in geogra-phy and planning 1974
[24] Svante Wold Arnold Ruhe Herman Wold and WJ Dunn IIIThe collinearity problem in linear regression the partial leastsquares (pls) approach to generalized inverses SIAM Journalon Scientific and Statistical Computing 5(3)735ndash743 1984
Social media fingerprints of unemployment mdash 1719
Gravity ModelParameter Description Spain
α1 Origin exponent 0477lowastlowastlowast(0002)α2 Destination exponent 0478lowastlowastlowast(0002)β Distance exponent 105lowastlowastlowast(00035)R2 Goodness of fit 0797φ Correlation between Ti j and T gra
i j 0826
Table 1 Description of the parameters for the Gravity Law Model in geo-tagged social media data for Spain (lowast lowast lowast) meanssignificance p lt 00001
NMI between G and Gp for different pAlgorithm p = 001 002 003 004 005 006 007 008 009 01
FG 0995 0992 0989 0983 0981 0977 0983 0969 0980 0959WT 0954 0959 0950 0954 0945 0948 0947 0935 0926 0931IM 0988 0981 0980 0981 0978 0974 0975 0970 0969 0966ML 0994 0978 0979 0983 0948 0934 0972 0952 0973 0947LP 0906 0908 0911 0915 0895 0907 0907 0893 0905 0904LE 0960 0957 0956 0859 0910 0892 0908 0858 0885 0884
Table 2 NMI measure comparing G and Gp
Communities StatsAlgorithm 〈|Ni|〉i max|Ni| |Ni| Modularity NMI P NMI C
FG 309696 1385 23 0726 0712 0590WT 9262 433 769 0417 0744 0757IM 21011 143 339 0758 0770 0831ML 323772 1132 22 0800 0717 0599LP 22052 750 323 0732 0749 0761LE 1017571 5344 7 0381 0264 0205
Table 3 Statistics of the communities Ni returned by the six algorithms NMI P refers to the comparison between communitiesand provinces whereas NMI C considers counties instead of provinces
Social media fingerprints of unemployment mdash 1819
All ages lt 24 25minus44 gt 44(Intercept) 011lowastlowastlowastlowast 010lowastlowastlowast 020lowastlowastlowast 020lowastlowastlowast
(002) (003) (003) (0035)Penetration rate 323lowast 857lowastlowastlowast 628lowastlowast 240
(141) (222) (217) (277)Geographical diversity 003 015lowastlowastlowast 008lowast 006
(002) (004) (004) (005)Social diversity minus003lowast minus003 minus005lowast minus006lowast
(001) (002) (002) (003)Morning activity minus069lowast minus130lowastlowast minus153lowastlowastlowast minus119lowast
(026) (042) (041) (052)Misspellers rate 1156 3151lowast 1546 2360
(813) (1278) (1248) (1594)Employment mentions minus180 317 minus994 271
(627) (986) (964) (123)R2 047 064 055 029Adj R2 044 062 052 026lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005
Table 4 Regression table for the different models in which unemployment for different age groups is fitted The All ages model isthe fit to the general rate of unemployment in each geographical area while the other models are for the rates of unemployment ingroups of less than 24 years between 25 and 44 years and above 44 years
All variables Youth model Twitter model (I) Twitter model (II)(Intercept) 006 minus002 010lowastlowastlowast 009lowastlowastlowast
(003) (003) (003) (0027)Young pop rate 066lowast 220lowastlowastlowast
(030) (035)Penetration rate 820lowastlowastlowast 857lowastlowastlowast 862lowastlowastlowast
(225) (222) (221)Geographical diversity 014lowastlowastlowast 015lowastlowastlowast 012lowastlowastlowast
(004) (004) (003)Social diversity minus002 minus003
(002) (002)Morning activity minus142lowastlowastlowast minus130lowastlowast minus128lowastlowast
(041) (042) (041)Misspellers rate 2395 3151lowast 3228lowast
(1309) (1278) (1271)Employment mentions 034 317
(981) (986)R2 065 024 064 063Adj R2 063 024 062 062lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005
Table 5 Regression table for the different statistical models The All variables model includes both Twitter and rate of youngpopulation variables Twitter model (I) includes only the variables described in the main article while Twitter model (II) only includesthose variables which are significant p lt 005 in Twitter model (I)
Social media fingerprints of unemployment mdash 1919
Communities Municipalities Counties Provinces(Intercept) 010lowastlowastlowast 016lowastlowastlowast 011lowastlowastlowast 011lowast
(003) (001) (003) (005)Penetration rate 857lowastlowastlowast 401lowastlowastlowast 912lowastlowastlowast 1047lowastlowastlowast
(222) (059) (181) (197)Geographical diversity 015lowastlowastlowast 002 012lowastlowastlowast 008
(004) (001) (003) (007)Social diversity minus003 minus001 minus001 minus003
(002) (001) (002) 007Morning activity minus130lowastlowast minus116lowastlowastlowast minus149lowastlowastlowast minus103
(042) (014) (039) (088)Misspellers rate 3151lowast 1440lowastlowastlowast 1409
(1278) (251) (1002)Employment mentions 317 minus071 241 minus317
(986) (089) (886) (1229)Number of points 128 1738 198 50R2 064 022 055 065Adj R2 062 021 054 061lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005
Table 6 Regression table for the unemployment linear regression model in different levels of geographical areas In the Provincesmodel the misspellers rate has been removed from the model due to the large collinearity with the penetration rate
- 1 Social media dataset and functional partition of cities
- 2 Social media behavioral fingerprints
- 3 Explanatory power of social media in unemployment
- 4 Discussion
-
Social media fingerprints of unemployment mdash 1419
0
10
20
30
16minus24 25minus34 35minus44 45minus54 55minus64r
f
000
025
050
075
100llena
Age group
Perc
enta
ge o
f pop
ulat
ion Census
10
15
20
25
10 15 20 25x
y
10
15
20
25
10 15 20 25x
y
5
10
15
20
5 10 15 20x
y
5
10
15
2025
5 10 15 20 25x
y
All ages lt 24
25-44 gt 44
Observed Unemployment () Observed Unemployment ()
Observed Unemployment ()Observed Unemployment ()
Pred
icte
d Un
empl
oym
ent (
)
Pred
icte
d Un
empl
oym
ent (
)
Pred
icte
d Un
empl
oym
ent (
)
Pred
icte
d Un
empl
oym
ent (
)
R2 = 047 R2 = 062
R2 = 052 R2 = 026
Figure 8 Top Percentage of population in each age groupfrom the Spanish Census (dark bars) and surveys about usersin Twitter (light bars) Bottom performance of the linearmodels for each of the age groups
bull Confusing the verb haber with the periphrasis a ver
bull Separating a word into two ones for instance writing theword conmigo as con migo
This way our list of mispellings is composed of 617 common mis-takes in Spanish that cannot be attributed to the special featuresof Twitter or a specific region of Spain Thus one can expect thatthis selection provides an accurate and equitable method of detect-ing misspellers Under these conditions the number of users whowrote at least one misspelled word is 27055 (56 over the wholepopulation)
We analyze whether misspellers have different Twitter usagebehavior from that people who do not make serious mistakes whenpublishing a tweet Comparing the average number of tweets itcan be observed that misspellers tend to publish a larger numberof tweets than those who did not made mistakes (14471 against2372) This also emerges when the mean number of misspellinggiven the total number of tweets is considered For users with lessthan approximately 30 published tweets in the observation period thenumber of misspellings is almost zero whereas for users who publishmore often the mean number of misspellings scales sub-linearly
minus1
minus08
minus06
minus04
minus02
0
02
04
06
08
1rtwpen
sio
sior
siosocial
siorsocial
manana
tarde
madrugada
fmiss
job
emp
unem
p
eco
rtwpensiosior
siosocialsiorsocialmanana
tardemadrugada
fmissjobemp
unempeco
i
Sui
Sri
Sri
Sui
i
microecoi
microunempi
microempi
microjobi
ngti
aftni
mrngi
minus3 minus2 minus1 0 1 2 3
minus3minus2
minus10
12
3
First Principal component
Seco
nd P
rinci
pal C
ompo
nent
6
26
47
70 92123
145
155185
209
213
251
257 279305318339
371
396411
423
455
466
490
507
546552
568
597
621
637672
681
710
718738
772
781
805828
849
876
900
919
925
966979
1002 1021
1047
1064
1085
1109
1126
1137
1164
11861200
12361249
1264
1300
1318
1336
13541386
1406
1425
1444
1457
14711504
15141554
1572
1584
1611
1622
16571667
1686
1721
1739
1800
1823
1837
1863
18831930
1949
1968
20062036
2042
2067
2181
2206
22442264
2305
2331
23322413 2435
2456
2512
2554
2569
25982617
26522670
26972721 2748
2790
2875
2917
2939
29492976
2989
3025
3132
3228
3245
3331
3347
3381
3451
3484
3519
3613
3837
38653987
4326
minus05 00 05
minus05
00
05
sio
sior
siosocial
siorsocial
rtwpen
manana
tarde
madrugada
fmiss
jobemp
unemp
eco
ngt
microunemp
aftn
microempmicroeco
microjobmrng
Si
Si
Sir
˜Sir
domingo 12 de octubre de 14
Figure 10 Top Correlation matrix between the vari-ables constructed from Twitter Each entry in the matrixis depicted as a circle whose size is proportional to thecorrelation between variables and the sign is bluered forpositivenegative correlations Blank entries correspond tostatistically insignificant correlations with 95 confidenceBottom Variables projection on the first two principal com-ponents given by PCA We observe different groups of vari-ables and collinearity between some of them
with the number of tweets (exponent asymp 033)
Since we have observed a segmentation of Twitter populationbased on how accurate they write we consider the misspeller rate as aproxy of the educational level of the cities Large number of previousworks in the literature have revealed the relationship between theeconomical status and the educational level of geographical areasand therefore it is natural to ask whether the observed misspellersrate is related to economy driven by the unemployment rate To testthis hypothesis we consider cities populated with more than 5000inhabitants to avoid subsampled cases We find a strong positivecorrelation between the probability of finding a misspeller in a cityand the unemployment rate (0372 0491)
Social media fingerprints of unemployment mdash 1519
2 5 10 20 50 200 500
10
15
20
30
40
tweets
N(m
iss|
tw
eets
)
2 5 10 20 50 200 500
000
50
050
050
0
tweets
P(m
iss|
twee
ts)
2 50 500
Figure 11 Number (red) and probability (blue) of ob-served misspellings given the number of tweets
050
055
060
065
070
201210201211201212201301201302201303201304201305201306201307201308201309201310201311201312201401201402201403201404201405201406
month
r2R2
month
Figure 12 Explanatory power of the linear regressionmodel when fitted against the unemployment data for dif-ferent months Gray (orange) area correspond to the timewindow in which Twitter data is collected and variables areconstructed
S7 Time window and unemployment
In the definition of the variables we have aggregated the Twitter ac-tivity within a 7 months time window (from December 2012 to June2013) Since unemployment has a significant variation along timewe investigate here what is the correlation and explanatory powerof the Twitter variables for the values of unemployment determinedat different months through the same time window in which Twitterdata was collected Or if the variables collected in that time windoware more correlated with past or future values of unemploymentFigure 12 shows the explanatory value of the model when the linearregression is done for values of unemployment of different monthsbefore during and after the Twitter data time window Althoughthere is a small seasonal effect along the year we see that the ex-planatory power remains around R2 = 06 which suggest that ourTwitter linear model retains its explanatory power even though unem-ployment changes considerably throughout the year It is interestingto note that R2 decays a little bit during the summer which meansthat our variables are less correlated with summer unemploymentFinally unemployment used in the main article is from June 2013ie the last month in the time window used to collect the data
S8 Demographics does not explain unem-ployment
Since unemployment rates are very large for the group of youngpeople a natural question is whether only demographic variablescould explain the heterogeneity of young unemployment rates foundin the geographical areas To test this end we have built four linearmodels the first one (named Youth model in Table 5) is composedby the rate of young population as the only explaining variable thesecond ones are built based on only the Twitter variables consideredin the main text (named Twitter model (I)) or just with those whoseregression coefficients are statistically significant (Twitter model(II)) the third one is fitted with all the variables (named All variablesmodel in Table 5) In table 5 we show the summary of the regressionfor each model Focusing on the explained variance by the model interms of R2 it can be checked that considering all Twitter variables isthree times more explanatory than considering only the young peopleproportion On the other hand the comparison of R2 for the Twittermodel with the one for All variables and Youth model shows that therate of young population does not provide a significant explanatorypower This semi-partial analysis shows that our Twitter variablesretain a high explanatory power when the effect of young populationrate is controlled
S9 Unemployment models for other geo-graphical areas
While municipalities are very heterogeneous demographically otheradministrative areas exist in Spain at large scales that could be usedfor our model of unemployment As mentioned in section 4 thesmallest administrative division of Spain we have considered is thatof the 8200 municipalities At larger scales we have the 326 coun-ties (comarcas in spanish) which are aggregations of municipalitiesFinally the largest geographical scale we considered is defined by50 provinces (provincias in Spanish) In this section we comparethe performance of our Twitter model for unemployment for thevariables defined in those administrative areas and relate it to thegeographical communities detected and used in the main paper (seesection 4) Not all the areas at different administrative divisions areconsidered in the model To minimize the effect of areas in whichthe number of geo-tagged tweets is very small we only consider the1738 municipalities which have a Twitter population π gt 10 Simi-larly we only consider the 198 counties with π gt 100 As we can seein Table 6 the model has a large explanatory power for areas equalor bigger than counties As expected R2 increases as the number ofareas in the model is smaller but the description level of the modelis very low for provinces for example The best performance (highR2 and high geographical description level) is attained at the level ofthe detected communities
S10 Relative importance of the variables
To asses the relative importance of the variables in the unemploymentmodel we have used several methods They all give qualitatively thesame results with some variations for the statistically insignificantvariables Specifically we have use
1 (weight) Relative weight of the absolute values of the coef-ficients obtained in the linear regression when variables arescaled to have mean zero and variance one
2 (lmg) averaging over orderings proposed by LindemanMerenda and Gold
Social media fingerprints of unemployment mdash 1619
0
10
20
30
40
emp fmiss manana rtwpen sio siosocialnames
values
indabscoefffirstlmgpmvd
microemp mrng SuSu
Rela
tive
impo
rtanc
e (
)
0
10
20
30
40
emp fmiss manana rtwpen sio siosocialnames
values
indabscoefffirstlmgpmvd
weight first lmg pvmd
Figure 13 Relative importance of the variables (in per-centage) in the unemployment model for different ways tocalculate it
3 (pmvd) The PMVD metric introduced by Feldman whichan average over orderings as well but with data-dependentweights
4 (first) The univariate R2-values from regression models withone variable only
All these metrics are obtained using the relaimpo R package [12]The results for the young unemployment model are shown in figure13 where we can see that different methods yield to similar rela-tive importance of the variables excepting perhaps for the diversityof mobility flows a variable with a non-significant weight in theregression model
References
[1] B Ashtakala Generalized power model for trip distributionTransportation Research Part B Methodological 21(1)59ndash671987
[2] Michel Bierlaire Mathematical models for transportation de-mand analysis Transportation research Part A Policy andpractice 31(1)86ndash86 1997
[3] Vincent D Blondel Jean-Loup Guillaume Renaud Lambiotteand Etienne Lefebvre Fast unfolding of communities in largenetworks Journal of Statistical Mechanics Theory and Exper-iment 2008(10)P10008 2008
[4] Harry J Casey Jr The law of retail gravitation applied to trafficengineering Traffic Quarterly 9(3) 1955
[5] Aaron Clauset Mark EJ Newman and Cristopher Moore Find-ing community structure in very large networks Physicalreview E 70(6)066111 2004
[6] Leon Danon Albert Diaz-Guilera Jordi Duch and AlexArenas Comparing community structure identificationJournal of Statistical Mechanics Theory and Experiment2005(09)P09008 2005
[7] Servicio Publico de Empleo Estatal (SEPE) Spanish registeredunemployment httpwwwsepeescontenidosque_
es_el_sepeestadisticasindexhtml[8] Instituto Nacional de Estadıstica Spanish 2011 cen-
sus httpwwwineescensos2011_datoscen11_
datos_iniciohtm|[9] Suzanne P Evans A relationship between the gravity model
for trip distribution and the transportation problem in linearprogramming Transportation Research 7(1)39ndash61 1973
[10] Paul Expert Tim S Evans Vincent D Blondel and RenaudLambiotte Uncovering space-independent communities inspatial networks Proceedings of the National Academy ofSciences 108(19)7663ndash7668 2011
[11] Ingo Feinerer Christian Buchta Wilhelm Geiger JohannesRauch Patrick Mair and Kurt Hornik The textcat packagefor n-gram based text categorization in r Journal of StatisticalSoftware 52(6)1ndash17 2013
[12] Ulrike Gromping Relative importance for linear regressionin r the package relaimpo Journal of statistical software17(1)1ndash27 2006
[13] Bartosz Hawelka Izabela Sitko Euro Beinat StanislavSobolevsky Pavlos Kazakopoulos and Carlo Ratti Geo-located twitter as the proxy for global mobility patterns arXivpreprint arXiv13110680 2013
[14] Yu Liu Zhengwei Sui Chaogui Kang and Yong Gao Uncov-ering patterns of inter-urban trips and spatial interactions fromcheck-in data arXiv preprint arXiv13100282 2013
[15] Mark EJ Newman Finding community structure in net-works using the eigenvectors of matrices Physical reviewE 74(3)036104 2006
[16] Pascal Pons and Matthieu Latapy Computing communities inlarge networks using random walks In Computer and Informa-tion Sciences-ISCIS 2005 pages 284ndash293 Springer 2005
[17] Usha Nandini Raghavan Reka Albert and Soundar KumaraNear linear time algorithm to detect community structures inlarge-scale networks Physical Review E 76(3)036106 2007
[18] Martin Rosvall and Carl T Bergstrom Maps of random walkson complex networks reveal community structure Proceedingsof the National Academy of Sciences 105(4)1118ndash1123 2008
[19] Morton Schneider Gravity models and trip distribution theoryPapers in Regional Science 5(1)51ndash56 1959
[20] Filippo Simini Marta C Gonzalez Amos Maritan and Albert-Laszlo Barabasi A universal model for mobility and migrationpatterns Nature 484(7392)96ndash100 2012
[21] Chaoming Song Tal Koren Pu Wang and Albert-LaszloBarabasi Modelling the scaling properties of human mobilityNature Physics 6(10)818ndash823 2010
[22] Alan Geoffrey Wilson Entropy in urban and regional mod-elling Pion Ltd 1970
[23] Alan Geoffrey Wilson Urban and regional models in geogra-phy and planning 1974
[24] Svante Wold Arnold Ruhe Herman Wold and WJ Dunn IIIThe collinearity problem in linear regression the partial leastsquares (pls) approach to generalized inverses SIAM Journalon Scientific and Statistical Computing 5(3)735ndash743 1984
Social media fingerprints of unemployment mdash 1719
Gravity ModelParameter Description Spain
α1 Origin exponent 0477lowastlowastlowast(0002)α2 Destination exponent 0478lowastlowastlowast(0002)β Distance exponent 105lowastlowastlowast(00035)R2 Goodness of fit 0797φ Correlation between Ti j and T gra
i j 0826
Table 1 Description of the parameters for the Gravity Law Model in geo-tagged social media data for Spain (lowast lowast lowast) meanssignificance p lt 00001
NMI between G and Gp for different pAlgorithm p = 001 002 003 004 005 006 007 008 009 01
FG 0995 0992 0989 0983 0981 0977 0983 0969 0980 0959WT 0954 0959 0950 0954 0945 0948 0947 0935 0926 0931IM 0988 0981 0980 0981 0978 0974 0975 0970 0969 0966ML 0994 0978 0979 0983 0948 0934 0972 0952 0973 0947LP 0906 0908 0911 0915 0895 0907 0907 0893 0905 0904LE 0960 0957 0956 0859 0910 0892 0908 0858 0885 0884
Table 2 NMI measure comparing G and Gp
Communities StatsAlgorithm 〈|Ni|〉i max|Ni| |Ni| Modularity NMI P NMI C
FG 309696 1385 23 0726 0712 0590WT 9262 433 769 0417 0744 0757IM 21011 143 339 0758 0770 0831ML 323772 1132 22 0800 0717 0599LP 22052 750 323 0732 0749 0761LE 1017571 5344 7 0381 0264 0205
Table 3 Statistics of the communities Ni returned by the six algorithms NMI P refers to the comparison between communitiesand provinces whereas NMI C considers counties instead of provinces
Social media fingerprints of unemployment mdash 1819
All ages lt 24 25minus44 gt 44(Intercept) 011lowastlowastlowastlowast 010lowastlowastlowast 020lowastlowastlowast 020lowastlowastlowast
(002) (003) (003) (0035)Penetration rate 323lowast 857lowastlowastlowast 628lowastlowast 240
(141) (222) (217) (277)Geographical diversity 003 015lowastlowastlowast 008lowast 006
(002) (004) (004) (005)Social diversity minus003lowast minus003 minus005lowast minus006lowast
(001) (002) (002) (003)Morning activity minus069lowast minus130lowastlowast minus153lowastlowastlowast minus119lowast
(026) (042) (041) (052)Misspellers rate 1156 3151lowast 1546 2360
(813) (1278) (1248) (1594)Employment mentions minus180 317 minus994 271
(627) (986) (964) (123)R2 047 064 055 029Adj R2 044 062 052 026lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005
Table 4 Regression table for the different models in which unemployment for different age groups is fitted The All ages model isthe fit to the general rate of unemployment in each geographical area while the other models are for the rates of unemployment ingroups of less than 24 years between 25 and 44 years and above 44 years
All variables Youth model Twitter model (I) Twitter model (II)(Intercept) 006 minus002 010lowastlowastlowast 009lowastlowastlowast
(003) (003) (003) (0027)Young pop rate 066lowast 220lowastlowastlowast
(030) (035)Penetration rate 820lowastlowastlowast 857lowastlowastlowast 862lowastlowastlowast
(225) (222) (221)Geographical diversity 014lowastlowastlowast 015lowastlowastlowast 012lowastlowastlowast
(004) (004) (003)Social diversity minus002 minus003
(002) (002)Morning activity minus142lowastlowastlowast minus130lowastlowast minus128lowastlowast
(041) (042) (041)Misspellers rate 2395 3151lowast 3228lowast
(1309) (1278) (1271)Employment mentions 034 317
(981) (986)R2 065 024 064 063Adj R2 063 024 062 062lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005
Table 5 Regression table for the different statistical models The All variables model includes both Twitter and rate of youngpopulation variables Twitter model (I) includes only the variables described in the main article while Twitter model (II) only includesthose variables which are significant p lt 005 in Twitter model (I)
Social media fingerprints of unemployment mdash 1919
Communities Municipalities Counties Provinces(Intercept) 010lowastlowastlowast 016lowastlowastlowast 011lowastlowastlowast 011lowast
(003) (001) (003) (005)Penetration rate 857lowastlowastlowast 401lowastlowastlowast 912lowastlowastlowast 1047lowastlowastlowast
(222) (059) (181) (197)Geographical diversity 015lowastlowastlowast 002 012lowastlowastlowast 008
(004) (001) (003) (007)Social diversity minus003 minus001 minus001 minus003
(002) (001) (002) 007Morning activity minus130lowastlowast minus116lowastlowastlowast minus149lowastlowastlowast minus103
(042) (014) (039) (088)Misspellers rate 3151lowast 1440lowastlowastlowast 1409
(1278) (251) (1002)Employment mentions 317 minus071 241 minus317
(986) (089) (886) (1229)Number of points 128 1738 198 50R2 064 022 055 065Adj R2 062 021 054 061lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005
Table 6 Regression table for the unemployment linear regression model in different levels of geographical areas In the Provincesmodel the misspellers rate has been removed from the model due to the large collinearity with the penetration rate
- 1 Social media dataset and functional partition of cities
- 2 Social media behavioral fingerprints
- 3 Explanatory power of social media in unemployment
- 4 Discussion
-
Social media fingerprints of unemployment mdash 1519
2 5 10 20 50 200 500
10
15
20
30
40
tweets
N(m
iss|
tw
eets
)
2 5 10 20 50 200 500
000
50
050
050
0
tweets
P(m
iss|
twee
ts)
2 50 500
Figure 11 Number (red) and probability (blue) of ob-served misspellings given the number of tweets
050
055
060
065
070
201210201211201212201301201302201303201304201305201306201307201308201309201310201311201312201401201402201403201404201405201406
month
r2R2
month
Figure 12 Explanatory power of the linear regressionmodel when fitted against the unemployment data for dif-ferent months Gray (orange) area correspond to the timewindow in which Twitter data is collected and variables areconstructed
S7 Time window and unemployment
In the definition of the variables we have aggregated the Twitter ac-tivity within a 7 months time window (from December 2012 to June2013) Since unemployment has a significant variation along timewe investigate here what is the correlation and explanatory powerof the Twitter variables for the values of unemployment determinedat different months through the same time window in which Twitterdata was collected Or if the variables collected in that time windoware more correlated with past or future values of unemploymentFigure 12 shows the explanatory value of the model when the linearregression is done for values of unemployment of different monthsbefore during and after the Twitter data time window Althoughthere is a small seasonal effect along the year we see that the ex-planatory power remains around R2 = 06 which suggest that ourTwitter linear model retains its explanatory power even though unem-ployment changes considerably throughout the year It is interestingto note that R2 decays a little bit during the summer which meansthat our variables are less correlated with summer unemploymentFinally unemployment used in the main article is from June 2013ie the last month in the time window used to collect the data
S8 Demographics does not explain unem-ployment
Since unemployment rates are very large for the group of youngpeople a natural question is whether only demographic variablescould explain the heterogeneity of young unemployment rates foundin the geographical areas To test this end we have built four linearmodels the first one (named Youth model in Table 5) is composedby the rate of young population as the only explaining variable thesecond ones are built based on only the Twitter variables consideredin the main text (named Twitter model (I)) or just with those whoseregression coefficients are statistically significant (Twitter model(II)) the third one is fitted with all the variables (named All variablesmodel in Table 5) In table 5 we show the summary of the regressionfor each model Focusing on the explained variance by the model interms of R2 it can be checked that considering all Twitter variables isthree times more explanatory than considering only the young peopleproportion On the other hand the comparison of R2 for the Twittermodel with the one for All variables and Youth model shows that therate of young population does not provide a significant explanatorypower This semi-partial analysis shows that our Twitter variablesretain a high explanatory power when the effect of young populationrate is controlled
S9 Unemployment models for other geo-graphical areas
While municipalities are very heterogeneous demographically otheradministrative areas exist in Spain at large scales that could be usedfor our model of unemployment As mentioned in section 4 thesmallest administrative division of Spain we have considered is thatof the 8200 municipalities At larger scales we have the 326 coun-ties (comarcas in spanish) which are aggregations of municipalitiesFinally the largest geographical scale we considered is defined by50 provinces (provincias in Spanish) In this section we comparethe performance of our Twitter model for unemployment for thevariables defined in those administrative areas and relate it to thegeographical communities detected and used in the main paper (seesection 4) Not all the areas at different administrative divisions areconsidered in the model To minimize the effect of areas in whichthe number of geo-tagged tweets is very small we only consider the1738 municipalities which have a Twitter population π gt 10 Simi-larly we only consider the 198 counties with π gt 100 As we can seein Table 6 the model has a large explanatory power for areas equalor bigger than counties As expected R2 increases as the number ofareas in the model is smaller but the description level of the modelis very low for provinces for example The best performance (highR2 and high geographical description level) is attained at the level ofthe detected communities
S10 Relative importance of the variables
To asses the relative importance of the variables in the unemploymentmodel we have used several methods They all give qualitatively thesame results with some variations for the statistically insignificantvariables Specifically we have use
1 (weight) Relative weight of the absolute values of the coef-ficients obtained in the linear regression when variables arescaled to have mean zero and variance one
2 (lmg) averaging over orderings proposed by LindemanMerenda and Gold
Social media fingerprints of unemployment mdash 1619
0
10
20
30
40
emp fmiss manana rtwpen sio siosocialnames
values
indabscoefffirstlmgpmvd
microemp mrng SuSu
Rela
tive
impo
rtanc
e (
)
0
10
20
30
40
emp fmiss manana rtwpen sio siosocialnames
values
indabscoefffirstlmgpmvd
weight first lmg pvmd
Figure 13 Relative importance of the variables (in per-centage) in the unemployment model for different ways tocalculate it
3 (pmvd) The PMVD metric introduced by Feldman whichan average over orderings as well but with data-dependentweights
4 (first) The univariate R2-values from regression models withone variable only
All these metrics are obtained using the relaimpo R package [12]The results for the young unemployment model are shown in figure13 where we can see that different methods yield to similar rela-tive importance of the variables excepting perhaps for the diversityof mobility flows a variable with a non-significant weight in theregression model
References
[1] B Ashtakala Generalized power model for trip distributionTransportation Research Part B Methodological 21(1)59ndash671987
[2] Michel Bierlaire Mathematical models for transportation de-mand analysis Transportation research Part A Policy andpractice 31(1)86ndash86 1997
[3] Vincent D Blondel Jean-Loup Guillaume Renaud Lambiotteand Etienne Lefebvre Fast unfolding of communities in largenetworks Journal of Statistical Mechanics Theory and Exper-iment 2008(10)P10008 2008
[4] Harry J Casey Jr The law of retail gravitation applied to trafficengineering Traffic Quarterly 9(3) 1955
[5] Aaron Clauset Mark EJ Newman and Cristopher Moore Find-ing community structure in very large networks Physicalreview E 70(6)066111 2004
[6] Leon Danon Albert Diaz-Guilera Jordi Duch and AlexArenas Comparing community structure identificationJournal of Statistical Mechanics Theory and Experiment2005(09)P09008 2005
[7] Servicio Publico de Empleo Estatal (SEPE) Spanish registeredunemployment httpwwwsepeescontenidosque_
es_el_sepeestadisticasindexhtml[8] Instituto Nacional de Estadıstica Spanish 2011 cen-
sus httpwwwineescensos2011_datoscen11_
datos_iniciohtm|[9] Suzanne P Evans A relationship between the gravity model
for trip distribution and the transportation problem in linearprogramming Transportation Research 7(1)39ndash61 1973
[10] Paul Expert Tim S Evans Vincent D Blondel and RenaudLambiotte Uncovering space-independent communities inspatial networks Proceedings of the National Academy ofSciences 108(19)7663ndash7668 2011
[11] Ingo Feinerer Christian Buchta Wilhelm Geiger JohannesRauch Patrick Mair and Kurt Hornik The textcat packagefor n-gram based text categorization in r Journal of StatisticalSoftware 52(6)1ndash17 2013
[12] Ulrike Gromping Relative importance for linear regressionin r the package relaimpo Journal of statistical software17(1)1ndash27 2006
[13] Bartosz Hawelka Izabela Sitko Euro Beinat StanislavSobolevsky Pavlos Kazakopoulos and Carlo Ratti Geo-located twitter as the proxy for global mobility patterns arXivpreprint arXiv13110680 2013
[14] Yu Liu Zhengwei Sui Chaogui Kang and Yong Gao Uncov-ering patterns of inter-urban trips and spatial interactions fromcheck-in data arXiv preprint arXiv13100282 2013
[15] Mark EJ Newman Finding community structure in net-works using the eigenvectors of matrices Physical reviewE 74(3)036104 2006
[16] Pascal Pons and Matthieu Latapy Computing communities inlarge networks using random walks In Computer and Informa-tion Sciences-ISCIS 2005 pages 284ndash293 Springer 2005
[17] Usha Nandini Raghavan Reka Albert and Soundar KumaraNear linear time algorithm to detect community structures inlarge-scale networks Physical Review E 76(3)036106 2007
[18] Martin Rosvall and Carl T Bergstrom Maps of random walkson complex networks reveal community structure Proceedingsof the National Academy of Sciences 105(4)1118ndash1123 2008
[19] Morton Schneider Gravity models and trip distribution theoryPapers in Regional Science 5(1)51ndash56 1959
[20] Filippo Simini Marta C Gonzalez Amos Maritan and Albert-Laszlo Barabasi A universal model for mobility and migrationpatterns Nature 484(7392)96ndash100 2012
[21] Chaoming Song Tal Koren Pu Wang and Albert-LaszloBarabasi Modelling the scaling properties of human mobilityNature Physics 6(10)818ndash823 2010
[22] Alan Geoffrey Wilson Entropy in urban and regional mod-elling Pion Ltd 1970
[23] Alan Geoffrey Wilson Urban and regional models in geogra-phy and planning 1974
[24] Svante Wold Arnold Ruhe Herman Wold and WJ Dunn IIIThe collinearity problem in linear regression the partial leastsquares (pls) approach to generalized inverses SIAM Journalon Scientific and Statistical Computing 5(3)735ndash743 1984
Social media fingerprints of unemployment mdash 1719
Gravity ModelParameter Description Spain
α1 Origin exponent 0477lowastlowastlowast(0002)α2 Destination exponent 0478lowastlowastlowast(0002)β Distance exponent 105lowastlowastlowast(00035)R2 Goodness of fit 0797φ Correlation between Ti j and T gra
i j 0826
Table 1 Description of the parameters for the Gravity Law Model in geo-tagged social media data for Spain (lowast lowast lowast) meanssignificance p lt 00001
NMI between G and Gp for different pAlgorithm p = 001 002 003 004 005 006 007 008 009 01
FG 0995 0992 0989 0983 0981 0977 0983 0969 0980 0959WT 0954 0959 0950 0954 0945 0948 0947 0935 0926 0931IM 0988 0981 0980 0981 0978 0974 0975 0970 0969 0966ML 0994 0978 0979 0983 0948 0934 0972 0952 0973 0947LP 0906 0908 0911 0915 0895 0907 0907 0893 0905 0904LE 0960 0957 0956 0859 0910 0892 0908 0858 0885 0884
Table 2 NMI measure comparing G and Gp
Communities StatsAlgorithm 〈|Ni|〉i max|Ni| |Ni| Modularity NMI P NMI C
FG 309696 1385 23 0726 0712 0590WT 9262 433 769 0417 0744 0757IM 21011 143 339 0758 0770 0831ML 323772 1132 22 0800 0717 0599LP 22052 750 323 0732 0749 0761LE 1017571 5344 7 0381 0264 0205
Table 3 Statistics of the communities Ni returned by the six algorithms NMI P refers to the comparison between communitiesand provinces whereas NMI C considers counties instead of provinces
Social media fingerprints of unemployment mdash 1819
All ages lt 24 25minus44 gt 44(Intercept) 011lowastlowastlowastlowast 010lowastlowastlowast 020lowastlowastlowast 020lowastlowastlowast
(002) (003) (003) (0035)Penetration rate 323lowast 857lowastlowastlowast 628lowastlowast 240
(141) (222) (217) (277)Geographical diversity 003 015lowastlowastlowast 008lowast 006
(002) (004) (004) (005)Social diversity minus003lowast minus003 minus005lowast minus006lowast
(001) (002) (002) (003)Morning activity minus069lowast minus130lowastlowast minus153lowastlowastlowast minus119lowast
(026) (042) (041) (052)Misspellers rate 1156 3151lowast 1546 2360
(813) (1278) (1248) (1594)Employment mentions minus180 317 minus994 271
(627) (986) (964) (123)R2 047 064 055 029Adj R2 044 062 052 026lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005
Table 4 Regression table for the different models in which unemployment for different age groups is fitted The All ages model isthe fit to the general rate of unemployment in each geographical area while the other models are for the rates of unemployment ingroups of less than 24 years between 25 and 44 years and above 44 years
All variables Youth model Twitter model (I) Twitter model (II)(Intercept) 006 minus002 010lowastlowastlowast 009lowastlowastlowast
(003) (003) (003) (0027)Young pop rate 066lowast 220lowastlowastlowast
(030) (035)Penetration rate 820lowastlowastlowast 857lowastlowastlowast 862lowastlowastlowast
(225) (222) (221)Geographical diversity 014lowastlowastlowast 015lowastlowastlowast 012lowastlowastlowast
(004) (004) (003)Social diversity minus002 minus003
(002) (002)Morning activity minus142lowastlowastlowast minus130lowastlowast minus128lowastlowast
(041) (042) (041)Misspellers rate 2395 3151lowast 3228lowast
(1309) (1278) (1271)Employment mentions 034 317
(981) (986)R2 065 024 064 063Adj R2 063 024 062 062lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005
Table 5 Regression table for the different statistical models The All variables model includes both Twitter and rate of youngpopulation variables Twitter model (I) includes only the variables described in the main article while Twitter model (II) only includesthose variables which are significant p lt 005 in Twitter model (I)
Social media fingerprints of unemployment mdash 1919
Communities Municipalities Counties Provinces(Intercept) 010lowastlowastlowast 016lowastlowastlowast 011lowastlowastlowast 011lowast
(003) (001) (003) (005)Penetration rate 857lowastlowastlowast 401lowastlowastlowast 912lowastlowastlowast 1047lowastlowastlowast
(222) (059) (181) (197)Geographical diversity 015lowastlowastlowast 002 012lowastlowastlowast 008
(004) (001) (003) (007)Social diversity minus003 minus001 minus001 minus003
(002) (001) (002) 007Morning activity minus130lowastlowast minus116lowastlowastlowast minus149lowastlowastlowast minus103
(042) (014) (039) (088)Misspellers rate 3151lowast 1440lowastlowastlowast 1409
(1278) (251) (1002)Employment mentions 317 minus071 241 minus317
(986) (089) (886) (1229)Number of points 128 1738 198 50R2 064 022 055 065Adj R2 062 021 054 061lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005
Table 6 Regression table for the unemployment linear regression model in different levels of geographical areas In the Provincesmodel the misspellers rate has been removed from the model due to the large collinearity with the penetration rate
- 1 Social media dataset and functional partition of cities
- 2 Social media behavioral fingerprints
- 3 Explanatory power of social media in unemployment
- 4 Discussion
-
Social media fingerprints of unemployment mdash 1619
0
10
20
30
40
emp fmiss manana rtwpen sio siosocialnames
values
indabscoefffirstlmgpmvd
microemp mrng SuSu
Rela
tive
impo
rtanc
e (
)
0
10
20
30
40
emp fmiss manana rtwpen sio siosocialnames
values
indabscoefffirstlmgpmvd
weight first lmg pvmd
Figure 13 Relative importance of the variables (in per-centage) in the unemployment model for different ways tocalculate it
3 (pmvd) The PMVD metric introduced by Feldman whichan average over orderings as well but with data-dependentweights
4 (first) The univariate R2-values from regression models withone variable only
All these metrics are obtained using the relaimpo R package [12]The results for the young unemployment model are shown in figure13 where we can see that different methods yield to similar rela-tive importance of the variables excepting perhaps for the diversityof mobility flows a variable with a non-significant weight in theregression model
References
[1] B Ashtakala Generalized power model for trip distributionTransportation Research Part B Methodological 21(1)59ndash671987
[2] Michel Bierlaire Mathematical models for transportation de-mand analysis Transportation research Part A Policy andpractice 31(1)86ndash86 1997
[3] Vincent D Blondel Jean-Loup Guillaume Renaud Lambiotteand Etienne Lefebvre Fast unfolding of communities in largenetworks Journal of Statistical Mechanics Theory and Exper-iment 2008(10)P10008 2008
[4] Harry J Casey Jr The law of retail gravitation applied to trafficengineering Traffic Quarterly 9(3) 1955
[5] Aaron Clauset Mark EJ Newman and Cristopher Moore Find-ing community structure in very large networks Physicalreview E 70(6)066111 2004
[6] Leon Danon Albert Diaz-Guilera Jordi Duch and AlexArenas Comparing community structure identificationJournal of Statistical Mechanics Theory and Experiment2005(09)P09008 2005
[7] Servicio Publico de Empleo Estatal (SEPE) Spanish registeredunemployment httpwwwsepeescontenidosque_
es_el_sepeestadisticasindexhtml[8] Instituto Nacional de Estadıstica Spanish 2011 cen-
sus httpwwwineescensos2011_datoscen11_
datos_iniciohtm|[9] Suzanne P Evans A relationship between the gravity model
for trip distribution and the transportation problem in linearprogramming Transportation Research 7(1)39ndash61 1973
[10] Paul Expert Tim S Evans Vincent D Blondel and RenaudLambiotte Uncovering space-independent communities inspatial networks Proceedings of the National Academy ofSciences 108(19)7663ndash7668 2011
[11] Ingo Feinerer Christian Buchta Wilhelm Geiger JohannesRauch Patrick Mair and Kurt Hornik The textcat packagefor n-gram based text categorization in r Journal of StatisticalSoftware 52(6)1ndash17 2013
[12] Ulrike Gromping Relative importance for linear regressionin r the package relaimpo Journal of statistical software17(1)1ndash27 2006
[13] Bartosz Hawelka Izabela Sitko Euro Beinat StanislavSobolevsky Pavlos Kazakopoulos and Carlo Ratti Geo-located twitter as the proxy for global mobility patterns arXivpreprint arXiv13110680 2013
[14] Yu Liu Zhengwei Sui Chaogui Kang and Yong Gao Uncov-ering patterns of inter-urban trips and spatial interactions fromcheck-in data arXiv preprint arXiv13100282 2013
[15] Mark EJ Newman Finding community structure in net-works using the eigenvectors of matrices Physical reviewE 74(3)036104 2006
[16] Pascal Pons and Matthieu Latapy Computing communities inlarge networks using random walks In Computer and Informa-tion Sciences-ISCIS 2005 pages 284ndash293 Springer 2005
[17] Usha Nandini Raghavan Reka Albert and Soundar KumaraNear linear time algorithm to detect community structures inlarge-scale networks Physical Review E 76(3)036106 2007
[18] Martin Rosvall and Carl T Bergstrom Maps of random walkson complex networks reveal community structure Proceedingsof the National Academy of Sciences 105(4)1118ndash1123 2008
[19] Morton Schneider Gravity models and trip distribution theoryPapers in Regional Science 5(1)51ndash56 1959
[20] Filippo Simini Marta C Gonzalez Amos Maritan and Albert-Laszlo Barabasi A universal model for mobility and migrationpatterns Nature 484(7392)96ndash100 2012
[21] Chaoming Song Tal Koren Pu Wang and Albert-LaszloBarabasi Modelling the scaling properties of human mobilityNature Physics 6(10)818ndash823 2010
[22] Alan Geoffrey Wilson Entropy in urban and regional mod-elling Pion Ltd 1970
[23] Alan Geoffrey Wilson Urban and regional models in geogra-phy and planning 1974
[24] Svante Wold Arnold Ruhe Herman Wold and WJ Dunn IIIThe collinearity problem in linear regression the partial leastsquares (pls) approach to generalized inverses SIAM Journalon Scientific and Statistical Computing 5(3)735ndash743 1984
Social media fingerprints of unemployment mdash 1719
Gravity ModelParameter Description Spain
α1 Origin exponent 0477lowastlowastlowast(0002)α2 Destination exponent 0478lowastlowastlowast(0002)β Distance exponent 105lowastlowastlowast(00035)R2 Goodness of fit 0797φ Correlation between Ti j and T gra
i j 0826
Table 1 Description of the parameters for the Gravity Law Model in geo-tagged social media data for Spain (lowast lowast lowast) meanssignificance p lt 00001
NMI between G and Gp for different pAlgorithm p = 001 002 003 004 005 006 007 008 009 01
FG 0995 0992 0989 0983 0981 0977 0983 0969 0980 0959WT 0954 0959 0950 0954 0945 0948 0947 0935 0926 0931IM 0988 0981 0980 0981 0978 0974 0975 0970 0969 0966ML 0994 0978 0979 0983 0948 0934 0972 0952 0973 0947LP 0906 0908 0911 0915 0895 0907 0907 0893 0905 0904LE 0960 0957 0956 0859 0910 0892 0908 0858 0885 0884
Table 2 NMI measure comparing G and Gp
Communities StatsAlgorithm 〈|Ni|〉i max|Ni| |Ni| Modularity NMI P NMI C
FG 309696 1385 23 0726 0712 0590WT 9262 433 769 0417 0744 0757IM 21011 143 339 0758 0770 0831ML 323772 1132 22 0800 0717 0599LP 22052 750 323 0732 0749 0761LE 1017571 5344 7 0381 0264 0205
Table 3 Statistics of the communities Ni returned by the six algorithms NMI P refers to the comparison between communitiesand provinces whereas NMI C considers counties instead of provinces
Social media fingerprints of unemployment mdash 1819
All ages lt 24 25minus44 gt 44(Intercept) 011lowastlowastlowastlowast 010lowastlowastlowast 020lowastlowastlowast 020lowastlowastlowast
(002) (003) (003) (0035)Penetration rate 323lowast 857lowastlowastlowast 628lowastlowast 240
(141) (222) (217) (277)Geographical diversity 003 015lowastlowastlowast 008lowast 006
(002) (004) (004) (005)Social diversity minus003lowast minus003 minus005lowast minus006lowast
(001) (002) (002) (003)Morning activity minus069lowast minus130lowastlowast minus153lowastlowastlowast minus119lowast
(026) (042) (041) (052)Misspellers rate 1156 3151lowast 1546 2360
(813) (1278) (1248) (1594)Employment mentions minus180 317 minus994 271
(627) (986) (964) (123)R2 047 064 055 029Adj R2 044 062 052 026lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005
Table 4 Regression table for the different models in which unemployment for different age groups is fitted The All ages model isthe fit to the general rate of unemployment in each geographical area while the other models are for the rates of unemployment ingroups of less than 24 years between 25 and 44 years and above 44 years
All variables Youth model Twitter model (I) Twitter model (II)(Intercept) 006 minus002 010lowastlowastlowast 009lowastlowastlowast
(003) (003) (003) (0027)Young pop rate 066lowast 220lowastlowastlowast
(030) (035)Penetration rate 820lowastlowastlowast 857lowastlowastlowast 862lowastlowastlowast
(225) (222) (221)Geographical diversity 014lowastlowastlowast 015lowastlowastlowast 012lowastlowastlowast
(004) (004) (003)Social diversity minus002 minus003
(002) (002)Morning activity minus142lowastlowastlowast minus130lowastlowast minus128lowastlowast
(041) (042) (041)Misspellers rate 2395 3151lowast 3228lowast
(1309) (1278) (1271)Employment mentions 034 317
(981) (986)R2 065 024 064 063Adj R2 063 024 062 062lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005
Table 5 Regression table for the different statistical models The All variables model includes both Twitter and rate of youngpopulation variables Twitter model (I) includes only the variables described in the main article while Twitter model (II) only includesthose variables which are significant p lt 005 in Twitter model (I)
Social media fingerprints of unemployment mdash 1919
Communities Municipalities Counties Provinces(Intercept) 010lowastlowastlowast 016lowastlowastlowast 011lowastlowastlowast 011lowast
(003) (001) (003) (005)Penetration rate 857lowastlowastlowast 401lowastlowastlowast 912lowastlowastlowast 1047lowastlowastlowast
(222) (059) (181) (197)Geographical diversity 015lowastlowastlowast 002 012lowastlowastlowast 008
(004) (001) (003) (007)Social diversity minus003 minus001 minus001 minus003
(002) (001) (002) 007Morning activity minus130lowastlowast minus116lowastlowastlowast minus149lowastlowastlowast minus103
(042) (014) (039) (088)Misspellers rate 3151lowast 1440lowastlowastlowast 1409
(1278) (251) (1002)Employment mentions 317 minus071 241 minus317
(986) (089) (886) (1229)Number of points 128 1738 198 50R2 064 022 055 065Adj R2 062 021 054 061lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005
Table 6 Regression table for the unemployment linear regression model in different levels of geographical areas In the Provincesmodel the misspellers rate has been removed from the model due to the large collinearity with the penetration rate
- 1 Social media dataset and functional partition of cities
- 2 Social media behavioral fingerprints
- 3 Explanatory power of social media in unemployment
- 4 Discussion
-
Social media fingerprints of unemployment mdash 1719
Gravity ModelParameter Description Spain
α1 Origin exponent 0477lowastlowastlowast(0002)α2 Destination exponent 0478lowastlowastlowast(0002)β Distance exponent 105lowastlowastlowast(00035)R2 Goodness of fit 0797φ Correlation between Ti j and T gra
i j 0826
Table 1 Description of the parameters for the Gravity Law Model in geo-tagged social media data for Spain (lowast lowast lowast) meanssignificance p lt 00001
NMI between G and Gp for different pAlgorithm p = 001 002 003 004 005 006 007 008 009 01
FG 0995 0992 0989 0983 0981 0977 0983 0969 0980 0959WT 0954 0959 0950 0954 0945 0948 0947 0935 0926 0931IM 0988 0981 0980 0981 0978 0974 0975 0970 0969 0966ML 0994 0978 0979 0983 0948 0934 0972 0952 0973 0947LP 0906 0908 0911 0915 0895 0907 0907 0893 0905 0904LE 0960 0957 0956 0859 0910 0892 0908 0858 0885 0884
Table 2 NMI measure comparing G and Gp
Communities StatsAlgorithm 〈|Ni|〉i max|Ni| |Ni| Modularity NMI P NMI C
FG 309696 1385 23 0726 0712 0590WT 9262 433 769 0417 0744 0757IM 21011 143 339 0758 0770 0831ML 323772 1132 22 0800 0717 0599LP 22052 750 323 0732 0749 0761LE 1017571 5344 7 0381 0264 0205
Table 3 Statistics of the communities Ni returned by the six algorithms NMI P refers to the comparison between communitiesand provinces whereas NMI C considers counties instead of provinces
Social media fingerprints of unemployment mdash 1819
All ages lt 24 25minus44 gt 44(Intercept) 011lowastlowastlowastlowast 010lowastlowastlowast 020lowastlowastlowast 020lowastlowastlowast
(002) (003) (003) (0035)Penetration rate 323lowast 857lowastlowastlowast 628lowastlowast 240
(141) (222) (217) (277)Geographical diversity 003 015lowastlowastlowast 008lowast 006
(002) (004) (004) (005)Social diversity minus003lowast minus003 minus005lowast minus006lowast
(001) (002) (002) (003)Morning activity minus069lowast minus130lowastlowast minus153lowastlowastlowast minus119lowast
(026) (042) (041) (052)Misspellers rate 1156 3151lowast 1546 2360
(813) (1278) (1248) (1594)Employment mentions minus180 317 minus994 271
(627) (986) (964) (123)R2 047 064 055 029Adj R2 044 062 052 026lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005
Table 4 Regression table for the different models in which unemployment for different age groups is fitted The All ages model isthe fit to the general rate of unemployment in each geographical area while the other models are for the rates of unemployment ingroups of less than 24 years between 25 and 44 years and above 44 years
All variables Youth model Twitter model (I) Twitter model (II)(Intercept) 006 minus002 010lowastlowastlowast 009lowastlowastlowast
(003) (003) (003) (0027)Young pop rate 066lowast 220lowastlowastlowast
(030) (035)Penetration rate 820lowastlowastlowast 857lowastlowastlowast 862lowastlowastlowast
(225) (222) (221)Geographical diversity 014lowastlowastlowast 015lowastlowastlowast 012lowastlowastlowast
(004) (004) (003)Social diversity minus002 minus003
(002) (002)Morning activity minus142lowastlowastlowast minus130lowastlowast minus128lowastlowast
(041) (042) (041)Misspellers rate 2395 3151lowast 3228lowast
(1309) (1278) (1271)Employment mentions 034 317
(981) (986)R2 065 024 064 063Adj R2 063 024 062 062lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005
Table 5 Regression table for the different statistical models The All variables model includes both Twitter and rate of youngpopulation variables Twitter model (I) includes only the variables described in the main article while Twitter model (II) only includesthose variables which are significant p lt 005 in Twitter model (I)
Social media fingerprints of unemployment mdash 1919
Communities Municipalities Counties Provinces(Intercept) 010lowastlowastlowast 016lowastlowastlowast 011lowastlowastlowast 011lowast
(003) (001) (003) (005)Penetration rate 857lowastlowastlowast 401lowastlowastlowast 912lowastlowastlowast 1047lowastlowastlowast
(222) (059) (181) (197)Geographical diversity 015lowastlowastlowast 002 012lowastlowastlowast 008
(004) (001) (003) (007)Social diversity minus003 minus001 minus001 minus003
(002) (001) (002) 007Morning activity minus130lowastlowast minus116lowastlowastlowast minus149lowastlowastlowast minus103
(042) (014) (039) (088)Misspellers rate 3151lowast 1440lowastlowastlowast 1409
(1278) (251) (1002)Employment mentions 317 minus071 241 minus317
(986) (089) (886) (1229)Number of points 128 1738 198 50R2 064 022 055 065Adj R2 062 021 054 061lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005
Table 6 Regression table for the unemployment linear regression model in different levels of geographical areas In the Provincesmodel the misspellers rate has been removed from the model due to the large collinearity with the penetration rate
- 1 Social media dataset and functional partition of cities
- 2 Social media behavioral fingerprints
- 3 Explanatory power of social media in unemployment
- 4 Discussion
-
Social media fingerprints of unemployment mdash 1819
All ages lt 24 25minus44 gt 44(Intercept) 011lowastlowastlowastlowast 010lowastlowastlowast 020lowastlowastlowast 020lowastlowastlowast
(002) (003) (003) (0035)Penetration rate 323lowast 857lowastlowastlowast 628lowastlowast 240
(141) (222) (217) (277)Geographical diversity 003 015lowastlowastlowast 008lowast 006
(002) (004) (004) (005)Social diversity minus003lowast minus003 minus005lowast minus006lowast
(001) (002) (002) (003)Morning activity minus069lowast minus130lowastlowast minus153lowastlowastlowast minus119lowast
(026) (042) (041) (052)Misspellers rate 1156 3151lowast 1546 2360
(813) (1278) (1248) (1594)Employment mentions minus180 317 minus994 271
(627) (986) (964) (123)R2 047 064 055 029Adj R2 044 062 052 026lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005
Table 4 Regression table for the different models in which unemployment for different age groups is fitted The All ages model isthe fit to the general rate of unemployment in each geographical area while the other models are for the rates of unemployment ingroups of less than 24 years between 25 and 44 years and above 44 years
All variables Youth model Twitter model (I) Twitter model (II)(Intercept) 006 minus002 010lowastlowastlowast 009lowastlowastlowast
(003) (003) (003) (0027)Young pop rate 066lowast 220lowastlowastlowast
(030) (035)Penetration rate 820lowastlowastlowast 857lowastlowastlowast 862lowastlowastlowast
(225) (222) (221)Geographical diversity 014lowastlowastlowast 015lowastlowastlowast 012lowastlowastlowast
(004) (004) (003)Social diversity minus002 minus003
(002) (002)Morning activity minus142lowastlowastlowast minus130lowastlowast minus128lowastlowast
(041) (042) (041)Misspellers rate 2395 3151lowast 3228lowast
(1309) (1278) (1271)Employment mentions 034 317
(981) (986)R2 065 024 064 063Adj R2 063 024 062 062lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005
Table 5 Regression table for the different statistical models The All variables model includes both Twitter and rate of youngpopulation variables Twitter model (I) includes only the variables described in the main article while Twitter model (II) only includesthose variables which are significant p lt 005 in Twitter model (I)
Social media fingerprints of unemployment mdash 1919
Communities Municipalities Counties Provinces(Intercept) 010lowastlowastlowast 016lowastlowastlowast 011lowastlowastlowast 011lowast
(003) (001) (003) (005)Penetration rate 857lowastlowastlowast 401lowastlowastlowast 912lowastlowastlowast 1047lowastlowastlowast
(222) (059) (181) (197)Geographical diversity 015lowastlowastlowast 002 012lowastlowastlowast 008
(004) (001) (003) (007)Social diversity minus003 minus001 minus001 minus003
(002) (001) (002) 007Morning activity minus130lowastlowast minus116lowastlowastlowast minus149lowastlowastlowast minus103
(042) (014) (039) (088)Misspellers rate 3151lowast 1440lowastlowastlowast 1409
(1278) (251) (1002)Employment mentions 317 minus071 241 minus317
(986) (089) (886) (1229)Number of points 128 1738 198 50R2 064 022 055 065Adj R2 062 021 054 061lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005
Table 6 Regression table for the unemployment linear regression model in different levels of geographical areas In the Provincesmodel the misspellers rate has been removed from the model due to the large collinearity with the penetration rate
- 1 Social media dataset and functional partition of cities
- 2 Social media behavioral fingerprints
- 3 Explanatory power of social media in unemployment
- 4 Discussion
-
Social media fingerprints of unemployment mdash 1919
Communities Municipalities Counties Provinces(Intercept) 010lowastlowastlowast 016lowastlowastlowast 011lowastlowastlowast 011lowast
(003) (001) (003) (005)Penetration rate 857lowastlowastlowast 401lowastlowastlowast 912lowastlowastlowast 1047lowastlowastlowast
(222) (059) (181) (197)Geographical diversity 015lowastlowastlowast 002 012lowastlowastlowast 008
(004) (001) (003) (007)Social diversity minus003 minus001 minus001 minus003
(002) (001) (002) 007Morning activity minus130lowastlowast minus116lowastlowastlowast minus149lowastlowastlowast minus103
(042) (014) (039) (088)Misspellers rate 3151lowast 1440lowastlowastlowast 1409
(1278) (251) (1002)Employment mentions 317 minus071 241 minus317
(986) (089) (886) (1229)Number of points 128 1738 198 50R2 064 022 055 065Adj R2 062 021 054 061lowastlowastlowastp lt 0001 lowastlowastp lt 001 lowastp lt 005
Table 6 Regression table for the unemployment linear regression model in different levels of geographical areas In the Provincesmodel the misspellers rate has been removed from the model due to the large collinearity with the penetration rate
- 1 Social media dataset and functional partition of cities
- 2 Social media behavioral fingerprints
- 3 Explanatory power of social media in unemployment
- 4 Discussion
-