Presentation Final

1
ANALYSIS OF GSM BIG DATA IN CONJUNCTION WITH TWITTER DATA FOR UNDERSTANDING SOCIAL BEHAVIOURS IN DAKAR, SENEGAL Tianyu (Leo) Liu 2. Calculate hourly centroid of traffic amount of antennas which belong to a same community by using Mean Center tool 744 centroids are calculated for each community for January (31 days * 24 hour/day = 744 hours) The traffic amount of an antenna can also be viewed as the amount of people around the antenna, since cellphones always automatically connect to an antenna which provides the strongest signal. In most of the cases, it is the antenna closest to the user’s cellphone 3. Calculate 99.7% probability region of the population centroid for each community Create standard deviations ellipse (3 SD) based on the assumption that centroids for each community are in normal distribution. Standard Distance tool is used Create standard deviations ellipse (3 SD) based on the assumption that centroids for each community are in directional distribution. Directional Distribution tool is used Evaluate the computed ellipses for each community. Therefore, the 99.7% probability regions of community centroid for each community is found. 4. Identify unusual social behaviours If a centroid of a community falls outside from the 99.7% probability region, it signifies that movement of people in this hour demonstrates an unusual social behaviour If the following two hours’ (or three hours’, if unusual behaviour occurs during midnight to 7 a.m. of a day) centroid continuously falls outside from the 99.7% probability region, the unusual social behaviour is confirmed A tool is scripted in Python to select centroids that, in three (or four) consecutive hours, continuously fall outside from their community’s 99.7% probability region. It will also calculate in which day and hour it occurs Contact Tianyu (Leo) Liu University of New Brunswick Email: [email protected] Acknowledgements Dr. Monica Wachowicz (Supervisor) Dr. Emmanuel Stefanakis Lola Arteaga David Fraser Orange France Key References Montjoye, Y. A., Smoreda, Z., Trinquart, R., Ziemlicki, C., & Blondel, V. D. (2014). D4D-Senegal: The Second Mobile Phone Data for Development Challenge. arXiv preprint arXiv:1407.4885. Pang, B., & Lee, L. (2008). Opinion mining and sentiment analysis. Foundations and trends in information retrieval (Vols. 2(2-1)). Fellbaum, C. (1998). WordNet. Hoboken, NJ: Wiley-Blackwell. Williams-Blangero, S., & Blangero, J. (1989). Anthropometric variation and the genetic structure of the Jirels of Nepal. In Human biology. Hugo, G. (1999). Research Paper 9 1999-2000: Regional Development Through Immigration? The Reality behind the Rhetoric. Retrieved March 30, 2015, from Parliament of Australia With the development of communication and positioning technologies, we are entering into a big data Era. Every minute in 2014, people searched on Google 4,000,000 times, sent 280,000 tweets, and mobile network received around 400 new subscriptions. Almost everything we do nowadays leaves a geolocated digital footprint and traces of the footprints are revealing collective mobility patterns. Our main research goal is to use GSM big data and geolocated Twitter data in the same region and time frame, in order to discover the collective mobility patterns of communities. This can enhance decision makers’ understanding about unusual social behaviours, here defined as those socio-mobility patterns that are relatively infrequent or are atypical of the population. Firstly, a sentiment/language preference analysis approach is proposed for community building based on a lexicon method and clustering procedures. The assumption is that people interacting within a same community will share some common characteristics, such as language preference and sentiment status. Secondly, the GSM big data is used for modelling the hourly-mobility patterns of people who live in a same community I built previously. Finally, hourly centroids of the socio-mobility patterns are computed for each community in each month. If population centroids for a community continuously falls outside from the standard deviation ellipse of the spatial distribution of the hourly centroids, an unusual behaviour is signified. The proposed methodology was implemented by using the geolocated tweets retrieved for the city of Dakar in Senegal for the period of 6 months, as well as the GSM big data on an hourly basis for the entire year which was kindly provided by Orange France within the D4D Challenge framework. Introduction Twitter Data 1. Determine people’s language preference clusters in Dakar Create lexicons containing mostly used English and French words Compare the contents of tweets word by word to both English and French lexicons to determine the language preference of each tweet Check clustering tendency by using Global Moran's I tool Find optimal Distance Threshold using Incremental Spatial Autocorrelation tool Build clusters by using Anselin Local Moran's I tool 2. Determine positive and negative tweets clusters in Dakar Translate all the tweets into English Use SentiWordNet 3.0.0 lexicon to evaluate each word’s sentiment Check clustering tendency by using Global Moran's I tool Find optimal Distance Threshold using Incremental Spatial Autocorrelation tool Build clusters by using Anselin Local Moran's I tool 3. Community building Overlap clustering maps generated in Objective 1 and 2 with existing municipal subdivision maps Select a subdivision schema best reflects overlapped clusters, or create boundaries to separate different clusters GSM Data 1. Aggregate GSM data into hourly intervals for computing the amount of hourly traffic for each antenna (Table below shows an example of January’s data) Objectives and Methods GSM big data: Hourly antenna-to-antenna traffic in Senegal 1666 antennas in total Antennas’ latitude and longitude are given Data recorded for 1 Year (2013) Data contains more than 14,400,000,000 rows Table above is an example of the data, which can be translated into the figure on the right Provided by Orange France within the Data For Development Challenge Twitter Data 32536 geo-tagged tweets, in the city of Dakar From February 4th to July 25 th , 2013 Raw Data 1 1654 186 1 call Results The results shown on right indicates that unusual behaviour has occurred when social events have taken in Dakar. Examples include: Jan 1, 2013 - New Year Eve Firework on Gorée Island Influenced Communities: No.6 (3-8 a.m.), No.8 (4-7 a.m.) Great amount of people marched down to the coastal line on the south end of the city to watch the firework show. After around 1:00 a.m., people started to move back to residential/bar area Jan 24, 2013 - Mawlid: Prophet Muhammad's Birthday Influenced Communities: No.4 (0-7 a.m.), No.7 (0-8 a.m.), No.16 (1-5 a.m.), No. 17 (4-7 a.m.) Since 95% of Senegal’s population is Muslim, Prophet Muhammad’s Birthday is believed to be one of the most important day in Senegal. People gather at local religion places at midnight to celebrate and pray Conclusions Unusual social behaviour can be discovered by analyzing GSM and Twitter data in a timely manner, which can potentially benefit many industries such as marketing, public health and urban planning This research demonstrate the potential of using ESRI ArcPy to process and analyze big data using average computing power of personal computers. Results and Conclusions Time Out_Antenna In_Antenna Num_of_Calls 2013-01-01 00 1 1 1 2013-01-01 00 1 1654 8 2013-01-01 00 1 186 22 Day/ Hr. Antenna 13 12 6 2 1 2 1 2 8 9 10 840 350 296 192 206 115 51 31 56 75 215 153 63 59 53 24 25 16 18 50 52 81 391 192 154 57 56 21 27 21 72 121 270 2273 2052 2429 1997 1535 1087 533 178 167 280 708 1236 657 540 362 267 141 81 61 132 245 927 0 0 0 0 0 0 0 0 0 0 0 1569 2117 2704 2481 1476 1091 488 92 53 75 331 2994 3018 3739 3726 2586 1945 669 363 940 1707 525 1383 964 1237 960 696 359 159 63 123 174 583 9 1 3 1 0 0 0 18 30 95 17 1/ 3 1/ 4 1/ 5 1 2 1/ 0 1/ 1 1/ 2 9 10 31/ 23 1666 1/ 9 3 4 5 6 7 8 1/ 6 1/ 7 1/ 8 Centroid XCoord YCoord Community Hour Day 57 -1943298.453 1648243.086 6 3 1 74 -1943272.931 1648225.662 6 4 1 91 -1943415.873 1648301.804 6 5 1 108 -1943348.489 1648253.616 6 6 1 125 -1943618.741 1648468.230 6 7 1 76 -1943942.523 1646592.451 8 4 1 93 -1943963.149 1646592.233 8 5 1 110 -1943953.637 1646659.011 8 6 1 9388 -1940613.685 1651019.540 4 0 24 9405 -1940502.412 1651102.546 4 1 24 9422 -1940508.567 1651107.019 4 2 24 9439 -1940592.240 1651077.932 4 3 24 9456 -1940650.085 1651015.970 4 4 24 9473 -1940617.702 1651045.504 4 5 24 9490 -1940694.016 1650981.382 4 6 24 9391 -1945453.157 1645126.536 7 0 24 9408 -1945562.481 1645129.617 7 1 24 9425 -1945626.517 1645146.684 7 2 24 9442 -1945637.427 1645118.731 7 3 24 9459 -1945617.967 1645106.984 7 4 24 9476 -1945624.254 1645132.200 7 5 24 9493 -1945559.513 1645091.225 7 6 24 9510 -1945340.277 1645119.125 7 7 24 9417 -1942515.231 1641481.918 16 1 24 9434 -1942521.903 1641440.494 16 2 24 9451 -1942532.122 1641497.490 16 3 24 9468 -1942518.896 1641514.979 16 4 24 9452 -1941511.973 1640256.701 17 3 24 9469 -1941569.272 1640306.845 17 4 24 9486 -1941503.265 1640310.359 17 5 24 9503 -1941519.272 1640375.256 17 6 24

description

Presentation Final

Transcript of Presentation Final

ANALYSIS OF GSM BIG DATA IN CONJUNCTION WITH TWITTER DATAFOR UNDERSTANDING SOCIAL BEHAVIOURS IN DAKAR, SENEGAL

Tianyu (Leo) Liu

2. Calculate hourly centroid of traffic amount of antennas which belong to a same community by using Mean Center tool • 744 centroids are calculated for each community for January (31 days * 24 hour/day = 744 hours)• The traffic amount of an antenna can also be viewed as the amount of

people around the antenna, since cellphones always automatically connect to an antenna which provides the strongest signal. In most of the cases, it is the antenna closest to the user’s cellphone

3. Calculate 99.7% probability region of the population centroid for eachcommunity

• Create standard deviations ellipse (3 SD) based on the assumption thatcentroids for each community are in normal distribution. StandardDistance tool is used

• Create standard deviations ellipse (3 SD) based on the assumption thatcentroids for each community are in directional distribution. DirectionalDistribution tool is used

• Evaluate the computed ellipses for each community. Therefore, the 99.7% probability regions of community centroid for each community is found.

4. Identify unusual social behaviours• If a centroid of a community falls outside from the 99.7% probability region, it signifies that movement of people in this hour

demonstrates an unusual social behaviour• If the following two hours’ (or three hours’, if unusual behaviour occurs during midnight to 7 a.m. of a day) centroid

continuously falls outside from the 99.7% probability region, the unusual social behaviour is confirmed• A tool is scripted in Python to select centroids that, in three (or four) consecutive hours, continuously fall outside from their

community’s 99.7% probability region. It will also calculate in which day and hour it occurs

ContactTianyu (Leo) LiuUniversity of New BrunswickEmail: [email protected]

AcknowledgementsDr. Monica Wachowicz (Supervisor)Dr. Emmanuel StefanakisLola ArteagaDavid FraserOrange France

Key ReferencesMontjoye, Y. A., Smoreda, Z., Trinquart, R., Ziemlicki, C., & Blondel, V. D. (2014). D4D-Senegal: The Second Mobile Phone Data for Development Challenge. arXiv preprint arXiv:1407.4885.Pang, B., & Lee, L. (2008). Opinion mining and sentiment analysis. Foundations and trends in information retrieval (Vols. 2(2-1)).Fellbaum, C. (1998). WordNet. Hoboken, NJ: Wiley-Blackwell.Williams-Blangero, S., & Blangero, J. (1989). Anthropometric variation and the genetic structure of the Jirels of Nepal. In Human biology.Hugo, G. (1999). Research Paper 9 1999-2000: Regional Development Through Immigration? The Reality behind the Rhetoric. Retrieved March 30, 2015, from Parliament of Australia

With the development of communication and positioning technologies, we are entering into a big data Era. Every minute in 2014, people searched on Google 4,000,000 times, sent 280,000 tweets, and mobile network received around 400 new subscriptions. Almost everything we do nowadays leaves a geolocated digital footprint and traces of the footprints are revealing collective mobility patterns.

Our main research goal is to use GSM big data and geolocated Twitter data in the same region and time frame, in order to discover the collective mobility patterns of communities. This can enhance decision makers’ understanding about unusual social behaviours, here defined as those socio-mobility patterns that are relatively infrequent or are atypical of the population.

• Firstly, a sentiment/language preference analysis approach is proposed for community building based on a lexicon method and clustering procedures. The assumption is that people interacting within a same community will share some common characteristics, such as language preference and sentiment status.

• Secondly, the GSM big data is used for modelling the hourly-mobility patterns of people who live in a same community I built previously.

• Finally, hourly centroids of the socio-mobility patterns are computed for each community in each month.

• If population centroids for a community continuously falls outside from the standard deviation ellipse of the spatial distribution of the hourly centroids, an unusual behaviour is signified.

The proposed methodology was implemented by using the geolocated tweets retrieved for the city of Dakar in Senegal for the period of 6 months, as well as the GSM big data on an hourly basis for the entire year which was kindly provided by Orange France within the D4D Challenge framework.

IntroductionTwitter Data1. Determine people’s language preference clusters in Dakar

• Create lexicons containing mostly used English and French words• Compare the contents of tweets word by word to both English and French

lexicons to determine the language preference of each tweet• Check clustering tendency by using Global Moran's I tool• Find optimal Distance Threshold using Incremental Spatial Autocorrelation tool• Build clusters by using Anselin Local Moran's I tool

2. Determine positive and negative tweets clusters in Dakar• Translate all the tweets into English• Use SentiWordNet 3.0.0 lexicon to evaluate each word’s sentiment• Check clustering tendency by using Global Moran's I tool• Find optimal Distance Threshold using Incremental Spatial Autocorrelation tool• Build clusters by using Anselin Local Moran's I tool

3. Community building• Overlap clustering maps generated in Objective 1 and 2 with existing municipal

subdivision maps• Select a subdivision schema best reflects overlapped clusters, or create

boundaries to separate different clusters

GSM Data1. Aggregate GSM data into hourly intervals for computing the amount of hourly

traffic for each antenna (Table below shows an example of January’s data)

Objectives and Methods

GSM big data:

• Hourly antenna-to-antenna traffic in Senegal• 1666 antennas in total• Antennas’ latitude and longitude are given• Data recorded for 1 Year (2013)• Data contains more than 14,400,000,000 rows• Table above is an example of the data, which can be

translated into the figure on the right• Provided by Orange France within the Data For

Development Challenge

Twitter Data• 32536 geo-tagged tweets, in the city of Dakar• From February 4th to July 25th, 2013

Raw Data

1

1654

1861 call

Results• The results shown on right indicates that unusual behaviour has occurred

when social events have taken in Dakar. Examples include:

• Jan 1, 2013 - New Year Eve Firework on Gorée IslandInfluenced Communities: No.6 (3-8 a.m.), No.8 (4-7 a.m.)Great amount of people marched down to the coastal line on the south end ofthe city to watch the firework show. After around 1:00 a.m., people started tomove back to residential/bar area

• Jan 24, 2013 - Mawlid: Prophet Muhammad's BirthdayInfluenced Communities: No.4 (0-7 a.m.), No.7 (0-8 a.m.), No.16 (1-5 a.m.), No. 17 (4-7 a.m.)Since 95% of Senegal’s population is Muslim, Prophet Muhammad’s Birthdayis believed to be one of the most important day in Senegal. People gather at local religion places at midnight to celebrate and pray

Conclusions• Unusual social behaviour can be discovered by analyzing GSM and Twitter data

in a timely manner, which can potentially benefit many industries such as marketing, public health and urban planning

• This research demonstrate the potential of using ESRI ArcPy to process andanalyze big data using average computing power of personal computers.

Results and Conclusions

Time Out_Antenna In_Antenna Num_of_Calls

2013-01-01 00 1 1 1

2013-01-01 00 1 1654 8

2013-01-01 00 1 186 22

… … … … … … … …

Day/Hr.

Antenna

13 12 6 2 1 2 1 2 8 9 … 10

840 350 296 192 206 115 51 31 56 75 … 215

153 63 59 53 24 25 16 18 50 52 … 81

391 192 154 57 56 21 27 21 72 121 … 270

2273 2052 2429 1997 1535 1087 533 178 167 280 … 708

1236 657 540 362 267 141 81 61 132 245 … 927

0 0 0 0 0 0 0 0 0 0 … 0

1569 2117 2704 2481 1476 1091 488 92 53 75 … 331

2994 3018 3739 3726 2586 1945 669 363 940 1707 … 525

1383 964 1237 960 696 359 159 63 123 174 … 583

… … … … … … … … … … … …

9 1 3 1 0 0 0 18 30 95 … 17

1/3 1/4 1/5

1

2

1/0 1/1 1/2

9

10

31/23

1666

1/9 …

3

4

5

6

7

8

1/6 1/7 1/8

Centroid XCoord YCoord Community Hour Day

57 -1943298.453 1648243.086 6 3 1

74 -1943272.931 1648225.662 6 4 1

91 -1943415.873 1648301.804 6 5 1

108 -1943348.489 1648253.616 6 6 1

125 -1943618.741 1648468.230 6 7 1

76 -1943942.523 1646592.451 8 4 1

93 -1943963.149 1646592.233 8 5 1

110 -1943953.637 1646659.011 8 6 1

9388 -1940613.685 1651019.540 4 0 24

9405 -1940502.412 1651102.546 4 1 24

9422 -1940508.567 1651107.019 4 2 24

9439 -1940592.240 1651077.932 4 3 24

9456 -1940650.085 1651015.970 4 4 24

9473 -1940617.702 1651045.504 4 5 24

9490 -1940694.016 1650981.382 4 6 24

9391 -1945453.157 1645126.536 7 0 24

9408 -1945562.481 1645129.617 7 1 24

9425 -1945626.517 1645146.684 7 2 24

9442 -1945637.427 1645118.731 7 3 24

9459 -1945617.967 1645106.984 7 4 24

9476 -1945624.254 1645132.200 7 5 24

9493 -1945559.513 1645091.225 7 6 24

9510 -1945340.277 1645119.125 7 7 24

9417 -1942515.231 1641481.918 16 1 24

9434 -1942521.903 1641440.494 16 2 24

9451 -1942532.122 1641497.490 16 3 24

9468 -1942518.896 1641514.979 16 4 24

9452 -1941511.973 1640256.701 17 3 24

9469 -1941569.272 1640306.845 17 4 24

9486 -1941503.265 1640310.359 17 5 24

9503 -1941519.272 1640375.256 17 6 24