Clustering Weekly Patterns of Human Mobility Through ...

15
HAL Id: hal-01992673 https://hal.archives-ouvertes.fr/hal-01992673 Submitted on 24 Jan 2019 HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés. Clustering Weekly Patterns of Human Mobility Through Mobile Phone Data Etienne Thuillier, Laurent Moalic, Sid Ahmed Lamrous, Alexandre Caminada To cite this version: Etienne Thuillier, Laurent Moalic, Sid Ahmed Lamrous, Alexandre Caminada. Clustering Weekly Pat- terns of Human Mobility Through Mobile Phone Data. IEEE Transactions on Mobile Computing, In- stitute of Electrical and Electronics Engineers, 2018, 17 (4), pp.817-830. 10.1109/TMC.2017.2742953. hal-01992673

Transcript of Clustering Weekly Patterns of Human Mobility Through ...

HAL Id: hal-01992673https://hal.archives-ouvertes.fr/hal-01992673

Submitted on 24 Jan 2019

HAL is a multi-disciplinary open accessarchive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come fromteaching and research institutions in France orabroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, estdestinée au dépôt et à la diffusion de documentsscientifiques de niveau recherche, publiés ou non,émanant des établissements d’enseignement et derecherche français ou étrangers, des laboratoirespublics ou privés.

Clustering Weekly Patterns of Human Mobility ThroughMobile Phone Data

Etienne Thuillier, Laurent Moalic, Sid Ahmed Lamrous, Alexandre Caminada

To cite this version:Etienne Thuillier, Laurent Moalic, Sid Ahmed Lamrous, Alexandre Caminada. Clustering Weekly Pat-terns of Human Mobility Through Mobile Phone Data. IEEE Transactions on Mobile Computing, In-stitute of Electrical and Electronics Engineers, 2018, 17 (4), pp.817-830. �10.1109/TMC.2017.2742953�.�hal-01992673�

IEEE TRANSACTIONS ON MOBILE COMPUTING 1

Clustering weekly patterns of human mobilitythrough mobile phone data

Etienne Thuillier1, Laurent Moalic1, Sid Lamrous1, Alexandre Caminada1

1Univ. Bourgogne Franche-Comte, UTBM, OPERA, F-90100, Belfort, France

With the rapid growth of cell phone networks during the last decades, call detail records (CDR) have been used as approximateindicators for large scale studies on human and urban mobility. Although coarse and limited, CDR are a real marker of humanpresence. In this paper, we use more than 800 million of CDR to identify weekly patterns of human mobility through mobile phonedata. Our methodology is based on the classification of individuals into six distinct presence profiles where we focus on the inherenttemporal and geographical characteristics of each profile within a territory. Then, we use an event-based algorithm to clusterindividuals and we identify 12 weekly patterns. We leverage these results to analyze population estimates adjustment processes andas a result, we propose new indicators to characterize the dynamics of a territory. Our model has been applied to real data comingfrom more than 1.6 million individuals and demonstrates its relevance. The product of our work can be used by local authoritiesfor human mobility analysis and urban planning.

Index Terms—Clustering algorithm, Human mobility, Mobile communications, Territory dynamics, User profiles

I. INTRODUCTION

The use of cell phone network data for urban and humanmobility studies have been proposed and largely developedduring the last decades [1],[2]. Cellphone networks wereconsidered as a more reliable data source in comparisonwith household mobility surveys and road sensors. In thissense, the rapid growth of cell phone networks, the deeppenetration and the even deeper daily usage of mobile phonesin the world [3], genuinely made cellular network data anunavoidable candidate for the development of a large panel ofdecision support tools and human mobility simulation tools.But although these sources represented a new opportunityfor crowd tracking and road traffic estimates, their reliabilityhave been largely discussed [4],[5]. One of the main potentialdrawbacks with cellular network data is the high dependenceon the users’ behaviors: the more the users use their phones,the more data are generated. And the more one can leveragethese data.

Initially, the reuse of cellular data with other goals thancustomer billing, was to exploit the users’ locations to estimatethe road traffic and to plan road network infrastructures. Therelatively easy accessibility of cellular data attracted otherresearch fields, and many studies using individuals’ locationsemerged. Thus, different aspects of cellular network data havebeen used for different researches [6]. Among them we find;studies on patterns of mobility [7],[8], large scale studies of ur-ban mobility [9],[10],[11],[12] conception of human mobilitymodels [13],[14],[15],[16], studies on land usage [17],[18], etc.Although the literature abounds with examples of cellular datausages, the suitability of such data to identify and characterizehuman mobility is discussed [19], and many inherent aspectsof these data such as granularity and coarse location estimates,may disrupt mobility models. Recent works propose to mergedifferent types of mobility data in order to avoid limitationsof single view models [20], [21].

In this paper, we focus on using cellular data to understand

the human mobility patterns observable on a territory. Ourcharacterization covers three different aspects: a user profilegeographical aspect (origin and destination), a user’s tem-poral aspect (when do people move), and a flow intensityaspect. We propose a novel approach to address the humanmobility understanding through mobile phone data. Firstly, wecharacterize the cellular network customers and their relationwith the territory with a profiling process based on CallDetail Records (CDR). In [21] however, the authors arguethat mobility models driven by CDR data are limited by theindividuals’ activity during the modeling time. To avoid thisproblem, we propose to cluster individuals’ behaviors intoweekly patterns of mobility. Our methodology then identifies12 weekly patterns, and classifies more than 1.6 millionindividuals into these global patterns. In order to evaluate therelevance of using only mobile phone data, we compare ourresults with existing National Census and surveys. We useour results to dress population estimates, and we propose atemporal, geographical and community vision of a territorythat can easily be verified with national data and reproducedin other places. Finally the contributions of our work are:

• We propose a user profiling algorithm, classifying indi-viduals into 6 distinct groups from CDR data.

• We present a novel approach to extract weekly patternsof mobility from millions of CDR data.

• We propose 12 global weekly patterns, summarizing themobility behaviors of 80% of studied individuals.

• We validate our approach with data from three nationalsources and show the consistency of our methodology.

• We propose new indicators of mobility which providestrong added-values for mobility analysis.

The paper is structured as follows. In the section II wepropose to get back on some basis of cellular networks andcellular data. Section III presents the profiling process thatwe used to leverage cellular data. In section IV we show theresults of our clustering experiment. In section V we propose

IEEE TRANSACTIONS ON MOBILE COMPUTING 2

a comparative analysis with National Census data to validateour results and new insights on territory dynamics. The sectionVI concludes our results.

II. BASIC CONCEPTS

A. Cellular network and CDR

Cellular networks are based on a single concept: to allow theexchange of information from one mobile device to anotherone through linked base stations. If the cellular network detectsthe geographical position of the mobile equipment, it can thusknow the position of the equipment holder, human or machine.

Today the most worldwide used cellular network technologyis the GSM (Global System for Mobile communications).A GSM network is composed of several base stations thattransmit a radio signal into a geographical zone. Any mo-bile equipment within that area that receives enough fieldstrength is connected to the network, and thus reachable forcommunications. The size of each cell (antenna coverage)depends on the network design and may vary from 300 meterswide (microcells in an urban environment) to 30 kilometerswide (macrocells in rural environment). All adjacent cells areoverlapping, allowing a continuous connection to the networkwhen the mobile equipment is moving. Many adjacent cellsare grouped in zones identified by a Local Area Code (LAC).

For billing purposes, when a mobile equipment is used (call,messages, data, etc.), the mobile phone operators store the dataabout the mobile equipment: identification code, location, typeof event, day and hour, service, duration of the connection, etc.Those records are called Call Detail Records (CDR). We maygeneralize and say that all CDR contain at least three basicinformation which are:

1) an IMEI (International Mobile Equipment Identity),which is unique and identifies a terminal

2) a timestamp, as day and hour3) a cell Id, the network operator identity code for the cellThe combination of those three components is particularly

useful for mobility studies since they represent an object intime and space. We easily understand that the combination ofseveral CDR sharing the same IMEI represents the trajectoriesdone by that mobile equipment during a time period.

B. Challenges of using CDR data

CDR are registered according to mobile equipment eventson the network [22],[23]. Thus, they are a pertinent indicatorof presence in an area at a certain time, but their usage as acontinuous indicator of presence is relative and depends on thefrequency of the CDR. Indeed, CDR data highly depend onusers’ behaviors and cannot detect the presence of residentswithout cellphone data [20], [21]. This implies that their usageas a marker of mobility can be discussed. Moreover, a CDRlinks a mobile equipment to a certain spatial area, and thelocation of the mobile depends on the size of the cell whichis space dependent and not strictly known. We can summarizethe main problems of CDR:

1) Their number and frequency depend on the mobileequipment holder usage.

2) The location of the equipment depends on the size ofthe cell.

To counter these problems, many works propose specifictools to generate synthetic CDR data from actual mobilenetwork data and from survey data with fair results [15],[16]. In [20] and [21] the authors propose to use multiplemobility data and to merge CDR data with complementarymobility information to model human mobility. They showgreat results when compared with the synthetic tool proposedin [15]. More generally, generation of synthetic data and multi-source data fusion are current research trends, especially inmobility analysis. These developments facilitate the access totrustful and relatively free mobility data.

Although CDR may present different problems, they alsooffer several advantages. The first one is the deep pene-tration of cellphone in the population, which makes theman omnipresent and reliable source of mobility data. [23]identifies a second advantage which is a passive or activedata collection methodology. We can also argue that, unlikeother kind of technologies (Bluetooth sensors, road magneticsensors, road cameras, National Census, etc.), mobile networksare already widely spread on territories, and do not needspecific equipment deployment and additional costs.

C. Event category

We previously stated that a Call Detail Record correspondsto the storage of specific information such as mobile phoneidentification code, date, time, cell identification code, etc.Nonetheless, this definition is incomplete: a CDR correspondsto one interaction between the network operator and the mobileequipment. We count six different events types and threedifferent event categories for any digital cellular system.Owned events:

• Emission or reception of a phone call• Emission or reception of a message

Transition events:• Handover during a phone call• Location area transition

Induced events:• Every 3 minutes during a phone call• Every 3 hours if none of the above events are done

III. USER PROFILING METHODOLOGY

A. data set description

For this research, we use a data set of real CDR consistingof more than 800 million records collected from a three weekperiod (month of October 2014). These data contain about1.6 million distinct IMEI with both national and internationalmobile phones (roaming). The data set was collected in aterritory around Paris in a 130,000 inhabitants suburban areawith buildings, houses, stores, companies including railwayand highway infrastructures.

The construction of the data set was done as follows: wedetermined a geographical region on which we proposed tofocus our study (approximately shaped as the administrativearea of the studied territory of 70 km2). We registered all the

IEEE TRANSACTIONS ON MOBILE COMPUTING 3

network cells that overlapped this region in a subset that wecall S1. We also prepared a subset S2 that contains all cellsfrom the Ile-de-France province which includes the study area.Then, during the three week period, for each mobile phonewhich was at least detected once in S1, we stored all the pastand future events of this mobile phone within the limits of thethree week period, and within the geographical limit of subsetS1 ∪ S2. With all these data, we are able to better understandthe mobility flows of each individual that crossed this territory.

For privacy purpose the data set was separated according tothe three different weeks, anonymizing all IMEI with threedifferent keys, giving three one-week subsets. With the sameprivacy concern, all information about the type of events wereerased, all IMEI with less than 10 records per week wereerased, and international mobile phone records were kept onlyif the total number of distinct IMEI from that country wasabove 10. In the end each sub-data set on one week consists ofmore than 175 million of CDR which corresponds to more than600,000 distinct IMEI. Each CDR is composed as follows:<IMEI, Date, Time, Network cell Id, Roaming information,Country Code (if roaming)>

Finally, we consider the period between two consecutiveevents as a period of presence or absence within the territoryS1. As due to the nature of CDR data, it is not possible toaccurately calculate the presence period of an individual, wedeliberately consider that between two consecutive events oneindividual is still attached to the last corresponding antenna(discrete model).

B. User profiles

For local authorities, territory dynamics are closely linkedto individuals that use local infrastructures for living, working,circulating, etc. Having the possibility to evaluate and quantifythe usage made of its road network and public infrastructures,like car parks, access ways, malls, schools, is important fordecision making processes. Many studies showed that CDRcan be used to profile individuals that are present in a territory,or to qualify certain areas of a territory. In [24] the authorsuse a system based on a top-down and a bottom-up algorithmto classify individuals on four distinct profiles: Residents,Commuters, In transit and Visitors. We find the same approachin [25] where individuals are characterized according to theircalling behaviors into Residents, Commuters and Visitors. Inearly works, [26] analyzed the daily rhythms of commuters ina territory by computing distances between home and workplaces. These distances often called radius of gyration areused as a mobility indicator for mobility studies [7],[8]. Then,although admitting that roaming information was not the mostpertinent, [27] used CDR to evaluate the proportion of touristsin a territory, whereas [28] used phone calls, ticketing andonline photos information to extract tourist statistics. Suchindividual profiling has also been developed in [10] whichleverages cellular data to detect communities inside a city.

More recently, many works studied the daily profiles ofindividuals. [29] proposes to analyze the results of an activity-based travel survey by the clustering of the individuals accord-ing to their weekday and weekend activity patterns. [30] uses

CDR and survey data from Paris and Chicago to detect 17daily mobility profiles. These mobility profiles represent thedaily patterns of 90% of individuals present on a territory.As well, [31] proposes to identify the mobility profiles byanalyzing filtered CDR from Singapore.

In this paper, we are interested in the description andevaluation of the practices made of a territory by individuals.The chosen area covered by S1 is particularly known for itsnumerous big companies and its access points to the cityof Paris (train and roads). To best characterize this territorywe present six different and exclusive daily profiles that areinferred from the CDR data set, and more specifically fromthe daily patterns of individuals within the territory:

1) Resident Working in Zone (RWiZ): a RWiZ is an in-dividual that lives, sleeps and stays (work, study, etc.)in the territory covered by S1 during the day

2) Resident Working out of Zone (RWoZ): a RWoZ is anindividual that lives and sleeps in the territory S1, butis out of the territory during the day

3) Commuters (C): a commuter is an individual that livesand sleeps outside but works or study in the territory S1

4) Multiple Single Transit (MST): a MST is an individualthat crosses the territory and appears irregularly severaltimes during the day for more than one hour of dailypresence

5) One Single Transit (OST): an OST is an individualwhom the total consecutive daily presence does notexceed one hour

6) Weekend (WE): a WE is an individual that is pre-sent inthe territory only during the weekend

Each of the listed profiles contains important information forthe characterization of a territory. For example, the presenceof residents (RWiZ, RWoZ) indicates that there are houses orresidential areas within the territory. Thus, there is a needfor administrative infrastructures (school, local authorities)but also supermarkets and night car parks. On the contrary,commuters will need rapid entrance and exit ways, but alsoday time car parks. People ”in transit” (MST, OST) will wantto quickly cross the territory, thus they may need secondaryitineraries such as highways or trains. Finally WE will needaccommodations for the weekend, easy access ways, andmaybe tourist areas.

In addition to these patterns we set the index A for absentto all IMEI of the zone S1 which are not detected during agiven day. Absent will further be considered as the seventhcategory of profile even if it is not a real profile at all.

C. Profiling algorithm

Our main concern for this work is to be able to charac-terize the individuals that are actors and users of territoryinfrastructures. According to the spatiotemporal pattern ofeach individual we propose to classify them into a distinctprofile for each day of the week. A classification algorithmis used in [24], which uses temporal patterns of individualsto categorize them, but according to a whole week of data.The main problem with this approach is that it is generalizingand reducing individuals to a global definition. Unfortunately,

IEEE TRANSACTIONS ON MOBILE COMPUTING 4

although human mobility patterns showed to be repetitive [8],they highly depend on the day of the week (part-time jobs,displacements on specific days of the week, special workinghours, etc.).

In this work we propose to study individuals according totheir daily patterns, and then over a whole week. For thatwe present an algorithm that determines the profile of eachindividual’s days. We call pdi the profile of type i associatedto a day d, and w (p1i , p

2i , . . . , p

7i ) the list of the seven

profiles over the week. The profile classification is basedsolely on the temporal patterns by analyzing the presence andabsence periods within the territory S1. For that we determinesix exclusive classification rules depending on the possiblesuccession of events within a day. Note that this algorithm isdeterministic and does a hard mapping; a fuzzy version willbe checked further.

The rules of this algorithm are given by the functionDefineProfile, and figure 1 illustrates our profile classifica-tion process. For each individual we gather the daily set ofCDR and start the algorithm: each diamond corresponds to aTrue/False condition. In the figure, if a condition is verified,then the algorithm follows the bold line. Otherwise, if thecondition is not verified it follows the hashed line until aprofile is set.

Function DefineProfile():if not is in then

/* Is present during the day (is_in) ? Theindividual is detected at least once inS1 during the selected day */

return Absent [A]if is we then

/* Is present only on the weekend (is_we)? The individual is detected onlyduring the weekend. We define theweekend period as ]Friday 19:00 toMonday 06:00[ */

return Week-end [WE]if is ost then

/* Has only one transit (is_ost) ? Thetotal daily presence does not exceedone hour (from the first to the lastevent of the day within S1) */

return One Single Transit [OST]if is mst then

/* Has multiple transits (is_mst) ? Thereare several presence times in the daywithin S1. Each of them do not lastmore than 1 hour, and they are spacedout of more than 3 hours */

return Multiple Single Transit [MST]if not is res then

/* Is resident (is_res) ? The individualis detected during the periods [00:00 –06:00[ and [19:00 – 23:59[ within S1 */

return Commuters [C]if is pres then

/* Is continuously present (is_pres) ? Allconsecutive events within S1 are spacedout of 3 hours maximum (there should beat least an event every 3 hoursaccording to the location updateoperation of the network) */

return Resident Working in Zone [RWiZ]return Resident Working out of Zone [RWoZ]

Fig. 1: The profiling algorithm

Additionally we propose some classification examples infigure 2. The hashed zones correspond to presence periodswithin S1 during a specific day and for each example the cor-responding profile is given. The last example is one possiblerepresentation of a WE profile for any Friday, while in additionthe presence might be at any time on Saturday or Sunday.

Fig. 2: Examples of profiling (Hashed boxes correspond topresence period within S1)

D. Users distribution

We use the profiling algorithm for each distinct IMEI ofthe data set. For that, the algorithm needs to survey all eventsdone in one week by this IMEI. Once all events are stored ina chronological order, the different rules for the classificationare computed, and for each day is attributed a unique profile.In the end all the IMEI of the data set are represented by alist of seven profiles, one per day.

We propose to study the distribution of the different profilesfor each day. The figures 3 and 4 show the distributionof the profiles for all detected IMEI during the 21 days.For a better understanding figure 3 shows the accumulatedcurves of several profiles: RWiZ and RWoZ are summed upinto ”Residents”, OST and MST profiles are summed up in”Transit”. This was done in order to dress a general overviewwithout surcharging the graph. First of all, the sum of theall the profiles’ curves gives almost the same result betweeneach week with a total variation of 0.658%. Thus, for the firstweek we count 534,820 distinct IMEI, 527,622 for the secondweek and 535,271 for the last week. As a matter of facts, it

IEEE TRANSACTIONS ON MOBILE COMPUTING 5

is possible for an IMEI to be detected during one week andto be totally off the two other weeks, and thus not countingin the results. Consequently, the global attractiveness of thezone is very stable from one week to another. The intensityof the curves tells us that many users are regularly detected inthe profile Absent. This does not mean that the algorithm didnot manage to set a profile, but rather that those individualsare not present in the territory S1 on this specific day. Then60% of the detected IMEI are not present every day whateverthe profile. The second main aspect is the repetitiveness of thecurves over the three weeks, comforting the idea that humanbeings produce repetitive patterns, and that their behaviors arehighly predictable.

Additionally, we observe a peak of individuals with profileWE during the weekend, but also a peak of Absent profiles,which means that a large part of individuals who are seenduring the week are not seen at all over the weekend. Conse-quently we observe that there is a decrease in the Commutersand Transit profiles during the weekend. The result of thesetwo observations is that this territory is a crossing and workingarea during the week, and this working force is not presentduring the weekend. Moreover, this territory attracts manyvisitors during the weekends; such visitors can be tourists,but also people that live and work outside of the territorylike students returning home for the weekend. These Weekendindividuals are present in almost the same proportions as theResidents accumulated (In zone and Out of Zone).

The figure 4 presents a detailed view of the figure 3, whereeach profile is clearly presented. We chose to not display theAbsent profile for readability reasons. The first observationonce again is the highly reproductive patterns for each profile.The curve representing the RWoZ profile is almost constantduring the week except for Mondays and weekends wherethis profile is less represented. RWiZ profile on the contraryhas an increasing curve all along the week with the highestlevel during the weekend. Similar end-patterns are observedfor the OST and MST profiles, but in much less proportions.

Conclusions and interpretations on the users distribution areimportant for the comprehension of the dynamics of a territory.However, at this stage of our study we simply do a daily sumof all distinct IMEI for each user profile. This does not implythat the individuals reported within a profile during one dayare the same the next day. On the contrary, individuals adoptdifferent profiles during the week [8]. This leads to the nextpoint that focuses on the ”week patterns” of individuals.

IV. CLUSTERING EXPERIMENT

A. Clustering methodology

The previous graphs inform us on the distribution of thedifferent profiles along three weeks of data but without takinginto account the individuals’ profiles changes from one dayto another. We now wonder what is the repetitiveness ofeach individual within a profile. In this sense, we proposea totally new process to cluster the individuals accordingto their whole week of data, knowing that each day of theweek has a specific human mobility behavior [8]. In [30] theauthors showed that the human mobility patterns are similar

over several days, but do not specifically study the patternsover one entire week. Thus they do not observe the multiplebehaviors adopted by a same individual. Similarly, [29]presents weekday and weekend patterns without distinctionbetween the days, and [31] reduces its data set to studyaveraged days and loses the features of each day. In ourapproach, we keep the features of each day, from Monday toSunday, and characterize the territory through weekly patternsof human mobility.

The first step of this process is to identify the structuresthat can be used for the clustering. For that, we create a subsetof data to run our clustering, and we propose to study theinherent limitations of the clustering. As we do not know inadvance the structure of our data, we decide to use a k-meanslike approach. A k-means clustering algorithm is simpleand easy to setup, and allows us to have a first look at ourdata without introducing bias. Note that different clusteringapproaches will be checked further. We do a cluster analysisof the population and extract the principal characteristics ofthese clusters. Then, we propose to analyze the structure ofthe resulting classes, and we link the different patterns withthe dynamics of the studied territory. This methodology isbased on the one proposed by [29] and [32].

1) Representation of the individualsVector for each dayIn our methodology we propose to give to each individual

a profile for each day of the week. This means that eachindividual has seven profiles for one week. The profiles may bedifferent each day, according to the relation of the individualwith the territory. We propose here to consider each day ofan individual as a binary vector of seven items where eachitem corresponds to a profile, and where the value of the itemcorresponds to the membership intensity of the individual tothis profile. We call such a vector a vector-day. Logically foran individual the membership intensity range is composed ofthe two binary values {0, 1}. This means that for each vector-day only one item (i.e. one profile) of the vector can be setto 1, the others being put to 0. This furthermore implies thatthe total sum of the items values for a vector-day is 1. Figure5 shows the representations of two vectors-day (Monday andTuesday) for a random individual.

Vector for each weekWe propose to also consider the week as a vector of

seven items, where each item corresponds to a day of theweek. We call such a vector a vector-week. Each day beingrepresented by a vector-day with seven items, we can representan individual’s vector-week as a simple vector of 49 items. Thetotal sum of all items of such vector has to give 7. For example,figure 6 shows the representation of an individual which isidentified as Commuters during the week and Absent duringthe weekend. The advantages of using this kind of model, isthat each individual is represented by a unique vector, whichsimplifies the computation and the classification of individuals.

2) Production of a subset of dataIn order to identify the similarities of the population indi-

viduals, we propose to sample the weeks by using a subset

IEEE TRANSACTIONS ON MOBILE COMPUTING 6

Fig. 3: Global overview of the profiles distribution

Fig. 4: Detailed view of the profiles distribution

Fig. 5: Representations of two vectors-day

of individuals for each week. The sampling has proved itsefficiency to allow a better generalization of the proposedclassification and to avoid a useless overlearning of the data set[33]. For that we extract a panel of 10,000 random individuals(1.87% of the total data set) for each of the three weeksand run the cluster analysis on each of the three subsets.The main concern of working with a subset of data is therepresentativeness of the sample. In order to guarantee that

Fig. 6: Representation of a vector-week

the samples are representative, we propose to compare theprofiles’ distribution of each subset with the total profiles’sdistribution. In figure 7 we present two types of curves. Thedotted lines represent the average number of profiles perday, for the three weeks. And the straight lines represent the

IEEE TRANSACTIONS ON MOBILE COMPUTING 7

average number of profiles in the three subsets, multiplied byan adjustment factor i.e. the Average number of individualsin a week divided by 10,000. For readability the profiles aregrouped in four classes (Residents, Commuters, Transit andWeek-end) and the Absent profile is not shown. On average, thenumber of wrongly estimated profiles for each day represents1.46% of the data set. This is small enough to consider thatour subsets are representative of the whole data set. Note thatwe repeated our cluster analysis with several random subsetsand that we obtained similar results, for the representativenessof the subset, as well as for the derived conclusions of ourclustering analysis which are developed later.

Fig. 7: Representativeness of the subsets

B. k-means clustering

1) Determination of k for k-meansWorking with k-means has many benefits, but the main

drawback in our case is to define the appropriate number k ofclusters. To find the best number k we run several clusteringtests where for each test we set the number k. To achievethat, we study a subset of 250 individuals chosen randomlythat we cluster with k ranging from 2 to 250. For each test ofk we compute the average silhouette of the solution obtainedby the average of all silhouettes of the clustered objects. Thesilhouette is a good indicator of the quality of a clusteringsolution, and has been widely used in clustering analysis [34].The silhouette of a clustered object shows if this object lieswell or not within its cluster. It represents the distance ratiobetween the cluster it lies in, and the second-best possiblecluster. The original formula is given by:

s(i) = (b(i)− a(i))/max (a(i), b(i)) (1)

With s(i) the silhouette of the object i, b(i) the dissimilarityto the second-best cluster and a(i) the dissimilarity to its owncluster.

Figure 8 shows the average silhouette obtained with a kvarying from 2 to 250 and 250 individuals. In the figurethe dotted lines correspond to the best average silhouetteobtained for each subset and for a specific k. The straightline corresponds to the average of the 3 subsets. We canobserve that this last curve is plummeting when k is small,then increasing to reach a maximum at one third of the abscissalimit and finally slowly decreasing. By looking at this curve,

the best silhouette is obtained when k is around 60 with 250individuals.

Fig. 8: Average silhouette for 250 individuals, k ranging from2 to 250

However, we need a critical interpretation of these results.What this graph is showing is that the best average silhouetteis obtained for a specific k. We have to understand whatthe silhouette of a cluster means. The dissimilarity of anindividual is the relation between its own cluster and thenext closest cluster in the solution. By increasing k weincrease the probability for an individual to be in the bestsuitable cluster and thus we increase the probability of havingclusters made of very few and specific individuals. Theseindividuals will surely present a good silhouette, but thefigure 8 shows us that having a number k too big decreasesthe general quality of the solution. To sum up, we need anumber k that allows a large number of possibilities, butsmall enough to have a good overall solution. Moreover,clustering with k-means means that we lately can classifyindividuals into distinct groups. A too large number ofgroups is difficult to analyze, and the different resultswould not have much signification. That is why, due toour preliminary test on 250 samples, we propose in ourmethodology to do the cluster analysis using three differentnumbers k {20, 50, 70}, thus multiplying the experimentalconditions, and avoiding local optimum problems. Theresulting clusters are easier to interpret, and evaluating therelation between the individuals and the territory is simplified.

2) Distance computation between clustersThe k-means algorithm is based on the proximity of an

individual to the center of a cluster, often under the form of adistance. Here, all individuals are represented by a vector of49 items holding binary information {0, 1}, that is why weconsider that the best solution is to use Hamming distancesbetween individuals [35]. Then, during the clustering process,the individuals have to determine the closest cluster. Thisis done by computing the Hamming distance between theindividual and the centroids of the clusters. However, usingHamming distances raises drawbacks. In a k-means analysisthe clusters are made of similar individuals, and it seemsinteresting to us to have the possibility of observing the smallvariations (i.e. the profiles’ variations) within each cluster. Inorder to do that, we propose to use real values (instead ofbinary values) to represent the clusters’ centroids during the

IEEE TRANSACTIONS ON MOBILE COMPUTING 8

clustering process. Thus the centroid of a cluster is determinedby computing the average of the vector-week of all individualsbelonging to this cluster.

Then, for the computation of the distance between anindividual and a cluster’s centroid we slightly modify thevector-week of the centroid in order to obtain a correspondingHamming vector. For that, for each vector-day of the cluster’svector-week we set the item with the highest value to 1, andthe six others items to 0. Figure 9 shows the modificationprocess of a vector-day of a cluster’s centroid.

Fig. 9: Modification of the vector-day of a cluster centroid

To sum up, individuals are represented by a vector-weekof 49 items with only one positive item every seven items,and a cluster is represented by a vector-week of 49 items withreal values (i.e. within the interval [0, 1]), except during thedistance computation process where a cluster is representedby the corresponding binary vector-week.

C. Clusters extraction of individual profiles

In our methodology, we propose to do the k-means clusteranalysis with three different values of k {20, 50, 70}, andwith three different weeks, which allows various clusteringpossibilities. This computation generates several groupsfrom the subset of data, and we propose to observe if someparticularities emerge from these groups. However, our mainobjective is to extract general trends and patterns of therelations between the individuals and their territory. To avoidtoo many combinations of clusters we set a process to detectthe main ones that we explain below.

1) Filtering of representative individualsThe first step is to limit the number of clusters in the

output of the solution. We propose in our methodology todisplay only the clusters that hold more than 1% of the totalnumber of individuals. Limiting the output possibilities meansthat the individuals that are not kept in the solution are notrepresentative of a real pattern followed by many. A drawbackhere is that the output solution may contain a really reducednumber of individuals. In theory, if k − 1 clusters containexactly 1% of the individuals, then the last cluster containsN(1−((k−1)∗1/100)) individuals. This corresponds at worstto 81% of individuals for k = 20, 51% for k = 50 and 31%of all individuals for k = 70. In order to study the potentialeffects of this drawback, we propose to count the number ofindividuals present in the output clusters, and to compare itto the real number of individuals in the subset (10,000). Thisallows us to guarantee that the clusters that hold more than1% of the individuals can be considered as good indicatorsat the scale of the whole population. In Table I we present

TABLE I: Clustering results with several k

kGlobal

silhouetteAverage numberof kept clusters

Individualskept (%)

Individualslost (%)

20 weak 13.7 96.4 3.650 average 16.7 86.44 13.5670 good 16.7 82.34 17.66

different statistics about the three number k chosen for themethodology. The number of kept clusters is the averagenumber of clusters that hold more that 1% of the individualsfor the three different weeks. We see that the rate of keptindividuals changes according to the number k. This confirmsour previous analysis on figure 8 where we acknowledgedthat increasing the number k results in a decreasing qualityof clustering. Moreover, the average correct clustering rate isalmost 88%, which means that the individuals kept by theoutput clusters may be considered as good representativesof the subset. And we demonstrated that the subsets arethemselves good representatives of the data of the three weeks.

2) Detection of main-profile clustersFor this analysis, we run nine clustering sessions, one for

each week {1, 2, 3} and for each k {20, 50, 70} respectively.For each session we keep the best run according to the averagesilhouette, and from this run we keep the clusters with morethan 1% of the data set. This gave an average of 15.7 clustersfor each session (141 clusters in totality).

We remark that for each session the clusters present differentpatterns and behaviors, but the most remarkable particularity isthat we encounter similar clusters in each of the nine sessions.Which is to say that even with a different week and a differentk, many clusters are identical. We thus managed to group theseclusters according to their similarities. For that, we comparethe clusters two by two. We compute for the seven daysthe sum of the standard deviations between all correspondingprofiles. If the average of these sums is lower than 0.05 (whichwe empirically set), then we link these two clusters with animaginary edge. We can then map the clusters into a graph, andwe finally compute for each ”clique” of clusters an averagedcluster that we call a main-cluster. We then obtain 12 main-clusters from our analysis. On average, these 12 main-clustershold 82.75% of the individuals of the subset. This means thatin 8 over 10 cases an individual follows one of the 12 extractedpatterns. Each diagram of figure 10 represents the differentpatterns displayed by the 12 main-clusters. On the abscissaare the different days of the week (from 1 to 7) and on theordinates are the rates of the profiles (from 0 to 1). The valuesof the curves show the percentage of a specific profile withinthe main-cluster. It is thus totally possible to have multiplecurves crossing each other’s in a diagram. However the totalsum of the curves for one day is always 1. The color schemeused is the same as in figure 4, RWiZ profile is in dark bluediamonds, RWoZ profile in red squares, Commuters profile ingreen triangles, MST profile in purple crosses, OST profilein light blue double crosses, WE profile in orange disks andAbsent profile is in gray line.

Additionally, for each of these main-clusters we propose inTable II to count the number of occurrences within the nine

IEEE TRANSACTIONS ON MOBILE COMPUTING 9

Fig. 10: The 12 main-clusters extracted from the nine clustering sessions

IEEE TRANSACTIONS ON MOBILE COMPUTING 10

TABLE II: Number of individuals and occurences of the 12main clusters

ClusterNumber ofindividuals

(over 10,000)

Occurencesover the 9

sessions (%)

Profiletype

1 400.25 88.9 Residents2 623.11 100 Residents3 373.11 100 Commuters4 164.57 77.8 Everyday commuting5 225.88 100 Everyday traveling6 160.66 100 Everyday traveling7 744.2 55.6 Punctual crossing8 665 66.7 Punctual crossing9 889.62 88.9 Punctual crossing10 794.66 77.8 Punctual crossing11 852.83 77.8 Punctual crossing12 2381.78 100 WE visitors

sessions and to compute the average number of individualseach main-cluster contains from the nine sessions. The occur-rence column represents in percentage the number of clusters,from the nine sessions, used to create this main-cluster, i.e.it represents the size of the clique used to create this main-cluster. The occurrence shows the representativeness of themain-cluster inherent behavior given the initial conditions (k,week) and we see that the results are fair (from 55% to 100%).Moreover, as we present later, some of these main-clustersare related to the same profiles and we propose to classifythese main-clusters into six different types which are easier tomanage and to interpret. Eventually, by analyzing these main-clusters, we manage to extract the major trends of individualswith respect to their territory, which is the main objective ofour approach to understand the characteristics of the territory.

D. Analysis of the 12 main clusters

Our first observation is that 8,275 distinct IMEI (over10,000) are represented by those 12 main-clusters. Each ofthese IMEI follows one of the 12 presented patterns; thisindicates that there is a high tendency for individuals toreproduce identical patterns.

The different patterns within each cluster show that ourprevious statements about the filling and emptying of theterritory are correct. For example, clusters 1 and 2 show thatRWiZ and RWoZ are closely linked to each other, with a patternon cluster 2 that tends to show that many individuals thatwere Residents Working out of Zone adopt an in zone profileduring the weekend. This demonstrates that almost 60% of theresidents that work outside during the week stay in their nearbyenvironment during the weekend. Cluster 1 in the opposite isin majority composed of Residents Working in Zone that tendto stay in their nearby environment during the weekend.

Cluster 3 and 4 concern commuting patterns within theterritory. We can observe in cluster 3 that individuals have aCommuters profile during the week, and are absent during theweekend. The individuals of cluster 4 are of type Commutersevery day of the week, but this predominance tends to slightlyfall during the weekend, allowing a larger variety of profileson Sunday.

Clusters 5 and 6 show the dynamics of transit profiles, suchas OST and MST individuals. Cluster 5 present individuals

that are mostly of profile MST during the week and Absentthe weekend, whereas in cluster 6 individuals with profile OSTare also present during the weekend. These clusters informus of the presence of many individuals that travel within theterritory almost every day of the week, sometimes during lessthan one hour. It is important to remember here that the totaltime needed to cross the territory about one hour.

Clusters 7, 8, 9, 10 and 11 present the exact same dynamics,but each on a different day of the week. These clustersinform us of the presence of individuals that are absent ofthe territory during the week, except for a specific day wherethey cross the territory, and in majority for less than onehour. These clusters are interesting since they hold almostone third of the behaviors of the individuals detected withinthe territory. However, some of the individuals spend enoughtime in the territory to be classified as Commuters or MST. Inthese categories enter individuals like transporters or punctualindustry or schools visitors. Finally, cluster 12 shows thatalmost all WE are present in the territory two consecutive daysduring the weekend, with a preference for Sunday.

Consequently, we propose a more advanced analysis forsome of these clusters that we think are interesting for thecharacterization of the territory. We then focus on three of thedifferent clusters above, cluster 2, 3 and 9. Cluster 2 shows aresidential and working pattern with 623 detected individuals(over 10,000). During the week almost 83% of the cluster isconsidered as RWoZ, and 15% as RWiZ. During the weekenda convergence occurs, and the individuals tend to show an inZone profile. This comforts us in the idea of a strong relationbetween these two profiles. Many of these individuals live inthe territory, but work outside during the week (out of Zone),and when their economic activity is less important (i.e. duringthe weekend) they are detected in the territory as RWiZ. Notethat in cluster 1, most of the time the individuals are RWiZduring the week and the weekend (76% of the cluster).

Cluster 3 shows a recurrent commuting pattern. This patterntotally corresponds to individuals that are employed in theterritory but reside outside. The majority of these individualshave a commuting pattern every day of the week; they aremostly employees and students that live outside of the territory.Since they are not present during the weekend we can deducethat they do not work in touristic and commerce industry but inmanufacturing, public services, schools... We can consider thiscluster as the opposite of cluster 2 where its dynamic wouldbe switched with the RWoZ flows. It is interesting to note thatin cluster 4 the individuals are present during the weekend andso that they may work in touristic and commerce industry.

Cluster 9 groups 889 individuals (almost 9% of the sample)where OST profile is present on the territory on a specific day(Wednesday). This pattern corresponds to individuals that areonly present on Wednesday, and if we look at the details of thecluster we can see that these individuals are at 61% classifiedin the OST profile, 25% in Commuters and 13% in MST. Itmeans that 74% of individuals were present during periodsof less than one hour, and 25% were present during all theday. We can interpret these dynamics as flows of individualsthat wish to cross the territory without staying (transporters,Wednesday afternoon activities, etc.), which are at the border

IEEE TRANSACTIONS ON MOBILE COMPUTING 11

of the zone, or which came inside the territory to visit industrysites or schools (the studied territory is a dynamic part of theSouth-West of Paris, with many companies and schools). Wecan also add to these individuals a large part of tourists sincethe territory is also known for its huge cultural heritage.

We finally propose a graphical visualization of these dy-namics. Figure 12 represents the dynamics of the territoryaccording to these three different patterns (clusters 2, 3 and 9).The territory is represented by the hexagon shape while eachprofile is represented by its letter within a color circle. The sizeof the circle represents the profiles’ rates within the cluster forthis particular day. The arrows represent the inherent dynamicsof each profile, OST cross the territory, Commuters enter, andRWoZ go out of the territory. The length of the arrows isdirectly linked to the rate of the profile.

V. VALIDATION OF THE MODEL

In this section we verify our conclusions about the dynamicsof a territory with credible sources that already studied thisterritory. For that we use three distinct data sets, a NationalCensus, a professional mobility survey and a scholar mobilitysurvey. All these three surveys are coming from the INSEE, theFrench national statistic organism, and were done in a classicalway, by interrogating people face to face. From the threesurveys we extract several variables related to the demographyand the mobility of individuals over the studied area.

A. Data from national surveys

From the National Census we collect four demographicvariables:

1) The total number of inhabitants (residents)2) The number of residents within 15 and 64 years old3) The working force (registered workers)4) The non-working force (which includes students)Additionally, we use the professional mobility and the

scholar mobility databases from the INSEE to collect moreinformation about workers and students within the zone.We call workers and students active individuals. From thesemobility data sets we consider three more variables:

1) The number of individuals living and being active in thesame city

2) The number of individuals living in the city but beingactive in another city

3) The number of active individuals coming from outsidethe zone

A summary of these seven variables is given in table 3. Allthe individuals of the studied territory are recorded by thesevariables and some of these variables are exclusive, whichmeans they are used to categorize the individuals into distinctclasses (working force and non-working force for example).Figure 12 illustrates the interlinking of these exclusive classesof individuals.

Moreover, we propose to study the similarities of theseclasses with some profiles from our methodology. For that wefocus on the three main profiles that can be easily extractedfrom the seven variables of the survey: RWiZ, RWoZ and

TABLE III: National data about the zone

Source Inhabitants Inhabitantsbetween 15-64

Workingforce

Non-workingforce and

non-students

Nationalcensus 412,500 272,900 207,650 46,385

Source Active from inside,staying in zone

Active from inside,going out of zone

Active in zone,from outside

Professionalmobility survey 98,550 85,950 101,915

Scholarmobility survey 27,115 14,900 18,315

TABLE IV: Summary of the different profiles present on thezone

Category fromnational survey RWiZ RWoZ C

Workers 98,550 85,950 101,915Students 27,115 14,900 18,315

Non-working force andnon-students staying in zone 46,385 / /

Total 172,050 100,850 120,230

Commuters. The RWoZ is the sum of the active individualsgoing out of zone for their activity (work or school), and theRWiZ is the sum of the active individuals staying in zonefor their activity, plus the inactive individuals. Finally theCommuters are individuals which live outside the zone, butwho have their activity (work or school) within the zone.Figure 12 links our three profiles to the different classes ofindividuals and a summary is given in table IV.

B. Adjustments on mobile phone data

1) Size estimation of the profilesIn part 4 we clustered a subset of 10,000 individuals and

then we obtained 12 main clusters (figure 10). Now we classifyevery individual of the full data set (527,622 individuals fromweek 2) into one of these 12 main clusters. This gives usthe actual individuals distribution which will be used forcomparative analysis with national survey data. Note that theupcoming results are similar for week 1 and week 3.

Firstly, we estimate the size of each profile. For that wemultiply the average rate of each profile within a cluster (valuebetween [0, 1]) by the actual size of this cluster, and we onlyuse clusters where a profile is distinctly present. For example,the profile RWiZ corresponds to the number of RWiZ fromcluster 1 and from cluster 2. A rate of 0.8 from cluster 1gives 19,993 individuals and a rate of 0.2 from cluster 2gives 12,292 individuals. Then the total and actual number ofRWiZ individuals is 32,285. Table V presents the results of thisclassification for the 527,622 individuals of week 2. In additionthe same calculation is done for other profiles RWoZ, C, OSTand MST. Note that for the profiles Commuters, OST and MST,we also propose an estimate based on the results from punctualcrossing clusters 7, 8, 9, 10 and 11, to track people comingfor one day per week and to look at their intensity.

2) Size adjustments from mobile network marketOur clusters are only representative of the individuals

tracked by the network operator which provides the data. This

IEEE TRANSACTIONS ON MOBILE COMPUTING 12

Fig. 11: The territory dynamics through three different patterns

Fig. 12: Inhabitants class distribution within the territory

TABLE V: Results of the classification for 527,622 individ-uals

Profile From cluster 1 From cluster 2 TotalRWiZ 19,993 12,292 32,285RWoZ 5,898 37,164 43,153

Profile From cluster 3 From cluster 4 Total Average fromclusters 7,8,9,10,11

C 23,897 11,066 34,963 13,148

Profile From cluster 5 From cluster 6 Total Average fromclusters 7,8,9,10,11

OST 11,400 2,669 14,069 31,618MST 2,043 9,748 11,791 4,330

Profile From cluster 12 TotalWE 62,882 62,882

means that these numbers need to be adjusted to the totalpopulation of the studied area. At the time we obtained thedata there were three main network operators in France. Acommonly used rate to adjust cell phone data is about onethird. We show that this rate is correct for our data set.From figure 3 we estimate the total number of residents to

TABLE VI: Adjusted estimations for the 527,622 individuals

Profile Estimations (recurrent clusters) Adjusted estimations

Residents 75,438 251,460RWiZ 32,285 107,617RWoZ 43,153 142,843

C 34,963 116,543

be 81,912. A simple division shows that 81,912 divided by272,900 (inhabitants between 15 and 64 years old given bythe National Census) gives 0.3 which is close to one third.In [21] the authors note that users without cellphones are notmodeled in cellphone data mobility models, that is why wepropose to use the number of inhabitants between 15 and64 years old to estimate this rate because they are the mostsusceptible of owning a cell phone, and furthermore they arethe individuals taken into account for the active part in nationalsurveys. Table VI presents the adjusted estimations for each ofthe three profiles RWoZ, RWiZ and C compared with nationalsurvey data. Note that Residents adds RWiZ and RWoZ.

C. Comparative analysis

We propose a comparative analysis between our estimationsand the data collected from the three national surveys over theprofiles Residents, RWiZ, RWoZ and C. A short summary ofthe estimated values is given in table VII.

We see that the correlation rates vary from 58% to 97%according to the variables. The estimations on the total numberof Residents and Commuters are close, which shows therelevance of the adjustment factor. However, on the profilesthat involve a dynamic behavior (staying in the zone or goingout of the zone) the correlation rates drop to 58 and 62%. Theinterpretation of such a difference is that for INSEE survey,many resident people declare to work in the zone attachedto their company, but are actually often outside (taxi drivers,transporters, business travelers, managers, etc.). So when theyare tracked by their mobile phone, they are tracked outsidetheir residence and working area [36]. This is a strong added-value of the mobile phone tracking data. Furthermore thenational surveys deliver estimations based on only one-day

IEEE TRANSACTIONS ON MOBILE COMPUTING 13

TABLE VII: Comparison between mobile phone model andsurveys data

Profile INSEE Adjusted mobile phone data Correlation rate (%)

Residents 272,900 251,460 92.14RWiZ 172,050 107,617 62.55RWoZ 100,850 142,843 58.36

C 120,230 116,543 96.93

TABLE VIII: Adjusted estimations for the 527,622 individuals

ProfileEstimations

(recurrent clusters3,4,5,6,12)

Adjustedestimations

Estimations(punctual clusters

7,8,9,10,11)

Adjustedestimations

C 34,963 116,542 13,148 43,827OST 14,069 46,897 31,618 105,393MST 11,791 39,303 4,330 14,433WE 62,882 209,607 / /

survey and are limited by a standard deviation coefficient [37].If we look at the RWoZ collected, we see that from the

mobile phone data there are almost 42,000 more than fromthe INSEE survey, and inversely, 64,000 less in RWiZ category.Concerning the collection of static information about people(age, work, revenue, etc.), the face-to-face surveys are veryconfident, but concerning the collection of dynamic infor-mation about people (when they move, how long, etc.), theinformation are not so truthful and the error rate is up to 30%between what people say and what they really do. Here wesee that deviation.

D. Conception of new indicators

We show that our results are consistent with national datain terms of static behavior (residents and commuters). Now,we propose new indicators to qualify the dynamics of thisterritory. In addition to the RWiZ and RWoZ analysis, tableVIII presents the adjusted estimations for the other profileswe detected with our methodology based on the analysis ofmobile phone data.

Almost 47,000 OST and 39,000 MST are present daily onthe territory. They are individuals that cross the zone, or thatpunctually come, like taxis or transporters. The number ofestimated WE is also interesting; almost 210,000 individualsare present in the territory only during the weekend, whichrepresents 33% of the population present at this time. Thisis notably due to the presence of many cultural and touristattraction in the territory, but we can also put in this categorymany individuals that work or study outside during the weekand come back home only for the weekend. Finally, we detectmany individuals that are present only one day of the week.They are individuals that cross the territory, visit touristic orindustrial venues, taxis or transporters. Thus, almost 164,000individuals are present on the territory for less than one day,which represents almost 20% of the individuals present onthe territory during the day. From those individuals, almost105,000 (12%) stay in the territory for less than one hour.

VI. CONCLUSIONS

Cellular networks are definitely omnipresent in our dailylives, and many applications for the understanding of human

and urban mobility use these systems. Unfortunately, cellulardata suffer from many drawbacks, coarse location estimates,user dependent data frequency, etc. as seen in [22] and [19].Then using these data as a continuous presence indicator canlead to erroneous conclusions. Moreover, for privacy issues, allcollected data is anonymized, and it is impossible to followcontinuously the movements of individuals, and to link theindividuals with a specific social category.

We proposed to leverage the human behavior and thepredictive human patterns by analyzing cellular data from anaggregated and clustered point of view. The activity of thecellular network’s users allowed us to understand the relationthat individuals have with their territory.

We presented a new profiling algorithm that allows theclassification into six different profiles of individuals, accord-ing to their usage of the cellular network. We analyzed thedistribution of these profiles over a three week period of dataand we presented a novel approach to extract the individuals’weekly mobility patterns.

We presented an innovative methodology that groups theindividuals according to their profiles, and we extracted 12patterns that hold the main mobility behaviors of individualsover a territory. Our results can be used by mobility modelingalgorithms that require human mobility patterns like [15] or[16]. We compared our results with data from three Nationalsurveys and we proved that our methodology based on theanalysis of mobile phone data was consistent and relevant. Wethen proposed several dynamic indicators inferred from thisapproach and we used them to dress a map of the dynamicsof a territory. In particular, we saw that a lot of new andunpredictable information about mobility may be added to theface to face survey from mobile network data.

Our future works concern the evolution of the mappingalgorithm to propose a fuzzy version and to compare thefuzzy distribution to the hard one. We also intend to workon the influence of the sample size to build the main-clustersand on the thresholds of similarities between the clusters.Furthermore, we intend to apply the method on a second regionto improve and to generalize the process.

REFERENCES

[1] J. H. Kang, W. Welbourne, B. Stewart, and G. Borriello, “ExtractingPlaces from Traces of Locations,” in Proceedings of the 2Nd ACMInternational Workshop on Wireless Mobile Applications and Serviceson WLAN Hotspots, ser. WMASH 04. Philadelphia, PA, USA: ACM,2004, pp. 110–118.

[2] J. Reades, F. Calabrese, A. Sevtsuk, and C. Ratti, “Cellular Census:Explorations in Urban Data Collection,” IEEE Pervasive Computing,vol. 6, no. 3, pp. 30–38, Jul. 2007.

[3] “World Telecommunication Development Conference (WTDC-14): Fi-nal Report.”

[4] G. Rose, “Mobile Phones as Traffic Probes: Practices, Prospects andIssues,” Transport Reviews, vol. 26, no. 3, May 2006.

[5] C. Iovan, A.-M. Olteanu, T. Couronne, and Z. Smoreda, “Moving andCalling: Mobile Phone Data Quality Measurements and SpatiotemporalUncertainty in Human Mobility Studies,” in Geographic InformationScience at the Heart of Europe, ser. Lecture Notes in Geoinformationand Cartography, D. Vandenbroucke, B. Bucher, and J. Crompvoets,Eds. Springer International Publishing, May 2013, pp. 247–265.

[6] F. Calabrese, L. Ferrari, and V. D. Blondel, “Urban Sensing UsingMobile Phone Network Data: A Survey of Research,” ACM Comput.Surv., vol. 47, no. 2, pp. 25:1–25:20, Nov. 2014.

IEEE TRANSACTIONS ON MOBILE COMPUTING 14

[7] M. C. Gonzalez, C. A. Hidalgo, and A.-L. Barabasi, “Understandingindividual human mobility patterns,” Nature, vol. 453, no. 7196, pp.779–782, 2008.

[8] C. Song, Z. Qu, N. Blumm, and A.-L. Barabasi, “Limits of Predictabilityin Human Mobility,” Science, vol. 327, no. 5968, pp. 1018–1021, Feb.2010.

[9] F. Calabrese, M. Colonna, P. Lovisolo, D. Parata, and C. Ratti, “Real-Time Urban Monitoring Using Cell Phones: A Case Study in Roma,”IEEE Transactions on Intelligent Transportation Systems, vol. 12, no. 1,pp. 141–151, Mar. 2011.

[10] F. Alhasoun, A. Almaatouq, K. Greco, R. Campari, A. Alfaris, andC. Ratti, “The City Browser: Utilizing Massive Call Data to InferCity Mobility Dynamics,” in The 3rd International Workshop on UrbanComputing (UrbComp 2014), New York, NY, Aug. 2014.

[11] R. Becker, R. Caceres, K. Hanson, S. Isaacman, J. M. Loh,M. Martonosi, J. Rowland, S. Urbanek, A. Varshavsky, and C. Volin-sky, “Human Mobility Characterization from Cellular Network Data,”Commun. ACM, vol. 56, no. 1, pp. 74–82, Jan. 2013.

[12] F. Calabrese, M. Diao, G. Di Lorenzo, J. Ferreira Jr., and C. Ratti,“Understanding individual mobility patterns from urban sensing data:A mobile phone trace example,” Transportation Research Part C:Emerging Technologies, vol. 26, pp. 301–313, Jan. 2013.

[13] C. Joumaa, A. Caminada, and S. Lamrous, “Mask Based MobilityModel A new mobility model with smooth trajectories,” in Modeling andOptimization in Mobile, Ad Hoc and Wireless Networks and Workshops,2007. WiOpt 2007. 5th International Symposium on, 2007, pp. 1–6.

[14] K. A. Ali, M. Lalam, L. Moalic, and O. Baala, “V-MBMM: VehicularMask-Based Mobility Model,” in Networks (ICN), 2010 Ninth Interna-tional Conference on, 2010, pp. 243–248.

[15] S. Isaacman, R. Becker, R. Caceres, M. Martonosi, J. Rowland, A. Var-shavsky, and W. Willinger, “Human Mobility Modeling at MetropolitanScales,” in Proceedings of the 10th International Conference on MobileSystems, Applications, and Services, ser. MobiSys ’12. Low Wood Bay,Lake District, UK: ACM, 2012, pp. 239–252.

[16] D. J. Mir, S. Isaacman, R. Caceres, M. Martonosi, and R. N. Wright,“DP-WHERE: Differentially private modeling of human mobility,” in2013 IEEE International Conference on Big Data, Oct. 2013, pp. 580–588.

[17] J. Yuan, Y. Zheng, and X. Xie, “Discovering Regions of DifferentFunctions in a City Using Human Mobility and POIs,” in Proceedingsof the 18th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining, ser. KDD ’12. New York, NY, USA:ACM, 2012, pp. 186–194.

[18] J. L. Toole, M. Ulm, M. C. Gonzalez, and D. Bauer, “Inferring LandUse from Mobile Phone Activity,” in Proceedings of the ACM SIGKDDInternational Workshop on Urban Computing, ser. UrbComp ’12. NewYork, NY, USA: ACM, 2012, pp. 1–8.

[19] A. Wesolowski, N. Eagle, A. M. Noor, R. W. Snow, and C. O. Buckee,“The impact of biases in mobile phone ownership on estimates of humanmobility,” Journal of The Royal Society Interface, vol. 10, no. 81, p.20120986, Apr. 2013.

[20] D. Zhang, J. Huang, Y. Li, F. Zhang, C. Xu, and T. He, “Exploring Hu-man Mobility with Multi-source Data at Extremely Large MetropolitanScales,” in Proceedings of the 20th Annual International Conference onMobile Computing and Networking, ser. MobiCom ’14. New York,NY, USA: ACM, 2014, pp. 201–212.

[21] D. Zhang, J. Zhao, F. Zhang, and T. He, “coMobile: Real-time HumanMobility Modeling at Urban Scale Using Multi-view Learning,” inProceedings of the 23rd SIGSPATIAL International Conference onAdvances in Geographic Information Systems, ser. SIGSPATIAL ’15.New York, NY, USA: ACM, 2015, pp. 40:1–40:10.

[22] M. A. Bayir, M. Demirbas, and N. Eagle, “Discovering spatiotempo-ral mobility profiles of cellphone users,” in 2009 IEEE InternationalSymposium on a World of Wireless, Mobile and Multimedia NetworksWorkshops, Jun. 2009, pp. 1–9.

[23] Z. Smoreda, A.-M. Olteanu, and T. Couronne, “Spatiotemporal datafrom mobile phones for personal mobility assessment,” Transport SurveyMethods: Best Practice for Decision Making, 2013.

[24] B. Furletti, L. Gabrielli, C. Renso, and S. Rinzivillo, “Identifyingusers profiles from mobile calls habits,” in Proceedings of the ACMSIGKDD International Workshop on Urban Computing, ser. UrbComp’12. Beijing, China: ACM, 2012, pp. 17–24.

[25] M. Nanni, R. Trasarti, B. Furletti, L. Gabrielli, P. V. D. Mede, J. D.Bruijn, E. D. Romph, and G. Bruil, “MP4-A Project:Mobility PlanningFor Africa,” in Analysis of mobile phone datasets for the developmentof Ivory Coast, ser. Mobility/Transport, V. Blondel, N. d. Cordes,

A. Decuyper, P. Deville, J. Raguenez, and Z. Smoreda, Eds. netmob,May 2013, pp. 423–446.

[26] R. Ahas, A. Aasa, S. Silm, and M. Tiru, “Daily rhythms of suburbancommuters’ movements in the Tallinn metropolitan area: Case study withmobile positioning data,” Transportation Research Part C: EmergingTechnologies, vol. 18, no. 1, pp. 45–54, 2010.

[27] R. Ahas, A. Aasa, A. Roose, l. Mark, and S. Silm, “Evaluating passivemobile positioning data for tourism surveys: An Estonian case study,”Tourism Management, vol. 29, no. 3, pp. 469–486, 2008.

[28] F. Girardin, F. Calabrese, F. Fiore, C. Ratti, and J. Blat, “DigitalFootprinting: Uncovering Tourists with User-Generated Content,” IEEEPervasive Computing, vol. 7, no. 4, pp. 36–43, Oct. 2008.

[29] S. Jiang, J. Ferreira, and M. C. Gonzalez, “Clustering daily patterns ofhuman activities in the city,” Data Mining and Knowledge Discovery,vol. 25, no. 3, pp. 478–510, Nov. 2012.

[30] C. M. Schneider, V. Belik, T. Couronne, Z. Smoreda, and M. C.Gonzalez, “Unravelling daily human mobility motifs,” Journal of TheRoyal Society Interface, vol. 10, no. 84, p. 20130246, Jul. 2013.

[31] S. Jiang, J. Ferreira, and M. C. Gonzalez, “Activity-Based HumanMobility Patterns Inferred from Mobile Phone Data: A Case Study ofSingapore,” IEEE Transactions on Big Data, vol. PP, no. 99, pp. 1–1,2017.

[32] F. Calabrese, J. Reades, and C. Ratti, “Eigenplaces: Segmenting Spacethrough Digital Signatures,” IEEE Pervasive Computing, vol. 9, no. 1,pp. 78–84, 2010.

[33] R. Fergus, A. Zisserman, and P. Perona, “Sampling methods for unsuper-vised learning,” in Advances in Neural Information Processing Systems17. Neural information processing systems foundation, 2005.

[34] P. J. Rousseeuw, “Silhouettes - A graphical aid to the interpretation andvalidation of cluster analysis,” Journal of Computational and AppliedMathematics, vol. 20, no. 0, pp. 53–65, 1987.

[35] R. W. Hamming, “Error Detecting and Error Correcting Codes,” BellSystem Technical Journal, vol. 29, no. 2, pp. 147–160, Apr. 1950.

[36] INSEE, “Definition of working place.” [Online].Available: http://www.insee.fr/fr/bases-de-donnees/default.asp?page=recensement/resultats/doc/definitions-rp.htm#def lieu travail

[37] ——, “Recensement de la population - La precision des resultats durecensement,” INSEE, Tech. Rep., Aug. 2009.

Etienne Thuillier received the title of engineer in computer sciences(M.S. Degree) at Belfort-Montbeliard University of Technology (UTBM),France, in 2013. He is currently pursuing a Ph.D. degree in computersciences at the OPERA team, UTBM. His research interest includesIntelligent Transportation Services technologies, and more generally humanand urban mobility.

Laurent Moalic received the title of engineer in computer sciences (M.S.Degree) at UTBM, France, in 2003. He later received the Ph.D. degree incomputer science from the University of Franche Comte in 2013. In 2004,he joined the Transportation Laboratory of UTBM, as a research engineer.His current research interests include operational research, combinatorialoptimization, and human mobility modeling. Currently, he is the team leaderfor the French National Research Agency project, Norm-Atis, about newstandards to develop advanced transportation information services.

Sid Lamrous is currently associate professor of computer science atUTBM. From 1996 to 1999, he obtained the postgraduate diploma insystems control (M.S. Degree) and the Ph.D in computer science at theUniversity of Technology of Compiegne. From 2011 to 2013 he was headof engineering education by learning for 3 years. For 10 years he has beenworking on models and algorithms for combinatorial optimization, andstochastic modeling of mobility.

Alexandre Caminada is currently full professor of computer scienceat UTBM. He received the MSc research degree in Artificial Intelligencefrom the University of Paris XII/Paris VIII and the diploma in ComputerScience from ESIEA Paris (M.S. Degree), then the Ph.D from the Universityof Montpellier II in telecommunication software engineering. From 1993 to2004, he worked at the National Telecommunications Research Centre (CNETFrance) as head of a Unit Research on Wireless Optimisation and in 2004he has been nominated as full Professor at UTBM. His personal research isabout resources optimization and network performance modelling for mobileand wireless systems.