PDF-Cranshaw Bridging the Gap

download PDF-Cranshaw Bridging the Gap

of 10

Transcript of PDF-Cranshaw Bridging the Gap

  • 7/27/2019 PDF-Cranshaw Bridging the Gap

    1/10

    Bridging the Gap BetweenPhysical Location and Online Social Networks

    Justin Cranshaw Eran Toch Jason Hong

    Aniket Kittur Norman SadehSchool of Computer Science, Carnegie Mellon University5000 Forbes Avenue, Pittsburgh, PA 15213, USA

    {jcransh, eran, jasonh, nkittur, sadeh}@cs.cmu.edu

    ABSTRACT

    This paper examines the location traces of 489 users of alocation sharing social network for relationships betweenthe users mobility patterns and structural properties of theirunderlying social network. We introduce a novel set oflocation-based features for analyzing the social context of ageographic region, including location entropy, which mea-sures the diversity of unique visitors of a location. Usingthese features, we provide a model for predicting friendshipbetween two users by analyzing their location trails. Ourmodel achieves significant gains over simpler models basedonly on direct properties of the co-location histories, such asthe number of co-locations. We also show a positive rela-tionship between the entropy of the locations the user visitsand the number of social ties that user has in the network.We discuss how the offline mobility of users can have im-plications for both researchers and designers of online socialnetworks.

    Author Keywords

    Location sensing, Social network analysis, Social computing

    ACM Classification KeywordsH.1.2 User/Machine Systems: Human information process-ing; J.4 Social and Behavioral Sciences: Sociology

    General TermsExperimentation, Human Factors

    INTRODUCTION

    Although voices in the media and academia often make dis-tinctions between online social networks and offline socialnetworks, until recently it has been extremely difficult torigorously address questions comparing these two worlds.This has led to conflicting results when researchers have at-tempted to relate online and offline behavior. For example,

    in a recent article Deresiewicz argues that online social net-works are contributing to the isolation of people in the physi-

    Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.UbiComp 10, Sep 26-Sep 29, 2010, Copenhagen, Denmark.

    Copyright 2010 ACM 978-1-60558-843-8/10/09...$10.00.

    cal world [2], while a recent Pew Internet and American Lifereport argues that online social networks have a positive im-pact on social relations in the physical world [9]. The currentlack of methodology for analyzing the distinctions betweenonline and offline social networks can explain, in part, thistype of open ended debate.

    At the same time, the growing ubiquity of location-enabledsmartphones blurs the distinction between online and of-fline social networks. This is most apparent in emerging mo-bile social networks such as Foursquare and Gowalla whichhave created new means for online interaction based entirelyon the physical location of their users. Furthermore, smartdevices also make it possible to study peoples offline behav-ior by continuously tracking their whereabouts. As a conse-quence, many questions of human behavior which in the pastwere difficult to answer, will soon become easier to analyze.

    One of the challenging problems in this space is inferringproperties of the social behavior of users from their locationtrails. Some promising research in this area can be seen inpapers by Eagle et al. [3, 4] and Li et al. [11] who developmeasures of user similarity based on mobility and use thisto infer the social structures of the users. This task is par-ticularly challenging since co-location of two users, looselydefined as being in the same place at the same time, doesnot provide enough evidence to reliably establish a relation-ship between them, especially in urban environments, whereco-location among strangers is frequent [12]. Furthermore,in realistic conditions, location tracking is inherently partialand inexact, making this kind of inference difficult on a largescale.

    To meet these challenges we introduce a set of features thatshed light on the social context of the locations that usersvisit. We evaluate these features on two tasks: predictingwhether two co-located users are friends on Facebook, andpredicting the number of friends a user has in the social net-work. Additionally, we examine the relative importance ofthe predictors used, and we show that looking deeper intocharacteristics of the locations the users visit can signifi-cantly improve performance on these tasks.

    Being able to rigorously address these questions requires aspecial experimental framework capable of observing boththe offline social behavior and the online social structure ofthe users. To meet this end, we use Locaccino, a location-

    119

  • 7/27/2019 PDF-Cranshaw Bridging the Gap

    2/10

    sharing application based on Facebooks social network[14].Locaccino allows users to share their location with theirFacebook friends subject to robust privacy preferences. Ourresults are based on an analysis of the location trails of 489participants who were tracked using GPS and WiFi position-ing technologies installed on the their mobile phones and/orlaptop computers.

    In this work we introduce and evaluate a set of contextualfeatures of human location trail data for inferring two socialaspects of the users: the existence of an online social net-work link between two users, and the number of friends auser has. We show that by analyzing characteristics of thelocations the users visit, and by studying the patterns of anindividual users mobility, we can gain valuable context intothe users social world.

    This work makes the following primary research contribu-tions:

    1. We establish a model of friendship in an online social net-work based on contextual features of user co-location.

    2. We identify positive relationships between the mobilitypatterns of a user and the number of online friends theuser has.

    3. We show that diversity measurements of a location, suchas the entropy of the distribution of unique visitors there,can be used to analyze the context of the social interac-tions at that location.

    RELATED WORK

    Several promising results demonstrate the potential of usingubiquitous mobile technologies to study human social be-havior. In a series of papers, Gonzalez et al. observed alarge group of mobile phone users over six months, show-ing that phone users mobility patterns have a high degreeof spatial and temporal regularity [7]. They then used thisinsight to developed statistical models of user mobility pat-terns. Eagle and Pentland used eigenvalue decompositionto study routine behaviors of mobile phone users [3]. Theyshowed that the inferred principal components of participantbehaviors discovered by their decomposition can be used tobuild a similarity measure between users. Furthermore, theyshowed that this similarity measure can be used to success-fully infer familiarity. Li et al. also used location histories toderive a user similarity measure [11]. Their similarity mea-sure is derived from a hierarchical modeling of the userslocation histories that takes into account both movements ona micro scale, say from building to building, and movementson a macro scale, say from city to city. Miklas et al. studiedthe network of interactions of mobile phone participants inrelation to their social network [12]. Although the primaryfocus of their work was on applications that exploit socialinteractions such as routing in delay tolerant networks, theyalso found several interesting descriptive results about socialinteractions, such as the distribution of participant interac-tions with strangers versus interactions with friends.

    Eagle et al. analyzed a set of features of mobility data to

    study the social structure of the participants [4]. They ex-amined features such as the proximity of the users at work,proximity on a Saturday night, whether there was phonecommunication between them, and the number of unique lo-cations where they were together as predictors of whetherthere was a relationship between the two users. They thenconducted a regression analysis using self report data forthe actual user relationships to study what factors contribute

    most to friendship. Their analysis showed that phone com-munication was by far the most significant predictor of friend-ship, followed by the number of unique location, and prox-imity on a Saturday night.

    We build on this foundation and expand it several ways.First, we compare physical social interactions with an exist-ing online social network rather than self reported social ties.Not only does this bypass any potential biases introduced byself report data, this type of analysis also allows our work tocontribute new applications to online social networks, suchas location-based friend recommendation and categorizationsystems, and location recommendation systems. Second, wedo not record the existence of cellular phone communica-

    tion between the users. Rather, our methodology is basedonly on knowing the users locations. Finally, we expandthe existing methodology for analyzing location data, by in-troducing new tools for enhancing the understanding of thecontext of human location observations by looking at globalproperties of the location where the observation occurred,such as the entropy of the distribution of users that visit thelocation. We show that using these new location-based fea-tures, we can construct a classifier for predicting social tiesthat outperforms one that is based on features similar to theproximity-based features used used by Eagle et al. [4].

    Unlike the bluetooth handshake method for inferring interac-tions used by Eagle et al. [4] which requires communication

    between the phones of the participants in order to establishproximity, our method records the location of the users us-ing standard GPS and WiFi geo-positioning, similar to Li et.al [11]. We then infer by proximity, rather than explicitlyobserve via bluetooth handshake, the social interactions be-tween the users. This method is realistic and highly scalable,making it relevant for researchers and practitioners wishingto study user location traces on a large scale. Most impor-tantly, although inferring social interaction in this way canproduce noisy data, we show how a sophisticated analysisof the context of the observed proximity can compensate fordata limitations.

    The methodology we present in this paper also offers new

    tools that can be used in future research on the impact ofthe Internet on social relations. Current research in this fieldis is based mostly on qualitative findings and surveys. Forexample, Barry Wellman et al. studied the impact of Interneton neighborhoods and families [15], Kraut et al. studied theeffects of the Internet on the well being of users [10], andEllison et al. studied social capital of Facebook [5]. Whileour current paper does not aim to contribute directly to anyof these questions, our work provides another dimension toaddress these difficult questions in future work.

    120

  • 7/27/2019 PDF-Cranshaw Bridging the Gap

    3/10

    METHOD

    We observed users through continuous tracking of their loca-tion using laptop computers and smart phones. Additionallywe observed the existence of Facebook friendships betweenpairs of users. In this section, we describe in detail the tech-nical and experimental framework, and the collected data.

    Locaccino

    Locaccino [14] is a Web-application developed by the Mo-bile Commerce Lab at Carnegie Mellon University that al-lows a user to share her current location with other Locac-cino users through her Facebook social network subject touser-controllable privacy rule specifications1. From the usersperspective, there are two components of Locaccino: theweb application, which allows users to query their friendslocations and set up and review privacy rules, and the locatorsoftware, which runs on laptops and mobile phones (Sym-bian OS and Android) and updates the user location every10 minutes.

    Users run the client locator software in the background oftheir laptops or smart phones, which uses a combination of

    GPS (if available), WiFI, and IP geolocation to ascertainlocation coordinates of the user. Each method has differ-ing levels of accuracy. Locations ascertained via GPS aretypically accurate to within 10 to 15 meters. Locations as-certained through a WiFi lookup service like that providedby Skyhook Wireless2 are typically accurate to within 10to 20 meters. Locations ascertained via IP geolocation aretypically at the city or neighborhood level of granularity.These location observations, which consist of a time-stamptogether with latitude and longitude values, are sent to theLocaccino server by the client software in 10 minute inter-vals.

    Recruitment, Demographics and Data CollectionThe 489 users discussed in this work were each active usersof Locaccino for periods ranging from 7 days to severalmonths (mean of 74 days, median of 38 days). The par-ticipants started using Locaccino at different times and fordifferent reasons. 285 of the users were recruited as part of3 different studies from the campus population using fliersand posting on the universitys electronic message boards.The rest of the users were either invited by study partici-pants through a built-in invite mechanism, or they found Lo-caccino through research publications, online press, or othermeans. Figure 1 shows a plot of the number of unique usersbeing tracked for each day of the period we study in thiswork. All users of Locaccino, regardless of how they were

    recruited, gave informed consent to participate in the studyprior to registering an account on the system.

    Although we recognize this might limit the generality of ourfindings, to enforce some control over the data we ignoreall observations outside of the Pittsburgh metropolitan re-gion (where Locaccino was first deployed). This allows usto study the users in a closed ecosystem and it frees the

    1www.locaccino.org2www.skyhookwireless.com

    NumberofUsers

    0

    20

    40

    60

    80

    100

    120

    140

    Jan-09 Apr-09 Jul-09 Oct-09 Jan-10

    Figure 1. A plot of the number locatable study users in the Pittsburghmetropolitan region for each day of the study period. Peaks in Apr-09,Jul-09, and Dec-09 indicate controlled studies.

    data from any bias that might result from an uneven densityof observations across geographic regions. In total, over 3million location observations were collected for this work,with nearly 2 million of these falling in the Pittsburgh re-gion. Additionally, we ignore all location observations thatwere obtained by IP geolocation. Assuming each data pointrepresents a 5 minute interval, this is over 20 years of cumu-lative human observational data.

    A large percentage of the observations were collected fromthe laptop locator software (93.7%). This imposes severallimitations on the data analysis. For one, people are not as

    mobile with laptops, which are often only powered on instationary locations, as they are with cellular devices, whichoften remain powered-on and near that person at all times.Furthermore, laptops offer a much more sporadic approxi-mation of a persons location than cellular devices do. Lap-tops are sometimes powered on for hours at a time while theuser is in fact not near the laptop (for instance at home, or atthe office). Although, this adds a significant element of noiseto the data that is difficult to quantify, the data is neverthelessrealistic, as it represents a real world deployment of a loca-tion sharing system. Furthermore, we feel the limitationsof the data are testament to the strength of our methods, aswe are able to find significant and strong results using datagenerated in this realistic and highly scalable manner.

    Co-location

    We divide the latitude and longitude space into discrete0.0002 0.0002 latitude/longitude grids (approximately 30meters 30 meters) and the time coordinate into whole 10minute intervals. In this way, a co-location of two usersis defined as an observation of the users within the same0.0002 0.0002 location grid within the same discrete 10minute interval. The particular choice of discretization wasbased on practical considerations balancing the accuracy ofthe location sensing technology with the noise associatedwith larger discretization windows. Although such a dis-cretization adds some noise when trying to infer co-locations,

    when examining the entire history of co-locations betweenpairs of users, this noise is marginalized. Unless otherwisestated, when we refer to a locationor location observationorco-location observationin this paper we assume the locationand time coordinates are subject to this discretization.

    Network Data

    In this work we primarily focus on network data inducedfrom Locaccino user observations. In particular, we comparethe network formed by co-location of system users, with the

    121

  • 7/27/2019 PDF-Cranshaw Bridging the Gap

    4/10

    Graph Structural Properties S C SCNumber of vertices 397 397 397Number of isolated vertices 15 120 206Number of edges 1063 3636 307Num connected components 106 108 234Largest component size 315 275 67Density 0.014 0.046 0.004Connectedness 0.63 0.48 0.04Degree centralization 0.06 0.22 0.03Eigenvector centralization 0.42 0.21 0.50

    Table 1. Structural properties of the networks analyzed in this work.

    underlying Facebook social network. The three networksthat we consider are defined below:

    The Social Network: We denote the underlying Facebooksocial network of Locaccino users by S. There is an edgebetween vertices u1, u2 S if and only ifu1 and u2 arefriends on Facebook.

    The Co-location Network: We construct an undirected graph

    based on user co-location, so that an edge exists between u1and u2 if they were co-located. We call this graph the co-location networkand denote it by C.

    The Co-located Friends Network: We will refer to thegraph induced by those Facebook friends who were actuallyco-located as the co-located friends network, which will willdenote by S C.

    Structural properties and descriptive statistics of the networksare shown in Table 1. In this work we will primarily focuson the edges of the graphs. Although their were 3636 ob-served colocations among the users, only 307 of these wereco-locations of Facebook friends. This shows that although

    co-location among Locaccino users in Pittsburgh is quitecommon, co-location among Facebook friends is compar-atively rare. Indeed, only roughly 30% of the dyads in thesocial network ever appear in the co-location network. Alsoof interest are global properties of the graph structures, suchas the distribution of component sizes (see Figure 2). Ob-serve that, ignoring isolated vertices, co-location of the par-ticipants occurs in one large connected component, whereasco-location of friends occurs in several smaller distinct com-ponents.

    MODEL DESCRIPTIONS

    In this section we describe the variables we use to model theco-location of two users and individual user mobility.

    Measuring the diversity of a location

    To better understand the context of each observation, it wouldbe helpful to have information about the type of locationwhere the observation occurred. For example, observationsof a user in a private residence should be interpreted differ-ently from those in a crowded shopping center. We introducea set of measures on locations that attempt to quantify thediversity of observations that occur at a given location. Oneprimary motivation in defining these measures is to be able

    275

    315Social Network: 35 connected components, 20 non-trivial

    Co-location Network: 122 connected components, 2 non-trivial

    Trivial component (isolated vertex) Non-trivial component (at least 2 vertices)

    Co-located Friends Network: 234 connected components, 26 non-trivial67

    Figure 2. The distribution of connected component sizes in the social,

    co-location and co-located friends networks. Ignoring isolated vertices,co-location of the participants occurs in a single connected component,whereas co-location of friends occurs in several smaller components.

    to distinguish when a co-location between two users happensby chance, say two strangers eating at adjacent tables in thesame restaurant, and when the co-location is a social event,say one friend inviting the other to his house for dinner.

    In this work we consider three diversity measure on loca-tion: frequency, user count, and entropy. The frequencyof a location measures the raw count of user observationsthat occurred there, user count looks at the total number ofunique users that visit the location, and entropy takes into

    account both the number of users observed at the location aswell as the relative proportions of their observations. A loca-tion will have a high entropy if many users were observed atthe location with equal proportion. Conversely it will havelow entropy if the distribution of observations at a locationis heavily concentrated on few users. See Figure 4 for a con-crete illustration of the difference between the three diversitymeasures. We find entropy to be particularly appealing be-cause locations of high entropy by definition are preciselywhere chance encounters are most likely to occur.

    Now we define these notions formally. Let L be a locationand let U be the set of all users. For a u U, let Ou bethe set of location observations ofu and let O = u

    UOu.

    An observation o Ou is a 4-tuple of the user ID, the lo-cation latitude and longitude coordinates, and a timestamp.Define UL = {u U : u was observed at location L}. LetOu,L = {o Ou : o L} and OL = {o O : o L} berestrictions ofOu and O to the location L

    3. The probability

    that a random drawn from OL belongs u is PL(u) =|Ou,L||OL|

    ,

    that is PL(u) is the total fraction of all observations at loca-tion L that are of user u.

    Definition 1: For a location L, the frequency of the loca-tion is defined as Freq(L) := |OL|, the user count of thelocation is defined as UserCount(L) := |UL|, and the lo-cation entropy of the location is defined as Entropy(L) :=

    uUL PL(u)logPL(u).

    Our application of the UserCount and Entropy mea-sures to study locations is motivated by their use in ecologyin the study of biodiversity [13].

    Co-location features

    3The notation that o O and o L is imprecise. Elements ofO are 4-tuples whereas elements ofL are location coordinates. Byo L we mean the location component ofo lies in location L.

    122

  • 7/27/2019 PDF-Cranshaw Bridging the Gap

    5/10

    University

    u

    esidential

    Areas

    D inin gan d

    Sho pping !HighEntro py

    LowEntro py

    Figure 3. A map of the study region where observations are colored ac-cording to their level of entropy. Places such as the university campus,shopping and dining districts, have high entropy levels. Residential ar-eas have low entropy.

    For each co-location edge {u1, u2} ofC, we extract 67 fea-tures from the data describing contextual properties of thehistory of co-locations between u1 and u2. These features,which are outlined in Table 2, are designed to distinguishmore meaningful co-location histories from chance co-locations. Broadly, we divide the features into four cate-gories, Intensity and Duration, Location Diversity, Speci-ficity, and Structural Properties.

    Intensity and Duration: The Intensity and Duration fea-tures measure qualities related to the size and spatial andtemporal range of the set of co-locations. These featuresquantify how long and how actively users have embracedthe system.

    Location Diversity: The location diversity measures givenin Definition 1 provide the basis for several features whichaid in understanding the context of a set of co-locations.For a given co-location observation between u1 and u2, letl be the location where they were observed. We computeFreq(l), UserCount(l) and Entropy(l) for every co-locationofu1 and u2, then we take the average, median, variance,minimum and maximum of the resulting values to get thefeatures listed in Table 2. Additionally, although they are notlisted in Table 2, we also use two variations of each of thesefeatures where the statistics are taken only over evening andweekend co-locations.

    Specificity: Inspired by the tf-idf ranking technique frominformation retrieval, we would like to measure how spe-cific a location is to the pair of users who were co-locatedthere. For example, a domestic residence is highly specificto the married couple that lives there, since a large fraction ofthe observations there are co-locations of that couple. For agiven location l, we define the TFIDFu1,u2(l) to be the num-ber of times u1 and u2 were observed co-located at l divided

    l4

    l1 l2

    l3

    req ue nc y: ow

    User count: Low

    E nt ro py : L ow

    requency: g

    User count: Low

    E nt ro py : Low

    requency: g

    User count: High

    E nt ro py: Low

    requency: g

    User count: High

    Entropy: High

    = o servation o user 1 = o servation o user 2 = o servation o user 3

    Freq(l1) = 2UserCount(l1) = 1Entropy(l1) = 0

    Freq(l2) = 12UserCount(l2) = 1Entropy(l2) = 0

    Freq(l3) = 12UserCount(l3) = 3Entropy(l3) = 0 .60

    Freq(l4) = 12UserCount(l4) = 3Entropy(l4) = 1.58

    Figure 4. Four representative scenarios highlighting the differences be-tween the location diversity measures. Circles of a given shade repre-sent location observations of a particular user.

    by Freq(l) (i.e. the total number of observations at the lo-cation). The Specificity features listed in Table 2 are de-termined by first computing TFIDFu1,u2 at each co-locationobservation ofu1 and u2, and then taking the average, mini-mum, and maximum of the resulting data.

    Structural Properties: We use three variables which mea-sure the strength of the structural relationship between u1and u2 in C. Two of these are standard social networkanalysis techniques (NumMutualNeighbors and Neighbor-hoodOverlap). The third feature, LocationOverlap, is notstrictly a structural property ofC, but is computed similarlyto NeighborhoodOverlap, and can be viewed as a similaritymeasure between the sets of locations u1 and u2 visit.

    Measuring the regularity of a users routine

    In studying how properties of user mobility relate to proper-ties of the underlying social network, one attribute we wouldlike to quantify is the regularity of a users schedule. We ac-complish this by first representing each location observationo Ou as a vector of values of the location, day of the week,and hour of the day of the observation. To measure how ausers mobility pattern repeats in regular intervals, we can re-strict this vector to a subset of the components and study theobserved probability distribution in the resulting subspace.

    For example, if we restrict the observations to just the loca-tion and day of the week components, then we could studyhow the users schedule varies as a function of the day of theweek. The observed joint probability distribution on thesetwo components provides some insight into the regularity ofthe users weekly schedule. If the distribution is concen-trated at relatively few values, then the user has a highly reg-ular weekly schedule, and if the distribution is spread outamong several values, then the user has a highly irregularschedule. We will again use entropy as a measure of thespread of this distribution.

    We now formalize this intuition. Let R {L, D, H} denotethe components of the restriction (standing for location, dayof the week, and hour of the day respectively). For a giveno Ou, we let o(R) be the restriction ofo to the compo-nents ofS. Then Ou(R) = {o(R) : o Ou} is the unique

    123

  • 7/27/2019 PDF-Cranshaw Bridging the Gap

    6/10

    Category Variables Description Co-location User mobility

    Intensity

    and

    Duration

    NumO bservat ions The total number of observat ions of the user.

    NumColoc, NumColocEvening, Num-ColocWeekend

    The number of co-location observations of the two users, in total, in theevening only, and on weekends only.

    NumLocations, NumLocationsEvening,NumLocationsWeekend

    The number of distinct grid boxes where the user or users were observed, intotal, in the evening only, and on weekends only.

    NumHours, NumWeekdays, NumDates The number of distinct hours of the day, days of the week, and calendardates that the two users were observed together.

    ObservationTimeSpan The difference in seconds between the last and the first location or co-location observation.

    BoundingBoxArea The area of t he mi ni mal axis ali gned rect angle that cont ai ns t helocations/co-location observations of the user/users.

    Location

    Diversity

    AvgEntropy, MedEntropy, VarEntropy,MinEntropy, MaxEntropy

    The mean/median/variance/min/max of the location entropy at eachlocation/co-location observation of the user/users.

    AvgFreq, MedFreq, VarFreq, MinFreq,MaxFreq

    The mean/median/variance/min/max of the location frequency at eachlocation/co-location observation of the user/users.

    AvgUserCount, MedUserCount, VarUser-Count, MinUserCount, MaxUserCount

    The mean/median/variance/min/max of the location user count at eachlocation/co-location observation of the user/users.

    Mobility

    Regularity

    SchEntropyL, SchEntropyLH, SchEn-tropyLD, SchEntropyLHD

    The schedule entropy of the user with respect to location, location and hour,location and day of the week, and location and hour and day of the week.

    SchSizeLH, SchSizeLD, SchSizeLHD The schedule size of the user with respect to location and hour, location andday of the week, and location and hour and day of the week.

    Specificity

    AvgTFIDF, MinTFIDF, MaxTFIDF The mean/minimum/maximum of the location TFIDF at each co-location ofthe two users.

    PercentObservationsTogether The total number of co-locations of the two users divided by the sum ofeach users total number of observations.

    Structural

    Properties

    NumMutualNeighbors The number of people who have been co-located with both users.

    NeighborhoodOverlap The number of people who have been co-located with both users divided bythe number of people who have been co-located with either user.

    LocationOverlap The total number of distinct places visited by both users divided by the total

    number of places visited by either users.

    Table 2. Above are the names and descriptions of the independent variables used in our models. We divide the variables into five categories to betterunderstand how the groups of variables relate to one another. Further we indicate whether the variables are features of co-location, or features ofindividual user mobility, or both.

    set of |R|-tuples generated by applying the restriction R toobservation vectors in Ou (i.e. observations of users u). Wedefine the observed probability distribution of the restricted

    observations as r Ou(R) as p(r) =|{oOu : o(R)=r}|

    |Ou|.

    Definition 2: The schedule size ofu with respect to restric-tion R is defined as SchSize(Ou, R) := |Ou(R)| and the

    schedule entropy ofu with respect to R is defined as

    SchEntropy(Ou, R) :=

    rOu(R)P(r)log P(r).

    To clarify these definitions, suppose again that R = {L,D},so R is a restriction of the observation to the location andday of the week. Then SchSize(Ou, R) will be high if oneach day of the weeku visits many different locations andSchEntropy(Ou, R) will be high if a u visits many loca-tions on each weekday in relatively equal proportions.

    User mobility features

    For each vertex u ofC

    , we extract 64 features from the datadescribing properties of the mobility patterns ofu. Againsee Table 2 for a full description of the features. We partitionthe user mobility features into three categories, Intensity andDuration, Location Diversity, and Mobility Regularity.

    Intensity and Duration: Similar to the corresponding cat-egory in the co-location model, these features measure theintensity of and range of the users use of the system.

    Location Diversity: Location Diversity features for the user

    mobility model are identical to the co-location model, exceptinstead of measuring the diversity of co-location observa-tions, we measure the diversity of the location observationsof a single user.

    Mobility Regularity: In this work we consider four restric-tions: {L}, {L, H}, {L, D}, {L, H, D}. Computing theschedule size and schedule entropy on these four restrictions

    yields the seven Mobility Regularity features listed in Table2 (the eight feature is SchSize(Ou, {L}), which is alreadyrepresented by NumLocations). Similar to the location di-versity variables, we also use evening and weekend varia-tions for the Mobility Regularity features.

    RESULTS

    In this section we present our main results. We analyze therelative importance of the independent variables both in thecontext of predicting the number of social network ties a userhas as well as the existence of a social network tie betweentwo co-located users.

    Inferring social network ties from co-location

    First we consider the task of predicting ties from co-locationdata. For each edge of C we use a binary response variableFacebookFriends to indicate whether or not the correspond-ing edge is present in S. In total, there were 307 co-locationedges where the users were Facebook friends and 3330 co-location edges where the users were not Facebook friends.We model FacebookFriends as a function of the co-locationsfeatures.

    We then trained 6 classifiers on the data (2 Random Forest

    124

  • 7/27/2019 PDF-Cranshaw Bridging the Gap

    7/10

    Classifier Prec. Recall

    RandomForests (10 vars. per node) 0.62 0.22RandomForests (18 vars. per node) 0.61 0.22AdaBoost (dec. stumps, exp. loss) 0.68 0.24AdaBoost (dec. stumps, lgstc. loss) 0.60 0.28SVM (deg 2 polynomial kernel) 0.40 0.31SVM (deg 3 polynomial kernel) 0.26 0.37

    Table 3. The observed precision and recall of the 6 classifiers we tested.

    Predictions were conducted with a 50-fold cross validation procedureover all the observations in the dataset. The choice for number of vari-ables at each split in the Random Forest models was picked via crossvalidation. Each Random Forest model had 1000 trees. The AdaBoostalgorithm was run for 400 iterations.

    classifiers, 2 AdaBoost Classifiers (with Decision Stumps),and 2 SVM Classifiers (with polynomial kernels). The per-formance of each classifier was measured against the truevalues of whether the users are Facebook friends using a 50-fold cross validation procedure. Table 3 shows the observedaverage precision and recall for each from this procedure.

    The RandomForest models and the AdaBoost models out-perform the two SVM models that we trained. In particular,the AdaBoost model with exponential loss seems to performthe best, having the best observed precision, and very near tothe best observed recall. It correctly identifies 74/307 friend-ships and 3295/3330 non-friendships. Although the overallaccuracy of the classifier is high (92%), this is somewhatmisleading since the class distribution is heavily biased to-wards non-friendship, so we prefer to examine the precisionand recall to get a clearer judge of performance.

    To examine the relative predictive power of the 4 featureclasses, we trained an AdaBoost classifier (exponential loss

    / decision stumps) using only the Intensity and Duration fea-tures. Such a model uses proximity-based features similarto those used by Eagle et al. for a similar friend predic-tion task [4]. To see how these feature perform against thelocation features we define in this work, we compare thismodel to one trained using the three remaining co-locationfeature classes (Location Diversity, Specificity, and Struc-tural Properties). The results are shown in figure 5. Herewe plot the precision and recall curves at varying thresholdsof the class probabilities output by AdaBoost. We again use50-fold cross validation for all classifier estimates. For com-parison, we also plot the precision/recall curves for the fullAdaBoost classifier, and a baseline formed by thresholdingNumColocations at varying threshold values.

    One can observe that both the full model, and the modeltrained on the Location Diversity, Specificity, and StructuralProperties features significantly outperform the model trainedonly on Intensity and Duration features. Furthermore at mod-erate to high recall levels, the Intensity and Duration modeldoes not offer any improvement over simply thresholding onNumColocations. This is in contrast to the other models,which show consistent and significant gains over the base-line at all recall levels.

    The low performance of the baseline and of the Intensity

    0.0 0.2 0.4 0.6 0.8 1.0

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Recall

    Precision

    AdaBoost: Full ModelAdaBoost: Loc. Diversity, Specificity, Structural Prop.AdaBoost: Intensity and DurationThresholding NumColocations

    Figure 5. Observed precision and recall of the full AdaBoost classi-fier on all the features, compared with two sub-models: the first is

    only trained on the Intensity and Duration features, and the secondis trained on the Location Diversity, Specificity, and Structural Prop-erties features. Precision and recall estimates are made using 50-foldcross validation. These are shown against a baseline constructed bythresholding NumColocations at varying threshold values.

    and Duration features is illustrative of the wide range ofco-location patterns observed among the participants. Thisresult shows that, although co-location alone is not a verystrong predictor of online friendship, we can significantlyimprove the predictive performance by looking at additionalcontextual social properties of the locations the users visit.

    Inferring the number of friends from user mobility data

    Next we consider the relationship between the number ofFacebook friends a user has in Locaccino and her mobilitypatterns. There are plausible hypotheses why one may ex-pect variables in each of the three mobility feature categoriesto be correlated with the number of friends the user has inLocaccino. First, one would expect that users who haveused the system longer or more vigorously might have morefriends in the system. Furthermore, since locations of highdiversity are in some ways more social, one could hypoth-esize that users who visit such locations often might havemore friends in general. Finally, users with highly irregu-lar schedules might find a system such as Locaccino moreuseful to help coordinate with their friend and family.

    To examine this question, we first calculated the Pearsonscorrelation between the node degrees in S with each usermobility feature listed in Table 2. The results are plottedin Figure 6, where the three variable categories have beengrouped and shaded in different colors. There are severalinteresting observations that should be noted when exam-ining this figure. First, one should notice that the correla-tion of variables in the Intensity and Duration category havea far weaker correlation to the number of friends than themore nuanced variables in the Location Diversity and Mo-

    125

  • 7/27/2019 PDF-Cranshaw Bridging the Gap

    8/10

    1.0

    0.5

    0.0

    0.5

    1.

    NumObservations

    NumLocations

    NumLocationsEvening

    NumLocationsWeekend

    ObservationTimeSpan

    BoundingBoxArea

    AvgEntropy

    VarEntropy

    MedEntropy

    MinEntropy

    MaxEntropy

    AvgFreq

    VarFreq

    MedFreq

    MinFreq

    MaxFreq

    AvgUserCount

    VarUserCount

    MedUserCount

    MinUserCount

    MaxUserCount

    AvgEntropyEvening

    VarEntropyEvening

    MedEntropyEvening

    MinEntropyEvening

    MaxEntropyEvening

    AvgFreqEvening

    VarFreqEvening

    MedFreqEvening

    MinFreqEvening

    MaxFreqEvening

    AvgUserCountEvening

    VarUserCountEvening

    MedUserCountEvening

    MinUserCountEvening

    MaxUserCountEvening

    AvgEntropyWeekend

    VarEntropyWeekend

    MedEntropyWeekend

    MinEntropyWeekend

    MaxEntropyWeekend

    AvgFreqWeekend

    VarFreqWeekend

    MedFreqWeekend

    MinFreqWeekend

    MaxFreqWeekend

    AvgUserCountWeekend

    VarUserCountWeekend

    MedUserCountWeekend

    MinUserCountWeekend

    MaxUserCountWeekend

    SchEntropyL

    SchEntropyLH

    SchEntropyLD

    SchEntropyLHD

    SchEntropyLEvening

    SchEntropyLWeekend

    SchEntropyLHEvening

    SchEntropyLHWeekend

    SchSizeLH

    SchSizeLHEvening

    SchSizeLHWeekend

    SchSizeLD

    SchSizeLHD

    Intensity and duration

    Location diversity

    Mobility regularity

    Figure 6. Pearsons correlation of user mobility variables with node degree in S. Error bars indicate an approximate 95% confidence interval. Thehighest correlation values correspond to variables that measure the location diversity of observations.

    bility Regularity categories. Only 2 out of the 6 Intensity

    and Duration features correlate with the number of friends,and only very weakly (0.10 - 0.12).

    The weak correlations of the Intensity and Duration vari-ables is in contrast to the Location Diversity category, wherethe variable with highest correlation is MaxEntropyWeekend(cor=0.39 with 95% CI=(0.31, 0.47)). Indeed, MaxEntropy-Weekend is the single variable most correlated to the numberof friends. Furthermore, note that for each diversity mea-sure, the highest correlated sample statistic on the set of lo-cations is always the maximum: MaxEntropy (cor=0.29 with95% CI=(0.20,0.37)), MaxUserCount (cor=0.30 with 95%CI=(0.21,0.38)), and MaxFreq (cor=0.29 with 95% CI=(0.20,0.37)), with similar results for the evening and weekend vari-ations of these variables. These variables each quantify intheir own way the most diverse location that a user vis-ited, suggesting that users who visit highly diverse locationstend to have more social network ties than those who do not.In addition to the maximum, the average and variance sam-ple statistics exhibit moderate positive correlations with thenumber of friends, whereas the minimum exhibits a weaknegative correlation. The median statistic showed very weakand often insignificant correlations with the number of friends.

    It is important to make the distinction that the location di-versity measures for a user do not simply take census dataof other nearby users at each specific location the user visits.Rather they are global properties of the locations themselves,taking into account all observations of all users at each givenlocation (recall Figure 4). It is thus possible, even likely, fora user to be located at a highly diverse location, yet also notbe co-located with any other system users. In this sense webelieve these correlations are strong, if not surprising, resultsillustrating a unique relationship between the context of thelocations a user visits and the number of online social net-work ties the user has.

    Next observe the moderate positive correlation values for the

    Variable b-estimate -estimate p-value

    SchEntropyLHWeekend 7.67e-01 0.237 0.001AvgFreqEvening 1.25e-04 0.228 < 0.001MinFreqEvening -1.83e-04 -0.195 0.002MaxEntropyWeekend 6.99e-01 0.188 0.007VarEntropyEvening 8.46e-01 0.148 0.008MinEntropy 9.08e-01 0.119 0.055BoundingBoxArea -2.97e+02 -0.102 0.037VarFreq -3.01e-09 -0.092 0.098MinFreqWeekend -4.26e-05 -0.042 0.398NumObservations -3.32e-05 -0.037 0.500

    Table 4. Raw b-coefficient, standardized -coefficient estimates andp-values from a multiple regression with listed variables taken inde-pendents and the number of friend taken as dependent. Variables are

    sorted according to the absolute value of the estimates.

    schedule entropy variables in the Mobility Regularity cat-egory: SchEntropyL (cor=0.16 with 95% CI=(0.06,0.25)),SchEntropyLH (cor=0.28 with 95% CI=(0.20,0.37)), SchEn-tropyLD (cor=0.28 with 95% CI=(0.19,0.37)), and SchEn-tropyLHD (cor=0.32 with 95% CI=(0.23,0.40)), as well astheir corresponding evening and weekend variations. Asthese variables quantify the regularity of a users schedule,the results suggests that users who have irregular schedulestend to have more ties in the online social networkS.

    It is notable that both the diversity variables and the regu-larity variables have a higher correlation with the number offriends than variables which measure the intensity and du-ration of system use. This suggests that the correlations ob-served in the diversity and regularity features are not simplybyproducts of heavy system use.

    To better understand the interrelations among the user mo-bility features with respect to the number of friends, we con-ducted a multiple regression analysis. First, to account formulticolinearity in the data, any pair of independent vari-ables having correlation higher than 0.80 in absolute value,

    126

  • 7/27/2019 PDF-Cranshaw Bridging the Gap

    9/10

    the variable with lowest absolute correlation to the numberof friends was discarded. We then performed an stepwisesearch with AIC penalty working backwards from the fullmodel to select a sub-model of the full linear model. Thefitted model of the remaining 10 variables (2 intensity, 7 di-versity, 1 regularity) given by this procedure are shown intable 4.

    The resulting model (adj R

    2

    =0.21, p-val < 0.001) yieldsvery strong evidence that the diversity and regularity vari-ables outperform the intensity variables. Examining the stan-dardized coefficients from the regression can provide someinsight into which variables are the strongest predictors. Wecan see that the two variables with highest absolute areAvgFreqEvening ( = 0.223) and SchEntropyLHWeekend( = 0.237). Indeed, the Intensity and Duration variable intotal comprise 10% of the total absolute weight of thecoef-ficients, whereas the Location Diversity variables comprise73%, and the Mobility Regularity terms comprise 17%.

    This result provides evidence that an examination of the con-text of the locations a user visits and analysis of the regular-

    ity of the users routine can provide valuable insight into so-cial behaviors of the user, in this case the number of friendsthe user has in an online social network. We have shown thatthe types of places a user visits, and the regularity of a usersroutine are stronger predictors for the number of Locaccinofriends they have than how long or how intensely they usethe system.

    DISCUSSION

    In this work we have explored several interesting connec-tions between an online social network, and an offline co-location network on the same user set. These networks havevery different structures. The co-location network has roughly3 times the number of edges as the social network, yet thesocial network is better connected. The co-location networkhas many small disconnected components, but it has a sin-gle large and highly connected subcomponent. Despite thesedifferences, we have shown that the co-location graph con-tains important information that can be used to reconstruct aportion of the social network.

    We have shown that properties of the locations a user vis-its can provide valuable context to the user observations. Inparticular we have shown that the entropy of a location isa valuable tool for analyzing social mobility data. By defi-nition, locations of high entropy locations are precisely theplaces where chance encounters are most probable, thus co-locations at high entropy locations are thus much more likelyto be random occurrences than co-locations at low entropylocations. Thus if two users are only observed together at alocations of high entropy such as a shopping mall or a uni-versity center, they are less likely to actually have a tie in theonline social network than if they are observed in a place oflow entropy.

    We have also shown that the entropy of the locations a uservisits can provide insight into the number of ties that the in-dividual has in the social network. Users who visit locations

    of higher entropy tend to have more ties in the social networkthan users who visit less diverse locations. One possible ex-planation for this result is that locations of high entropy tendto be more social in nature than locations of lower entropy,and so users who visit these locations tend to be more so-cial. However future studies are needed to further explorethis relationship.

    In addition to location diversity measures, we have exploredseveral novel features that have proven useful in analyzingsocial mobility data. We looked at features that measure theintensity, location diversity, specificity, and structural prop-erties of a set of co-locations, and we used these to con-struct a classifier that predicts social network ties betweenthe users. These features far outperform predictions basedon simple co-location observation counts between the twousers.

    In our analysis we have observed a wide range of co-locationpatterns between both Facebook friends and non-friends. Weview this as testament to the complexities of human socialrelations (both online and offline). Indeed, the data show

    many instances where users are not friends in the online so-cial network, yet exhibit very convincing co-location pat-terns for friendship. Similarly, there are numerous instancesof friendships in the online social network with little to noevidence for friendship in the co-location data.

    This disparity highlights two strong use cases for our work.Online social networks could use our classifier in their friendrecommendation systems to find users with strong co-locationpatterns who are not yet friends in the social network. Sucha system could strengthen current link-based friend recom-mendation systems by taking into account user behavior inthe offline world to bolster online social relationships. Ad-ditionally, although future research is needed to verify the

    hypothesis, it is plausible that the predictions (and mis-predictions) of our classifier could provide insight into thestrength of ties between users [8, 6]. If this were true ourwork could aid location aware social networks in developingsystems to aid users in segregating and categorizing their on-line connections, which among other things could be usefulin building privacy rules and organizing the social graph.

    One area where we feel our work has the greatest potentialis as a window into the relationship between online and of-fline social behavior. We show that location-based features(such as the entropy of a location) have significant correla-tions with real social behavior features (such as the numberof friends in a social network). Understanding the interplay

    between users location patterns and social patterns is an im-portant area for future research.

    It is also important to highlight some of the limitations ofour work. Location data is extremely sensitive [14], and it isclear that the type of analysis we perform requires strong pri-vacy controls and procedures that would protect users. Also,our pool of study participants is highly homogenous, con-taining mostly students. Future studies would strengthenedby seeking a more diverse pool of participants as other pop-

    127

  • 7/27/2019 PDF-Cranshaw Bridging the Gap

    10/10

    ulations could exhibit different online and offline behavior.

    CONCLUSIONS

    In this work we explore connection between an online so-cial network and the location traces of its users. We eval-uate a set of features of the location observations for theirpotential in analyzing the social behavior of the users. So-cial network designers may find our methodology useful for

    designing social applications, such as location-aware infor-mation sharing platforms, privacy control mechanisms, andfriend suggestion systems.

    This work opens up many future paths of research. Can dif-ferent types of social relationships be inferred from locationdata? Can tie strength be estimated from locations? Doesoffline interaction spur online communication? This alsoraises important privacy questions about how much infor-mation location-based services leak about their users. Webelieve that this work provides a necessary step towards ad-dressing such questions.

    ACKNOLEDGEMENTS

    This work is supported by NSF Cyber Trust grant CNS-0627513, by CyLab at Carnegie Mellon under grant DAAD19-02-1-0389 from the Army Research Office, and by theFundacaopara a Ciencia e a Tecnologia through the CarnegieMellon Portugal Program. Additional support has been pro-vided by Microsoft through the Carnegie Mellon Center forComputational Thinking, and by grants from France Tele-com, Nokia and Google. The authors would also like tothank David Eggerschwiler and Jay Springfield for devel-oping the locator software, and the members of the MobileCommerce Lab for their feedback.

    REFERENCES

    1. CUL P, M., JOHNSON, K., AN D MICHAILIDES , G.

    ada: An r package for stochastic boosting. Journal ofStatistical Software 17, 2 (9 2006), 127.

    2. DERESIEWICZ , W. Faux friendship. The Chronicle ofHigher Education (2009).

    3. EAGLE, N., AN D PENTLAND, A. Eigenbehaviors:identifying structure in routine. Behavioral Ecologyand Sociobiology 63, 7 (May 2009), 10571066.

    4. EAGLE, N., PENTLAND, A. S., AN D LAZER, D.Inferring friendship network structure by using mobilephone data. Proceedings of the National Academy ofSciences 106, 36 (September 2009), 1527415278.

    5. ELLISON, N. B., STEINFIELD, C., AN D LAMPE, C.The benefits of facebook friends: social capital andcollege students use of online social network sites.Journal of Computer-Mediated Communication 12, 4(2007).

    6. GILBERT, E., AN D KARAHALIOS, K. Predicting tiestrength with social media. In CHI 09: Proceedings ofthe 27th international conference on Human factors incomputing systems (New York, NY, USA, 2009), ACM,pp. 211220.

    7. GON ZALEZ, M. C., HIDALGO, C. A., AN DBARABASI, A.-L. Understanding individual humanmobility patterns. Nature 453, 7196 (June 2008),779782.

    8. GRANOVETTER, M. S. The strength of weak ties. TheAmerican Journal of Sociology 78, 6 (1973),13601380.

    9. HAMPTON, K., SESSIONS, L., HER , E. J., AN DRAINIE, L. Social isolation and new technology. Tech.rep., Pew Internet and American Life report, November2009.

    10. KRAUT, R., PATTERSON , M., LUNDMARK, V.,KIESLER, S., MUKOPADHYAY, T., AN D SCHERLIS,W. Internet paradox: A social technology that reducessocial involvement and psychological well-being.American Psychologist 53 (1998), 10171031.

    11. LI, Q., ZHENG, Y., XIE , X., CHE N, Y., LIU , W.,AN D MA, W.-Y. Mining user similarity based onlocation history. In GIS 08: Proceedings of the 16thACM SIGSPATIAL international conference on

    Advances in geographic information systems (NewYork, NY, USA, 2008), ACM, pp. 110.

    12. MIKLAS, A. G., GOLLU, K. K., CHA N, K. K. W.,SAROIU, S., GUMMADI, K. P., AN D DE LAR A, E.Exploiting social interactions in mobile systems. InUbiComp07: Proceedings of the 9th internationalconference on Ubiquitous computing (Berlin,Heidelberg, 2007), Springer-Verlag, pp. 409428.

    13. RICOTTA, C., AN D SZEIDL, L. Towards a unifyingapproach to diversity measures: Bridging the gapbetween the shannon entropy and raos quadratic index.Theoretical Population Biology 70, 3 (2006), 237243.

    14. SADEH, N., HON G, J., CRANOR, L., FETTE, I.,KELLEY, P., PRABAKER, M., AN D RAO, J.Understanding and capturing peoples privacy policiesin a mobile social networking application. Journal ofPersonal and Ubiquitous Computing 13, 6 (August2009).

    15. WELLMAN, B., HOGAN, B., BER G, K., BOASE, J.,CARRASCO, J.-A., COT E, R., KAYAHARA, J.,KENNEDY, T. L. M., AN D TRA N, P. Networked

    Neighbourhoods. Springer, 2006, ch. Connected Lives:The Project.

    16. WYATT, D., BILMES, J., CHOUDHURY, T., AN D

    KITTS

    , J. A. Towards the automated social analysis ofsituated speech data. In UbiComp 08: Proceedings ofthe 10th international conference on Ubiquitouscomputing (New York, NY, USA, 2008), ACM,pp. 168171.

    17. ZHENG, Y., LI, Q., CHE N, Y., XIE , X., AN D MA,W.-Y. Understanding mobility based on gps data. InUbiComp 08: Proceedings of the 10th internationalconference on Ubiquitous computing (New York, NY,USA, 2008), ACM, pp. 312321.

    128