Publishing CitiSense Data: Privacy Concerns and …k6gupta/MSProject_KapilGu… ·  ·...

18
University of California,San Diego Masters Project Publishing CitiSense Data: Privacy Concerns and Remedies Author: Kapil Gupta Supervisor: Prof. Bill Griswold March 15, 2013

Transcript of Publishing CitiSense Data: Privacy Concerns and …k6gupta/MSProject_KapilGu… ·  ·...

University of California, San Diego

Master’s Project

Publishing CitiSense Data: Privacy Concernsand Remedies

Author:Kapil Gupta

Supervisor:Prof. Bill Griswold

March 15, 2013

Master’s Project Report • March 15, 2013

Publishing CitiSense Data: PrivacyConcerns and Remedies

KAPIL GUPTA

University of California, San [email protected]

Abstract

Publishing original spatial trajectories obtained from a Location based Service (LBS) to the public or a third party fordata analysis could result in serious privacy breaches. CitiSense generates huge collections of spatio-temporal data,variously called moving object data, trajectory data, or mobility data. In the first part of this report we study aboutthe possible privacy violations for an individual such as identity revelation, if the CitiSense data is made public. Laterwe propose an existing methodology for privacy-preserving data publication called (k, δ)-anonymity and demonstrateits effectiveness on the CitiSense dataset. This technique utilizes the inherent uncertainty of location in order todecrease the extent of distortion required to anonymize data.

Location based services’ data have great utilityin various data analysis based applicationssuch as city traffic control, mobility manage-

ment, urban planning, and location-based serviceadvertisements, just to mention a few. Therefore,extensive amount of research has been done onthese data, which is clearly indicated by the largenumber of spatio-temporal data mining techniquesthat have been developed in the recent past [28, 29,27, 42, 43, 49, 36, 37, 44, 9]. As such, it is critical thatsuch techniques to transform a database of trajec-tories of moving objects, be developed that satisfysome concept of anonymity while maintaining mostof their original utility in the transformed database.

Anonymity cannot be assured by simply replac-ing users’ real identifiers (e.g., name, age, dateof birth, etc.) with pseudonyms. As demon-strated in [1], using pseudonyms does not guar-antee anonymity, since location is a property thatcan be used to determine the identification of anindividual. For example, if a person is known tofollow a certain route every day, it is highly likelythat the end-points of the route are the workplace(or school) and the home of that person. Also dueto the existence of the quasi-identifier locations, i.e.,a set of locations that can be linked to external infor-mation to re-identify individuals, the anonymouslocation data may be traced back to personally iden-tifying information with the help of additional datasources [10]. Contemporary techniques for trajec-tory data mining and knowledge discovery haveconcentrated both on the geometrical propertiesand the background geographic information (se-

mantic trajectory mining) of trajectories.We cannot simply strip the location information

from a reading in the CitiSense data as it will hurtthe utility of the published data. Adding noise tolocation data to anonymize it will also hurt the util-ity of sensors’ readings in the CitiSense data. Onthe other hand, if we simply publish the locationinformation of the readings as they are, we riskexposing many forms of sensitive information thatthe trajectories are likely to contain. Therefore tra-jectories cannot be released for public use beforethey are properly anonymized.

The problem of location privacy has been wellstudied in the context of location-based services[39; 46; 31; 22; 47]. The focus is both on on-line,service-centric anonymity and off-line and data-centric anonymity (as in the context of data publish-ing). In this report, we will focus on the latter andstudy the problem of anonymity preserving datapublishing of the CitiSense Data. We have used theNWA algorithm [6] which extends the concept ofk-anonymity [2] to handle the type of data we have,and to utilize its inherent uncertainty [3], [4], [5].

Please note that discussing the extent to whichthe location of an individual represents vulnerableinformation or what exactly constitutes private andsensitive information are philosophical, social andindividual concerns and beyond the scope of thisproject.

Paper content and organization: The rest of thepaper is organized as follows. Section I gives anoverview of the CitiSense project and its dataset.Section II describes the preprocessing of the Ci-

2

Master’s Project Report • March 15, 2013

Table 1: Sample reading of publicly available CitiSense Sensor’s data

sensorId reading dateSampled Latitude Longitude locationAccuracy3 0.05 2012-07-29 23:57:47-07 32.8643388 -117.2221105 60.000000

tiSense dataset to remove outliers and to compressthe data. Section III discusses privacy concerns onthe data publication and various information thatcan be extracted from the published data. This sec-tion also presents privacy breaches on the CitiSensedata. Section IV, examines existing anonymizationtechniques and proposes Region of Interest basedanonymization and temporal cloaking on the Ci-tiSense data. Section V evaluates the dataset afterapplying anonymization and discusses the findings.Finally, Section VI concludes the paper and sug-gests some ideas for extensions to this work.

I. CitiSense Dataset

CitiSense is a portable pollution monitoring systemthat allows one to get real-time air quality readingsfor one’s surroundings on a smart phone. The Ci-tiSense system includes small sensors carried byusers, users’ Android mobile phone and a back-end infrastructure that stores the collected data.CitiSense devices can estimate air quality in thearea where they’re deployed, providing informa-tion to everyone, not just those carrying sensors.For publishing this dataset to individuals and pub-lic health agencies, providing only sensor’s read-ing, date sampled and location information is suffi-cient. A sample of the publicly available dataset isshown in Table 1. The sensorId can take 7 differentvalues. The dataset used in this project containsreadings from 30 users over a period of five weeks(Jul30− Sep7). Total number of rows present inthis dataset is more than 21.5 million.

The sampling interval for sensor’s reading inCitiSense System is very aggressive, usually abouta few seconds. Processing data at such high ratewould be computationally very challenging. Also,due to high sampling rate, the database will haveenormous privacy implications (because even foronly 30 days, the data for an individual would beenormous and would contain hundreds of thou-sands of data points leading to identification ofhis/her home, office and spatial patterns etc).

II. Preprocessing

Choosing high sampling rates for acquiring the sen-sors’ readings from individuals, leads to massivedata collections. Thus, it is imperative to apply datacompression methodologies during preprocessingof trajectory. Additionally, filtering data also helpsin diminishing noise and assessing higher-levelproperties such as speed and direction. Since tra-jectories are normally measured by a sensor, theyinevitably have some error, including occasionaloutliers. Simple techniques like mean and medianfiltering can reduce these errors. In addition to er-ror reduction, certain filters like the Kalman filterand particle filter can also give error estimates andinferences on speed and direction. Because we ac-quire data using a sampling-based approach, therepresentation of object trajectories is in a discreteform despite the object movement being continuous.However, object movements display predictable pat-terns due to the linear properties of the underlyingtransportation framework. Consequently, much ofthe redundant and erroneous data can be elimi-nated from the trajectory without compromisingmuch of the useful information [8]. These prepro-cessing steps are also necessary for an attacker tomine the underlying hidden information.

A. Trajectory Filtering & Smoothing

Due to the uncertainty of the data obtained fromGPS devices, outliers need to be removed beforebehavior mining or region of interest extraction canbe done. Filtering of data is particularly essentialwhen one intends to deduce other properties fromit, such as speed or direction. In this project, wediscuss two filters to eliminate outliers and segmenttrajectory data into trips on different bases.

All the calculations (speed/ acceleration/ com-pression/ data mining etc) are done after convert-ing location data (latitude, longitude, altitude) intoearth’s coordinates using Mercator projection [7]. Itis a cylindrical map projection which specify howthe geographic detail is transferred from the globeto a cylinder tangential to it at the equator. Thecylinder is then unrolled to give the planar map

3

Master’s Project Report • March 15, 2013

(see Figure 1 ).

Figure 1: A cylindrical map projection to find coordinatesin frame of reference of Earth’s center using latitude andlongitude. Figure taken from [7].

Duplication filter: If the distance between twoconsecutive positions is smaller than a threshold,the duplication filter removes the second position.CitiSense dataset contains multiple sensor types,leading to multiple readings with same locationand time stamps. To make the computation onthe dataset efficient, it is essential to remove theseduplicate entries. Table 2 presents the results ofapplication of duplication filter on the CitiSensedata.

Speed and Acceleration filter: It is assumedthat individuals move at a plausible speed betweentwo consecutive positions, and that there is a rea-sonable speed range for individuals (for differentmeans of transportation like walking, biking, car,bus etc). The speed and acceleration filter removesthe second position if the speed and/or accelerationbetween two consecutive positions are/is unreason-able. For example, there are a few readings in theCitiSense data which indicate impossible speed of546km/hr with 52m/sec2 acceleration. These in-valid readings need to be removed for trajectoryanalysis and their neighboring data points needto be smoothed accordingly. Table 2 presents theresults after application of speed and accelerationfilter, with speed limit of 150km/s and accelerationof 10m/s2, on the CitiSense data. Figure 2 showsthe variation of speed for a user in a given trip andsmoothed speed after removal of outliers.

B. Trip Segmentation

Route pattern mining requires recorded data to besegmented into trips. However, asking users tomanually turn on and off their GPS devices severaltimes a day for the purpose of trip segmentationwould drastically decrease usability of the systemand reliability of the data. This section discusses

how to extract trips from GPS data using the con-cept of moves, stops etc [11].

Figure 2: Variation of Speed before and after applyingspeed and acceleration filter

The basic criterion for splitting GPS data is thetime gap between two consecutive positions, sincea stop indicates the end of a trip. Algorithm A isused to segment the trip. In this algorithm T isthe array containing all recorded trips of a person,λtime_gap is the time threshold used to segment trips,λtrj_len is the threshold used to remove short trips,and Funct() is one of the data filtering functionsdescribed above. Funct() returns true if the posi-tions comply with the restriction of the data filter;otherwise, it returns f alse, and the correspondingpositions are removed.

procedure Trip Segmentation(A)Input: T, λtime_gap, λtrj_len, Funct()Output: Ttmp

Ttmp ⇐ φfor each route ri in T do

rtmp = φfor each position pj in ri do

if Funct(pj, ri ) returns true thenrtmp=Append(rtmp, pj)

else if Time(pj)-Time(pj−1)>λtime_gapthen

if Size(rtmp)>λtrj_len then

4

Master’s Project Report • March 15, 2013

Ttmp=Append(Ttmp, rtmp)end if

end ifend for

end forReturn Ttmp

end procedure

Figure 3: Various trips made by user on 10th Aug, 2012.Different colors represent different trips.

The data filtering process can remove the noisyraw data, and greatly reduce the amount of the orig-inal real trip data. Applying trip segmentation onthe CitiSense data results in 0-128 number of tripsper person over a period of 30 days. Note that tripcount of 0 shows the presence of stationary nodein the CitiSense dataset. This implies that simplefilters and trip segmentation can identity stationary’users’ in the dataset. The λtime_gap is chosen to be300 seconds and λtrj_len is 100 meters. Figure 3shows all the trips made by a user on 10th Aug,2012.

C. Trajectory Smoothing:

A simple method to smooth noise is to apply amean filter. For a measured point pi, the estimateof the (unknown) true value is the mean of pi and

its n− 1 predecessors in time. The mean filter canbe thought of as a sliding window covering n tem-porally adjacent values of pi. A major drawbackof the mean filter is its sensitivity to outliers. Thisoutlier problem can be alleviated by using a medianfilter instead of a mean filter. In the median filter,everything is same as in mean filter except that themean is replaced with a median [8].

xi = median{pi−n+1, pi−n+2, ..., pi−1, pi} (1)

Figure 4: Example of Median filter for Trajectory smooth-ing. Figure taken from [8]

See figure 4 to see the Median filter in effect ona sample trajectory with outliers. For smoothinga trajectory both the mean filter and median filterare simple and effective techniques, but both thesefilters suffer from lag. Kalman filter and the particlefilter are two more advanced techniques that reducelag and can be designed to estimate more than justlocation. Though they are not used in this project,they are worth exploring.

D. Error Measure for Trajectory Compression

In this section, we discuss two error measures forthe deviation of an approximate trajectory from itsoriginal trajectory - perpendicular Euclidean dis-tance and time synchronized Euclidean distance.An estimate of the accuracy of the approximated

Table 2: Filtering of the CitiSense Dataset

Raw After Duplication filter After Duplication + Speed & Acceleration filter#Rows 21699711 6579172 2498653

5

Master’s Project Report • March 15, 2013

(a) Error measure based on perpendicular Euclidean distance.This error measure takes into account the geometric relation-ship of the trajectories.

(b) Error measure based on time synchronized Euclidean dis-tance. This error measure takes into account both the geometricrelationship and temporal factor of the trajectories.

Figure 5: Error Measure for Trajectory Compression. Figure taken from [8]

location values can be obtained from the distancebetween a location on the original trajectory and theestimated location on the approximated trajectory.

The shortest distance from a sampled locationpoint in the original trajectory to the approximatedtrajectory is perpendicular Euclidean distance. Ameasure of the error can be obtained by the aver-aging the perpendicular Euclidean distance for allsampled location points. Figure 5(a) illustrates thecomputation of error measure based on the per-pendicular Euclidean distance between the originaltrajectory acquired by a moving object and an ap-proximated trajectory generated by applying one ofthe trajectory data reduction algorithms.

However, this conception of projecting each ofthe possible points in the original trajectory ontothe segments of approximated trajectory, takes intoconsideration only the geometric characteristics ofthe trajectories. The temporal component of objectmovement in the trajectories is not accounted for[8]. Notice that a sampled data point < x, y, t >in the original trajectory denotes the time t whenthe moving object are located at x, y. Thus, there isa need to also consider the temporal factor in theprojection.

To take the temporal factor into account, timesynchronized Euclidian distance was proposed [8]as a new error measure for approximated trajec-tories generated by trajectory data reduction algo-rithms [24, 25]. This error measure realizes thatthere should be a "time-synchronization" of the pro-jected movement on the approximated trajectorywith the real movement on the actual trajectory.Notice that a sampled data point < x, y, t > inthe original trajectory denotes the time t when themoving object are located at x, y. Thus, there is

a need to also consider the temporal factor in theprojection.

Figure 5(b) illustrates the idea of time synchro-nized Euclidean distance. As shown, the locationpoints on the approximated trajectory, i.e. p0, p5and p16, are already synchronized by time. Theother sampled location points, e.g. p1, p2, p3 and p4,are projected to time synchronized location pointsp1, p2, p3, and p4, on the line segment ¯p0 p5.

E. Trajectory Compression

Our aim here is to produce an approximate trajec-tory from the actual trajectory by eliminating somelocation points while making sure that the error in-troduced is negligible. This problem is very muchalike the well-studied line generalization problemin computer graphics and cartography [8]. A verysimple approximation technique utilizes uniformsampling algorithm, where every ith location points(e.g. 10th, 20th, 30th etc) are retained and the otherpoints are rejected [27]. This approach does notwork if each location point in the original trajectorycontains different amount of information requiredto represent the trajectory.

Douglas-Peucker (DP), a renowned algorithm,can be employed for the approximation of originaltrajectory [9,15]. This algorithm, given a curve com-posed of line segments, finds a similar curve withfewer points. The objective is to use an approximateline segment to replace the actual trajectory. If thereplacement does not comply with the specifiederror conditions, the original problem is partitionedinto two sub-problems by choosing the locationpoint responsible for maximum errors as the splitpoint. This partitioning is a recursive process andit continues till it meets the stopping condition.

6

Master’s Project Report • March 15, 2013

Table 3: Variation of each user’s data after preprocessing

User Id Raw After Filtering After Compression Compression %56 486175 84342 17012 3.5%40 1650974 68098 13503 .81%47 2254760 500344 103286 4.6%55 56133 169 169 0.3%6 206933 30856 5230 2.5%

46 1085624 108703 23132 2.1%54 157504 460 460 0.3%45 1468473 97291 20437 1.4%53 1436417 120764 26947 1.9%52 979791 103822 20400 2.08%44 949626 35727 7526 0.8%3 388015 48239 10885 2.8%

51 2644865 143302 36359 1.4%43 2070140 260324 55531 2.7%50 756594 45982 9749 1.3%42 2543693 240332 50874 2%49 2382835 360212 69753 2.9%41 181159 63211 12708 7%

Total 21699711 2312178 483959 2.23%

Figure 6: Pseudocode of proposed GPS trajectory approx-imation process

The stopping condition would be that the er-ror between the approximate and original trajecto-ries falls below the given error threshold. A modi-

fied DP algorithm, called the top-down time-ratio(TD-TR) algorithm [24], which uses synchronousEuclidean distance (SED), as compared to the per-pendicular Euclidean distance is also very popularalgorithm for trajectory compression.

Figure 7: Variation of % compression vs SED

In this project we have used the GTC trajectorycompression algorithm [14] (See Figure 6) which isa greedy solution for the trajectory approximation.It starts from the first point, and the farthest pointis found with an approximated SED less than thegiven error tolerance. The pseudocode is shown inFigure 6. The rest of the analysis in this paper isdone on compressed dataset for SED = 5m unlessspecified. Figure 7 shows the variation of percent-age of points/rows left for different values of SED

7

Master’s Project Report • March 15, 2013

used. Table 4 shows variation in number of read-ings for each user after filtering and compression.

III. Privacy Breaches

There are many real-life situations when attack-ers exploit location-detection technologies to gainaccess to private location information and othersensitive information about victims [16, 17, 18, 19].Following are some of the techniques which can beapplied on LBS data to mine information about theindividuals:

A. Region of Interest (ROI)

In 2008, Spaccapietra et al. proposed the first datamodel looking at trajectories from the conceptualpoint of view which provides robust semantic anal-ysis, called stops and moves [11]. A stop is a seman-tically important part of a trajectory that is relevantfor an application, and where the object has stayedfor a minimal amount of time. For instance, onweekdays a stop could be an office or workplaceand on weekends or holidays, a stop could be atouristic place, a restaurant, a movie theater, etc.Figure 8 describes this idea pictorially.

Figure 8: Identifying stops and moves from GPS datapoints. Figure taken from [23]

To extract stops and moves from trajectorypoints, Alvares et al. introduced an algorithm calledIB-SMoT (Intersection Based Stops and Moves ofTrajectories) [12]. While IB-SMoT searches for inter-sections among trajectories, there are several otherways like speed-based spatiotemporal clusteringapproach (CB-SMoT) to find important points of in-terest [13]. In this project we have used Weka-STPM[21] to do IB-SMot and CB-SMot analysis. Weka-

STPM is an extension of Weka for spatio-temporaldata.

Figure 9 shows some of the stops taken by anindividual over a period of 15 days. It can easilybe inferred that if a stop is repeated more than aparticular number of times, it is region of interestfor an attacker. Taking this notion a step forwardand plotting stops over time can lead to identifi-cation of region of interest for an attacker such asvictim’s home, office, gym location, preferred shop-ping mall etc. Although this interpretation requiresmanual endeavor, there exist semantic trajectoryframeworks to perform this automatically [20].

B. Behavior Mining

For most purposes, we can assume that individu-als adhere to the same paths (approximately) overregular intervals in time. For instance, people usu-ally follow a fixed routine throughout the day; theywake up at the same time, take just about the sameroute to work and follow daily or weekly chores ina regular way. Therefore, trajectory patterns mostlikely represent summaries of repeated behavior, interms of both space (i.e., the regions of space visitedduring movements) and time (i.e., the duration ofmovements) [8].

Figure 9: Visualization of stops of an individual overa period of 30 days. The markers are color-coded toemphasize the frequency of a ’stop’ taken by the user.

The discovery of hidden periodic movement pat-terns in spatiotemporal data may violate privacyof users. Figure 10 and 11 provide examples ofrevelation of hidden information of an individual.

8

Master’s Project Report • March 15, 2013

Figure 10: Loss of privacy. It can be inferred that theuser is a faculty at CSE, UCSD and uses faculty parkingto park his/her car.

Figure 11: The trajectory paths on weekdays, from 8amto 11 am and from 4pm -9pm for a user. It can be inferredthat the user is a student at CSE, UCSD and uses bikeas conveyance and takes the same route most of the time.

C. Predictive Query

Given the recent movements of an individual andthe current time, predictive queries ask for the prob-able location of the individual at some future time.[30, 32] accurately forecast locations when the fore-cast time is far away from the current time. The longterm prediction uses previously extracted move-ment patterns named Trajectory Patterns, whichare a concise representation of behaviors of movingobjects as sequences of regions frequently visitedwithin a typical travel time. It has been shown thatprediction based on the trajectory patterns of anobject is a powerful method [35].

D. Regular Routes Mining

This technique is useful for mining Regular (or fre-quently repeated) Routes from users’ route sets. Itinvolves following steps [33]:

Trajectory Similarity: An estimate of the simi-larity between two trajectories can be obtained bysome form of aggregation of distances betweentrajectory points. On this ideology, we have sev-eral similarity functions developed for differentpurposes, including Closest-Pair Distance, Sum-of-Pairs Distance [34], Dynamic Time Warping (DTW)[38], Longest Common Subsequence (LCSS) [37],and Edit Distance with Real Penalty (ERP) [40],Edit Distance on Real Sequences (EDR) [41]. Eventhough some of these similarity functions were ini-tially put forth for time series data, they can alsobe employed for trajectory data as trajectories canbe viewed as a distinctive type of time series inmulti-dimensional space.

Figure 12 shows the basic step to break routeinto frequent directed edge (FDE) to compare twotrajectories [33].

Figure 12: Steps involved in converting a route into FDEto convert it into time series data for further computation.Figure taken from [33].

Routes Grouping: We group the routes thatare followed by someone at approximately sametimes of the day and which have the high trajectorysimilarity (from above).

Finding Regular Routes: Then we mine Regu-lar routes from each set of routes. For qualifying asa regular route, the route must have been traveledon approximately same hours frequently. Figure 13shows regular routes taken by a particular user. Theopen source code T-Pattern [48] is used to mine theregular trajectory pattern which uses the algorithmproposed in [23].

9

Master’s Project Report • March 15, 2013

Figure 13: Regular routes taken by a user. Red denotesthe most common route taken by the user followed bygreen and blue respectively. On a side note, all the stopsof the user are localized and pointing his/her home (RitaAtkinson Residence) and office (CSE, UCSD) location.

E. Recognizing Travel Modes:

The different travel modes of a route can be recog-nized. It is observed that a public transport stopsfrequently, and also stops periodically at fixed posi-tions. Therefore, fixed stop rate (FSR) can be usedto recognize the different travel modes along withspeed variation. Figure 14 compares the speedvariation of 3 users using walking, bike and car forcommuting. From this figure we can also see theFSR.

IV. Trajectory Anonymization

This section discusses privacy preserving trajectorydata publication algorithm. With regards to thedifficulties in privacy protection, it is different fromcontinuous LBS data publication in the followingways [45]: (1) The need for privacy protecting mech-

anisms to be scalable is much more for continuousLBS than for trajectory data publication. This isbecause continuous LBS’s anonymization modulehandles enormous number of real-time locationupdates at high rates; whereas trajectory data pub-lication can accomplish the anonymization processoffline. (2) Global optimization techniques can beimplemented for trajectory data publication as itsanonymization process can scrutinize the entire tra-jectory data (static) for optimization possibilities.On the other hand, attaining global optimizationis very tough for continuous LBS, due to run-timedata caused by extremely dynamic, unpredictableuser movements.

Figure 14: Speed variations for different modes of trans-portation. Top graph shows mode of transportation aswalking with speed of 2-4 miles/hr, middle graph showsmode as biking with speed of 3-7 miles/hr while bottomgraph shows mode of transportation as car with speedupto 70 miles/hr.

In the literature there are four major trajec-tory anonymization techniques for static trajec-tory data publication, namely, clustering-based [6],generalization-based [50], suppression-based [51]

10

Master’s Project Report • March 15, 2013

and grid-based anonymization [30] approaches. Inthis project we have used a combination of threetechniques, namely, clustering based techniques,Temporal Cloaking and ROI anonymization.

A. Clustering based Anonymization

The clustering-based approach [6] utilizes the un-certainty of trajectory data to group k co-localizedtrajectories within the same time period to forma k-anonymized aggregate trajectory. Given a tra-jectory T between times t1 and tn, i.e., [t1, tn], andan uncertainty threshold δ, each location sample inT, pi = (xi, yi, ti), is modeled by a horizontal diskwith radius δ centered at (xi, yi). The union of allsuch disks constitutes the trajectory volume of T,as shown in Figure 15. Two trajectories Tp and Tqdefined in [t1, tn] are said to be co-localized withrespect to δ, if the Euclidean distance between eachpair of points in Tp and Tq at time t ∈ [t1, tn] is lessthan or equal to δ.

An anonymity set of k trajectories is defined as aset of at least k co-localized trajectories. The clusterof k co-localized trajectories is then transformedinto an aggregate trajectory where each of its loca-tion points is computed by the arithmetic mean ofthe location samples at the same time.

Figure 15: Uncertain trajectory: uncertainty area, tra-jectory volume and possible motion curve. Figure takenfrom [6]

Figure 16 gives the trajectory volumes of Tpand Tq that are represented by grey dotted lines,respectively. The trajectory volume with black linesis a bounding trajectory volume for Tp and Tq. Thebounding trajectory volume is then transformedinto an aggregate trajectory which is representedby the sequence of square markers.

The clustering-based anonymization algorithmconsists of three main steps as mentioned in [6]:

1. Pre-processing step. The main task of thisphase is to group all trajectories that have thesame starting and ending times, i.e., they arein the same equivalence class with respect totime span. To increase the number of trajecto-ries in an equivalence class, given an integerparameter π, all trajectories are trimmed ifnecessary such that only one timestamp ev-ery π can be the starting or ending point of atrajectory.

2. Clustering step. This phase clusters trajecto-ries based on a greedy clustering scheme. Foreach equivalence class, a set of appropriatepivot trajectories are selected as cluster cen-ters. For each cluster center, its nearest k− 1trajectories are assigned to the cluster, suchthat the radius of the bounding trajectory vol-ume of the cluster is not larger than a certainthreshold (e.g., δ/2).

3. Space transformation step. Each cluster istransformed into a k-anonymized aggregatetrajectory by moving all points at the sametime to the corresponding arithmetic mean ofthe cluster.

Figure 16: A (2, δ)-anonymity set formed by two co-localized trajectories, their respective uncertainty vol-umes, and the central cylindrical volume of radius δ/2that contains both trajectories. Figure taken from [6]

B. ROI based Anonymization

As mentioned earlier ROIs are regions where alarge number of moving objects remain for at least

11

Master’s Project Report • March 15, 2013

a given time interval. As shown in previous sec-tions, the main threat in publishing the CitiSensedata is revelation of home and office locations of theusers. Since information is revealed by analyzingstops and moves of trajectory data, easiest way toremedy such kind of privacy threat is to removefrom trajectory data, neighboring regions of stopsthat satisfy certain criteria like duration of stop be-ing greater than 30 minutes etc. Therefore for thiskind of analysis, two parameter values need to bedecided upon 1) Circular area with radius λr froma stop, and 2) Duration of stop λt to qualify a stopfor anonymization. Figure 17 shows the result ofapplying ROI based anonymization on a user’s trip.

Figure 17: Application of ROI based anonymization ona user’s trip.

Adding semantic analysis: The previous ap-proach can be improved by taking into accountsemantics of graphical location. It performs graph-ical semantic analysis on the stops and tags allthe locations as public (like highways, shoppingmalls, parkways, highways etc) or private (residen-

tial places, offices etc). For publishing the data, wecan selectively choose the location data tagged aspublic.

Increasing utility: To further decrease theamount of data lost by discarding private locationdata in dense region, we can take advantage of thenotion of k-anonymity. If there are sufficient num-ber of data points available from k or more userswithin a circular area of radius δ, we can averagethe readings for that circular area into buckets ofminutes or hours and publish them. Publishingprivate location data in this way will keep our no-tion of (k, δ)-anonymity and maintain the utility ofCitiSense data for places tagged as private.

C. Temporal Cloaking

All trajectory pattern mining and behavior min-ing algorithms depend on successful creation oftrips from the raw GPS data. If the GPS data doesnot contain user identifier (as in case of publiclyavailable CitiSense data), the trip segmentation isheavily dependent on temporal pattern. The idea oftemporal cloaking is to blur the users’ presence at alocation at a particular time by inserting Gaussiannoise into time so that the linear relation betweendistance and time doesn’t hold.

Gaussian noise is statistical noise that has itsprobability density function equal to that of the nor-mal distribution, which is also known as the Gaus-sian distribution [53]. In other words, the valuesthat the noise can take on are Gaussian-distributed.

P(x) =1

σ√

2πe−(x−µ)2

/2σ2

(2)

Temporal cloaking can result in drastic decreasein trip segmentation and, hence, revelation of infor-mation from trajectory data. It is noteworthy thatutility of the CitiSense data is not much affected byintroducing uncertainty in time by a few minutes.

D. Results

All the analysis done in this section uses filtereddataset (2312178 readings) and not the compresseddataset. Data compression was needed for makingdata mining algorithms computationally efficient.

As mentioned in previous section, the inherentuncertainty parameter (δ) in NWA algorithm is setto be 50m while k is set to be 2. The percentage ofpoints changed by NWA algorithm is only about48%. On further analysis it is seen that most of these

12

Master’s Project Report • March 15, 2013

points are located in one major region (CSE, UCSD).Hence, NWA does not anonymize the entire dataset.The reason behind this concentration of data pointsis the co-existence of different users at a given time.This shows a problem in the CitiSense data i.e datapoints are sparsely distributed and there is hardlyany other region where CitiSense users coexistent.This drawback of NWA is addressed by ROI basedanonymization which is discussed next.

Table 4: Variation in number of stops, trips and activedays for each user. This information is further used inROI based anonymization.

userId # trips # stops # active days30 20 23 640 22 44 2241 41 51 1842 27 55 2843 168 199 3044 30 57 1745 83 112 2946 17 36 1847 152 189 3149 121 152 2850 57 80 2851 73 102 2952 34 51 1753 83 103 2054 0 3 355 0 2 256 22 31 86 30 36 7

Figure 18: Concentrated stop, leading to 0 trips.

Table 4 shows the number of trips and stopstaken by users. Also, the number of stops can be

peculiarly greater than the number of trips as datapoints sampled in a period are concentrated in adense area leading to no actual movement (see Fig-ure 18).

After recognizing points which belong to stops,we need to remove them from the dataset foranonymization. Simply removing points which be-long to a stop-cluster can still pose a privacy threatas surrounding points that survived can still beextrapolated to the removed ROIs. To circumventthis possibility, data points from all stops are firstclustered using DBSCAN. The advantage of usingDB-SCAN (density-based spatial clustering of appli-cations with noise) over other clustering algorithmsis that it can find arbitrarily shaped clusters. Oncethe diameters (d) of the clusters are found, all thedata points present in the circular area with radius(d/2 + γ) and center as mean position calculatedfrom that cluster members are removed. Figures19 and 20 depict this idea pictorially.

The two parameters required by above DBSCANalgorithm are set to ε = 100 (i.e distance betweenfarthest points in the cluster) and minPts = 200 (i.eminimum number of points to consider a set as acluster). Also γ is set to 50m. Table 5 presentsthe result after removing the stops in such manner.From the table, it looks like this approach destroyedthe number of data points in the original datasetand hurt the original utility. Although this is notthe complete picture. To fully understand why thesituation is not as bad as it appears from Table 5,we introduce the concept of coverage.

# rows # rows afterAnonymization

data loss(in %)

2312178 161712 93%

Table 5: Loss of data in terms of number of rows.

Coverage: The coverage by a data point can bedefined as the area where the readings from thesensor can be considered same. For example, inCitiSense dataset, CO2 reading at a location x canbe treated same as the reading at location x + γfor a very small value of γ. Hence the surroundingarea can be said to be covered by a single data point.The coverage by clustered stops’ data points can bethought of as the circular area with radius (d/2+ ε)and center as the mean calculated from the clus-ter members. Similarly, for each pair of adjacentmoving points, a rectangular area covered by thosepoints can be thought of as ’covered area’. This is

13

Master’s Project Report • March 15, 2013

Figure 19: DB-SCAN clustering performed on the stop data points.

Figure 20: To find the coverage loss, areas spanned by stops (circular area) are removed. Total area is calculated byusing the notion that if a reading is present at a point, it covers ’some’ surrounding area.

14

Master’s Project Report • March 15, 2013

also shown pictorially in Figure 22. Further Figure21 shows this concept in real trajectory.

Figure 22: The circular area and rectangular boxesaround the trajectory path depict areas covered by thedata points.

Kindly note that in the CitiSense data, utilityis directly related to coverage, not to the numberof data points. Using this notion of coverage, ap-proximate coverage loss is calculated for ROI basedanonymization which is shown in Table 6. HenceROI based anonymization hurts utility by only 6%.Interestingly, the coverage calculated above doesnot need to take into account the overlapping ofdifferent users’ trajectories. This is because if thereare overlapping trajectories, we can apply NWA

algorithm on those points, thus, keeping the utilityunaltered.

Table 6: Loss of data in terms of area coverage. Area isin m2

Area covered Area covered afterAnonymization

coverageloss (in%)

28002564 26304128 6 %

We applied temporal cloaking on the datasetobtained by applying the above anonymizationtechniques. The Gaussian parameters µ and σ fortemporal cloaking are set to 600 seconds and 1 re-spectively. Performing preprocessing on this trans-formed dataset resulted in creation of 54% less tripsas opposed to those created by our previous anal-ysis on non-anonymized dataset. This will resultin even lesser information that can be gained (forexample, finding regular routes, mode of trans-portation etc).

V. Conclusion

Lately, it has been recognized in [7] and in manyother works, that k-anonymity alone does not putus on the safe side, because although one individual

Figure 21: The rectangular boxes around the trajectory path depict areas covered by the data points.

15

Master’s Project Report • March 15, 2013

is hidden in a group, if the group has not enoughdiversity of the sensitive attributes then an attackercan still associate one individual to sensitive infor-mation. However, in the context of moving objectdata the problem is very challenging, because loca-tion is a particular kind of information that couldbe considered sensitive as well as quasi-identifierat the same time.

Moreover major privacy concern of identifica-tion of locations private to the user is resolved byROI based anonymization method with a mere 6%loss of in coverage. Another concern regarding lackof effectiveness of clustering based anonymizationtechnique as mentioned in the results will disap-pear when the data becomes denser ( more pre-cisely when each region has more than 1 CitiSenseuser present at approximately the same time). Tem-poral Cloaking needs more analysis in order toderive rigorous mathematical guarantees for immu-nity against attackers. Another interesting area toexplore is continuous CitiSense real time data pub-lication. This is a relatively newer field and worthexploring in context of CitiSense.

VI. Acknowledgement

I would like to thank Prof. Bill Griswold, Depart-ment of Computer Sciennce for his constant supportand guidance throughout the course of this project.I would also like to thank Prof. Sanjoy Dasguptaand Prof. Hovav Shacham for their valuable inputs.Last but certainly not the least I am grateful to thekind assistance and cooperation of Nima Nikzadand Celal Ziftci for helping me to obtain the Ci-tiSense data.

References

[1] C. Bettini, X. S. Wang, and S. Jajodia, "Pro-tecting Privacy Against Location-Based PersonalIdentification." in Proc. of the Second VLDB Work-shop on Secure Data Management (SDM’05).

[2] P. Samarati and L. Sweeney, "Generalizingdata to provide anonymity when disclosing infor-mation (abstract)," in Proc. of the 17th ACM Symp.on Principles of Database Systems (PODS’98).

[3] O. Wolfson, S. Chamberlain, S. Dao, L. Jiang,and G. Mendez, "Cost and imprecision in modelingthe position of moving objects." in Proc. of the 14thIEEE Int. Conf. on Data Engineering (ICDE’98).

[4] G. Trajcevski, O. Wolfson, K. Hinrichs, andS. Chamberlain, "Managing uncertainty in movingobjects databases." ACM Trans. Database Syst., vol.29, no. 3, pp. 463-507, 2004.

[5] D. Pfoser and C. S. Jensen, "Capturing theuncertainty of moving-object representations." inProc. of the 6th International Symp. on Advancesin Spatial Databases (SSD’99).

[6] Osman Abul , Francesco Bonchi , MircoNanni, Never Walk Alone: Uncertainty forAnonymity in Moving Objects Databases, Proceed-ings of the 2008 IEEE 24th International Conferenceon Data Engineering, p.376-385, April 07-12, 2008

[7] http://en.wikipedia.org/wiki/Mercator-_projection

[8] Y. Zheng, X. Zhou, Computing with spatialtrajectories. Springer 2011. ISBN: 978-1-4614-1628-9

[9] Douglas, D., Peucker, T.: Algorithms for theReduction of the Number of Points Required toRepresent a Line or its Caricature. The CanadianCartographer 10(2), 112-122 (1973)

[10] Francesco Bonchi , Laks V.S. Lakshmanan ,Hui (Wendy) Wang, Trajectory anonymity in pub-lishing personal mobility data, ACM SIGKDD Ex-plorations Newsletter, v.13 n.1, June 2011

[11] Spaccapietra, S., Parent C., Damiani M. L.,Macedo J. A., Porto F., Vangenot C. 2008. A Concep-tual View on Trajectories. 2008. Data and Knowl-edge Engineering (DKE)

[12] L. O. Alvares, V. Bogorny, B. Kuijpers, J.A. F. de Macedo, B. Moelans, and A. Vaisman. Amodel for enriching trajectories with semantic geo-graphical information. In ACM-GIS, pages 162-169,New York, NY, USA, 2007. ACM Press

[13] Nanni, M., Pedreschi, D. 2006. Time-focused clustering of trajectories of moving ob-jects. Journal of Intelligent Information Systems27(3) (2006) 267-289

[14] M. Chen , M. Xu and P. Franti "Compressionof GPS trajectories", Proc. IEEE Data CompressionConf., pp.62 -71 2012

[15] Hershberger, J., Snoeyink, J.: Speeding upthe Douglas-Peucker Line simplification Algorithm.In: International Symposium on Spatial Data Han-dling, pp. 134-143 (1992)

[16] Dateline NBC: Tracing a stalker.http://www.msnbc.msn.com/id/19253352 (2007)

[17] FoxNews: Man accused of stalking ex-girlfriend with GPS. http://www.foxnews.com/story/0,2933,131487,00.html (2004)

16

Master’s Project Report • March 15, 2013

[18] USAToday: Authorities: GPS systemused to stalk woman. http://www.usatoday.com/tech/news/2002-12-30-gps-stalker_x.htm(2002)

[19] Voelcker, J.: Stalked by satellite: An alarm-ing rise in gps-enabled harassment. IEEE Spectrum47(7), 15-16 (2006)

[20] Yan, Z., (2009), "Towards Semantic Trajec-tory Data Analysis : A Conceptual and Computa-tional Approach". VLDB’09, Lyon, France.

[21] L.O. Alvares, A. Palma, G. Oliveira, and V.Bogorny, "Weka-STPM: From Trajectory Samplesto Semantic Trajectories", Proceedings of the XIWorkshop de Software Livre, WSL’10, Porto Alegre,Brazil, 2010, pp. 164-169.

[22] Gedik, B., and Liu, L. Location Privacy inMobile Systems: A Personalized AnonymizationModel. In Proc. of the 25th Int. Conf. on Dis-tributed Computing Systems (ICDCS’05).

[23] Norma Saiph Savage, Shoji Nishimura,Norma Elva Chavez, and Xifeng Yan. 2010. Fre-quent trajectory mining on GPS data. In Proceed-ings of the 3rd International Workshop on Locationand the Web (LocWeb ’10). ACM, New York, NY,USA.

[24] Maratnia, N., de By, R.: Spatio-TemporalCompression Techniques for Moving Point Objects.In: International Conference on Extending DatabaseTechnology (EDBT), pp. 765-782 (2004)

[25] Potamias, M., Patroumpas, K., Sellis, T.:Sampling Trajectory Streams with Spatio-TemporalCriteria. In: International Conference on Scientificand Statistical Database Management (SSDBM), pp.275-284 (2006)

[26] Ye Qian,Chen Ling,Chen Gencai.Personalcontinuous route pattern mining[J].Journal of Zhe-jiang University,2009,10(2):211-231.

[27] gil Lee, J., and Han, J. Trajectory clustering:A partition-and-group framework. In Proc. of the2007 ACM SIGMOD Int. Conf. on Management ofData (SIGMOD’07) (2007), pp. 593-604.

[28] gil Lee, J., Han, J., and Li, X. Trajectory out-lier detection: A partition-and-detect framework.In Proc. of the 24th IEEE International Conferenceon Data Engineering (ICDE’08) (2008).

[29] gil Lee, J., Han, J., Li, X., and Gonzalez, H.Traclass: Trajectory classification using hierarchicalregion-based and trajectory-based clustering ? ab-stract. In Proc. of the 34th Int. Conf. on Very LargeDatabases (VLDB’08) (2008).

[30] Gidofalvi, G., Huang, X., Pedersen, T.B.:Privacy-preserving data mining on moving objecttrajectories. In: Proceedings of the InternationalConference on Mobile Data Management (2007)

[31] Gruteser, M., and Grunwald, D. Anony-mous Usage of Location-Based Services ThroughSpatial and Temporal Cloaking. In Proc. of the FirstInt. Conf. on Mobile Systems, Applications, andServices (MobiSys 2003).

[32] Freudiger, J., Raya, M., Felegyhazi, M., Pa-padimitratos, P., Hubaux, J.P.: Mix-zones for loca-tion privacy in vehicular networks. In: Proceedingsof the InternationalWorkshop onWireless Network-ing for Intelligent Transportation Systems (2007)

[33] Mining Regular Routes from GPS Data forRidesharing Recommendation Wen He, Deyi Li,Tianlei Zhang, Mu Guo, Lifeng An

[34] Agrawal, R., Faloutsos, C., Swami, A.N.:Efficient similarity search in sequence databases.FODO pp. 69-84 (1993)

[35] Anna Monreale , Fabio Pinelli , RobertoTrasarti , Fosca Giannotti, WhereNext: a locationpredictor on trajectory pattern mining, Proceedingsof the 15th ACM SIGKDD international conferenceon Knowledge discovery and data mining, June28-July 01, 2009, Paris, France

[36] Jeung, H., Liu, Q., Shen, H. T., and Zhou, X.A hybrid prediction model for moving objects. InProc. of the 24th IEEE International Conference onData Engineering (ICDE’08) (2008).

[37] Zheng, Y., Zhang, L., Xie, X., Ma, W.Y.: Min-ing interesting locations and travel sequences fromgps trajectories. WWW (2009)

[38] Yi, B.K., Jagadish, H., Faloutsos, C.: Effi-cient retrieval of similar time sequences under timewarping. ICDE (1998)

[39] Kido, H., Yanagisawa, Y., and Satoh, T.An Anonymous Communication Technique usingDummies for Location-based Services. In Proc. ofthe Third Int. Conf. on Pervasive Computing (Per-vasive 2005) (2005), pp. 88-97.

[40] Chen, Z., Shen, H.T., Zhou, X., Zheng, Y.,Xie, X.: Searching trajectories by locations - an effi-ciency study. SIGMOD (2010)

[41] Chen, L., Ozsu, M.T., Oria, V.: Robust andfast similarity search for moving object trajectories.SIGMOD (2005)

[42] Li, X., Han, J., Kim, S., and Gonzalez, H.Anomaly detection in moving object.

[43] Li, X., Han, J., Lee, J.-G., and Gonzalez, H.Traffic density-based discovery of hot routes in road

17

Master’s Project Report • March 15, 2013

networks.[44] Mamoulis, N., Cao, H., Kollios, G., Had-

jieleftheriou, M., Tao, Y., and Cheung, D. W.: Min-ing, indexing, and querying historical spatiotempo-ral data.

[45] Chow, Chi-Yin: Trajectory Privacy inLocation-based Services and Data. In: ACMSIGKDD Explorations Newsletter 13 (2011), Nr. 1,19-29.

[46] Mokbel, M. F., Chow, C.-Y., and Aref, W.G. Casper: Query processing for location serviceswithout compromising privacy. In Proceeding ofthe 32nd International Conference on Very LargeDatabases (VLDB’06)

[47] Mokbel, M. F., Chow, C.-Y., and Aref, W. G.The new casper: A privacy-aware location-baseddatabase server. In Proc. of the 23rd IEEE Interna-tional Conference on Data Engineering (ICDE’07).

[48] http://sourceforge.net/projects/t-patterns/

[49] Nanni, M., and Pedreschi, D. Time-focusedclustering of trajectories of moving objects. Jour-nal of Intelligent Information Systems 27, 3 (2006),267-289.

[50] Nergiz, M.E., Atzori, M., Saygin, Y.,GÂlucÂÿ, B.: Towards trajectory anonymization:A generalization-based approach. Transactions onData Privacy 2(1), 47-75 (2009)

[51] Terrovitis, M., Mamoulis, N.: Privacy preser-vation in the publication of trajectories. In: Proceed-ings of the International Conference on Mobile DataManagement (2008)

[52] Ester, M., Kriegel, H.P., Sander, J., Xu, X.: Adensity-based algorithm for discovering clusters inlarge spatial databases with noise. In: Proceedingsof the ACM SIGKDD International Conference onKnowledge Discovery and Data Mining, pp. 226-231 (1996)

[53] http://en.wikipedia.org/wiki/Normal-_distribution

18