Differentially Private Publication of Location Entropy ...€¦ · It is used in numerous...

Differentially Private Publication of Location Entropy(Technical Report)

Hien To ∗

[email protected] Nguyen ∗

[email protected] Shahabi

[email protected] of Computer Science, University of Southern California

Los Angeles, CA 90007

ABSTRACTLocation entropy (LE) is an eminent metric for measuringthe popularity of various locations (e.g., points-of-interest).It is used in numerous applications in geo-marketing, crimeanalysis, epidemiology, traffic incident analysis, spatial crowd-sourcing, and geosocial networks. Unlike other metrics com-puted from only the number of (unique) visits to a location,namely frequency, LE also captures the diversity of the users’visits, and is thus more accurate than other metrics. Cur-rent solutions for computing LE require full access to thepast visits of users to locations, which poses privacy threats.This paper discusses, for the first time, the problem of per-turbing location entropy for a set of locations according todifferential privacy. The problem is challenging, inasmuchas removing a single user from the dataset will impact mul-tiple records of the database; i.e., all the visits made by thatuser to various locations. Towards this end, we first derivenon-trivial, tight bounds for both local and global sensitivityof LE, and show that to satisfy ε-differential privacy, a largeamount of noise must be introduced, rendering the publishedresults useless. Hence, we propose a thresholding techniqueto limit the number of users’ visits, which significantly re-duces the perturbation error but introduces an approxima-tion error. To achieve better utility, we extend the techniqueby adopting two weaker notions of privacy: smooth sensitiv-ity (slightly weaker) and crowd-blending (strictly weaker).Extensive experiments on synthetic and real-world datasetsshow that our proposed techniques preserve original datadistribution without compromising location privacy.

Categories and Subject DescriptorsH.2.4 [Database Management]: Database Applications—Spatial databases and GIS ; H.1.1 [Models and Princi-ples]: Systems and Information Theory—Information the-ory ; F.2.2 [Theory of Computation]: Analysis of Algo-rithms and problem complexity—Nonnumerical Algorithmsand Problem

KeywordsDifferential Privacy, Location Entropy, Algorithms, Theory

1. INTRODUCTIONDue to the pervasiveness of GPS-enabled mobile devices

and the popularity of location-based services such as map-ping and navigation apps (e.g., Google Maps, Waze), or spa-tial crowdsourcing apps (e.g., Uber, TaskRabbit, Gigwalk),∗These authors contributed equally to this work.

or apps with geo-tagging (e.g., Twitter, Picasa, Instagram,Flickr), or check-in functionality (e.g., Foursquare, Face-book), numerous industries are now collecting fine-grainedlocation data from their users. While the collected loca-tion data can be used for many commercial purposes bythese industries (e.g., geo-marketing), other companies andnon-profit organizations (e.g., academia, CDC) can also beempowered if they can use the location data for the greatergood (e.g., research, preventing the spread of disease). Un-fortunately, despite the usefulness of the data, industriesdo not publish their location data due to the sensitivity oftheir users’ location information. However, many of theseorganizations do not need access to the raw location databut aggregate or processed location data would satisfy theirneed.

One example of using location data is to measure the pop-ularity of a location that can be used in many applicationdomains such as public health, criminology, urban planning,policy, and social studies. One accepted metric to measurethe popularity of a location is location entropy (or LE forshort). LE captures both the frequency of visits (how manytimes each user visited a location) as well as the diversityof visits (how many unique users visited a location) with-out looking at the functionality of that location; e.g., is it aprivate home or a coffee shop? Hence, LE has shown thatit is able to better quantify the popularity of a location ascompared to the number of unique visits or the number ofcheck-ins to the location [9]. For example, [9] shows that LEis more successful in accurately predicting friendship fromlocation trails over simpler models based only on the numberof visits. LE is also used to improve online task assignmentin spatial crowdsourcing [24, 43] by giving priority to work-ers situated in less popular locations because there may beno available worker visiting those locations in the future.

Obviously, LE can be computed from raw location datacollected by various industries; however, the raw data cannotbe published due to serious location privacy implications [18,11, 42]. Without privacy protection, a malicious adver-sary can stage a broad spectrum of attacks such as physicalsurveillance and stalking, and breach of sensitive informa-tion such as an individual’s health issues (e.g., presence in acancer treatment center), alternative lifestyles, political andreligious preferences (e.g., presence in a church). Hence, inthis paper we propose an approach based on differential pri-vacy (DP) [12] to publish LE for a set of locations withoutcompromising users’ raw location data. DP has emerged asthe de facto standard with strong protection guarantees forpublishing aggregate data. It has been adapted by major in-dustries for various tasks without compromising individual

privacy, e.g., data analytics with Microsoft [15], discoveringusers’ usage patterns with Apple 1, or crowdsourcing statis-tics from end-user client software [15] and training of deepneural networks [1] with Google. DP ensures that an adver-sary is not able to reliably learn from the published sanitizeddata whether or not a particular individual is present in theoriginal data, regardless of the adversary’s prior knowledge.

It is sufficient to achieve ε-DP (ε is privacy loss) by addingLaplace noise with mean zero and scale proportional to thesensitivity of the query (LE in this study) [12]. The sen-sitivity of LE is intuitively the maximum amount that oneindividual can impact the value of LE. The higher the sen-sitivity, the more noise must be injected to guarantee ε-DP.Even though DP has been used before to compute Shan-non Entropy [4] (the formulation adapted in LE), the mainchallenge in differentially private publication of LE is thatadding (or dropping) a single user from the dataset wouldimpact multiple entries of the database, resulting in a highsensitivity of LE. To illustrate, consider a user that has con-tributed many visits to a single location; thus, adding orremoving this user would significantly change the value ofLE for that location. Alternatively, a user may contributevisits to multiple locations and hence impact the entropy ofall those visited locations. Another unique challenge in pub-lishing LE (vs. simply computing the Shannon Entropy) isdue to the presence of skewness and sparseness in real-worldlocation datasets where the majority of locations have smallnumbers of visits.

Towards this end, we first compute a non-trivial tightbound for the global sensitivity of LE. Given the bound,a sufficient amount of noise is introduced to guarantee ε-DP. However, the injected noise linearly increases with themaximum number of locations visited by a user (denoted byM) and monotonically increases with the maximum num-ber of visits a user contributes to a location (denoted by C),and such an excessive amount of noise renders the publishedresults useless. We refer to this algorithm as Baseline. Ac-cordingly, we propose a technique, termed Limit, to limituser activity by thresholding M and C, which significantlyreduces the perturbation error. Nevertheless, limiting an in-dividual’s activity entails an approximation error in calculat-ing LE. These two conflicting factors require the derivationof appropriate values for M and C to obtain satisfactoryresults. We empirically find such optimal values.

Furthermore, to achieve a better utility, we extend Limitby adopting two weaker notions of privacy: smooth sensi-tivity [31] and crowd-blending [17] (strictly weaker). Wedenote the techniques as Limit-SS and Limit-CB, respec-tively. Limit-SS provides a slightly weaker privacy guaran-tee, i.e., (ε, δ)-differential privacy by using local sensitivitywith much smaller noise magnitude. We propose an efficientalgorithm to compute the local sensitivity of a particular lo-cation that depends on C and the number of users visitingthe location (represented by n) such that the local sensitiv-ity of all locations can be precomputed, regardless of thedataset. Thus far, we publish entropy for all locations; how-ever, the ratio of noise to the true value of LE (noise-to-true-entropy ratio) is often excessively high when the number ofusers visiting a location n is small (i.e., the entropy of alocation is bounded by log(n)). For example, given a loca-tion visited by only two users with an equal number of visits

1https://www.wired.com/2016/06/apples-differential-privacy-collecting-data/

(LE is log 2), removing one user from the database drops theentropy of the location to zero. To further reduce the noise-to-true-entropy ratio, Limit-CB aims to publish the entropyof locations with at least k users (n ≥ k) and suppress theother locations. By thresholding n, the global sensitivity ofLE significantly drops, implyling much less noise. We provethat Limit-CB satisfies (k, ε)-crowd-blending privacy.

We conduct an extensive set of experiments on both syn-thetic and real-world datasets. We first show that the trun-cation technique (Limit) reduces the global sensitivity of LEby two orders of magnitude, thus greatly enhancing the util-ity of the perturbed results. We also demonstrate that Limitpreserves the original data distribution after adding noise.Thereafter, we show the superiority of Limit-SS and Limit-CB over Limit in terms of achieving higher utility (measuredby KL-divergence and mean squared error metrics). Par-ticularly, Limit-CB performs best on sparse datasets whileLimit-SS is recommended over Limit-CB on dense datasets.We also provide insights on the effects of various parameters:ε, C, M and k on the effectiveness and utility of our pro-posed algorithms. Based on the insights, we provide a set ofguidelines for choosing appropriate algorithms and parame-ters.

The remainder of this paper is organized as follows. InSection 2, we define the problem of publishing LE accordingto differential privacy. Section 3 presents the preliminaries.Section 4 introduces the baseline solution and our threshold-ing technique. Section 5 presents our utility enhancementsby adopting weaker notions of privacy. Experimental resultsare presented in Section 6, followed by a survey of relatedwork in Section 7, and conclusions in Section 8.

2. PROBLEM DEFINITIONIn this section we present the notations and the formal

definition of the problem.Each location l is represented by a point in two-dimensional

space and a unique identifier l (−180 ≤ llat ≤ 180) and(−90 ≤ llon ≤ 90)2. Hereafter, l refers to both the locationand its unique identifier. For a given location l, let Ol bethe set of visits to that location. Thus, cl = |Ol| is the to-tal number of visits to l. Also, let Ul be the set of distinctusers that visited l, and Ol,u be the set of visits that useru has made to the location l. Thus, cl,u = |Ol,u| denotesthe number of visits of user u to location l. The probability

that a random draw from Ol belongs to Ol,u is pl,u =|cl,u||cl|

,

which is the fraction of total visits to l that belongs to useru. The location entropy for l is computed from Shannonentropy [37] as follows:

H(l) = H(pl,u1 , pl,u2 , . . . , pl,u|Ul|) = −

∑u∈Ul

pl,u log pl,u (1)

In our study the natural logarithm is used. A location has ahigher entropy when the visits are distributed more evenlyamong visiting users, and vice versa. Our goal is to publishlocation entropy of all locations L = {l1, l2, ..., l|L|}, whereeach location is visited by a set of users U = {u1, u2, ..., u|U|},while preserving the location privacy of users. Table 1 sum-marizes the notations used in this paper.

3. PRELIMINARIES2llat, llon are real numbers with ten digits after the decimal point.

l, L, |L| a location, the set of all locations and its cardinalityH(l) location entropy of location l

H(l) noisy location entropy of location l∆Hl sensitivity of location entropy for location l∆H sensitivity of location entropy for all locationsOl the set of visits to location lu, U, |U | a user, the set of all users and its cardinalityUl the set of distinct users who visits lOl,u the set of visits that user u has made to location lcl the total number of visits to lcl,u the number of visits that user u has made to location lC maximum number of visits of a user to a locationM maximum number of locations visited by a userpl,u the fraction of total visits to l that belongs to user u

Table 1: Summary of notations.

We now present properties of the Shannon entropy andthe differential privacy notion that will be used throughoutthe paper.

3.1 Shannon EntropyShannon [37] introduces entropy as a measure of the un-

certainty in a random variable with a probability distribu-tion U = (p1, p2, ..., p|U|):

H(U) = −∑i

pi log pi (2)

where∑i pi = 1. H(U) is maximal if all the outcomes are

equally likely:

H(U) ≤ H(1

|U | , ...,1

|U | ) = log |U | (3)

Additivity Property of Entropy: Let U1 and U2 be non-overlapping partitions of a database U including users whocontribute visits to a location l, and φ1 and φ2 are proba-bilities that a particular visit belongs to partition U1 andU2, respectively. Shannon discovered that using logarithmicfunction preserves the additivity property of entropy:

H(U) = φ1H(U1) + φ2H(U2) +H(φ1, φ2)

Subsequently, adding a new person u into U changes itsentropy to:

H(U+) =cl

cl + cl,uH(U) +H

( cl,ucl + cl,u

,cl

cl + cl,u

)(4)

where U+ = U ∪ u and cl is the total number of visits to l,and cl,u is the number of visits to l that is contributed byuser u. Equation (4) can be derived from Equation (4) if weconsider U+ includes two non-overlapping partitions u andU with associated probabilities

cl,ucl+cl,u

and clcl+cl,u

. We note

that the entropy of a single user is zero, i.e., H(u) = 0.Similarly, removing a person u from U changes its entropy

as follows:

H(U−) =cl

cl − cl,u

(H(U)−H

(cl,ucl,cl − cl,u

cl

))(5)

where U− = U \ {u}.

3.2 Differential PrivacyDifferential privacy (DP) [12] has emerged as the de facto

standard in data privacy, thanks to its strong protectionguarantees rooted in statistical analysis. DP is a semanticmodel which provides protection against realistic adversarieswith background information. Releasing data according to

DP ensures that an adversary’s chance of inferring any in-formation about an individual from the sanitized data willnot substantially increase, regardless of the adversary’s priorknowledge. DP ensures that the adversary does not knowwhether an individual is present or not in the original data.DP is formally defined as follows.

Definition 1. ε-indistinguishability [13] Consider that

a database produces a set of query results D on the set ofqueries Q = {q1, q2, . . . , q|Q|}, and let ε > 0 be an arbitrar-ily small real constant. Then, transcript U produced by arandomized algorithm A satisfies ε-indistinguishability if forevery pair of sibling datasets D1, D2 that differ in only onerecord, it holds that

lnPr[Q(D1) = U ]

Pr[Q(D2) = U ]≤ ε

In other words, an attacker cannot reliably learn whetherthe transcript was obtained by answering the query set Qon dataset D1 or D2. Parameter ε is called privacy bud-get, and specifies the amount of protection required, withsmaller values corresponding to stricter privacy protection.To achieve ε-indistinguishability, DP injects noise into eachquery result, and the amount of noise required is propor-tional to the sensitivity of the query set Q, formally definedas:

Definition 2 (L1-Sensitivity). [13] Given any arbi-trary sibling datasets D1 and D2, the sensitivity of query setQ is the maximum change in the query results of D1 andD2.

σ(Q) = maxD1,D2

||Q(D1)−Q(D2)||1

An essential result from [13] shows that a sufficient conditionto achieve DP with parameter ε is to add to each query resultrandomly distributed Laplace noise with mean 0 and scaleλ = σ(Q)/ε.

4. PRIVATE PUBLICATION OF LEIn this section we present a baseline algorithm based on

a global sensitivity of LE [13] and then introduce a thresh-olding technique to reduce the global sensitivity by limitingan individual’s activity.

4.1 Global Sensitivity of LETo achieve ε-differential privacy, we must add noise pro-

portional to the global sensitivity (or sensitivity for short) ofLE. Thus, to minimize the amount of injected noise, we firstpropose a tight bound for the sensitivity of LE, denoted by∆H. ∆H represents the maximum change of LE across alllocations when the data of one user is added (or removed)from the dataset. With the following theorem, the sensitiv-ity bound is a function of the maximum number of visits auser contributes to a location, denoted by C (C ≥ 1).

Theorem 1. Global sensitivity of location entropy is

∆H = max {log 2, logC − log(logC)− 1}

Proof. We prove this theorem by first deriving a tightbound for the sensitivity of a particular location l (visitedby n users), denoted by ∆Hl (Theorem 2). The bound is afunction of C and n. Thereafter, we generalize the bound tohold for all locations as follows. We take the derivative of

the bound derived for ∆Hl with respect to variable n andfind the extremal point where the bound is maximized. Thedetailed proof can be found in Appendix A.1.

Theorem 2. Local sensitivity of a particular location lwith n users is:

• log 2 when n = 1

• log n+1n

when C = 1

• max{log n−1n−1+C

+ Cn−1+C

logC, log nn+C

+ Cn+C

logC,

log(1+ 1exp(H(C\cu))

)} where C is the maximum number

of visits a user contributes to a location (C ≥ 1) and

H(C \ cu) = log(n− 1)− logCC−1

+ log(

logCC−1

)+ 1, when

n > 1, c > 1.

Proof. We prove the theorem considering both cases—when a user is added (or removed) from the database. Wefirst derive a proof for the adding case by using the additivityproperty of entropy from Equation 4. Similarly, the prooffor the removing case can be derived from Equation 5. Thedetailed proofs can be found in Appendix A.2.

Baseline Algorithm: In this section we present a base-line algorithm that publishes location entropy for all loca-tions (see Algorithm 1). Since adding (or removing) a singleuser from the dataset would impact the entropy of all loca-tions he visited, the change of adding (or removing) a userto all locations is bounded by Mmax∆H, where Mmax isthe maximum number of locations visited by a user. Thus,Line 6 adds randomly distributed Laplace noise with meanzero and scale λ = Mmax∆H

εto the actual value of location

entropy H(l). It has been proved [13] that this is sufficientto achieve differential privacy with such simple mechanism.

Algorithm 1 Baseline Algorithm

1: Input: privacy budget ε, a set of locations L = {l1, l2, ..., l|L|},maximum number of visits of a user to a location Cmax, max-imum number of locations a user visits Mmax.

2: Compute sensitivity ∆H from Theorem 1 for C = Cmax.3: For each location l in L4: Count #visits each user made to l: cl,u and compute pl,u5: Compute H(l) = −

∑u∈Ul pl,u log pl,u

6: Publish noisy LE: H(l) = H(l) + Lap(Mmax∆Hε

)

4.2 Reducing the Global Sensitivity of LE

4.2.1 Limit AlgorithmLimitation of the Baseline Algorithm: Algorithm 1

provides privacy; however, the added noise is excessivelyhigh, rendering the results useless. To illustrate, Figure 1shows the bounds of the global sensitivity (Theorem 1) whenC varies. The figure shows that the bound monotonically in-creases when C grows. Therefore, the noise introduced byAlgorithm 1 increases as C and M increase. In practice, Cand M can be large because a user may have visited eithermany locations or a single location many times, resultingin large sensitivity. Furthermore, Figure 2 depicts differentvalues of noise magnitude (in log scale) used in our variousalgorithms by varying the number of users visiting a loca-tion, n. The graph shows that the noise magnitude of thebaseline is too high to be useful (see Table 2).

0 10 20 30 40

0.7

0.8

0.9

1

1.1

1.2

1.3

C

Glo

bal S

ensitiv

ity

Figure 1: Global sensitiv-ity bound of location entropywhen varying C.

0 20 40 60 80 1000

1

2

3

4

5

6

Nois

e m

agnitude

Number of users (n)

BaselineLimitLimit−SSLimit−CB

Figure 2: Noise magnitudein natural log scale (ε =5, Cmax=1000, Mmax=100,C=20, M=5, δ=10−8, k=25).

Improving Baseline by Limiting User Activity: Toreduce the global sensitivity of LE, and inspired by [27], wepropose a thresholding technique, named Limit, to limit anindividual’s activity by truncating C and M . Our techniqueis based on the following two observations. First, Figure 3bshows the maximum number of visits a user contributes toa location in the Gowalla dataset that will be used in Sec-tion 6 for evaluation. Although most users have one andonly one visit, the sensitivity of LE is determined by theworst-case scenario—the maximum number of visits3. Sec-ond, Figure 3a shows the number of locations visited by auser. The figure confirms that there are many users whocontribute to more than ten locations.

(a) A user may visit many loca-tions

(b) The largest number of visitsa user contributes to a location

Figure 3: Gowalla, New York.

Since the introduced noise linearly increases with M andmonotonically increases with C, the noise can be reducedby capping them. First, to truncate M , we keep the firstM location visits of the users who visit more than M loca-tions and throw away the rest of the locations’ visits. As aresult, adding or removing a single user in the dataset af-fects at most M locations. Second, we set the number ofvisits of the users who have contributed more than C visitsto a particular location of C. Figure 2 shows that the noisemagnitude used in Limit drops by two orders of magnitudewhen compared with the baseline’s sensitivity.

At a high-level, Limit (Algorithm 2) works as follows.Line 3 limits user activity across locations, while Line 7 lim-its user activity to a location. The impact of Line 3 is theintroduction of approximation error on the published data.This is because the number of users visiting some locationsmay be reduced, which alters their actual LE values. Conse-quently, some locations may be thrown away without beingpublished. Furthermore, Line 7 also alters the value of lo-

3This suggests that users tend not to check-in at places that theyvisit the most, e.g., their homes, because if they did, the peak ofthe graph would not be at 1.

cation entropy, but by trimming the number of visits of auser to a location. The actual LE value of location l (afterthresholdingM and C) is computed in Line 8. Consequently,the noisy LE is published in Line 9, where Lap(M∆H

ε) de-

notes a random variable drawn independently from Laplacedistribution with mean zero and scale parameter M∆H

ε.

Algorithm 2 Limit Algorithm

1: Input: privacy budget ε, a set of locations L = {l1, l2, ..., l|L|},maximum threshold on the number of visits of a user to alocation C, maximum threshold on the number of locations auser visits M

2: For each user u in U3: Truncate M : keep the first M locations’ visits of the users

who visit more than M locations4: Compute sensitivity ∆H from Theorem 1.5: For each location l in L6: Count #visits each user made to l: cl,u and compute pl,u7: Threshold C: cl,u = min(C, cl,u), then compute pl,u8: Compute H(l) = −

∑u∈Ul pl,u log pl,u

9: Publish noisy LE: H(l) = H(l) + Lap(M∆Hε

)

The performance of Algorithm 2 depends on how we set Cand M . There is a trade-off on the choice of values for C andM . Small values of C and M introduce small perturbationerror but large approximation error and vice versa. Hence,in Section 6, we empirically find the values of M and C thatstrike a balance between noise and approximation errors.

4.2.2 Privacy Guarantee of the Limit AlgorithmThe following theorem shows that Algorithm 2 is differ-

entially private.

Theorem 3. Algorithm 2 satisfies ε-differential privacy.

Proof. For all locations, let L1 be any subset of L. LetT = {t1, t2, . . . , t|L1|} ∈Range(A) denote an arbitrary pos-sible output. Then we need to prove the following:

Pr[A(O1(org), . . . , O|L1|(org)) = T ]

Pr[A(O1(org) \Ol,u(org), . . . , O|L1|(org) \Ol,u(org)) = T ]

≤ exp(ε)

The details of the proof and notations used can be foundin Appendix A.3.

5. RELAXATION OF PRIVATE LEThis section presents our utility enhancements by adopt-

ing two weaker notions of privacy: smooth sensitivity [31](slightly weaker) and crowd-blending [17] (strictly weaker).

5.1 Relaxation with Smooth SensitivityWe aim to extend Limit to publish location entropy with

smooth sensitivity (or SS for short). We first present thenotions of smooth sensitivity and the Limit-SS algorithm.We then show how to precompute the SS of location entropy.

5.1.1 Limit-SS AlgorithmSmooth sensitivity is a technique that allows one to com-

pute noise magnitude—not only by the function one wantsto release (i.e., location entropy), but also by the databaseitself. The idea is to use the local sensitivity bound of eachlocation rather than the global sensitivity bound, resultingin small injected noise. However, simply adopting the lo-cal sensitivity to calibrate noise may leak the information

about the number of users visiting that location. Smoothsensitivity is stated as follows.

Let x, y ∈ DN denote two databases, where N is the num-ber of users. Let lx, ly denote the location l in database xand y, respectively. Let d(lx, ly) be the Hamming distancebetween lx and ly, which is the number of users at locationl on which x and y differ; i.e., d(lx, ly) = |{i : lxi 6= lyi }|; l

xi

represents information contributed by one individual. Thelocal sensitivity of location lx, denoted by LS(lx), is themaximum change of location entropy when a user is addedor removed.

Definition 3. Smooth sensitivity [31] For β > 0, β-smoothsensitivity of location entropy is:

SSβ(lx) = maxly∈DN

(LS(ly) · e−βd(l

x,ly))

= maxk=0,1,...,N

e−kβ(

maxy:d(lx,ly)=k

LS(ly))

Smooth sensitivity of LE of location lx can be interpreted asthe maximum of LS(lx) and LS(ly) where the effect of y atdistance k from x is dropped by a factor of e−kβ . Thereafter,the smooth sensitivity of LE can be plugged into Line 3 ofAlgorithm 2, producing the Limit-SS algorithm.

Algorithm 3 Limit-SS Algorithm

1: Input: privacy budget ε, privacy parameter δ, L ={l1, l2, ..., l|L|}, C,M

2: Copy Lines 2-8 from Algorithm 2

3: Publish noisy LE H(l) = H(l) +M·2·SSβ(l)

ε· η, where η ∼

Lap(1), where β = ε2 ln( 2

δ)

5.1.2 Privacy Guarantee of Limit-SS

The noise of Limit-SS is specific to a particular locationas opposed to those of the Baseline and Limit algorithms.Limit-SS has a slightly weaker privacy guarantee. It satisfies(ε, δ)-differential privacy, where δ is a privacy parameter, δ =0 in the case of Definition 1. The choice of δ is generally leftto the data releaser. Typically, δ < 1

number of users(see [31]

for details).

Theorem 4. Calibrating noise to smooth sensitivity [31]If β ≤ ε

2 ln( 2δ

)and δ ∈ (0, 1), the algorithm l 7→ H(l) +

2·SSβ(l)

ε· η, where η ∼ Lap(1), is (ε, δ)-differentially private.

Theorem 5. Limit-SS is (ε, δ)-differentially private.

Proof. Using Theorem 4, Al satisfies (0)-differential pri-vacy when l /∈ L1 ∩ L(u), and satisfies ( ε

M, δM

)-differentialprivacy when l ∈ L1 ∩ L(u).

5.1.3 Precomputation of Smooth SensitivityThis section shows that the smooth sensitivity of a lo-

cation visited by n users can be effectively precomputed.Figure 2 illustrates the precomputed local sensitivity for afixed value of C.

Let LS(C, n), SS(C, n) be the local sensitivity and thesmooth sensitivity of all locations that visited by n users, re-spectively. LS(C, n) is defined in Theorem 2. Let GS(C) bethe global sensitivity of the location entropy given C, whichis defined in Theorem 1. Algorithm 4 computes SS(C, n).At a high level, the algorithm computes the effect of all

locations at every possible distance k from n, which is non-trivial. Thus, to speed up computations, we propose severalstopping conditions based on the following observations.

Let nx, ny be the number of users visited lx, ly, respec-tively. If nx > ny, Algorithm 4 stops when e−kβGS(C) isless than the current value of smooth sensitivity (Line 7). Ifnx < ny, given the fact that LS(ly) starts to decrease whenny > C

logC−1+ 1, and e−kβ also decreases when k increases,

Algorithm 4 stops when ny > ClogC−1

+ 1 (Line 8). In ad-dition, the algorithm tolerates a minimum value of smoothsensitivity ξ. Thus, when n > n0 (LS(C, n0) < ξ), theprecomputation of SS(C, n) is stopped and SS(C, n) is con-sidered as ξ for all n > n0 (Line 8).

Algorithm 4 Precompute Smooth Sensitivity

1: Input: privacy parameters: ε, δ, ξ; C, maximum number ofpossible users N

2: Set β = ε2 ln( 2

δ)

3: For n = [1, . . . , N ]4: SS(C, n) = 05: For k = [1, . . . , N ]:

6: SS(C, n) = max(SS(C, n), e−kβ max(LS(C, n −

k), LS(C, n+ k)))

7: Stop when e−kβGS(C, n− k) < SS(C, n) and n+ k >C

logC−1+ 1

8: Stop when n > ClogC−1

+ 1 and LS(C, n) < ξ

5.2 Relaxation with Crowd-Blending Privacy

5.2.1 Limit-CB AlgorithmThus far, we publish entropy for all locations; however,

the ratio of noise to the true value of LE (noise-to-true-entropy ratio) is often excessively high when the number ofusers visiting a location n is small (i.e., Equation 3 showsthat entropy of a location is bounded by log(n)). The largenoise-to-true-entropy ratio would render the published re-sults useless since the introduced noise outweighs the actualvalue of LE. This is an inherent issue with the sparsity ofthe real-world datasets. For example, Figure 4 summarizesthe number of users contributing visits to each location inthe Gowalla dataset. The figure shows that most locationshave check-ins from fewer than ten users. These locationshave LE values of less than log(10), which are particularlyprone to the noise-adding mechanism in differential privacy.

Figure 4: Sparsity of locationvisits (Gowalla, New York).

0 20 40 60 80 1000

0.2

0.4

0.6

0.8

1

Number of users (n)

Glo

bal sensitiv

ity

C = 10C = 20

Figure 5: Global sensitivitybound when varying n.

Therefore, to reduce the noise-to-true-entropy ratio, wepropose a small sensitivity bound of location entropy thatdepends on the minimum number of users visiting a location,denoted by k. Subsequently, we present Algorithm 5 that

satisfies (k, ε)-crowd-blending privacy [17]. We prove this inSection 5.2.2.

The algorithm aims to publish entropy of locations withat least k users (n ≥ k) and throw away the other locations.We refer to the algorithm as Limit-CB. Lines 3-6 publish theentropy of each location according to (k, ε)-crowd-blendingprivacy. That is, we publish the entropy of the locationswith at least k users and suppress the others. The followinglemma shows that for the locations with at least k userswe have a tighter bound on ∆H, which depends on C andk. Figure 2 shows that the sensitivity used in Limit-CB issignificantly smaller than Limit’s sensitivity.

Theorem 6. Global sensitivity of location entropy for lo-cations with at least k users, k ≥ C

logC−1+1, where C is the

maximum number of visits a user contributes to a location,is the local sensitivity at n = k.

Proof. We prove the theorem by showing that local sen-sitivity decreases when the number of users n ≥ C

logC−1+ 1.

Thus, when n ≥ ClogC−1

+ 1, the global sensitivity equals tothe local sensitivity at the smallest value of n, i.e, n = k.The detailed proof can be found in Appendix A.2.

Algorithm 5 Limit-CB Algorithm

1: Input: all users U , privacy budget ε; C,M, k2: Compute global sensitivity ∆H based on Theorem 6.3: For each location l ∈ L4: Count number of users who visit l, nl5: If nl ≥ k, publish H(l) according to Algorithm 2 with

budget ε using a tighter bound on ∆H6: Otherwise, do not publish the data

5.2.2 Privacy Guarantee of Limit-CB

Before proving the privacy guarantee of Limit-CB, wefirst present the notion of crowd-blending privacy, a strictrelaxation of differential privacy [17]. k-crowd blending pri-vate sanitization of a database requires each individual in thedatabase to blend with k other individuals in the database.This concept is related to k-anonymity [39] since both arebased on the notion of “blending in a crowd.” However,unlike k-anonymity that only restricts the published data,crowd-blending privacy imposes restrictions on the noise-adding mechanism. Crowd-blending privacy is defined asfollows.

Definition 4 (Crowd-blending privacy). An algo-rithm A is (k, ε)-crowd-blending private if for every databaseD and every individual t ∈ D, either t ε-blends in a crowdof k people in D, or A(D) ≈ε A(D\{t}) (or both).

A result from [17] shows that differential privacy impliescrowd-blending privacy.

Theorem 7. DP −→ Crowd-blending privacy [13]Let A be any ε-differentially private algorithm. Then, Ais (k, ε)-crowd-blending private for every integer k ≥ 1.

The following theorem shows that Algorithm 5 is (k, ε)-crowd-blending private.

Theorem 8. Algorithm 5 is (k, ε)-crowd-blending private.

Sparse Dense Gow.# of locations 10,000 10,000 14,058# of users 100K 10M 5,800Max LE 9.93 14.53 6.45Min LE 1.19 6.70 0.04Avg. LE 3.19 7.79 1.45Variance of LE 1.01 0.98 0.6Max #locations per user 100 100 1407Avg. #locations per user 19.28 19.28 13.5Max #visits to a loc. per user 20,813 24,035 162Avg. #visits to a loc. per user 2578.0 2575.8 7.2Avg. #users per loc. 192.9 19,278 5.6

Table 2: Statistics of the datasets.

Proof. First, if there are at least k people in a location,then individual u ε-blends with k people in U . This is be-cause Line 5 of the algorithm satisfies ε-differential privacy,which infers (k, ε)-crowd-blending private (Theorem 7). Oth-erwise, we have A(D) ≈0 A(D\{t}) since A suppresses eachlocation with less than k users.

6. PERFORMANCE EVALUATIONWe conduct several experiments on real-world and syn-

thetic datasets to compare the effectiveness and utility ofour proposed algorithms. Below, we first discuss our exper-imental setup. Next, we present our experimental results.

6.1 Experimental SetupDatasets: We conduct experiments on one real-world

(Gowalla) and two synthetic datasets (Sparse and Dense).The statistics of the datasets are shown in Table 2. Gowallacontains the check-in history of users in a location-based so-cial network. For our experiments, we use the check-in datain an area covering the city of New York.

For synthetic data generation, in order to study the im-pact of the density of the dataset, we consider two cases:Sparse and Dense. Sparse contains 100,000 users while Densehas 10 million users. The Gowalla dataset is sparse aswell. We add the Dense synthetic dataset to emulate thecase for large industries, such as Google, who have accessto large- and fine-granule user location data. To gener-ate visits, without loss of generality, the location with idx ∈ [1, 2, . . . , 10, 000] has a probability 1/x of being vis-ited by a user. This means that locations with smaller idstend to have higher location entropy since more users wouldvisit these locations. In the same fashion, the user with idy ∈ {1, 2, . . . , 100, 000} (Sparse) is selected with probability1/y. This follows the real-world characteristic of locationdata where a small number of locations are very popularand then many locations have a small number of visits.

In all of our experiments, we use five values of privacy bud-get ε ∈ {0.1, 0.5, 1,5, 10}. We vary the maximum number ofvisits a user contributes to a location C ∈ {1, 2, . . . ,5, . . . , 50}and the maximum number of locations a user visits M ∈{1, 2,5, 10, 20, 30}. We also vary threshold k ∈ {10, 20, 30, 40,50}. Default values are shown in boldface.

Metrics: We use KL-divergence as one measure of pre-serving the original data distribution after adding noise.Given two discrete probability distributions P and Q, theKL-divergence of Q from P is defined as follows:

DKL(P ||Q) =∑i

P (i) logP (i)

Q(i)(6)

In this paper the location entropy of location l is the prob-ability that l is chosen when a location is randomly selectedfrom the set of all locations; P and Q are respectively thepublished and the actual LE of locations after normalization;i.e., normalized values must sum to unity.

We also use mean squared error (MSE) over a set of loca-tions L as the metric of accuracy using Equation 7.

MSE =1

|L|∑l∈L

(LEa(l)− LEn(l)

)2(7)

where LEa(l) and LEn(l) are the actual and noisy entropyof the location l, respectively.

Since Limit-CB throws away more locations as comparedto Limit and Limit-SS, we consider both cases: 1) KL-divergence and MSE metrics are computed on all locationsL, where the entropy of the suppressed locations are setto zero (default case); 2) the metrics are computed on thesubset of locations that Limit-CB publishes, termed Throw-away.

6.2 Experimental ResultsWe first evaluate our algorithms on the synthetic datasets.

6.2.1 Overall Evaluation of the Proposed AlgorithmsWe evaluate the performance of Limit from Section 4.2.1

and its variants (Limit-SS and Limit-CB). We do not in-clude the results for Baseline since the excessively highamount of injected noise renders the perturbed data useless.

Figure 6 illustrates the distributions of noisy vs. actualLE on Dense and Sparse. The actual distributions of thedense (Figure 6a) and sparse (Figure 6e) datasets confirmour method of generating the synthetic datasets; locationswith smaller ids have higher entropy, and entropy of lo-cations in Dense are higher than that in Sparse. We ob-serve that Limit-SS generally performs best in preservingthe original data distribution for Dense (Figure 6c), whileLimit-CB performs best for Sparse (Figure 6h). Note thatas we show later, Limit-CB performs better than Limit-SSand Limit given a small budget ε (see Section 6.2.2).

Due to the truncation technique, some locations may bediscarded. Thus, we report the percentage of perturbed lo-cations, named published ratio. The published ratio is com-puted as the number of perturbed locations divided by thetotal number of eligible locations. A location is eligible forpublication if it contains check-ins from at least K users(K ≥ 1). Figure 7 shows the effect of k on the publishedratio of Limit-CB. Note that the published ratio of Limitand Limit-SS is the same as Limit-CB when k = K. Thefigure shows that the ratio is 100% with Dense, while thatof Sparse is less than 10%. The reason is that with Dense,each location is visited by a large number of users on aver-age (see Table 2); thus, limiting M and C would reduce theaverage number of users visiting a location but not to thepoint where the locations are suppressed. This result sug-gests that our truncation technique performs well on largedatasets.

6.2.2 Privacy-Utility Trade-off (Varying ε)We compare the trade-off between privacy and utility by

varying the privacy budget ε. The utility is captured by theKL-divergence metric introduced in Section 6.1. We alsouse the MSE metric. Figure 8 illustrates the results. Asexpected, when ε increases, less noise is injected, and values

0 2000 4000 6000 8000 100000

2

4

6

8

10

12

14

16

18

Location id

Actu

al e

ntr

op

y

(a) Actual (Dense)

0 2000 4000 6000 8000 100000

2

4

6

8

10

12

14

16

18

Location id

No

isy e

ntr

op

y

(b) Limit (Dense)

0 2000 4000 6000 8000 100000

2

4

6

8

10

12

14

16

18

Location id

No

isy e

ntr

op

y

(c) Limit-SS (Dense)

0 2000 4000 6000 8000 100000

2

4

6

8

10

12

14

16

18

Location id

No

isy e

ntr

op

y

(d) Limit-CB (Dense)

0 2000 4000 6000 8000 100000

2

4

6

8

10

12

Location id

Actu

al e

ntr

op

y

(e) Actual (Sparse)

0 2000 4000 6000 8000 100000

2

4

6

8

10

12

Location id

No

isy e

ntr

op

y

(f) Limit (Sparse)

0 2000 4000 6000 8000 100000

2

4

6

8

10

12

Location id

No

isy e

ntr

op

y

(g) Limit-SS (Sparse)

0 2000 4000 6000 8000 100000

2

4

6

8

10

12

Location id

No

isy e

ntr

op

y

(h) Limit-CB (Sparse)

Figure 6: Comparison of the distributions of noisy vs. actual location entropy on the dense and sparse datasets.

20 25 30 35 40 45 500

20

40

60

80

100

Publis

hed r

atio

k

Sparse dataset

Dense dataset

Figure 7: Published ratio of Limit-CB when varying k (K = 20).

of KL-divergence and MSE decrease. Interestingly though,KL-divergence and MSE saturate at ε = 5, where reducingprivacy level (increase ε) only marginally increases utility.This can be explained through a significant approximationerror in our thresholding technique that outweighs the im-pact of having smaller perturbation error. Note that theapproximation errors are constant in this set of experimentssince the parameters C, M and k are fixed.

Another observation is that the observed errors incurredare generally higher for Dense (Figures 8b vs. 8c), which issurprising, as differentially private algorithms often performbetter on dense datasets. The reason for this is becauselimiting M and C has a larger impact on Dense, resultingin a large perturbation error. Furthermore, we observe thatthe improvements of Limit-SS and Limit-CB over Limitare more significant with small ε. In other words, Limit-SSand Limit-CB would have more impact with a higher levelof privacy protection. Note that these enhancements comeat the cost of weaker privacy protection.

6.2.3 The Effect of Varying M and C

We first evaluate the performance of our proposed tech-niques by varying threshold M . For brevity, we present the

0 2 4 6 8 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

KL−

Div

erg

ence

epsilon

LIMITLIMIT−SSLIMIT−CB

(a) Dense

0 2 4 6 8 100

1000

2000

3000

4000

5000

6000

Me

an

Sq

ua

red

Err

or

epsilon


(b) Dense

0 2 4 6 8 100

500

1000

1500

2000

2500

3000

3500

Me

an

Sq

ua

red

Err

or

epsilon


(c) Sparse

0 2 4 6 8 100

500

1000

1500

2000

2500

3000

3500M

ea

n S

qu

are

d E

rro

r

epsilon


(d) Sparse, Throwaway

Figure 8: Varying ε

results only for MSE, as similar results have been observedfor KL-divergence. Figure 9 indicates the trade-off betweenthe approximation error and the perturbation error. Ourthresholding technique decreases M to reduce the perturba-tion error, but at the cost of increasing the approximationerror. As a result, at a particular value of M , the tech-nique balances the two types of errors and thus minimizesthe total error. For example, in Figure 9a, Limit performsbest at M = 5, while Limit-SS and Limit-CB work best atM ≥ 30. In Figure 9b, however, Limit-SS performs best atM = 10 and Limit-CB performs best at M = 20.

We then evaluate the performance of our techniques byvarying threshold C. Figure 10 shows the results. For

0 10 20 300

500

1000

1500

2000M

ea

n S

qu

are

d E

rro

r

M


(a) Dense

0 10 20 300

200

400

600

800

1000

1200

1400

Me

an

Sq

ua

red

Err

or

M


(b) Sparse, Throwaway

Figure 9: Varying M

brevity, we only include KL-divergence results (MSE metricshows similar trends). The graphs show that KL-divergenceincreases as C grows. This observation suggests that Cshould be set to a small number (less than 10). By com-paring the effect of varying M and C, we conclude that Mhas more impact on the trade-off between the approximationerror and the perturbation error.

0 10 20 30 40 500

0.02

0.04

0.06

0.08

0.1

KL

−D

ive

rge

nce

C


(a) Dense

0 10 20 30 40 500

0.05

0.1

0.15

0.2

KL

−D

ive

rge

nce

C


(b) Sparse

Figure 10: Varying C

6.2.4 Results on the Gowalla DatasetIn this section we evaluate the performance of our algo-

rithms on the Gowalla dataset. Figure 11 shows the dis-tributions of noisy vs. actual location entropy. Note thatwe sort the locations based on their actual values of LE asdepicted in Figure 11a. As expected, due to the sparse-ness of Gowalla (see Table 2), the published values of LEin Limit and Limit-SS are scattered while those in Limit-CB preserve the trend in the actual data but at the cost ofthrowing away more locations (Figure 11d). Furthermore,we conduct experiments on varying various parameters (i.e.,ε, C,M, k) and observe trends similar to the Sparse dataset;nevertheless, for brevity, we only show the impact of varyingε and M in Figure 12.

Recommendations for Data Releases: We summa-rize our observations and provide guidelines for choosing ap-propriate techniques and parameters. Limit-CB generallyperforms best on sparse datasets because it only focuses onpublishing the locations with large visits. Alternatively, ifthe dataset is dense, Limit-SS is recommended over Limit-CB since there are sufficient locations with large visits. Adataset is dense if most locations (e.g., 90%) have at leastnCB users, where nCB is the threshold for choosing Limit-CB. Particularly, given fixed parameters C, ε, δ, k—nCB canbe found by comparing the global sensitivity of Limit-CBand the precomputed smooth sensitivity. In Figure 2, nCBis a particular value of n where SS(C, nCB) is smaller thanthe global sensitivity of Limit-CB. In other words, the noise

0 2000 4000 6000 8000 10000 12000 140000

2

4

6

8

10

12

14

16

18

Location id

Actu

al e

ntr

op

y

(a) Actual

0 2000 4000 6000 8000 10000 12000 140000

2

4

6

8

10

12

14

16

18

Location id

No

isy e

ntr

op

y

(b) Limit

0 2000 4000 6000 8000 10000 12000 140000

2

4

6

8

10

12

14

16

18

Location id

No

isy e

ntr

op

y

(c) Limit-SS

0 2000 4000 6000 8000 10000 12000 140000

2

4

6

8

10

12

14

16

18

Location id

No

isy e

ntr

op

y

(d) Limit-CB

Figure 11: Comparison of the distributions of noisy vs. actuallocation entropy on Gowalla, M = 5.

0 2 4 6 8 100

500

1000

1500

2000

2500

3000M

ea

n S

qu

are

d E

rro

r

epsilon


(a) Vary ε

0 10 20 300

200

400

600

800

1000

1200

Me

an

Sq

ua

red

Err

or

M


(b) Vary M

Figure 12: Varying ε and M (Gowalla).

magnitude required for Limit-SS is smaller than that forLimit-CB. Regarding the choice of parameters, to guaranteestrong privacy protection, ε should be as small as possible,while the measured utility metrics are practical. Finally, thevalue of C should be small (≤ 10), while the value of M canbe tuned to achieve maximum utility.

7. RELATED WORKLocation privacy has largely been studied in the context

of location-based services [19, 30, 26, 18], participatory sens-ing [8, 6] and spatial crowdsourcing [42, 34]. Most studiesuse the model of spatial k-anonymity [39], where the locationof a user is hidden among k other users [19, 30]. However,there are known attacks on k-anonymity, e.g., when all kusers are at the same location. Other studies focused onspace transformation to preserve location privacy [26]. Nev-ertheless, such techniques assume a centralized architecturewith a trusted third party, which is a single point of attack.Consequently, a technique that makes use of cryptographictechniques such as private information retrieval is proposedthat does not rely on a trusted third party to anonymizelocations [18]. Recent studies on location privacy have fo-cused on leveraging differential privacy (DP) to protect theprivacy of users [2, 7, 35, 42, 40, 47, 22].

Location entropy has been extensively used in variousareas of research, including multi-agent systems [45], wire-less sensor networks [46], geosocial networks [9, 5, 33, 32],personalized web search [28], image retrieval [49] and spa-tial crowdsourcing [24, 43, 41]. The study that most closelyrelates to ours focuses on privacy-preserving location-basedservices [3, 29, 48, 44] in which location entropy is used asthe measure of privacy or the attacker’s uncertainty [3, 48,44]. In [48], a privacy model is proposed that discloses a lo-cation on behalf of a user only if the location has at least thesame popularity (quantified by location entropy) as a publicregion specified by a user. In fact, we find that locations withhigh entropy are more likely to be shared (checked-in) thanplaces with low entropy [44]. However, directly using loca-tion entropy compromises the privacy of individuals. For ex-ample, an adversary certainly knows whether people visitinga location based on its entropy value, e.g., low value means asmall number of people visit the location, and if they are allin a small geographical area, their privacy is compromised.Unlike the studies that use true value of LE [3, 29, 48, 44],our techniques add noise to the actual LE so that an attackercannot reliably learn whether or not a particular individualis present in the original data. Despite the fact that DP hasbeen used to publish various kinds of statistical data, suchas counting query [13], 1D [21] and 2D [7] histograms, high-dimensional data [50], trajectory [38, 22], time series [36, 16,25], and graph [20, 27, 23, 10], to the best of our knowledge,there is no study that uses differential privacy for publishinglocation entropy.

8. CONCLUSIONSWe introduced the problem of publishing the entropy of a

set of locations according to differential privacy. A baselinealgorithm was proposed based on the derived tight bound forglobal sensitivity of the location entropy. We showed thatthe baseline solution requires an excessively high amountof noise to satisfy ε-differential privacy, which renders thepublished results useless. A simple yet effective trunca-tion technique was then proposed to reduce the sensitivitybound by two orders of magnitude, and this enabled publica-tion of location entropy with reasonable utility. The utilitywas further enhanced by adopting smooth sensitivity andcrowd-blending. We conducted extensive experiments andconcluded that the proposed techniques are practical.

AcknowledgementWe would like to thank Prof. Aleksandra Korolova for herconstructive feedback during the course of this research.

This research has been funded in in part by NSF grantsIIS-1320149 and CNS-1461963, National Cancer Institute,National Institutes of Health, Department of Health andHuman Services, under Contract No. HHSN261201500003B,the USC Integrated Media Systems Center, and unrestrictedcash gifts from Google, Northrop Grumman and Oracle.Any opinions, findings, and conclusions or recommendationsexpressed in this material are those of the authors and donot necessarily reflect the views of any of the sponsors.

9. REFERENCES[1] M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan,

I. Mironov, K. Talwar, and L. Zhang. Deep learningwith differential privacy. arXiv:1607.00133, 2016.

[2] M. E. Andres, N. E. Bordenabe, K. Chatzikokolakis,and C. Palamidessi. Geo-indistinguishability:Differential privacy for location-based systems. InProceedings of the 2013 ACM SIGSAC conference onComputer & communications security, pages 901–914.ACM, 2013.

[3] A. R. Beresford and F. Stajano. Location privacy inpervasive computing. IEEE Pervasive computing,2(1):46–55, 2003.

[4] A. Blum, C. Dwork, F. McSherry, and K. Nissim.Practical privacy: The SuLQ framework. In PODS,pages 128–138. ACM, 2005.

[5] E. Cho, S. A. Myers, and J. Leskovec. Friendship andmobility: user movement in location-based socialnetworks. In SIGKDD, pages 1082–1090. ACM, 2011.

[6] D. Christin, A. Reinhardt, S. S. Kanhere, andM. Hollick. A survey on privacy in mobileparticipatory sensing applications. Journal of Systemsand Software, 84(11):1928–1946, 2011.

[7] G. Cormode, C. Procopiuc, D. Srivastava, E. Shen,and T. Yu. Differentially private spatialdecompositions. In 2012 IEEE 28th InternationalConference on Data Engineering (ICDE), pages20–31. IEEE, 2012.

[8] C. Cornelius, A. Kapadia, D. Kotz, D. Peebles,M. Shin, and N. Triandopoulos. Anonysense:privacy-aware people-centric sensing. In Proceedings ofthe 6th international conference on Mobile systems,applications, and services, pages 211–224. ACM, 2008.

[9] J. Cranshaw, E. Toch, J. Hong, A. Kittur, andN. Sadeh. Bridging the gap between physical locationand online social networks. In UbiComp. ACM, 2010.

[10] W.-Y. Day, N. Li, and M. Lyu. Publishing graphdegree distribution with node differential privacy. InProceedings of the 2016 International Conference onManagement of Data, pages 123–138. ACM, 2016.

[11] Y.-A. de Montjoye, C. A. Hidalgo, M. Verleysen, andV. D. Blondel. Unique in the crowd: The privacybounds of human mobility. Scientific Reports, 2013.

[12] C. Dwork. Differential privacy. In Automata, languagesand programming, pages 1–12. Springer, 2006.

[13] C. Dwork, F. McSherry, K. Nissim, and A. Smith.Calibrating noise to sensitivity in private dataanalysis. In TCC, pages 265–284. Springer, 2006.

[14] C. Dwork and A. Roth. The algorithmic foundationsof differential privacy. Foundations and Trendsr inTheoretical Computer Science, 9(3-4):211–407, 2014.

[15] U. Erlingsson, V. Pihur, and A. Korolova. RAPPOR:Randomized aggregatable privacy-preserving ordinalresponse. In SIGSAC, pages 1054–1067. ACM, 2014.

[16] L. Fan and L. Xiong. Real-time aggregate monitoringwith differential privacy. In Proceedings of the 21stACM international conference on Information andknowledge management, pages 2169–2173. ACM, 2012.

[17] J. Gehrke, M. Hay, E. Lui, and R. Pass.Crowd-blending privacy. In Advances in Cryptology,pages 479–496. Springer, 2012.

[18] G. Ghinita, P. Kalnis, A. Khoshgozaran, C. Shahabi,and K.-L. Tan. Private queries in location basedservices: anonymizers are not necessary. In SIGMOD,pages 121–132. ACM, 2008.

[19] M. Gruteser and D. Grunwald. Anonymous usage oflocation-based services through spatial and temporalcloaking. In MobiSys, pages 31–42. ACM, 2003.

[20] M. Hay, C. Li, G. Miklau, and D. Jensen. Accurateestimation of the degree distribution of privatenetworks. In 2009 Ninth IEEE InternationalConference on Data Mining, pages 169–178. IEEE,2009.

[21] M. Hay, V. Rastogi, G. Miklau, and D. Suciu.Boosting the accuracy of differentially privatehistograms through consistency. Proceedings of theVLDB Endowment, 3(1-2):1021–1032, 2010.

[22] X. He, G. Cormode, A. Machanavajjhala, C. M.Procopiuc, and D. Srivastava. Dpt: differentiallyprivate trajectory synthesis using hierarchicalreference systems. Proceedings of the VLDBEndowment, 8(11):1154–1165, 2015.

[23] Z. Jorgensen, T. Yu, and G. Cormode. Publishingattributed social graphs with formal privacyguarantees. In Proceedings of the 2016 InternationalConference on Management of Data, pages 107–122.ACM, 2016.

[24] L. Kazemi and C. Shahabi. GeoCrowd: enabling queryanswering with spatial crowdsourcing. InSIGSPATIAL 2012, pages 189–198. ACM, 2012.

[25] G. Kellaris, S. Papadopoulos, X. Xiao, andD. Papadias. Differentially private event sequencesover infinite streams. Proceedings of the VLDBEndowment, 7(12):1155–1166, 2014.

[26] A. Khoshgozaran and C. Shahabi. Blind evaluation ofnearest neighbor queries using space transformation topreserve location privacy. In International Symposiumon Spatial and Temporal Databases, pages 239–257.Springer, 2007.

[27] A. Korolova, K. Kenthapadi, N. Mishra, andA. Ntoulas. Releasing search queries and clicksprivately. In WWW, pages 171–180. ACM, 2009.

[28] K. W.-T. Leung, D. L. Lee, and W.-C. Lee.Personalized web search with location preferences. InICDE, pages 701–712. IEEE, 2010.

[29] J. Meyerowitz and R. Roy Choudhury. Hiding starswith fireworks: location privacy through camouflage.In Proceedings of the 15th annual internationalconference on Mobile computing and networking, pages345–356. ACM, 2009.

[30] M. F. Mokbel, C.-Y. Chow, and W. G. Aref. The newcasper: query processing for location services withoutcompromising privacy. In VLDB, pages 763–774, 2006.

[31] K. Nissim, S. Raskhodnikova, and A. Smith. Smoothsensitivity and sampling in private data analysis. InSTOC, pages 75–84. ACM, 2007.

[32] H. Pham and C. Shahabi. Spatial influence -measuring followship in the real world. In 2016 IEEE32nd International Conference on Data Engineering(ICDE), pages 529–540, May 2016.

[33] H. Pham, C. Shahabi, and Y. Liu. Inferring socialstrength from spatiotemporal data. ACM Trans.Database Syst., 41(1):7:1–7:47, Mar. 2016.

[34] L. Pournajaf, L. Xiong, V. Sunderam, andS. Goryczka. Spatial task assignment for crowdsensing with cloaked locations. In 2014 IEEE 15th

International Conference on Mobile DataManagement, volume 1, pages 73–82. IEEE, 2014.

[35] W. Qardaji, W. Yang, and N. Li. Understandinghierarchical methods for differentially privatehistograms. Proceedings of the VLDB Endowment,6(14):1954–1965, 2013.

[36] V. Rastogi and S. Nath. Differentially privateaggregation of distributed time-series withtransformation and encryption. In Proceedings of the2010 ACM SIGMOD International Conference onManagement of data, pages 735–746. ACM, 2010.

[37] C. E. Shannon and W. Weaver. A mathematicaltheory of communication, 1948.

[38] H. Su, K. Zheng, H. Wang, J. Huang, and X. Zhou.Calibrating trajectory data for similarity-basedanalysis. In Proceedings of the 2013 ACM SIGMODInternational Conference on Management of Data,pages 833–844. ACM, 2013.

[39] L. Sweeney. k-anonymity: A model for protectingprivacy. International Journal of Uncertainty,Fuzziness and Knowledge-Based Systems, 10(05).

[40] H. To, L. Fan, and C. Shahabi. Differentially privateh-tree. 2nd Workshop on Privacy in GeographicInformation Collection and Analysis, 2015.

[41] H. To, L. Fan, L. Tran, and C. Shahabi. Real-timetask assignment in hyperlocal spatial crowdsourcingunder budget constraints. In PerCom. IEEE, 2016.

[42] H. To, G. Ghinita, and C. Shahabi. A framework forprotecting worker location privacy in spatialcrowdsourcing. VLDB, 7(10):919–930, 2014.

[43] H. To, C. Shahabi, and L. Kazemi. A server-assignedspatial crowdsourcing framework. TSAS, 1(1):2, 2015.

[44] E. Toch, J. Cranshaw, P. H. Drielsma, J. Y. Tsai,P. G. Kelley, J. Springfield, L. Cranor, J. Hong, andN. Sadeh. Empirical models of privacy in locationsharing. In UbiComp, pages 129–138. ACM, 2010.

[45] H. Van Dyke Parunak and S. Brueckner. Entropy andself-organization in multi-agent systems. In AAMAS,pages 124–130. ACM, 2001.

[46] H. Wang, K. Yao, G. Pottie, and D. Estrin.Entropy-based sensor selection heuristic for targetlocalization. In IPSN, pages 36–45. ACM, 2004.

[47] Y. Xiao and L. Xiong. Protecting locations withdifferential privacy under temporal correlations. InProceedings of the 22nd ACM SIGSAC Conference onComputer and Communications Security, pages1298–1309. ACM, 2015.

[48] T. Xu and Y. Cai. Feeling-based location privacyprotection for location-based services. In CCS, pages348–357. ACM, 2009.

[49] K. Yanai, H. Kawakubo, and B. Qiu. A visual analysisof the relationship between word concepts andgeographical locations. In CIVR, page 13. ACM, 2009.

[50] J. Zhang, G. Cormode, C. M. Procopiuc,D. Srivastava, and X. Xiao. Privbayes: Private datarelease via bayesian networks. In Proceedings of the2014 ACM SIGMOD international conference onManagement of data, pages 1423–1434. ACM, 2014.

APPENDIXA. PROOF OF THEOREM

A.1 Proof of Theorem 1We prove the bound of global sensitivity for two cases,

c = 1 and c > 1. It is obvious that when c = 1, the maximumof the change of location entropy is log 2.

When c > 1, log(1 +1

exp(H(C \ cu))) decreases when n

increases. Thus, it is maximized when n = 1 and the maxi-mum change is log 2.

We also have:(log

n− 1

n− 1 + C+

C

n− 1 + ClogC

)′n

=1

n− 1− 1

n− 1 + C− C logC

(n− 1 + C)2

=(n− 1 + C)2 − (n− 1)(n− 1 + C)− (n− 1)C logC

(n− 1)(n− 1 + C)2

=C2 + (n− 1)C − (n− 1)C logC

(n− 1)(n− 1 + C)2

(log

n− 1

n− 1 + C+

C

n− 1 + ClogC

)′n≥ 0

⇔ C2 + (n− 1)C − (n− 1)C logC ≥ 0

⇔ C ≥ (n− 1)(logC − 1)

If logC − 1 ≥ 0 or C is not less than the base of thelogarithm (which is e in this work), where n = C

logC−1+ 1,

log n−1n−1+C

+ Cn−1+C

logC is maximized and:

logn− 1

n− 1 + C+

C

n− 1 + ClogC

= log( C

logC − 1C

logC − 1+ C

)+

CC

logC − 1+ C

logC

= logC

C + C logC − C +C(logC − 1) logC

C logC

= log1

logC+ logC − 1

= logC − log(logC)− 1

If logC−1 < 0, log n−1n−1+C

+ Cn−1+C

logC always increases.We have:

limn→∞

n− 1

n− 1 + C+

C

n− 1 + ClogC = 0

⇒ n− 1

n− 1 + C+

C

n− 1 + ClogC < 0

Similarly, we can prove that:

maxn

(log

n

n+ C+

C

n+ ClogC

)= logC − log(logC)− 1

A.2 Proof of Theorem 2

In this section, we derive the bound for the sensitivity oflocation entropy of a particular location l when the data ofa user is removed from the dataset.

For convenience, let n be the number of users visiting l,n = |Ul|, cu = |Ol,u|. Let C = {c1, c2, . . . , cn} be the set ofnumbers of visits of all users to location l. Let S = |Ol| =∑u cu. Let Su = S−cu be the sum of numbers of visits of all

users to location l after removing user u. From Equation 1,we have:

H(l) = H(Ol) = H(C) = −∑u

cuS

logcuS

By removing a user u, entropy of location l becomes:

H(C \ cu) = −∑u

cuSu

logcuSu

From Equation 4, we have:

H(C) =SuSH(C \ cu) +H(

SuS,cuS

)

Subsequently, the change of location entropy when a user uis removed is:

∆Hu = H(C \ cu)−H(C) =cuSH(C \ cu)−H(

SuS,cuS

) (8)

Taking derivative of ∆Hu w.r.t cu:

(∆Hu)′cu =S − cuS2

H(C \ cu) + (SuS

logSuS

+cuS

logcuS

)′cu

=SuS2H(C \ cu)− Su

S2log

SuS

+SuS

−SuS2

SuS

+SuS2

logcuS

+cuS

SuS2

cuS

=SuS2H(C \ cu)− Su

S2log

SuS− SuS2

+SuS2

logcuS

+SuS2

=SuS2

(H(C \ cu)− logSucu

)

We have:

(∆Hu)′cu = 0⇔ H(C \ cu) = logSucu

Therefore, ∆Hu decreases when cu ≤ c∗u, and increaseswhen c∗u < cu where c∗u is the value so that H(C \ cu) =log Su

c∗u.

In addition, because H(C\cu) ≤ log(n−1) and n−1 < Suwhen C > 1,⇒ ∆Hu < 0 when cu = 1⇒ |∆Hu| is maximized when cu = c∗u or cu = C.Case 1: cu = C:If ∆Hu < 0, |∆Hu| is maximized when cu = c∗u.If ∆Hu > 0, we have:

0 ≤ cuSH(C \ cu)−H(

SuS,cuS

)

≤ C

Slog(n− 1)−H(

SuS,C

S)

Taking the derivative of the right side w.r.t Su:(CS

log(n− 1)−H(SuS,C

S))′Su

=( C

Su + Clog(n− 1)−H(

SuSu + C

,C

Su + C))′Su

=( C

Su + Clog(n− 1) +

SuSu + C

logSu

Su + C+

C

Su + Clog

C

Su + C

)′Su

=−C

(Su + C)2log(n− 1) +

C

(Su + C)2log

SuSu + C

+C

(Su + C)2

− C

(Su + C)2log

C

Su + C− C

(Su + C)2

=C

(Su + C)2

(− log(n− 1) + log

SuC

)=

C

(Su + C)2log

Su(n− 1)C

≤ 0

⇒ ∆Hu is maximized when Su is minimized.⇒ When cu = C and ∆Hu > 0:

∆Hu ≤C

n− 1 + Clog(n− 1)−H(

n− 1

n− 1 + C,

C

n− 1 + C)

=C

n− 1 + Clog(n− 1) +

n− 1

n− 1 + Clog

n− 1

n− 1 + C

+C

n− 1 + Clog

C

n− 1 + C

= logn− 1

n− 1 + C+

C

n− 1 + ClogC

Case 2: cu = c∗u:

∆Hu =cuS

logSucu−H(

SuS,cuS

)

=cuS

logSucu

+SuS

logSuS

+cuS

logcuS

=cuS

logSuS

+SuS

logSuS

= logSuS

= − logS

Su

= − logSu + cuSu

= − log(1 +cuSu

)

= − log(1 +1Sucu

)

= − log(1 +1

exp(H(C \ cu)))

⇒ ∆Hu is maximized when H(C \ cu) is minimized.

Lemma 1. Given a set of n numbers, C = {c1, c2, . . . , cn},1 ≤ ci ≤ C, entropy H(C) is minimized when ci = 1 orci = C, for all i = 1, . . . , n.

Proof. When the value of a number cu is changed andothers a fixed, from equation 8:

H(C)′cu = (∆Hu)′cu

=SuS2

(logSucu−H(C \ cu))

Therefore, H(C) increases when cu ≤ c∗u, and decreaseswhen c∗u < cu where c∗u is the value so that H(C \ cu) =log Su

c∗u.

⇒ H(C) is minimized when cu = 1 or cu = C.

Lemma 2 (Minimum Entropy). Given a set of n num-bers, C = {c1, c2, . . . , cn}, 1 ≤ ci ≤ C, the minimum entropy

H(C) = logn− logCC−1

+ log(

logCC−1

)+ 1.

Proof. Using Lemma 1, entropyH(C) is minimized whenC = {1, . . . , 1︸︷︷︸

n−k times

, C, . . . , C︸︷︷︸k times

}.

Let S =∑i ci = n− k + kC. We have:

H(C) = −n− kS

log1

S− kC

Slog

C

S

=n− kS

logS +kC

SlogS − kC

SlogC

= logS − kC

SlogC

We have S′k = C− 1. Take the derivative of H(C) w.r.t k,we have:

(H(C))′k =C − 1

S− S − k(C − 1)

S2C logC

=1

S2

(S(C − 1)− (S − kC + k)C logC

)=

1

S2

((n− k + kC)(C − 1)− (n− k + kC − kC + k)C logC

)=

1

S2(nC − n− kC + k + kC2 − kC − nC logC)

(H(C))′k = 0⇔ k =n(C logC − C + 1)

(C − 1)2

When k = 0 or k = n, H(C) is maximized. Thus H(C)

is minimized when k =n(C logC − C + 1)

(C − 1)2. It is clear that

we need k to be an integer. Thus, k would be chosen as bkcor bkc+ 1, depending on what value creates a smaller valueof H(C).

S = n− k + kC

= n+ k(C − 1)

= n+n(C logC − C + 1)

C − 1

=nC − n+ nC logC − nC + n

C − 1

=nC logC

C − 1

kC

SlogC =

n(C logC − C + 1)

(C − 1)2C

C − 1

nC logClogC

=C logC − C + 1

C − 1

H(C) = log(nC logC

C − 1

)− C logC − C + 1

C − 1

= logn+ log(C logC

C − 1

)− C logC

C − 1+ 1

= logn− logC

C − 1+ log

( logC

C − 1

)+ 1

Using Lemma 2, when cu = c∗u:

|∆Hu| ≤ log(1 +1

exp(H(C \ cu)))

where H(C \ cu) = log(n− 1)− logCC−1

+ log(

logCC−1

)+ 1.

Thus, the maximum change of location entropy when auser is removed equals :

• log nn−1

when C = 1

• max(

log n−1n−1+C

+ Cn−1+C

logC, log(1 + 1exp(H(C\cu))

))

where H(C \ cu) = log(n− 1)− logCC−1

+ log(

logCC−1

)+ 1,

when n > 1, c > 1.

Similarly, the maximum change of location entropy whena user is added equals :

• log n+1n

when C = 1

• max(

log nn+C

+ Cn+C

logC, log(1+ 1exp(H(C\cu))

))

where

H(C \ cu) = log(n− 1)− logCC−1

+ log(

logCC−1

)+ 1, when

n > 1, c > 1.

Thus, we have the proof for Theorem 2.

A.3 Proof of Theorem 3In this section, we prove that Theorem 3 satisfies ε-differential

privacy. We prove the theorem when a user is removed fromthe database. The case when a user is added is similar.

Let l be an arbitrary location and Al : L → R be theAlgorithm 2 when only location l is considered for perturba-tion. Ol(org) be the original set of observations at locationl; Ol,u(org) be the original set of observations of user u atlocation l; Ol,u be the set of observations after limiting cuto C and limiting maximum M locations per user; Ol be theset of all Ol,u for all users u that visit location l; C be theset visits. Let Ol \Ol,u be the set of observations at locationl when a user u is removed from the dataset. Let b = M∆H

ε.

If Ol,u(org) = ∅:

Pr[Al(Ol(org)) = tl]

Pr[Al(Ol(org) \Ol,u(org)) = tl]= 1

If Ol,u(org) 6= ∅:

Pr[Al(Ol(org)) = tl]

Pr[Al(Ol(org) \Ol,u(org)) = tl]

=Pr[Al(Ol) = tl]

Pr[Al(Ol \Ol,u) = tl]

=Pr[H(C) + Lap(b) = tl]

Pr[H(C \ cl,u) + Lap(b) = tl]

If H(C) ≤ H(C \ cl,u):

Pr[H(C) + Lap(b) = tl]

Pr[H(C \ cl,u) + Lap(b) = tl]

=Pr[H(C) + Lap(b) = tl]

Pr[H(C) + ∆H(l) + Lap(b) = tl]

=Pr[Lap(b) = tl −H(C)]

Pr[Lap(b) = tl −H(C)−∆H(l)]

= exp(1

b(−|tl −H(C)|+ |tl −H(C)−∆H(l)|)

)≤ exp(

|∆H(l)|b

)

≤ exp(∆H

b)

= exp(ε

M)

If H(C) > H(C \ cl,u), similarly:

Pr[H(C) + Lap(b) = tl]

Pr[H(C \ cl,u) + Lap(b) = tl]≤ exp(

ε

M)

Similarly, we can prove that:

Pr[Al(Ol(org) \Ol,u(org)) = tl]

Pr[Al(Ol(org)) = tl]≤ exp( ε

M)

Therefore, Al satisfies 0-differential privacy whenOl,u(org) =∅, and satisfies ε

M-differential privacy when Ol,u(org) 6= ∅.

For all locations, let A : L→ R|L| be the Algorithm 2. letL1 be any subset of L. Let T = {t1, t2, . . . , t|L1|} ∈Range(A)denote an arbitrary possible output. Thus, A is the com-position of all Al, l ∈ L. Let L(u) be the set of all lo-cations that u visits, |L(u)| ≤ M . Applying compositiontheorems [14] for all Al where Al satisfies 0-differential pri-vacy when l /∈ L1∩L(u), and satisfies ε

M-differential privacy

when l ∈ L1∩L(u), we have A satisfies ε-differential privacy.

Differentially Private Publication of Location Entropy ...€¦ · It is used in numerous...

Documents

Transcript of Differentially Private Publication of Location Entropy ...€¦ · It is used in numerous...