Evaluating Privacy-Preserving Generative Models in the ...

15
Evaluating Privacy-Preserving Generative Models in the Wild Technical Report Bristena Oprisanu, Adria Gascon, Emiliano De Cristofaro Abstract The ability to share data among different entities can en- able a number of compelling applications and analytics. However, useful datasets contain, more often than not, in- formation of sensitive nature, which can endanger the pri- vacy of users within the dataset. A possible alternative is to use machine learning algorithms known as generative models: these learn the data-generating distribution and al- low to generate artificial data resembling the data they are trained on. In other words, entities can train and publish the model, instead of the original data; however, access to such model could still allow an adversary to reconstruct (possibly sensitive) training data. To overcome these issues, researchers have recently pro- posed a number of techniques that use carefully crafted random noise to provide strong privacy guarantees – specifically, guaranteeing Differential Privacy – so that the privacy leakage of synthetic data generation can be quan- tified by rigorous, statistical means. A common approach in several of the proposed techniques involves training a generative machine learning model using a differentially private version of stochatic gradient descent, the so-called moments accountant method (Abadi et al. 2016). In this project, we investigate the real-world feasibility of four recently proposed approaches to private synthetic data generation. We evaluate such approaches extensively on different types of data and data analysis tasks. More- over, we provide an empirical evaluation of the moments accountant method, highlighting the range of privacy guar- antees that it yields. Overall, we show that a generic ap- proach applicable across a wide range of datasets and tasks might be too much to ask. However, our experiments also suggest that satisfactory trade-offs between utility an pri- vacy for private data synthesis are achievable in specific settings. The most encouraging results correspond to im- age and financial data as, for good privacy parameters, syn- thetic training datasets led to 5-8% accuracy loss with re- spect to training on the non-private original data, for sim- ple linear models. 1 Introduction Every day over 2.5 quintillion bytes of data are cre- ated [22]; however, the value of this data is maximized only with the ability to analyze it and provide meaningful insight from it, ultimately facilitating research and mean- ingful analytics. In this context, data collection and data sharing often constitute necessary steps to enable progress of businesses and society. As a results, entities are often willing or compelled to provide access to their datasets, e.g., to enable analysis by third parties. Alas, data sharing does not come without privacy risks. The GDPR, recently introduced in the EU, also puts pres- sure on businesses to be more responsible, more account- able, and more transparent with respect to personal infor- mation and privacy laws. In an attempt to mitigate these risks, several methods have been proposed, e.g., anonymiz- ing datasets before sharing them. However, as pointed out on several occasions [6], in practice, anonymization fails to provide realistic privacy guarantees. Another approach has been to release aggregate statistics, but this is also vul- nerable to a number of attacks, e.g., membership inference (where one could test for the presence of a given individ- ual’s data in the aggregates) [28]. More promising attempts come from providing access to statistics obtained from the data, while adding noise to the queries’ response, in such a way that “differential pri- vacy” [15] is guaranteed—we describe differential privacy in Section 3. However, this approach generally lowers the utility of the dataset, especially on high dimensional data. Additionally, by allowing unlimited non-trivial queries on a dataset can reveal the whole dataset, so this approach needs to keep track of the privacy budget over time. Privacy-Preserving Generative Models. An alternative approach for addressing these issues is to release realistic synthetic data using privacy-preserving generative models. Generative models yield new samples that follow the same probabilistic distribution of a given training dataset. The intuition is that entities can train and publish the model, but not the original data, so that anybody can generate a syn- thetic dataset resembling the data it was trained on. Cru- cially, the model should be published while also guaran- teeing differential privacy. Although these approaches have been evaluated in re- search papers for certain use cases, we set to investigate their real-world usability and limitations. Specifically, we select four models: 1) PrivBayes [33], an -differentially private algorithm, which constructs a generative model based on Bayesian networks, 2) DP-SYN [3] and 3) Priv- VAE [4], which use differentially private generative mod- els relying on the moments accountant [2], and 4) Syn- Loc [7], a seed-based generative model for location data, building on the concept of plausible deniability. Overview of the Results Our experimental evaluation is performed on five datasets, grouped into four categories: financial data, images, cyber threat logs, and location data. All the code and data con- 1

Transcript of Evaluating Privacy-Preserving Generative Models in the ...

Evaluating Privacy-Preserving Generative Models in the Wild

Technical ReportBristena Oprisanu, Adria Gascon, Emiliano De Cristofaro

Abstract

The ability to share data among different entities can en-able a number of compelling applications and analytics.However, useful datasets contain, more often than not, in-formation of sensitive nature, which can endanger the pri-vacy of users within the dataset. A possible alternative isto use machine learning algorithms known as generativemodels: these learn the data-generating distribution and al-low to generate artificial data resembling the data they aretrained on. In other words, entities can train and publishthe model, instead of the original data; however, access tosuch model could still allow an adversary to reconstruct(possibly sensitive) training data.

To overcome these issues, researchers have recently pro-posed a number of techniques that use carefully craftedrandom noise to provide strong privacy guarantees –specifically, guaranteeing Differential Privacy – so that theprivacy leakage of synthetic data generation can be quan-tified by rigorous, statistical means. A common approachin several of the proposed techniques involves training agenerative machine learning model using a differentiallyprivate version of stochatic gradient descent, the so-calledmoments accountant method (Abadi et al. 2016).

In this project, we investigate the real-world feasibilityof four recently proposed approaches to private syntheticdata generation. We evaluate such approaches extensivelyon different types of data and data analysis tasks. More-over, we provide an empirical evaluation of the momentsaccountant method, highlighting the range of privacy guar-antees that it yields. Overall, we show that a generic ap-proach applicable across a wide range of datasets and tasksmight be too much to ask. However, our experiments alsosuggest that satisfactory trade-offs between utility an pri-vacy for private data synthesis are achievable in specificsettings. The most encouraging results correspond to im-age and financial data as, for good privacy parameters, syn-thetic training datasets led to 5-8% accuracy loss with re-spect to training on the non-private original data, for sim-ple linear models.

1 Introduction

Every day over 2.5 quintillion bytes of data are cre-ated [22]; however, the value of this data is maximizedonly with the ability to analyze it and provide meaningfulinsight from it, ultimately facilitating research and mean-ingful analytics. In this context, data collection and datasharing often constitute necessary steps to enable progressof businesses and society. As a results, entities are often

willing or compelled to provide access to their datasets,e.g., to enable analysis by third parties.

Alas, data sharing does not come without privacy risks.The GDPR, recently introduced in the EU, also puts pres-sure on businesses to be more responsible, more account-able, and more transparent with respect to personal infor-mation and privacy laws. In an attempt to mitigate theserisks, several methods have been proposed, e.g., anonymiz-ing datasets before sharing them. However, as pointed outon several occasions [6], in practice, anonymization failsto provide realistic privacy guarantees. Another approachhas been to release aggregate statistics, but this is also vul-nerable to a number of attacks, e.g., membership inference(where one could test for the presence of a given individ-ual’s data in the aggregates) [28].

More promising attempts come from providing accessto statistics obtained from the data, while adding noise tothe queries’ response, in such a way that “differential pri-vacy” [15] is guaranteed—we describe differential privacyin Section 3. However, this approach generally lowers theutility of the dataset, especially on high dimensional data.Additionally, by allowing unlimited non-trivial queries ona dataset can reveal the whole dataset, so this approachneeds to keep track of the privacy budget over time.

Privacy-Preserving Generative Models. An alternativeapproach for addressing these issues is to release realisticsynthetic data using privacy-preserving generative models.Generative models yield new samples that follow the sameprobabilistic distribution of a given training dataset. Theintuition is that entities can train and publish the model, butnot the original data, so that anybody can generate a syn-thetic dataset resembling the data it was trained on. Cru-cially, the model should be published while also guaran-teeing differential privacy.

Although these approaches have been evaluated in re-search papers for certain use cases, we set to investigatetheir real-world usability and limitations. Specifically, weselect four models: 1) PrivBayes [33], an ✏-differentiallyprivate algorithm, which constructs a generative modelbased on Bayesian networks, 2) DP-SYN [3] and 3) Priv-VAE [4], which use differentially private generative mod-els relying on the moments accountant [2], and 4) Syn-Loc [7], a seed-based generative model for location data,building on the concept of plausible deniability.

Overview of the Results

Our experimental evaluation is performed on five datasets,grouped into four categories: financial data, images, cyberthreat logs, and location data. All the code and data con-

1

sidered in this report is available in a github repository, andeasily deployable by means of a Docker container, as partof the project deliverables.

Financial data. We use two datasets from the UCI Ma-chine Learning repository [12] and evaluate accuracy forclassification tasks on the synthetic data (e.g., classify-ing individuals as having good/bad credit risks, or incomeabove/below $50K). We find that PrivBayes performs bestunder a high-privacy regime for large datasets, but over-all synthetic data generated from smaller datasets yieldspoor accuracy, even lower than random guessing. Overall,DP-SYN reaches the highest accuracy, but accuracy doesnot improve with less noise. Furthermore, even though thesynthetic data generated by Priv-VAE achieves similar ac-curacy as DP-SYN, it performs poorly on datasets with im-balanced classes.

Images. For images, we use the MNIST datasets of hand-written digits [20], evaluating the performance of digitrecognition, DP-SYN achieves highest accuracy for clas-sification, and it also seems to generate clearer samples –from a human evaluation perspective.

Network activity logs. We evaluate the data synthesis oncyber threat logs obtained from Dshield [1] and find it tobe the most sensitive to perturbation, to the point that nomodel is able to classify or predict threats with better ac-curacy than random guessing. However, we do not rule outthat a dataset of this nature richer than Dshield could yieldbetter results.

Location data. Finally, for location data, we use the SanFrancisco cabs dataset [26], evaluating performance on aclustering task, which can be used for detecting areas ofinterest for locations. We confirm that the model obtainingthe most similar distribution of clusters to the original datais Syn-Loc, which specializes on this particular task.

Moments Accountant. Several privacy-preserving gener-ative models are instantiations of the moments accountantmethod [2]. This technique allows to keep track of theoverall privacy budget throughout a differentially privatestochastic gradient descent optimization. The momentsaccountant offers a concrete regime of privacy guaranteesthat involves a complex dependency between the param-eters of the gradient descent procedure, the available pri-vacy budget, and the size of the input dataset. We providea description of the technique and show empirically whatrange of privacy guarantees it can provide. Our goal isto shed light on the relationship between the moments ac-countant and the synthetic data generation techniques thatrely on it, a point often ommitted in the description of ap-proaches that use differentially private gradient descent asa subprocedure.

Report Organization

In Section 2, we review background information onprivacy-preserving data synthesis, then, in Section 3, weprovide an overview of differential privacy, followed bya discussion on the moments accountant in Section 4. Wethen present the models used in our evaluation in Section 5,while, in Section 6, the testing methodology and the ex-

perimental results results, followed by a discussion in Sec-tion 7.

2 Privacy-Preserving Data Release

The need to reconcile the need to release data and that toprotect sensitive information has encouraged researchersto consider different approaches; we review them in thissection.

2.1 Approaches for Privacy and Data

Analysis

Anonymization. In theory, one could try to anonymizedata by stripping personally identifiable information be-fore sharing it. The assumption is that sharing anonymizedrecords can be done freely, since no one knows who therespective record belongs to. In practice, however, this as-sumption has been disproven on multiple occasions, fornumerous datasets. Archie et al. [6] re-identify users fromthe Netflix Prize dataset by using publicly available IMDbdata. This is an even bigger problem for more sensitivedata, such as genomes, where re-identification of usershas also been proven to be possible. Gymrek et al. [16]demonstrate that recovery of surnames from genomic datadonors can be inferred using data publicly available fromrecreational genealogy databases. Additionally, not evenk-anonymity, where generalization techniques are used tomask exact values of attributes, are safe against inferenceattacks [5].Aggregation. Another approach is to share aggregatestatistics about a dataset. For example, one can the num-ber of people in a certain location at a given time in or-der to determine if the location is considered a point ofinterest [27]. However, this is also ineffective, due to sus-ceptibility to membership inference attacks. For instance,Pyrgelis et al. [28] show how to determine whether a user ispart of aggregate location data. Membership inference at-tacks have also been demonstrated in other contexts, e.g.,genomic data [19, 32].Differentially Private Data Release. In order to providestronger privacy guarantees, techniques that satisfy differ-ential privacy have been more widely proposed as moreeffective solutions. Differential privacy, reviewed in Sec-tion 3, provides a formal mathematical definition that spec-ifies requirements for controlling privacy risk, with sev-eral properties (e.g., composition, post-processing, etc.)that facilitate reasoning about privacy and the construc-tion of differentially private algorithms. However, the ten-sion between usability and privacy is inherently complexand application-dependent, and differentially privacy algo-rithms have often been regarded as providing low utilityfor researchers, as, e.g., in the case of health data [11].Privacy-Preserving Synthetic Data Generation. Toovercome these limitations, another approach is to gener-ate realistic synthetic data using generative models. Theseyield new samples that follow the same probabilistic distri-bution of a given training dataset. The intuition is that enti-ties can train and publish the model, in a differentially pri-

2

vate way, so that anybody can generate a synthetic datasetresembling the data it was trained on, without exposing thetraining data itself.Imputation models. One of the first approaches for gener-ating fully synthetic data has been proposed by Rubin [29].The idea is to treat all observations from the samplingframe as missing data and to input them using the multi-ple imputation method. Because the synthetic data has nofunctional link to the original data, it can preserve the con-fidentiality of participants. However, as discussed in [10],the synthetic data generated this way is subject to infer-ential disclosure risk when the model used to generate thedata is too accurate.Statistical models. Other approaches attempt to generatea statistical model based on the original data [33]. Themain idea is to generate a low-dimensional distribution ofthe original data to help with the data generation process.This approach, combined with differential privacy, aims toprovide privacy guarantees to the synthetic data.Generative Models. More recently, generative machinelearning models have attracted a lot of attention from theresearch community. A generative model is a way to learnany kind of data distribution using unsupervised learning,aiming to generate new samples that follow the same prob-abilistic distribution of a given dataset. Generative modelsbased on neural networks work by optimizing the weightsof the connections between neurons by back-propagationtechniques. For complex networks, the optimization isusually done by the mini-batch stochastic gradient descent(SGD) algorithm. Generative models can be used in con-junction with differential privacy. This is usually done us-ing a differentially private training procedure, which guar-antees that the learned model is differentially private, andthus any synthetic dataset we can derive from it will alsoguarantee differential privacy.

We also look at seed-based generative models [7], whichcondition the output of the model based on input data,called the seed. This way, the model will produce syn-thetic records similar to the seed, which can increase thequality of the output. Because of the high correlation be-tween the output and the seed, privacy tests which providedifferential privacy guarantees are introduced.

3 Differential Privacy

In this section, we discuss Differential Privacy (DP) as wellas the Moments Accountant method presented in [2] in thecontext of privacy-preserving deep learning. Readers fa-miliar with these concepts can skip this section withoutloss of continuity.

3.1 Definitions and Properties

DP addresses the paradox of learning nothing about an in-dividual while learning useful information about a popu-lation [15]. Generally speaking, differential privacy aimsto provide rigorous, statistical guarantees against what anadversary can infer from learning the result of some ran-domized algorithm. Typically, differentially private tech-niques protect the privacy of individual data subjects by

adding random noise when producing statistics. In a nut-shell, differential privacy guarantees that an individual willbe exposed to the same privacy risk whether or not her datais included in a differentially private analysis.Definition. Formally, for two non-negative numbers ✏, �, arandomized algorithm A satisfies (✏, �)-differential privacyif and only if, for any neighboring datasets D and D0 (i.e.differing at most one record), and for the possible outputS ✓ Range(A), the following formula holds:

Pr[A(D) 2 S] e✏ Pr[A(D0) 2 S] + �

The ✏, � parameters. Differential privacy analyses allowfor some information leakage specific to individual datasubjects, controlled by the privacy parameter ✏. This mea-sures the effect on each individual’s information on the out-put of the analysis. With smaller values of ✏, the datasetis considered to have stronger privacy, but less accuracy,thus reducing its utility. An intuitive description of the pri-vacy parameter, along with supporting examples, is avail-able in [24].

If � = 0, we say that the mechanism is ✏-differentiallyprivate. This is considered to be the absolute case, in whichone cannot gain more than a small amount of probabilisticinformation about a single individual. By contrast, � > 0allows for a small probability of failure, e.g., an output canoccur with probability � > 0 if an individual is present inthe dataset, and never happens otherwise.

As per [15], the values of � are usually computed as aninverse function of a polynomial in the size of the dataset.In particular, any values of � on the order of 1

|D| , where|D| represents the size of the dataset D, are consideredto be very dangerous: even though this case is “privacy-preserving,” it would still allow the publication of com-plete records for a small number of participants. In orderto better understand why values of � of the order of 1

|D| canbe dangerous, consider an algorithm that simply releasesan entry of the dataset |D| uniformly at random. This al-gorithm is ✏, �, with ✏ = 1 and � = 1

|D| , and yet obviouslydoes not provide a meaningful privacy guarantee.Post-Processing. Differential privacy is “immune” topost-processing, i.e., any function applied to the output of adifferentially private algorithm cannot provide less privacyguarantees than the original mechanism. More formally,let A be a randomized algorithm that is (✏, �)-differentiallyprivate, and f be an arbitrary mapping. Then, f �A is also(✏, �)-differentially private.Composition. One of the most important properties ofdifferential privacy is its robustness under composition.When combining multiple differentially private mecha-nisms, composition theorems can be used to account forthe total differential privacy of the system. More pre-cisely, for mechanisms M1, . . .Mn, where each Mi isa (✏i, �i)- differentially private algorithm, we have thatM[n](x) = (M1(x), . . .Mn(x)) is (

Pni=1 ✏i,

Pni=1 �i)-

diffentially private.Strong Composition. Dinur and Nissim [13] and Dworkand Nissim [14] showed that, under k-fold adaptive com-position on a single database, the privacy parameter dete-riorates less if a negligible loss in � can be tolerated. Thisyields the Strong Composition Theorem:

3

For every ✏ > 0, �, �0 > 0 and k 2 N, the class of(✏, �)-differentially private mechanisms is (✏0, k� + �0)-differentially private under k-fold adaptive composition,for

✏0 =q2k ln 1

�0 · ✏+ k · ✏✏0,

where ✏0 = e✏ � 1. This theorem introduces a strongerbound on the expected privacy loss due to multiple mech-anisms, which relaxes the worst-case result given from thecomposition theorem.Sensitivity. The notion of the sensitivity of a functionis very useful in the design of differentially private al-gorithms, and define the notion of sensitivity of a func-tion with respect to a neighboring relationship. Given aquery F on a dataset D, the sensitivity is used to adjustthe amount of noise required for F (D). More formally, ifF is a function that maps a dataset (in matrix form) intoa fixed-size vector of real numbers, we can define the Li-sensitivity of F as:

Si(F ) = maxD,D0

||F (D)� F (D0)||i,

where || · ||i denotes the Li norm, i 2 {1, 2} and D and D0

are any two neighboring datasets.The Gaussian Mechanism. One of the most widely usedmethods to achieve (✏, �)-differential privacy is to addGaussian noise to the result of a query. Given a functionF : D ! R over a dataset D, if � = S2(F )

p2 ln(2/�)/✏,

and N (0,�2) are independent and identically distributedGaussian random variables, the mechanism M provides(✏, �)-differential privacy when:

M(D) = f(D) +N (0,�2)

More specifically, the Gaussian Mechanism with parame-ter � adds noise scaled to N (0,�2) to each of the compo-nents of the output.Differentially Private Data Generation. As discussedearlier, the main focus of our work is differentially pri-vate synthetic data generation. The main idea is that, oncethe data is generated, it can be used for multiple analy-ses, without the need to further increase the privacy bud-get. This is a consequence of the postprocessing propertyof differential privacy mentioned above. Incidentally, theNational Institute of Standards and Technology (NIST) hasrecently launched a differential privacy synthetic data chal-lenge [31], aiming to find synthetic data generation algo-rithms that protect individual privacy but provide a highutility of the overall dataset.

3.2 Moments Accountant

In [2], Abadi et al. show how to bound the privacy lossof gradient descent-like computations. They present aprivacy-preserving stochastic gradient descent (SGD) al-gorithm for training a model with parameters ✓ by mini-mizing the loss function L(✓); see Algorithm 1.

At each step of the algorithm, the gradient is computedfor a small subset of examples (Line 6), then the L2 normof each gradient is clipped (Line 7) in order to bound theinfluence of each individual example. Noise is then added

Algorithm 1 Differentially Private SGD

1: Input: Examples {x1, . . . , xN}, loss function L(✓ =1N

Pi L(✓, xi). Parameters: learning rate ⌘t, noise

scale �, group size L, gradient norm bound C, numberof steps T .

2: Initialize ✓0 randomly3: for t = 1 to T � 1 do

4: Take a random sample Lt, with sampling probabil-ity q = L

N5: for each i 2 Lt do

6: gt(xi) = r✓tL(✓t, xi)

7: gt(xi) =gt(xi)

max(1,||gt(xi)||2

C )

8: egt = 1L (

Pi gt(xi) +N (0,�2C2I))

9: ✓t+1 = ✓t � ⌘t egtreturn ✓T and compute the overall privacy cost (✏, �)using a privacy accounting method.

to the clipped gradient (Line 8) and a step is taken in theopposite direction of the average noisy gradient. Whenoutputting the model (Line 9), the total privacy loss needsto be computed, using a privacy accounting method. Thecomposition property of differential privacy allows to com-putes the privacy cost at each access to the training data,accumulating the cost over all training steps. An initialbound is given by the strong composition theorem, how-ever, [2] provides a stronger accounting method, namelythe moments accountant, which saves a factor

qT� in the

asymptotic bound.

Formal Description. Next, we provide a formal descrip-tion of the main technical aspects of the moments accoun-tant technique, mirroring the presentation in the originalpaper [2]. For proofs and details we reader to the supple-mentary material of [2].

For any neighboring databases D,D0, a mechanism M,an auxiliary input aux and a outcome o, the privacy loss ato is defined as:

c(o;M, aux,D,D0) = log Pr[M(aux,D)=o]Pr[M(aux,D0)=o]

For a given mechanism M, the �th moment↵M(�;M, aux,D,D0) is defined as:

↵M(�;M, aux,D,D0) =logEo⇠M(aux,D)[exp(�c(o;M, aux,D,D0))]

In order to provide the guarantees of the mechanism, allpossible ↵M(�;M, aux,D,D0) should be bounded, so↵M(�) is defined as:

↵M(�) = maxaux,D,D0 ↵M(�;M, aux,D,D0),

where the maximum is taken over all possible aux and allneighboring datasets D and D0.

Then, ↵ achieves the following properties:

• Composability: Suppose that a mechanism Mconsists of a sequence of adaptive mechanismsM1 . . .Mk. Then, for any �:

↵M(�) Pk

i=1 ↵Mi(�).

4

Algorithm 2 Epsilon Computation1: Input q,�, �, number of epochs ep, number of orders �2: Initialize l = �1, ↵list, ✏list = ;3: for i = 2 to �+ 1 do

4: for j = 2 to ı + 1 do

5: l1 = log( i!j!(i�j)! + j · log q+(i� j) log(1�q)

6: s = l1 + (j2 � j)/(s�2)7: if l is �1 then

8: l = s9: else:

10: l = log(elog(l�s)) + s

11: ↵ = l · ep · q12: Append ↵ to ↵list

13: i = 214: for each ↵ in ↵list do:15: Append (↵� log(�))/(i� 1) to ✏list16: i = i+ 117: ✏ = min(✏list)18: return ✏

Thus, in order to bound the mechanism overall, weneed to bound each ↵Mi(�) and sum them.

• Tail Bound: For any ✏ > 0, the mechanism M is(✏, �)-differentially private for

� = min� exp(↵M(�)� �✏)

The tail bound converts the moments bound to the(✏, �)-differential privacy guarantee and gives us away to compute ✏, given a fixed � as:

✏ = min�↵M(�)�log �

Assuming that the training is done over multiple epochs,we can fix the sampling ratio q = L

N , where N is the num-ber of data points in the dataset, and L is the number ofdatapoints within a lot. Then, for a Gaussian Mechanismwith random sampling, it suffices to compute the probabil-ity density function for N (0,�2) and N (1,�2), denoted asµ0 and µ1 respectively. If µ = (1 � q)µ0 + qµ1, we have↵(�) = logmax(E1, E2), where:

E1 = Ez⇠µ0 [(µ0(z)µ(z) )

�]

E2 = Ez⇠µ[(µ(z)µ0(z)

)�]

4 A Discussion of the the Moments

Accountant Method

In Section 3.2, we have defined the moments accountantmethod proposed in [2]. Here, we discuss and study itseffect on the privacy budget, aiming to provide a betterunderstanding of how all variables used in the moment ac-countant influence the overall privacy budget of the dataset.

The steps for computing the value of ✏, given �, the sam-pling ratio q, the noise multiplier �, the number of epochsep, and the number of orders for the moment accountant �

Figure 1: The minimum value of ✏, calculated by the momentsaccountant, for different dataset sizes, with � = 1

10⇤n , where nis the dataset size, for 20 training epochs, with sampling ratioq = 0.01.

are shown in Algorithm 2. We evaluate the effect of differ-ent parameters on the minimum value of ✏, by varying thevalue of one of the parameters, and keeping all the othersfixed, for different dataset sizes. For each of the cases, wealso illustrate the values of 32 individual moments, simi-lar to the analysis done in [2]. Please note that Abadi etal.’s original analysis aimed to find the best possible (✏, �)for each of the datasets tested when testing accuracy ondifferent datasets, whereas, we aim to provide a better un-derstanding of the privacy budget limitations when usingthis method for different dataset sizes.Effect of the noise multiplier. We begin by evaluating themoments accountant with respect to �, the noise multiplier.We set the value of � = 1

10⇤n , where n represents the sizeof the different datasets tested, the sampling ratio q = 0.01and evaluate over 20 training epochs. From Figure 1, weobserve that with � = 2, the minimum required value forthe privacy budget is outside the so-called privacy regime(✏ < 1) even for the smallest dataset tested (n = 500).However, when looking at higher values of � we can ob-serve that the minimum value for the privacy budget de-creases up to the case � = 20, after which the epsilonvalues remain close to each other, even for high values forsigma (� = 107). Figure 2 illustrates the correspondingvalue for the moments ↵. The correlation between the val-ues of ↵ and the minimum values for ✏ is remarkable, as↵ quickly increases for � = 2, and for higher values of �there is little to no variation between the values of ↵.Effect of �. We then evaluate the effect of � on the over-all privacy budget. Recall from Section 3.1 that � allowsfor a small probability of failure in the differential privacydefinition, i.e. with a probability of � a certain output ofa query might be given if an individual in present in thedataset. As each individual record in the dataset has thisprobability of failure, this will happen, on average � · ntimes. Hence � · n needs to be small, as to not let anyusers at risk, � needs to be chosen based on the size of thedataset.

Given that ✏ is computed as a function of � in the mo-ments accountant, the privacy budget is highly dependent

5

Figure 2: The values of the moments accountant ↵Mi(�), forthe minimum values of ✏, with a noise multiplier of � = 20.

Figure 3: The minimum value of ✏, calculated by the momentsaccountant, for different dataset sizes, where n is the dataset size,for 20 training epochs, with the sampling ratio q = 0.01, and thenoise multiplier � = 20.

on the size of the dataset. In this setting, we evaluate 20training epochs, with a noise multiplier � = 20 and a sam-pling ratio q = 0.01. From Figure 3, we observe that theminimum value for ✏ increases with lower values for �. Wealso illustrate the corresponding values of the moments,↵, corresponding to the minimum values of ✏ over 32 mo-ments in Figure 4. These values are independent of thedataset size, hence independent from �.

Effect of Number of Epochs. We also studied the effectof the number of epochs on the privacy budget. However,even though there is a small variation on the privacy budgetfor different number of epochs, the effect of this parameteris not as prominent as the other parameters presented inthis section.

Effect of Sampling Ratio. Finally, we look at the effectsof different sampling ratios (q) on the overall privacy bud-get. In Figure 5, we keep the � parameter fixed at 1

10·n ,the noise multiplier � = 20, and plot the minimum valueof ✏ when varying the sampling ratio q. Here we see theincrease in the minimum value of epsilon is also correlatedwith an increase in the sampling ratio q. The value of the

Figure 4: The values of the moments accountant ↵Mi(�), forthe minimum values of ✏, with a noise multiplier � = 20 andsampling ratio q = 0.01.

Figure 5: The minimum value of ✏, calculated by the momentsaccountant, for different dataset sizes, with � = 1

10⇤n , where nis the dataset size, for 20 training epochs, with noise multiplier� = 20.

sampling ratio also affects the value of the moments ac-countant, as clear from Figure 6. For a small samplingratio (i.e. q = 0.01 to q = 0.1), we find that the valuesof the moments ↵ are close to 0, however, with increasingbatch size, these values increase quickly, making the min-imum value of the privacy parameter ✏ close to 1 even forsmall dataset sizes (n 103).

Remarks. Overall, our analysis shows that differential pri-vacy is not a purely “out-of-the-box” tool, but a complexprocess to satisfy a definition where different parametersaffects the overall privacy budget. Therefore, before ap-plying differential privacy to a dataset, one has to be awarethe trade-offs between parameters in practice as well as theneeds of the system with respect to privacy.

6

Figure 6: The values of the moments accountant ↵Mi(�), forthe minimum values of ✏, with a noise multiplier of 20.

5 State-of-the-Art Approaches for

Privacy-Preserving Data Synthesis

In this section, we describe the state-of-the-art approachesfor privacy-preserving data synthesis, which we evaluatelater in Section 6.

5.1 PrivBayes

PrivBayes [33] is a solution for releasing a high-dimensional dataset D in an ✏-differentially private man-ner. It involves three phrases:

1. A k-degree Bayesian network N is built over theattributes in D using an ✏

2 -differentially privatemethod (k is a small value, chosen automatically byPrivBayes).

2. An ✏2 -differentially private algorithm is used to gen-

erate a set of conditional distributions of D, such thatfor each attribute-parent (AP) pair (X,⇧) in N , wehave a noisy version of the conditional distributionPr[Xi|⇧i].

3. The Bayesian network N and the d noisy conditionaldistributions are used to derive an approximate distri-bution to generate a synthetic dataset D⇤.

The model is first constructed in a non-private manner, us-ing standard notions from information theory to constructthe k-degree Bayesian network N on a dataset D, contain-ing a set of A attributes. The mutual information betweentwo variable is denoted as :

I(X,⇧) =P

x2dom(X)

P⇡2dom(⇧)

Pr[X = x,⇧ =

⇡]log2Pr[X=x,⇧=⇡]

Pr[X=x] Pr[⇧=⇡] ,

where Pr[X,⇧] is the joint distribution of X and ⇧ andPr[X] and Pr[⇧] are the marginal distributions of X and⇧.

The proposed non-private algorithm “GreedyBayes” –see Algorithm 3 – extends the Chow and Liu algorithm [9]

Algorithm 3 Greedy Bayes

1: Initialize N = ; and V = ;2: Randomly select an attribute Xi from A; add (Xi, ;)

to N ; add Xi to V3: for i = 2 to d do

4: Initialize ⌦ = ;5: for each X 2 A \ V and each ⇧ 2

�Vk

�do

6: ⌦ = ⌦ [ (X,⇧)

7: Select a pair (Xi,⇧i) from ⌦ with maximal mu-tual information I(Xi,⇧i)

8: Add (Xi,⇧i) to N ; add Xi to Vreturn N

for higher values of k. The Bayesian network is built bygreedingly picking the next edge based on the maximummutual information. First, the network N is initialized toan empty set of AP pairs. The set of attributes whose par-ents were fixed in the previous step (V ) is also initializedto an empty set. Then an attribute is randomly chosen, andits parent is set to the empty set. The AP pair obtained isadded to N and the attribute to V . For all the next d � 1steps, an AP pair is added to N based on the mutual infor-mation, i.e. the edge which maximizes I is selected. EachAP-pair is chosen such that i) the size of the parent set isless than or equal to k and ii)N contains no edge from theattribute at the current step to any of the attributes added atprevious steps.

However, both I and the best edge are data sensitive, soGreedyBayes is adapted to be differentially private. EachAP pair (X,⇧) 2 ⌦ is inspected and the mutual informa-tion between X and ⇧ is calculated. After that, an AP pairis sampled from ⌦ such that the sampling probability ofany pair (X,⇧) is proportional to exp( I(X,⇧)

2� ), where �is the scaling factor. In order for the Bayesian network tosatisfy ✏

2 -differential privacy, � is set to

� =2(d� 1)S(I)

✏,

where S1(I) denotes the L1 sensitivity of I . For the mutualinformation I , we have:

S1(I(X,⇧)) =(1n log2 n+ n�1

n log2n

n�1 , if X or ⇧ is binary2n log2

n+12 + n�1

n log2n+1n�1 , otherwise

However, since S(I) > log2 nn , the range of S can be quite

large compared to the range of I . Hence, the authors pro-pose the use of a novel function F that maps each AP pair(X,⇧) to a score, such that i) F ’s sensitivity is small (withrespect to the range of F ). and ii) if F (X,⇧) is large thenI(X,⇧) tends to be large.

In order to define F , first the maximum joint distribu-tion is defined. Given an AP pair (X,⇧), a maximum jointdistribution Pr⇧[X,⇧] for X and ⇧ is one that maximizesthe mutual information between X and ⇧. In other words,if |dom(X)| |dom(⇧)|, Pr⇧[X,⇧] is a maximum jointdistribution if and only if i) Pr⇧[X = x] = 1

|dom(X)| forall x 2 X , and ii) for all ⇡ 2 dom(⇧), there is at most onex 2 dom(X) with Pr[X = x,⇧ = ⇡] > 0.

7

Let (X,⇧) be an AP pair and P⇧[X,⇧] be the set of allmaximum joint distributions for X and ⇧. Then F is de-fined as:

F (X,⇧) = 12 minPr⇧2P⇧

||Pr[X,⇧]� Pr⇧[X,⇧]||1

The function F defined this way has a smaller sensitivitythan I , namely S(F ) = 1

n . In order to give a computationof F , let (X,⇧) be an AP pair and |⇧| = k. Then, the jointdistribution Pr[X,⇧] can be represented as a 2 ⇥ 2k ma-trix, where all elements sum up to 1. In order to identifythe minimum L1 distance between Pr[X,⇧] and a max-imum joint distribution Pr⇧[X,⇧],the distributions in P⇧

are partitioned into a number of equivalence classes andthen F (X,⇧) is computed by processing each equivalenceclass individually.

In a nutshell, PrivBayes is an ✏-differentially privatemethod for high dimensional data using a Bayesian Net-work. The network is used to model the distribution be-tween the attributes of the data. The distribution of the datais approximated using low dimensional marginals and thennoise is added to satisfy differential privacy. Finally, thesamples are drawn from the differentially private data forrelease. One of the considered drawbacks of this approachis the addition of too much noise during the network con-struction, which might make the approximation of the datadistribution inaccurate.

The implementation for this method is available fromhttps://github.com/JiaMingLin/privbayes.

5.2 Approaches based on Generative Models

and the Moments Accountant

5.2.1 Priv-VAE

In [4], Acs et al. present a technique for privately releas-ing generative models so that they can be used to generatean arbitrary number of synthetic datasets. It relies on gen-erative neural networks to model the data distribution onvarious kinds of data, such as images or transit informa-tion. As the training procedure is differentially private withrespect to the training data, any information derived fromthe generative models is also differentially private (by thepost-processing property of differential privacy), includingany dataset produced from them.

More concretely, the non-private version of the proposalworks as follows: the dataset is partitioned into k clusterscD1, cD2, . . . , cDk, which are used to train k distinct genera-tive models, where the parameters of the resulting modelsare denoted ✓1, . . . , ✓k. The data samples within a clusterare similar. ✓1, . . . , ✓k are learned using gradient descent.The generative models are then released, so that they canbe used by third parties to generate synthetic data.

Note that both the clustering step and the training ofthe generative models inspect the data, and thus should bemade differentially private. The clustering is performed us-ing a differentially private kernel k-means algorithm whichfirst transforms the data into a low-dimensional represen-tation using the randomized Fourier feature map and thestandard differentially private k-means is applied on these

low dimensional features. To select the number of clus-ters k , Acs et al. rely on dimensionality reduction al-gorithms (e.g. t-SNE [21]). Then, to training a gener-ative model for each cluster, they use the differentiallyprivate gradient descent procedure from Algorithm 1. Inparticular, authors experiment with Restricted BoltzmannMachines (RBMs) and Variational Autoencoders (VAEs).They evaluate the clustering accuracy of their algorithm onthe MNIST dataset (for which thy use k = 10) and theirmodel on both a Call Data Records and a transit dataset,showing that the average relative error of their model out-performs MWEM [17].

Finally, note that the implementation of this approach isavailable upon request from the authors of [4].

5.2.2 DP-SYN

Abay et al. [3] present a framework combining private con-volutional neural network with the private version of theiterative expectation maximization algorithm Dp-Em [25].The dataset is first partitioned according to every instance’slabel, and then an ( ✏2 ,

�2 )-differentially private autoencoder

is built for each label group. The private latent represen-tation is then injected into an ( ✏2 ,

�2 )-differentially private

expectation maximization function (Dp-Em [25]). The Dp-Em function detects different latent patterns in the encodeddata and generates output data with similar patterns, whichis then decoded to obtain the synthetic data.

DP-SYN uses a different approach to partitioning as op-posed to Priv-VAE. The former partitions the data accord-ing to each of the label and the model is then trained onthe labeled partitions, while the latter does so randomly,and uses differentially private k-means clustering for clus-tering the data samples.

The authors include an extensive experimental evalua-tion on nine datasets for binary classification tasks, com-paring their results with four state-of-the-art techniques,including PrivBayes [33] and Priv-VAE [4]. All the ex-periments are run for 10 rounds, and only the best resultsfor each algorithm are being recorded.

DP-SYN’s implementation was originally availablefrom https://github.com/ncabay/synthetic generation, butas of March 2019 it is no longer so.

5.3 Synthesis of Location Traces (Syn-Loc)

Finally, we look at the work by Bindschaedler andShokri [7], which aims to generate fake, yet semanticallyplausible privacy-preserving location traces.

5.3.1 Plausible Deniabilty

Bindschaedler et al. [8] formalize the concept of plausi-ble deniability in the context of data synthesis, as a newprivacy notion for releasing privacy-preserving datasets.More precisely, it is presented as a formal privacy guar-antee such that an adversary (with no access to back-ground knowledge) cannot deduce that a particular data-point within a dataset was “more responsible” for a syn-thetic output than a collection of other datapoints.

8

More formally, plausible deniability is defined as fol-lows. For any dataset D, with |D| � k, and any recordy generated by a probabilistic generative models M suchthat y = M(d1) for d1 2 D, we state that y is releaseablewith (k, �) -plausible deniability if there exist at least k�1distinct records d2, . . . , dk 2 D \ {d1} such that:

��1 Pr[y=M(di)]Pr[y=M(dj ]

�,

for any i, j 2 {1, 2, . . . k}.In short, a synthetic record provides plausible deniabil-

ity if the that synthetic data record could have been gener-ated by a number of real data points.

Plausible deniability has been shown to satisfy differ-ential privacy guarantees if randomness is added to thethreshold k. In fact, the authors show that if LaplacianNoise Lap( 1

✏0) is added to k, then, their system satis-

fies (✏, �) differential privacy, with ✏ = ✏0 + log �t and

� = e�✏0(k�t), for any integer t, with 1 t k. One ofthe main differences between plausible deniability and dif-ferential privacy for generative models is the lack of noiseadded to the generated data.

5.3.2 The Data Synthesis Algorithm

The paper first introduces two mobility metrics that capturehow realistic a synthetic location trace is with respect to ge-ographical semantic dimensions of human mobility. Thenit constructs a probabilistic generative model that producessynthetic, yet plausible traces. It is built using a dataset ofreal locations used as seeds, referred to as the seed dataset.For each set in the seed dataset, a probabilistic mobilitymodel is computed representing the visiting probability ofeach location and the transition probability between loca-tions.

Generating the synthetic traces starts by transforming areal trace (taken as seed) to a semantic trace. The equiva-lent of semantic classes is created using the k-means clus-tering algorithm, where the number of clusters is chosensuch that it optimizes the clustering objective. The seed isthen converted to a semantic seed by replacing each loca-tion in the trace with all its semantically equivalent loca-tions. Then, some randomness is injected into the seman-tic seed. Any random walk on the semantic seed trace thattravels through the available locations at each time instantis a valid location trace that is semantically similar to theseed trace. Then, the semantic trace is decoded into a ge-ographic trace in order to generate traces that are plausibleaccording to aggregate mobility models. To generate mul-tiple traces for each seed, the Viterbi algorithm with addedrandomness to the trace reconstruction is used.

Each of the generated traces is tested to ensure statisticaldissimilarity and plausible deniability. Statistical dissimi-larity ensures a maximum bounds on both the similaritybetween a fake trace and the seed from which it was gener-ated, and between the intersection of all fake traces gener-ated from a specific seed. Plausible deniability means that,for any fake trace generated from a seeds, there are at leasta number of alternative seeds that could have generated it.

The most notable difference between this model and theother models presented in this section is that it is a seed-

based model, where it takes a data record as a seed and gen-erates synthetic data from that data record. In contrast, theother models model the synthetic data based on the prop-erties of the original data.

The protocol was evaluated on the Nokia LausanneDataset [26], containing a combination of GPS coordi-nates, WLAN and GSM identifiers for users, with the loca-tion being reported every 20 minutes. The implementationof the protocol is available from https://vbinds.ch/node/70.

6 Experimental Evaluation

6.1 Datasets & Tasks

We use several datasets for our experimental evaluation,which we group into 4 categories. For each of the datasets,we also consider different tasks, following common exper-iments used in literature.

1. Financial data. We use two of the datasets from theUCI Machine Learning Datasets [12], namely: i) theGerman Credit dataset, with anonymized informationof 1,000 customers, having 20 features and classify-ing customers as having good or bad credit risk; andii) the Adult dataset, with information from 45,222individuals, extracted from the 1994 US census, with15 features, indicating whether the income of an indi-vidual exceeds 50,000 US dollars.

2. Images. We use the MNIST dataset [20], a publicimage dataset, which includes 28⇥28-pixel images ofhandwritten digits, containing 70,000 samples. Themain classification task usually performed on this dataset is dataset is to correctly classify the handwrittendigit in the image.

3. Cyber threat logs. We rely on the DShield [1] datacollected over 10 days, with approximately 5M en-tries collected each day. Each entry contains the Id,date, source IP, source port, target port, target IP ofan attack, whenever an alert has been sounded by thefirewall. This dataset has been often used in the con-text of predictive blacklisting [30], i.e., forecastingwhich IPs will attack a target.

4. Location data. We use the San Francisco cabsdataset [26], containing mobility traces recorded bySan Francisco taxis. This has often been used topredict next locations, identifying points of interests,etc. [28].

6.2 Experimental Setup and Objectives

We aim to evaluate the performance of the syntheticdatasets for different tasks to give a concrete overview oftheir usability in practice. Different datasets have differ-ent statistical requirements, hence we aim to provide anextensive analysis to cover multiple basis. For each ofthe datasets, we test multiple values for ✏, while keeping� fixed at 1

10·n , allowing us to analyze the quality of thesynthetic data under different levels of noise.

9

Classification tasks. We evaluate the synthetic data bytraining a discriminative model on it and evaluate its accu-racy for classification on real test data. This should high-light whether the quality of the original data is preservedwhen using synthetic data. For this task, we compare ourresults to two baselines: 1) the original data, i.e., we eval-uate how well the discriminative model performs whentrained on the synthetic data as opposed to being trainedon the original dataset; and 2) a model that would ran-domly assign a prediction based on the original distribu-tions of the data. Any of the models that report an accu-racy lower than this baseline are considered unsuitable forany classification task. In our evaluation, we will refer tothe former as “Original Data’ and to the latter as “LowerBaseline.” We use this methodology to evaluate the finan-cial data dataset and the cyber threats logs, using an SVMclassifier.Linear regression. We also use the synthetic data gen-erated by each of the models and train a linear regressionmodel on them. We evaluate the resulting model on a realtesting data, vis-a-vis the accuracy of the predictions made.Similar to the classification tasks, we use the “Lower Base-line” and “Original Data” as baselines. We do this for theimage and the German Credit dataset.Prediction. Inspired by the work of Melis et al. [23] inthe context of collaborative predictive blacklisting, we aimto analyze the performance of synthetic data in forecast-ing future attack sources for the cyber threat logs dataset.We use Melis et al.’s implementation from https://github.com/mex2meou/collsec.git to evaluate the synthetic dataobtained from each of the models under their k-nearestneighbors approach. We evaluate the performance of thesynthetic data by looking at the true positive rate for pre-dicting future attacks.Clustering. For location data, we evaluate if the syntheticdata preserves the properties of the original data, specifi-cally, extracting the points of interest and comparing theresults to the points of interest from the original dataset.

By comparison, the NIST challenge has similar criteria:the submitted algorithms must be able to preserve the bal-ance of utility and privacy for regression, classification andclustering, but also for evaluation when the research ques-tion is unknown.

6.3 Financial Data

German Credit. In order to evaluate the categorical at-tributes of the German Credit dataset, we use the numericencoding provided in [12]. We split the dataset into 70%training data and 30% testing data.

First, we train an SVM model on both synthetic datagenerated by each of the algorithms and on the originaldata, and report the accuracy results of the classification inFigure 7. The accuracy of the models improves with lessnoise added to the model (i.e. higher values for ✏), and,among the tested models, DP-SYN obtains the highest ac-curacy. Priv-VAE has accuracy close to DP-SYN, however,when constructing the confusion matrices for Priv-VAE,we see that it fails to classify the German Credit dataseteven in less noisy settings, and classifies most datapoints

Figure 7: SVM accuracy results for German Credit Dataset with� fixed at 1

10·n , for varying values of ✏.

Figure 8: Linear regression accuracy results for German CreditDataset with � fixed at 1

10·n , for varying values of ✏.

within one class (in this case it classifies most customersas having bad credit score). For ✏ 0.5, the syntheticdata obtained from PrivBayes reports less accuracy thanour lower baseline.

In Figure 7, we also report a setting for ✏ = 1. Thisillustrates the models’ accuracy with very little or no pri-vacy. If the models generate synthetic data under this set-ting, the accuracy given is close to the accuracy of theSVM model trained on the original data.

Second, we tested a logistic regression model on thedataset, and observed the accuracy. From Figure 8, wesee that this model fails to correctly classify even the orig-inal data. For the synthetic data, in fact, it reports a betteraccuracy than the SVM model, however, when looking atthe confusion matrices, we notice that it fails to classifythe dataset, reporting most datapoints as being within oneclass.Adult Dataset. We first encode the adult dataset, us-ing One-Hot Encoding [18], to convert the categorical at-tributes of the dataset into numerical attributes. The encod-ing obtained has 57 attributes as opposed to 13 attributes inthe original dataset. We split the dataset into 70% trainingdata and 30% testing data.

We train an SVM model on the synthetic data obtainedfrom each of the models and present the results in Fig-

10

Figure 9: SVM accuracy results for Adult Dataset with � fixed at1

10·n , for varying values of ✏.

ure 9. Note that PrivBayes has better accuracy with highernoise levels than DP-SYN (✏ 1). Additionally, when✏ is greater than 2, the accuracy of DP-SYN decreases,which is perhaps counter-intuitive, as it is expected for ac-curacy to increase when less noise is added to the model.Priv-VAE fails to cluster the data in this test case, often re-turning empty clusters during the differentially private k-means clustering. The accuracy of Priv-VAE, even thoughhigher than the lower baseline used, is lower than the ac-curacy of the other two tested models for all tested valuesof ✏.

To observe if an increase in accuracy can be correlatedwith with increasing dataset size, we also train the SVMmodel on partial data, while keeping the same test dataset.In Figure 10, we plot the accuracy for ✏ = 0.9 for DP-SYN(the minimum accepted value of ✏ for this dataset, when� = 1

10·n ), and ✏ = 0.8 for PrivBayes. Even though in thiscase PrivBayes has more noise added to the model, it re-ports better accuracy than DP-SYN, for all partial datasetstested. The increase in accuracy for PrivBayes with largerdataset sizes is easily observed, from approximately 0.75accuracy when 10% of the original data was used for gen-erating the synthetic data to approximately 0.8 when 90%of the data was used. For DP-SYN, the same correlationcannot be observed, and in fact, the highest accuracy (0.75)is reported when 80% of the data was used for training.

In Figure 11, we increase the value of ✏ to 1.2. We canobserve that PrivBayes reports better accuracy when lessthan 40% of the original data was used for generating thesynthetic data, and DP-SYN outperforms PrivBayes withincreasing dataset size. In contrast to Figure 10, in this casewe observe an improvement in accuracy for both modelswith increasing dataset sizes.

In Figure 12, we report the accuracy for increasingdataset sizes for ✏ = 3.2. In this case DP-SYN outperformsPrivBayes, however, neither of the models’ improvementin accuracy can be correlated with increasing dataset sizes.

6.4 Images

We then evaluate the models on the MNIST dataset. Weuse the the usual split for training and testing purposes, i.e.60,000 samples for training and 10,000 samples for testing.

Figure 10: SVM accuracy results, for training models at a per-centage of the original datasets. For DP-SYN, we have ✏ = 0.9(the minimum value for ✏ for this dataset, when � = 1

10·n ), andfor PrivBayes ✏ = 0.8.

Figure 11: SVM accuracy results for training models at a per-centage of the original datasets, with ✏ = 1.2.

We generate synthetic data with all three methods and thenconstruct a linear regression model on the synthetic datafor evaluating the accuracy of each of the models.

For each value of ✏ tested we reconstruct the averageimage resulted in every class. When reconstructing theclasses for PrivBayes, we can observe that the model didnot correctly reconstruct separate class images.. As shownin Figure 13, all the classes seem similar, even in the low-est privacy case (i.e. ✏ = 107). Therefore, we split the datainto separate classes and trained on each class separatelyfor generating the synthetic data.

In Figure 14, after splitting, with increasing values ofepsilon, we start to distinguish between different classes.For DP-SYN (see Figure 15), the reconstruction of eachof the digit classes is much clear even for small ✏, andclose to the average class reconstruction for the originaldata. Priv-VAE (see 16) outputs a less noisier reconstruc-tion than PrivBayes, but not as clear as DP-SYN. Finally,in Figure 17, we report the accuracy of the linear regressionmodels. As expected from the reconstructed samples, theaccuracy of PrivBayes is very low for the noisier samples.In fact, not even for the less noisy cases, when accuracy im-proves, it does not match the accuracy of the linear regres-

11

Figure 12: SVM accuracy results for training models at a per-centage of the original datasets, with ✏ = 3.2.

Figure 13: Average class reconstruction for PrivBayes, with nosplitting before training.

sion model when trained on the original data. Similarly,the accuracy of the synthetic samples generated from Priv-VAE is correlated with the average class reconstruction,and in fact generates has a better accuracy than PrivBayesas the value of ✏ increases. DP-SYN obtains a higher ac-curacy then both the other two models, however, it stillreports lower accuracy that that of the original dataset.

6.5 Cyber Threat Logs

First, we try a classification task for this dataset. We con-struct the dataset for this task by using the DShield logsand classifing the existing logs as threats, and adding asmuch non-threat traffic to the dataset. After randomizing,we split the data into 70% training and 30% testing. Wetrain a discriminative model on the synthetic data for allmodels, however, none of them achieved an accuracy bet-ter than the lower baseline.

Our second approach is to use this data set as both train-ing and testing, using a 5 day sliding window for trainingthe models and obtaining synthetic data, and we use thereal data from the next date for testing. The aim of this isfor a model trained on the synthetic data to generate new

Figure 14: Average class reconstruction for PrivBayes, withsplitting before training.

Figure 15: Average class reconstruction for DP-SYN.

samples that could then be used to predict data found in thetesting dataset.

When using the synthetic dataset obtained fromPrivBayes to evaluate prediction using the k-NN approachfrom [23], it failed to produce any samples that would cor-relate with the testing data. Hence, we no meaningful pre-dictions can be obtained on this synthetic dataset, regard-less of noise level. DP-SYN managed to predict some ofthe future attacks, however, it reported a lower true posi-tive rate (less than 50% true positive rate) than the originaldata. Even for this dataset, DP-SYN does not perform con-sistently better with less noise added to the model.

6.6 Location Data

We generate the synthetic datasets for each of the mod-els and plot the distribution of locations (Figure 18. DP-SYN fails to provide a meaningful distribution of loca-tions, placing all locations on the same point on the map.The synthetic data generated by PrivBayes, even though

12

Figure 16: Average class reconstruction for Priv-VAE.

Figure 17: Linear Regression accuracy results for MNISTDataset with � fixed at 1

10·n , for varying values of ✏.

more scattered on the map than DP-SYN, still fails tomimic the distribution of the original data.The syntheticdata generated by Syn-Loc has a more similar distributionacross the map to the original data. This is due to the morespecialized model used for data generation.

We cluster the data using k-means clustering for extract-ing the points of interest on the map. We plot the clus-ters distribution on the map for k = 10 in Figure 19. Asexpected, the synthetic data generated from DP-SYN isgrouped within a single cluster. The synthetic data fromPrivBayes does not generate empty clusters, but the distri-bution of clusters on the map does not resemble the distri-bution of clusters for the original dataset. The data gen-erated by Syn-Loc provides the most similar clustering tothe original dataset. In Figure 20, We plot the distributionof clusters for k = 20. In this case, not only the data fromDP-SYN is grouped within one cluster, but clustering onthe synthetic data from PrivBayes also return empty clus-

Figure 18: Distribution of locations for the San Francisco Cabsdataset.

Figure 19: 10 Clusters of locations for the San Francisco Cabsdataset.

ters. Again, Syn-Loc provides the most similar clusteringto the original dataset.

7 Discussion

The main goal of our project was to understand how toevaluate privacy-friendly synthetic data generation tech-niques. In other words, we set to determine whether or notit is possible to privately synthesize data characteristics toyield translational findings to the original data. In short,the answer to this question depends on the task at hand,the size of the dataset, as well as the privacy requirementsof the synthetic dataset.

Our analysis also showed that one of the most importantnotions to be thoroughly understood before attempting togenerate differentially private synthetic datasets is the mo-ments accountant [2]. This represents a complex account-ing method for keeping track of the privacy budget whereeach parameter can have a great influence on the quality ofthe synthetic data.

We evaluated three of the privacy-preserving syn-thetic data generation models on binary classificationtasks on two financial data datasets. The first model,PrivBayes [33], reported accuracy lower than randomly as-

13

Figure 20: 20 Clusters of locations for the San Francisco Cabsdataset.

signing labels based on original distributions of the datawhen trained on noisy synthetic data (i.e., ✏ 0.5) forthe German Credit dataset. However, when the same taskis performed on a larger dataset, accuracy is better evenfor noisier synthetic datasets. Moreover, even on smallersamples of the adult dataset, the performance is still bet-ter than the lower baseline and reports better accuracythan the other models for noisier datasets. By contrast,DP-SYN yields better performance overall on the GermanCredit dataset, but lower accuracy on the larger dataset fornoisy synthetic data (✏ < 1). In fact, the average perfor-mance actually decreases for some of the less noisy syn-thetic datasets (✏ > 2). Priv-VAE generates synthetic datawhich even though reports testing accuracy comparable toDP-SYN, but, from closer inspection, it classifies most dat-apoints within one label.

The same models were then evaluated on the MNISTdataset, on a multi-class classification task. For thisdataset, the synthetic data obtained from PrivBayes per-forms poorly when evaluated for classification tasks. Infact, even from the synthetic data reconstruction of the av-erage sample image for each class, it is easy to observe thatthe resulting images have a significant amount of noise.Moreover, we found that splitting the data into separateclasses before training the model is necessary in order tobe able to distinguish between the different classes. Thesynthetic data from DP-SYN returns better reconstructedsamples, however, the minimum privacy budget for thisdataset is ✏ = 0.9 for � = 1

10·n , therefore not allowingtoo much noise to be added to the model.

For the cyber threat logs, none of the models tested per-formed well under the two tested tasks. The discriminativemodels trained on the synthetic datasets for classifying dat-apoints as threats or as benign all reports a lower accuracythan the lower baseline. Even for the prediction task, noneof the models achieve a meaningful true positive rate forpredicting future attacks. We think this is due to the factthat the attack vectors given, based on IP and port, are quitesensitive to noise perturbations, and therefore noisy datacan be unsuitable for such tasks.

For location data, the model that returns the distributionmost similar to the original data is Syn-Loc. This is an

expected outcome, because of the more specialized modelused in this model. DP-SYN fails to generate meaningfulsynthetic data, generating all synthetic data points withinone location point. We believe that the performance of thismodel is worse on the location dataset compared to theother datasets, because the model used to split the train-ing data by class and generate synthetic samples for eachclass separately, whereas the same split is not possible inthis case. PrivBayes managed to generate synthetic sam-ples with a better distribution than DP-SYN, however notas good as Syn-Loc.

In conclusion, our experimental evaluation suggests thata generic approach which would successfully be able togenerate universally meaningful synthetic datasets mightnot be viable. This is due to the complexity and variednature of the datasets used in the wild. It remains forthe data provider to decide if the offset in utility asso-ciated with privacy-preserving synthetic data satisfies itsneeds. Overall, there is no “best” model among the oneswe tested, as they all exhibit different performances on dif-ferent datasets. From our evaluation, we can conclude thatDP-SYN can be used for image reconstructions, providingthat the training can be done in classes, as it constructedthe images in the MNIST dataset close to the original sam-ples. PrivBayes provides high accuracy for binary classifi-cation tasks on large datasets, when a large noise perturba-tion is needed. Syn-Loc manages to simulate real locationdata distributions better than the other models due to itsfocus on location datasets. However, privacy-preservingsynthetic data, even though not perfect, offers a better alter-native then anonymization techniques which fail to provideuseful privacy guarantees, or differential privacy which cangreatly affect the utility of the data.

Overall, the most encouraging results correspond to im-age and financial data that, as for settings with good pri-vacy guarantees – where the value of the epsilon and deltaparameters of differential privacy are less than 1, and anorder of magnitude smaller than the inverse of the size ofthe dataset, respectively – the evaluated approaches led tosynthetic training datasets on which very basic models forprediction and classification incurred a 5-8 accuracy losswith respect to training on the non-private original data.

These results are encouraging, and leave open severalvenues for further work that could potentially lead to bet-ter trade-offs between privacy and utility, as one can (i)consider synthesis procedures that are more specialized toconcrete types of data, (ii) evaluation under more powerfuldomain-specific models, and (iii) consider large datasets,which is a realistic consideration given that the datasets wecovered in our evaluation are of moderate size.

References

[1] Internet Storm Center | DShield. http://www.dshield.org.

[2] M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan,I. Mironov, K. Talwar, and L. Zhang. Deep Learning withDifferential Privacy. In ACM CCS, 2016.

[3] N. C. Abay, Y. Zhou, M. Kantarcioglu, B. Thuraisingham,and L. Sweeney. Privacy preserving synthetic data releaseusing deep learning. In ECML PKDD, 2018.

14

[4] G. Acs, L. Melis, C. Castelluccia, and E. De Cristofaro. Dif-ferentially private mixture of generative neural networks.IEEE Transactions on Knowledge and Data Engineering,2018.

[5] C. C. Aggarwal. On k-anonymity and the curse of dimen-sionality. In VLDB, 2005.

[6] M. Archie, S. Gershon, A. Katcoff, and A. Zeng. De-anonymization of Netflix Reviews using Amazon Re-views. https://courses.csail.mit.edu/6.857/2018/project/Archie-Gershon-Katchoff-Zeng-Netflix.pdf.

[7] V. Bindschaedler and R. Shokri. Synthesizing plausibleprivacy-preserving location traces. In IEEE Symposium on

Security and Privacy, 2016.

[8] V. Bindschaedler, R. Shokri, and C. A. Gunter. Plausibledeniability for privacy-preserving data synthesis. VLDB,2017.

[9] C. Chow and C. Liu. Approximating discrete probabilitydistributions with dependence trees. IEEE Transactions on

Information Theory, 1968.

[10] N. R. Council et al. Putting people on the map: Protect-

ing confidentiality with linked social-spatial data. NationalAcademies Press, 2007.

[11] F. K. Dankar and K. El Emam. The Application of Differ-ential Privacy to Health Data. In Joint EDBT/ICDT Work-

shops, 2012.

[12] D. Dheeru and E. Karra Taniskidou. UCI Machine LearningRepository. http://mlr.cs.umass.edu/ml/datasets.html, 2017.

[13] I. Dinur and K. Nissim. Revealing information while pre-serving privacy. In ACM PODS, 2003.

[14] C. Dwork and K. Nissim. Privacy-preserving datamining onvertically partitioned databases. In IACR CRYPTO, 2004.

[15] C. Dwork and A. Roth. The Algorithmic Foundations ofDifferential Privacy. Foundations and Trends in Theoretical

Computer Science, 9, 2013.

[16] M. Gymrek, A. L. McGuire, D. Golan, E. Halperin, andY. Erlich. Identifying personal genomes by surname infer-ence. Science, 2013.

[17] M. Hardt, K. Ligett, and F. McSherry. A simple and prac-tical algorithm for differentially private data release. InNeurIPS, 2012.

[18] D. Harris and S. Harris. Digital Design and Computer Ar-

chitecture. Morgan Kaufmann, 2010.

[19] N. Homer, S. Szelinger, M. Redman, D. Duggan, W. Tembe,J. Muehling, J. V. Pearson, D. A. Stephan, S. F. Nelson,

and D. W. Craig. Resolving Individuals Contributing TraceAmounts of DNA to Highly Complex Mixtures Using High-Density SNP Genotyping Microarrays. PLoS Genetics,2008.

[20] Y. LeCun, C. Cortes, and C. Burges. MNIST handwrittendigit database. http://yann.lecun.com/exdb/mnist/.

[21] L. v. d. Maaten and G. Hinton. Visualizing data using t-SNE. Journal of machine learning research, 2008.

[22] B. Marr. Forbes – How Much Data Do We Create EveryDay? https://bit.ly/2Iz6Hbv, 2018.

[23] L. Melis, A. Pyrgelis, and E. De Cristofaro. On collabora-tive predictive blacklisting. ACM SIGCOMM CCR, 2019.

[24] K. Nissim, T. Steinke, A. Wood, M. Altman, A. Bembenek,M. Bun, M. Gaboardi, D. R. OBrien, and S. Vadhan. Dif-ferential privacy: A primer for a non-technical audience. InPrivacy Law Scholars Conference, 2017.

[25] M. Park, J. Foulds, K. Chaudhuri, and M. Welling. DP-EM:Differentially Private Expectation Maximization. arXiv

preprint arXiv:1605.06995, 2016.

[26] M. Piorkowski, N. Sarafijanovic-Djukic, and M. Gross-glauser. CRAWDAD dataset epfl/mobility (v. 2009-02-24).https://crawdad.org/epfl/mobility/20090224/.

[27] R. A. Popa, A. J. Blumberg, H. Balakrishnan, and F. H.Li. Privacy and accountability for location-based aggregatestatistics. In ACM CCS, 2011.

[28] A. Pyrgelis, C. Troncoso, and E. De Cristofaro. KnockKnock, Who’s There? Membership Inference on AggregateLocation Data. In NDSS, 2018.

[29] D. B. Rubin. Statistical disclosure limitation. Journal of

official Statistics, 1993.

[30] F. Soldo, A. Le, and A. Markopoulou. Predictive blacklist-ing as an implicit recommendation system. In INFOCOM,2010.

[31] B. Vendetti. 2018 Differential Privacy Syn-thetic Data Challenge. https://www.nist.gov/ctl/pscr/funding-opportunities/prizes-challenges/2018-differential-privacy-synthetic-data-challenge.

[32] R. Wang, Y. F. Li, X. Wang, H. Tang, and X. Zhou. LearningYour Identity and Disease from Research Papers: Informa-tion Leaks in Genome Wide Association Study. In ACM

CCS, 2009.

[33] J. Zhang, G. Cormode, C. M. Procopiuc, D. Srivastava, andX. Xiao. PrivBayes: Private Data Release via Bayesian Net-works. ACM Transactions on Database Systems, 2017.

15