Randomization in Privacy Preserving Data Mining

Randomization in Privacy Preserving Data Mining

Agrawal, R., and Srikant, R. Privacy-Preserving Data Mining, ACM SIGMOD’00

the following slides include materials from this paper

Privacy-Preserving Data Mining

• Problem: How do we publish data without compromising individual privacy?

• Solution : randomization, anonymization

Randomization

• Adding random noise to original dataset

• Challenge– Is data still useful for further analysis?

Randomization

• Model: data is distorted by adding random noise

• Original data X = {x1 . . .xN}, for record xi X, ∈random variable Y = {y1 . . .yN} is added, so new data is denoted by Z ={ z1 . . .zN}, zi=xi + yi.

• yi is a random value– Uniform, [-α, +α]– Gaussian, N (0, σ2)

Reconstruction

• Perturbed data hides data distribution and need be reconstructed before data mining

• Given– x1+y1, x2+y2, ..., xn+yn

– the probability distribution of Y• Estimate the probability distribution of x

Clifton AusDM‘11

1. fx 0 = Uniform distribution

2. Repeat update

until stop criterion met

Reconstruction

• Bayes rule to estimate cumulative density functions

reconstruction algorithm

reconstructed

originalrandomized

original

reconstructed

randomized

N(0, 0.25)

(-0.5, 0.5)

Privacy Metric

• If a data x is estimated to be in the interval [α, β] with c% confidence, then the interval (β-α) defines the amount of privacy with c% confidence.

• ExampleAge 20-40, 95% confidence, 50% privacy in Uniform2 α = 20*0.5/0.95 = 10.5

Confidence50% 95% 99.9%

Uniform 0.5 X 2α 0.95 X 2α 0.999 X 2α

Gaussian 1.34 X σ 3.92 X σ 6.8 X σ

Decision Tree

Training Decision Tree

• Split point– interval boundaries

• Reconstruction algorithm– Global– Byclass– Local

• Dataset– Synthetic dataset, training set of 100,000 records and

testing set of 5,000 records, equally split into two classes

originalglobal and randomized

Byclass and local

global

randomized

original

byclasslocal

Extended Work

• ‘02 proposed a method to quantify information loss– Mutual information

• ‘07 evaluated randomization with combining of public information– Gaussian is better than uniform– Dataset with inherent cluster pattern will improve

randomization performance– Varying density and outliers will decrease performance

Multiplicative Randomization

• Rotation randomization– Distorted by an orthogonal matrix

• Projection randomization– Project high-dimensional dataset into low-

dimensional space• Preserving Euclidean distance and can be

applied with distance-based classification (KNN, SVM) and clustering (K-means)

Summary

• Pros: data and noise are independent, can be applied during data collection time, useful for stream data

• Cons: information loss, dimensionality curse

Questions?

Randomization in Privacy Preserving Data Mining

Documents

Transcript of Randomization in Privacy Preserving Data Mining