Randomization in Privacy Preserving Data Mining
description
Transcript of Randomization in Privacy Preserving Data Mining
Randomization in Privacy Preserving Data Mining
Agrawal, R., and Srikant, R. Privacy-Preserving Data Mining, ACM SIGMOD’00
the following slides include materials from this paper
Privacy-Preserving Data Mining
• Problem: How do we publish data without compromising individual privacy?
• Solution : randomization, anonymization
Randomization
• Adding random noise to original dataset
• Challenge– Is data still useful for further analysis?
Randomization
• Model: data is distorted by adding random noise
• Original data X = {x1 . . .xN}, for record xi X, ∈random variable Y = {y1 . . .yN} is added, so new data is denoted by Z ={ z1 . . .zN}, zi=xi + yi.
• yi is a random value– Uniform, [-α, +α]– Gaussian, N (0, σ2)
Reconstruction
• Perturbed data hides data distribution and need be reconstructed before data mining
• Given– x1+y1, x2+y2, ..., xn+yn
– the probability distribution of Y• Estimate the probability distribution of x
Clifton AusDM‘11
1. fx 0 = Uniform distribution
2. Repeat update
until stop criterion met
Reconstruction
• Bayes rule to estimate cumulative density functions
reconstruction algorithm
reconstructed
originalrandomized
original
reconstructed
randomized
N(0, 0.25)
(-0.5, 0.5)
Privacy Metric
• If a data x is estimated to be in the interval [α, β] with c% confidence, then the interval (β-α) defines the amount of privacy with c% confidence.
• ExampleAge 20-40, 95% confidence, 50% privacy in Uniform2 α = 20*0.5/0.95 = 10.5
Confidence50% 95% 99.9%
Uniform 0.5 X 2α 0.95 X 2α 0.999 X 2α
Gaussian 1.34 X σ 3.92 X σ 6.8 X σ
Decision Tree
Training Decision Tree
• Split point– interval boundaries
• Reconstruction algorithm– Global– Byclass– Local
• Dataset– Synthetic dataset, training set of 100,000 records and
testing set of 5,000 records, equally split into two classes
originalglobal and randomized
Byclass and local
global
randomized
original
byclasslocal
Extended Work
• ‘02 proposed a method to quantify information loss– Mutual information
• ‘07 evaluated randomization with combining of public information– Gaussian is better than uniform– Dataset with inherent cluster pattern will improve
randomization performance– Varying density and outliers will decrease performance
Multiplicative Randomization
• Rotation randomization– Distorted by an orthogonal matrix
• Projection randomization– Project high-dimensional dataset into low-
dimensional space• Preserving Euclidean distance and can be
applied with distance-based classification (KNN, SVM) and clustering (K-means)
Summary
• Pros: data and noise are independent, can be applied during data collection time, useful for stream data
• Cons: information loss, dimensionality curse
Questions?