Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/18... · 2011-04-03 ·...

49
Introduction to Data Mining Privacy preserving data mining 4/3/2011 1 Privacy preserving data mining Li Xiong Slides credits: Chris Clifton Agrawal and Srikant

Transcript of Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/18... · 2011-04-03 ·...

Page 1: Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/18... · 2011-04-03 · If a data set D contains examples from nclasses, gini index, gini(D) is defined

Introduction to Data Mining

Privacy preserving data mining

4/3/2011 1

Privacy preserving data mining

Li Xiong

Slides credits:

Chris Clifton

Agrawal and Srikant

Page 2: Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/18... · 2011-04-03 · If a data set D contains examples from nclasses, gini index, gini(D) is defined

Privacy Preserving Data Mining

� Privacy concerns about personal data

� AOL query log release

� Netflix challenge

� Data scraping

Page 3: Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/18... · 2011-04-03 · If a data set D contains examples from nclasses, gini index, gini(D) is defined

A race to the bottom: privacy ranking of Internet service companies

� A study done by Privacy International into the privacy practices of key Internet based companies

� Amazon, AOL, Apple, BBC, eBay, Facebook, Friendster, Google, LinkedIn, LiveJournal, Friendster, Google, LinkedIn, LiveJournal, Microsoft, MySpace, Skype, Wikipedia, LiveSpace, Yahoo!, YouTube

Page 4: Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/18... · 2011-04-03 · If a data set D contains examples from nclasses, gini index, gini(D) is defined

A Race to the Bottom: Methodologies

� Corporate administrative details

� Data collection and processing

� Data retention

� Openness and transparency

� Customer and user control

� Privacy enhancing innovations and privacy invasive innovations

Page 5: Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/18... · 2011-04-03 · If a data set D contains examples from nclasses, gini index, gini(D) is defined

A race to the bottom: interim results revealed

Page 6: Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/18... · 2011-04-03 · If a data set D contains examples from nclasses, gini index, gini(D) is defined

Why Google

� Retains a large quantity of information about users, often for an unstated or indefinite length of time, without clear limitation on subsequent use or disclosure

� Maintains records of all search strings with � Maintains records of all search strings with associated IP and time stamps for at least 18-24 months

� Additional personal information from user profiles in Orkut

� Use advanced profiling system for ads

Page 7: Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/18... · 2011-04-03 · If a data set D contains examples from nclasses, gini index, gini(D) is defined

Remember, they are always watching …

Page 8: Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/18... · 2011-04-03 · If a data set D contains examples from nclasses, gini index, gini(D) is defined

Some advice from privacy campaigners …

� Use cash when you can.

� Do not give your phone number, social-security number or address, unless you absolutely have to.

� Do not fill in questionnaires or respond to telemarketers.

� Demand that credit and data-marketing firms produce all information they have on you, correct errors and remove you information they have on you, correct errors and remove you from marketing lists.

� Check your medical records often.

� Block caller ID on your phone, and keep your number unlisted.

� Never leave your mobile phone on, your movements can be traced.

� Do not user store credit or discount cards

� If you must use the Internet, encrypt your e-mail, reject all “cookies” and never give your real name when registering at websites

� Better still, use somebody else’s computer

Page 9: Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/18... · 2011-04-03 · If a data set D contains examples from nclasses, gini index, gini(D) is defined

� Data obfuscation (non-interactive model)

Privacy-Preserving Data Mining

Original Data

“Sanitized”Data

MinerAnonymization

� Output perturbation (interactive model)

AccessInterface

Original Data

Miner

“Perturbed” Results

Page 10: Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/18... · 2011-04-03 · If a data set D contains examples from nclasses, gini index, gini(D) is defined

Classes of Solutions

� Methods

� Input obfuscation

� Perturbation

� Generalization� Generalization

� Output perturbation

� Differential privacy

� Metrics

� Privacy vs. Utility

Page 11: Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/18... · 2011-04-03 · If a data set D contains examples from nclasses, gini index, gini(D) is defined

Data Perturbation

� Data randomization� Randomization (additive noise)

� Geometric perturbation and projection (multiplicative noise)

� Randomized response technique (categorical data)

Page 12: Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/18... · 2011-04-03 · If a data set D contains examples from nclasses, gini index, gini(D) is defined

Randomization Based Decision Tree Learning (Agrawal and Srikant ’00)

� Basic idea: Perturb Data with Value Distortion� User provides xi+r instead of xi� r is a random value

� Uniform, uniform distribution between [-α, α]� Gaussian, normal distribution with µ = 0, σ

� Hypothesis� Hypothesis

� Miner doesn’t see the real data or can’t reconstruct real values

� Miner can reconstruct enough information to identify patterns

Page 13: Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/18... · 2011-04-03 · If a data set D contains examples from nclasses, gini index, gini(D) is defined

Classification using Randomization Data

50 | 40K | ...30 | 70K | ... ...

Randomizer Randomizer

65 | 20K | ... 25 | 60K | ... ...

Alice’s age

Add random number to

Age

Classification

AlgorithmModel

65 | 20K | ... 25 | 60K | ... ...30

becomes 65

(30+35)

?

Page 14: Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/18... · 2011-04-03 · If a data set D contains examples from nclasses, gini index, gini(D) is defined

Output: A Decision Tree for “buys_computer”

age?

overcast<=30 >4031..40

February 12, 2008 Data Mining: Concepts and Techniques 14

student? credit rating?

no yes yes

yes

fairexcellentyesno

Page 15: Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/18... · 2011-04-03 · If a data set D contains examples from nclasses, gini index, gini(D) is defined

Attribute Selection Measure: Gini index (CART)

� If a data set D contains examples from n classes, gini index, gini(D) is

defined as

where pj is the relative frequency of class j in D

� If a data set D is split on A into two subsets D1 and D2, the gini index

gini(D) is defined as

∑=

−=n

j

p jDgini

1

21)(

February 12, 2008 Data Mining: Concepts and Techniques 15

1 2

gini(D) is defined as

� Reduction in Impurity:

� The attribute provides the smallest ginisplit(D) (or the largest reduction

in impurity) is chosen to split the node

)(||

||)(

||

||)( 2

21

1Dgini

D

DDgini

D

DDgini A +=

)()()( DginiDginiAginiA

−=∆

Page 16: Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/18... · 2011-04-03 · If a data set D contains examples from nclasses, gini index, gini(D) is defined

Randomization Approach Overview

50 | 40K | ...30 | 70K | ... ...

Randomizer Randomizer

65 | 20K | ... 25 | 60K | ... ...

Alice’s age

Add random number to

Age

...Reconstruct

Distribution

of Age

Reconstruct

Distribution

of Salary

Classification

AlgorithmModel

65 | 20K | ... 25 | 60K | ... ...30

becomes 65

(30+35)

Page 17: Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/18... · 2011-04-03 · If a data set D contains examples from nclasses, gini index, gini(D) is defined

Original Distribution Reconstruction

� x1, x2, …, xn are the n original data values

� Drawn from n iid random variables with distribution X

� Using value distortion,

� The given values are w1 = x1 + y1, w2 = x2 + y2, …, wn = xn + yn

� yi’s are from n iid random variables with distribution Y

� Reconstruction Problem:

� Given FY and wi’s, estimate FX

Page 18: Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/18... · 2011-04-03 · If a data set D contains examples from nclasses, gini index, gini(D) is defined

Original Distribution Reconstruction: Method

� Bayes’ theorem for continuous distribution

� The estimated density function (minimum mean square error estimator):

( ) ( ) ( )∑

−=′

nXiY afawf

af1

� Iterative estimation

� The initial estimate for fX at j=0: uniform distribution

� Iterative estimation

� Stopping Criterion: difference between successive iterations is small

( ) ( ) ( )( ) ( )

∑∫=

∞−−

=′i

XiY

XiYX

dzzfzwfnaf

1

1

( ) ( ) ( )( ) ( )

∑∫=

∞−

+

−=

n

ij

XiY

j

XiYj

X

dzzfzwf

afawf

naf

1

1 1

Page 19: Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/18... · 2011-04-03 · If a data set D contains examples from nclasses, gini index, gini(D) is defined

Reconstruction of Distribution

800

1000

1200

Number of Peo

ple

Original

0

200

400

600

20 60

Age

Number of Peo

ple

Original

Randomized

Reconstructed

Page 20: Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/18... · 2011-04-03 · If a data set D contains examples from nclasses, gini index, gini(D) is defined

Original Distribution Reconstruction

Page 21: Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/18... · 2011-04-03 · If a data set D contains examples from nclasses, gini index, gini(D) is defined

Original Distribution Construction for Decision Tree

� When are the distributions reconstructed?� Global

� Reconstruct for each attribute once at the beginning� Build the decision tree using the reconstructed data

� ByClass� First split the training data

Reconstruct for each class separately� Reconstruct for each class separately� Build the decision tree using the reconstructed data

� Local� First split the training data� Reconstruct for each class separately� Reconstruct at each node while building the tree

Page 22: Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/18... · 2011-04-03 · If a data set D contains examples from nclasses, gini index, gini(D) is defined

Accuracy vs. Randomization Level

Fn 3

90

100

40

50

60

70

80

10 20 40 60 80 100 150 200

Randomization Level

Accura

cy Original

Randomized

ByClass

Page 23: Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/18... · 2011-04-03 · If a data set D contains examples from nclasses, gini index, gini(D) is defined

More Results

� Global performs worse than ByClass and Local

� ByClass and Local have accuracy within 5% to 15% (absolute error) of the Original accuracy

� Overall, all are much better than the Randomized accuracy

Page 24: Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/18... · 2011-04-03 · If a data set D contains examples from nclasses, gini index, gini(D) is defined

Privacy metrics

� Privacy metrics of random additive data perturbation

4/3/2011 Data Mining: Principles and Algorithms 24

Page 25: Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/18... · 2011-04-03 · If a data set D contains examples from nclasses, gini index, gini(D) is defined

Unfortunately

� Random additive data perturbation are subject to data reconstruction attacks

� Original data can be estimated using spectral filtering techniques

� H. Kargupta , S. Datta. On the privacy preserving properties of random data perturbation techniques, in ICDM 2003

4/3/2011 Data Mining: Principles and Algorithms 25

Page 26: Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/18... · 2011-04-03 · If a data set D contains examples from nclasses, gini index, gini(D) is defined

Estimating distribution and data values

4/3/2011 Data Mining: Principles and Algorithms 26

Page 27: Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/18... · 2011-04-03 · If a data set D contains examples from nclasses, gini index, gini(D) is defined

Follow-up Work

� Multiplicative randomization

� Geometric randomization

� Also subjective to data reconstruction attacks!

� Known input-output

� Known samples

4/3/2011 Data Mining: Principles and Algorithms 27

Page 28: Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/18... · 2011-04-03 · If a data set D contains examples from nclasses, gini index, gini(D) is defined

Data Perturbation

� Data randomization� Randomization (additive noise)

� Geometric perturbation and projection (multiplicative noise)

� Randomized response technique (categorical data)

Page 29: Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/18... · 2011-04-03 · If a data set D contains examples from nclasses, gini index, gini(D) is defined

Data Collection Model

Data cannot be shared directly because of privacy concern

Page 30: Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/18... · 2011-04-03 · If a data set D contains examples from nclasses, gini index, gini(D) is defined

Background:Randomized Response

Do you smoke?

Head Yes

The true answer is “Yes”

)5.0(

)(

=

θθYesP

P'(Yes) = P(Yes) ⋅ θ + P(No) ⋅ (1−θ)P'(No) = P(Yes) ⋅ (1−θ) + P(No) ⋅ θ

Head

TailNo

YesBiased coin:

5.0

)(

=

θθHeadP

Page 31: Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/18... · 2011-04-03 · If a data set D contains examples from nclasses, gini index, gini(D) is defined

Decision Tree Mining using Randomized Response

� Multiple attributes encoded in bits

)5.0(

)(

=

θθYesP

Head True answer E: 110Biased coin:

)( = θHeadP )5.0( ≠θ

TailFalse answer !E: 0015.0

)(

=

θθHeadP

� Column distribution can be estimated for learning a decision tree

Using Randomized Response Techniques for Privacy-Preserving Data Mining, Du, 2003

Page 32: Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/18... · 2011-04-03 · If a data set D contains examples from nclasses, gini index, gini(D) is defined

Generalization for Multi-Valued Categorical Data

Si

Si+1

Si+2

q1

q2

q3

q4True Value: Si Si+3

q4

P'(s1)

P'(s2)

P'(s3)

P'(s4)

=

q1 q4 q3 q2

q2 q1 q4 q3

q3 q2 q1 q4

q4 q3 q2 q1

P(s1)

P(s2)

P(s3)

P(s4)

M

Page 33: Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/18... · 2011-04-03 · If a data set D contains examples from nclasses, gini index, gini(D) is defined

A Generalization

� RR Matrices [Warner 65], [R.Agrawal 05], [S. Agrawal 05]

� RR Matrix can be arbitrary

� Can we find optimal RR matrices?

M =

a11 a12 a13 a14

a21 a22 a23 a24

a31 a32 a33 a34

a41 a42 a43 a44

OptRR:Optimizing Randomized Response Schemes for Privacy-Preserving Data Mining, Huang, 2008

Page 34: Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/18... · 2011-04-03 · If a data set D contains examples from nclasses, gini index, gini(D) is defined

What is an optimal matrix?

� Which of the following is better?

M1 =

1 0 0

0 1 0

M2 =

13

13

13

13

13

13

M1 = 0 1 0

0 0 1

2 3 3 3

13

13

13

Privacy: M2 is betterUtility: M1 is better

So, what is an optimal matrix?

Page 35: Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/18... · 2011-04-03 · If a data set D contains examples from nclasses, gini index, gini(D) is defined

Optimal RR Matrix

� An RR matrix M is optimal if no other RR matrix’s privacy and utility are both better than M (i, e, no other matrix dominates M).

� Privacy Quantification

� Utility Quantification� Utility Quantification

� Privacy and utility metrics

� Privacy: how accurately one can estimate individualinfo.

� Utility: how accurately we can estimate aggregate info.

Page 36: Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/18... · 2011-04-03 · If a data set D contains examples from nclasses, gini index, gini(D) is defined

Optimization algorithm

� Evolutionary Multi-Objective Optimization (EMOO)

� The algorithm � Start with a set of initial RR matrices

� Repeat the following steps in each iteration

� Mating: selecting two RR matrices in the pool

� Crossover: exchanging several columns between the two RR � Crossover: exchanging several columns between the two RR matrices

� Mutation: change some values in a RR matrix

� Meet the privacy bound: filtering the resultant matrices

� Evaluate the fitness value for the new RR matrices.

Note : the fitness values is defined in terms of privacy and utility metrics

Page 37: Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/18... · 2011-04-03 · If a data set D contains examples from nclasses, gini index, gini(D) is defined

Output of Optimization

Worse

M5M6

The optimal set is often plotted in the objective space as Pareto front.

Privacy

Utility

Better

M1M2

M4

M3

M5

M7

M6

M8

Page 38: Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/18... · 2011-04-03 · If a data set D contains examples from nclasses, gini index, gini(D) is defined

Classes of Solutions

� Methods

� Input obfuscation

� Perturbation

� Generalization

Output perturbation� Output perturbation

� Differential privacy

� Metrics

� Privacy vs. Utility

Page 39: Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/18... · 2011-04-03 · If a data set D contains examples from nclasses, gini index, gini(D) is defined

Disease

Birthdate

Zip

Data Re-identification

Sex

Zip

Name

Page 40: Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/18... · 2011-04-03 · If a data set D contains examples from nclasses, gini index, gini(D) is defined

k-anonymity & l-diversity

40

Page 41: Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/18... · 2011-04-03 · If a data set D contains examples from nclasses, gini index, gini(D) is defined

Privacy preserving data mining

� Generalization principles

� k-anonymity, l-diversity, …

� Methods

� Optimal

� Greedy

41

� Greedy

� Top-down vs. bottom-up

Page 42: Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/18... · 2011-04-03 · If a data set D contains examples from nclasses, gini index, gini(D) is defined

Mondrian: Greedy Partitioning Algorithm

� Problem

� Need an algorithm to find multi-dimensional partitions

� Optimal k-anonymous strict multi-dimensional partitioning is NP-hardpartitioning is NP-hard

� Solution

� Use a greedy algorithm

� Based on k-d trees

� Complexity O(nlogn)

Page 43: Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/18... · 2011-04-03 · If a data set D contains examples from nclasses, gini index, gini(D) is defined

Example

k = 2; Quasi Identifiers: Age, Zipcode

What should be the splitting criteria?

Patient Data Multi-Dimensional

Page 44: Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/18... · 2011-04-03 · If a data set D contains examples from nclasses, gini index, gini(D) is defined

Unfortunately

� Generalization based principles and methods are subjective to attacks

� Background knowledge sensitive

� Attack dependent

4/3/2011 Data Mining: Principles and Algorithms 44

Page 45: Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/18... · 2011-04-03 · If a data set D contains examples from nclasses, gini index, gini(D) is defined

Classes of Solutions

� Methods

� Input obfuscation

� Perturbation

� Generalization� Generalization

� Output perturbation

� Differential privacy

� Metrics

� Privacy vs. Utility

Page 46: Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/18... · 2011-04-03 · If a data set D contains examples from nclasses, gini index, gini(D) is defined

� Differential privacy requires the outcome to be formally indistinguishable when run with and without any particular record in the data set

Differential Privacy

D1 Q(D1) + Y1

Differentially

PrivateInterface

D2Bob out

UserQ

D1Bob in

A(D2)

A(D1)Q(D1) + Y1

Q(D2) + Y2

Page 47: Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/18... · 2011-04-03 · If a data set D contains examples from nclasses, gini index, gini(D) is defined

� Differential privacy

� Laplace mechanism

Q(D) + Y where Y is drawn from

� Query sensitivity

Differential Privacy

� Query sensitivity

Differentially Private

InterfaceD2

Bob out

UserQ

D1Bob in

A(D2)

A(D1)Q(D1) + Y1

Q(D2) + Y2

Page 48: Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/18... · 2011-04-03 · If a data set D contains examples from nclasses, gini index, gini(D) is defined

Coming up

� Data mining algorithms using differential privacy

� Decision tree learning (Data Mining with Differential Privacy, SIGKDD 10)

� Frequent itemsets mining (discovering frequent patterns in sensitive data, SIGKDD 10)patterns in sensitive data, SIGKDD 10)

4/3/2011 Data Mining: Principles and Algorithms 48

Page 49: Introduction to Data Mining - Emory Universitylxiong/cs570_s11/share/slides/18... · 2011-04-03 · If a data set D contains examples from nclasses, gini index, gini(D) is defined

Midterm Exam

� Adjusted mean: 85.3

� Adjusted max: 101

� Your favorite topics: Clustering, frequent itemsets mining, decision tree

Your favorite assignments: Apriori� Your favorite assignments: Apriori

� Your least favorite: SOM, Weka analysis

4/3/2011 Data Mining: Principles and Algorithms 49