Data mining with differential privacy

37
+ Data Mining with Differential Privacy Chang Wei-Yuan 2014 / 10 / 3 (Fri.) @ MakeLab Group Meeting Arik Friedman and Assaf Schuster / KDD’10

TAGS:

description

Data mining with differential privacy @ KDD'10

Transcript of Data mining with differential privacy

Page 1: Data mining with differential privacy

+

Data Mining with Differential Privacy

Chang Wei-Yuan �2014 / 10 / 3 (Fri.) @ MakeLab Group Meeting

Arik Friedman and Assaf Schuster / KDD’10

Page 2: Data mining with differential privacy

+Outline

n Introduction �

n Background �

n Method �

n Experiment �

n Conclusion �

n Though

2

Page 3: Data mining with differential privacy

+ Introduction �

n There is great value in data mining solutions. �n reliable privacy guarantees�n available accuracy�

n Differential privacy �n computations are insensitive to changes in

any particular individual's record

3

Page 4: Data mining with differential privacy

+ Introduction (cont.)�

n Once an individual is certain that his or her data will remain private, being opted in or out of the database should make little difference.

4

Page 5: Data mining with differential privacy

+

n Example1 �

Introduction (cont.)�

Name Result

Tom 0

Jack 1

Henry 1

Diego 0

Alice ?

5

n f(i) = count(i)�

n Alice => i=5�

n count(5) – count(4)�

Page 6: Data mining with differential privacy

+

n Example2 �

n We can speculate the target based on the information. �

Introduction (cont.)�

6

Id Sex Job Hometown Hobby

1 M student Hsinchu sport

2 M teacher Taipei writing

3 F student Hsinchu Singing

4 F student Taipei Singing

5 ? ? ? ?

Page 7: Data mining with differential privacy

+ Introduction (cont.)�

n Goal:count(5) – count(4) ≈ 0 �

n Goal:” computations are insensitive to changes in any particular individual's record ”

7

Page 8: Data mining with differential privacy

+Outline

n Introduction �

n Background �

n Method �

n Experiment �

n Conclusion �

n Though

8

Page 9: Data mining with differential privacy

+Differential Privacy

n Differential privacy �

9

Output

Probability

•  M:a randomized computation� •  f:a query function� •  D, D’:the datasets with

symmetric difference

Page 10: Data mining with differential privacy

+Differential Privacy (cont.)

n Differential privacy �

10

Define.(∊-Differential Privacy)� We say a randomized computation M provides differential privacy if for any datasets A and B with symmetric difference AΔB=1, and set of possible outcomes S ⊆ Range(M)�

Page 11: Data mining with differential privacy

+Laplace Mechanism

n Example of Laplace Mechanism�

11

Name Result

Tom 0

Jack 1

Henry 1

Diego 0

Alice ?

n count(4) = 2 + noise(4)�

n count(5) = 3 + noise(5)�

n count(5) – count(4) = e∊ �

Page 12: Data mining with differential privacy

+Laplace Mechanism

n Laplace Mechanism �

12

Theorem. (Laplace mechanism) � Given a function f over an arbitrary domain D, the computation� � � provides differential privacy.�

Page 13: Data mining with differential privacy

+Exponential Mechanism

n Example of Exponential Mechanism

13

item� q� ∊=0� ∊=0.1� ∊=1� Football 30 0.46 0.42 0.92

Volleyball 25 0.38 0.33 0.07

Basketball 8 0.12 0.14 1.5E-05

Tennis 2 0.03 0.10 7.7E-07

Page 14: Data mining with differential privacy

+Exponential Mechanism (cont.)

n Exponential Mechanism

14

Theorem. (Exponential Mechanism) � Let q be a quality function, given a database d, assigns a score r to each outcome. Then the mechanism M, defined by� � � maintains differential privacy.�

Page 15: Data mining with differential privacy

+PINQ Framework

n PINQ Framework �n PINQ is a proposed architecture for data

analysis with differential privacy�n Another operator presented in PINQ is

partition which was dubbed parallel composition. �n  the costs do not add up when queries are executed

on disjoint datasets

15

Page 16: Data mining with differential privacy

+PINQ Framework (cont.) 16

Page 17: Data mining with differential privacy

+Outline

n Introduction �

n Background �

n Method �

n Experiment �

n Conclusion �

n Though

17

Page 18: Data mining with differential privacy

+Method �18

n SQL-based ID3 �

n DiffP-ID3 �

n DiffP-C4.5

Page 19: Data mining with differential privacy

+SuL-based ID3

n Based on SuLQ framework and Using Laplace Mechanism. �

n It makes direct use of the NoisyCount primitive to evaluate the information gain criterion. �

n It required to evaluate the information gain should be carried out for each attribute separately. �n  the budget per query is small �

19

Page 20: Data mining with differential privacy

+SuL-based ID3

n ID3 Classification �

n Split point �n max( Gain(Job), Gain(Home), Gain(Hobby) )

20

Id Sex Job Hometown Hobby

1 M student Hsinchu sport

2 M teacher Taipei writing

3 F student Hsinchu Singing

4 F student Taipei Singing

Page 21: Data mining with differential privacy

+SuL-based ID3

n SuL-based ID3 Classification �

n Split point �n max( Gain(Job)+Noisy, Gain(Home)+Noisy,

Gain(Hobby)+Noisy )

21

Id Sex Job Hometown Hobby

1 M student Hsinchu sport

2 M teacher Taipei writing

3 F student Hsinchu Singing

4 F student Taipei Singing

Page 22: Data mining with differential privacy

+DiffP-ID3

n Based on PINQ framework and using exponential mechanism. �

n It evaluates all attributes simultaneously in one query, the outcome of which is the attribute to use for splitting. �n the quality function q provided to the scores

each attribute

22

Page 23: Data mining with differential privacy

+DiffP-ID3 (cont.)

n DiffP-ID3 Classification �

n Split point �n Max( Gain(M(Job)), Gain(M(Job)),

Gain(M(Hobby)) )�n PINQ Partition

23

Id Sex Job Hometown Hobby

1 M student Hsinchu sport

2 M teacher Taipei writing

3 F student Hsinchu Singing

4 F student Taipei Singing

Page 24: Data mining with differential privacy

+DiffP-ID3 (cont.)

n Which quality function should be fed into the exponential mechanism? �n the depth constraint �n the sensitivity of the splitting criterion �

n Information gain will be the most sensitive to noise, and Max operator will be the least sensitive to noise.

24

Page 25: Data mining with differential privacy

+DiffP-C4.5

n One important extension is the ability to handle continuous attributes. �n First, the domain is divided into ranges where

the score is constant. Each range is considered a discrete option. �

n Then, a point from the range is sampled with uniform distribution and returned as the output of the exponential mechanism.

25

Page 26: Data mining with differential privacy

+Outline

n Introduction �

n Background �

n Method �

n Experiment �

n Conclusion �

n Though

26

Page 27: Data mining with differential privacy

+Experiment �

n It define a domain with ten nominal attributes and a class attribute from another paper. �

n It introduces noise to the samples by reassigning attributes and classes, replacing each value with probability noise. �

n For testing, it generated similarly a noiseless test set with 10, 000 records.

27

Page 28: Data mining with differential privacy

+ 28

n the average accuracy is higher as more training samples are available �

n the influence of the noise weakens as the number of samples grows using Gini and Max

Page 29: Data mining with differential privacy

+ 29

n three of the ten attributes were replaced with numeric attributes over the domain [0, 100] �

n Figure 4 presents the results of a similar experiment

Page 30: Data mining with differential privacy

+ 30

n for smaller training sets, ID3 allows for better accuracy�

n for larger training sets, C4.5 is better than ID3

Page 31: Data mining with differential privacy

+ 31

n the accuracy results presented in Figure 6 was around 5% and even lower than the results presented in Figure 7 �

n when the sizeof the dataset is small, algorithms that make efficient use of the privacy budget are superior

Page 32: Data mining with differential privacy

+Outline

n Introduction �

n Background �

n Method �

n Experiment �

n Conclusion �

n Though

32

Page 33: Data mining with differential privacy

+Conclusion

n When the number of training samples is relatively small or the privacy constraints set by the data provider are very limiting, the sensitivity of the calculations becomes crucial.

33

Page 34: Data mining with differential privacy

+Future work

n One solution might be to consider other stopping rules when selecting nodes, trading possible improvements in accuracy for increased stability. �

n In addition, it may be fruitful to consider different tactics for budget distribution.

34

Page 35: Data mining with differential privacy

+Outline

n Introduction �

n Background �

n Method �

n Experiment �

n Conclusion �

n Though

35

Page 36: Data mining with differential privacy

+

Thought

36

Page 37: Data mining with differential privacy

+Thanks for listening. 2014 / 10 / 3 (Fri.) @ MakeLab Group Meeting �[email protected]