Data mining with differential privacy
-
Upload
chang-wei-yuan -
Category
Data & Analytics
-
view
178 -
download
2
description
Transcript of Data mining with differential privacy
+
Data Mining with Differential Privacy
Chang Wei-Yuan �2014 / 10 / 3 (Fri.) @ MakeLab Group Meeting
Arik Friedman and Assaf Schuster / KDD’10
+Outline
n Introduction �
n Background �
n Method �
n Experiment �
n Conclusion �
n Though
2
+ Introduction �
n There is great value in data mining solutions. �n reliable privacy guarantees�n available accuracy�
n Differential privacy �n computations are insensitive to changes in
any particular individual's record
3
+ Introduction (cont.)�
n Once an individual is certain that his or her data will remain private, being opted in or out of the database should make little difference.
4
+
n Example1 �
Introduction (cont.)�
Name Result
Tom 0
Jack 1
Henry 1
Diego 0
Alice ?
5
n f(i) = count(i)�
n Alice => i=5�
n count(5) – count(4)�
+
n Example2 �
n We can speculate the target based on the information. �
Introduction (cont.)�
6
Id Sex Job Hometown Hobby
1 M student Hsinchu sport
2 M teacher Taipei writing
3 F student Hsinchu Singing
4 F student Taipei Singing
5 ? ? ? ?
+ Introduction (cont.)�
n Goal:count(5) – count(4) ≈ 0 �
n Goal:” computations are insensitive to changes in any particular individual's record ”
7
+Outline
n Introduction �
n Background �
n Method �
n Experiment �
n Conclusion �
n Though
8
+Differential Privacy
n Differential privacy �
9
Output
Probability
• M:a randomized computation� • f:a query function� • D, D’:the datasets with
symmetric difference
+Differential Privacy (cont.)
n Differential privacy �
10
Define.(∊-Differential Privacy)� We say a randomized computation M provides differential privacy if for any datasets A and B with symmetric difference AΔB=1, and set of possible outcomes S ⊆ Range(M)�
+Laplace Mechanism
n Example of Laplace Mechanism�
11
Name Result
Tom 0
Jack 1
Henry 1
Diego 0
Alice ?
n count(4) = 2 + noise(4)�
n count(5) = 3 + noise(5)�
n count(5) – count(4) = e∊ �
+Laplace Mechanism
n Laplace Mechanism �
12
Theorem. (Laplace mechanism) � Given a function f over an arbitrary domain D, the computation� � � provides differential privacy.�
+Exponential Mechanism
n Example of Exponential Mechanism
13
item� q� ∊=0� ∊=0.1� ∊=1� Football 30 0.46 0.42 0.92
Volleyball 25 0.38 0.33 0.07
Basketball 8 0.12 0.14 1.5E-05
Tennis 2 0.03 0.10 7.7E-07
+Exponential Mechanism (cont.)
n Exponential Mechanism
14
Theorem. (Exponential Mechanism) � Let q be a quality function, given a database d, assigns a score r to each outcome. Then the mechanism M, defined by� � � maintains differential privacy.�
+PINQ Framework
n PINQ Framework �n PINQ is a proposed architecture for data
analysis with differential privacy�n Another operator presented in PINQ is
partition which was dubbed parallel composition. �n the costs do not add up when queries are executed
on disjoint datasets
15
+PINQ Framework (cont.) 16
+Outline
n Introduction �
n Background �
n Method �
n Experiment �
n Conclusion �
n Though
17
+Method �18
n SQL-based ID3 �
n DiffP-ID3 �
n DiffP-C4.5
+SuL-based ID3
n Based on SuLQ framework and Using Laplace Mechanism. �
n It makes direct use of the NoisyCount primitive to evaluate the information gain criterion. �
n It required to evaluate the information gain should be carried out for each attribute separately. �n the budget per query is small �
19
+SuL-based ID3
n ID3 Classification �
n Split point �n max( Gain(Job), Gain(Home), Gain(Hobby) )
20
Id Sex Job Hometown Hobby
1 M student Hsinchu sport
2 M teacher Taipei writing
3 F student Hsinchu Singing
4 F student Taipei Singing
+SuL-based ID3
n SuL-based ID3 Classification �
n Split point �n max( Gain(Job)+Noisy, Gain(Home)+Noisy,
Gain(Hobby)+Noisy )
21
Id Sex Job Hometown Hobby
1 M student Hsinchu sport
2 M teacher Taipei writing
3 F student Hsinchu Singing
4 F student Taipei Singing
+DiffP-ID3
n Based on PINQ framework and using exponential mechanism. �
n It evaluates all attributes simultaneously in one query, the outcome of which is the attribute to use for splitting. �n the quality function q provided to the scores
each attribute
22
+DiffP-ID3 (cont.)
n DiffP-ID3 Classification �
n Split point �n Max( Gain(M(Job)), Gain(M(Job)),
Gain(M(Hobby)) )�n PINQ Partition
23
Id Sex Job Hometown Hobby
1 M student Hsinchu sport
2 M teacher Taipei writing
3 F student Hsinchu Singing
4 F student Taipei Singing
+DiffP-ID3 (cont.)
n Which quality function should be fed into the exponential mechanism? �n the depth constraint �n the sensitivity of the splitting criterion �
n Information gain will be the most sensitive to noise, and Max operator will be the least sensitive to noise.
24
+DiffP-C4.5
n One important extension is the ability to handle continuous attributes. �n First, the domain is divided into ranges where
the score is constant. Each range is considered a discrete option. �
n Then, a point from the range is sampled with uniform distribution and returned as the output of the exponential mechanism.
25
+Outline
n Introduction �
n Background �
n Method �
n Experiment �
n Conclusion �
n Though
26
+Experiment �
n It define a domain with ten nominal attributes and a class attribute from another paper. �
n It introduces noise to the samples by reassigning attributes and classes, replacing each value with probability noise. �
n For testing, it generated similarly a noiseless test set with 10, 000 records.
27
+ 28
n the average accuracy is higher as more training samples are available �
n the influence of the noise weakens as the number of samples grows using Gini and Max
+ 29
n three of the ten attributes were replaced with numeric attributes over the domain [0, 100] �
n Figure 4 presents the results of a similar experiment
+ 30
n for smaller training sets, ID3 allows for better accuracy�
n for larger training sets, C4.5 is better than ID3
+ 31
n the accuracy results presented in Figure 6 was around 5% and even lower than the results presented in Figure 7 �
n when the sizeof the dataset is small, algorithms that make efficient use of the privacy budget are superior
+Outline
n Introduction �
n Background �
n Method �
n Experiment �
n Conclusion �
n Though
32
+Conclusion
n When the number of training samples is relatively small or the privacy constraints set by the data provider are very limiting, the sensitivity of the calculations becomes crucial.
33
+Future work
n One solution might be to consider other stopping rules when selecting nodes, trading possible improvements in accuracy for increased stability. �
n In addition, it may be fruitful to consider different tactics for budget distribution.
34
+Outline
n Introduction �
n Background �
n Method �
n Experiment �
n Conclusion �
n Though
35
+
Thought
36
+Thanks for listening. 2014 / 10 / 3 (Fri.) @ MakeLab Group Meeting �[email protected]�