A Machine Learning Approach to Privacy-Preserving Data Mining
UsingHomomorphic Encryption
Seiichi OzawaCenter for Mathematical Data Science
Graduate School of Engineering Kobe University
1
What is PPDM?2
Big data consists of lots of sensitive private information such as names, addresses, phone number, etc.
Obviously, we should conduct proper masking to such sensitive data to analyze, but such masking could erase valuable information from a database.
How can we analyze big data to extract useful rules in a legitimate way?
Privacy-Preserving Data Mining (PPDM) Privacy-Preserving Machine Learning (PPML)
Approaches to PPDM (1)3
1. Homomorphic Encryption A form of encryption that allows computation on ciphertexts. - Additive HE: Pailier- Multiplicative HE: Unpadded RSA, ElGamal- Fully HE: addition + mulitiplication
2. Garbled CircuitsA cryptographic protocol that enables two-party secure computation in which two mistrusting parties can jointly evaluate a function over their private inputs without the presence of a trusted third party. (Wikipedia)
Approaches to PPDM (2)4
3. Secret SharingA form of approaches to distributing a secret amongst a group of participants, each of whom is allocated a share of the secret. The secret can be reconstructed only when a sufficient number, of possibly different types, of shares are combined together; individual shares are of no use on their own. (Wikipedia)
4. Perturbation Approaches Adding random noise to avoid from leaking information with a mechanism satisfying Differential Privacy. - Input Perturbation- Algorithm Perturbation- Output Perturbation
A New Direction Using PPDM- Fintech
Analyst
Transaction Data ATM Data Internet
Banking
Bank A
Bank B
Transaction Data
ATMData
Internet Banking
Transaction Data
ATMData
Internet Banking
Current Approach New Approach
5
Bank C
Individual Analysis
Bank A Bank B Bank C Other Data etc...
Privacy-Preserving Data Mining Engine
Integrate Automate
- Detection of illegal money transfer- Calculate proper interest rate
Machine Learning over Encrypted Data
5
Privacy-Preserving Platform on Cloud Computing
6
Additively Homomorphic Encryption
Privacy Preserving Extreme Learning Machine
Sharing Roles in Computation7
• Nonlinear calculation with an activate function
• Multiplication and inner products
Data Contributor
• Summation of N data with Additive HE
Outsourced Server
• Calculating an inverse matrix and weights. Data Analyst
Performance Evaluation9
(L:#Hidden Units)+0.04〜0.12
Data Sets: 4 Bench Mark Datasets in Machine Leaning Repository
Encryption: LWE base Homomorphic Encryption
Privacy-Preserving Naïve Bayes Classification
10
Classification Using Posterior Probability
Naïve Bayes Classification
Probability Estimationm: #training samples, mi: #class i samples
mit: #occurrences of xt in class i samples
x: input, λ: #classes
Assume independency for x
Privacy-Preserving Naïve Bayes Classification
11
Calculation of Posterior Probability
Obtain class i* with largest Posterior Probability
for the other labels j
Multiplying both sides by ,
d: dimensionality
Securely computed using homomorphic encryption
System Configuration - Multi-Party Computation
12
- CS1 and CS2 do not collude. - All participants are assumed “honest-but-curious”.
(Follow protocols but may want to know data information.)
CS1: No access to Alice’s secret key.
CS2 knows Alice’s secret key.
Alice’s public key
Computation of Epk(mid-1)13
Alice’s class labels
1
1
1
…
1
Encrypted labels (homomorphic encryption)
1
1
1
…1
1
1
1
…
21 10 8 12
Addition on encrypted values
21 10 8 12
21 10 8 12
21 10 8 12
Element-wise multiplication
9261 1000 512 1728
Epk (mid-1)
Alice
CS1
CS1
provide
(NOTE)
Computation of Epk(mid-1Πmjt)14
(NOTE)
1
1
1…
1
1
1
1
…
1
Encrypted one-hot encoded feature t
Encrypted labels
Key point of calculation of mjt
1 1
0
0
0
0 1 0 0
0 0 0 0
0 0 0 0
0 0 0 0
And accumulate for every samples
0 1 0 0
0 0 0 0
0 0 0 0
0 0 0 0
Computation of Epk(mid-1Πmjt)15
(NOTE)
1
1
1
…1
1
1
1
…
1
Encrypted one-hot encoded feature t
Encrypted labels
Key point of calculation of mjt
1 0
0
0
1
0 0 0 0
0 0 0 0
0 0 0 0
1 0 0 0
1 14 2 1
21 1 13 11
5 3 31 2
25 2 1 7
Still CS1 cannot observe actual values
Maximization of Posterior Probability
16
CS2CS1
Alice’s secret key
CS2
where
Construct Y
Random rotation prevents CS2 from observing actual index (classification result) For enhanced security, we are considering adoption of garbled circuit method here
Sending Classification Result17
CS2
CS1
Bob
Rotation parameter k(ex. 2)
which has 1 at the place of maximum column from rotated matrix
One-hot vector z
1
1(ex)
(ex)
Classification result: 2
Classification result is computed using information from both sides
Performance Evaluation18
Data Sets:Iris dataset from UCI ML Repository (training:test=80:20)
Preprocessing:Each feature (real-valued) are encoded into one-hot encoding(dimension: 5)
Encryption: HELib (implementation of Brakerski-Gentry-Vaikuntanathan scheme)
Prediction Accuracy(Smoothing: allocation of small value to 0 frequencies)
Execution TimeProtocol 1: computation of Epk(mi
d-1)Protocol 2: computation of Epk(H)Protocol 3: obtaining classification result
Concluding Remarks19
1. Privacy-Preserving Data Mining (PPDM) suggests a new direction of AI applications for Big Data.
2. Aggregation of big data provided by multiple organizations could bring a new impact in Big Data analysis.
3. Two machine learning approaches (i.e., PP-ELM and PP-NBC) are introduced.
4. The number of papers on PPDM/PPML is rapidly increasing at top conferences (ICML, USENIX, ACM CCS, etc.).
Top Related