Privacy Preserving Data Mining Lecture 2 Cryptographic Solutions
description
Transcript of Privacy Preserving Data Mining Lecture 2 Cryptographic Solutions
page 1March 1, 2005 10th Estonian Winter School in Computer Science
Privacy Preserving Data Mining
Lecture 2
Cryptographic Solutions
Benny PinkasHP Labs, Israel
page 2March 1, 2005 10th Estonian Winter School in Computer Science
Secure two-party computation - definition
x y
F(x,y) and nothing else
Input:Output:
x yAs if…
F(x,y) F(x,y)
page 3March 1, 2005 10th Estonian Winter School in Computer Science
Secure Function Evaluation
• A major topic of cryptographic research• How to let n parties, P1,..,Pn compute a function
F(x1,..,xn) – Where input xi is known to party Pi
– Parties learn the final input and nothing else
• Caveat: cryptographic definitions of secure computation are both too strong and too weak:– Too strong: do not allow leakage of harmless
information; the price of this extra security is in efficiency.
– Too weak: do not address leakage or misuse caused by the function itself (e.g., information implied by the outputs, or misbehavior in choosing an input).
page 5March 1, 2005 10th Estonian Winter School in Computer Science
Secure Function Evaluation
•Major Result [Yao]: “Any function that can be evaluated using polynomial resources can be securely evaluated using polynomial resources”(under some cryptographic assumption)
page 6March 1, 2005 10th Estonian Winter School in Computer Science
SFE Building Block: 1-out-of 2 Oblivious Transfer
Learns nothing
Yj
Alicej{0,1}
Bob
Y0, Y1
• 1-out-of-2 OT can be based on most public key systems
• There are implementations with two communication rounds
page 7March 1, 2005 10th Estonian Winter School in Computer Science
General Two party Computation
Two party protocol
• Input:– Sender: Function F (some representation)
• The sender’s input Y is already embedded in F
– Receiver: X 0,1n
•Output:– Receiver: F(x) and nothing else about F– Sender: nothing about x
page 8March 1, 2005 10th Estonian Winter School in Computer Science
Representations of F
• Boolean circuits [Yao,GMW,…]• Algebraic circuits [BGW,…]• Low deg polynomials [BFKR]• Matrices product over a large field
[FKN,IK]• Randomizing polynomials [IK]• Communication Complexity Protocol
[NN]
page 9March 1, 2005 10th Estonian Winter School in Computer Science
Secure two-party computation of general functions [Yao]
•First, represent the function F as a Boolean circuit C– It’s always possible– Sometimes it’s easy (additions,
comparisons)– Sometimes the result is inefficient (e.g. for
indirect addressing, e.g. A[x] )
•Then, “garble” the circuit
•Finally, evaluate the garbled circuit
page 10
March 1, 2005 10th Estonian Winter School in Computer Science
Garbling the circuit
• Bob constructs the circuit, and then garbles it.
G
wi0,wi
1 wJ0,wJ
1
wk0,wk
1
W values will serve as cryptographic keys
Wk0 0 on wire
kWk
1 1 on wire k
(Alice will learn onestring per wire, butnot which bit it corresponds to.)
page 11
March 1, 2005 10th Estonian Winter School in Computer Science
Gate tables
• For every gate, every combination of input values is used as a key for encrypting the corresponding output
• Assume G=AND. Bob constructs a table:– Encryption of wk
0 using keys wi0,wJ
0 (AND(0,0)=0)– Encryption of wk
0 using keys wi0,wJ
1 (AND(0,1)=0)– Encryption of wk
0 using keys wi1,wJ
0 (AND(1,0)=0)– Encryption of wk
1 using keys wi1,wJ
1 (AND(1,1)=1)
• Result: given wix,wJ
y, can compute wkG(x,y)
page 12
March 1, 2005 10th Estonian Winter School in Computer Science
Secure computation
• Bob sends the table of gate G to Alice• Given, e.g., wi
0,wJ1, Alice computes wk
0 by decrypting the corresponding entry in the table, but she does not know the actual values of the wires.
G
wi0,wi
1 wJ0,wJ
1
wk0,wk
1
Encryption of wk0 using keys
wi0,wJ
0
Encryption of wk0 using keys
wi0,wJ
1
Encryption of wk1 using keys
wi1,wJ
1
Encryption of wk0 using keys
wi1,wJ
0
Permuted order
page 13
March 1, 2005 10th Estonian Winter School in Computer Science
Secure computation
•Bob sends to Alice– Tables encoding each circuit gate.– Garbled values (w’s) of his input values.– Translation from garbled values of output
wires to actual 0/1 values.
• If Alice gets garbled values (w’s) of her input values, she can compute the output of the circuit, and nothing else.
page 14
March 1, 2005 10th Estonian Winter School in Computer Science
Alice’s input
•For every wire i of Alice’s input:– The parties run an OT protocol– Alice’s input is her input bit (s).– Bob’s input is wi
0,wi1
– Alice learns wis
•The OTs for all input wires can be run in parallel.
•Afterwards Alice can compute the circuit by herself.
page 15
March 1, 2005 10th Estonian Winter School in Computer Science
Secure computation – the big picture
•Represent the function as a circuit C•Bob sends to Alice 4|C| encryptions (e.g. 64|C| Bytes), 4 encryptions for every gate.
•Alice performs an OT for every input bit. (Can do, e.g. 100-1000 OTs per sec.)
•~One round of communication.•Efficient for medium size circuits!
page 16
March 1, 2005 10th Estonian Winter School in Computer Science
Example
• The Millionaires problem: comparing two N bit numbers
• What’s the overhead?
page 17
March 1, 2005 10th Estonian Winter School in Computer Science
Applications
• Two parties. Two large data sets.• Max?• Mean? • Median?• Intersection?• Decision Tree learning? ID3?
page 18
March 1, 2005 10th Estonian Winter School in Computer Science
Fairplay – a secure two-party computation systemMalkhi, Nissan, P., Sella
• A a full fledged secure two-party computation system, implementing Yao’s “garbled circuit” protocol.
• Goals:– Investigate whether two-party SFE is practical– Actual measurements of overall computation– Breakdown of computation into parts– Computation versus communication?– Test-bed for various optimizations
page 19
March 1, 2005 10th Estonian Winter School in Computer Science
Fairplay
• The Compilation paradigm– Programs written in SFDL, a high-level
programming language– Allows clear, formal, easily understandable
definition and requirements by humans
– SHDL: Low-level language describing Boolean circuits
– SFDL SHDL compiler and optimizer– SHDL Java programs implementing Yao’s
protocol
page 20
March 1, 2005 10th Estonian Winter School in Computer Science
Fairplay – SFDL example
program Millionaires { type int = Int<20>; // 20-bit integer type AliceInput = int; type BobInput = int; type AliceOutput = Boolean; type BobOutput = Boolean; type Output = struct {AliceOutput alice, BobOutput bob}; type Input = struct {AliceInput alice, BobInput bob};
function Output output(Input input) { output.alice = input.alice > input.bob; output.bob = input.bob > input.alice; }
page 21
March 1, 2005 10th Estonian Winter School in Computer Science
SFDL properties
• Conventional syntax (C/Pascal-like)• Type system – Boolean, integer, enumerated• Program structure
– Declarations: global constants, types– Sequence of functions (no nesting [C], no recursion)– Function name is its return value [Pascal]
• Conditional execution and loops– if-then, if-then-else statements, For-loop (loop boundaries
should be known at compile time)• Assignments and expressions
– constants, variables, array entries, structure items, function calls, operators (+, -, logical, comparison), parenthesis
page 22
March 1, 2005 10th Estonian Winter School in Computer Science
SHDL example
0 input //output$input.bob$01 input //output$input.bob$12 input //output$input.bob$23 input //output$input.bob$34 input //output$input.alice$05 input //output$input.alice$16 input //output$input.alice$27 input //output$input.alice$38 gate arity 2 table [ 1 0 0 0 ] inputs [ 4 5 ]9 gate arity 2 table [ 0 1 1 0 ] inputs [ 4 5 ]
page 23
March 1, 2005 10th Estonian Winter School in Computer Science
kth-ranked element (e.g. median)
• Inputs:– Alice: SA Bob: SB
– Large sets of unique items (D).• Output:
– x SA SB s.t. x has k-1 elements smaller than it.• The rank k
– Could depend on the size of input datasets. – Median: k = (|SA| + |SB|) / 2
• Motivation:– Basic statistical analysis of distributed data.– E.g. histogram of salaries in CS departments
• The Problem: Generic constructions using circuits [Yao …] yield an overhead which is at least linear in k.
page 24
March 1, 2005 10th Estonian Winter School in Computer Science
An (insecure) two-party median protocol
RALASA
SB
mA
RBLB mB
LA lies below the median, RB lies above the median.
New median is same as original median.Recursion Need log n rounds
(assume each set contains n=2i items)
mA < mB
page 25
March 1, 2005 10th Estonian Winter School in Computer Science
A Secure two-party median protocol
A finds its median mA
B finds itsmedian mB
mA < mB
A deletes elements ≤ mA.B deletes elements > mB.
A deletes elements > mA.B deletes elements ≤ mB.
YES
NO
Secure comparison(e.g. a small circuit)
page 26
March 1, 2005 10th Estonian Winter School in Computer Science
An example
A B
mA>mB
mA<mB
mA<mB
mA>mB
mA<mB
Medianfound!!
8 9 16
1
16 161 1
page 27
March 1, 2005 10th Estonian Winter School in Computer Science
Proof of security
A B
mA>mB
mA<mB
mA<mB
mA>mB
mA<mB
median
mA>mB
mA<mB
mA<mB
mA>mB
mA<mB
page 28
March 1, 2005 10th Estonian Winter School in Computer Science
+
Arbitrary input size, arbitrary k
SA
SB
k
Now, compute the median of two sets of size k.
Size should be a power of 2.
median of new inputs = kth element of original inputs
2i
+
-
page 29
March 1, 2005 10th Estonian Winter School in Computer Science
Hiding size of inputs
• Can search for kth element without revealing size of input sets.
• However, k=n/2 (median) reveals input size.• Solution: Let S=2i be a bound on input size.
|SA|S
-+
-+
|SB|
Median of new datasets is same
as median of original datasets.
page 30
March 1, 2005 10th Estonian Winter School in Computer Science
Privacy preserving data mining
Confidential database D1
Wish to “mine” D1 D2 without revealing more info
Examples:• Medical databases protected by law• Competing businesses• Government agencies (privacy, “need to know”)
Confidential database D2
P1P2
Huge
page 31
March 1, 2005 10th Estonian Winter School in Computer Science
The classification problem
Age > 30 Sex time insured Claim > $500
Did fraud occur?
C1 Yes M t [0,9] years No No
C2 No F t [10,19] years Yes Yes
… … … … … …
Cn Yes F t [20,29] years No No
Goal: based on available data design an algorithm to classify new data
page 32
March 1, 2005 10th Estonian Winter School in Computer Science
Classification using Decision Trees
Time insured
No
[0,9] years > 20 years [10,19] years
Age > 30
No Yes
YesNo
Claim > $500
No Yes
YesNo
ID3: Choose attribute A thatminimizes the conditionalentropy of the attribute class
page 33
March 1, 2005 10th Estonian Winter School in Computer Science
Privacy Preserving ID3
• Scenario: The inputs are private information of P1 and P2
• Main technical problem: Comparing entropies while preserving privacy. (entropy = x logx)
• Efficiency: – most computation done independently by parties.– The overhead of cryptographic operations depends only on
the size of the decision tree (not on the input size).
• Basic task: compute x log x.
x = x1+x2 = e.g., total number of customers with (age
> 30) and (fraud = yes)
page 34
March 1, 2005 10th Estonian Winter School in Computer Science
Privacy Preserving ID3
•Computing x log x:– x = x1 + x2, known to P1 and P2
respectively (independently computed from databases).
– Might as well compute x lnx, or lnx.– First run a protocol to compute random
shares, y1 + y2 = ln x• ln x is Real. Crypto works over finite fields. Must do numerical analysis.
page 35
March 1, 2005 10th Estonian Winter School in Computer Science
Cryptographic Tools
x
Implementation:Two passes, O(degree) (or O( log|F|) ) exponentiations.
A polynomial Q(·)
Q(x) and nothing else nothingInput:Output:
• Secure Function Evaluation (SFE) [Yao]• Oblivious Polynomial Evaluation [NP]
page 36
March 1, 2005 10th Estonian Winter School in Computer Science
Computing random shares of lnx = ln(x1+x2)
Use Taylor approximation for lnx– x = x1 + x2 = 2 n (1+) -½ < < ½ – lnx = ln(2 n (1+)) = ln 2 n + ln(1+)
ln 2 n + i=1..k (-1) i-1 i / i
= ln 2 n + T()
• T() is a polynomial of degree k. Error is exponentially small in k.
• We only know how to work over finite fields• Compute c·lnx, where c compensates for fractions.• Work in F, where |F| sufficiently large.
page 37
March 1, 2005 10th Estonian Winter School in Computer Science
ln(x1+x2) Protocol
• Step 1 of the protocol – Find n, – Apply Yao’s protocol to the following small circuit
• Input: x1 and x2• Output (random shares):
•random a1 and a2 s.t. a1 + a2 = x-2 n = ·2 n
•random b1 and b2 s.t. b1 + b2 = ln 2 n
• Operation: The protocol finds 2 n closest to x1+x2, computes 2 n = x1+x2- 2 n.
– x = x1 + x2 = 2 n + 2 n
– lnx = ln(2 n (1+)) = ln 2 n + ln(1+)
page 38
March 1, 2005 10th Estonian Winter School in Computer Science
ln(x1+x2) Protocol (Cont.)
Step 2 of the protocol– Compute random shares of T() (Taylor approx.)
– P1 chooses a random w1 F and defines a polynomial Q(x), s.t. w1 +Q(a2) = T() (recall a1 + a2 = ·2 n)
– Namely, Q(x) = T( (a1+x)/2 n ) – w1 . – Run an oblivious poly evaluation in which P2
computes
w2 = Q(a2) = T() – w1 . – Now the parties have random w1 and w2 s.t.
– w1 + w2 = T() ln(1+)
– (b1 + w1) + (b2 + w2) ln 2 n + ln(1+) = ln x
page 40
March 1, 2005 10th Estonian Winter School in Computer Science
The rest of the work..
• The parties compute shares of lnx• Then they compute shares of xlnx• Each party computes a share of the entropy by
summing shares of x lnx (H(X) = x lnx )• A small circuit finds the attribute giving the
minimal conditional entropy• The attribute is assigned to the node• The databases are divided according to the
value of this attribute
page 41
March 1, 2005 10th Estonian Winter School in Computer Science
Efficiency
• lnx protocol: – secure computation of a small circuit– one oblivious polynomial evaluation
• ID3 for a database with: – 1,000,000 transactions– 15 attributes– 10 values per attribute– 4 class values
– Communication per node takes seconds (T1)
– Computation per node takes minutes (P3)
page 42
March 1, 2005 10th Estonian Winter School in Computer Science
Contributions
• Cryptographic protocols where the bulk of the operations is done independently.
• Data mining– Rigorous model for secure data-mining.– Efficient, secure protocol for specific problems (median, ID3).
• Cryptography– Sub-linear complexity - secure computation for large data sets.– Efficient protocols for complex known algorithms.– Secure computation of logarithms (real function - numerical
analysis).• Drawbacks:
– Privacy preserving solutions are less efficient– It’s hard to find efficient private solutions for all interesting
functions – Security against malicious parties
page 43
March 1, 2005 10th Estonian Winter School in Computer Science
References
• Lecture notes and overview papers:– B. Pinkas, Cryptographic Techniques for Privacy-Preserving
Data Mining, SIGKDD Explorations, January 2003. http://www.pinkas.net/PAPERS/sigkdd.pdf
– R. Cramer: Introduction to Secure Computation, 2000. http://homepages.cwi.nl/~cramer/papers/CRAMER_revised.ps
– Ivan Damgård, Theory and practice of multiparty computation, 8th EWSCS, http://www.cs.ioc.ee/yik/schools/win2003/damgard.php
• Research papers:– G. Aggarwal, N. Mishra and B. Pinkas, Secure Computation of
the K'th-ranked Element, Eurocrypt '2004. http://www.pinkas.net/PAPERS/ANP04.pdf
– Y. Lindell and B. Pinkas, Privacy Preserving Data Mining, Journal of Cryptology, Vol. 15 – No. 3, 2002. http://www.pinkas.net/PAPERS/id3-final.pdf