Oren Fine Nov. 2008 CS Seminar in Databases (236826)
-
Upload
susannah-short -
Category
Documents
-
view
14 -
download
1
description
Transcript of Oren Fine Nov. 2008 CS Seminar in Databases (236826)
![Page 1: Oren Fine Nov. 2008 CS Seminar in Databases (236826)](https://reader036.fdocuments.us/reader036/viewer/2022062422/568135cb550346895d9d2f1e/html5/thumbnails/1.jpg)
To Do or Not To Do: The Dilemma of Disclosing
Anonymized Data
Lakshmanan L, Ng R, Ramesh GUniv. of British Columbia
Oren Fine
Nov. 2008
CS Seminar in Databases (236826)
![Page 2: Oren Fine Nov. 2008 CS Seminar in Databases (236826)](https://reader036.fdocuments.us/reader036/viewer/2022062422/568135cb550346895d9d2f1e/html5/thumbnails/2.jpg)
Once Upon a Time…
• The police is after Edgar, a drug lord suspect.– Intel. has gathered calls & meetings data records as a
transactional database– In order to positively frame Edgar, the police must find
hard evidence, and wishes to outsource data mining tasks to “We Mind your Data Ltd.”
– But, the police is subject to the law, and is obligated to keep the privacy of the people in the database – including Edgar, which is innocent until proven otherwise.
– Furthermore, Edgar is seeking for the smallest hint to disappear…
![Page 3: Oren Fine Nov. 2008 CS Seminar in Databases (236826)](https://reader036.fdocuments.us/reader036/viewer/2022062422/568135cb550346895d9d2f1e/html5/thumbnails/3.jpg)
I have the pleasure to introduce Edgar vs. The Police
VS.
![Page 4: Oren Fine Nov. 2008 CS Seminar in Databases (236826)](https://reader036.fdocuments.us/reader036/viewer/2022062422/568135cb550346895d9d2f1e/html5/thumbnails/4.jpg)
Motivation
• The Classic Dilemma:– Keep your data close to your chest and never risk
privacy or confidentiality or…– Disclose the data and gain potential valuable
knowledge and benefits
• In order to decide, we need to answer a major question– “Just how safe is the anonymized data?”– Safe = protecting the identities of the of the objects.
![Page 5: Oren Fine Nov. 2008 CS Seminar in Databases (236826)](https://reader036.fdocuments.us/reader036/viewer/2022062422/568135cb550346895d9d2f1e/html5/thumbnails/5.jpg)
Agenda
• Anonymization
• Model the Attacker’s Knowledge
• Determine the risk to our data
![Page 6: Oren Fine Nov. 2008 CS Seminar in Databases (236826)](https://reader036.fdocuments.us/reader036/viewer/2022062422/568135cb550346895d9d2f1e/html5/thumbnails/6.jpg)
Anonymization or De-Identification
• Transform sensitive data into generated unique content (strings, numbers)
• Example
TIDNames
1{Hussein, Hassan, Dimitri}
2{Hussein, Edgar, Anglea}
3{Angela, Edgar}
4{Raz, Adi, Yishai}
5{Hassan, Yishai, Dimitri, Raz}
6{Raz, Anglea, Nithai}
TIDTransaction
1{1,2,3}
2{1,4,5}
3{5,4}
4{6,7,8}
5{2,8,3,6}
6{6, 5, 9}
![Page 7: Oren Fine Nov. 2008 CS Seminar in Databases (236826)](https://reader036.fdocuments.us/reader036/viewer/2022062422/568135cb550346895d9d2f1e/html5/thumbnails/7.jpg)
Anonymization or De-Identification
• Advantages– Very simple– Does not affect final outcome or perturb data
characteristics
• We do not suggest that anonymization is the “right” way, but it is probably the most common
![Page 8: Oren Fine Nov. 2008 CS Seminar in Databases (236826)](https://reader036.fdocuments.us/reader036/viewer/2022062422/568135cb550346895d9d2f1e/html5/thumbnails/8.jpg)
Frequent Set Mining Crash Course• Transactional database • Each transaction has TID and a set of
items• An association rule of the form XY has
– Support s if s% of the transactions include (X,Y)
– Confidence c if c% of the transactions that include X also include Y
• Support = frequent sets• Confidence = association rules• A k-itemset is a set of k items
![Page 9: Oren Fine Nov. 2008 CS Seminar in Databases (236826)](https://reader036.fdocuments.us/reader036/viewer/2022062422/568135cb550346895d9d2f1e/html5/thumbnails/9.jpg)
Example
TIDNames
1Angela, Ariel, Edgar, Steve, Benny
2Edgar, Hassan, Steve, Tommy
3Joe, Sara, Israel
4Steve, Angela, Edgar
5Benny, Mahhmud, Tommy
6Angela, Sara, Edgar
7Hassan, Angela, Joe, Edgar, Noa
8Edgar, Benny, Steve, Tommy
![Page 10: Oren Fine Nov. 2008 CS Seminar in Databases (236826)](https://reader036.fdocuments.us/reader036/viewer/2022062422/568135cb550346895d9d2f1e/html5/thumbnails/10.jpg)
Example (Cont.)• First, we look for frequent sets, according to a
support threshold• 2-itemsets: {Angela, Edgar}, {Edgar, Steve} have
50% support (4 out of 8 transactions).• 3-itemsets: {Angela, Edgar, Steve}, {Benny,
Edgar, Steve} and {Tommy, Edgar, Steve} have only 25% support (2 out of 8 transactions)
• The rule {Edgar, Steve} {Angela} has 50% confidence (2 out 4 transactions) and the rule {Tommy} {Edgar, Steve} has 66.6% confidence.
![Page 11: Oren Fine Nov. 2008 CS Seminar in Databases (236826)](https://reader036.fdocuments.us/reader036/viewer/2022062422/568135cb550346895d9d2f1e/html5/thumbnails/11.jpg)
Frequent Set Mining Crash Course (You’re Qualified!)
• Widely used in market basket analysis, intrusion detection, Web usage mining and bioinformatics
• Aimed at discovering non trivial or not necessarily intuitive relation between items/variables of large databases“Extracting wisdom out of data”
• Who knows what is the most famous frequent set?
![Page 12: Oren Fine Nov. 2008 CS Seminar in Databases (236826)](https://reader036.fdocuments.us/reader036/viewer/2022062422/568135cb550346895d9d2f1e/html5/thumbnails/12.jpg)
Big Mart’s Database
![Page 13: Oren Fine Nov. 2008 CS Seminar in Databases (236826)](https://reader036.fdocuments.us/reader036/viewer/2022062422/568135cb550346895d9d2f1e/html5/thumbnails/13.jpg)
Modeling the Attacker’s Knowledge
• We believe that the attacker has prior knowledge about the items in the original domain
• The prior information regards the frequencies of items in the original domain
• We capture the attacker’s knowledge with “Belief Functions”
![Page 14: Oren Fine Nov. 2008 CS Seminar in Databases (236826)](https://reader036.fdocuments.us/reader036/viewer/2022062422/568135cb550346895d9d2f1e/html5/thumbnails/14.jpg)
Examples of Belief Functions
![Page 15: Oren Fine Nov. 2008 CS Seminar in Databases (236826)](https://reader036.fdocuments.us/reader036/viewer/2022062422/568135cb550346895d9d2f1e/html5/thumbnails/15.jpg)
Consistent Mapping
• Mapping anonymized entities to original entities only according to the belief function
![Page 16: Oren Fine Nov. 2008 CS Seminar in Databases (236826)](https://reader036.fdocuments.us/reader036/viewer/2022062422/568135cb550346895d9d2f1e/html5/thumbnails/16.jpg)
Ignorant Belief Function (Q)
• How does the graph look like?
• What is the expected number of cracks?
• Suppose n items. Further suppose that we are only interested in a partial group, of size n1
• What is the expected number of cracks now?
• Don’t you underestimate Edgar…
![Page 17: Oren Fine Nov. 2008 CS Seminar in Databases (236826)](https://reader036.fdocuments.us/reader036/viewer/2022062422/568135cb550346895d9d2f1e/html5/thumbnails/17.jpg)
Ignorant Belief Function (A)
![Page 18: Oren Fine Nov. 2008 CS Seminar in Databases (236826)](https://reader036.fdocuments.us/reader036/viewer/2022062422/568135cb550346895d9d2f1e/html5/thumbnails/18.jpg)
Compliant Point-Valued Belief Function (Q)
• How does the graph look like?• What is the expected number of cracks?• Suppose n items. Further suppose that we
are only interested in a partial group, of size n1
• What is the expected number of cracks now?
• Unless he has inner source, we shouldn’t overestimate Edgar either…
![Page 19: Oren Fine Nov. 2008 CS Seminar in Databases (236826)](https://reader036.fdocuments.us/reader036/viewer/2022062422/568135cb550346895d9d2f1e/html5/thumbnails/19.jpg)
Compliant Point-Valued Belief Function (A)
![Page 20: Oren Fine Nov. 2008 CS Seminar in Databases (236826)](https://reader036.fdocuments.us/reader036/viewer/2022062422/568135cb550346895d9d2f1e/html5/thumbnails/20.jpg)
Compliant IntervalBelief Functions
• Direct Computation Method– Build a graph G and adjacency matrix AG
– The probability of cracking k out of n items:
• Computing the permanent is know to be #P-complete problem, state of the art approximation running time O(n22) !!
• What the !#$!% is a permanent or #P-complete?
![Page 21: Oren Fine Nov. 2008 CS Seminar in Databases (236826)](https://reader036.fdocuments.us/reader036/viewer/2022062422/568135cb550346895d9d2f1e/html5/thumbnails/21.jpg)
Permanent
• A permanent of an n*n matrix is
• The sum is over all permutations of 1,2,…• Calculating the permanent is #P-complete• Which brings us to…
nS
n
iiiaAperm
1)(,)(
![Page 22: Oren Fine Nov. 2008 CS Seminar in Databases (236826)](https://reader036.fdocuments.us/reader036/viewer/2022062422/568135cb550346895d9d2f1e/html5/thumbnails/22.jpg)
#P-Complete
• Unlike well known complexity classes which are of decision problems, this is a class of function problems
• "compute f(x)," where f is the number of accepting paths of an NP machine
• Example– NP: Are there any subsets of a list of integers that add
up to zero? – #P: How many subsets of a list of integers add up to
zero?
![Page 23: Oren Fine Nov. 2008 CS Seminar in Databases (236826)](https://reader036.fdocuments.us/reader036/viewer/2022062422/568135cb550346895d9d2f1e/html5/thumbnails/23.jpg)
Chain Belief Functions
![Page 24: Oren Fine Nov. 2008 CS Seminar in Databases (236826)](https://reader036.fdocuments.us/reader036/viewer/2022062422/568135cb550346895d9d2f1e/html5/thumbnails/24.jpg)
Chain Belief Functions
![Page 25: Oren Fine Nov. 2008 CS Seminar in Databases (236826)](https://reader036.fdocuments.us/reader036/viewer/2022062422/568135cb550346895d9d2f1e/html5/thumbnails/25.jpg)
Unfortunately…
• General Belief Function does not always produce a chain…
• We seek for way to estimate the number of cracks.
![Page 26: Oren Fine Nov. 2008 CS Seminar in Databases (236826)](https://reader036.fdocuments.us/reader036/viewer/2022062422/568135cb550346895d9d2f1e/html5/thumbnails/26.jpg)
The O-estimate Heuristic
• Suppose Graph G, interval belief function β.• For each x, let Ox denote the outdegree of x
in G.• The probability of cracking x is simply
• The expected number of cracks is
xO1
![Page 27: Oren Fine Nov. 2008 CS Seminar in Databases (236826)](https://reader036.fdocuments.us/reader036/viewer/2022062422/568135cb550346895d9d2f1e/html5/thumbnails/27.jpg)
Properties of O-estimate• Inexact (hence “estimate”)
• Monotonic
![Page 28: Oren Fine Nov. 2008 CS Seminar in Databases (236826)](https://reader036.fdocuments.us/reader036/viewer/2022062422/568135cb550346895d9d2f1e/html5/thumbnails/28.jpg)
-Compliant Belief Function
• Suppose we “somehow” know which items are guessed wrong
• We sum the O-estimates only over the compliant frequency groups
![Page 29: Oren Fine Nov. 2008 CS Seminar in Databases (236826)](https://reader036.fdocuments.us/reader036/viewer/2022062422/568135cb550346895d9d2f1e/html5/thumbnails/29.jpg)
Risk Assessment
• Worst case \ Best case – unrealistic
• Determine the intervals width– Twice the median gap of all successive
frequency groups– Why?
• Determine the degree of compliancy– Perform binary search on , subject to a
“degree of tolerance” – .
![Page 30: Oren Fine Nov. 2008 CS Seminar in Databases (236826)](https://reader036.fdocuments.us/reader036/viewer/2022062422/568135cb550346895d9d2f1e/html5/thumbnails/30.jpg)
End to End Example
• These Intel. Calls & Meeting DR are classified “Top Secret”
TIDNames1Angela, Ariel, Edgar, Steve, Benny
2Edgar, Hassan, Steve, Tommy
3Joe, Sara, Israel
4Steve, Angela, Edgar
5Benny, Mahhmud, Tommy
6Angela, Sara, Edgar
7Hassan, Angela, Joe, Edgar, Noa
8Edgar, Benny, Steve, Tommy
![Page 31: Oren Fine Nov. 2008 CS Seminar in Databases (236826)](https://reader036.fdocuments.us/reader036/viewer/2022062422/568135cb550346895d9d2f1e/html5/thumbnails/31.jpg)
We Anonymize the Database
IJfreq
Angela14/8
Ariel21/8
Edgar36/8
Steve44/8
Benny53/8
Hassan62/8
Tommy73/8
Joe82/8
Sara92/8
Israel101/8
Noa111/8
Mahhmud121/8
TIDItems
11, 2, 3, 4, 5
23, 6, 4, 7
38, 9, 10
44, 1, 3
55, 7, 12
61, 9, 3
76, 1, 8, 3, 11
83, 5, 4, 7
![Page 32: Oren Fine Nov. 2008 CS Seminar in Databases (236826)](https://reader036.fdocuments.us/reader036/viewer/2022062422/568135cb550346895d9d2f1e/html5/thumbnails/32.jpg)
Frequency Groups
• The gaps between the frequency groups:1/8, 1/8, 1/8, 1/8, 2/8
• The median gap = 1/8
FrequencyItems
1/82, 10, 11, 12
2/86, 8, 9
3/85, 7
4/81, 4
6/83
![Page 33: Oren Fine Nov. 2008 CS Seminar in Databases (236826)](https://reader036.fdocuments.us/reader036/viewer/2022062422/568135cb550346895d9d2f1e/html5/thumbnails/33.jpg)
The Attacker’s Prior KnowledgeIFrequency Group
Angela3/8 – 5/8
Ariel0 – 2/8
Edgar5/8 – 7/8
Steve3/8 – 5/8
Benny2/8 – 4/8
Hassan1/8 – 3/8
Tommy2/8 – 4/8
Joe1/8 – 3/8
Sara1/8 – 3/8
Israel0 – 2/8
Noa0 – 2/8
Mahhmud0 – 2/8
![Page 34: Oren Fine Nov. 2008 CS Seminar in Databases (236826)](https://reader036.fdocuments.us/reader036/viewer/2022062422/568135cb550346895d9d2f1e/html5/thumbnails/34.jpg)
The Graph, By the Way…1
2
4
3
8
5
6
7
10
9
11
12
Angela
Ariel
Edgar
Steve
Benny
Hassan
Tommy
Joe
Sara
Israel
Noa
Mahhmud
![Page 35: Oren Fine Nov. 2008 CS Seminar in Databases (236826)](https://reader036.fdocuments.us/reader036/viewer/2022062422/568135cb550346895d9d2f1e/html5/thumbnails/35.jpg)
Calculating the Risk
• Oest=1/4+1/7+1/3+1/4+1/7+1/9+1/7+ 1/9+1/9+1/7+1/7+1/7 = 2.023
• Now, it’s a question of how much would you tolerate...
• Note, that this is the expected number of cracks. However, if we are interested in Edgar, as we’ve seen in previous lemmas, the probability of crack – 1/3.
![Page 36: Oren Fine Nov. 2008 CS Seminar in Databases (236826)](https://reader036.fdocuments.us/reader036/viewer/2022062422/568135cb550346895d9d2f1e/html5/thumbnails/36.jpg)
Experiments
![Page 37: Oren Fine Nov. 2008 CS Seminar in Databases (236826)](https://reader036.fdocuments.us/reader036/viewer/2022062422/568135cb550346895d9d2f1e/html5/thumbnails/37.jpg)
Open Problems
• The attacker’s prior knowledge remains a largely unsolved issue
• This article does not really deal with frequent sets but rather frequent items– Frequent sets can add more information and
differentiate objects from one frequency group
![Page 38: Oren Fine Nov. 2008 CS Seminar in Databases (236826)](https://reader036.fdocuments.us/reader036/viewer/2022062422/568135cb550346895d9d2f1e/html5/thumbnails/38.jpg)
Modeling the Attacker’s Knowledge in the Real World
• In a report for the Canadian Privacy Commissioner appears a broad mapping of adversary knowledge– Mapping phone directories– CV’s – Inferring gender, year of birth and postal code
from different details– Data remnants on 2nd hand hard disks– Etc.
![Page 39: Oren Fine Nov. 2008 CS Seminar in Databases (236826)](https://reader036.fdocuments.us/reader036/viewer/2022062422/568135cb550346895d9d2f1e/html5/thumbnails/39.jpg)
סוף טוב, הכל טוב
![Page 40: Oren Fine Nov. 2008 CS Seminar in Databases (236826)](https://reader036.fdocuments.us/reader036/viewer/2022062422/568135cb550346895d9d2f1e/html5/thumbnails/40.jpg)
Bibliography
• Lakshmanan L., Ng R., Ramesh G. To Do or Not To Do: The Dilemma of Disclosing Anonymized Data. ACM SIGMOD Conference, 2005.
• Agrawal, R. and Srikant, R. 1994. Fast algorithms for mining association rules. In Proc. 1994 Int. Conf. Very Large Data Bases (VLDB’94), Santiago, Chile, pp. 487–499.
• Pan-Canadian De-Identification Guidelines for Personal Health Information, Khaled El-Emam et al., April 2007.
• Wikipedia– Association rule
– #P
– Permanent
![Page 41: Oren Fine Nov. 2008 CS Seminar in Databases (236826)](https://reader036.fdocuments.us/reader036/viewer/2022062422/568135cb550346895d9d2f1e/html5/thumbnails/41.jpg)
Questions ?