Dilys Thomas PODS 20061 Achieving Anonymity via Clustering G. Aggarwal, T. Feder, K. Kenthapadi, S....
-
date post
20-Dec-2015 -
Category
Documents
-
view
214 -
download
0
Transcript of Dilys Thomas PODS 20061 Achieving Anonymity via Clustering G. Aggarwal, T. Feder, K. Kenthapadi, S....
Dilys Thomas PODS 2006 1
Achieving Anonymity via Clustering
G. Aggarwal, T. Feder, K. Kenthapadi, S. Khuller, R. Panigrahy, D. Thomas,
A. Zhu
Dilys Thomas PODS 2006 2
Talk outline
• k-Anonymity model
• Achieving Anonymity via Clustering
• r-Gather clustering
• Cellular clustering
• Future Work
Dilys Thomas PODS 2006 3
Medical RecordsIdentifying Sensitive
SSN Name DOB Race Zip code Disease
614 Sara 03/04/76 Cauc 94305 Flu
615 Joan 07/11/80 Cauc 94307 Cold
629 Kelly 05/09/55 Cauc 94301 Diabetes
710 Mike 11/23/62 Afr-A 94305 Flu
840 Carl 11/23/62 Afr-A 94059 Arthritis
780 Joe 01/07/50 Hisp 94042 Heart problem
619 Rob 04/08/43 Hisp 94042 Arthritis
Dilys Thomas PODS 2006 4
De-identified Medical RecordsSensitive
Age Race Zip code Disease
Cauc 94305 Flu
07/11/80 Cauc 94307 Cold
05/09/55 Cauc 94301 Diabetes
11/23/62 Afr-A 94305 Flu
11/23/62 Afr-A 94059 Arthritis
01/07/50 Hisp 94042 Heart problem
04/08/43 Hisp 94042 Arthritis
03/04/76
Dilys Thomas PODS 2006 5
k-Anonymity model
Uniquelyidentify
you!
Sensitive
DOB Race Zip code Disease
03/04/76 Cauc 94305 Flu
07/11/80 Cauc 94307 Cold
05/09/55 Cauc 94301 Diabetes
12/30/72 Afr-A 94305 Flu
11/23/62 Afr-A 94059 Arthritis
01/07/50 Hisp 94042 Heart problem
04/08/43 Hisp 94042 Arthritis
Quasi-identifiers:approximate foreign keys
Dilys Thomas PODS 2006 6
k-Anonymity Model [Swe00]
• Suppress some entries of quasi-identifiers – each modified row becomes identical to at least
k-1 other rows with respect to quasi-identifiers
• Individual records hidden in a crowd of size k
Dilys Thomas PODS 2006 7
2-Anonymized Table
DOB Race Zip code Disease
* Cauc * Flu
* Cauc * Cold
* Cauc * Diabetes
11/23/62 Afr-A * Flu
11/23/62 Afr-A * Arthritis
* Hisp 94042 Heart problem
* Hisp 94042 Arthritis
Dilys Thomas PODS 2006 8
k-Anonymity Optimization
• Minimize the number of generalizations/ suppressions to achieve k-Anonymity
• NP-hard to come up with minimum suppressions/ generalizations.[MW04]
• (k) approximation for k-anonymity [AFK+05]
• (k) lower bound on approximation ratio with graph assumption
Dilys Thomas PODS 2006 9
Talk outline
• k-Anonymity model
• Achieving Anonymity via Clustering
• r-Gather clustering
• Cellular Clustering
• Future Work
Dilys Thomas PODS 2006 10
Original Table
Age Salary
Amy 25 50
Brian 27 60
Carol 29 100
David 35 110
Evelyn 39 120
Dilys Thomas PODS 2006 11
2-Anonymity with Suppression
Age Salary
Amy * *
Brian * *
Carol * *
David * *
Evelyn * *
All attributes suppressed
Dilys Thomas PODS 2006 12
Original Table
Age Salary
Amy 25 50
Brian 27 60
Carol 29 100
David 35 110
Evelyn 39 120
Dilys Thomas PODS 2006 13
2-Anonymity with Generalization
Age Salary
Amy 20-30 50-100
Brian 20-30 50-100
Carol 20-30 50-100
David 30-40 100-150
Evelyn 30-40 100-150
Generalization allows pre-specified ranges
Dilys Thomas PODS 2006 14
Original Table
Age Salary
Amy 25 50
Brian 27 60
Carol 29 100
David 35 110
Evelyn 39 120
Dilys Thomas PODS 2006 15
2-Anonymity with Clustering
Age Salary
Amy [25-29] [50-100]
Brian [25-29] [50-100]
Carol [25-29] [50-100]
David [35-39] [110-120]
Evelyn [35-39] [110-120]
Cluster centers published
27=(25+27+29)/3
70=(50+60+100)/3
37=(35+39)/2
115=(110+120)/2
Dilys Thomas PODS 2006 16
Advantages of Clustering
• Clustering reduces the amount of distortion introduced as compared to suppressions / generalizations
• Clustering allows constant factor approximation algorithms
Dilys Thomas PODS 2006 17
Quasi-Identifiers form a Metric Space
• Convert quasi-identifiers into points in a metric space
• Distance function, D, on points– D(X,X)=0 Reflexive– D(X,Y)=D(Y,X) Symmetric– D(X,Z) <= D(X,Y) + D(Y,Z) Triangle Inequality
Dilys Thomas PODS 2006 18
Metric Space
• Converting (gender, zip code, DOB) into points in a metric space not easy.
• Define distance function on each attribute.• E.g. on Zip code:
– D (Zip1,Zip2)= physical distance between locations Zip1 and Zip2.
• Weight attributes, weighted sum of attribute distances gives metric.
Dilys Thomas PODS 2006 19
Clustering for Anonymity
• Cluster Quasi-identifiers so that each cluster has at least r members for anonymity.
• Publish cluster centers for anonymity with number of point and radius
• Tight clusters Usefulness of data for mining
• Large number of points per cluster Anonymity
Dilys Thomas PODS 2006 20
Quasi-identifiers: Metric Space
• Assume further that the distance metric has been already defined on
quasi-identifiers
Dilys Thomas PODS 2006 21
Talk outline
• k-Anonymity model
• Achieving Anonymity via Clustering
• r-Gather clustering
• Cellular Clustering
• Future Work
Dilys Thomas PODS 2006 22
r-Gather Clustering
10 points, radius 5
20 points, radius 10
50 points, radius 20
• Minimize the maximum radius: 20
Dilys Thomas PODS 2006 23
Results
• 2 Approximation to minimize maximum radius with cluster size constraint
• Matching Lower bound of 2 for maximum radius minimization
Dilys Thomas PODS 2006 25
Lower Bound: Reduction from 3-SAT
X1T
X1F
X2T
X2F
r-2 points r-2 points
• r-gather with radius 1 iff formula satisfiable
Else radius ¸ 2
C1=X1 Æ X2
C1
Dilys Thomas PODS 2006 26
Talk outline
• k-Anonymity model
• Achieving Anonymity via Clustering
• r-Gather clustering
• Cellular Clustering
• Future Work
Dilys Thomas PODS 2006 27
Cellular Clustering
10 points, radius 5
20 points, radius 10
50 points, radius 20
Dilys Thomas PODS 2006 28
Cellular Clustering Metric
10 points, radius 5
20 points, radius 10
50 points, radius 20
Cellular Clustering Metric: 10*5 + 20*10 + 50*20
= 50 + 200 + 1000 = 1250
Dilys Thomas PODS 2006 29
Cellular Clustering
• Primal dual 4-approximation algorithm for cellular clustering
• Constant factor approximation to minimum cluster size– Each cluster has at least r points
Dilys Thomas PODS 2006 30
Cellular Clustering: Linear Program
Minimize c ( i xicdc + fc yc)
Sum of Cellular cost and facility cost
Subject to:
c xic ¸ 1 Each Point belongs to a cluster
xic· yc Cluster must be opened for point to belong
0 · xic · 1 Points belong to clusters positively
0 · yc · 1 Clusters are opened positively
Dilys Thomas PODS 2006 31
Dual Program
• Maximize i i
• Subject to:
i ic · fc (1)
i - ic · dc (2)
i ¸ 0
ic ¸ 0
Overview of Algorithm: First grow i keeping ic=0 till (2) becomes tight then grow ic at same rate till (1) becomes tight
Dilys Thomas PODS 2006 32
Future Work
• Improve approximation ratio for Cellular Clustering
• Improve Running time. Presently r-gather is O(n2) while cellular clustering is a linear program over n2 variables.– Linear or even sub-linear time algorithms
• Weaker guarantees on anonymity, e.g. at least k/2 points per cluster instead of k.