Achieving Anonymity via Clustering
description
Transcript of Achieving Anonymity via Clustering
![Page 1: Achieving Anonymity via Clustering](https://reader036.fdocuments.us/reader036/viewer/2022070418/56815a69550346895dc7bbe2/html5/thumbnails/1.jpg)
Dilys Thomas PODS 2006 1
Achieving Anonymity via Clustering
G. Aggarwal, T. Feder, K. Kenthapadi, S. Khuller, R. Panigrahy, D. Thomas,
A. Zhu
![Page 2: Achieving Anonymity via Clustering](https://reader036.fdocuments.us/reader036/viewer/2022070418/56815a69550346895dc7bbe2/html5/thumbnails/2.jpg)
Dilys Thomas PODS 2006 2
Talk outline
• k-Anonymity model
• Achieving Anonymity via Clustering
• r-Gather clustering
• Cellular clustering
• Future Work
![Page 3: Achieving Anonymity via Clustering](https://reader036.fdocuments.us/reader036/viewer/2022070418/56815a69550346895dc7bbe2/html5/thumbnails/3.jpg)
Dilys Thomas PODS 2006 3
Medical RecordsIdentifying Sensitive
SSN Name DOB Race Zip code Disease
614 Sara 03/04/76 Cauc 94305 Flu
615 Joan 07/11/80 Cauc 94307 Cold
629 Kelly 05/09/55 Cauc 94301 Diabetes
710 Mike 11/23/62 Afr-A 94305 Flu
840 Carl 11/23/62 Afr-A 94059 Arthritis
780 Joe 01/07/50 Hisp 94042 Heart problem
619 Rob 04/08/43 Hisp 94042 Arthritis
![Page 4: Achieving Anonymity via Clustering](https://reader036.fdocuments.us/reader036/viewer/2022070418/56815a69550346895dc7bbe2/html5/thumbnails/4.jpg)
Dilys Thomas PODS 2006 4
De-identified Medical RecordsSensitive
Age Race Zip code Disease
Cauc 94305 Flu
07/11/80 Cauc 94307 Cold
05/09/55 Cauc 94301 Diabetes
11/23/62 Afr-A 94305 Flu
11/23/62 Afr-A 94059 Arthritis
01/07/50 Hisp 94042 Heart problem
04/08/43 Hisp 94042 Arthritis
03/04/76
![Page 5: Achieving Anonymity via Clustering](https://reader036.fdocuments.us/reader036/viewer/2022070418/56815a69550346895dc7bbe2/html5/thumbnails/5.jpg)
Dilys Thomas PODS 2006 5
k-Anonymity model
Uniquelyidentify
you!
Sensitive
DOB Race Zip code Disease
03/04/76 Cauc 94305 Flu
07/11/80 Cauc 94307 Cold
05/09/55 Cauc 94301 Diabetes
12/30/72 Afr-A 94305 Flu
11/23/62 Afr-A 94059 Arthritis
01/07/50 Hisp 94042 Heart problem
04/08/43 Hisp 94042 Arthritis
Quasi-identifiers:approximate foreign keys
![Page 6: Achieving Anonymity via Clustering](https://reader036.fdocuments.us/reader036/viewer/2022070418/56815a69550346895dc7bbe2/html5/thumbnails/6.jpg)
Dilys Thomas PODS 2006 6
k-Anonymity Model [Swe00]
• Suppress some entries of quasi-identifiers – each modified row becomes identical to at least
k-1 other rows with respect to quasi-identifiers
• Individual records hidden in a crowd of size k
![Page 7: Achieving Anonymity via Clustering](https://reader036.fdocuments.us/reader036/viewer/2022070418/56815a69550346895dc7bbe2/html5/thumbnails/7.jpg)
Dilys Thomas PODS 2006 7
2-Anonymized Table
DOB Race Zip code Disease
* Cauc * Flu
* Cauc * Cold
* Cauc * Diabetes
11/23/62 Afr-A * Flu
11/23/62 Afr-A * Arthritis
* Hisp 94042 Heart problem
* Hisp 94042 Arthritis
![Page 8: Achieving Anonymity via Clustering](https://reader036.fdocuments.us/reader036/viewer/2022070418/56815a69550346895dc7bbe2/html5/thumbnails/8.jpg)
Dilys Thomas PODS 2006 8
k-Anonymity Optimization
• Minimize the number of generalizations/ suppressions to achieve k-Anonymity
• NP-hard to come up with minimum suppressions/ generalizations.[MW04]
• (k) approximation for k-anonymity [AFK+05]
• (k) lower bound on approximation ratio with graph assumption
![Page 9: Achieving Anonymity via Clustering](https://reader036.fdocuments.us/reader036/viewer/2022070418/56815a69550346895dc7bbe2/html5/thumbnails/9.jpg)
Dilys Thomas PODS 2006 9
Talk outline
• k-Anonymity model
• Achieving Anonymity via Clustering
• r-Gather clustering
• Cellular Clustering
• Future Work
![Page 10: Achieving Anonymity via Clustering](https://reader036.fdocuments.us/reader036/viewer/2022070418/56815a69550346895dc7bbe2/html5/thumbnails/10.jpg)
Dilys Thomas PODS 2006 10
Original Table
Age Salary
Amy 25 50
Brian 27 60
Carol 29 100
David 35 110
Evelyn 39 120
![Page 11: Achieving Anonymity via Clustering](https://reader036.fdocuments.us/reader036/viewer/2022070418/56815a69550346895dc7bbe2/html5/thumbnails/11.jpg)
Dilys Thomas PODS 2006 11
2-Anonymity with Suppression
Age Salary
Amy * *
Brian * *
Carol * *
David * *
Evelyn * *
All attributes suppressed
![Page 12: Achieving Anonymity via Clustering](https://reader036.fdocuments.us/reader036/viewer/2022070418/56815a69550346895dc7bbe2/html5/thumbnails/12.jpg)
Dilys Thomas PODS 2006 12
Original Table
Age Salary
Amy 25 50
Brian 27 60
Carol 29 100
David 35 110
Evelyn 39 120
![Page 13: Achieving Anonymity via Clustering](https://reader036.fdocuments.us/reader036/viewer/2022070418/56815a69550346895dc7bbe2/html5/thumbnails/13.jpg)
Dilys Thomas PODS 2006 13
2-Anonymity with Generalization
Age Salary
Amy 20-30 50-100
Brian 20-30 50-100
Carol 20-30 50-100
David 30-40 100-150
Evelyn 30-40 100-150
Generalization allows pre-specified ranges
![Page 14: Achieving Anonymity via Clustering](https://reader036.fdocuments.us/reader036/viewer/2022070418/56815a69550346895dc7bbe2/html5/thumbnails/14.jpg)
Dilys Thomas PODS 2006 14
Original Table
Age Salary
Amy 25 50
Brian 27 60
Carol 29 100
David 35 110
Evelyn 39 120
![Page 15: Achieving Anonymity via Clustering](https://reader036.fdocuments.us/reader036/viewer/2022070418/56815a69550346895dc7bbe2/html5/thumbnails/15.jpg)
Dilys Thomas PODS 2006 15
2-Anonymity with Clustering
Age Salary
Amy [25-29] [50-100]
Brian [25-29] [50-100]
Carol [25-29] [50-100]
David [35-39] [110-120]
Evelyn [35-39] [110-120]
Cluster centers published
27=(25+27+29)/3
70=(50+60+100)/3
37=(35+39)/2
115=(110+120)/2
![Page 16: Achieving Anonymity via Clustering](https://reader036.fdocuments.us/reader036/viewer/2022070418/56815a69550346895dc7bbe2/html5/thumbnails/16.jpg)
Dilys Thomas PODS 2006 16
Advantages of Clustering
• Clustering reduces the amount of distortion introduced as compared to suppressions / generalizations
• Clustering allows constant factor approximation algorithms
![Page 17: Achieving Anonymity via Clustering](https://reader036.fdocuments.us/reader036/viewer/2022070418/56815a69550346895dc7bbe2/html5/thumbnails/17.jpg)
Dilys Thomas PODS 2006 17
Quasi-Identifiers form a Metric Space
• Convert quasi-identifiers into points in a metric space
• Distance function, D, on points– D(X,X)=0 Reflexive– D(X,Y)=D(Y,X) Symmetric– D(X,Z) <= D(X,Y) + D(Y,Z) Triangle Inequality
![Page 18: Achieving Anonymity via Clustering](https://reader036.fdocuments.us/reader036/viewer/2022070418/56815a69550346895dc7bbe2/html5/thumbnails/18.jpg)
Dilys Thomas PODS 2006 18
Metric Space
• Converting (gender, zip code, DOB) into points in a metric space not easy.
• Define distance function on each attribute.• E.g. on Zip code:
– D (Zip1,Zip2)= physical distance between locations Zip1 and Zip2.
• Weight attributes, weighted sum of attribute distances gives metric.
![Page 19: Achieving Anonymity via Clustering](https://reader036.fdocuments.us/reader036/viewer/2022070418/56815a69550346895dc7bbe2/html5/thumbnails/19.jpg)
Dilys Thomas PODS 2006 19
Clustering for Anonymity
• Cluster Quasi-identifiers so that each cluster has at least r members for anonymity.
• Publish cluster centers for anonymity with number of point and radius
• Tight clusters Usefulness of data for mining
• Large number of points per cluster Anonymity
![Page 20: Achieving Anonymity via Clustering](https://reader036.fdocuments.us/reader036/viewer/2022070418/56815a69550346895dc7bbe2/html5/thumbnails/20.jpg)
Dilys Thomas PODS 2006 20
Quasi-identifiers: Metric Space
• Assume further that the distance metric has been already defined on
quasi-identifiers
![Page 21: Achieving Anonymity via Clustering](https://reader036.fdocuments.us/reader036/viewer/2022070418/56815a69550346895dc7bbe2/html5/thumbnails/21.jpg)
Dilys Thomas PODS 2006 21
Talk outline
• k-Anonymity model
• Achieving Anonymity via Clustering
• r-Gather clustering
• Cellular Clustering
• Future Work
![Page 22: Achieving Anonymity via Clustering](https://reader036.fdocuments.us/reader036/viewer/2022070418/56815a69550346895dc7bbe2/html5/thumbnails/22.jpg)
Dilys Thomas PODS 2006 22
r-Gather Clustering
10 points, radius 5
20 points, radius 10
50 points, radius 20
• Minimize the maximum radius: 20
![Page 23: Achieving Anonymity via Clustering](https://reader036.fdocuments.us/reader036/viewer/2022070418/56815a69550346895dc7bbe2/html5/thumbnails/23.jpg)
Dilys Thomas PODS 2006 23
Results
• 2 Approximation to minimize maximum radius with cluster size constraint
• Matching Lower bound of 2 for maximum radius minimization
![Page 24: Achieving Anonymity via Clustering](https://reader036.fdocuments.us/reader036/viewer/2022070418/56815a69550346895dc7bbe2/html5/thumbnails/24.jpg)
Dilys Thomas PODS 2006 24
r-Gather Clustering
2d2d
2d
![Page 25: Achieving Anonymity via Clustering](https://reader036.fdocuments.us/reader036/viewer/2022070418/56815a69550346895dc7bbe2/html5/thumbnails/25.jpg)
Dilys Thomas PODS 2006 25
Lower Bound: Reduction from 3-SAT
X1T
X1F
X2T
X2F
r-2 points r-2 points
• r-gather with radius 1 iff formula satisfiable
Else radius ¸ 2
C1=X1 Æ X2
C1
![Page 26: Achieving Anonymity via Clustering](https://reader036.fdocuments.us/reader036/viewer/2022070418/56815a69550346895dc7bbe2/html5/thumbnails/26.jpg)
Dilys Thomas PODS 2006 26
Talk outline
• k-Anonymity model
• Achieving Anonymity via Clustering
• r-Gather clustering
• Cellular Clustering
• Future Work
![Page 27: Achieving Anonymity via Clustering](https://reader036.fdocuments.us/reader036/viewer/2022070418/56815a69550346895dc7bbe2/html5/thumbnails/27.jpg)
Dilys Thomas PODS 2006 27
Cellular Clustering
10 points, radius 5
20 points, radius 10
50 points, radius 20
![Page 28: Achieving Anonymity via Clustering](https://reader036.fdocuments.us/reader036/viewer/2022070418/56815a69550346895dc7bbe2/html5/thumbnails/28.jpg)
Dilys Thomas PODS 2006 28
Cellular Clustering Metric
10 points, radius 5
20 points, radius 10
50 points, radius 20
Cellular Clustering Metric: 10*5 + 20*10 + 50*20
= 50 + 200 + 1000 = 1250
![Page 29: Achieving Anonymity via Clustering](https://reader036.fdocuments.us/reader036/viewer/2022070418/56815a69550346895dc7bbe2/html5/thumbnails/29.jpg)
Dilys Thomas PODS 2006 29
Cellular Clustering
• Primal dual 4-approximation algorithm for cellular clustering
• Constant factor approximation to minimum cluster size– Each cluster has at least r points
![Page 30: Achieving Anonymity via Clustering](https://reader036.fdocuments.us/reader036/viewer/2022070418/56815a69550346895dc7bbe2/html5/thumbnails/30.jpg)
Dilys Thomas PODS 2006 30
Cellular Clustering: Linear Program
Minimize c ( i xicdc + fc yc)
Sum of Cellular cost and facility cost
Subject to:
c xic ¸ 1 Each Point belongs to a cluster
xic· yc Cluster must be opened for point to belong
0 · xic · 1 Points belong to clusters positively
0 · yc · 1 Clusters are opened positively
![Page 31: Achieving Anonymity via Clustering](https://reader036.fdocuments.us/reader036/viewer/2022070418/56815a69550346895dc7bbe2/html5/thumbnails/31.jpg)
Dilys Thomas PODS 2006 31
Dual Program
• Maximize i i
• Subject to:
i ic · fc (1)
i - ic · dc (2)
i ¸ 0
ic ¸ 0
Overview of Algorithm: First grow i keeping ic=0 till (2) becomes tight then grow ic at same rate till (1) becomes tight
![Page 32: Achieving Anonymity via Clustering](https://reader036.fdocuments.us/reader036/viewer/2022070418/56815a69550346895dc7bbe2/html5/thumbnails/32.jpg)
Dilys Thomas PODS 2006 32
Future Work
• Improve approximation ratio for Cellular Clustering
• Improve Running time. Presently r-gather is O(n2) while cellular clustering is a linear program over n2 variables.– Linear or even sub-linear time algorithms
• Weaker guarantees on anonymity, e.g. at least k/2 points per cluster instead of k.
![Page 33: Achieving Anonymity via Clustering](https://reader036.fdocuments.us/reader036/viewer/2022070418/56815a69550346895dc7bbe2/html5/thumbnails/33.jpg)
Dilys Thomas PODS 2006 33
THANK YOU!
QUESTIONS?