1
On the Anonymization of Sparse High-Dimensional
Data
1 National University of Singapore{ghinitag,kalnis}@comp.nus.edu.sg
2 Chinese University of Hong [email protected]
Gabriel Ghinita1 Yufei Tao2 Panos Kalnis1
2
Publishing Transaction Data Publishing transaction data
Retail chain-owned shopping cart data
Infer consumer spending patterns
Correlations among purchased items
e.g., 90% of cereals buyers also buy milk
What about privacy?
4
Privacy Paradigm ℓ-diversity
prevent association between quasi-identifier and sensitive attributes
Create groups of transactions freq. of an SA value in a group < 1/p
Objective Enforce privacy Preserve correlations among items Challenge: high data dimensionality
7
Contributions Novel data representation
Preserves correlation among items
Efficient heuristic for group formation Linear time to data size Supports multiple sensitive items
State-of-the-art: Mondrian[FWR06]
Generalization-based data-space partitioning similar to k-d-trees
split recursively until privacy condition does not hold
constrained global recoding
k = 2
[FWR06] K. LeFevre et al. Mondrian Multidimensional k-anonymity, Proceedings of the 22nd International Conference on Data Engineering (ICDE), 2006
Age
20 40 60
Weig
ht
40
60
80
100
GENERALIZATION + HIGH DIMENSIONALITY
=
UNACCEPTBLE INFORMATION LOSS
State-of-the-art: Anatomy[XT06]
Permutation-based method discloses exact QID values
DiseaseUlcer(1)
Pneumonia(1)Flu(1)
Dyspepsia(1)
Gastritis(1) Dyspepsia(1)
[XT06] X. Xiao and Y. Tao. Anatomy: simple and effective privacy preservation, Proceedings of the 32nd international conference on Very Large Data Bases (VLDB), 2006
Age ZipCode42 5200047 4300051 3200062 4100055 2700067 55000
Age ZipCode Disease
42 52000 Ulcer47 43000 Pneumonia51 32000 Flu55 27000 Gastritis62 41000 Dyspepsia67 55000 Dyspepsia
“Anatomized” table|G|! permutationsRANDOM GROUP FORMATION
DOES NOT PRESERVE CORRELATIONS
11
Reverse Cuthil-McKee (RCM) Heuristic Bandwidth Minimization
Solves corresponding graph labeling problem Permutes rows and columns Complexity N* D * log D
N = matrix rows (# transactions) D = maximum degree of any vertex
12
Group Formation Correlation-aware Anonymization of High-
Dimensional Data (CAHD)
Use the order given by RCM Consecutive transactions highly correlated
O(pN) complexity
16
Experimental Setting BMS dataset Compare with hybrid PermMondrian(PM)
Combines Mondrian with Anatomy Query Workload
Reconstruction Error
Top Related