A Privacy Preserving Index for Range Queries
description
Transcript of A Privacy Preserving Index for Range Queries
1
A Privacy Preserving Index for Range Queries
Bijit Hore, Sharad Mehrotra, Gene Tsudik
2
Database as a Service (DAS) [Hacigumus et. al, SIGMOD2002]
A client wants to store data on a remote server & run queries on it
BUT he does not trust the server Solution: Encrypt the data & store it How do you query the encrypted data ?
Encrypted & Indexed
Client Data
Server
Untrusted
Service Provider
Query Post Processor
Query Translator
True Results
Original Query
Query over Encrypted Data
Encrypted Results
Trusted
Client
User
3
Data storage in DAS
etuple sharesA ageA salA
X@#$^&FJ X1 Y2 Z1
CH$^*(G#!
X2 Y1 Z1
^$*D%L*# X3 Y2 Z2
*%GH%&)$ X3 Y3 Z3
Original Table (plain text) R
Server side Table (encrypted + indexed) RA
Bucket-tags
eid name
addr shares age sal
345 Tom Maple 5400 32 390K
876 Mary Main 5800 22 423K
234 John River 6000 34 598K
780 Jerry Ocean
6200 48 632K
0 200 450 600 650 700
Z0 Z1 Z2 Z3 Z4
buckets
Meta data
Server side data
Client side storage
4
Querying in DAS
etuple sharesA ageA salA
X@#$^&FJ X1 Y2 Z1
CH$^*(G#!
X2 Y1 Z1
^$*D%L*# X3 Y2 Z2
*%GH%&)$ X3 Y3 Z3
Client side Table (plain text) R
Server side Table (encrypted + indexed) RA
Bucket-tags
Client-side query
Server-side query
Select etuple from RA where RA.salA = z1 ∨ z2
Select * from R where R.sal [400K, 600K]
eid name
addr shares age sal
345 Tom Maple 5400 32 390K
876 Mary Main 5800 22 426K
234 John River 6000 34 598K
780 Jerry Ocean
6200 48 634K
Client side Table (plain text) R
5
Issues in partitioning
How many buckets should one use ?
How to partition the data ?
6
Data Privacy in DAS
AdversaryAccess to sever-side data +Malicious Intentions
Privacy issue in partitioned dataSmall range of a bucket B +1 sample value from B
Privacy goal of clientTo hide all useful information from A
Put all values of an attribute in a single bucket !
Adversary (A)
“Almost total” disclosure of all
elements in B
7
Research challenges & our contributions
Precision: how to partition data Definition Optimal partitioning to maximize precision
Privacy: quantifying disclosure Adversary’s goals Measures of information disclosure
Privacy-Precision trade-off Controlled diffusion algorithm
Experiments & Conclusion
Privacy Precision
8
Precision of range queries Given a partition of data into M parts Precision (q) = 1 – (# false positives / # tuples returned for q) Recall = 1 Workload: All O(N2) range queries are equiprobable (uniform)
1 2 3 4 5 6 7 8 9 10Salary (100K’s)
4 44 4 4
10
2
6
2
10
Frequency NB=5,FB=18
N = 10
(domain size)
q Precision =
1 – 20/50 = 0.6
# false positive α ∑ NB*FB = 5*32 + 5*18 = 250B
M = 2
9
Query optimal buckets (QOB) Optimization problem:
For the uniform workload find a partition of the data into M buckets that minimizes total # false positives i.e.
1 2 3 4 5 6 7 8 9 10Salary (100K’s)
4 44 4 4
10
2
6
2
10Cost(8,10)
Frequency
QOB (1,7,3) +QOB (1,10,4) =
Optimal solution to a sub-problem Cost of rightmost bucket
NB*FB = 24
B=1
Minimize ∑ NB*FB
N = 10
(domain size)
4
10
QOB (cont.)
1 2 3 4 5 6 7 8 9 10Salary(100K’s)
4
B1
44 4 4
10
2
6
2
10
B2 B3 B4
Frequency
Optimal cost = ∑NB*FB = 12*3 + 20*2 + 10*2 + 8*3 = 1101
4
Time complexity = O(n2M), Space = O(nM)
n = # distinct values in dataset; M = # buckets
11
Outline
Optimal data partitioning for range queries
Adversarial goals & privacy measures
Balancing privacy and precision
Experiments & conclusion
12
Adversary’s learning model
Need to learn bucket properties to estimatesensitive values
Model
A’s Domain knowledge +
Sample values from buckets
Worst case assumption for Privacy Analysis:
A knows exact value distribution for every bucket
A learns distribution of
values in buckets
13
Adversarial Goal (I)
Individual Centric Information: Eg: “What is the salary of an individual I”
Value Estimation Power (VEP) of A
Variance of bucket-distribution is an inverse measure of VEP
Bucket range
Average error of value estimation for Adversary
LargeSmall
Preferred: Large
varianceSmall variance
Bucket range
14
Adversarial Goal (II)
Query Centric Information: Eg: “Which individuals have salary [100k,150k]”
Set Estimation Power (SEP) of A
Entropy of bucket-distribution is an inverse measure of SEP*
Bucket range
Average error of query-set estimation for Adversary
Small Large100k 150k 100k 150k
Best case: high
entropy + large
variance
Bucket range
low entropy + large
variance
H(X) = - ∑ pilogpi
15
Outline
Optimal data partitioning for range queries
Adversarial goals & privacy measures
Balancing privacy and precision
Experiments & conclusion
16
Privacy-Precision Trade-off
Optimal buckets might offer less privacy than desired Small variance
partial disclosure of numeric value Small entropy
Total disclosure with high probability (e.g. categorical data)
Partial detection of query-sets (for all cases)
Algorithm that allows trading-off bounded amount
of query precision for greater variance and entropy
Objective
17
The controlled diffusion algorithm
A simple observation
B0
CB1 CB2 CB3
• Let a query Q overlap only with B0
• If elements of B0 are distributed
into CB1, CB2 & CB3 randomly
• Now Q overlaps with CB1, CB2 & CB3
• With new buckets, the precision for Q drops by factor of
(|CB1|+|CB2|+|CB3|) / |B0|
Any re-distribution scheme
where B∀ i this ratio ≤ K precision degradation is bounded above by K
Q
18
Controlled diffusion Algorithm Compute optimal buckets on data set D B1 … BM
Fix max degradation factor = K
Initialize M empty composite buckets CB1 … CBM
Set target size of each CB to
fCB = |D|/M (equidepth)
∀ Bi select di CB’s at random, where
di = K*|Bi|/fCB
Diffuse elements of Bi into these uniformly at random
19
1 2 3 4 5 6 7 8 9 10
Freq
Values
4 44 4 4
10 10
6
2 2
B1 B2 B3 B4
2 2 2
2 4 2
4 2 2 3
3 4 2 3
2 2 2 3 4
CB1
CB3
CB2
CB4
CB1
CB2
CB3
CB4
Query optimal buckets
1 2 3 4 5 6 7 8 9 10
10
Degradation factor k = 2
Composite Buckets
Controlled Diffusion (Example)
25.12
12*2*
|f(CB)|
|f(B)|K
Final set of buckets on
server
5.124
50)( CBf
Metadata size increases from
O(M) to O(KM)
20
Some features of the diffusion algorithm
Many consecutive optimal buckets might get diffused into common set of CB’s Observed precision degradation < K
Elements with same values can go to multiple buckets Giving it an extra degree of freedom compared to hashing Not best for point queries
Random choice in the algorithm Each bucket distribution approaches data distribution as K
increases reducing information gained by adversary by learning buckets
21
Outline
Optimal data partitioning for range queries
Adversarial goals & privacy measures
Balancing privacy and precision
Experiments & conclusion
22
Experiments
Data sets Synthetic Data: 105 Integers in [0,999]
uniformly at random
Real Data: 104 Real values in [-0.8,8.0] “Corel Image” dataset (UCI KDD archive)
Query workloads (2 of size 104 each) End points chosen uniformly at random from
the respective ranges
23
Ratio of Average Precision (synthetic)(a)
0
0.5
1
1.5
2
2.5
3
100 150 200 250 300 350 400
Number of Buckets
Pre
c (
QO
B)
/ P
rec
(C
B's
)
k = 2 k = 4 k = 6 k = 8 k = 10
Ratio of Average Std Deviation (synthetic)(a)
0
50
100
150
200
250
300
350
400
450
100 150 200 250 300 350 400
Number of Buckets
Std
Dev
(C
B)
/ Std
Dev
(Q
OB
)
k = 2 k = 4 k = 6 k = 8 k = 10
Ratio of Average Entropy (synthetic)(a)
0
0.5
1
1.5
2
2.5
3
3.5
4
100 150 200 250 300 350 400
Number of Buckets
En
tro
py
(CB
) / E
ntr
op
y (Q
OB
)
k = 2 k = 4 k = 6 k = 8 k = 10
1. Relative decrease in precision of composite buckets
2. Relative increase in standard deviation in composite buckets
3. Relative increase in entropy in composite buckets
24
Composite buckets (sample)
Histogram
0102030405060708090
100
1510
4.4
193.
828
3.2
372.
646
255
1.4
640.
873
0.2
819.
6M
ore
BinF
req
ue
nc
y
FrequencyHistogram
0102030405060708090
100
Bin
Fre
qu
ency
Frequency
K = 6, M = 350 K = 10, M = 250
25
Trade-off (Precision Vs Entropy)
0
1
2
3
4
5
6
7
0 0.2 0.4 0.6 0.8 1 1.2
Average Precision
Av
era
ge
En
tro
py
Opt-Buckts
CB(k=2)
CB (k=4)
CB (k=6)
CB (k=8)
CB (k=10)
Trade-off (Precision vs Std. Dev)
0
50
100
150
200
250
300
0 0.2 0.4 0.6 0.8 1 1.2
Average Precision
Av
era
ge
Std
. De
via
tio
n
• Visualizing trade-offs for various bucketization parameters
• Eg: The marked points show the average entropy & precision we get for 100 buckets & degradation factor of 2
• The same point in the precision vs standard deviation trade-off space
• Provides an easy way to visualize the design space and choose parameters of interest
26
Summary
An optimal algorithm for partitioning data for range queries
Statistical measures of data privacy Variance Entropy
Fast & simple algorithm for re-bucketizing data Bounded amount of precision degradation Substantial increase in privacy level
27
Related work
Hacigumus et. al, SIGMOD 2002, “Executing SQL over Encrypted Data in the Database Service Provider Model”.
Damiani et. al, ACM CCS 2003, “Balancing Confidentiality and Efficiency in Untrusted Relation DBMS”.
Bouganim et. al, VLDB 2002 “Chip-Secured Data Access: Confidential Data on Untrusted Servers”.
28
THANK YOU !
Questions ?
29
Privacy in DAS Here goal of “Data Privacy” is not just
ensuring “non-disclosure of identity”. It is more general !
Privacy-preserving DM & Statistical DB
DAS
• Privacy criteria: Protect against disclosure of identity
• Utility criteria: Minimizing information loss i.e. maximize utility for data miners, retain as much aggregate level information as possible
• Privacy criteria: Hide as much information as possible (even at the aggregate level)
• Utility criteria: Maintain only the necessary information required for server-side query evaluation (at desired degree of accuracy)
30
Individual Privacy MeasureAverage Squared Error of Estimation
(ASEE) Error in approximating true value of a r.v XB by
another r.v XB’ (learned by A)
ASEE(XB,XB’) =
Var(XB) + Var(XB’) + (E(XB) – E(XB’))2
Variance of bucket distribution, Var(XB) is our
measure of individual privacy (lower bound)
31
Set oriented Privacy Measure
Entropy of bucket distribution is our measure for query-centric privacy
Measures uncertainty associated with a r.v (Eg. True class of an element for categorical data)
An inverse measure of the quality of partial solution sets* that A can derive for a query
H(X) = - ∑ pilogpi
32
Meta data size increase in diffusion
The meta data increases from O(M) toK*|B1|/fcb + K*|B2|/fcb + … + K*|BM|/fcb
= (K/fcb) * (|B1| + |B2| + … + |BM|)
= (KM/|D|)*|D| = O(KM)