Part 3: Query Processing -- Data-Independent Methods 1 Marianne Winslett 1,3, Xiaokui Xiao 2, Yin...

21
Part 3: Query Processing -- Data-Independent Methods 1 Marianne Winslett 1,3 , Xiaokui Xiao 2 , Yin Yang 3 , Zhenjie Zhang 3 , Gerome Miklau 4 1 University of Illinois at Urbana Champaign, USA 2 Nanyang Technological University, Singapore 3 Advanced Digital Sciences Center, Singapore 4 University of Massachusetts, Amherst, USA

Transcript of Part 3: Query Processing -- Data-Independent Methods 1 Marianne Winslett 1,3, Xiaokui Xiao 2, Yin...

Part 3: Query Processing Data-Independent Methods

Part 3: Query Processing --Data-Independent Methods1Marianne Winslett1,3, Xiaokui Xiao2, Yin Yang3, Zhenjie Zhang3, Gerome Miklau41 University of Illinois at Urbana Champaign, USA2 Nanyang Technological University, Singapore3 Advanced Digital Sciences Center, Singapore4 University of Massachusetts, Amherst, USA1Batch query answeringLaplaceMechanismdataworkload Ww1w2w3w1(D) + noise w2(D) + noisew3(D) + noiseGiven goal query set W (workload)a1a2a3a1(D) + noisea2(D) + noisea3(D) + noisealternativequeries Anoisy est. w1(D) noisy est. w2(D)noisy est. w3(D)workload Ww1w2w3LaplaceMechanismdataFrequency representation of the databasenamegendergradeAliceFemale91BobMale84CarlMale82DaveMale97EdwinaFemale88FaithFemale78GhitaFemale85.........Relational databaseFrequency vector{Male, Female} x {A,B,C,D,F}gendergradecountMale10010Male9913Male985Male977.........Female10015Female9921Female984Female9714Female969x1x2x3x4x5x6x7x8...xn{gender, grade}x = [x1, x2, ... xn]xgradecount90-1001080-902370-801660-703{grade}x1x2x3x4Answering all range queriesx1+x2+x3+x4x1+x2+x3x2+x3+x4x1+x2x2+x3x3+x4x1x2x3x4workload WGoal: answer all range-count queries over xAllRange = { w | w = xi + ... + xj for 1 i j n }w1w2w3w4w5w6w7w8w9w10range(x1,x4)range(x1,x3)range(x2,x4)range(x1,x2)range(x2,x3)range(x3,x4)range(x1,x1)range(x2,x2)range(x3,x3)range(x4,x4)1023163x=w1w2w3w4w5w6w7w8w9w105249423339191023163For domain of size n, there are 1/2*n*(n+1) = O(n^2) range queries.

The sensitivity for the workload of all range queries is (n/2)*(n/2+1) = O(n^2)Approach 1: basic Laplace mechanismx1+x2+x3+x4x1+x2+x3x2+x3+x4x1+x2x2+x3x3+x4x1x2x3x4WTwo problems:

- high error- inconsistency

w1w2w3w4w5w6w7w8w9w10n=4nSensitivity ||W||16O(n2)Error per query2(||W||1/)2 = 72/22(||W||1/)2 = O(n4)/2Error is measured asvarianceb1b2b3b4b5b6b7b8b9b10+(6/) private outputLaplace noisew1w2w3w4w5w6w7w8w9w10Workload queries||W||1 =68.2-5.4-3.16.6-7.92.4-3.0-4.96.74.660.244.638.939.631.121.47.018.122.77.65249423339191023163=55.4Explain sensitivityFor domain of size n, there are 1/2*n*(n+1) = O(n^2) range queries.

The sensitivity for the workload of all range queries is (n/2)*(n/2+1) = O(n^2)Approach 2: noisy frequency countsz1z2z3z4b1b2b3b4+(1/) Use Laplace mechanism to get noisy estimates for each xi.private outputx1x2x3x4queries submittedderivedworkload answersw1w2w3w4w5w6w7w8w9w10z1+z2+z3+z4z1+z2+z3z2+z3+z4z1+z2z2+z3z3+z4z1z2z3z4Laplace noise||I||1 =1IFor w=range(xi,xj) Error(w)= 2(j-i+1)/28/22/2Explain computation of estimates.Approach 3: hierarchical queriesH||H||1 = 3 = logn+1Hierarchical queries: recursively partition the domain, computing sums of each interval.[Hay, 2010]x1+x2+x3+x4x1+x2x3+x4x1x2x3x4+(3/) private outputLaplace noiseb1b2b3b4b5b6b7z1z2z3z4z5z6z7More than one possible estimate for a range query can be derived from z Queries submittedderivedworkload answersw1w2w3w4w5w6w7w8w9w10?z5 + z6z1 - z4 - z7z2 - z4 + z6Possible estimates for query range(x2,x3) = x2 + x3Least-squares estimate(6z1 + 3z2 + 3z3 - 9z4 + 12z5 + 12z6 - 9z7)/21Idea: only a small number of noisy outputs to needed to estimate any range query.Approach 4: wavelet queries[Xiao, 2010]x1+x2+x3+x4x1+x2-x3-x4x1-x2x3-x4z1z2z3z4b1b2b3b4+(3/) private outputQueries submittedderivedworkload answersw1w2w3w4w5w6w7w8w9w10?Wavelet queries: use Haar wavelet to get noisy summary of data.Estimate for query range(x2,x3) = x2 + x3.5z1 + 0z2 - .5z3 + .5z4YLaplace noise||Y||1 = 3 = logn+1Approaches for workload AllRangeLow sensitivity, and all range queries can be estimated using no more than logn output entries.Very low sensitivity, but large ranges estimated badly.HYINoisy countsHierarchicalWaveletO(n/2)Max/Avg errorO(log3n/2)O(log3n/2)x1x2x3x4x1+x2+x3+x4x1+x2x3+x4x1x2x3x4x1+x2+x3+x4x1+x2-x3-x4x1-x2x3-x4State error bounds from respective papersError: workload of all range queries = 0.1n = 1024

-differential privacysmall rangesbig ranges

Visualizing error: identity strategyn=128range(x1,x128)Identity strategyrange(x1,x1)ErrorVisualizing error: hierarchical v. wavelet

HierarchicalstrategyErrorWavelet strategyError(branching = 2)State result about asymptotic equivalence?Data-independent methodsTwo key ideas in choosing alternative query set A:low sensitivity (typically much lower than the workload itself).W can be estimated efficiently from A.Can we do better? Are these approaches optimal for all range queries?What about other workloads?arbitrary sets of range queries, data cubes, sets of marginals, CDFs, arbitrary sets of predicate counting queries, etc.Batch query answering (Design) Choose alternative query set A (Apply Laplace) Use the Laplace mechanism to answer A (Derivation) Compute each query in W using answers to Aa1a2a3a1(D) + noisea2(D) + noisea3(D) + noisealternativequeries Anoisy est. w1(D) noisy est. w2(D)noisy est. w3(D)Given goal query set W (workload)workload Ww1w2w3LaplaceMechanismdataGeneralize, remove matrix mechanismThe matrix mechanismGiven a workload W and a strategy matrix A, the following randomized algorithm is -differentially private:MatrixA(W,x) = Wx + (||A||1/) WA+ bworkload WAlgdataa1a2a3a1(D) + noisea2(D) + noisea3(D) + noisestrategy Aw1w2w3w1(D) + noisew2(D) + noisew3(D) + noiseb=Lap(1)Laplace(W,x) = Wx + (||W||1/)bCompare with the Laplace mechanism:instantiated withstrategy Atrue answerscaling by ||A||1transformation by WA+x=A+(Ax + (A/)b)Wx=WA+(Ax + (A/)b))Wx=Wx + (A/)WA+bWxDerived noisy answers to workload W[Li, 2010]This is never worse than the Laplace mechanism:even if we have no idea how to choose A, we could set A=W.

If W is square the matrix mech. is equivalent to Laplace. Otherwise, its better. Strategies equivalent to wavelet111111-1-11-100001-1Wavelet Y||Y||1 = 3Y||Y||1 = 2.414Neither the hierarchical nor the wavelet strategy is optimal, i.e. there exist uniformly better strategies with matching error profiles.Y||Y||1 = 31100001110000100001000011000010000100001>110000112000020000200002Overview of data-independent methodsMethodGoal WorkloadStrategyAdaptiveFourier[Barak 07]sets of marginalsfourier basis vectorsYESWavelet[Xiao 10]All Range (multi-dim)Haar waveletNOHierarchical[Hay 10]All Range (one-dim)k-order treeNOMatrix Mechanism[Li, 10]sets of linear queriesset of linear queriesYESCuboid[Ding,11]sets of data cubesset of cuboidsYESQuad-tree[Cormode,12]Range queries (multi-dim)quad-treeNOAdaptive: alternative query set customized to workload WAdaptive: Fourier basis methodGoal workload: a sets of low-order marginals over multi-dimensional data.Alternative query set: subset of Fourier basis vectorsAdaptivity: Any workload of marginals can be expressed using a small number of Fourier basis vectors, reducing sensitivity.Naturally adaptive, without explicit optimization step.[Barak, 2007]Adaptive: Cuboid methodGoal workload: a set of cuboids W selected by the user.Alternative query set: a subset A of cuboids. Adaptivity:Select A to minimize the max error over all cuboids in W.NP-hard, so an approximation algorithm based on set-cover is used to achieve log(|W|+2)2 approximation to optimal.[Ding, 2011]SexAgeSalarySexAgeSex*AgeSalarySexSalaryAgeSalaryAdaptive: the matrix mechanismGoal workload: Any set of linear queries W selected by the user.Alternative query set: any set of linear queries.Adaptivity:Select A to minimize the average per query error for W.Solved exactly using semi-definite programming (but not feasible in practice).Solved approximately by designing A from scaled eigenvectors of W. [Li, 2012][Li, 2010]Summary: data-independent methodsFor batch query answering, it is possible to exploit properties of the workload to significantly improve accuracy over typical applications of the Laplace mechanism.By submitting alternative query set to Laplace mechanism and inferring answers:Sensitivity is reducedNoise ultimately added to workload queries is correlated (not independent) which can fit correlation amongst workload queries.Next: exploiting properties of the input database data dependent methods.References[Barak, 2007] B. Barak, K. Chaudhuri, C. Dwork, S. Kale, F. McSherry, and K. Talwar. Privacy, accuracy, and consistency too: a holistic solution to contingency table release. In Principles of Database Systems (PODS) 2007.[Hay, 2010] M. Hay, V. Rastogi, G. Miklau, and D. Suciu. Boosting the accuracy of dierentially-private queries through consistency. Proceedings of the VLDB Endowment (PVLDB), 2010.[Xiao, 2010] X. Xiao, G. Wang, and J. Gehrke. Differential privacy via wavelet transforms. In International Conference on Data Engineering, 2010.[Li, 2010] C. Li, M. Hay, V. Rastogi, G. Miklau, and A. McGregor. Optimizing Linear Counting Queries Under Differential Privacy. Principles of Database Systems (PODS) 2010.[Ding, 2011] B. Ding, M. Winslett, J. Han, and Z. Li. Differentially private data cubes: optimizing noise sources and consistency. In SIGMOD, pages 217228. ACM, 2011.[Cormode, 2012] G. Cormode, M. Procopiuc, D. Srivastava, E. Shen, and T. Yu. Differentially private spatial decompositions. ICDE, 2012.[Li, 2012] Chao Li and Gerome Miklau. An adaptive mechanism for accurate query answering under differential privacy. PVLDB 2012.