A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries

41
1 A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data p. 295 - 306 Presented by Rebecca M. Atchley

description

Surajit Chaudhuri Gautam Das Vivek Narasayya Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data p. 295 - 306. Presented by Rebecca M. Atchley. A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries. Motivation. - PowerPoint PPT Presentation

Transcript of A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries

1

A Robust, Optimization-Based Approach for Approximate Answering

of Aggregate Queries

Surajit ChaudhuriGautam Das

Vivek Narasayya

Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data p. 295 - 306

Presented byRebecca M. Atchley

2

Motivation Decision Support applications - OLAP

and data mining for analyzing large databases have become very popular

Most queries are aggregation queries on these large databases

Expensive and resource intensive Approximate answers given

accurately and efficiently benefit the scalability of the application and are usually “Good Enough”

3

Approaches to Approximation/Data Reduction

This paper’s approach uses pre-computed samples of data instead of complete data

Sampling approaches: Weighted Sampling Congressional sampling On-the-fly sampling (error-prone when selections,

GROUP BYs and joins are used) Workload (Deterministic solution for identical

workloads) “similar” workloads, are considered as an

optimization problem (minimizing the error)

4

Approaches to Approximation/Data Reduction (contd.)

Histograms Wavelets

5

Drawbacks of previous work Lack of rigorous problem formulations

lead to solutions that are difficult to evaluate theoretically

Does not deal with uncertainty in incoming queries that are “similar” but not identical – Assumes fixed workload

Ignores the variance in data distribution of aggregated columns

6

Attacking “similar” workloads

Stratified sampling Minimize error in estimation of

aggregates

7

Architecture for AQP

Queries with selections, foreign-key joins and GROUP BY, containing aggregation functions such as COUNT, SUM, AVG.

8

Architecture for AQP

9

Architecture for AQPInputs: a database and a workload W Offline component for selecting a sample Online component that

Rewrites an incoming query to use the sample to answer the query approximately.

Reports the answer with an estimate of the error in the answer.

ScaleFactor – As in previous works, each record in the sample contains an additional column, ScaleFactor. “The value of the aggregate column of each record in the sample is first scaled up by multiplying with the ScaleFactor, and then aggregated.”

10

Architecture for AQP A workload W is specified as a set of

pairs of queries and their corresponding weights: i.e., W = {<Q1,w1>, … <Qq,wq>}

Weight wi indicates the importance of query Qi in the workload.

Without loss of generality, assume the weights are normalized, i.e., Σiwi =1

11

Architecture for AQPIf correct answer for query Q is y while approximate answer is y’ Relative error : E(Q) = |y - y’| / y Squared error : SE(Q) = (|y - y’| / y)²

If correct answer for the ith group is yi while approximate answer is yi’ Squared error in answering a GROUP BY query Q :

SE(Q) = (1/g) Σi ((yi – yi’)/ yi)²

Given a probability distribution of queries pw Mean squared error for the distribution :

MSE(pw) =ΣQ pw(Q)*SE(Q), (where pw(Q) is probability of query Q) Root mean squared error (L2):

Other error metrics L1 metric : the mean error over all queries in workload L∞ metric : the max error over all queries

)()( WW pMSEpRMSE

12

Special Case: Fixed Workload

Fundamental Regions: For a given relation R and workload W, consider partitioning the records in R into a minimum number of regions R1, R2, …, Rr such that for any region Rj, each query in W selects either all records in Rj or none.

13

FIXEDSAMP Solution – a Deterministic Algorithm called “FIXED”

Step 1: Identify all fundamental regions → r

After step 1, Case A (r ≤ k) and Case B (r >k)(k → sample size)

Case A (r ≤ k): (Pick Sample records) Selects the samples by picking exactly one record from each important fundamental region

Assigns appropriate values to additional columns in the sample records.

Case B (r > k): (Pick Sample records) select k regions and then pick one record from each of the selected regions. The heuristic is to select top k.

Assign Values to Additional Columns. This is an optimization problem, which is solved by partially differentiating and the resulting linear equations using Gauss-Seidel method.

14

Disadvantage

Per-query (probabilistic) error guarantee is not possible

If incoming query is not identical to a query in the given workload, FIXED can result in unpredictable errors

15

Rationale for Stratified Sampling

Stratified sampling is a well-known generalization of uniform sampling where a population is partitioned into multiple strata and samples are selected uniformly from each stratum with “important” strata contributing relatively more samples

An effective scheme for stratification should be such that the expected variance, over all queries in each stratum, is small, and allocate more samples to strata with larger expected variances.

Minimize the MSE of the lifted

workload

16

Lifting Workload to Query Distributions

Should be resilient to situations where “similar” but not identical queries

“Similar”-ity not based on syntaxIf records returned by the two queries have

significant overlap, they’re similar Each record will have a probability

associated with it such that the incoming query will select this record

17

The Non-Special Case with a Lifted Workload

Problem: SAMP Focus on Single-Relation queries with aggregation containing SUM

or COUNT, W consisting of 1 query Q on Relation R Input : R, Pw (a probability distribution function specified by W), and

k Output : A sample of k records (with the appropriate additional

column(s)) such that the MSE(Pw) is minimized. lifted workload

For a given W, define a lifted workload pw, i.e., a probability distribution of incoming queries. high if Q’ is similar to queries in W low if dissimilar

Instead of mapping queries to probabilities, P{Q} maps subsets of R to probabilities.

Objective is to define the distribution P{Q}

18

The Non-Special Case with a Lifted Workload

lifted workload (Cont.) Two parameters δ (½ ≤ δ ≤1) and γ (0 ≤ γ ≤ ½) define the degree to

which the workload “influences” the query distribution. For any given record inside (resp. outside) RQ, the parameter δ (resp. γ) represents the probability that an incoming query will select this record.

P{Q}(R’) is the probability of occurrence of any query that selects exactly

the set of records R’.

n1, n2, n3, and n4 are the counts of records in the regions.

n2 or n4 large (large overlap), P{Q}(R’) is high

n1 or n3 large (small overlap), P{Q}(R’) is low

q

iQi RpwRpi

1}{W )'()'(

19

Example

ProductID Revenue

1 10

2 10

3 10

4 1000

Query Q1 :SELECT COUNT(*)FROM RWHERE PRODUCTID IN (3,4);

Population POPQ1 = {0,0,1,1}

20

Solution STRAT for single-table selection queries with Aggregation

1. Stratification How many strata required during partition? How many records should each stratum have?

2. Allocation Determine the number of samples required

across each strata

3. Sampling

21

Allocation

After stratification into: the r fundamental regions for COUNT query h finer subdivisions of the fundamental regions

for SUM query (since each r may have large internal variance in the aggregate column, we cannot use the same stratification as in the COUNT case)

How do we do distribute the k records for the sample across the r (for COUNT) or h*r (for SUM) strata?

22

The Allocation Step for COUNT Query

We want to minimize the error over queries in pw .

k1, … kr are unknown variables such that Σkj = k.

From Equation (2) on an earlier slide,

MSE(pW) can be expressed as a weighted sum of the MSE of each query in the workload:

Lemma 2: MSE(pW) = Σi wi MSE(p{Q})

q

iQi RpwRpi

1}{W )'()'(

23

For any Q ε W, we express MSE(p{Q}) as a function of the kj’s.

Lemma 3 : For a COUNT query Q in W,

let ApproxMSE(p{Q}) =

Then

The Allocation Step for COUNT Query (Cont.)

24

The Allocation Step for COUNT Query (Cont.)

Since we have an (approximate) formula for MSE(p{Q}), we can express MSE(pw) as a function of the kj’s variables.

Corollary 1 : MSE(pw) = Σj(αj / kj), where each αj is a function of n1,…,nr, δ, and γ.

αj captures the “importance” of a region; it is positively correlated with nj as well as the frequency of queries in the workload that access Rj.

Now we can minimize MSE(pw).

25

Lemma 4: Σj (αj / kj) is minimized subject to Σj kj = k

if kj = k * ( sqrt(αj) / Σi sqrt(αi) )

This provides a closed-form and computationally inexpensive solution to the allocation problem since αj depends only on δ, γ and the number of tuples in each fundamental region.

The Allocation Step for COUNT Query (Cont.)

26

Like COUNT, we express an optimization problem with h*r unknowns k1,…, kh*r.

Unlike COUNT, the specific values of the aggregate column in each region (as well as the variance of values in each region) influence MSE(p{Q}).

Let yj(Yj) be the average (sum) of the aggregate column values of all records in region Rj. Since the variance within each region is small, each value within the region can be approximated as simply yj. Thus to express MSE(p{Q}) as a function of the kj’s for a SUM query Q in W:

Allocation Step for SUM Query

27

Allocation Step for SUM Query

As with COUNT, MSE(pW) for SUM is functionally of the form Σj(αj / kj), and αj depends on the same parameters n1, …nh*r , δ, and γ (see Corollary 1).

The same minimization procedure can be used as in Lemma 4.

29

Experimental Results

FIXED – solution for FIXEDSAMP, fixed workload, identical queries

STRAT – solution for SAMP, workloads with single-table selection queries with aggregation

PREVIOUS WORKUSAMP – uniform random sampling

WSAMP – weighted sampling

OTLIDX – outlier indexing combined with weighted sampling

CONG – Congressional sampling

Experimental Results

Experimental Results

Experimental Results

Experimental Results

Experimental Results

Experimental Results

Experimental Results

Experimental Results

38

Pragmatic Issues

Identifying Fundamental Regions Handling Large Number of

Fundamental Regions Obtaining Integer Solutions Obtaining an Unbiased Estimator

39

Extensions for more General Queries

GROUP BY Queries JOIN Queries Other Extensions

Mix of COUNT and SUM queries

40

Conclusion

The solutions FIXED and STRAT handle the problems of data variance, heterogeneous mixes of queries, GROUP BY and foreign-key joins.

Questions?

41

Thank you!