A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid...
-
Upload
rosalyn-waters -
Category
Documents
-
view
225 -
download
0
Transcript of A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid...
A Robust, Optimization-Based Approach for Approximate Answering of
Aggregate QueriesBy :
Surajid ChaudhuriGautam Das
Vivek Narasayya
Presented by :Sayed Muchallil September 21st, 2010
CONTENTS
1. INTRODUCTION2. ARCHITECTURE FOR APPROXIMATE QUERY
PROCESSING3. FIXED WORKLOAD4. STRATIFIED SAMPLING5. SOLUTION6. SUMMARY
Pre-computed samples
Can give approximate answer very efficiently.
Workload are used to make sure that errors are acceptable.
Previous Studies
Solution is difficult to evaluate theoretically.
Do not formally deal with uncertainty in the expected workload.
Ignoring the variance in the data distribution.
Sample Product ID Revenue
1 10
2 10
3 10
4 1000
Only 50% of R records can be used as sample
Query : “SELECT SUM(Revenue) FROM R”
The answer for is 1030
Table R
Sample (cont.)
Product ID
Revenue
1 10
4 1000
The answer for the query for table S1 is 40.
The answer for the query for table S2 is 2020.
How to get these answer?
Sample Table S1
Sample Table S2
Sample (cont.)
large variance in the aggregate column can lead to large relative errors.
Relative error = |y - y’| / y
Relative error for S1 = |1030 – 40| / 1030
Relative error for S2 = |1030 – 2020| / 1030
What’s New ?
The goal is to pick sample that minimize error.
If actual workload is identical to the given workload (fixed), error will be smaller.
Can work for identical and similar query to the given workload.
Sampling
• Two ways for selecting samples– Randomized– Deterministic
• A Workload W is a set of pairs of queries and their weight.– W = {<Q1, w1>,<Q2, w2>,…<Qq, wq>}
– Σiwi = 1.
Architecture for
Approximate Query Processing
Architecture (cont.)
Offline ComponentSelects sample or records from relation R
Online ComponentRewrites an incoming query to use the sample. What is “rewrites” means?
Reports answer with an estimate error
Architecture (cont.)
New method for automatically lifting a given workload.
It is unrealistic to assume that the incoming queries will be identical to the given workload.
The key : the ability to compute a probability distribution Pw.
Error Metrics Relative Error : |y - y’| / y Squared Error : SE(Q) = (|y - y’| / y)² Squared Error for GROUP BY query
SE(Q) = (1/g) Σi ((yi – yi’)/ yi)²
a probability distribution of queries pw
Mean squared error for the distribution:
MSE(pw) =ΣQ pw(Q)*SE(Q)
Root mean squared error :
RMSE(pw) = √MSE(pw)
Fixed Workload
Special case ?
A given workload are “identical” to the incoming queries.
Problem: FIXEDSAMPInput: R, W, kOutput: A sample of k records (with appropriate additional columns) such that MSE(W) is minimized.
Fundamental Regions
Relation R contains 9 records
W consists of 2 queries Q1 = select records with C values between 10 -50Q2 = select records with C values between 40 -70
These queries divide Relation R into 4 fundamental regions.
Fundamental Regions (cont.)
Fundamental Regions (cont.)
• partitioning the records in R into a minimum number of regions R1, R2, …, Rr such that for any region Rj, each query in W selects either all records in Rj or none.
• Total number fundamental regions =? Min(2|W|, n)
FIXEDSAMP Solution
Step 1. Identify Fundamental Regions in R r <= k r > k
Step 2 Pick Sample Records
Step 3 Assign values to additional columns
LIFTING WORKLOAD TO QUERY DISTRIBUTION
Query Q’ is not identical, Pw(Q’) is high if Q’ is similar to queries in the workload, and Low if not.
Q’ and Q are similar if selected records have significant overlap.
LIFTED WORKLOAD
P{Q}(R’) is the probability of occurrence of any query that selects exactly the set of records R’.
For any given record inside (resp. outside) RQ, the parameter δ (resp. γ) represents the probability that an incoming query will select this record
LIFTED WORKLOAD (Cont.)
LIFTED WORKLOAD (Cont.)
δ → 1 and γ → 0: implies that incoming queries are identical to workload queries.
δ → 1 and γ → ½: implies that incoming queries are supersets of workload queries.
δ → ½ and γ → 0: implies that incoming queries are subsets of workload queries.
δ → ½ and γ → ½: implies that incoming queries are unrestricted.
RATIONALE FOR STRATIFIED SAMPLING
A population is partitioned into multiple strata, and samples are selected uniformly from each stratum.
STRATIFIED SAMPLING
a stratified sampling scheme partitions R into r strata containing n1, ., nr records (where Σnj = n), with k1, …, kr records uniformly sampled from each stratum (where Σkj = k).
Q1 = SELECT COUNT(*) FROM R WHERE ProductID IN(3,4);
POPQ is population of query Q
POPQ1 = {0,0,1,1} = non-zero variance
Divided into two strata {0,0} and {1,1}
Product ID Revenue
1 10
2 10
3 10
4 1000
SOLUTION FOR SINGLE-TABLE SELECTION QUERIES WITH AGGREGATION
StratificationHow many strata How many records for each stratum
AllocationDetermines how to divide k
SamplingForms the final sample of k record
SOLUTION FOR COUNT AGGREGATE
Stratification (lemma 1)r is not known, divide R into fundamental regions
and treat them as strata.
Allocation (lemma 2)MSE(pW) = Σi wi MSE(p{Q})MSE(pW) can be expressed as a weighted sum of
the MSE of each query in the workload
SOLUTION FOR COUNT AGGREGATE (Cont.)
For any Q ε W, we express MSE(p{Q}) as a function of the kj’s
Lemma 3 :ApproxMSE(p{Q}) =
Then,
SOLUTION FOR COUNT AGGREGATE (Cont.)
Since we have an (approximate) formula for MSE(p{Q}), we can express MSE(pw) as a function of the kj’s variables.
Corollary 1 : MSE(pw) = Σj(αj / kj), where each αj is a function of n1,…,nr, δ, and γ.
αj captures the “importance” of a region; it is positively correlated with nj as well as the frequency of queries in the workload that access Rj.
Now we can minimize MSE(pw).
SOLUTION FOR COUNT AGGREGATE (Cont.)
Lemma 4: Σj (αj / kj) is minimized subject to Σj kj = k
if kj = k * ( sqrt(αj) / Σi sqrt(αi) )
This provides a closed-form and computationally inexpensive solution to the allocation problem since αj depends only on δ, γ and the number of tuples in each fundamental region
SOLUTION FOR SUM AGGREGATE
StratificationBucketing technique
Divide fundamental regions with large variance into a set of finer regions.
Treat each region as strata
AllocationYj is average (sum) of the aggregate column values
of all records in region Rj
SOLUTION FOR SUM AGGREGATE (Cont.)
Each value in the region can be approximated as yj
An approximate formula for MSE(P{Q}) for SUM query Q in W
Pragmatic Issues
Identifying Fundamental Regions
Handling Large Number of Fundamental Regions
Obtaining Integer Solution
Obtaining unbiased error
STRAT ALGORITHM
IMPLEMENTATION AND EXPERIMENTAL RESULT
This experiment compares the STRAT method to other methods.USAMP – uniform random sampling WSAMP – weighted sampling OTLIDX – outlier indexing combined with
weighted sampling CONG – Congressional sampling
COUNT AGGREGATE
SUM AGGREGATE
COUNT AGGREGATE
THANK YOU