1
Scalable Approximate Query Processingthrough Scalable Error Estimation
Kai ZengUCLA
Advisor: Carlo Zaniolo
2
Why Approximate Query Processing?
• AQP is critical for massive data– Ever-growing size of big data– Need for timely and cost-effective analysis– Widely applied• RDBMSs (e.g., online aggregation)• MapReduce systems (e.g., BlinkDB)• Data stream systems (load shedding)
3
• Sampling: widely-used in AQP• Error estimation: fundamental in AQP– Analytic error estimation– Bootstrap
MassiveData
AVG5.5
Approx.Mean
sample(6, 2, 7, 8, 5, 1, 3, 4, 9, 10)
Sample
Sampling & Quality assessment
Need to assess the quality!What is the error of this approx. mean?
4
MassiveData
query: AVG5.5
Approx.Mean
sample(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
Sample
collect
# of tuples, Variance
Central Limit Theorem
Analytic Error Estimation
• Use closed-form formulas• Pro: very fast• Con: restricted to simple aggregates
What if I want to estimate?1. Complex SQL queries2. Data mining tasks3. ….
5
Bootstrap [Efron 1979]
• Resample with replacement from the sample• Run the query on the resample• Repeat many times, typically 100s or even 1000s of
times(6, 2, 7, 8, 5, 1, 3, 4, 9, 10)
(2, 10, 10, 5, 9, 2, 5, 10, 8, 10)
(8, 1, 2, 1, 1, 9, 7, 4, 10, 1)
5.5
6.8(9, 10, 2, 10, 7, 1, 3, 6, 10, 10)
7.1
4.5
(9, 10, 2, 10, 7, 1, 3, 6, 10, 10)(9, 10, 2, 10, 7, 1, 3, 6, 10, 10)(9, 10, 2, 10, 7, 1, 3, 6, 10, 10)(9, 10, 2, 10, 7, 1, 3, 6, 10, 10)
…… ……
Sample Mean
resample
query: AVG
collect
Same Size
6
• Compute the error from the empirical distribution of all the query results
95%
7
Notes on Bootstrap
• Bootstrap treats Q as a black-box • Can handle (almost) arbitrarily complex queries including
UDFs!
• Embarrassingly Parallel
• Computational demanding
• Use too much resources
Error Estimation• Analytic error estimation– Fast but limited to simple aggregates
• Bootstrap (Monte Carlo simulation):– Expensive but general
Fast and General?
9
How To Make Bootstrap Faster
• Optimize the Monte-Carlo simulation process– EARL system [VLDB12][ICDE13]
• Bypass the Monte-Carlo simulation process– Analytical Bootstrap method (ABM) [SIGMOD14]
10
EARLY ACCURATE RESULT LIBRARY(EARL PROJECT)
11
Motivation
• Existing systems (e.g. Hadoop) use batch processing– High latency– Waste of resources
• Goals: a general driver that can– Return approximate results– With accuracy guarantee– For a wide range of tasks
12
Incremental Computation
• A small sample a larger sample ……• Use Bootstrap to test accuracy• Time efficient: Enable early returns• Resource efficient: Do not waste resources
MassiveData
Samplesample enlarge enlarge
bootstrap
Accurate enough?
bootstrap
Accurate enough?
……
Sample Sample
13
Basic Ideas: Optimization
• Intra-iteration optimization– We have to repeat the same computation on all
resamples– Many data are shared!– Compute the shared part once
𝑆
𝑆1
𝑆2
……
Iteration 𝒇
𝒇
𝒇
Shared
Non-shared
14
Basic Ideas: Optimization
• Inter-iteration optimization– Reuse the old computation– Cannot simply merge for randomness– Keep a small sample in memory for adjustment
𝑆 ∆𝑆
𝑆1
𝑆2
𝑆1′
𝑆2′
𝑆
…… ……
Iteration Iteration
𝑆1 Δ𝑆1 Adjustment is small
15
ANALYTICAL BOOTSTRAP
16
Analytical Bootstrap
• Scope: relational algebra(selection), (projection), (join), (aggregate)
• Basic idea– Annotate tuples with random variables– Extend relational algebra to manage these
random variables
A single-round evaluation = 100s/1000s of bootstrap trials!
# of times a tuple will be drawn in a bootstrap trial
17
Bootstrap Resamples As Multiset DB
• Bootstrap generates multiset relations– Tuples annotated with multiplicities– Query processing manipulate these multiplicities
ID Product Qty1 A 22 B 33 A 24 A 4
ID Product Qty #1 A 2 12 B 3 03 A 2 24 A 4 1
ID Product Qty1 A 22 B 32 B 34 A 4
ID Product Qty1 A 22 B 34 A 44 A 4
ID Product Qty1 A 23 A 23 A 24 A 4
resample
……
sample
18
Querying Multiset DB: Projection
• Projection takes sum of multiplicities
ID Product Qty #1 A 2 12 B 3 03 A 2 24 A 4 1
Product Qty #A 2 3B 3 0A 4 1
1+2=3
SELECT Product, SUM(Qty)FROM OrdersWHERE Qty < (SELECT SUM(Qty) / 4
FROM Orders)GROUP BY Product
How many products are ordered by small quantity orders?
19
Querying Multiset DB: Aggregate
• Aggregate takes weighted sum of multiplicities
ID Product Qty #1 A 2 12 B 3 03 A 2 24 A 4 1
SUM(Qty) #
10 1
2×1+3×0+2×2+4×1=10
20
Querying Multiset DB: Join
• Join takes product of multiplicities
Product Qty #A 2 3B 3 0A 4 1
SUM(Qty) #10 1
Product Qty SUM(Qty) #A 2 10 3B 3 10 0A 4 10 1
3×1=3
21
Querying Multiset DB: Selection
• Selection takes product of multiplicities
Product Qty SUM(Qty) #A 2 10 3B 3 10 0A 4 10 1
Product Qty SUM(Qty) #A 2 10 3B 3 10 0A 4 10 0
3×1=31×0=0
22
Bootstrap Resamples As Multiset DB
• Bootstrap generates multiset relations– Tuples annotated with multiplicities– Query processing manipulate these multiplicities– ,
23
• Multiset DB– Tuples are annotated with
∼𝑀𝑢𝑙𝑡𝑖𝑛𝑜𝑚𝑖𝑎𝑙 (4 , (0.25,0 .25,0 .25,0 .25 ) )
ID Product Qty #1 A 2 22 B 3 13 A 2 14 A 4 0
ID Product Qty #1 A 2 12 B 3 13 A 2 04 A 4 2
ID Product Qty #1 A 2 12 B 3 03 A 2 24 A 4 1
ID Product Qty #1 A 22 B 33 A 24 A 4
(𝑚1 ,𝑚2 ,𝑚3 ,𝑚4 )∼𝑀𝑢𝑙𝑡𝑖𝑛𝑜𝑚𝑖𝑎𝑙 (4 , (0.25,0 .25,0 .25,0 .25 ) )∼𝑀𝑢𝑙𝑡𝑖𝑛𝑜𝑚𝑖𝑎𝑙 (4 , (0.25,0 .25,0 .25,0 .25 ) )0.25
0.25
0.25
0.25
Probabilistic Multiset DB
Random Variables on Probabilistic Multiset DB (PMDB)
Similar to Tossing Coins
24
Querying PMDB
• Whenever we apply – to the multiplicity columnsum () the annotated random variables
– to the multiplicity columnmultiply () the annotated random variables
25
ID Product Qty #1 A 22 B 33 A 24 A 4
Product Qty #A 2B 3A 4
Querying PMDB: Projection
• Projection takes convolution sum of multiplicities
26
From Theory To Practice
• Annotated random variables – Marginal distribution
ID Product Qty #1 A 22 B 33 A 24 A 4
0.25
0.25
0.25
0.25
0.75
∼𝑀𝑢𝑙𝑡𝑖𝑛𝑜𝑚𝑖𝑎𝑙 (4 , (0.25,0 .25,0 .25,0 .25 ) )∼𝑀𝑢𝑙𝑡𝑖𝑛𝑜𝑚𝑖𝑎𝑙 (4 , (0.25,0 .75 ) )
ID Product Qty#
n 0 1
1 A 2 4 0.75 0.25
2 B 3 4 0.75 0.25
3 A 2 4 0.75 0.25
4 A 4 4 0.75 0.25
Numeric Form!
27
ID Product Qty #1 A 22 B 33 A 24 A 4
Product Qty #A 2B 3A 4
ID Product Qty#
n 0 11 A 2 4 0.75 0.252 B 3 4 0.75 0.253 A 2 4 0.75 0.254 A 4 4 0.75 0.25
Product Qty#
n 0 1A 2 4 0.5 0.5B 3 4 0.75 0.25A 4 4 0.75 0.25
Product Qty#
n 0 1A 2 4 0.5 0.5
Querying PMDB: an Example
• : works with the numeric forms
28
Querying PMDB: an Example
• Correctness of – Tuples projected are disjoint: They do not depend on the same base tuple– Can be detected by functional dependency:
ID Product Qty #1 A 22 B 33 A 24 A 4
Product Qty #A 2B 3A 4
29
Querying PMDB in Numeric Form
• ABM is correct for queries with eligible plans• A large subset of queries can be evaluated by
ABM in DBPTIME• Eligible plans can be tested at compile time
Functional Dependency Rules
30
Coverage of Various TechniquesAnalytic error estimationTPCH (9/22); Conviva Log (36.9 %)
ABM DBPTIME eligibleTPCH (15/22); Conviva Log (81.0 %)
ABM eligibleTPCH (19/22); Conviva Log (98.6 %)
ABMTPCH (19/22); Conviva Log (99.1 %)
BootstrapTPCH (19/22); Conviva Log (99.1 %)
Over 6660 queries
31
EXPERIMENTAL EVALUATION
32
Experimental Setting
• Synthetic and real-life datasets and queries: – TPC-H: 100 GB– Skewed-TPC-H: 1 GB– Customer: 52 GB
• Compare relative error– Of: mean, standard-deviation, quantile, KS-distance,
confidence interval, existence probability– Between: Analytical Bootstrap Method (ABM), bootstrap
(BS), ground truth (GT)
33
Accuracy of ABM
Comparing the distributions given by ABM & bootstrap on quantiles & existence probability (1% sample)
1%
ABM models Bootstrap accurately
34
Accuracy of ABM
Comparing user-defined measures given by ABM & bootstrap to ground truth (1% sample)
ABM is consistent with Bootstrap
35
Accuracy of ABM
Comparing predictions given by ABM & bootstrap when varying number of bootstrap trials (TPC-H 1%)
Bootstrap converges to ABM
36
Time Performance of ABM
Bootstrap: Original bootstrapBLB-10: Bag of Little Bootstrap using 10 machinesODM: On-Demand Materialization
Comparing time performance of ABM & bootstrap variants (TPC-H 10%)
ABM is 3-4 orders of magnitude faster than sequential/parallel bootstrap variants
37
Time Performance of ABM
Exact: Run the query on the original dataSample: Run the query on the sampleCLT: Analytic error estimation using Central Limit Theorem
Comparing time performance of ABM & various techniques (TPC-H 10%)
ABM introduces little overhead
38
Conclusion & Future Work
• Bootstrap is critical for scalable AQP• ABM provides an analytical model for
bootstrap, and achieves significant speed-up• ABM+EARL: a bootstrap-based system that
can automatically choose/combine error estimation methods
• Integrating ABM into Hive/Shark
Top Related