Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable...
Transcript of Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable...
![Page 1: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/1.jpg)
UniversalizingApproximate Query Processing
Yongjoo Park Barzan MozafariJoseph Sorenson Junhao Wang
![Page 2: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/2.jpg)
UniversalApproximate Query Processing
![Page 3: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/3.jpg)
UniversalApproximate Query Processing
![Page 4: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/4.jpg)
What is Approximate Query Processing (AQP)?
I/O Computation Exact Answer
![Page 5: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/5.jpg)
What is Approximate Query Processing (AQP)?
I/O
Less I/O
Computation
Less Computation Approximate Answer
Exact Answer
![Page 6: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/6.jpg)
Why AQP?
Higher Productivity
Numerous studies:
A latency >2 seconds is no longer interactive and negatively affects creativity!
![Page 7: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/7.jpg)
Why AQP?
Higher Productivity
Numerous studies:
A latency >2 seconds is no longer interactive and negatively affects creativity!
Human time: Money
Machine time: No one loves their EC2 bill!
Lower Cost (Time + Resources)
![Page 8: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/8.jpg)
Why AQP?
Higher Productivity
Numerous studies:
A latency >2 seconds is no longer interactive and negatively affects creativity!
Human time: Money
Machine time: No one loves their EC2 bill!
Lower Cost (Time + Resources)
Jeff Bezos
![Page 9: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/9.jpg)
AQP research in academia
1985 1990 1995 2000 2005 2010 2015 2020
DBLearning
QuickR, Seek+Sample,Wander join
BlinkDB
SciBORQ
MapReduce Online,COSMOS
Optimized stratified,Scalable with DBO
SMS join
Bootsrapfor AQP
Dynamic sample selection
STRAT
AQUA,Ripple join
Online aggregation
Approximate count estimator
Selectivity estimation on random samples
Doublesampling
1984
1991
1996
1997
1999
2001
2005
2003
2006
2007
2010
2011
2013
2017
2016
![Page 10: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/10.jpg)
AQP research in academia
1985 1990 1995 2000 2005 2010 2015 2020
DBLearning
QuickR, Seek+Sample,Wander join
BlinkDB
SciBORQ
MapReduce Online,COSMOS
Optimized stratified,Scalable with DBO
SMS join
Bootsrapfor AQP
Dynamic sample selection
STRAT
AQUA,Ripple join
Online aggregation
Approximate count estimator
Selectivity estimation on random samples
Doublesampling
1984
1991
1996
1997
1999
2001
2005
2003
2006
2007
2010
2011
2013
2017
2016
35 years of research, little industry adoption
![Page 11: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/11.jpg)
AQP is hard to adoptAQP typically requires significant modifications of DBMS internals
• Error estimation: [BlinkDB ‘13], [G-OLA ’15], …
• Query evaluation: [Online ‘97], [Join Synopses ‘99], ...
• Relational operators: [ABM ‘14], …
![Page 12: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/12.jpg)
AQP is hard to adopt
Traditional DBMS vendors• Stable codebase, reluctant to make major changes
• Slow in adopting ANYTHING :-)
AQP typically requires significant modifications of DBMS internals• Error estimation: [BlinkDB ‘13], [G-OLA ’15], …
• Query evaluation: [Online ‘97], [Join Synopses ‘99], ...
• Relational operators: [ABM ‘14], …
![Page 13: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/13.jpg)
AQP is hard to adoptAQP typically requires significant modifications of DBMS internals
• Error estimation: [BlinkDB ‘13], [G-OLA ’15], …
• Query evaluation: [Online ‘97], [Join Synopses ‘99], ...
• Relational operators: [ABM ‘14], …
Newer SQL-on-Hadoop systems: implementing standard features
Traditional DBMS vendors• Stable codebase, reluctant to make major changes
• Slow in adopting ANYTHING :-)
![Page 14: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/14.jpg)
AQP is hard to adoptAQP typically requires significant modifications of DBMS internals
• Error estimation: [BlinkDB ‘13], [G-OLA ’15], …
• Query evaluation: [Online ‘97], [Join Synopses ‘99], ...
• Relational operators: [ABM ‘14], …
Users won’t abandon their existing DBMS just to use AQP.
Newer SQL-on-Hadoop systems: implementing standard features
Traditional DBMS vendors• Stable codebase, reluctant to make major changes
• Slow in adopting ANYTHING :-)
![Page 15: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/15.jpg)
Built-in AQP functions in OLAP engines
APPROXIMATE PERCENTILE_DISC
approxCountDistinctapproxQuantile
NDVAPX_MEDIAN
approx_count_distinctapprox_percentile
count-distinct or quantile
![Page 16: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/16.jpg)
Built-in AQP functions in OLAP engines
count-distinct or quantile Good progress! But, too little, too slow
APPROXIMATE PERCENTILE_DISC
approxCountDistinctapproxQuantile
NDVAPX_MEDIAN
approx_count_distinctapprox_percentile
![Page 17: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/17.jpg)
Built-in AQP functions in OLAP engines
Limitations
count-distinct or quantile
APPROXIMATE PERCENTILE_DISC
approxCountDistinctapproxQuantile
NDVAPX_MEDIAN
approx_count_distinctapprox_percentile
1. Good only when the data does not fit in memory2. Good only for flat queries: no error propagation3. Applicable only for order statistics: no support for UDAs or arithmetic aggregates
Good progress! But, too little, too slow
![Page 18: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/18.jpg)
Built-in AQP functions in OLAP engines
Limitations
count-distinct or quantile
Need for complete AQP solutions that are easy to adopt
1. Good only when the data does not fit in memory2. Good only for flat queries: no error propagation3. Applicable only for order statistics: no support for UDAs or arithmetic aggregates
APPROXIMATE PERCENTILE_DISC
approxCountDistinctapproxQuantile
NDVAPX_MEDIAN
approx_count_distinctapprox_percentile
Good progress! But, too little, too slow
![Page 19: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/19.jpg)
Our proposal: Universal AQP
user/app SQL DB
ExactResult
SQL
![Page 20: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/20.jpg)
Our proposal: Universal AQP
SQL DB
Thin AQP layer
AQP
user/app
![Page 21: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/21.jpg)
Our proposal: Universal AQP
SQL DB
Thin AQP layer
AQP
user/app
select avg(price)from saleswhere channel = ‘online’
SQL
![Page 22: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/22.jpg)
Our proposal: Universal AQP
user/app SQL DB
Thin AQP layer
select avg(price)from saleswhere channel = ‘online’
AQP
SQL RewrittenSQL
select avg(a1), std(a1)from (select avg(price) as a1from saleswhere channel = ‘online’group by sid ) t1
![Page 23: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/23.jpg)
Our proposal: Universal AQP
SQL DB
Thin AQP layer
AQP
SQL
ExactResult
Approx Result+ Error Bound
RewrittenSQL
select avg(a1), std(a1)from (select avg(price) as a1from saleswhere channel = ‘online’group by sid ) t1
user/app
select avg(price)from saleswhere channel = ‘online’
![Page 24: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/24.jpg)
Our proposal: Universal AQP
SQL DB
Thin AQP layer
AQP
SQL
ExactResult
Approx Result+ Error Bound
RewrittenSQL
select avg(a1), std(a1)from (select avg(price) as a1from saleswhere channel = ‘online’group by sid ) t1
user/app
select avg(price)from saleswhere channel = ‘online’
![Page 25: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/25.jpg)
Our proposal: Universal AQP
SQLSQL DB
ExactResult
Approx Result+ Error Bound
RewrittenSQL
Thin AQP layer
select avg(a1), std(a1)from (select avg(price) as a1from saleswhere channel = ‘online’group by sid ) t1
Universal AQP
AQP
user/app
select avg(price)from saleswhere channel = ‘online’
![Page 26: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/26.jpg)
Challenges of Universal AQP
1. Statistical correctness (inter-tuple correlations)• Foreign-key constraints [Join Synopses ‘99]
• Modifying the join algorithm [Wander Join ‘16]
• Modifying the query plan [BlinkDB ‘13, Quickr ‘16]
2. Middleware efficiency• Lack of access to DBMS machinery
3. Server efficiency• Resampling-based techniques [Pol and Jermaine ‘05, BlinkDB ‘14]
• Intimate integration of err est. logic into scan operators [Quickr ‘16, SnappyData]
• Overriding the relational operators altogether [ABM ‘14]
![Page 27: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/27.jpg)
Challenges of Universal AQP
1. Statistical correctness (inter-tuple correlations)• Foreign-key constraints [Join Synopses ‘99]
• Modifying join algorithm [Wander Join ‘16]
• Modifying the query plan [Quickr ‘16]
2. Middleware efficiency• Lack of access to DBMS machinery
3. Server efficiency• Resampling-based techniques [Pol and Jermaine ‘05, BlinkDB ‘14]
• Intimate integration of err est. logic into scan operators [Quickr ‘16, SnappyData]
• Overriding the relational operators altogether [ABM ‘14]
sale
AA NYC SF
correct error bounds
![Page 28: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/28.jpg)
Challenges of Universal AQP
1. Statistical correctness (inter-tuple correlations)• Foreign-key constraints [Join Synopses ‘99]
• Modifying join algorithm [Wander Join ‘16]
• Modifying the query plan [Quickr ‘16]
2. Middleware efficiency• Lack of access to DBMS machinery
3. Server efficiency• Resampling-based techniques [Pol and Jermaine ‘05, BlinkDB ‘14]
• Intimate integration of err est. logic into scan operators [Quickr ‘16, SnappyData]
• Overriding the relational operators altogether [ABM ‘14]
ThinAQPLayer
network
![Page 29: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/29.jpg)
Challenges of Universal AQP
1. Statistical correctness (inter-tuple correlations)• Foreign-key constraints [Join Synopses ‘99]
• Modifying join algorithm [Wander Join ‘16]
• Modifying the query plan [Quickr ‘16]
2. Middleware efficiency• Lack of access to DBMS machinery
3. Server efficiency• Resampling-based techniques [Pol and Jermaine ‘05, BlinkDB ‘14]
• Intimate integration of err est. logic into scan operators [Quickr ‘16, SnappyData]
• Overriding the relational operators altogether [ABM ‘14]
![Page 30: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/30.jpg)
VerdictDB Overview
First Universal AQP system
![Page 31: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/31.jpg)
Deployment
VerdictDB
user/app SQL DB
![Page 32: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/32.jpg)
Deployment
VerdictDB
JDBC,API call
user/app SQL DB
![Page 33: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/33.jpg)
Deployment
VerdictDB
JDBC,API call
user/app SQL DB
JDBC,spark.sql
![Page 34: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/34.jpg)
Deployment
VerdictDB
user/app SQL DB
JDBC,spark.sql
JDBC,API call
Stores (1) offline-created samples, and (2) VerdictDB-managed metadata
Required SQL syntax:• create table as select …• rand(), agg(col) over (partition by …)
![Page 35: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/35.jpg)
Deployment
VerdictDB
user/app SQL DB
JDBC,spark.sql
JDBC,API call
Stores (1) offline-created samples, and (2) VerdictDB-managed metadata
The only requirements:• create table as select …• rand(), agg(col) over (partition by …)
![Page 36: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/36.jpg)
Deployment
VerdictDB
user/app SQL DB
JDBC,spark.sql
Stores (1) offline-created samples, and (2) VerdictDB-managed metadata
The only requirements:• create table as select …• rand(), agg(col) over (partition by …)
supported byalmost any SQL engines
JDBC,API call
![Page 37: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/37.jpg)
Architecture
VerdictDB
Query Parser
Query Rewriter
DBMSDrivers
Impala driver
Hive driver
Redshift driverAnswerRewriter
incoming query
approximateanswer SQL DB
…
![Page 38: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/38.jpg)
Architecture
VerdictDB
Query Parser
Query Rewriter
DBMSDrivers
Hive driver
Redshift driverAnswerRewriter
incoming query
approximateanswer
Crucial component1. Chooses an optimal set of samples2. Scales values appropriately3. Inserts an error estimation logic
SQL DB
Impala driver…
![Page 39: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/39.jpg)
Error estimation in VerdictDB
![Page 40: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/40.jpg)
Error estimation in generalUser interested in Q(T)
We compute Q(S) where S is a sample of T
![Page 41: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/41.jpg)
Error estimation in general
Main question: how close is Q(S) to Q(T)?
User interested in Q(T)
We compute Q(S) where S is a sample of T
![Page 42: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/42.jpg)
Error estimation in general
Fast? General?
Closed-form(CLT, Hoeffding, HT) YES NO
(no UDAs, requires IID)
Existing Resampling(subsampling, bootstrap)
NO(can be slow in SQL)
YES(Hadamard differentiable)
Ours(variational subsampling) YES YES
(Hadamard differentiable)
Main question: how close is Q(S) to Q(T)?
User interested in Q(T)
We compute Q(S) where S is a sample of T
![Page 43: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/43.jpg)
Error estimation in general
Fast? General?
Closed-form(CLT, Hoeffding, HT) YES NO
(no UDAs)
Existing Resampling(subsampling, bootstrap)
NO(can be slow in SQL)
YES(Hadamard differentiable)
Ours(variational subsampling) YES YES
(Hadamard differentiable)
Main question: how close is Q(S) to Q(T)?
User interested in Q(T)
We compute Q(S) where S is a sample of T
![Page 44: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/44.jpg)
Error estimation in general
Fast? General?
Closed-form(CLT, Hoeffding, HT) YES NO
(no UDAs)
Existing Resampling(subsampling, bootstrap)
NO(can be slow in SQL)
YES(Hadamard differentiable)
Ours(variational subsampling) YES YES
(Hadamard differentiable)
Main question: how close is Q(S) to Q(T)?
User interested in Q(T)
We compute Q(S) where S is a sample of T
![Page 45: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/45.jpg)
Error estimation in general
Fast? General?
Closed-form(CLT, Hoeffding, HT) YES NO
(no UDAs)
Existing Resampling(subsampling, bootstrap)
NO(can be slow in SQL)
YES(Hadamard differentiable)
Ours(variational subsampling) YES YES
(Hadamard differentiable)
Main question: how close is Q(S) to Q(T)?
User interested in Q(T)
We compute Q(S) where S is a sample of T
![Page 46: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/46.jpg)
Recap: traditional subsampling
TOriginal Table
(size N)
Q(T) is slow / expensive
![Page 47: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/47.jpg)
Recap: traditional subsampling
TOriginal Table
(size N)
Ssample(size n)
random sample
What is the error of Q(S)?
![Page 48: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/48.jpg)
Recap: traditional subsampling
TOriginal Table
(size N)
Ssample(size n)
s2 sbs1 · · ·s3subsample(size s ≪ n)
random sample
What is the error of Q(S)?
![Page 49: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/49.jpg)
Recap: traditional subsampling
Random sample without replacement
Each subsample is independentT
Original Table(size N)
Ssample(size n)
s2 sbs1 · · ·s3subsample(size s ≪ n)
random sample
What is the error of Q(S)?
![Page 50: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/50.jpg)
Recap: traditional subsampling
TOriginal Table
(size N)
Ssample(size n)
s2 sbs1 · · ·s3subsample(size s ≪ n)
random sample
Q(s1) Q(s2) Q(s3) Q(sb)
What is the error of Q(S)?
![Page 51: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/51.jpg)
Recap: traditional subsampling
TOriginal Table
(size N)
Ssample(size n)
s2 sbs1 · · ·s3subsample(size s ≪ n)
random sample
What is the error of Q(S)?
Q(s1) Q(s2) Q(s3) Q(sb)
![Page 52: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/52.jpg)
Recap: traditional subsampling
STOriginal Table
(size N)
sample(size n)
s2 sbs1 · · ·s3subsample(size s ≪ n)
random sample
Important properties
1. A tuple may belong to multiple subsamples.
2. The size ofevery subsample is s.
What is the error of Q(S)?
Q(s1) Q(s2) Q(s3) Q(sb)
![Page 53: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/53.jpg)
Traditional subsampling in SQL is slow
CITY PRODUCT PRICE 1 2 · · · bAA egg $3.00 1 0 1AA milk $5.00 0 1 0AA egg $3.00 0 0 1NYU egg $4.00 0 1 0NYU milk $6.00 0 0 1NYU candy $2.00 1 0 0SF milk $6.00 0 1 0SF egg $4.00 0 0 0SF egg $4.00 0 1 1
subsample ID
ntuples
sum = s sum = s
![Page 54: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/54.jpg)
Traditional subsampling in SQL is slow
CITY PRODUCT PRICE 1 2 · · · bAA egg $3.00 1 0 1AA milk $5.00 0 1 0AA egg $3.00 0 0 1NYU egg $4.00 0 1 0NYU milk $6.00 0 0 1NYU candy $2.00 1 0 0SF milk $6.00 0 1 0SF egg $4.00 0 0 0SF egg $4.00 0 1 1
subsample ID
ntuples
Algorithm:for i = 1, ..., nfor j = 1, ..., b
if sid[i,j] == 1sum[j] += price[i]
sum = s sum = s
![Page 55: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/55.jpg)
Traditional subsampling in SQL is slow
CITY PRODUCT PRICE 1 2 · · · bAA egg $3.00 1 0 1AA milk $5.00 0 1 0AA egg $3.00 0 0 1NYU egg $4.00 0 1 0NYU milk $6.00 0 0 1NYU candy $2.00 1 0 0SF milk $6.00 0 1 0SF egg $4.00 0 0 0SF egg $4.00 0 1 1
subsample ID
ntuples
Algorithm:for i = 1, ..., nfor j = 1, ..., b
if sid[i,j] == 1sum[j] += price[i]
Time Complexity: O(n·b)
sum = s sum = s
![Page 56: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/56.jpg)
Traditional subsampling in SQL is slow
CITY PRODUCT PRICE 1 2 · · · bAA egg $3.00 1 0 1AA milk $5.00 0 1 0AA egg $3.00 0 0 1NYU egg $4.00 0 1 0NYU milk $6.00 0 0 1NYU candy $2.00 1 0 0SF milk $6.00 0 1 0SF egg $4.00 0 0 0SF egg $4.00 0 1 1
subsample ID
ntuples
Algorithm:for i = 1, ..., nfor j = 1, ..., b
if sid[i,j] == 1sum[j] += price[i]
Time Complexity: O(n·b)
No error est: 0.35 secTrad. subsampling: 118 sec337x slower
(basedon1Gsample,Impala)sum = s sum = s
![Page 57: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/57.jpg)
Our approach: variational subsampling
T Ssample(size n)
randomsample
![Page 58: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/58.jpg)
Our approach: variational subsampling
T Ssample(size n)
randomsample
sbs1 · · ·s3
(size n1) (size n2) (size n3) (size nb)
s2
![Page 59: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/59.jpg)
Our approach: variational subsampling
T Ssample(size n)
randomsample
Important properties1. A tuple may belong to multiple
subsamples.Each sampled tuple can belong to at most one subsample
2. The size ofevery subsample is s.Allow subsamples to differ in size.sbs1 · · ·s3
(size n1) (size n2) (size n3) (size nb)
s2
![Page 60: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/60.jpg)
Our approach: variational subsampling
T Ssample(size n)
sbs1 · · ·s3
randomsample
(size n1) (size n2) (size n3) (size nb)
Important properties1. A tuple may belong to multiple
subsamples.Each sampled tuple can belong to at most one subsample
2. The size ofevery subsample is s.
s2
![Page 61: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/61.jpg)
Our approach: variational subsampling
T Ssample(size n)
sbs1 · · ·s3
randomsample
(size n1) (size n2) (size n3) (size nb)
Important properties1. A tuple may belong to multiple
subsamples.Each sampled tuple can belong to at most one subsample
2. The size ofevery subsample is s.Allow subsamples to differ in size.s2
![Page 62: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/62.jpg)
Our approach: variational subsampling
T Ssample(size n)
s2 sbs1 · · ·s3
randomsample
(size n1) (size n2) (size n3) (size nb)
Important properties1. A tuple may belong to multiple
subsamples.Each sampled tuple can belong to at most one subsample
2. The size ofevery subsample is s.Allow subsamples to differ in size.
Can be implemented in SQLas a single group-by query!
![Page 63: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/63.jpg)
Variational subsampling in SQL is fast
CITY PRODUCT PRICE subsample IDAA egg $3.00 1AA milk $5.00 3AA egg $3.00 2NYU egg $4.00 4NYU milk $6.00 3NYU candy $2.00 1SF milk $6.00 5SF egg $4.00 4SF egg $4.00 5
ntuples
randint(1,b)
We call this augmented table, a variational table
![Page 64: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/64.jpg)
Variational subsampling in SQL is fast
CITY PRODUCT PRICE subsample IDAA egg $3.00 1AA milk $5.00 3AA egg $3.00 2NYU egg $4.00 4NYU milk $6.00 3NYU candy $2.00 1SF milk $6.00 5SF egg $4.00 4SF egg $4.00 5
ntuples
randint(1,b)Algorithm:for i = 1, ..., nsum[sid] += price[i]
We call this augmented table, a variational table
![Page 65: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/65.jpg)
Variational subsampling in SQL is fast
CITY PRODUCT PRICE subsample IDAA egg $3.00 1AA milk $5.00 3AA egg $3.00 2NYU egg $4.00 4NYU milk $6.00 3NYU candy $2.00 1SF milk $6.00 5SF egg $4.00 4SF egg $4.00 5
ntuples
randint(1,b)Algorithm:for i = 1, ..., nsum[sid] += price[i]
Time Complexity: O(n)
We call this augmented table, a variational table
![Page 66: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/66.jpg)
Variational subsampling in SQL is fast
CITY PRODUCT PRICE subsample IDAA egg $3.00 1AA milk $5.00 3AA egg $3.00 2NYU egg $4.00 4NYU milk $6.00 3NYU candy $2.00 1SF milk $6.00 5SF egg $4.00 4SF egg $4.00 5
ntuples
randint(1,b)Algorithm:for i = 1, ..., nsum[sid] += price[i]
Time Complexity: O(n)
No error est: 0.35 secTrad. subsampling: 118 secVar. subsampling: 0.73 sec162× faster than traditional
We call this augmented table, a variational table(basedon1Gsample,Impala)
![Page 67: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/67.jpg)
Main results
Theorem 1 (Consistency) The distribution of the aggregates of variational subsamples, after appropriate scaling, converges to the true distribution of the aggregate of a sample as 𝑛 → ∞.
![Page 68: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/68.jpg)
Main results
Theorem 1 (Consistency) The distribution of the aggregates of variational subsamples, after appropriate scaling, converges to the true distribution of the aggregate of a sample as 𝑛 → ∞.
Theorem 2 (Convergence Rate) The convergence rate of variational subsampling is equal to that of traditional subsampling when b is finite.
𝑂 𝑛%&'/) +
𝑛%𝑛 + 𝑏&'/)
The error term from the finite b(The Dvoretzky–Kiefer–Wolfowitz inequality)
![Page 69: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/69.jpg)
Experiments1. Does VerdictDB provide enough speedup?
2. Is VerdictDB (UAQP)’s performance comparable
to a tightly-integrated AQP?
3. Is variational subsampling statistically correct?
![Page 70: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/70.jpg)
Experiments1. Does VerdictDB provide enough speedup?
2. Is VerdictDB (UAQP)’s performance comparable
to a tightly-integrated AQP?
3. Is variational subsampling statistically correct?
![Page 71: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/71.jpg)
Experiments1. Does VerdictDB provide enough speedup?
2. Is VerdictDB (UAQP)’s performance comparable
to a tightly-integrated AQP engine?
3. Is variational subsampling statistically correct?
![Page 72: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/72.jpg)
Experiments1. Does VerdictDB provide enough speedup?
2. Is VerdictDB (UAQP)’s performance comparable
to a tightly-integrated AQP engine?
3. Is variational subsampling statistically correct?
![Page 73: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/73.jpg)
Experiments1. Does VerdictDB provide enough speedup?
2. Is VerdictDB (UAQP)’s performance comparable
to a tightly-integrated AQP engine?
3. Is variational subsampling statistically correct?
Yes
![Page 74: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/74.jpg)
Experiments
Datasets:• 500GB TPC-H benchmark / 200GB Instacart dataset / synthetic datasets
Underlying databases• Amazon Redshift, Apache Spark SQL, Apache Impala on 10+1 r4.xlarge cluster
1. Does VerdictDB provide enough speedup?
2. Is VerdictDB (UAQP)’s performance comparable
to a tightly-integrated AQP engine?
3. Is variational subsampling statistically correct?
Yes
![Page 75: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/75.jpg)
Speedup for Redshift
tpc-h benchmark micro-benchmark
Redshift 24.0× Speedup
![Page 76: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/76.jpg)
Speedup for Redshift
tpc-h benchmark micro-benchmark
t3, t10, t15: no speedup (i.e., 1×) due to high-cardinality grouping attributes
Redshift 24.0× Speedup
![Page 77: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/77.jpg)
Speedup for Redshift
t3, t10, t15: no speedup (i.e., 1×) due to high-cardinality grouping attributes
Other queries: 26.3× speedups (relative errors were 2%)
tpc-h benchmark micro-benchmark
Redshift 24.0× Speedup
![Page 78: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/78.jpg)
Speedup for Apache Spark & ImpalaSpark SQL 12.0× Speedup
Impala 18.6× Speedup
![Page 79: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/79.jpg)
Speedup for Apache Spark & Impala
speedup = overhead + processing
overhead + (sample processing)
Lower overhead →Larger speedup
Spark SQL 12.0× Speedup
Impala 18.6× Speedup
![Page 80: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/80.jpg)
UAQP vs. Tightly-integrated AQP
![Page 81: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/81.jpg)
UAQP vs. Tightly-integrated AQP
VerdictDB was comparable to SnappyData.
![Page 82: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/82.jpg)
UAQP vs. Tightly-integrated AQP
VerdictDB was comparable to SnappyData.
SnappyData ver 0.8 didn’t support the join of two sample tables.
![Page 83: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/83.jpg)
Variational subsampling: correctnessThe bars are 5th and 95th percentiles.Rel. err. naturally become smaller for higher selectivity.
![Page 84: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/84.jpg)
Variational subsampling: correctnessThe bars are 5th and 95th percentiles.Rel. err. naturally become smaller for higher selectivity.
The estimated errors close to true errors.
![Page 85: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/85.jpg)
Variational subsampling: correctnessThe bars are 5th and 95th percentiles.Rel. err. naturally become smaller for higher selectivity.
The accuracy of var. subsampling ≈ (a) bootstrap and (b) trad. subsampling
The estimated errors close to true errors.
![Page 86: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/86.jpg)
Variational subsampling: convergence rate
![Page 87: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/87.jpg)
Variational subsampling: convergence rate
The accuracy was almost the same for relatively large samples.
![Page 88: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/88.jpg)
Variational subsampling: convergence rate
Variational subsampling was significantly faster.
The accuracy was almost the same for relatively large samples.
![Page 89: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/89.jpg)
Conclusion: Universal AQP is viable
![Page 90: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/90.jpg)
Conclusion: Universal AQP is viable
1. Comparable performance to a fully-integrated solution
2. New error estimation technique: variational subsampling
1. Generality and computational efficiency
2. The first subsampling-based error estimation technique for AQP
3. Offers considerable speedup (18.45× on average, up to 171×, less than 2-3% errors)
![Page 91: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/91.jpg)
Conclusion: Universal AQP is viable
1. Comparable performance to a fully-integrated solution
2. New error estimation technique: variational subsampling
1. Generality and computational efficiency
2. The first subsampling-based error estimation technique for AQP
3. Offers considerable speedup (18.45× on average, up to 171×, less than 2-3% errors)
![Page 92: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/92.jpg)
Conclusion: Universal AQP is viable
1. Comparable performance to a fully-integrated solution
2. New error estimation technique: variational subsampling
1. Generality and computational efficiency
2. The first subsampling-based error estimation technique for AQP
3. Offers considerable speedup (18.45× on average, up to 171×, less than 2-3% errors)
![Page 93: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/93.jpg)
Conclusion: Universal AQP is viable
1. Comparable performance to a fully-integrated solution
2. New error estimation technique: variational subsampling
1. Generality and computational efficiency
2. The first subsampling-based error estimation technique for AQP
3. Offers considerable speedup (18.45× on average, up to 171×, less than 2-3% errors)
![Page 94: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/94.jpg)
Conclusion: Universal AQP is viable
1. Comparable performance to a fully-integrated solution
2. New error estimation technique: variational subsampling
1. Generality and computational efficiency
2. The first subsampling-based error estimation technique for AQP
3. Offers considerable speedup (18.45× on average, up to 171×, less than 2-3% errors)
![Page 95: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/95.jpg)
Conclusion: Universal AQP is viable
1. Comparable performance to a fully-integrated solution
2. New error estimation technique: variational subsampling
1. Generality and computational efficiency
2. The first subsampling-based error estimation technique for AQP
3. Offers considerable speedup (18.45× on average, up to 171×, less than 2-3% errors)
Open-sourced (Apache v2.0): http://verdictdb.org
![Page 96: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/96.jpg)
Future Work
Development
Research
![Page 97: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/97.jpg)
Future Work
• Adding more drivers (Presto, Teradata, Oracle, SQL Server, …)
Development
Research
![Page 98: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/98.jpg)
Future Work
• Adding more drivers (Presto, Teradata, Oracle, SQL Server, …)
Development
• Support for online sampling
Research
![Page 99: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/99.jpg)
Future Work
• Adding more drivers (Presto, Teradata, Oracle, SQL Server, …)
Development
• Support for online sampling
• Robust physical designer (see CliffGuard @ SIGMOD 15)
Research
![Page 100: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/100.jpg)
Future Work
• Adding more drivers (Presto, Teradata, Oracle, SQL Server, …)
Development
• Support for online sampling
• Robust physical designer (see CliffGuard @ SIGMOD 15)
• Integration with ML libraries (sampling-based model tuning)
Research
![Page 101: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/101.jpg)
Thank You
http://verdictdb.org
![Page 102: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/102.jpg)
VerdictDB: current status
• We support• aggregates: sum, count, avg, count-distinct, quantiles, UDAs• sources: base table, derived table, equi-join• filters: comparison, some subquery• others: group-by, having, etc.
• Open-sourced under Apache License version 2.0• http://verdictdb.org for code and documentation
• Upcoming features• Online sampling, automated physical designer
![Page 103: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/103.jpg)
VerdictDB: current status
• We support• aggregates: sum, count, avg, count-distinct, quantiles, UDAs• sources: base table, derived table, equi-join• filters: comparison, some subquery• others: group-by, having, etc.
• Open-sourced under Apache License version 2.0• http://verdictdb.org for code and documentation
• Upcoming features• Online sampling, automated physical designer
![Page 104: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/104.jpg)
VerdictDB: current status
• We support• aggregates: sum, count, avg, count-distinct, quantiles, UDAs• sources: base table, derived table, equi-join• filters: comparison, some subquery• others: group-by, having, etc.
• Open-sourced under Apache License version 2.0• http://verdictdb.org for code and documentation
• Upcoming features• Online sampling, automated physical designer
![Page 105: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/105.jpg)
Example of query rewriting
select l_returnflag , count (*) as ccfrom lineitemgroup by l_returnflag ;
select vt1 .` l_returnflag ` AS `l_returnflag `,round ( sum (( vt1 .`cc ` * vt1 .` sub_size `)) / sum ( vt1 .` sub_size `)) AS `cc `,
(stddev ( vt1 .` count_order `) * sqrt ( avg ( vt1 .` sub_size `))) / sqrt ( sum ( vt1 .` sub_size `)) AS `cc_err `
from (select vt0 .` l_returnflag ` AS `l_returnflag `,(( sum ((1.0 / vt0 .` sampling_prob `)) / count (*))* sum ( count (*)) OVER ( partition BY vt0 .` l_returnflag `)) AS `cc `,vt0 .`sid ` AS `sid `, count (*) AS `sub_size `
from lineitem_sample vt0 GROUP BY vt0 .` l_returnflag `, vt0 .`sid `) AS vt1
GROUP BY vt1 .` l_returnflag `;
original
rewritten
![Page 106: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/106.jpg)
Bibliography
[Pol and Jermaine ‘05] Pol, Abhijit, and Christopher Jermaine. "Relational confidence bounds are easy with the bootstrap." In Proceedings of the 2005 ACM SIGMOD international conference on Management of data, pp. 587-598. ACM, 2005.
[BlinkDB ‘13] Agarwal, Sameer, Barzan Mozafari, Aurojit Panda, Henry Milner, Samuel Madden, and Ion Stoica. EuroSys, 2013.
[BlinkDB ‘14] Agarwal, Sameer, Henry Milner, Ariel Kleiner, Ameet Talwalkar, Michael Jordan, Samuel Madden, Barzan Mozafari, and Ion Stoica. "Knowing when you're wrong: building fast and reliable approximate query processing systems." SIGMOD, 2014.
[Quickr ‘16] Kandula, Srikanth, Anil Shanbhag, Aleksandar Vitorovic, Matthaios Olma, Robert Grandl, Surajit Chaudhuri, and Bolin Ding. "Quickr: Lazily approximating complex adhoc queries in bigdataclusters." SIGMOD, 2016.
![Page 107: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/107.jpg)
Bibliography
[G-OLA ’15] Zeng, Kai, Sameer Agarwal, Ankur Dave, Michael Armbrust, and Ion Stoica. "G-ola: Generalized on-line aggregation for interactive analysis on big data." SIGMOD, 2015.
[ABM ‘14] Zeng, Kai, Shi Gao, Barzan Mozafari, and Carlo Zaniolo. "The analytical bootstrap: a new method for fast error estimation in approximate query processing." SIGMOD, 2014.
[Join Synopses ‘99] Acharya, Swarup, Phillip B. Gibbons, Viswanath Poosala, and Sridhar Ramaswamy. "Join synopses for approximate query answering." SIGMOD Record, 1999.
[Wander Join ‘16] Li, Feifei, Bin Wu, Ke Yi, and Zhuoyue Zhao. "Wander join: Online aggregation via random walks." SIGMOD, 2016.
[Online ‘97] Hellerstein, Joseph M., Peter J. Haas, and Helen J. Wang. "Online aggregation." SIGMOD, 1997.
[Politis ‘94] Politis, Dimitris N., and Joseph P. Romano. "Large sample confidence regions based on subsamples under minimal assumptions." The Annals of Statistics, 1994
![Page 108: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/108.jpg)
Variational subsampling: overhead
![Page 109: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/109.jpg)
Variational subsampling: overhead
Overhead of variational subsampling: 0.38–0.87 seconds
![Page 110: Universalizing Approximate Query ProcessingMapReduce Online, COSMOS Optimized stratified, Scalable with DBO SMS join Bootsrap for AQP Dynamic sample selection STRAT AQUA, Ripple join](https://reader034.fdocuments.us/reader034/viewer/2022042912/5f476f96ece5210f334baf41/html5/thumbnails/110.jpg)
Variational subsampling: overhead
Variational subsampling was 100×–237× faster compared to Consolidated Bootstrap.
Overhead of variational subsampling: 0.38–0.87 seconds