AppsDBA_ Very important queries in day to day apps dba life.pdf
Approximate Queries on Very Large Data
description
Transcript of Approximate Queries on Very Large Data
Approximate Queries on Very Large Data
UC Berkeley
Sameer AgarwalJoint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael Jordan, Samuel Madden, Ion Stoica
Our GoalSupport interactive SQL-like aggregate queries over massive sets of data
Our GoalSupport interactive SQL-like aggregate queries over massive sets of data
blinkdb> SELECT AVG(jobtime) FROM very_big_log AVG, COUNT,
SUM, STDEV, PERCENTILE
etc.
Support interactive SQL-like aggregate queries over massive sets of data
blinkdb> SELECT AVG(jobtime) FROM very_big_log WHERE src = ‘hadoop’
FILTERS, GROUP BY clauses
Our Goal
Support interactive SQL-like aggregate queries over massive sets of data
blinkdb> SELECT AVG(jobtime) FROM very_big_log WHERE src = ‘hadoop’ LEFT OUTER JOIN logs2
ON very_big_log.id = logs.id
JOINS, Nested Queries etc.
Our Goal
Support interactive SQL-like aggregate queries over massive sets of data
blinkdb> SELECT my_function(jobtime) FROM very_big_log WHERE src = ‘hadoop’ LEFT OUTER JOIN logs2
ON very_big_log.id = logs.id
ML Primitives,User Defined Functions
Our Goal
Hard Disks
½ - 1 Hour 1 - 5 Minutes 1 second
?Memory
100 TB on 1000 machines
Query Execution on Samples
ID
City Salary
1 NYC 50,0002 NYC 62,4923 Berkeley 78,2124 NYC 120,24
25 NYC 98,3416 Berkeley 75,4537 NYC 60,0008 NYC 72,4929 Berkeley 88,21210
Berkeley 92,242
11
NYC 70,000
12
Berkeley 102,492
Query Execution on SamplesWhat is the average Salary of all the people in the table?
$80,848
ID
City Salary
1 NYC 50,0002 NYC 62,4923 Berkeley 78,2124 NYC 120,24
25 NYC 98,3416 Berkeley 75,4537 NYC 60,0008 NYC 72,4929 Berkeley 88,21210
Berkeley 92,242
11
NYC 70,000
12
Berkeley 102,492
Query Execution on SamplesWhat is the average Salary of all the people in the table?
ID City Salary
Sampling Rate
2 NYC 62,492 1/46 Berkele
y75,453 1/4
8 NYC 72,492 1/4
UniformSample
$70,145$80,848
ID
City Salary
1 NYC 50,0002 NYC 62,4923 Berkeley 78,2124 NYC 120,24
25 NYC 98,3416 Berkeley 75,4537 NYC 60,0008 NYC 72,4929 Berkeley 88,21210
Berkeley 92,242
11
NYC 70,000
12
Berkeley 102,492
Query Execution on SamplesWhat is the average Salary of all the people in the table?
ID City Salary
Sampling Rate
2 NYC 62,492 1/46 Berkele
y75,453 1/4
8 NYC 72,492 1/4
UniformSample
$70,145 +/- 10,815
$80,848
ID
City Salary
1 NYC 50,0002 NYC 62,4923 Berkeley 78,2124 NYC 120,24
25 NYC 98,3416 Berkeley 75,4537 NYC 60,0008 NYC 72,4929 Berkeley 88,21210
Berkeley 92,242
11
NYC 70,000
12
Berkeley 102,492
Query Execution on SamplesWhat is the average Salary of all the people in the table?ID City Salar
ySampling Rate
2 NYC 62,492 1/23 Berkele
y78,212 1/2
5 NYC 60,000 1/26 Berkele
y75,453 1/2
8 NYC 72,492 1/212 Berkele
y102,492
1/2
UniformSample
$75,190 +/- 5,895
$80,848$70,145 +/- 10,815
Speed/Accuracy Trade-off
Execution Time
Erro
r
30 mins
Time to Execute on
Entire Dataset
InteractiveQueries
5 sec
Execution Time
Erro
r
30 mins
Time to Execute on
Entire Dataset
InteractiveQueries
5 sec
Speed/Accuracy Trade-off
Pre-ExistingNoise
What is BlinkDB?A data analysis (warehouse) system that … - builds on Shark and Spark- returns fast, approximate answers with
error bars by executing queries on small samples of data
- is compatible with Apache Hive (storage, serdes, UDFs, types, metadata) and supports Hive’s SQL-like query structure with minor modifications
Sampling Vs. No Sampling
0100200300400500600700800900
1000
1 10-1 10-2 10-3 10-4 10-5
Fraction of full data
Que
ry R
espo
nse
Tim
e (S
econ
ds)
103
1020
18 13 10 8
10x as response timeis dominated by I/O
Sampling Vs. No Sampling
0100200300400500600700800900
1000
1 10-1 10-2 10-3 10-4 10-5
Fraction of full data
Que
ry R
espo
nse
Tim
e (S
econ
ds)
103
1020
18 13 10 8
(0.02%)(0.07%) (1.1%) (3.4%) (11%)
Error Bars
Hive Architecture
Hadoop Storage (e.g., HDFS, HBase)
Metastore
MapReduce
SQL Parser
Query Optimize
r
Physical PlanSerDes, UDFs
Execution
Driver
Command-line Shell Thrift/JDBC
Shark Architecture
Hadoop Storage (e.g., HDFS, HBase)
Metastore
Spark
SQL Parser
Query Optimize
r
Physical PlanSerDes, UDFs
Execution
Driver
Command-line Shell Thrift/JDBC
BlinkDB Architecture
Hadoop Storage (e.g., HDFS, HBase)
Metastore
Spark
SQL Parser
Query Optimize
r
Physical PlanSerDes,
UDFsExecution
Driver
Command-line Shell Thrift/JDBC
BlinkDB alpha-0.1.0
1. Released and available at http://blinkdb.org
2. Allows you to create random and stratified samples on native tables and materialized views
3. Adds approximate aggregate functions with statistical closed forms to HiveQL : approx_avg(), approx_sum(), approx_count() etc.
Example: Preparing the Datablinkdb>
blinkdb> create external table logs (dt string, event string, bytes int) row format delimited fields terminated by ' ' location ’/tmp/logs’;
Referencing an external table logs in BlinkDB
Example: Preparing the Data
blinkdb> create external table logs (dt string, event string, bytes int) row format delimited fields terminated by ' ' location ’/tmp/logs';
blinkdb> create table logs_sample as select * from logs samplewith 0.01;
Create a 1% random sample logs_sample from logs
Example: Preparing the Data
blinkdb> create external table logs (dt string, event string, bytes int) row format delimited fields terminated by ' ' location ’/tmp/logs';
blinkdb> create table logs_sample as select * from logs samplewith 0.01;
blinkdb> create table logs_sample_cached as select * from logs_sample;
Supports all Shark primitives for caching samples in memory
Example: Preparing the Data
blinkdb> set blinkdb.sample.size=32810
blinkdb> set blinkdb.dataset.size=3198910
Giving BlinkDB information about the size of sample you wish to operate on and the size of the original dataset
Example: Analyzing the Data
blinkdb> set blinkdb.sample.size=32810
blinkdb> set blinkdb.dataset.size=3198910
blinkdb> select approx_count(1) from logs_sample_cached where event = “foo”;
Example: Analyzing the Data
Prefixing approx_ to an aggregate operator tells BlinkDB to return an approximate answer
blinkdb> set blinkdb.sample.size=32810
blinkdb> set blinkdb.dataset.size=3198910
blinkdb> select approx_count(1) from logs_sample_cached where event = “foo”;
12810132 +/- 3423 (99% Confidence)
Example: Analyzing the Data
Returns an approximate answer with an error bar and confidence interval
blinkdb> create table logs_sample as select * from [any subquery] samplewith 0.01;
Example: There’s more!
The sample operator can be anywhere in the query graph
blinkdb> create table logs_sample as select * from [any subquery] samplewith 0.01;
blinkdb> select approx_count(1) from logs_sample_cached where event = “foo” GROUP BY dt ORDER BY dt;
Example: There’s more!
Retains remaining Hive Query Structure
blinkdb> create table logs_sample as select * from [any subquery] samplewith 0.01;
blinkdb> select approx_count(1) from logs_sample_cached where event = “foo” GROUP BY dt ORDER BY dt;
12810132 +/- 3423 (99% Confidence)
Example: There’s more!
Note: The output is a String
Feature Roadmap1. Integrating BlinkDB with Shark as an
experimental feature (coming soon!)2. Automatic Sample Management3. More Hive Aggregates, UDAF Support4. Runtime Correctness Tests
SELECT avg(sessionTime) FROM Table WHERE city=‘San Francisco’WITHIN 1 SECONDS 234.23 ± 15.32
Automatic Sample Management
Goal: The API should abstract the details of creating, deleting and managing samples from the user
SELECT avg(sessionTime) FROM Table WHERE city=‘San Francisco’WITHIN 2 SECONDS 234.23 ± 15.32
Automatic Sample Management
239.46 ± 4.96
Goal: The API should abstract the details of creating, deleting and managing samples from the user
SELECT avg(sessionTime) FROM Table WHERE city=‘San Francisco’ERROR 0.1 CONFIDENCE 95.0%
Automatic Sample Management
Goal: The API should abstract the details of creating, deleting and managing samples from the user
TABLE
Sam
plin
g M
odul
e
Original Data
Offline-sampling: Creates an optimal set of samples on native tables and materialized views based on query history and workload characteristics
Automatic Sample Management
TABLE
Sam
plin
g M
odul
e
In-MemorySamples
On-DiskSamples
Original Data
Sample Placement: Samples striped over 100s or 1,000s of machines both on disks and in-memory.
Automatic Sample Management
SELECT foo (*)FROM TABLE
WITHIN 2
Query Plan
HiveQL/SQLQuery
Sample Selection
TABLE
Sam
plin
g M
odul
e
In-MemorySamples
On-DiskSamples
Original Data
Automatic Sample Management
SELECT foo (*)FROM TABLE
WITHIN 2
Query Plan
HiveQL/SQLQuery
Sample Selection
TABLE
Sam
plin
g M
odul
e
In-MemorySamples
On-DiskSamples
Original Data
Online sample selection to pick best sample(s) based on query latency and accuracy requirements
Automatic Sample Management
TABLE
Sam
plin
g M
odul
e
In-MemorySamples
On-DiskSamples
Original Data
Shark
SELECT foo (*)FROM TABLE
WITHIN 2
New Query Plan
HiveQL/SQLQuery
Sample Selection
Error Bars & Confidence Intervals
Result182.23 ± 5.56
(95% confidence)
Parallel query execution on multiple samples striped across multiple machines.
Automatic Sample Management
1. Using Bootstrap to estimate error
More Aggregates/ UDAFs Support
Sample
A
1. Using Bootstrap to estimate error
More Aggregates/ UDAFs Support
Sample
A A1 A2 An
…
…
Bootstrap Operator
1. Using Bootstrap to estimate error
More Aggregates/ UDAFs Support
Sample
A A1 A2 An
…
…
Placement of the Bootstrap Operator in the query graph is critical to performance
1. Using Bootstrap to estimate error
More Aggregates/ UDAFs Support
Sample
A A1 A2 An
…
…
However, the bootstrap can fail
1. Given a query,how do you know if it can be approximated at runtime?- Depends on the query, data distribution, and
sample size
2. Need for runtime diagnosis tests- Check whether error improves as sample size
increases- 30,000 extremely small query tasks
Runtime Correctness Tests
1. BlinkDB alpha-0.1.0 released and available at http://blinkdb.org
2. Takes just 5-10 minutes to run it locally or to spin an EC2 cluster
3. Hands-on Exercises today at the AMPCamp
4. Designed to be a drop-in tool like Shark
Getting Started
1. Approximate queries is an important means to achieve interactivity in processing large datasets
2. BlinkDB..- builds on Shark and Spark- approximate answers with error bars by executing queries on
small samples of data- supports existing Hive Query with minor modifications
3. For more information, please check out our EuroSys 2013 (http://bit.ly/blinkdb-1) and KDD 2014 (http://bit.ly/blinkdb-2) papers
Summary
Thanks!