© 2010 IBM Corporation Business-Intelligence Queries with Order Dependencies in DB2 Jarek Szlichta...
-
Upload
michael-boyd -
Category
Documents
-
view
217 -
download
0
Transcript of © 2010 IBM Corporation Business-Intelligence Queries with Order Dependencies in DB2 Jarek Szlichta...
© 2010 IBM Corporation
Business-Intelligence Queries with Order Dependencies in DB2
Jarek SzlichtaUniversity of Toronto and IBM CAS
Joint work with Parke Godfrey, Jarek Gryz, Wenbin Ma, Weinan Qiu and Calisto Zuzarte
York University and IBM CAS
Jarek
126/03/2014
© 2010 IBM Corporation
IBM Power Systems
Data, Data Everywhere…
2
Open data
Business Data
Web Data
Jarek Szlichta26/03/2014
© 2010 IBM Corporation
IBM Power Systems
From Business to Data ScienceData Scientist:The Sexiest Job of the 21st Century
HarvardBusiness ReviewOct. 2012
(c) 2012 Biocomicalsby Dr. Alper Uzon
Can we take data science out of the realm of geekdom and make it a trusted, respected profession?
3Jarek Szlichta26/03/2014
© 2010 IBM Corporation
IBM Power Systems
Outline
I. Background
II. Optimization of Business Intelligence (BI)
– Implicit Order Dependencies (ODs)
– Explicit ODs
– Inference System
III. Conclusions
4Jarek Szlichta26/03/2014
© 2010 IBM Corporation
IBM Power Systems
I. Business-Intelligence
5
Data warehouses are designed to assist in analysis of business data
Business-Intelligence applications have become more complex and data volume have grown
The increasing complexity raises performance issues and numerous challenges for query optimization
Traditional methods often fail when logical subtleties in database schemas and in queries circumvent them.
Jarek Szlichta26/03/2014
© 2010 IBM Corporation
IBM Power Systems
Date in TPC-DS Schema
6
d_date_sk d_date d_year d_month d_day
… … … … …
2000 2010-08-30 2010 08 30
2001 2010-09-31 2010 09 31
2401 2011-01-05 2011 01 05
2402 2011-01-06 2011 01 06
2487 2011-04-01 2011 04 01
… … … … …
Jarek Szlichta
date_dim table:
26/03/2014
© 2010 IBM Corporation
IBM Power Systems
Motivation
Order dependencies (ODs) capture a monotonicity property between attributes
FDs play important roles in query optimization [SIGMOD, Simmen et al., 1996]
We introduce ODs to use it in similar manner
Order plays pivotal roles in the query optimization. Data is often stored sorted by a clustered (tree) index’s key
7Jarek Szlichta26/03/2014
© 2010 IBM Corporation
IBM Power Systems
Order Dependencies
8
Order dependency (OD) states that whenever two tuples grow or agree on values for d_date, they must lexicographically grow or agree on values for attributes d_year, d_month, d_day
However, date_dim ⊭
d_date_sk d_date d_year d_month d_day
2000 2010-08-30 2010 08 30
2001 2010-09-31 2010 09 31
2401 2011-01-05 2011 01 05
2402 2011-01-06 2011 01 06
2487 2011-04-01 2011 04 01
2488 2011-04-01 2011 04 01
d_date_sk d_date d_year d_month d_day d_quarter
2000 2010-08-30 2010 08 30 02
2001 2010-09-31 2010 09 31 03
2401 2011-01-05 2011 01 05 01
2402 2011-01-06 2011 01 06 01
2487 2011-04-01 2011 04 01 01
2488 2011-04-01 2011 04 01 01
Jarek Szlichta26/03/2014
© 2010 IBM Corporation
IBM Power Systems
Optimization with ODs
9
OD optimization techniques are also applicable even when the database has no declared ODs
ODs also arise through SQL functions and SQL expressions
‒ [d_date] [year(d_date)]
‒ [d_date] [d_date + 30 days]
Also, if there is a predicate A = B, then an OD [A] [B] is satisfied within the scope of the query
If we knew the declared [A] [Z], OD [B] [Z], is also satisfied
‒ Therefore, optimizer has a need to infer OD from others
26/03/2014 Jarek Szlichta
© 2010 IBM Corporation
IBM Power Systems
Substring with group-by
10
select substr(s_zip, 1, 2) as area,count(distinct s_zip) as cnt,sum(ss_net_profit) as net
from store_sales, storewhere ss_store_sk = s_store_skgroup by substr(s_zip, 1, 2)
Let there be an index on s_zip in table store
[s_zip] [substr(s_zip, 1, 2)]
Optimizer can accomplish group-by on-the-fly (partial group by)
Note that a clever SQL programmer could not rewrite Query manually with group by s_zip to avoid this issue
‒ Since the substring changes the partition of the group by
26/03/2014 Jarek Szlichta
© 2010 IBM Corporation
IBM Power Systems
Near-Sortedness
11
select d_year, i_brand_id brand_id, i_brand, sum(ss_ext_sales_price) sum_aggfrom date_dim, store_sales, itemwhere d_date_sk = ss_sold_date_sk ...group by d_year, sum_agg, i_brand_idorder by d_year, sum_agg, i_brand_id
The inference algorithm is triggered due to the order-by and group by statements
‒ It detects that [d_date] [d_year]
Therefore, the optimizer can then take advantage of the index on d_date
‒ simplifying the sort operator in the plan (near-sortedness), to accomplish order by and group by
If partitions of d_year are suitable small, each partition of d_year could be sorted on sum_agg and i_brand_id in the main memory “on-the-fly”
26/04/2014 Jarek Szlichta
© 2010 IBM Corporation
IBM Power Systems
Query with year(d_date)
12
select year(d_date), sm_type, ws_web_name, ...from web_sales, warehouse, ship_mode, web_site, date_dimwhere ws_ship_date_sk = d_date_sk and ...group by year(d_date), sm_type, ws_web_nameorder by year(d_date), sm_type, ws_web_name
Similarly, ODs and near-sortedness can be used when using SQL functions such as year()
[d_date] [year(d_date)]
…
26/03/2014 Jarek Szlichta
© 2010 IBM Corporation
IBM Power Systems
Query with Case Expression
13
select ..., sum(quantity), (case when customer_id between 1 and 10 then 1
... when customer_id between 91 and 100
then 10 ...end)
from sales S, ... where ...group by (case customer_id between...)order by (case customer_id between ...)
There is an OD in the scope of query between customer_id and the output of the case statement
When this relationship is discovered, the index on customer_id can be used
These kind of subtleties are common in customer queries created by BI reporting tools such as Cognos (auto-generation)
26/03/2014 Jarek Szlichta
© 2010 IBM Corporation
IBM Power Systems
Sample of TPC-DS Schema
1426/03/2014 Jarek Szlichta
© 2010 IBM Corporation
IBM Power Systems
Query with an Expensive Join
15
select ... from web_sales W, item I, date_dim D where W.ws_item_sk = I.i_item_sk and
I.i_category in (’Sports’, ’Books’, ’Home’) and
W.ws_sold_date_sk = D.d_date_sk and D.d_date between
cast(’1999-02-22’ as date) and (cast(’1999-02-22’ as date)
+ 30 days) ...;
26/03/2014 Jarek Szlichta
© 2010 IBM Corporation
IBM Power Systems
Eliminating Join in Data Warehouses
16
This requires a potentially expensive join between sales fact table and the date dimension table.
We optimize such queries by removing the join There is an OD which can be declared as integrity constraint ‒
Two probes can be made into the dimension table to calculate the range of the surrogate keys of date in the fact table
26/03/2014 Jarek Szlichta
© 2010 IBM Corporation
IBM Power Systems
Rewrite of the Query
17
select ...from web_sales W, item I,(select min(d_date_sk) as mindate
from date_dimwhere d_date >=
cast(’1999-02-22’ as date))as A,
(select max(d_date_sk) as maxdatefrom date_dimwhere d_date <=
cast(’1999-02-22’ as date)+ 30 days)
as Zwhere ... andW.ws_sold_date_sk between
A.mindate and Z.maxdate...;
26/03/2014 Jarek Szlichta
© 2010 IBM Corporation
IBM Power Systems
Sound and Complete Axiomatization
18
(Order equivalent) iff and
(Order compatibility) iff
1: Reflexivity
2: Prefix
3: Normalization
4: Transitivity
5: Suffix
6: Chain
Very Large Data Bases (PVLDB) 2012Fundamentals of Order Dependencies J. Szlichta, P. Godfrey, J. Gryz
Jarek Szlichta26/03/2014
© 2010 IBM Corporation
IBM Power Systems
Complexity of OD Inference
19
Inference problem is to decide whether a dependency is satisfied based on a given set of dependencies• Assume M = {AB C, C D}. Is it true that M {AB D}?⊨
• Inference system is very useful for query optimization
We show how to infer ODs with chase procedure (exponential time schema complexity)
We establish a proof of co-NP-completeness for the inference problem for ODs
…
Very Large Data Bases (PVLDB) 2013Expressiveness and Complexity of Order Dependencies. J. Szlichta, P. Godfrey, J. Gryz, C. Zuzarte
Jarek Szlichta26/03/2014
© 2010 IBM Corporation
IBM Power Systems
Restricted Domain
20
We introduced a restricted domain
A domain is restricted if an additional order property is guaranteed over the schema
It makes reasoning over ODs “simpler”
We propose a restricted set of axioms
26/03/2014 Jarek Szlichta
© 2010 IBM Corporation
IBM Power Systems
Restricted Domain
21
Complexity over restricted axiom based approach is polynomial
1: Reflexivity
2: Prefix
3: Normalization
4: Transitivity
5: Suffix
6: Downward Closure X Y
Replacing Chain Axiom
Jarek Szlichta26/03/2014
© 2010 IBM Corporation
IBM Power Systems
Evaluation
22
Our experiments on TPC-DS schema have shown 30% gain on 10GB database and 50% gain on 100GB (customer driven queries)
Jarek Szlichta 26/03/2014
A1 A1' A4 A5 A5' A6 A8 A9 A10 0
10
20
30
40
50
60
Original Query
Rewriten Query
Queries
Ela
pse
d T
ime
(s)
© 2010 IBM Corporation
IBM Power Systems
Evaluation (TPC-DS Queries)
23Jarek Szlichta 26/03/2014
Q7 Q12 Q13 Q16 Q18 Q20 Q26 Q37 Q32 Q36 Q48 Q94 Q980
20
40
60
80
100
120
140
Original Query
Rewriten Query
Queries
Ela
pse
d T
ime
(s)
© 2010 IBM Corporation
IBM Power Systems
III. Conclusions
24
Business Intelligence is an important area of research
We have done some (hopefully) interesting work in this area– Optimization of Queries for Business Intelligence
More research can be done!
Contact: [email protected]
Jarek Szlichta26/03/2014
© 2010 IBM Corporation
IBM Power Systems
25
Thank you!
Jarek Szlichta26/03/2014