© 2010 IBM Corporation Business-Intelligence Queries with Order Dependencies in DB2 Jarek Szlichta...

25
© 2010 IBM Corporation Business-Intelligence Queries with Order Dependencies in DB2 Jarek Szlichta University of Toronto and IBM CAS Joint work with Parke Godfrey, Jarek Gryz, Wenbin Ma, Weinan Qiu and Calisto Zuzarte York University and IBM CAS Jarek 1 26/03/2014

Transcript of © 2010 IBM Corporation Business-Intelligence Queries with Order Dependencies in DB2 Jarek Szlichta...

Page 1: © 2010 IBM Corporation Business-Intelligence Queries with Order Dependencies in DB2 Jarek Szlichta University of Toronto and IBM CAS Joint work with Parke.

© 2010 IBM Corporation

Business-Intelligence Queries with Order Dependencies in DB2

Jarek SzlichtaUniversity of Toronto and IBM CAS

Joint work with Parke Godfrey, Jarek Gryz, Wenbin Ma, Weinan Qiu and Calisto Zuzarte

York University and IBM CAS

Jarek

126/03/2014

Page 2: © 2010 IBM Corporation Business-Intelligence Queries with Order Dependencies in DB2 Jarek Szlichta University of Toronto and IBM CAS Joint work with Parke.

© 2010 IBM Corporation

IBM Power Systems

Data, Data Everywhere…

2

Open data

Business Data

Web Data

Jarek Szlichta26/03/2014

Page 3: © 2010 IBM Corporation Business-Intelligence Queries with Order Dependencies in DB2 Jarek Szlichta University of Toronto and IBM CAS Joint work with Parke.

© 2010 IBM Corporation

IBM Power Systems

From Business to Data ScienceData Scientist:The Sexiest Job of the 21st Century

HarvardBusiness ReviewOct. 2012

(c) 2012 Biocomicalsby Dr. Alper Uzon

Can we take data science out of the realm of geekdom and make it a trusted, respected profession?

3Jarek Szlichta26/03/2014

Page 4: © 2010 IBM Corporation Business-Intelligence Queries with Order Dependencies in DB2 Jarek Szlichta University of Toronto and IBM CAS Joint work with Parke.

© 2010 IBM Corporation

IBM Power Systems

Outline

I. Background

II. Optimization of Business Intelligence (BI)

– Implicit Order Dependencies (ODs)

– Explicit ODs

– Inference System

III. Conclusions

4Jarek Szlichta26/03/2014

Page 5: © 2010 IBM Corporation Business-Intelligence Queries with Order Dependencies in DB2 Jarek Szlichta University of Toronto and IBM CAS Joint work with Parke.

© 2010 IBM Corporation

IBM Power Systems

I. Business-Intelligence

5

Data warehouses are designed to assist in analysis of business data

Business-Intelligence applications have become more complex and data volume have grown

The increasing complexity raises performance issues and numerous challenges for query optimization

Traditional methods often fail when logical subtleties in database schemas and in queries circumvent them.

Jarek Szlichta26/03/2014

Page 6: © 2010 IBM Corporation Business-Intelligence Queries with Order Dependencies in DB2 Jarek Szlichta University of Toronto and IBM CAS Joint work with Parke.

© 2010 IBM Corporation

IBM Power Systems

Date in TPC-DS Schema

6

d_date_sk d_date d_year d_month d_day

… … … … …

2000 2010-08-30 2010 08 30

2001 2010-09-31 2010 09 31

2401 2011-01-05 2011 01 05

2402 2011-01-06 2011 01 06

2487 2011-04-01 2011 04 01

… … … … …

Jarek Szlichta

date_dim table:

26/03/2014

Page 7: © 2010 IBM Corporation Business-Intelligence Queries with Order Dependencies in DB2 Jarek Szlichta University of Toronto and IBM CAS Joint work with Parke.

© 2010 IBM Corporation

IBM Power Systems

Motivation

Order dependencies (ODs) capture a monotonicity property between attributes

FDs play important roles in query optimization [SIGMOD, Simmen et al., 1996]

We introduce ODs to use it in similar manner

Order plays pivotal roles in the query optimization. Data is often stored sorted by a clustered (tree) index’s key

7Jarek Szlichta26/03/2014

Page 8: © 2010 IBM Corporation Business-Intelligence Queries with Order Dependencies in DB2 Jarek Szlichta University of Toronto and IBM CAS Joint work with Parke.

© 2010 IBM Corporation

IBM Power Systems

Order Dependencies

8

Order dependency (OD) states that whenever two tuples grow or agree on values for d_date, they must lexicographically grow or agree on values for attributes d_year, d_month, d_day

However, date_dim ⊭

d_date_sk d_date d_year d_month d_day

2000 2010-08-30 2010 08 30

2001 2010-09-31 2010 09 31

2401 2011-01-05 2011 01 05

2402 2011-01-06 2011 01 06

2487 2011-04-01 2011 04 01

2488 2011-04-01 2011 04 01

d_date_sk d_date d_year d_month d_day d_quarter

2000 2010-08-30 2010 08 30 02

2001 2010-09-31 2010 09 31 03

2401 2011-01-05 2011 01 05 01

2402 2011-01-06 2011 01 06 01

2487 2011-04-01 2011 04 01 01

2488 2011-04-01 2011 04 01 01

Jarek Szlichta26/03/2014

Page 9: © 2010 IBM Corporation Business-Intelligence Queries with Order Dependencies in DB2 Jarek Szlichta University of Toronto and IBM CAS Joint work with Parke.

© 2010 IBM Corporation

IBM Power Systems

Optimization with ODs

9

OD optimization techniques are also applicable even when the database has no declared ODs

ODs also arise through SQL functions and SQL expressions

‒ [d_date] [year(d_date)]

‒ [d_date] [d_date + 30 days]

Also, if there is a predicate A = B, then an OD [A] [B] is satisfied within the scope of the query

If we knew the declared [A] [Z], OD [B] [Z], is also satisfied

‒ Therefore, optimizer has a need to infer OD from others

26/03/2014 Jarek Szlichta

Page 10: © 2010 IBM Corporation Business-Intelligence Queries with Order Dependencies in DB2 Jarek Szlichta University of Toronto and IBM CAS Joint work with Parke.

© 2010 IBM Corporation

IBM Power Systems

Substring with group-by

10

select substr(s_zip, 1, 2) as area,count(distinct s_zip) as cnt,sum(ss_net_profit) as net

from store_sales, storewhere ss_store_sk = s_store_skgroup by substr(s_zip, 1, 2)

Let there be an index on s_zip in table store

[s_zip] [substr(s_zip, 1, 2)]

Optimizer can accomplish group-by on-the-fly (partial group by)

Note that a clever SQL programmer could not rewrite Query manually with group by s_zip to avoid this issue

‒ Since the substring changes the partition of the group by

26/03/2014 Jarek Szlichta

Page 11: © 2010 IBM Corporation Business-Intelligence Queries with Order Dependencies in DB2 Jarek Szlichta University of Toronto and IBM CAS Joint work with Parke.

© 2010 IBM Corporation

IBM Power Systems

Near-Sortedness

11

select d_year, i_brand_id brand_id, i_brand, sum(ss_ext_sales_price) sum_aggfrom date_dim, store_sales, itemwhere d_date_sk = ss_sold_date_sk ...group by d_year, sum_agg, i_brand_idorder by d_year, sum_agg, i_brand_id

The inference algorithm is triggered due to the order-by and group by statements

‒ It detects that [d_date] [d_year]

Therefore, the optimizer can then take advantage of the index on d_date

‒ simplifying the sort operator in the plan (near-sortedness), to accomplish order by and group by

If partitions of d_year are suitable small, each partition of d_year could be sorted on sum_agg and i_brand_id in the main memory “on-the-fly”

26/04/2014 Jarek Szlichta

Page 12: © 2010 IBM Corporation Business-Intelligence Queries with Order Dependencies in DB2 Jarek Szlichta University of Toronto and IBM CAS Joint work with Parke.

© 2010 IBM Corporation

IBM Power Systems

Query with year(d_date)

12

select year(d_date), sm_type, ws_web_name, ...from web_sales, warehouse, ship_mode, web_site, date_dimwhere ws_ship_date_sk = d_date_sk and ...group by year(d_date), sm_type, ws_web_nameorder by year(d_date), sm_type, ws_web_name

Similarly, ODs and near-sortedness can be used when using SQL functions such as year()

[d_date] [year(d_date)]

26/03/2014 Jarek Szlichta

Page 13: © 2010 IBM Corporation Business-Intelligence Queries with Order Dependencies in DB2 Jarek Szlichta University of Toronto and IBM CAS Joint work with Parke.

© 2010 IBM Corporation

IBM Power Systems

Query with Case Expression

13

select ..., sum(quantity), (case when customer_id between 1 and 10 then 1

... when customer_id between 91 and 100

then 10 ...end)

from sales S, ... where ...group by (case customer_id between...)order by (case customer_id between ...)

There is an OD in the scope of query between customer_id and the output of the case statement

When this relationship is discovered, the index on customer_id can be used

These kind of subtleties are common in customer queries created by BI reporting tools such as Cognos (auto-generation)

26/03/2014 Jarek Szlichta

Page 14: © 2010 IBM Corporation Business-Intelligence Queries with Order Dependencies in DB2 Jarek Szlichta University of Toronto and IBM CAS Joint work with Parke.

© 2010 IBM Corporation

IBM Power Systems

Sample of TPC-DS Schema

1426/03/2014 Jarek Szlichta

Page 15: © 2010 IBM Corporation Business-Intelligence Queries with Order Dependencies in DB2 Jarek Szlichta University of Toronto and IBM CAS Joint work with Parke.

© 2010 IBM Corporation

IBM Power Systems

Query with an Expensive Join

15

select ... from web_sales W, item I, date_dim D where W.ws_item_sk = I.i_item_sk and

I.i_category in (’Sports’, ’Books’, ’Home’) and

W.ws_sold_date_sk = D.d_date_sk and D.d_date between

cast(’1999-02-22’ as date) and (cast(’1999-02-22’ as date)

+ 30 days) ...;

26/03/2014 Jarek Szlichta

Page 16: © 2010 IBM Corporation Business-Intelligence Queries with Order Dependencies in DB2 Jarek Szlichta University of Toronto and IBM CAS Joint work with Parke.

© 2010 IBM Corporation

IBM Power Systems

Eliminating Join in Data Warehouses

16

This requires a potentially expensive join between sales fact table and the date dimension table.

We optimize such queries by removing the join There is an OD which can be declared as integrity constraint ‒

Two probes can be made into the dimension table to calculate the range of the surrogate keys of date in the fact table

26/03/2014 Jarek Szlichta

Page 17: © 2010 IBM Corporation Business-Intelligence Queries with Order Dependencies in DB2 Jarek Szlichta University of Toronto and IBM CAS Joint work with Parke.

© 2010 IBM Corporation

IBM Power Systems

Rewrite of the Query

17

select ...from web_sales W, item I,(select min(d_date_sk) as mindate

from date_dimwhere d_date >=

cast(’1999-02-22’ as date))as A,

(select max(d_date_sk) as maxdatefrom date_dimwhere d_date <=

cast(’1999-02-22’ as date)+ 30 days)

as Zwhere ... andW.ws_sold_date_sk between

A.mindate and Z.maxdate...;

26/03/2014 Jarek Szlichta

Page 18: © 2010 IBM Corporation Business-Intelligence Queries with Order Dependencies in DB2 Jarek Szlichta University of Toronto and IBM CAS Joint work with Parke.

© 2010 IBM Corporation

IBM Power Systems

Sound and Complete Axiomatization

18

(Order equivalent) iff and

(Order compatibility) iff

1: Reflexivity

2: Prefix

3: Normalization

4: Transitivity

5: Suffix

6: Chain

Very Large Data Bases (PVLDB) 2012Fundamentals of Order Dependencies J. Szlichta, P. Godfrey, J. Gryz

Jarek Szlichta26/03/2014

Page 19: © 2010 IBM Corporation Business-Intelligence Queries with Order Dependencies in DB2 Jarek Szlichta University of Toronto and IBM CAS Joint work with Parke.

© 2010 IBM Corporation

IBM Power Systems

Complexity of OD Inference

19

Inference problem is to decide whether a dependency is satisfied based on a given set of dependencies• Assume M = {AB C, C D}. Is it true that M {AB D}?⊨

• Inference system is very useful for query optimization

We show how to infer ODs with chase procedure (exponential time schema complexity)

We establish a proof of co-NP-completeness for the inference problem for ODs

Very Large Data Bases (PVLDB) 2013Expressiveness and Complexity of Order Dependencies. J. Szlichta, P. Godfrey, J. Gryz, C. Zuzarte

Jarek Szlichta26/03/2014

Page 20: © 2010 IBM Corporation Business-Intelligence Queries with Order Dependencies in DB2 Jarek Szlichta University of Toronto and IBM CAS Joint work with Parke.

© 2010 IBM Corporation

IBM Power Systems

Restricted Domain

20

We introduced a restricted domain

A domain is restricted if an additional order property is guaranteed over the schema

It makes reasoning over ODs “simpler”

We propose a restricted set of axioms

26/03/2014 Jarek Szlichta

Page 21: © 2010 IBM Corporation Business-Intelligence Queries with Order Dependencies in DB2 Jarek Szlichta University of Toronto and IBM CAS Joint work with Parke.

© 2010 IBM Corporation

IBM Power Systems

Restricted Domain

21

Complexity over restricted axiom based approach is polynomial

1: Reflexivity

2: Prefix

3: Normalization

4: Transitivity

5: Suffix

6: Downward Closure X Y

Replacing Chain Axiom

Jarek Szlichta26/03/2014

Page 22: © 2010 IBM Corporation Business-Intelligence Queries with Order Dependencies in DB2 Jarek Szlichta University of Toronto and IBM CAS Joint work with Parke.

© 2010 IBM Corporation

IBM Power Systems

Evaluation

22

Our experiments on TPC-DS schema have shown 30% gain on 10GB database and 50% gain on 100GB (customer driven queries)

Jarek Szlichta 26/03/2014

A1 A1' A4 A5 A5' A6 A8 A9 A10 0

10

20

30

40

50

60

Original Query

Rewriten Query

Queries

Ela

pse

d T

ime

(s)

Page 23: © 2010 IBM Corporation Business-Intelligence Queries with Order Dependencies in DB2 Jarek Szlichta University of Toronto and IBM CAS Joint work with Parke.

© 2010 IBM Corporation

IBM Power Systems

Evaluation (TPC-DS Queries)

23Jarek Szlichta 26/03/2014

Q7 Q12 Q13 Q16 Q18 Q20 Q26 Q37 Q32 Q36 Q48 Q94 Q980

20

40

60

80

100

120

140

Original Query

Rewriten Query

Queries

Ela

pse

d T

ime

(s)

Page 24: © 2010 IBM Corporation Business-Intelligence Queries with Order Dependencies in DB2 Jarek Szlichta University of Toronto and IBM CAS Joint work with Parke.

© 2010 IBM Corporation

IBM Power Systems

III. Conclusions

24

Business Intelligence is an important area of research

We have done some (hopefully) interesting work in this area– Optimization of Queries for Business Intelligence

More research can be done!

Contact: [email protected]

Jarek Szlichta26/03/2014

Page 25: © 2010 IBM Corporation Business-Intelligence Queries with Order Dependencies in DB2 Jarek Szlichta University of Toronto and IBM CAS Joint work with Parke.

© 2010 IBM Corporation

IBM Power Systems

25

Thank you!

Jarek Szlichta26/03/2014