GPORCA: Query Optimization as a Service

33
1 © Copyright 2015 Pivotal. All rights reserved. 1 © Copyright 2015 Pivotal. All rights reserved. GPORCA: Query Optimization as a Service Venkatesh Raghavan Apache HAWQ Nest

Transcript of GPORCA: Query Optimization as a Service

1© Copyright 2015 Pivotal. All rights reserved. 1© Copyright 2015 Pivotal. All rights reserved.

GPORCA: Query Optimization as a Service

Venkatesh Raghavan

Apache HAWQ Nest

2© Copyright 2015 Pivotal. All rights reserved.

• Motivation

• Introduction to GPORCA (a.k.a Orca)

• How to add an Orca feature?

• How to enable Orca to Apache HAWQ?

Outline

3© Copyright 2015 Pivotal. All rights reserved.

Why a New Optimizer?

• Query optimization is key to performance

• Legacy Planner was not initially designed with distributed

data processing in mind

• Average time to fix customer issues:– Legacy Optimizer (Planner) ~ 70 days

– Pivotal Query Optimizer (Orca) ~ 13 days

4© Copyright 2015 Pivotal. All rights reserved.

Legacy Query Planner

• Technology of the 90’s

• Addresses Join re-ordering

• Treats everything else as add-on (grouping, with clause, etc.)

• Imposes order on specific optimization steps

• Recursively descends into sub-queries– Cannot Unnest Complex Correlated Sub Queries

• High Code Complexity– Maximum: 102 (Orca 8.5)

– Minimum: 6.4 (Orca 1.5)

5© Copyright 2015 Pivotal. All rights reserved.

Join Ordering vs. “Everything Else”• TPC-H Query 5

– 6 Tables

– “Harmless” query“Everything Else”

Size of search space ~230,000,000

Join Order ProblemSize of search space

< 100,000

6© Copyright 2015 Pivotal. All rights reserved.

What Is GP-Orca?

• State-of-the-art query optimization framework designed from

scratch to – Improve – performance, ease-of-use

– Enable – foundation for future research and development

– Connect – applies to multiple host systems (GPDB and HAWQ)

7© Copyright 2015 Pivotal. All rights reserved.

Modularity

• Orca is not baked into one host system

Orca

Parser

Host System< />

SQL

Q2DXL

DXL

Query

MD Provider

MD

req.

Catalog Executor

DXL

MD

DXL

Plan

DXL2Plan

Results

8© Copyright 2015 Pivotal. All rights reserved.

Key Features

• Smarter partition elimination

• Subquery unnesting

• Common table expressions (CTE)

• Additional Functionality

– Improved join ordering

– Join-Aggregate reordering

–Sort order optimization

–Skew aware

9© Copyright 2015 Pivotal. All rights reserved.

Not Yet Feature Complete

• Improve Performance for Short Running Queries

• External parameters

• Cubes

• Multiple grouping sets

• Inverse distribution functions

• Ordered aggregates

• Catalog Queries

10© Copyright 2015 Pivotal. All rights reserved.

Currently in Apache HAWQ

Parser Executor

Orca

PlannerSQL Results

Query Plan

11© Copyright 2015 Pivotal. All rights reserved.

When Orca is exercised

Parser Executor

Orca

Planner

Query Plan

SQL Results

12© Copyright 2015 Pivotal. All rights reserved.

When Orca fallbacks

Parser Executor

Orca

Planner

Query

SQL Results

Fallback

Plan

Orca

Orca will automatically fallback to the legacy optimizer

for unsupported features

13© Copyright 2015 Pivotal. All rights reserved. 13© Copyright 2015 Pivotal. All rights reserved. 13© Copyright 2015 Pivotal. All rights reserved.

Subquery Unnesting

14© Copyright 2015 Pivotal. All rights reserved.

• A query that is nested inside an outer query block

• Correlated Subquery (CSQ) is a subquery that uses values

from the outer query

Subqueries: Definition

SELECT * FROM part p1

WHERE price >

(SELECT avg(price) FROM part p2 WHERE p2.brand = p1.brand)

15© Copyright 2015 Pivotal. All rights reserved.

Subqueries: Impact

• Heavily used in many workloads

– BI/Reporting tools generate substantial number of subqueries

– TPC-H workload: 40% of the 22 queries

– TPC-DS workload: 20% of the 111 queries

• Inefficient plans means query takes a long time or does not terminate

• Optimizations

– De-correlation

– Conversion of subqueries to joins

16© Copyright 2015 Pivotal. All rights reserved.

Subqueries in Disjunctive Filters

• Find parts with: size > 40 OR price > the average brand price

SELECT *

FROM part p1

WHERE p_size > 40 OR

p_retailprice >(SELECT avg(p_retailprice)

FROM part p2

WHERE p2.p_brand = p1.p_brand)

17© Copyright 2015 Pivotal. All rights reserved.

Subqueries in Disjunctive Filters

18© Copyright 2015 Pivotal. All rights reserved.

Subquery Handling: Orca vs. Planner

CSQ Class Planner Orca

CSQ in select list Correlated Execution Join

CSQ in disjunctive filter Correlated Execution Join

Multi-Level CSQ No Plan Join

CSQ with group by and inequality Correlated Execution Join

CSQ must return one row Correlated Execution Join

CSQ with correlation in select list Correlated Execution Correlated Execution

19© Copyright 2015 Pivotal. All rights reserved.

TPC-DS Orca vs. Planner

TPC-DS 10TB, 16 nodes, 48 GB/node

20© Copyright 2015 Pivotal. All rights reserved. 20© Copyright 2015 Pivotal. All rights reserved. 20© Copyright 2015 Pivotal. All rights reserved.

How to add an Orca feature?

21© Copyright 2015 Pivotal. All rights reserved.

• Step 1: Pre-Process Input Logical Expression– Apply heuristics like pushing selects down etc.

• Step 2: Exploration (via Transforms)

– Generate all equivalent logical plans

• Step 3: Implementation (via Transforms)

– Generate all physical implementation for all logical operators

• Step 4: Optimization

– Enforce distribution and ordering requirements and pick the cheapest

plan

Optimization Steps in Orca

22© Copyright 2015 Pivotal. All rights reserved.

• Split an aggregate into a pair of local and global aggregate.

• Schema: CREATE TABLE foo (a int, b int, c int) distributed by (a);

• Query: SELECT sum(c) FROM foo GROUP BY b

• Do local aggregation and then send the partial aggregation to

the master.

• The final aggregation can then be done on the master.

Let’s Pair

23© Copyright 2015 Pivotal. All rights reserved.

// HEADER FILES

~/orca/libgpopt/include/gpopt/xforms

// SOURCE FILES

~/orca/libgpopt/src/xforms

CXformSplitGbAgg

24© Copyright 2015 Pivotal. All rights reserved.

• Pattern

• Pre-Condition Check

Define What Will Trigger This Transformation

25© Copyright 2015 Pivotal. All rights reserved.

Pattern

GPOS_NEW(pmp)

CExpression

(

pmp,

// logical aggregate operator

GPOS_NEW(pmp) CLogicalGbAgg(pmp),

// relational child

GPOS_NEW(pmp) CExpression(pmp, GPOS_NEW(pmp) CPatternLeaf(pmp)),

// scalar project list

GPOS_NEW(pmp) CExpression(pmp, GPOS_NEW(pmp) CPatternTree(pmp))

));

26© Copyright 2015 Pivotal. All rights reserved.

// Compatibility function for splitting aggregates

virtual

BOOL FCompatible(CXform::EXformId exfid)

{

return (CXform::ExfSplitGbAgg != exfid);

}

Pre-Condition Check

Common sense rules such as do not fire this rule on a logical operator

that was a result of the same rule. (Avoid Infinite Recurssion)

27© Copyright 2015 Pivotal. All rights reserved.

void Transform

(

CXformContext *pxfctxt,

CXformResult *pxfres,

CExpression *pexpr

)

const;

The Actual Transformation

28© Copyright 2015 Pivotal. All rights reserved.

void CXformFactory::Instantiate()

{

….

Add(GPOS_NEW(m_pmp) CXformSplitGbAgg(m_pmp));

….

}

Register Transformation

29© Copyright 2015 Pivotal. All rights reserved. 29© Copyright 2015 Pivotal. All rights reserved. 29© Copyright 2015 Pivotal. All rights reserved.

How to enable Orca on HAWQ?

30© Copyright 2015 Pivotal. All rights reserved.

• GPORCA https://github.com/greenplum-db/gporca

• White Paper: bit.ly/1ntrE8v

• Pivotal Tracker: bit.ly/1m1WGDn

Get Your Hands On It!

31© Copyright 2015 Pivotal. All rights reserved.

• Step 1: Get Centos07 Docker Imagehttps://github.com/xinzweb/hawq-devel-env/blob/master/README.md

• Step 2: Update CMake to 3.4.3

• Step 3: Install GPORCA

– Install Xerces 3.1.2

– Install GPOS

Getting Orca on Apache HAWQ

32© Copyright 2015 Pivotal. All rights reserved.

• Step 4: Apply my changes to Apache HAWQ Makefilehttps://github.com/vraghavan78/incubator-hawq/tree/fix-enable-orca

• Step 5: Compile HAWQ with orca enabled

./configure --enable-orca --with-perl --with-python --with-libxml

• Step 6: May have to copy libraries to the different segments

For Example: scp -r . gpadmin@centos7-datanode1:/usr/local/lib

Getting Orca on Apache HAWQ

33© Copyright 2015 Pivotal. All rights reserved.

Publications• Optimization of Common Table Expressions in MPP Database Systems, VLDB 2015

– Amr El-Helw, Venkatesh Raghavan, Mohamed A. Soliman, George C. Caragea, Zhongxian Gu, Michalis Petropoulos.

• Orca: A Modular Query Optimizer Architecture for Big Data, SIGMOD 2014

– Mohamed A. Soliman, Lyublena Antova, Venkatesh Raghavan, Amr El-Helw, Zhongxian Gu, Entong Shen, George C. Caragea, Carlos Garcia-

Alvarado, Foyzur Rahman, Michalis Petropoulos, Florian Waas, Sivaramakrishnan Narayanan, Konstantinos Krikellas, Rhonda

Baldwin

• Optimizing Queries over Partitioned Tables in MPP Systems, SIGMOD 2014

– Lyublena Antova, Amr El-Helw, Mohamed Soliman, Zhongxian Gu, Michalis Petropoulos, Florian Waas

• Reversing Statistics for Scalable Test Databases Generation, DBTest 2013

– Entong Shen, Lyublena Antova

• Total Operator State Recall - Cost-Effective Reuse of Results in Greenplum Database, ICDE Workshops 2013

– George C. Caragea, Carlos Garcia-Alvarado, Michalis Petropoulos, Florian M. Waas

• Testing the Accuracy of Query Optimizers, DBTest 2012

– Zhongxian Gu, Mohamed A. Soliman, Florian M. Waas

– Automatic Capture of Minimal, Portable, and Executable Bug Repros using AMPERe, DBTest 2012

– Lyublena Antova, Konstantinos Krikellas, Florian M. Waas

– Automatic Data Placement in MPP Databases, ICDE Workshops 2012

– Carlos Garcia-Alvarado, Venkatesh Raghavan, Sivaramakrishnan Narayanan, Florian M. Waas