Lecture 12 - Relational Query Optimizerdzeina/courses/epl446/lectures/12.pdf · Lecture Outline...
Transcript of Lecture 12 - Relational Query Optimizerdzeina/courses/epl446/lectures/12.pdf · Lecture Outline...
12-1EPL446: Advanced Database Systems - Demetris Zeinalipour (University of Cyprus)
EPL446 – Advanced Database Systems
Lecture 12
A Typical Relational Query Optimizer
Chapter 15: Ramakrishnan & Gehrke
(* exlclude 15.5 and 15.7)
Demetris Zeinalipourhttp://www.cs.ucy.ac.cy/~dzeina/courses/epl446
Department of Computer Science
University of Cyprus
12-2EPL446: Advanced Database Systems - Demetris Zeinalipour (University of Cyprus)
Lecture OutlineRelational Query Optimizer
• Introduction to Relational Query Optimization (Στεζιακή Βεληιζηοποίηζη Επερφηήζεφν)
• Relational Algebra Equivalences(Ιζοδσναμίες Στεζιακών Τελεζηών)
• Query Blocks: Units of Optimization(Μπλοκ Επερώηηζης: Η Βαζική μονάδα βεληιζηοποίηζης)
• Enumeration of Alternative Plans(Απαρίθμηζη Εναλλακηικών Πλάνφν)
• Cost Estimation of Plans(Υπολογιζμός Κόζηοσς με Εκηέλεζης Πλάνφν)
Query Optimization
and Execution
Relational Operators
Files and Access Methods
Buffer Management
Disk Space Management
DB
12-3EPL446: Advanced Database Systems - Demetris Zeinalipour (University of Cyprus)
Relational Query Optimization(Στεζιακή Βεληιζηοποίηζη Επερφηήζεφν)
• A user of a DBMS formulates SQL queries.
• The query optimizer translates this query into an
equivalent Relational Algebra (RA) query, i.e. a
RA query with the same result.
• Τo optimize the efficiency of query processing, the
query optimizer reorders the individual
operations (ηελεζηέρ) within the RA query.
• Re-ordering has to preserve the query semantics
(ζημαζιολογία) and is based on Rel. Algebra
equivalences (we will see those in a while)
12-4EPL446: Advanced Database Systems - Demetris Zeinalipour (University of Cyprus)
Relational Query Optimization(Στεζιακή Βεληιζηοποίηζη Επερφηήζεφν)
• Why can re-ordering improve the
efficiency?
• Different orders can imply different sizes
of the intermediate results.
• The smaller the intermediate results, the
more efficient the execution plan!
12-6EPL446: Advanced Database Systems - Demetris Zeinalipour (University of Cyprus)
Relational Algebra Equivalences(Ιζοδσναμίες Στεζιακών Τελεζηών)
• The most important RA equivalences are commutative
(ανηιμεηάθεζη) and associative laws
(πποζεηαιπιζμόρ).
• A commutative law (ανηιμεηάθεζη) about some
operation (e.g., about join) states that the order of (two)
arguments does not matter.
– e.g., Join is Commutative (R S) ≡ (S R)
• An associative law (πποζεηαιπιζμόρ) about some
(binary) operation states that (more than two) arguments
can be grouped either from the left or from the right.
– e.g., Join is Associative R ( S Τ ) ≡ ( R S ) Τ
• If an operation is both commutative and associative,
then any number of arguments can be (re-)ordered in an
arbitrary manner.
12-7EPL446: Advanced Database Systems - Demetris Zeinalipour (University of Cyprus)
RA Equivalences: Joins(Ιζοδσναμίες Σ.Τ.: Σσνενώζεις)
• The following (binary) RA operations are
commutative and associative: , , ,
• For example, we have:
(R S) ≡ (S R) (Commutative, Αντιμετάθεση)
– the order of (two) arguments does not matter.
R(SΤ) ≡ (RS)Τ (Associative, Προσεταιρισμός)
– arguments can be grouped either from the left or
from the right.
• The Set Difference ( - ) is not commutative but it
is associative:
(R-S) ≡! (S-R) but R-(S-T) ≡ (R-S)-T
12-8EPL446: Advanced Database Systems - Demetris Zeinalipour (University of Cyprus)
RA Equivalences: Selections((Ιζοδσναμίες Σ.Τ.: Επιλογέρ)
• Selections are crucial from the point of view of query
optimization, because they typically reduce the size
of intermediate results by a significant factor.
• Laws for selections (επιλογέρ) only:
– σA1 … An (R) ≡ σA1 (… σAn (R))
(Cascade Conditions, Διάδοζη)
– σA1(σA2(R)) ≡ σA2(σA1(R))
(Commutative, Ανηιμεηάθεζη)
12-9EPL446: Advanced Database Systems - Demetris Zeinalipour (University of Cyprus)
RA Equivalences: Selections((Ιζοδσναμίες Σ.Τ.: Επιλογέρ)
• Laws for the combination of selections and , :
if R has all attributes mentioned in c,
σc(R S) ≡ σc(R) S
• Laws for the combination of selections and -,, (άλλερ
ζςνολοθεωπηηικέρ ππάξειρ):
σc(R S) ≡ σc(R) σc(S)
• The above laws can be applied to “push selections down”
as much as possible in an expression, i.e. performing
selections as early as possible, e.g.,
σA(R S) ≡ σB C D (R S) ≡ σD(σB(R) σC(S))
• Selection over a Cartesian Product yields a Join
σA(R S) ≡ R c S
Κατηγορήματα
(predicates)
12-10EPL446: Advanced Database Systems - Demetris Zeinalipour (University of Cyprus)
RA Equivalences: Projections((Ιζοδσναμίες Σ.Τ.: Πποβολέρ)
• Projection can be cascaded (διάδοζη)
• Projection is distributive (επιμεπιζηική) over set
operators (, , -,/)
• Selection and projection: A projection commutes
with a selection
– Αpplies only if ζ uses attributes retained by π
• Projection and Joins: we can `push‟ the projection
down by retaining only attributes of R (and S) that are
needed for the join (or are kept by the projection a)
))(...(( 211 RR AnAAA
* Study book chapter 15.3 for more details for RA equivalences
12-11EPL446: Advanced Database Systems - Demetris Zeinalipour (University of Cyprus)
Query Blocks: Units of Optimization(Μπλοκ Επερώηηζης: Η Βαζική μονάδα βεληιζηοποίηζης)
• An SQL query is parsed into a
collection of query blocks (μπλοκ
επερωηήζεων), and these are
optimized one block-at-a-time.
• Nested blocks are usually treated
as calls to a subroutine, made once
per outer tuple.
SELECT S.sname
FROM Sailors S
WHERE S.age IN(SELECT MAX (S2.age)
FROM Sailors S2
GROUP BY S2.rating)
Nested block(εμφωλευμένο
μπλοκ)
Outer block(Εξωτερικό
Μπλοκ)
• For each block, the plans considered are:
– All available access methods, for each relation in the FROM clause.
– All possible join trees for the relations in the FROM clause.
• We shall the above in further details in the following slides…
SQL=>RA
Enum. PlansEst. Cost
12-12EPL446: Advanced Database Systems - Demetris Zeinalipour (University of Cyprus)
Optimizing Query Block Example(Παράδειγμα Βεληιζηοποίηζης Μπλόκ)
• Example Schema
– Sailors (sid: integer, sname:string, rating:integer, age:real)
– Reserves (sid:integer, bid:integer, day:dates, rname:string)
– Boats(bid:integer, bname:string, color:string)
• Example Query
– For each sailor with the highest rating (over all sailors) and at
least two reservations for red boats, find the sailor id and the
earliest date on which the sailor has a reservation for a red boat.
SELECT S.sid. MIN(R.day) // Find sailor ID & earl. day of red reserv.
FROM Sailors S. Reserves R. Boats B
WHERE S.sid = R.sid AND R.bid = B.bid AND B.color = „red' AND
S.rating = ( SELECT MAX (S2.rating) FROM Sailors S2 )
GROUP BY S.sid HAVING COUNT (*) > 1 // At least two such reservations
// Highest rating
SQL
SQL=>RA
Enum. PlansEst. Cost
12-13EPL446: Advanced Database Systems - Demetris Zeinalipour (University of Cyprus)
Query Blocks: Units of Optimization(Μπλοκ Επερώηηζης: Η Βαζική μονάδα βεληιζηοποίηζης)
SELECT S.sid. MIN(R.day)
FROM Sailors S. Reserves R. Boats B
WHERE S.sid = R.sid AND R.bid = B.bid AND B.color = „red' AND
S.rating = ( SELECT MAX (S2.rating) FROM Sailors S2 ) GROUP BY S.sid
HAVING COUNT (*) > 1
SELECT S.sid. MIN(R.day)
FROM Sailors S. Reserves R. Boats B
WHERE S.sid = R.sid AND R.bid = B.bid AND B.color = „red' AND
S.rating = Reference to Nested (Inner) BlockGROUP BY S.sid
HAVING COUNT (*) > 1
SQL: Only consider the Outer Block for the Optimization part…
Extended* Relational Algebra Block:
SQL: User’s Query
* recall that Having, Group-by & Aggr. not
part of Relational Algebra)
SQL=>RA
Enum. PlansEst. Cost
12-14EPL446: Advanced Database Systems - Demetris Zeinalipour (University of Cyprus)
Query Blocks: Units of Optimization(Μπλοκ Επερώηηζης: Η Βαζική μονάδα βεληιζηοποίηζης)
• A query is treated as a ζ-π- algebra expression
with the remaining operations (if any) carried out on
the result.
• For our example, the optimizer only considers:
• Aggregates, Having, Group-By are calculated after
computing the ζ-π- of a query.
• Now the Optimizer needs to i) enumerate the
alternative plans and ii) estimate cost of each plan.
Relational Algebra Block (will be considered for evaluation):
SQL=>RA
Enum. PlansEst. Cost
12-15EPL446: Advanced Database Systems - Demetris Zeinalipour (University of Cyprus)
Enumeration of Alternative Plans(Απαρίθμηζη Εναλλακηικών Πλάνφν)
• Problem: The space of alternative plans for a given query is
very large!
• To motivate the discussion consider the binary query
evaluation plans and assume that only 1 join alg. exists.
• Question: How many such plans can we have?
• Answer: Number of Binary Trees with n nodes: – N=4 we have 336 possible trees
– N=5 we have 1008 possible trees
– ….
– N=10 we have 6 x 1010 possible trees
BA
C
D
BA
C
D
C DBA
)!1(
)!2(
n
nCn
SQL=>RA
Enum. PlansEst. Cost
Number of
Binary Plans:
We certainly need to prune (κλαδέψοςμε) the search space!
12-16EPL446: Advanced Database Systems - Demetris Zeinalipour (University of Cyprus)
Enumeration of Alternative Plans(Απαρίθμηζη Εναλλακηικών Πλάνφν)
• The Query Optimizer therefore focuses on a subset
of plans.
SQL=>RA
Enum. PlansEst. Cost
• Algebraic plans: those that can be
expressed with Relational Algebra
operators.
• Enumerable plans: e.g., only binary
plans.
• Searched plans: Among binary plans only
consider the left-deep plans, i.e., where
right child of each join is a leaf (base
relation)
• Constructed plans: Those that are
actually constructed.
Focus of the Query Optimizer
12-17EPL446: Advanced Database Systems - Demetris Zeinalipour (University of Cyprus)
Enumeration of Alternative Plans(Απαρίθμηζη Εναλλακηικών Πλάνφν)
• Left-deep (αριστεροβαθή) join trees:– Α left-deep tree is a tree in which the right child of each join is a leaf
(i.e., a base table or index).
– Left-deep trees allow us to generate all fully pipelined plans
(πλήπωρ ζωληνωμένα πλάνα εκηέλεζηρ) .
• As results are generated these are forwarded to the operator
higher in the tree hierarchy.
• Intermediate results not written to temporary files.
• ΝΟΤ all left-deep trees are fully pipelined (e.g., SM join, no
results are generated during sorting but only during merging).
BA
C
D
SQL=>RA
Enum. PlansEst. Cost
12-18EPL446: Advanced Database Systems - Demetris Zeinalipour (University of Cyprus)
Enumeration of Alternative Plans(Απαρίθμηζη Εναλλακηικών Πλάνφν)
• Even by only considering left-deep plans, the number of
plans still grows rapidly when number of join increases!
• In particular, we have n! possible plans, where N the number
of base relations participating in a join.– With N=4, we have 24 possible plans
– With N=5, we have 120 possible plans
– With N=6, we have 720 possible plans
– ….
– With N=10, we have 3628800 possible plans
BA
C
D
Number of
Left-Deep
Plans*: n!
AB
C
D
CB
A
D
...
SQL=>RA
Enum. PlansEst. Cost
* Again assuming that only 1 join algorithm exists
12-19EPL446: Advanced Database Systems - Demetris Zeinalipour (University of Cyprus)
Enumeration of Alternative Plans(Απαρίθμηζη Εναλλακηικών Πλάνφν)
• When enumerating a plan we need a way to
determine the cost of each plan
• The cost of a query plan is determined largely by
the order in which the tables are joined.
• Most query optimizers determine join order via a
dynamic programming algorithm pioneered by
IBM's System R database project (next slide)
SQL=>RA
Enum. PlansEst. Cost
12-21EPL446: Advanced Database Systems - Demetris Zeinalipour (University of Cyprus)
Enumeration of Alternative Plans(Απαρίθμηζη Εναλλακηικών Πλάνφν)
Sketch of Enumeration Algorithm (uses Dynamic Programming)
• Pass 1: Find Access Paths (file scan, indexes, etc) for each Relation in
Query.
– Objective: Record the cheapest way to scan the relation, as well as the cheapest
way to scan the relation that produces records in a particular sorted order.
– e.g., FileScan for fetching all tuples and B+Tree for fetching IDs in sort order.
• Pass 2: For each 2-relation pair (for which a join condition exists) find
the cheapest way to join relations and generate results i) with no order
and ii) with order.
– Utilize the available join algorithms implemented by the DBMS (nested-loops join, sort-
merge join, etc).
• Pass 3: For each 3-relation pair (for which a join condition exists) find
the cheapest way to join relations and generate results i) with no order
and ii) with order.
– In particular, it will join each two-relation plan produced by the previous phase with the
remaining relations in the query.
• Pass N: Continue the above until all the relations in query are considered
• At the end we will obtain the overall best plan!
SQL=>RA
Enum. PlansEst. Cost
12-22EPL446: Advanced Database Systems - Demetris Zeinalipour (University of Cyprus)
Cost Estimation of Plans(Υπολογιζμός Κόζηοσς με Εκηέλεζης Πλάνφν)
• Consider a Query Block:
• Maximum # tuples in result is the product of the cardinalities of
relations in the FROM clause.
– i.e., |A|*|B|* … * |Z|
• Reduction factor (RF) (Σσνηελεζηής Μείωζης): defines the
ratio of the expected result size / input size
– e.g., term1 yields 200 expected answers out of 1000 => RF term1=0.2
– Result cardinality = Max # tuples * product of all RF’s.
• How can a DBMS know these RFs for a table without
spending too much time? (next slide)
SELECT attribute list
FROM A, B, …, Z
WHERE term1 AND ... AND termz
SQL=>RA
Enum. PlansEst. Cost
12-23EPL446: Advanced Database Systems - Demetris Zeinalipour (University of Cyprus)
Reduction Factors Using Histograms(Σσνηελεζηές Μείφζης με Ιζηογράμμαηα)
• Wrong Answer: Scan the table => Too Expensive
• Correct Answer: Utilize Histograms (tiny data
structures that approximate the real distribution of
values in a table (stored in system catalog)
• Example
Initial Distribution of “age”
Fre
qu
en
cy o
f Ap
pe
ara
nce
Equiwidth Histogram Equidepth Histogram
SQL=>RA
Enum. PlansEst. Cost
12-24EPL446: Advanced Database Systems - Demetris Zeinalipour (University of Cyprus)
Query Optimization Example(Παράδειγμα Βεληιζηοποίηζης Επερώηηζης)
• Consider that we have the following access methods for
the working example we’ve been using
• Sailors:
– Clustered B+ tree on rating
– Clustered Hash index on sid
• Reserves:
– Unclustered B+ tree on bid
• Task: The query optimizer needs to
optimize the query evaluation plan
on the right…
Reserves Sailors
sid=sid
bid=100 rating > 5
sname
12-25EPL446: Advanced Database Systems - Demetris Zeinalipour (University of Cyprus)
Query Optimization Example(Παράδειγμα Βεληιζηοποίηζης Επερώηηζης)
• Pass1:
– Sailors:
• Utilize Clustered B+ tree for rating>5
• If index was unclustered we would consider the FileScan.
• In many cases might consider B+ tree as tuples are in rating
order).
– Reserves:
• Utilize B+ tree on bid matches as it can quickly match
bid=100 (regardless of whether the index is Clustered
or unclustered.
• Pass2:
– Consider each plan retained from Pass 1 as the outer, and
consider how to join it with the (only) other inner relation.
• e.g., Reserves as outer: Hash index can be used to get
Sailors (inner) tuples that satisfy sid = outer tuple’s sid
value.
Reserves Sailors
sid=sid
bid=100 rating > 5
sname
B+
Hash
B+
12-26EPL446: Advanced Database Systems - Demetris Zeinalipour (University of Cyprus)
Highlights of System R Optimizer(Σηοιτεία για ηο Βεληιζηοποιηηή System R)
• Basic Ideas in the System R Query Optimizer:
– Plan Space: Too large, must be pruned!
– Cost estimation: Approximate art at best.
• Characteristics:– Statistics, maintained in system catalogs, used to estimate cost of
operations and result sizes.
– Considers only left-deep plans
– NO nested sub-queries (as these would increase the plan search space =>
slow)
– NO Duplicate elimination in the tree (only as a final step)
• Why? Duplicate elimination requires sorting or hashing , consequently the operator
can not pipeline the results higher in the Query Plan)
– Considers combination of CPU and I/O costs.
• Impact:
– Most widely used currently; works well for < 10 joins.
12-27EPL446: Advanced Database Systems - Demetris Zeinalipour (University of Cyprus)
Summary
(Σύνουη)
• Query optimization is an important task in a relational
DBMS.
• Must understand optimization in order to understand
the performance impact of a given database design
(relations, indexes) on a workload (set of queries).
• Two parts to optimizing a query:
– Consider a set of alternative plans.
• Must prune search space; typically, left-deep plans only.
– Must estimate cost of each plan that is considered.
• Must estimate size of result and cost for each plan node.
• Key issues: Statistics, indexes, operator implementations.