Database Management 9. course. Execution of queries.

38
Database Management 9. course

Transcript of Database Management 9. course. Execution of queries.

Database Management

9. course

Execution of queries

Query evaluation

QueryParse,

compileRelational

algebra

Optimize

Execution plan

Evaluate

Statistics

Query output

Data

• Query– SQL

• Parse– Correct SQL query?

• Relational algebra– Understandable for the computer

• Optimize– Based on what?

• Execution plan– If several queries give the same result: which is

the best?• Evaluate– Find the proper data records

• Query output– Give answer to the user

Optimization example

• Data of a bank• select balance from account

where balance<2500• Two relational algebra representation

• The cost of an operation depends on the algorithms we can use: e.g. an index speeds up the selection

• Primitive: elemental operation (projection, selection, …)

• Pipeline: building blocks for evaluations and statistics

• Input of a primitive=output of the previous primitive

Catalogue cost approximation

• For choosing the proper strategy• The approximation of cost is needed• Cost approximation can be done based on

several attributes– Space– Time

• Statistics are stored in the catalogue

Content of the catalogue

• Number of records in relation r: nr

• Number of blocks used for relation r: br

• Size of one records in relation r: sr

• Number of records in one block: fr

• Number of different values of attribute A in relation r: V(A,r) = |πA(r)|

• Average number of records that fulfills an equality selection for attribute A: SC(A,r)

Catalogue information about indexes

• Hash tables are considered as special indexes• Average number of pointers in one node

(averge number of children): fi

• Height of tree i: HTi=|log fi V(A,r)| or in case of hash, HTi=1

• Lowest level index Block (number of leaf nodes): LBi

• Statistics should be updated after every modification expensive

• Updated when DB has time• Not always consistent, but gives good

approximation

Cost of operations

• Just approximations: reading/writing is assumed to need the same time

Equality selection

• Full scan: br

• Binary search: – Blocks are sequential on the disk– File is ordered by attribute A– Just for equality search

• For clustered index on search key: HTi+1• For clustered index not on search key: HTi+• For unclustered index: HTi+SC(A,r)

Range selection

• Selection: σA≤v(r)– If v is unknown: nr/2– If v is known (with uniform distribution):

• With clustered index– If v is unknown: HTi+br/2– If v is known:

where c is the number of records that fulfills A≤v

• With unclustered index: HTi+LBi/2+nr/2– Sometimes better to use full scan

Types of join

• and , • Natural join:• T• Outer join– Left join: – Right join:– Full join:

• Theta join:

Distinct

• Nested loop join of relations r and s:FOR trr DO BEGIN

FOR tss DO BEGIN

test (tr, ts) if they fulfill Θ

IF yes THEN add (tr, ts)

ENDENDWorst case cost: nr*bs+br

Nested with blocks

FOR brr DO BEGIN

FOR bss DO BEGIN

FOR trbr DO BEGIN

FOR ts bs DO BEGIN

test (tr, ts) if they fulfill Θ

IF yes THEN add (tr, ts)

ENDEND

ENDENDWorst case: br*bs+br

Indexed nested-loop join

• If one of the relations is indexed• No need for full scan• Cost: br + nr*c, where c is the cost of selection

on s

Merge join

• First sort the relations based on the join attributes

• Reading the relations once is enough• Cost: cost of sorting+br+bs

Other operations

• Filter repetition (distinct)– Sort– Delete

• Cost: cost of sorting• Projection: cost of sorting +(filter repetition+)br

• Union: Sort relations+merge+filter repetition• Intersection: sort both+select common rows• Difference: sort+delete rows from 2nd relation

Evaluation - Materialization

• Tree of operations• Leaves: relations• Nodes: operations• Cost: storing temporal

relations + cost of operations

• Parallel processing

Pipelining

• Temporal storing is reduced• Result records are given for the next process and

not stored any more• Save memory (records are stored, not relations)• Sorting is not possible• Demand-driven pipeline: system requires data

when needed• Data-driven pipeline: operations push data to the

pipeline without request until the buffer gets full

Pipeline evaluation

• Records arrive one after another• Merge cannot be used• Indexed nested-loop join can be used

Transformation of relational expressions

• Transform to equivalent expressions with smaller evaluation time

• Example: Give me the names of customers who have account in Brooklyn

• Time consuming (selection after join 3 tables)

• Much better

Equivalence rules

• Predicates: Θ, Θ1, Θ2

• Attributes: L1, L2, L3

• Relational algebra expression: E, E1, E2

• Cascade selection: • Commutativity:• Cascade projection:• Connection of join and Descartes

multipliation:

• Commutativity of theta-join:

• Associativity of natural join:

• Distributivity of selection on join– Θ0 contains attributes from E1

• Distributivity of projection on theta-join– L1, L2 contains attributes from E1, E2 and in the join

condition there are attributes only from L1

– L1, L2 contains attributes from E1, E2

L3, L4 contains attributes from E1, E2 but not from L1

and in the join condition there are attributes only from L3 and

• Commutativity of union and intersection

• Assiciativity of union and intersection

• Distributivity of selection on union, intersection, and difference

• Distributivity of projection on union

• These are only examples!

Choosing evaluation plan

• Create algorithm for the expressions• Give order for the operations• Take them into processes• Example:

pipeline pipeline

use 1. indexuse linear

scan

Sort to filter repetition

Cost-based optimization

• List all the equivalent expressions• Assign execution plan for every plan• Calculate the cost for every plan• Choose the cheapest (based on

approximations and statistics)• Disadvantage: if too many plan, then too many

pre calculations

Example

• Joining 3 relations: 6 ways and parenthesized in two ways: (2*(n-1)!) / (n-1)!

• If n=10 then 176 billions of plans…• Solution: use some heuristics• Consider• First optimal join for the first 3 relations, then

join with the rest:• 12+12 plans remain not good!

Rules for heuristics

• Do the selection at the beginning to reduce the number of rows

• Do the projection as soon as possible to reduce the size of rows

• Split the conjunction of selections to sequence of selections (use only one selection at the time)

• Push down the selections on the tree• Use the selection or join which results in the

least number of rows use associativity of join

• If join is equivalent to a Descartes multiplication and a selection comes next then merge them into a join operation: less records are generated

• Break the projection lists, push them up on the tree (sometimes new projections can be generated)

• Search subtrees where pipeline can be applied

1. By applying the rules, several trees are got2. Calculate the cost3. Apply the cheapest• The optimization adds a cost optimize it• The optimal optimizer optimizes the cost of its

own work and the execution too.

Thank you for your attention!