Query optimization in relational DBs Leveraging the mathematical formal underpinnings of the...

Query optimization in relational DBs

Leveraging the mathematical formal underpinnings of the relational model.

Pre-processing a query

Convert SQL to algebra

Create a binary parse tree

Assign execution fragments to each interior node

Optimizing a query: stage 1

Manipulate parse tree to create new, but equivalent trees that will be faster independently of the properties of the database

Select a handful that would most likely be the fastest

Stage 2: collect information about the DB’s physical state

Size of tables being manipulated (in both rows and space)

Sorting order of tables

Location of indices on tables

Distribution of PKs and FKs (to estimate hit ratio on joins)

Hit ratio of attributes referenced by 1-operator commands (select and project)

Stage 3: transform parse tree into a tree that is likely to be faster

Heuristics: unary operations Push unary operations down stream Bundle unary operations on common leaves Goal is to perform unary operations together and before

binary operations, to make binary operations faster

Heuristics: joins Most join algorithms tree the two operands differently Make use of sort orders and index locations

Stage 4: choose implementations of binary operations

If both tables sorted by desired attributes, do a merge join

If one table has an index, to a nested join where the inner table is accessed via the index and the outer one is accessed sequentially

Order unary in-line operations so that only necessary attributes are funneled upstream

Stage 4, Notes on joins

If no indices exist and the tables are not sorted the way we want, build an index on one, then do a sort merge or a nested join

If both tables have indices, use the one for the shortest table

If the hit ratio of a join will be very small, do it as early as you can

Stage 5: consider multiple execution plans

Estimate the costs of various execution plans and pick the best one

Balance time spent optimizing versus time spent executing

For one time only transactions, pick a plan fast

Other important issues

Use indices that are fast with respect to pulling in blocks from the disk Wide, but shallow trees, like b-trees

An index on a sorted attribute can be interpolated

Hash indices good for key-based selections, but are not as versatile as b-trees

A note on b-trees

They are dynamic and therefore do not suffer from the migrating artificial primary key problems

Since they are dynamic, the tradeoff is that they are slightly less space efficient and often lead to an extra layer in the tree

Important: manipulating parse trees

There is a sort of algebra of parse tree manipulation

For example, A join B and B join A are equivalent

(A join B) join C = A join (B join C)

If a and b are used as join attributes, we can strip away all but those and the ones that will be chosen by a later projection or selection

More notes on optimization

The smallest parse tree might not be the best Sometimes we add unary operations to make the

operands of binary operations smaller If a unary operation will have a very small hit ratio, do it

early

The cost of a complete execution plan is complex to calculate Statistics of the data size and value distribution Choice of parse tree Choice of execution plan for each tree !! It is not a linear process

In light of modern database applications

Often the cost of optimization is not worth it Many one of a kind transactions Database with very small transaction load, perhaps do to

multiple servers and hard drives Tables are small and queries are simple, and so almost

any execution plan is good

This is a key reason by traditional relational database servers are often far less efficient than no-SQL DBs – because they were built to efficiently run a high volume of similar transactions that involve multiple tables

Making use of parallelism

SQL, compared to various non-declarative languages, is very east to parallelize

But communication costs can be very high if the servers are not co-located

Often, disk arrays are the chosen form of parallelism

Optimization in the real world today

Languages and queries are often not set-based

The code in a query can be very imperative

Binary operations might be in the minority

Databases can be so huge that optimization has to be heavily biased toward certain kinds of queries

Query optimization in relational DBs Leveraging the mathematical formal underpinnings of the...

Documents

Transcript of Query optimization in relational DBs Leveraging the mathematical formal underpinnings of the...