Advanced Databases: Lecture 2 Query Optimization (I) 1 Query Optimization (introduction to query...

Advanced Databases: Lecture 2 Query Optimization (I)1

Query Optimization (introduction to query

processing)Advanced Databases

By Dr. Akhtar Ali


What is Optimization

• Best use of resources.– Good time management

– Effective allocations of lecturers, labs to course units

• Efficient solution to a problem.– Quick response to a user query

• Less costly.– Solar Energy Vs. Nuclear Vs. hydro-electric power

– Minimum I/O, CPU cycles, Memory Space


Query Optimization

• A classical component of a DBMS.

• Choosing best composition of algebraic operators to answer a query.– A query (e.g. in SQL) may have several alternative representations in

algebra.

– The optimizer selects a best possible algebraic representation.

• Choosing an efficient and less costly plan to answer a query.– One that takes less time to compute.

– One with least cost (in terms of I/Os).

• Why Query Optimization?– To make query evaluation faster.

– To reduce the response time of the query processor.

– To allow the user write queries without being aware of the physical access mechanisms and without asking her/him to explicitly dictate the system how the queries should be evaluated.


Recommended Text

• Database Management Systems By R. Ramakrishnan, Chapters 12, 13 (copy provided)

• Fundamental of Database Systems – 3rd EditionBy R. Elmasri and S. B. Navati, Chapter 18

• An Introduction to Database Systems – 7th EditionBy C. J. Date, Chapter 17


Query Processing – the context

user/application

scanning,parsing,

validatingTranslator

Logical Optimizeruses

tranformations

Physical Optimizeruses a cost model

RuntimeDatabase Engine

Database

Catalog

meta data

data

parse treeSQL query

RelationalAlgebra query tree

optimized RelationalAlgebra query tree

code to executethe query

databasestatistics

result ofthe query


Example database schema

• We will use the following schema throughout this lecture:Sailors(sid:integer, sname:string, rating:integer, age:real)Reserves(sid:integer, bid:integer, day:date, rname:string)

• Consider the following statistics about the relations.– Each tuple of Reserves is 40 bytes long,– A data page can hold 100 Reserves tuples,– The size of Reserves relation is 1000 pages,– Each tuple of Sailors is 50 bytes long, – A data page can hold 80 Sailors tuples, and– The size of Sailors relation is 500 pages.


Translating SQL into Relational Algebra

• After the SQL query is parsed and it is syntactically correct, then it is mapped onto Relational Algebra (RA) expression. Usually shown as a query tree (bottom up).

• Consider the SQL query:SELECT S.snameFROM Reserves R, Sailors SWHERE R.sid = S.sid

AND R.bid = 100 AND S.rating > 5

The same query in RA:

sname (bid=100 and rating > 5(Reserves ⋈sid=sid Sailors))

π sname

sid=sid

SailorsReserves

σ bid = 100 and rating > 5


Implementation of Relational Operators

• We will discuss how to implement:– Selection () Selects a subset of rows from a relation.

– Projection () Picks only required attributes and removes unwanted attributes from a relation.

– Join (⋈) Combines two relations.


Access Paths

• There is usually more than one way to retrieve tuples from a relation, if indexes are available and if the query contains selection conditions.

• The selection condition comes from a select or a join.

• The alternative ways to retrieve tuples from a relation are called access paths.

• An access path is either:– A file scan (when there is no selection condition or no index can

be used).

– An index plus a matching selection condition. For example, attr op value, where op is an operator (<, >, =), and there is an index available on attr.


Implementing Selection operator

• Depends on the available file organizations, that is whether we have:– No index available and the physical file for a given relation is

unsorted. Too much expensive.

– No index but the file is sorted on some attribute.

– A B+ tree index is available.

– A Hash index is available.

• For each of the above, the selection operator costs differently and that is the main thing to know.


Selection Operator – an Example Query

• Consider the following query:SELECT *

FROM Reserves

WHERE rname = ‘Joe’

• Consider that there are 100 tuples that qualify for the result of the above query. That is 100 tuples have rname = ‘Joe’.


Selection using no index & no sorting

• For a general selection query: R.attr op value (R), we have to scan the entire file to get the qualifying tuples. Note that op can be <, >, =, <>, etc.

• For each tuple, it is tested to see if the given condition (R.attr op value) holds. If the conditions holds then the tuple is added to the result.

• The cost of this approach is M I/Os, where M is the number of pages in R.

• For the example query, the cost is 1000 I/Os because there are 1000 pages in Reserves relation.


Selection using sorting but no index

• For a general selection query: R.attr op value (R), if R is physically sorted on R.attr, we use a binary search to locate the first qualifying tuple.

• We keep on testing the condition on the tuples in every page that is scanned and add them to the result until the condition fails to hold.

• The cost of this approach is equal to the cost of binary search plus the number of pages that have been read.– The cost of binary search = log2 M I/Os

– The cost of retrieving tuples = T I/Os where T is the number of pages scanned to retrieve the qualifying tuples.

• For the example query, the cost is computed as follows:– The binary search cost = log2 1000 = log 1000/ log 2 = 9.96 10

– Since the number of qualifying tuples are 100, 1 page will hold these tuples and scanning that page will cost 1 I/O.

– So the total cost is 10 + 1 = 11 I/Os.


B+ tree Index

• B+ tree index is a balanced tree in which the internal nodes (the top two levels) direct the search and the leaf nodes contain data entries.

• Searching for a record requires just a traversal from the root to the appropriate leaf node.• The length of the path from the root to a leaf is called height of the tree (usually 2 or 3).• To search for entry 9*, we follow the left most child pointer from the root (as 9 < 10). Then at level

two we follow the right child pointer (as 9 > 6). Once at the leaf node, data entries can be found sequentially.

• Leaf nodes are inter-connected which makes it suitable for range queries.

10 20

6 12 23 35

3* 4* 10* 10*6* 9* 12* 13* 23* 31*20* 22* 35* 36*

Root


Selection using B+ tree index

• For a general selection query: R.attr op value (R), B+ tree is best if R.attr is not equality (e.g. <, >). It is also good for = operator.

• We search the B+ tree to find the first page that contains a qualifying tuple. Assume that the tree index is clustered.

• We then read all those pages that contain the qualifying tuples.• The cost of this approach is equal to the sum of the following:

– The cost of identifying the starting page = 2 or 3 I/Os. We assume 2 I/Os throughout.


• For the example query, the cost is computed as follows:– Since the number of qualifying tuples are 100, 1 page will hold these tuples

and scanning that page will cost 1 I/O.– So the total cost is 2 + 1 = 3 I/Os.


Hash Index

• A function called hash function is applied to the hash field value (key field) to get the address of the disk page in which the record is stored.

• A bucket is a set of records.

• The directory is an array of size n (4 in the figure), each element is a pointer to a bucket.

• To search for a data entry:

• the hash function is applied to the search field and the last bits of its binary form is used to get a number between 0 and 3.

• this number gives the array position to get the pointer to the desired bucket. • to locate a record with key field 5 (binary 101), we look at directory element 01 and follow the

pointer to the data page (Bucket B).

2

00

01

10

11

2

4* 12* 32* 16*

2

1* 5* 21*

2

10*

2

15* 7* 19*

Bucket A

Bucket B

Bucket C

Bucket D

Local Dept

Global Dept

Directory

Data Pages


Selection using Hash Index

• For a general selection query: R.attr op value (R), hash index is best if R.attr is equality (=). It is not good for not equality (e.g. <, >, <>).

• We retrieve the index page that contain the rids (record identifiers) of the qualifying tuples.

• Then the pages that contain these tuples are scanned.

• The cost of this approach is equal to the sum of the following:– The cost to retrieve the index page = 1 I/O


– For none-equality operators, T = the number of qualifying tuples.

• For the example query, the cost is computed as follows:– Since the number of qualifying tuples are 100, 1 page will hold these tuples and

scanning that page will cost 1 I/O.

– So the total cost is 1 + 1 = 2 I/Os.


Summary of Lecture 7

• Query Optimization– What and why

• Query Processing– The various stages through which a query goes

• Translation of SQL into Relational Algebra– Internal representation of the query

• Access Paths– Different paths and ways to get the same data

• Implementation of the Selection Operator– Different ways of evaluating selection using different access paths

Advanced Databases: Lecture 2 Query Optimization (I) 1 Query Optimization (introduction to query...

Documents

Transcript of Advanced Databases: Lecture 2 Query Optimization (I) 1 Query Optimization (introduction to query...