Advanced Databases: Lecture 2 Query Optimization (I) 1 Query Optimization (introduction to query...
-
Upload
arabella-nicholson -
Category
Documents
-
view
229 -
download
7
Transcript of Advanced Databases: Lecture 2 Query Optimization (I) 1 Query Optimization (introduction to query...
Advanced Databases: Lecture 2 Query Optimization (I)1
Query Optimization (introduction to query
processing)Advanced Databases
By Dr. Akhtar Ali
Advanced Databases: Lecture 2 Query Optimization (I)2
What is Optimization
• Best use of resources.– Good time management
– Effective allocations of lecturers, labs to course units
• Efficient solution to a problem.– Quick response to a user query
• Less costly.– Solar Energy Vs. Nuclear Vs. hydro-electric power
– Minimum I/O, CPU cycles, Memory Space
Advanced Databases: Lecture 2 Query Optimization (I)3
Query Optimization
• A classical component of a DBMS.
• Choosing best composition of algebraic operators to answer a query.– A query (e.g. in SQL) may have several alternative representations in
algebra.
– The optimizer selects a best possible algebraic representation.
• Choosing an efficient and less costly plan to answer a query.– One that takes less time to compute.
– One with least cost (in terms of I/Os).
• Why Query Optimization?– To make query evaluation faster.
– To reduce the response time of the query processor.
– To allow the user write queries without being aware of the physical access mechanisms and without asking her/him to explicitly dictate the system how the queries should be evaluated.
Advanced Databases: Lecture 2 Query Optimization (I)4
Recommended Text
• Database Management Systems By R. Ramakrishnan, Chapters 12, 13 (copy provided)
• Fundamental of Database Systems – 3rd EditionBy R. Elmasri and S. B. Navati, Chapter 18
• An Introduction to Database Systems – 7th EditionBy C. J. Date, Chapter 17
Advanced Databases: Lecture 2 Query Optimization (I)5
Query Processing – the context
user/application
scanning,parsing,
validatingTranslator
Logical Optimizeruses
tranformations
Physical Optimizeruses a cost model
RuntimeDatabase Engine
Database
Catalog
meta data
data
parse treeSQL query
RelationalAlgebra query tree
optimized RelationalAlgebra query tree
code to executethe query
databasestatistics
result ofthe query
Advanced Databases: Lecture 2 Query Optimization (I)6
Example database schema
• We will use the following schema throughout this lecture:Sailors(sid:integer, sname:string, rating:integer, age:real)Reserves(sid:integer, bid:integer, day:date, rname:string)
• Consider the following statistics about the relations.– Each tuple of Reserves is 40 bytes long,– A data page can hold 100 Reserves tuples,– The size of Reserves relation is 1000 pages,– Each tuple of Sailors is 50 bytes long, – A data page can hold 80 Sailors tuples, and– The size of Sailors relation is 500 pages.
Advanced Databases: Lecture 2 Query Optimization (I)7
Translating SQL into Relational Algebra
• After the SQL query is parsed and it is syntactically correct, then it is mapped onto Relational Algebra (RA) expression. Usually shown as a query tree (bottom up).
• Consider the SQL query:SELECT S.snameFROM Reserves R, Sailors SWHERE R.sid = S.sid
AND R.bid = 100 AND S.rating > 5
The same query in RA:
sname (bid=100 and rating > 5(Reserves ⋈sid=sid Sailors))
π sname
sid=sid
SailorsReserves
σ bid = 100 and rating > 5
Advanced Databases: Lecture 2 Query Optimization (I)8
Implementation of Relational Operators
• We will discuss how to implement:– Selection () Selects a subset of rows from a relation.
– Projection () Picks only required attributes and removes unwanted attributes from a relation.
– Join (⋈) Combines two relations.
Advanced Databases: Lecture 2 Query Optimization (I)9
Access Paths
• There is usually more than one way to retrieve tuples from a relation, if indexes are available and if the query contains selection conditions.
• The selection condition comes from a select or a join.
• The alternative ways to retrieve tuples from a relation are called access paths.
• An access path is either:– A file scan (when there is no selection condition or no index can
be used).
– An index plus a matching selection condition. For example, attr op value, where op is an operator (<, >, =), and there is an index available on attr.
Advanced Databases: Lecture 2 Query Optimization (I)10
Implementing Selection operator
• Depends on the available file organizations, that is whether we have:– No index available and the physical file for a given relation is
unsorted. Too much expensive.
– No index but the file is sorted on some attribute.
– A B+ tree index is available.
– A Hash index is available.
• For each of the above, the selection operator costs differently and that is the main thing to know.
Advanced Databases: Lecture 2 Query Optimization (I)11
Selection Operator – an Example Query
• Consider the following query:SELECT *
FROM Reserves
WHERE rname = ‘Joe’
• Consider that there are 100 tuples that qualify for the result of the above query. That is 100 tuples have rname = ‘Joe’.
Advanced Databases: Lecture 2 Query Optimization (I)12
Selection using no index & no sorting
• For a general selection query: R.attr op value (R), we have to scan the entire file to get the qualifying tuples. Note that op can be <, >, =, <>, etc.
• For each tuple, it is tested to see if the given condition (R.attr op value) holds. If the conditions holds then the tuple is added to the result.
• The cost of this approach is M I/Os, where M is the number of pages in R.
• For the example query, the cost is 1000 I/Os because there are 1000 pages in Reserves relation.
Advanced Databases: Lecture 2 Query Optimization (I)13
Selection using sorting but no index
• For a general selection query: R.attr op value (R), if R is physically sorted on R.attr, we use a binary search to locate the first qualifying tuple.
• We keep on testing the condition on the tuples in every page that is scanned and add them to the result until the condition fails to hold.
• The cost of this approach is equal to the cost of binary search plus the number of pages that have been read.– The cost of binary search = log2 M I/Os
– The cost of retrieving tuples = T I/Os where T is the number of pages scanned to retrieve the qualifying tuples.
• For the example query, the cost is computed as follows:– The binary search cost = log2 1000 = log 1000/ log 2 = 9.96 10
– Since the number of qualifying tuples are 100, 1 page will hold these tuples and scanning that page will cost 1 I/O.
– So the total cost is 10 + 1 = 11 I/Os.
Advanced Databases: Lecture 2 Query Optimization (I)14
B+ tree Index
• B+ tree index is a balanced tree in which the internal nodes (the top two levels) direct the search and the leaf nodes contain data entries.
• Searching for a record requires just a traversal from the root to the appropriate leaf node.• The length of the path from the root to a leaf is called height of the tree (usually 2 or 3).• To search for entry 9*, we follow the left most child pointer from the root (as 9 < 10). Then at level
two we follow the right child pointer (as 9 > 6). Once at the leaf node, data entries can be found sequentially.
• Leaf nodes are inter-connected which makes it suitable for range queries.
10 20
6 12 23 35
3* 4* 10* 10*6* 9* 12* 13* 23* 31*20* 22* 35* 36*
Root
Advanced Databases: Lecture 2 Query Optimization (I)15
Selection using B+ tree index
• For a general selection query: R.attr op value (R), B+ tree is best if R.attr is not equality (e.g. <, >). It is also good for = operator.
• We search the B+ tree to find the first page that contains a qualifying tuple. Assume that the tree index is clustered.
• We then read all those pages that contain the qualifying tuples.• The cost of this approach is equal to the sum of the following:
– The cost of identifying the starting page = 2 or 3 I/Os. We assume 2 I/Os throughout.
– The cost of retrieving tuples = T I/Os where T is the number of pages scanned to retrieve the qualifying tuples.
• For the example query, the cost is computed as follows:– Since the number of qualifying tuples are 100, 1 page will hold these tuples
and scanning that page will cost 1 I/O.– So the total cost is 2 + 1 = 3 I/Os.
Advanced Databases: Lecture 2 Query Optimization (I)16
Hash Index
• A function called hash function is applied to the hash field value (key field) to get the address of the disk page in which the record is stored.
• A bucket is a set of records.
• The directory is an array of size n (4 in the figure), each element is a pointer to a bucket.
• To search for a data entry:
• the hash function is applied to the search field and the last bits of its binary form is used to get a number between 0 and 3.
• this number gives the array position to get the pointer to the desired bucket. • to locate a record with key field 5 (binary 101), we look at directory element 01 and follow the
pointer to the data page (Bucket B).
2
00
01
10
11
2
4* 12* 32* 16*
2
1* 5* 21*
2
10*
2
15* 7* 19*
Bucket A
Bucket B
Bucket C
Bucket D
Local Dept
Global Dept
Directory
Data Pages
Advanced Databases: Lecture 2 Query Optimization (I)17
Selection using Hash Index
• For a general selection query: R.attr op value (R), hash index is best if R.attr is equality (=). It is not good for not equality (e.g. <, >, <>).
• We retrieve the index page that contain the rids (record identifiers) of the qualifying tuples.
• Then the pages that contain these tuples are scanned.
• The cost of this approach is equal to the sum of the following:– The cost to retrieve the index page = 1 I/O
– The cost of retrieving tuples = T I/Os where T is the number of pages scanned to retrieve the qualifying tuples.
– For none-equality operators, T = the number of qualifying tuples.
• For the example query, the cost is computed as follows:– Since the number of qualifying tuples are 100, 1 page will hold these tuples and
scanning that page will cost 1 I/O.
– So the total cost is 1 + 1 = 2 I/Os.
Advanced Databases: Lecture 2 Query Optimization (I)18
Summary of Lecture 7
• Query Optimization– What and why
• Query Processing– The various stages through which a query goes
• Translation of SQL into Relational Algebra– Internal representation of the query
• Access Paths– Different paths and ways to get the same data
• Implementation of the Selection Operator– Different ways of evaluating selection using different access paths