H.Lu/HKUST
L05: Query Processing & Tuning Queries
Query Processing -- Principles Tuning Queries
H.Lu/HKUST L04: Physical Database Design (2) -- 2
Query Processing & Optimization
Any high-level query (SQL) on a database must be processed, optimized and executed by the DBMS
The high-level query is scanned, and parsed to check for syntactic correctness
An internal representation of a query is created, which is either a query tree or a query graph
The DBMS then devises an execution strategy for retrieving the result of the query. (An execution strategy is a plan for executing the query by accessing the data, and storing the intermediate results)
The process of choosing one out of the many execution strategies is known as query optimization
H.Lu/HKUST L04: Physical Database Design (2) -- 3
Query Processor & Query Optimizer
A query processor is a module in the DBMS that performs the tasks to process, to optimize, and to generate execution strategy for a high-level query
Queries expressed in SQL can have multiple equivalent relational algebra query expressions. The query optimizer must select the optimal one
H.Lu/HKUST L04: Physical Database Design (2) -- 4
Database Query Processing
Quer y
Quer yResul t s
Par ser
QuerOpt i mi zor
CodeGener at i on
Runt i meSuppor t
Dat abase
St at i st i cs
Dat abase
St or age
I nternalRepresentat i on
QueryExecut i on Pl an
Execut abl eCode
H.Lu/HKUST L04: Physical Database Design (2) -- 5
Relational Operations
We will consider how to implement: Selection : Selects a subset of rows from relation. Projection : Deletes unwanted columns from
relation. Join : Allows us to combine two relations.
Since each operation returns a relation, opreations can be composed!
H.Lu/HKUST L04: Physical Database Design (2) -- 6
Simple Selections
Of the form Size of result approximated as size of R * reduction
factor; we will consider how to estimate reduction factors later.
With no index, unsorted: Must essentially scan the whole relation; cost is |R| (# pages in R).
With an index on selection attribute: Use index to find qualifying data entries, then retrieve corresponding data records. (Hash index useful only for equality selections.)
SELECT *FROM Reserves RWHERE R.rname < ‘C%’
R attr valueop R. ( )
H.Lu/HKUST L04: Physical Database Design (2) -- 7
Using an Index for Selections
Cost depends on #qualifying tuples, and clustering. Cost of finding qualifying data entries (typically small)
plus cost of retrieving records (could be large w/o clustering).
In example, assuming uniform distribution of names, about 10% of tuples qualify (100 pages, 10000 tuples). With a clustered index, cost is little more than 100 I/Os; if unclustered, upto 10000 I/Os!
Important refinement for unclustered indexes: 1. Find qualifying data entries. 2. Sort the rid’s of the data records to be retrieved. 3. Fetch rids in order. This ensures that each data page is
looked at just once (though # of such pages likely to be higher than with clustering).
H.Lu/HKUST L04: Physical Database Design (2) -- 8
Selection - Summary
R.attr op value (R) Size of result = R * selectivity Processing methods
Table scan Index scan
• B+-tree, Hash index• Clustered vs. non-clustered
– Clustered index: Good– Non-clustered index:
• Good for low selectivity• Worse than scan for high selectivity
SELECT *FROM Reserves RWHERE R.rname < ‘C%’
H.Lu/HKUST L04: Physical Database Design (2) -- 9
General Selection Conditions
Such selection conditions are first converted to conjunctive normal form (CNF):
(day<8/9/94 OR bid=5 OR sid=3 ) AND (rname=‘Paul’ OR bid=5 OR sid=3)
We only discuss the case with no ORs (a conjunction of terms of the form attr op value).
An index matches (a conjunction of) terms that involve only attributes in a prefix of the search key. Index on <a, b, c> matches a=5 AND b= 3, but not b=3.
(day<8/9/94 AND rname=‘Paul’) OR bid=5 OR sid=3
H.Lu/HKUST L04: Physical Database Design (2) -- 10
Two Approaches to General Selections
First approach: Find the most selective access path, retrieve tuples using it, and apply any remaining terms that don’t match the index: Most selective access path: An index or file scan that we
estimate will require the fewest page I/Os. Terms that match this index reduce the number of tuples
retrieved; other terms are used to discard some retrieved tuples, but do not affect number of tuples/pages fetched.
Consider day<8/9/94 AND bid=5 AND sid=3. A B+ tree index on day can be used; then, bid=5 and sid=3 must be checked for each retrieved tuple. Similarly, a hash index on <bid, sid> could be used; day < 8/9/94 must then be checked.
H.Lu/HKUST L04: Physical Database Design (2) -- 11
Intersection of Rids
Second approach (if we have 2 or more matching indexes that use Alternatives (2) or (3) for data entries): Get sets of rids of data records using each matching index. Then intersect these sets of rids (we’ll discuss intersection
soon!) Retrieve the records and apply any remaining terms. Consider day<8/9/94 AND bid=5 AND sid=3. If we have
a B+ tree index on day and an index on sid, both using Alternative (2), we can retrieve rids of records satisfying day<8/9/94 using the first, rids of recs satisfying sid=3 using the second, intersect, retrieve records and check bid=5.
H.Lu/HKUST L04: Physical Database Design (2) -- 12
The Projection Operation
An approach based on sorting: Modify Pass 0 of external sort to eliminate unwanted fields.
Thus, runs of about 2B pages are produced, but tuples in runs are smaller than input tuples. (Size ratio depends on # and size of fields that are dropped.)
Modify merging passes to eliminate duplicates. Thus, number of result tuples smaller than input. (Difference depends on # of duplicates.)
Cost: In Pass 0, read original relation (size M), write out same number of smaller tuples. In merging passes, fewer tuples written out in each pass. Using Reserves example, 1000 input pages reduced to 250 in Pass 0 if size ratio is 0.25
SELECT DISTINCT R.sid, R.bidFROM Reserves R
H.Lu/HKUST L04: Physical Database Design (2) -- 13
Projection Based on Hashing
Partitioning phase: Read R using one input buffer. For each tuple, discard unwanted fields, apply hash function h1 to choose one of B-1 output buffers. Result is B-1 partitions (of tuples with no unwanted fields).
2 tuples from different partitions guaranteed to be distinct. Duplicate elimination phase: For each partition, read it and
build an in-memory hash table, using hash fn h2 (<> h1) on all fields, while discarding duplicates. If partition does not fit in memory, can apply hash-based
projection algorithm recursively to this partition. Cost: For partitioning, read R, write out each tuple, but with
fewer fields. This is read in next phase.
H.Lu/HKUST L04: Physical Database Design (2) -- 14
Discussion of Projection
Sort-based approach is the standard; better handling of skew and result is sorted.
If an index on the relation contains all wanted attributes in its search key, can do index-only scan. Apply projection techniques to data entries (much
smaller!) If an ordered (i.e., tree) index contains all wanted attributes as
prefix of search key, can do even better: Retrieve data entries in order (index-only scan), discard
unwanted fields, compare adjacent tuples to check for duplicates.
H.Lu/HKUST L04: Physical Database Design (2) -- 15
Equality Joins With One Join Column
In algebra: R S. Common! Must be carefully optimized.
R S is large; so, R S followed by a selection is inefficient.
Cost metric: # of I/Os. We will ignore output costs.
SELECT *FROM Reserves R1, Sailors S1WHERE R1.sid=S1.sid
H.Lu/HKUST L04: Physical Database Design (2) -- 16
Notations
|R| = number of pages in outer table R ||R|| = number of tuples in outer table R |S| = number of pages in inner table S ||S|| = number of tuples in inner table S M = number of main memory pages allocated
H.Lu/HKUST L04: Physical Database Design (2) -- 17
Example of Join
sid sname rating age22 dustin 7 45.028 yuppy 9 35.031 lubber 8 55.544 guppy 5 35.058 rusty 10 35.0
sid bid day rname
31 101 10/11/96 lubber58 103 11/12/96 dustin
sid sname rating age bid day rname
31 lubber 8 55.5 101 10/11/96 lubber58 rusty 10 35.0 103 11/12/96 dustin
SELECT *FROM Sailors R, Reserve SWHERE R.sid=S.sid
H.Lu/HKUST L04: Physical Database Design (2) -- 18
Schema for Examples
Similar to old schema; rname added for variations. Reserves:
Each tuple is 40 bytes long, 100 tuples per page, 1000 pages.
Sailors: Each tuple is 50 bytes long, 80 tuples per page, 500 pages.
Sailors (sid: integer, sname: string, rating: integer, age: real)Reserves (sid: integer, bid: integer, day: dates, rname: string)
H.Lu/HKUST L04: Physical Database Design (2) -- 19
Simple Nested Loops Join
For each tuple in the outer relation R, we scan the entire inner relation S. Cost: |R| + ||R|| * |S| = 1000 + 100*1000*500 I/Os.
Page-oriented Nested Loops join: For each page of R, get each page of S, and write out matching pairs of tuples <r, s>, where r is in R-page and S is in S-page. Cost: |R| + |R|*|S| = 1000 + 1000*500 If smaller relation (S) is outer, cost = 500 + 500*1000
foreach tuple r in R doforeach tuple s in S do
if ri == sj then add <r, s> to result
H.Lu/HKUST L04: Physical Database Design (2) -- 20
Block Nested Loops Join
InputBuffer
Hash Table for R
(M-2) pages
S
R
OutputBuffer
h( )Read in block by
block
Read in page by
page
R S
Use one page as an input buffer for scanning the inner S, one page as the output buffer, and use all remaining pages to hold ``block’’ of outer R. For each matching
tuple r in R-block, s in S-page, add <r, s> to result.
Then read next R-block, scan S, etc.
H.Lu/HKUST L04: Physical Database Design (2) -- 21
Block Nested Loop Join
Scan inner table S per block of (M – 2) pages of R tuples Each scan costs |S| pages |R| / (M – 2) blocks of R tuples
|R| pages for outer table R Total cost = |R| + |R| / (M – 2) * |S| pages R should be the smaller table
H.Lu/HKUST L04: Physical Database Design (2) -- 22
Examples of Block Nested Loops
Cost: Scan of outer + #outer blocks * scan of inner #outer blocks =no of pages of outer / blocksize
With Reserves (R) as outer, and 100 pages of R: Cost of scanning R is 1000 I/Os; a total of 10 blocks. Per block of R, we scan Sailors (S); 10*500 I/Os. If space for just 90 pages of R, we would scan S 12 times.
With 100-page block of Sailors as outer: Cost of scanning S is 500 I/Os; a total of 5 blocks. Per block of S, we scan Reserves; 5*1000 I/Os.
With sequential reads considered, analysis changes: may be best to divide buffers evenly between R and S.
H.Lu/HKUST L04: Physical Database Design (2) -- 23
Index Nested Loops Join
InputBuffer
S
R
OutputBuffer
Read in page by
page
R S
Use one page as an input buffer for scanning the outer R, one page as the output buffer.
Use all remaining pages to hold index of inner S. For each tuple r in
R- page, probe index of S to find matches, s, add <r, s> to result.
Then read next R-page, repeat.
Index for S
H.Lu/HKUST L04: Physical Database Design (2) -- 24
Index Nested Loop Join
Probe S index for matching S tuples per R tuple Probe hash index: 1.2 I/Os Probe B+ tree: 2-4 I/Os, plus retrieve matching S
tuples: 1 I/O For ||R|| tuples
|R| pages for outer table R Total cost = |R| + ||R|| * index retrieval Better than Block NL join only for small number of
R tuples
H.Lu/HKUST L04: Physical Database Design (2) -- 25
Examples of Index Nested Loops
Hash-index (Alt. 2) on sid of Sailors (as inner): Scan Reserves: 1000 page I/Os, 100*1000 tuples. For each Reserves tuple: 1.2 I/Os to get data entry in
index, plus 1 I/O to get (the exactly one) matching Sailors tuple. Total: 220,000 I/Os.
Hash-index (Alt. 2) on sid of Reserves (as inner): Scan Sailors: 500 page I/Os, 80*500 tuples. For each Sailors tuple: 1.2 I/Os to find index page with
data entries, plus cost of retrieving matching Reserves tuples. Assuming uniform distribution, 2.5 reservations per sailor (100,000 / 40,000). Cost of retrieving them is 1 or 2.5 I/Os depending on whether the index is clustered.
H.Lu/HKUST L04: Physical Database Design (2) -- 26
Sort-Merge Join
Sort R and S on the join column Scan sorted R and S to do a ``merge’’ (on join col.), and
output result tuples. Advance scan of R until current R-tuple >= current S
tuple, then advance scan of S until current S-tuple >= current R tuple; do this until current R tuple = current S tuple.
At this point, all R tuples with same value in Ri (current R group) and all S tuples with same value in Sj (current S group) match; output <r, s> for all pairs of such tuples.
Then resume scanning R and S. R is scanned once; each S group is scanned once per
matching R tuple. (Multiple scans of an S group are likely to find needed pages in buffer.)
H.Lu/HKUST L04: Physical Database Design (2) -- 27
External Sort ( K-Way Merge Sort)
43 26 49 78 65 13 2
3 264 74 9 8 31 65 2
2 1 33 44 65 7 698 2
21 32 43 54 66 87 9
I n p u t D ata
S o r tin g
F ir s t m er g e
S ec o n d m er g e
S o r ted r u n s ( L = 3 )
S o r ted r u n s ( L = 9 )
S o r ted d a ta
H.Lu/HKUST L04: Physical Database Design (2) -- 28
Sort Merge Join
External-sort R: 2 |R| * (logM-1 |R|/M + 1) Split R into |R|/M sorted runs each of size M: 2 |R| Merge up to (M – 1) runs repeatedly logM-1 |R|/M passes, each costing 2 |R|
External-sort S: 2 |S| * (logM-1 |S|/M + 1) Merge matching tuples from sorted R and S: |R| + |S| Total cost = 2 |R| * (logM-1 |R|/M + 1) + 2 |S| * (logM-
1 |S|/M + 1) + |R| + |S| If |R| < M*(M-1), cost = 5 * (|R| + |S|)
H.Lu/HKUST L04: Physical Database Design (2) -- 29
Example of Sort-Merge Join
Cost of merge is M+N, could be M*N (very unlikely!) With 35, 100 or 300 buffer pages, both Reserves and
Sailors can be sorted in 2 passes; total join cost: 7500.
sid sname rating age22 dustin 7 45.028 yuppy 9 35.031 lubber 8 55.544 guppy 5 35.058 rusty 10 35.0
sid bid day rname
28 103 12/4/96 guppy28 103 11/3/96 yuppy31 101 10/10/96 dustin31 102 10/12/96 lubber31 101 10/11/96 lubber58 103 11/12/96 dustin
(BNL cost: 2500 to 15000 I/Os)
H.Lu/HKUST L04: Physical Database Design (2) -- 30
Hash Join – The Principle
S
R
Si
Ri
1
n
i
i iR S R S
Ri Sj If
for i j
H.Lu/HKUST L04: Physical Database Design (2) -- 31
Hash-Join
Partition Phase: Partition both relations using hash fn h1: R tuples in partition i will only match S tuples in partition i.
Joining Phase: Read in a partition of R, hash it using h2 (<> h1!). Scan matching partition of S, search for matches.
Partitionsof R & S
Input bufferfor Si
Hash table for partitionRi (k < M -1 pages)
M main memory buffersDisk
Output buffer
Disk
Join Result
hashfnh2
h2
M main memory buffers DiskDisk
Original Relation OUTPUT
2INPUT
1
hashfunction
h1 M-1
Partitions
1
2
M-1
. . .
H.Lu/HKUST L04: Physical Database Design (2) -- 32
Hash Join Algorithm
Partition phase: 2 (|R| + |S|) Partition table R using hash function h1: 2 |R| Partition table S using hash function h1: 2 |S| R tuples in partition i will match only S tuples in partition i R (M – 1) partitions, each of size |R| / (M – 1)
Join phase: |R| + |S| Read in a partition of R (|R| / (M – 1) < M -1) Hash it using function h2 (<> h1!) Scan corresponding S partition, search for matches
Total cost = 3 (|R| + |S|) pages Condition: M > √f|R|, f ≈ 1.2 to account for hash table
H.Lu/HKUST L04: Physical Database Design (2) -- 33
General Join Conditions
Equalities over several attributes (e.g., R.sid=S.sid AND R.rname=S.sname): For Index NL, build index on <sid, sname> (if S is inner);
or use existing indexes on sid or sname. For Sort-Merge and Hash Join, sort/partition on
combination of the two join columns. Inequality conditions (e.g., R.rname < S.sname):
For Index NL, need (clustered!) B+ tree index.• Range probes on inner; # matches likely to be much higher than
for equality joins. Hash Join, Sort Merge Join not applicable. Block NL quite likely to be the best join method here.
H.Lu/HKUST L04: Physical Database Design (2) -- 34
Impact of Buffering
If several operations are executing concurrently, estimating the number of available buffer pages is guesswork.
Repeated access patterns interact with buffer replacement policy. e.g., Inner relation is scanned repeatedly in Simple Nested
Loop Join. With enough buffer pages to hold inner, replacement policy does not matter. Otherwise, MRU is best, LRU is worst (sequential flooding).
Does replacement policy matter for Block Nested Loops? What about Index Nested Loops? Sort-Merge Join?
H.Lu/HKUST L04: Physical Database Design (2) -- 35
Summary of Join Operator
Simple nested loop: |R| + ||R|| * |S| Block nested loop: |R| + |R| / (M – 2) * |S| Index nested loop: |R| + ||R|| * index retrieval Sort-merge: 2 |R| * (logM-1 |R|/M + 1) + 2 |S| *
(logM-1 |S|/M + 1) + |R| + |S|
Hash Join: 3 * (|R| + |S|) Condition: M > √f|R|
H.Lu/HKUST L04: Physical Database Design (2) -- 36
Overview of Query Processing
Parser QueryOptimizer
Statistics Cost Model
QEPParsed Query
Database
High Level Query Query Result
QueryEvaluator
H.Lu/HKUST L04: Physical Database Design (2) -- 37
Query Optimization
Given: An SQL query joining n tables Dream: Map to most efficient plan Reality: Avoid rotten plans State of the art:
Most optimizers follow System R’s technique Works fine up to about 10 joins
SELECT S.snameFROM Reserves R, Sailors SWHERE R.sid=S.sid AND R.bid=100 AND S.rating>5
Reserves Sailors
sid=sid
bid=100 rating > 5
sname
H.Lu/HKUST L04: Physical Database Design (2) -- 38
Complexity of Query Optimization
Many degrees of freedom Selection: scan versus
(clustered, non-clustered) index
Join: block nested loop, sort-merge, hash
Relative order of the operators
Exponential search space!
Heuristics Push the selections
down Push the projections
down Delay Cartesian
products System R: Only left-
deep trees
BA
C
D
H.Lu/HKUST L04: Physical Database Design (2) -- 39
Selection: - cascade
- commutative
Projection: - cascade
Join: - associative
- commutative
Equivalences in Relational Algebra
c cn c cnR R1 1 ... . . .
c c c cR R1 2 2 1
a a anR R1 1 . . .
R (S T) (R S) T
(R S) (S R)
H.Lu/HKUST L04: Physical Database Design (2) -- 40
Equivalences in Relational Algebra
A projection commutes with a selection that only uses attributes retained by the projection
Selection between attributes of the two arguments of a cross-product converts cross-product to a join
A selection on just attributes of R commutes with join R S (i.e., (R S) (R) S )
Similarly, if a projection follows a join R S, we can `push’ it by retaining only attributes of R (and S) that are needed for the join or are kept by the projection
H.Lu/HKUST L04: Physical Database Design (2) -- 41
System R Optimizer
1. Find all plans for accessing each base table2. For each table
• Save cheapest unordered plan• Save cheapest plan for each interesting order• Discard all others
3. Try all ways of joining pairs of 1-table plans; save cheapest unordered + interesting ordered plans
4. Try all ways of joining 2-table with 1-table5. Combine k-table with 1-table till you have full plan tree6. At the top, to satisfy GROUP BY and ORDER BY
• Use interesting ordered plan• Add a sort node to unordered plan
H.Lu/HKUST L04: Physical Database Design (2) -- 42
Source: Selinger et al, “Access Path Selection in a Relational Database Management System”
H.Lu/HKUST L04: Physical Database Design (2) -- 43
Note: Only branches for NL join are shown here. Additional branches for other join methods (e.g. sort-merge) are not shown.
Source: Selinger et al, “Access Path Selection in a Relational Database Management System”
H.Lu/HKUST L04: Physical Database Design (2) -- 44
What is “Cheapest”?
Need information about the relations and indexes involved Catalogs typically contain at least:
# tuples (NTuples) and # pages (NPages) for each relation. # distinct key values (NKeys) and NPages for each index. Index height, low/high key values (Low/High) for each tree
index. Catalogs updated periodically.
Updating whenever data changes is too expensive; lots of approximation anyway, so slight inconsistency ok.
More detailed information (e.g., histograms of the values in some field) are sometimes stored.
H.Lu/HKUST L04: Physical Database Design (2) -- 45
Estimating Result Size
Consider a query block:
Maximum # tuples in result is the product of the cardinalities of relations in the FROM clause.
Reduction factor (RF) associated with each termi reflects the impact of the term in reducing result size Term col=value has RF 1/NKeys(I) Term col1=col2 has RF 1/MAX(NKeys(I1), NKeys(I2)) Term col>value has RF (High(I)-value)/(High(I)-Low(I))
Result cardinality = Max # tuples * product of all RF’s. Implicit assumption that terms are independent!
SELECT attribute listFROM relation listWHERE term1 AND ... AND termk
H.Lu/HKUST L04: Physical Database Design (2) -- 46
Cost Estimates for Single-Table Plans
Index I on primary key matches selection: Cost is Height(I)+1 for a B+ tree, about 1.2 for hash
index. Clustered index I matching one or more selects:
(NPages(I)+NPages(R)) * product of RF’s of matching selects.
Non-clustered index I matching one or more selects: (NPages(I)+NTuples(R)) * product of RF’s of matching
selects. Sequential scan of file:
NPages(R).
Note: Typically, no duplicate elimination on projections! (Exception: Done on answers if user says DISTINCT.)
H.Lu/HKUST L04: Physical Database Design (2) -- 47
Counting the Costs
With 5 buffers, cost of plan: Scan Reserves (1000) + write temp
T1 (10 pages, if we have 100 boats, uniform distribution)
Scan Sailors (500) + write temp T2 (250 pages, if we have 10 ratings).
Sort T1 (2*10*2), sort T2 (2*250*4), merge (10+250), total=2300
Total: 4060 page I/Os If we used BNL join, join cost =
10+4*250, total cost = 2770 If we ‘push’ projections, T1 has only
sid, T2 only sid and sname: T1 fits in 3 pages, cost of BNL
drops to under 250 pages, total < 2000
Reserves Sailors
sid=sid
bid=100
sname(On-the-fly)
rating > 5(Scan;write to temp T1)
(Scan;write totemp T2)
(Sort-Merge Join)
SELECT S.snameFROM Reserves R, Sailors SWHERE R.sid=S.sid AND R.bid=100 AND S.rating>5
H.Lu/HKUST L04: Physical Database Design (2) -- 48
Exercise Reserves: 100,000 tuples, 100
tuples per page With clustered index on bid of
Reserves, we get 100,000/100 = 1000 tuples on 1000/100 = 10 pages
Join column sid is a key for Sailors - at most one matching tuple
Decision not to push rating>5 before the join is based on availability of sid index on Sailors
Cost: Selection of Reserves tuples (10 I/Os); for each tuple, must get matching Sailors tuple (1000*1.2); total 1210 I/Os
Reserves
Sailors
sid=sid
bid=100
sname(On-the-fly)
rating > 5
(Use clustered index on sid)
(Index Nested Loops,with pipelining )
(On-the-fly)
(Use hashIndex on sid)
H.Lu/HKUST
L05: Query Processing & Tuning Queries
Query Processing – Principles
Tuning Queries
H.Lu/HKUST L04: Physical Database Design (2) -- 50
Avoid Redundant DISTINCT
DISTINCT usually entails a sort operation Slow down query optimization because one more
“interesting” order to consider Remove if you know the result has no duplicates
SELECT DISTINCT ssnumFROM EmployeeWHERE dept = ‘information systems’
H.Lu/HKUST L04: Physical Database Design (2) -- 51
Change Nested Queries to Join
Might not use index on Employee.dept
Need DISTINCT if an employee might belong to multiple departments
SELECT ssnumFROM EmployeeWHERE dept IN (SELECT dept FROM Techdept)
SELECT ssnumFROM Employee, TechdeptWHERE Employee.dept = Techdept.dept
H.Lu/HKUST L04: Physical Database Design (2) -- 52
Avoid Unnecessary Temp Tables
Creating temp table causes update to catalog Cannot use any index on original table
SELECT * INTO TempFROM EmployeeWHERE salary > 40000
SELECT ssnumFROM TempWHERE Temp.dept = ‘information systems’
SELECT ssnumFROM EmployeeWHERE Employee.dept = ‘information systems’AND salary > 40000
H.Lu/HKUST L04: Physical Database Design (2) -- 53
Avoid Complicated Correlation Subqueries
Search all of e2 for each e1 record!
SELECT ssnumFROM Employee e1WHERE salary = (SELECT MAX(salary) FROM Employee e2 WHERE e2.dept = e1.dept
SELECT MAX(salary) as bigsalary, dept INTO TempFROM EmployeeGROUP BY dept
SELECT ssnumFROM Employee, TempWHERE salary = bigsalaryAND Employee.dept = Temp.dept
H.Lu/HKUST L04: Physical Database Design (2) -- 54
Avoid Complicated Correlation Subqueries
SQL Server 2000 does a good job at handling the correlated subqueries (a hash join is used as opposed to a nested loop between query blocks) The techniques
implemented in SQL Server 2000 are described in “Orthogonal Optimization of Subqueries and Aggregates” by C.Galindo-Legaria and M.Joshi, SIGMOD 2001.-10
0
10
20
30
40
50
60
70
80
correlated subquery
Th
rou
gh
pu
t im
pro
vem
ent p
erce
nt
SQLServer 2000
Oracle 8i
DB2 V7.1
> 10000> 1000
H.Lu/HKUST L04: Physical Database Design (2) -- 55
Join on Clustering and Integer Attributes
Employee is clustered on ssnum ssnum is an integer
SELECT Employee.ssnumFROM Employee, StudentWHERE Employee.name = Student.name
SELECT Employee.ssnumFROM Employee, StudentWHERE Employee.ssnum = Student.ssnum
H.Lu/HKUST L04: Physical Database Design (2) -- 56
Avoid HAVING when WHERE is enough
May first perform grouping for all departments!
SELECT AVG(salary) as avgsalary, deptFROM EmployeeGROUP BY deptHAVING dept = ‘information systems’
SELECT AVG(salary) as avgsalaryFROM EmployeeWHERE dept = ‘information systems’GROUP BY dept
H.Lu/HKUST L04: Physical Database Design (2) -- 57
Avoid Views with unnecessary Joins
Join with Techdept unnecessarily
CREATE VIEW TechlocationAS SELECT ssnum, Techdept.dept, locationFROM Employee, TechdeptWHERE Employee.dept = Techdept.dept
SELECT deptFROM TechlocationWHERE ssnum = 4444
SELECT deptFROM EmployeeWHERE ssnum = 4444
H.Lu/HKUST L04: Physical Database Design (2) -- 58
Aggregate Maintenance
Materialize an aggregate if needed “frequently” Use trigger to update
create trigger updateVendorOutstanding on orders for insert asupdate vendorOutstandingset amount =
(select vendorOutstanding.amount+sum(inserted.quantity*item.price)from inserted,itemwhere inserted.itemnum = item.itemnum)
where vendor = (select vendor from inserted) ;
H.Lu/HKUST L04: Physical Database Design (2) -- 59
Avoid External Loops
No loop:sqlStmt = “select * from lineitem where l_partkey <=
200;”odbc->prepareStmt(sqlStmt);odbc->execPrepared(sqlStmt);
Loop:sqlStmt = “select * from lineitem where l_partkey = ?;”odbc->prepareStmt(sqlStmt);for (int i=1; i<200; i++){
odbc->bindParameter(1, SQL_INTEGER, i);odbc->execPrepared(sqlStmt);
}
H.Lu/HKUST L04: Physical Database Design (2) -- 60
Avoid External Loops
SQL Server 2000 on Windows 2000
Crossing the application interface has a significant impact on performance
0
100
200
300
400
500
600
loop no loop
thro
ug
hp
ut
(re
co
rds
/se
c)
Let the DBMS optimizeset operations
H.Lu/HKUST L04: Physical Database Design (2) -- 61
Avoid Cursors
No cursorselect * from employees;
CursorDECLARE d_cursor CURSOR FOR select * from employees;OPEN d_cursorwhile (@@FETCH_STATUS = 0)BEGIN
FETCH NEXT from d_cursorENDCLOSE d_cursorgo
H.Lu/HKUST L04: Physical Database Design (2) -- 62
Avoid Cursors
SQL Server 2000 on Windows 2000
Response time is a few seconds with a SQL query and more than an hour iterating over a cursor
0
1000
2000
3000
4000
5000
cursor SQL
Th
rou
gh
pu
t (r
eco
rds/
sec)
H.Lu/HKUST L04: Physical Database Design (2) -- 63
Retrieve Needed Columns Only
AllSelect * from lineitem;
Covered subsetSelect l_orderkey, l_partkey, l_suppkey, l_shipdate, l_commitdate from lineitem;
Avoid transferring unnecessary data
May enable use of a covering index.
0
0.25
0.5
0.75
1
1.25
1.5
1.75
no index index
Th
rou
gh
pu
t (q
uer
ies/
mse
c)
all
covered subset
H.Lu/HKUST L04: Physical Database Design (2) -- 64
Use Direct Path for Bulk Loading
sqlldr directpath=true control=load_lineitem.ctl data=E:\Data\lineitem.tbl
load data infile "lineitem.tbl"into table LINEITEM appendfields terminated by '|' (
L_ORDERKEY, L_PARTKEY, L_SUPPKEY, L_LINENUMBER, L_QUANTITY, L_EXTENDEDPRICE, L_DISCOUNT, L_TAX, L_RETURNFLAG, L_LINESTATUS, L_SHIPDATE DATE "YYYY-MM-DD", L_COMMITDATE DATE "YYYY-MM-DD", L_RECEIPTDATE DATE "YYYY-MM-DD", L_SHIPINSTRUCT, L_SHIPMODE, L_COMMENT
)
H.Lu/HKUST L04: Physical Database Design (2) -- 65
Use Direct Path for Bulk Loading
Direct path loading bypasses the query engine and the storage manager. It is orders of magnitude faster than for conventional bulk load (commit every 100 records) and inserts (commit for each record).
650
10000
20000
30000
40000
50000
conventional direct path insert
Th
rou
gh
pu
t (r
ec
/se
c)
H.Lu/HKUST L04: Physical Database Design (2) -- 66
Some Idiosyncrasies
OR may stop the index being used break the query and use UNION
Order of tables may affect join implementation
H.Lu/HKUST L04: Physical Database Design (2) -- 67
Query Tuning Principles
Avoid redundant DISTINCT Change nested queries to join Avoid unnecessary temp tables Avoid complicated correlation subqueries Join on clustering and integer attributes Avoid HAVING when WHERE is enough Avoid views with unnecessary joins Maintain frequently used aggregates Avoid external loops
H.Lu/HKUST L04: Physical Database Design (2) -- 68
Query Tuning Principles
Avoid cursors Retrieve needed columns only Use direct path for bulk loading
Top Related