1 Optimization - Selection. 2 The Selection Operation Table: Reserves(sid, bid, day, agent) A page...

30
1 Optimization - Selection
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    221
  • download

    0

Transcript of 1 Optimization - Selection. 2 The Selection Operation Table: Reserves(sid, bid, day, agent) A page...

Page 1: 1 Optimization - Selection. 2 The Selection Operation Table: Reserves(sid, bid, day, agent) A page (block) can hold 100 Reserves tuples There are 1,000.

1

Optimization - Selection

Page 2: 1 Optimization - Selection. 2 The Selection Operation Table: Reserves(sid, bid, day, agent) A page (block) can hold 100 Reserves tuples There are 1,000.

2

The Selection Operation

• Table: Reserves(sid, bid, day, agent)• A page (block) can hold 100 Reserves tuples• There are 1,000 pages• Compute the query:

SELECT *

FROM Reserves R

WHERE R.agent = ‘Joe’

Page 3: 1 Optimization - Selection. 2 The Selection Operation Table: Reserves(sid, bid, day, agent) A page (block) can hold 100 Reserves tuples There are 1,000.

3

General Problem

• Compute: A op c(R) where – A is an attribute of R– op is an operator, such as =, < , etc.– c is a constant

• We assume that there are M pages in R.

Page 4: 1 Optimization - Selection. 2 The Selection Operation Table: Reserves(sid, bid, day, agent) A page (block) can hold 100 Reserves tuples There are 1,000.

4

No Index, Unsorted Data

• We can compute the selection by scanning the entire relation.

Cost: M

(For the query on Reserves, Cost = 1,000)

Page 5: 1 Optimization - Selection. 2 The Selection Operation Table: Reserves(sid, bid, day, agent) A page (block) can hold 100 Reserves tuples There are 1,000.

5

No Index, Sorted Data

• Suppose that the file in which R is stored is physically sorted on A.

• Then, we can use binary search to find the first tuple matching the search condition.

• Then, scan to find all additional tuples that match the condition.

Cost: log(M) + #pages with tuples from the result(For the query on Reserves: 10 + #pages with tuples from the result)

Page 6: 1 Optimization - Selection. 2 The Selection Operation Table: Reserves(sid, bid, day, agent) A page (block) can hold 100 Reserves tuples There are 1,000.

6

B+ Tree Index

• Suppose that there is a B+ Tree Index available on attribute A of R.

• Search the tree to find the first index entry that points to a tuple satisfying the condition.

• Scan leaf pages of index to find all entries in which key value satisfies the condition.

• Retrieve the satisfying tuples from the file.

Page 7: 1 Optimization - Selection. 2 The Selection Operation Table: Reserves(sid, bid, day, agent) A page (block) can hold 100 Reserves tuples There are 1,000.

7

Example

• Selection condition: rname < ‘C%’• Assume that names are uniformly distributed

with respect to first letter About 2/26 10% = 10,000 tuples = 100 pages match the condition

• If the B+ Tree is clustered, we can return them in 100 I/Os (plus a few to traverse the tree)

• Otherwise, up to 10,000 I/Os may be needed!!– Can you improve the number of I/Os needed?

Page 8: 1 Optimization - Selection. 2 The Selection Operation Table: Reserves(sid, bid, day, agent) A page (block) can hold 100 Reserves tuples There are 1,000.

8

Hash Index, Equality Selection

• Find appropriate bucket (about 1, 2 I/Os)• Retrieve the qualifying tuples from R

– Time depends on whether index is clustered.

• Example: Consider condition agent = ‘Joe’. Suppose there is an unclustered hash index on agent. Suppose there 100 reservations made by Joe.

• Cost: Could be between 1 and 100.

Page 9: 1 Optimization - Selection. 2 The Selection Operation Table: Reserves(sid, bid, day, agent) A page (block) can hold 100 Reserves tuples There are 1,000.

9

Optimization - Join

Page 10: 1 Optimization - Selection. 2 The Selection Operation Table: Reserves(sid, bid, day, agent) A page (block) can hold 100 Reserves tuples There are 1,000.

10

The Join Operation

• Compute the query:

SELECT *

FROM Reserves R, Sailors S

WHERE R.sid = S.sid

Number of Pages Tuples per Page

R 1000 (M) 100 (pR)

S 500 (N) 80 (pS)

Page 11: 1 Optimization - Selection. 2 The Selection Operation Table: Reserves(sid, bid, day, agent) A page (block) can hold 100 Reserves tuples There are 1,000.

11

Block Nested Loops Join

• Suppose there are B buffer pages

foreach block of B-2 pages of R do

foreach page of S do {

for all matching in-memory

pairs r, s:

add <r,s> to result

}

Page 12: 1 Optimization - Selection. 2 The Selection Operation Table: Reserves(sid, bid, day, agent) A page (block) can hold 100 Reserves tuples There are 1,000.

12

Cost Analysis

• R is read once: M• S is read ceil(M/(B-2)) times: ceil(M/(B-2))*N• Total: M + ceil(M/(B-2))*N

• For our Sailors, Reserves example. (B = 102)• Reserves is the outer Relation:

– Cost = 1000 + (1,000/100)*500 = 6,000 I/Os

• Sailors is the outer Relation:– Cost = 500 + (500/100)*1,000 = 5,500 I/Os

Page 13: 1 Optimization - Selection. 2 The Selection Operation Table: Reserves(sid, bid, day, agent) A page (block) can hold 100 Reserves tuples There are 1,000.

13

Index Nested Loops Join

• Suppose there is an index on the join attribute of S

• We find the tuple s using the index!

foreach tuple r of R

foreach tuple s of S where ri=sj

add <r,s> to result

Page 14: 1 Optimization - Selection. 2 The Selection Operation Table: Reserves(sid, bid, day, agent) A page (block) can hold 100 Reserves tuples There are 1,000.

14

Cost Analysis

1. If index is B+ Tree, about 2-4 I/Os to find leaf. If index is Hash index, about 1-2 I/Os to find bucket

2. Retrieve tuples. Time depends on whether index is clustered. If so, cost is usually 1 I/O per r tuple. Otherwise, could be 1 I/O per matching s tuple

Page 15: 1 Optimization - Selection. 2 The Selection Operation Table: Reserves(sid, bid, day, agent) A page (block) can hold 100 Reserves tuples There are 1,000.

15

Example (1)

• Hash index on sid of Sailors. (about 1.2 I/Os to find bucket)

• sid is a key in Sailors, so there is at most one matching tuple (actually, exactly 1: why?)

• Scanning Reserves: 1,000• There are 100 * 1,000 = 100,000 tuples in Reserves.• For each tuple, search index (1.2) and retrieve page (1)• Total time: 1,000 + 100,000 * 2.2 = 221,000 I/Os

Page 16: 1 Optimization - Selection. 2 The Selection Operation Table: Reserves(sid, bid, day, agent) A page (block) can hold 100 Reserves tuples There are 1,000.

16

Example

• Hash index on sid of Reserves. • Scanning Sailors: 500• There are 80 * 500 = 40,000 tuples in Sailors• For each tuple, search index (1.2) and retreive page

(??)• There are 100,000 reservations for 40,000 sailors.

Assuming uniform distribution on reservations, each sailor tuple matches about 2.5 reserves tuples. If index is unclustered, they may be on different pages

• Total: 500 + 40,000*(1.2 + 2.5) = 148,500 I/Os

Page 17: 1 Optimization - Selection. 2 The Selection Operation Table: Reserves(sid, bid, day, agent) A page (block) can hold 100 Reserves tuples There are 1,000.

17

Sort-Merge Join

• Sort both relations on join attribute.• This creates “partitions” according to the join

attributes.• Join relations while merging them. Tuples in

corresponding partitions are joined. • Cost depends on whether partitions are large

and therefore, are scanned multiple times.

Page 18: 1 Optimization - Selection. 2 The Selection Operation Table: Reserves(sid, bid, day, agent) A page (block) can hold 100 Reserves tuples There are 1,000.

18

sid sname rating age

22 dustin 7 45

28 yuppy 9 35

31 lubber 8 55

36 lubber 6 36

44 guppy 5 35

58 rusty 10 35sid bid day agent

28 103 12/4/96 Joe

28 103 11/3/96 Frank

31 101 10/2/96 Joe

31 102 12/7/96 Sam

31 101 13/7/96 Sam

58 103 22/6/96 Frank

Reserves

Sailors

Page 19: 1 Optimization - Selection. 2 The Selection Operation Table: Reserves(sid, bid, day, agent) A page (block) can hold 100 Reserves tuples There are 1,000.

19

Cost Analysis

• Sort R: O(M logM)• Sort S: O(N log N)• Merge: M + N (If partitions aren’t scanned

multiple times. Otherwise, worst case is M*N!!)

Cost: O(M+N+MlogM + NlogN)

Page 20: 1 Optimization - Selection. 2 The Selection Operation Table: Reserves(sid, bid, day, agent) A page (block) can hold 100 Reserves tuples There are 1,000.

20

Example

• Sort Reserves: 1000 log 1000 10,000 (actually better with a good algorithm for external sorting)

• Sort Sailors: 500 log 500 4,500• Merge: 1,000 + 500 = 1,500• Total: 1,500 + 10,000 + 4,500 = 16,000

(actually about 7,500 if sorting is done well)

Page 21: 1 Optimization - Selection. 2 The Selection Operation Table: Reserves(sid, bid, day, agent) A page (block) can hold 100 Reserves tuples There are 1,000.

21

Hash Join//Partition R into k partitions

foreach tuple r in R do //flush when fills

read r and add it to buffer page h(ri)

foreach tuple s in S do //flush when fills

read s and add it to buffer page h(sj)

for l = 1..k

//Build in-memory hash table for Rl using h2

foreach tuple r in Rl do

read r and insert into hash table with h2

foreach tuple s in Sl do

read s and probe table using h2

output matching pairs <r,s>

Page 22: 1 Optimization - Selection. 2 The Selection Operation Table: Reserves(sid, bid, day, agent) A page (block) can hold 100 Reserves tuples There are 1,000.

22

Cost Analysis

• Partition phase - Read R, S once and write them once: 2(M + N)

• In the second phase, we can read each partition once, assuming that it fits into memory: M + N

Cost: 3(M + N)

• In our example: 3(1,000 + 500) = 4,500 I/Os

Page 23: 1 Optimization - Selection. 2 The Selection Operation Table: Reserves(sid, bid, day, agent) A page (block) can hold 100 Reserves tuples There are 1,000.

23

Estimating Result Sizes

Page 24: 1 Optimization - Selection. 2 The Selection Operation Table: Reserves(sid, bid, day, agent) A page (block) can hold 100 Reserves tuples There are 1,000.

24

Picking a Query Plan

• Suppose we want to find the natural join of: Reserves, Sailors, Boats.

• The 2 options that appear the best are (ignoring the order within a single join):

(Sailors Reserves) Boats Sailors (Reserves Boats)

• We would like intermediate results to be as small as possible. Which is better?

Page 25: 1 Optimization - Selection. 2 The Selection Operation Table: Reserves(sid, bid, day, agent) A page (block) can hold 100 Reserves tuples There are 1,000.

25

Analyzing Result Sizes

• In order to answer the question in the previous slide, we must be able to estimate the size of (Sailors Reserves) and (Reserves Boats).

• The DBMS stores statistics about the relations and indexes.

• They are updated periodically (not every time the underlying relations are modified).

Page 26: 1 Optimization - Selection. 2 The Selection Operation Table: Reserves(sid, bid, day, agent) A page (block) can hold 100 Reserves tuples There are 1,000.

26

Statistics Maintained by DBMS

• Cardinality: Number of tuples NTuples(R) in each relation R

• Size: Number of pages NPages(R) in each relation R• Index Cardinality: Number of distinct key values

NKeys(I) for each index I• Index Size: Number of pages INPages(I) in each

index I• Index Height: Number of non-leaf levels IHeight(I) in

each B+ Tree index I• Index Range: The minimum ILow(I) and maximum

value IHigh(I) for each index I

Page 27: 1 Optimization - Selection. 2 The Selection Operation Table: Reserves(sid, bid, day, agent) A page (block) can hold 100 Reserves tuples There are 1,000.

27

Estimating Result Sizes

• Consider

• The maximum number of tuples is the product of the cardinalities of the relations in the FROM clause

• The WHERE clause is associating a reduction factor with each term

• Estimated result size is: maximum size times product of reduction factors

SELECT attribute-list

FROM relation-list

WHERE term1 and ... and termn

Page 28: 1 Optimization - Selection. 2 The Selection Operation Table: Reserves(sid, bid, day, agent) A page (block) can hold 100 Reserves tuples There are 1,000.

28

Estimating Reduction Factors

• column = value: 1/NKeys(I) if there is an index I on column. This assumes a uniform distribution. Otherwise, System R assumes 1/10.

• column1 = column2: 1/Max(NKeys(I1),NKeys(I2)) if there is an index I1 on column1 and I2 on column2. If only one column has an index, we use it to estimate the value. Otherwise, use 1/10.

• column > value: (High(I)-value)/(High(I)-Low(I)) if there is an index I on column.

Page 29: 1 Optimization - Selection. 2 The Selection Operation Table: Reserves(sid, bid, day, agent) A page (block) can hold 100 Reserves tuples There are 1,000.

29

Example

• Cardinality(R) = 1,000 * 100 = 100,000• Cardinality(S) = 500 * 80 = 40,000• NKeys(Index on R.agent) = 100• High(Index on Rating) = 10, Low = 0

SELECT *

FROM Reserves R, Sailors S

WHERE R.sid = S.sid and S.rating > 3 and

R.agent = ‘Joe’

Page 30: 1 Optimization - Selection. 2 The Selection Operation Table: Reserves(sid, bid, day, agent) A page (block) can hold 100 Reserves tuples There are 1,000.

30

Example (cont.)

• Maximum cardinality: 100,000 * 40,000 • Reduction factor of R.sid = S.sid: 1/40,000• Reduction factor of S.rating > 3: (10–3)/(10-0)

= 7/10• Reduction factor of R.agent = ‘Joe’: 1/100

• Total Estimated size: 700