CS 500: Fundamentals of Databasesjulia/cs500/documents/lectures/lecture6.… · 2-way merge-sort...

CS 500: Fundamentals of Databases

Storage and indexing. Query evaluation.

supplementary material: Ch. 8.1-8.4, 10, 11, 13.1-13.3

Julia Stoyanovich ([email protected])

Julia Stoyanovich

Outline

• The I/O model of computation

• Storage and indexing

• Overview of query optimization

• Object-relational databases

2

Julia Stoyanovich

Architecture of a typical DBMS

3

Application Query evaluation engine

Recovery manager

Concurrency control Storage manager

Storage

parser, compiler, optimizer, evaluator

transaction managerlock manager index/record manager,

buffer manager,disk space manager

Julia Stoyanovich

The memory hierarchy

4

CPU

Cache

based on Figure 9.1 in R&G

Main memory

Magnetic disk

Tape

request for data

data satisfying request

primary storage

secondary storage

tertiary storage

(volatile)

(stable)

(stable)

access time = 10 nsec

access time = 10-100 nsec

access time = 10-15 msecneed to consider seek, rotation, transfer times

only for sequential access

1 nsec=10-9 sec1 msec =10-3 sec

Julia Stoyanovich

Buffer management in a DBMS

5

buffer pool(volatile)

pages on disk (stable)

disk page

free frame

• Data must be in main memory for a DBMS to operate on it

• The unit of transfer between disk and memory is a block; reading / writing a disk block is called an I/O operation. Assume that disk page and memory block are of the same size, use these terms interchangeably.

• Table of <frame#, page#> pairs is maintained by the buffer manager

• Different page replacement policy than in general OS tasks. Why?

Julia Stoyanovich

Disk structure• Time to read or write a block

depends on location of the data

• I/O dominates the running time of database operations

• access time = seek time + rotation delay + transfer time

• If data is used together, it should be co-located

• Sequential vs. random access

6

Sector

Platters

Spindle

Disk head

Arm movement

Arm assembly

Tracks

Cylinder

Rotation speed: 5400 RPM

Number of platters: 1-30

Number of tracks <= 10,000

Julia Stoyanovich

Disk access characteristics• access time = seek time +

rotation delay + transfer time

• seek time = time for the head to reach cylinder (10-40ms)

• rotation latency = time for the sector to rotate (10ms)

• transfer time = 10MB/sec

• disks read / write 1 block at a time (typically 4KB)

7

Sector

Platters

Spindle

Disk head

Arm movement

Arm assembly

Tracks

Cylinder

Rotation speed: 5400 RPM

Number of platters: 1-30

Number of tracks <= 10,000

Julia Stoyanovich

The I/O model of computation

• In main-memory algorithms we care about CPU time

• In databases the cost is dominated by I/O

• Assumption: cost is given only by I/O

• Consequences: need to redesign certain algorithms

• Will illustrate here with sorting: Sort 1GB of data with 1MB of RAM

• Needed in many relational operations: projection, order by, grouping, some join algorithms

8

Julia Stoyanovich

2-way merge-sort

9

Input file PASS 0

PASS 1

PASS 2

PASS 3

9

3,4 6,2 9,4 8,7 5,6 3,1 2

3,4 5,6 2,6 4,9 7,8 1,3 2

2,3 4,6

4,7 8,9

1,3 5,6 2

2,3 4,4 6,7 8,9

1,2 3,5 6

1,2 2,3 3,4 4,5 6,6 7,8

1-page runs

2-page runs

4-page runs

8-page runs

example with N=7 pages

Julia Stoyanovich

2-way merge-sort

• Pass 0: read in the input 1 page at a time, sort the page in memory (e.g., using QuickSort), write it out. This produces 2k sorted runs of 1 page each. Uses 1 buffer page at a time.

• Passes 1, 2, …k-1: read in pairs of sorted runs, 1 page from each at a time, merge them (using 1 page), output to disk. This produces 2k-1, 2k-2,…, sorted runs. Uses 3 buffer pages at a time.

• Pass k produces one sorted run of 2k pages. Uses 3 buffer pages at a time.

10

suppose the input occupies N = 2k disk pages

Main memory buffers

INPUT 1

INPUT 2

OUTPUT

Disk Disk

this algorithm requires a total of 3 buffer pages!

Julia Stoyanovich

2-way merge-sort

• What is the cost of this algorithm?

• In each pass, we read each page process it, and write it out: 2 disk I/Os per page, per pass

• There are k = log2N + 1 passes

• The over-all cost is 2N (log2N + 1) I/Os

11

suppose the input occupies N = 2k disk pages

Main memory buffers

INPUT 1

INPUT 2

OUTPUT

Disk Disk

Julia Stoyanovich

Generalization: external merge-sort

Phase 0: load M bytes into memory, sort, do this NR/M times.

M bytes of main memory Disk Disk

. . . . . .

M/R records

M M M M M M M M M M M M M M M

Result: N records, divided into NR / M sorted runs of M / R records each

12

N records, each of size R (NR total input size)Main memory size M. Disk block size B.

Julia Stoyanovich


Phase 1, 2, …: merge intermediate runs into a new run

Result:

M bytes of main memory Disk Disk

. . . . . . Input M/B

Input 1

Input 2 . . . .

Output


MMM

MMM

MMM

MMM

MMM

13


Julia Stoyanovich



MMM

MMM

MMM

MMM

MMM

MMM

MMM

MM

MMM

MMM

14

N records, divided into NR / M sorted runs of M / R records each

final sorted result


Julia Stoyanovich

Cost of external merge-sort

15

Given B = 4KB, M = 64MB, R = 0.1KB

Pass 1: runs of 40*16*1024 = about 640,000 records

Pass 2: runs increase by a factor of M/B - 1 = 16,000sorted runs of 10,240,000,000 records

Pass 3: runs increase by a factor of M/B - 1 = 16,000sorted runs of 1014 records

with a modest memory size, we can sort everything in 2-3 passes!

B: block size M: main memory size (in blocks)N: input size (pages) R: size of 1 record

Cost = 2*N *(logM−1NM +1)

Julia Stoyanovich

Outline





16

Julia Stoyanovich

Storage and indexing• How do we efficiently store large amounts of data?

• The answer depends on how the data will be accessed!

• Primary storage of the data: heap, hashed, sorted

• Additional indexing: tree-based and hash-based

• Data records are stored in files. Each record has a unique identifier called record id, or rid.

• Cost model: ignore CPU cost, focus on I/O

• Average-case analysis based on simplifying assumptions

17

Julia Stoyanovich

Basic file organization• Heap files: good for full file scans or frequent updates

• unordered files

• insert at the end of file

• supports retrieval of all records, or equality selection on rid (exactly one match - why?)

• Sorted files: good for range queries on sort field(s)

• need external sort to keep sorted

• compacted after deletion

• assumes selection on sort field(s)

• Hashed files: good for selection on equality

• collection of buckets with primary & overflow pages

• hashing function h(r) = bucket for record r

• each bucket is a heap file (unsorted)

18

Julia Stoyanovich

Cost of operations (average case)

19

Heap File Sorted File Hashed File

Scan all recs p(T) D p(T) D 1.25 p(T) D

Equality Search p(T) D / 2 D log2 p(T) D

Range Search p(T) D D log2 p(T) + (# pages with matches)

1.25 p(T) D

Insert 2D Search + p(T) D 2D

Delete Search + D Search + p(T) D 2D

*

* assuming no overflow bucket, 80% page occupancy

p(T) - number of data pages in table T

** assuming search on candidate key

**

***

D - time to read or write a disk page

*** search, insert or delete (in the middle on average), then move all records 1 position to the right or left, at a cost of 2D per page, for p(T) / 2 pages

***

Julia Stoyanovich

Indexing

• An index (plural: indexes) represents the fundamental trade-off between space and processing time

• An index on a file speeds up selections on the search key attributes

• any subset of the fields of a relation can be the search key for an index on the relation

• search key is not the same as candidate key!

• An index is a collection of data entries that supports efficient retrieval of all data entries k* with a given search key value k

20

“If you don’t find it in the index, look very carefully through the entire catalog.”

Sears, Roebuck and Co., Consumer’s Guide, 1897

Julia Stoyanovich

Alternatives

• 3 alternatives for what to store as a data entry1. data entry = actual data record with search key value k

2. <k, rid of data record with search key value k>

3. <k, rid-list with search key value k>

• At most one index should use Alternative 1 - why?

• Alternative 3 is more compact than Alternative 2, but records are variable-length

21

Julia Stoyanovich

More on index types• Clustered index: order of records in the file that stores the data

records is the same, or close to, the order of records in the index

• Alternative 1 index is clustered by definition

• Alternative 2 or 3 index is usually not clustered

• (Alternative 2 or 3 indexed are only clustered if data records are sorted on the search key; in practice this is rear)

• Cost for using an index to answer a query varies greatly depending on whether the index is clustered or unclustered

• Primary index: an index on the primary key or on a superkey. Otherwise called a secondary index.

• Unique index: search key is a superkey

• Tree-based index, e.g., B+ tree vs. hash-based index

22

Julia Stoyanovich

Clustered vs. unclustered index

23

Data entries

(Index File)

(Data file)

Data Records

Data entries

Data Records

CLUSTERED UNCLUSTERED

Julia Stoyanovich

Composite search keys

• Composite search keys support search on a combination of fields

• Equality (point) query: every field value is equal to a constant, e.g., w.r.t. <sal, age> index age=12 and sal = 75

• Range query: some field value is not constant, e.g., age=12 and sal >10 or age < 45 and sal > 15

24

sue 13 75

bob

cal

joe 12

10

20

80 11

12

name age sal

<sal, age>

<age, sal> <age>

<sal>

12,20

12,10

11,80

13,75

20,12

10,12

75,13

80,11

11

12

12

13

10

20

75

80

Data records sorted by name

Data entries in index sorted by <sal,age>

Data entries sorted by <sal>

Examples of composite key indexes using lexicographic order.

Julia Stoyanovich

Hash-based indexes

• Good for equality selections, cost = 1 I/O

• Index is a collection of buckets

• bucket = primary page plus zero or more overflow pages

• buckets contain data entries

• Hashing function h: h(r) = bucket in which data entry for record r belongs. h looks at the search key fields of r.

25

Julia Stoyanovich

Tree-based indexes• ISAM (static, clustered) - Indexed Sequential Access Method

• B+ tree (dynamic, height-balanced)

26

Non-leaf Pages

Pages (Sorted by search key)

Leaf

P 0 K 1 P 1 K 2 P 2 K m P m

good for equality and range selectionscost = tree height + # pages with matching records

Julia Stoyanovich

B+ tree: the DB World’s favorite index• Insert / delete at logFN cost

• F=fanout, N = # leaf pages

• keep tree height-balanced

• Minimum 50% occupancy, except for root

• each node contains d<= m <= 2d entries, d is called the order of the tree (measure of tree node capacity)

• B+ tree supports equality and range queries efficiently

27

Index Entries

Data Entries

Julia Stoyanovich

B+ tree search

• Start at root, use key comparisons to navigate to a leaf

• Search for 5*, 15*, all data entries >=24*

28

Root

17 24 30

2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*

13

How many disk I/Os to answer a point query?How many disk I/Os to answer a range query?

Julia Stoyanovich

Insertion in a B+ tree

29

Insert(K,P)

• find leaf where K belongs, insert

• if no overflow (2d keys or less) - done

• if overflow (2d+1 keys), split node, insert in parent

K1 K2 K3 K4 K5

P0 P1 P2 P3 P4 P5

K1 K2

P0 P1 P2

K4 K5

P3 P4 P5

(K3, ) to parent

• if leaf, also keep K3 in right node (copy-up vs. push-up)

• when root splits, new root has only 1 key

Julia Stoyanovich

Inserting 8* example: push up

30

Root

17 24 30

2* 3* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*

13

5* 7* 8*

5

Need to split node & push up

5 24 30

17

13

Entry to be inserted in parent node. (Note that 17 is pushed up and only appears once in the index. Contrast this with a leaf split.)

Julia Stoyanovich

Deletion from a B+ tree• Start at root, find leaf L where the entry belongs

• Remove the entry

• if L is at least half-full, done

• If L has only d-1 entries

• try to re-distribute, borrowing from sibling (adjacent node with same parent as L)

• if re-distribution fails, merge L and sibling

• If merge occurred, must delete entry (pointing to L or sibling) from parent of L

• Merge could propagate to root, decreasing height

31

Julia Stoyanovich

B+ trees in practice

• Typical order d=100, typical fill factor 67%, fanout = 133

• Typical capacities:

• Height 3: 1333 = 2,352,637 records

• Height 4: 1334 = 312,900,700 records

• Can usually hold top levels of the tree in buffer pool: level 3 is 133MB, level 4 is 18GB (assuming 8KB pages)

32

“Nearly O(1)” access time to data - for equality or range queries!

Julia Stoyanovich

Summary so far

• We briefly looked at the memory hierarchy. Disk is the most important storage device.

• We discussed alternative file and index organizations

• We argued for estimating the cost of an algorithm using I/O rather than CPU operations, and gave an example, external sort

• We saw an important index, B+ tree

• Next: overview of query optimization

33

Julia Stoyanovich

Outline





34

Julia Stoyanovich

Overview of query optimizationGiven a SQL query, there may be different execution plans that

produce the same result but that have different performance characteristics

• Goal of query optimization: find an efficient query execution plan for a given query

• Ideally: find the absolute best plan• In reality: find a reasonable plan, avoid the really bad ones

• Query execution plan is represented by a tree of relational algebra operators, annotated with a choice of an algorithm for each operator

• Two main issues in query optimization:

1.Which plans are considered for a given query, that is, what is the search space of the query optimization algorithm?

2. How do we estimate the cost of a particular query execution plan?

35

Julia Stoyanovich

Example of a query execution plan

36

SELECT S.sname FROM Reserves R, Sailors S WHERE R.sid=S.sid AND R.bid=100 AND S.rating>5

Reserves Sailors

sid=sid

bid=100 rating > 5

sname

Reserves Sailors

sid=sid

bid=100 rating > 5

sname

(Simple Nested Loops)

(On-the-fly)

(On-the-fly) RA Tree: Plan:

• Query plans: data-flow graphs of relational algebra operators• Typically determined by the query optimizer

Julia Stoyanovich

Algebraic equivalences• Query optimization relies on reasoning about equivalence among

relational algebra expressions

• Equivalent expressions compute the same result (for all instances) but may differ in cost

• There is also cost associated with implementations of particular operators, of course (e.g., whether an index is used to retrieve data values)

37

cascading selections

selection is commutative σ c1(σ c2 (R)) =σ c2 (σ c1(R))

σ c1∧c2∧…cn (R) =σ c1(σ c2 (....σ cn (R))....)

cascading projections π a1(R) = π a1(π a2 (....π an (R))....)

Julia Stoyanovich

Algebraic equivalences

38

cascading selections

selection is commutative σ c1(σ c2 (R)) =σ c2 (σ c1(R))

σ c1∧c2∧…cn (R) =σ c1(σ c2 (....σ cn (R))....)

cascading projections π a1(R) = π a1(π a2 (....π an (R))....)

join, cross product arecommutative, associative

R × S = S × R R × (S ×T ) = (R × S)×T

do selection and projection commute? projection and join / cross product?

pushing selections σ c(R>< S) =σ c(R)>< S

R>< S = S >< R R>< (S ><T ) = (R>< S)><T

Julia Stoyanovich

Examples

39

Sailors (sid, name, rating, age) Boats (bid, name, color) Reserves (sid, bid, day)

List ids of sailors who reserved boat 102

Write SQL queries, give several equivalent relational algebra expressions, show operator trees

select sidfrom Reserveswhere bid = 102

π sid (σ bid=102Reserves) π sid (σ bid=102 (π sid , bidReserves))

Reserves

σ bid=102

π sid

Reserves

σ bid=102

π sid

π sid , bid

Julia Stoyanovich

Examples

40


List names of sailors who reserved boat 102


select S.namefrom Reserves R, Sailors Swhere R.sid = S.sidand R.bid = 102

R

σ bid=102

S

><

π name

R S

π name

×σ bid=102∧R.sid=S .sid

π name(σ bid=102∧R.sid=S .sid (R× S)π name((σ bid=102 R)▹◃ S)

Julia Stoyanovich

Examples

41


List names of sailors who reserved the red Interlake.


select S.namefrom Reserves R, Sailors S,

Boats Bwhere R.sid = S.sidand S.bid = B.bidand B.name = ‘Interlake’and B.color = ‘red’

πSailors.name(((σ name=' Interlake '∧color='red '

Boats)><Reserves)>< Sailors)

Reserves Sailors

><

πSailors.name

><

Boats

σ name=' Interlake '∧color='red '

Julia Stoyanovich

Examples

42


List names of sailors who reserved the red Interlake.


select S.namefrom Reserves R, Sailors S,

Boats Bwhere R.sid = S.sidand S.bid = B.bidand B.name = ‘Interlake’and B.color = ‘red’

Boats Reserves Sailors

><

πSailors.name

><σ name=' Interlake '

∧color='red '

πSailors.name((σ name=' Interlake '∧color='red '

Boats)>< (Reserves>< Sailors))

Julia Stoyanovich

Common implementation techniques• Of course, it is extremely important to efficiently

implement individual relational algebra operators

• The following approaches are used for implementing different operators (high-level insights)

• Indexing: can use WHERE conditions to retrieve a small set of tuples (selection, join)

• Iteration: sometimes it is faster to scan all tuples even if there is an index. And sometimes we can scan the data entries in an index, rather than in the data file itself.

• Partitioning: By using sorting or hashing, we can partition the input tuples and replace an expensive operation by similar operations on smaller inputs.

43

Julia Stoyanovich

Statistics and catalogs• Need information about the relations and indexes.

Catalogs typically contain at least:

• # tuples (NTuples) and # pages (NPages) for each relation

• # distinct key values (NKeys) and NPages for each index

• index height, low / high key values for each tree index

• Catalogs are updated periodically

• Updating whenever data changes is too expensive, lots of approximation anyway, so slight imprecision is OK

• More detailed information, e.g., histograms of the values in some filed, are sometimes also stored

44

Julia Stoyanovich

Access paths: example

45

Employees(id, dept, age);

select deptfrom Employeeswhere age > ?

A1: clustered B+ tree index on id

A2: unclustered B+-tree index on age

A3: sequential scan of relation, stored in a sorted file on id

A4: unclustered hash index on age

Q1: age > 10

Q2: age > 40

Q3: age > 70

Julia Stoyanovich


46


select dept, count(*)from Employeeswhere age > ?group by dept

Q1: age > 10

Q2: age > 40

Q3: age > 70

A1: clustered B+ tree index on id

A2: unclustered B+-tree index on age

A3: sequential scan of relation, stored in a sorted file on id

A4: clustered B+tree index on dept

Julia Stoyanovich


47


select age, count(*)from Employeeswhere age > ?group by age

Q1: age > 10

Q2: age > 40

Q3: age > 70

A1: clustered B+ tree index on age

A2: unclustered B+ tree index on age

A3: sequential scan of relation, stored in a sorted file

Julia Stoyanovich

Access paths• An access path is a method of retrieving tuples: file scan, or

index that matches a selection in the query

• An index matches a conjunction of terms if it can be used to retrieve all data values that match this conjunction of terms.

• A tree index matches a conjunction of terms that involve only attributes in a prefix of the search key.

• e.g., tree index <a,b,c> matches the selection a=5 AND b=3; it also matches a=5 AND b>4; it does not match b=3.

• A hash index matches a conjunction of terms that has a term attribute=value for every attribute in the search key of the index.

• e.g., hash index on <a,b,c> matches a=6 AND b=3 AND c=5; it does not match b=3; or a=5 and b=5; or a>5 AND b=3 and c=5

48

Julia Stoyanovich

One approach to selection

• Find the most selective access path, retrieve tuples using it, and apply the remaining terms that do not match the index.

• The most selective access path: an index or file scan (!) that we estimate will require the fewest page I/Os.

• Terms that match this index reduce the number of tuples retrieved; other terms are used to discard some retrieved tuples, but do not affect the number of tuples / pages fetched.

• Example: day < 1/1/2011 AND bid=5 AND sid=3

• option 1: use a B+ tree index on day, then check bid=5 and sid=3 for each retrieved tuple

• option 2: use a hash index on <bid, sid>, then check day <1/1/2011 for each retrieved tuple

Once again, we are interested in quantifying the I/O-based cost

49

Julia Stoyanovich

Index-only evaluation

• Many DBMS implement index-only query plans: if the query can be satisfied using the information in the search key of the index, without going to the data record on disk

• Important because typically only 1 index is clustered, and so using other indexes will potentially trigger several random I/Os

• Works well only with unclustered indexes

50

Julia Stoyanovich

Using an index for selection

• Cost of finding qualifying data entries (typically small) plus cost of retrieving records (could be large)

• Example: assuming uniform distribution of rname, about 10% of tuples qualify (100 pages, 10,000 tuples).

• with a clustered index, cost is little more than 100 I/Os

• with an unclustered index, cost is up to 10,000 I/Os!

51

SELECT * FROM Reserves R WHERE R.rname < �C%�

Reserves (R): 100 tuples per page, 1000 pages Sailors (S): 80 tuples per page, 500 pages

Sailors (sid, name, rating, age);Reserves (sid, bid, day, rname);

Julia Stoyanovich

Access paths: another example

52

Employees (eid, name, salary, age, did);Departments (did, budget, floor, manager_eid);

Salaries from $10K to $100K ages from 20 to 80; 5 employees per department; 10 floors; budgets from $10K to $1M. Uniform,

uncorrelated values.

Q1. Print name, age, salary for all employees

A1: clustered hash index on (name, age, salary) of Employees

A2: unclustered hash index on (name, age, salary) of Employees

A3: clustered B+-tree index on (name, age, salary) of Employees

A4: unclustered hash index on (eid, did) of Employees

A5: sequential access of Employees, Departments

Julia Stoyanovich


53


Salaries from $10K to $100K ages from 20 to 80; 5 employees per department; 10 floors; budgets from $10K to $1M. Uniform,

uncorrelated values.

Q2. Find dids of departments on the 10th floor with budget < $15K

A1: clustered hash index on (floor) of Departments

A2: clustered hash index on (floor, budget) of Departments

A3: clustered B+-tree index on (floor, budget) of Departments

A4: clustered B+-tree index on (budget) of Departments

A5: no index

Julia Stoyanovich


54


Salaries from $10K to $100K ages from 20 to 80; 5 employees per department; 10 floors; budgets from $10K to $1M.

Uniform, uncorrelated values.

Q3. Compute average budget of departments per floor

A1: clustered hash index on (did, floor) of Departments

A2: clustered hash index on (floor) of Departments

A3: clustered B+-tree index on (did, floor, budget) of Departments

A4: clustered B+-tree index on (floor, budget, did) of Departments

A5: no index

Julia Stoyanovich


55

Sailors (sid, name, rating, age);

Sids from 1 to 100K, ratings from 1 to 10, ages from 20 to 80.Uniform, uncorrelated values.

Q1. Print name, age, rating of all sailors

A1: sequential scan of sorted file, sorted on (id)

A2: clustered hash index on (rating)

A3: unclustered hash index on (id)

A4: unclustered hash index on (age, rating)

A5: unclustered hash index on (name, age)

A6: clustered B+-tree index on (name, age)

A7: unclustered B+-tree index on (age, rating)

Julia Stoyanovich


56



Q2. Print name, age, rating of the sailor with sid 123








Julia Stoyanovich


57



Q3. Count sailors with rating = 5 and age < 40








Julia Stoyanovich


58



Q4. Count sailors with rating = 5








Julia Stoyanovich


59



Q5. Print name, age, rating of sailors with age < 40 and rating < 5








Julia Stoyanovich

Projection• The expensive part of projection is removing

duplicates (if we are retrieving sets, not bags)

• Sorting approach: sort on <sid,bid> and remove duplicates. This can be optimized by dropping unwanted information while sorting.

• Hashing approach: hash on <sid,bid> to create partitions. Load partitions into memory one at a time, build in-memory hash structure and eliminate duplicates.

• If there is an index with both R.sid and R.bid in the search key, it may be cheaper to sort data entries.

• Similar issues arise when processing the group-by operator

60

SELECT DISTINCT R.sid, R.bid FROM Reserves R

Julia Stoyanovich

Join• A highly optimized operation in a DBMS

• Several well-studied algorithms are available:

• Nested loops join family of algorithms

• Sort-merge join

• Hash join

• As before, the cost of a join is based on I/O (# pages exchanged between disk and memory)

• Best choice for a query depends on the characteristics of the query and the relations, as well as on the available indexes

• Also important to think about how the join fits within the over-all query exaction pipeline (may be slower, but may have a side-effect that is desirable upstream)

61

Julia Stoyanovich

Highlights of the System R optimizer• Currently the most widely used approach, works well for < 10

joins

• A combination of rule-based and cost-based optimization

• Cost estimation: Approximate art at best.

• Statistics, maintained in system catalogs, used to estimate cost of operations and result sizes

• Consider combination of CPU and I/O cost, mostly interested in I/O cost

• Plan space: too large, must be pruned

• Only the space of left-deep plans is considered. Left-deep plans allow output of each operator to be pipelined into the next operator without storing it in a temporary relation.

• Cartesian products are avoided

62

Julia Stoyanovich

Cost estimation

• For a given SQL query, there are multiple (often very many!) query execution plans

• A query execution plan is a tree of relational operators, each node is annotated with a particular implementation of the operator

• To cost a plan, we must estimate the cost of each operation

63

Julia Stoyanovich

Size estimation and reduction factors

• Consider this query

• Maximum # tuples in result is the product of the cardinalities of relations in the FROM clause

• Reduction factor (RF) associated with each term reflects the impact the term has in reducing result size

• Result cardinality = max # tuples * product of all RFs

• Implicitly assumes that terms are independent

• Term col=val has RF 1/NKeys(I), given index I on column col

• Term col1=col2 has RF 1/max(NKeys(I1), NKeys(I2)

• Term col > val has RF (high(I) - val) / (high(I)-low(I))

64

SELECT attribute list FROM relation list WHERE term1 AND ... AND termk

Julia Stoyanovich

Example

65

Sailors (sid:int, sname: string, rating:int, age:real)

Reserves (sid:int, bid:int, day:date, rname:string)

Reserves (R): each tuple us 40 bytes long, 100 tuples per page, 1000 pages Sailors (S): each tuple is 50 bytes long, 80 tuples per page, 500 pages


Simple nested loops join algorithmfor each tuple r in R do

for each tuple s in S do if ri == sj then add <r, s> to result

• for each tuple in R we scan the entire relation S.• the cost of scanning R is 1000 I/Os• the cost of scanning S is 500 I/Os, this is done 1000 * 100 times

total cost = 1000 + (100 * 1000 * 500) = 1000 + 5*107 I/Oswith 10ms per I/O, this is 140 hours!

• a simple refinement is to compute the join page-at-a-time• total cost = 1000 + 1000 * 500 = 501,000 can we do better?

Julia Stoyanovich

Example (continued)

66


Reserves Sailors

sid=sid

bid=100 rating > 5

sname

(Simple Nested Loops)

(On-the-fly)

(On-the-fly)

• Cost with page-at-a-time nested loops join = 500 + 500 * 1000 I/O

• This is by no means the worst plan! can you think of a plan that’s worse?

• Nonetheless, this plan misses several opportunities: selections could have been pushed, we could have used indexes

Goal of optimization: To find reasonably efficient plans while avoiding the absolutely terrible ones

Julia Stoyanovich

Choice of indexes to buildWhat indexes should we create?

• Which relations should have indexes? What fields should be the search key? Should we build several indexes?

• For each index, what kind of an index should it be?

• One approach

• consider the most important queries in your workload

• consider the best query plan, and see whether adding an index will improve the plan by making additional indexes available

• Keep in mind that indexes (1) take up space; (2) require maintenance under updates (time). So there is a cost!

67

Julia Stoyanovich

Choice of indexes to build: guidelines• Attributes in WHERE are candidates for index keys

• exact match conditions suggest hash index

• range conditions suggest tree index

• Multi-attribute search keys should be considered when query contains several conditions on the same relation

• order of attributes is important for range queries

• such indexes can enable index-only strategies for important queries

• Try to choose indexes that benefit as many queries as possible

68

Julia Stoyanovich

Understanding the workload• For each query

• Which relations does it access?

• Which attributes are retrieved?

• Which attributes are involved in selection / join conditions? How selective are these conditions likely to be?

• For each update in the workload

• Which attributes are involved in selection / join conditions? How selective are these conditions likely to be?

• What is the type of update (INSERT / DELETE / UPDATE), and what attributes are affected?

69

Julia Stoyanovich

Outline





70

Julia Stoyanovich

Motivating example California Department of Water Resources: 500,000 photos, with captions

71

select idfrom Photos P, Landmarks L, Landmarks Swhere sunset(P.picture)and contains(P.caption, L.name)and L.location |20| S.locationand S.name = ‘Sacramento, CA'

Find sunset pictures of landmarks within 20 miles of Sacramento, CA.

Note: user-defined functions (UDFs) and operators

Note: query optimization must be re-considered

create table Photos ( id number primary key, ts date, caption document, picture photo_CD_image);

create table Landmarks ( name varchar(64) primary key, location point);

Julia Stoyanovich

Object-database systems• Relational databases: relations are in first normal form. Clean

and simple. But the world is more complex!

• There is sometimes a need to accommodate

• complex data types / nesting

• inheritance hierarchies

• Two directions, conceptually very similar but implementations differ

• Object-Oriented Database Systems (OODBMS)

• Object-Relational database Systems (ORDBMS) is the currently accepted model, part of the SQL:1999 standard. Extends the relational model, borrows concepts from OO programming languages.

• implemented by Oracle, PostgreSQL and others

72

Julia Stoyanovich

The Dinky Entertainment Company• About the company

• Location: Holywood, CA

• Main assets: cartoon characters, e.g., Herbert the Worm

• Products: film shows, voice and video-footage licenses (e.g., for action figures, video games)

DBMS manages everything!

• New data types required

• user-defined abstract data types (ADTs): image, sound, video, with functions and operators

• type constructors: sets, tuples, arrays

• inheritance: low-resolution and high-resolution images are images

73

Julia Stoyanovich

Why an RDBMS won’t do

74

create table Frames ( frame_number number primary key, image BLOB, category number);

BLOB = binary large object

no structure / semantics

cannot issue any conditional queries against the image attribute

Enter ORDBMS• user-defined data types possible for attributes• complex attributes are possible (non 1NF)• reference types (pointers) - why do we need them? why are these

potentially problematic?• inheritance

Julia Stoyanovich

User-Defined Abstract Data TypesI’ve got one word for you: encapsulation!

• Users define new types, with their operations (methods).

• Define how to read and output objects of the new type

• Define the size of the objects of the new type

75

create abstract data type jpeg_image(internallength=VARIABLE, input=jpeg_in, output=jpeg_out);

create function is_sunrise(jpeg_image) returns booleanas external name '/usr/local/bin/dinky.jar';

• Atomic and user-defined types

• Type constructors:

• row (f1 t1, …,fn tn) - a tuple of n fields, where fi us the bane if the filed and ti is its type

• Listof (base), Arrayof(base), Setof(base), Bagof(base)

Julia Stoyanovich

Reference types• Objects have an OID

• Consequences? ref / deref

• Examples:

• ref(theater_t)

• setof(ref(arrayof(integer)))

• Shallow vs. deep equality

• deep equality is defined recursively for complex types

76

Julia Stoyanovich

Inheritance• To reuse and refine type definitions

• To create hierarchies of collections of similar but not identical objects

77

create type superhero_t under superbeing_t (strength, power);

Substitution principle: Given a super type A, and a subtype B, i t is always possible to substitute an object of type B into a legal expression written for objects of type A.

Julia Stoyanovich

ORDBMS Implementation• Physical data layout

- nested objects, arrays

- objects that vary in size over their lifetime

• Access methods

- indexes on predicates

- indexes over collection hierarchies

• Query processing

- New aggregates

1.specify what to do with first object (e.g., sum=0)

2.specify what do to on next (e.g., sum+=item)

3.specify what to do on last object (e.g., avg = sum / cnt)

- Method caching for expensive predicates

78

Julia Stoyanovich

Query optimization• Using new indexes

• what where-clause conditions are matched by the index

• what does it cost to fetch a tuple using the index - either supplied or measured by the DBMS directly

• Reduction factor and cost estimation for ADT methods

•effect of evaluating a selection condition is no longer negligible! Must consider both selectivity and cost.

• selectivity of a condition is 1/N means that 1 in N tuples will pass the selection condition

79

Julia Stoyanovich

Query optimization example

80

σ isSunrise()∧isHerbert ()RRetrieve all photos in which Herbert is enjoying the sunrise.

N=100,000 images in R

isHerbert: cost c1 = 0.5 sec/image, returns true for r1 = 10% of the images in R isSunrise: cost c2 = 0.01 sec/image, returns true for r2 = 20% of the images in R

bettercost(σ isHerbert (σ isSunriseR)) = N ∗c2 + N ∗r2 *c1 = 11,000 sec

cost(σ isSunrise(σ isHerbertR)) = N ∗c1 + N ∗r1 ∗c2 = 50,100sec

Julia Stoyanovich

Query optimization example

81

Retrieve all photos of Herbert in which there is no sunrise.

N=100,000 images in R

isHerbert: cost c1 = 0.5 sec/image, returns true for r1 = 10% of the images in R NOT isSunrise: cost c2 = 0.01 sec/image, returns true for 1 - r2 = 80% of the images in R

better

same as in the previous case!cost(σ NOT isSunrise(σ isHerbertR)) = N ∗c1 + N ∗r1 ∗c2 = 50,100sec

cost(σ isHerbert (σ NOT isSunriseR)) = N ∗c2 + N ∗(1− r2 )∗c1 = 41,000sec

σ NOT isSunrise()∧isHerbert ()R

Julia Stoyanovich

The order of predicatesFor n predicates, there are n! orders

Suppose that there are N tuples in the input

82

cost(σ n (σ n−1(...(σ 1(R))...))) = c1N + c2 (Nr1)+ c3(Nr1r2 )+ ...+ cn (Nr1r2...rn−1)

Suppose that σ 2 (σ 1(R)) ≤σ 1(σ 2 (R))⇒N(c1 + c2r1) ≤ N(c2 + c1r2 )c1 + c2r1 ≤ c2 + c1r2c2 (r1 −1) ≤ c1(r2 −1)r1 −1c1

≤ r2 −1c2

Julia Stoyanovich

Query optimizationReduction factor and cost estimation for ADT methods

1. compute the rank of each condition involving an ADT method

2. order conditions by increasing rank, process in that order

83

rank = reductionFactor −1Cost

isHerbert: cost c1 = 0.5 sec/image, returns true for r1 = 10% of the images in R isSunrise: cost c2 = 0.01 sec/image, returns true for r2 = 20% of the images in R

rank(isSunrise) = r2 −1c2

= 0.2 −10.01

= −80rank(isHerbert) = r1 −1c1

= 0.1−10.5

= −1.8

Meaning, isSunrise should be executed before isHerbert, i.e., this plan is optimal: σ isHerbert (σ isSunriseR)

CS 500: Fundamentals of Databasesjulia/cs500/documents/lectures/lecture6.… · 2-way merge-sort...

Documents

Transcript of CS 500: Fundamentals of Databasesjulia/cs500/documents/lectures/lecture6.… · 2-way merge-sort...