CS 500: Fundamentals of Databasesjulia/cs500/documents/lectures/lecture6.… · 2-way merge-sort...
Transcript of CS 500: Fundamentals of Databasesjulia/cs500/documents/lectures/lecture6.… · 2-way merge-sort...
CS 500: Fundamentals of Databases
Storage and indexing. Query evaluation.
supplementary material: Ch. 8.1-8.4, 10, 11, 13.1-13.3
Julia Stoyanovich ([email protected])
Julia Stoyanovich
Outline
• The I/O model of computation
• Storage and indexing
• Overview of query optimization
• Object-relational databases
2
Julia Stoyanovich
Architecture of a typical DBMS
3
Application Query evaluation engine
Recovery manager
Concurrency control Storage manager
Storage
parser, compiler, optimizer, evaluator
transaction managerlock manager index/record manager,
buffer manager,disk space manager
Julia Stoyanovich
The memory hierarchy
4
CPU
Cache
based on Figure 9.1 in R&G
Main memory
Magnetic disk
Tape
request for data
data satisfying request
primary storage
secondary storage
tertiary storage
(volatile)
(stable)
(stable)
access time = 10 nsec
access time = 10-100 nsec
access time = 10-15 msecneed to consider seek, rotation, transfer times
only for sequential access
1 nsec=10-9 sec1 msec =10-3 sec
Julia Stoyanovich
Buffer management in a DBMS
5
buffer pool(volatile)
pages on disk (stable)
disk page
free frame
• Data must be in main memory for a DBMS to operate on it
• The unit of transfer between disk and memory is a block; reading / writing a disk block is called an I/O operation. Assume that disk page and memory block are of the same size, use these terms interchangeably.
• Table of <frame#, page#> pairs is maintained by the buffer manager
• Different page replacement policy than in general OS tasks. Why?
Julia Stoyanovich
Disk structure• Time to read or write a block
depends on location of the data
• I/O dominates the running time of database operations
• access time = seek time + rotation delay + transfer time
• If data is used together, it should be co-located
• Sequential vs. random access
6
Sector
Platters
Spindle
Disk head
Arm movement
Arm assembly
Tracks
Cylinder
Rotation speed: 5400 RPM
Number of platters: 1-30
Number of tracks <= 10,000
Julia Stoyanovich
Disk access characteristics• access time = seek time +
rotation delay + transfer time
• seek time = time for the head to reach cylinder (10-40ms)
• rotation latency = time for the sector to rotate (10ms)
• transfer time = 10MB/sec
• disks read / write 1 block at a time (typically 4KB)
7
Sector
Platters
Spindle
Disk head
Arm movement
Arm assembly
Tracks
Cylinder
Rotation speed: 5400 RPM
Number of platters: 1-30
Number of tracks <= 10,000
Julia Stoyanovich
The I/O model of computation
• In main-memory algorithms we care about CPU time
• In databases the cost is dominated by I/O
• Assumption: cost is given only by I/O
• Consequences: need to redesign certain algorithms
• Will illustrate here with sorting: Sort 1GB of data with 1MB of RAM
• Needed in many relational operations: projection, order by, grouping, some join algorithms
8
Julia Stoyanovich
2-way merge-sort
9
Input file PASS 0
PASS 1
PASS 2
PASS 3
9
3,4 6,2 9,4 8,7 5,6 3,1 2
3,4 5,6 2,6 4,9 7,8 1,3 2
2,3 4,6
4,7 8,9
1,3 5,6 2
2,3 4,4 6,7 8,9
1,2 3,5 6
1,2 2,3 3,4 4,5 6,6 7,8
1-page runs
2-page runs
4-page runs
8-page runs
example with N=7 pages
Julia Stoyanovich
2-way merge-sort
• Pass 0: read in the input 1 page at a time, sort the page in memory (e.g., using QuickSort), write it out. This produces 2k sorted runs of 1 page each. Uses 1 buffer page at a time.
• Passes 1, 2, …k-1: read in pairs of sorted runs, 1 page from each at a time, merge them (using 1 page), output to disk. This produces 2k-1, 2k-2,…, sorted runs. Uses 3 buffer pages at a time.
• Pass k produces one sorted run of 2k pages. Uses 3 buffer pages at a time.
10
suppose the input occupies N = 2k disk pages
Main memory buffers
INPUT 1
INPUT 2
OUTPUT
Disk Disk
this algorithm requires a total of 3 buffer pages!
Julia Stoyanovich
2-way merge-sort
• What is the cost of this algorithm?
• In each pass, we read each page process it, and write it out: 2 disk I/Os per page, per pass
• There are k = log2N + 1 passes
• The over-all cost is 2N (log2N + 1) I/Os
11
suppose the input occupies N = 2k disk pages
Main memory buffers
INPUT 1
INPUT 2
OUTPUT
Disk Disk
Julia Stoyanovich
Generalization: external merge-sort
Phase 0: load M bytes into memory, sort, do this NR/M times.
M bytes of main memory Disk Disk
. . . . . .
M/R records
M M M M M M M M M M M M M M M
Result: N records, divided into NR / M sorted runs of M / R records each
12
N records, each of size R (NR total input size)Main memory size M. Disk block size B.
Julia Stoyanovich
Generalization: external merge-sort
Phase 1, 2, …: merge intermediate runs into a new run
Result:
M bytes of main memory Disk Disk
. . . . . . Input M/B
Input 1
Input 2 . . . .
Output
M M M M M M M M M M M M M M M
MMM
MMM
MMM
MMM
MMM
13
N records, each of size R (NR total input size)Main memory size M. Disk block size B.
Julia Stoyanovich
Generalization: external merge-sort
M M M M M M M M M M M M M M M
MMM
MMM
MMM
MMM
MMM
MMM
MMM
MM
MMM
MMM
14
N records, divided into NR / M sorted runs of M / R records each
final sorted result
N records, each of size R (NR total input size)Main memory size M. Disk block size B.
Julia Stoyanovich
Cost of external merge-sort
15
Given B = 4KB, M = 64MB, R = 0.1KB
Pass 1: runs of 40*16*1024 = about 640,000 records
Pass 2: runs increase by a factor of M/B - 1 = 16,000sorted runs of 10,240,000,000 records
Pass 3: runs increase by a factor of M/B - 1 = 16,000sorted runs of 1014 records
with a modest memory size, we can sort everything in 2-3 passes!
B: block size M: main memory size (in blocks)N: input size (pages) R: size of 1 record
Cost = 2*N *(logM−1NM +1)
Julia Stoyanovich
Outline
• The I/O model of computation
• Storage and indexing
• Overview of query optimization
• Object-relational databases
16
Julia Stoyanovich
Storage and indexing• How do we efficiently store large amounts of data?
• The answer depends on how the data will be accessed!
• Primary storage of the data: heap, hashed, sorted
• Additional indexing: tree-based and hash-based
• Data records are stored in files. Each record has a unique identifier called record id, or rid.
• Cost model: ignore CPU cost, focus on I/O
• Average-case analysis based on simplifying assumptions
17
Julia Stoyanovich
Basic file organization• Heap files: good for full file scans or frequent updates
• unordered files
• insert at the end of file
• supports retrieval of all records, or equality selection on rid (exactly one match - why?)
• Sorted files: good for range queries on sort field(s)
• need external sort to keep sorted
• compacted after deletion
• assumes selection on sort field(s)
• Hashed files: good for selection on equality
• collection of buckets with primary & overflow pages
• hashing function h(r) = bucket for record r
• each bucket is a heap file (unsorted)
18
Julia Stoyanovich
Cost of operations (average case)
19
Heap File Sorted File Hashed File
Scan all recs p(T) D p(T) D 1.25 p(T) D
Equality Search p(T) D / 2 D log2 p(T) D
Range Search p(T) D D log2 p(T) + (# pages with matches)
1.25 p(T) D
Insert 2D Search + p(T) D 2D
Delete Search + D Search + p(T) D 2D
*
* assuming no overflow bucket, 80% page occupancy
p(T) - number of data pages in table T
** assuming search on candidate key
**
***
D - time to read or write a disk page
*** search, insert or delete (in the middle on average), then move all records 1 position to the right or left, at a cost of 2D per page, for p(T) / 2 pages
***
Julia Stoyanovich
Indexing
• An index (plural: indexes) represents the fundamental trade-off between space and processing time
• An index on a file speeds up selections on the search key attributes
• any subset of the fields of a relation can be the search key for an index on the relation
• search key is not the same as candidate key!
• An index is a collection of data entries that supports efficient retrieval of all data entries k* with a given search key value k
20
“If you don’t find it in the index, look very carefully through the entire catalog.”
Sears, Roebuck and Co., Consumer’s Guide, 1897
Julia Stoyanovich
Alternatives
• 3 alternatives for what to store as a data entry1. data entry = actual data record with search key value k
2. <k, rid of data record with search key value k>
3. <k, rid-list with search key value k>
• At most one index should use Alternative 1 - why?
• Alternative 3 is more compact than Alternative 2, but records are variable-length
21
Julia Stoyanovich
More on index types• Clustered index: order of records in the file that stores the data
records is the same, or close to, the order of records in the index
• Alternative 1 index is clustered by definition
• Alternative 2 or 3 index is usually not clustered
• (Alternative 2 or 3 indexed are only clustered if data records are sorted on the search key; in practice this is rear)
• Cost for using an index to answer a query varies greatly depending on whether the index is clustered or unclustered
• Primary index: an index on the primary key or on a superkey. Otherwise called a secondary index.
• Unique index: search key is a superkey
• Tree-based index, e.g., B+ tree vs. hash-based index
22
Julia Stoyanovich
Clustered vs. unclustered index
23
Data entries
(Index File)
(Data file)
Data Records
Data entries
Data Records
CLUSTERED UNCLUSTERED
Julia Stoyanovich
Composite search keys
• Composite search keys support search on a combination of fields
• Equality (point) query: every field value is equal to a constant, e.g., w.r.t. <sal, age> index age=12 and sal = 75
• Range query: some field value is not constant, e.g., age=12 and sal >10 or age < 45 and sal > 15
24
sue 13 75
bob
cal
joe 12
10
20
80 11
12
name age sal
<sal, age>
<age, sal> <age>
<sal>
12,20
12,10
11,80
13,75
20,12
10,12
75,13
80,11
11
12
12
13
10
20
75
80
Data records sorted by name
Data entries in index sorted by <sal,age>
Data entries sorted by <sal>
Examples of composite key indexes using lexicographic order.
Julia Stoyanovich
Hash-based indexes
• Good for equality selections, cost = 1 I/O
• Index is a collection of buckets
• bucket = primary page plus zero or more overflow pages
• buckets contain data entries
• Hashing function h: h(r) = bucket in which data entry for record r belongs. h looks at the search key fields of r.
25
Julia Stoyanovich
Tree-based indexes• ISAM (static, clustered) - Indexed Sequential Access Method
• B+ tree (dynamic, height-balanced)
26
Non-leaf Pages
Pages (Sorted by search key)
Leaf
P 0 K 1 P 1 K 2 P 2 K m P m
good for equality and range selectionscost = tree height + # pages with matching records
Julia Stoyanovich
B+ tree: the DB World’s favorite index• Insert / delete at logFN cost
• F=fanout, N = # leaf pages
• keep tree height-balanced
• Minimum 50% occupancy, except for root
• each node contains d<= m <= 2d entries, d is called the order of the tree (measure of tree node capacity)
• B+ tree supports equality and range queries efficiently
27
Index Entries
Data Entries
Julia Stoyanovich
B+ tree search
• Start at root, use key comparisons to navigate to a leaf
• Search for 5*, 15*, all data entries >=24*
28
Root
17 24 30
2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*
13
How many disk I/Os to answer a point query?How many disk I/Os to answer a range query?
Julia Stoyanovich
Insertion in a B+ tree
29
Insert(K,P)
• find leaf where K belongs, insert
• if no overflow (2d keys or less) - done
• if overflow (2d+1 keys), split node, insert in parent
K1 K2 K3 K4 K5
P0 P1 P2 P3 P4 P5
K1 K2
P0 P1 P2
K4 K5
P3 P4 P5
(K3, ) to parent
• if leaf, also keep K3 in right node (copy-up vs. push-up)
• when root splits, new root has only 1 key
Julia Stoyanovich
Inserting 8* example: push up
30
Root
17 24 30
2* 3* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*
13
5* 7* 8*
5
Need to split node & push up
5 24 30
17
13
Entry to be inserted in parent node. (Note that 17 is pushed up and only appears once in the index. Contrast this with a leaf split.)
Julia Stoyanovich
Deletion from a B+ tree• Start at root, find leaf L where the entry belongs
• Remove the entry
• if L is at least half-full, done
• If L has only d-1 entries
• try to re-distribute, borrowing from sibling (adjacent node with same parent as L)
• if re-distribution fails, merge L and sibling
• If merge occurred, must delete entry (pointing to L or sibling) from parent of L
• Merge could propagate to root, decreasing height
31
Julia Stoyanovich
B+ trees in practice
• Typical order d=100, typical fill factor 67%, fanout = 133
• Typical capacities:
• Height 3: 1333 = 2,352,637 records
• Height 4: 1334 = 312,900,700 records
• Can usually hold top levels of the tree in buffer pool: level 3 is 133MB, level 4 is 18GB (assuming 8KB pages)
32
“Nearly O(1)” access time to data - for equality or range queries!
Julia Stoyanovich
Summary so far
• We briefly looked at the memory hierarchy. Disk is the most important storage device.
• We discussed alternative file and index organizations
• We argued for estimating the cost of an algorithm using I/O rather than CPU operations, and gave an example, external sort
• We saw an important index, B+ tree
• Next: overview of query optimization
33
Julia Stoyanovich
Outline
• The I/O model of computation
• Storage and indexing
• Overview of query optimization
• Object-relational databases
34
Julia Stoyanovich
Overview of query optimizationGiven a SQL query, there may be different execution plans that
produce the same result but that have different performance characteristics
• Goal of query optimization: find an efficient query execution plan for a given query
• Ideally: find the absolute best plan• In reality: find a reasonable plan, avoid the really bad ones
• Query execution plan is represented by a tree of relational algebra operators, annotated with a choice of an algorithm for each operator
• Two main issues in query optimization:
1.Which plans are considered for a given query, that is, what is the search space of the query optimization algorithm?
2. How do we estimate the cost of a particular query execution plan?
35
Julia Stoyanovich
Example of a query execution plan
36
SELECT S.sname FROM Reserves R, Sailors S WHERE R.sid=S.sid AND R.bid=100 AND S.rating>5
Reserves Sailors
sid=sid
bid=100 rating > 5
sname
Reserves Sailors
sid=sid
bid=100 rating > 5
sname
(Simple Nested Loops)
(On-the-fly)
(On-the-fly) RA Tree: Plan:
• Query plans: data-flow graphs of relational algebra operators• Typically determined by the query optimizer
Julia Stoyanovich
Algebraic equivalences• Query optimization relies on reasoning about equivalence among
relational algebra expressions
• Equivalent expressions compute the same result (for all instances) but may differ in cost
• There is also cost associated with implementations of particular operators, of course (e.g., whether an index is used to retrieve data values)
37
cascading selections
selection is commutative σ c1(σ c2 (R)) =σ c2 (σ c1(R))
σ c1∧c2∧…cn (R) =σ c1(σ c2 (....σ cn (R))....)
cascading projections π a1(R) = π a1(π a2 (....π an (R))....)
Julia Stoyanovich
Algebraic equivalences
38
cascading selections
selection is commutative σ c1(σ c2 (R)) =σ c2 (σ c1(R))
σ c1∧c2∧…cn (R) =σ c1(σ c2 (....σ cn (R))....)
cascading projections π a1(R) = π a1(π a2 (....π an (R))....)
join, cross product arecommutative, associative
R × S = S × R R × (S ×T ) = (R × S)×T
do selection and projection commute? projection and join / cross product?
pushing selections σ c(R>< S) =σ c(R)>< S
R>< S = S >< R R>< (S ><T ) = (R>< S)><T
Julia Stoyanovich
Examples
39
Sailors (sid, name, rating, age) Boats (bid, name, color) Reserves (sid, bid, day)
List ids of sailors who reserved boat 102
Write SQL queries, give several equivalent relational algebra expressions, show operator trees
select sidfrom Reserveswhere bid = 102
π sid (σ bid=102Reserves) π sid (σ bid=102 (π sid , bidReserves))
Reserves
σ bid=102
π sid
Reserves
σ bid=102
π sid
π sid , bid
Julia Stoyanovich
Examples
40
Sailors (sid, name, rating, age) Boats (bid, name, color) Reserves (sid, bid, day)
List names of sailors who reserved boat 102
Write SQL queries, give several equivalent relational algebra expressions, show operator trees
select S.namefrom Reserves R, Sailors Swhere R.sid = S.sidand R.bid = 102
R
σ bid=102
S
><
π name
R S
π name
×σ bid=102∧R.sid=S .sid
π name(σ bid=102∧R.sid=S .sid (R× S)π name((σ bid=102 R)▹◃ S)
Julia Stoyanovich
Examples
41
Sailors (sid, name, rating, age) Boats (bid, name, color) Reserves (sid, bid, day)
List names of sailors who reserved the red Interlake.
Write SQL queries, give several equivalent relational algebra expressions, show operator trees
select S.namefrom Reserves R, Sailors S,
Boats Bwhere R.sid = S.sidand S.bid = B.bidand B.name = ‘Interlake’and B.color = ‘red’
πSailors.name(((σ name=' Interlake '∧color='red '
Boats)><Reserves)>< Sailors)
Reserves Sailors
><
πSailors.name
><
Boats
σ name=' Interlake '∧color='red '
Julia Stoyanovich
Examples
42
Sailors (sid, name, rating, age) Boats (bid, name, color) Reserves (sid, bid, day)
List names of sailors who reserved the red Interlake.
Write SQL queries, give several equivalent relational algebra expressions, show operator trees
select S.namefrom Reserves R, Sailors S,
Boats Bwhere R.sid = S.sidand S.bid = B.bidand B.name = ‘Interlake’and B.color = ‘red’
Boats Reserves Sailors
><
πSailors.name
><σ name=' Interlake '
∧color='red '
πSailors.name((σ name=' Interlake '∧color='red '
Boats)>< (Reserves>< Sailors))
Julia Stoyanovich
Common implementation techniques• Of course, it is extremely important to efficiently
implement individual relational algebra operators
• The following approaches are used for implementing different operators (high-level insights)
• Indexing: can use WHERE conditions to retrieve a small set of tuples (selection, join)
• Iteration: sometimes it is faster to scan all tuples even if there is an index. And sometimes we can scan the data entries in an index, rather than in the data file itself.
• Partitioning: By using sorting or hashing, we can partition the input tuples and replace an expensive operation by similar operations on smaller inputs.
43
Julia Stoyanovich
Statistics and catalogs• Need information about the relations and indexes.
Catalogs typically contain at least:
• # tuples (NTuples) and # pages (NPages) for each relation
• # distinct key values (NKeys) and NPages for each index
• index height, low / high key values for each tree index
• Catalogs are updated periodically
• Updating whenever data changes is too expensive, lots of approximation anyway, so slight imprecision is OK
• More detailed information, e.g., histograms of the values in some filed, are sometimes also stored
44
Julia Stoyanovich
Access paths: example
45
Employees(id, dept, age);
select deptfrom Employeeswhere age > ?
A1: clustered B+ tree index on id
A2: unclustered B+-tree index on age
A3: sequential scan of relation, stored in a sorted file on id
A4: unclustered hash index on age
Q1: age > 10
Q2: age > 40
Q3: age > 70
Julia Stoyanovich
Access paths: example
46
Employees(id, dept, age);
select dept, count(*)from Employeeswhere age > ?group by dept
Q1: age > 10
Q2: age > 40
Q3: age > 70
A1: clustered B+ tree index on id
A2: unclustered B+-tree index on age
A3: sequential scan of relation, stored in a sorted file on id
A4: clustered B+tree index on dept
Julia Stoyanovich
Access paths: example
47
Employees(id, dept, age);
select age, count(*)from Employeeswhere age > ?group by age
Q1: age > 10
Q2: age > 40
Q3: age > 70
A1: clustered B+ tree index on age
A2: unclustered B+ tree index on age
A3: sequential scan of relation, stored in a sorted file
Julia Stoyanovich
Access paths• An access path is a method of retrieving tuples: file scan, or
index that matches a selection in the query
• An index matches a conjunction of terms if it can be used to retrieve all data values that match this conjunction of terms.
• A tree index matches a conjunction of terms that involve only attributes in a prefix of the search key.
• e.g., tree index <a,b,c> matches the selection a=5 AND b=3; it also matches a=5 AND b>4; it does not match b=3.
• A hash index matches a conjunction of terms that has a term attribute=value for every attribute in the search key of the index.
• e.g., hash index on <a,b,c> matches a=6 AND b=3 AND c=5; it does not match b=3; or a=5 and b=5; or a>5 AND b=3 and c=5
48
Julia Stoyanovich
One approach to selection
• Find the most selective access path, retrieve tuples using it, and apply the remaining terms that do not match the index.
• The most selective access path: an index or file scan (!) that we estimate will require the fewest page I/Os.
• Terms that match this index reduce the number of tuples retrieved; other terms are used to discard some retrieved tuples, but do not affect the number of tuples / pages fetched.
• Example: day < 1/1/2011 AND bid=5 AND sid=3
• option 1: use a B+ tree index on day, then check bid=5 and sid=3 for each retrieved tuple
• option 2: use a hash index on <bid, sid>, then check day <1/1/2011 for each retrieved tuple
Once again, we are interested in quantifying the I/O-based cost
49
Julia Stoyanovich
Index-only evaluation
• Many DBMS implement index-only query plans: if the query can be satisfied using the information in the search key of the index, without going to the data record on disk
• Important because typically only 1 index is clustered, and so using other indexes will potentially trigger several random I/Os
• Works well only with unclustered indexes
50
Julia Stoyanovich
Using an index for selection
• Cost of finding qualifying data entries (typically small) plus cost of retrieving records (could be large)
• Example: assuming uniform distribution of rname, about 10% of tuples qualify (100 pages, 10,000 tuples).
• with a clustered index, cost is little more than 100 I/Os
• with an unclustered index, cost is up to 10,000 I/Os!
51
SELECT * FROM Reserves R WHERE R.rname < �C%�
Reserves (R): 100 tuples per page, 1000 pages Sailors (S): 80 tuples per page, 500 pages
Sailors (sid, name, rating, age);Reserves (sid, bid, day, rname);
Julia Stoyanovich
Access paths: another example
52
Employees (eid, name, salary, age, did);Departments (did, budget, floor, manager_eid);
Salaries from $10K to $100K ages from 20 to 80; 5 employees per department; 10 floors; budgets from $10K to $1M. Uniform,
uncorrelated values.
Q1. Print name, age, salary for all employees
A1: clustered hash index on (name, age, salary) of Employees
A2: unclustered hash index on (name, age, salary) of Employees
A3: clustered B+-tree index on (name, age, salary) of Employees
A4: unclustered hash index on (eid, did) of Employees
A5: sequential access of Employees, Departments
Julia Stoyanovich
Access paths: another example
53
Employees (eid, name, salary, age, did);Departments (did, budget, floor, manager_eid);
Salaries from $10K to $100K ages from 20 to 80; 5 employees per department; 10 floors; budgets from $10K to $1M. Uniform,
uncorrelated values.
Q2. Find dids of departments on the 10th floor with budget < $15K
A1: clustered hash index on (floor) of Departments
A2: clustered hash index on (floor, budget) of Departments
A3: clustered B+-tree index on (floor, budget) of Departments
A4: clustered B+-tree index on (budget) of Departments
A5: no index
Julia Stoyanovich
Access paths: another example
54
Employees (eid, name, salary, age, did);Departments (did, budget, floor, manager_eid);
Salaries from $10K to $100K ages from 20 to 80; 5 employees per department; 10 floors; budgets from $10K to $1M.
Uniform, uncorrelated values.
Q3. Compute average budget of departments per floor
A1: clustered hash index on (did, floor) of Departments
A2: clustered hash index on (floor) of Departments
A3: clustered B+-tree index on (did, floor, budget) of Departments
A4: clustered B+-tree index on (floor, budget, did) of Departments
A5: no index
Julia Stoyanovich
Access paths: another example
55
Sailors (sid, name, rating, age);
Sids from 1 to 100K, ratings from 1 to 10, ages from 20 to 80.Uniform, uncorrelated values.
Q1. Print name, age, rating of all sailors
A1: sequential scan of sorted file, sorted on (id)
A2: clustered hash index on (rating)
A3: unclustered hash index on (id)
A4: unclustered hash index on (age, rating)
A5: unclustered hash index on (name, age)
A6: clustered B+-tree index on (name, age)
A7: unclustered B+-tree index on (age, rating)
Julia Stoyanovich
Access paths: another example
56
Sailors (sid, name, rating, age);
Sids from 1 to 100K, ratings from 1 to 10, ages from 20 to 80.Uniform, uncorrelated values.
Q2. Print name, age, rating of the sailor with sid 123
A1: sequential scan of sorted file, sorted on (id)
A2: clustered hash index on (rating)
A3: unclustered hash index on (id)
A4: unclustered hash index on (age, rating)
A5: unclustered hash index on (name, age)
A6: clustered B+-tree index on (name, age)
A7: unclustered B+-tree index on (age, rating)
Julia Stoyanovich
Access paths: another example
57
Sailors (sid, name, rating, age);
Sids from 1 to 100K, ratings from 1 to 10, ages from 20 to 80.Uniform, uncorrelated values.
Q3. Count sailors with rating = 5 and age < 40
A1: sequential scan of sorted file, sorted on (id)
A2: clustered hash index on (rating)
A3: unclustered hash index on (id)
A4: unclustered hash index on (age, rating)
A5: unclustered hash index on (name, age)
A6: clustered B+-tree index on (name, age)
A7: unclustered B+-tree index on (age, rating)
Julia Stoyanovich
Access paths: another example
58
Sailors (sid, name, rating, age);
Sids from 1 to 100K, ratings from 1 to 10, ages from 20 to 80.Uniform, uncorrelated values.
Q4. Count sailors with rating = 5
A1: sequential scan of sorted file, sorted on (id)
A2: clustered hash index on (rating)
A3: unclustered hash index on (id)
A4: unclustered hash index on (age, rating)
A5: unclustered hash index on (name, age)
A6: clustered B+-tree index on (name, age)
A7: unclustered B+-tree index on (age, rating)
Julia Stoyanovich
Access paths: another example
59
Sailors (sid, name, rating, age);
Sids from 1 to 100K, ratings from 1 to 10, ages from 20 to 80.Uniform, uncorrelated values.
Q5. Print name, age, rating of sailors with age < 40 and rating < 5
A1: sequential scan of sorted file, sorted on (id)
A2: clustered hash index on (rating)
A3: unclustered hash index on (id)
A4: unclustered hash index on (age, rating)
A5: unclustered hash index on (name, age)
A6: clustered B+-tree index on (name, age)
A7: unclustered B+-tree index on (age, rating)
Julia Stoyanovich
Projection• The expensive part of projection is removing
duplicates (if we are retrieving sets, not bags)
• Sorting approach: sort on <sid,bid> and remove duplicates. This can be optimized by dropping unwanted information while sorting.
• Hashing approach: hash on <sid,bid> to create partitions. Load partitions into memory one at a time, build in-memory hash structure and eliminate duplicates.
• If there is an index with both R.sid and R.bid in the search key, it may be cheaper to sort data entries.
• Similar issues arise when processing the group-by operator
60
SELECT DISTINCT R.sid, R.bid FROM Reserves R
Julia Stoyanovich
Join• A highly optimized operation in a DBMS
• Several well-studied algorithms are available:
• Nested loops join family of algorithms
• Sort-merge join
• Hash join
• As before, the cost of a join is based on I/O (# pages exchanged between disk and memory)
• Best choice for a query depends on the characteristics of the query and the relations, as well as on the available indexes
• Also important to think about how the join fits within the over-all query exaction pipeline (may be slower, but may have a side-effect that is desirable upstream)
61
Julia Stoyanovich
Highlights of the System R optimizer• Currently the most widely used approach, works well for < 10
joins
• A combination of rule-based and cost-based optimization
• Cost estimation: Approximate art at best.
• Statistics, maintained in system catalogs, used to estimate cost of operations and result sizes
• Consider combination of CPU and I/O cost, mostly interested in I/O cost
• Plan space: too large, must be pruned
• Only the space of left-deep plans is considered. Left-deep plans allow output of each operator to be pipelined into the next operator without storing it in a temporary relation.
• Cartesian products are avoided
62
Julia Stoyanovich
Cost estimation
• For a given SQL query, there are multiple (often very many!) query execution plans
• A query execution plan is a tree of relational operators, each node is annotated with a particular implementation of the operator
• To cost a plan, we must estimate the cost of each operation
63
Julia Stoyanovich
Size estimation and reduction factors
• Consider this query
• Maximum # tuples in result is the product of the cardinalities of relations in the FROM clause
• Reduction factor (RF) associated with each term reflects the impact the term has in reducing result size
• Result cardinality = max # tuples * product of all RFs
• Implicitly assumes that terms are independent
• Term col=val has RF 1/NKeys(I), given index I on column col
• Term col1=col2 has RF 1/max(NKeys(I1), NKeys(I2)
• Term col > val has RF (high(I) - val) / (high(I)-low(I))
64
SELECT attribute list FROM relation list WHERE term1 AND ... AND termk
Julia Stoyanovich
Example
65
Sailors (sid:int, sname: string, rating:int, age:real)
Reserves (sid:int, bid:int, day:date, rname:string)
Reserves (R): each tuple us 40 bytes long, 100 tuples per page, 1000 pages Sailors (S): each tuple is 50 bytes long, 80 tuples per page, 500 pages
SELECT S.sname FROM Reserves R, Sailors S WHERE R.sid=S.sid AND R.bid=100 AND S.rating>5
Simple nested loops join algorithmfor each tuple r in R do
for each tuple s in S do if ri == sj then add <r, s> to result
• for each tuple in R we scan the entire relation S.• the cost of scanning R is 1000 I/Os• the cost of scanning S is 500 I/Os, this is done 1000 * 100 times
total cost = 1000 + (100 * 1000 * 500) = 1000 + 5*107 I/Oswith 10ms per I/O, this is 140 hours!
• a simple refinement is to compute the join page-at-a-time• total cost = 1000 + 1000 * 500 = 501,000 can we do better?
Julia Stoyanovich
Example (continued)
66
SELECT S.sname FROM Reserves R, Sailors S WHERE R.sid=S.sid AND R.bid=100 AND S.rating>5
Reserves Sailors
sid=sid
bid=100 rating > 5
sname
(Simple Nested Loops)
(On-the-fly)
(On-the-fly)
• Cost with page-at-a-time nested loops join = 500 + 500 * 1000 I/O
• This is by no means the worst plan! can you think of a plan that’s worse?
• Nonetheless, this plan misses several opportunities: selections could have been pushed, we could have used indexes
Goal of optimization: To find reasonably efficient plans while avoiding the absolutely terrible ones
Julia Stoyanovich
Choice of indexes to buildWhat indexes should we create?
• Which relations should have indexes? What fields should be the search key? Should we build several indexes?
• For each index, what kind of an index should it be?
• One approach
• consider the most important queries in your workload
• consider the best query plan, and see whether adding an index will improve the plan by making additional indexes available
• Keep in mind that indexes (1) take up space; (2) require maintenance under updates (time). So there is a cost!
67
Julia Stoyanovich
Choice of indexes to build: guidelines• Attributes in WHERE are candidates for index keys
• exact match conditions suggest hash index
• range conditions suggest tree index
• Multi-attribute search keys should be considered when query contains several conditions on the same relation
• order of attributes is important for range queries
• such indexes can enable index-only strategies for important queries
• Try to choose indexes that benefit as many queries as possible
68
Julia Stoyanovich
Understanding the workload• For each query
• Which relations does it access?
• Which attributes are retrieved?
• Which attributes are involved in selection / join conditions? How selective are these conditions likely to be?
• For each update in the workload
• Which attributes are involved in selection / join conditions? How selective are these conditions likely to be?
• What is the type of update (INSERT / DELETE / UPDATE), and what attributes are affected?
69
Julia Stoyanovich
Outline
• The I/O model of computation
• Storage and indexing
• Overview of query optimization
• Object-relational databases
70
Julia Stoyanovich
Motivating example California Department of Water Resources: 500,000 photos, with captions
71
select idfrom Photos P, Landmarks L, Landmarks Swhere sunset(P.picture)and contains(P.caption, L.name)and L.location |20| S.locationand S.name = ‘Sacramento, CA'
Find sunset pictures of landmarks within 20 miles of Sacramento, CA.
Note: user-defined functions (UDFs) and operators
Note: query optimization must be re-considered
create table Photos ( id number primary key, ts date, caption document, picture photo_CD_image);
create table Landmarks ( name varchar(64) primary key, location point);
Julia Stoyanovich
Object-database systems• Relational databases: relations are in first normal form. Clean
and simple. But the world is more complex!
• There is sometimes a need to accommodate
• complex data types / nesting
• inheritance hierarchies
• Two directions, conceptually very similar but implementations differ
• Object-Oriented Database Systems (OODBMS)
• Object-Relational database Systems (ORDBMS) is the currently accepted model, part of the SQL:1999 standard. Extends the relational model, borrows concepts from OO programming languages.
• implemented by Oracle, PostgreSQL and others
72
Julia Stoyanovich
The Dinky Entertainment Company• About the company
• Location: Holywood, CA
• Main assets: cartoon characters, e.g., Herbert the Worm
• Products: film shows, voice and video-footage licenses (e.g., for action figures, video games)
DBMS manages everything!
• New data types required
• user-defined abstract data types (ADTs): image, sound, video, with functions and operators
• type constructors: sets, tuples, arrays
• inheritance: low-resolution and high-resolution images are images
73
Julia Stoyanovich
Why an RDBMS won’t do
74
create table Frames ( frame_number number primary key, image BLOB, category number);
BLOB = binary large object
no structure / semantics
cannot issue any conditional queries against the image attribute
Enter ORDBMS• user-defined data types possible for attributes• complex attributes are possible (non 1NF)• reference types (pointers) - why do we need them? why are these
potentially problematic?• inheritance
Julia Stoyanovich
User-Defined Abstract Data TypesI’ve got one word for you: encapsulation!
• Users define new types, with their operations (methods).
• Define how to read and output objects of the new type
• Define the size of the objects of the new type
75
create abstract data type jpeg_image(internallength=VARIABLE, input=jpeg_in, output=jpeg_out);
create function is_sunrise(jpeg_image) returns booleanas external name '/usr/local/bin/dinky.jar';
• Atomic and user-defined types
• Type constructors:
• row (f1 t1, …,fn tn) - a tuple of n fields, where fi us the bane if the filed and ti is its type
• Listof (base), Arrayof(base), Setof(base), Bagof(base)
Julia Stoyanovich
Reference types• Objects have an OID
• Consequences? ref / deref
• Examples:
• ref(theater_t)
• setof(ref(arrayof(integer)))
• Shallow vs. deep equality
• deep equality is defined recursively for complex types
76
Julia Stoyanovich
Inheritance• To reuse and refine type definitions
• To create hierarchies of collections of similar but not identical objects
77
create type superhero_t under superbeing_t (strength, power);
Substitution principle: Given a super type A, and a subtype B, i t is always possible to substitute an object of type B into a legal expression written for objects of type A.
Julia Stoyanovich
ORDBMS Implementation• Physical data layout
- nested objects, arrays
- objects that vary in size over their lifetime
• Access methods
- indexes on predicates
- indexes over collection hierarchies
• Query processing
- New aggregates
1.specify what to do with first object (e.g., sum=0)
2.specify what do to on next (e.g., sum+=item)
3.specify what to do on last object (e.g., avg = sum / cnt)
- Method caching for expensive predicates
78
Julia Stoyanovich
Query optimization• Using new indexes
• what where-clause conditions are matched by the index
• what does it cost to fetch a tuple using the index - either supplied or measured by the DBMS directly
• Reduction factor and cost estimation for ADT methods
•effect of evaluating a selection condition is no longer negligible! Must consider both selectivity and cost.
• selectivity of a condition is 1/N means that 1 in N tuples will pass the selection condition
79
Julia Stoyanovich
Query optimization example
80
σ isSunrise()∧isHerbert ()RRetrieve all photos in which Herbert is enjoying the sunrise.
N=100,000 images in R
isHerbert: cost c1 = 0.5 sec/image, returns true for r1 = 10% of the images in R isSunrise: cost c2 = 0.01 sec/image, returns true for r2 = 20% of the images in R
bettercost(σ isHerbert (σ isSunriseR)) = N ∗c2 + N ∗r2 *c1 = 11,000 sec
cost(σ isSunrise(σ isHerbertR)) = N ∗c1 + N ∗r1 ∗c2 = 50,100sec
Julia Stoyanovich
Query optimization example
81
Retrieve all photos of Herbert in which there is no sunrise.
N=100,000 images in R
isHerbert: cost c1 = 0.5 sec/image, returns true for r1 = 10% of the images in R NOT isSunrise: cost c2 = 0.01 sec/image, returns true for 1 - r2 = 80% of the images in R
better
same as in the previous case!cost(σ NOT isSunrise(σ isHerbertR)) = N ∗c1 + N ∗r1 ∗c2 = 50,100sec
cost(σ isHerbert (σ NOT isSunriseR)) = N ∗c2 + N ∗(1− r2 )∗c1 = 41,000sec
σ NOT isSunrise()∧isHerbert ()R
Julia Stoyanovich
The order of predicatesFor n predicates, there are n! orders
Suppose that there are N tuples in the input
82
cost(σ n (σ n−1(...(σ 1(R))...))) = c1N + c2 (Nr1)+ c3(Nr1r2 )+ ...+ cn (Nr1r2...rn−1)
Suppose that σ 2 (σ 1(R)) ≤σ 1(σ 2 (R))⇒N(c1 + c2r1) ≤ N(c2 + c1r2 )c1 + c2r1 ≤ c2 + c1r2c2 (r1 −1) ≤ c1(r2 −1)r1 −1c1
≤ r2 −1c2
Julia Stoyanovich
Query optimizationReduction factor and cost estimation for ADT methods
1. compute the rank of each condition involving an ADT method
2. order conditions by increasing rank, process in that order
83
rank = reductionFactor −1Cost
isHerbert: cost c1 = 0.5 sec/image, returns true for r1 = 10% of the images in R isSunrise: cost c2 = 0.01 sec/image, returns true for r2 = 20% of the images in R
rank(isSunrise) = r2 −1c2
= 0.2 −10.01
= −80rank(isHerbert) = r1 −1c1
= 0.1−10.5
= −1.8
Meaning, isSunrise should be executed before isHerbert, i.e., this plan is optimal: σ isHerbert (σ isSunriseR)