Download - Assignment II - Department of Computer Sciencecs661/lectures/DDB_07_Op3.pdf3 Example Query: SELECT * FROM EMP, ASG, PROJ WHERE EMP.ENO=ASG.ENO AND ASG.PNO=PROJ.PNO AND PNAME=‘CAD/CAM’

1

Assignment IIï Critique ì Enhancing P2P File-Sharing with an Internet-Scale Query

Processorî , by Boon Thau Loo (UC Berkeley), Joseph M. Hellerstein (UC Berkeley and Intel Research Berkeley), Ryan Huebsch (UC Berkeley), Scott Shenker (UC Berkeley and International Computer Science Institute), Ion Stoica (UC Berkeley), VLDB 2004

ï Due on Sept 30, 2004ï % contributed to the final grade: 8%ï Total score: 100 pointsï The critique includes

ñ Your name and last four digits of your university IDñ (20 points) Discuss originality of the work (whatís new in this paper that

has never been done previously)ñ (50 points) 2 page summary of the proposed work (how does the

technique work, how performance evaluation is performed)ñ (20 points) Discuss drawbacks of the proposed work (in your opinion)

ï Drawback of the proposed techniqueï Drawback of the performance evaluation

ñ (10 points) What would you do differently to solve the same problem?

Centralized Query OptimizationStatic Query Optimization: Rely on statistics;

based on database statistics to decide the query execution plan; done after query decomposition

•System R•More widely used than the other

Dynamic Query Optimization: Combine query decomposition and optimization together; don’t rely on database statistics, but use the actual result of the intermediate relation

•INGRES

2

Static Query OptimizationSystem R

Input: Query tree from the query decomposition stepOutput: Cheapest query execution plan

Solution Sketch: • Consider some subsets of all possible plans (e.g., Cartesian product plan is eliminated)

• Assign cost to candidate trees and choose the one with the cheapest cost

•Some plan with selection first may not be chosen because the plan costs more in the later step

• Utilize cost model for each low-level operation (e.g., clustered B+-tree index with a range predicate)

System R• Pass I: Enumerate all single-relation plans.

Retain the cheapest plan(s) for each relation.

• Pass II: Generate all two-relation plans by considering each single-relation plan after Pass I as the outer relation and every other relation as the inner relation.

• There are as many passes as the number of relations involving in the join.

3

Example

Query: SELECT *FROM EMP, ASG, PROJWHERE EMP.ENO=ASG.ENOAND ASG.PNO=PROJ.PNOAND PNAME=‘CAD/CAM’

EMP has an index on ENO.ASG has an index on PNO.PROJ has an index on PNO and another index on PNAME.

Assume that Only nested loops join and merge-join are considered.

Enumeration of Left-Deep Plans

Pass I: Enumerate all single-relation plans.Retain the cheapest plan(s) for each relation.

•Cheapest plan where tuples are not in any order (Plan A).•Cheapest plan where tuples are in a sort order (Plan B).•If plan B is cheaper than plan A, retain only plan B.

EMP: file scan (no selection on EMP)ASG: file scan (no selection on ASG)PROJ: index on PNAME (selection on PNAME)

Example:

Input: an algebraic query tree

4

Cost computation

• EMP– Compute the cost for file scan

• ASG– Compute the cost for file scan on ASG

• PROJ– Compute the cost for file scan – Compute the cost for using index on

PNAME

I/O Cost for File Scan

• Scan on R: Read every tuple from R– Record Format– Page Format– File Format– Cardinality of R– The simplest cost assumes fixed length record;

one page has as many records as possible; pages are stored consecutively on disk

• {Card(R)*RecordSize(R)}/{pagesize-headersize}

5

ExampleSailors(sid,sname,rating,age)

– Each Sailors tuple is 50 bytes long (fixed length record format)

– A page size is 4K bytes– #tuples: 40,000– All pages are full, unpacked bitmap; 96 bytes are reserved

for slot directory– How many pages for Sailors?

• One page can contain at most tuples

• Sailors occupies pages

4096 96 8050− =

40000 50080

=

Selection Operators

)( op . RvalueARσ

No index, unsorted data: file scanNo index, sorted data: Index

•Clustered/unclustered•Dense/Sparse•Tree-based index/Hash-based index•Data entries format: Alternatives 1,2,3

1

Record Formats: Fixed Length

• Information about field types is stored in systemcatalogs.

• Li: Size of field i in bytes

Base address (B)

L1 L2 L3 L4

F1 F2 F3 F4

Address = B+L1+L2

Record Formats: Variable Length• Two alternative formats (# fields is fixed):

Second offers direct access to i’th field, efficient storage of nulls; small directory overhead.

4 $ $ $ $

FieldCount

Fields Delimited by Special Symbols

F1 F2 F3 F4

F1 F2 F3 F4

Array of Field Offsets

Can be used for fixed length fieldsOr Variable length fields

2

Page Formats: Fixed Length Records

Slot 1Slot 2

Slot N

. . . . . .

N M10. . .M ... 3 2 1

PACKED UNPACKED, BITMAP

Slot 1Slot 2

Slot N

FreeSpace

Slot M11

number of records

numberof slots

Record id = <page id, slot #>. In first alternative, moving records for free space management changes rid; may not be acceptable.

Page Formats: Variable Length Records

Can move records on page without changing rid; so, attractive for fixed-length records too.

Page iRid = (i,N)

Rid = (i,2)

Rid = (i,1)

Pointerto startof freespace

SLOT DIRECTORY

N . . . 2 120 16 24 N

# slots

3

Heap files (Unordered file) :The data in the pages of a heap file is not ordered.

• Heap File: Suitable when typical access is a file scan retrieving all records.

• Sorted File: Best if records must be retrieved in some order, or only a `range’ of records is needed.

• Hashed File: Good for equality selections.

File Formats

Heap File Implemented as a List

• The header page id and heap file name must be stored someplace.

• Each page contains at least 2 `pointers’ plus data.

HeaderPage

DataPage

DataPage

DataPage

DataPage

DataPage

DataPage Pages with

Free Space

Full Pages

4

Heap File Using a Page Directory

• The entry for a page can include the number of free bytes on the page.

• The directory is a collection of pages; • linked list implementation is just one

alternative.– Much smaller than linked list of all HF pages!

DataPage 1

DataPage 2

DataPage N

HeaderPage

DIRECTORY

Smith, 44, 3000Jones, 40, 6003Tracy, 44, 5004

Ashby, 25,3000Basu, 33, 4003Bristow, 29, 2007

Class, 50, 5004Daniels, 22, 6003

h1age

h(age)=00

h(age)=01

h(age)=10

File hashed on age

•File is a collection of buckets. •Bucket = primary page plus zero or more overflow pages.

•Hashing function h: h(r) = bucket in which record r belongs. h looks at only some of the fields of r, called the search fields.

Hashed files

5

Implementation of Relational Operators/Estimated Cost

Selection Operators

)( op . RvalueARσNo index, unsorted data: file scanNo index, sorted data: Index

•Clustered/unclustered•Dense/Sparse

•Tree-based index/Hash-based index

•Data entries format: Alternatives 1,2,3

6

Cost for File ScanSailors(sid,sname,rating,age)

– Each Sailors tuple is 50 bytes long (fixed length record format)

– A page size is 4K bytes– #tuples: 40,000– All pages are full, unpacked bitmap; 96 bytes are reserved

for slot directory– How many pages for Sailors?

• One page can contain at most tuples

• Sailors occupies pages

4096 96 8050− =

40000 50080

=

Reserves(sid,bid,day,rname)– Each Reserves tuple is 40 bytes long– Total #tuples = 100,000– Given the same page size, page format as

in the previous example, we have• A page can hold 4000/40=100 Reserves

tuples• #Pages for Reserves: 100,000/100=1000

pages

7

No index, unsorted file

. = (R)R A valueσSuppose R is Reserves relation

Best access path: File ScanI/O Cost: 1000 pagesI/O time cost: 1000 * time to access each pageComplexity: O(|R|)

Notation: |R| is the number of pages in R

No Index, sorted file on R.A

. = (R)R A valueσSuppose R is Reserves

Sorted order: AscendingBest Access Path:

•Binary search to locate the first tuple with R.A=Value

•Scan the remaining recordsI/O Cost:

log2(|R|)+Cost of scan for remaining tuples (0 ~ |R|)Log2(1000) ~ about 10 + Cost of retrieving remaining

tuplesAbout 10 + 0.1*1000 when the reduction factor is 0.1

8

B+Tree Index on R.A

• Need to know– What format is the data entries in the leaf

node of the index?– Whether the index is clustered or unclustered,

dense/sparse– Cost of traversing from the root to the leaf (<4

I/Os, typically)+ Cost of retrieving the pages in the sequence set + the cost of retrieving pages containing the data records.

. = (R)R A valueσ

• Index: A data structure which helps speeding up selections of data.

• It can be viewed as a set of data entries, each denoted as k*.

– Any subset of the fields of a relation can be the search keyfor an index on the relation.

– Search key is not the same as key (minimal set of fields that uniquely identify a record in a relation).

– Tree structured Index vs Hash-based Index

Indexes

9

Alternatives for Data Entry k* in Index

Alternative 1: Data entry is the actual tupleTuples organized according to the search key value

Allow only one Alternative 1 index per relation

Alternative 2: Data entry is <k, rid of data record with search key value k>Ex. <50, (1, 10)>: search key value = 50, rid tells that the

data is in page 1, slot 10

Alternative 3: Data entry is <k, list of rids of data records with search key k>Ex: <50, (1,10), (2,20), (3,1)>

search key value = 50; three record IDs

Pros and Cons for different data entries format

– Data entries are typically much smaller than data records. So, Alternatives 2 and 3 are better than Alternative 1 with large data records, especially if search keys are small.

– If more than one index is required on a given file, at most one index can use Alternative 1; the rest must use Alternatives 2 or 3.

– Alternative 3 is more compact than Alternative 2, but leads to variable sized data entries even if search keys are of fixed length.

10

Index entries

Data entries

direct search for

(Index File)(Data file)

Data Records

data entries Data entries

Data Records

CLUSTERED INDEX UNCLUSTERED INDEX

Alternatives for Data Entries in an Index

•K* is an actual data record ( with search key value k)•<k,rid>•<k,rid-list>

• Support equality and range-searches efficiently• Balanced tree• Search/Insert/delete at log F N cost (F = fanout, N = # leaf

pages)• Minimum 50% occupancy for each node except for root • Each node except root contains d <=m <= 2d entries. The root

node contains 1<= m <= 2d entries where d is the order of the tree.

Index Entries

Data Entries

("Sequence set")

(Direct search)

B+Tree: The Most Widely Used Index for RDBMS

11

Example B+Tree• Search begins at root, and key comparisons direct it to a

leaf.• Search for tuples whose search key >=5

• Follow the left pointer if the desired value is less than the value in the node

• Otherwise, follow the right pointer

Based on the search for 15*, we know it is not in the tree!

Root

17 24 30

2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*

13

AshbyCassSmith

Data File

300030005004

4003200760036003

5004

Ashby, 25, 3000

Smith, 44, 3000

Bristow, 30, 2007

Basu, 33, 4003

Cass, 50, 5004

Tracy, 44, 5004

Daniels, 22, 6003

Jones, 40, 6003

h2 sal

h2(sal)=00

h2(sal)=11

Hashed index

Data entriesData records

Sparse Cluster

Dense IndexUnclustered

Tree-index

Dense: At least one data entry per data recordSparse: At least one data entry per block

12

Index Classification• Primary vs. secondary: If search key contains

primary key, then called primary index.– Unique index: Search key contains a candidate key.

• Clustered vs. unclustered: If order of data records is the same as, or `close to’, order of data entries, then called clustered index. • Clustered index is good for a range search query.

• Dense vs. Sparse: •Dense: At least one data entry per data record•Sparse: At least one data entry per block

B+Tree Index on R.A

– Suppose R is Reserves table– Format of Data Entry: Alternative 2 <k, rid>– Size of data entry = 20 bytes– Page size=4K bytes; 96 bytes are reserved for header– Unclustered and dense– Reduction Factor: 0.1– Total Cost =

• Cost of traversing from the root to the leaf (assume 4 I/Os) +• Cost of retrieving the pages of data entries in the sequence set + • Cost of retrieving pages containing the data records

. = (R)R A valueσ

13

• B+tree, Alternative 2, dense, unclustered• I/O cost of retrieving pages of qualifying data

entries– Dense: matching data entries: 0.1*100000=10000 entries– #Data entries per page: – Pages of matching data entries = 10000/200 = 50 pages

• I/O cost of retrieving qualifying tuples– 10000 pages since the index is unclustered, the qualifying

tuples are not always in the same order as the data entries.

– In the worst case, for each qualifying data entry, one I/O is needed

• Total I/O Cost = 4+ 50+10,000 pages

4096 96 20020− =

• B+tree, Alternative 2, dense, clustered• I/O cost of retrieving pages of qualifying data

entries– Matching data entries: 0.1*100000=10000 entries– #Data entries per page: – #Pages of matching data entries =

• I/O cost of retrieving qualifying tuples– #Matching tuples: 10000 – Since the index is dense and clustered, the qualifying

tuples are also clustered– # pages: 10000/100=100 due to 100 tuples per page

• Total I/O Cost = 4+ 50+100 pages

4096 96 20020− = 10000 50

200 =

14

• B+tree, Alternative 2, sparse (must be clustered)• I/O cost of retrieving qualifying tuples

– #Matching tuples: 0.1*100,000=10000 – Since the index is sparse, one data entry points to a page.– # pages for matching tuples: = 100 since a page

can hold 100 tuples• I/O cost of retrieving pages of qualifying data

entries– #pages of matching tuples: 100 – #Data entries per page: – #Pages of matching data entries = =1 page

• Total I/O Cost = 4+ 1+100 pages

4096 96 20020− =

100200

10000100

Hash-based Index

• Hash-based index on A is good for equality search on attribute A

• Cannot support range search

• Suppose R is Reserves • Reduction Factor: 0.1• Data entry format: Alternative 2: 20 bytes per data entry• Cost for searching the matching data entry: 1.2 I/O

. = (R)R A valueσ

15

• I/O cost = cost for retrieving the matching data entries+cost for retrieving the qualifying tuples

• I/O cost of retrieving pages of matching data entries– Matching data entries: 0.1*100000=10000 entries– I/O cost = 10000*1.2 = 12000 I/Os

• I/O cost for retrieving the qualifying tuples = 12000 I/Os

• Total cost = 12000+12000 = 24000 I/Os

• Hash-based indexes are unclustered

Cost comparisons among various access paths for Equality Selection

on Reserves

• File scan: 1000 • Sorted File: 110• Dense, unclustered B+tree = 10054• Dense, clustered B+tree = 154• Sparse, B+tree = 105• Static Hash index = 24000

• Select the cheapest access path: Sorted File

. = (R)R A valueσ

6

Enumeration of Left-Deep Plans (Cont’d)

Pass II: Generate all two-relation plans by considering each single-relation plan after Pass I as the outer relation and every otherrelation as the inner relation.

For each join method of A joins B, determine the best access method for B. Consider also the plans that involve the join attribute of B.

If a sort-merge join is used and the access method for B does not produce the result in a sorted order according to the join attribute of B, cost of sorting must be included.

EMP: file scan (no selection on EMP)ASG: file scan (no selection on ASG)PROJ: index on PNAME (selection on PROJ.PNAME)

Plans of EMP join with PROJ are removed because they use Cartesian product.Compute the cost of the following plans and select the cheapest cost.

EMP (file scan) (nested loops join) ASG (file scan)EMP (file scan) (merge-join) ASG (file scan)

ExampleEMP has an index on ENO.ASG has an index on PNO.PROJ has an index on PNO and another index on PNAME.Suppose these are the

cheapest plan from Part 1

7

ASG (file scan) (nested loops join) PROJ (file scan)ASG (file scan) (nested loops join) PROJ (index on PNO)ASG (file scan) (merge-join) PROJ (file scan)ASG (file scan) (merge-join) PROJ (index on PNO)

PROJ (index on PNAME) (nested loops join) ASG (file scan)PROJ (index on PNAME) (nested loops join) ASG (index on PNO)PROJ (index on PNAME) (merge join) ASG (file scan)PROJ (index on PNAME) (merge join) ASG (index on PNO)

Suppose plan X is the cheapest.

Plan X:

Pass III: Generate all three-relation plans by considering each plan after Pass II as the outer relation. Select the cheapest plan.

Additional passes are needed if more relations are involved in the join.

(PROJ (index on PNAME) nested loops join with ASG (index on PNO))

Nested loops join with

EMP (file scan)EMP (index on ENO)

(PROJ (index on PNAME) nested loops join with ASG (index on PNO))

Merge join with

EMP (file scan)EMP (index on ENO)

8

Block Nested Loops Join

R

Memory

Output buffer

R can fit in memory entirely.

Bring bigger relation S one page at a time. Cost=M+N Optimal if one of the

relation can fit in the memory (M=B-2).

foreach page of B-2 pages of R do foreach page of s in S do

if ri in R page == sj in S-page then add <ri, sj> to result

Cost=M+N*

− 2B

MB: Available memory in pages.M: Number of pages for RN: Number of pages for S

Number of pages of R

R is the outer relation

Finding Matching Pairs

. . .. . .

R & SHash table for block of R

(k < B-1 pages)

Input buffer for SOutput buffer

. . .

Join Result

To reduce the CPU time to find a matching pair, an in-memory hash table is used.

9

Indexed Nested Loops Joinforeach tuple r in R do

foreach tuple s in S doif ri.A == sj.B then add <r, s> to result

Use index on the joining attribute of S.Cost=M+M*PR*(Cost of retrieving a matching tuple in S).

Depend on the type of index and the number of matching tuples.cost to scan R

• For each R tuple, cost of probing S index is about 1.2 for hash index, 2-4 for B+ tree. Cost of then finding S tuples(assuming Alt. (2) or (3) for data entries) depends on clustering.

– Clustered index: 1 I/O (typical), unclustered: upto 1 I/O per matching S tuple.

Sort-Merge Join (R S)• Sort R and S on the join column, then scan them to

do a ``merge’’ (on join col.), and output result tuples.

– Advance scan of R until current R-tuple >= current S tuple, then advance scan of S until current S-tuple >= current R tuple; do this until current R tuple = current S tuple.

– At this point, all R tuples with same value in Ri (current R group) and all S tuples with same value in Sj (current S group) match; output <r, s> for all pairs of such tuples.

– Then resume scanning R and S.

><i=j

10

Example of Sort-Merge Join

sid snam e rating age22 dustin 7 45.028 yuppy 9 35.031 lubber 8 55.544 guppy 5 35.058 rusty 10 35.0

s id b id d a y r n a m e2 8 1 0 3 1 2 /4 /9 6 g u p p y2 8 1 0 3 1 1 /3 /9 6 y u p p y3 1 1 0 1 1 0 /1 0 /9 6 d u s t in3 1 1 0 2 1 0 /1 2 /9 6 lu b b e r3 1 1 0 1 1 0 /1 1 /9 6 lu b b e r5 8 1 0 3 1 1 /1 2 /9 6 d u s t in

ï Cost: M log M + N log N + (M+N)ñ The cost of scanning, M+N, could be M*N (very

unlikely!)

Join Ordering in Fragment QueriesJoin ordering is important in a centralized DBMS.Join ordering is even more important in distributed DDBMSs.

EMP

ASG

PROJ

ENO PNO

Site 1

Site 2

Site 3

R site j: “relation R is transferred to site j”

1. EMP site 2; site 2 computes EMP’, EMP’->site 3; site 3 computes the result.

2. ASG->site 1: site 1 computes EMP’, EMP’->site 3; site 3 computes the result

3. ASG->site 3; compute ASG’;ASG’->site 14. PROJ->site 2; compute PROJ’; PROJ’-

>site 15. EMP->site 2; PROJ->site 2; site 2

compute the join.

11

R SIf size(R) < size(S)

If size(S) < size(R)

SR ><

•Ignore the transfer time for producing data at the result site.

Approach I: Ordering joins without using semi-joins

•Distributed INGRES•R* (distributed version of system R)

Consider costs of all strategies and choose the best one

Semijoin:) () (

) ( ) (

RSJSSSJR

RSJSR

SSSJRSR

AAA

A

AA

A

A

><

><

><><

⇔⇔⇔

Site 1 Site 2)(' SS AΠ=' ' SRRA

SJ=SRresult A><'=

Semijoin is better than join if size(R’)+size(S’)<Size(R)(i.e., a few tuples or R participate in the join)S’ can be minimized by encoding it in a bit array (BA).BA[i]=1 if h(value of S.A)=i,BA[i]=0 otherwise.h() is the hash function.R’ consists of tuples whose BA(h(value of R.A))=1;

S’

Approach II: Use semi-joins

12

Some Semi-Joins Alternatives

• ASG1=ASG SJ EMP• ASG11= (ASG SJ PROJ) SJ EMP• Complex: Most algorithms use single semi-joins

EMP ASG

PROJENO

PNO

ASG EMP

EMPPROJ ASG

ENO

PNO

PNO

Cyclic Queries• ET(ENO, ENAME, TITLE, CITY)• ASG(ENO,PNO,RESP,DUR)• PT(PNO, PNAME, BUDGET, CITY)

• Query: Retrieve the names of all employees living in the city in which their project is located

SELECT ET.ENAMEFROM ET,ASG,PTWHERE ET.ENO=ASG.ENOAND ASG.PNO=PT.PNOAND ET.CITY=PT.CITY

ASG

ET

PT

A cyclic join graph

ASG.PNO=PT.PNO

ET.ENO=ASG.ENO

ET.CITY=PT.CITY

13

Transformation of Cyclic Queries

• Remove (ET,PT)• Add additional predicates

ASG

ET

PT

A cyclic join graph

ASG.PNO=PT.PNO

ET.ENO=ASG.ENO

ET.CITY=PT.CITY

ASG

ET

PT

A cyclic join graph

ASG.PNO=PT.PNOAND ASG.CITY=PT.CITY

ET.ENO=ASG.ENOAND ET.CITY=ASG.CITY