1
Assignment IIï Critique ì Enhancing P2P File-Sharing with an Internet-Scale Query
Processorî , by Boon Thau Loo (UC Berkeley), Joseph M. Hellerstein (UC Berkeley and Intel Research Berkeley), Ryan Huebsch (UC Berkeley), Scott Shenker (UC Berkeley and International Computer Science Institute), Ion Stoica (UC Berkeley), VLDB 2004
ï Due on Sept 30, 2004ï % contributed to the final grade: 8%ï Total score: 100 pointsï The critique includes
ñ Your name and last four digits of your university IDñ (20 points) Discuss originality of the work (whatís new in this paper that
has never been done previously)ñ (50 points) 2 page summary of the proposed work (how does the
technique work, how performance evaluation is performed)ñ (20 points) Discuss drawbacks of the proposed work (in your opinion)
ï Drawback of the proposed techniqueï Drawback of the performance evaluation
ñ (10 points) What would you do differently to solve the same problem?
Centralized Query OptimizationStatic Query Optimization: Rely on statistics;
based on database statistics to decide the query execution plan; done after query decomposition
•System R•More widely used than the other
Dynamic Query Optimization: Combine query decomposition and optimization together; don’t rely on database statistics, but use the actual result of the intermediate relation
•INGRES
2
Static Query OptimizationSystem R
Input: Query tree from the query decomposition stepOutput: Cheapest query execution plan
Solution Sketch: • Consider some subsets of all possible plans (e.g., Cartesian product plan is eliminated)
• Assign cost to candidate trees and choose the one with the cheapest cost
•Some plan with selection first may not be chosen because the plan costs more in the later step
• Utilize cost model for each low-level operation (e.g., clustered B+-tree index with a range predicate)
System R• Pass I: Enumerate all single-relation plans.
Retain the cheapest plan(s) for each relation.
• Pass II: Generate all two-relation plans by considering each single-relation plan after Pass I as the outer relation and every other relation as the inner relation.
• There are as many passes as the number of relations involving in the join.
3
Example
Query: SELECT *FROM EMP, ASG, PROJWHERE EMP.ENO=ASG.ENOAND ASG.PNO=PROJ.PNOAND PNAME=‘CAD/CAM’
EMP has an index on ENO.ASG has an index on PNO.PROJ has an index on PNO and another index on PNAME.
Assume that Only nested loops join and merge-join are considered.
Enumeration of Left-Deep Plans
Pass I: Enumerate all single-relation plans.Retain the cheapest plan(s) for each relation.
•Cheapest plan where tuples are not in any order (Plan A).•Cheapest plan where tuples are in a sort order (Plan B).•If plan B is cheaper than plan A, retain only plan B.
EMP: file scan (no selection on EMP)ASG: file scan (no selection on ASG)PROJ: index on PNAME (selection on PNAME)
Example:
Input: an algebraic query tree
4
Cost computation
• EMP– Compute the cost for file scan
• ASG– Compute the cost for file scan on ASG
• PROJ– Compute the cost for file scan – Compute the cost for using index on
PNAME
I/O Cost for File Scan
• Scan on R: Read every tuple from R– Record Format– Page Format– File Format– Cardinality of R– The simplest cost assumes fixed length record;
one page has as many records as possible; pages are stored consecutively on disk
• {Card(R)*RecordSize(R)}/{pagesize-headersize}
5
ExampleSailors(sid,sname,rating,age)
– Each Sailors tuple is 50 bytes long (fixed length record format)
– A page size is 4K bytes– #tuples: 40,000– All pages are full, unpacked bitmap; 96 bytes are reserved
for slot directory– How many pages for Sailors?
• One page can contain at most tuples
• Sailors occupies pages
4096 96 8050− =
40000 50080
=
Selection Operators
)( op . RvalueARσ
No index, unsorted data: file scanNo index, sorted data: Index
•Clustered/unclustered•Dense/Sparse•Tree-based index/Hash-based index•Data entries format: Alternatives 1,2,3
1
Record Formats: Fixed Length
• Information about field types is stored in systemcatalogs.
• Li: Size of field i in bytes
Base address (B)
L1 L2 L3 L4
F1 F2 F3 F4
Address = B+L1+L2
Record Formats: Variable Length• Two alternative formats (# fields is fixed):
Second offers direct access to i’th field, efficient storage of nulls; small directory overhead.
4 $ $ $ $
FieldCount
Fields Delimited by Special Symbols
F1 F2 F3 F4
F1 F2 F3 F4
Array of Field Offsets
Can be used for fixed length fieldsOr Variable length fields
2
Page Formats: Fixed Length Records
Slot 1Slot 2
Slot N
. . . . . .
N M10. . .M ... 3 2 1
PACKED UNPACKED, BITMAP
Slot 1Slot 2
Slot N
FreeSpace
Slot M11
number of records
numberof slots
Record id = <page id, slot #>. In first alternative, moving records for free space management changes rid; may not be acceptable.
Page Formats: Variable Length Records
Can move records on page without changing rid; so, attractive for fixed-length records too.
Page iRid = (i,N)
Rid = (i,2)
Rid = (i,1)
Pointerto startof freespace
SLOT DIRECTORY
N . . . 2 120 16 24 N
# slots
3
Heap files (Unordered file) :The data in the pages of a heap file is not ordered.
• Heap File: Suitable when typical access is a file scan retrieving all records.
• Sorted File: Best if records must be retrieved in some order, or only a `range’ of records is needed.
• Hashed File: Good for equality selections.
File Formats
Heap File Implemented as a List
• The header page id and heap file name must be stored someplace.
• Each page contains at least 2 `pointers’ plus data.
HeaderPage
DataPage
DataPage
DataPage
DataPage
DataPage
DataPage Pages with
Free Space
Full Pages
4
Heap File Using a Page Directory
• The entry for a page can include the number of free bytes on the page.
• The directory is a collection of pages; • linked list implementation is just one
alternative.– Much smaller than linked list of all HF pages!
DataPage 1
DataPage 2
DataPage N
HeaderPage
DIRECTORY
Smith, 44, 3000Jones, 40, 6003Tracy, 44, 5004
Ashby, 25,3000Basu, 33, 4003Bristow, 29, 2007
Class, 50, 5004Daniels, 22, 6003
h1age
h(age)=00
h(age)=01
h(age)=10
File hashed on age
•File is a collection of buckets. •Bucket = primary page plus zero or more overflow pages.
•Hashing function h: h(r) = bucket in which record r belongs. h looks at only some of the fields of r, called the search fields.
Hashed files
5
Implementation of Relational Operators/Estimated Cost
Selection Operators
)( op . RvalueARσNo index, unsorted data: file scanNo index, sorted data: Index
•Clustered/unclustered•Dense/Sparse
•Tree-based index/Hash-based index
•Data entries format: Alternatives 1,2,3
6
Cost for File ScanSailors(sid,sname,rating,age)
– Each Sailors tuple is 50 bytes long (fixed length record format)
– A page size is 4K bytes– #tuples: 40,000– All pages are full, unpacked bitmap; 96 bytes are reserved
for slot directory– How many pages for Sailors?
• One page can contain at most tuples
• Sailors occupies pages
4096 96 8050− =
40000 50080
=
Reserves(sid,bid,day,rname)– Each Reserves tuple is 40 bytes long– Total #tuples = 100,000– Given the same page size, page format as
in the previous example, we have• A page can hold 4000/40=100 Reserves
tuples• #Pages for Reserves: 100,000/100=1000
pages
7
No index, unsorted file
. = (R)R A valueσSuppose R is Reserves relation
Best access path: File ScanI/O Cost: 1000 pagesI/O time cost: 1000 * time to access each pageComplexity: O(|R|)
Notation: |R| is the number of pages in R
No Index, sorted file on R.A
. = (R)R A valueσSuppose R is Reserves
Sorted order: AscendingBest Access Path:
•Binary search to locate the first tuple with R.A=Value
•Scan the remaining recordsI/O Cost:
log2(|R|)+Cost of scan for remaining tuples (0 ~ |R|)Log2(1000) ~ about 10 + Cost of retrieving remaining
tuplesAbout 10 + 0.1*1000 when the reduction factor is 0.1
8
B+Tree Index on R.A
• Need to know– What format is the data entries in the leaf
node of the index?– Whether the index is clustered or unclustered,
dense/sparse– Cost of traversing from the root to the leaf (<4
I/Os, typically)+ Cost of retrieving the pages in the sequence set + the cost of retrieving pages containing the data records.
. = (R)R A valueσ
• Index: A data structure which helps speeding up selections of data.
• It can be viewed as a set of data entries, each denoted as k*.
– Any subset of the fields of a relation can be the search keyfor an index on the relation.
– Search key is not the same as key (minimal set of fields that uniquely identify a record in a relation).
– Tree structured Index vs Hash-based Index
Indexes
9
Alternatives for Data Entry k* in Index
Alternative 1: Data entry is the actual tupleTuples organized according to the search key value
Allow only one Alternative 1 index per relation
Alternative 2: Data entry is <k, rid of data record with search key value k>Ex. <50, (1, 10)>: search key value = 50, rid tells that the
data is in page 1, slot 10
Alternative 3: Data entry is <k, list of rids of data records with search key k>Ex: <50, (1,10), (2,20), (3,1)>
search key value = 50; three record IDs
Pros and Cons for different data entries format
– Data entries are typically much smaller than data records. So, Alternatives 2 and 3 are better than Alternative 1 with large data records, especially if search keys are small.
– If more than one index is required on a given file, at most one index can use Alternative 1; the rest must use Alternatives 2 or 3.
– Alternative 3 is more compact than Alternative 2, but leads to variable sized data entries even if search keys are of fixed length.
10
Index entries
Data entries
direct search for
(Index File)(Data file)
Data Records
data entries Data entries
Data Records
CLUSTERED INDEX UNCLUSTERED INDEX
Alternatives for Data Entries in an Index
•K* is an actual data record ( with search key value k)•<k,rid>•<k,rid-list>
• Support equality and range-searches efficiently• Balanced tree• Search/Insert/delete at log F N cost (F = fanout, N = # leaf
pages)• Minimum 50% occupancy for each node except for root • Each node except root contains d <=m <= 2d entries. The root
node contains 1<= m <= 2d entries where d is the order of the tree.
Index Entries
Data Entries
("Sequence set")
(Direct search)
B+Tree: The Most Widely Used Index for RDBMS
11
Example B+Tree• Search begins at root, and key comparisons direct it to a
leaf.• Search for tuples whose search key >=5
• Follow the left pointer if the desired value is less than the value in the node
• Otherwise, follow the right pointer
Based on the search for 15*, we know it is not in the tree!
Root
17 24 30
2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*
13
AshbyCassSmith
Data File
300030005004
4003200760036003
5004
Ashby, 25, 3000
Smith, 44, 3000
Bristow, 30, 2007
Basu, 33, 4003
Cass, 50, 5004
Tracy, 44, 5004
Daniels, 22, 6003
Jones, 40, 6003
h2 sal
h2(sal)=00
h2(sal)=11
Hashed index
Data entriesData records
Sparse Cluster
Dense IndexUnclustered
Tree-index
Dense: At least one data entry per data recordSparse: At least one data entry per block
12
Index Classification• Primary vs. secondary: If search key contains
primary key, then called primary index.– Unique index: Search key contains a candidate key.
• Clustered vs. unclustered: If order of data records is the same as, or `close to’, order of data entries, then called clustered index. • Clustered index is good for a range search query.
• Dense vs. Sparse: •Dense: At least one data entry per data record•Sparse: At least one data entry per block
B+Tree Index on R.A
– Suppose R is Reserves table– Format of Data Entry: Alternative 2 <k, rid>– Size of data entry = 20 bytes– Page size=4K bytes; 96 bytes are reserved for header– Unclustered and dense– Reduction Factor: 0.1– Total Cost =
• Cost of traversing from the root to the leaf (assume 4 I/Os) +• Cost of retrieving the pages of data entries in the sequence set + • Cost of retrieving pages containing the data records
. = (R)R A valueσ
13
• B+tree, Alternative 2, dense, unclustered• I/O cost of retrieving pages of qualifying data
entries– Dense: matching data entries: 0.1*100000=10000 entries– #Data entries per page: – Pages of matching data entries = 10000/200 = 50 pages
• I/O cost of retrieving qualifying tuples– 10000 pages since the index is unclustered, the qualifying
tuples are not always in the same order as the data entries.
– In the worst case, for each qualifying data entry, one I/O is needed
• Total I/O Cost = 4+ 50+10,000 pages
4096 96 20020− =
• B+tree, Alternative 2, dense, clustered• I/O cost of retrieving pages of qualifying data
entries– Matching data entries: 0.1*100000=10000 entries– #Data entries per page: – #Pages of matching data entries =
• I/O cost of retrieving qualifying tuples– #Matching tuples: 10000 – Since the index is dense and clustered, the qualifying
tuples are also clustered– # pages: 10000/100=100 due to 100 tuples per page
• Total I/O Cost = 4+ 50+100 pages
4096 96 20020− = 10000 50
200 =
14
• B+tree, Alternative 2, sparse (must be clustered)• I/O cost of retrieving qualifying tuples
– #Matching tuples: 0.1*100,000=10000 – Since the index is sparse, one data entry points to a page.– # pages for matching tuples: = 100 since a page
can hold 100 tuples• I/O cost of retrieving pages of qualifying data
entries– #pages of matching tuples: 100 – #Data entries per page: – #Pages of matching data entries = =1 page
• Total I/O Cost = 4+ 1+100 pages
4096 96 20020− =
100200
10000100
Hash-based Index
• Hash-based index on A is good for equality search on attribute A
• Cannot support range search
• Suppose R is Reserves • Reduction Factor: 0.1• Data entry format: Alternative 2: 20 bytes per data entry• Cost for searching the matching data entry: 1.2 I/O
. = (R)R A valueσ
15
• I/O cost = cost for retrieving the matching data entries+cost for retrieving the qualifying tuples
• I/O cost of retrieving pages of matching data entries– Matching data entries: 0.1*100000=10000 entries– I/O cost = 10000*1.2 = 12000 I/Os
• I/O cost for retrieving the qualifying tuples = 12000 I/Os
• Total cost = 12000+12000 = 24000 I/Os
• Hash-based indexes are unclustered
Cost comparisons among various access paths for Equality Selection
on Reserves
• File scan: 1000 • Sorted File: 110• Dense, unclustered B+tree = 10054• Dense, clustered B+tree = 154• Sparse, B+tree = 105• Static Hash index = 24000
• Select the cheapest access path: Sorted File
. = (R)R A valueσ
6
Enumeration of Left-Deep Plans (Cont’d)
Pass II: Generate all two-relation plans by considering each single-relation plan after Pass I as the outer relation and every otherrelation as the inner relation.
For each join method of A joins B, determine the best access method for B. Consider also the plans that involve the join attribute of B.
If a sort-merge join is used and the access method for B does not produce the result in a sorted order according to the join attribute of B, cost of sorting must be included.
EMP: file scan (no selection on EMP)ASG: file scan (no selection on ASG)PROJ: index on PNAME (selection on PROJ.PNAME)
Plans of EMP join with PROJ are removed because they use Cartesian product.Compute the cost of the following plans and select the cheapest cost.
EMP (file scan) (nested loops join) ASG (file scan)EMP (file scan) (merge-join) ASG (file scan)
ExampleEMP has an index on ENO.ASG has an index on PNO.PROJ has an index on PNO and another index on PNAME.Suppose these are the
cheapest plan from Part 1
7
ASG (file scan) (nested loops join) PROJ (file scan)ASG (file scan) (nested loops join) PROJ (index on PNO)ASG (file scan) (merge-join) PROJ (file scan)ASG (file scan) (merge-join) PROJ (index on PNO)
PROJ (index on PNAME) (nested loops join) ASG (file scan)PROJ (index on PNAME) (nested loops join) ASG (index on PNO)PROJ (index on PNAME) (merge join) ASG (file scan)PROJ (index on PNAME) (merge join) ASG (index on PNO)
Suppose plan X is the cheapest.
Plan X:
Pass III: Generate all three-relation plans by considering each plan after Pass II as the outer relation. Select the cheapest plan.
Additional passes are needed if more relations are involved in the join.
(PROJ (index on PNAME) nested loops join with ASG (index on PNO))
Nested loops join with
EMP (file scan)EMP (index on ENO)
(PROJ (index on PNAME) nested loops join with ASG (index on PNO))
Merge join with
EMP (file scan)EMP (index on ENO)
8
Block Nested Loops Join
R
Memory
Output buffer
R can fit in memory entirely.
Bring bigger relation S one page at a time. Cost=M+N Optimal if one of the
relation can fit in the memory (M=B-2).
foreach page of B-2 pages of R do foreach page of s in S do
if ri in R page == sj in S-page then add <ri, sj> to result
Cost=M+N*
− 2B
MB: Available memory in pages.M: Number of pages for RN: Number of pages for S
Number of pages of R
R is the outer relation
Finding Matching Pairs
. . .. . .
R & SHash table for block of R
(k < B-1 pages)
Input buffer for SOutput buffer
. . .
Join Result
To reduce the CPU time to find a matching pair, an in-memory hash table is used.
9
Indexed Nested Loops Joinforeach tuple r in R do
foreach tuple s in S doif ri.A == sj.B then add <r, s> to result
Use index on the joining attribute of S.Cost=M+M*PR*(Cost of retrieving a matching tuple in S).
Depend on the type of index and the number of matching tuples.cost to scan R
• For each R tuple, cost of probing S index is about 1.2 for hash index, 2-4 for B+ tree. Cost of then finding S tuples(assuming Alt. (2) or (3) for data entries) depends on clustering.
– Clustered index: 1 I/O (typical), unclustered: upto 1 I/O per matching S tuple.
Sort-Merge Join (R S)• Sort R and S on the join column, then scan them to
do a ``merge’’ (on join col.), and output result tuples.
– Advance scan of R until current R-tuple >= current S tuple, then advance scan of S until current S-tuple >= current R tuple; do this until current R tuple = current S tuple.
– At this point, all R tuples with same value in Ri (current R group) and all S tuples with same value in Sj (current S group) match; output <r, s> for all pairs of such tuples.
– Then resume scanning R and S.
><i=j
10
Example of Sort-Merge Join
sid snam e rating age22 dustin 7 45.028 yuppy 9 35.031 lubber 8 55.544 guppy 5 35.058 rusty 10 35.0
s id b id d a y r n a m e2 8 1 0 3 1 2 /4 /9 6 g u p p y2 8 1 0 3 1 1 /3 /9 6 y u p p y3 1 1 0 1 1 0 /1 0 /9 6 d u s t in3 1 1 0 2 1 0 /1 2 /9 6 lu b b e r3 1 1 0 1 1 0 /1 1 /9 6 lu b b e r5 8 1 0 3 1 1 /1 2 /9 6 d u s t in
ï Cost: M log M + N log N + (M+N)ñ The cost of scanning, M+N, could be M*N (very
unlikely!)
Join Ordering in Fragment QueriesJoin ordering is important in a centralized DBMS.Join ordering is even more important in distributed DDBMSs.
EMP
ASG
PROJ
ENO PNO
Site 1
Site 2
Site 3
R site j: “relation R is transferred to site j”
1. EMP site 2; site 2 computes EMP’, EMP’->site 3; site 3 computes the result.
2. ASG->site 1: site 1 computes EMP’, EMP’->site 3; site 3 computes the result
3. ASG->site 3; compute ASG’;ASG’->site 14. PROJ->site 2; compute PROJ’; PROJ’-
>site 15. EMP->site 2; PROJ->site 2; site 2
compute the join.
11
R SIf size(R) < size(S)
If size(S) < size(R)
SR ><
•Ignore the transfer time for producing data at the result site.
Approach I: Ordering joins without using semi-joins
•Distributed INGRES•R* (distributed version of system R)
Consider costs of all strategies and choose the best one
Semijoin:) () (
) ( ) (
RSJSSSJR
RSJSR
SSSJRSR
AAA
A
AA
A
A
><
><
><><
⇔⇔⇔
Site 1 Site 2)(' SS AΠ=' ' SRRA
SJ=SRresult A><'=
Semijoin is better than join if size(R’)+size(S’)<Size(R)(i.e., a few tuples or R participate in the join)S’ can be minimized by encoding it in a bit array (BA).BA[i]=1 if h(value of S.A)=i,BA[i]=0 otherwise.h() is the hash function.R’ consists of tuples whose BA(h(value of R.A))=1;
S’
Approach II: Use semi-joins
12
Some Semi-Joins Alternatives
• ASG1=ASG SJ EMP• ASG11= (ASG SJ PROJ) SJ EMP• Complex: Most algorithms use single semi-joins
EMP ASG
PROJENO
PNO
ASG EMP
EMPPROJ ASG
ENO
PNO
PNO
Cyclic Queries• ET(ENO, ENAME, TITLE, CITY)• ASG(ENO,PNO,RESP,DUR)• PT(PNO, PNAME, BUDGET, CITY)
• Query: Retrieve the names of all employees living in the city in which their project is located
SELECT ET.ENAMEFROM ET,ASG,PTWHERE ET.ENO=ASG.ENOAND ASG.PNO=PT.PNOAND ET.CITY=PT.CITY
ASG
ET
PT
A cyclic join graph
ASG.PNO=PT.PNO
ET.ENO=ASG.ENO
ET.CITY=PT.CITY
Top Related