Post on 19-Dec-2015
Graph-Based Synopses for Relational Selectivity Estimation
Joshua Spiegel and Neoklis PolyzotisUniversity of California, Santa Cruz
2
Motivation Problem: determine the result cardinality of a
complex relational query Query optimization: cost factors of candidate plans depend
on query selectivity Data exploration: query selectivity provides timely feedback
Solution: approximate selectivity over data synopses
RelationalDatabase
Count(Q)Selectivity
Expensive
Efficient
Database synopsis SelectivityEstimate
Count(Q)
3
Previous Work Table Level Synopses
Examples: Histograms [Poosala+96], Sketches [Dobra+02], Wavelets [Chakrabarti+00], Table samples [Lipton+93]
Weakness: do not summarize key values well Schema Level Synopses
Examples: Join Synopses [Acharya+99], PRMs [Getoor+01] Weakness: restricted to certain types of schemata
R
T
Z
W
SRZT
SRZW
SR SZ
ST
SW
4
Synopsis Desiderata
Schema level Capture key/foreign-key joins Applicable to general schemata and queries
5
Contributions Tuple Graph Synopsis (TuG)
Model: Semi-structured view of relational data Schema level summary
Schemata with many-to-many relationships Complex join queries
TuG construction algorithm Basis: Tuple clustering Novel heuristics Builds on existing clustering techniques
Experimental study TuGs are effective synopses for small space budgets Better accuracy compared to previous techniques
6
Outline
The TuG Synopsis Synopsis model Estimation framework
TuG Construction TuG Compression Construction Algorithm
Experimental Study Conclusions
7
TuG Model: Intuition #1 Relational database ↔ Data graph
mid year genre
1 2005 Action
2 2004 Action
3 2000 Drama
aid sex
1 Male
2 Female
3 Male
4 Male
Movies
mid aid
1 1
1 2
2 3
3 3
3 4
Cast Actors
c1
Action
Drama
c2
c3
c4
c5
m1
m2
m3
a1
a2
a3
a4
Female
Male
2005
2004
2000
8
TuG Model: Intuition #2 Join query ↔ Sub-graph matching Selectivity ↔ Count of matching sub-graphs
SELECT *FROM M, C, AWHERE M.mid=C.midAND C.aid=A.aidAND A.sex=Male AND M.genre=Drama
c1
Action
Drama
c2
c3
c4
c5
m1
m2
m3
a1
a2
a3
a4
Female
Male
2005
2004
2000
M C A
MaleDrama
9
3
2
1
1
1
1
TuG Synopsis Model
Tuple Node: Set of tuples from the same relation Node count: number of tuples
Edge: Join between tuple sets Edge count: result size of join
Action
Drama
Female
Male
2005
2004
a1
a2
a3
c1
c2
c3
m3
c5
m1
m2
Action
Drama
Female
Male
2005
2004
1
mα(2)
mβ(1)
cα(3)
cβ(2)
aα(3)2
1
3
2c4
1
Data Graph TuG
10
3
TuG Synopsis Model
Value node: A distinct value in the database Value edge: Appearance of value in the tuple set
Edge count: frequency of value in the tuple set
In practice, distributions are compressed with summaries
Action
Drama
Female
Male
2005
2004
a1
a2
a3
c1
c2
c3
m3
c5
m1
m2
Action
Drama
Female
Male
2005
2004
1
1
1 1
mα(2)
mβ(1)
cα(3)
cβ(2)
aα(3)
Data Graph
2
1
3
2 2
1
c4
1
TuG
11
3
Action
Drama
2005
2004
1
1
1 1
mα(2)
mβ(1)
2
1
1
TuG Semantics
Assumption 1: Independence across edges Assumption 2: Uniformity along each edge
Female
Male
cα(3)
cβ(2)
aα(3)
3
2 2
1For each actor:- prob[cα aα] = 1/3
- prob[cβ aα] = 1/3- prob[sex=Female] = 1/3- prob[sex=Male] = 2/3
12
3
Action
Drama
2005
2004
1
1
1 1
mα(2)
mβ(1)
2
1
1
Tuple Clustering
Tuple node Cluster Join and value probabilities Centroid Validity of assumptions Error of clustering
Tight clusters Valid assumptions Accurate synopsis
Female
Male
cα(3)
cβ(2)
aα(3)
cα cβ Female Male
1/3 1/3 1/3 2/3The centroid of aα: ( )
For each actor:- prob[cα aα] = 1/3
- prob[cβ aα] = 1/3- prob[sex=Female] = 1/3- prob[sex=Male] = 2/3
3
2 2
1
13
1
1 2
Selectivity Estimation
M C A
Male2005
3
Action
Drama
Female
Male
2005
2004
1
1
1
mα(2)
mβ(1)
cα(3)
cβ(2)
aα(3)2
3
2
1
1
€
23
⎛
⎝ ⎜
⎞
⎠ ⎟
€
13
⎛
⎝ ⎜
⎞
⎠ ⎟
€
12
⎛
⎝ ⎜
⎞
⎠ ⎟
Single pass estimation algorithm Accuracy depends on the validity of our assumptions
Tight clustering Accurate estimates
Sug-graph Selectivity = (2 · 3 · 3) · Prob[Male] Prob[mα cα] Prob[cα aα] Prob[2005] · · ·
€
12
⎛
⎝ ⎜
⎞
⎠ ⎟
= 1 tuple
Prob[2005 ^ mα cα ^ cα aα ^ Male ]
14
Outline
The TuG Synopsis Synopsis model Estimation framework
TuG Construction TuG Compression Construction Algorithm
Experimental Study Conclusions
15
The Node-Merge Operation Collapse a set of nodes into one new node
New node acquires aggregate characteristics New centroid represents the union of the tuple sets
4cα(6) aα(2) Male
aβ(2)
2
2
26cα(6) aγ(4) Male4
Merge aα and aβ
cα Male
1/3 1 )(
cα Male
1/6 1 )(
cα Male
1/4 1 )(
- When is a merge lossless?- How do we quantify lossy merges?
16
Full Similarity Merge Nodes are fully similar if they have the same centroids
4cα(12) aα(2) Male
aβ(4)
8
2
412cβ(12) aγ(6) Male6
Merge aα and aβ
cα Male
1/6 1)(
cα Male
1/6 1)(
cα Male
1/6 1)(
17
Nodes are all-but-one similar if they have the same centroids with respect to all schema neighbors but one
Theorem: a merge of all-but-one similar nodes is lossless
Order of merging can affect final compression Potential application in other domains (e.g., XML summarization)
All-but-one Similarity
4cα(12) aα(2) Male
aβ(4)
2
2
1
Female3
cα cβ Male Female
1/6 1/8 1 0( )
cα cβ Male Female
1/6 1/8 1/4 3/4( )
cβ(8)
84
12cα(12) aγ(6) Male3
Merge aα and aβ
(Lossless) Female
3
cα cβ Male Female
1/6 1/8 1/2 1/2( )
cβ(8)
6
18
Effect of All-but-one Similarity
Data set Data graph Full-SimilaritySynopsis
Ab1-SimilaritySynopsis
TPCH 8 million 4.4 million 33K
IMDB 4.7 million 4.5 million 65K
Number of nodes in synopsis graph
19
Question: When is a lossy merge good? Intuition: Similar centroids Good merge
Measure merge quality by error of centroid clustering Radius, Diameter, Manhattan Distance, …
Lossy Merges
bα(3)
bβ(2)
bγ(5)
aα(10)
12
6
6
8
6
5
cα(8) bα
bβ
bγ
Centroids
Join prob. to aα
Join
pro
b. to
cα aα cα
bα(0.4 0.3)
bβ(0.3 0.4)
bγ(0.1 0.1)
20
Construction Algorithm Database Reference synopsis
All-but-one similar (lossless) merges Adaptive selection of merge operations
Reference Synopsis Join Compressed TuG Lossy merges Good merges are identified by adaptive clustering Clustering algorithm: BIRCH [Zhang+96] + CM-Sketches [CM04]
Join Compressed TuG Value Compressed TuG Value distributions Histograms Histograms are shared among nodes with similar distributions
DatabaseReferenceSynopsis
Lossless Merge Compression
TuGLossy Merge Compression
TuGValue Compression
21
Outline
The TuG Synopsis Synopsis model Estimation framework
TuG Construction TuG Compression Construction Algorithm
Experimental Study Conclusions
22
Techniques
TuGs Join Synopses [Acharya+99]
Multidimensional wavelets [Chakrabarti+00]
Single dimensional histograms [Poosala+96] Generated by commercial database System X
23
Data and Queries
Data Sets
Workload ~200 randomly generated positive queries 4-8 join predicates 1-7 value predicates
Dataset Size Data Graph Nodes Budget Space
TPCH 1 GB 8 Million 30 KB
IMDB 139 MB 4.7 Million 20 KB
24
Evaluation Metric
Absolute relative error (ARE) Sanity bound = 10th percentile of the true selectivities of
the workload
€
ARE = | selectivity- estimate|( ,max selectivity )sanity bound
25
Estimation Error - TPCH
0
10
20
30
40
50
60
70
80
90
100
0 20 40 60 80 100 120 140
Absolute Relative Error (%)
Queries in Workload (%)
TuG
Histograms
Join Synop.
TuG error isless than 30% for 56% of the queries in the workload
Join Synopsis error isless than 30% for 40% of the queries in the workload
Histogram error is less than 30% for 25% of the queries in the workload
26
Estimation Error - IMDB
0
10
20
30
40
50
60
70
80
90
100
0 20 40 60 80 100 120 140
Absolute Relative Error (%)
Queries in Workload (%) TuG
Wavelet
TuGs have significantly less estimation error for most queries in the workload
Join Synopses are not applicable for this schema
27
Conclusions
TuG Synopses Schema-level relational summaries Model: Semi-structured view of the relational data set Selectivity estimates for complex join queries Support for a large class of practical schemata Effective construction algorithm
Experimental Results Accurate selectivity estimates given a small budget Benefits over existing techniques
28
Questions?
29
Construction Times
TuG Construction Times TPCH: 55 minutes IMDB : 85 minutes
Histograms and Join Synopses can be constructed relatively quickly (e.g. < 10 minutes for our datasets)
Multidimensional wavelets are prohibitively expensive to construct over key values
DatabaseReferenceSynopsis
Lossless Merge Compression
TuGLossy Merge Compression
TuGValue Compression
30
Estimation Error - IMDB
0
10
20
30
40
50
60
70
80
90
0 20 40 60 80 100 120 140
Absolute Relative Error
Queries in Workload (%)
TuG
Histograms
31
A synopsis should be:
Accurate Much smaller than the database Efficient to construct Applicable for any schema and query
Many-to-many relationships Join graphs with cycles
Movies
Cast
ActorsCustomerRegion
Orders