Graph-Based Synopses for Relational Selectivity Estimation Joshua Spiegel and Neoklis Polyzotis...

Graph-Based Synopses for Relational Selectivity Estimation

Joshua Spiegel and Neoklis PolyzotisUniversity of California, Santa Cruz

Motivation Problem: determine the result cardinality of a

complex relational query Query optimization: cost factors of candidate plans depend

on query selectivity Data exploration: query selectivity provides timely feedback

Solution: approximate selectivity over data synopses

RelationalDatabase

Count(Q)Selectivity

Expensive

Efficient

Database synopsis SelectivityEstimate

Count(Q)

Previous Work Table Level Synopses

Examples: Histograms [Poosala+96], Sketches [Dobra+02], Wavelets [Chakrabarti+00], Table samples [Lipton+93]

Weakness: do not summarize key values well Schema Level Synopses

Examples: Join Synopses [Acharya+99], PRMs [Getoor+01] Weakness: restricted to certain types of schemata

Synopsis Desiderata

Schema level Capture key/foreign-key joins Applicable to general schemata and queries

Contributions Tuple Graph Synopsis (TuG)

Model: Semi-structured view of relational data Schema level summary

Schemata with many-to-many relationships Complex join queries

TuG construction algorithm Basis: Tuple clustering Novel heuristics Builds on existing clustering techniques

Experimental study TuGs are effective synopses for small space budgets Better accuracy compared to previous techniques

Outline

The TuG Synopsis Synopsis model Estimation framework

TuG Construction TuG Compression Construction Algorithm

Experimental Study Conclusions

TuG Model: Intuition #1 Relational database ↔ Data graph

mid year genre

1 2005 Action

2 2004 Action

3 2000 Drama

aid sex

1 Male

2 Female

3 Male

4 Male

Movies

mid aid

Cast Actors

Action

Female

TuG Model: Intuition #2 Join query ↔ Sub-graph matching Selectivity ↔ Count of matching sub-graphs

SELECT *FROM M, C, AWHERE M.mid=C.midAND C.aid=A.aidAND A.sex=Male AND M.genre=Drama

Action

Female

MaleDrama

TuG Synopsis Model

Tuple Node: Set of tuples from the same relation Node count: number of tuples

Edge: Join between tuple sets Edge count: result size of join

Action

Female

Action

Female

mα(2)

mβ(1)

cα(3)

cβ(2)

aα(3)2

Data Graph TuG

TuG Synopsis Model

Value node: A distinct value in the database Value edge: Appearance of value in the tuple set

Edge count: frequency of value in the tuple set

In practice, distributions are compressed with summaries

Action

Female

Action

Female

mα(2)

mβ(1)

cα(3)

cβ(2)

aα(3)

Data Graph

Action

mα(2)

mβ(1)

TuG Semantics

Assumption 1: Independence across edges Assumption 2: Uniformity along each edge

Female

cα(3)

cβ(2)

aα(3)

1For each actor:- prob[cα aα] = 1/3

- prob[cβ aα] = 1/3- prob[sex=Female] = 1/3- prob[sex=Male] = 2/3

Action

mα(2)

mβ(1)

Tuple Clustering

Tuple node Cluster Join and value probabilities Centroid Validity of assumptions Error of clustering

Tight clusters Valid assumptions Accurate synopsis

Female

cα(3)

cβ(2)

aα(3)

cα cβ Female Male

1/3 1/3 1/3 2/3The centroid of aα: ( )

For each actor:- prob[cα aα] = 1/3

- prob[cβ aα] = 1/3- prob[sex=Female] = 1/3- prob[sex=Male] = 2/3

Selectivity Estimation

Male2005

Action

Female

mα(2)

mβ(1)

cα(3)

cβ(2)

aα(3)2

⎝ ⎜

⎠ ⎟

⎝ ⎜

⎠ ⎟

⎝ ⎜

⎠ ⎟

Single pass estimation algorithm Accuracy depends on the validity of our assumptions

Tight clustering Accurate estimates

Sug-graph Selectivity = (2 · 3 · 3) · Prob[Male] Prob[mα cα] Prob[cα aα] Prob[2005] · · ·

⎝ ⎜

⎠ ⎟

= 1 tuple

Prob[2005 ^ mα cα ^ cα aα ^ Male ]

Outline

The Node-Merge Operation Collapse a set of nodes into one new node

New node acquires aggregate characteristics New centroid represents the union of the tuple sets

4cα(6) aα(2) Male

aβ(2)

26cα(6) aγ(4) Male4

Merge aα and aβ

cα Male

1/3 1 )(

cα Male

1/6 1 )(

cα Male

1/4 1 )(

- When is a merge lossless?- How do we quantify lossy merges?

Full Similarity Merge Nodes are fully similar if they have the same centroids

4cα(12) aα(2) Male

aβ(4)

412cβ(12) aγ(6) Male6

Merge aα and aβ

cα Male

1/6 1)(

cα Male

1/6 1)(

cα Male

1/6 1)(

Nodes are all-but-one similar if they have the same centroids with respect to all schema neighbors but one

Theorem: a merge of all-but-one similar nodes is lossless

Order of merging can affect final compression Potential application in other domains (e.g., XML summarization)

All-but-one Similarity

4cα(12) aα(2) Male

aβ(4)

Female3

cα cβ Male Female

1/6 1/8 1 0( )

cα cβ Male Female

1/6 1/8 1/4 3/4( )

cβ(8)

12cα(12) aγ(6) Male3

Merge aα and aβ

(Lossless) Female

cα cβ Male Female

1/6 1/8 1/2 1/2( )

cβ(8)

Effect of All-but-one Similarity

Data set Data graph Full-SimilaritySynopsis

Ab1-SimilaritySynopsis

TPCH 8 million 4.4 million 33K

IMDB 4.7 million 4.5 million 65K

Number of nodes in synopsis graph

Question: When is a lossy merge good? Intuition: Similar centroids Good merge

Measure merge quality by error of centroid clustering Radius, Diameter, Manhattan Distance, …

Lossy Merges

bα(3)

bβ(2)

bγ(5)

aα(10)

cα(8) bα

Centroids

Join prob. to aα

cα aα cα

bα(0.4 0.3)

bβ(0.3 0.4)

bγ(0.1 0.1)

Construction Algorithm Database Reference synopsis

All-but-one similar (lossless) merges Adaptive selection of merge operations

Reference Synopsis Join Compressed TuG Lossy merges Good merges are identified by adaptive clustering Clustering algorithm: BIRCH [Zhang+96] + CM-Sketches [CM04]

Join Compressed TuG Value Compressed TuG Value distributions Histograms Histograms are shared among nodes with similar distributions

DatabaseReferenceSynopsis

Lossless Merge Compression

TuGLossy Merge Compression

TuGValue Compression

Outline

Techniques

TuGs Join Synopses [Acharya+99]

Multidimensional wavelets [Chakrabarti+00]

Single dimensional histograms [Poosala+96] Generated by commercial database System X

Data and Queries

Data Sets

Workload ~200 randomly generated positive queries 4-8 join predicates 1-7 value predicates

Dataset Size Data Graph Nodes Budget Space

TPCH 1 GB 8 Million 30 KB

IMDB 139 MB 4.7 Million 20 KB

Evaluation Metric

Absolute relative error (ARE) Sanity bound = 10th percentile of the true selectivities of

the workload

ARE = | selectivityestimate|( ,max selectivity )sanity bound

Estimation Error - TPCH

0 20 40 60 80 100 120 140

Absolute Relative Error (%)

Queries in Workload (%)

Histograms

Join Synop.

TuG error isless than 30% for 56% of the queries in the workload

Join Synopsis error isless than 30% for 40% of the queries in the workload

Histogram error is less than 30% for 25% of the queries in the workload

Estimation Error - IMDB

0 20 40 60 80 100 120 140

Absolute Relative Error (%)

Queries in Workload (%) TuG

Wavelet

TuGs have significantly less estimation error for most queries in the workload

Join Synopses are not applicable for this schema

Conclusions

TuG Synopses Schema-level relational summaries Model: Semi-structured view of the relational data set Selectivity estimates for complex join queries Support for a large class of practical schemata Effective construction algorithm

Experimental Results Accurate selectivity estimates given a small budget Benefits over existing techniques

Questions?

Construction Times

TuG Construction Times TPCH: 55 minutes IMDB : 85 minutes

Histograms and Join Synopses can be constructed relatively quickly (e.g. < 10 minutes for our datasets)

Multidimensional wavelets are prohibitively expensive to construct over key values

DatabaseReferenceSynopsis

Lossless Merge Compression

TuGLossy Merge Compression

TuGValue Compression

Estimation Error - IMDB

0 20 40 60 80 100 120 140

Absolute Relative Error

Queries in Workload (%)

Histograms

A synopsis should be:

Accurate Much smaller than the database Efficient to construct Applicable for any schema and query

Many-to-many relationships Join graphs with cycles

Movies

ActorsCustomerRegion

Orders

Graph-Based Synopses for Relational Selectivity Estimation Joshua Spiegel and Neoklis Polyzotis...

Documents

Transcript of Graph-Based Synopses for Relational Selectivity Estimation Joshua Spiegel and Neoklis Polyzotis...

1 Autocompletion for Mashups Ohad Greenshpan, Tova Milo, Neoklis Polyzotis Tel-Aviv University UCSC.

Cine Europa 13 Films Synopses

SYSTEMATIC THEOLOGY SYNOPSES - drnichols.orgdrnichols.org/_pdf/systematic_theology_synopses.pdf · Other Studies Are Available at SYSTEMATIC THEOLOGY SYNOPSES ED NICHOLS

SYNOPSES OF SUBJECTS OFFERED AT 'A' LEVEL H3 Synopses (updated).pdf · SYNOPSES OF SUBJECTS OFFERED AT 'A' LEVEL ... Students are required to write two essays on the areas of learning

Laser Synopses

Selectivity-Based Partitioning Alkis Polyzotis UC Santa Cruz.

The Data Ring: Community Content Sharing Serge Abiteboul (INRIA) Alkis Polyzotis (UC Santa Cruz)

S KYLINE Q UERY P ROCESSING OVER J OINS. Akrivi Vlachou1, Christos Doulkeridis1, Neoklis Polyzotis SIGMOD 2011.

Optimal Workload-based Weighted Wavelet Synopses ∗

Cover Page for Take-Home Assignments / Synopses

XCluster Synopses for Structured XML Content Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Intel Research, Berkeley)

Synopses of Black Belt Projects

Prelims Synopses 2014-15 - Oxford Materials Lecture Course Synopses 2014 - 2015 Materials Science (MS) & Materials, Economics and Management (MEM) Prelims Lecture Course Synopses 2014-15

UWI Course Synopses 202030-02

On-line Index Selection for Physical Database Tuning Karl Schnaitter UCSC & Aster Data Advisor: Neoklis Polyzotis ISSDM Mentor: John Bent SRL/ISSDM Symposium.

Approximate XML Query Answers Neoklis Polyzotis (UC Santa Cruz) Minos Garofalakis (Bell Labs) Yannis Ioannidis (U. of Athens, Hellas) Represented by: Gal.

FHS Core Synopses 2013-14

AIA Document Synopses - By Family

Postgraduate Lecture Training Course Synopses

Internet Synopses 29Dec08