Estimating Join Selectivities using Bandwidth-Optimized ...the query optimizer of a relational...

12
Estimating Join Selectivities using Bandwidth-Optimized Kernel Density Models Martin Kiefer 1 Max Heimel 2 Sebastian Breß 1,3 Volker Markl 1,3 1 Technische Universit¨ at Berlin [email protected] 2 Snowflake Computing max.heimel@snowflake.net 3 German Research Center for Artificial Intelligence (DFKI) [email protected] ABSTRACT Accurately predicting the cardinality of intermediate plan operations is an essential part of any modern relational query optimizer. The accuracy of said estimates has a strong and direct impact on the quality of the generated plans, and in- correct estimates can have a negative impact on query per- formance. One of the biggest challenges in this field is to predict the result size of join operations. Kernel Density Estimation (KDE) is a statistical method to estimate multivariate probability distributions from a data sample. Previously, we introduced a modern, self-tuning se- lectivity estimator for range scans based on KDE that out- performs state-of-the-art multidimensional histograms and is ecient to evaluate on graphics cards. In this paper, we extend these bandwidth-optimized KDE models to estimate the result size of single and multiple joins. In particular, we propose two approaches: (1) Building a KDE model from a sample drawn from the join result. (2) Eciently combining the information from base table KDE models. We evaluated our KDE-based join estimators on a vari- ety of synthetic and real-world datasets, demonstrating that they are superior to state-of-the art join estimators based on sketching or sampling. 1. INTRODUCTION In order to correctly predict the cost of candidate plans, the query optimizer of a relational database engine requires accurate information about the result sizes of intermediate plan operations [32]. The accuracy of these cardinality es- timates has a direct impact on the quality of the generated query plans. Incorrect estimates are known to cause un- expectedly bad query performance [6, 15, 20, 23, 28]. In fact, due to the multiplicative nature of joins, errors in these estimates typically propagate exponentially through larger query plans [15]. This means that even small improvements can dramatically improve the information quality that is available to the query optimizer [15, 20]. This work is licensed under the Creative Commons Attribution- NonCommercial-NoDerivatives 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For any use beyond those covered by this license, obtain permission by emailing [email protected]. Proceedings of the VLDB Endowment, Vol. 10, No. 13 Copyright 2017 VLDB Endowment 2150-8097/17/08. A particular challenging problem is to accurately and con- sistently predict the result size of joins [34]. The typical approach for this is to combine the information from base table estimators under the assumptions of uniformity, in- dependence, and containment [34]. However, while easy to compute, this approach can cause severe estimation errors if any of the underlying assumptions is violated [6]. Mul- tiple authors suggested specialized methods to tackle the Join Estimation problem, including Sampling [8, 11, 21, 36], Graphical Models [9, 35], and Sketches [19, 30]. However, none of them managed to manifest themselves as a gener- ally viable solution, leaving the problem as one of the major unsolved challenges in research on query optimization [22]. In prior work, we introduced bandwidth-optimized Kernel Density Models (KDE) as a way to compute multidimen- sional selectivity estimates. KDE is a data-driven, non- parametric method to estimate probability densities from a data sample [31]. We demonstrated that combining KDE with a query-driven tuning mechanism to optimally pick the so-called bandwidth yields an estimator that typically out- performs the accuracy of state-of-the-art multidimensional histograms. Furthermore, the estimator is easy to maintain and to parallelize on modern hardware [12]. In this paper, we significantly expand upon this work and demonstrate how to estimate the result size of queries spanning mul- tiple equijoins and base table predicates from bandwidth- optimized KDE models. In particular, we explain how to compute estimates from both joint models and combined base table models. We present pruning methods to reduce the computational overhead and demonstrate a mechanism to automatically tune the bandwidth parameters. Based on an extensive experimental evaluation, we found that our family of estimators matches and usually outperforms the accuracy of existing state-of-the-art join estimators like cor- related sampling [36] or the AGMS sketch [30]. Finally, we provide all sources and experimental scripts to allow others to reproduce and build upon our results. 1 In the following two sections, we introduce background knowledge on the join estimation problem and bandwidth- optimized KDE models. In Section 4, we lay the theoretical foundation of our work, explaining how to estimate join se- lectivities from a KDE model. Section 5 introduces pruning techniques to reduce the computational overhead, and Sec- tion 6 discusses strategies to fine-tune the bandwidth pa- rameter of these models. Finally, Section 7 presents our experimental evaluation, and Section 8 concludes the paper by summarizing our findings. 1 The source code is available at: goo.gl/RejjVk. 2085

Transcript of Estimating Join Selectivities using Bandwidth-Optimized ...the query optimizer of a relational...

Page 1: Estimating Join Selectivities using Bandwidth-Optimized ...the query optimizer of a relational database engine requires accurate information about the result sizes of intermediate

Estimating Join Selectivities usingBandwidth-Optimized Kernel Density Models

Martin Kiefer1 Max Heimel2 Sebastian Breß1,3 Volker Markl1,3

1Technische Universitat [email protected]

2Snowflake [email protected]

3German Research Center forArtificial Intelligence (DFKI)

[email protected]

ABSTRACTAccurately predicting the cardinality of intermediate planoperations is an essential part of any modern relational queryoptimizer. The accuracy of said estimates has a strong anddirect impact on the quality of the generated plans, and in-correct estimates can have a negative impact on query per-formance. One of the biggest challenges in this field is topredict the result size of join operations.

Kernel Density Estimation (KDE) is a statistical methodto estimate multivariate probability distributions from a datasample. Previously, we introduced a modern, self-tuning se-lectivity estimator for range scans based on KDE that out-performs state-of-the-art multidimensional histograms andis e�cient to evaluate on graphics cards. In this paper, weextend these bandwidth-optimized KDE models to estimatethe result size of single and multiple joins. In particular, wepropose two approaches: (1) Building a KDE model from asample drawn from the join result. (2) E�ciently combiningthe information from base table KDE models.

We evaluated our KDE-based join estimators on a vari-ety of synthetic and real-world datasets, demonstrating thatthey are superior to state-of-the art join estimators based onsketching or sampling.

1. INTRODUCTIONIn order to correctly predict the cost of candidate plans,

the query optimizer of a relational database engine requiresaccurate information about the result sizes of intermediateplan operations [32]. The accuracy of these cardinality es-timates has a direct impact on the quality of the generatedquery plans. Incorrect estimates are known to cause un-expectedly bad query performance [6, 15, 20, 23, 28]. Infact, due to the multiplicative nature of joins, errors in theseestimates typically propagate exponentially through largerquery plans [15]. This means that even small improvementscan dramatically improve the information quality that isavailable to the query optimizer [15, 20].

This work is licensed under the Creative Commons Attribution-

NonCommercial-NoDerivatives 4.0 International License. To view a copy

of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For

any use beyond those covered by this license, obtain permission by emailing

[email protected].

Proceedings of the VLDB Endowment, Vol. 10, No. 13

Copyright 2017 VLDB Endowment 2150-8097/17/08.

A particular challenging problem is to accurately and con-sistently predict the result size of joins [34]. The typicalapproach for this is to combine the information from basetable estimators under the assumptions of uniformity, in-dependence, and containment [34]. However, while easy tocompute, this approach can cause severe estimation errorsif any of the underlying assumptions is violated [6]. Mul-tiple authors suggested specialized methods to tackle theJoin Estimation problem, including Sampling [8, 11, 21, 36],Graphical Models [9, 35], and Sketches [19, 30]. However,none of them managed to manifest themselves as a gener-ally viable solution, leaving the problem as one of the majorunsolved challenges in research on query optimization [22].

In prior work, we introduced bandwidth-optimized KernelDensity Models (KDE) as a way to compute multidimen-sional selectivity estimates. KDE is a data-driven, non-parametric method to estimate probability densities froma data sample [31]. We demonstrated that combining KDEwith a query-driven tuning mechanism to optimally pick theso-called bandwidth yields an estimator that typically out-performs the accuracy of state-of-the-art multidimensionalhistograms. Furthermore, the estimator is easy to maintainand to parallelize on modern hardware [12]. In this paper,we significantly expand upon this work and demonstratehow to estimate the result size of queries spanning mul-tiple equijoins and base table predicates from bandwidth-optimized KDE models. In particular, we explain how tocompute estimates from both joint models and combinedbase table models. We present pruning methods to reducethe computational overhead and demonstrate a mechanismto automatically tune the bandwidth parameters. Basedon an extensive experimental evaluation, we found that ourfamily of estimators matches and usually outperforms theaccuracy of existing state-of-the-art join estimators like cor-related sampling [36] or the AGMS sketch [30]. Finally, weprovide all sources and experimental scripts to allow othersto reproduce and build upon our results.1

In the following two sections, we introduce backgroundknowledge on the join estimation problem and bandwidth-optimized KDE models. In Section 4, we lay the theoreticalfoundation of our work, explaining how to estimate join se-lectivities from a KDE model. Section 5 introduces pruningtechniques to reduce the computational overhead, and Sec-tion 6 discusses strategies to fine-tune the bandwidth pa-rameter of these models. Finally, Section 7 presents ourexperimental evaluation, and Section 8 concludes the paperby summarizing our findings.

1The source code is available at: goo.gl/RejjVk.

2085

Page 2: Estimating Join Selectivities using Bandwidth-Optimized ...the query optimizer of a relational database engine requires accurate information about the result sizes of intermediate

2. THE JOIN ESTIMATION PROBLEMGiven a set of n relations R1, R2, . . . , R

n

, and a queryQ = ‡

c1 (R1) ÛÙ◊1

!. . . ÛÙ

n≠1 ‡c

n

(Rn

)", where ‡

c

i

denotesa selection with (local) predicate c

i

and ÛÙ◊

i

a join with joinpredicate ◊

i

, our goal is to predict the fraction of tuplesfrom the Cartesian Product R1 ◊ . . . ◊ R

n

that fall intothe query’s result. This Join Estimation Problem is one ofthe classic problems from query optimization research [22],and improving the quality of join estimates has a direct andmeasurable impact on the plan quality produced by cost-based optimizers [6, 15, 18, 20, 23, 28, 33]. We consider theimportant subproblem where all joins are equijoins.

2.1 Classic Join EstimationThe classic way to estimate an equijoin between two ta-

bles R1 and R2 requires us to know the number of distinctkeys n

R1.A1 and nR2.A1 in the corresponding join columns.

Assuming uniformity, each distinct key in Ri

will appear|R

i

|/n

R

i

.A1 times. Further assuming that the key domainfrom the table with fewer distinct keys is a subset of theother table’s domain – the containment assumption –, andassuming that the local predicates c1 and c2 are independentof the join attribute, we arrive at the classic join estimationformula that is used by most query optimizers [32, 34]:

|‡c1 (R1) ÛÙ ‡

c2 (R2)| ¥ |‡c1 (R1)| · |‡

c2 (R2)|max (n

R1.A1 , nR2.A1 ) (1)

While straightforward to derive and easy to compute, theunderlying assumptions make Equation (1) susceptible toseveral sources of estimation errors that can cause substan-tially under- or overestimations of the join result size [15,20, 34]. This instability has inspired several researchersto investigate more sophisticated methods for estimatingjoin result sizes. These methods can be broadly catego-rized into two classes: While base table models dynamicallycombine the information from individual estimators, jointmodels directly model the value distribution for preselectedjoins. Joint models usually produce more accurate estimatesbut are less flexible and harder to maintain than base tablemodels.

2.2 Sampling-based Join EstimationSampling is a powerful tool to estimate selectivities for

both individual and joined query results. Creating and main-taining a random sample from database tables is a well-understood topic [37], and we can directly produce estimatesfrom base table samples by evaluating the actual query onthem [11, 26]. However, joining base table samples has onemajor drawback: While it is an unbiased estimator to theselectivity, its variance is very high for small sample sizesdue to missing join partners in the sample. While this prob-lem can be mitigated using non-uniform methods like end-biased [7] or correlated sampling [36], the created samplesare targeted to predefined joins. Another possibility is tobuild a joint model by sampling directly from the result of aparticular join. Sampling from a join result, and maintain-ing said sample under updates, deletions and insertions, arewell-understood problems, and can be done e�ciently in thepresence of join indexes [5, 26]. Such a join sample generallyproduces better estimates and requires smaller sample sizescompared to evaluating the query based on base table sam-ples [11, 21]. Leis. et. al. showed that estimates computedfrom join samples significantly improve plan quality [21].

(a) Points in database.

(b) Sampled points.

(c) Kernels.

(d) Estimator.

Figure 1: A Kernel Density Estimator approximates theunderlying distribution of a given dataset (a) by picking arandom sample of data points from the set (b), centeringlocal probability distributions (kernels) around the sampledpoints (c), and averaging those local distributions (d).

However, join samples are limited to queries that use theexact same join from which the sample was drawn.

2.3 The AGMS SketchThe AGMS sketch is a probabilistic data structure to es-

timate the join size between two data streams [1]. Given adata stream A = [v1, v3, v2, . . .], every distinct value v

i

is as-sociated with a uniform random variable X

i

œ {0, 1}. When-ever the value v

i

is encountered in stream A the countersk (A) is incremented by the value of X

i

. Given a sketchsk (B) constructed from the stream B with the same setof random variables, an estimate for the join cardinality|A ÛÙ B| is given by sk (A) · sk (B). While a single set ofAGMS sketches does not provide reliable results, the vari-ance of the estimator reduces by a factor n when the esti-mates provided by n pairs of sketches are averaged. TheAGMS sketch can be extended to multiple joins. It also ex-tends to selections by representing them as a join with allvalues matching the selection predicate [36].

3. BANDWIDTH-OPTIMIZED KDEKernel Density Estimation (KDE) is a data-driven, non-

parametric method to estimate a probability density func-tion from a data sample [31]. We illustrate the principleidea behind a KDE-based estimator in Figure 1: Based ona sample (Figure 1(b)) drawn from a table (Figure 1(a)),KDE places local probability density functions, the so-calledkernels, around the sample points (Figure 1(c)). The proba-bility density function for the overall data is then estimatedby summing and averaging over those kernels (Figure 1(d)).Formally, based on a data sample S =

)t (1), . . . , t (s)* of

2086

Page 3: Estimating Join Selectivities using Bandwidth-Optimized ...the query optimizer of a relational database engine requires accurate information about the result sizes of intermediate

(a) Bandwidth too small. (b) Bandwidth too large.

Figure 2: Tuning the bandwidth is crucial for the estima-tion quality of KDE. If the bandwidth is too small (a), theestimator overfits the sample. If it is too large (b), all localinformation is lost.

size s from a d-dimensional dataset, multivariate KDE de-fines an estimator p (x) : Rd æ R that assigns a probabilitydensity to each point x œ Rd:

p (x) = 1s

sÿ

i=1

p(i)(x)

= 1s · |H|

sÿ

i=1

K!H≠1 #

t (i) ≠ x$"

(2)

In this equation, p(i)(x) denotes the local probability con-tribution coming from sample point t (i). The function Kis the kernel function, which defines the shape of the localprobability distributions. In theory, there is a large numberof possible choices for K, as any symmetric probability dis-tribution is valid. However, in practice, the kernel functionitself has only a minuscule impact on estimation quality [31],and it is typically chosen based on other desired properties.In our case, we will be using the Gaussian Kernel, a Stan-dard Normal Distribution, since it simplifies the requiredderivations. The parameter H is the so-called bandwidthmatrix, which controls the spread of the local probabilitydistributions. Figure 2 illustrates the e�ects of setting thisparameter for the estimator from Figure 1: If the band-width is chosen too small (Figure 2 (a)), the estimator isnot smooth enough, resulting in a very spiky distributionthat overfits the sample. On the other hand, if the band-width is chosen too large (Figure 2 (b)), the estimator issmoothed too strongly, losing much of the local informa-tion and underfitting the actual distribution. Accordingly,choosing the right bandwidth value is essential to achievegood estimation quality with KDE [12, 31].

In prior work [12], we demonstrated how to derive a selec-tivity estimator for multidimensional range queries based onKDE. We also demonstrated how to select this estimator’sbandwidth parameter by numerically minimizing the esti-mation error. For this, we plugged the gradient of KDE’sestimation error with respect to its bandwidth into an o�-the-shelf numerical solver. We then continuously fed thissolver with training data obtained from user queries thatwe collected on-the-fly. This query-driven tuning mech-anism allowed us to outperform the accuracy of state-of-the-art multidimensional histograms such as GenHist [10]or STHoles [4], while still o�ering the flexibility and main-tainability of a sample-based method. Furthermore, we ex-

plained how our estimator is e�ciently evaluated and main-tained on GPUs [12, 17] and, thus, identified an interest-ing use case for GPUs in relational databases besides actualquery execution [2, 13].

4. KDE-BASED JOIN ESTIMATIONWe will now discuss how to compute the result size of equi-

join queries from KDE models. In particular, we introducetwo basic strategies: A joint model that works by building aKDE estimator from a sample directly drawn from the joinresult, and a base table model that works by dynamicallycombining multiple base table KDE models. By combiningbase table models, we e�ectively join their estimated dis-tributions and avoid the problem of empty join results fornaive sample evaluation.

4.1 Estimating from a Join SampleThe straight-forward method to estimate joins based on

KDE is to build the model from a sample that we drewdirectly from the join result. Sampling data directly froma join result is well-understood and can be e�ciently im-plemented [26]. Since KDE is inherently sample-based, thismethod allows us to build a KDE-based join estimator with-out having to change a single line of code of the base tablemodel. The advantages and disadvantages of this approachare identical to naıve sample evaluation: While we expectthe joint model to produce very accurate estimates [11], itis less flexible than any method that combines base tableKDE models. In fact, while the latter model can provideselectivity estimates for any arbitrary equijoin, the formerrequires us to construct and maintain joint models for allpotential joins in the query workload.

4.2 Combining Base Table ModelsLet us now derive the estimation formula for computing

join estimates from individual base table models. In orderto simplify this derivation, let us first consider the case ofpredicting the result size of a two-way equijoin query withlocal predicates: Q = ‡

c1 (R1) ÛÙR1.A1=R2.A1 ‡

c2 (R2). Wewill later generalize this to the case of multiple joins. Foreach individual join key ‹, we can express the number ofresult tuples produced for that key as:

--‡ (R1)R1.A1=‹·c1

-- ·--‡ (R2)

R2.A1=‹·c2

-- =|R1| · p1 (R1.A1 = ‹ · c1) · |R2| · p2 (R2.A1 = ‹ · c2) (3)

In this equation, the function pi

(c) denotes the exact basetable selectivity for a predicate c on table R

i

. We can nowdefine the join selectivity J (Q) = |Q|/|R1|·|R2| by summingEquation (3) over all keys ‹ œ A, where A is the join domain,which is the set of distinct join keys. Note that A can bereduced to a subset of the distinct keys in the join columnsdue to local table predicates:

J (Q) =ÿ

‹œA

p1 (A1 = ‹ · c1) · p2 (A1 = ‹ · c2) (4)

We replace p1 and p2 by our base table estimators p1 andp2, and arrive at the join selectivity estimator J (Q):

J (Q) =ÿ

‹œA

p1 (A1 = ‹ · c1) · p2 (A1 = ‹ · c2) (5)

2087

Page 4: Estimating Join Selectivities using Bandwidth-Optimized ...the query optimizer of a relational database engine requires accurate information about the result sizes of intermediate

Substituting the definition of a KDE estimator from Equa-tion (2), we find that both estimators are evaluated andtheir results are multiplied. By distributivity, we can com-pute and multiply the individual contributions for all combi-nations of sample points in their respective samples S1 andS2, and sum over the products.

J (Q) = 1s1 · s2

ÿ

‹œA

AAs1ÿ

i=1

p(i)1 (A1 = ‹ · c1)

B

As2ÿ

j=1

p(j)2 (A1 = ‹ · c2)

BB

= 1s1 · s2

ÿ

‹œA

s1,s2ÿ

i=1j=1

p(i)1 (A1 = ‹ · c1) · p

(j)2 (A1 = ‹ · c2)

(6)

Assuming that the kernel functions used by our KDE mod-els are product kernels [31], the following identity holds:p(i) (A1 = ‹ · c) = p(i) (A1 = ‹) · p(i) (c). Substituting thisinto Equation (6) allows us to isolate the join-specific partsof the computation:

J (Q) = 1s1 · s2

ÿ

‹œA

s1,s2ÿ

i=1j=1

p(i)1 (c1) · p

(i)1 (A1 = ‹)

· p(j)2 (c2) · p

(j)2 (A1 = ‹)

= 1s1 · s2

s1,s2ÿ

i=1j=1

p(i)1 (c1) · p

(j)2 (c2)

·A

ÿ

‹œA

p(i)1 (A1 = ‹) · p

(j)2 (A1 = ‹)

B

¸ ˚˙ ˝J

i,j

(7)

We refer to Ji,j

as the cross contribution. Naively computingthe cross contribution in a selectivity estimation scenario isinfeasible, as the join key domain A is potentially huge andgenerally unknown at query optimization time. Instead, wehave to exploit properties of the kernels to avoid explicitlysumming over the entire join domain. Equation (8) — whichis derived in Appendix A —, provides a closed-form approx-imation to the cross contribution for a Gaussian kernel oninteger attributes. The Gaussian kernel is a common choicefor KDE-based estimators and is the one used in our exper-imental evaluation.

Ji,j

¥ Nt

(i)1 ,(”

21+”

22)

1t(j)2

2(8)

In this equation, Nµ,‡

2 denotes the probability density func-tion for a standard normal distribution with mean µ andvariance ‡2, t

(i)1 denotes the i-th sample point from R1.A1,

and ”1 denotes the join estimator bandwidth for R1.

4.3 Extending to Multiple JoinsIn order to generalize our approach to multiple joins, we

need to introduce the notion of equivalence classes. Fortwo tables R

i

and Rj

that join on the attributes Ri

.Al

Algorithm 1: Combining base table KDE models1 # 1) Apply sample pruning:

2 S1 = S1 \Ó

t(i)1 œ S1

---p(i)1 (c1) < ◊

Ô

3 S2 = S2 \Ó

t(i)2 œ S2

---p(i)2 (c2) < ◊

Ô

45 # 2) Sort S2 by the join key:

6 S2 = sort(S2,S2.A1)

78 # 3) Apply Cross Pruning and estimate selectivity:

9 for i in {1, ..., s1}:

10 j = binarySearch

1t(i)1 ≠ maxdi� (”1, ”2) , S2

2

11 while

---t(i)1 ≠ t

(j)2

--- Æ maxdi� (”1, ”2) · j Æ s2:

12 ˆJ = compute

ˆJ1

t(i)1 , t

(j)2 , ”1, ”2

2

13 e += p(i)1 · ˆJ · p

(j)2

14 return

e/s1·s2

and Rj

.Am

, we consider the pair of attributes equivalentR

i

.Al

≥ Rj

.Am

. Note that, by definition, this equivalencealso holds transitively. We denote the equivalence class fora given attribute R

j

.Am

by � (Rj

.Am

). Each equivalenceclass contains a set of attributes that have to be equal forall tuples in the join result. Based on this definition, we cannow discuss how the cross contribution can be generalized toequivalence classes containing more than two relations, andhow we can compute the join selectivity for an arbitrarynumber of equivalence classes.

First, we consider the case of joins consisting of a singleequivalence class � (R1.A1) containing n relations. Since alljoin attributes are in the same equivalence class, we can stillsum over the shared join domain A. Assuming that eachjoined table R

i

has a kernel density estimator model pi

, wedefine the generalized cross contribution J

o1,...,o

n

that needsto be computed for the cross product between all samples:

Jo1,...,o

n

=ÿ

‹œA

i=1

p(o

i

)i

(A1 = ‹) (9)

Appendix A provides a closed-form approximation for thegeneralized cross contribution of a Gaussian kernel. Now,since the kernels for each dimension are independent, thefinal formula to compute the join selectivity for n equiva-lence classes �1, . . . , �

k

over a total of n relations can becomputed by multiplying their respective generalized crosscontributions J

i

(omitting the sample o�sets for readability)with the contributions for the local predicates c

j

:

J (Q) = 1rn

i=1 si

s1,...,s

nÿ

i1=1,...,i

n

=1

j=1

p(i

j

)j

(cj

)kŸ

j=1

Jj

(10)

5. EFFICIENTLY JOINING KDE MODELSIn the previous section, we derived the theoretical foun-

dation for computing join selectivities from base table mod-els. We now discuss how we can e�ciently evaluate themin practice. Our estimator receives the relational queryQ = ‡

c1 (R1) ÛÙR1.A1=R2.A1 ‡

c2 (R2), as well as two KDEestimators with their respective base table samples S1, S2and bandwidth vectors ”1, ”2. Based on Equation (7), we

2088

Page 5: Estimating Join Selectivities using Bandwidth-Optimized ...the query optimizer of a relational database engine requires accurate information about the result sizes of intermediate

know that computing the join selectivity requires us to com-pute the cross contributions J

i,j

for all s1 · s2 pairs from thecross product of S1 and S2, incurring quadratic complexity.Accordingly, a naıve implementation would severely limitthe scalability and applicability of our approach. In orderto reduce the number of required computations, we now in-troduce two pruning techniques: Sample Pruning and CrossPruning. Algorithm 1 illustrates these methods and the gen-eral selectivity estimation procedure.

5.1 Sample PruningIf the local contribution for a particular sample tuple is

su�ciently small, the contribution of every derived tuplefrom the cross product will be negligible. Thus, we canomit this sample point in all following computations, whichwe call Sample Pruning (Lines 2 – 3). Similar to pushingdown selections in query execution plans, we can reduce thenumber of input tuples that have to be considered in themore expensive computation of the cross contributions. Wechose the threshold as the inverse of the cross product size◊ = 1

r1·r2, as this limits the overall error to the join cardi-

nality estimate to at most one tuple.

5.2 Cross PruningNext, we compute the cross contributions (Line 10), mul-

tiply them with their corresponding local contributions andsum them up to compute the join selectivity (Line 11). Inthis part of the algorithm, we apply Cross Pruning to re-duce the computational load: As the Gaussian kernel appliessmoothing by distance, its intuitive that the cross contri-bution for two sample points becomes negligible when thedistance between the two points is very large. Again choos-ing the maximum tolerable error to be 1

r1·r2, the maximum

tolerable distance between two sample values is:

1r1 · r2

> Nt

i

1,”

21+”

22(tj

2)

≈∆ 1r1 · r2

>1

2fi (”21 + ”2

2)exp

A≠1

2

!ti

1 ≠ tj

2"2

(”21 + ”2

2)

B

≈∆--ti

1 ≠ tj

2-- >

ııÙ≠2 · ln

A2fi (”2

1 + ”22)

r1 · r2

B(”2

1 + ”22)

(11)

To exploit this property, we first sort S2 on the join at-tribute (Line 5). Next, instead of iterating over the crossproduct, the algorithm considers only tuples that are su�-ciently close to each other by iterating over all tuples fromS1 (Line 7) and finding the first qualifying tuple from S2via binary search (Line 8). The function maxdi� computesthe maximum distance according to Equation (11). Finally,we iterate over all qualifying tuples from S2 (Line 9 – 11),compute the cross contribution J (Line 10) and iterativelycompute the join selectivity (Line 11).

Sample pruning and cross pruning can significantly re-duce the number of computations required to compute anestimate, in particular when the join and selections are veryselective. Sorting, if necessary, can be done in O (s2 log s2),pruning and computing the local contributions can be donein a single pass over each sample [12]. We have to computes1 binary searches, each requiring at most log (s2) accesses

to S2. The actual number of elements traversed in the innerwhile-loop is data-dependent. In the degenerate case of ajoin that is close to a cross product, we still need to traverseS2 for every tuple in S1, and the complexity of the overallalgorithm remains O (s1 · s2). However, in the optimal case,we only have to check a handful of tuples from S1, whichyields O (s1logs2). We argue that the degenerate case doesrarely appear in real-world data and provide experimentalevaluation on real-world data in Section 7.4.

5.3 Extending to Multiple JoinsGeneralizing this algorithm to multiple joins requires only

a few modifications. In particular, we have to apply samplepruning to all base table samples. We pick a left-deep joinorder and sort the samples for the right hand side of all joinoperators based on their join attribute. This way, we ensurethat we need to sort at most j ≠ 1 samples for a total ofj joins. As we only allow bandwidth values such that thefunction values of the cross contribution never exceeds one,we can handle the joins by subsequent binary searches andapply cross pruning for each of them.

6. BANDWIDTH OPTIMIZATIONIn prior work [12], we demonstrated that query-driven

bandwidth optimization is crucial to the estimation qual-ity of KDE-based selectivity estimators. During query ex-ecution, we observe the true selectivity of operators, whichallows us to numerically optimize the bandwidth based onthe estimation error. Since we cannot use sample or crosspruning to speed up the gradient computations for base tableKDEs — a negligible contribution to the estimate does notimply a negligible contribution to the gradient —, we insteadrely on a derivative-free, bound-constrained optimization al-gorithm. In particular, we use constrained optimization bylinear approximation (COBYLA) [27] from nlopt [16]. Weuse the same algorithm for KDE over join samples.

We optimize the bandwidth based on the multiplicativeerror, which, given the true selectivity c and an estimate c,is defined as:

m (c, c) = max (c, c)min (c, c) (12)

If the estimate is larger than the actual selectivity, themultiplicative error is equivalent to the relative error, other-wise it is the inverse of the relative error. Thus, the smallestmultiplicative error is 1.0 and over- and underestimationsare equally penalized. The multiplicative error is the errormetric of choice for cardinality estimation, as it minimizeserror propagation in query plans and correlates directly withplan quality [25]. By optimizing for this error function, weensure that the optimization translates to improved queryplans.

Given a set of representative queries Q1, . . . , Qn

, we thenoptimize the bandwidth vectors for all tables ”1, . . . , ”

m

forthe geometric mean over the estimation error:

arg min˛

”1,...,

˛

m

AnŸ

i=0

m!J (Q

i

) , J (Qi

)"B 1

n

(13)

Note that the estimate J depends on the bandwidth vec-tors ”1, . . . , ”

m

. We can only optimize the bandwidth for at-tributes that are actually covered in the queries Q1, . . . , Q

n

.

2089

Page 6: Estimating Join Selectivities using Bandwidth-Optimized ...the query optimizer of a relational database engine requires accurate information about the result sizes of intermediate

We suggest collecting query feedback for all base table fil-ters and subsequent join operators in a query plan. Thisonly requires keeping track of intermediate results, whichcan be done with very little overhead. The optimizationprocess can then be executed periodically or triggered by adatabase command. Note that bandwidth optimization doesnot block the KDE models and, thus, optimization and es-timation can be interleaved.

7. EVALUATIONIn this section, we present the experimental evaluation of

our KDE-based join estimators in terms of estimation qual-ity and execution time. All experiments can be reproducedusing the code and datasets from our public repository2.

7.1 Experimental SetupWe will now describe the datasets, workloads, and esti-

mators that were used for our experiments.

7.1.1 Compared Estimators

We compared the following estimators:

Postgres: Our first baseline estimator uses the EXPLAINfeature of vanilla Postgres 9.6. Postgres uses the clas-sical join estimation formula, relying on the indepen-dence assumption and 1D statistics (frequent values,histograms, and number of distinct values).

Table Sample (TS): Our second baseline estimator, whichimplements naıve sample evaluation based on uniformsamples from the base tables.

Join Sample (JS): The final baseline estimator, which im-plements naıve sample evaluation based on a singleuniform sample from the join result.

Correlated Sample (CS): A sampling-based estimator op-erating on biased samples constructed by using a com-mon hash function on join attributes [36]. By corre-lating the samples on the join attribute, the problemof empty join results due to independence is avoided.Compared to other biased sampling algorithms, it doesnot require prior knowledge of the data distribution.

AGMS: We implemented the AGMS sketch with extensions tofilter conditions as proposed in [36]. Random variableswere generated by the EH3 hashing scheme which wasshown to be favorable in terms of hash size and gener-ation e�ciency [29].

JS+KDE: Our KDE estimator based on a join sample, as de-scribed in Section 4.1.

TS+KDE: Our base KDE estimator based on table samples,as described in Section 5.

Note that we specifically evaluated against estimators thatare comparable in terms of model construction and sup-ported estimation operations (join subject to conjunctivebase table predicates). In particular, all compared estima-tors can be constructed in a single pass over the data andcan be maintained under updates.

All KDE models use bandwidth vectors optimized for thegeometric mean of the multiplicative error on a set of 100training queries.2https://goo.gl/RejjVk

7.1.2 Evaluated Datasets

We conducted our experiments based on the followingdatasets that cover both synthetic and real-world examples:

SN (Shifted Normal): Synthetic dataset consisting of 100ktuples drawn from normal distributions N

µ1,�, Nµ2,�,

and Nµ3,�. Values were rounded to the closest integer

to simulate a discrete dataset. The covariance matrix� was chosen as ( 1200 1100

1100 1200 ). The means were chosen asµ1 = ( 500 700 )T , µ2 = ( 600 700 )T and µ3 = ( 700 700 )T .Thus, all attributes are dense and highly correlated.

IMDb: Real-world dataset based on data from the InternetMovie Database3, which we obtained using the pythonpackage IMDbPY4 . Our queries use the tables title

(3.5m tuples), movie keyword (6m), cast info (50m),company name (300k), and movie companies (4m).

DMV: Real-world dataset based on data from a Departmentof Motor Vehicles [14, 24]. The dataset contains in-formation on cars, their owners and accidents in sixrelations containing 23 columns. The relations con-tain between 269 and 430k tuples.

We used dictionary encoding to transform all string at-tributes into integers.

7.1.3 Query Workload

Every workload in our evaluation is defined by a preparedquery statement and a workload strategy (Uniform, Dis-tinct). The prepared statement contains up to three joins aswell as selections with conjunctive range and equality pred-icates. The left hand side of thes selection predicates is abase table attribute, while the right hand side is a param-eter. We used the following algorithm to generate actualqueries from the prepared statement based on the chosenworkload strategy:

1. We compute the full join in the prepared statement,while ignoring the selections, and project on the non-join-attributes in the prepared statement.

2. We select a tuple t from the join result based on thespecified workload strategy: (Uniform) We select atuple from the join result with uniform probability.(Distinct) We eliminate duplicates from the join resultand draw a tuple with uniform probability.

3. For attributes subject to equality predicates, the valueson the previously drawn tuple t become the selectionparameters. For an attribute A with range predicates,we need to provide an upper and a lower bound. Weretrieve the minimum value min

A

and maximum valuemax

A

after applying the equality predicates. Definingthe value of attribute A on the tuple t as t.A, we selectthe upper bound as u

A

= [T.A+(maxA

≠T.A)·rand()]and the lower bound as l

A

= [T.A ≠ (T.A ≠ minA

) ·rand()]. The function rand returns a random realvalue in [0, 1).

3http://www.imdb.com/interfaces

4http://imdbpy.sourceforge.net

2090

Page 7: Estimating Join Selectivities using Bandwidth-Optimized ...the query optimizer of a relational database engine requires accurate information about the result sizes of intermediate

The uniform strategy follows the distribution in the join re-sult and therefore favors queries with higher selectivities.As the distinct strategy disregards the distribution of tuplesin the join result, it favors queries with lower selectivities.Intuitively, a learning estimator should be more e�ective onthe uniform workload as the queries focus on strongly rep-resented regions. The distinct workload is harder to learn,as the estimator has to fit the entire dataset.

7.2 Estimation QualityIn the first set of experiments, we compared the accu-

racy of all estimators to get a feeling how our KDE-basedjoin estimators stack up against the state of the art. Forthese experiments, the size of base table samples was fixedto one percent, join samples and number of AGMS sketcheswere chosen to match the memory required by the base ta-ble samples. While the sample sizes for correlated samplesvary inherently, we chose the sampling threshold to meetthe target size in expectation. The actual experiment thenconsisted of optimizing the bandwidth of our KDE-basedmodels on 100 training queries, followed by measuring themultiplicative estimation error for all estimators on another100 queries from the selected workload. We ran this experi-ment for di�erent query patterns over the dataset and bothworkloads, repeating it 20 times for each combination.

7.2.1 SN Dataset

Figure 3 illustrates the results of this experiment for theSN dataset. SN Q1 joins the tables generated with µ1 andµ2 on their first attribute. The remaining two attributesare subject to range selections. Methods based on uniformbase table or join samples clearly outperform the other esti-mators on this workload by more than 60% in terms of themedian estimation error. KDE on base tables provides asmall improvement of up to 15% over plain base table sam-ple evaluation, while join samples with and without KDEboth provide close to perfect estimates. Correlated samplesprovide the worst estimates for this query and are o� byone order of magnitude. As the join is not a PK-FK joinand the join attributes are skewed, the problem tackled bycorrelated samples does not arise.

SN Q2 adds the table generated with µ3 to the join andintroduces an additional range predicate to the query. Likefor SN Q1, join sample-based estimators produce close toperfect estimates. This is not surprising, as the complexityintroduced by the additional join is handled in the samplingprocess. TS+KDE is clearly superior to all other base tableestimators and provides a better median estimation errorby a factor of 4 compared to Postgres. In contrast, tablesamples, AGMS and correlated samples are heavily a�ectedby the additional joins and yield estimates that are o� byup to seven orders of magnitude.

7.2.2 DMV Dataset

Figure 4 illustrates the results of this experiment for theDMV dataset. We evaluate three query patterns over thefour main tables of the datasets. DMV Q1 consists of a sin-gle join and four base table selections (two range predicate,two equality predicates). DMV Q2 and DMV Q3 succes-sively add an additional join. Furthermore, they add tworange selections and one equality selection, respectively.

We observe that TS+KDE is superior to the other basetable estimators and outperforms them by at least one order

1e+00

1e+01

1e+02

1e+03

1e+04

1e+05

SN Q1 (Uniform)

1e+00

1e+01

1e+02

1e+03

1e+04

1e+05

SN Q1 (Distinct)

1e+001e+011e+021e+031e+041e+051e+061e+071e+08

Postgres

AGMS

Table

Sample

(TS)

Correlat

ed Sam

ple

Join Sam

ple (JS

)

TS+KDE

JS+K

DE

SN Q2 (Uniform)

1e+001e+011e+021e+031e+041e+051e+061e+071e+08

Postgres

AGMS

Table

Sample

(TS)

Correlat

ed Sam

ple

Join Sam

ple (JS

)

TS+KDE

JS+K

DE

SN Q2 (Distinct)

Figure 3: Estimation quality, SN dataset. The y-axisshows the multiplicative estimation error.

of magnitude in terms of the median estimation error inall experiments. Only Join Sample and JS+KDE performbetter by up to a factor of four. While these estimators arevery close in terms of the median estimation error, JS+KDEimproves the estimation errors above the median for IMDbQ2 and Q3.

For DMV Q1, correlated sampling outperforms Table Sam-ple by more than an order of magnitude. However, for DMVQ2 and Q3, Postgres, Table Sample and Correlated Sampleperform very similar.

The AGMS sketch scales poorly with the introduced se-lections and performs worse than the other estimators inthis experiment. Its median estimation error compared toPostgres is worse by two orders of magnitude - and theupper whisker and box boundary even extend beyond theplot boundaries. Given that the estimator variance for theAGMS sketch is proportional to the size of the cross productand selections are handled by joining with additional virtualtables [36], this behavior is expected.

7.2.3 IMDb Dataset

Figure 5 illustrates the results of this experiment for theIMDb dataset. IMDb Q1 joins two tables subject to onerange and two equality predicates. IMDb Q2 and Q3 addan additional join and equality predicate respectively.

Join sample and JS+KDE provide close to perfect esti-mates for almost all experiments. IMDb Q3 with the dis-tinct workload is the only exception to this: While bothestimators have comparable median errors, JS+KDE showsa much better error distribution above the median as theupper box boundary and whisker improved by one and twoorders of magnitude respectively.

TS+KDE provides the best estimates among the base ta-ble estimators. We see drastic improvements of an order ofmagnitude over correlated samples and the AGMS sketch.Compared to Table Sample, we observe drastic improve-ments of more than an order of magnitude for IMDBb Q2(Distinct), Q3 (Uniform), and Q3 (Distinct) — for all otherexperiments the estimates are comparable. Postgres estima-

2091

Page 8: Estimating Join Selectivities using Bandwidth-Optimized ...the query optimizer of a relational database engine requires accurate information about the result sizes of intermediate

1e+00

1e+01

1e+02

1e+03

1e+04

1e+05DMV Q1 (Uniform)

1e+00

1e+01

1e+02

1e+03

1e+04

1e+05DMV Q1 (Distinct)

1e+00

1e+01

1e+02

1e+03

1e+04

1e+05

DMV Q2 (Uniform)

1e+00

1e+01

1e+02

1e+03

1e+04

1e+05

DMV Q2 (Distinct)

1e+00

1e+01

1e+02

1e+03

1e+04

1e+05

Postgres

AGMS

Table

Sample

(TS)

Correlat

ed Sam

ple

Join Sam

ple (JS

)

TS+KDE

JS+K

DE

DMV Q3 (Uniform)

1e+00

1e+01

1e+02

1e+03

1e+04

1e+05

Postgres

AGMS

Table

Sample

(TS)

Correlat

ed Sam

ple

Join Sam

ple (JS

)

TS+KDE

JS+K

DE

DMV Q3 (Distinct)

Figure 4: Estimation quality, DMV dataset. The y-axisshows the multiplicative estimation error.

tion errors predominantly lie between one and ten, whichis very competitive - especially for Q2 and Q3. However,TS+KDE still provides an improvement between factors oftwo and four for IMDb Q1 and Q2. For IMDb Q3, theprovided estimates are comparable.

7.2.4 Discussion

Based on the results of our experiments, we can make thefollowing general observations:

1. KDE-based join estimators generally perform betterthan the AGMS sketch and traditional estimators thatrely on the independence assumption and 1D statistics.

2. KDE-based join estimators never perform significantlyworse than their naıve sample evaluation counterpartsor correlated sampling, but usually improve the esti-mates significantly.

7.3 Quality Impact of Model SizeIn our second experiment, we investigate how the rela-

tionship between the di�erent estimators changes with themodel size. For this, we ran our previous experiments whileincreasing the sample ratio — and accordingly the othermodel sizes — by factors of two from 0.001 to 0.128. Wereport the geometric mean error for DMV Q1 (Uniform), asa representative for estimates on a single join, and DMV Q3(Uniform), as a representative for multiple joins.

Figure 6 illustrate the results of this experiment for DMVQ1 with the uniform workload. There are few noteworthyobservations. First, KDE-based estimators are a signifi-cant improvement over naıve sample evaluation for small

�����

�����

�������� � �� ������

�����

�����

�������� � ��� ������

�����

�����

�����

�����

�����

�����

������ �� ���������

�����

�����

�����

�����

�����

�����

������ �� ����������

�������������������������������������������

��������

��

� ���� ��������

������ ���� ����

����� ��������

������

������

� �� �� ���������

�������������������������������������������

��������

��

� ���� ��������

������ ���� ����

����� ��������

������

������

� �� �� ����������

Figure 5: Estimation quality, IMDb dataset. The y-axisshows the multiplicative estimation error.

sampling fractions. This is clearly visible for TS+KDE: Asampling fraction of 0.001 is not su�cient for naıve eval-uation of base table samples, as the join between the twosamples is likely to be empty causing estimation errors ofthree orders of magnitude and more. Adding a KDE modelimproves the estimation error by more than two orders ofmagnitude, which outperforms Postgres by a factor of two.While correlated samples tackle the same problem and bringsubstantial improvements over naıve base table samples forsmaller sample sizes, KDE models are still clearly superior.Furthermore, we see JS+KDE bringing a 50% improvementover join sample evaluation for a sample size of 0.001.

Second, KDE-based estimators never perform significantlyworse than Table Sample. While correlated samples bringimprovements of an order of magnitude and more for smallsample sizes, they converge slower to exact estimates. Thiscauses an intersection point, after which Table Sample yieldsa smaller estimation error. This is consistent with our obser-vations in Section 7.2, which showed that correlated samplesare not always preferable to uniform table samples. As sam-ple evaluation is in the parameter space of KDE-based esti-mators, their estimation error converges with their sampleevaluation pendant for larger model sizes.

Figure 7 illustrates the results of this scaling experimentfor DMV Q3 with the uniform workload, which adds twomore joins to the query. The key observation is that largersample sizes are required for the base table estimators toclearly outperform Postgres by 10% or more. TS+KDEprovides better estimates at sample size 0.008, correlatedsamples need twice as many points. A clear improvementfor table samples is only visible at a sample size of 0.128.

2092

Page 9: Estimating Join Selectivities using Bandwidth-Optimized ...the query optimizer of a relational database engine requires accurate information about the result sizes of intermediate

JS+KDE provides better estimation errors than join sam-ple evaluation for sample sizes 0.001 to 0.004 by a factorbetween 1.5 and 2.

These experiments confirm that bandwidth-optimized KDEmodels can significantly improve the estimates computedfrom samples. Furthermore, the measured estimates werenever significantly worse than the estimates provided byPostgres, but are usually much better depending on the sam-ple size and the workload.

�����

�����

�����

�����

�����

����� ����� ����� ���� ���� ����� ���� ����

������������ ���������

�� ��� ���� ��������� �� ���� ����� �����

����� �� ��� �������� �� ��� ����

��������� �� ���

!"#����$%&���$%&

'���(���

Figure 6: Estimation quality on DMV Q1 (Uniform) withgrowing sample sizes.

�����

�����

�����

�����

�����

�����

����

����

����� ����� ����� ����� ���� ����� ���� �����

������������ ���������

����� ��� ��������� �� ���� ����� ����

����� ����� �� ����� ����� �� �

�!!�����" �����

#$% � �&'(� �&'(

)���*!��

Figure 7: Estimation quality on DMV Q3 (Uniform) withgrowing sample sizes.

7.4 Performance EvaluationIn our final series of experiments, we evaluated the run-

time scalability of our estimators for increasing sample sizes.

7.4.1 Setup

As demonstrated in our prior publication, KDE modelsare very well suited to be accelerated by graphics cards [12].Accordingly, we implemented all estimators as GPU pro-grams using custom OpenCL kernels and GPU primitivesas provided by the Boost.Compute framework5. Naıve sam-ple evaluation on the join sample was implemented as astraight-forward table scan, evaluation on the table sampleswas implemented by applying the local filter predicates onthe sample, followed by performing a binary search join. Ac-cordingly, both naıve estimators perform basically the same5www.boost.org/libs/compute

operations as our KDE estimators, but with less computa-tional overhead and most aggressive pruning.

The experiment was conducted on a custom server withan AMD Opteron 6376 processor, a NVIDIA GeForce GTX980 graphics card and 64 GB of DDR3 memory. The serverwas running Ubuntu Linux 16.04.1 with kernel 4.4.0. TheGPU was controlled by NVIDIAs 367.57 driver.

We performed our experiments on the IMDb dataset asit is the largest of our evaluated datasets. We repeated theprevious experiment for two queries over the IMDb datasetwhile reporting the average runtime in milliseconds insteadof the estimation error.

7.4.2 IMDb Q1 (Uniform)

Figure 8 shows the results for query IMDb Q1 using theuniform workload. The first observation is that the run-time for all sample-based estimators barely increases up toa sample size of 0.016. For the AGMS sketch we observethe same e�ect for sample sizes of 0.001 and 0.002 This iscaused by the overheads introduced by the OpenCL frame-work, OpenCL kernel execution and memory transfer domi-nating the runtime, which is 1.5ms for table sample estima-tors, 0.4ms for join sample estimators and 0.1ms for AGMS.

Once the actual computation dominates execution times,we see the expected linear increase in the runtime of bothAGMS (0.04) and JS+KDE (0.064). As join sample evalu-ation only requires computationally cheap comparisons andincrement operations for every tuple, the framework over-head dominates throughout the experiment.

The runtime for table sample, TS+KDE and correlatedsample is very close throughout the experiment within a10% margin. This shows the e�ectiveness of our pruningtechniques: While computing the cross contribution is com-putationally expensive, our pruning techniques reduce thenumber of computations to a degree that they do not dom-inate the estimation time in this experiment.

7.4.3 IMDb Q3 (Uniform)

In order to show the performance of our estimator for mul-tiple joins, we repeated the experiment for IMDb Q3 whichadds two joins and additional base table predicates. Theresults are shown in Figure 9. Again, we observe that forsmaller sample sizes, the framework overhead dominates theexecution time. However, it is slightly larger for base tablemodels, which is due to the additional number of involvedtables. For AGMS and JS+KDE, the execution time in-creases roughly linear for sample sizes of 0.008 and larger.For Join Sample, the overhead dominates for all sample sizesand the execution time does not increase.

For sample sizes from 0.001 to 0.004, we see that theestimation time for TS+KDE, TS and Correlated Samplebarely increase. TS+KDE adds an overhead of at most 50%to naive samples. From a sample size of 0.032 onwards,TS+KDE imposes a larger overhead and grows roughly lin-early, while table and correlated samples increase sublin-early. The additional overhead imposed by KDE over TableSample is within an order of magnitude. Only for the largestsample size of 0.128, we can observe a substantial overheadof two orders of magnitude.

Our pruning techniques are very e�ective for this query,as the runtime complexity without our pruning techniqueswould be quartic in the sample size.

2093

Page 10: Estimating Join Selectivities using Bandwidth-Optimized ...the query optimizer of a relational database engine requires accurate information about the result sizes of intermediate

�����

�����

�����

����� ����� ����� ���� ���� ����� ���� ����

����������

������ �� �

� ���� ���� ���� ���� �� � �� � ��� �����

� ��� � ���� �������� � ���� ����

�� �� ��! � ����"#$�

���%&'���%&'

Figure 8: Estimation time on IMDb Q1 (Uniform) withgrowing sample sizes.

�����

�����

�����

�����

�����

�����

���� ���� ���� ��� ���� ���� ���� ���

����������

������ �� �

� ���� ���� ���� ���� �� � �� � ��� �����

� ��� � ���� �������� � ���� ����

�� �� ��! � ����"#$�

���%&'���%&'

Figure 9: Estimation time on IMDb Q3 (Uniform) withgrowing sample sizes.

8. CONCLUSION & FUTURE WORKIn this paper, we introduced a novel way to estimate join

selectivities based on bandwidth-optimized Kernel DensityEstimators. Existing models su�er from at least one of thefollowing drawbacks: They (1) provide inaccurate estimates,(2) are expensive to construct, (3) are restricted to a singletype of query, or (4) are expensive or impossible to maintainunder changing data.

Our approach uses KDE models, which are constructedfrom base table or join samples, and provides an estimateto its underlying distribution. They apply smoothing tothe sample distribution by placing probability density func-tions on all sample points, averaging over them to computethe final estimate. The degree of smoothing is controlledby a hyper-parameter, the so-called bandwidth. Selectingthis bandwidth parameter is essential for the estimationquality and can be done by performing numerical optimiza-tion over query-feedback, either based on base table or joinqueries. KDE combines the flexibility and maintainability ofa sample-based method, with the quality of state-of-the-artselectivity estimators.

We evaluated the quality of our approach using querieson both synthetic and real-world datasets. We found thatKDEs provide significantly better join estimates than tradi-tional methods in case one of the underlying assumptions isviolated. Compared to naıve sample evaluation, our modelscan provide significantly improved results for relatively small

sample sizes, while still converging to the same accuracy forlarger samples. In practice, we suggest to maintain base ta-ble KDE models for all tables in a database as they usuallyprovide better estimates for supported operators and datatypes (numeric attributes, dictionary encoded attributes).Joint KDE models can be added manually, when additionalaccuracy is required for particular joins.

In order to conclude the paper, we now present and discussselected topics that we believe to be interesting directions forfuture work in the field of KDE-based selectivity estimation:

Investigating Di�erent Kernel Functions: As using theGaussian kernel requires evaluating the error and ex-ponential function, it is computationally intensive. In-vestigating alternative kernel functions could lead toa more e�cient evaluation, while still providing simi-lar results. In particular, one could extend our ideas tothe Cauchy or Epanechnikov kernel, which are cheaperto evaluate.

Generalization to Theta Joins: Since KDE models pro-vide a probabilistic model for the join frequency distri-bution of a table, they could also be used to estimatethe selectivity for the more general class of theta joins.In particular, we would have to derive closed-form es-timation formulae by solving Equation 14 for di�erentintegral bounds and identify how to e�ciently imple-ment those. While this is surely not trivially possiblefor every join predicate, we think that extending sup-port for certain classes of general theta joins, like bandjoins (R1.L Æ R2.A Æ R1.U) or simple inequality joins(R2.x Æ R1.l), would be possible and interesting.

KDE on runtime samples: Recent work showed promis-ing results for uniform join samples created at esti-mation time in in-memory databases [21]. It wouldbe an interesting research direction to investigate howthe bandwidth of KDE models could be selected insystems without materialized samples.

AcknowledgmentThe work has received funding from the European Union’sHorizon2020 Research & Innovation Program under grantagreement 671500 (project ’SAGE’) and from the GermanMinistry for Education and Research as Berlin Big DataCenter BBDC (funding mark 01IS14013A).

9. REFERENCES[1] N. Alon, P. Gibbons, Y. Matias, and M. Szegedy.

Tracking join and self-join sizes in limited storage. InPODS, pages 10–20, 1999.

[2] S. Breß, H. Funke, and J. Teubner. Robust queryprocessing in co-processor-accelerated databases. InSIGMOD, pages 1891–1906, 2016.

[3] P. Bromiley. Products and convolutions of gaussianprobability density functions. 2003.http://www.tina-vision.net, Internal Report.

[4] N. Bruno, S. Chaudhuri, and L. Gravano. Stholes: amultidimensional workload-aware histogram. InSIGMOD Record, volume 30, pages 211–222, 2001.

[5] S. Chaudhuri, R. Motwani, and V. Narasayya. Onrandom sampling over joins. In SIGMOD, pages263–274, June 1999.

2094

Page 11: Estimating Join Selectivities using Bandwidth-Optimized ...the query optimizer of a relational database engine requires accurate information about the result sizes of intermediate

[6] S. Christodoulakis. Implications of certainassumptions in database performance evauation.TODS, 9(2):163–186, 1984.

[7] C. Estan and J. Naughton. End-biased samples forjoin cardinality estimation. In ICDE, page 20, 2006.

[8] S. Ganguly, P. Gibbons, Y. Matias, andA. Silberschatz. Bifocal sampling for skew-resistantjoin size estimation. In SIGMOD Record, volume 25,pages 271–281, 1996.

[9] L. Getoor, B. Taskar, and D. Koller. Selectivityestimation using probabilistic models. SIGMODRecord, 30(2):461–472, May 2001.

[10] D. Gunopulos, G. Kollios, J. Tsotras, andC. Domeniconi. Selectivity estimators formultidimensional range queries over real attributes.VLDB Journal, 14(2):137–154, 2005.

[11] P. Haas, J. Naughton, and A. Swami. On the relativecost of sampling for join selectivity estimation. InPODS, pages 14–24, 1994.

[12] M. Heimel, M. Kiefer, and V. Markl. Self-tuning,gpu-accelerated kernel density models formultidimensional selectivity estimation. In SIGMOD,pages 1477–1492, 2015.

[13] M. Heimel, M. Saecker, H. Pirk, S. Manegold, andV. Markl. Hardware-oblivious parallelism forin-memory column-stores. PVLDB, 6(9):709–720,2013.

[14] I. Ilyas, V. Markl, P. Haas, P. Brown, andA. Aboulnaga. Cords: Automatic discovery ofcorrelations and soft functional dependencies. InSIGMOD, pages 647–658, 2004.

[15] Y. Ioannidis and S. Christodoulakis. On thepropagation of errors in the size of join results. InSIGMOD, pages 268–277, 1991.

[16] S. G. Johnson. The NLopt nonlinear-optimizationpackage. http://ab-initio.mit.edu/nlopt.

[17] M. Kiefer, M. Heimel, and V. Markl. Demonstratingtransfer-e�cient sample maintenance on graphicscards. In EDBT, pages 513–516, 2015.

[18] P.-A. Larson, W. Lehner, J. Zhou, and P. Zabback.Cardinality estimation using sample views withquality assurance. In SIGMOD, pages 175–186, 2007.

[19] H. Lee, R. Ng, and K. Shim. Similarity join sizeestimation using locality sensitive hashing. PVLDB,4(6):338–349, 2011.

[20] V. Leis, A. Gubichev, A. Mirchev, P. Boncz,A. Kemper, and T. Neumann. How good are queryoptimizers, really? PVLDB, 9(3):204–215, 2015.

[21] V. Leis, B. Radke, A. Gubichev, A. Kemper, andT. Neumann. Cardinality estimation done right:Index-based join sampling. In CIDR, 2017.

[22] G. Lohman. Is query optimization a “solved”problem? SIGMOD Blog, April 2014.

[23] V. Markl, P. Haas, M. Kutsch, N. Megiddo,U. Srivastava, and T. Tran. Consistent selectivityestimation via maximum entropy. PVLDB,16(1):55–76, 2007.

[24] V. Markl, V. Raman, D. Simmen, G. Lohman,H. Pirahesh, and M. Cilimdzic. Robust queryprocessing through progressive optimization. InSIGMOD, pages 659–670, 2004.

[25] G. Moerkotte, T. Neumann, and G. Steidl. Preventingbad plans by bounding the impact of cardinalityestimation errors. VLDB, 2(1):982–993, 2009.

[26] F. Olken. Random sampling from databases. PhDthesis, University of California at Berkeley, 1993.

[27] M. Powell. A direct search optimization method thatmodels the objective and constraint functions bylinear interpolation. In Advances in Optimization andNumerical Analysis, pages 51–67, 1994.

[28] N. Reddy and J. Haritsa. Analyzing plan diagrams ofdatabase query optimizers. In PVLDB, pages1228–1239, 2005.

[29] F. Rusu. Sketches for aggregate estimations over datastreams. PhD thesis, University of Florida, 2009.

[30] F. Rusu and A. Dobra. Sketches for size of joinestimation. TODS, 33(3):15:1–15:46, Sept. 2008.

[31] D. Scott. Multivariate Density Estimation - Theory,Practice, and Visualization. John Wiley & Sons, 2ndedition, 2015.

[32] P. Selinger, M. Astrahan, D. Chamberlin, R. Lorie,and T. Price. Access path selection in a relationaldatabase management system. In SIGMOD, pages23–34, 1979.

[33] M. Stillger, G. Lohman, V. Markl, and M. Kandil.LEO - db2’s learning optimizer. In PVLDB, pages19–28, 2001.

[34] A. Swami and K. Schiefer. On the estimation of joinresult sizes. In EDBT, pages 287–300, 1994.

[35] K. Tzoumas, A. Deshpande, and C. Jensen.Lightweight graphical models for selectivity estimationwithout independence assumptions. VLDB,4(11):852–863, 2011.

[36] D. Vengerov, A. Menck, M. Zait, and S. Chakkappen.Join size estimation subject to filter conditions.PVLDB, 8(12):1530–1541, 2015.

[37] J. Vitter. Random sampling with a reservoir. TOMS,11(1):37–57, 1985.

APPENDIXA. GAUSSIAN CROSS CONTRIBUTION

The Gaussian kernel is a common choice for a continuouskernel in KDE. It is a Normal distribution N

s,”

2 (x) centeredon the sample point s and using the bandwidth ” as its stan-dard deviation. In order to compute discretized probabilityestimates for a integer join key ‹ from this continuous ker-nel, we simply integrate it over the interval [‹ ≠ 0.5, ‹ + 0.5],leaving us with the discretized Gaussian kernel. We assumethat join attributes are not subject to selections (A = Z)and lift this assumption in Appendix B.

A.1 Cross ContributionSubstituting the discretized Gaussian kernel into Equa-

tion 7, yields the cross contribution for the Gaussian kernel:

Ji,j

=ÿ

‹œZ

⁄‹+0.5

‹≠0.5N

t

(i)1 ,”

21

(x) dx ·⁄

‹+0.5

‹≠0.5N

t

(j)2 ,”

22

(x) dx

(14)

Equation (14) does not have a closed-form solution thatwould allow us to e�ciently compute it without summingover all ‹ œ A. However, for probability densities f (x), g (x)

2095

Page 12: Estimating Join Selectivities using Bandwidth-Optimized ...the query optimizer of a relational database engine requires accurate information about the result sizes of intermediate

0

0.005

0.01

0.015

0.02

0.025

�6 �4 �2 0 2 4 6 8 10

N1,16(v) · N3,4(v)R v+0.5v�0.5 N1,16(x)dx ·

R v+0.5v�0.5 N3,4(x)

Figure 10: Approximating the cross contribution for theGaussian kernel by the multiply-then-integrate approach

with values smaller or equal to one, we can approximates‹+0.5

‹≠0.5 f (x) dxs

‹+0.5‹≠0.5 g (x) dx ¥ s

‹+0.5‹≠0.5 f (x) g (x) dx. We

illustrate this approximation in Figure 10, showing that inthis case the results for first integrating then multiplying arevery similar to first multiplying then integrating.

Substituting this approximation into Equation (14) givesus an approximate closed-form solution:

Ji,j

¥ÿ

‹œZ

⁄‹+0.5

‹≠0.5N

t

(i)1 ,”

21

· Nt

(j)2 ,”

22dx

=⁄ +Œ

≠ŒN

t

(i)1 ,”

21

(x) · Nt

(j)2 ,”

22

(x) dx

= Nt

(i)1 ,(”

21+”

22)

1t(j)2

2 ⁄ +Œ

≠ŒN

t

Õ,”

Õ (x)

= Nt

(i)1 ,(”

21+”

22)

1t(j)2

2

(15)

The last transformation requires some explanation: Theproduct N1,2 = N1 · N2 of two Normal probability densityfunctions is itself a scaled Normal probability density func-tion [3]. Its location tÕ and scale ”Õ are functions over theparameters of the individual densities. Since we integrateover the full domain, this factor integrates to one, leavingonly the scaling factor, which is again given by a Normaldensity function that depends on the mean and bandwidthparameters of the original functions [3].

Equation (15) does not hold for densities with functionvalues larger than one. In particular, when the bandwidthof one of the two Gaussian functions is below (2fi)≠ 1

2 ¥ 0.4,the error increases drastically. However, since for this valueonly 1.3% of the probability mass of a Gaussian is locatedoutside of [µ ≠ 0.5, µ + 0.5], the estimator is still able to fallback close to sample evaluation.

A.2 Generalized Cross ContributionPlugging the discretized Gaussian kernel into Equation (9),

we arrive at:

Jo1,...,o

n

=ÿ

‹œZ

i=1

⁄‹+0.5

‹≠0.5N

t

(o

i

)i

,”

2i

(x) dx (16)

Similar to the single join case, we can plug in the approxima-tion

ri

sv+0.5

v≠0.5 pi

(x)dx ¥ sv+0.5

v≠0.5r

i

pi

(x)dx. The productof k normal densities can be rewritten as a normal densityand a scale factor that does not depend on the integrand [3].

Jo1,...,o

n

=⁄ Œ

≠Œ

i=1

Nt

(o

i

)i

,”

2i

(x) dx

=S1...n

·⁄ Œ

≠ŒN

t1...n

,”

21...n

(x)dx

=S1...n

(17)

Since we integrate over the entire domain, the normal den-sity integrates to one, which leaves us with the scale factorS1...n

only. The scale factor S1...n

is given by:

S1...n

=

Ú”

21...nrn

i=1”

2i

(2fi)(n≠1)/2 exp

Q

ca≠12

Q

canÿ

i=1

1t(o

i

)i

22

”2i

≠ t21...n

”21...n

R

db

R

db

(18)

Finally, the quantities ”21...n

and t1...n

can be computed fromthe individual means and variances of each normal density:

”21...n

=

Anÿ

i=1

1”2

i

B≠1

t1...n

=”21...n

nÿ

i=1

t(o

i

)i

”2i

(19)

B. SELECTIONS ON JOIN ATTRIBUTESThe assumption that join attributes are not subject to

selections can be lifted. In general, selections on the join at-tributes allow us to apply sample pruning for join attributesas well which potentially reduces the number of tuples thatwe have to consider in the cross pruning step.

Range selections require adjusting the equations for thecross contribution. If a range selection l Æ A Æ u is appliedto a join attribute, the normal density function in Equation17 has to be considered:

Jo1,...,o

n

=S1...n

·⁄

u

l

Nt1...n

,”

21...n

(x)dx (20)

However, we do not have to adjust the maximum distanceinequality (Equation 11) in cross pruning. As the addition-ally introduced factor

su

l

Nt

Õ,”

Õ (x) is smaller or equal thanone, the given inequality remains intact even in case of rangepredicates on the join attribute.

Point selections are a special case of range selections. Asa join attribute subject to a point selection reduces the es-timate computation to multiplying estimates for a single se-lection per table, treating them di�erently simplifies the re-quired computations: Point selections allow us to handle thejoins for an entire equivalence class by local predicates. Fur-thermore, every table with all join attributes being subjectto point selections, can even be excluded entirely from thesum over the cross product of all samples.

2096