Automatic Physical Design of Databases Systems: An ...ddash/thesis-proposal.pdf · Automatic...
Transcript of Automatic Physical Design of Databases Systems: An ...ddash/thesis-proposal.pdf · Automatic...
Automatic Physical Design of Databases Systems:
An Optimization Theory Approach
(Thesis Proposal)
Debabrata Dash
Thesis Committee
Anastasia Ailamaki (Chair)
Christos Faloutsos
Carlos Guestrin
Guy Lohman (IBM Almaden Research Lab)
1
1 Introduction
Database systems are too large, too complex, and too dynamic for a non-expert user
to understand all the features available to obtain the most out of the system. One
of the most complex tasks for the database administrator is physically designing the
database to make it perform optimally for a query workload. The complexity of
physical design emanates from the availability of a large number of options, such as,
indexes, partitions, materialized views, different workload modeling techniques, and
different data layouts. The space of possible physical design features is huge, and to
make matters worse, the interaction between these features is difficult to model even
for sophisticated users.
This thesis proposes algorithms to automate database physical design. We use
optimization techniques to provide near-optimal solutions to the physical design prob-
lem. We show that our solutions are better than existing solutions and run order of
magnitude faster than current tools for large workloads.
This thesis states that database physical design problems can be modeled as convex
optimization problems and solved efficiently and scalably, without using heuristics.
We demonstrate the validity of the thesis statement by testing our proposed algo-
rithms on a variety of workload models (offline workload, online workload), a variety
of data layouts (row-oriented, column-oriented), for structured as well as unstruc-
tured, data. Finally, as a case study, we use our algorithms to aid astronomers in real
challenges they face in everyday life, such as managing large unstructured simulation
data, tracking objects over time, and querying complex spatio-temporal relationship
between objects. Using such large spectrum of databases, we show that our algorithms
are generic enough to be applicable to new types of databases as well.
2 Background and Related Work
In this section we discuss the existing research on physical design selection. First, we
discuss the need for physical design and then discuss different ways to address the
problem.
2.1 Database Design Process
Database design process usually comprises the following steps [19]:
2
1. Requirement Analysis to determine the data, the relationship between the data
elements, and the queries required by the software on top of the data.
2. Logical Design to determine the conceptual model of the database from the user
requirements and the database tables. The tables and their relationships are
usually shown in an entity-relationship (ER) or an Unified Modeling Language
(UML) diagram.
3. Physical Design to ensure that the queries on the database are executed ef-
ficiently. The database designer needs to build auxiliary structures, such as,
indexes, materialized views, and partitions, to ensure the efficiency. In rest of
the thesis, we denote these structures as design features. Unlike the logical de-
sign step, this step involves both the data and the queries determined in the
requirement analysis phase.
4. Monitoring and Refinement is an iterative step, where the database is monitored
for change in user requirement, or data characteristic. Once a change is detected,
the logical and physical design steps may be repeated. Typically the physical
design is repeated more often than logical design, as the workload tends to
change faster than the data characteristic.
The logical design phase mainly involves gathering the data dependencies. Once
the dependencies are determined, determining the right schema is straightforward
using one of the available automated tools [28].
The physical design process, however, is the most challenging one, since (1) the
number of possible indexes, materialized views, and partitions can be exponential
with respect to the size of the tables; (2) their effects are very often interrelated.
This has lead to many different automated physical design tools such as IBM’s DB2
Design Advisor [30], Microsoft’s Database Tuning Advisor [1], and Oracle’s SQL
Access Advisor [29].
Once the design features are built, the query optimizer in the database optimizes
queries to run them efficiently using those features. Since the query optimizer is the
consumer of the design features, physical design process depends on modeling of the
optimizer.
2.2 Physical Design using Optimizer Cost Models
Early research on physical designer focuses on modeling the query optimizer math-
ematically and then suggesting the design features. Since the early optimizers use
3
simple cost models [26], it is easy to model the entire optimization process and select
appropriate design features accurately. Lum et al. model the selection of secondary
indexes as an optimization problem [20]. Esiner et al. improve on that by mapping
the index selection problem to an equivalent network flow problem [13]. Researchers
also propose various integer linear program formulations for vertical partitioning of
tables [5, 10, 16, 22]. They all assume simple cost models for using vertical partitions
and build their optimization problem on those models. Because of the rapid im-
provement of optimizers over last few decades, the cost model used in these problem
formulations are obsolete now. We improve upon these formulations by modeling the
optimizer as a black-box and reusing the past optimization results. This allows us to
model the optimizer just enough to speed up the cost estimation, while avoiding the
complexity of fully modeling it [23].
2.3 Modern Physical Design using “what-if” Features
Over time the commercial grade optimizers have become complex and cannot be
modeled easily. Therefore, to find the effects of different design features on a query,
it is necessary to invoke the optimizer itself. Instead of building real design features,
all modern physical design selection techniques depend on creating “what-if” features
and observing the optimizer behavior using those features. The “what-if” features
are virtual in nature, they only simulate real features i.e., contain statistics similar to
the real features without building the feature’s underlying data.
Microsoft’s Index Tuning Wizard and Database Tuning Advisor use a ‘candidate
selection’ phase to identify promising indexes [1, 8], a ‘merging’ step for augmenting
the candidate set and a greedy search for selecting a locally optimal solution [9]. The
DB2 Advisor uses the optimizer to select an initial set of indexes and formulates a
knapsack problem that is solved by a greedy search [14].
Our ILP formulation is an extension of facility-location problem [12], enabling it
to exhaustively search the features, instead of greedily searching them. Caprara et al.
first use this formulation to find single index per query [7]. We extend their formula-
tion to account for queries using more than one indexes and also model index update
costs. Our work is concerned with applying the formalism to real-world problems,
that involve commercial database systems and workloads.
In terms of modeling, we integrate the facility-location problem model with the
query optimizer in existing systems. We also deal with the problems of selecting
candidates and configurations using sensitivity analysis, so that we derive reasonably-
4
sized ILP instances that can be solved efficiently. In terms of solutions, we may
not achieve the optimal solutions, as the model, due to candidate and configuration
selection, already incorporates approximations. Using our technique, however, we
can determine the distance from optimal solutions. Finally, to accurately model real
world constraints, we include storage limits in our formulation. Caprara et al. do
not consider storage and it is unclear if their analysis and proposed algorithms can
be applied to problem instances resulting from our formulation.
Heeren et al. describe an ILP and a solution based on randomized-rounding,
assuming a single index per query [15]. Their solution has optimal performance but
requires a bounded amount of additional storage.
Besides index selection, there is work on designing additional features, such as ma-
terialized views and table partitions [2,4,30,31]. The proposed algorithms are similar
to their index selection counterparts, only they are facing a combinatorial explosion
in the number of alternative designs that arises when exploring feature combinations.
Our modeling approach will be beneficial for problems combining multiple features,
because it can better capture the ‘interaction’ between features and it offers higher
scalability. Rao et al. extend the physical design selection to parallel databases using
a rank-based index selection [25]. Our model is more powerful than rank-based index
selection, hence, is applicable to parallel databases as well.
Agrawal et al. model the workload as a sequence [3] and Bruno et al. improve
on it by modeling the workload as an online query sequence [6]. They change the
workload model, but still use greedy index selection. Furthermore, their space of
possible query plans is limited, as they allow only local changes to the plan nodes.
Our approach considers complete changes in the plan, and selects better indexes than
the greedy approach. Recently, Tata et al. propose a techqnique for selecting physical
design for all major databases using a common client-based tool [27]. Our server-based
approach is complimentary to their client-based approach. Moreover, since our INUM
model is portable across multiple databases, it can simplify implementation of such
client-based tools.
Below we summarize the limitations of current approaches.
• Sub-optimal solutions: Empirically we verify that the solutions provided by
the design tools are far from optimal. For TPC-H like workload a commercial
system provided only 50% of the optimal improvement in query time. This
shows that the greedy selection of candidate solutions is not effective in selecting
the most optimal solutions.
5
• No quality guarantee: The existing tools do not allow the users to trade
speed with quality of the final solution. Although they allow the user to control
how long the designer tool can run in order to come up with the solution, they
do not provide any indication of how the running time affects the final solution.
• Difficulty in resource planning: The existing tools assume that the user
knows exactly how much resource the new physical designs are allowed to con-
sume. When the user wants to plan on possible resource acquisition, however,
it is desirable to present the user with a whole range of query performance for
different resource requirements. For example, if the user knows that by allowing
10% more disk to be used for new indexes, increases the performance by 20%,
she could plan for bigger disk space. This is not easily achieved in the current
tools.
• No solution for alternative data layouts: The column-oriented databases
are becoming more popular for DSS-type workloads consisting of analytical
queries, and in-memory databases are projected to take over the OLTP work-
loads consisting of transactional queries. There are no tools in the market,
however, to suggest physical design for these databases. The physical design for
in-memory databases is critical because memory is much more expensive than
disks, requiring the selection of the most optimal design features.
• No solution for scientific databases: The scientific workloads use different
query types, when compared to the transactional or analytical workloads. For
example, they use recursive queries, and time-windowed queries in addition to
standard queries. The mainstream physical design tools fail to find indexes or
materialized views for these query types.
3 Thesis Outline
This thesis proposes to address the problems discussed above by formulating the
various physical design problems as standard optimization problems and then solving
them systematically using existing tools. The thesis attempts to solve the problems
in the following steps:
1. Accurate and fast cost estimation: (Completed) We design a fast and
accurate cost estimator for queries with different physical design configurations.
We speed up the cost estimation by caching the previous optimizer outputs and
6
reusing them for future configurations. Reusing the optimizer outputs ensures
that our cost model keep matching the model of the optimizer. We extend our
earlier work [23] by modeling the cost for materialized views and partitions.
2. Convex optimization formulation: (Completed) Using our caching frame-
work we model the design selection as a convex optimization problem. We then
solve it using industry strength optimization solver tools.
3. Online design selection: (Ongoing) So far we consider the cases when the
workload is already known to us. When the workload is not known, then we
formulate the design selection problem as an online optimization problem. In
particular, we are studying the application of online optimization to physical
design of caches.
4. Physical design for alternative data stores: Physical design is an impor-
tant part of the alternative database designs such as column stores. Since the
existing column store databases depend heavily on selecting the right set of
columns to group together, tuning the design with the workload is extremely
important. We plan to model the design problem including the grouping, order-
ing and compression requirements of the column stores to achieve the optimal
design.
5. Physical design for unstructured databases: In unstructured databases
we lack the statistics information of the data on which queries are working on.
We target the problem of optimizing the physical design for unstructured data
queried using tools like map-reduce [11], Dryad [18] and Hadoop [17]. The phys-
ical design algorithm needs to learn the data and the workload characteristics
to optimize the performance of the workload.
6. Case study – simulation and observational databases for astronomy:
(Ongoing) We intend to apply our design selection methods to optimizing the
scientific workload. Scientists have different types of workload such as involv-
ing moving object, or massive recursive queries. We intend to apply and if
needed extend our methods to select the optimal design features for the scien-
tific database workloads.
Step 1 enables us to build large scale optimization problems by speeding up the
cost estimation. Steps 2 and 3 address the problems like sub-optimality, lack of
quality guarantee, and difficulty in resource planning. By formulating the physical
design problem as an optimization problem, we obtain optimal solutions. We allow
7
the user to speed up the optimization process by reducing the design search space
and provide guarantees on the loss of quality because of the reduced search space.
Since the space constraint is just one of the thousands of constraints in our problem
formulations, and modern solvers can easily re-optimize for small changes in the
problem, we can provide the entire spectrum of design solutions by altering the space
constraints. Step 4 and 5 select the physical designs for alternate data layouts, and
unstructured databases. Finally, Step 6 addresses the lack of physical design tools for
scientific database community.
4 Efficient Use of Optimizer for Index Selection
Problem
State-of-the-art database design tools rely on the query optimizer for comparing be-
tween physical design alternatives. Although it provides an appropriate cost model
for physical design, query optimization is a computationally expensive process. The
significant time consumed by optimizer invocations poses serious performance limita-
tions for physical design tools, causing long running times, especially for large problem
instances. So far it has been impossible to remove query optimization overhead with-
out sacrificing cost estimation precision. Inaccuracies in query cost estimation are
detrimental to the quality of physical design algorithms, as they increase the chances
of “missing” good designs and consequently selecting sub-optimal ones. Precision loss
and the resulting reduction in solution quality is particularly undesirable and it is the
reason the query optimizer is used in the first place.
In our approach, instead of replacing the optimizer completely, we cache the pre-
viously returned plans from the optimizer and reuse them to provide accurate and
fast cost estimation. The intuition behind the INUM is that although design tools
must examine a large number of alternative designs, the number of different opti-
mal query execution plans and thus the range of different optimizer outputs is much
smaller. Therefore it makes sense to reuse the optimizer output, instead of repeatedly
computing the same plan.
The INUM works by first performing a small number of key optimizer calls per
query in a precomputation phase and caching the optimizer output (query plans along
with statistics and costs for the individual operators). During normal operation,
query costs are derived exclusively from the precomputed information without any
further optimizer invocation. The derivation involves a simple calculation (similarly to
8
computing the value of an analytical model) and thus is significantly faster compared
to the complex query optimization code.
We present the details of the INUM model in our earlier work [23], and only list
the important implications of the INUM model below.
1. If a query involves only Merge-Join and Hash-Join plans, the cost of joining
and aggregation does not depend on the cost of accessing data from the table
or indexes. Hence, the total cost of the query depends linearly on the access
costs for each table.
2. If a query involves only Merge-Join and Hash-Join plans, then caching one plan
per interesting-order combination is sufficient to answer all possible incoming
index combinations. We define interesting-order combination as a tuple con-
taining at most one interesting-order for each table referenced in the query.
The interesting-order is a column which helps in query performance when or-
dered [26].
3. For queries involving all join methods including Nested-Loop Joins, we cache
more than one plan per interesting-order combination and achieve 5% cost ap-
proximation for the optimal plan cost.
Experimental Results for INUM We implement INUM using Java (JDK1.4.0)
and interface our code to the optimizer of a commercial DBMS, which we will call
System1. Our implementation demonstrates the feasibility of our approach in the
context of a real commercial optimizer and workloads and allows us to compare di-
rectly with existing index selection tools. To evaluate the benefits of INUM, we build
on top of it a very simple index selection tool, called eINUM. eINUM is essentially
an enumerator, taking as input a set of candidate indexes and performing a simple
greedy search, similar to the one used in [8].
We choose not to implement any candidate pruning heuristics because one of our
goals is to demonstrate that the high scalability offered by INUM can deal with large
candidate sets that have not been pruned in any way. We “feed” eINUM with an
exhaustive candidate set generated by building an index on every possible subset
of attributes referenced in the workload. From each subset, we generate multiple
indexes, each having a different attribute as prefix. This algorithm generates a set of
indexes on all possible attribute subsets, and with every possible attribute as key.
We experiment with the 1GB version of the TPC-H benchmark with a workload
consisting of 15 out of the 22 queries, which we call TPCH15. We were forced to
9
omit certain queries due to limitations in our parser but our sample preserves the
complexity of the full workload.
We use a dual-Xeon 3.0GHz based server with 4 gigabytes of RAM running Win-
dows Server 2003 (64bit).
On average eINUM takes only 0.34 milliseconds to find the cost of a configuration,
while the commercial system took about 1.3 seconds. The significant speed up in the
cost estimation allowed eINUM to suggest better indexes as well.
The construction of the INUM takes 1243s, or about 21 minutes, spent in per-
forming 1358 “real” optimizer calls. The number of actual optimizer calls is very
small compared to the millions of INUM cost evaluations performed during tuning.
We can also “compress” the time spent in INUM construction for smaller problems.
System1 required 246 seconds of total tuning time, and optimization time accounted
for 92% of the total tool running time.
5 An ILP Model for Index Selection
In this section, we introduce an integer linear programming formulation that captures
the full complexity of the index selection problem.
5.1 Mathematical Formulation
Consider a workload consisting of m queries and a set of n indexes I1-In, with sizes
s1-sn. We want our model to account for the fact that a query has different costs
depending on the combination of indexes it uses. A configuration is a subset Ck =
{Ik1, Ik2, ...} of indexes with the property that all of the indexes in Ck are used by
some query.
Let P be the set of all the configurations that can be constructed using the indexes
in I and that can potentially be useful for a query. For example, if a query accesses
tables T1, T2 and T3 then P contains all the elements in the set (indexes in I on T1)
× (indexes in I on T2) × (indexes in I on T3).
The cost of a query i when accessing a configuration Ck is c(i, Ck) and c(i, {})
denotes the cost of the query on an unindexed database. We define the benefit of a
configuration Ck for query i by bik = max(0, c(i, {}) − c(i, Ck)).
Let yj be a binary decision variable that is 1 if the index is actually implemented
and 0 otherwise. In addition, let xik be a binary decision variable that is equal to 1
10
if query i uses configuration Ck and 0 otherwise.
Using xik and bik, the benefit for the workload Z is
Z =
m∑
i=1
p∑
k=1
bik × xik (1)
where p = |P |. The values of xik depend on the values for yj: We cannot have a query
using Ck if a member of Ck is not implemented.
Also, we require that a query uses at most one configuration at a time. For
instance, a query cannot be simultaneously using both C1 = {I1, I2, I3} and C2 =
{I1, I2}. Finally, we require that the set of selected indexes consumes no more than
S units of storage. Thus the formal specification of the index selection problem is as
follows.
maximize Z =
m∑
i=1
p∑
k=1
bik × xik (2)
subject to
p∑
k=1
xik ≤ 1 ∀i (3)
xik ≤ yj ∀i, ∀j, k : Ij ∈ Ck. (4)
n∑
j=1
sj × yj ≤ S (5)
Equation 3 guarantees that a query uses at most one configuration. Equation 4
ensures that we cannot use a configuration k unless all the indexes in it are built and
Equation 5 expresses the constraint for available storage S. Observe that the space
restriction is just one variable in the ILP problem generated. Hence, unlike earlier
heuristic approaches, we can consider different storage constraints by tweaking the
problem slightly.
The above ILP formulation can be easily extended to account for update costs.
With every index Ij we associate a (negative) benefit value −fj , which denotes to cost
of updating the index Ij over all the update statements in the workload. Hence, we
modify the objective function in Equation 2 to take into account the negative benefit
values fj .
11
maximize Z =
m+m1∑
i=1
p∑
k=1
bik × xik −
n∑
j=1
fj × yj (6)
Equation 6 describes the workload benefit in the presence of m queries and m1
update statements. The second term simply states that if index Ij is constructed as
part of the solution, it will cost fj units of benefit to maintain it in the presence of
the m1 update statements.
Supporting clustered indexes is straightforward with our model. A candidate
clustered index is yet another index in the candidate set, one that contains all the
attributes in a relation. We allocate a yj variable to it as usual. It also participates
in combinations naturally. The size of clustered indexes is artificially set to 0 (as no
additional space is required to sort a table). For each table T we restrict the set of
clustered indexes on it, say {yT c1, yT c
2, ...yT c
l} so that only one clustered index is picked:
yT c1
+ yT c2
+ ... + yT cl≤ 1 (7)
The exact solution provided by the ILP formulation is optimal for the given initial
selection of indexes. If we were to include all the possible indexes that are relevant
to the given workload, it would give us the globally optimal solution.
Considering the set of all the possible indexes is prohibitively expensive and thus
a candidate selection module is necessary. The ILP approach is flexible enough that
we can use it with an arbitrary candidate index set, in fact the ILP model enables us
to derive tight bounds on the quality loss when we remove some candidates.
5.2 An ILP-based Index Selection Tool
In the previous section, we present an ILP formulation that completely describes the
index selection problem. In this section, we discuss the architecture of a practical
ILP-based index selection tool.
Fig. 1 details the components that are used in our tool. All the modules except
for the ILP solver are used in the construction of the model, deciding the xk and yj
variables and computing the benefit values bk, all described in Section 5. Once the
model is constructed, the ILP solver is used to determine the optimal solution.
The Candidate Selection and Combination Selection modules allow the determina-
tion of the xk and yj decision variables from the problem at hand. For each combina-
tion xk participating in the model, the Cost Estimation module determines the query
12
Candidate
Selection
Combination
Selection
ILP
ModelINUM
OptimizerWorkload Constraints
ILP
Solver
Solutions Bounds
Cost
Estimation
Figure 1: Architecture for an ILP-based index selection algorithm.
costs and corresponding ck benefit values. Cost Estimation is typically based on the
query optimizer. In our system, we couple the query optimizer to the Index Usage
Model (INUM), a mechanism we have developed for improving the performance of
query cost estimation through caching and reusing of optimizer computation. The
INUM is ≈ 3, 000 times faster than a query optimizer call, while providing exactly the
same result. Section 4 provides an overview of the INUM, while a detailed description
appears in [23]. The completed ILP representation is consequently input to the ILP
solver.
The overall performance depends on the numbers of yj and xk variables, which
control the problem size and consequently the optimization time and the time spent
in the cost estimation module. Our ILP formulation is advantageous compared to
existing approaches in that the efficiency of modern ILP Solvers and of the INUM
allows us to consider very large numbers of decision variables (candidate indexes and
index combinations). In our experiments, we have been able to solve ILP instances
considering up to 110,000 candidate indexes and 3.2 million combinations within
minutes.
Furthermore, our ILP formulation allows for a particularly attractive modular
design, where the impact of optimizations in each module on the final solution can be
analyzed and quantified. Each module can apply cost-based pruning (for Candidate
and Combination Selection) and approximations (for Cost Estimation), to improve
performance.
13
5.3 Experimental Results on TPC-H
We experiment with a workload consisting of 1000 queries, generated by changing the
parameters for the TPC-H queries, using the QGEN utility. We compare our ILP-
based approach against the commercial index selection tool integrated in our server,
for solution quality, for various amounts of storage space available for index selection.
Table 1 shows the number of candidates and combinations considered by our
ILP-based and the commercial index selection tool. The first column of the table
corresponds to the size of the resulting ILP instance. The data for the second column
is obtained through the profiling of the commercial tool. Due to the efficiency of our
ILP solver and the INUM, our ILP tool was able to handle an order of magnitude
more indexes and combinations.
ILP Commercial
Combinations 348147 42189
Candidates 1931 138
Table 1: Number of candidates and combinations considered by our ILP-based and
the commercial index selection tool.
40
50
60
70
80
1GB 3GB
Index Size
Wo
rklo
ad
Sp
ee
du
p (
%)
Commercial ilp ilp_cost_approx
Figure 2: Comparing the solution quality between our ILP-based and the commercial
index selection tool for two storage constraint values.
Figure 2 compares the solution quality achieved by our ILP-based and the com-
mercial tool for 1GB and 3GB storage constraint values. For the “tight” storage
constraint of 1GB, the ILP algorithm (ilp) provides 16 more percentage points of
benefit to the workload compared to the commercial tool (commercial). Equivalently,
when using the indexes recommended by our ILP tool the workload runs 30% faster.
Figure 2 also shows the solution quality achieved using cost approximation, in
addition to table-subset pruning (ilp cost approx). In this case we apply cost approx-
14
imation using a single plan per query in the INUM space. The quality loss resulting
from cost approximation is less than 3 percentage points in both cases.
6 Integer-Convex Programming Model
Although above approach allows us to accurately model the cost index selection prob-
lem, the number of configurations explode combinatorially. Only pruning some of the
configurations makes the above formulation practical. In this section, we propose a
different formulation of the problem, which makes the optimization program construc-
tion even more practical. We reduce the program size by exploiting the model created
by INUM, and the solver’s ability to compute optimal combinations efficiently.
We consider a workload consisting of m queries and a set of n indexes I1-In, with
sizes s1-sn. Instead of maximizing the benefit of using a configuration, we try to
minimize the cost of using it. Let Oij be the jth interesting-order combination for
the query i. Since, we cache multiple plans for each interesting-order, let Pijk be the
kth plan cached for the interesting-order combination Oij. According to INUM’s cost
model, the cost for the configurations matching an interesting-order combination Oij
can be represented as:
Cost(Oij) = mink
wijk +∑
t∈Ti
wijktati
Where Ti is the set of tables accessed in the query i, ati is the cost of accessing
table t in query i, and wijk, wijkt are the constants for the linear cost equation in
INUM. Using above equation, we can represent the cost of query i, Ci as
Ci = minj
Cost(Oij) = minjk
wijk +∑
t∈Ti
wijktati
Therefore, the total cost of the workload C becomes
C =∑
i
(
minjk
wijk +∑
t∈Ti
wijktati
)
In this form above formulation is not a convex program, as the min function is not
a convex function. To convert it to convex program, we introduce a binary variable
oir, which represents the cached plan selected for the query i. The variable oir is
essentially an indicator constraint for selection of an interesting-order r of query i.
15
We also reuse the variable yj to represent selection of an index Ij. The objective of
the equation becomes:
minimize∑
i
Ci
Such that:
oir = 1 ⇒ Ci = (Cirint +
∑
t∈Ti
wirtStiyj)
Commercial programs such as CPLEX provide efficient implementation of such
indicator variables.
To maintain correctness of the formulation, we have to ensure that (1) the variable
oir is picked only when all the required indexes providing the interesting-orders for
that cached plans are selected; (2) for each query one interesting-order is picked; (3)
only one index is used to decide the cost of a query plan.
We introduce another binary variable yirj, which indicates that the index Ij is
used by rth plan of query i.
minimize∑
i
Ci (8)
Such that: (9)
oir = 1 → Ci = Cirint +
∑
t∈Ti
wirtStiyirj (10)
∀i,∑
r
oir = 1 (11)
∀i, ∀t
∑
yirj∈Uses(t)
yirj = 1 (12)
∀j , yj ≤∑
yirj ≤ myj (13)
∀i, ∀r, ∀t∈Ti,∑
yirj∈Virt
yirj ≥ oir (14)
∑
j
sj × yj ≤ S (15)
yT c1
+ yT c2
+ ... + yT cl≤ 1 (16)
Where Virt represents the set of indexes on table t, which provide the interesting-
order required by oirt, and the set Uses(t) is the set of indexes on table t. Equation
16
10 ensures that if we select a plan, then we also select its associated cost. Equation 11
ensures that at least one plan is selected for each query. Equation 12 ensures that for
a query, only one index is selected per table. Equation 13 ensures that if we select
an index for a plan, then the global index is selected as well. The Equation 15 is the
space constraint. The last constraint ensures that we select only one clustered index
for each table.
Unlike the formulation in Equation 6, this formulation does not enumerate all
configurations. Therefore, the problem size remains linear with respect to the size of
candidate index set. To understand the variable and constraint size, let’s consider
a simple query with just one possible plan – a join on two tables. If t1, and t2 are
the number of possible candidate indexes on those two tables, then the number of
possible configurations for ILP is t1 × t2. Therefore the number of variables in ILP
is O(t1 × t2 + t1 + t2) and the number of constraints is O(t1 × t2). In ICP, however,
the number of variables and the constraints is approximately O(t1 + t2). The ICP
formulation scales linearly with the number of cached plans, but ILP is independent
of the number of cached plans.
ILP vs. ICP: Unlike ICP, the ILP formulation does not assume any internal
details for the cost estimation module. Given any cost estimation module, if it is fast
enough to support millions of optimization queries, it would be possible to build an
ILP. For large problems, however, the ILP formulation needs pruning methods to keep
the problem size under control. Although, the pruning methods do not affect the final
solution in our experiments, the upper bounds on the quality loss are relatively high,
making them less desirable than no pruning at all. The ICP formulation depends on
the details of the cost estimation module (INUM in our case), however, by exploiting
the internal details, it generates a smaller optimization problem. Furthermore, the
cost bounds determined using the ICP formulation is much tighter when compared to
the cost bounds for ILP formulation. Although we build ICP on top of INUM’s cost
model, we can adopt any other convex formulation as the basis for ICP. Most modern
optimizers use dynamic programming to determine the optimal plan for a query, which
can be easily modeled as a convex optimization problem [24]. Hence more complex
model for the optimizer using dynamic programming can also be modeled using ICP.
17
7 Applying Physical Design Techniques to Scien-
tific Databases
Scientists have very specific needs for their databases. For example, the simula-
tion databases generated by scientists need recursive queries to determine the ori-
gin/termination of particles in the space-time continuum. Unsurprisingly, the recur-
sive queries are not part of the database benchmarks such as TPC-H, TPC-DS, etc.
Hence the commercial vendors do not support the optimal physical design of the such
queries.
To design the database for scientific workload, we use a database to manage the
output of an astronomy experiment, consisting of 128 gigabytes of raw data. Since
the data is generated in number of steps, we design a loader which combines the
data from different steps and load them together into the database. The regime
where database technology can make the greatest impact for simulations is tracking
group evolution over time. This is because simulation outputs are typically stored as
snapshots. Thus, queries that focus on a single instant in time have an advantage over
ones that examine time evolution simply because the file format naturally lends itself
to this kind of inquiry. We find that scientists correspondingly limit their own research
on this basis, preferring the “low lying fruit” of problems that can be addressed using
single snapshot (or maybe comparing two snapshots) and avoiding the ones that
require looking at a range of snapshots. Our current query set, therefore, focuses on
the time domain. We implement 6 such representative queries on the data, so that
the astronomers can query the data by changing query parameters.
The most common type of query is “going back in time”, i.e. identifying a set
of progenitors for a given set of “particle groups”. Implementing this traversal in
a database is not straightforward, requiring the development of “recursive” queries,
which are not optimized in existing systems. The recursive aspect arises from the
nature of the data. A given cosmological group g in output t knows its group ID in
output t−1 and t+1. To follow the evolution of a group through many outputs, one
must trace these links through many sequential outputs: e.g. galaxy 123 in output 10
is the same as galaxy 186 in output 9 which is the same as galaxy 452 in output 8, etc.
Optimizing recursive query performance over large datasets is not well understood,
and the tools currently available for automated indexing and partitioning do not work
for recursive queries. We initially experiment with the obvious choice of flattening
our database structures to eliminate recursive queries, but that significantly increases
data sizes.
18
We implement a hybrid approach of the existing techniques and completely flat-
tening the time-steps. We do not materialize the complete progenitor relationships,
but materialize them in steps. In the sample database, the progenitors are recorded
in 256 time steps. Instead of materializing 255 different progenitor tables (one for
each time step difference), we materialize only 8 progenitor tables. Each progenitor
table contains the information 2i time steps away, where i = 1, 2, 3, ..., k. In our case
we keep 1, 2, 4, 8, 16, 32, 64, and 128 hop progenitor information. This allows us to
obtain accurate progenitor information using only 7 joins instead of 255 joins. This
method is possible because of the following two properties of the data:
1. For most of the queries, we know the starting and the ending time steps.
2. Inspecting the progenitor tree, we observe that on average the fan-out at each
node is considerably small compared to 2. This allows the tables at 2i hops a
size similar to the original table. The average degree of the tree is 1.018 in the
sample data.
1
10
100
1000
10000
100000
1000000
10000000
100000000
1000000000
10000000000
Original Hybrid Full*
Materialization Scheme
Nu
mb
er o
f R
ow
s
(a) The space usages of different
schemes
0.1
1
10
100
Original Hybrid Full**
Materialization Schemes
Tim
e t
o r
un
th
e q
uery (
in
Seco
nd
s)
(b) The query times of different schemes
Figure 3: Performance comparison of hybrid methods against naive methods
Figures 3(a) and 3(b) show that the hybrid scheme is only 0.02% size of the full
materialization, while only 14 times slower than the time taken by the full material-
ization. The hybrid scheme is about 12 times faster than the naive implementation.
This benefit increases when we run on more than 256 time steps in the simulation.1
7.1 Physical Design of Distributed Database Caches
Distributed databases usually maintain proxy caches to server large number of queries.
Unlike the standard web-based workloads the queries sent to the proxy caches are
1This is a joint work with Andy Connelly and Jeff Gardner at University of Washington
19
highly dynamic and evolve over time. Therefore, selecting a representative workload
to tune using the existing physical design tools is difficult. Caches present a unique
design point for the physical design tools, as the designs can be optimized both at
the client site and the source site. Often they can be optimized together to provide
the best performance and low network bandwidth. In this work, we design both
competitive and incremental algorithms for physical design of the such distributed
caches [21].2
8 Things to be done
8.1 Online Optimization
In our current work, we assume full knowledge of the queries in the system. This is
true for the standard workload such as web applications, or TPC-H benchmarks. In
many cases, however, the user can form ad-hoc queries and then we do not have prior
information about the workload to optimize on. We need to suggest the the future
physical design with the information about the past workload. Since the workload
can change in future and implementing the design features take time and space, we
need to suggest design changes which balance the query time performance and the
design changes appropriately.
8.2 Alternative Data Layouts
Column oriented databases provide better access to the data when the data is mostly
read only. They also provide good support for ad-hoc queries. To allow our convex
optimization techniques to be used in this framework, we will model the query costs
in the column-oriented databases. Since the column-oriented database optimizers
are not as complex as the ones for row databases, we can take advantage of better
and simpler models for this problem. Similarly, for in-memory OLTP databases, the
focus will be on the change modeling. So far we do not consider change as a big
factor in the physical design, that assumption is completely invalid in the case of
OLTP databases. This work should provide interesting applications of our convex
optimization techniques on these new problem spaces.
2This is a joint work with Xiaodan Wang of John Hopkins University, and Tanu Malik of Purdue
University
20
8.3 Physical Design for Unstructured Databases
In the case of unstructured databases, we do not know the queries, as well as the
structure of the data. In the techniques currently used to process unstructured data,
such as map-reduce [11] and Hadoop [17], each and every time the user parses the
data afresh and processes them. This makes the queries easy to implement, but does
not take advantage of many database innovations such as using statistics to model
the query cost, physical design features for speed. We intend to study the application
of the convex optimization methods to the map-reduce framework to find the optimal
materialization points and reuse the work done by previous map-reduce jobs to speed
up the future tasks.
8.4 Case Study: Scientific Database Optimization
Scientific data continue to give interesting optimization problems. For example, one of
the recent optimization goal for our collaborators is to design the spatial and temporal
indexes to use least space possible. The space requirement is of paramount interest
to the scientific database community and our optimization techniques are ideal for
such constrained environments, where heuristics lead to suboptimal behavior. We
intend to show case our techniques using scientific data management problems as
case studies.
21
9 Time-line
Task Task Description Start Date End Date
1 ILP/ICP formulations 1-Apr-08 1-Nov-08
1.1 Scaling the ILP using sampling 1-Apr-08 27-Jun-08
1.2 Integrating Materialized Views into INUM 1-Feb-07 27-Jun-08
1.3 Integrating all features into INUM 1-Jul-08 1-Oct-08
1.4 Scaling ICP using solver techniques 1-Jul-08 1-Nov-08
2 Online physical design 1-Apr-08 27-Jun-08
2.1 ILP for vertical partitioning 1-May-08 27-Jun-08
2.2 INUM for vertical partitioning 1-May-08 27-Jun-08
2.3 Online algorithm 1-May-08 27-Jun-08
3 Physical design for cumin store 1-Dec-08 1-Apr-09
3.1 Cost model for column stores 1-Dec-08 1-Jan-09
3.2 ILP for column stores 1-Jan-09 1-Apr-09
3.3 Improving column store for scientific DB 1-Dec-08 1-Apr-09
4 Physical design for Hadoop 1-Apr-09 1-Nov-09
4.1 Identifying the physical design features 1-Apr-09 1-Nov-09
4.2 Learning data characteristics 1-Apr-09 1-Nov-09
4.3 Learning workload characteristics 1-Apr-09 1-Nov-09
4.4 Cost model 1-Apr-09 1-Nov-09
5 Physical design for Astronomy database 1-Apr-07 1-Jan-10
5.1 Collecting and loading data 1-Apr-07 1-Mar-08
5.2 Optimizing recursive queries 1-Oct-07 1-Jan-08
5.3 Optimizing spatial moving object queries 1-Apr-09 1-Oct-08
5.4 Implementing physical design features 1-Oct-09 1-Jan-10
22
References
[1] Sanjay Agrawal, Surajit Chaudhuri, Lubor Kollar, Arunprasad P. Marathe,
Vivek R. Narasayya, and Manoj Syamala. Database tuning advisor for microsoft
sql server 2005. In VLDB, pages 1110–1121, 2004.
[2] Sanjay Agrawal, Surajit Chaudhuri, and Vivek R. Narasayya. Automated selec-
tion of materialized views and indexes in SQL databases. In VLDB 2000.
[3] Sanjay Agrawal, Eric Chu, and Vivek Narasayya. Automatic physical design
tuning: workload as a sequence. In SIGMOD ’06: Proceedings of the 2006 ACM
SIGMOD international conference on Management of data, pages 683–694, New
York, NY, USA, 2006. ACM.
[4] Sanjay Agrawal, Vivek Narasayya, and Beverly Yang. Integrating vertical and
horizontal partitioning into automated physical database design. In Proceedings
of the SIGMOD Conference, New York, NY, USA, 2004. ACM Press.
[5] Jair M. Babad. A record and file partitioning model. Commun. ACM, 20(1):22–
31, 1977.
[6] Nicolas Bruno and Surajit Chaudhuri. An online approach to physical design
tuning. In IEEE 23rd International Conference on Data Engineering (ICDE
2007), 2007.
[7] A. Caprara and J. Salazar. A branch-and-cut algorithm for a generalization of
the uncapacitated facility location problem, 1996.
[8] Surajit Chaudhuri and Vivek R. Narasayya. An efficient cost-driven index selec-
tion tool for Microsoft SQL server. In Proceedings of VLDB 1997.
[9] Surajit Chaudhuri and Vivek R. Narasayya. Index merging. In Proceedings of
ICDE 1999.
[10] D.W. Cornell and P.S. Yu. An effective approach to vertical partitioning for phys-
ical design of relational databases. IEEE Transactions on Software Engineering,
16(2):248–258, 1990.
[11] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on
large clusters. Commun. ACM, 51(1):107–113, 2008.
[12] Zvi Drezner and Horst W. Hamacher. Facility Location: Applications and Theory.
2001.
23
[13] Mark J. Eisner and Dennis G. Severance. Mathematical techniques for efficient
record segmentation in large database systems. 1976.
[14] G.Valentin, M.Zuliani, D.Zilio, and G.Lohman. DB2 advisor: An optimizer
smart enough to recommend its own indexes. In Proceedings of ICDE 2000.
[15] C. Heeren, H. V. Jagadish, and L. Pitt. Optimal indexes using near-minimal
space. In PODS, 2003.
[16] Jeffrey A. Hoffer and Dennis G. Severance. The use of cluster analysis in physical
data base design. In Douglas S. Kerr, editor, Proceedings of the International
Conference on Very Large Data Bases, September 22-24, 1975, Framingham,
Massachusetts, USA, pages 69–86. ACM, 1975.
[17] http://hadoop.apache.org. Apache hadoop project.
[18] Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly.
Dryad: Distributed data-parallel programs from sequential building blocks. In
European Conference on Computer Systems (EuroSys), pages 59–72, Lisbon, Por-
tugal, March 21-23 2007. also as MSR-TR-2006-140.
[19] Sam S. Lightstone, Toby J. Teorey, and Tom Nadeau. Physical Database De-
sign: the database professional’s guide to exploiting indexes, views, storage, and
more (The Morgan Kaufmann Series in Data Management Systems). Morgan
Kaufmann Publishers Inc., San Francisco, CA, USA, 2007.
[20] Vincent Y. Lum and Huei Ling. An optimization problem on the selection of
secondary keys. In Proceedings of the 1971 26th annual conference, pages 349–
356, New York, NY, USA, 1971. ACM.
[21] T. Malik, X. Wang, R. Burns, D. Dash, and Anastasia Ailamaki. Automated
physical design in database caches. In In Proceedings of SMDB 2008, 2008.
[22] Shamkant Navathe, Stefano Ceri, Gio Wiederhold, and Jinglie Dou. Vertical par-
titioning algorithms for database design. ACM Trans. Database Syst., 9(4):680–
710, 1984.
[23] Stratos Papadomanolakis, Debabrata Dash, and Anastasia Ailamaki. Intelligent
use of query optimizer for automated physical design. Technical Report CMU-
CS-06-151, CMU, 2006.
[24] A. Rantzer. Dynamic programming via convex optimization, 1999.
24
[25] Jun Rao, Chun Zhang, Nimrod Megiddo, and Guy Lohman. Automating physical
database design in a parallel database. In SIGMOD ’02: Proceedings of the 2002
ACM SIGMOD international conference on Management of data, pages 558–569,
New York, NY, USA, 2002. ACM.
[26] P. Selinger, M. Astrahan, D. Chamberlin, R. Lorie, and T. Price. Access path
selection in a relational database management system. In SIGMOD 1979.
[27] Sandeep Tata, Lin Qiao, and Guy M. Lohman. On common tools for databases
- the case for a client-based index advisor. In ICDE Workshops, pages 42–49,
2008.
[28] http://dbtools.cs.cornell.edu/norm_index.html. The database normaliza-
tion tool.
[29] http://www.oracle.com/technology/obe/11gr1_db/manage/sqlaccadv/
sqlaccadv.%htm. Improving schema design with sql access advisor.
[30] Daniel C. Zilio, Jun Rao, Sam Lightstone, Guy M. Lohman, Adam Storm, Chris-
tian Garcia-Arellano, and Scott Fadden. Db2 design advisor: Integrated auto-
matic physical database design. In VLDB, pages 1087–1097, 2004.
[31] Daniel C. Zilio, Calisto Zuzarte, Guy M. Lohman, Hamid Pirahesh, Jarek Gryz,
Eric Alton, Dongming Liang, and Gary Valentin. Recommending materialized
views and indexes with ibm db2 design advisor. In ICAC ’04: Proceedings of
the First International Conference on Autonomic Computing, pages 180–188,
Washington, DC, USA, 2004. IEEE Computer Society.
25