Member, IEEE IEEE Proof - CityU CSchiychow/papers/TKDE_2016.pdf · 2016-04-24 · the contexts of...

15
IEEE Proof Explaining Missing Answers to Top-k SQL Queries Wenjian Xu, Zhian He, Eric Lo, and Chi-Yin Chow, Member, IEEE Abstract—Due to the fact that existing database systems are increasingly more difficult to use, improving the quality and the usability of database systems has gained tremendous momentum over the last few years. In particular, the feature of explaining why some expected tuples are missing in the result of a query has received more attention. In this paper, we study the problem of explaining missing answers to top-k queries in the context of SQL (i.e., with selection, projection, join, and aggregation). To approach this problem, we use the query-refinement method. That is, given as inputs the original top-k SQL query and a set of missing tuples, our algorithms return to the user a refined query that includes both the missing tuples and the original query results. Case studies and experimental results show that our algorithms are able to return high quality explanations efficiently. Index Terms—Missing answers, Top-K, SQL, usability Ç 1 INTRODUCTION A FTER decades of effort working on database perfor- mance, recently the database research community has paid more attention to the issue of database usability, i.e., how to make database systems and database applications more user friendly? Among all the studies that focus on improving database usability (e.g., keyword search [1], form-based search [2], query recommendation [3] and query auto-com- pletion [4]), the feature of explaining why some expected tuples are missing in the result of a query, or the so-called “why-not?” feature [5], is gaining momentum. A why-not question is being posed when a user wants to know why her expected tuples do not show up in the query result. Currently, end users cannot directly sift through the dataset to determine “why-not?” because the query inter- face (e.g., web forms) restricts the types of query that they can express. When end users query the data through a data- base application and ask “why-not?” but do not find any means to get an explanation through the query interface, that would easily cause them to throw up their hands and walk away from the tool forever—the worst result that nobody, especially the database application developers who have spent months to build the database applications, want to see. Unfortunately, supporting the feature of explaining missing answers requires deep knowledge of various database query evaluation algorithms, which is beyond the capabilities of most database application developers. In view of this, recently, the database community has started to research techniques to answer why-not questions on vari- ous query types. Among them, a few works have focused on answering why-not questions on Select-Project-Join-Aggre- gate (SPJA) SQL queries (e.g., [5], [6], [7], [8]) and preference queries (e.g., top-k queries [9], reverse skyline queries [10]). So far, these proposals only work independently. For exam- ple, when answering why-not questions on top-k queries, the proposal in [9] assumes there are no SPJ constructs (e.g., selection, projection, join, and aggregation). In this paper, we study the problem of answering why- not top-k questions in the context of SQL. Generally, a top-k query in SQL appears as: SELECT A 1 , ..., A m , aggðÞ FROM T 1 , ... T k WHERE P 1 AND ... AND P n GROUP BY A 1 , ..., A m ORDER BY f ð ~ wÞ LIMIT k where each P i is either a selection predicate “A i op v” or a join predicate “A j op A k ”, where A i is an attribute, v is a constant, and op is a comparison operator. Besides, aggðÞ is an aggregation function and f ð ~ wÞ is a ranking function with weighting vector ~ w. The weighting vector represents the user preference when making multi-criteria decisions. To address the problem of answering why-not questions on top-k SQL queries, we employ the query refinement approach [8], [9], [10]. Specifically, given as inputs the origi- nal top-k SQL query and a set of missing tuples, this approach requires to return to the user a refined query whose result includes the missing tuples as well as the original query results. In this paper, we show that finding the best refined query is actually computationally expensive. After- wards, we present efficient algorithms that can obtain the best approximate explanations (i.e., the refined query) in reasonable time. We present case studies to demonstrate our solutions. We also present experimental results to show that our solutions return high quality explanations efficiently. This paper is an extension of [9], [11], which W. Xu and E. Lo are with the Department of Computing, Hong Kong Polytechnic University, Hong Kong. E-mail: {cswxu, ericlo}@comp.polyu.edu.hk. Z. He is with the Department of Computer Science, University of Hong Kong, Hong Kong. E-mail: [email protected]. C.-Y. Chow is with the Department of Computer Science, City University of Hong Kong, Hong Kong. E-mail: [email protected]. Manuscript received 5 Jan. 2015; revised 28 Feb. 2016; accepted 18 Mar. 2016. Date of publication 0 . 0000; date of current version 0 . 0000. Recommended for acceptance by G. Das. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference the Digital Object Identifier below. Digital Object Identifier no. 10.1109/TKDE.2016.2547398 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 28, NO. X, XXXXX 2016 1 1041-4347 ß 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Transcript of Member, IEEE IEEE Proof - CityU CSchiychow/papers/TKDE_2016.pdf · 2016-04-24 · the contexts of...

Page 1: Member, IEEE IEEE Proof - CityU CSchiychow/papers/TKDE_2016.pdf · 2016-04-24 · the contexts of spatial keyword top-k queries [17], reverse top-k queries [18], and graph matching

IEEE P

roof

Explaining Missing Answersto Top-k SQL Queries

Wenjian Xu, Zhian He, Eric Lo, and Chi-Yin Chow,Member, IEEE

Abstract—Due to the fact that existing database systems are increasingly more difficult to use, improving the quality and the usability

of database systems has gained tremendous momentum over the last few years. In particular, the feature of explaining why some

expected tuples are missing in the result of a query has received more attention. In this paper, we study the problem of explaining

missing answers to top-k queries in the context of SQL (i.e., with selection, projection, join, and aggregation). To approach this

problem, we use the query-refinement method. That is, given as inputs the original top-k SQL query and a set of missing tuples, our

algorithms return to the user a refined query that includes both the missing tuples and the original query results. Case studies and

experimental results show that our algorithms are able to return high quality explanations efficiently.

Index Terms—Missing answers, Top-K, SQL, usability

Ç

1 INTRODUCTION

AFTER decades of effort working on database perfor-mance, recently the database research community has

paid more attention to the issue of database usability, i.e.,how to make database systems and database applications moreuser friendly? Among all the studies that focus on improvingdatabase usability (e.g., keyword search [1], form-basedsearch [2], query recommendation [3] and query auto-com-pletion [4]), the feature of explaining why some expectedtuples are missing in the result of a query, or the so-called“why-not?” feature [5], is gaining momentum.

A why-not question is being posed when a user wants toknow why her expected tuples do not show up in the queryresult. Currently, end users cannot directly sift through thedataset to determine “why-not?” because the query inter-face (e.g., web forms) restricts the types of query that theycan express. When end users query the data through a data-base application and ask “why-not?” but do not find anymeans to get an explanation through the query interface,that would easily cause them to throw up their hands andwalk away from the tool forever—the worst result thatnobody, especially the database application developers whohave spent months to build the database applications, wantto see. Unfortunately, supporting the feature of explainingmissing answers requires deep knowledge of variousdatabase query evaluation algorithms, which is beyond thecapabilities of most database application developers. Inview of this, recently, the database community has started

to research techniques to answer why-not questions on vari-ous query types. Among them, a few works have focused onanswering why-not questions on Select-Project-Join-Aggre-gate (SPJA) SQL queries (e.g., [5], [6], [7], [8]) and preferencequeries (e.g., top-k queries [9], reverse skyline queries [10]).So far, these proposals only work independently. For exam-ple, when answering why-not questions on top-k queries,the proposal in [9] assumes there are no SPJ constructs (e.g.,selection, projection, join, and aggregation).

In this paper, we study the problem of answering why-not top-k questions in the context of SQL. Generally, a top-kquery in SQL appears as:

SELECT A1, . . ., Am, aggð�ÞFROM T1, . . . Tk

WHERE P1 AND . . . AND Pn

GROUP BY A1, . . ., Am

ORDER BY fð~wÞLIMIT k

where each Pi is either a selection predicate “Ai op v” or ajoin predicate “Aj op Ak”, where Ai is an attribute, v is aconstant, and op is a comparison operator. Besides, aggð�Þ isan aggregation function and fð~wÞ is a ranking function withweighting vector ~w. The weighting vector represents theuser preference when making multi-criteria decisions.

To address the problem of answering why-not questionson top-k SQL queries, we employ the query refinementapproach [8], [9], [10]. Specifically, given as inputs the origi-nal top-k SQL query and a set of missing tuples, thisapproach requires to return to the user a refined query whoseresult includes the missing tuples as well as the originalquery results. In this paper, we show that finding the bestrefined query is actually computationally expensive. After-wards, we present efficient algorithms that can obtain thebest approximate explanations (i.e., the refined query) inreasonable time. We present case studies to demonstrateour solutions. We also present experimental results toshow that our solutions return high quality explanationsefficiently. This paper is an extension of [9], [11], which

� W. Xu and E. Lo are with the Department of Computing, Hong KongPolytechnic University, Hong Kong.E-mail: {cswxu, ericlo}@comp.polyu.edu.hk.

� Z. He is with the Department of Computer Science, University of HongKong, Hong Kong. E-mail: [email protected].

� C.-Y. Chow is with the Department of Computer Science, City Universityof Hong Kong, Hong Kong. E-mail: [email protected].

Manuscript received 5 Jan. 2015; revised 28 Feb. 2016; accepted 18 Mar. 2016.Date of publication 0 . 0000; date of current version 0 . 0000.Recommended for acceptance by G. Das.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference the Digital Object Identifier below.Digital Object Identifier no. 10.1109/TKDE.2016.2547398

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 28, NO. X, XXXXX 2016 1

1041-4347� 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Page 2: Member, IEEE IEEE Proof - CityU CSchiychow/papers/TKDE_2016.pdf · 2016-04-24 · the contexts of spatial keyword top-k queries [17], reverse top-k queries [18], and graph matching

IEEE P

roof

discussed answering why-not questions on top-k queries inthe absence of other SQL constructs such as selection, pro-jection, join, and aggregation.

The rest of the paper is organized as follows. Section 2presents the related work. Section 3 presents (i) the problemformulation, (ii) the problem analysis, and (iii) the algo-rithms of answering why-not questions on top-k SQLqueries. Section 4 extends the discussion to top-k SQLqueries with GROUP BY and aggregation. Section 5 demon-strates both the case studies and the experimental results.Section 6 concludes the paper.

2 RELATED WORK

Explaining a null answer for a database query was set out by[12], [13] but the concept of why-not was first formally dis-cussed in [5]. That work answers a user’s why-not questionon Select-Project-Join (SPJ) queries by telling herwhich queryoperator(s) eliminated her desired tuples. After that, this lineof work has gradually expanded. In [6] and [7], the missinganswers of SPJ [6] and SPJA [7] queries are explained by adata-refinement approach, i.e., it tells the user how the datashould be modified (e.g., adding a tuple) if she wants themissing answer back to the result. In [8], a query-refinementapproach is adopted. The answer to a why-not question is totell the user how to revise her original SPJA queries so thatthe missing answers can return to the result. They define thata good refined query should be (a) similar—have few “edits”comparing with the original query (e.g., modifying the con-stant value in a selection predicate is a type of edit; adding/removing a join predicate is another type of edit) and (b) pre-cise — have few extra tuples in the result, except the originalresult plus the missing tuples. In this paper, we adopt thequery-refinement approach as our explanation model and alsoapply the above similarity and precisionmetrics.

This paper is an extension of our early work [9], [11],which discussed the answering of why-not questions ontop-k queries in the absence of other SQL constructs such asselection, projection, join, and aggregation. Consideringthese additional SQL constructs could complicate the solu-tion space. As an example, consider table U in Fig. 1a andthe following top-3 SQL query:

Qo:SELECT U.IDFROM UWHERE U.A � 205ORDER BY 0.5 * U.A + 0.5 * U.BLIMIT 3

Fig. 1b shows the ranking scores of all tuples in U and thetop-3 result is {P3, P1, P2}. Assuming that we are interested inaskingwhy P5 is not in the top-3, we see that using SPJA querymodification techniques in [8] to modify only the SPJ con-structs (e.g.,modifying WHERE clause to beU.A� 140) cannotinclude P5 in the top-3 result (because P5 indeed ranks 4thunder the currentweighting ~w ¼ j0:5 0:5j). Using our prelimi-nary top-k query modification technique [9] to modify onlythe top-k constructs (e.g., modifying k to be four) cannotwork either because P5 is filtered by the WHERE clause.This motivates us to develop holistic solutions that consider themodification of both SPJA constructs and top-k constructs in order

to answer why-not questions on top-k SQL queries. For the exam-ple above, the following refined query Q0 is one candidateanswer:

Q0:SELECT U.IDFROM UWHERE U.A � 140ORDER BY 0.5 * U.A + 0.5 * U.BLIMIT 4

Q0 is precise because it includes no extra tuple and is sim-ilar to Qo because only essential edits were carried out: (1)modifying from U.A � 205 to U.A � 140, and (2) modify-ing k from three to four.

There are a few other related works. In [14], the authorsdiscuss how to help users to quantify their preferences as aset of weightings. Their solution is based on presentingusers a set of objects to choose, and try to infer the users’weightings based on the objects that they have chosen. Inthe why-not paradigm, users are quite clear with which arethe missing objects and our job is to explain to them whythose objects are missing. In [15], the notion of reverse top-kqueries is proposed. A reverse top-k query takes as input atop-k query, a missing object ~m, and a set of candidateweightings W . The output is a weighting ~w 2W that makes~m in its top-k result. Two solutions are given in [15]. Thefirst one insists users to provide W as input, which slightlylimits its practicability. The second one does not requireusers to provide W , however, it only works when the top-kqueries involve two attributes. Although the problems looksimilar, answering why-not questions on top-k SQL queriesindeed does not require users to provideW .

In [10], the notion of why-not questions on reverse skylinequeries is proposed. Given a set of data pointsD and a queryq, the dynamic skyline of a data point p 2 D is the skyline of atransformed data space using p as the origin and the reverseskyline [16] returns the objects whose dynamic skyline con-tains the query point q. The why-not question to a reverseskyline query then asks about why a specific data point p 2 Dis not in the reverse skyline of q. In [10], explanations usingboth data-refinement and query-refinement are discussed. Inthe data-refinement approach, the explanation is based onmodifying the values (in minimal) of p 2 D such that pappears in the reverse skyline of q. In the query-refinementapproach, the explanation is based on modifying the querypoint q until p is in its reverse skyline. In addition to that, thediscussion of why-not questions has also been extended tothe contexts of spatial keyword top-k queries [17], reversetop-k queries [18], and graph matching [19]. All these worksabove, however, are not SQL oriented.

Fig. 1. Motivating example.

2 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 28, NO. X, XXXXX 2016

Page 3: Member, IEEE IEEE Proof - CityU CSchiychow/papers/TKDE_2016.pdf · 2016-04-24 · the contexts of spatial keyword top-k queries [17], reverse top-k queries [18], and graph matching

IEEE P

roof

Overall, the interactive query refinement approach (e.g.,[20], [21], [22], [23]) advocates that users should be guidedto refine the query step-by-step in a systematic way. Theinteractive approach is more commonly used when the userhas no specific target. In [14], a step-by-step interactiveapproach is employed to help users that are unsure abouttheir preferences when specifying preference queries. In[20], an interactive approach is adopted to suggest usersalternative queries when a user’s initial query returns anempty answer. In the former, the users have no specific tar-get preference. In the latter, the users have no specific targetrefined query.

The single shot (i.e., non-interactive) approach is morecommonly used when the user has a specific target. In reversetop-k [15], it takes the single shot approach to find a set ofweightings such that a missing tuple specified by the userappears in the top-k result under those weightings. In dataprovenance (e.g., [24], [25]), the single shot approach isadopted to answer why and how a tuple specified by the useris generated. In why-not problems, the user has specific tar-gets as well (i.e., the why-not tuples). Therefore, we observethat most why-not problems employ the single shotapproach [5], [8], [17], [18]. In this paper, we also adopt thesingle shot approach.

3 ANSWERING WHY-NOT QUESTIONS ON TOP-KSPJ QUERIES

In this section, we first focus on answering why-not ques-tions on top-k SQL queries with SPJ clauses. We will extendthe discussion to why-not top-k SQL queries with GROUP

BY and aggregation in the next section. For easy reading,Table 1 summarizes the most frequently used symbols inthis paper.

3.1 The Problem and The Explanation Model

We consider a top-k SPJ query Qwith a set of Select-Project-Join clauses SPJ, a monotonic scoring function f and aweighting vector ~w ¼ jw½1� w½2� � � � w½d�j, where d is thenumber of attributes in the scoring function. For simplicity,we assume a larger value means a better score (and rank)and the weighting space subject to the constraint

Pw½i� ¼ 1

where 0 � w½i� � 1. We only consider conjunctions of predi-cates P1 ^ . . . ^ Pn, where each Pi is either a selection predi-cate “Aj op v” or a join predicate “Aj op Ak”, where A is anattribute, v is a constant, and op is a comparison operator.For simplicity, our discussion focuses on � comparisonbecause generalizing our discussion to other comparisonoperators is straightforward. The query result would thenbe a set of k tuples whose scores are the largest (in casetuples with the same scores are tied at rank kth, only one ofthem is returned).

Initially, a user issues an original top-k SPJ queryQoðSPJo; ko; ~woÞ on a dataset D. After she gets the queryresult, denoted as Ro, she may pose a why-not question witha set of missing tuples Y ¼ fy1; . . . ; ylg (l � 1), where yi hasthe same set of projection attributes as Qo. In this paper, weadopt the query-refinement approach in [8] so that the sys-tem returns the user a refined query Q0ðSPJ 0; k0; ~w0Þ, whoseresult R0 includes Y and Ro, i.e., fY [Rog � R0. It is possiblethat there are indeed no refined queries Q0 that can includeY (e.g., Y contains a missing tuple whose expected attributevalues indeed do not exist in the database). For those cases,the system will report to the user about her error.

There are possibly multiple refined queries for being theanswers to a why-not question hQo; Y i. We thus use DSPJ ,Dk, and Dw to measure the quality of a refined query Q0,where Dk ¼ k0 � ko, Dw ¼ jj~w0 � ~wojj2, and DSPJ is definedbased on four different types of edit operations of SPJclauses adopted in [8]:

� (e1) modifying the constant value of a selection pred-icate in the WHERE clause.

� (e2) adding a selection predicate in the WHERE clause.� (e3) adding/removing a join predicate in the WHERE

clause.� (e4) adding/removing a relation in the FROM clause.Following [8], we do not allow other edit operations such

as changing the projection attributes (because users usuallyhave a clear intent about the projection attributes). Note thatthere is no explicit edit operation for removing a selectionpredicate, since it is equivalent to modifying the constantvalue in the predicate to cover the whole domain of theattribute. Furthermore, we also do not consider modifyingthe joins to include self-join. Let ci denote the cost of theedit operation ei, and we follow [8] to set c1 ¼ 1; c2 ¼ 3;c3 ¼ 5; c4 ¼ 7. So, DSPJ ¼P

1�i�4ðci niÞ, where ni is the

number of edit operations ei used to obtain the refinedquery Q0. In order to capture a user’s tolerance to thechanges of SPJ clauses, k, and ~w on her original query Qo,we first define a basic penalty model that sets the penalties�spj, �k and �w to DSPJ , Dk and Dw, respectively, where�spj þ �k þ �w ¼ 1:

Basic Penalty ¼ �spjDSPJ þ �kDkþ �wDw: (1)

TABLE 1Major Notations in This Paper

Notation Meaning

D The datasetQoðSPJo; ko; ~woÞ The original query with SPJ clauses SPJo,

the k value ko and the weighting ~wo

Ro The query result of Qo on dataset DQSo The original query schemaQS0 The refined query schema

SQS0sel

A set of candidate selection conditionsunder QS0

Sw A set of candidate weighting vectorsPi A selection predicate or a join predicateAi An attribute of a relationvmini

The minimum value of attribute Ai amongtuples in Y [Ro

Y A set of missing tuples (groups)rij The worst rank of tuples in Y [Ro under

selection condition seli and weighting ~wj

Q0ij The refined query under selection conditionseli and weighting ~wj

Rij The query result of Q0ij on dataset Dci The edit cost of edit operation eirT The threshold ranking for stopping a

progressive top-k SQL operation earlierDwf The maximum feasible weighting distance

Dw for skipping progressive top-k SQLoperations

XU ET AL.: EXPLAINING MISSING ANSWERS TO TOP-K SQL QUERIES 3

Page 4: Member, IEEE IEEE Proof - CityU CSchiychow/papers/TKDE_2016.pdf · 2016-04-24 · the contexts of spatial keyword top-k queries [17], reverse top-k queries [18], and graph matching

IEEE P

roofNote that the basic penalty model is able to capture both

the similar and precise requirements. Specifically, a refinedquery Q0 that minimizes Basic Penalty implies it is similarto the original query Qo. To make the result precise (i.e.,having fewer extra tuples), we can set a larger penalty �k toDk such that modifying k significantly is undesired.

The basic penalty model, however, has a drawbackbecause Dk generally could be a large integer (as large asjDj) whereas Dw and DSPJ are generally smaller. One possi-ble way to mitigate this discrimination is to normalize themrespectively.

[Normalizing DSPJ] We normalize DSPJ using the max-imum editing cost DSPJmax.

Definition 1 (maximum editing cost DSPJmax). Given theoriginal query Qo, the maximum editing cost DSPJmax is

the editing cost of obtaining a refined SPJ query QSPJmax, whose

(1) SPJ constructs most deviated from the SPJ constructs ofthe original queryQo (based on the four types of edit operationse1 to e4) and (2) with a query result that includes all missingtuples Y and the original query result Ro.

Example 1 (maximum editing cost DSPJmax). Fig. 2 showsan example database D with three base tables T1, T2, andT3. Assume a user has issued the following top-2 SQLquery:

Qo:SELECT BFROM T1, T2

WHERE T1.A = T2.A AND D � 400ORDER BY 0.5 * D + 0.5 * ELIMIT 2

By referring to Fig. 3 (the join result of T1 ffl T2), theresult Ro of the top-2 query is: {Gary, Alice}. Assumingthe missing tuples set Y of the why-not question is

{Chandler}. Then, QSPJmax is:

1

QSPJmax:

SELECT BFROM T1, T2, T3T3

WHERE T1.A = T2.A AND T1:A ¼ T3:AT1:A ¼ T3:AAND C � 50AND D � 100AND E � 50AND F � 60ANDG � 200ANDH � 60

Accordingly, DSPJmax = 1 c1 þ 5 c2 þ 1 c3þ 1c4 = 1þ 5 3þ 5þ 7 = 28.

[Normalizing Dk] We normalize Dk using (ro � ko),where ro is the worst rank with minimal edits.

Definition 2 (worst rank with minimal edits). The worstrank with minimal edits ro is the worst rank among all tuples

in Y [Ro of a refined top-k SQL query QSPJmin , whose (1) SPJ

constructs least deviated from the original query Qo (measuredby c1 to c4), (2) using the original weighting ~wo, (3) with aquery result that includes all missing tuples Y and the originalquery resultRo, and (4) the modification of k is minimal.

To explain why ro is a suitable value to normalize Dk, wefirst remark that we could normalize Dk using the cardinalityof the join result ofQo because that is the worst possible rank.But to get amore reasonable normalizing constant, we look atEquation (1). First, to obtain the “worst” but reasonable valueof Dk, we can assume that we do not modify the weighting,leading to condition (2) in Definition 2. Similarly, we do notwant to modify the SPJ constructs so much but we hope theSPJ constructs at least do not filter out the missing tuples Yand the original query resultRo, leading to conditions (1) and

(3) in Definition 2. So, based on Example 1,QSPJmin is:

QSPJmin :

SELECT BFROM T1, T2

WHERE T1.A = T2.A AND D � 200ORDER BY 0.5 * D + 0.5 * ELIMIT 7

We note that the following is not QSPJmin although it also

satisfies conditions (1) to (3) because its modification of k isfrom two to eight, which is not minimal (condition 4) com-

paring with the true QSPJmin above:

SELECT BFROM T1, T2

WHERE T1.A = T2.A AND D � 100ORDER BY 0.5 * D + 0.5 * ELIMIT 8

So, based on QSPJmin , Dk will be normalized by ðro � koÞ ¼

ð7� 2Þ ¼ 5.[Normalizing Dw] Let the original weighting vector

~wo ¼ jwo½1� wo½2� � � � wo½d�j, we normalize Dw usingffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1þP

wo½i�2q

, because:

Lemma 1. In our concerned weighting space, given ~wo and anarbitrary weighting vector ~w ¼ jw½1� w½2� � � � w½d�j, Dw �ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

1þPwo½i�2

q.

Fig. 2. Running example: data setD.

Fig. 3. T1 ffl T2 ranked under ~wo ¼ j0:5 0:5j.

1. To obtainQSPJmax in this example, we (i) add all other tables (e.g., T3)

with the corresponding join conditions, and (ii) add/refine selectionpredicates for all attributes so that Y [Ro is included the query result.

4 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 28, NO. X, XXXXX 2016

Page 5: Member, IEEE IEEE Proof - CityU CSchiychow/papers/TKDE_2016.pdf · 2016-04-24 · the contexts of spatial keyword top-k queries [17], reverse top-k queries [18], and graph matching

IEEE P

roofProof. First, we have Dw ¼ k~w� ~wok2 ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPðw½i� � wo½i�Þ2q

.Since w½i� and wo½i� are both nonnegative, then we

can haveffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPðw½i� � wo½i�Þ2

q�

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPðw½i�2 þ wo½i�2Þq

¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPw½i�2 þP

wo½i�2q

. It is easy to see thatP

w½i�2 �ðPw½i�Þ2. As

Pw½i� ¼ 1, we know that

Pw½i�2 � 1.

Therefore, we have Dw �ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiP

w½i�2 þPwo½i�2

q�ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

1þPwo½i�2

q. tu

Summarizing the above discussion about normalizingEquation (1), our normalized penalty function is as follows:

Penalty ¼ �spjDSPJ

DSPJmaxþ �k

Dk

ðro � koÞ þ �wDwffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

1þPwo½i�2

q : (2)

The problem definition is as follows. Given a why-notquestion hQo; Y i, where Y is a set of missing tuples and Qo

is the user’s initial query with result Ro, our goal is to find arefined top-k SQL query Q0ðSPJ 0; k0; ~w0Þ that includesY [Ro in the result with the smallest Penalty. In this paper,we use Equation (2) as the penalty function. Nevertheless,our solution works for all kinds of monotonic (with respectto all DSPJ , Dk and Dw) penalty functions. For better usabil-ity, we do not explicitly ask users to specify the values for�spj, �k and �w. Instead, we follow our early work [9] so thatusers are prompted to answer a simple multiple-choicequestion as illustrated in Fig. 4.2

Assume the default option “Never mind” is chosen.Table 2 lists some examples of refined queries that could bethe answer of the why-not question to Qo in Example 1.According to the above discussion, we have DSPJmax ¼28, ro � ko ¼ 5 and

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1þP

wo½i�2q

¼ 1:22. Among those

refined queries, Q01 dominates Q02 because its Dk is smallerthan that of Q02 and the other dimensions are equal. Thebest refined query in the example is Q04 (Penalty ¼ 0:11).At this point, readers may notice that the best refinedquery is in the skyline of the answer space of three dimen-sions: (1) DSPJi, (2) Dki, and (3) Dwi. Later, we will showhow to exploit properties like this to obtain better effi-ciency in our algorithm.

3.2 Problem Analysis

Answering a why-not question is essentially searching forthe best refined SPJ clauses and weighting in (1) the spaceSSSPJ of all possible modified SPJ clauses and in (2) the space

SSw of all possible weightings, respectively. It is not neces-sary to search for k because once the best set of SPJclauses and the best weighting ~w are found, the value of kcan be accordingly set as the worst rank of tuples inY [Ro. The search space SSSPJ can be further divided intotwo: (1a) the space of the query schemas SSQS and (1b) thespace of all the selection conditions SSsel. A query schemaQS represents the set of relations in the FROM clause andthe set of join predicates in the WHERE clause. A selectioncondition represents the set of selection predicates in theWHERE clause.

First, the space SSQS is Oð2nÞ, where n is the number ofrelations in the database. Second, the space SSsel given aquery schema is OðQm

1 ðjAij þ 1ÞÞ, where m is the number ofattributes in that query schema, jAij is the number of dis-tinct values in attribute Ai, and jAij þ 1 takes into accountattribute Ai can optionally be or not be added to the selec-tion condition. Third, the space SSw is infinite. Hence, it isobvious that finding the exact best refined query fromSSQS SSsel SSw is impractical.

3.3 The Solution

According to the problem analysis presented above, find-ing the best refined query is computationally difficult.Therefore, we trade the quality of the answer with the run-ning time.

Let us start the discussion by illustrating our basic ideaunder the assumption that there is only one missing tuple y.First, we observe that not every query schema can generatea query whose results contain fyg [Ro. For Example 1, ifthe missing tuple y is {Henry}, then the query schemaT1 ffl T2 does not include y. In this case, no matter how weexhaust the search space SSsel and SSw, we cannot generate avalid refined query. In other words, during the search fory ¼ {Henry}, if we are able to filter out query schemaT1 ffl T2, then the subsequent search in Ssel and Sw can beskipped accordingly. Hence, we enumerate the space SSQS

prior to SSsel and SSw. Similarly, we enumerate SSsel prior toSSw, since some predicates in SSsel may filter out the tuples infyg [Ro

When enumerating SSQS , there could be multiple queryschemas that satisfy the requirement of generating a querywhose results contain fyg [Ro. Our idea here is similar to[8], which starts from the original query schema QSo, car-ries out incremental modification to QSo (using edit opera-tion e4), and stops once we have found a query schemaQS0 for which queries based on that can include fyg [Ro

in the result. If no such a query schema is found, we reportto the user that no refined query can answer her why-notquestion.

Fig. 4. A multiple-choice question for freeing users from specifying �spj,�k, and �w.

TABLE 2Examples of Candidate Refined Queries

(Option Never Mind (NM): �spj ¼ 1=3, �k ¼ 1=3, �w ¼ 1=3)

Refined Query DSPJi Dki Dwi Penalty

Q01ðfT1 fflA T2; D � 200g; 7; j0:5 0:5jÞ 1 5 0 0.35

Q02ðfT1 fflA T2; D � 100g; 8; j0:5 0:5jÞ 1 6 0 0.41

Q03ðfT1 fflA T2; D � 200g; 7; j0:6 0:4jÞ 1 5 0.14 0.38

Q04ðfT1 fflA T2; D � 200 ^ C � 90g; 3; j0:5 0:5jÞ 4 1 0 0.11

Q05ðfT1 fflA T2; D � 200g; 3; j0:2 0:8jÞ 1 1 0.42 0.20

2. The number of choices and the pre-defined values for �spj, �k and�w, of course, could be adjusted. For example, to make the result precise(i.e., having fewer extra tuples), we suggest the user to choose theoption where �k is a large value.

XU ET AL.: EXPLAINING MISSING ANSWERS TO TOP-K SQL QUERIES 5

Page 6: Member, IEEE IEEE Proof - CityU CSchiychow/papers/TKDE_2016.pdf · 2016-04-24 · the contexts of spatial keyword top-k queries [17], reverse top-k queries [18], and graph matching

IEEE P

roof

Once the target query schema QS0 is found, we next enu-

merate3 all possible selection conditions SSQS0sel that can be

derived from QS0 with a set Sw of weighting vectors. Theset Sw includes a random sample S of vectors ~w1; ~w2; . . . ; ~ws

from the weighting space SSw and the original weighting ~wo.That is, Sw ¼ f~wog [ S. For each selection condition

seli 2 SSQS0sel and each weighting ~wj 2 Sw, we formulate a

refined top-k SQL query Q0ij and execute it using a pro-

gressive top-k SQL algorithm (e.g., [26], [27], [28]), whichprogressively reports each top rank tuple one-by-one,until all tuples in fyg [Ro come forth to the result set at

ranking rij. So, after jSSQS0sel j � jSwj progressive top-k SQL

executions, we have jSSQS0sel j � jSwj refined queries. Finally,

using (1) rij as the refined k0, (2) seli and QS0 to formu-late the refined SPJ clauses SPJ 0, and (3) ~wj as therefined weighting ~w0, the refined query Q0ðSPJ 0; k0; ~w0Þwith the least penalty is returned to the user as the why-not answer.

The basic idea above incurs many progressive top-k SQLexecutions. In the following, we present our algorithm indetail. The algorithm consists of three phases and includesdifferent optimization techniques in order to reduce the exe-cution time.

[PHASE-1] In this phase, we start from the originalquery schema QSo and do incremental modification toQSo (using edit operation e4) and stop until we find aquery schema QS0 for which queries based on that caninclude fyg [Ro in the result. As all the subsequent con-sidered refined queries will be based on QS0, we material-ize the join result J in memory in order to avoid repeatedcomputation of the same join result based on QS0 in thesubsequent phase. If no such a query schema is found,we report to the user that no refined query can answerher why-not question. Finally, we randomly sample sweightings ~w1; ~w2; . . . ; ~ws from the weighting space andadd them into Sw in addition to ~wo.

[PHASE-2] Next, for a subset SQS0sel � SSQS0

sel of selectionconditions that can be derived from QS0 and weighting vec-tors ~wj 2 Sw, we execute a progressive top-k SQL query onthe materialized join result J using a selection condition

seli 2 SQS0sel and a weighting ~wj 2 Sw until a stopping condition

is met. From now on, we denote a progressive top-k SQLexecution as:

rij ¼ topkðseli; ~wj; stopping-conditionÞ;where rij denotes the rank when all fyg [Ro come forth tothe results under selection condition seli and weighting ~wj.

In the basic idea mentioned above, we have to execute

jSSQS0sel j � jSwj progressive top-k SQL executions. In Section 3.3.1,

we will illustrate that we can just focus on a much smaller

subset SQS0sel � SSQS0

sel without jeopardizing the quality of the

answer. Consequently, the number of progressive top-k SQL

executions could be largely reduced to jSQS0sel j � jSwj.

Furthermore, the original stopping condition in our basicidea is to proceed the TOPK execution on J until all tuples infyg [Ro come forth to the result4. However, if some tuplesin fyg [Ro rank very poorly under some weighting ~wj, the

corresponding progressive top-k SQL operation may bequite slow because it has to access many tuples in the mate-rialized join result J . In Section 3.3.2, we present a muchmore aggressive and effective stopping condition thatmakes most of those operations stop early before fyg [Ro

come forth to the result.Finally, we present a technique in Section 3.3.3 that can

identify some weightings in Sw whose generated refinedqueries have poorer quality, thereby skipping the TOPK exe-cution for those weightings to gain better efficiency.

[PHASE-3] Using rij as the refined k0, seli and QS0 as therefined SPJ clauses SPJ 0, and ~wj as the refined weighting ~w0,the refined query Q0ðSPJ 0; k0; ~w0Þ with the least penalty isreturned to the user as the why-not answer.

3.3.1 Excluding Selection Predicates That Could not

Yield Good Refined Queries

In the basic idea, we have to enumerate all possible selectionconditions based on the query schema QS0 chosen inPHASE-1. As mentioned, if there are m attributes in QS0

and jAij is the number of distinct values in attribute Ai in

that query schema, there would be OðQm1 ðjAij þ 1ÞÞ possible

selection conditions. However, the following lemma canhelp us to exclude selection conditions that could not yieldgood refined queries:

Lemma 2. If we need to modify the original selection conditionby modifying the constant value vi of its selection predicateP in the form of Ai � vi (edit operation e1), or adding aselection predicate P in the form of Ai � vi (edit operatione2) to the selection condition, we can simply consider only

one possibility of P , which is Ai � vmini , where vmin

i is theminimum value of attribute Ai among tuples in fyg [Ro,because P in any other form would not lead to betterrefined top-k SQL queries whose penalties are better than P

as Ai � vmini .

Proof. Let sel0 ¼ fA1 � v1; . . . ; Al � vlg ðl � mÞ be the selec-tion condition after modification and particularly Ai � v0ibe a predicate P 0 in sel0 who gets modified/added fromthe original selection condition.

First, v0i in P 0 has to be smaller than or equal to vmini or

otherwise some tuples in fyg [Ro would get filteredaway.

Second, comparing a predicate P : Ai � x, with x as

any value smaller than or equal to vmini , and the predicate

Pmin: Ai � vmini , these two predicates incur the same

DSPJ to the original selection condition.Third, as the predicate P : Ai � x, is less restrictive

than the predicate Pmin: Ai � vmini , more tuples could

pass P . So, given a weighting ~w, the worst rank of

3. Let A denote all possible subsets of attributes in QS0. For eachsubset Aset 2 A, each enumerated selection condition is in the form ofVðAi � vjÞ, 8Ai 2 Aset; vj is enumerated from the domain of Ai. Only

selection conditions that do not filter fyg [Ro are added to SSQS0sel .

4. There could be multiple tuples in J that match y. One option is torandomly choose one in J that matches y as the missing tuple. Anotheroption is to repeat the whole process for each matching tuple and regard theone with the lowest penalty as the answer. It is a tradeoff between effi-ciency and quality. In this paper, we adopt the first option.

6 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 28, NO. X, XXXXX 2016

Page 7: Member, IEEE IEEE Proof - CityU CSchiychow/papers/TKDE_2016.pdf · 2016-04-24 · the contexts of spatial keyword top-k queries [17], reverse top-k queries [18], and graph matching

IEEE P

roof

fyg [Ro under P cannot be better than that under Pmin.That implies Dk under P would not be better than thatunder Pmin as well. Therefore, we conclude that top-kSQL queries based on P would not lead to better refinedtop-k SQL queries based on Pmin. tuConsider Example 1 again. The query schema QS0 that

can include back the missing tuple (Chandler) would leadto a join result like Fig. 3. Originally, we have to considereight predicates when dealing with attribute D, which areD � 500, D � 400, D � 300, D � 290, D � 280, D � 250,D � 210, and D � 100. By using Lemma 2, we just need toconsider D � 200 because among fyg [Ro ¼ {Chandler,Gary, Alice}, their attribute values of D are 200, 400, and500, with 200 as the minimum. Similar for attribute E, byusing Lemma 2, we just need to consider E � 80. The abovediscussion can be straightforwardly generalized to othercomparisons including �, < , and > .

3.3.2 Stopping a Progressive Top-k SQL Operation

Earlier

In PHASE-2 of our algorithm, the basic idea is to executethe progressive top-k SQL query until fyg [Ro comeforth to the results, with rank rij. In the following, weshow that it is actually possible for a progressive top-kSQL execution to stop early even before fyg [Ro comeforth to the result.

Consider an example where a user specifies a top-kSQL query QoðSPJo; ko ¼ 2; ~wo) on a data set D and poses awhy-not question about a missing tuple y. Assume that thelist of weightings Sw is ½~wo; ~w1; ~w2; ~w3� and topkðsel1; ~wo;until-see-ffyg [RogÞ is first executed, where all tuples infyg [Ro come forth to the result at rank 7. Now, we have ourfirst candidate refined query Q01oðsel1; r1o ¼ 7; ~wo), withDk ¼ 7� 2 ¼ 5 and Dw ¼ jjwo � wojj2 ¼ 0. The correspond-ing penalty, denoted as, PenQ0

1o, could be calculated using

Equation (2). Remember that we want to find the refinedquery with the least penalty Penmin. So, at this moment, weset penaltyPenmin ¼ PenQ0

1o.

According to our basic idea outlined above, we shouldnext execute another progressive top-k SQL using the selec-tion condition sel1 and another weighting vector, say, ~w1,until fyg [Ro come forth to the result with a rank r11.However, we notice that the skyline property in the answerspace can help to stop that operation earlier, even beforefyg [Ro are seen. Given the first candidate refined queryQ01oðsel1; r1o ¼ 7; ~woÞwith Dw ¼ 0 and Dk ¼ 5, any other can-didate refined queries Q01j with Dk > 5 must be dominated

by Q01o. In our example, after the first executed progressivetop-k SQL execution, all the subsequent progressive top-kSQL executions with sel1 can stop once fyg [Ro do notshow up in the top-7 tuples.

Fig. 5 illustrates the answer space of the example. Theidea above essentially means a progressive top-k SQL exe-cution can stop if fyg [Ro do not show up in result afterreturning the top-7 tuples (i.e., Dk > 5; see the dottedregion). For example, consider Q011 in Fig. 5 whose weight-ing is w1 and fyg [Ro show up in the result only whenk ¼ 9, i.e., Dk ¼ 9� 2 ¼ 7. So, the progressive top-k SQLassociated with Q011 can stop after k ¼ 7, because after that

Q011 has no chance to dominate Q01o anymore. In otherwords, the progressive top-k SQL operation associated withQ011 does not wait to reach k ¼ 9 where fyg [Ro come forthto the result but stop early when k ¼ 7.

While useful, we can actually be even more aggressivein many cases. Consider another candidate refined query,say, Q012, in Fig. 5. Assume that r12 ¼ topkðsel1; ~w2;until-see-ffyg [RogÞ ¼ 6 (i.e., Dk ¼ 6� 2 ¼ 4), whichis not covered by the above technique (since Dk 6> 5).However, Q012 can also stop early, as follows. In Fig. 5,we show the normalized penalty Equation (2) as a slopePenmin ¼ PenQ0

1othat passes through the best refined

query so far (currently Q01o). All refined queries that lie onthe slope have the same penalty value as Penmin. In addi-tion, all refined queries that lie above the slope actuallyhave a penalty larger than Penmin, and thus are inferior toQ01o. Therefore, similar to the skyline discussion above,

we can determine an even tighter threshold ranking rT forstopping the subsequent progressive top-k SQL opera-tions with sel1:

rT ¼ DkT þ ko, where

DkT ¼ Penmin � �spjDSPJ

DSPJmax� �w

Dwffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1þP

wo½i�2q

0B@

1CA ro � ko

�k

66664

77775:

(3)

Equation (3) is a rearrangement of Equation (2) (withPenalty ¼ Penmin). Back to our example in Fig. 5, given thatthe weighting of candidate refined query Q012 is ~w2, wecan first compute its Dw2 value. Then, we can project Dw2

onto the slope Penmin (currently Penmin ¼ PQ01o) to obtain

the corresponding DkT value, which is three in Fig. 5.That means, if we carry out a progressive top-k SQLoperation using sel1 as the predicates and ~w2 as theweighting, and if fyg [Ro still do not appear in result

after the top-5 tuples (rT ¼ DkT þ ko ¼ 3þ 2 ¼ 5) are seen,then we can stop it early because the penalty of Q012 isworse than the penalty Penmin of the best refined query(Q01o) seen so far.

Following the discussion above, we now have two earlystopping conditions for the progressive top-k SQL algo-rithm: until-see-ffyg [Rog and until-rank-rT . Except

Fig. 5. Example of answer space under selection condition sel1.

XU ET AL.: EXPLAINING MISSING ANSWERS TO TOP-K SQL QUERIES 7

Page 8: Member, IEEE IEEE Proof - CityU CSchiychow/papers/TKDE_2016.pdf · 2016-04-24 · the contexts of spatial keyword top-k queries [17], reverse top-k queries [18], and graph matching

IEEE P

roof

for the first progressive top-k SQL operation whichtopkðsel1; ~wo;until-see-ffyg [RogÞ must be used, the sub-sequent progressive top-k SQL operations with sel1 can use

“until-see-ffyg [Rog OR until-rank-rT” as the stopping

condition. We remark that the conditions until-rank-rT

and until-see-ffyg [Rog are both useful. For example,assume that the actual worst rank of fyg [Ro under sel1and ~w2 is four, which gives it a Dk ¼ 2 (see Q0012 in Fig. 5).Recall that by projecting Dw2 onto the slope ofPenmin ¼ PenQ0

1o, we can stop the progressive top-k SQL

operation after rT ¼ 3þ 2 ¼ 5 tuples have been seen. How-ever, using the condition until-see-ffyg [Rog, we can stopthe progressive top-k SQL operation when all fyg [Ro

show up at rank four. This drives us to use

“until-see-ffyg [Rog OR until-rank-rT” as the stoppingcondition.

Finally, we remark that the optimization power of thistechnique increases while the algorithm proceeds. Forexample, after Q0012 has been executed, the best refined queryseen so far should be updated as Q0012 (because its penalty isbetter than that of Q01o). Therefore, Penmin now is updatedas PQ00

12and the slope Penmin should be updated to pass

through Q0012 now (the dotted slope in Fig. 5). Because

Penmin is continuously decreasing, DkT and thus the thresh-

old ranking rT would get smaller and smaller and the subse-quent progressive top-k SQL operations can terminate evenearlier while the algorithm proceeds.

The above early stopping technique can be applied to thesubsequent selection conditions. In our example, after selec-tion condition sel1 and turning to consider selection condi-

tion sel2 with the set of weightings Sw, we can derive rT

based on Equation (3) by simply reusing the Penmin

obtained from sel1.

3.3.3 Skipping Progressive Top-k SQL Operations

In PHASE-2 of our algorithm, the basic idea is to executeprogressive top-k SQL queries for all selection conditions in

SQS0sel and all weightings in Sw. After the discussion in Sec-

tion 3.3.2, we know them some progressive top-k SQL exe-cutions can stop early. We now illustrate three pruningopportunities where some of those executions could beskipped entirely.

The first pruning opportunity is based on the observationfrom [15] that under the same selection condition seli, similarweighting vectors may lead to top-k SQL results with morecommon tuples. Therefore, if an operation TOPKðseli; ~wj;

until-see-ffyg [RogÞÞ for ~wj has already been executed, andif a weighting ~wl is similar to ~wj, then we can use the queryresult Rij of topkðseli; ~wj;until-see-ffyg [RogÞ to deducethe smallest k value for seli and ~wl. Let k

0 be the deduced kvalue for seli and ~wl. If the deduced k0 is larger than the

threshold ranking rT , then we can skip the entireTOPKðseli; ~wl; stopping-conditionÞ operation.

We illustrate the above by reusing our running example.Assume that we have cached the result sets of executed pro-gressive top-k SQL queries. Let R1o be the result set of thefirst executed query topkðsel1; ~wo;until-see-ffyg [RogÞand Ro ¼ ft5; t6g. Assume that R1o ¼ ½t1; t2; t3; t4; t5; t6; y�.Then, when we are considering the next weighting vector,

say, ~w1, in Sw, we first follow Equation (3) to calculate the

threshold ranking rT . In Fig. 5, projecting ~w1 onto slope PQ01o

we get rT ¼ 4þ 2 ¼ 6. Next we calculate the scores of alltuples in R1o using ~w1 as the weighting. More specifically, letus denote the tuple in fyg [Ro under weighting vector ~w1

as tbad if it has the worst rank among fyg [Ro. In the exam-ple, assume under ~w1, the scores of t1, t2, t3 and t4 are stillbetter than tbad, then the k0 value for sel1 and ~w1 is at least

4þ 3 ¼ 7. Since k0 is worse than rT ¼ 6, we can skip theentire TOPKðsel1; ~w1; stopping-conditionÞ operation.

The above caching technique is shown to be the mosteffective between similar weighting vectors [15]. Therefore,we design the algorithm in a way that the list of weightingsSw is sorted according to their corresponding Dwi values (ofcourse, ~wo is in the head of the list since Dwo ¼ 0). In addi-tion, the technique is general so that the cached result for aspecific selection condition seli can also be used to derivethe smallest k0 value for another selection condition selj. Aslong as seli and selj are similar, the chance that we candeduce k0 from the cached result that leads to TOPK operationpruning is also higher. So, we design the algorithm in a waythat seli is enumerated in increasing order of Dsel as well.

The second pruning opportunity is to exploit the bestpossible ranking of fyg [Ro (under all possible weightings)to set up an early termination condition for some weightings,so that after a certain number of progressive top-k SQLoperations have been executed under seli, operations associ-ated with some other weightings for the same seli can beskipped.

Recall that the best possible ranking of fyg [Ro is ko þ 1,since jfyg [Roj ¼ ko þ 1. Therefore, the lower bound of Dk,denoted as DkL equals 1. So, this time, we project DkL ontoslope Penmin in order to determine the corresponding maxi-

mum feasible Dw value. We name that value as Dwf . For any

Dw > Dwf , it means “fyg [Ro has Dk < DkL”, which isimpossible. As our algorithm is designed to examineweightings in their increasing order Dw values, when a

weighting wj 2 Sw has jjwj � wojj > Dwf , topkðseli; ~wj;

stopping-conditionÞ and all subsequent progressive top-kSQL operations topkðseli; ~wl; stopping-conditionÞ wherel > jþ 1 could be skipped.

Reuse Fig. 5 as an example. By projecting DkL ¼ 1 ontothe slope PenQ0

1o, we could determine the corresponding

Dwf value. So, when the algorithm finishes executing a pro-gressive top-k SQL operation for weighting ~w2, the algo-rithm can skip all the remaining weightings and proceed toexamine the next selection condition.

As a remark, we would like to point out that the pruningpower of this technique also increases while the algorithmproceeds. For instance, in Fig. 5, if Q0012 has been executed,slope Penmin is changed from slope PenQ0

1oto slope PenQ00

12.

Projecting DkL onto the new Penmin slope would result in a

smaller Dwf , which in turn increases the chance of eliminat-ing more weightings.

The last pruning opportunity is to set up an early termi-nation condition for the whole algorithm. In fact, since we aresorting seli in their increasing order of Dsel, as soon as we

encounter a DSPJ > ðPenmin � �kDkLro�koÞ

DSPJmax�spj

, we can skip

all subsequent progressive top-k SQL operations and

8 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 28, NO. X, XXXXX 2016

Page 9: Member, IEEE IEEE Proof - CityU CSchiychow/papers/TKDE_2016.pdf · 2016-04-24 · the contexts of spatial keyword top-k queries [17], reverse top-k queries [18], and graph matching

IEEE P

roof

terminate the algorithm. This equation is a rearrangement ofthe following equation:

Penmin < �spjDSPJ

DSPJmaxþ �k

DkLro � ko

: (4)

3.3.4 How Large Should Be the List of Weighting

Vectors?

Given that there are an infinite number of points (weight-ings) in the weighting space, how many sample weightingsshould we put into Sw in order to obtain a good approximation ofthe answer?

Recall that more sample weightings in Sw will increasethe number of progressive top-k SQL executions and thusthe running time. Therefore, we hope Sw to be as small aspossible while maintaining good approximation. We say arefined query is the best-T percent refined query if its penaltyis smaller than ð1� T Þ percent refined queries in the whole(infinite) answer space, and we hope the probability of get-ting at least one such refined query is larger than a certainthreshold Pr:

1� ð1� T%Þs � Pr: (5)

Equation (5) is general. The sample size s is independent ofthe data size but is controlled by two parameters: T percentand Pr. Following our usual practice of not improvingusability (i.e., why not queries) by increasing the users’ bur-den (e.g., specifying parameter values for T percent, andPr), we make T percent, and Pr system parameters and letusers override their values only when necessary.

The pseudo-code of the complete algorithm is presentedin Algorithm 1. It is self-explanatory and mainly summa-rizes what we have discussed above, so we do not give it afull walkthrough here.

3.3.5 Multiple Missing Tuples

To handle multiple missing tuples Y ¼ fy1; . . . ; ylg; l > 1,we just need little modification to the algorithm. First of all,a refined query needs to ensure fY [Rog (instead offyg [Ro) come forth to the result. Correspondingly, thestopping condition for the progressive top-k SQL algorithm

becomes “until-see-fY [Rog OR until-rank-rT”. Apartfrom that, when generating the candidate selection condi-tions, we need to consider the minimum attribute values forall tuples in Y [Ro.

4 ANSWERING WHY-NOT QUESTIONS ON TOP-KSPJA QUERIES

In this section, we extend the discussion to why-not top-kSQL queries with GROUP BY and aggregation.

4.1 The Problem and The Explanation Model

Initially, a user issues an original top-k SPJA queryQoðSPJAo; ko; ~woÞ on a dataset D. After she gets the resultRo, she may pose a why-not question about a set of missinggroups Y ¼ fg1; . . . ; glg (l � 1), where gi has the same set ofprojection attributes as Qo. Then, the system returns theuser a refined query Q0ðSPJA0; k0; ~w0Þ, whose result R0

includes Y and Ro, i.e., fY [Rog � R0. If there are indeed

no refined queries Q0 that can include Y and Ro, the systemwill report to the user about her error.

Algorithm 1. Answering a Why-not Top-k SPJ Question

Input:The dataset D; original top-k SQL query QoðSPJo; ko; ~woÞ andits query result Ro; the missing tuple y; penalty settings �spj,�k, �w; T% and Pr; edit cost for SPJ clauses: c1, c2, c3 and c4

Output:A refined query Q0ðSPJ 0; k0; ~w0ÞPhase 1:1: Obtain QS0 and J by doing incremental modification to the

original query schema QSo;2: if QS0 does not exist then3: return “cannot answer the why-not question”;4: end if5: Construct SQS0

sel based on Section 3.3.1;6: Sort SQS0

sel according to their Dsel value;7: Determine s from T% and Pr using Equation (5);8: Sample s weightings from the weighting space, add them

and ~wo into Sw;9: Sort Sw according to their Dw values;

Phase 2:10: Penmin 1;11: Dwf 1;12: for all seli 2 SQS0

sel do13: if DSPJi > ðPenmin � �k

DkLro�koÞ

DSPJmax�spj

then

14: break; //Section 3.3.3 — early algorithm termination15: end if16: for all ~wj 2 Sw do17: if Dwj > Dwf then18: break; //Technique 3.3.3 — skipping weightings19: end if20: Compute rT based on Equation (3);21: if there exist rT� jfyg [Roj þ 1 objects in some Rij 2 RR

having scores better than the worst rank of tuples infyg [Ro under ~wj then

22: continue; //Section 3.3.3 — use cached result to skipa progressive top-k SQL operation

23: end if24: ðRij; rijÞ topkðseli; ~wj;until-see-ffyg [Rog

or until-rank-rT Þ; //Section 3.3.2 — stopping a pro-gressive top-k SQL operation earlier

25: Compute Penij based on Equation (2);26: Add Rij intoRR;27: if Penij < Penmin then28: Penmin Penij;29: Update Dwf according to Section 3.3.3;30: end if31: end for32: end for

Phase 3:33: Return the best refined query Q0ðSPJ 0; k0; ~w0Þwhose penalty

equals Penmin;

Following [8], we do not modify the aggregation functionand the group-by attributes in the original query. For theSPJA clauses, we adopt the four edit operations in Section 3whereas the penalty function is similar to Equation (2):

Penalty ¼ �spjaDSPJA

DSPJAmaxþ �k

Dkðro�koÞ þ �w

Dwffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1þP

wo½i�2p : (6)

XU ET AL.: EXPLAINING MISSING ANSWERS TO TOP-K SQL QUERIES 9

Page 10: Member, IEEE IEEE Proof - CityU CSchiychow/papers/TKDE_2016.pdf · 2016-04-24 · the contexts of spatial keyword top-k queries [17], reverse top-k queries [18], and graph matching

IEEE P

roof

The calculation of DSPJA is the same as that of DSPJ .The cost of DSPJAmax refers to the editing cost of obtaining

a refined SPJA query QSPJAmax , whose definition is the same as

QSPJmax except that it needs to include all missing groups Y

and the original result groups Ro.The problem definition is as follows. Given a why-not

question hQo; Y i, where Y is a set of missing groups and Qo

is the user’s initial query with result Ro, our goal is to find arefined top-k SQL query Q0ðSPJA0; k0; ~w0Þ that includesY [Ro in the result with the smallest penalty with respectto Equation (6). Again, we do not explicitly ask users tospecify the values for �spja, �k and �w, but prompt users toanswer a simple multiple-choice question listed in Fig. 4.

4.2 Problem Analysis

Since we allow the same set of edit operations as top-k SPJqueries, the problem complexity of answering why-not top-k SPJA questions follows the one presented in Section 3.2.

4.3 The Solution

The following changes are required to extend Algorithm 1to handle why-not top-k SPJA questions.

First, when materializing the join result J of QS0 (Line 1),we sort J using the group-by attributes so as to facilitatethe subsequent grouping step. Second, we do not applyLemma 2. Recall that Lemma 2 was proposed to excludeselection conditions that could not yield good refinedqueries. Specifically, when enumerating selection condi-tions, Lemma 2 considers only selection predicates in the

form of Ai � vmini , where vmin

i is the minimum value of attri-bute Ai among tuples in fyg [Ro. That works because amissing tuple y corresponds to a single value in attribute Ai.However, in the context of missing groups, a missing groupY could correspond to multiple values in attribute Ai. There-fore, Lemma 2 might miss some good refined queries.

Example 2. Consider the following top-k SPJA query basedon the data set in Fig. 2:

Qo:SELECT BFROM T1, T2

WHERE T1.A = T2.A AND D � 400GROUP BY BORDER BY AVG(0.5 * D + 0.5 * E)LIMIT 2

The result Ro is : {Gary, Alice}. Let us assume the why-not question is asking for the missing group {Daniel}.From Fig. 3, we see that the group {Daniel} is composedby two base tuples (i.e., the two rows with B equals Dan-iel), with a group score average as ð185þ 155Þ=2 ¼ 170. Ifwe follow Lemma 2 (using the minimum attribute D’svalues among Gary, Alice, and Daniel) to modify theselection condition D � 400 to D � 100, the top-4 wouldthen become:

GROUP BY B AVG(0.5 * D + 0.5 * E) Rank

Gary 300 1Alice 240 2Bob 175 3

Daniel 170 4

If we do not follow Lemma 2, we can modify the selec-tion condition D � 400 to, say, D � 300. In this manner,the top-3 result already includes all required groups{Gary, Alice, Daniel} due to the fact that the group {Bob}has been excluded by D � 300. Thus, we obtain a betterrefined query by (i) improving Dk from 2 to 1 and(ii) keeping DSPJ and Dw unchanged.

Lastly, when using the caching technique described inSection 3.3.3, we cannot just cache the resulting groups orotherwise we do not have the base tuples that contribute toeach group to derive the new score using another weighting(or selection condition). Therefore, for that technique, wealso cache the base tuples for each resulting group.

5 EXPERIMENTS

We evaluate our proposed techniques through a case studyand a performance study. By default, we set the systemparameters T percent and Pr as 0:5 percent and 0:7, respec-tively, resulting in a sample size of 241 weighting vectors.The algorithms are implemented in C++ and the experi-ments are run on a Ubuntu PC with Intel 3.4 GHz i7 proces-sor and 16 GB RAM. For comparisons, we implement threebaseline algorithms:

� Top-k: The algorithm in our early work [9], whichmodifies only top-k (i.e., ~w and k) but not SQL’s con-structs. This baseline may not be able to answersome why-not questions if the missing tuple isindeed filtered by some SQL constructs.

� Top-k-adapted: This baseline adapts our earlywork [9] as follows: when sampling s weightings,we draw samples from a restricted weighting spaceconstituted by the missing tuples Y and the originalresult Ro, as well as the data points that are incompa-rable with Y [Ro. It was proved by us that such arestricted weighting space is smaller than the entireweighting space and therefore samples from thatspace could have higher quality [9]. Since the con-struction of the restricted weighting space dependson data points that are incomparable with Y [Ro,and the selection condition seli would influence thenumber of such incomparable data points, so wehave to construct a specific restricted weighting

space for each seli 2 SQS0sel .

� Staged: This algorithm employs a two-stage pro-cess. (1) Ignore the top-k part and first apply thewhy-not SPJA solution in [8] to get the refined query.(2) If the desired tuples still do not show up in thetop-k, increase k accordingly.

5.1 Case Study

The case study is based on the NBA data set, whose schema isshown in Fig. 6. It contains statistics of all NBA players from1,973–2,009 with four tables: (i) Player (4,051 tuples) recordsplayers’ name and their career start year, (ii) Career(4,051 tuples) stores players’ performance in their wholecareer, (iii) Regular (21,961 tuples) and (iv) Playoffs (8,341tuples) contain players’ performance year-by-year in regulargames and playoffs games, respectively. We designedfour cases based on four queries on the data set. Then, we putthe cases and the results returned by our approach, Top-k,

10 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 28, NO. X, XXXXX 2016

Page 11: Member, IEEE IEEE Proof - CityU CSchiychow/papers/TKDE_2016.pdf · 2016-04-24 · the contexts of spatial keyword top-k queries [17], reverse top-k queries [18], and graph matching

IEEE P

roof

Top-k-adapted, and Staged online5 and recruited 100qualified AMT (Amazon Mechanical Turk) workers to gradethe user satisfaction about the why-not answers in a scale of 1to 4 (1 - very dissatisfied, 2 - dissatisfied, 3 - satisfied, 4 - verysatisfied). Table 3 shows a summary of the results in terms ofexecution time, penalty and user satisfaction. In the following,we present somemore details.

Case 1 (Finding the top-3 players who have played at least1,000 games in NBA history). The first case study was to findthe top-3 players with at least 1,000-game experience in theNBA history. Therefore, we issued a query Q1:

Q1:SELECT P.NameFROM Player P, Career CWHERE P.PID = C.PID AND C.GP � 1,000ORDER BY (17 * C.PPG + 1

7 * C.RPG + 17 * C.APG

+ 17 * C.SPG + 1

7 * C.BPG + 17 * C.FG percent

+ 17 * C.FT percent) DESC

LIMIT 3

The initial result was6:

Rank Name GP PPG RPG APG SPG BPG FG% FT%

1 Michael Jordan 1072 30.1 6.2 5.3 2.3 0.8 49.7 83.5

2 Oscar Robertson 1040 25.7 7.5 9.5 0 0 48.5 83.8

3 Kareem Abdul-jabbar 1560 24.6 11.2 3.6 0.7 2.0 55.9 72.1

We were surprised that Magic Johnson, who has won 5NBA championships and 3 NBA Most Valuable Player(MVP) in his career, was not in the result. So we issued awhy-not question hQ1, {Magic Johnson}i.

Using our algorithm, we got a refined query Q01 in10.2 ms (modifications are in bold face):

Q01:SELECT P.NameFROM Player P, Career CWHERE P.PID = C.PID AND C.GP � 906ORDER BY (17 * C.PPG + 1

7 * C.RPG + 17 * C.APG

+ 17 * C.SPG + 1

7 * C.BPG + 17 * C.FG percent

+ 17 * C.FT percent) DESC

LIMIT 4

Its new top-k result was:

Rank Name GP PPG RPG APG SPG BPG FG% FT%

1 Michael Jordan 1072 30.1 6.2 5.3 2.3 0.8 49.7 83.5

2 Magic Johnson 906 19.5 7.2 11.2 1.9 0.4 52.0 84.8

3 Oscar Robertson 1040 25.7 7.5 9.5 0 0 48.5 83.8

4 Kareem Abdul-jabbar 1560 24.6 11.2 3.6 0.7 2.0 55.9 72.1

The refined query essentially hinted us that our originalselection predicate C.GP � 1,000 (the number of gamesplayed) eliminated Magic Johnson (who got brilliantrecords without playing a lot of games).

In this case, baseline algorithm Top-k could not answerthe why-not question because it could not modify the predi-cate C.GP � 1,000. Baselines Top-k-adapted andStaged returned the same refined query (Q01) as ourapproach. However, they took much longer time (i.e., 258and 96ms, respectively).

Case 2 (Finding the top-3 players who performed best in year2004). In this case, we first look up the top-3 players in year2004:

Q2:SELECT P.NameFROM Player P, Regular R, Playoffs OWHERE P.PID = R.PID AND P.PID = O.PID

AND R.Year = 2004 AND O.Year = 2004ORDER BY (17 * R.PPG + 1

7 * R.RPG +17 * R.APG

+ 17 * R.SPG + 1

7 * R.BPG + 17 * R.FG percent

+ 17 * R.FT percent) DESC

LIMIT 3

The initial result was: {Dirk Nowitzki, Allen Iverson,Steve Nash}. Wewonderedwhy Kobe Bryant was missing inthe answer, so we posed a why-not question hQ2, {KobeBryant}i.

Our algorithm returned the following refined query in18.1 ms:

Q02:SELECT P.NameFROM Player P, Regular R Playoffs O_____________

WHERE P.PID = R.PID AND P.PID = O.PID________________________

AND R.Year = 2004 AND O.Year = 2004________________________

ORDER BY (0.2125 * R.PPG + 0.0902 * R.RPG+ 0.1955 * R.APG + 0.0844 * R.SPG+ 0.1429 * R.BPG + 0.0732 * R.FG percent+ 0.2013 * R.FT percent) DESC

LIMIT 4

The refined query essentially hinted us that Kobe Bryantdid not play PlayOff in year 2004. We checked back thedata, and we were confirmed with the fact that KobeBryant’s host team, LA Lakers, failed to enter the Playoffsin that year. The result of Q02 was: {Allen Iverson, DirkNowitzki, Steve Nash, Kobe Bryant}. As an interpretation,in addition to eliminating the PlayOff records, the refinedweighting of Q02 indicates that, in order to include KobeBryant in the result, we should weigh the players’ scoring/assisting/free throwing abilities higher than the otherabilities.

In this case, baseline Top-k again could not workbecause it could not modify the query schema. BaselineTop-k-adapted returned the same refined query (Q02) as

Fig. 6. Schema of the NBA data set.

5. http://goo.gl/forms/64sYbnmFIl6. We attach extra columns to the results for better understanding.

XU ET AL.: EXPLAINING MISSING ANSWERS TO TOP-K SQL QUERIES 11

Page 12: Member, IEEE IEEE Proof - CityU CSchiychow/papers/TKDE_2016.pdf · 2016-04-24 · the contexts of spatial keyword top-k queries [17], reverse top-k queries [18], and graph matching

IEEE P

roofour approach, but took much longer time (576 ms). Baseline

Staged returned the following refined query with higherpenalty (0.37) and lower user satisfaction (2.1) compared toour approach (0.13 and 3.6, respectively):

Q002 :SELECT P.NameFROM Player P, Regular R , Playoffs O_____________

WHERE P.PID = R.PID AND P.PID = O.PID________________________

AND R.Year = 2004 AND O.Year = 2004________________________

ORDER BY (17 * R.PPG + 17 * R.RPG

+ 17 * R.APG + 1

7 * R.SPG

+ 17 * R.BPG + 1

7 * R.FG percent

+ 17 * R.FT percent) DESC

LIMIT 7

Summary of Results: Due to space limit, we do notinclude the details of case 3 and case 4 here. Briefly, case 3 isto find the top-3 players who performed best in their earlycareer and case 4 is to find the top-3 players who haveplayed more than 15 seasons of Playoffs, with the field goalpercentage higher than 50 percent in each season. Readerscan access the link of the online case study for details. Insummary, baseline Top-k is not applicable to answer manywhy-not questions (Q1, Q2, and Q4) because it cannot mod-ify the SQL constructs. Baseline Top-k-adapted is able toreturn answers with penalty as good as our approach but itis slower by an order of magnitude. That is because it has toconstruct a restricted weighting space for each selectionpredicate. Baseline Staged runs slower and gives answerswith poorer penalty and user satisfaction. That is becauseStaged is not solving the problem holistically but in astaged manner.

5.2 Performance Study

We use TPC-H to perform a performance study. Weselected 10 TPC-H queries that can let us extend as top-kSPJA queries with minor modifications. The list of modifica-tions of the selected queries is listed in Table 6.

Table 4 shows the parameters we varied in the experi-ments. The default values are in bold faces. The defaultweighting is ~wo ¼ j 1d . . . 1

d j, where d is the number of attrib-utes in the ranking function. By default, the why-not ques-tion asks for a missing tuple/group that is rankedð10 � ko þ 1Þ-th under ~wo. The performance study includestwo parts. Section 5.2.1 studies the various aspects of ourproposed algorithm. Section 5.2.2 evaluates our algorithmwith the baselines.

5.2.1 Microbenchmark

Effectiveness of Optimization Techniques. In this experiment,weinvestigate the effectiveness of (i) excluding unnecessaryselection condition (Section 3.3.1), (ii) the early stopping (Sec-tion 3.3.2) and (iii) skipping (Section 3.3.3) used in our algo-rithm. Fig. 7 shows the performance of our algorithm usingonly (i), only (ii), only (iii), all, and none, under the defaultsetting. The effectiveness of the techniques is very promisingwhen they are applicable (in particular, Lemma 2 is not usedwhen the queries contain GROUP-BY). Without using anyoptimization technique, the algorithm requires a runningtime of roughly 1,000 seconds on these TPC-H queries. How-ever, our algorithm runs about two to three orders fasterwhen our optimization techniques are all enabled.

Varying Data Size. Fig. 8a shows the running time of ouralgorithm under different data size (i.e., scale factor of TPC-H). We can see our algorithm for answering why-not ques-tions scales linearly with the data size for all queries.

Varying ko. Fig. 8b shows the running time of our algo-rithm using top-k SQL queries with different ko values. Inthis experiment, when a top-5 query (ko ¼ 5) is used, thecorresponding why-not question is to ask why the tuple inrank 51st is missing. Similarly, when a top-50 query(ko ¼ 50) is used, the corresponding why-not question is toask why the tuple in rank 501-st is missing. Naturally, whenko increases, the time to answer a why-not question shouldalso increase because the execution time of a progressivetop-k SQL operation also increases with ko. Fig. 8b showsthat our algorithm scales well with ko.

TABLE 4Parameter Setting

Parameter Range

Data size 1G, 2G, 3G, 4G, 5GT% 10%, 5%, 1%, 0.5%, 0.1%Pr 0.1, 0.3, 0.5, 0.7, 0.9

Preference option PMSPJ, PMW, PMK,NMko 5, 10, 50, 100ro 101, 501, 1001jY j 1, 2, 3, 4, 5

Fig. 7. Effectiveness of optimization techniques.

TABLE 3Result of Case Study

Approaches Q1 Q2 Q3 Q4

Time Penalty UserSatisfaction

Time Penalty UserSatisfaction

Time Penalty UserSatisfaction

Time Penalty UserSatisfaction

Our Approach 10.2 ms 0.39 3.5 18.1 ms 0.13 3.6 48.3 ms 0.09 3.7 29.5 ms 0.19 3.6Top-k — — — — — — 52.2 ms 0.25 2.4 — — —Top-k-adapted 258 ms 0.39 3.5 576 ms 0.13 3.6 566 ms 0.09 3.7 319.5 ms 0.19 3.6Staged 96 ms 0.39 3.5 158 ms 0.37 2.1 105 ms 0.09 3.7 95.5 ms 0.37 2.0

12 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 28, NO. X, XXXXX 2016

Page 13: Member, IEEE IEEE Proof - CityU CSchiychow/papers/TKDE_2016.pdf · 2016-04-24 · the contexts of spatial keyword top-k queries [17], reverse top-k queries [18], and graph matching

IEEE P

roofVarying the Missing Tuple to be Inquired.We next study the

performance of our algorithm by posing why-not questionswith missing tuples from different rankings (i.e., ro). In thisexperiment, we set ko ¼ 10 and asked three individual why-not questions about why the tuple that ranked 101st, 501st,and 1001st, respectively, is missing in the result. Fig. 8cshows that our algorithm scales well with the ranking of themissing tuple. Of course, when the missing tuple has aworse ranking under the original weighting ~wo, the progres-sive top-k SQL operation should take a longer time to dis-cover it in the result and thus the overall running time mustincrease.

Varying Preference Option.We next study the performanceof algorithm under different user preference on changingSPJ constructs, k and ~w. These values can be system parame-ters or user specified as stated in Section 3.1. In Fig. 8d, it isgood to show that our algorithm is insensitive to variouspreference options. In all cases, our algorithm can returnanswer very efficiently.

Varying the Number of Missing Tuples jY j. We also studythe performance of our algorithm by posing why-notquestions with different numbers of missing tuples. Inthis experiment, a total of five why-not questions areasked for each TPC-H query. In the first question, onemissing tuple that ranked 101th under ~wo is included inY . In the second question, two missing tuples that respec-tively ranked 101th and 102th under ~wo are included in

Y . The third to the fifth questions are constructed simi-larly. Fig. 8e shows that our algorithm scales linearlywith respect to different size of Y .

Varying Query Size. In this experiment, we study the per-formance based on varying the number of predicates ineach query. For example, when the number of predicates iscontrolled to 1, we randomly keep one selection predicatein the query and discard the rest. Fig. 8f shows that the per-formance is insensitive to the query size. That is because thesearch space of selection conditions SQS0

sel only depends onthe number of attributes in the query schema QS0 instead ofthe query size.

Varying T percent. We would also like to know howthe performance and solution quality of our algorithmvary when we look for refined queries with differentquality guarantees. Figs. 9a and 9b show the runningtime of our algorithm and the penalty of the returnedrefined queries when we changed from accepting refinedqueries that are within the best 10 percent (jSj ¼ 12) toaccepting refined queries that are within the best 0.1 per-cent (jSj ¼ 1;204). From Fig. 9a, we can see that the run-ning time of our algorithm increases when the guaranteeis more stringent. However, from Fig. 9b, we can seethat the solution quality of the algorithm improves whenT increases, until T reaches 1 percent where furtherincreases the sample size cannot yield any significantimprovement.

Varying Pr. The experimental results of varying Pr, theprobability of getting the best-T percent of refined queries,are similar to the results of varying T percent above. That is

Fig. 8. Varying parameters.

Fig. 9. Varying T percent or Pr. Fig. 10. Comparison with baselines.

XU ET AL.: EXPLAINING MISSING ANSWERS TO TOP-K SQL QUERIES 13

Page 14: Member, IEEE IEEE Proof - CityU CSchiychow/papers/TKDE_2016.pdf · 2016-04-24 · the contexts of spatial keyword top-k queries [17], reverse top-k queries [18], and graph matching

IEEE P

roofbecause both parameters are designed for controlling the

quality of the approximate solutions. In Figs. 9c and 9d, wecan see that when we vary Pr from 0.1 (jSj ¼ 22) to 0.9(jSj ¼ 460), the running times increase mildly. However, thesolution quality also increases gradually.

5.2.2 Comparison with Baselines

Fig. 10 shows the performance and penalty of our algorithmand the three baselines on TPC-H. Similar to our findings incase study, we observe that our approach is themost efficient.Top-k is comparable with our approach in terms of perfor-mance, but the quality of its answers is way poorer. Top-k-adapted is comparable with our approach in terms ofanswer quality, but its performance is the worst. Staged ispoor in terms of both performance and answer quality. Wealso show a time breakdown of our algorithm. We see thatPhase-2 dominates the execution time since it requires proc-essing progressive top-k SQL operations for each candidatequery.

Table 5 shows the maximum memory usage of our algo-rithm and the baselines. We observe that all algorithms con-sume roughly similar amount of memory. The memory ismainly used for holding the the join result J in memory,which is used in Phase-1 of all algorithms (Section 3.3), inorder to avoid repeated computation of the same join resultbased on QS0 in the subsequent phase.

Lastly, we indeed also implemented another baselinealgorithm, Exact, which leverages a quadratic program-ming (QP) solver to solve the why-not problem mathemati-cally. Unfortunately, this exact solution ran too slow andthus we only ran experiments for TPC-H Q2 on 1GB data.In that experiment, the best refined query returned byExact incurs a penalty of 0.298. However, Exact took 30days to find that. In contrast, our algorithm is able to find arefined query with 0.31 penalty, but in 1.45 seconds.

6 CONCLUSION

In this paper, we have studied the problem of answeringwhy-not questions on top-k SQL queries. Our target is togive an explanation to a user who is wondering why herexpected answers are missing in the query result. We returnto the user a refined query that can include the missingexpected answers back to the result. Our case studies andexperimental results show that our solutions efficientlyreturn very high quality solutions. In future work, we willstudy this issue on queries involving non-numeric attributes.

ACKNOWLEDGMENTS

The research was supported by grants 521012E, 520413E,15200715 from Hong Kong RGC. Zhian He is the corre-sponding author.

TABLE 6Modified TPC-H Queries (Modified Parts are in Bold Face; [...]

Means That Clause is Exactly the Same as the OriginalTPC-H Queries)

Q2 SELECT s_acctbal, s_name, n_name, p_partkey, p_mfgr, s_address,

s_phone, s_comment

FROM part, supplier, partsupp, nation, region

WHERE [...]

ORDER BY (0.5 * p_retailprice + 0.5 * ps_supplycost) DESC LIMIT 10

Q3 SELECT l_orderkey, sum(0.5 * l_extendedprice * (1 - l_discount) +

0.5 * o_totalprice) as amount, o_orderdate, o_shippriority

FROM customer, orders, lineitem

WHERE [...]

GROUP BY l_orderkey, o_orderdate, o_shippriority

ORDER BY amount DESC LIMIT 10

Q9 SELECT nation, o_year, sum(amount) as sum_profit

FROM (SELECT n_name as nation, extract(year from o_orderdate) as

o_year,

(0.5 * l_extendedprice * (1 - l_discount) -

0.5 * ps_supplycost*l_quantity) as amount

FROM part, supplier, lineitem, partsupp, orders, nation

WHERE [...]) as profit

GROUP BY nation, o_year

ORDER BY sum_profit DESC LIMIT 10

Q10 SELECT c_custkey, c_name, sum(0.5 * l_extendedprice * (1 - l_discount)+

0.5 * c_acctbal) as amount,

n_name, c_address, c_phone, c_comment

FROM customer, orders, lineitem, nation

WHERE [...]

GROUP BY c_custkey, c_name, c_acctbal, c_phone, n_name,

c_address, c_comment

ORDER BY amount DESC LIMIT 10

Q11 SELECT ps_partkey, sum(0.5 * ps_supplycost * ps_availqty +

0.5 * s_acctbal) as value

FROM partsupp, supplier, nation

WHERE ps_suppkey=s_suppkey AND s_nationkey=n_nationkey

AND n_name=‘GERMANY’

GROUP BY [...]

ORDER BY value DESC LIMIT 10

Q12 SELECT o_orderkey, sum(0.5 * l_extendedprice * (1 - l_discount) +

0.5 * o_totalprice) as score

FROM orders, lineitem

WHERE [...]

GROUP BY o_orderkey

ORDER BY score DESC LIMIT 10

Q13 SELECT c_custkey, sum(0.5 * c_acctbal + 0.5 * o_totalprice) as score

FROM customer, orders

WHERE [...]

GROUP BY c_custkey

ORDER BY score DESC LIMIT 10

Q16 SELECT p_brand, p_type, p_size, sum(0.5 * ps_supplycost +

0.5 * p_retailprice) as value

FROM partsupp, part

WHERE [...]

GROUP BY p_brand, p_type, p_size

ORDER BY value, p_brand, p_type, p_size DESC LIMIT 10

Q18 SELECT c_name, p_custkey, p_orderkey, o_orderdate, o_totalprice,

sum(0.5 * l_extendedprice * (1 - l_discount)

+ 0.5 * o_totalprice) as value

FROM customer, orders, lineitem

WHERE [...]

GROUP BY c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice

ORDER BY value, o_totalprice, o_orderdate DESC LIMIT 10

Q20 SELECT s_name, s_address, sum(0.5 * s_acctbal +

0.5 * ps_supplycost) as value

FROM supplier, nation

WHERE [...]

ORDER BY s_name, value DESC LIMIT 10

TABLE 5Maximum Memory Usage (in MB)

Approaches Q2 Q3 Q9 Q10 Q11 Q12 Q13 Q16 Q18 Q20

Our Approach 25 258 647 788 147 288 887 103 687 15Top-k 26 261 650 780 150 291 879 101 680 16Top-k-adapted 25 260 655 789 148 290 870 101 689 16Staged 24 258 650 788 150 285 888 105 690 14

14 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 28, NO. X, XXXXX 2016

Page 15: Member, IEEE IEEE Proof - CityU CSchiychow/papers/TKDE_2016.pdf · 2016-04-24 · the contexts of spatial keyword top-k queries [17], reverse top-k queries [18], and graph matching

IEEE P

roof

REFERENCES

[1] S. Agrawal, S. Chaudhuri, and G. Das, “DBXplorer: A system forkeyword-based search over relational databases,” in Proc. 18th Int.Conf. Data Eng., 2002, pp. 5–16.

[2] H. Wu, G. Li, C. Li, and L. Zhou, “Seaform: Search-as-you-type informs,” in Proc. VLDB Endowment, 2010, vol. 3, no. 2, pp. 1565–1568.

[3] J. Akbarnejad, G. Chatzopoulou, M. Eirinaki, S. Koshy, S. Mittal,D. On, N. Polyzotis, and J. S. V. Varman, “SQL QueRIE recom-mendations,” in Proc. VLDB Endowment, 2010, vol. 3, no. 2,pp. 1597–1600.

[4] M. B. N. Khoussainova, Y.C. Kwon and D. Suciu, “Snipsuggest:Context-aware autocompletion for SQL,” in Proc. VLDB Endow-ment, 2010, vol. 4, no. 1, pp. 22–33.

[5] A. Chapman and H. Jagadish, “Why not?” in Proc. ACM SIGMODInt. Conf. Manage. Data, 2009, pp. 523–534.

[6] J. Huang, T. Chen, A.-H. Doan, and J. F. Naughton, “On the prove-nance of non-answers to queries over extracted data,” in Proc.VLDB, 2008, pp. 736–747.

[7] M. Herschel and M. A. Hern�andez, “Explaining missing answersto SPJUA queries,” in Proc. VLDB, 2010, pp. 185–196.

[8] Q. T. Tran and C.-Y. Chan, “How to ConQueR why-not ques-tions,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2010,pp. 15–26.

[9] Z. He and E. Lo, “Answering why-not questions on top-k quer-ies,” in Proc. IEEE 28th Int. Conf. Data Eng., 2012, pp. 750–761.

[10] I. Md. Saiful, Z. Rui, and L. Chengfei, “On answering why-notquestions in reverse skyline queries,” in Proc. IEEE 28th Int. Conf.Data Eng., 2013, pp. 973–984.

[11] Z. He and E. Lo, “Answering why-not questions on top-k quer-ies,” IEEE Trans. Knowl. Data Eng., vol. 26, no. 6, pp. 1300–1315,Jun. 2014.

[12] A. Motro, “Query generalization: A method for interpreting nullanswers,” in Proc. 1st Int. Expert Database Workshop, 1984, pp. 597–616.

[13] A. Motro, “SEAVE: A mechanism for verifying user presupposi-tions in query systems,” ACM Trans. Inf. Syst., vol. 4, no. 4,pp. 312–330, 1986.

[14] F. Zhao, K.-L. T. G. Das, and A. K. H. Tung, “Call to order: A hier-archical browsing approach to eliciting users’ preference,” in Proc.ACM SIGMOD Int. Conf. Manage. Data, 2010, pp. 27–38.

[15] A. Vlachou, C. Doulkeridis, Y. Kotidis, and K. Nørva�g, “Reverse

top-k queries,” in Proc. IEEE 28th Int. Conf. Data Eng., 2010,pp. 365–376.

[16] D. Prasad M. and P. Deepak, “Efficient reverse skyline retrievalwith arbitrary non-metric similarity measures,” in Proc. 14th Int.Conf. Extending Database Technol., 2011, pp. 319–330.

[17] L. Chen, X. Lin, H. Hu, C. S. Jensen, and J. Xu, “Answering why-not questions on spatial keyword top-k queries,” in Proc. IEEE31st Int. Conf. Data Eng. , 2015, pp. 279–290.

[18] Y. Gao, Q. Liu, G. Chen, B. Zheng, and L. Zhou, “Answering why-not questions on reverse top-k queries,” in Proc. VLDB Endowment,vol. 8, no. 7, pp. 738–749, 2015.

[19] M. S. Islam, C. Liu, and J. Li, “Efficient answering of why-notquestions in similar graph matching,” IEEE Trans. Knowl. DataEng., vol. 27, no. 10, pp. 2672–2686, Oct. 2015.

[20] D. Mottin, A. Marascu, S. B. Roy, G. Das, T. Palpanas, and Y. Vele-grakis, “A probabilistic optimization framework for the empty-answer problem,” in Proc. VLDB Endowment, vol. 6, no. 14,pp. 1762–1773, 2013.

[21] C. Mishra and N. Koudas, “Interactive query refinement,” in Proc.12th Int. Conf. Extending Database Technol.: Adv. Database Technol.,2009, pp. 862–873.

[22] M. S. Islam, C. Liu, and R. Zhou, “A framework for query refine-ment with user feedback,” J. Syst. Softw., vol. 86, no. 6, pp. 1580–1595, 2013.

[23] M. S. Islam, C. Liu, and R. Zhou, “Flexiq: A flexible interactivequerying framework by exploiting the skyline operator,” J. Syst.Softw., vol. 97, pp. 97–117, 2014.

[24] J. Cheney, L. Chiticariu, andW.-C. Tan, “Provenance in databases:Why, how, and where,” Found. Trends Databases, vol. 1, no. 4,pp. 379–474, 2009.

[25] M. Interlandi, K. Shah, S. D. Tetali, M. A. Gulzar, S. Yoo, M. Kim,T. Millstein, and T. Condie, “Titian: Data provenance support inspark,” in Proc. VLDB Endowment, 2015, pp. 216–227, vol. 9, no. 3.

[26] R. Faginm, A. Lotem, and M. Naor, “Optimal aggregation algo-rithm for middleware,” J. Comput. Syst. Sci., vol. 64, no. 4, pp. 614–656, 2003.

[27] I. F. Ilyas, W. G. Aref, and A. K. Elmagarmid, “Supporting top-kjoin queries in relational databases,” VLDB J., vol. 13, no. 3,pp. 207–221, 2004.

[28] C. Li, K. Chen-Chuan Chang, and I. F. Ilyas, “Supporting ad-hocranking aggregates,” in Proc. ACM SIGMOD Int. Conf. Manage.Data, 2006, pp. 61–72.

Wenjian Xu received the bachelor’s degree in2010 from Northwestern Polytechnical University,China. He is currently working toward the PhDdegree in the Department of Computing, HongKong Polytechnic University, Hung Hom, HongKong. His research interests include query proc-essing and cloud computing.

Zhian He received the PhD degree in 2015 fromthe Hong Kong Polytechnic University, Hung Hom,Hong Kong. He is currently a post-doc researcherin the Department of Computing Science, the Uni-versity of Hong Kong. His research focuses on bigdata analytics and cloud computing.

Eric Lo received the PhD degree in 2007 fromETH Zurich, Switzerland. He is currently an associ-ate professor in the Department of Computing,Hong Kong Polytechnic University, Hung Hom,Hong Kong. His research interests include big dataanalytic, cloud computing, and any algorithmic andsystem aspect with respect tomodern hardware.

Chi-Yin Chow received the MS and PhDdegrees from the University of Minnesota-TwinCities, Minneapolis, MN, USA, in 2008 and 2010,respectively. He is currently an assistant profes-sor in the Department of Computer Science, CityUniversity of Hong Kong. His research interestsinclude spatio-temporal data management andanalysis, GIS, mobile computing, and location-based services. He is the co-founder and co-organizer of ACM SIGSPATIAL MobiGIS 2012,2013, and 2014. He is a member of the IEEE.

" For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

XU ET AL.: EXPLAINING MISSING ANSWERS TO TOP-K SQL QUERIES 15