Answering Top-k Queries Using Views
description
Transcript of Answering Top-k Queries Using Views
Answering Top-k Queries Using Views
ByGautam DasDimitrios GunopulosNick KoudasDimitris Tsirogiannis Presented By
Raju BuchiPoornima Ancha
AGENDA
Agenda
Introduction Views Related Work Preliminaries Problems Discussed Algorithm LPTA View Selection Problem Experimental Results
Introduction
Answering Top-k Queries
• Active research topic
• Retrieve quickly a number(k) of highest ranking tuples
in presence of monotone ranking functions defined on
attributes of underlying relations
Algorithms
• Threshold Algorithm (TA) by Fagin et. al.,
• Independently by Guntzer et. al.,
• Nepal et. al.,
INTRODUCTION
Views
Materialized Views
• A database table that contains the results of the query
previously asked. Actually constructed and stored.
Problem Discussed
To find efficient methods of answering a query using a set of
previously defined materialized views over the database .
Why Views?
• Relevance to a variety of data management problems.
• Promised increased in performance.
• Views are materialized (incurring a space overhead) with the
hope to gain in performance for some queries.
INTRODUCTION
Views
• Views do not specify any selection conditions on the attributes
they aim to rank.
• Example: (TOP-k)
INTRODUCTION
tid X1 X2 X3
1 82 1 59
2 53 19 83
3 29 1 2
4 80 22 90
5 28 8 87
6 12 55 82
7 16 99 42
8 18 42 67
9 42 1 23
10 23 21 88
R tid Score
7 527
6 299
4 270
8 246
2 201
tid Score
6 219
4 202
10 197
f1=2x1+5x2 f2=x2+2x3
View1 (V1)Top-5 Query
View2 (V2)Top-3 Query
Views – Example Contd…
• Given a top-2 query defined using function f3=3x1+10x2+5x3,
we can apply standard top-k algorithm(e.g., TA) using the data
from R and obtain answer to the query.
• Using Views?
• Feasibility
• Guarantee an answer
• Speed of using R directly vs. Using Views
INTRODUCTION
Related Work
• Multimedia Context: Uses ordered lists
• Threshold Algorithm:
• This algorithm requires the scoring function to be monotonic.
i .e. For tuples t and u, t[i]<u[i], 1≤i≤100, then ScoreQ(t)≤ScoreQ(u).
• TA requires that each attribute has an index mechanism that
allows all tids to be accessible in sorted order.
• A single random access is required to resolve all attributes of a tid.
• In our paper we focus on Additive scoring functions(monotonic),
where ScoreQ(t)=w1t[1]+ w2t[2]+….+ wmt[m]
RELATED
WORK
Related Work
• Variants:
• TA-Sorted - Lists are always accessed sequentially and NO
random accesses are performed.
• PREFER [Hristidis et. al.,] :
• Storing multiple copies of ‘R’.
• It assumes to utilize only one copy of a relation which is
closest to the new query to answer the new query.
RELATED
WORK
Ranking Queries• Consider Relation R with m numeric attributes (X1, X2…Xm)
• Domi=[lbi, ubi] domain of ith attribute.
• Tuple t is viewed as numeric vector t=(t[1], t[2]… t[m])
• Top-k Ranking Queries in SQL-like syntax:
SELECT TOP[k] FROM R WHERE RangeQ ORDER BY ScoreQ
• Expressed as a triple Q=(ScoreQ, k, RangeQ)
• ScoreQ: Function that assigns a numeric score to any tuple ‘t’.
• RangeQ : Boolean function that defines a selection condition
for the tuples of ‘R’.
• The semantics requires that the system retrieve the k tuples
with the top scores satisfying the selection condition.
PRELI
MINARIES
Ranking Views
• Materialized Ranking View(V):
• Materialized result of the tuples of a previously executed top-k
query Q, ordered according to the scoring function ScoreQ.
Q’=(ScoreQ’ , k’, RangeQ’ )
• Corresponding materialized ranking view’ is a set of k(tid,
ScoreQ(tid) pairs, ordered by decreasing the values of ScoreQ(tid).
PRELI
MINARIES
Problems Discussed• Problem 1: TOP-k QUERY ANSWERING USING VIEWS
• Given a set of views and a query Q, obtain an answer to Q
combining all the information conveyed by the views in U.
• SOLUTION: Algorithm named LPTA.
• Problem 2: VIEW SELECTION
• Given a collection of views V={V1, V2 … VR} that includes
the base views(thus r ≥ m) and a query Q, determine the
most efficient subset U ⊆ V to execute Q on.
• Such a subset U will be provided as input to LPTA.
• Should identify a set of views that can provide an answer
to the query and at same time provide the answer faster
than running TA on the base set of views, if possible.
PROBLEMS
LPTA LINEAR PROGRAMMING ADAPTION OF THE THRESHOLD ALGORITHMALGORITHM
LPTA
• An adaptation of TA algorithm in the sense that it answers top-
k queries using multiple ranking views
• Requires the scoring functions of the query & the views to be
linear and additive
• Sorted access on pairs (tid, scoreQ(tid))
• Views and Queries are of the form V’ = (ScoreV’, n, *) and
Q=(ScoreQ, k, *) respectively.
• Pseudo code
• Example
• General Approach
LPTA LINEAR PROGRAMMING ADAPTION OF THE THRESHOLD ALGORITHMALGORITHM
LPTA
• Pseudo code
• Initialize top-k buffer to empty.
• Retrieve the tids from the views V1 and V2 in a lock-step
fashion, in the order of decreasing score.
• Retrieve corresponding tuple by random access on R.
• Compute score according to f3 and update top-k buffer to
contain largest scores.
• Check the stopping condition.
• Once the stopping condition is satisfied we will have the
results in the top-k buffer.
LPTA LINEAR PROGRAMMING ADAPTION OF THE THRESHOLD ALGORITHMALGORITHM
LPTA
• Stopping Condition:
• After dth iteration,
let the tuple read from V1= (tid1d, s1
d) and V2= (tid2d, s2
d)
and minimum score in the top-k buffer be top-kmin
• At this point the unseen tuples have to satisfy the following
inequalities: ( Domain of each attribute of R = [1, 100])0≤X1, X2, X3≤1002x1 + 5x2 ≤ s1
d
x2 + 2x3 ≤ s2d
• This will represent a convex region in 3-d space.
• unseenmax will be the solution to the linear program
where we maximize the function f3=3x1+10x2+5x3
LPTA LINEAR PROGRAMMING ADAPTION OF THE THRESHOLD ALGORITHMALGORITHM
LPTA
• Example: (TOP-k Query Answering using Views)
tid X1 X2 X3
1 82 1 59
2 53 19 83
3 29 1 2
4 80 22 90
5 28 8 87
8 18 42 67
9 42 1 23
10 23 21 88
R
tid Score
4 270
8 246
2 201
tid Score
10 197
f1=2x1+5x2 f2=x2+2x3
View1 (V1)Top-5 Query
View2 (V2)Top-3 Query
f3=3x1+10x2+5x3Query = (f3, k, *)
top-2 buffer
7
6
527
299
6 219
4 202
{tidid, si
d }={(7,1248), (6,996)}Linear Programming Solution with s1
d=527 and s2d=219 gives
unseenmax= 1388
(7,1248)
(6,996)
7 16 99 42
6 12 55 82
LPTA LINEAR PROGRAMMING ADAPTION OF THE THRESHOLD ALGORITHMALGORITHM
LPTA
• Example: (TOP-k Query Answering using Views)
tid X1 X2 X3
1 82 1 59
2 53 19 83
3 29 1 2
5 28 8 87
7 16 99 42
8 18 42 67
9 42 1 23
10 23 21 88
R
tid Score
4 270
8 246
2 201
tid Score
10 197
f1=2x1+5x2 f2=x2+2x3
View1 (V1)Top-5 Query
View2 (V2)Top-3 Query
f3=3x1+10x2+5x3Query = (f3, k, *)
top-2 buffer
(7, 1248)
(6, 996)
7
6
527
299
6 219
4 202
{tidid, si
d }={(6,996), (4, 910)}Linear Programming Solution with s1
d=299 and s2d=202 gives
unseenmax= 953.5
6 12 55 82
4 80 22 90
≤ top-kmin
LPTA LINEAR PROGRAMMING ADAPTION OF THE THRESHOLD ALGORITHMALGORITHM
LPTA
V1
s11
tid12 s1
2
tid13 s1
3
tid14 s1
4
tid15 s1
5
V2
s21
tid22 s2
2
tid23 s2
3
tid24 s2
4
tid25 s2
5
tid11
R(X1, X2) Top-1
V1
V2
Qstoppingcondition
X1
X2
R=(1,1)
tid21
tid21
tid11
P=(1,0)O=(0,0)
T=(0,1)
LPTA LINEAR PROGRAMMING ADAPTION OF THE THRESHOLD ALGORITHMALGORITHM
LPTA
0 ≤ x1, x2, x3 ≤ 100
2x1 + 5x2 ≤ s1d
x2 + 2x3 ≤ s2d
fV1=2x1+5x2
fV2=x2+2x3
Q: fQ=3x1+10x2+5x3R(X1, X2)
tid score
tid1d s1
d
tid score
tid2d s2
d d iteration
View1 (V1) View2 (V2)
unseenmax ≤ top-kmin
LPTA LINEAR PROGRAMMING ADAPTION OF THE THRESHOLD ALGORITHMALGORITHM
LPTA
V1
tid11 s1
1
s12
tid13 s1
3
tid14 s1
4
tid15 s1
5
V2
tid21 s2
1
s22
tid23 s2
3
tid24 s2
4
tid25 s2
5
R(X1, X2)
tid12
tid22
tid21
tid22
V1
V2
Qstoppingcondition
Top-1
X1
X2
P=(1,0)O=(0,0)
T=(0,1)R=(1,1)
tid21
tid11
TA Vs. LPTA
TA
VS
LPTA
• LPTA essentially becomes TA when the set of views U equal to the set of base views
• In terms of execution cost both have Sequential as well as Random Access
• Execution Efficiency: I/O Operations play a significant role – they overshadow the costs of CPU operations such as updated top-k buffer, testing for stopping condition & so on.
• Highly correlated: every sequential access incurs a random access.
• Determining factor: If d = number of lock-step iterations and
r = no. of views, then running Cost is O(dr).
Conceptual DiscussionVIEW
SELECTION
Given a collection of views Ѵ = {V1,V2,…. Vr } that includes
base views determine the most efficient subset U ⊆ Ѵ to
execute the query Q on.
Conceptual Discussion
• View Selection in Two Dimensions
• View Selection in Higher Dimensions
Conceptual DiscussionVIEW
SELECTION
2D
R=(1,1)
O=(0,0) P=(1,0)
T=(1,0)
V2
V1
Q
A1 A’1 A A’2
M
B’1
B’2 B2
B
Min top-k tuple
X
Y
Conceptual DiscussionVIEW
SELECTION
HD
For Ѵ = {V1,V2,…. Vr } being a set of views for m-dimensional dataset, Q being query, the optimal execution of LPTA requires the use of a subset of the views U ⊆ Ѵ such that |U| < m.
View Selection ProblemCOST
ESTI
MATION
• Compute histograms representing the distribution of scores
along each view in U.
• Estimate top kmin from Hq by determining the bucket which
contains the kth highest tuple.
• “Walkdown” these histograms until the stopping condition
is reached.
• Check stopping condition by linear programming.
• When Unseen max < top kmin then perform logarithmic search
within last bucket.
• Number of sorted accesses ((d-1)n/b + n’)r’.
• Running time of algorithm is O((d-1)+log n’)
Select Views(Q,V)SELECT
VIEWS
• Consider MinCost and MinCurCost = ∞, U={ }, V -є ѴU
• Compare the cost estimate for V with MinCurCost,
if EstimateCost < MinCurCost , add V to MinV.
• MinCurCost is now is EstimateCost of V.
• ∀ V, above steps are followed
• When MinCurCost < MinCost, V is added U
• This is repeated for all the attributes m considered.
View Selection Algorithms
Select Views(Q,V) / Exhaustive : Estimates cost of all possible (r
p)subsets of V to select one with minimum cost.
Simple Greedy Heuristic : Iterates the set of views , selects the one that reduces the total cost by the greatest amount.
SELECT
VIEWS
View Selection Algorithms
Select Views Spherical(Q,V) : it has to solve linear program just once and is very effective for highly restrictive data sets.
Select view By Angles : sorts the view vectors by increasing angle with query vector returning top-m views.
SELECT
VIEWS
More General Queries & Views
Views that Only Materialize their Top-k Tuples• Truncate the histograms
Accommodating Range Conditions• Select the views that cover the range conditions.• Truncate each attribute’s histogram
MORE
GENERAL
QUERIES
&
VIEWS
Performance EvaluationEXPERI
MENTAL
RESULTS
Real Data, performance comparison of PREFER, LPTA, TA
(2d) (3d)
References
REFERENCES
• Answering Top-k Queries Using Views: Gautam Das, Dimitrios Gunopulos, Nick Koudas
• aitrc.kaist.ac.kr/~vldb06/slides/R13-1.ppt
THANK YOUQuestions???