Supporting Ranking in Queries Score-based Paradigm
description
Transcript of Supporting Ranking in Queries Score-based Paradigm
Supporting Ranking in QueriesScore-based Paradigm
Russell GreenspanCS 411Spring 2004
2
Supporting Ranking in QueriesTalk Outline
What Why How
– “Out-of-the-box” support– “Smart” top-k processing
3
Ranking in Queries What is ranking in queries?
A mechanism to return only the top-k results– Closest matches to user-specified boolean criteria– Scoring results based on user-specified
predicates SELECT Address
FROM HousesForSaleORDER BY Best(Size, Price)
Express similarity, relevance, or preference to a given query
4
What is ranking in queries?Definitions
Intuitive– Output an ordered list of k items such that the list
includes only those items whose scored rank is greater than the items not included
Formal– “Given retrieval size k and scoring function F, a
ranked query returns a list K of k objects (i.e. |K| = k) with query scores, sorted in a descending order, such that F(t1, ..., tn) [u] > F(t1, ..., tn) [v] for all u in K and all v not in K.” [Chang, Hwang, 2002]
5
What is ranking in queries?Differences from traditional queries
How does this differ from traditional queries?– Traditional queries:
Do not stop processing until all results are computed Do not focus on ranking tuples to best match the input
query
– Traditional boolean queries: Do not return “close” matches Can “over” or “under” match, producing too few or too
many results
6
Ranking in Queries Why use ranking in queries?
Exact matches not required– Often times something “close enough” satisfies a
user’s demands
Fuzzy matches desired– Multimedia/image matching, where the very nature
of the query does not involve an exact match
Avoid unnecessary computations– Find the “best” answers quickly as opposed to all
answers
7
Ranking in Queries How do we use execute ranked queries?
“Out-of-the-box” support– Perform query as any other, then perform sort
and return only first k rows– Why is this bad?
Lots of unnecessary processing Waste of resources in intermediate results If scoring function is expensive, could result in
computation of unneeded scores
Can we do better?
8
How do we use execute ranked queries?“Smart” Ranked Query Execution
Query Processing – Try to achieve significant reduction in query
execution time– Use mid-query (i.e. as query executes)
techniques to optimize query plan for top-k results– Consider minimal amount of tuples necessary to
return k results Scoring Predicate
– Consider expense of scoring function in determining optimal query plan
9
“Smart” Ranked Query ExecutionTwo Areas of Research Focus
Top-k processing– Reducing number of tuples considered at each
intermediate step Assume minimal work necessary to retrieve items sorted
by score (i.e. indexes on simple attributes)
Rank function– Reducing number of calls to ranking function
Assume rank calculation is expensive
– Implementing unusual ranking function
10
“Smart” Ranked Query ExecutionResearch and Techniques
Reducing number of tuples considered– Middleware/Multimedia
Garlic [Fagin, 1999] CHITRA [Nepal, Ramakrishna, 1999]
– Relational STOP Operator [Carey, Kossmann, 1997] Probabilistic [Donjerkovic, Ramakrishnan, 1999] Statistical [Chaudhuri, Gravano, 1999]
Reducing number of calls to ranking function – MPro [Chang, Hwang, 2002]
Implementing unusual ranking function– AutoRank [Agrawal, Chaudhuri, Das, Gionis, 2003]
11
“Smart” Ranked Querying (Middleware) –Garlic [Fagin, 1999]
Integrates data from different database systems or non-database data servers– Relational Query Set vs. “Sorted List”
Example: “Return the reddest covers of Beatle’s albums”i.e. (Artist = ‘Beatles’) AND (AlbumColor LIKE ‘red’), where Artists are stored relationally and Album colors in a multimedia database
Assign grade to each object– Boolean grade either 0 or 1– Fuzzy value 0<=x<=1 indicating closeness
12
Garlic [Fagin, 1999]Rank Processing Methods
How to combine two fuzzy values to retrieve top-k objects?– Inefficient
Consider graded sets of all objects by color and shape Compute combined score for every object, then output top
k objects
– Efficient Retrieve objects (sorted by grade) from each subsystem
until there are at least k of the same objects in each set Compute combined score for each of these k objects
13
Garlic [Fagin, 1999]Example Query
Example: (use combined scoring function = x * y) Return Top 2 Color = ‘red’ AND Shape = ‘round’
Object Roundness
A .6
B .8
C .3
D .2
E .9
F .1
G .7
H .4
Object Redness
A .2
B .6
C .1
D .8
E .3
F .5
G .9
H .3
14
Garlic [Fagin, 1999]Inefficient vs. Efficient Processing
Inefficient– Calculate combined
score for every object– Sort by score– Return top k objects
{G, B}
Object Score
A .12
B .48
C .03
D .16
E .27
F .05
G .63
H .12
15
Garlic [Fagin, 1999]Inefficient vs. Efficient Processing
Efficient (Fagin’s A0 algorithm)– Consider ordered members from each set until there
are k of the same object in each set A1 = {G(.9), D(.8), B(.6)} A2 = {E(.9), B(.8), G(.7)}
– Calculate combined score for each of the k objects G = .9 * .7 = .63 B = .6 * .8 = .48
– Return these objects ordered by combined score {G, B}
16
Garlic [Fagin, 1999]Conclusions
Why is this more efficient?– Incur expense of scoring function k times, as
opposed to n times (where n is the total number of items)
– Access each subsystem at least k and at most n times, as opposed to n times (again, where n is the total number of items)
17
“Smart” Ranked Querying (Middleware) –CHITRA [Nepal, Ramakrishna, 1999]
Expands on Fagin’s GARLIC system by proposing new “multi-step” processing algorithm
Experimental results show 50% improvement
18
CHITRA [Nepal, Ramakrishna, 1999]“Multi-step” Algorithm
Consider first sorted item x from each subsystem i
Perform random access into every other subsystem to obtain other rankings of x
Add object to result set if its rank is greater than the threshold grade, quit when we have k objects– Threshold is score of all objects considered each
iteration
19
CHITRA [Nepal, Ramakrishna, 1999]Example Query
Back to our example...Return Top 2 Color = ‘red’ AND Shape = ‘round’
Object Roundness
A .6
B .8
C .3
D .2
E .9
F .1
G .7
H .4
Object Redness
A .2
B .6
C .1
D .8
E .3
F .5
G .9
H .3
20
CHITRA [Nepal, Ramakrishna, 1999]Example Scoring Functions Results
Consider two scoring functions as examples:
– min[x, y]
– [x * y]
Iter. Items Grade Threshold Resultset
1 i1 = {G(.9)}i2 = {E(.9)}
G = min[.9, .7] = .7E = min[.9, .3] = .3
min[.9, .9] = .9
2 i1 = {D(.8)}i2 = {B(.8)}
D = min[.8, .2] = .2B = min[.8, .6] = .6
min[.8, .8] = .8
3 i1 = {B(.6)}i2 = {G(.7)}
B = min[.6, .8] = .6G = min[.7, .9] = .7
min[.6, .7] = .6 {G, B}
Iter. Items Grade Threshold Resultset
1 i1 = {G(.9)}i2 = {E(.9)}
G = [.9 * .7] = .63E = [.9 * .3] = .27
[.9 * .9] = .81
2 i1 = {D(.8)}i2 = {B(.8)}
D = [.8 * .2] = .16B = [.8 * .6] = .48
[.8 * .8] = .64
3 i1 = {B(.6)}i2 = {G(.7)}
B = [.6 * .8] = .48G = [.7 * .9] = .63
[.6 * .7] = .43 {G, B}
21
CHITRA [Nepal, Ramakrishna, 1999]Conclusions
Why is this more efficient?– Requires fewer accesses to each subsystem
How do we know this algorithm is correct?– Proof by contradiction
Assume object z which should have been included If Rank(z) > Rank(y), either:
– y must have at least one subsystem rank smaller than all subsystem ranks of z
– z must have at least one subsystem rank greater than all subsystem ranks of y
However, since Rank(z) < Threshold and Rank(y) >= Threshold, Rank(z) cannot be greater than Rank(y)
22
“Smart” Ranked Querying (Relational) –STOP Operator [Carey, et al, 1997]
Specifies extension to SQL-92 standard to allow limit on cardinality of result– STOP AFTER
Return subset of results from each section of query plan
Implement with STOP operator– STOP(N, D, E) where N is the number of desired
tuples, D is the Sort Directive [asc, desc, none], and E is the Sort Expression
– Heuristically determine when and how to apply
23
STOP Operator [Carey, et al, 1997] Example query plans
Fig a shows traditional JOIN– Join all EMP to DEPT, sort, output top k
Fig b shows implementation of STOP operators– Based on cardinality estimates, only 20 rows of EMP need
be joined with 30 rows of DEPT to produce top-k of 10
24
STOP Operator [Carey, et al, 1997]Conservative Heuristic
Ensures that every tuple in each intermediate result is guaranteed to generate at least one tuple of the overall query result
Advantages– No restarts from intermediate processing returning fewer than k
results– Intermediate STOP operators take their N value from overall
query k value Disadvantages
– Only inserts STOP operators where all remaining predicates are non-reductive (cannot use with multi-way joins)
25
STOP Operator [Carey, et al, 1997]Aggressive Heuristic
Applies STOP operator wherever it may be beneficial, thus reducing intermediate results to a greater degree
Choose N value using cardinality estimates
Requires RESTART operator when intermediate processing returns too few results
26
STOP Operator [Carey, et al, 1997]Experimental Results
Which heuristic is better?– Depends on cardinality, expense of processing
intermediate results, accuracy of prediction, etc.– With low expense of processing intermediate
results, experimental results show aggressive overestimation the best:
Traditional Conservative Aggressive,Underestimate (1/10)
Aggressive,Overestimate (10)
128.3 sec 63.9 sec 63.1 sec 18.5 sec
27
STOP Operator [Carey, et al, 1997]Experimental Results
Performance vs. Traditional (“out-of-the-box”) processing shows benefits in both indexed and non-indexed situations
28
“Smart” Ranked Querying (Relational) –Probabilistic [Donjerkovic, et al, 1999]
Introduces idea of ‘selection cutoff’ to produce top k results without requiring SORT
Quantifies the risk of fewer than k results being generated using inherent database statistics– List the top 10 paid employees
becomesList the employees whose salary is greater than x where x is determined by the distribution of employees’ salaries
29
Probabilistic [Donjerkovic, et al, 1999]Comparison with STOP Operator
In theory, likely to be cheaper to simply ‘select’ the necessary intermediate rows using cutoff (fig b) rather than performing sort and returning top-k (fig a)
30
Probabilistic [Donjerkovic, et al, 1999]Implementation
Leverage same statistics used by traditional query optimizer to guess cutoff– Histograms– Selectivity factors
31
Probabilistic [Donjerkovic, et al, 1999]Performance
For simple query using no indexes (return k highest paid employees, no index on ‘Salary’ attribute), easily outperforms traditional (scan, sort, return top k)
Also provides benefit to JOIN queries due to complexity of estimating join selectivity
32
“Smart” Ranked Querying (Relational) – Statistical [Chaudhuri, Gravano, 1999]
Expansion of probabilistic model Maps rank queries into boolean range queries Works with a variety of scoring functions,
including Min, Euclidean, and Sum
33
Statistical [Chaudhuri, Gravano, 1999]Expansion of probabilistic model
Consider multiple levels of ‘selection cutoff’, here referred to as ‘search score’ (Sq)– NoRestarts – score low enough to guarantee no
restarts are even needed– Restarts – score high enough that restarts might
result– Intermediate – score between NoRestarts and
Restarts
34
Statistical [Chaudhuri, Gravano, 1999]Implementation
Determine Sq from histograms– Choose bounding tuples in each bucket to ensure
NoRestarts (fig a) or tight tuples to minimize selection but potentially require Restarts (fig b)
35
Statistical [Chaudhuri, Gravano, 1999]Implementation
Determine relational query to retrieve all tuples that score above Sq
– Compute n-rectangle bounding such tuples– SELECT *
FROM RWHERE (a1<=A1<=b1) ... AND ... (an<=An<=bn)
Compute score for all returned tuples Output top-k tuples with score > Sq or rerun
query with lower search score
36
Statistical [Chaudhuri, Gravano, 1999]Expansion of Fagin’s model
Expands Fagin’s ideas to relational queries– Substitute ‘search score’ query to determine top
tuples for each subsystem– Use NoRestarts strategy to ensure that expensive
re-querying is avoided
37
“Smart” Ranked Querying (Rank) – MPro [Chang, Hwang, 2002]
Extends consideration of top-k querying to expensive predicates (monotonic only) – As opposed to other work, which assumes the
expense of score calculation to be minimal
Attempt to minimize the number of scores calculated– Consider only Necessary Probes, i.e. only those
calculations without which the top-k results cannot be found
38
MPro [Chang, Hwang, 2002]Determining if probe is necessary
An object’s lowest calculated score represents “ceiling score” (i.e. it is impossible for any other score for that object to raise its lowest score)
If “ceiling score” falls below top-k object’s complete score, object is ruled out and no further calculations on the object need be performed
Simple Example: – Consider scoring function like Min and top-1 results desired– If we know object A’s combined rank with respect to F(x)
and F(y) is .8, and we calculate object B’s score with respect to F(x) to be .7, B’s score with respect to F(y) need not be calculated (its Min value cannot be higher than .7)
39
MPro [Chang, Hwang, 2002] Determining all necessary probes
Only objects with ceiling scores in the top-k need be further evaluated
If objects are kept in sorted order by current ceiling scores:– For any object u in the top-k slots, its next probe
is necessary
40
MPro [Chang, Hwang, 2002]Minimal Probes Algorithm (MPro)
Priority queue initialization– Evaluate each object over first predicate (same as
sequentially accessing objects sorted by x) Necessary probing
– Request from queue the object with highest ceiling score
– Evaluate object over next predicate y– Update ceiling score and reinsert into queue
Stop when at least k objects have been completely scored (and output these objects)
41
MPro [Chang, Hwang, 2002]Further Applications
Incremental results– Output top k, resume processing where it left off for
next k as user requests
Fuzzy joins– Consider join predicate in same manner
Parallel processing– Distribute necessary probes across multiple servers– Distribute data, calculate top-n over each chunk,
merge results
42
MPro [Chang, Hwang, 2002]Experimental Results
On experimental dataset, over 96% of complete probes found to be unnecessary
Elapsed time significantly improved (see below), from 21009 to 408 seconds for k = 10
43
“Smart” Ranked Querying (Rank) – AutoRank [Agrawal, et al, 2003]
Consider ranking of relational attributes in similar way to Information Retrieval (IR)– IDF Similarity
Extend TF-IDF based on frequency of occurrence of attribute values
– QF Similarity Use database workload to determine frequency with
which attributes and attribute values are referenced “Poor man’s relevance feedback”
– ITA Index-based top-k algorithm that exploits above ranking
functions
44
AutoRank [Agrawal, et al, 2003]IDF Similarity
Extend TF (term frequency)– IR – frequency of terms in a document – Relational – frequency of values for an attribute
Extend IDF (inverse document frequency)– IR – total documents / documents containing term– Relational – tuples / tuples where attribute = value
For all tuples matching the queried value, IDF Similarity is the attribute’s IDF (for the queried value), and 0 otherwise
45
AutoRank [Agrawal, et al, 2003]QF Similarity
Consider problem of IDF where desired result is also the most frequent
– Realty database where homes built in the last three years are most desired, but the few entries existing for old homes (with higher IDF) will be considered “top”
Instead, use frequency of occurrence of attribute values in executed queries to determine ranking (by examining workload)
Can extend workload analysis to draw comparative conclusions from attribute values queried together
– Assume similarity between ‘Honda’ and ‘Toyota’ if users frequently look for cars by either of these manufacturers
46
AutoRank [Agrawal, et al, 2003]Implementation
Store approximate representations of IDF and QF values using smooth function
– Minimal storage required– IDF and QF values can be quickly retrieved at runtime
ITA (Index-based Threshold Algorithm)– Use available, existing indexes (B+ trees)– Define threshold by computing best tuple in data not yet
examined– Stop processing when similarity of this tuple is no greater
than similarity of lowest ranking tuple in top-k buffer
47
AutoRank [Agrawal, et al, 2003]Experimental Results
Used large realtor database from http://homeadvisor.microsoft.com and MS- SQL Server
Measured result-quality via user studies– For each test query, asked users to identify
relevant and irrelevant tuples and compared results of QF and IDF queries to users’ responses
ITA judged to be more efficient than SQL Server’s Top-k operator when indexes exist
48
Conclusions
Clearly, an exciting and worthwhile field Research has gone in several directions but
all shares roots in Fagin and Carey’s work Combines many areas of computer science
– Artificial Intelligence (Fuzzy Logic)– Information Retrieval
49
The Future
Implementation in major RDBMS vendors– Microsoft should be among the first to revamp
their Top-K operator, as in-house research [Agrawal, et al, 2003] has provided a smarter, faster technique
Explore more complex ranking functions that cannot be easily mapped to range queries or used with indexes
50
References
M. J. Carey and D. Kossmann. On saying “enough already!" in SQL. 1997 SIGMOD Conference: 219-230, 1997.
D. Donjerkovic, R. Ramakrishnan. Probabilistic Optimization of Top N Queries. VLDB 1999: 411-422, 1999.
R. Fagin. Combining Fuzzy Information from Multiple Systems. PODS 1996: 216-226, 1996.
S. Nepal, M. V. Ramakrishna. Query Processing Issues in Image (Multimedia) Databases. ICDE 1999: 22-29, 1999.
Surajit Chaudhuri, Luis Gravano. Evaluating Top-k Selection Queries. VLDB 1999: 397-410, 1999.
K.C. Chang, S. Hwang. Minimal Probing: Supporting Expensive Predicates for Top-k Queries. SIGMOD Conference 2002: 346-357, 2002.
Sanjay Agrawal, Surajit Chaudhuri, Gautam Das, Aristides Gionis. Automated Ranking of Database Query Results. CIDR 2003, 2003.