Answering Approximate Queries Efficiently
description
Transcript of Answering Approximate Queries Efficiently
Chen LiDepartment of Computer Science
Joint work with Liang Jin, Nick Koudas, Anthony Tung, and Rares Vernica
Answering Approximate Queries Efficiently
2
30,000-Foot View of Info Systems
Data Repository (RDBMS, Search
Engines, etc.)
QueryAnswers matching
conditions
3
Example: a movie database
Star Title Year GenreKeanu Reeves The Matrix 1999 Sci-Fi
Samuel Jackson Star Wars: Episode III - Revenge of the Sith
2005 Sci-Fi
Schwarzenegger The Terminator 1984 Sci-Fi
Samuel Jackson Goodfellas 1990 Drama
… … … …
Tom
Find movies starred Samuel Jackson
4
How about our governor: Schwarrzenger?
Star Title Year GenreKeanu Reeves The Matrix 1999 Sci-Fi
Samuel Jackson Star Wars: Episode III - Revenge of the Sith
2005 Sci-Fi
Schwarzenegger The Terminator 1984 Sci-Fi
Samuel Jackson Goodfellas 1990 Drama
… … … …
The user doesn’t know the exact spelling!
5
Relaxing Conditions
Star Title Year GenreKeanu Reeves The Matrix 1999 Sci-Fi
Samuel Jackson Star Wars: Episode III - Revenge of the Sith
2005 Sci-Fi
Schwarzenegger The Terminator 1984 Sci-Fi
Samuel Jackson Goodfellas 1990 Drama
… … … …
Find movies with a star “similar to” Schwarrzenger.
6
In general: Gap between Queries and Facts
• Errors in the query– The user doesn’t remember a string exactly– The user unintentionally types a wrong string
Samuel Jackson
…
Schwarzenegger
Samuel Jackson
Keanu ReevesStar
…
Samuel L. Jackson
Schwarzenegger
Samuel L. Jackson
Keanu ReevesStar
Relation R Relation S
• Errors in the database:– Data often is not clean by itself– Especially true in data integration and cleansing
7
“Did you mean…?” features in Search Engines
8
What if we don’t want the user to change the query?Answering Queries Approximately
Data Repository (RDBMS, Search
Engines, etc.)
QueryAnswers matching
conditions approximately
9
Technical Challenges
• How to relax conditions?– Name: “Schwarzenegger” vs “Schwarrzenger”– Salary: “in [50K,60K]” vs “in [49K,63K]”
• How to answer queries efficiently?– Index structures– Selectivity estimation
See our three recent VLDB papers
10
Rest of the talk
• Selectivity estimation of fuzzy predicates• Our approach: SEPIA• Construction and maintenance of SEPIA• Experiments• Other works
11
Queries with Fuzzy String Predicates
• Stars: name similar to “Schwarrzenger”• Employees: SSN similar to “430-87-7294”• Customers: telephone number similar to “412-
0964”
• Similar to: – a domain-specific function – returns a similarity value between two strings
• Examples:– Edit distance: ed(Schwarrzenger, Schwarzenegger)=2– Cosine similarity– Jaccard coefficient distance– Soundex– …
Database
12
• A widely used metric to define string similarity• Ed(s1,s2)= minimum # of operations (insertion,
deletion, substitution) to change s1 to s2• Example:
s1: Tom Hankss2: Ton Hanked(s1,s2) = 2
Example Similarity Function: Edit Distance
13
Selectivity of Fuzzy Predicates
star SIMILARTO ’Schwarrzenger’• Selectivity: # of records satisfying the predicate
Star Title Year GenreKeanu Reeves The Matrix 1999 Sci-Fi
Samuel Jackson Star Wars: Episode III - Revenge of the Sith
2005 Sci-Fi
Schwarzenegger The Terminator 1984 Sci-Fi
Samuel Jackson Goodfellas 1990 Drama
… … … …
14
Selectivity Estimation: Problem Formulation
A bag of strings
Input: fuzzy string predicate P(q, δ)
star SIMILARTO ’Schwarrzenger’
Output: # of strings s that satisfy dist(s,q) <= δ
15
Why Selectivity Estimation?
SELECT *
FROM Movies
WHERE star SIMILARTO ’Schwarrzenger’
AND year BETWEEN [1980,1989];
Star Title Year GenreKeanu Reeves The Matrix 1999 Sci-Fi
Samuel Jackson Star Wars: Episode III - Revenge of the Sith
2005 Sci-Fi
Schwarzenegger The Terminator 1984 Sci-Fi
Samuel Jackson Goodfellas 1990 Drama
… … … …
Movies
SELECT *
FROM Movies
WHERE star SIMILARTO ’Schwarrzenger’
AND year BETWEEN [1970,1971];
The optimizer needs to know the selectivity of a predicate to decide a good plan.
16
• No “nice” order for strings• Lexicographical order?
– Similar strings could be far from each other: Kammy/Cammy– Adjacent strings have different selectivities: Cathy/Catherine
Using traditional histograms?
17
Outline
• Selectivity estimation of fuzzy predicates• Our approach: SEPIA
– Overview– Proximity between strings– Estimation algorithm
• Construction and maintenance of SEPIA• Experiments• Other works
18
Our approach: SEPIA
Selectivity Estimation of Approximate Predicates
Cluster
Pivot: p
String s
Query String: q
v1
v2ed(p,s)1 2 3
10%
44%28%
Probability 100%
4
Intuition
19
Proximity between Strings
lukas
luciano
lucia
lucas2
3Query String
Pivot2
Cluster
Edit Distance? Not discriminative enough
20
Edit Vector from s1 to s2
• A vector <I, D, S>– I: # of insertions– D: # of deletions– S: # of substitutionsin a sequence of edit operations with their edit
distance
– Easily computable– Not symmetric– Not unique, but tend to be (ed <= 3 91% unique)
luciano
lucas<1,1,0>
<2,0,0>lucia
lucia
21
Why Edit Vector?
More discriminative
lukas
luciano
lucia
lucas
<1,1,0><1,1,1>
<2,0,0>
Cluster
22
SEPIA histograms: Overview
Frequency Table
Cluster 1
Cluster k
Cluster 2
...
Global PPD TablePivot p1
Pivot p2
Pivot pk
Vector v1
<1,1,0><1,0,1>
<1,1,0><1,0,1>
<1,1,0><1,0,1>
Vector v2
1003
602
301
Percentage(%)
<1,1,0>
<1,1,1><1,1,0>
<1,1,1><1,1,0>
<1,1,1><1,1,0>
<1,1,1> 2
100
884
763
32
Edit Distance
5
…………
30
18
9
Count
25
22
19
8
…
Edit Vector
......
40<0,1,0>
3<0,0,0>
# of Strings
Edit Vector
......
12<0,0,1>
4<0,0,0>
# of Strings
<0,1,0> 7
Edit Vector
......
84<1,0,2>
2<0,0,0>
# of Strings
Frequency Table
Frequency Table
23
Frequency table for each cluster
Edit Vector
......
12<0,0,1>
4<0,0,0>
# of Strings
Cluster iPivot pi
<0,1,0> 7
[0,1,0]
7 strings with an edit vector <0,1,0> from pi
24
Global PPD Table
Proximity Pair Distribution table
Vector v1
<1,1,0><1,0,1>
<1,1,0><1,0,1>
<1,1,0><1,0,1>
Vector v2
1003
602
301
Percentage(%)
<1,1,0>
<1,1,1><1,1,0>
<1,1,1><1,1,0>
<1,1,1><1,1,0>
<1,1,1> 2
100
884
763
32
Edit Distance
5
…………
30
18
9
Count
25
22
19
8
…
Cluster
Pivot: p
String s
Query String: q
<1,0,1>
<1,1,0>ed(p,s)1 2 3
Probability
30%
60%
100%
25
SEPIA histograms: summary
Edit Vector
......
12<0,0,1>4<0,0,0>
# of Strings
Edit Vector
......
40<0,1,0>
3<0,0,0>
# of Strings
Edit Vector
......
84<1,0,2>
2<0,0,0>
# of Strings
Frequency Table
Cluster 1
Cluster k
Cluster 2
Vector v1
<1,1,0><1,0,1>
<1,1,0><1,0,1>
<1,1,0><1,0,1>
Vector v2
1003
602
301
Percentage(%)
<1,1,0>
<1,1,1><1,1,0>
<1,1,1><1,1,0>
<1,1,1><1,1,0>
<1,1,1> 2
100
884
763
32
...
Edit Distance
5
…………
Global PPD TablePivot p1
Pivot p2
Pivot pk
<0,1,0> 730
18
9
Count
25
22
19
8
…
26
Selectivity Estimation: ed(lukas, 2)
• Do it for all v2 vectors in each cluster, for all clusters• Take the sum of these contributions
Cluster i
lucialukas[1,1,1]
<0,1,0>Edit Vector
......
40<0,1,0>
# of Strings
Vector v1 Vector v2Percentage
(%)
<0,1,0><1,1,1> 762
Edit Distance
Count
19
... ...
Expected Contribution: 76% * 40
Global PPD Table
Frequency Table i
27
Selectivity Estimation for ed(q,d)
• For each cluster Ci
• For each v2 in frequency table of Ci
• Use (v1,v2,d) to lookup PPD• Take the sum of these f * N• Pruning possible (triangle inequality)
Cluster i
pivotqv1
v2Edit Vector
......
# of Strings
Vector v1 Vector v2Percentage
(%)
v2v1 f
Edit Distance
Count
19
... ...
Expected Contribution: f * N
Global PPD Table
Frequency Table i
d
v2 N
28
Outline
• Selectivity estimation of fuzzy predicates• Our approach: SEPIA
– Overview– Proximity between strings– Estimation algorithm
• Construction and maintenance of SEPIA• Experiments• Other works
29
Clustering Strings
Two example algorithms• Lexicographic order based.• K-Medoids
– Choose initial pivots– Assign strings to its closest pivot– Swap a pivot with another string– Reassign the strings
30
Number of Clusters
It affects:• Cluster quality
– Similarity of strings within each cluster
• Costs:– Space– Estimation time
31
Constructing Frequency Tables
• For each cluster, group strings based on their edit vector from the pivot
• Count the frequency for each group
Cluster i
Pivot pi
[0,1,0]
[0,1
,0]
32
Constructing PPD Table
• Get enough samples of string triplets (q,p,s)• Propose a few heuristics
– ALL_RAND– CLOSE_RAND– CLOSE_LEX– CLOSE_UNIQUE
Pivot: p
String s
Query String: q
v1
v2ed(p,s)1 2 3
10%
44%28%
Probability 100%
4
A collection of q strings
A set of clusters
33
Dynamic Maintenance: Frequency Table
Take insertion as an example
Edit Vector
......
12<0,0,1>
4<0,0,0>
# of Strings
Cluster iPivot pi
<0,1,0> 7
[0,1,0]
New String
8
34
Dynamic Maintenance: PPD
Pivot: pq v1
v2
ed(p,s)=2
A collection of q strings in the construction of PPD
One of the clusters in the construction of PPD
New String
Vector v1 Vector v2Percentage
(%)
100
88
76
32
Edit Distance
…………
Count
25
22
19
8
…
v1 v2
v1 v2
v1 v2
v1 v2
0
1
2
3
+1
Adjust
35
Improving Estimation Accuracy
• Reasons of estimate errors– Miss hits in PPD.– Inaccurate percentage entries in PPD.
• Improvement: use sample fuzzy predicates to analyze their estimation errors
Predicates Real
P4(david, 2)P3(jordan, 2)P2(james,3)P1(tommy,2)
500600400500
Estimate
600300
600750
Relative Error+50%
-40%0%
+50%
-40% 0% +50%
25%
50%
25%
Relative Error
Probability
36
Relative-Error Model
• Use the errors to build a model• Use the model to adjust initial estimation
d: threshold;L: query string length;IE: Initial estimate
0<=IE<=400<=IE<=40
0<=IE<=40
IE>=41
1<=L<=51<=L<=5
L>=6
...
d = 1
d = 2
d = 3
-15% -20% +17% -8% 1%
IE>=41
+12% -23% +25%
IE>=41 IE>=41
L>=6
0<=IE<=40
37
Outline
• Motivation: selectivity estimation of fuzzy predicates• Our approach: SEPIA
– Overview– Proximity between strings– Estimation algorithm
• Construction and maintenance of SEPIA• Experiments• Other works
38
Data
• Citeseer: – 71K author names– Length: [2,20], avg = 12
• Movie records from UCI KDD repository: – 11K movie titles.– Length: [3,80], avg = 35
• Introduced duplicates: – 10% of records – # of duplicates: [1,20], uniform
• Final results:– Citeseer: 142K author names– UCI KDD: 23K movie titles
39
Setting
• Test bed– PC: 2.4G P4, 1.2GB RAM, Windows XP– Visual C++ compiler
• Query workload:– Strings from the data– String not in the data– Results similar
• Quality measurements– Relative error: (fest – freal) / freal
– Absolute relative error : |fest – freal | / freal
40
Clustering Algorithms
217
45
18
120
47 37
Clustering Time(sec)
Estimation Time(ms)
Average AbsoluteRelative Error (%)
k-Medoids Lexicographic
K-Metoids is better
41
Quartile distribution of relative errors
0
0.25
0.5
0.75
1
-100 -7
5-5
0-2
5 0 25 50 75 100
Infin
ity
Relative Error (%)
Perc
enta
ge in
Wor
kloa
d
Data set 1. CLOSE_RAND; 1000 clusters
42
Number of Clusters
43
Effectiveness of Applying Relative-Error Model
18
25
1012
Average Absolute RelativeError for Data set 1 (%)
Average Absolute RelativeError for Data set 2 (%)
Without Error Correction With Error Correction
44
Dynamic Maintenance
45
Other work 1: Relaxing SQL queries with Selections/Joins
SELECT * FROM Jobs J, Candidate CWHERE J.Salary <= 95 AND J.Zipcode = C.Zipcode AND C.WorkYear >= 5
Jobs Candidates
JID Company
Zipcode
Salary CID Zipcode
ExpSalary
WorkYear
r1 Broadcom
92047 80 s1 93652 120 3
r2 Intel 93652 95 s2 92612 130 6
r3 Microsoft 82632 120 s3 82632 100 5
r4 IBM 90391 130 s4 90391 150 1
... … … … ... … … …
46
Query Relaxation: Skyline!
{}
R J S
RJ RS SJ
RSJ
J .Salary
C.WorkYear
J .Salary <= 95C.WorkYear >=5
5
95
47
Other work 2: Fuzzy predicates on attributes of mixed types
SELECT *
FROM Movies
WHERE star SIMILARTO ’Schwarrzenger’
AND |year – 1977| <= 3;
Star Title Year GenreKeanu Reeves The Matrix 1999 Sci-Fi
Samuel Jackson Star Wars: Episode III - Revenge of the Sith
2005 Sci-Fi
Schwarzenegger The Terminator 1984 Sci-Fi
Samuel Jackson Goodfellas 1990 Drama
… … … …
Movies
48
Mixed-Typed Predicates
• String attributes: edit distance• Numeric attributes: absolute numeric
difference
SELECT *
FROM Movies
WHERE star SIMILARTO ’Schwarrzenger’
AND |year – 1977| <= 3;
49
MAT-tree: Intuition
• Indexing on two attributes is more effective than two separate indexing structures
• Numeric attribute: B-tree• String attribute: tree-based index structure?
50
MAT-tree: Overview
• Tree-based indexing structure:– Each node has MBR for both numeric attribute and string attribute
• Compressing strings as a “compressed trie” that fits into a limited space• An edit distance between a string and compressed trie can be computed• Experiments show that MAT-tree is very efficient
Spielberg1946
Hanks1956
Gibson1956
Hanks1957
Crowe1964
Robert1968
DiCaprio1974
Roberrts1977
<1946,1956> <1956,1957>
<1946,1957> <1964,1977>
MBR
Root
Leaf nodes
*
<1964,1968>
*
<1974,1977>
*
* *
......
...
......
*
...
51
Conclusion
• It’s important to support answering approximate queries efficiently
• Our results so far:– SEPIA: provides accurate selectivity
estimation for fuzzy string predicates– Relaxing SQL queries with selections and
joins– MAT-tree: indexing structure supporting fuzzy
queries with mixed-types predicates