Answering Approximate Queries Efficiently

51
Chen Li Department of Computer Science Joint work with Liang Jin, Nick Koudas, Anthony Tung, and Rares Vernica Answering Approximate Queries Efficiently

description

Chen Li Department of Computer Science Joint work with Liang Jin, Nick Koudas, Anthony Tung, and Rares Vernica. Answering Approximate Queries Efficiently. 30,000-Foot View of Info Systems. Data Repository (RDBMS, Search Engines, etc.). Answers matching conditions. Query. Tom. - PowerPoint PPT Presentation

Transcript of Answering Approximate Queries Efficiently

Page 1: Answering Approximate Queries Efficiently

Chen LiDepartment of Computer Science

Joint work with Liang Jin, Nick Koudas, Anthony Tung, and Rares Vernica

Answering Approximate Queries Efficiently

Page 2: Answering Approximate Queries Efficiently

2

30,000-Foot View of Info Systems

Data Repository (RDBMS, Search

Engines, etc.)

QueryAnswers matching

conditions

Page 3: Answering Approximate Queries Efficiently

3

Example: a movie database

Star Title Year GenreKeanu Reeves The Matrix 1999 Sci-Fi

Samuel Jackson Star Wars: Episode III - Revenge of the Sith

2005 Sci-Fi

Schwarzenegger The Terminator 1984 Sci-Fi

Samuel Jackson Goodfellas 1990 Drama

… … … …

Tom

Find movies starred Samuel Jackson

Page 4: Answering Approximate Queries Efficiently

4

How about our governor: Schwarrzenger?

Star Title Year GenreKeanu Reeves The Matrix 1999 Sci-Fi

Samuel Jackson Star Wars: Episode III - Revenge of the Sith

2005 Sci-Fi

Schwarzenegger The Terminator 1984 Sci-Fi

Samuel Jackson Goodfellas 1990 Drama

… … … …

The user doesn’t know the exact spelling!

Page 5: Answering Approximate Queries Efficiently

5

Relaxing Conditions

Star Title Year GenreKeanu Reeves The Matrix 1999 Sci-Fi

Samuel Jackson Star Wars: Episode III - Revenge of the Sith

2005 Sci-Fi

Schwarzenegger The Terminator 1984 Sci-Fi

Samuel Jackson Goodfellas 1990 Drama

… … … …

Find movies with a star “similar to” Schwarrzenger.

Page 6: Answering Approximate Queries Efficiently

6

In general: Gap between Queries and Facts

• Errors in the query– The user doesn’t remember a string exactly– The user unintentionally types a wrong string

Samuel Jackson

Schwarzenegger

Samuel Jackson

Keanu ReevesStar

Samuel L. Jackson

Schwarzenegger

Samuel L. Jackson

Keanu ReevesStar

Relation R Relation S

• Errors in the database:– Data often is not clean by itself– Especially true in data integration and cleansing

Page 7: Answering Approximate Queries Efficiently

7

“Did you mean…?” features in Search Engines

Page 8: Answering Approximate Queries Efficiently

8

What if we don’t want the user to change the query?Answering Queries Approximately

Data Repository (RDBMS, Search

Engines, etc.)

QueryAnswers matching

conditions approximately

Page 9: Answering Approximate Queries Efficiently

9

Technical Challenges

• How to relax conditions?– Name: “Schwarzenegger” vs “Schwarrzenger”– Salary: “in [50K,60K]” vs “in [49K,63K]”

• How to answer queries efficiently?– Index structures– Selectivity estimation

See our three recent VLDB papers

Page 10: Answering Approximate Queries Efficiently

10

Rest of the talk

• Selectivity estimation of fuzzy predicates• Our approach: SEPIA• Construction and maintenance of SEPIA• Experiments• Other works

Page 11: Answering Approximate Queries Efficiently

11

Queries with Fuzzy String Predicates

• Stars: name similar to “Schwarrzenger”• Employees: SSN similar to “430-87-7294”• Customers: telephone number similar to “412-

0964”

• Similar to: – a domain-specific function – returns a similarity value between two strings

• Examples:– Edit distance: ed(Schwarrzenger, Schwarzenegger)=2– Cosine similarity– Jaccard coefficient distance– Soundex– …

Database

Page 12: Answering Approximate Queries Efficiently

12

• A widely used metric to define string similarity• Ed(s1,s2)= minimum # of operations (insertion,

deletion, substitution) to change s1 to s2• Example:

s1: Tom Hankss2: Ton Hanked(s1,s2) = 2

Example Similarity Function: Edit Distance

Page 13: Answering Approximate Queries Efficiently

13

Selectivity of Fuzzy Predicates

star SIMILARTO ’Schwarrzenger’• Selectivity: # of records satisfying the predicate

Star Title Year GenreKeanu Reeves The Matrix 1999 Sci-Fi

Samuel Jackson Star Wars: Episode III - Revenge of the Sith

2005 Sci-Fi

Schwarzenegger The Terminator 1984 Sci-Fi

Samuel Jackson Goodfellas 1990 Drama

… … … …

Page 14: Answering Approximate Queries Efficiently

14

Selectivity Estimation: Problem Formulation

A bag of strings

Input: fuzzy string predicate P(q, δ)

star SIMILARTO ’Schwarrzenger’

Output: # of strings s that satisfy dist(s,q) <= δ

Page 15: Answering Approximate Queries Efficiently

15

Why Selectivity Estimation?

SELECT *

FROM Movies

WHERE star SIMILARTO ’Schwarrzenger’

AND year BETWEEN [1980,1989];

Star Title Year GenreKeanu Reeves The Matrix 1999 Sci-Fi

Samuel Jackson Star Wars: Episode III - Revenge of the Sith

2005 Sci-Fi

Schwarzenegger The Terminator 1984 Sci-Fi

Samuel Jackson Goodfellas 1990 Drama

… … … …

Movies

SELECT *

FROM Movies

WHERE star SIMILARTO ’Schwarrzenger’

AND year BETWEEN [1970,1971];

The optimizer needs to know the selectivity of a predicate to decide a good plan.

Page 16: Answering Approximate Queries Efficiently

16

• No “nice” order for strings• Lexicographical order?

– Similar strings could be far from each other: Kammy/Cammy– Adjacent strings have different selectivities: Cathy/Catherine

Using traditional histograms?

Page 17: Answering Approximate Queries Efficiently

17

Outline

• Selectivity estimation of fuzzy predicates• Our approach: SEPIA

– Overview– Proximity between strings– Estimation algorithm

• Construction and maintenance of SEPIA• Experiments• Other works

Page 18: Answering Approximate Queries Efficiently

18

Our approach: SEPIA

Selectivity Estimation of Approximate Predicates

Cluster

Pivot: p

String s

Query String: q

v1

v2ed(p,s)1 2 3

10%

44%28%

Probability 100%

4

Intuition

Page 19: Answering Approximate Queries Efficiently

19

Proximity between Strings

lukas

luciano

lucia

lucas2

3Query String

Pivot2

Cluster

Edit Distance? Not discriminative enough

Page 20: Answering Approximate Queries Efficiently

20

Edit Vector from s1 to s2

• A vector <I, D, S>– I: # of insertions– D: # of deletions– S: # of substitutionsin a sequence of edit operations with their edit

distance

– Easily computable– Not symmetric– Not unique, but tend to be (ed <= 3 91% unique)

luciano

lucas<1,1,0>

<2,0,0>lucia

lucia

Page 21: Answering Approximate Queries Efficiently

21

Why Edit Vector?

More discriminative

lukas

luciano

lucia

lucas

<1,1,0><1,1,1>

<2,0,0>

Cluster

Page 22: Answering Approximate Queries Efficiently

22

SEPIA histograms: Overview

Frequency Table

Cluster 1

Cluster k

Cluster 2

...

Global PPD TablePivot p1

Pivot p2

Pivot pk

Vector v1

<1,1,0><1,0,1>

<1,1,0><1,0,1>

<1,1,0><1,0,1>

Vector v2

1003

602

301

Percentage(%)

<1,1,0>

<1,1,1><1,1,0>

<1,1,1><1,1,0>

<1,1,1><1,1,0>

<1,1,1> 2

100

884

763

32

Edit Distance

5

…………

30

18

9

Count

25

22

19

8

Edit Vector

......

40<0,1,0>

3<0,0,0>

# of Strings

Edit Vector

......

12<0,0,1>

4<0,0,0>

# of Strings

<0,1,0> 7

Edit Vector

......

84<1,0,2>

2<0,0,0>

# of Strings

Frequency Table

Frequency Table

Page 23: Answering Approximate Queries Efficiently

23

Frequency table for each cluster

Edit Vector

......

12<0,0,1>

4<0,0,0>

# of Strings

Cluster iPivot pi

<0,1,0> 7

[0,1,0]

7 strings with an edit vector <0,1,0> from pi

Page 24: Answering Approximate Queries Efficiently

24

Global PPD Table

Proximity Pair Distribution table

Vector v1

<1,1,0><1,0,1>

<1,1,0><1,0,1>

<1,1,0><1,0,1>

Vector v2

1003

602

301

Percentage(%)

<1,1,0>

<1,1,1><1,1,0>

<1,1,1><1,1,0>

<1,1,1><1,1,0>

<1,1,1> 2

100

884

763

32

Edit Distance

5

…………

30

18

9

Count

25

22

19

8

Cluster

Pivot: p

String s

Query String: q

<1,0,1>

<1,1,0>ed(p,s)1 2 3

Probability

30%

60%

100%

Page 25: Answering Approximate Queries Efficiently

25

SEPIA histograms: summary

Edit Vector

......

12<0,0,1>4<0,0,0>

# of Strings

Edit Vector

......

40<0,1,0>

3<0,0,0>

# of Strings

Edit Vector

......

84<1,0,2>

2<0,0,0>

# of Strings

Frequency Table

Cluster 1

Cluster k

Cluster 2

Vector v1

<1,1,0><1,0,1>

<1,1,0><1,0,1>

<1,1,0><1,0,1>

Vector v2

1003

602

301

Percentage(%)

<1,1,0>

<1,1,1><1,1,0>

<1,1,1><1,1,0>

<1,1,1><1,1,0>

<1,1,1> 2

100

884

763

32

...

Edit Distance

5

…………

Global PPD TablePivot p1

Pivot p2

Pivot pk

<0,1,0> 730

18

9

Count

25

22

19

8

Page 26: Answering Approximate Queries Efficiently

26

Selectivity Estimation: ed(lukas, 2)

• Do it for all v2 vectors in each cluster, for all clusters• Take the sum of these contributions

Cluster i

lucialukas[1,1,1]

<0,1,0>Edit Vector

......

40<0,1,0>

# of Strings

Vector v1 Vector v2Percentage

(%)

<0,1,0><1,1,1> 762

Edit Distance

Count

19

... ...

Expected Contribution: 76% * 40

Global PPD Table

Frequency Table i

Page 27: Answering Approximate Queries Efficiently

27

Selectivity Estimation for ed(q,d)

• For each cluster Ci

• For each v2 in frequency table of Ci

• Use (v1,v2,d) to lookup PPD• Take the sum of these f * N• Pruning possible (triangle inequality)

Cluster i

pivotqv1

v2Edit Vector

......

# of Strings

Vector v1 Vector v2Percentage

(%)

v2v1 f

Edit Distance

Count

19

... ...

Expected Contribution: f * N

Global PPD Table

Frequency Table i

d

v2 N

Page 28: Answering Approximate Queries Efficiently

28

Outline

• Selectivity estimation of fuzzy predicates• Our approach: SEPIA

– Overview– Proximity between strings– Estimation algorithm

• Construction and maintenance of SEPIA• Experiments• Other works

Page 29: Answering Approximate Queries Efficiently

29

Clustering Strings

Two example algorithms• Lexicographic order based.• K-Medoids

– Choose initial pivots– Assign strings to its closest pivot– Swap a pivot with another string– Reassign the strings

Page 30: Answering Approximate Queries Efficiently

30

Number of Clusters

It affects:• Cluster quality

– Similarity of strings within each cluster

• Costs:– Space– Estimation time

Page 31: Answering Approximate Queries Efficiently

31

Constructing Frequency Tables

• For each cluster, group strings based on their edit vector from the pivot

• Count the frequency for each group

Cluster i

Pivot pi

[0,1,0]

[0,1

,0]

Page 32: Answering Approximate Queries Efficiently

32

Constructing PPD Table

• Get enough samples of string triplets (q,p,s)• Propose a few heuristics

– ALL_RAND– CLOSE_RAND– CLOSE_LEX– CLOSE_UNIQUE

Pivot: p

String s

Query String: q

v1

v2ed(p,s)1 2 3

10%

44%28%

Probability 100%

4

A collection of q strings

A set of clusters

Page 33: Answering Approximate Queries Efficiently

33

Dynamic Maintenance: Frequency Table

Take insertion as an example

Edit Vector

......

12<0,0,1>

4<0,0,0>

# of Strings

Cluster iPivot pi

<0,1,0> 7

[0,1,0]

New String

8

Page 34: Answering Approximate Queries Efficiently

34

Dynamic Maintenance: PPD

Pivot: pq v1

v2

ed(p,s)=2

A collection of q strings in the construction of PPD

One of the clusters in the construction of PPD

New String

Vector v1 Vector v2Percentage

(%)

100

88

76

32

Edit Distance

…………

Count

25

22

19

8

v1 v2

v1 v2

v1 v2

v1 v2

0

1

2

3

+1

Adjust

Page 35: Answering Approximate Queries Efficiently

35

Improving Estimation Accuracy

• Reasons of estimate errors– Miss hits in PPD.– Inaccurate percentage entries in PPD.

• Improvement: use sample fuzzy predicates to analyze their estimation errors

Predicates Real

P4(david, 2)P3(jordan, 2)P2(james,3)P1(tommy,2)

500600400500

Estimate

600300

600750

Relative Error+50%

-40%0%

+50%

-40% 0% +50%

25%

50%

25%

Relative Error

Probability

Page 36: Answering Approximate Queries Efficiently

36

Relative-Error Model

• Use the errors to build a model• Use the model to adjust initial estimation

d: threshold;L: query string length;IE: Initial estimate

0<=IE<=400<=IE<=40

0<=IE<=40

IE>=41

1<=L<=51<=L<=5

L>=6

...

d = 1

d = 2

d = 3

-15% -20% +17% -8% 1%

IE>=41

+12% -23% +25%

IE>=41 IE>=41

L>=6

0<=IE<=40

Page 37: Answering Approximate Queries Efficiently

37

Outline

• Motivation: selectivity estimation of fuzzy predicates• Our approach: SEPIA

– Overview– Proximity between strings– Estimation algorithm

• Construction and maintenance of SEPIA• Experiments• Other works

Page 38: Answering Approximate Queries Efficiently

38

Data

• Citeseer: – 71K author names– Length: [2,20], avg = 12

• Movie records from UCI KDD repository: – 11K movie titles.– Length: [3,80], avg = 35

• Introduced duplicates: – 10% of records – # of duplicates: [1,20], uniform

• Final results:– Citeseer: 142K author names– UCI KDD: 23K movie titles

Page 39: Answering Approximate Queries Efficiently

39

Setting

• Test bed– PC: 2.4G P4, 1.2GB RAM, Windows XP– Visual C++ compiler

• Query workload:– Strings from the data– String not in the data– Results similar

• Quality measurements– Relative error: (fest – freal) / freal

– Absolute relative error : |fest – freal | / freal

Page 40: Answering Approximate Queries Efficiently

40

Clustering Algorithms

217

45

18

120

47 37

Clustering Time(sec)

Estimation Time(ms)

Average AbsoluteRelative Error (%)

k-Medoids Lexicographic

K-Metoids is better

Page 41: Answering Approximate Queries Efficiently

41

Quartile distribution of relative errors

0

0.25

0.5

0.75

1

-100 -7

5-5

0-2

5 0 25 50 75 100

Infin

ity

Relative Error (%)

Perc

enta

ge in

Wor

kloa

d

Data set 1. CLOSE_RAND; 1000 clusters

Page 42: Answering Approximate Queries Efficiently

42

Number of Clusters

Page 43: Answering Approximate Queries Efficiently

43

Effectiveness of Applying Relative-Error Model

18

25

1012

Average Absolute RelativeError for Data set 1 (%)

Average Absolute RelativeError for Data set 2 (%)

Without Error Correction With Error Correction

Page 44: Answering Approximate Queries Efficiently

44

Dynamic Maintenance

Page 45: Answering Approximate Queries Efficiently

45

Other work 1: Relaxing SQL queries with Selections/Joins

SELECT * FROM Jobs J, Candidate CWHERE J.Salary <= 95 AND J.Zipcode = C.Zipcode AND C.WorkYear >= 5

Jobs Candidates

JID Company

Zipcode

Salary CID Zipcode

ExpSalary

WorkYear

r1 Broadcom

92047 80 s1 93652 120 3

r2 Intel 93652 95 s2 92612 130 6

r3 Microsoft 82632 120 s3 82632 100 5

r4 IBM 90391 130 s4 90391 150 1

... … … … ... … … …

Page 46: Answering Approximate Queries Efficiently

46

Query Relaxation: Skyline!

{}

R J S

RJ RS SJ

RSJ

J .Salary

C.WorkYear

J .Salary <= 95C.WorkYear >=5

5

95

Page 47: Answering Approximate Queries Efficiently

47

Other work 2: Fuzzy predicates on attributes of mixed types

SELECT *

FROM Movies

WHERE star SIMILARTO ’Schwarrzenger’

AND |year – 1977| <= 3;

Star Title Year GenreKeanu Reeves The Matrix 1999 Sci-Fi

Samuel Jackson Star Wars: Episode III - Revenge of the Sith

2005 Sci-Fi

Schwarzenegger The Terminator 1984 Sci-Fi

Samuel Jackson Goodfellas 1990 Drama

… … … …

Movies

Page 48: Answering Approximate Queries Efficiently

48

Mixed-Typed Predicates

• String attributes: edit distance• Numeric attributes: absolute numeric

difference

SELECT *

FROM Movies

WHERE star SIMILARTO ’Schwarrzenger’

AND |year – 1977| <= 3;

Page 49: Answering Approximate Queries Efficiently

49

MAT-tree: Intuition

• Indexing on two attributes is more effective than two separate indexing structures

• Numeric attribute: B-tree• String attribute: tree-based index structure?

Page 50: Answering Approximate Queries Efficiently

50

MAT-tree: Overview

• Tree-based indexing structure:– Each node has MBR for both numeric attribute and string attribute

• Compressing strings as a “compressed trie” that fits into a limited space• An edit distance between a string and compressed trie can be computed• Experiments show that MAT-tree is very efficient

Spielberg1946

Hanks1956

Gibson1956

Hanks1957

Crowe1964

Robert1968

DiCaprio1974

Roberrts1977

<1946,1956> <1956,1957>

<1946,1957> <1964,1977>

MBR

Root

Leaf nodes

*

<1964,1968>

*

<1974,1977>

*

* *

......

...

......

*

...

Page 51: Answering Approximate Queries Efficiently

51

Conclusion

• It’s important to support answering approximate queries efficiently

• Our results so far:– SEPIA: provides accurate selectivity

estimation for fuzzy string predicates– Relaxing SQL queries with selections and

joins– MAT-tree: indexing structure supporting fuzzy

queries with mixed-types predicates