CORDS: Automatic Discovery of Correlations and Soft ...nasghar/CORDS_presentation.pdf · Jo Watson...

37
CORDS: Automatic Discovery of Correlations and Soft Functional Dependencies Presenter: Nabiha Asghar . I. F. Ilyas, V. Markl, P. Haas, P. Brown and A. Aboulnaga SIGMOD 2004

Transcript of CORDS: Automatic Discovery of Correlations and Soft ...nasghar/CORDS_presentation.pdf · Jo Watson...

Page 1: CORDS: Automatic Discovery of Correlations and Soft ...nasghar/CORDS_presentation.pdf · Jo Watson 765776 Ottawa 13 . CORDS Main Idea: •Generate candidate column pairs (a 1, a 2)

CORDS: Automatic Discovery of

Correlations and Soft Functional

Dependencies

Presenter: Nabiha Asghar

.

I. F. Ilyas, V. Markl, P. Haas, P. Brown and A. Aboulnaga

SIGMOD 2004

Page 2: CORDS: Automatic Discovery of Correlations and Soft ...nasghar/CORDS_presentation.pdf · Jo Watson 765776 Ottawa 13 . CORDS Main Idea: •Generate candidate column pairs (a 1, a 2)

Outline

• Introduction & Motivation

• Main contributions of the paper

• Description of algorithm and techniques

• Experimental results

2

Page 3: CORDS: Automatic Discovery of Correlations and Soft ...nasghar/CORDS_presentation.pdf · Jo Watson 765776 Ottawa 13 . CORDS Main Idea: •Generate candidate column pairs (a 1, a 2)

Intro & Motivation

• Why is it important to discover correlations

and functional dependencies in the data?

3

Page 4: CORDS: Automatic Discovery of Correlations and Soft ...nasghar/CORDS_presentation.pdf · Jo Watson 765776 Ottawa 13 . CORDS Main Idea: •Generate candidate column pairs (a 1, a 2)

Intro & Motivation

• Why is it important to discover correlations

and functional dependencies in the data?

QUERY OPTIMIZATION

4

Page 5: CORDS: Automatic Discovery of Correlations and Soft ...nasghar/CORDS_presentation.pdf · Jo Watson 765776 Ottawa 13 . CORDS Main Idea: •Generate candidate column pairs (a 1, a 2)

Intro & Motivation

Query optimizers:

- need to estimate the cost of different access

methods (e.g. the optimal join order)

- need to compute ‘selectivity’ of predicates

- ‘selectivity’ of a predicate p = fraction of rows in

a table that are chosen by p

5

Page 6: CORDS: Automatic Discovery of Correlations and Soft ...nasghar/CORDS_presentation.pdf · Jo Watson 765776 Ottawa 13 . CORDS Main Idea: •Generate candidate column pairs (a 1, a 2)

Example: Selectivity of a Predicate

FirstName LastName

Emma Stewart

Chris Martin

Jennifer Jackson

Emma Johnson

Liam Watson

sel(“FirstName = Emma”) = 2/5

sel(“LastName = Watson”) = 1/5

6

Page 7: CORDS: Automatic Discovery of Correlations and Soft ...nasghar/CORDS_presentation.pdf · Jo Watson 765776 Ottawa 13 . CORDS Main Idea: •Generate candidate column pairs (a 1, a 2)

Intro & Motivation

Query optimizers:

- need to estimate the cost of different access

methods (e.g. the optimal join order)

- need to compute ‘selectivity’ of predicates

7

Page 8: CORDS: Automatic Discovery of Correlations and Soft ...nasghar/CORDS_presentation.pdf · Jo Watson 765776 Ottawa 13 . CORDS Main Idea: •Generate candidate column pairs (a 1, a 2)

Intro & Motivation

Query optimizers:

- need to estimate the cost of different access

methods (e.g. the optimal join order)

- need to compute ‘selectivity’ of predicates

- example: joining the columns R.C1 and S.C2 over

the predicate R.C1 = ‘a’ AND S.C2 = ‘b’

- typically estimate this selectivity as

sel(R.C1 = ‘a’ ) x sel(S.C2 = ‘b’)

8

Page 9: CORDS: Automatic Discovery of Correlations and Soft ...nasghar/CORDS_presentation.pdf · Jo Watson 765776 Ottawa 13 . CORDS Main Idea: •Generate candidate column pairs (a 1, a 2)

Intro & Motivation

Query optimizers:

- Assume sel(p1 AND p2 ) = sel(p1) * sel(p2 )

- This assumption does not hold if the two columns

are dependent/correlated

Gist: we need to figure out such dependencies to

enable more efficient query execution plans

9

Page 10: CORDS: Automatic Discovery of Correlations and Soft ...nasghar/CORDS_presentation.pdf · Jo Watson 765776 Ottawa 13 . CORDS Main Idea: •Generate candidate column pairs (a 1, a 2)

Main Contributions

• CORDS Algorithm

detects functional dependencies (FDs) of the form

X → Y

detects soft FDs of the form X Y (the value

of X determines the value of Y with a high

probability)

scalable, efficient

• Experimental Evaluation

10

Page 11: CORDS: Automatic Discovery of Correlations and Soft ...nasghar/CORDS_presentation.pdf · Jo Watson 765776 Ottawa 13 . CORDS Main Idea: •Generate candidate column pairs (a 1, a 2)

Examples of FDs & Soft FDs

Name SIN City

Emma Stewart 123456 Toronto

Jack Hugh 456789 Boston

Jennifer Li 567890 LA

Liam Yang 364566 Miami

Sandra B. 871235 Austin

Megan Ray 639123 Seattle

Jo Watson 789012 NYC

Jo Watson 901234 NYC

Jo Watson 765776 Ottawa

11

Page 12: CORDS: Automatic Discovery of Correlations and Soft ...nasghar/CORDS_presentation.pdf · Jo Watson 765776 Ottawa 13 . CORDS Main Idea: •Generate candidate column pairs (a 1, a 2)

Examples of FDs & Soft FDs

FDs:

SIN → Name

SIN → City

Name SIN City

Emma Stewart 123456 Toronto

Jack Hugh 456789 Boston

Jennifer Li 567890 LA

Liam Yang 364566 Miami

Sandra B. 871235 Austin

Megan Ray 639123 Seattle

Jo Watson 789012 NYC

Jo Watson 901234 NYC

Jo Watson 765776 Ottawa

12

Page 13: CORDS: Automatic Discovery of Correlations and Soft ...nasghar/CORDS_presentation.pdf · Jo Watson 765776 Ottawa 13 . CORDS Main Idea: •Generate candidate column pairs (a 1, a 2)

Examples of FDs & Soft FDs

FDs:

SIN → Name

SIN → City

Soft FDs:

Name City

Name SIN City

Emma Stewart 123456 Toronto

Jack Hugh 456789 Boston

Jennifer Li 567890 LA

Liam Yang 364566 Miami

Sandra B. 871235 Austin

Megan Ray 639123 Seattle

Jo Watson 789012 NYC

Jo Watson 901234 NYC

Jo Watson 765776 Ottawa

13

Page 14: CORDS: Automatic Discovery of Correlations and Soft ...nasghar/CORDS_presentation.pdf · Jo Watson 765776 Ottawa 13 . CORDS Main Idea: •Generate candidate column pairs (a 1, a 2)

CORDS

Main Idea:

• Generate candidate column pairs (a1, a2) that

potentially have interesting/useful

dependencies

• For each candidate, the dependency is detected

by computing some statistics on sampled

values from the columns

14

Page 15: CORDS: Automatic Discovery of Correlations and Soft ...nasghar/CORDS_presentation.pdf · Jo Watson 765776 Ottawa 13 . CORDS Main Idea: •Generate candidate column pairs (a 1, a 2)

CORDS: Generating Candidates

A candidate is a triple (a1, a2, P), where • a1 and a2 are attributes

• P is a pairing rule, which specifies which particular a1

values get paired with which particular a2 values to form

the set of pairs of potentially correlated values.

15

Page 16: CORDS: Automatic Discovery of Correlations and Soft ...nasghar/CORDS_presentation.pdf · Jo Watson 765776 Ottawa 13 . CORDS Main Idea: •Generate candidate column pairs (a 1, a 2)

Example 1: Candidates

16

Page 17: CORDS: Automatic Discovery of Correlations and Soft ...nasghar/CORDS_presentation.pdf · Jo Watson 765776 Ottawa 13 . CORDS Main Idea: •Generate candidate column pairs (a 1, a 2)

Example 1: Candidates

A possible candidate: (orders.shipDate,

deliveries.deliveryDate,

orders.orderID = deliveries.orderID)

a1 a2

P

17

Page 18: CORDS: Automatic Discovery of Correlations and Soft ...nasghar/CORDS_presentation.pdf · Jo Watson 765776 Ottawa 13 . CORDS Main Idea: •Generate candidate column pairs (a 1, a 2)

Example 2: Candidates

18

Page 19: CORDS: Automatic Discovery of Correlations and Soft ...nasghar/CORDS_presentation.pdf · Jo Watson 765776 Ottawa 13 . CORDS Main Idea: •Generate candidate column pairs (a 1, a 2)

Example 2: Candidates

a1 a2

A possible candidate: (orders.shipDate,

orders.deliveryDate,

Øorders)

19

Page 20: CORDS: Automatic Discovery of Correlations and Soft ...nasghar/CORDS_presentation.pdf · Jo Watson 765776 Ottawa 13 . CORDS Main Idea: •Generate candidate column pairs (a 1, a 2)

Example 2: Candidates

a1 a2

A possible candidate: (orders.shipDate,

orders.deliveryDate,

Øorders)

Trivial pairing rule:

when the columns are

in the same table and

each a1 value is paired

with the a2 value in

the same row

20

Page 21: CORDS: Automatic Discovery of Correlations and Soft ...nasghar/CORDS_presentation.pdf · Jo Watson 765776 Ottawa 13 . CORDS Main Idea: •Generate candidate column pairs (a 1, a 2)

CORDS: Generating Candidates

• Generate all possible candidates having a

trivial pairing rule

• Generate all possible candidates with

nontrivial pairing rules which look like key-to-

foreign-key join predicates

i. First find all the declared primary/unique keys,

and all the soft (almost-unique) keys

ii. For each such key, examine every other column

in the schema to find potential matches

21

Page 22: CORDS: Automatic Discovery of Correlations and Soft ...nasghar/CORDS_presentation.pdf · Jo Watson 765776 Ottawa 13 . CORDS Main Idea: •Generate candidate column pairs (a 1, a 2)

CORDS: Generating Candidates (cont’d)

To prune this huge set of candidates, use:

• Type constraints e.g. prune columns who’s

data type is not integer

• Statistical constraints e.g. prune columns with

too few distinct values

• Pairing constraints e.g. only allow key-to-

foreign-key pairing rules

etc.

22

Page 23: CORDS: Automatic Discovery of Correlations and Soft ...nasghar/CORDS_presentation.pdf · Jo Watson 765776 Ottawa 13 . CORDS Main Idea: •Generate candidate column pairs (a 1, a 2)

CORDS

Main Idea:

• Generate candidate column pairs (a1, a2) that

potentially have interesting/useful

dependencies

• For each candidate, the dependency is detected

by computing some statistics on sampled

values from the candidates’ columns

23

Page 24: CORDS: Automatic Discovery of Correlations and Soft ...nasghar/CORDS_presentation.pdf · Jo Watson 765776 Ottawa 13 . CORDS Main Idea: •Generate candidate column pairs (a 1, a 2)

CORDS: Testing for Correlation

For each candidate triple C = (a1, a2, P): • Draw a random sample of n pairs from the columns (a1, a2)

• Estimate where

di =no. of distinct values in column ai

d = min (d1, d2)

= fraction of pairs where a1 = i and a2 = j

• corresponds to an FD

• corresponds to independence for a small

24

Page 25: CORDS: Automatic Discovery of Correlations and Soft ...nasghar/CORDS_presentation.pdf · Jo Watson 765776 Ottawa 13 . CORDS Main Idea: •Generate candidate column pairs (a 1, a 2)

CORDS

Main Idea:

• Generate candidate column pairs (a1, a2) that

potentially have interesting/useful

dependencies

• For each candidate, the dependency is detected

by computing some statistics on sampled

values from the candidates’ columns

25

Page 26: CORDS: Automatic Discovery of Correlations and Soft ...nasghar/CORDS_presentation.pdf · Jo Watson 765776 Ottawa 13 . CORDS Main Idea: •Generate candidate column pairs (a 1, a 2)

26

Page 27: CORDS: Automatic Discovery of Correlations and Soft ...nasghar/CORDS_presentation.pdf · Jo Watson 765776 Ottawa 13 . CORDS Main Idea: •Generate candidate column pairs (a 1, a 2)

27

Page 28: CORDS: Automatic Discovery of Correlations and Soft ...nasghar/CORDS_presentation.pdf · Jo Watson 765776 Ottawa 13 . CORDS Main Idea: •Generate candidate column pairs (a 1, a 2)

28

Page 29: CORDS: Automatic Discovery of Correlations and Soft ...nasghar/CORDS_presentation.pdf · Jo Watson 765776 Ottawa 13 . CORDS Main Idea: •Generate candidate column pairs (a 1, a 2)

29

Page 30: CORDS: Automatic Discovery of Correlations and Soft ...nasghar/CORDS_presentation.pdf · Jo Watson 765776 Ottawa 13 . CORDS Main Idea: •Generate candidate column pairs (a 1, a 2)

Helping the Optimizer

30

Page 31: CORDS: Automatic Discovery of Correlations and Soft ...nasghar/CORDS_presentation.pdf · Jo Watson 765776 Ottawa 13 . CORDS Main Idea: •Generate candidate column pairs (a 1, a 2)

Experimental Evaluation

• Created a schema with 4 relations: CAR, OWNER,

DEMOGRAPHICS and ACCIDENTS (size 1GB)

• Noted FDs, soft FDs, correlations

• CORDS, using a sample size of 4000 rows, detected

all correlations and soft FDs, and did not incorrectly

detect any spurious relationships

31

Page 32: CORDS: Automatic Discovery of Correlations and Soft ...nasghar/CORDS_presentation.pdf · Jo Watson 765776 Ottawa 13 . CORDS Main Idea: •Generate candidate column pairs (a 1, a 2)

Experimental Evaluation

32

Page 33: CORDS: Automatic Discovery of Correlations and Soft ...nasghar/CORDS_presentation.pdf · Jo Watson 765776 Ottawa 13 . CORDS Main Idea: •Generate candidate column pairs (a 1, a 2)

Experimental Evaluation

33

Page 34: CORDS: Automatic Discovery of Correlations and Soft ...nasghar/CORDS_presentation.pdf · Jo Watson 765776 Ottawa 13 . CORDS Main Idea: •Generate candidate column pairs (a 1, a 2)

Experimental Evaluation

34

Page 35: CORDS: Automatic Discovery of Correlations and Soft ...nasghar/CORDS_presentation.pdf · Jo Watson 765776 Ottawa 13 . CORDS Main Idea: •Generate candidate column pairs (a 1, a 2)

Experimental Evaluation

35

Page 36: CORDS: Automatic Discovery of Correlations and Soft ...nasghar/CORDS_presentation.pdf · Jo Watson 765776 Ottawa 13 . CORDS Main Idea: •Generate candidate column pairs (a 1, a 2)

Experimental Evaluation

36

Page 37: CORDS: Automatic Discovery of Correlations and Soft ...nasghar/CORDS_presentation.pdf · Jo Watson 765776 Ottawa 13 . CORDS Main Idea: •Generate candidate column pairs (a 1, a 2)

Summary

• CORDS automatically discovers correlations and soft

FDs in data

• Works on samples instead of the entire data, so very

scalable

• Improves selectivity estimation for query

optimization, therefore better query plans are

generated

• Efficient, because it works on pairs of columns only

37