sherlock - DA QCRIda.qcri.org/ntang/pubs/sherlock.slides.pdf · Sherlock Rules 7 name dep t nation...

42
Proof Positive and Negative in Data Cleaning Matteo Interlandi Nan Tang Sherlock Rules

Transcript of sherlock - DA QCRIda.qcri.org/ntang/pubs/sherlock.slides.pdf · Sherlock Rules 7 name dep t nation...

Page 1: sherlock - DA QCRIda.qcri.org/ntang/pubs/sherlock.slides.pdf · Sherlock Rules 7 name dep t nation capital bornat officePhn Si DA China Beijing ChenYang 28098001 Yan DA China Shanghai

Proof Positive and Negative in Data Cleaning

Matteo InterlandiNan Tang

Sherlock Rules

Page 2: sherlock - DA QCRIda.qcri.org/ntang/pubs/sherlock.slides.pdf · Sherlock Rules 7 name dep t nation capital bornat officePhn Si DA China Beijing ChenYang 28098001 Yan DA China Shanghai

•Motivation

•Sherlock Rules

•Fundamental problems

•Algorithms

Outline

2

Page 3: sherlock - DA QCRIda.qcri.org/ntang/pubs/sherlock.slides.pdf · Sherlock Rules 7 name dep t nation capital bornat officePhn Si DA China Beijing ChenYang 28098001 Yan DA China Shanghai

Roadblocks to Get Value from Data?

3

Data Mining

Machine Learning

Rule Discovery

Page 4: sherlock - DA QCRIda.qcri.org/ntang/pubs/sherlock.slides.pdf · Sherlock Rules 7 name dep t nation capital bornat officePhn Si DA China Beijing ChenYang 28098001 Yan DA China Shanghai

Roadblocks to Get Value from Data?

3

Data Mining

Machine Learning

Rule Discovery

Page 5: sherlock - DA QCRIda.qcri.org/ntang/pubs/sherlock.slides.pdf · Sherlock Rules 7 name dep t nation capital bornat officePhn Si DA China Beijing ChenYang 28098001 Yan DA China Shanghai

Roadblocks to Get Value from Data?

3

High Quality Data

Data Mining

Machine Learning

Rule Discovery

Page 6: sherlock - DA QCRIda.qcri.org/ntang/pubs/sherlock.slides.pdf · Sherlock Rules 7 name dep t nation capital bornat officePhn Si DA China Beijing ChenYang 28098001 Yan DA China Shanghai

name nation capitalSi China Beijing

Yan China ShanghaiIan China Tokyo

D

Page 7: sherlock - DA QCRIda.qcri.org/ntang/pubs/sherlock.slides.pdf · Sherlock Rules 7 name dep t nation capital bornat officePhn Si DA China Beijing ChenYang 28098001 Yan DA China Shanghai

name nation capitalSi China Beijing

Yan China ShanghaiIan China Tokyo

D

consistent D’nation -> capital

name nation capitalSi China Beijing

Yan China BeijingIan China Beijing

data repairing

Page 8: sherlock - DA QCRIda.qcri.org/ntang/pubs/sherlock.slides.pdf · Sherlock Rules 7 name dep t nation capital bornat officePhn Si DA China Beijing ChenYang 28098001 Yan DA China Shanghai

name nation capitalSi China Beijing

Yan China ShanghaiIan China Tokyo

D

consistent D’nation -> capital

name nation capitalSi China Beijing

Yan China BeijingIan China Beijing

data repairing

Page 9: sherlock - DA QCRIda.qcri.org/ntang/pubs/sherlock.slides.pdf · Sherlock Rules 7 name dep t nation capital bornat officePhn Si DA China Beijing ChenYang 28098001 Yan DA China Shanghai

name nation capitalSi China Beijing

Yan China ShanghaiIan China Tokyo

D

consistent D’nation -> capital

name nation capitalSi China Beijing

Yan China BeijingIan China Beijing

name nation capitalSi China Beijing

Yan China ShanghaiIan China Tokyo

annotated D”

data repairingproof positive and negative

Page 10: sherlock - DA QCRIda.qcri.org/ntang/pubs/sherlock.slides.pdf · Sherlock Rules 7 name dep t nation capital bornat officePhn Si DA China Beijing ChenYang 28098001 Yan DA China Shanghai

name nation capitalSi China Beijing

Yan China ShanghaiIan China Tokyo

D

consistent D’nation -> capital

name nation capitalSi China Beijing

Yan China BeijingIan China Beijing

name nation capitalSi China Beijing

Yan China ShanghaiIan China Tokyo

annotated D”

data repairingproof positive and negative

help

Page 11: sherlock - DA QCRIda.qcri.org/ntang/pubs/sherlock.slides.pdf · Sherlock Rules 7 name dep t nation capital bornat officePhn Si DA China Beijing ChenYang 28098001 Yan DA China Shanghai

name nation capitalSi China Beijing

Yan China ShanghaiIan China Tokyo

D

consistent D’nation -> capital

name nation capitalSi China Beijing

Yan China BeijingIan China Beijing

name nation capitalSi China Beijing

Yan China ShanghaiIan China Tokyo

annotated D”

data repairingproof positive and negative

helpSherlock Rules

Page 12: sherlock - DA QCRIda.qcri.org/ntang/pubs/sherlock.slides.pdf · Sherlock Rules 7 name dep t nation capital bornat officePhn Si DA China Beijing ChenYang 28098001 Yan DA China Shanghai

•Motivation

•Sherlock Rules

•Fundamental problems

•Algorithms

Outline

5

Page 13: sherlock - DA QCRIda.qcri.org/ntang/pubs/sherlock.slides.pdf · Sherlock Rules 7 name dep t nation capital bornat officePhn Si DA China Beijing ChenYang 28098001 Yan DA China Shanghai

Proof Positive and Negative

6

name dept

nation capital bornat officePhnSi DA China Beijing ChenYang 28098001

Yan DA China Shanghai Chengdu 24038698Ian ALT China Beijing Hangzhou 33668323

t1t2t3

name officePhn mobileSi 28098001 66700541

Yan 24038698 66706563Ian 27364928 33668323

r1r2r3

Page 14: sherlock - DA QCRIda.qcri.org/ntang/pubs/sherlock.slides.pdf · Sherlock Rules 7 name dep t nation capital bornat officePhn Si DA China Beijing ChenYang 28098001 Yan DA China Shanghai

Proof Positive and Negative

6

name dept

nation capital bornat officePhnSi DA China Beijing ChenYang 28098001

Yan DA China Shanghai Chengdu 24038698Ian ALT China Beijing Hangzhou 33668323

t1t2t3

name officePhn mobileSi 28098001 66700541

Yan 24038698 66706563Ian 27364928 33668323

r1r2r3

Page 15: sherlock - DA QCRIda.qcri.org/ntang/pubs/sherlock.slides.pdf · Sherlock Rules 7 name dep t nation capital bornat officePhn Si DA China Beijing ChenYang 28098001 Yan DA China Shanghai

Proof Positive and Negative

6

name dept

nation capital bornat officePhnSi DA China Beijing ChenYang 28098001

Yan DA China Shanghai Chengdu 24038698Ian ALT China Beijing Hangzhou 33668323

t1t2t3

name officePhn mobileSi 28098001 66700541

Yan 24038698 66706563Ian 27364928 33668323

r1r2r3

Proof Positive/Negative, Correction

t3[Ian] is correct, t3[officePhn] = 27364928

Page 16: sherlock - DA QCRIda.qcri.org/ntang/pubs/sherlock.slides.pdf · Sherlock Rules 7 name dep t nation capital bornat officePhn Si DA China Beijing ChenYang 28098001 Yan DA China Shanghai

Proof Positive and Negative

6

name dept

nation capital bornat officePhnSi DA China Beijing ChenYang 28098001

Yan DA China Shanghai Chengdu 24038698Ian ALT China Beijing Hangzhou 33668323

t1t2t3

name officePhn mobileSi 28098001 66700541

Yan 24038698 66706563Ian 27364928 33668323

r1r2r3

Proof Positive/Negative, Correction

t3[Ian] is correct, t3[officePhn] = 27364928

Page 17: sherlock - DA QCRIda.qcri.org/ntang/pubs/sherlock.slides.pdf · Sherlock Rules 7 name dep t nation capital bornat officePhn Si DA China Beijing ChenYang 28098001 Yan DA China Shanghai

Proof Positive and Negative

6

name dept

nation capital bornat officePhnSi DA China Beijing ChenYang 28098001

Yan DA China Shanghai Chengdu 24038698Ian ALT China Beijing Hangzhou 33668323

t1t2t3

name officePhn mobileSi 28098001 66700541

Yan 24038698 66706563Ian 27364928 33668323

r1r2r3

Proof Positive/Negative, Correction

t3[Ian] is correct, t3[officePhn] = 27364928

Proof Positive/Negative

t3[Ian] is correct, t3[officePhn] is wrong

Page 18: sherlock - DA QCRIda.qcri.org/ntang/pubs/sherlock.slides.pdf · Sherlock Rules 7 name dep t nation capital bornat officePhn Si DA China Beijing ChenYang 28098001 Yan DA China Shanghai

Proof Positive and Negative

6

name dept

nation capital bornat officePhnSi DA China Beijing ChenYang 28098001

Yan DA China Shanghai Chengdu 24038698Ian ALT China Beijing Hangzhou 33668323

t1t2t3

country capitalChina BeijingJapan TokyoChile Santiago

s1s2s3

Proof Positive/Negative, Correction

t3[Ian] is correct, t3[officePhn] = 27364928

Proof Positive/Negative

t3[Ian] is correct, t3[officePhn] is wrong

Page 19: sherlock - DA QCRIda.qcri.org/ntang/pubs/sherlock.slides.pdf · Sherlock Rules 7 name dep t nation capital bornat officePhn Si DA China Beijing ChenYang 28098001 Yan DA China Shanghai

Proof Positive and Negative

6

name dept

nation capital bornat officePhnSi DA China Beijing ChenYang 28098001

Yan DA China Shanghai Chengdu 24038698Ian ALT China Beijing Hangzhou 33668323

t1t2t3

country capitalChina BeijingJapan TokyoChile Santiago

s1s2s3

Proof Positive/Negative, Correction

t3[Ian] is correct, t3[officePhn] = 27364928

Proof Positive

t1[nation, capital] is correct t3[nation, capital] is correct

Proof Positive/Negative

t3[Ian] is correct, t3[officePhn] is wrong

Page 20: sherlock - DA QCRIda.qcri.org/ntang/pubs/sherlock.slides.pdf · Sherlock Rules 7 name dep t nation capital bornat officePhn Si DA China Beijing ChenYang 28098001 Yan DA China Shanghai

Sherlock Rules

7

name dept

nation capital bornat officePhnSi DA China Beijing ChenYang 28098001

Yan DA China Shanghai Chengdu 24038698Ian ALT China Beijing Hangzhou 33668323

t1t2t3

name officePhn mobileSi 28098001 66700541

Yan 24038698 66706563Ian 27364928 33668323

r1r2r3

country capitalChina BeijingJapan TokyoChile Santiago

s1s2s3

D

Dm

evidence positive

negative

Page 21: sherlock - DA QCRIda.qcri.org/ntang/pubs/sherlock.slides.pdf · Sherlock Rules 7 name dep t nation capital bornat officePhn Si DA China Beijing ChenYang 28098001 Yan DA China Shanghai

Sherlock Rules

7

name dept

nation capital bornat officePhnSi DA China Beijing ChenYang 28098001

Yan DA China Shanghai Chengdu 24038698Ian ALT China Beijing Hangzhou 33668323

t1t2t3

name officePhn mobileSi 28098001 66700541

Yan 24038698 66706563Ian 27364928 33668323

r1r2r3

country capitalChina BeijingJapan TokyoChile Santiago

s1s2s3

D

Dm

evidence positive

negative

Page 22: sherlock - DA QCRIda.qcri.org/ntang/pubs/sherlock.slides.pdf · Sherlock Rules 7 name dep t nation capital bornat officePhn Si DA China Beijing ChenYang 28098001 Yan DA China Shanghai

Point of Innovation

8

Integrity Constraints

There does not exist t1[X1] = t2[X2] but

t1[B1] = t2[B2]

(China, Shanghai)

(China, Beijing)

=

<>

Page 23: sherlock - DA QCRIda.qcri.org/ntang/pubs/sherlock.slides.pdf · Sherlock Rules 7 name dep t nation capital bornat officePhn Si DA China Beijing ChenYang 28098001 Yan DA China Shanghai

Point of Innovation

8

Integrity Constraints

There does not exist t1[X1] = t2[X2] but

t1[B1] = t2[B2]

(China, Shanghai)

(China, Beijing)

=

<>

Page 24: sherlock - DA QCRIda.qcri.org/ntang/pubs/sherlock.slides.pdf · Sherlock Rules 7 name dep t nation capital bornat officePhn Si DA China Beijing ChenYang 28098001 Yan DA China Shanghai

Point of Innovation

8

Integrity Constraints

There does not exist t1[X1] = t2[X2] but

t1[B1] = t2[B2]

(China, Shanghai)

(China, Beijing)

Sherlock Rules

t1[X1] = t2[X2] and t1[B] = t2[B-], then

t1[B] := t2[B+]

(China, Shanghai)

(China, Beijing, Shanghai)

=

<>

Page 25: sherlock - DA QCRIda.qcri.org/ntang/pubs/sherlock.slides.pdf · Sherlock Rules 7 name dep t nation capital bornat officePhn Si DA China Beijing ChenYang 28098001 Yan DA China Shanghai

Point of Innovation

8

Integrity Constraints

There does not exist t1[X1] = t2[X2] but

t1[B1] = t2[B2]

(China, Shanghai)

(China, Beijing)

Sherlock Rules

t1[X1] = t2[X2] and t1[B] = t2[B-], then

t1[B] := t2[B+]

(China, Shanghai)

(China, Beijing, Shanghai)

=

<>

Page 26: sherlock - DA QCRIda.qcri.org/ntang/pubs/sherlock.slides.pdf · Sherlock Rules 7 name dep t nation capital bornat officePhn Si DA China Beijing ChenYang 28098001 Yan DA China Shanghai

Point of Innovation

8

Integrity Constraints

There does not exist t1[X1] = t2[X2] but

t1[B1] = t2[B2]

(China, Shanghai)

(China, Beijing)

Sherlock Rules

t1[X1] = t2[X2] and t1[B] = t2[B-], then

t1[B] := t2[B+]

(China, Shanghai)

(China, Beijing, Shanghai)

=

<>

Page 27: sherlock - DA QCRIda.qcri.org/ntang/pubs/sherlock.slides.pdf · Sherlock Rules 7 name dep t nation capital bornat officePhn Si DA China Beijing ChenYang 28098001 Yan DA China Shanghai

Applying Multiple Rules

9

+

Pos(t)

Neg(t)

Free(t)-

+

Page 28: sherlock - DA QCRIda.qcri.org/ntang/pubs/sherlock.slides.pdf · Sherlock Rules 7 name dep t nation capital bornat officePhn Si DA China Beijing ChenYang 28098001 Yan DA China Shanghai

Sherlock Rules in Action

10

t1 (Si, DA, China, Beijing, ChenYang, 28098001)

t1 (Si+, DA, China, Beijing, ChenYang-, 28098001+)

t1 (Si+, DA, China, Beijing, ShenYang+, 28098001+)

Page 29: sherlock - DA QCRIda.qcri.org/ntang/pubs/sherlock.slides.pdf · Sherlock Rules 7 name dep t nation capital bornat officePhn Si DA China Beijing ChenYang 28098001 Yan DA China Shanghai

Sherlock Rules in Action

10

t1 (Si, DA, China, Beijing, ChenYang, 28098001)

t1 (Si+, DA, China, Beijing, ChenYang-, 28098001+)

t1 (Si+, DA, China, Beijing, ShenYang+, 28098001+)

Pos(t1)

Page 30: sherlock - DA QCRIda.qcri.org/ntang/pubs/sherlock.slides.pdf · Sherlock Rules 7 name dep t nation capital bornat officePhn Si DA China Beijing ChenYang 28098001 Yan DA China Shanghai

Transformation Rules

11

Page 31: sherlock - DA QCRIda.qcri.org/ntang/pubs/sherlock.slides.pdf · Sherlock Rules 7 name dep t nation capital bornat officePhn Si DA China Beijing ChenYang 28098001 Yan DA China Shanghai

•Motivation

•Sherlock Rules

•Fundamental problems

•Algorithms

Outline

12

Page 32: sherlock - DA QCRIda.qcri.org/ntang/pubs/sherlock.slides.pdf · Sherlock Rules 7 name dep t nation capital bornat officePhn Si DA China Beijing ChenYang 28098001 Yan DA China Shanghai

Fundamental Problems

13

Termination

Determinism

Consistency

Implication

(coNP-complete)

(coNP-complete)

Page 33: sherlock - DA QCRIda.qcri.org/ntang/pubs/sherlock.slides.pdf · Sherlock Rules 7 name dep t nation capital bornat officePhn Si DA China Beijing ChenYang 28098001 Yan DA China Shanghai

•Motivation

•Sherlock Rules

•Fundamental problems

•Algorithms

Algorithms

14

Page 34: sherlock - DA QCRIda.qcri.org/ntang/pubs/sherlock.slides.pdf · Sherlock Rules 7 name dep t nation capital bornat officePhn Si DA China Beijing ChenYang 28098001 Yan DA China Shanghai

Algorithms

15

Naive Repairing

chase-based

O(|R|x|Sigma|x|M|)

Page 35: sherlock - DA QCRIda.qcri.org/ntang/pubs/sherlock.slides.pdf · Sherlock Rules 7 name dep t nation capital bornat officePhn Si DA China Beijing ChenYang 28098001 Yan DA China Shanghai

Algorithms

15

Naive Repairing

chase-based

O(|R|x|Sigma|x|M|)

Fast Repairing

Similarity indicesto reduce |M|

(BK-tree, FastSS, n-gram)

Inverted indexto reduce |Sigma|

(hash map)

O(|R|x|Sigma| x com(S))

Page 36: sherlock - DA QCRIda.qcri.org/ntang/pubs/sherlock.slides.pdf · Sherlock Rules 7 name dep t nation capital bornat officePhn Si DA China Beijing ChenYang 28098001 Yan DA China Shanghai

Algorithms

15

Naive Repairing

chase-based

O(|R|x|Sigma|x|M|)

Fast Repairing

Similarity indicesto reduce |M|

(BK-tree, FastSS, n-gram)

Inverted indexto reduce |Sigma|

(hash map)

O(|R|x|Sigma| x com(S)) Caching similarity index accessesRule pruning based on dependency

Page 37: sherlock - DA QCRIda.qcri.org/ntang/pubs/sherlock.slides.pdf · Sherlock Rules 7 name dep t nation capital bornat officePhn Si DA China Beijing ChenYang 28098001 Yan DA China Shanghai

Rule Pruning Example

16

R1 R2

R3

R1:R2:R3:t3(Ian, ALT, Chine, Beijing, Hangzhou, 33668323)

Page 38: sherlock - DA QCRIda.qcri.org/ntang/pubs/sherlock.slides.pdf · Sherlock Rules 7 name dep t nation capital bornat officePhn Si DA China Beijing ChenYang 28098001 Yan DA China Shanghai

Rule Pruning Example

16

R1 R2

R3

R1:R2:R3:t3(Ian, ALT, Chine, Beijing, Hangzhou, 33668323)

iteration 1: {(R1, Yes), (R2, Yes), (R3, No)}

Page 39: sherlock - DA QCRIda.qcri.org/ntang/pubs/sherlock.slides.pdf · Sherlock Rules 7 name dep t nation capital bornat officePhn Si DA China Beijing ChenYang 28098001 Yan DA China Shanghai

Rule Pruning Example

16

R1 R2

R3

R1:R2:R3:t3(Ian, ALT, Chine, Beijing, Hangzhou, 33668323)

iteration 2: {(R1, Yes), (R2, No), (R3, No)}

iteration 1: {(R1, Yes), (R2, Yes), (R3, No)}

Page 40: sherlock - DA QCRIda.qcri.org/ntang/pubs/sherlock.slides.pdf · Sherlock Rules 7 name dep t nation capital bornat officePhn Si DA China Beijing ChenYang 28098001 Yan DA China Shanghai

Rule Pruning Example

16

R1 R2

R3

R1:R2:R3:t3(Ian, ALT, Chine, Beijing, Hangzhou, 33668323)

iteration 2: {(R1, Yes), (R2, No), (R3, No)}

iteration 3: {(R1, Yes), (R2, No), (R3, No)}

iteration 1: {(R1, Yes), (R2, Yes), (R3, No)}

Page 41: sherlock - DA QCRIda.qcri.org/ntang/pubs/sherlock.slides.pdf · Sherlock Rules 7 name dep t nation capital bornat officePhn Si DA China Beijing ChenYang 28098001 Yan DA China Shanghai

17

Conclusion

• Sherlock rules for accurately annotating and repairing data

• Fundamental problems

• Efficient algorithms

Page 42: sherlock - DA QCRIda.qcri.org/ntang/pubs/sherlock.slides.pdf · Sherlock Rules 7 name dep t nation capital bornat officePhn Si DA China Beijing ChenYang 28098001 Yan DA China Shanghai

17

Conclusion

• Sherlock rules for accurately annotating and repairing data

• Fundamental problems

• Efficient algorithms

Future Work

• Let SQL drive the Sherlock workhorse

• Extend Sherlock rules to more data such as RDF (knowledge bases)