Solomon: Seeking the Truth Via Copying Detection
description
Transcript of Solomon: Seeking the Truth Via Copying Detection
![Page 1: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/1.jpg)
SOLOMON: SEEKING THE TRUTH VIA COPYING
DETECTION
Xin Luna DongAT&T Labs-Research
9/13 @QDB’2010
![Page 2: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/2.jpg)
We Live in an Information Era
A visualization of the topology of a portion of the Internet. Web 2.0
![Page 3: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/3.jpg)
But the Freely Accessible Information Has Its Downside
![Page 4: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/4.jpg)
Information Propagation Becomes Much Easier with the Web Technologies
![Page 5: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/5.jpg)
False Information Can Be Propagated (I)UA’s bankruptcyChicago Tribune,
2002
Sun-Sentinel.com
Google News
Bloomberg.com
The UAL stock plummeted to $3
from $12.5
![Page 6: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/6.jpg)
False Information Can Be Propagated (II)
Maurice Jarre (1924-2009) French Conductor and Composer
“One could say my life itself has been one long soundtrack. Music was my life, music brought me to life, and music is how I will be remembered long after I leave this life. When I die there will be a final waltz playing in my head and that only I can hear.”
2:29, 30 March 2009
![Page 7: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/7.jpg)
False Information Can Be Propagated (III)
Pasadena Fire Department …received several calls Monday from people saying they heard a quake was imminent
![Page 8: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/8.jpg)
False Information Can Be Propagated (IV)
Posted by Andrew BreitbartIn his blog
…
![Page 9: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/9.jpg)
The Internet needs a way to help people separate rumor from real science.
– Tim Berners-Lee
We now live in this media culture where something goes up on YouTube or a blog and everybody scrambles. - Barack Obama
![Page 10: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/10.jpg)
Copying Can Happen on Structured Data (Copying of Weather Data)
![Page 11: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/11.jpg)
Copying Can Be Large Scaled (Copying of AbeBooks Data)
Data collected from AbeBooks[Yin et al., 2007]
![Page 12: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/12.jpg)
Intuitively Meaningful Clusters According to the Copying Relationships
![Page 13: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/13.jpg)
Intuitively Meaningful Clusters According to the Copying Relationships
![Page 14: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/14.jpg)
Copying Can Be Large Scaled (Copying of AbeBooks Data)
![Page 15: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/15.jpg)
SolomonGoal
Discover copying relationships between structured data sources
Leverage the copying relationships to improve various components of data integration
Other applicationsBusiness purpose: data are valuableIn-depth data analysis: information
dissemination
![Page 16: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/16.jpg)
Solomon
Outline
Copying discovery• Local
detection [VLDB’09a]
• Global detection [VLDB’10a]
• Detection w. dynamic data [VLDB’09b]
Applications in data integration• Truth
discovery [VLDB’09a][VLDB’09b]
• Query answering [Submitted]
• Record linkage [VLDB’10b]
Visualization and decision explanation• Visualization• Decision
explanation[VLDB’10 demo]
![Page 17: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/17.jpg)
Problem Definition—Input
Src
ISBN Name Author
S11 IPV6: Theory, Protocol, and
Practice Loshin, Peter
2 Web Usability: A User-Centered Design Approach
Lazar, Jonathan
S21 IPV4: Theory, Protocol, and
Practice -
2 Web Usability: A User Jonathan Lazar
S31 IPV6: Theory, Protocol, and
Practice Loshin, Peter
2 Web Usability: A User Jonathan Lazar
S4 1 IPV6: Theory, Protocol, and Practice Loshin
2 Web Usability: A User Lazar
Missing values
Different formats
Incorrectvalues
Objects: a real-world entity, described by a set of attributes Each associated w. a true value
Sources: each providing data for a subset of objectsInpu
t
![Page 18: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/18.jpg)
Formatting Patterns for Author List
![Page 19: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/19.jpg)
Problem Definition—OutputFor each S1, S2, decide pr of S1 copying directly from S2
A copier copies all or a subset of data A copier can add values and verify/modify copied values—independent
contribution A copier can re-format copied values—still considered as copied
S1 S2
S3
S4
Src
ISBN Name Author
S11 IPV6: Theory, Protocol, and
Practice Loshin, Peter
2 Web Usability: A User-Centered Design Approach
Lazar, Jonathan
S21 IPV4: Theory, Protocol, and
Practice -
2 Web Usability: A User Jonathan Lazar
S31 IPV6: Theory, Protocol, and
Practice Loshin, Peter
2 Web Usability: A User Jonathan Lazar
S4 1 IPV6: Theory, Protocol, and Practice Loshin
2 Web Usability: A User Lazar
![Page 20: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/20.jpg)
Sharing data may be due to both sources providing accurate dataA copier can copy only a small fraction of dataWith only a snapshot it is hard to decide which source is a copierCopying relationship can be complex: co-copying, transitive copying
S1 S2
S3
S4
Challenges in Copying Detection
Src
ISBN Name Author
S11 IPV6: Theory, Protocol, and
Practice Loshin, Peter
2 Web Usability: A User-Centered Design Approach
Lazar, Jonathan
S21 IPV4: Theory, Protocol, and
Practice -
2 Web Usability: A User Jonathan Lazar
S31 IPV6: Theory, Protocol, and
Practice Loshin, Peter
2 Web Usability: A User Jonathan Lazar
S4 1 IPV6: Theory, Protocol, and Practice Loshin
2 Web Usability: A User Lazar
![Page 21: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/21.jpg)
High-Level Intuitions for Copying Detection
Intuition I: decide dependence (w/o direction)
For shared data, Pr(Ф(S1)|S1S2) is low e.g., incorrect value
Pr(Ф(S1)|S1S2) >> Pr(Ф(S1)|S1S2) S1S2
![Page 22: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/22.jpg)
Copying?Not necessarilyName: Alice Score:
51. A2. C3. D4. C5. B6. D7. B8. A9. B10.C
Name: Bob Score:
51. A2. C3. D4. C5. B6. D7. B8. A9. B10.C
![Page 23: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/23.jpg)
Copying?—Common ErrorsVery likelyName: Mary Score:
11. A2. B3. B4. D5. A6. C7. C8. D9. E10.C
Name: John Score:
11. A2. B3. B4. D5. A6. C7. C8. D9. E10.B
![Page 24: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/24.jpg)
High-Level Intuitions for Copying Detection
Intuition I: decide dependence (w/o direction)
For shared data, Pr(Ф(S1)|S1S2) is low e.g., incorrect data
Intuition II: decide copying directionLet F be a property function of the data
(e.g., accuracy of data)|F(Ф(S1) Ф(S2))-F(Ф(S1)-Ф(S2))|
> |F(Ф(S1) Ф(S2))-F(Ф(S2)-Ф(S1))| .
Pr(Ф(S1)|S1S2) >> Pr(Ф(S1)|S1S2) S1S2
![Page 25: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/25.jpg)
Copying?—Different AccuracyJohn copies from AliceName: Alice Score:
31. B2. B3. D4. D5. B6. D7. D8. A9. B10.C
Name: John
Score:11. B2. B3. D4. D5. B6. C7. C8. D9. E10.B
![Page 26: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/26.jpg)
Copying?—Different AccuracyAlice copies from JohnName: John Score:
11. A2. B3. B4. D5. A6. C7. C8. D9. E10.B
Name: Alice Score:
31. A2. B3. B4. D5. A6. D7. B8. A9. B10.C
![Page 27: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/27.jpg)
Bayesian Analysis – BasicDifferent Values O.Ad
TRUE O.At
S1 S2
FALSE O.Af
Same Values
Observation: ФGoal: Pr(S1S2| Ф), Pr(S1S2| Ф) (sum up to 1)According to the Bayes Rule, we need to know
Pr(Ф|S1S2), Pr(Ф|S1S2)Key: computing Pr(ФO.A|S1S2), Pr(ФO.A|S1S2)
for each O.AS1 S2
![Page 28: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/28.jpg)
Bayesian Analysis – Probability Computation
Pr Independence Copying
O.At
O.Af
O.Ad
nnn
22
21
n
Pd2
211
)1(11 2 cc
)1(2
cn
c
)1( cPd
ε-error rate; n-#wrong-values; c-copy rate
>
Different Values O.Ad
TRUE O.At
S1 S2
FALSE O.Af
Same Values
![Page 29: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/29.jpg)
Considering Source Accuracy
Pr Independence S1 Copies S2 S2 Copies S1
O.At
O.Af
O.Ad
nSSPf 21
ftd PPP 1
)1(1 1 cPcS t
)1(1 cPcS f
)1( cPd
21 11 SSPt )1(1 2 cPcS t
)1(2 cPcS f
)1( cPd
≠≠
Different Values O.Ad
TRUE O.At
S1 S2
FALSE O.Af
Same Values
![Page 30: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/30.jpg)
Src
ISBN Name Author
S11 IPV6: Theory, Protocol, and
Practice Loshin, Peter
2 Web Usability: A User-Centered Design Approach
Lazar, Jonathan
S21 IPV4: Theory, Protocol, and
Practice -
2 Web Usability: A User Jonathan Lazar
S31 IPV6: Theory, Protocol, and
Practice Loshin, Peter
2 Web Usability: A User Jonathan Lazar
S4 1 IPV6: Theory, Protocol, and Practice Loshin
2 Web Usability: A User Lazar
Correctness of Data as Evidence for Copying
S1 S2
S3
S4
![Page 31: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/31.jpg)
Extending the Basic Technique
Consider correctness
of data [VLDB’09a]
Consider additional evidence
[VLDB’10a]
![Page 32: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/32.jpg)
Src
ISBN Name Author
S11 IPV6: Theory, Protocol, and
Practice Loshin, Peter
2 Web Usability: A User-Centered Design Approach
Lazar, Jonathan
S21 IPV4: Theory, Protocol, and
Practice -
2 Web Usability: A User Jonathan Lazar
S31 IPV6: Theory, Protocol, and
Practice Loshin, Peter
2 Web Usability: A User Jonathan Lazar
S4 1 IPV6: Theory, Protocol, and Practice Loshin
2 Web Usability: A User Lazar
Formatting as Evidence for Copying
S1 S2
S3
S4
Different formats
SubValues
![Page 33: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/33.jpg)
Extending the Basic Technique
Consider correctness
of data [VLDB’09a]
Consider additional evidence
[VLDB’10a]
Consider correlated copying
[VLDB’10a]
![Page 34: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/34.jpg)
Correlated CopyingK A1 A2 A3 A4
O1 S S S D DO2 S D S S DO3 S S D S DO4 S S S D SO5 S D S S S
K A1 A2 A3 A4
O1 S S S S SO2 S S S S SO3 S S S S SO4 S D D D DO5 S D D D D
17 same values, and 8 different values17 same values, and 8 different values
CopyingS: Two sources providing the same valueD: Two sources providing different values
![Page 35: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/35.jpg)
Extending the Basic Technique
Local Detection Global Detection [VLDB’10a]
Consider correctness
of data [VLDB’09a]
Consider additional evidence
[VLDB’10a]
Consider correlated copying
[VLDB’10a]
Consider updates [VLDB’09b]
![Page 36: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/36.jpg)
Multi-Source Copying? Co-copying? Transitive Copying?S1{V1-V100}
S2 S3Multi-source copying
Co-copying
{V51-V130}{V1-V50, V101-V130}
S1{V1-V100}
S2 S3{V21-V70}{V1-V50}
Transitive copying
S1{V1-V100}
S2 S3{V21-V50,V81-V100}{V1-V50}
(V81-V100 are popular values)
![Page 37: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/37.jpg)
Multi-Source Copying? Co-copying? Transitive Copying?S1{V1-V100}
S2 S3Multi-source copying
Co-copying
Local copying detection results
{V51-V130}{V1-V50, V101-V130}
S1{V1-V100}
S2 S3{V21-V70}{V1-V50}
Transitive copying
S1{V1-V100}
S2 S3{V21-V50,V81-V100}{V1-V50}
(V81-V100 are popular values)
![Page 38: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/38.jpg)
Multi-Source Copying? Co-copying? Transitive Copying?S1{V1-V100}
S2 S3Multi-source copying
Co-copying - Looking at the copying probabilities?
{V51-V130}{V1-V50, V101-V130}
S1{V1-V100}
S2 S3{V21-V70}{V1-V50}
Transitive copying
S1{V1-V100}
S2 S3{V21-V50,V81-V100}{V1-V50}
(V81-V100 are popular values)
![Page 39: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/39.jpg)
Multi-Source Copying? Co-copying? Transitive Copying?S1{V1-V100}
S2 S3Multi-source copying
Co-copying
1
X Looking at the copying probabilities? - Counting shared values?
{V51-V130}{V1-V50, V101-V130}
S1{V1-V100}
S2 S3{V21-V70}{V1-V50}
Transitive copying
S1{V1-V100}
S2 S3{V21-V50,V81-V100}{V1-V50}
(V81-V100 are popular values)
1
1
1 1
1
1 1
1
![Page 40: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/40.jpg)
Multi-Source Copying? Co-copying? Transitive Copying?S1{V1-V100}
S2 S3Multi-source copying
Co-copying
50
X Looking at the copying probabilities?X Counting shared values? - Comparing the set of shared values?
{V51-V130}{V1-V50, V101-V130}
S1{V1-V100}
S2 S3{V21-V70}{V1-V50}
Transitive copying
S1{V1-V100}
S2 S3{V21-V50,V81-V100}{V1-V50}
(V81-V100 are popular values)
50
30
50 50
30
50 50
30
![Page 41: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/41.jpg)
Multi-Source Copying? Co-copying? Transitive Copying?S1{V1-V100}
S2 S3Multi-source copying
Co-copying
V1-V50
V101-V130
X Looking at the copying probabilities?X Counting shared values? - Comparing the set of shared values?
V51-V100
{V51-V130}{V1-V50, V101-V130}
S1{V1-V100}
S2 S3V1-V50
V21-V50
V21-V70
{V21-V70}{V1-V50}
Transitive copying
S1{V1-V100}
S2 S3V1-V50
V21-V50
V21-V50, V81-V100{V21-V50,V81-V100}{V1-V50}
(V81-V100 are popular values)
![Page 42: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/42.jpg)
Multi-Source Copying? Co-copying? Transitive Copying?S1{V1-V100}
S2 S3Multi-source copying
Co-copying
V1-V50
V101-V130
X Looking at the copying probabilities?X Counting shared values?X Comparing the set of shared values?
V51-V100
{V51-V130}{V1-V50, V101-V130}
S1{V1-V100}
S2 S3V1-V50
V21-V50
V21-V70
{V21-V70}{V1-V50}
Transitive copying
S1{V1-V100}
S2 S3V1-V50
V21-V50
V21-V50, V80-V100{V21-V50,V81-V100}{V1-V50}
(V81-V100 are popular values)
V21-V50 shared by 3 sources
We need to reason for each data item in a principled way!
![Page 43: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/43.jpg)
Global Copying Detection1. Find a set of copyings R that significantly influence
the rest of the copyings Maximize
Finding R is NP-complete We propose a fast greedy algorithm
2. Adjust copying probability for the rest of the copyings: P(S1S2|R)
Replace Pr(ФO.A(S1)|S1S2) everywhere with Pr(ФO.A (S1)|S1S2, R), which considers sources that S1 copies from according to R and provide the same value on O.A as S1
Pr(Ф(S1)|S1S2) >> Pr(Ф(S1)|S1S2) S1S2
![Page 44: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/44.jpg)
Multi-Source Copying? Co-copying? Transitive Copying?S1{V1-V100}
S2 S3Multi-source copying
Co-copying
V1-V50
V101-V130
V51-V100
{V51-V130}{V1-V50, V101-V130}
S1{V1-V100}
S2 S3V1-V50
V21-V50
V21-V70
{V21-V70}{V1-V50}
Transitive copying
S1{V1-V100}
S2 S3V1-V50
V21-V50
V21-V50, V81-V100{V21-V50,V81-V100}{V1-V50}
(V81-V100 are popular values)
R={S3S1}, Pr(Ф(S3))= Pr(Ф(S3)|R) for V101-V130
R={S3S1}, Pr(Ф(S3))<<Pr(Ф(S3)|R) for V21-V50
R={S3S2}, Pr(Ф(S3))<<Pr(Ф(S3)|R) for V21-V50Pr(Ф(S3)) is high for V81-V100
X X
?
? ?
![Page 45: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/45.jpg)
Experiment Setup18 weather websitesfor 30 major USA citiescollected every 45 minutes for a day33 collections, so 990 objects28 distinct attributes in total
![Page 46: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/46.jpg)
18 weather websitesfor 30 major USA citiescollected every 45 minutes for a day33 collections, so 990 objects28 distinct attributes in total
Silver Standard
![Page 47: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/47.jpg)
Experiment ResultsMeasure: Precision, Recall, F-measure
C: real copying; D: detected copying
RPPRF
CDC
RDDC
P
2,,
Methods Precision
Recall
F-measur
eCorr (Only correctness) .5 .43 .46
Enriched (More evidence) 1 .14 .25
Local (correlated copying) .33 .86 .48
Global (global detection) .79 .79 .79
Transitive/co-copying not removed
Ignoring evidence from
correlated copying
Enriched improves over Corr when true/false notion
does apply
![Page 48: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/48.jpg)
What Is Missing? (a.k.a. Future Work)
Local Detection Global Detection
Consider correctness
of data [VLDB’09a]
Consider additional evidence
[VLDB’10a]
Consider correlated copying
[VLDB’10a]
Consider updates [VLDB’09b]
![Page 49: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/49.jpg)
What Is Missing? (a.k.a. Future Work)
Local Detection Global Detection
Loop copying Copying by category Summarizing copying
patterns Exploring evidence from
schemas, tuple ordering, etc.
Scalability Detecting opinion
influence
Hidden Sources Global detection
for dynamic data
![Page 50: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/50.jpg)
Solomon
Outline
Copying discovery• Local
detection [VLDB’09a]
• Global detection [VLDB’10a]
• Detection w. dynamic data [VLDB’09b]
Applications in data integration• Truth
discovery [VLDB’09a][VLDB’09b]
• Query answering [Submitted]
• Record linkage [VLDB’10b]
Visualization and decision explanation• Visualization• Decision
explanation[VLDB’10 demo]
![Page 51: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/51.jpg)
Data Integration Faces 3 Challenges
Data Conflicts
Instance Heterogeneity
Structure Heterogeneity
![Page 52: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/52.jpg)
Data Integration Faces 3 Challenges
Data Conflicts
Instance Heterogeneity
Structure Heterogeneity
![Page 53: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/53.jpg)
Data Integration Faces 3 Challenges
Data Conflicts
Instance Heterogeneity
Structure Heterogeneity
Scissors
Paper Scissors
![Page 54: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/54.jpg)
Data Integration Faces 3 Challenges
Data Conflicts
Instance Heterogeneity
Structure Heterogeneity
Scissors
Glue
![Page 55: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/55.jpg)
Existing Solutions Assume Independence of Data Sources
Data Conflicts
Instance Heterogeneity
Structure Heterogeneity
•Schema matching•Model management•Query answering using views•Information extraction
•String matching (edit distance, token-based, etc.)•Object matching (aka. record linkage, reference reconciliation, …)
•Data fusion•Truth discovery
Assume INDEPENDENCEof data sources
![Page 56: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/56.jpg)
Data Conflicts
Instance Heterogeneity
Structure Heterogeneity
Source Copying Adds A New Dimension to Data Integration
Data Fusion
• Truth discovery [VLDB’09a, VLDB’09b]
• Integrating probabilistic data
Record Linkage
• Improve record linkage
• Distinguish bet wrong values and alter representations [VLDB’10b]
Query Answerin
g
• Query optimization [Submitted]
• Improve schema matching
Source Recom-
mendation
• Recommend trustworthy, up-to-date, and independent sources
![Page 57: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/57.jpg)
S1 S2 S3Stonebrak
erMIT Berkel
eyMIT
Dewitt MSR MSR UWiscBernstein MSR MSR MSR
Carey UCI AT&T BEAHalevy Google Google UW
Application I. Truth Discovery—Naïve Voting
![Page 58: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/58.jpg)
Application I. Truth Discovery—Naïve Voting
S1 S2 S3 S4 S5Stonebrak
erMIT Berkel
eyMIT MIT MS
Dewitt MSR MSR UWisc UWisc UWiscBernstein MSR MSR MSR MSR MSR
Carey UCI AT&T BEA BEA BEAHalevy Google Google UW UW UW
![Page 59: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/59.jpg)
Application I. Truth Discovery—Our Solution
S1 S2 S3 S4 S5Stonebrak
erMIT Berkel
eyMIT MIT MS
Dewitt MSR MSR UWisc UWisc UWiscBernstein MSR MSR MSR MSR MSR
Carey UCI AT&T BEA BEA BEAHalevy Google Google UW UW UW
Copying Relationship
UCI AT&T
BEA
Truth Discovery(1-.99*.8=.2)
(.22)
S1
S2
S4
S3
S5
.87 .2.2
.99
.99.99
S1 S2
S3
S4 S5Round 1
![Page 60: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/60.jpg)
Application I. Truth Discovery—Our Solution
S1 S2 S3 S4 S5Stonebrak
erMIT Berkel
eyMIT MIT MS
Dewitt MSR MSR UWisc UWisc UWiscBernstein MSR MSR MSR MSR MSR
Carey UCI AT&T BEA BEA BEAHalevy Google Google UW UW UW
Copying Relationship
S1
S2
S4
S3
S5
.14
.49.49
.49.08
.49.49.49
AT&T
BEA
Truth Discovery
S2
S3
S4 S5
UCIS1
Round 2
![Page 61: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/61.jpg)
Application I. Truth Discovery—Our Solution
S1 S2 S3 S4 S5Stonebrak
erMIT Berkel
eyMIT MIT MS
Dewitt MSR MSR UWisc UWisc UWiscBernstein MSR MSR MSR MSR MSR
Carey UCI AT&T BEA BEA BEAHalevy Google Google UW UW UW
Copying Relationship
S1
S2
S4
S3
S5
.12
.49.49
.49.06
.49.49.49
AT&T
BEA
Truth Discovery
S2
S3
S4 S5
UCI
S1
Round 3
![Page 62: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/62.jpg)
Application I. Truth Discovery—Our Solution
S1 S2 S3 S4 S5Stonebrak
erMIT Berkel
eyMIT MIT MS
Dewitt MSR MSR UWisc UWisc UWiscBernstein MSR MSR MSR MSR MSR
Carey UCI AT&T BEA BEA BEAHalevy Google Google UW UW UW
Copying Relationship
S1
S2
S4
S3
S5
.10
.48.49
.50.05
.49.48.50
AT&T
BEA
Truth Discovery
S2
UCI
S1
Round 4
S3
S4 S5
![Page 63: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/63.jpg)
Application I. Truth Discovery—Our Solution
S1 S2 S3 S4 S5Stonebrak
erMIT Berkel
eyMIT MIT MS
Dewitt MSR MSR UWisc UWisc UWiscBernstein MSR MSR MSR MSR MSR
Carey UCI AT&T BEA BEA BEAHalevy Google Google UW UW UW
Copying Relationship
AT&T
BEA
Truth Discovery
S2
UCI
S1
Round 5
S3
S4 S5
S1
S2
S4
S3
S5
.09
.47.49
.51.04
.49.47.51
![Page 64: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/64.jpg)
Application I. Truth Discovery—Our Solution
S1 S2 S3 S4 S5Stonebrak
erMIT Berkel
eyMIT MIT MS
Dewitt MSR MSR UWisc UWisc UWiscBernstein MSR MSR MSR MSR MSR
Carey UCI AT&T BEA BEA BEAHalevy Google Google UW UW UW
Copying Relationship
AT&T
BEA
Truth Discovery
S2
UCI
S1
Round 13
S3
S4 S5
S1
S2
S4
S3
S5
.55.49
.55.49.44.44
![Page 65: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/65.jpg)
Application I. Truth Discovery (Con’t)
Truth Discovery
Source-accuracy
ComputationCopying
DetectionStep 1Step 3
Step 2
Theorem: w/o accuracy, converges Observation: w. accuracy, converges when #objs >> #srcs
![Page 66: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/66.jpg)
Experiment on Static Data [VLDB’09a]Dataset: AbeBooks
877 bookstores1265 CS books24364 listings, w. ISBN, name, author-listAfter pre-cleaning, each book on avg has 19
listings and 4 author lists (ranges from 1-23)Golden standard: 100 random books
Manually check author list from book coverMeasure: Precision=#(Corr author lists)/#(All lists)
![Page 67: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/67.jpg)
Naïve Voting and Types of ErrorsNaïve voting has precision .71
Error type NumMissing authors 23
Additional authors 4Mis-ordering 3Mis-spelling 2
Incomplete names 2
![Page 68: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/68.jpg)
Contributions of Various Components
Methods Prec #Rnds
Time(s)
Naïve .71 1 .2Only value similarity .74 1 .2
Only source accuracy .79 23 1.1
Only source copying .83 3 28.3Copy+accu .87 22 185.8
Copy+accu+sim .89 18 197.5Precision improves by 25.4% over Naïve
Considering copying improves the results most
Reasonably fast
![Page 69: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/69.jpg)
Experiment on Dynamic Data [VLDB’09b]Dataset: Manhattan restaurants
Data crawled from 12 restaurant websites8 versions: weekly from 1/22/2009 to 3/12/20095269 restaurants, 5231 appearing in the first
crawling and 5251 in the last crawling467 restaurants deleted from some websites,
280 closed before 3/15/2009 (Golden standard)Measure: Precision, Recall, F-measure
G: really closed restaurants; D: detected closed restaurants
RPPRF
GDG
RDDG
P
2,,
![Page 70: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/70.jpg)
Between 12 out of 66 pairs copying is likely
Discovered Copying
![Page 71: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/71.jpg)
Contributions of Various Components
Method
Ever-existing Closed #Rn
dsTime(
s)#Rest Prec Rec F-msr
ALL - .60 1.0 .75 - -ALL2 - .94 .34 .50 - -Naïve 1192 .70 .93 .80 1 158
Quality 5068 .83 .88 .85 7 637CopyQu
a 5186 .86 .87 .86 6 1408Google - .84 .19 .30 - -Quality and CopyQua
obtain high precision and recall
Applying rules is inadequate
Naïve missed a lot of restaurants
Google Map listed a lot of out-of-business restaurants
![Page 72: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/72.jpg)
Application II. Query Optimization in DI
S1{V1-V100}
S3 S4
50%
{V251-V300}{V201-V250}
50%
S5
100%
S2{V101-V200}
100%
S6Minimize #sources: {S5, S6}Minimize #tuples: {S3, S4, S5}
100%100%
80%
![Page 73: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/73.jpg)
Key Problems in IDSGoal: return only independently provided dataKey problems
Coverage: fraction of answers returned by a subset of sources
Cost minimization: minimal set of sources to retrieve all answers
Maximum coverage: set of sources to retrieve the maximum set of answers under a resource bound
Source ordering: best ordering of data sources to provide more answers quickly
![Page 74: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/74.jpg)
Complexity of Computing Coverage
Exact Solution (ε, δ)-Approximation
Copy a fraction of data
#P-complete O(LNE)
Copy all data O(N + E) N/A
Copy w. select
predicate
Attr. Dep: O((2bE)k(N + E))
Attr. Indep: O(bkE(N + E))
N/A
N- #sources; E-#copyings; L =k - #attributes w. selection predicatesb - maximum number of constants in predicates for each attribute for each copying
2
1log
![Page 75: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/75.jpg)
Complexity of Source Selection/Ordering Problems
Exact Solution Approximation
Cost Minimization
NP-complete,MaxSNP-hard
log α-approx(w. PTIME
coverage solution)
Maximum Coverage PP-hard
(1 − 1/e )-approx(w. PTIME
coverage solution)
Source Ordering PP-hard
2-approx(w. PTIME
coverage solution)
![Page 76: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/76.jpg)
Data Conflicts
Instance Heterogeneity
Structure Heterogeneity
What is Missing (a.k.a. Future Work)
• Truth discovery [VLDB’09a, VLDB’09b]
• Integrating probabilistic data
Data Fusion
• Improve record linkage• Distinguish bet wrong
values and alter representations [VLDB’10b]
Record Linkage
• Query optimization [Submitted]
• Improve schema matching
Query Answerin
g
• Recommend trustworthy, up-to-date, and independent sources
Source Recom-mendati
on
![Page 77: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/77.jpg)
Solomon
Outline
Copying discovery• Local detection
[VLDB’09a]• Global detection
[VLDB’10a]• Detection w.
dynamic data [VLDB’09b]
Applications in data integration• Truth discovery
[VLDB’09a][VLDB’09b]
• Query answering [Submitted]
• Record linkage [VLDB’10b]
Visualization and decision explanation• Visualization• Decision
explanation[VLDB’10 demo]
![Page 78: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/78.jpg)
Copying of AbeBooks DataAbeBooks data set:
877 bookstores, 1265 CS books, 24364 listings Copying between 465 pairs of sources
![Page 79: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/79.jpg)
A Picture Is Worth a Thousand Words [VLDB’10 Demo]
![Page 80: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/80.jpg)
Demo Here
![Page 81: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/81.jpg)
Future Work: Explaining Copying-Detection DecisionsProvide the simplest, understandable explanation for Bayesian analysis
A copying detection decision is complexWhy copying?Why a particular copying pattern (per-object copying vs. per-attribute
copying)?Why a particular copying direction?Why the local decision is different from the global decision?
Answer “what-if” questions What if the two sources actually use the same format for those
common values? What if there is a hidden source that S1 and S2 both copy
from?Answer “comparison” questions
Why S1 is a copier of S2 but not a copier of S3? Why S1 has copied attributes “title” but not “authors”?
![Page 82: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/82.jpg)
Related WorkCopying detection
Texts/Programs [Schleimer et al., 03][Buneman, 71]
Videos [Law-To et al., 07]Structured sources
[Dong et al., 09a] [Dong et al., 09b]: Local decision[Blanco et al., 10]: Assume a copier must copy all
attribute values of an objectData provenance [Buneman et al., PODS’08]
Focus on effective presentation and retrievalAssume knowledge of provenance/lineage
![Page 83: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/83.jpg)
Take-AwaysCopying is common on the WebDetecting copying for structured data is possible and beneficialNext step: reduce redundancy for quality
How many sources are sufficient?How to help a user effectively explore
the sources?
![Page 84: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/84.jpg)
AcknowledgementsDivesh Srivastava(AT&T Research)
Alon Halevy(Google)
Yifan Hu(AT&T Research)
Laure Berti-Equille(Univ de Rennes 1)
Remi Zajac(AT&T Interactive)
Songtao Guo(AT&T Interactive)
Xuan Liu(Singapore National Univ.)
Pei Li(Univ di Milano-Bicocca)
Amelie Marian(Rutgers Univ.)
Andrea Maurino(Univ di Milano-Bicocca)
Anish Das Sarma(Yahoo!)
Ordered by the amount of time spent at AT&T
![Page 85: Solomon: Seeking the Truth Via Copying Detection](https://reader036.fdocuments.us/reader036/viewer/2022070500/568168af550346895ddf6bd7/html5/thumbnails/85.jpg)
SOLOMON: SEEKING THE TRUTH VIA COPYING
DETECTION
http://www2.research.att.com/~yifanhu/SourceCopying/