Data Integration with Uncertainty
description
Transcript of Data Integration with Uncertainty
![Page 1: Data Integration with Uncertainty](https://reader035.fdocuments.us/reader035/viewer/2022062315/56814cfc550346895dba19bb/html5/thumbnails/1.jpg)
Data Integration with Uncertainty
Xin (Luna) Dong Alon Halevy Cong YuUniv. of Washington Google Inc.Univ. of Michigan
@ VLDB 2007
![Page 2: Data Integration with Uncertainty](https://reader035.fdocuments.us/reader035/viewer/2022062315/56814cfc550346895dba19bb/html5/thumbnails/2.jpg)
Many Applications Need to Manage Heterogeneous Data Sources
D1
D2
D3
D4
D5
![Page 3: Data Integration with Uncertainty](https://reader035.fdocuments.us/reader035/viewer/2022062315/56814cfc550346895dba19bb/html5/thumbnails/3.jpg)
Traditional Data Integration Systems
D1
D2
D3
D4
D5
Mediated Schema
SELECT P.title AS title, P.year AS year, A.name AS authorFROM Author AS A, Paper AS P, AuthoredBy AS BWHERE A.aid=B.aid AND P.pid=B.pid
Author(aid, name)Paper(pid, title, year)AuthoredBy(aid,pid)
Publication(title, year, author)
![Page 4: Data Integration with Uncertainty](https://reader035.fdocuments.us/reader035/viewer/2022062315/56814cfc550346895dba19bb/html5/thumbnails/4.jpg)
Querying on Traditional Data Integration Systems
D1
D2
D3
D4
D5
Mediated Schema
QQQ
Q1
Q2Q4
Q
Q5
Q3
![Page 5: Data Integration with Uncertainty](https://reader035.fdocuments.us/reader035/viewer/2022062315/56814cfc550346895dba19bb/html5/thumbnails/5.jpg)
Uncertainty Can Occur at Three Levels in Data Integration Applications
D1
D2
D3
D4
D5
Mediated Schema
I. Data Level
II. Mapping Level
III. Query Level
![Page 6: Data Integration with Uncertainty](https://reader035.fdocuments.us/reader035/viewer/2022062315/56814cfc550346895dba19bb/html5/thumbnails/6.jpg)
Mediated Schema
Q
Data Integration with Uncertainty
D1
D2
D3
D4
D5
I. Probabilistic Model
PD1
PD2
PD3
PD4
PD5
PM1
PM2
PM3
PM4
PM5
Q1..Qm
II. Keyword QueryReformulationIII. Top-K Query
Answering
![Page 7: Data Integration with Uncertainty](https://reader035.fdocuments.us/reader035/viewer/2022062315/56814cfc550346895dba19bb/html5/thumbnails/7.jpg)
Mediated Schema
Q
Data Integration with Uncertainty
D1
D2
D3
D4
D5PD1
PD2
PD3
PD4
PD5
PM1
PM2
PM3
PM4
PM5
Q1..QmFocus of the paper: Probabilistic schema mappings
![Page 8: Data Integration with Uncertainty](https://reader035.fdocuments.us/reader035/viewer/2022062315/56814cfc550346895dba19bb/html5/thumbnails/8.jpg)
Example Probabilistic Mappings
Mediated Schema
T(name, email, mailing-addr, home-addr, office-addr)
Q: SELECT mailing-addr FROM T
Q1: SELECT current-addr FROM S
Q2: SELECT permanent-addr FROM S
Q3: SELECT email-addr FROM S
PName Email-addr Current-addr Permanent-addr
Alice alice@ Mountain View Sunnyvale
Bob bob@ Sunnyvale Sunnyvale
S
T(name, email, mailing-addr, home-addr, office-addr) 0.5
S(pname, email-addr, current-addr, permanent-addr)
m1
T(name, email, mailing-addr, home-addr, office-addr) 0.4
S(pname, email-addr, current-addr, permanent-addr)
m2
T(name, email, mailing-addr, home-addr, office-addr) 0.1
S(pname, email-addr, current-addr, permanent-addr)
m3
Tuple Prob
(‘Sunnyvale’)
(‘Mountain View’)
(‘alice@’)
(‘bob@’)
0.9
0.5
0.1
0.1
![Page 9: Data Integration with Uncertainty](https://reader035.fdocuments.us/reader035/viewer/2022062315/56814cfc550346895dba19bb/html5/thumbnails/9.jpg)
Q1: SELECT current-addr FROM S
Q2: SELECT permanent-addr FROM S
Mediated Schema
T(name, email, mailing-addr, home-addr, office-addr)
Q: SELECT mailing-addr FROM T
Q1: SELECT current-addr FROM S
Q2: SELECT permanent-addr FROM S
Q3: SELECT email-addr FROM S
PName Email-addr Current-addr Permanent-addr
Alice alice@ Mountain View Sunnyvale
Bob bob@ Sunnyvale Sunnyvale
S
T(name, email, mailing-addr, home-addr, office-addr) 0.5
S(pname, email-addr, current-addr, permanent-addr)
m1
T(name, email, mailing-addr, home-addr, office-addr) 0.4
S(pname, email-addr, current-addr, permanent-addr)
m2
T(name, email, mailing-addr, home-addr, office-addr) 0.1
S(pname, email-addr, current-addr, permanent-addr)
m3
Tuple Prob
(‘Sunnyvale’)
(‘Mountain View’)
(‘alice@’)
(‘bob@’)
0.9
0.5
0.1
0.1
Top-k Query Answering w.r.t. Probabilistic Mappings
Tuple Prob
(‘Sunnyvale’)(‘Mountain View’)
(‘alice@’)
(‘bob@’)
0.90.5
0.1
0.1
![Page 10: Data Integration with Uncertainty](https://reader035.fdocuments.us/reader035/viewer/2022062315/56814cfc550346895dba19bb/html5/thumbnails/10.jpg)
Contributions Definition of probabilistic mappings
Semantics: by-table v.s. by-tuple Complexity of query answering with respect to
probabilistic mappings
Concise representations of probabilistic mappings More expressive extensions to probabilistic
mappings
By-table By-tuple
Data Complexity PTIME #P-complete
Mapping Complexity
PTIME PTIME
![Page 11: Data Integration with Uncertainty](https://reader035.fdocuments.us/reader035/viewer/2022062315/56814cfc550346895dba19bb/html5/thumbnails/11.jpg)
Outline Motivation and overview of our results Definition of probabilistic mappings Complexity of query answering
Concise representations of probabilistic mappings
More expressive extensions to probabilistic mappings
By-table By-tuple
Data Complexity PTIME #P-complete
Mapping Complexity
PTIME PTIME
![Page 12: Data Integration with Uncertainty](https://reader035.fdocuments.us/reader035/viewer/2022062315/56814cfc550346895dba19bb/html5/thumbnails/12.jpg)
Schema Mapping
S=(pname, email-addr, home-addr, office-addr)
T=(name, mailing-addr)
Mappings: one-to-one schema matching Queries: select-project-join queries
![Page 13: Data Integration with Uncertainty](https://reader035.fdocuments.us/reader035/viewer/2022062315/56814cfc550346895dba19bb/html5/thumbnails/13.jpg)
Probabilistic Mapping
S=(pname, email-addr, home-addr, office-addr)
T=(name, mailing-addr)
Possible MappingProbabil
ity{(pname,name),(home-addr, mailing-addr)}
0.5
{(pname,name),(office-addr, mailing-addr)}
0.4
{(pname,name),(email-addr, mailing-addr)}
0.1
![Page 14: Data Integration with Uncertainty](https://reader035.fdocuments.us/reader035/viewer/2022062315/56814cfc550346895dba19bb/html5/thumbnails/14.jpg)
By-Table v.s. By-Tuple Semantics
![Page 15: Data Integration with Uncertainty](https://reader035.fdocuments.us/reader035/viewer/2022062315/56814cfc550346895dba19bb/html5/thumbnails/15.jpg)
By-Table v.s. By-Tuple Semantics
pname
email-addr
home-addroffice-addr
Alice alice@Mountain
ViewSunnyvale
Bob bob@ Sunnyvale Sunnyvale
Ds=
name
mailing-addr
AliceMountain
View
Bob Sunnyvale
DT=nam
emailing-
addr
Alice Sunnyvale
Bob Sunnyvale
name
mailing-addr
Alice alice@
Bob bob@ Pr(m1)=0.5 Pr(m2)=0.4 Pr(m3)=0.1
![Page 16: Data Integration with Uncertainty](https://reader035.fdocuments.us/reader035/viewer/2022062315/56814cfc550346895dba19bb/html5/thumbnails/16.jpg)
By-Table v.s. By-Tuple Semantics
pname
email-addr
home-addroffice-addr
Alice alice@Mountain
ViewSunnyvale
Bob bob@ Sunnyvale Sunnyvale
Ds=
name
mailing-addr
AliceMountain
View
Bob bob@
DT=nam
emailing-
addr
AliceSunnyval
e
Bob bob@
name
mailing-addr
Alice alice@
Bob bob@ Pr(<m1,m3>)=0.05 Pr(<m2,m3>)=0.04 Pr(<m3,m3>)=0.01
…
![Page 17: Data Integration with Uncertainty](https://reader035.fdocuments.us/reader035/viewer/2022062315/56814cfc550346895dba19bb/html5/thumbnails/17.jpg)
By-Table Query Answering
pname
email-addr
home-addroffice-addr
Alice alice@Mountain
ViewSunnyvale
Bob bob@ Sunnyvale Sunnyvale
Ds=
name
mailing-addr
AliceMountain
View
Bob Sunnyvale
DT=nam
emailing-
addr
AliceSunnyval
e
BobSunnyval
e
name
mailing-addr
Alice alice@
Bob bob@ 0.5 0.4 0.1
SELECT mailing-addr
FROM T
Tuple Probability
(‘Sunnyvale’) 0.9
(‘Mountain View’) 0.5
(‘alice@’) 0.1
(‘bob@’) 0.1
![Page 18: Data Integration with Uncertainty](https://reader035.fdocuments.us/reader035/viewer/2022062315/56814cfc550346895dba19bb/html5/thumbnails/18.jpg)
By-Tuple Query Answering
pname
email-addr
home-addroffice-addr
Alice alice@Mountain
ViewSunnyvale
Bob bob@ Sunnyvale Sunnyvale
Ds=
name
mailing-addr
AliceMountain
View
Bob bob@
DT=nam
emailing-
addr
AliceSunnyval
e
Bob bob@
name
mailing-addr
Alice alice@
Bob bob@ 0.05 0.04 0.01
Tuple Probability
(‘Sunnyvale’) 0.94
(‘Mountain View’) 0.5
(‘alice@’) 0.1
(‘bob@’) 0.1
…
SELECT mailing-addr
FROM T
![Page 19: Data Integration with Uncertainty](https://reader035.fdocuments.us/reader035/viewer/2022062315/56814cfc550346895dba19bb/html5/thumbnails/19.jpg)
Outline Motivation and overview of our results Definition of probabilistic mappings Complexity of query answering
Concise representations of probabilistic mappings
More expressive extensions to probabilistic mappings
By-table By-tuple
Data Complexity PTIME #P-complete
Mapping Complexity
PTIME PTIME
![Page 20: Data Integration with Uncertainty](https://reader035.fdocuments.us/reader035/viewer/2022062315/56814cfc550346895dba19bb/html5/thumbnails/20.jpg)
By-Table Query Answering
SELECT mailing-addr FROM T
SELECT home-addr FROM S (0.5)
SELECT office-addr FROM S (0.4)
SELECT email-addr FROM S (0.1)
Theorem: Query answering in by-table semantics is in PTIME in the size of the data and the size of the mapping
Query Rewriting
![Page 21: Data Integration with Uncertainty](https://reader035.fdocuments.us/reader035/viewer/2022062315/56814cfc550346895dba19bb/html5/thumbnails/21.jpg)
By-Tuple Query Answering
pname email-addr home-addr office-addr
Alice alice@ Mountain View Sunnyvale
Bob bob@ Sunnyvale San Jose
Ds=
DT=
name
mailing-addr
Alice Sunnyvale
Bob Sunnyvale0.2
…
SELECT mailing-addr
FROM T
Theorem: Query answering in by-tuple semantics is #P-complete in the size of the data, and in PTIME in the size of the mapping
Target Enumeration name
mailing-addr
Alice Sunnyvale
Bob @bob0.04
name mailing-addr
Alice alice@
name mailing-addr
AliceMountain
View
Bob @bob0.05
![Page 22: Data Integration with Uncertainty](https://reader035.fdocuments.us/reader035/viewer/2022062315/56814cfc550346895dba19bb/html5/thumbnails/22.jpg)
More on By-Tuple Query Answering The high complexity comes from computing
probabilitiesTheorem: Even computing the probability for
one possible answer is #P-hardTheorem: Computing all possible answers w/o
probabilities is in PTIME In general query answering cannot be done by
query rewriting There are two subsets of queries that can be
answered in PTIME by query rewriting
![Page 23: Data Integration with Uncertainty](https://reader035.fdocuments.us/reader035/viewer/2022062315/56814cfc550346895dba19bb/html5/thumbnails/23.jpg)
PTIME for Queries with a Single P-Mapping Target
Answers
(‘Mountain View’)
(‘Sunnyvale’)
SELECT mailing-addr FROM T
SELECT home-addr FROM S
SELECT office-addr FROM S
SELECT email-addr FROM S
Answers
(‘Sunnyvale’)
(‘Sunnyvale’)
Answers
(‘alice@’)
(‘bob@’)
Pr=0.5 Pr=0.4 Pr=0.1
TupleProbabil
ity
(‘Sunnyvale’) 0.94
(‘Mountain View’)
0.5
(‘alice@’) 0.1
(‘bob@’) 0.1
=1-(1-0.4)(1-0.5-0.4)=1-(1-0.5)(1-0)
![Page 24: Data Integration with Uncertainty](https://reader035.fdocuments.us/reader035/viewer/2022062315/56814cfc550346895dba19bb/html5/thumbnails/24.jpg)
PTIME for Queries that Return Join Attributes
SELECT mailing-addr FROM T,VWHERE T.mailing-addr = V.hightech
SELECT mailing-addr FROM T
SELECT hightechFROM V
TupleProbabil
ity
(‘Sunnyvale’) 0.94
(‘Mountain View’)
0.5
(‘alice@’) 0.1
(‘bob@’) 0.1
=0.94*0.8
TupleProbabil
ity
(‘Sunnyvale’) 0.8
(‘Mountain View’)
0.8
TupleProbabil
ity
(‘Sunnyvale’) 0.752
(‘Mountain View’)
0.4
![Page 25: Data Integration with Uncertainty](https://reader035.fdocuments.us/reader035/viewer/2022062315/56814cfc550346895dba19bb/html5/thumbnails/25.jpg)
Query Rewriting Does Not Apply to Queries that Do NOT Return Join Attributes
SELECT ‘true’ FROM T,VWHERE T.mailing-addr = V.hightech
SELECT mailing-addr FROM T
SELECT hightechFROM V
TupleProbabil
ity
(‘Sunnyvale’) 0.94
(‘Mountain View’)
0.5
(‘alice@’) 0.1
(‘bob@’) 0.1
≠0.94*0.8+0.5*0.8
TupleProbabil
ity
(‘Sunnyvale’) 0.8
(‘Mountain View’)
0.8
TupleProbabilit
y
(‘true’) 0.864
![Page 26: Data Integration with Uncertainty](https://reader035.fdocuments.us/reader035/viewer/2022062315/56814cfc550346895dba19bb/html5/thumbnails/26.jpg)
Outline Motivation and overview of our results Definition of probabilistic mappings Complexity of query answering
Concise representations of probabilistic mappings
More expressive extensions to probabilistic mappings
By-table By-tuple
Data Complexity PTIME #P-complete
Mapping Complexity
PTIME PTIME
![Page 27: Data Integration with Uncertainty](https://reader035.fdocuments.us/reader035/viewer/2022062315/56814cfc550346895dba19bb/html5/thumbnails/27.jpg)
Group Probabilistic Mapping Intuition: Divide a p-mapping into a set of
independent mappings on disjoint attributes
Theorem: Query answering is in PTIME in the size of the group probabilistic mapping
S(a, b, c)
T(a’, b’, c’)p=0.72
m1 S(a, b, c)
T(a’, b’, c’)p=0.18
m2 S(a, b, c)
T(a’, b’, c’)p=0.08
m3 S(a, b, c)
T(a’, b’, c’)p=0.02
m4
S(a, b)
T(a’, b’)p=0. 8
m’1 S(a, b)
T(a’, b’)p=0. 2
m’2 S(c)
T(c’)p=0. 9
m’’1 S(c)
T(c’)p=0. 1
m’’2
![Page 28: Data Integration with Uncertainty](https://reader035.fdocuments.us/reader035/viewer/2022062315/56814cfc550346895dba19bb/html5/thumbnails/28.jpg)
Probabilistic Correspondences
S(a, b, c)
T(a’, b’, c’)p=0.72
m1 S(a, b, c)
T(a’, b’, c’)p=0.18
m2
S(a, b, c)
T(a’, b’, c’)p=0.08
m3 S(a, b, c)
T(a’, b’, c’)p=0.02
m4
Attr Correspondence
Probability
(a, a’) 0.8
(a, b’) 0.2
(b, b’) 0.8
(c, c’) 0.9
Intuition: Compute marginal probabilities for attribute pairs
Theorem: Query answering is in PTIME in the size of the probabilistic correspondence
Many-to-one relationship between probabilistic mappings and probabilistic correspondences
We can use probabilistic correspondences to answer ONLY queries that contain no more than one attribute
![Page 29: Data Integration with Uncertainty](https://reader035.fdocuments.us/reader035/viewer/2022062315/56814cfc550346895dba19bb/html5/thumbnails/29.jpg)
Bayes Nets Representation Theorem: When we encode probabilistic mappings
using a Bayes Net, query answering can be exponential in the size of the representation
Example:S(s1, s2, …, sn, s1’, s2’,…, sn’)
T(t1, t2, …, tn)
--Pr(s1, t1) = Pr(s1’, t1) = .5
--if s1 maps to t1, then for 2<=i<=n, si maps to ti with probability .9 and si’ maps to ti with probability .1; and the other way around
Original representation: O(2n); Bayes representation: O(n)
![Page 30: Data Integration with Uncertainty](https://reader035.fdocuments.us/reader035/viewer/2022062315/56814cfc550346895dba19bb/html5/thumbnails/30.jpg)
Outline Motivation and overview of our approach Definition of probabilistic mappings Complexity of query answering
Concise representations of probabilistic mappings
More expressive extensions to probabilistic mappings
By-table By-tuple
Data Complexity PTIME #P-complete
Mapping Complexity
PTIME PTIME
![Page 31: Data Integration with Uncertainty](https://reader035.fdocuments.us/reader035/viewer/2022062315/56814cfc550346895dba19bb/html5/thumbnails/31.jpg)
Extensions to More Expressive Mappings The complexity results for query answering carry over to
three extensions to more expressive mappings Complex mappings
E.g., address street, city and state GLAV mappings
E.g., Paper Authorship Author Publication Conditional mappings:
E.g., if age>65, Pr(home-addr mailing-addr)=0.8 if age<=65, Pr(home-addr mailing-addr)=0.5
![Page 32: Data Integration with Uncertainty](https://reader035.fdocuments.us/reader035/viewer/2022062315/56814cfc550346895dba19bb/html5/thumbnails/32.jpg)
Related Work Using top-k schema mappings can increase the
recall of query answering w/o sacrificing precision much [Magnani&Montesi, 2007]
Using top-k schema mappings to improve schema mapping [Gal, 2006]
Using probabilities to model overlaps between data sources [Florescu et al, 1997]
Probabilistic Databases (Tutorial [Suciu&Dalvi, Sigmod’05])
![Page 33: Data Integration with Uncertainty](https://reader035.fdocuments.us/reader035/viewer/2022062315/56814cfc550346895dba19bb/html5/thumbnails/33.jpg)
Conclusions Contributions
Data integration with uncertainty Probabilistic mappings and query answering in their
presence
Future work Incorporate uncertainty on keyword reformulation and
dirty data Use the probabilistic model to improve schema
mappings Build a real system
By-table By-tuple
Data Complexity PTIME #P-complete
Mapping Complexity PTIME PTIME
![Page 34: Data Integration with Uncertainty](https://reader035.fdocuments.us/reader035/viewer/2022062315/56814cfc550346895dba19bb/html5/thumbnails/34.jpg)
Data Integration with Uncertainty
Xin (Luna) Dong Alon Halevy Cong YuUniv. of Washington Google Inc.Univ. of Michigan
@ VLDB 2007