Data Integration with Uncertainty

34
Data Integration with Uncertainty Xin (Luna) Dong Alon Halevy Cong Yu Univ. of Washington Google Inc. Univ. of Michigan

description

Data Integration with Uncertainty. Xin (Luna) Dong Alon Halevy Cong Yu Univ. of Washington Google Inc.Univ. of Michigan @ VLDB 2007. D5. D1. D2. D4. D3. Many Applications Need to Manage Heterogeneous Data Sources. Mediated Schema. D5. D1. D2. D4. D3. - PowerPoint PPT Presentation

Transcript of Data Integration with Uncertainty

Page 1: Data Integration  with Uncertainty

Data Integration with Uncertainty

Xin (Luna) Dong Alon Halevy Cong YuUniv. of Washington Google Inc.Univ. of Michigan

@ VLDB 2007

Page 2: Data Integration  with Uncertainty

Many Applications Need to Manage Heterogeneous Data Sources

D1

D2

D3

D4

D5

Page 3: Data Integration  with Uncertainty

Traditional Data Integration Systems

D1

D2

D3

D4

D5

Mediated Schema

SELECT P.title AS title, P.year AS year, A.name AS authorFROM Author AS A, Paper AS P, AuthoredBy AS BWHERE A.aid=B.aid AND P.pid=B.pid

Author(aid, name)Paper(pid, title, year)AuthoredBy(aid,pid)

Publication(title, year, author)

Page 4: Data Integration  with Uncertainty

Querying on Traditional Data Integration Systems

D1

D2

D3

D4

D5

Mediated Schema

QQQ

Q1

Q2Q4

Q

QQ

Q5

Q3

Page 5: Data Integration  with Uncertainty

Uncertainty Can Occur at Three Levels in Data Integration Applications

D1

D2

D3

D4

D5

Mediated Schema

QQ

I. Data Level

II. Mapping Level

III. Query Level

Page 6: Data Integration  with Uncertainty

Mediated Schema

Q

Data Integration with Uncertainty

D1

D2

D3

D4

D5

I. Probabilistic Model

PD1

PD2

PD3

PD4

PD5

PM1

PM2

PM3

PM4

PM5

Q1..Qm

II. Keyword QueryReformulationIII. Top-K Query

Answering

Page 7: Data Integration  with Uncertainty

Mediated Schema

Q

Data Integration with Uncertainty

D1

D2

D3

D4

D5PD1

PD2

PD3

PD4

PD5

PM1

PM2

PM3

PM4

PM5

Q1..QmFocus of the paper: Probabilistic schema mappings

Page 8: Data Integration  with Uncertainty

Example Probabilistic Mappings

Mediated Schema

T(name, email, mailing-addr, home-addr, office-addr)

Q: SELECT mailing-addr FROM T

Q1: SELECT current-addr FROM S

Q2: SELECT permanent-addr FROM S

Q3: SELECT email-addr FROM S

PName Email-addr Current-addr Permanent-addr

Alice alice@ Mountain View Sunnyvale

Bob bob@ Sunnyvale Sunnyvale

S

T(name, email, mailing-addr, home-addr, office-addr) 0.5

S(pname, email-addr, current-addr, permanent-addr)

m1

T(name, email, mailing-addr, home-addr, office-addr) 0.4

S(pname, email-addr, current-addr, permanent-addr)

m2

T(name, email, mailing-addr, home-addr, office-addr) 0.1

S(pname, email-addr, current-addr, permanent-addr)

m3

Tuple Prob

(‘Sunnyvale’)

(‘Mountain View’)

(‘alice@’)

(‘bob@’)

0.9

0.5

0.1

0.1

Page 9: Data Integration  with Uncertainty

Q1: SELECT current-addr FROM S

Q2: SELECT permanent-addr FROM S

Mediated Schema

T(name, email, mailing-addr, home-addr, office-addr)

Q: SELECT mailing-addr FROM T

Q1: SELECT current-addr FROM S

Q2: SELECT permanent-addr FROM S

Q3: SELECT email-addr FROM S

PName Email-addr Current-addr Permanent-addr

Alice alice@ Mountain View Sunnyvale

Bob bob@ Sunnyvale Sunnyvale

S

T(name, email, mailing-addr, home-addr, office-addr) 0.5

S(pname, email-addr, current-addr, permanent-addr)

m1

T(name, email, mailing-addr, home-addr, office-addr) 0.4

S(pname, email-addr, current-addr, permanent-addr)

m2

T(name, email, mailing-addr, home-addr, office-addr) 0.1

S(pname, email-addr, current-addr, permanent-addr)

m3

Tuple Prob

(‘Sunnyvale’)

(‘Mountain View’)

(‘alice@’)

(‘bob@’)

0.9

0.5

0.1

0.1

Top-k Query Answering w.r.t. Probabilistic Mappings

Tuple Prob

(‘Sunnyvale’)(‘Mountain View’)

(‘alice@’)

(‘bob@’)

0.90.5

0.1

0.1

Page 10: Data Integration  with Uncertainty

Contributions Definition of probabilistic mappings

Semantics: by-table v.s. by-tuple Complexity of query answering with respect to

probabilistic mappings

Concise representations of probabilistic mappings More expressive extensions to probabilistic

mappings

By-table By-tuple

Data Complexity PTIME #P-complete

Mapping Complexity

PTIME PTIME

Page 11: Data Integration  with Uncertainty

Outline Motivation and overview of our results Definition of probabilistic mappings Complexity of query answering

Concise representations of probabilistic mappings

More expressive extensions to probabilistic mappings

By-table By-tuple

Data Complexity PTIME #P-complete

Mapping Complexity

PTIME PTIME

Page 12: Data Integration  with Uncertainty

Schema Mapping

S=(pname, email-addr, home-addr, office-addr)

T=(name, mailing-addr)

Mappings: one-to-one schema matching Queries: select-project-join queries

Page 13: Data Integration  with Uncertainty

Probabilistic Mapping

S=(pname, email-addr, home-addr, office-addr)

T=(name, mailing-addr)

Possible MappingProbabil

ity{(pname,name),(home-addr, mailing-addr)}

0.5

{(pname,name),(office-addr, mailing-addr)}

0.4

{(pname,name),(email-addr, mailing-addr)}

0.1

Page 14: Data Integration  with Uncertainty

By-Table v.s. By-Tuple Semantics

Page 15: Data Integration  with Uncertainty

By-Table v.s. By-Tuple Semantics

pname

email-addr

home-addroffice-addr

Alice alice@Mountain

ViewSunnyvale

Bob bob@ Sunnyvale Sunnyvale

Ds=

name

mailing-addr

AliceMountain

View

Bob Sunnyvale

DT=nam

emailing-

addr

Alice Sunnyvale

Bob Sunnyvale

name

mailing-addr

Alice alice@

Bob bob@ Pr(m1)=0.5 Pr(m2)=0.4 Pr(m3)=0.1

Page 16: Data Integration  with Uncertainty

By-Table v.s. By-Tuple Semantics

pname

email-addr

home-addroffice-addr

Alice alice@Mountain

ViewSunnyvale

Bob bob@ Sunnyvale Sunnyvale

Ds=

name

mailing-addr

AliceMountain

View

Bob bob@

DT=nam

emailing-

addr

AliceSunnyval

e

Bob bob@

name

mailing-addr

Alice alice@

Bob bob@ Pr(<m1,m3>)=0.05 Pr(<m2,m3>)=0.04 Pr(<m3,m3>)=0.01

Page 17: Data Integration  with Uncertainty

By-Table Query Answering

pname

email-addr

home-addroffice-addr

Alice alice@Mountain

ViewSunnyvale

Bob bob@ Sunnyvale Sunnyvale

Ds=

name

mailing-addr

AliceMountain

View

Bob Sunnyvale

DT=nam

emailing-

addr

AliceSunnyval

e

BobSunnyval

e

name

mailing-addr

Alice alice@

Bob bob@ 0.5 0.4 0.1

SELECT mailing-addr

FROM T

Tuple Probability

(‘Sunnyvale’) 0.9

(‘Mountain View’) 0.5

(‘alice@’) 0.1

(‘bob@’) 0.1

Page 18: Data Integration  with Uncertainty

By-Tuple Query Answering

pname

email-addr

home-addroffice-addr

Alice alice@Mountain

ViewSunnyvale

Bob bob@ Sunnyvale Sunnyvale

Ds=

name

mailing-addr

AliceMountain

View

Bob bob@

DT=nam

emailing-

addr

AliceSunnyval

e

Bob bob@

name

mailing-addr

Alice alice@

Bob bob@ 0.05 0.04 0.01

Tuple Probability

(‘Sunnyvale’) 0.94

(‘Mountain View’) 0.5

(‘alice@’) 0.1

(‘bob@’) 0.1

SELECT mailing-addr

FROM T

Page 19: Data Integration  with Uncertainty

Outline Motivation and overview of our results Definition of probabilistic mappings Complexity of query answering

Concise representations of probabilistic mappings

More expressive extensions to probabilistic mappings

By-table By-tuple

Data Complexity PTIME #P-complete

Mapping Complexity

PTIME PTIME

Page 20: Data Integration  with Uncertainty

By-Table Query Answering

SELECT mailing-addr FROM T

SELECT home-addr FROM S (0.5)

SELECT office-addr FROM S (0.4)

SELECT email-addr FROM S (0.1)

Theorem: Query answering in by-table semantics is in PTIME in the size of the data and the size of the mapping

Query Rewriting

Page 21: Data Integration  with Uncertainty

By-Tuple Query Answering

pname email-addr home-addr office-addr

Alice alice@ Mountain View Sunnyvale

Bob bob@ Sunnyvale San Jose

Ds=

DT=

name

mailing-addr

Alice Sunnyvale

Bob Sunnyvale0.2

SELECT mailing-addr

FROM T

Theorem: Query answering in by-tuple semantics is #P-complete in the size of the data, and in PTIME in the size of the mapping

Target Enumeration name

mailing-addr

Alice Sunnyvale

Bob @bob0.04

name mailing-addr

Alice alice@

Bob [email protected]

name mailing-addr

AliceMountain

View

Bob @bob0.05

Page 22: Data Integration  with Uncertainty

More on By-Tuple Query Answering The high complexity comes from computing

probabilitiesTheorem: Even computing the probability for

one possible answer is #P-hardTheorem: Computing all possible answers w/o

probabilities is in PTIME In general query answering cannot be done by

query rewriting There are two subsets of queries that can be

answered in PTIME by query rewriting

Page 23: Data Integration  with Uncertainty

PTIME for Queries with a Single P-Mapping Target

Answers

(‘Mountain View’)

(‘Sunnyvale’)

SELECT mailing-addr FROM T

SELECT home-addr FROM S

SELECT office-addr FROM S

SELECT email-addr FROM S

Answers

(‘Sunnyvale’)

(‘Sunnyvale’)

Answers

(‘alice@’)

(‘bob@’)

Pr=0.5 Pr=0.4 Pr=0.1

TupleProbabil

ity

(‘Sunnyvale’) 0.94

(‘Mountain View’)

0.5

(‘alice@’) 0.1

(‘bob@’) 0.1

=1-(1-0.4)(1-0.5-0.4)=1-(1-0.5)(1-0)

Page 24: Data Integration  with Uncertainty

PTIME for Queries that Return Join Attributes

SELECT mailing-addr FROM T,VWHERE T.mailing-addr = V.hightech

SELECT mailing-addr FROM T

SELECT hightechFROM V

TupleProbabil

ity

(‘Sunnyvale’) 0.94

(‘Mountain View’)

0.5

(‘alice@’) 0.1

(‘bob@’) 0.1

=0.94*0.8

TupleProbabil

ity

(‘Sunnyvale’) 0.8

(‘Mountain View’)

0.8

TupleProbabil

ity

(‘Sunnyvale’) 0.752

(‘Mountain View’)

0.4

Page 25: Data Integration  with Uncertainty

Query Rewriting Does Not Apply to Queries that Do NOT Return Join Attributes

SELECT ‘true’ FROM T,VWHERE T.mailing-addr = V.hightech

SELECT mailing-addr FROM T

SELECT hightechFROM V

TupleProbabil

ity

(‘Sunnyvale’) 0.94

(‘Mountain View’)

0.5

(‘alice@’) 0.1

(‘bob@’) 0.1

≠0.94*0.8+0.5*0.8

TupleProbabil

ity

(‘Sunnyvale’) 0.8

(‘Mountain View’)

0.8

TupleProbabilit

y

(‘true’) 0.864

Page 26: Data Integration  with Uncertainty

Outline Motivation and overview of our results Definition of probabilistic mappings Complexity of query answering

Concise representations of probabilistic mappings

More expressive extensions to probabilistic mappings

By-table By-tuple

Data Complexity PTIME #P-complete

Mapping Complexity

PTIME PTIME

Page 27: Data Integration  with Uncertainty

Group Probabilistic Mapping Intuition: Divide a p-mapping into a set of

independent mappings on disjoint attributes

Theorem: Query answering is in PTIME in the size of the group probabilistic mapping

S(a, b, c)

T(a’, b’, c’)p=0.72

m1 S(a, b, c)

T(a’, b’, c’)p=0.18

m2 S(a, b, c)

T(a’, b’, c’)p=0.08

m3 S(a, b, c)

T(a’, b’, c’)p=0.02

m4

S(a, b)

T(a’, b’)p=0. 8

m’1 S(a, b)

T(a’, b’)p=0. 2

m’2 S(c)

T(c’)p=0. 9

m’’1 S(c)

T(c’)p=0. 1

m’’2

Page 28: Data Integration  with Uncertainty

Probabilistic Correspondences

S(a, b, c)

T(a’, b’, c’)p=0.72

m1 S(a, b, c)

T(a’, b’, c’)p=0.18

m2

S(a, b, c)

T(a’, b’, c’)p=0.08

m3 S(a, b, c)

T(a’, b’, c’)p=0.02

m4

Attr Correspondence

Probability

(a, a’) 0.8

(a, b’) 0.2

(b, b’) 0.8

(c, c’) 0.9

Intuition: Compute marginal probabilities for attribute pairs

Theorem: Query answering is in PTIME in the size of the probabilistic correspondence

Many-to-one relationship between probabilistic mappings and probabilistic correspondences

We can use probabilistic correspondences to answer ONLY queries that contain no more than one attribute

Page 29: Data Integration  with Uncertainty

Bayes Nets Representation Theorem: When we encode probabilistic mappings

using a Bayes Net, query answering can be exponential in the size of the representation

Example:S(s1, s2, …, sn, s1’, s2’,…, sn’)

T(t1, t2, …, tn)

--Pr(s1, t1) = Pr(s1’, t1) = .5

--if s1 maps to t1, then for 2<=i<=n, si maps to ti with probability .9 and si’ maps to ti with probability .1; and the other way around

Original representation: O(2n); Bayes representation: O(n)

Page 30: Data Integration  with Uncertainty

Outline Motivation and overview of our approach Definition of probabilistic mappings Complexity of query answering

Concise representations of probabilistic mappings

More expressive extensions to probabilistic mappings

By-table By-tuple

Data Complexity PTIME #P-complete

Mapping Complexity

PTIME PTIME

Page 31: Data Integration  with Uncertainty

Extensions to More Expressive Mappings The complexity results for query answering carry over to

three extensions to more expressive mappings Complex mappings

E.g., address street, city and state GLAV mappings

E.g., Paper Authorship Author Publication Conditional mappings:

E.g., if age>65, Pr(home-addr mailing-addr)=0.8 if age<=65, Pr(home-addr mailing-addr)=0.5

Page 32: Data Integration  with Uncertainty

Related Work Using top-k schema mappings can increase the

recall of query answering w/o sacrificing precision much [Magnani&Montesi, 2007]

Using top-k schema mappings to improve schema mapping [Gal, 2006]

Using probabilities to model overlaps between data sources [Florescu et al, 1997]

Probabilistic Databases (Tutorial [Suciu&Dalvi, Sigmod’05])

Page 33: Data Integration  with Uncertainty

Conclusions Contributions

Data integration with uncertainty Probabilistic mappings and query answering in their

presence

Future work Incorporate uncertainty on keyword reformulation and

dirty data Use the probabilistic model to improve schema

mappings Build a real system

By-table By-tuple

Data Complexity PTIME #P-complete

Mapping Complexity PTIME PTIME

Page 34: Data Integration  with Uncertainty

Data Integration with Uncertainty

Xin (Luna) Dong Alon Halevy Cong YuUniv. of Washington Google Inc.Univ. of Michigan

@ VLDB 2007