Uncertainty in Data Integration Ai Jing 2007-11-10.

Post on 26-Mar-2015

214 views 0 download

Tags:

Transcript of Uncertainty in Data Integration Ai Jing 2007-11-10.

Uncertainty in Data Integration

Ai Jing2007-11-10

Outline Data Integration with Uncertainty Overview of Workshop on

Management of Uncertain Data Uncertainty in Deep Web

Outline

Data Integration with Uncertainty Overview of Workshop on

Management of Uncertain Data Uncertainty in Deep Web

Data Integration with Uncertainty

Motivation and overview Definition of probabilistic mappings Query answering w.r.t. p-mappings Complexity of query answering Contributions

Data Integration with Uncertainty

Motivation and overview Definition of probabilistic mappings Query answering w.r.t. p-mappings Complexity of query answering Contributions

Traditional Data Integration SystemsSELECT P.title AS title, P.year AS year, A.

name AS authorFROM Author, Paper, AuthoredBy

WHERE Author.aid = AuthoredBy.aid AND Paper.pid = AUthoredBy.pid Q

Q1

Q2

Q3

Q4

Q5

Uncertainty Can Occur at Three Levels in Data Integration Applications

III. Query Level

II. Mapping Level

I. Data Level

Focus of the paper:Probabilistic schema mappings

Example Probabilistic Mappings

T(name, email, mailing-addr, home-addr, office-addr)S(pname, email-addr, current-addr, permanent-addr)

T(name, email, mailing-addr, home-addr, office-addr) S(pname, email-addr, current-addr, permanent-addr)

T(name, email, mailing-addr, home-addr, office-addr)

S(pname, email-addr, current-addr, permanent-addr)

m1:

0.5

m2:

0.4

m3:

0.1

Top-k Query Answering w.r.t. Probabilistic Mappings

Mediated Schema

Q: SELECT mailing-addr FROM T

0.5 0.40.1

Q1: SELECT current-addr FROM S

Q2: SELECT permanent-addr FROM S

Q3: SELECT email-addr FROM S

Data Integration with Uncertainty

Motivation and overview Definition of probabilistic mappings Query answering w.r.t. p-mappings Complexity of query answering Contributions

Definition of probabilistic mappings

Schema Mapping

Probabilistic Mapping

S=(pname, email-addr, home-addr, office-addr)

T=(name, mailing-addr)

one-to-one schema matchinghave exact knowledge of mapping

S=(pname, email-addr, home-addr, office-addr)

T=(name, mailing-addr)

1.0 0.1 0.5 0.4

By-Table Semantics

DT=

m

0.5

By-Tuple Semantics

DT=

Pr(<m1,m3>)=0.05

Data Integration with Uncertainty

Motivation and overview Definition of probabilistic mappings Query answering w.r.t. p-mappings Complexity of query answering Contributions

By-Table Query Answering

By-Tuple Query Answering

Data Integration with Uncertainty

Motivation and overview Definition of probabilistic mappings Query answering w.r.t. p-mappings Complexity of query answering Contributions

Complexity of query answering

More on By-Tuple Query Answering The high complexity comes from computing probabili

ties the number of mapping sequences is exponential in the size of the i

nput data n tuples, m mappings m^n mapping sequences

There are two subsets of queries that can be answered in PTIME by query rewriting SELECT mailing-addr FROM T SELECT mailing-addr FROM T,V

WHERE T.mailing-addr = V.hightech In general query answering cannot be done by query

rewriting

One of Dt

Extensions to More Expressive Mappings

The complexity results for query answering carry over to three extensions to more expressive mappings Complex mappings

GLAV mappings

Conditional mappings:

Data Integration with Uncertainty

Motivation and overview Definition of probabilistic mappings Query answering w.r.t. p-mappings Complexity of query answering Contributions

Contributions

Definition of probabilistic mappingsSemantics: by-table v.s. by-tuple

Complexity of query answering

Outline

Data Integration with Uncertainty Overview of Workshop on

Management of Uncertain Data Uncertainty in Deep Web

Overview of MUD 2007

Theory A New Language and Architecture to Obtain Fuzzy Global Depende

ncies About the Processing of Division Queries Addressed to Possibilistic

Databases Making Aggregation Work in Uncertain and Probabilistic Datab

ases Application

Materialized Views in Probabilistic Databases

Application Flexible matching of Ear Biometrics Consistent Joins Under Primary Key Constraints

A New Language and Architecture to Obtain Fuzzy Global Dependencies

SQL does not satisfy the minimum requirements to be true DM language

A New Language: dmFSQL (data mining Fuzzy Structured Query Language)

Fuzzy Database Data mining

About the Processing of Division Queries Addressed to Possibilistic Databases

They devised a data model which is a strong representation system for operations in possibilistic databases

A possibilistic databases D can be interpreted as a weighted disjunctive set of regular databases

Division Queries

Making Aggregation Work inUncertain and Probabilistic Databases

Trio is a prototype database management system for storing and querying data with uncertainty and lineage

Trio’s query language——TriQL

Trio data model and query semantics

Aggregation function in the Trio system for uncertain and probabilistic data

Materialized Views in Probabilistic Databases

Materialized Views for probabilistic may not define a unique probability distribution

view representation Answer queries on large probabilistic dat

a set more efficiently with materialized views

Flexible matching of Ear Biometrics

Research area Image Recognition (or Identification)

Scenario identifying found bodies in a large-scale disaster

Challenge fast and cheap identification no DNA-databases or fingerprint

databases are at hand

Consistent Joins Under Primary KeyConstraints

Inconsistent database primary key

will the natural join of the repaired relations always be nonempty, no matter whichtuples are selected?

game theory, winning strategy

Outline

Data Integration with Uncertainty Overview of Workshop on

Management of Uncertain Data Uncertainty in Deep Web

Uncertainty in Deep Web

No “perfect” data Noise Dirty Redundancy ……

No “perfect” solution Web data extraction Interface integration ……

Uncertainty in Deep Web Data Integration(1)

Query Translation

Resul ts Extraction

Data Merging

Integrated Interface

Deep Web

WDB Discovery

Interface Integration

RDBWeb DB

Web DB

Web DB

Web DBWeb DB

Interface Schema Extraction

WDB Clustering

Query Process Modul e

I nterface I ntegrati on Modul e

WDB Selection

Query Submission

Resul ts Annotation

Resul t Process Modul e

•Robust•Evaluable

Uncertainty in Deep Web Data Integration(2)

Query Translation

Resul ts Extraction

Data Merging

Integrated Interface

Deep Web

WDB Discovery

Interface Integration

RDBWeb DB

Web DB

Web DB

Web DBWeb DB

Interface Schema Extraction

WDB Clustering

Query Process Modul e

I nterface I ntegrati on Modul e

WDB Selection

Query Submission

Resul ts Annotation

Resul t Process Modul e

•Tuning•Feedback•Evaluable

Uncertainty in Jobtong(1)

Data level

Uncertainty in Jobtong(2)

Query level

How can we give every result a probability to show it’s importance?

Uncertainty in Jobtong(3)

The automatic maintenance of configuration files

<record><xpath>/html/body//table/tr[@class='nob']</xpath> <combination>2</combination> <items> <item> <name>title</name> <xpath>td[2]/a/span</xpath> </item> <item> <name>company</name> <xpath>td[3]/a/span</xpath> </item> </items></record>

<record> <xpath>/html/body//table/tr[@class='list2' or @class='list3']</xpath> <combination>2</combination> <items> <item> <name>title</name> <xpath>td[2]/a</xpath> </item> <item> <name>company</name> <xpath>td[3]/a</xpath> </item> </items></record>

Q&A

Thank you!