Uncertainty in Data Integration
Ai Jing2007-11-10
Outline Data Integration with Uncertainty Overview of Workshop on
Management of Uncertain Data Uncertainty in Deep Web
Outline
Data Integration with Uncertainty Overview of Workshop on
Management of Uncertain Data Uncertainty in Deep Web
Data Integration with Uncertainty
Motivation and overview Definition of probabilistic mappings Query answering w.r.t. p-mappings Complexity of query answering Contributions
Data Integration with Uncertainty
Motivation and overview Definition of probabilistic mappings Query answering w.r.t. p-mappings Complexity of query answering Contributions
Traditional Data Integration SystemsSELECT P.title AS title, P.year AS year, A.
name AS authorFROM Author, Paper, AuthoredBy
WHERE Author.aid = AuthoredBy.aid AND Paper.pid = AUthoredBy.pid Q
Q1
Q2
Q3
Q4
Q5
Uncertainty Can Occur at Three Levels in Data Integration Applications
III. Query Level
II. Mapping Level
I. Data Level
Focus of the paper:Probabilistic schema mappings
Example Probabilistic Mappings
T(name, email, mailing-addr, home-addr, office-addr)S(pname, email-addr, current-addr, permanent-addr)
T(name, email, mailing-addr, home-addr, office-addr) S(pname, email-addr, current-addr, permanent-addr)
T(name, email, mailing-addr, home-addr, office-addr)
S(pname, email-addr, current-addr, permanent-addr)
m1:
0.5
m2:
0.4
m3:
0.1
Top-k Query Answering w.r.t. Probabilistic Mappings
Mediated Schema
Q: SELECT mailing-addr FROM T
0.5 0.40.1
Q1: SELECT current-addr FROM S
Q2: SELECT permanent-addr FROM S
Q3: SELECT email-addr FROM S
Data Integration with Uncertainty
Motivation and overview Definition of probabilistic mappings Query answering w.r.t. p-mappings Complexity of query answering Contributions
Definition of probabilistic mappings
Schema Mapping
Probabilistic Mapping
S=(pname, email-addr, home-addr, office-addr)
T=(name, mailing-addr)
one-to-one schema matchinghave exact knowledge of mapping
S=(pname, email-addr, home-addr, office-addr)
T=(name, mailing-addr)
1.0 0.1 0.5 0.4
By-Table Semantics
DT=
m
0.5
By-Tuple Semantics
DT=
Pr(<m1,m3>)=0.05
…
Data Integration with Uncertainty
Motivation and overview Definition of probabilistic mappings Query answering w.r.t. p-mappings Complexity of query answering Contributions
By-Table Query Answering
By-Tuple Query Answering
Data Integration with Uncertainty
Motivation and overview Definition of probabilistic mappings Query answering w.r.t. p-mappings Complexity of query answering Contributions
Complexity of query answering
More on By-Tuple Query Answering The high complexity comes from computing probabili
ties the number of mapping sequences is exponential in the size of the i
nput data n tuples, m mappings m^n mapping sequences
There are two subsets of queries that can be answered in PTIME by query rewriting SELECT mailing-addr FROM T SELECT mailing-addr FROM T,V
WHERE T.mailing-addr = V.hightech In general query answering cannot be done by query
rewriting
One of Dt
Extensions to More Expressive Mappings
The complexity results for query answering carry over to three extensions to more expressive mappings Complex mappings
GLAV mappings
Conditional mappings:
Data Integration with Uncertainty
Motivation and overview Definition of probabilistic mappings Query answering w.r.t. p-mappings Complexity of query answering Contributions
Contributions
Definition of probabilistic mappingsSemantics: by-table v.s. by-tuple
Complexity of query answering
Outline
Data Integration with Uncertainty Overview of Workshop on
Management of Uncertain Data Uncertainty in Deep Web
Overview of MUD 2007
Theory A New Language and Architecture to Obtain Fuzzy Global Depende
ncies About the Processing of Division Queries Addressed to Possibilistic
Databases Making Aggregation Work in Uncertain and Probabilistic Datab
ases Application
Materialized Views in Probabilistic Databases
Application Flexible matching of Ear Biometrics Consistent Joins Under Primary Key Constraints
A New Language and Architecture to Obtain Fuzzy Global Dependencies
SQL does not satisfy the minimum requirements to be true DM language
A New Language: dmFSQL (data mining Fuzzy Structured Query Language)
Fuzzy Database Data mining
About the Processing of Division Queries Addressed to Possibilistic Databases
They devised a data model which is a strong representation system for operations in possibilistic databases
A possibilistic databases D can be interpreted as a weighted disjunctive set of regular databases
Division Queries
Making Aggregation Work inUncertain and Probabilistic Databases
Trio is a prototype database management system for storing and querying data with uncertainty and lineage
Trio’s query language——TriQL
Trio data model and query semantics
Aggregation function in the Trio system for uncertain and probabilistic data
Materialized Views in Probabilistic Databases
Materialized Views for probabilistic may not define a unique probability distribution
view representation Answer queries on large probabilistic dat
a set more efficiently with materialized views
Flexible matching of Ear Biometrics
Research area Image Recognition (or Identification)
Scenario identifying found bodies in a large-scale disaster
Challenge fast and cheap identification no DNA-databases or fingerprint
databases are at hand
Consistent Joins Under Primary KeyConstraints
Inconsistent database primary key
will the natural join of the repaired relations always be nonempty, no matter whichtuples are selected?
game theory, winning strategy
Outline
Data Integration with Uncertainty Overview of Workshop on
Management of Uncertain Data Uncertainty in Deep Web
Uncertainty in Deep Web
No “perfect” data Noise Dirty Redundancy ……
No “perfect” solution Web data extraction Interface integration ……
Uncertainty in Deep Web Data Integration(1)
Query Translation
Resul ts Extraction
Data Merging
Integrated Interface
Deep Web
WDB Discovery
Interface Integration
RDBWeb DB
Web DB
Web DB
Web DBWeb DB
Interface Schema Extraction
WDB Clustering
Query Process Modul e
I nterface I ntegrati on Modul e
WDB Selection
Query Submission
Resul ts Annotation
Resul t Process Modul e
•Robust•Evaluable
Uncertainty in Deep Web Data Integration(2)
Query Translation
Resul ts Extraction
Data Merging
Integrated Interface
Deep Web
WDB Discovery
Interface Integration
RDBWeb DB
Web DB
Web DB
Web DBWeb DB
Interface Schema Extraction
WDB Clustering
Query Process Modul e
I nterface I ntegrati on Modul e
WDB Selection
Query Submission
Resul ts Annotation
Resul t Process Modul e
•Tuning•Feedback•Evaluable
Uncertainty in Jobtong(1)
Data level
Uncertainty in Jobtong(2)
Query level
How can we give every result a probability to show it’s importance?
Uncertainty in Jobtong(3)
The automatic maintenance of configuration files
<record><xpath>/html/body//table/tr[@class='nob']</xpath> <combination>2</combination> <items> <item> <name>title</name> <xpath>td[2]/a/span</xpath> </item> <item> <name>company</name> <xpath>td[3]/a/span</xpath> </item> </items></record>
<record> <xpath>/html/body//table/tr[@class='list2' or @class='list3']</xpath> <combination>2</combination> <items> <item> <name>title</name> <xpath>td[2]/a</xpath> </item> <item> <name>company</name> <xpath>td[3]/a</xpath> </item> </items></record>
Q&A
Thank you!
Top Related