Query Processing over Incomplete Autonomous Web Databases MS Thesis Defense by Hemal Khatri...
-
date post
18-Dec-2015 -
Category
Documents
-
view
217 -
download
1
Transcript of Query Processing over Incomplete Autonomous Web Databases MS Thesis Defense by Hemal Khatri...
??????
Web
Query Processing over Incomplete Autonomous Web Databases
MS Thesis Defenseby Hemal Khatri
Committee Members: Prof. Subbarao Kambhampati (chair)Prof. Chitta BaralProf. Yi ChenProf. Huan Liu
??????
Web
Introduction to Web databases Many websites allow user query through a form
based interface and are supported by backend databases
Consider used cars selling websites such as Cars.com, Yahoo! autos, etc
AutonomousDatabase
??????
Web
Incompleteness in Web databases Web databases are often input by lay individuals
without any curation. For e.g. Cars.com, Yahoo! Autos
Web databases are being populated using automated information extraction techniques which are inherently imperfect
The local schema of data sources may not support certain attributes supported by the global schema
Incomplete/Uncertain tuple: A tuple in which one or more of its attributes have a missing value
Website # of attributes
# of tuples
incomplete tuples
body style engine
autotrader.com 13 25127 33.67% 3.6% 8.1%carsdirect.com 14 32564 98.74% 55.7% 55.8%
??????
Web
Problem Statement Many entities corresponding to tuples with missing
values might be relevant to the user query Current query processing techniques return answers that
exactly satisfy the user query– Such techniques return results with high precision but
low recall
Relevant Uncertain tuple: A tuple which does not exactly satisfy the query predicates but the entity represented by that tuple might be relevant to the query
How to support query processing over incomplete autonomous databases in order to retrieve ranked uncertain results?
null Accord 2003 sedanQ:Make=Honda
??????
Web
Challenges Involved
How to predict missing values in autonomous databases?
As autonomous databases are accessible only through form-based interfaces, how to retrieve relevant uncertain answers?– How to keep query
processing cost manageable in retrieving uncertain tuples?
How to rank the retrieved uncertain answers?
??????
Web
Related Work Probabilistic databases
– Incomplete databases are similar to probabilistic databases once we assess the probabilities for missing values
– TRIO: uncertainty with lineage– ConQuer: handling inconsistency over databases
• Assume probability distributions are given for uncertain or inconsistent attributes
– We assess probability distribution for missing attribute and use it to rank rewritten queries to retrieve relevant answers since the probabilities cannot be stored in databases
– Our query rewriting framework is general and can be used by these systems if the databases are autonomous
Handling Missing Values– EM algorithm, Bayes Net, Association rules
??????
Web
Possible Approaches
For a query Q:body style = convt1.Certain Answers Only (CAO): Return
certain answers only as in traditional databases
2. All Uncertain Answers (AUA): Null matches any concrete value, hence return all answers having body style=convt along with answers having body style as null
3. Relevant Uncertain Answers (RUA): Ranking answers by predicting values of missing attribute
Low Recall
Low Precision, infeasible
Costly, infeasible
??????
Web
Outline
Introduction QPIAD: Query Processing over
Incomplete Autonomous Databases Data Integration over Incomplete
Autonomous Databases Other Contributions Conclusion
??????
Web
QPIAD System Architecture
??????
Web
RRUA: Generating Rewritten Queries Restricted Relevant Uncertain Answers (RRUA) approach only retrieves
only relevant incomplete tuples instead of retrieving all tuples as in AUA and RUA
Consider a query Q:Body style=convt
Make Model Year Price Body styleAudi a4 2004 20000 convt
BMW z4 2003 17000 convt
Porsche boxster 2000 13000 convt
….. …… …… …… ……
Rewritten queries are based on the determining set from AFD for Body style: Model ~~> Body style:0.9
Q1:model=‘a4’Q2:model=‘z4’Q3:model=‘boxster’
Determining Attribute set(dtrSet)
Base Result Set:RS(Q)
??????
Web
Learning Attribute Correlations
AFD: VIN ~~> Model where VIN is an Approximate Key(AKey) with high confidence
VIN will not be useful for query rewriting and feature selection since it will not be able to retrieve additional new tuples
SampleDatabase
TANE Algorithm AFDs and AKeys Prune AFDs basedon AKeys
AFDs for Query Rewritingand Feature Selection in classifier
??????
Web
RRUA: Ranking Rewritten Queries
All queries may not be equally good in retrieving relevant answers– “z4” model cars are more likely to be
convertibles than a car with “a4” model When database or network resources
are limited, the mediator can choose to issue the top K queries to get the most relevant uncertain answers
??????
Web
Learning Value Distributions Used to rank queries based on the
determining set of attributes from the AFD for query attribute
We use Naïve Bayes Classifier with m-estimates with AFD as a feature selection step
Rank of a rewritten query Qi = P(Am=vm|ti), where ti ε ПdtrSet(Am)(RS(Q))– Q1:model=‘a4’, R(Q1) = P(bodystyle=convt|model=a4) = 0.4– Q2:model=‘z4’, R(Q2) = P(bodystyle=convt|model=z4)= 1.0– Q3:model=‘boxster’, R(Q3) = P(bodystyle=convt|model=boxster)=0.7
R(Q2) > R(Q3) > R(Q1)
Relevant uncertain answers are ranked based on the rank of the rewritten query that retrieved it
??????
Web
Combining AFDs and Classifiers
More than one AFD may exist for some attributes
Experimented with several approaches:– Only best-AFD having highest confidence– All attributes ignoring AFDs– Hybrid One-AFD – Ensemble of classifiers
??????
Web
Empirical Evaluation of QPIAD
Test Databases: AutoTrader database containing 100K tuples and Census database from UCI Repository containing 50K tuples
Oracular study: To evaluate the effectiveness of our system against a ground truth, we artificially insert missing values in 10% of the tuples within these databases
??????
Web
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1Recall
Pre
cis
ion
AUA (467)
RUA (467)
RRUA (204)
Q:education=bachelors
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Recall
Pre
cis
ion
AUA(1245)
RUA(1245)
RRUA(209)
Q:body style=convt
RRUA vs AUA vs RUA
??????
Web
0
0.2
0.4
0.6
0.8
1
0 20 40 60 80 100
Top K tuples
Pre
cis
ion
AUA
RUA
RRUA
Q:body style=convt
Precision over Top K Tuples
0
0.2
0.4
0.6
0.8
1
0 10 20 30 40 50 60 70 80 90 100
Top K Tuples
Pre
cis
ion
AUA
RUA
RRUA
Q:education=bachelors
??????
Web
Ranking the Rewritten Queries
Cars database Census database
0
0.2
0.4
0.6
0.8
0 20 40 60 80 100
Kth Query
Avg
. Acc
um
ula
ted
Pre
cisi
on
0.3
0.4
0.5
0.6
0 20 40 60 80 100Kth Query
Av
g. A
cc
um
ula
ted
Pre
cis
ion
??????
Web
Robustness of QPIAD
Q:workshop=private
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 20 40 60 80 100
Kth Query
Acc
um
ula
ted
Pre
cisi
on
3% 5% 10% 15%
??????
Web
User Relevance Issues with QPIAD
When the query processor presents incomplete tuples, it becomes a recommender system
For a query Q:year=2000 How to convince users into believing the
system results?
Make Model Year Price Mileage
Honda Civic null 15000 18000
Explanation
We have determined that this car’s year is 60% likely to 2000 based on price=15000 and mileage=18000
??????
Web
Outline
Introduction QPIAD: Query Processing over
Incomplete Autonomous Databases Data Integration over Incomplete
Autonomous Databases Other Contributions Conclusion
??????
Web Leveraging Correlations between Data Sources
Mediator:GS(Make,Model,Year,Price,Mileage,Bodystyle)Q:Body style=coupe
??????
Web Correlated Source and Maximum Correlated Source Consider four sources with schema:
– S1(Make,Model,Year,Price)– S2(Engine,Drive,Bodystyle),
• AFD: {Engine, Drive} -> Body style confidence 0.7– S3(Make,Model,Body style)
• AFD: Model -> Body style confidence 0.8– S4(Make,Price,Body style)
• AFD: {Make, Price} -> Body Style confidence 0.6– Mediator global schema GS(Make,Model,Year,Price,
Bodystyle, Engine, Drive) S3 and S4 are correlated sources with S1 on Body
style attribute S3 is the maximum correlated source for S1 on
Body style attribute
??????
Web Retrieving Relevant Uncertain Answers from CarsDirect.com Consider a query Q:body style = coupe(GS) Cars.com has an AFD: Model ~~> Body style(0.9) Cars.com is the maximum correlated source for
CarsDirect.com which doesn’t support Body style but supports Model attribute
Make Model Year Price Body style
Honda Accord 2003 19000 coupe
Ford Mustang 2004 29100 coupe
Acura Legend 1997 12000 coupe
BMW 325 2003 28000 coupe
Q1:model=Accord
Q2:model=Mustang
Q3:model=Legend
Q4:model=325
??????
Web Empirical Evaluation of using Correlation between Data Sources We consider a mediator performing data
integration over three sources: Cars.com, Yahoo! Autos and CarsDirect.com
Yahoo! Autos and CarsDirect.com do not allow querying on body style but when the tuples are retrieved we can check the body style attribute to determine if the tuple retrieved has the body style specified in the query
Evaluation using attribute correlations and value distributions learned from Cars.com for 5 test queries on body style attribute
??????
Web
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 5 10 15 20 25 30 35 40
Kth Tuple
Pre
cis
ion
Yahoo! Autos
Retrieving Relevant Answers using Correlations from Cars.com
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 20 40 60 80 100
Kth Tuple
Pre
cis
ion
Carsdirect.com
??????
Web Handling Joins over Incomplete Autonomous databases Mediator performing data integration across two sources:
– Source S1 is incomplete– Source S2 is complete
Source Local Schema
S1 Cars(Make,Model,Year,Price)
S2 Review(Model,Ratings)
Mediator View
UsedCars(Make,Model,Year,Price,Ratings) :- Cars(Make,Model,Year,Price), Review(Model, Ratings)
??????
Web
Issues in Handling Joins Performing joins over probabilistic
databases will lead to a disjunction in join results
Consider joining uncertain tuples from the two sources:
Make Model Year Price
Honda null [0.6 Civic]
[0.4 Accord]
2003 18000
Model Ratings
Civic 5
Accord 4
Make Model Year Price Ratings
Honda Civic 2003 18000 5
Honda Accord 2003 18000 4or0.6
0.4
Approximation
??????
Web
Handling Join Queries Q:σMake=Honda(UsedCars) Assume AFDs: {Make,Year} ~~> Model, Model ~~> Make
Make Model(FK) Year Price
Honda Odyssey 2000 10000
Honda Accord 2004 20000
Honda null 2000 15000
null Accord 2002 18000
Toyota Camry 2003 16000
Model(PK) Ratings
Civic 5
Corolla 4
Accord 4
Altima 3
Camry 5
Odyssey 3
Honda Odyssey 2000 10000 3
Honda Accord 2004 20000 4
null Accord 2002 18000 4
Honda null 2000 15000 5
1.0
0.6
Q1: Model=Odyssey:R(Q1)=1
Q2: Model=Accord:R(Q2)=1
0.6 Civic0.4 Accord
Queries on source S2 to joinQ3:Model=Odyssey:R(Q3)=1Q4:Model=Accord:R(Q4)=1Q5:Model=Civic:R(Q5)=0.6
??????
Web
Q:ratings=4
0
0.2
0.4
0.6
0.8
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Recall
Pre
cis
ion
RUA(2475)
RRUA(157)Q:make=audi
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8Recall
Pre
cis
ion
RUA(4892)
RRUA(58)
Q:model=Civic
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1Recall
Pre
cis
ion
RUA(2475)
RRUA(24)
Experimental Results Joins
??????
Web
Outline
Introduction QPIAD: Query Processing over
Incomplete Autonomous Databases Data Integration over Incomplete
Autonomous Databases Other Contributions Conclusion
??????
WebQUIC: Querying under Imprecision and Incompleteness Consider a query Q:model like Civic(Cars) User might be interested in similar cars like “Accord”, ”Camry”,
etc Ranking results in presence of both similar and incomplete tuples
Id Make Model Year Body style
1 Honda Civic 2000 Sedan
2 Honda Accord 2004 Coupe
3 Toyota Camry 2001 Sedan
4 Honda null 2004 Coupe
5 Honda null 2000 Sedan
6 Honda Civic 2004 Coupe
7 BMW 3series 2001 convt
8 Toyota null 1999 sedan
??????
Web Other Contributions[*Collaboration with Garrett Wolf]
Handling multi-attribute selection queries for incomplete databases*
QUIC system for query processing under imprecision and incompleteness
Online learning of value distribution based on base result set to avoid sample biases
??????
Web
Conclusion
Thesis proposed a framework for query processing over incomplete autonomous web databases:– QPIAD: Query processing over incomplete
autonomous databases– QPIAD: Data Integration over multiple
incomplete data sources Results of empirical evaluation on real world
databases show that our system returns relevant answers with high precision while keeping the query processing cost manageable
??????
Web
Thank You!!
Questions??