Answering Imprecise Queries over Web Databases
description
Transcript of Answering Imprecise Queries over Web Databases
Answering Imprecise Queries over Web Databases
Ullas Nambiar and Subbarao Kambhampati
Department of CS & EngineeringArizona State University
VLDB , Aug 30 – Sep 02, 2005, Trondheim, Norway
Why Imprecise Queries ?
Want a ‘sedan’ priced around $7000
A Feasible Query
Make =“Toyota”, Model=“Camry”,
Price ≤ $7000
What about the price of a Honda Accord?
Is there a Camry for $7100?
Solution: Support Imprecise Queries
………
1998$6500CamryToyota
2000$6700CamryToyota
2001$7000CamryToyota
1999$7000CamryToyota
The Imprecise Query Answering Problem
Problem Statement: Given a conjunctive query Q over a relation R, find a ranked set of tuples of R that satisfy Q above a threshold of similarity Tsim.
Ans(Q) ={x|x Є R, Similarity(Q,x) >Tsim}
Constraints:– Autonomous Database
• Data accessible only by querying
• Data model, operators etc not modifiable
– Supports boolean model (relation)
Existing Approaches
Similarity search over Vector space– Data must be stored as vectors of text
WHIRL, W. Cohen, 1998
Enhanced database model– Add ‘similar-to’ operator to SQL.
Distances provided by an expert/system designer
VAGUE, A. Motro, 1998– Support similarity search and query
refinement over abstract data typesBinderberger et al, 2003
User guidance– Users provide information about objects
required and their possible neighborhoodProximity Search, Goldman et al, 1998
Limitations:
1. User/expert must provide similarity measures
2. New operators to use distance measures
3. Not applicable over autonomous databases
Motivation & Challenges what we did ! Objectives
– Minimal burden on the end user
– No changes to existing database
– Domain independent
Motivation– Mimic relevance based
ranked retrieval paradigm of IR systems
– Can we learn relevance statistics from database ?
– Use the estimated relevance model to improve the querying experience of users
Challenges Estimating Query-Tuple
Similarity– Weighted summation of
attribute similarities– Syntactic similarity
inadequate– Need to estimate
semantic similarity• Not enough Ontologies
Measuring Attribute Importance– Not all attributes equally
important– Users cannot quantify
importance
DataSource n
Sample Dataset
Data Collector
Probe using Random Sample Queries
Estimate similarity
Similarity Matrix
Similarity Miner
Extract Concepts
Dependency Miner
Mine AFDs & Keys
Weighted Dependencies
Data processing
Query Engine
Map to precise query
Identify & Execute Similar Queries
Result Ranking
Imprecise Query Ranked Tuples
AIMQ
WWW
DataSource 2
DataSource 1
Wrappers
The AIMQ approach
ImpreciseQuery
QMap: Convert“like” to “=”
Qpr = Map(Q)
Use Base Set as set ofrelaxable selectionqueries
Using AFDs findrelaxation order
Derive Extended Set byexecuting relaxed queries
Use Concept similarityto measure tuplesimilarities
Prune tuples belowthreshold
Return Ranked Set
Derive BaseSet Abs
Abs = Qpr(R)
Query-Tuple Similarity Tuples in extended set show different levels of relevance Ranked according to their similarity to the corresponding
tuples in base set using
– n = Count(Attributes(R)) and Wimp is the importance weight of the attribute
– Euclidean distance as similarity for numerical attributes e.g. Price, Year
– VSim – semantic value similarity estimated by AIMQ for categorical attributes e.g. Make, Model
NumericalAiDomif
AiQ
AitAiQ
lCategoricaAiDomif
AitAiQVSim
AiWtQSimn
i
imp
)(
.
|..|
)(
).,.(
)(),(1
ImpreciseQuery
Q Map: Convert“like” to “=”
Qpr = Map(Q)
Use Base Set as set ofrelaxable selectionqueries
Using AFDs findrelaxation order
Derive Extended Set byexecuting relaxed queries
Use Concept similarityto measure tuplesimilarities
Prune tuples belowthreshold
Return Ranked Set
Derive BaseSet Abs
Abs = Qpr(R)
Deciding Attribute Order Mine AFDs and Approximate
Keys Create dependence graph using
AFDs– Strongly connected hence a
topological sort not possible Using Approximate Key with
highest support partition attributes into
– Deciding set– Dependent set– Sort the subsets using
dependence and influence weights
Measure attribute importance as
ImpreciseQuery
Q Map: Convert“like” to “=”
Qpr = Map(Q)
Use Base Set as set ofrelaxable selectionqueries
Using AFDs findrelaxation order
Derive Extended Set byexecuting relaxed queries
Use Concept similarityto measure tuplesimilarities
Prune tuples belowthreshold
Return Ranked Set
Derive BaseSet Abs
Abs = Qpr(R)
CarDB(Make, Model, Year, Price)
Decides: Make, YearDepends: Model, Price
Order: Price, Model, Year, Make
1- attribute: { Price, Model, Year, Make}
2-attribute: {(Price, Model), (Price, Year), (Price, Make).. }
•Attribute relaxation order is all non-keys first then keys
•Greedy multi-attribute relaxation
depends
idepends
decides
idecides
iimp
Wt
AWt
or
Wt
AWt
RAttributescount
AlaxOrderAiW
)(
)(
))((
)(Re)(
Empirical Evaluation Goal
– Test robustness of learned dependencies– Evaluate the effectiveness of the query relaxation and
similarity estimation Database
– Used car database CarDB based on Yahoo AutosCarDB( Make, Model, Year, Price, Mileage, Location,
Color)• Populated using 100k tuples from Yahoo Autos
Algorithms – AIMQ
• RandomRelax – randomly picks attribute to relax• GuidedRelax – uses relaxation order determined using
approximate keys and AFDs– ROCK: RObust Clustering using linKs (Guha et al, ICDE
1999)• Compute Neighbours and Links between every tuple
Neighbour – tuples similar to each other Link – Number of common neighbours between two tuples
• Cluster tuples having common neighbours
Robustness of Dependencies
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Model Color Year Make
Dependent Attribute
Depe
nden
ce .
100k50k25k15k
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1 6 11 16 21 26
Keys
Quali
ty .
100k50k25k15k
Attribute dependence order & Key quality is unaffected by sampling
Robustness of Value Similarities
Value Similar Values
25K 100k
Make=“Kia” Hyundai 0.17 0.17
Isuzu 0.15 0.15
Subaru 0.13 0.13
Make=“Bronco”
Aerostar 0.19 0.21
F-350 0 0.12
Econoline Van
0.11 0.11
Year=“1985” 1986 0.16 0.16
1984 0.13 0.14
1987 0.12 0.12
Efficiency of Relaxation
0
100
200
300
400
500
600
700
800
900
1 2 3 4 5 6 7 8 9 10
Queries
Wor
k/Re
leva
nt T
uple
Є= 0.7
Є = 0.6
Є = 0.5
•Average 8 tuples extracted per relevant tuple for Є =0.5. Increases to 120 tuples for Є=0.7.
•Not resilient to change in Є
0
20
40
60
80
100
120
140
160
180
1 2 3 4 5 6 7 8 9 10Queries
Wor
k/R
elev
ant T
uple
Є = 0.7
Є = 0.6
Є = 0.5
•Average 4 tuples extracted per relevant tuple for Є=0.5. Goes up to 12 tuples for Є= 0.7.
•Resilient to change in Є
Random Relaxation Guided Relaxation
Accuracy over CarDB
•14 queries over 100K tuples
• Similarity learned using 25k sample
• Mean Reciprocal Rank (MRR) estimated as
• Overall high MRR shows high relevance of suggested answers
1|)()(|
1)(
ii tAIMQRanktUserRankAvgQMRR
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14Queries
Avera
ge M
RR .
GuidedRelax
RandomRelax
ROCK
AIMQ - Summary
An approach for answering imprecise queries over Web database– Mine and use AFDs to determine attribute order– Domain independent semantic similarity estimation
technique– Automatically compute attribute importance scores
Empirical evaluation shows – Efficiency and robustness of algorithms– Better performance than current approaches– High relevance of suggested answers– Domain independence