Post on 28-Mar-2015
ICDT 2005
An Abstract Framework for Generating Maximal Answers to Queries
Sara Cohen, Yehoshua Sagiv
ICDT 2005
Motivation
Queries and Databases
Answers and Semantics
Graph Properties
ICDT 2005
The Problem
In many different domains, we are given the option to query some source of information
Usually, the user only gets results if the query can be completely answered (satisfied)
In many domains, this is not appropriate, e.g., The user is not familiar with the database The database does not contain complete information There is a mismatch between the ontology of the user
and that of the database The query is a “search” that is not expected to be
correct
ICDT 2005
Search for papers by “Smith” that appeared in
ICDT 2004
ICDT 2005
Sorry, no matching record found
ICDT 2005
Search for buses from “Haifa-Technion” to “Ben Gurion Airport”
ICDT 2005
There is no direct bus line between the required
destinations
ICDT 2005
Search for buses to “Ben Gurion Airport”
ICDT 2005
Must choose From and To
ICDT 2005
What Do Users Need?
Users need a way to get interesting partial answers to their queries, especially if a complete answer does not exist
These partial answers should contain maximal information
Main Problems: What should be the semantics of partial answers? How can all partial answers be efficiently computed?
ICDT 2005
Previous Work
Many solutions have been given for the main problems solutions differ, according to the problem domain
Examples: Full disjunctions: Galindo-Legaria (94), Rajaraman,
Ullman (96), Kanza, Sagiv (03) Queries with incomplete answers over semistructured
data: Kanza, Nutt, Sagiv (99) FleXPath: Amer-Yahia, Lakshmanan, Pandit (04) Interconnections: Cohen, Kanza, Sagiv (03)
ICDT 2005
Our Contribution In the past, for each semantics considered, the
query evaluation problem had to be studied anew. In this paper, we: Present a general framework for defining semantics
for partial answers Framework is general enough to cover most
previously studied semantics Query evaluation problem can be solved once within
this framework – and reused for new semantics Results improve upon previous evaluation algorithms Presents relationship between this problem and that
of the maximal P-subgraph problem
ICDT 2005
Motivation
Queries and Databases
Answers and Semantics
Graph Properties
ICDT 2005
Databases
Databases are modeled as data graphs: (V, E, r, lV, lE)r: Can have a designated root lV: Labels on the vertices lE: Labels on the edges
Note:Nodes correspond to data itemsEven databases that do not have an inherent
graph structure can be modeled as graphs, e.g., relational databases
ICDT 2005
XML as a Data Graph
Technion
University
NameDept Dept
Name Faculty Name Faculty
Professor
Name Teaches Teaches
Lecturer
Name Teaches
ComputerScience
ChanaIsraeli
Databases Bioinformatics AviLevy
Biology
MolecularBiology
ICDT 2005
Relational Database as a Data Graph
Country Climate
Canadadiverse
UKtemporate
USAtemporate
Country City Hotel
UKLondonPlaza
CanadaMontrealHitlon
Canada TorontoRamada
Country City Site
UKLondonBuckingham
USANYMetropolitan
Climates Sites
Accommodations
ICDT 2005
Relational Database as a Data Graph
Country Climate
Canadadiverse
UKtemporate
USAtemporate
Country City Hotel
UKLondonPlaza
CanadaMontrealHitlon
Canada TorontoRamada
Country City Site
UKLondonBuckingham
USANYMetropolitan
Climates Sites
Accommodations(C, (Canada, diverse))
(C, (UK, temporate))
(C, (USA, temporate))
ICDT 2005
Relational Database as a Data Graph
Country City Hotel
UKLondonPlaza
CanadaMontrealHitlon
Canada TorontoRamada
Country City Site
UKLondonBuckingham
USANYMetropolitan
Sites
Accommodations
(C, (Canada, diverse))
(C, (UK, temporate))
(C, (USA, temporate))
(A, (UK, London, Plaza))
(A, (Canda, Montreal, Hilton))
(A, (Canda, Toronto, Ramada))
ICDT 2005
Relational Database as a Data Graph
Country City Site
UKLondonBuckingham
USANYMetropolitan
Sites(C, (Canada, diverse))
(C, (UK, temporate))
(C, (USA, temporate))
(A, (UK, London, Plaza))
(A, (Canda, Montreal, Hilton))
(A, (Canda, Toronto, Ramada))
(S, (UK, London, Buckingham))
(S, (USA, NY, Metropolitan))
ICDT 2005
Relational Database as a Data Graph
(C, (Canada, diverse))
(C, (UK, temporate))
(C, (USA, temporate))
(A, (UK, London, Plaza))
(A, (Canda, Montreal, Hilton))
(A, (Canda, Toronto, Ramada))
(S, (UK, London, Buckingham))
(S, (USA, NY, Metropolitan))
ICDT 2005
Queries
Queries are modeled as query graphs: (V, E, r, CV, CE, s) r: Can have a designated root CV : Vertex constraints on the vertices (basically, a
boolean function on vertices) CE : Edge constraints on the edges (basically, a
boolean function on pairs of vertices) s: A structural constraint, one of the letters C, R, N
(defines the required structure of answers, i.e., connected, rooted or none)
Note: Nodes correspond to query variables
ICDT 2005
= Dept and ContainsText(Biology)
XML Query as a Graph
Returns faculty members from the Biology Department
= University
= Faculty
= Name
Is Descendent
Is GrandChild
Is Child
Structural Constraint: Rooted
ICDT 2005
Join Query as a Graph
C A S
Belongs to: C
Belongs to: A Belongs to: S
C.Country = A.Country
C.Country = S.Country
A.Country = S.Company and A.City = S.City
q1
q2 q3
Structural Constraint: Connected
ICDT 2005
Motivation
Queries and Databases
Answers and Semantics
Graph Properties
ICDT 2005
Assignment Graphs
Assignment graphs are used to compactly represent assignments of query nodes to database nodes
Basically, assignment graph for Q and D, written QD has:Node (q,d) for each pair q Q and d D such
that d satisfies the constraint on qEdge ((q,d), (q’,d’)) if there is an edge (q,q’)
in Q and (d,d’) satisfies the constraint on (q,q’)May also have a root (details omitted)
ICDT 2005
Belongs to: A q2
Belongs to: C
Belongs to: S
C.Country = A.Country
C.Country = S.Country
A.Country = S.Company and A.City = S.City
q1
q3
(C, (Canada, diverse))
(C, (UK, temporate))
(C, (USA, temporate))
(A, (UK, London, Plaza))
(A, (Canda, Montreal, Hilton))
(A, (Canda, Toronto, Ramada))
(S, (UK, London, Buckingham))
(S, (USA, NY, Metropolitan))
c1
c2
c3
a1
a2
s1
s2
a3
(q1, c1)
(q1, c2)
(q1, c3)
ICDT 2005
Belongs to: A q2
Belongs to: C
Belongs to: S
C.Country = A.Country
C.Country = S.Country
A.Country = S.Company and A.City = S.City
q1
q3
(C, (Canada, diverse))
(C, (UK, temporate))
(C, (USA, temporate))
(A, (UK, London, Plaza))
(A, (Canda, Montreal, Hilton))
(A, (Canda, Toronto, Ramada))
(S, (UK, London, Buckingham))
(S, (USA, NY, Metropolitan))
c1
c2
c3
a1
a2
s1
s2
a3
(q1, c1)
(q1, c2)
(q1, c3)
(q2, a1)
(q2, a2)
(q2, a3)
ICDT 2005
Belongs to: A q2
Belongs to: C
Belongs to: S
C.Country = A.Country
C.Country = S.Country
A.Country = S.Company and A.City = S.City
q1
q3
(C, (Canada, diverse))
(C, (UK, temporate))
(C, (USA, temporate))
(A, (UK, London, Plaza))
(A, (Canda, Montreal, Hilton))
(A, (Canda, Toronto, Ramada))
(S, (UK, London, Buckingham))
(S, (USA, NY, Metropolitan))
c1
c2
c3
a1
a2
s1
s2
a3
(q3, s1)
(q3, s2)
(q1, c1)
(q1, c2)
(q1, c3)
(q2, a1)
(q2, a2)
(q2, a3)
ICDT 2005
Belongs to: A q2
Belongs to: C
Belongs to: S
C.Country = A.Country
C.Country = S.Country
A.Country = S.Company and A.City = S.City
q1
q3
(C, (Canada, diverse))
(C, (UK, temporate))
(C, (USA, temporate))
(A, (UK, London, Plaza))
(A, (Canda, Montreal, Hilton))
(A, (Canda, Toronto, Ramada))
(S, (UK, London, Buckingham))
(S, (USA, NY, Metropolitan))
c1
c2
c3
a1
a2
s1
s2
a3
(q3, s1)
(q3, s2)
(q1, c1)
(q1, c2)
(q1, c3)
(q2, a1)
(q2, a2)
(q2, a3)
ICDT 2005
Belongs to: A q2
Belongs to: C
Belongs to: S
C.Country = A.Country
C.Country = S.Country
A.Country = S.Company and A.City = S.City
q1
q3
(C, (Canada, diverse))
(C, (UK, temporate))
(C, (USA, temporate))
(A, (UK, London, Plaza))
(A, (Canda, Montreal, Hilton))
(A, (Canda, Toronto, Ramada))
(S, (UK, London, Buckingham))
(S, (USA, NY, Metropolitan))
c1
c2
c3
a1
a2
s1
s2
a3
(q3, s1)
(q3, s2)
(q1, c1)
(q1, c2)
(q1, c3)
(q2, a1)
(q2, a2)
(q2, a3)
ICDT 2005
Belongs to: A q2
Belongs to: C
Belongs to: S
C.Country = A.Country
C.Country = S.Country
A.Country = S.Company and A.City = S.City
q1
q3
(C, (Canada, diverse))
(C, (UK, temporate))
(C, (USA, temporate))
(A, (UK, London, Plaza))
(A, (Canda, Montreal, Hilton))
(A, (Canda, Toronto, Ramada))
(S, (UK, London, Buckingham))
(S, (USA, NY, Metropolitan))
c1
c2
c3
a1
a2
s1
s2
a3
(q3, s1)
(q3, s2)
(q1, c1)
(q1, c2)
(q1, c3)
(q2, a1)
(q2, a2)
(q2, a3)
ICDT 2005
Partial Assignment
A partial assignment is any subgraph of QD that does not contain two different nodes (q,d) and (q,d’)otherwise, would map the node q to two
different database nodes Can distinguish special types of partial
assignments:vertex completeedge completestructurally consistent
Every query node must appear in the partial
assignment
Every edge constraint between query
variables in the partial assignment holds
The partial assignment satisfies the query’s structural constraint
ICDT 2005
Vertex Complete,
Edge Complete,
Structurally Consistent
Vertex Complete,
Edge Complete,
Structurally Consistent
Vertex Complete,
Edge Complete,
Structurally Consistent
Example
(q3, s1)
(q3, s2)
(q1, c1)
(q1, c2)
(q1, c3)
(q2, a1)
(q2, a2)
(q2, a3)
ICDT 2005
Semantics All partial assignments for Q over D that satisfy
the vertex and edge constraints are encoded in QD
A semantics defines which subgraphs of the answer graph (i.e., which partial assignments) are in fact answers, e.g., Sves allows all partial assignments that are vertex
complete, edge complete and structurally consistent Ses allows all partial assignments that are edge
complete and structurally consistent Ss allows all partial assignments that are structurally
consistent Usually, we are only interested in maximal
partial assignemnts
ICDT 2005
Example: Join
(q3, s1)
(q3, s2)
(q1, c1)
(q1, c2)
(q1, c3)
(q2, a1)
(q2, a2)
(q2, a3)
Using semantics Sves
we get the natural join
ICDT 2005
Example: Join “becomes” a Full Disjunction
(q3, s1)
(q3, s2)
(q1, c1)
(q1, c2)
(q1, c3)
(q2, a1)
(q2, a2)
(q2, a3)
Using semantics Ses
we get the full disjunction
ICDT 2005
Other Examples
Queries with incomplete answers over semistructured data: Kanza, Nutt, Sagiv (PODS 99) Weak semantics modeled by Ses; Or-semantics
modeled by Ss
FleXPath: Amer-Yahia, Lakshmanan, Pandit (Sigmond 04) Modeled by Ses
Interconnections: Cohen, Kanza, Sagiv (03) Complete interconnection can be modeled by Ses;
Reachable interconnection can be modeled by Ss
ICDT 2005
Motivation
Queries and Databases
Answers and Semantics
Graph Properties
ICDT 2005
Semantics are a type of Graph Property
A graph property P is a set of graphs, e.g., is a clique is a bipartite graph
A semantics defines a set of graphs, for every Q, D (these graphs are subgraphs of QD)
Therefore, semantics are a type of graph property
ICDT 2005
Hereditary Graph Properties and their Variants
There are several interesting types of graph properties that have been studied in graph theory
A graph property P is hereditary if every induced subgraph of a graph in P, is also in P (e.g., clique, is a forest)
A graph property P is connected-hereditary if every connected induced subgraph of a graph in P, is also in P (e.g., is a tree)
Can define rooted-hereditary similarly
ICDT 2005
Semantics are usually Hereditary
Most semantics for partial answers considered in the past are hereditary (in some sense), i.e., subgraphs of a partial answer are also partial answers
Many semantics require connectivity of results (e.g., full disjunctions)
Some require answers to be rooted (e.g., FlexPath)
ICDT 2005
Maximal P-Subgraph Problem
Given a graph property P, and a graph G The maximal P-subgraph problem is: Find all maximal induced subgraphs of G that have property P
Therefore, the problem of finding all maximal answers for a query over a database, under a given semantics, is a special case of the maximal P-subgraph problem
ICDT 2005
Efficient Query Evaluation
There are efficient algorithms that find all maximal P-subgraphs for hereditary, connected hereditary and rooted hereditary properties Efficient in terms of the input and the output (i.e.,
incremental polynomial time)
Use these algorithms to find maximal query answers, e.g., to find full disjunctions, weak answers, or-answers, etc. Improves upon previous results
ICDT 2005
Conclusion
Presented abstract framework Can model many different types of
queries, databases and semantics in the framework
Semantics in the framework are graph properties
Solve the maximal P-subgraph problem once and reuse it to find maximal query answers
ICDT 2005
Future Work
It is convenient to define ranking functions and return answers in ranking order
How/when can this be done in our framework? Note: From the modeling it is immediately
apparent that ranking cannot always be performed efficiently The problem of finding a maximal P-subgraph of size
k is NP complete for hereditary and connected-hereditary graph properties (Yannakakis, STOC 78)
ICDT 2005
Thank you!
Questions?