K EYWORD S EARCH OVER R ELATIONAL T ABLES AND S TREAMS ALEXANDER MARKOWETZ University of Bonn YIN...
-
Upload
ayden-honeyman -
Category
Documents
-
view
215 -
download
0
Transcript of K EYWORD S EARCH OVER R ELATIONAL T ABLES AND S TREAMS ALEXANDER MARKOWETZ University of Bonn YIN...
1
KEYWORD SEARCH OVER RELATIONAL TABLES AND STREAMS
ALEXANDER MARKOWETZ
University of Bonn
YIN YANG and DIMITRIS PAPADIAS
Hong Kong University of Science and Technology
Doklea Meci (A.M 2152)
May 2012
University Of Crete
Department Of Computer Science
3
THE CHALLENGES OF ACCESSING STRUCTURED DATA Query languages:
Numerous complex SQL statements
Schemas: Complex, or nontrivial
schema
R-KWS queries: replaces numerous
complex SQL statements liberates users from
studying a database schema
allows querying for terms in unknown locations (tables/attributes)
INTRODUCTION
KeyWord Search (KWS) each document/Web page constitutes one unit of information
a result if it contains a subset of the query’s keywords
has been applied to relational DBMS allows data retrieval without SQL
Relational-Keyword Search (R-KWS) the basic unit of information is a record/tuple queries cannot be answered by inspecting
records individually results have to be constructed by joining tuples
5
OUTLINE
Introduction Relational Keyword Search On Tables
Graph-Based Processing Operator-Based Processing
Optimizations For Continuous GB Predecessor-KL Time-KL
Optimizations For Continuous OB Operator Mesh Demand-Driven Operator Execution Partial-Mesh
Experimental Evaluation Snapshot R-KWS Queries over Tables Continuous R-KWS Querie0s over Streams Summary of Experimental Evaluation
Conclusion
RELATIONAL KEYWORD SEARCH ON TABLES Goal: methods for BG and OB processing
avoid the shortcomings of prior systems improve performance of R-KWS in conventional
databases
7
GRAPH-BASED PROCESSING
Basic Idea: given an inverted index I (on disk), it traverses
an undirected data graph G (in memory), searching for MTJNT (Minimal Total Join Networks of Tuples ) results
JNT –Join Networks of Tuples (JNT), which are connected acyclic components of G
A JNT is called Minimal Total JNT (MTJNT) iff it is impossible to remove any node and find the remainder to be total
8
GSEARCH ALGORITHM
Basic Idea: the algorithm enumerates all possible trees in G rooted at sn
Result: a tree that corresponds to an MTJNT
9
GSEARCH ALGORITHM
GSearch maintains a queue Q of trees each constituting a fraction of a potential MTJNT
Every tree is de-queued and expanded by adding one new node , resulting in a new tree
The new tree falls into one of three categories: It forms an MTJNT, and is included in the result set It has the potential to become an MTJNT, and is
inserted in Q to be expanded later None of the previous and the tree can be safely
discarded The algorithm terminates when Q becomes
empty
10
GSEARCH ALGORITHM
GSearch computes the set of MTJNT containing node sn and so GB answers an R-KWS query q correctly, completely, without duplicates.
11
OPERATOR-BASED PROCESSING
Basic Idea: Query processing relies on Candidate Networks
(CN)
Candidate Networks (CN) are projections of MTJNT onto the expanded schema a tuple s of relation S maps to node S{K} EG(q), iff s
contains all keywords in K , but does not contain any other term in q\K
An MTJNT projects to a unique CN
12
EXAMPLE
13
EXAMPLE
14
OUTLINE
Introduction Relational Keyword Search On Tables
Graph-Based Processing Operator-Based Processing
Optimizations For Continuous GB Predecessor-KL Time-KL
Optimizations For Continuous OB Operator Mesh Demand-Driven Operator Execution Partial-Mesh
Experimental Evaluation Snapshot R-KWS Queries over Tables Continuous R-KWS Querie0s over Streams Summary of Experimental Evaluation
Conclusion
15
OPTIMIZATIONS FOR CONTINUOUS GB
Basic Idea: Keyword labeling a simple and effective method to summarize
reachable keywords for a given node.
Improves performance by avoiding unnecessary calls to GSearch and constraining graph traversals.
A keyword label (KL) of format , stored at node n, indicates a path of h edges in the data graph, connecting n to an occurrence of keyword .
16
EXAMPLE s:[ ,2] corresponds to
the path connecting s to an occurrence of , via 2 edges
17
BENEFITS OF A MIN-COMPLETE LABELING GSearch(G, q, s) is called if s node can reach all query
terms, only if the node stores a KL for every k ∈ q. In any other case, s is guaranteed not to participate
in an MTJNT.
KL-aware Gsearch Algorithm: Inserts into Q iff there exists a set NL of labels with
belows criteria:
The KL in NL can reach all missing keywords; that is, NL
18
EXAMPLE - INTERMEDIATE TREES ABANDONED BY KL-AWARE GSEARCH. ( = 9)
lacking keyword new nodes can only be
added to node can reach in four
hops, the shortest path to
2-nd criteria not satisfied!while = 6; + 4 FAIL! 6+4
19
PREDECESSOR-KL IMPLEMENTATION
Basic Idea: A predecessor-KL is a triplet of the form [k, h, p]
a path of length h, connecting n to an occurrence of keyword k
p is n’s predecessor
Every node n must contain a predecessor-KL [k, h, p] for the shortest path leading from n through p to the occurrence of k
An arriving tuple s can itself contain a keyword, or create new paths between keywords and nodes
require KL insertions and updates
each path contains at most edges
20
PREDECESSOR-KL EXAMPLE
must keep bothKL [] , KL[,1, ] represent the shortest
path via predecessors and
both paths (to and ) share the same predecessor
suffices to keep KL [] through node
21
TIME-KL
Basic Idea: More efficient labeling that does not require
explicit removal A time-KL is a triplet [k, h, ] indicating a
path of length h to an occurrence of keyword k, which exists until KL [k, h1, ] dominates another [k, h2, ] iff ( h1 h2 and )Result: the graph that contains all KL that are not
dominated by others
22
TIME-KL EXAMPLE
1) is connected to in via 2 hops
2) is connected to in via 1 hop
3) is connected to in via 3 hops and node expires at 21
Result:
(1) and (2) must be stored as each indicates the shortest path for some period of time.
(3) is not recorded as it expires sooner than the other two
23
OUTLINE
Introduction Relational Keyword Search On Tables
Graph-Based Processing Operator-Based Processing
Optimizations For Continuous GB Predecessor-KL Time-KL
Optimizations For Continuous OB Operator Mesh Demand-Driven Operator Execution Partial-Mesh
Experimental Evaluation Snapshot R-KWS Queries over Tables Continuous R-KWS Querie0s over Streams Summary of Experimental Evaluation
Conclusion
24
OPTIMIZATIONS FOR CONTINUOUS OB
Basic Idea: If a selection on a table (e.g., T{}) returns no
tuples, all operator trees using this input can be discarded immediately For data streams, this is not permissible
Even though the selection T{} does not currently produce tuples, it may do so in the future, and all operator trees must thus be maintained.
Solution: optimizations that enable efficient OB R-KWS
over data streams
25
OPERATOR MESH (1/3)Basic Idea:
sharing common subexpressions all operator trees are integrated into an operator mesh, reducing
CPU cost (for evaluating joins) as well as memory overhead (for intermediate results).
The mesh has |SR|* clusters |SR| is the number of streaming relations |K| the number of query keywords
Each cluster contains the operator trees for all CN (Candidate Networks) discovered from a certain
The entire operator mesh has |SR|* leafs/sources, one for each node of the extended schema
Maximum depth of the mesh is +1 Number of edges depends on the schema complexity Different clusters are interconnected only through
their source operators Joins from different clusters do not connect directly
26
OPERATOR MESH EXAMPLE
shows the shared execution of four operator trees
27
OPERATOR MESH EXAMPLE
Algorithm: The first node in a cluster corresponds to the root
node , from which CNGen starts Whenever the algorithm generates a new tree
from (by adding a new child to a parent ), a join .op is added to the mesh
The left child of .op is .op (the operator that was inserted when was created)
The right child is the source of For each tree t in CNGen, a pointer is maintained to
the corresponding operator t.op, to decide where to place subsequent joins when t is expanded
The algorithm is initialized with t first .op pointing to the source of
28
PROBLEMS WITH OPERATOR MESH APPROACH
Example: Assume tuples from S{} and T{} and
V{},U{, },V {, } are empty none of the joins , , or requires the output of
because they do not receive right input
Worst case:
’s results expire before the arrival of any tuples from V{},U{, } or V {, }
The join has wasted CPU and memory, without any contribution to the query
29
DEMAND-DRIVEN OPERATOR EXECUTION (2/3) This mesh is maintained in main memory
throughout the lifespan of the query. A join is considered to be either
running - operators process input Sleeping – operators ignore input
A join operator is sent to sleep if: it has no input from the right child (a source), or all its parents are sleeping
Sending operators to sleep does not affect the result’s correctness or completeness because either: the operator cannot produce output, or its output would not be consumed
30
DEMAND-DRIVEN OPERATOR EXECUTION - EXAMPLE
Shows the state diagram for a join operator
31
DEMAND-DRIVEN OPERATOR EXECUTION - EXAMPLE
States are characterized by two binary flags: d indicating that at least one parent operator is running, and r specifying that the operator’s right input is not
empty. An operator only runs in the topmost state (d/r) Operators exchange messages regarding their
state, in order to ensure that all d and r flags are up-to-date.
When it leaves this state (transition 2 or 3) it goes to sleep (or halts), to wake up (or restart) later (transitions 9 and 10)
a join operator communicates changes (running/sleeping) to its left child that adjusts its d flag
32
DEMAND-DRIVEN OPERATOR EXECUTION - EXAMPLE
Assume U{, } stops producing output
Result: turns off its r flag,
goes to sleep (transition 2)
calls its left child decreases its counter of running parents no further actions
for as there are other running parents ,
33
DEMAND-DRIVEN OPERATOR EXECUTION - EXAMPLE
If T{},V{, } dries up, too, then, goes to sleep
When operator decreases its counter (rParents=0)
Trasition 3
34
EXAMPLE- CONSIDERING THAT THE ONLY RUNNING JOIN OPERATORS ARE AND
Join does not generate results, due to lack of left input
When T{} begins producing output, it causes to adjust its r flag, wake up (transition 9), and
call .Pstart operator restarts
and informs
35
EXAMPLE - ALL JOINS RUN AGAIN EXCEPT AND
Note!!! this method is not restricted to keyword search; it can
equally benefit other data stream applications.
36
PARTIAL-MESH (3/3)BASIC IDEA
A Partial-Mesh (PM) is built at runtime and breaks the distinction between
operator initialization Tuple processing
The method maintains relatively few active operators in memory
It is each operator’s responsibility to create its parents before it can produce output
It destroys its parents (and other operators up the tree) if it cannot supply them with input
In large meshes operators are idle Their absence does not affect result’s
completeness, but dramatically reduces memory consumption
37
PARTIAL-MESH EXAMPLE
When the leftmost source S{} first produces output
It creates its direct parents and
when generates results, it creates its own parents
38
PARTIAL-MESH EXAMPLE
when outputs a first tuple t and instantiates , this operator immediately probes t against T {}
39
PARTIAL-MESH ALGORITHM
Basic Idea: TreeGen, is an algorithm for reconstructing a tree
I decideS which parents to create
The algorithm checks the join condition of .op If is the source joined with then is generated
by adding as the rightmost child of in
40
PARTIAL-MESH EXAMPLES OF TREEGEN.
TreeGen(S{} )returns a tree that contains a single node S{}
parent is inserted in the mesh and connected to its left and right inputs
The call TreeGen() returns the tree
The expansion of reveals the parents of (e.g., , , )
41
OUTLINE
Introduction Relational Keyword Search On Tables
Graph-Based ProcessingOperator-Based Processing
Optimizations For Continuous GBPredecessor-KLTime-KL
Optimizations For Continuous OBOperator MeshDemand-Driven Operator ExecutionPartial-Mesh
Experimental EvaluationSnapshot R-KWS Queries over TablesContinuous R-KWS Queries over Streams
Conclusion
42
SNAPSHOT R-KWS QUERIES OVER TABLES (1/3)
Comparing GB and OB implementation: Experiments are focused on tables
Part (0.2M entries), Supplier (10K), PartSupp (0.8M), Customer (150K), Orders (1.5M), and LineItem (6M)
Two tables can join if and only if there is a foreign-key to primary-key between them
The length of join sequences is restricted to , which ranges between 4 and 6.
43
EXAMPLE
44
EXAMPLE - SEVEN SETS OF R-KWS QUERIES QS 1 -QS 7
QS 1, QS 2 : people’s or companies’ names (denoted as PeopleName), which appear in the columns Customer. Name, Supplier.Name, and Orders.Clerk; (retrieve connections between multiple people)QS 3 /QS 4:terms from the name of apart, for example, “ivory”, from the Part.Name attribute;
45
EXAMPLE - SEVEN SETS OF R-KWS QUERIES QS 1 -QS 7
QS 5, QS 6 :years, which are present in LineItem.ShipDate, LineItem.CommitDate, LineItem.ReceiptDate, Orders.OrderDate; QS 7 :terms from Part.Brand, Part.Mfgr, Part.Size, and Part.Container
46
EXAMPLE- PROCESSING TIME FOR QUERIES QS 1 -QS 7
The below picture depicts the total runtime ( y-axis) of GB and OB The result set cardinality |R| (below the x-
axis) for the seven query sets Report the median values after setting to 4,
5, and 6.
47
SNAPSHOT R-KWS QUERIES OVER TABLES –CONCLUSION
(+) For conventional tables, GB is more
efficient than OB,. GB methods, GSearch avoids
duplicate results reduces the total cost GB is preferable for datasets with
frequent updates (-) Not efficient for queries involving
numerous keywords and/or a large value of T max
consumes a large amount of main memory to store the data graph
Conclusion:On servers dedicated for R-KWS queries, GB is the best choice due to its high performance
(+) OB utilizes the
functionality provided by a DBMS, and, thus, can answer R-KWS queries using much less memory than GB
Conclusion:On servers running multiple applications and only answering R-KWS queries infrequently, OB might be preferable due to its low memory footprint
GB OB
48
CONTINUOUS R-KWS QUERIES OVER STREAMS(2/2)
49
CONTINUOUS R-KWS QUERIES OVER STREAMS
50
CONTINUOUS R-KWS QUERIES OVER STREAMS
51
CONTINUOUS R-KWS QUERIES OVER STREAMS
52
CONTINUOUS R-KWS QUERIES OVER STREAMS
53
CONTINUOUS R-KWS QUERIES OVER STREAMS
54
CONTINUOUS R-KWS QUERIES OVER STREAMS
55
CONTINUOUS R-KWS QUERIES OVER STREAMS
56
CONTINUOUS R-KWS QUERIES OVER STREAMS - CONCLUSION
FM is usually the most
CPU-efficient method for a single query
GB and PM are more economical in terms of memory consumption
FULL MESH (FM) Partial Mesh (PM)
57
OUTLINE
Introduction Relational Keyword Search On Tables
Graph-Based ProcessingOperator-Based Processing
Optimizations For Continuous GBPredecessor-KLTime-KL
Optimizations For Continuous OBOperator MeshDemand-Driven Operator ExecutionPartial-Mesh
Experimental EvaluationSnapshot R-KWS Queries over TablesContinuous R-KWS Queries over Streams
Conclusion
58
CONCLUSION – ADVANTAGES OF R-KWS
R-KWS handles broad query tasks whose complexity does not permit handcoded structured queries
Presents considerable algorithmic challenges because query processing has to explore a vast search space
Challenges are faced through a series of contributions
they provide R-KWS semantics that are well defined and easily extensible to streaming environments
develop GB and OB processing techniques that match these semantics and remedy problems encountered in previous systems
they adapt their framework to relational streams, and propose a wide range of optimizations
support their claims through an extensive set of experiments
59
CONCLUSION – FUTURE WORK
They plan to further improve R-KWS performance by means of indexing
They intend to integrate ranking into continuous R-KWS query processing Example:
if there are a sudden burst of results, it may be desirable to report only the top-k answers for the affected period.