Efficient Keyword Search over Virtual XML Views
description
Transcript of Efficient Keyword Search over Virtual XML Views
![Page 1: Efficient Keyword Search over Virtual XML Views](https://reader036.fdocuments.us/reader036/viewer/2022070410/568146b3550346895db3d04b/html5/thumbnails/1.jpg)
Efficient Keyword Search over Virtual XML Views
Feng Shao and Lin Guo and Chavdar Botev
and Anand Bhaskar and Muthiah Chettiar and Fan Yang
Cornell University
Jayavel Shanmugasundaram
Yahoo! Research
2008. 02. 14.Summarized by Dongmin Shin, IDS Lab., Seoul National University
Presented by Dongmin Shin, IDS Lab., Seoul National University
![Page 2: Efficient Keyword Search over Virtual XML Views](https://reader036.fdocuments.us/reader036/viewer/2022070410/568146b3550346895db3d04b/html5/thumbnails/2.jpg)
Copyright 2007 by CEBT
Index
Introduction Background System Overview QPT Generation Module PDT Generation Module Experiments Conclusion and Future Work
2
![Page 3: Efficient Keyword Search over Virtual XML Views](https://reader036.fdocuments.us/reader036/viewer/2022070410/568146b3550346895db3d04b/html5/thumbnails/3.jpg)
Copyright 2007 by CEBT
Index
Introduction Background System Overview QPT Generation Module PDT Generation Module Experiments Conclusion and Future Work
3
![Page 4: Efficient Keyword Search over Virtual XML Views](https://reader036.fdocuments.us/reader036/viewer/2022070410/568146b3550346895db3d04b/html5/thumbnails/4.jpg)
Copyright 2007 by CEBT
Introduction
Fundamental assumption of tradi-tional information retrieval systems
4
The set of documents being searched
is materialized.
![Page 5: Efficient Keyword Search over Virtual XML Views](https://reader036.fdocuments.us/reader036/viewer/2022070410/568146b3550346895db3d04b/html5/thumbnails/5.jpg)
Copyright 2007 by CEBT
Introduction
But
5
The view is often virtual (unmaterial-ized)
Aggregator may not have resources to materialize all the data
If the view is materialized, the contents of the view may be out-of-date or maintaining the view may be expensive
The data sources may not wish to provide the entire data
![Page 6: Efficient Keyword Search over Virtual XML Views](https://reader036.fdocuments.us/reader036/viewer/2022070410/568146b3550346895db3d04b/html5/thumbnails/6.jpg)
Copyright 2007 by CEBT
Introduction
Efficiently evaluating keyword search queries
over virtual XML views
6
Need
![Page 7: Efficient Keyword Search over Virtual XML Views](https://reader036.fdocuments.us/reader036/viewer/2022070410/568146b3550346895db3d04b/html5/thumbnails/7.jpg)
Copyright 2007 by CEBT
Index
Introduction Background System Overview QPT Generation Module PDT Generation Module Experiments Conclusion and Future Work
7
![Page 8: Efficient Keyword Search over Virtual XML Views](https://reader036.fdocuments.us/reader036/viewer/2022070410/568146b3550346895db3d04b/html5/thumbnails/8.jpg)
Copyright 2007 by CEBT
Background
8
![Page 9: Efficient Keyword Search over Virtual XML Views](https://reader036.fdocuments.us/reader036/viewer/2022070410/568146b3550346895db3d04b/html5/thumbnails/9.jpg)
Copyright 2007 by CEBT
Background
XML Scoring
tf(e,k) : the number of distinct occurrences of the key-word k in element e and its descendants
idf(k) =
score(e,Q) =
9
TF-IDF method
![Page 10: Efficient Keyword Search over Virtual XML Views](https://reader036.fdocuments.us/reader036/viewer/2022070410/568146b3550346895db3d04b/html5/thumbnails/10.jpg)
Copyright 2007 by CEBT
Index
Introduction Background System Overview QPT Generation Module PDT Generation Module Experiments Conclusion and Future Work
10
![Page 11: Efficient Keyword Search over Virtual XML Views](https://reader036.fdocuments.us/reader036/viewer/2022070410/568146b3550346895db3d04b/html5/thumbnails/11.jpg)
Copyright 2007 by CEBT
System Overview
(1) Keyword queries over virtual views
11
(2) The parser redirects the query to the Query Pattern Tree(QPT) Generation Module
(3) QPT is sent to the Pruned Document Tree(PDT) Genera-tion Module
(4) Generate PDTs using only the path indices and inverted list indices
(5) Rewritten query and PDTs are sent to Evaluator(6) Produce the view that contains all view elements with pruned content
(7) Elements are scored, only those with highest scores are fully materialized using document storage
![Page 12: Efficient Keyword Search over Virtual XML Views](https://reader036.fdocuments.us/reader036/viewer/2022070410/568146b3550346895db3d04b/html5/thumbnails/12.jpg)
Copyright 2007 by CEBT
System Overview
XML Storage Dewey IDs
– Popular id format
– Hierarchical numbering scheme
– ID of an element contains the ID of its parent
12
![Page 13: Efficient Keyword Search over Virtual XML Views](https://reader036.fdocuments.us/reader036/viewer/2022070410/568146b3550346895db3d04b/html5/thumbnails/13.jpg)
Copyright 2007 by CEBT
System Overview
XML Indexing Path indices
– Evaluate XML path and twig(i.e., branching path)
– Store XML paths with values in a relational table
– Use indices such as B+-tree
– One row for each unique
(Path, Value) pair
– IDList : the list of ids of
all elements on the path
– B+-tree index is built on the (Path, Value) pair
13
![Page 14: Efficient Keyword Search over Virtual XML Views](https://reader036.fdocuments.us/reader036/viewer/2022070410/568146b3550346895db3d04b/html5/thumbnails/14.jpg)
Copyright 2007 by CEBT
System Overview
Inverted list indices– Store the list of XML elements that directly contain the keyword
for each keyword in the document collection
14
![Page 15: Efficient Keyword Search over Virtual XML Views](https://reader036.fdocuments.us/reader036/viewer/2022070410/568146b3550346895db3d04b/html5/thumbnails/15.jpg)
Copyright 2007 by CEBT
Index
Introduction Background System Overview QPT Generation Module PDT Generation Module Experiments Conclusion and Future Work
15
![Page 16: Efficient Keyword Search over Virtual XML Views](https://reader036.fdocuments.us/reader036/viewer/2022070410/568146b3550346895db3d04b/html5/thumbnails/16.jpg)
Copyright 2007 by CEBT
QPT Generation Module
16
![Page 17: Efficient Keyword Search over Virtual XML Views](https://reader036.fdocuments.us/reader036/viewer/2022070410/568146b3550346895db3d04b/html5/thumbnails/17.jpg)
Copyright 2007 by CEBT
Index
Introduction Background System Overview QPT Generation Module PDT Generation Module Experiments Conclusion and Future Work
17
![Page 18: Efficient Keyword Search over Virtual XML Views](https://reader036.fdocuments.us/reader036/viewer/2022070410/568146b3550346895db3d04b/html5/thumbnails/18.jpg)
Copyright 2007 by CEBT
PDT Generation Module
Output Only contains elements that correspond to nodes in the
QPT Only contains element values that are required during
query evaluation
Advantage Query evaluation is likely to be more efficient and scalable Allows us to use the regular(unmodified) query evaluator
18
![Page 19: Efficient Keyword Search over Virtual XML Views](https://reader036.fdocuments.us/reader036/viewer/2022070410/568146b3550346895db3d04b/html5/thumbnails/19.jpg)
Copyright 2007 by CEBT
PDT Generation Module
Key Idea An element e in the document corresponding to a node n in
the QPT is selected for inclusion only if it satisfies three types of constraints(1) Ancestor constraint – an ancestor element of e that corre-
sponds to the parent of n in the QPT should also be selected
(2) Descendant constraint – for each mandatory edge from n to a child of n in the QPT, at least one child/descendant element of e corresponding to that child of n should also be selected
(3) Predicate Constraint – if e is a leaf node, it satisfies all predi-cates associated with n
19
![Page 20: Efficient Keyword Search over Virtual XML Views](https://reader036.fdocuments.us/reader036/viewer/2022070410/568146b3550346895db3d04b/html5/thumbnails/20.jpg)
Copyright 2007 by CEBT
PDT Generation Module
PrepareList
(1) Issues a lookup on path indices for each QPT node that has no mandatory child edges
(2) Identifies nodes that have a ‘v’ annotation to obtain values and ids
(3) Looks up inverted lists indices and retrieves the list of Dewey IDs containing the keywords along with tf values
20
![Page 21: Efficient Keyword Search over Virtual XML Views](https://reader036.fdocuments.us/reader036/viewer/2022070410/568146b3550346895db3d04b/html5/thumbnails/21.jpg)
Copyright 2007 by CEBT
PDT Generation Module
Candidate Tree(CT)
21
![Page 22: Efficient Keyword Search over Virtual XML Views](https://reader036.fdocuments.us/reader036/viewer/2022070410/568146b3550346895db3d04b/html5/thumbnails/22.jpg)
Copyright 2007 by CEBT
PDT Generation Module
Step 1 : adding new IDs– Adds the current minimum IDs in pathLists
22
![Page 23: Efficient Keyword Search over Virtual XML Views](https://reader036.fdocuments.us/reader036/viewer/2022070410/568146b3550346895db3d04b/html5/thumbnails/23.jpg)
Copyright 2007 by CEBT
PDT Generation Module
Step 2 : creating PDT nodes– Create PDT nodes using CT nodes
– Top-down
– Check DM value of each CT node if it is “1”, create it in pdt cache If not, check children of that node
If DM value of that children node is “1”, create is in pdt cache of parent node
23
![Page 24: Efficient Keyword Search over Virtual XML Views](https://reader036.fdocuments.us/reader036/viewer/2022070410/568146b3550346895db3d04b/html5/thumbnails/24.jpg)
Copyright 2007 by CEBT
PDT Generation Module
Step 3 : removing CT nodes– Bottom-up
– Check if each node satisfies ancestor constraints If not, remove If so, propagate to the pdt cache of the ancestor
– If some node has no children and does not satisfy descendant constraints, remove
24
![Page 25: Efficient Keyword Search over Virtual XML Views](https://reader036.fdocuments.us/reader036/viewer/2022070410/568146b3550346895db3d04b/html5/thumbnails/25.jpg)
Copyright 2007 by CEBT
PDT Generation Module
– When we remove the root node “books”, all IDs in its pdt cache will be propagated to the result PDT
25
![Page 26: Efficient Keyword Search over Virtual XML Views](https://reader036.fdocuments.us/reader036/viewer/2022070410/568146b3550346895db3d04b/html5/thumbnails/26.jpg)
Copyright 2007 by CEBT
PDT Generation Module
26
![Page 27: Efficient Keyword Search over Virtual XML Views](https://reader036.fdocuments.us/reader036/viewer/2022070410/568146b3550346895db3d04b/html5/thumbnails/27.jpg)
Copyright 2007 by CEBT
Index
Introduction Background System Overview QPT Generation Module PDT Generation Module Experiments Conclusion and Future Work
27
![Page 28: Efficient Keyword Search over Virtual XML Views](https://reader036.fdocuments.us/reader036/viewer/2022070410/568146b3550346895db3d04b/html5/thumbnails/28.jpg)
Copyright 2007 by CEBT
Experiments
500MB INEX dataset
Varying parameters Size of data, # keywords, selectivity of keywords # of joins, join selectivity, level of nestings # of results, Avg. size of view element
Four alternative approaches Baseline GTP : general solution to integrate structure and keyword
search queries Efficient : proposed architecture Proj : techniques of projecting XML documents
28
![Page 29: Efficient Keyword Search over Virtual XML Views](https://reader036.fdocuments.us/reader036/viewer/2022070410/568146b3550346895db3d04b/html5/thumbnails/29.jpg)
Copyright 2007 by CEBT
Experiments
EFFICIENT is a scalable and efficient soultion
29
The cost of generating PDTs scales gracefully
Overhead of post-processing(scoring and ma-terializing) is negligible
The cost of the query evalua-tor dominates the entire cost
![Page 30: Efficient Keyword Search over Virtual XML Views](https://reader036.fdocuments.us/reader036/viewer/2022070410/568146b3550346895db3d04b/html5/thumbnails/30.jpg)
Copyright 2007 by CEBT
Experiments
Run time for EFFICIENT in-creases slightly Because it accesses more
inverted lists to retrieve tf values
30
Run time for EFFICIENT in-creases Because the cost of the
query evaluation increases
![Page 31: Efficient Keyword Search over Virtual XML Views](https://reader036.fdocuments.us/reader036/viewer/2022070410/568146b3550346895db3d04b/html5/thumbnails/31.jpg)
Copyright 2007 by CEBT
Index
Introduction Background System Overview QPT Generation Module PDT Generation Module Experiments Conclusion and Future Work
31
![Page 32: Efficient Keyword Search over Virtual XML Views](https://reader036.fdocuments.us/reader036/viewer/2022070410/568146b3550346895db3d04b/html5/thumbnails/32.jpg)
Copyright 2007 by CEBT
Conclusion and Future Work
Conclusion A general technique for evaluating keyword search queries
over views Efficient over a wide range of parameters
Future Work Instead of using the regular query evaluator, we could use
the techniques proposed for ranked query evaluation Views may contain non-monotonic operators such as group-
by
32