Post on 04-Feb-2016
description
1
CS 561 Presentation:
Indexing and Querying XML Data for Regular Path Expressions
A Paper by Quanzhong Li and Bongki Moon
Presented by Ming Li
2
Our Objective
• Developing a system that will enable us to perform XML data queries efficiently.
3
XML Queries Languages
• Used for retrieving data from XML files.
• Use a regular path expression syntax.
• e.g. XPath, XQuery.
4
Queries Today - Inefficient
• Usually XML tree traversals – Inefficient.– Top-Down Approach– Bottom-Up Approach– An example:
the query:
/chapter/_*/figure
(finding all figures in all chapters.)
5
Our Objective - Refined
• Developing a system that will enable us to perform XML data queries efficiently
• Developing such a system consists of:– Developing a way to efficiently store XML data.– Developing efficient algorithms for processing
regular path expressions (e.g. XQuery expressions).
6
Storing XML Documents - XISS
• XISS - XML Indexing and Storage System.
• Provides us with ways to:– efficiently find all elements or attributes with the
same name string grouped by document which they belong to.
– quickly determine the ancestor-descendant relationship between elements and/or attributes in the hierarchy of XML data hierarchy.
7
Determining Ancestor-Descendent Relationship
• According to Dietz’s: for two given nodes x and y of a tree T, x is an ancestor of y iff x occurs before y in the preorder traversal and after y in the postorder traversal.
• Example:
8
Determining Ancestor-Descendent Relationship – cont.
• Advantage: the ancestor-descendent relationship can be determined in constant time.
• Disadvantage: a lack of flexibility.– e.g. inserting a new node requires recomputation
of many tree nodes.
9
• A new numbering scheme:– Each node is associated with a <order, size> pair:
• For a tree node y and its parent x:
[order(y), order(y) + size(y)] (order(x), order(x) + size(x)]
• For two sibling nodes x and y, if x is the predecessor of y in preorder traversal holds:
order(x) + size(x) < order(y).
Determining Ancestor-Descendent Relationship – cont.
exclusive
10
Determining Ancestor-Descendent Relationship – cont.
• Fact: for two given nodes x and y of a tree T, x is an ancestor of y iff:
order(x) < order(y) order(x) + size(x)
11
Determining Ancestor-Descendent Relationship – cont.
• Properties:– the ancestor-descendent relationship can be
determined in constant time.– flexibility – node insertion usually doesn’t require
recomputation of tree nodes.– an element can be uniquely identified in a
document by its order value.
12
XISS System Overview
13
Name Index and Value Table
• Objective: minimizing the storage and computation overhead by eliminating replicated strings and string comparisons.
• Name Index - mapping distinct name strings into unique name identifiers (nid).
• Value Table - mapping distinct value strings (i.e. attribute value and text value) into unique value identifiers (vid).
• Both implemented as a B+-tree.
14
The Element Index
• Objective: quickly finding all elements with the same name string.
• Structure:
15
The Attribute Index
• Objective: quickly finding all elements with the same name string.
• Structure:– Same structure as the Element Index except that the
record in attribute index has a value identifier vid which is a key used to obtain the attribute from the value table.
16
The Structure Index
• Objectives:– Finding the parent element and child elements (or
attributes) for a given element.– Finding the parent element for a given attribute.
• Structure:
17
The Structure Index – cont.
• Structure:– B+-tree using document identifier (did) as a key.– Leaf nodes: linear arrays with records for all
elements and attributes from an XML document.– Each record: {nid, <order,size>, Parent order, Child
order, Sibling order, Attribute order}.– Records are ordered by order value.
18
Querying Method
• Decomposing path expressions into simple path expressions.
• Applying algorithms on simple path expressions and their intermediate results.
19
Decomposition of Path Expressions
• The main idea: – A complex path expression is decomposed into
several simple path expressions.– Each simple path expression produces an
intermediate result that can be used in the subsequent stage of processing.
– The results of the simple path expressions are than combined or joined together to obtain the final result of the given query.
20
Basic Subexpressions - Example
Decomposition of
(E1/E2)*/ E3 / ((E4[@a=V]) | (E5/_*/E6)):
(1 )Single Element/Attribute
(2 )Element-Attribute
(3 )Element-Element
(4 )Kleene Closure
(5 )Union/
_/*/
* |
] [/
/
(4)
(2)
(3)
(5)
(3)
(3)
(3)
(1) (1) (1)(1) (1) (1)(1)
21
Example: EA-Join: Element and Attribute Join
22
EA-Join: Element and Attribute Join
Input:
{E1,…,Em}: Ei is a set of elements having a common document identifier (did);
{A1,…,An}: Aj is a set of elements having a common document identifier (did);
Output:
A set of (e,a) pairs such that the element e is the parent of the attribute a.
23
EA-Join: Element and Attribute Join
The Algorithm:
// Sort-merge {Ei} and {Aj} by did.
(1) foreach Ei and Aj with the same did do:
// Sort-merge Ei and Aj by
// PARENT-CHILD relationship
(2) foreach e Ei and a Aj do
(3) if (e is a parent of a) then output (e,a)
end
end
24
EA-Join – Example
• Consider the XML document:
<Ele Att=“A1”>
<Ele Att=“A2”> </Ele>
</Ele>
• And the query: /Ele[@Att=“A1”]
Ele <1,3>
Ele <3,1>
Att <4,0>
Att <2,0>
25
<Ele Att=“A1”>
<Ele Att=“A2”> </Ele>
</Ele>
• Sort-merging “Ele”s and “Att”s by parent-child relation ship will give us the list:<1,3>, <2,0>, <3,1>, <4,0>
• Finding the elements “Ele”s with a child attribute “Att” with a value “A1” from the accepted list is easy using the information in the Element Record.
EA-Join – Querying /Ele[@Att=“A1”]
Ele <1,3>
Ele <3,1>
Att <4,0>
Att <2,0>
26
EA-Join – Comments
• Only a two-stage sort-merge operation without additional cost of sorting:– First merge: by did.– Second merge: by examining parent-child relationship.
• This merge is based on the order values of the element and attribute as defined by the numbering scheme.
• Attributes should be placed before their sibling elements in the order of the numbering scheme.– guarantees that elements and attributes with the same did
can be merged in a single scan.
27
Conclusions
• XISS can efficiently process regular path expression queries.
• Performance improvement over the conventional methods by up to an order of magnitude.
• Future work:optimal page size or the break-even point between the two criteria.
28
Thank you so much!