CS 561 Presentation: Indexing and Querying XML Data for Regular Path Expressions

CS 561 Presentation:

Indexing and Querying XML Data for Regular Path Expressions

A Paper by Quanzhong Li and Bongki Moon

Presented by Ming Li

Our Objective

• Developing a system that will enable us to perform XML data queries efficiently.

XML Queries Languages

• Used for retrieving data from XML files.

• Use a regular path expression syntax.

• e.g. XPath, XQuery.

Queries Today - Inefficient

• Usually XML tree traversals – Inefficient.– Top-Down Approach– Bottom-Up Approach– An example:

the query:

/chapter/_*/figure

(finding all figures in all chapters.)

Our Objective - Refined

• Developing a system that will enable us to perform XML data queries efficiently

• Developing such a system consists of:– Developing a way to efficiently store XML data.– Developing efficient algorithms for processing

regular path expressions (e.g. XQuery expressions).

Storing XML Documents - XISS

• XISS - XML Indexing and Storage System.

• Provides us with ways to:– efficiently find all elements or attributes with the

same name string grouped by document which they belong to.

– quickly determine the ancestor-descendant relationship between elements and/or attributes in the hierarchy of XML data hierarchy.

Determining Ancestor-Descendent Relationship

• According to Dietz’s: for two given nodes x and y of a tree T, x is an ancestor of y iff x occurs before y in the preorder traversal and after y in the postorder traversal.

• Example:

Determining Ancestor-Descendent Relationship – cont.

• Advantage: the ancestor-descendent relationship can be determined in constant time.

• Disadvantage: a lack of flexibility.– e.g. inserting a new node requires recomputation

of many tree nodes.

• A new numbering scheme:– Each node is associated with a <order, size> pair:

• For a tree node y and its parent x:

[order(y), order(y) + size(y)] (order(x), order(x) + size(x)]

• For two sibling nodes x and y, if x is the predecessor of y in preorder traversal holds:

order(x) + size(x) < order(y).

exclusive

• Fact: for two given nodes x and y of a tree T, x is an ancestor of y iff:

order(x) < order(y) order(x) + size(x)

• Properties:– the ancestor-descendent relationship can be

determined in constant time.– flexibility – node insertion usually doesn’t require

recomputation of tree nodes.– an element can be uniquely identified in a

document by its order value.

XISS System Overview

Name Index and Value Table

• Objective: minimizing the storage and computation overhead by eliminating replicated strings and string comparisons.

• Name Index - mapping distinct name strings into unique name identifiers (nid).

• Value Table - mapping distinct value strings (i.e. attribute value and text value) into unique value identifiers (vid).

• Both implemented as a B+-tree.

The Element Index

• Objective: quickly finding all elements with the same name string.

• Structure:

The Attribute Index

• Objective: quickly finding all elements with the same name string.

• Structure:– Same structure as the Element Index except that the

record in attribute index has a value identifier vid which is a key used to obtain the attribute from the value table.

The Structure Index

• Objectives:– Finding the parent element and child elements (or

attributes) for a given element.– Finding the parent element for a given attribute.

• Structure:

The Structure Index – cont.

• Structure:– B+-tree using document identifier (did) as a key.– Leaf nodes: linear arrays with records for all

elements and attributes from an XML document.– Each record: {nid, <order,size>, Parent order, Child

order, Sibling order, Attribute order}.– Records are ordered by order value.

Querying Method

• Decomposing path expressions into simple path expressions.

• Applying algorithms on simple path expressions and their intermediate results.

Decomposition of Path Expressions

• The main idea: – A complex path expression is decomposed into

several simple path expressions.– Each simple path expression produces an

intermediate result that can be used in the subsequent stage of processing.

– The results of the simple path expressions are than combined or joined together to obtain the final result of the given query.

Basic Subexpressions - Example

Decomposition of

(E1/E2)*/ E3 / ((E4[@a=V]) | (E5/_*/E6)):

(1 )Single Element/Attribute

(2 )Element-Attribute

(3 )Element-Element

(4 )Kleene Closure

(5 )Union/

(1) (1) (1)(1) (1) (1)(1)

Example: EA-Join: Element and Attribute Join

EA-Join: Element and Attribute Join

Input:

{E1,…,Em}: Ei is a set of elements having a common document identifier (did);

{A1,…,An}: Aj is a set of elements having a common document identifier (did);

Output:

A set of (e,a) pairs such that the element e is the parent of the attribute a.

EA-Join: Element and Attribute Join

The Algorithm:

// Sort-merge {Ei} and {Aj} by did.

(1) foreach Ei and Aj with the same did do:

// Sort-merge Ei and Aj by

// PARENT-CHILD relationship

(2) foreach e Ei and a Aj do

(3) if (e is a parent of a) then output (e,a)

EA-Join – Example

• Consider the XML document:

</Ele>

• And the query: /Ele[@Att=“A1”]

Ele <1,3>

Ele <3,1>

Att <4,0>

Att <2,0>

</Ele>

• Sort-merging “Ele”s and “Att”s by parent-child relation ship will give us the list:<1,3>, <2,0>, <3,1>, <4,0>

• Finding the elements “Ele”s with a child attribute “Att” with a value “A1” from the accepted list is easy using the information in the Element Record.

EA-Join – Querying /Ele[@Att=“A1”]

Ele <1,3>

Ele <3,1>

Att <4,0>

Att <2,0>

EA-Join – Comments

• Only a two-stage sort-merge operation without additional cost of sorting:– First merge: by did.– Second merge: by examining parent-child relationship.

• This merge is based on the order values of the element and attribute as defined by the numbering scheme.

• Attributes should be placed before their sibling elements in the order of the numbering scheme.– guarantees that elements and attributes with the same did

can be merged in a single scan.

Conclusions

• XISS can efficiently process regular path expression queries.

• Performance improvement over the conventional methods by up to an order of magnitude.

• Future work:optimal page size or the break-even point between the two criteria.

Thank you so much!

CS 561 Presentation: Indexing and Querying XML Data for Regular Path Expressions

Documents

Transcript of CS 561 Presentation: Indexing and Querying XML Data for Regular Path Expressions

Couchbase Server 2.0 - Indexing and Querying - Deep dive

Fast querying indexing for performance (4)

Distributed Indexing and Querying in Sensor Networks using Statistical Models

Indexing. The goal of indexing is to speed up online querying – Retrieval needs to be performed in milliseconds – Without an index, retrieval would require.

Indexing & Querying XML Data for ../Regular Path Expressions/*

Mapping, Indexing & Querying of MPEG-7 Descriptors in ... · Mapping, Indexing & Querying of MPEG-7 Descriptors in RDBMS with IXMDB Yang Chu, Liang-Tien Chia and Sourav S. Bhowmick

Couchbase 2.0: Indexing and Querying

Information Retrieval - Universidade NOVA de Lisboactp.di.fct.unl.pt/~jmag/ir/materials/Laboratory_guides.pdf · Information Retrieval, i.e., text representation, indexing, querying,

An Indexing, Browsing, Search and Retrieval System for ... this paper we describe an indexing, querying and browsing system for online images based on the PNG (Portable Network Graphics)

Querying JSON with Oracle Database 12c Release 2 · Storing, Indexing and Querying JSON data in Oracle Database 12c Release 2 JSON is stored in the database using standard VARCHAR2,

Содержание · Подпакеты из пакета в java.lang 561 Пакет java.lang.annotation 561 Пакет java.lang.instrument 561 Пакет java.lang.invoke 561

WP1 D1.2 final - Universiteit Hasselt · WP1 aims to provide the big data infrastructure for storing, indexing, accessing, anonymizing, querying and analyzing the massive amounts

Couchbase 2.0 and Indexing/Querying

QUERYING (BIG) DATA ON NOSQL STORES - Vargas-Solarvargas-solar.com/bigdata-management/wp-content/... · NOSQL STORES: DATA MANAGEMENT PROPERTIES ! Indexing ! Distributed hashing like

Storing, Indexing and Querying Large Provenance Data Sets as RDF Graphs in Apache HBase Artem Chebotko Joint work with John Abraham and Pearl Brazier University.

Information retrieval I · Information retrieval I Introduction, e cient indexing, querying Clovis Galiez Mast ere Big Data December 3, 2019 C. Galiez (LJK-SVH) Information retrieval

Efficient Indexing and Querying of Geo-tagged Aerial … · Efficient Indexing and Querying of Geo-tagged Aerial Videos ... Efficient Indexing and Querying of Geo-tagged Aerial Videos.

Utilizing arrays: modeling, indexing, and querying

Chapter 1: Getting Started with Elasticsearch Cluster...Chapter 1: Getting Started with Elasticsearch Cluster Chapter 2: Indexing Your Data Chapter 4: Extending Your Querying Knowledge

Couchbase 20-indexing querying-deep-dive-12012012