On the Memory Requirements of XPath Evaluation over XML Streams

On the Memory Requirements of XPath

Evaluation over XML Streams

Ziv Bar-YossefMarcus FontouraVanja Josifovski

IBM Almaden Research Center

Preliminaries: XML

<speaker> <name> Josifovski </name> <paper_cnt> 1 </paper_cnt> </speaker>

<speaker> <name> Fagin </name> <paper_cnt> 3 </paper_cnt> </speaker></conference>

conference

speaker

namepaper_cnt

speaker

namepaper_cnt

JosifovskiFagin1 3

x4 x5 x7

Preliminaries: XPath 1.0

/conference[name = PODS]/speaker[paper_cnt > 1]/name

conference

DocumentQuery

Result: { x7 }

speaker

namepaper_cnt

= PODS

conference

speaker

namepaper_cnt

speaker

namepaper_cnt

JosifovskiFagin1 3

x4 x5 x7

XML Streams

XML stream: XML document arriving as a one-way stream

Critical resources:

• Memory

• Processing time

Why XML streams?

• For transferring XML between systems

• For efficient access to large XML documents

Streaming XML Algorithms

XFilter and YFilter [Altinel and Franklin 00] [Diao et al 02] X-scan [Ives, Levy, and Weld 00] XMLTK [Avila-Campillo et al 02] XTrie [Chan et al 02] SPEX [Olteanu, Kiesling, and Bry 03] Lazy DFAs [Green et al 03] The XPush Machine [Gupta and Suciu 03] XSQ [Peng and Chawathe 03] TurboXPath [Josifovski, Fontoura, and Barta 04] …

Our Results

Space lower bounds for evaluating XPath on XML streams

A streaming XML algorithm Matches the lower bounds on a large fragment

of the language Uses space sub-linear in the query size rather

than exponential in the query size

Related Work Space complexity of XPath evaluation over non-

streaming XML documents [Gottlob, Koch, Pichler 03], [Segoufin 03]

Space complexity of XPath evaluation over streams of indexed XML data [Choi, Mahoui, Wood 03]

Space complexity of select-project-join queries over relational data streams [Arasu et al 02]

Data Complexity [Vardi 82]

(Q,D) Evaluation function of a query Q on document D.

Q(D) Evaluation function of a fixed query Q on document D.

Data complexity on Q: Complexity of best algorithm for Q on worst D.

Worst-case data complexity: maxQ (complexity of Q).

We characterize the data complexity of Q separately for each Q (not just the worst-case one).

XPath Fragment

1. Queries are subsumption-free

conference

= PODS name != SIGMOD

conference

name != SIGMOD

Not subsumption-free Subsumption-free

XPath Fragment (cont.)

2. Queries are univariate

conference

paper_cnt

author_cnt

Not univariate Univariate

conference

paper_cnt

author_cnt< 30 > 30

XPath Fragment (cont.)

3. Queries consist of conjunctions only

4. Queries are “star-restricted”

Query Frontier Size

1. Frontier at u: u, its siblings, and the siblings of its ancestors.

Theorem 1: For all queries Q in the fragment,

stream-space(Q) = (FrontierSize(Q)).

Definitions:

2. FrontierSize(Q): size of largest frontier.

conference

speaker

namepaper_cnt

= PODS

Theorem 2: For all queries Q in the fragment that have at least one “//” node,

stream-space(Q) = (recDepthQ(D)).

Document Recursion Depth

//part

number

numbername

Definition:

recDepthQ(D): Max number of nodes in D that lie on one root-to-leaf path and “path match” the same node in Q.

Document D

Query Q

number

x5Compressor12

Refrigerator

Document Depth

Definition:

depth(D): Length of longest root-to-leaf path.

numbername

Document D

number

x5Compressor12

Refrigerator

Theorem 2: For all queries Q in the fragment that have at least one “/” node,

stream-space(Q) = (log depth(D)).456

New algorithm

Theorem 4(a):

For all queries Q in a “Univariate XPath”:

Space: O(|Q| recDepth(D) log depth(D)).Time: O(|D| |Q| recDepth(D)).

Theorem 4(b):

For all queries Q in a subset of our fragment and for non-recursive documents D,Space: O(FrontierSize(Q) log depth(D)).Time: O(|D| FrontierSize(Q)).

Proof of Theorem 1

Fragment:

•“subsumption-free”•“univariate”•Conjunctions only •“star-restricted”

conference

speaker

namepaper_cnt

= PODS

Critical DocumentDefinition: Document D is critical for query Q, if:

(1) D matches Q.

(2) If we remove from D any node, it no longer matches Q.

conference

Query Q

speaker

namepaper_cnt

= PODS

conference

speaker

namepaper_cnt

speaker

namepaper_cnt

JosifovskiFagin1 3

x4 x5 x7

Document D

Main Lemmas

Lemma 1: For all queries Q in the fragment and any critical document D for Q,

stream-space(Q) = (FrontierSize(D)).

Lemma 2: For all queries Q in the fragment, there is a critical document D so that

FrontierSize(D) = FrontierSize(Q).

showproof

One-way Communication Complexity

Alice Bob

f: (X, Y) Z

f(x,y)

CC(f) = number of communication bits used by the best protocol on the worst-case choice of inputs.

Reduction

Alice Bob

stateA()

A : streaming algorithm for Q using space S

stateA()

Theorem: stream-space(Q) >= CC(Q)

Fooling Set Technique

Theorem: For any fooling set T, CC(Q) = (log |T|).

Definition

A set T of partitioned documents is a fooling set for Q if:1. All documents in T match Q.

2. For any two distinct documents D,, D, in T, either D, does not match Q or D, does not match Q.

Partitioned document:

Document prefix Document suffix

Proof of Lemma 1

Lemma 1: For all queries Q in the fragment nd any critical document D for Q,

stream-space(Q) = (FS(D)).

conference

Query Q

speaker

= PODS

conference

speaker

namepaper_cnt

Fagin 3

Document D

paper_cnt

Proof of Lemma 1

For each subset S of Frontier(D), define a partitioned document DS:

S = { x2, x5 }

conference

Query Q

speaker

= PODS

conference

speaker

name paper_cnt

Fagin 3

Document DS

paper_cnt

2. If S T, need: either DST or DTS does not match Q.

Proof of Lemma 1 (cont)

1. For all S, DS matches Q.

Claim: { DS }S is a subset of Frontier(D) is a fooling set.

stream-space(Q) >= log(2FS(D)) = FS(D).

Proof of Claim:

Proof of Claim (example)

conference

speaker

name paper_cnt

Document DT

T = { x4,x5 }

Document DTS

conference

speaker

namepaper_cnt

Document DS

S = { x2,x5 }

Fagin 3

3conference

root x0

Conference name missing!speaker

name paper_cnt

Fagin 3

Algorithm

Uses the query as an NFA Based on three global data structures

Pointer array Validation array Level array

Matches the lower bounds for a fragment of XPath.

Algorithm Example Run

<a> <c>c1</c> b1</a>...

Level array

Validation array

Pointer array with one entry

u2 /c u3

Query: /a[b and c]Input XML

<a> <c>c1</c> b1</a>...

<a> <c>c1</c> b1</a>... a

Index 0

Index 1

u2 /c u3

<a> <c>c1</c> b1</a>...

Input XML

Query: /a[b and c]

Index 0

Index 1

u2 /c u3

<a> <c>c1</c> b1</a>...

<a> <c>c1</c> b1</a>... a

Index 0

Index 1

Algorithm Example RunQuery: /a[b and c]Input XML

u2 /c u3

<a> <c>c1</c> b1</a>...

<a> <c>c1</c> b1</a>... a

Index 0

Index 1

u2 /c u3

<a> <c>c1</c> b1</a>...

<a> <c>c1</c> b1</a>... a

Index 0

Index 1

u2 /c u3

<a> <c>c1</c> b1</a>...

<a> <c>c1</c> b1</a>... a

/aReturn

u2 /c u3

Conclusion: our Contributions

Space lower bounds on the instance data complexity of XPath on XML streams:1. In terms of Query Frontier Size

2. In terms of Document Recursion Depth

3. In terms of Document Depth

A streaming XML algorithm Matches the lower bounds on a fragment of the

language Does not use finite-state automata

XPath 1.0

Josifovski Fagin1 3

x4 x5x7 x8

/conference/name

Result: { x2 }

XPath 1.0

Josifovski Fagin1 3

x4 x5x7 x8

/conference//name

Result: { x2, x4, x7 }

D 31 1 2 2 3 31 1 2 2 3

Reduction

Alice Bobs1

A : S-space streaming algorithm for Q.

r ¸ 1: integer.

(r = 6)

s0s1 s2 s3 s4 s5 s6

Theorem: S ¸ CC(Qr) / r

Q(D) Q(D)

On the Memory Requirements of XPath Evaluation over XML Streams

Documents

Transcript of On the Memory Requirements of XPath Evaluation over XML Streams

Processing XPath queries with forward and downward axes ...openproceedings.org/2010/conf/edbt/Onizuka10.pdfProcessing XPath queries with forward and downward axes over XML streams

Xpath XPath is a language for finding information in an XML document.

CIS 132 XML, XPath, XQuery · 2013. 3. 20. · XPath. •XPath is used to navigate through elements and attributes in an XML document. •XPath is a major element in W3C's XSLT standard

XML II: XSL,XPath,XSLT

More XML XML schema, XPATH, XSLT

XML Path Language (XPath) 2.0

Evaluating XPath Queries on XML Data Streams

Querying XML: XPath and XQuery

XML Parsers XPath, XQuery Outline - EPFLlsir · ¥XML parsers ¥XPath ¥XQuery. 31 XQuery Motivation ¥Query is a strongly typed query language ¥Builds on XPath ¥XPath expressivity

XML and XPath details

XML and Semantic Web Technologies II. XML / 4. XML Path ... · XML and Semantic Web Technologies / 1. XPath Data Model Node Kinds The XPath Data Model describes a XML document as

XPath - Roma Tre Universityatzeni/didattica/BD/20112012... · XPath Introduction XPath is a language that lets you identify particular parts of XML documents XPath interprets XML

Querying XML XPath - Artificial Intelligenceopenclassroom.stanford.edu/.../old-site/docs/pdfs/XPath.pdf · Querying XML XPath . Jennifer Widom XPath Querying XML Not nearly as mature

Querring xml with xpath

XPath XML Path Language. Outline XML Path Language (XPath) Data Model Description Node values XPath expressions Relative expressions Simple subset of.

Xpath Sources: amoeller/XML .

CS 433 Xml, DTD, XPath, & Xslt

XML 6.6 XPath 6. What is XPath? XPath is a syntax used for selecting parts of an XML document The way XPath describes paths to elements is similar to.

XML & XPath Injections

XPath-AwareChunking of XML-Documentsdoesen0.informatik.uni-leipzig.de/proceedings/slides/btw2003_wiss... · XPath-AwareChunking of XML-Documents Wolfgang Lehner. Uni Erlangen XPath-Aware