Post on 21-Jan-2016
description
On the Memory Requirements of XPath
Evaluation over XML Streams
Ziv Bar-YossefMarcus FontouraVanja Josifovski
IBM Almaden Research Center
Preliminaries: XML
<conference> <name> PODS </name>
<speaker> <name> Josifovski </name> <paper_cnt> 1 </paper_cnt> </speaker>
<speaker> <name> Fagin </name> <paper_cnt> 3 </paper_cnt> </speaker></conference>
conference
name
speaker
namepaper_cnt
root
speaker
namepaper_cnt
PODS
JosifovskiFagin1 3
x0
x1
x2
x3
x6
x4 x5 x7
x8
Preliminaries: XPath 1.0
/conference[name = PODS]/speaker[paper_cnt > 1]/name
conference
name
root
DocumentQuery
Result: { x7 }
speaker
namepaper_cnt
= PODS
> 1
conference
name
speaker
namepaper_cnt
root
speaker
namepaper_cnt
PODS
JosifovskiFagin1 3
x0
x1
x2
x3
x6
x4 x5 x7
x8
XML Streams
XML stream: XML document arriving as a one-way stream
Critical resources:
• Memory
• Processing time
Why XML streams?
• For transferring XML between systems
• For efficient access to large XML documents
Streaming XML Algorithms
XFilter and YFilter [Altinel and Franklin 00] [Diao et al 02] X-scan [Ives, Levy, and Weld 00] XMLTK [Avila-Campillo et al 02] XTrie [Chan et al 02] SPEX [Olteanu, Kiesling, and Bry 03] Lazy DFAs [Green et al 03] The XPush Machine [Gupta and Suciu 03] XSQ [Peng and Chawathe 03] TurboXPath [Josifovski, Fontoura, and Barta 04] …
Our Results
Space lower bounds for evaluating XPath on XML streams
A streaming XML algorithm Matches the lower bounds on a large fragment
of the language Uses space sub-linear in the query size rather
than exponential in the query size
Related Work Space complexity of XPath evaluation over non-
streaming XML documents [Gottlob, Koch, Pichler 03], [Segoufin 03]
Space complexity of XPath evaluation over streams of indexed XML data [Choi, Mahoui, Wood 03]
Space complexity of select-project-join queries over relational data streams [Arasu et al 02]
Data Complexity [Vardi 82]
(Q,D) Evaluation function of a query Q on document D.
Q(D) Evaluation function of a fixed query Q on document D.
Data complexity on Q: Complexity of best algorithm for Q on worst D.
Worst-case data complexity: maxQ (complexity of Q).
We characterize the data complexity of Q separately for each Q (not just the worst-case one).
XPath Fragment
1. Queries are subsumption-free
conference
name
root
Query
= PODS name != SIGMOD
conference
root
Query
name != SIGMOD
Not subsumption-free Subsumption-free
XPath Fragment (cont.)
2. Queries are univariate
conference
paper_cnt
root
Query
author_cnt
Query
Not univariate Univariate
<
conference
paper_cnt
root
author_cnt< 30 > 30
XPath Fragment (cont.)
3. Queries consist of conjunctions only
4. Queries are “star-restricted”
Query Frontier Size
1. Frontier at u: u, its siblings, and the siblings of its ancestors.
Theorem 1: For all queries Q in the fragment,
stream-space(Q) = (FrontierSize(Q)).
Definitions:
2. FrontierSize(Q): size of largest frontier.
conference
name
root
Query
speaker
namepaper_cnt
= PODS
> 1
Theorem 2: For all queries Q in the fragment that have at least one “//” node,
stream-space(Q) = (recDepthQ(D)).
Document Recursion Depth
//part
number
root
name
part
numbername
root
name
x0
x1
x3
x4
x4
x6
x7
x2
Definition:
recDepthQ(D): Max number of nodes in D that lie on one root-to-leaf path and “path match” the same node in Q.
Document D
Query Q
part
part
number
x5Compressor12
Refrigerator
456
Document Depth
Definition:
depth(D): Length of longest root-to-leaf path.
part
numbername
root
name
x0
x1
x3
x4
x4
x6
x7
x2
Document D
part
part
number
x5Compressor12
Refrigerator
Theorem 2: For all queries Q in the fragment that have at least one “/” node,
stream-space(Q) = (log depth(D)).456
New algorithm
Theorem 4(a):
For all queries Q in a “Univariate XPath”:
Space: O(|Q| recDepth(D) log depth(D)).Time: O(|D| |Q| recDepth(D)).
Theorem 4(b):
For all queries Q in a subset of our fragment and for non-recursive documents D,Space: O(FrontierSize(Q) log depth(D)).Time: O(|D| FrontierSize(Q)).
Proof of Theorem 1
Fragment:
•“subsumption-free”•“univariate”•Conjunctions only •“star-restricted”
Theorem 1: For all queries Q in the fragment,
stream-space(Q) = (FrontierSize(Q)).
conference
name
root
Query
speaker
namepaper_cnt
= PODS
> 1
Critical DocumentDefinition: Document D is critical for query Q, if:
(1) D matches Q.
(2) If we remove from D any node, it no longer matches Q.
conference
name
root
Query Q
speaker
namepaper_cnt
= PODS
> 1
conference
name
speaker
namepaper_cnt
root
speaker
namepaper_cnt
PODS
JosifovskiFagin1 3
x0
x1
x2
x3
x6
x4 x5 x7
x8
Document D
Main Lemmas
Lemma 1: For all queries Q in the fragment and any critical document D for Q,
stream-space(Q) = (FrontierSize(D)).
Lemma 2: For all queries Q in the fragment, there is a critical document D so that
FrontierSize(D) = FrontierSize(Q).
showproof
Theorem 1: For all queries Q in the fragment,
stream-space(Q) = (FrontierSize(Q)).
One-way Communication Complexity
Alice Bob
x ym
f: (X, Y) Z
f(x,y)
CC(f) = number of communication bits used by the best protocol on the worst-case choice of inputs.
D
Reduction
Alice Bob
stateA()
A : streaming algorithm for Q using space S
stateA()
Theorem: stream-space(Q) >= CC(Q)
Q(D)
D,
Fooling Set Technique
Theorem: For any fooling set T, CC(Q) = (log |T|).
Definition
A set T of partitioned documents is a fooling set for Q if:1. All documents in T match Q.
2. For any two distinct documents D,, D, in T, either D, does not match Q or D, does not match Q.
Partitioned document:
Document prefix Document suffix
Proof of Lemma 1
Lemma 1: For all queries Q in the fragment nd any critical document D for Q,
stream-space(Q) = (FS(D)).
conference
name
root
Query Q
speaker
name
= PODS
> 1
conference
name
root
speaker
namepaper_cnt
Fagin 3
x0
x1
x2
x3
x4
x5
Document D
paper_cnt
PODS
Proof of Lemma 1
For each subset S of Frontier(D), define a partitioned document DS:
S = { x2, x5 }
conference
name
root
Query Q
speaker
name
= PODS
> 1
conference
name
root
speaker
name paper_cnt
Fagin 3
x0
x1
x2
x3
x4
x5
Document DS
paper_cnt
PODS
2. If S T, need: either DST or DTS does not match Q.
Proof of Lemma 1 (cont)
1. For all S, DS matches Q.
Claim: { DS }S is a subset of Frontier(D) is a fooling set.
stream-space(Q) >= log(2FS(D)) = FS(D).
Proof of Claim:
Proof of Claim (example)
conference
name
root
speaker
name paper_cnt
x0
x1
x3x2
x4
x5
Document DT
T = { x4,x5 }
PODS
Document DTS
conference
name
root
speaker
namepaper_cnt
x0
x1
x2
x3
x5x4
Document DS
S = { x2,x5 }
PODS
Fagin
Fagin 3
3conference
root x0
x1
Conference name missing!speaker
name paper_cnt
x3
x4
Fagin 3
name
Fagin
x4x5
Algorithm
Uses the query as an NFA Based on three global data structures
Pointer array Validation array Level array
Matches the lower bounds for a fragment of XPath.
Algorithm Example Run
<a> <c>c1</c> <b>b1</b></a>...
<a> <c>c1</c> <b>b1</b></a>...
aF
1
Level array
Validation array
Pointer array with one entry
/a
/b
$ u0
u1
u2 /c u3
Query: /a[b and c]Input XML
Algorithm Example Run
<a> <c>c1</c> <b>b1</b></a>...
<a> <c>c1</c> <b>b1</b></a>... a
F
1
$
bF
2
a
cF
2
Index 0
Index 1
Query: /a[b and c]Input XML
/a
/b
$ u0
u1
u2 /c u3
Algorithm Example Run
<a> <c>c1</c> <b>b1</b></a>...
<a> <c>c1</c> <b>b1</b></a>...
Input XML
aF
1
$
Query: /a[b and c]
bF
2
a
cF
2
Index 0
Index 1
bF
2
c
cF
2
/a
/b
$ u0
u1
u2 /c u3
<a> <c>c1</c> <b>b1</b></a>...
<a> <c>c1</c> <b>b1</b></a>... a
F
1
$
bF
2
a
cF
2
Index 0
Index 1
bF
2
c
cF
2
bF
2
/c
cT
2
Algorithm Example RunQuery: /a[b and c]Input XML
/a
/b
$ u0
u1
u2 /c u3
<a> <c>c1</c> <b>b1</b></a>...
<a> <c>c1</c> <b>b1</b></a>... a
F
1
$
bF
2
a
cF
2
Index 0
Index 1
bF
2
c
cF
2
bF
2
b
cT
2
Algorithm Example Run
bF
2
/c
cT
2
Query: /a[b and c]Input XML
/a
/b
$ u0
u1
u2 /c u3
<a> <c>c1</c> <b>b1</b></a>...
<a> <c>c1</c> <b>b1</b></a>... a
F
1
$
bF
2
a
cF
2
Index 0
Index 1
bF
2
c
cF
2
bF
2
b
cT
2
Algorithm Example Run
bF
2
/c
cT
2
bT
2
/b
cT
2
Query: /a[b and c]Input XML
/a
/b
$ u0
u1
u2 /c u3
<a> <c>c1</c> <b>b1</b></a>...
<a> <c>c1</c> <b>b1</b></a>... a
F
1
$
bF
2
a
cF
2
bF
2
c
cF
2
bF
2
b
cT
2
Algorithm Example Run
bF
2
/c
cT
2
bT
2
/b
cT
2
aT
1
/aReturn
TRUE
Query: /a[b and c]Input XML
/a
/b
$ u0
u1
u2 /c u3
Conclusion: our Contributions
Space lower bounds on the instance data complexity of XPath on XML streams:1. In terms of Query Frontier Size
2. In terms of Document Recursion Depth
3. In terms of Document Depth
A streaming XML algorithm Matches the lower bounds on a fragment of the
language Does not use finite-state automata
XPath 1.0
C
N
S
N P
$
S
N P
PODS
Josifovski Fagin1 3
x0
x1
x2
x3 x6
x4 x5x7 x8
/conference/name
/C
/N
$ u0
u1
u2
DQ
Result: { x2 }
XPath 1.0
C
N
S
N P
$
S
N P
PODS
Josifovski Fagin1 3
x0
x1
x2
x3 x6
x4 x5x7 x8
/conference//name
/C
//N
$ u0
u1
u2
D
Q
Result: { x2, x4, x7 }
D 31 1 2 2 3 31 1 2 2 3
Reduction
Alice Bobs1
s2
s3
s4
A : S-space streaming algorithm for Q.
r ¸ 1: integer.
(r = 6)
s0s1 s2 s3 s4 s5 s6
s5
s6
Theorem: S ¸ CC(Qr) / r
Q(D) Q(D)