On the Memory Requirements of XPath Evaluation over XML Streams
description
Transcript of On the Memory Requirements of XPath Evaluation over XML Streams
![Page 1: On the Memory Requirements of XPath Evaluation over XML Streams](https://reader033.fdocuments.us/reader033/viewer/2022051418/56815068550346895dbe669e/html5/thumbnails/1.jpg)
On the Memory Requirements of XPath
Evaluation over XML Streams
Ziv Bar-YossefMarcus FontouraVanja Josifovski
IBM Almaden Research Center
![Page 2: On the Memory Requirements of XPath Evaluation over XML Streams](https://reader033.fdocuments.us/reader033/viewer/2022051418/56815068550346895dbe669e/html5/thumbnails/2.jpg)
Preliminaries: XML
<conference> <name> PODS </name>
<speaker> <name> Josifovski </name> <paper_cnt> 1 </paper_cnt> </speaker>
<speaker> <name> Fagin </name> <paper_cnt> 3 </paper_cnt> </speaker></conference>
conference
name
speaker
namepaper_cnt
root
speaker
namepaper_cnt
PODS
JosifovskiFagin1 3
x0
x1
x2
x3
x6
x4 x5 x7
x8
![Page 3: On the Memory Requirements of XPath Evaluation over XML Streams](https://reader033.fdocuments.us/reader033/viewer/2022051418/56815068550346895dbe669e/html5/thumbnails/3.jpg)
Preliminaries: XPath 1.0
/conference[name = PODS]/speaker[paper_cnt > 1]/name
conference
name
root
DocumentQuery
Result: { x7 }
speaker
namepaper_cnt
= PODS
> 1
conference
name
speaker
namepaper_cnt
root
speaker
namepaper_cnt
PODS
JosifovskiFagin1 3
x0
x1
x2
x3
x6
x4 x5 x7
x8
![Page 4: On the Memory Requirements of XPath Evaluation over XML Streams](https://reader033.fdocuments.us/reader033/viewer/2022051418/56815068550346895dbe669e/html5/thumbnails/4.jpg)
XML Streams
XML stream: XML document arriving as a one-way stream
Critical resources:
• Memory
• Processing time
Why XML streams?
• For transferring XML between systems
• For efficient access to large XML documents
![Page 5: On the Memory Requirements of XPath Evaluation over XML Streams](https://reader033.fdocuments.us/reader033/viewer/2022051418/56815068550346895dbe669e/html5/thumbnails/5.jpg)
Streaming XML Algorithms
XFilter and YFilter [Altinel and Franklin 00] [Diao et al 02] X-scan [Ives, Levy, and Weld 00] XMLTK [Avila-Campillo et al 02] XTrie [Chan et al 02] SPEX [Olteanu, Kiesling, and Bry 03] Lazy DFAs [Green et al 03] The XPush Machine [Gupta and Suciu 03] XSQ [Peng and Chawathe 03] TurboXPath [Josifovski, Fontoura, and Barta 04] …
![Page 6: On the Memory Requirements of XPath Evaluation over XML Streams](https://reader033.fdocuments.us/reader033/viewer/2022051418/56815068550346895dbe669e/html5/thumbnails/6.jpg)
Our Results
Space lower bounds for evaluating XPath on XML streams
A streaming XML algorithm Matches the lower bounds on a large fragment
of the language Uses space sub-linear in the query size rather
than exponential in the query size
![Page 7: On the Memory Requirements of XPath Evaluation over XML Streams](https://reader033.fdocuments.us/reader033/viewer/2022051418/56815068550346895dbe669e/html5/thumbnails/7.jpg)
Related Work Space complexity of XPath evaluation over non-
streaming XML documents [Gottlob, Koch, Pichler 03], [Segoufin 03]
Space complexity of XPath evaluation over streams of indexed XML data [Choi, Mahoui, Wood 03]
Space complexity of select-project-join queries over relational data streams [Arasu et al 02]
![Page 8: On the Memory Requirements of XPath Evaluation over XML Streams](https://reader033.fdocuments.us/reader033/viewer/2022051418/56815068550346895dbe669e/html5/thumbnails/8.jpg)
Data Complexity [Vardi 82]
(Q,D) Evaluation function of a query Q on document D.
Q(D) Evaluation function of a fixed query Q on document D.
Data complexity on Q: Complexity of best algorithm for Q on worst D.
Worst-case data complexity: maxQ (complexity of Q).
We characterize the data complexity of Q separately for each Q (not just the worst-case one).
![Page 9: On the Memory Requirements of XPath Evaluation over XML Streams](https://reader033.fdocuments.us/reader033/viewer/2022051418/56815068550346895dbe669e/html5/thumbnails/9.jpg)
XPath Fragment
1. Queries are subsumption-free
conference
name
root
Query
= PODS name != SIGMOD
conference
root
Query
name != SIGMOD
Not subsumption-free Subsumption-free
![Page 10: On the Memory Requirements of XPath Evaluation over XML Streams](https://reader033.fdocuments.us/reader033/viewer/2022051418/56815068550346895dbe669e/html5/thumbnails/10.jpg)
XPath Fragment (cont.)
2. Queries are univariate
conference
paper_cnt
root
Query
author_cnt
Query
Not univariate Univariate
<
conference
paper_cnt
root
author_cnt< 30 > 30
![Page 11: On the Memory Requirements of XPath Evaluation over XML Streams](https://reader033.fdocuments.us/reader033/viewer/2022051418/56815068550346895dbe669e/html5/thumbnails/11.jpg)
XPath Fragment (cont.)
3. Queries consist of conjunctions only
4. Queries are “star-restricted”
![Page 12: On the Memory Requirements of XPath Evaluation over XML Streams](https://reader033.fdocuments.us/reader033/viewer/2022051418/56815068550346895dbe669e/html5/thumbnails/12.jpg)
Query Frontier Size
1. Frontier at u: u, its siblings, and the siblings of its ancestors.
Theorem 1: For all queries Q in the fragment,
stream-space(Q) = (FrontierSize(Q)).
Definitions:
2. FrontierSize(Q): size of largest frontier.
conference
name
root
Query
speaker
namepaper_cnt
= PODS
> 1
![Page 13: On the Memory Requirements of XPath Evaluation over XML Streams](https://reader033.fdocuments.us/reader033/viewer/2022051418/56815068550346895dbe669e/html5/thumbnails/13.jpg)
Theorem 2: For all queries Q in the fragment that have at least one “//” node,
stream-space(Q) = (recDepthQ(D)).
Document Recursion Depth
//part
number
root
name
part
numbername
root
name
x0
x1
x3
x4
x4
x6
x7
x2
Definition:
recDepthQ(D): Max number of nodes in D that lie on one root-to-leaf path and “path match” the same node in Q.
Document D
Query Q
part
part
number
x5Compressor12
Refrigerator
456
![Page 14: On the Memory Requirements of XPath Evaluation over XML Streams](https://reader033.fdocuments.us/reader033/viewer/2022051418/56815068550346895dbe669e/html5/thumbnails/14.jpg)
Document Depth
Definition:
depth(D): Length of longest root-to-leaf path.
part
numbername
root
name
x0
x1
x3
x4
x4
x6
x7
x2
Document D
part
part
number
x5Compressor12
Refrigerator
Theorem 2: For all queries Q in the fragment that have at least one “/” node,
stream-space(Q) = (log depth(D)).456
![Page 15: On the Memory Requirements of XPath Evaluation over XML Streams](https://reader033.fdocuments.us/reader033/viewer/2022051418/56815068550346895dbe669e/html5/thumbnails/15.jpg)
New algorithm
Theorem 4(a):
For all queries Q in a “Univariate XPath”:
Space: O(|Q| recDepth(D) log depth(D)).Time: O(|D| |Q| recDepth(D)).
Theorem 4(b):
For all queries Q in a subset of our fragment and for non-recursive documents D,Space: O(FrontierSize(Q) log depth(D)).Time: O(|D| FrontierSize(Q)).
![Page 16: On the Memory Requirements of XPath Evaluation over XML Streams](https://reader033.fdocuments.us/reader033/viewer/2022051418/56815068550346895dbe669e/html5/thumbnails/16.jpg)
Proof of Theorem 1
Fragment:
•“subsumption-free”•“univariate”•Conjunctions only •“star-restricted”
Theorem 1: For all queries Q in the fragment,
stream-space(Q) = (FrontierSize(Q)).
conference
name
root
Query
speaker
namepaper_cnt
= PODS
> 1
![Page 17: On the Memory Requirements of XPath Evaluation over XML Streams](https://reader033.fdocuments.us/reader033/viewer/2022051418/56815068550346895dbe669e/html5/thumbnails/17.jpg)
Critical DocumentDefinition: Document D is critical for query Q, if:
(1) D matches Q.
(2) If we remove from D any node, it no longer matches Q.
conference
name
root
Query Q
speaker
namepaper_cnt
= PODS
> 1
conference
name
speaker
namepaper_cnt
root
speaker
namepaper_cnt
PODS
JosifovskiFagin1 3
x0
x1
x2
x3
x6
x4 x5 x7
x8
Document D
![Page 18: On the Memory Requirements of XPath Evaluation over XML Streams](https://reader033.fdocuments.us/reader033/viewer/2022051418/56815068550346895dbe669e/html5/thumbnails/18.jpg)
Main Lemmas
Lemma 1: For all queries Q in the fragment and any critical document D for Q,
stream-space(Q) = (FrontierSize(D)).
Lemma 2: For all queries Q in the fragment, there is a critical document D so that
FrontierSize(D) = FrontierSize(Q).
showproof
Theorem 1: For all queries Q in the fragment,
stream-space(Q) = (FrontierSize(Q)).
![Page 19: On the Memory Requirements of XPath Evaluation over XML Streams](https://reader033.fdocuments.us/reader033/viewer/2022051418/56815068550346895dbe669e/html5/thumbnails/19.jpg)
One-way Communication Complexity
Alice Bob
x ym
f: (X, Y) Z
f(x,y)
CC(f) = number of communication bits used by the best protocol on the worst-case choice of inputs.
![Page 20: On the Memory Requirements of XPath Evaluation over XML Streams](https://reader033.fdocuments.us/reader033/viewer/2022051418/56815068550346895dbe669e/html5/thumbnails/20.jpg)
D
Reduction
Alice Bob
stateA()
A : streaming algorithm for Q using space S
stateA()
Theorem: stream-space(Q) >= CC(Q)
Q(D)
![Page 21: On the Memory Requirements of XPath Evaluation over XML Streams](https://reader033.fdocuments.us/reader033/viewer/2022051418/56815068550346895dbe669e/html5/thumbnails/21.jpg)
D,
Fooling Set Technique
Theorem: For any fooling set T, CC(Q) = (log |T|).
Definition
A set T of partitioned documents is a fooling set for Q if:1. All documents in T match Q.
2. For any two distinct documents D,, D, in T, either D, does not match Q or D, does not match Q.
Partitioned document:
Document prefix Document suffix
![Page 22: On the Memory Requirements of XPath Evaluation over XML Streams](https://reader033.fdocuments.us/reader033/viewer/2022051418/56815068550346895dbe669e/html5/thumbnails/22.jpg)
Proof of Lemma 1
Lemma 1: For all queries Q in the fragment nd any critical document D for Q,
stream-space(Q) = (FS(D)).
conference
name
root
Query Q
speaker
name
= PODS
> 1
conference
name
root
speaker
namepaper_cnt
Fagin 3
x0
x1
x2
x3
x4
x5
Document D
paper_cnt
PODS
![Page 23: On the Memory Requirements of XPath Evaluation over XML Streams](https://reader033.fdocuments.us/reader033/viewer/2022051418/56815068550346895dbe669e/html5/thumbnails/23.jpg)
Proof of Lemma 1
For each subset S of Frontier(D), define a partitioned document DS:
S = { x2, x5 }
conference
name
root
Query Q
speaker
name
= PODS
> 1
conference
name
root
speaker
name paper_cnt
Fagin 3
x0
x1
x2
x3
x4
x5
Document DS
paper_cnt
PODS
![Page 24: On the Memory Requirements of XPath Evaluation over XML Streams](https://reader033.fdocuments.us/reader033/viewer/2022051418/56815068550346895dbe669e/html5/thumbnails/24.jpg)
2. If S T, need: either DST or DTS does not match Q.
Proof of Lemma 1 (cont)
1. For all S, DS matches Q.
Claim: { DS }S is a subset of Frontier(D) is a fooling set.
stream-space(Q) >= log(2FS(D)) = FS(D).
Proof of Claim:
![Page 25: On the Memory Requirements of XPath Evaluation over XML Streams](https://reader033.fdocuments.us/reader033/viewer/2022051418/56815068550346895dbe669e/html5/thumbnails/25.jpg)
Proof of Claim (example)
conference
name
root
speaker
name paper_cnt
x0
x1
x3x2
x4
x5
Document DT
T = { x4,x5 }
PODS
Document DTS
conference
name
root
speaker
namepaper_cnt
x0
x1
x2
x3
x5x4
Document DS
S = { x2,x5 }
PODS
Fagin
Fagin 3
3conference
root x0
x1
Conference name missing!speaker
name paper_cnt
x3
x4
Fagin 3
name
Fagin
x4x5
![Page 26: On the Memory Requirements of XPath Evaluation over XML Streams](https://reader033.fdocuments.us/reader033/viewer/2022051418/56815068550346895dbe669e/html5/thumbnails/26.jpg)
Algorithm
Uses the query as an NFA Based on three global data structures
Pointer array Validation array Level array
Matches the lower bounds for a fragment of XPath.
![Page 27: On the Memory Requirements of XPath Evaluation over XML Streams](https://reader033.fdocuments.us/reader033/viewer/2022051418/56815068550346895dbe669e/html5/thumbnails/27.jpg)
Algorithm Example Run
<a> <c>c1</c> <b>b1</b></a>...
<a> <c>c1</c> <b>b1</b></a>...
aF
1
Level array
Validation array
Pointer array with one entry
/a
/b
$ u0
u1
u2 /c u3
Query: /a[b and c]Input XML
![Page 28: On the Memory Requirements of XPath Evaluation over XML Streams](https://reader033.fdocuments.us/reader033/viewer/2022051418/56815068550346895dbe669e/html5/thumbnails/28.jpg)
Algorithm Example Run
<a> <c>c1</c> <b>b1</b></a>...
<a> <c>c1</c> <b>b1</b></a>... a
F
1
$
bF
2
a
cF
2
Index 0
Index 1
Query: /a[b and c]Input XML
/a
/b
$ u0
u1
u2 /c u3
![Page 29: On the Memory Requirements of XPath Evaluation over XML Streams](https://reader033.fdocuments.us/reader033/viewer/2022051418/56815068550346895dbe669e/html5/thumbnails/29.jpg)
Algorithm Example Run
<a> <c>c1</c> <b>b1</b></a>...
<a> <c>c1</c> <b>b1</b></a>...
Input XML
aF
1
$
Query: /a[b and c]
bF
2
a
cF
2
Index 0
Index 1
bF
2
c
cF
2
/a
/b
$ u0
u1
u2 /c u3
![Page 30: On the Memory Requirements of XPath Evaluation over XML Streams](https://reader033.fdocuments.us/reader033/viewer/2022051418/56815068550346895dbe669e/html5/thumbnails/30.jpg)
<a> <c>c1</c> <b>b1</b></a>...
<a> <c>c1</c> <b>b1</b></a>... a
F
1
$
bF
2
a
cF
2
Index 0
Index 1
bF
2
c
cF
2
bF
2
/c
cT
2
Algorithm Example RunQuery: /a[b and c]Input XML
/a
/b
$ u0
u1
u2 /c u3
![Page 31: On the Memory Requirements of XPath Evaluation over XML Streams](https://reader033.fdocuments.us/reader033/viewer/2022051418/56815068550346895dbe669e/html5/thumbnails/31.jpg)
<a> <c>c1</c> <b>b1</b></a>...
<a> <c>c1</c> <b>b1</b></a>... a
F
1
$
bF
2
a
cF
2
Index 0
Index 1
bF
2
c
cF
2
bF
2
b
cT
2
Algorithm Example Run
bF
2
/c
cT
2
Query: /a[b and c]Input XML
/a
/b
$ u0
u1
u2 /c u3
![Page 32: On the Memory Requirements of XPath Evaluation over XML Streams](https://reader033.fdocuments.us/reader033/viewer/2022051418/56815068550346895dbe669e/html5/thumbnails/32.jpg)
<a> <c>c1</c> <b>b1</b></a>...
<a> <c>c1</c> <b>b1</b></a>... a
F
1
$
bF
2
a
cF
2
Index 0
Index 1
bF
2
c
cF
2
bF
2
b
cT
2
Algorithm Example Run
bF
2
/c
cT
2
bT
2
/b
cT
2
Query: /a[b and c]Input XML
/a
/b
$ u0
u1
u2 /c u3
![Page 33: On the Memory Requirements of XPath Evaluation over XML Streams](https://reader033.fdocuments.us/reader033/viewer/2022051418/56815068550346895dbe669e/html5/thumbnails/33.jpg)
<a> <c>c1</c> <b>b1</b></a>...
<a> <c>c1</c> <b>b1</b></a>... a
F
1
$
bF
2
a
cF
2
bF
2
c
cF
2
bF
2
b
cT
2
Algorithm Example Run
bF
2
/c
cT
2
bT
2
/b
cT
2
aT
1
/aReturn
TRUE
Query: /a[b and c]Input XML
/a
/b
$ u0
u1
u2 /c u3
![Page 34: On the Memory Requirements of XPath Evaluation over XML Streams](https://reader033.fdocuments.us/reader033/viewer/2022051418/56815068550346895dbe669e/html5/thumbnails/34.jpg)
Conclusion: our Contributions
Space lower bounds on the instance data complexity of XPath on XML streams:1. In terms of Query Frontier Size
2. In terms of Document Recursion Depth
3. In terms of Document Depth
A streaming XML algorithm Matches the lower bounds on a fragment of the
language Does not use finite-state automata
![Page 35: On the Memory Requirements of XPath Evaluation over XML Streams](https://reader033.fdocuments.us/reader033/viewer/2022051418/56815068550346895dbe669e/html5/thumbnails/35.jpg)
XPath 1.0
C
N
S
N P
$
S
N P
PODS
Josifovski Fagin1 3
x0
x1
x2
x3 x6
x4 x5x7 x8
/conference/name
/C
/N
$ u0
u1
u2
DQ
Result: { x2 }
![Page 36: On the Memory Requirements of XPath Evaluation over XML Streams](https://reader033.fdocuments.us/reader033/viewer/2022051418/56815068550346895dbe669e/html5/thumbnails/36.jpg)
XPath 1.0
C
N
S
N P
$
S
N P
PODS
Josifovski Fagin1 3
x0
x1
x2
x3 x6
x4 x5x7 x8
/conference//name
/C
//N
$ u0
u1
u2
D
Q
Result: { x2, x4, x7 }
![Page 37: On the Memory Requirements of XPath Evaluation over XML Streams](https://reader033.fdocuments.us/reader033/viewer/2022051418/56815068550346895dbe669e/html5/thumbnails/37.jpg)
D 31 1 2 2 3 31 1 2 2 3
Reduction
Alice Bobs1
s2
s3
s4
A : S-space streaming algorithm for Q.
r ¸ 1: integer.
(r = 6)
s0s1 s2 s3 s4 s5 s6
s5
s6
Theorem: S ¸ CC(Qr) / r
Q(D) Q(D)