Querying Streaming XML Data. Layout of the presentation Introduction Common Problems faced ...
-
date post
19-Dec-2015 -
Category
Documents
-
view
224 -
download
0
Transcript of Querying Streaming XML Data. Layout of the presentation Introduction Common Problems faced ...
Querying Streaming XML Data
Layout of the presentation
Introduction Common Problems faced Solution proposed Basic Building blocks of the solution How to build up a solution to a given
query Features of the system
Streaming XML XML – standard for information exchange. Some XML documents only available in
streaming format. Streaming is like reading data from a tape
drive. Used in Stock Market, News, Network
Statistics. Predecessor systems used to filter
documents.
Structure of an XPath Query
Consists of a Location path and an Output Expression (name).
Location path consists of closure axis(//), node test (book) and predicate (year>2000).
e.g. //book[year>2000]/name
Features of our Approach
Efficient Easy to understand design. Design of BPDT is tricky
Common Problems faced
1. <root>2. <pub>3. <book id=”1”>4. <price> 12.00 </price>5. <name> First </name>6. <author> A </author>7. <price type=”discount”> 10.00 </price>8. </book>
9. <book id=”2”>10. <price> 14.00 </price>11. <name> Second </name>12. <author> A </author>13. <author> B </author>14. <price type=”discount”> 12.00 </price>15. </book>
16. <year> 2002 </year>17. </pub>18. </root>
Query: /pub[year=2002]/book[price<11]/author
Common Problems faced
1. <root>2. <pub>3. <book id=”1”>4. <price> 12.00 </price>5. <name> First </name>6. <author> A </author>7. <price type=”discount”> 10.00 </price>8. </book>
9. <book id=”2”>10. <price> 14.00 </price>11. <name> Second </name>12. <author> A </author>13. <author> B </author>14. <price type=”discount”> 12.00 </price>15. </book>
16. <year> 2002 </year>17. </pub>18. </root>
Query: /pub[year=2002]/book[price<11]/author
Element satisfies the path
Common Problems faced
1. <root>2. <pub>3. <book id=”1”>4. <price> 12.00 </price>5. <name> First </name>6. <author> A </author>7. <price type=”discount”> 10.00 </price>8. </book>
9. <book id=”2”>10. <price> 14.00 </price>11. <name> Second </name>12. <author> A </author>13. <author> B </author>14. <price type=”discount”> 12.00 </price>15. </book>
16. <year> 2002 </year>17. </pub>18. </root>
Query: /pub[year=2002]/book[price<11]/author
Element satisfies the path
Failure??
Common Problems faced
1. <root>2. <pub>3. <book id=”1”>4. <price> 12.00 </price>5. <name> First </name>6. <author> A </author>7. <price type=”discount”> 10.00 </price>8. </book>
9. <book id=”2”>10. <price> 14.00 </price>11. <name> Second </name>12. <author> A </author>13. <author> B </author>14. <price type=”discount”> 12.00 </price>15. </book>
16. <year> 2002 </year>17. </pub>18. </root>
Query: /pub[year=2002]/book[price<11]/author
Element satisfies the path
Failure??
Test passed. But year=2002?
Common Problems faced
1. <root>2. <pub>3. <book id=”1”>4. <price> 12.00 </price>5. <name> First </name>6. <author> A </author>7. <price type=”discount”> 10.00 </price>8. </book>
9. <book id=”2”>10. <price> 14.00 </price>11. <name> Second </name>12. <author> A </author>13. <author> B </author>14. <price type=”discount”> 12.00 </price>15. </book>
16. <year> 2002 </year>17. </pub>18. </root>
Query: /pub[year=2002]/book[price<11]/author
Element satisfies the path
Failure??
Test passed. But year=2002?
Buffer both A & B
Common Problems faced
1. <root>2. <pub>3. <book id=”1”>4. <price> 12.00 </price>5. <name> First </name>6. <author> A </author>7. <price type=”discount”> 10.00 </price>8. </book>
9. <book id=”2”>10. <price> 14.00 </price>11. <name> Second </name>12. <author> A </author>13. <author> B </author>14. <price type=”discount”> 12.00 </price>15. </book>
16. <year> 2002 </year>17. </pub>18. </root>
Query: /pub[year=2002]/book[price<11]/author
Element satisfies the path
Failure??
Test passed. But year=2002?
Failed price<11. Remove
Buffer both A & B
Common Problems faced
1. <root>2. <pub>3. <book id=”1”>4. <price> 12.00 </price>5. <name> First </name>6. <author> A </author>7. <price type=”discount”> 10.00 </price>8. </book>
9. <book id=”2”>10. <price> 14.00 </price>11. <name> Second </name>12. <author> A </author>13. <author> B </author>14. <price type=”discount”> 12.00 </price>15. </book>
16. <year> 2002 </year>17. </pub>18. </root>
Query: /pub[year=2002]/book[price<11]/author
Element satisfies the path
Failure??
Test passed. But year=2002?
Failed price<11. Remove
Buffer both A & B
Test passed. Output
Problems caused by closure axis
1. <root>2. <pub>3. <book>4. <name> X </name>5. <author> A </author>6. </book>
7. <book>8. <name> Y </name>9. <pub>10. <book>11. <name> Z </name>12. <author> B </author>13. </book>14. <year> 1999 </year>15. </pub>16. </book>17. <year> 2002 </year>18. </pub>19. </root>
Query: //pub[year=2002]//book[author]//name
Pub [year=2002] book [author]
Line 2 True Line 7 False
Line 2 True Line 10 True
Line 9 False Line 10 True
Problems caused by closure axis
1. <root>2. <pub>3. <book>4. <name> X </name>5. <author> A </author>6. </book>
7. <book>8. <name> Y </name>9. <pub>10. <book>11. <name> Z </name>12. <author> B </author>13. </book>14. <year> 1999 </year>15. </pub>16. </book>17. <year> 2002 </year>18. </pub>19. </root>
Query: //pub[year=2002]//book[author]//name
Pub [year=2002] book [author]
Line 2 True Line 7 False
Line 2 True Line 10 True
Line 9 False Line 10 True
Fails year=2002
Problems caused by closure axis
1. <root>2. <pub>3. <book>4. <name> X </name>5. <author> A </author>6. </book>
7. <book>8. <name> Y </name>9. <pub>10. <book>11. <name> Z </name>12. <author> B </author>13. </book>14. <year> 1999 </year>15. </pub>16. </book>17. <year> 2002 </year>18. </pub>19. </root>
Query: //pub[year=2002]//book[author]//name
Pub [year=2002] book [author]
Line 2 True Line 7 False
Line 2 True Line 10 True
Line 9 False Line 10 True
Fails year=2002
Passes year=2002
Problems caused by closure axis
1. <root>2. <pub>3. <book>4. <name> X </name>5. <author> A </author>6. </book>7. <book>8. <name> Y </name>9. <author> B </author>10. <pub>11. <book>12. <name> Z </name>13. <author> B </author>14. </book>15. <year> 1999 </year>16. </pub>17. </book>18. <year> 2002 </year>19. </pub>20. </root>
Query: //pub[year=2002]//book[author]//name
Pub [year=2002] book [author]
Line 2 True Line 7 False
Line 2 True Line 10 True
Line 9 False Line 10 True
Fails year=2002
Passes year=2002
Lets add author. Result?
Handling XML Stream
Input – well formed XML stream. Use SAX API to parse XML. Events belong to
Begin = {(a, attrs, d)} End = {(/a, d)} Text = {(a, text(), d)}
XML Stream: {e1,e2,…,ei,…} ¦
ei Є Begin υ End υ Text
Grammar for XPath Queries Q N+[/O] N [/¦//] tag [F] F [FO[OP constant]] FO @attribute ¦ tag [@attribute] ¦ text() O @attribute ¦ text() OP > ¦ ≥ ¦ = ¦ < ¦ ≥ ¦ ≠ ¦ contains
XPath query of the form N1N2…Nn/O
Cant handle Reverse Axis, Positional Functions.
Solution to QueryQuery: /pub[year=2002]/book[price<11]/author
PDA PDT
Basic PushDown Transducer (BPDT)
Similar to PushDown Automata Actions defined on Transition Arcs Finite set of states
A Start state A set of final states
Set of input symbols Set of Stack symbols
Book – Author: Buffer for future: Begin event of Author.
Book – Author: Remove from Buffer: End event of Book.
Book – Author: Output result if predicates true: Begin event of Author.
Building a BPDTQuery: /pub[year>2000]/book[author]/name/text()
Consider location step: /book[author]
Basic Building Blocks
XPath Expression: /tag[child]
Buffer Operations needed Enqueue(x): Add x to the end of the queue.
Clear(): Removes all items from the queue.
Flush(): Outputs all items in the queue in FIFO order.
Upload(): Moves all items to the end of the queue of a parent BPDT.
No Dequeue operation needed.
Basic Building Blocks
XPath Expression: /tag[@attr=val]
Basic Building Blocks
XPath Expression: /tag[text()=val]
Basic Building Blocks
XPath Expression: /tag[child@attr=val]
Basic Building Blocks
XPath Expression: /tag[child=val]
A sample BPDT
Query: /pub[year>2000]
Building a solutionHPDT for Query:
//pub[year>2000]//book[author]//name/text()
HPDT Structure Each BPDT in HPDT has:
Position BPDT POSITION (l,K) :- l = depth of BPDT in HPDT, K
= sequence # from right to left BPDT Position (i-1,k) – has right child BPDT position
(i,2k) – connected to NA state BPDT Position(i-1,k) – has left child BPDT position
(I,2k+1) – connected to True state. BPDT Position (i, 2i – 1) – means predicates in higher
level BPDT’s evaluate to trueBuffer – potential resultsStack – stack of elements (SAX) eventsDepth Vector
Example Query
1. <root>2. <pub>3. <book>4. <name> X </name>5. <author> A </author>6. </book>
7. <book>8. <name> Y </name>9. <pub>10. <book>11. <name> Z </name>12. <author> B </author>13. </book>14. <year> 1999 </year>15. </pub>16. </book>17. <year> 2002 </year>18. </pub>19. </root>
Query: //pub[year=2002]//book[author]//name
rootpub book name
1 2 7 11
1 2 10 11
1 9 10 11
3 paths from $1 to $14
System Features
Name Support Streaming Multiple
Predicates Closure
Buffered Predicate
Evaluation
XSQ-F XPath X X X X
XSQ-NC XPath X X X
XMLTK XPath X X
XQEngine XQuery X X
Galax XQuery X X
Joost STX X X
Reference Feng Peng and Sudarshan Chawate. XPath Queries
on Streaming Data. In SIGMOD 2003.
Thank You
???