Querying Streaming XML Data. Layout of the presentation Introduction Common Problems faced ...

Post on 19-Dec-2015

225 views 0 download

Tags:

Transcript of Querying Streaming XML Data. Layout of the presentation Introduction Common Problems faced ...

Querying Streaming XML Data

Layout of the presentation

Introduction Common Problems faced Solution proposed Basic Building blocks of the solution How to build up a solution to a given

query Features of the system

Streaming XML XML – standard for information exchange. Some XML documents only available in

streaming format. Streaming is like reading data from a tape

drive. Used in Stock Market, News, Network

Statistics. Predecessor systems used to filter

documents.

Structure of an XPath Query

Consists of a Location path and an Output Expression (name).

Location path consists of closure axis(//), node test (book) and predicate (year>2000).

e.g. //book[year>2000]/name

Features of our Approach

Efficient Easy to understand design. Design of BPDT is tricky

Common Problems faced

1. <root>2. <pub>3. <book id=”1”>4. <price> 12.00 </price>5. <name> First </name>6. <author> A </author>7. <price type=”discount”> 10.00 </price>8. </book>

9. <book id=”2”>10. <price> 14.00 </price>11. <name> Second </name>12. <author> A </author>13. <author> B </author>14. <price type=”discount”> 12.00 </price>15. </book>

16. <year> 2002 </year>17. </pub>18. </root>

Query: /pub[year=2002]/book[price<11]/author

Common Problems faced

1. <root>2. <pub>3. <book id=”1”>4. <price> 12.00 </price>5. <name> First </name>6. <author> A </author>7. <price type=”discount”> 10.00 </price>8. </book>

9. <book id=”2”>10. <price> 14.00 </price>11. <name> Second </name>12. <author> A </author>13. <author> B </author>14. <price type=”discount”> 12.00 </price>15. </book>

16. <year> 2002 </year>17. </pub>18. </root>

Query: /pub[year=2002]/book[price<11]/author

Element satisfies the path

Common Problems faced

1. <root>2. <pub>3. <book id=”1”>4. <price> 12.00 </price>5. <name> First </name>6. <author> A </author>7. <price type=”discount”> 10.00 </price>8. </book>

9. <book id=”2”>10. <price> 14.00 </price>11. <name> Second </name>12. <author> A </author>13. <author> B </author>14. <price type=”discount”> 12.00 </price>15. </book>

16. <year> 2002 </year>17. </pub>18. </root>

Query: /pub[year=2002]/book[price<11]/author

Element satisfies the path

Failure??

Common Problems faced

1. <root>2. <pub>3. <book id=”1”>4. <price> 12.00 </price>5. <name> First </name>6. <author> A </author>7. <price type=”discount”> 10.00 </price>8. </book>

9. <book id=”2”>10. <price> 14.00 </price>11. <name> Second </name>12. <author> A </author>13. <author> B </author>14. <price type=”discount”> 12.00 </price>15. </book>

16. <year> 2002 </year>17. </pub>18. </root>

Query: /pub[year=2002]/book[price<11]/author

Element satisfies the path

Failure??

Test passed. But year=2002?

Common Problems faced

1. <root>2. <pub>3. <book id=”1”>4. <price> 12.00 </price>5. <name> First </name>6. <author> A </author>7. <price type=”discount”> 10.00 </price>8. </book>

9. <book id=”2”>10. <price> 14.00 </price>11. <name> Second </name>12. <author> A </author>13. <author> B </author>14. <price type=”discount”> 12.00 </price>15. </book>

16. <year> 2002 </year>17. </pub>18. </root>

Query: /pub[year=2002]/book[price<11]/author

Element satisfies the path

Failure??

Test passed. But year=2002?

Buffer both A & B

Common Problems faced

1. <root>2. <pub>3. <book id=”1”>4. <price> 12.00 </price>5. <name> First </name>6. <author> A </author>7. <price type=”discount”> 10.00 </price>8. </book>

9. <book id=”2”>10. <price> 14.00 </price>11. <name> Second </name>12. <author> A </author>13. <author> B </author>14. <price type=”discount”> 12.00 </price>15. </book>

16. <year> 2002 </year>17. </pub>18. </root>

Query: /pub[year=2002]/book[price<11]/author

Element satisfies the path

Failure??

Test passed. But year=2002?

Failed price<11. Remove

Buffer both A & B

Common Problems faced

1. <root>2. <pub>3. <book id=”1”>4. <price> 12.00 </price>5. <name> First </name>6. <author> A </author>7. <price type=”discount”> 10.00 </price>8. </book>

9. <book id=”2”>10. <price> 14.00 </price>11. <name> Second </name>12. <author> A </author>13. <author> B </author>14. <price type=”discount”> 12.00 </price>15. </book>

16. <year> 2002 </year>17. </pub>18. </root>

Query: /pub[year=2002]/book[price<11]/author

Element satisfies the path

Failure??

Test passed. But year=2002?

Failed price<11. Remove

Buffer both A & B

Test passed. Output

Problems caused by closure axis

1. <root>2. <pub>3. <book>4. <name> X </name>5. <author> A </author>6. </book>

7. <book>8. <name> Y </name>9. <pub>10. <book>11. <name> Z </name>12. <author> B </author>13. </book>14. <year> 1999 </year>15. </pub>16. </book>17. <year> 2002 </year>18. </pub>19. </root>

Query: //pub[year=2002]//book[author]//name

Pub [year=2002] book [author]

Line 2 True Line 7 False

Line 2 True Line 10 True

Line 9 False Line 10 True

Problems caused by closure axis

1. <root>2. <pub>3. <book>4. <name> X </name>5. <author> A </author>6. </book>

7. <book>8. <name> Y </name>9. <pub>10. <book>11. <name> Z </name>12. <author> B </author>13. </book>14. <year> 1999 </year>15. </pub>16. </book>17. <year> 2002 </year>18. </pub>19. </root>

Query: //pub[year=2002]//book[author]//name

Pub [year=2002] book [author]

Line 2 True Line 7 False

Line 2 True Line 10 True

Line 9 False Line 10 True

Fails year=2002

Problems caused by closure axis

1. <root>2. <pub>3. <book>4. <name> X </name>5. <author> A </author>6. </book>

7. <book>8. <name> Y </name>9. <pub>10. <book>11. <name> Z </name>12. <author> B </author>13. </book>14. <year> 1999 </year>15. </pub>16. </book>17. <year> 2002 </year>18. </pub>19. </root>

Query: //pub[year=2002]//book[author]//name

Pub [year=2002] book [author]

Line 2 True Line 7 False

Line 2 True Line 10 True

Line 9 False Line 10 True

Fails year=2002

Passes year=2002

Problems caused by closure axis

1. <root>2. <pub>3. <book>4. <name> X </name>5. <author> A </author>6. </book>7. <book>8. <name> Y </name>9. <author> B </author>10. <pub>11. <book>12. <name> Z </name>13. <author> B </author>14. </book>15. <year> 1999 </year>16. </pub>17. </book>18. <year> 2002 </year>19. </pub>20. </root>

Query: //pub[year=2002]//book[author]//name

Pub [year=2002] book [author]

Line 2 True Line 7 False

Line 2 True Line 10 True

Line 9 False Line 10 True

Fails year=2002

Passes year=2002

Lets add author. Result?

Handling XML Stream

Input – well formed XML stream. Use SAX API to parse XML. Events belong to

Begin = {(a, attrs, d)} End = {(/a, d)} Text = {(a, text(), d)}

XML Stream: {e1,e2,…,ei,…} ¦

ei Є Begin υ End υ Text

Grammar for XPath Queries Q N+[/O] N [/¦//] tag [F] F [FO[OP constant]] FO @attribute ¦ tag [@attribute] ¦ text() O @attribute ¦ text() OP > ¦ ≥ ¦ = ¦ < ¦ ≥ ¦ ≠ ¦ contains

XPath query of the form N1N2…Nn/O

Cant handle Reverse Axis, Positional Functions.

Solution to QueryQuery: /pub[year=2002]/book[price<11]/author

PDA PDT

Basic PushDown Transducer (BPDT)

Similar to PushDown Automata Actions defined on Transition Arcs Finite set of states

A Start state A set of final states

Set of input symbols Set of Stack symbols

Book – Author: Buffer for future: Begin event of Author.

Book – Author: Remove from Buffer: End event of Book.

Book – Author: Output result if predicates true: Begin event of Author.

Building a BPDTQuery: /pub[year>2000]/book[author]/name/text()

Consider location step: /book[author]

Basic Building Blocks

XPath Expression: /tag[child]

Buffer Operations needed Enqueue(x): Add x to the end of the queue.

Clear(): Removes all items from the queue.

Flush(): Outputs all items in the queue in FIFO order.

Upload(): Moves all items to the end of the queue of a parent BPDT.

No Dequeue operation needed.

Basic Building Blocks

XPath Expression: /tag[@attr=val]

Basic Building Blocks

XPath Expression: /tag[text()=val]

Basic Building Blocks

XPath Expression: /tag[child@attr=val]

Basic Building Blocks

XPath Expression: /tag[child=val]

A sample BPDT

Query: /pub[year>2000]

Building a solutionHPDT for Query:

//pub[year>2000]//book[author]//name/text()

HPDT Structure Each BPDT in HPDT has:

Position BPDT POSITION (l,K) :- l = depth of BPDT in HPDT, K

= sequence # from right to left BPDT Position (i-1,k) – has right child BPDT position

(i,2k) – connected to NA state BPDT Position(i-1,k) – has left child BPDT position

(I,2k+1) – connected to True state. BPDT Position (i, 2i – 1) – means predicates in higher

level BPDT’s evaluate to trueBuffer – potential resultsStack – stack of elements (SAX) eventsDepth Vector

Example Query

1. <root>2. <pub>3. <book>4. <name> X </name>5. <author> A </author>6. </book>

7. <book>8. <name> Y </name>9. <pub>10. <book>11. <name> Z </name>12. <author> B </author>13. </book>14. <year> 1999 </year>15. </pub>16. </book>17. <year> 2002 </year>18. </pub>19. </root>

Query: //pub[year=2002]//book[author]//name

rootpub book name

1 2 7 11

1 2 10 11

1 9 10 11

3 paths from $1 to $14

System Features

Name Support Streaming Multiple

Predicates Closure

Buffered Predicate

Evaluation

XSQ-F XPath X X X X

XSQ-NC XPath X X X

XMLTK XPath X X

XQEngine XQuery X X

Galax XQuery X X

Joost STX X X

Reference Feng Peng and Sudarshan Chawate. XPath Queries

on Streaming Data. In SIGMOD 2003.

Thank You

???