Efficient Processing of Partially Specified Twig Queries

36
1 Efficient Processing of Partially Specified Twig Queries Junfeng Zhou Renmin University of Chin a

description

Efficient Processing of Partially Specified Twig Queries. Junfeng Zhou Renmin University of China. Outline. Introduction Preliminary PTwigStack Conclusion. Outline. Introduction Preliminary PTwigStack Conclusion. Introduction(1). - PowerPoint PPT Presentation

Transcript of Efficient Processing of Partially Specified Twig Queries

Page 1: Efficient Processing of Partially Specified Twig Queries

1

Efficient Processing of Partially Specified Twig Queries

Junfeng Zhou

Renmin University of China

Page 2: Efficient Processing of Partially Specified Twig Queries

2

Outline

• Introduction

• Preliminary

• PTwigStack

• Conclusion

Page 3: Efficient Processing of Partially Specified Twig Queries

3

Outline

• Introduction

• Preliminary

• PTwigStack

• Conclusion

Page 4: Efficient Processing of Partially Specified Twig Queries

4

Introduction(1)

• XML has been used extensively as a standard for information representation and exchange

• More and more data is stored and exchanged with XML format

• Effective and efficient querying of XML data is indispensable

Page 5: Efficient Processing of Partially Specified Twig Queries

5

Introduction(2)

• Using standard query language (XPath or XQuery)

• How can we write a proper query when:– the structure or schema is not fully available or – Extracting information from different data sources with

different structure bibliography(1)

bib(2) bib(…)

book(4)year(3)

1999 title(5) author(6)

article(7)

author(9)title(8)

XML Joe

author(10)

MaryXML Bob

book

title authorQ

zhoujf
加XPath例子介绍3方面的相关工作,需要提前进行说明
Page 6: Efficient Processing of Partially Specified Twig Queries

6

Introduction (4)

• Using keyword based query

• For example[1]– Find title and author of the publications

bibliography(1)

bib(2) bib(…)

book(4)year(3)

1999 title(5) author(6)

article(7)

author(9)title(8)

XML Joe

author(10)

MaryXML Bob

The answer is : (5,6), (8,9,10)

[1]Y. Li, C. Yu, and H. V. Jagadish. Schema-Free XQuery. In Proceedings of VLDB2004, pages 72-83, 2003

Page 7: Efficient Processing of Partially Specified Twig Queries

7

Introduction (5)

• Using keyword based query• How if node 6 and 8 are removed from the

document– Find title and author of the publications

bibliography(1)

bib(2) bib(…)

book(4)year(3)

1999 title(5)

article(7)

author(9)

Joe

author(10)

MaryXML

The answer is : (5,9,10)

Meaningless Result(5,NULL), (NULL,9,10)Correct answer

Page 8: Efficient Processing of Partially Specified Twig Queries

8

Introduction (6)• Using Partially Specified Twig Query (PSTQ) [2]

– Can provide users the most flexibility

• But– No existing method can process a PSTQ efficiently

[2]Heuristic Containment Check of Partial Tree-Pattern Queries in the Presence of Index Graphs, CIKM, 2006

Page 9: Efficient Processing of Partially Specified Twig Queries

9

Introduction(7)

• Objective– A concise but effective way to specify more flexible semantics

constrains in a twig query

– An efficient approach to process a PSTQ holistically without deriving twig queries and process them one by one

• Scan Once: Each stream whose elements’ tag appears in the twig pattern is scanned only once.

• No redundant output: None of the intermediate path solutions is useless

• Bounded space complexity: The space required by the algorithm is bounded by a factor which is independent of source document size.

Page 10: Efficient Processing of Partially Specified Twig Queries

10

Outline

• Introduction

• Preliminary– Holistic Twig Join– Partially Specified Twig Query

• PTwigStack

• Conclusion

Page 11: Efficient Processing of Partially Specified Twig Queries

11

Preliminary- Holistic Twig Join[3]

• Query Processing– Output useful Path Solutions– Merge all path solutions to get final results

• Data Structure– Each query node is associated with a stack and an element stream

• Benefits– No useless path solutions

R

a1

b1

a2

b2 c1

A

B C

QXML document

[3]N. Bruno, N. Koudas, and D. Srivastava: Holistic twig joins: Optimal XML pattern matching. TechnicalR eport Columbia University March 2002

zhoujf
该方法的作用
Page 12: Efficient Processing of Partially Specified Twig Queries

12

Preliminary- Partially Specified Twig Query[2]

• Q1 consists of two partial paths (PP), p1 and p2• In p1, Y is descendant of W• In p2, W and A are being at the same path • p1 share W with p2• “*” means p2 is output path

[2]Heuristic Containment Check of Partial Tree-Pattern Queries in the Presence of Index Graphs, CIKM, 2006

W

Y

W

A

PP p1 PP p2*

Q1• Compared with Twig Query:

– Some nodes are specified with being at the same path relationship with other nodes, but not the precedence relationship

• Compared with keyword based query:– Each part of the query can be a path expression, but not just keyword

• Benefits of using PSTQ:– Users can specify query with whatever partial knowledge they have

whenever possible

z
将p1的时候需要标注
Page 13: Efficient Processing of Partially Specified Twig Queries

13

Preliminary- Partially Specified Twig Query

• Query Processing of PSTQ: A naïve method– Deriving Twig Queries– Processing each twig query

• Problem of the naïve method

– Processing cost is too high– Eliminating redundant results

A

B

C

A

C

B

B

A

C

A

B C

QQ1 Q2 Q3 Q4

a1

b1

c1

Xml document

A

C

A

B

PP p1 PP p2*

Page 14: Efficient Processing of Partially Specified Twig Queries

14

Outline

• Introduction

• Preliminary

• PTwigStack

• Conclusion

z
how to express being at the same path relationship?algorithm should be changed
Page 15: Efficient Processing of Partially Specified Twig Queries

15

PTwigStack __PSTQ Expression

• Extending XPath by adding an operator– “ ” is used to denote being at the same path relatio

nship• A B is equivalent to A//B or B//A• A B C ?

A

B

C

A

B

C

A

C

B

C

A

B

C

B

A

B

A C

Q Q1 Q2 Q3 Q4 Q5

B

A

C

B

C

AQ6 Q7

Page 16: Efficient Processing of Partially Specified Twig Queries

16

PTwigStack

• Objective– Scan Once– No redundant output– Bounded space

complexity

• Problems– Which query node should

be processed first?– Which element should be

processed first?– How to guarantee no

useless path solutions from being produced?

b1

a1 a2

c1

b2

b3

Document

BA

C

A

B

C

A

C

B

B

A

C

A

B C

Q Q1 Q2 Q3 Q4

According to special order in the given Query

Element with solution extension

Element which cannot participate in answers will not be pushed into stack

Page 17: Efficient Processing of Partially Specified Twig Queries

17

PTwigStack

• Problems(1)– Which query node

should be processed first?

– Deep first order – ABC

b1

a1 a2

c1

b2

b3

Document

BA

C

A

B

C

A

C

B

B

A

C

A

B C

Q Q1 Q2 Q3 Q4

Page 18: Efficient Processing of Partially Specified Twig Queries

18

PTwigStack

• Problems(2)– Which element should be

processed first?– The element with Partial

Solution Extension

b1

a1 a2

c1

b2

b3

Document

BA

C

A

B

C

A

C

B

B

A

C

A

B C

Q Q1 Q2 Q3 Q4

• Partial Solution Extension– We say a query node q has a PSE iff q satisfies a

ny one of the following conditions:• If q is a leaf node, Cq does not equal to NULL.• If q is not a leaf node, for each q’ children(q)∈

– If q//q’, then Cq is ancestor of Cq’a1

c1

Page 19: Efficient Processing of Partially Specified Twig Queries

19

PTwigStack

• Problems(2)– Which element should be

processed first?– The element with Partial

Solution Extension

b1

a1 a2

c1

b2

b3

Document

BA

C

A

B

C

A

C

B

B

A

C

A

B C

Q Q1 Q2 Q3 Q4

• Partial Solution Extension– We say a query node q has a PSE iff q satisfies a

ny one of the following conditions:• If q is a leaf node, Cq does not equal to NULL.• If q is a non-leaf node, for each q’ children(q)∈

– If q//q’, then Cq is ancestor of Cq’

– If q q’ (being at the same path) and q’ has a PSE, then Cq can cover Cq’ or be covered by Cq’, or Cq.end < Cq’.start

b1

a1

c1c0

a1 b1

c1

a1

b1

c1

Page 20: Efficient Processing of Partially Specified Twig Queries

20

PTwigStack

• Problems(2)– Which element should be

processed first?– The element with Partial

Solution Extension

b1

a1 a2

c1

b2

b3

Document

BA

C

A

B

C

A

C

B

B

A

C

A

B C

Q Q1 Q2 Q3 Q4

• Partial Solution Extension– We say a query node q has a PSE iff q satisfies a

ny one of the following conditions:• If q is a leaf node, Cq does not equal to NULL.• If q is a non-leaf node, for each q’ children(q)∈

– If q//q’, then Cq is ancestor of Cq’

– If q q’ (being at the same path) and q’ has a PSE, then Cq can cover Cq’ or be covered by Cq’, or Cq.end < Cq’.start

– If q q’ and q’ hasn’t PSE, let p be descendent of q’ which has PSE, then Cq.start<Cp.start

Page 21: Efficient Processing of Partially Specified Twig Queries

21

PTwigStack

• Feature of Partial Solution Extension– If E has a PSE, E must have a Solution Extension of s

ome twig queries derived from the given PSTQ, which means CE may participate in final results.

• Usage of Partial Solution Extension– Guiding the executing of PTwigStack

Page 22: Efficient Processing of Partially Specified Twig Queries

22

PTwigStack

• Problems(3)– How to guarantee no

useless path solutions from being produced?

• Prevent useless elements from being pushed into stack

– What is useless element?

• cannot satisfy query requirement with top elements in correlated stacks or head element in each element stream

c1

b1 a1

Document

B

A

Ca1

Document

c1

a0

b1

a1

b1 c1

Document

Page 23: Efficient Processing of Partially Specified Twig Queries

23

PTwigStack

• Data Structure– Stack

• Each query node is also associated with a stack to compactly represent temporal results

– Tag index• Each query node is associated with an element

stream

Page 24: Efficient Processing of Partially Specified Twig Queries

24

PTwigStack

PTwigStack(root)// the first stage1 while not end(root) 2 q = getNext(root) 3 Clean All Stacks related with q and output relevant path solutions4 If Cq can be pushed into Stack Sq5 Push(Sq, Cq)

6 Processing other elements Cq’ iteratively where q’ is child of q in the query and Cq’.start < Cq.start

7. Output all possible path solutions8. Advance(Cq) //the second stage9 MergeAllPathSolution();

6

Page 25: Efficient Processing of Partially Specified Twig Queries

25

PTwigStack

b1

a1 a3

c2

B

A

C

c1

b2a2

BA

C

A

B

C

A

C

B

B

A

C

A

B C

Q Q1 Q2 Q3 Q4

Output:Output: Final Result:

PTwigStack(root)// the first stage1. while not end(root) 2. q = getNext(root) 3. Clean All Stacks related with q and output path solutions4. If Cq can be pushed into Stack Sq5. Push(Sq, Cq)6. Processing other elements Cq’ iteratively where q’ is child of q in the query and Cq’.start < Cq.start7. Output all possible path solutions8. Advance(Cq) //the second stage9. MergeAllPathSolution();

Page 26: Efficient Processing of Partially Specified Twig Queries

26

PTwigStack

b1

a1 a3

c2

B

A

C

c1

b2a2

BA

C

A

B

C

A

C

B

B

A

C

A

B C

Q Q1 Q2 Q3 Q4

Output:Output: Final Result:

PTwigStack(root)// the first stage1. while not end(root) 2. q = getNext(root) 3. Clean All Stacks related with q and output path solutions4. If Cq can be pushed into Stack Sq5. Push(Sq, Cq)6. Processing other elements Cq’ iteratively where q’ is child of q in the query and Cq’.start < Cq.start7. Output all possible path solutions8. Advance(Cq) //the second stage9. MergeAllPathSolution();

c1

Page 27: Efficient Processing of Partially Specified Twig Queries

27

PTwigStack

b1

a1 a3

c2

B

A

C

c1

b2a2

BA

C

A

B

C

A

C

B

B

A

C

A

B C

Q Q1 Q2 Q3 Q4

a1

b1

Output:Output: Final Result:

PTwigStack(root)// the first stage1. while not end(root) 2. q = getNext(root) 3. Clean All Stacks related with q and output path solutions4. If Cq can be pushed into Stack Sq5. Push(Sq, Cq)6. Processing other elements Cq’ iteratively where q’ is child of q in the query and Cq’.start < Cq.start7. Output all possible path solutions8. Advance(Cq) //the second stage9. MergeAllPathSolution();

Page 28: Efficient Processing of Partially Specified Twig Queries

28

PTwigStack

b1

a1 a3

c2

B

A

C

c1

b2a2

BA

C

A

B

C

A

C

B

B

A

C

A

B C

Q Q1 Q2 Q3 Q4

a1

b1

Output:Output: Final Result:

PTwigStack(root)// the first stage1. while not end(root) 2. q = getNext(root) 3. Clean All Stacks related with q and output path solutions4. If Cq can be pushed into Stack Sq5. Push(Sq, Cq)6. Processing other elements Cq’ iteratively where q’ is child of q in the query and Cq’.start < Cq.start7. Output all possible path solutions8. Advance(Cq) //the second stage9. MergeAllPathSolution();

Page 29: Efficient Processing of Partially Specified Twig Queries

29

PTwigStack

b1

a1 a3

c2

B

A

C

c1

b2a2

BA

C

A

B

C

A

C

B

B

A

C

A

B C

Q Q1 Q2 Q3 Q4

a1

b1 c2

Output:Output: Final Result:

a1c2

PTwigStack(root)// the first stage1. while not end(root) 2. q = getNext(root) 3. Clean All Stacks related with q and output path solutions4. If Cq can be pushed into Stack Sq5. Push(Sq, Cq)6. Processing other elements Cq’ iteratively where q’ is child of q in the query and Cq’.start < Cq.start7. Output all possible path solutions8. Advance(Cq) //the second stage9. MergeAllPathSolution();

Page 30: Efficient Processing of Partially Specified Twig Queries

30

PTwigStack

b1

a1 a3

c2

B

A

C

c1

b2a2

BA

C

A

B

C

A

C

B

B

A

C

A

B C

Q Q1 Q2 Q3 Q4

a1

b1

Output:Output: Final Result:

a1c2a1b2

b2PTwigStack(root)// the first stage1. while not end(root) 2. q = getNext(root) 3. Clean All Stacks related with q and output path solutions4. If Cq can be pushed into Stack Sq5. Push(Sq, Cq)6. Processing other elements Cq’ iteratively where q’ is child of q in the query and Cq’.start < Cq.start7. Output all possible path solutions8. Advance(Cq) //the second stage9. MergeAllPathSolution();

Page 31: Efficient Processing of Partially Specified Twig Queries

31

PTwigStack

b1

a1 a3

c2

B

A

C

c1

b2a2

BA

C

A

B

C

A

C

B

B

A

C

A

B C

Q Q1 Q2 Q3 Q4

a1

b1

Output:Output: Final Result:

a1c2a1b2a1b1

a1b1c2a1b2c2

PTwigStack(root)// the first stage1. while not end(root) 2. q = getNext(root) 3. Clean All Stacks related with q and output path solutions4. If Cq can be pushed into Stack Sq5. Push(Sq, Cq)6. Processing other elements Cq’ iteratively where q’ is child of q in the query and Cq’.start < Cq.start7. Output all possible path solutions8. Advance(Cq) //the second stage9. MergeAllPathSolution();

Page 32: Efficient Processing of Partially Specified Twig Queries

32

PTwigStack

• Properties:– Each element is scanned only once– Each element in stack must participate in at le

ast one final result– No “Eliminating Operation” for redundant resul

ts– Space bounded by |Q|×L where L is the

longest path in the XML source document and |Q| is the number of nodes in the given query Q

Page 33: Efficient Processing of Partially Specified Twig Queries

33

Outline

• Introduction

• Preliminary

• PTwigStack

• Conclusion

z
how to express being at the same path relationship?algorithm should be changed
Page 34: Efficient Processing of Partially Specified Twig Queries

34

Conclusion

• We propose a concise but effective way to express the semantics of being at the same path by expanding XPath

• We propose a new concept, Partial Solution Extension, to guide the executing of getNext

• We propose a new holistic join method to process a PSTQ with root node

Page 35: Efficient Processing of Partially Specified Twig Queries

35

Future Work

• The above method cannot be applied directly to query without being specified with root node, e.g.– #[//A]//B– #[//A//B]//C– #[//A B]//C

• Possible Solution– Implementing special algorithm to process a PSTQ without

being specified with root node (using Dewey code)– Using ORASS[4] to construct a twig query with more

semantics constrains (using range code)

[4] Gillian Dobbie, Wu Xiaoying, Tok Wang Ling, Mong Li Lee: ORA-SS: An Object-Relationship-Attribute Model for Semistructured Data TR21/00, Technical Report, Department of Computer Science, National University of Singapore, December 2000.

Page 36: Efficient Processing of Partially Specified Twig Queries

36

Thank You !

Q & A