Reuse or Never Reuse the Deleted Labels in XML Query Processing Based on Labeling Schemes

30
Reuse or Never Reuse the Deleted Labels in XML Query Processing Based on Labeling Schemes Changqing Li, Tok Wang Ling, Min Hu

description

Reuse or Never Reuse the Deleted Labels in XML Query Processing Based on Labeling Schemes. Changqing Li, Tok Wang Ling, Min Hu. Roadmap. Related Work Preliminary and Motivation Reuse the Deleted Labels Never Reuse the Deleted Labels Experimental Results. Related Work. Labeling scheme - PowerPoint PPT Presentation

Transcript of Reuse or Never Reuse the Deleted Labels in XML Query Processing Based on Labeling Schemes

Page 1: Reuse or Never Reuse the Deleted Labels in XML Query Processing Based on Labeling Schemes

Reuse or Never Reuse the Deleted Labels in XML Query Processing

Based on Labeling Schemes

Changqing Li, Tok Wang Ling, Min Hu

Page 2: Reuse or Never Reuse the Deleted Labels in XML Query Processing Based on Labeling Schemes

DASFAA'06 Reuse or Never Reuse 2

Roadmap

Related Work Preliminary and Motivation Reuse the Deleted Labels Never Reuse the Deleted Labels Experimental Results

Page 3: Reuse or Never Reuse the Deleted Labels in XML Query Processing Based on Labeling Schemes

DASFAA'06 Reuse or Never Reuse 3

Related Work

Labeling scheme Labeling schemes are proposed to efficiently process XML queries Three main categories of labeling schemes to process XML queries

Containment labeling scheme [Zhang et al SIGMOD01 etc.] Prefix labeling scheme [Tatarinov et al SIGMOD02 etc.] Prime labeling scheme [Wu et al ICDE04]

Version Control No related work on XML version control based on labeling schemes

[Zhang et al SIGMOD01] C. Zhang, et al. On Supporting Containment Queries in Relational Database Management Systems. In Proc. of SIGMOD, pages 425-436, 2001.

[Tatarinov et al SIGMOD02 ] I. Tatarinov, S. Viglas, K.S. Beyer, J. Shanmugasundaram, E.J. Shekita, and C. Zhang. Storing and querying ordered XML using a relational database system. In Proc. of SIGMOD, pages 204-215, 2002.

[Wu et al ICDE04] X. Wu, M.L. Lee, and W. Hsu. A Prime Number Labeling Scheme for Dynamic Ordered XML Trees. In Proc. of ICDE, pages 66-78, 2004

Page 4: Reuse or Never Reuse the Deleted Labels in XML Query Processing Based on Labeling Schemes

DASFAA'06 Reuse or Never Reuse 4

Preliminary

To completely avoid the re-labeling, we propose a Quaternary Encoding for Dynamic XML Data, called QED

Four quaternary strings “0”, “1”, “2” and “3” are used in the code and each quaternary string is stored with two bits, i.e. “00”, “01”, “10” and “11”.

The quaternary string “0” is used as the separator, and only “1”, “2”, and “3” are used in the QED encoding.

“0” will never encounter the overflow problem, so QED can completely avoid re-labeling in XML updates.

[Li & Ling CIKM05] Changqing Li, Tok Wang Ling. QED: A Novel Quaternary Encoding to Completely Avoid Re-labeling in XML Updates. In Proc. of the 14th International Conference on Information and Knowledge Management (CIKM), pages 501-508, 2005.

Page 5: Reuse or Never Reuse the Deleted Labels in XML Query Processing Based on Labeling Schemes

DASFAA'06 Reuse or Never Reuse 5

QED Encoding

Decimal number QED codes

1 112

2 12

3 122

4 13

5 132

6 2

7 212

8 22

9 23

10 232

11 3

12 312

13 32

14 322

15 33

16 332

Each time, encode the (1/3)th and (2/3)th numbers

Support insertion with orders kept and without re-encoding

When we try to insert two codes between “112” and “12”, the two codes are “113” and “1132”. We need not re-encode any existing numbers, but we can keep the orders, i.e. “112” < “113” < “1132” < “12” lexicographically.

QED can be applied broadly to different labeling schemes to completely avoid re-labeling in XML updates.

Page 6: Reuse or Never Reuse the Deleted Labels in XML Query Processing Based on Labeling Schemes

DASFAA'06 Reuse or Never Reuse 6

QED Algorithm for Insertion

1: get the sizes, i.e. number of bits, of Left_Code and Right_Code 2: if size(Left_Code) < size(Right_Code) //size is the number of bits of

the code //case 1 3: then Inserted_Code = the Right_Code with the last symbol changed to “1” concatenating “2” 4: else if size(Left_Code) > size(Right_Code) 5: if the last symbol of Left_Code is “2” //case 2 6: then Inserted_Code = the Left_Code with the last symbol changed from “2” to “3” 7: else if the last symbol of Left_Code is “3” //case 3 8: then Inserted_Code = Left_Code concatenating “2” 9: else if size(Left_Code) = size(Right_Code) //case 410: then Inserted_Code = Left_Code concatenating “2”

Page 7: Reuse or Never Reuse the Deleted Labels in XML Query Processing Based on Labeling Schemes

DASFAA'06 Reuse or Never Reuse 7

Motivation

If there are no deletions, the QED algorithm makes the inserted code with the smallest size and with the orders kept

However, if there are deletions, the QED algorithm can not make the inserted code with the smallest size, though it can keep the orders

Page 8: Reuse or Never Reuse the Deleted Labels in XML Query Processing Based on Labeling Schemes

DASFAA'06 Reuse or Never Reuse 8

Motivation (Cont.)

Decimal number QED codes

1 112

2 12

3 122

4 13

5 132

6 2

7 212

8 22

9 23

10 232

11 3

12 312

13 32

14 322

15 33

16 332

When deleting “12” between “112” and “122” and insert another code at this place

The inserted code will be “1122” based on QED algorithm.

The deleted code “12” is not reused

The re-inserted code “1122” has larger size than the deleted code “12”, therefore the size increases fast.

Page 9: Reuse or Never Reuse the Deleted Labels in XML Query Processing Based on Labeling Schemes

DASFAA'06 Reuse or Never Reuse 9

Motivation (Cont.)

Decimal number QED codes

1 112

2 12

3 122

4 13

5 132

6 2

7 212

8 22

9 23

10 232

11 3

12 312

13 32

14 322

15 33

16 332

On the other hand, if deleting “122” between “12” and “13” and inserting another code at this place

The inserted code is still “122”.

The deleted code “122” is reused because it has larger size than its neighbors (“12” and “13”).

Page 10: Reuse or Never Reuse the Deleted Labels in XML Query Processing Based on Labeling Schemes

DASFAA'06 Reuse or Never Reuse 10

Motivation (Cont.)

It is not good to process the deleted labels in this way. If we want to improve the query performance, all the deleted labels

should be reused which will hinder the label size from increasing fast.

If we want to query different versions of the XML, we should never reuse the deleted labels.

That is to say, the current QED sometimes reuses the deleted codes, sometimes not. It is not what we expect.

Therefore, we propose algorithms to reuse the deleted labels for improving query performance and never reuse the deleted labels to control versions respectively.

Page 11: Reuse or Never Reuse the Deleted Labels in XML Query Processing Based on Labeling Schemes

DASFAA'06 Reuse or Never Reuse 11

Reuse

Idea of reuse The main idea of the Reuse algorithm is to compare the Left_Code

and Right_Code symbol by symbol from left to right to find the smallest code lexicographically between Left_Code and Right_Code

The Reuse algorithm can be found in the paper. Because it is too long, here we do not repeat it.

We use examples to show how Reuse works

Page 12: Reuse or Never Reuse the Deleted Labels in XML Query Processing Based on Labeling Schemes

DASFAA'06 Reuse or Never Reuse 12

Example of Reuse

Decimal number QED codes

1 112

2 12

3 122

4 13

5 132

6 2

7 212

8 22

9 23

10 232

11 3

12 312

13 32

14 322

15 33

16 332

When deleting “12” between “112” and “122” and insert another code at this place

The second symbol of left_code (“112”) is “1” and the second symbol of right_code (“122”) is “2”.

temp_code = the second symbol of “112” changed to “2”, i.e. temp_code = “12”.

“12” > “112” lexicographically, and “12” < “122” lexicographically, therefore inserted_code = temp_code = “12”.

It can be seen the deleted code “12” is reused.

Page 13: Reuse or Never Reuse the Deleted Labels in XML Query Processing Based on Labeling Schemes

DASFAA'06 Reuse or Never Reuse 13

Theorem of Reuse

Theorem Suppose some codes are deleted between left_code and right_code, and suppose the minimum size of these deleted codes is MS. The Reuse Algorithm guarantees that the inserted code between left_code and right_code is with size MS.

Page 14: Reuse or Never Reuse the Deleted Labels in XML Query Processing Based on Labeling Schemes

DASFAA'06 Reuse or Never Reuse 14

Example of Reusing the Code with Smaller Size Firstly

Decimal number QED codes

1 112

2 12

3 122

4 13

5 132

6 2

7 212

8 22

9 23

10 232

11 3

12 312

13 32

14 322

15 33

16 332

When “212”, “22” and “23” between “2” and “232” are deleted and we need to insert a new code between “2” and “232”.

left_code “2” is a prefix of right_code “232”

Remove left_code from right code, i.e. remove the first “2” from “232”, “32” is left.

The firstly encountered “2” in “32” is at the 2nd symbol, and the firstly encountered “3” in “32” is at the 1st symbol.

“3” appears before “2”, therefore temp_code = change the first encountered “3” to “2”, i.e. temp_code = “2”.

Page 15: Reuse or Never Reuse the Deleted Labels in XML Query Processing Based on Labeling Schemes

DASFAA'06 Reuse or Never Reuse 15

Example of Reusing the Code with Smaller Size Firstly (Cont.)

Decimal number QED codes

1 112

2 12

3 122

4 13

5 132

6 2

7 212

8 22

9 23

10 232

11 3

12 312

13 32

14 322

15 33

16 332

When “212”, “22” and “23” between “2” and “232” are deleted and we need to insert a new code between “2” and “232”.

The final inserted_code = left_code concatenates temp_code = “2” concatenates “2” = “22”.

The deleted code “22” is reused, and it can be seen that the size of “22” is less than or equal to the size of the deleted codes “212” and “23”.

That means the deleted code with smaller size is reused firstly.

Page 16: Reuse or Never Reuse the Deleted Labels in XML Query Processing Based on Labeling Schemes

DASFAA'06 Reuse or Never Reuse 16

Comparison between QED and Reuse

Common points Both of them support insertions without re-encoding and with orders

kept

Different points Update cost

The update cost of QED is smaller, it only needs to modify the last 2 bits The update cost of Reuse is higher, it needs to compare labels symbol by symbol

Size increasing speed If there are deletions, QED can not always reuse the deleted labels, therefore its

size increases faster Reuse can reuse the deleted codes, thus its size increasing speed is slower

Page 17: Reuse or Never Reuse the Deleted Labels in XML Query Processing Based on Labeling Schemes

DASFAA'06 Reuse or Never Reuse 17

Never Reuse

Idea of NeverReuse The main idea of the NeverReuse algorithm is that we do not

physically delete the codes, but mark the deleted codes as “deleted”. The insertion of the new codes is that we

insert a code between left_code and the first deleted_code inserting between any two consecutive deleted codes and inserting between the last deleted code and right_code using the QED

Algorithm the final inserted code is the code of all these inserted codes with the smallest size

Page 18: Reuse or Never Reuse the Deleted Labels in XML Query Processing Based on Labeling Schemes

DASFAA'06 Reuse or Never Reuse 18

Example of NeverReuse

Decimal number QED codes

1 112

2 12

3 122

4 13

5 132

6 2

7 212

8 22

9 23

10 232

11 3

12 312

13 32

14 322

15 33

16 332

When deleting “122”, “13” and “132” between “12” and “2” and insert another code at this place

We do not delete them physically, but mark them as “deleted”.

When a new code needs to be inserted between “12” and “2” we insert codes

between left_code “12” and the first deleted_code “122”

between deleted_codes “122” and “13”

between deleted_codes “13” and “132”

and between the last deleted_code “132” and right_code “2”

Page 19: Reuse or Never Reuse the Deleted Labels in XML Query Processing Based on Labeling Schemes

DASFAA'06 Reuse or Never Reuse 19

Example of NeverReuse (Cont.)

Decimal number QED codes

1 112

2 12

3 122

4 13

5 132

6 2

7 212

8 22

9 23

10 232

11 3

12 312

13 32

14 322

15 33

16 332

When deleting “122”, “13” and “132” between “12” and “2” and insert another code at this place

The inserted codes will be “1212”, “123”, “1312” and “133” based on the QED Algorithm.

We select the inserted code with the smallest size, e.g. “123”, as the final inserted code.

“123” and “133” are the codes between “12” and “2” with the smallest sizes which do not reuse the deleted codes.

Page 20: Reuse or Never Reuse the Deleted Labels in XML Query Processing Based on Labeling Schemes

DASFAA'06 Reuse or Never Reuse 20

NeverReuse-II and NeverReuse-III

The previous NeverReuse Algorithm intends to make the label size increase slowly, called NeverReuse-I

However, NeverReuse-I needs more time to calculate the inserted code especially when there are a lot of deleted codes between left_code and right_code.

If we want to reduce the insertion time, we can directly use any inserted code as the final inserted code, called NeverReuse-II, but this can not guarantee that the inserted code is with the smallest size.

Furthermore, if a code is required to be inserted between two specific deleted codes (the inserted code should have order relationships with the two specific deleted codes), then insert a code between these two specific deleted codes, called NeverReuse-III

Page 21: Reuse or Never Reuse the Deleted Labels in XML Query Processing Based on Labeling Schemes

DASFAA'06 Reuse or Never Reuse 21

Theorem of NeverReuse

Theorem NeverReuse-I, NeverReuse-II, and NeverReuse-I all will NOT reuse the deleted codes.

Page 22: Reuse or Never Reuse the Deleted Labels in XML Query Processing Based on Labeling Schemes

DASFAA'06 Reuse or Never Reuse 22

Comparison of NeverReuse with Other Approaches

Time stamps + labeling scheme This approach may reuse the deleted labels, the time stamps + labels

can uniquely specify a node If the deleted labels have order requirement, this approach does not

work E.g. the inserted label A is before the inserted label B, this approach can not

distinguish the space order of A and B, but can only distinguish the time order of A and B

NeverReuse Our NeverReuse can keep both the space order and time order if

time stamps are also added into our NeverReuse

Page 23: Reuse or Never Reuse the Deleted Labels in XML Query Processing Based on Labeling Schemes

DASFAA'06 Reuse or Never Reuse 23

Experimental Setup

We select an XML file Hamlet in Dataset Shakespeare’s play [NIAGARA] to test the performances of Reuse and NeverReuse. It is similar for all the other files in other datasets [Washington, XMark].

[NIAGARA] NIAGARA Experimental Data. Available at: http://www.cs.wisc.edu/niagara/data.html

[Washington] University of Washington XML Repository. Available at: http://www.cs.washington.edu/research/xmldatasets/

[XMark] XMark — An XML Benchmark Project. Available at: http://monetdb.cwi.nl/xml/downloads.html

Page 24: Reuse or Never Reuse the Deleted Labels in XML Query Processing Based on Labeling Schemes

DASFAA'06 Reuse or Never Reuse 24

Experiment about Reuse

We generate 1,000,000 QED codes. We test the case that codes are deleted then inserted at the

odd positions of the 1,000,000 codes; after the deletions and insertions, we call these new codes CodeSet2; this is case 1.

Secondly we test that the codes are deleted then inserted at the even positions of CodeSet2, thirdly odd positions of CodeSet3, fourthly even positions of CodeSet4, and so on.

We compare the performance of QED and Reuse.

Page 25: Reuse or Never Reuse the Deleted Labels in XML Query Processing Based on Labeling Schemes

DASFAA'06 Reuse or Never Reuse 25

Experiment about Reuse: Label Size

The label size of QED increases fast Because Reuse can reuse the deleted labels, its size does not increase

0

10

20

30

40

50

1 2 3 4 5 6 7 8 9 10

Deletion and insertion cases

Lab

el s

ize

(1,0

00,0

00 b

its)

QED

Reuse

Page 26: Reuse or Never Reuse the Deleted Labels in XML Query Processing Based on Labeling Schemes

DASFAA'06 Reuse or Never Reuse 26

Experiment about NeverReuse

We delete and insert at any place of the 1,000,000 QED codes.

The experimental results confirm that our NeverReuse algorithm(s) (NeverReuse-I, NeverReuse-II, and NeverReuse-III; see the discussions after Theorem 5.1) never reuse any deleted codes, hence the NeverReuse algorithm(s) can truly maintain different label versions of the XML data.

There are no other researches about how to never reuse the deleted labels in labeling schemes. Therefore we do not compare different schemes on label version control in the experiments.

Page 27: Reuse or Never Reuse the Deleted Labels in XML Query Processing Based on Labeling Schemes

DASFAA'06 Reuse or Never Reuse 27

Experiment about NeverReuse (Cont.)

We compare the size and the update time increasing speeds of NeverReuse-I, NeverReuse-II and NeverReuse-III.

The below figure shows that the size (only the size of the inserted codes) differences among the three approaches are not very large though NeverReuse-I is better.

0

0.05

0.1

0.15

0.2

0.25

0.3

100020003000400050006000700080009000

10000

Number of nodes inserted

Lab

el s

ize

(1,0

00,0

00 b

its)

NeverReuse-I

NeverReuse-II

NeverReuse-III

Page 28: Reuse or Never Reuse the Deleted Labels in XML Query Processing Based on Labeling Schemes

DASFAA'06 Reuse or Never Reuse 28

Experiment about NeverReuse (Cont.)

The below figure shows that the update time (only the processing time) of NeverReuse-I increases very fast, but the update time of NeverReuse-II and NeverReuse-III is almost 0 millisecond (ms).

0

100200

300

400500

600

700800

900

100020003000400050006000700080009000

10000

number of nodes inserted

Inse

rtio

n t

ime

(ms)

NeverReuse-I

NeverReuse-II

NeverReuse-III

Page 29: Reuse or Never Reuse the Deleted Labels in XML Query Processing Based on Labeling Schemes

DASFAA'06 Reuse or Never Reuse 29

Experiment about NeverReuse (Cont.)

In practice, we suggest using NeverReuse-III because its update time is small, its code size is not large, and the most important reason is that NeverReuse-III can maintain the order relationships among the deleted codes.

Maintaining the orders of the deleted codes can only be achieved by our approach.

Page 30: Reuse or Never Reuse the Deleted Labels in XML Query Processing Based on Labeling Schemes

DASFAA'06 Reuse or Never Reuse 30

Thank youQ & A