Managing XML and Semistructured Data

115
1 Part 4: Compressing XML Data Managing XML and Semistructured Data

description

Managing XML and Semistructured Data. Part 4: Compressing XML Data. In this section. XML Compression Motivation The State-of-the-Art Queriable compressors Non-queriable compressors Resources XMILL: An Efficient Compressor for XML Data by Liefke and Suciu, in SIGMOD'2001 - PowerPoint PPT Presentation

Transcript of Managing XML and Semistructured Data

Page 1: Managing XML and Semistructured Data

1

Part 4: Compressing XML Data

Managing XML and Semistructured Data

Page 2: Managing XML and Semistructured Data

2

In this sectionIn this section XML Compression

• Motivation• The State-of-the-Art

Queriable compressors Non-queriable compressors

Resources XMILL: An Efficient Compressor for XML Data by Liefke and Suciu,

in SIGMOD'2001 Others: XGrind, XPress, XQuec, XMLzip, … XCQ: From my publications XQZip: From my publications MQX : From my publications

Page 3: Managing XML and Semistructured Data

3

IntroductionIntroduction More and more XML data is created

• Duplicate structures (tags, paths …)• Data inflation: data in XML is much larger than

raw data• Compression: storage and data transfer

General-purpose compressor (e.g. gzip)• Characteristics of XML data not utilized• Unqueriable

Page 4: Managing XML and Semistructured Data

4

Compression: The ProblemCompression: The Problem

XML for exchange (space or time) But XML is verbose and inflated due to

• Duplicated tags and paths Users prefer application specific formats:

• Eg. Web Server Logs Is XML doomed to fail ? Solution: XML-specific compressor

• Non-queriable: XMill• Queriable: XQzip

Page 5: Managing XML and Semistructured Data

5

XML-Specific CompressorsXML-Specific Compressors Unqueriable Compression (e.g. XMill):

• Full-chunked: data commonalities eliminated• Very good compression ratio

Queriable Compression (e.g. XGrind, XPRESS):• Fine-grained: data commonalities ignored• Inadequate compression ratio and time• Support simple path queries with atomic predicate

Page 6: Managing XML and Semistructured Data

6

Issues in XML CompressionIssues in XML Compression Compression ratios, Compression time, Query Coverage, Memory

Usage…(see my survey paper in WWWJ)

Comparison of existing technologies

Page 7: Managing XML and Semistructured Data

7

An Example:Web Server LogsAn Example:Web Server Logs

202.239.238.16|GET / HTTP/1.0|text/html|200|1997/10/01-00:00:02|-|4478|-|-|http://www.net.jp/|Mozilla/3.1[ja](I)202.239.238.16|GET / HTTP/1.0|text/html|200|1997/10/01-00:00:02|-|4478|-|-|http://www.net.jp/|Mozilla/3.1[ja](I)

<apache:entry>

<apache:host> 202.239.238.16 </apache:host>

<apache:requestLine> GET / HTTP/1.0 </apache:requestLine>

<apache:contentType> text/html </apache:contentType>

<apache:statusCode> 200</apache:statusCode>

<apache:date> 1997/10/01-00:00:02</apache:date>

<apache:byteCount> 4478</apache:byteCount>

<apache:referer> http://www.net.jp/ </apache:referer>

<apache:userAgent> Mozilla/3.1$[$ja$]$(I)</apache:userAgent>

</apache:entry>

<apache:entry>

<apache:host> 202.239.238.16 </apache:host>

<apache:requestLine> GET / HTTP/1.0 </apache:requestLine>

<apache:contentType> text/html </apache:contentType>

<apache:statusCode> 200</apache:statusCode>

<apache:date> 1997/10/01-00:00:02</apache:date>

<apache:byteCount> 4478</apache:byteCount>

<apache:referer> http://www.net.jp/ </apache:referer>

<apache:userAgent> Mozilla/3.1$[$ja$]$(I)</apache:userAgent>

</apache:entry>

ASCII File 15.9 MB (gzipped 1.6MB):

XML-ized apache web log inflates to 24.2 MB (gzipped 2.1MB):

Page 8: Managing XML and Semistructured Data

8

XMillXMill

First specialized compressor for XML data• SAX parser for parsing XML data• Still using gzip as its underlying compressor• Clever grouping of data into containers for compression

Compress XML via three basic techniques• Compress the structure separately from the data• Group the data values according to their types• Apply semantic (specialized) compressors:

Downloadable:• www.cs.washington.edu/homes/suciu/XMILL

Page 9: Managing XML and Semistructured Data

9

XMill Architecture:XMill Architecture:

Page 10: Managing XML and Semistructured Data

10

How Xmill Works: Three IdeasHow Xmill Works: Three Ideas

<apache:entry>

<apache:host> </apache:host>

. . .

</apache:entry>

<apache:entry>

<apache:host> </apache:host>

. . .

</apache:entry>

202.239.238.16

GET / HTTP/1.0

text/html

200

202.239.238.16

GET / HTTP/1.0

text/html

200

gzip Structure gzip Data

=1.75MB+

Compress the structure separately from the data:

Page 11: Managing XML and Semistructured Data

11

How Xmill Works: Three IdeasHow Xmill Works: Three Ideas

<apache:entry>

. . .

</apache:entry>

<apache:entry>

. . .

</apache:entry>

202.23.23.16

224.42.24.55

202.23.23.16

224.42.24.55

gzip Structure gzip Data1

=1.33MB+GET / HTTP/1.0

GET / HTTP/1.1

GET / HTTP/1.0

GET / HTTP/1.1

gzip Data2

+

Group the data values according to their types:

Page 12: Managing XML and Semistructured Data

12

How Xmill Works: Three IdeasHow Xmill Works: Three Ideas

gzip Structure + gzip c1(Data1) + gzip c2(Data2) + ... =0.82MB

Apply semantic (specialized) compressors:

Examples:• 8, 16, 32-bit integer encoding (signed/unsigned)• differential compressing (e.g. 1999, 1995, 2001, 2000, 1995, ...)• compress lists, records (e.g. 104.32.23.1 4 bytes)Need user input to select the semantic compressor

Page 13: Managing XML and Semistructured Data

13

Path Processor – structure container:Path Processor – structure container:

Replace data value with container number (negative integer) Replace end tag with 0 Replace tags/attributes with positive integer

<Book><Title lang=“English”>Data Compression</Title>

<Author>Gray</Author>

<Author>Reiter</Author>

</Book>

<Book><Title lang=“English”>Data Compression</Title>

<Author>Gray</Author>

<Author>Reiter</Author>

</Book>

<Book><Title lang=-1>-2</Title>

<Author>-3</Author>

<Author>-3</Autor>

</Book>

<Book><Title lang=-1 0>-2 0 <Author>-3 0 <Author>-3 0 0Book = 1, Title = 2, @lang = 3, Author = 4

1 2 3 -1 0 -2 0 4 -3 0 4 -3 0 0

Fewer storage!14 bytes!

Dictionary:One more entry

for each new word

Repeated structures entries could be compressed effectively!

Page 14: Managing XML and Semistructured Data

14

XML CompressionXML Compression

XMill Evaluation using XML datasets

Page 15: Managing XML and Semistructured Data

15

Queriable CompressorsQueriable Compressors XQzip: queriable XML compressor (our work

[EDBT04])• Existing XML compressors (survey in[WWWJ05]):

Unqueriable (e.g. XMill [SIGMOD00]): exploit data commonalities ≥ better compression rate than gzip)

Queriable (e.g. XGrind [ICDE02], XPRESS [SIGMOD03], XQueC, XQzip [EDBT04], XCQ [KAISJ05]): compress data individually ≥ inadequate compression rate and time)

• Features of XQzip: Use the SIT to aid query evaluation Block-compression: allow data commonalities to be exploited and

used as buffers to reduce decompression overhead

Page 16: Managing XML and Semistructured Data

16

Structure Index Tree (SIT)Structure Index Tree (SIT) Effective elimination of duplicate structures

in the XML data Merging of nodes that have

• the same incoming path• the same ordered set of paths of their descendants

SIT Construction• A linear scan of the XML document• Merging of the subtree that we are constructing

into its equivalent subtree in the base tree

Page 17: Managing XML and Semistructured Data

17

/

d

b

d

a

b

d e

c c

e

/

d

a

b

d e

c

e

c

d

c b

d

SIT ConstructionSIT Construction

0

1

2

3 4

5 6 7

8 9 10

0

1

2

3 4

5 6,6 7

8 9 10,8,10 ,9

,7

,10

Page 18: Managing XML and Semistructured Data

18

XQzip ArchitectureXQzip Architecture

InputXML

Document

SAXParser

Compressor(gzip)

IndexConstructor

b1 a1 c1a2 ... bi ckaj

a c b...

a5c7 ... b9

Parser

Executor

BufferManager

SIT

Hashtable

Compressed blocks

Query Processor

Query

QueryResult

Buffer Pool

XQzip Repository

Index Constructor: construct the SIT Compressor

• Group semantically related items in blocks• Compress each block by gzip

Query Processor: evaluate query• Parser• Executor: apply the SIT to evaluate query• Buffer Manager (By LRU)

Page 19: Managing XML and Semistructured Data

19

SIT Construction ComplexitySIT Construction ComplexityN: Total number of elements in the input XML

document Time Complexity:

• Worst-case: O(N │SIT │)• Average-case: O(N)

Space Complexity:• Base tree and the subtree being merged: ≤ 2│SIT │• Space for storing ids of eliminated nodes: O(N)

Page 20: Managing XML and Semistructured Data

20

Data CompressionData Compression A balance between full-chunked and fine-grained

compression• A distinct data container for each distinct element• Each container compressed (using gzip) into many smaller

blocks

Block size?• Too small: query time ↑compression ratio↓• Too large: query time ↓compression ratio↑• Only can be determined by an empirical study

Page 21: Managing XML and Semistructured Data

21

Block SizeBlock SizeRepresentative datasets and queries: Datasets:

• Heavy text • Light text • A mix of heavy text and light text

Queries:• High Selectivity• Medium Selectivity• Low Selectivity

Page 22: Managing XML and Semistructured Data

22

Block SizeBlock Size

0

1

2

3

4

5

6

7

8

9

10

10 100 1000 10000

Block Size (# data records)

Qu

eryi

ng T

ime

(sec

)

SwissP rot-L SwissP rot-M SwissP rot-H

XMark-L XMark-M XMark-H

OMIM-L OMIM-M OMIM-H

13.612.9

600

Page 23: Managing XML and Semistructured Data

23

Structure of Compressed-DataStructure of Compressed-Data Block size?

• Determined by an empirical study• Querying Time

near-optimal range : 600-1000 data items/block (average optimal: 950)

• Compression Ratio Not improved much after 150 KB/block (usually

contain more than 1000 items)• ≥ 1000 data items/block

Page 24: Managing XML and Semistructured Data

24

OutlineOutline

Introduction XQzip [EDBT 2004]

• Indexing

• Data Compression

• Query Evaluation

• Performance Evaluation Conclusion

Page 25: Managing XML and Semistructured Data

25

XQzip Query CoverageXQzip Query Coverage All XPath axes except the sideways axes (e.g.

preceding, following)-siblings Multiple and nested predicates

• and / or / not expressions

Aggregations: sum, count, average, max, min Group queries: e.g. (L1 (L2 + L3 + L4))

• L1 : //a[b = “Crete”] (prefix) L2 : c• L3 : d[f/count() >100] L4 : e[//g]

Page 26: Managing XML and Semistructured Data

26

Query EvaluationQuery Evaluation Depth-first traverse the index tree Buffer Management (LRU)

• Why buffering? Decompression Time Dominates

• Decompression avoidance

Page 27: Managing XML and Semistructured Data

27

OutlineOutline

Introduction XQzip

• Indexing

• Data Compression

• Query Evaluation

• Performance Evaluation Conclusion

Page 28: Managing XML and Semistructured Data

28

Effectiveness of the SITEffectiveness of the SIT

Data SourceNode

ReductionLoad Time

Node Selection

Acceleration

XMark 1.64% 0.67s 2.15

OMIM 0.24% 0.07s 2.16

DBLP 0.04% 1.62s 2.11

SwissProt 28.38% 5.61s 1.92

Treebank 93.42% 2.26s 1.76

PSD 10.85% 9.97s 2.18

Shakespeare 1.96% 0.07s 2.10

Lineitem 0.002% 0.42s 1.78

Page 29: Managing XML and Semistructured Data

29

Effectiveness of the SITEffectiveness of the SIT

Index Size: less than 1% of original size Load Time: a fraction of a second Node Selection Acceleration: twice faster

than F&B-Index Construction Time: more than 3 times faster

than F&B-Index

Page 30: Managing XML and Semistructured Data

30

Compression RatioCompression Ratio

0

10

20

30

40

50

60

70

80

90

100

XMark OMIM DBLP SwissProt Treebank PSD Shakespeare Lineitem

Data Sources

Co

mp

ressio

n R

ati

o (

%)

XQzip+ XQzip XMill gzip XGrind

XQzip is comparable to XMill and gzip,

17% better than XGrind with index size included, 42% better than XGrind without index.

Page 31: Managing XML and Semistructured Data

31

Compression/Decompression Compression/Decompression TimeTime

XQzip (compression + index construction) is more than 5 times better than XGrind, 1.5 times worse than XMill

XQzip (index-loading + decompression) is more than 3 times better than XGrind, 1.4 times worse than XMill

Page 32: Managing XML and Semistructured Data

32

    Node Partial Result Querying Querying Querying

Data   Selecting Decomp. Processing Time (sec) Time (sec) Time (sec)

Sources Time (sec) Time (sec) Time (sec) (XQzip-) (XQzip+) (XGrind)

XMark Q1 0.001 --- 0.911 0.913 0.122 22.774

(111MB) Q2 0.001 0.920 0.012 0.934 0.295 23.067

  Q3 0.001 3.395 0.014 3.411 0.349 35.012

  Q4 0.003 --- 0.551 0.584 0.118 ---

  Q5 0.831 4.534 0.010 5.376 1.544 ---

OMIM Q1 0.001 --- 0.030 0.032 0.005 3.513

(24.5MB) Q2 0.001 0.021 0.011 0.034 0.014 4.690

  Q3 0.001 0.036 0.057 0.095 0.067 6.134

  Q4 0.005 --- --- 0.005 0.005 ---

  Q5 0.012 0.020 0.580 0.613 0.034 ---

DBLP Q1 0.001 --- 0.370 0.381 0.034 19.582

(148MB) Q2 0.001 0.330 0.013 0.345 0.029 26.108

  Q3 0.033 0.391 8.997 9.541 1.543 50.344

  Q4 0.001 --- 0.000 0.001 0.001 ---

  Q5 0.087 1.122 0.260 1.481 0.642 ---

Page 33: Managing XML and Semistructured Data

33

Query PreformanceQuery Preformance

Cold Buffer-pool Evaluation• 13 times better than XGrind

Warm buffer-pool Evaluation• 80 times better than XGrind

Impressive Buffer Effect!

Page 34: Managing XML and Semistructured Data

34

Lessons on XML CompressionLessons on XML Compression Good compression ratio and time

• Comparable to that of XMill• Much better than that of XGrind (and XPRESS)

Support a very practical set of queries• A much wider range of queries than XGrind and XPRESS

Very Competitive Querying Time with Buffer• 13 time better than XGrind with cold buffer• 80 time better than XGrind with warm buffer

Limitations• Cost of building and maintenance of complex Indexes• No theoretical foundation of block size

Page 35: Managing XML and Semistructured Data

35

XCQXCQ

XCQ Framework Experimental Results

• Compression Performance• Query Performance

Lessons and Development

Page 36: Managing XML and Semistructured Data

36

XCQXCQ Objectives:

• Achieve Good Compression ratio Comparable to XMill Better than XGrind

• Achieve Good Query performance More efficient than XGrind Querying compressed documents with block-based partial

decompression

• But addressing issues different from XQzip Adopt minimal indexing Establish theory between selectivity and block size

Page 37: Managing XML and Semistructured Data

37

XCQ StrategyXCQ Strategy Based on four techniques

• DTD Tree and SAX Event Stream Parsing (DSP)

• Partition Path-Based Data Grouping (PPB) Format

• Block-Statistic Signature (BSS) Indexing

• Access Methods

XCQCompression

Engine

XCQQueryingEngine

DTD

XMLDocument

CompressedDocument

QueryResults

XPath Queries

DSP

PPG format BSS indexing Access

Methods

Page 38: Managing XML and Semistructured Data

38

Technique 1 – Technique 1 – DTD Tree and SAX Event Stream Parsing (DSP)DTD Tree and SAX Event Stream Parsing (DSP)

XCQCompression

Engine

XCQQueryingEngine

DTD

XMLDocument

CompressedDocument

QueryResults

XPath Queries

DSP

PPG format BSS indexing

Access Methods

Page 39: Managing XML and Semistructured Data

39

Technique 1 – Technique 1 – DTD Tree and SAX Event Stream Parsing (DSP)DTD Tree and SAX Event Stream Parsing (DSP)

Purpose: • To utilize information in the associated DTD of the

document

Benefits:• Only encode the information that cannot be inferred in

the DTD

• Precise path-based grouping of data items

• Run in automated manner

Page 40: Managing XML and Semistructured Data

40

DSP – Input and OutputDSP – Input and Output

A DTD Tree

Data StreamsDSP

Module

A Structure Stream

A Stream of SAX Events

Page 41: Managing XML and Semistructured Data

41

DSP Step 1 – Creating a DTD TreeDSP Step 1 – Creating a DTD Tree

<!ELEMENT library (entry*)><!ELEMENT entry (author, title, year, publisher?, (paper|course_note|book), num_copy)><!ELEMENT author EMPTY><!ATTLIST author name CDATA><!ELEMENT title (#PCDATA)><!ELEMENT year (#PCDATA)><!ELEMENT publisher (#PCDATA)><!ELEMENT paper EMPTY><!ELEMENT course_note EMPTY><!ELEMENT book EMPTY><!ELEMENT num_copy (#PCDATA)>

Key:

: PCDATA

library

author(name)

title year num_copy

paper

course_note

book

entry*

publisher? |

Page 42: Managing XML and Semistructured Data

42

DSP Step 1 – Creating a DTD TreeDSP Step 1 – Creating a DTD Tree

<!ELEMENT library (entry*)><!ELEMENT entry (author, title, year, publisher?, (paper|course_note|book), num_copy)><!ELEMENT author EMPTY><!ATTLIST author name CDATA><!ELEMENT title (#PCDATA)><!ELEMENT year (#PCDATA)><!ELEMENT publisher (#PCDATA)><!ELEMENT paper EMPTY><!ELEMENT course_note EMPTY><!ELEMENT book EMPTY><!ELEMENT num_copy (#PCDATA)>

Key:

: PCDATA

library

author(name)

title year num_copy

paper

course_note

book

entry*

publisher? |

Page 43: Managing XML and Semistructured Data

43

DSP Step 2 – Processing in DSP ModuleDSP Step 2 – Processing in DSP Module

How does the DSP module process the following XML document?

<library> <entry> <author name="Tom"/> <title>Introduction to &#34;OS&#34;</title> <year>2003</year> <course_note/> <num_copy>3</num_copy> </entry></library>

Page 44: Managing XML and Semistructured Data

44

SAX Event:

library

author(name)

title year num_copy

paper

course_note

bookKeys:

: Traversal path

: PCDATA: Processing DTD tree node

Start element – “library”

Structure Stream:

Data Streams:

<library> <entry> <author name="Tom"/> <title>Introduction to &#34;OS&#34;</title> <year>2003</year> <course_note/> <num_copy>3</num_copy> </entry></library>

entry*

publisher? |

Page 45: Managing XML and Semistructured Data

45

SAX Event:

library

author(name)

title year num_copy

paper

course_note

bookKeys:

: Traversal path

: PCDATA: Processing DTD tree node

Start element – “entry”

Structure Stream:

Data Streams:

T

Match!

<library> <entry> <author name="Tom"/> <title>Introduction to &#34;OS&#34;</title> <year>2003</year> <course_note/> <num_copy>3</num_copy> </entry></library>

entry*

publisher? |

Page 46: Managing XML and Semistructured Data

46

SAX Event:

library

author(name)

title year num_copy

paper

course_note

bookKeys:

: Traversal path

: PCDATA: Processing DTD tree node

Start element – “author”, att0:name=“Tom”End element – “author”

Structure Stream:

Data Streams:

T

Match!

d0

, d0

d0: Tom

<library> <entry> <author name="Tom"/> <title>Introduction to &#34;OS&#34;</title> <year>2003</year> <course_note/> <num_copy>3</num_copy> </entry></library>

entry*

publisher? |

Page 47: Managing XML and Semistructured Data

47

SAX Event:

library

author(name)

title year num_copy

paper

course_note

bookKeys:

: Traversal path

: PCDATA: Processing DTD tree node

Start element – “title”PCDATA – “Introduction to &#34;OS &#34;”End element – “title”

Structure Stream:

Data Streams:

T, d0, d1

d0: Tomd1: Introduction to &#34;OS &#34;

<library> <entry> <author name="Tom"/> <title>Introduction to &#34;OS&#34;</title> <year>2003</year> <course_note/> <num_copy>3</num_copy> </entry></library>

entry*

publisher? |

Page 48: Managing XML and Semistructured Data

48

SAX Event:

library

author(name)

title year num_copy

paper

course_note

bookKeys:

: Traversal path

: PCDATA: Processing DTD tree node

SAX Events:Start element – “year”PCDATA – “2003”End element – “year”Start element – “course_note”

Structure Stream:

Data Streams:

T, d0, d1, d2

d0: Tomd1: Introduction to &#34;OS &#34;d2: 2003

Not match

!

F

, F

<library> <entry> <author name="Tom"/> <title>Introduction to &#34;OS&#34;</title> <year>2003</year> <course_note/> <num_copy>3</num_copy> </entry></library>

entry*

publisher? |

Page 49: Managing XML and Semistructured Data

49

SAX Event:

library

author(name)

title year num_copy

paper

course_note

bookKeys:

: Traversal path

: PCDATA: Processing DTD tree node

Start element – “course_note”End element – “course_note”

Structure Stream:

Data Streams:

T, d0, d1, d2, F

d0: Tomd1: Introduction to &#34;OS &#34;d2: 2003

p1

Not match

!

Match!

, p1

<library> <entry> <author name="Tom"/> <title>Introduction to &#34;OS&#34;</title> <year>2003</year> <course_note/> <num_copy>3</num_copy> </entry></library>

p1

p0 p2

entry*

publisher? |

Page 50: Managing XML and Semistructured Data

50

SAX Event:

library

author(name)

title year num_copy

paper

course_note

bookKeys:

: Traversal path

: PCDATA: Processing DTD tree node

Start element – “num_copy”PCDATA – “3”End element – “num_copy”End element – “entry”

Structure Stream:

Data Streams:

T, d0, d1, d2, F, p1

d0: Tomd1: Introduction to &#34;OS &#34;d2: 2003d4: 3

<library> <entry> <author name="Tom"/> <title>Introduction to &#34;OS&#34;</title> <year>2003</year> <course_note/> <num_copy>3</num_copy> </entry></library>

entry*

publisher? |

Page 51: Managing XML and Semistructured Data

51

DSP Step 3 – Generated OutputDSP Step 3 – Generated Output

Structure Stream

Keys for path-based grouped Data Streams:d0: /library/entry/author/@named1: /library/entry/title/text()d2: /library/entry/year/text()d3: /library/entry/publisher/text()d4: /library/entry/num_copy/text()

d0

d1

d2

d3

d4

Page 52: Managing XML and Semistructured Data

52

XCQCompression

Engine

XCQQueryingEngine

DTD

XMLDocument

CompressedDocument

QueryResults

XPath Queries

DSP

PPG format BSS indexing

Access Methods

Technique 2 – Technique 2 – Partition Path-Based (PPB) Data Grouping FormatPartition Path-Based (PPB) Data Grouping Format

Page 53: Managing XML and Semistructured Data

53

Technique 2 – Technique 2 – Partition Path-Based Data Grouping (PPB) FormatPartition Path-Based Data Grouping (PPB) Format

Purpose: • To partition the data streams

Each block contains a number of data items Benefits:

• Can be compressed and decompressed as an individual unit

• Support partial decompression during query processing

Page 54: Managing XML and Semistructured Data

54

Technique 2 – Technique 2 – Partition Part Based Data Grouping (PPB) FormatPartition Part Based Data Grouping (PPB) Format

Structure Stream

Keys for path-based grouped Date Streams:d0: /library/entry/author/@named1: /library/entry/title/text()d2: /library/entry/year/text()d3: /library/entry/publisher/text()d4: /library/entry/num_copy/text()

d0

d1

d2

d3

d4

Page 55: Managing XML and Semistructured Data

55

Technique 2 – Technique 2 – Partition Part Based Data Grouping (PPB) FormatPartition Part Based Data Grouping (PPB) Format

A cost model is developed for PPBRelationship between block size, processing cost and selectivity can be knownFurther modelling is possible

Page 56: Managing XML and Semistructured Data

56

Two layersTwo layers

Page 57: Managing XML and Semistructured Data

57

nn layers layers

Page 58: Managing XML and Semistructured Data

58

Technique 3 – Technique 3 – Block-Statistic Signature (BSS) IndexingBlock-Statistic Signature (BSS) Indexing

XCQCompression

Engine

XCQQueryingEngine

DTD

XMLDocument

CompressedDocument

QueryResults

XPath Queries

DSP

PPG format BSS indexing

Access Methods

Page 59: Managing XML and Semistructured Data

59

Technique 3 – Technique 3 – Block-Statistic Signature (BSS) IndexingBlock-Statistic Signature (BSS) Indexing

Purpose: To avoid accessing of non-relevant data blocks during querying• I/O cost• Decompression overhead• Time to scan the data inside the block

Details• Statistic summary (signature) for each block

Min, Max, Sum and Count

• Benefit: Little amount of processing time and storage space

• Research status: Supporting numerical data only

Page 60: Managing XML and Semistructured Data

60

Technique 3 – Technique 3 – Block-Statistic Signature (BSS) IndexingBlock-Statistic Signature (BSS) Indexing

012101001000010

01018275

Min: 0Max:

10000Sum: 11320

Count: 5

Min: 0Max: 27Sum: 60Count: 5

CompressedData Blocks

Block Statistic Signatures

Page 61: Managing XML and Semistructured Data

61

Technique 3 – Technique 3 – Block Statistic Signature (BSS) IndexingBlock Statistic Signature (BSS) Indexing

Interval of Index Interval of Index

Interval of Index Interval of Index

Interval of Index

Interval of Index

Interval of Selection Predicate

No overlapping (out of range)

Block contains no relevant

data.

Block contains relevant data

With overlapping(s)

With overlapping (covered)

Block contains relevant data.

Page 62: Managing XML and Semistructured Data

62

Technique 4 – Technique 4 – Access MethodsAccess Methods

XCQCompression

Engine

XCQQueryingEngine

DTD

XMLDocument

CompressedDocument

QueryResults

XPath Queries

DSP

PPB format BSS indexing

Access Methods

Page 63: Managing XML and Semistructured Data

63

Technique 4 – Technique 4 – Access MethodsAccess Methods

Purpose• For realizing partial decompression during query

processing

4 types of queries• Selection queries

• Structural queries

• Structure-based aggregation queries

• Path-based aggregation queries

Page 64: Managing XML and Semistructured Data

64

Technique 4 – Technique 4 – Access Methods: Selection QueriesAccess Methods: Selection Queries

//entry[author/@name=“Jess” and publisher/text()=“ABC”]

Structure Stream

Keys for path-based grouped Date Streams:d0: /library/entry/author/@named1: /library/entry/title/text()d2: /library/entry/year/text()d3: /library/entry/publisher/text()d4: /library/entry/num_copy/text()

d0

d1

d2

d3

d4

Page 65: Managing XML and Semistructured Data

65

Technique 4 – Technique 4 – Access Methods: Structural QueriesAccess Methods: Structural Queries

/library/entry/author

Structure Stream

Keys for path-based grouped Date Streams:d0: /library/entry/author/@named1: /library/entry/title/text()d2: /library/entry/year/text()d3: /library/entry/publisher/text()d4: /library/entry/num_copy/text()

d0

d1

d2

d3

d4

Page 66: Managing XML and Semistructured Data

66

Technique 4 – Technique 4 – Access Methods: Structure-Based Aggregation QueriesAccess Methods: Structure-Based Aggregation Queries

count(//entry)

Structure Stream

Keys for path-based grouped Date Streams:d0: /library/entry/author/@named1: /library/entry/title/text()d2: /library/entry/year/text()d3: /library/entry/publisher/text()d4: /library/entry/num_copy/text()

d0

d1

d2

d3

d4

Page 67: Managing XML and Semistructured Data

67

Technique 4 – Technique 4 – Access Methods: Path-Based Aggregation QueriesAccess Methods: Path-Based Aggregation Queries

sum(//num_copy/text()=1)

Structure Stream

Keys for path-based grouped Date Streams:d0: /library/entry/author/@named1: /library/entry/title/text()d2: /library/entry/year/text()d3: /library/entry/publisher/text()d4: /library/entry/num_copy/text()

d0

d1

d2

d3

d4

Page 68: Managing XML and Semistructured Data

68

Experiment ContextExperiment Context

Compressors under study• gzip, XMill, XGrind, XCQ

Datasets

Document Size Data-Centric/

Document-Centric

Regularity

(Relative Level)

Weblog 89 MB Data-Centric 5

SwissProt 32 MB Data-Centric 3

DBLP 41 MB Data-Centric 2

TPC-H 32 MB Data-Centric 6

XMark 104 MB Data-Centric 4

Shakespeare 8 MB Document-Centric 1

Page 69: Managing XML and Semistructured Data

69

Experiment – Experiment – Compression PerformanceCompression Performance

Compression Performance• gzip, XMill, XCQ (No Partition) and XGrind• Scalability• XCQ

Partitioning BSS Indexing overhead

Objective:

Comparable to XMill and better than XGrind

Page 70: Managing XML and Semistructured Data

70

Compression RatiosCompression Ratios

0

0.5

1

1.5

2

2.5

3

3.5

4

Com

pres

sion

rat

io (

bits

/byt

e)

Gzip

XMill

XCQ

XGrind

Page 71: Managing XML and Semistructured Data

71

Compression TimesCompression Times

Compression Time

0

10

20

30

40

50

60

70

80

90

Weblog SwissProt DBLP TPC-H XMark Shakespeare

Com

pres

sion

Tim

e (s

)

gzipXMillXCQXGrind

Page 72: Managing XML and Semistructured Data

72

Decompression TimesDecompression Times

Decompression Time

0

5

10

15

20

25

30

35

40

45

50

Weblog SwissProt DBLP TPC-H XMark Shakespeare

Dec

ompr

essi

on T

ime

(s)

gzip

XMill

XCQ

XGrind

Page 73: Managing XML and Semistructured Data

73

Experiment – Experiment – Compression PerformanceCompression Performance

Compression Performance• gzip, XMill, XCQ and XGrind• Scalability• XCQ

Partitioning BSS Indexing overhead

Result:

Comparable to XMill

Page 74: Managing XML and Semistructured Data

74

Scalability: Compressed SizesScalability: Compressed Sizes

Compressed Document Sizes

0

1

2

3

4

5

6

7

8

9

10

0 10 20 30

Input Document Size (MB)

Com

pres

sed

Doc

umen

t Siz

e (M

B)

gzip

XMill

XCQ

XGrind

Compression Time

0

2

4

6

8

10

12

14

16

18

0 10 20 30 40

Input Document Size (MB)

Com

pres

sion

Tim

e (s

) gzip

XMill

XCQ

XGrind

Decompression Time

0

2

4

6

8

10

12

14

0 10 20 30 40

Input Document Size (MB)

Dec

ompr

essi

on T

ime

(s)

gzip

XMill

XCQ

XGrind

Page 75: Managing XML and Semistructured Data

75

Experiment – Experiment – Compression PerformanceCompression Performance

Compression Performance• gzip, XMill, XCQ (No Partition) and XGrind• Scalability• XCQ

Partitioning BSS Indexing

Result:

Overheads introduced are low

Page 76: Managing XML and Semistructured Data

76

Experiment Results – Experiment Results – Partitioning Effect on XCQ CompressionPartitioning Effect on XCQ Compression

Compression Ratio

0

0.1

0.2

0.3

0.4

0.5

0.6

0 500 1000 1500 2000

Block Size (records/block)

Com

pres

sion

Rat

io (b

its/

byte

)With

Without Partition

XMill

Compression Time

05

10

1520253035

404550

0 500 1000 1500 2000

Block Size (records/block)

Com

pres

sion

Tim

e (s

)

With

Without Partition

XMill

Decompression Time

0

2

4

6

8

10

12

14

0 500 1000 1500 2000

Block Size (records/block)

Dec

ompr

essi

on T

ime

(s)

With

Without Partition

XMill

Page 77: Managing XML and Semistructured Data

77

Experiment Results – Experiment Results – BSS Indexing Effect on XCQ CompressionBSS Indexing Effect on XCQ Compression

Compression Ratio

0

0.1

0.2

0.3

0.4

0.5

0.6

0 1000 2000 3000 4000 5000

Block Size (records/block)

Com

pres

sion

Rat

io(b

its/b

yte)

With

Without BSS

Compression Time

05

101520253035404550

0 1000 2000 3000 4000 5000

Block Size (records/block)

Com

pres

sion

Tim

e (s

)

WithWithout BSS

Decompression Time

0

2

4

6

8

10

12

14

0 1000 2000 3000 4000 5000

Block Size (records/block)

Dec

ompr

essi

on T

ime

(s)

WithWithout BSS

Page 78: Managing XML and Semistructured Data

78

Experiment – Experiment – Compression PerformanceCompression Performance

Query Performance• Different block sizes have impact!• XCQ vs XGrind

Result:

Choose a good block size

Page 79: Managing XML and Semistructured Data

79

Experiment Results – Experiment Results – Query performance: Selection queriesQuery performance: Selection queries

Processing Low Selectivity Queries

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

0 500 1000 1500 2000 2500 3000

Block Size (records/block)

Res

pons

e T

ime

(s)

0.01%

0.05%

0.08%

0.10%

0.40%

0.60%

0.80%

Page 80: Managing XML and Semistructured Data

80

Experiment Results – Experiment Results – Query performance: Selection queriesQuery performance: Selection queries

Processing High Selecitvity Queries

0

2

4

6

8

10

12

14

16

0 500 1000 1500 2000 2500 3000 3500 4000

Block Size (records/block)

Res

pons

e T

ime

(s)

1%

10%

50%

75%

Page 81: Managing XML and Semistructured Data

81

Experiment Results – Experiment Results – Query performance: Structural Query and Structure-Query performance: Structural Query and Structure-

Based Aggregation QueryBased Aggregation Query

Processing Strcutural Query

0

5

10

15

20

25

30

35

0 100 200 300 400 500 600

Block Size (records/block)

Res

pons

e T

ime

(s) Strcutural Query

Processing Structure-Based Aggregation Query

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 100 200 300 400 500 600

Block Size (records/block)

Res

pons

e T

ime

(s) Structure-Based

Aggregation Query

Page 82: Managing XML and Semistructured Data

82

Experiment Results – Experiment Results – Query performance: Query performance:

Path-Based Aggregation QueryPath-Based Aggregation Query

Processing Path-Based Aggregation Queries

-0.5

0

0.5

1

1.5

2

2.5

0 500 1000 1500 2000 2500 3000

Block Size (records/block)

Res

pons

e T

ime

(s)

0.01%

0.05%

0.10%

0.40%

1.00%

10.00%

50.00%

75.00%

100.00%

Page 83: Managing XML and Semistructured Data

83

Experiment – Experiment – Compression PerformanceCompression Performance

Query Performance• Different block sizes• XCQ vs XGrind

Objective:

How to choose a good block size?

A few hundred elements

Page 84: Managing XML and Semistructured Data

84

Experiment – Experiment – Compression PerformanceCompression Performance

Query Performance• Different block sizes• XCQ vs XGrind

Objective:

More efficient query performance

Page 85: Managing XML and Semistructured Data

85

Experiment Results – Experiment Results – XCQ vs XGrind (Data Centric Documents)XCQ vs XGrind (Data Centric Documents)

TPC-H

0

2

4

6

8

10

12

14

0.01% 0.40% 1.00% 10.00% 50% 75.00%Selectivity (%)

Res

pons

e T

ime

(s)

XCQ - exact

XGrind - exact

XCQ - range

XGrind - range

XMark

0

5

10

15

20

25

30

35

0.01% 0.40% 1.00% 10.00% 50% 75.00%Selectivity (%)

Res

pons

e T

ime

(s)

XCQ - exact

XGrind - exact

XCQ - range

XGrind - range

Weblog

0

5

10

15

20

25

0.01% 0.04% 1% 10% 50% 75%

Selectivity (%)

Res

pons

e T

ime

(s)

XCQ - exact

XGrind - exact

XCQ - range

XGrind - range

DBLP

0

5

10

15

20

25

30

0.02% 0.40% 1% 10% 50% 75%Selectivity (%)

Res

pons

e T

ime

(s)

XCQ - exactXGrind - exactXCQ - rangeXGrind - range

Page 86: Managing XML and Semistructured Data

86

Experiment Results – Experiment Results – XCQ vs XGrind (Document Centric Document)XCQ vs XGrind (Document Centric Document)

Shakespeare

0

0.5

1

1.5

2

2.5

3

3.5

0.01% 0.40% 1.00% 10.00% 15.00% 20.00% 35.00% 50.00% 75.00%

Selectivity (%)

Res

pons

e T

ime

(s)

XCQ - exact

XGrind - exact

XCQ - range

XGrind - range

Page 87: Managing XML and Semistructured Data

87

Lessons and DevelopmentLessons and Development XCQ Framework

• Developed techniques DSP PPG document format BSS indexing Access methods

Benefits of XCQ from experimental results• Simple Indexing, Mathematical Foundation• Compression performance

Comparable to XMill

• Query performance Better than XGrind for Data-Centric Documents Comparable to XGrind for Document-Centric Document

Page 88: Managing XML and Semistructured Data

88

Multi-query evaluation of Multi-query evaluation of Compressed Data over networkCompressed Data over network

Widespread XML documents in remote locations• Large scale • XML verbosity

Traditional XML query processing• One by one on a standalone system• Original result fragments or whole documents are forwarded.

Heavy bandwidth costs for Internet or Poor processing

efficiency Motivations:

• Provide efficient query evaluation on compressed XML data

• Reduce bandwidth consumption in result publication

Page 89: Managing XML and Semistructured Data

89

ArchitectureArchitecture Composed of the server

and a group of clients On the server side

• A large-scale XML document

• Largest results directing to the nearest clients

• Under compression

Co-operative clients• Further dissemination

XML data to remote clients is possible C lient F

C lient B

C lient D

C lient I

C lient G

C lient A

C lient H

C lient EC lient C

Q u e ry S u b m is s io n

Q I

Q E

Q G

Q F

Q A

Q D

Q H

QB

QC

Q i Server

R e s u lt P u b lic a tio n

Page 90: Managing XML and Semistructured Data

90

Preliminaries- XPressPreliminaries- XPress XPress

• For tags reverse arithmetic encoding Encoded into numerical intervals

• For text dictionary & huffman encoder

• Compared with XGrind Higher compression ratio More efficient query evaluation

• Less decompression need

Page 91: Managing XML and Semistructured Data

91

Preliminaries-Interval EncodingPreliminaries-Interval Encoding

Reverse arithmetic encoding • Adopted to compress tags in XPress

Element a b c

Probability 0.3 0.3 0.4

Original interval

[0.0, 0.3)

[0.3, 0.6) [0.6, 1.0)

Page 92: Managing XML and Semistructured Data

92

Preliminaries-Interval EncodingPreliminaries-Interval Encoding

Reverse arithmetic encoding • Adopted to compress tags in XPress

• The interval of “/a/c” is

[0.6+0.4*0.0, 0.6+0.4*0.3) = [0.6, 0.72)

Element a b c

Probability 0.3 0.3 0.4

Original interval

[0.0, 0.3)

[0.3, 0.6) [0.6, 1.0)

Original interval of c

Page 93: Managing XML and Semistructured Data

93

Preliminaries-Interval EncodingPreliminaries-Interval Encoding

Reverse arithmetic encoding • Adopted to compress tags in XPress

• The interval of “/a/c” is

[0.6+0.4*0.0, 0.6+0.4*0.3) = [0.6, 0.72)

Element a b c

Probability 0.3 0.3 0.4

Original interval

[0.0, 0.3)

[0.3, 0.6) [0.6, 1.0)

Probability of c

Page 94: Managing XML and Semistructured Data

94

Preliminaries-Interval EncodingPreliminaries-Interval Encoding

Reverse arithmetic encoding • Adopted to compress tags in XPress

• The interval of “/a/c” is

[0.6+0.4*0.0, 0.6+0.4*0.3) = [0.6, 0.72)

Element a b c

Probability 0.3 0.3 0.4

Original interval

[0.0, 0.3)

[0.3, 0.6) [0.6, 1.0)

Original interval of a

Page 95: Managing XML and Semistructured Data

95

Preliminaries-Interval EncodingPreliminaries-Interval Encoding

Reverse arithmetic encoding • Adopted to compress tags in XPress

• The interval of “/a/c” is

[0.6+0.4*0.0, 0.6+0.4*0.3) = [0.6, 0.72)• The interval of “//c” is [0.6, 1.0)

Element a b c

Probability 0.3 0.3 0.4

Original interval

[0.0, 0.3)

[0.3, 0.6) [0.6, 1.0)

Page 96: Managing XML and Semistructured Data

96

Preliminaries-Interval EncodingPreliminaries-Interval Encoding Reverse arithmetic encoding

• Adopted to compress tags in XPress

• The interval of “/a/c” is [0.6+0.4*0.0, 0.6+0.4*0.3) = [0.6, 0.72)

• The interval of “//c” is [0.6, 1.0)• “//c” is a suffix of “/a/c”

The interval of “//c” contains the interval of “/a/c”

Element a b c

Probability 0.3 0.3 0.4

Original interval

[0.0, 0.3)

[0.3, 0.6) [0.6, 1.0)

Page 97: Managing XML and Semistructured Data

97

Preliminaries-XML ContainmentPreliminaries-XML Containment

Query Evaluation on compressed document• XP{/, //, *}

• Query QA, QB submitted by client CA and CB

Page 98: Managing XML and Semistructured Data

98

Preliminaries-XML ContainmentPreliminaries-XML Containment

Query Evaluation on compressed document• XP{/, //, *}

• Query QA, QB submitted by client CA and CB

XPath Containment• If QA’s result is always

contained by QB’s for every XML document, then QB contains QA.

Page 99: Managing XML and Semistructured Data

99

Preliminaries-XML ContainmentPreliminaries-XML Containment

Query Evaluation on compressed document• XP{/, //, *}

• Query QA, QB submitted by client CA and CB

XPath Containment• If QA’s result is always

contained by QB’s for every XML document, then QB contains QA.

Application in our scenario• If QB contains QA, then result of QA can be published by CB.

• Classify queries based on the containment relationship

Page 100: Managing XML and Semistructured Data

100

Our approachOur approach

Query-Index-Tree (QIT) QIT Construction Multi-Query Evaluation Sub-Index Construction for Clients

Page 101: Managing XML and Semistructured Data

101

Query-Index-Tree (QIT)Query-Index-Tree (QIT)

Built at the server side• Each node corresponds to a query

• Explore containment relationship Among ancestors and descendants

• Remark all result locations as indices

Target• based on the hierachical level of QIT

Evaluate queries Route result fragments

Page 102: Managing XML and Semistructured Data

102

An QIT ExampleAn QIT Example

Q A ,b e gin ,e n d ,P /I

Q A : Q u e ry IDb e g in : b e g in n in g p o in t in d o ce n d : e n d in g p o in t in d o cP /I: p re c is e o r im p re c is e

Q C ,b e gin ,e n d ,P /I Q D ,b e gin ,e n d ,P /I Q G ,b e gin ,e n d ,P /I

Q B ,b e gin ,e n d ,P /I Q F ,b e gin ,e n d ,P /I Q H ,b e gin ,e n d ,P /I Q E ,b e gin ,e n d ,P /I

Q I,b egin ,en d ,P /I

C o m p re sse d d o c a t se rve r

Q A = /a Q B = /a /c /d Q C = /a /* /d Q D = /a //e Q E = /a /d /q Q F = /a /c /* /e Q G = /a /d Q H = /a /* /d /e Q I = /a /d /q /e

Page 103: Managing XML and Semistructured Data

103

An QIT ExampleAn QIT Example

Q A ,b e gin ,e n d ,P /I

Q A : Q u e ry IDb e g in : b e g in g p o in t in d o ce n d : e n d in g p o in t in d o cP /I: p re c is e o r im p re c is e

Q C ,b e gin ,e n d ,P /I Q D ,b e gin ,e n d ,P /I Q G ,b e gin ,e n d ,P /I

Q B ,b e gin ,e n d ,P /I Q F ,b e gin ,e n d ,P /I Q H ,b e gin ,e n d ,P /I Q E ,b e gin ,e n d ,P /I

Q I,b egin ,en d ,P /I

C o m p re sse d d o c a t se rve r

Q A = /a Q B = /a /c /d Q C = /a /* /d Q D = /a //e Q E = /a /d /q Q F = /a /c /* /e Q G = /a /d Q H = /a /* /d /e Q I = /a /d /q /e

Page 104: Managing XML and Semistructured Data

104

An QIT ExampleAn QIT Example

Q A ,b e gin ,e n d ,P /I

Q A : Q u e ry IDb e g in : b e g in g p o in t in d o ce n d : e n d in g p o in t in d o cP /I: p re c is e o r im p re c is e

Q C ,b e gin ,e n d ,P /I Q D ,b e gin ,e n d ,P /I Q G ,b e gin ,e n d ,P /I

Q B ,b e gin ,e n d ,P /I Q F ,b e gin ,e n d ,P /I Q H ,b e gin ,e n d ,P /I Q E ,b e gin ,e n d ,P /I

Q I,b egin ,en d ,P /I

C o m p re sse d d o c a t se rve r

Q A = /a Q B = /a /c /d Q C = /a /* /d Q D = /a //e Q E = /a /d /q Q F = /a /c /* /e Q G = /a /d Q H = /a /* /d /e Q I = /a /d /q /e

Page 105: Managing XML and Semistructured Data

105

An QIT ExampleAn QIT Example

Q A ,b e gin ,e n d ,P /I

Q A : Q u e ry IDb e g in : b e g in g p o in t in d o ce n d : e n d in g p o in t in d o cP /I: p re c is e o r im p re c is e

Q C ,b e gin ,e n d ,P /I Q D ,b e gin ,e n d ,P /I Q G ,b e gin ,e n d ,P /I

Q B ,b e gin ,e n d ,P /I Q F ,b e gin ,e n d ,P /I Q H ,b e gin ,e n d ,P /I Q E ,b e gin ,e n d ,P /I

Q I,b egin ,en d ,P /I

C o m p re sse d d o c a t se rve r

Q A = /a Q B = /a /c /d Q C = /a /* /d Q D = /a //e Q E = /a /d /q Q F = /a /c /* /e Q G = /a /d Q H = /a /* /d /e Q I = /a /d /q /e

Page 106: Managing XML and Semistructured Data

106

QIT ConstructionQIT Construction

Recursive classification

All submitted queries

is a descendant set of root

Page 107: Managing XML and Semistructured Data

107

QIT ConstructionQIT Construction

Recursive classification

QA contains

all other queries

Page 108: Managing XML and Semistructured Data

108

QIT ConstructionQIT Construction

Recursive classification

Recursive classification

in QA’s descendant set

Page 109: Managing XML and Semistructured Data

109

QIT ConstructionQIT Construction

Recursive classification

Each class has a query

containing others

Page 110: Managing XML and Semistructured Data

110

QIT ConstructionQIT Construction

Recursive classification

Classification continues until leafs

Page 111: Managing XML and Semistructured Data

111

Preprocess for Multi-Query Preprocess for Multi-Query EvaluationEvaluation

On server side, Over compressed document• How to evaluate queries using QIT• How to support intermediate clients to locate results

Tags are encoded into intervals• To avoid decompression in query processing• Interval translation

Simple path interval Complex path simple paths intervals

• Examples “/a/b//c/d” “/a/b” & “/c/d” “/a/b/*/c/d” “/a/b”, “*” & “/c/d”

Page 112: Managing XML and Semistructured Data

112

Experiment - Overall Cost Experiment - Overall Cost SavingsSavings

Compare with linear query processing (without QIT)

• Saving Ratio

Page 113: Managing XML and Semistructured Data

113

Collaborative ProcessingCollaborative Processing

A co-operative framework for multi-query processing over compressed XML data

Keep results under compression to save bandwidth Bring forward QIT and building algorithm Future work

• QIT is not enough for handling complex XPath

• Subscribed queries and non-subscribed queries.

• XPath queries and XPath FT queries

Page 114: Managing XML and Semistructured Data

114

Papers: CompressionPapers: Compression XMILL: An Efficient Compressor for XML Data by Liefke and Suciu, in

SIGMOD'2001 P. M. Tolani and J. R. Haritsa. XGRIND: A Query-friendly XML Compressor.

IEEE ICDE Conf., pp. 225-234, 2002. M. Girardot and N. Sundaresan. Millau: an encoding format for efficient

representation and exchange of XML over the Web. WWW Conf., pp. 747-765, 2000.

H. Ishikawa, S. Yokoyama, S. Isshiki and M. Ohta. Project Xanadu: XML- and Active-Database-Unified Approach to Distributed E-Commerce. Int. Workshop on DEXA, 2001.

A.Arion, A. Bonifati, G. Costa, S. D’Aguanno, I. Manolescu, A. Pugliese, Efficient Query Evaluation over XML Compressed Data, EDBT 2004.

JunKi Min, MyungJae Park, ChinWan Chung, XPRESS: A Queriable Compression for XML Data, EDBT 2004.

Page 115: Managing XML and Semistructured Data

115

Our publications for XML compressionOur publications for XML compression• Xiaoling WANG, Aoying ZHOU, Juzhen HE and Wilfred NG.

MQX: Multi-Query Processing Engine for Compressed XML Data. International Conference on Information Retrieval. ACM SIGIR 2007, Amsterdam, Holland (Demonstration Paper), pp. 897, (2007).

• Wilfred NG, Ho-Lam LAU and Aoying ZHOU. Divide, Compress and Conquer: Querying XML via Partitioned Path-Based Compressed Data Blocks. Accepted and to appear: World Wide Web Journal, (2006).

• Juzhen HE, Wilfred NG, Xiaoling WANG and Aoying ZHOU. An Efficient Co-operative Framework for Multi-Query Processing over Compressed XML Data. International Conference of Database Systems for Advanced Applications. DASFAA 2006, Lecture Notes in Computer Science Vol. 3882, Singapore, pp. 218-232, (2006).

• Wilfred NG, Wai-Yeung LAM, Peter WOOD and Mark LEVENE. XCQ: A Queriable XML Compression System. Accepted and to appear: An International Journal of Knowledge and Information Systems, (2005).

• Wilfred NG, Wai-Yeung LAM and James CHENG. Comparative Analysis of XML Compression Technologies. Accepted and to appear: World Wide Web Journal: Internet and Web Information Systems, (2005).

• James CHENG and Wilfred NG. XQzip: Querying Compressed XML Using Structural Indexing. International Conference on Extending Database Technology EDBT 2004, Lecture Notes of Computer Science Vol.2992, Heraklion, Crete, Greece, page 219-236, (2004).

• Wai-Yeung LAM, Wilfred NG, Peter WOOD and Mark LEVENE.  XCQ: XML Compression and Querying System. Poster Proceedings of the World Wide Web WWW'2003, Budapest, (2003).