Divide, Compress and Conquer: Querying XML via Partitioned ... · Divide, Compress and Conquer:...

Divide, Compress and Conquer: Querying XML viaPartitioned Path-Based Compressed Data Blocks

Wilfred Ng Ho-Lam LauDepartment of Computer Science and Engineering

The Hong Kong University of Science and Technology, Hong Kong{wilfred, lauhl}@cs.ust.hk

Aoying ZhouDepartment of Computer Science and Engineering

Fudan University, [email protected]

Abstract

We propose a novel Partition Path-Based (PPB) grouping strategy to store com-pressed XML data in a stream of blocks. In addition, we employ a minimal indexingscheme called Block Statistic Signature (BSS) on the compressed data, which is a simplebut effective technique to support evaluation of selection and aggregate XPath queriesof the compressed data. We present a formal analysis and empirical study of these tech-niques. The BSS indexing is first extended into effective Cluster Statistic Signature(CSS) and Multiple-Cluster Statistic Signature (MSS) indexing by establishing morelayers of indexes. We analyze how the response time is affected by various parametersinvolved in our compression strategy such as the data stream block size, the number ofcluster layers, and the query selectivity. We also gain further insight about the compres-sion and querying performance by studying the optimal block size in a stream, whichleads to the minimum processing cost for queries. The cost model analysis providesa solid foundation for predicting the querying performance. Finally, we demonstratethat our PPB grouping and indexing strategies are not only efficient enough to supportpath-based selection and aggregate queries of the compressed XML data, but they alsorequire relatively low computation time and storage space when compared with otherstate-of-the-art compression strategies.

Index Terms: Data Compression, Query processing, Cost Model, Markup Languages

1 Introduction

XML is, by nature verbose, since repeated tags and structures are needed to specify elementitems in the document. As a result, XML data are larger in size than the original data format.

1

For example, the file size of the Weblog obtained from [24] expands by three times when itslog format is converted to XML. The size problem of XML documents hinders their practicalusage, since the size substantially increases the costs of storing, processing and exchangingdata.

In this paper, we propose a Partition Path-Based (PPB) data grouping strategy to addressthe size problem. The term “stream” used in our subsequent discussion originated from ourXCQ compression methodology proposed in [12]. In a nut shell, given an XML document,we achieve the compression using a DTD Tree and SAX Event Stream Parsing (DSP) tech-nique. This DSP algorithm takes as input the DTD tree and the SAX event stream createdby the DTD Tree Building module and the SAX Parser module, respectively. The objectivesof the DSP algorithm are to extract the structural information [18] from the input XML doc-ument that cannot be inferred from the DTD during the parsing process, and to group dataelements in the document based on their corresponding tree paths in the DTD tree.

By structural information, we mean the information necessary to reconstruct the treestructure of the XML document. By data elements, we mean the attributes and PCDATAwithin the document. The output of the DSP algorithm is a stream of structural information,which we call the structure stream, and streams of XML data elements, which we call thedata streams. The generated structure stream and data streams are compressed and packedinto a single file. (cf. a full explanation of the DSP algorithm and a detailed example ofgenerating the structure and data streams can be found in Sections 3.2 and 3.3 in [12].)

In Figure 1, we show that the structure stream derived from a given DTD is stored andcompressed separately from the data streams. Each PPB data stream corresponds to a partic-ular path structure in the DTD labeled by a key, and all data elements in the data stream arethe elements that match the path structure. In addition, each PPB data stream is partitionedinto its set of data blocks with a pre-defined block size, which helps to increase the overallcompression ratio (cf. [9, 10, 16, 17]). A data block in a PPB data stream is able to be com-pressed or decompressed as an individual unit. This partitioning strategy allows us to accessthe data in a compressed document by decompressing only those data blocks that contain thedata elements relevant to the imposed query. We term this a partial decompression strategy.

d4: /library/entry/num_copy/text()

streams:Keys for path−based grouped data

Structure Stream

d0 d1 d2 d3 d4

d0: /library/entry/author/@named1: /library/entry/title/text()d2: /library/entry/year/text()d3: /library/entry/publisher/text()

Figure 1: A Conceptual Diagram of Partitioned Path-Based Data Grouping

For example, if the block size has n records per block, the first batch of n records in the

2

stream, d0, are packed in the first data block while the next batch of n records are packedin the subsequent block. As such, each data block in the data stream contains a certainnumber of elements in the order listed in their corresponding data streams, which are underthe same tree path in the DTD tree as shown in Figure 2, where name is an attribute node;author, title, year, publisher and num copy are element nodes; and paper,course notes and book are empty elements.

library

entry*

book paper

course_notes

PCDATA Key: Element Nodes

Attribute Nodes

name

year title publisher ? | num_copy author

Figure 2: A Library DTD XML Tree

The data blocks in the data streams are compressed individually using Gzip [23]. Thelow-level compressor can be replaced by another generic one such as Bzip2 library as re-ported in [12], which could gain a better compression ratio at the expense of compressiontime. The underlying idea of query evaluation is that, by utilizing the structure stream, weonly need to decompress on the fly the data streams that are relevant to the query evaluation.This partial decompression accessing strategy significantly reduces the decompression delayand storage overhead and is similar in spirit to some recent XML compressors [21, 15, 4, 1].We do not present the query evaluation algorithms and the technical details of implementa-tion for our query processor as they are described in another submitted paper.

Some preliminary ideas about the query evaluation strategy were highlighted in our ear-lier study [12], in which the PPB engine was developed to support the logical operatorsof core XPath queries: “not”, “or” and “and”; along with the comparison operators: “=”,“6=”, “>”, and “<”. The PPB engine also supports “CONTAINS” (i.e., for a substring) and“STARTS-WITH” for strings. We call this class of path-based queries collectively the se-lection queries. The PPB engine also implements the core library functions: “COUNT”,“SUM”, “AVG”, “MIN” and “MAX”, which are standard aggregation operators. These ag-gregation operators can be embodied in the core XPath queries as predicates. We call thisclass of path-based queries collectively the aggregate queries.

For the sake of clarity, we only illustrate how and why PPB is useful to tackle singlepath expressions that consist of at most one occurrence of a descendant axis “//”, since suchexpressions are fundamental to current XML query languages such as XPath and XQuery[28, 29]. Nevertheless, the efficient support of complex path expressions (i.e., more thanone descendant axis or having branch expressions in a path) over XML data can be naturally

3

extended with some decomposition and execution plan. For example, a complex path ex-pression, “p1[/p2]//p3”, where p1, p2 and p3 are single path expressions, can be supportedas follows. We can first decompose the expression into “p1/p2” and “p1//p3”. Then thequery processor will first retrieve those results that satisfy p1 and evaluate p3 among thedescendants of the result from p1 as the first set of immediate answers. The second set ofimmediate answers is only an execution of a single path expression “p1/p2”. The final resultcan be obtained by performing the set intersection or some sophisticated join techniques asillustrated in existing XML query evaluation techniques [1, 2, 15]. In principle, it wouldalso be straightforward to modify the PPB engine to handle IDREF or IDREFS in DTDsby building a DTD graph rather than a DTD tree in the structure stream. When parsing adocument against the DTD graph, the graph would still be traversed in depth-first order inorder to determine the correct data stream.

The main contributions of this paper are as follows. First, we clarify how the evaluation ofan important class of XPath queries (selection and aggregate queries) is affected by variousparameters involved in the compression strategy, such as the data stream block size, thenumber of cluster layers, the cost of scanning the indexes, the cost of decompressing a block,and the query selectivity. Second, we establish a solid foundation on which to optimize theblock size in a stream, which leads to the minimum processing cost for a query. Finally, weverify our results by carrying out an extensive experimental study to test the performanceof the PPB engine. Our empirical result presented in this paper demonstrates that our PPBengine prototype performs well when compared with two existing compressors, XMill andXGrind, and, importantly, our PPB cost model predicts the trend of the query response timecorrectly. To the best of our knowledge, our PPB cost model is the first analytical model ofthe cost of XML compression.

The remainder of this paper is organized as follows. The rest of this section is devotedto a description of related work. Section 2 presents the Block Statistic Signature (BSS) in-dexing scheme, which is adopted in the PPB system to aid in partial decompression. Section3 establishes the cost model for the PPB data grouping and presents a preliminary analysisof the optimal query processing cost. Section 4 studies the impacts of varying the selec-tivity and block size on the processing cost for path-based selection and aggregate queries.Section 5 improves BSS by introducing a level of clustering of the blocks and the Clus-ter Statistic Signature (CSS) indexing scheme in a data stream. We then further generalizeCSS by considering cases when there is more than one level of clustering, which we termthe Multiple-cluster Statistic Signature (MSS). Section 6 uses real XML datasets to verifyvarious predictions from our model. Finally, we offer concluding remarks in Section 7.

4

1.1 Related Work

The motivation for supporting direct querying compressed XML documents is in related tostudies on traditional data compression [19, 5, 30]. We are able to save in bandwidth andachieve more efficient query processing over compressed data, due to the fact that moreinformation can be carried by a given data size. In addition, compression techniques areparticularly important in dealing with the verbosity of XML files. Most well-known XML-conscious compression technologies have been recently proposed (cf. see our recent surveyin [13]). These technologies are able to achieve good compression performance. They in-clude XMill and Millau [10, 20]. However, these systems are solely optimized for bettercompression but do not support querying of the compressed data. The more recently pro-posed XML compressors, such as XGrind [21], XPress [15], skeleton compression [2], XCQ[12], XQzip[4], and XQuec [1] (all these compressors except XGrind do not have releasedcode) are able to support querying. However, their foundations are not sufficiently adequatefrom the analytical point of view. In particular, various underlying cost factors involved inprocessing queries over compressed data have not yet been introduced in these compressors.It is worth mentioning that XGrind and XPress are able to encode XML data items individu-ally and to preserve the XML structure. Thus, both compressors possess the desirable featurethat their querying operations can be executed without fully decompressing the document.However, the compression ratio and time performance of XMill is much better than XGrindand XPress (see Figures 12 and 13 in [21]).

2 BSS Indexing in PPB Data Streams

In this section, we describe the Block Statistic Signature (BSS) indexing scheme used in thePPB engine. The indexing scheme is designed for supporting data values from numericalor string domains and is minimal in the sense that it requires very small amounts of storagespace and time resources in the PPB engine. This indexing scheme simplifies signature fileindexing approaches [7, 11, 6]. Like projection signature indexing in [6], BSS indexing isused to index block-oriented compressed data.

Definition 2.1 (BSS Index and Value Range) Assume that there are n blocks. The BlockStatistic Signature (BSS) index of an ith block is given by Bi = 〈sβ

i , bβi 〉, where bβ

i representsthe data items in the compressed block and the BSS index value, sβ

i = 〈min(bβi ), max(bβ

i ),

sum(bβi ), count(bβ

i )〉. We define the value range for a BSS indexed block, Bi, denoted aslβi , as 〈min(bβ

i ),max(bβi )〉. From now on, we use more handy notation such as sβ, bβ and lβ

when there is no ambiguity in the block labelling.Figure 3 depicts the underlying idea of the BSS indexing scheme. In the PPB engine, a

statistical signature is generated for each compressed data block for a document. The signa-ture summarizes the data values inside a particular block. When a query is being evaluated,

5

the compressed data blocks in the corresponding data streams are accessed by the PPB en-gine. A filtering process is carried out by the PPB engine as follows. Before a data blockis fetched from the disk, the PPB engine consults the corresponding BSS index and ignoresthose data blocks that do not contain the required record(s). To do this, the PPB enginechecks the BSS signature of the data block and decides whether the value range of that blockoverlaps with the value range specified in the interval query. If these two ranges overlap,which means that the data block may contain the required record(s), then the data block isfetched and decompressed for evaluation. If the two ranges do not overlap, the block doesnot contain the required records, and in this case the data block is not fetched.

10 18 27 5

Min:5 Max: 27 Sum: 60 Count:4

0 1210 100

10000 10

Min:0 Max: 10000 Sum: 11320 Count:5

Compressed Data Blocks

Block Statistics

Signatures (BSS)

Figure 3: BSS Indexes in a PPB Data Stream

If more statistical information is stored in the block signature, we may have the benefitof having better knowledge about the compressed data block content, which helps in theblock scanning and filtering processes during query evaluation. However, this additionalinformation requires additional resources to generate, scan and store the block signatures,which will have a negative effect on the compression time, the query response time and thecompression ratio. The BSS indexing scheme requires low computation overhead. It is easyto compute and generate the indexes in the compression phase and it is quick to scan theindexes during query processing. The operations of generating and scanning a signature canbe done in linear time, O(n). In the case of signature generating, n is the number of elementsin the data stream. In the case of signature scanning, n is the number of compressed datablocks in the PPB data stream. We consider the following selection query of a compressedXML document that conforms to the DTD given in Figure 2 (we skip the full path of thequeries for simplicity,):

(Q1): entry[author/@name=“Lam” and publisher =“X Ltd.”].This query, Q1, selects those XML fragments that satisfy the two given predicates. We

define the selectivity as the percentage of XML fragments (e.g., the entry fragments in Fig-ure 2) returned in the query result compared to the total number of XML fragments in the

6

compressed document. Two data streams, d0 and d3, are involved during the process ofquery evaluation. The PPB engine first returns the XML fragments that have matched “en-try” elements. Assuming there are m records satisfying the first predicate, the PPB enginethen needs to find the blocks in d3 that contain the corresponding m “publisher” records toevaluate the second predicate. The structure stream directs the engine to these two streams.The query predicates can be Exact Matching (equality comparison only) or Range Matching(inequality comparison involved).

Consider also the following aggregate query that finds the sum of (paper, book or coursenotes) copies based on the data elements that satisfy some specified predicates inside a singledata stream. The query aims at finding aggregate information related to a particular path.

(Q2): SUM(//num copy ≥2).This query, Q2, selects only those elements in the data stream, d4. The PPB engine needs

to find the data elements that satisfy the predicate “num copy≥2” and then return the sum ofthe num copy. The BSS index in d4 is scanned in the filtering process and those data blocksthat contain the data elements are fetched from the disk. This helps the PPB engine to avoidfetching the compressed data blocks that do not contain relevant records or the statisticalparameters in the signature can be used directly to evaluate the query.

3 PPB Data Grouping Analysis

In this section, we study the process of checking data elements in the PPB data groupingagainst the selection predicate of a given query. This paves the way to study the optimizationof the PPB engine to run selective and aggregative queries.

We begin by discussing the cost of running a selection query that has a simple range-matching, open-interval predicate, such as entry[num copy > 1]. We then extend the basicresult to other queries. We now assume that the PPB data stream is finite with N dataelements, each of which has a fixed and uniform probability, P , of being selected by thequery predicate and there is only a single predicate that selects data elements from the rangelp.

Definition 3.1 (BSS Hit and Missed Blocks) We compare lβi of a compressed block, Bi,in a BSS indexed PPB data stream, 〈B1, . . . , Bn〉, with lp in a query and check if these twointervals overlap (i.e., when lβi ∩lp 6= ∅). If they overlap, we say that a BSS hit occurs at blockBi (or simply Bi is a hit block). A BSS miss occurs at block Bi (or simply Bi is a missedblock) if they do not overlap. A block is decompressed for further comparison of elementswith lp iff it is hit.

We make the following assumptions to establish cost equations related to query process-ing.

7

1. The cost of scanning the value range of a block, B (which includes checking lβ againstlp), denoted as Cs, is relatively small compared with the cost of decompressing theblock.

2. We assume that Cs is constant and independent of the block size. We also assume thatthe average cost of decompressing a data element in a block, denoted as Cd, is alsoconstant and independent of the block size.

3. Elements in a block have the same inherent probability, P , of satisfying the predicatespecified in a given query.

The first assumption is needed because decompressing a block is a costly operation asthe whole block is fetched from the disk into the main memory. The glossary used in ourmodel is given in Table 1.

Notation Definition

lp The range specified in a query predicate.π The hitting probability of a compressed block in a PPB data stream.N Number of data elements in a data stream.P Probability that a data element in a stream satisfies the predicate.K Number of data elements in a block of a data stream.L Number of blocks in a data stream specified by a PPB path.X Number of blocks being hit (i.e., lβ and lp overlap in these blocks).Cq Total cost of processing a query q.Cs Cost of scanning a BSS range of a block.Cd Average cost of decompressing a data element.Y Unit processing cost of processing a query.

Table 1: Notation Used in the Cost Processing Analysis

As N data elements with the same semantics in a PPB data stream are divided into L

compressed data blocks, we have K = NL

data elements in each block. The probabilitythat all the elements in a block do not fall within lp is equal to (1 − P )K , where P is theselectivity of the predicate. It follows that the probability that some blocks are hit is equal to(1− (1− P )K). Let us call π = 1− (1− P )K the hitting probability of a compressed datablock in a PPB data stream (or simply the block hitting probability). The distribution that X

blocks are hit satisfies a binomial distribution as follows.The probability that X blocks are hit in a stream = CL,X ·πX(1−π)L−X , where CL,X =L!

X!(L−X)!and X ∈ {0, 1, . . . , L} for some positive integer L.

As the mean of the distribution is equal to πL, we have the expected number of hit blocks,E(X) = πL, which are decompressed for further checking of the data elements against lp.

8

We ignore the cost of checking the decompressed items that are already fetched into themain memory. Thus, the total cost of processing a query in the data streams is equal to thesum of the costs of scanning the BSS indexes in a data stream and decompressing X blocks(i.e., L · Cs + K · X · Cd). The expected cost of running the selection query is given byE(Cq) = L · Cs + K ·E(X) · Cd. Let us call the expression Yq = E(Cq)

N(or simply Y if q is

clear in the context) the unit processing cost with respect to the query, q.We now give the following formula that expresses Y in terms of the block size, K, and

the selectivity, P :

Y =Cs

K+

(1− (1− P )K

) · Cd. (1)

In Figures 4(a) and 4(b), we sketch the unit processing cost function against the block sizeunder different selectivity values, in which we assume Cs = Cd = 1 in plotting the chartsfor the sake of easy illustration. These charts help to give further insights for developing ourcost model. Note that if the block size is large, then there are virtually no missed blocks(i.e., we need to decompress the whole document) and thus the unit processing cost functionconverges to the unit decompressing cost, Cd (i.e., as shown in Figure 4(a) K →∞, Y → Cd

in our case). We can see that the cost function can take three values, depending on theselectivity values. First, the curves for high selectivity values in Figure 4(b) are generallydecreasing (e.g., P ≥ 0.5) and, in these cases, the larger the block size, the lower the unitprocessing cost. There is also comparatively little difference in these curves. Second, thecurves for low selectivity values have a local minimum (e.g., P = 0.4) but the value maystill be higher than the asymptotic value, Cd. Third, the curves for sufficiently low selectivityvalues have a global minimum (e.g., P = 0.2) (i.e., the minimum value is smaller than Cd).

0 10 20 30 40 50 60 70 800.5

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3

1.4

1.5

Block Size K

Uni

t Pro

cess

ing

Cos

t Y

P =0.5

0.4

0.3

0.2

0.1

Cs =1; Cd =1

Cd

(a) Low Selectivity Selection Queries

0 10 20 30 40 50 60 70 801

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

Cs =1; Cd =1

Block Size K

Uni

t Pro

cess

ing

Cos

t Y

P = 0.5 to 0.9

Cd

(b) High Selectivity Selection Queries

Figure 4: Cost Analysis of the Unit Processing Cost for Selection Queries

We formally show that there exists an optimal query processing cost if and only if theselectivity value is within a certain threshold. The theorem helps to clarify the unit costprocessing function shown in Figures 4(a) and 4(b).

Theorem 3.1 The following statements are true for the unit processing cost, Y .1. Y has a local minimum if and only if P ≤ 1− e−

4CdCs

e−2

.

9

2. Y has a global minimum if and only if P ≤ 1− e−CdCs

e−1

.Proof. We let x = (1− P ) where 0 ≤ x ≤ 1 and we have the unit processing cost functionas follows:

Y =Cs

K+ (1− xK)Cd. (2)

In order to establish the result in Part 1, we proceed to find the extreme points of Y bydifferentiating it with respect to K. We then set the derivative to zero and thus have theequation:

dY

dK= − Cs

K2− xKCd ln x

= 0. (3)

It follows from Equation 3 that we have

K2xK =−Cs

Cd ln x. (4)

Note that the parameter K exists only on the right-hand side. In order to solve Equation3, we define a function on K by F (K) = K2xK and find the maximum of F first. By solvingthe equation dF

dK= 2KxK + K2xK lnx = 0, we obtain the (unique) extreme value of F at

K = KF = −2ln x

. We conclude that KF is at the global maximum point from the evidenced2FdK2 = −2xK < 0 for all K. Therefore, there exists a root for x if and only if the followingequation holds:

K2F xKF ≥ −Cs

Cd ln x. (5)

The intuition behind Equation 5 is that, given α and x, the horizontal line (with respectto K) F ′ = −α

ln xcuts the curve F if and only if the line is lower than or equal to the maximum

value of F . When the line is higher than F (KF ), there is no intersection between F and F ′

and therefore no solution for Equation 3. By substituting KF = −2ln x

into Equation 5, we then

have x ≥ e−4CdCs

e−2

. It follows that P ≤ 1− e−4Cd

Cse−2

.In order to prove Part 2, we need to compare the minimum value of Y in Equation 2 with

the asymptotic value of Y when K is large (i.e., Y∞ = Cd ). We define

Ylocal min =Cs

K∗ + (1− xK∗)Cd, (6)

where K∗ is the minimum point corresponding to Ylocal min.If Ylocal min ≤ Cd, then Ylocal min is in fact the global minimum of Y occurring at some

finite K value (cf. Figure 4(a)).We therefore have the condition Ylocal min ≤ Cd. By Equation 6, we further simplify this

condition and obtain the following equation:

Cs

K∗ ≤ CdxK∗

. (7)

10

Bear in mind that K∗ is the root of Equation 3. By using Equations 4 and 7, we have thefollowing condition:

−K∗ ln x ≤ 1. (8)

From Equation 8, we substitute K∗ in Equation 4 and obtain the inequality conditionln

(−Cs ln x

Cd

)≥ −1. After simplifying the logarithmic terms, it follows that P ≤ 1−e−

CdCs

e−1

,which is a stricter condition than the counterpart of the local minimum, since we have 4 > e.¤

As illustrated in Figure 4(a), Y has the local minimum and the smaller the selectivity, thelower the local minimum. The local minimum is in fact the global one if it is smaller thanCd. This can be proved by using Equations 6 and 7 as follows:

dYlocal min

dP= Cd(K

∗(1− P )K∗−1) > 0. (9)

It follows from Equation 2 that when K → ∞, we have Cs

K+ (1 − xK)Cd → Cd,

which means that Y can be brought arbitrarily close to Cd by choosing a sufficiently large K

value. On the other hand, we find the expression of K that gives rise to the optimal Y for asufficiently small K value in Corollary 3.2.

Corollary 3.2 Y has a global minimum at K =√

Cs

PCdwhen P is sufficiently small.

Proof. From Theorem 3.1, it follows that the global minimum value of Y exists if P issufficiently small. In this case, we have the approximation (1 − xK) ≈ KP . Using thisapproximation, we reduce Equation 1 to the following equation:

Y ≈(

Cs

K+ KPCd

). (10)

We differentiate Equation 10 with respect to K, by which we are able to find the optimalY value. So we have the following equation:

dY

dK= PCd − Cs

K2= 0. (11)

Thus, Y has the extreme value at K ≈√

Cs

PCd, which is a global minimum, since d2Y

dK2 =2Cs

K3 > 0 for all K. ¤

To summarize, we present three main findings that involve the unit processing cost, theblock size, and the scanning and the unit decompressing cost in our analysis.

1. For low selectivity queries: if the selectivity satisfies P ≤ Plocal min = 1 − e−4CdCs

e−2

as shown in Part 1, Theorem 3.1, then the optimal unit processing cost, Y , occurs, inthe case K ≈

√Cs

PCdas shown in Corollary 3.2.

11

2. For high selectivity queries: if the selectivity satisfies P > Pglobal min = 1 − e−CdCs

e−1

as shown in Part 2, Theorem 3.1, the optimal unit processing cost, Y , approaches Cd

as K → ∞, as shown in Figure 4(b). In practice, this means that the larger the blocksize, the more efficient the query processing.

3. For sufficiently large K, the unit processing cost, Y , of a selection query approachesthe unit decompressing cost, Cd, irrespective of the P values.

4 Processing Interval Queries

In this section, we discuss the hitting probabilities of selection and aggregate queries. Wethen present an analysis of their unit processing costs.

4.1 Modified Block Hitting Probability

We now consider the block hitting probability used in selection and aggregate queries withopen or closed intervals as selective predicates.

Selection Queries. We classify the selection query intervals on data values specified by theselective predicate into two categories.

1. Open Interval Queries. The interval specified by the selection predicate has an openend. For example, the following query, which seeks the “byte count” in an compressedXMLised Weblog document [24], has an open interval used in the selection predicate:

(Q3): element[byteCount > 1000].

2. Closed Interval Queries. The interval specified by the selection predicate has twoclosed ends. For example, the following query, which seeks the byte counts in an com-pressed XMLised Weblog document [24], has a closed interval used in the selectionpredicate.

(Q4): element[2000 > byteCount > 1000].

In Section 3 we simply assumed that hitting a block in a stream occurs when lp and lβ

overlap, which is given by the block hitting probability π = 1− (1− P )K . The parameter π

is essential for us to deduce the expected number of blocks to be decompressed in a stream.However, we have not tackled two issues. The first is if π is applicable to the open and closedintervals; the second is if π is applicable to the selection and aggregation queries.

We now consider the case of an open interval predicate. There are two cases of overlap-ping between lβ and lp, as shown as Cases B and C in Figure 5. It is clear that these twocases happen if and only if there exist some data elements in a block falling in the intervallp. In other words, we only need to exclude the probability of the event that no data element

12

falls in lp (i.e., reject Case A), which is equal to (1− P )K . Thus, our assumption of usingπ = 1− (1− P )K in Equation 1 is justified.

Interval of Index

Interval of Index

Interval of Index

BSS Index Scanning Only

BSS Index Scanning + Block Decompressing


l p ��Open Query Interval of the Selection Predicate

Range of Data Values in a PPB Data Stream

( Miss ) Case A:

(Hit) Case B:

(Hit) Case C:

Figure 5: Overlapping BSS Index and Open Interval Selection Query

However, there are three cases of overlapping, Cases B, C and D, in the case of closedinterval queries as shown in Figure 6. Notably, in Case D, it may happen that all the dataelements in a block are not actually in the range of the interval lp, but the block is still ahit. This is because, even when lp and lβ overlap, we still cannot determine whether thedata elements fall in lp, la or lb. In other words, the hitting parameter π does not reflect theoutcomes in Case D.

Interval of Index

Interval of Index

Interval of Index BSS Index Scanning Only



l p

( Miss ) Case A:

(Hit) Case B:

(Hit) Case C:

Interval of Index

Interval of Index

Interval of Index (Hit) Case D: BSS Index Scanning + Block Decompressing

l a ��Closed Interval of the Selection Predicate

l b

[ ] a b �Range of Values in

a PPB Data Stream ��Figure 6: Overlapping BSS Index and Closed Query Intervals for Selection Queries

In order to modify the expression of π to π∗, we take into account the possibility thatsome data elements may fall in la or lb in Case D as follows:

13

π∗ = the probability of having at least one element in the interval of lp (i.e.,related to Cases B and C and (part of) D ) + the probability of having atleast one element at la but others at lb (i.e., related to (part of) Case D)+ the probability of having at least one element at lb but others at la (i.e.,related to (part of) Case D).

We assume that the elements in a PPB data stream are distributed with the probabilitiesPa and Pb in the intervals of la and lb, respectively. Thus, we have the following equation:

π∗ = (1− (1− P )K) + (1− P )K(1− (1− Pa)K)(1− (1− Pb)

K). (12)

Note that P + Pa + Pb = 1 in Equation 12. The probability values of P, Pa and Pb

depend on the data distribution in the range [a, b]. If we assume that the probability for thedistribution of the data elements is uniform in the range, then we have the simple ratio forthe probabilities, Pa = (1− P )( la

la+lb) and Pb = (1− P )( lb

la+lb).

Aggregate Queries. There is one fundamental difference between aggregate and selectionqueries in estimating the processing cost, Cq. It may not be necessary to decompress a hitblock in a PPB data stream. For example, consider the following aggregate queries, whichare modified from the selection queries Q3 and Q4:

(Q5): element[COUNT(byteCount > 1000)].(Q6): element[COUNT(2000 > byteCount > 1000)].To process the above queries, we do not need to decompress the block if the comparison

satisfies the condition that lβ is contained in lp. We have to exclude the decompressing costarising from Case C. We now modify the expression of π to π∗ in order to take account ofthis savings in decompression.

In the case of an open interval query, we have the following hitting probability:

π∗ = the probability of having at least one element in the interval of lp (i.e.,Cases B and C) − the probability of having all elements in lp (i.e., rejectCase C).

Thus, we have the following equation:

π∗ = (1− (1− P )K)− PK . (13)

From Equation 13, we show in Figures 7(a) and 7(b) the unit processing cost of aggre-gate queries against block sizes when the selectivity values are low and high, respectively.The low selectivity case in Figure 7(a) is similar to its counterpart in the selection queries.However, the high selectivity case in Figure 7(b) also exhibits the minimum unit processingcost at a certain block size, which is in contrast to what we observe in Figure 4(b).

In the case of a closed interval query, we have the following hitting probability:

14

0 10 20 30 40 50 60 70 800.5

1

1.5

2

2.5

3

Cs =1; Cd =1

Uni

t Pro

cess

ing

Cos

t Y

Block Size K

P =0.5

P =0.1

(a) Low Selectivity Aggregate Queries

0 10 20 30 40 50 60 70 800

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Cs =1; Cd =1

Uni

t Pro

cess

ing

Cos

t Y

Block Size K

P =0.6

P =1.0

P =0.9

(b) High Selectivity Aggregate Queries

Figure 7: Cost Analysis of the Unit Processing Cost for Aggregate Queries

π∗ = the probability of having at least one element in the interval of lp (i.e.,Cases B and C) − the probability of having all elements in lp (i.e., rejectCase C) + the probability of having at least one element at la but others atlb (i.e., Case D) + the probability of having at least one element at lb butothers at la (i.e., Case D).

Thus, we have the following equation:

π∗ = (1− (1− P )K)− PK + (1− P )K(1− (1− Pa)K)(1− (1− πb)

K). (14)

We summarize the modified block hitting probabilities to be used in the unit cost equa-tions in Table 2 for different query interval cases.

Query Selection Aggregate

Open Interval π = (1− (1− P )K) Equation 13Closed Interval Equation 12 Equation 14

Table 2: Modified Block Hitting Probabilities

4.2 Processing Cost and Selectivity

For the sake of clarity, we first transform the unit processing cost equation for selectionqueries with an open interval from Equation 1 as follows,

Y = HK − (1− P )KCd, (15)

where HK =(

Cs

K+ Cd

).

We differentiate Y with respect to P (0 ≤ P ≤ 1). At the extreme cases of P = 0 and1, we have Y = (HK − Cd) and HK , respectively. This result implies that our partitioningstrategy favors low selectivity queries and the processing cost is bounded by Hk (when allthe blocks are decompressed). The implication is reasonable, since in practice we do not

15

enjoy a processing cost benefit if we have to decompress too many blocks in a data stream,which happens in the cases of high selectivity selection queries. We show the unit processingcost function in relation to the selectivity of the selection queries in Figure 8(a).

We now study the relationship between the unit processing cost and selectivity for ag-gregation queries. The unit processing cost equation, in which we use the modified blockhitting probability given in Equation 13, is:

Y = HK − ((1− P )K + PK) · Cd. (16)

Similarly, we differentiate Y with respect to P and then set the corresponding derivativeto zero. Now we have the following equation:

dY

dP= Cd ·K(1− P )K−1 − CdKPK−1 = 0, (17)

which gives the stationary point at P = Pm = 12. Assuming that K À 1, which is reasonable

in practice, we conclude that Pm is in fact a local maximum point, since we have

d2Y

dP 2

∣∣∣P=Pm

= −Cd ·K(1− Pm)K−2 − CdK(K − 1)PK−2m < 0. (18)

By substituting Pm into Equation 16, we obtain the maximum value of Y (Pm) = HK −2Cd

2K . At the extreme cases of P = 0 and 1, we have the same value given by Y = HK − Cd.We show the unit processing cost function Y in relation to the selectivity in Figure 8(b). Thegraph is remarkably different from that in Figure 8(a) in the region of high selectivity, sincethe cost decreases rapidly when the selectivity is high. The difference can be explained asfollows. When a wider selectivity is given, then more data items are selected in a block; thechance that the lp covers lβ becomes higher (i.e., the occurrence of Case C becomes morefrequent than others). It is therefore less likely that the hit blocks are decompressed.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.2

0.4

0.6

0.8

1

Selectivity P in Selection Queries

Uni

t Pro

cess

ing

Cos

t Y

K = 90K = 70K = 50K = 30

Cs =Cd =1

(a) Selection Queries

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.2

0.4

0.6

0.8

1

1.2

1.4

K = 90K = 70K = 50K = 30

Selectivity P in Aggregate Queries

Uni

t Pro

cess

ing

Cos

t Y Cs =Cd =1

(b) Aggregate Queries

Figure 8: Cost Analysis of the Unit Processing Cost against Selectivity

In a given workload, if the distribution of selectivities and queries can be estimated (e.g.,from the past query log data), then similar differentiation can be straightforwardly appliedto the following extended unit processing cost equation: Y = w1Y1 + w2Y2 + · · · + wnYn,where wi is the normalized weight and Yi is the cost for the query having Pi.

16

5 A Block Clustering Scheme

In this section, we impose a layer of clusters on the PPB data streams; each cluster consistsof a sequence of an equal number of data blocks. We then extend the BSS indexing schemeinto the Cluster Statistics Signature (CSS) indexing scheme. We further extend the resultsobtained from studying the CSS indexing scheme into the general case of the Multiple-clusterStatistics Signature (MSS) indexing scheme.

5.1 The CSS Indexing Scheme

The CSS index of a given cluster, denoted by sα, is computed from the BSS indexes (recallDefinition 2.1) in the cluster.

Definition 5.1 (CSS Index and Value Range) Assume that there are nj compressed blocksin a j-th cluster. The Cluster Statistic Signature (CSS) index is given by Cj = 〈sα

j , cαj 〉, where

cαj = {b1, . . . , bnj

} is a collection of nj blocks of data items. The index value is given by sαj =

〈minαj ,maxα

j , sumαj , countαj 〉. The parameters are minα

j = Min(min(b1), . . . , min(bnj)),

maxαj = Max(max(b1), . . . , max(bnj

)), sumαj = sum(b1)+ · · ·+sum(bnj

), and countαj =

count(b1)+ · · ·+count(bnj). Similar to lβ , we define the value range for a CSS index, lαj , as

〈minαj ,maxα

j 〉. We may use more handy notation such as sα, bα and lα when no ambiguityarises from the cluster labelling.

The index checking of the compressed data with this new scheme is carried out as fol-lows. In the first phase, we scan and check the value ranges of the CSS indexes. If there isan overlap between lp and lα (i.e., when the cluster is hit), then we use the BSS indexes toscan and check for all blocks that are contained in the hit cluster in the second phase. Wedepict the idea of imposing a CSS indexing scheme on a PPB data stream in Figure 9. Westudy the unit processing cost while including a level of clusters in the PPB engine. In Table3, we extend the BSS notation in order to specify the parameters in the CSS layer.

Notation Definition

πc Cluster hitting probability of a PBB data stream.πb Block hitting probability in a hit cluster.Kb Number of data elements in a block of a data stream.Kc Number of data elements in a cluster.Lb Number of blocks in a cluster.Lc Number of clusters in a PBB data stream.Xb Number of hit blocks in a (hit) cluster (i.e., lβ and lp overlap).Xc Number of hit clusters (i.e., lα and lp overlap in these clusters).

Table 3: Notation Used in the CSS Indexing Scheme

17

Compressed Data Blocks in a

Hit Cluster

��A CSS Indexing

Layer

A BSS Indexing

Layer

Hit Cluster

Hit Block

Second Scanning BSS indexes

First Scanning CSS indexes

Figure 9: The CSS Indexing Scheme

Note that, as shown in Figure 9, Xc is equal to the number of clusters that has an overlapbetween lα (the CSS index) and lp in a PPB data stream, which is processed in the first phase,and that Xb is equal to the number of blocks that has an overlap between lβ (the BSS index)and lp in a hit cluster, which is processed in the second phase. For simplicity, we assume thatthe cost of scanning a BSS index is equal to the cost of scanning a CSS index.

We now construct the unit processing cost equation based on the following reasoning:Total cost of processing a selection query

= Cost of scanning the CSS indexes in a data stream+ Cost of scanning the BSS indexes in all the hit clusters+ Cost of decompressing all the hit blocks.

Thus, we have the following cost equation for processing a query using the CSS indexingscheme: Cq = Lc · Cs + Xc · Lb · Cs + Xc ·Xb ·Kb · Cd.

We extend the notion of the hitting probability to the cluster hitting probability by usinga similar argument leading to the notion of the block hitting probability, which is discussedin Section 4.1. We now simply state the cluster hitting probability and the block hittingprobability in the following expressions, πc = 1 − (1− P )Kc and πb = 1 − (1− P )Kb ,respectively. We refer to the case of open interval selection queries. However, other casesof interval queries can be studied in a similar way, using the respective hitting probabilitiesgiven in Table 2.

We extend the concept of hit blocks to hit clusters. The number of hit clusters, Xc, isequal to the number of clusters that have a corresponding CSS value range overlapping withthe predicate range of a query (i.e., lα ∩ lp 6= ∅). The distribution of the event that Xc is hitsatisfies a binomial distribution, which is the first phase of scanning in Figure 9. Thus, theexpected number of hit clusters is given by E(Xc) = πc · Lc. In the second phase, all theblocks (i.e., Lb blocks) in these hit clusters are scanned and their BSS indexes are checked.

18

The expected number of hit blocks in a hit cluster is given by E(Xb) = πb · Lb.Based on the above reasoning, we formulate the expected cost equation, which can be

viewed as an extension of Equation 1:

E(Cq) = Lc · Cs + πc · Lb · Lc · Cs + πbπcLbLcN

LbLc

Cd. (19)

By using the formula N = LbLcKb, we simplify Equation 19 and formulate the followingunit processing cost equation:

Y =E(Cq)

N=

(Cs

Kc

+πcCs

Kb

+ πbπcCd

). (20)

Theorem 5.1 Y has a local minimum value for sufficiently small selectivity values, whereY is given in Equation 20.Proof. Consider the case of using small selectivity values in Equation 20 as follows: πb ≈KbP and πc ≈ KcP . We then obtain the following equation:

Y =E(Cq)

N=

(Cs

Kc

+CsKcP

Kb

+ CdKbKcP2

). (21)

We now find the partial derivatives with respect to Kb and Kc and then set them to bezero, in order to obtain the stationary points of the unit processing cost function Y :

∂Y

∂Kb

= −KcPCs

K2b

+ KcP2Cd,

∂Y

∂Kc

= − Cs

K2c

+CsP

Kb

+ KbP2Cd. (22)

By inserting ∂Y∂Kb

= 0 and ∂Y∂Kc

= 0 into Equation 22, we have the following optimalchoices when, respectively, Kb = Kmin

b and Kc = Kminc :

Kminb =

√Cs

CdP, Kmin

c =

(Cs

4CdP 3

) 14

. (23)

We proceed to prove the claim that (Kminb , Kmin

c ) is a local minimum point of Y byconsidering the following sufficient conditions (cf. [14]): (1) ∂2Y

∂K2b

> 0 or ∂2Y∂K2

c> 0 and (2)

∂2Y∂K2

b

∂2Y∂K2

c− ( ∂2Y

∂Kb∂Kc)2 > 0.

From Equation 23, the second derivatives ∂2Y∂K2

band ∂2Y

∂K2c

are obtained as follows:

∂2Y

∂K2b

=2CsPKc

K3b

,

∂2Y

∂K2c

=2Cs

K3c

, (24)

19

which are always positive for all positive parameters. In addition, for sufficiently small P

(ignoring P 2 and other higher power terms), we have

∂2Y

∂K2b

∂2Y

∂K2c

−(

∂2Y

∂Kb∂Kc

)2

=4C2

s PKc

K3b K

3c

−(

P 2Cd − PCs

K2b

)2

≈ P

(4C2

s Kc

K3b K

3c

), (25)

which is always positive. ¤

We now present in Figure 10 a three-dimensional plot of the unit processing cost, Y , inrelation to the block and cluster sizes in the case of low selectivity.

020

4060

80

020

4060

800

1

2

3

4

5

6

7

8

9

Processing Cost for Low Selectivity Queries in CSS scheme

Uni

t Pro

cess

ing

Cos

t Y

Cluster Size Kc Block Size Kb

Cs=1; Cd =1; P = 0.1

Local Minimum Point

Figure 10: Unit Processing Cost in Relation to Block and Cluster Sizes in Low SelectivityQueries

Notably, there is a minimum point on the surface in Figure 10 from Theorem 5.1. Infact, we can easily deduce from Equation 23 that in the extreme case of P → 0, we haveKmin

b →∞ and Kminc →∞. There are two further remarks concerning the minimum points

given in Equation 23. First, the parameters Kb and Kc should be some positive integers. Inaddition, we assume that Lc = Kc

Kb, which must be an integer in order to give the whole

number of clusters in a stream. In reality, Kc may not be an exact multiple of Kb and hencethere may be a source of inaccuracy in our modelling. Second, it is interesting to see that theminimum point (Kmin

b , Kminc ) depends only on selectivity P . (Cs and Cd are constant in a

given configuration.)

5.2 The MSS Indexing Scheme

The clustering technique can be generalized to p layers of clusters that are arranged in a top-down manner. All the statistical parameters of the parent’s cluster layer are computed from

20

the indexes of its children’s cluster layers. To visualize this idea, we extend Figure 9 intoFigure 11.

BSS indexes

��n th MSS indexes

(outermost cluster layer)

Hit ��( n -1)th MSS

indexes

( n c

lust

er

laye

rs)

(dat

a bl

ock

laye

r)

��1st MSS indexes

(inner most cluster layer)

Compressed data blocks in a hit cluster at the innermost layer

��Scanning MSS indexes from

the n th to the 1st cluster layer Scanning BSS

indexes

MSS indexes

MSS indexes

MSS indexes

Hit Hit Hit ��Hit

Figure 11: MSS Indexing Scheme

Let us call this indexing the Multiple-cluster Statistics Signature (MSS) indexing scheme.The scanning first starts at the outermost layer of clusters. Once we identify the hit clustersat this level, we then proceed to the scanning of the next level of clusters (i.e., Cp−1

j forj ∈ {1, . . . , Xn}). The process repeats until it reaches the innermost cluster and finally theBSS indexes of the blocks (i.e., Bk for j ∈ {1, . . . , Xb} in the hit clusters are scanned andthe hit blocks are collected for decompression).

Definition 5.2 (The MSS Index and Value Range) Assume that there are nj clusters atlevel (p − 1) being contained in a j-th cluster at level p. The Multiple-cluster Statis-tics Signature (MSS) index at the pth level of clusters is given by Cp

j = 〈spj , c

pj〉, where

cpj = {cp−1

1 , . . . , cp−1nj} is a collection of the nj clusters at the immediate lower level (i.e., at

the (p − 1)th level). The index value is given by spj = 〈min

αp

j ,maxαp

j , sumαp

j , countαp

j 〉.The parameters are min

αp

j = Min(minαp−1

1 , . . . , minαp−1nj

), maxαp

j = Max(maxαp−1

1 , . . . ,

maxαp−1nj ), sum

αp

j = sumαp−1

1 + · · ·+sumαp−1nj , and count

αp

j = countαp−1

1 + · · ·+countαp−1nj .

Similar to lβ , we define the value range for a MSS indexed cluster at the pth level, denotedas l

αp

j , as 〈minαp

j ,maxαp

j 〉.

The searching of the hit clusters and blocks in the MSS indexing scheme is formallypresented as the following pseudo-code algorithm, designated as SEARCH(Cp

j ).

Algorithm 1 (SEARCH(Cpj ))

21

1. begin2. Result:= ∅;3. if p = 0 do4. let C0

j = {B1,j , . . . , Bnj ,j} where Bi,j = 〈sβi,j , bi,j〉;

5. for i = 1 to nj do6. scan(sβ

i,j);7. if hit(Bi,j) = true do Result := Result ∪{bi,j} ;8. end for9. else10. let Cp

j = {Cp−11 , . . . , Cp−1

nj } where Cp−1j = 〈sαp−1

i,j , cp−1i,j 〉;

11. for i = 1 to nj do12. scan(sαp−1

i,j );13. if hit(Cp

j ) = true doResult := Result ∪ SEARCH(Cp−1

j );14. end for15. return Result;16. end.

The notation in Algorithm 1 is heavy, since there is more than one cluster at differentlevels in general. For example, the term sβ

i,j represents the BSS index value of the ith blockin the jth cluster and the term cp−1

i,j represents the ith cluster at the (p − 1) level, which is amember in the jth cluster at the p level. In essence, Algorithm 1 performs scanning on MSSindexes from the outermost level in order to find the hit blocks in the innermost level, whichis done in a depth-first manner. The complexity of Algorithm 1 is O(mn), where m and n

are the number of levels and blocks.We now state the unit processing cost equation for the MSS indexing scheme, which is a

generalized form of Equation 21. The parameters are naturally extended in a correspondingway, for example, Ki and πi indicate the number of elements in a cluster and the clusterhitting probability at the ith MSS layer for i ∈ {1, . . . , n}. When i = 0, K0 = Kb, whichrepresents the number of data elements in a block.

Yn =Cs

Kn

+Kn

Kn−1

CsP +KnKn−1

Kn−2

CsP2 + · · ·+ KnKn−1 · · ·K1

K0

CsPn + CdKn · · ·K0P

n+1

(26)The following theorem is a generalization of Theorem 5.1 to n clusters.

Theorem 5.2 The unit processing cost, Yn, for n ≥ 1 has a local minimum value for suffi-ciently small selectivity values.Proof. Equation 26 can be written in a recursive form as follows,

Y1 =Cs

K1

+K1

K0

CsP + CdK1K0P2, when n = 0;

Yn+1 =Cs

Kn+1

+ Kn+1P (Yn), when n ≥ 1. (27)

22

When the derivative of the above equation is set to zero, we have a solution that will leadto the second derivative being zero. We proceed with the proof by induction on n.

(Basis): When n = 0, by Theorem 5.1, we have proved that the Y1 has a local mini-mum for sufficiently small P 1. In other words, we have the minimum of Y1 at (K0, K1) =(√

Cs

Cd,(

Cs

4CdP 3

) 14

).

(Induction): We assume that Yk in Equation 26 has a local minimum at K′ = (K ′0, . . . , K

′k)

for k ≥ 1. We need to show that Yk+1 also has a local minimum at K′′ = (K ′′0 , . . . , K ′′

k+1).First, there is a solution, K, for the system of equations, ∂Yk+1

∂Ki= 0, where (k + 1) ≥ i ≥ 1.

Second, the corresponding Hessian of Yk+1 is positive-definite. The use of the Hessian func-tion is a standard technique used in calculus to verify if the turning point of a function havingmultiple variables is a minimum (cf. [14]).

From Equation 27, we express the first-order derivative of Yk+1 for k ≥ i ≥ 1 as follows,

∂Yk+1

∂Ki

= Kk+1P

(∂Yk

∂Ki

). (28)

By the inductive assumption, it follows that ∂Yk

∂Ki= 0 at K′ = (K ′

0, . . . , K′k) for k ≥

i ≥ 1. By Equation 28, it then follows that ∂Yk+1

∂Ki= 0 for k ≥ i ≥ 1. We now show that

∂Yk+1

∂Ki= 0 when i = k + 1.

From Equation 27 again, we have the first-order derivative for i = k + 1 as follows,

∂Yk+1

∂Kk+1

= − Cs

Kk+12 + P (Yk), (29)

which has the solution of Kk+1 =√

Cs

PYk(K′) when ∂Yk+1

∂Kk+1= 0.

It remains for us to show that the k+1 dimensional point K = (K ′0, . . . , K

′k, Kk+1) is the

minimum of Yk+1. Equivalently, we show that the Hessian of Yk+1 at K is positive-definite.We first define the Hessian function, H , of Yk+1 by (cf. see page 252 in [14])

HYk+1(K)(h) =1

2

k+1∑i,j=1

∂2Yk+1

∂Ki∂Kj

(K)hihj, (30)

where h = 〈h1, . . . , hk+1〉.By expanding Equation 30, we have the following expression,

HYk+1(K)(h) =1

2(

k∑i,j=1

∂2Yk+1

∂Ki∂Kj

(K)hihj +k∑

i=1

∂2Yk+1

∂Ki∂Kk+1

(K)hihk+1 +

k∑j=1

∂2Yk+1

∂Kk+1∂Kj

(K)hk+1hj +∂2Yk+1

∂2Kk+1

hk+1hk+1). (31)

1When n = 0 in Equation 27, Y1 is simply the case of CSS as already shown in Equation 21.

23

From Equations 28 and 29, we express the Yk+1 derivative in terms of the Yk derivativeas follows,

HYk+1(K)(h) =1

2(Kk+1P

k∑i,j=1

∂2Yk

∂Ki∂Kj

(K)hihj + P

k∑i=1

∂Yk

∂Ki

(K)hihk+1 +

P

k∑j=1

∂Yk

∂Kj

(K)hk+1hj +2Cs

K3k+1

). (32)

By the inductive assumption, we have ∂Yk

∂Ki(K) = 0 and therefore the second and third

summation terms are zero. Furthermore, it follows by Equation 30 the first summation term isequal to Kk+1P (HYk(K)(h)), which is positive-definite by the inductive assumption, sinceP and Kk+1 are positive numbers. The fourth term, Cs

K3k+1

, is simply a fraction of positiveparameters. Therefore, we conclude that HYk+1(K)(h) is positive-definite. ¤

From Equation 27 we now deduce the expression of the minimum cost, Ymin(n), asfollows:

Ymin(n + 1) =Cs

Kn+1

+ Kn+1P (Ymin(n))

= Cs

√PYmin(n)

Cs

+ PYmin(n)

√Cs

PYmin(n)

= 2√

PCsYmin(n). (33)

Now, we let α = 2√

PCs and β =√

Ymin(1) =√

Cs

K1+ K1

K0CsP + CdK1K0P 2. From

Equation 27, we expand the expression of the minimum cost for an n-layer MSS, Ymin(n),and obtain the expression as follows:

Ymin(n + 1) = α · α 12 α

14 · · ·α 1

2(n−1)

√Ymin(1)

= βα2(1− 12n ). (34)

We remark that from Equation 34, as n → ∞, Ymin → βα2. In other words, Ymin is adecreasing function and its range is between βα and βα2.

6 Verification of the PPB Engine

In this section we examine the essential features of the PPB engine, which are developedbased on the techniques introduced in Section 2 and analyzed in Section 3. In particular,we study the relationship among the query response time, the block size, the selectivity,and the selection and aggregate queries. The different selectivities in the experiments areobtained by varying the interval range of the queries. The experiments are run on a notebookcomputer that has the following configuration: PIII machine 600MHz, 192 MB RAM main

24

memory, 20 GB hard disk (Ultra DMA/66, 4200 rpm, 512KB cache, 12 ms seek time),and Windows2000 Professional SP2 platform. We assume that the unit processing cost isproportional to the response time in our system and that the block size used here, which ismeasured in records per block, is proportional to the block size we used in Section 3, whichis defined as data items per block. Formally, we have Response time (s) = a1 · Y and BlockSize (records per block) = a2 ·K, where a1 and a2 are the proportional constants determinedby our system configuration.

6.1 PPB Compression Performance

We now study the compression performance of the PPB engine by using the four XMLdatasets of Weblog [24], DBLP [22], TPC-H [26] and XMark [27], which are commonlyused in XML research (see the experiments in [3, 10, 21]). Some characteristics of these datasources are shown in Table 4, where E num and A num refer to the number of elements andattributes in the document, respectively. We briefly introduce each dataset as below.

1. Weblog is constructed from the Apache webserver log [24]. The original documentsare not in XML format.

2. DBLP is a collection of the XML documents freely available in the DBLP archive [22].The documents are already in XML format.

3. TPC-H is an XML representation of the TPC-H benchmark database, which is avail-able from the Transaction Processing Performance Council [26].

4. XMark is an XML document that models an auction website. It is generated by thegeneration tool provided in [27].

Dataset File Size Depth E num A num

Weblog 32.72MB 3 641037 0DBLP 40.90MB 6 1107711 118028TPC-H 32.30MB 3 1022976 0XMark 103.64MB 11 2873293 621490

Table 4: XML Datasets for Studying PPB Compression Performance

Figure 12(a) shows the compression ratio (expressed in the number of bits per byte) that isachieved by the four compressors. Notably, both XMill and PPB consistently achieve bettercompression ratios than Gzip achieves. Figures 12(b) and 12(c) show the compression anddecompression times (expressed in ms), which indicate that Gzip outperforms other com-pressors. The time overhead can be explained by the fact that both XMill and PPB introduce

25

a pre-compression phase for restructuring XML documents to help in the main compressionprocess. XMill adopts by default an approximation match on a Reversed DataGuide for de-termining which container a data value belongs to. This group by enclosing tag heuristicruns faster than the grouping method used in the PPB engine. Thus, XMill runs slightlyfaster than PPB, since the compression buffer window size in XMill is solely optimized forbetter compression [10]. One observation from Figure 12(c) is that, in general, XGrind re-quires longer compression and decompression times than other three decompressors, whichagrees with the findings reported in [21].

DBLP

PPB

XGrind

Com

pres

sion

rat

io (

bits

/byt

e)

0

0.5

1

1.5

2

XMill

2.5

3

3.5

4

XMarkTPC−H

Gzip

(a) Compression Ratio

DBLP

PPB

XGrind

Com

pres

sion

tim

e in

s

0

10

20

30

40

XMill

50

60

70

80

XMarkTPC−H

Gzip

(b) Compression Time

Dec

ompr

essi

on ti

me

(s)

PPB

XGrind

0

5

10

15

20

25

30

35 XMill

40

45

XMarkTPC−HDBLP

Gzip

(c) Decompression Time

Figure 12: Compression Performance of the PPB Engine

6.2 Selection and Aggregate Queries

We study the selection query, Q3, and the aggregate query, Q5, given in Section 4.1 runningon 89MB XMLized Weblog data [24]. Figure 13(a) shows the query response time for eval-uating low selectivity selection queries under different block sizes. As shown in this figure,the PPB engine performs as expected and enjoys the benefits of using PPB data streams andblock partitioning. The optimal point depends on the selectivity of the query. Even as theblock size further increases, the response time for evaluating queries that have different selec-tivities approaches the same value (recall Figure 4(a) and Theorem 3.1). Figure 13(b) showsthe response time for evaluating high selectivity selection queries. We see that the responsetime for the queries is similar to our analysis given in Figure 4(b). In essence, the decom-pression overhead increases when finer partitioning is used. When the block size increases,the response time decreases due to the fact that the decompression overhead decreases.

Figures 13(c) and 13(d) show the query response time for evaluating aggregate queries

26

1.5

2

2.5

3

3.5

4

4.5

5

5.5

6

0 500 1000 1500 2000

Res

pons

e tim

e (s

)

Block size (records per block)

1%2%5%

(a) Low Selectivity Selection

0

5

10

15

20

25

30

35

40

45

50

0 500 1000 1500 2000

Res

pons

e tim

e (s

)


Selectivity 10%50%

100%

(b) High Selectivity Selection

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 1000 2000 3000 4000 5000 6000

Res

pons

e tim

e (s

)


Selectivity 1%2%5%

(c) Low Selectivity Aggregate

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 500 1000 1500 2000R

espo

nse

time

(s)


Selectivity 10%50%

100%

(d) High Selectivity Aggregate

1.5

2

2.5

3

3.5

4

4.5

5

5.5

6

6.5

0 10 20 30 40 50 60 70 80 90 100

Res

pons

e tim

e (s

)

Selectivity (%)

Block Size 5001000

10000

(e) Selection Query Selectivity

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0 10 20 30 40 50 60 70 80 90 100

Res

pons

e tim

e (s

)

Selectivity (%)

Block Size 50010005000

(f) Aggregate Query Selectivity

Figure 13: Query Processing in the PPB Engine

under different block sizes, which are what we expected in Figures 7(a) and 7(b). As we seein Figure 13(c), for aggregate queries with low selectivity, PPB can take advantage of thedata stream partitioning in processing the queries, which is similar to the case of selectionqueries. It should be noted that the query response time has an optimal point as expected.For an aggregate query to have a high selectivity value (say at the extreme near 100%),the corresponding query response time is almost zero. The underlying reason is that whenthe selectivity of an imposed query is extremely large, the probability that lβ overlaps thepredicate interval, lp, of the imposed query is near unity. In this case, the PPB engine usesonly the BSS indexes to compute the aggregate value without decompressing many blocks.It is straightforward to see that the optimized block size for the response time of differentqueries roughly falls within a short range of 250 to 500 records per block. Thus, we canfurther take advantage of this finding for the PPB engine configuration, when block size

27

estimation should be done under insufficient or no query information.

6.3 Selectivity and Block Size

Figure 13(e) shows the query response time achieved by the PPB engine for a set of selectionqueries having different selectivity values and block sizes. As the selectivity of the imposedqueries increases, the query response time increases. This is due to the fact that more XMLfragments are expected to be selected and returned in the output as the selectivity of animposed query increases. Figure 13(f) shows the query response time achieved by the PPBengine for aggregate queries with different selectivities and block sizes. When the selectivityof the imposed query is low, the query response times are extremely small. As the selectivityvalues of the imposed queries increase, the query response time increases rapidly to its max-imum value and decreases gradually again as the selectivity further increases. Compared toFigures 8(a) and 8(b), the trends of the response time in both cases are similar, except thatthe decreasing rate in the region of high selectivity aggregate queries seems to depend moreheavily on the block size than expected.

75

PPB (Block Size = 250)

501010.40.02

Res

pons

e T

ime

(s)

Selectivity (%)

0

5

10

XGrind (Range Match)

15

20

25

XGrind (Exact Match)

(a) DBLP

10


10.40.01

Res

pons

e T

ime

(s)

Selectivity (%)

0

2

4

6

8

10

12 XGrind (Range Match)

14

7550


(b) TPC-H

50


1010.40.01

Res

pons

e T

ime

(s)

Selectivity (%)

0

5

10

15

20

25 XGrind (Range Match)

30

75


(c) XMark

Figure 14: Processing Selection Queries over Different Datasets in PPB and XGrind

28

6.4 Comparing PPB with XGrind

We now compare the performance of XGrind with that of the PPB engine by running thefollowing queries.

(Q7): /dblp/article[1998 > year > 1950].(Q8): /table/T[10010 > L ORDERKEY > 10000].(Q9): /open auctions/open auction[100 > current > 20].As XGrind only supports processing of exact match and partial match selection queries,

we use these selection queries in the tests. Figures 14(a), 14(b) and 14(c) show the queryresponse times of XGrind and PPB for processing queries Q7 to Q9 on the respective datasets.We varied the interval ranges of year, L ORDERKEY and current to obtain differentselectivities for the experiment. Clearly, the query response time obtained by the PPB engineis consistently shorter than that obtained by XGrind. The savings in response time by PPBis attributed to the use of partial decompression of data streams and to the BSS indexingadopted in the PPB engine.

7 Conclusions

In this paper, we developed and analyzed the PPB data grouping strategy in order to supportquerying over compressed XML data. We demonstrated that the technique is effective fromboth analytical and empirical points of view. We introduced the BSS indexing scheme forcompressed blocks in a data stream. We then presented a critical analysis of partial decom-pression performance based on the PPB and BSS indexing techniques. Our analysis presentsquery processing cost expressions to answer selection and aggregate queries. We showedthat the optimal cost exists in low selectivity selection queries and an asymptotic cost ex-ists in high selectivity values in Theorem 3.1. We discussed the extension of BSS into CSSand found that optimal sizes of clusters and blocks exist in the CSS indexing scheme. Theoptimal block size and cluster size values are independent of the data size, N , but not ofthe selectivity value, P . The CSS indexing scheme can be further generalized into the MSSindexing scheme. We established the generalized result in Theorem 5.2 that there exists alocal minimum of the unit processing cost for all MSS indexing schemes if the selectivityand scanning costs are sufficiently low. We demonstrated that the relationship between pro-cessing time, selectivity and block size is consistent with our analysis in Sections 6.2 and 6.3.We also showed that the compression performance is comparative to XMill and the queryingperformance of PPB is superior to XGrind in the the scope of our study.

Overall, our analysis paves the way to optimize query processing of compressed XMLdata within a path-partitioned based framework. We need to go in three directions in ourfuture work. First, we need to verify empirically how the compression performance is im-pacted by the MSS indexing scheme, though we proved from the theoretical point of view

29

that the processing cost can be reduced. Second, we need to analyze further the cost of usingthe PPB technique to handle a wider scope of XML semantics. For example, the join (forthe For clause), semi-join (for the Where clause) and outer-join (for the Return clause) instandard XQuery expressions [29]. Third, it is indeed challenging and practical to considerXML query processing over compressed XML documents in distributed applications, suchas the study of distributed XML query processing and result dissemination in a co-operativeframework [8]. However, when the query evaluation is carried out over a network of clients,the work is more complex and challenging. We need to devise some succinct data struc-tures and query rewriting efficient techniques in order to organize multi-XML queries fromnetwork clients and support query evaluation of compressed documents.

References[1] A. Arion, A. Bonifati, G. Costa, S. D’Aguanno, I. Manolescu, and A. Pugliese. Efficient Query Evaluation

over Compressed XML Data. Proc. of EDBT (2004).

[2] P. Buneman, M. Grohe and C. Koch. Path Queries on Compressed XML. Proc. of VLDB, (2003).

[3] J. Cheney. Compressing XML with Multiplexed Hierarchical PPM Models. Proc. of IEEE Data Compres-sion Conf., pp. 163-172, (2000).

[4] J. Cheng and W. Ng. XQzip: Querying Compressed XML Using Structural Indexing. Proc. of EDBT(2004).

[5] T. M. Cover and J. A. Thomas. Elements of Information Theory. WILEY-INTERSCIENCE, John Wiley &ons, Inc., New York, (1991).

[6] A. Datta and H. Thomas. Accessing Data in Block-Compressed Data Warehouses. Workshop on Informa-tion Technologies and Systems (WITS), (1999).

[7] C. Faloutsos and S. Christodoulakis. Design of a Signature File Method that Accounts for Non-UniformOccurrence and Query Frequencies. Proc. of VLDB, pp. 165-170, (1985).

[8] J. He, W. Ng, X. Wang and A. Zhou. An Efficient Co-operative Framework for Multi-Query Processingover Compressed XML Data. Proc. of DASFAA 2006, LNCS Vol. 3882, pp. 218-232, (2006).

[9] B. Iyer and D. Wilhite. Data Compression Support in Databases. Proc. of VLDB Conf., pp. 695-704,(1994).

[10] H. Liefke and D. Suciu. XMill: An Efficient Compressor for XML Data. Proc. of ACM SIGMOD, pp.153-164, (2000).

[11] Z. Lin and C. Faloutsos. Frame-Sliced Signature Files. IEEE TKDE, 4(3), pp.281-289, (1992).

[12] W. Ng, W. Y. Lam, P. T. Wood and M. Levene. XCQ: A Queriable XML Compression System. An Inter-national Journal of Knowledge and Information Systems, (2005).

[13] W. Ng, W. Y. Lam and J. Cheng. Comparative Analysis of XML Compression Technologies. Journal ofWorld Wide Web: Internet and Web Information Systems, 9(1), pp. 5-33, (2005).

[14] J. E. Marsden and A.J. Tromba. Vector Calculus W.H. Freeman and Company, (1988).

[15] J. K. Min, M. J. Park and C. W. Chung. XPRESS: A Queriable Compression for XML Data. Proc. of ACMSIGMOD, (2003).

[16] W. Ng and C. Ravishankar. Block-Oriented Compression Techniques for Large Statistical DatabasesIEEE TKDE 9(2), pp. 314-328 (1997).

[17] M. Poess and D. Potapov Data Compression in Oracle. Proc. of VLDB Conf., (2003).

[18] L. Segoufin and V. Vianu. Validating Streaming XML Documents. Proc. of PODS, (2002).

30

[19] C. E. Shannon. A Mathematical Theory of Communication. The Bell System Technical Journal, Vol. 27,pp. 398-403 (1948).

[20] N. Sundaresan and R. Moussa. Algorithms and Programming Models for Efficient Representation of XMLfor Internet Applications. WWW Conf., pp. 366-375, (2001).

[21] P. M. Tolani and J. R. Haritsa. XGRIND: A Query-friendly XML Compressor. IEEE ICDE Conf., pp.225-234, (2002).

[22] DBLP, http://dblp.uni-trier.de/

[23] Gzip, http://www.gzip.org/

[24] Log Files - Apache, http://httpd.apache.org/docs/

[25] SAX, http://www.saxproject.org/

[26] TPC-H, http://www.tpc.org/tpch/default.asp

[27] XMark, http://monetdb.cwi.nl/xml/

[28] XPath 1.0, http://www.w3c.org/TR/xpath/

[29] XQuery 1.0, http://www.w3c.org/TR/2002/WD-xquery-20020816.

[30] J. Ziv and A. Lempel. A Universal Algorithm for Sequential Data Compression. IEEE Transactions onInformation Theory, Vol. IT-23, No. 3, pp. 337-343, May (1977).

31

Divide, Compress and Conquer: Querying XML via Partitioned ... · Divide, Compress and Conquer:...

Documents

Transcript of Divide, Compress and Conquer: Querying XML via Partitioned ... · Divide, Compress and Conquer:...