On Producing High and Early Result Throughput in Multi-join …mokbel/papers/tkde10b.pdf ·...

On Producing High and Early ResultThroughput in Multi-join Query Plans

Justin J. Levandoski, Member, IEEE , Mohamed E. Khalefa, Member, IEEE ,

and Mohamed F. Mokbel, Member, IEEE

Abstract—This paper introduces an efficient framework for producing high and early result throughput in multi-join query plans.

While most previous research focuses on optimizing for cases involving a single join operator, this work takes a radical step by

addressing query plans with multiple join operators. The proposed framework consists of two main methods, a flush algorithm

and operator state manager. The framework assumes a symmetric hash join, a common method for producing early results,

when processing incoming data. In this way, our methods can be applied to a group of previous join operators (optimized for

single-join queries) when taking part in multi-join query plans. Specifically, our framework can be applied by (1) employing a new

flushing policy to write in-memory data to disk, once memory allotment is exhausted, in a way that helps increase the probability

of producing early result throughput in multi-join queries, and (2) employing a state manager that adaptively switches operators in

the plan between joining in-memory data and disk-resident data in order to positively affect the early result throughput. Extensive

experimental results show that the proposed methods outperform the state-of-the-art join operators optimized for both single and

multi-join query plans.

Index Terms—Database Management, Systems, Query processing.

F

1 INTRODUCTION

T RADITIONAL join algorithms (e.g., see [1], [2], [3])are designed with the implicit assumption that

all input data is available beforehand. Furthermore,traditional join algorithms are optimized to producethe entire query result. Unfortunately, such algorithmsare not suitable for emerging applications and en-vironments that require results as soon as possible.Such environments call for a new non-blocking joinalgorithm design that is: (1) applicable in cases whereinput data is retrieved from remote sources throughslow and bursty network connections, (2) optimizedto produce as many early results as possible (i.e.,produce a high early result throughput) in a non-blocking fashion, while not sacrificing performance inprocessing the complete query result, and (3) operatesin concert with other non-blocking operators to pro-vide early (i.e., online) results to users or applications.In other words, any blocking operation (e.g., sorting)must be avoided when generating of online results.

Examples of applications requiring online resultsinclude web-based environments, where data is gath-ered from multiple remote sources and may exhibitslow and bursty behavior [4]. Here, a join algorithmshould be capable of producing results without wait-ing for the arrival of all input data. In addition,web users prefer early query feedback, rather than

• J.J. Levandoski, M.E. Khalefa, and M.F. Mokbel are with the Depart-ment of Computer Science and Engineering, Unversity of Minnesota- Twin Cities, 200 Union Street SE, Minneapolis, MN 55455. E-mail:{justin,khalefa,mokbel}@cs.umn.edu

waiting an extended time period for the completeresult. Another vital example is scientific experimentalsimulation, where experiments may take up to daysto produce large-scale results. In such a setting, a joinquery should be able to function while the experimentis running, and not have to wait for the experimentto finish. Also, scientists prefer to receive early feed-back from long-running experiments in order to tellif the experiment must halt and be restarted withdifferent settings due to unexpected results [5], [6].Other applications that necessitate the productionof early results include streaming applications [7],[8], workflow management [9], data integration [10],parallel databases [11], spatial databases [12], sensornetworks [13], and moving object environments [14].

Toward the goal of producing a high early resultthroughput in new and emerging online environ-ments as those just described, several research effortshave been dedicated to the development of non-blocking join operators (e.g., see [15], [16], [17], [12],[18], [19], [20], [8], [21]). However, with the exceptionof [17], these algorithms focus on query plans contain-ing a single join operator. The optimization techniquesemployed by these join operators focus on producinga high early result throughput locally, implying thateach operator is unaware of other join operators thatmay exist above or below in the query pipeline.Optimizing for local throughput does not necessarilycontribute to the goal of producing a high overall earlyresult throughput in multi-join query plans.

In general, the ability of join queries to producehigh early result throughput in emerging environ-ments lies in the optimization of two main tasks:

(1) The ability to handle large input sizes by flushingthe least beneficial data from memory to disk whenmemory becomes exhausted during runtime, thus al-lowing new input tuples to produce early results, and(2) The ability to adapt to input behavior by producingresults in a state (e.g., in-memory or on-disk) that pos-itively affects early result throughput. In this paper,we explore a holistic approach to optimizing thesetwo tasks for multi-join query plans by proposing twomethods that consider join operators in concert, ratherthan as separate entities. The first method is a novelflushing scheme, named AdaptiveGlobalFlush, that in-telligently selects a portion of in-memory data thatis expected to contribute to result throughput the least.This data is written to disk once memory allotment forthe query plan is exhausted, thus increasing the prob-ability that new input data will help increase overallresult throughput. The second method is a novel statemanager module that is designed to fetch disk-residentdata that will positively affect result throughput. Thus,the state manager directs each operator in the querypipeline to the most beneficial state during query run-time. These two methods assume that symmetric hashjoin, a common non-blocking join operator [17], [18],[21], is used to compute join results. This assumptionmakes the AdaptiveGlobalFlush method and the statemanager module compatible with previous operatorsoptimized for single-join query plans. Also, whilethese two methods are mainly designed to producehigh early throughput in multi-join query plans, theyalso ensure the production of complete and exact queryresults, making them suitable for applications that donot tolerate approximations.

To the authors’ knowledge, the state spilling ap-proach [17] is the only work to have considered multi-join query plans. However, the work presented inthis paper distinguishes itself from state-spilling andall other previous work through two novel aspects.First, the AdaptiveGlobalFlush algorithm attempts tomaximize overall early result throughput by buildingits decisions over time based on a set of simplecollected statistics that take into account both the datainput and result output. Second, the state manager isa novel module that does not exist in previous work,and is designed specifically for multi-join query plans.The state manager has the ability to switch any joinoperator back and forth between joining in-memorydata and disk-resident data during runtime based onthe operation most beneficial for the set of pipelinedjoin operators to produce early result throughput.When making its decision, the state manager modulemaintains a set of accurate, lightweight statistics thathelp in predicting the contribution of each join operatorstate. In general the contributions of this paper can besummarized as follows:

1) We propose a novel flushing scheme , Adap-tiveGlobalFlush , applicable to any hash-based

join algorithm in a multi-join query plan. Adap-tiveGlobalFlush helps produce high early resultthroughput in multi-join query plans while be-ing adaptive to the data arrival patterns.

2) We propose a novel state manager module that di-rects each join operator in the query pipeline tojoin either in-memory data or disk-resident datain order to produce high early result throughput.

3) We provide experimental evidence that ourmethods outperform the state-of-the-art join al-gorithms optimized for early results in terms ofefficiency and throughput.

The rest of this paper is organized as follows: Sec-tion 2 highlights related work. Section 3 presents anoverview of the methods proposed in this paper. TheAdaptiveGlobalFlush method is described in Section 4.Section 5 presents the state manager module. Correct-ness of AdaptiveGlobalFlush and the state manager iscovered in Section 6. Experimental evidence that ourmethods outperform other optimization techniques ispresented in Section 7. Section 8 concludes the paper.

2 RELATED WORK

The symmetric hash join [21] is the most widely usednon-blocking join algorithm for producing early joinresults. However, it was designed for cases whereall input data fits in memory. With the massive ex-plosion of data sizes, several research attempts haveaimed to extend the symmetric hash join to supportdisk-resident data. Such algorithms can be classifiedinto three categories: (1) hash-based algorithms [10],[18], [19], [20] that flush in-memory hash buckets todisk either individually or in groups, (2) sort-basedalgorithms [15], [22] in which in-memory data issorted before being flushed to disk, and (3) nested-loop-based algorithms [16] in which a variant of thetraditional nested-loop algorithm is employed. Also,several methods have been proposed to extend thesealgorithms for other operators, e.g., MJoin [8] extendsXJoin [20] for multi-way join operators while hash-based joins have been extended for spatial join op-erators [12]. However, these join algorithms employoptimization techniques that focus only on the caseof a single join operator, with no applicable extensionfor multi-join query plans.

In terms of memory flushing algorithms, previ-ous techniques can be classified to two categories:(1) Flushing a single hash bucket. Examples of thiscategory include XJoin [20] that aims to flush thelargest memory hash bucket regardless of its inputsource and RPJ [19] that evicts a hash bucket basedon an extensive probabilistic analysis of input rates.(2) Flushing a pair of corresponding buckets fromboth data sources. Examples of this category includehash-merge join [18] that keeps the memory balancedbetween the input sources and state spilling [17]that attempts to maximize the overall early result

Buffer In-Memory Join

Join disk-resident

data

Flush when

Mem. full

Disk

Temporarily

Block

State

Manager

Sec. 4

Sec. 5

Incoming

Data

Join

Result

Observed

and

Collected

Statistics

(a) Join Operation Overview

Result

1

2

3

BA

C

D

O

O

O

(b) Query Plan

Fig. 1. Join Overview and Example Query Plan

throughput of the query plan. Our proposed Adaptive-GlobalFlush method belongs to the latter category, asflushing a pair of buckets eliminates the need for exten-sive timestamp collection. Timestamps are necessarywhen flushing a single bucket to avoid duplicate joinresults [19], [20].

A family of previous work can be applied to stream-ing environments where input data is potentiallyinfinite (e.g., [23], [24], [25]). Such algorithms aim toproduce an approximate join result based on eitherload shedding [26], [27] or the definition of a slidingwindow over incoming streams [28]. Since our meth-ods are concerned with finite input and exact queryresults, discussion of infinite data streams is outsidethe scope of this paper.

To the authors’ knowledge the closest work to oursis PermJoin [29] and the state-spilling method [17].PermJoin does not discuss specific methods for pro-ducing early results in multi-join query plans, onlya generic framework. The basic idea behind statespilling is to score each hash partition group (i.e.,symmetric hash buckets) in the query plan based onits current contribution to the query result. Whenmemory is full, the partition group with the lowestscore is flushed to disk. Once all inputs have fin-ished transmission, the state-spill approach joins disk-resident data using a traditional sort-merge join. Ourproposed AdaptiveGlobalFlush method differs from thestate-spilling flush approach in that it is predictive,taking into account both input and output character-istics to make an optimal flush decision. Furthermore,our proposed state manager is not found previouswork, as it continuously operates during query runtimeto place operators in an optimal state with regardto overall throughput. State-spill and other previouswork consider changing operator state only whensources block or data transmission has terminated.

3 OVERVIEW

This section gives an overview of the two novelmethods we propose for producing a high early resultthroughput in multi-join query plans. These methodsare: (1) A new memory flushing algorithm, designedwith the goal of evicting data from memory thatwill contribute to the result throughput the least, and

optimized for overall (rather than local) early resultthroughput. (2) A state manager module designed withthe goal of placing each operator in a state that willpositively affect result throughput. Each operator canfunction in an in-memory, on-disk, or blocking state.

An overview of how our memory flushing algo-rithm and state manager can be added to existingnon-blocking hash-based join algorithms is given inthe state diagram in Figure 1(a). As depicted in thediagram, whenever a new tuple RS is received by theinput buffer from source S of operator O, the statemanager determines how the tuple is processed. If Ois currently not in memory, the tuple RS will be tem-porarily stored in the buffer until O is brought backto memory. Otherwise, RS will be immediately usedto produce early results by joining it with in-memorydata. Initially, all join operators function in memory.Once memory becomes full, the memory flushing al-gorithm frees memory space by flushing a portion ofin-memory data to disk. During query runtime, eachjoin operator may switch between processing resultsusing either memory-resident or disk-resident data.Memory Flushing. Most hash-based join algorithmsoptimized for early results employ a flushing policyto write data to disk once memory is full. In amulti-join query plan, policies that optimize for localthroughput will likely perform poorly compared topolicies that consider all operators together to opti-mize for overall throughput. For example, given theplan in Figure 1(b), if a local policy constantly flushesdata from O1, then downstream operators (O2 andO3) will be starved of data, degrading overall resultthroughput. We introduce a flushing policy, Adap-tiveGlobalFlush, that can be added to existing hash-based join algorithms when part of a multi-join queryplan. AdaptiveGlobalFlush opts to flush pairs of hashbuckets, where flushing a bucket from an input sourceimplies flushing the corresponding hash bucket fromthe opposite input at the same time. AdaptiveGlob-alFlush evicts pairs of hash buckets that have the lowestcontribution to the overall query output, using anaccurately collected set of statistics that reflect boththe data input and query output patterns. Details ofAdaptiveGlobalFlush are covered in Section 4.State Manager. The main responsibility of the statemanager is to place each join operator in the mostbeneficial state in terms of producing high early resultthroughput. These states, as depicted in Figure 1(a) byrectangles, are: (1) Joining in-memory data, (2) Joiningdisk-resident data, or (3) Temporary blocking, i.e.,not performing a join operation. As a motivatingexample, consider the query pipeline given in Fig-ure 1(b). During query runtime, sources A and Bmay be transmitting data, while sources C and Dare blocked. In this case, query results can only begenerated from the base operator O1. The overallquery results produced by O1 rely on the selectivityof the two operators above in the pipeline (i.e. O2

Hash Table A Hash Table B

hash(A)

(1)(2)

Source B

hash(B)

(1) (2)

Join Result

1

2

N

...

1

2

N

...

Source A

(a) Symmetric Hash Join

Before Merge

After Merge

Memory

Results: (4,4), (6,6)

Disk

Results: (1,1), (7,7)

Source A

h1 4 7 9 1 6

Source B

h1 1 4 2 6 7

a1,1 a1,2 b1,1 b1,2

Source A

h1 1 4 6 7 9

Source B

h1 1 2 4 6 7

(b) On-Disk Merge

Fig. 2. Join and Disk Merge Examples

and O3). If the selectivity of these operators is low,merging disk-resident data at either O2 or O3 maybe more beneficial in maximizing the overall resultthroughput than performing an in-memory join at thebase operator O1. Thus, the state manager may decideto place O3 in the in-memory (i.e., default) state, O2 inthe on-disk merge state, while O1 is placed in a low-priority state. Section 5 covers state manager details.

Architecture assumptions. We assume the in-memoryjoin employs the symmetric hash join algorithm toproduce join results [21]. Figure 2(a) gives the mainidea of the symmetric hash join. Each input source (Aand B) maintains a hash table with hash function hand n buckets. Once a tuple r arrives from input A,its hash value h(r) is used to probe the hash tableof B and produce join results. Then, r is stored inthe hash bucket h(rA) of source A. A similar scenariooccurs when a tuple arrives at source B. Symmetrichash join is typical in many join algorithms optimizedfor early results [18], [20], [21]. We also assume ajoin operator in the disk merge state employs a disk-based sort-merge join to produce results (e.g., see [17],[18]). Figure 2(b) gives an example of the sort-mergejoin in which partition group h1 has been flushedtwice. The first flush resulted in writing partitionsa1,1 and b1,1 to disk while the second flush resultedin writing a1,2 and b1,2. Results from joining disk-resident data are produced by joining (and merging)a1,1 with b1,2 and a1,2 with b1,1. Partitions (a1,1, b1,1)and (a1,2, b1,2) do not need to be merged, as they werealready joined while residing in memory; the resultsproduced from in-memory and on-disk operations arelabeled in Figure 2(b). Flushing partition groups andusing a disk-based sort-merge has the major advantageof not requiring timestamps for removal of duplicateresults [17], [18]; once a partition group is flushed todisk it is not used again for in-memory joins.

In this work, we consider left-deep query planswith m binary join operators (m > 1) where the baseoperator joins two streaming input sources while allother operators join one streaming input with theoutput of the operator below. An example query planis given in Figure 1(b). We choose to study left-deepplans as they are common in databases. Our methodscan apply to bushy trees, where our flush algorithmand state manager can be applied to leaf nodes of theplan. Future work will involve adapting our methods

to non-leaf nodes in bushy multi-join query plans.Further, an extension of our framework to the caseof query plans consisting of more than one m-wayjoin operator is straightforward.

4 MEMORY FLUSHING

In online environments, most join operators opti-mized for early throughput produce results usingsymmetric hash join [18], [20], [21]. If memory allot-ment for the query plan is exhausted, data is flushedto disk to make room for new input, thus continuingthe production of early results. In this section, wepresent the AdaptiveGlobalFlush algorithm that aims toproduce a high early overall throughput for multi-join query plans. As discussed in Section 3, the Adap-tiveGlobalFlush policy flushes partition groups simul-taneously (i.e., corresponding hash partitions fromboth hash tables). The main idea behind AdaptiveGlob-alFlush is to consider partition groups across all joinoperators in concert, by iterating through all possiblegroups, scoring them based on their expected contri-bution to the overall result throughput, and finallyflushing the partition group with the lowest score, i.e.,the partition group that is expected to contribute to theoverall result throughput the least.

In general, a key success to any flushing policylies in its ability to produce a high result throughput. Inother words, the ability to flush hash table data thatcontributes the least to the result throughput. To avoidthe drawbacks of previous flushing techniques [17],[18], [19], we identify the following three importantcharacteristics that should be taken into considerationby any underlying flushing policy in order to accu-rately determine in-memory data to flush:

• Global contribution of each partition group.Global contribution refers to the number of overallresults that have been produced by each partitiongroup. In multi-join query plans, the global contri-bution of each partition is continuously changing.For instance, any non-root join in the plan is af-fected by the selectivity of join operators residingabove in the operator chain. These selectivitiesmay change over time due to new data arrivingand possibly being evicted due to flushing. Theability to successfully predict the global contributionof each partition group is a key to the success ofa flushing policy in producing result throughput.

• Data arrival patterns at each join operation.Since data is transmitted through unreliable net-work connections, input patterns (i.e., arrivalrates or delays) can significantly change the over-all throughput during the query runtime. Forexample, higher data arrival rates imply that ajoin operator (and its partition groups) will con-tribute more to the overall throughput. Likewise,a lower data arrival rate implies less contributionto the overall throughput. Thus, a flushing policy

Class Statistic Definition

Size prtSizeSj Size of hash partition j for input SgrpSizej Size of partition group jtupSizeS Tuple size for input S

Input inputtot Total input countuniquej No. unique values in part. group jprtInputSj Input count at partition j of input S

Output obsLocj Local output of partition group jobsGloj Global output of partition group j

TABLE 1

Statistics

should consider the fluctuations of input patternsin order to accurately predict the overall through-put contribution of a partition group.

• Data Properties. Considering data properties isimportant as they directly affect the population,and hence the result production, for each par-tition group in a join operator. Such propertiescould be join attribute distribution or whetherthe data is sorted. For instance, if input data issorted in a many-to-many or one-to-many join, apartition group j may contribute a large amountto the overall result throughput within a timeperiod T as data in a sorted group (i.e., all tupleshave join attribute N ) may all hash to j. However,after time T , j may not contribute to the resultthroughput for a while, as tuples from a newsorted group (i.e., with join attribute N +1) nowhash to different partition K. Therefore, the flush-ing policy should be able to adapt during queryruntime to data properties in order to accuratelypredict the population of a partition group.

4.1 Adaptive Global Flush Algorithm

AdaptiveGlobalFlush is a novel algorithm that aimsto flush in-memory hash partition groups that con-tribute the least to the overall result throughput. Thedecisions made by the AdaptiveGlobalFlush algorithmmainly depend on the three characteristics just dis-cussed; namely, global contribution, data arrival patterns,and data properties. The main idea of the AdaptiveG-lobalFlush algorithm is to collect and observe statis-tics during a query runtime to help the algorithmchoose the least useful partition groups to flush todisk. Flushing the least useful groups increases theprobability that new input data will continue to helpproduce results at a high rate. We begin by discussingthe statistical model that forms the “brains” behindthe decision made by AdaptiveGlobalFlush to evict datafrom memory to disk. We then discuss the steps takenby AdaptiveGlobalFlush, and how it uses its statisticalmodel, when flushing must occur. Throughout thissection, we refer to the running example given inFigure 3 depicting a pipeline query plan with two joinoperators, JoinAB and JoinABC.

4.1.1 Statistics

The AdaptiveGlobalFlush algorithm is centered arounda group of statistics observed and collected duringthe query runtime (summarized in Table 1). Thesestatistics help the algorithm achieve its ultimate goalof determining, and flushing, the partition groupswith the least potential of contributing to the overallresult throughput. The collected statistics serve threedistinct purposes, grouped into the following classes:

Size statistics. The size statistics summarize the dataresiding inside a particular operator at any par-ticular time. For each operator O, we keep trackof the size of each hash partition j at each in-put S, denoted prtSizeSj . In Figure 3, the opera-tor JoinAB has four partitions (i.e., two partitiongroups) with sizes prtSizeA1 = 150, prtSizeB1 = 30,prtSizeA2 = 100, prtSizeB2 = 15. We also trackthe size of each pair of partitions grpSizej , thisvalue is simply the sum of the symmetric parti-tions grpSizej = prtSizeAj + prtSizeBj . For ex-ample, in Figure 3, the size of partition groupsat JoinAB are grpSize1 = 150 + 30 = 180 andgrpSize2 = 100 + 15 = 115. Finally, we keep track ofthe tuple size for the input data at each source S,assuming the tuple size is constant at each join input.In Figure 3, we observe a tuple size of tupSizeA=10and tupSizeB=8.

Input statistics. The input statistics summarize theproperties of incoming data at each operator. Dueto the dynamic nature of data in an online environ-ment, the input statistics are collected over a timeinterval [ti,tj] (i < j) and updated directly after theinterval expiration (we cover these maintenance inSection 4.2). Specifically, for each operator O we trackthe total number of input tuples received from allinputs, inputtot. Also, for each partition group j in op-erator O, we track the statistic uniquej that indicatesthe number of unique join attribute values observedin j throughout the query runtime. Finally, for eachinput S and hash partition j, we track the numberof tuples received at each partition j, prtInputSj . Forthe example in Figure 3, we assume the followingvalues have been collected over a certain period oftime: inputtot = 100, unique2 = 5, prtInputA2 = 6,and prtInputB2 = 4.

Output statistics. Opposite of the input statistics, theoutput statistics summarize the results produced bythe join operator(s). Like the input statistics, the outputstatistics are collected and updated over a predefinedtime interval. Specifically, for each partition group jin each join operator, we track two output statistics,namely, the local output (obsLocj) and the global output(obsGloj). In a pipeline query plan, the local outputof a join operator O is the number of tuples sentfrom O to the join above it in the operator tree.In Figure 3, all tuples that flow from JoinAB toJoinABC are considered local output of JoinAB. The

C

JoinAB

JoinABC

Result

A150

B

100301

2 15

AB100

C

75801

2 50

A B

Observed Values (JoinAB) Calculated Values (JoinAB)

N = 2,000unique2 = 5prtInputA2 = 6prtInputB2 = 4inputtot= 100

expPrtInputA2 = [2,000 * (6/100)]/10 = 12

B2 = [2,000 * (4/100)]/8 = 10

expLocPrtRsltA2 = (10 * 100)/5 = 200

expLocPrtRslt B2 = (12 * 15)/5 = 36

expLocRsltNEW2= (12*10)/5 = 24

expLocRslt2 = 200 + 36 + 24 = 260

expPrtInput

tupSize = 10A

tupSize = 8B

Fig. 3. Flush Example Plan

collected statistic obsLocj in operator O is the numberof tuples in the local output of O produced by partitiongroup j. The collected statistic obsGloj is the numberof tuples in the global output (i.e., the local output of theroot join operator, JoinABC in Figure 3), produced bypartition group j. To track this information, an identi-fier for the partition group(s) is stored on each tuple.Thus, each tuple stores a compact list of identifiersof the partition groups where it “lived” during queryruntime. We note that storing identifiers versus re-computing the join to find the original hash group (asdone in [17]) is a trade-off between storage overheadand computation costs. We choose the former method,which is more suitable for larger tuples, whereas ahash group re-computation algorithm can be imple-mented for smaller tuples. Once a tuple is output fromthe root join operator, it is considered to contributeto the global output, and hence the observed globaloutput values for all the partition group identifier(s)stored on that tuple are incremented by one.

4.1.2 AdaptiveGlobalFlush Algorithm

The AdaptiveGlobalFlush algorithm takes as input asingle parameter N , representing the amount of mem-ory that must be freed (i.e., flushed to disk). Theidea behind AdaptiveGlobalFlush is to evaluate eachhash partition group at each join operator and scoreeach group based on its expected contribution to theglobal output. The algorithm then flushes the partitiongroup(s) with the lowest score until at least the amountof memory N is freed and its data written to disk.We classify this process into four steps. In the rest ofthis section, we provide the ideas and intuition behindeach step, and finish by providing a detailed look atthe AdaptiveGlobalFlush algorithm.

Step 1: Input estimation. The input estimation stepuses the input statistics (covered in Section 4.1.1) inorder to estimate the number of tuples that will arriveat each operator input hash partition group Sj untilthe next memory flush. Since each flushing instanceaims to flush enough tuples to free an amount N ofmemory, we know the next flush instance will occurafter tuples with a combined size of N enter theentire query plan. Thus, the goal of input estimationis to predict where (i.e., which partition group) eachof these new tuples will arrive. We note that thesize N encompasses both new tuples streaming intothe query plan, and intermediate results generatedby joins lower in the operator tree. Estimating the

number of tuples that will arrive at each partitiongroup is the first step in gaining an idea of how“productive” a each group will be.

Step 2: Local output estimation. The local output es-timation step uses the size statistics and input estimationfrom Step 1 in order estimate the contribution of eachpartition group to the local output of its join operatoruntil the next flushing instance. The intuition behindthis step is as follows. By tracking size statistics, weknow, for each binary join operator, the size of eachpartition j for each input A and B. Further, throughinput estimation, we have an idea of how many newtuples will arrive at partition Aj and Ab. Thus, tocalculate local output, this step estimates a partitionscontribution to the local output as the combinationof three calculations: (1) The output expected fromthe newly arrived tuples at partition Aj joining withtuples already residing in partition Bj . (2) The outputexpected from the newly arrived tuples at partitionBj joining with tuples already residing in partitionAj . (3) The output expected from the newly arrivedtuples at partition Aj joining with the newly arrivedtuples at partition Bj .

Step 3: Partition group scoring. The partition groupscoring step uses the observed output statistics andthe local output estimation from step 2 in order toscore each partition group j across all operators O,where the lowerst score implies that j is the leastlikely to contribute to the overall global result output.The rationale for this score is to maximize the globaloutput per number of tuples in the partition group. Forexample, consider two partition groups, P with size20 and Q with size 10, each having an expected globaloutput of 25. In this case, Q would be assigned a betterscore as it has a higher global output count per numberof tuples. The intuition behind this approach is thatusing local output estimation, we know how many localresults will be produced by each partition group j.Through our observed output statistics, we can form

a ratioobsLocjobsGloj

that allows us to estimate how many

of these local results will become global results. Wenormalize the estimated global contribution by thepartition size to derive a final score giving us theglobal output per partition size.

Step 4: Partition group flushing. The final partitiongroup flushing step uses the scores from the partitiongroup scoring in step 3 to flush partition groups to disk.This step simply chooses the lowest scoring partitiongroup across all operators, and flushes it to disk. Ititerates until tuples totaling the size N have beenflushed to disk.

Detailed Algorithm. We now turn to the detailsbehind AdaptiveGlobalFlush. Algorithm 1 gives thepseudo-code for the algorithm, where each step pre-viously outlined is denoted by a comment. The al-gorithm starts by iterating over all partition groupswithin all operators (Lines 2 to 4 in Algorithm 1).

Step 1 calculates an estimate for the number of new

Algorithm 1 AdaptiveGlobalFlush Algorithm

1: Function AdaptiveGlobalFlush(N)2: for all Operators O do3: A ← O.sideA; B ← O.sideB; SizeF lushed← 04: for all Partition Groups j ∈ O do5: /* Step 1: Input estimation */

6: expPrtInputAj ← (N ·prtInputAj

inputtot)/tupSizeA

7: expPrtInputBj ← (N ·prtInputBj

inputtot)/tupSizeB

8: /* Step 2: Local output estimation */

9: expLocPrtRsltAj ←(prtSizeAj ·expPrtInputBj)

uniquej

10: expLocPrtRsltBj ←(prtSizeBj ·expPrtInputAj)

uniquej

11: expLocRsltNEWj ← (expPrtInputAj ·expPrtInputBj

uniquej)

12: expLocRsltj ← (expLocPrtRsltAj + expLocPrtRsltAj +expLocRsltNEWj)

13: /* Step 3: Partition group scoring */

14: expGloRsltj ← expLocRsltj (obsGlojobsLocj

)

15: grpScorej ←expGloRsltj

grpSizej

16: end for17: end for18: /* Step 4: Partition group flushing */19: while SizeF lushed ≤ N do20: PSj ← partition group from input S with lowest grpScorej21: Flush P to disk22: SizeF lushed+ = sizeof(PSj)× tupSizeS23: end while

input tuples for each side S of the partition groupj, denoted as expPrtInputSj . This value is calculatedby first multiplying the expected input size of all tuplesin the query plan N by the observed ratio that havehashed to side S of j, formally prtInputSj/inputtot.Second, this value is divided by the tuple sizetupSizeS for input S to arrive at the number of tuples.As an example, in Figure 3, assuming that N=2,000 theexpected input for each partition in group 2 of JoinABis: expPrtInputA2 = [2000 × (6/100)]/10 = 12 andexpPrtInputB2 = [2000× (4/100)]/8 = 10, assuming atuple size of 10 and 8 at input A and B, respectively.

Step 2 (Lines 9 to 12 in Algorithm 1) estimatesa partition group’s local output as follows. (1) Theoutput from new tuples arriving at side B hashingwith existing data in partition side A is computedby multiplying the two values expPrtInputBj (fromstep 1) and prtSizeAj (a size statistic). However, toaccommodate for the fact that a hash bucket j maycontain up to uniquej different values, we dividethe computed value by uniquej . As an example, inFigure 3, this value for partition A2 at JoinAB isexpLocPrtRsltA2 = (10 × 100)/5 = 200. (2) Theoutput from new tuples arriving at side A hashingwith existing data in partition side B is computedin a symmetric manner by exchanging the roles ofA and B. For example, in Figure 3, this value forpartition B2 at JoinAB is expLocPrtRsltB2 = (12 ×15)/5 = 36. (3) The output estimate from newlyarrived tuples from both A and B is computed bymultiplying expPrtInputAj by expPrtInputBj anddividing by uniquej . For example, in Figure 3, thisvalue for partition group 2 at JoinAB is (12×10)/5 =24. Finally, the local output estimation for partitiongroup j, expLocRsltj is the sum of the previously

three computed values. In the example given in Fig-ure 3, this value for partition group 2 at JoinAB isexpLocRslt2 = 200 + 36 + 24 = 260

Step 3 (Lines 14 to 15 in Algorithm 1) proceeds byfirst estimating the global output of partition group j(denoted expGloRsltj) by multiplying expLocRsltj by

the ratioobsLocjobsGloj

. For example, assume that in Figure 3,

the global/local ratio for partition group 2 in JoinABis 0.5 (i.e. half of its local output contributes to theglobal output). Then the expected global output for thispartition group is expGloRslt2 = 260/2 = 130. Finally,the algorithm assigns a score (grpScorej) to partitiongroup j by dividing the expected global output by thesize of the partition group (Line 15 in Algorithm 1).

In Step 4 (Lines 19 to 23 in Algorithm 1), thealgorithm flushes partition groups with lowest score(grpScorej) until the desired data size N is reached.

4.2 Scoring Accuracy

The quality of the AdaptiveGlobalFlush algorithm de-pends on the accuracy of the collected statistics. Thesize statistics (i.e., grpSizej and prtSizeSj) are exact asthey reflect the current cardinality that changes onlywhen new tuples are received or memory flushingtakes place. The input statistic uniquej is exact as itreflects a running total of the unique join attributesthat arrive to partition group j. Also, we assume thestatistics tupSizeS is static, i.e., that tuple size will notchange during runtime. However, other input statistics(i.e., prtInputSj and inputtot) and output statistics (i.e.,obsLocj , and obsGloj) are observed between a timeinterval [ti,tj] (i < j). In this section, we present threemethods, namely, Recent, Average, and EWMA formaintaining such input and output statistics. Figure 4gives an example of each method for the statisticobsLocj over four observations made from t0 to t4.Recent. This method stores the most recent observa-tion. In Figure 4, the Recent method stores only thevalue 60 observed between time period t3 to t4. Themain advantage of this method is ease of use as noextra calculations are necessary. This method is notwell suited for slow and bursty behavior.Average. In this method, the average of the lastN observations is stored, where N is a predefinedconstant. In Figure 4, the Average method stores theaverage of the last N = 3 observations from timeperiod t1 to t4, which is 77. The average method isstraightforward as it considers the last N time periodsto model the average behavior of the environment. Itdoes not adhere to the fact that the recent behaviorshould weigh more than an old behavior.Exponential Weighted Moving Average (EWMA).Using the EWMA method, each past observation isaged as follows: If the statistic at time period ti hasvalue stati while at the time period [ti, ti+1], weobserve a new value ObsV al, then the value of thecollected statistic is updated to be: stati+1 = α ∗

50

t t t t t0 1 2 3 4

6070100

Recent:

Average (N=3):

EWMA (

obsLoc j

= 63

= 77

= 60

a =.2): obsLoc j

obsLoc j

obsLoc j

Fig. 4. Statistic Maintenance Sample

stati + (1−α) ∗ ObsV al where α is a decay constant(0 < α < 1). By choosing a smaller α, this methodforgets past observations quickly and gives a greaterweight to recent observations. The opposite is truefor a larger α. In Figure 4, the EWMA method (withα=0.2) tracks all previous observations, but weightseach new observation by a factor of 0.8 while weight-ing old observed values by 0.2. The EWMA methodgives more accurate model for statistic maintenance.

4.3 Discussion

The statistics collected and calculated by AdaptiveG-lobalFlush are designed to be lightweight due to itsapplication in join processing. For instance, instead ofcalculating the exact intermediate input for all non-base join operations coming from operator outputlower in the query plan, we track the input indepen-dently at each input using prtInputSj . By simplifyingthis statistical tracking, the flushing algorithm doesnot have to derive correlation between partition groupsat each operator to calculate input to intermediateoperators. As an example, consider a query plan withthree joins as given in Figure 5(b). In order calculatethe intermediate input to JoinABCD, we would needto derive the correlation between each partition groupin JoinAB to the partition groups in JoinABC (i.e.,where tuples in partition group n of JoinAB willhash in JoinABC). Similarly, the correlation betweengroups in JoinABC and JoinABCD is needed. Whilecalculating intermediate input this way would bemore accurate than our approach, we feel the tradeoff between accuracy and algorithmic complexity isimportant in this case.

Comparison to State-of-the-Art Algorithms. Otherflushing algorithms used by the state-of-the-art non-blocking join algorithms lack at least two of the threeimportant aspects achieved by the PermJoin. For exam-ple, state-spilling flushing [17] only considers globalcontribution which is calculated based on the previousobserved output, however, it does not adapt to dataarrival patterns or data properties that affect the globalthroughput during the query runtime. The hash-merge join flushing algorithm [18] only considers dataarrival patterns when attempting to keep hash tablesbalanced between inputs, however, global contributionand data properties are not taken into consideration.Finally, the RPJ flushing policy [19] adapts only todata arrival patterns and data properties by using aprobability model to predict data arrival rates at eachhash partition. However, global contribution is not con-sidered by RPJ flushing. Furthermore, the RPJ policy

Result

1

2

3

4

Blocking

On-Disk

Merge

In-Memory

Join

A B

C

D

E

(a) State Example

C

JoinAB

JoinABC

Result

A150

B

100301

2 15

AB100

C

75801

2 50

A B

ABC140

D

26101

2 23

D

JoinABCDABC300

D

200601

2 100

DISK (JoinABCD)

(b) Memory and Disk State

Fig. 5. Multi-Join Plan Examples

is designed only for single-join query plans with nostraightforward extension for multi-join plans.

5 THE STATE MANAGER

While the goal of the AdaptiveGlobalFlush algorithmis to evict data from memory that will contributeto the result throughput the least, the goal of thestate manager is to fetch disk-resident data that willpositively affect result throughput. The basic ideabehind the state manager is, at any time during queryexecution, to place each operator in one of the fol-lowing states: (1) In-memory join, (2) On-disk merge,or (3) Temporarily blocking (i.e., waiting for a joinoperator above in the pipeline to finish its mergephase). The state manager constantly monitors theoverall throughput potential of the in-memory and on-disk states for each join operator. Overall throughputof intermediate operators (i.e., non-root operators)refers to the number of intermediate results producedthat propagate up the pipeline to form an overallquery result, whereas results at the root operatoralways contribute to the overall query result.

The state manager accomplishes its task by invokinga daemon process that traverses the query pipelinefrom top to bottom. During traversal, it determinesthe operator O closest to the root that will producea higher overall throughput in the on-disk state com-pared to its in-memory state. If this operator O exists,it is immediately directed to its on-disk merge state toprocess results. Meanwhile, all operators below O inthe pipeline are directed to temporary block while alloperators above O in the pipeline continue processingtuples in memory. Figure 5(a) gives an example of apipeline where the state manger finds that join operator13 is the closest operator to the root capable of max-imizing the query global throughput when joiningdisk-resident data rather than in-memory data. Thus,the state manger directs 13 to the on-disk state whileall the operators above 13 (i.e., 14) are directed to joinin-memory data and all operators below 13 (i.e., 11

and 12) are temporarily blocked.The main rationale behind the state-manager ap-

proach is threefold. (1) Top-down traversal serves asan efficiency measure by shortcutting the search for

operators where disk-resident operations are morebeneficial to overall throughput than in-memory oper-ations. (2) Each operator in a query pipeline producesan overall throughput greater than or equal to thelower operators in the query plan. Thus, increasingthe overall throughput of an operator higher in thepipeline implies an increase in the overall throughputof the multi-join query. For example, in Figure 5(a)the state manager places 13 in the on-disk mergestate without traversing the pipeline further, as theoverall result throughput contributions of 11 and12 are bounded by 13, i.e., all intermediate resultsproduced by 11 and 12 must be processed by 13

before propagating further. Thus, the overall resultthroughput produced by in-memory operations at 11

and 12 cannot be greater than the overall in-memorythroughput produced by 13. (3) While performingon-disk merge operations multiple operators in thequery plan (e.g., 13 and 12) could also benefit resultthroughput, we feel placing a single operator in itson-disk state is sufficient due to the rationale justdiscussed. Furthermore, a single operator performingan on-disk join limits context switches, random I/O,and seeks that can result when multiple requests fordisk operations are present.

In the rest of this section, we first provide thehigh-level state-manager algorithm. We then discussthe details behind the state manager’s statistics andcalculations. We finish with a discussion of the relativeadvantages of the state manager.

5.1 State Manager Algorithm

Algorithm 2 provides the pseudo code for the statemanager module. As previously mentioned, the statemanager algorithm performs a top-down traversal ofall join operators in the pipeline. At each operator O,it compares the potential overall throughput for thein-memory and on-disk states. If the on-disk state isfound to have a greater potential overall throughputat an operator O, the state manager directs O tothe on-disk merge state immediately. All operatorsabove O remain in the in-memory state, while alloperators below O are directed to temporarily block.In general, the state manager executes in two mainstages (1) Evaluate states and (2) Direct operators. Wenow outline the functionality of each stage.Stage 1: Evaluate states. The evaluate states stage per-forms a top-down traversal of the query plan, startingat the root operator (Line 2 in Algorithm 2). For theexample given in Figure 5(b), the algorithm would tra-verse from JoinABCD down to JoinAB. The traver-sal ends when either an operator is found to switchto the on-disk state, or all operators have been visited(Line 4 in Algorithm 2). For the current operator O,the algorithm compares the overall throughput pro-duced by the in-memory state to the throughput forthe on-disk merge state (Lines 5 to 7 in Algorithm 2).

Algorithm 2 State Manager

1: Function StateManager()2: O ← root join operator ,mergeGroup← φ3: /* Stage 1 - Evaluate states */4: while mergeGroup = φ AND O 6= φ do5: memThroughput← memThroughput(O) {Sec 5.2.1}6: diskThroughput ← Max(dskThroughput(j)) by iterating

through all disk partition groups j ∈ O {Sec 5.2.2}7: if ((diskThroughput× dskConst) > memThroughput) then8: mergeGroup← diskThroughput.j9: else

10: O ← next operator in pipeline11: end if12: end while13: /* Stage 2 - Direct operators */14: if O = φ then15: for all Operators O3 in Pipeline do O3.state=IN-MEMORY16: else17: O.merge = mergeGroup,O.state = ON-DISK18: for all Operators O1 above O do O1.state=IN-MEMORY19: for all Operators O2 below O do O2.state=BLOCK20: end if

Details behind in-memory and on-disk throughputcalculations are covered shortly in Section 5.2. Thisprocess begins by first calculating the in-memorythroughput for O (Line 5 in Algorithm 2). The on-disk merge state can only process one disk-residentpartition group at a time. Thus, when comparingthe on-disk throughput to in-memory throughput,the algorithm iterates over all disk-resident partitiongroups j in O to find the j that will produce a maximaloverall disk throughput (Line 6 in Algorithm 2). Thealgorithm then compares the overall throughput foran on-disk merge to the overall in-memory through-put for operator O. Since producing join results fromdisk is much slower than producing join results frommemory, the state manager accounts for this fact bymultiplying the calculated disk throughput by a diskthroughput constant, dskConst. The value dskConstis a tunable database parameter, and represents themaximum amount of tuples per time period (e.g., 1second) that can be read from disk (Line 7 in Algo-rithm 2). If the on-disk throughput is maximal, thealgorithm will mark the disk partition group (Line 8in Algorithm 2). By marking j, the algorithm ends thetraversal process and moves to the next stage. If thealgorithm does not mark a partition group, traversalcontinues until all operators have been visited.

Stage 2: Direct operators This stage of the algorithmis responsible for directing each operator in the querypipeline to the appropriate state based on the analysisin stage 1. Stage 2 starts by checking if the traversalin stage 1 found an operator O to move to the on-diskstate (Line 14 in Algorithm 2). If this is not the case, alloperators in the query pipeline are directed to the in-memory state (Line 15 in Algorithm 2). Otherwise, wewill direct O to its on-disk state where it is asked tomerge the marked disk partition group from stage 1(Line 17 in Algorithm 2). Also, all operators aboveO in the pipeline will be directed to the in-memorystate while all operators below O in the pipeline willbe directed to block (Lines 18 to 19 in Algorithm 2).

5.2 Throughput Calculation

From the high-level overview, we can see the core ofthe state manager is based on an accurate estimate ofthe throughput potential for the on-disk and in-memorystates of each join operator in the query plan. Wenow cover the statistics and calculations necessary toestimate these throughput values.

5.2.1 Memory Throughput

The overall result throughput produced by the in-memory state at operator O can be estimated as thesum of all observed global output values for its par-tition groups. Formally, this estimation is calculatedas: memThroughput(O) = memConst(

∑|O|j=1 obsGloj)

where |O| represents the number of partition groupsin operator O and obsGloj is the observed globaloutput estimate over a predefined time interval (usinga method from Section 4.2) for a partition group j inO (see Table 1). Much like our disk constant dskConstpresented in Stage 1 of the state manager algorithm,the memory throughput calculation uses a tunable pa-rameter memConst that models the maximum numberof results that can be produced in-memory over aconstant time period (e.g., 1 second).

5.2.2 Disk Throughput

The on-disk merge state allows only one disk-residentpartition group to be merged at a time, meaning theoverall disk throughput at an operator O is repre-sented by the disk group that produces the maximaloverall throughput when merged. The overall diskthroughput estimation is done per disk-resident group.Collected statistics. The disk throughput calculationrelies on two exact statistics (i.e., they are not main-tained using methods from Section 4.2). First, a sizestatistic that maintains, for each operator, the numberof disk-resident tuples from input S for each partitionj (dskSizeSj). As with in the in-memory size statistics,dskSizeSj is an exact statistic as it represents the cur-rent cardinality. The other collected statistic maintainsthe current number of local results expected from anoperator O if it were to merge a disk partition group j(dskLocRsltj). The statistic dskLocRsltj is initially setto zero, increases with every flush of partition groupj, and reset to zero when the disk partition group j ismerged. Upon flushing partition group j (with sidesA and B) to disk, the statistic dskLocRsltj is increasedby the number of results expected to be produced by jwhich can be computed as follows: The flushed parti-tion from side B (partSizeBj ; a collected in-memorysize statistic), must eventually join with data alreadyresiding on disk (from a previous memory-flush) fromside A (dskSizeAj ; a collected on-disk size statistic)while the flushed partition from A (partSizeAj) joinswith disk-resident data from B (dskSizeAj). Theseestimations are possible because data is flushed inpartition groups; new data being flushed from one

partition side has not yet joined with disk-residentdata from the other side. Since a partition groupmay contain tuples with multiple join attributes, theexpected result must be divided by the unique num-ber of join attributes observed in the partition group(uniquej ; a collected in-memory input statistic). For-mally, the statistic dskLocRsltj is increased by thefollowing value when a flush occurs: ((dskSizeAj ×partSizeBj) + (dskSizeBj × partSizeAj))/uniquej . Asan example, in Figure 5(b) (assuming uniquej = 5), ifpartition group 1 from JoinABCD were to be flushedto disk, the statistic dskLocRslt1 would be increasedby: ((140× 60) + (300× 10))/5 = 2280.Overall disk throughput calculation Estimating theoverall disk throughput for merging a disk par-tition group j involves two steps. The first stepcalculates the expected global output for the merge(dskGloRsltj). Similar to the computation made bythe AdaptiveGlobalFlush algorithm, we can derivethis value by multiplying the expected local output(dskLocRsltj ; a collected on-disk statistic) by the ob-served global/local output ration (obsGloj/obsLocj ; acollected in-memory statistic). Formally, this equationis: dskGloRsltj = dskLocRsltj × (obsGloj/obsLocj).In Figure 5(b), the global/local output ratio atJoinABCD is 1, as it is the root join opera-tor. Therefore, if dskLocRslt1 = 2280 for parti-tion group 1 at JoinABCD, then dskGloRslt1 isalso 2280. The second step involves calculating avalue for the number of global results expected pertuples read from disk. This calculation is similarto the scoring step of the AdaptiveGlobalFlush algo-rithm as we divide the expected global throughputby the size of the expected data read from disk.Formally, this equation for overall disk throughputis: dskThroughput(j) = (dskGloRsltj/(dskSizeAj +dskSizeBj)). For example, in Figure 5(b) (assum-ing dskLocRslt1 = 2280), the expected diskthroughput for partition group 1 at JoinABCD isdskThroughput(ABCD1) = (2280/(300+60)) = 6.33.Also, since producing join results from disk is muchslower than producing join results from memory, theestimation for the throughput of an on-disk mergemust be bounded by the time needed to read datafrom disk. In Section 5.1, we described how the statemanager accounts for this bound using the tunableparameter dskConst that represents disk throughput.

5.3 Discussion

One main goal of join algorithms optimized for earlyresult throughput is the ability to produce resultswhen sources block. When an operator is optimizedfor a single-join query plans, the decision to join datapreviously flushed to disk is straightforward as itis used to produce results only when both sourcesblock [18], [19]. For multi-join queries, the state-spilling algorithm [17] does not consider using disk-

resident data to produce results while input is stream-ing to the query. Disk-resident data is used by state-spilling in its cleanup phase, producing results whenall input has terminated. Our proposed state managergoes beyond the idea of using disk-resident data onlyin the cleanup phase by invoking the disk mergeoperation when it is expected to increase the overallresult throughput of the query. In this regard, the statemanager is a novel algorithm for multi-operator queryplans that uses disk-resident data to maintain a highresult throughput outside the cleanup phase.

If the state manager finds that the on-disk state canproduce a greater overall throughput than the in-memory state at an operator O, it immediately directsO to enter its on-disk state. While this case is certainlytrue when both inputs to the operator O block, it maybe true otherwise. For instance, input rates at operatorO may slow down, decreasing its ability to produceoverall results. If the operator O is not a base operator,then one of these input rates is directly affected bythe throughput of an operator located below in thepipeline. Thus, merging disk-resident data at operatorO is beneficial if it is guaranteed to generate tuplesthat will increase the overall result throughput.

Input Buffering. The state manager is able to placean operator in an on-disk or blocking state even ifdata is still streaming into the query plan at or be-low the on-disk operator. One consequence of thisfreedom is buffer overrun, i.e., exhausting the buffermemory for incoming tuples. One method to avoidbuffer overrun is to ensure that switching operatorsto the on-disk and blocking phase will not causebuffer overrun. Using our collected statistics, we caneasily perform this check. Let BufSpace be the availablespace in the buffer. Further, let DskTime be the timenecessary to perform an on-disk merge. DskTime canbe calculated as the time needed to read an on-diskpartition group j. Using our disk size statistics, this

value isdskSizeAj+dskSizeBj

dskConst , where dskConst is ourtunable parameter modeling disk throughput. Also,recall from Section 4.1.1 that we measure (over timeinterval [ti,tj]) the incoming count to each partitionj for each join input S as prtInputSj . Let InputSumbe the sum of all prtInputSj values for partitionsresiding in join operators at or below the operatorabout to be placed in an on-disk state. Then, wecan model the time until the next buffer overrun asBufT ime =

(tj−ti)BufSpaceInputSum . Then, the state manager

can reliably place the operators in the on-disk andblocking states without the threat of a buffer overrunif the value BufTime is greater than DskTime.

Cleanup. Because the AdaptiveGlobalFlush algorithmand the state manager are designed for multi-join queryplans, results that are not produced during the in-memory or on-disk phase require special consider-ation. In this vein, we outline a cleanup process foreach operator that is compatible with our frameworkto ensure that a complete result set is produced. In

this phase, each operator begins cleanup by flushingall memory-resident data to disk after receiving anend-of-transmission (EOT) signal from both inputs.After this flush, all disk-resident partition groups aremerged, producing a complete result set for that oper-ator. The order of operator cleanup is important, dueto the top-down dependency of pipeline operators. Anexample of this dependency is given in Figure 5(b). Inorder for JoinABC to produce a complete result set,it must receive all input from source C and JoinABbefore beginning the cleanup phase. Cleanup startsat the base operator and ends at the root of thequery plan. To enforce cleanup order, we assume eachoperator in the chain can send an EOT signal to thenext pipeline operator after completing its cleanup.

6 CORRECTNESS

Correctness of both AdaptiveGlobalFlush, and the statemanager can be mapped to previous work. In termsof producing all results, we use symmetric hash jointo produce in-memory results. Further we flush parti-tion groups and use a variance of sort-merge join toproduce disk results. Using these methods ensuresthat a single operator will produce all results [17], [18].These methods also ensure that a single operator willproduce a unique result exactly once in either the in-memory or on-disk states [18]. If the base operatorin a query plan guarantees production of all resultsexactly once, the ordered cleanup phase ensures thatall results from the base operator will propagate upthe query plan. It follows that the this cleanup phaseensures that using our methods in a multi-join queryplan will produce all results exactly once.

7 EXPERIMENTAL RESULTS

This section provides experimental evidence for theperformance of the AdaptiveGlobalFlush algorithm andthe state manager, by implementing them along withsymmetric hash join [21]. We compare this imple-mentation to optimizations used by the state-of-the-art non-blocking join operators optimized for multi-join queries (i.e., state-spill [17]) as well as single-join queries (i.e., Hash-Merge Join (HMJ) [18] andRPJ [19]). Unless mentioned otherwise, all experi-ments use three join operations with four inputs A,B, C, and D as given in Figure 5(b). All input arrivalrates exhibit slow and bursty behavior. To modelthis behavior, input rates follow a Pareto distribution;widely used to model slow and bursty network be-havior [30]. Each data source contains 300K tupleswith join key attributes uniformly distributed in arange of 600K values, where each join key “bucket”is assigned to a source with equal probability. Thus,given M total join key buckets assigned to each joininput S, there will be X (X < M ) overlapping joinkey buckets and (M − X) non-overlapping join key

buckets at each operator. Memory for the query planis set to 10% of total input size.

For our proposed methods, statistical maintenance(see Section 4.1.1) is performed every five seconds(t = 5) using the EWMA method with an α valueof 0.5. These parameters were determined with initialtuning experiments (not included for space reasons).For flushing policies, the percentage of data evicted isconstant at 5% of total memory.

7.1 Input Behavior

Figure 6 gives the performance of AdaptiveGlobalFlushand the state manager (abbr. AGF+SM) compared tothat of state-spill, HMJ, and RPJ when run in envi-ronments with four different input behaviors. In thesetests, we focus on the first 100K results produced,corresponding to early query feedback. Figure 6(a)gives the results for the default behavior (all inputsslow/bursty), and demonstrates the usefulness ofAGF+SM. AdaptiveGlobalFlush is able to keep benefi-cial data in memory, thus allowing incoming data toproduce a better overall early throughput. Meanwhilethe state manager helps to produce results due tothe slow and bursty input behavior where sourcesmay block. The state-spill flush policy does not adaptto the slow and bursty input behavior in this case,and thus flushes beneficial data to disk early in thequery runtime. Also, state-spill does not employ astate manager that is capable of finding on-disk dataat each operator to help produce results. State-spillonly invokes its disk phase once all inputs terminate.HMJ and RPJ, optimized for single-join queries, basetheir flushing decisions on maximizing output locallyat each operator. However, these flushing policiesare unaware of how data at each operator is beingprocessed by subsequent operators in the pipeline,causing tuples to flood the pipeline that are unlikelyto produce results higher in the query plan. HMJinvokes its disk phase once both inputs to an operatorblock. However, in a multi-join query, this scenariowill likely occur only at the base operator, as oneinput for each intermediate operator is generated froma join operation residing below in the pipeline. Inthis case, disk results from the base operator can bepushed upward in the pipeline that will not produceresults at subsequent operators, and cause flushing tooccur higher in the pipeline when inputs are block-ing, also degrading performance. On the other hand,RPJ performs better than HMJ due to its ability toswitch from memory to disk operations based on itsstatistical model that predicts local throughput. Formulti-join queries, a state manager is needed to manageprocessing at all operators in concert.

Figures 6(b) through 6(d) give the performanceof AGF+SM compared to state-spill HMJ, and RPJfor three, two, and one input(s) exhibiting slow andbursty behavior, respectively. In this left-right progres-

sion, as the the inputs exhibit more of a steady behav-ior, the disk phase is used less, allowing us to pinpointthe relative behavior of each flushing policy only.While the performance of each algorithm convergesfrom left to right, AGF+SM consistently outperformstate-spill by at least 30%, RPJ by 40%, and HMJ by45%, in producing early results. AGF+SM also exhibita constant steady behavior in both slow/bursty andsteady environments. Finally, we note that these ex-periments show the benefit of optimization techniquesfor multi-join query plans, due to the performance ofstate-spill and AGF+SM compared to that of RPJ andHMJ. Also, due to the similar behavior of HMJ andRPJ compared to state-spill and AGF+SM, subsequentexperiments use only HMJ to represent algorithmsoptimized for single-join query plans.

7.2 Memory Size

Figure 7 gives the time needed for AGF+SM, state-spill, and HMJ to produce 100K results for queryplan memory sizes 5% above and below the defaultmemory size (i.e., 10%) for two different environ-ments. Figures 7(a) and 7(b) give the performancefor inputs exhibiting the default behavior, while fig-ures 7(c) and 7(d) give the performance for inputs Aand B exhibiting steady behavior while C and D areslow and bursty. For smaller memory (Figures 7(a)and 7(c)), flushing occurs earlier causing more flushesduring early result production. In this case, Adap-tiveGlobalFlush is able to predict beneficial in-memorydata and keep it in memory in order to maintain ahigh early overall throughput. Furthermore, when allinputs are slow/bursty (Figure 7(a)), the state managerhelps in early result production. With larger mem-ory (Figures 7(b) and 7(d)), more room is availablefor incoming data, hence less flushing occurs early,meaning the symmetric-hash join is able to processmost early results when memory is high. For bothinput behaviors, state-spill and AGF+SM performrelatively similar early on when more memory isavailable. Meanwhile, HMJ performs relatively betterfor a larger memory sizes, but its flushing and diskpolicy are still a drawback in multi-join query planswhen inputs are slow/bursty compared to a steadyenvironment.

7.3 Data Flush Size

Figure 8 give the performance for different data flushsizes. Figure 8(a) shows the time needed to producethe first 100K results when smaller and larger percent-ages of total memory are evicted per flush operation,while Figure 8(b) focuses on the specific case of 10%data flush. When the percentage is very low (1%),flushes occur often, causing delays in processing. Ingeneral, writing less data to disk (in a reasonablerange) during a flush operation leads to a greater

0

20

40

60

80

100

120

0 10 20 30 40 50 60 70 80 90 100

Tim

e (

sec)

Number of Results * 1000

AGF+SMState Spill

RPJHMJ

(a) ABCDB

0

20

40

60

80

100

120

0 10 20 30 40 50 60 70 80 90 100

Tim

e (

sec)


AGF+SMState Spill

RPJHMJ

(b) AS ,BCDB

0

20

40

60

80

100

120

0 10 20 30 40 50 60 70 80 90 100

Tim

e (

sec)


AGF+SMState Spill

RPJHMJ

(c) ABS ,CDB

0

20

40

60

80

100

120

0 10 20 30 40 50 60 70 80 90 100

Tim

e (

sec)


AGF+SMState Spill

RPJHMJ

(d) ABCS ,DB

Fig. 6. Input Behavior

0

20

40

60

80

100

120

0 10 20 30 40 50 60 70 80 90 100

Tim

e (

sec)


AGF+SMState Spill

HMJ

(a) ABCDB Mem 5%

0

20

40

60

80

100

120

0 10 20 30 40 50 60 70 80 90 100

Tim

e (

sec)


AGF+SMState Spill

HMJ

(b) ABCDB Mem 15%

0

20

40

60

80

100

120

0 10 20 30 40 50 60 70 80 90 100

Tim

e (

sec)


AGF+SMState Spill

HMJ

(c) ABSCDB Mem 5%

0

20

40

60

80

100

120

0 10 20 30 40 50 60 70 80 90 100

Tim

e (

sec)


AGF+SMState Spill

HMJ

(d) ABSCDB Mem 15%

Fig. 7. Query Memory Size

0

50

100

150

200

250

300

350

400

5 10 15 20 25 30 35

Tim

e (

sec)

Data Flushed (%)

AGF+SMState Spill

HMJ

(a) Flush Progression

0

20

40

60

80

100

120

0 10 20 30 40 50 60 70 80 90 100

Tim

e (

sec)


AGF+SMState Spill

HMJ

(b) Flush 10%

Fig. 8. Data Flush Size

amount of early results as more data is available in-memory for arriving tuples. However, more overallflush operations are performed during query runtimeleading to a large number of buckets on disk withlow cardinality, which is less beneficial to the disk-merge phase. Writing more data to disk during a flushshould lead to less amount of early results as less in-memory data is present for arriving tuples, while lessflushes will be performed overall. Both state-spill andHMJ produce less early results in this case. However,AGF+SM performance remains stable. The reason forthis performance is mainly due to the state manager.When AdaptiveGlobalFlush is required to write moretuples to disk at every flush, the state manager is ableto quickly reclaim beneficial data to produce earlyresults.

7.4 Scalability

Figure 9 gives the time needed for AGM+SM andstate-spill to produce the first 15K results in 5-join(6 inputs) and 6-join (7 inputs) query plans. In thesetests, HMJ was omitted as no results were producedin the first 120 seconds. Due to the higher selectivityfrom more join operations, we plot the first 15K resultsas cleanup begins earlier (steep slopes in figures) thanin previous experiments. Here, AGF+SM is able todelay the cleanup phase longer. If the AdaptiveGlob-alFlush algorithm makes an incorrect decision andflushes beneficial data at an intermediate operator,

0

20

40

60

80

100

120

0 2 4 6 8 10 12 14

Tim

e (

sec)


AGF+SMState Spill

(a) 5-Join Query

0

20

40

60

80

100

120

0 2 4 6 8 10 12 14

Tim

e (

sec)


AGF+SMState Spill

(b) 6-Join Query

Fig. 9. Number of Joins

the state manager is able to propagate these beneficialtuples by placing each operator in the in-memoryor on-disk state in order to produce more resultsbefore cleanup. The state-spill algorithm does notmake use of disk-resident data while producing theseearly results. Furthermore, state-spill relies only onits flushing algorithm that scores partition groupsbased on global vs. local contribution over size. Thisscoring method generally does not perform well witha large number of joins where selectivity is high inslow/bursty environments. Here, the ability to predictbeneficial data in both the flush algorithm and diskstates is important. Thus, state-spill exhibits lowerrelative performance in these cases.

8 CONCLUSION

This paper introduces a framework for producinghigh early result throughput in multi-join query plans.This framework can be added to existing join opera-tors optimized for early results by implementing twonovel methods. First, a new flushing algorithm, Adap-tiveGlobalFlush, flushes data predicted to contribute tothe overall early throughput the least by adaptingto important input and output characteristics. Thisadaptation allows for considerable performance im-provement over the methods used by other state-of-the-art algorithms optimized for early results. Second,a novel module, termed a state manager, adaptively

switches each operator in the multi-join query plan toprocess results in two states (in-memory or on-disk)to maximize the overall early throughput. The statemanager is a novel concept, as it is the first moduleto consider the efficient use of disk-resident datain producing early results in multi-join queries. Ex-tensive experimental results show that the proposedmethods outperform the state-of-the-art join operatorsoptimized for both single and multi-join query plansin terms of efficiency, resilience to changing input be-havior, and scalability when producing early results.

ACKNOWLEDGEMENTS

This work is supported in part by the National Sci-ence Foundation under Grants IIS0811998, IIS0811935,IIS-0952977, CNS0708604, and by Microsoft ResearchGifts

REFERENCES

[1] G. Graefe, “Query Evaluation Techniques for LargeDatabases,” ACM Computing Surveys, vol. 25, no. 2, pp.73–170, 1993.

[2] P. Mishra and M. H. Eich, “Join Processing in RelationalDatabases,” ACM Computing Surveys, vol. 24, no. 1, pp. 63–113, 1992.

[3] L. D. Shapiro, “Join Processing in Database Systems with LargeMain Memories,” TODS, vol. 11, no. 3, pp. 239–264, 1986.

[4] T. Urhan, M. J. Franklin, and L. Amsaleg, “Cost Based QueryScrambling for Initial Delays,” in SIGMOD, 1998.

[5] G. Abdulla, T. Critchlow, and W. Arrighi, “Simulation Data asData Streams,” SIGMOD Record, vol. 33, no. 1, pp. 89–94, 2004.

[6] J. Becla and D. L. Wang, “Lessons Learned from Managing aPetabyte,” in CIDR, 2005.

[7] R. S. Barga, J. Goldstein, M. Ali, and M. Hong, “ConsistentStreaming Through Time: A Vision for Event Stream Process-ing,” in CIDR, 2007.

[8] S. Viglas, J. F. Naughton, and J. Burger, “Maximizing theOutput Rate of Multi-Way Join Queries over Streaming In-formation Sources,” in VLDB, 2003.

[9] D. T. Liu, M. J. Franklin, G. Abdulla, J. Garlick, and M. Miller,“Data-Preservation in Scientific Workflow Middleware,” inSSDBM, 2006.

[10] Z. G. Ives, D. Florescu, M. Friedman, A. Y. Levy, and D. S.Weld, “An Adaptive Query Execution System for Data Inte-gration,” in SIGMOD, 1999.

[11] D. J. DeWitt and J. Gray, “Parallel Database Systems: TheFuture of High Performance Database Systems,” Commununi-cations of ACM, CACM, vol. 35, no. 6, pp. 85–98, 1992.

[12] G. Luo, J. F. Naughton, and C. Ellmann, “A Non-BlockingParallel Spatial Join Algorithm,” in ICDE, 2002.

[13] M. A. Hammad, W. G. Aref, and A. K. Elmagarmid, “StreamWindow Join: Tracking Moving Objects in Sensor-NetworkDatabases,” in SSDBM, 2003.

[14] G. S. Iwerks, H. Samet, and K. P. Smith, “Maintenance of K-NN and Spatial Join Queries on Continuously Moving Points,”TODS, vol. 31, no. 2, pp. 485–536, 2006.

[15] J.-P. Dittrich, B. Seeger, D. S. Taylor, and P. Widmayer, “Pro-gressive Merge Join: A Generic and Non-blocking Sort-basedJoin Algorithm,” in VLDB, 2002.

[16] P. J. Haas and J. M. Hellerstein, “Ripple Joins for OnlineAggregation,” in SIGMOD, 1999.

[17] B. Liu, Y. Zhu, and E. A. Rundensteiner, “Run-time operatorstate spilling for memory intensive long-running queries,” inSIGMOD, 2006.

[18] M. F. Mokbel, M. Lu, and W. G. Aref, “Hash-Merge Join: ANon-blocking Join Algorithm for Producing Fast and EarlyJoin Results,” in ICDE, 2004.

[19] Y. Tao, M. L. Yiu, D. Papadias, M. Hadjieleftheriou, andN. Mamoulis, “RPJ: Producing Fast Join Results on Streamsthrough Rate-based Optimization,” in SIGMOD, 2005.

[20] T. Urhan and M. J. Franklin, “XJoin: A Reactively-ScheduledPipelined Join Operator,” IEEE Data Engineering Bulletin,vol. 23, no. 2, pp. 27–33, 2000.

[21] A. N. Wilschut and P. M. G. Apers, “Dataflow Query Executionin a Parallel Main-Memory Environment,” in PDIS, 1991.

[22] J.-P. Dittrich, B. Seeger, D. S. Taylor, and P. Widmayer, “Onproducing join results early,” in PODS, 2003.

[23] J. Kang, J. F. Naughton, and S. Viglas, “Evaluating WindowJoins over Unbounded Streams,” in ICDE, 2003.

[24] U. Srivastava and J. Widom, “Memory-Limited Execution ofWindowed Stream Joins,” in VLDB, 2004.

[25] J. Xie, J. Yang, and Y. Chen, “On Joining and Caching Stochas-tic Streams,” in SIGMOD, 2005.

[26] N. Tatbul, U. Cetintemel, S. B. Zdonik, M. Cherniack, andM. Stonebraker, “Load Shedding in a Data Stream Manager,”in VLDB, 2003.

[27] Y.-C. Tu, S. Liu, S. Prabhakar, and B. Yao, “Load Sheddingin Stream Databases: A Control-Based Approach,” in VLDB,2006.

[28] M. A. Hammad, M. J. Franklin, W. G. Aref, and A. K. El-magarmid, “Scheduling for Shared Window Joins over DataStreams,” in VLDB, 2003.

[29] J. J. Levandoski, M. E. Khalefa, and M. F. Mokbel, “PermJoin:An Efficient Algorithm for Producing Early Results in Multi-join Query Plans,” in ICDE, 2008.

[30] M. E. Crovella, M. S. Taqqu, and A. Bestavros., Heavy-tailedprobability distributions in the world wideweb. Chapman Hall,1998, ch. A practical guide to heavy tails: statistical technique-sand applications.

Justin J. Levandoski received his Bache-lor’s degree at Carleton College, MN, USAin 2003, and his Master’s degree at theUniversity of Minnesota, MN, USA in 2008.He is pursuing a PhD degree in computerscience at the University of Minneosta. HisPhD research focuses on embedding pref-erence and contextual-awareness inside thedatabase engine. His general research inter-ests focus on extending database technolo-gies to handle emerging applications.

Mohamed E. Khalefa received the BS andMS degrees in Computer Science and En-gineering in 2003 and 2006, respectively,from Alexandria University, Egypt. He is cur-rently pursuing a PhD in database systemsat the University of Minnesota. His researchinterests include non-traditional query pro-cessing, preference queries, uncertain dataprocessing, and cloud computing.

Mohamed F. Mokbel (Ph.D., Purdue Univer-sity, 2005, MS, B.Sc., Alexandria University,1999, 1996) is an assistant professor in theDepartment of Computer Science and Engi-neering, University of Minnesota. His mainresearch interests focus on advancing thestate of the art in the design and imple-mentation of database engines to cope withthe requirements of emerging applications(e.g., location-based applications and sensornetworks). His research work has been rec-

ognized by two best paper awards at IEEE MASS 2008 and MDM2009. Mohamed is a recipient of the NSF CAREER award 2010.Mohamed has actively participated in several program committeesfor major conferences including ICDE, SIGMOD, VLDB, SSTD, andACM GIS. He is/was a program co-chair for ACM SIGSPATIAL GIS2008, 2009, and 2010. Mohamed is an ACM and IEEE memberand a founding member of ACM SIGSPATIAL. For more information,please visit http://www.cs.umn.edu/ mokbel.

On Producing High and Early Result Throughput in Multi-join …mokbel/papers/tkde10b.pdf ·...

Documents

Transcript of On Producing High and Early Result Throughput in Multi-join …mokbel/papers/tkde10b.pdf ·...