Adaptive Query Processing in Data Stream Systems Paper written by Shivnath Babu Kamesh Munagala,...

Adaptive Query Processing Adaptive Query Processing in Data Stream Systemsin Data Stream Systems

Paper written byPaper written by

Shivnath BabuShivnath Babu

Kamesh Munagala, Rajeev Motwani, Jennifer Widom Kamesh Munagala, Rajeev Motwani, Jennifer Widom

stanfordstreamdatamanager

Itaru Nishizawa

Hitachi, Ltd.

Stanford University

Data StreamsData Streams

Continuous, unbounded, rapid, time-Continuous, unbounded, rapid, time-varying streams of data elementsvarying streams of data elements

Occur in a variety of modern applicationsOccur in a variety of modern applications Network monitoring and intrusion detectionNetwork monitoring and intrusion detection Sensor networksSensor networks Telecom call recordsTelecom call records Financial applicationsFinancial applications Web logs and click-streamsWeb logs and click-streams Manufacturing processesManufacturing processes

Example Continuous QueriesExample Continuous Queries WebWeb

Amazon’s best sellers over last hourAmazon’s best sellers over last hour Network Intrusion DetectionNetwork Intrusion Detection

Track HTTP packets with destination Track HTTP packets with destination address matching a prefix in a given address matching a prefix in a given table and content matching “*\.ida”table and content matching “*\.ida”

FinanceFinance Monitor NASDAQ stocks between $20 Monitor NASDAQ stocks between $20

and $200 that have moved down more and $200 that have moved down more than 2% in the last 20 minutesthan 2% in the last 20 minutes

Traditional Query OptimizationTraditional Query Optimization

Executor:Runs chosen plan to

completion

Chosen query plan

Optimizer: Finds “best” query plan to

process this query

Query

Statistics Manager: Periodically collects statistics, e.g., table sizes, histograms

Which statisticsare required

Estimatedstatistics

Optimizing Continuous Queries is Optimizing Continuous Queries is DifferentDifferent

Continuous queries are long-runningContinuous queries are long-running Stream characteristics can change over Stream characteristics can change over

timetime Data properties: Selectivities, correlationsData properties: Selectivities, correlations Arrival properties: Bursts, delaysArrival properties: Bursts, delays

System conditions can change over timeSystem conditions can change over time Performance of a fixed plan can change Performance of a fixed plan can change

significantly over timesignificantly over time Adaptive processing:Adaptive processing: find best plan for find best plan for

current conditions current conditions

Traditional Optimization Traditional Optimization Adaptive OptimizationAdaptive Optimization

Optimizer: Finds “best” query plan to

process this query

Executor:Runs chosen plan to

completion

Chosen query plan

Query

Statistics Manager: Periodically collects statistics, e.g., table sizes, histograms

Which statisticsare required

Estimatedstatistics

Reoptimizer:Ensures that plan is efficient

for current characteristics

Profiler: Monitors current stream and

system characteristics

Executor:Executes current plan

Decisions toadaptCombined in

part for efficiency

PreliminariesPreliminaries Let query Q process input stream I, applying the

conjunction of n commutative filters F1, F2, …, Fn. Each filter Fi takes a stream tuple e as input and returns

either true or false. If Fi returns false for tuple e we say that Fi drops e. A tuple is emitted in the continuous query result if and only

if all n filters return true. A plan for executing Q consists of an ordering P =Ff(1), Ff(2),..,

Ff(n) where f is the mapping from positions in the filter ordering to the indexes of the filters at those positions

When a tuple e is processed by P, first Ff(1) is evaluated. If it returns false (e is dropped by Ff(1)), then e is not processed

Further. Otherwise, Ff(2) is evaluated on e, and so on.

Preliminaries – cont’dPreliminaries – cont’d

At any time, the cost of an ordering O is the expected time to process an incoming tuple in I to completion (either emitted or dropped), using O.

Consider O = Ff(1), Ff(2),.., Ff(n). d(i|j) is the conditional probability that Ff(i) will

drop a tuple e from input stream I, given that e was not dropped by any of Ff(1), Ff(2),.., Ff(j). The unconditional probability that Ff(i) will drop an I tuple is d(i|0).

ti is the expected time for Fi to process one tuple.

Preliminaries – cont’dPreliminaries – cont’d

• Given the notations the cost of O = Ff(1), Ff(2),.., Ff(n). per tuple can be formalized as:

• Notice Di is the portion of tuple that is left for operator Ff(i) to process• The goal is to maintain filter orderings that minimize this cost at any point in time.

ExampleExample In this picture a In this picture a sequence of tuples is

arriving on stream I: 1, 2, 1, 4, ...

We have four filters F1–F4, such that Fi drops a tuple e if and only if Fi does not contain e.

Note that all of the incoming tuples except e = 1 are dropped by some filter. For O1 = F1, F2, F3, F4, the total number of probes for the eight I tuples shown is 20. (For example, e = 2 requires three probes — F1, F2, and F3 – before it is dropped by F3.)

The corresponding number for O2 = F3, F2, F4, F1 is 18

O3 = F3, F1, F2, F4 is optimal for this example at 16 probes.

Greedy AlgorithmGreedy Algorithm

Assume for the moment uniform times ti for all filters.

A greedy approach to filter ordering proceeds as follows: 1. Choose the filter Fi with highest

unconditional drop probability d(i|0) as Ff(1). 2. Choose the filter Fj with highest conditional

drop probability d(j|1) as Ff(2). 3. Choose the filter Fk with highest conditional

drop probability d(k|2) as Ff(3). 4. And so on.

Greedy InvariantGreedy Invariant

To factor in varying filter times ti, replace d(i|0) in step 1 with d(i|0)/ti, d(j|1) in step 2 with d(j|1)/tj , and so on. We refer to this ordering algorithm as Static Greedy, or simply Greedy.

Greedy maintains the following Greedy Invariant (GI):

So far - Pipelined Filters: So far - Pipelined Filters: Stable StatisticsStable Statistics

Assume statistics are not changingAssume statistics are not changing Order filters by decreasing unconditional Order filters by decreasing unconditional drop-drop-

rate/cost rate/cost [prev. work][prev. work] Correlations Correlations NP-Hard NP-Hard

Greedy algorithm: Use conditional Greedy algorithm: Use conditional selectivities selectivities FF(1)(1) has maximum drop-rate/cost has maximum drop-rate/cost

FF(2)(2) has maximum drop-rate/cost ratio for tuples has maximum drop-rate/cost ratio for tuples

not dropped by Fnot dropped by F(1)(1)

And so onAnd so on

Adaptive Version of GreedyAdaptive Version of Greedy Greedy gives strong guarantees Greedy gives strong guarantees

4-approximation, best poly-time approx. 4-approximation, best poly-time approx. possiblepossible

For arbitrary (correlated) characteristicsFor arbitrary (correlated) characteristics Usually optimal in experiments Usually optimal in experiments

Challenge:Challenge: Online algorithmOnline algorithm Fast adaptivity to Greedy orderingFast adaptivity to Greedy ordering Low run-time overheadLow run-time overhead

A-Greedy: Adaptive GreedyA-Greedy: Adaptive Greedy

A-GreedyA-Greedy

Profiler: Maintains conditionalfilter selectivities and costs

over recent tuples

Executor:Processes tuples withcurrent filter ordering

Reoptimizer: Ensures thatfilter ordering is Greedy for

current statistics

statisticsEstimated

are requiredWhich statistics

Combined in part for

efficiency

Changes infilter ordering

A-Greedy ProfilerA-Greedy Profiler

For n filters, the total number of conditional selectivities is n2n-1

Clearly it is impractical for the profiler to maintain online estimates of all these selectivities.

Fortunately, to check whether a given ordering satisfies the GI, we need to check (n + 2)(n - 1) /2 /2 = O(n= O(n22) selectivities only.) selectivities only.

Once a GI violation has occurred, to find a new ordering that satisfies the GI we may need O(n2) new selectivities in the worst case.

The new set of required selectivities depends on the new input characteristics, so it cannot be predicted in advance.

Profiler cont’dProfiler cont’d The profiler maintains a profile of tuples dropped

in the recent past. The profile is a sliding window of profile tuples

created by sampling tuples from input stream I that get dropped during filter processing.

A profile tuple contains n boolean attributes b1, …, bn corresponding to filters F1, …, Fn.

When a tuple e є I is dropped during processing, e is profiled with some probability p, called the drop-profiling probability.

If e is chosen for profiling, processing of e continues artificially to determine whether any of the remaining filters unconditionally drop e.

Profiler cont’dProfiler cont’d

The profiler then logs a tuple with attribute bi = 1 if Fi drops e and bi = 0 otherwise, 1 ≤ i ≤ n.

The profile is maintained as a sliding window so that older input data does not contribute to statistics used by the reoptimizer.

a sliding window of processing-time samples is also maintained to calculate the avg processing time ai for each filter Fi

A-Greedy ReoptimizerA-Greedy Reoptimizer

The reoptimizer’s job is to maintain an ordering O such that O satisfies the GI for statistics estimated from the tuples in the current profile window.

The view maintained over the profile window is an n X n upper triangular matrix V [i, j], 1 ≤ i ≤ j ≤ n, so we call it the matrix view.

The n columns of V correspond in order to the n filters in O. That is, the filter corresponding to column c is Ff(c).

Reoptimizer cont’dReoptimizer cont’d

Entries in the ith row of V represent the conditional selectivities of filters Ff(i), ,Ff(i+1) , .. ,Ff(n) for tuples that are not dropped by Ff(1) ,Ff(2) , … , Ff(i-1)

Specifically, V [I, j] is the number of tuples in the profile window that were dropped by Ff(j) among tuples that were not dropped by Ff(1) ,Ff(2) , … , Ff(i-1)

Notice that V [i, j] is proportional to d(j|i)

Updating V on an insert to Updating V on an insert to profile Windowprofile Window

Violation of GIViolation of GI The reoptimizer maintains the ordering O

such that the matrix view for O always satisfies the condition:

V [i, i]/af(i) ≥ V [i, j]/af(j), 1 ≤ i ≤ j ≤ n

Suppose an update to the matrix view or to a processing-time estimate causes the following condition to hold:

V [i, i]/af(i) ≤ V [i, j]/af(j), 1 ≤ i ≤ j ≤ n

Then a GI violation has occurred at position i

Detecting a violationDetecting a violation An update to V or to an ai can cause a GI

violation at position i either because it reduces V [i, i] / af(i) , or because it increases some V [i, j] / af(j) , j > i.

Correcting a violationCorrecting a violation

We may need to reevaluate the filters at positions > i because their conditional selectivities may have changed.

The adaptive ordering can thrash if both sides of the Equation are almost equal for some pair of filters. To avoid thrashing, the thrashing-avoidance parameter β is introduced in the equation:

V [i, i]/af(i) ≤ β V [i, j]/af(j), 1 ≤ i ≤ j ≤ n

TradeoffsTradeoffs Suppose changes are infrequentSuppose changes are infrequent

Slower adaptivity is okaySlower adaptivity is okay Want best plans at very low run-time Want best plans at very low run-time

overheadoverhead Three-way tradeoff among Three-way tradeoff among speed of speed of

adaptivityadaptivity, , run-time overheadrun-time overhead, and , and convergence propertiesconvergence properties

Adaptive Query Processing in Data Stream Systems Paper written by Shivnath Babu Kamesh Munagala,...

Documents

Transcript of Adaptive Query Processing in Data Stream Systems Paper written by Shivnath Babu Kamesh Munagala,...