Adaptive Query Processing in Data Stream Systems Paper written by Shivnath Babu Kamesh Munagala,...
-
Upload
vivien-bruce -
Category
Documents
-
view
212 -
download
0
Transcript of Adaptive Query Processing in Data Stream Systems Paper written by Shivnath Babu Kamesh Munagala,...
Adaptive Query Processing Adaptive Query Processing in Data Stream Systemsin Data Stream Systems
Paper written byPaper written by
Shivnath BabuShivnath Babu
Kamesh Munagala, Rajeev Motwani, Jennifer Widom Kamesh Munagala, Rajeev Motwani, Jennifer Widom
stanfordstreamdatamanager
Itaru Nishizawa
Hitachi, Ltd.
Stanford University
Data StreamsData Streams
Continuous, unbounded, rapid, time-Continuous, unbounded, rapid, time-varying streams of data elementsvarying streams of data elements
Occur in a variety of modern applicationsOccur in a variety of modern applications Network monitoring and intrusion detectionNetwork monitoring and intrusion detection Sensor networksSensor networks Telecom call recordsTelecom call records Financial applicationsFinancial applications Web logs and click-streamsWeb logs and click-streams Manufacturing processesManufacturing processes
Example Continuous QueriesExample Continuous Queries WebWeb
Amazon’s best sellers over last hourAmazon’s best sellers over last hour Network Intrusion DetectionNetwork Intrusion Detection
Track HTTP packets with destination Track HTTP packets with destination address matching a prefix in a given address matching a prefix in a given table and content matching “*\.ida”table and content matching “*\.ida”
FinanceFinance Monitor NASDAQ stocks between $20 Monitor NASDAQ stocks between $20
and $200 that have moved down more and $200 that have moved down more than 2% in the last 20 minutesthan 2% in the last 20 minutes
Traditional Query OptimizationTraditional Query Optimization
Executor:Runs chosen plan to
completion
Chosen query plan
Optimizer: Finds “best” query plan to
process this query
Query
Statistics Manager: Periodically collects statistics, e.g., table sizes, histograms
Which statisticsare required
Estimatedstatistics
Optimizing Continuous Queries is Optimizing Continuous Queries is DifferentDifferent
Continuous queries are long-runningContinuous queries are long-running Stream characteristics can change over Stream characteristics can change over
timetime Data properties: Selectivities, correlationsData properties: Selectivities, correlations Arrival properties: Bursts, delaysArrival properties: Bursts, delays
System conditions can change over timeSystem conditions can change over time Performance of a fixed plan can change Performance of a fixed plan can change
significantly over timesignificantly over time Adaptive processing:Adaptive processing: find best plan for find best plan for
current conditions current conditions
Traditional Optimization Traditional Optimization Adaptive OptimizationAdaptive Optimization
Optimizer: Finds “best” query plan to
process this query
Executor:Runs chosen plan to
completion
Chosen query plan
Query
Statistics Manager: Periodically collects statistics, e.g., table sizes, histograms
Which statisticsare required
Estimatedstatistics
Reoptimizer:Ensures that plan is efficient
for current characteristics
Profiler: Monitors current stream and
system characteristics
Executor:Executes current plan
Decisions toadaptCombined in
part for efficiency
PreliminariesPreliminaries Let query Q process input stream I, applying the
conjunction of n commutative filters F1, F2, …, Fn. Each filter Fi takes a stream tuple e as input and returns
either true or false. If Fi returns false for tuple e we say that Fi drops e. A tuple is emitted in the continuous query result if and only
if all n filters return true. A plan for executing Q consists of an ordering P =Ff(1), Ff(2),..,
Ff(n) where f is the mapping from positions in the filter ordering to the indexes of the filters at those positions
When a tuple e is processed by P, first Ff(1) is evaluated. If it returns false (e is dropped by Ff(1)), then e is not processed
Further. Otherwise, Ff(2) is evaluated on e, and so on.
Preliminaries – cont’dPreliminaries – cont’d
At any time, the cost of an ordering O is the expected time to process an incoming tuple in I to completion (either emitted or dropped), using O.
Consider O = Ff(1), Ff(2),.., Ff(n). d(i|j) is the conditional probability that Ff(i) will
drop a tuple e from input stream I, given that e was not dropped by any of Ff(1), Ff(2),.., Ff(j). The unconditional probability that Ff(i) will drop an I tuple is d(i|0).
ti is the expected time for Fi to process one tuple.
Preliminaries – cont’dPreliminaries – cont’d
• Given the notations the cost of O = Ff(1), Ff(2),.., Ff(n). per tuple can be formalized as:
• Notice Di is the portion of tuple that is left for operator Ff(i) to process• The goal is to maintain filter orderings that minimize this cost at any point in time.
ExampleExample In this picture a In this picture a sequence of tuples is
arriving on stream I: 1, 2, 1, 4, ...
We have four filters F1–F4, such that Fi drops a tuple e if and only if Fi does not contain e.
Note that all of the incoming tuples except e = 1 are dropped by some filter. For O1 = F1, F2, F3, F4, the total number of probes for the eight I tuples shown is 20. (For example, e = 2 requires three probes — F1, F2, and F3 – before it is dropped by F3.)
The corresponding number for O2 = F3, F2, F4, F1 is 18
O3 = F3, F1, F2, F4 is optimal for this example at 16 probes.
Greedy AlgorithmGreedy Algorithm
Assume for the moment uniform times ti for all filters.
A greedy approach to filter ordering proceeds as follows: 1. Choose the filter Fi with highest
unconditional drop probability d(i|0) as Ff(1). 2. Choose the filter Fj with highest conditional
drop probability d(j|1) as Ff(2). 3. Choose the filter Fk with highest conditional
drop probability d(k|2) as Ff(3). 4. And so on.
Greedy InvariantGreedy Invariant
To factor in varying filter times ti, replace d(i|0) in step 1 with d(i|0)/ti, d(j|1) in step 2 with d(j|1)/tj , and so on. We refer to this ordering algorithm as Static Greedy, or simply Greedy.
Greedy maintains the following Greedy Invariant (GI):
So far - Pipelined Filters: So far - Pipelined Filters: Stable StatisticsStable Statistics
Assume statistics are not changingAssume statistics are not changing Order filters by decreasing unconditional Order filters by decreasing unconditional drop-drop-
rate/cost rate/cost [prev. work][prev. work] Correlations Correlations NP-Hard NP-Hard
Greedy algorithm: Use conditional Greedy algorithm: Use conditional selectivities selectivities FF(1)(1) has maximum drop-rate/cost has maximum drop-rate/cost
FF(2)(2) has maximum drop-rate/cost ratio for tuples has maximum drop-rate/cost ratio for tuples
not dropped by Fnot dropped by F(1)(1)
And so onAnd so on
Adaptive Version of GreedyAdaptive Version of Greedy Greedy gives strong guarantees Greedy gives strong guarantees
4-approximation, best poly-time approx. 4-approximation, best poly-time approx. possiblepossible
For arbitrary (correlated) characteristicsFor arbitrary (correlated) characteristics Usually optimal in experiments Usually optimal in experiments
Challenge:Challenge: Online algorithmOnline algorithm Fast adaptivity to Greedy orderingFast adaptivity to Greedy ordering Low run-time overheadLow run-time overhead
A-Greedy: Adaptive GreedyA-Greedy: Adaptive Greedy
A-GreedyA-Greedy
Profiler: Maintains conditionalfilter selectivities and costs
over recent tuples
Executor:Processes tuples withcurrent filter ordering
Reoptimizer: Ensures thatfilter ordering is Greedy for
current statistics
statisticsEstimated
are requiredWhich statistics
Combined in part for
efficiency
Changes infilter ordering
A-Greedy ProfilerA-Greedy Profiler
For n filters, the total number of conditional selectivities is n2n-1
Clearly it is impractical for the profiler to maintain online estimates of all these selectivities.
Fortunately, to check whether a given ordering satisfies the GI, we need to check (n + 2)(n - 1) /2 /2 = O(n= O(n22) selectivities only.) selectivities only.
Once a GI violation has occurred, to find a new ordering that satisfies the GI we may need O(n2) new selectivities in the worst case.
The new set of required selectivities depends on the new input characteristics, so it cannot be predicted in advance.
Profiler cont’dProfiler cont’d The profiler maintains a profile of tuples dropped
in the recent past. The profile is a sliding window of profile tuples
created by sampling tuples from input stream I that get dropped during filter processing.
A profile tuple contains n boolean attributes b1, …, bn corresponding to filters F1, …, Fn.
When a tuple e є I is dropped during processing, e is profiled with some probability p, called the drop-profiling probability.
If e is chosen for profiling, processing of e continues artificially to determine whether any of the remaining filters unconditionally drop e.
Profiler cont’dProfiler cont’d
The profiler then logs a tuple with attribute bi = 1 if Fi drops e and bi = 0 otherwise, 1 ≤ i ≤ n.
The profile is maintained as a sliding window so that older input data does not contribute to statistics used by the reoptimizer.
a sliding window of processing-time samples is also maintained to calculate the avg processing time ai for each filter Fi
A-Greedy ReoptimizerA-Greedy Reoptimizer
The reoptimizer’s job is to maintain an ordering O such that O satisfies the GI for statistics estimated from the tuples in the current profile window.
The view maintained over the profile window is an n X n upper triangular matrix V [i, j], 1 ≤ i ≤ j ≤ n, so we call it the matrix view.
The n columns of V correspond in order to the n filters in O. That is, the filter corresponding to column c is Ff(c).
Reoptimizer cont’dReoptimizer cont’d
Entries in the ith row of V represent the conditional selectivities of filters Ff(i), ,Ff(i+1) , .. ,Ff(n) for tuples that are not dropped by Ff(1) ,Ff(2) , … , Ff(i-1)
Specifically, V [I, j] is the number of tuples in the profile window that were dropped by Ff(j) among tuples that were not dropped by Ff(1) ,Ff(2) , … , Ff(i-1)
Notice that V [i, j] is proportional to d(j|i)
Updating V on an insert to Updating V on an insert to profile Windowprofile Window
Violation of GIViolation of GI The reoptimizer maintains the ordering O
such that the matrix view for O always satisfies the condition:
V [i, i]/af(i) ≥ V [i, j]/af(j), 1 ≤ i ≤ j ≤ n
Suppose an update to the matrix view or to a processing-time estimate causes the following condition to hold:
V [i, i]/af(i) ≤ V [i, j]/af(j), 1 ≤ i ≤ j ≤ n
Then a GI violation has occurred at position i
Detecting a violationDetecting a violation An update to V or to an ai can cause a GI
violation at position i either because it reduces V [i, i] / af(i) , or because it increases some V [i, j] / af(j) , j > i.
Correcting a violationCorrecting a violation
We may need to reevaluate the filters at positions > i because their conditional selectivities may have changed.
The adaptive ordering can thrash if both sides of the Equation are almost equal for some pair of filters. To avoid thrashing, the thrashing-avoidance parameter β is introduced in the equation:
V [i, i]/af(i) ≤ β V [i, j]/af(j), 1 ≤ i ≤ j ≤ n
TradeoffsTradeoffs Suppose changes are infrequentSuppose changes are infrequent
Slower adaptivity is okaySlower adaptivity is okay Want best plans at very low run-time Want best plans at very low run-time
overheadoverhead Three-way tradeoff among Three-way tradeoff among speed of speed of
adaptivityadaptivity, , run-time overheadrun-time overhead, and , and convergence propertiesconvergence properties