One-Pass Wavelet Synopses for Maximum-Error Metrics
description
Transcript of One-Pass Wavelet Synopses for Maximum-Error Metrics
![Page 1: One-Pass Wavelet Synopses for Maximum-Error Metrics](https://reader034.fdocuments.us/reader034/viewer/2022042703/56814f56550346895dbd0244/html5/thumbnails/1.jpg)
One-Pass Wavelet Synopses for One-Pass Wavelet Synopses for Maximum-Error MetricsMaximum-Error Metrics
Panagiotis KarrasPanagiotis KarrasTrondheim, August 31st, 2005
Research at HKU with Nikos MamoulisResearch at HKU with Nikos Mamoulis
![Page 2: One-Pass Wavelet Synopses for Maximum-Error Metrics](https://reader034.fdocuments.us/reader034/viewer/2022042703/56814f56550346895dbd0244/html5/thumbnails/2.jpg)
OutlineOutline• Preliminaries & Motivation
– Usefulness of Synopses
– Haar wavelet decomposition, conventional wavelet synopses
– The maximum error guarantee problem
• Earlier Approach: Wavelet Synopses with Optimal Error
Guarantees
– Impracticability of this approach
• Solution: Practicable Wavelet Synopses for Maximum Error
Metrics
– Low-Complexity Algorithms that provide near-optimal error results
• Extension to Data Streams
– One-Pass adaptations of the proposed algorithms
• Conclusions & Future Directions
![Page 3: One-Pass Wavelet Synopses for Maximum-Error Metrics](https://reader034.fdocuments.us/reader034/viewer/2022042703/56814f56550346895dbd0244/html5/thumbnails/3.jpg)
Compact Data Synopses useful in:Compact Data Synopses useful in:
• Approximate Query Processing (exact answers not always required)
• Learning, Classification, Event Detection• Data Mining, Selectivity Estimation• Situations where massive data arrives in a
stream
![Page 4: One-Pass Wavelet Synopses for Maximum-Error Metrics](https://reader034.fdocuments.us/reader034/viewer/2022042703/56814f56550346895dbd0244/html5/thumbnails/4.jpg)
34 16 2 20 20 0 36 16
0
18
7 -8
9 -9 1010
25 11 10 26
Haar Wavelet Haar Wavelet Decomposition Decomposition
18 18
• Wavelet decomposition:Wavelet decomposition: orthogonal transform for the hierarchical representation of functions and signals
• Haar wavelets:Haar wavelets: simplest wavelet system, easy to understand and implement
• Extensible to many dimensions
• Error treeError tree: structure for the visualization of decomposition and value reconstructions
• Reconstructions require logarithmically many terms, along appropriate error tree paths
![Page 5: One-Pass Wavelet Synopses for Maximum-Error Metrics](https://reader034.fdocuments.us/reader034/viewer/2022042703/56814f56550346895dbd0244/html5/thumbnails/5.jpg)
Wavelet Synopses Wavelet Synopses • Compute Haar wavelet decomposition of D• Coefficient thresholding : retain B coefficients,
B<<|D|• Approximate query engine can operate over
such compact synopses– [MVW, SIGMOD’98]; [VW, SIGMOD’99]; [CGRS,
VLDB’00]
• Conventional approach: Retain B largest coefficients in absolute normalized value– Normalized Haar basis: divide coefficients at resolution j by
– Minimizes the Total Squared (L2) Error
• However…
j2
![Page 6: One-Pass Wavelet Synopses for Maximum-Error Metrics](https://reader034.fdocuments.us/reader034/viewer/2022042703/56814f56550346895dbd0244/html5/thumbnails/6.jpg)
The Problem with Conventional The Problem with Conventional SynopsesSynopses
34 16 2 20 20 0 36 16
0
18
7 -8
9 -9 1010
+
-+
+
+ + +
+
+
- -
- - - -
• Example data vector and synopsis (|D|=8, B=4)
Original Data
Reconstruction
18 18 18 18 20 0 36 16
• Large variation in answer quality
• Root cause– Aggregate error
measure may be optimal, but error distributed unevenly among individual values
![Page 7: One-Pass Wavelet Synopses for Maximum-Error Metrics](https://reader034.fdocuments.us/reader034/viewer/2022042703/56814f56550346895dbd0244/html5/thumbnails/7.jpg)
Solution: Thresholding for Maximum-Error Solution: Thresholding for Maximum-Error Metrics Metrics
• Error Metrics providing tight error guarantees for all reconstructed values:
– Maximum Absolute Error
– Maximum Relative Error with Sanity Bound (to avoid domination by small data values)
• Aim at minimization of these metrics
}|,max{|
|ˆ|max
sd
dd
i
iii
|ˆ|max iii dd
![Page 8: One-Pass Wavelet Synopses for Maximum-Error Metrics](https://reader034.fdocuments.us/reader034/viewer/2022042703/56814f56550346895dbd0244/html5/thumbnails/8.jpg)
Former Approach:Former Approach:Optimal Thresholding for Maximum-Error Optimal Thresholding for Maximum-Error MetricsMetrics[GK, PODS’04][GK, PODS’04]
• Based on Dynamic-Programming Formulation
• Relies on recursive function that computes minimum maximum error for a coefficient’s sub-tree given an allocated storage space
• Optimally distributes allocated space b between a node’s two child sub-trees and decides whether to retain the coefficient on this node
• Approximation schemes for multiple dimensions, also applicable in one dimension
![Page 9: One-Pass Wavelet Synopses for Maximum-Error Metrics](https://reader034.fdocuments.us/reader034/viewer/2022042703/56814f56550346895dbd0244/html5/thumbnails/9.jpg)
• Challenge:Challenge:– Design efficient, low-complexity thresholding schemes
that achieve competitive results in comparison to the optimal solution and are extensible to streaming data
BBNO log2
However:However:
• Complexity:
• time (reducible to )
• space (reducible to )
• 1-D Approximation Schemes
• Impractical for the purpose it is meant for
• All Inapplicable in Streaming Environments
BNO 2
BNNBO loglog2
NO BNO log2
![Page 10: One-Pass Wavelet Synopses for Maximum-Error Metrics](https://reader034.fdocuments.us/reader034/viewer/2022042703/56814f56550346895dbd0244/html5/thumbnails/10.jpg)
Solution:Solution:Greedy Thresholding for Maximum-Error Greedy Thresholding for Maximum-Error MetricsMetrics• Key Idea: Greedy solution that makes the best choice of next
coefficient to discard at each step• Each error-tree node stores the Maximum Potential Error that
will be affected when the coefficient on it is discarded:
– For Absolute Error:For Absolute Error:
– For Relative Error:For Relative Error:
• Global Heap structure returns node of Least Maximum Potential Error• For Absolute Error:For Absolute Error:
– Max and Min values of Accumulated Error below maintained on nodes
• For Relative Error:For Relative Error:– Accumulated Error on data level stored on leaf nodes– Heaps returning leaf of Maximum Potential Error augmented on
nodes
kjkjd
k ckj
errmaxMAleaves
Sdc jkjkjd
kkj
,maxerrmaxMR
leaves
![Page 11: One-Pass Wavelet Synopses for Maximum-Error Metrics](https://reader034.fdocuments.us/reader034/viewer/2022042703/56814f56550346895dbd0244/html5/thumbnails/11.jpg)
• Changes in Accumulated Error values propagated up and down the tree
• On each affected node:• For Absolute Error:For Absolute Error:
– Max, Min Accumulated Error updated– New Maximum Potential Absolute Absolute Error calculated as:
krkk
rk
klkk
lk
kcc
cc
min,max
,min,maxmaxMA
• For Relative Error:For Relative Error:– Descendants’ Heap updated– New Maximum Potential RelativeRelative Error returned from Heap
• Update node’s position in Global Heap
After each discarding operation:After each discarding operation:
Solution:Solution:Greedy Thresholding for Maximum-Error Greedy Thresholding for Maximum-Error MetricsMetrics
![Page 12: One-Pass Wavelet Synopses for Maximum-Error Metrics](https://reader034.fdocuments.us/reader034/viewer/2022042703/56814f56550346895dbd0244/html5/thumbnails/12.jpg)
An Example (absolute error)An Example (absolute error)• First drop coefficient -1
• Error accumulates on leaf nodes
• Next drop coefficient 2 of maximum potential error 3
• And so on…
11 -1 -6 8 -2 6 6 10
-1
4
2 -3
6 -7 -2 -4
+
-+
+
+ + +
+
+
- -
- - - -
1 1 1 1 -1 -1 -1 -1 -1 -1 -3 -3 1 -3
![Page 13: One-Pass Wavelet Synopses for Maximum-Error Metrics](https://reader034.fdocuments.us/reader034/viewer/2022042703/56814f56550346895dbd0244/html5/thumbnails/13.jpg)
Complexity AnalysisComplexity Analysis
• AbsoluteAbsolute Error Algorithm:
Time: O(Nlog2N)
Space: O(N)
• RelativeRelative Error Algorithm:
Time: O(Nlog3N)
Space: O(NlogN)
![Page 14: One-Pass Wavelet Synopses for Maximum-Error Metrics](https://reader034.fdocuments.us/reader034/viewer/2022042703/56814f56550346895dbd0244/html5/thumbnails/14.jpg)
Extension to Data StreamsExtension to Data Streams
• Major application area• Existing methods inapplicable• Assumption: O(B ) available memory
budget• Further Problem:Further Problem:
– Extend proposed methods to streams– One-pass overall process– Construct and truncate error-tree on-the-fly
![Page 15: One-Pass Wavelet Synopses for Maximum-Error Metrics](https://reader034.fdocuments.us/reader034/viewer/2022042703/56814f56550346895dbd0244/html5/thumbnails/15.jpg)
Solution for Absolute ErrorSolution for Absolute Error• After first B data, pair of coefficients
discarded for every arriving data pair• Scope limited to error-tree constructed so far• Higher tree level for higher power of 2 #data• Frontline structure storing:
– Hanging coefficient nodes– Temporary average of data in hanging subtree– Error information from deleted orphan nodes
• Error propagation similar to static case, with some elaboration in upward propagation due to tree sparseness
![Page 16: One-Pass Wavelet Synopses for Maximum-Error Metrics](https://reader034.fdocuments.us/reader034/viewer/2022042703/56814f56550346895dbd0244/html5/thumbnails/16.jpg)
9 3 9 -5 5 13 13 17 14 -2 9 7 7 3 . . .
-4
2 -3
3 7 -2 -4
-+
+ +- - -1
5
7
Error Tree Frontline
8 1
8
-+
Data Stream
2
Example: Classic Error-TreeExample: Classic Error-Tree
![Page 17: One-Pass Wavelet Synopses for Maximum-Error Metrics](https://reader034.fdocuments.us/reader034/viewer/2022042703/56814f56550346895dbd0244/html5/thumbnails/17.jpg)
9 3 9 -5 5 13 13 17 14 -2 9 7 7 3 . . .
-4
2 -3
3 7 -2 -4
-1
5
7
Error Tree Frontline
8 1
8
Data Stream
2
Example: Sibling Error-TreeExample: Sibling Error-Tree
![Page 18: One-Pass Wavelet Synopses for Maximum-Error Metrics](https://reader034.fdocuments.us/reader034/viewer/2022042703/56814f56550346895dbd0244/html5/thumbnails/18.jpg)
9 3 9 -5 5 13 . . .
2
3 7 -4 9
4
Error Tree Frontline
Data Stream
Example: Example: B B = 6, after 6 = 6, after 6 valuesvalues
![Page 19: One-Pass Wavelet Synopses for Maximum-Error Metrics](https://reader034.fdocuments.us/reader034/viewer/2022042703/56814f56550346895dbd0244/html5/thumbnails/19.jpg)
9 3 9 -5 5 13 13 17 . . .
-4
-3
3 7 -4 9
4
Error Tree Frontline
8
Data Stream
Example: Example: B B = 6, after 8 = 6, after 8 valuesvalues
-2
2
![Page 20: One-Pass Wavelet Synopses for Maximum-Error Metrics](https://reader034.fdocuments.us/reader034/viewer/2022042703/56814f56550346895dbd0244/html5/thumbnails/20.jpg)
9 3 9 -5 5 13 13 17 14 -2 . . .
-4
7 6
-
Error Tree Frontline
8
8
Data Stream
-3
Example: Example: B B = 6, after 10 = 6, after 10 valuesvalues
![Page 21: One-Pass Wavelet Synopses for Maximum-Error Metrics](https://reader034.fdocuments.us/reader034/viewer/2022042703/56814f56550346895dbd0244/html5/thumbnails/21.jpg)
9 3 9 -5 5 13 13 17 14 -2 9 7 . . .
-4
7 -
7
Error Tree Frontline
8
8
Data Stream
-3
Example: Example: B B = 6, after 12 = 6, after 12 valuesvalues
![Page 22: One-Pass Wavelet Synopses for Maximum-Error Metrics](https://reader034.fdocuments.us/reader034/viewer/2022042703/56814f56550346895dbd0244/html5/thumbnails/22.jpg)
9 3 9 -5 5 13 13 17 14 -2 9 7 7 3
-4
7 5
7
Error Tree Frontline
8
8
Data Stream
Example: Example: B B = 6, after 14 = 6, after 14 valuesvalues
![Page 23: One-Pass Wavelet Synopses for Maximum-Error Metrics](https://reader034.fdocuments.us/reader034/viewer/2022042703/56814f56550346895dbd0244/html5/thumbnails/23.jpg)
4 4 11 -3 12 12 12 12 15 -1 7 7 5 5
-4
7 8
Reconstruction
1
1
7
Error Tree
Example: Example: B B = 6, after = 6, after paddingpadding
![Page 24: One-Pass Wavelet Synopses for Maximum-Error Metrics](https://reader034.fdocuments.us/reader034/viewer/2022042703/56814f56550346895dbd0244/html5/thumbnails/24.jpg)
Solution for Relative ErrorSolution for Relative Error• Analogous Extension not feasible• Solution: Heuristic Techniques
• Estimate of MRk calculated based on:– 4 quantities as in Absolute Error (with denominators)– Minimum Absolute values in each subtree (with
errors)– A sample value (with error) for each subtree,
initialized as Minimum Absolute value beneath, changed by error propagation process when a sample below involves larger relative error
• Heuristic Estimate set as Maximum Relative Error among these 8 positions
![Page 25: One-Pass Wavelet Synopses for Maximum-Error Metrics](https://reader034.fdocuments.us/reader034/viewer/2022042703/56814f56550346895dbd0244/html5/thumbnails/25.jpg)
Experimental SettingExperimental Setting• Experiments with Real DataExperiments with Real Data:
– Frequency counts in US Forest Service Database– Photon counts by Voyager 2 stellar occultation experiments– Temperature measures from equatorial Pacific
• Comparison of both Comparison of both StaticStatic and and StreamStream Algorithms with Algorithms with the the OptimalOptimal Solution and the Solution and the ConventionalConventional Method Method
• Streaming Algorithm can produce window-based Streaming Algorithm can produce window-based synopses by discarding those retained coefficients synopses by discarding those retained coefficients whose scope falls entirely outside the window of whose scope falls entirely outside the window of interestinterest
• We present results for We present results for fixedfixed data sets arriving in data sets arriving in stream in order to preserve comparability with those stream in order to preserve comparability with those of the non-streaming algorithmsof the non-streaming algorithms
• We present results for the relative error heuristic in We present results for the relative error heuristic in the static case as wellthe static case as well
![Page 26: One-Pass Wavelet Synopses for Maximum-Error Metrics](https://reader034.fdocuments.us/reader034/viewer/2022042703/56814f56550346895dbd0244/html5/thumbnails/26.jpg)
Experimental ResultsExperimental Results• Run-time, Run-time, B B = = N N / 16, Relative Error/ 16, Relative Error
10
100
1000
10000
100000
1000000
10000000
32 128 512 2048 8192 32768 131072
N
tim
e(m
sec)
OPT
GSTA
GSTR
CON
GSTA-2
![Page 27: One-Pass Wavelet Synopses for Maximum-Error Metrics](https://reader034.fdocuments.us/reader034/viewer/2022042703/56814f56550346895dbd0244/html5/thumbnails/27.jpg)
Experimental ResultsExperimental Results• Quality, Absolute Error, Real Data (frequency counts), Quality, Absolute Error, Real Data (frequency counts), N N = 360= 360
0
200
400
600
800
1000
1200
1400
20 30 40 50 60 70 80 90
B
Max
imu
m A
bso
lute
Err
or
OPT
GSTA
GSTR
CON
![Page 28: One-Pass Wavelet Synopses for Maximum-Error Metrics](https://reader034.fdocuments.us/reader034/viewer/2022042703/56814f56550346895dbd0244/html5/thumbnails/28.jpg)
Experimental ResultsExperimental Results• Quality, Relative Error, Real Data (frequency counts), Quality, Relative Error, Real Data (frequency counts), N N = 360= 360
0
0.2
0.4
0.6
0.8
1
1.2
20 30 40 50 60 70 80 90
B
Max
imu
m R
elat
ive
Err
or
OPT
GSTA
GSTR
CON
GSTA-2
![Page 29: One-Pass Wavelet Synopses for Maximum-Error Metrics](https://reader034.fdocuments.us/reader034/viewer/2022042703/56814f56550346895dbd0244/html5/thumbnails/29.jpg)
Experimental ResultsExperimental Results• Scalability, Absolute Error, Real Data (photon counts), Scalability, Absolute Error, Real Data (photon counts), N N = 16K= 16K
0
2
4
6
8
10
12
14
300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500
B
Max
imu
m A
bso
lute
Err
or
GSTA
GSTR
CON
![Page 30: One-Pass Wavelet Synopses for Maximum-Error Metrics](https://reader034.fdocuments.us/reader034/viewer/2022042703/56814f56550346895dbd0244/html5/thumbnails/30.jpg)
Experimental ResultsExperimental Results• Scalability, Relative Error, Real Data (photon counts), Scalability, Relative Error, Real Data (photon counts), N N = 16K= 16K
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
100 300 500 700 900 1100 1300 1500
B
Max
imu
m R
elat
ive
Err
or
GSTA
GSTR
CON
GSTA-2
![Page 31: One-Pass Wavelet Synopses for Maximum-Error Metrics](https://reader034.fdocuments.us/reader034/viewer/2022042703/56814f56550346895dbd0244/html5/thumbnails/31.jpg)
Experimental ResultsExperimental Results• Scalability, Absolute Error, Real Data (temperature measures), Scalability, Absolute Error, Real Data (temperature measures), B B = = N N / 16/ 16
0
0.5
1
1.5
2
2.5
1024 2048 4096 8192 16384 32768 65536 131072
N
Max
imu
m A
bso
lute
Err
or
GSTA
GSTR
CON
![Page 32: One-Pass Wavelet Synopses for Maximum-Error Metrics](https://reader034.fdocuments.us/reader034/viewer/2022042703/56814f56550346895dbd0244/html5/thumbnails/32.jpg)
Experimental ResultsExperimental Results• Scalability, Relative Error, Real Data (temperature measures), Scalability, Relative Error, Real Data (temperature measures), B B = = N N / 16/ 16
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
1024 2048 4096 8192 16384 32768 65536 131072
N
Max
imu
m R
elat
ive
Err
or
GSTA
GSTR
CON
GSTA-2
![Page 33: One-Pass Wavelet Synopses for Maximum-Error Metrics](https://reader034.fdocuments.us/reader034/viewer/2022042703/56814f56550346895dbd0244/html5/thumbnails/33.jpg)
Conclusions & Future DirectionsConclusions & Future Directions
• Feasibility of Wavelet Synopses with Feasibility of Wavelet Synopses with near-optimal Error Guarantees at near-near-optimal Error Guarantees at near-linear cost for both Static and Streaming linear cost for both Static and Streaming DataData
• Extension to Multidimensional Wavelets?Extension to Multidimensional Wavelets?• Alternative Relative Error Heuristics?Alternative Relative Error Heuristics?• Variable Coefficients?Variable Coefficients?• Theoretical Worst-case Guarantee?Theoretical Worst-case Guarantee?
![Page 34: One-Pass Wavelet Synopses for Maximum-Error Metrics](https://reader034.fdocuments.us/reader034/viewer/2022042703/56814f56550346895dbd0244/html5/thumbnails/34.jpg)
Related WorkRelated Work• Y. Matias, J. S. Vitter, and M. Wang. Wavelet-based
histograms for selectivity estimation. SIGMOD 1998• J. S. Vitter and M. Wang. Approximate computation of
multidimensional aggregates of sparse data using wavelets. SIGMOD 1999
• K. Chakrabarti, M. Garofalakis, R. Rastogi, and K. Shim. Approximate query processing using wavelets. VLDB Journal 2001
• A. Gilbert, Y. Kotidis, S. Muthukrishnan and Martin Strauss. Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries. VLDB 2001
• M. Garofalakis and A. Kumar. Deterministic wavelet thresholding for maximum-error metrics. PODS 2004
• S. Guha and B. Harb. Wavelet Synopses for Data Streams: Minimizing Non-Euclidean Error. KDD 2005
![Page 35: One-Pass Wavelet Synopses for Maximum-Error Metrics](https://reader034.fdocuments.us/reader034/viewer/2022042703/56814f56550346895dbd0244/html5/thumbnails/35.jpg)
Thank you! Questions?Thank you! Questions?