1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of...
-
Upload
archibald-nash -
Category
Documents
-
view
213 -
download
1
Transcript of 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of...
![Page 1: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/1.jpg)
1
Offline, Stream and Approximation Algorithms for Synospis Construction
Sudipto Guha University of Pennsylvania
Kyuseok Shim Seoul National University
![Page 2: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/2.jpg)
VLDB 2005A tutorial on synopsis construction algorithms2
About this Tutorial
Information is incomplete and could be inaccurate
Our presentation reflects our understanding which may be erroneous
![Page 3: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/3.jpg)
VLDB 2005A tutorial on synopsis construction algorithms3
Synopses Construction
Where is the life we have lost in living?Where is the wisdom we have lost in knowledge?Where is the knowledge we have lost in information?
T. S. Eliot, from The Rock.
Routers Sensors Web Astronomy and sciences
Too much data too little time.
![Page 4: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/4.jpg)
VLDB 2005A tutorial on synopsis construction algorithms4
The idea
To see the world in a grain of sand…
Broad characteristics of the data Compression Dimensionality Reduction Approximate query answering Denoising, Outlier Detection and a broad
array of signal processing
![Page 5: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/5.jpg)
VLDB 2005A tutorial on synopsis construction algorithms5
What is a synopsis ?
Hmm. Any “shorthand” representation
Clustering! SVD!
In this tutorial we will focus on signal/time series processing
![Page 6: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/6.jpg)
VLDB 2005A tutorial on synopsis construction algorithms6
The basic problem
Formally, given a signal X and a dictionary {i} find a representation F=i zi i with at most B non-zero zi minimizing some error which is a fn of X-F
Note, the above extends to any dim.
![Page 7: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/7.jpg)
VLDB 2005A tutorial on synopsis construction algorithms7
Many issues
What is the dictionary ? Which B terms ? What is the error ? What are the constraints ?
![Page 8: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/8.jpg)
VLDB 2005A tutorial on synopsis construction algorithms8
Many issues
What is the dictionary ? Set of vectors Maybe a basis
Which B terms ? What is the error ? What are the constraints ?
Top K
![Page 9: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/9.jpg)
VLDB 2005A tutorial on synopsis construction algorithms9
Many issues
What is the dictionary ? Set of vectors Maybe a basis
Which B terms ? What is the error ? What are the constraints ?
Haar Wavelets
Also Fourier, Polynomials,…
![Page 10: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/10.jpg)
VLDB 2005A tutorial on synopsis construction algorithms10
Many issues
What is the dictionary ? Set of vectors May not be a basis
Histograms: There are n choose 2 vectors But since we impose a non-overlapping restriction
we get a unique representation.
Which B terms ? What is the error ? What are the constraints ?
![Page 11: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/11.jpg)
VLDB 2005A tutorial on synopsis construction algorithms11
Many issues
What is the dictionary ? Which B terms ?
First B ? Best B ?
What is the error ? What are the constraints ?
Why should we choose first B ?
1. B vs 2B numbers 2. Also …
![Page 12: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/12.jpg)
VLDB 2005A tutorial on synopsis construction algorithms12
Approximation theory
Discipline of Math associated with approximation of functions.
Same as our problem
Linear theory (Parseval, 1800 over two centuries) Non-Linear theory (Schmidt 1909, Haar 1910)
Is it relevant ? Yes. However Math treatment has been “extremal”, i.e., how does the error change as a function of B. Is that bound tight?
Note: a yes answer does not say anything about “given this signal, is that the best we can do ?”
![Page 13: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/13.jpg)
VLDB 2005A tutorial on synopsis construction algorithms13
Many issues
What is the dictionary ? Which B terms ? What is the error ?
This controls which B. ||X-F||2 is most common, used all over in
mathematics ||X-F||1,||X-F||1 are useful also Weights. Relative error of approximation
1000 by 1010 is not so bad. 1 by 11 is not too good an idea.
What are the constraints ?
![Page 14: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/14.jpg)
VLDB 2005A tutorial on synopsis construction algorithms14
Many issues
What is the dictionary ? Which B terms ? What is the error ? What are the constraints ?
Input ? Stream, stream of updates … Space, time, precision and range of
values (for zi in the expression F=i zi i )
![Page 15: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/15.jpg)
VLDB 2005A tutorial on synopsis construction algorithms15
In this tutorial
Histograms & Wavelets
Will focus on Optimal, Approximation and Streaming algorithms
How to get one from the other! Connections to top K and Fourier.
![Page 16: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/16.jpg)
VLDB 2005A tutorial on synopsis construction algorithms16
I. Histograms.
![Page 17: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/17.jpg)
VLDB 2005A tutorial on synopsis construction algorithms17
VOpt Histograms
Lets start simple Given a signal X, find a piecewise constant
representation H with at most B pieces minimizing ||X-H||2
Jagadish, Koudas, Muthukrishnan, Poosala, Sevcik, Suel, 1998
Consider one bucket. The mean is the best value. A natural Dynamic programming formulation
![Page 18: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/18.jpg)
VLDB 2005A tutorial on synopsis construction algorithms18
An Example Histogram
162814821012Value (Xi)
7654321Location (i)
Data Distribution
1628148Representative
[7,7][6,6][5,5][1,4]Range
V-Optimal Histogram
![Page 19: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/19.jpg)
VLDB 2005A tutorial on synopsis construction algorithms19
Idea: VOpt Algorithm
Within “step/bucket”: Mean is the best.
Assume that the last bucket is [j+1,n].
What can we say about the rest k-1 ?
j1
1j n
OPT[j,k-1]
Last bucket
SQERR[j+1,n]
Must also be optimal for the range [1, j] with (k-1) buckets! Dynamic Programming !!
![Page 20: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/20.jpg)
VLDB 2005A tutorial on synopsis construction algorithms20
Idea: VOpt Algorithm
Within “step/bucket”: Mean is the best.
Assume that the last bucket is [j+1,n].
What can we say about the rest k-1 ?
j1
1j n
OPT[j,k-1]
Last bucket
SQERR[j+1,n]
Must also be optimal for the range [1, j] with (k-1) buckets! Dynamic Programming !!
![Page 21: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/21.jpg)
VLDB 2005A tutorial on synopsis construction algorithms21
Idea: VOpt Algorithm
Within “step/bucket”: Mean is the best.
Assume that the last bucket is [j+1,n].
What can we say about the rest k-1 ?
j1
1j n
OPT[j,k-1]
Last bucket
SQERR[j+1,n]
Must also be optimal for the range [1, j] with (k-1) buckets! Dynamic Programming !!
![Page 22: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/22.jpg)
VLDB 2005A tutorial on synopsis construction algorithms22
Idea: VOpt Algorithm
Dynamic programming algorithm was given to construct the V optimal Histogram.
OPT[n,k] = min {OPT[j,k-1,]+SQERR[(j+1)..n]}
1≤j < n
OPT[j, k] : the minimum cost of representing the set of values indexed by [1..j] by a histogram with k buckets.
SQERR[(j+1)..n]: the sum of the squared absolute errors from (j+1) to n.
![Page 23: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/23.jpg)
VLDB 2005A tutorial on synopsis construction algorithms23
The DP-based VOpt Algorithmfor i=1 to n do
for k=1 to B do for j=1 to i-1 do (split pt of k-1 bucket hist. and last
bucket) OPT[i, k] = min{ OPT[i, k], OPT[j,k-1] + SQERR[j+1,i] }
We need O(Bn) entries for the table OPT For each entry OPT[i,k], it takes O(n) time if SQERR[j+1.i]
can be computed O(1) time O(Bn) space and O(Bn2) time
B
n
OPT
![Page 24: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/24.jpg)
VLDB 2005A tutorial on synopsis construction algorithms24
Computation of Sum of Squared Absolute Error in O(1) time
index 1 2 3 4
x 2 3 7 5
sum 2 5 12 17
sum(2,3) = x[2]+x[3] = sum[3]-sum[1]= 12-2 = 10
![Page 25: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/25.jpg)
VLDB 2005A tutorial on synopsis construction algorithms25
Computation of Sum of Squared Absolute Error in O(1) time
222 )(1
1)(],[
j
ipp
j
ipp
j
ipp x
ijxxxjiSQERR
i
ppxiSUM
1
],1[
i
ppxiSQSUM
1
2],1[Let and
Then,
Thus,
2])1,1[],1[(1
1])1,1[],1[(
jSUMiSUM
ijjSQSUMiSQSUM
]1[][),( 2
iSQSUMjSQSUMxjiSQSUMj
ipp
]1[][),(
iSUMjSUMxjiSUMj
ipp
![Page 26: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/26.jpg)
VLDB 2005A tutorial on synopsis construction algorithms26
Analysis of VOpt Algorithm
O(n2B) time O(nB) space The space can be reduced
(Wednesday)
Main Question : The end use of histogram is to approximate something. Why not find an “approximately optimal”
(e.g., (1+ε) ) histogram?
![Page 27: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/27.jpg)
VLDB 2005A tutorial on synopsis construction algorithms27
If you had to improve something ?
(1+) streaming ssqO(n) time.O(B/2) space
(1+) streamingO(nB2/) time.O(B2/) space
O(n2B) timeO(n) space
Via Wavelets ssqO(n) timeO(B2/2) space
OfflineO(n) time.O(n+B/) space
O(n2B) timeO(nB) space
(1+) streamingO(n) time.O(B2/) space
offlineO(n) time.O(B2/) space
![Page 28: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/28.jpg)
VLDB 2005A tutorial on synopsis construction algorithms28
Take 1:
For i=1 to n doFor K=1 to B do For j=1 to i-1 do (split point for the last bucket)
OPT[ 1…i, k] = Min [ OPT[1…i, k], OPT[1…j,k-1]+
SQERR(j+1,i) ]
OPT[1..j,k] is increasing SQERR(j+1,i) is decreasing
Question: Can we use the monotonicity for searching the minimum ?
As j increases
![Page 29: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/29.jpg)
VLDB 2005A tutorial on synopsis construction algorithms29
No
Consider a sequence of positive y1,y2,…,yn F(i) = i yi and G(i) = F(n) – F(i-1)
F(i): monotonically increasing … Opt[1..j,k-1] G(i): monotonically deceasing … SQERR(j+1,i)
(n) time is necessary to find mini{ F(i)+G(i) }
Open Question: Does it extend to (n2) over the entire algorithm ?
![Page 30: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/30.jpg)
VLDB 2005A tutorial on synopsis construction algorithms30
What gives ?
Consider a sequence of positive y1,y2,…,yn
F(i) = i yi and G(i) = F(n) – F(i-1) Thus, F(i)+G(i) = F(i) + xi
Any i gives a 2 approximation to mini{ F(i) + G(i)}
F(i) + G(i) = F(n) + xi ≤ 2 F(n)
mini{ F(i) + G(i)} is at least F(n)
![Page 31: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/31.jpg)
VLDB 2005A tutorial on synopsis construction algorithms31
Round 1
Use a histogram to approximate the fn Bootstrap! Approximate the increasing fn in powers of (1+) Right end pt is (1+) approximation of left end pt
h
·(1+h
![Page 32: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/32.jpg)
VLDB 2005A tutorial on synopsis construction algorithms32
What does that do ?
Consider evaluating the fn at the two endpoints
Proof by picture.
h h’¸(1+)
Why ? By construction.
¸
Why ? By monotonicity!
![Page 33: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/33.jpg)
VLDB 2005A tutorial on synopsis construction algorithms33
Therefore… The right hand point is a (1+δ)
approximation! Holds for any point in between.
OPT[x]+SQERR[x+1]≥ OPT[a]+SQERR[b] ≥ OPT[b]/(1+ δ) + SQERR[b] ≥ {OPT[b] + SQERR[b]}/ (1+δ)
Are we done ? Not quite yet. What happens for B>2 ? – we do not
compute OPT[i,b] exactly !!
h’
SQERR
OPT
a b
![Page 34: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/34.jpg)
VLDB 2005A tutorial on synopsis construction algorithms34
Zen and the art of histograms
Approximate the increasing fn in powers of (1+) Right end pt is (1+) approximation
Prove by induction that the error is (1+)B
This tells us what should be (small), in fact if we set =/2B then (1+)B· 1+
![Page 35: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/35.jpg)
VLDB 2005A tutorial on synopsis construction algorithms35
Complexity analysis
# of intervals p ~ (B/) log n Why ?
c(1+δ) (p-1) ≤ nR2 and δ = /(2B) R is the largest number in data Assume R is polynomially bounded by n
Running time ~ nB (B/) log n
Why are we approximating the increasing function ? Why not the decreasing one ?
![Page 36: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/36.jpg)
VLDB 2005A tutorial on synopsis construction algorithms36
The first streaming model
The signal X is specified by xi arriving in increasing order of i
Not the most general model But extremely useful for modeling
time series data
![Page 37: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/37.jpg)
VLDB 2005A tutorial on synopsis construction algorithms37
Streaming
Need to store
1a xi
1a x2
i
1b xi
1b x2
i
Required space is (B2/) log n
a b
![Page 38: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/38.jpg)
VLDB 2005A tutorial on synopsis construction algorithms38
VOpt Construction: O(Bn2)
n
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
n
[Jagadish et al.: VLDB 1998] OPT(i,k) = min1≤j<i{OPT(j,k-
1)+SQERR(j+1,i)}
OPT[j,k-1]
OPT[j,k]
![Page 39: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/39.jpg)
VLDB 2005A tutorial on synopsis construction algorithms39
AHIST-S: (1+ε) Approximation
AOPT[j,k]
n
a b cAOPT[j,k-1]
P
(1+δ)a ≥b
(1+δ)a < c
δ = ε /2B
P = O(Bε-1logn)
AOPT[j,k] = min1≤j<i{AOPT[bjp,k-
1]+SQERR[bjp+1,n]}
O(B2ε-1nlogn) time and O(B2ε-1logn) space
![Page 40: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/40.jpg)
VLDB 2005A tutorial on synopsis construction algorithms40
The overall idea
The natural DP table
The approximate table
![Page 41: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/41.jpg)
VLDB 2005A tutorial on synopsis construction algorithms41
Do s talk to us ?
DJIA data from 1901-1993
B
execu
tion
time
![Page 42: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/42.jpg)
VLDB 2005A tutorial on synopsis construction algorithms42
Take 2: GK02
Sliding window streams Potentially infinite data – interested in the
last n only Q: Suppose we constructed histogram for
[1..n] and now want it for [2..(n+1)]
Previous idea is a dead on arrival. Consider 100,1,2,3,4,5,7,8,…
![Page 43: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/43.jpg)
VLDB 2005A tutorial on synopsis construction algorithms43
Formal problem
Maintain a data structure
Given an interval [a,b] construct a B bucket histogram for [a,b]
Compute on the fly
Generalizes the window! Generalizes VOpt when a=1,b=n
![Page 44: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/44.jpg)
VLDB 2005A tutorial on synopsis construction algorithms44
Reconsider the take 1
We are evaluating
Left to right, i.e.,
But we are still evaluating this guy !
![Page 45: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/45.jpg)
VLDB 2005A tutorial on synopsis construction algorithms45
A brave new world
Assume a O(n) size buffer holds xi values The previous algorithm was:
Several issues1. Which values are necessary and sufficient2. We are not evaluating all values – what induction ?
![Page 46: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/46.jpg)
VLDB 2005A tutorial on synopsis construction algorithms47
GK02: Enhanced (1+ε) Approximation
AOPT[j,k]
n
a b
AOPT[j,k-1]
P
(1+δ)a ≥z
(1+δ)a < z+1
Lazy evaluation using Binary Search O(B3ε-2log3n) time and O(n) space
Pre-processing takes O(n) time – SUM and SQSUM
P = O(Bε-1logn)
![Page 47: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/47.jpg)
VLDB 2005A tutorial on synopsis construction algorithms48
GK02: Enhanced (1+ε) Approximation Creates all of B interval lists at once The values of necessary AOPT[j,k] are
computed recursively to find the intervals [ajp,
bjp] where bj
p is the largest z s.t. (1+ε) AOPT[aj
p,k] ≥ (1+ε) AOPT[z,k] (1+ε) AOPT[aj
p,k] < (1+ε) AOPT[z+1,k] Note that AOPT increases as z increases Thus, we can use binary search to find z O(n) space of SUM and SQSUM arrays needs
to be maintain to allow the computation of SQERR(j+1,i) in O(1) time
O(n+B3ε-2log3n) time and O(n) space
![Page 48: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/48.jpg)
VLDB 2005A tutorial on synopsis construction algorithms49
Take 2 summary
O(n) space and O(n+B3-2log2 n) time
Is that the best ? Obviously no.
![Page 49: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/49.jpg)
VLDB 2005A tutorial on synopsis construction algorithms50
Take 3: AHIST-L-
Suppose we knew · OPT · 2 then… Instead of powers of (1+/B) additive terms
of /(2B) then … Time is O(B3-2 log n) To get ?
2-approximation: =O(1) a binary search: O(log n) Thus, O(B3 log n * log n)
Overall O(n+B3(-2+logn)log n) time and O(n+B2/) space:
O(B/)
![Page 50: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/50.jpg)
VLDB 2005A tutorial on synopsis construction algorithms51
Take 4: AHIST-B
Consider the take 4 algorithm. How to stream it ?
M
On the new part
Overall
![Page 51: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/51.jpg)
VLDB 2005A tutorial on synopsis construction algorithms52
Not done yet
1+r First find an =O(1) approximation,
then proceed back and refine
k
K-1
![Page 52: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/52.jpg)
VLDB 2005A tutorial on synopsis construction algorithms53
The running space-time
B(# insertions)(log M)(log ) where =O(B-1 log n) is the length of a list
Space Who cares and why ?
![Page 53: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/53.jpg)
VLDB 2005A tutorial on synopsis construction algorithms54
Asymptotics
For fixed B and we can compute a (1+ ) piecewise constant representation in
O(n log log n) time and O(log n) space or
O(n) time and O(log n log log n) space.
Extends to degree d polynomials, space increases by O(d) and time is O(nd + d3…)
![Page 54: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/54.jpg)
VLDB 2005A tutorial on synopsis construction algorithms55
Our friendly Running time
B
Execu
tion
Tim
e
![Page 55: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/55.jpg)
VLDB 2005A tutorial on synopsis construction algorithms56
Our friendly Error(E
rror
–VO
PT)/
VO
PT
B
![Page 56: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/56.jpg)
VLDB 2005A tutorial on synopsis construction algorithms57
What you analyze is what you get
Execu
tion
tim
e
n
![Page 57: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/57.jpg)
VLDB 2005A tutorial on synopsis construction algorithms58
Questions ?
![Page 58: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/58.jpg)
VLDB 2005A tutorial on synopsis construction algorithms60
For general error measure, IF…
The error of a bucket only depends on the values in the bucket.
The overall error function, is the sum of the errors in the buckets.
The data can be processed in O(T) time per item such that in O(Q) time we can find the error of a bucket, storing O(P) info.
The error (of a bucket) is a monotonic function of the interval.
The value of the maximum and the minimum nonzero error is polynomially bounded in n.
![Page 59: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/59.jpg)
VLDB 2005A tutorial on synopsis construction algorithms61
Then…
Optimum histogram in time O(nT+n2(B+Q)) time and O(n(P+B)) space
(1+)-approximation in
O(nT+nQB2-1 log n) time and O(PB2-1 log n) space,
O(nT + QB3(log n + -2 )log n) time and O(nP) space
O(nT) time and space
O(PB2 -1 log n + (QB/T) [B-1 log2 (B-1 log n) + log n loglog n)]
![Page 60: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/60.jpg)
VLDB 2005A tutorial on synopsis construction algorithms62
Splines and piecewise polynomials
Instead of
If we wanted
Or maybe…
![Page 61: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/61.jpg)
VLDB 2005A tutorial on synopsis construction algorithms63
The overall idea
If we want to represent {xa+1,…,xb} by p0+p1(x-xa)+p2(x-xa)2 + …
The solution is as above…
We need O(d) times (than before) space and need to solve the system. This means an increase by a factor O(d3) in time.
![Page 62: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/62.jpg)
VLDB 2005A tutorial on synopsis construction algorithms64
Another useful example: Relative error
Issue with global measures: Estimating 10 by 20 and 1000 by 1010 has the same effect
The above is ok if we are querying for “1000” a 1000 times and 10 times for “10” (point queries and VOPT measure)
But consider approximating a time series. We may be interested in per point guarantees.
![Page 63: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/63.jpg)
VLDB 2005A tutorial on synopsis construction algorithms65
Sum of Squared Relative Error for a Bucket
Relative error for a bucket (sr,er,xr) :
Since A > 0, it is minimized when xr=B/A The minimum value is C-B2/A If the aggregated sum of A, B and C are
stored, ERRSQ(i,j) can be computed in O(1) time
Optimal histogram can be constructed in O(Bn2) time… Approximation algorithms follow…
CBxAxxc
xxesERR rr
i
rie
sixrrSQ
r
r
2}},max{
)({min),( 2
2
2
2
![Page 64: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/64.jpg)
VLDB 2005A tutorial on synopsis construction algorithms66
Maximum Error and the l1 metric
![Page 65: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/65.jpg)
VLDB 2005A tutorial on synopsis construction algorithms67
Maximum Error Histograms
A bucket (sr,er,xr) with a numbers {x1, x2, …, xn}
s.t. sr: starting position
er: ending position
xr: representative value
Maximum Error is given by
Maximum relative error is defined as:}|}|,max{
||max{min),(
],[i
ri
esixrrM xc
xxesERR
rrr
||maxmin),(],[
riesix
rrM xxesERRrrr
![Page 66: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/66.jpg)
VLDB 2005A tutorial on synopsis construction algorithms68
Maximum Error of a bucket
Given numbers {x1, x2, …, xn} s.t. Maximum Error is given by ErrM=minxr
maxi |xi
– xr|
What is the best xr
(xmin+xmax)/2
![Page 67: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/67.jpg)
VLDB 2005A tutorial on synopsis construction algorithms69
Maximum Relative Error of a set
Given a set of numbers {x1, x2, …, xn} max: the maximum of {x1, x2, …, xn} min: the minimum of {x1, x2, …, xn} c: A sanitary constant
Some function of c,max,min E.g., when c· min· max the error is Optimal maximum relative error for a bucket can
be computed in O(1) time
![Page 68: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/68.jpg)
VLDB 2005A tutorial on synopsis construction algorithms70
The Naïve Optimal Algorithm
for i :=1 to n doOPTM[i,1] := ERRM(i,i)for K :=1 to B do { max := - ∞; min := ∞; OPTM[i,k] := ∞ for j :=i-1 to 1 do {
if (max < x[j+1]) max := x[j+1]if (min > x[j+1]) min := x[j+1]
OPTM[i,k] := min{OPTM[i,k] , max( OPTM[j,k-1], ERRM(j+1,i) ) } }
}} ERRM(j+1,i) can be obtained in O(1) time O(Bn) space and O(Bn2 ) time optimal algorithm
![Page 69: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/69.jpg)
VLDB 2005A tutorial on synopsis construction algorithms71
An Improved Optimal Algorithm
OPTM[i,j] := minj{max( OPTM[j,k-1], ERRM(j+1,i)) } Observations
OPTM[j,k-1] is an increasing function
ERRM(j+1,i) is a decreasing function
To compute minx{ max ( F(x), G(x) ) } where F(x) and G(x) are non-decreasing and non-increasing functions
We can perform binary search for the value of x such that F(x) > G(x) and F(x-1) < G(x-1)
The minimum is min{ G(x-1) and F(X) }
![Page 70: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/70.jpg)
VLDB 2005A tutorial on synopsis construction algorithms72
An Improved Optimal Algorithm
OPTM[i,j]:= min{max(OPTMj,k-1], ERRM(j+1,i))}
We can improve the most inner loop of Naïve algorithm in O(log n) time.
However, ERRM(j+1,i) cannot be computed in O(1) time any more
Using an interval tree, we can compute min and max values for [j+1, i], i.e. ERRM(j+1,i), in O(log n) time
Thus, our improved algorithm takes O(Bn log2n) time with O(Bn) space
![Page 71: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/71.jpg)
VLDB 2005A tutorial on synopsis construction algorithms73
An Interval Tree Example
[1,8]
[5,8][1,4]
[1,2] [3,4] [5,6] [7,8]
[1,1] [2,2] [3,3] [4,4] [5,5] [6,6] [7,7] [8,8]
[2,4]
Min Interval
decomposeLeftdecomposeRight
The steps of decomposing [2,4] with an interval tree
![Page 72: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/72.jpg)
VLDB 2005A tutorial on synopsis construction algorithms74
Consider another solution
Make the first bucket as large as possible
i.e. push the boundary right E.g. in the figure we can….
As long as the max and min is same…
Why will we have to stop ?
![Page 73: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/73.jpg)
VLDB 2005A tutorial on synopsis construction algorithms75
Consider another solution (2)
In this example we cannot…
But may be the error comes from a different bucket!
Here’s one idea Given an i, find Err[1,i] If i is small Err[1,i] · OPT If i is large Err[1,i] ¸ OPT How ?
By binary search !
Observe that given an error , it is easy to check if the error can be realized by B buckets
![Page 74: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/74.jpg)
VLDB 2005A tutorial on synopsis construction algorithms76
How ? Assume given an interval [a,b], we can find the min
and max, and therefore Err[a,b] With O(n) time and space preprocessing, we can find
Err[] in O(log n) time. (interval tree)
Check[p,q,b,]: If q > p (for b¸ 0), we are done. Otherwise,
Find mid, s.t. Err[p,mid] · and Err[p,mid+1] > Check[mid+1,q,b-1,]
O(B log2 n) Binary Search: log n * log n (to find min and max for
Err) Invocation of Check: B times
![Page 75: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/75.jpg)
VLDB 2005A tutorial on synopsis construction algorithms77
Now for the original problem
By binary search, find largest s such that When =Err[1,s] and ’=Err[1,s+1], Check[1,n, B-1 ]=false and Check[1,n, B-1, ’]=true
Now OPT=’ or the best B-1 bucket error of [s+1,n]
A recursive algorithm! T(B)= log n * B log2 n + T(B-1) ¼ O(B2 log3
n) !!
Check[]
![Page 76: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/76.jpg)
VLDB 2005A tutorial on synopsis construction algorithms78
Summary
In O(n + B2 log3 n) time and O(n) space we can find the optimum error.
What do we do if Stream or Less than O(n) space ?
Approximate, using some of the old ideas…
![Page 77: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/77.jpg)
79
Short break !
When we return
•Range Query Histograms
•Wavelets• Optimum synopsis• Connection to Histograms
•Overall ideas and themes
![Page 78: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/78.jpg)
VLDB 2005A tutorial on synopsis construction algorithms80
Range Query Histograms
![Page 79: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/79.jpg)
VLDB 2005A tutorial on synopsis construction algorithms81
A more synopsis structure
Instead of estimating the value at a point we are interested in sum of the values in intervals/ranges.
Clearly, very useful. Clearly we need new optimization. E.g., Not useful, in
this example
![Page 80: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/80.jpg)
VLDB 2005A tutorial on synopsis construction algorithms82
A more difficult problem
Only special cases solved (satisfactorily)
Hierarchies: Prefix ranges: All ranges of form [1,j] as j
varies Complete Binary Ranges General hierarchies
Uniform Ranges: all ranges
![Page 81: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/81.jpg)
VLDB 2005A tutorial on synopsis construction algorithms83
Status Range Query
Caveat:
Against a restricted Opt which stores the average of the values in a bucket.
![Page 82: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/82.jpg)
VLDB 2005A tutorial on synopsis construction algorithms84
The uniform case
Consider a sequence X={0,x1,x2,…,xn}
Define the operators: (g)[i]=j· i g[j] is the prefix sum
![Page 83: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/83.jpg)
VLDB 2005A tutorial on synopsis construction algorithms85
Unbiased
Suppose H is a histogram such that F=(X-H) is s.t. i F[i]=0
Or think of i r<i (X[r]-H[r])=0
Claim: Error of using H to answer range queries for X is twice the error of using (H) to answer point queries about (X) !
![Page 84: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/84.jpg)
VLDB 2005A tutorial on synopsis construction algorithms86
The main idea
Define G[i]=r<i X[i] – H[i] = (X)[i] - (H)[i]
Now i G[i] = 0 if H is unbiased Pick a RANDOM elements u
Expected[ G[u] ] = 0
Pick two random elements u,v Expected[ (G[u]-G[v])2]=Expected error of using H
to answer range queries for X But that is equal to 2 * Expected[ G[u]2 ]
![Page 85: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/85.jpg)
VLDB 2005A tutorial on synopsis construction algorithms87
A simple approximation
What we want is: Hard
But we know how to get:
X)
H)
Piecewise linear histograms!
![Page 86: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/86.jpg)
VLDB 2005A tutorial on synopsis construction algorithms88
An easy trick
We can also find: A “buffer” of Size 1 after each bucket Use it as a patch-up
2B buckets Same error as OPT Approximation algorithms try to find the
“continuous variant”
![Page 87: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/87.jpg)
VLDB 2005A tutorial on synopsis construction algorithms89
The Synopsis Construction Problem
Formally, given a signal X and a dictionary {i} find a representation F=i zi i with at most B non-zero zi minimizing some error which a fn of X-F
In case of histograms the “dictionary” was the set of all possible intervals – but we could only choose a non-overlapping set.
![Page 88: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/88.jpg)
VLDB 2005A tutorial on synopsis construction algorithms90
The eternal “what if”
If the {i} are “designed for the data” do we get a better synopsis ?
Absolutely! Consider a Sine wave … Or any smooth fn.
Why though ?
![Page 89: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/89.jpg)
VLDB 2005A tutorial on synopsis construction algorithms91
Representations not piecewise const.
Electromagnetic signals are sine/cosine waves.
If we are considering any process which involve electromagnetic signals – this is a great idea.
These are particularly great for representing periodic functions.
Often these algorithms are found in DSP (digital signal processing chips)
A fascinating 300+ years of history in Math !
![Page 90: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/90.jpg)
VLDB 2005A tutorial on synopsis construction algorithms92
A slight problem …
ll cmcFrr
Fourier is suitable to smooth “natural processes”
If we are talking about signals from man-made processes, clearly they cannot be natural (and hardly likely to be smooth) …
More seriously, discreteness and burstiness…
![Page 91: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/91.jpg)
VLDB 2005A tutorial on synopsis construction algorithms93
The Wavelet (frames)
Inherits properties from both worlds
Fourier transform has all frequencies.
Considers frequencies that are powers of 2 but the effect of each wave is limited (shifted)
![Page 92: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/92.jpg)
VLDB 2005A tutorial on synopsis construction algorithms94
Wavelets
What to do in a discrete world ?
The Haar Wavelets (1910) !
![Page 93: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/93.jpg)
VLDB 2005A tutorial on synopsis construction algorithms95
The Haar Wavelets
Best “energy” synopsis amongst all wavelets (we will see more later)
Great for data with discontinuities. A natural extension to discrete spaces
{1,-1,0,0,0,0…}, {0,0,1,-1,0,0,…},{0,0,0,0,1,-1,…}…
{1,1,-1,-1,0,0,0,0,…},{0,0,0,0,1,1,-1,-1,…}…
![Page 94: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/94.jpg)
VLDB 2005A tutorial on synopsis construction algorithms96
The Haar Synopsis Problem
Formally, given a signal X and the Haar basis {i} find a representation F=i zi i with at most B non-zero zi minimizing some error which a fn of X-F
Lets begin with the VOPT error (||X-F||22)
![Page 95: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/95.jpg)
VLDB 2005A tutorial on synopsis construction algorithms97
The Magic of Parseval (no spears)
The l2 distance is unchanged by a rotation. A set of basis vectors {i} define a rotation iff
h i,j i = ij , i.e.,
Redefine the basis (scale) s.t. ||i||2 = 1 Let the transform be W Then ||X-F||2 = || W(X-F)||2=||W(X) – W(F)||2
Now W(F)={z1,z2,…zn} and so ||W(X) – W(F)||2 = i (W(X)i – zi)2
![Page 96: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/96.jpg)
VLDB 2005A tutorial on synopsis construction algorithms98
What did we achieve ?
Storing the largest coefficients is the best solution.
Note that the fact zi=W(X)i is a consequence of the optimization and IS NOT a specification of the problem.
More on that later.
![Page 97: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/97.jpg)
VLDB 2005A tutorial on synopsis construction algorithms99
What is the best algorithm ?
How to find the largest B coefficients of the set {x1,x2,…} ?
Cascade Algorithm. Recall the hierarchical nature.
![Page 98: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/98.jpg)
VLDB 2005A tutorial on synopsis construction algorithms100
Cascade algorithm ?
Given a,b represent them as (a-b) and (a+b) Divide by sqrt(2) so that the sum of squares etc… Running time O(n)
1 4 5 6
![Page 99: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/99.jpg)
VLDB 2005A tutorial on synopsis construction algorithms101
Surfing Streams
Notice that once the left half is done we only need to remember the
A stream algorithm is natural
1 4 5 6
![Page 100: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/100.jpg)
VLDB 2005A tutorial on synopsis construction algorithms102
Surfing Streams
Have an auxillary structure that maintains top B of a set of numbers
Where else have you seen this ?
Reduce Merge ParadigmAlso used in clustering data streams
![Page 101: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/101.jpg)
VLDB 2005A tutorial on synopsis construction algorithms103
In summary
Given a series of {x1,x2,…xi,…xn} in increasing order of i we can find (maintain) the largest B coefficients in O(n) time and O(B+log n) space
Ok, but only for ||X-F||2
![Page 102: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/102.jpg)
VLDB 2005A tutorial on synopsis construction algorithms104
What do we do in presence of multiple dimensions/measures ?
Use multi-dim transforms Use many 1 D transforms
Strategy: Use a Flexible scheme that allows us to store the index and a bitmap to indicate which measures are stored.
Extended Histograms
Indices are large.
Correlations
![Page 103: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/103.jpg)
VLDB 2005A tutorial on synopsis construction algorithms105
How to solve it ?
For the basic 1-D problem we need to choose the largest B coefficients
Use Parseval to transform error of data to choosing/not choosing coefficients
Here we have “bags” We can choose coefficient j with bitmap
0100 using H+S space 0101 using H+2S space 1111 using H+4S space
![Page 104: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/104.jpg)
VLDB 2005A tutorial on synopsis construction algorithms106
Is 0101 better than 1100 ?
Subproblem:Given the fact that we have settled on
choosing 2 coefficients for j, which 2 ?
It is the largest 2 again!Basically we can choose a set of
indices j and decide how many coefficients we choose for each j
What does this remind you of ?
![Page 105: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/105.jpg)
VLDB 2005A tutorial on synopsis construction algorithms107
Knapsack
Each item j is available with M different “versions”.
Cost of the rth version is H+rS. The profit is an increasing function of r.
Can choose only one version.
![Page 106: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/106.jpg)
VLDB 2005A tutorial on synopsis construction algorithms108
Strange roadbumps
Optimal profit + Optimal error= total energy
The relationship does not hold in approximation.
99+1=100. Approximating 99 by 95 increases error by 400%
We will return to this.
![Page 107: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/107.jpg)
VLDB 2005A tutorial on synopsis construction algorithms109
Many questions
What do we do for other error measures ?
What is the connection with Histograms ?
Positives: Some direction Cascade algorithm Hierarchy of coefficients
![Page 108: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/108.jpg)
110
Non l2 errors
![Page 109: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/109.jpg)
VLDB 2005A tutorial on synopsis construction algorithms111
Storing coefficients is suboptimal
Recall the complicate {1,4,5,6} We want a 1 term summary and the error is l1 What do we store ?
1 4 5 6
What is the final Result ?
{3.5,3.5,3.5,3.5}
What is the transform ?
{7,0,0,0}
But the set of coefficients available {8,?,?,?}
![Page 110: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/110.jpg)
VLDB 2005A tutorial on synopsis construction algorithms112
What to do ?
Search where there is light. Restricted problem. Useful if the
synopsis has more than one use.
Think outside the coefficients Probabilistic Rounding Search (cleverly) over the whole space
![Page 111: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/111.jpg)
VLDB 2005A tutorial on synopsis construction algorithms113
The Best Restricted Synospis
Maximum Error.
A value (at the leaf) is affected by only the ancestors.
# of ancestors = log n
Guess/try all of the set! O(n) choices Start bottom up and use a DP
to choose the best B coefficients overall.
Works for a large number of error measures.
![Page 112: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/112.jpg)
VLDB 2005A tutorial on synopsis construction algorithms114
Analysis
At each internal node j we need to maintain the table
Error[j,Ancestor set,b]: the contribution to the minimum error by only the subtree rooted at j when using b or less coefficients (for the subtree)
Size of table O(n2B);
Time ~ O(n2B log B) [depends on measure ] But we can do better.
![Page 113: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/113.jpg)
VLDB 2005A tutorial on synopsis construction algorithms115
Faster Restricted Synospis
A better cut
Number of coefficients in a subtree is at most size+1
Size of the table storing Err[j,Ancestor Set,b]
Remains constant as we go up the levels!
Ancestor set decreases by 1 b takes twice as many values
O(n2) algorithm We can also reduce the space to
O(n)
![Page 114: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/114.jpg)
VLDB 2005A tutorial on synopsis construction algorithms116
Thinking beyond the coefficient
Probabilistic Rounding Start from the coefficients. Randomly round most of them to 0 A few are rounded to non-zero values E.g. set zi= with prob. e-W(X)i/and 0 otherwise
Has promise (correct expectation, variance) Two issues,
The quality is unclear (wrt the original optimization) The Expected number of non-zero coefficients is B The variance is large, so with reasonable prob ~ 2B
![Page 115: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/115.jpg)
VLDB 2005A tutorial on synopsis construction algorithms117
More exploration reqd
Interestingly the method (as proposed) eliminates a region of search space
We can construct examples that the optimum lies in that range.
But is an interesting method and likely (I/we are guessing) preserves more errors than one simultaneously (multi-criterion optimization)
![Page 116: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/116.jpg)
VLDB 2005A tutorial on synopsis construction algorithms118
What is the optimum strategy
Consider the best set of coefficients Z*={z1,z2,…zn} “nudge” them a bit by making them
multiples of some
The “extra error” is small (and a fn of ) In fact each point sees § log n
By reducing we can get (1+) approx
![Page 117: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/117.jpg)
VLDB 2005A tutorial on synopsis construction algorithms119
A straightforward idea
But we still need to find the solution
The ancestor set is unimportant – what is important is their combined effect.
Try all possible values (multiples of , but we still need to fix the range)
![Page 118: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/118.jpg)
VLDB 2005A tutorial on synopsis construction algorithms120
The graphs – the data
![Page 119: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/119.jpg)
VLDB 2005A tutorial on synopsis construction algorithms121
The graphs … l1
![Page 120: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/120.jpg)
VLDB 2005A tutorial on synopsis construction algorithms122
Relative Error (small B), Relative l1
![Page 121: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/121.jpg)
VLDB 2005A tutorial on synopsis construction algorithms123
The times
![Page 122: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/122.jpg)
VLDB 2005A tutorial on synopsis construction algorithms124
What have we seen so far
Wavelet representation of l_2 error Streaming
Wavelet representation for non l_2 error Restricted Unrestricted Stream
![Page 123: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/123.jpg)
125
A return to histograms
![Page 124: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/124.jpg)
VLDB 2005A tutorial on synopsis construction algorithms126
Easy relationships
A B-bucket (piecewise constant) histogram can be represented by 2B log n Haar wavelet coefficients.
Why Only the 2B boundary points matter
A B-term Haar wavelet synopsis can be represented by 3B-bucket histogram.
Why Each wavelet basis creates 3 extra pieces from 1
line
![Page 125: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/125.jpg)
VLDB 2005A tutorial on synopsis construction algorithms127
Anything else ?
Totally!
We can use Wavelets to get (1+\epsilon)-approximate V-optimal histograms.
In fact the method has advantages…
![Page 126: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/126.jpg)
VLDB 2005A tutorial on synopsis construction algorithms128
Histograms, Take 5:
A B-term Histogram can be represented by cB log n wavelet terms.
What is we choose the largest cB log n wavelet terms ?
![Page 127: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/127.jpg)
VLDB 2005A tutorial on synopsis construction algorithms129
Need not be good.
The best histogram has the cB log n wavelets “aligned” such that the result is B buckets.
The best cB log n coefficients are all over the place and give us 3cB log n buckets.
All hope is lost ?
![Page 128: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/128.jpg)
VLDB 2005A tutorial on synopsis construction algorithms130
If at first you don’t succeed…
We repeat the process and also keep the next cB log n coefficients …
No.
But notice that the “energy” drops. Energy = ||X||2=||W(X)||2
Basic intuition: If there were a lot of coefficients which were large then the best V-Opt histogram MUST have a large error.
Why?
![Page 129: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/129.jpg)
VLDB 2005A tutorial on synopsis construction algorithms131
The “robust” property
Look at ||W(X)-W(H)||2=||X-H||2
W(H) has cB log n entries If W(X) has cB-2 log n large
entries ..
![Page 130: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/130.jpg)
VLDB 2005A tutorial on synopsis construction algorithms132
A strange idea in 1000 words
Consider the projection to the largest cB-2 log n wavelet terms
Is …
¼
?
![Page 131: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/131.jpg)
VLDB 2005A tutorial on synopsis construction algorithms133
No. But flatten the fn
¼
X
![Page 132: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/132.jpg)
VLDB 2005A tutorial on synopsis construction algorithms134
In fact
If we chose (Blog n)O(1), i.e., large, number of coefficients then the boundary points of the coefficients are (approximately) good boundary points for a VOPT histogram.
![Page 133: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/133.jpg)
VLDB 2005A tutorial on synopsis construction algorithms135
The take away:
I’m ok you’re ok If I’m not ok then you’re not ok too. An oft repeated approximation paradigm
“if there are too many coefficients then my algorithm is doomed – but so is anyone elses, and therefore I am good”
“if there are not too many coefficients then we’re good”.
![Page 134: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/134.jpg)
VLDB 2005A tutorial on synopsis construction algorithms136
The Extended Wavelets in l2
We can store the largest coefficients
If there are too many coefficients which are large then optimum error is large.
Otherwise we repeatedly take out coefficients till taking out coefficients will not reduce the error any more.
DP on the set of coefficients taken out.
![Page 135: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/135.jpg)
VLDB 2005A tutorial on synopsis construction algorithms137
The Full Monty – update streams
So far we have been looking at X arriving as {x1,x2,…}
What happens when X is specified by a stream of updates ?
i.e., (i,di)=change xi to xi + di
![Page 136: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/136.jpg)
VLDB 2005A tutorial on synopsis construction algorithms138
Sketches :Stream Embeddings
Basically Dimensionality reductionTo compute the histogram H of signal
X
Compute embedding g(X) to fit the space
Compute H s.t. g(H) is close to g(X)
![Page 137: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/137.jpg)
VLDB 2005A tutorial on synopsis construction algorithms139
Linear Embeddings [JL Lemma ]
A is a Random Matrix drawn from Gaussian distribution.
Too many elements in matrix!
Use Pseudorandom Generators P-Stable distribution for
222)1( xAxx
nn )log( 2
p where p [ , ] 0 2
![Page 138: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/138.jpg)
VLDB 2005A tutorial on synopsis construction algorithms140
What it achieves
Computes Norm
A
x
Increasing the coordinate is adding the column to sketch.
![Page 139: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/139.jpg)
VLDB 2005A tutorial on synopsis construction algorithms141
Suppose we knew the intervals
The best histogram minimizes ||X-H||2 ¼ ||AX –AH ||2
AX is a vector, AH is a linear function of B values
We have a min sq. error program, solvable in ptime more involved in 1-norm.
![Page 140: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/140.jpg)
VLDB 2005A tutorial on synopsis construction algorithms142
Cannot do that
||X-H||2 = ||W(X) – W(H)||2 ¼ ||AW(X) –AW(H) ||2
Idea:
Use the linear map to find the large number of Wavelet coefficients(top k problem using sketches)
Use similar ideas to Take 5 to get the final solution.
![Page 141: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/141.jpg)
VLDB 2005A tutorial on synopsis construction algorithms143
The return of the pink Fourier
Assuming x1,x2,…,xi,… arrive in increasing order of i, find/maintain the top k Fourier coefficients.
Use the strategy : Assume that there are O(k log n) frequencies
and try to find them. If not, we are doomed and so is everyone. So we are ok. For the 3rd time …
![Page 142: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/142.jpg)
VLDB 2005A tutorial on synopsis construction algorithms144
What about top k
Assuming x1,x2,…,xi,… are specified by a stream of updates find/maintain the top k values (all elements with frequency ~1/k or more).
Use the strategy : Assume that there are O(k log n) elements and try to find
them. If not, we are doomed and so is everyone. So we are ok. Again!
Use Group testing 20 questions, bit chasing – is an heavy item in the first
half ? You can use norms – or you can use collisions (hashes).
![Page 143: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/143.jpg)
VLDB 2005A tutorial on synopsis construction algorithms145
From optimization to learning
We are trying to “learn” a “pure” signal that has few coefficients…
A general paradigm.
![Page 144: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/144.jpg)
VLDB 2005A tutorial on synopsis construction algorithms146
The Meaning of Life
In Summary (high level):
Approximation is very useful for synopsis construction (the execution time speedups plus “the end use of synopsis is approximation only”)
Synopses are usually applied on large data. Asymptotic behaviour matters
The exact definition of the optimization is important. How natural is natural…
Few degrees of separation between the synopsis structures. They are related. They should be. But then we can use algorithmic techniques back and forth between them.
![Page 145: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/145.jpg)
VLDB 2005A tutorial on synopsis construction algorithms147
The Summary (contd.) In algorithm design terms
Most synopsis construction problems involve DP. Investigating how to change the DP to get approximation, space efficient algs., is often useful.
Search techniques (computation geometry) – search exponents first are useful.
What you analyze (carefully) is often what you would get asymptotically. The usual techniques we use for pruning etc., can be analyzed and and shown to be better.
Reduce-Merge ) Streaming ?
The top k in various disguises. Group testing matters.
![Page 146: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/146.jpg)
VLDB 2005A tutorial on synopsis construction algorithms148
What lies ahead
Ok. So 1 D histograms have good algos. 2 D ?
NP-Hard. Some approximation algorithms known. Q: In linear time and sublinear space what
can we do ?
Sketch based results. Long way to go.
![Page 147: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/147.jpg)
VLDB 2005A tutorial on synopsis construction algorithms149
What lies ahead
So 1 D Haar Wavelets have good algos (non l2).
2 D ?
Unlikely to be NP-Hard Quasi-polynomial time nlog n approximation
algorithms known.
Q: In linear time and sublinear space what can we do ?
![Page 148: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/148.jpg)
VLDB 2005A tutorial on synopsis construction algorithms150
What lies ahead
So 1 D Haar Wavelets have good algos (non l2). Non Haar ? Daubechies. Multifractals.
Unlikely to be NP-Hard Quasi-polynomial time nlog n approximation algorithms
known. What can we do ?
![Page 149: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/149.jpg)
VLDB 2005A tutorial on synopsis construction algorithms151
What lies ahead
All the update stream results are based on l2 error because of Johnson Lindenstrauss (and some on lp for 0<p· 2)
What about other errors ? Will require new techniques for
streaming.
![Page 150: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/150.jpg)
VLDB 2005A tutorial on synopsis construction algorithms152
Notes (not from the underground) The VOPT definition
Poosala, Haas, Ioannidis, Shekita, SIGMOD `96. The VOPT histogram algorithm
Jagadish, Koudas, Muthukrishnan, Poosala, Sevcik, Suel, VLDB ‘98.
Take 1 Guha, Koudas, Shim, STOC, ‘01.
Take 2 Guha, Koudas, ICDE, ‘02.
Take 3 & 4 Guha, Koudas, Shim, TODS, ‘05.
Take 5 Guha, Indyk, Muthukrishnan, Strauss, ICALP, ‘02.
Relative Error Histograms Guha, Shim, Woo, VLDB, ‘04.
Maximum Error histograms Nicole, J. of Parallel Distributed Computing, 1994. (Muthukrishnan, Khanna, Skiena, ICALP, ’97), Guha, Shim, (here) ‘05.
![Page 151: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/151.jpg)
VLDB 2005A tutorial on synopsis construction algorithms153
More Notes
Range Query Histograms Muthukrishnan, Strauss, SODA, ‘03.
The Full Monty Gilbert, Guha, Indyk, Kotidis, Muthukrishnan, Strauss, STOC,
‘02.
Parseval stuff Parseval, (margin of notebook ?), 1799.
Folklore sum of squares and l2 The mandala
Surfing Wavelets Gilbert, Kotidis, Muthukrishnan,Strauss, VLDB, ‘01
Probabilistic Synopsis Gibbons, Garofalakais, SIGMOD, ’02 (also TODS, ‘04)
Maximum error (restricted version) Garofalakis, Kumar, PODS, ‘04.
![Page 152: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/152.jpg)
VLDB 2005A tutorial on synopsis construction algorithms154
Notes again Faster Restricted Synopsis
Guha, VLDB, ‘05. Unrestricted non l2 error
Guha, Harb, KDD, ‘05 + new results Extended Wavelets
Deligiannakis Rossopolous, SIGMOD ’03. Guha, Kim, Shim, VLDB ’04.
Streaming Fourier approximation Gilbert, Guha, Indyk, Muthukrishnan, Strauss, STOC, ’02
Learning Fourier Coefficients Linial, Kushilevitz, Mansour, JACM, 93
JL Lemma Johnson, Lindenstrauss, , ’84.
Sketches Alon, Matias, Szegedy, JCSS, ’99. Feigenbaum Kannan, Vishwanathan, Strauss, FOCS, ’99 Indyk, FOCS, ‘00
![Page 153: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.](https://reader037.fdocuments.us/reader037/viewer/2022110321/56649cf35503460f949c118e/html5/thumbnails/153.jpg)
VLDB 2005A tutorial on synopsis construction algorithms155
Roads not taken
(but are relevant to synopsis) Property Testing Weighted sampling and SVD Median Finding Sampling based estimators