1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of...

153
1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University

Transcript of 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of...

Page 1: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

1

Offline, Stream and Approximation Algorithms for Synospis Construction

Sudipto Guha University of Pennsylvania

Kyuseok Shim Seoul National University

Page 2: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms2

About this Tutorial

Information is incomplete and could be inaccurate

Our presentation reflects our understanding which may be erroneous

Page 3: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms3

Synopses Construction

Where is the life we have lost in living?Where is the wisdom we have lost in knowledge?Where is the knowledge we have lost in information?

T. S. Eliot, from The Rock.

Routers Sensors Web Astronomy and sciences

Too much data too little time.

Page 4: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms4

The idea

To see the world in a grain of sand…

Broad characteristics of the data Compression Dimensionality Reduction Approximate query answering Denoising, Outlier Detection and a broad

array of signal processing

Page 5: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms5

What is a synopsis ?

Hmm. Any “shorthand” representation

Clustering! SVD!

In this tutorial we will focus on signal/time series processing

Page 6: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms6

The basic problem

Formally, given a signal X and a dictionary {i} find a representation F=i zi i with at most B non-zero zi minimizing some error which is a fn of X-F

Note, the above extends to any dim.

Page 7: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms7

Many issues

What is the dictionary ? Which B terms ? What is the error ? What are the constraints ?

Page 8: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms8

Many issues

What is the dictionary ? Set of vectors Maybe a basis

Which B terms ? What is the error ? What are the constraints ?

Top K

Page 9: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms9

Many issues

What is the dictionary ? Set of vectors Maybe a basis

Which B terms ? What is the error ? What are the constraints ?

Haar Wavelets

Also Fourier, Polynomials,…

Page 10: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms10

Many issues

What is the dictionary ? Set of vectors May not be a basis

Histograms: There are n choose 2 vectors But since we impose a non-overlapping restriction

we get a unique representation.

Which B terms ? What is the error ? What are the constraints ?

Page 11: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms11

Many issues

What is the dictionary ? Which B terms ?

First B ? Best B ?

What is the error ? What are the constraints ?

Why should we choose first B ?

1. B vs 2B numbers 2. Also …

Page 12: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms12

Approximation theory

Discipline of Math associated with approximation of functions.

Same as our problem

Linear theory (Parseval, 1800 over two centuries) Non-Linear theory (Schmidt 1909, Haar 1910)

Is it relevant ? Yes. However Math treatment has been “extremal”, i.e., how does the error change as a function of B. Is that bound tight?

Note: a yes answer does not say anything about “given this signal, is that the best we can do ?”

Page 13: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms13

Many issues

What is the dictionary ? Which B terms ? What is the error ?

This controls which B. ||X-F||2 is most common, used all over in

mathematics ||X-F||1,||X-F||1 are useful also Weights. Relative error of approximation

1000 by 1010 is not so bad. 1 by 11 is not too good an idea.

What are the constraints ?

Page 14: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms14

Many issues

What is the dictionary ? Which B terms ? What is the error ? What are the constraints ?

Input ? Stream, stream of updates … Space, time, precision and range of

values (for zi in the expression F=i zi i )

Page 15: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms15

In this tutorial

Histograms & Wavelets

Will focus on Optimal, Approximation and Streaming algorithms

How to get one from the other! Connections to top K and Fourier.

Page 16: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms16

I. Histograms.

Page 17: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms17

VOpt Histograms

Lets start simple Given a signal X, find a piecewise constant

representation H with at most B pieces minimizing ||X-H||2

Jagadish, Koudas, Muthukrishnan, Poosala, Sevcik, Suel, 1998

Consider one bucket. The mean is the best value. A natural Dynamic programming formulation

Page 18: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms18

An Example Histogram

162814821012Value (Xi)

7654321Location (i)

Data Distribution

1628148Representative

[7,7][6,6][5,5][1,4]Range

V-Optimal Histogram

Page 19: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms19

Idea: VOpt Algorithm

Within “step/bucket”: Mean is the best.

Assume that the last bucket is [j+1,n].

What can we say about the rest k-1 ?

j1

1j n

OPT[j,k-1]

Last bucket

SQERR[j+1,n]

Must also be optimal for the range [1, j] with (k-1) buckets! Dynamic Programming !!

Page 20: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms20

Idea: VOpt Algorithm

Within “step/bucket”: Mean is the best.

Assume that the last bucket is [j+1,n].

What can we say about the rest k-1 ?

j1

1j n

OPT[j,k-1]

Last bucket

SQERR[j+1,n]

Must also be optimal for the range [1, j] with (k-1) buckets! Dynamic Programming !!

Page 21: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms21

Idea: VOpt Algorithm

Within “step/bucket”: Mean is the best.

Assume that the last bucket is [j+1,n].

What can we say about the rest k-1 ?

j1

1j n

OPT[j,k-1]

Last bucket

SQERR[j+1,n]

Must also be optimal for the range [1, j] with (k-1) buckets! Dynamic Programming !!

Page 22: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms22

Idea: VOpt Algorithm

Dynamic programming algorithm was given to construct the V optimal Histogram.

OPT[n,k] = min {OPT[j,k-1,]+SQERR[(j+1)..n]}

1≤j < n

OPT[j, k] : the minimum cost of representing the set of values indexed by [1..j] by a histogram with k buckets.

SQERR[(j+1)..n]: the sum of the squared absolute errors from (j+1) to n.

Page 23: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms23

The DP-based VOpt Algorithmfor i=1 to n do

for k=1 to B do for j=1 to i-1 do (split pt of k-1 bucket hist. and last

bucket) OPT[i, k] = min{ OPT[i, k], OPT[j,k-1] + SQERR[j+1,i] }

We need O(Bn) entries for the table OPT For each entry OPT[i,k], it takes O(n) time if SQERR[j+1.i]

can be computed O(1) time O(Bn) space and O(Bn2) time

B

n

OPT

Page 24: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms24

Computation of Sum of Squared Absolute Error in O(1) time

index 1 2 3 4

x 2 3 7 5

sum 2 5 12 17

sum(2,3) = x[2]+x[3] = sum[3]-sum[1]= 12-2 = 10

Page 25: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms25

Computation of Sum of Squared Absolute Error in O(1) time

222 )(1

1)(],[

j

ipp

j

ipp

j

ipp x

ijxxxjiSQERR

i

ppxiSUM

1

],1[

i

ppxiSQSUM

1

2],1[Let and

Then,

Thus,

2])1,1[],1[(1

1])1,1[],1[(

jSUMiSUM

ijjSQSUMiSQSUM

]1[][),( 2

iSQSUMjSQSUMxjiSQSUMj

ipp

]1[][),(

iSUMjSUMxjiSUMj

ipp

Page 26: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms26

Analysis of VOpt Algorithm

O(n2B) time O(nB) space The space can be reduced

(Wednesday)

Main Question : The end use of histogram is to approximate something. Why not find an “approximately optimal”

(e.g., (1+ε) ) histogram?

Page 27: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms27

If you had to improve something ?

(1+) streaming ssqO(n) time.O(B/2) space

(1+) streamingO(nB2/) time.O(B2/) space

O(n2B) timeO(n) space

Via Wavelets ssqO(n) timeO(B2/2) space

OfflineO(n) time.O(n+B/) space

O(n2B) timeO(nB) space

(1+) streamingO(n) time.O(B2/) space

offlineO(n) time.O(B2/) space

Page 28: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms28

Take 1:

For i=1 to n doFor K=1 to B do For j=1 to i-1 do (split point for the last bucket)

OPT[ 1…i, k] = Min [ OPT[1…i, k], OPT[1…j,k-1]+

SQERR(j+1,i) ]

OPT[1..j,k] is increasing SQERR(j+1,i) is decreasing

Question: Can we use the monotonicity for searching the minimum ?

As j increases

Page 29: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms29

No

Consider a sequence of positive y1,y2,…,yn F(i) = i yi and G(i) = F(n) – F(i-1)

F(i): monotonically increasing … Opt[1..j,k-1] G(i): monotonically deceasing … SQERR(j+1,i)

(n) time is necessary to find mini{ F(i)+G(i) }

Open Question: Does it extend to (n2) over the entire algorithm ?

Page 30: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms30

What gives ?

Consider a sequence of positive y1,y2,…,yn

F(i) = i yi and G(i) = F(n) – F(i-1) Thus, F(i)+G(i) = F(i) + xi

Any i gives a 2 approximation to mini{ F(i) + G(i)}

F(i) + G(i) = F(n) + xi ≤ 2 F(n)

mini{ F(i) + G(i)} is at least F(n)

Page 31: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms31

Round 1

Use a histogram to approximate the fn Bootstrap! Approximate the increasing fn in powers of (1+) Right end pt is (1+) approximation of left end pt

h

·(1+h

Page 32: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms32

What does that do ?

Consider evaluating the fn at the two endpoints

Proof by picture.

h h’¸(1+)

Why ? By construction.

¸

Why ? By monotonicity!

Page 33: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms33

Therefore… The right hand point is a (1+δ)

approximation! Holds for any point in between.

OPT[x]+SQERR[x+1]≥ OPT[a]+SQERR[b] ≥ OPT[b]/(1+ δ) + SQERR[b] ≥ {OPT[b] + SQERR[b]}/ (1+δ)

Are we done ? Not quite yet. What happens for B>2 ? – we do not

compute OPT[i,b] exactly !!

h’

SQERR

OPT

a b

Page 34: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms34

Zen and the art of histograms

Approximate the increasing fn in powers of (1+) Right end pt is (1+) approximation

Prove by induction that the error is (1+)B

This tells us what should be (small), in fact if we set =/2B then (1+)B· 1+

Page 35: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms35

Complexity analysis

# of intervals p ~ (B/) log n Why ?

c(1+δ) (p-1) ≤ nR2 and δ = /(2B) R is the largest number in data Assume R is polynomially bounded by n

Running time ~ nB (B/) log n

Why are we approximating the increasing function ? Why not the decreasing one ?

Page 36: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms36

The first streaming model

The signal X is specified by xi arriving in increasing order of i

Not the most general model But extremely useful for modeling

time series data

Page 37: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms37

Streaming

Need to store

1a xi

1a x2

i

1b xi

1b x2

i

Required space is (B2/) log n

a b

Page 38: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms38

VOpt Construction: O(Bn2)

n

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

n

[Jagadish et al.: VLDB 1998] OPT(i,k) = min1≤j<i{OPT(j,k-

1)+SQERR(j+1,i)}

OPT[j,k-1]

OPT[j,k]

Page 39: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms39

AHIST-S: (1+ε) Approximation

AOPT[j,k]

n

a b cAOPT[j,k-1]

P

(1+δ)a ≥b

(1+δ)a < c

δ = ε /2B

P = O(Bε-1logn)

AOPT[j,k] = min1≤j<i{AOPT[bjp,k-

1]+SQERR[bjp+1,n]}

O(B2ε-1nlogn) time and O(B2ε-1logn) space

Page 40: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms40

The overall idea

The natural DP table

The approximate table

Page 41: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms41

Do s talk to us ?

DJIA data from 1901-1993

B

execu

tion

time

Page 42: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms42

Take 2: GK02

Sliding window streams Potentially infinite data – interested in the

last n only Q: Suppose we constructed histogram for

[1..n] and now want it for [2..(n+1)]

Previous idea is a dead on arrival. Consider 100,1,2,3,4,5,7,8,…

Page 43: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms43

Formal problem

Maintain a data structure

Given an interval [a,b] construct a B bucket histogram for [a,b]

Compute on the fly

Generalizes the window! Generalizes VOpt when a=1,b=n

Page 44: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms44

Reconsider the take 1

We are evaluating

Left to right, i.e.,

But we are still evaluating this guy !

Page 45: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms45

A brave new world

Assume a O(n) size buffer holds xi values The previous algorithm was:

Several issues1. Which values are necessary and sufficient2. We are not evaluating all values – what induction ?

Page 46: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms47

GK02: Enhanced (1+ε) Approximation

AOPT[j,k]

n

a b

AOPT[j,k-1]

P

(1+δ)a ≥z

(1+δ)a < z+1

Lazy evaluation using Binary Search O(B3ε-2log3n) time and O(n) space

Pre-processing takes O(n) time – SUM and SQSUM

P = O(Bε-1logn)

Page 47: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms48

GK02: Enhanced (1+ε) Approximation Creates all of B interval lists at once The values of necessary AOPT[j,k] are

computed recursively to find the intervals [ajp,

bjp] where bj

p is the largest z s.t. (1+ε) AOPT[aj

p,k] ≥ (1+ε) AOPT[z,k] (1+ε) AOPT[aj

p,k] < (1+ε) AOPT[z+1,k] Note that AOPT increases as z increases Thus, we can use binary search to find z O(n) space of SUM and SQSUM arrays needs

to be maintain to allow the computation of SQERR(j+1,i) in O(1) time

O(n+B3ε-2log3n) time and O(n) space

Page 48: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms49

Take 2 summary

O(n) space and O(n+B3-2log2 n) time

Is that the best ? Obviously no.

Page 49: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms50

Take 3: AHIST-L-

Suppose we knew · OPT · 2 then… Instead of powers of (1+/B) additive terms

of /(2B) then … Time is O(B3-2 log n) To get ?

2-approximation: =O(1) a binary search: O(log n) Thus, O(B3 log n * log n)

Overall O(n+B3(-2+logn)log n) time and O(n+B2/) space:

O(B/)

Page 50: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms51

Take 4: AHIST-B

Consider the take 4 algorithm. How to stream it ?

M

On the new part

Overall

Page 51: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms52

Not done yet

1+r First find an =O(1) approximation,

then proceed back and refine

k

K-1

Page 52: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms53

The running space-time

B(# insertions)(log M)(log ) where =O(B-1 log n) is the length of a list

Space Who cares and why ?

Page 53: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms54

Asymptotics

For fixed B and we can compute a (1+ ) piecewise constant representation in

O(n log log n) time and O(log n) space or

O(n) time and O(log n log log n) space.

Extends to degree d polynomials, space increases by O(d) and time is O(nd + d3…)

Page 54: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms55

Our friendly Running time

B

Execu

tion

Tim

e

Page 55: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms56

Our friendly Error(E

rror

–VO

PT)/

VO

PT

B

Page 56: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms57

What you analyze is what you get

Execu

tion

tim

e

n

Page 57: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms58

Questions ?

Page 58: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms60

For general error measure, IF…

The error of a bucket only depends on the values in the bucket.

The overall error function, is the sum of the errors in the buckets.

The data can be processed in O(T) time per item such that in O(Q) time we can find the error of a bucket, storing O(P) info.

The error (of a bucket) is a monotonic function of the interval.

The value of the maximum and the minimum nonzero error is polynomially bounded in n.

Page 59: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms61

Then…

Optimum histogram in time O(nT+n2(B+Q)) time and O(n(P+B)) space

(1+)-approximation in

O(nT+nQB2-1 log n) time and O(PB2-1 log n) space,

O(nT + QB3(log n + -2 )log n) time and O(nP) space

O(nT) time and space

O(PB2 -1 log n + (QB/T) [B-1 log2 (B-1 log n) + log n loglog n)]

Page 60: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms62

Splines and piecewise polynomials

Instead of

If we wanted

Or maybe…

Page 61: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms63

The overall idea

If we want to represent {xa+1,…,xb} by p0+p1(x-xa)+p2(x-xa)2 + …

The solution is as above…

We need O(d) times (than before) space and need to solve the system. This means an increase by a factor O(d3) in time.

Page 62: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms64

Another useful example: Relative error

Issue with global measures: Estimating 10 by 20 and 1000 by 1010 has the same effect

The above is ok if we are querying for “1000” a 1000 times and 10 times for “10” (point queries and VOPT measure)

But consider approximating a time series. We may be interested in per point guarantees.

Page 63: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms65

Sum of Squared Relative Error for a Bucket

Relative error for a bucket (sr,er,xr) :

Since A > 0, it is minimized when xr=B/A The minimum value is C-B2/A If the aggregated sum of A, B and C are

stored, ERRSQ(i,j) can be computed in O(1) time

Optimal histogram can be constructed in O(Bn2) time… Approximation algorithms follow…

CBxAxxc

xxesERR rr

i

rie

sixrrSQ

r

r

2}},max{

)({min),( 2

2

2

2

Page 64: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms66

Maximum Error and the l1 metric

Page 65: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms67

Maximum Error Histograms

A bucket (sr,er,xr) with a numbers {x1, x2, …, xn}

s.t. sr: starting position

er: ending position

xr: representative value

Maximum Error is given by

Maximum relative error is defined as:}|}|,max{

||max{min),(

],[i

ri

esixrrM xc

xxesERR

rrr

||maxmin),(],[

riesix

rrM xxesERRrrr

Page 66: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms68

Maximum Error of a bucket

Given numbers {x1, x2, …, xn} s.t. Maximum Error is given by ErrM=minxr

maxi |xi

– xr|

What is the best xr

(xmin+xmax)/2

Page 67: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms69

Maximum Relative Error of a set

Given a set of numbers {x1, x2, …, xn} max: the maximum of {x1, x2, …, xn} min: the minimum of {x1, x2, …, xn} c: A sanitary constant

Some function of c,max,min E.g., when c· min· max the error is Optimal maximum relative error for a bucket can

be computed in O(1) time

Page 68: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms70

The Naïve Optimal Algorithm

for i :=1 to n doOPTM[i,1] := ERRM(i,i)for K :=1 to B do { max := - ∞; min := ∞; OPTM[i,k] := ∞ for j :=i-1 to 1 do {

if (max < x[j+1]) max := x[j+1]if (min > x[j+1]) min := x[j+1]

OPTM[i,k] := min{OPTM[i,k] , max( OPTM[j,k-1], ERRM(j+1,i) ) } }

}} ERRM(j+1,i) can be obtained in O(1) time O(Bn) space and O(Bn2 ) time optimal algorithm

Page 69: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms71

An Improved Optimal Algorithm

OPTM[i,j] := minj{max( OPTM[j,k-1], ERRM(j+1,i)) } Observations

OPTM[j,k-1] is an increasing function

ERRM(j+1,i) is a decreasing function

To compute minx{ max ( F(x), G(x) ) } where F(x) and G(x) are non-decreasing and non-increasing functions

We can perform binary search for the value of x such that F(x) > G(x) and F(x-1) < G(x-1)

The minimum is min{ G(x-1) and F(X) }

Page 70: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms72

An Improved Optimal Algorithm

OPTM[i,j]:= min{max(OPTMj,k-1], ERRM(j+1,i))}

We can improve the most inner loop of Naïve algorithm in O(log n) time.

However, ERRM(j+1,i) cannot be computed in O(1) time any more

Using an interval tree, we can compute min and max values for [j+1, i], i.e. ERRM(j+1,i), in O(log n) time

Thus, our improved algorithm takes O(Bn log2n) time with O(Bn) space

Page 71: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms73

An Interval Tree Example

[1,8]

[5,8][1,4]

[1,2] [3,4] [5,6] [7,8]

[1,1] [2,2] [3,3] [4,4] [5,5] [6,6] [7,7] [8,8]

[2,4]

Min Interval

decomposeLeftdecomposeRight

The steps of decomposing [2,4] with an interval tree

Page 72: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms74

Consider another solution

Make the first bucket as large as possible

i.e. push the boundary right E.g. in the figure we can….

As long as the max and min is same…

Why will we have to stop ?

Page 73: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms75

Consider another solution (2)

In this example we cannot…

But may be the error comes from a different bucket!

Here’s one idea Given an i, find Err[1,i] If i is small Err[1,i] · OPT If i is large Err[1,i] ¸ OPT How ?

By binary search !

Observe that given an error , it is easy to check if the error can be realized by B buckets

Page 74: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms76

How ? Assume given an interval [a,b], we can find the min

and max, and therefore Err[a,b] With O(n) time and space preprocessing, we can find

Err[] in O(log n) time. (interval tree)

Check[p,q,b,]: If q > p (for b¸ 0), we are done. Otherwise,

Find mid, s.t. Err[p,mid] · and Err[p,mid+1] > Check[mid+1,q,b-1,]

O(B log2 n) Binary Search: log n * log n (to find min and max for

Err) Invocation of Check: B times

Page 75: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms77

Now for the original problem

By binary search, find largest s such that When =Err[1,s] and ’=Err[1,s+1], Check[1,n, B-1 ]=false and Check[1,n, B-1, ’]=true

Now OPT=’ or the best B-1 bucket error of [s+1,n]

A recursive algorithm! T(B)= log n * B log2 n + T(B-1) ¼ O(B2 log3

n) !!

Check[]

Page 76: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms78

Summary

In O(n + B2 log3 n) time and O(n) space we can find the optimum error.

What do we do if Stream or Less than O(n) space ?

Approximate, using some of the old ideas…

Page 77: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

79

Short break !

When we return

•Range Query Histograms

•Wavelets• Optimum synopsis• Connection to Histograms

•Overall ideas and themes

Page 78: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms80

Range Query Histograms

Page 79: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms81

A more synopsis structure

Instead of estimating the value at a point we are interested in sum of the values in intervals/ranges.

Clearly, very useful. Clearly we need new optimization. E.g., Not useful, in

this example

Page 80: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms82

A more difficult problem

Only special cases solved (satisfactorily)

Hierarchies: Prefix ranges: All ranges of form [1,j] as j

varies Complete Binary Ranges General hierarchies

Uniform Ranges: all ranges

Page 81: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms83

Status Range Query

Caveat:

Against a restricted Opt which stores the average of the values in a bucket.

Page 82: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms84

The uniform case

Consider a sequence X={0,x1,x2,…,xn}

Define the operators: (g)[i]=j· i g[j] is the prefix sum

Page 83: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms85

Unbiased

Suppose H is a histogram such that F=(X-H) is s.t. i F[i]=0

Or think of i r<i (X[r]-H[r])=0

Claim: Error of using H to answer range queries for X is twice the error of using (H) to answer point queries about (X) !

Page 84: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms86

The main idea

Define G[i]=r<i X[i] – H[i] = (X)[i] - (H)[i]

Now i G[i] = 0 if H is unbiased Pick a RANDOM elements u

Expected[ G[u] ] = 0

Pick two random elements u,v Expected[ (G[u]-G[v])2]=Expected error of using H

to answer range queries for X But that is equal to 2 * Expected[ G[u]2 ]

Page 85: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms87

A simple approximation

What we want is: Hard

But we know how to get:

X)

H)

Piecewise linear histograms!

Page 86: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms88

An easy trick

We can also find: A “buffer” of Size 1 after each bucket Use it as a patch-up

2B buckets Same error as OPT Approximation algorithms try to find the

“continuous variant”

Page 87: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms89

The Synopsis Construction Problem

Formally, given a signal X and a dictionary {i} find a representation F=i zi i with at most B non-zero zi minimizing some error which a fn of X-F

In case of histograms the “dictionary” was the set of all possible intervals – but we could only choose a non-overlapping set.

Page 88: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms90

The eternal “what if”

If the {i} are “designed for the data” do we get a better synopsis ?

Absolutely! Consider a Sine wave … Or any smooth fn.

Why though ?

Page 89: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms91

Representations not piecewise const.

Electromagnetic signals are sine/cosine waves.

If we are considering any process which involve electromagnetic signals – this is a great idea.

These are particularly great for representing periodic functions.

Often these algorithms are found in DSP (digital signal processing chips)

A fascinating 300+ years of history in Math !

Page 90: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms92

A slight problem …

ll cmcFrr

Fourier is suitable to smooth “natural processes”

If we are talking about signals from man-made processes, clearly they cannot be natural (and hardly likely to be smooth) …

More seriously, discreteness and burstiness…

Page 91: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms93

The Wavelet (frames)

Inherits properties from both worlds

Fourier transform has all frequencies.

Considers frequencies that are powers of 2 but the effect of each wave is limited (shifted)

Page 92: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms94

Wavelets

What to do in a discrete world ?

The Haar Wavelets (1910) !

Page 93: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms95

The Haar Wavelets

Best “energy” synopsis amongst all wavelets (we will see more later)

Great for data with discontinuities. A natural extension to discrete spaces

{1,-1,0,0,0,0…}, {0,0,1,-1,0,0,…},{0,0,0,0,1,-1,…}…

{1,1,-1,-1,0,0,0,0,…},{0,0,0,0,1,1,-1,-1,…}…

Page 94: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms96

The Haar Synopsis Problem

Formally, given a signal X and the Haar basis {i} find a representation F=i zi i with at most B non-zero zi minimizing some error which a fn of X-F

Lets begin with the VOPT error (||X-F||22)

Page 95: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms97

The Magic of Parseval (no spears)

The l2 distance is unchanged by a rotation. A set of basis vectors {i} define a rotation iff

h i,j i = ij , i.e.,

Redefine the basis (scale) s.t. ||i||2 = 1 Let the transform be W Then ||X-F||2 = || W(X-F)||2=||W(X) – W(F)||2

Now W(F)={z1,z2,…zn} and so ||W(X) – W(F)||2 = i (W(X)i – zi)2

Page 96: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms98

What did we achieve ?

Storing the largest coefficients is the best solution.

Note that the fact zi=W(X)i is a consequence of the optimization and IS NOT a specification of the problem.

More on that later.

Page 97: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms99

What is the best algorithm ?

How to find the largest B coefficients of the set {x1,x2,…} ?

Cascade Algorithm. Recall the hierarchical nature.

Page 98: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms100

Cascade algorithm ?

Given a,b represent them as (a-b) and (a+b) Divide by sqrt(2) so that the sum of squares etc… Running time O(n)

1 4 5 6

Page 99: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms101

Surfing Streams

Notice that once the left half is done we only need to remember the

A stream algorithm is natural

1 4 5 6

Page 100: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms102

Surfing Streams

Have an auxillary structure that maintains top B of a set of numbers

Where else have you seen this ?

Reduce Merge ParadigmAlso used in clustering data streams

Page 101: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms103

In summary

Given a series of {x1,x2,…xi,…xn} in increasing order of i we can find (maintain) the largest B coefficients in O(n) time and O(B+log n) space

Ok, but only for ||X-F||2

Page 102: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms104

What do we do in presence of multiple dimensions/measures ?

Use multi-dim transforms Use many 1 D transforms

Strategy: Use a Flexible scheme that allows us to store the index and a bitmap to indicate which measures are stored.

Extended Histograms

Indices are large.

Correlations

Page 103: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms105

How to solve it ?

For the basic 1-D problem we need to choose the largest B coefficients

Use Parseval to transform error of data to choosing/not choosing coefficients

Here we have “bags” We can choose coefficient j with bitmap

0100 using H+S space 0101 using H+2S space 1111 using H+4S space

Page 104: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms106

Is 0101 better than 1100 ?

Subproblem:Given the fact that we have settled on

choosing 2 coefficients for j, which 2 ?

It is the largest 2 again!Basically we can choose a set of

indices j and decide how many coefficients we choose for each j

What does this remind you of ?

Page 105: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms107

Knapsack

Each item j is available with M different “versions”.

Cost of the rth version is H+rS. The profit is an increasing function of r.

Can choose only one version.

Page 106: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms108

Strange roadbumps

Optimal profit + Optimal error= total energy

The relationship does not hold in approximation.

99+1=100. Approximating 99 by 95 increases error by 400%

We will return to this.

Page 107: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms109

Many questions

What do we do for other error measures ?

What is the connection with Histograms ?

Positives: Some direction Cascade algorithm Hierarchy of coefficients

Page 108: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

110

Non l2 errors

Page 109: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms111

Storing coefficients is suboptimal

Recall the complicate {1,4,5,6} We want a 1 term summary and the error is l1 What do we store ?

1 4 5 6

What is the final Result ?

{3.5,3.5,3.5,3.5}

What is the transform ?

{7,0,0,0}

But the set of coefficients available {8,?,?,?}

Page 110: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms112

What to do ?

Search where there is light. Restricted problem. Useful if the

synopsis has more than one use.

Think outside the coefficients Probabilistic Rounding Search (cleverly) over the whole space

Page 111: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms113

The Best Restricted Synospis

Maximum Error.

A value (at the leaf) is affected by only the ancestors.

# of ancestors = log n

Guess/try all of the set! O(n) choices Start bottom up and use a DP

to choose the best B coefficients overall.

Works for a large number of error measures.

Page 112: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms114

Analysis

At each internal node j we need to maintain the table

Error[j,Ancestor set,b]: the contribution to the minimum error by only the subtree rooted at j when using b or less coefficients (for the subtree)

Size of table O(n2B);

Time ~ O(n2B log B) [depends on measure ] But we can do better.

Page 113: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms115

Faster Restricted Synospis

A better cut

Number of coefficients in a subtree is at most size+1

Size of the table storing Err[j,Ancestor Set,b]

Remains constant as we go up the levels!

Ancestor set decreases by 1 b takes twice as many values

O(n2) algorithm We can also reduce the space to

O(n)

Page 114: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms116

Thinking beyond the coefficient

Probabilistic Rounding Start from the coefficients. Randomly round most of them to 0 A few are rounded to non-zero values E.g. set zi= with prob. e-W(X)i/and 0 otherwise

Has promise (correct expectation, variance) Two issues,

The quality is unclear (wrt the original optimization) The Expected number of non-zero coefficients is B The variance is large, so with reasonable prob ~ 2B

Page 115: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms117

More exploration reqd

Interestingly the method (as proposed) eliminates a region of search space

We can construct examples that the optimum lies in that range.

But is an interesting method and likely (I/we are guessing) preserves more errors than one simultaneously (multi-criterion optimization)

Page 116: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms118

What is the optimum strategy

Consider the best set of coefficients Z*={z1,z2,…zn} “nudge” them a bit by making them

multiples of some

The “extra error” is small (and a fn of ) In fact each point sees § log n

By reducing we can get (1+) approx

Page 117: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms119

A straightforward idea

But we still need to find the solution

The ancestor set is unimportant – what is important is their combined effect.

Try all possible values (multiples of , but we still need to fix the range)

Page 118: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms120

The graphs – the data

Page 119: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms121

The graphs … l1

Page 120: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms122

Relative Error (small B), Relative l1

Page 121: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms123

The times

Page 122: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms124

What have we seen so far

Wavelet representation of l_2 error Streaming

Wavelet representation for non l_2 error Restricted Unrestricted Stream

Page 123: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

125

A return to histograms

Page 124: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms126

Easy relationships

A B-bucket (piecewise constant) histogram can be represented by 2B log n Haar wavelet coefficients.

Why Only the 2B boundary points matter

A B-term Haar wavelet synopsis can be represented by 3B-bucket histogram.

Why Each wavelet basis creates 3 extra pieces from 1

line

Page 125: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms127

Anything else ?

Totally!

We can use Wavelets to get (1+\epsilon)-approximate V-optimal histograms.

In fact the method has advantages…

Page 126: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms128

Histograms, Take 5:

A B-term Histogram can be represented by cB log n wavelet terms.

What is we choose the largest cB log n wavelet terms ?

Page 127: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms129

Need not be good.

The best histogram has the cB log n wavelets “aligned” such that the result is B buckets.

The best cB log n coefficients are all over the place and give us 3cB log n buckets.

All hope is lost ?

Page 128: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms130

If at first you don’t succeed…

We repeat the process and also keep the next cB log n coefficients …

No.

But notice that the “energy” drops. Energy = ||X||2=||W(X)||2

Basic intuition: If there were a lot of coefficients which were large then the best V-Opt histogram MUST have a large error.

Why?

Page 129: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms131

The “robust” property

Look at ||W(X)-W(H)||2=||X-H||2

W(H) has cB log n entries If W(X) has cB-2 log n large

entries ..

Page 130: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms132

A strange idea in 1000 words

Consider the projection to the largest cB-2 log n wavelet terms

Is …

¼

?

Page 131: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms133

No. But flatten the fn

¼

X

Page 132: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms134

In fact

If we chose (Blog n)O(1), i.e., large, number of coefficients then the boundary points of the coefficients are (approximately) good boundary points for a VOPT histogram.

Page 133: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms135

The take away:

I’m ok you’re ok If I’m not ok then you’re not ok too. An oft repeated approximation paradigm

“if there are too many coefficients then my algorithm is doomed – but so is anyone elses, and therefore I am good”

“if there are not too many coefficients then we’re good”.

Page 134: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms136

The Extended Wavelets in l2

We can store the largest coefficients

If there are too many coefficients which are large then optimum error is large.

Otherwise we repeatedly take out coefficients till taking out coefficients will not reduce the error any more.

DP on the set of coefficients taken out.

Page 135: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms137

The Full Monty – update streams

So far we have been looking at X arriving as {x1,x2,…}

What happens when X is specified by a stream of updates ?

i.e., (i,di)=change xi to xi + di

Page 136: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms138

Sketches :Stream Embeddings

Basically Dimensionality reductionTo compute the histogram H of signal

X

Compute embedding g(X) to fit the space

Compute H s.t. g(H) is close to g(X)

Page 137: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms139

Linear Embeddings [JL Lemma ]

A is a Random Matrix drawn from Gaussian distribution.

Too many elements in matrix!

Use Pseudorandom Generators P-Stable distribution for

222)1( xAxx

nn )log( 2

p where p [ , ] 0 2

Page 138: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms140

What it achieves

Computes Norm

A

x

Increasing the coordinate is adding the column to sketch.

Page 139: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms141

Suppose we knew the intervals

The best histogram minimizes ||X-H||2 ¼ ||AX –AH ||2

AX is a vector, AH is a linear function of B values

We have a min sq. error program, solvable in ptime more involved in 1-norm.

Page 140: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms142

Cannot do that

||X-H||2 = ||W(X) – W(H)||2 ¼ ||AW(X) –AW(H) ||2

Idea:

Use the linear map to find the large number of Wavelet coefficients(top k problem using sketches)

Use similar ideas to Take 5 to get the final solution.

Page 141: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms143

The return of the pink Fourier

Assuming x1,x2,…,xi,… arrive in increasing order of i, find/maintain the top k Fourier coefficients.

Use the strategy : Assume that there are O(k log n) frequencies

and try to find them. If not, we are doomed and so is everyone. So we are ok. For the 3rd time …

Page 142: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms144

What about top k

Assuming x1,x2,…,xi,… are specified by a stream of updates find/maintain the top k values (all elements with frequency ~1/k or more).

Use the strategy : Assume that there are O(k log n) elements and try to find

them. If not, we are doomed and so is everyone. So we are ok. Again!

Use Group testing 20 questions, bit chasing – is an heavy item in the first

half ? You can use norms – or you can use collisions (hashes).

Page 143: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms145

From optimization to learning

We are trying to “learn” a “pure” signal that has few coefficients…

A general paradigm.

Page 144: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms146

The Meaning of Life

In Summary (high level):

Approximation is very useful for synopsis construction (the execution time speedups plus “the end use of synopsis is approximation only”)

Synopses are usually applied on large data. Asymptotic behaviour matters

The exact definition of the optimization is important. How natural is natural…

Few degrees of separation between the synopsis structures. They are related. They should be. But then we can use algorithmic techniques back and forth between them.

Page 145: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms147

The Summary (contd.) In algorithm design terms

Most synopsis construction problems involve DP. Investigating how to change the DP to get approximation, space efficient algs., is often useful.

Search techniques (computation geometry) – search exponents first are useful.

What you analyze (carefully) is often what you would get asymptotically. The usual techniques we use for pruning etc., can be analyzed and and shown to be better.

Reduce-Merge ) Streaming ?

The top k in various disguises. Group testing matters.

Page 146: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms148

What lies ahead

Ok. So 1 D histograms have good algos. 2 D ?

NP-Hard. Some approximation algorithms known. Q: In linear time and sublinear space what

can we do ?

Sketch based results. Long way to go.

Page 147: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms149

What lies ahead

So 1 D Haar Wavelets have good algos (non l2).

2 D ?

Unlikely to be NP-Hard Quasi-polynomial time nlog n approximation

algorithms known.

Q: In linear time and sublinear space what can we do ?

Page 148: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms150

What lies ahead

So 1 D Haar Wavelets have good algos (non l2). Non Haar ? Daubechies. Multifractals.

Unlikely to be NP-Hard Quasi-polynomial time nlog n approximation algorithms

known. What can we do ?

Page 149: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms151

What lies ahead

All the update stream results are based on l2 error because of Johnson Lindenstrauss (and some on lp for 0<p· 2)

What about other errors ? Will require new techniques for

streaming.

Page 150: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms152

Notes (not from the underground) The VOPT definition

Poosala, Haas, Ioannidis, Shekita, SIGMOD `96. The VOPT histogram algorithm

Jagadish, Koudas, Muthukrishnan, Poosala, Sevcik, Suel, VLDB ‘98.

Take 1 Guha, Koudas, Shim, STOC, ‘01.

Take 2 Guha, Koudas, ICDE, ‘02.

Take 3 & 4 Guha, Koudas, Shim, TODS, ‘05.

Take 5 Guha, Indyk, Muthukrishnan, Strauss, ICALP, ‘02.

Relative Error Histograms Guha, Shim, Woo, VLDB, ‘04.

Maximum Error histograms Nicole, J. of Parallel Distributed Computing, 1994. (Muthukrishnan, Khanna, Skiena, ICALP, ’97), Guha, Shim, (here) ‘05.

Page 151: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms153

More Notes

Range Query Histograms Muthukrishnan, Strauss, SODA, ‘03.

The Full Monty Gilbert, Guha, Indyk, Kotidis, Muthukrishnan, Strauss, STOC,

‘02.

Parseval stuff Parseval, (margin of notebook ?), 1799.

Folklore sum of squares and l2 The mandala

Surfing Wavelets Gilbert, Kotidis, Muthukrishnan,Strauss, VLDB, ‘01

Probabilistic Synopsis Gibbons, Garofalakais, SIGMOD, ’02 (also TODS, ‘04)

Maximum error (restricted version) Garofalakis, Kumar, PODS, ‘04.

Page 152: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms154

Notes again Faster Restricted Synopsis

Guha, VLDB, ‘05. Unrestricted non l2 error

Guha, Harb, KDD, ‘05 + new results Extended Wavelets

Deligiannakis Rossopolous, SIGMOD ’03. Guha, Kim, Shim, VLDB ’04.

Streaming Fourier approximation Gilbert, Guha, Indyk, Muthukrishnan, Strauss, STOC, ’02

Learning Fourier Coefficients Linial, Kushilevitz, Mansour, JACM, 93

JL Lemma Johnson, Lindenstrauss, , ’84.

Sketches Alon, Matias, Szegedy, JCSS, ’99. Feigenbaum Kannan, Vishwanathan, Strauss, FOCS, ’99 Indyk, FOCS, ‘00

Page 153: 1 Offline, Stream and Approximation Algorithms for Synospis Construction Sudipto Guha University of Pennsylvania Kyuseok Shim Seoul National University.

VLDB 2005A tutorial on synopsis construction algorithms155

Roads not taken

(but are relevant to synopsis) Property Testing Weighted sampling and SVD Median Finding Sampling based estimators