Optimal Approximations of the Frequency Moments of Data Streams Piotr Indyk David Woodruff.

18
Optimal Approximations of the Frequency Moments of Data Streams Piotr Indyk David Woodruff

Transcript of Optimal Approximations of the Frequency Moments of Data Streams Piotr Indyk David Woodruff.

Page 1: Optimal Approximations of the Frequency Moments of Data Streams Piotr Indyk David Woodruff.

Optimal Approximations of the Frequency Moments of Data

Streams

Piotr Indyk

David Woodruff

Page 2: Optimal Approximations of the Frequency Moments of Data Streams Piotr Indyk David Woodruff.

The Streaming Model

7113734 … Stream of elements a1, …, an each in {1, …, m} Want to compute statistics on stream Elements arranged in adversarial order Algorithms given one pass over stream Goal: Minimum space algorithm

Page 3: Optimal Approximations of the Frequency Moments of Data Streams Piotr Indyk David Woodruff.

Frequency Moments [AMS96]

n = stream size, m = universe size

fi = # occurrences of item i

Why are frequency moments important?

F0 = # of distinct elements F1 = n = stream size F2 = self-join size

k-th moment

Page 4: Optimal Approximations of the Frequency Moments of Data Streams Piotr Indyk David Woodruff.

Applications

Estimating distinct elements with low space Estimate query selectivity to huge DB without sorting Routers gather # distinct destinations

F2 estimates size of self-joins:

Bob x

Alice y

Bob z

Bob a

Alice b

Bob c

,

Alice b y

Bob a x

Bob a z

Bob c x

Bob c z

Fk measures data skewness

fB2 + fA

2 = 4 + 1 = 5

Page 5: Optimal Approximations of the Frequency Moments of Data Streams Piotr Indyk David Woodruff.

The Best Deterministic Algorithm

Trivial algorithm for Fk

Store/update fi for each item i, sum fi

k at end

Space = O(mlog n): m items i, log n bits to count f i

Negative Results [AMS96]:

Compute Fk exactly (m) space

Any deterministic alg. outputs X with |Fk – X| < Fk must use (m) space

What about randomized algorithms?

Page 6: Optimal Approximations of the Frequency Moments of Data Streams Piotr Indyk David Woodruff.

Randomized Approx Algs for Fk

Randomized alg. -approximates Fk if outputs X s.t.

Pr[|Fk – X| < Fk ] > 2/3

Previous work (table suppresses polylog mn)

Upper Lower

F0 1/2 [FM85, GT02, BJKST02]

1/2 [IW03, W04]

F1 1 - 1 -

F2 1/2 [AMS96] 1/2 [W04]

Fk m1-1/(k-1) [CK04, G04] m1-2/k [BJKS02]

Page 7: Optimal Approximations of the Frequency Moments of Data Streams Piotr Indyk David Woodruff.

Matching Upper Bound

Our Contribution:

For every k there is a 1-pass O~(m1-2/k) space algorithm to -approximate Fk

Additional Features:

1. Works even if we allow deletions, that is, stream of elements (i, +), (i,-)

2. Constant update time

Page 8: Optimal Approximations of the Frequency Moments of Data Streams Piotr Indyk David Woodruff.

Techniques

Our “algorithm’’ 1. Divide frequencies into “buckets” 0, [1, 2), [2, 4), [4, 8), …, [2i-1, 2i), … 2. Estimate size si of each bucket 3. Output X = i si 2ik

Previous Algorithms [AMS96, CK04, G04]

1. Cleverly construct small-space estimator X s.t.

E[X] = Fk

Var[X] small

2. Apply Chebyshev’s inequality

Page 9: Optimal Approximations of the Frequency Moments of Data Streams Piotr Indyk David Woodruff.

What’s Left?

Remaining Problem: Estimate si = # of elements with frequency in each bucket [2i-1, 2i)

Is this always easy? No.

Suppose always easy – then could approximate the maximum frequency This is HARD – (m) space [AMS96]

However, (m) only applies to “worst-case” streams, otherwise can do better: Countsketch [CCF-C]

Page 10: Optimal Approximations of the Frequency Moments of Data Streams Piotr Indyk David Woodruff.

For the moment, let’s assume:

1. 9 a 1-pass oracle Max returning the maximum frequency using O(B) space (we remove this using CountSketch)

2. We have a very long RAM of random bits

(we remove this using Nisan’s generator)

0 1 1 0 0 0 1 …

items

frequencyMax

Page 11: Optimal Approximations of the Frequency Moments of Data Streams Piotr Indyk David Woodruff.

Restrict input stream to a random subset of items in {1, …, m}, where items are included independently with probability p.

General Idea: Max + Sampling

7113734 …Random subset = {1, 3}

… 3 3 1 1

Page 12: Optimal Approximations of the Frequency Moments of Data Streams Piotr Indyk David Woodruff.

General Idea: Max + Sampling

What are chances the maximum lies in

Si = elements r such that fr 2 [2i-1, 2i)?

Restrict input to a random subset of items in {1, …, m}, where items are included independently with probability p.

q = (1-p) j > i sj ¢ (1 – (1-p)si)

Idea: 1. Estimate q as q’ by taking independent trials

and computing fraction of max in Si

2. If already estimated sj for j > i, solve this

expression for si.

Page 13: Optimal Approximations of the Frequency Moments of Data Streams Piotr Indyk David Woodruff.

When is this estimate any good?

Recall q = (1-p){j > i} sj (1 – (1-p)si), so estimate si:

Need 1. (holds inductively)

2.

Requires 9 p so that q > 1/R, where

R = # trials used to estimate q

(tight concentration of q’)

Page 14: Optimal Approximations of the Frequency Moments of Data Streams Piotr Indyk David Woodruff.

When is this estimate any good?

Motivates the following:

Say a class Si contributes if and only if si > j > i sj /R

If R = (log n), then Fk ¼ contributing i si 2ik

q = (1-p)j > i sj (1 – (1-p)si)

p too large? ! q too small

p too small? ! q too small

Page 15: Optimal Approximations of the Frequency Moments of Data Streams Piotr Indyk David Woodruff.

The Idealized Algorithm 1. Use the random string to generate hash functions hj

r : [m] -> [2j] for j 2 [log m] and r 2 [R]

2. Restrict stream Str to Strjr, those items i with hj

r(i) = 1

3. For each Strjr, compute Max(Strj

r)

4. To estimate si given s’t for t > i, find some j for which “enough” of the Max(Strjr) come from

Si, and then set

5. Output F’k = i s’i 2ik

Page 16: Optimal Approximations of the Frequency Moments of Data Streams Piotr Indyk David Woodruff.

Removing the assumptions

[CCF-C02]: 9 a 1-pass O(B)-space algorithm CountSketch

which, given stream Str, outputs all x for which fx2 ¸ F2/B

1. Assumption: 9 a 1-pass oracle Max returning the maximum frequency using O(B) space

Lemma: If Si = [2i-1, 2i) contributes, then

Proof: Holder’s inequality.

Recall: Si contributes if and only if si > j > i sj /R

Page 17: Optimal Approximations of the Frequency Moments of Data Streams Piotr Indyk David Woodruff.

Removing the assumptions

2. We have an infinite string of random bits

Consider a space-S algorithm A and a functionf, with random strings R1, …, Rn that, when processing a stream, maintains a variableC, and updates as follows: C = C + f(i, Ri)

[Indyk00] Then R1, …, Rn can be generated using Nisan’s PRG, and:1. The new algorithm A’ has space O~(S)

2. The outputs of A’ and A are indistinguishable

Our algorithm follows this framework

Page 18: Optimal Approximations of the Frequency Moments of Data Streams Piotr Indyk David Woodruff.

Conclusions

Result: Tight O~(m1-2/k) upper bound Handle deletions (j, -) O~(1) update time

Open Problem: Reduce O~ factors