Download - Distinct items:

Transcript
Page 1: Distinct items:

Distinct items: • Given a stream , where , count the number of distinct items (so we are in the cash

register model)

• Example: 3 5 7 4 3 4 3 4 7 5 9• 5 distinct elements: 3 4 5 7 9 (we only want the count of distinct elements, and not

the set of distinct elements)

• In terms of frequency moments estimation, this is the problem of estimating

• The easy deterministic solutions with space and ( number of distinct elements)

• Deterministic exact solution requires space in the worst case• How about deterministic approximate solutions? And exact randomized?

• Can we do better with randomization and approximation?

Page 2: Distinct items:

Counting distinct elements (Flajolet—Martin 1985)

• Let be a random hash function: For each , value is uniformly distributed in

• What is the relation between the minimum of and the number of distinct elements

(We will do two proofs on the board, one algebraic and one pictorial)

• Moreover, the variance can also be bounded via (Fun problem: I only know an algebraic proof for this, but there could be a pictorial one too given the suggestive-looking rhs)

Page 3: Distinct items:

Counting distinct elementsFirst algorithm• Pick random hash function • Find the minimum of • Output

• Estimator has high variance. Improving the estimator by averaging:

Second algorithm• Run parallel independent copies of the first algorithm• Set ( is the estimate given by the th copy)• Return

Page 4: Distinct items:

Counting distinct elements• Space complexity of the first algorithm: To compute the minimum we just need to keep one real number in the memory. But need to limit precision

• So the space requirement

• Not quite: also need to account for the memory requirements for a random hash function

• What property of random hash function did we really use?

Page 5: Distinct items:

Counting distinct elements• Pick from a 2-wise independent hash function family mapping for a prime

( is chosen large to reduce round off errors)• set of distinct elements

• New estimator: • No longer clear that , but does provide useful informationLemma (probability is over the random choice of )Proof (1) First, prove :

Union bound

Page 6: Distinct items:

Counting distinct elements(2) Prove :

• Define indicator if (this is the good event)

otherwise• and so • We now upper bound by using the pairwise independence of the and Chebyshev’s inequality (proof on the board; also in the book page 297)

Page 7: Distinct items:

Boosting the success probability• Take the median of the means estimator • But doesn’t seem to give a -factor approximation approximation only within factors and

• A related estimator [BJKST 2004]:

• pairwise independent hash function family of functions of type • , so we can take , and have bits decription• So the probability that a random is injective is

• Maintain the smallest hash values the th smallest hash value at the end of the stream The new estimator (BJKST estimator) is

Page 8: Distinct items:

Analyzing the BJKST estimator• Requirements to maintain the BJKST estimator:

– Space – Update time

• We assume (satisfied if true for )

• Recall that the set of distinct elements in the stream• We separately upper bound and using the Chebyshev inequality

Page 9: Distinct items:

Analyzing the BJKST estimator

• I.e., contains at least elements less than (using )

• For , define if and otherwise

• For

• , ,

Chebyshev

Page 10: Distinct items:

Analyzing the BJKST estimator• Similarly,

• Thus,

• And now we can apply the median trick: Run parallel independent copies of the algorithm to compute and output their median

Theorem The output of the above algorithm is an -approximation of . It uses space and update time per streaming element

Very powerful: A variant needs 128 bytes for all works of Shakespeare, ≈1/10 [Durand--Flajolet 2003]

• What streaming model does the above algorithm require?

Page 11: Distinct items:

Counting distinct elements (strict turnstile model)

• What about the strict turnstile model? • with integers• Frequency vector nonnegative • The previous algorithm requires cash register model

• A different but closely related algorithm that works in the strict turnstile model

• We will only give the basic idea and not the full details of the proof

Page 12: Distinct items:

Counting distinct elements (strict turnstile model)

• set of distinct elements• First reduce the problem to its decision version: • Input: stream , parameters, and an additional parameter • Output:

– YES if – NO if – Arbitrary otherwise

• Solution of the decision version gives a solution of the general problem with a slight blow up in the space:

• Run parallel versions of the decision problem with • A total of copies

Page 13: Distinct items:

Algorithm for the decision version of counting distinct elements

Basic algorithm

• Choose a random set by picking each element independently with probability :

for all

• Maintain

• Output YES if else output NO

Page 14: Distinct items:

Decision version of counting distinct elements (analysis idea)

Lemma For and if if

Proof

Page 15: Distinct items:

Full algorithm• Run independent parallel copies of the basic algorithm for sufficiently large

constant : Sample independently, and maintain for each

• if the ’th instance of the basic algorithm gives otherwise

• Output YES (i.e. declare ) if • Output NO otherwise

• An application of the Chernoff bound using the independence of the shows that this provides an -approximation

• Space requirement? • Use 2-wise independent sampling to choose • Total space requirement is

Page 16: Distinct items:

Counting distinct elements• Why didn’t we just maintain whether or not ?

• is a linear sketch• Allows for negative • So works in the (strict) turnstile model

• The problem of computing is by now very well understood: space complexity with update time This is optimal up to constant factors [Kane et al. 2010]