1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 25, 2006 .

1

Algorithms for Large Data Sets

Ziv Bar-YossefLecture 13

June 25, 2006

http://www.ee.technion.ac.il/courses/049011

2

Data Streams (cont.)

3

Outline

Distinct elements Lp norms

Notation: for integers a < b,

[a,b] = {a, a+1, …, b}

4

Distinct Elements[Flajolet, Martin 85] [Alon, Matias, Szegedy 96], [Bar-Yossef, Jayram, Kumar, Sivakumar, Trevisan 02]

Input: a vector x [1,m]n

Goal: find D = number of distinct elements of x Exact algorithms: need (m) bits of space Deterministic algorithms: need (m) bits of space Approximate randomized algorithms: O(log m)

bits of space

5

Distinct Elements, 1st Attempt

Let M >> m2

Pick a “random hash function” h: [1,m] [1,M] h(1),…,h(m) are chosen uniformly and independently

from [1,M] Since M >> m2, probability of collisions is tiny

1. min M2. for i = 1 to n do3. read xi from stream4. if h(xi) < min, min h(xi)5. output M/min

6

Distinct Elements: Analysis

Space: O(log M) = O(log m) for minO(m log M) = O(m log m) for h

Too much!Worse than the naïve O(m) space algorithm

Next: show how to use more “space-efficient” hash functions

7

Small Families of Hash Functions

H = {h | h: [1,m] [1,M] }: a family of hash functions

|H| = O(mc) for some constant c Therefore, each h H can be represented in O(log m) bits

Need H to be “explicit”: given representation of h, can compute h(x), for any x, efficiently.

How do we make sure H has the “random-like” properties of random hash functions?

8

Universal Hash Functions[Carter, Wegman 79]

H is a 2-universal family of hash functions if:For all x y [1,m] and for all z,w [1,M], when choosing h from H randomly, then

Pr[h(x) = z and h(y) = w] = 1/M2

Conclusions: For each x, h(x) is uniform in [1,M] For all x y, h(x) and h(y) are independent h(1),…,h(m) is a sequence of uniform pairwise-independent

random variables

k-universal families: straightforward generalization

9

Construction of a Universal Family

Suppose M = prime power [1,M] can be viewed as a finite field FM

[1,m] can be viewed as elements of FM

H = { ha,b | a,b FM } is defined as:

ha,b(x) = ax + b Note:

|H| = M2

If x y FM and z,w Fm, then ha,b(x) = z and ha,b(y) = w iff

Since x y, the above system has a unique solution Hence, Pra,b[ha,b(x) = z and ha,b(y) = w] = 1/M2.

10

Distinct Elements, 2nd Attempt Use 2-universal hash functions rather than random hash

function Space:

O(log m) for tracking the minimum O(log m) for storing the hash function

Correctness: Part 1:

h(a1),…,h(aD) are still uniform in [1,M] Linearity of expectation holds regardless of whether Z1,…,Zk are

independent or not. Part 2:

h(a1),…,h(aD) are still uniform in [1,M] Main point: variance of pairwise independent variables is additive:

11

Distinct Elements, Better Approximation So far we had a factor 6 approximation. How do we get a better one? 1 + approximation algorithm:

Find the t = O(1/2) smallest elements, rather than just the smallest one.

If v is the largest among these, output tM/v

Space: O(1/2 log m) Better algorithm: O(1/2 + log m)

12

Lp Norms Input: an integer vector x [-m,+m]n

Goal: find ||x||p = Lp norm of x

Popular instantiations: L2: Euclidean distance L1: Manhattan distance L: max L0: # of non-zeros (assuming 1/0 = 1, 00 = 0)

Not a norm

Data stream algorithm: Can be done trivially in O(log m) space

13

Lp Norms: The “Cash Register” Model Input: a sequence X of N pairs (i1,a1),…,(iN,aN)

For each j, ij {1,…,n} For each j, aj [-m,m]

Ex: X = (1,3), (3,-2), (1,-5), (2,4), (2,1)

For each i = 1,…,n, let Si = { j | ij = i } Ex: S1 = {1,3}, S2 = {4,5}, S3 = {2}

Define: xi = jSi aj

Ex: x1 = -2, x2 = 5, x3 = -2

Goal: find ||x||p = Lp norm of x

14

Lp Norms in the “Cash Register” Model: Applications Standard Lp norms Lp distances

Input: two vectors x,y [-m,+m]n (interleaved arbitrarily) Goal: find ||x – y||p

Frequency moments: Input: a vector X [1,n]N

Ex: X = (1 2 3 1 1 2) For each i = 1,…,n, define: xi = frequency of i in X

Ex: x1 = 3, x2 = 2, x3 = 1 Goal: output ||x||p Special cases:

p = : Most frequent element p = 0: Distinct elements

15

Lp Norms: State of the Art Results 0 < p ≤ 2: O(log n log m) space algorithm [Indyk 00]

2 < p < : O(n1-2/p log m) space algorithm [Indyk,Woodruff 05] (n1-2/p-o(1)) space lower bound [Saks, Sun 02], [Bar-

Yossef,Jayram,Kumar,Sivakumar 02], [Chakrabarti, Khot, Sun 03]

p = : O(n) space algorithm [Alon,Matias,Szegedy 96] (n) space lower bound [Alon,Matias,Szegedy 96]

p = 0 (distinct elements): O(log n + 1/2) space algorithm [Bar-Yossef,Jayram,Kumar,Sivakumar,Trevisan 02] (log n + 1/2) space lower bound [Alon,Matias,Szegedy

96], [Indyk, Woodruff 03]

16

Stable Distributions D: distribution on R, x Rn, p (0,2]

The distribution Dx: Z1,…,Zn: i.i.d. random variables with distribution D Dx = distribution of i xi Zi

The distribution Dp,x: Z: random variable with distribution D Dp,x = distribution of ||x||p Z

Definition: D is p-stable, if for every x, Dx = Dp,x.

Examples: p = 2: Standard normal distribution. p = 1: Cauchy distribution. Other p’s: no closed form pdf.

17

Indyk’s Algorithm

For simplicity, assume p = 1. Input: a sequence X = (i1,a1),…,(iN,aN) Output: a value z s.t.

“Cauchy hash function”: h:[1,n] R h(1),…,h(n) are i.i.d. with Cauchy distribution In practice, use bounded precision

18

Indyk’s Algorithm, 1st Attempt

1. k O(1/2 log(1/))

2. generate k Cauchy hash functions h1,…,hk

3. for t = 1,…,k do

4. At 0

5. for j = 1,…,N do

6. read (ij,aj) from data stream

7. for t = 1,…,k do

8. At At + aj ht(ij)

9. output median(A1,…,Ak)

19

Correctness Analysis

Fix some t [1,k] What value does At have at the end of the execution?

Recall: ht(1),…,ht(n) are i.i.d. with 1-stable distribution Therefore, At is distributed the same as: ||x||1 Z

Z: random variable with Cauchy distribution

20

Correctness Analysis (cont.)

Z1,…,Zk: i.i.d. random variables with Cauchy distribution

Output of algorithm: median(A1,…,Ak)

Same as: median(||x||1 Z1,…,||x||1 Zk) = ||x||1 median(Z1,…,Zk)

Conclusion: enough to show:

21

Correctness Analysis (cont.)

Claim: Let Z be distributed Cauchy. Then,

Proof: The cdf of the Cauchy distribution is:

Therefore,

Claim: Let Z be distributed Cauchy. For any sufficiently small > 0,

22

Correctness Analysis (cont.) Claim: Let Z1,…,Zk be k = O(1/2 log(1/)) i.i.d. Cauchy

random variables. Then,

Proof: For j = 1,…,k, let Then,

median(Z1,…,Zk) < 1 - iff jYj ≥ k/2

E[jYj] = k/2 - k/4

By Chernoff-Heoffding bound,

Pr[jYj ≥ k/2] < /2

Similar analysis shows:

Pr[median(Z1,…,Zk) > 1 + ] < /2

23

Space Analysis

Space used: k = O(1/2 log(1/)) times: At: O(log m) bits

ht: O(n log m) bits Too much!

This time we really need ht(1),…,ht(n) to be totally independent

Otherwise, resulting distribution is not stable Cannot use universal hashing What can we do?

24

Pseudo-Random Generators for Space-Bounded Computations [Nisan 90]

Notation: Uk = a random sequence of k bits

An S-space R-random bits randomized algorithm A: Uses at most S bits of space Uses at most R random bits Accesses random bits sequentially A(x,UR): (random) output of A on input x

Nisan’s pseudo-random generator: G: {0,1}S log R {0,1}R s.t. For every S-space R-random bits randomized algorithm A, for every input x, A(x,UR) has almost the same distribution as A(x,G(US log R))

25

Space Analysis

Suppose input stream is guaranteed to come in the following order: First all pairs of the form (1,*) Then, all pairs of the form (2,*), … Finally, all pairs of the form (n,*)

Then, we can generate the values ht(1),…,ht(n) on the fly, and no need to store them O(log m) bits will suffice to store the hash function

Therefore, for such input streams, Indyk’s algorithm uses: O(log m) bits of space O(n log m) random bits

26

Space Analysis (cont.) Conclusion: For “ordered” input streams, Indyk’s algorithm

is an O(log m)-space O(n log m)-random bits randomized algorithm.

Can use Nisan’s generator ht can now be generated from only O(log m log n) random bits Space needed: O(log n log m) bits

Crucial observation: Indyk’s algorithm does not depend on the order of the input stream.

Conclusion: If we generate the Cauchy hash functions using Nisan’s generator, then Indyk’s algorithm will work even for “unordered” streams.

27

Wrapping Up

Space used: k = O(1/2 log(1/)) times:At: O(log m) bits

ht: O(log n log m) bits (using Nisan’s generator)

Total: O(1/2 log(1/) log n log m) bits

28

End of Lecture 13

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 25, 2006 .

Documents

Transcript of 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 25, 2006 .