Lower bounds on data stream computations
-
Upload
constance-morse -
Category
Documents
-
view
21 -
download
0
description
Transcript of Lower bounds on data stream computations
![Page 1: Lower bounds on data stream computations](https://reader035.fdocuments.us/reader035/viewer/2022062422/56813602550346895d9d7810/html5/thumbnails/1.jpg)
Lower bounds on data streamcomputations
Seminar in Communication Complexity
By Michael UmanskyInstructor: Ronitt Rubinfeld
![Page 2: Lower bounds on data stream computations](https://reader035.fdocuments.us/reader035/viewer/2022062422/56813602550346895d9d7810/html5/thumbnails/2.jpg)
Previously...
We proved 3 theorems concerning space complexity of data stream algorithms.
Using the streaming model discussed earlier, we found out some lower bounds for the MAX, MAXNEIGHBOR, MAXTOTAL and MAXPATH algorithms.
And now, for something completely different.
![Page 3: Lower bounds on data stream computations](https://reader035.fdocuments.us/reader035/viewer/2022062422/56813602550346895d9d7810/html5/thumbnails/3.jpg)
Today
In this lecture, I introduce lower bounds from communication complexity.
Trust me they are correct.
Using these bounds and (mostly) reductions, our goal is to prove even more theorems. Theorems are good.
I'll prove 3 of them.
Starting with “Theorem 4”.
![Page 4: Lower bounds on data stream computations](https://reader035.fdocuments.us/reader035/viewer/2022062422/56813602550346895d9d7810/html5/thumbnails/4.jpg)
Theorem 4
Setting: Sequence of m numbers in {1,...,n}.– Multiple occurences are allowed.
Claim: Finding the k most frequent items requires Ω(n/k) space.
Moreover, random sampling yields an upper bound of O(n (log m + log n) / k).
We're going to use a blackbox to prove it.
![Page 5: Lower bounds on data stream computations](https://reader035.fdocuments.us/reader035/viewer/2022062422/56813602550346895d9d7810/html5/thumbnails/5.jpg)
Theorem 4 blackbox
Alon-Matias-Szegedy: Finding the most frequent number in a sequence of length m in range {1,...,n} takes Ω(n) space.
Proof outline: Reduction. Namely, we create a new stream that we can (ab)use this blackbox on.
The reduction will replace each number in the sequence with a sequence of numbers:– Each i in {1,...,n} is replaced with
ki+1,...,ki+k.– In total, nk numbers.
![Page 6: Lower bounds on data stream computations](https://reader035.fdocuments.us/reader035/viewer/2022062422/56813602550346895d9d7810/html5/thumbnails/6.jpg)
Reduction example
Our data stream is {4,5,3,2,7,3,4,5,1} in range {1,...,10} and we want to obtain the 2 most occuring numbers.
The reduction will create the numbers: {9,10}, {11, 12}, {7, 8}, {5, 6}, {15,16}, {7, 8},
{9, 10}, {11,12}, {3, 4}
The most occuring numbers in the original sequence are the most occuring number in the new sequence.
![Page 7: Lower bounds on data stream computations](https://reader035.fdocuments.us/reader035/viewer/2022062422/56813602550346895d9d7810/html5/thumbnails/7.jpg)
Proof outline
If xi=x
j, then the sequences created by the
reduction coincide. Otherwise, they are disjoint.
If xi occurs l times in the stream, it'll occur kl times
in the new stream.
It follows that finding one of the k most frequent items in one pass requires Ω(n/k) space. Running this 'algorithm' k times we get the AMS theorem.
Great success.
![Page 8: Lower bounds on data stream computations](https://reader035.fdocuments.us/reader035/viewer/2022062422/56813602550346895d9d7810/html5/thumbnails/8.jpg)
As for the upper bound
Reminder: a Monte-Carlo algorithm is a randomized algorithm that succeeds with a high probability.
So we'll show a Monte-Carlo algorithm that succeeds with high probability to get the right upper bound.
![Page 9: Lower bounds on data stream computations](https://reader035.fdocuments.us/reader035/viewer/2022062422/56813602550346895d9d7810/html5/thumbnails/9.jpg)
The Monte-Carlo algorithm Before reading the stream:
– Sample each number with probability 1/k.– Only keep a counter for the sampled numbers.
Read the stream normally.
Output the successfully sampled number with largest count.
With constant probability, one of the k-th most frequent numbers has been sampled successfully.
This requires O(n (log m + log n) / k) space. Epic win.
![Page 10: Lower bounds on data stream computations](https://reader035.fdocuments.us/reader035/viewer/2022062422/56813602550346895d9d7810/html5/thumbnails/10.jpg)
And now for somethingcompletely different
Introducing the approximate median problem (AMP).
Reminder: The median is the value which separates the higher half of the set from the lower half.
We want to approximate that. Why? Because it's cool.
![Page 11: Lower bounds on data stream computations](https://reader035.fdocuments.us/reader035/viewer/2022062422/56813602550346895d9d7810/html5/thumbnails/11.jpg)
This slide isn't the median problem
First, a blackbox from communication complexity.
Consider the bit-vector probing problem:
– Let A have a bit sequence of length m and B an index i. B needs to know x
i, the i-th
input bit.
– But the communication is one way only, B can not send anything to A.
Ideas?
![Page 12: Lower bounds on data stream computations](https://reader035.fdocuments.us/reader035/viewer/2022062422/56813602550346895d9d7810/html5/thumbnails/12.jpg)
Blackbox cont.
Turns out there isn't a better method for A to send the i-th bit than to send the entire string to B.– So it takes Ω(m) space.
But what about randomization?– Too bad, any algorithm that succeeds in
guessing xi
– With probability better than (1+ε)/2– Requires at least εm bits of
communication.
![Page 13: Lower bounds on data stream computations](https://reader035.fdocuments.us/reader035/viewer/2022062422/56813602550346895d9d7810/html5/thumbnails/13.jpg)
Approximate median problem
Goal: Find a number whose rank is in the interval [m/2 – εm, m/2 + εm].
It can be solved by a one-pass Monte-Carlo algorithm with 1/10 error probability.
Takes O(log n (log 1/ε)2 / ε) space.
I have a truly magnificent proof of this theorem. This slideshow is too small to contain it.
![Page 14: Lower bounds on data stream computations](https://reader035.fdocuments.us/reader035/viewer/2022062422/56813602550346895d9d7810/html5/thumbnails/14.jpg)
AMP cont.
Motivation: We want to prove a corresponding lower bound on this problem.
How: We show that any 1-pass Las Vegas algorithm that solves the ε-AMP requires Ω(1/ε) space.
We show a reduction from the bit-vector probing problem.
![Page 15: Lower bounds on data stream computations](https://reader035.fdocuments.us/reader035/viewer/2022062422/56813602550346895d9d7810/html5/thumbnails/15.jpg)
AMP lower bound proof
Let B be a bit vector, followed by a query index i.
This is translated to a sequence of numbers as follows:– First, output 2j+b
j, for each j.
– Then, upon getting the query, output n-i+1 copies of 0 and i+1 copies of 2(n+1).
![Page 16: Lower bounds on data stream computations](https://reader035.fdocuments.us/reader035/viewer/2022062422/56813602550346895d9d7810/html5/thumbnails/16.jpg)
Reduction example
B = (0,1,0,1,1,0,1,1,0,1), i=5.
The reduction maps:– 2j+b
j: [2,5,6,9,11,12,15,17,18,21]
– N-i+1=6 copies of 0: [0,0,0,0,0,0]– i+1=6 copies of 22=2(n+1):
[22,22,22,22,22,22]
The median of this set is 11. It's LSB is 1. Which is exactly the value of b
5.
![Page 17: Lower bounds on data stream computations](https://reader035.fdocuments.us/reader035/viewer/2022062422/56813602550346895d9d7810/html5/thumbnails/17.jpg)
AMP proof cont.
It is easily verified that the least significant bit of the median of this sequence is the value of b
i (that
is, the bit we seek).
Choose ε=1/2n. Therefore the ε-approximate median is the exact median. This is true because we have 2n numbers in the “reduced” stream.
Therefore any one-pass algorithm that requires fewer than 1/2ε = n bits of memory can be used...
![Page 18: Lower bounds on data stream computations](https://reader035.fdocuments.us/reader035/viewer/2022062422/56813602550346895d9d7810/html5/thumbnails/18.jpg)
AMP proof cont.
… to derive a communication protocol that requires fewer than n bits to be communicated from A to B in solving bit vector probing.
But every protocol that solves bit vector probing must communicate n bits.
Contradiction. Quod erat demonstratum.
![Page 19: Lower bounds on data stream computations](https://reader035.fdocuments.us/reader035/viewer/2022062422/56813602550346895d9d7810/html5/thumbnails/19.jpg)
Corollary
What's the point I've been trying to make?
Randomization can sometimes reduce space complexity significantly, at the cost of guarantee of output correctness.
Moving right along.
![Page 20: Lower bounds on data stream computations](https://reader035.fdocuments.us/reader035/viewer/2022062422/56813602550346895d9d7810/html5/thumbnails/20.jpg)
Some graph theory
A graph can be considered as a stream.– Example: Adjacency list.
This means some graph-theoretic problems can be approximated or solved using data stream and communication complexity techniques.
I'll address a small part of them.
![Page 21: Lower bounds on data stream computations](https://reader035.fdocuments.us/reader035/viewer/2022062422/56813602550346895d9d7810/html5/thumbnails/21.jpg)
Why is this good?
Suppose we can read the stream more than once (we don't have enough memory to store it but we do have access).
But the amount of times we can read the stream is finite.
What possible graph theoretic problems could we approximate with this method?
![Page 22: Lower bounds on data stream computations](https://reader035.fdocuments.us/reader035/viewer/2022062422/56813602550346895d9d7810/html5/thumbnails/22.jpg)
Theorem 6
In P passes, the following problems on an n-node graph take Ω(n / P) space:– Computing connected components– Computing k-edge connected components.– Computing k-vertex connected components.– Testing graph planarity.– Finding the sinks of a directed graph.
I'll prove graph connectivity.
![Page 23: Lower bounds on data stream computations](https://reader035.fdocuments.us/reader035/viewer/2022062422/56813602550346895d9d7810/html5/thumbnails/23.jpg)
Connected components
Proof by reduction of DISJOINT to the graph connectivity problem. Reminder: DISJOINT(x,y) returns 1 iff there exists i such that x
i=y
i.
Given bit vectors A and B, construct a graph with vertices {a,b,1,...,n}.
Insert an edge (a,i) iff i is in A's vector and an edge (i,b) iff it's in B's vector.
The graph is connected iff there exists a bit that's set in both vectors.
![Page 24: Lower bounds on data stream computations](https://reader035.fdocuments.us/reader035/viewer/2022062422/56813602550346895d9d7810/html5/thumbnails/24.jpg)
Connectivity cont.
From communication complexity, we know that every DISJOINT-solving protocol sends Ω(n) bits.
So if we have P passes over the data, one of the passes must use Ω(n / P) space. This is a total cheating hack by the way. Blame HRR.
QED anyway.
That's all folks!