Sketching (1)
Alex Andoni(Columbia University)
MADALGO Summer School on Streaming Algorithms 2015
IP Frequency
131.107.65.14 3
18.0.1.12 2
80.97.56.20 2
131.107.65.14
131.107.65.14
131.107.65.14
18.0.1.12
18.0.1.12
80.97.56.20
80.97.56.20
IP Frequency
131.107.65.14 3
18.0.1.12 2
80.97.56.20 2
127.0.0.1 9
192.168.0.1 8
257.2.5.7 0
16.09.20.11 1
Challenge: log statistics of the data, using small space
Streaming statistics
Let ๐ฅ๐ = frequency of IP ๐
1st moment (sum): โ๐ฅ๐
Trivial: keep a total counter
2nd moment (variance): โ๐ฅ๐2 = ||๐ฅ||2
Trivially: ๐ counters โ too much space
Canโt do better
Better with small approximation!
Via dimension reduction in โ2
IP Frequency
131.107.65.14 3
18.0.1.12 2
80.97.56.20 2
โ๐ฅ๐ = 7
โ๐ฅ๐2 = 17
2nd frequency moment
Let ๐ฅ๐ = frequency of IP ๐
2nd moment: โ๐ฅ๐2 = ||๐ฅ||2
Dimension reduction
Store a sketch of ๐ฅ
๐ ๐ฅ = ๐บ1๐ฅ, ๐บ2๐ฅ, โฆ ๐บ๐๐ฅ = ๐ฎ๐ฅ
each ๐บ๐ is n-dimensional Gaussian vector
Estimator:
1
๐||๐ฎ๐ฅ||2 =
1
๐๐บ1๐ฅ 2 + ๐บ2๐ฅ 2 + โฏ + ๐บ๐๐ฅ 2
Updating the sketch:
Use linearity of the sketching function ๐
๐ฎ ๐ฅ + ๐๐ = ๐ฎ๐ฅ + ๐ฎ๐๐
๐ฎ
๐ฅ
๐
๐
Correctness
Theorem [Johnson-Lindenstrauss]:
||๐ฎ๐ฅ||2 = 1 ยฑ ๐ ||๐ฅ||2 with probability 1 โ ๐โ๐(๐๐2)
Why Gaussian?
Stability property: ๐บ๐๐ฅ = โ๐ ๐บ๐๐๐ฅ๐ is distributed as | ๐ฅ| โ ๐,
where ๐ is also Gaussian
Equivalently: ๐บ๐ is centrally distributed, i.e., has random
direction, and projection on random direction depends only
on length of ๐ฅ
๐ ๐ โ ๐ ๐ =
=1
2๐๐โ๐2/2
1
2๐๐โ๐2/2
=1
2๐๐โ(๐2+๐2)/2
pdf = 1
2๐๐โ๐2/2
๐ธ[๐] = 0๐ธ[๐2] = 1
Proof [sketch]
Claim: for any ๐ฅ โ โ๐, we have Expectation: ๐บ๐ โ ๐ฅ 2 = ๐ฅ 2
Standard deviation: [|๐บ๐๐ฅ|2] = ๐( ๐ฅ 2)
Proof: Expectation = ๐บ๐ โ ๐ฅ 2 = ๐ฅ 2 โ ๐2
= ๐ฅ 2
๐ฎ๐ฅ is distributed as
1
๐||๐ฅ|| โ ๐1, โฆ , ||๐ฅ|| โ ๐๐
where each ๐๐ is distributed as 1D Gaussian
Estimator: ||๐บ๐ฅ||2 = ||๐ฅ||2 โ 1
๐โ๐ ๐๐
2
โ๐ ๐๐2 is called chi-squared distribution with ๐ degrees
Fact: chi-squared very well concentrated:
Equal to 1 + ๐ with probability 1 โ ๐โฮฉ(๐2๐)
Akin to central limit theorem
pdf = 1
2๐๐โ๐2/2
๐ธ[๐] = 0๐ธ[๐2] = 1
2nd frequency moment: overall
Correctness:
||๐ฎ๐ฅ||2 = 1 ยฑ ๐ ||๐ฅ||2 with probability 1 โ ๐โ๐(๐๐2)
Enough to set ๐ = ๐(1/๐2) for const probability of success
Space requirement:
๐ = ๐(1/๐2) counters of ๐(log ๐) bits
What about ๐ฎ: store ๐(๐๐) reals ?
Storing randomness [AMSโ96]
Ok if ๐๐ โless randomโ: choose each of them as 4-wise
independent
Also, ok if ๐๐ is a random ยฑ1
Only ๐(๐) counters of ๐ log ๐ bits
More efficient sketches?
Smaller Space:
No: ฮฉ1
๐2 log ๐ bits [JWโ11] โ Davidโs lecture
Faster update time:
Yes: Jelaniโs lecture
Streaming Scenario 2 131.107.65.14
18.0.1.12
18.0.1.12
80.97.56.20
IP Frequency
131.107.65.14 1
18.0.1.12 1
80.97.56.20 1
Focus: difference in traffic
1st moment: โ |๐ฅ๐โ ๐ฆ๐| = ๐ฅ โ ๐ฆ 1
2nd moment: โ |๐ฅ๐โ ๐ฆ๐|2 = ๐ฅ โ ๐ฆ 2
2
๐ฅ โ ๐ฆ 1 = 2
๐ฅ โ ๐ฆ 22 = 2
๐ฆ
๐ฅ
Similar Qs: average delay/variance in a network
differential statistics between logs at different servers, etc
IP Frequency
131.107.65.14 1
18.0.1.12 2
Definition: Sketching
Sketching:
๐ : objects โ short bit-strings
given ๐(๐ฅ) and ๐(๐ฆ), should be able to estimate some function
of ๐ฅ and ๐ฆ
๐๐
010110 010101
Estimate ๐ฅ โ ๐ฆ 22 ?
IP Frequency
131.107.65.14 1
18.0.1.12 1
80.97.56.20 1๐ฆ
IP Frequency
131.107.65.14 1
18.0.1.12 2
๐ฅ
Sketching for โ2
As before, dimension reduction
Pick ๐ฎ (using common randomness)
๐(๐ฅ) = ๐ฎ๐ฅ
Estimator: ||๐(๐ฅ) โ ๐(๐ฆ)||22 = ||๐ฎ ๐ฅ โ ๐ฆ ||2
2
010110 010101
||๐บ๐ฅ โ ๐บ๐ฆ||22
IP Frequency
131.107.65.14 1
18.0.1.12 1
80.97.56.20 1๐ฆ
IP Frequency
131.107.65.14 1
18.0.1.12 2
๐ฅ
๐บ๐ฅ
๐
๐บ๐ฆ
๐
Sketching for Manhattan distance (โ1)
Dimension reduction?
Essentially no: [CSโ02, BCโ03, LNโ04, JNโ10โฆ]
For ๐ points, ๐ท approximation: between ๐ฮฉ 1/๐ท2and ๐(๐/๐ท)
[BC03, NR10, ANN10โฆ]
even if map depends on the dataset!
In contrast: [JL] gives ๐(๐โ2 log ๐)
No distributional dimension reduction either
Weak dimension reduction is the rescueโฆ
Dimension reduction for โ1 ?
Can we do the โanalogโ of Euclidean projections?
For โ2, we used: Gaussian distribution
has stability property:
๐1๐ง1 + ๐2๐ง2 + โฏ ๐๐๐ง๐ is distributed as ๐ โ ||๐ง||
Is there something similar for 1-norm?
Yes: Cauchy distribution!
1-stable:
๐1๐ง1 + ๐2๐ง2 + โฏ ๐๐๐ง๐ is distributed as ๐ โ ||๐ง||1
Whatโs wrong then?
Cauchy are heavy-tailedโฆ
doesnโt even have finite expectation (of abs)
๐๐๐ ๐ =1
๐(๐ 2 + 1)
Sketching for โ1 [Indykโ00]
14
Still, can consider map as before
๐ ๐ฅ = ๐ถ1๐ฅ, ๐ถ2๐ฅ, โฆ , ๐ถ๐๐ฅ = ๐ช๐ฅ
Consider ๐ ๐ฅ โ ๐ ๐ฆ = ๐ช๐ฅ โ ๐ช๐ฆ = ๐ช ๐ฅ โ ๐ฆ = ๐ช๐ง where ๐ง = ๐ฅ โ ๐ฆ
each coordinate distributed as ||๐ง||1 รCauchy
Take 1-norm ||๐ช๐ง||1 ?
does not have finite expectation, butโฆ
Can estimate ||๐ง||1 by:
Median of absolute values of coordinates of ๐ช๐ง !
Correctness claim: for each ๐
Pr ๐ถ๐๐ง > ||๐ง||1 โ (1 โ ๐) > 1/2 + ฮฉ(๐)
Pr ๐ถ๐๐ง < ||๐ง||1 โ (1 + ๐) > 1/2 + ฮฉ(๐)
Estimator for โ1
Estimator: median ๐ถ1๐ง , ๐ถ2๐ง , โฆ ๐ถ๐๐ง
Correctness claim: for each ๐
Pr ๐ถ๐๐ง > ||๐ง||1 โ (1 โ ๐) > 1/2 + ฮฉ(๐)
Pr ๐ถ๐๐ง < ||๐ง||1 โ (1 + ๐) > 1/2 + ฮฉ(๐)
Proof:
๐ถ๐๐ง = ๐๐๐ (๐ถ๐๐ง) is distributed as abs ||๐ง||1๐ = ||๐ง||1 โ |๐|
Easy to verify that
Pr ๐ > 1 โ ๐ > 1/2 + ฮฉ ๐
Pr ๐ < 1 + ๐ > 1/2 + ฮฉ ๐
Hence, if we have ๐ = ๐ 1/๐2
median ๐ถ1๐ง , ๐ถ2๐ง , โฆ ๐ถ๐๐ง โ 1 ยฑ ๐ ||๐ง||1with probability at least 90%
To finish the โ๐ normsโฆ
๐-moment: ฮฃ๐ฅ๐๐
= ๐ฅ ๐๐
๐ โค 2
works via ๐-stable distributions [Indykโ00]
๐ > 2
Can do (and need) ๐(๐1โ2/๐) counters
[AMSโ96, SSโ02, BYJKSโ02, CKSโ03, IWโ05, BGKSโ06, BO10, AKOโ11, Gโ11,
BKSVโ14]
Will see a construction via Precision Sampling
A task: estimate sum Given: ๐ quantities ๐1, ๐2, โฆ ๐๐ in the range [0,1] Goal: estimate ๐ = ๐1 + ๐2 + โฏ๐๐ โcheaplyโ
Standard sampling: pick random set ๐ฝ = {๐1, โฆ ๐๐} of size ๐ Estimator: ๐ =
๐
๐โ (๐๐1
+ ๐๐2+ โฏ ๐๐๐
)
Chebyshev bound: with 90% success probability1
2๐ โ ๐(๐/๐) < ๐ < 2๐ + ๐(๐/๐)
For constant additive error, need ๐ = ฮฉ(๐)
a1 a2 a3 a4
a1a3
Compute an estimate ๐ from ๐1, ๐3
Precision Sampling Framework
Alternative โaccessโ to ๐๐โs:
For each term ๐๐, we get a (rough) estimate ๐๐
up to some precision ๐ข๐, chosen in advance: |๐๐ โ ๐๐| < ๐ข๐
Challenge: achieve good trade-off between
quality of approximation to ๐
use only weak precisions ๐ข๐ (minimize โcostโ of estimating ๐)
a1 a2 a3 a4
u1 u2 u3 u4
a1 a2a3 a4
Compute an estimate ๐ from ๐1, ๐2, ๐3, ๐4
Formalization
Sum Estimator Adversary
1. fix ๐1, ๐2, โฆ ๐๐1. fix precisions ๐ข๐
2. fix ๐1, ๐2, โฆ ๐๐ s.t. |๐๐ โ ๐๐| < ๐ข๐
3. given ๐1, ๐2, โฆ ๐๐, output ๐ s.t.
โ๐ ๐๐ โ ๐พ ๐ < 1 (for some small ๐พ)
What is cost?
Here, average cost = 1/๐ โ โ 1/๐ข๐
to achieve precision ๐ข๐, use 1/๐ข๐ โresourcesโ: e.g., if ๐๐ is itself a sum ๐๐ =โ๐๐๐๐ computed by subsampling, then one needs ฮ(1/๐ข๐) samples
For example, can choose all ๐ข๐ = 1/๐ Average cost โ ๐
Precision Sampling Lemma
Goal: estimate โ๐๐ from { ๐๐} satisfying |๐๐ โ ๐๐| < ๐ข๐.
Precision Sampling Lemma: can get, with 90% success: O(1) additive error and 1.5 multiplicative error:
๐ โ ๐ 1 < ๐ < 1.5 โ ๐ + ๐(1)
with average cost equal to O(log n)
Example: distinguish ฮฃ๐๐ = 3 vs ฮฃ๐๐ = 0 Consider two extreme cases:
if three ๐๐ = 1: enough to have crude approx for all (๐ข๐ = 0.1)
if all ๐๐ = 3/๐: only few with good approx ๐ข๐ = 1/๐, and the rest with ๐ข๐ = 1
๐ 1 + ๐
๐ โ ๐ < ๐ < 1 + ๐ ๐ + ๐
[A-Krauthgamer-Onakโ11]
๐(๐โ3log ๐)
Precision Sampling Algorithm
Precision Sampling Lemma: can get, with 90% success: O(1) additive error and 1.5 multiplicative error:
๐ โ ๐ 1 < ๐ < 1.5 โ ๐ + ๐(1)
with average cost equal to ๐(log ๐)
Algorithm: Choose each ๐ข๐[0,1] i.i.d.
Estimator: ๐ = count number of ๐โs s.t. ๐๐/ ๐ข๐ > 6 (up to a normalization constant)
Proof of correctness:
we use only ๐๐ which are 1.5-approximation to ๐๐
๐ธ[ ๐] โ โ Pr[๐๐ / ๐ข๐ > 6] = โ ๐๐/6.
๐ธ[1/๐ข๐] = ๐(log ๐) w.h.p.
function of [ ๐๐/๐ข๐ โ 4/๐]+
and ๐ข๐โs
concrete distrib. = minimum of ๐(๐โ3) u.r.v.
๐(๐โ3log ๐)
๐ 1 + ๐
๐ โ ๐ < ๐ < 1 + ๐ โ ๐ + ๐ 1
โ๐ via precision sampling
Theorem: linear sketch for โ๐ with ๐(1) approximation,
and ๐(๐1โ2/๐ log ๐) space (90% succ. prob.).
Sketch:
Pick random ๐๐{ยฑ1}, and ๐ข๐ as exponential r.v.
let ๐ฆ๐ = ๐ฅ๐ โ ๐๐/๐ข๐1/๐
throw into one hash table ๐ป,
๐ = ๐(๐1โ2/๐ log ๐) cells
Estimator:
max๐
๐ป ๐ ๐
Linear: works for difference as well
Randomness: bounded independence suffices
๐ฅ1 ๐ฅ2 ๐ฅ3 ๐ฅ4 ๐ฅ5 ๐ฅ6
๐ฆ๐
+ ๐ฆ๐
๐ฆ๐ ๐ฆ๐
+ ๐ฆ๐
+ ๐ฆ๐
๐ฅ =
๐ป =
๐ข โผ ๐โ๐ข
Correctness of โ๐ estimation
Sketch:
๐ฆ๐ = ๐ฅ๐ โ ๐๐/๐ข๐1/๐
where ๐๐{ยฑ1}, and ๐ข๐ exponential r.v.
Throw into hash table ๐ป
Theorem: max๐
๐ป ๐ ๐ is ๐(1) approximation with 90%
probability, for ๐ = ๐(๐1โ2/๐ log๐ 1 ๐) cells
Claim 1: max๐
|๐ฆ๐| is a const approx to ||๐ฅ||๐
max๐
๐ฆ๐๐ = max
๐๐ฅ๐
๐/๐ข๐
Fact [max-stability]: max ๐๐/๐ข๐ distributed as โ๐๐/๐ข
max๐
๐ฆ๐๐ is distributed as ||๐ฅ||๐
๐/๐ข
๐ข is ฮ(1) with const probability
Correctness (cont)
Claim 2:
max๐
|๐ป ๐ | = 1 โ ||๐ฅ||๐
Consider a hash table ๐ป, and the cell ๐ where ๐ฆ๐โ falls into
for ๐โ which maximizes |๐ฆ๐โ|
How much โextra stuffโ is there?
2 = (๐ป[๐] โ ๐ฆ๐โ)2 = (โ๐โ ๐โ ๐ฆ๐ โ [๐๐])2
๐ธ 2 = โ๐โ ๐โ ๐ฆ๐2 โ [๐๐] = โ๐โ ๐โ ๐ฆ๐
2/๐ โค ||๐ฆ||2/๐
We have: ๐ธ๐ข||๐ฆ||2 โค ||๐ฅ||2 โ ๐ธ 1/๐ข1/๐ = ๐ log๐ โ ||๐ฅ||2
||๐ฅ||2 โค ๐1โ2/๐||๐ฅ||๐2
By Markovโs: 2 โค ||๐ฅ||๐2 โ ๐1โ2/๐ โ ๐(log๐)/๐ with prob 0.9.
Then: ๐ป[๐] = ๐ฆ๐โ + ๐ฟ = 1 โ ||๐ฅ||๐.
Need to argue about other cells too โ concentration
๐ฆ๐ = ๐ฅ๐ โ ๐๐/๐ข๐1/๐
where ๐๐{ยฑ1}๐ข๐ exponential r.v.
๐ฅ1 ๐ฅ2 ๐ฅ3 ๐ฅ4 ๐ฅ5 ๐ฅ6
๐ฆ๐
+ ๐ฆ๐
๐ฆ๐ ๐ฆ๐
+ ๐ฆ๐
+ ๐ฆ๐
๐ป =
Top Related