CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005)...
-
date post
21-Dec-2015 -
Category
Documents
-
view
221 -
download
2
Transcript of CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005)...
![Page 1: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani.](https://reader034.fdocuments.us/reader034/viewer/2022051516/56649d6b5503460f94a4adc3/html5/thumbnails/1.jpg)
CS 361A 1
CS 361A CS 361A (Advanced Data Structures and Algorithms)(Advanced Data Structures and Algorithms)
Lectures 16 & 17 (Nov 16 and 28, 2005)
Synopses, Samples, and Sketches
Rajeev Motwani
![Page 2: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani.](https://reader034.fdocuments.us/reader034/viewer/2022051516/56649d6b5503460f94a4adc3/html5/thumbnails/2.jpg)
CS 361A 2
Game Plan for WeekGame Plan for WeekLast Class
Models for Streaming/Massive Data Sets
Negative results for Exact Distinct Values
Hashing for Approximate Distinct Values
TodaySynopsis Data Structures
Sampling Techniques
Frequency Moments Problem
Sketching Techniques
Finding High-Frequency Items
![Page 3: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani.](https://reader034.fdocuments.us/reader034/viewer/2022051516/56649d6b5503460f94a4adc3/html5/thumbnails/3.jpg)
CS 361A 3
Synopsis Data StructuresSynopsis Data Structures Synopses
Webster – a condensed statement or outline (as of a narrative or treatise)
CS 361A – succinct data structure that lets us answers queries efficiently
Synopsis Data Structures“Lossy” Summary (of a data stream)
Advantages – fits in memory + easy to communicate
Disadvantage – lossiness implies approximation error
Negative Results best we can do
Key Techniques – randomization and hashing
![Page 4: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani.](https://reader034.fdocuments.us/reader034/viewer/2022051516/56649d6b5503460f94a4adc3/html5/thumbnails/4.jpg)
CS 361A 4
Numerical ExamplesNumerical ExamplesApproximate Query Processing [AQUA/Bell Labs]
Database Size – 420 MB
Synopsis Size – 420 KB (0.1%)
Approximation Error – within 10%
Running Time – 0.3% of time for exact query
Histograms/Quantiles [Chaudhuri-Motwani-Narasayya,
Manku-Rajagopalan-Lindsay, Khanna-Greenwald] Data Size – 109 items
Synopsis Size – 1249 items
Approximation Error – within 1%
![Page 5: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani.](https://reader034.fdocuments.us/reader034/viewer/2022051516/56649d6b5503460f94a4adc3/html5/thumbnails/5.jpg)
CS 361A 5
SynopsesSynopses Desidarata
Small Memory Footprint
Quick Update and Query
Provable, low-error guarantees
Composable – for distributed scenario
Applicability?General-purpose – e.g. random samples
Specific-purpose – e.g. distinct values estimator
Granularity?Per database – e.g. sample of entire table
Per distinct value – e.g. customer profiles
Structural – e.g. GROUP-BY or JOIN result samples
![Page 6: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani.](https://reader034.fdocuments.us/reader034/viewer/2022051516/56649d6b5503460f94a4adc3/html5/thumbnails/6.jpg)
CS 361A 6
Examples of SynopsesExamples of SynopsesSynopses need not be fancy!
Simple Aggregates – e.g. mean/median/max/min
Variance?
Random Samples
Aggregates on small samples represent entire data
Leverage extensive work on confidence intervals
Random Sketches
structured samples
Tracking High-Frequency Items
![Page 7: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani.](https://reader034.fdocuments.us/reader034/viewer/2022051516/56649d6b5503460f94a4adc3/html5/thumbnails/7.jpg)
CS 361A 7
Random SamplesRandom Samples
![Page 8: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani.](https://reader034.fdocuments.us/reader034/viewer/2022051516/56649d6b5503460f94a4adc3/html5/thumbnails/8.jpg)
CS 361A 8
Types of SamplesTypes of Samples Oblivious sampling – at item level
o Limitations [Bar-Yossef–Kumar–Sivakumar STOC 01]
Value-based sampling – e.g. distinct-value samples
Structured samples – e.g. join samplingNaïve approach – keep samples of each relation
Problem – sample-of-join ‡ join-of-samples
Foreign-Key Join [Chaudhuri-Motwani-Narasayya SIGMOD 99]
what if A sampled from L and B from R?
AABB
L R
AB
![Page 9: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani.](https://reader034.fdocuments.us/reader034/viewer/2022051516/56649d6b5503460f94a4adc3/html5/thumbnails/9.jpg)
CS 361A 9
Basic ScenarioBasic ScenarioGoal maintain uniform sample of item-stream
Sampling Semantics?Coin flip
o select each item with probability po easy to maintaino undesirable – sample size is unbounded
Fixed-size sample without replacemento Our focus today
Fixed-size sample with replacemento Show – can generate from previous sample
Non-Uniform Samples [Chaudhuri-Motwani-Narasayya]
![Page 10: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani.](https://reader034.fdocuments.us/reader034/viewer/2022051516/56649d6b5503460f94a4adc3/html5/thumbnails/10.jpg)
CS 361A 10
Reservoir Sampling [Vitter]Reservoir Sampling [Vitter]Input – stream of items X1 , X2, X3, …
Goal – maintain uniform random sample S of size n (without replacement) of stream so far
Reservoir Sampling
Initialize – include first n elements in S
Upon seeing item Xt
o Add Xt to S with probability n/to If added, evict random previous item
![Page 11: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani.](https://reader034.fdocuments.us/reader034/viewer/2022051516/56649d6b5503460f94a4adc3/html5/thumbnails/11.jpg)
CS 361A 11
AnalysisAnalysis Correctness?
Fact: At each instant, |S| = n
Theorem: At time t, any XiεS with probability n/t
Exercise – prove via induction on t
Efficiency?Let N be stream size
Remark: Verify this is optimal.
Naïve implementation N coin flips time O(N)
n
Nln n O )HHn(1
t
n S] toupdates E[# nN
Nt 1n
![Page 12: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani.](https://reader034.fdocuments.us/reader034/viewer/2022051516/56649d6b5503460f94a4adc3/html5/thumbnails/12.jpg)
CS 361A 12
Improving EfficiencyImproving Efficiency
Random variable Jt – number jumped over after time t
Idea – generate Jt and skip that many items
Cumulative Distribution Function – F(s) = P[Jt ≤ s], for t>n & s≥0
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14
items inserted into sample S (where n=3)
J9=4J3=2
1)b(a2)1)(aa(abawhere
s1)(t
sn)1(t1
T
n-T1
T
n-11F(s)
st
1tT
st
1tT
![Page 13: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani.](https://reader034.fdocuments.us/reader034/viewer/2022051516/56649d6b5503460f94a4adc3/html5/thumbnails/13.jpg)
CS 361A 13
AnalysisAnalysisNumber of calls to RANDOM()?
one per insertion into sample
this is optimal!
Generating Jt?Pick random number U ε [0,1]
Find smallest j such that U ≤ F(j)
How?o Linear scan O(N) time o Binary search with Newton’s interpolation
O(n2(1 + polylog N/n)) time
Remark – see paper for optimal algorithm
![Page 14: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani.](https://reader034.fdocuments.us/reader034/viewer/2022051516/56649d6b5503460f94a4adc3/html5/thumbnails/14.jpg)
CS 361A 14
Sampling over Sliding Windows Sampling over Sliding Windows [Babcock-Datar-Motwani]
Sliding Window W – last w items in stream
Model – item Xt expires at time t+w
Why?
Applications may require ignoring stale data
Type of approximation
Only way to define JOIN over streams
Goal – Maintain uniform sample of size n of sliding window
![Page 15: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani.](https://reader034.fdocuments.us/reader034/viewer/2022051516/56649d6b5503460f94a4adc3/html5/thumbnails/15.jpg)
CS 361A 15
Reservoir Sampling?Reservoir Sampling?
Observeany item in sample S will expire eventually
must replace with random item of current window
Problem
no access to items in W-S
storing entire window requires O(w) memory
OversamplingBacking sample B – select each item with probability
sample S – select n items from B at random
upon expiry in S replenish from B
Claim: n < |B| < n log w with high probability
w
wlogn θ
![Page 16: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani.](https://reader034.fdocuments.us/reader034/viewer/2022051516/56649d6b5503460f94a4adc3/html5/thumbnails/16.jpg)
CS 361A 16
Index-Set ApproachIndex-Set Approach
Pick random index set I= { i1, … , in }, X{0,1, … , w-1}
Sample S – items Xi with i ε {i1, … , in} (mod w) in current window
ExampleSuppose – w=2, n=1, and I={1}
Then – sample is always Xi with odd i
Memory – only O(k)
ObserveS is uniform random sample of each windowBut sample is periodic (union of arithmetic progressions)Correlation across successive windows
ProblemsCorrelation may hurt in some applicationsSome data (e.g. time-series) may be periodic
![Page 17: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani.](https://reader034.fdocuments.us/reader034/viewer/2022051516/56649d6b5503460f94a4adc3/html5/thumbnails/17.jpg)
CS 361A 17
Chain-Sample AlgorithmChain-Sample Algorithm Idea
Fix expiry problem in Reservoir Sampling
Advance planning for expiry of sampled items
Focus on sample size 1 – keep n independent such samples
Chain-SamplingAdd Xt to S with probability 1/min{t,w} – evict earlier sample
Initially – standard Reservoir Sampling up to time w
Pre-select Xt’s replacement Xr ε Wt+w = {Xt+1, …, Xt+w}
o Xt expires must replace from Wt+w
o At time r, save Xr and pre-select its own replacement building “chain” of potential replacements
Note – if evicting earlier sample, discard its “chain” as well
![Page 18: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani.](https://reader034.fdocuments.us/reader034/viewer/2022051516/56649d6b5503460f94a4adc3/html5/thumbnails/18.jpg)
CS 361A 18
ExampleExample
3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3
3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3
3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3
3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3
![Page 19: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani.](https://reader034.fdocuments.us/reader034/viewer/2022051516/56649d6b5503460f94a4adc3/html5/thumbnails/19.jpg)
CS 361A 19
Expectation for Chain-SampleExpectation for Chain-Sample
T(x) = E[chain length for Xt at time t+x]
E[chain length] = T(w) e 2.718
E[memory required for sample size n] = O(n)
1xforxi T(i)
w
11
1xfor1T(x)
![Page 20: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani.](https://reader034.fdocuments.us/reader034/viewer/2022051516/56649d6b5503460f94a4adc3/html5/thumbnails/20.jpg)
CS 361A 20
Tail Bound for Chain-SampleTail Bound for Chain-Sample Chain = “hops” of total length at most w
Chain of h hops ordered (h+1)-partition of wh hops of total length less than w
plus, remainder
Each partition has probability w-h
Number of partitions:
h = O(log w) probability of a partition is O(w-c)
Thus – memory O(n log w) with high probability
h
h
ew
h
w
![Page 21: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani.](https://reader034.fdocuments.us/reader034/viewer/2022051516/56649d6b5503460f94a4adc3/html5/thumbnails/21.jpg)
CS 361A 21
Comparison of AlgorithmsComparison of Algorithms
Chain-Sample beats Oversample:
Expected memory – O(n) vs O(n log w)
High-probability memory bound – both O(n log w)
Oversample may have sample size shrink below n!
Algorithm Expected High-Probability
Periodic O(n) O(n)
Oversample O(n log w) O(n log w)
Chain-Sample O(n) O(n log w)
![Page 22: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani.](https://reader034.fdocuments.us/reader034/viewer/2022051516/56649d6b5503460f94a4adc3/html5/thumbnails/22.jpg)
CS 361A 22
SketchesSketchesandand
Frequency MomentsFrequency Moments
![Page 23: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani.](https://reader034.fdocuments.us/reader034/viewer/2022051516/56649d6b5503460f94a4adc3/html5/thumbnails/23.jpg)
CS 361A 23
Generalized Stream ModelGeneralized Stream Model
Input Element (i,a)
a copies of domain-value i
increment to ith dimension of m by a
a need not be an integer
Negative value – captures deletions
Data stream: 2, 0, 1, 3, 1, 2, 4, . . .
m0 m1 m2 m3 m4
11 1
2 2
![Page 24: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani.](https://reader034.fdocuments.us/reader034/viewer/2022051516/56649d6b5503460f94a4adc3/html5/thumbnails/24.jpg)
CS 361A 24
ExampleExample
m0 m1 m2 m3 m4
11 1
2 2
On seeing element (i,a) = (2,2)
m0 m1 m2 m3 m4
11 1
2
4On seeing element (i,a) = (1,-1)
m0 m1 m2 m3 m4
11 1
4
1
![Page 25: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani.](https://reader034.fdocuments.us/reader034/viewer/2022051516/56649d6b5503460f94a4adc3/html5/thumbnails/25.jpg)
CS 361A 25
Frequency MomentsFrequency Moments Input Stream
values from U = {0,1,…,N-1}
frequency vector m = (m0,m1,…,mN-1)
Kth Frequency Moment Fk(m) = Σi mik
F0: number of distinct values (Lecture 15)
F1: stream size
F2: Gini index, self-join size, Euclidean norm
Fk: for k>2, measures skew, sometimes useful
F∞: maximum frequency
Problem – estimation in small space
Sketches – randomized estimators
![Page 26: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani.](https://reader034.fdocuments.us/reader034/viewer/2022051516/56649d6b5503460f94a4adc3/html5/thumbnails/26.jpg)
CS 361A 26
Naive ApproachesNaive ApproachesSpace N – counter mi for each distinct value i
Space O(1)
if input sorted by i
single counter recycled when new i value appears
Goal
Allow arbitrary input
Use small (logarithmic) space
Settle for randomization/approximation
![Page 27: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani.](https://reader034.fdocuments.us/reader034/viewer/2022051516/56649d6b5503460f94a4adc3/html5/thumbnails/27.jpg)
CS 361A 27
Sketching FSketching F22
Random Hash h(i): {0,1,…,N-1} {-1,1}
Define Zi =h(i)
Maintain X = Σi miZi
Easy for update streams (i,a) – just add aZi to X
Claim: X2 is unbiased estimator for F2
Proof: E[X2] = E[(Σi miZi)2]
= E[Σi mi2Zi
2] + E[Σi,jmimjZiZj]
= Σi mi2E[Zi
2] + Σi,jmimjE[Zi]E[Zj]
= Σi mi2 + 0 = F2
Last Line? – Zi2 = 1 and E[Zi] = 0 as uniform{-1,1}
fromindependence
![Page 28: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani.](https://reader034.fdocuments.us/reader034/viewer/2022051516/56649d6b5503460f94a4adc3/html5/thumbnails/28.jpg)
CS 361A 28
Estimation Error?Estimation Error? Chebyshev bound:
Define Y = X2 E[Y] = E[X2] = Σi mi2 = F2
Observe E[X4] = E[(ΣmiZi)4]
= E[Σmi4Zi
4]+4E[Σmimj3ZiZj
3]+6E[Σmi2mj
2Zi2Zj
2]
+12E[Σmimjmk2ZiZjZk
2]+24E[ΣmimjmkmlZiZjZkZl]
= Σmi4 + 6Σmi
2mj2
By definition Var[Y] = E[Y2] – E[Y]2 = E[X4] – E[X2]2
= [Σmi4+6Σmi
2mj2] – [Σmi
4+2Σmi2mj
2]
= 4Σmi2mj
2 ≤ 2E[X2]2 = 2F22
2 2 YEλ
YVarYλEYEYP
Why?
![Page 29: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani.](https://reader034.fdocuments.us/reader034/viewer/2022051516/56649d6b5503460f94a4adc3/html5/thumbnails/29.jpg)
CS 361A 29
Estimation Error?Estimation Error? Chebyshev bound
P [relative estimation error >λ]
Problem – What if we want λ really small?
SolutionCompute s = 8/λ2 independent copies of X
Estimator Y = mean(Xi2)
Variance reduces by factor s
P [relative estimation error >λ]
22 YEλ
YVarYλEYEYP
222
2
22
λ
2
Fλ
2F
4
1
Fsλ
2F2
22
22
![Page 30: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani.](https://reader034.fdocuments.us/reader034/viewer/2022051516/56649d6b5503460f94a4adc3/html5/thumbnails/30.jpg)
CS 361A 30
Boosting TechniqueBoosting Technique Algorithm A: Randomized λ-approximate estimator f
P[(1- λ)f* ≤ f ≤ (1+ λ)f*] = 3/4
Heavy Tail Problem: P[f*–z, f*, f*+z] = [1/16, 3/4, 3/16]
Boosting IdeaO(log1/ε) independent estimates from A(X)
Return median of estimates
Claim: P[median is λ-approximate] >1- ε Proof:
P[specific estimate is λ-approximate] = ¾
Bad event only if >50% estimates not λ-approximate
Binomial tail – probability less than ε
![Page 31: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani.](https://reader034.fdocuments.us/reader034/viewer/2022051516/56649d6b5503460f94a4adc3/html5/thumbnails/31.jpg)
CS 361A 31
Overall Space RequirementOverall Space RequirementObserve
Let m = Σmi
Each hash needs O(log m)-bit counter
s = 8/λ2 hash functions for each estimator
O(log 1/ε) such estimators
Total O(λ-2 log 1/ε log m) bits
Question – Space for storing hash function?
![Page 32: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani.](https://reader034.fdocuments.us/reader034/viewer/2022051516/56649d6b5503460f94a4adc3/html5/thumbnails/32.jpg)
CS 361A 32
Sketching ParadigmSketching Paradigm Random Sketch: inner product
frequency vector m = (m0,m1,…,mN-1)
random vector Z (currently, uniform {-1,1})
ObserveLinearity Sketch(m1) ± Sketch(m2) = Sketch (m1 ± m2)
Ideal for distributed computing
Observe Suppose: Given i, can efficiently generate Zi
Then: can maintain sketch for update streams
Problemo Must generate Zi=h(i) on first appearance of io Need Ω(N) memory to store h explicitlyo Need Ω(N) random bits
i if(i)ZZf,
![Page 33: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani.](https://reader034.fdocuments.us/reader034/viewer/2022051516/56649d6b5503460f94a4adc3/html5/thumbnails/33.jpg)
CS 361A 33
Two birds, One stoneTwo birds, One stone Pairwise Independent Z1,Z2, …, Zn
for all Zi and Zk, P[Zi=x, Zk=y] = P[Zi=x].P[Zk=y]
property E[ZiZk] = E[Zi].E[Zk]
Example – linear hash functionSeed S=<a,b> from [0..p-1], where p is prime
Zi = h(i) = ai+b (mod p)
Claim: Z1,Z2, …, Zn are pairwise independent
Zi=x and Zk=y x=ai+b (mod p) and y=ak+b (mod p)
fixing i, k, x, y unique solution for a, b
P[Zi=x, Zk=y] = 1/ p2 = P[Zi=x].P[Zk=y]
Memory/Randomness: n log p 2 log p
![Page 34: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani.](https://reader034.fdocuments.us/reader034/viewer/2022051516/56649d6b5503460f94a4adc3/html5/thumbnails/34.jpg)
CS 361A 34
Wait a minute!Wait a minute! Doesn’t pairwise independence screw up proofs?
No – E[X2] calculation only has degree-2 terms
But – what about Var[X2]?
Need 4-wise independence
![Page 35: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani.](https://reader034.fdocuments.us/reader034/viewer/2022051516/56649d6b5503460f94a4adc3/html5/thumbnails/35.jpg)
CS 361A 35
Application – Join-Size EstimationApplication – Join-Size Estimation
GivenJoin attribute frequencies f1 and f2
Join size = f1.f2
Define – X1 = f1.Z and X2 = f2.Z
Choose – Z as 4-wise independent & uniform {-1,1}
Exercise: Show, as before,
E[X1 X2] = f1.f2
Var[X1 X2] ≤ 2 (f1.f2)2
Hint: a.b ≤ |a|.|b|
![Page 36: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani.](https://reader034.fdocuments.us/reader034/viewer/2022051516/56649d6b5503460f94a4adc3/html5/thumbnails/36.jpg)
CS 361A 36
Bounding Error ProbabilityBounding Error Probability Using s copies of X’s & taking their mean Y
Pr[ |Y- f1.f2 | ≥ λ f1.f2 ] ≤ Var(Y) / λ2(f1.f2)2
≤ 2f12f2
2 / sλ2(f1.f2)2
= 2 / sλ2cos2 θ
Bounding error probability?Need – s > 2/λ2cos2θ
Memory? – O( log 1/ε cos-2θ λ-2 (log N + log m))
ProblemTo choose s – need a-priori lower bound on cos θ = f1.f2
What if cos θ really small?
![Page 37: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani.](https://reader034.fdocuments.us/reader034/viewer/2022051516/56649d6b5503460f94a4adc3/html5/thumbnails/37.jpg)
CS 361A 37
Sketch PartitioningSketch Partitioning
dom(R1.A)
10
12
10
dom(R2.B)
10 10
12
self-join(R1.A)*self-join(R2.B) = 205*205 = 42K
self-join(R1.A)*self-join(R2.B) + self-join(R1.A)*self-join(R2.B) = 200*5 +200*5 = 2K
Idea for dealing with f12f2
2/(f1.f2)2 issue-- partition domain into regions whereself-join size is smaller to compensatesmall join-size (cos θ)
![Page 38: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani.](https://reader034.fdocuments.us/reader034/viewer/2022051516/56649d6b5503460f94a4adc3/html5/thumbnails/38.jpg)
CS 361A 38
Sketch PartitioningSketch Partitioning
Idea
intelligently partition join-attribute space
need coarse statistics on stream
build independent sketches for each partition
Estimate = Σ partition sketches
Variance = Σ partition variances
![Page 39: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani.](https://reader034.fdocuments.us/reader034/viewer/2022051516/56649d6b5503460f94a4adc3/html5/thumbnails/39.jpg)
CS 361A 39
Sketch PartitioningSketch Partitioning
Partition Space Allocation?
Can solve optimally, given domain partition
Optimal Partition: Find K-partition to minimize
Results
Dynamic Programming – optimal solution for single join
NP-hard – for queries with multiple joins
K
1
K
1i oin)size(selfJ]Var[X
![Page 40: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani.](https://reader034.fdocuments.us/reader034/viewer/2022051516/56649d6b5503460f94a4adc3/html5/thumbnails/40.jpg)
CS 361A 40
FFkk for k > 2 for k > 2
Assume – stream length m is known (Exercise: Show can fix with log m space overhead by repeated-doubling estimate of m.)
Choose – random stream item ap p uniform from {1,2,…,m}
Suppose – ap = v ε {0,1,…,N-1}
Count subsequent frequency of v
r = | {q | q≥p, aq=v} |
Define X = m(rk – (r-1)k)
![Page 41: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani.](https://reader034.fdocuments.us/reader034/viewer/2022051516/56649d6b5503460f94a4adc3/html5/thumbnails/41.jpg)
CS 361A 41
ExampleExampleStream
7,8,5,1,7,5,2,1,5,4,5,10,6,5,4,1,4,7,3,8
m = 20
p = 9
ap = 5
r = 3
![Page 42: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani.](https://reader034.fdocuments.us/reader034/viewer/2022051516/56649d6b5503460f94a4adc3/html5/thumbnails/42.jpg)
CS 361A 42
FFkk for k > 2 for k > 2
Var(X) ≤ kN1 – 1/k Fk2
Bounded Error Probability s = O(kN1 – 1/k / λ2)
Boosting memory bound
O(kn1 – 1/k λ-2 (log 1/ε)(log N + log m))
k
kn
kn
kkk
k2
k2
kkk
k1
k1
kkk
F
)]1)(m(m...)1(21
...
)1)(m(m...)1(21
)1)(m(m...)1(2[1m
mXE
Summing overm choices of
stream elements
![Page 43: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani.](https://reader034.fdocuments.us/reader034/viewer/2022051516/56649d6b5503460f94a4adc3/html5/thumbnails/43.jpg)
CS 361A 43
Frequency MomentsFrequency MomentsF0 – distinct values problem (Lecture 15)
F1 – sequence lengthfor case with deletions, use Cauchy distribution
F2 – self-join size/Gini index (Today)
Fk for k >2omitting grungy details
can achieve space bound O(kN1 – 1/k λ-2 (log 1/ε)(log n + log m))
F∞ – maximum frequency
![Page 44: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani.](https://reader034.fdocuments.us/reader034/viewer/2022051516/56649d6b5503460f94a4adc3/html5/thumbnails/44.jpg)
CS 361A 44
Communication ComplexityCommunication Complexity
Cooperatively compute function f(A,B) Minimize bits communicated
Unbounded computational power
Communication Complexity C(f) – bits exchanged by optimal protocol Π
Protocols?1-way versus 2-way
deterministic versus randomized
Cδ(f) – randomized complexity for error probability δ
ALICEinput A
BOBinput B
![Page 45: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani.](https://reader034.fdocuments.us/reader034/viewer/2022051516/56649d6b5503460f94a4adc3/html5/thumbnails/45.jpg)
CS 361A 45
Streaming & Communication ComplexityStreaming & Communication Complexity
Stream Algorithm 1-way communication protocol
Simulation ArgumentGiven – algorithm S computing f over streams
Alice – initiates S, providing A as input stream prefix
Communicates to Bob – S’s state after seeing A
Bob – resumes S, providing B as input stream suffix
Theorem – Stream algorithm’s space requirement is at least the communication complexity C(f)
![Page 46: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani.](https://reader034.fdocuments.us/reader034/viewer/2022051516/56649d6b5503460f94a4adc3/html5/thumbnails/46.jpg)
CS 361A 46
Example: Set DisjointnessExample: Set DisjointnessSet Disjointness (DIS)
A, B subsets of {1,2,…,N}
Output
Theorem: Cδ(DIS) = Ω(N), for any δ<1/2
φBA0
φBA1
![Page 47: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani.](https://reader034.fdocuments.us/reader034/viewer/2022051516/56649d6b5503460f94a4adc3/html5/thumbnails/47.jpg)
CS 361A 47
Lower Bound for FLower Bound for F∞∞
Theorem: Fix ε<1/3, δ<1/2. Any stream algorithm S with
P[ (1-ε)F∞ < S < (1+ε)F∞ ] > 1-δ
needs Ω(N) space
ProofClaim: S 1-way protocol for DIS (on any sets A and B)
Alice streams set A to S
Communicates S’s state to Bob
Bob streams set B to S
Observe
Relative Error ε<1/3 DIS solved exactly!
P[error <½ ] < δ Ω(N) space
φBAif1
φBAif2F
![Page 48: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani.](https://reader034.fdocuments.us/reader034/viewer/2022051516/56649d6b5503460f94a4adc3/html5/thumbnails/48.jpg)
CS 361A 48
ExtensionsExtensions Observe
Used only 1-way communication in proof
Cδ(DIS) bound was for arbitrary communication
Exercise – extend lower bound to multi-pass algorithms
Lower Bound for Fk, k>2
Need to increase gap beyond 2
Multiparty Set Disjointness – t players
Theorem: Fix ε,δ<½ and k > 5. Any stream algorithm S with
P[ (1-ε)Fk < S < (1+ε)Fk ] > 1-δ
needs Ω(N1-(2+ δ)/k) space
Implies Ω(N1/2) even for multi-pass algorithms
![Page 49: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani.](https://reader034.fdocuments.us/reader034/viewer/2022051516/56649d6b5503460f94a4adc3/html5/thumbnails/49.jpg)
CS 361A 49
Tracking Tracking High-Frequency ItemsHigh-Frequency Items
![Page 50: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani.](https://reader034.fdocuments.us/reader034/viewer/2022051516/56649d6b5503460f94a4adc3/html5/thumbnails/50.jpg)
CS 361A 50
Problem 1 – Top-K ListProblem 1 – Top-K List[Charikar-Chen-Farach-Colton]
The Google Problem
Return list of k most frequent items in stream
Motivation
search engine queries, network traffic, …
Remember
Saw lower bound recently!
Solution
Data structure Count-Sketch maintaining count-estimates of high-frequency elements
![Page 51: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani.](https://reader034.fdocuments.us/reader034/viewer/2022051516/56649d6b5503460f94a4adc3/html5/thumbnails/51.jpg)
CS 361A 51
DefinitionsDefinitions Notation
Assume {1, 2, …, N} in order of frequency
mi is frequency of ith most frequent element
m = Σmi is number of elements in stream
FindCandidateTopInput: stream S, int k, int p
Output: list of p elements containing top k
Naive sampling gives solution with p = (m log k / mk)
FindApproxTopInput: stream S, int k, real
Output: list of k elements, each of frequency mi > (1-) mk
Naive sampling gives no solution
![Page 52: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani.](https://reader034.fdocuments.us/reader034/viewer/2022051516/56649d6b5503460f94a4adc3/html5/thumbnails/52.jpg)
CS 361A 52
Main IdeaMain Idea Consider
single counter X
hash function h(i): {1, 2,…,N} {-1,+1}
Input element i update counter X += Zi = h(i)
For each r, use XZr as estimator of mr
Theorem: E[XZr] = mr Proof
X = Σi miZi
E[XZr] = E[Σi miZiZr] = Σi miE[Zi Zr] = mrE[Zr2] = mr
Cross-terms cancel
![Page 53: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani.](https://reader034.fdocuments.us/reader034/viewer/2022051516/56649d6b5503460f94a4adc3/html5/thumbnails/53.jpg)
CS 361A 53
Finding Max Frequency ElementFinding Max Frequency Element Problem – var[X] = F2 = Σi mi
2
Idea – t counters, independent 4-wise hashes h1,…,ht
Use t = O(log m • mi2 / (m1)2)
Claim: New Variance < mi2 / t = (m1)2 / log m
Overall Estimatorrepeat + median of averages
with high probability, approximate m1
h1: i {+1, –1}
ht: i {+1, –1}
![Page 54: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani.](https://reader034.fdocuments.us/reader034/viewer/2022051516/56649d6b5503460f94a4adc3/html5/thumbnails/54.jpg)
CS 361A 54
Problem with “Array of Counters”Problem with “Array of Counters”Variance – dominated by highest frequency
Estimates for less-frequent elements like k
corrupted by higher frequencies
variance >> mk
Avoiding Collisions?
spread out high frequency elements
replace each counter with hashtable of b counters
![Page 55: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani.](https://reader034.fdocuments.us/reader034/viewer/2022051516/56649d6b5503460f94a4adc3/html5/thumbnails/55.jpg)
CS 361A 55
Count SketchCount Sketch Hash Functions
4-wise independent hashes h1,...,ht and s1,…,st
hashes independent of each other
Data structure: hashtables of counters X(r,c)
s1 : i {1, ..., b}
h1: i {+1, -1}
st : i {1, ..., b}
ht: i {+1, -1}
1 2 … b
![Page 56: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani.](https://reader034.fdocuments.us/reader034/viewer/2022051516/56649d6b5503460f94a4adc3/html5/thumbnails/56.jpg)
CS 361A 56
Overall AlgorithmOverall Algorithm sr(i) – one of b counters in rth hashtable
Input i for each r, update X(r,sr(i)) += hr(i)
Estimator(mi) = medianr { X(r,sr(i)) • hr(i) }
Maintain heap of k top elements seen so far
ObserveNot completely eliminated collision with high frequency items
Few of estimates X(r,sr(i)) • hr(i) could have high variance
Median not sensitive to these poor estimates
![Page 57: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani.](https://reader034.fdocuments.us/reader034/viewer/2022051516/56649d6b5503460f94a4adc3/html5/thumbnails/57.jpg)
CS 361A 57
Avoiding Large ItemsAvoiding Large Items b > O(k) with probability Ω(1), no collision with top-k elements
t hashtables represent independent trials
Need log m/ trials to estimate with probability 1-
Also need – small variance for colliding small elements
Claim:
P[variance due to small items in each estimate < (i>k mi2)/b] = Ω(1)
Final bound b = O(k + i>k mi2 / (mk)2)
![Page 58: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani.](https://reader034.fdocuments.us/reader034/viewer/2022051516/56649d6b5503460f94a4adc3/html5/thumbnails/58.jpg)
CS 361A 58
Final ResultsFinal Results
Zipfian Distribution: mi 1/i [Power Law]
FindApproxTop
[k + (i>kmi2) / (mk)2] log m/
Roughly: sampling bound with frequencies squared
Zipfian – gives improved results
FindCandidateTop
Zipf parameter 0.5
O(k log N log m)
Compare: sampling bound O((kN)0.5 log k)
![Page 59: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani.](https://reader034.fdocuments.us/reader034/viewer/2022051516/56649d6b5503460f94a4adc3/html5/thumbnails/59.jpg)
CS 361A 59
Problem 2 – Elephants-and-AntsProblem 2 – Elephants-and-Ants[Manku-Motwani]
Identify items whose current frequency exceeds support threshold s = 0.1%.[Jacobson 2000, Estan-Verghese 2001]
Stream
![Page 60: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani.](https://reader034.fdocuments.us/reader034/viewer/2022051516/56649d6b5503460f94a4adc3/html5/thumbnails/60.jpg)
CS 361A 60
Algorithm 1: Lossy CountingAlgorithm 1: Lossy Counting
Step 1: Divide the stream into ‘windows’
Window-size w is function of support s – specify later…
Window 1 Window 2 Window 3
![Page 61: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani.](https://reader034.fdocuments.us/reader034/viewer/2022051516/56649d6b5503460f94a4adc3/html5/thumbnails/61.jpg)
CS 361A 61
Lossy Counting in Action ...Lossy Counting in Action ...
Empty
FrequencyCounts
At window boundary, decrement all counters by 1
+
First Window
![Page 62: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani.](https://reader034.fdocuments.us/reader034/viewer/2022051516/56649d6b5503460f94a4adc3/html5/thumbnails/62.jpg)
CS 361A 62
Lossy Counting (continued)Lossy Counting (continued)FrequencyCounts
At window boundary, decrement all counters by 1
Next Window
+
![Page 63: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani.](https://reader034.fdocuments.us/reader034/viewer/2022051516/56649d6b5503460f94a4adc3/html5/thumbnails/63.jpg)
CS 361A 63
Error AnalysisError Analysis
If current size of stream = Nand window-size w = 1/ε
then # windows = εN
Rule of thumb: Set ε = 10% of support sExample: Given support frequency s = 1%, set error frequency ε = 0.1%
frequency error
How much do we undercount?
![Page 64: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani.](https://reader034.fdocuments.us/reader034/viewer/2022051516/56649d6b5503460f94a4adc3/html5/thumbnails/64.jpg)
CS 361A 64
Output: Elements with counter values exceeding (s-ε)N
Approximation guarantees Frequencies underestimated by at most εN No false negatives False positives have true frequency at least (s–ε)N
Putting it all together…Putting it all together…
How many counters do we need?
Worst case bound: 1/ε log εN counters
Implementation details…
![Page 65: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani.](https://reader034.fdocuments.us/reader034/viewer/2022051516/56649d6b5503460f94a4adc3/html5/thumbnails/65.jpg)
CS 361A 65
Number of Counters?Number of Counters? Window size w = 1/
Number of windows m = N
ni – # counters alive over last i windows
Fact:
Claim:
Counter must average 1 increment/window to survive
# active counters
m1,2,...,jforjwinj
1ii
m1,2,...,jfori
wn
j
1i
j
1ii
εN logε
1m logw
i
wn
m
1i
m
1ii
![Page 66: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani.](https://reader034.fdocuments.us/reader034/viewer/2022051516/56649d6b5503460f94a4adc3/html5/thumbnails/66.jpg)
CS 361A 66
EnhancementsEnhancements
Frequency Errors For counter (X, c), true frequency in [c, c+εN]
Trick: Track number of windows t counter has been active For counter (X, c, t), true frequency in [c, c+t-1]
Batch Processing Decrements after k windows
If (t = 1), no error!
![Page 67: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani.](https://reader034.fdocuments.us/reader034/viewer/2022051516/56649d6b5503460f94a4adc3/html5/thumbnails/67.jpg)
CS 361A 67
Algorithm 2: Sticky SamplingAlgorithm 2: Sticky Sampling
Stream
Create counters by sampling Maintain exact counts thereafter
What is sampling rate?
341530
283141233519
![Page 68: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani.](https://reader034.fdocuments.us/reader034/viewer/2022051516/56649d6b5503460f94a4adc3/html5/thumbnails/68.jpg)
CS 361A 68
Sticky Sampling (continued)Sticky Sampling (continued)For finite stream of length N
Sampling rate = 2/εN log 1/s
Same Rule of thumb: Set ε = 10% of support sExample: Given support threshold s = 1%, set error threshold ε = 0.1% set failure probability = 0.01%
Output: Elements with counter values exceeding (s-ε)N
Same error guarantees as Lossy Counting but probabilistic
Approximation guarantees (probabilistic) Frequencies underestimated by at most εN No false negatives False positives have true frequency at least (s-ε)N
= probability of failure
![Page 69: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani.](https://reader034.fdocuments.us/reader034/viewer/2022051516/56649d6b5503460f94a4adc3/html5/thumbnails/69.jpg)
CS 361A 69
Number of counters?Number of counters?
Finite stream of length NSampling rate: 2/εN log 1/s
Independent of N
Infinite stream with unknown NGradually adjust sampling rate
In either case,Expected number of counters = 2/ log 1/s
![Page 70: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani.](https://reader034.fdocuments.us/reader034/viewer/2022051516/56649d6b5503460f94a4adc3/html5/thumbnails/70.jpg)
CS 361A 70
References – SynopsesReferences – Synopses Synopsis data structures for massive data sets. Gibbons
and Matias, DIMACS 1999.
Tracking Join and Self-Join Sizes in Limited Storage, Alon, Gibbons, Matias, and Szegedy. PODS 1999.
Join Synopses for Approximate Query Answering, Acharya, Gibbons, Poosala, and Ramaswamy. SIGMOD 1999.
Random Sampling for Histogram Construction: How much is enough? Chaudhuri, Motwani, and Narasayya. SIGMOD 1998.
Random Sampling Techniques for Space Efficient Online Computation of Order Statistics of Large Datasets, Manku, Rajagopalan, and Lindsay. SIGMOD 1999.
Space-efficient online computation of quantile summaries, Greenwald and Khanna. SIGMOD 2001.
![Page 71: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani.](https://reader034.fdocuments.us/reader034/viewer/2022051516/56649d6b5503460f94a4adc3/html5/thumbnails/71.jpg)
CS 361A 71
References – SamplingReferences – Sampling Random Sampling with a Reservoir, Vitter. Transactions on
Mathematical Software 11(1):37-57 (1985).
On Sampling and Relational Operators. Chaudhuri and Motwani. Bulletin of the Technical Committee on Data Engineering (1999).
On Random Sampling over Joins. Chaudhuri, Motwani, and Narasayya. SIGMOD 1999.
Congressional Samples for Approximate Answering of Group-By Queries, Acharya, Gibbons, and Poosala. SIGMOD 2000.
Overcoming Limitations of Sampling for Aggregation Queries, Chaudhuri, Das, Datar, Motwani and Narasayya. ICDE 2001.
A Robust Optimization-Based Approach for Approximate Answering of Aggregate Queries, Chaudhuri, Das and Narasayya. SIGMOD 01.
Sampling From a Moving Window Over Streaming Data. Babcock, Datar, and Motwani. SODA 2002.
Sampling algorithms: lower bounds and applications. Bar-Yossef–Kumar–Sivakumar. STOC 2001.
![Page 72: CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lectures 16 & 17 (Nov 16 and 28, 2005) Synopses, Samples, and Sketches Rajeev Motwani.](https://reader034.fdocuments.us/reader034/viewer/2022051516/56649d6b5503460f94a4adc3/html5/thumbnails/72.jpg)
CS 361A 72
References – SketchesReferences – Sketches Probabilistic counting algorithms for data base applicatio
ns. Flajolet and Martin. JCSS (1985).
The space complexity of approximating the frequency moments. Alon, Matias, and Szegedy. STOC 1996.
Approximate Frequency Counts over Streaming Data. Manku and Motwani. VLDB 2002.
Finding Frequent Items in Data Streams. Charikar, Chen, and Farach-Colton. ICALP 2002.
An Approximate L1-Difference Algorithm for Massive Data Streams. Feigenbaum, Kannan, Strauss, and Viswanathan. FOCS 1999.
Stable Distributions, Pseudorandom Generators, Embeddings and Data Stream Computation. Indyk. FOCS 2000.