Mergeable Summaries
description
Transcript of Mergeable Summaries
![Page 1: Mergeable Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062410/56816552550346895dd7cc71/html5/thumbnails/1.jpg)
Mergeable Summaries Ke Yi
HKUST
Pankaj Agarwal (Duke)Graham Cormode (Warwick)
Zengfeng Huang (HKUST)Jeff Philips (Utah)
Zheiwei Wei (Aarhus)
+ = ?
![Page 2: Mergeable Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062410/56816552550346895dd7cc71/html5/thumbnails/2.jpg)
Small summaries for BIG data¨ Allows approximate computation with guarantees and small
space – save space, time, and communication¨ Tradeoff between error and size
Mergeable Summaries2
![Page 3: Mergeable Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062410/56816552550346895dd7cc71/html5/thumbnails/3.jpg)
Mergeable Summaries3
Summaries¨ Summaries allow approximate computations:
– Random sampling– Sketches (JL transform, AMS, Count-Min, etc.)– Frequent items– Quantiles & histograms (ε-approximation)– Geometric coresets– …
![Page 4: Mergeable Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062410/56816552550346895dd7cc71/html5/thumbnails/4.jpg)
Mergeable Summaries4
Mergeability¨ Ideally, summaries are algebraic: associative, commutative
– Allows arbitrary computation trees (shape and size unknown to algorithm)
– Quality remains the same– Similar to the MUD model
[Feldman et al. SODA 08])
¨ Summaries should have bounded size– Ideally, independent of base data size– Sublinear in base data (logarithmic, square root)– Rule out “trivial” solution of keeping union of input
¨ Generalizes the streaming model
![Page 5: Mergeable Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062410/56816552550346895dd7cc71/html5/thumbnails/5.jpg)
Application of mergeability: Large-scale distributed computation
Programmers have no control on how things are merged
Mergeable Summaries5
![Page 6: Mergeable Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062410/56816552550346895dd7cc71/html5/thumbnails/6.jpg)
MapReduce
Mergeable Summaries6
![Page 7: Mergeable Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062410/56816552550346895dd7cc71/html5/thumbnails/7.jpg)
Dremel
Mergeable Summaries7
![Page 8: Mergeable Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062410/56816552550346895dd7cc71/html5/thumbnails/8.jpg)
Pregel: Combiners
Mergeable Summaries8
![Page 9: Mergeable Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062410/56816552550346895dd7cc71/html5/thumbnails/9.jpg)
Sensor networks
Mergeable Summaries9
![Page 10: Mergeable Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062410/56816552550346895dd7cc71/html5/thumbnails/10.jpg)
Summaries to be merged
¨ Random samples¨ Sketches¨ MinHash¨ Heavy hitters¨ ε-approximations
(quantiles, equi-height histograms)
Mergeable Summaries10
easy
easy and cuteeasy algorithm, analysis requires work
![Page 11: Mergeable Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062410/56816552550346895dd7cc71/html5/thumbnails/11.jpg)
Merging random samples
Mergeable Summaries11
+
N1 = 10 N2 = 15
With prob. N1/(N1+N2), take a sample from S1
S1
S2
![Page 12: Mergeable Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062410/56816552550346895dd7cc71/html5/thumbnails/12.jpg)
Merging random samples
Mergeable Summaries12
+
N1 = 9 N2 = 15
With prob. N1/(N1+N2), take a sample from S1
S1
S2
![Page 13: Mergeable Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062410/56816552550346895dd7cc71/html5/thumbnails/13.jpg)
Merging random samples
Mergeable Summaries13
+
N1 = 9 N2 = 15
With prob. N2/(N1+N2), take a sample from S2
S1
S2
![Page 14: Mergeable Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062410/56816552550346895dd7cc71/html5/thumbnails/14.jpg)
Merging random samples
Mergeable Summaries14
+
N1 = 9 N2 = 14
With prob. N2/(N1+N2), take a sample from S2
S1
S2
![Page 15: Mergeable Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062410/56816552550346895dd7cc71/html5/thumbnails/15.jpg)
Merging random samples
Mergeable Summaries15
+
N1 = 8 N2 = 12
With prob. N2/(N1+N2), take a sample from S2
S1
S2
N3 = 15
![Page 16: Mergeable Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062410/56816552550346895dd7cc71/html5/thumbnails/16.jpg)
Mergeable Summaries16
Merging sketches¨ Linear sketches (random projections) are easily mergeable¨ Data is multiset in domain [1..U], represented as vector x[1..U]¨ Count-Min sketch:
– Creates a small summary as an array of w d in size– Use d hash functions h to map vector entries to [1..w]
¨ Trivially mergeable: CM(x + y) = CM(x) + CM(y)w
dArray: CM[i,j]
![Page 17: Mergeable Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062410/56816552550346895dd7cc71/html5/thumbnails/17.jpg)
MinHash
Mergeable Summaries17
Hash function h
Retain the k elements with smallest hash values
Trivially mergeable
![Page 18: Mergeable Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062410/56816552550346895dd7cc71/html5/thumbnails/18.jpg)
Summaries to be merged
¨ Random samples¨ Sketches¨ MinHash¨ Heavy hitters¨ ε-approximations
(quantiles, equi-height histograms)
Mergeable Summaries18
easy
easy and cuteeasy algorithm, analysis requires work
![Page 19: Mergeable Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062410/56816552550346895dd7cc71/html5/thumbnails/19.jpg)
Mergeable Summaries19
Heavy hitters¨ Misra-Gries (MG) algorithm finds up to k items that occur
more than 1/k fraction of the time in a stream [MG’82]– Estimate their frequencies with additive error N/(k+1)
¨ Keep k different candidates in hand. For each item in stream:– If item is monitored, increase its counter– Else, if < k items monitored, add new item with count 1– Else, decrease all counts by 1
1 2 3 4 5 6 7 8 9
k=5
![Page 20: Mergeable Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062410/56816552550346895dd7cc71/html5/thumbnails/20.jpg)
Mergeable Summaries20
Heavy hitters¨ Misra-Gries (MG) algorithm finds up to k items that occur
more than 1/k fraction of the time in a stream [MG’82]– Estimate their frequencies with additive error N/(k+1)
¨ Keep k different candidates in hand. For each item in stream:– If item is monitored, increase its counter– Else, if < k items monitored, add new item with count 1– Else, decrease all counts by 1
1 2 3 4 5 6 7 8 9
k=5
![Page 21: Mergeable Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062410/56816552550346895dd7cc71/html5/thumbnails/21.jpg)
Mergeable Summaries21
Heavy hitters¨ Misra-Gries (MG) algorithm finds up to k items that occur
more than 1/k fraction of the time in a stream [MG’82]– Estimate their frequencies with additive error N/(k+1)
¨ Keep k different candidates in hand. For each item in stream:– If item is monitored, increase its counter– Else, if < k items monitored, add new item with count 1– Else, decrease all counts by 1
1 2 3 4 5 6 7 8 9
k=5
![Page 22: Mergeable Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062410/56816552550346895dd7cc71/html5/thumbnails/22.jpg)
Mergeable Summaries22
Streaming MG analysis¨ N = total input size¨ Previous analysis shows that the error is at most
– N/(k+1) [MG’82] – standard bound, but too weak– F1
res(k)/(k+1) [Berinde et al. TODS’10] – too strong¨ M = sum of counters in data structure¨ Error in any estimated count at most (N-M)/(k+1)
– Estimated count a lower bound on true count– Each decrement spread over (k+1) items: 1 new one and k in MG– Equivalent to deleting (k+1) distinct items from stream– At most (N-M)/(k+1) decrement operations– Hence, can have “deleted” (N-M)/(k+1) copies of any item– So estimated counts have at most this much error
![Page 23: Mergeable Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062410/56816552550346895dd7cc71/html5/thumbnails/23.jpg)
Mergeable Summaries23
Merging two MG summaries¨ Merging algorithm:
– Merge two sets of k counters in the obvious way– Take the (k+1)th largest counter = Ck+1, and subtract from all– Delete non-positive counters
1 2 3 4 5 6 7 8 9
k=5
![Page 24: Mergeable Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062410/56816552550346895dd7cc71/html5/thumbnails/24.jpg)
(prior error) (from merge)
Mergeable Summaries24
Merging two MG summaries¨ This algorithm gives mergeability:
– Merge subtracts at least (k+1)Ck+1 from counter sums– So (k+1)Ck+1 (M1 + M2 – M12)
Sum of remaining (at most k) counters is M12
– By induction, error is ((N1-M1) + (N2-M2) + (M1+M2–M12))/(k+1)=((N1+N2) –M12)/(k+1)
(as claimed)
1 2 3 4 5 6 7 8 9
k=5
Ck+1
![Page 25: Mergeable Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062410/56816552550346895dd7cc71/html5/thumbnails/25.jpg)
Compare with previous merging algorithms¨ Two previous merging algorithms for MG [Manjhi et al.
SIGMOD’05, ICDE’05]– No guarantee on size– Error increases after each merge– Need to know the size or the height of the merging tree in
advance for provisioning
Mergeable Summaries25
![Page 26: Mergeable Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062410/56816552550346895dd7cc71/html5/thumbnails/26.jpg)
Compare with previous merging algorithms¨ Experiment on a BFS routing tree over 1024 randomly deployed
sensor nodes¨ Data
– Zipf distribution
Mergeable Summaries26
![Page 27: Mergeable Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062410/56816552550346895dd7cc71/html5/thumbnails/27.jpg)
Compare with previous merging algorithms¨ On a contrived example
Mergeable Summaries27
![Page 28: Mergeable Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062410/56816552550346895dd7cc71/html5/thumbnails/28.jpg)
SpaceSaving: Another heavy hitter summary ¨ At least 10+ papers on this problem¨ The “SpaceSaving” (SS) summary also keeps k counters
[Metwally et al. TODS’06]– If stream item not in summary, overwrite item with least count– SS seems to perform better in practice than MG
¨ Surprising observation: SS is actually isomorphic to MG!– An SS summary with k+1 counters has same info as MG with k– SS outputs an upper bound on count, which tends to be tighter
than the MG lower bound¨ Isomorphism is proved inductively
– Show every update maintains the isomorphism¨ Immediate corollary: SS is mergeable
![Page 29: Mergeable Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062410/56816552550346895dd7cc71/html5/thumbnails/29.jpg)
Summaries to be merged
¨ Random samples¨ Sketches¨ MinHash¨ Heavy hitters¨ ε-approximations
(quantiles, equi-height histograms)
Mergeable Summaries29
easy
easy and cuteeasy algorithm, analysis requires work
![Page 30: Mergeable Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062410/56816552550346895dd7cc71/html5/thumbnails/30.jpg)
ε-approximations: a more “uniform” sample
¨ A “uniform” sample needs 1/ε sample points¨ A random sample needs Θ(1/ε2) sample points
(w/ constant prob.)Mergeable Summaries30
Random sample:
|¿ sample points in range¿ all sample points− ¿ data points in range
¿ all data points |≤𝜺
![Page 31: Mergeable Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062410/56816552550346895dd7cc71/html5/thumbnails/31.jpg)
Mergeable Summaries31
Quantiles (order statistics)¨ Quantiles generalize median:
– Exact answer: CDF-1() for 0 < < 1– Approximate version: tolerate answer in CDF-1(-)…CDF-1(+)– -approximation solves dual problem: estimate CDF(x)
Binary search to find quantiles
![Page 32: Mergeable Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062410/56816552550346895dd7cc71/html5/thumbnails/32.jpg)
Quantiles gives equi-height histogram
¨ Automatically adapts to skew data distributions¨ Equi-width histograms (fixed binning) are trivially mergeable
but does not adapt to data distribution
Mergeable Summaries32
![Page 33: Mergeable Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062410/56816552550346895dd7cc71/html5/thumbnails/33.jpg)
Previous quantile summaries¨ Streaming
– At least 10+ papers on this problem– GK algorithm: O(1/ε log n)
[Greenwald and Khana, SIGMOD’01]– Randomized algorithm: O(1/ε log3(1/ε)) [Suri et al. DCG’06]
¨ Mergeable– Q-digest: O(1/ε log U) [Shrivastava et al. Sensys’04]
Requires a fixed universe of size U– [Greenwald and Khana, PODS’04]
Error increases after each merge (not truly mergeable!)– New: O(1/ε log1.5(1/ε))
Works in comparison model Mergeable Summaries33
![Page 34: Mergeable Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062410/56816552550346895dd7cc71/html5/thumbnails/34.jpg)
Mergeable Summaries34
Equal-weight merges
¨ A classic result (Munro-Paterson ’80):– Base case: fill summary with k input points– Input: two summaries of size k, built from
data sets of the same size– Merge, sort summaries to get size 2k– Take every other element
¨ Error grows proportionally to height of merge tree ¨ Randomized twist:
– Randomly pick whether to take odd or even elements
1 5 6 7 8
2 3 4 9 10
1 3 5 7 9
+
![Page 35: Mergeable Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062410/56816552550346895dd7cc71/html5/thumbnails/35.jpg)
Equal-weight merge analysis: Base case
Mergeable Summaries35
|¿ sample points in range¿ all sample points− ¿ data points in range
¿ all data points𝑛 |≤𝜺
¨ Let the resulting sample be S. Consider any interval I¨ Estimate 2| I S| is unbiased and has error at most 1
– | I D| is even: 2| I S| has no error– | I D| is odd: 2| I S| has error 1 with equal prob.
2
![Page 36: Mergeable Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062410/56816552550346895dd7cc71/html5/thumbnails/36.jpg)
Equal-weight merge analysis: Multiple levels
Level i=1
Level i=2
Level i=3
Level i=4
¨ Consider j’th merge at level i of L(i-1), R(i-1) to S(i)
– Estimate is 2i | I S(i) | – Error introduced by replacing L, R with S is
Xi,j = 2i | I Si | - (2i-1 | I (L(i-1) R(i-1))|) (new estimate) (old estimate)
– Absolute error |Xi,j| 2i-1 by previous argument
¨ Bound total error over all m levels by summing errors:– M = i,j Xi,j = 1 i m 1 j 2m-i Xi,j
– max |M| grows with levels– Var[M] doesn’t!
(dominated by highest level)
![Page 37: Mergeable Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062410/56816552550346895dd7cc71/html5/thumbnails/37.jpg)
Equal-sized merge analysis: Chernoff bound¨ Chernoff-Hoeffding: Give unbiased variables Yj s.t. |Yj| yj :
Pr[ abs(1 j t Yj ) > ] 2exp(-22/1 j t (2yj)2)¨ Set = h 2m for our variables:
– 22/(i j (2 max(Xi,j)2)= 2(h2m)2 / (i 2m-i . 22i)= 2h2 22m / i 2m+i
= 2h2 / i 2i-m
= 2h2 / i 2-i
2h2
¨ From Chernoff bound, error probability is at most 2exp(-2h2)– Set h = O(log1/2 -1) to obtain 1- probability of success
Mergeable Summaries37
Level i=1
Level i=2
Level i=3
Level i=4
![Page 38: Mergeable Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062410/56816552550346895dd7cc71/html5/thumbnails/38.jpg)
Mergeable Summaries38
Equal-sized merge analysis: finishing up¨ Chernoff bound ensures absolute error at most =h2m
– m is number of merges = log (n/k) for summary size k– So error is at most hn/k
¨ Set size of each summary k to be O(h/) = O(1/ log1/2 1/) – Guarantees give n error with probability 1- for any one
range¨ There are O(1/) different ranges to consider
– Set = Θ() to ensure all ranges are correct with constant probability
– Summary size: O(1/ log1/2 1/)
![Page 39: Mergeable Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062410/56816552550346895dd7cc71/html5/thumbnails/39.jpg)
¨ Use equal-size merging in a standard logarithmic trick:
¨ Merge two summaries as binary addition¨ Fully mergeable quantiles, in O(1/ log n log1/2 1/)
– n = number of items summarized, not known a priori¨ But can we do better?
Mergeable Summaries39
Fully mergeable -approximation
Wt 32 Wt 16 Wt 8 Wt 4 Wt 2 Wt 1
Wt 32 Wt 16 Wt 8 Wt 4 Wt 2 Wt 1
Wt 32 Wt 16 Wt 8 Wt 4 Wt 2 Wt 1Wt 4
![Page 40: Mergeable Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062410/56816552550346895dd7cc71/html5/thumbnails/40.jpg)
Mergeable Summaries40
Hybrid summary¨ Classical result: It’s sufficient to build the summary on a
random sample of size Θ(1/ε2) – Problem: Don’t know n in advance
¨ Hybrid structure: – Keep top O(log 1/) levels: summary size O(1/ log1.5(1/))– Also keep a “buffer” sample of O(1/) items– When buffer is “full”, extract points as a sample of lowest weight
Wt 32 Wt 16 Wt 8
Buffer
![Page 41: Mergeable Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062410/56816552550346895dd7cc71/html5/thumbnails/41.jpg)
-approximations in higher dimensions¨ -approximations generalize to range spaces with bounded
VC-dimension– Generalize the “odd-even” trick to low-discrepancy colorings– -approx for constant VC-dimension d has size Õ(-2d/(d+1))
Mergeable Summaries41
![Page 42: Mergeable Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062410/56816552550346895dd7cc71/html5/thumbnails/42.jpg)
Mergeable Summaries42
Other mergeable summaries: -kernels ¨ -kernels in d-dimensional space approximately preserve the
projected extent in any direction– -kernel has size O(1/(d-1)/2)– Streaming -kernel has size O(1/(d-1)/2 log(1/)) – Mergeable -kernel has size O(1/(d-1)/2 logdn)
![Page 43: Mergeable Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062410/56816552550346895dd7cc71/html5/thumbnails/43.jpg)
SummaryStatic Streaming Mergeable
Heavy hitters 1/ε 1/ε 1/εε-approximation(quantiles) deterministic
1/ε 1/ε log n 1/ε log U
ε-approximation(quantiles) randomized
- 1/ε log1.5(1/ε) 1/ε log1.5(1/ε)
ε-kernel 1/ε(d-1)/2 1/ε(d-1)/2 log(1/ε) 1/(d-1)/2 logdn
Mergeable Summaries43
![Page 44: Mergeable Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062410/56816552550346895dd7cc71/html5/thumbnails/44.jpg)
Mergeable Summaries44
Open problems¨ Better bound on mergeable -kernels
– Match the streaming bound?¨ Lower bounds for mergeable summaries
– Separation from streaming model?¨ Other streaming algorithms (summaries)
– Lp sampling– Coresets for minimum enclosing balls (MEB)
![Page 45: Mergeable Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062410/56816552550346895dd7cc71/html5/thumbnails/45.jpg)
Thank you!
Mergeable Summaries 45
![Page 46: Mergeable Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062410/56816552550346895dd7cc71/html5/thumbnails/46.jpg)
Hybrid analysis (sketch)¨ Keep the buffer (sample) size to O(1/)
– Accuracy only √ n– If buffer only summarizes O(n) points, this is OK
¨ Analysis rather delicate:– Points go into/out of buffer, but always moving “up”– Number of “buffer promotions” is bounded– Similar Chernoff bound to before on probability of large error– Gives constant probability of accuracy in O(1/ log1.5(1/)) space
Mergeable Summaries46
![Page 47: Mergeable Summaries](https://reader036.fdocuments.us/reader036/viewer/2022062410/56816552550346895dd7cc71/html5/thumbnails/47.jpg)
Mergeable Summaries47
Models of summary construction¨ Offline computation: e.g. sort data, take percentiles¨ Streaming: summary merged with one new item each step¨ One-way merge
– Caterpillar graph of merges
¨ Equal-weight merges: can only merge summaries of same weight
¨ Full mergeability (algebraic): allow arbitrary merging trees– Our main interest