Streaming Quantiles - edoliberty.github.io
Transcript of Streaming Quantiles - edoliberty.github.io
![Page 1: Streaming Quantiles - edoliberty.github.io](https://reader033.fdocuments.us/reader033/viewer/2022061100/629a69f64199fe4af4189d63/html5/thumbnails/1.jpg)
StreamingQuantiles
EdoLibertyPrincipalScientistAmazonWebServices
![Page 2: Streaming Quantiles - edoliberty.github.io](https://reader033.fdocuments.us/reader033/viewer/2022061100/629a69f64199fe4af4189d63/html5/thumbnails/2.jpg)
![Page 3: Streaming Quantiles - edoliberty.github.io](https://reader033.fdocuments.us/reader033/viewer/2022061100/629a69f64199fe4af4189d63/html5/thumbnails/3.jpg)
StreamingQuantiles
Manku,Rajagopalan,Lindsay.Randomsamplingtechniquesforspaceefficientonlinecomputationoforderstatisticsoflargedatasets.Munro,Paterson.Selectionandsortingwithlimitedstorage.Greenwald,Khanna.Space-efficientonlinecomputationofquantilesummaries.Wang,Luo,Yi,Cormode.Quantilesoverdatastreams:Anexperimentalstudy.Greenwald,Khanna.Quantilesandequidepth histogramsoverstreams.Agarwal,Cormode,Huang,Phillips,Wei,Yi.Mergeable summaries.Felber,Ostrovsky.ArandomizedonlinequantilesummaryinO((1/ε)log(1/ε))words.Lang,Karnin,Liberty,OptimalQuantileApproximationinStreams.
![Page 4: Streaming Quantiles - edoliberty.github.io](https://reader033.fdocuments.us/reader033/viewer/2022061100/629a69f64199fe4af4189d63/html5/thumbnails/4.jpg)
ProblemDefinition
n
0 n
R( ) = 0.6 · n
CreateasketchforsuchthatR0 |R0(x)�R(x)| "n
![Page 5: Streaming Quantiles - edoliberty.github.io](https://reader033.fdocuments.us/reader033/viewer/2022061100/629a69f64199fe4af4189d63/html5/thumbnails/5.jpg)
• UniformsamplingFastandsimpleFullymergeable Space
• GreenwaldKhanna(GK)sketchSlow,complexNotmergeable Space
• Felber-Ostrovsky,combinessamplingandGK(2015)Slow,complexNotmergeable Space
Solutions
log(1/")/✏
O(1/"2)
log(n)/"
Previouslyconjecturedspaceoptimalforallalgorithms.lowerboundfordeterministic algorithmsbyHungandTing2010.log(1/")/✏
![Page 6: Streaming Quantiles - edoliberty.github.io](https://reader033.fdocuments.us/reader033/viewer/2022061100/629a69f64199fe4af4189d63/html5/thumbnails/6.jpg)
• UniformsamplingFast,simpleFullymergeable Space
• GreenwaldKhanna(GK)sketchSlow,complexNotmergeable Space
• Felber-Ostrovsky,combinessamplingandGK(2015)Slow,complexNotmergeable Space
• Manku-Rajagopalan-Lindsay(MRL)FastsimpleFullymergeable Space
• Agarwal,Cormode,Huang,Phillips,Wei,YiFast,complexFullymergeable Space
Solutionscont’
log(1/")/✏
O(1/"2)
log(n)/"
log
2(n)/"
log
3/2(1/")/"
Bufferbasedsolutions
![Page 7: Streaming Quantiles - edoliberty.github.io](https://reader033.fdocuments.us/reader033/viewer/2022061100/629a69f64199fe4af4189d63/html5/thumbnails/7.jpg)
Thebasicbufferidea
1 0 35 4 7
Bufferofsizek
![Page 8: Streaming Quantiles - edoliberty.github.io](https://reader033.fdocuments.us/reader033/viewer/2022061100/629a69f64199fe4af4189d63/html5/thumbnails/8.jpg)
Thebasicbufferidea
Storeskstreamentries
1
03
5
47
![Page 9: Streaming Quantiles - edoliberty.github.io](https://reader033.fdocuments.us/reader033/viewer/2022061100/629a69f64199fe4af4189d63/html5/thumbnails/9.jpg)
Thebasicbufferidea
Thebuffersortskstreamentries
10
3
54
7
![Page 10: Streaming Quantiles - edoliberty.github.io](https://reader033.fdocuments.us/reader033/viewer/2022061100/629a69f64199fe4af4189d63/html5/thumbnails/10.jpg)
Thebasicbufferidea
Deleteseveryotheritem
10
3
54
7
![Page 11: Streaming Quantiles - edoliberty.github.io](https://reader033.fdocuments.us/reader033/viewer/2022061100/629a69f64199fe4af4189d63/html5/thumbnails/11.jpg)
Thebasicbufferidea
Andoutputstherestwithdoubletheweight
035
![Page 12: Streaming Quantiles - edoliberty.github.io](https://reader033.fdocuments.us/reader033/viewer/2022061100/629a69f64199fe4af4189d63/html5/thumbnails/12.jpg)
Thebasicbufferidea
0
0
x x
1 54 7
1
3
3
4
5
7
R(x) = 2
R
0(x) = 2
R
0(x) = 2
R(x) = 5
R
0(x) = 4
R
0(x) = 6
![Page 13: Streaming Quantiles - edoliberty.github.io](https://reader033.fdocuments.us/reader033/viewer/2022061100/629a69f64199fe4af4189d63/html5/thumbnails/13.jpg)
Thebasicbufferidea
Repeattimeuntiltheendofthestream
0
|R0(x)�R(x)| < n/k
nn/2
n/k
1 0 355
![Page 14: Streaming Quantiles - edoliberty.github.io](https://reader033.fdocuments.us/reader033/viewer/2022061100/629a69f64199fe4af4189d63/html5/thumbnails/14.jpg)
n
Buffersofsize k
1 0 35
Manku-Rajagopalan-Lindsay(MRL)sketch
|R0(x)�R(x)| n/k · log2(n)
H = log2(n)
![Page 15: Streaming Quantiles - edoliberty.github.io](https://reader033.fdocuments.us/reader033/viewer/2022061100/629a69f64199fe4af4189d63/html5/thumbnails/15.jpg)
k = log2(n)/"Ifwesetweget
whilemaintainingatmoststreamitems.
|R0(x)�R(x)| "n
Manku-Rajagopalan-Lindsay(MRL)sketch
H · k log
22(n)/"
Manku-Rajagopalan-Lindsay(MRL)sketchFast,SimpleFullymergeable Space
![Page 16: Streaming Quantiles - edoliberty.github.io](https://reader033.fdocuments.us/reader033/viewer/2022061100/629a69f64199fe4af4189d63/html5/thumbnails/16.jpg)
Agarwal,Cormode,Huang,Phillips,Wei,Yi(1)
Buffersofsize klog(1/")
startsamplingafteritemsO(1/"2)
log
2(1/")/"Reducesspaceusagetoitemsfromthestream.
1 0 35
![Page 17: Streaming Quantiles - edoliberty.github.io](https://reader033.fdocuments.us/reader033/viewer/2022061100/629a69f64199fe4af4189d63/html5/thumbnails/17.jpg)
Agarwal,Cormode,Huang,Phillips,Wei,Yi(2)
E[R0(x)] = R(x)
R
0(x) isarandomvariablenowand
R(x) = 1
R
0(x) = 2
R
0(x) = 0
x
Reducesspaceusagetoitemsfromthestream.log
3/2(1/")/"
5 7
5
7
![Page 18: Streaming Quantiles - edoliberty.github.io](https://reader033.fdocuments.us/reader033/viewer/2022061100/629a69f64199fe4af4189d63/html5/thumbnails/18.jpg)
• UniformsamplingFast,simpleFullymergeable Space
• GreenwaldKhanna(GK)sketchSlow,complexNotmergeable Space
• Felber-Ostrovsky,combinessamplingandGK(2015)Slow,complexNotmergeable Space
• Manku-Rajagopalan-Lindsay(MRL)Fast,simpleFullymergeable Space
• Agarwal,Cormode,Huang,Phillips,Wei,YiFast,complexFullymergeable Space
Recap
log(1/")/✏
O(1/"2)
log(n)/"
log
2(n)/"
log
3/2(1/")/"
![Page 19: Streaming Quantiles - edoliberty.github.io](https://reader033.fdocuments.us/reader033/viewer/2022061100/629a69f64199fe4af4189d63/html5/thumbnails/19.jpg)
• UniformsamplingFast,simpleFullymergeable Space
• GreenwaldKhanna(GK)sketchSlow,complexNotmergeable Space
• Felber-Ostrovsky,combinessamplingandGK(2015)Slow,complexNotmergeable Space
• Manku-Rajagopalan-Lindsay(MRL)Fast,simpleFullymergeable Space
• Agarwal,Cormode,Huang,Phillips,Wei,YiFast,complexFullymergeable Space
• Karnin,Lang,LibertyFast,simpleFullymergeable Space
Ourgoal
log(1/")/✏
O(1/"2)
log(n)/"
log
2(n)/"
log
3/2(1/")/"
log(1/")/✏
![Page 20: Streaming Quantiles - edoliberty.github.io](https://reader033.fdocuments.us/reader033/viewer/2022061100/629a69f64199fe4af4189d63/html5/thumbnails/20.jpg)
Observation
h = H
Thefirstbufferscontributeverylittletotheerror.Theyare“toogood”.
wh = 2h�1
Numberofcompactions
Weightofitemsinthelevel
h = 2 h = 1....
mh = 2H�h�1
![Page 21: Streaming Quantiles - edoliberty.github.io](https://reader033.fdocuments.us/reader033/viewer/2022061100/629a69f64199fe4af4189d63/html5/thumbnails/21.jpg)
Idea
h = H
h = 2h = 1
kh � kcH�hLetbuffersshrinkat-most-exponentially
wh = 2h�1
H log(n/ck) + 2
mh (2/c)H�h�1
Numberofcompactions
TBDlater
![Page 22: Streaming Quantiles - edoliberty.github.io](https://reader033.fdocuments.us/reader033/viewer/2022061100/629a69f64199fe4af4189d63/html5/thumbnails/22.jpg)
Pr [|R(x,H 0)�R(x)| � "n] 2 exp
⇣�C"2k222(H�H0)
⌘
therankof among1. Theitemsyieldedbythecompactoratheight2. Alltheitemsstoredinthecompactorsofheights h0 h
hR(h, x) x
Claim,for
ProofUseHoeffding’s inequalityon
HX
h=1
[R(x, h)�R(x, h� 1)]
C = c2(2c� 1)
Analysis
![Page 23: Streaming Quantiles - edoliberty.github.io](https://reader033.fdocuments.us/reader033/viewer/2022061100/629a69f64199fe4af4189d63/html5/thumbnails/23.jpg)
Setand
Solution1
kh = dkcH�he+ 1c = 2/3• Karnin,Lang,Liberty(1)
Fast,simpleFullymergeable Spacep
log(1/")/"
exponentiallydecreasingcapacitybuffers
log(n) samplerreplacesall buffersofsize2
Betterthanpreviouslyconjecturedoptimal!
![Page 24: Streaming Quantiles - edoliberty.github.io](https://reader033.fdocuments.us/reader033/viewer/2022061100/629a69f64199fe4af4189d63/html5/thumbnails/24.jpg)
exponentiallydecreasingcapacitybuffers
Setandexceptthatthetopbuffersallhavecapacity.
Solution2(KLL+MRL)
kh = dkcH�he+ 1c = 2/3
• Karnin,Lang,Liberty(2)Fast,simpleFullymergeable Space
log log(1/") k
log
2log(1/")/"
log log(1/")Buffersofcapacityk
log(n) samplerreplacesall buffersofsize2
![Page 25: Streaming Quantiles - edoliberty.github.io](https://reader033.fdocuments.us/reader033/viewer/2022061100/629a69f64199fe4af4189d63/html5/thumbnails/25.jpg)
• Karnin,Lang,Liberty(3)Fast,simpleFullymergeable Space
exponentiallydecreasingcapacitybuffers
SetandreplacethetopwithaGKsketch
Solution3(KLL+GK)
kh = dkcH�he+ 1c = 2/3log log(1/") k
log log(1/")GKsketchreplaces
toplevelslog(n) samplerreplaces
all buffersofsize2
GKSketch
log log(1/")/"
![Page 26: Streaming Quantiles - edoliberty.github.io](https://reader033.fdocuments.us/reader033/viewer/2022061100/629a69f64199fe4af4189d63/html5/thumbnails/26.jpg)
CountDistinct(DemoOnly)
![Page 27: Streaming Quantiles - edoliberty.github.io](https://reader033.fdocuments.us/reader033/viewer/2022061100/629a69f64199fe4af4189d63/html5/thumbnails/27.jpg)
$ head data.csv0103023732
Inthisone,rowi tasksavaluefrom[0,i]uniformlyatrandom.
Assumeyouneedtoestimatethedistributionofnumbersinafile
![Page 28: Streaming Quantiles - edoliberty.github.io](https://reader033.fdocuments.us/reader033/viewer/2022061100/629a69f64199fe4af4189d63/html5/thumbnails/28.jpg)
$ time wc -lc data.csv10000000 76046666 data.csv
real 0m0.101suser 0m0.072ssys 0m0.021s
Readingthefiletake~1/10seconds.Wedon’tforeseeIObeinganissue.
Somestats:thereare10,000,000suchnumbersinthis~76Mbfile.
![Page 29: Streaming Quantiles - edoliberty.github.io](https://reader033.fdocuments.us/reader033/viewer/2022061100/629a69f64199fe4af4189d63/html5/thumbnails/29.jpg)
$ time cat data.csv | python quantiles.py > /dev/null
real 0m13.406suser 0m12.937ssys 0m0.407s
Inpythonitlookslikethis:
$ cat quantiles.pyimport sysints = sorted([int(x) for x in sys.stdin])for i in range(0,len(ints),int(len(ints)/100)):
print(str(ints[i]))
![Page 30: Streaming Quantiles - edoliberty.github.io](https://reader033.fdocuments.us/reader033/viewer/2022061100/629a69f64199fe4af4189d63/html5/thumbnails/30.jpg)
$ time cat data.csv | sketch rank > /dev/null
real 0m1.495suser 0m1.878ssys 0m0.141s
Thisisthewaytodothiswiththesketchinglibrary
$ time cat data.csv | sketch rank
ToofasttousethesystemmonitorUI...
Ituses~4kofmemory!
![Page 31: Streaming Quantiles - edoliberty.github.io](https://reader033.fdocuments.us/reader033/viewer/2022061100/629a69f64199fe4af4189d63/html5/thumbnails/31.jpg)
0
1000000
2000000
3000000
4000000
5000000
6000000
7000000
8000000
9000000
10000000
1 7 13 19 25 31 37 43 49 55 61 67 73 79 85 91 97 103
exactandapproximatequantiles
approximatequantiles exactquantiles
0
1000000
2000000
3000000
4000000
5000000
6000000
7000000
8000000
9000000
10000000
0 2000000 4000000 6000000 8000000 10000000
exactvsapproximatequantiles
![Page 32: Streaming Quantiles - edoliberty.github.io](https://reader033.fdocuments.us/reader033/viewer/2022061100/629a69f64199fe4af4189d63/html5/thumbnails/32.jpg)
0
0.001
0.002
0.003
0.004
0.005
0.006
0.007
0.008
0.009
0.01
100 1000 10000 100000 1e+06
Err
or
Number of Items in Randomly Permuted Stream
Lazy KLL versus (Sketch Library and Two Variants)
Sketch LibraryVariant 1Variant 2Lazy KLL
0
500
1000
1500
2000
2500
3000
3500
4000
100 1000 10000 100000 1e+06
Space
Use
d F
or
Sto
ring S
am
ple
s
Number of Items in Randomly Permuted Stream
Lazy KLL versus (Sketch Library and Two Variants)
Sketch LibraryVariant 1Variant 2Lazy KLL
Someexperimentalresults
![Page 33: Streaming Quantiles - edoliberty.github.io](https://reader033.fdocuments.us/reader033/viewer/2022061100/629a69f64199fe4af4189d63/html5/thumbnails/33.jpg)
Thankyou!