2013 open analytics_countingv3
Transcript of 2013 open analytics_countingv3
![Page 1: 2013 open analytics_countingv3](https://reader036.fdocuments.us/reader036/viewer/2022081603/55a1fdc71a28ab1c4c8b46d4/html5/thumbnails/1.jpg)
Cardinality Estimation for Very Large Data Sets
Matt Abrams, VP Data and Operations March 25, 2013
![Page 2: 2013 open analytics_countingv3](https://reader036.fdocuments.us/reader036/viewer/2022081603/55a1fdc71a28ab1c4c8b46d4/html5/thumbnails/2.jpg)
THANKS FOR COMING!
I build large scale distributed systems and work on algorithms that make sense of the data stored in them Contributor to the open source project Stream-Lib, a Java library for summarizing data streams (https://github.com/clearspring/stream-lib) Ask me questions: @abramsm
![Page 3: 2013 open analytics_countingv3](https://reader036.fdocuments.us/reader036/viewer/2022081603/55a1fdc71a28ab1c4c8b46d4/html5/thumbnails/3.jpg)
HOW CAN WE COUNT THE NUMBER OF DISTINCT ELEMENTS IN LARGE DATA SETS?
![Page 4: 2013 open analytics_countingv3](https://reader036.fdocuments.us/reader036/viewer/2022081603/55a1fdc71a28ab1c4c8b46d4/html5/thumbnails/4.jpg)
HOW CAN WE COUNT THE NUMBER OF DISTINCT ELEMENTS IN VERY LARGE DATA SETS?
![Page 5: 2013 open analytics_countingv3](https://reader036.fdocuments.us/reader036/viewer/2022081603/55a1fdc71a28ab1c4c8b46d4/html5/thumbnails/5.jpg)
GOALS FOR COUNTING SOLUTION
Support high throughput data streams (up to many 100s of thousands per second) Estimate cardinality with known error thresholds in sets up to around 1 billion (or even 1 trillion when needed) Support set operations (unions and intersections) Support data streams with large number of dimensions
![Page 6: 2013 open analytics_countingv3](https://reader036.fdocuments.us/reader036/viewer/2022081603/55a1fdc71a28ab1c4c8b46d4/html5/thumbnails/6.jpg)
http://msnbcmedia.msn.com/j/MSNBC/Components/Photo/_new/pb-111031-hajjj-01.photoblog900.jpg
![Page 7: 2013 open analytics_countingv3](https://reader036.fdocuments.us/reader036/viewer/2022081603/55a1fdc71a28ab1c4c8b46d4/html5/thumbnails/7.jpg)
513a71b843e54b73 1 UID = 128 bits
![Page 8: 2013 open analytics_countingv3](https://reader036.fdocuments.us/reader036/viewer/2022081603/55a1fdc71a28ab1c4c8b46d4/html5/thumbnails/8.jpg)
In one month AddThis logs 5B+ UIDs
2,500,000 * 2000 = 5,000,000,000
![Page 9: 2013 open analytics_countingv3](https://reader036.fdocuments.us/reader036/viewer/2022081603/55a1fdc71a28ab1c4c8b46d4/html5/thumbnails/9.jpg)
That’s 596GB of just UIDS
![Page 10: 2013 open analytics_countingv3](https://reader036.fdocuments.us/reader036/viewer/2022081603/55a1fdc71a28ab1c4c8b46d4/html5/thumbnails/10.jpg)
NAÏVE SOLUTIONS
• Select count(distinct UID) from table where dimension = foo
• HashSet<K> • Run a batch job for each
new query request
![Page 11: 2013 open analytics_countingv3](https://reader036.fdocuments.us/reader036/viewer/2022081603/55a1fdc71a28ab1c4c8b46d4/html5/thumbnails/11.jpg)
WE ARE NOT A BANK
http://graphics8.nytimes.com/images/2008/01/30/timestopics/feddc.jpg
This means a estimate rather than exact value is acceptable.
![Page 12: 2013 open analytics_countingv3](https://reader036.fdocuments.us/reader036/viewer/2022081603/55a1fdc71a28ab1c4c8b46d4/html5/thumbnails/12.jpg)
![Page 13: 2013 open analytics_countingv3](https://reader036.fdocuments.us/reader036/viewer/2022081603/55a1fdc71a28ab1c4c8b46d4/html5/thumbnails/13.jpg)
THREE INTUITIONS
• It is possible to estimate the cardinality of a set by understanding the probability of a sequence of events occurring in a random variable (e.g. how many coins were flipped if I saw n heads in a row?)
• Averaging the the results of multiple observations can reduce the variance associated with random variables
• Applying a good hash function effectively de-duplicates the input stream
![Page 14: 2013 open analytics_countingv3](https://reader036.fdocuments.us/reader036/viewer/2022081603/55a1fdc71a28ab1c4c8b46d4/html5/thumbnails/14.jpg)
INTUITION
What is the probability that a binary string starts with ’01’?
![Page 15: 2013 open analytics_countingv3](https://reader036.fdocuments.us/reader036/viewer/2022081603/55a1fdc71a28ab1c4c8b46d4/html5/thumbnails/15.jpg)
INTUITION
(1/2)2 = 25%
![Page 16: 2013 open analytics_countingv3](https://reader036.fdocuments.us/reader036/viewer/2022081603/55a1fdc71a28ab1c4c8b46d4/html5/thumbnails/16.jpg)
INTUITION
(1/2)3 = 12.5%
![Page 17: 2013 open analytics_countingv3](https://reader036.fdocuments.us/reader036/viewer/2022081603/55a1fdc71a28ab1c4c8b46d4/html5/thumbnails/17.jpg)
INTUITION
Crude analysis: If a stream has 8 unique values the hash of at least one of them should start with ‘001’
![Page 18: 2013 open analytics_countingv3](https://reader036.fdocuments.us/reader036/viewer/2022081603/55a1fdc71a28ab1c4c8b46d4/html5/thumbnails/18.jpg)
INTUITION
Given the variability of a single random value we can not use a single variable for accurate cardinality estimations
![Page 19: 2013 open analytics_countingv3](https://reader036.fdocuments.us/reader036/viewer/2022081603/55a1fdc71a28ab1c4c8b46d4/html5/thumbnails/19.jpg)
MULTIPLE OBSERVATIONS HELP REDUCE VARIANCE
By taking the mean of the standard deviation of multiple random variables we can make the error rate as small as desired by controlling the size of m (the number random variables)
error =σ / m
![Page 20: 2013 open analytics_countingv3](https://reader036.fdocuments.us/reader036/viewer/2022081603/55a1fdc71a28ab1c4c8b46d4/html5/thumbnails/20.jpg)
THE PROBLEM WITH MULTIPLE HASH FUNCTIONS
• It is too costly from a computational perspective to apply m hash functions to each data point
• It is not clear that it is possible to generate m good hash functions that are independent
![Page 21: 2013 open analytics_countingv3](https://reader036.fdocuments.us/reader036/viewer/2022081603/55a1fdc71a28ab1c4c8b46d4/html5/thumbnails/21.jpg)
STOCHASTIC AVERAGING
• Emulating the effect of m experiments with a single hash function
• Divide input stream into m sub-streams
• An average of the observable values for each sub-stream will yield a cardinality that improves in proportion to as m increases
h(Μ)
1m, 2m,...,m−1
m,1
"
#$%
&'
1/ m
![Page 22: 2013 open analytics_countingv3](https://reader036.fdocuments.us/reader036/viewer/2022081603/55a1fdc71a28ab1c4c8b46d4/html5/thumbnails/22.jpg)
HASH FUNCTIONS
32 Bit Hash
64 Bit Hash
160 Bit Hash
Odds of a Collision
77163 5.06 Billion 1.42 * 10^14
1 in 2
30084 1.97 Billion 5.55 * 10^23
1 in 10
9292 609 million 1.71 * 10^23
1 in 100
2932 192 million 5.41 * 10^22
1 in 1000
http://preshing.com/20110504/hash-collision-probabilities
![Page 23: 2013 open analytics_countingv3](https://reader036.fdocuments.us/reader036/viewer/2022081603/55a1fdc71a28ab1c4c8b46d4/html5/thumbnails/23.jpg)
HYPERLOGLOG (2007)
Philippe Flajolet (1948-2011)
Counts up to 1 Billion in 1.5KB of space
![Page 24: 2013 open analytics_countingv3](https://reader036.fdocuments.us/reader036/viewer/2022081603/55a1fdc71a28ab1c4c8b46d4/html5/thumbnails/24.jpg)
HYPERLOGLOG (HLL)
• Operates with a single pass over the input data set
• Produces a typical error of of
• Error decreases as m increases. Error is not a function of the number of elements in the set
1.04 / m
![Page 25: 2013 open analytics_countingv3](https://reader036.fdocuments.us/reader036/viewer/2022081603/55a1fdc71a28ab1c4c8b46d4/html5/thumbnails/25.jpg)
HLL SUBSTREAMS
HLL uses a single hash function and splits the result into m buckets
Hash Function Input Values
Bucket 1
Bucket 2
Bucket m
S
![Page 26: 2013 open analytics_countingv3](https://reader036.fdocuments.us/reader036/viewer/2022081603/55a1fdc71a28ab1c4c8b46d4/html5/thumbnails/26.jpg)
HLL ALGORITHM BASICS
• Each substream maintains an Observable • Observable is largest value p(x) which is the
position of the leftmost 1-bit in a binary string x
• 32 bit hashing function with 5 bit “short bytes” • Harmonic mean
• Increases quality of estimates by reducing variance
![Page 27: 2013 open analytics_countingv3](https://reader036.fdocuments.us/reader036/viewer/2022081603/55a1fdc71a28ab1c4c8b46d4/html5/thumbnails/27.jpg)
WHAT ARE “SHORT BYTES”?
• We know a priori that the value of a given substream of the multiset M is in the range
• Assuming L = 32 we only need 5 bits to store the value of the register
• 85% less memory usage as compared to standard java int (32 bits)
0..(L +1− log2m)
![Page 28: 2013 open analytics_countingv3](https://reader036.fdocuments.us/reader036/viewer/2022081603/55a1fdc71a28ab1c4c8b46d4/html5/thumbnails/28.jpg)
ADDING VALUES TO HLL
• The first b bits of the new value define the index for the multiset M that may be updated when the new value is added
• The bits b+1 to m are used to determine the leading number of zeros (p)
index =1+ x1x2 ⋅ ⋅ ⋅ xb 2ρ xb+1xb+2 ⋅ ⋅ ⋅( )
![Page 29: 2013 open analytics_countingv3](https://reader036.fdocuments.us/reader036/viewer/2022081603/55a1fdc71a28ab1c4c8b46d4/html5/thumbnails/29.jpg)
ADDING VALUES TO HLL
M[1],M[2],...,M[m]{ }The multiset is updated using the equation:
M[ j] :=max(M[ j],ρ(ω))
Observations
Number of leading zeros + 1
![Page 30: 2013 open analytics_countingv3](https://reader036.fdocuments.us/reader036/viewer/2022081603/55a1fdc71a28ab1c4c8b46d4/html5/thumbnails/30.jpg)
INTUITION ON EXTRACTING CARDINALITY FROM HLL
• If we add n unique elements to a stream then each substream will contain roughly n/m elements
• The MAX value in each substream should be about (from earlier intuition re random variables)
• The harmonic mean (mZ) of 2MAX is on the order of n/m
• So m2Z is on the order of n ß That’s the cardinality!
log2 n /m( )
![Page 31: 2013 open analytics_countingv3](https://reader036.fdocuments.us/reader036/viewer/2022081603/55a1fdc71a28ab1c4c8b46d4/html5/thumbnails/31.jpg)
HLL CARDINALITY ESTIMATE
• m2Z has systematic multiplicative bias that needs to be corrected. This is done by multiplying a constant value
E :=αmm2 ⋅ 2−M j[ ]
j=1
m
∑$
%&&
'
())
−1
Harmonic Mean 2 p( )2
![Page 32: 2013 open analytics_countingv3](https://reader036.fdocuments.us/reader036/viewer/2022081603/55a1fdc71a28ab1c4c8b46d4/html5/thumbnails/32.jpg)
A NOTE ON LONG RANGE CORRECTIONS
• The paper says to apply a long range correction function when the estimate is greater than:
• The correction function is:
• DON’T DO THIS! It doesn’t work and increases error. Better approach is to use a bigger/better hash function
E > 130232
E* := −232 log(1−E / 2
![Page 33: 2013 open analytics_countingv3](https://reader036.fdocuments.us/reader036/viewer/2022081603/55a1fdc71a28ab1c4c8b46d4/html5/thumbnails/33.jpg)
Lets look at HLL in Action.
DEMO TIME!
http://www.aggregateknowledge.com/science/blog/hll.html
![Page 34: 2013 open analytics_countingv3](https://reader036.fdocuments.us/reader036/viewer/2022081603/55a1fdc71a28ab1c4c8b46d4/html5/thumbnails/34.jpg)
HLL UNIONS
• Merging two or more HLL data structures is a similar process to adding a new value to a single HLL
• For each register in the HLL take the max value of the HLLs you are merging and the resulting register set can be used to estimate the cardinality of the combined sets
Root
MON
TUE
WED
THU
FRI
HLL
HLL
HLL
HLL
HLL
![Page 35: 2013 open analytics_countingv3](https://reader036.fdocuments.us/reader036/viewer/2022081603/55a1fdc71a28ab1c4c8b46d4/html5/thumbnails/35.jpg)
HLL INTERSECTION
You must understand the properties of your sets to know if you can trust the resulting intersection
C = A + B − A∪B
A B C
![Page 36: 2013 open analytics_countingv3](https://reader036.fdocuments.us/reader036/viewer/2022081603/55a1fdc71a28ab1c4c8b46d4/html5/thumbnails/36.jpg)
HYPERLOGLOG++
• Google researches have recently released an update to the HLL algorithm
• Uses clever encoding/decoding techniques to create a single data structure that is very accurate for small cardinality sets and can estimate sets that have over a trillion elements in them
• Empirical bias correction. Observations show that most of the error in HLL comes from the bias function. Using empirically derived values significantly reduces error
![Page 37: 2013 open analytics_countingv3](https://reader036.fdocuments.us/reader036/viewer/2022081603/55a1fdc71a28ab1c4c8b46d4/html5/thumbnails/37.jpg)
HLL++ DELTA ENCODING
1024,1027,1028,1030,1033,1035{ }
{0,3,1, 2,3, 2}By using delta encoding fewer bits are required to represent array making it easier to fit larger sets in memory
![Page 38: 2013 open analytics_countingv3](https://reader036.fdocuments.us/reader036/viewer/2022081603/55a1fdc71a28ab1c4c8b46d4/html5/thumbnails/38.jpg)
OTHER PROBABILISTIC DATA STRUCTURES
• Bloom Filters – set membership detection
• CountMinSketch – estimate number of occurrences for a given element
• TopK Estimators – estimate the frequency and top elements from a stream
![Page 39: 2013 open analytics_countingv3](https://reader036.fdocuments.us/reader036/viewer/2022081603/55a1fdc71a28ab1c4c8b46d4/html5/thumbnails/39.jpg)
REFERENCES
• Stream-Lib - https://github.com/clearspring/stream-lib
• HyperLogLog - http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.142.9475
• HyperLogLog In Practice - http://research.google.com/pubs/pub40671.html
• Aggregate Knowledge HLL Blog Posts - http://blog.aggregateknowledge.com/tag/hyperloglog/
![Page 40: 2013 open analytics_countingv3](https://reader036.fdocuments.us/reader036/viewer/2022081603/55a1fdc71a28ab1c4c8b46d4/html5/thumbnails/40.jpg)
THANKS!
AddThis is hiring!