Probabilistic data structures in real life

16
PROBABILISTIC DATA STRUCTURES IN REAL LIFE Valentin Bazarevsky

Transcript of Probabilistic data structures in real life

PROBABILISTIC DATA STRUCTURES IN REAL LIFEValentin Bazarevsky

WHO THEY ARE?

Bloom FilterLogLog FamilyMinHash

BUSINESS CASE:ESTIMATE YOUR AUDIENCE

SEGMENT BUILDER

15 Tb of transactional data4h SLA

POSSIBLE SOLUTIONS

Brute force (15 TB of transactional data) Sampling (1 % of users => 1.2 mb / b.o.)Magic tool (?!)

EstimatorHyperLogLog allows to estimate > 1 000 000 000 sets of unique elements with 1% error, and requires only 4kb memory

50 000 000 basic operations

OOPS…

Supports only Unions

But we need Intersections, Subtractions, Not operators

HYPERLOGLOG INTUITION

00101010101010001111010101101 => a[2] = 010010101010100101010101001011 => a[9] = 100000101010100101010101110101 => a[0] = 101010101010100100101010101010 => a[5] = 1

01010000000000000000000000010 => a[5] = 23

INCLUSION-EXCLUSION PRINCIPLE

MINHASH

Store only x (8192) smallest hashes in setJaccard Distance

UNION OF INTERSECTIONS

A (B C) = (A B) (A B)A - B - C = A - (B C)

NOT OPERATOR

Subtraction

I WANT EVERYONE EXCEPT…

A and not B Not A and Not B

CORNER CASES

|(A not(B)) C| => |A C||A not(B)| = |Everything| - |B| + |A B||A not(B)| => |A| - |A B|

ARCHITECTURE

ERROR RATE

Median = 5%Percentile 75 = 8%