Flexible Approximate Counting
description
Transcript of Flexible Approximate Counting
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company,for the United States Department of Energy’s National Nuclear Security Administration
under contract DE-AC04-94AL85000.
Scott A. Mitchell, and David M. DaySandia National Laboratories
Scott – presenter
IDEAS’1115th International Database Engineering & Applications Symposium
Flexible Approximate Counting
Outline
• What is approximate counting?• What’s new?
– Functional form– Increment decision strategies
• Speed it up!• Random number and bit generators
– Inverse problem• Find function given how high you want to count
(Focus on red since that’s what’s significant)
What is approximate counting?• Approximate counter C
– Trade decreased memory for decreased accuracy– Standard (unsigned) integer or bit field,
but C represents some bigger number N– Normal integers use log2N bits to represent 0..N– Counter C can use log2(log2N) bits to represent 1..log2N
• Accurate to within a factor of 2• “Count” to 2^(28) using 8 bits
N=φ(C) function
unary binary floating point100110 100 110
Count using only the exponent
What is approximate counting?
• Count occurrences of datastream objects, pairs of IP addresses• Problem
– Object arrives, decide whether to increment• N+1 = ? if you only stored C?• C=4, N=16. Choose 16+1 = 16 or 16+1 = 32
?
• Solution– Coin flipping. 16+1 = 32 with probability p = 1/(32-16) – Flajolet papers prove expected value and error are reasonable, 1985-2004+– Two sources of error
• Unavoidable: intermediate numbers not representable. Constant-factor approximation.• Datastream: can’t view all the data at once, random decisions. Expected error bounds.
p=1/(32-16)
Motivation
• Old idea (memory-accuracy) with some new uses– Morris 1978, one small register on a CPU
– Today big data, lots of counters
• Data-summarization – Approximate Counting useful by itself, for counting all objects
• Database merge– Choose most efficient algorithm, pre-allocate memory
– May be combined with other techniques• Bloom filters
– Replace 1-bit with a small counter, Van Durme & Lall 2009– Spread counter into multiple bits of a Bloom filter, Talbot 2009
vary the number of bits for skewed data,
Generalize Functionq-ary counting and Floating Point AC
• ΔN = 2C. Why base 2?– p=2-C Use fast random-bit
source for increment decisions
• Csűrös 2010 – Treat counter as binary-exponent floating point number
• Exponent gives powers-of-two increment probabilities• Significand gives better accuracy than base 2
– Stair-step approximationto “q-ary” counting:
– I.e. Restricted to 9 choices for 8-bit counters
– First contributionGet these advantages… …without these restrictions
0100 0110 8-d bitsexponent
d-bits signficand
Our Flexible AC• Flexible AC
– Perfect counting below a threshold T, then– ΔN = aC-T. p=1/aC-T, a is any floating point value. – a small (<2) since 255 = log2(5.7e76)– Round ΔN to integer
• Still get prior speedupsRound all ΔN to powers-of-two
If speed(RandomBit) < ½ speed(RandomNumber)
Random Bit Generator• Many well-tested random number generators
– Fewer random bit generators • Knuth vol. 2 eq 10 – very simple (fast!)A = x0102010081010101 //64-bit constantX = X << 1 //shift leftIf overflow X = X xor ARandomBit = X & 1 // lowest bit of X– A is your choice of primitive polynomial mod 2 with
many one-bits: 8 out of 64, Rajski & Tyszer 2003– Every length-64 bit-sequence occurs once before
repetition • Consider accuracy in terms of intended use.
What matters for our application– k one-bits in a row occurs 1 in 2^k times– Generated 2^47 bits, 42 one-bits in a row occurs 1 in 2^42
times verified experimentally
Speed Comparison
• If this is embedded in a datastream application, speed may be important.
• Random number generator is the bottleneck (goal is incrementing a counter!)
if RandNumber < p increment
//p = 2^{-k}if k RandomBits in a row increment
Random Countdown Speedup• Why generate a random number every time?
– Set countdown counter P P = number of times in a row RandNumber > p [no increment]
– Need one countdown counter per counter value (1..255)not per counter (billions)
– Calculating P is (relatively) very expensive• Fast on average if P is large p is small• Hybrid algorithm
– RandNumber < p? or RandomBit for small p– Random Countdown for large p– “small” means <10 or <22
– This is the definition of a geometric distribution
Fixed Countdown Speedup• Why generate a random number at all?
– Increment “1 in Δφ” times deterministicallySlightly different value to get correct expected value
Best possible accuracy if only one item Fastest Relies on randomness of stream
– E.g. alternating items bad counts
Speed: RandomCount FixedCountRandomCount = 1.5x Fixed Count for Δφ=255Random Count = ¼x RandomBit for Δφ=172
Punchline
How High Do You Want to Count?Inverse problem (David M. Day)
• Find a, never discussed in approximate counting literature– For some applications, determine by hand ahead of time– Our run-time solution– Inverse geometric sum
tricky case
Find root >1 for r(a)Initial guess depends on s compared to K. I.e. aK+1 vs. sa vs. (s-1)
const
Inverse Problem Alternatives
• We’re only approximately counting, – So accuracy may not be important
• We only calculate function once, – So efficiency may not be important
(Application dependent)
– Use the initial guesses– Use binary search or lookup table– Use N=φ(C) function with easier inverse
• E.g. exponential + linear function,but increments are too small for small C
Conclusion
• Flexible Approximate Counting provides– Customization of functional form
• At run-time, for maximum value to count to– Fast decisions of whether to increment
• If datastream is sufficiently random– Use fixed countdown
• Else – Switch to random countdown for large increments
• If speed is more important than accuracy for small increments– Use random bits and power-of-two increments
• Random generator accuracy limits– Consider the intended use
• RandNumber Min r : probability(u<r) ≈ r• RandomBit Max k: probability(k one-bits in row) ≈ 2-k
• Thank you– Have a safe trip home