Flexible Approximate Counting

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company,for the United States Department of Energy’s National Nuclear Security Administration

under contract DE-AC04-94AL85000.

Scott A. Mitchell, and David M. DaySandia National Laboratories

Scott – presenter

IDEAS’1115th International Database Engineering & Applications Symposium

Flexible Approximate Counting

Outline

• What is approximate counting?• What’s new?

– Functional form– Increment decision strategies

• Speed it up!• Random number and bit generators

– Inverse problem• Find function given how high you want to count

(Focus on red since that’s what’s significant)

What is approximate counting?• Approximate counter C

– Trade decreased memory for decreased accuracy– Standard (unsigned) integer or bit field,

but C represents some bigger number N– Normal integers use log2N bits to represent 0..N– Counter C can use log2(log2N) bits to represent 1..log2N

• Accurate to within a factor of 2• “Count” to 2^(28) using 8 bits

N=φ(C) function

unary binary floating point100110 100 110

Count using only the exponent

What is approximate counting?

• Count occurrences of datastream objects, pairs of IP addresses• Problem

– Object arrives, decide whether to increment• N+1 = ? if you only stored C?• C=4, N=16. Choose 16+1 = 16 or 16+1 = 32

?

• Solution– Coin flipping. 16+1 = 32 with probability p = 1/(32-16) – Flajolet papers prove expected value and error are reasonable, 1985-2004+– Two sources of error

• Unavoidable: intermediate numbers not representable. Constant-factor approximation.• Datastream: can’t view all the data at once, random decisions. Expected error bounds.

p=1/(32-16)

Motivation

• Old idea (memory-accuracy) with some new uses– Morris 1978, one small register on a CPU

– Today big data, lots of counters

• Data-summarization – Approximate Counting useful by itself, for counting all objects

• Database merge– Choose most efficient algorithm, pre-allocate memory

– May be combined with other techniques• Bloom filters

– Replace 1-bit with a small counter, Van Durme & Lall 2009– Spread counter into multiple bits of a Bloom filter, Talbot 2009

vary the number of bits for skewed data,

Generalize Functionq-ary counting and Floating Point AC

• ΔN = 2C. Why base 2?– p=2-C Use fast random-bit

source for increment decisions

• Csűrös 2010 – Treat counter as binary-exponent floating point number

• Exponent gives powers-of-two increment probabilities• Significand gives better accuracy than base 2

– Stair-step approximationto “q-ary” counting:

– I.e. Restricted to 9 choices for 8-bit counters

– First contributionGet these advantages… …without these restrictions

0100 0110 8-d bitsexponent

d-bits signficand

Our Flexible AC• Flexible AC

– Perfect counting below a threshold T, then– ΔN = aC-T. p=1/aC-T, a is any floating point value. – a small (<2) since 255 = log2(5.7e76)– Round ΔN to integer

• Still get prior speedupsRound all ΔN to powers-of-two

If speed(RandomBit) < ½ speed(RandomNumber)

Random Bit Generator• Many well-tested random number generators

– Fewer random bit generators • Knuth vol. 2 eq 10 – very simple (fast!)A = x0102010081010101 //64-bit constantX = X << 1 //shift leftIf overflow X = X xor ARandomBit = X & 1 // lowest bit of X– A is your choice of primitive polynomial mod 2 with

many one-bits: 8 out of 64, Rajski & Tyszer 2003– Every length-64 bit-sequence occurs once before

repetition • Consider accuracy in terms of intended use.

What matters for our application– k one-bits in a row occurs 1 in 2^k times– Generated 2^47 bits, 42 one-bits in a row occurs 1 in 2^42

times verified experimentally

Speed Comparison

• If this is embedded in a datastream application, speed may be important.

• Random number generator is the bottleneck (goal is incrementing a counter!)

if RandNumber < p increment

//p = 2^{-k}if k RandomBits in a row increment

Random Countdown Speedup• Why generate a random number every time?

– Set countdown counter P P = number of times in a row RandNumber > p [no increment]

– Need one countdown counter per counter value (1..255)not per counter (billions)

– Calculating P is (relatively) very expensive• Fast on average if P is large p is small• Hybrid algorithm

– RandNumber < p? or RandomBit for small p– Random Countdown for large p– “small” means <10 or <22

– This is the definition of a geometric distribution

Fixed Countdown Speedup• Why generate a random number at all?

– Increment “1 in Δφ” times deterministicallySlightly different value to get correct expected value

Best possible accuracy if only one item Fastest Relies on randomness of stream

– E.g. alternating items bad counts

Speed: RandomCount FixedCountRandomCount = 1.5x Fixed Count for Δφ=255Random Count = ¼x RandomBit for Δφ=172

Punchline

How High Do You Want to Count?Inverse problem (David M. Day)

• Find a, never discussed in approximate counting literature– For some applications, determine by hand ahead of time– Our run-time solution– Inverse geometric sum

tricky case

Find root >1 for r(a)Initial guess depends on s compared to K. I.e. aK+1 vs. sa vs. (s-1)

const

Inverse Problem Alternatives

• We’re only approximately counting, – So accuracy may not be important

• We only calculate function once, – So efficiency may not be important

(Application dependent)

– Use the initial guesses– Use binary search or lookup table– Use N=φ(C) function with easier inverse

• E.g. exponential + linear function,but increments are too small for small C

Conclusion

• Flexible Approximate Counting provides– Customization of functional form

• At run-time, for maximum value to count to– Fast decisions of whether to increment

• If datastream is sufficiently random– Use fixed countdown

• Else – Switch to random countdown for large increments

• If speed is more important than accuracy for small increments– Use random bits and power-of-two increments

• Random generator accuracy limits– Consider the intended use

• RandNumber Min r : probability(u<r) ≈ r• RandomBit Max k: probability(k one-bits in row) ≈ 2-k

• Thank you– Have a safe trip home

Flexible Approximate Counting

Documents

Transcript of Flexible Approximate Counting