Post on 14-Apr-2017
–Hal Abelson and Gerald J. Sussman, SICP
“In testing primality of very large numbers chosen at random, the chance of stumbling upon a value that
fools the Fermat test is less than the chance that cosmic radiation will cause the computer to make
an error in carrying out a 'correct' algorithm. Considering an algorithm to be inadequate for the first reason but not for the second illustrates the
difference between mathematics and engineering.”
Load Balancing
Monitoring
Application
CDN
DatabasesCount-distinct
Join-shortest-queue
Reliable broadcast
importnumpyasnpimportnumpy.randomasnr
n=8#numberofserversm=1000#numberofrequestsbins=[0]*n
mu=0.0sigma=1.15
forweightinnr.lognormal(mu,sigma,m):chosen_bin=nr.randint(0,n)bins[chosen_bin]+=normalize(weight)
[100.7,137.5,134.3,126.2,113.5,175.7,101.6,113.7]
importnumpyasnpimportnumpy.randomasnr
n=8#numberofserversm=1000#numberofrequestsbins=[0]*n
mu=0.0sigma=1.15
forweightinnr.lognormal(mu,sigma,m):a=nr.randint(0,n)b=nr.randint(0,n)chosen_bin=aifbins[a]<bins[b]elsebbins[chosen_bin]+=normalize(weight)
[130.5,131.7,129.7,132.0,131.3,133.2,129.9,132.6]
[100.7,137.5,134.3,126.2,113.5,175.7,101.6,113.7]
[130.5,131.7,129.7,132.0,131.3,133.2,129.9,132.6]
STANDARD DEVIATION: 1.18
STANDARD DEVIATION: 22.9
Load Balancing
Monitoring
Application
CDN
DatabasesCount-distinct
Join-shortest-queue
Reliable broadcast
166.208.249.236--[25/Feb/2015:07:20:13+0000]"GET/product/982"20020246103.138.203.165--[25/Feb/2015:07:20:13+0000]"GET/article/490"20029870191.141.247.227--[25/Feb/2015:07:20:13+0000]"HEAD/page/1"20020409150.255.232.154--[25/Feb/2015:07:20:13+0000]"GET/page/1"20042999191.141.247.227--[25/Feb/2015:07:20:13+0000]"GET/article/490"20025080150.255.232.154--[25/Feb/2015:07:20:13+0000]"GET/listing/567"20033617103.138.203.165--[25/Feb/2015:07:20:13+0000]"HEAD/listing/567"20029618191.141.247.227--[25/Feb/2015:07:20:13+0000]"HEAD/page/1"20030265166.208.249.236--[25/Feb/2015:07:20:13+0000]"GET/page/1"2005683244.210.202.222--[25/Feb/2015:07:20:13+0000]"HEAD/article/490"20047124103.138.203.165--[25/Feb/2015:07:20:13+0000]"HEAD/listing/567"20048734103.138.203.165--[25/Feb/2015:07:20:13+0000]"GET/listing/567"20027392191.141.247.227--[25/Feb/2015:07:20:13+0000]"GET/listing/567"20015705150.255.232.154--[25/Feb/2015:07:20:13+0000]"GET/page/1"20022587244.210.202.222--[25/Feb/2015:07:20:13+0000]"HEAD/product/982"20030063244.210.202.222--[25/Feb/2015:07:20:13+0000]"GET/page/1"2006041166.208.249.236--[25/Feb/2015:07:20:13+0000]"GET/product/982"20025783191.141.247.227--[25/Feb/2015:07:20:13+0000]"GET/article/490"2001099244.210.202.222--[25/Feb/2015:07:20:13+0000]"GET/product/982"20031494191.141.247.227--[25/Feb/2015:07:20:13+0000]"GET/listing/567"20030389150.255.232.154--[25/Feb/2015:07:20:13+0000]"GET/article/490"20010251191.141.247.227--[25/Feb/2015:07:20:13+0000]"GET/product/982"20019384150.255.232.154--[25/Feb/2015:07:20:13+0000]"HEAD/product/982"20024062244.210.202.222--[25/Feb/2015:07:20:13+0000]"GET/article/490"20019070191.141.247.227--[25/Feb/2015:07:20:13+0000]"GET/page/648"20045159191.141.247.227--[25/Feb/2015:07:20:13+0000]"HEAD/page/648"2005576166.208.249.236--[25/Feb/2015:07:20:13+0000]"GET/page/648"20041869166.208.249.236--[25/Feb/2015:07:20:13+0000]"GET/listing/567"20042414
defcombined_cardinality(seen_sets):combined=set()forseeninseen_sets:combined|=seenreturnlen(combined)
classLogLog(object):def__init__(self,k):self.k=kself.m=2**kself.M=np.zeros(self.m,dtype=np.int)self.alpha=Alpha[k]definsert(self,token):y=hash_fn(token)j=y>>(hash_len-self.k)remaining=y&((1<<(hash_len-self.k))-1)first_set_bit=(64-self.k)-int(math.log(remaining,2))self.M[j]=max(self.M[j],first_set_bit)defcardinality(self):returnself.alpha*2**np.mean(self.M)
HyperLogLog
• Adding an item: O(1)
• Retrieving cardinality: O(1)
• Space: O(log log n)
• Error rate: 2%
HyperLogLog
For 100 million unique items, and an error rate of 2%,
the size of the HyperLogLog is...
1,500 bytes
Load Balancing
Monitoring
Application
CDN
DatabasesCount-distinct
Join-shortest-queue
Reliable broadcast
End-to-End Latency
42ms
74ms
83ms
133ms
New York
London
San Jose
Tokyo
0.00
0.05
0.10
0.00
0.05
0.10
0.00
0.05
0.10
0.00
0.05
0.10
0 50 100 150Latency (ms)
Den
sity
Density plot and 95th percentile of purge latency by server location
End-to-End Latency42ms
74ms
83ms
133ms
New York
London
San Jose
Tokyo
0.00
0.05
0.10
0.00
0.05
0.10
0.00
0.05
0.10
0.00
0.05
0.10
0 50 100 150Latency (ms)
Den
sity
Density plot and 95th percentile of purge latency by server location
Load Balancing
Monitoring
Application
CDN
DatabasesCount-distinct
Join-shortest-queue
Reliable broadcast
Bloom Filters
Hash tables!ECMP
Consistent hashing
Quicksort
Probabilistic Algorithms
1. An iota of theory
2. Where are they useful and where are they not?
3. HyperLogLog
4. Locality-sensitive Hashing
5. Bimodal Multicast
Las Vegasdeffind_las_vegas(haystack,needle):length=len(haystack)whileTrue:index=randrange(length)ifhaystack[index]==needle:returnindex
Monte Carlodeffind_monte_carlo(haystack,needle,k):length=len(haystack)foriinrange(k):index=randrange(length)ifhaystack[index]==needle:returnindex
– Prabhakar Raghavan (author of Randomized Algorithms)
“For many problems a randomized algorithm is the simplest the
fastest or both.”
Slightly Less Naive
ips_seen=BloomFilter(capacity=expected_size,error_rate=0.03)counter=0forlineinlog_file:ip=extract_ip(line)ifitems_bloom.add(ip):counter+=1print"UniqueIPs:",counter
Slightly Less Naive
• Adding an IP: O(1)
• Retrieving cardinality: O(1)
• Space: O(n)
• Error rate: 3%
kind of
Slightly Less Naive
For 100 million unique IPv4 addresses, and an error rate of 3%,
the size of the bloom filter is...
87mb
definsert(self,token):#Gethashoftokeny=hash_fn(token)#Extract`k`mostsignificantbitsof`y`j=y>>(hash_len-self.k)#Extractremainingbitsof`y`remaining=y&((1<<(hash_len-self.k))-1)#Find"first"setbitof`remaining`first_set_bit=(64-self.k)-int(math.log(remaining,2))#Update`M[j]`tomaxof`first_set_bit`#andexistingvalueof`M[j]`self.M[j]=max(self.M[j],first_set_bit)
defcardinality(self):#Themeanof`M`estimates`log2(n)`with#anadditivebiasreturnself.alpha*2**np.mean(self.M)
“There has been a lot of recent work on streaming algorithms, i.e. algorithms that produce an output by making one pass (or a few passes) over the data while using a limited amount of storage space and time. To cite a few examples, ...”
{"there":1,"has":1,"been":1,“a":4,"lot":1,"of":2,"recent":1,...}