Statistics 101 for System Administrators
-
Upload
roberto-polli -
Category
Software
-
view
266 -
download
3
description
Transcript of Statistics 101 for System Administrators
Statistics 101 for SystemAdministrators
EuroPython 2014, 22th July - Berlin
Roberto Polli - [email protected]
Babel Srl P.zza S. Benedetto da Norcia, 3300040, Pomezia (RM) - www.babel.it
22 July 2014Roberto Polli - [email protected]
Who? What? Why?
• Using (and learning) elements of statistics with python.• Roberto Polli - Community Manager @ Babel.it. Loves writing in C, Java
and Python. Red Hat Certified Engineer and Virtualization Administrator.• Babel – Proud sponsor of this talk ;) Delivers large mail infrastructures
based on Open Source software for Italian ISP and PA. Contributes tovarious FLOSS.
Intro Roberto Polli - [email protected]
Agenda
• A latency issue: what happened?• Correlation in 30”• Combining data• Plotting time• modules: scipy, matplotlib
Intro Roberto Polli - [email protected]
A Latency Issue
• Episodic network latency issues• Logs traces: message size, #peers, retransimissions• Do we need to scale? Was a peak problem?
Find a rapid answer with python!
Intro Roberto Polli - [email protected]
Basic statistics
Python provides basic statistics, likefrom scipy.stats import mean # x̄from scipy.stats import std # σXT = { ’ts’: (1, 2, 3, .., ),
’late’: (0.12, 6.31, 0.43, .. ),’peers’: (2313, 2313, 2312, ..),...}
print([k, max(X), min(X), mean(X), std(X) ]for k, X in T.items() ])
Intro Roberto Polli - [email protected]
Distributions
Data distribution - aka δX - shows event frequency.
# The fastest way to get a# distribution isfrom matplotlib import pyplot as pltfreq, bins, _ = plt.hist(T[’late’])
# plt.hist returns adistribution = zip(bins, freq)
A ping rtt distribution
158.0 158.5 159.0 159.5 160.0 160.5 161.0 161.5 162.0rtt in ms
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0 Ping RTT distribution
r
Intro Roberto Polli - [email protected]
Correlation I
Are two data series X ,Y related?Given ∆xi = xi − x̄ Mr. Pearson answered with this formula
ρ(X ,Y ) =
∑i ∆xi∆yi√∑i ∆2xi∆2yi
∈ [−1,+1] (1)
ρ identifies if the values of X and Y ‘move’ together on the same line.
Intro Roberto Polli - [email protected]
You must (scatter) plot
ρ doesn’t find non-linear correlation!Intro Roberto Polli - [email protected]
Probability Indicator
Python scipy provides a correlation function, returning two values:• the ρ correlation coefficient ∈ [−1,+1]
• the probability that such datasets are produced by uncorrelated systems
from scipy.stats.stats import pearsonr # our beloved ρa, b = range(0, 100), range(0, 400, 4)c, d = [randint(0, 100) for x in a], [randint(0, 100) for x in a]correlation, probability = pearsonr(a,b) # ρ = 1.000, p = 0.000correlation, probability = pearsonr(c,d) # ρ = −0.041, p = 0.683
Intro Roberto Polli - [email protected]
Combinations
itertools is a gold pot of useful tools.
from itertools import combinations# returns all possible combination of# items grouped by N at a timeitems = "heart spades clubs diamonds".split()combinations(items, 2)
# And now all possible combinations between# dataset fields!combinations(T, 2)
Combinating 4 suites,2 at a time.
♥♠♥♣♥♦♠♣♠♦♣♦
Intro Roberto Polli - [email protected]
Netfishing correlation I
# Now we have all the ingredients for# net-fishing relations between our data!for (k1,v1), (k2,v2) in combinations(T.items(), 2):
# Look for correlations between every dataset!corr, prob = pearsonr(v1, v2)
if corr > .6:print("Series", k1, k2, "can be correlated", corr)
elif prob < 0.05:print("Series", k1, k2, "probability lower than 5%%", prob)
Intro Roberto Polli - [email protected]
Netfishing correlation IINow plot all combinations: there’s more to meet with eyes!# Plot everything, and insert data in plots!for (k1,v1), (k2,v2) in combinations(T.items(), 2):
corr, prob = pearsonr(v1, v2)plt.scatter(v1, v2)
# 3 digit precision on titleplt.title("R={:0.3f} P={:0.3f}".format(corr, prob))plt.xlabel(k1); plt.ylabel(k2)
# save and close the plotplt.savefig("{}_{}.png".format(k1, k2)); plt.close()
Intro Roberto Polli - [email protected]
Color is the 3rd dimension
from itertools import cyclecolors = cycle("rgb") # use more than 3 colors!labels = cycle("morning afternoon night".split())size = datalen / 3 # 3 colors, right?for (k1,v1), (k2,v2) in combinations(T.items(), 2):
[ plt.scatter( t1[i:i+size] , t2[i:i+size],color=next(colors),label=next(labels)) for i in range(0, datalen, size) ]
# set title, save plot & co
Intro Roberto Polli - [email protected]
Latency Solution
• Latency wasn’t related to packet size or system throughput• Errors were not related to packet size• Discovered system throughput
Intro Roberto Polli - [email protected]
Wrap Up
• Use statistics: it’s easy• Don’t use ρ to exclude relations• Plot, Plot, Plot• Continue collecting results
Intro Roberto Polli - [email protected]
That’s all folks!
Thank you for the attention!Roberto Polli - [email protected]
Intro Roberto Polli - [email protected]