DRAFT: Python for System Administrators

Post on 10-May-2015

397 views 3 download

Tags:

description

Draft of the EP14 Training

Transcript of DRAFT: Python for System Administrators

DRAFTPython for System Administrator

EuroPython 2014, 24th July - Berlin

Roberto Polli - roberto.polli@babel.it

Babel Srl P.zza S. Benedetto da Norcia, 3300040, Pomezia (RM) - www.babel.it

24 July 2014

Roberto Polli - roberto.polli@babel.it

DRAFTAgenda

IntroipythonPath management: 10’Encoding: 10’Data Gathering: 20’

module: psutilmodule: subprocessThe /proc filesystem

Parsing: 60’Regular Expressions

Nosetest Intermezzo: 15’Processing: 45’

DistributionsDeviationCorrelationPlotting Time

End

Roberto Polli - roberto.polli@babel.it

DRAFTWho? What? Why?

• Use python to replace Grep Awk Sed Perl. Speed up your daily job.• Roberto Polli - Community Manager @ Babel.it. Loves writing in C, Java

and Python. Red Hat Certified Engineer and Virtualization Administrator.• Babel – Proud sponsor of this talk ;) Delivers large mail infrastructures

based on Open Source software for Italian ISP and PA. Contributes tovarious FLOSS.

Intro Roberto Polli - roberto.polli@babel.it

DRAFTRequirements

• python 2.7+, ipython• course code from github

#git clone https://github.com/ioggstream/python-course• test your environment (eg. psutil, numpy, scipy, matplotlib)

#nosetests -vs test prerequisites.py• first part: nose, psutil• second part: scipy, numpy, matplotlib• ♦optional/advanced content ♦

Intro Roberto Polli - roberto.polli@babel.it

DRAFTHow

• Get ready before starting: code is here on github!• Type everything but #comments and try/except• Type fast with tab-completion and copy-paste• Be curious: inspect and print returned variables• Never ∗ close your iPython session: you’ll lose your precious variables

* (ok, sometimes you can).

Intro Roberto Polli - roberto.polli@babel.it

DRAFTReferences

• irc.freenode.net# python - The Python Community :D• Python Cookbook 3rd ed. O’Reilly - David Beazley and Brian K. Jones• Programming Python 4th ed. O’Reilly - Mark Lutz• Dive into Python3 2nd ed. Apress - Mark Pilgrim• nose.readthedocs.org• github.com/ioggstream/python-course

Intro Roberto Polli - roberto.polli@babel.it

DRAFTiPython I

• Interactive interpreter with tons of functionalities, and the main tool ofour training.

• The most fun way to learn and use python!• Supports tab-completion , readline , inline help• Allows pasting from clipboard with %paste , and multi-line editing with

%edit• Run it enabling plotting support:

# ipython --pylab

ipython Roberto Polli - roberto.polli@babel.it

DRAFTiPython II

# iPython supports inline-help appending ? to an objectstr?

# We can run commands and capture the output in a variable# don’t need to quote using the ! magic on unixret = !cat /etc/hosts

# windows has etc\hosts too ;)ret = !type c: windows\system32\drivers\etc\hosts

ipython Roberto Polli - roberto.polli@babel.it

DRAFTiPython III# returned objects can be filtered withret. grep (’localhost’)# Now get the first space-splitted column of the outputret. fields (0)ret.grep(’localhost’).fields(0)

# And the last returned value is stored inlocalip = _

# We can type long commands in an editor like ‘vi’ using%edit mytmp.py # type print(ret[0]), then exit (eg. wq!)> Editing... done. Executing edited code...

ipython Roberto Polli - roberto.polli@babel.it

DRAFTPath management: Goal

• Normalize paths on different platform• Create, copy and remove folders• Handle errors

modules: os, os.path, shutil, errnosee also: pathlib on Python 3.4+

Path management: 10’ Roberto Polli - roberto.polli@babel.it

DRAFTPath management: os.path, sys

basedir, hosts = "/", "etc/hosts"# Check the hosting platform with the sys modulefrom sys import platformif platform.startswith(’win’):

basedir = ’c:/windows/system32/drivers’

# Always use the os.path module!from os.path import join, normpathhosts = join(basedir, hosts)hosts = normpath(hosts)print("Normalized path is", hosts)

Path management: 10’ Roberto Polli - roberto.polli@babel.it

DRAFTPath management: os.path, sys

• os.path is the best way to manage paths!• multiplatform• safe

• join removes redundant ”/”• normpath fixes ”/” orientation and redundant ”..”• realpath resolves symlinks

And now, a rapid glance to other toolsPath management: 10’ Roberto Polli - roberto.polli@babel.it

DRAFTMove trees: shutil, os, os.path

from os import makedirs # ...tree creation...from os.path import isdir # ...checking...from shutil import copytree, rmtreemakedirs("/tmp/py/foo/bar")

# We can copy a whole tree and test itcopytree("/tmp/py/foo", "/tmp/py/foo2")assert isdir("/tmp/py/foo2/bar")

rmtree("/tmp/py/foo") # ... and finally delete itassert not isdir("/tmp/py/foo/bar")

Path management: 10’ Roberto Polli - roberto.polli@babel.it

DRAFTMove trees: errno

# We can use exception handlers to investigate errorstry:

# python2 does not allow to ignore existing directories...makedirs ("/tmp/py/foo/bar")# ...and raises an OSError

except OSError as e:# Just use the errno module to check the error valueimport errnoassert e.errno == errno.EEXIST

help(makedirs)

Path management: 10’ Roberto Polli - roberto.polli@babel.it

DRAFTEncoding: Goal

• A string more than a sequence of bytes• A string is a couple (bytes, encoding)• Use unicode literals in python2• Manage differently encoded filenames• A string is not a sequence of bytes

modules: os, os.path, glob

Encoding: 10’ Roberto Polli - roberto.polli@babel.it

DRAFTSong of Childhood

Als das Kind Kindwar, ging es mithangenden Armen,wollte der Bach sei einFluß, der Flußsei einStrom, und diesePfutze das Meer.Als das Kind Kindwar, wues nicht, daßesKind war, alles warihm beseelt, und alleSeelen waren eins.Als das Kind Kindwar, hatte es vonnichts eine Meinung,hatte keineGewohnheit, saßoft imSchneidersitz, lief ausdem Stand, hatteeinen Wirbel im Haarund machte keinGesicht beimfotografieren.

“‘When the child was a child,

characters were bytes, and

strings list of bytes”’

Als das Kind Kindwar, fielen ihm dieBeeren wie nurBeeren in die Handund jetzt immer noch,machten ihm diefrischen Walnusse einerauhe Zunge und jetztimmer noch, hatte esauf jedem Berg dieSehnsucht nach demimmer hoheren Berg,und in jeder Stadt dieSehnsucht nach dernoch groStadt, unddas ist immer nochso, griff im Wipfeleines Baums nachdem Kirschen ineinemHochgefuhl wieauch heute noch, eineScheu vor jedemFremden und hat sieimmer noch, wartetees auf den erstenSchnee, und wartet soimmer noch.

Encoding: 10’ Roberto Polli - roberto.polli@babel.it

DRAFTEncoding is a map

# Py3 doesn’t need the uthe_string = u "S\u00fcd" # Sud

# can be encoded in differentin_utf8 = the_string.encode(’utf-8’)in_win = the_string.encode(’cp1252’)

type(in_utf8) == bytes # byte-sequences

# Decoding bytes using the wrong map..# ...gives sad results ;)in_utf8.decode(’cp1252’) # SA1/4d

• Encoding is a one-to-onemap between atypographical characterand a byte-sequence

• Decoding is its reversemap

char ascii utf-8 cp1252a [97] [97] [97]u - [195, 188] [252]

Encoding: 10’ Roberto Polli - roberto.polli@babel.it

DRAFTEnters Encoding

# Filenames are binary data! Be careful when reading from# a (eg. vfat) filesystem!# To make python2 encoding-aware we shouldfrom __future__ import unicode_literals

# Create 3 windows-encoded filenames inbasedir = "/tmp/py"

# using the provided functionfrom course import create_wuerstelstrassecreate_wuerstelstrasse(basedir)

Encoding: 10’ Roberto Polli - roberto.polli@babel.it

DRAFTEncoded filenames: glob

from glob import glob as ls # expands wildcards like a shell.

files = ls("/tmp/py/*.txt") # To avoid encoding issues ...# UnicodeDecodeError : ’ascii’ codec can’t decode byte 0xFC0xFC == 252 # remember the u in cp1252 map?

files = ls( b "/tmp/py/*.txt") #..we explicitly use bytes

Encoding: 10’ Roberto Polli - roberto.polli@babel.it

DRAFTData Gathering: Goal

Gathering System Data with multiplatform and platform-dependent tools.• Get infos from files, /proc and /sys• Capture command output• Use psutil to get IO, CPU and memory data• Parse files with a strategy

modules: psutil, subprocess, os

Data Gathering: 20’ Roberto Polli - roberto.polli@babel.it

DRAFTData Gathering: grep

def grep(needle, fpath):"""is a minimal grep implementation

goal: open() is iterable and doesn’tneed splitlines()

goal: comprehension can filter iterables"""return [x for x in open(fpath) if needle in x]

# Do we have "localhost" in our "/etc/hosts"?grep("localhost", "/etc/hosts")

Data Gathering: 20’ Roberto Polli - roberto.polli@babel.it

DRAFTData Gathering: psutil

# The psutil module is very nice!import psutil

# Works on Windows, Linux and MacOSpsutil.cpu_percent()

# And its output is easy to managepsutil.disk_io_counters()

Exercise: Which other information does psutil provide?

Data Gathering: 20’module: psutil Roberto Polli - roberto.polli@babel.it

DRAFTData Gathering: Exercises

Write a vmstat-like function printing every second:• cpu usage % ;• bytes read and written in the given interval;• Hint: use psutil, time.sleep(1)• Hint: try on ipython and then write the function using

%edit vmstat.py

Data Gathering: 20’module: psutil Roberto Polli - roberto.polli@babel.it

DRAFTData Gathering: subprocess

# The check_output function returns the command stdoutfrom subprocess import check_output

# It takes a list as an argument!out = check_output("ping -w1 -c1 www.google.com". split ())

# and returns a stringprint(out)

Data Gathering: 20’module: subprocess Roberto Polli - roberto.polli@babel.it

DRAFTData Gathering: subprocess, sys

def sh(cmd, shell=False, timeout=0):"""Returns an iterable output of a command string, checking ... """from sys import version_info as python versionif python_version < (3, 3): # ..before using...

if timeout:raise ValueError("Timeout not supported")

output = check_output(cmd.split(), shell=shell)else:

output = check_output(cmd.split(), shell=shell, timeout=timeout)

return output. splitlines ()

Data Gathering: 20’module: subprocess Roberto Polli - roberto.polli@babel.it

DRAFTData Gathering: Exercises

Write a simple pgrep-like function for your OS which:• ppgrep signature is the following

def ppgrep(program):"""@param program - eg. firefox, explorer.exe"""raise NotImplementedError

• prints a list of processes executing ‘program‘;• Hint: use subprocess, os, and list-comprehension

items = [ x for x in a_list if ’firefox’ in x]

Data Gathering: 20’module: subprocess Roberto Polli - roberto.polli@babel.it

DRAFT♦Data Gathering: Parsing /proc I ♦

def linux_threads(pid):"""The Linux /proc filesystem is a cool place to get infos."""from glob import glob # replaces * and ?path = "/proc/{}/task/*/status".format(pid)

# Pick a set of fields to gather...t_info = (’Pid’, ’Tgid’, ’voluntary’) # a tuplefor t_path in glob(path):

# ...and use comprehension to get interesting data.print([x for x in open(t_path)

if x. startswith (t_info)] # accepts tuples!)

Data Gathering: 20’The /proc filesystem Roberto Polli - roberto.polli@babel.it

DRAFTData Gathering: Parsing /proc II

# On Linux, /proc/diskstats is the source of I/O infosdisk_l = grep("sda", "/proc/diskstats")

# To gather that data we put the headers in a multi-line stringfrom course import diskstats_headers as headers

disk_info = disk_l[0].split() # Take the 1st entry, split the datas ...zip(headers, disk_info) # ...and tie them with the headerslist(_) # On py3 you need to iterate the generator!

Data Gathering: 20’The /proc filesystem Roberto Polli - roberto.polli@babel.it

DRAFTData Gathering: Parsing /proc III# Or create a reusable commodity class withfrom collections import namedtuple# using headers as attributes# like the one provided by psutilDiskStats = namedtuple(’DiskStat’, headers )

# ... and disk_info as valuesdstat = DiskStats(*disk_info)dstat.device, dstat.writes_ms

# Homework: check further features withhelp(collections)

Data Gathering: 20’The /proc filesystem Roberto Polli - roberto.polli@babel.it

DRAFTParsing: Goal

• Plan a parsing strategy• Use basic regular expressions: match, search, sub• Benchmarking a parser• Running nosetests• Write a simple parser

modules: re, nose, %timeit

Parsing: 60’ Roberto Polli - roberto.polli@babel.it

DRAFTParsing is hard...

”System Administrators spent 24.3% of their work-life parsingfiles.”∗

*Independent analysis by The GASP1 Society ;)

1Grep Awk Sed PerlParsing: 60’ Roberto Polli - roberto.polli@babel.it

DRAFT...use a strategy!

1. Collect parsing samples2. Play in ipython and collect %history3. Write tests, then the parser4. Eventually benchmark

Parsing: 60’ Roberto Polli - roberto.polli@babel.it

DRAFTParsing postfix logs

# Before writing the parser, collect samples of# the interesting lines. For now justfrom course import mail_sent, mail_delivered

# and \%edit a simpledef test_sent():

hour, host, to = parse_line(mail_sent)assert hour == ’08:00:00’assert to = ’jon@doe.it’

Parsing: 60’ Roberto Polli - roberto.polli@babel.it

DRAFTParsing lines: split, zip

May 31 08:00:00 test-1 postfix/smtp[169]: 7CD8E730020: to=〈joe@foo.it〉, relay=mx2.foo.it[10.0.4.5]:25,

...

mail_sent.split() # Start using basic strings in ipython

# Then tie them with zip/zip()fields, counting = _, zip(range(20), _)fields = fields[:7] # We just care for the first 7 values

# and pick fields singularlyhour, host, dest = fields[2], fields[3], fields[6]

Parsing: 60’ Roberto Polli - roberto.polli@babel.it

DRAFTParse: Exercise I

In another window• edit 03 parsing test.py• complete the parse line(line) function

def parse_line(line):"""Write your function and test it

with test_sent()"""raise NotImplementedError

%paste your solution’s code in iPython and run manually the test functions

Parsing: 60’ Roberto Polli - roberto.polli@babel.it

DRAFTPython Regexp

# Python supports regular expressions viaimport re

# We start showing a grep-reloaded functiondef grep(expr, fpath):

one = re.compile(expr) # ...has two lookup methods...assert ( one.match # which searches from ˆ the beginning

and one. search ) # that searches anywhere

with open(fpath) as fp:return [x for x in fp if one.search(x)]

Parsing: 60’Regular Expressions Roberto Polli - roberto.polli@babel.it

DRAFTSplitting with re.split

from re import split # is a very nice function

# Let’s gather some ping statsif sys.platform.startswith(’win’):

cmd = "ping -n10 www.google.it"else:

cmd = "ping -c10 -w10 www.google.it"

# Split for both space and =ping_output = [ split("[ =]", x) for x in sh(cmd)]

Parsing: 60’Regular Expressions Roberto Polli - roberto.polli@babel.it

DRAFTSplitting with re.findall

from re import findall # can be misused too ;)

# eg. for adding the ":" to amac = "00""24""e8""b4""33""20"

# ...using thisre_hex = ’[0-9A-Fa-f]{2}’mac_address = ’:’.join(findall(re_hex, mac))print("The mac address is ", mac_address)

Actually this does a bit of validation, requiring all chars to be in the 0-F range

Parsing: 60’Regular Expressions Roberto Polli - roberto.polli@babel.it

DRAFTBenchmarking in iPython I

• Parsing big files needs benchmarks. iPython %timeit magic is a goodstarting point.test_regexps = ("..", "[a-fA-F0-9]{2}")for re_s in test_regexps:

%timeit ’:’.join(findall (re_s, mac))

• We can even compare compiled and inline regexpimport refor re_s in test_regexps:

re_c = re.compile (re_s)%timeit ’:’.join(re_c.findall (mac))

Parsing: 60’Regular Expressions Roberto Polli - roberto.polli@babel.it

DRAFTBenchmarking in iPython II

Or find other methods:• complex...

from re import sub as sed%timeit sed(r’(..)’, r’\1:’, mac)

• ...or simple%timeit ’:’.join([ mac[i:i+2] for i in range(0,12,2)])

• Outside iPython check the timeit module

Parsing: 60’Regular Expressions Roberto Polli - roberto.polli@babel.it

DRAFT♦Parsing: a real world Example ♦

# Don’t need to type this VSAN configuration script# which uses linux FC information from /sys filesystemfc_id_path = "/sys/class/fc_host/host*/port_name"for x in glob(fc_id_path):

# ...we boldly skip an explicit close()pwwn = open(x).read() # 0x500143802427e66cpwwn = pwwn[2:]# ...and even use the slower but readablepwwn = re.findall(r’..’, pwwn)print("member pwwn ", ’:’.join(pwwn))

Parsing: 60’Regular Expressions Roberto Polli - roberto.polli@babel.it

DRAFTParsing logs: a simple solution

def parse_line(line):import re# using _ we improve readability_, _, hour, host, _, _, dest = line.split()[:7]try:

# and if dest isn’t what we expect...dest = re.split(r’[<>]’,dest)[1]

except IndexError:# ...we set it to Nonedest = None

return (hour, host, dest)

Parsing: 60’Regular Expressions Roberto Polli - roberto.polli@babel.it

DRAFTParsing logs: II

# Now another test for the delivered messages# %edit 03_parsing_testdef test_delivered():

hour, host, destination = parse_line(test_str_2)assert hour == ’08:00:00’# Delivery logs should have destination == Noneassert destination is None

# Exercise: fix parse_line to work with both tests# and save test

Nosetest Intermezzo: 15’ Roberto Polli - roberto.polli@babel.it

DRAFTRunning nosetest

• Now run the following command from a shell# nosetests -vs 03_parsing_test.py03_parsing_test.test_sent ... ok03_parsing_test.test_delivered ... okRan 2 tests in 0.001s

• Nose is a test framework.• Nose runs every file matching test *• Nose runs every function matching test *

Nosetest Intermezzo: 15’ Roberto Polli - roberto.polli@babel.it

DRAFTSimple Test Script

• Open the 02 nosetests simple.py filedef setup():

print("is run before the testsuite, while")def teardown():

print("after all tests")def test_one():

# name a function like test_* to run it!assert 1 == 1

def test_two():# and use assert to test for successassert 1 == 0, "I was expecting 0"

Nosetest Intermezzo: 15’ Roberto Polli - roberto.polli@babel.it

DRAFT♦Complete Test Script: I ♦• A more flexible script is 02 nosetests full.py which uses a Test class

class Test(object):@classmethoddef setup_class(self): # is run once at startup,

# ..eg. to create database structureprint("setup testsuite environment")open("/tmp/test2.out", "w").write("0")

@classmethoddef teardown_class(self): # is run once after all tests to...

print("cleanup testsuite environment")os.unlink("/tmp/test2.out")

Nosetest Intermezzo: 15’ Roberto Polli - roberto.polli@babel.it

DRAFT♦Complete Test Script: II ♦• allowing pre-post testsuite and pre-post test fixtures

class Test(object):...# Using a Test class...def setup(self):

print("is_run_before_every_test") #..and..def teardown(self):

print("after_every_test") # eg truncate a table

# each test can use the prepared environmentdef test_a(self):

assert os.path.isfile("/tmp/test2.out")Nosetest Intermezzo: 15’ Roberto Polli - roberto.polli@babel.it

DRAFTSimple processing: Goal

• Handle gathered data with dict() and zip()• Find data relation with scipy• Get essential information like standard deviation σ and distributions δ• Linear correlation: what’s that, when can help• Plotting

modules: numpy, scipy, scipy.stats.stats, collections, random, time

Processing: 45’ Roberto Polli - roberto.polli@babel.it

DRAFTThe Chicken Paradox

“‘According to latest statistics,it appears that you eat one chicken per year:and, if that doesn’t fit your budget,you’ll fit into statistic anyway,because someone will eat two.”’ C. A. Salustri

Processing: 45’ Roberto Polli - roberto.polli@babel.it

DRAFTSimple processing: ExerciseHow to dismantle the chicken paradox? Gather data!

• Write the following function using our parsing strategydef ping_rtt(seconds=10):

"""@return: a list of ping RTT"""from course import sh# get sample output# find a solution in ipython# test and paste the coderaise NotImplementedError

• Gather 10 seconds of ping output• Hint: reuse the sh() function• Hint: slice and filter lists using comprehension

Processing: 45’Distributions Roberto Polli - roberto.polli@babel.it

DRAFTDistributions: set, defaultdictA distribution or δ shows the frequency of events, like how many people ate xchickens ;)

#Create a simple δ with set and dictd = {x: rtt.count(x) for x in set(rtt)}

# We can even usefrom collections import defaultdictd = defaultdict(int)for x in rtt:

distro[x] += 1

Distributions and Mean are both important!

Processing: 45’Distributions Roberto Polli - roberto.polli@babel.it

DRAFTStandard Deviation: scipy

• Standard deviation or σformula isσ2(X ) :=

∑(x−x)2

n• σ tells if δ is fair or not,

and how much the mean(x) is representative

• matplotlib.mlab.normpdfis a smooth functionapproximating thehistogram

from scipy import std, meanfair = [1, 1] # chickensunfair = [0, 2] # chickensassert mean(fair) == mean(unfair)

# Use standard deviation!std(fair) # 0std(unfair) # 1

Processing: 45’Deviation Roberto Polli - roberto.polli@babel.it

DRAFTSimple processing: scipy

Check your computed values vs the σ returned by ping (didn’t you notice pingreturned it?)"""goal: remember to convert to numeric / float

goal: use scipygoal: check stdev"""

from scipy import std, mean # max,min are builtinrtt = ping_rtt()

print(max(rtt), min(rtt), mean(rtt), std(rtt))

Processing: 45’Deviation Roberto Polli - roberto.polli@babel.it

DRAFTTime Distributions: Exercise

• Parse the provided maillog in ipython using its ! magic and get an hourlyemail δ

• Expected output:time_d = { # mail delivered (removed) between

0: xxx # 00:00 - 00:591: xxx # 01:00 - 01:59..}

Processing: 45’Deviation Roberto Polli - roberto.polli@babel.it

DRAFTTime Distributions: Exercise Solution

# deliveder emails are like the following#May 14 16:00:04 rpolli postfix/qmgr[122]: 4DC3DA: removed"

ret = !grep removed maillog # get the interesting lines

ts = ret.fields(2) # find the timestamp (3rd column)

hours = [ int(ts) for x in ts ]time_d = {x: count(x) for x in set(hours)}

Processing: 45’Deviation Roberto Polli - roberto.polli@babel.it

DRAFTPlotting distributions

# To plot data..from matplotlib import pyplot as plt# and set the interactive modeplt.ion()

# Plotting an histogram...frequency, bins, _ = hist(hours)

# .. returns adistribution = dict(zip(slots,

frequency))

This server works mostly atnight...

Processing: 45’Deviation Roberto Polli - roberto.polli@babel.it

DRAFTSize Distributions: Exercise

• Create a size δ using hist(..., bins=...)• Hint: help(hist)

size_d = { # mail size between0: xxx # 0 - 10k1: xxx # 10k - 20k..}

• Homework: Use the size δ to find size mean and size sigma and comparewith σ and mean evaluated from the original data-series

Processing: 45’Deviation Roberto Polli - roberto.polli@babel.it

DRAFT♦Simulating data with σ and x ♦

Mean and a stdev are useful starting point to simulate data using the gaussiandistribution.# A mail load generator creating attachments of a given size...from random import gaussmail_size = gauss(mean, sigma_s) # a random number

# and use time_d to simulate the load during the dayfrom time import localtimehour = localtime().tm_hourmail_per_minute = time_d[hour] / 60 # minutes in hour

Processing: 45’Deviation Roberto Polli - roberto.polli@babel.it

DRAFTLinear Correlation

# Let’s plot the following datasets# taken from a 4-hour distributionmail_sent = [1, 5, 500, 250, 100, 7]kB_s = [70, 300, 29000, 12500, 450, 500]

# A scatter plot can suggest relations# between dataplt.scatter(mail_sent, kB_s)

Correlating Mail and Thruput

100 0 100 200 300 400 500 600kMail sent

5000

0

5000

10000

15000

20000

25000

30000

35000

Thru

put

kB

/s

Correlating mail and thruput

Processing: 45’Correlation Roberto Polli - roberto.polli@babel.it

DRAFTLinear CorrelationThe Pearson Coefficient ρ is a relation indicator.

0 no relation1 direct relation (both dataset increase together)

-1 inverse relation (one increase as the other decrease)

ρ(X ,Y ) =

(∑(x − x)(y − y)

)√∑

(x − x)2√∑

(y − y)2(1)

from scipy.stats.stats import pearsonrret = pearsonr(mail_sent, kB_s)print(ret)>(0.9823, 0.0004)correlation, probability = ret

Processing: 45’Correlation Roberto Polli - roberto.polli@babel.it

DRAFTYou must (scatter) plot!

ρ does not detect non-linear correlation

Processing: 45’Correlation Roberto Polli - roberto.polli@babel.it

DRAFTCombinations

# Given a table with many data seriesfrom course import tabletable = {...

’cpu_usr’: [10, 23, 55, ..],’byte_in’: [2132, 3212, 3942, ..], }

# We can combine all their names withfrom itertools import combinationslist(combinations(table,2))>[(’swap_in’, ’cpu_sys’),(’swap_in’, ’csw’), (’cpu_sys’, ’csw’)... ]

Combinating 4 suites,2 at a time.

♥♠♥♣♥♦♠♣♠♦♣♦

Processing: 45’Correlation Roberto Polli - roberto.polli@babel.it

DRAFTNetfishing correlation

We can try every combination between data series and check if there’s someρ.for k1, k2 in combinations(table, 2):

corr, probability = pearsonr(table[k1], table[k2])if corr < 0.5:

# I’m *still* not interested in data under this thresholdcontinue

print("linear correlation between {} and {} is {}".format(k1, k2, corr))

Processing: 45’Correlation Roberto Polli - roberto.polli@babel.it

DRAFTCorrelating I/O and Context SwitchNow we’ll generate some correlation plots from table data, like this one.

Processing: 45’Plotting Time Roberto Polli - roberto.polli@babel.it

DRAFTNetfishing correlation II

# create all combined plotfor k1, k2 in combinations(table, 2):

corr, probability = pearsonr(table[k1], table[k2])plt.scatter(table[k1], table[k2])

# 3 digit precision on titleplt.title("R={:0.3f}".format(corr))plt.xlabel(k1); plt.ylabel(k2)

# save and close the plotplt.savefig("{}_{}.png".format(k1, k2)); plt.close()

Processing: 45’Plotting Time Roberto Polli - roberto.polli@babel.it

DRAFTMark time with colors# Use 3 colors to mark time-slotsfrom itertools import cyclecolors = cycle(’rgb’) # Red Green Bluemy_list = range(10)

# then import a function to chunk datasetsfrom course import in_chunksin_chunks(my_list, size=4)) # returns a <generator object ...>list(_) # ... which iterates to...> [[0, 1, 2, 3], # Plotted in Red

[4, 5, 6, 7], # ..Green[8, 9]] # ..Blue

Processing: 45’Plotting Time Roberto Polli - roberto.polli@babel.it

DRAFTMark time with colors# Get combined data directly via itemsfor (k1, v1), (k2, v2) in combinations(table. items (), 2):

corr, probability = pearsonr(v1, v2)

# Two nice generatorstime_chunked = zip(in_chunks(v1, size=8*3600),

in_chunks(v2, size=8*3600))[plt.scatter(t1, t2, color= next(colors) ) # iterate colors!

for t1, t2 in time_chunked]

# save and close the plotplt.savefig("timed_{}_{}.png".format(k1, k2)); plt.close()

Processing: 45’Plotting Time Roberto Polli - roberto.polli@babel.it

DRAFTThat’s all folks!

Thank you for the attention!Roberto Polli - roberto.polli@babel.it

End Roberto Polli - roberto.polli@babel.it