Post on 15-Jul-2015
High Performance Python
Marc Garcia
February 19, 2015
Barcelona Python Meetup
1 / 31High Performance Python - Marc Garcia - Barcelona Python Meetup
N
Overview
1 Warm up example
2 Some theory
3 Profiling
4 Speeders
5 Summary
2 / 31High Performance Python - Marc Garcia - Barcelona Python Meetup
N
Warm up example
Can we optimize this?
def list_numbers(until):’’’Returns a string representing the sequence of numbers from 1 to ‘until’
>>> list_numbers(10)’1, 2, 3, 4, 5, 6, 7, 8, 9, 10’’’’num_list = []for i in range(until):
num_list.append(str(i+1))return ’, ’.join(num_list)
%timeit _ = list_numbers(int(1e6))1 loops, best of 3: 461 ms per loop
Without using a list comprehension first
4 / 31High Performance Python - Marc Garcia - Barcelona Python Meetup
N
Warm up example
Some tricks...
def list_numbers_opt(until):’’’Returns a string representing the sequence ofnumber from 1 to ‘until’
>>> list_numbers(10)’1, 2, 3, 4, 5, 6, 7, 8, 9, 10’’’’num_list = []local_str = str # <− first variable lookup is local (avoiding fallback)num_list__append = num_list.append # <− avoiding attribute lookupfor i in range(1, until+1):
num_list__append(local_str(i)) # <− avoiding sum in the loopreturn ’, ’.join(num_list)
%timeit _ = list_numbers_opt(int(1e6))1 loops, best of 3: 323 ms per loop
5 / 31High Performance Python - Marc Garcia - Barcelona Python Meetup
N
Warm up example
With a list comprehension
def list_numbers_comprehension(until):’’’Returns a string representing the sequence ofnumber from 1 to ‘until’
>>> list_numbers(10)’1, 2, 3, 4, 5, 6, 7, 8, 9, 10’’’’local_str = strreturn ’, ’.join([local_str(num) for num in range(1, until+1)])
%timeit _ = list_numbers_comprehension(int(1e6))1 loops, best of 3: 311 ms per loop
6 / 31High Performance Python - Marc Garcia - Barcelona Python Meetup
N
Warm up example
With map function
def list_numbers_map(until):’’’Returns a string representing the sequence ofnumber from 1 to ‘until’
>>> list_numbers(10)’1, 2, 3, 4, 5, 6, 7, 8, 9, 10’’’’return ’,’.join(map(str, range(1, until+1)))
%timeit _ = list_numbers_map(int(1e6))1 loops, best of 3: 274 ms per loop
7 / 31High Performance Python - Marc Garcia - Barcelona Python Meetup
N
Warm up example
Comparison
Approach Time (absolute) Time (relative)Not optimized 461 ms 1.68Optimized 323 ms 1.18List comprehension 311 ms 1.14Map function 274 ms 1.00
8 / 31High Performance Python - Marc Garcia - Barcelona Python Meetup
N
Some theory
Types of optimizations
CPU boundBetter algorithmsMinimization of in-loop tasks†
Better compilation / low-level optimizations†
I/O boundI/O (disk, network, etc.) access optimizationCompressionMultithreading†
Memory boundMemory access optimization / Use of caches†
Compression
Programmer boundCode readability, styles, etc.Use of libraries
10 / 31High Performance Python - Marc Garcia - Barcelona Python Meetup
N
Some theory
Low-level optimizations
In Python, we do not want to implement low-level optimizations ourselves.But we can profit of the ones existing in libraries.
Write your program so it can be optimized
Vectorization (avoid loops)
map instead of list comprehensions
Objects or dicts?
Generators instead of lists
11 / 31High Performance Python - Marc Garcia - Barcelona Python Meetup
N
Some theory
Multithreading
GIL (Global Interpreter Lock)1: No multicore
Is released only for:
I/O operations
numpy operations2
So, it’s only possible to parallelize these operations
1https://wiki.python.org/moin/GlobalInterpreterLock2http://wiki.scipy.org/ParallelProgramming
12 / 31High Performance Python - Marc Garcia - Barcelona Python Meetup
N
Some theory
Memory performance (I)
1
1Source: http://www.edn.com/Home/PrintView?contentItemId=4397051
13 / 31High Performance Python - Marc Garcia - Barcelona Python Meetup
N
Some theory
Memory performance (II)
1
1Source: https://dl.dropboxusercontent.com/u/3967849/sfmu/pub/index.html
14 / 31High Performance Python - Marc Garcia - Barcelona Python Meetup
N
Some theory
Memory performance (III)
Optimal use of CPU cache when possible (Numexpr1)
Reusing dataSequential data
Preallocate (Zero Buffer2)
while True:data = os.read(fd, 1024) # os.read allocates memoryprint data.lstrip()
1https://github.com/pydata/numexpr2http://zero-buffer.readthedocs.org/en/latest/
15 / 31High Performance Python - Marc Garcia - Barcelona Python Meetup
N
Profiling
Profiling basics
Use %timeit to compare different implementations, and %lprun to test themost expensive part of your program.
Line # Hits Time Per Hit % Time Line Contents==============================================================
1 def foo(n):2 1 3 3.0 0.0 phrase = ’repeat me’3 1 185 185.0 0.1 pmul = phrase * n4 100001 97590 1.0 32.4 pjoi = ’’.join([phrase for x in xrange(n)])5 1 4 4.0 0.0 pinc = ’’6 100001 90133 0.9 29.9 for x in xrange(n):7 100000 112935 1.1 37.5 pinc += phrase8 1 182 182.0 0.1 del pmul, pjoi, pinc
Performance may change when CPU or RAM are at higher use.
17 / 31High Performance Python - Marc Garcia - Barcelona Python Meetup
N
Speeders
What are speeders?
Just-in-time (JIT) compilers and others.
Dynamic languages are slower by design, making them somehow staticimproves performance.
a+ b (1)
1 Get a and from memory
2 Get the types of a and b from memory
3 Lookup of add method
4 Allocate memory for the result
5 Store the result in memory
Consider a and b are integers and inside a loop executed million times.There is a huge cost that can be avoided.
19 / 31High Performance Python - Marc Garcia - Barcelona Python Meetup
N
Speeders
PyPy
No modification in the code should be required, only changing the interpreter
C extensions need to be recompiled (in some cases modified)
Minor compatibility issues (e.g. __del__ method can’t be added to classafter it has already been created)
python myscript.py
pypy myscript.py
20 / 31High Performance Python - Marc Garcia - Barcelona Python Meetup
N
Speeders
Numba
Uses a decorator
Build using LLVM Compilation framework
Caches compiled code, second executions are faster
Some limitations:
Generators not supportedNested functions not supportedDefault arguments not supported"is not" operator not supported
21 / 31High Performance Python - Marc Garcia - Barcelona Python Meetup
N
Speeders
C extensions
Implementation is in C, programming time is much higher
Compilation of the extension is required
Overhead due to moving data from Python to C and C to Python
Minimize this by making as few calls with as much data as possible
22 / 31High Performance Python - Marc Garcia - Barcelona Python Meetup
N
Speeders
Cython (I)
Cython != CPython
Optimising static compiler: Allows us to write C extensions, but in Python
Main difference with Python code is that types are declared
Cython files need to be compile:
from distutils.core import setupfrom Cython.Build import cythonize
setup(name = ’Hello world app’,ext_modules = cythonize("hello.pyx"),
)
23 / 31High Performance Python - Marc Garcia - Barcelona Python Meetup
N
Speeders
Cython (II)
# Pure Pythondef f(x):
return x∗∗2−x
def integrate_f(a, b, N):s = 0dx = (b−a)/Nfor i in range(N):
s += f(a+i∗dx)return s ∗ dx
#Cythondef f(double x):
return x∗∗2−x
def integrate_f(double a, double b, int N):cdef int icdef double s, dxs = 0dx = (b−a)/Nfor i in range(N):
s += f(a+i∗dx)return s ∗ dx
24 / 31High Performance Python - Marc Garcia - Barcelona Python Meetup
N
Speeders
Numexpr
Executes code optimizing memory and cache usage
Works well with numpy
Numexpr gets the code as a string
Performance may improve one order of magnitude
import numpy as npimport numexpr as ne
a = np.arange(1e6) # Choose large arrays for better speedups
ne.evaluate("a + 1") # a simple expressionarray([ 1.00000000e+00, 2.00000000e+00, 3.00000000e+00, ...,
9.99998000e+05, 9.99999000e+05, 1.00000000e+06])
25 / 31High Performance Python - Marc Garcia - Barcelona Python Meetup
N
Speeders
GPU
Performance can increase two orders of magnitude when using GPU
Extra hardware is required
Implementation require use of parallel programming techniques
Libraries can be used: PyCUDA, NumbaPro
26 / 31High Performance Python - Marc Garcia - Barcelona Python Meetup
N
Speeders
Performance comparison
Do not try and find the winner. That’s impossible.Instead... only try to realize the truth.
There is no winnerLook for the solution that works better with your problem.
27 / 31High Performance Python - Marc Garcia - Barcelona Python Meetup
N
Summary
Takeaways
"Premature optimization is the root of all evil". Donald Knuth
Optimize only when necessary, and only the bottlenecks
Vectorize and avoid loops (and focus on them when they are required
Write your programs so they can be optimized
Mostly using libraries
Bottleneck will be memory or I/O will in many cases
Static typing improves performance (numpy, Cython, etc)
29 / 31High Performance Python - Marc Garcia - Barcelona Python Meetup
N
Summary
Useful links (I)
TalksRaymond Hettinger talk
https://vimeo.com/114368783 (starting at 21:30)http://bit.ly/python-sfmu
It’s the memory stupid - Francesc Altedhttp://www.slideshare.net/BigDataSpain/francesc-alted-how-i-learned-to-stop-worrying-about-cpu-speed
Fast Python, Slow Python - Alex Gaynorhttps://www.youtube.com/watch?v=7eeEf_rAJds
https://twitter.com/raymondhhttps://twitter.com/ContinuumIOhttps://twitter.com/FrancescAlted
30 / 31High Performance Python - Marc Garcia - Barcelona Python Meetup
N
Summary
Useful links (II)
Speeders
http://pypy.org/http://numba.pydata.org/http://cython.org/https://github.com/pydata/numexprhttp://docs.continuum.io/numbapro/
Memory performance
http://queue.acm.org/detail.cfm?id=2513149
Profiling
http://www.huyng.com/posts/python-performance-analysis/
31 / 31High Performance Python - Marc Garcia - Barcelona Python Meetup
N