Pandas/Data Analysis at Baypiggies
-
Upload
andy-hayden -
Category
Data & Analytics
-
view
209 -
download
3
description
Transcript of Pandas/Data Analysis at Baypiggies
Python PandasLessons Learned in Performance and
Design
Who we are
Chang She - CTO/Cofounder @ DataPad, core pandas contributor, recovering financial quant. Follow me on twitter: @changhiskhan
Andy Hayden - core pandas contributor, analyst and software engineer from the UK turned Data Scientist in CA, avid data tool maker
What are we talking about
- Why pandas?- What’s cool about pandas?- How do we improve and track performance- A few data structures and algorithms- Bad idioms and how to fix
What is it?
- Python library for analyzing real world data- Created by Wes McKinney, now led by Jeff Reback- Supported on all platforms- Supports Python 3.4 as of latest version- Big and active community
Pandas Highlights- Labelled data and automatic alignment- Easy data integration- Flexible slicing and dicing of data- Analytics made to fit your brain, not vice versa (I’m looking at you SQL)
USER PRODUCTIVITY
Productivity via better workflow
- Single tool to minimize cognitive dissonance
- Iterative and not linear workflow
- Performant enough for interactive work
Pandas basics
(notebook)
Priorities
- Build the right abstractions
- Get the API right
- Then optimize for performance
Open source APIs
- Sometimes you can’t be all things to all people
- You can only add to an API, rarely change, and never get rid of APIs
- Documentation Documentation Documentation
An example
- DataFrame started life as essentially a dict of Series- There was also DataMatrix- Unified under DataFrame via combining homogeneous blocks. Performant and single API
Optimization
- Push slow code paths into cython or directly into C
- Try to be smart about minimizing cache misses and not creating unnecessary copies
- Careful with NAs
Tracking Performance (vbench)
what to track?
use vbench to track everything we care about (read: users have complained its slow ?)
unofficial vbenches repos for numpy and scikit
(look)
why
Once users are using your API, they’ll notice performance changes “it feels slower”.
Then timeit and have legitimate grievance… want to automate this process (before user-upset).
how
(notebook)
Pandorable pandas
(notebook)
The End