Pandas/Data Analysis at Baypiggies

Post on 02-Dec-2014

209 views 3 download

description

Presented at BayPiggies by Chang She and Andy Hayden. pandas is used by many people to make their lives easier when analyzing data. This talk is centered around how the overarching goal of user productivity has driven the balance of API development and performance optimization. We will cover some pandas basics. We'll talk about pandas performance. And we'll discuss data structures and algorithms. Along the way, we'll cover best practices and tools useful for developing open source projects. Chang She is the CTO/co-founder of DataPad. A pythonista and recovering financial quant, Chang was a core contributor to pandas prior to co-founding DataPad. Chang is passionate about creating better data tools to make knowledge workers more productive. Andy is a core contributor to pandas and holds the dubious accolade of having answered the most pandas-related questions on Stack Overflow. Andy is an analyst and software engineer from the UK, turned Data Scientist in CA, and is enthusiastic about making data tools easy. ipython notebooks available here: https://www.wakari.io/sharing/bundle/hayd/baypiggies https://www.wakari.io/sharing/bundle/hayd/vbench https://www.wakari.io/sharing/bundle/hayd/pandorable

Transcript of Pandas/Data Analysis at Baypiggies

Python PandasLessons Learned in Performance and

Design

Who we are

Chang She - CTO/Cofounder @ DataPad, core pandas contributor, recovering financial quant. Follow me on twitter: @changhiskhan

Andy Hayden - core pandas contributor, analyst and software engineer from the UK turned Data Scientist in CA, avid data tool maker

What are we talking about

- Why pandas?- What’s cool about pandas?- How do we improve and track performance- A few data structures and algorithms- Bad idioms and how to fix

What is it?

- Python library for analyzing real world data- Created by Wes McKinney, now led by Jeff Reback- Supported on all platforms- Supports Python 3.4 as of latest version- Big and active community

Pandas Highlights- Labelled data and automatic alignment- Easy data integration- Flexible slicing and dicing of data- Analytics made to fit your brain, not vice versa (I’m looking at you SQL)

USER PRODUCTIVITY

Productivity via better workflow

- Single tool to minimize cognitive dissonance

- Iterative and not linear workflow

- Performant enough for interactive work

Pandas basics

(notebook)

Priorities

- Build the right abstractions

- Get the API right

- Then optimize for performance

Open source APIs

- Sometimes you can’t be all things to all people

- You can only add to an API, rarely change, and never get rid of APIs

- Documentation Documentation Documentation

An example

- DataFrame started life as essentially a dict of Series- There was also DataMatrix- Unified under DataFrame via combining homogeneous blocks. Performant and single API

Optimization

- Push slow code paths into cython or directly into C

- Try to be smart about minimizing cache misses and not creating unnecessary copies

- Careful with NAs

Tracking Performance (vbench)

what to track?

use vbench to track everything we care about (read: users have complained its slow ?)

unofficial vbenches repos for numpy and scikit

(look)

why

Once users are using your API, they’ll notice performance changes “it feels slower”.

Then timeit and have legitimate grievance… want to automate this process (before user-upset).

how

(notebook)

Pandorable pandas

(notebook)

The End