Pandas/Data Analysis at Baypiggies
-
Upload
andy-hayden -
Category
Data & Analytics
-
view
209 -
download
3
description
Transcript of Pandas/Data Analysis at Baypiggies
![Page 1: Pandas/Data Analysis at Baypiggies](https://reader034.fdocuments.us/reader034/viewer/2022051109/547e44615806b5c25e8b4686/html5/thumbnails/1.jpg)
Python PandasLessons Learned in Performance and
Design
![Page 2: Pandas/Data Analysis at Baypiggies](https://reader034.fdocuments.us/reader034/viewer/2022051109/547e44615806b5c25e8b4686/html5/thumbnails/2.jpg)
Who we are
Chang She - CTO/Cofounder @ DataPad, core pandas contributor, recovering financial quant. Follow me on twitter: @changhiskhan
Andy Hayden - core pandas contributor, analyst and software engineer from the UK turned Data Scientist in CA, avid data tool maker
![Page 3: Pandas/Data Analysis at Baypiggies](https://reader034.fdocuments.us/reader034/viewer/2022051109/547e44615806b5c25e8b4686/html5/thumbnails/3.jpg)
What are we talking about
- Why pandas?- What’s cool about pandas?- How do we improve and track performance- A few data structures and algorithms- Bad idioms and how to fix
![Page 4: Pandas/Data Analysis at Baypiggies](https://reader034.fdocuments.us/reader034/viewer/2022051109/547e44615806b5c25e8b4686/html5/thumbnails/4.jpg)
![Page 5: Pandas/Data Analysis at Baypiggies](https://reader034.fdocuments.us/reader034/viewer/2022051109/547e44615806b5c25e8b4686/html5/thumbnails/5.jpg)
What is it?
- Python library for analyzing real world data- Created by Wes McKinney, now led by Jeff Reback- Supported on all platforms- Supports Python 3.4 as of latest version- Big and active community
![Page 6: Pandas/Data Analysis at Baypiggies](https://reader034.fdocuments.us/reader034/viewer/2022051109/547e44615806b5c25e8b4686/html5/thumbnails/6.jpg)
Pandas Highlights- Labelled data and automatic alignment- Easy data integration- Flexible slicing and dicing of data- Analytics made to fit your brain, not vice versa (I’m looking at you SQL)
USER PRODUCTIVITY
![Page 7: Pandas/Data Analysis at Baypiggies](https://reader034.fdocuments.us/reader034/viewer/2022051109/547e44615806b5c25e8b4686/html5/thumbnails/7.jpg)
Productivity via better workflow
- Single tool to minimize cognitive dissonance
- Iterative and not linear workflow
- Performant enough for interactive work
![Page 8: Pandas/Data Analysis at Baypiggies](https://reader034.fdocuments.us/reader034/viewer/2022051109/547e44615806b5c25e8b4686/html5/thumbnails/8.jpg)
Pandas basics
(notebook)
![Page 9: Pandas/Data Analysis at Baypiggies](https://reader034.fdocuments.us/reader034/viewer/2022051109/547e44615806b5c25e8b4686/html5/thumbnails/9.jpg)
Priorities
- Build the right abstractions
- Get the API right
- Then optimize for performance
![Page 10: Pandas/Data Analysis at Baypiggies](https://reader034.fdocuments.us/reader034/viewer/2022051109/547e44615806b5c25e8b4686/html5/thumbnails/10.jpg)
Open source APIs
- Sometimes you can’t be all things to all people
- You can only add to an API, rarely change, and never get rid of APIs
- Documentation Documentation Documentation
![Page 11: Pandas/Data Analysis at Baypiggies](https://reader034.fdocuments.us/reader034/viewer/2022051109/547e44615806b5c25e8b4686/html5/thumbnails/11.jpg)
An example
- DataFrame started life as essentially a dict of Series- There was also DataMatrix- Unified under DataFrame via combining homogeneous blocks. Performant and single API
![Page 12: Pandas/Data Analysis at Baypiggies](https://reader034.fdocuments.us/reader034/viewer/2022051109/547e44615806b5c25e8b4686/html5/thumbnails/12.jpg)
Optimization
- Push slow code paths into cython or directly into C
- Try to be smart about minimizing cache misses and not creating unnecessary copies
- Careful with NAs
![Page 13: Pandas/Data Analysis at Baypiggies](https://reader034.fdocuments.us/reader034/viewer/2022051109/547e44615806b5c25e8b4686/html5/thumbnails/13.jpg)
Tracking Performance (vbench)
![Page 14: Pandas/Data Analysis at Baypiggies](https://reader034.fdocuments.us/reader034/viewer/2022051109/547e44615806b5c25e8b4686/html5/thumbnails/14.jpg)
what to track?
use vbench to track everything we care about (read: users have complained its slow ?)
unofficial vbenches repos for numpy and scikit
(look)
![Page 15: Pandas/Data Analysis at Baypiggies](https://reader034.fdocuments.us/reader034/viewer/2022051109/547e44615806b5c25e8b4686/html5/thumbnails/15.jpg)
why
Once users are using your API, they’ll notice performance changes “it feels slower”.
Then timeit and have legitimate grievance… want to automate this process (before user-upset).
![Page 16: Pandas/Data Analysis at Baypiggies](https://reader034.fdocuments.us/reader034/viewer/2022051109/547e44615806b5c25e8b4686/html5/thumbnails/16.jpg)
how
(notebook)
![Page 17: Pandas/Data Analysis at Baypiggies](https://reader034.fdocuments.us/reader034/viewer/2022051109/547e44615806b5c25e8b4686/html5/thumbnails/17.jpg)
Pandorable pandas
(notebook)
![Page 18: Pandas/Data Analysis at Baypiggies](https://reader034.fdocuments.us/reader034/viewer/2022051109/547e44615806b5c25e8b4686/html5/thumbnails/18.jpg)
The End