Pandas/Data Analysis at Baypiggies

Python PandasLessons Learned in Performance and

Design

Who we are

Chang She - CTO/Cofounder @ DataPad, core pandas contributor, recovering financial quant. Follow me on twitter: @changhiskhan

Andy Hayden - core pandas contributor, analyst and software engineer from the UK turned Data Scientist in CA, avid data tool maker

What are we talking about

- Why pandas?- What’s cool about pandas?- How do we improve and track performance- A few data structures and algorithms- Bad idioms and how to fix

What is it?

- Python library for analyzing real world data- Created by Wes McKinney, now led by Jeff Reback- Supported on all platforms- Supports Python 3.4 as of latest version- Big and active community

Pandas Highlights- Labelled data and automatic alignment- Easy data integration- Flexible slicing and dicing of data- Analytics made to fit your brain, not vice versa (I’m looking at you SQL)

USER PRODUCTIVITY

Productivity via better workflow

- Single tool to minimize cognitive dissonance

- Iterative and not linear workflow

- Performant enough for interactive work

Pandas basics

(notebook)

Priorities

- Build the right abstractions

- Get the API right

- Then optimize for performance

Open source APIs

- Sometimes you can’t be all things to all people

- You can only add to an API, rarely change, and never get rid of APIs

- Documentation Documentation Documentation

An example

- DataFrame started life as essentially a dict of Series- There was also DataMatrix- Unified under DataFrame via combining homogeneous blocks. Performant and single API

Optimization

- Push slow code paths into cython or directly into C

- Try to be smart about minimizing cache misses and not creating unnecessary copies

- Careful with NAs

Tracking Performance (vbench)

what to track?

use vbench to track everything we care about (read: users have complained its slow ?)

unofficial vbenches repos for numpy and scikit

(look)

Once users are using your API, they’ll notice performance changes “it feels slower”.

Then timeit and have legitimate grievance… want to automate this process (before user-upset).

(notebook)

Pandorable pandas

(notebook)

The End

Pandas/Data Analysis at Baypiggies

Data & Analytics

Transcript of Pandas/Data Analysis at Baypiggies

pandas: Powerful data analysis tools for Python

Data Analysis / Data Science on Hadoop · Overview of pandas • pandas is an open-source library with easy-to-use data structures and functions that simplifies data analysis and

Pandas · 2020-05-11 · In [1]: !pip install --upgrade pandas-datareader 2 Overview Pandas is a package of fast, efficient data analysis tools for Python. Its popularity has surged

Pandas and Friends - GitHub Pagesdesertpy.github.io/presentations/pandas-and-friends-godber/pandas-and... · What does it do? Pandas is a Python data analysis tool built on top of

Data Handling using Pandas -1

SAS® and Python: The Perfect Partners in Crime€¦ · Python programming language to perform data analysis and manipulations. To gain access to pandas, the pandas module needs to

Pandas - seas.upenn.educis192/jorge/slides/07_pandas.pdf•Pandas is a powerful tool for data analysis •Supports 1-D and 2-D data with Series and DataFrames •Little support for

Python programming | · PDF filePython programming | Pandas Finn Arup Nielsen DTU Compute Technical University of Denmark October 5, 2013. Pandas Overview Pandas? Reading data Summary

INTRODUCTION TO DATA SCIENCE - GitHub Pages · • numba–Python compiler that support JIT compilation. • ALGLIB –numerical analysis library. • pandas –high-performance data

Introducing Python Pandas - WordPress.com › 2018 › 11 › ... · • Pandas or Python Pandas is a library of Python which is used for data analysis. • The term Pandas is derived

Real World Data Analysis PANDAS · 5.03.2019 · Real World Data Analysis PANDAS PYTHON PACKAGE FOR FAST, FLEXIBLE, AND EXPRESSIVE DATA STRUCTURES DR. SYED IMTIYAZ HASSAN ASSISTANT

pandas-validation Documentation - Read the Docs · pandas-validation Documentation, Release 0.5.0 pandas-validation is a small Python package for casting and validating data handled

Data Mining with Python (Working draft)€¦ · metrics, Statistics and Data Analysis covers both Python basics and Python-based data analysis with Numpy, SciPy, Matplotlib and Pandas,

3 Pandas 1: Introduction - BYU ACMEacme.byu.edu/wp-content/uploads/2020/09/Pandas12020.pdfPython's pandas library, built on NumPy, is designed spci ceally for data management and analysis.

Introduction to Python Pandas for Data Analyticsusers.encs.concordia.ca/~gregb/home/PDF/pandas-vtach-2016.pdf · to Python Pandas for Data Analytics Srijith Rajamohan Introduction

SCHOOL OF DATA SCIENCE AND FORECASTING. _DS_ 2019-21.pdf · UNIT IV: Pandas: Manipulating data from CSV, Excel, HDF5, and SQL databases, Data analysis and modelling with Pandas, Time-series

Python Data Analysis Library V E R S I O N S - HPC-Forge · What is pandas? Pandas : Pan el da ta system Python data analysis library, built on top of numpy Open Sourced by AQR Capital

pandas: Rich Data Analysis Tools for Quant Finance · Apr 24, 2012 pandas vs. R • More time series features, higher performance than zoo, xts, fts, its, etc. • DataFrame merge

pandas: a Foundational Python Library for Data Analysis and Statistics

ArcGIS API for Python: for Analysts and Data Scientists · 2018-11-07 · Data Science with ArcGIS - Analysis • Analysis with Python libraries-Data wrangling-Pandas, numpy, scipy-Machine