Python Packages for Data Science - The Data...

W W W . T H E D A T A I N C U B A T O R . C O M

Ranked: 15 Python packages for Data Science

T H E D A T A I N C U B A T O R


Who We Are

Michael Li Co-Author & Founder of The Data Incubator

This report ranks Python packages for data science, andwe're hoping to stir the pot a bit and get our colleagues to join the discussion.

Our discoveries here aren't final, but rather serve to showcase the depth, and the breadth, of knowledge available to the data science community.

At The Data Incubator we pride ourselves on having the latest data science curriculum. Much of our course material is based on feedback from corporate and government partners about the technologies they are looking to learn.

However, we wanted to develop a more data-driven approach to what we should be teaching in our data science corporate training and our free fellowship for masters and PhDs looking to enter data science careers in industry.

This report is the second in a series analyzing data science related topics. We thought it would be useful to the data science community to rank and analyze a variety of topics related to the profession in a simple, easy to digest cheat sheet, ranking, or report. It’s our way of practicing what we teach.

Paul Paczuski Co-Author & The Data Incubator Fellow

https://www.thedataincubator.com/#


The Rankings

This project began as a ranking of the top packages for all data scientists, but we soon found that the scope was too broad. Data scientists do many different things, and you can classify almost any R package as helping a data scientist.

Python, along with R, is one of the most popular tools in a data scientist’s arsenal mostly for it’s simplicity and ease of use- most concepts can be expressed in fewer lines of code in Python, than in other languages.

Which is why we wanted to rank the most popular and useful Python packages in an effort to help those new to data science, or and provide insight into what’s driving the popularity of certain Python packages.

Below is a ranking of Python packages that are useful for Data Science, based on Github and Stack Overflow activity, as well as PyPI (The Python Package Index) downloads.

The table shows standardized scores, where a value of 1 means one standard deviation above average (average = score of 0).

For example, numpy is 2 standard deviations above averagein Stack Overflow activity, while tensorflow is close to average. See below for methods.


The Rankings

1 2 3 4 5 6 7 8 9

10 11 12 13 14 15

numpy

Tensorflow

pandas

ipython

scikit-learn

matplotlib

pattern

scrapy

scipy

plotly

nltk

theano

sympy

bokeh

networkx


The Rankings

For this ranking The Data Incubator focused on a number of criteria including an exhaust list of ML packages, and three objective metrics- total downloads, Github stars, and the number of Stack Overflow questions.

However, the scalable machine learning package tensorflow (stared at Google) trounces the other libraries in Github activity (based on both stars and forks), with the more general machine learning module scikit-learn a distant second, but fifth overall.

Both numpy and pandas (high-performance data structures and data analysis package) are only average on Github, but strong in the other two categories.

The interactive interpreter ipython is fourth overall, while the jupyter project (of the popular notebook) is 19th overall (not shown).

*All data was downloaded on January 19, 2017. CRAN download counts were from the past 365

days: January 19, 2016 to January 19, 2017.


The Insights

matplotlib most popular graphics library

As expected, matplotlib (2D plotting library) is the most popular graphics package, but the ranking also features plotly (interactive, publication-quality graphs that can be easily published online) and bokeh (an interactive visualization library that targets modern web browsers for presentation). ggpy (Python port for R's popular ggplot2 package) was 18th overall, but the data is less reliable, as noted in the next section.

Github vs. Stack Overflow activity

There appears to be an inverse correlation in Github activity compared to Stack Overflow and Downloads for the top packages. For example, there are a lot of Stack Overflow questions for numpy and pandas compared to tensorflow and scikit-learn, but the latter two have an edge on Github. Since numpy and pandas are two "utility" packages, perhaps more people are actually using them (and need help).

numpy most popular core library (beats pandas, scipy) Among the core libraries, numpy is a clear first, with pandas third, ipython fourth, and scipy (an ecosystem of open- source software for mathematics, science, and engineering) ninth.

tensorflow outperforms theano in Neural Networks

The other deep-learning package, Theano, is a big distance behind tensorflow in this ranking. The interactive and polished Tensorflow Playground could be a factor.

inverse relationships between activity and Downloads.


The Insights Limitations As with any analysis, decisions were made along the way. All source code and data is on our Github Page.

The full list of machine learning packages came from a fewsources, and a few packages were unranked, due to unavailable downloads or Github data.

These are: basemap (mapping with matplotlib), d3py (D3- like plotting), jupyter-notebook, mlpy (machine learning based on scipy and numpy), pylearn2 (machine learning, based on theano), pytables (big tables), and shogun (machine learning). They were all below average compared to the ranked packages, in all categories.

Importantly, the Anaconda distribution bundles together many of these packages, and this was not considered.

Further, naturally, some packages that have been around longer will have higher metrics, and therefore higher ranking. This is not adjusted for in any way.

older packages feature Strong metricS by default.

The data presented a few difficulties: The python port for ggplot was recently renamed to ggpy, and we used the latter for all metrics, except downloads (which used ggplot).

ipython notebook is now jupyter notebook. Stack overflow auto-corrected ipython-notebook to jupyter- notebook so we combined these results with jupyter- notebook results from the other two sources. But jupyter-notebook didn't have a downloads count, so it doesn't feature in the final ranking. Instead, we performed analyses for ipython and jupyter individually.

The Pattern package has inflated Stack Overflow (SO) question metrics since it's a common word. It has no tags data, as SO auto-corrects the query "[pattern]" to something unrelated.

SO data for plotly may be inflated- it's an R and Python package.


The methodology

A few other notes:

All source code and data is on our Github Page. Rankings come from a simple understanding of who is using which packages on popular platforms.

sqlite3 was removed from analysis, as it is a base Python module.ggplot downloads were combined with ggpy results from Github and Stack Overflow. Any unavailable Stack Overflow counts were converted to zero count.Counts were standardized to mean 0 and deviation 1, and then averaged to get Github and Stack Overflow scores, and, combined with the Downloads, the Overall score. Some manual checks were done to confirm Github repository location.

We first generated a list of data science packages and then collected metrics using these three resources:

Github data is based on both stars and forks, while StackOverflow data is based on tags and questions containing the package name.

Downloads are from PyPI, using a fork of the vanity project. Other projects to get download counts include pypi- download-stats (couldn't get it to work) and pypi- ranking.info (gives smaller numbers than vanity).

GithubUpwork Data Science Central

https://github.com/thedataincubator/data-science-blogs

https://github.com/rasbt/pattern_classification/blob/master/resources/python_data_libraries.md

https://www.upwork.com/hiring/data/15-python-libraries-data-science/


The Resources

Become a Data Scientist

Source code is available on The Data Incubator's Github: https://github.com/thedataincubator/data-science-blogs/

If you're interested in learning more, consider taking a look at the following:

Data Incubator Links

Complete ranking of Python packages: https://github.com/thedataincubator/data- science-blogs/blob/master/output/python-ranks-with- na.csv

Tensorflow Playground: http://playground.tensorflow.org/

Raw ranking data: https://github.com/thedataincubator/data- science-blogs/blob/master/output/python-data-wide.csv

Anaconda Open Data Science Platform: https://www.continuum.io/anaconda-overview

PyPI Ranking: http://pypi-ranking.info/alltime

PyPI Download Stats: https://github.com/jantman/pypi- download-stats Fork of vanity project in Github: https://github.com/pavopax/vanity

Hire a Data Scientist Train a Data Scientist

Connect with us on LinkedIn

Join the conversation on Facebook

Follow us on Twitter

Get The Data Science Fundamentals

https://www.thedataincubator.com/fellowship.html#apply

https://github.com/thedataincubator/data-science-blogs/

https://www.thedataincubator.com/training.html

https://www.thedataincubator.com/fellowship.html

https://www.thedataincubator.com/hiring.html





https://www.thedataincubator.com/hiring.html#signup

https://www.thedataincubator.com/training.html#signup

https://www.linkedin.com/school/8941222?pathWildcard=8941222

https://www.facebook.com/dataincubator/

https://twitter.com/thedatainc

https://www.thedataincubator.com/foundations.html#apply

Python Packages for Data Science - The Data...

Documents

Transcript of Python Packages for Data Science - The Data...