Python Packages for Data Science - The Data...
Transcript of Python Packages for Data Science - The Data...
W W W . T H E D A T A I N C U B A T O R . C O M
Ranked: 15 Python packages for Data Science
T H E D A T A I N C U B A T O R
T H E D A T A I N C U B A T O R
Who We Are
Michael Li Co-Author & Founder of The Data Incubator
This report ranks Python packages for data science, andwe're hoping to stir the pot a bit and get our colleagues to join the discussion.
Our discoveries here aren't final, but rather serve to showcase the depth, and the breadth, of knowledge available to the data science community.
At The Data Incubator we pride ourselves on having the latest data science curriculum. Much of our course material is based on feedback from corporate and government partners about the technologies they are looking to learn.
However, we wanted to develop a more data-driven approach to what we should be teaching in our data science corporate training and our free fellowship for masters and PhDs looking to enter data science careers in industry.
This report is the second in a series analyzing data science related topics. We thought it would be useful to the data science community to rank and analyze a variety of topics related to the profession in a simple, easy to digest cheat sheet, ranking, or report. It’s our way of practicing what we teach.
Paul Paczuski Co-Author & The Data Incubator Fellow
T H E D A T A I N C U B A T O R
The Rankings
This project began as a ranking of the top packages for all data scientists, but we soon found that the scope was too broad. Data scientists do many different things, and you can classify almost any R package as helping a data scientist.
Python, along with R, is one of the most popular tools in a data scientist’s arsenal mostly for it’s simplicity and ease of use- most concepts can be expressed in fewer lines of code in Python, than in other languages.
Which is why we wanted to rank the most popular and useful Python packages in an effort to help those new to data science, or and provide insight into what’s driving the popularity of certain Python packages.
Below is a ranking of Python packages that are useful for Data Science, based on Github and Stack Overflow activity, as well as PyPI (The Python Package Index) downloads.
The table shows standardized scores, where a value of 1 means one standard deviation above average (average = score of 0).
For example, numpy is 2 standard deviations above averagein Stack Overflow activity, while tensorflow is close to average. See below for methods.
T H E D A T A I N C U B A T O R
The Rankings
1 2 3 4 5 6 7 8 9
10 11 12 13 14 15
numpy
Tensorflow
pandas
ipython
scikit-learn
matplotlib
pattern
scrapy
scipy
plotly
nltk
theano
sympy
bokeh
networkx
T H E D A T A I N C U B A T O R
The Rankings
For this ranking The Data Incubator focused on a number of criteria including an exhaust list of ML packages, and three objective metrics- total downloads, Github stars, and the number of Stack Overflow questions.
However, the scalable machine learning package tensorflow (stared at Google) trounces the other libraries in Github activity (based on both stars and forks), with the more general machine learning module scikit-learn a distant second, but fifth overall.
Both numpy and pandas (high-performance data structures and data analysis package) are only average on Github, but strong in the other two categories.
The interactive interpreter ipython is fourth overall, while the jupyter project (of the popular notebook) is 19th overall (not shown).
*All data was downloaded on January 19, 2017. CRAN download counts were from the past 365
days: January 19, 2016 to January 19, 2017.
T H E D A T A I N C U B A T O R
The Insights
matplotlib most popular graphics library
As expected, matplotlib (2D plotting library) is the most popular graphics package, but the ranking also features plotly (interactive, publication-quality graphs that can be easily published online) and bokeh (an interactive visualization library that targets modern web browsers for presentation). ggpy (Python port for R's popular ggplot2 package) was 18th overall, but the data is less reliable, as noted in the next section.
Github vs. Stack Overflow activity
There appears to be an inverse correlation in Github activity compared to Stack Overflow and Downloads for the top packages. For example, there are a lot of Stack Overflow questions for numpy and pandas compared to tensorflow and scikit-learn, but the latter two have an edge on Github. Since numpy and pandas are two "utility" packages, perhaps more people are actually using them (and need help).
numpy most popular core library (beats pandas, scipy) Among the core libraries, numpy is a clear first, with pandas third, ipython fourth, and scipy (an ecosystem of open- source software for mathematics, science, and engineering) ninth.
tensorflow outperforms theano in Neural Networks
The other deep-learning package, Theano, is a big distance behind tensorflow in this ranking. The interactive and polished Tensorflow Playground could be a factor.
inverse relationships between activity and Downloads.
T H E D A T A I N C U B A T O R
The Insights Limitations As with any analysis, decisions were made along the way. All source code and data is on our Github Page.
The full list of machine learning packages came from a fewsources, and a few packages were unranked, due to unavailable downloads or Github data.
These are: basemap (mapping with matplotlib), d3py (D3- like plotting), jupyter-notebook, mlpy (machine learning based on scipy and numpy), pylearn2 (machine learning, based on theano), pytables (big tables), and shogun (machine learning). They were all below average compared to the ranked packages, in all categories.
Importantly, the Anaconda distribution bundles together many of these packages, and this was not considered.
Further, naturally, some packages that have been around longer will have higher metrics, and therefore higher ranking. This is not adjusted for in any way.
older packages feature Strong metricS by default.
The data presented a few difficulties: The python port for ggplot was recently renamed to ggpy, and we used the latter for all metrics, except downloads (which used ggplot).
ipython notebook is now jupyter notebook. Stack overflow auto-corrected ipython-notebook to jupyter- notebook so we combined these results with jupyter- notebook results from the other two sources. But jupyter-notebook didn't have a downloads count, so it doesn't feature in the final ranking. Instead, we performed analyses for ipython and jupyter individually.
The Pattern package has inflated Stack Overflow (SO) question metrics since it's a common word. It has no tags data, as SO auto-corrects the query "[pattern]" to something unrelated.
SO data for plotly may be inflated- it's an R and Python package.
T H E D A T A I N C U B A T O R
The methodology
A few other notes:
All source code and data is on our Github Page. Rankings come from a simple understanding of who is using which packages on popular platforms.
sqlite3 was removed from analysis, as it is a base Python module.ggplot downloads were combined with ggpy results from Github and Stack Overflow. Any unavailable Stack Overflow counts were converted to zero count.Counts were standardized to mean 0 and deviation 1, and then averaged to get Github and Stack Overflow scores, and, combined with the Downloads, the Overall score. Some manual checks were done to confirm Github repository location.
We first generated a list of data science packages and then collected metrics using these three resources:
Github data is based on both stars and forks, while StackOverflow data is based on tags and questions containing the package name.
Downloads are from PyPI, using a fork of the vanity project. Other projects to get download counts include pypi- download-stats (couldn't get it to work) and pypi- ranking.info (gives smaller numbers than vanity).
GithubUpwork Data Science Central
T H E D A T A I N C U B A T O R
The Resources
Become a Data Scientist
Source code is available on The Data Incubator's Github: https://github.com/thedataincubator/data-science-blogs/
If you're interested in learning more, consider taking a look at the following:
Data Incubator Links
Complete ranking of Python packages: https://github.com/thedataincubator/data- science-blogs/blob/master/output/python-ranks-with- na.csv
Tensorflow Playground: http://playground.tensorflow.org/
Raw ranking data: https://github.com/thedataincubator/data- science-blogs/blob/master/output/python-data-wide.csv
Anaconda Open Data Science Platform: https://www.continuum.io/anaconda-overview
PyPI Ranking: http://pypi-ranking.info/alltime
PyPI Download Stats: https://github.com/jantman/pypi- download-stats Fork of vanity project in Github: https://github.com/pavopax/vanity
Hire a Data Scientist Train a Data Scientist
Connect with us on LinkedIn
Join the conversation on Facebook
Follow us on Twitter
Get The Data Science Fundamentals