Open Source Software for Data Scientists -- BigConf 2014

33
Open Source Software for Data Scientists Charlie Greenbacker, Director of Data Science 28 Mar 2014

description

As presented at BigConf on 28 March 2014 in Silver Spring, MD http://www.bigconf.io/schedule/index#charlie_greenbacker ========================= Harvard Business Review called it "the sexiest job of the 21st century." These days, data scientists are faced with an onslaught of companies pitching products that promise to solve all your problems. Is there such a thing as a "silver bullet" for data science, and is it worth the hefty price tag? This talk will briefly discuss what data science is, it will argue why open source software is usually the right choice for data scientists, and it will examine some of the leading OSS tools for data science available today. Topics will include statistical analysis, data mining, machine learning, natural language processing, and data visualization. Additional materials will be provided on the presentation's companion website: oss4ds.com

Transcript of Open Source Software for Data Scientists -- BigConf 2014

Page 1: Open Source Software for Data Scientists -- BigConf 2014

Open Source Software for Data Scientists

Charlie Greenbacker, Director of Data Science 28 Mar 2014

Page 2: Open Source Software for Data Scientists -- BigConf 2014

Altamira Technologies Corporation 2014

Agenda

■  What is a Data Scientist? ■  Why use Open Source Software? ■  Survey of Open Source Software Tools:

¤ Statistical Analysis ¤ Data Mining ¤ Machine Learning ¤ Natural Language Processing ¤ Social Network Analysis ¤ Data Visualization

Page 3: Open Source Software for Data Scientists -- BigConf 2014

Altamira Technologies Corporation 2014

About me: @greenbacker Theories: popular tripe Methods: sloppy Conclusions: highly questionable

photo: Columbia Pictures

Page 4: Open Source Software for Data Scientists -- BigConf 2014

Altamira Technologies Corporation 2014

Best reason for not finishing PhD

Page 5: Open Source Software for Data Scientists -- BigConf 2014

Altamira Technologies Corporation 2014

@ExploreAltamira

Page 6: Open Source Software for Data Scientists -- BigConf 2014

What is a Data Scientist?

Page 7: Open Source Software for Data Scientists -- BigConf 2014
Page 8: Open Source Software for Data Scientists -- BigConf 2014
Page 9: Open Source Software for Data Scientists -- BigConf 2014
Page 10: Open Source Software for Data Scientists -- BigConf 2014

credit: Drew Conway (http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram)

Page 11: Open Source Software for Data Scientists -- BigConf 2014

http://www.itproportal.com/2014/02/11/how-to-pick-a-data-scientist-the-right-way/

Paul Cooper, ITProPortal.com

“A data scientist is someone who understands the domains of programming, machine learning, data mining, statistics, and hacking”

Page 12: Open Source Software for Data Scientists -- BigConf 2014

Computer Programming

Mathematics & Analytic Methodology

Distributed Computing & Big Data

Data Science

Stat

istic

al A

naly

sis

Dat

a M

inin

g

Mac

hine

Lea

rnin

g

Nat

ural

Lan

guag

e Pr

oces

sing

Soci

al N

etw

ork

Ana

lysis

Dat

a V

isual

izat

ion

Domain Knowledge & Communication Skills

etc.

Altamira Technologies Corporation 2014

Page 13: Open Source Software for Data Scientists -- BigConf 2014

Why use Open Source Software?

Page 14: Open Source Software for Data Scientists -- BigConf 2014

photo: Karen (https://flic.kr/p/5njby2)

THERE ARE NO SILVER BULLETS."

Page 15: Open Source Software for Data Scientists -- BigConf 2014

photo: Paul Inkles (https://flic.kr/p/e2QMS5)

IF YOUR BOSS BUYS SOMETHING,"YOU DAMN WELL BETTER USE IT."

Page 16: Open Source Software for Data Scientists -- BigConf 2014

photo: Valugi (http://bit.ly/1jrvVBC)

BUDGETS DON’T SCALE."

Page 17: Open Source Software for Data Scientists -- BigConf 2014

Survey of OSS Tools

Page 18: Open Source Software for Data Scientists -- BigConf 2014

Altamira Technologies Corporation 2014

Statistical Analysis

■  Name: R ■  Creator: Gentleman, Ihaka, et al. ■  License: GPL Version 2 ■  Website: r-project.org ■  Source: cran.us.r-project.org/src/base/ ■  Features:

¤ Language & environment for statistical computing & viz ¤ Linear and nonlinear modeling, classical statistical tests,

time-series analysis, graphical techniques, and more… ¤ 5000+ packages available in CRAN repository

Page 19: Open Source Software for Data Scientists -- BigConf 2014

Altamira Technologies Corporation 2014

Data Mining

■  Name: Pandas ■  Creator: Wes McKinney, et al. ■  License: BSD 3-Clause License ■  Website: pandas.pydata.org ■  Source: github.com/pydata/pandas ■  Features:

¤ Data analysis workflow in Python ¤ DataFrame object for fast manipulation & indexing ¤ Tools for reading & writing data between formats ¤ Label-based slicing, indexing, and subsetting of data

Page 20: Open Source Software for Data Scientists -- BigConf 2014

Altamira Technologies Corporation 2014

Data Mining

■  Name: Impala ■  Creator: Cloudera ■  License: Apache License 2.0 ■  Website: impala.io ■  Source: github.com/cloudera/impala ■  Features:

¤ MPP query engine implemented on Hadoop ¤ Low latency, high concurrency SQL & BI queries ¤ Same interfaces as Apache Hive, but ~24x faster ¤ Written in C++; does not use MapReduce

Page 21: Open Source Software for Data Scientists -- BigConf 2014

Altamira Technologies Corporation 2014

Machine Learning

■  Name: Mahout ■  Creator: ASF ■  License: Apache License 2.0 ■  Website: mahout.apache.org ■  Source: svn.apache.org/viewvc/mahout ■  Features:

¤ Distributed/scalable ML library for Hadoop ¤ Classification, Clustering, Collaborative filtering ¤ Logistic regression, naïve Bayes, random forest, neural

networks, HMM, k-means, SVD, PCA, ALS, LDA, etc.

Page 22: Open Source Software for Data Scientists -- BigConf 2014

Altamira Technologies Corporation 2014

Machine Learning

■  Name: Scikit-learn ■  Creator: Cournapeau, et al. ■  License: BSD 3-Clause License ■  Website: scikit-learn.org ■  Source: github.com/scikit-learn/scikit-learn ■  Features:

¤ ML library for Python built on NumPy, SciPy, matplotlib ¤ Support for classification, clustering, dimensionality

reduction, regression, model selection, preprocessing ¤ SVM, k-NN, PCA, NNMF, crossval, feature extraction, ...

Page 23: Open Source Software for Data Scientists -- BigConf 2014

Altamira Technologies Corporation 2014

Machine Learning + NLP

■  Name: Mallet ■  Creator: UMass (McCallum, et al.) ■  License: Common Public License 1.0 ■  Website: mallet.cs.umass.edu ■  Source: hg-iesl.cs.umass.edu/hg/mallet ■  Features:

¤ Java-based “Machine Learning for Language Toolkit” ¤ Document classification, clustering, topic modeling,

information extraction & sequence tagging, etc. ¤ Efficient implementation of LDA for topic modeling

Page 24: Open Source Software for Data Scientists -- BigConf 2014

Altamira Technologies Corporation 2014

Natural Language Processing

■  Name: NLTK ■  Creator: Bird, Loper, et al. ■  License: Apache License 2.0 ■  Website: nltk.org ■  Source: github.com/nltk/nltk ■  Features:

¤ Natural Language Toolkit for Python ¤ Built-in support for dozens of corpora & trained models ¤ Libraries for classification, tokenization, stemming,

tagging, parsing, and semantic reasoning

Page 25: Open Source Software for Data Scientists -- BigConf 2014

Altamira Technologies Corporation 2014

Natural Language Processing

■  Name: Stanford CoreNLP ■  Creator: Stanford NLP Group ■  License: GPL Version 2 ■  Website: nlp.stanford.edu/software/corenlp.shtml ■  Source: github.com/stanfordnlp/CoreNLP ■  Features:

¤ Suite of high-quality, Java-based NLP tools ¤  Includes POS tagger, named entity recognizer, parser,

coreference resolution, sentiment analysis, SUTime, etc. ¤  Includes models for English, Chinese, Arabic, German

Page 26: Open Source Software for Data Scientists -- BigConf 2014

Altamira Technologies Corporation 2014

NLP + Geospatial Analysis

■  Name: CLAVIN ■  Creator: Berico Technologies ■  License: Apache License 2.0 ■  Website: clavin.io ■  Source: github.com/Berico-Technologies/CLAVIN ■  Features:

¤ Extracts location names from text, resolves to gazetteer ¤ Employs context-based geospatial entity resolution ¤ ~75% accuracy, processes 1M documents per hour ¤ Built on Hadoop, CoreNLP, OpenNLP, GeoNames.org

Page 27: Open Source Software for Data Scientists -- BigConf 2014

Altamira Technologies Corporation 2014

Social Network Analysis

■  Name: Gephi ■  Creator: UTC France ■  License: GPL Version 3 ■  Website: gephi.org ■  Source: github.com/gephi/gephi ■  Features:

¤ Network analysis and visualization package for Java ¤ Dynamic network analysis with temporal filtering ¤ Metrics include: community detection, betweenness,

closeness, clustering coefficient, PageRank, etc.

Page 28: Open Source Software for Data Scientists -- BigConf 2014

Altamira Technologies Corporation 2014

Data Visualization

■  Name: D3.js ■  Creator: Mike Bostock ■  License: BSD 3-Clause License ■  Website: d3js.org ■  Source: github.com/mbostock/d3 ■  Features:

¤ JavaScript library based on HTML, SVG, and CSS ¤ Binds data to DOM & enables transformations ¤ ~200 examples, including: force-directed graphs,

choropleths, treemaps, dendrograms, animations, etc.

Page 29: Open Source Software for Data Scientists -- BigConf 2014

Altamira Technologies Corporation 2014

Fusion, Analysis, and Visualization

■  Name: Lumify ■  Creator: Altamira ■  License: Apache License 2.0 ■  Website: lumify.io ■  Source: github.com/altamiracorp/lumify ■  Features:

¤ Built on Hadoop, Storm, Accumulo, Elasticsearch, etc. ¤  Integrates structured data, text, images, video ¤ Cell-level security & access controls ¤ Live, shared collaborative workspaces

Page 30: Open Source Software for Data Scientists -- BigConf 2014
Page 31: Open Source Software for Data Scientists -- BigConf 2014

Altamira Technologies Corporation 2014

Final Thought…

Save your $$$ for: ¨  People

¤  salaries, training, etc.

¨  Resources ¤ hardware, AWS, etc.

¨  Proprietary software ¤  if no viable OSS

alternative exists photo: Brett Weinstein (http://bit.ly/1dHXvqJ)

FINAL THOUGHT

Springer’s

Page 32: Open Source Software for Data Scientists -- BigConf 2014

open source software for data scientists

oss4ds.com

Page 33: Open Source Software for Data Scientists -- BigConf 2014

Charlie Greenbacker | @greenbacker www.oss4ds.com