Open Source Software for Data Scientists -- BigConf 2014

download Open Source Software for Data Scientists -- BigConf 2014

of 33

Embed Size (px)

description

As presented at BigConf on 28 March 2014 in Silver Spring, MD http://www.bigconf.io/schedule/index#charlie_greenbacker ========================= Harvard Business Review called it "the sexiest job of the 21st century." These days, data scientists are faced with an onslaught of companies pitching products that promise to solve all your problems. Is there such a thing as a "silver bullet" for data science, and is it worth the hefty price tag? This talk will briefly discuss what data science is, it will argue why open source software is usually the right choice for data scientists, and it will examine some of the leading OSS tools for data science available today. Topics will include statistical analysis, data mining, machine learning, natural language processing, and data visualization. Additional materials will be provided on the presentation's companion website: oss4ds.com

Transcript of Open Source Software for Data Scientists -- BigConf 2014

  • Open Source Software for Data Scientists Charlie Greenbacker, Director of Data Science28 Mar 2014
  • Altamira Technologies Corporation 2014 Agenda What is a Data Scientist? Why use Open Source Software? Survey of Open Source Software Tools: Statistical Analysis Data Mining Machine Learning Natural Language Processing Social Network Analysis Data Visualization
  • Altamira Technologies Corporation 2014 About me: @greenbacker Theories: popular tripe Methods: sloppy Conclusions: highly questionable photo: Columbia Pictures
  • Altamira Technologies Corporation 2014 Best reason for not finishing PhD
  • Altamira Technologies Corporation 2014 @ExploreAltamira
  • What is a Data Scientist?
  • credit: Drew Conway (http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram)
  • http://www.itproportal.com/2014/02/11/how-to-pick-a-data-scientist-the-right-way/ Paul Cooper, ITProPortal.com A data scientist is someone who understands the domains of programming, machine learning, data mining, statistics, and hacking
  • Computer Programming Mathematics & Analytic Methodology Distributed Computing & Big Data Data Science StatisticalAnalysis DataMining MachineLearning NaturalLanguageProcessing SocialNetworkAnalysis DataVisualization Domain Knowledge & Communication Skills etc.Altamira Technologies Corporation 2014
  • Why use Open Source Software?
  • photo: Karen (https://flic.kr/p/5njby2) THERE ARE NO SILVER BULLETS."
  • photo: Paul Inkles (https://flic.kr/p/e2QMS5) IF YOUR BOSS BUYS SOMETHING," YOU DAMN WELL BETTER USE IT."
  • photo: Valugi (http://bit.ly/1jrvVBC) BUDGETS DONT SCALE."
  • Survey of OSS Tools
  • Altamira Technologies Corporation 2014 Statistical Analysis Name: R Creator: Gentleman, Ihaka, et al. License: GPL Version 2 Website: r-project.org Source: cran.us.r-project.org/src/base/ Features: Language & environment for statistical computing & viz Linear and nonlinear modeling, classical statistical tests, time-series analysis, graphical techniques, and more 5000+ packages available in CRAN repository
  • Altamira Technologies Corporation 2014 Data Mining Name: Pandas Creator: Wes McKinney, et al. License: BSD 3-Clause License Website: pandas.pydata.org Source: github.com/pydata/pandas Features: Data analysis workflow in Python DataFrame object for fast manipulation & indexing Tools for reading & writing data between formats Label-based slicing, indexing, and subsetting of data
  • Altamira Technologies Corporation 2014 Data Mining Name: Impala Creator: Cloudera License: Apache License 2.0 Website: impala.io Source: github.com/cloudera/impala Features: MPP query engine implemented on Hadoop Low latency, high concurrency SQL & BI queries Same interfaces as Apache Hive, but ~24x faster Written in C++; does not use MapReduce
  • Altamira Technologies Corporation 2014 Machine Learning Name: Mahout Creator: ASF License: Apache License 2.0 Website: mahout.apache.org Source: svn.apache.org/viewvc/mahout Features: Distributed/scalable ML library for Hadoop Classification, Clustering, Collaborative filtering Logistic regression, nave Bayes, random forest, neural networks, HMM, k-means, SVD, PCA, ALS, LDA, etc.
  • Altamira Technologies Corporation 2014 Machine Learning Name: Scikit-learn Creator: Cournapeau, et al. License: BSD 3-Clause License Website: scikit-learn.org Source: github.com/scikit-learn/scikit-learn Features: ML library for Python built on NumPy, SciPy, matplotlib Support for classification, clustering, dimensionality reduction, regression, model selection, preprocessing SVM, k-NN, PCA, NNMF, crossval, feature extraction, ...
  • Altamira Technologies Corporation 2014 Machine Learning + NLP Name: Mallet Creator: UMass (McCallum, et al.) License: Common Public License 1.0 Website: mallet.cs.umass.edu Source: hg-iesl.cs.umass.edu/hg/mallet Features: Java-based Machine Learning for Language Toolkit Document classification, clustering, topic modeling, information extraction & sequence tagging, etc. Efficient implementation of LDA for topic modeling
  • Altamira Technologies Corporation 2014 Natural Language Processing Name: NLTK Creator: Bird, Loper, et al. License: Apache License 2.0 Website: nltk.org Source: github.com/nltk/nltk Features: Natural Language Toolkit for Python Built-in support for dozens of corpora & trained models Libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning
  • Altamira Technologies Corporation 2014 Natural Language Processing Name: Stanford CoreNLP Creator: Stanford NLP Group License: GPL Version 2 Website: nlp.stanford.edu/software/corenlp.shtml Source: github.com/stanfordnlp/CoreNLP Features: Suite of high-quality, Java-based NLP tools Includes POS tagger, named entity recognizer, parser, coreference resolution, sentiment analysis, SUTime, etc. Includes models for English, Chinese, Arabic, German
  • Altamira Technologies Corporation 2014 NLP + Geospatial Analysis Name: CLAVIN Creator: Berico Technologies License: Apache License 2.0 Website: clavin.io Source: github.com/Berico-Technologies/CLAVIN Features: Extracts location names from text, resolves to gazetteer Employs context-based geospatial entity resolution ~75% accuracy, processes 1M documents per hour Built on Hadoop, CoreNLP, OpenNLP, GeoNames.org
  • Altamira Technologies Corporation 2014 Social Network Analysis Name: Gephi Creator: UTC France License: GPL Version 3 Website: gephi.org Source: github.com/gephi/gephi Features: Network analysis and visualization package for Java Dynamic network analysis with temporal filtering Metrics include: community detection, betweenness, closeness, clustering coefficient, PageRank, etc.
  • Altamira Technologies Corporation 2014 Data Visualization Name: D3.js Creator: Mike Bostock License: BSD 3-Clause License Website: d3js.org Source: github.com/mbostock/d3 Features: JavaScript library based on HTML, SVG, and CSS Binds data to DOM & enables transformations ~200 examples, including: force-directed graphs, choropleths, treemaps, dendrograms, animations, etc.
  • Altamira Technologies Corporation 2014 Fusion, Analysis, and Visualization Name: Lumify Creator: Altamira License: Apache License 2.0 Website: lumify.io Source: github.com/altamiracorp/lumify Features: Built on Hadoop, Storm, Accumulo, Elasticsearch, etc. Integrates structured data, text, images, video Cell-level security & access controls Live, shared collaborative workspaces
  • Altamira Technologies Corporation 2014 Final Thought Save your $$$ for: People salaries, training, etc. Resources hardware, AWS, etc. Proprietary software if no viable OSS alternative exists photo: Brett Weinstein (http://bit.ly/1dHXvqJ) FINAL THOUGHT Springers
  • open source software for data scientists oss4ds.com
  • Charlie Greenbacker | @greenbacker www.oss4ds.com