An Analytics Toolkit Tour
-
Upload
rory-winston -
Category
Technology
-
view
115 -
download
0
description
Transcript of An Analytics Toolkit Tour
A Programming Language/Toolkit Tour
Rory Winston
Monday, 27 February 2012
Agenda
• A quick overview and tour of:
• R
• Python
• Java/C++
• For data analysis/analytics applications
• Comparison
Monday, 27 February 2012
Purpose
• To give a feeling for the relative advantages and disadvantages of each approach
• Understand the tradeoffs involved
• See some demos
Monday, 27 February 2012
R• R is a domain-specific-language (DSL) for statistics
and data analysis
• Functional-based language
• Based on an earlier language called S
• Core engine written in C
• Open-source
• Popularity has exploded in the last few years
• Some commercial support
Monday, 27 February 2012
Pros• R is the de facto standard in statistical analysis tooling
• Incredible range of functionality via contributed libraries
• Powerful interactive analysis environment and visualization tools
• Large number of built-in datasets
• Cross-platform
• Broad user community
• Wide range of resources (books, tutorials, papers) available
Monday, 27 February 2012
Cons
• Performance limitations
• Single-threaded interpreter
• Language limitations and quirks
• Initial learning curve may be steep
• R gives you a lot of power, but assumes you know how to use it!
Monday, 27 February 2012
Language Features• R is vectorized:
• Loops are not required for many operations (and are actually discouraged)
• R is functional:
• Functions can be passed around like other variables
• R integrates with a BLAS:
• high-performance numerical operations
Monday, 27 February 2012
Demo
• Console R
• R GUI
• RStudio
Monday, 27 February 2012
Tips
• Learn how to use ggplot2 (http://had.co.nz/ggplot2/)
• Consider using RStudio (http://www.rstudio.org)
Monday, 27 February 2012
Python
• Initially developed in the late 1980s
• Object-oriented / functional support
• Open-source
• Initially popular in web applications, now popular across a number of domains
Monday, 27 February 2012
Pros
• Very readable, simple and clear syntax
• Well-supported (many libraries and extensions)
• Easy to integrate with other languages (e.g. C)
• Very efficient environment to develop in
Monday, 27 February 2012
Cons
• Language syntax is not universally popular
• In terms of analytics, many libraries are still slightly immature
• Performance can be lacking (although there are many options to tune it)
• Interpreter is effectively single-threaded
Monday, 27 February 2012
Python + Analytics
• There are a number of excellent libraries available for analytics applications:
• NumPy + SciPy
• matplotlib
• pandas
• scikits
• Some packages (e.g. pandas) are designed to replicate the ‘feel’ and functionality of analysis operations in R
Monday, 27 February 2012
NumPy + SciPy
• Using NumPy + SciPy + matplotlib provides an experience similar to using an interactive R/Matlab environment
• Supports vectorization and BLAS integration
• Add ipython for more goodness
Monday, 27 February 2012
Tips
• Use ipython!
• Check out:
• http://pandas.pydata.org/
• http://statsmodels.sourceforge.net/
• http://scikit-learn.org
Monday, 27 February 2012
Comparisons
x <- 1:10
x <- seq(1, 2, .2)
x <- seq(1,2, length.out=15)
M <- matrix(1:100, 10, 10)
x[ x < 1.5 ]
X <- cbind(a,b)
x = arange(1,11)
x = arange(1,2,.2)
x = linspace(1,2,15)
M <- arange(1,101).reshape(10,10)
x[x < 1.5]
X = colstack((a,b))
Monday, 27 February 2012
Java/C++
• The ultimate in power/flexibility
• Also the ultimate in development time and effort
• Lets just look at C++ briefly
Monday, 27 February 2012
C++• Old but still very popular
• Just had a revamp (C++11, was C++0x)
• Mostly competes with Java on the server side
• Everything else (JVM, R, Python) is written in C/C++
• Both R and Python provide easy ways to interface with C/C++ code
• This is used a lot
Monday, 27 February 2012
Pros
• Flexibility
• Lots of libraries available
• Control of resources for performance-critical apps (e.g. memory)
• C++11 adds a lot of nice stuff (finally)
Monday, 27 February 2012
Cons
• Lots of effort
• Lots of hidden traps for the unwary
• Initial experience may be a large productivity hit
• Effort in porting between systems
• There is “modern” C++ (which is actually pretty nice) and everything else (which isn’t so nice)
Monday, 27 February 2012
Examples
• Lets look at a sample library
• This one is called Armadillo (http://arma.sourceforge.net/)
• Developed in Australia (NICTA / Univ. Queensland)
• Contains functions for numerical applications and some statistical functions
• Modern, efficient use of C++
Monday, 27 February 2012
Armadillo
• Armadillo supports vectorized operations
• Also integrates with a BLAS
• Example (see console)
Monday, 27 February 2012
Simple Example
• Using the Box-Jenkins airline passenger data
• Classic dataset
• 12 years of monthly airline passenger observations (144 in all)
Monday, 27 February 2012
Passenger Dataset
Monday, 27 February 2012
Linear Model
• We will use a simple linear model (explains 85% of the variance of this data)
Ax = b
A =
1 t11 t21 t3... ...
Monday, 27 February 2012
Conclusion
• Use the toolkit that’s most appropriate for you
• Common approches are to use e.g. R for prototyping and model selection and (if required) switch to a higher-performance implementation for production
• If you have time, learn all of them!
Monday, 27 February 2012
Language Map
ROctave
PythonRuby
JavaC/C++
Performance, complexity
InteractivityDynamic Typing Static Typing
Monday, 27 February 2012
Resources
Monday, 27 February 2012