Using R for Predictive Analytics / Data Mining / Data Science - Oct 2009

9
Using R for Predictive Analytics Szil´ ard Pafka Predictive Analytics World / DC useR Group October 20, 2009

description

This talk argues already in 2009 that R is a viable tool for the entire lifecycle of data science / data mining / predictive analytics. A few years later R has indeed become the #1 tool used by data scientists [http://www.rexeranalytics.com/].

Transcript of Using R for Predictive Analytics / Data Mining / Data Science - Oct 2009

Page 1: Using R for Predictive Analytics / Data Mining / Data Science - Oct 2009

Using R for Predictive Analytics

Szilard Pafka

Predictive Analytics World / DC useR Group

October 20, 2009

Page 2: Using R for Predictive Analytics / Data Mining / Data Science - Oct 2009

Introduction and agenda

Introduction:

• physics, finance, statistical modeling (PhD)

• data mining at a credit card processor (Chief Scientist)

• co-organizing the Los Angeles area R users group

Agenda:

• using R for predictive analytics

• benefits from R users groups

Page 3: Using R for Predictive Analytics / Data Mining / Data Science - Oct 2009

R and predictive analytics

R:

• data manipulation

• statistical analysis, exploratory data analysis

• computations, simulations

• modeling

• visualization

Predictive analytics (data mining for prediction problems):

• understanding the domain problem and related questions

• data exploration and cleaning

• building predictive models (supervised learning)

• model diagnostics, selection and evaluation

• deployment

Page 4: Using R for Predictive Analytics / Data Mining / Data Science - Oct 2009

R vs alternatives

R:

• open source• packages• latest cutting-edge methods

• command line interaction (vs GUI)• flexible• results easier to reproduce

• learning curve• books• R-help

• cost, multiple platforms

alternatives:

• spreadsheets

• programminglanguages

• data miningsoftware

• similar softwareenvironments

Page 5: Using R for Predictive Analytics / Data Mining / Data Science - Oct 2009

Data exploration

• data frames

• import from various sources (csv files, database connection)

• flexible data manipulation functionalities• subscripting• numerous transformations• text manipulation (e.g. regexp)• data aggregation• specialized packages (reshape, plyr etc.)

• graphics

• sample R code

Page 6: Using R for Predictive Analytics / Data Mining / Data Science - Oct 2009

Visualization

• powerful tool for data exploration, diagnostics etc.

• R graphics: base, lattice, ggplot2

• visual perception, principles of visual design

• ggplot2• based on the grammar of graphics• nice defaults• elements: mappings, geoms, stats, scales, facets• very expressive (sample code)

Page 7: Using R for Predictive Analytics / Data Mining / Data Science - Oct 2009

Building predictive models

• supervised learning: y = f (x1, x2, . . . , xp)

• regression (numeric y), classification (categorical y)

• spam detection example: classification (2 classes, Y/N)

• numerous methods: logistic regression, LDA, nearest neighbors,

decision trees, bagging, boosting, random forests, neural networks,

support vector machines etc.

• example: neural networks with nnet (sample code)• weight decay as regularization• helps convergence and against overfitting

Page 8: Using R for Predictive Analytics / Data Mining / Data Science - Oct 2009

Model evaluation

99% accuracy?

• package caret: unified interface/wrapper for• training models, tuning (complexity parameters)• diagnosing, selecting, evaluating models

• model deployment

• Conclusion: R is a viable tool for predictive analytics

Page 9: Using R for Predictive Analytics / Data Mining / Data Science - Oct 2009

R users groups

• R users groups around the world, in the US: SF, NY, LA, DC...

• Los Angeles: 150+ members, 4 meetings so far

• talks, Q&A, discussions, exchange of knowledge• pointing people where to look for (packages, docs etc.)• also more generally for statistical methodology issues

• networking opportunities