Transcript of Using R to win Kaggle Data Mining Competitions Chris Raimondi November 1, 2012.
Slide 1
Using R to win Kaggle Data Mining Competitions Chris Raimondi
November 1, 2012
Slide 2
Overview of talk What I hope you get out of this talk Life
before R Simple model example R programming language
Background/Stats/Info How to get started Kaggle
Slide 3
Overview of talk Individual Kaggle competitions HIV Progression
Chess Mapping Dark Matter Dunnhumbys Shoppers Challenge Online
Product Sales
Slide 4
What I want you to leave with Belief that you dont need to be a
statistician to use R - NOR do you need to fully understand Machine
Learning in order to use it Motivation to use Kaggle competitions
to learn R Knowledge on how to start
Slide 5
My life before R Lots of Excel Had tried programming in the
past got frustrated Read NY Times article in January 2009 about R
& Google Installed R, but gave up after a couple minutes Months
later
Slide 6
My life before R Using Excel to run PageRank calculations that
took hours and was very messy Was experimenting with Pajek a
windows based Network/Link analysis program Was looking for a
similar program that did PageRank calculations Revisited R as a
possibility
Slide 7
My life before R Came across R Graph Gallery Saw this
graph
Slide 8
Slide 9
Addicted to R in one line of code pairs(iris[1:4], main="Edgar
Anderson's Iris Data", pch=21, bg=c("red", "green3",
"blue")[unclass(iris$Species)]) pairs = function iris =
dataframe
Slide 10
What do we want to do with R? Machine learning a.k.a. or more
specifically Making models We want to TRAIN a set of data with
KNOWN answers/outcomes In order to PREDICT the answer/outcome to
similar data where the answer is not known
Slide 11
Slide 12
How to train a model R allows for the training of models using
probably over 100 different machine learning methods To train a
model you need to provide 1.Name of the function which machine
learning method 2.Name of Dataset 3.What is your response variable
and what features are you going to use
Slide 13
Example machine learning methods available in R BaggingPartial
Least Squares Boosted TreesPrincipal Component Regression Elastic
NetProjection Pursuit Regression Gaussian ProcessesQuadratic
Discriminant Analysis Generalized additive modelRandom Forests
Generalized linear modelRecursive Partitioning K Nearest
NeighborRule-Based Models Linear RegressionSelf-Organizing Maps
Nearest Shrunken CentroidsSparse Linear Discriminant Analysis
Neural NetworksSupport Vector Machines
Slide 14
Code used to train decision tree library(party) irisct