Introduction to Machine Learning using R - Dublin R User Group - Oct 2013
-
Upload
eoin-brazil -
Category
Technology
-
view
115 -
download
0
description
Transcript of Introduction to Machine Learning using R - Dublin R User Group - Oct 2013
![Page 1: Introduction to Machine Learning using R - Dublin R User Group - Oct 2013](https://reader033.fdocuments.us/reader033/viewer/2022051400/54c64e824a7959ad7b8b458e/html5/thumbnails/1.jpg)
Introduction with Examples
Eoin Brazil - https://github.com/braz/DublinR-ML-treesandforests
![Page 2: Introduction to Machine Learning using R - Dublin R User Group - Oct 2013](https://reader033.fdocuments.us/reader033/viewer/2022051400/54c64e824a7959ad7b8b458e/html5/thumbnails/2.jpg)
Machine Learning Techniques in RA bit of context around ML
How can you interpret their results?
A few techniques to improve prediction / reduce over-fitting
Kaggle & similar competitions - using ML for fun & profit
Nuts & Bolts - 4 data sets and 6 techniques
A brief tour of some useful data handling / formatting tools
2/50
![Page 3: Introduction to Machine Learning using R - Dublin R User Group - Oct 2013](https://reader033.fdocuments.us/reader033/viewer/2022051400/54c64e824a7959ad7b8b458e/html5/thumbnails/3.jpg)
Types of Learning
3/50
![Page 4: Introduction to Machine Learning using R - Dublin R User Group - Oct 2013](https://reader033.fdocuments.us/reader033/viewer/2022051400/54c64e824a7959ad7b8b458e/html5/thumbnails/4.jpg)
A bit of context around ML
4/50
![Page 5: Introduction to Machine Learning using R - Dublin R User Group - Oct 2013](https://reader033.fdocuments.us/reader033/viewer/2022051400/54c64e824a7959ad7b8b458e/html5/thumbnails/5.jpg)
5/50
![Page 6: Introduction to Machine Learning using R - Dublin R User Group - Oct 2013](https://reader033.fdocuments.us/reader033/viewer/2022051400/54c64e824a7959ad7b8b458e/html5/thumbnails/6.jpg)
Model Building Process
6/50
![Page 7: Introduction to Machine Learning using R - Dublin R User Group - Oct 2013](https://reader033.fdocuments.us/reader033/viewer/2022051400/54c64e824a7959ad7b8b458e/html5/thumbnails/7.jpg)
Model Selection and Model Assessment
7/50
![Page 8: Introduction to Machine Learning using R - Dublin R User Group - Oct 2013](https://reader033.fdocuments.us/reader033/viewer/2022051400/54c64e824a7959ad7b8b458e/html5/thumbnails/8.jpg)
Model Choice - Move from Adaptability toSimplicity
8/50
![Page 9: Introduction to Machine Learning using R - Dublin R User Group - Oct 2013](https://reader033.fdocuments.us/reader033/viewer/2022051400/54c64e824a7959ad7b8b458e/html5/thumbnails/9.jpg)
Interpreting A Confusion Matrix
9/50
![Page 10: Introduction to Machine Learning using R - Dublin R User Group - Oct 2013](https://reader033.fdocuments.us/reader033/viewer/2022051400/54c64e824a7959ad7b8b458e/html5/thumbnails/10.jpg)
Interpreting A Confusion Matrix Example
10/50
![Page 11: Introduction to Machine Learning using R - Dublin R User Group - Oct 2013](https://reader033.fdocuments.us/reader033/viewer/2022051400/54c64e824a7959ad7b8b458e/html5/thumbnails/11.jpg)
Confusion Matrix - Calculations
11/50
![Page 12: Introduction to Machine Learning using R - Dublin R User Group - Oct 2013](https://reader033.fdocuments.us/reader033/viewer/2022051400/54c64e824a7959ad7b8b458e/html5/thumbnails/12.jpg)
Interpreting A ROC Plot
A point in this plot is better than another if it is
to the northwest (TPR higher / FPR lower / or
both)
``Conservatives'' - on LHS and near the X-
axis - only make positive classification with
strong evidence and making few FP errors
but low TP rates
``Liberals'' - on upper RHS - make positive
classifications with weak evidence so nearly
all positives identified however high FP rates
·
·
·
12/50
![Page 13: Introduction to Machine Learning using R - Dublin R User Group - Oct 2013](https://reader033.fdocuments.us/reader033/viewer/2022051400/54c64e824a7959ad7b8b458e/html5/thumbnails/13.jpg)
ROC Dangers
13/50
![Page 14: Introduction to Machine Learning using R - Dublin R User Group - Oct 2013](https://reader033.fdocuments.us/reader033/viewer/2022051400/54c64e824a7959ad7b8b458e/html5/thumbnails/14.jpg)
Addressing Prediction ErrorK-fold Cross-Validation (e.g. 10-fold)
Bootstrapping, draw B random samples with
replacement from data set to create B
bootstrapped data sets with same size as
original. These are used as training sets with
the original used as the test set.
Other variations on above:
·
Allows for averaging the error across the
models
-
·
·
Repeated cross validation
The '.632' bootstrap
-
-
14/50
![Page 15: Introduction to Machine Learning using R - Dublin R User Group - Oct 2013](https://reader033.fdocuments.us/reader033/viewer/2022051400/54c64e824a7959ad7b8b458e/html5/thumbnails/15.jpg)
Addressing Feature Selection
15/50
![Page 16: Introduction to Machine Learning using R - Dublin R User Group - Oct 2013](https://reader033.fdocuments.us/reader033/viewer/2022051400/54c64e824a7959ad7b8b458e/html5/thumbnails/16.jpg)
Kaggle - using ML for fun & profit
16/50
![Page 17: Introduction to Machine Learning using R - Dublin R User Group - Oct 2013](https://reader033.fdocuments.us/reader033/viewer/2022051400/54c64e824a7959ad7b8b458e/html5/thumbnails/17.jpg)
Nuts & Bolts - Data sets and Techniques
17/50
![Page 18: Introduction to Machine Learning using R - Dublin R User Group - Oct 2013](https://reader033.fdocuments.us/reader033/viewer/2022051400/54c64e824a7959ad7b8b458e/html5/thumbnails/18.jpg)
18/50
![Page 19: Introduction to Machine Learning using R - Dublin R User Group - Oct 2013](https://reader033.fdocuments.us/reader033/viewer/2022051400/54c64e824a7959ad7b8b458e/html5/thumbnails/19.jpg)
Aside - How does associative analysis work ?
19/50
![Page 20: Introduction to Machine Learning using R - Dublin R User Group - Oct 2013](https://reader033.fdocuments.us/reader033/viewer/2022051400/54c64e824a7959ad7b8b458e/html5/thumbnails/20.jpg)
What are they good for ?Marketing Survey Data - Part 1
20/50
![Page 21: Introduction to Machine Learning using R - Dublin R User Group - Oct 2013](https://reader033.fdocuments.us/reader033/viewer/2022051400/54c64e824a7959ad7b8b458e/html5/thumbnails/21.jpg)
Marketing Survey Data - Part 2
21/50
![Page 22: Introduction to Machine Learning using R - Dublin R User Group - Oct 2013](https://reader033.fdocuments.us/reader033/viewer/2022051400/54c64e824a7959ad7b8b458e/html5/thumbnails/22.jpg)
Aside - How do decision trees work ?
22/50
![Page 23: Introduction to Machine Learning using R - Dublin R User Group - Oct 2013](https://reader033.fdocuments.us/reader033/viewer/2022051400/54c64e824a7959ad7b8b458e/html5/thumbnails/23.jpg)
What are they good for ?
Car Insurance Policy Exposure Management - Part 1Analysing insurance claim details of 67856policies taken out in 2004 and 2005.
The model maps each record into one of Xmutually exclusive terminal nodes or groups.
These groups are represented by theiraverage response, where the node number istreated as the data group.
The binary claim indicator uses 6 variables todetermine a probability estimate for eachterminal node determine if a insurancepolicyholder will claim on their policy.
·
·
·
·
23/50
![Page 24: Introduction to Machine Learning using R - Dublin R User Group - Oct 2013](https://reader033.fdocuments.us/reader033/viewer/2022051400/54c64e824a7959ad7b8b458e/html5/thumbnails/24.jpg)
Car Insurance Policy Exposure Management - Part 2
Root node, splits the data set on 'agecat'
Younger drivers to the left (1-8) and olderdrivers (9-11) to right
N9 splits on basis of vehicle value
N10 <= $28.9k giving 15k records and 5.4%of claims
N11 > $28.9k+ giving 1.9k records and 8.5%of claims
Left Split from Root, N2 splits on vehicle bodytype, on age (N4), then on vehicle value (N6)
The n value = num of overall population andthe y value = probability of claim from a driverin that group
·
·
·
·
·
·
·
24/50
![Page 25: Introduction to Machine Learning using R - Dublin R User Group - Oct 2013](https://reader033.fdocuments.us/reader033/viewer/2022051400/54c64e824a7959ad7b8b458e/html5/thumbnails/25.jpg)
What are they good for ?
Cancer Research Screening - Part 1Hill et al (2007), models how well cells within
an image are segmented, 61 vars with 2019
obs (Training = 1009 & Test = 1010).
·
"Impact of image segmentation on high-
content screening data quality for SK-
BR-3 cells, Andrew A Hill, Peter LaPan,
Yizheng Li and Steve Haney, BMC
Bioinformatics 2007, 8:340".
b, Well-Segmented (WS)
c, WS (e.g. complete nucleus and
cytoplasmic region)
d, Poorly-Segmented (PS)
e, PS (e.g. partial match/es)
-
-
-
-
-
25/50
![Page 26: Introduction to Machine Learning using R - Dublin R User Group - Oct 2013](https://reader033.fdocuments.us/reader033/viewer/2022051400/54c64e824a7959ad7b8b458e/html5/thumbnails/26.jpg)
Cancer Research Screening Dataset - Part 2
26/50
![Page 27: Introduction to Machine Learning using R - Dublin R User Group - Oct 2013](https://reader033.fdocuments.us/reader033/viewer/2022051400/54c64e824a7959ad7b8b458e/html5/thumbnails/27.jpg)
"prp(rpartTune$finalModel)" "fancyRpartPlot(rpartTune$finalModel)"
Cancer Research Screening - Part 3
27/50
![Page 28: Introduction to Machine Learning using R - Dublin R User Group - Oct 2013](https://reader033.fdocuments.us/reader033/viewer/2022051400/54c64e824a7959ad7b8b458e/html5/thumbnails/28.jpg)
Cancer Research Screening - Part 4
28/50
![Page 29: Introduction to Machine Learning using R - Dublin R User Group - Oct 2013](https://reader033.fdocuments.us/reader033/viewer/2022051400/54c64e824a7959ad7b8b458e/html5/thumbnails/29.jpg)
Cancer Research Screening Dataset - Part 5
29/50
![Page 30: Introduction to Machine Learning using R - Dublin R User Group - Oct 2013](https://reader033.fdocuments.us/reader033/viewer/2022051400/54c64e824a7959ad7b8b458e/html5/thumbnails/30.jpg)
What are they good for ?
Predicting the Quality of Wine - Part 1Cortez et al (2009), models the quality ofwines (Vinho Verde), 14 vars with 4898 obs(Training = 5199 & Test = 1298).
"Modeling wine preferences by data miningfrom physicochemical properties, P. Cortez,A. Cerdeira, F. Almeida, T. Matos and J.Reis, Decision Support Systems 2009,47(4):547-553".
·
·
Good (quality score is >= 6)
Bad (quality score is < 6)
-
-
## ## Bad Good ## 476 822
30/50
![Page 31: Introduction to Machine Learning using R - Dublin R User Group - Oct 2013](https://reader033.fdocuments.us/reader033/viewer/2022051400/54c64e824a7959ad7b8b458e/html5/thumbnails/31.jpg)
Predicting the Quality of Wine - Part 2
31/50
![Page 32: Introduction to Machine Learning using R - Dublin R User Group - Oct 2013](https://reader033.fdocuments.us/reader033/viewer/2022051400/54c64e824a7959ad7b8b458e/html5/thumbnails/32.jpg)
Predicting the Quality of Wine - Part 3
32/50
![Page 33: Introduction to Machine Learning using R - Dublin R User Group - Oct 2013](https://reader033.fdocuments.us/reader033/viewer/2022051400/54c64e824a7959ad7b8b458e/html5/thumbnails/33.jpg)
Predicting the Quality of Wine - Part 4 - Problems with Trees
Deal with irrelevant inputs
No data preprocessing required
Scalable computation (fast to build)
Tolerant with missing values (little loss ofaccuracy)
Only a few tunable parameters (easy to learn)
Allows for human understandable graphicrepresentation
·
·
·
·
·
·
Data fragmentation for high-dimensionalsparse data set (over-fitting)
Difficult to fit to a trend / piece-wise constantmodel
Highly influenced by changes to the data setand local optima (deep trees might bequestionable as the errors propagate down)
·
·
·
33/50
![Page 34: Introduction to Machine Learning using R - Dublin R User Group - Oct 2013](https://reader033.fdocuments.us/reader033/viewer/2022051400/54c64e824a7959ad7b8b458e/html5/thumbnails/34.jpg)
Aside - How does a random forest work ?
34/50
![Page 35: Introduction to Machine Learning using R - Dublin R User Group - Oct 2013](https://reader033.fdocuments.us/reader033/viewer/2022051400/54c64e824a7959ad7b8b458e/html5/thumbnails/35.jpg)
Predicting the Quality of Wine - Part 5 - Random Forest
35/50
![Page 36: Introduction to Machine Learning using R - Dublin R User Group - Oct 2013](https://reader033.fdocuments.us/reader033/viewer/2022051400/54c64e824a7959ad7b8b458e/html5/thumbnails/36.jpg)
Predicting the Quality of Wine - Part 6 - Other ML methods
K-nearest neighbors
Neural Nets
·
Unsupervised learning / non-targetbased learning
Distance matrix / cluster analysis usingEuclidean distances.
-
-
·
Looking at basic feed forward simple 3-layer network (input, 'processing', output)
Each node / neuron is a set of numericalparameters / weights tuned by thelearning algorithm used
-
-
Support Vector Machines·
Supervised learning
non-probabilistic binary linear classifier /nonlinear classifiers by applying thekernel trick
constructs a hyper-plane/s in a high-dimensional space
-
-
-
36/50
![Page 37: Introduction to Machine Learning using R - Dublin R User Group - Oct 2013](https://reader033.fdocuments.us/reader033/viewer/2022051400/54c64e824a7959ad7b8b458e/html5/thumbnails/37.jpg)
Aside - How does k nearest neighbors work ?
37/50
![Page 38: Introduction to Machine Learning using R - Dublin R User Group - Oct 2013](https://reader033.fdocuments.us/reader033/viewer/2022051400/54c64e824a7959ad7b8b458e/html5/thumbnails/38.jpg)
Predicting the Quality of Wine - Part 7 - kNN
38/50
![Page 39: Introduction to Machine Learning using R - Dublin R User Group - Oct 2013](https://reader033.fdocuments.us/reader033/viewer/2022051400/54c64e824a7959ad7b8b458e/html5/thumbnails/39.jpg)
Aside - How do neural networks work ?
39/50
![Page 40: Introduction to Machine Learning using R - Dublin R User Group - Oct 2013](https://reader033.fdocuments.us/reader033/viewer/2022051400/54c64e824a7959ad7b8b458e/html5/thumbnails/40.jpg)
Predicting the Quality of Wine - Part 8 - NNET
40/50
![Page 41: Introduction to Machine Learning using R - Dublin R User Group - Oct 2013](https://reader033.fdocuments.us/reader033/viewer/2022051400/54c64e824a7959ad7b8b458e/html5/thumbnails/41.jpg)
Aside - How do support vector machines work
?
41/50
![Page 42: Introduction to Machine Learning using R - Dublin R User Group - Oct 2013](https://reader033.fdocuments.us/reader033/viewer/2022051400/54c64e824a7959ad7b8b458e/html5/thumbnails/42.jpg)
Predicting the Quality of Wine - Part 9 - SVN
42/50
![Page 43: Introduction to Machine Learning using R - Dublin R User Group - Oct 2013](https://reader033.fdocuments.us/reader033/viewer/2022051400/54c64e824a7959ad7b8b458e/html5/thumbnails/43.jpg)
Predicting the Quality of Wine - Part 10 - All Results
43/50
![Page 44: Introduction to Machine Learning using R - Dublin R User Group - Oct 2013](https://reader033.fdocuments.us/reader033/viewer/2022051400/54c64e824a7959ad7b8b458e/html5/thumbnails/44.jpg)
What are they not good for ?
Predicting the Extramarital AffairsFair, R.C. et al (1978), models the possibilityof affairs, 9 vars with 601 obs (Training = 481& Test = 120).
"A Theory of Extramarital Affairs, Fair, R.C.,Journal of Political Economy 1978, 86:45-61".
·
·
Yes (affairs is >= 1 in last 6 months)
No (affairs is < 1 in last 6 months)
-
-
## ## No Yes ## 90 30
44/50
![Page 45: Introduction to Machine Learning using R - Dublin R User Group - Oct 2013](https://reader033.fdocuments.us/reader033/viewer/2022051400/54c64e824a7959ad7b8b458e/html5/thumbnails/45.jpg)
Extramarital Dataset
45/50
![Page 46: Introduction to Machine Learning using R - Dublin R User Group - Oct 2013](https://reader033.fdocuments.us/reader033/viewer/2022051400/54c64e824a7959ad7b8b458e/html5/thumbnails/46.jpg)
Random Forest Naive Bayes
Predicting the Extramarital Affairs - RF & NB
## Reference## Prediction No Yes## No 90 30## Yes 0 0
## Accuracy ## 0.75
## Reference## Prediction No Yes## No 88 29## Yes 2 1
## Accuracy ## 0.75
46/50
![Page 47: Introduction to Machine Learning using R - Dublin R User Group - Oct 2013](https://reader033.fdocuments.us/reader033/viewer/2022051400/54c64e824a7959ad7b8b458e/html5/thumbnails/47.jpg)
Other related tools: OpenRefine (formerly
Google Refine) / Rattle
47/50
![Page 48: Introduction to Machine Learning using R - Dublin R User Group - Oct 2013](https://reader033.fdocuments.us/reader033/viewer/2022051400/54c64e824a7959ad7b8b458e/html5/thumbnails/48.jpg)
Other related tools: Command Line Utilities
http://www.gregreda.com/2013/07/15/unix-commands-for-data-science/
http://blog.comsysto.com/2013/04/25/data-analysis-with-the-unix-shell/
·
sed / awk
head / tail
wc (word count)
grep
sort / uniq
-
-
-
-
-
·
join
Gnuplot
-
-
http://jeroenjanssens.com/2013/09/19/seven-command-line-tools-for-data-science.html
·
http://csvkit.readthedocs.org/en/latest/
https://github.com/jehiah/json2csv
http://stedolan.github.io/jq/
https://github.com/jeroenjanssens/data-science-toolbox/blob/master/sample
https://github.com/bitly/data_hacks
https://github.com/jeroenjanssens/data-science-toolbox/blob/master/Rio
https://github.com/parmentf/xml2json
-
-
-
-
-
-
-
48/50
![Page 49: Introduction to Machine Learning using R - Dublin R User Group - Oct 2013](https://reader033.fdocuments.us/reader033/viewer/2022051400/54c64e824a7959ad7b8b458e/html5/thumbnails/49.jpg)
A (incomplete) tour of the packages in Rcaret
party
rpart
rpart.plot
AppliedPredictiveModeling
randomForest
corrplot
arules
arulesViz
·
·
·
·
·
·
·
·
·
C50
pROC
corrplot
kernlab
rattle
RColorBrewer
corrgram
ElemStatLearn
car
·
·
·
·
·
·
·
·
·
49/50
![Page 50: Introduction to Machine Learning using R - Dublin R User Group - Oct 2013](https://reader033.fdocuments.us/reader033/viewer/2022051400/54c64e824a7959ad7b8b458e/html5/thumbnails/50.jpg)
In SummaryAn idea of some of the types of classifiers available in ML.
What a confusion matrix and ROC means for a classifier and how to
interpret them
An idea of how to test a set of techniques and parameters to help you find
the best model for your data
Slides, Data, Scripts are all on GH:https://github.com/braz/DublinR-ML-treesandforests
50/50