NO ONE KNOWS WHAT IT’S LIKE TO BE THE BAD MAN:
THE DEVELOPMENT PROCESS FOR THE CARET PACKAGE
Max Kuhn Director of Statistics at Pfizer R&D
Zachary Deane–Mayer Cognius
O P E N D A T A S C I E N C E C O N F E R E N C E_ BOSTON 2015
@opendatasci
www.opendatascience.com
Nobody Knows What It’s Like To Be the Bad ManThe Development Process for the caret Package
Max KuhnPfizer Global R&D
Zachary Deane–MayerCognius
Model Function Consistency
Since there are many modeling packages in R written by differentpeople, there are inconsistencies in how models are specified andpredictions are created.
For example, many models have only one method of specifying themodel (e.g. formula method only)
> ## only one way here:
> rpart(y ~ ., data = dat)
>
> ## and both ways here:
> lda(y ~ ., data = dat)
>
> lda(x = predictors, y = outcome)
Kuhn & Deane–Mayer (Pfizer / Cognius) caret 2 / 26
Generating Class Probabilities Using Different
Packages
Function predict Function SyntaxMASS::lda predict(obj) (no options needed)
stats:::glm predict(obj, type = "response")
gbm::gbm predict(obj, type = "response", n.trees)
mda::mda predict(obj, type = "posterior")
rpart::rpart predict(obj, type = "prob")
RWeka::Weka predict(obj, type = "probability")
caTools::LogitBoost predict(obj, type = "raw", nIter)
Kuhn & Deane–Mayer (Pfizer / Cognius) caret 3 / 26
The caret Package
The caret package was developed to:
create a unified interface for modeling and prediction (interfacesto 183 models)
streamline model tuning using resampling
provide a variety of “helper” functions and classes forday–to–day model building tasks
increase computational efficiency using parallel processing
First commits within Pfizer: 6/2005, First version on CRAN: 10/2007
Website: http://topepo.github.io/caret/
JSS Paper: http://www.jstatsoft.org/v28/i05/paper
Model List: http://topepo.github.io/caret/bytag.html
Many computing sections in Applied Predictive Modeling
Kuhn & Deane–Mayer (Pfizer / Cognius) caret 4 / 26
Easily Switching Between Models> library(doMC)
> registerDoMC(cores=10)
>
> ctlr <- trainControl(classProbs = TRUE, method = "repeatedcv")
> gbm_mod <- train(Class ~ ., data = training,
+ method = "gbm",
+ trControl = ctlr,
+ ## gbm argument:
+ verbose = FALSE)
>
> pls_mod <- train(Class ~ ., data = training,
+ method = "pls",
+ tuneLength = 10,
+ preProc = c("center", "scale", "spatailSign"),
+ trControl = ctlr)
>
> pls_search <- gafs(x = training[, -1], y = training$Class,
+ gafsControl = gafsControl(method = "cv", functions = rfGA),
+ ## train options:
+ method = "pls",
+ tuneLength = 10,
+ preProc = c("center", "scale", "spatailSign"),
+ trControl = ctlr)
Kuhn & Deane–Mayer (Pfizer / Cognius) caret 5 / 26
Package Dependencies
One thing that makes caret different from most other packages isthat it uses code from an abnormally large number (> 80) of otherpackages.
A refresher:
Depends: required for package to function; loaded when caret
is loaded
Imports: required for package to function; not loaded
Suggests: the package uses it sometimes; not loaded
Kuhn & Deane–Mayer (Pfizer / Cognius) caret 6 / 26
“Simple’”” Example (38 Nodes)
caret
car
reshape2
foreach
plyr
nlme
BradleyTerry2
MASS
mgcv
nnet
pbkrtest
quantreg
stringr
Rcpp
codetools
iterators
lattice
brglmgtools
lme4
digest
gtable
scales
proto
profileModel
minqa
nloptr
Matrix
RcppEigen
SparseM
RColorBrewer
dichromat
munsell
labeling
stringi
magrittr
colorspace
ggplot2
...
...
...
...
...
ImportsDependsSuggestsEnhancesLinkingTo
Kuhn & Deane–Mayer (Pfizer / Cognius) caret 7 / 26
Package Dependencies
Originally, these were in the Depends field of the DESCRIPTION filewhich caused all of them to be loaded with caret.
For many years, they were moved to Suggests, which solved thatissue.
However, their formal dependency in the DESCRIPTION file requiredCRAN to install hundreds of other packages to check caret. Themaintainers were not pleased.
Kuhn & Deane–Mayer (Pfizer / Cognius) caret 8 / 26
Package Dependencies in caret Version 5.17-07
ada
arm
Boruta
bst
C50
car
caTools
class
Cubiste1071
earth
elasticnet
ellipse
evtreeextraTrees
fastICA
foba
gam
gbm
glmnet
hda
HDclassif
HiDimDA
Hmisc
ipred
kernlabkknn
klaR
kohonen
KRLS
lars
leaps
LogicForest
LogicReg
MASS
mboost
mda
mgcv
mlbench
neuralnet
nnet
nodeHarvest
obliqueRF
pamr
partDSA
party
penalized
penalizedLDA
pls
pROC
protoclass
proxy
qrnn
quantregForest
randomForest
RANN
relaxo
rFerns
rocc
rpart
rrcov
RRF
rrlda
RSNNS
RWeka
sda
sparseLDA
spls
stepPlr
superpc
cluster
foreach
lattice
plyrreshape2
Matrix
lme4
abind
coda
nlme
Rcpp
minqa
nloptr
RcppEigen
survival
partykit
pbkrtest
quantreg
SparseM
bitops
stringr
stringimagrittr
plotmo
plotrix
TeachingDemos
rJava
codetools
iterators
Formula
ggplot2
proto
scales
latticeExtra
acepack
foreign
gtable
gridExtra
digest
RColorBrewer
dichromat
munsell
labeling
colorspace
prodlim KernSmoothlava
numDeriv
igraph
combinat
CircStats
gtools
boot
stabs
nnls
quadprog
ROCR
gplots
gdata
mvtnorm
modeltools
strucchange
coin
zoo
sandwich
flsa
robustbasepcaPP
DEoptimR
mvoutlier
glasso
matrixcalc
sgeostat
robCompositions
GGally
sROC
reshape
RWekajars
entropy
corpcor
fdrtool
linprog
scalreg
lpSolve
effects
BradleyTerry2 brglm
profileModel
TH.data
evaluate
formatR
highr markdown
yaml
mime
polspline
multcomp
DBI
spam
maps
shapefiles
sp
maptools
ks
misc3d
rgl
multicool
brew
hdi
alr4
lmtestMatrixModels
survey
caret
xtable
testthat
akima
RUnit
knitr
chron
rms
mice
tables
scatterplot3d
som
biglm
fields
BayesX
VGAM
vcd
globaltest
Rmpi
microbenchmark
logcondens
doParallel
roxygen2
cba
mclust
crossval
itertools
optextras
qvcalc
relimp
xts
intervals
jsonlite
RCurl
R6
XML
gamlss.data
gamlss.dist
ucminf
BB
RcgminRvmmin
setRNG
dfoptim
svUnit
psychotools
timereg
RcppArmadillo
BH
timeDate
fit.models
network
gnm
assertthat
lazyeval
httpuv
htmltools
bdsmatrix
mratios
spacetime
FNN
deldir
tensor
polyclip
goftest
httr
memoise
whisker
rstudioapi
rversions
git2r
gamlss
LearnBayes
expm
PKPDmodels
MEMSS
mlmRev
optimx gamm4
inline
rbenchmark
highlight
pkgKitten
cmprsk
pmml
AER
psychotreetripack
logspline
nor1mix
dynlm
rpart.plot
tkrplot
tcltk2
R2wd
EBImage
png
mapproj
hexbin
graph
Rgraphviz
RGraphics
pixmap
metsgeepack
gof
ascii
igraphdata
ape
tseries
DAAG
fts
its
mondate
timeSeries
tis
robust
MPV
sfsmisc
catdata
intergraph
scagnostics
sna
tnet
poLCA
heplots
prefmod
dplyr
shiny
testit
coxme
SimComp
ISwR
RSQLite
rgdal
rgeos
gstat
spatstat
PBSmapping
rmarkdown
mitools
RODBCCompQuadForm
subselect
devtools
AGD
pan
Zelig
spdep
gpclib
VGAMdata
HSAUR
gclus
mix
care
binda
Kuhn & Deane–Mayer (Pfizer / Cognius) caret 9 / 26
Package Dependencies
This problem was somewhat alleviated at the end of 2013 whencustom methods were expanded.
Although this functionality had already existed in the package forsome time, it was refactored to be more user friendly.
In the process, much of the modeling code was moved out of caret’sR files and into R objects, eliminating the formal dependencies.
Right now, the total number of dependencies is much smaller (2Depends, 7 Imports, and 25 Suggests).
This still affects testing though (described later). Also:
1 package is needed for this model and is not installed. (gbm).
Would you like to try to install it now?
Kuhn & Deane–Mayer (Pfizer / Cognius) caret 10 / 26
38 Dependencies for caret Version 6.0-47
●
●●
● ●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
caret
lattice
ggplot2
car reshape2
foreach
plyrnlme
BradleyTerry2
lme4
brglm
gtools
MASS
mgcv
nnet
pbkrtest
quantreg
codetools
iterators
digest
gtablescales
proto
Rcpp
stringr
profileModel
Matrix
minqa
nloptr
RcppEigen
SparseM
RColorBrewer
dichromat
munsell
labeling
stringi
magrittr
colorspace
plotrix
TeachingDemos
survival
KernSmooth
lava
numDeriv
modeltools
mvtnorm
zoo
sandwich
mime
qvcalc
relimp
quadprog
optextras
bitops
class
plotmo
rpart
prodlim
combinat
coin
strucchange pls
cluster
latticeExtra
acepack
foreign
gridExtra
Formula
maps
sp
TH.data
evaluateformatR
highr
markdown
yaml
effects
gnm
ucminf
BB
RcgminRvmmin
setRNG
dfoptim
svUnit
gdata
caTools
lmtest
e1071
earthfastICA
gam
ipred
kernlab
klaR
ellipse
mda
mlbench
party
pROC
proxy
randomForest
RANN
spls
subselect
pamrsuperpc
Cubist
testthat
Hmisc
mapproj
hexbin
maptools
multcomp
knitr
alr4
bootleaps
MatrixModelsrgl
survey
abind
doParallel
itertools
prefmod
PKPDmodels
MEMSSmlmRev
optimx
gamm4
gplots
tripack
akima
logspline
nor1mix
dynlm
RUnit
graph
Rgraphviz
inline
rbenchmark
highlightpkgKittenexpm
vcd
...
...
...
...
...
ImportsDependsSuggestsEnhancesLinkingTo
Kuhn & Deane–Mayer (Pfizer / Cognius) caret 11 / 26
The Basic Release Process
1 create a few dynamic man pages
2 use R CMD check --as-cran to ensure passing CRAN testsand unit tests
3 update all packages (and R)
4 run regression tests and evaluate results
5 send to CRAN
6 repeat
7 repeat
8 install passed caret version
9 generate HTML documentation and sync github io branch
10 profit!
Kuhn & Deane–Mayer (Pfizer / Cognius) caret 12 / 26
Toolbox
RStudio projects: a clean, self contained package developmentenvironment
I No more setwd(’/path/to/some/folder’) in scriptsI Keep track of project-wide standards, e.g. code formattingI An RStudio project was the first thing we added after moving
the caret repository to github
devtools: automate boring tasks
testthat: automated unit testing
roxygen2: combine source code with documentation
github: source control
Kuhn & Deane–Mayer (Pfizer / Cognius) caret 13 / 26
devtools
devtools::install builds package and installs it locally
devtools::check:
1 Builds documentation
2 Runs unit tests
3 Builds tarball
4 Runs R CMD CHECK
devtools::release builds package and submits it to CRAN
devtools::install github enables non-CRAN code distributionor distribution of private packages
devtools::use travis enables automated unit testing throughtravis-CI and test coverage reports through coveralls
Kuhn & Deane–Mayer (Pfizer / Cognius) caret 14 / 26
Testing
Testing for the package occurs in a few different ways:
units tests via testthat and travis-CI
regression tests for consistency
Automated unit testing via testthat
devtools::use testhat
Unit tests prevent new features from breaking old code
All functions should have associated tests
Run during R CMD check --as-cran
Can specify that certain tests be skipped on CRAN
caret is slowly adding more testthat tests
Kuhn & Deane–Mayer (Pfizer / Cognius) caret 15 / 26
github + travis + coveralls
Travis and coveralls are tools for automated unit testing
Travis reports test failures and CRAN errors/warnings
Coveralls reports % of code covered by unit tests
Both automatically comment on the PR itself
Contributor submits code via a pull request
Travis notifies them of test failures
Coveralls notifies them to write tests for new functions
Automated feedback on code quality allows rapid iteration
Code review once unit tests and R CMD CHECK pass
github supports line-by-line comments
Usually several more iterations here
Kuhn & Deane–Mayer (Pfizer / Cognius) caret 16 / 26
Writing new unit tests
Coveralls identifies un-tested code:
Developer chooses a file with low or no coverage
Developer writes unit tests
context('Test Data Splitting Functions')test_that('createTimeSlices', {
y <- 1:10
s1 <- createTimeSlices(y, initialWindow=5, horizon=1)
expect_equal(length(s1$train), 5)
expect_equal(s1$train$Training1, 1:5)
expect_equal(s1$test$Testing1, 6)
})
while(coverage<100%){repeat}
Kuhn & Deane–Mayer (Pfizer / Cognius) caret 17 / 26
Regression Testing
Prior to CRAN release (or whenever required), a comprehensive set ofregression tests can be conducted.
All modeling packages are updated to their current CRAN versions.
For each model accessed by train, rfe, and/or sbf, a set of testcases are computed with the production version of caret and thedevel version.
First, test cases are evaluated to make sure that nothing has beenbroken by updated versions of the consistuent packages.
Diffs of the model results are computed to assess any differences incaret versions.
This process takes approximately 3hrs to complete using make -j 12
on a Mac Pro.
Kuhn & Deane–Mayer (Pfizer / Cognius) caret 18 / 26
Regression Testing
Typical tests include:
tests for different resampling methods (LOOCV!)
formula vs. non-formula interface
predictions
variable importance
ancillary classes/functions (e.g. predictors)
In some cases, we need to correlate results between versions due torandom numbers without seed control.
I send a lot of emails/pull requests to package maintainers (e.g. classprobabilities don’t sum to 1, predictions fail with n = 1, etc. )
Kuhn & Deane–Mayer (Pfizer / Cognius) caret 19 / 26
Regression Testing
$ R CMD BATCH move_files.R
$ cd ~/tmp/2015_04_19_09__6.0-41/
$ make -j 12 -i
2015-04-19 09:13:44: Starting ada
2015-04-19 09:13:44: Starting AdaBag
2015-04-19 09:13:44: Starting AdaBoost.M1
2015-04-19 09:13:44: Starting ANFIS
:
make: [FH.GBML.RData] Error 1 (ignored)
:
2015-04-19 12:03:52: Finished WM
2015-04-19 12:04:48: Finished xyf
Kuhn & Deane–Mayer (Pfizer / Cognius) caret 20 / 26
Documentation
caret originally contained four package vignettes with in–depthdescriptions of functionality with examples.
However, this added time to R CMD check and was a general pain forCRAN.
Efforts to make the vignettes more computationally efficient (e.g.reducing the number of examples, resamples, etc.) diminished theeffectiveness of the documentation.
Kuhn & Deane–Mayer (Pfizer / Cognius) caret 21 / 26
Documentation
The documentation was moved out of the package and to the githubIO page.
These pages are built using knitr whenever a new version is sent toCRAN. Some advantages are:
longer and more relevant examples are available
update schedule is under my control
dynamic documentation (e.g. D3 network graphs, JS tables)
better formatting
It currently takes about 4hr to create these (using parallel processingwhen possible).
Kuhn & Deane–Mayer (Pfizer / Cognius) caret 22 / 26
Required “Optimizations” for CRAN
For example, there is one check that produces a large number of falsepositive warnings. For example:
> bwplot.diff.resamples <- function (x, data, metric = x$metric, ...) {+ ## some code
+ plotData <- subset(plotData, Metric %in% metric)
+ ## more code
+ }
will trigger a warning that “bwplot.diff.resamples: no visible
binding for global variable ’Metric’”.
The “solution” is to have a file that is sourced first in the package(e.g. aaa.R) with the line
> Metric <- NULL
Kuhn & Deane–Mayer (Pfizer / Cognius) caret 23 / 26
Judging the Severity of Problems
It’s hard to tell which warnings should be ignored and which shouldnot. There is also the issue of inconsistencies related to who is “onduty” when you submit your package.
It is hard to believe that someone took the time for this:
Description Field: "My awesome R package"
R CMD check: "Malformed Description field: should contain one
or more complete sentences."
Description Field: "This package is awesome"
R CMD check: "The Description field should not start with the
package name, ’This package’ or similar."
Description Field: "Blah blah blah."
R CMD check: PASS
Kuhn & Deane–Mayer (Pfizer / Cognius) caret 24 / 26
Backup Slides
roxygen2
Simplified package documentation
Automates many parts of the documentation process
Special comment block above each function
Name, description, arguments, etc.
Code and documentation are in the same source file
A must have for new packages but hard to convert existing packages
caret has 92 .Rd files
I’m not in a hurry to re-write them all in roxygen2 format
Kuhn & Deane–Mayer (Pfizer / Cognius) caret 26 / 26
Top Related