Bootstrap and Cross-Validation for Evaluating Modelling Strategies _ R-bloggers

6
07/06/2016 Bootstrap and crossvalidation for evaluating modelling strategies | Rbloggers http://www.rbloggers.com/bootstrapandcrossvalidationforevaluatingmodellingstrategies/ 3/11 Tweet Task at Hand and Analyzing Jointly the Resulting Rows and Columns Other sites SAS blogs Statistics of Israel Jobs for Rusers Bootstrap and crossvalidation for evaluating modelling strategies June 4, 2016 By Peter's stats stuff R (This article was first published on Peter's stats stuff R , and kindly contributed to R bloggers) Modelling strategies I’ve been rereading Frank Harrell’s Regression Modelling Strategies , a must read for anyone who ever fits a regression model, although be prepared – depending on your background, you might get 30 pages in and suddenly become convinced you’ve been doing nearly everything wrong before, which can be disturbing. I wanted to evaluate three simple modelling strategies in dealing with data with many variables. Using data with 54 variables on 1,785 area units from New Zealand’s 2013 census, I’m looking to predict median income on the basis of the other 53 variables. The features are all continuous and are variables like “mean number of bedrooms”, “proportion of individuals with no religion” and “proportion of individuals who are smokers”. Restricting myself to traditional linear regression with a normally distributed response, my three alternative strategies were: use all 53 variables; eliminate the variables that can be predicted easily from the other variables (defined by having a variance inflation factor greater than ten), one by one until the main collinearity problems are gone; or eliminate variables one at a time from the full model on the basis of comparing Akaike’s Information Criterion of models with and without each variable. None of these is exactly what I would use for real, but they serve the purpose of setting up a competition of strategies that I can test with a variety of model validation techniques. Validating models The main purpose of the exercise was actually to ensure I had my head around different ways of estimating the validity of a model, loosely definable as how well it would perform at predicting new data. As there is no possibility of new areas in New Zealand from 2013 that need to have their income predicted, the “prediction” is a thought exercise which we need to find a plausible way of simulating. Confidence in hypothetical predictions gives us confidence in the insights the model gives into relationships between variables. There are many methods of validating models, although I think kfold crossvalidation has market dominance (not with Harrell though, who prefers varieties of the bootstrap). The three validation methods I’ve used for this post are: 1. ‘simple’ bootstrap. This involves creating resamples with replacement from the original data, of the same size; applying the modelling strategy to the resample; using the model to predict the values of the full set of original data and calculating a goodness of fit statistic (eg either Rsquared or root mean squared error) comparing the predicted value to the actual value. Note – Following Efron, Harrell calls this the “simple bootstrap”, but other authors and the useful caret package use 258 Like Share Share 12

description

Comparison of the 2 methods for modelling.

Transcript of Bootstrap and Cross-Validation for Evaluating Modelling Strategies _ R-bloggers

07/06/2016 Bootstrap and crossvalidation for evaluating modelling strategies | Rbloggers

http://www.rbloggers.com/bootstrapandcrossvalidationforevaluatingmodellingstrategies/ 3/11

Tweet

Task at Hand andAnalyzing Jointlythe ResultingRows andColumns

Other sites

SAS blogsStatistics of IsraelJobs for Rusers

Bootstrap and crossvalidationfor evaluating modellingstrategiesJune 4, 2016By Peter's stats stuff R

(This article was first published on Peter's stats stuff R, and kindly contributed to Rbloggers)

Modelling strategies

I’ve been rereading Frank Harrell’s Regression Modelling Strategies,a must read for anyone who ever fits a regression model, although beprepared – depending on your background, you might get 30 pages inand suddenly become convinced you’ve been doing nearly everythingwrong before, which can be disturbing.

I wanted to evaluate three simple modelling strategies in dealing withdata with many variables. Using data with 54 variables on 1,785 areaunits from New Zealand’s 2013 census, I’m looking to predict medianincome on the basis of the other 53 variables. The features are allcontinuous and are variables like “mean number of bedrooms”,“proportion of individuals with no religion” and “proportion ofindividuals who are smokers”. Restricting myself to traditional linearregression with a normally distributed response, my three alternativestrategies were:

use all 53 variables;eliminate the variables that can be predicted easily from theother variables (defined by having a variance inflation factorgreater than ten), one by one until the main collinearityproblems are gone; oreliminate variables one at a time from the full model on the basisof comparing Akaike’s Information Criterion of models with andwithout each variable.

None of these is exactly what I would use for real, but they serve thepurpose of setting up a competition of strategies that I can test with avariety of model validation techniques.

Validating models

The main purpose of the exercise was actually to ensure I had my headaround different ways of estimating the validity of a model, looselydefinable as how well it would perform at predicting new data. Asthere is no possibility of new areas in New Zealand from 2013 thatneed to have their income predicted, the “prediction” is a thoughtexercise which we need to find a plausible way of simulating.Confidence in hypothetical predictions gives us confidence in theinsights the model gives into relationships between variables.

There are many methods of validating models, although I think kfoldcrossvalidation has market dominance (not with Harrell though, whoprefers varieties of the bootstrap). The three validation methods I’veused for this post are:

1. ‘simple’ bootstrap. This involves creating resamples withreplacement from the original data, of the same size; applyingthe modelling strategy to the resample; using the model topredict the values of the full set of original data and calculating agoodness of fit statistic (eg either Rsquared or root meansquared error) comparing the predicted value to the actual value.Note – Following Efron, Harrell calls this the “simplebootstrap”, but other authors and the useful caret package use

258Like Share Share 12

07/06/2016 Bootstrap and crossvalidation for evaluating modelling strategies | Rbloggers

http://www.rbloggers.com/bootstrapandcrossvalidationforevaluatingmodellingstrategies/ 4/11

“simple bootstrap” to mean the resample model is used topredict the outofbag values at each resample point, rather thanthe full original sample.

2. ‘enhanced’ bootstrap. This is a little more involved and isbasically a method of estimating the ‘optimism’ of the goodnessof fit statistic. There’s a nice step by step explanation bythestatsgeek which I won’t try to improve on.

3. repeated 10fold crossvalidation. 10fold crossvalidationinvolves dividing your data into ten parts, then taking turns to fitthe model on 90% of the data and using that model to predict theremaining 10%. The average of the 10 goodness of fit statisticsbecomes your estimate of the actual goodness of fit. One of theproblems with kfold crossvalidation is that it has a highvariance ie doing it different times you get different resultsbased on the luck of you kway split; so repeated kfold crossvalidation addresses this by performing the whole process anumber of times and taking the average.

As the sample sizes get bigger relative to the number of variables inthe model the methods should converge. The bootstrap methods cangive overoptimistic estimates of model validity compared to crossvalidation; there are various other methods available to address thisissue although none seem to me to provide allpurpose solution.

It’s critical that the resampling in the process envelopes the entiremodelbuilding strategy, not just the final fit. In particular, if thestrategy involves variable selection (as two of my candidate strategiesdo), you have to automate that selection process and run it on eachdifferent resample. That’s because one of the highest risk parts of themodelling process is that variable selection. Running crossvalidationor the bootstrap on a final model after you’ve eliminated a bunch ofvariables is missing the point, and will give materially misleadingstatistics (biased towards things being more “significant” than therereally is evidence for). Of course, that doesn’t stop this being commonmisguided practice.

Results

One nice feature of statistics since the revolution of the 1980s is thatthe bootstrap helps you conceptualise what might have happened butdidn’t. Here’s the root mean squared error from the 100 differentbootstrap resamples when the three different modelling strategies(including variable selection) were applied:

Notice anything? Not only does it seem to be generally a bad idea todrop variables just because they are collinear with others, butoccasionally it turns out to be a really bad idea – like in resamples #4,#6 and around thirty others. Those thirty or so spikes are in resampleswhere random chance led to one of the more important variables beingdumped before it had a chance to contribute to the model.

The thing that surprised me here was that the generally maligned stepwise selection strategy performed nearly as well as the full model,judged by the simple bootstrap. That result comes through for the othertwo validation methods as well:

07/06/2016 Bootstrap and crossvalidation for evaluating modelling strategies | Rbloggers

http://www.rbloggers.com/bootstrapandcrossvalidationforevaluatingmodellingstrategies/ 5/11

In all three validation methods there’s really nothing substantive tochoose between the “full model” and “stepwise” strategies, basedpurely on results.

Reflections

The full model is much easier to fit, interpret, estimate confidenceintervals and perform tests on than stepwise. All the standard statisticsfor a final model chosen by stepwise methods are misleading andcareful recalculations are needed based on elaborate bootstrapping. Sothe full model wins handsdown as a general strategy in this case.

With this data, we have a bit of freedom from the generous samplesize. If approaching this for real I wouldn’t eliminate any variablesunless there were theoretical / subject matter reasons to do so. I havemade the mistake of eliminating the colinear variables before fromthis dataset but will try not to do it again. The rule of thumb is to have20 observations for each parameter (this is one of the most asked andmost dodged questions in statistics education; see Table 4.1 ofRegression Modelling Strategies for this particular answer), whichsuggests we can have up to 80 parameters with a bit to spare. Thisgives us 30 parameters to use for nonlinear relationships and/orinteractions, which is the direction I might go in a subsequent post.Bearing that in mind, I’m not bothering to report here the actualsubstantive results (eg which factors are related to income and how);that can wait for a better model down the track.

Data and computing

The census data are ultimately from Statistics New Zealand of course,but are tidied up and available in my nzelect R package, which is stillvery much under development and may change without notice. It’sonly available from GitHub at the moment (installation code below).

I do the bootstrapping with the aid of the boot package, which isgenerally the recommended approach in R. For repeated crossvalidation of the two straightforward strategies (full model andstepwise variable selection) I use the caret package, in combinationwith stepAIC which is in the Venables and Ripley MASS package. Forthe more complex strategy that involved dropping variables with highvariance inflation factors I found it easiest to do the repeated crossvalidation oldschool with my own for loops.

This exercise was a bit complex and I won’t be astonished if someonepoints out an error. If you see a problem, or have any suggestions orquestions, please leave a comment.

Here’s the code:

#===================setup======================= library(ggplot2) library(scales) library(MASS) library(boot) library(caret) library(dplyr) library(tidyr) library(directlabels)

set.seed(123)

# install nzelect package that has census data devtools::install_github("ellisp/nzelect/pkg") library(nzelect)

# drop the columns with areas' code and name au <‐ AreaUnits2013 %>% select(‐AU2014, ‐Area_Code_and_Description)

07/06/2016 Bootstrap and crossvalidation for evaluating modelling strategies | Rbloggers

http://www.rbloggers.com/bootstrapandcrossvalidationforevaluatingmodellingstrategies/ 6/11

# give meaningful rownames, helpful for some diagnostic plots later row.names(au) <‐ AreaUnits2013$Area_Code_and_Description

# remove some repetition from the variable names names(au) <‐ gsub("2013", "", names(au))

# restrict to areas with no missing data. If this was any more complicated (eg # imputation),it would need to be part of the validation resampling too; but # just dropping them all at the beginning doesn't need to be resampled; the only # implication would be sample size which would be small impact and complicating. au <‐ au[complete.cases(au), ]

#==================functions for two of the modelling strategies===================== # The stepwise variable selection:model_process_step <‐ function(the_data) model_full <‐ lm(MedianIncome ~ ., data = the_data) model_final <‐ stepAIC(model_full, direction = "both", trace = 0) return(model_final)

# The dropping of highly collinear variables, based on Variance Inflation Factor: model_process_vif <‐ function(the_data) # remove the collinear variables based on vif x <‐ 20 while(max(x) > 10) mod1 <‐ lm(MedianIncome ~ . , data = the_data) x <‐ sort(car::vif(mod1) , decreasing = TRUE) the_data <‐ the_data[ , names(the_data) != names(x)[1]] # message(paste("dropping", names(x)[1])) model_vif <‐ lm(MedianIncome ~ ., data = the_data) return(model_vif)

# The third strategy, full model, is only a one‐liner with standard functions # so I don't need to define a function separately for it.

#==================Different validation methods=================

#‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐simple bootstrap comparison‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐ # create a function suitable for boot that will return the goodness of fit # statistics testing models against the full original sample. compare <‐ function(orig_data, i) # create the resampled data train_data <‐ orig_data[i, ] test_data <‐ orig_data # ie the full original sample # fit the three modelling processes model_step <‐ model_process_step(train_data) model_vif <‐ model_process_vif(train_data) model_full <‐ lm(MedianIncome ~ ., data = train_data) # predict the values on the original, unresampled data predict_step <‐ predict(model_step, newdata = test_data) predict_vif <‐ predict(model_vif, newdata = test_data) predict_full <‐ predict(model_full, newdata = test_data) # return a vector of 6 summary results results <‐ c( step_R2 = R2(predict_step, test_data$MedianIncome), vif_R2 = R2(predict_vif, test_data$MedianIncome), full_R2 = R2(predict_full, test_data$MedianIncome), step_RMSE = RMSE(predict_step, test_data$MedianIncome), vif_RMSE = RMSE(predict_vif, test_data$MedianIncome), full_RMSE = RMSE(predict_full, test_data$MedianIncome) ) return(results)

# perform bootstrap Repeats <‐ 100 res <‐ boot(au, statistic = compare, R = Repeats)

# restructure results for a graphic showing root mean square error, and for # later combination with the other results. I chose just to focus on RMSE; # the messages are similar if R squared is used. RMSE_res <‐ as.data.frame(res$t[ , 4:6]) names(RMSE_res) <‐ c("AIC stepwise selection", "Remove collinear variables", "Use all variables")

RMSE_res %>% mutate(trial = 1:Repeats) %>% gather(variable, value, ‐trial) %>% # re‐order levels: mutate(variable = factor(variable, levels = c( "Remove collinear variables", "AIC stepwise selection", "Use all variables" ))) %>% ggplot(aes(x = trial, y = value, colour = variable)) + geom_line() + geom_point() + ggtitle("'Simple' bootstrap of model fit of three different regression strategies", subtitle = "Predicting areas' median income based on census variables") + labs(x = "Resample id (there no meaning in the order of resamples)n", y = "Root Mean Square Error (higher is worse)n", colour = "Strategy", caption = "Data from New Zealand Census 2013")

# store the three "simple bootstrap" RMSE results for later simple <‐ apply(RMSE_res, 2, mean)

07/06/2016 Bootstrap and crossvalidation for evaluating modelling strategies | Rbloggers

http://www.rbloggers.com/bootstrapandcrossvalidationforevaluatingmodellingstrategies/ 7/11

#‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐enhanced (optimism) bootstrap comparison‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐ # for convenience, estimate the models on the original sample of data orig_step <‐ model_process_step(au) orig_vif <‐ model_process_vif(au) orig_full <‐ lm(MedianIncome ~ ., data = au)

# create a function suitable for boot that will return the optimism estimates for # statistics testing models against the full original sample. compare_opt <‐ function(orig_data, i) # create the resampled data train_data <‐ orig_data[i, ]

# fit the three modelling processes model_step <‐ model_process_step(train_data) model_vif <‐ model_process_vif(train_data) model_full <‐ lm(MedianIncome ~ ., data = train_data) # predict the values on the original, unresampled data predict_step <‐ predict(model_step, newdata = orig_data) predict_vif <‐ predict(model_vif, newdata = orig_data) predict_full <‐ predict(model_full, newdata = orig_data) # return a vector of 6 summary optimism results results <‐ c( step_R2 = R2(fitted(model_step), train_data$MedianIncome) ‐ R2(predict_step, orig_data$MedianIncome), vif_R2 = R2(fitted(model_vif), train_data$MedianIncome) ‐ R2(predict_vif, orig_data$MedianIncome), full_R2 = R2(fitted(model_full), train_data$MedianIncome) ‐ R2(predict_full, orig_data$MedianIncome), step_RMSE = RMSE(fitted(model_step), train_data$MedianIncome) ‐ RMSE(predict_step, orig_data$MedianIncome), vif_RMSE = RMSE(fitted(model_vif), train_data$MedianIncome) ‐ RMSE(predict_vif, orig_data$MedianIncome), full_RMSE = RMSE(fitted(model_full), train_data$MedianIncome) ‐ RMSE(predict_full, orig_data$MedianIncome) ) return(results)

# perform bootstrap res_opt <‐ boot(au, statistic = compare_opt, R = Repeats)

# calculate and store the results for later original <‐ c( RMSE(fitted(orig_step), au$MedianIncome), RMSE(fitted(orig_vif), au$MedianIncome), RMSE(fitted(orig_full), au$MedianIncome) )

optimism <‐ apply(res_opt$t[ , 4:6], 2, mean) enhanced <‐ original ‐ optimism

#‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐repeated cross‐validation‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐ # The number of cross validation repeats is the number of bootstrap repeats / 10: cv_repeat_num <‐ Repeats / 10

# use caret::train for the two standard models: the_control <‐ trainControl(method = "repeatedcv", number = 10, repeats = cv_repeat_num) cv_full <‐ train(MedianIncome ~ ., data = au, method = "lm", trControl = the_control) cv_step <‐ train(MedianIncome ~ ., data = au, method = "lmStepAIC", trControl = the_control, trace = 0)

# do it by hand for the VIF model:results <‐ numeric(10 * cv_repeat_num) for(j in 0:(cv_repeat_num ‐ 1)) cv_group <‐ sample(1:10, nrow(au), replace = TRUE) for(i in 1:10) train_data <‐ au[cv_group != i, ] test_data <‐ au[cv_group == i, ] results[j * 10 + i] <‐ RMSE( predict(model_process_vif(train_data), newdata = test_data), test_data$MedianIncome) cv_vif <‐ mean(results)

cv_vif_results <‐ data.frame( results = results, trial = rep(1:10, cv_repeat_num), cv_repeat = rep(1:cv_repeat_num, each = 10) )

#===============reporting results=============== # combine the three cross‐validation results together and combined with # the bootstrap results from earlier summary_results <‐ data.frame(rbind( simple, enhanced, c(mean(cv_step$resample$RMSE), cv_vif, mean(cv_full$resample$RMSE) ) ), check.names = FALSE) %>% mutate(method = c("Simple bootstrap", "Enhanced bootstrap", paste(cv_repeat_num, "repeats 10‐foldncross‐validation"))) %>% gather(variable, value, ‐method)

# Draw a plot summarising the results direct.label( summary_results %>% mutate(variable = factor(variable, levels = c( "Use all variables", "AIC stepwise selection", "Remove collinear variables" ))) %>%

07/06/2016 Bootstrap and crossvalidation for evaluating modelling strategies | Rbloggers

http://www.rbloggers.com/bootstrapandcrossvalidationforevaluatingmodellingstrategies/ 8/11

Tweet

Tweet

Tweet

ggplot(aes(y = method, x = value, colour = variable)) + geom_point(size = 3) + labs(x = "Estimated Root Mean Square Error (higher is worse)n", colour = "Modellingnstrategy", y = "Method of estimating model fitn", caption = "Data from New Zealand Census 2013") + ggtitle("Three different validation methods of three different regression strategies", subtitle = "Predicting areas' median income based on census variables") )

To leave a comment for the author, please follow the link and comment on theirblog: Peter's stats stuff R.

Rbloggers.com offers daily email updates about R news and tutorials on topicssuch as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps,animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop,Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site?Choose your flavor: email, twitter, RSS, or facebook...

Comments are closed.

Search & Hit Enter

Recent popular posts

What are the Best Machine LearningPackages in R?Bootstrap and crossvalidation forevaluating modelling strategiesPredictive Bookmaker ConsensusModel for the UEFA Euro 2016

Most visited articles of theweek

1. How to write the first for loop in R2. Installing R packages3. Predictive Bookmaker ConsensusModel for the UEFA Euro 2016

4. R tutorials5. Indepth introduction to machinelearning in 15 hours of expert videos

6. Using apply, sapply, lapply in R7. How to perform a Logistic Regressionin R

8. Computing and visualizing PCA in R9. What are the Best Machine LearningPackages in R?

Sponsors

258Like Share Share 12

Related

Election analysiscontest entry part 4 drivers of preferencefor Green over LabourpartyIn "R bloggers"

Exponential randomgraph models with RThis note documentsthe a small but growingmicroverse of Rpackages on CRAN toproduce various formsof exponential randomIn "R bloggers"

Modelling withR: part 2

Modelling with R: part2In "R bloggers"

258

Like

Share

12

Share

258Like Share Share 12