146 42 model_selection

32
MATH& 146 Lesson 42 Section 6.2 Model Selection 1

Transcript of 146 42 model_selection

MATH& 146

Lesson 42

Section 6.2

Model Selection

1

Model Selection

The best model is not always the most

complicated. Sometimes including variables that

are not evidently important can actually reduce the

accuracy of predictions.

However, it is not always clear when a variable

should or should not be included in the final model,

so a strategy needs to be developed that will help

us eliminate from the model variables that are less

important.

2

Model Selection

The model that includes all available explanatory

variables is often referred to as the full model.

Our goal is to assess whether the full model is the

best model. If it isn't, we want to identify a smaller

model that is preferable.

3

Model Selection

The table below provides a summary of the

regression output for the full model for the Mario

Kart auction data.

4

Estimate Std. Error t value Pr(>|t|)

(Intercept) 36.2110 1.5140 23.92 0.0000

cond_new 5.1306 1.0511 4.88 0.0000

stock_photo 1.0803 1.0568 1.02 0.3085

duration –0.0268 0.1904 –0.14 0.8882

wheels 7.2852 0.5547 13.13 0.0000

df = 1362 0.7108adjR

Model Selection

The last column of the table lists the p-values that can

be used to assess hypotheses of the following form:

, assuming the other

explanatory variables are held constant in the model.

5

Estimate Std. Error t value Pr(>|t|)

(Intercept) 36.2110 1.5140 23.92 0.0000

cond_new 5.1306 1.0511 4.88 0.0000

stock_photo 1.0803 1.0568 1.02 0.3085

duration –0.0268 0.1904 –0.14 0.8882

wheels 7.2852 0.5547 13.13 0.0000

df = 1362 0.7108adjR

0 : 0, : 0i A iH H

Example 1

The coefficient of cond_new has a point estimate of

b1 = 5.13 and a p-value for its corresponding

hypotheses (H0: β1 = 0, HA: β1 ≠ 0) of about zero. How

can this be interpreted?

6

Estimate Std. Error t value Pr(>|t|)

(Intercept) 36.2110 1.5140 23.92 0.0000

cond_new 5.1306 1.0511 4.88 0.0000

stock_photo 1.0803 1.0568 1.02 0.3085

duration –0.0268 0.1904 –0.14 0.8882

wheels 7.2852 0.5547 13.13 0.0000

df = 1362 0.7108adjR

Example 2

Identify the p-values for each variable in the model. Is

there strong evidence supporting the connection of

these variables with the total price in the model?

7

Estimate Std. Error t value Pr(>|t|)

(Intercept) 36.2110 1.5140 23.92 0.0000

cond_new 5.1306 1.0511 4.88 0.0000

stock_photo 1.0803 1.0568 1.02 0.3085

duration –0.0268 0.1904 –0.14 0.8882

wheels 7.2852 0.5547 13.13 0.0000

df = 1362 0.7108adjR

Model Selection

There is not statistically significant evidence that either

stock_photo or duration variables contribute

meaningfully to the model. Next we consider common

strategies for pruning such variables from a model.

8

Estimate Std. Error t value Pr(>|t|)

(Intercept) 36.2110 1.5140 23.92 0.0000

cond_new 5.1306 1.0511 4.88 0.0000

stock_photo 1.0803 1.0568 1.02 0.3085

duration –0.0268 0.1904 –0.14 0.8882

wheels 7.2852 0.5547 13.13 0.0000

df = 1362 0.7108adjR

Model Selection

Two common strategies for adding or removing

variables in a multiple regression model are called

backward-elimination and forward-selection.

These techniques are often referred to as stepwise

model selection strategies, because they add or delete

one variable at a time as they "step" through the

candidate predictors.

9

Backward-Elimination

The backward-elimination strategy starts with the

model that includes all potential predictor variables.

Variables are eliminated one-at-a-time from the model

until only variables with statistically significant p-values

remain.

The strategy within each elimination step is to drop the

variable with the largest p-value, refit the model, and

reassess the inclusion of all variables.

10

Example 3

Results corresponding to the full model for the Mario

Kart data are shown below. How should we proceed

under the backward-elimination strategy?

11

Estimate Std. Error t value Pr(>|t|)

(Intercept) 36.2110 1.5140 23.92 0.0000

cond_new 5.1306 1.0511 4.88 0.0000

stock_photo 1.0803 1.0568 1.02 0.3085

duration –0.0268 0.1904 –0.14 0.8882

wheels 7.2852 0.5547 13.13 0.0000

df = 1362 0.7108adjR

Example 4

The variable duration has been removed and a new

model fitted. Now how should we proceed under the

backward-elimination strategy?

12

Estimate Std. Error t value Pr(>|t|)

(Intercept) 36.0483 0.9745 36.99 0.0000

cond_new 5.1763 0.9961 5.20 0.0000

stock_photo 1.1177 1.0192 1.10 0.2747

wheels 7.2984 0.5448 13.40 0.0000

df = 1372 0.7128adjR

Backward-Elimination

Notice that the p-value for stock photo changed a little

from the full model (0.3085) to the model that did not

include the duration variable (0.2747).

It is common for p-values of one variable to change,

due to collinearity, after eliminating a different variable.

This fluctuation emphasizes the importance of refitting

a model after each variable elimination step. The p-

values tend to change dramatically when the

eliminated variable is highly correlated with another

variable in the model.

13

Backward-Elimination

In the latest model, we see that the two remaining

predictors have statistically significant coefficients with

p-values of about zero.

Since there are no variables remaining that could be

eliminated from the model, we stop.

14

Estimate Std. Error t value Pr(>|t|)

(Intercept) 36.7849 0.7066 52.06 0.0000

cond_new 5.5848 0.9245 6.04 0.0000

wheels 7.2328 0.5419 13.35 0.0000

df = 1382 0.7124adjR

Example 5

a) Write out our final model for predicting the total

auction price?

b) What is the expected price for a new Mario Kart

game that included two wheels?

c) What is the expected price for a used Mario Kart

game that did not include any wheels?

15

Estimate Std. Error t value Pr(>|t|)

(Intercept) 36.7849 0.7066 52.06 0.0000

cond_new 5.5848 0.9245 6.04 0.0000

wheels 7.2328 0.5419 13.35 0.0000

df = 1382 0.7124adjR

Forward-Selection

The forward-selection strategy is the reverse of the

backward-elimination technique.

Instead of eliminating variables one-at-a-time, we

add variables one-at-a-time until we cannot find

any variables that present strong evidence of their

importance in the model.

16

Forward-Selection

For the Mario Kart data, we would start with (1) the

model that includes no variables.

17

Model 1 Estimate Std. Error t value Pr(>|t|)

(Intercept) 47.4319 0.7675 61.80 0.0000

df = 1402 0adjR

Forward-Selection

Now we fit each of the possible models with just one

variable. That is, we fit (2) the model including just the

cond_new predictor, then (3) the model including just

the stock_photo variable, then (4) the model with just

duration, and (5) the model with just wheels.

Each of the four models (yes, we fit four models!)

provides a p-value for the coefficient of the predictor

variable.

18

Forward-Selection

19

Model 2 Estimate Std. Error t value Pr(>|t|)

(Intercept) 42.8711 0.8140 52.67 0.0000

cond_new 10.8996 1.2583 8.66 0.0000

df = 139

Model 3 Estimate Std. Error t value Pr(>|t|)

(Intercept) 44.3272 1.4935 29.68 0.0000

stock_photo 4.1692 1.7307 2.41 0.0173

df = 139

2 0.3459adjR

2 0.0332adjR

Forward-Selection

20

Model 4 Estimate Std. Error t value Pr(>|t|)

(Intercept) 52.3736 1.2608 41.54 0.0000

duration –1.3172 0.2769 –4.76 0.0000

df = 139

Model 5 Estimate Std. Error t value Pr(>|t|)

(Intercept) 37.5020 0.7804 48.06 0.0000

wheels 8.6427 0.5479 15.77 0.0000

df = 139

2 0.1338adjR

2 0.6390adjR

Forward-Selection

Out of these four variables, the wheels variable had

the smallest p-value and largest test statistic. Since its

p-value is less than 0.05 (the p-value was smaller than

2 E –16), we add the Wii wheels variable to the model.

Once a variable is added in forward-selection, it will be

included in all models considered as well as the final

model.

21

Forward-Selection

Since we successfully found a first variable to add, we

consider adding another. We fit three new models: (6)

the model including just the cond_new and wheels

variables, (7) the model including just the stock_photo

and wheels variables, and (8) the model including only

the duration and wheels variables.

22

Forward-Selection

23

Model 6 Estimate Std. Error t value Pr(>|t|)

(Intercept) 39.7849 0.7066 52.06 0.0000

wheels 7.2328 0.5419 13.35 0.0000

cond_new 5.5848 0.9245 6.04 0.0000

df = 138

Model 7 Estimate Std. Error t value Pr(>|t|)

(Intercept) 35.3144 1.0512 33.60 0.0000

wheels 8.5384 0.5339 15.99 0.0000

stock_photo 3.0985 1.0305 3.08 0.0031

df = 138

2 0.7124adjR

2 0.6587adjR

Forward-Selection

24

Model 8 Estimate Std. Error t value Pr(>|t|)

(Intercept) 39.8029 1.1806 33.71 0.0000

wheels 8.1844 0.5664 14.45 0.0000

duration –0.4729 0.1848 –2.56 0.0116

df = 1382 0.6528adjR

Forward-Selection

Of these models, the model with the wheels and

cond_new variables had the lowest p-value and

highest test statistic for its new variable (the p-value

corresponding to cond_new was 1.4 E –8).

Because this p-value is below 0.05, we add the

cond_new variable to the model. Now the final model

is guaranteed to include both the condition and wheels

variables.

25

Forward-Selection

We now repeat the process a third time, fitting two new

models: (9) the model including the stock_photo,

cond_new, and wheels variables and (10) the model

including the duration, cond_new, and wheels

variables.

26

Forward-Selection

27

Model 9 Estimate Std. Error t value Pr(>|t|)

(Intercept) 36.0483 0.9745 36.99 0.0000

wheels 7.2984 0.5448 13.40 0.0000

cond_new 5.1763 0.9961 5.20 0.0000

stock_photo 1.1177 1.0192 1.10 0.2747

df = 137

Model 10 Estimate Std. Error t value Pr(>|t|)

(Intercept) 37.1750 1.1846 31.38 0.0000

wheels 7.2018 0.5488 13.12 0.0000

cond_new 5.4170 1.0133 5.35 0.0000

duration –0.0758 0.1843 –0.41 0.6817

df = 137

2 0.7128adjR

2 0.7107adjR

Forward-Selection

The p-value corresponding to stock_photo in Model 9

(0.2747) was smaller than the p-value corresponding

to duration in Model 10 (0.6817).

However, since this smaller p-value was not below

0.05, there was no evidence that it should be included

in the best model. Therefore, neither variable is added

and we are finished.

28

Model Selection Summary

The backward-elimination strategy begins with the

largest model and eliminates variables one-by-one

until we are satisfied that all remaining variables are

important to the model.

The forward-selection strategy starts with no variables

included in the model, then it adds in variables

according to their importance until no other important

variables are found.

29

Model Selection Summary

It is worth noting that there is no guarantee that the

backward-elimination and forward-selection strategies

will arrive at the same final model.

It is also worth noting that there is also no guarantee

that either strategy will arrive at the overall best model,

especially when there are hundreds, thousands, or

even millions of variables to check.

30

Model Selection Summary

For 50 variables, even if you could check 1,000,000

models a second, it would take you about 36 years to

look all 250 possible models.

For 100 variables at 1 trillion models a second, it

would still take about 40 billion years to check all

possible models. Good luck!

It is often impossible to consider all possible models,

so the best model cannot be guaranteed. Fortunately,

backward elimination and forward selection will give

pretty good models to use.

31

Model Selection Summary

It is generally acceptable to use just one strategy.

However, if the backwards-elimination and forward-

selection strategies are both tried and they arrive at

different models, choose the model with the larger

adjusted R2 as a tie-breaker.

32