Multiple Regression - Assessing Collinearity · 2019. 8. 5. · Multiple Regression How the...

Multiple Regression - Assessing Collinearity

MATH 369

Multiple Regression

How the explanatory variables relate to each

other is fundamental to understanding their

relationship to the response variable.

Relationships between the explanatory

variables are called collinearity. This can be

mild, or severe.

An advantage of multiple regression

Without collinearity, you can estimate the effect of X variables on Y similarly by performing separate simple regressions or by a multiple regression.

However, with a moderate degree of collinearity, the multiple regression allows you to separate the relative effects of the X variables, while the simple linear regressions do not.

Note with severe collinearity everything fails.

Multicollinearity isn’t tragic

In most practical datasets there will be some degree of multicollinearity. If the degree of multicollinearity isn’t too bad (more on assessment in a few slides) then it can be safely ignored.

If you have serious multicollinearity, then your goals must be considered and there are various options.

In what follows, we first focus on how to assess multicollinearity, then what to do about it should it be found to be a problem.

Collinearity, effect on p-values

The p-value in a regression provides a test of whether that variable is statistically significantly related to Y, after accounting for everything else.

You will get different p-values for the same variables in different regressions as you add/remove other explanatory variables.

Collinearity, effect on p-values contd

A variable can be significantly related to Y by itself, but not be significantly related to Y after accounting for several other variables. In that case, the variable is viewed as redundant.

If all the X variables are corelated, it is possible ALL the variables may be insignificant, even if each is significantly related to Y by itself.

Collinearity, effect on coefficients

A coefficient in a regression can be interpreted as the “amount of change in Y for every unit increase in X, given all other variables in the regression remain constant”.

Collinearity, effect on coefficients contd

Coefficients of individual explanatory variables can change depending on what other explanatory variables are present.

Increased standard error of estimates of the β’s (decreased reliability)

Assessing Multicollinearity

Remember that multicollinearity refers to

associations between the explanatory

variables.

We discuss three methods for assessing

multicollinearity in this course – correlation

matrices, tolerance and variance inflation

factors.

Correlation Matrices

A correlation matrix is simply a table indicating the correlations between each pair of explanatory variables.

If you see values close to 1 or -1 that indicates variables are strongly associated with each other and you may have multicollinearity problems.

If you see many correlations all greater in absolute value than 0.8, you may also have problems.

Example the correlation matrix with mild collinearity, not serious.

Pearson Correlation Coefficients, N = 507

Prob > |r| under H0: Rho=0

ankle_ waist_ forearm_

diam girth girth height

ankle_diam 1.00000 0.63697 0.73525 0.68645

ankle#diam <.0001 <.0001 <.0001

waist_girth 0.63697 1.00000 0.78079 0.55296

waist#girth <.0001 <.0001 <.0001

forearm_girth 0.73525 0.78079 1.00000 0.65502

forearm#girth <.0001 <.0001 <.0001

height 0.68645 0.55296 0.65502 1.00000

height <.0001 <.0001 <.0001

Ignore the p-values for assessing multicollinearity.

None of these correlations are extreme

Dis/advantages of correlation matrices

Disadvantage - correlation matrices only

work with two variables at a time. Thus we

can only see pairwise relationships. If a more

complicated relationship exists, the

correlation matrix won’t find it.

Advantage – working with two variables at a

time is simplest and most interpretable. If you

see a high correlation, you can make a

simple scatterplot of the two variables and

see their relationship.

Measure of Tolerance

Tolerance is a measure of collinearity

reported by most statistical programs such as

SPSS; the variable’s tolerance is 1-R2.

A small tolerance value indicates that the

variable under consideration is almost a

perfect linear combination of the independent

variables already in the equation and that it

should not be added to the regression

equation.

Tolerance Contd

All variables involved in the linear

relationship will have a small tolerance.

Some suggest that a tolerance value less

than 0.1 should be investigated further.

If a low tolerance value is accompanied by

large standard errors and nonsignificance,

multicollinearity may be an issue.

Variance Inflation Factors (VIFs)

The Variance Inflation Factor (VIF) measures the impact of collinearity among the variables in a regression model. The Variance Inflation Factor (VIF) is 1/Tolerance, it is always greater than or equal to 1

Variance inflation factors measure the relationship of all the variables simultaneously, thus they avoid the “two at a time” disadvantage of correlation matrices.

VIF contd

There is a VIF for each variable.

Loosely, the VIF is based on regressing each variable on the remaining variables. If the remaining variables can predict the variable of interest, then that variable has a high VIF.

Calculating VIF for Variable Xj

The formul:

If V IFj ≥ 10 then there is a problem with

multicollinearity.

2

1

1j

j

VIFR

Using VIFs

Ideally, VIFs are 1, indicating the variable

provides completely independent

information.

In practice, all VIFs are greater than 1. There

is typically some degree of multicollinearity.

VIFs are considered “bad” if they exceed 10.

Illustrating VIFs

The variance inflation factors in this example

are all quite moderate (the largest is 3.58).

Thus, there is no evidence of serious

collinearity in our example.

Parameter Standard Variance

Variable Label DF Estimate Error t Value Pr > |t| Inflation

Intercept Intercept 1 -71.56758 3.27692 -21.84 <.0001 0

ankle_diam ankle#diam 1 0.21323 0.22842 0.93 0.3510 2.63183

waist_girth waist#girth 1 0.66749 0.02582 25.85 <.0001 2.62210

forearm_girth forearm#girth 1 1.34978 0.11740 11.50 <.0001 3.58026

height height 1 0.30009 0.02697 11.13 <.0001 2.08710

A more difficult example

Suppose in a study the variables are:

chest girth, shoulder girth, waist girth, and

bicep girth (all upper body measurement

which we might expect to be more strongly

associated with each other)

Correlation matrix

Pearson Correlation Coefficients, N = 507

Prob > |r| under H0: Rho=0

waist_ bicep_ chest_ shoulder_

girth girth girth girth

waist_girth 1.00000 0.80470 0.88380 0.82345

waist#girth <.0001 <.0001 <.0001

bicep_girth 0.80470 1.00000 0.90818 0.89519

bicep#girth <.0001 <.0001 <.0001

chest_girth 0.88380 0.90818 1.00000 0.92719

chest#girth <.0001 <.0001 <.0001

shoulder_girth 0.82345 0.89519 0.92719 1.00000

shoulder#girth <.0001 <.0001 <.0001

Many of these correlations are near or above 0.9, indicating

problems with multicollinearity

VIFs for second example

The VIFs here are larger than in the first

example, with the largest being 12.47, which

is above our rule of thumb 10. Again,

evidence multicollinearity is an issue.

Parameter Estimates

Parameter Standard Variance

Variable Label DF Estimate Error t Value Pr > |t| Inflation

Intercept Intercept 1 -36.71600 2.39942 -15.30 <.0001 0

waist_girth waist#girth 1 0.59868 0.03952 15.15 <.0001 4.57075

chest_girth chest#girth 1 0.07716 0.07169 1.08 0.2823 12.47291

bicep_girth bicep#girth 1 0.66512 0.12170 5.47 <.0001 6.44810

shoulder_girth shoulder#girth 1 0.29432 0.05568 5.29 <.0001 8.05461

So multicollinearity is an issue – what do you do about it?

Remember, if the multicollinearity is present

but not excessive (no high correlations, no

VIFs above 10), you can ignore it.

If multicollinearity is a big issue, your goal

becomes extremely important.

If your goal is prediction…

If your main goal is prediction (using the available explanatory variables to predict the response), then you can safely ignore the multicollinearity.

If you have multiple variables that provide the same information, using any one of them is sufficient.

If interest centers on the real relationships between the variables…

When you have serious multicollinearity, the

variables are sufficiently redundant that you

cannot reliably distinguish their effects.

There is no simple solution here.

Sometimes a bigger experiment will help, but

often the variables are so intertwined you

cannot distinguish them without a prohibitive

sample size.

If interest centers on the real relationships between the variables…

Often, while you may not be able to

disentangle your explanatory variables, you

may be comfortable with a “composite” (my

term) variable.

For example, if in a sociological study you

find the variables “father’s education level”

and “mother’s education level” are strongly

related, it may be sufficient to simply use one

variable, “parent’s education level”, which is

some function of the two parents.

Finally…

In many situations you get to set some of the

explanatory variables (in engineering studies

you often get to select almost all of them, in

medical studies you can set drug dosage and

some other variables).

Make sure when you set these experiments

up that you do not “install” multicollinearity

A review, and what to do…

Ok, let’s put some of this together

You have several explanatory variables and a single response Y.

You run the multiple regression first and check the residuals and collinearity measures.

A review, and what to do…

If the residuals look bad, deal with those first (you may need a transformation).

Now suppose you have decent residuals

With decent residuals…

Check the collinearity measures. If these are

problematic (any VIF above 10 or high correlations),

then you must start removing or combining variables

before you can trust the output. This tends to be a

substantive, not a statistical, task.

The variables with the highest VIFs can be first

targeted for deletion if that makes sense, they are

the “most redundant”

With decent residuals…

After you do anything, remember to check

the residuals again.

Now suppose you have decent residuals and

collinearity measures…

Look at the Fand p-value. If this is not

significant, stop. Otherwise continue to the

next slide…

Backward selection – a simple model selection technique.

“Model selection” refers to determining which of the

explanatory variables should be placed in a final model.

Backward selection begins by fitting the “full model”,

where all the explanatory variables are present. If the F-

statistic is not significant, stop and declare no variable

meaningful, otherwise continue below.

Backward selection – a simple model selection technique.

Check the resulting p-values. If any are NOT

significant, remove the variable with the

highest p-value and rerun the model (never

remove the intercept).

Continue removing insignificant variables

until none remain. The result is your final

model.

Stepwise regression

Stepwise Regression.ppt

Alternatives to backward selection

There are many other model selection

techniques. For example, there is forward

selection, which builds the model up from an

empty model. Backward selection is the

simplest.

Many modern techniques are based on

heavy computation and involve running every

possible model and comparing the results,

potentially of several thousand regressions.

Summary – backward selection is simple and

practical.

Backward selection in our example

In regressing weight on ankle diameter, waist girth,

forearm girth, and height. The full model resulted in a

significant F-statistic and individual p-values of

There are non-significant p-values, the largest of

which (the only) is ankle diameter, thus we remove

ankle diameter from the model and rerun.

Parameter Standard

Variable Label DF Estimate Error t Value Pr > |t|

Intercept Intercept 1 -71.56758 3.27692 -21.84 <.0001

ankle_diam ankle#diam 1 0.21323 0.22842 0.93 0.3510

waist_girth waist#girth 1 0.66749 0.02582 25.85 <.0001

forearm_girth forearm#girth 1 1.34978 0.11740 11.50 <.0001

height height 1 0.30009 0.02697 11.13 <.0001

Backward selection in our example continued.

The next model with waist girth, forearm

girth, and height results in all significant p-

values. We may stop and declare this our

final model.

Parameter Standard

Variable Label DF Estimate Error t Value Pr > |t|

Intercept Intercept 1 -71.53234 3.27628 -21.83 <.0001

waist_girth waist#girth 1 0.67048 0.02562 26.17 <.0001

forearm_girth forearm#girth 1 1.38825 0.10992 12.63 <.0001

height height 1 0.30998 0.02480 12.50 <.0001

QUESTIONS???

Multiple Regression - Assessing Collinearity · 2019. 8. 5. · Multiple Regression How the...

Documents

Transcript of Multiple Regression - Assessing Collinearity · 2019. 8. 5. · Multiple Regression How the...