Multiple Regression - Assessing Collinearity · 2019. 8. 5. · Multiple Regression How the...
Transcript of Multiple Regression - Assessing Collinearity · 2019. 8. 5. · Multiple Regression How the...
Multiple Regression - Assessing Collinearity
MATH 369
Multiple Regression
How the explanatory variables relate to each
other is fundamental to understanding their
relationship to the response variable.
Relationships between the explanatory
variables are called collinearity. This can be
mild, or severe.
An advantage of multiple regression
Without collinearity, you can estimate the effect of X variables on Y similarly by performing separate simple regressions or by a multiple regression.
However, with a moderate degree of collinearity, the multiple regression allows you to separate the relative effects of the X variables, while the simple linear regressions do not.
Note with severe collinearity everything fails.
Multicollinearity isn’t tragic
In most practical datasets there will be some degree of multicollinearity. If the degree of multicollinearity isn’t too bad (more on assessment in a few slides) then it can be safely ignored.
If you have serious multicollinearity, then your goals must be considered and there are various options.
In what follows, we first focus on how to assess multicollinearity, then what to do about it should it be found to be a problem.
Collinearity, effect on p-values
The p-value in a regression provides a test of whether that variable is statistically significantly related to Y, after accounting for everything else.
You will get different p-values for the same variables in different regressions as you add/remove other explanatory variables.
Collinearity, effect on p-values contd
A variable can be significantly related to Y by itself, but not be significantly related to Y after accounting for several other variables. In that case, the variable is viewed as redundant.
If all the X variables are corelated, it is possible ALL the variables may be insignificant, even if each is significantly related to Y by itself.
Collinearity, effect on coefficients
A coefficient in a regression can be interpreted as the “amount of change in Y for every unit increase in X, given all other variables in the regression remain constant”.
Collinearity, effect on coefficients contd
Coefficients of individual explanatory variables can change depending on what other explanatory variables are present.
Increased standard error of estimates of the β’s (decreased reliability)
Assessing Multicollinearity
Remember that multicollinearity refers to
associations between the explanatory
variables.
We discuss three methods for assessing
multicollinearity in this course – correlation
matrices, tolerance and variance inflation
factors.
Correlation Matrices
A correlation matrix is simply a table indicating the correlations between each pair of explanatory variables.
If you see values close to 1 or -1 that indicates variables are strongly associated with each other and you may have multicollinearity problems.
If you see many correlations all greater in absolute value than 0.8, you may also have problems.
Example the correlation matrix with mild collinearity, not serious.
Pearson Correlation Coefficients, N = 507
Prob > |r| under H0: Rho=0
ankle_ waist_ forearm_
diam girth girth height
ankle_diam 1.00000 0.63697 0.73525 0.68645
ankle#diam <.0001 <.0001 <.0001
waist_girth 0.63697 1.00000 0.78079 0.55296
waist#girth <.0001 <.0001 <.0001
forearm_girth 0.73525 0.78079 1.00000 0.65502
forearm#girth <.0001 <.0001 <.0001
height 0.68645 0.55296 0.65502 1.00000
height <.0001 <.0001 <.0001
Ignore the p-values for assessing multicollinearity.
None of these correlations are extreme
Dis/advantages of correlation matrices
Disadvantage - correlation matrices only
work with two variables at a time. Thus we
can only see pairwise relationships. If a more
complicated relationship exists, the
correlation matrix won’t find it.
Advantage – working with two variables at a
time is simplest and most interpretable. If you
see a high correlation, you can make a
simple scatterplot of the two variables and
see their relationship.
Measure of Tolerance
Tolerance is a measure of collinearity
reported by most statistical programs such as
SPSS; the variable’s tolerance is 1-R2.
A small tolerance value indicates that the
variable under consideration is almost a
perfect linear combination of the independent
variables already in the equation and that it
should not be added to the regression
equation.
Tolerance Contd
All variables involved in the linear
relationship will have a small tolerance.
Some suggest that a tolerance value less
than 0.1 should be investigated further.
If a low tolerance value is accompanied by
large standard errors and nonsignificance,
multicollinearity may be an issue.
Variance Inflation Factors (VIFs)
The Variance Inflation Factor (VIF) measures the impact of collinearity among the variables in a regression model. The Variance Inflation Factor (VIF) is 1/Tolerance, it is always greater than or equal to 1
Variance inflation factors measure the relationship of all the variables simultaneously, thus they avoid the “two at a time” disadvantage of correlation matrices.
VIF contd
There is a VIF for each variable.
Loosely, the VIF is based on regressing each variable on the remaining variables. If the remaining variables can predict the variable of interest, then that variable has a high VIF.
Calculating VIF for Variable Xj
The formul:
If V IFj ≥ 10 then there is a problem with
multicollinearity.
2
1
1j
j
VIFR
Using VIFs
Ideally, VIFs are 1, indicating the variable
provides completely independent
information.
In practice, all VIFs are greater than 1. There
is typically some degree of multicollinearity.
VIFs are considered “bad” if they exceed 10.
Illustrating VIFs
The variance inflation factors in this example
are all quite moderate (the largest is 3.58).
Thus, there is no evidence of serious
collinearity in our example.
Parameter Standard Variance
Variable Label DF Estimate Error t Value Pr > |t| Inflation
Intercept Intercept 1 -71.56758 3.27692 -21.84 <.0001 0
ankle_diam ankle#diam 1 0.21323 0.22842 0.93 0.3510 2.63183
waist_girth waist#girth 1 0.66749 0.02582 25.85 <.0001 2.62210
forearm_girth forearm#girth 1 1.34978 0.11740 11.50 <.0001 3.58026
height height 1 0.30009 0.02697 11.13 <.0001 2.08710
A more difficult example
Suppose in a study the variables are:
chest girth, shoulder girth, waist girth, and
bicep girth (all upper body measurement
which we might expect to be more strongly
associated with each other)
Correlation matrix
Pearson Correlation Coefficients, N = 507
Prob > |r| under H0: Rho=0
waist_ bicep_ chest_ shoulder_
girth girth girth girth
waist_girth 1.00000 0.80470 0.88380 0.82345
waist#girth <.0001 <.0001 <.0001
bicep_girth 0.80470 1.00000 0.90818 0.89519
bicep#girth <.0001 <.0001 <.0001
chest_girth 0.88380 0.90818 1.00000 0.92719
chest#girth <.0001 <.0001 <.0001
shoulder_girth 0.82345 0.89519 0.92719 1.00000
shoulder#girth <.0001 <.0001 <.0001
Many of these correlations are near or above 0.9, indicating
problems with multicollinearity
VIFs for second example
The VIFs here are larger than in the first
example, with the largest being 12.47, which
is above our rule of thumb 10. Again,
evidence multicollinearity is an issue.
Parameter Estimates
Parameter Standard Variance
Variable Label DF Estimate Error t Value Pr > |t| Inflation
Intercept Intercept 1 -36.71600 2.39942 -15.30 <.0001 0
waist_girth waist#girth 1 0.59868 0.03952 15.15 <.0001 4.57075
chest_girth chest#girth 1 0.07716 0.07169 1.08 0.2823 12.47291
bicep_girth bicep#girth 1 0.66512 0.12170 5.47 <.0001 6.44810
shoulder_girth shoulder#girth 1 0.29432 0.05568 5.29 <.0001 8.05461
So multicollinearity is an issue – what do you do about it?
Remember, if the multicollinearity is present
but not excessive (no high correlations, no
VIFs above 10), you can ignore it.
If multicollinearity is a big issue, your goal
becomes extremely important.
If your goal is prediction…
If your main goal is prediction (using the available explanatory variables to predict the response), then you can safely ignore the multicollinearity.
If you have multiple variables that provide the same information, using any one of them is sufficient.
If interest centers on the real relationships between the variables…
When you have serious multicollinearity, the
variables are sufficiently redundant that you
cannot reliably distinguish their effects.
There is no simple solution here.
Sometimes a bigger experiment will help, but
often the variables are so intertwined you
cannot distinguish them without a prohibitive
sample size.
If interest centers on the real relationships between the variables…
Often, while you may not be able to
disentangle your explanatory variables, you
may be comfortable with a “composite” (my
term) variable.
For example, if in a sociological study you
find the variables “father’s education level”
and “mother’s education level” are strongly
related, it may be sufficient to simply use one
variable, “parent’s education level”, which is
some function of the two parents.
Finally…
In many situations you get to set some of the
explanatory variables (in engineering studies
you often get to select almost all of them, in
medical studies you can set drug dosage and
some other variables).
Make sure when you set these experiments
up that you do not “install” multicollinearity
A review, and what to do…
Ok, let’s put some of this together
You have several explanatory variables and a single response Y.
You run the multiple regression first and check the residuals and collinearity measures.
A review, and what to do…
If the residuals look bad, deal with those first (you may need a transformation).
Now suppose you have decent residuals
With decent residuals…
Check the collinearity measures. If these are
problematic (any VIF above 10 or high correlations),
then you must start removing or combining variables
before you can trust the output. This tends to be a
substantive, not a statistical, task.
The variables with the highest VIFs can be first
targeted for deletion if that makes sense, they are
the “most redundant”
With decent residuals…
After you do anything, remember to check
the residuals again.
Now suppose you have decent residuals and
collinearity measures…
Look at the Fand p-value. If this is not
significant, stop. Otherwise continue to the
next slide…
Backward selection – a simple model selection technique.
“Model selection” refers to determining which of the
explanatory variables should be placed in a final model.
Backward selection begins by fitting the “full model”,
where all the explanatory variables are present. If the F-
statistic is not significant, stop and declare no variable
meaningful, otherwise continue below.
Backward selection – a simple model selection technique.
Check the resulting p-values. If any are NOT
significant, remove the variable with the
highest p-value and rerun the model (never
remove the intercept).
Continue removing insignificant variables
until none remain. The result is your final
model.
Stepwise regression
Alternatives to backward selection
There are many other model selection
techniques. For example, there is forward
selection, which builds the model up from an
empty model. Backward selection is the
simplest.
Many modern techniques are based on
heavy computation and involve running every
possible model and comparing the results,
potentially of several thousand regressions.
Summary – backward selection is simple and
practical.
Backward selection in our example
In regressing weight on ankle diameter, waist girth,
forearm girth, and height. The full model resulted in a
significant F-statistic and individual p-values of
There are non-significant p-values, the largest of
which (the only) is ankle diameter, thus we remove
ankle diameter from the model and rerun.
Parameter Standard
Variable Label DF Estimate Error t Value Pr > |t|
Intercept Intercept 1 -71.56758 3.27692 -21.84 <.0001
ankle_diam ankle#diam 1 0.21323 0.22842 0.93 0.3510
waist_girth waist#girth 1 0.66749 0.02582 25.85 <.0001
forearm_girth forearm#girth 1 1.34978 0.11740 11.50 <.0001
height height 1 0.30009 0.02697 11.13 <.0001
Backward selection in our example continued.
The next model with waist girth, forearm
girth, and height results in all significant p-
values. We may stop and declare this our
final model.
Parameter Standard
Variable Label DF Estimate Error t Value Pr > |t|
Intercept Intercept 1 -71.53234 3.27628 -21.83 <.0001
waist_girth waist#girth 1 0.67048 0.02562 26.17 <.0001
forearm_girth forearm#girth 1 1.38825 0.10992 12.63 <.0001
height height 1 0.30998 0.02480 12.50 <.0001
QUESTIONS???