Post on 08-Jul-2020
Diagnostics for
Linear Regression
Models
Prof. David Sibbritt
Session Outline
• Using residuals to check the model
residual plots
• Multicollinearity diagnostics
variance inflation factor (VIF)
Background
• When a regression model is selected, one cannot
usually be certain in advance that the model is
appropriate
• It is important to consider a series of regression
diagnostics available
allow us to look for flaws that may affect our parameter
estimates
consider whether the assumptions underlying the model are
violated and whether our results are heavily impacted by
influential observations
Using Residuals to check the model
• There are several plots that we can do to determine if
there are any departures from regression model
1) The regression function is not linear
plot residuals vs independent variable (or dependent
variable)
residuals should be evenly scattered on a straight line
about zero and there should not be any other obvious
pattern (eg. curved)
2) The error terms do not have constant variance
plot residuals vs independent variable (or dependent
variable)
residuals should be constantly scattered about zero
(ie. no ‘fanning out’)
3) The model fits all but one or a few outlier observations
plot residuals vs independent variable (or dependent
variable)
there should not be any residuals that are positioned far
from zero
4) The error terms are not normally distributed
histogram of residuals and/or normal probability plot
5) One or several important independent variables have
been omitted from the model
plot residuals vs any other independent variables
residuals should not be raised above, or lowered below,
zero
• Below are some example residual plots (assuming
linear regression model), that typically results:
(a) shows that all residuals are evenly scattered about
zero, with no obvious pattern or outliers
suggests that the model is appropriate
(a) (b)
i i
0 0
X X
(c) (d)
i i
0 0
X Time
(a) (b)
i i
0 0
X X
(c) (d)
i i
0 0
X Time
(a) (b)
i i
0 0
X X
(c) (d)
i i
0 0
X Time
(b) shows that as the predictor variable X increase, so does
the variation in residuals (fanning out)
suggests that the model is not appropriate
(c) shows a definite pattern (curved) in the residuals
suggests that the model is not appropriate (and also
that a curved model would be a better choice)
• In SPSS, we can produce residuals as follows:
• Should be randomly scattered about zero
Notes of caution:
a) Range of observations
even if fit appears satisfactory for the observations we
have available to use, the model may not be a good fit
when extended outside the range of past observations
b) Causality
the presence of a regression relation between two
variables does not imply a cause-and-effect relation
between them
Multicollinearity Diagnostics –
Variance Inflation Factor
• In multiple linear regression, problems can arise
when the independent variables being considered
for the regression model are highly correlated
among themselves
ie. the correlated variables will have a similar
relationship with the dependent variable
• There is a highly useful diagnostic; the variance
inflation factor
• The variance inflation factor (VIF) measures how much
the variances of the estimated regression coefficients
are inflated as compared to when the independent
variables are not linearly related
• The largest VIF value among all X variables is often
used as an indicator of the severity of multicollinearity
a maximum VIF value in excess of 10 is often taken as an
indication that multicollinearity may be unduly influencing
the least squares estimates
• Mean VIF values considerably larger than 1 are
indicative of serious multicollinearity problems
• In general, the VIF for the j th regression coefficient can
be written as
VIFj =
where is the coefficient of multiple determination
obtained from regressing Xj on the other regressor variables
21
1
kR
2kR
• If Xj is nearly linearly dependent on some of the other
regressors, then will be near unity and VIFj will be large
• If Xj is orthogonal to the remaining predictors, its VIF will
be 1
• Regression models fit to data by the method of least
squares, when strong multicollinearity is present, are
notoriously poor prediction equations and the values
of the regression coefficients are often very sensitive
to the data in the particular sample collected
2kR
Example: AIS Athletes Study
• Maximum VIF = 5.01 < 10 (not too bad)
• Mean VIF = 3.73 > 1 (not too bad?)
Coefficientsa
Model
Unstandardized Coefficients
Standardized
Coefficients
t Sig.
Collinearity Statistics
B Std. Error Beta Tolerance VIF
1 (Constant) -63.714 47.543 -1.340 .182
RCC -12.359 15.533 -.118 -.796 .427 .211 4.747
Hg 13.884 5.356 .396 2.592 .010 .200 5.012
Bfat -.233 .626 -.030 -.372 .710 .700 1.428
a. Dependent Variable: Ferr
Mean VIF = ([4.747 + 5.012 + 1.428] ÷ 3) = 3.73
Reference
• DuPont WD. (2009) “Statistical Modeling for
Biomedical Researchers: A Simple Introduction to the
Analysis of Complex Data” Cambridge University
Press.