Multicollinearity in Regression Principal Components Analysis
Detecting and reducing multicollinearity
description
Transcript of Detecting and reducing multicollinearity
Detecting and reducing multicollinearity
Detecting multicollinearity
Common methods of detection
• Realized effects (changes in coefficients, changes in standard errors of coefficients, changes in sequential sums of squares) of multicollinearity.
• Non-significant t-tests for all of the slopes but a significant overall F-test.
• Significant correlations among pairs of predictor variables (correlations, matrix scatter plots).
• Variance inflation factors (VIF).
The first variance at issueFor the model:
ipipiii xxxy 1,122110
the variance of the estimated coefficient bk is:
2
1
2
2
1
1
kn
ikik
k Rxx
bVar
2kRwhere is the R2 value obtained by regressing
the kth predictor on the remaining predictors.
The second variance at issueFor the model:
iikki xy 0
the variance of the estimated coefficient bk is:
n
ikik
k
xxbVar
1
2
2
min
The ratio of the two variances
2
2
2
22
2
min 1
111
k
kik
kkik
k
k
R
xx
Rxx
bVar
bVar
Variance inflation factors
The variance inflation factor for the kth predictor is:
21
1
kk R
VIF
2kRwhere is the R2 value obtained by regressing
the kth predictor on the remaining predictors.
Variance inflation factors (VIFk)
• A measure of how much the variance of the estimated regression coefficient bk is “inflated” by the existence of correlation among the predictor variables in the model.
• VIFs exceeding 4 warrant investigation.
• VIFs exceeding 10 are signs of serious multicollinearity.
Blood pressure example
120
110
53.25
47.75
97.325
89.375
2.125
1.875
8.275
4.425
72.5
65.5
120110
76.25
30.75
53.2547.75
97.32589.375
2.1251.875
8.2754.425
72.565.576.25
30.75
BP
Age
Weight
BSA
Duration
Pulse
Stress
n = 20 hypertensive individuals
p-1 = 6 predictor variables
Blood pressure example
BP Age Weight BSA Duration PulseAge 0.659Weight 0.950 0.407BSA 0.866 0.378 0.875Duration 0.293 0.344 0.201 0.131Pulse 0.721 0.619 0.659 0.465 0.402Stress 0.164 0.368 0.034 0.018 0.312 0.506
Blood pressure (BP) is the response.
Regress y = BP on all 6 predictors Predictor Coef SE Coef T P VIFConstant -12.870 2.557 -5.03 0.000Age 0.70326 0.04961 14.18 0.000 1.8Weight 0.96992 0.06311 15.37 0.000 8.4BSA 3.776 1.580 2.39 0.033 5.3Dur 0.06838 0.04844 1.41 0.182 1.2Pulse -0.08448 0.05161 -1.64 0.126 4.4Stress 0.005572 0.003412 1.63 0.126 1.8
S = 0.4072 R-Sq = 99.6% R-Sq(adj) = 99.4%
Analysis of VarianceSource DF SS MS F PRegression 6 557.844 92.974 560.64 0.000Residual Error 13 2.156 0.166Total 19 560.000
Regress x2 = weight on 5 predictorsPredictor Coef SE Coef T P VIFConstant 19.674 9.465 2.08 0.057Age -0.1446 0.2065 -0.70 0.495 1.7BSA 21.422 3.465 6.18 0.000 1.4Dur 0.0087 0.2051 0.04 0.967 1.2Pulse 0.5577 0.1599 3.49 0.004 2.4Stress -0.02300 0.01308 -1.76 0.101 1.5
S = 1.725 R-Sq = 88.1% R-Sq(adj) = 83.9%
Analysis of VarianceSource DF SS MS F PRegression 5 308.839 61.768 20.77 0.000Residual Error 14 41.639 2.974Total 19 350.478
The variance inflation factor calculated by its definition
40.8
881.01
1
1
12
min
kk
k
RbVar
bVar
The variance of the weight coefficient is inflated by a factor of 8.40 due to the existence of correlation among the predictor variables in the model.
The pairwise correlations
BP Age Weight BSA Duration PulseAge 0.659Weight 0.950 0.407BSA 0.866 0.378 0.875Duration 0.293 0.344 0.201 0.131Pulse 0.721 0.619 0.659 0.465 0.402Stress 0.164 0.368 0.034 0.018 0.312 0.506
Blood pressure (BP) is the response.
Regress y = BP on age, weight, duration and stress
Predictor Coef SE Coef T P VIFConstant -15.870 3.195 -4.97 0.000Age 0.68374 0.06120 11.17 0.000 1.5Weight 1.03413 0.03267 31.65 0.000 1.2Dur 0.03989 0.06449 0.62 0.545 1.2Stress 0.002184 0.003794 0.58 0.573 1.2
S = 0.5505 R-Sq = 99.2% R-Sq(adj) = 99.0%
Analysis of VarianceSource DF SS MS F PRegression 4 555.45 138.86 458.28 0.000Residual Error 15 4.55 0.30Total 19 560.00
Reducing data-based multicollinearity
Data-based multicollinearity
• Multicollinearity that results from a poorly designed experiment, reliance on purely observational data, or the inability to manipulate the system on which you collect the data.
Some methods
• Modify the regression model by eliminating one or more predictor variables.
• Collect additional data under different experimental or observational conditions.
(Modified!) Allen Cognitive Level (ACL) Study
• Relationship of ACL test to level of pathology in a set of 23 patients in a hospital psychiatry unit:– Response y = ACL score
– x1 = vocabulary (Vocab) score on Shipley Institute of Living Scale
– x2 = abstraction (Abstract) score on Shipley Institute of Living Scale
– x3 = score on Symbol-Digit Modalities Test (SDMT)
Allen Cognitive Level (ACL) Study on 23 patients
47.7517.25
27.7517.25
28.513.5
5.8
4.2
47.75
17.25
27.75
17.25
ACL
SDMT
Vocab
Abstract
Strong correlation between Vocab and Abstract
3525155
30
20
10
Abstract
Vo
cab
Pearson correlation of Vocab and Abstract = 0.990
Regress y = ACL on SDMT, Vocab, and Abstract
Predictor Coef SE Coef T P VIFConstant 3.747 1.342 2.79 0.012SDMT 0.02326 0.01273 1.83 0.083 1.7Vocab 0.0283 0.1524 0.19 0.855 49.3Abstract -0.0138 0.1006 -0.14 0.892 50.6
S = 0.7344 R-Sq = 26.5% R-Sq(adj) = 14.8%
Analysis of Variance
Source DF SS MS F PRegression 3 3.6854 1.2285 2.28 0.112Residual Error 19 10.2476 0.5393Total 22 13.9330
Allen Cognitive Level (ACL) Study on 69 patients
57.520.5 32.517.5 30.511.5
5.8
4.2
57.5
20.5
32.5
17.5
ACL
SDMT
Vocab
Abstract
Plot after having collected more data
403020100
40
30
20
10
Abstract
Vo
cab
Pearson correlation of Vocab and Abstract = 0.698
Regress y = ACL on SDMT, Vocab, and Abstract
Predictor Coef SE Coef T P VIFConstant 3.9463 0.3381 11.67 0.000SDMT 0.027404 0.007168 3.82 0.000 1.6Vocab -0.01740 0.01808 -0.96 0.339 2.1Abstract 0.01218 0.01159 1.05 0.297 2.2
S = 0.6878 R-Sq = 28.6% R-Sq(adj) = 25.3%
Analysis of Variance
Source DF SS MS F PRegression 3 12.3009 4.1003 8.67 0.000Residual Error 65 30.7487 0.4731Total 68 43.0496
Reducing structural multicollinearity
In context of polynomial regression models
Structural multicollinearity
• Multicollinearity that is a mathematical artifact caused by creating new predictors from other predictors, such as, creating the predictor x2 from the predictor x.
Example
• (General research question) What is impact of exercise on human immune system?
• (Specific research question) How is amount of immunoglobin in blood (y) related to maximal oxygen uptake (x)?
30 40 50 60 70
1000
1500
2000
Maximal oxygen uptake (ml/kg)
Imm
uno
glo
bin
(mg)
Scatter plot
A quadratic polynomial regression function
iiii xxy 21110
where:
• yi = amount of immunoglobin in blood (mg)
• xi = maximal oxygen uptake (ml/kg)
• typical assumptions about error terms (“INE”)
Estimated quadratic function
30 40 50 60 70
1000
1500
2000
oxygen
igg
igg = -1464.40 + 88.3071 oxygen - 0.536247 oxygen**2
S = 106.427 R-Sq = 93.8 % R-Sq(adj) = 93.3 %
Regression Plot
Interpretation of the regression coefficients
• If 0 is a possible x value, then b0 is the predicted response. Otherwise, interpretation of b0 is meaningless.
• b1 is the slope of the tangent line at x = 0.
• b2 indicates the up/down direction of curve
– b2 < 0 means curve is concave down
– b2 > 0 means curve is concave up
The regression equation is igg = - 1464 + 88.3 oxygen - 0.536 oxygensq
Predictor Coef SE Coef T P VIFConstant -1464.4 411.4 -3.56 0.001oxygen 88.31 16.47 5.36 0.000 99.9oxygensq -0.5362 0.1582 -3.39 0.002 99.9
S = 106.4 R-Sq = 93.8% R-Sq(adj) = 93.3%
Analysis of VarianceSource DF SS MS F PRegression 2 4602211 2301105 203.16 0.000Residual Error 27 305818 11327Total 29 4908029
Regress y = iggon oxygen and oxygen2
Structural multicollinearity
7060504030
5000
4000
3000
2000
1000
oxygen
oxy
ge
nsq
Pearson correlation of oxygen and oxygensq = 0.995
“Center” the predictors
637.50OxygenOxCent
2637.50 OxygenOxCentSq
Mean of oxygen = 50.637
oxygen oxcent oxcentsq 34.6 -16.037 257.185 45.0 -5.637 31.776 62.3 11.663 136.026 58.9 8.263 68.277 42.5 -8.137 66.211 44.3 -6.337 40.158 67.9 17.263 298.011 58.5 7.863 61.827 35.6 -15.037 226.111 49.6 -1.037 1.075 33.0 -17.637 311.064
Wow! It really works!
20100-10-20
400
300
200
100
0
oxcent
oxc
ent
sq
Pearson correlation of oxcent and oxcentsq = 0.219
A better quadratic polynomial regression function
iiii xxxxy 2*11
*1
*0
xxx ii *where denotes the centered predictor
and:
• yi = amount of immunoglobin in blood (mg)
• typical assumptions about error terms (“INE”)
iiii xxy 2**11
**1
*0
The regression equation isigg = 1632 + 34.0 oxcent - 0.536 oxcentsq
Predictor Coef SE Coef T P VIFConstant 1632.20 29.35 55.61 0.000oxcent 34.000 1.689 20.13 0.000 1.1oxcentsq -0.5362 0.1582 -3.39 0.002 1.1
S = 106.4 R-Sq = 93.8% R-Sq(adj) = 93.3%
Analysis of VarianceSource DF SS MS F PRegression 2 4602211 2301105 203.16 0.000Residual Error 27 305818 11327Total 29 4908029
Regress y = iggon oxcent and oxcent2
Interpretation of the regression coefficients
• b0 is predicted response at the predictor mean.
• b1 is the estimated slope of the tangent line at the predictor mean; and, often, similar to the estimated slope in the simple model.
• b2 indicates the up/down direction of curve
– b2 < 0 means curve is concave down
– b2 > 0 means curve is concave up
-20 -10 0 10 20
1000
1500
2000
oxcent
igg
igg = 1632.20 + 33.9995 oxcent - 0.536247 oxcent**2
S = 106.427 R-Sq = 93.8 % R-Sq(adj) = 93.3 %
Regression Plot
Estimated regression function
Similar estimates of coefficients from first-order linear model
-20 -10 0 10 20
1000
1500
2000
oxcent
igg
igg = 1557.63 + 32.7427 oxcent
S = 124.783 R-Sq = 91.1 % R-Sq(adj) = 90.8 %
Regression Plot
The relationship between the two forms of the model
2**11
**1
*0ˆ iii xbxbby Centered model:
21110ˆ iii xbxbby Original model:
*1111
*11
*11
2*11
*1
*00
2
bb
xbbb
xbxbbb
where:
2** 5362.00.342.1632ˆ iii xxy
5362.0
3.88)637.50)(5362.(234
4.1464)637.50(5362.0)637.50(342.1632
11
1
20
b
b
b
2536.03.884.1464ˆ iii xxy
Mean of oxygen = 50.637
1000 1500 2000
-200
-100
0
100
200
Fitted Value
Res
idua
l
Residuals Versus the Fitted Values(response is igg)
Model evaluation
-200 -100 0 100 200
-2
-1
0
1
2
Nor
mal
Sco
re
Residual
Normal Probability Plot of the Residuals(response is igg)
Model evaluation
Model use: What is predicted IgG if maximal oxygen uptake is 90?
There is an even greater danger in extrapolation when modeling data with a polynomial function, because of changes in direction.
Predicted Values for New Observations
New Obs Fit SE Fit 95.0% CI 95.0% PI1 2139.6 219.2 (1689.8,2589.5) (1639.6,2639.7) XXX denotes a row with X values away from the centerXX denotes a row with very extreme X values
Values of Predictors for New Observations
New Obs oxcent oxcentsq1 39.4 1549
The hierarchical approach to model fitting
Widely accepted approach is to fit a higher-order model and then explore whether a lower-order (simpler) model is adequate.
iiiii xxxY 3111
21110
Is a first-order linear model (“line”) adequate?
0: 111110 H
The hierarchical approach to model fitting
But then … if a polynomial term of a given order is retained, then all related lower-order terms are also retained.
That is, if a quadratic term was significant, you would use this regression function:
21110 iii xxYE
2110 ii xYE
and not this one: