Multicollinearity
-
Upload
dalton-calhoun -
Category
Documents
-
view
42 -
download
1
description
Transcript of Multicollinearity
Multicollinearity
Multicollinearity
• Multicollinearity (or intercorrelation) exists when at least some of the predictor variables are correlated among themselves.
• In observational studies, multicollinearity happens more often than not.
• So, we need to understand the effects of multicollinearity on regression analyses.
Example #1
120
110
53.25
47.75
97.325
89.375
2.125
1.875
8.275
4.425
72.5
65.5
120110
76.25
30.75
53.25
47.75
97.325
89.375 2.1
251.8
758.2
754.4
25 72.565.5
76.25
30.75
BP
Age
Weight
BSA
Duration
Pulse
Stress
n = 20 hypertensive individuals
p-1 = 6 predictor variables
Example #1
BP Age Weight BSA Duration PulseAge 0.659Weight 0.950 0.407BSA 0.866 0.378 0.875Duration 0.293 0.344 0.201 0.131Pulse 0.721 0.619 0.659 0.465 0.402Stress 0.164 0.368 0.034 0.018 0.312 0.506
Blood pressure (BP) is the response.
What is effect on regression analyses if predictors are perfectly
uncorrelated?
x1 x2 y 2 5 52 2 5 43 2 7 49 2 7 46 4 5 50 4 5 48 4 7 44 4 7 43
Pearson correlation of x1 and x2 = 0.000
The regression equation is y = 48.8 - 0.63 x1
Predictor Coef SE Coef T PConstant 48.750 4.025 12.11 0.000x1 -0.625 1.273 -0.49 0.641
Analysis of VarianceSource DF SS MS F PRegression 1 3.13 3.13 0.24 0.641Error 6 77.75 12.96Total 7 80.88
Regress Y on X1
Regress Y on X2
The regression equation is y = 55.1 - 1.38 x2
Predictor Coef SE Coef T PConstant 55.125 7.119 7.74 0.000x2 -1.375 1.170 -1.17 0.285
Analysis of VarianceSource DF SS MS F PRegression 1 15.13 15.13 1.38 0.285Error 6 65.75 10.96Total 7 80.88
Regress Y on X1 and X2
The regression equation is y = 57.0 - 0.63 x1 - 1.38 x2
Predictor Coef SE Coef T PConstant 57.000 8.486 6.72 0.001x1 -0.625 1.251 -0.50 0.639x2 -1.375 1.251 -1.10 0.322
Analysis of VarianceSource DF SS MS F PRegression 2 18.25 9.13 0.73 0.528Error 5 62.63 12.53Total 7 80.88
Source DF Seq SSx1 1 3.13x2 1 15.13
Regress Y on X2 and X1The regression equation is y = 57.0 - 1.38 x2 - 0.63 x1
Predictor Coef SE Coef T PConstant 57.000 8.486 6.72 0.001x2 -1.375 1.251 -1.10 0.322x1 -0.625 1.251 -0.50 0.639
Analysis of VarianceSource DF SS MS F PRegression 2 18.25 9.13 0.73 0.528Error 5 62.63 12.53Total 7 80.88
Source DF Seq SSx2 1 15.13x1 1 3.13
If predictors are perfectly uncorrelated, then…
• You get the same slope estimates regardless of the first-order regression model used.
• That is, the effect on the response ascribed to a predictor doesn’t depend on the other predictors in the model.
If predictors are perfectly uncorrelated, then…
• The sum of squares SSR(X1) is the same as the sequential sum of squares SSR(X1|X2).
• The sum of squares SSR(X2) is the same as the sequential sum of squares SSR(X2|X1).
• That is, the marginal contribution of one predictor variable in reducing the error sum of squares doesn’t depend on the other predictors in the model.
Same effects for “real data” with nearly uncorrelated predictors?
BP Age Weight BSA Duration PulseAge 0.659Weight 0.950 0.407BSA 0.866 0.378 0.875Duration 0.293 0.344 0.201 0.131Pulse 0.721 0.619 0.659 0.465 0.402Stress 0.164 0.368 0.034 0.018 0.312 0.506
Regress BP on Stress
The regression equation isBP = 113 + 0.0240 Stress
Predictor Coef SE Coef T PConstant 112.720 2.193 51.39 0.000Stress 0.02399 0.03404 0.70 0.490
S = 5.502 R-Sq = 2.7% R-Sq(adj) = 0.0%
Analysis of VarianceSource DF SS MS F PRegression 1 15.04 15.04 0.50 0.490Error 18 544.96 30.28Total 19 560.00
Regress BP on BSA
The regression equation isBP = 45.2 + 34.4 BSA
Predictor Coef SE Coef T PConstant 45.183 9.392 4.81 0.000BSA 34.443 4.690 7.34 0.000
S = 2.790 R-Sq = 75.0% R-Sq(adj) = 73.6%
Analysis of VarianceSource DF SS MS F PRegression 1 419.86 419.86 53.93 0.000Error 18 140.14 7.79Total 19 560.00
Regress BP on BSA and StressThe regression equation is BP = 44.2 + 34.3 BSA + 0.0217 Stress
Predictor Coef SE Coef T PConstant 44.245 9.261 4.78 0.000BSA 34.334 4.611 7.45 0.000Stress 0.02166 0.01697 1.28 0.219
Analysis of VarianceSource DF SS MS F PRegression 2 432.12 216.06 28.72 0.000Error 17 127.88 7.52Total 19 560.00
Source DF Seq SSBSA 1 419.86Stress 1 12.26
Regress BP on Stress and BSAThe regression equation isBP = 44.2 + 0.0217 Stress + 34.3 BSA
Predictor Coef SE Coef T PConstant 44.245 9.261 4.78 0.000Stress 0.02166 0.01697 1.28 0.219BSA 34.334 4.611 7.45 0.000
Analysis of VarianceSource DF SS MS F PRegression 2 432.12 216.06 28.72 0.000Error 17 127.88 7.52Total 19 560.00
Source DF Seq SSStress 1 15.04BSA 1 417.07
If predictors are nearlyuncorrelated, then…
• You get similar slope estimates regardless of the first-order regression model used.
• The sum of squares SSR(X1) is similar to the sequential sum of squares SSR(X1|X2).
• The sum of squares SSR(X2) is similar to the sequential sum of squares SSR(X2|X1).
What happens if the predictor variables are highly correlated?
BP Age Weight BSA Duration PulseAge 0.659Weight 0.950 0.407BSA 0.866 0.378 0.875Duration 0.293 0.344 0.201 0.131Pulse 0.721 0.619 0.659 0.465 0.402Stress 0.164 0.368 0.034 0.018 0.312 0.506
Regress BP on Weight
The regression equation isBP = 2.21 + 1.20 Weight
Predictor Coef SE Coef T PConstant 2.205 8.663 0.25 0.802Weight 1.20093 0.09297 12.92 0.000
S = 1.740 R-Sq = 90.3% R-Sq(adj) = 89.7%
Analysis of VarianceSource DF SS MS F PRegression 1 505.47 505.47 166.86 0.000Error 18 54.53 3.03Total 19 560.00
Regress BP on BSA
The regression equation isBP = 45.2 + 34.4 BSA
Predictor Coef SE Coef T PConstant 45.183 9.392 4.81 0.000BSA 34.443 4.690 7.34 0.000
S = 2.790 R-Sq = 75.0% R-Sq(adj) = 73.6%
Analysis of VarianceSource DF SS MS F PRegression 1 419.86 419.86 53.93 0.000Error 18 140.14 7.79Total 19 560.00
Regress BP on BSA and WeightThe regression equation isBP = 5.65 + 5.83 BSA + 1.04 Weight
Predictor Coef SE Coef T PConstant 5.653 9.392 0.60 0.555BSA 5.831 6.063 0.96 0.350Weight 1.0387 0.1927 5.39 0.000
Analysis of VarianceSource DF SS MS F PRegression 2 508.29 254.14 83.54 0.000Error 17 51.71 3.04Total 19 560.00
Source DF Seq SSBSA 1 419.86Weight 1 88.43
Regress BP on Weight and BSAThe regression equation isBP = 5.65 + 1.04 Weight + 5.83 BSA
Predictor Coef SE Coef T PConstant 5.653 9.392 0.60 0.555Weight 1.0387 0.1927 5.39 0.000BSA 5.831 6.063 0.96 0.350
Analysis of VarianceSource DF SS MS F PRegression 2 508.29 254.14 83.54 0.000Error 17 51.71 3.04Total 19 560.00
Source DF Seq SSWeight 1 505.47BSA 1 2.81
Effect #1 of multicollinearity
When predictor variables are correlated, the regression coefficient of any one variable depends on which other predictor variables are included in the model.
Variables in model
b1 b2
X1 1.20 ----
X2 ---- 34.4
X1, X2 1.04 5.83
Even correlated predictors not in the model can have an impact!
• Regression of territory sales on territory population, per capita income, etc.
• Against expectation, coefficient of territory population was determined to be negative.
• Competitor’s market penetration, which was strongly positively correlated with territory population, was not included in model.
• But, competitor kept sales down in territories with large populations.
Effect #2 of multicollinearity
When predictor variables are correlated, the marginal contribution of any one predictor variable in reducing the error sum of squares varies, depending on which other variables are already in model.
SSR(X1) = 505.47
SSR(X1|X2) = 88.43
SSR(X2) = 419.86
SSR(X2|X1) = 2.81
Effect #3 of multicollinearity
When predictor variables are correlated, the precision of the estimated regression coefficients decreases as more predictor variables are added to the model.
Variables in model
se(b1) se(b2)
X1 0.093 ----
X2 ---- 4.69
X1, X2 0.193 6.06
What is the effect on estimating mean or predicting new response?
2.252.152.051.951.851.75
100
95
90
85
BSA
Wei
ght
(2,92)
Weight Fit SE Fit 95.0% CI 95.0% PI
92 112.7 0.402 (111.85,113.54) (108.94,116.44)
BSA Fit SE Fit 95.0% CI 95.0% PI 2 114.1 0.624 (112.76,115.38) (108.06,120.08)
BSA Weight Fit SE Fit 95.0% CI 95.0% PI
2 92 112.8 0.448 (111.93,113.83) (109.08, 116.68)
Effect #4 of multicollinearity on estimating mean or predicting Y
High multicollinearity among predictor variables does not prevent good, precise predictions of the response (within scope of model).
The regression equation isBP = 45.2 + 34.4 BSA
Predictor Coef SE Coef T PConstant 45.183 9.392 4.81 0.000BSA 34.443 4.690 7.34 0.000
S = 2.790 R-Sq = 75.0% R-Sq(adj) = 73.6%
Analysis of VarianceSource DF SS MS F PRegression 1 419.86 419.86 53.93 0.000Error 18 140.14 7.79Total 19 560.00
What is effect on tests of individual slopes?
The regression equation isBP = 2.21 + 1.20 Weight
Predictor Coef SE Coef T PConstant 2.205 8.663 0.25 0.802Weight 1.20093 0.09297 12.92 0.000
S = 1.740 R-Sq = 90.3% R-Sq(adj) = 89.7%
Analysis of VarianceSource DF SS MS F PRegression 1 505.47 505.47 166.86 0.000Error 18 54.53 3.03Total 19 560.00
What is effect on tests of individual slopes?
What is effect on tests of individual slopes?
The regression equation isBP = 5.65 + 1.04 Weight + 5.83 BSA
Predictor Coef SE Coef T PConstant 5.653 9.392 0.60 0.555Weight 1.0387 0.1927 5.39 0.000BSA 5.831 6.063 0.96 0.350
Analysis of VarianceSource DF SS MS F PRegression 2 508.29 254.14 83.54 0.000Error 17 51.71 3.04Total 19 560.00
Source DF Seq SSWeight 1 505.47BSA 1 2.81
Effect #5 of multicollinearity on slope tests
When predictor variables are correlated, hypothesis tests for βk = 0 may yield different conclusions depending on which predictor variables are in the model.
Variables in model
b2 se(b2) t P-value
X2 34.4 4.7 7.34 0.000
X1, X2 5.83 6.1 0.96 0.350
Summary comments
• Tests for slopes should generally be used to answer a scientific question and not for model building purposes.
• Even then, caution should be used when interpreting results when multicollinearity exists. (Think marginal effects.)
Summary comments (cont’d)
• Multicollinearity has little to no effect on estimation of mean response or prediction of future response.
Diagnosing multicollinearity
• Realized effects (changes in coefficients, changes in sequential sums of squares, etc.) of multicollinearity.
• Scatter plot matrices.
• Pairwise correlation coefficients among predictor variables.
• Variance inflation factors (VIF).