Multicollinearity

35
Multicollinearity

description

Multicollinearity. Multicollinearity. Multicollinearity (or intercorrelation ) exists when at least some of the predictor variables are correlated among themselves. In observational studies, multicollinearity happens more often than not. - PowerPoint PPT Presentation

Transcript of Multicollinearity

Page 1: Multicollinearity

Multicollinearity

Page 2: Multicollinearity

Multicollinearity

• Multicollinearity (or intercorrelation) exists when at least some of the predictor variables are correlated among themselves.

• In observational studies, multicollinearity happens more often than not.

• So, we need to understand the effects of multicollinearity on regression analyses.

Page 3: Multicollinearity

Example #1

120

110

53.25

47.75

97.325

89.375

2.125

1.875

8.275

4.425

72.5

65.5

120110

76.25

30.75

53.25

47.75

97.325

89.375 2.1

251.8

758.2

754.4

25 72.565.5

76.25

30.75

BP

Age

Weight

BSA

Duration

Pulse

Stress

n = 20 hypertensive individuals

p-1 = 6 predictor variables

Page 4: Multicollinearity

Example #1

BP Age Weight BSA Duration PulseAge 0.659Weight 0.950 0.407BSA 0.866 0.378 0.875Duration 0.293 0.344 0.201 0.131Pulse 0.721 0.619 0.659 0.465 0.402Stress 0.164 0.368 0.034 0.018 0.312 0.506

Blood pressure (BP) is the response.

Page 5: Multicollinearity

What is effect on regression analyses if predictors are perfectly

uncorrelated?

x1 x2 y 2 5 52 2 5 43 2 7 49 2 7 46 4 5 50 4 5 48 4 7 44 4 7 43

Pearson correlation of x1 and x2 = 0.000

Page 6: Multicollinearity

The regression equation is y = 48.8 - 0.63 x1

Predictor Coef SE Coef T PConstant 48.750 4.025 12.11 0.000x1 -0.625 1.273 -0.49 0.641

Analysis of VarianceSource DF SS MS F PRegression 1 3.13 3.13 0.24 0.641Error 6 77.75 12.96Total 7 80.88

Regress Y on X1

Page 7: Multicollinearity

Regress Y on X2

The regression equation is y = 55.1 - 1.38 x2

Predictor Coef SE Coef T PConstant 55.125 7.119 7.74 0.000x2 -1.375 1.170 -1.17 0.285

Analysis of VarianceSource DF SS MS F PRegression 1 15.13 15.13 1.38 0.285Error 6 65.75 10.96Total 7 80.88

Page 8: Multicollinearity

Regress Y on X1 and X2

The regression equation is y = 57.0 - 0.63 x1 - 1.38 x2

Predictor Coef SE Coef T PConstant 57.000 8.486 6.72 0.001x1 -0.625 1.251 -0.50 0.639x2 -1.375 1.251 -1.10 0.322

Analysis of VarianceSource DF SS MS F PRegression 2 18.25 9.13 0.73 0.528Error 5 62.63 12.53Total 7 80.88

Source DF Seq SSx1 1 3.13x2 1 15.13

Page 9: Multicollinearity

Regress Y on X2 and X1The regression equation is y = 57.0 - 1.38 x2 - 0.63 x1

Predictor Coef SE Coef T PConstant 57.000 8.486 6.72 0.001x2 -1.375 1.251 -1.10 0.322x1 -0.625 1.251 -0.50 0.639

Analysis of VarianceSource DF SS MS F PRegression 2 18.25 9.13 0.73 0.528Error 5 62.63 12.53Total 7 80.88

Source DF Seq SSx2 1 15.13x1 1 3.13

Page 10: Multicollinearity

If predictors are perfectly uncorrelated, then…

• You get the same slope estimates regardless of the first-order regression model used.

• That is, the effect on the response ascribed to a predictor doesn’t depend on the other predictors in the model.

Page 11: Multicollinearity

If predictors are perfectly uncorrelated, then…

• The sum of squares SSR(X1) is the same as the sequential sum of squares SSR(X1|X2).

• The sum of squares SSR(X2) is the same as the sequential sum of squares SSR(X2|X1).

• That is, the marginal contribution of one predictor variable in reducing the error sum of squares doesn’t depend on the other predictors in the model.

Page 12: Multicollinearity

Same effects for “real data” with nearly uncorrelated predictors?

BP Age Weight BSA Duration PulseAge 0.659Weight 0.950 0.407BSA 0.866 0.378 0.875Duration 0.293 0.344 0.201 0.131Pulse 0.721 0.619 0.659 0.465 0.402Stress 0.164 0.368 0.034 0.018 0.312 0.506

Page 13: Multicollinearity

Regress BP on Stress

The regression equation isBP = 113 + 0.0240 Stress

Predictor Coef SE Coef T PConstant 112.720 2.193 51.39 0.000Stress 0.02399 0.03404 0.70 0.490

S = 5.502 R-Sq = 2.7% R-Sq(adj) = 0.0%

Analysis of VarianceSource DF SS MS F PRegression 1 15.04 15.04 0.50 0.490Error 18 544.96 30.28Total 19 560.00

Page 14: Multicollinearity

Regress BP on BSA

The regression equation isBP = 45.2 + 34.4 BSA

Predictor Coef SE Coef T PConstant 45.183 9.392 4.81 0.000BSA 34.443 4.690 7.34 0.000

S = 2.790 R-Sq = 75.0% R-Sq(adj) = 73.6%

Analysis of VarianceSource DF SS MS F PRegression 1 419.86 419.86 53.93 0.000Error 18 140.14 7.79Total 19 560.00

Page 15: Multicollinearity

Regress BP on BSA and StressThe regression equation is BP = 44.2 + 34.3 BSA + 0.0217 Stress

Predictor Coef SE Coef T PConstant 44.245 9.261 4.78 0.000BSA 34.334 4.611 7.45 0.000Stress 0.02166 0.01697 1.28 0.219

Analysis of VarianceSource DF SS MS F PRegression 2 432.12 216.06 28.72 0.000Error 17 127.88 7.52Total 19 560.00

Source DF Seq SSBSA 1 419.86Stress 1 12.26

Page 16: Multicollinearity

Regress BP on Stress and BSAThe regression equation isBP = 44.2 + 0.0217 Stress + 34.3 BSA

Predictor Coef SE Coef T PConstant 44.245 9.261 4.78 0.000Stress 0.02166 0.01697 1.28 0.219BSA 34.334 4.611 7.45 0.000

Analysis of VarianceSource DF SS MS F PRegression 2 432.12 216.06 28.72 0.000Error 17 127.88 7.52Total 19 560.00

Source DF Seq SSStress 1 15.04BSA 1 417.07

Page 17: Multicollinearity

If predictors are nearlyuncorrelated, then…

• You get similar slope estimates regardless of the first-order regression model used.

• The sum of squares SSR(X1) is similar to the sequential sum of squares SSR(X1|X2).

• The sum of squares SSR(X2) is similar to the sequential sum of squares SSR(X2|X1).

Page 18: Multicollinearity

What happens if the predictor variables are highly correlated?

BP Age Weight BSA Duration PulseAge 0.659Weight 0.950 0.407BSA 0.866 0.378 0.875Duration 0.293 0.344 0.201 0.131Pulse 0.721 0.619 0.659 0.465 0.402Stress 0.164 0.368 0.034 0.018 0.312 0.506

Page 19: Multicollinearity

Regress BP on Weight

The regression equation isBP = 2.21 + 1.20 Weight

Predictor Coef SE Coef T PConstant 2.205 8.663 0.25 0.802Weight 1.20093 0.09297 12.92 0.000

S = 1.740 R-Sq = 90.3% R-Sq(adj) = 89.7%

Analysis of VarianceSource DF SS MS F PRegression 1 505.47 505.47 166.86 0.000Error 18 54.53 3.03Total 19 560.00

Page 20: Multicollinearity

Regress BP on BSA

The regression equation isBP = 45.2 + 34.4 BSA

Predictor Coef SE Coef T PConstant 45.183 9.392 4.81 0.000BSA 34.443 4.690 7.34 0.000

S = 2.790 R-Sq = 75.0% R-Sq(adj) = 73.6%

Analysis of VarianceSource DF SS MS F PRegression 1 419.86 419.86 53.93 0.000Error 18 140.14 7.79Total 19 560.00

Page 21: Multicollinearity

Regress BP on BSA and WeightThe regression equation isBP = 5.65 + 5.83 BSA + 1.04 Weight

Predictor Coef SE Coef T PConstant 5.653 9.392 0.60 0.555BSA 5.831 6.063 0.96 0.350Weight 1.0387 0.1927 5.39 0.000

Analysis of VarianceSource DF SS MS F PRegression 2 508.29 254.14 83.54 0.000Error 17 51.71 3.04Total 19 560.00

Source DF Seq SSBSA 1 419.86Weight 1 88.43

Page 22: Multicollinearity

Regress BP on Weight and BSAThe regression equation isBP = 5.65 + 1.04 Weight + 5.83 BSA

Predictor Coef SE Coef T PConstant 5.653 9.392 0.60 0.555Weight 1.0387 0.1927 5.39 0.000BSA 5.831 6.063 0.96 0.350

Analysis of VarianceSource DF SS MS F PRegression 2 508.29 254.14 83.54 0.000Error 17 51.71 3.04Total 19 560.00

Source DF Seq SSWeight 1 505.47BSA 1 2.81

Page 23: Multicollinearity

Effect #1 of multicollinearity

When predictor variables are correlated, the regression coefficient of any one variable depends on which other predictor variables are included in the model.

Variables in model

b1 b2

X1 1.20 ----

X2 ---- 34.4

X1, X2 1.04 5.83

Page 24: Multicollinearity

Even correlated predictors not in the model can have an impact!

• Regression of territory sales on territory population, per capita income, etc.

• Against expectation, coefficient of territory population was determined to be negative.

• Competitor’s market penetration, which was strongly positively correlated with territory population, was not included in model.

• But, competitor kept sales down in territories with large populations.

Page 25: Multicollinearity

Effect #2 of multicollinearity

When predictor variables are correlated, the marginal contribution of any one predictor variable in reducing the error sum of squares varies, depending on which other variables are already in model.

SSR(X1) = 505.47

SSR(X1|X2) = 88.43

SSR(X2) = 419.86

SSR(X2|X1) = 2.81

Page 26: Multicollinearity

Effect #3 of multicollinearity

When predictor variables are correlated, the precision of the estimated regression coefficients decreases as more predictor variables are added to the model.

Variables in model

se(b1) se(b2)

X1 0.093 ----

X2 ---- 4.69

X1, X2 0.193 6.06

Page 27: Multicollinearity

What is the effect on estimating mean or predicting new response?

2.252.152.051.951.851.75

100

95

90

85

BSA

Wei

ght

(2,92)

Page 28: Multicollinearity

Weight Fit SE Fit 95.0% CI 95.0% PI

92 112.7 0.402 (111.85,113.54) (108.94,116.44)

BSA Fit SE Fit 95.0% CI 95.0% PI 2 114.1 0.624 (112.76,115.38) (108.06,120.08)

BSA Weight Fit SE Fit 95.0% CI 95.0% PI

2 92 112.8 0.448 (111.93,113.83) (109.08, 116.68)

Effect #4 of multicollinearity on estimating mean or predicting Y

High multicollinearity among predictor variables does not prevent good, precise predictions of the response (within scope of model).

Page 29: Multicollinearity

The regression equation isBP = 45.2 + 34.4 BSA

Predictor Coef SE Coef T PConstant 45.183 9.392 4.81 0.000BSA 34.443 4.690 7.34 0.000

S = 2.790 R-Sq = 75.0% R-Sq(adj) = 73.6%

Analysis of VarianceSource DF SS MS F PRegression 1 419.86 419.86 53.93 0.000Error 18 140.14 7.79Total 19 560.00

What is effect on tests of individual slopes?

Page 30: Multicollinearity

The regression equation isBP = 2.21 + 1.20 Weight

Predictor Coef SE Coef T PConstant 2.205 8.663 0.25 0.802Weight 1.20093 0.09297 12.92 0.000

S = 1.740 R-Sq = 90.3% R-Sq(adj) = 89.7%

Analysis of VarianceSource DF SS MS F PRegression 1 505.47 505.47 166.86 0.000Error 18 54.53 3.03Total 19 560.00

What is effect on tests of individual slopes?

Page 31: Multicollinearity

What is effect on tests of individual slopes?

The regression equation isBP = 5.65 + 1.04 Weight + 5.83 BSA

Predictor Coef SE Coef T PConstant 5.653 9.392 0.60 0.555Weight 1.0387 0.1927 5.39 0.000BSA 5.831 6.063 0.96 0.350

Analysis of VarianceSource DF SS MS F PRegression 2 508.29 254.14 83.54 0.000Error 17 51.71 3.04Total 19 560.00

Source DF Seq SSWeight 1 505.47BSA 1 2.81

Page 32: Multicollinearity

Effect #5 of multicollinearity on slope tests

When predictor variables are correlated, hypothesis tests for βk = 0 may yield different conclusions depending on which predictor variables are in the model.

Variables in model

b2 se(b2) t P-value

X2 34.4 4.7 7.34 0.000

X1, X2 5.83 6.1 0.96 0.350

Page 33: Multicollinearity

Summary comments

• Tests for slopes should generally be used to answer a scientific question and not for model building purposes.

• Even then, caution should be used when interpreting results when multicollinearity exists. (Think marginal effects.)

Page 34: Multicollinearity

Summary comments (cont’d)

• Multicollinearity has little to no effect on estimation of mean response or prediction of future response.

Page 35: Multicollinearity

Diagnosing multicollinearity

• Realized effects (changes in coefficients, changes in sequential sums of squares, etc.) of multicollinearity.

• Scatter plot matrices.

• Pairwise correlation coefficients among predictor variables.

• Variance inflation factors (VIF).