Detecting and reducing multicollinearity

Detecting multicollinearity

Common methods of detection

• Realized effects (changes in coefficients, changes in standard errors of coefficients, changes in sequential sums of squares) of multicollinearity.

• Non-significant t-tests for all of the slopes but a significant overall F-test.

• Significant correlations among pairs of predictor variables (correlations, matrix scatter plots).

• Variance inflation factors (VIF).

The first variance at issueFor the model:

ipipiii xxxy 1,122110

the variance of the estimated coefficient bk is:

2

1

2

2

1

1

kn

ikik

k Rxx

bVar

2kRwhere is the R2 value obtained by regressing

the kth predictor on the remaining predictors.

The second variance at issueFor the model:

iikki xy 0

the variance of the estimated coefficient bk is:

n

ikik

k

xxbVar

1

2

2

min

The ratio of the two variances

2

2

2

22

2

min 1

111

k

kik

kkik

k

k

R

xx

Rxx

bVar

bVar

Variance inflation factors

The variance inflation factor for the kth predictor is:

21

1

kk R

VIF

2kRwhere is the R2 value obtained by regressing

the kth predictor on the remaining predictors.

Variance inflation factors (VIFk)

• A measure of how much the variance of the estimated regression coefficient bk is “inflated” by the existence of correlation among the predictor variables in the model.

• VIFs exceeding 4 warrant investigation.

• VIFs exceeding 10 are signs of serious multicollinearity.

Blood pressure example

120

110

53.25

47.75

97.325

89.375

2.125

1.875

8.275

4.425

72.5

65.5

120110

76.25

30.75

53.2547.75

97.32589.375

2.1251.875

8.2754.425

72.565.576.25

30.75

BP

Age

Weight

BSA

Duration

Pulse

Stress

n = 20 hypertensive individuals

p-1 = 6 predictor variables

Blood pressure example

BP Age Weight BSA Duration PulseAge 0.659Weight 0.950 0.407BSA 0.866 0.378 0.875Duration 0.293 0.344 0.201 0.131Pulse 0.721 0.619 0.659 0.465 0.402Stress 0.164 0.368 0.034 0.018 0.312 0.506

Blood pressure (BP) is the response.

Regress y = BP on all 6 predictors Predictor Coef SE Coef T P VIFConstant -12.870 2.557 -5.03 0.000Age 0.70326 0.04961 14.18 0.000 1.8Weight 0.96992 0.06311 15.37 0.000 8.4BSA 3.776 1.580 2.39 0.033 5.3Dur 0.06838 0.04844 1.41 0.182 1.2Pulse -0.08448 0.05161 -1.64 0.126 4.4Stress 0.005572 0.003412 1.63 0.126 1.8

S = 0.4072 R-Sq = 99.6% R-Sq(adj) = 99.4%

Analysis of VarianceSource DF SS MS F PRegression 6 557.844 92.974 560.64 0.000Residual Error 13 2.156 0.166Total 19 560.000

Regress x2 = weight on 5 predictorsPredictor Coef SE Coef T P VIFConstant 19.674 9.465 2.08 0.057Age -0.1446 0.2065 -0.70 0.495 1.7BSA 21.422 3.465 6.18 0.000 1.4Dur 0.0087 0.2051 0.04 0.967 1.2Pulse 0.5577 0.1599 3.49 0.004 2.4Stress -0.02300 0.01308 -1.76 0.101 1.5

S = 1.725 R-Sq = 88.1% R-Sq(adj) = 83.9%


The variance inflation factor calculated by its definition

40.8

881.01

1

1

12

min

kk

k

RbVar

bVar

The variance of the weight coefficient is inflated by a factor of 8.40 due to the existence of correlation among the predictor variables in the model.

The pairwise correlations

BP Age Weight BSA Duration PulseAge 0.659Weight 0.950 0.407BSA 0.866 0.378 0.875Duration 0.293 0.344 0.201 0.131Pulse 0.721 0.619 0.659 0.465 0.402Stress 0.164 0.368 0.034 0.018 0.312 0.506

Blood pressure (BP) is the response.

Regress y = BP on age, weight, duration and stress

Predictor Coef SE Coef T P VIFConstant -15.870 3.195 -4.97 0.000Age 0.68374 0.06120 11.17 0.000 1.5Weight 1.03413 0.03267 31.65 0.000 1.2Dur 0.03989 0.06449 0.62 0.545 1.2Stress 0.002184 0.003794 0.58 0.573 1.2

S = 0.5505 R-Sq = 99.2% R-Sq(adj) = 99.0%


Reducing data-based multicollinearity

Data-based multicollinearity

• Multicollinearity that results from a poorly designed experiment, reliance on purely observational data, or the inability to manipulate the system on which you collect the data.

Some methods

• Modify the regression model by eliminating one or more predictor variables.

• Collect additional data under different experimental or observational conditions.

(Modified!) Allen Cognitive Level (ACL) Study

• Relationship of ACL test to level of pathology in a set of 23 patients in a hospital psychiatry unit:– Response y = ACL score

– x1 = vocabulary (Vocab) score on Shipley Institute of Living Scale

– x2 = abstraction (Abstract) score on Shipley Institute of Living Scale

– x3 = score on Symbol-Digit Modalities Test (SDMT)

Allen Cognitive Level (ACL) Study on 23 patients

47.7517.25

27.7517.25

28.513.5

5.8

4.2

47.75

17.25

27.75

17.25

ACL

SDMT

Vocab

Abstract

Strong correlation between Vocab and Abstract

3525155

30

20

10

Abstract

Vo

cab

Pearson correlation of Vocab and Abstract = 0.990

Regress y = ACL on SDMT, Vocab, and Abstract

Predictor Coef SE Coef T P VIFConstant 3.747 1.342 2.79 0.012SDMT 0.02326 0.01273 1.83 0.083 1.7Vocab 0.0283 0.1524 0.19 0.855 49.3Abstract -0.0138 0.1006 -0.14 0.892 50.6

S = 0.7344 R-Sq = 26.5% R-Sq(adj) = 14.8%

Analysis of Variance

Source DF SS MS F PRegression 3 3.6854 1.2285 2.28 0.112Residual Error 19 10.2476 0.5393Total 22 13.9330

Allen Cognitive Level (ACL) Study on 69 patients

57.520.5 32.517.5 30.511.5

5.8

4.2

57.5

20.5

32.5

17.5

ACL

SDMT

Vocab

Abstract

Plot after having collected more data

403020100

40

30

20

10

Abstract

Vo

cab

Pearson correlation of Vocab and Abstract = 0.698

Regress y = ACL on SDMT, Vocab, and Abstract

Predictor Coef SE Coef T P VIFConstant 3.9463 0.3381 11.67 0.000SDMT 0.027404 0.007168 3.82 0.000 1.6Vocab -0.01740 0.01808 -0.96 0.339 2.1Abstract 0.01218 0.01159 1.05 0.297 2.2

S = 0.6878 R-Sq = 28.6% R-Sq(adj) = 25.3%

Analysis of Variance

Source DF SS MS F PRegression 3 12.3009 4.1003 8.67 0.000Residual Error 65 30.7487 0.4731Total 68 43.0496

Reducing structural multicollinearity

In context of polynomial regression models

Structural multicollinearity

• Multicollinearity that is a mathematical artifact caused by creating new predictors from other predictors, such as, creating the predictor x2 from the predictor x.

Example

• (General research question) What is impact of exercise on human immune system?

• (Specific research question) How is amount of immunoglobin in blood (y) related to maximal oxygen uptake (x)?

30 40 50 60 70

1000

1500

2000

Maximal oxygen uptake (ml/kg)

Imm

uno

glo

bin

(mg)

Scatter plot

A quadratic polynomial regression function

iiii xxy 21110

where:

• yi = amount of immunoglobin in blood (mg)

• xi = maximal oxygen uptake (ml/kg)

• typical assumptions about error terms (“INE”)

Estimated quadratic function

30 40 50 60 70

1000

1500

2000

oxygen

igg

igg = -1464.40 + 88.3071 oxygen - 0.536247 oxygen**2

S = 106.427 R-Sq = 93.8 % R-Sq(adj) = 93.3 %

Regression Plot

Interpretation of the regression coefficients

• If 0 is a possible x value, then b0 is the predicted response. Otherwise, interpretation of b0 is meaningless.

• b1 is the slope of the tangent line at x = 0.

• b2 indicates the up/down direction of curve

– b2 < 0 means curve is concave down

– b2 > 0 means curve is concave up

The regression equation is igg = - 1464 + 88.3 oxygen - 0.536 oxygensq

Predictor Coef SE Coef T P VIFConstant -1464.4 411.4 -3.56 0.001oxygen 88.31 16.47 5.36 0.000 99.9oxygensq -0.5362 0.1582 -3.39 0.002 99.9

S = 106.4 R-Sq = 93.8% R-Sq(adj) = 93.3%

Analysis of VarianceSource DF SS MS F PRegression 2 4602211 2301105 203.16 0.000Residual Error 27 305818 11327Total 29 4908029

Regress y = iggon oxygen and oxygen2

Structural multicollinearity

7060504030

5000

4000

3000

2000

1000

oxygen

oxy

ge

nsq

Pearson correlation of oxygen and oxygensq = 0.995

“Center” the predictors

637.50OxygenOxCent

2637.50 OxygenOxCentSq

Mean of oxygen = 50.637

oxygen oxcent oxcentsq 34.6 -16.037 257.185 45.0 -5.637 31.776 62.3 11.663 136.026 58.9 8.263 68.277 42.5 -8.137 66.211 44.3 -6.337 40.158 67.9 17.263 298.011 58.5 7.863 61.827 35.6 -15.037 226.111 49.6 -1.037 1.075 33.0 -17.637 311.064

Wow! It really works!

20100-10-20

400

300

200

100

0

oxcent

oxc

ent

sq

Pearson correlation of oxcent and oxcentsq = 0.219

A better quadratic polynomial regression function

iiii xxxxy 2*11

*1

*0

xxx ii *where denotes the centered predictor

and:

• yi = amount of immunoglobin in blood (mg)

• typical assumptions about error terms (“INE”)

iiii xxy 2**11

**1

*0

The regression equation isigg = 1632 + 34.0 oxcent - 0.536 oxcentsq

Predictor Coef SE Coef T P VIFConstant 1632.20 29.35 55.61 0.000oxcent 34.000 1.689 20.13 0.000 1.1oxcentsq -0.5362 0.1582 -3.39 0.002 1.1

S = 106.4 R-Sq = 93.8% R-Sq(adj) = 93.3%

Analysis of VarianceSource DF SS MS F PRegression 2 4602211 2301105 203.16 0.000Residual Error 27 305818 11327Total 29 4908029

Regress y = iggon oxcent and oxcent2

Interpretation of the regression coefficients

• b0 is predicted response at the predictor mean.

• b1 is the estimated slope of the tangent line at the predictor mean; and, often, similar to the estimated slope in the simple model.

• b2 indicates the up/down direction of curve

– b2 < 0 means curve is concave down

– b2 > 0 means curve is concave up

-20 -10 0 10 20

1000

1500

2000

oxcent

igg

igg = 1632.20 + 33.9995 oxcent - 0.536247 oxcent**2

S = 106.427 R-Sq = 93.8 % R-Sq(adj) = 93.3 %

Regression Plot

Estimated regression function

Similar estimates of coefficients from first-order linear model

-20 -10 0 10 20

1000

1500

2000

oxcent

igg

igg = 1557.63 + 32.7427 oxcent

S = 124.783 R-Sq = 91.1 % R-Sq(adj) = 90.8 %

Regression Plot

The relationship between the two forms of the model

2**11

**1

*0ˆ iii xbxbby Centered model:

21110ˆ iii xbxbby Original model:

*1111

*11

*11

2*11

*1

*00

2

bb

xbbb

xbxbbb

where:

2** 5362.00.342.1632ˆ iii xxy

5362.0

3.88)637.50)(5362.(234

4.1464)637.50(5362.0)637.50(342.1632

11

1

20

b

b

b

2536.03.884.1464ˆ iii xxy

Mean of oxygen = 50.637

1000 1500 2000

-200

-100

0

100

200

Fitted Value

Res

idua

l

Residuals Versus the Fitted Values(response is igg)

Model evaluation

-200 -100 0 100 200

-2

-1

0

1

2

Nor

mal

Sco

re

Residual

Normal Probability Plot of the Residuals(response is igg)

Model evaluation

Model use: What is predicted IgG if maximal oxygen uptake is 90?

There is an even greater danger in extrapolation when modeling data with a polynomial function, because of changes in direction.

Predicted Values for New Observations

New Obs Fit SE Fit 95.0% CI 95.0% PI1 2139.6 219.2 (1689.8,2589.5) (1639.6,2639.7) XXX denotes a row with X values away from the centerXX denotes a row with very extreme X values

Values of Predictors for New Observations

New Obs oxcent oxcentsq1 39.4 1549

The hierarchical approach to model fitting

Widely accepted approach is to fit a higher-order model and then explore whether a lower-order (simpler) model is adequate.

iiiii xxxY 3111

21110

Is a first-order linear model (“line”) adequate?

0: 111110 H

The hierarchical approach to model fitting

But then … if a polynomial term of a given order is retained, then all related lower-order terms are also retained.

That is, if a quadratic term was significant, you would use this regression function:

21110 iii xxYE

2110 ii xYE

and not this one:

Detecting and reducing multicollinearity

Documents

Transcript of Detecting and reducing multicollinearity