Download - Multicollinearity - Central University of South Bihar · 2020. 4. 9. · multicollinearity : this becomes clear from the formulas for variances of the estimates. When multicollinearity

MulticollinearityS. K. Bhaumik

Issues for discussion

1. Definition

2. Consequences

3. Tests

4. Some Remedial Measures

Reference: Sankar Kumar Bhaumik, Principles of Econometrics: A

Modern Approach Using EViews, Oxford University Press, 2015, Ch. 6

Definition

❖ While estimating multiple regression models, quite

often we obtain unsatisfactory results.

❖ This happens due to high variances and hence high

values of standard errors of the estimated coefficients.

❖ This is possible when there is little variation in

explanatory variables or high inter-correlations

among the explanatory variables or both.

❖ Muticollinearity represents a situation where the

explanatory variables of the multiple regression model

get highly correlated.

❖ So muticollinearity problem arises only in the case of

multiple regressions. 2

o Multicollinearity represents lack of independent

movement in the sample data on explanatory variables

→ multicollinearity is a feature of sample data rather

than that of population.

o When multicollinearity is present, it becomes difficult

to disentangle the separate effects of the explanatory

variables on the dependent variable.

Thus, if ),( 21 iii XXfY = , and iX1 and iX 2 are perfectly

correlated, then either could predict iY and the other

would become superfluous.

o Usually perfect correlation between the explanatory

variables is not observed → we observe high

correlation among the explanatory variables →

unable to obtain precise estimates for the unknown

parameters. 3

Consequences of Multicollinearity

The easiest way to understand the consequences of

multicollinearity problem is to compare simple and

multiple regression models.

Consider Simple Regression Models

iiYi XY 1111 ++= (1)

iiYiXY

2222 ++= (2)

and Multiple Regression Model

iiYiYiXXY +++=

21.212.1 (3)

Applying the OLS method, we obtain

2

1

1

1ˆ

i

ii

Yx

yx

= (4)

2

2

2

2ˆ

i

ii

Yx

yx

=

(5) 4

2

21

2

2

2

1

212

2

21

2.1)(

ˆ

iiii

iiiiiiiY

xxxx

xxyxxyx

−

−= (6)

2

21

2

2

2

1

211

2

12

1.2)(

ˆ

iiii

iiiiiiiY

xxxx

xxyxxyx

−

−=

(7)

Here 1ˆ

Y and 2ˆ

Y are two estimated slope coefficients

from simple regression models (1) and (2) respectively.

2.1ˆ

Y and 1.2ˆ

Y respectively are two estimated partial

regression coefficients from the multiple regression model

(3).

The lowercase letters denote deviation from sample

means for the variables. 5

We may define 111 SSrnyx YYii =

222 SSrnyx YYii =

211221 SSrnxx ii =

2

1

2

1 Snx i =

2

2

2

2 Snx i =

where 1S = standard deviation of iX1 ;

2S = standard deviation of iX 2 ;

YS = standard deviation of iY ;

1Yr = simple correlation between iY and iX1 ;

2Yr = simple correlation between iY and iX 2 ;

12r = simple correlation between iX1 and iX 2 ;

n = number of observations. 6

Putting the above values into the formulas for estimated

coefficients, we get

1

11ˆ

S

Sr YYY =

(8)

2

22ˆ

S

Sr YYY =

(9)

−

−=

1

2

12

21212.1

1ˆ

S

S

r

rrr YYYY (10)

−

−=

2

2

12

11221.2

1ˆ

S

S

r

rrr YYYY (11)

We now examine various cases depending upon the value

of 2

12r . 7

Case I: Absence of multicollinearity )0( 2

12=r

As 02

12 =r , equation (10) collapses to equation (8) and equation

(11) to equation (9) ⟹ we might abandon the multiple regression

and run two separate simple regressions.

However, there are two important points to remember here:

o How to obtain ̂ ?

We compute ̂ as

iYiYi XXY 2211ˆˆˆ −−=

o In the case of simple regressions, the variances of 1ˆ

Y and

2ˆ

Y would be upwardly biased (or greater than the variances

of 2.1ˆ

Y and 1.2ˆ

Y ).

Therefore, when we are working with multivariate data, it is

advisable to fit multiple regressions. 8

Case II: Perfect multicollinearity )1( 2

12=r

o As 12

12 =r , we cannot define equations (10) and (11)

o It would not be possible to obtain OLS estimates for

unknown parameters of the multiple regression

models.

Case III: High Degree of Multicollinearity (also called

Imperfect Multicollinearity) )1( 2

12r

o If 2

12r is close to unity, we say that there is high degree

of multicollinearity

o Here it would be possible to perform OLS estimation

of the multiple regression model, but the variances of

the estimates would be very large. 9

The above point may be explained using the formulas

used for computation of variances of 2.1ˆ

Y and 1.2ˆ

Y .

)1()ˆ(

2

12

2

1

2

2.1rx

Vari

Y−

=

)1()ˆ(

2

12

2

2

2

1.2rx

Vari

Y−

=

It is clear that

o 12

12 =r provides an undefined computation; and

o High value of 2

12r (close to 1) makes the variances of

the estimated partial regression coefficients large.

10

Tests for Multicollinearity

❖ Klein’s Rule: Multicollinearity would be regarded as a

problem if 22

kY RR .

Here 2

YR = squared multiple correlation coefficient between iY

and explanatory variables kiii XXX ,....,, 21 ; and

2

kR = squared multiple correlation coefficient between thk

explanatory variable and other explanatory variables.

Limitation: It cannot always correctly diagnose the presence of

multicollinearity in data. In particular, it has been found that

there are instances where in spite of 22

kY RR we still have small

variances of the OLS estimates and hence significant t-ratios.

11

❖ The Variance-Inflation Factor (VIF)

We know that multicollinearity produces high variances for the

OLS estimates and hence low t-ratios and insignificant

regression results → one way to understand the presence of

multicollinearity → compare the variances of OLS estimates for

two situations:

(i) where multicollinearity is absent (“ideal situation”); and

(ii) where multicollinearity is present (“observed situation”).

When multicollinearity is present, the variance of the estimated

coefficient of kth explanatory variable is measured by:

)1()ˆ(

22

2

kki

kRx

Var−

=

Under the ‘ideal situation”, 02 =kR , so that

2

2

)ˆ(ki

kx

Var

=

12

The VIF compares these two situations by taking a ratio of the

two variances.

2

2

2

22

2

1

1)1()ˆ(

k

ki

kki

kR

x

RxVIF

−=

−=

Now if 1)ˆ( =kVIF , we say that there is no multicollinearity

while 1)ˆ( kVIF indicates its presence.

Rule of thumb: 10VIF ⟹ serious multicollinearity

involving the corresponding explanatory variable.

Two points to note:

(i) VIF values are computed for each of the estimated slope

coefficients.

(ii) VIF values help to identify the multicollinear variables. 13

Limitations:

1. It comes more as a complaint that things are not

ideal; and

2. 2

kR is not the only factor responsible for inflating

)ˆ( kVar , it might also be due low value of 2

kix .

❖ Tolerance

Tolerance is the reciprocal of VIF:

211

kR

VIFtolerance −==

Obviously, here the rule of thumb is that tolerance values

of 0.10 or less indicate presence of serious

multicollinearity. 14

❖ The Condition Number (CN)

While VIF is computed for each of the estimated

coefficients, the condition number is an overall measure.

It conveys the status or condition of the data matrix X.

The formula used to compute condition number is:

)(

)(

XXmatrixtheofvalueeigenLowest

XXmatrixtheofvalueeigenHighestCN

=

Rules of thumb:

o 1=CN ⟹ no multicollinearity

o 101 CN ⟹ multicollinearity is negligible

o 3010 CN ⟹ moderate to strong multicollinearity

o 30CN ⟹ severe multicollinearity 15

Limitations of CN:

• It is also more of a complaint that things are not ideal.

• It has been shown that CN value may change with a

reparametrization of the variables. It may be brought closer

to 1 with suitable transformation of the variables.

Important points to remember:

✓ The tests of multicollinearity may provide some broad idea

about its presence. But they are of limited use from the

practical point of view.

✓ To understand the severity of the multicollinearity problem,

we should also look into the values of standard errors and t-

ratios for the estimated coefficients as also their statistical

significance.

✓ Apart from inter-correlations amongst the explanatory

variables, one should look into other aspects like standard

errors of the estimated coefficients, their t-ratios, 2R and

2R , F-ratio, and so on, to assess the usefulness of the

estimated multiple regression models. 16

Remedial Measures

• Of all econometric problems, multicollinearity is the most serious one

→ no measure can completely remove it when present in data.

• The measures discussed below only attempt to minimize its impact so

that reasonable regression results are obtained.

1. Increasing Sample Size → helps to reduce the severity of

multicollinearity → this becomes clear from the formulas for

variances of the estimates.

When multicollinearity is present, the variance of the estimated

coefficient of thk explanatory variable is given by

)1()ˆ(

22

2

kki

kRx

Var−

=

Now as sample size increases, 2

kix increases and )ˆ( KVar falls, unless

in all cases additional observations on kiX are equal to kiX , which is

most unlikely to happen. 17

Of course, we are unsure about what will happen to 2

kR when sample

size increases. But it is possible that with increasing sample size 2

kR

would also fall, which further reduces )ˆ( KVar .

2. Transformation of Variables:

• It has been found that the intensity of multicollinearity gets reduced

when transformed variables (ratio, first-difference etc.) are used

instead of variables in ‘levels’.

• For example, in a three variable model, although the levels of tX1

and tX 2 might be correlated, there is no a priori reason to believe

that the first-difference of these variables, )( 111 −− tt XX and

)( 122 −− tt XX , would also be correlated. However, there might be

autocorrelation in the first-difference regression model. 18

3. Dropping Variables:

• One of the easiest ways to overcome the multicollinearity problem.

• After identification of the multicollinear variables, the researcher

often drops some of them from the model.

• This is justified if the model included a large number of

explanatory variables of which all are not important.

• However, the implication of dropping variables approach needs to

be understood ⟶ it is necessary to clarify when dropping a

variable(s) is justified.

Suppose our three-variable multiple regression model is

iiii XXY ++= 2211 (12)

[We excluded the intercept term for the sake of simplicity of discussion.]

Assume: iX1 and iX 2 are highly correlated and we are interested about

iX1 and drop iX 2 to avoid multicollinearity. Then our model becomes

iii vXY += 11 (13)

Equation 12 → the “complete model”

Equation 13 → the “omitted variable model” 19

Let the OLS estimate of 1 from the complete model be 1̂ and that

from the omitted variable model be 1

~ . As regards 1̂ , we know that

=)ˆ( 1E and )1()ˆ(

2

12

2

1

2

1rX

Vari −

=

For 1

~ , we have to compute )

~(E and )

~(Var .

2

1

1

1

~

i

ii

X

YX

=

2

1

22111 )(

i

iiii

X

XXX

++=

2

1

1

2

1

21

21

i

ii

i

ii

X

X

X

XX

+

+=

Thus, 12

1

21

211)~

(

+=

i

ii

X

XXE ]0)([ =iE

This implies that 1

~ is a biased estimate.

20

2

11 )]~

(~

[)~

( EEVar −=

2

2

1

1

=

i

ii

X

XE

22

122

1 )(

1i

i

XX

= ])([ 22

=

iE

2

1

2

iX=

It shows that 1

~ , the OLS estimate of from the

omitted variable model, is biased but it has smaller

variance than 1̂

Implication →

• On the basis of unbiasedness property, 1̂ seems

preferable when the two variables are highly

correlated.

• On the basis of minimum variance property, 1

~ seems

preferable. 21

When one estimate is unbiased but not having minimum

variance, while the other estimate is biased but has

minimum variance, we face a difficult choice problem.

To overcome this problem, we compare the MSE for the

two estimates → this also helps to correctly choose from

two alternative estimates.

Let us consider the ratio of two mean-square errors.

)ˆ()ˆ(

)~

()~

(

)ˆ(

)~

(

1

2

1

1

2

1

1

1

Varbias

Varbias

MSE

MSE

+

+=

)ˆ(

)~

(

)ˆ(

)~

(

1

1

1

2

1

Var

Var

Var

bias+=

]0)ˆ([ 1 =bias 22

Here

)1(

)ˆ(

)~

(

2

12

2

1

2

2

2

1

21

2

1

2

1

rX

X

XX

Var

bias

i

i

ii

−

=

2

2

12

2

22

22

2

2

1

2

21 )1()(

rX

XX

XX i

ii

ii −

=

)ˆ(

1

2

2

2

2

12

Var

r=

2

2

2

12tr= (14)

Note that in equation 14, 2t is not estimated but ‘true t-ratio’

for iX 2 .

Further,

2

12

2

12

2

1

2

2

1

2

1

1 1

)1(

)ˆ(

)~

(r

rX

X

Var

Var

i

i −=

−

=

(15)

23

Now using equations 14 and 15, we can write

)1(1)1()ˆ(

)~

( 2

2

2

12

2

12

2

2

2

12

1

1 −+=−+= trrtrMSE

MSE

• Thus, if 12 t , )ˆ()~

( 11 MSEMSE and 1

~ should be

preferred.

• But as 2t is not known, we use 2̂t (i.e., estimated t-value)

from equation 2. Then, as an estimate of 1 , we may use

the conditional-omitted-variable estimator )~~

( 1 , which is

defined as

=

1ˆ~

1ˆˆ~~

21

21

1

tifestimatevariableomittedthe

tifestimateOLSthe

Implication →only if the computedt-valuefor a variable is less

than 1, we might drop that variable and accept the OV model.

Otherwise, continue with the complete model.

Limitation of the OV approach → this approach would not be

very attractive when a limited number of variables have been

used in the model. 24