10/12/201410/12/2014 5 MACT 4231: Applied Regression Methods, Prof Hadi, 9.99.9 4. Detection of...

10/12/2014

1

MACT 4231: Applied Regression Methods, Prof Hadi, 9.1

9.1

Chapter 9. Analysis of Collinear Data

1. What is collinearity?

2. What are the reasons for collinearity?

3. Why is collinearity a problem?

4. How can collinearity be detected?

5. How do we deal with collinearity when it

exists? (Chapter 10)


9.2

1. What is Collinearity

When there is a complete absence of linear relationship among the predictor variables, they are said to be orthogonal.

The existence of strong (not necessarily exact)linear relationships among the predictor variables is referred to as collinearity or multicollinearity.

The interpretation of the multiple regression equation depends on the assumption that the predictor variables are not strongly interrelated.

10/12/2014

2


9.3

2. Reasons for Collinearity

Collinearity may be present because:

1. The sample data are deficient. In this case it

should be improved with additional

observations.

2. The interrelationships among the variables are

an inherent characteristic of the process under

investigation.


9.4

3. Why is Collinearity a Problem?

Collinearity affects statistical inference because

collinearity inflates the variance of the regression

coefficients. When collinearity is present, is

near singular and becomes very large and

becomes very large, and the t-statistics becomes

very small.

X XT

1(X X)T

2 1ˆvar(β) (X X)T

10/12/2014

3


9.5

Example 1: Equal Educational Opportunity

The variables are:Y: A measure of student achievements X1: A measure of faculty credentials X2: The influence of peer group in the school X3: School facilities

The objective is to evaluate the effect of school inputs (X3) on achievement (Y):

0 1 1 2 2 3 3Y X X X .


9.6Example 1: Equal Educational Opportunity

Dependent variable is: YR squared = 20.6% R squared (adjusted) = 17.0%s = 2.07 with 70 - 4 = 66 degrees of freedom F = 5.72

Variable Coefficient s.e. of Coeff t-ratio probConstant −0.070 0.251 −0.28 0.7810

X1 1.101 1.411 0.78 0.4378X2 2.322 1.481 1.57 0.1218X3 −2.281 2.220 −1.03 0.3080

Inconsistent results:

The F test is significant but all regression

coefficients are insignificant! Why?

10/12/2014

4


9.7


The plot of residuals

versus predicted

values shows no

discernible pattern.

-2 -1 0 1 2

-2-1

01

2

Predicted

Residuals


9.8


The culprit here is

collinearity! The predictor variables are so highly correlated that each one may serve as a proxy for the others in the regression equation without affecting the total explanatory power.

FAM

-2 -1 0 1 2

-2-1

01

23

-2-1

01

2

0.96 PEER

-2 -1 0 1 2 3

0.99 0.98

-2 -1 0 1 2

-2-1

01

2

SCHOOL

10/12/2014

5


9.9

4. Detection of Collinearity

Simple Signs of Collinearity:

1. Large pairwise correlation coefficients2. Coefficients of variables that are expected to be

important have small t-values3. The algebraic signs of the estimated coefficients

are different from prior expectations4. Large changes in the estimated coefficients

when a variable is added or deleted.5. Large changes in the estimated coefficients

when a data point is altered, added, or dropped.


9.10


• The source of collinearity may be more subtle

than a simple relationship between two variables.

• A linear relation can involve many of the predictor

variables.

• It may not be possible to detect such a

relationship with the simple correlation

coefficients.

10/12/2014

6


9.11

Example 2: Advertizing Expenditure Data

The variables are: A: Advertizing Expenditure in year t, P: promotion expenditures in year t, E: Sales expense in year t, S: Aggregate sales in year t.

We study the effect of A, P, and E on S.

Since this is a time series data, we may consider lagged variables. We consider the model:

0 1 2 t 3 t 4 1 5 1S A P E A P .t t t


9.12


The plot of residuals versus fitted values and the

index plot of residuals (Figures 9.5 and 9.6), as well

as other plots of the residuals versus the

predictor variables (not shown), do not suggest

any problems of misspecification.

10/12/2014

7


9.13


The correlation coefficients between pairs of predictor variables are small

A_t

0.0 0.5 1.0 1.5 1.0 1.4 1.8 2.2

1.0

1.4

1.8

2.2

0.0

0.5

1.0

1.5

0.36 P_t

0.13 0. 063 E_t

0.30

0.45

0.60

1.0

1.4

1.8

2.2

0.14 0.32 0.17 A_.t.1.

1.0 1.4 1.8 2.2

0.50 0.30

0.30 0.45 0.60

0.21 0.36

0.0 0.5 1.0 1.5

0.0

0.5

1.0

1.5

P_.t.1.


9.14


Let us look at the stability of regression coefficients:

Variable Model 1 Model 2

Constant -14.19 10.51

A_t 5.36

P_t 8.37 3.70

E_t 22.52 22.79

A_(t-1) 3.85 -0.77

P_(t-1) 4.12 -0.97

10/12/2014

8


9.15


Two methods for measuring collinearity are the

variance inflation factors and the condition indices.

1. Variance Inflation Factor:

Let us look at the regression of each of the

predictor variables against all the others.

Let be the j-th predictor variable and be

the set of all other predictors variables.

[ ]X jX j


9.16

1. The Variance Inflation Factor

Let be the square of the multiple correlation

coefficient obtained from fitting the models:

Large indicates that is highly related to all

other predictor variables; an indication of

collinearity.

2jR

[ ]X X , 1,2, , .j j j p

2jR X j

10/12/2014

9


9.17


The Variance Inflation Factor for is defined as:

If has a strong linear relationship with the other

predictor variables, would be close to 1, and

would be large; an indication of collinearity.

X j

2

1VIF , 1,2, , .

1j

j

j pR

X j

2jR

VIFj


9.18


In the absence of any linear relationship between

the predictor variables (i.e., if the predictor

variables are orthogonal), would be 0 and

would be 1.

The deviation of the value from 1 indicates

departure from orthogonality and tendency

toward collinearity.

2

1VIF , 1,2, , .

1j

j

j pR

2jR VIFj

VIFj

10/12/2014

10


9.19


As tends toward 1, indicating the presence of

a linear relationship in the predictor variables,

tends to infinity.

also measures the amount by which the

variance of is increased due to the linear

association of with other predictor variables

relative to the variance that would result if

were not linearly related to them.

X j

2

1VIF , 1,2, , .

1j

j

j pR

2jR

VIFj

VIFj

ˆj

X j


9.20


Values of > 10 is often taken as an indication

of collinearity.

may be used to obtain an expression for the

expected squared distance ( ) of the OLS

estimators from their true values:

VIFj

2D

2 2

1

= VIFp

jj

D

VIFj

10/12/2014

11


9.21


It follows that the ratio:

which shows that the average of the VIFs

measures the squared error in the OLS

estimators relative to the size of that error if

the data were orthogonal. Hence, may also

be used as an index of collinearity.

2

1

2

VIF

VIF,

p

jj

p

VIF

2 2

1

= VIFp

jj

D


9.22

Examples 1 and 2

Advertizing EEO

Variable VIF Variable VIF

A_t 37.4 FAM 37.6

P_t 33.5 PEER 30.2

E_t 1.1 School 83.2

P_(t-1) 26.6

E_(t-1) 26

44.1

28.5 50.3VIF VIF

10/12/2014

12


9.23

Example: Import Data

Aggregate data concerning import activity in the

French economy for the years 1949–1966.

The variables are

Y = Imports (IMPORT),

X1 = Domestic production (DOPROD),

X2 = Stock formation (STOCK), and

X3 = Domestic consumption (CONSUM)

All measured in billions of francs.

Use Data Desk


9.24

2. The Condition Index

Another measure of the overall collinearity of the

variables can be obtained by computing the

condition indices of the correlation matrix. The j-th

condition index is defined by

1 , 1,2, , ,j

j

j p

where are the eigenvalues of the

correlation matrix of the predictor variables.

1 2 p

10/12/2014

13


9.25


The first condition index, , is always 1 but the remaining indices are larger than 1.

The largest condition index is

is known as the condition number of the

correlation matrix.

1

1 , 1,2, , ,j

j

j p

1 ,p

p


9.26


The harmful effects of collinearity in the data

become strong when the values of the condition

number exceeds 10 (which means that is more

than 100 times ).

Some datasets contain more than one subset

of collinear variables. The number of these subsets

are indicated by the number of large condition

indices.

1

p

10/12/2014

14


9.27

Examples 1 and 2

j

EEO Advertising

1 2.952 1 1.701 1

2 0.040 8.59 1.288 1.14

3 0.008 19.26 1.145 1.21

4 0.859 1.4

5 0.007 15.29

jj jj

Example 3: The Import Data Use Data Desk


9.28


If a data set was found to be collinear as indicated

by large condition indices, the next question is:

Which variables are involved in this collinearity?

The answer to this question involves the

eigenvectors of the correlation matrix.

10/12/2014

15


9.29


We have seen that every correlation matrix of p

predictor variables has a set of p eigenvalues

For every eigenvalue , there exists a

corresponding eigenvector,

The p eigenvectors are pairwise

orthonormal. That is, and

1 2 .p

j

V , 1,2, , .j j p

1 2V ,V , ,Vp

V V 1Tj j V V 0, .T

i j i j


9.30


The p eigenvectors can be arranged in a

matrix as follows: where

and I is the identity matrix. If > 10, then the

variables involved in this set of collinearity are

related by where Z is the

standardized predictor variables.

1 2V = V ,V , ,V ,p

V V VV I,T T

p p

j

ZV ,j j

10/12/2014

16


9.31

Example: Import Data

1.999 0.998 0.003

1 1.42 27.26

DOPROD 0.706 0.036 0.707

STOCK 0.044 0.999 0.007

CONSUM 0.707 0.026 0.707

j

j

Vj 1V 2V 3V

Thus we have

0.707 DOPROD 0.007STOCK+0.707CONSUM=0.003

or DOPROD CONSUM

3 3ZV ,


9.32

Example: Advertising Data

1.701 1.288 1.145 0.859 0.007

1 1.15 1.22 1.41 15.59

A_t 0.532 0.024 0.668 0.074 0.514

P_t 0.232 0.825 0.158 0.037 0.489

E_t 0.389 0.022 0.217 0.895 0.010

P_(t-1) 0.395 0.260 0.692 0.338 0.428

E_(t-1) 0.596 0.501 0.057 0.279 0.559

jj

Vj 1V 2V 3V 4V 5V

10/12/2014

17


9.33

Example: Advertising Data

Thus we have

0.514 A_t 0.489 P_t 0.010 E_t

0.428 A_(t-1) 0.559 E_(t-1) 0.007

or

0.514 A_t 0.489 P_t 0.428 A_(t-1) 0.559 E_(t-1) 0

5 5ZV :

10/12/201410/12/2014 5 MACT 4231: Applied Regression Methods, Prof Hadi, 9.99.9 4. Detection of...

Documents

Transcript of 10/12/201410/12/2014 5 MACT 4231: Applied Regression Methods, Prof Hadi, 9.99.9 4. Detection of...