10/12/201410/12/2014 5 MACT 4231: Applied Regression Methods, Prof Hadi, 9.99.9 4. Detection of...
Transcript of 10/12/201410/12/2014 5 MACT 4231: Applied Regression Methods, Prof Hadi, 9.99.9 4. Detection of...
10/12/2014
1
MACT 4231: Applied Regression Methods, Prof Hadi, 9.1
9.1
Chapter 9. Analysis of Collinear Data
1. What is collinearity?
2. What are the reasons for collinearity?
3. Why is collinearity a problem?
4. How can collinearity be detected?
5. How do we deal with collinearity when it
exists? (Chapter 10)
MACT 4231: Applied Regression Methods, Prof Hadi, 9.2
9.2
1. What is Collinearity
When there is a complete absence of linear relationship among the predictor variables, they are said to be orthogonal.
The existence of strong (not necessarily exact)linear relationships among the predictor variables is referred to as collinearity or multicollinearity.
The interpretation of the multiple regression equation depends on the assumption that the predictor variables are not strongly interrelated.
10/12/2014
2
MACT 4231: Applied Regression Methods, Prof Hadi, 9.3
9.3
2. Reasons for Collinearity
Collinearity may be present because:
1. The sample data are deficient. In this case it
should be improved with additional
observations.
2. The interrelationships among the variables are
an inherent characteristic of the process under
investigation.
MACT 4231: Applied Regression Methods, Prof Hadi, 9.4
9.4
3. Why is Collinearity a Problem?
Collinearity affects statistical inference because
collinearity inflates the variance of the regression
coefficients. When collinearity is present, is
near singular and becomes very large and
becomes very large, and the t-statistics becomes
very small.
X XT
1(X X)T
2 1ˆvar(β) (X X)T
10/12/2014
3
MACT 4231: Applied Regression Methods, Prof Hadi, 9.5
9.5
Example 1: Equal Educational Opportunity
The variables are:Y: A measure of student achievements X1: A measure of faculty credentials X2: The influence of peer group in the school X3: School facilities
The objective is to evaluate the effect of school inputs (X3) on achievement (Y):
0 1 1 2 2 3 3Y X X X .
MACT 4231: Applied Regression Methods, Prof Hadi, 9.6
9.6Example 1: Equal Educational Opportunity
Dependent variable is: YR squared = 20.6% R squared (adjusted) = 17.0%s = 2.07 with 70 - 4 = 66 degrees of freedom F = 5.72
Variable Coefficient s.e. of Coeff t-ratio probConstant −0.070 0.251 −0.28 0.7810
X1 1.101 1.411 0.78 0.4378X2 2.322 1.481 1.57 0.1218X3 −2.281 2.220 −1.03 0.3080
Inconsistent results:
The F test is significant but all regression
coefficients are insignificant! Why?
10/12/2014
4
MACT 4231: Applied Regression Methods, Prof Hadi, 9.7
9.7
Example 1: Equal Educational Opportunity
The plot of residuals
versus predicted
values shows no
discernible pattern.
-2 -1 0 1 2
-2-1
01
2
Predicted
Residuals
MACT 4231: Applied Regression Methods, Prof Hadi, 9.8
9.8
Example 1: Equal Educational Opportunity
The culprit here is
collinearity! The predictor variables are so highly correlated that each one may serve as a proxy for the others in the regression equation without affecting the total explanatory power.
FAM
-2 -1 0 1 2
-2-1
01
23
-2-1
01
2
0.96 PEER
-2 -1 0 1 2 3
0.99 0.98
-2 -1 0 1 2
-2-1
01
2
SCHOOL
10/12/2014
5
MACT 4231: Applied Regression Methods, Prof Hadi, 9.9
9.9
4. Detection of Collinearity
Simple Signs of Collinearity:
1. Large pairwise correlation coefficients2. Coefficients of variables that are expected to be
important have small t-values3. The algebraic signs of the estimated coefficients
are different from prior expectations4. Large changes in the estimated coefficients
when a variable is added or deleted.5. Large changes in the estimated coefficients
when a data point is altered, added, or dropped.
MACT 4231: Applied Regression Methods, Prof Hadi, 9.10
9.10
4. Detection of Collinearity
• The source of collinearity may be more subtle
than a simple relationship between two variables.
• A linear relation can involve many of the predictor
variables.
• It may not be possible to detect such a
relationship with the simple correlation
coefficients.
10/12/2014
6
MACT 4231: Applied Regression Methods, Prof Hadi, 9.11
9.11
Example 2: Advertizing Expenditure Data
The variables are: A: Advertizing Expenditure in year t, P: promotion expenditures in year t, E: Sales expense in year t, S: Aggregate sales in year t.
We study the effect of A, P, and E on S.
Since this is a time series data, we may consider lagged variables. We consider the model:
0 1 2 t 3 t 4 1 5 1S A P E A P .t t t
MACT 4231: Applied Regression Methods, Prof Hadi, 9.12
9.12
Example 2: Advertizing Expenditure Data
The plot of residuals versus fitted values and the
index plot of residuals (Figures 9.5 and 9.6), as well
as other plots of the residuals versus the
predictor variables (not shown), do not suggest
any problems of misspecification.
10/12/2014
7
MACT 4231: Applied Regression Methods, Prof Hadi, 9.13
9.13
Example 2: Advertizing Expenditure Data
The correlation coefficients between pairs of predictor variables are small
A_t
0.0 0.5 1.0 1.5 1.0 1.4 1.8 2.2
1.0
1.4
1.8
2.2
0.0
0.5
1.0
1.5
0.36 P_t
0.13 0. 063 E_t
0.30
0.45
0.60
1.0
1.4
1.8
2.2
0.14 0.32 0.17 A_.t.1.
1.0 1.4 1.8 2.2
0.50 0.30
0.30 0.45 0.60
0.21 0.36
0.0 0.5 1.0 1.5
0.0
0.5
1.0
1.5
P_.t.1.
MACT 4231: Applied Regression Methods, Prof Hadi, 9.14
9.14
Example 2: Advertizing Expenditure Data
Let us look at the stability of regression coefficients:
Variable Model 1 Model 2
Constant -14.19 10.51
A_t 5.36
P_t 8.37 3.70
E_t 22.52 22.79
A_(t-1) 3.85 -0.77
P_(t-1) 4.12 -0.97
10/12/2014
8
MACT 4231: Applied Regression Methods, Prof Hadi, 9.15
9.15
4. Detection of Collinearity
Two methods for measuring collinearity are the
variance inflation factors and the condition indices.
1. Variance Inflation Factor:
Let us look at the regression of each of the
predictor variables against all the others.
Let be the j-th predictor variable and be
the set of all other predictors variables.
[ ]X jX j
MACT 4231: Applied Regression Methods, Prof Hadi, 9.16
9.16
1. The Variance Inflation Factor
Let be the square of the multiple correlation
coefficient obtained from fitting the models:
Large indicates that is highly related to all
other predictor variables; an indication of
collinearity.
2jR
[ ]X X , 1,2, , .j j j p
2jR X j
10/12/2014
9
MACT 4231: Applied Regression Methods, Prof Hadi, 9.17
9.17
1. The Variance Inflation Factor
The Variance Inflation Factor for is defined as:
If has a strong linear relationship with the other
predictor variables, would be close to 1, and
would be large; an indication of collinearity.
X j
2
1VIF , 1,2, , .
1j
j
j pR
X j
2jR
VIFj
MACT 4231: Applied Regression Methods, Prof Hadi, 9.18
9.18
1. The Variance Inflation Factor
In the absence of any linear relationship between
the predictor variables (i.e., if the predictor
variables are orthogonal), would be 0 and
would be 1.
The deviation of the value from 1 indicates
departure from orthogonality and tendency
toward collinearity.
2
1VIF , 1,2, , .
1j
j
j pR
2jR VIFj
VIFj
10/12/2014
10
MACT 4231: Applied Regression Methods, Prof Hadi, 9.19
9.19
1. The Variance Inflation Factor
As tends toward 1, indicating the presence of
a linear relationship in the predictor variables,
tends to infinity.
also measures the amount by which the
variance of is increased due to the linear
association of with other predictor variables
relative to the variance that would result if
were not linearly related to them.
X j
2
1VIF , 1,2, , .
1j
j
j pR
2jR
VIFj
VIFj
ˆj
X j
MACT 4231: Applied Regression Methods, Prof Hadi, 9.20
9.20
1. The Variance Inflation Factor
Values of > 10 is often taken as an indication
of collinearity.
may be used to obtain an expression for the
expected squared distance ( ) of the OLS
estimators from their true values:
VIFj
2D
2 2
1
= VIFp
jj
D
VIFj
10/12/2014
11
MACT 4231: Applied Regression Methods, Prof Hadi, 9.21
9.21
1. The Variance Inflation Factor
It follows that the ratio:
which shows that the average of the VIFs
measures the squared error in the OLS
estimators relative to the size of that error if
the data were orthogonal. Hence, may also
be used as an index of collinearity.
2
1
2
VIF
VIF,
p
jj
p
VIF
2 2
1
= VIFp
jj
D
MACT 4231: Applied Regression Methods, Prof Hadi, 9.22
9.22
Examples 1 and 2
Advertizing EEO
Variable VIF Variable VIF
A_t 37.4 FAM 37.6
P_t 33.5 PEER 30.2
E_t 1.1 School 83.2
P_(t-1) 26.6
E_(t-1) 26
44.1
28.5 50.3VIF VIF
10/12/2014
12
MACT 4231: Applied Regression Methods, Prof Hadi, 9.23
9.23
Example: Import Data
Aggregate data concerning import activity in the
French economy for the years 1949–1966.
The variables are
Y = Imports (IMPORT),
X1 = Domestic production (DOPROD),
X2 = Stock formation (STOCK), and
X3 = Domestic consumption (CONSUM)
All measured in billions of francs.
Use Data Desk
MACT 4231: Applied Regression Methods, Prof Hadi, 9.24
9.24
2. The Condition Index
Another measure of the overall collinearity of the
variables can be obtained by computing the
condition indices of the correlation matrix. The j-th
condition index is defined by
1 , 1,2, , ,j
j
j p
where are the eigenvalues of the
correlation matrix of the predictor variables.
1 2 p
10/12/2014
13
MACT 4231: Applied Regression Methods, Prof Hadi, 9.25
9.25
2. The Condition Index
The first condition index, , is always 1 but the remaining indices are larger than 1.
The largest condition index is
is known as the condition number of the
correlation matrix.
1
1 , 1,2, , ,j
j
j p
1 ,p
p
MACT 4231: Applied Regression Methods, Prof Hadi, 9.26
9.26
2. The Condition Index
The harmful effects of collinearity in the data
become strong when the values of the condition
number exceeds 10 (which means that is more
than 100 times ).
Some datasets contain more than one subset
of collinear variables. The number of these subsets
are indicated by the number of large condition
indices.
1
p
10/12/2014
14
MACT 4231: Applied Regression Methods, Prof Hadi, 9.27
9.27
Examples 1 and 2
j
EEO Advertising
1 2.952 1 1.701 1
2 0.040 8.59 1.288 1.14
3 0.008 19.26 1.145 1.21
4 0.859 1.4
5 0.007 15.29
jj jj
Example 3: The Import Data Use Data Desk
MACT 4231: Applied Regression Methods, Prof Hadi, 9.28
9.28
4. Detection of Collinearity
If a data set was found to be collinear as indicated
by large condition indices, the next question is:
Which variables are involved in this collinearity?
The answer to this question involves the
eigenvectors of the correlation matrix.
10/12/2014
15
MACT 4231: Applied Regression Methods, Prof Hadi, 9.29
9.29
4. Detection of Collinearity
We have seen that every correlation matrix of p
predictor variables has a set of p eigenvalues
For every eigenvalue , there exists a
corresponding eigenvector,
The p eigenvectors are pairwise
orthonormal. That is, and
1 2 .p
j
V , 1,2, , .j j p
1 2V ,V , ,Vp
V V 1Tj j V V 0, .T
i j i j
MACT 4231: Applied Regression Methods, Prof Hadi, 9.30
9.30
4. Detection of Collinearity
The p eigenvectors can be arranged in a
matrix as follows: where
and I is the identity matrix. If > 10, then the
variables involved in this set of collinearity are
related by where Z is the
standardized predictor variables.
1 2V = V ,V , ,V ,p
V V VV I,T T
p p
j
ZV ,j j
10/12/2014
16
MACT 4231: Applied Regression Methods, Prof Hadi, 9.31
9.31
Example: Import Data
1.999 0.998 0.003
1 1.42 27.26
DOPROD 0.706 0.036 0.707
STOCK 0.044 0.999 0.007
CONSUM 0.707 0.026 0.707
j
j
Vj 1V 2V 3V
Thus we have
0.707 DOPROD 0.007STOCK+0.707CONSUM=0.003
or DOPROD CONSUM
3 3ZV ,
MACT 4231: Applied Regression Methods, Prof Hadi, 9.32
9.32
Example: Advertising Data
1.701 1.288 1.145 0.859 0.007
1 1.15 1.22 1.41 15.59
A_t 0.532 0.024 0.668 0.074 0.514
P_t 0.232 0.825 0.158 0.037 0.489
E_t 0.389 0.022 0.217 0.895 0.010
P_(t-1) 0.395 0.260 0.692 0.338 0.428
E_(t-1) 0.596 0.501 0.057 0.279 0.559
jj
Vj 1V 2V 3V 4V 5V
10/12/2014
17
MACT 4231: Applied Regression Methods, Prof Hadi, 9.33
9.33
Example: Advertising Data
Thus we have
0.514 A_t 0.489 P_t 0.010 E_t
0.428 A_(t-1) 0.559 E_(t-1) 0.007
or
0.514 A_t 0.489 P_t 0.428 A_(t-1) 0.559 E_(t-1) 0
5 5ZV :