Endogenous explanatory variables
Violation of the assumption that Cov(xi, ui) = 0 has serious consequences
for the OLS estimator
This is one of the key assumptions needed to establish consistency
When one or more of the explanatory variables is correlated with the error
term ui, we have both E(ui|xi) 6= 0 and E(xiui) 6= 0, so the OLS estimator
will be both biased and inconsistent
1
We will consider two situations where this occurs:
- a linear model with Cov(xi, ui) = 0 is the correct specification, but one
or more of the explanatory variables is measured with error
- a linear model with Cov(xi, ui) = 0 is the correct specification, but one or
more of the explanatory variables is not measured at all, and hence omitted
from the model we can estimate
These are simply two examples of cases where we have simultaneity or
endogeneity, i.e. one or more of the explanatory variables is correlated
with the error term
2
Measurement error/errors-in-variables
A common concern in applied econometrics is that relevant explanatory
variables may be poorly measured
Examples - survey data on households:
- recall bias: how much time did you spend unemployed last year?
- rounding bias: how much money did you spend on food last week?
3
Illustrate ‘attenuation bias’for the case of a single explanatory variable,
measured with error
- the OLS estimator is biased towards zero if the explanatory variable is
measured with error
- this bias does not disappear in large samples (OLS is inconsistent)
Note that measurement error in the dependent variable does not lead to
the same bias and inconsistency problems, provided the measurement error
in yi is uncorrelated with (correctly measured) xi
4
Consider the model with a single explanatory variable and no intercept
y∗i = x∗iβ + ui for i = 1, ..., N
where y∗i and x∗i denote the true values of these variables, that we may not
observe
To simplify, suppose E(ui) = E(x∗i ) = E(y∗i ) = 0 for i = 1, ..., N (original
variables may be expressed as deviations from their sample means)
We focus on large sample properties, and assume that E(x∗iui) = 0 for
i = 1, ..., N , and we have independent observations, so that β̂OLS would be
a consistent estimator of β if we observed the true values of y∗i and x∗i
5
First consider additive, mean zero measurement error in the dependent
variable only
yi = y∗i + vi↔ y∗i = yi − vi
yi is the observed value
y∗i is the true value
vi is the measurement error, with E(vi) = 0 for i = 1, ..., N
The true values x∗i are observed
6
Substituting this expression for y∗i in the true model
(yi − vi) = x∗iβ + ui
or yi = x∗iβ + (ui + vi)
Consistency requires x∗i to be uncorrelated with the error term (ui + vi)
Given E(x∗iui) = 0, the additional requirement is that E(x∗ivi) = 0 for
i = 1, ..., N
That is, the measurement error in the dependent variable is uncorrelated
with the explanatory variable
7
Now consider additive, mean zero measurement error in the explanatory
variable (only)
xi = x∗i + ei↔ x∗i = xi − ei
Substituting for x∗i in the true model
y∗i = (xi − ei)β + ui
or y∗i = xiβ + (ui − eiβ)
The OLS estimator of β here is biased and inconsistent
- for a given value of x∗i , observed xi and the measurement error ei are
positively correlated, which implies non-zero correlation between xi and the
error term in this model (ui − eiβ)8
y∗i = xiβ + (ui − eiβ)
For β > 0, this implies a negative correlation between xi and (ui − eiβ)
For β < 0, this implies a positive correlation between xi and (ui − eiβ)
For β > 0, the OLS estimator of β will be biased downwards
For β < 0, the OLS estimator of β will be biased upwards
In either case, the OLS estimator of β will be biased towards zero
- this is known as ‘attenuation bias’
9
To analyse this further, we invoke the classical errors-in-variables assump-
tions (for i = 1, ..., N)
E(x∗iei) = 0 Measurement error is uncorrelated with the true value of x∗i
E(uiei) = 0 Measurement error is uncorrelated with the true model error ui
V (ei) = σ2e Measurement error is homoskedastic
V (x∗i ) = σ2x∗ Population variance of the true x∗i exists and is finite
Now β̂OLS = (X′X)−1X ′y∗ =
N∑i=1
xiy∗i
N∑i=1
x2i
=
1N
N∑i=1
xiy∗i
1N
N∑i=1
x2i
Using xi = x∗i +ei and y∗i = x∗iβ+ui together with the above assumptions,
we obtain10
p limN→∞
β̂OLS =
p lim 1N
N∑i=1
(x∗i + ei)(x∗iβ + ui)
p lim 1N
N∑i=1
(x∗i + ei)2
=
(p lim 1
N
N∑i=1
x∗2i
)β + p lim 1
N
N∑i=1
x∗iui +
(p lim 1
N
N∑i=1
x∗i ei
)β + p lim 1
N
N∑i=1
uiei
p lim 1N
N∑i=1
x∗2i + 2p lim1N
N∑i=1
x∗i ei + p lim1N
N∑i=1
e2i
=E(x∗2i )β + E(x
∗iui) + E(x
∗iei)β + E(uiei)
E(x∗2i ) + 2E(x∗iei) + E(e
2i )
=E(x∗2i )β + 0 + 0 + 0
E(x∗2i ) + 0 + E(e2i )
=
(σ2x∗
σ2x∗ + σ2e
)β =
β
1 + (σ2e/σ2x∗)6= β if σ2e > 0
11
p limN→∞
β̂OLS =β
1 + (σ2e/σ2x∗)
< β for β > 0 and σ2e > 0
p limN→∞
β̂OLS =β
1 + (σ2e/σ2x∗)
> β for β < 0 and σ2e > 0
The OLS estimator of β is inconsistent, with a bias towards zero that does
not diminish as the sample becomes large
For given σ2x∗, the severity of this ‘attenuation bias’ increases with the
variance of the measurement error (σ2e)
The magnitude of the inconsistency depends inversely on the ‘signal-to-
noise’ratio (σ2x∗/σ2e)
12
Under the classical errors-in-variables assumptions with homoskedasticmea-
surement error, the presence of measurement error affects the estimated slope
parameter, but not the linearity of the relationship between y∗i and observed
xi
With heteroskedastic measurement error, the presence of measurement er-
ror may also introduce an incorrect indication of non-linearity in the rela-
tionship
For example, if β > 0 and V (ei) tends to be larger for individuals with
higher values of x∗i , then estimation of a non-linear relationship between y∗i
and observed xi could give an incorrect indication of a concave relationship
(illustrate)13
Multiple regression with errors in variables
y∗i = x∗′i β + ui
x′i = x∗′i + e′i
where x′i, x∗′i and e
′i are 1×K vectors
As before
y∗i = x′iβ + (ui − e′iβ)
In general, the OLS estimator of the K × 1 vector of parameters β will be
biased and inconsistent, since E[xi(ui − e′iβ)] 6= 0
14
If only one of the explanatory variables in xi is measured with error, we
can show that
- the OLS estimator of the coeffi cient on that variable is biased towards
zero
- the OLS estimator of the coeffi cients on the other explanatory variables
are also biased, in unknown directions
If several explanatory variables are measured with error, it is very diffi cult
to sign the biases for any of the coeffi cients
15
Omitted variables
Another common concern in applied econometrics is that relevant explana-
tory variables may be omitted from the model
Relevant explanatory variables are often unobserved or unobservable
Example
- survey data on individuals do not contain data on characteristics like
ability or motivation
This may make it diffi cult to attach causal significance to estimated para-
meters in linear regression-type models
16
Illustrate omitted variable bias for the case of a single included variable
and a single omitted variable
- the OLS estimator is biased if the omitted variable is relevant and corre-
lated with the included regressor
- this bias does not disappear in large samples (OLS is inconsistent)
- the direction of the bias depends on the sign of the correlation between
the included variable and the omitted variable
17
Consequently omitted variables - or ‘unobserved heterogeneity’- presents a
formidable challenge to drawing causal inferences from cross-section regres-
sions
There is a serious danger that observed, included explanatory variables
may just be proxying for unobserved, omitted factors - rather than exerting
a direct, causal influence on the outcome of interest
18
Note that this problem is not confined to empirical research in economics
Beware of medical studies claiming that some activity will help you live
longer
These claims are often based on cross-section correlations
It is diffi cult to draw causal conclusions unless we are confident that the
study has controlled for all potentially relevant confounding factors
19
We first consider the model with one included variable (x1i) and one omit-
ted variable (x2i)
The true model is
yi = x1iβ1 + x2iβ2 + ui for i = 1, 2, ..., N
satisfying E(ui) = E(x1i) = E(x2i) = 0 and E(x1iui) = E(x2iui) = 0
However the model we estimate excludes x2i
yi = x1iβ1 + (ui + x2iβ2) for i = 1, 2, ..., N
Illustration suggests that the OLS estimator β̂1 in the estimated model
will be a biased and inconsistent estimator of β1 in the true model, in cases
where x2i and x1i are correlated, and where β2 6= 020
Stack across the N observations to obtain
y = X1β1 + (u +X2β2) (all vectors are N × 1)
The OLS estimator of β1 is
β̂1 = (X′1X1)
−1X ′1y
Substituting for y = X1β1 +X2β2 + u from the true model
β̂1 = (X′1X1)
−1X ′1(X1β1 +X2β2 + u)
= β1 +[(X ′1X1)
−1X ′1X2
]β2 + (X
′1X1)
−1X ′1u
= β1 + δ̂β2 + (X′1X1)
−1X ′1u
where δ̂ = (X ′1X1)−1X ′1X2 is the OLS estimator of ...
21
...the coeffi cient in a regression of the omitted variable x2i on the included
variable x1i, i.e.
x2i = x1iδ + ei
Taking probability limits, and using E(x1iui) = 0, we obtain
p limN→∞
β̂1 = β1 + (p limN→∞
δ̂)β2
The OLS estimator of β1 in the model that omits x2i is inconsistent unless
- either p limN→∞ δ̂ = 0 (the omitted variable is orthogonal to the included
variable)
- or β2 = 0 (the omitted variable is not a relevant explanatory variable in
the true model)22
p limN→∞
β̂1 = β1 + (p limN→∞
δ̂)β2
Thus if we omit a relevant explanatory variable (β2 6= 0), the only case in
which β̂1 remains a consistent estimator of the true, causal parameter β1 is
the case where p limN→∞ δ̂ = 0, i.e. where x1i and x2i are uncorrelated
If x1i and x2i are positively correlated, we have p limN→∞ δ̂ > 0
If β2 is also positive, we expect an upward bias in the OLS estimator β̂1
Intuitively, the OLS estimator β̂1 picks up an indirect relationship between
x1i and yi, due to the fact that the included x1i proxies for the omitted x2i,
as well as the direct causal effect of x1i on yi at a given level of x2i, measured
by β1 in the true model (cf. illustration)23
Conversely if x1i and x2i are negatively correlated (p limN→∞ δ̂ < 0) and
β2 > 0,
or if x1i and x2i are positively correlated (p limN→∞ δ̂ > 0) and β2 < 0,
we expect a downward bias in the OLS estimator β̂1
Thus if regressionmodels omit relevant (but perhaps unmeasured) explana-
tory factors that are correlated with the included (measured) regressors, we
cannot draw causal inferences from the pattern of partial correlations among
the observed variables
24
Some examples
- do individuals with lots of education tend to have high earnings because
education raises their productivity and wages, or because intrinsically high
ability (high productivity) individuals also tend to have lots of education?
- do countries with high investment tend to have high per capita income
levels because investment raises income, or because (for example) countries
with good institutions tend to have both high investment and high incomes?
25
In the first example, causality may run from ability to both education and
earnings, rather than from education to earnings
In the second example, causality may run from institutions to both invest-
ment and incomes, rather than from investment to incomes
Since it is very diffi cult to control adequately for individual ability or the
quality of national institutions, we should be very cautious about drawing
any causal inference from statistically significant coeffi cients reported in such
cross-section regression studies
26
Multiple regression with omitted variables
The analysis proceeds in a similar way
y = X1β1 + (X2β2 + u)
where now X1 is N ×K1, β1 is K1× 1, X2 is N ×K2 and β2 is K2× 1 (i.e.
there are K1 included regressors and K2 omitted variables)
As in the simpler example we can obtain
p limN→∞
β̂1 = β1 + (p limN→∞
(X ′1X1)−1X ′1X2)β2
Each column of the K1×K2 matrix (X ′1X1)−1X ′1X2 is the K1× 1 vector of
OLS estimates of the coeffi cients in amultiple regression of the corresponding
column of X2 on all of the included variables in X1
27
The general point is that it is harder to be confident about the direction
of the biases expected in the estimated β1 coeffi cients
If there is only one omitted variable (K2 = 1), which is correlated with
several of the included explanatory variables, we can show that the bias
in each of the estimated coeffi cients on the included regressors will depend
on their partial correlation with the omitted variable, not on the simple
correlation between the included regressor and the omitted variable
i.e. direction of biases depends on the sign of coeffi cients in a multiple
regression of the omitted variable on all the included regressors jointly
- not on the sign of coeffi cients in a set of simple regressions relating the
omitted variable to each of the included regressors individually28
Single omitted variable
yi = β1 + β2x2i + ... + βK−1xK−1,i + (βKxKi + ui)
where E(xkiui) = 0 for k = 1, ..., K (and x1i = 1 for all i = 1, ..., N)
[Relation to general model: K1 = K − 1, K2 = 1]
Linear projection of omitted xKi on the included variables
xKi = δ1 + δ2x2i + ... + δK−1xK−1,i + vi
s.t. E(xkivi) = 0 for k = 1, ..., K − 1 (by definition of linear projection)
Substitute
yi = (β1+βKδ1)+(β2+βKδ2)x2i+ ...+(βK−1+βKδK−1)xK−1,i+(ui+βKvi)
29
Now since E(xki(ui + βKvi)) = 0 for k = 1, ..., K − 1, we have
p limN→∞
β̂k = βk + βKδk for k = 1, ..., K − 1
(Or equivalently p limN→∞
β̂k = βk + (p limN→∞
δ̂k)βK)
The inconsistency thus depends on the sign of the partial correlations (re-
flected in the sign of the δk coeffi cients)
- not on the sign of the simple correlations between each included xki and
the omitted variable xKi
30
Multiple omitted variables
If there are several omitted variables, it is very diffi cult to predict the
direction of the biases
But the OLS estimator β̂1 is a biased and inconsistent estimator of β1 in
the true model, except in the special case where all of the omitted variables
are orthogonal to all of the included variables
31
Simultaneity bias
Note that both omitted variables and measurement error (in the explana-
tory variable(s)) result in correlation between the included explanatory vari-
able(s) and the error term in the estimated model
These are both examples of the more general phenomenon of ‘simultaneity’
or ‘endogeneity’- sources of correlation between the explanatory variable(s)
and the error term, such that the OLS estimator is biased and inconsistent
32
This can also arise naturally in situations where the dependent variable
and at least one of the explanatory variables are chosen jointly as part of
the same decision problem
Example:
- firms choosing inputs and output jointly in models of production, where
the error term includes unobserved (total factor) productivity
- high productivity firms are likely to be larger, using more inputs, as well
as producing more output from given inputs
- expect OLS estimates of coeffi cients on the inputs to be biased and incon-
sistent in production functions (Marschak & Andrews, Econometrica, 1944)33
Top Related