Econometrics review course: Microeconometrics

Econometrics review course:Microeconometrics

Franco PeracchiUniversity of Rome “Tor Vergata” and EIEF

Fall 2013

Contents

1 M-estimators 21.1 The class of M-estimators . . . . . . . . . . . . . . . . . . . . . 31.2 Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.3 Numerical methods . . . . . . . . . . . . . . . . . . . . . . . . . 91.4 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.5 Asymptotic normality . . . . . . . . . . . . . . . . . . . . . . . 191.6 The bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2 Static regression 35

3 IV and GMM 42

4 Linear panel data models 49

5 Models for discrete and limited dependent variables 52

1

1 M-estimators

We begin by studying the sampling properties of the class of M-estimators,obtained by maximizing a data-dependent criterion function over a finite-dimensional parameter space.

This class of estimators is very large and includes most common econo-metric estimators, such as least squares (LS), least absolute deviations(LAD), maximum likelihood (ML), pseudo or quasi ML, method of moments(MM), generalized method of moments (GMM), minimum distance (MD), andBayes estimators.

2

1.1 The class of M-estimators

Let Qn(θ) = Q(Z; θ) be a real function of an n × m data matrix Z =[Z1, . . . , Zn]> and a p-dimensional parameter θ belonging to some parameterspace Θ (typically a subset of the p-dimensional Euclidean space).

An M-estimator is any solution θn to the problem

maxθ∈Θ

Qn(θ).

Thus,θn = argmax

θ∈ΘQn(θ).

The term extremum estimator is also used (Amemiya 1985, Newey & Mc-Fadden 1994).

3

Example 1.1 An OLS estimator is an M-estimator corresponding to Z =[X,Y ] and

nQn(θ) = −(Y −Xθ)>(Y −Xθ).

A 2SLS estimator is an M-estimator corresponding to Z = [X,Y ,W ] and

nQn(θ) = −(Y −Xθ)>W (W>W )−1W>(Y −Xθ).

In both cases, the function Qn is quadratic in θ, which leads to simple closedform expressions for these estimators. 2

Example 1.2 Given a parametric model FΘ = {f(z; θ), θ ∈ Θ ⊆ <p}for Z, a ML estimator of θ is an M-estimator obtained by maximizing thesample loglikelihood

nQn(θ) = c+ ln f(Z; θ),

where c is an arbitrary constant. Given the parametric model FΘ and a lossfunction `(t, θ), for example `(t, θ) = (t − θ) (quadratic loss) or `(t, θ) =|t − θ| (absolute loss), a Bayes point estimator of θ with respect to theloss function ` is an M-estimator corresponding to

Qn(θ) = −∫

Θ`(t, θ) p(t |Z) dt,

wherep(t |Z) ∝ f(Z; t) p(t)

is the posterior density of θ given Z, obtained by combining, via Bayes rule,the sample likelihood f(Z; t) with the prior density p(t) of θ. Except inspecial cases, no closed form expression is available for these two types ofestimator. 2

4

Two subclasses of M-estimators

One may distinguish between two subclasses of M-estimators:

• Estimators obtained by maximizing a sample average of the form

Qn(θ) = n−1n∑i=1

ρi(θ),

where ρi(θ) is a real-valued function that depends on the data onlythrough the ith data point Zi. This subclass contains LS, LAD, MLand pseudo ML estimators.

• Estimators obtained by minimizing a quadratic form in a vectorηn(θ) of sample averages. This is equivalent to maximizing a criterionfunction of the form

Qn(θ) = −ηn(θ)>Dn ηn(θ),

where ηn(θ) is a data-dependent vector of dimension r ≥ p and Dn is asymmetric pd matrix that may also depend on the data. A solution tothis problem may be viewed as minimizing the distance of ηn(θ) from thezero vector. This subclass contains MM, GMM and MD estimators.

5

Computation and sampling properties

When Qn is quadratic in θ, an M-estimator admits a linear representation.This greatly simplifies its computation and the task of obtaining its sam-pling properties, either exactly or approximately.

When Qn is not quadratic, an M-estimator is usually defined only implicitly.In this case, two problems arise:

• Computing an M-estimate for a given sample usually requires some nu-merical method.

• The methods employed to obtain the sampling properties of linear esti-mators cannot be used.

Being able to derive a closed form expression for an M-estimator is useful butnot necessary in order to establish its sampling properties. Under appropri-ate regularity conditions:

• Consistency for the target parameter θ0 follows directly from the lim-iting properties of the criterion function Qn which defines the estimator.

• Asymptotic normality follows from the fact that, just like a regularML estimator, most M-estimators possess an asymptotic linear rep-resentation of the form

θn = θ0 +1

n

n∑i=1

ψ(Zi) + op(n−1/2), (1)

where ψ(Zi) is a zero mean random vector called the influence functionof the estimator.

Although not discussed here, the influence function also plays a crucial role inthe study of robustness, for it provides a description of the properties of anestimator under a small amount of data contamination.

6

1.2 Identification

Assessing identification of the population parameter of interest is a funda-mental step, for it guarantees that one can distinguish between differentparameter points using the data. Loosely speaking, identifiability makes itvery unlikely that the same data may be obtained under different values of thepopulation parameter.

Assessing identification of the parameter of interest is therefore preliminaryto any attempt at computing an estimator or assessing its sampling properties.

Although identification is typically defined in the context of parametric modelsestimated by ML, it can easily be extended to the case of general M-estimators.

Parametric identification

Let FΘ = {f(z; θ), θ ∈ Θ} be a parametric model for the data matrix Z.Following Rothenberg (1973), we introduce the following concepts:

• Two distinct parameter points θ and θ′ in the parameter space Θ areobservationally equivalent if they give rise to essentially the samedistribution of the data or, formally, if

f(z; θ) = f(z; θ′)

for almost all value z of Z.

• The parameter point θ0 is identified if there does not exist any otherθ ∈ Θ such that θ0 and θ are observationally equivalent.

• The parametric model FΘ is identified if all parameter points in Θ areidentified.

Sometimes, a weaker concept of identification is used: The parameter pointθ0 is locally identified if there does not exist any other θ in a neighborhoodO of θ0 such that θ0 and θ are observationally equivalent.

7

Generalization

Let Q(Z; θ) denote the data-dependent objective function defining some M-estimator. Then the definition of parametric identification may be generalizedas follows:

• Two distinct parameter points θ and θ′ in the parametr space Θ areobservationally equivalent using Q if they give rise to essentiallythe same value of Q or, formally, if

Q(z; θ) = Q(z; θ′)

for almost all value z of Z.

• The parameter point θ0 is identified using Q if there does not existany other θ ∈ Θ such that θ0 and θ are observationally equivalent.

• The model generating Q(z; θ) is identified using Q if all parameterpoints in Θ are identified.

This is the same definition as parametric identification if Q is the samplelikelihood or the sample log-likelihood, that is, if Q(z; θ) = f(z; θ) or Q(z; θ) =ln f(z; θ).

The definition of local identification using Q is a straightforward general-ization of its parametric counterpart.

8

1.3 Numerical methods

When Qn is not quadratic in θ, some algorithm is needed to compute thevalue of an M-estimator. Here we discuss ways of finding a root of an implicitequation of the form

g(θ) = 0,

where g: <p → <p. If Qn is a smooth function, this may be interpreted asthe set of first-order conditions for the optimization problem defining theM-estimator.

An algorithm is an iterative procedure, that is, a sequence of steps, withthe rth step typically defined as

θ(j+1) = θ(j) + λ(j)∆(j), j = 0, 1, 2, . . . ,

where λ(j) is a non-negative scalar called the step length, introduced to guar-antee that ‖g(θ)‖ does not increase by taking a step of the algorithm, and∆(j) is a p-vector called the search direction. Algorithms differ in the wayλ(j) and ∆(j) are selected.

An important aspect of an algorithm is its stopping rule, that is, the rulefor terminating the iterations. If the sequence {θ(j)} converges to a root ofg(θ) = 0, then g(θ(j)) → 0, ‖θ(j+1) − θ(j)‖ → 0 and ‖g(θ(j+1))− g(θ(j))‖ → 0.Thus, given constants ε, δ > 0, iterations may stop when either of the followingconditions are met:

• ‖θ(j+1) − θ(j)‖ < ε max(δ, ‖θ(j)‖),

• ‖g(θ(j+1))− g(θ(j))‖ < ε max(δ, ‖g(θ(j))‖).In principle, it is possible for one difference to be small while the other is large.It is therefore advisable to use both criteria and continue iterating until bothare satisfied.

Two problems that may arise when trying to achieve convergence are:

• g(θ) may have multiple roots. This suggests starting the iterationsfrom several different points, chosen either subjectively or randomly.

• g(θ) may have a unique root but be relatively flat around this root.

These problems are typically associated with lack of identification of thepopulation parameter of interest.

9

The Newton method

This algorithm requires the function g(θ) to be smooth and employs its Ja-cobean G(θ) = g′(θ), a p× p matrix.

The basic idea is to locally approximate g by a linear function g∗ obtainedby taking a first-order Taylor expansion about a given point θ(j)

g∗(θ) = g(j) +G(j)(θ − θ(j)),

where g(j) = g(θ(j)) and G(j) = G(θ(j)). If G(j) is nonsingular, then theunique root of g∗(θ) = 0 is

θ(j+1) = θ(j) − [G(j)]−1g(j). (2)

The Newton method consists of a sequence of iterations of the form (2) forj = 0, 1, 2, . . ., starting from some initial guess θ(0). Notice that, for thismethod, the search direction is

∆(j) = −[G(j)]−1g(j),

and its step length is λ(j) = 1.

It is easy to verify that if g(θ) is linear, that is, g(θ) = b + Cθ, and C is anonsingular matrix, then the Newton method converges to the unique rootθ∗ = −C−1b in a single iteration, no matter what the starting point is.

10

1.4 Consistency

The basic argument for consistency of an M-estimators θn is very similar tothe one used to establish consistency of a ML estimator: if, after a suit-able normalization, the sequence of random functions {Qn} converges in anappropriate sense to a nonrandom function Q which is continuous and at-tains its unique maximum on Θ at the point θ = θ0 then, under appropriateregularity conditions, the estimator sequence {θn} is consistent for θ0.

11

Convergence of a sequence of functions

Let {fn} be a sequence of real-valued functions defined on X . The sequence{fn} is said to converge pointwise on X to a function f if, given any x ∈ Xand any ε > 0, there exists an N such that |fn(x)− f(x)| < ε for all n ≥ N .

The sequence {fn} is said to converge uniformly on X to a function f if,given any ε > 0, there exists an N such that |fn(x)− f(x)| < ε for all n ≥ Nand all x ∈ X .

When X is discrete, pointwise convergence implies uniform convergence.

Example 1.3 Consider the function fn defined on the closed interval X =[0, 2] by

fn(x) =

nx, if 0 ≤ x ≤ (2n)−1,1− nx, if (2n)−1 < x ≤ n−1,0, if n−1 < x ≤ 1,[n/(n+ 1)](x− 1), if 1 < x ≤ 3/2,[n/(n+ 1)](2− x), otherwise

(Figure 1). As n→∞, the sequence {fn} converges pointwise to the function

f(x) =

0, if 0 ≤ x ≤ 1,x− 1, if 1 < x ≤ 3/2,2− x, otherwise.

It does not converge uniformly, however, for

supx∈[0,1]

|fn(x)− f(x)| = .5

for all n. Notice that the function fn attains its maximum at x = (2n)−1,which converges to zero as n→∞, while the function f attains its maximumat x = 1.5. 2

12

Figure 1: Graph of the functions defined in the previous example.

0.1

.2.3

.4.5

0 .5 1 1.5 2x

f_1 f_2 f_4 f

13

Convergence of a sequence of random functions

We now generalize the above concepts to a sequence of random functions.

Definition 1.1 A sequence {Qn} of random functions defined on a set Θ issaid to converge in probability pointwise on Θ to a function Q if, givenany θ ∈ Θ and any ε > 0,

Pr{|Qn(θ)−Q(θ)| < ε} → 1

as n→∞. 2

Pointwise convergence in probability of {Qn} to Q is typically the consequenceof some Law of Large Numbers (LLN).

Example 1.4 Given a sequence {Zi} of iid random vectors and a functionρ(z; θ), let

Qn(θ) = n−1∑i

ρi(θ),

where ρi(θ) = ρ(Zi, θ). If E |ρi(θ)| < ∞ for all θ ∈ Θ, then Khinchine’sWLLN implies that

Qn(θ)p→Q(θ) = E ρi(θ)

for all θ ∈ Θ. 2

Definition 1.2 A sequence {Qn} of random functions defined on a set Θ issaid to converge in probability uniformly on Θ to a function Q if, for anyε > 0,

Pr{supθ∈Θ|Qn(θ)−Q(θ)| < ε} → 1

as n→∞. 2

14

Main result

Theorem 1.1 Suppose that:

(i) Θ is a compact subset of <p;

(ii) the sequence of random functions {Qn} converges in probability uniformlyon Θ to a continuous function Q;

(iii) Q attains a unique maximum on Θ at θ0.

Then θn = argmaxθ∈ΘQn(θ) exists and is unique with probability approaching

one as n→∞, and θnp→ θ0.

Remarks:

• Assumption (i) requires the parameter space Θ to be bounded and rulesout the incidental parameters case, where the dimensionality of Θincreases with the sample size.

• Sufficient conditions for Assumption (ii) will be discussed later.

• Assumption (iii) ensures that the target parameter θ0 is identified. ForML and pseudo ML estimators, this condition is automatically satisfied ifthe assumed parametric model is identified (for, in this case, the targetparameter corresponds to the unique maximum on Θ of the expectedlog-likelihood).

15

Proof

First notice that, since Θ is a compact set and Qn converges to a continuousfunction with probability approaching one as n→∞, θn exists with probabilityapproaching one as n→∞.

Given δ > 0, define the set O = {θ ∈ Θ: ‖θ − θ0‖ ≥ δ}. Since O is compact,Q attains a maximum on O. Define

ε = Q(θ0)−maxθ∈O

Q(θ),

and denote by An the event that θn exists and |Qn(θ) − Q(θ)| < ε/2 for allθ ∈ Θ. The occurrence of An implies that

Qn(θ)− ε

2< Q(θ) < Qn(θ) +

ε

2

for all θ ∈ Θ. In particular, it implies that

Qn(θ0) > Q(θ0)− ε

2. (3)

Because Qn(θn) ≥ Qn(θ0) if θn exists, the occurrence of An also implies that

Q(θn) > Qn(θn)− ε

2≥ Qn(θ0)− ε

2. (4)

By adding both sides of (3) and (4), the occurrence of An implies that

Q(θn) > Q(θ0)− ε = maxθ∈O

Q(θ).

So, the occurrence of An implies that θn /∈ O, that is, ‖θn − θ0‖ < δ. Therefore

Pr{An} ≤ Pr{‖θn − θ0‖ < δ}.

Since Pr{An} → 1 by unform convergence in probability of Qn to Q, it follows

that Pr{‖θn − θ0‖ < δ} → 1. Because δ is arbitrary, θnp→ θ0. Finally, the fact

that θ0 is unique implies that θn is unique with probability approaching oneas n→∞.

16

Stochastic equicontinuity and uniform convergence

A sequence {fn} of real-valued functions is said to be equicontinuous onX if, given any ε > 0, there exists a δ > 0 such that ‖x′ − x‖ < δ implies|fn(x′)− fn(x)| < ε for all x′, x ∈ X and all n.

Equicontinuity requires that small perturbations of x should have uniformlysmall effects on the sequence of function values {fn(x)}.

If S is a compact set, then the sequence {fn} converges uniformly on X toa continuous function f if and only if: (i) it converges pointwise on Sto f , and (ii) it is equicontinuous on X .

We now consider generalizations of this result to a sequence of randomfunctions.

Definition 1.3 A sequence {Qn} of random functions defined on a set Θ issaid to be stochastically equicontinuous on Θ if, for any ε, η > 0 andθ ∈ Θ, there exist an integer N , an open neighborhood O of θ, and a sequence{Dn} of random variables such that Pr{|Dn| > ε} < η and

supθ′∈O|Qn(θ′)−Qn(θ)| ≤ Dn

for all n ≥ N . 2

The next result characterizes uniform convergence in probability to a con-tinuous function on a compact set.

Theorem 1.2 (Newey 1991) A sequence {Qn} of random functions definedon a compact set Θ converges in probability uniformly on Θ to a continuousfunction Q if and only if it converges in probability pointwise on Θ to Q andis stochastically equicontinuous on Θ.

17

Sufficient conditions for uniform convergence

The following results imply stochastic equicontinuity and provide sets of rel-atively simple sufficient conditions for uniform convergence in probability.

Continuity and compactness

The simplest case is the following:

Theorem 1.3 (Hansen 1982) Let {Zi} be a sequence of iid random vectorsand let Qn(θ) = n−1∑

i ρi(θ), where ρi(θ) = ρ(Zi; θ). Suppose that:

(i) Θ is a compact subset of <p;

(ii) ρi(·) is continuous on Θ with probability one;

(iii) E [supθ∈Θ |ρi(θ)|] <∞.

Then {Qn} converges in probability uniformly on Θ to E ρi(·), and E ρi(·) iscontinuous on Θ.

Concavity

The next result does not require compactness of the parameter space Θ, butreplaces continuity of Qn by the stronger condition that Qn is concave.

Theorem 1.4 (Pollard 1991) Suppose that:

(i) Θ is an open convex subset of <p;

(ii) for all n, Qn is concave on Θ;

(iii) {Qn} converges in probability pointwise on Θ to Q.

Then, for any compact subset K of Θ, {Qn} converges in probability uniformlyon K to Q. Further, Q is concave on Θ.

18

1.5 Asymptotic normality

An estimator sequence {θn} is said to be√n -consistent for the population

parameter θ0 if the rescaled difference√n (θn−θ0) has a nondegenerate limiting

distribution

Under certain regularity conditions, the limiting distribution of a√n -consistent

M-estimator is Gaussian. The intuition is that, because a regular M-estimatorpossesses the asymptotic linear representation (1), we have

√n (θn − θ0) =

1√n

n∑i=1

ψ(Zi) + op(1),

where the average on the right-hand side obeys some Central Limit Theo-rem (CLT).

The standard case is when the criterion function Qn is twice differentiable.This is not an innocuous assumption, however, for it rules out importantestimators, such as LAD.

Asymptotic normality results are also available that only require the weakercondition that the asymptotic criterion function Q (not Qn) is twice differ-entiable.

19

The standard case

The basic argument is similar to that for asymptotic normality of a regular MLestimator, and relies on a second-order Taylor expansion of the criterionfunction defining the estimator.

Theorem 1.5 Let θn = argmaxθ∈ΘQn(θ) and suppose that:

(i) θnp→ θ0;

(ii) θ0 is in the interior of Θ;

(iii) the function Qn is twice continuously differentiable on an open neighbor-hood O of θ0;

(iv)√nQ′n(θ0)⇒Np(0, C0);

(v) the sequence of random matrix functions {−Q′′n} converges in probabilityuniformly on O to a matrix function B that is symmetric and continuouson O;

(vi) B0 = B(θ0) is a finite pd matrix.

Then√n (θn − θ0)⇒Np(0, B−1

0 C0B−10 ).

We refer to the conditions of Theorem 1.5 as the standard conditionsfor asymptotic normality of an M-estimator.

20

Proof

By (i)–(iii), with probability approaching one as n → ∞, θn is a root of theequation

0 = Q′n(θ).

By the mean value theorem

0 = Q′n(θn) =√nQ′n(θ0) +Q′′n(θ?n)

√n (θn − θ0),

where θ?n = (1− λn)θ0 + λnθn and λn ∈ [0, 1].

Since ‖θ?n − θ0‖ ≤ ‖θn − θ0‖ and θnp→ θ0, θ?n converges in probability to θ0.

Thus, for n sufficiently large,

‖Q′′n(θ?n) +B0‖ ≤ ‖Q′′n(θ?n) +B(θ?n)‖+ ‖B0 −B(θ?n)‖≤ sup

θ∈O‖Q′′n(θ) +B(θ)‖+ ‖B0 −B(θ?n)‖,

where the first term on the right side of the inequality converges in probabilityto zero by (v). Hence Q′′n(θ?n)

p→−B0, and therefore√nQ′n(θ0)− B0

√n (θn −

θ0) = op(1). Because B0 is nonsingular, we have

√n (θn − θ0)−B−1

0

√nQ′n(θ0) = op(1).

So,√n (θn − θ0) and B−1

0

√nQ′n(θ0) have the same limiting distribution,

namely Np(0, B−10 C0B

−10 ).

21

Remarks

• Under our differentiability assumptions, condition (ii) allows the estima-tor to be represented as a root of a set of first-order conditions. Thiscondition is essential for asymptotic normality, which usually does nothold when θ0 is on the boundary of the parameter space.

• Condition (iii) can be weakened (see the Asymptotic normality undernonstandard conditions).

• Condition (iv) is typically the result of a CLT applied to the asymp-totically linear representation (1).

• If the sequence of random function {Qn} converges uniformly on a neigh-borhoodO of θ0 to a twice differentiable functionQ, then the matrixB(θ)in condition (v) is equal to minus the Hessian of Q for each θ ∈ O.

• Given (v), condition (vi) guarantees that Q has a local maximum onO at θ = θ0, that is, θ0 is locally identified.

• Estimates of the asymptotic variance of θn are typically of the form

AV(θn) = B−1n CnB

−1n ,

where Bn and Cn are consistent estimates of B0 and C0 respectively.In standard settings,

Bn = −Q′′n(θn),

Cn = Q′n(θn)Q′n(θn)>.

It is easy to verify that, under the assumptions of Theorem 1.5, Bn ispd with probability approaching one as n→∞, and

AV(θn)p→ AV(θn).

Thus, one may estimate the sampling variance of θn by n−1 AV(θn).

22

One-step M-estimators

Consider the general class of Newton-type algorithms, namely those basedon a sequence of iterations of the form

θ(r+1)n = θ(r)

n + [Bn(θ(r)n )]−1Q′n(θ(r)

n ), r = 0, 1, 2, . . . , (5)

where Bn(·) = −Q′′n(·), as in the Newton method, or some approximationto it, as in the scoring or in the Berndt-Hall-Hall-Hausmann (BHHH)method.

A one-step M-estimator θn is obtained by starting from some initial valueθ(0) and taking only one step of (5) in the direction of a local maximum ofQn, that is

θn = θ(0)n + [B(0)

n ]−1Q′n(θ(0)n ),

where B(0)n = Bn(θ(0)

n ). The resulting estimator is also called a linearizedM-estimator.

The main advantage of this type of estimator is that they are simpler tocompute than fully iterated ones.

The next theorem shows that, under appropriate conditions, the limiting dis-tribution of a one-step M-estimator is the same as that of a fully iteratedM-estimator. The crucial requirements are:

• The starting point θ(0)n must be a

√n -consistent estimator of the target

parameter θ0,

• Bn(θ) must be a consistent estimator of the matrix B(θ) defined inTheorem 1.5.

Theorem 1.6 Suppose that the assumptions of Theorem 1.5 hold with Bn

replacing −Q′′n. Further suppose that θ(0)n − θ0 = op(n

−1/2). Then θn − θnp→ 0.

23

Proof

From the definition of a one-step estimator

√n (θn − θ0) =

√n (θ(0)

n − θ0) + [B(0)n ]−1

√nQ′n(θ(0)

n ). (6)

Expanding Q′n(θ(0)n ) about the target parameter θ0 gives

Q′n(θ(0)n ) = Q′n(θ0)−Bn(θ0)(θ(0)

n − θ0) + op(n−1/2).

Substituting back into (6) gives

√n (θn−θ0) =

√n (θ(0)

n −θ0)+√n [B(0)

n ]−1[Q′n(θ0)−Bn(θ0)(θ(0)

n − θ0)]+op(1),

where, under the stated assumptions,

B(0)n = B0 + op(n

−1/2), Bn(θ0) = B0 + op(1).

Thus,

√n (θn − θ0) =

√n (θ(0)

n − θ0) +√nB−1

0

[Q′n(θ0)−B0(θ(0)

n − θ0)]

+ op(1)

= B−10

√nQ′n(θ0) + op(1).

The conclusion then follows from the fact that, using the proof of Theorem 1.5,

√n (θn − θ0)−B−1

0

√nQ′n(θ0) = op(1).

24

Example 1.5 Consider a classical Gaussian linear model with nonscalarerror covariance matrix Σ0 = Σ(δ0), where δ0 is a finite-dimensional parameter,functionally unrelated to the regression parameter β0.

The likelihood score for β is

∂L

∂β= n−1X>Σ−1(Y −Xβ).

Further, the Fisher information I is block-diagonal with respect to β andδ, and the block corresponding to β is

Iβ = n−1X>Σ−1X.

Hence, the rth iteration of the scoring method is

β(r+1) = β(r) +[X>(Σ(r))−1X

]−1X>(Σ(r))−1(Y −Xβ(r))

=[X>(Σ(r))−1X

]−1X>(Σ(r))−1Y ,

where Σ(r) = Σ(δ(r)). In this case, the one-step estimator of β is the feasibleGLS estimator based on a consistent estimate of δ0. 2

25

Asymptotic normality under nonstandard conditions

Let {θn} be a sequence of M-estimators defined by maximizing a random cri-terion function Qn. Assume that the conditions of Theorem 1.4 hold and, inparticular, that Qn converges in probability uniformly on Θ to a contin-uous function Q which attains its unique maximum at θ0.

When Qn is smooth, as in Theorem 1.5, θn has the same limiting distributionas the (unfeasible) linearized estimator θn = θ0 − Q′′n(θ0)−1Q′n(θ0), obtainedby maximizing the following local quadratic approximation to Qn

Qn(θ) = Qn(θ0) +Q′n(θ0)>(θ − θ0) +1

2(θ − θ0)>Q′′n(θ0)(θ − θ0).

When Qn is not smooth, neither Q′n nor Q′′n exist, but an argument forasymptotic normality of θn may be constructed as follows.

Suppose that θ0 lies in the interior of the parameter space Θ. If Q is smooth,then Q′(θ0) = 0 and so Q(θ) may be approximated by the quadratic function

Q∗(θ) = Q(θ0)− 1

2(θ − θ0)>B0(θ − θ0),

where B0 = −Q′′(θ0) and the approximation error is of order o(‖θ − θ0‖2).Now consider approximating Qn(θ) by the locally quadratic function

Q∗n(θ) = Qn(θ0) +D>n (θ − θ0)− 1

2(θ − θ0)>B0(θ − θ0),

where Dn is some “approximation” to the gradient of Qn, such as Dn =n−1∑

i ψ(Zi), with ψ(Zi) the influence function of the estimator. Since thisapproximation is typically good for large n and the matrix B0 is pd then,under appropriate conditions, we may obtain the limiting distribution of θnfrom the asymptotic properties of

θ∗n = argmaxθ∈Θ

Q∗n(θ) = θ0 +B−10 Dn.

In particular, if EF ψ(Zi) = 0 and VarF ψ(Zi) = C0, then√nDn⇒Np(0, C0)

by the CLT. So, if we can show that√n (θn−θ∗n) = op(1), then we can conclude

that√n (θn − θ0)⇒Np(0, B−1

0 C0B−10 ).

26

1.6 The bootstrap

This method relies on numerical rather than analytical calculations to ap-proximate the sampling properties of an estimator.

Let Z = (Z1, . . . , Zn) be a sample from a distribution with df F , let θ = θ(Z)be an estimator that does not depend on the order in which the observationsare arranged, and let the precision of θ be measured by its sampling variance,which we write as σ2(F ) = VarF θ to stress its dependence on the parent df F .

When F is unknown, the analogy principle suggests estimating σ2(F ) byreplacing F with some estimate. The classical nonparametric estimator ofF is the empirical df (edf)

F (z) = n−1n∑i=1

1{Zi ≤ z}.

Under general conditions, F is known to be strongly uniformly consistentfor F , that is,

sup−∞<z<∞

|F (z)− F (z)| as→ 0

as n→∞ (Glivenko-Cantelli Theorem). Thus, a natural nonparametricestimate of σ2(F ) is

σ2(F ) = VarF θ.

Example 1.6 Let Z = (Z1, . . . , Zn) be a sample from a distribution with dfF and finite moments µh = E Zh up to order 2k. Let θ be the jth empiricalmoment µj = n−1∑n

i=1 Zji , 1 ≤ j ≤ k, whose sampling variance is σ2(F ) =

n−1 VarF Zji . Because VarF Z

ji = µ2j − µ2

j , a nonparametric estimate of σ2(F )is

σ2(F ) = n−1 VarF Zji = n−1(µ2j − µ2

j).

By Jensen’s inequality, µ2j is an upward biased estimator of µ2

j , so VarF µj isa downward biased estimator of the sampling variance of µj, although thebias is negligible for large enough n. 2

More generally, if the functional ψ(F ) describes some aspect of the sam-pling distribution of θ under F , then a nonparametric estimate of ψ(F ) isjust ψ(F ). Because F is strongly uniformly consistent for F under generalconditions, ψ(F ) ought to converge to ψ(F ) provided that the functional ψis continuous.

27

Approximating the bootstrap

Except in special cases, such as Example 1.6, evaluating ψ(F ) is complicatedand some form of approximation becomes necessary.

Example 1.7 Let Z be a sample from a distribution with df F . The analogyprinciple suggests estimating the expected value of θ,

ψ(F ) = EF θ =∫· · ·

∫θ(z1, . . . , zn) dF (z1) · · · dF (zn),

by its sample counterpart

ψ(F ) = EF θ =∫· · ·

∫θ(z1, . . . , zn) dF (z1) · · · dF (zn).

Because the edf F is the df of a discrete distribution that gives probabilitymass n−1 to each of the n sample values Z1, . . . , Zn, we get

EF θ =1

nn

n∑i1=1

· · ·n∑

in=1

θ(Zi1 , . . . , Zin). (7)

The number of terms in this summation is equal to nn, so it becomes rapidlyastronomical. For example, if n = 5 then nn = 3, 125, if n = 10 then nn =10, 000, 000, 000, and so on.

The generic term θ(Zi1 , . . . , Zin) on the right-hand side of (7) is just the valueof the estimator θ for one of the nn samples of size n that may be obtained byrandomly drawing with replacement n elements from the original dataZ. Thus, an alternative to (7) consists in randomly selecting only B of thenn possible samples. Denoting these samples by Z∗1, . . . ,Z

∗B, one may then

approximate EF θ by θ(·) = B−1∑b θ∗b , where θ∗b = θ(Z∗b).

It can be shown that this approximation to the nonparametric estimateEF θ becomes increasingly accurate as B increases. Of course, this does

not necessarily mean that, for a given sample, EF θ is a good approximation

to EF θ. 2

28

The nonparametric bootstrap algorithm

Given a sample Z = (Z1, . . . , Zn), a resample Z∗ = (Z∗1 , . . . , Z∗n) is a random

sample of size n drawn with replacement from Z or, equivalently, a sampleof size n from the edf F . Notice that Z∗i has probability n−1 of being equal toany distinct element of Z, and probability m/n of being equal to any elementof Z that is repeated m times. Thus, even when all elements of Z are distinct,a resample may contain repeats.

Given a resample Z∗, the estimate θ∗ = θ(Z∗) is called a replicate of θ.Because the estimator is assumed not to depend on the order in which the dataare arranged, some of the nn possible resamples are really indistinguishable. Itcan be shown that the chance of drawing the same unordered resample morethan once is less than 1

2B(B − 1)n!/nn (Hall 1992).

The argument in Example 1.7 motivates the following algorithm, called thenonparametric bootstrap, for numerically approximating the nonpara-metric estimate of any aspect of the sampling distribution of θ.

Algorithm 1.1

(1) Compute the edf F of the sample Z.

(2) Draw a sample Z∗ of size n from F .

(3) Compute θ∗ = θ(Z∗).

(4) Repeat steps (2) and (3) a sufficiently large number B of times, obtainingbootstrap replicates θ∗1, . . . , θ

∗B.

(5) Use the empirical distribution of θ∗1, . . . , θ∗B to estimate any aspect of the

sampling distribution of θ.

29

Example 1.8 The estimate of EF θ based on Algorithm 1.1 is

θ(·) = B−1B∑b=1

θ∗b ,

the estimate of VarF θ is

VarB θ = B−1B∑b=1

[θ∗b − θ(·)]2,

while the estimate of PrF{θ ≤ c} is

PrB{θ ≤ c} = B−1B∑b=1

1{θ∗b ≤ c}.

As B →∞, θ(·) → EF θ, VarB θ → VarF θ, and PrB{θ ≤ c} → PrF{θ ≤ c}.

The amount of computer time required by Algorithm 1.1 depends linearly onthe number B of resamples. According to Efron and & Tibshirani (1993), ifone seeks estimates of VarF θ, then:

• A number of resamples as small as B = 25 is usually informative, whereasB = 50 is often enough for a good estimate.

• Very seldom one needs to draw more than 200 resamples.

2

30

Bootstrap confidence intervals

Since its introduction, a major use of the nonparametric bootstrap has beenin constructing confidence intervals that avoid the symmetry implied by theuse of asymptotically normal approximations.

One method, called the T method, assumes that there exists a pivotal (orapproximately pivotal) statistic of the form

T (Z; θ) =θ(Z)− θ

SE,

where θ is an estimator of θ and SE is an estimate of the standard error of θ.For each resample Z∗, construct the replicate

T ∗ = T (Z∗; θ) =θ∗ − θSE∗ ,

where SE∗

denotes the estimated standard error of θ. Given replicates T ∗1 , . . . , T∗B,

one may approximate a 2-sided (1 − α)-level confidence interval for θ by theinterval

[θ − t1−α/2 SE, θ − tα/2 SE],

where tp denotes a pth quantile of the empirical distribution of T ∗b . Thisconfidence interval is not symmetric, unless t1−α/2 = −tα/2.

Because α is usually small, for example α = .01 or α = .05, to attain sufficientlyaccurate estimates of the tail probabilities, many more replicates are neededthan for a bootstrap estimate of variance.

Another method, called the percentile method, starts directly from the boot-strap distribution of the estimator θ of θ. Given B replicates θ∗1, . . . , θ

∗B, it is

reasonable to consider, as an approximate (1− α)-level confidence interval forθ, the interval [tα/2, t1−α/2], where tp now denotes a pth quantile of the em-

pirical distribution of θ∗b . This method tends to be less erratic than the Tmethod, but also less accurate.

31

The parametric bootstrap

Let Z be a sample from a distribution whose df is known to belong to aparametric family {F (z; θ), θ ∈ Θ}, and let θ = θ(Z) be a ML estimator ofthe population parameter θ.

In this case, instead of using the edf F , which is a nonparametric estimateof the parent df, the bootstrap may be based on the parametric estimateF (z; θ). This leads to the following algorithm.

Algorithm 1.2

(1) Given θ, compute the parametric estimate F (z; θ).

(2) Draw a sample Z∗ of size n from F (z; θ).

(3) Compute θ∗ = θ(Z∗).

(4) Repeat steps (2)–(3) a sufficiently large number B of times obtainingbootstrap replicates θ∗1, . . . , θ

∗B.

(5) Use the empirical distribution of θ∗1, . . . , θ∗B to estimate any aspect of the

sampling distribution of θ.

32

The jackknife

We now present another method that may be used to estimate the samplingvariance of an estimator θ. This method, which may be interpreted as anapproximation to the nonparametric bootstrap, works well for estimatorsthat are not far from being linear in the data.

Given a sample Z1, . . . , Zn, the ith jackknife sample is the subset

(Z1, . . . , Zi−1, Zi+1, . . . , Zn)

of n− 1 elements obtained by excluding the ith data point Zi.

Let θ(i) denote the value of the estimator θ for the ith jackknife sample, and

let θ(·) = n−1∑i θ(i) be the average of θ(i) over the n jackknife samples. Tukey

(1958) suggested estimating the sampling variance of θ by

VarJ θ =n− 1

n

n∑i=1

[θ(i) − θ(·)][θ(i) − θ(·)]>,

called the jackknife estimate of the sampling variance of θ.

The motivation for this method is easiest to see by considering the problemof estimating the sampling variance of the sample mean.

Example 1.9 Let Z be the mean of a sample Z1, . . . , Zn from a distributionwith finite variance. The value of the sample mean for the ith jackknife sampleis

Z(i) = (n− 1)−1(nZ − Zi).Because Z(·) = Z, we have Z(i) − Z(·) = (n − 1)−1(Z − Zi). The jackknifeestimate of the sampling variance of Z is therefore

VarJ θ =n− 1

n

n∑i=1

(Zi − Zn− 1

)2

=1

n(n− 1)

n∑i=1

(Zi − Z)2 =s2

n.

In this case, the jackknife estimate coincides with the unbiased estimate ofVar Z. 2

When the sample size is small, the jackknife estimate is generally easier tocompute than that based on the bootstrap, because it only requires n evalu-ations. Further, for estimators that are linear in the data, θ(i) may usuallybe computed through simple recursive formulae.

33

Problems with the jackknife

The jackknife has to important drawbacks:

• It tends to be conservative, that is, the expectation of VarJ θ tends toexceed the actual sampling variance of θ (Efron & Stein 1981).

• It may fail when θ is a highly nonlinear function of the data. Oneexample is its failure to correctly estimate the sampling variance of thesample median (Efron 1982).

One way of improving the quality of the jackknife estimates is to use thedelete-d jackknife, which excludes not a single data point but subsets ofd > 1 data points. Because the number of jackknife samples is in this caseequal to

(nd

), the method loses its simplicity, especially when the sample size

is large.

Instead of computing θ for each of the possible jackknife samples, an alterna-tive is to randomly select a subset of them. With this modification, thejackknife tends to resemble the bootstrap.

34

2 Static regression

2.1 [September 2012] Let (X1, Y1), . . . , (Xn, Yn) be a random sample that satisfies

Yi = β>Xi + Ui, i = 1, . . . , n,

where β is an unknown parameter and the error Ui = Yi − β>Xi is independent ofXi with variance equal to σ2

1 for the n1 observations in group G1 and to σ22 for the

n2 = n− n1 observations in group G2. Discuss the properties (exact or asymptotic)of the following four estimators of β:

a. The Ordinary Least Squares (OLS) estimator

βOLS =

(n∑i=1

XiX>i

)−1 n∑i=1

XiYi.

b. The Weighted Least Squares (WLS) estimator

βWLS =

2∑j=1

1

σ2j

∑i∈Gj

XiX>i

−12∑j=1

1

σ2j

∑i∈Gj

XiY>i .

c. The feasible WLS estimator

βFWLS =

2∑j=1

1

σ2j

∑i∈Gj

XiX>i

−12∑j=1

1

σ2j

∑i∈Gj

XiY>i ,

with σ2j = n−1

j

∑i∈Gj U

2i , where Ui = Yi − β>Xi is the ith OLS residual.

d. Another feasible WLS estimator

βFWLS =

2∑j=1

1

σ2j

∑i∈Gj

XiX>i

−12∑j=1

1

σ2j

∑i∈Gj

XiY>i ,

with σ21 = γ and σ2

2 = γ + δ, where γ and δ are the estimated intercept andslope from a regression of the squared OLS residuals U2

i on a constant and a0-1 indicator equal to one if an observation is in G2 and to zero otherwise.

35

2.2 [May 2012] Let (X,Y ) be a bivariate random variable that satisfies

X ∼ N (0, τ2), Y |X ∼ N (α+ βX + γX2, σ2),

where 0 < τ2, σ2 <∞.

a. Compute the conditional mean function (CMF) of Y given X.

b. Compute the best linear predictor (BLP) of Y given X, E∗(Y |X).

c. Are the CEF and the BLP of Y given X equal? Do you know cases where theCEF and the BLP coincide?

d. Let V = Y − E∗(Y |X). Show that E V = 0 and VarV = σ2 + 2γ2τ4.Comment.

e. Is V independent of X? Mean independent of X? Uncorrelated with X?

f. Let {(Xi, Yi), i = 1, . . . , n} be a random sample from the distribution of(X,Y ), and let

βn =

∑ni=1 (Xi − X)(Yi − Yn)∑n

i=1 (Xi − Xn)2,

where Xn = n−1∑ni=1 Xi and Yn = n−1∑n

i=1 Yi. Compute E(βn |X1, . . . , Xn).Comment.

g. Show that βn is a consistent estimator of β when n→∞.

2.3 [January 2012] Let (X1, Y1), . . . , (Xn, Yn) be a random sample that satisfies

Yi = β>Xi + Ui, i = 1, . . . , n,

where β is an unknown parameter and Ui = Yi − β>Xi is mean independent of Xi

with conditional variance σ2i = Var(Ui |Xi). Discuss the properties of the following

four estimators of β, focusing on their pros and cons:

a. The Ordinary Least Squares (OLS) estimator

βOLS =

(n∑i=1

XiX>i

)−1 n∑i=1

XiYi.

36

b. The Weighted Least Squares (WLS) estimator

βWLS =

(n∑i=1

XiX>i

σ2i

)−1 n∑i=1

XiYiσ2i

.

c. The feasible WLS estimator

βFWLS =

(n∑i=1

XiX>i

σ2i

)−1 n∑i=1

XiYiσ2i

,

where σ2i = γ + δZ2

i is an estimate of σ2i obtained by assuming that σ2

i =γ + δZ2

i , Zi is a variable supposed to fully explain the heteroskedasticity, andγ and δ are the estimates of γ and δ from a regression of the squared OLSresiduals U2

i on a constant and Z2i .

d. The estimator

β =

(n∑i=1

XiX>i

U2i

)−1 n∑i=1

XiYi

U2i

,

where the U2i are again the squared OLS residuals.

2.4 [September 2011] Consider the linear regression model Yi = β>Xi + Ui, i =1, . . . , n, where β ∈ <k. Suppose that:

(i) {(Xi, Yi)} is a sequence of iid random vectors;

(ii) EXiUi = 0;

(iii) EXiX>i = P , a finite pd k × k matrix;

(iv) VarXiUi = D, a finite symmetric pd k × k matrix.

Please answer the following questions:

a. Carefully discuss each of these four assumptions (What is their meaning?When could they fail?)

b. Prove that the OLS estimator βn = (∑ni=1 XiX

>i )−1∑n

i=1 XiYi exists with

probability approaching one as n→∞ and βnp→ β.

c. Prove that βn is asymptotically normally distributed, and determine the formof its asymptotic variance AV(βn).

37

d. How would you construct a consistent estimator of AV(βn)?

e. How would you test a linear hypothesis of the form H0:Rβ = r, where R is aknown q × k matrix and r is a known q-vector?

2.5 [January 2011] Let E0(Y |X) = γ0X denote the best proportional predictor(BPP) of Y given X under quadratic loss.

a. Show that γ0 = E(XY )/E(X2).

b. Let U = Y − γ0X denote the prediction error associated with the BPP. Is ittrue or false that E U = 0 and Cov(U,X) = 0? Explain.

c. If X is uniformly distributed on the interval [−1, 1] and the conditional dis-tribution of Y given X = x is N (x2 − x, 2):

c.1 Obtain the BPP of Y given X in this case and compare it with the bestlinear predictor of Y given X.

c.2 Compute the mean and variance of Y and the population R2.

Now consider the problem of estimating γ0 from a random sample (X1, Y1), . . . , (Xn, Yn).The analogy principle suggests using the statistic γ =

∑iXiYi/

∑iX

2i .

d. Prove that γ is consistent for γ0.

e. If the joint distribution of (X,Y ) is as in (c), is γ unbiased for γ0?

f. Prove that√n (γ − γ0)⇒N (0, φ2) as n→∞, where

φ2 =E(U2X2)

[E(X2)]2.

g. Propose a consistent estimator of φ2 and indicate precisely how you wouldconstruct an asymptotic 95% confidence interval for γ0.

2.6 [September 2010] Consider the classical linear model Y = Xβ + U , whereE U = 0, VarU = In and

X>X =

[4 −2−2 3

].

You are offered the choice of two jobs: estimate β1 + β2, or estimate β1 − β2. Youwill be paid the dollar amount 10− (θ − θ)2, where θ is your estimate and θ is theparameter combination that you have chosen to estimate.

38

a. To maximize your expected pay, which job should you take?

b. What pay will you expect to receive?

2.7 [September 2010] In a regression analysis of the relation between earnings andvarious personal characteristics, a researcher includes these explanatory variablesalong with six others:

X7 =

{1 if female0 if male

X8 =

{1 if male0 if female

but does not include a constant term.

a. Does the sum of residuals from her OLS regression equal zero?

b. Why did she not also include a constant term?

2.8 [January 2010] Suppose that U and X are continuous random variables andthat {(Ui, Xi)} are n random draws from the joint distribution of U and X.

a. Show that, for any i 6= j, the joint density function of (Ui, Uj , Xi, Xj) can bewritten as f(ui, xi) f(uj , xj), where f(u, x) is the joint density function of Uand X.

b. Show that E(UiUj |Xi, Xj) = E(Ui |Xi) E(Uj |Xj) for i 6= j.

c. Show that E(Ui |X1, . . . , Xn) = E(Ui |Xi).

d. Show that E(UiUj |X1, . . . , Xn) = E(Ui |Xi) E(Uj |Xj) for i 6= j.

e. Why are these results important for regression?

f. What can you say about the conditional variance of Ui given X1, . . . , Xn?What about the conditional variance of Ui given Xi?

2.9 [January 2010] Consider the regression model Y = Xβ + W γ + U , where Xand W are, respectively, n × k1 and n × k2 matrices of regressors. Let Xi and Wi

denote the ith rows of X and W , and let Yi and Ui denote, respectively, the ithelements of Y and U . Assume that:

39

(i) E(Ui |Xi,Wi) = δ>Wi, where δ is a k2-vector of unknown parameters;

(ii) {(Xi,Wi, Yi)} are i.i.d.;

(iii) (Xi,Wi, Ui) have four finite, nonzero moments;

(iv) there is no perfect collinearity.

Let β be the OLS estimator in the regression of Y on X and W , and let MW =In −W (W>W )−1W>.

a. Show that β − β = (n−1X>MWX)−1(n−1X>MWU).

b. Show that n−1X>MWXp→ΣXX−ΣXWΣ−1

WWΣWX , where ΣXX = E(XiX>i ),

ΣXW = E(XiW>i ), etc.

c. Show that assumptions (i) and (ii) imply that E(U |X,W ) = W δ.

d. Use (c) and the law of iterated expectations to show that n−1X>MWUp→ 0.

e. Conclude that βp→ β.

f. Interpret this result.

2.10 [January 2009] Consider the simple regression model Yi = β0 +β1Xi+ui, andsuppose that it satisfies all the classical assumptions. Suppose that Yi is measuredwith error, so that the data are Yi = Yi + wi, where wi is the measurement errorwhich is i.i.d. and independent of Yi and Xi with finite fourth moment. Considerthe population regression Yi = β0 + β1Xi + vi, where vi is the regression error usingthe mismeasured dependent variable Yi.

a. Show that vi = wi + ui.

b. Show that the regression Yi = Yi + wi satisfies all the classical regressionassumptions.

c. Are the OLS estimators of β0 and β1 from such regression consistent?

d. Can confidence intervals be constructed in the usual way?

e. Evaluate these statements: “Measurement errors in the X’s is a serious prob-lem. Measurement errors in Y is not.”

40

2.11 [January 2009] Consider the linear regression model with a single regressorYi = β0 + β1Xi + ui, i = 1, . . . , n, where (X1, Y1), . . . , (Xn, Yn) are i.i.d. randomvectors, and the conditional distribution of ui given Xi is normal with mean zeroand variance var(ui |Xi) = θ0 +θ1|Xi|, where |Xi| is the absolute value of Xi, θ0 > 0and θ1 ≥ 0.

a. Is the OLS estimator of β1 BLUE? Explain your answer.

b. Suppose that θ0 and θ1 are known. What is the BLUE estimator of β1?

c. Derive the exact sampling distribution of the OLS estimator, conditional onX1, . . . , Xn.

d. Derive the exact sampling distribution of the WLS estimator (treating θ0 andθ1 as known), conditional on X1, . . . , Xn.

e. What can you say about the sampling properties of the feasible WLS estimator(replacing θ0 and θ1 by estimates)?

2.12 [January 2009] Consider the classical linear regression model Yi = β1Xi +β2Wi+ui, i = 1, . . . , n, where for simplicity the intercept is omitted and all variablesare assumed to have a mean of zero. Suppose that Xi is distributed independently of(Wi, ui) but Wi and ui might be correlated, and let β1 and β2 be the OLS estimatorsfor this model. Show that:

a. Whether or not Wi and ui are correlated, β1 is consistent for β1.

b. If Wi and ui are correlated, then β2 is inconsistent for β2.

c. Let β1 be the OLS estimator from the regression of Y on X (the restrictedregression that excludes W ). Provide conditions under which β1 has a smallerasymptotic variance than β1, allowing for the possibility that Wi and ui arecorrelated.

2.13 [January 2008] Consider the linear model Yi = α + β1X1i + β2X2i + Ui, i =1, . . . , n.

a. Explain how you would test the null hypothesis that β1 = 0.

b. Explain how you would test the null hypothesis that β2 = 0.

c. Explain how you would test the null hypothesis that β1 = β2 = 0.

d. Why isn’t the result of the third test implied by the results of the first two?

41

3 IV and GMM

3.1 [September 2013] Suppose you have individual data from a developing countryand you want to test whether nutrition affects productivity using the model

Y = δ0 + δ1exper + δ2exper2 + δ3educ+ α1calories+ α2protein+ U,

where Y is the logarithm of worker productivity, exper is working experience, educis educational attainments, calories is caloric intake per day, protein is a measureof protein intake per day, and U consists of other unobservable determinants of Y .

a. Explain why calories and protein may be endogenous.

b. The literature has proposed regional prices of various goods (e.g. grains, meats,breads, dairy products, etc.) as possible instruments for calories and protein.Under what circumstances are these instruments valid?

c. What happens if these prices reflect quality of food?

d. If prices are a valid set of instruments, how many prices would you need toidentify α1 and α2?

e. How many prices would you need to test the null hypothesis that they arevalid instruments?

f. Suppose you have M prices. Explain how you would test the null hypothesisthat calories and protein are exogenous.

3.2 [May 2013] Consider a general instrumental variables (IV) estimator of thek-dimensional parameter θ in the structural linear model

Yi = X>i θ + Ui, i = 1, . . . , n, (8)

where Xi and Ui are potentially correlated and there exists a valid set of r instru-ments Wi. Show how to cast the general IV estimator of model (8) as a GMMestimator. In particular:

a. Show what is the relevant moment function h(z; θ).

b. Show what is the population moment restriction that identifies the populationparameter θ0.

c. Show what is the asymptotic variance of hn(θ0) = n−1∑ni=1 h(Zi; θ0).

42

d. Show what is the relevant P matrix. Recall that P = E [h′n(θ0)].

e. Show what is the weighting matrix of the asymptotically efficient GMM esti-mator and compare it with the weighting matrix of the two-stage least squaresestimator of model (8).

3.3 [February 2013] Consider a model for the health of an individual

health = β0 + β1age + β2weight + β3height + β4male + β5work + β6exercise +U, (9)

where health is some quantitative measure of the person’s health, age, weight, heightand male are self-explanatory, work is weekly hours worked, and exercise is the hoursof exercise per week.

a. Why might you be concerned about exercise being correlated with the errorterm U?

b. Suppose you can collect data on two additional variables, disthome and dist-work, the distances from home and from work to the nearest health club orgym. Discuss whether these are likely to be correlated with U .

c. Now assume that disthome and distwork are in fact uncorrelated with U , as areall regressors in equation (9) with the exception of exercise. Write down thereduced form for exercise, and state the conditions under which the parametersof equation (9) are identified.

c. How can the identification assumption in part c. be tested?

3.4 [September 2012] Consider the structural linear model

Yi1 = αYi2 + βXi1 + Ui, i = 1, . . . , n, (10)

where Xi1 is exogenous but Yi2 is suspected to be endogenous. The reduced-formequation for Yi2 is

Yi2 = π1Xi1 + π2Xi2 + Vi, (11)

where Xi2 is another exogenous variable.

a. Give a set of conditions on {(Yi1, Yi2, Xi1, Xi2)}ni=1 under which the two-stageleast squares (2SLS) estimator of model (10) using Xi1 and Xi2 as instrumentsis consistent for (α, β).

43

b. Show that the 2SLS estimator of α is the ratio of two ordinary least squares(OLS) estimators, each using Xi1 and Xi2 as regressors. What implicationsdoes this have for the sampling properties of the 2SLS estimator of α?

c. Consider estimating first the reduced form (11) by OLS, saving the OLS resid-uals Vi, and then running an OLS regression of Yi1 on Yi2, Xi1 and Vi. Use therules for partitioned regressions and the properties of OLS residuals to showthat the coefficients on Yi2 and Xi1 from this regression are identical to the2SLS estimates of α and β.

d. Discuss the result in c.

3.5 [June 2012] Let Z1, . . . , Zn be a sample from the distribution of the randomvector Z = (X,W1,W2, Y ). Suppose that Z has finite second moments and that

EW1(Y − βX) = EW2(Y − βX) = 0 (12)

for some β ∈ <.

a. Show that (12) suggests two simple IV estimators of the parameter β, say β1n

and β2n. Are these estimators consistent?

b. Show that the 2SLS estimator based on the instrument vector (W1,W2) is alinear combination of β1n and β2n. [Hint : The 2SLS estimator may be viewedas a simple IV estimator that uses as instrument the OLS predictor of X.]

c. Give sufficient conditions for βn = (β1n, β2n) to be asymptotically normal,with an asymptotic variance matrix V whose jkth element is

vjk =E(Y − βX)2WjWk

(EXWj)(EXWk), j, k = 1, 2.

d. What conditions are necessary for the 2SLS estimator in (ii) to be asymp-totically efficient in the class of estimators based on the moment conditions(12)?

3.6 [September 2011] Consider the structural linear model Yi = β>Xi + Ui, i =1, . . . , n, where β ∈ <k is the parameter of interest, Xi and Ui are potentiallycorrelated, and there exists an r-vector Wi of potential instruments. Suppose that:

44

(i) {(Wi, Xi, Yi)} is a sequence of iid random vectors.

(ii) EWiUi = 0.

(iii) EWiX>i = P , a finite r × k matrix of rank k ≤ r.

(iv) Anp→A, a finite symmetric pd r × r matrix.

(v) VarWi(Yi − β>Xi) = D, a finite symmetric pd r × r matrix.

Please answer the following questions:

a. Carefully discuss each of these five assumptions (What is their meaning?When could they fail?)

b. Derive the IV estimator βn of β based on the above assumptions.

c. Prove that the IV estimator βn exists with probability approaching one asn→∞ and βn

p→ β.

d. Prove that βn is asymptotically normally distributed and determine the formof its asymptotic variance AV(βn).

e. How would you construct a consistent estimator of AV(βn)?

f. Derive an asymptotically optimal estimator in this class and discuss how youwould use it to test the moment restriction (ii).

3.7 [September 2010] Let {(Xi, Yi), i = 1, . . . , n} be a random sample from thenonlinear model

E(Yi |Xi) = m(Xi, β0), β0 ∈ B,

Var(Yi |Xi) = σ20 h(Xi, γ0), γ0 ∈ Γ,

where m and h are known functions. Consider the weighted nonlinear least squares(WNLS) estimator of β0, which solves

minβ∈B

n∑i=1

[Yi −m(Xi, β)]2

h(Xi, γ),

where γ satisfies√n (γ − γ0) = Op(1) as n→∞. Suppose that

E[m(Xi, β0)−m(Xi, β)]2

h(Xi, γ0)> 0

for all β ∈ B such that β 6= β0.

45

a. Show that β is consistent for β0. State clearly the key conditions that youneed.

b. Show that β has asymptotic variance equal to that of the efficient GMMestimator based on the orthogonality condition

E[Yi −m(Xi, β0) |Xi] = 0.

c. When does the (unweighted) NLLS estimator of β0 achieve the efficiencybound derived in Part (a)?

d. Suppose that, in addition to E(Yi |Xi) = m(Xi, β0), you use the restrictionVar(Yi |Xi) = σ2

0 for some σ20 > 0. Write down the two conditional moment

restrictions for estimating β0 and σ20. What are the efficient instrumental

variables in this case?

3.8 [May 2010] Consider the model

Yi1 = αYi2 + βZi1 + Ui, i = 1, . . . , n,

where Yi2 is suspected to be endogenous. The reduced-form equation for Yi2 is

Yi2 = π>Zi + Vi,

where Zi = (Zi1, Zi2) are the available exogenous variables.

a. Show that a consistent estimator of (α, β) is the 2SLS estimator using Zi asthe instrument vector.

b. Now consider estimating first the reduced form by OLS, saving the OLS resid-uals Vi, and then running an OLS regression of Yi1 on Yi2, Zi1 and Vi. Use therules for partitioned regressions and the properties of OLS residuals to showthat the coefficients on Yi2 and Zi1 from this regression are identical to the2SLS estimates of (α, β).

c. Discuss this result.

46

3.9 [May 2010] Consider the three-equation model

Yi = βXi + Ui,

Xi = λUi + εi,

Zi = γεi + Vi, i = 1, . . . , n,

where the errors Ui, εi and Vi are mutually independent zero-mean normal randomvariables with variances σ2

U , σ2ε and σ2

V respectively.

a. Show that that the OLS estimator of β in the regression of Yi on Xi satisfies

plim(βOLS − β) =λσ2

U

λ2σ2U + σ2

ε

.

b. Derive the correlations between Zi and Ui and Xi respectively, and determineunder what conditions is Zi a valid instrument for estimating β.

c. Show that the IV estimator of β using Zi as instrument satisfies

βIV = β +mZU

λmZU +mZε,

where mZU = n−1∑ni=1 ZiUi and mZε = n−1∑n

i=1 Ziεi.

d. Show that βIV − βp→ 1/λ as γ → 0.

e. Show that βIV − βp→∞ as mZU → −γσ2

ε /λ.

f. What do the last two results imply regarding finite-sample biases and themoments of βIV when the instruments are poor?

3.10 [May 2009] Consider the problem of estimating the parameter φ = (φ1, φ2) inthe ARMA(2,2) model

Yt = φ1Yt−1 + φ2Yt−2 + εt − θ1εt−1 − θ2εt−2,

where {εt} is a white-noise process with zero mean and finite nonzero variance σ2,φ1, φ2 6= 0, and there are no common factors.

a. What are the sampling properties of the ordinary least squares (OLS) estima-tor of φ?

47

b. Provide a simple instrumental variable (IV) estimator of φ, discuss the condi-tions for its validity, and derive its sampling properties.

c. Is your proposed IV estimator asymptotically efficient? Discuss.

d. Discuss what problems you may face when trying to construct an asymptoti-cally efficient IV estimator of φ.

3.11 [May 2009] Consider estimating the regression parameter β in the linear modelYi = β>Xi + Ui, i = 1, . . . , n, by the method of instrumental variables.

a. Define the problem of “weak” instruments.

b. Define the problem of “too many” instruments.

c. Discuss the relationship between the two problems.

d. Briefly discuss the relevance of the two problems for empirical work.

3.12 [May 2008] Consider the linear model Yi = α+β1X1i+β2X2i+Ui, i = 1, . . . , n,where X2i is correlated with both X1i and the regression error Ui.

a. Discuss the bias and consistency properties of the ordinary least squares (OLS)estimator in the regression of Yi on a constant, X1i and X2i.

b. Discuss what conditions are needed for consistent estimation of β1 using theinstrumental variables (IV) method.

c. Discuss what conditions are needed for consistent estimation of both β1 andβ2 using the IV method.

d. Discuss the relative efficiency of the estimators proposed in b. and c.

3.13 [May 2008] Discuss the problem of weak instruments. In particular:

a. Analyze the consequences of weak instruments for both the finite-sample andthe large-sample properties of instrumental variables (IV) estimators.

b. Discuss ways of testing for the presence of weak instruments.

c. Provide an example of empirical study where weak instruments may be anissue.

48

4 Linear panel data models

4.1 [June 2013] Consider observations (Xit, Yit) from the linear panel data model

Yit = αi + βXit + γt+ Uit,

where i = 1, . . . , n and t = 1, . . . , T .

(a) Give three good reasons for using the fixed effects (FE) estimator of β.

(b) Give three good reasons for using instead the random effects (RE) estimatorof β.

4.2 [June 2012] Consider observations (Xit, Yit) from the linear panel data model

Yit = αi + βXit + γt+ Uit,

where i = 1, . . . , n and t = 1, . . . , T .

a. How would you estimate β?

b. Suppose that, instead of being linear, the common time trend is an arbitraryfunction of time. How would you estimate β in this case?

c. Suppose that, instead of a common time trend, we have an unobserved individual-specific time trend γit. How would you estimate β in this case?

4.3 [June 2011] You managed to collect longitudinal data on the annual investmentsYit made by the n firms in a representative sample. Economic theory suggests thefollowing basic model

Yit = αi + βYi,t−1 + γ>Xit + Uit, i = 1, . . . , n, t = 1, . . . , T,

where αi is a firm-specific effect, Xit is a vector of exogeneous variables, β and γ areunknown parameters, and Uit follows an AR(1) process.

a. Discuss how you would consistently estimate the unknown parameters β andγ. State carefully all your assumptions.

b. What are the sampling properties of the estimator that you intend to use?

c. How would you test the hypothesis of no state dependence?

49

d. How would you test that your model is correctly specified?

4.4 [June 2010] Consider the panel-data model

Yit = α+ βXit + Uit, i = 1, . . . , n, t = 1, . . . , T,

where α and β are scalars.

a. Show that this model implies

Yit − Y = β(Xit − Xi) + β(Xi − X) + Vit,

where Xi = T−1∑tXit, X = n−1∑

i Xi, etc., and Vit = Uit − U .

b. For the corresponding unrestricted regression

Yit − Y = β1(Xit − Xi) + β2(Xi − X) + Vit,

show that the OLS estimator of β1 coincides with the within-group estimatorand that of β2 with the between-group estimator.

c. Discuss this result.

4.5 [June 2010] Consider a model for new capital investment in a particular industry(say, manufacturing), where the cross section observations are at the regional leveland there are T years of data for each region

log(Investit) = θt + γ>Zit + δ1Taxit + δ2Disasterit + αi + Uit.

The variable Taxit is a measure of the marginal tax rate on capital and Disasteritan indicator for a significant natural disaster (for example, a major flood or anearthquake) in region i at time t, the variables in Zit are other factors affectingcapital investment, the θt are aggregate time effects, and the αi are region specificeffects.

a. Why is allowing for aggregate time effects in the equation important?

b. What kind of variables are captured in αi?

c. Interpreting the equation in a causal fashion, what sign does economic rea-soning suggest for δ1?

50

d. Explain in detail how you would estimate this model. Be specific about allthe assumptions you are making.

e. Discuss whether strict exogeneity is reasonable for the two variables Taxit andDisasterit (assume that that neither of them has a lagged effect on capitalinvestment). [Recall that, in a panel context, a vector of variables X is strictlyexogeneous relative to the outcome Y if E(Yit |Xi1, . . . , XiT ) = E(Yit |Xit),and is strictly exogeneous conditional on the individual-specific effect αi ifE(Yit |Xi1, . . . , XiT , αi) = E(Yit |Xit, αi).]

51

5 Models for discrete and limited dependent

variables

5.1 [June 2013] Consider the latent variable model

Y ∗i = α+ β>Xi + Ui, i = 1, . . . , n,

where Xi is a random k-vector of regressors and the Ui are random errors indepen-dently and identically distributed as N (0, σ2) conditionally on the Xi. Suppose youobserve only Yi = 0 if Y ∗i < γ, Yi = 1 if γ ≤ Y ∗i < ξi, and Yi = 2 if Y ∗i ≥ ξi, wherethe ξi are known constants which may differ across individuals, but α and β areunknown.

a. Compute the conditional probabilities that Yi = 0, Yi = 1 and Yi = 2.

b. Which model parameters, or combination thereof, are identifiable?

c. Provide details on an estimation method to consistently estimate the identifi-able model parameters.

d. What can you say about the sampling distribution of the proposed estimator?

5.2 [June 2012] Let (X1, Y2), . . . , (Xn, Yn) be a random sample from the joint dis-tribution of (X,Y ), with Y = 1{Y ∗ > 0}, where 1{A} is the indicator function ofthe event A. We assume that the latent variable Y ∗ satisfies the heteroskedasticlinear model

Y ∗ = α+ βX + exp(γ + δX)U,

where U is distributed independently of X with mean zero and distribution function(df) G, and α, β, γ and δ are unknown parameters.

a. Discuss the properties of this model and the interpretation of the model pa-rameters.

b. Propose a parametric maximum likelihood estimator of α and β. Under whatconditions is it consistent?

c. Propose a semiparametric estimator of α and β that does not require knowl-edge of the scale function exp(γ + δX), nor knowledge of the df G. Underwhat conditions is it consistent?

d. What can you say about the asymptotic distribution of the two estimators?

52

5.3 [June 2011] Let (X1, Y2), . . . , (Xn, Yn) be a random sample from the joint dis-tribution of (X,Y ), where Y = 1{α+βX+U > 0}, 1{A} is the indicator function ofthe event A, α and β are unknown parameters, and U is distributed independentlyof X with mean zero and distribution function G.

a. Propose a maximum likelihood (ML) estimator of α and β. State carefully allthe assumptions you make.

b. Propose a least squares (LS) estimator of α and β. State carefully all theassumptions you make.

c. Under what assumptions are the two estimators consistent for α and β?

d. Under what assumptions is one estimator “more efficient” than the other?Explain exactly what you mean by “more efficient”.


Y ∗i = β>Xi + Ui, i = 1, . . . , n,

where Xi is a k-vector of regressors (k ≥ 3) and the Ui are random errors inde-pendently and identically distributed as N (0, 1) conditionally on the Xi. Supposeyou observe only Yi = 1 if Y ∗i < ξi and Yi = 0 if Y ∗i ≥ ξi, where the ξi are knownconstants which may differ across individuals.

a. Find E(Yi |Xi).

b. Provide details on an estimation method to consistently estimate β.

c. Suppose that you estimate this model and find that the third regressor X3i

has estimated coefficient β3 = 0.2. Provide a meaningful interpretation of β3.


Y ∗i = β>Xi + Ui, i = 1, . . . , n,

where Xi is a k-vector of regressors (k ≥ 3) and the Ui are random errors indepen-dently and identically distributed as N (0, 1) conditionally on the Xi. Suppose youobserve only Yi = 2 if Y ∗i < α, Yi = 1 if α ≤ Y ∗i < ξi, and Yi = 0 if Y ∗i ≥ ξi, wherethe ξi are known constants which may differ across individuals, but α and β areunknown.

a. Compute the conditional probabilities that Yi = 0, Yi = 1 and Yi = 2.

b. Provide details on an estimation method to consistently estimate α and β.

53

References

Amemiya T. (1985) Advanced Econometrics, Harvard University Press, Cambridge,MA.

Breusch T.V., and Pagan A.R. (1980) The Lagrange Multiplier Test and Its Appli-cations to Model Specification in Econometrics. Review of Economic Studies,47: 239–254.

Cameron A.C., and Trivedi P.K. (2005) Microeconometrics. Methods and Applica-tions, Cambridge University Press, New York.

Chow G. (1960) Tests of the Equality Between Two Sets of Coefficients in TwoLinear Regressions. Econometrica, 28: 561–605.

Cook R.D. (1977) Detection of Influential Observations in Linear Regression. Tech-nometrics, 19: 15–18.

Davison A.C., and Hinkley D.V. (1997), Bootstrap methods and their Applications,Cambridge University Press, Cambridge, UK.

Efron B. (1982) The Jackknife, the Bootstrap and Other Resampling Plans, SIAM,Philadelphia, PA.

Efron B., and Stein C. (1981) The Jackknife Estimate of Variance. Annals ofStatistics, 9: 586–596.

Efron B., and Tibshirani R. (1993) An Introduction to the Bootstrap, Chapmanand Hall, New York.

Goldberger A.S. (1991) A Course in Econometrics, Harvard University Press, Cam-bridge, MA.

Hall P.J. (1992) The Bootstrap and Edgeworth Expansions, Springer, New York.

Hansen L.P. (1982) Large Sample Properties of Generalized Method of MomentsEstimators. Econometrica, 50: 1029–1054.

Hausman J.A. (1978) Specification Tests in Econometrics. Econometrica, 46: 1251–1272.

Holland P.W. (1986) Statistics and Causal Inference. Journal of the AmericanStatistical Association, 81: 945–960.

54

Maddala G.S. (1983) Limited-Dependent and Qualitative Variables in Economet-rics, Cambridge University Press, New York.

Manski C.F. (1991) Regression. Journal of Economic Literature, 29: 34–50.

Manski C.F. (1995) Identification Problems in the Social Sciences, Harvard Uni-versity Press, Cambridge, MA.

Newey W.K. (1991) Uniform Convergence in Probability and Stochastic Equicon-tinuity. Econometrica, 59: 1161–1167.

Newey W.K., and McFadden D.L. (1994) Large Sample Estimation and HypothesisTesting. In Engle R.F. and McFadden D.L. (eds.) Handbook of Econometrics,Vol. 4, pp. 2111–2245, North-Holland, Amsterdam.

Newey W.K., and West K. (1987) A Simple, Positive Semi-Definite, Heteroskedas-ticity and Autocorrelation Consistent Covariance Matrix. Econometrica, 55:703–708.

Peracchi F. (2001) Econometrics, Wiley, Chichester (UK).

Pollard D. (1991) Asymptotics for Least Absolute Deviation Regression Estimators.Econometric Theory, 7: 186–199.

Rothenberg T.J. (1973) Efficient Estimation With A Priori Information, Yale Uni-versity Press, New Haven.

Stock J.H., and Watson M.W. (2006) Introduction to Econometrics (2nd ed.), Ad-dison Wesley, Boston.

Tukey J. (1958) Bias and Confidence in Not Quite Large Samples. Annals ofMathematical Statistics, 29: 614.

White H. (1980) A Heteroskedasticity-Consistent Covariance Matrix Estimator anda Direct Test for Heteroskedasticity. Econometrica, 48: 817–838.

White H. (2001) Asymptotic Theory for Econometricians (2nd ed.), AcademicPress, San Diego, CA.

Wooldridge J.M. (2010) Econometric Analysis of Cross Section and Panel Data(2nd ed.), MIT Press, Cambridge, MA.

55

Econometrics review course: Microeconometrics

Documents

Transcript of Econometrics review course: Microeconometrics