Download - Locally Robust Semiparametric Estimation · Locally Robust Semiparametric Estimation Victor Chernozhukov MIT Juan Carlos Escanciano Indiana University Hidehiko Ichimura University

Locally Robust Semiparametric Estimation

Victor Chernozhukov

MIT

Juan Carlos Escanciano

Indiana University

Hidehiko Ichimura

University of Tokyo

Whitney K. Newey

MIT

James R. Robins

Harvard University

April 2018

Abstract

We give a general construction of debiased/locally robust/orthogonal (LR) moment

functions for GMM, where the derivative with respect to first step nonparametric estima-

tion is zero and equivalently first step estimation has no effect on the influence function.

This construction consists of adding an estimator of the influence function adjustment term

for first step nonparametric estimation to identifying or original moment conditions. We

also give numerical methods for estimating LR moment functions that do not require an

explicit formula for the adjustment term.

LR moment conditions have reduced bias and so are important when the first step is

machine learning. We derive LR moment conditions for dynamic discrete choice based on

first step machine learning estimators of conditional choice probabilities.

We provide simple and general asymptotic theory for LR estimators based on sample

splitting. This theory uses the additive decomposition of LR moment conditions into an

identifying condition and a first step influence adjustment. Our conditions require only

mean square consistency and a few (generally either one or two) readily interpretable rate

conditions.

LR moment functions have the advantage of being less sensitive to first step estimation.

Some LR moment functions are also doubly robust meaning they hold if one first step

is incorrect. We give novel classes of doubly robust moment functions and characterize

double robustness. For doubly robust estimators our asymptotic theory only requires one

rate condition.

Keywords: Local robustness, orthogonal moments, double robustness, semiparametric

estimation, bias, GMM.

JEL classification: C13; C14; C21; D24

1

1 Introduction

There are many economic parameters that depend on nonparametric or large dimensional first

steps. Examples include dynamic discrete choice, games, average consumer surplus, and treat-

ment effects. This paper shows how to construct moment functions for GMM estimators that are

debiased/locally robust/orthogonal (LR), where moment conditions have a zero derivative with

respect to the first step. We show that LR moment functions can be constructed by adding the

influence function adjustment for first step estimation to the original moment functions. This

construction can also be interpreted as a decomposition of LR moment functions into identifying

moment functions and a first step influence function term. We use this decomposition to give

simple and general conditions for root-n consistency and asymptotic normality, with different

properties being assumed for the identifying and influence function terms. The conditions are

easily interpretable mean square consistency and second order remainder conditions based on

estimated moments that use cross-fitting (sample splitting). We also give numerical estimators

of the influence function adjustment.

LR moment functions have several advantages. LR moment conditions bias correct in a way

that eliminates the large biases from plugging in first step machine learning estimators found

in Belloni, Chernozhukov, and Hansen (2014). LR moment functions can be used to construct

debiased/double machine learning (DML) estimators, as in Chernozhukov et al. (2017, 2018).

We illustrate by deriving LR moment functions for dynamic discrete choice estimation based

on conditional choice probabilities. We provide a DML estimator for dynamic discrete choice that

uses first step machine learning of conditional choice probabilities. We find that it performs well

in a Monte Carlo example. Such structural models provide a potentially important application

of DML, because of potentially high dimensional state spaces. Adding the first step influence

adjustment term provides a general way to construct LR moment conditions for structural

models so that machine learning can be used for first step estimation of conditional choice

probabilities, state transition distributions, and other unknown functions on which structural

estimators depend.

LR moment conditions also have the advantage of being relatively insensitive to small vari-

ation away from the first step true function. This robustness property is appealing in many

settings where it may be difficult to get the first step completely correct. Many interesting and

useful LR moment functions have the additional property that they are doubly robust (DR),

meaning moment conditions hold when one first step is not correct. We give novel classes of DR

moment conditions, including for average linear functionals of conditional expectations and prob-

ability densities. The construction of adding the first step influence function adjustment to an

identifying moment function is useful to obtain these moment conditions. We also give necessary

and sufficient conditions for a large class of moment functions to be DR. We find DR moments

have simpler and more general conditions for asymptotic normality, which helps motivate our

2

consideration of DR moment functions as special cases of LR ones. LR moment conditions also

help minimize sensitivity to misspecification as in Bonhomme and Weidner (2018).

LR moment conditions have smaller bias from first step estimation. We show that they have

the small bias property of Newey, Hsieh, and Robins (2004), that the bias of the moments is of

smaller order than the bias of the first step. This bias reduction leads to substantial improve-

ments in finite sample properties in many cases relative to just using the original moment con-

ditions. For dynamic discrete choice we find large bias reductions, moderate variance increases

and even reductions in some cases, and coverage probabilities substantially closer to nominal.

For machine learning estimators of the partially linear model, Chernozhukov et al. (2017, 2018)

found bias reductions so large that the LR estimator is root-n consistent but the estimator based

on the original moment condition is not. Substantial improvements were previously also found

for density weighted averages by Newey, Hsieh, and Robins (2004, NHR). The twicing kernel

estimators in NHR are numerically equal to LR estimators based on the original (before twicing)

kernel, as shown in Newey, Hsieh, Robins (1998), and the twicing kernel estimators were shown

to have smaller mean square error in large samples. Also, a Monte Carlo example in NHR finds

that the mean square error (MSE) of the LR estimator has a smaller minimum and is flatter as

a function of bandwidth than the MSE of Powell, Stock, and Stoker’s (1989) density weighted

average derivative estimator. We expect similar finite sample improvements from LR moments

in other cases.

LR moment conditions have appeared in earlier work. They are semiparametric versions of

Neyman (1959) C-alpha test scores for parametric models. Hasminskii and Ibragimov (1978)

suggested LR estimation of functionals of a density and argued for their advantages over plug-in

estimators. Pfanzagl and Wefelmeyer (1981) considered using LR moment conditions for im-

proving the asymptotic efficiency of functionals of distribution estimators. Bickel and Ritov

(1988) gave a LR estimator of the integrated squared density that attains root-n consistency

under minimal conditions. The Robinson (1988) semiparametric regression and Ichimura (1993)

index regression estimators are LR. Newey (1990) showed that LR moment conditions can be

obtained as residuals from projections on the tangent set in a semiparametric model. Newey

(1994a) showed that derivatives of an objective function where the first step has been "concen-

trated out" are LR, including the efficient score of a semiparametric model. NHR (1998, 2004)

gave estimators of averages that are linear in density derivative functionals with remainder rates

that are as fast as those in Bickel and Ritov (1988). Doubly robust moment functions have

been constructed by Robins, Rotnitzky, and Zhao (1994, 1995), Robins and Rotnitzky (1995),

Scharfstein, Rotnitzky, and Robins (1999), Robins, Rotnitzky, and van der Laan (2000), Robins

and Rotnitzky (2001), Graham (2011), and Firpo and Rothe (2017). They are widely used for

estimating treatment effects, e.g. Bang and Robins (2005). Van der Laan and Rubin (2006) de-

veloped targeted maximum likelihood to obtain a LR estimating equation based on the efficient

3

influence function of a semiparametric model. Robins et al. (2008, 2017) showed that efficient

influence functions are LR, characterized some doubly robust moment conditions, and developed

higher order influence functions that can reduce bias. Belloni, Chernozhukov, and Wei (2013),

Belloni, Chernozhukov, and Hansen (2014), Farrell (2015), Kandasamy et al. (2015), Belloni,

Chernozhukov, Fernandez-Val, and Hansen (2016), and Athey, Imbens, and Wager (2017) gave

LR estimators with machine learning first steps in several specific contexts.

A main contribution of this paper is the construction of LR moment conditions from any

moment condition and first step estimator that can result in a root-n consistent estimator of

the parameter of interest. This construction is based on the limit of the first step when a data

observation has a general distribution that allows for misspecification, similarly to Newey (1994).

LR moment functions are constructed by adding to identifying moment functions the influence

function of the true expectation of the identifying moment functions evaluated at the first step

limit, i.e. by adding the influence function term that accounts for first step estimation. The

addition of the influence adjustment "partials out" the first order effect of the first step on the

moments. This construction of LR moments extends those cited above for first step density and

distribution estimators to any first step, including instrumental variable estimators. Also, this

construction is estimator based rather than model based as in van der Laan and Rubin (2006)

and Robins et al. (2008, 2017). The construction depends only on the moment functions and

the first step rather than on a semiparametric model. Also, we use the fundamental Gateaux

derivative definition of the influence function to show LR rather than an embedding in a regular

semiparametric model.

The focus on the functional that is the true expected moments evaluated at the first step

limit is the key to this construction. This focus should prove useful for constructing LR moments

in many setting, including those where it has already been used to find the asymptotic variance

of semiparametric estimators, such as Newey (1994a), Pakes and Olley (1995), Hahn (1998), Ai

and Chen (2003), Hirano, Imbens, and Ridder (2003), Bajari, Hong, Krainer, and Nekipelov

(2010), Bajari, Chernozhukov, Hong, and Nekipelov (2009), Hahn and Ridder (2013, 2016), and

Ackerberg, Chen, Hahn, and Liao (2014), Hahn, Liao, and Ridder (2016). One can construct

LR moment functions in each of these settings by adding the first step influence function derived

for each case as an adjustment to the original, identifying moment functions.

Another contribution is the development of LR moment conditions for dynamic discrete

choice. We derive the influence adjustment for first step estimation of conditional choice prob-

abilities as in Hotz and Miller (1993). We find encouraging Monte Carlo results when various

machine learning methods are used to construct the first step. We also give LRmoment functions

for conditional moment restrictions based on orthogonal instruments.

An additional contribution is to provide general estimators of the influence adjustment term

that can be used to construct LR moments without knowing their form. These methods estimate

4

the adjustment term numerically, thus avoiding the need to know its form. It is beyond the scope

of this paper to develop machine learning versions of these numerical estimators. Such estimators

are developed by Chernozhukov, Newey, and Robins (2018) for average linear functionals of

conditional expectations.

Further contributions include novel classes of DR estimators, including linear functionals of

nonparametric instrumental variables and density estimators, and a characterization of (nec-

essary and sufficient conditions for) double robustness. We also give related, novel partial

robustness results where original moment conditions are satisfied even when the first step is not

equal to the truth.

A main contribution is simple and general asymptotic theory for LR estimators that use

cross-fitting in the construction of the average moments. This theory is based on the structure

of LR moment conditions as an identifying moment condition depending on one first step plus

an influence adjustment that can depend on an additional first step. We give a remainder

decomposition that leads to mean square consistency conditions for first steps plus a few readily

interpretable rate conditions. For DR estimators there is only one rate condition, on a product

of sample remainders from two first step estimators, leading to particularly simple conditions.

This simplicity motivates our inclusion of results for DR estimators. This asymptotic theory

is also useful for existing moment conditions that are already known to be LR. Whenever the

moment condition can be decomposed into an identifying moment condition depending on one

first step and an influence function term that may depend on two first steps the simple and

general regularity conditions developed here will apply.

LRmoments reduce that smoothing bias that results from first step nonparametric estimation

relative to original moment conditions. There are other sources of bias arising from nonlinearity

of moment conditions in the first step and the empirical distribution. Cattaneo and Jansson

(2017) and Cattaneo, Jansson, and Ma (2017) give useful bootstrap and jackknife methods that

reduce nonlinearity bias. Newey and Robins (2017) show that one can also remove this bias by

cross fitting in some settings. We allow for cross-fitting in this paper.

Section 2 describes the general construction of LR moment functions for semiparametric

GMM. Section 3 gives LR moment conditions for dynamic discrete choice. Section 4 shows how

to estimate the first step influence adjustment. Section 5 gives novel classes of DR moment

functions and characterizes double robustness. Section 6 gives an orthogonal instrument con-

struction of LR moments based on conditional moment restrictions. Section 7 provides simple

and general asymptotic theory for LR estimators.

5

2 Locally Robust Moment Functions

The subject of this paper is GMM estimators of parameters where the sample moment functions

depend on a first step nonparametric or large dimensional estimator. We refer to these estimators

as semiparametric. We could also refer to them as GMMwhere first step estimators are “plugged

in” the moments. This terminology seems awkward though, so we simply refer to them as

semiparametric GMM estimators. We denote such an estimator by , which is a function of the

data 1 where is the number of observations. Throughout the paper we will assume that

the data observations are i.i.d. We denote the object that estimates as 0, the subscript

referring to the parameter value under the distribution 0 of .

To describe semiparametric GMM let ( ) denote an × 1 vector of functions of thedata observation parameters of interest , and a function that may be vector valued. The

function can depend on and through those arguments of Here the function represents

some possible first step, such as an estimator, its limit, or a true function. A GMM estimator

can be based on a moment condition where 0 is the unique parameter vector satisfying

[( 0 0)] = 0 (2.1)

and 0 is the true . We assume that this moment condition identifies Let denote some

first step estimator of 0. Plugging in to obtain ( ) and averaging over results in the

estimated sample moments () =P

=1( ) For a positive semi-definite weighting

matrix a semiparametric GMM estimator is

= argmin∈

()()

where denotes the transpose of a matrix and is the parameter space for . Such

estimators have been considered by, e.g. Andrews (1994), Newey (1994a), Newey and McFadden

(1994), Pakes and Olley (1995), Chen and Liao (2015), and others.

Locally robust (LR) moment functions can be constructed by adding the influence func-

tion adjustment for the first step estimator to the identifying or original moment functions

( ) To describe this influence adjustment let ( ) denote the limit of when has dis-

tribution where we restrict only in that ( ) exists and possibly other regularity conditions

are satisfied. That is, ( ) is the limit of under possible misspecification, similar to Newey

(1994). Let be some other distribution and = (1 − )0 + for 0 ≤ ≤ 1 where 0denotes the true distribution of We assume that is chosen so that ( ) is well defined for

0 small enough and possibly other regularity conditions are satisfied, similarly to Ichimura

and Newey (2017). The influence function adjustment will be the function ( ) such that

for all such

[( ())] =

Z( 0 0)() [( 0 0)] = 0 (2.2)

6

where is an additional nonparametric or large dimensional unknown object on which ( )

depends and the derivative is from the right (i.e. for positive values of ) and at = 0

This equation is the well known definition of the influence function ( 0 0) of ( ) =

[( ( ))] as the Gateaux derivative of ( ) e.g. Huber (1981). The restriction of so

that () exists allows ( 0 0) to be the influence function when ( ) is only well de-

fined for certain types of distributions, such as when ( ) is a conditional expectation or density.

The function ( ) will generally exist when [( ( ))] has a finite semiparametric

variance bound. Also ( ) will generally be unique because we are not restricting very

much. Also, note that ( ) will be the influence adjustment term from Newey (1994a),

as discussed in Ichimura and Newey (2017).

LR moment functions can be constructed by adding ( ) to ( ) to obtain new

moment functions

( ) = ( ) + ( ) (2.3)

Let be a nonparametric or large dimensional estimator having limit ( ) when has distrib-

ution with (0) = 0 Also let () =P

=1 ( ) A LR GMM estimator can be

obtained as

= argmin∈

() () (2.4)

As usual a choice of that minimizes the asymptotic variance of√(−0) will be a consistent

estimator of the inverse of the asymptotic variance Ω of√(0) As we will further discuss,

( ) being LR will mean that the estimation of and does not affect Ω, so that Ω =

[( 0 0 0)( 0 0 0) ] An optimal also gives an efficient estimator in the wider

sense shown in Ackerberg, Chen, Hahn, and Liao (2014), making efficient in a semiparametric

model where the only restrictions imposed are equation (2.1).

The LR property we consider is that the derivative of the true expectation of the moment

function with respect to the first step is zero, for a Gateaux derivative like that for the influence

function in equation (2.2). Define = (1 − )0 + as before where is such that both

() and () are well defined. The LR property is that for all as specified,

[( () ( ))] = 0 (2.5)

Note that this condition is the same as that of Newey (1994a) for the presence of an to

have no effect on the asymptotic distribution, when each is a regular parametric submodel.

Consequently, the asymptotic variance of√(0) will be Ω as in the last paragraph.

To show LR of the moment functions ( ) = ( ) + ( ) from equation

(2.3) we use the fact that the second, zero expectation condition in equation (2.2) must hold for

all possible true distributions. For any given define ( ) = [( ( ))] and ( ) =

( ( ) ( ))

7

Theorem 1: If i) ( ) =R( 0)(), ii)

R( )() = 0 for all ∈ [0 )

and iii)R( )0() and

R( )() are continuous at = 0 then

[( )] = −( )

(2.6)

The proofs of this result and others are given in Appendix B. Assumptions i) and ii) of

Theorem 1 require that both parts of equation (2.2) hold with the second, zero mean condition

being satisfied when is the true distribution. Assumption iii) is a regularity condition. The

LR property follows from Theorem 1 by adding () to both sides of equation (2.6) and

noting that the sum of derivatives is the derivative of the sum. Equation (2.6) shows that the

addition of ( ) "partials out" the effect of the first step on the moment by "cancelling"

the derivative of the identifying moment [( ())] with respect to . This LR result for

( ) differs from the literature in its Gateaux derivative formulation and in the fact that

it is not a semiparametric influence function but is the hybrid sum of an identifying moment

function ( ) and an influence function adjustment ( )

Another zero derivative property of LR moment functions is useful. If the sets Γ and Λ of

possible limits ( ) and ( ), respectively, are linear, ( ) and ( ) can vary separately from

one another, and certain functional differentiability conditions hold then LR moment functions

will have the property that for any ∈ Γ, ∈ Λ, and ( ) = [( 0 )],

((1− )0 + 0) = 0

(0 (1− )0 + ) = 0 (2.7)

That is, the expected value of the LR moment function will have a zero Gateaux derivative

with respect to each of the first steps and This property will be useful for several results

to follow. Under still stronger smoothness conditions this zero derivative condition will result

in the existence of a constant such that for a function norm k·k,¯( 0)

¯≤ k − 0k2

¯(0 )

¯≤ k− 0k2 (2.8)

when k − 0k and k− 0k are small enough. In Appendix B we give smoothness conditionsthat are sufficient for LR to imply equations (2.7) and (2.8). When formulating regularity

conditions for particular moment functions and first step estimators it may be more convenient

to work directly with equation (2.7) and/or (2.8).

The approach of constructing LR moment functions by adding the influence adjustment

differs from the model based approach of using an efficient influence function or score for a

semiparametric model as moment functions . The approach here is estimator based rather than

model based. The influence adjustment ( ) is determined by the limit ( ) of the first

step estimator and the moment functions ( ) rather than by some underlying semi-

parametric model. This estimator based approach has proven useful for deriving the influence

8

function of a wide variety of semiparametric estimators, as mentioned in the Introduction. Here

this estimator based approach provides a general way to construct LR moment functions. For

any moment function ( ) and first step estimator a corresponding LR estimator can be

constructed as in equations (2.3) and (2.4).

The addition of ( ) does not affect identification of because ( 0 0) has

expectation zero for any and true 0 Consequently, the LR GMM estimator will have the

same asymptotic variance as the original GMM estimator when√(− 0) is asymptotically

normal, under appropriate regularity conditions. The addition of ( ) will change other

properties of the estimator. As discussed in Chernozhukov et al. (2017, 2018), it can even

remove enough bias so that the LR estimator is root-n consistent and the original estimator is

not.

If was modified so that is a function of a smoothing parameter, e.g. a bandwidth,

and gives the magnitude of the smoothing bias of () then equation (2.5) is a small bias

condition, equivalent to

[( 0 () ())] = ()

Here [( 0 () ( ))] is a bias in the moment condition resulting from smoothing that

shrinks faster than In this sense LR GMM estimators have the small bias property considered

in NHR. This interpretation is also one sense in which LR GMM is "debiased."

In some cases the original moment functions ( ) are already LR and the influence

adjustment will be zero. An important class of moment functions that are LR are those where

( ) is the derivative with respect to of an objective function where nonparametric

parts have been concentrated out. That is, suppose that there is a function ( ) such that

( ) = ( ()) where () = argmax [( )], where includes () and

possibly additional functions. Proposition 2 of Newey (1994a) and Lemma 2.5 of Chernozhukov

et al. (2018) then imply that ( ) will be LR. This class of moment functions includes

various partially linear regression models where represents a conditional expectation. It also

includes the efficient score for a semiparametric model, Newey (1994a, pp. 1358-1359).

Cross fitting, also known as sample splitting, has often been used to improve the properties

of semiparametric and machine learning estimators; e.g. see Bickel (1982), Schick (1986), and

Powell, Stock, and Stoker (1989). Cross fitting removes a source of bias and can be used to

construct estimators with remainder terms that converge to zero as fast as is known to be

possible, as in NHR and Newey and Robins (2017). Cross fitting is also useful for double

machine learning estimators, as outlined in Chernozhukov et al. (2017, 2018). For these reasons

we allow for cross-fitting, where sample moments have the form

() =1

X=1

( )

9

with and being formed from observations other than the This kind of cross fitting

removes an "own observation" bias term and is useful for showing root-n consistency when

and are machine learning estimators.

One version of cross-fitting with good properties in examples in Chernozhukov et al. (2018)

can be obtained by partitioning the observation indices into groups ( = 1 ) forming

and from observations not in , and constructing

() =1

X=1

X∈

( ) (2.9)

Further bias reductions may be obtained in some cases by using different sets of observations for

computing and leading to remainders that converge to zero as rapidly as known possible

in interesting cases; see Newey and Robins (2017). The asymptotic theory of Section 7 focuses

on this kind of cross fitting.

As an example we consider a bound on average equivalent variation. Let 0() denote the

conditional expectation of quantity conditional on = ( ) where = (1 2 )

is a vector

of prices and is income The object of interest is a bound on average equivalent variation for

a price change from 1 to 1 given by

0 = [

Z(1 )0(1 2 )1] (1 ) = ()1(1 ≤ 1 ≤ 1) exp−(1 − 1)]

where () is a function of income and a constant. It follows by Hausman and Newey

(2016) that if is a lower (upper) bound on the income effect for all individuals then 0 is

an upper (lower) bound on the equivalent variation for a price change from 1 to 1 averaged

over heterogeneity, other prices 2 and income . The function () allows for averages over

income in specific ranges, as in Hausman and Newey (2017).

A moment function that could be used to estimate 0 is

( ) =

Z(1 )(1 2 )1 −

Note that

[( 0 )] + 0 = [

Z(1 )(1 2 )1] = [0()()] 0() =

(1 )

0(1|2 )

where 0(1|2 ) is the conditional pdf of 1 given 2 and . Then by Proposition 4 of Newey(1994) the influence function adjustment for any nonparametric estimator () of [| = ]

is

( ) = ()[ − ()]

Here 0() is an example of an additional unknown function that is included in ( ) but

not in the original moment functions ( ). Let () be an estimator of [| = ] that

10

can depend on and () be an estimator of 0(), such as (1|2 )−1(1 ) for an estimator(1|2 ) The LR estimator obtained by solving () = 0 for ( ) and ( ) as

above is

=1

X=1

½Z(1 )(1 2 )1 + ()[ − ()]

¾ (2.10)

3 Machine Learning for Dynamic Discrete Choice

A challenging problem when estimating dynamic structural models is the dimensionality of

state spaces. Machine learning addresses this problem via model selection to estimate high

dimensional choice probabilities. These choice probabilities estimators can then be used in

conditional choice probability (CCP) estimators of structural parameters, following Hotz and

Miller (1993). In order for CCP estimators based on machine learning to be root-n consistent

they must be based on orthogonal (i.e. LR) moment conditions, see Chernozhukov et al. (2017,

2018). Adding the adjustment term provides the way to construct LR moment conditions from

known moment conditions for CCP estimators. In this Section we do so for the Rust’s (1987)

model of dynamic discrete choice.

We consider an agent choosing among discrete alternatives by maximizing the expected

present discounted value of utility. We assume that the per-period utility function for an agent

making choice in period is given by

= ( 0) + ( = 1 ; = 1 2 )

The vector is the observed state variables of the problem (e.g. work experience, number of

children, wealth) and the vector is unknown parameters. The disturbances = 1 are not observed by the econometrician. As in much of the literature we assume that is i.i.d.

over time with known CDF that has support is independent of and is first-order

Markov.

To describe the agent’s choice probabilities let denote a time discount parameter, ()

the expected value function, ∈ 0 1 the indicator that choice is made and () =

( 0) + [(+1)| ] the expected value function conditional on choice As in Rust

(1987), we assume that in each period the agent makes the choice that maximizes the expected

present discounted value of utility () + The probability of choosing in period is then

() = Pr(() + ≥ () + ; = 1 ) = (1() ())0 (3.1)

These choice probabilities have a useful relationship to the structural parameters when

there is a renewal choice, where the conditional distribution of +1 given the renewal choice and

does not depend on Without loss of generality suppose that the renewal choice is = 1

11

Let denote () = ()− 1() so that 1 ≡ 0. As usual, subtracting 1 from each

in () does not change the choice probabilities, so that they depend only on = (2 )

The renewal nature of = 1 leads to a specific formula for in terms of the per period

utilities = ( 0) and the choice probabilities = () = (1() ())0 As in Hotz

and Miller (1993), there is a function P−1( ) such that = P−1() Let ( ) denote the

function such that

() = [ max1≤≤

P−1() + |] = [ max1≤≤

+ |]

For example, for multinomial logit () = 5772− ln(1) Note that by = 1 being a renewalwe have [+1| 1] = for a constant , so that

() = 1 +() = 1 + +()

It then follows that

= + [(+1)| ] = + [1+1 +(+1)| ] + 2 ( = 1 )

Subtracting then gives

= − 1 + [1+1 +(+1)| ]−[1+1 +(+1)|1] (3.2)

This expression for the choice specific value function depends only on ( ) (+1), and

conditional expectations given the state and choice, and so can be used to form semiparametric

moment functions.

To describe those moment functions let 1() denote the vector of possible values of the

choice probabilities [| = ] where = (1 )0 Also let ( 1) ( = 2 )

denote a possible [1(+1 )+(1(+1))| ] as a function of , and 1 and +1( 1)a possible value of [1( ) +(1(+1))|1] Then a possible value of is given by

( ) = ( )− 1( ) + [( 1)− +1( 1)] ( = 2 )

These value function differences are semiparametric, depending on the function 1 of choice prob-

abilities and the conditional expectations , ( = 2 ) Let ( ) = (2( ) ( ))0

and () denote a matrix of functions of with columns. Semiparametric moment functions

are given by

( ) = ()[ − (( ))]

LR moment functions can be constructed by adding the adjustment term for the presence of

the first step This adjustment term is derived in Appendix A. It takes the form

( ) =

+1X=1

( )

12

where ( ) is the adjustment term for holding all other components fixed at their

true values. To describe it define

() = () 1 = Pr(1 = 1) 10() = [1|+1 = ] (3.3)

0() = [()()

()|+1 = ] ( = 2 )

Then for = +1 and = ( ) let

1( ) = −Ã

X=2

()−[()()]−11 1()

![(1()) ]

0 − 1()

( ) = −()(( ))

(( ))1( ) +(1())− ( 1) ( = 2 )

+1( ) =

ÃX

=2

[()(( ))]

!−11 11( ) +(1())− +1( 1)

Theorem 2: If the marginal distribution of does not vary with then LR moment

functions for the dynamic discrete choice model are

( ) = ()[ − (( ))] +

+1X=1

( )

The form of ( ) is amenable to machine learning. A machine learning estimator of the

conditional choice probability vector 10() is straightforward to compute and can then be used

throughout the construction of the orthogonal moment conditions everywhere 1 appears. If

1( ) is linear in say 1( ) = 011 for subvectors 1 and 1 of and respectively, then

machine learning estimators can be used to obtain [1+1| ] and [+1| ] ( = 2 )and a sample average used to form +1( 1). The value function differences can then be

estimated as

( ) = ( )−1( )+ [1+1| ]01− [1+1|1]01+ [+1| ]− [+1|1]

Furthermore, denominator problems can be avoided by using structural probabilities (rather

than the machine learning estimators) in all denominator terms.

The challenging part of the machine learning for this estimator is the dependence on of the

reverse conditional expectations in 1(). It may be computationally prohibitive and possibly

unstable to redo machine learning for each One way to to deal with this complication is to

update periodically, with more frequent updates near convergence. It is important that at

convergence the in the reverse conditional expectations is the same as the that appears

elsewhere.

13

With data that is i.i.d. over individuals these moment functions can be used for any to

estimate the structural parameters Also, for data for a single individual we could use a time

averageP−1

=1 ( )( − 1) to estimate It will be just as important to use LR momentsfor estimation with a single individual as it is with a cross section of individuals, although our

asymptotic theory will not apply to that case.

Bajari, Chernozhukov, Hong, and Nekipelov (2009) derived the influence adjustment for

dynamic discrete games of imperfect information. Locally robust moment conditions for such

games could be formed using their results. We leave that formulation to future work.

As an example of the finite sample performance of the LR GMM we report a Monte Carlo

study of the LR estimator of this Section. The design of the experiment is loosely like the

bus replacement application of Rust (1987). Here is a state variable meant to represent the

lifetime of a bus engine. The transition density is

+1 =

( +(25 1)2 = 1

= 1 +(25 1)2 = 0

where = 0 corresponds to replacement of the bus engine and = 1 to nonreplacement. We

assume that the agent chooses contingent on state to maximize

∞X=1

−1[(√ + ) + (1− )] = −3 = −4

The unconditional probability of replacement in this model is about 18 which is substantially

higher than that estimated in Rust (1987). The sample used for estimation was 1000 observations

for a single decision maker. We carried out 10 000 replications.

We estimate the conditional choice probabilities by kernel and series nonparametric regression

and by logit lasso, random forest, and boosted tree machine learning methods. Logit conditional

choice probabilities and derivatives were used in the construction of wherever they appear

in order to avoid denominator issues. The unknown conditional expectations in the were

estimated by series regressions throughout. Kernel regression was also tried but did not work

particularly well and so results are not reported.

Table 1 reports the results of the experiment. Bias, standard deviation, and coverage prob-

ability for asymptotic 95 percent confidence intervals are given in Table 1.

Table 1

14

LR CCP Estimators, Dynamic Discrete Choice

Bias Std Dev 95% Cov

RC RC RC

Two step kernel -.24 .08 .08 .32 .01 .86

LR kernel -.05 .02 .06 .32 .95 .92

Two step quad -.00 .14 .049 .33∗ .91 .89

LR quad -.00 .01 .085 .39 .95 .92

Logit Lasso -.12 .25 .06 .28 .74 .84

LR Logit Lasso -.09 .01 .08 .36 .93 .95

Random Forest -.15 -.44 .09 .50 .91 .98

LR Ran. For. .00 .00 .06 .44 1.0 .98

Boosted Trees -.10 -.28 .08 .50 .99 .99

LR Boost Tr. .03 .09 .07 .47 .99 .97

Here we find bias reduction from the LR estimator in all cases. We also find variance

reduction from LR estimation when the first step is kernel estimation, random forests, and

boosted trees. The LR estimator also leads to actual coverage of confidence intervals being

closer to the nominal coverage. The results for random forests and boosted trees seem noisier

than the others, with higher standard deviations and confidence interval coverage probabilities

farther from nominal. Overall, we find substantial improvements from using LR moments rather

than only the identifying, original moments.

4 Estimating the Influence Adjustment

Construction of LR moment functions requires an estimator ( ) of the adjustment term.

The form of ( ) is known for some cases from the semiparametric estimation literature.

Powell, Stock, and Stoker (1989) derived the adjustment term for density weighted average

derivatives. Newey (1994a) gave the adjustment term for mean square projections (including

conditional expectations), densities, and their derivatives. Hahn (1998) and Hirano, Imbens,

and Ridder (2003) used those results to obtain the adjustment term for treatment effect esti-

mators, where the LR estimator will be the doubly robust estimator of Robins, Rotnitzky, and

Zhao (1994, 1995). Bajari, Hong, Krainer, and Nekipelov (2010) and Bajari, Chernozhukov,

Hong, and Nekipelov (2009) derived adjustment terms in some game models. Hahn and Ridder

(2013, 2016) derived adjustments in models with generated regressors including control func-

tions. These prior results can be used to obtain LR estimators by adding the adjustment term

with nonparametric estimators plugged in.

For new cases it may be necessary to derive the form of the adjustment term. Also, it is

possible to numerically estimate the adjustment term based on series estimators and other non-

15

parametric estimators. In this Section we describe how to construct estimators of the adjustment

term in these ways.

4.1 Deriving the Formula for the Adjustment Term

One approach to estimating the adjustment term is to derive a formula for ( ) and

then plug in and in that formula A formula for ( ) can be obtained as in Newey

(1994a). Let ( ) be the limit of the nonparametric estimator when has distribution

Also, let denote a regular parametric model of distributions with = 0 at = 0 and

score (derivative of the log likelihood at = 0) equal to (). Then under certain regularity

conditions ( 0 0) will be the unique solution to

R( ( ))0()

¯=0

= [( 0 0)()] [( 0 0)] = 0 (4.1)

as and the corresponding score () are allowed to vary over a family of parametric

models where the set of scores for the family has mean square closure that includes all mean

zero functions with finite variance. Equation (4.1) is a functional equation that can be solved to

find the adjustment term, as was done in many of the papers cited in the previous paragraph.

The influence adjustment can be calculated by taking a limit of the Gateaux derivative as

shown in Ichimura and Newey (2017). Let ( ) be the limit of when is the true distribution

of , as before. Let be a family of distributions that approaches a point mass at as −→ 0

If ( 0 0) is continuous in with probability one then

( 0 0) = lim−→0

µ[( (

))]

¯=0

¶

= (1− )0 + (4.2)

This calculation is more constructive than equation (4.1) in the sense that the adjustment term

here is a limit of a derivative rather than the solution to a functional equation. In Sections 5

and 6 we use those results to construct LR estimators when the first step is a nonparametric

instrumental variables (NPIV) estimator.

With a formula for ( ) in hand from either solving the functional equation in equa-

tion (4.1) or from calculating the limit of the derivative in equation (4.2), one can estimate the

adjustment term by plugging estimators and into ( ) This approach to estimating

LR moments can used to construct LR moments for the average surplus described near the end

of Section 2. There the adjustment term depends on the conditional density of 1 given 2 and

. Let (1|2 ) be some estimator of the conditional pdf of 1 given 2 and Plugging

that estimator into the formula for 0() gives () =(1)

(1|2)This () can then be used in

equation (2.10)

16

4.2 Estimating the Influence Adjustment for First Step Series Esti-

mators

Estimating the adjustment term is relatively straightforward when the first step is a series

estimator. The adjustment term can be estimated by treating the first step estimator as if it

were parametric and applying a standard formula for the adjustment term for parametric two-

step estimators. Suppose that depends on the data through a × 1 vector of parameterestimators that has true value 0. Let ( ) denote ( ) as a function of Suppose

that there is a × 1 vector of functions ( ) such that satisfies1√

X∈

( ) = (1) (4.3)

where is a subset of observations, none which are included in and is the number of

observations in Then a standard calculation for parametric two-step estimators (e.g. Newey,

1984, and Murphy and Topel, 1985) gives the parametric adjustment term

( Ψ) = Ψ()( ) Ψ() = −X∈

( )

⎛⎝X∈

( )

⎞⎠−1 ∈

In many cases ( Ψ) approximates the true adjustment term ( 0 0) as shown

by Newey (1994a, 1997) and Ackerberg, Chen, and Hahn (2012) for estimating the asymptotic

variance of functions of series estimators. Here this approximation is used for estimation of

instead of just for variance estimation. The estimated LR moment function will be

( Ψ) = ( ) + ( Ψ) (4.4)

We note that if were computed from the whole sample then () = 0. This degeneracy does

not occur when cross-fitting is used, which removes "own observation" bias and is important for

first step machine learning estimators, as noted in Section 2.

We can apply this approach to construct LR moment functions for an estimator of the

average surplus bound example that is based on series regression. Here the first step estimator

of 0() = [| = ] will be that from an ordinary least regression of on a vector () of

approximating functions. The corresponding ( ) and ( ) are

( ) = ()0 − ( ) = ()[ − ()0] () =Z

(1 )(1 2 )1

Let denote the least squares coefficients from regressing on () for observations that are

17

not included in . Then the estimator of the locally robust moments given in equation (4.4) is

( Ψ) = ()0 − + Ψ()[ − ()

0]

Ψ =X∈

()0

⎛⎝X∈

()()0

⎞⎠−1 It can be shown similarly to Newey (1994a, p. 1369) that Ψ estimates the population least

squares coefficients from a regression of 0() on () so that () = Ψ() estimates

0() In comparison the LR estimator described in the previous subsection was based on an

explicit nonparametric estimator of 0(1|2 ) while this () implicitly estimates the inverseof that pdf via a mean-square approximation of 0() by Ψ()

Chernozhukov, Newey, and Robins (2018) introduce machine learning methods for choosing

the functions to include in the vector (). This method can be combined with machine learning

methods for estimating [|] to construct a double machine learning estimator of averagesurplus, as shown in Chernozhukov, Hausman, and Newey (2018).

In parametric models moment functions like those in equation (4.4) are used to "partial

out" nuisance parameters For maximum likelihood these moment functions are the basis

of Neyman’s (1959) C-alpha test. Wooldridge (1991) generalized such moment conditions to

nonlinear least squares and Lee (2005), Bera et al. (2010), and Chernozhukov et al. (2015) to

GMM. What is novel here is their use in the construction of semiparametric estimators and the

interpretation of the estimated LR moment functions ( Ψ) as the sum of an original

moment function ( ) and an influence adjustment ( Ψ).

4.3 Estimating the Influence Adjustment with First Step Smoothing

The adjustment term can be estimated in a general way that allows for kernel density, locally

linear regression, and other kernel smoothing estimators for the first step. The idea is to differ-

entiate with respect to the effect of the observation on sample moments. Newey (1994b) used

a special case of this approach to estimate the asymptotic variance of a functional of a kernel

based semiparametric or nonparametric estimator. Here we extend this method to a wider class

of first step estimators, such as locally linear regression, and apply it to estimate the adjustment

term for construction of LR moments.

We will describe this estimator for the case where is a vector of functions of a vector of

variables Let ( ) be a vector of functions of a data observation , , and a possible

realized value of (i.e. a vector of real numbers ). Also let ( ) =P

∈ ( )be a sample average over a set of observations not included in where is the number of

observations in We assume that the first step estimator () solves

0 = ( )

18

We suppress the dependence of and on a bandwidth. For example for a pdf () a kernel

density estimator would correspond to ( ) = (− )− and a locally linear regression

would be 1() for

( ) = (− )

Ã1

−

![ − 1 − (− )

02]

To measure the effect of the observation on let () be the solution to

0 = ( ) + · ( )

This () is the value of the function obtained from adding the contribution · ( ) of

the observation. An estimator of the adjustment term can be obtained by differentiating the

average of the original moment function with respect to at = 0 This procedure leads to an

estimated locally robust moment function given by

( ) = ( ) +

1

X∈

( (·))

¯¯=0

This estimator is a generalization of the influence function estimator for kernels in Newey

(1994b).

5 Double and Partial Robustness

The zero derivative condition in equation (2.5) is an appealing robustness property in and of

itself. A zero derivative means that the expected moment functions remain closer to zero than

as varies away from zero. This property can be interpreted as local insensitivity of the

moments to the value of being plugged in, with the moments remaining close to zero as

varies away from its true value. Because it is difficult to get nonparametric functions exactly

right, especially in high dimensional settings, this property is an appealing one.

Such robustness considerations, well explained in Robins and Rotnitzky (2001), have moti-

vated the development of doubly robust (DR) moment conditions. DR moment conditions have

expectation zero if one first stage component is incorrect. DR moment conditions allow two

chances for the moment conditions to hold, an appealing robustness feature. Also, DR moment

conditions have simpler conditions for asymptotic normality than general LR moment functions

as discussed in Section 7. Because many interesting LR moment conditions are also DR we

consider double robustness.

LR moments that are constructed by adding the adjustment term for first step estimation

provide candidates for DR moment functions. The derivative of the expected moments with

19

respect to each first step will be zero, a necessary condition for DR. The condition for moments

constructed in this way to be DR is the following:

Assumption 1: There are sets Γ and Λ such that for all ∈ Γ and ∈ Λ

[( 0 )] = −[( 0 0)] [( 0 0 )] = 0

This condition is just the definition of DR for the moment function ( ) = ( )+

( ), pertaining to specific sets Γ and Λ

The construction of adding the adjustment term to an identifying or original moment function

leads to several novel classes of DR moment conditions. One such class has a first step that

satisfies a conditional moment restriction

[ − 0()|] = 0 (5.1)

where is potentially endogenous and is a vector of instrumental variables. This condition

is the nonparametric instrumental variable (NPIV) restriction as in Newey and Powell (1989,

2003) and Newey (1991). A first step conditional expectation where 0() = [|] is includedas special case with = Ichimura and Newey (2017) showed that the adjustment term for

this step takes the form ( ) = ()[− ()] so ( ) + ()[− ()] is a candidate

for a DR moment function. A sufficient condition for DR is:

Assumption 2: i) Equation (5.1) is satisfied; ii) Λ = () : [()2] ∞ and Γ =

() : [()2] ∞; iii) there is () with [()

2] ∞ such that [( 0 )] =

[()()− 0()] for all ∈ Γ; iv) there is 0() such that () = [0()|]; and

v) [2 ] ∞

By the Riesz representation theorem condition iii) is necessary and sufficient for[( 0 )]

to be a mean square continuous functional of with representer () Condition iv) is an addi-

tional condition giving continuity in the reduced form difference [()−0()|], as furtherdiscussed in Ichimura and Newey (2017). Under this condition

[( 0 )] = [[0()|]()− 0()] = [0()()− 0()]= −[( 0)] [( 0 )] = [() − 0()] = 0

Thus Assumption 2 implies Assumption 1 so that we have

Theorem 3: If Assumption 2 is satisfied then ( )+()−() is doubly robust.

There are many interesting, novel examples of DR moment conditions that are special cases

of Theorem 3. The average surplus bound is an example where = = is the observed

20

vector of prices and income, Λ = Γ is the set of all measurable functions of with finite second

moment, and 0() = [| = ] Let 1 denote 1 and 2 the vector of other prices and

income, so that = (1 02)0. Also let 0(1|2) denote the conditional pdf of 1 given 2 and

() = (1 ) for income . Let ( ) =R(1 2)(1 2)1 − as before. Multiplying

and dividing through by 0(1|2) gives, for all ∈ Γ and 0() = 0(1|2)−1()

[( 0 )] = [

Z(1 2)(1 2)1]−0 = [[0()()|2]]−0 = [0()()−0()]

Theorem 3 then implies that the LR moment function for average surplus ( ) + ()[−()] is DR. A corresponding DR estimator is given in equation (2.10).

The surplus bound is an example of a parameter where 0 = [( 0)] for some linear

functional ( ) of and for 0 satisfying the conditional moment restriction of equation (5.1)

For the surplus bound ( ) =R(1 2)(1 2)1 If Assumption 2 is satisfied then choosing

( ) = ( )− a DR moment condition is ( )−+()[−()] A correspondingDR estimator is

=1

X=1

( ) + ()[ − ()] (5.2)

where () and () are estimators of 0() and 0() respectively. An estimator can be

constructed by nonparametric regression when = or NPIV in general. A series estimator

() can be constructed similarly to the surplus bound example in Section 3.2. For =

Newey and Robins (2017) give such series estimators of () and Chernozhukov, Newey, and

Robins (2018) show how to choose the approximating functions for () by machine learning.

Simple and general conditions for root-n consistency and asymptotic normality of that allow

for machine learning are given in Section 7.

Novel examples of the DR estimator in equation (5.2) = are given by Newey and

Robins (2017) and Chernozhukov, Newey, and Robins (2018). Also Appendix C provides a

generalization to () and () that satisfy orthogonality conditions more general than con-

ditional moment restrictions and novel examples of those. A novel example with 6= is a

weighted average derivative of 0() satisfying equation (5.1). Here ( ) = ()() for

some weight function (). Let 0() be the pdf of and () = −0()−1[()0()]assuming that derivatives exist. Assume that ()()0() is zero on the boundary of the

support of Integration by parts then gives Assumption 2 iii). Assume also that there exists

0 ∈ Λ with () = [0()|] Then for estimators and a DR estimator of the weighted

average derivative is

=1

X=1

()()

+ ()[ − ()]

This is a DR version of the weighted average derivative estimator of Ai and Chen (2007). A

21

special case of this example is the DR moment condition for the weighted average derivative in

the exogenous case where = given in Firpo and Rothe (2017).

Theorem 3 includes existing DR moment functions as special cases where = , including

the mean with randomly missing data given by Robins and Rotnitzky (1995), the class of DR

estimators in Robins et al. (2008), and the DR estimators of Firpo and Rothe (2017). We

illustrate for the mean with missing data. Let = = ( ) for an observed data indicator

∈ 0 1 and covariates ( ) = (1 )− and 0() = Pr( = 1| = ) Here it

is well known that

[( 0 )] = [(1 )]− 0 = [0()()− 0()] = −[0() − ()]

Then DR of the moment function (1 )− + ()[ − ()] of Robins and Rotnitzky (1995)

follows by Proposition 5.

Another novel class of DR moment conditions are those where the first step is a pdf of a

function of the data observation By Proposition 5 of Newey (1994a), the adjustment term

for such a first step is ( ) = () − R ()() for some possible . A sufficient

condition for the DR as in Assumption 1 is:

Assumption 3: has pdf 0() and for Γ = : () ≥ 0, R () = 1 there is 0()such that for all ∈ Γ

[( 0 )] =

Z0()()− 0()

Note that for ( ) = ()−R ()() it follows fromAssumption 3 that[( 0 )] =−[( 0)] for all ∈ Γ. Also, [( 0 )] = [()] −

R()0() = 0 Then As-

sumption 1 is satisfied so we have:

Theorem 4: If Assumption 3 is satisfied then ( ) + ()− R ()() is DR.The integrated squared density 0 =

R0()

2 is an example for ( ) = () −

0 = 0 and

( ) = ()− + ()−Z

()()

This DR moment function seems to be novel. Another example is the density weighted average

derivative (DWAD) of Powell, Stock, and Stoker (1989), where ( ) = −2 ·()−.Let () = [|]0(). Assuming that ()() is zero on the boundary and differentiable,integration by parts gives

[( 0 )] = −2[()]− 0 =

Z[()]()− 0()

22

so that Assumption 3 is satisfied with 0() = () Then by Theorem 4

=1

X=1

−2()

+()

−Z

()

()

is a DR estimator. It was shown in NHR (1998) that the Powell, Stock, and Stoker (1989)

estimator with a twicing kernel is numerically equal to a leave one out version of this estimator

for the original (before twicing) kernel. Thus the DR result for gives an interpretation of the

twicing kernel estimator as a DR estimator.

The expectation of the DR moment functions of both Theorem 3 and 4 are affine in and

holding the other fixed at the truth. This property of DR moment functions is general, as we

show by the following characterization of DR moment functions:

Theorem 5: If Γ and Λ are linear then ( ) is DR if and only if

[( 0 (1− )0 + 0)]|=0 = 0 [( 0 0 (1− )0 + )]|=0 = 0

and [( 0 0)] and [( 0 0 )] are affine in and respectively.

The zero derivative condition of this result is a Gateaux derivative, componentwise version

of LR. Thus, we can focus a search for DR moment conditions on those that are LR. Also, a DR

moment function must have an expectation that is affine in each of and while the other is

held fixed at the truth. It is sufficient for this condition that ( 0 ) be affine in each of

and while the other is held fixed. This property can depend on how and are specified.

For example the missing data DR moment function (1 )−+()−1[−()] is not affine

in the propensity score () = Pr( = 1| = ) but is in () = ()−1.

In general Theorem 5 motivates the construction of DR moment functions by adding the

adjustment term to obtain a LR moment function that will then be DR if it is affine in and

separately. It is interesting to note that in the NPIV setting of Theorem 3 and the density setting

of Theorem 4 that the adjustment term is always affine in and It then follows from Theorem

5 that in those settings LR moment conditions are precisely those where [( 0 )] is affine

in Robins and Rotnitzky (2001) gave conditions for the existence of DR moment conditions

in semiparametric models. Theorem 5 is complementary to those results in giving a complete

characterization of DR moments when Γ and Λ are linear.

Assumptions 2 and 3 both specify that [( 0 )] is continuous in an integrated squared

deviation norm. These continuity conditions are linked to finiteness of the semiparametric

variance bound for the functional [( 0 )] as discussed in Newey and McFadden (1994)

for Assumption 2 with = and for Assumption 3. For Assumption 2 with 6= Severini

and Tripathi (2012) showed for ( ) = ()()− with known () that the existence

of 0() with () = [0()|] is necessary for the existence of a root-n consistent estimator

23

of . Thus the conditions of Assumption 2 are also linked to necessary conditions for root-n

consistent estimation when 6=

Partial robustness refers to settings where [( 0 )] = 0 for some 6= 0. The novel DR

moment conditions given here lead to novel partial robustness results as we now demonstrate

in the conditional moment restriction setting of Assumption 2. When 0() in Assumption 2 is

restricted in some way there may exist 6= 0 with [0() − ()] = 0 Then

[( 0 )] = −[0() − ()] = 0

Consider the average derivative 0 = [0()] where ( ) = () − for

some Let = ([()()0])−1[()] be the limit of the linear IV estimator with right

hand side variables () and the same number of instruments () The following is a partial

robustness result that provides conditions for the average derivative of the linear IV estimator

to equal the true average derivative:

Theorem 6: If− ln 0() = 0() for a constant vector , [()()0] is nonsingu-

lar, and[()| = ] = Π() for a square nonsingularΠ then for = ([()()0])−1[()]

[()0] = [0()]

This result shows that if the density score is a linear combination of the right-hand side

variables () used by linear IV, the conditional expectation of the instruments () given

is a nonsingular linear combination of (), and () has a nonsingular second moment matrix

then the average derivative of the linear IV estimator is the true average derivative. This is

a generalization to NPIV of Stoker’s (1986) result that linear regression coefficients equal the

average derivatives when the regressors are multivariate Gaussian.

DR moment conditions can be used to identify parameters of interest. Under Assumption 1

0 may be identified from

[( 0 )] = −[( 0 0)]

for any fixed when the solution 0 to this equation is unique.

Theorem 7: If Assumption 1 is satisfied, 0 is identified, and for some the equation

[( 0)] = 0 has a unique solution then 0 is identified as that solution.

Applying this result to the NPIV setting of Assumption 2 gives an explicit formula for certain

functionals of 0() without requiring that the completeness identification condition of Newey

and Powell (1989, 2003) be satisfied, similarly to Santos (2011). Suppose that () is identified,

e.g. as for the weighted average derivative. Since both and are observed it follows that

24

a solution 0() to () = [0()|] will be identified if such a solution exists. Plugging in = 0 into the equation [( 0 0)] = 0 gives

Corollary 8: If () is identified and there exists 0() such that () = [0()|]

then 0 = [()0()] is identified as 0 = [0()].

Note that this result holds without the completeness condition. Identification of 0 =

[()0()] for known () with () = [0()|] follows from Severini and Tripathi

(2006). Corollary 8 extends that analysis to the case where () is only identified but not

necessarily known and links it to DR moment conditions. Santos (2011) gives a related formula

for a parameter 0 =R()0(). The formula here differs from Santos (2011) in being an

expectation rather than a Lebesgue integral. Santos (2011) constructed an estimator. That is

beyond the scope of this paper.

6 Conditional Moment Restrictions

Models of conditional moment restrictions that depend on unknown functions are important

in econometrics. In such models the nonparametric components may be determined simulta-

neously with the parametric components. In this setting it is useful to work directly with the

instrumental variables to obtain LR moment conditions rather than to make a first step in-

fluence adjustment. For that reason we focus in this Section on constructing LR moments by

orthogonalizing the instrumental variables.

Our orthogonal instruments framework is based on based on conditional moment restrictions

of the form

[( 0 0)|] = 0 ( = 1 ) (6.1)

where each ( ) is a scalar residual and are instruments that may differ across . This

model is considered by Chamberlain (1992) and Ai and Chen (2003, 2007) when is the same

for each and for Ai and Chen (2012) when the set of includes −1 We allow the residual

vector ( ) to depend on the entire function and not just on its value at some function

of the observed data .

In this framework we consider LR moment functions having the form

( ) = ()( ) (6.2)

where () = [1(1) ()] is a matrix of instrumental variables with the column given

by () We will define orthogonal instruments to be those that make ( ) locally

robust. To define orthogonal instrumental variables we assume that is allowed to vary over a

linear set Γ as varies. For each ∆ ∈ Γ let

(∆) = ([1( 0 0 + ∆)|1]

[( 0 0 + ∆)| ]

)0

25

This (∆) is the Gateaux derivative with respect to of the conditional expectation of the

residuals in the direction ∆ We characterize 0() as orthogonal if

[0()(∆)] = 0 for all ∆ ∈ Γ

We assume that (∆) is linear in ∆ and consider the Hilbert space of vectors of random

vectors () = (1(1) ()) with inner product h i = [()0()]. Let Λ denote

the closure of the set (∆) : ∆ ∈ Γ in that Hilbert space. Orthogonal instruments arethose where each row of 0() is orthogonal to Λ They can be interpreted as instrumental

variables where the effect of estimation of has been partialed out. When 0() is orthogonal

then ( ) = ()( ) is LR:

Theorem 9: If each row of 0() is orthogonal to Λ then the moment functions in equation

(6.2) are LR.

We also have a DR result:

Theorem 10: If each row of 0() is orthogonal to Λ and ( ) is affine in ∈ Γ then

the moment functions in equation (6.2) are DR for Λ = () : [()0( 0 0)0( 0 0)()].

There are many ways to construct orthogonal instruments. For instance, given a × ( − 1)matrix of instrumental variables () one could construct corresponding orthogonal ones 0()

as the matrix where each row of () is replaced by the residual from the least squares projection

of the corresponding row of () on Λ. For local identification of we also require that

([( 0)]|=0) = dim() (6.3)

A model where 0 is identified from semiparametric conditional moment restrictions with

common instrumental variables is a special case where is the same for each . In this case

there is a way to construct orthogonal instruments that leads to an efficient estimator of 0.

Let Σ() denote some positive definite matrix with its smallest eigenvalue bounded away from

zero, so that Σ()−1 is bounded. Let h iΣ = [()

0Σ()−1()] denote an inner product

and note that Λ is closed in this inner product by Σ()−1 bounded. Let Σ ( ) denote the

residual from the least squares projection of the row ()0 of () on Λ with the inner

product h iΣ Then for all ∆ ∈ Γ

[Σ ( )0Σ()

−1(∆)] = 0

so that for Σ( ) = [Σ1 ( ) Σ ( )] the instrumental variables

Σ( )Σ()−1 are

orthogonal. Also, Σ( ) can be interpreted as the solution to

min():()0∈Λ =1

([()−()Σ()−1()−()0])

26

where the minimization is in the positive semidefinite sense.

The orthogonal instruments that minimize the asymptotic variance of GMM in the class of

GMM estimators with orthogonal instruments are given by

∗0() = Σ∗( )Σ

∗()−1 () =[( 0)|]

¯0=0

Σ∗() = (|) = ( 0 0)

Theorem 11: The instruments ∗() give an efficient estimator in the class of IV estima-

tors with orthogonal instruments.

The asymptotic variance of the GMM estimator with optimal orthogonal instruments is

([∗

∗0 ])

−1 = [( ∗Σ∗)Σ∗()

−1( ∗Σ∗)0])−1

This matrix coincides with the semiparametric variance bound of Ai and Chen (2003). Esti-

mation of the optimal orthogonal instruments is beyond the scope of this paper. The series

estimator of Ai and Chen (2003) could be used for this.

This framework includes moment restrictions with a NPIV first step satisfying[( 0)|] =0 where we can specify 1( ) = ( ) 1 = 1 2( ) = ( ) and 2 = It

generalizes that setup by allowing for more residuals ( ), ( ≥ 3) and allowing all residualsto depend on

7 Asymptotic Theory

In this Section we give simple and general asymptotic theory for LR estimators that incorporates

the cross-fitting of equation (2.9). Throughout we use the structure of LRmoment functions that

are the sum ( ) = ( )+( ) of an identifying or original moment function

( ) depending on a first step function and an influence adjustment term ( )

that can depend on an additional first step The asymptotic theory will apply to any moment

function that can be decomposed into a function of a single nonparametric estimator and a

function of two nonparametric estimators. This structure and LR leads to particularly simple

and general conditions.

The conditions we give are composed of mean square consistency conditions for first steps

and one, two, or three rate conditions for quadratic remainders. We will only use one quadratic

remainder rate for DR moment conditions, involving faster than 1√ convergence of products

of estimation errors for and When [( 0 ) + ( 0 0)] is not affine in we

will impose a second rate condition that involves faster than −14 convergence of When

[( 0 )] is also not affine in we will impose a third rate condition that involves faster

than −14 convergence of Most adjustment terms ( ) of which we are aware, including

27

for first step conditional moment restrictions and densities, have [( 0 0 )] affine in

so that faster −14 convergence of will not be required under our conditions. It will suffice

for most LR estimators which we know of to have faster than −14 convergence of and faster

than 1√ convergence of the product of estimation errors for and with only the latter

condition imposed for DR moment functions. We also impose some additional conditions for

convergence of the Jacobian of the moments and sample second moments that give asymptotic

normality and consistent asymptotic variance estimation for .

An important intermediate result for asymptotic normality is

√(0) =

1√

X=1

( 0 0 0) + (1) (7.1)

where () is the cross-fit, sample, LR moments of equation (2.9). This result will mean

that the presence of the first step estimators has no effect on the limiting distribution of the

moments at the true 0. To formulate conditions for this result we decompose the difference

between the left and right-hand sides into several remainders. Let ( ) = ( 0 )

( ) = [( )] and () = [( 0 )] so that ( ) = () + ( ) Then

adding and subtracting terms gives

√[(0)−

X=1

( 0 0 0)] = 1 + 2 + 3 + 4 (7.2)

where

1 =1√

X=1

[( 0 )−( 0 0)− ()] (7.3)

+1√

X=1

[( 0)− ( 0 0)− ( 0) + ( 0 )− ( 0 0)− (0 )]

2 =1√

X=1

[( )− ( 0)− ( 0 ) + ( 0 0)]

3 =1√

X=1

( 0) 4 =1√

X=1

(0 )

We specify regularity conditions sufficient for each of 1, 2, 3 and 4 to converge in

probability to zero so that equation (7.1) will hold. The remainder term 1 is a stochastic

equicontinuity term as in Andrews (1994). We give mean square consistency conditions for

1−→ 0 in Assumption 3.

The remainder term 2 is a second order remainder that involves both and When the

influence adjustment is ( ) = ()[− ()] as for conditional moment restrictions, then

2 =−1√

X=1

[()− 0()][()− 0()]

28

2 will converge to zero when the product of convergence rates for () and () is faster

than 1√ However, that is not the weakest possible condition. Weaker conditions for locally

linear regression first steps are given by Firpo and Rothe (2017) and for series regression first

steps by Newey and Robins (2017). These weaker conditions still require that the product of

biases of () and () converge to zero faster than 1√ but have weaker conditions for

variance terms. We allow for these weaker conditions by allowing 2−→ 0 as a regularity

condition. Assumption 5 gives these conditions.

We will have 3 = 4 = 0 in the DR case of Assumption 1, where 1−→ 0 and 2

−→ 0

will suffice for equation (7.1). In non DR cases LR leads to ( 0) = ()+ ( 0) having a

zero functional derivative with respect to at 0 so that 3−→ 0 when converges to 0 at a

rapid enough, feasible rate. For example if ( 0) is twice continuously Frechet differentiable

in a neighborhood of 0 for a norm k·k with zero Frechet derivative at 0. Then¯3

¯≤

X=1

√ k − 0k2 −→ 0

when k − 0k = (−14). Here 3

−→ 0 when each converges to 0 more quickly than −14.

It may be possible to weaken this condition by bias correcting ( ) as by the bootstrap

in Cattaneo and Jansson (2017), by the jackknife in Cattaneo Ma and Jansson (2017), and by

cross-fitting in Newey and Robins (2017). Consideration of such bias corrections for ( )

is beyond the scope of this paper.

In many cases 4 = 0 even though the moment conditions are not DR. For example that is

true when is a pdf or when 0 estimates the solution to a conditional moment restriction. In

such cases mean square consistency, 2−→ 0 and faster than −14 consistency of suffices

for equation (7.1); no convergence rate for is needed. The simplification that 4 = 0 seems

to be the result of being a Riesz representer for the linear functional that is the derivative of

() with respect to Such a Riesz representer will enter ( 0) linearly, leading to 4 = 0

When 4 6= 0 then 4−→ 0 will follow from twice Frechet differentiability of ( 0) in and

faster than −14 convergence of

All of the conditions can be easily checked for a wide variety of machine learning and conven-

tional nonparametric estimators. There are well known conditions for mean square consistency

for many conventional and machine learning methods. Rates for products of estimation errors

are also know for many first step estimators as are conditions for −14 consistency. Thus,

the simple conditions we give here are general enough to apply to a wide variety of first step

estimators.

The first formal assumption of this section is sufficient for 1−→ 0

Assumption 4: For each = 1 , i) Either ( 0 ) does not depend on orR ( 0 )−( 0 0)20() −→ 0 ii)R ( 0)− ( 0 0)20() −→ 0 and

29

R ( 0 )− ( 0 0)20() −→ 0;

The cross-fitting used in the construction of (0) is what makes the mean-square consistency

conditions of Assumption 4 sufficient for 1−→ 0. The next condition is sufficient for 2

−→ 0

Assumption 5: For each = 1 , either i)

√

Zmax|( )− ( 0 )− ( 0) + ( 0 0)|0() −→ 0

or ii) 2−→ 0

As previously discussed, this condition allows for just 2−→ 0 in order to allow the weak

regularity conditions of Firpo and Rothe (2017) and Newey and Robins (2017). The first result

of this Section shows that Assumptions 4 and 5 are sufficient for equation (7.1) when the moment

functions are DR.

Lemma 12: If Assumption 1 is satisfied, with probability approaching one ∈ Γ, ∈ Λ

and Assumptions 4 and 5 are satisfied then equation (7.1) is satisfied.

An important class of DR estimators are those from equation (5.2). The following result

gives conditions for asymptotic linearity of these estimators:

Theorem 13: If a) Assumptions 2 and 4 i) are satisfied with ∈ Γ and ∈ Λ with

probability approaching one; b) 0() and [ − 0()2|] are bounded; c) for each =

1 ,R[()− 0()]

20()−→ 0

R[()− 0()]

20()−→ 0, and either

√

½Z[()− 0()]

20()

¾12½Z[()− 0()]

20()

¾12−→ 0

or1√

X∈()− 0()()− 0() −→ 0;

then√( − 0) =

1√

X=1

[( 0)− 0 + 0() − 0()] + (1)

The conditions of this result are simple, general, and allow for machine learning first steps.

Conditions a) and b) simply require mean square consistency of the first step estimators and

The only convergence rate condition is c), which requires a product of estimation errors for

the two first steps to go to zero faster than 1√. This condition allows for a trade-off in

convergence rates between the two first steps, and can be satisfied even when one of the two

30

rates is not very fast. This trade-off can be important when 0() is not continuous in one of

the components of , as in the surplus bound example. Discontinuity in can limit that rate at

which 0() can be estimated. This result extends the results of Chernozhukov et al. (2018) and

Farrell (2015) for DR estimators of treatment effects to the whole novel class of DR estimators

from equation (5.2) with machine learning first steps. In interesting related work, Athey et al.

(2016) show root-n consistent estimation of an average treatment effect is possible under very

weak conditions on the propensity score, under strong sparsity of the regression function. Thus,

for machine learning the conditions here and in Athey et al. (2016) are complementary and

one may prefer either depending on whether or not the regression function can be estimated

extremely well based on a sparse method. The results here apply to many more DR moment

conditions.

DR moment conditions have the special feature that 3 and 4 in Proposition 4 are equal

to zero. For estimators that are not DR we impose that 3 and 4 converge to zero.

Assumption 6: For each = 1 , i)√( 0)

−→ 0 and ii)√(0 )

−→ 0

Assumption 6 requires that converge to 0 rapidly enough but places no restrictions on the

convergence rate of when (0 ) = 0

Lemma 14: If Assumptions 4-6 are satisfied then equation (7.1) is satisfied.

Assumptions 4-6 are based on the decomposition of LR moment functions into an identifying

part and an influence function adjustment. These conditions differ from other previous work in

semiparametric estimation, as in Andrews (1994), Newey (1994), Newey and McFadden (1994),

Chen, Linton, and van Keilegom (2003), Ichimura and Lee (2010), Escanciano et al. (2016), and

Chernozhukov et al. (2018), that are not based on this decomposition. The conditions extend

Chernozhukov et. al. (2018) to many more DR estimators and to estimators that are nonlinear

in but only require a convergence rate for and not for .

This framework helps explain the potential problems with "plugging in" a first step machine

learning estimator into a moment function that is not LR. Lemma 14 implies that if Assumptions

4-6 are satisfied for some then√(0)−

P

=1 ( 0 0)√

−→ 0 if and only if

5 =1√

X=1

( )−→ 0 (7.4)

The plug-in method will fail when this equation does not hold. For example, suppose 0 = [|]so that by Proposition 4 of Newey (1994),

1√

X=1

( ) =−1√

X=1

()[ − ()]

31

Here 5−→ 0 is an approximate orthogonality condition between the approximation ()

to 0() and the nonparametric first stage residuals − () Machine learning uses model

selection in the construction of () If the model selected by () to approximate 0() is

not rich (or dense) enough to also approximate 0() then () need not be approximately

orthogonal to − () and 5 need not converge to zero. In particular, if the variables selectedto be used to approximate 0() cannot be used to also approximate 0() then the approximate

orthogonality condition can fail. This phenomenon helps explain the poor performance of the

plug-in estimator shown in Belloni, Chernozhukov, and Hansen (2014) and Chernozhukov et al.

(2017, 2018). The plug-in estimator can be root-n consistent if the only thing being selected is

an overall order of approximation, as in the series estimation results of Newey (1994). General

conditions for root-n consistency of the plug-in estimator can be formulated using Assumptions

4-6 and 2−→ 0 which we do in Appendix D.

Another component of an asymptotic normality result is convergence of the Jacobian term

() to = [( 0 0)|=0 ] We impose the following condition for thispurpose.

Assumption 7: exists and there is a neighborhood N of 0 and k·k such that i) for each k − 0k −→ 0

°°° − 0

°°° −→ 0; ii) for all k − 0k and k− 0k small enough ( )is differentiable in on N with probability approaching 1 iii) there is 0 0 and () with

[()] ∞ such that for ∈ and k − 0k small enough°°°°( )− ( 0 )

°°°° ≤ () k − 0k0;

iii) For each = 1 and ,R ¯

( 0 ) − ( 0 0 0)

¯0()

−→0

The following intermediate result gives Jacobian convergence.

Lemma 15: If Assumption 7 is satisfied then for any −→ 0 () is differentiable at

with probability approaching one and ()−→

With these results in place the asymptotic normality of semiparametric GMM follows in a

standard way.

Theorem 16: If Assumptions 4-7 are satisfied, −→ 0

−→ , 0 is nonsingu-

lar, and [k( 0 0 0)k2] ∞ then for Ω = [( 0 0 0)( 0 0 0)0]

√( − 0)

−→ (0 ) = ( 0)−1 0Ω( 0)−1

32

It is also useful to have a consistent estimator of the asymptotic variance of . As usual such

an estimator can be constructed as

= ( 0)−1 0 Ω( 0)−1

=()

Ω =

1

X=1

X∈I

( )( )0

Note that this variance estimator ignores the estimation of and which works here because

the moment conditions are LR. The following result gives conditions for consistency of

Theorem 17: If Assumptions 4 and 7 are satisfied with [()2] ∞ 0 is nonsin-

gular, and Z °°°( )− ( 0 )− ( 0) + ( 0 0)°°°2 0() −→ 0

then Ω−→ Ω and

−→

In this section we have used cross-fitting and a decomposition of moment conditions into

identifying and influence adjustment components to formulate simple and general conditions for

asymptotic normality of LR GMM estimators. For reducing higher order bias and variance it

may be desirable to let the number of groups grow with the sample size. That case is beyond

the scope of this paper.

8 Appendix A: Proofs of Theorems

Proof of Theorem 1: By ii) and iii),

0 = (1− )

Z( )0() +

Z( )()

Dividing by and solving gives

1

Z( )0() = −

Z( )() +

Z( )0()

Taking limits as −→ 0, 0 and using i) gives

Z( )0() = −

Z( 0)() + 0 = −()

Proof of Theorem 2: We begin by deriving 1 the adjustment term for the first step CCP

estimation. We use the definitions given in the body of the paper. We also let

() = () 1 = Pr(1 = 1) 10() = [1|+1 = ]

0() = [()()

()|+1 = ] ( = 2 )

33

Consider a parametric submodel as described in Section 4 and let 1( ) denote the conditional

expectation of given under the parametric submodel. Note that for = ()

[()()[(1(+1 ))| = 1]

]

=

[()()

()(1(+1 ))]

=

[[()()

()|+1](1(+1 ))]

=

[0(+1)(1(+1 ))] =

[0()(1( ))]

= [0()(10())

01( )

] = [0()

(10())

0 − 10()()]

where the last (sixth) equality follows as in Proposition 4 of Newey (1994a), and the fourth

equality follows by equality of the marginal distributions of and +1. Similarly, for 1 =

Pr(1 = 1) and 10() = [1|+1 = ] we have

[(1(+1 ))|1 = 1]

=[−11 1(1(+1 ))]

=

[−11 10(+1)(1(+1 ))]

=[−11 10()(1( ))]

= [−11 10()(10())

0 − 10()()]

Then combining terms gives

[( 0 1() −10)]

= −X=2

[()()[(1(+1 ))| = 1]

]

−[()()][(1(+1 ))|1 = 1]

= −X=2

[0()−[()()]−11 10()(10())

0 − 10()()]

= [1( 0 0 0)()]

Next, we show the result for ( ) for 2 ≤ ≤ As in the proof of Proposition 4 of

Newey (1994a), for any we have

[| = 1 ] = [

() −[| = 1]()|]

34

It follows that

[( 0 () −0)]

= −[()()[1+1 ++1| = 1 ]

]

= −

[[()()1+1 ++1| = 1 ]]

= −[()()

()1+1 ++1 − 0( 0 1)()]

= [( 0 0 0)()]

showing that the formula for is correct. The proof for +1 follows similarly. Q.E.D.

Proof of Theorem 3: Given in text.

Proof of Theorem 4: Given in text.

Proof of Theorem 5: Let ( ) = [( 0 )]. Suppose that ( ) is DR.

Then for any 6= 0 ∈ Γ we have

0 = ( 0) = (0 0) = ((1− )0 + 0)

for any Therefore for any ,

((1− )0 + 0) = 0 = (1− )(0 0) + ( 0)

so that ( 0) is affine in Also by the previous equation ((1−)0+ 0) = 0 identically

in so that

((1− )0 + 0) = 0

where the derivative with respect to is evaluated at = 0 Applying the same argument

switching of and we find that (0 ) is affine in and (0 (1− )0 + ) = 0

Next suppose that ( 0) is affine and ((1−)0+ 0) = 0 Then by (0 0) =0, for any ∈ Γ

( 0) = [( 0)] = [(1− )(0 0) + ( 0)]

= ((1− )0 + 0) = 0

Switching the roles of and it follows analogously that (0 ) = 0 for all ∈ Λ so ( )

is doubly robust. Q.E.D.

Proof of Theorem 6: Let 0() = −0Π−1() so that [0()|] = −0Π−1Π() =

−0()Then integration by parts gives

[( 0 )] = [0()()− 0()] = −[0()()− 0()]= [0() − ()] = −0Π−1[() − ()] = 0

35

Proof of Theorem 7: If 0 is identified then ( 0) is identified for every . By DR

[( 0)] = 0

at = 0 and by assumption this is the only where this equation is satisfied. Q.E.D.

Proof of Corollary 8: Given in text.

Proof of Theorem 9: Note that for = ( 0 0)

(0 (1− )0 + )] = (1− )[0()] + [()] = 0 (8.1)

Differentiating gives the second equality in eq. (2.7). Also, for ∆ = − 0

((1− )0 + 0)

= [0()(∆)] = 0

giving the first equality in eq. (2.7). Q.E.D.

Proof of Theorem 10: The first equality in eq. (8.1) of the proof of Theorem 9 shows that

(0 ) is affine in . Also,

((1−)0+ 0) = [0()(1−)( 0 0)+( 0 )] = (1−)(0 0)+( 0)

so that ( 0) is affine in The conclusion then follows by Theorem 5. Q.E.D.

Proof of Theorem 11: To see that Σ∗(

∗)Σ∗()−1 minimizes the asymptotic vari-

ance note that for any orthogonal instrumental variable matrix 0() by the rows of ()−Σ

∗( ) being in Λ

= [0()()0] = [0()

Σ∗( )0] = [0()

0Σ∗()

−1Σ∗( )

0]

Since the instruments are orthogonal the asymptotic variance matrix of the GMM estimator with

−→ is the same as if = 0 Define = 00() and ∗

= Σ∗( )Σ

∗()−1

The asymptotic variance of the GMM estimator for orthogonal instruments 0() is

( 0)−1 0[0()00()

0]( 0)−1 = ([∗0 ])

−1[0]([

∗ ])−10

The fact that this matrix is minimized in the positive semidefinite sense for = ∗ is well

known, e.g. see Newey and McFadden (1994). Q.E.D.

The following result is useful for the results of Section 7:

Lemma A1: If Assumption 4 is satisfied then 1−→ 0 If Assumption 5 is satisfied then

2−→ 0

36

Proof: Define ∆ = ( )−( 0)−() for ∈ and let denote the observations

for ∈ . Note that depends only on . By construction and independence of

and

∈ we have [∆| ] = 0 Also by independence of the observations, [∆∆|

] = 0

for ∈ Furthermore, for ∈ [∆2|

] ≤R[( )−( 0)]

20(). Then we have

[

Ã1√

X∈

∆

!2|

] =1

[

ÃX∈

∆

!2|

] =1

X∈

[∆2|

]

≤Z[( )−( 0)]

20()−→ 0

The conditional Markov inequality then implies thatP

∈ ∆√

−→ 0 The analogous results

also hold for ∆ = ( 0)− ( 0 0)− ( 0) and ∆ = ( 0 )− ( 0 0)−(0 ). Summing across these three terms and across = 1 gives the first conclusion.

For the second conclusion, note that under the first hypothesis of Assumption 5,

[

¯¯ 1√X

∈[( )− ( 0 )− ( 0) + ( 0 0)]

¯¯ |

]

≤ 1√

X∈

[¯( )− ( 0 )− ( 0) + ( 0 0)

¯|

]

≤ √Z ¯

( )− ( 0 )− ( 0) + ( 0 0)¯0()

−→ 0

so 2−→ 0 follows by the conditional Markov and triangle inequalities. The second hypothesis

of Assumption 5 is just 2−→ 0

Proof of Lemma 12: By Assumption 1 and the hypotheses that ∈ Γ and ∈ Λ we

have 3 = 4 = 0 By Lemma A1 we have 1−→ 0 and 2

−→ 0 The conclusion then follows

by the triangle inequality.

Proof of Theorem 13: Note that for = − 0()

( 0)− ( 0 0) = 0()[()− 0()]

( 0 )− ( 0 0) = [()− 0()]

( )− ( 0 )− ( 0) + ( 0 0) = −[()− 0()][()− 0()]

The first part of Assumption 4 ii) then follows byZ[( 0)− ( 0 0)]

20() =

Z0()

2[()− 0()]20()

≤

Z[()− 0()]

20()−→ 0

37

The second part of Assumption 4 ii) follows byZ[( 0 )− ( 0 0)]

20() =

Z[()− 0()]

220()

=

Z h()− 0()

i2[2|]0()

≤

Z h()− 0()

i20()

−→ 0

Next, note that by the Cauchy-Schwartz inequality,

√

Z|( )− ( 0 )− ( 0) + ( 0 0)|0()

=√

Z ¯[()− 0()][()− 0()]

¯0()

≤ √Z[()− 0()]

20()12Z[()− 0()]

20()12

Then the first rate condition of Assumption 5 holds under the first rate condition of Theorem

13 while the second condition of Assumption 5 holds under the last hypothesis of Theorem 13.

Then eq. (7.1) holds by Lemma 12, and the conclusion by rearranging the terms in eq. (7.1).

Q.E.D.

Proof of Lemma 14: Follows by Lemma A1 and the triangle inequality. Q.E.D.

Proof of Lemma 15: Let () = () when it exists, = −1P

∈ ( 0 )

and = −1P

∈ ( 0 0 0) By the law of large numbers, and Assumption 5 iii),P

=1 −→ Also, by condition iii) for each and

[| − ||] ≤Z ¯

( 0 ) − ( 0 0 0)

¯0()

−→ 0

Then by the conditional Markov inequality, for each

− −→ 0

It follows by the triangle inequality thatP

=1 −→ Also, with probability approaching

one we have for any −→ 0°°°°°()−X=1

°°°°° ≤Ã1

X=1

()

!°° − 0°°0 = (1)(1)

−→ 0

The conclusion then follows by the triangle inequality. Q.E.D.

Proof of Theorem 16: The conclusion follows in a standard manner from the conclusions

of Lemmas 14 and 15. Q.E.D.

38

Proof of Theorem 17: Let = ( ) and = ( 0 0 0) By standard

arguments (e.g. Newey, 1994), it suffices to show thatP

=1

°°° −

°°°2 −→ 0 Note that

− =

5X=1

∆ ∆1 = ( )− ( 0 ) ∆2 = ( 0 )−( 0 0)

∆3 = ( 0)− ( 0 0) ∆4 = ( 0 )− ( 0 0)

∆5 = ( )− ( 0)− ( 0 ) + ( 0 0)

By standard arguments it suffices to show that for each and

1

X∈

°°°∆

°°°2 −→ 0 (8.2)

For = 1 it follows by a mean value expansion and Assumption 7 with [()2] ∞ that

1

X∈

°°°∆1

°°°2 = 1

X∈

°°°°

( )( − )

°°°°2 ≤ 1

ÃX∈

()2

!°°° − °°°2 −→ 0

where is a mean value that actually differs from row to row of ( ). For = 2

note that by Assumption 4,

[1

X∈

°°°∆2

°°°2 |] ≤Zk( 0 )−( 0 0)k2 0() −→ 0

so eq. (8.2) holds by the conditional Markov inequality. For = 3 and = 4 eq. (8.2) follows

similarly. For = 5, it follows from the hypotheses of Theorem 17 that

[1

X∈

°°°∆5

°°°2 |] ≤Z °°°( )− ( 0 )− ( 0) + ( 0 0)

°°°2 0() −→ 0

Then eq. (8.2) holds for = 5 by the conditional Markov inequality. Q.E.D.

9 Appendix B: Local Robustness and Derivatives of Ex-

pected Moments.

In this Appendix we give conditions sufficient for the LR property of equation (2.5) to imply

the properties in equations (2.7) and (2.8). As discussed following equation (2.8), it may be

convenient when specifying regularity conditions for specific moment functions to work directly

with (2.7) and/or (2.8).

39

Assumption B1: There are linear sets Γ and Λ and a set such that i) ( ) is Frechet

differentiable at (0 0); ii) for all ∈ the vector (() ( )) is Frechet differentiable at

= 0; iii) the closure of (() ( )) : ∈ is Γ× Λ.

Theorem B1: If Assumption B1 is satisfied and equation (2.5) is satisfied for all ∈ Gthen equation (2.7) is satisfied.

Proof: Let 0( ) denote the Frechet derivative of ( ) at (0 0) in the direction ( )

which exists by i). By ii), the chain rule for Frechet derivatives (e.g. Proposition 7.3.1 of

Luenberger, 1969), and by eq. (2.5) it follows that for (∆ ∆

) = (( ) ())

0(∆ ∆

) =

(() ())

= 0

By 0( ) being a continuous linear function and iii) it follows that 0( ) = 0 for all ( ) ∈Γ× Λ Therefore, for any ∈ Γ and ∈ Λ

0( − 0 0) = 0 0(0 − 0) = 0

Equation (2.7) then follows by i). Q.E.D.

Theorem B2: If equation (2.7) is satisfied and in addition ( 0) and (0 ) are twice

Frechet differentiable in open sets containing 0 and 0 respectively with bounded second deriv-

ative then equation (2.8) is satisfied.

Proof: Follows by Proposition 7.3.3 of Luenberger (1969). Q.E.D.

10 Appendix C: Doubly Robust Moment Functions for

Orthogonality Conditions

In this Appendix we generalize the DR estimators for conditional moment restrictions to or-

thogonality conditions for a general residual ( ) that is affine in but need not have the

form − ()

Assumption C1: There are linear sets Γ and Λ of functions () and () that are closed in

mean square such that i) For any ∈ Γ and scalar [( )2] ∞ and ( (1−)+ ) =

(1 − )( ) + ( ) ; ii) [()( 0)] = 0 for all ∈ Λ; iii) there exists 0 ∈ Λ such

that [( 0 )] = −[0()( )] for all ∈ Γ

Assumption C1 ii) could be thought of as an identification condition for 0. For example, if

Λ is all functions of with finite mean square then ii) is [( 0)|] = 0 the nonparametric

40

conditional moment restriction of Newey and Powell (2003) and Newey (1991). Assumption

C1 iii) also has an interesting interpretation. Let Π()() denote the orthogonal mean-square

projection of a random variable () with finite second moment on Γ Then by ii) and iii) we

have

[( 0 )] = −[0()( )] = [0()Π(())()]

= [0()Π(())()−Π((0))()]= [0()Π(()− (0))()]

Here we see that [( 0 )] is a linear, mean-square continuous function of Π(() −(0))() The Riesz representation theorem will also imply that if [( 0 )] is a linear,

mean-square continuous function of Π(()−(0))() then 0() exists satisfying Assumption

C1 ii). For the case where = this mean-square continuity condition is necessary for exis-

tence of a root-n consistent estimator, as in Newey (1994) and Newey and McFadden (1994).

We conjecture that when need not equal this condition generalizes Severini and Tripathi’s

(2012) necessary condition for existence of a root-n consistent estimator of 0.

Noting that Assumptions 1 ii) and iii) are the conditions for double robustness we have

Theorem C1: If Assumption C1 is satisfied then ( ) = ( ) + ()( ) is

doubly robust.

It is interesting to note that 0() satisfying Assumption C1 iii) need not be unique. When

the closure of Π(())() : ∈ Γ is not all of Λ then there will exist ∈ Λ such that 6= 0and

[()( )] = [()Π(())()] = 0 for all ∈ Γ

In that case Assumption C1 iii) will also be satisfied for 0()+ ()We can think of this case

as one where 0 is overidentified, similarly to Chen and Santos (2015). As discussed in Ichimura

and Newey (2017), the different 0() would correspond to different first step estimators.

The partial robustness results of the last Section can be extended to the orthogonality condi-

tion setting of Assumption C1. Let Λ∗ be a closed linear subset of Λ such as finite dimensional

linear set and let ∗ be such that [()( ∗)] = 0 for all ∈ Λ∗. Note that if 0 ∈ Λ∗ it

follows by Theorem C1 that

[( 0 ∗)] = −[0()( ∗)] = 0

Theorem C2: If Λ∗ is a closed linear subset of Λ, [()( ∗)] = 0 for all ∈ Λ∗,

and Assumption C2 iii) is satisfied with 0 ∈ Λ∗ then

[( 0 ∗)] = 0

41

11 Appendix D: Regularity Conditions for Plug-in Esti-

mators

In this Appendix we formulate regularity conditions for root-n consistency and asymptotic nor-

mality of the plug-in estimator as described in Section 2, where ( ) need not be LR.

These conditions are based on Assumptions 4-6 applied to the influence adjustment ( )

corresponding to ( ) and For this purpose we treat as any object that can approxi-

mate 0() not just as an estimator of 0

Theorem D1: If Assumptions 4-6 are satisfied, Assumption 7 is satisfied with ( )

replacing ( ) −→ 0

−→ , 0 is nonsingular, [k( 0 0 0)k2] ∞

and

5 =1√

X=1

( )−→ 0

then for Ω = [( 0 0 0)( 0 0 0)0]

√( − 0)

−→ (0 ) = ( 0)−1 0Ω( 0)−1

The condition 5−→ 0 was discussed in Section 7. It is interesting to note that 5

−→ 0

appears to be a complicated condition that seems to depend on details of the estimator in a

way that Assumptions 4-7 do not. In this way the regularity conditions for the LR estimator

seem to be more simple and general than those for the plug-in estimator.

Acknowledgements

Whitney Newey gratefully acknowledges support by the NSF. Helpful comments were pro-

vided by M. Cattaneo, B. Deaner, J. Hahn, M. Jansson, Z. Liao, A. Pakes, R. Moon, A. de

Paula, V. Semenova, and participants in seminars at Cambridge, Columbia, Cornell, Harvard-

MIT, UCL, USC, Yale, and Xiamen. B. Deaner provided capable research assistance.

REFERENCES

Ackerberg, D., X. Chen, and J. Hahn (2012): "A Practical Asymptotic Variance Estimator

for Two-step Semiparametric Estimators," The Review of Economics and Statistics 94: 481—498.

Ackerberg, D., X. Chen, J. Hahn, and Z. Liao (2014): "Asymptotic Efficiency of Semipara-

metric Two-Step GMM," The Review of Economic Studies 81: 919—943.

Ai, C. and X. Chen (2003): “Efficient Estimation of Models with Conditional Moment Restric-

tions Containing Unknown Functions,” Econometrica 71, 1795-1843.

42

Ai, C. and X. Chen (2007): "Estimation of Possibly Misspecified Semiparametric Conditional

Moment Restriction Models with Different Conditioning Variables," Journal of Econometrics

141, 5—43.

Ai, C. and X. Chen (2012): "The Semiparametric Efficiency Bound for Models of Sequential

Moment Restrictions Containing Unknown Functions," Journal of Econometrics 170, 442—457.

Andrews, D.W.K. (1994): “Asymptotics for Semiparametric Models via Stochastic Equiconti-

nuity,” Econometrica 62, 43-72.

Athey, S., G. Imbens, and S. Wager (2017): "Efficient Inference of Average Treatment Ef-

fects in High Dimensions via Approximate Residual Balancing," Journal of the Royal Statistical

Society, Series B, forthcoming.

Bajari, P., V. Chernozhukov, H. Hong, and D. Nekipelov (2009): "Nonparametric and

Semiparametric Analysis of a Dynamic Discrete Game," working paper, Stanford.

Bajari, P., H. Hong, J. Krainer, and D. Nekipelov (2010): "Estimating Static Models of

Strategic Interactions," Journal of Business and Economic Statistics 28, 469-482.

Bang, and J.M. Robins (2005): "Doubly Robust Estimation in Missing Data and Causal Infer-

ence Models," Biometrics 61, 962—972.

Belloni, A., D. Chen, V. Chernozhukov, and C. Hansen (2012): “Sparse Models and

Methods for Optimal Instruments with an Application to Eminent Domain,” Econometrica 80,

2369—2429.

Belloni, A., V. Chernozhukov, and Y. Wei (2013): “Honest Confidence Regions for Logistic

Regression with a Large Number of Controls,” arXiv preprint arXiv:1304.3969.

Belloni, A., V. Chernozhukov, and C. Hansen (2014): "Inference on Treatment Effects

after Selection among High-Dimensional Controls," The Review of Economic Studies 81, 608—

650.

Belloni, A., V. Chernozhukov, I. Fernandez-Val, and C. Hansen (2016): "Program

Evaluation and Causal Inference with High-Dimensional Data," Econometrica 85, 233-298.

Bera, A.K., G. Montes-Rojas, and W. Sosa-Escudero (2010): "General Specification

Testing with Locally Misspecified Models," Econometric Theory 26, 1838—1845.

Bickel, P.J. (1982): "On Adaptive Estimation," Annals of Statistics 10, 647-671.

Bickel, P.J. and Y. Ritov (1988): "Estimating Integrated Squared Density Derivatives: Sharp

Best Order of Convergence Estimates," Sankhya: The Indian Journal of Statistics, Series A

238, 381-393.

Bickel, P.J., C.A.J. Klaassen, Y. Ritov, and J.A. Wellner (1993): Efficient and Adaptive

Estimation for Semiparametric Models, Springer-Verlag, New York.

Bickel, P.J. and Y. Ritov (2003): "Nonparametric Estimators Which Can Be "Plugged-in,"

43

Annals of Statistics 31, 1033-1053.

Bonhomme, S., and M. Weidner (2018): "Minimizing Sensitivity to Misspecification," working

paper.

Cattaneo, M.D., and M. Jansson (2017): "Kernel-Based Semiparametric Estimators: Small

Bandwidth Asymptotics and Bootstrap Consistency," Econometrica, forthcoming.

Cattaneo, M.D., M. Jansson, and X. Ma (2017): "Two-step Estimation and Inference with

Possibly Many Included Covariates," working paper.

Chamberlain, G. (1987): “Asymptotic Efficiency in Estimation with Conditional Moment Re-

strictions,” Journal of Econometrics 34, 1987, 305—334.

Chamberlain, G. (1992): “Efficiency Bounds for Semiparametric Regression,” Econometrica 60,

567—596.

Chen, X. and X. Shen (1997): “Sieve Extremum Estimates for Weakly Dependent Data,”

Econometrica 66, 289-314.

Chen, X., O.B. Linton, and I. van Keilegom (2003): “Estimation of Semiparametric Models

when the Criterion Function Is Not Smooth,” Econometrica 71, 1591-1608.

Chen, X., and Z. Liao (2015): "Sieve Semiparametric Two-Step GMM Under Weak Depen-

dence", Journal of Econometrics 189, 163—186.

Chen, X., and A. Santos (2015): “Overidentification in Regular Models,” working paper.

Chernozhukov, V., C. Hansen, and M. Spindler (2015): "Valid Post-Selection and Post-

Regularization Inference: An Elementary, General Approach," Annual Review of Economics 7:

649—688.

Chernozhukov, V., G.W. Imbens and W.K. Newey (2007): "Instrumental Variable Identi-

fication and Estimation of Nonseparable Models," Journal of Econometrics 139, 4-14.

Chernozhukov, V., D. Chetverikov, M. Demirer, E. Duflo, C. Hansen, W. Newey

(2017): "Double/Debiased/Neyman Machine Learning of Treatment Effects," American Eco-

nomic Review Papers and Proceedings 107, 261-65.

Chernozhukov, V., D. Chetverikov, M. Demirer, E. Duflo, C. Hansen, W. Newey,

J. Robins (2018): "Debiased/Double Machine Learning for Treatment and Structural Parame-

ters,Econometrics Journal 21, C1-C68.

Chernozhukov, V., J.A. Hausman, and W.K. Newey (2018): "Demand Analysis with Many

Prices," working paper, MIT.

Chernozhukov, V., W.K. Newey, J. Robins (2018): "Double/De-Biased Machine Learning

Using Regularized Riesz Representers," arxiv.

Escanciano, J-C., D. Jacho-Cha´vez, and A. Lewbel (2016): “Identification and Estima-

tion of Semiparametric Two Step Models”, Quantitative Economics 7, 561-589.

44

Farrell, M. (2015): "Robust Inference on Average Treatment Effects with Possibly More Co-

variates than Observations," Journal of Econometrics 189, 1—23.

Firpo, S. and C. Rothe (2017): "Semiparametric Two-Step Estimation Using Doubly Robust

Moment Conditions," working paper.

Graham, B.W. (2011): "Efficiency Bounds for Missing Data Models with Semiparametric Re-

strictions," Econometrica 79, 437—452.

Hahn, J. (1998): "On the Role of the Propensity Score in Efficient Semiparametric Estimation

of Average Treatment Effects," Econometrica 66, 315-331.

Hahn, J. and G. Ridder (2013): "Asymptotic Variance of Semiparametric Estimators With

Generated Regressors," Econometrica 81, 315-340.

Hahn, J. and G. Ridder (2016): “Three-stage Semi-Parametric Inference: Control Variables

and Differentiability," working paper."

Hahn, J., Z. Liao, and G. Ridder (2016): "Nonparametric Two-Step Sieve M Estimation and

Inference," working paper, UCLA.

Hasminskii, R.Z. and I.A. Ibragimov (1978): "On the Nonparametric Estimation of Function-

als," Proceedings of the 2nd Prague Symposium on Asymptotic Statistics, 41-51.

Hausman, J.A., and W.K. Newey (2016): "Individual Heterogeneity and Average Welfare,"


Hausman, J.A., and W.K. Newey (2017): "Nonparametric Welfare Analysis," Annual Review

of Economics 9, 521—546.

Hirano, K., G. Imbens, and G. Ridder (2003): "Efficient Estimation of Average Treatment

Effects Using the Estimated Propensity Score," Econometrica 71: 1161—1189.

Hotz, V.J. and R.A. Miller (1993): "Conditional Choice Probabilities and the Estimation of

Dynamic Models," Review of Economic Studies 60, 497-529.

Huber, P. (1981): Robust Statistics, New York: Wiley.

Ichimura, H. (1993): "Estimation of Single Index Models," Journal of Econometrics 58, 71-120.

Ichimura, H., and S. Lee (2010): “Characterization of the Asymptotic Distribution of Semi-

parametric M-Estimators,” Journal of Econometrics 159, 252—266.

Ichimura, H. and W.K. Newey (2017): "The Influence Function of Semiparametric Estima-

tors," CEMMAP Working Paper, CWP06/17.

Kandasamy, K., A. Krishnamurthy, B. P´oczos, L. Wasserman, J.M. Robins (2015):

"Influence Functions for Machine Learning: Nonparametric Estimators for Entropies, Diver-

gences and Mutual Informations," arxiv.

Lee, Lung-fei (2005): “A ()-type Gradient Test in the GMM Approach,” working paper.

45

Luenberger, D.G. (1969): Optimization by Vector Space Methods, New York: Wiley.

Murphy, K.M. and R.H. Topel (1985): "Estimation and Inference in Two-Step Econometric

Models," Journal of Business and Economic Statistics 3, 370-379.

Newey, W.K. (1984): "A Method of Moments Interpretation of Sequential Estimators," Eco-

nomics Letters 14, 201-206.

Newey, W.K. (1990): "Semiparametric Efficiency Bounds," Journal of Applied Econometrics 5,

99-135.

Newey, W.K. (1991): ”Uniform Convergence in Probability and Stochastic Equicontinuity,”


Newey, W.K. (1994a): "The Asymptotic Variance of Semiparametric Estimators," Econometrica

62, 1349-1382.

Newey, W.K. (1994b): ”Kernel Estimation of Partial Means and a General Variance Estimator,”

Econometric Theory 10, 233-253.

Newey, W.K. (1997): ”Convergence Rates and Asymptotic Normality for Series Estimators,”

Journal of Econometrics 79, 147-168.

Newey, W.K. (1999): ”Consistency of Two-Step Sample Selection Estimators Despite Misspeci-

fication of Distribution,” Economics Letters 63, 129-132.

Newey, W.K., and D. McFadden (1994): “Large Sample Estimation and Hypothesis Testing,"

in Handbook of Econometrics, Vol. 4, ed. by R. Engle, and D. McFadden, pp. 2113-2241. North

Holland.

Newey, W.K., and J.L. Powell (1989): "Instrumental Variable Estimation of Nonparametric

Models," presented at Econometric Society winter meetings, 1988.

Newey, W.K., and J.L. Powell (2003): "Instrumental Variable Estimation of Nonparametric

Models," Econometrica 71, 1565-1578.

Newey, W.K., F. Hsieh, and J.M. Robins (1998): “Undersmoothing and Bias Corrected

Functional Estimation," MIT Dept. of Economics working paper 72, 947-962.

Newey, W.K., F. Hsieh, and J.M. Robins (2004): “Twicing Kernels and a Small Bias Property

of Semiparametric Estimators,” Econometrica 72, 947-962.

Newey, W.K., and J. Robins (2017): "Cross Fitting and Fast Remainder Rates for Semipara-

metric Estimation," arxiv.

Neyman, J. (1959): “Optimal Asymptotic Tests of Composite Statistical Hypotheses,” Probability

and Statistics, the Harald Cramer Volume, ed., U. Grenander, New York, Wiley.

Pfanzagl, J., andW.Wefelmeyer (1982): "Contributions to a General Asymptotic Statistical

Theory. Springer Lecture Notes in Statistics.

46

Pakes, A. and G.S. Olley (1995): "A Limit Theorem for a Smooth Class of Semiparametric

Estimators," Journal of Econometrics 65, 295-332.

Powell, J.L., J.H. Stock, and T.M. Stoker (1989): "Semiparametric Estimation of Index

Coefficients," Econometrica 57, 1403-1430.

Robins, J.M., A. Rotnitzky, and L.P. Zhao (1994): "Estimation of Regression Coefficients

When Some Regressors Are Not Always Observed," Journal of the American Statistical Associ-

ation 89: 846—866.

Robins, J.M. and A. Rotnitzky (1995): "Semiparametric Efficiency in Multivariate Regression

Models with Missing Data," Journal of the American Statistical Association 90:122—129.

Robins, J.M., A. Rotnitzky, and L.P. Zhao (1995): "Analysis of Semiparametric Regression

Models for Repeated Outcomes in the Presence of Missing Data," Journal of the American

Statistical Association 90,106—121.

Robins, J.M.,and A. Rotnitzky (2001): Comment on “Semiparametric Inference: Question

and an Answer Likelihood” by P.A. Bickel and J. Kwon, Statistica Sinica 11, 863-960.

Robins, J.M., A. Rotnitzky, and M. van der Laan (2000): "Comment on ’On Profile

Likelihood’ by S. A. Murphy and A. W. van der Vaart, Journal of the American Statistical

Association 95, 431-435.

Robins, J., M. Sued, Q. Lei-Gomez, and A. Rotnitzky (2007): "Comment: Performance of

Double-Robust Estimators When Inverse Probability’ Weights Are Highly Variable," Statistical

Science 22, 544—559.

Robins, J.M., L. Li, E. Tchetgen, and A. van der Vaart (2008): "Higher Order Influence

Functions and Minimax Estimation of Nonlinear Functionals," IMS Collections Probability and

Statistics: Essays in Honor of David A. Freedman, Vol 2, 335-421.

Robins, J.M., L. Li, R. Mukherjee, E. Tchetgen, and A. van der Vaart (2017): "Higher

Order Estimating Equations for High-Dimensional Models," Annals of Statistics, forthcoming.

Robinson, P.M. (1988): "‘Root-N-consistent Semiparametric Regression," Econometrica 56, 931-

954.

Rust, J. (1987): "Optimal Replacement of GMC Bus Engines: An Empirical Model of Harold

Zurcher," Econometrica 55, 999-1033.

Santos, A. (2011): "Instrumental Variable Methods for Recovering Continuous Linear Function-

als," Journal of Econometrics, 161, 129-146.

Scharfstein D.O., A. Rotnitzky, and J.M. Robins (1999): Rejoinder to “Adjusting For

Nonignorable Drop-out Using Semiparametric Non-response Models,” Journal of the American

Statistical Association 94, 1135-1146.

Severini, T. and G. Tripathi (2006): "Some Identification Issues in Nonparametric Linear

47

Models with Endogenous Regressors," Econometric Theory 22, 258-278.

Severini, T. and G. Tripathi (2012): "Efficiency Bounds for Estimating Linear Functionals of

Nonparametric Regression Models with Endogenous Regressors," Journal of Econometrics 170,

491-498.

Schick, A. (1986): "On Asymptotically Efficient Estimation in Semiparametric Models," Annals

of Statistics 14, 1139-1151.

Stoker, T. (1986): "Consistent Estimation of Scaled Coefficients," Econometrica 54, 1461-1482.

Tamer, E. (2003): "Incomplete Simultaneous Discrete Response Model with Multiple Equilibria,"

Review of Economic Studies 70, 147-165.

van der Laan, M. and Rubin (2006): "Targeted Maximum Likelihood Learning," U.C. Berkeley

Division of Biostatistics Working Paper Series. Working Paper 213.

van der Vaart, A.W. (1991): “On Differentiable Functionals,” The Annals of Statistics, 19,

178-204.

van der Vaart, A.W. (1998): Asymptotic Statistics, Cambride University Press, Cambridge,

England.

van der Vaart, A.W. (2014): "Higher Order Tangent Spaces and Influence Functions," Statis-

tical Science 29, 679—686.

Wooldridge, J.M. (1991): “On the Application of Robust, Regression-Based Diagnostics to

Models of Conditional Means and Conditional Variances,” Journal of Econometrics 47, 5-46.

48