Locally Robust Semiparametric Estimation
Victor Chernozhukov
MIT
Juan Carlos Escanciano
Indiana University
Hidehiko Ichimura
University of Tokyo
Whitney K. Newey
MIT
James R. Robins
Harvard University
April 2018
Abstract
We give a general construction of debiased/locally robust/orthogonal (LR) moment
functions for GMM, where the derivative with respect to first step nonparametric estima-
tion is zero and equivalently first step estimation has no effect on the influence function.
This construction consists of adding an estimator of the influence function adjustment term
for first step nonparametric estimation to identifying or original moment conditions. We
also give numerical methods for estimating LR moment functions that do not require an
explicit formula for the adjustment term.
LR moment conditions have reduced bias and so are important when the first step is
machine learning. We derive LR moment conditions for dynamic discrete choice based on
first step machine learning estimators of conditional choice probabilities.
We provide simple and general asymptotic theory for LR estimators based on sample
splitting. This theory uses the additive decomposition of LR moment conditions into an
identifying condition and a first step influence adjustment. Our conditions require only
mean square consistency and a few (generally either one or two) readily interpretable rate
conditions.
LR moment functions have the advantage of being less sensitive to first step estimation.
Some LR moment functions are also doubly robust meaning they hold if one first step
is incorrect. We give novel classes of doubly robust moment functions and characterize
double robustness. For doubly robust estimators our asymptotic theory only requires one
rate condition.
Keywords: Local robustness, orthogonal moments, double robustness, semiparametric
estimation, bias, GMM.
JEL classification: C13; C14; C21; D24
1
1 Introduction
There are many economic parameters that depend on nonparametric or large dimensional first
steps. Examples include dynamic discrete choice, games, average consumer surplus, and treat-
ment effects. This paper shows how to construct moment functions for GMM estimators that are
debiased/locally robust/orthogonal (LR), where moment conditions have a zero derivative with
respect to the first step. We show that LR moment functions can be constructed by adding the
influence function adjustment for first step estimation to the original moment functions. This
construction can also be interpreted as a decomposition of LR moment functions into identifying
moment functions and a first step influence function term. We use this decomposition to give
simple and general conditions for root-n consistency and asymptotic normality, with different
properties being assumed for the identifying and influence function terms. The conditions are
easily interpretable mean square consistency and second order remainder conditions based on
estimated moments that use cross-fitting (sample splitting). We also give numerical estimators
of the influence function adjustment.
LR moment functions have several advantages. LR moment conditions bias correct in a way
that eliminates the large biases from plugging in first step machine learning estimators found
in Belloni, Chernozhukov, and Hansen (2014). LR moment functions can be used to construct
debiased/double machine learning (DML) estimators, as in Chernozhukov et al. (2017, 2018).
We illustrate by deriving LR moment functions for dynamic discrete choice estimation based
on conditional choice probabilities. We provide a DML estimator for dynamic discrete choice that
uses first step machine learning of conditional choice probabilities. We find that it performs well
in a Monte Carlo example. Such structural models provide a potentially important application
of DML, because of potentially high dimensional state spaces. Adding the first step influence
adjustment term provides a general way to construct LR moment conditions for structural
models so that machine learning can be used for first step estimation of conditional choice
probabilities, state transition distributions, and other unknown functions on which structural
estimators depend.
LR moment conditions also have the advantage of being relatively insensitive to small vari-
ation away from the first step true function. This robustness property is appealing in many
settings where it may be difficult to get the first step completely correct. Many interesting and
useful LR moment functions have the additional property that they are doubly robust (DR),
meaning moment conditions hold when one first step is not correct. We give novel classes of DR
moment conditions, including for average linear functionals of conditional expectations and prob-
ability densities. The construction of adding the first step influence function adjustment to an
identifying moment function is useful to obtain these moment conditions. We also give necessary
and sufficient conditions for a large class of moment functions to be DR. We find DR moments
have simpler and more general conditions for asymptotic normality, which helps motivate our
2
consideration of DR moment functions as special cases of LR ones. LR moment conditions also
help minimize sensitivity to misspecification as in Bonhomme and Weidner (2018).
LR moment conditions have smaller bias from first step estimation. We show that they have
the small bias property of Newey, Hsieh, and Robins (2004), that the bias of the moments is of
smaller order than the bias of the first step. This bias reduction leads to substantial improve-
ments in finite sample properties in many cases relative to just using the original moment con-
ditions. For dynamic discrete choice we find large bias reductions, moderate variance increases
and even reductions in some cases, and coverage probabilities substantially closer to nominal.
For machine learning estimators of the partially linear model, Chernozhukov et al. (2017, 2018)
found bias reductions so large that the LR estimator is root-n consistent but the estimator based
on the original moment condition is not. Substantial improvements were previously also found
for density weighted averages by Newey, Hsieh, and Robins (2004, NHR). The twicing kernel
estimators in NHR are numerically equal to LR estimators based on the original (before twicing)
kernel, as shown in Newey, Hsieh, Robins (1998), and the twicing kernel estimators were shown
to have smaller mean square error in large samples. Also, a Monte Carlo example in NHR finds
that the mean square error (MSE) of the LR estimator has a smaller minimum and is flatter as
a function of bandwidth than the MSE of Powell, Stock, and Stoker’s (1989) density weighted
average derivative estimator. We expect similar finite sample improvements from LR moments
in other cases.
LR moment conditions have appeared in earlier work. They are semiparametric versions of
Neyman (1959) C-alpha test scores for parametric models. Hasminskii and Ibragimov (1978)
suggested LR estimation of functionals of a density and argued for their advantages over plug-in
estimators. Pfanzagl and Wefelmeyer (1981) considered using LR moment conditions for im-
proving the asymptotic efficiency of functionals of distribution estimators. Bickel and Ritov
(1988) gave a LR estimator of the integrated squared density that attains root-n consistency
under minimal conditions. The Robinson (1988) semiparametric regression and Ichimura (1993)
index regression estimators are LR. Newey (1990) showed that LR moment conditions can be
obtained as residuals from projections on the tangent set in a semiparametric model. Newey
(1994a) showed that derivatives of an objective function where the first step has been "concen-
trated out" are LR, including the efficient score of a semiparametric model. NHR (1998, 2004)
gave estimators of averages that are linear in density derivative functionals with remainder rates
that are as fast as those in Bickel and Ritov (1988). Doubly robust moment functions have
been constructed by Robins, Rotnitzky, and Zhao (1994, 1995), Robins and Rotnitzky (1995),
Scharfstein, Rotnitzky, and Robins (1999), Robins, Rotnitzky, and van der Laan (2000), Robins
and Rotnitzky (2001), Graham (2011), and Firpo and Rothe (2017). They are widely used for
estimating treatment effects, e.g. Bang and Robins (2005). Van der Laan and Rubin (2006) de-
veloped targeted maximum likelihood to obtain a LR estimating equation based on the efficient
3
influence function of a semiparametric model. Robins et al. (2008, 2017) showed that efficient
influence functions are LR, characterized some doubly robust moment conditions, and developed
higher order influence functions that can reduce bias. Belloni, Chernozhukov, and Wei (2013),
Belloni, Chernozhukov, and Hansen (2014), Farrell (2015), Kandasamy et al. (2015), Belloni,
Chernozhukov, Fernandez-Val, and Hansen (2016), and Athey, Imbens, and Wager (2017) gave
LR estimators with machine learning first steps in several specific contexts.
A main contribution of this paper is the construction of LR moment conditions from any
moment condition and first step estimator that can result in a root-n consistent estimator of
the parameter of interest. This construction is based on the limit of the first step when a data
observation has a general distribution that allows for misspecification, similarly to Newey (1994).
LR moment functions are constructed by adding to identifying moment functions the influence
function of the true expectation of the identifying moment functions evaluated at the first step
limit, i.e. by adding the influence function term that accounts for first step estimation. The
addition of the influence adjustment "partials out" the first order effect of the first step on the
moments. This construction of LR moments extends those cited above for first step density and
distribution estimators to any first step, including instrumental variable estimators. Also, this
construction is estimator based rather than model based as in van der Laan and Rubin (2006)
and Robins et al. (2008, 2017). The construction depends only on the moment functions and
the first step rather than on a semiparametric model. Also, we use the fundamental Gateaux
derivative definition of the influence function to show LR rather than an embedding in a regular
semiparametric model.
The focus on the functional that is the true expected moments evaluated at the first step
limit is the key to this construction. This focus should prove useful for constructing LR moments
in many setting, including those where it has already been used to find the asymptotic variance
of semiparametric estimators, such as Newey (1994a), Pakes and Olley (1995), Hahn (1998), Ai
and Chen (2003), Hirano, Imbens, and Ridder (2003), Bajari, Hong, Krainer, and Nekipelov
(2010), Bajari, Chernozhukov, Hong, and Nekipelov (2009), Hahn and Ridder (2013, 2016), and
Ackerberg, Chen, Hahn, and Liao (2014), Hahn, Liao, and Ridder (2016). One can construct
LR moment functions in each of these settings by adding the first step influence function derived
for each case as an adjustment to the original, identifying moment functions.
Another contribution is the development of LR moment conditions for dynamic discrete
choice. We derive the influence adjustment for first step estimation of conditional choice prob-
abilities as in Hotz and Miller (1993). We find encouraging Monte Carlo results when various
machine learning methods are used to construct the first step. We also give LRmoment functions
for conditional moment restrictions based on orthogonal instruments.
An additional contribution is to provide general estimators of the influence adjustment term
that can be used to construct LR moments without knowing their form. These methods estimate
4
the adjustment term numerically, thus avoiding the need to know its form. It is beyond the scope
of this paper to develop machine learning versions of these numerical estimators. Such estimators
are developed by Chernozhukov, Newey, and Robins (2018) for average linear functionals of
conditional expectations.
Further contributions include novel classes of DR estimators, including linear functionals of
nonparametric instrumental variables and density estimators, and a characterization of (nec-
essary and sufficient conditions for) double robustness. We also give related, novel partial
robustness results where original moment conditions are satisfied even when the first step is not
equal to the truth.
A main contribution is simple and general asymptotic theory for LR estimators that use
cross-fitting in the construction of the average moments. This theory is based on the structure
of LR moment conditions as an identifying moment condition depending on one first step plus
an influence adjustment that can depend on an additional first step. We give a remainder
decomposition that leads to mean square consistency conditions for first steps plus a few readily
interpretable rate conditions. For DR estimators there is only one rate condition, on a product
of sample remainders from two first step estimators, leading to particularly simple conditions.
This simplicity motivates our inclusion of results for DR estimators. This asymptotic theory
is also useful for existing moment conditions that are already known to be LR. Whenever the
moment condition can be decomposed into an identifying moment condition depending on one
first step and an influence function term that may depend on two first steps the simple and
general regularity conditions developed here will apply.
LRmoments reduce that smoothing bias that results from first step nonparametric estimation
relative to original moment conditions. There are other sources of bias arising from nonlinearity
of moment conditions in the first step and the empirical distribution. Cattaneo and Jansson
(2017) and Cattaneo, Jansson, and Ma (2017) give useful bootstrap and jackknife methods that
reduce nonlinearity bias. Newey and Robins (2017) show that one can also remove this bias by
cross fitting in some settings. We allow for cross-fitting in this paper.
Section 2 describes the general construction of LR moment functions for semiparametric
GMM. Section 3 gives LR moment conditions for dynamic discrete choice. Section 4 shows how
to estimate the first step influence adjustment. Section 5 gives novel classes of DR moment
functions and characterizes double robustness. Section 6 gives an orthogonal instrument con-
struction of LR moments based on conditional moment restrictions. Section 7 provides simple
and general asymptotic theory for LR estimators.
5
2 Locally Robust Moment Functions
The subject of this paper is GMM estimators of parameters where the sample moment functions
depend on a first step nonparametric or large dimensional estimator. We refer to these estimators
as semiparametric. We could also refer to them as GMMwhere first step estimators are “plugged
in” the moments. This terminology seems awkward though, so we simply refer to them as
semiparametric GMM estimators. We denote such an estimator by , which is a function of the
data 1 where is the number of observations. Throughout the paper we will assume that
the data observations are i.i.d. We denote the object that estimates as 0, the subscript
referring to the parameter value under the distribution 0 of .
To describe semiparametric GMM let ( ) denote an × 1 vector of functions of thedata observation parameters of interest , and a function that may be vector valued. The
function can depend on and through those arguments of Here the function represents
some possible first step, such as an estimator, its limit, or a true function. A GMM estimator
can be based on a moment condition where 0 is the unique parameter vector satisfying
[( 0 0)] = 0 (2.1)
and 0 is the true . We assume that this moment condition identifies Let denote some
first step estimator of 0. Plugging in to obtain ( ) and averaging over results in the
estimated sample moments () =P
=1( ) For a positive semi-definite weighting
matrix a semiparametric GMM estimator is
= argmin∈
()()
where denotes the transpose of a matrix and is the parameter space for . Such
estimators have been considered by, e.g. Andrews (1994), Newey (1994a), Newey and McFadden
(1994), Pakes and Olley (1995), Chen and Liao (2015), and others.
Locally robust (LR) moment functions can be constructed by adding the influence func-
tion adjustment for the first step estimator to the identifying or original moment functions
( ) To describe this influence adjustment let ( ) denote the limit of when has dis-
tribution where we restrict only in that ( ) exists and possibly other regularity conditions
are satisfied. That is, ( ) is the limit of under possible misspecification, similar to Newey
(1994). Let be some other distribution and = (1 − )0 + for 0 ≤ ≤ 1 where 0denotes the true distribution of We assume that is chosen so that ( ) is well defined for
0 small enough and possibly other regularity conditions are satisfied, similarly to Ichimura
and Newey (2017). The influence function adjustment will be the function ( ) such that
for all such
[( ())] =
Z( 0 0)() [( 0 0)] = 0 (2.2)
6
where is an additional nonparametric or large dimensional unknown object on which ( )
depends and the derivative is from the right (i.e. for positive values of ) and at = 0
This equation is the well known definition of the influence function ( 0 0) of ( ) =
[( ( ))] as the Gateaux derivative of ( ) e.g. Huber (1981). The restriction of so
that () exists allows ( 0 0) to be the influence function when ( ) is only well de-
fined for certain types of distributions, such as when ( ) is a conditional expectation or density.
The function ( ) will generally exist when [( ( ))] has a finite semiparametric
variance bound. Also ( ) will generally be unique because we are not restricting very
much. Also, note that ( ) will be the influence adjustment term from Newey (1994a),
as discussed in Ichimura and Newey (2017).
LR moment functions can be constructed by adding ( ) to ( ) to obtain new
moment functions
( ) = ( ) + ( ) (2.3)
Let be a nonparametric or large dimensional estimator having limit ( ) when has distrib-
ution with (0) = 0 Also let () =P
=1 ( ) A LR GMM estimator can be
obtained as
= argmin∈
() () (2.4)
As usual a choice of that minimizes the asymptotic variance of√(−0) will be a consistent
estimator of the inverse of the asymptotic variance Ω of√(0) As we will further discuss,
( ) being LR will mean that the estimation of and does not affect Ω, so that Ω =
[( 0 0 0)( 0 0 0) ] An optimal also gives an efficient estimator in the wider
sense shown in Ackerberg, Chen, Hahn, and Liao (2014), making efficient in a semiparametric
model where the only restrictions imposed are equation (2.1).
The LR property we consider is that the derivative of the true expectation of the moment
function with respect to the first step is zero, for a Gateaux derivative like that for the influence
function in equation (2.2). Define = (1 − )0 + as before where is such that both
() and () are well defined. The LR property is that for all as specified,
[( () ( ))] = 0 (2.5)
Note that this condition is the same as that of Newey (1994a) for the presence of an to
have no effect on the asymptotic distribution, when each is a regular parametric submodel.
Consequently, the asymptotic variance of√(0) will be Ω as in the last paragraph.
To show LR of the moment functions ( ) = ( ) + ( ) from equation
(2.3) we use the fact that the second, zero expectation condition in equation (2.2) must hold for
all possible true distributions. For any given define ( ) = [( ( ))] and ( ) =
( ( ) ( ))
7
Theorem 1: If i) ( ) =R( 0)(), ii)
R( )() = 0 for all ∈ [0 )
and iii)R( )0() and
R( )() are continuous at = 0 then
[( )] = −( )
(2.6)
The proofs of this result and others are given in Appendix B. Assumptions i) and ii) of
Theorem 1 require that both parts of equation (2.2) hold with the second, zero mean condition
being satisfied when is the true distribution. Assumption iii) is a regularity condition. The
LR property follows from Theorem 1 by adding () to both sides of equation (2.6) and
noting that the sum of derivatives is the derivative of the sum. Equation (2.6) shows that the
addition of ( ) "partials out" the effect of the first step on the moment by "cancelling"
the derivative of the identifying moment [( ())] with respect to . This LR result for
( ) differs from the literature in its Gateaux derivative formulation and in the fact that
it is not a semiparametric influence function but is the hybrid sum of an identifying moment
function ( ) and an influence function adjustment ( )
Another zero derivative property of LR moment functions is useful. If the sets Γ and Λ of
possible limits ( ) and ( ), respectively, are linear, ( ) and ( ) can vary separately from
one another, and certain functional differentiability conditions hold then LR moment functions
will have the property that for any ∈ Γ, ∈ Λ, and ( ) = [( 0 )],
((1− )0 + 0) = 0
(0 (1− )0 + ) = 0 (2.7)
That is, the expected value of the LR moment function will have a zero Gateaux derivative
with respect to each of the first steps and This property will be useful for several results
to follow. Under still stronger smoothness conditions this zero derivative condition will result
in the existence of a constant such that for a function norm k·k,¯( 0)
¯≤ k − 0k2
¯(0 )
¯≤ k− 0k2 (2.8)
when k − 0k and k− 0k are small enough. In Appendix B we give smoothness conditionsthat are sufficient for LR to imply equations (2.7) and (2.8). When formulating regularity
conditions for particular moment functions and first step estimators it may be more convenient
to work directly with equation (2.7) and/or (2.8).
The approach of constructing LR moment functions by adding the influence adjustment
differs from the model based approach of using an efficient influence function or score for a
semiparametric model as moment functions . The approach here is estimator based rather than
model based. The influence adjustment ( ) is determined by the limit ( ) of the first
step estimator and the moment functions ( ) rather than by some underlying semi-
parametric model. This estimator based approach has proven useful for deriving the influence
8
function of a wide variety of semiparametric estimators, as mentioned in the Introduction. Here
this estimator based approach provides a general way to construct LR moment functions. For
any moment function ( ) and first step estimator a corresponding LR estimator can be
constructed as in equations (2.3) and (2.4).
The addition of ( ) does not affect identification of because ( 0 0) has
expectation zero for any and true 0 Consequently, the LR GMM estimator will have the
same asymptotic variance as the original GMM estimator when√(− 0) is asymptotically
normal, under appropriate regularity conditions. The addition of ( ) will change other
properties of the estimator. As discussed in Chernozhukov et al. (2017, 2018), it can even
remove enough bias so that the LR estimator is root-n consistent and the original estimator is
not.
If was modified so that is a function of a smoothing parameter, e.g. a bandwidth,
and gives the magnitude of the smoothing bias of () then equation (2.5) is a small bias
condition, equivalent to
[( 0 () ())] = ()
Here [( 0 () ( ))] is a bias in the moment condition resulting from smoothing that
shrinks faster than In this sense LR GMM estimators have the small bias property considered
in NHR. This interpretation is also one sense in which LR GMM is "debiased."
In some cases the original moment functions ( ) are already LR and the influence
adjustment will be zero. An important class of moment functions that are LR are those where
( ) is the derivative with respect to of an objective function where nonparametric
parts have been concentrated out. That is, suppose that there is a function ( ) such that
( ) = ( ()) where () = argmax [( )], where includes () and
possibly additional functions. Proposition 2 of Newey (1994a) and Lemma 2.5 of Chernozhukov
et al. (2018) then imply that ( ) will be LR. This class of moment functions includes
various partially linear regression models where represents a conditional expectation. It also
includes the efficient score for a semiparametric model, Newey (1994a, pp. 1358-1359).
Cross fitting, also known as sample splitting, has often been used to improve the properties
of semiparametric and machine learning estimators; e.g. see Bickel (1982), Schick (1986), and
Powell, Stock, and Stoker (1989). Cross fitting removes a source of bias and can be used to
construct estimators with remainder terms that converge to zero as fast as is known to be
possible, as in NHR and Newey and Robins (2017). Cross fitting is also useful for double
machine learning estimators, as outlined in Chernozhukov et al. (2017, 2018). For these reasons
we allow for cross-fitting, where sample moments have the form
() =1
X=1
( )
9
with and being formed from observations other than the This kind of cross fitting
removes an "own observation" bias term and is useful for showing root-n consistency when
and are machine learning estimators.
One version of cross-fitting with good properties in examples in Chernozhukov et al. (2018)
can be obtained by partitioning the observation indices into groups ( = 1 ) forming
and from observations not in , and constructing
() =1
X=1
X∈
( ) (2.9)
Further bias reductions may be obtained in some cases by using different sets of observations for
computing and leading to remainders that converge to zero as rapidly as known possible
in interesting cases; see Newey and Robins (2017). The asymptotic theory of Section 7 focuses
on this kind of cross fitting.
As an example we consider a bound on average equivalent variation. Let 0() denote the
conditional expectation of quantity conditional on = ( ) where = (1 2 )
is a vector
of prices and is income The object of interest is a bound on average equivalent variation for
a price change from 1 to 1 given by
0 = [
Z(1 )0(1 2 )1] (1 ) = ()1(1 ≤ 1 ≤ 1) exp−(1 − 1)]
where () is a function of income and a constant. It follows by Hausman and Newey
(2016) that if is a lower (upper) bound on the income effect for all individuals then 0 is
an upper (lower) bound on the equivalent variation for a price change from 1 to 1 averaged
over heterogeneity, other prices 2 and income . The function () allows for averages over
income in specific ranges, as in Hausman and Newey (2017).
A moment function that could be used to estimate 0 is
( ) =
Z(1 )(1 2 )1 −
Note that
[( 0 )] + 0 = [
Z(1 )(1 2 )1] = [0()()] 0() =
(1 )
0(1|2 )
where 0(1|2 ) is the conditional pdf of 1 given 2 and . Then by Proposition 4 of Newey(1994) the influence function adjustment for any nonparametric estimator () of [| = ]
is
( ) = ()[ − ()]
Here 0() is an example of an additional unknown function that is included in ( ) but
not in the original moment functions ( ). Let () be an estimator of [| = ] that
10
can depend on and () be an estimator of 0(), such as (1|2 )−1(1 ) for an estimator(1|2 ) The LR estimator obtained by solving () = 0 for ( ) and ( ) as
above is
=1
X=1
½Z(1 )(1 2 )1 + ()[ − ()]
¾ (2.10)
3 Machine Learning for Dynamic Discrete Choice
A challenging problem when estimating dynamic structural models is the dimensionality of
state spaces. Machine learning addresses this problem via model selection to estimate high
dimensional choice probabilities. These choice probabilities estimators can then be used in
conditional choice probability (CCP) estimators of structural parameters, following Hotz and
Miller (1993). In order for CCP estimators based on machine learning to be root-n consistent
they must be based on orthogonal (i.e. LR) moment conditions, see Chernozhukov et al. (2017,
2018). Adding the adjustment term provides the way to construct LR moment conditions from
known moment conditions for CCP estimators. In this Section we do so for the Rust’s (1987)
model of dynamic discrete choice.
We consider an agent choosing among discrete alternatives by maximizing the expected
present discounted value of utility. We assume that the per-period utility function for an agent
making choice in period is given by
= ( 0) + ( = 1 ; = 1 2 )
The vector is the observed state variables of the problem (e.g. work experience, number of
children, wealth) and the vector is unknown parameters. The disturbances = 1 are not observed by the econometrician. As in much of the literature we assume that is i.i.d.
over time with known CDF that has support is independent of and is first-order
Markov.
To describe the agent’s choice probabilities let denote a time discount parameter, ()
the expected value function, ∈ 0 1 the indicator that choice is made and () =
( 0) + [(+1)| ] the expected value function conditional on choice As in Rust
(1987), we assume that in each period the agent makes the choice that maximizes the expected
present discounted value of utility () + The probability of choosing in period is then
() = Pr(() + ≥ () + ; = 1 ) = (1() ())0 (3.1)
These choice probabilities have a useful relationship to the structural parameters when
there is a renewal choice, where the conditional distribution of +1 given the renewal choice and
does not depend on Without loss of generality suppose that the renewal choice is = 1
11
Let denote () = ()− 1() so that 1 ≡ 0. As usual, subtracting 1 from each
in () does not change the choice probabilities, so that they depend only on = (2 )
The renewal nature of = 1 leads to a specific formula for in terms of the per period
utilities = ( 0) and the choice probabilities = () = (1() ())0 As in Hotz
and Miller (1993), there is a function P−1( ) such that = P−1() Let ( ) denote the
function such that
() = [ max1≤≤
P−1() + |] = [ max1≤≤
+ |]
For example, for multinomial logit () = 5772− ln(1) Note that by = 1 being a renewalwe have [+1| 1] = for a constant , so that
() = 1 +() = 1 + +()
It then follows that
= + [(+1)| ] = + [1+1 +(+1)| ] + 2 ( = 1 )
Subtracting then gives
= − 1 + [1+1 +(+1)| ]−[1+1 +(+1)|1] (3.2)
This expression for the choice specific value function depends only on ( ) (+1), and
conditional expectations given the state and choice, and so can be used to form semiparametric
moment functions.
To describe those moment functions let 1() denote the vector of possible values of the
choice probabilities [| = ] where = (1 )0 Also let ( 1) ( = 2 )
denote a possible [1(+1 )+(1(+1))| ] as a function of , and 1 and +1( 1)a possible value of [1( ) +(1(+1))|1] Then a possible value of is given by
( ) = ( )− 1( ) + [( 1)− +1( 1)] ( = 2 )
These value function differences are semiparametric, depending on the function 1 of choice prob-
abilities and the conditional expectations , ( = 2 ) Let ( ) = (2( ) ( ))0
and () denote a matrix of functions of with columns. Semiparametric moment functions
are given by
( ) = ()[ − (( ))]
LR moment functions can be constructed by adding the adjustment term for the presence of
the first step This adjustment term is derived in Appendix A. It takes the form
( ) =
+1X=1
( )
12
where ( ) is the adjustment term for holding all other components fixed at their
true values. To describe it define
() = () 1 = Pr(1 = 1) 10() = [1|+1 = ] (3.3)
0() = [()()
()|+1 = ] ( = 2 )
Then for = +1 and = ( ) let
1( ) = −Ã
X=2
()−[()()]−11 1()
![(1()) ]
0 − 1()
( ) = −()(( ))
(( ))1( ) +(1())− ( 1) ( = 2 )
+1( ) =
ÃX
=2
[()(( ))]
!−11 11( ) +(1())− +1( 1)
Theorem 2: If the marginal distribution of does not vary with then LR moment
functions for the dynamic discrete choice model are
( ) = ()[ − (( ))] +
+1X=1
( )
The form of ( ) is amenable to machine learning. A machine learning estimator of the
conditional choice probability vector 10() is straightforward to compute and can then be used
throughout the construction of the orthogonal moment conditions everywhere 1 appears. If
1( ) is linear in say 1( ) = 011 for subvectors 1 and 1 of and respectively, then
machine learning estimators can be used to obtain [1+1| ] and [+1| ] ( = 2 )and a sample average used to form +1( 1). The value function differences can then be
estimated as
( ) = ( )−1( )+ [1+1| ]01− [1+1|1]01+ [+1| ]− [+1|1]
Furthermore, denominator problems can be avoided by using structural probabilities (rather
than the machine learning estimators) in all denominator terms.
The challenging part of the machine learning for this estimator is the dependence on of the
reverse conditional expectations in 1(). It may be computationally prohibitive and possibly
unstable to redo machine learning for each One way to to deal with this complication is to
update periodically, with more frequent updates near convergence. It is important that at
convergence the in the reverse conditional expectations is the same as the that appears
elsewhere.
13
With data that is i.i.d. over individuals these moment functions can be used for any to
estimate the structural parameters Also, for data for a single individual we could use a time
averageP−1
=1 ( )( − 1) to estimate It will be just as important to use LR momentsfor estimation with a single individual as it is with a cross section of individuals, although our
asymptotic theory will not apply to that case.
Bajari, Chernozhukov, Hong, and Nekipelov (2009) derived the influence adjustment for
dynamic discrete games of imperfect information. Locally robust moment conditions for such
games could be formed using their results. We leave that formulation to future work.
As an example of the finite sample performance of the LR GMM we report a Monte Carlo
study of the LR estimator of this Section. The design of the experiment is loosely like the
bus replacement application of Rust (1987). Here is a state variable meant to represent the
lifetime of a bus engine. The transition density is
+1 =
( +(25 1)2 = 1
= 1 +(25 1)2 = 0
where = 0 corresponds to replacement of the bus engine and = 1 to nonreplacement. We
assume that the agent chooses contingent on state to maximize
∞X=1
−1[(√ + ) + (1− )] = −3 = −4
The unconditional probability of replacement in this model is about 18 which is substantially
higher than that estimated in Rust (1987). The sample used for estimation was 1000 observations
for a single decision maker. We carried out 10 000 replications.
We estimate the conditional choice probabilities by kernel and series nonparametric regression
and by logit lasso, random forest, and boosted tree machine learning methods. Logit conditional
choice probabilities and derivatives were used in the construction of wherever they appear
in order to avoid denominator issues. The unknown conditional expectations in the were
estimated by series regressions throughout. Kernel regression was also tried but did not work
particularly well and so results are not reported.
Table 1 reports the results of the experiment. Bias, standard deviation, and coverage prob-
ability for asymptotic 95 percent confidence intervals are given in Table 1.
Table 1
14
LR CCP Estimators, Dynamic Discrete Choice
Bias Std Dev 95% Cov
RC RC RC
Two step kernel -.24 .08 .08 .32 .01 .86
LR kernel -.05 .02 .06 .32 .95 .92
Two step quad -.00 .14 .049 .33∗ .91 .89
LR quad -.00 .01 .085 .39 .95 .92
Logit Lasso -.12 .25 .06 .28 .74 .84
LR Logit Lasso -.09 .01 .08 .36 .93 .95
Random Forest -.15 -.44 .09 .50 .91 .98
LR Ran. For. .00 .00 .06 .44 1.0 .98
Boosted Trees -.10 -.28 .08 .50 .99 .99
LR Boost Tr. .03 .09 .07 .47 .99 .97
Here we find bias reduction from the LR estimator in all cases. We also find variance
reduction from LR estimation when the first step is kernel estimation, random forests, and
boosted trees. The LR estimator also leads to actual coverage of confidence intervals being
closer to the nominal coverage. The results for random forests and boosted trees seem noisier
than the others, with higher standard deviations and confidence interval coverage probabilities
farther from nominal. Overall, we find substantial improvements from using LR moments rather
than only the identifying, original moments.
4 Estimating the Influence Adjustment
Construction of LR moment functions requires an estimator ( ) of the adjustment term.
The form of ( ) is known for some cases from the semiparametric estimation literature.
Powell, Stock, and Stoker (1989) derived the adjustment term for density weighted average
derivatives. Newey (1994a) gave the adjustment term for mean square projections (including
conditional expectations), densities, and their derivatives. Hahn (1998) and Hirano, Imbens,
and Ridder (2003) used those results to obtain the adjustment term for treatment effect esti-
mators, where the LR estimator will be the doubly robust estimator of Robins, Rotnitzky, and
Zhao (1994, 1995). Bajari, Hong, Krainer, and Nekipelov (2010) and Bajari, Chernozhukov,
Hong, and Nekipelov (2009) derived adjustment terms in some game models. Hahn and Ridder
(2013, 2016) derived adjustments in models with generated regressors including control func-
tions. These prior results can be used to obtain LR estimators by adding the adjustment term
with nonparametric estimators plugged in.
For new cases it may be necessary to derive the form of the adjustment term. Also, it is
possible to numerically estimate the adjustment term based on series estimators and other non-
15
parametric estimators. In this Section we describe how to construct estimators of the adjustment
term in these ways.
4.1 Deriving the Formula for the Adjustment Term
One approach to estimating the adjustment term is to derive a formula for ( ) and
then plug in and in that formula A formula for ( ) can be obtained as in Newey
(1994a). Let ( ) be the limit of the nonparametric estimator when has distribution
Also, let denote a regular parametric model of distributions with = 0 at = 0 and
score (derivative of the log likelihood at = 0) equal to (). Then under certain regularity
conditions ( 0 0) will be the unique solution to
R( ( ))0()
¯=0
= [( 0 0)()] [( 0 0)] = 0 (4.1)
as and the corresponding score () are allowed to vary over a family of parametric
models where the set of scores for the family has mean square closure that includes all mean
zero functions with finite variance. Equation (4.1) is a functional equation that can be solved to
find the adjustment term, as was done in many of the papers cited in the previous paragraph.
The influence adjustment can be calculated by taking a limit of the Gateaux derivative as
shown in Ichimura and Newey (2017). Let ( ) be the limit of when is the true distribution
of , as before. Let be a family of distributions that approaches a point mass at as −→ 0
If ( 0 0) is continuous in with probability one then
( 0 0) = lim−→0
µ[( (
))]
¯=0
¶
= (1− )0 + (4.2)
This calculation is more constructive than equation (4.1) in the sense that the adjustment term
here is a limit of a derivative rather than the solution to a functional equation. In Sections 5
and 6 we use those results to construct LR estimators when the first step is a nonparametric
instrumental variables (NPIV) estimator.
With a formula for ( ) in hand from either solving the functional equation in equa-
tion (4.1) or from calculating the limit of the derivative in equation (4.2), one can estimate the
adjustment term by plugging estimators and into ( ) This approach to estimating
LR moments can used to construct LR moments for the average surplus described near the end
of Section 2. There the adjustment term depends on the conditional density of 1 given 2 and
. Let (1|2 ) be some estimator of the conditional pdf of 1 given 2 and Plugging
that estimator into the formula for 0() gives () =(1)
(1|2)This () can then be used in
equation (2.10)
16
4.2 Estimating the Influence Adjustment for First Step Series Esti-
mators
Estimating the adjustment term is relatively straightforward when the first step is a series
estimator. The adjustment term can be estimated by treating the first step estimator as if it
were parametric and applying a standard formula for the adjustment term for parametric two-
step estimators. Suppose that depends on the data through a × 1 vector of parameterestimators that has true value 0. Let ( ) denote ( ) as a function of Suppose
that there is a × 1 vector of functions ( ) such that satisfies1√
X∈
( ) = (1) (4.3)
where is a subset of observations, none which are included in and is the number of
observations in Then a standard calculation for parametric two-step estimators (e.g. Newey,
1984, and Murphy and Topel, 1985) gives the parametric adjustment term
( Ψ) = Ψ()( ) Ψ() = −X∈
( )
⎛⎝X∈
( )
⎞⎠−1 ∈
In many cases ( Ψ) approximates the true adjustment term ( 0 0) as shown
by Newey (1994a, 1997) and Ackerberg, Chen, and Hahn (2012) for estimating the asymptotic
variance of functions of series estimators. Here this approximation is used for estimation of
instead of just for variance estimation. The estimated LR moment function will be
( Ψ) = ( ) + ( Ψ) (4.4)
We note that if were computed from the whole sample then () = 0. This degeneracy does
not occur when cross-fitting is used, which removes "own observation" bias and is important for
first step machine learning estimators, as noted in Section 2.
We can apply this approach to construct LR moment functions for an estimator of the
average surplus bound example that is based on series regression. Here the first step estimator
of 0() = [| = ] will be that from an ordinary least regression of on a vector () of
approximating functions. The corresponding ( ) and ( ) are
( ) = ()0 − ( ) = ()[ − ()0] () =Z
(1 )(1 2 )1
Let denote the least squares coefficients from regressing on () for observations that are
17
not included in . Then the estimator of the locally robust moments given in equation (4.4) is
( Ψ) = ()0 − + Ψ()[ − ()
0]
Ψ =X∈
()0
⎛⎝X∈
()()0
⎞⎠−1 It can be shown similarly to Newey (1994a, p. 1369) that Ψ estimates the population least
squares coefficients from a regression of 0() on () so that () = Ψ() estimates
0() In comparison the LR estimator described in the previous subsection was based on an
explicit nonparametric estimator of 0(1|2 ) while this () implicitly estimates the inverseof that pdf via a mean-square approximation of 0() by Ψ()
Chernozhukov, Newey, and Robins (2018) introduce machine learning methods for choosing
the functions to include in the vector (). This method can be combined with machine learning
methods for estimating [|] to construct a double machine learning estimator of averagesurplus, as shown in Chernozhukov, Hausman, and Newey (2018).
In parametric models moment functions like those in equation (4.4) are used to "partial
out" nuisance parameters For maximum likelihood these moment functions are the basis
of Neyman’s (1959) C-alpha test. Wooldridge (1991) generalized such moment conditions to
nonlinear least squares and Lee (2005), Bera et al. (2010), and Chernozhukov et al. (2015) to
GMM. What is novel here is their use in the construction of semiparametric estimators and the
interpretation of the estimated LR moment functions ( Ψ) as the sum of an original
moment function ( ) and an influence adjustment ( Ψ).
4.3 Estimating the Influence Adjustment with First Step Smoothing
The adjustment term can be estimated in a general way that allows for kernel density, locally
linear regression, and other kernel smoothing estimators for the first step. The idea is to differ-
entiate with respect to the effect of the observation on sample moments. Newey (1994b) used
a special case of this approach to estimate the asymptotic variance of a functional of a kernel
based semiparametric or nonparametric estimator. Here we extend this method to a wider class
of first step estimators, such as locally linear regression, and apply it to estimate the adjustment
term for construction of LR moments.
We will describe this estimator for the case where is a vector of functions of a vector of
variables Let ( ) be a vector of functions of a data observation , , and a possible
realized value of (i.e. a vector of real numbers ). Also let ( ) =P
∈ ( )be a sample average over a set of observations not included in where is the number of
observations in We assume that the first step estimator () solves
0 = ( )
18
We suppress the dependence of and on a bandwidth. For example for a pdf () a kernel
density estimator would correspond to ( ) = (− )− and a locally linear regression
would be 1() for
( ) = (− )
Ã1
−
![ − 1 − (− )
02]
To measure the effect of the observation on let () be the solution to
0 = ( ) + · ( )
This () is the value of the function obtained from adding the contribution · ( ) of
the observation. An estimator of the adjustment term can be obtained by differentiating the
average of the original moment function with respect to at = 0 This procedure leads to an
estimated locally robust moment function given by
( ) = ( ) +
1
X∈
( (·))
¯¯=0
This estimator is a generalization of the influence function estimator for kernels in Newey
(1994b).
5 Double and Partial Robustness
The zero derivative condition in equation (2.5) is an appealing robustness property in and of
itself. A zero derivative means that the expected moment functions remain closer to zero than
as varies away from zero. This property can be interpreted as local insensitivity of the
moments to the value of being plugged in, with the moments remaining close to zero as
varies away from its true value. Because it is difficult to get nonparametric functions exactly
right, especially in high dimensional settings, this property is an appealing one.
Such robustness considerations, well explained in Robins and Rotnitzky (2001), have moti-
vated the development of doubly robust (DR) moment conditions. DR moment conditions have
expectation zero if one first stage component is incorrect. DR moment conditions allow two
chances for the moment conditions to hold, an appealing robustness feature. Also, DR moment
conditions have simpler conditions for asymptotic normality than general LR moment functions
as discussed in Section 7. Because many interesting LR moment conditions are also DR we
consider double robustness.
LR moments that are constructed by adding the adjustment term for first step estimation
provide candidates for DR moment functions. The derivative of the expected moments with
19
respect to each first step will be zero, a necessary condition for DR. The condition for moments
constructed in this way to be DR is the following:
Assumption 1: There are sets Γ and Λ such that for all ∈ Γ and ∈ Λ
[( 0 )] = −[( 0 0)] [( 0 0 )] = 0
This condition is just the definition of DR for the moment function ( ) = ( )+
( ), pertaining to specific sets Γ and Λ
The construction of adding the adjustment term to an identifying or original moment function
leads to several novel classes of DR moment conditions. One such class has a first step that
satisfies a conditional moment restriction
[ − 0()|] = 0 (5.1)
where is potentially endogenous and is a vector of instrumental variables. This condition
is the nonparametric instrumental variable (NPIV) restriction as in Newey and Powell (1989,
2003) and Newey (1991). A first step conditional expectation where 0() = [|] is includedas special case with = Ichimura and Newey (2017) showed that the adjustment term for
this step takes the form ( ) = ()[− ()] so ( ) + ()[− ()] is a candidate
for a DR moment function. A sufficient condition for DR is:
Assumption 2: i) Equation (5.1) is satisfied; ii) Λ = () : [()2] ∞ and Γ =
() : [()2] ∞; iii) there is () with [()
2] ∞ such that [( 0 )] =
[()()− 0()] for all ∈ Γ; iv) there is 0() such that () = [0()|]; and
v) [2 ] ∞
By the Riesz representation theorem condition iii) is necessary and sufficient for[( 0 )]
to be a mean square continuous functional of with representer () Condition iv) is an addi-
tional condition giving continuity in the reduced form difference [()−0()|], as furtherdiscussed in Ichimura and Newey (2017). Under this condition
[( 0 )] = [[0()|]()− 0()] = [0()()− 0()]= −[( 0)] [( 0 )] = [() − 0()] = 0
Thus Assumption 2 implies Assumption 1 so that we have
Theorem 3: If Assumption 2 is satisfied then ( )+()−() is doubly robust.
There are many interesting, novel examples of DR moment conditions that are special cases
of Theorem 3. The average surplus bound is an example where = = is the observed
20
vector of prices and income, Λ = Γ is the set of all measurable functions of with finite second
moment, and 0() = [| = ] Let 1 denote 1 and 2 the vector of other prices and
income, so that = (1 02)0. Also let 0(1|2) denote the conditional pdf of 1 given 2 and
() = (1 ) for income . Let ( ) =R(1 2)(1 2)1 − as before. Multiplying
and dividing through by 0(1|2) gives, for all ∈ Γ and 0() = 0(1|2)−1()
[( 0 )] = [
Z(1 2)(1 2)1]−0 = [[0()()|2]]−0 = [0()()−0()]
Theorem 3 then implies that the LR moment function for average surplus ( ) + ()[−()] is DR. A corresponding DR estimator is given in equation (2.10).
The surplus bound is an example of a parameter where 0 = [( 0)] for some linear
functional ( ) of and for 0 satisfying the conditional moment restriction of equation (5.1)
For the surplus bound ( ) =R(1 2)(1 2)1 If Assumption 2 is satisfied then choosing
( ) = ( )− a DR moment condition is ( )−+()[−()] A correspondingDR estimator is
=1
X=1
( ) + ()[ − ()] (5.2)
where () and () are estimators of 0() and 0() respectively. An estimator can be
constructed by nonparametric regression when = or NPIV in general. A series estimator
() can be constructed similarly to the surplus bound example in Section 3.2. For =
Newey and Robins (2017) give such series estimators of () and Chernozhukov, Newey, and
Robins (2018) show how to choose the approximating functions for () by machine learning.
Simple and general conditions for root-n consistency and asymptotic normality of that allow
for machine learning are given in Section 7.
Novel examples of the DR estimator in equation (5.2) = are given by Newey and
Robins (2017) and Chernozhukov, Newey, and Robins (2018). Also Appendix C provides a
generalization to () and () that satisfy orthogonality conditions more general than con-
ditional moment restrictions and novel examples of those. A novel example with 6= is a
weighted average derivative of 0() satisfying equation (5.1). Here ( ) = ()() for
some weight function (). Let 0() be the pdf of and () = −0()−1[()0()]assuming that derivatives exist. Assume that ()()0() is zero on the boundary of the
support of Integration by parts then gives Assumption 2 iii). Assume also that there exists
0 ∈ Λ with () = [0()|] Then for estimators and a DR estimator of the weighted
average derivative is
=1
X=1
()()
+ ()[ − ()]
This is a DR version of the weighted average derivative estimator of Ai and Chen (2007). A
21
special case of this example is the DR moment condition for the weighted average derivative in
the exogenous case where = given in Firpo and Rothe (2017).
Theorem 3 includes existing DR moment functions as special cases where = , including
the mean with randomly missing data given by Robins and Rotnitzky (1995), the class of DR
estimators in Robins et al. (2008), and the DR estimators of Firpo and Rothe (2017). We
illustrate for the mean with missing data. Let = = ( ) for an observed data indicator
∈ 0 1 and covariates ( ) = (1 )− and 0() = Pr( = 1| = ) Here it
is well known that
[( 0 )] = [(1 )]− 0 = [0()()− 0()] = −[0() − ()]
Then DR of the moment function (1 )− + ()[ − ()] of Robins and Rotnitzky (1995)
follows by Proposition 5.
Another novel class of DR moment conditions are those where the first step is a pdf of a
function of the data observation By Proposition 5 of Newey (1994a), the adjustment term
for such a first step is ( ) = () − R ()() for some possible . A sufficient
condition for the DR as in Assumption 1 is:
Assumption 3: has pdf 0() and for Γ = : () ≥ 0, R () = 1 there is 0()such that for all ∈ Γ
[( 0 )] =
Z0()()− 0()
Note that for ( ) = ()−R ()() it follows fromAssumption 3 that[( 0 )] =−[( 0)] for all ∈ Γ. Also, [( 0 )] = [()] −
R()0() = 0 Then As-
sumption 1 is satisfied so we have:
Theorem 4: If Assumption 3 is satisfied then ( ) + ()− R ()() is DR.The integrated squared density 0 =
R0()
2 is an example for ( ) = () −
0 = 0 and
( ) = ()− + ()−Z
()()
This DR moment function seems to be novel. Another example is the density weighted average
derivative (DWAD) of Powell, Stock, and Stoker (1989), where ( ) = −2 ·()−.Let () = [|]0(). Assuming that ()() is zero on the boundary and differentiable,integration by parts gives
[( 0 )] = −2[()]− 0 =
Z[()]()− 0()
22
so that Assumption 3 is satisfied with 0() = () Then by Theorem 4
=1
X=1
−2()
+()
−Z
()
()
is a DR estimator. It was shown in NHR (1998) that the Powell, Stock, and Stoker (1989)
estimator with a twicing kernel is numerically equal to a leave one out version of this estimator
for the original (before twicing) kernel. Thus the DR result for gives an interpretation of the
twicing kernel estimator as a DR estimator.
The expectation of the DR moment functions of both Theorem 3 and 4 are affine in and
holding the other fixed at the truth. This property of DR moment functions is general, as we
show by the following characterization of DR moment functions:
Theorem 5: If Γ and Λ are linear then ( ) is DR if and only if
[( 0 (1− )0 + 0)]|=0 = 0 [( 0 0 (1− )0 + )]|=0 = 0
and [( 0 0)] and [( 0 0 )] are affine in and respectively.
The zero derivative condition of this result is a Gateaux derivative, componentwise version
of LR. Thus, we can focus a search for DR moment conditions on those that are LR. Also, a DR
moment function must have an expectation that is affine in each of and while the other is
held fixed at the truth. It is sufficient for this condition that ( 0 ) be affine in each of
and while the other is held fixed. This property can depend on how and are specified.
For example the missing data DR moment function (1 )−+()−1[−()] is not affine
in the propensity score () = Pr( = 1| = ) but is in () = ()−1.
In general Theorem 5 motivates the construction of DR moment functions by adding the
adjustment term to obtain a LR moment function that will then be DR if it is affine in and
separately. It is interesting to note that in the NPIV setting of Theorem 3 and the density setting
of Theorem 4 that the adjustment term is always affine in and It then follows from Theorem
5 that in those settings LR moment conditions are precisely those where [( 0 )] is affine
in Robins and Rotnitzky (2001) gave conditions for the existence of DR moment conditions
in semiparametric models. Theorem 5 is complementary to those results in giving a complete
characterization of DR moments when Γ and Λ are linear.
Assumptions 2 and 3 both specify that [( 0 )] is continuous in an integrated squared
deviation norm. These continuity conditions are linked to finiteness of the semiparametric
variance bound for the functional [( 0 )] as discussed in Newey and McFadden (1994)
for Assumption 2 with = and for Assumption 3. For Assumption 2 with 6= Severini
and Tripathi (2012) showed for ( ) = ()()− with known () that the existence
of 0() with () = [0()|] is necessary for the existence of a root-n consistent estimator
23
of . Thus the conditions of Assumption 2 are also linked to necessary conditions for root-n
consistent estimation when 6=
Partial robustness refers to settings where [( 0 )] = 0 for some 6= 0. The novel DR
moment conditions given here lead to novel partial robustness results as we now demonstrate
in the conditional moment restriction setting of Assumption 2. When 0() in Assumption 2 is
restricted in some way there may exist 6= 0 with [0() − ()] = 0 Then
[( 0 )] = −[0() − ()] = 0
Consider the average derivative 0 = [0()] where ( ) = () − for
some Let = ([()()0])−1[()] be the limit of the linear IV estimator with right
hand side variables () and the same number of instruments () The following is a partial
robustness result that provides conditions for the average derivative of the linear IV estimator
to equal the true average derivative:
Theorem 6: If− ln 0() = 0() for a constant vector , [()()0] is nonsingu-
lar, and[()| = ] = Π() for a square nonsingularΠ then for = ([()()0])−1[()]
[()0] = [0()]
This result shows that if the density score is a linear combination of the right-hand side
variables () used by linear IV, the conditional expectation of the instruments () given
is a nonsingular linear combination of (), and () has a nonsingular second moment matrix
then the average derivative of the linear IV estimator is the true average derivative. This is
a generalization to NPIV of Stoker’s (1986) result that linear regression coefficients equal the
average derivatives when the regressors are multivariate Gaussian.
DR moment conditions can be used to identify parameters of interest. Under Assumption 1
0 may be identified from
[( 0 )] = −[( 0 0)]
for any fixed when the solution 0 to this equation is unique.
Theorem 7: If Assumption 1 is satisfied, 0 is identified, and for some the equation
[( 0)] = 0 has a unique solution then 0 is identified as that solution.
Applying this result to the NPIV setting of Assumption 2 gives an explicit formula for certain
functionals of 0() without requiring that the completeness identification condition of Newey
and Powell (1989, 2003) be satisfied, similarly to Santos (2011). Suppose that () is identified,
e.g. as for the weighted average derivative. Since both and are observed it follows that
24
a solution 0() to () = [0()|] will be identified if such a solution exists. Plugging in = 0 into the equation [( 0 0)] = 0 gives
Corollary 8: If () is identified and there exists 0() such that () = [0()|]
then 0 = [()0()] is identified as 0 = [0()].
Note that this result holds without the completeness condition. Identification of 0 =
[()0()] for known () with () = [0()|] follows from Severini and Tripathi
(2006). Corollary 8 extends that analysis to the case where () is only identified but not
necessarily known and links it to DR moment conditions. Santos (2011) gives a related formula
for a parameter 0 =R()0(). The formula here differs from Santos (2011) in being an
expectation rather than a Lebesgue integral. Santos (2011) constructed an estimator. That is
beyond the scope of this paper.
6 Conditional Moment Restrictions
Models of conditional moment restrictions that depend on unknown functions are important
in econometrics. In such models the nonparametric components may be determined simulta-
neously with the parametric components. In this setting it is useful to work directly with the
instrumental variables to obtain LR moment conditions rather than to make a first step in-
fluence adjustment. For that reason we focus in this Section on constructing LR moments by
orthogonalizing the instrumental variables.
Our orthogonal instruments framework is based on based on conditional moment restrictions
of the form
[( 0 0)|] = 0 ( = 1 ) (6.1)
where each ( ) is a scalar residual and are instruments that may differ across . This
model is considered by Chamberlain (1992) and Ai and Chen (2003, 2007) when is the same
for each and for Ai and Chen (2012) when the set of includes −1 We allow the residual
vector ( ) to depend on the entire function and not just on its value at some function
of the observed data .
In this framework we consider LR moment functions having the form
( ) = ()( ) (6.2)
where () = [1(1) ()] is a matrix of instrumental variables with the column given
by () We will define orthogonal instruments to be those that make ( ) locally
robust. To define orthogonal instrumental variables we assume that is allowed to vary over a
linear set Γ as varies. For each ∆ ∈ Γ let
(∆) = ([1( 0 0 + ∆)|1]
[( 0 0 + ∆)| ]
)0
25
This (∆) is the Gateaux derivative with respect to of the conditional expectation of the
residuals in the direction ∆ We characterize 0() as orthogonal if
[0()(∆)] = 0 for all ∆ ∈ Γ
We assume that (∆) is linear in ∆ and consider the Hilbert space of vectors of random
vectors () = (1(1) ()) with inner product h i = [()0()]. Let Λ denote
the closure of the set (∆) : ∆ ∈ Γ in that Hilbert space. Orthogonal instruments arethose where each row of 0() is orthogonal to Λ They can be interpreted as instrumental
variables where the effect of estimation of has been partialed out. When 0() is orthogonal
then ( ) = ()( ) is LR:
Theorem 9: If each row of 0() is orthogonal to Λ then the moment functions in equation
(6.2) are LR.
We also have a DR result:
Theorem 10: If each row of 0() is orthogonal to Λ and ( ) is affine in ∈ Γ then
the moment functions in equation (6.2) are DR for Λ = () : [()0( 0 0)0( 0 0)()].
There are many ways to construct orthogonal instruments. For instance, given a × ( − 1)matrix of instrumental variables () one could construct corresponding orthogonal ones 0()
as the matrix where each row of () is replaced by the residual from the least squares projection
of the corresponding row of () on Λ. For local identification of we also require that
([( 0)]|=0) = dim() (6.3)
A model where 0 is identified from semiparametric conditional moment restrictions with
common instrumental variables is a special case where is the same for each . In this case
there is a way to construct orthogonal instruments that leads to an efficient estimator of 0.
Let Σ() denote some positive definite matrix with its smallest eigenvalue bounded away from
zero, so that Σ()−1 is bounded. Let h iΣ = [()
0Σ()−1()] denote an inner product
and note that Λ is closed in this inner product by Σ()−1 bounded. Let Σ ( ) denote the
residual from the least squares projection of the row ()0 of () on Λ with the inner
product h iΣ Then for all ∆ ∈ Γ
[Σ ( )0Σ()
−1(∆)] = 0
so that for Σ( ) = [Σ1 ( ) Σ ( )] the instrumental variables
Σ( )Σ()−1 are
orthogonal. Also, Σ( ) can be interpreted as the solution to
min():()0∈Λ =1
([()−()Σ()−1()−()0])
26
where the minimization is in the positive semidefinite sense.
The orthogonal instruments that minimize the asymptotic variance of GMM in the class of
GMM estimators with orthogonal instruments are given by
∗0() = Σ∗( )Σ
∗()−1 () =[( 0)|]
¯0=0
Σ∗() = (|) = ( 0 0)
Theorem 11: The instruments ∗() give an efficient estimator in the class of IV estima-
tors with orthogonal instruments.
The asymptotic variance of the GMM estimator with optimal orthogonal instruments is
([∗
∗0 ])
−1 = [( ∗Σ∗)Σ∗()
−1( ∗Σ∗)0])−1
This matrix coincides with the semiparametric variance bound of Ai and Chen (2003). Esti-
mation of the optimal orthogonal instruments is beyond the scope of this paper. The series
estimator of Ai and Chen (2003) could be used for this.
This framework includes moment restrictions with a NPIV first step satisfying[( 0)|] =0 where we can specify 1( ) = ( ) 1 = 1 2( ) = ( ) and 2 = It
generalizes that setup by allowing for more residuals ( ), ( ≥ 3) and allowing all residualsto depend on
7 Asymptotic Theory
In this Section we give simple and general asymptotic theory for LR estimators that incorporates
the cross-fitting of equation (2.9). Throughout we use the structure of LRmoment functions that
are the sum ( ) = ( )+( ) of an identifying or original moment function
( ) depending on a first step function and an influence adjustment term ( )
that can depend on an additional first step The asymptotic theory will apply to any moment
function that can be decomposed into a function of a single nonparametric estimator and a
function of two nonparametric estimators. This structure and LR leads to particularly simple
and general conditions.
The conditions we give are composed of mean square consistency conditions for first steps
and one, two, or three rate conditions for quadratic remainders. We will only use one quadratic
remainder rate for DR moment conditions, involving faster than 1√ convergence of products
of estimation errors for and When [( 0 ) + ( 0 0)] is not affine in we
will impose a second rate condition that involves faster than −14 convergence of When
[( 0 )] is also not affine in we will impose a third rate condition that involves faster
than −14 convergence of Most adjustment terms ( ) of which we are aware, including
27
for first step conditional moment restrictions and densities, have [( 0 0 )] affine in
so that faster −14 convergence of will not be required under our conditions. It will suffice
for most LR estimators which we know of to have faster than −14 convergence of and faster
than 1√ convergence of the product of estimation errors for and with only the latter
condition imposed for DR moment functions. We also impose some additional conditions for
convergence of the Jacobian of the moments and sample second moments that give asymptotic
normality and consistent asymptotic variance estimation for .
An important intermediate result for asymptotic normality is
√(0) =
1√
X=1
( 0 0 0) + (1) (7.1)
where () is the cross-fit, sample, LR moments of equation (2.9). This result will mean
that the presence of the first step estimators has no effect on the limiting distribution of the
moments at the true 0. To formulate conditions for this result we decompose the difference
between the left and right-hand sides into several remainders. Let ( ) = ( 0 )
( ) = [( )] and () = [( 0 )] so that ( ) = () + ( ) Then
adding and subtracting terms gives
√[(0)−
X=1
( 0 0 0)] = 1 + 2 + 3 + 4 (7.2)
where
1 =1√
X=1
[( 0 )−( 0 0)− ()] (7.3)
+1√
X=1
[( 0)− ( 0 0)− ( 0) + ( 0 )− ( 0 0)− (0 )]
2 =1√
X=1
[( )− ( 0)− ( 0 ) + ( 0 0)]
3 =1√
X=1
( 0) 4 =1√
X=1
(0 )
We specify regularity conditions sufficient for each of 1, 2, 3 and 4 to converge in
probability to zero so that equation (7.1) will hold. The remainder term 1 is a stochastic
equicontinuity term as in Andrews (1994). We give mean square consistency conditions for
1−→ 0 in Assumption 3.
The remainder term 2 is a second order remainder that involves both and When the
influence adjustment is ( ) = ()[− ()] as for conditional moment restrictions, then
2 =−1√
X=1
[()− 0()][()− 0()]
28
2 will converge to zero when the product of convergence rates for () and () is faster
than 1√ However, that is not the weakest possible condition. Weaker conditions for locally
linear regression first steps are given by Firpo and Rothe (2017) and for series regression first
steps by Newey and Robins (2017). These weaker conditions still require that the product of
biases of () and () converge to zero faster than 1√ but have weaker conditions for
variance terms. We allow for these weaker conditions by allowing 2−→ 0 as a regularity
condition. Assumption 5 gives these conditions.
We will have 3 = 4 = 0 in the DR case of Assumption 1, where 1−→ 0 and 2
−→ 0
will suffice for equation (7.1). In non DR cases LR leads to ( 0) = ()+ ( 0) having a
zero functional derivative with respect to at 0 so that 3−→ 0 when converges to 0 at a
rapid enough, feasible rate. For example if ( 0) is twice continuously Frechet differentiable
in a neighborhood of 0 for a norm k·k with zero Frechet derivative at 0. Then¯3
¯≤
X=1
√ k − 0k2 −→ 0
when k − 0k = (−14). Here 3
−→ 0 when each converges to 0 more quickly than −14.
It may be possible to weaken this condition by bias correcting ( ) as by the bootstrap
in Cattaneo and Jansson (2017), by the jackknife in Cattaneo Ma and Jansson (2017), and by
cross-fitting in Newey and Robins (2017). Consideration of such bias corrections for ( )
is beyond the scope of this paper.
In many cases 4 = 0 even though the moment conditions are not DR. For example that is
true when is a pdf or when 0 estimates the solution to a conditional moment restriction. In
such cases mean square consistency, 2−→ 0 and faster than −14 consistency of suffices
for equation (7.1); no convergence rate for is needed. The simplification that 4 = 0 seems
to be the result of being a Riesz representer for the linear functional that is the derivative of
() with respect to Such a Riesz representer will enter ( 0) linearly, leading to 4 = 0
When 4 6= 0 then 4−→ 0 will follow from twice Frechet differentiability of ( 0) in and
faster than −14 convergence of
All of the conditions can be easily checked for a wide variety of machine learning and conven-
tional nonparametric estimators. There are well known conditions for mean square consistency
for many conventional and machine learning methods. Rates for products of estimation errors
are also know for many first step estimators as are conditions for −14 consistency. Thus,
the simple conditions we give here are general enough to apply to a wide variety of first step
estimators.
The first formal assumption of this section is sufficient for 1−→ 0
Assumption 4: For each = 1 , i) Either ( 0 ) does not depend on orR ( 0 )−( 0 0)20() −→ 0 ii)R ( 0)− ( 0 0)20() −→ 0 and
29
R ( 0 )− ( 0 0)20() −→ 0;
The cross-fitting used in the construction of (0) is what makes the mean-square consistency
conditions of Assumption 4 sufficient for 1−→ 0. The next condition is sufficient for 2
−→ 0
Assumption 5: For each = 1 , either i)
√
Zmax|( )− ( 0 )− ( 0) + ( 0 0)|0() −→ 0
or ii) 2−→ 0
As previously discussed, this condition allows for just 2−→ 0 in order to allow the weak
regularity conditions of Firpo and Rothe (2017) and Newey and Robins (2017). The first result
of this Section shows that Assumptions 4 and 5 are sufficient for equation (7.1) when the moment
functions are DR.
Lemma 12: If Assumption 1 is satisfied, with probability approaching one ∈ Γ, ∈ Λ
and Assumptions 4 and 5 are satisfied then equation (7.1) is satisfied.
An important class of DR estimators are those from equation (5.2). The following result
gives conditions for asymptotic linearity of these estimators:
Theorem 13: If a) Assumptions 2 and 4 i) are satisfied with ∈ Γ and ∈ Λ with
probability approaching one; b) 0() and [ − 0()2|] are bounded; c) for each =
1 ,R[()− 0()]
20()−→ 0
R[()− 0()]
20()−→ 0, and either
√
½Z[()− 0()]
20()
¾12½Z[()− 0()]
20()
¾12−→ 0
or1√
X∈()− 0()()− 0() −→ 0;
then√( − 0) =
1√
X=1
[( 0)− 0 + 0() − 0()] + (1)
The conditions of this result are simple, general, and allow for machine learning first steps.
Conditions a) and b) simply require mean square consistency of the first step estimators and
The only convergence rate condition is c), which requires a product of estimation errors for
the two first steps to go to zero faster than 1√. This condition allows for a trade-off in
convergence rates between the two first steps, and can be satisfied even when one of the two
30
rates is not very fast. This trade-off can be important when 0() is not continuous in one of
the components of , as in the surplus bound example. Discontinuity in can limit that rate at
which 0() can be estimated. This result extends the results of Chernozhukov et al. (2018) and
Farrell (2015) for DR estimators of treatment effects to the whole novel class of DR estimators
from equation (5.2) with machine learning first steps. In interesting related work, Athey et al.
(2016) show root-n consistent estimation of an average treatment effect is possible under very
weak conditions on the propensity score, under strong sparsity of the regression function. Thus,
for machine learning the conditions here and in Athey et al. (2016) are complementary and
one may prefer either depending on whether or not the regression function can be estimated
extremely well based on a sparse method. The results here apply to many more DR moment
conditions.
DR moment conditions have the special feature that 3 and 4 in Proposition 4 are equal
to zero. For estimators that are not DR we impose that 3 and 4 converge to zero.
Assumption 6: For each = 1 , i)√( 0)
−→ 0 and ii)√(0 )
−→ 0
Assumption 6 requires that converge to 0 rapidly enough but places no restrictions on the
convergence rate of when (0 ) = 0
Lemma 14: If Assumptions 4-6 are satisfied then equation (7.1) is satisfied.
Assumptions 4-6 are based on the decomposition of LR moment functions into an identifying
part and an influence function adjustment. These conditions differ from other previous work in
semiparametric estimation, as in Andrews (1994), Newey (1994), Newey and McFadden (1994),
Chen, Linton, and van Keilegom (2003), Ichimura and Lee (2010), Escanciano et al. (2016), and
Chernozhukov et al. (2018), that are not based on this decomposition. The conditions extend
Chernozhukov et. al. (2018) to many more DR estimators and to estimators that are nonlinear
in but only require a convergence rate for and not for .
This framework helps explain the potential problems with "plugging in" a first step machine
learning estimator into a moment function that is not LR. Lemma 14 implies that if Assumptions
4-6 are satisfied for some then√(0)−
P
=1 ( 0 0)√
−→ 0 if and only if
5 =1√
X=1
( )−→ 0 (7.4)
The plug-in method will fail when this equation does not hold. For example, suppose 0 = [|]so that by Proposition 4 of Newey (1994),
1√
X=1
( ) =−1√
X=1
()[ − ()]
31
Here 5−→ 0 is an approximate orthogonality condition between the approximation ()
to 0() and the nonparametric first stage residuals − () Machine learning uses model
selection in the construction of () If the model selected by () to approximate 0() is
not rich (or dense) enough to also approximate 0() then () need not be approximately
orthogonal to − () and 5 need not converge to zero. In particular, if the variables selectedto be used to approximate 0() cannot be used to also approximate 0() then the approximate
orthogonality condition can fail. This phenomenon helps explain the poor performance of the
plug-in estimator shown in Belloni, Chernozhukov, and Hansen (2014) and Chernozhukov et al.
(2017, 2018). The plug-in estimator can be root-n consistent if the only thing being selected is
an overall order of approximation, as in the series estimation results of Newey (1994). General
conditions for root-n consistency of the plug-in estimator can be formulated using Assumptions
4-6 and 2−→ 0 which we do in Appendix D.
Another component of an asymptotic normality result is convergence of the Jacobian term
() to = [( 0 0)|=0 ] We impose the following condition for thispurpose.
Assumption 7: exists and there is a neighborhood N of 0 and k·k such that i) for each k − 0k −→ 0
°°° − 0
°°° −→ 0; ii) for all k − 0k and k− 0k small enough ( )is differentiable in on N with probability approaching 1 iii) there is 0 0 and () with
[()] ∞ such that for ∈ and k − 0k small enough°°°°( )− ( 0 )
°°°° ≤ () k − 0k0;
iii) For each = 1 and ,R ¯
( 0 ) − ( 0 0 0)
¯0()
−→0
The following intermediate result gives Jacobian convergence.
Lemma 15: If Assumption 7 is satisfied then for any −→ 0 () is differentiable at
with probability approaching one and ()−→
With these results in place the asymptotic normality of semiparametric GMM follows in a
standard way.
Theorem 16: If Assumptions 4-7 are satisfied, −→ 0
−→ , 0 is nonsingu-
lar, and [k( 0 0 0)k2] ∞ then for Ω = [( 0 0 0)( 0 0 0)0]
√( − 0)
−→ (0 ) = ( 0)−1 0Ω( 0)−1
32
It is also useful to have a consistent estimator of the asymptotic variance of . As usual such
an estimator can be constructed as
= ( 0)−1 0 Ω( 0)−1
=()
Ω =
1
X=1
X∈I
( )( )0
Note that this variance estimator ignores the estimation of and which works here because
the moment conditions are LR. The following result gives conditions for consistency of
Theorem 17: If Assumptions 4 and 7 are satisfied with [()2] ∞ 0 is nonsin-
gular, and Z °°°( )− ( 0 )− ( 0) + ( 0 0)°°°2 0() −→ 0
then Ω−→ Ω and
−→
In this section we have used cross-fitting and a decomposition of moment conditions into
identifying and influence adjustment components to formulate simple and general conditions for
asymptotic normality of LR GMM estimators. For reducing higher order bias and variance it
may be desirable to let the number of groups grow with the sample size. That case is beyond
the scope of this paper.
8 Appendix A: Proofs of Theorems
Proof of Theorem 1: By ii) and iii),
0 = (1− )
Z( )0() +
Z( )()
Dividing by and solving gives
1
Z( )0() = −
Z( )() +
Z( )0()
Taking limits as −→ 0, 0 and using i) gives
Z( )0() = −
Z( 0)() + 0 = −()
Proof of Theorem 2: We begin by deriving 1 the adjustment term for the first step CCP
estimation. We use the definitions given in the body of the paper. We also let
() = () 1 = Pr(1 = 1) 10() = [1|+1 = ]
0() = [()()
()|+1 = ] ( = 2 )
33
Consider a parametric submodel as described in Section 4 and let 1( ) denote the conditional
expectation of given under the parametric submodel. Note that for = ()
[()()[(1(+1 ))| = 1]
]
=
[()()
()(1(+1 ))]
=
[[()()
()|+1](1(+1 ))]
=
[0(+1)(1(+1 ))] =
[0()(1( ))]
= [0()(10())
01( )
] = [0()
(10())
0 − 10()()]
where the last (sixth) equality follows as in Proposition 4 of Newey (1994a), and the fourth
equality follows by equality of the marginal distributions of and +1. Similarly, for 1 =
Pr(1 = 1) and 10() = [1|+1 = ] we have
[(1(+1 ))|1 = 1]
=[−11 1(1(+1 ))]
=
[−11 10(+1)(1(+1 ))]
=[−11 10()(1( ))]
= [−11 10()(10())
0 − 10()()]
Then combining terms gives
[( 0 1() −10)]
= −X=2
[()()[(1(+1 ))| = 1]
]
−[()()][(1(+1 ))|1 = 1]
= −X=2
[0()−[()()]−11 10()(10())
0 − 10()()]
= [1( 0 0 0)()]
Next, we show the result for ( ) for 2 ≤ ≤ As in the proof of Proposition 4 of
Newey (1994a), for any we have
[| = 1 ] = [
() −[| = 1]()|]
34
It follows that
[( 0 () −0)]
= −[()()[1+1 ++1| = 1 ]
]
= −
[[()()1+1 ++1| = 1 ]]
= −[()()
()1+1 ++1 − 0( 0 1)()]
= [( 0 0 0)()]
showing that the formula for is correct. The proof for +1 follows similarly. Q.E.D.
Proof of Theorem 3: Given in text.
Proof of Theorem 4: Given in text.
Proof of Theorem 5: Let ( ) = [( 0 )]. Suppose that ( ) is DR.
Then for any 6= 0 ∈ Γ we have
0 = ( 0) = (0 0) = ((1− )0 + 0)
for any Therefore for any ,
((1− )0 + 0) = 0 = (1− )(0 0) + ( 0)
so that ( 0) is affine in Also by the previous equation ((1−)0+ 0) = 0 identically
in so that
((1− )0 + 0) = 0
where the derivative with respect to is evaluated at = 0 Applying the same argument
switching of and we find that (0 ) is affine in and (0 (1− )0 + ) = 0
Next suppose that ( 0) is affine and ((1−)0+ 0) = 0 Then by (0 0) =0, for any ∈ Γ
( 0) = [( 0)] = [(1− )(0 0) + ( 0)]
= ((1− )0 + 0) = 0
Switching the roles of and it follows analogously that (0 ) = 0 for all ∈ Λ so ( )
is doubly robust. Q.E.D.
Proof of Theorem 6: Let 0() = −0Π−1() so that [0()|] = −0Π−1Π() =
−0()Then integration by parts gives
[( 0 )] = [0()()− 0()] = −[0()()− 0()]= [0() − ()] = −0Π−1[() − ()] = 0
35
Proof of Theorem 7: If 0 is identified then ( 0) is identified for every . By DR
[( 0)] = 0
at = 0 and by assumption this is the only where this equation is satisfied. Q.E.D.
Proof of Corollary 8: Given in text.
Proof of Theorem 9: Note that for = ( 0 0)
(0 (1− )0 + )] = (1− )[0()] + [()] = 0 (8.1)
Differentiating gives the second equality in eq. (2.7). Also, for ∆ = − 0
((1− )0 + 0)
= [0()(∆)] = 0
giving the first equality in eq. (2.7). Q.E.D.
Proof of Theorem 10: The first equality in eq. (8.1) of the proof of Theorem 9 shows that
(0 ) is affine in . Also,
((1−)0+ 0) = [0()(1−)( 0 0)+( 0 )] = (1−)(0 0)+( 0)
so that ( 0) is affine in The conclusion then follows by Theorem 5. Q.E.D.
Proof of Theorem 11: To see that Σ∗(
∗)Σ∗()−1 minimizes the asymptotic vari-
ance note that for any orthogonal instrumental variable matrix 0() by the rows of ()−Σ
∗( ) being in Λ
= [0()()0] = [0()
Σ∗( )0] = [0()
0Σ∗()
−1Σ∗( )
0]
Since the instruments are orthogonal the asymptotic variance matrix of the GMM estimator with
−→ is the same as if = 0 Define = 00() and ∗
= Σ∗( )Σ
∗()−1
The asymptotic variance of the GMM estimator for orthogonal instruments 0() is
( 0)−1 0[0()00()
0]( 0)−1 = ([∗0 ])
−1[0]([
∗ ])−10
The fact that this matrix is minimized in the positive semidefinite sense for = ∗ is well
known, e.g. see Newey and McFadden (1994). Q.E.D.
The following result is useful for the results of Section 7:
Lemma A1: If Assumption 4 is satisfied then 1−→ 0 If Assumption 5 is satisfied then
2−→ 0
36
Proof: Define ∆ = ( )−( 0)−() for ∈ and let denote the observations
for ∈ . Note that depends only on . By construction and independence of
and
∈ we have [∆| ] = 0 Also by independence of the observations, [∆∆|
] = 0
for ∈ Furthermore, for ∈ [∆2|
] ≤R[( )−( 0)]
20(). Then we have
[
Ã1√
X∈
∆
!2|
] =1
[
ÃX∈
∆
!2|
] =1
X∈
[∆2|
]
≤Z[( )−( 0)]
20()−→ 0
The conditional Markov inequality then implies thatP
∈ ∆√
−→ 0 The analogous results
also hold for ∆ = ( 0)− ( 0 0)− ( 0) and ∆ = ( 0 )− ( 0 0)−(0 ). Summing across these three terms and across = 1 gives the first conclusion.
For the second conclusion, note that under the first hypothesis of Assumption 5,
[
¯¯ 1√X
∈[( )− ( 0 )− ( 0) + ( 0 0)]
¯¯ |
]
≤ 1√
X∈
[¯( )− ( 0 )− ( 0) + ( 0 0)
¯|
]
≤ √Z ¯
( )− ( 0 )− ( 0) + ( 0 0)¯0()
−→ 0
so 2−→ 0 follows by the conditional Markov and triangle inequalities. The second hypothesis
of Assumption 5 is just 2−→ 0
Proof of Lemma 12: By Assumption 1 and the hypotheses that ∈ Γ and ∈ Λ we
have 3 = 4 = 0 By Lemma A1 we have 1−→ 0 and 2
−→ 0 The conclusion then follows
by the triangle inequality.
Proof of Theorem 13: Note that for = − 0()
( 0)− ( 0 0) = 0()[()− 0()]
( 0 )− ( 0 0) = [()− 0()]
( )− ( 0 )− ( 0) + ( 0 0) = −[()− 0()][()− 0()]
The first part of Assumption 4 ii) then follows byZ[( 0)− ( 0 0)]
20() =
Z0()
2[()− 0()]20()
≤
Z[()− 0()]
20()−→ 0
37
The second part of Assumption 4 ii) follows byZ[( 0 )− ( 0 0)]
20() =
Z[()− 0()]
220()
=
Z h()− 0()
i2[2|]0()
≤
Z h()− 0()
i20()
−→ 0
Next, note that by the Cauchy-Schwartz inequality,
√
Z|( )− ( 0 )− ( 0) + ( 0 0)|0()
=√
Z ¯[()− 0()][()− 0()]
¯0()
≤ √Z[()− 0()]
20()12Z[()− 0()]
20()12
Then the first rate condition of Assumption 5 holds under the first rate condition of Theorem
13 while the second condition of Assumption 5 holds under the last hypothesis of Theorem 13.
Then eq. (7.1) holds by Lemma 12, and the conclusion by rearranging the terms in eq. (7.1).
Q.E.D.
Proof of Lemma 14: Follows by Lemma A1 and the triangle inequality. Q.E.D.
Proof of Lemma 15: Let () = () when it exists, = −1P
∈ ( 0 )
and = −1P
∈ ( 0 0 0) By the law of large numbers, and Assumption 5 iii),P
=1 −→ Also, by condition iii) for each and
[| − ||] ≤Z ¯
( 0 ) − ( 0 0 0)
¯0()
−→ 0
Then by the conditional Markov inequality, for each
− −→ 0
It follows by the triangle inequality thatP
=1 −→ Also, with probability approaching
one we have for any −→ 0°°°°°()−X=1
°°°°° ≤Ã1
X=1
()
!°° − 0°°0 = (1)(1)
−→ 0
The conclusion then follows by the triangle inequality. Q.E.D.
Proof of Theorem 16: The conclusion follows in a standard manner from the conclusions
of Lemmas 14 and 15. Q.E.D.
38
Proof of Theorem 17: Let = ( ) and = ( 0 0 0) By standard
arguments (e.g. Newey, 1994), it suffices to show thatP
=1
°°° −
°°°2 −→ 0 Note that
− =
5X=1
∆ ∆1 = ( )− ( 0 ) ∆2 = ( 0 )−( 0 0)
∆3 = ( 0)− ( 0 0) ∆4 = ( 0 )− ( 0 0)
∆5 = ( )− ( 0)− ( 0 ) + ( 0 0)
By standard arguments it suffices to show that for each and
1
X∈
°°°∆
°°°2 −→ 0 (8.2)
For = 1 it follows by a mean value expansion and Assumption 7 with [()2] ∞ that
1
X∈
°°°∆1
°°°2 = 1
X∈
°°°°
( )( − )
°°°°2 ≤ 1
ÃX∈
()2
!°°° − °°°2 −→ 0
where is a mean value that actually differs from row to row of ( ). For = 2
note that by Assumption 4,
[1
X∈
°°°∆2
°°°2 |] ≤Zk( 0 )−( 0 0)k2 0() −→ 0
so eq. (8.2) holds by the conditional Markov inequality. For = 3 and = 4 eq. (8.2) follows
similarly. For = 5, it follows from the hypotheses of Theorem 17 that
[1
X∈
°°°∆5
°°°2 |] ≤Z °°°( )− ( 0 )− ( 0) + ( 0 0)
°°°2 0() −→ 0
Then eq. (8.2) holds for = 5 by the conditional Markov inequality. Q.E.D.
9 Appendix B: Local Robustness and Derivatives of Ex-
pected Moments.
In this Appendix we give conditions sufficient for the LR property of equation (2.5) to imply
the properties in equations (2.7) and (2.8). As discussed following equation (2.8), it may be
convenient when specifying regularity conditions for specific moment functions to work directly
with (2.7) and/or (2.8).
39
Assumption B1: There are linear sets Γ and Λ and a set such that i) ( ) is Frechet
differentiable at (0 0); ii) for all ∈ the vector (() ( )) is Frechet differentiable at
= 0; iii) the closure of (() ( )) : ∈ is Γ× Λ.
Theorem B1: If Assumption B1 is satisfied and equation (2.5) is satisfied for all ∈ Gthen equation (2.7) is satisfied.
Proof: Let 0( ) denote the Frechet derivative of ( ) at (0 0) in the direction ( )
which exists by i). By ii), the chain rule for Frechet derivatives (e.g. Proposition 7.3.1 of
Luenberger, 1969), and by eq. (2.5) it follows that for (∆ ∆
) = (( ) ())
0(∆ ∆
) =
(() ())
= 0
By 0( ) being a continuous linear function and iii) it follows that 0( ) = 0 for all ( ) ∈Γ× Λ Therefore, for any ∈ Γ and ∈ Λ
0( − 0 0) = 0 0(0 − 0) = 0
Equation (2.7) then follows by i). Q.E.D.
Theorem B2: If equation (2.7) is satisfied and in addition ( 0) and (0 ) are twice
Frechet differentiable in open sets containing 0 and 0 respectively with bounded second deriv-
ative then equation (2.8) is satisfied.
Proof: Follows by Proposition 7.3.3 of Luenberger (1969). Q.E.D.
10 Appendix C: Doubly Robust Moment Functions for
Orthogonality Conditions
In this Appendix we generalize the DR estimators for conditional moment restrictions to or-
thogonality conditions for a general residual ( ) that is affine in but need not have the
form − ()
Assumption C1: There are linear sets Γ and Λ of functions () and () that are closed in
mean square such that i) For any ∈ Γ and scalar [( )2] ∞ and ( (1−)+ ) =
(1 − )( ) + ( ) ; ii) [()( 0)] = 0 for all ∈ Λ; iii) there exists 0 ∈ Λ such
that [( 0 )] = −[0()( )] for all ∈ Γ
Assumption C1 ii) could be thought of as an identification condition for 0. For example, if
Λ is all functions of with finite mean square then ii) is [( 0)|] = 0 the nonparametric
40
conditional moment restriction of Newey and Powell (2003) and Newey (1991). Assumption
C1 iii) also has an interesting interpretation. Let Π()() denote the orthogonal mean-square
projection of a random variable () with finite second moment on Γ Then by ii) and iii) we
have
[( 0 )] = −[0()( )] = [0()Π(())()]
= [0()Π(())()−Π((0))()]= [0()Π(()− (0))()]
Here we see that [( 0 )] is a linear, mean-square continuous function of Π(() −(0))() The Riesz representation theorem will also imply that if [( 0 )] is a linear,
mean-square continuous function of Π(()−(0))() then 0() exists satisfying Assumption
C1 ii). For the case where = this mean-square continuity condition is necessary for exis-
tence of a root-n consistent estimator, as in Newey (1994) and Newey and McFadden (1994).
We conjecture that when need not equal this condition generalizes Severini and Tripathi’s
(2012) necessary condition for existence of a root-n consistent estimator of 0.
Noting that Assumptions 1 ii) and iii) are the conditions for double robustness we have
Theorem C1: If Assumption C1 is satisfied then ( ) = ( ) + ()( ) is
doubly robust.
It is interesting to note that 0() satisfying Assumption C1 iii) need not be unique. When
the closure of Π(())() : ∈ Γ is not all of Λ then there will exist ∈ Λ such that 6= 0and
[()( )] = [()Π(())()] = 0 for all ∈ Γ
In that case Assumption C1 iii) will also be satisfied for 0()+ ()We can think of this case
as one where 0 is overidentified, similarly to Chen and Santos (2015). As discussed in Ichimura
and Newey (2017), the different 0() would correspond to different first step estimators.
The partial robustness results of the last Section can be extended to the orthogonality condi-
tion setting of Assumption C1. Let Λ∗ be a closed linear subset of Λ such as finite dimensional
linear set and let ∗ be such that [()( ∗)] = 0 for all ∈ Λ∗. Note that if 0 ∈ Λ∗ it
follows by Theorem C1 that
[( 0 ∗)] = −[0()( ∗)] = 0
Theorem C2: If Λ∗ is a closed linear subset of Λ, [()( ∗)] = 0 for all ∈ Λ∗,
and Assumption C2 iii) is satisfied with 0 ∈ Λ∗ then
[( 0 ∗)] = 0
41
11 Appendix D: Regularity Conditions for Plug-in Esti-
mators
In this Appendix we formulate regularity conditions for root-n consistency and asymptotic nor-
mality of the plug-in estimator as described in Section 2, where ( ) need not be LR.
These conditions are based on Assumptions 4-6 applied to the influence adjustment ( )
corresponding to ( ) and For this purpose we treat as any object that can approxi-
mate 0() not just as an estimator of 0
Theorem D1: If Assumptions 4-6 are satisfied, Assumption 7 is satisfied with ( )
replacing ( ) −→ 0
−→ , 0 is nonsingular, [k( 0 0 0)k2] ∞
and
5 =1√
X=1
( )−→ 0
then for Ω = [( 0 0 0)( 0 0 0)0]
√( − 0)
−→ (0 ) = ( 0)−1 0Ω( 0)−1
The condition 5−→ 0 was discussed in Section 7. It is interesting to note that 5
−→ 0
appears to be a complicated condition that seems to depend on details of the estimator in a
way that Assumptions 4-7 do not. In this way the regularity conditions for the LR estimator
seem to be more simple and general than those for the plug-in estimator.
Acknowledgements
Whitney Newey gratefully acknowledges support by the NSF. Helpful comments were pro-
vided by M. Cattaneo, B. Deaner, J. Hahn, M. Jansson, Z. Liao, A. Pakes, R. Moon, A. de
Paula, V. Semenova, and participants in seminars at Cambridge, Columbia, Cornell, Harvard-
MIT, UCL, USC, Yale, and Xiamen. B. Deaner provided capable research assistance.
REFERENCES
Ackerberg, D., X. Chen, and J. Hahn (2012): "A Practical Asymptotic Variance Estimator
for Two-step Semiparametric Estimators," The Review of Economics and Statistics 94: 481—498.
Ackerberg, D., X. Chen, J. Hahn, and Z. Liao (2014): "Asymptotic Efficiency of Semipara-
metric Two-Step GMM," The Review of Economic Studies 81: 919—943.
Ai, C. and X. Chen (2003): “Efficient Estimation of Models with Conditional Moment Restric-
tions Containing Unknown Functions,” Econometrica 71, 1795-1843.
42
Ai, C. and X. Chen (2007): "Estimation of Possibly Misspecified Semiparametric Conditional
Moment Restriction Models with Different Conditioning Variables," Journal of Econometrics
141, 5—43.
Ai, C. and X. Chen (2012): "The Semiparametric Efficiency Bound for Models of Sequential
Moment Restrictions Containing Unknown Functions," Journal of Econometrics 170, 442—457.
Andrews, D.W.K. (1994): “Asymptotics for Semiparametric Models via Stochastic Equiconti-
nuity,” Econometrica 62, 43-72.
Athey, S., G. Imbens, and S. Wager (2017): "Efficient Inference of Average Treatment Ef-
fects in High Dimensions via Approximate Residual Balancing," Journal of the Royal Statistical
Society, Series B, forthcoming.
Bajari, P., V. Chernozhukov, H. Hong, and D. Nekipelov (2009): "Nonparametric and
Semiparametric Analysis of a Dynamic Discrete Game," working paper, Stanford.
Bajari, P., H. Hong, J. Krainer, and D. Nekipelov (2010): "Estimating Static Models of
Strategic Interactions," Journal of Business and Economic Statistics 28, 469-482.
Bang, and J.M. Robins (2005): "Doubly Robust Estimation in Missing Data and Causal Infer-
ence Models," Biometrics 61, 962—972.
Belloni, A., D. Chen, V. Chernozhukov, and C. Hansen (2012): “Sparse Models and
Methods for Optimal Instruments with an Application to Eminent Domain,” Econometrica 80,
2369—2429.
Belloni, A., V. Chernozhukov, and Y. Wei (2013): “Honest Confidence Regions for Logistic
Regression with a Large Number of Controls,” arXiv preprint arXiv:1304.3969.
Belloni, A., V. Chernozhukov, and C. Hansen (2014): "Inference on Treatment Effects
after Selection among High-Dimensional Controls," The Review of Economic Studies 81, 608—
650.
Belloni, A., V. Chernozhukov, I. Fernandez-Val, and C. Hansen (2016): "Program
Evaluation and Causal Inference with High-Dimensional Data," Econometrica 85, 233-298.
Bera, A.K., G. Montes-Rojas, and W. Sosa-Escudero (2010): "General Specification
Testing with Locally Misspecified Models," Econometric Theory 26, 1838—1845.
Bickel, P.J. (1982): "On Adaptive Estimation," Annals of Statistics 10, 647-671.
Bickel, P.J. and Y. Ritov (1988): "Estimating Integrated Squared Density Derivatives: Sharp
Best Order of Convergence Estimates," Sankhya: The Indian Journal of Statistics, Series A
238, 381-393.
Bickel, P.J., C.A.J. Klaassen, Y. Ritov, and J.A. Wellner (1993): Efficient and Adaptive
Estimation for Semiparametric Models, Springer-Verlag, New York.
Bickel, P.J. and Y. Ritov (2003): "Nonparametric Estimators Which Can Be "Plugged-in,"
43
Annals of Statistics 31, 1033-1053.
Bonhomme, S., and M. Weidner (2018): "Minimizing Sensitivity to Misspecification," working
paper.
Cattaneo, M.D., and M. Jansson (2017): "Kernel-Based Semiparametric Estimators: Small
Bandwidth Asymptotics and Bootstrap Consistency," Econometrica, forthcoming.
Cattaneo, M.D., M. Jansson, and X. Ma (2017): "Two-step Estimation and Inference with
Possibly Many Included Covariates," working paper.
Chamberlain, G. (1987): “Asymptotic Efficiency in Estimation with Conditional Moment Re-
strictions,” Journal of Econometrics 34, 1987, 305—334.
Chamberlain, G. (1992): “Efficiency Bounds for Semiparametric Regression,” Econometrica 60,
567—596.
Chen, X. and X. Shen (1997): “Sieve Extremum Estimates for Weakly Dependent Data,”
Econometrica 66, 289-314.
Chen, X., O.B. Linton, and I. van Keilegom (2003): “Estimation of Semiparametric Models
when the Criterion Function Is Not Smooth,” Econometrica 71, 1591-1608.
Chen, X., and Z. Liao (2015): "Sieve Semiparametric Two-Step GMM Under Weak Depen-
dence", Journal of Econometrics 189, 163—186.
Chen, X., and A. Santos (2015): “Overidentification in Regular Models,” working paper.
Chernozhukov, V., C. Hansen, and M. Spindler (2015): "Valid Post-Selection and Post-
Regularization Inference: An Elementary, General Approach," Annual Review of Economics 7:
649—688.
Chernozhukov, V., G.W. Imbens and W.K. Newey (2007): "Instrumental Variable Identi-
fication and Estimation of Nonseparable Models," Journal of Econometrics 139, 4-14.
Chernozhukov, V., D. Chetverikov, M. Demirer, E. Duflo, C. Hansen, W. Newey
(2017): "Double/Debiased/Neyman Machine Learning of Treatment Effects," American Eco-
nomic Review Papers and Proceedings 107, 261-65.
Chernozhukov, V., D. Chetverikov, M. Demirer, E. Duflo, C. Hansen, W. Newey,
J. Robins (2018): "Debiased/Double Machine Learning for Treatment and Structural Parame-
ters,Econometrics Journal 21, C1-C68.
Chernozhukov, V., J.A. Hausman, and W.K. Newey (2018): "Demand Analysis with Many
Prices," working paper, MIT.
Chernozhukov, V., W.K. Newey, J. Robins (2018): "Double/De-Biased Machine Learning
Using Regularized Riesz Representers," arxiv.
Escanciano, J-C., D. Jacho-Cha´vez, and A. Lewbel (2016): “Identification and Estima-
tion of Semiparametric Two Step Models”, Quantitative Economics 7, 561-589.
44
Farrell, M. (2015): "Robust Inference on Average Treatment Effects with Possibly More Co-
variates than Observations," Journal of Econometrics 189, 1—23.
Firpo, S. and C. Rothe (2017): "Semiparametric Two-Step Estimation Using Doubly Robust
Moment Conditions," working paper.
Graham, B.W. (2011): "Efficiency Bounds for Missing Data Models with Semiparametric Re-
strictions," Econometrica 79, 437—452.
Hahn, J. (1998): "On the Role of the Propensity Score in Efficient Semiparametric Estimation
of Average Treatment Effects," Econometrica 66, 315-331.
Hahn, J. and G. Ridder (2013): "Asymptotic Variance of Semiparametric Estimators With
Generated Regressors," Econometrica 81, 315-340.
Hahn, J. and G. Ridder (2016): “Three-stage Semi-Parametric Inference: Control Variables
and Differentiability," working paper."
Hahn, J., Z. Liao, and G. Ridder (2016): "Nonparametric Two-Step Sieve M Estimation and
Inference," working paper, UCLA.
Hasminskii, R.Z. and I.A. Ibragimov (1978): "On the Nonparametric Estimation of Function-
als," Proceedings of the 2nd Prague Symposium on Asymptotic Statistics, 41-51.
Hausman, J.A., and W.K. Newey (2016): "Individual Heterogeneity and Average Welfare,"
Econometrica 84, 1225-1248.
Hausman, J.A., and W.K. Newey (2017): "Nonparametric Welfare Analysis," Annual Review
of Economics 9, 521—546.
Hirano, K., G. Imbens, and G. Ridder (2003): "Efficient Estimation of Average Treatment
Effects Using the Estimated Propensity Score," Econometrica 71: 1161—1189.
Hotz, V.J. and R.A. Miller (1993): "Conditional Choice Probabilities and the Estimation of
Dynamic Models," Review of Economic Studies 60, 497-529.
Huber, P. (1981): Robust Statistics, New York: Wiley.
Ichimura, H. (1993): "Estimation of Single Index Models," Journal of Econometrics 58, 71-120.
Ichimura, H., and S. Lee (2010): “Characterization of the Asymptotic Distribution of Semi-
parametric M-Estimators,” Journal of Econometrics 159, 252—266.
Ichimura, H. and W.K. Newey (2017): "The Influence Function of Semiparametric Estima-
tors," CEMMAP Working Paper, CWP06/17.
Kandasamy, K., A. Krishnamurthy, B. P´oczos, L. Wasserman, J.M. Robins (2015):
"Influence Functions for Machine Learning: Nonparametric Estimators for Entropies, Diver-
gences and Mutual Informations," arxiv.
Lee, Lung-fei (2005): “A ()-type Gradient Test in the GMM Approach,” working paper.
45
Luenberger, D.G. (1969): Optimization by Vector Space Methods, New York: Wiley.
Murphy, K.M. and R.H. Topel (1985): "Estimation and Inference in Two-Step Econometric
Models," Journal of Business and Economic Statistics 3, 370-379.
Newey, W.K. (1984): "A Method of Moments Interpretation of Sequential Estimators," Eco-
nomics Letters 14, 201-206.
Newey, W.K. (1990): "Semiparametric Efficiency Bounds," Journal of Applied Econometrics 5,
99-135.
Newey, W.K. (1991): ”Uniform Convergence in Probability and Stochastic Equicontinuity,”
Econometrica 59, 1161-1167.
Newey, W.K. (1994a): "The Asymptotic Variance of Semiparametric Estimators," Econometrica
62, 1349-1382.
Newey, W.K. (1994b): ”Kernel Estimation of Partial Means and a General Variance Estimator,”
Econometric Theory 10, 233-253.
Newey, W.K. (1997): ”Convergence Rates and Asymptotic Normality for Series Estimators,”
Journal of Econometrics 79, 147-168.
Newey, W.K. (1999): ”Consistency of Two-Step Sample Selection Estimators Despite Misspeci-
fication of Distribution,” Economics Letters 63, 129-132.
Newey, W.K., and D. McFadden (1994): “Large Sample Estimation and Hypothesis Testing,"
in Handbook of Econometrics, Vol. 4, ed. by R. Engle, and D. McFadden, pp. 2113-2241. North
Holland.
Newey, W.K., and J.L. Powell (1989): "Instrumental Variable Estimation of Nonparametric
Models," presented at Econometric Society winter meetings, 1988.
Newey, W.K., and J.L. Powell (2003): "Instrumental Variable Estimation of Nonparametric
Models," Econometrica 71, 1565-1578.
Newey, W.K., F. Hsieh, and J.M. Robins (1998): “Undersmoothing and Bias Corrected
Functional Estimation," MIT Dept. of Economics working paper 72, 947-962.
Newey, W.K., F. Hsieh, and J.M. Robins (2004): “Twicing Kernels and a Small Bias Property
of Semiparametric Estimators,” Econometrica 72, 947-962.
Newey, W.K., and J. Robins (2017): "Cross Fitting and Fast Remainder Rates for Semipara-
metric Estimation," arxiv.
Neyman, J. (1959): “Optimal Asymptotic Tests of Composite Statistical Hypotheses,” Probability
and Statistics, the Harald Cramer Volume, ed., U. Grenander, New York, Wiley.
Pfanzagl, J., andW.Wefelmeyer (1982): "Contributions to a General Asymptotic Statistical
Theory. Springer Lecture Notes in Statistics.
46
Pakes, A. and G.S. Olley (1995): "A Limit Theorem for a Smooth Class of Semiparametric
Estimators," Journal of Econometrics 65, 295-332.
Powell, J.L., J.H. Stock, and T.M. Stoker (1989): "Semiparametric Estimation of Index
Coefficients," Econometrica 57, 1403-1430.
Robins, J.M., A. Rotnitzky, and L.P. Zhao (1994): "Estimation of Regression Coefficients
When Some Regressors Are Not Always Observed," Journal of the American Statistical Associ-
ation 89: 846—866.
Robins, J.M. and A. Rotnitzky (1995): "Semiparametric Efficiency in Multivariate Regression
Models with Missing Data," Journal of the American Statistical Association 90:122—129.
Robins, J.M., A. Rotnitzky, and L.P. Zhao (1995): "Analysis of Semiparametric Regression
Models for Repeated Outcomes in the Presence of Missing Data," Journal of the American
Statistical Association 90,106—121.
Robins, J.M.,and A. Rotnitzky (2001): Comment on “Semiparametric Inference: Question
and an Answer Likelihood” by P.A. Bickel and J. Kwon, Statistica Sinica 11, 863-960.
Robins, J.M., A. Rotnitzky, and M. van der Laan (2000): "Comment on ’On Profile
Likelihood’ by S. A. Murphy and A. W. van der Vaart, Journal of the American Statistical
Association 95, 431-435.
Robins, J., M. Sued, Q. Lei-Gomez, and A. Rotnitzky (2007): "Comment: Performance of
Double-Robust Estimators When Inverse Probability’ Weights Are Highly Variable," Statistical
Science 22, 544—559.
Robins, J.M., L. Li, E. Tchetgen, and A. van der Vaart (2008): "Higher Order Influence
Functions and Minimax Estimation of Nonlinear Functionals," IMS Collections Probability and
Statistics: Essays in Honor of David A. Freedman, Vol 2, 335-421.
Robins, J.M., L. Li, R. Mukherjee, E. Tchetgen, and A. van der Vaart (2017): "Higher
Order Estimating Equations for High-Dimensional Models," Annals of Statistics, forthcoming.
Robinson, P.M. (1988): "‘Root-N-consistent Semiparametric Regression," Econometrica 56, 931-
954.
Rust, J. (1987): "Optimal Replacement of GMC Bus Engines: An Empirical Model of Harold
Zurcher," Econometrica 55, 999-1033.
Santos, A. (2011): "Instrumental Variable Methods for Recovering Continuous Linear Function-
als," Journal of Econometrics, 161, 129-146.
Scharfstein D.O., A. Rotnitzky, and J.M. Robins (1999): Rejoinder to “Adjusting For
Nonignorable Drop-out Using Semiparametric Non-response Models,” Journal of the American
Statistical Association 94, 1135-1146.
Severini, T. and G. Tripathi (2006): "Some Identification Issues in Nonparametric Linear
47
Models with Endogenous Regressors," Econometric Theory 22, 258-278.
Severini, T. and G. Tripathi (2012): "Efficiency Bounds for Estimating Linear Functionals of
Nonparametric Regression Models with Endogenous Regressors," Journal of Econometrics 170,
491-498.
Schick, A. (1986): "On Asymptotically Efficient Estimation in Semiparametric Models," Annals
of Statistics 14, 1139-1151.
Stoker, T. (1986): "Consistent Estimation of Scaled Coefficients," Econometrica 54, 1461-1482.
Tamer, E. (2003): "Incomplete Simultaneous Discrete Response Model with Multiple Equilibria,"
Review of Economic Studies 70, 147-165.
van der Laan, M. and Rubin (2006): "Targeted Maximum Likelihood Learning," U.C. Berkeley
Division of Biostatistics Working Paper Series. Working Paper 213.
van der Vaart, A.W. (1991): “On Differentiable Functionals,” The Annals of Statistics, 19,
178-204.
van der Vaart, A.W. (1998): Asymptotic Statistics, Cambride University Press, Cambridge,
England.
van der Vaart, A.W. (2014): "Higher Order Tangent Spaces and Influence Functions," Statis-
tical Science 29, 679—686.
Wooldridge, J.M. (1991): “On the Application of Robust, Regression-Based Diagnostics to
Models of Conditional Means and Conditional Variances,” Journal of Econometrics 47, 5-46.
48
Top Related