ERRORS IN THE DEPENDENT VARIABLE OF QUANTILE REGRESSION MODELS€¦ · · 2017-07-06ERRORS IN THE...

ERRORS IN THE DEPENDENT VARIABLEOF QUANTILE REGRESSION MODELS

JERRY HAUSMAN, YE LUO, AND CHRISTOPHER PALMER

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

APRIL 2014

*PRELIMINARY AND INCOMPLETE**PLEASE DO NOT CITE OR CIRCULATE WITHOUT PERMISSION*

Abstract. The usual quantile regression estimator of Koenker and Bassett (1978) is biasedif there is an additive error term in the dependent variable. We analyze this problem asan errors-in-variables problem where the dependent variable suffers from classical measure-ment error and develop a sieve maximum-likelihood approach that is robust to left-hand sidemeasurement error. After describing sufficient conditions for identification, we employ globaloptimization algorithms (specifically, a genetic algorithm) to solve the MLE optimizationproblem. We also propose a computationally tractable EM-type algorithm that acceleratesthe computational speed in the neighborhood of the optimum. When the number of knotsin the quantile grid is chosen to grow at an adequate speed, the sieve maximum-likelihoodestimator is asymptotically normal. We verify our theoretical results with Monte Carlo simula-tions and illustrate our estimator with an application to the returns to education highlightingimportant changes over time in the returns to education that have been obscured in previouswork by measurement-error bias.

Keywords: Measurement Error, Quantile Regression, Functional Analysis, EM Algo-rithm

Hausman: Professor of Economics, MIT Economics; [email protected]. Luo: PhD Candidate, MIT Economics;[email protected]. Palmer: PhD Candidate, MIT Economics; [email protected] thank Victor Chernozhukov and Kirill Evdokimov for helpful discussions, as well as workshop participantsat MIT. Cuicui Chen provided research assistance.

ERRORS IN THE DEPENDENT VARIABLE OF QUANTILE REGRESSION MODELS 1

1. Introduction

In this paper, we study the consequences of measurement error in the dependent variableof conditional quantile models and propose an adaptive EM algorithm to consistently estimatethe distributional effects of covariates in such a setting. The errors-in-variables (EIV) problemhas received significant attention in the linear model, including the well-known results thatclassical measurement error causes attenuation bias if present in the regressors and has no effecton unbiasedness if present in the dependent variable. See Hausman (2001) for an overview.

In general, the linear model results do not hold in nonlinear models.1 We are particularlyinterested in the linear quantile regression setting.2 Hausman (2001) observes that EIV in thedependent variable in quantile regression models generally leads to significant bias, a resultvery different from the linear model intuition.

In general, EIV in the dependent variable can be viewed as a mixture model.3 One way toestimate a mixture distribution is to use the expectation-maximization (EM) algorithm.4 Weshow that under certain discontinuity assumptions, by choosing the growth speed of the numberof knots in the quantile grid, our estimator has fractional polynomial of n convergence speedand asymptotic normality. We suggest using the bootstrap for inference. However, in practicewe find that EM algorithm is only feasible in the local neighborhood of the true parameter.We propose to use global optimization algorithms such as the genetic algorithm, in order tocompute the sieve ML estimator we propose.

Intuitively, the estimated quantile regression line xiβ(τ) for quantile τ may be far from theobserved yi because of LHS measurement error or because the unobserved conditional quantileui of observation i is far from τ . Our MLE framework implemented via an EM algorithmeffectively estimates the likelihood that a given εik is large because of measurement error ratherthan an unobserved conditional quantile that is far away from τk. The estimate of the jointdistribution of the conditional quantile and the measurement error allows us to weight the log

1Schennach (2008) establishes identification and a consistent nonparametric estimator when EIV exists in anexplanatory variable. Studies focusing on nonlinear models in which the left-hand side variable is measuredwith error include Hausman et. al (1998) and Cosslett (2004), who study probit and tobit models, respectively.2Carroll and Wei (2009) proposed an iterative estimator for the quantile regression when one of the regressorshas EIV.3A common feature of mixture models under a semiparametric or nonparametric framework is the ill-posedinverse problem, see Evdokimov (2010). We face the ill-posed problem here, and our model specifications arelinked to the Fredholm integral equation of the first kind. The inverse of such integral equations is usuallyill-posed even if the integral kernel is positive definite. The key symptom of these model specifications is thatthe high-frequency signal of the objective we are interested in is wiped out, or at least shrunk, by the unknownnoise if its distribution is smooth. To uncover these signals is difficult and all feasible estimators have a lowerspeed of convergence compare to the usual

√n case. The convergence speed of our estimator relies on the decay

speed of the eigenvalues of the integral operator. We explain this technical problem in more detail in the relatedsection of this paper.4Dempster, Laird and Rubin (1977) propose that the EM algorithm performs well in finite mixture models whenthe EIV follows some (unknown) parametric family of distributions. This algorithm can be extended to infinitemixture models.

2 HAUSMAN, LUO, AND PALMER

likelihood contribution of observation i more in the estimation of β(τk) where they are likelyto have ui ≈ τk. In the case of Gaussian errors in variables, this estimator reduces to weightedleast squares, with weights equal to the probability of observing the quantile-specific residualfor a given observation as a fraction of the total probability of the same observation’s residualsacross all quantiles.

An empirical example (extending Angrist et al., 2006) studies the heterogeneity of returnsto education across conditional quantiles of the wage distribution. We find that when wecorrect for likely measurement error in the self-reported wage data, we estimate considerablymore heterogeneity across the wage distribution in the returns to education. In particular, theeducation coefficient for the bottom of the wage distribution is lower than previously estimated,and the returns to education for latently high-wage individuals has been increasing over timeand is much higher than previously estimated. By 2000, the returns to education for the top ofthe conditional wage distribution are over three times larger than returns for any other segmentof the distribution.

The rest of the paper proceeds as follows. In Section 2, we introduce model specification andidentification conditions. In Section 3, we consider the MLE estimation method and analyzes itsproperties. In Sections 4 and 5, we discuss sieve estimation and the EM algorithm. We presentMonte Carlo simulation results in Section 6, and Section 7 contains our empirical application.Section 8 concludes. The Appendix contains an extension using the deconvolution method andadditional proofs.

Notation: Define the domain of x as X . Define the space of y as Y. Denote a ∧ b as theminimum of a and b, and denote a ∨ b as the larger of a and b. Let −→

dbe weak convergence

(convergence in distribution), and −→p

stands for convergence in probability. Let −→d

∗ be weak

convergence in outer probability. Let f(ε|σ) be the p.d.f of the EIV ε parametrized by σ.Assume the true parameters are β0(·) and σ0 for the coefficient of the quantile model andparameter of the density function of the EIV. Let dx be the dimension of x. Let Ω be thedomain of σ. Let dσ be the dimension of parameter σ. Define ||(β0,σ0)|| :=

||β0||22 + ||σ0||22

as the L2 norm of (β0,σ0), where || · ||2 is the usual Euclidean norm. For βk ∈ Rk, define||(βk,σ0)||2 :=

||βk||22/k + ||σ0||22.

2. Model and Identification

We consider the standard linear conditional quantile model, where the τ th quantile of thedependent variable y∗ is a linear function of x

Qy∗(τ |x) = xβ(τ).

However, we are interested in the situation where y∗ is not directly observed, and we insteadobserve y where

y = y∗ + ε


Figure 1. Check Function ρτ (z)

0

!

1!!

and ε is a mean-zero, i.i.d error term independent from y∗, x and τ .Unlike the linear regression case where EIV in the left hand side variable does not matter

for consistency and asymptotic normality, EIV in the dependent variable can lead to severebias in quantile regression. More specifically, with ρτ (z) denoting the check function (plottedin Figure 1)

ρτ (z) = z(τ − 1(z < 0)),

the minimization problem in the usual quantile regression

β(τ) ∈ argminb

E[ρτ (y − xb)], (2.1)

is generally no longer minimized at the true β0(τ) when EIV exists in the dependent variable.When there exists no EIV in the left-hand side variable, i.e. y∗ is observed, the FOC is

E[x(τ − 1(y∗ < xβ(τ)))] = 0, (2.2)

where the true β(τ) is the solution to the above system of FOC conditions as shown by Koenkerand Bassett (1978). However, with left-hand side EIV, the FOC condition determining β(τ)becomes

E[x(τ − 1(y∗ + ε < xβ(τ)))] = 0. (2.3)

For τ = .5, the presence of ε will result in the FOC being satisfied at a different estimate ofβ than in equation (2.2) even in the case where ε is symmetrically distributed because of theasymmetry of the check function. In other words, in the minimization problem, observations for


which y∗ ≥ xβ(τ) and should therefore get a weight of τ may end up on the left-hand side of thecheck function, receiving a weight of (1−τ) . Thus, equal-sized differences on either side of zerodo not cancel each other out. For median regression, τ = .5 and so ρ.5(·) is symmetric aroundzero. This means that if ε is symmetrically distributed and β(τ) symmetrically distributedaround τ = .5 (as would be the case, for example, if β(τ) were linear in τ), the expectation inequation (2.3) holds for the true β0(τ). However, for non-symmetric ε, equation (2.3) is notsatisfied at the true β0(τ).

Below we demonstrate a simple Monte-Carlo example to show the existence of the bias inthe quantile regression with random disturbances on the dependent variable y.

Example. The Data-Generating Process for the Monte-Carlo results is

yi = β0(ui) + x1iβ1(ui) + x2iβ2(ui) + εi

with the measurement error εi distributed as N(0,σ2) and the unobserved conditional quantileui of observation i following ui ∼ U [0, 1]. The coefficient function β(τ) has components β0(τ) =0, β1(τ) = exp(τ), and β2(τ) =

√τ . The variables x1 and x2 are drawn from independent log-

normal distributions LN(0, 1). The number of observations is 1,000.Table 1 presents Monte-Carlo results for three cases: when there is no measurement error and

when the variance of ε equals 4 and 16. The simulation results show that under the presenceof measurement error, the quantile regression estimator is severely biased.5 Comparing themean bias when the variance of the measurement error increases from 4 to 16 shows that thebias is increasing in the variance of the measurement error. Intuitively, the information of thefunctional parameter β(·) is decaying when the variance of the EIV becomes larger.

2.1. Identification and Regularity Conditions. In the linear quantile model, it is assumedthat for any x ∈ X , xβ(τ) is increasing. Suppose x1, ..., xdx are dx-dimensional linearly inde-pendent vectors in int(X ). So Qy∗(τ |xi) = xiβ(τ) must be strictly increasing in τ . Considerthe linear transformation of the model with matrix A = [x1, ..., xdx ]

:

Q∗y(τ |x) = xβ(τ) = (xA−1)(Aβ(τ)). (2.4)

Let x = xA−1 and β(τ) = Aβ(τ). The transformed model becomes

Q∗y(τ |x) = xβ(τ), (2.5)

with every coefficient βi(τ) being weakly increasing in τ for 1 ≤ i ≤ k. Therefore, WLOG,we can assume that the coefficients β(τ) are increasing. Such βi(τ), 1 ≤ i ≤ k, are called co-monotonic functions. From now on, we will assume that βi(τ) is an increasing function whichhas the properties of β(τ). All of the convergence and asymptotic results for βi(τ) hold for theparameter βi after the inverse transform A−1.5A notable exception to this is the median regression results. We find that for symmetrically distributed EIVand uniformly distributed β(τ), the median regression results are unbiased.


Table 1. Monte-Carlo Results: Mean Bias

EIV Quantile (τ)Parameter Distribution 0.1 0.25 0.5 0.75 0.9

β1(τ) = eτ

ε =0 0.006 0.003 0.002 0.000 -0.005ε ∼ N (0, 4) 0.196 0.155 0.031 -0.154 -0.272ε ∼ N (0, 16) 0.305 0.246 0.054 -0.219 -0.391

True parameter: 1.105 1.284 1.649 2.117 2.46

β2(τ) =√τ

ε =0 0.000 -0.003 -0.005 -0.006 -0.006ε ∼ N (0, 4) 0.161 0.068 -0.026 -0.088 -0.115ε ∼ N (0, 16) 0.219 0.101 -0.031 -0.128 -0.174

True parameter: 0.316 0.5 0.707 0.866 0.949Notes: Table reports mean bias (across 500 simulations) of slope coefficients estimated for eachquantile τ from standard quantile regression of y on a constant, x1, and x2 wherey = x1β1(τ) + x2β2(τ) + ε and ε is either zero (no measurement error case, i.e. y∗ is observed) or ε isdistributed normally with variance 4 or 16. The covariates x1 and x2 are i.i.d. draws from LN(0, 1).N = 1, 000.

Condition C1 (Properties of β(·)). We assume the following properties on the coefficient

vectors β(τ):

(1) β(τ) is in the space M [B1×B2×B3...×Bdx ] where the functional space M is defined as

the collection of all functions f = (f1, ..., fdx) : [0, 1] → [B1×...×Bdx ] with Bi ⊂ R being

a closed interval ∀ i ∈ 1, ..., dx such that each entry fi : [0, 1] → Bi is monotonically

increasing in τ .

(2) Let Bi = [li, ui] so that li < β0i(τ) < ui ∀ i ∈ 1, ..., dx and τ ∈ [0, 1].

(3) β0 is a vector of C1 functions with derivative bounded from below by a positive constant

∀ i ∈ 1, ..., dx.(4) The domain of the parameter σ is a compact space Ω and the true value σ0 is in the

interior of Ω.

Under assumption C1 it is easy to see that the parameter space D := M × Ω is compact.

Lemma 1. The space M [B1 ×B2 ×B3...×Bk] is a compact and complete space under Lp, for

any p ≥ 1.

Proof. See Appendix B.1.

Monotonicity of βi(·) is important for identification, because in the log-likelihood function,f(y|x) =

´ 10 f(y − xβ(u)|σ)du is invariant when the distribution of random variable β(u) is

invariant. The function β(·) is therefore unidentified if we do not impose further restrictions.Given the distribution of the random variable β(u)|u ∈ [0, 1]dx, the vector of functionsβ : [0, 1] → B1 × B2 × B3... × Bdx is unique under the rearrangement if the functions βi,1 ≤ i ≤ dx, are co-monotonic.


Condition C2 (Properties of x). We assume the following properties of the design matrix x:

(1) E[xx] is non-singular.

(2) The domain of x, denoted as X , is continuous on at least one dimension, i.e. there

exists k ∈ 1, ..., dx such that for every feasible x−k, there is a open set Xk ⊂ R such

that (Xk,x−k) ⊂ X .

(3) Without loss of generality, βk(0) ≥ 0.

Condition C3 (Properties of EIV). We assume the following properties of the measurement

error ε:

(1) The probability function f(ε|σ) is differentiable in σ.

(2) For all σ ∈ Ω, there exists a uniform constant C > 0 such that E[| log f(ε|σ)|] < C.

(3) f(·) is non-zero all over the space R, and bounded from above.

(4) E[ε] = 0.

(5) Denote φ(s|σ) :=´∞−∞ exp(isε)f(ε|σ)dε as the characteristic function of ε.

(6) Assume for any σ1 and σ2 in the domain of σ, denoted as Ω, there exists a neighborhood

of 0, such thatφ(s|σ1)φ(s|σ2)

can be expanded as 1 +∞

k=2 ak(is)k.

Denote θ := (β(·),σ) ∈ D. For any θ, define the expected log-likelihood function L(θ) asfollows:

L(θ) = E[log g(y|x, θ)], (2.6)

where the density function g(y|x, θ) is defined as

g(y|x, θ) =ˆ 1

0f(y − xβ(u)|σ)du. (2.7)

And define the empirical likelihood as

Ln(θ) = En[log g(y|x, θ)], (2.8)

The main identification results rely on the monotonicity of xβ(τ). The global identificationcondition is the following:

Condition C4 (Identification). There does not exist (β1,σ1) = (β0,σ0) in parameter space Dsuch that g(y|x,β1,σ1) = g(y|x,β0,σ0) for all (x, y) with positive continuous density or positive

mass.

Lemma 2 (Nonparametric Global Identification). Under condition C1-C3, for any β(·) and

f(·) which generates the same density of y|x almost everywhere as the true function β0(·) and

f0(·), it must be that:

β(τ) = β0(τ)

f(ε) = f0(ε).



We also summarize the local identification condition as follows:

Lemma 3 (Local Identification). Define p(·) and δ(·) as functions that measure the deviation

of a given β or σ from the truth: p(τ) = β(τ) − β0(τ) and δ(σ) = σ − σ0. Then there does

not exist a function p(τ) ∈ L2[0, 1] and δ ∈ Rdσ such that for almost all (x, y) with positive

continuous density or positive massˆ 1

0f(y − xβ(τ)|σ)xp(τ)dτ =

ˆ 1

0δfσ(y − xβ(τ))dτ,

except that p(τ) = 0 and δ = 0.


Condition C5 (Stronger local identification for σ). For any δ ∈ Sdσ−1,

infp(τ)∈L2[0,1]

(

ˆ 1

0fy(y − xβ(τ)|σ)xp(τ)dτ −

ˆ 1

0δfσ(y − xβ(τ))dτ)2 > 0. (2.9)

3. Maximum Likelihood Estimator

3.1. Consistency. The ML estimator is defined as:

(β(·), σ) ∈ arg max(β(·),σ)∈D

En[g(y|x,β(·),σ)]. (3.1)

The following theorem states the consistency property of the ML estimator.

Lemma 4 (MLE Consistency). Under conditions C1-C3, the random coefficients β(τ) and the

parameter σ that determines the distribution of ε are identified in the parameter space D. The

maximum-likelihood estimator

(β(·), σ) ∈ arg max(β(·),σ)∈D

En

log

ˆ 1

0f(y − xβ(τ)|σ)dτ

exists and converges to the true parameter (β0(·),σ0) under the L∞ norm in the functional

space M and Euclidean norm in Ω with probability approaching 1.


The identification theorem is a special version of a general MLE consistency theorem (Van derVaart, 2000). Two conditions play critical roles here: the co-monotonicity of the β(·) functionand the local continuity of at least one right-hand side variable. If we do not restrict theestimator in the family of monotone functions, then we will lose compactness of the parameterspace D and the consistency argument will fail.

3.2. Ill-posed Fredholm Integration of the First Kind. In the usual MLE setting withthe parameter being finite dimensional, the Fisher information matrix I is defined as:


I := E

∂f

∂θ

∂f

∂θ

.

In our case, since the parameter is continuous, the informational kernel I(u, v) is defined as6

I(u, v)[p(v), δ] = E

fβ(u)g

,gσg

ˆ 1

0

fvgp(v)dv +

gσg

δ

. (3.2)

If we assume that we know the true σ, i.e., δ = 0, then the informational kernel becomes

I0(u, v) := E[∂fug

∂fug

].

By condition C5, the eigenvalue of I and I0 are both non-zero.Since E[X X] < ∞, I0(u, v) is a compact (Hilbert-Schmidt) operator. From the Riesz

Representation Theorem, L(u, v) has countable eigenvalues λ1 ≥ λ2 ≥ ... > 0 with 0 being theonly limit point of this sequence. It can be written as the following form:

I0(·) =∞

i=1

λi ψi, ·ψi

where ψi is the system of orthogonal functional basis, i = 1, 2, .... For any function s(·),´ 10 I0(u, v)h(v)dv = s(u) is called the Fredholm integral equation of the first kind. Thus, the

first-order condition of the ML estimator is ill-posed. In the next subsection we establishconvergence rate results for the ML estimator using a deconvolution method.7

Although the estimation problem is ill-posed for the function β, the estimation problem isnot ill-posed for the finite dimensional parameter σ, given that E[gσgσ] is non-singular. In theLemma below, we show that σ converges to σ0 at rate 1/

√n.

Lemma 5 (Estimation of σ). If E[gσgσ] is positive definite and conditions C1-C5 hold, the ML

estimator σ has the following property:

σ − σ0 → Op(1√n). (3.3)


3.3. Bounds for the Maximum Likelihood Estimator. In this subsection, we use thedeconvolution method to establish bounds for the maximum likelihood estimator. Recall that

6For notational convenience, we abbreviate ∂f(y−xβ0(τ)|σ0)∂β(τ) as fβ(τ), g(y|x,β0(τ),σ0) as g, and ∂g

∂σ as gσ.7Kuhn(1990) shows that if the integral kernel is positive definite with smooth degree r, then the eigenvalueλi is decaying with speed O(n−r−1). However, it is generally more difficult to obtain a lower bound for theeigenvalues. The decay speed of the eigenvalues of the information kernel is essential in obtaining convergencerate. And from the Kuhn (1990) results, we see that the decay speed is linked with the degree of smoothness ofthe function f . The less smooth the function f is, the slower the decaying speed is. Later in the paper we willshow that by assuming some discontinuity conditions, we can obtain a polynomial rate of convergence.


the maximum likelihood estimator is the solution

(β,σ) = argmax(β,σ)∈DEn[log(g(y|x,β,σ))]

Define χ(s0) = sup|s|≤s0 |1

φε(s|σ0)|. We use the following smoothness assumptions on the

distribution of ε (see Evodokimov, 2010).

Condition C6 (Ordinary Smoothness (OS)). χ(s0) ≤ C(1 + |s0|λ).

Condition C7 (Super Smoothness (SS)). χ(s0) ≤ C1(1 + |s0|C2) exp(|s0|λ/C3).

The Laplace distribution L(0, b) satisfies the Ordinary Smoothness (OS) condition with λ =

2. The Chi-2 Distribution χ2k satisfies the OS condition with λ = k

2 . The Gamma DistributionΓ(k, θ) satisfies the OS condition with λ = k. The Exponential Distribution satisfies conditionOS with λ = 1. The Cauchy Distribution satisfies the Super Smoothness (SS) condition withλ = 1. The Normal Distribution satisfies the SS condition with λ = 2.

Lemma 6 (MLE Convergence Speed). Under assumptions C1-C5 and the results in Lemma 4,

(1) if Ordinary Smoothness holds (C6), then the ML estimator of β satisfies for all τ

|β(τ)− β0(τ)| n− 12(1+λ)

(2) if Super Smoothness holds (C7), then the ML estimator of β satisfies for all τ

|β(τ)− β0(τ)| log(n)−1λ .


4. Sieve estimation

In the last section we demonstrated that the maximum likelihood estimator restricted toparameter space D converges to the true parameter with probability approaching 1. However,the estimator still lives in a large space with β(·) being k-dimensional co-monotone functionsand σ being a finite dimensional parameter. Although theoretically such an estimator doesexist, in practice it is computationally infeasible to search for the maximizer within this largespace. Instead, we consider a series estimator of β(·) to mimic the co-monotone functions β(·).

There can be many choices of series, for example, triangular series, (Chebyshev) power series,B-splines, etc. In this paper, we choose to use splines because of their computational advantagesin calculating the Sieve estimator used in the EM algorithm. For simplicity, we use the piecewiseconstant Sieve space. Below we introduce the formal definition of our Sieve space

Definition 1. Define Dk = Θk×Ω, where Θk stands for increasing piecewise constant functionson [0, 1]dx with knots at i

k , i = 0, 1, ..., k − 1. In other words, for any β ∈ Θk, βj is a piecewiseconstant function on intervals [ ik ,

i+1k ) for 0 ≤ i ≤ k and 1 ≤ j ≤ dx.

Furthermore, let Pk denote the product space of all dx-dimensional piecewise-constant func-tions and Rdσ .


We know that the L2 distance of the space Dk to the true parameter θ0 satisfies d2(θ0,Dk) ≤C

1k for some generic constant C.The Sieve estimator is defined as follows:

Definition 2 (Sieve Estimator).

(βk,σk) = arg max(β,σ)∈Dk

En[log g(y|x,β,σ)] (4.1)

Let d(θ1, θ2)2 := E[´R

(g(y|x,θ1)−g(y|x,θ2))2g(y|x,θ0) dy] be a pseudo metric on the parameter space D.

We know that d(θ, θ0) ≤ L(θ0|θ0)− L(θ|θ0).Let || · ||d be a norm such that ||θ0 − θ||d ≤ Cd(θ0, θ). Our metric || · ||d here is chosen to be:

||θ||2d := θ, I(u, v)[θ] . (4.2)

By Condition C4, || · ||d is indeed a metric.For the Sieve space Dk and any θk ∈ Dk, define I as the following matrix:

I := E

´ 1k0 fτdτ,

´ 2k1k

fτdτ, ...,´ 1

k−1k

fτdτ

g,gσg

´ 1k0 fτdτ,

´ 2k1k

fτdτ, ...,´ 1

k−1k

fτdτ

g,gσg

.

Again, by Condition C4, this matrix I is non-singular. Furthermore, if a certain discontinuitycondition is assumed, the smallest eigenvalue of I can be proved to be bounded away from somepolynomial of k.

Condition C8 (Discontinuity of f). Suppose there exists a positive integer λ such that f ∈Cλ−1(R), and the λth order derivative of f equals:

f (λ)(x) = h(x) + δ(x− a), (4.3)

with h(x) being a bounded function and L1 Lipschitz except at a, and δ(x − a) is a Dirac

δ-function at a.

Remark. The Laplace distribution satisfies the above assumption with λ = 1. There is anintrinsic link between the above discontinuity condition and the tail property of the char-acteristic function stated as the (OS) condition. Because φf (s) = isφf (s), we know thatφf (s) =

1(is)λ

φf (λ)(s), while φf (λ)(s) = O(1) under the above assumption. Therefore, assump-tion C9 indicates that χf (s) := sup | 1

φf (s)| ≤ C(1 + sλ). In general, these results will hold as

long as the number of Dirac functions in f (λ) are finite in (4.3).

The following Lemma establishes the decay speed of the minimum eigenvalue of I.

Lemma 7. If the function f satisfies condition C8 with degree λ > 0,


(1) the minimum eigenvalue of I, denoted as r(I), has the following property:

1

kλ r(I).

(2)1kλ

supθ∈Pk

||θ||d||θ||


The following Lemma establishes the consistency of the Sieve estimator.

Lemma 8 (Sieve Estimator Consistency). If conditions C1-C6 and C9 hold, The Sieve esti-

mator defined in (4.1) is consistent.

Given the identification assumptions in the last section, if the problem is identified, thenthere exists a neighborhood N of θ0, for any θ ∈ N , θ = θ0, we have:

L(θ0|θ0)− L(θ|θ0) ≥ Ex

ˆY

(g(y|x, θ0)− g(y|x, θ))2

g(y|x, θ0)

> 0. (4.4)


Unlike the usual Sieve estimation problem, our problem is ill-posed with decaying eigen-value with speed kλ. However, the curse of dimensionality does not show up because of co-monotonicity: all entries of vector of functions β(·) are function of a single variable τ . Thereforeit is possible to use Sieve estimation to approximate the true functional parameter with k

growing slower than√n.

We summarize the tail property of f in the following condition:

Condition C9. For any θk ∈ Θk, assume there exists a generic constant C such that V ar((fβkg0

, gσg0 )) <

C for any βk and σ in a fixed neighborhood of θ0.

Remark. The above condition is true for the Normal, Laplace, and Beta Distributions.

Condition C10 (Eigenvector of I). Suppose the smallest eigenvalue of I, r(I) = ckkλ

. Suppose

v is a normalized eigenvector of r(I), and that k−α min |vi| for some fixed α ≥ 12 .

Theorem 1. Under conditions C1-C5 and C8-C10, the following results holds for the Sieve-ML

estimator:

(1) If the number of knots k satisfies the following growth condition:

(a)k2λ+1√n

→ 0,

(b)kλ+r√n

→ ∞,

then ||θ − θ0|| = Op(kλ√n).

(2) If condition 11 holds and the following growth conditions hold

(a)k2λ+1√n

→ 0

(b)kλ+r−α

√n

→ ∞,


then (a) for every i = 1, ..., k, there exists a number µijk, such thatµijk

kr → 0, kλ−α

µijk= O(1),

and

µijk(βk,j(τi)− β0,j(τi)) −→d

N(0, 1).

and (b) for the parameter σ, there exists a positive definite matrix V of dimension dσ ×dσsuch that the sieve estimator satisfies:

√n(σk − σ0) → N(0, V ).


Once we fix the number of interior points, we can perform MLE to estimate the Sieveestimator. We discuss how to compute the Sieve-ML estimator in the next section.

4.1. Inference via Bootstrap. In the last section we proved asymptotic normality for theSieve-ML estimator θ = (β(τ),σ). However, computing the convergence speed µijk for βk,j(τi)

by explicit formula can be difficult in general. To conduct inference, we recommend using thenonparametric bootstrap.

Define (xBi , yBi ) as a resampling of data (xi, yi) with replacement. Define the estimator

θBk = arg maxθk∈Θk

EBn [log g(y

Bi |xBi , θk)], (4.5)

where EBn denotes the operator of empirical average of resampled data (xBi , y

Bi ).

Chen and Pouzo (2013) establishes results in validity of the nonparametric bootstrap insemiparametric models for a general functional of a parameter θ. The Lemma below is animplementation of theorem (5.1) of Chen and Pouzo (2013).

Lemma 9 (Validity of the Bootstrap). Under condition C1-C6 and C9, choosing the number

of knots k according to the condition stated in Theorem 1, the estimator via bootstrap defined

in equation (4.10) has the following property:

βk,j(τ)B − βk,j(τ)

µijk

∗−→d

N(0, 1) (4.6)


5. Algorithm for Computing Sieve MLE

5.1. Global Optimization and Flexible Distribution Function. In the maximizationproblem (4.1), the dimension of the parameter is possibly large. As the number of knots k

grows, the dimensionality of the parameter space M becomes large compared to usual practicaleconometric applications. In addition, the parameter space Ω can be complex if we considermore flexible distribution families of the EIV, e.g., mixture of normals.


More specifically, assume the number of mixture components is w. Then the parameter spaceΩ is defined as a 3w − 2 dimensional space.

Consider (p, µ,σ) ∈ Ω. p ∈ [0, 1]w−1 is a w − 1 dimensional vector indicating the weight ofthe first w − 1 components. The weight of the last component is 1 −

w−1i=1 pi. µ ∈ Rw−1 is

defined as the mean of the first w− 1 components. The mean of the last component is definedas −

w−1i=1 piµiw−1i=1 pi

. σ ∈ Rw is defined as the vector of standard deviation of each component.In general, if the number of knots is k, and the number of mixture components is w, then

the total number of parameters is dxk + 3w − 2.Since the number of parameters is potentially large and the maximization problem (4.1) is

typically highly nonlinear, usual (local) optimization algorithms do not generally work well andcan be trapped at a local optimum. We recommend using global optimization algorithms suchas the genetic algorithm or the efficient global optimization algorithm (EGO) to compute thesieve estimator suggested.

While such global algorithms have good precision, these methods can be slow. Thus, weprovide an EM-algorithm which is locally computational tractable in the next subsection.

5.2. EM Algorithm for Finding Local Optima of the Likelihood Function. In thissection, we propose a special EM algorithm in real application to generate a consistent estima-tor.

Definition 3. Define κ(τ |x, y, θ) := f(y−xβ(τ)|σ)´ 10 f(y−xβ(τ)|σ)dτ

as the Bayesian probability of τ |x, y, θ.Define

H(θ|θ) = Ex,τ [log κ(τ |x, y, θ)|x, y, θ]

= Ex,τ

ˆ ∞

−∞

f(y − xβ(τ)|σ)´ 10 f(y − xβ(u)|σ)du

logf(y − xβ(τ)|σ)´ 1

0 f(y − xβ(u)|σ)dudy

as the expectation of the log Bayesian likelihood over τ ∈ [0, 1] conditional on observation (x, y)

and parameter θ, given a prior θ.Define

Q(θ|θ) = Ex,τ [κ(τ |x, y) log f(y − xβ(τ))]

= L(θ) +H(θ|θ)

In the EM algorithm setting, each maximization step will weakly increase the likelihood L,and thus the algorithm leads to maximum likelihood estimator of the likelihood function L.

Lemma 10 (Local Optimality of the EM Algorithm). The EM Algorithm solves the following

problem:

θp+1 ∈ argmaxθ

Q(θ|θp) (5.1)

produces a sequence θp given an initial value θ0. If the sequence has the following property:


(1) The sequence L(θp) is bounded.

(2) L(θp+1|θp)− L(θp|θp) ≥ 0.

Then the sequence θp converges to some θ∗ in D. And by property (2), θ∗ is a local optimum

of the log likelihood function L.


This Lemma says, instead of maximizing the log likelihood function, we can run iterationsconditional on the previous estimation of θ to obtain a sequence of estimates θp converging tothe maximum likelihood estimator. The computation steps are much more attractive than thedirect minimization of the log-likelihood function L(θ), since in each step we only need to dothe partial maximization taking the prior estimates as known parameters. The next sectionpresents a simpler version of the algorithm as weighted least squares when the distributionfunction f(·|σ) is in the normal family.

The maximization of the Q function based on the last step is simpler than the global opti-mization algorithm in the following sense: Unlike the direct maximization of L(θ) in the spaceD, the maximization of the Q(|θp) function is a process of solving maximization problems foreach different τ , which involves many fewer number of parameters. More specifically, the max-imization problem in equation (5.5) is equivalent to the following maximization problem foreach different τ given fixed σ:

(β(τ),σ) ∈ argmaxEx

f(y − xβ(τ)|σ)´ 1

0 f(y − xβ(u)|σ)dulog f(y − xβ(τ)|σ)dy

. (5.2)

The EM algorithm is much easier to adopt compared to searching in the functional spaceM to solve for the global maximizer of the log-likelihood function L(θ). The EM algorithm isuseful to obtain a local optimum, but not necessarily a global optimum. However, in practice,if we start with a solver of (4.1) by using the global optimization algorithm and then apply theEM algorithm, we could potentially improve the speed of computation compared to directlyusing the genetic algorithm.

5.3. Weighted Least Squares. Under normality assumption of the EIV term ε, the maxi-mization of Q(·|θ) will be extremely simple: it is reduced to the minimization of a weighted leastsquare problem. In the last section, in equation (3.21), suppose the disturbance ε ∼ N (0,σ2).Then the maximization problem (4.1) becomes the following, with the parameter θ = [β(·),σ]

argmaxθ

Q(θ|θ) := Elog(f(y − xβ(τ))|θ)κ(x, y, θ)|θ

(5.3)

= Eˆ 1

τ=0

f(y − xβ(τ)|σ)´ 10 f(y − xβ(u)|σ)du

−1

2log(2πσ2)− (y − xβ(τ))2

2σ2

dτ

.

It is easy to see from the above equation that the maximization problem of β(·)|θ is tominimize the sum of weighted least squares. As in standard normal MLE, the FOC for β(·)


does not depend on σ2. The σ2 is solved after all the β(τ) are solved from equation (5.3).Therefore, the EM algorithm becomes a simple weighted least square algorithm, which is bothcomputationally tractable and easy to implement in practice.

Given an initial estimate of a weighting matrix W , the weighted least squares estimates ofβ and σ are

β(τk) = (X WkX)−1X Wky

σ =

1

NK

k

i

wikε2ik

where Wk is the diagonal matrix formed from the kth column of W , which has elements wik.Given estimates εk = y −X β(τk) and σ, the weights wik for observation i in the estimation

of β(τk) are

wik =φ (εik/σ)

1K

kφ (εik/σ)

(5.4)

where K is the number of τs in the sieve, e.g. K = 9 if the quantile grid is τk =

0.1, 0.2, ..., 0.9, and φ(·) is the pdf of a standard normal distribution.

6. Monte-Carlo Simulations

We examine the properties of our estimator empirically in Monte-Carlo simulations. Let thedata-generating process be

yi = β0(ui) + x1iβ1(ui) + x2iβ2(ui) + εi

where N = 1, 000, the conditional quantile ui of each individual is u ∼ U [0, 1], and the mea-surement error ε is drawn from a mixed normal distribution

εi ∼

N (−3, 1) with probability 0.5

N (2, 1) with probability 0.25

N (4, 1) with probability 0.25

The covariates are functions of random normals

x1i = log |z1i|+ 1

x2i = log |z2i|+ 1

where z1, z2 ∼ N (0, I2). Finally, the coefficient vector is a function of the conditional quantileui of individual i

β0(u)

β1(u)

β2(u)

=

1 + 2u− u2

12 exp(u)

u+ 1

.


Figure 2. Monte Carlo Simulation Results: Mean Bias of β1(τ)

!.6

!.4

!.2

0.2

.4.6

Mean B

ias

0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1Quantile

Quantile Regression Bias!Corrected

Bias!Corrected 95% Confidence Intervals

Simulation Results. In Figures 2 and 3, we plot the mean bias of quantile regression of y

on a constant, x1, and x2 and contrast that with the mean bias of our estimator. Quantileregression is badly biased, with lower quantiles biased upwards towards the median-regressioncoefficients and upper quantiles biased downwards towards the median-regression coefficients.We also plot bootstrapped 95% confidence intervals (dashed green lines) to show the precisionof our estimator. Especially in the case of nonlinear β1(·) in Figure 2, we reject the quantilecoefficients, which fall outside the 95% confidence interval of our estimates.

Figure 4 shows the true mixed-normal distribution of the measurement error ε as definedabove (solid blue line) contrasted with the estimated distribution of the measurement error(dotted red line) and the 95% confidence interval of the estimated density (dashed green line).Despite the bimodal nature of the true measurement error distribution, our algorithm suc-cessfully estimates its distribution with very little bias and relatively tight confidence bounds.

7. Empirical Application

Angrist et al. (2006) estimate that the heterogeneity in the returns to education across theconditional wage distribution has increased over time, as seen below in Figure 5, reproduced


Figure 3. Monte Carlo Simulation Results: Mean Bias of β2(τ)

!.6

!.4

!.2

0.2

.4.6

Mean B

ias

0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1Quantile

Quantile Regression Bias!Corrected

Bias!Corrected 95% Confidence Intervals

from Angrist et al. Estimating the quantile-regression analog of a simple Mincer regression,they estimate models of the form

qy|X(τ) = β0(τ)+β1(τ)educationi+β2(τ)experiencei+β3(τ)experience2i +β4(τ)blacki (7.1)

where qy|X(τ) is the τ th quantile of the conditional (on the covariates X) log-wage distribution,the education and experience variables are measured in years, and black is an indicator variable.They find that in 1980, an additional year of education was associated with a 7% increase inwages across all quantiles, nearly identical to OLS estimates. In 1990, they again find thatmost of the wage distribution has a similar education-wage gradient but that higher conditional(wage) quantiles saw a slightly stronger association between education and wages. By 2000,the education coefficient was almost seven percentage points higher for the 95th percentile thanfor the 5th percentile.

We observe a different pattern when we correct for measurement-error bias in the self-reported wages used in the Angrist et al. census data to compare results using our estimationprocedure with standard quantile regression.8 We first use our genetic algorithm to maximize8We are able to replicate the Angrist et al. quantile regression results (shown in the dashed blue line) almostexactly. We estimate β(τ) over a τ grid of 99 knots that stretches to 0.01 and 0.99. However, we do not reportresults for quantiles τ < .1 or τ > .9 because of how unstable and noisy these results are.


Figure 4. Monte Carlo Simulation Results: Distribution of Measurement Error

0.0

5.1

.15

.2D

ensi

ty

!8 !6 !4 !2 0 2 4 6 8Measurement Error

True Density Estimated Density 95% CI

the Sieve likelihood with an EIV density f that is a mean-zero mixture of three normal distri-butions with unknown variances. The genetic algorithm excels at optimizing over non-smoothsurfaces where other algorithms would be sensitive to start values. However, running the ge-netic algorithm on our entire dataset is computationally costly since the data contain over65,000 observations for each year and our parameter space is large. To speed up convergence,we average coefficients estimated on 100 subsamples of 2,000 observations each. This proceduregives consistent parameter estimates and nonparametric confidence intervals provided the ratioof the subsample size and the original number of observations goes to zero in the limit. The95% nonparametric confidence intervals are calculated pointwise as the .025 and .975 quan-tiles of the empirical distribution of the subsampling parameter estimates. Figure 6 plots theestimated distribution of the measurement error (solid blue line) in the 2000 data, and thedashed red lines report the 95% confidence bounds on this estimate, obtained via subsampling.The estimated EIV distributions for the 1980 and 1990 data look similar—we find that thedistribution of the measurement error in self-reported log wages is approximately Gaussian.

Figure 7 plots the education coefficient β1(τ) from estimating equation (7.1) by weighted-least squares.9 While not causal estimates of the effect of education on wages, the results9Because preliminary testing has shown that the weighted-least squares approach improves upon the optimumreached by the genetic algorithm, we continue to refine our genetic algorithm. The coefficient results we report


Figure 5. Angrist, Chernozhukov, and Fernandez-Val (2006) Figure 2

suggest that in 1980, education was more valuable at the middle and top of the conditionalincome distribution. This pattern contrasts with the standard quantile-regression results, whichshow constant returns to education across quantiles of the conditional wage distribution (asdiscussed above). For 1990, the pattern of increasing returns to education for higher quantilesis again visible with the very highest quantiles seeing a large increase in the education-wagegradient. The estimated return to education for the .9 quantile is over twice as large as estimatesobtained from traditional quantile regression. In the 2000 decennial census, the returns toeducation appear nearly constant across most of the distribution, with two notable exceptions.The lowest income quantiles have a noticeably lower wage gain from a year of education thanmiddle-income quantiles. Second, the top-income increases in the education-wage premium thatwere present in the 1990 results are higher in 2000. The 90th percentile of the conditional wagedistribution in 2000 increases by over 30 log points for every year of education. Over time—especially between 1980 and 1990—we see an overall increase in the returns to education. Whilethis trend is consistent with the observations of Angrist et al. (2006), the distributional story

below are from weighted-least squares, which assumes normality of the measurement error. As Figure 6 shows,this assumption seems to be consistent with the data.


Figure 6. Estimated Distribution of Wage Measurement Error

0.2

.4.6

.8D

ensi

ty

!4 !2 0 2 4Measurement Error

Estimated Density 95% Confidence Intervals

Notes: Graph plots the estimated distribution of the measurement eror

that emerges from correcting for measurement error suggests that the returns to education havebecome disproportionately concentrated in high-wage professions.

8. Conclusion

In this paper, we develop a methodology for estimating the functional parameter β(·) inquantile regression models when there is measurement error in the dependent variable. Assum-ing that the measurement error follows a distribution that is known up to a finite-dimensionalparameter, we establish general convergence speed results for the MLE-based approach. Undera discontinuity assumption (C8), we establish the convergence speed of the Sieve-ML estimator.We use genetic algorithms to maximize the Sieve likelihood, and show that when the distri-bution of the EIV is normal, the EM algorithm becomes an iterative weighted least squarealgorithm that is easy to compute. We demonstrate the validity of bootstrapping based onasymptotic normality of our estimator and suggest using a nonparametric bootstrap procedurefor inference.

Finally, we revisited the Angrist et al. (2006) question of whether the returns to educationacross the wage distribution have been changing over time. We find a somewhat different


Figure 7. Returns to Education Correcting for LHS Measurement Error0

.1.2

.3.4

.1 .2 .3 .4 .5 .6 .7 .8 .9Quantile

1980

0.1

.2.3

.4

.1 .2 .3 .4 .5 .6 .7 .8 .9Quantile

1990

0.1

.2.3

.4

.1 .2 .3 .4 .5 .6 .7 .8 .9Quantile

2000

Ed

uca

tion

Co

eff

icie

nt

Quantile Regression

Bias!Corrected Quantile Regression

Notes: Graphs plot education coefficients estimated using quantile regression and our bias-correctingalgorithm. The dependent variable is log wage for prime age males, and other controls include aquadratic in years of experience and a dummy for blacks. The data comes from the indicateddecennial Census and consists of 40-49 year old white and black men born in America. The number ofobservations in each sample is 65,023, 86,785, and 97,397 in 1980, 1990, and 2000, respectively.

pattern than prior work, highlighting the importance of correcting for errors in the dependentvariable of conditional quantile models. When we correct for likely measurement error in theself-reported wage data, the wages of low-wage men had much less to do with their educationlevels than they did for those at the top of the conditional-wage distribution. By 2000, the wage-education gradient was nearly constant across most of the wage distribution, with somewhatlower returns for low-wage earners and significant growth in the predictability of education ontop wages.

References

[1] Angrist, J., Chernozhukov, V. and Fernandez-Val, I. (2006), Quantile Regression under Misspecification,with an Application to the U.S. Wage Structure. Econometrica, 74: 539-563.

[2] Burda, M., Harding, M. and Hausman, J. (2010), A Poisson Mixture Model of Discrete Choice, Journal ofEconometrics, 166(2) (2010) 184-203.

[3] Carroll, Raymond J. and Wei, Y., (2009), "Quantile Regression With Measurement Error", Journal of theAmerican Statistical Association, 104, issue 487, p. 1129-1143.


[4] Songnian Chen (2002). "Rank Estimation of Transformation Models," Econometrica, vol. 70(4), pages1683-1697, July.

[5] Xiaohong Chen (2009), "Large sample sieve estimation of semiparametric models", Cowles FoundationPaper No. 1262.

[6] Xiaohong Chen, Demian Pouzo (2013), “Sieve quasi likelihood ratio inference on semi/nonparametric con-ditional moment models”, Cowles fundation paper No. 1897.

[7] Chernozhukov, V., Fernandez-Val, I. and Galichon, A., “Improving Point and Interval Estimates of Mono-tone Functions by Rearrangement"

[8] Chernozhukov, Victor and Hansen, Christian (2005), "An IV Model of Quantile Treatment Effects," Econo-metrica, vol. 73(1), pages 245-261, 01.

[9] Chernozhukov, Victor and Hansen, Christian (2008) "Instrumental variable quantile regression: A robustinference approach," Journal of Econometrics, vol. 142(1), pages 379-398, January.

[10] Cosslett, S. R. (2004), Efficient Semiparametric Estimation of Censored and Truncated Regressions via aSmoothed Self-Consistency Equation. Econometrica, 72: 1277–1293.

[11] Cosslett, S. R. (2007), Efficient Estimation of Semiparametric models by Smoothed Maximum Likelihood.International Economic Review, 48: 1245–1272.

[12] Dempster, A.P., Laird. N. M. and Rubin D.B. (1977), Journal of the Royal Statistical Society, Vol. 39, No.1., pp. 1-38

[13] Evdokimov, K., Identification and Estimation of a Nonparametric Panel Data Model with UnobservedHeterogeneity, mimeo.

[14] Hausman J.A., Lo A.W., MacKinlay A.C. An ordered probit analysis of transaction stock prices (1992)Journal of Financial Economics, 31 (3) , pp. 319-379.

[15] Hausman J.A., Abrevaya, Jason, and Scott-Morton, Fiona. (1998) Misclassification of a dependent variablein qualitative response model, Journal of econometrics, 87, 239-269.

[16] Hausman J.A., (2001) Mismeasured variables in econometrics problems from the right and problems fromthe left, Journal of Economic perspectives, 15, 57-67.

[17] Horowitz, J. L. (2011), Applied Nonparametric Instrumental Variables Estimation. Econometrica, 79: 347-394.

[18] Kuhn, T., (1990)Eigenvalues of integral operators with smooth positive definite kernels, Archiv der Math-ematik 2. 12. 1987, Volume 49, Issue 6, pp 525-534

[19] Shen, Xiaotong, "On methods of Sieve and Penalization", The Annals of Statistics, vol. 25, 1997.[20] Schennach, Susanne M., 2008. "Quantile Regression With Mismeasured Covariates," Econometric Theory,

vol. 24(04), pages 1010-1043, August.


Appendix A. Deconvolution Method

Although the direct MLE method is used to obtain identification and some asymptoticproperties of the estimator, the main asymptotic property of the maximum likelihood estimatordepends on discontinuity assumption C.9. Evodokimov (2011) establishes a method usingdeconvolution to solve a similar problem under panel data condition. He adopts a nonparametricsetting and therefore needs kernel estimation method.

Instead, in the special case of quantile regression setting, we can further explore the linearstructure without using the nonparametric method. Deconvolution offers a straight forwardview of the estimation procedure.

The CDF function of a random variable w can be computed from the characteristic functionφw(s).

F (w) =1

2− lim

q→∞

ˆ q

−q

e−iws

2πisφw(s)ds. (A.1)

β0(τ) satisfying the following conditional moment condition:

E[F (xβ0(τ)|x)] = τ. (A.2)

Since we only observe y = xβ(τ) + ε, the φxβ(s) =E[exp(isy)]

φε(s). Therefore β0(τ) satisfies:

E[1

2− τ − lim

q→∞

ˆ q

−q

E[e−i(y−xβ0(τ))s|x]2πisφε(s)

ds|x] = 0 (A.3)

In the above expression, the limit symbol before the integral can not be exchanged with thefollowed conditional expectation symbol. But in practice, they can be exchanged if the integralis truncated.

A simple way to explore information in the above conditional moment equation is to considerthe following moment conditions:

E

x

1

2− τ + Im( lim

q→∞

ˆ q

−q

E[e−i(y−xβ0(τ))s|x]2πsφε(s)

ds)

= 0 (A.4)

Let F (xβ, y, τ, q) = 12 − τ − Im

´ q−q

e−i(y−xβ)s|x

2πsφε(s)ds

.Given fixed truncation threshold q, the optimal estimator of the kind En[f(x)F (xβ, y, τ, q)]

is:

En[E[∂F

∂β|x]/σ(x)2F (xβ, y, τ, q)] = 0, (A.5)

where σ(x)2 := E[F (xβ, y, τ)2|x].Another convenient estimator is:


argminβ

En[F (xβ, y, τ, q)2]. (A.6)

These estimators have similar behaviors in term of convergence speed, so we will only discussthe optimal estimator. Although it can not be computed directly because the weight ∂F

∂β /σ(x)2

depends on β, it can be achieved by an iterative method.

Theorem 2. (a) Under condition OS with λ ≥ 1, the estimator β satisfying equation (3.15)

with truncation threshold qn = CN12λ satisfies:

β − β0 = O(N−1N2λ ), (A.7)

(b) Under condition SS with, the estimator β satisfies equation (3.15) with truncation thresh-

old qn = C log(n)1λ satisfies:

β − β0 = O(log(n)−1λ ). (A.8)

Appendix B. Proofs of Lemmas and Theorems

B.1. Lemmas and Theorems in Section 2. In this section we prove the Lemmas andTheorems in section 2.

Proof of Lemma 1.

Proof. For any sequence of monotone functions f1, f2, ..., fn, ... with each one mapping [a, b] intosome closed interval [c, d]. For bounded monotonic functions, point convergence means uniformconvergence, therefore this space is compact. Hence the product space B1 × B2 × ... × Bk iscompact. It is complete since L2 functional space is complete and limit of monotone functionsis still monotone.

Proof of Lemma 2.

Proof. WLOG, under condition C.1-C.3, we can assume the variable x1 is continuous. If thereexists β(·) and f(·) which generates the same density g(y|x,β(·), f) as the true paramter β0(·)and f0(·), then by applying Fourier transformation,

φ(s)

ˆ 1

0exp(isβ(τ))dτ = φ0(s)

ˆ 1

0exp(isβ0(τ))dτ.

Denote m(s) = φ(s)φ0(s)

= 1 +∞

k=2 ak(is)k around a neighborhood of 0. Therefore,

m(s)

ˆ 1

0exp(isx−1β−1(τ))

∞

i=0

(is)kxk1β1(τ)k

k!dτ =

ˆ 1

0exp(isx−1β0,−1(τ))

∞

i=0

(is)kxk1β0,1(τ)k

k!dτ.

Since x1is continuous, then it must be that the corresponding polynomials of x1are the samefor both sides. Namely,


m(s)(is)k

k!

ˆ 1

0exp(isx−1β−1(τ))β1(τ)

kdτ =(is)k

k!

ˆ 1

0exp(isx−1β0.−1(τ))β0,1(τ)

kdτ.

Divide both sides of the above equation by (is)k/k!, as let s approaching 0, we get:

ˆ 1

0β1(τ)

kdτ =

ˆ 1

0β0,1(τ)

kdτ.

By assumption C.2, β1(·) and β0,1(·) are both strictly monotone, differentiable and greaterthan or equal to 0. So β1(τ) = β0,1(τ) for all τ ∈ [0, 1].

Now consider the same equation considered above. divide both sides by (is)k/k!, we getm(s)

´ 10 exp(isx−1β−1(τ))β0,1(τ)kdτ =

´ 10 exp(isx−1β0.−1(τ))β0,1(τ)kdτ,for all k ≥ 0.

Since β0,1(τ)k, k ≥ 1 is a functional basis of L2[0, 1], therefore m(s) exp(isx−1β−1(τ)) =

exp(isx−1β0,−1(τ)) for all s in a neighborhood of 0 and all τ ∈ [0, 1]. If we differentiate bothsides with respect to s and evaluate at 0 (notice that m(0) = 0), we get:

x−1β−1(τ) = x−1β0,−1(τ),

for all τ ∈ [0, 1].

By assumption C.2, E[xx] is non-singular. Therefore E[x−1x−1] is also non-singular. Hence,the above equation suggests β−1(τ) = β0,−1(τ), for all τ ∈ [0, 1]. Therefore, β(τ) = β0(τ).Thisimplies that φ(s) = φ0(s). Hence f(ε) = f0(ε).

Proof of Lemma 3.

Proof. Proof: (a) If the equation stated in (a) holds a.e., then we can apply Fourier transfor-mation on both sides, i.e., conditional on x,

êisyˆ 1

0fy(y − xβ(τ))xt(τ)dτdy =

êisyˆ 1

0

ˆ 1

0fσ(y − xβ(τ))dτdy. (B.1)

Let q(τ) = xβ(τ), which is a strictly increasing function. We can use change of variables tochange τ to q.

We get:ˆ 1

0eisxβ(τ)xt(τ)

êis(y−xβ(τ))f (y−xβ(τ))dydτ =

ˆ 1

0eisxβ(τ)

êis(y−xβ(τ))fσ(y−xβ(τ))dydτ.

(B.2)or, equivalently,

ˆ 1

0eisxβ(τ)xt(τ)dτ(−is)φε(s) =

ˆ 1

0eisxβ(τ)dτ

∂φε(s)

∂σ, (B.3)

given that´f (y)eisydy = −is

´f(y)eisydy.


Denote´ 10 eisxβ(τ)xt(τ)dτ = q(x, s), and

´ 10 eisxβ(τ)dτ = r(x, s). Therefore,

−isq(s, x)φ(s) = r(s, x)∂φε(s)

∂σ.

First we consider normal distribution. φε(s) = exp(−σ2s2/2). So ∂φε(s)∂σ = −σs2φε(s).

Therefore, q(s, x) = isσr(s, x), for all s and x with positive probability. Since xβ(τ) isstrictly increasing, let z = xβ(τ), and I(y) = 1(xβ(0) ≤ y ≤ xβ(1)).

So q(s, x) =´∞−∞ eisw xt(z−1(w))

xz(z−1(w))I(w)dw, and r(s, x) =´∞−∞ eisw 1

xz(z−1(w))I(w)dw.q(s, x) = isσr(s, x) means that there is a relationship between the integrands. More specifi-

cally, it must be that

xt(z−1(w))

xz(z−1(w))I(w) = d

1

xz(z−1(w))I(w)/dw. (B.4)

Obviously they don’t equal to each other for a continuous interval of xi, because the latterfunction has a δ function component.

For more general distribution function f , −isq(s, x)φε(s) = r(s, x)∂φε(s)∂σ , with ∂φε(s)

∂σ =

φε(s)s2m(s), for some m(s) :=∞

j=0 aj(is)j .

Therefore,

−ism(s)r(s, x) = q(s, x).

Consider local expansion of s in both sides, we get:

−is(∞

j=0

aj(is)j)(

∞

k=0

ˆ 1

0(xβ(τ))kdτ

(is)k

k!) =

∞

j=0

ˆ 1

0(xβ(τ))kxt(τ)dτ

(is)k

k!.

Compare the kth polynomial of s, we get:´ 10 (xβ(τ))

kxt(τ)dτ = 0, if k = 0.´ 10 (xβ(τ))

kxt(τ)dτ = −

1≤j<k aj(is)j(´ 10 xβ(τ))j−1dτ (is)j−1

(j−1)! .Let x1 be the continuous variable, and x−1 be the variables left. Let β1 be the coefficient

corresponds to x1 and let β−1 be the coefficient corresponds to x−1. Let t1 be the componentin t that corresponds to x1, and let t−1 be the component in t that corresponds to x−1.

Then, consider the (k + 1)th order polynomial of x1, we get:

ˆ 1

0β1(τ)

kt1(τ)dτ = 0.

since β1(τ)k, k ≥ 0 form a functional basis in L2[0, 1], t1(τ) must equal to 0.Consider the kth order polynomial of x1, we get:

ˆ 1

0β1(τ)

k(x−1t−1(τ))dτ = 0.


So x−1t−1(τ) = 0 for all τ ∈ [0, 1]. Since E[x−1x−1] has full rank, it must be that t−1(τ) = 0.Hence local identification condition holds.

B.2. Lemmas and Theorems in Section 3. In this subsection, we prove the Lemmas andTheorems in section 3.

Lemma 11 (Donskerness of D). The parameter space D is Donsker.

Proof. By theorem 2.7.5 of Van der Vaart and Wellner (2000), the space of bounded monotonefunction, denoted as F , satisfies:

logN[](ε,F , Lr(Q)) ≤ K1

ε,

for every probability measure Q and every r ≥ 1, and a constant K which depends only onr.

Since Dis a product space of bounded monotone functions M and a finite dimensionalbounded compact set Ω, therefore the braketing number of D given any measure Q is alsobounded by:

logN[](ε,D, Lr(Q)) ≤ Kdx1

ε,

Therefore,´ δ0

logN[](ε,D, Q) < ∞, i.e., D is Donsker.

Proof of Lemma 5.

Proof. Ln(θ) ≥ Ln(θ0) by definition, and we know that L(θ0) ≥ L(θ) by information inequality.By Lemma 10, since D is Donsker, then the following stochastic equicontinuity condition

holds:

√n(Ln(θ)− Ln(θ1)− (L(θ)− L(θ1))) →p rn,

for some generic term rn = op(1), if |θ − θ1| ≤ ηn for small enough ηn. Therefore uniformconvergence of Ln(θ)− L(θ) holds in D. In addition, by Lemma 2, the parameter θ is globallyidentified. Also, it is obvious that L is continuous in θ.

Therefore, the usual consistency result of MLE estimator applies here.

Proof of Lemma 6.

Proof. By definition,

En[log(g(y|x,β,σ))] ≥ En[log(g(y|x,β0,σ0))]. (B.5)

And we know that by information inequality, E[log(g(y|x,β,σ))] ≤ E[log(g(y|x,β0,σ0))].Therefore, E[log(g(y|x,β,σ))]−E[log(g(y|x,β0,σ0))] ≥ 1√

nGn[log(g(y|x,β,σ))−g(y|x,β0,σ0))].

By Donskerness of parameter space D, the following equicontinuity condition holds:


Hence,0 ≤ E[log(g(y|x,β0,σ0))]− E[log(g(y|x,β,σ))] p

1√n. (B.6)

Because σ is converging to the true parameter in√n speed, we will simply use σ0 here in

the argument because it won’t affect our analysis later. In the notation, I will abbreviate σ0.Let z(y, x) = g(y|β, x) − g(y|β0, x). In equation 3.19, we know that the KL-discrepancy of

g(y|x,β0) and g(y|x,β), noted as D(g(·|β0)||g(·|β)), is bounded by C 1√n

with p → 1.So Ex[||z(y|x)||1]2 ≤ D(g(·|β0)||g(·|β)) ≤ 2(E[log(g(y|x,β0,σ0))] − E[log(g(y|x,β,σ))]) p

1√n, where |z(y|x)|1 =

´∞−∞ |z(y|x)|dy.

Now consider the characteristic function of y conditional on x:

φxβ(s) =

´∞−∞ g(y|β, x)eisydy

φε(s),

and

φxβ0(s) =

´∞−∞ g(y|β0, x)eisydy

φε(s).

Therefore

φxβ(s)− φxβ0(s) =

´∞−∞ z(y|x)eisydy

φε(s), (B.7)

and |φxβ(s)− φxβ0(s)| ≤||z(y|x)||1|φε(s)| .

Let Fη(w) =12 − limq→∞

´ q−q

e−iws

2πs φη(s)ds. Then F (w) is the CDF of the random variableη.

Let Fη(w, q) =12 −´ q−q

e−iws

2πs φη(s)ds. If the p.d.f of η is a C1 function and η has finite secondorder moment, then Fη(w, q) = Fη(w) + O( 1

q2 ). Since x and β are bounded, let η = xβ(τ),then there exists a common C such that |Fxβ(w, q) − Fxβ(w)| ≤ C

q2 for all x ∈ X and w ∈ Rfor q being large enough.

Therefore |Fxβ(w, q) − Fxβ0(w, q)| = |´ q−q

eisw

2πs (φxβ(s) − φxβ0(s))ds| ≤ |´ q−q

||z(y|x)||12πs

|φε(s)| ds| ||z(y|x)||1χ(q).

Consider the follow moment condition:

E[x(Fxβ0(xβ0)− τ)] = 0 (B.8)

Let F0 be Fxβ0 and F be Fxβ .We will replace the moment condition by the truncated function F (w, q). So by the trunca-

tion property, for any fixed δ ∈ Rdx :

δE[x(F0(xβ)− F0(xβ0))] = δE[x(F0(xβ)− F (xβ))] = δE[x(F0(xβ, q)− F (xβ, q))] +O(1

q2).

(B.9)


The first term of the right hand side is bounded by:

E[|δx|χ(q)||z(y|x)||1] E[||z(y|x)||1]χ(q).

And δE[x(F0(xβ)−F0(xβ0))] = δE[xxf(xβ0(τ))](β(τ)− β0(τ))(1+ o(1)), where f(xβ0(τ))

is the p.d.f of the random variablexβ0(·) evaluated at τ .Hence, β(τ)− β0(τ) = Op(

χ(q)

n14) +O( 1

q2 ).

Under OS condition, χ(q) qλ, so by choosing q = Cn1

4(λ+2) , ||β(τ)− β0(τ)||2 p n− 1

2(λ+2) .Under SS condition, χ(q) C1(1+ |q|C2) exp(|q|λ/C3). By choosing q = C log(n)

14λ , ||β(τ)−

β0(τ)||2 p log(n)− 1

2λ .

B.3. Lemmas and Theorems in Sections 4 and 5. In this subsection we prove the Lemmasand Theorems in section 4.

Proof of Lemma 8.

Proof. Suppose f satisfies discontinuity condition C8 with degree λ > 0.Denote lk = (fτ1 , fτ2 , ..., fτk , gσ).

For any (pk, δ) ∈ Pk, pkIkpk = E[´R

(lkpk)2

g dy] ≥ E[(lkpk)2], since g is bounded from above.

Assume c := infx∈§,τ∈[0,1](xβ(τ)) > 0. Now consider an interval [y, y + 1

2λkc ].Let S(λ) :=

λi=0(λC

iλ)

2, where Cba stands for the combinatorial number choosing b elements

from a set with size a.So,

S(λ)E[

ˆR

(lkpk)2

dy] = E[

λ

j=0

(Cjλ)

2ˆR(lk(y + j/(2λukc))pk(y + j/(2λkc)))2dy].

And by Cauchy-Schwarz inequality,

E[λ

j=0

(Cjλ)

2ˆR(lk(y + j/(2λukc))pk(y + j/(2λukc)))2dy]

≥ E[

ˆR(

λ

j=0

(−1)jCjλlk(y + j/(2λukc))pk(y + j/(2λukc)))2dy].

let J ik := [a+ xβ( i+1/2

k )− 12λuk , a+ xβ( i+1/2

k ) + 12λuk ], so

E[λ

j=0

(Cjλ)

2ˆR(lk(y + j/(2λukc))pk(y + j/(2λukc)))2dy]

≥ E[

ˆJik

(λ

j=0

(−1)jCjλlk(y + j/(2λukc))pk(y + j/(2λukc)))2dy].


By discontinuity assumption, ,λ

j=0

(−1)jCjλlk(y + j/(2λukc))pk(y + j/(2λukc))

=1

(uk)λ−1(

c

(λ− 1)!xip

i +

cjuk

k

j=1,j =i

xjpj),

with cj uniformly bounded from above, since f (λ) is L1 Lipschitz except at a.Notice that J i

k does not intersect each other, so S(λ)E[´R(lkp

k)

2dy] ≥ S(λ)k

i=1E[´Jik(lkpk)

2dy]

≥ E[ c

2uk

ki=1

1(uk)λ−1 (

c(λ−1)!xip

i +

cjkuk

kj=1,j =i xjpj)2]

For constant u being large enough (only depend on λ, sup cjk and c),E[

ki=1

1(uk)λ−1 (

c(λ−1)!xip

i +

cjkuk

kj=1,j =i xjpj)2] ≥ c(λ) 1

k2λ−1

kj=1E[x2jp

2j ] 1

k2λ||p||22,

with some constant c(λ) only depends on λ and u.In addition, from condition C5, we know that θIkθ ||δ||22.Therefore θIkθ ||δ||22 + 1

k2λ||p||22 1

k2λ||θ||22. Hence, the smallest eigenvalue r(Ik) of Ik is

bounded by ckλ

, with some generic constant c depending on λ, x, the the L1 Lipshitz coefficientof f (k) at set R− [a− η, a+ η].

Proof of Lemma 9.

Proof. The proof of this Lemma is based on Chen (2008) results of Sieve extremum estimatorconsistency. Below I recall the five Conditions listed in Theorem 3.1 of Chen (2008). Theorem3.1 in Chen (2008) shows that as long as the following conditions are satisfied, the Sieveestimator is consistent.

Condition C11. [Identification] (1) L(θ0) < ∞, and if L(θ0) = −∞ then L(θ) > −∞ for all

θ ∈ Θk \ θ0 for all k ≥ 1.

(2)there are a nonincreasing positive function δ(·) and a positive function m(·) such that for

all ε > 0 and k ≥ 1,

L(θ0)− supθ∈Θk:d(θ,θ0)L(θk)≥ε

≥ δ(k)m(ε) > 0.

Condition C12. [Sieve space] Θk ⊂ Θk+1 ⊂ Θ for all k ≥ 1; and there exists a sequence

πkθ0 ∈ Θk such that d(θ0,πkθ0) → 0 as k → 0.

Condition C13. [Continuity] (1)L(θ) is upper semicontinuous on Θk under metric d(·, ·).(2) |L(θ0)− L(πk(n)θ0)| = o(δ(k(n))).

Condition C14. [compact Sieve space] Θk is compact under d(·, ·).

Condition C15. [uniform convergence] 1) For all k ≥ 1, plim supθ∈Θk|Ln(θ)− L(θ)| = 0. 2)

c(k(n)) = op(δ(k(n))) where c(k(n)) := supθ∈Θk|Ln(θ)− L(θ)|. 3) ηk(n) = o(δ(k(n))).


These assumptions are verified below:For Identification: Let d be the metric induced by theL2 norm || · ||2defined on D. By

assumption C4 and compactness of D, L(θ0) − supθ∈Θ:d(θ,θ0) L(θk) > 0 for any ε > 0. So letδ(k) = 1, Identification condition holds.

For condition Sieve Space: Our Sieve space satisfies Θk ⊂ Θ2k ⊂ Θ. In general we can considethe sequence Θ2k(1) ,Θ2k(2) , ...,Θ2k(n) ... instead of Θ1,Θ2,Θ3... with k(n) being an increasingfunction and limn→∞ k(n) = ∞.

For condition Continuity: Since we assume f is continuous and L1 Lipshitz, (1) is satisfied.(2) is statisfied under the construction of our Sieve space.

Condition Compact Sieve Space is trivial.The condition Uniform Convergence is also easy to verify since the entropy metric of space

Θk is finite and uniformly bounded by the entropy metric of Θ. Therefore we have stochasticequi-continuity, i.e., supθ∈Θk

|Ln(θ)− L(θ)| = Op(1√n).

In (3), ηk is defined as the error in the maximization procedure. Since δ is constant, as soonas our maximizer θk satisfies Ln( θk)− supθ∈Θk

Ln(θk) → 0, (3) is verified.Hence, the Sieve MLE estimator is consistent.

Proof of Lemma 10.

Proof. This theorem follows directly from Lemma 8. If the neighborhood Nk of θ0 is chosen asNk := θ ∈ Θk : |θ − θ0|2 ≤ µk

kλ,

with µk being a sequence converging to 0, then the non-linear term in Ex[´Y

(g(y|x,θ0)−g(y|x,θ))2g(y|x,θ0) dy]

is dominated by the first order term θIkθ.

Proof of Theorem 1.

Proof. Suppose θ = (β(·),σ) ∈ Θk is the Sieve estimator. By Lemma 8, the consistency ofSieve estimator shows that ||θ − θ0||2 →p 0.

Denote τi =i−1k , i = 1, 2, ..., k

First order condition gives:

En[−xf (y − xβ(τi)|σ)] = 0. (B.10)

andEn[gσ(y|x,β,σ)] = 0. (B.11)

Assume that we choose k = k(n) such that:(1)kλ+1||θ − θ0||2 →p 0.(2)kr||θ − θ0||2 → ∞ with p → 1.


And assume that r > λ + 1. Such sequence k(n) exists because Lemma 7 shows an upperbound for ||θ − θ0||, i.e., ||θ0 − θ||2 p n

− 12(2+λ) .

From the first order condition, we know that:

En[(fτ1(y|x, θ), fτ2(y|x, θ), ..., fτk(y|x, θ)

g(y|x, θ) ,gσ(y|x, θ)g(y|x, θ) )] = 0.

Denote fθk = (fτ1(y|x, θ), fτ2(y|x, θ), ..., fτk(y|x, θ)).Therefore,

0 = En[(fθkg

,gσg)] = (En[(

fθkg

,gσg)]− En[(

fθk,0g0

,gσ,0g0

)]) + En[(fθk,0g0

,gσ,0g0

)].

The first term can be decomposed as:

En[(fθkg

,gσg)]− En[(

fθkg0

,gσg0

)] + En[(fθkg0

,gσg0

)− (fθk,0g0

,gσ,0g0

)].

By C10, we know that CLT holds pointwisely in the space Θk for En[(fθkg0

, gσg0 )].Define Σ := U [0, 1] × D. Consider a mapping H : Σ → R,H((u, θ)) =

√n(En[

fug ] − E[fug ]).

By Donskerness of D and U [0, 1], Σ is Donsker. By C10, we have pointwise CLT for H((u, θ)).By condition C2, the Lipshitz condition garantees that E[|fug (θ) − fu

g (θ)|2] ||θ − θ||22.Therefore, stochastic equicontinuity holds: for γ small enough,

Pr( sup|u−u|≤γ,||θ−θ||2≤γ,θ,θ∈Θ

|H((u, θ))−H((u, θ))| ≥ γ) → 0.

Similarly we have stochastic equicontinuity on√nEn[

gσg ].

Notice E[(fθkg0

, gσg0 )− (fθk,0

g0, gσ,0g0

)] = 0, so En[(fθkg0

, gσg0 )− (fθk,0

g0, gσ,0g0

)] = op(1√n)+E[(

fθkg0

, gσg0 )−

(fθk,0

g0, gσ,0g0

)] = op(1√n).

Therefore, the first term = En[(fθkg , gσg )]− En[(

fθkg0

, gσg0 )] + op(1).

Similarly, for the second term, En[(fθk,0g0

, gσ,0g0)], we can establishes a functional CLT:

En[(fθk,0g0

,gσ,0g0

)] =1√nGn(τ1, τ2, ..., τk) + op(

1√n) ∈ Rkrdx+dp ,

where Gn(·) :=√nEn[(

fτg0|0≤τ≤1,

gσ,0g )] is weakly converging to a tight Gaussian process.

Therefore,

En[(fθkg

,gσg)]− En[(

fθkg0

,gσg0

)]] = − 1√nGn(τ1, τ2, ..., τk)(1 + op(1)).

Similarly by stochastic equicontinuity, we can replace fθk with fθ0 and gσ with σ0 on the lefthand side of the above equation. And now we get:

E[(fθ0 , gσ0)(g − g0g20

)] =1√nGn(τ1, τ2, ..., τk)(1 + op(1)), (B.12)


where g − g0 = g(y|x, θ)− g(y|x, θ0) := g(y|x, θ)− g(y|x,πkθ0) + (g(y|x,πkθ0)− g(y|x, θ0)).By the construction of Sieve, g(y|x,πkθ0)− g(y|x, θ0) = O( 1

kr+1 ), which depends only on theorder of Spline employed in the estimation process.

The first term g(y|x, θ)− g(y|x,πkθ0) = (θ − πkθ0)(fθ, gσ),

where θ and σ are some value between θ and πkθ0.Denote Ik = E[(fθ0 , gσ0)

(fθ, gσ)].The eigenvalue norm ||Ik − Ik||2 = ||E[(fθ0 , gσ0)

(fθ − fθ0 , gσ − gσ0)]||2.While ||fθ − fθ0 , gσ − gσ0 ||2 ||θ − πkθ0||2 + ||πkθ0 − θ0||2 ||θ − πkθ0||2 (By the choice of

k, ||θ− πkθ0||2 ≥ ||θ− θ0||2 − ||πkθ0 − θ0||2. The first term dominates the second term becauseO( 1

kr ) ||πkθ0 − θ0||2 and kr||θ − θ0||2 →p ∞.)Since supx,y ||(fθ0 , gσ0)||2 k, ||Ik − Ik||2 k||θ − πkθ0||2.Denote H := Ik − Ik, so

||H||2 k||θ − πθ0||2 ||θ − θ0||2 p1

kλ. (B.13)

I−1k = (Ik +H)−1 = I−1

k (I + I−1k H)−1.

By (J.13), ||I−1k H||2 ≤ kλ o(1)

kλ= op(1). Therefore, I + I−1

k H is invertible and has eigenvaluesbounded from above and below uniformly in k. Hence the matrix I−1

k ’s largest eigenvalue isbounded by Ckλ for some generic constant C with p → 1. Furthermore, I−1

k = I−1k (1 + op(1)).

Apply our result in (J.12), we get: θ − πθ0 = I−1k

1√nGn(1 + op(1)). So ||θ − θ0||2 = Op(

kλ√n).

Moreover, under condition C.11,||θ − θ0||2 p

Since at the beginning we assume: (1) kλ+1||θ−θ0||2 → 0. (2) kr||θ−θ0||2 → ∞ with p → 1,and ||θ0 − πθ0||2 = Op(

1kr ).

Now we prove the condition that ||θ − θ0||2 = Op(kλ√n)

the regularity condition for k is:(a)k2λ+1

√n

→ 0.

(b)kλ+r−α√n

→ ∞.If k satisfies growth conditions (1) and (2), the Sieve estimator θk satisfies:

θk − θ0 = θk − πθ0 + πθ0 − θ0 = Op(kλ√n) +O(

1

kr) = Op(

kλ√n).

Under condition C.11,

θ − πθ0 = I−1k

1√nGn(1 + op(1)) p

kλ−α

√n

1

kr.

Therefore,θ − θ0 = I−1

k

1√nGn(1 + op(1)).

And pointwise asymptotic normality holds.


For parameter σ, we would like to recall Chen’s (2008) result. In fact, we use a resultslightly stronger than that in Chen (2008), but the reasoning are similar. For k satisfyinggrowth condition in (1), we know that L(θ)−L(θ0) ||θ− θ0||k, where θk ∈ Dk ∩Nk and || · ||kis defined as < ·, Ik· >.

For a functional of interest, h : Θ :→ R, define the partial derivative:∂h

∂θ:= lim

t→0

h(θ0 + t[θ − θ0])− h(θ0)

t.

Define V be the space spanned by Θ− θ0.By Riesz representation theory, there exists a v∗ ∈ V such that ∂h

∂θ [θ − θ0] =< θ − θ0, v∗ >,

where < ·, · > is induced by || · ||.Below are five conditions needed for asymptotic normality of h(θ).

Condition C16. (i)There is ω > 0 such that |h(θ − θ0 − ∂h(θ0)∂θ [θ − θ0])| = O(||θ − θ0||ω)

uniformly in θ ∈ Θn with ||θ − θ0|| = o(1);

(ii)||∂h(θ0)∂θ || < ∞.

(iii)there is πnv∗ ∈ Θk such that ||πnv∗ − v∗||× ||θn − θ0|| = op(n−12 ).

This assumption is easy to verify because our function h is a linear function in σ: h(θ) = δσ.Therefore the first and third condition is automatically verified. The second condition is truebecause of assumption C.6. So v∗ = (0, E[gσ,0gσ,0]

−1δ). Denote Iσ := E[gσ,0gσ,0]

Let εn denote any sequence satisfying: εn = o(n−1/2). Suppose the Sieve estimator convergesto θ0 faster than δn.

Define the functional operator µn = En − E.

Condition C17.

supθ∈Θk(n);||θ−θ0||≤δn

µn(l(θ, Z)− l(θ ± εnπnv∗, Z)− ∂l(θ0, Z)

∂θ[±εnπnv

∗]) = Op(ε2n).

In fact this assumption is the stochastic continuity assumption. It can be verified by P-Donskerness of Θ and L1 Lipshitz property that we assume on the function f .

Denote K(θ1, θ2) = L(θ1)− L(θ2).

Condition C18.

E[∂l(θn, Z)

∂θ[πnv

∗]] =< θn − θ0,πnv∗ > +o(n−1/2).

We know that E[∂l(θn,Z)∂θ [πnv∗]] = E[gσ(y|x,

θn)g

I−1σ δ] = E[gσ(y|x,

θn)g

]I−1σ δ.

We know that |θn − θ0|2 = Op(kλ√n) and k2λ+1

√n

→ 0, so |θ − θ0|22 = op(n−(λ+1)/(2λ+1)) =

o(n−1/2).And we know that E[gσ(y|x,

θn)g0

] = 0. So


E[gσ(y|x, θn)

g] = E[gσ(y|x, θn)(

1

g− 1

g0)].

By L1 Lipshitz condition |g−g0| ≤ C1(x, y)|θ−θ0|2, and |gσ(|θn)−gσ(|θ0)| ≤ C2(x, y)|θ−θ0|2.And such coefficients has nice tail properties. Therefore,

E[gσ(y|x, θn)(1

g− 1

g0)] = E[gσ(y|x, θ0)(

∂g/∂θ

g2)(θn − θ0)] +O(|θn − θ0|22).

Then E[∂l(θn,Z)∂θ [πnv∗]] = E[(θn − θ0)

∂g/∂θg2 gσ(y|x, θ0)δ] + op(n−1/2).

The first part of the equation above is < θn − θ0,πnv∗ >.

Condition C19. (i) µn(∂l(θ0,Z)

∂θ [πnv∗ − v∗]) = op(n−1/2).

(ii) E[∂l(θ0,Z)∂θ [πnv∗]] = o(n−1/2).

(i) is obvious since v∗ = πnv∗. (ii) is also obvious since it equals to 0.

Condition C20.Gn

∂l(θ0, Z)

∂θ[v∗] →d N(0,σ2

v),

withσ2v > 0.

This assumption is also not hard to verify because ∂l(θ0,Z)∂θ [v∗] is a random variable with

variance bounded from above. It is not degenerate because E[gσgσ] is positive definite.Then by theorem 4.3 in Chen (2008), for any σ, our sieve estimator θn have the property:

√nδ(σn − σ0) →d N(0,σ2

v).

Since the above conclusion is true for arbitrary δ,√n(σn − σ0) must be jointly asymptotic

normal, i.e., there exists a positive definite matrix V such that√n(σn − σ0) →d N(0, V ).

B.4. Lemmas and Theorems in Appendix I. In this subsection we prove the Lemmas andTheorems stated in section 5.

Proof of Theorem 2.

Proof. For simplicity, we assume the φε(s) is a real function. The proof is similar when it iscomplex. Then,

F (xβ, y, q) =1

2− τ +

ˆ q

0

sin((y − xβ)s)

πs

1

φε(s)ds.

Under condition OS,|F (xβ, y, q)| ≤ C(qλ−1 + 1) if λ ≥ 1, for some fixed C and q large enough.Under condition SS, |F (xβ, y, q)| ≤ C1 exp(qλ/C2) for some fixed C1 and C2.Let h(x, y) := E[

´ qn0

cos((y−xβ)s)π

dsφε(s)

].


By change of integration, E[´ qn0

cos((y−xβ)s)π

dsφε(s)

] ≤ C.Let σ(x)2 = E[F (xβ, y, τ, q)2|x].Under condition OS, by similar computation, C1q2(λ−1) ≤ σ(x)2 ≤ C2q2(λ−1), and under

condition SS, C1 exp(qλ/C3) ≤ σ(x)2 ≤ C2 exp(qλ/C4), for some fixed C1, C2, C3, C4.For simplicity, let m(λ) = qλ−1 if OS condition holds, and let m(λ) = exp(qλ/C) if SS holds

with constant C.β is solving:

En[xh(x)

σ(x)2F (xβ, y, τ, q)] = 0. (B.14)

Adopting the usual local expansion:

En[h(x, y)

σ(x)2(F (xβ, y, τ, q)−F (xβ0, y, τ, q))] = −En[

h(x, y)

σ(x)2(F (xβ0, y, τ, q)]−E[

h(x)

σ(x)2(F (xβ0, y, τ, q)]

(B.15)

−E[h(x, y)

σ(x)2(F (xβ0, y, τ, q)].

The left hand side equals:

(β − β0)E[xxh(x, y)2

σ(x)2](1 + op(1)) =

q2

m(λ)2E[l(x, q)xx](β − β0)(1 + op(1)).

l(x, q) is a bounded function converging uniformly to l(x,∞).The first term of the RHS is:

− 1√nGn[x

h(x, y)

σ(x)

f(xβ, y, τ, q)

σ(x)] = Op(

1√nm(λ)

).

The second term of the RHS is:

E[h(x)

σ(x)2(F (xβ0, y, τ, q)] = O(

q

m(λ)2).

Therefore, β − β0 = Op(m(λ)√

n) +O(1q ).

Under SS, the optimal q level is log(n)1λ , and the optimal convergence speed is therefore

log(n)−1λ . Under OS condition, the optimal q = n

12λ , and the optimal convergence speed is

1

n12λ

.

ERRORS IN THE DEPENDENT VARIABLE OF QUANTILE REGRESSION MODELS€¦ · · 2017-07-06ERRORS IN THE...

Documents

Transcript of ERRORS IN THE DEPENDENT VARIABLE OF QUANTILE REGRESSION MODELS€¦ · · 2017-07-06ERRORS IN THE...