On minimum Hellinger distance estimation

20
514 The Canadian Journal of Statistics Vol. 37, No. 4, 2009, Pages 514–533 La revue canadienne de statistique On minimum Hellinger distance estimation Jingjing WU 1 * and Rohana J. KARUNAMUNI 2 1 Department of Mathematics and Statistics, Calgary, Alberta, Canada T2N 1N4 2 Department of Mathematical and Statistical Sciences, Edmonton, Alberta, Canada T6G 2G1 Key words and phrases: Asymptotic efficiency; asymptotic normality; minimum Hellinger distance estima- tors; robust statistics; semiparametric models. MSC 2000: Primary 62F10, 62E10; secondary 62F35, 60F05. Abstract: Efficiency and robustness are two fundamental concepts in parametric estimation problems. It was long thought that there was an inherent contradiction between the aims of achieving robustness and efficiency; that is, a robust estimator could not be efficient and vice versa. It is now known that the minimum Hellinger distance approached introduced by Beran [R. Beran, Annals of Statistics 1977;5:445–463] is one way of reconciling the conflicting concepts of efficiency and robustness. For parametric models, it has been shown that minimum Hellinger estimators achieve efficiency at the model density and simultaneously have excellent robustness properties. In this article, we examine the application of this approach in two semipara- metric models. In particular, we consider a two-component mixture model and a two-sample semiparametric model. In each case, we investigate minimum Hellinger distance estimators of finite-dimensional Euclidean parameters of particular interest and study their basic asymptotic properties. Small sample properties of the proposed estimators are examined using a Monte Carlo study. The results can be extended to semiparametric models of general form as well. The Canadian Journal of Statistics 37: 514–533; 2009 © 2009 Statistical Society of Canada esum´ e: Efficacité et robustesse sont deux concepts fondamentaux dans les problèmes d’estimation pa- ramétrique. Il a longtemps été pensé qu’il y avait un conflit inhérent entre les objectifs d’obtenir un estimateur robuste et efficace, c’est-à-dire qu’un estimateur robuste ne pouvait pas être efficace et vice-versa. Main- tenant, nous savons que l’approche de la distance d’Hellinger minimale proposée par Beran (R. Beran, Annals of Statistics 1977; 5 ; 445-463) est une façon de réconcilier les concepts conflictuels d’efficacité et de robustesse. Il a été démontré que, pour les modèles paramétriques, les estimateurs d’Hellinger mini- mums atteignent l’efficacité à la densité du modèle tout en ayant si-multanément d’excellentes propriétés de robustesse. Dans cet article, nous examinons l’application de cette approche à deux modèles semi- paramétriques. En particulier, nous considérons un mélange de deux modèles et un modèle semi-paramétrique à deux échantillons. Dans chaque cas, nous investigons les estimateurs à distance d’Hellinger minimale de paramètres d’intérêt d’un espace euclidien de dimension finie et nous étudions leurs propriétés asymptotiques de base. En utilisant une étude de Monte-Carlo, nous examinons les propriétés des estimateurs proposés pour des petits échantillons. De plus, ces résultats peuvent être généralisés à des modèles semi-paramétriques plus généraux. La revue canadienne de statistique 37: 514–533; 2009 © 2009 Société statistique du Canada 1. INTRODUCTION Statistical inference is based on statistical models for data. During most of the history of the subject, these have been parametric: the mechanism generating the data could be identified by specifying a * Author to whom correspondence may be addressed. E-mail: [email protected] This article is based in part on Jingjing Wu’s doctoral dissertation, which was selected as a co-winner of the 2007 Pierre Robillard Award of the Statistical Society of Canada. The thesis supervisor was Rohana J. Karunamuni/Cet article est basé en partie sur la thèse de Jingjin Wu qui a coobtenu le prix Pierre Robillard de la Société statistique du Canada en 2007. Son directeur de recherche était Rohana J. Karunamuni. © 2009 Statistical Society of Canada / Société statistique du Canada

Transcript of On minimum Hellinger distance estimation

Page 1: On minimum Hellinger distance estimation

514 The Canadian Journal of StatisticsVol. 37, No. 4, 2009, Pages 514–533La revue canadienne de statistique

On minimum Hellinger distance estimationJingjing WU1* and Rohana J. KARUNAMUNI2

1Department of Mathematics and Statistics, Calgary, Alberta, Canada T2N 1N42Department of Mathematical and Statistical Sciences, Edmonton, Alberta, Canada T6G 2G1

Key words and phrases: Asymptotic efficiency; asymptotic normality; minimum Hellinger distance estima-tors; robust statistics; semiparametric models.

MSC 2000: Primary 62F10, 62E10; secondary 62F35, 60F05.

Abstract: Efficiency and robustness are two fundamental concepts in parametric estimation problems. Itwas long thought that there was an inherent contradiction between the aims of achieving robustness andefficiency; that is, a robust estimator could not be efficient and vice versa. It is now known that the minimumHellinger distance approached introduced by Beran [R. Beran, Annals of Statistics 1977;5:445–463] is oneway of reconciling the conflicting concepts of efficiency and robustness. For parametric models, it has beenshown that minimum Hellinger estimators achieve efficiency at the model density and simultaneously haveexcellent robustness properties. In this article, we examine the application of this approach in two semipara-metric models. In particular, we consider a two-component mixture model and a two-sample semiparametricmodel. In each case, we investigate minimum Hellinger distance estimators of finite-dimensional Euclideanparameters of particular interest and study their basic asymptotic properties. Small sample properties of theproposed estimators are examined using a Monte Carlo study. The results can be extended to semiparametricmodels of general form as well. The Canadian Journal of Statistics 37: 514–533; 2009 © 2009 StatisticalSociety of Canada

Resume: Efficacité et robustesse sont deux concepts fondamentaux dans les problèmes d’estimation pa-ramétrique. Il a longtemps été pensé qu’il y avait un conflit inhérent entre les objectifs d’obtenir un estimateurrobuste et efficace, c’est-à-dire qu’un estimateur robuste ne pouvait pas être efficace et vice-versa. Main-tenant, nous savons que l’approche de la distance d’Hellinger minimale proposée par Beran (R. Beran,Annals of Statistics 1977; 5 ; 445-463) est une façon de réconcilier les concepts conflictuels d’efficacitéet de robustesse. Il a été démontré que, pour les modèles paramétriques, les estimateurs d’Hellinger mini-mums atteignent l’efficacité à la densité du modèle tout en ayant si-multanément d’excellentes propriétésde robustesse. Dans cet article, nous examinons l’application de cette approche à deux modèles semi-paramétriques. Enparticulier, nous considérons unmélangededeuxmodèles et unmodèle semi-paramétriqueà deux échantillons. Dans chaque cas, nous investigons les estimateurs à distance d’Hellinger minimale deparamètres d’intérêt d’un espace euclidien de dimension finie et nous étudions leurs propriétés asymptotiquesde base. En utilisant une étude de Monte-Carlo, nous examinons les propriétés des estimateurs proposés pourdes petits échantillons. De plus, ces résultats peuvent être généralisés à des modèles semi-paramétriques plusgénéraux. La revue canadienne de statistique 37: 514–533; 2009 © 2009 Société statistique du Canada

1. INTRODUCTION

Statistical inference is based on statisticalmodels for data.Duringmost of the history of the subject,these have been parametric: the mechanism generating the data could be identified by specifying a

*Author to whom correspondence may be addressed.E-mail: [email protected] article is based in part on Jingjing Wu’s doctoral dissertation, which was selected as a co-winner ofthe 2007 Pierre Robillard Award of the Statistical Society of Canada. The thesis supervisor was RohanaJ. Karunamuni/Cet article est basé en partie sur la thèse de Jingjin Wu qui a coobtenu le prix Pierre Robillardde la Société statistique du Canada en 2007. Son directeur de recherche était Rohana J. Karunamuni.

© 2009 Statistical Society of Canada / Société statistique du Canada

Page 2: On minimum Hellinger distance estimation

2009 ON MINIMUM HELLINGER DISTANCE ESTIMATION 515

few real parameters. However, during the last 30 years nonparametric and semiparametric modelshave flourished. The main reason has been the inevitable rise of computing power permittingthe application of such models to large data sets showing the inadequacy of parametric models.The deficiency in interpretability of nonparametric models was filled by the development ofsemiparametric models. The main focus of research in this area has been the construction of suchmodels and corresponding statistical procedures in response to particular types of data arising invarious disciplines, primarily in biostatistics and econometrics. The well-known semiparametricmodels include theCox proportional hazardmodel in survival analysis, econometric indexmodels,regression models and errors-in-variables models, among many others.

Many authors have considered efficient and adaptive estimation in semiparametric models inthe literature (Bickel, 1982; Schick, 1986; van derVaart, 1998; Forrester et al., 2003, amongothers).However, robust efficient estimation in semiparametric models has been paid little attention. Theefficiency when the model has been appropriately chosen, and the robustness when it has not, aretwo fundamental concepts in parametric estimation. It was long thought that there was an inherentcontradiction between the aims of achieving robustness and efficiency; that is, a robust estimatorcould not be efficient and vice versa. Some of the practical deficiencies of maximum likelihoodestimators (MLEs) are the lack of resistance to outliers and the general non-robustnesswith respectto model misspecification. The need for robust statistics in statistical inference has been widelyrecognized now.Different approaches for finding robust statistics for parametricmodels have beenproposed, seeHuber (1980) andMaronna,Martin&Yohai (2007) for summaries ofmost importantmethods. Many robust estimators achieve robustness at some cost in first-order efficiency. Thisis, however, not the case with the minimum Hellinger distance (MHD) estimators introduced byBeran (1977). Lindsay (1994) has shown that MLE and MHD estimators are members of a largerclass of efficient estimatorswith various second-order efficiency properties.MHDestimators havebeen shown to have excellent robustness properties in parametric models such as the resistanceto outliers and robustness with respect to model misspecification (Beran, 1977; Donoho & Liu,1988). In fact, Donoho & Liu (1988) have shown a much stronger result that all minimum distanceestimators are automatically robust with respect to the stability of the quantity being estimated.Efficiency combined with excellent robustness properties make MHD estimators appealing inpractice. Furthermore, Hellinger distance has the special attraction that it is dimensionless. Fora comparison between MHD estimators with the MLEs and the balance between robustness andefficiency of estimators see Lindsay (1994).

The literature on MHD estimation has been dominated by MHD estimation in fully parametricmodels; see, for instance, Beran (1977), Tamura & Boos (1986), Simpson (1987), Yang (1991),Ying (1992), Karlis & Xekalaki (1998, 2001), Lu, Hui & Lee (2003), and Woo & Sriram (2006,2007), among others. Very little research has been done on application of the MHD methodologyto semiparametric models, however, due in large part to computational difficulties caused by thepresence of a nuisance parameter, possibly of infinite dimensional. A systematic study of MHDestimation in semiparametric models has been made in the Ph.D. thesis of Wu (2007). This articleis based on some results obtained in the preceding thesis.

The layout of this article is as follows. In Section 2, we consider the problem of estimatingthe mixture proportion θ in the two-component mixture model of the form θF + (1 − θ)G,where F and G are two unknown distribution functions. Distributions F and G are the nuisanceparameter in this case, and they are estimated using training samples. In Section 3, we study atwo-sample semiparametric model, where the log ratio of the two underlying density functions isof a regression model, that is, hθ(x) = g(x) exp[α + r(x)β] with θ = (α, β) being the parameterof interest and g is unknown nuisance parameter. The preceding set-up includes the two samplelocation-scale model as a special case and is also closely related to the logistic regression model.In each case, we investigate MHD estimators of the parameter θ and study their basic asymptoticproperties. The general conclusion reached from the study is that the proposed MHD estimators

DOI: 10.1002/cjs The Canadian Journal of Statistics / La revue canadienne de statistique

Page 3: On minimum Hellinger distance estimation

516 WU AND KARUNAMUNI Vol. 37, No. 4

are asymptotically consistent and asymptotically normally distributed with high efficiencyproperties. Results of a simulation study are presented in Section 4, and some concluding remarksare given in Section 5. Some proofs are given in an appendix.

2. MHD ESTIMATION IN A TWO-COMPONENT MIXTURE MODEL

2.1. Two-Component Mixture ModelLet F and G be two probability distributions and θ be a positive real number between 0and 1. Then θF + (1 − θ)G defines a two-component mixture distribution with mixture weightsθ and (1 − θ). When component distributions F and G are known to have some specific forms,then θF + (1 − θ)G is a parametric mixture, and if F and G are completely unspecified butare different distributions then θF + (1 − θ)G is known as a nonparametric mixture. For theliterature on parametric mixture models, see Titterington, Smith & Makov (1985), Chen (1992,1998), Chen & Kalbfleisch (1996), and McLachlan & Peel (2000), among others. Estimation ofthe mixture parameter θ in a nonparametric mixture model, however, is more formidable dueto lack of identifiably of θ. To overcome the difficulty, Hall (1981) suggested taking trainingsamples. Specifically, he observed three independent samples

X1, . . . , Xn0i.i.d.∼ F

Y1, . . . , Yn1i.i.d.∼ G

Z1, . . . , Zn2i.i.d.∼ θF + (1 − θ)G,

(1)

and then estimated the mixture parameter θ, treating F and G as nuisance parameters. Formodel (1), Hall (1981, 1983) described a minimum distance estimator based on empiricaldistribution functions, Titterington (1983) considered a minimum distance estimator basedon density estimators, and Hall & Titterington (1984) constructed a sequence of multinomialapproximations and related MLE estimator of θ by grouping data for a similar model to (1). Qin(1999) developed a confidence interval for θ using an empirical likelihood ratio based statisticassuming the log-likelihood ratio of densities of F and G is linear in observations.

It has been observed that robust methods such as M-estimation are not easily adapted fornonparametricmixtures (Cutler&Cordero-Brana, 1996).Minimumdistance estimation is an alter-native approach that produces robust estimators. In this section,we propose to estimate themixtureparameter θ using the MHD approach of Beran (1977). The set-up of Beran (1977) assumes thatthe observed random variables are independent identically distributed (i.i.d.) with some unknowndensity g which is close in the Hellinger metric to a member of some specified parametric class{fθ : θ ∈ �}. The model at (1) is not parametric, but rather it is semiparametric in nature. Thus, theresults in this section exhibit an extension of Beran’s (1977) MHD technique to a semiparametricmodel. Furthermore, the combined data set of (1), X1, . . . , Xn0 , Y1, . . . , Yn1 , Z1, . . . , Zn2 , is acollection of independent observations, but not necessarily identically distributed.

There have been very few attempts to estimate the parameters in a mixture problem usingthe MHD approach. Woodward, Whitney & Eslinger (1995), Cordero-Brana (1994), Cutler& Cordero-Brana (1996) and Lu, Hui & Lee (2003) have studied MHD estimation for finitemixtures. However, their results are for the case that F and G are fully parametric models.More specifically, Woodward, Whitney & Eslinger (1995) have concentrated on estimating themixture proportions (π1, . . . , πk−1) in a fully parametric model of the form �k

i=1πif (x|φi),whereas Cordero-Brana (1994) and Cutler & Cordero-Brana (1996) have assumed that all themixture parameters (π1, . . . , πk−1, φ1, . . . , φk) are of interest, extending the work of Woodward,Whitney & Eslinger (1995), where f (·|φ1), . . . , f (·|φk) are density functions on the real lineand φi ∈ � ⊆ IRs, i = 1, . . . , k. Lu, Hui & Lee (2003) have examined MHD estimation for finite

The Canadian Journal of Statistics / La revue canadienne de statistique DOI: 10.1002/cjs

Page 4: On minimum Hellinger distance estimation

2009 ON MINIMUM HELLINGER DISTANCE ESTIMATION 517

mixtures of Poisson regression models. The present work thus gives an extension of the abovework to the case where the distributions F and G in model (1) are completely unknown.

The proposed MHD estimator of θ is given in Section 2.2. Our approach is very natural.We minimize the Hellinger distance between a totally nonparametric adaptive kernel densityestimator and a parameterized convolution of estimated component densities. We study asymp-totic properties such as strong consistency and asymptotic normality of the proposed estimator.Asymptotic efficiency properties of the proposed MHD estimator are also examined. This is doneby constructing a Cramer-Rao type lower bound for nonparametric estimators of the mixtureproportion.

2.2. MHD Estimator of the Mixture ProportionIn order to implement the MHD technique of Beran (1977), we first define a parametric family

hθ(x) = θf (x) + (1 − θ)g(x), (2)

where f and g denote two different densities of F and G, respectively. Assume that f and g

satisfy the condition∫ |f (x) − g(x)| dx > 0 (this is required for identifiability of θ). Next define

following adaptive kernel density estimators of f and g, respectively, based on data X1, . . . , Xn0

and Y1, . . . , Yn1 of (1):

f (x) = 1n0Sn0bn0

n0∑i=1

K0

(x − Xi

Sn0bn0

), (3)

g(x) = 1n1Sn1bn1

n1∑j=1

K1

(x − Yj

Sn1bn1

), (4)

where K0 and K1 are two smooth density functions, bandwidths bn0 and bn1 are positive constantssuch that bni → 0 as ni → ∞, i = 0, 1, and Sn0 = Sn0 (X1, . . . , Xn0 ) and Sn1 = Sn1 (Y1, . . . , Yn1 )are scale statistics. In applications, bandwidths usually take the form bni = n−r

i with 0 < r < 1for i = 0, 1. For any t ∈ [0, 1] define

ht(x) = tf (x) + (1 − t)g(x). (5)

Note that hθ is a parametric density function with the only unknown parameter being θ. Further-more, θ is identifiable from (5) since θ1 = θ2 implies hθ1 = hθ2 (Titterington, Smith & Makov,1985, Section 3.1). Next we define a kernel density estimator based on the Zi’s as follows:

h(x) = 1n2Sn2bn2

n2∑i=1

K2

(x − Zi

Sn2bn2

), (6)

where again K2 is a smooth density function, bandwidth bn2 is a positive constant such thatbn2 → 0 as n2 → ∞, and Sn2 = Sn2 (Z1, . . . , Zn2 ) is a scale statistic.

Let H be the set of all densities with respect to Lebesgue measure on the real line. FollowingBeran (1977), we define an MHD functional T0 : H −→ [0, 1] such that

T0(φ) = arg min −t ∈ [0, 1]∥∥∥h

1/2t − φ1/2

∥∥∥ , (7)

where ‖ · ‖ denotes the L2-norm here and thereafter. When ht is known, the MHD estimator ofT0(φ) is defined as T0(φ), where φ is a nonparametric density estimator of φ. Since ht is unknown

DOI: 10.1002/cjs The Canadian Journal of Statistics / La revue canadienne de statistique

Page 5: On minimum Hellinger distance estimation

518 WU AND KARUNAMUNI Vol. 37, No. 4

in our model (1), we propose to replace ht with ht , the parameterized convolution of estimatedcomponent densities defined by (5). Then an MHD estimator of T0(φ) is defined as T (φ), where

T (φ) = arg mint∈[0,1]

∥∥∥h1/2t − φ1/2

∥∥∥ . (8)

Since the parameter space [0, 1] is compact, T (φ) is attained. However, T (φ) may be multiplevalued and so we will use the notation T (φ) to indicate any one of the possible values chosenarbitrarily (cf. Beran, 1977). In our situation, φ = hθ and φ = h. Therefore, our proposed MHDestimator of θ is defined as

θ = T (h), (9)

where h is given by (6) and n = n0 + n1 + n2 is the total sample size. That is, θ is the minimizerof the Hellinger distance between θf + (1 − θ)g and h with f and g defined by (3) and (4),respectively. In order to study asymptotic properties of θ, we let n → ∞ and assume that ni/n →ρi for some positive constants ρi as n → ∞, i = 0, 1, 2. Note that θ depends on n, but thisdependence is suppressed for notational convenience.

2.3. Asymptotic PropertiesWe now discuss asymptotic properties of the proposed MHD estimator θ defined in (9). First,we give some results on the existence, consistency and asymptotic uniqueness of the proposedestimator. The next theorem, which gives conditions for the existence of θ and the continuity offunctionals, is analogous to Theorem 1 of Beran (1977).

Theorem 2.1. Suppose that T0 and T are defined by (7) and (8), respectively. Then,

(i) For every φ ∈ H, there exists T (φ) ∈ [0, 1] satisfying (8).(ii) If T0(φ) is unique, then T (φn) → T0(φ) for any sequences {φn}n∈IN and {ht}n∈IN such that

‖φ1/2n − φ1/2‖ → 0 and supt∈[0,1] ‖h

1/2t − h

1/2t ‖ → 0 as n → ∞.

(iii) T0(ht) = t uniquely for every t ∈ [0, 1], where ht(x) = tf (x) + (1 − t)g(x).

The following assumptions are made in the next three theorems:

C1. The kernels K0, K1, and K2 in (3), (4), and (6), respectively, are absolutely continuous andhave compact supports, and the first derivatives K

(1)0 , K

(1)1 , and K

(1)2 are bounded.

C2. f and g are uniformly continuous on their support.C3. The positive constants bn0 , bn1 , bn2 in (3), (4), and (6), respectively, satisfy bni → 0 and

n1/2i bni → ∞ as ni → ∞, i = 0, 1, 2.

C4. Sni

P→ Si as ni → ∞, i = 0, 1, 2.C5. Sequences of densities {h}n≥1 and {ht}n≥1 converge to hθ and ht , respectively, in the sense

that ‖h1/2 − h1/2θ ‖ → 0 and supt∈[0,1] ‖h

1/2t − h

1/2t ‖ → 0 as n → ∞, where θ ∈ (0, 1) and

ht = tf + (1 − t)g with f and g converging to f and g uniformly.C6. f and g have the same compact support, say W , on which ht(x) ≥ C for some C > 0 and

for any t ∈ [0, 1]; and f , g, f , g, and h are continuous.C7. K0, K1 and K2 are symmetric about zero and have compact supports, and the second

derivatives K(2)0 , K

(2)1 , and K

(2)2 exist and are bounded.

C8. Sθ = (∂/∂)θh1/2θ has compact support W on which it is continuous, where hθ is given by

(2).

The Canadian Journal of Statistics / La revue canadienne de statistique DOI: 10.1002/cjs

Page 6: On minimum Hellinger distance estimation

2009 ON MINIMUM HELLINGER DISTANCE ESTIMATION 519

C9. f, g > 0 on W and the second derivatives f (2) and g(2) exist and are bounded.C10. There exist positive finite constants S0, S1, and S2 depending on f and g such that n

1/2i

(Sni − Si) are bounded in probability as n → ∞, i = 0, 1, 2.

Consistency of the MHD estimator θ of θ follows from the continuity of functionals in theHellinger topology. This result is given next.

Theorem 2.2. Suppose that ni/n → ρi for some positive constants ρi as n → ∞, i = 0, 1, 2.Further suppose that assumptions C1 to C4 hold with ht , h, and θ given by (5), (6), and (9),

respectively. Then C5 holds and θP→ θ as n → ∞.

The next theoremgives an expression for the difference θ − θ, which is very useful in obtainingthe asymptotic distribution of the proposed MHD estimator.

Theorem 2.3. Suppose that densities ht defined in (5) and h in (6) satisfy assumptions C5and C6. Define T ({ht}t∈[0,1], φ) = arg min

t∈[0,1]‖h

1/2t − φ1/2‖, and assume that the functional T is

continuous at ({ht}t∈[0,1], hθ) in the sense of Theorem 2.1(ii). Then, it follows that

θ − θ = T({ht}t∈[0,1], h

)− T

({ht}t∈[0,1], hθ

)=

{[∫(f − g)2

2(θf + (1 − θ)g)3/2 h1/2θ dx

]−1

+ γn

{∫f − g

(θf + (1 − θ)g)1/2 (h1/2 − h1/2θ ) dx

+∫ θ

2 (f − g) + g

(θf + (1 − θ)g)3/2 (f − f )h1/2 dx −∫ 1

2 (1 + θ)(f − g) + g

(θf + (1 − θ)g)3/2 (g − g)h1/2 dx

+ αn

∫(f − f )2 dx + βn

∫(g − g)2 dx

},

where {αn}, {βn}, and {γn} are bounded sequences of real numbers and γn → 0 as n → ∞.

Under further conditions on the parametric family hθ defined by (2) and the kernels, the nexttheorem shows that the MHD estimator θ is asymptotically normally distributed about θ = T0(hθ).

Theorem 2.4. Suppose that ni/n → ρi for some positive constants ρi as n → ∞, i = 0, 1, 2.Suppose that ht , h and θ given by (5), (6), and (9), respectively, satisfy assumptions C7 to C10.Then the limiting distribution of n1/2(θ − θ) is N(0, σ2), where σ2 is defined as{

Var[

∂ log hθ(Z1)∂θ

]}−2 {θ2

ρ0Var

[∂ log hθ(X1)

∂θ

]+ (1 − θ)2

ρ1Var

[∂ log hθ(Y1)

∂θ

]+ 1

ρ2Var

[∂ log hθ(Z1)

∂θ

]}.

Remark 2.1. The regularity conditions assumed in Theorems 2.1–2.4 are typical in MHD esti-mation context; see, for example, Beran (1977) and Cordero-Brana (1994). For the asymptoticnormality of θ established inTheorem2.4, the compact support assumption on the densitiesf andg

has beenmade.However, this assumption is not required for the consistency of θ, see Theorem2.2.The compact-supported Epanechnikov kernel function given by K(x) = (3/4)(1 − x2)I(|x| ≤ 1)

DOI: 10.1002/cjs The Canadian Journal of Statistics / La revue canadienne de statistique

Page 7: On minimum Hellinger distance estimation

520 WU AND KARUNAMUNI Vol. 37, No. 4

satisfies condition C1. The bandwidths bn0 = n−1/30 , bn1 = n

−1/31 , and bn2 = n

−1/32 satisfy condi-

tion C3. Many density functions satisfy condition C2, including the location-scale family and, inparticular, a normal distribution with mean µ and variance σ2. For scale statistics Sni , i = 0, 1, 2in (3), (4), and (6), one can use the robust scale estimator proposed by Rousseeuw & Croux (1993),Sn = 1.1926 medi(medj(|Xi − Xj|)). The preceding estimator satisfies condition C4.

Remark 2.2. In order to prove the asymptotic normality of θ for the infinite support case of f

and g, we must employ a different technique. Note that the asymptotic normality of an estimatoris related to the differentiability of the functional T0 defined at (7). A way to extend Theorem2.4 to infinite support case is to concentrate on the Hadamard (or compact) differentiability ofthe functional T0; see Fernholz (1983). It is known that Hadamard differentiability will yieldasymptotic normality. Furthermore, the norm chosen on the domain of the functional is a crucialfactor for the differentiability and, moreover, it is desirable to have a topology which leads to“robustness” according to Hampel (1971). The weak topology, uniform topology and the topologyinduced by the Hellinger metric are all “robust.” Thus, in order to achieve our goal we may needto set up the Hadamard differentiability of the functional T0 under the Hellinger norm. This willbe investigated in a separate article.

We now study asymptotic efficiency properties of the proposed MHD estimator θ. Asymp-totic efficiency properties of the MHD and maximum likelihood estimators are well known inparametric models (Beran, 1977; Lindsay, 1994). However, such properties in nonparametric orsemiparametric settings are less studied. Hall & Titterington (1984) have derived a Cramer-Raotype lower bound for nonparametric estimators of the mixture proportions and thereby charac-terized asymptotically optimal procedures for the case of sampling model M2 of Hosmer (1973).[Our model (1) is somewhat close to model M1 of Hosmer (1973).] Furthermore, they have con-structed a sequence of maximum likelihood estimators that attain the above-mentioned lowerbound and are therefore asymptotically optimal in this sense. Following the ideas of Hall & Tit-terington (1984), we also obtain a Cramer-Rao type lower bound for nonparametric estimators ofthe mixture proportion θ. This result is given next.

Theorem 2.5. Let θn denote a nonparametric estimator of θ such that n1/2(θn − θ) →N(0, V (θ, f, g, hθ)) and n Var(θn − θ) → V (θ, f, g, hθ) as n → ∞, where f, g, and hθ denote thedensities of distributions F, G, and θF + (1 − θ)G, respectively, of (1). Suppose that ni/n − ρi →0, i = 0, 1, 2, and thatρ0/ρ1 = θ/(1 − θ). IfV (θ, fn, gn, hn) → V (θ, f, g, hθ)wheneverfn → f ,gn → g and hn → hθ in the class of uniformly piecewise continuous densities, then

V (θ, f, g, hθ) ≥ /(θ, f, g, hθ)

for any f = g and θ ∈ (0, 1), where

/(θ, f, g, hθ) = 1/2

2

[(1 − θ)4

ρ0/0 + (1 − θ)4

ρ1/1 + (1 − θ)2

ρ2/2

](10)

with /0 = ∫ g2

h2θ

f dx − (∫ g

hθf dx)2, /1 = ∫ f 2

h2θ

g dx − (∫ f

hθg dx)2 and /2 = ∫ f 2

hθdx − 1.

The next result shows that the proposed MHD estimator (9) attains the above lower boundunder certain regularity conditions, showing an asymptotic efficiency property of the proposedMHD estimator.

Theorem 2.6. Assume that conditions of Theorem 2.5 hold. Then the asymptotic variance ofthe MHD estimator θ of (9) is equal to /(θ, f, g, hθ), where /(θ, f, g, hθ) is given in (10). In thissense, θ is asymptotically efficient.

The Canadian Journal of Statistics / La revue canadienne de statistique DOI: 10.1002/cjs

Page 8: On minimum Hellinger distance estimation

2009 ON MINIMUM HELLINGER DISTANCE ESTIMATION 521

Remark 2.3. The lower bound obtained in Theorem 2.5 is established under the assumption thatρ0/ρ1 = θ/(1 − θ) holds. For other cases, the method employed in Wu (2007) does not work verywell and one may need to seek a different way to find a lower bound of the asymptotic variance.Indeed, for the case ρ0/ρ1 = θ/(1 − θ), the full efficiency of the MHD estimator θ of (9) isunknown and it needs further research. Nevertheless, we have shown that θ is n1/2-consistent,that is, n1/2(θ − θ) = OP (1), which demonstrates that θ has good efficiency properties whetherρ0/ρ1 = θ/(1 − θ) holds or not.

Remark 2.4. As in Hall & Titterington (1984), treating the data as multinomial with categoriesdetermined by which one of the L regions observations fall into, Karunamuni & Wu (2009) alsoexhibited a maximum likelihood estimator (MLE) θL of the mixture proportion θ. They showedthat their MLE θL is consistent and asymptotically normal under certain regularity conditions.Furthermore, they showed that the limiting variance of θL attains the lower bound in Theorem2.5 above as the number of regions L goes to infinity. In this sense, θL is nearly efficient for largevalues of L.

3. MHD ESTIMATION IN A TWO-SAMPLE SEMIPARAMETRIC MODEL

3.1. Two-Sample Semiparametric ModelConsider the following two-sample semiparametric model. Let X1, . . . , Xn0 be i.i.d. randomvariables with density function g. Independently of the Xi’s, let Z1, . . . , Zn1 be i.i.d. randomvariables with density function h. The two unknown density functions g and h are linked byan “exponential tilt” exp[α + r(x)β]. Thus, we have

X1, . . . , Xn0i.i.d.∼ g(x)

Z1, . . . , Zn1i.i.d.∼ g(x) exp[α + r(x)β],

(11)

where r(x) = (r1(x), . . . , rp(x)) is a 1 × p vector of functions of x, β = (β1, . . . , βp)T is a p × 1parameter vector and α is a normalizing parameter thatmakes g(x) exp[α + r(x)β] integrate to one.Inmost applications r(x) = x or r(x) = (x, x2). Estimation of parametersα andβ is of interest here.

For r(x) = x, model (11) encompasses many common distributions, including two exponen-tial distributions with different means and two normal distributions with common variance butdifferent means. Furthermore, model (11) with r(x) = x or r(x) = (x, x2) has wide applicationsin the logistic discriminant analysis (Anderson, 1979) and in case–control studies (Prentice &Pyke, 1979). Suppose Y is a binary response variable and X is the associated covariate, then the(prospective) logistic regression model is of the form

P(Y = 1|X = x) = exp[α∗ + xβ]1 + exp[α∗ + xβ]

, (12)

where α∗ and β are parameters and the marginal distribution of X is not specified. In case–control studies, data are collected retrospectively in the sense that for samples of subjects havingY = 1 (“case”) and having Y = 0 (“control”), the value x of X is observed. More specifically,suppose X1, . . . , Xn0 is a random sample from F (x|Y = 0) and, independently of the Xi’s, sup-pose Z1, . . . , Zn1 is a random sample from F (x|Y = 1). If π = P(Y = 1) = 1 − P(Y = 0) andf (x|Y = i) is the conditional density of X given Y = i, i = 0, 1, then it follows from (12)and Bayes rule that model (11) is satisfied with g(x) = f (x|Y = 0) and h(x) = f (x|Y = 1),α = α∗ + log[(1 − π)/π] and r(x) = x. Model (11) with r(x) = (x, x2) also coincides with expo-nential family of densities considered in Efron & Tibshirani (1996) in the case of two-sample

DOI: 10.1002/cjs The Canadian Journal of Statistics / La revue canadienne de statistique

Page 9: On minimum Hellinger distance estimation

522 WU AND KARUNAMUNI Vol. 37, No. 4

problems. Moreover, model (11) can also be viewed as a biased sampling model with weightfunction exp[α + r(x)β] depending on the unknown parameters α and β.

Vardi (1985), Gill, Vardi & Wellner (1988), and Qin (1993) discussed estimating distributionfunctions in biased sampling models with known weight functions. Gilbert, Self & Ashby (1998)have employed model (11) with r(x) = (x, x2) to analyze HIV vaccine trial data for assessingdifferential vaccine protection against human immunodeficiency virus types. Qin & Zhang (1997)considered a goodness-of-fit test for logistic regression model (12) based on case–control data byemploying the maximum semiparametric likelihood estimator of the distribution function of g totest the validity of model (11) with r(x) = x. Zhang (2000) estimated quartiles of the distributionfunction of g under model (11). Here we are interested in estimation of the parameters α and β

when g(x) is unknown. Since the form of g(x) is not specified, statistical inference based on model(11) with unknown g would be more robust than those based on a full parametric model in whichthe form of g(x) is known. Here we propose MHD estimation of parameters α and β in model(11) when g is unknown.

3.2. MHD Estimators of Parameters α and β

Define θ = (α, βT)T, where α and β are defined in (11). Then the model (11) can be written as

X1, . . . , Xn0i.i.d.∼ g(x)

Z1, . . . , Zn1i.i.d.∼ hθ(x),

(13)

wherehθ(x) = g(x) exp[(1, r(x))θ], r(x) = (r1(x), . . . , rp(x)) is a 1 × pvector of continuous func-tions of x on R1, β = (β1, . . . , βp)T is a p × 1 parameter vector and α is a normalizing parameterthat makes hθ(x) integrate to one. We assume that θ ∈ � and � is a compact subset of IRp+1.

We first define the following kernel density estimators of g and hθ , respectively, based on dataX1, . . . , Xn0 and Z1, . . . , Zn1 of (13):

gn0 (x) = 1n0bn0

n0∑i=1

K0

(x − Xi

bn0

), (14)

hn1 (x) = 1n1bn1

n1∑j=1

K1

(x − Zj

bn1

), (15)

where K0 and K1 are symmetric density functions, bn0 and bn1 are positive constants such thatbni → 0 as ni → ∞, i = 0, 1. For simplicity, here we are using non-adaptive kernel densityestimators. However, the results can be easily extended with adaptive kernel density estimatorsunder some additional conditions.

Let H be the set of all densities with respect to Lebesgue measure on the real line. For φ ∈ H,define MHD functional T0(φ) as

T0(φ) = T ({hθ}θ∈�, φ) = arg minθ∈�

∥∥∥h1/2θ − φ1/2

∥∥∥ . (16)

If the family {hθ}θ∈� is identifiable, then the functional T0 is Fisher consistent, that is, T0(hθ) = θ

for any θ ∈ �. Since hn1 defined by (15) is an estimator of hθ , an MHD estimator of θ would beT0(hn1 ). However, this estimator is not available in application since g and hence hθ in (16) areunknown. Naturally, one can use the estimator gn of g and then apply the plug-in rule to constructa parametric model, that is, one can replace hθ by

hθ(x) = exp[(1, r(x)) θ]gn0 (x). (17)

The Canadian Journal of Statistics / La revue canadienne de statistique DOI: 10.1002/cjs

Page 10: On minimum Hellinger distance estimation

2009 ON MINIMUM HELLINGER DISTANCE ESTIMATION 523

Note that hθ is a parametric density function with the unknown parameter being θ. Now ourproposed MHD estimator of θ is defined as

θ = T (hn1 ) = T({hθ}θ∈�, hn1

)= arg min

θ∈�

∥∥∥h1/2θ − h1/2

n1

∥∥∥ , (18)

where hn1 and hθ are given by (15) and (17), respectively. That is, θ is the minimizer of theHellinger distance between the parametric density hθ and hn1 , a nonparametric estimator of thedensity of the Zi’s. This approach is in line with Beran’s (1977) original mechanism of obtainingMHD estimators. Thus, we would expect θ to have good robustness and asymptotic efficiencyproperties. Since T (hn1 ) may be multiple valued, we will use the notation T (hn1 ) to indicate anyone of the possible values chosen arbitrarily. Note that θ depends on n (n = n0 + n1 is the totalsample size) but this dependence is suppressed for notational convenience. We are interested inthe asymptotic properties of θ as n → ∞.

3.3. Asymptotic PropertiesIn this section, we give some properties of the proposed MHD estimator θ defined by (18). Inparticular, we state theoretical results related to existence, uniqueness, and asymptotic consistencyof θ. Asymptotic normality and robustness properties of θ are established in Wu (2007). First, westate a few conditions required for the theorems below.

(D1) There exists an ε-neighborhood B(θ, ε) of θ for some ε > 0 such that ht − hθ is boundedby an integrable function for any t ∈ B(θ, ε).

(D2) g and K0 in (13) and (14), respectively, have compact supports.(D3) sup

θ∈�

supx

(1, r(x))θ < ∞.

(D4) g in (13) has infinite support, K0 in (14) is a bounded symmetric density with support[−a0, a0], 0 < a0 < ∞, and there exists a sequence {αn} of positive numbers such thatαn → ∞ as n → ∞, and

supθ∈�

∫I{|x|>αn}hθ(x) dx → 0, (19)

b2n sup

θ∈�

∫I{|x|>αn}hθ(x) sup

|t|≤a0

|g(2)(x + tbn)|g(x)

dx → 0, (20)

n−1b−1n sup

θ∈�

∫I{|x|≤αn}hθ(x) sup

|t|≤a0

g(x + tbn)g2(x)

dx → 0, (21)

b4n sup

θ∈�

∫I{|x|≤αn}hθ(x) sup

|t|≤a0

[g(2)(x + tbn)

g(x)

]2

dx → 0, (22)

where g(k) denotes the kth derivative of g and IA denotes the indicator function of a set A.

Theorem 3.1. Suppose that T0 and T are defined by (16) and (18), respectively, and (D1) holdsfor all θ ∈ �. Then we have the following results:

(i) For every φ ∈ H, there exists T (φ) ∈ � satisfying (18) with hθ and gn0 defined by (17) and(14), respectively. For every φ ∈ H, there exists T0(φ) ∈ � satisfying (16).

(ii) Suppose that n0 → ∞ and n1 → ∞ as n → ∞ and θ0 = T0(φ) is unique. Then θ =T (φn1 ) → θ0 as n → ∞ for any density sequences {φn1}n1≥1 and {hθ}n0≥1,θ∈� such that‖φ

1/2n1 − φ1/2‖ → 0 and sup

θ∈�

‖h1/2θ − h

1/2θ ‖ → 0 as n → ∞.

(iii) If {hθ}θ∈� is identifiable, then T0(hθ) = θ uniquely for any θ ∈ �.

DOI: 10.1002/cjs The Canadian Journal of Statistics / La revue canadienne de statistique

Page 11: On minimum Hellinger distance estimation

524 WU AND KARUNAMUNI Vol. 37, No. 4

Theorem 3.2. Let n0 → ∞ and n1 → ∞ as n → ∞. Suppose that (1, r(x)) are linearly inde-pendent, (D1) holds for any θ ∈ �, and bandwidths bn0 and bn1 in (14) and (15), respectively,satisfy bni → 0 and nibni → ∞ as n → ∞, i = 0, 1. Further, suppose that either (D2), (D3), or

(D4) holds. Then ‖h1/2n1 − h

1/2θ ‖ P→ 0 and sup

θ∈�

‖h1/2θ − h

1/2θ ‖ P→ 0 as n → ∞. Furthermore, we

have θP→ θ as n → ∞, where θ is defined by (18) with gn0 , hn1 and hθ given by (14), (15), and

(17), respectively.

The proofs of Theorems 3.1 and 3.2 are given in Appendix.

Remark 3.1. Condition (D1) holds for many families including normal distributions. Sup-pose that g(x) and h(x) denote density functions of the normal distribution N(0, 1) and N(µ, 1),respectively. It is easy to see that h(x) = hθ(x) = exp[(1, r(x))θ]g(x), where r(x) = x andθ = (α, β) = (−µ2/2, µ).

Remark 3.2. If (1, r(x)) are linearly independent, then {hθ}θ∈� is identifiable. To see this clearly,note that forhθ1 = hθ2 ,we have (1, r(x))(θ1 − θ2) = 0, and then θ1 = θ2 when (1, r(x)) are linearlyindependent. Therefore, {hθ}θ∈� is identifiable for any continuous density function g.

Remark 3.3. Condition (D3) is satisfied when g and hθ are two normal density functions withdifferent standard deviations. Assume that g(x) and h(x) denote density functions of N(0, 1)and N(µ, σ), respectively, where σ < 1. It is easy to see that h(x) = hθ(x) = exp[(1, r(x))θ]g(x),where r1(x) = x, r2(x) = x2 and θ = (θ0, θ1, θ2) = (−µ2/2σ2 − log σ, µ/σ2, 1/2 − 1/2σ2). Ifthe parameter space � is such that its projection onto the third argument is to the left of zero, thenobviously condition (D3) holds.

Remark 3.4. Condition (D4) holds for many families and one such example is stated in Remark3.1, that is, g and h are two normal density functions with the same standard deviation. Withoutloss of generality, we suppose the compact parameter space � = [α, α] × [β, β] for some finitenumbers α, α, β and β. Then it is easy to show that (19)–(22) hold for some αn, the log functionof n, and any bandwidth bn0 such that bn0 → 0 and n0bn0 → ∞ as n0 → ∞.

4. SIMULATION STUDIES

In this section, we exhibit some small sample properties of theMHDestimators defined in Sections2 and 3 using a simulation study.

4.1. Two-Component Mixture ModelHere we report the results of a simulation study for the two-component mixture model consid-ered in Section 2. We considered a mixture of two normal distributions in this numerical study.Specifically, we studied following four mixture models:

Model I : Hθ = 0.25N(0, 1) + 0.75N(4.46, 2),Model II : Hθ = 0.25N(0, 1) + 0.75N(2.96, 2),Model III : Hθ = 0.5N(0, 1) + 0.5N(4.52, 2),Model IV : Hθ = 0.5N(0, 1) + 0.5N(3.07, 2),

(23)

where N(µ, σ2) denotes a normal distribution with mean µ and variance σ2. That is, we set thedistributions F and G of (1) as N(0, 1) and N(µ, 2), respectively, where µ = 0 as in ModelsI to IV above. Note that Models I and III have an overlap of 0.03, whereas Models II and IV

The Canadian Journal of Statistics / La revue canadienne de statistique DOI: 10.1002/cjs

Page 12: On minimum Hellinger distance estimation

2009 ON MINIMUM HELLINGER DISTANCE ESTIMATION 525

have an overlap of 0.1. Here the overlap is defined as the probability of misclassification using thefollowing rule: classify an observation x as being from population F if x < xc and from populationG if x ≥ xc, where xc is the unique point between 0 and µ such that θf (xc) = (1 − θ)g(xc).

We used the compact-supported Epanechnikov kernel function

K(x) = 34(1 − x2)I(|x| ≤ 1),

for all three kernels K0, K1, and K2 in (3), (4), and (6), respectively. The bandwidths bn0 ,bn1 , and bn2 in (3), (4), and (6), respectively, were taken to be bn0 = n

−1/30 , bn1 = n

−1/31 , and

bn2 = n−1/32 . This selection satisfies the bandwidth assumptions in the theorems of Section 2.3.

For scale statistics Sn0 , Sn1 , and Sn2 in (3), (4), and (6), respectively, we used the following robustscale estimator proposed by Rousseeuw & Croux (1993),

Sn = 1.1926 medi(medj(|Xi − Xj|)).

The choices of the kernel function, bandwidths and scale estimators satisfy conditions C1, C3,and C4. Thus C5 is satisfied by Theorem 2.2.

We compared our MHD estimator with two maximum likelihood estimators. The ML estima-tors are based on the following two likelihood functions combined with the data (Z1, . . . , Zn2 ):

L(θ) =n2∏

i=1

[θf (Zi) + (1 − θ)g(Zi)

]and

L(θ) =n2∏

i=1

[θf (Zi) + (1 − θ)g(Zi)

],

where f and g are the kernel density estimators of f and g defined by (3) and (4), respectively,with f and g as in model (2). In other words, the likelihood L is constructed assuming that densityfunctions f and g are completely known, whereas L is obtained by replacing f and g by theirestimators. Thus, L and L are rather naturally constructed for simulation purposes. We define

θMLE = arg maxθ∈[0,1]

L(θ) (24)

and

θMLE = arg maxθ∈[0,1]

L(θ) (25)

as the ML estimators of θ based on L and L, respectively.In our simulation, the data were generated from the models defined in (23). For each model,

500 samples with sizes n0, n1, and n2 were obtained from the corresponding distributions. Forinstance, for Model I, samples of size n0 and n1 were obtained from the distributions N(0, 1) andN(4.46, 2), respectively, while a sample of size n2 was obtained from the mixture distribution0.25N(0, 1) + 0.75N(4.46, 2). For each model considered, we obtained estimates of the bias and

DOI: 10.1002/cjs The Canadian Journal of Statistics / La revue canadienne de statistique

Page 13: On minimum Hellinger distance estimation

526 WU AND KARUNAMUNI Vol. 37, No. 4

Table 1: Estimates of the biases and mean squared errors of θ, θMLE, and θMLE defined in (18), (24), and(25), respectively.

(n0, n1, n2) Model Bias(θ) MSE(θ) Bias(θMLE) MSE(θMLE) Bias(θMLE) MSE(θMLE)

(10,10,30) I −0.0065 0.0124 −0.0015 0.0069 −0.0091 0.0127

II 0.0213 0.0224 −0.0041 0.0107 −0.0348 0.0224

III −0.0027 0.0166 0.0004 0.0098 −0.0290 0.0201

IV −0.0207 0.0251 −0.0038 0.0112 −0.1370 0.0535

(20,20,60) I −0.0032 0.0052 −0.0042 0.0032 −0.0138 0.0054

II 0.0036 0.0100 −0.0072 0.0050 −0.0887 0.0172

III 0.0002 0.0057 0.0016 0.0041 −0.0368 0.0098

IV −0.0052 0.0104 0.0024 0.0061 −0.1843 0.0548

(50,50,150) I 0.0015 0.0016 0.0015 0.0013 −0.0319 0.0034

II 0.0047 0.0039 −0.0022 0.0022 −0.1334 0.0217

III −0.0018 0.0023 −0.0006 0.0019 −0.0608 0.0084

IV 0.0001 0.0037 0.0004 0.0022 −0.2473 0.0721

mean squared error (MSE) as follows:

Bias = 1Ns

Ns∑i=1

(µi − µ)

and

MSE = 1Ns

Ns∑i=1

(µi − µ)2,

where Ns is the number of replications (Ns = 500 in our case), and µi denotes an estimate of µ

for the ith replication. Here µ = θ and µ denotes either the proposed MHD estimator θ, θMLEor θMLE. We have chosen several combinations for the sample sizes (n0, n1, n2); namely, (10,10, 30), (20, 20, 60), and (50, 50, 150), and have computed the bias and MSE in each case. Ourfindings are summarized in Table 1.

We observed that the MHD estimator θ performed better than the MLE θMLE (i.e., smallerMSE values) for all four models considered in each combination of the sample sizes. On theother hand, the MLE θMLE, which is based on assuming f and g are known, showed the bestperformance among the three estimators compared in terms of both minimum estimated MSEand bias in most cases. However, this behaviour can be expected here since θMLE employs moreinformation (i.e., knowing f and g, or in other words n0 = ∞ and n1 = ∞) than either θMLE or θ.Note that θMLE is not available in practice and the sole purpose of analyzing it here is to examinethe amount of loss in performance when f and g are assumed unknown. The bias and MSE ofθ were less affected by the preceding fact compared to those of θMLE. Note that θMLE uses onlythe mixture sample of size n2, whereas θ and θMLE are based on all three samples of sizes n0,n1, and n2. (Data from f and g are not required for θMLE since it is based on the fact that f andg are known.) Thus, one might argue that a direct comparison between θ and θMLE may not bea fair comparison. In Figure 1, we have also given the normal probability plots of the proposedMHD estimator θ based on the 500 replications for all four models with n0 = 50, n1 = 50, and

The Canadian Journal of Statistics / La revue canadienne de statistique DOI: 10.1002/cjs

Page 14: On minimum Hellinger distance estimation

2009 ON MINIMUM HELLINGER DISTANCE ESTIMATION 527

−3 −2 −1 3210

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

Normal Quantiles

−3 −2 −1 3210

0.3

0.4

0.5

0.6

0.7

Normal Quantiles

B

A

Figure 1: Normal probability plots of MHD estimator θ for sample sizes n0 = n1 = 30 and n2 = 100, with(•) Models I and III and (◦) Models II and IV.

n2 = 150. Figure 1 demonstrates that the sampling distribution of θ closely approximates a normalcurve for each model.

4.2. Two-Sample Semiparametric ModelIn this section, we report the results of a simulation study for the two-sample semiparametricmodel considered in Section 3. We will demonstrate that the proposed MHD estimator θ definedin (18) has good small sample properties.

DOI: 10.1002/cjs The Canadian Journal of Statistics / La revue canadienne de statistique

Page 15: On minimum Hellinger distance estimation

528 WU AND KARUNAMUNI Vol. 37, No. 4

Table 2: Estimates of the biases and mean squared errors of θ = (α, β) and θ = (α, β) defined in (18) andZhang (2000), respectively, when g and h are the densities of N(0, 1) and N(1, 1), respectively.

(n0, n1) Bias(α) MSE(α) Bias(β) MSE(β) Bias(α) MSE(α) Bias(β) MSE(β)

(10,40) 0.203 0.106 −0.030 0.320 0.085 0.069 0.125 0.329

(20,30) 0.131 0.074 −0.073 0.199 −0.005 0.058 0.088 0.184

(30,20) 0.099 0.092 −0.053 0.195 −0.038 0.087 0.106 0.204

(40,10) 0.041 0.150 −0.045 0.266 −0.078 0.175 0.098 0.271

In our simulation, we let µ = 1 and therefore θ = (α, β) = (−0.5, 1). For each pair (n0, n1),we generated 500 independent sets of combined random samples of size n = n0 + n1 = 50 fromthe N(0, 1) and N(1, 1) distributions. Our choices of the pair (n0, n1) were (10, 40), (20, 30),(30, 20), and (40, 10) so that n = 50 in each case. The values of bandwidths bn0 and bn1 in(14) and (15), respectively, were taken to be bn0 = n

−2/50 and bn1 = n

−2/51 . We again used the

Epanechnikov kernel function for both K0 and K1. For 500 simulated replications, we comparedthe performance of the proposed MHD estimator θ defined in (18) with Zhang’s (2000) maximumsemiparametric likelihood estimator θ = (α, β), by examining their respective biases and MSEvalues. Our simulation results are summarized in Table 2. From the values in Table 2, we can seethat α is better than α when the estimated biases and MSE values are compared, except for the caseof (40, 10). However, β has smaller estimated bias than β uniformly for all pairs of (n0, n1) valuesconsidered, while the estimated MSE values of β and β are comparable overall. We believe that β

plays amore important role than α inmost applications, since α is just a normalizing parameter thatmakes g(x) exp[α + r(x)β] integrate to one, see Equation (11). For instance, in the Cox model, thevalue exp(β) can be interpreted as the ratio of the hazards of two individuals whose covariates areZ = 1 and Z = 0, respectively, but who are identical otherwise. In Figure 2, we have also giventhe normal probability plots of the proposed MHD estimators α and β based on 500 replicationsfor (n0, n1) is equal to (10, 40) and (30, 20). The normal probability plots of other (n0, n1) valuesare similar. Figure 2 demonstrates that the sampling distributions of α and β are approximatelynormal.

5. CONCLUDING REMARKS

In this article, we have attempted to show that the MHD approach originally developed forparametric models can be applied successfully for certain semiparametric models as well. Theproposed MHD estimators in these semiparametric models have been shown to be asymptoticallyconsistent and asymptotically normally distributed with high efficiency properties. The proposedestimators also have excellent robustness properties, see Wu (2007) for more details. Efficiencyof MHD estimators has been attributed to their heuristic relationship to the MLEs, while theequivalence of the Hellinger metric to the L1-norm is a probable reason for excellent robustnessproperties of MHD estimators. The models studied in Sections 2 and 3 are in fact special casesof the more general semiparametric models of the form {fθ,η : θ ∈ : ⊆ Rp, η ∈ H}, where :

is a compact subset of Rp and H is an arbitrary, typically of infinite dimension. The statisticalproblem is estimation of the parameter θ assuming that η as a nuisance parameter. The MHDestimation problem in the preceding general model has also been investigated in Wu (2007).

Extensive simulation studies have been carried to examine small-sample properties of theproposed estimators on empirical robustness and efficiency. However, only a limited simulationstudy is reported here in order to save space. Interested readers are referred to the thesis of Wu

The Canadian Journal of Statistics / La revue canadienne de statistique DOI: 10.1002/cjs

Page 16: On minimum Hellinger distance estimation

2009 ON MINIMUM HELLINGER DISTANCE ESTIMATION 529

−3 −2 −1 3210

−10

12

3

Normal Quantiles

−3 −2 −1 3210

−2−1

01

23

Normal Quantiles

a

b

Figure 2: Normal probability plots of MHD estimator θ = (α, β), with (•) α and (◦) β.

(2007) for more details on simulation studies and some real data analyses. Furthermore, most ofthe theoretical results on robustness of proposed estimators and the proofs are also not includedhere, again to limit the length of the article. As many authors have pointed out, the robustnessof an estimator should ideally be studied by considering what happens to the distribution of theestimator as the distribution of the data is varied. A thorough technical discussion on robustnessproperties of the MHD estimators in general semiparametric models {fθ,η : θ ∈ : ⊆ Rp, η ∈ H}is given in Wu (2007) using Hellinger neighbourhoods of the assumed semiparametric models.

DOI: 10.1002/cjs The Canadian Journal of Statistics / La revue canadienne de statistique

Page 17: On minimum Hellinger distance estimation

530 WU AND KARUNAMUNI Vol. 37, No. 4

Wu (2007) has examined most of the issues related to MHD estimation in semiparametricmodels. However, there are still a number of issues that must be addressed for completeness,such as computational difficulties inherent with the MHD approach. Nevertheless, we hope thatthe success of the MHD approach in the models considered in this article and in Wu (2007)will provide some motivation to pursue its further application in a greater variety of models andin many areas. In particular, development of optimal robust tests of hypotheses and varianceestimation for the parameter θ of the semiparametric models {fθ,η : θ ∈ � ⊆ IRp, η ∈ H} wouldbe of great interest. Optimality of test statistics is closely related to efficient estimation, in thatthe most powerful test statistics for a hypothesis about a parameter are usually based on efficientestimators for that parameter. Therefore, the MHD estimators would be excellent candidates inthis context.

APPENDIX

We give a sketch of the proofs of Theorems 3.1 and 3.2 here, detailed proofs can be found in Wu(2007). First, we state two lemmas that are proven in Wu (2007).

Lemma A.1. If (D1) holds for θ ∈ �, then d(t) = ‖h1/2t − φ1/2‖ is continuous at point t = θ

for any φ ∈ H.

Lemma A.2. If (D4) holds, then supθ∈�

∫exp[(1, r(x))θ][g1/2

n (x) − g1/2(x)]2 dxP→ 0 as n → ∞.

Proof of Theorem 3.1.

(i) Let dn0 (t) = ‖h1/2t − φ1/2‖. Suppose the sequence {tk} ⊂ � and tk → t as k → ∞. Since

� is compact, t ∈ �. From Minkowaski’s inequality, we have

| dn0 (tk) − dn0 (t) | ≤[∫

| htk (x) − ht(x) | dx

]1/2

=[∫

| exp[(1, r(x))tk] − exp[(1, r(x))t] | gn0 (x) dx

]1/2

.

Since gn0 is compactly supported, we obtain using theDominatedConvergence Theorem thatdn0 (tk) → dn0 (t) as k → ∞, that is, dn0 (t) is continuous and achieves a minimum over t ∈ �.Let d(t) = ‖h

1/2t − φ1/2‖. By Lemma A.1, d(t) is continuous in t and therefore achieves a

minimum over t ∈ �.(ii) Suppose that ‖φ

1/2n1 − φ1/2‖ → 0 and sup

θ∈�

‖h1/2θ − h

1/2θ ‖ → 0 as n → ∞. Put dn(θ) =

‖h1/2θ (x) − φ

1/2n1 (x)‖ and d(θ) = ‖h

1/2θ (x) − φ1/2(x)‖. Again by Minkowski’s inequality,

| dn(θ) − d(θ) | ≤{∫ [

h1/2θ (x) − φ1/2

n1(x) − h

1/2θ (x) + φ1/2(x)

]2dx

}1/2

≤{

2∫ [

h1/2θ (x) − h

1/2θ (x)

]2dx + 2

∫ [φ1/2

n1(x) − φ1/2(x)

]2dx

}1/2

,

and consequently supθ∈�

| dn(θ) − d(θ) |→ 0 as n → ∞. Therefore, as n → ∞, dn(θ0) →d(θ0) and dn(θ) − d(θ) → 0. If θ � θ0, then there exists a subsequence {θni} ⊆ {θ} suchthat θni → θ′ = θ0. Since � is compact, θ

′ ∈ �. Lemma A.1 yields that d(θni ) → d(θ′).

The Canadian Journal of Statistics / La revue canadienne de statistique DOI: 10.1002/cjs

Page 18: On minimum Hellinger distance estimation

2009 ON MINIMUM HELLINGER DISTANCE ESTIMATION 531

From above results we obtain dni (θni ) − dni (θ0) → d(θ′) − d(θ0). By the definition of θni ,

dni (θni ) − dni (θ0) ≤ 0. Hence, d(θ′) − d(θ0) ≤ 0. But by the definition and uniqueness of θ0,d(θ′) > d(θ0). This is a contradiction. Therefore, θ → θ0.

(iii) Since {ht}t∈� is identifiable, we now have T0(ht) = t uniquely for every t ∈ �.

This completes the proof. �

Proof of Theorem 3.2. From Remark 3.2, {hθ}θ∈� is identifiable. So if we can prove that

‖h1/2n1 − h

1/2θ ‖ P→ 0 and sup

θ∈�

‖h1/2θ − h

1/2θ ‖ P→ 0 as n → ∞, then it follows that θ

P→ θ as n → ∞by Theorem 3.1.

It is easy to show that gn0P→ g and hn1

P→ hθ as n → ∞ . Since∫

hθ(x) dx = ∫hn1 (x) dx =

1,∫

[hθ(x) − hn1 (x)]+ dx = ∫[hθ(x) − hn1 (x)]− dx and ‖h

1/2n1 − h

1/2θ ‖2 ≤ ∫ | hθ(x) − hn1 (x) |

dx = 2∫

[hθ(x) − hn1 (x)]+ dx. Combined with [hθ(x) − hn1 (x)]+ < hθ(x) and the Dominated

Convergence Theorem, it follows that ‖h1/2n1 − h

1/2θ ‖ P→ 0 as n → ∞.

Note that∫

[h1/2θ (x) − h

1/2θ (x)]2 dx = ∫

exp[(1, r(x))θ][g1/2n0 (x) − g1/2(x)]2 dx ≤ ∫

exp[(1,

r(x))θ]|gn0 (x) − g(x)| dx. If (D2) holds, then gn0 − g will have a compact support, on whichexp[(1, r(x))θ] is bounded. Therefore,

∫[h1/2

θ (x) − h1/2θ (x)]2 dx ≤ C1

∫ |gn0 (x) − g(x)| dx =2C1

∫[g(x) − gn0 (x)]+ dx for some positive number C1. Since gn0 →P g, from the Domi-

nated Convergence Theorem we have supθ∈�

‖h1/2θ − h

1/2θ ‖ P→ 0. If (D3) holds, then exp[(1, r(x))θ]

is bounded and similarly supθ∈�

‖h1/2θ − h

1/2θ ‖ →

P0. If (D4) holds, then Lemma A.2 gives that

supθ∈�

‖h1/2θ − h

1/2θ ‖ P→ 0. This completes the proof. �

ACKNOWLEDGEMENTSWe wish to thank the editor Professor Paul Gustafson, the Associate Editor, and a referee for theirvery constructive and helpful comments that led to an improved presentation of the article.

BIBLIOGRAPHYJ. A. Anderson (1979). Multivariate logistic compounds. Biometrika, 66, 17–26.R. Beran (1977). Minimum Hellinger distance estimators for parametric models. Annals of Statistics, 5,

445–463.P. J. Bickel (1982). On adaptive estimation. Annals of Statistics, 10, 647–671.J. Chen (1992). Optimal rate of convergence in finite mixture models. Annals of Statistics, 23, 221–234.J. Chen (1998). Penalized likelihood-ratio test for finite mixture models with multinomial observations.

Canadian Journal of Statistics, 26, 583–599.J.Chen&J.D.Kalbfleisch (1996). Penalizedminimum-distance estimates infinitemixturemodels.Canadian

Journal of Statistics, 24, 167–175.O. I. Cordero-Brana (1994). Minimum Hellinger distance estimation for finite mixture models. Ph.D. dis-

sertation, Utah State University.A. Cutler & O. I. Cordero-Brana (1996). Minimum Hellinger distance estimation for finite mixture models.

Journal of the American Statistical Association, 91, 1716–1723.D. L. Donoho & R. C. Liu (1988). The “automatic” robustness of minimum distance functionals. Annals of

Statistics, 16, 552–586.B. Efron&R. Tibshirani (1996). Using specially designed exponential families for density estimation.Annals

of Statistics, 24, 2431–2461.

DOI: 10.1002/cjs The Canadian Journal of Statistics / La revue canadienne de statistique

Page 19: On minimum Hellinger distance estimation

532 WU AND KARUNAMUNI Vol. 37, No. 4

L. Fernholz (1983).VonMises calculus for statistical functionals.LectureNotes in Statistics, Vol. 19, SpringerVerlag, New York.

J. Forrester, W. Hooper, H. Peng & A. Schick (2003). On the construction of efficient estimators in semi-parametric models. Statistics & Decisions, 21, 109–138.

P. B. Gilbert, S. R. Self & M. Ashby (1998). Statistical methods for assessing differential vaccine protectionagainst human immunodeficiency virus types. Biometrics, 54, 799–814.

R. D. Gill, Y. Vardi & J. A. Wellner (1988). Large sample theory of empirical distributions in biased samplingmodels. Annals of Statistics, 16, 1069–1112.

P. Hall (1981). On the nonparametric estimation of mixture proportions. Journal of the Royal StatisticalSociety Series B, 43, 147–156.

P. Hall (1983). Orthogonal series distribution function estimation, with applications. Journal of the RoyalStatistical Society Series B, 45, 81–88.

P. Hall & D. M. Titterington (1984). Efficient nonparametric estimation of mixture proportions. Journal ofthe Royal Statistical Society Series B, 46, 465–473.

F. Hampel (1971). A general qualitative definition of robustness. Annals of Mathematical Statistics, 42,1887–1896.

D. W. Hosmer (1973). A comparison of iterative maximum likelihood estimates of the parameters of amixture of two normal distributions under three types of samples. Biometrics, 29, 761–770.

P. J. Huber (1980). “Robust Statistics,” Wiley, New York.D.Karlis&E.Xekalaki (1998).MinimumHellinger distance estimation for Poissonmixtures.Computational

Statistics & Data Analysis, 29, 81–103.D. Karlis & E. Xekalaki (2001). Robust inference for finite Poisson mixtures. Journal of Statistical Planning

and Inference, 93, 93–115.R. J. Karunamuni & J. Wu (2009). Minimum Hellinger distance estimation in a nonparametric mixture

model. Journal of Statistical Planning and Inference, 139, 1118–1133.B. G. Lindsay (1994). Efficiency versus robustness: the case for minimum Hellinger distance and related

methods. Annals of Statistics, 22, 1081–1114.Z. Lu, Y. V. Hui & A. H. Lee (2003). Minimum Hellinger distance estimation for finite mixtures of Poisson

regression models and its applications. Biometrics, 59, 1016–1026.R. A. Maronna, D. R. Martin & V. Yohai (2007). “Robust Statistics: Theory and Methods,” John Wiley, New

York.G. J. McLachlan & D. Peel (2000). Finite Mixture Models. Wiley, New York.R. L. Prentice & R. Pyke (1979). Logistic disease incidence models and case-control studies. Biometrika,

66, 403–411.J. Qin (1993). Empirical likelihood in biased sample problems. Annals of Statistics, 21, 1182–1196.J. Qin (1999). Empirical likelihood ratio based confidence intervals for mixture proportions. Annals of

Statistics, 27, 1368–1384.J. Qin & B. Zhang (1997). A goodness of fit test for logistic regression models based on case-control data.

Biometrika, 84, 609–618.P. J. Rousseeuw & C. Croux (1993). Alternatives to the median absolute deviation. Journal of the American

Statistical Association, 88, 1273–1283.A. Schick (1986). On asymptotically efficient estimation in semiparametric models. Annals of Statistics, 14,

1139–1151.D. G. Simpson (1987). Minimum Hellinger distance estimation for the analysis of count data. Journal of the

American Statistical Association, 82, 802–807.R. N. Tamura & D. D. Boos (1986). Minimum Hellinger distance estimation for multivariate location and

covariance. Journal of the American Statistical Association, 81, 223–229.D. M. Titterington (1983). Minimum distance nonparametric estimation of mixture proportions. Journal of

the Royal Statistical Society Series B, 45, 37–46.

The Canadian Journal of Statistics / La revue canadienne de statistique DOI: 10.1002/cjs

Page 20: On minimum Hellinger distance estimation

2009 ON MINIMUM HELLINGER DISTANCE ESTIMATION 533

D. M. Titterington, A. F. M. Smith & U. E. Makov (1985). “Statistical Analysis of Finite Mixture Distribu-tions,” Wiley, New York.

A. W. van der Vaart (1998). “Asymptotic Statistics,” Cambridge University Press, New York.Y. Vardi (1985). Empirical distribution in selection bias models. Annals of Statistics, 13, 178–203.Mi-Ja Woo & T. N. Sriram (2006). Robust estimation of mixture complexity. Journal of the American

Statistical Association, 101, 1475–1486.Mi-Ja Woo & T. N. Sriram (2007). Robust estimation of mixture complexity for count data. Computational

Statistics & Data Analysis, 51, 4379–4392.W. A. Woodward, P. Whitney & P. W. Eslinger (1995). Minimum Hellinger distance estimation of mixture

proportions. Journal of Statistical Planning and Inference, 48, 303–319.J. Wu (2007). Minimum Hellinger distance estimation in semiparametric models. Ph.D. dissertation, Uni-

versity of Alberta, Canada.S. Yang (1991). Minimum Hellinger distance estimation of parameter in the random censorship model.

Annals of Statistics, 19, 579–602.Z. Ying (1992). Minimum Hellinger-type distance estimation for censored data. Annals of Statistics, 20,

1361–1390.B. Zhang (2000). Quantile estimation under a two-sample semi-parametric model. Bernoulli, 6, 491–511.

Received 26 August 2008Accepted 22 July 2009

DOI: 10.1002/cjs The Canadian Journal of Statistics / La revue canadienne de statistique