Robust Likelihood Cross Validation for Kernel Density Estimation … · 2017. 6. 15. ·...

Robust Likelihood Cross Validation for Kernel Density Estimation

Ximing Wu∗

Abstract

Likelihood cross validation for kernel density estimation is known to be sensitive to

extreme observations and heavy-tailed distributions. We propose a robust likelihood-

based cross validation method to select bandwidths in multivariate density estimations.

We derive this bandwidth selector within the framework of robust maximum likelihood

estimation and establish its connection to the minimum density power divergence es-

timation. This method effects a smooth transition from likelihood cross validation for

non-extreme observations to least squares cross validation for extreme observations,

thereby combining the efficiency of likelihood cross validation and the robustness of

least squares cross validation. An automatic transition threshold is suggested. We

demonstrate the finite sample performance and practical usefulness of the proposed

method via Monte Carlo simulations and empirical applications to a British income

data and a Chinese air pollution data.

Key Words: Multivariate Density Estimation; Bandwidth Selection; Likelihood Cross Val-

idation; Robust Maximum Likelihood

∗Department of Agricultural Economics, Texas A&M University, College Station, 77843 TX, USA;[email protected]

1 Introduction

Kernel density estimator (KDE) has been the working horse of nonparametric density es-

timation for decades. Consider an I.I.D. sample of d-dimensional random vectors {Xi}ni=1

from an absolutely continuous distribution F defined on X with density f . In this study, we

are concerned with the following product KDE of multivariate densities:

f(x;h) =1

n

n∑i=1

Kh(x−Xi) ≡1

n

n∑i=1

{d∏s=1

Khs (xs −Xi,s)

}, (1)

where x = (x1, . . . , xd)′, K : R → R+ is taken to be a univariate density function, Kh(·) =

K(·/h)/h, and h = (h1, . . . , hd)′ is a positive vector of bandwidths. Kernel estimation de-

pends crucially on the bandwidth. There exist two major approaches of bandwidth selection:

the plug-in approach and the classical approach. Readers are referred to Scott (1992) and

Wand and Jones (1995) for general overviews of KDE and Park and Marron (1990), Sain

et al. (1994), Jones et al. (1996), and Loader (1999) for in-depth examinations of bandwidth

selection.

This study focuses on the method of Cross Validation (CV), which is a member of the

classical approach and one of the most commonly used methods of bandwidth selection. Some

plug-in methods, cf. Sheather and Jones (1991), are known to provide excellent performance.

However, these plug-in methods often require some complicated derivations and preliminary

estimates, and general purpose plug-in methods for multivariate densities are not available in

the literature. In contrast CV entails no complicated derivations nor preliminary estimates.

Furthermore, it works for univariate and multivariate densities alike and is suggested to be

advantageous for multivariate densities (Sain et al. (1994)). See also Loader (1999) on the

advantages of CV methods over plug-in approaches.

Habbema et al. (1974) introduced the Likelihood Cross Validation (LCV), which is de-

1

fined by

maxh

1

n

n∑i=1

lnfi(h), (2)

where fi(h) = 1/(n− 1)∑

j 6=iKh(Xi −Xj) is the leave-one-out density estimate. Another

popular method, the Least Squares Cross Validation (LSCV), proposed by Rudemo (1982)

and Bowman (1984), is given by

minh

∫Xf 2(x;h)dx− 2

n

n∑i=1

fi(h). (3)

Either method has its limitations. LSCV is known to have high variability and tend to

undersmooth data. It is also computationally more expensive than LCV — especially for

multivariate densities — due to the calculation of integrated squared density in (3). On the

other hand, LCV suffers one critical drawback: sensitivity to extreme observations and tail

heaviness of the underlying distribution; cf. Schuster and Gregory (1981) and Hall (1987)

on this detrimental tail effect. Some possible remedies have been proposed to alleviate

the tail problem of LCV. Marron (1985, 1987) explored trimming of extreme observations.

Hall (1987) suggested using a heavy-tailed kernel function for heavy-tailed densities. This

method, however, performs poorly for thin- or moderate-tailed densities. All these studies

focus on univariate densities.

This study proposes a robust alternative to LCV for multivariate kernel density estima-

tion. The key innovation of our method is to replace the logarithm function (in the LCV

objective) with a function that is robust against extreme observations. In particular, we

consider the following piecewise function: for x > 0,

ln?(x; a) =

lnx, if x ≥ a

lna− 1 + x/a, if x < a

(4)

where a ≥ 0. For x < a, we replace lnx with its linear approximation at a, which is

2

larger than lnx by the concavity of the logarithm function; see in Figure 1 an illustration

of ln?(x; a = 0.1) versus lnx. These two curves coincide for x ≥ 0.1; while ln(x) goes to

minus infinity rapidly as x→ 0, ln?(x; a) declines linearly for x < 0.1, effectively mitigating

the detrimental tail effect associated with LCV. We therefore name our method Robust

Likelihood Cross Validation (RLCV).

0.00 0.05 0.10 0.15 0.20

−7

−6

−5

−4

−3

−2

x

ln(x

) vs

ln∗ (x

;a)

Figure 1: ln?(x; a = 0.1) depicted by solid line and lnx depicted by dash line

The proposed RLCV is defined by

maxh

1

n

n∑i=1

ln?fi(h)− b?(h),

where b?(h) is a bias correction term to be given below. We show that LSCV, LCV and

RLCV can all be obtained within a unifying framework of robust maximum likelihood es-

timation. We also establish a connection between CV-based bandwidth selection and the

minimum density power divergence estimator of Basu et al. (1998). RLCV is in spirit close

to Huber’s (1964) robustification of location estimation, which replaces the least squares

objective function ρ(t) = 12t2 with its linear approximation ρ(t) = k|t|− 1

2k2 when |t| > k for

3

some k ≥ 0. Huber’s estimator nests sample mean (when k →∞) and sample median (when

k = 0) as limiting cases. Similarly, RLCV can be viewed as a hybrid bandwidth selector

that nests LCV (when a = 0) and LSCV (when a→∞). Loosely speaking, RLCV conducts

LSCV on extreme observations, avoiding the tail sensitivity of LCV; at the same time with

a small a, LCV is undertaken on the majority of observations, entailing little efficiency loss.

Therefore it essentially combines the efficiency of LCV and the robustness of LSCV while

eschewing their respective drawbacks.

To make the proposed bandwidth selector fully automatic, we further propose a simple

rule to select the threshold a in ln?(·; a):

an = |Σ|−1/2(2π)−d/2Γ(d/2)(lnn)1−d/2n−1,

where Σ is the variance of X and Γ is the Gamma function. No preliminary estimates nor

additional tuning parameters are required. We conduct a series of Monte Carlo simulations

on densities with varying degrees of tail-heaviness and dimensions. Our results demonstrate

good finite sample performance of RLCV relative to that of LCV and LSCV. RLCV per-

forms similarly to LCV for thin- and moderate-tailed densities and clearly outperforms LCV

for heavy-tailed densities. It also generally performs better than LSCV. We illustrate the

usefulness of RLCV via applications to a British income data and air pollution PM2.5 data

of Beijing.

2 Preliminaries

In this section we present a brief introduction to the approach of robust maximum likelihood

estimation. We shall show in the next section that it provides a unifying framework to

explore bandwidth selection via cross validation, including LSCV, LCV and the proposed

RLCV.

4

Given I.I.D. observations {Xi}ni=1 from an unknown density f defined on X , let us con-

sider a statistical model {g(x;θ) : θ ∈ Θ} of f , where Θ is a parameter space of finite

dimensional θ. Eguchi and Kano (2001) presented a family of robust MLE associated with

an increasing, convex and differentiable function Ψ : R→ R. Let l(x;θ) = ln[g(x;θ)]. They

defined the Ψ-likelihood function as

LΨ(θ) =1

n

n∑i=1

Ψ(l(Xi;θ))− bΨ(θ),

where

bΨ(θ) =

∫X

Ξ(l(x;θ))dx, Ξ(z) =

∫ z

−∞exp(s)

∂Ψ(s)

∂sds.

Since Ψ generally transforms the log likelihood nonlinearly, a bias correction term bΨ is

introduced to ensure the Fisher consistency of the estimator. The maximum Ψ(-likelihood)

estimator is then given by

maxθ∈Θ

LΨ(θ).

Let ψ(z) = ∂Ψ(z)/∂z and the score function S(x;θ) = ∂l(x;θ)/∂θ. The estimating

equation associated with Ψ-estimator is given by

1

n

n∑i=1

ψ(l(Xi;θ))S(Xi;θ) =∂

∂θbΨ(θ), (5)

with

∂

∂θbΨ(θ) =

∫Xψ(l(x;θ)S(x;θ)g(x;θ)dx.

Equation (5) can be rewritten as

∫Xψ(l(x;θ))S(x;θ)d (Fn(x)−G(x;θ)) = 0,

where Fn and G(·;θ) are the empirical CDF and CDF associated with f and g(·;θ) respec-

5

tively. Apparently if f is a member of g(·;θ), this estimating equation is unbiased.

By the monotonicity and concavity of Ψ, ψ(·) ≥ 0 and ψ′(·) ≤ 0. Thus ψ(l(Xi;θ)) can

be interpreted as the implicit weight assigned to Xi, which decreases with its log-density.

Extreme observations are weighted down in the estimating equation, providing desirable

robustness. Note that the classical MLE is a special case of Ψ-estimator with Ψ(z) = z and

ψ(z) = 1. It is efficient in the sense that all observations from an I.I.D. sample are assigned

equal weights in the estimating equation. On the other hand MLE is not robust: since the

log-density of an observation tends to minus infinity as its density approaches zero, extreme

observations exert unduly large influence on the estimation.

A family of Ψ function, which is termed Ψβ function by Eguchi and Kano (2001), turns

out to be particularly useful for the present study. This function is defined as

Ψβ(z) =exp(βz)− 1

β, β > 0.

The corresponding Ψβ estimator is given by

maxθ∈Θ

1

n

n∑i=1

gβ(Xi;θ)

β−∫X

gβ+1(x;θ)

β + 1dx. (6)

The Ψβ estimator is closely related to the minimum density power divergence estimator

by Basu et al. (1998). Let g be a generic statistical model for f , the density power divergence

is defined as

∆β(g, f) =

∫X

{g1+β(x)− β + 1

βgβ(x)f(x) +

1

βf 1+β(x)

}dx, β > 0, (7)

where the third term is constant. It nests the Kullback-Leibler divergence as a limiting case:

∆β→0(g, f) =

∫Xf(x)ln {f(x)/g(x)} dx. (8)

6

It is seen that minimizing the density power divergence (7) with respect to g is equivalent

to maximizing the Ψβ likelihood (6).

The density power divergence is appealing as it is linear in the unknown density f and

does not require a separate nonparametric estimate of f . To see this, note that the second

term in (7) is linear in f and can be readily estimated by its sample analog

1

n

n∑i=1

(β + 1)gβ(Xi)

β.

Most density divergence indices do not afford this advantage. For instance Csiszar’s diver-

gence, with the exception of the Kullback-Leibler divergence, is nonlinear in f ; cf. Beran’s

(1977) minimum Hellinger distance estimator of a parametric model, which requires an ad-

ditional nonparametric estimate of f . Interested readers are referred to Basu et al. (2011)

for a general treatment of minimum density power divergence estimation.

3 Robust Likelihood Cross Validation

3.1 Formulation

RLCV is motivated by replacing the logarithm function in the objective function of LCV with

its robust alternative ln? to alleviate sensitivity to extreme observations. A naive estimator

that maximizes∑n

i=1 ln?(fi(h)), however, does not render consistency. In this section we

show that bandwidths selected using LCV or LSCV can be interpreted as robust maximum

likelihood estimates and also derive RLCV within the robust MLE framework.

We recognize that Ψβ estimator provides a unifying framework to explore cross validation

methods of KDE. The basic idea is to select the bandwidth that maximizes the Ψβ likelihood

of a KDE. In particular, we replace in (6) the parametric model g(x;θ) with the kernel

estimate f(x;h) and the summand g(Xi;θ) with its leave-one-out counterpart fi(h) (to

7

avoid overfitting), yielding:

maxh

1

n

n∑i=1

fβi (h)

β−∫X

fβ+1(x;h)

β + 1dx.

It follows that LSCV/LCV can be viewed as special/ limiting case of this family. Setting

β = 1, we obtain

h1 = arg maxh

1

n

n∑i=1

fi(h)−∫X

f 2(x;h)

2dx,

which coincides with LSCV given in (3). Alternatively letting β → 0 yields

h0 = arg maxh

1

n

n∑i=1

lnfi(h)−∫Xf(x;h)dx,

where the second term is constant at unity. This is equivalent to LCV given in (2).

Furthermore, the Ψβ estimator provides a natural framework to derive the proposed

RLCV. Define

Ψ?(z) =

z, if z ≥ lna

lna− 1 + exp(z)/a, if z < lna

and the corresponding

Ξ?(z) =

exp(z), if z ≥ lna

exp(2z)/2a, if z < lna

Since Ψ? is increasing, convex and piecewise differentiable, a robust MLE based on Ψ? inherits

the properties of Ψ estimator. In particular, we define the robustified likelihood associated

8

with ln? as follows:

L?(h) =1

n

n∑i=1

ln?(fi(h))− b?(h)

where

b?(h) =

∫XI(f(x;h) ≥ a)f(x;h)dx+

1

2a

∫XI(f(x;h) < a)f 2(x;h)dx.

The resulting RLCV bandwidth selector is then given by

h? = arg maxh

L?(h). (9)

−6 −4 −2 0

−6

−4

−2

02

4

z

Ψ∗ (z

) vs

Ψβ(

z)

Figure 2: Ψ?(z) with a = 0.1, solid; Ψβ→0(z), dash; Ψβ=1, dotted; z = lnf, f ∈ [0.001, 5].

To help readers appreciate the role of Ψ in robust MLE, in Figure 2 we plot Ψβ→0(z),

Ψβ=1(z) and Ψ?(z) with a = 0.1, where z = lnf signifies log-density with f ∈ [0.001, 5].

Note that Ψβ→0(z) = lnf , corresponding to LCV, tends rapidly towards minus infinity as f

declines. In contrast, Ψβ=1(z) = f − 1, corresponding to LSCV, is linear in f and therefore

9

robust against small densities. Lastly Ψ?(z) coincides with Ψβ→0(z) when f ≥ a but switches

to f/a + lna − 1 if f < a. Thus like Ψβ=1, Ψ? is linear in f for f < a and robust against

small densities.

3.2 Discussions

Define

ψ?(z; a) =

1, if z ≥ lna

exp(z)/a, if z < lna

It follows readily that the estimating equation associated with RLCV is given by

1

n

n∑i=1

ψ?(lnfi(h))S(Xi;h) =

∫Xψ?(lnf(x;h))S(x;h)f(x;h)dx. (10)

This estimating equation is asymptotically unbiased provided that f(·;h) is a consistent

estimator of f .

Equation (10) can be rewritten as follows:

1

n

n∑i=1

{I(fi(h) ≥ a) + I(fi(h) < a)fi(h)/a

}S(Xi;h)

=

∫X

{I(f(x;h) ≥ a) + I(f(x;h) < a)f(x;h)/a

}S(x;h)f(x;h)dx,

which aptly captures the main thrust of RLCV. When fi ≥ a, RLCV executes LCV and

ψ?(fi) = 1; when fi < a, RLCV executes LSCV and ψ?(fi) = fi/a < 1, which tends to zero

with fi. Given a small positive value for the threshold a, the majority of observations are

assigned a unitary weight while extreme observations, if any, are assigned smaller weights

proportional to their densities. Thus RLCV effectively combines the efficiency of LCV and

the robustness of LSCV.

10

As is discussed in the previous section, Ψβ estimator can be interpreted as a minimum

density power divergence estimation. It follows that CV-based bandwidth selectors can

also be obtained as minimum density power divergence estimators. In particular, LSCV

is obtained by minimizing (7) under β = 1 while LCV is obtained by minimizing (8), the

Kullback-Leibler divergence. Define

∆?(g, f) =

∫g(x)≥a

f(x)ln {f(x)/g(x)} dx+1

2a

∫g(x)<a

{g2(x)− 2g(x)f(x) + f 2(x)

}dx.

(11)

It is seen that RLCV can also be obtained as a minimum density power divergence estimator

by minimizing the hybrid density power divergence (11).

There exists a large body of work on the theoretical properties of cross validation band-

width selection; cf. Hall (1983), Stone (1984), Burman (1985) on LSCV and Hall (1982,

1987) and Chow et al. (1983) on LCV. Marron (1985) considered, for univariate densities, a

modified LCV of the form

maxh

1

n

n∑i=1

fi(h)I(s1 ≤ Xi ≤ s2)−∫ s2

s1

f(x;h)dx, (12)

where f is assumed to be bounded above zero on the interval [s1, s2]. Marron (1987) pro-

posed a unifying framework to explore the asymptotic optimality of LSCV and modified

LCV (12). Let h be the optimizer of LSCV or modified LCV and define

Average Square Error: DA(f , f) = 1/n∑n

i=1

{f(Xi;h)− f(Xi)

}2

w2(Xi)

Integrated Square Error: DI(f , f) =∫ {

f(x;h)− f(x)}2

w(x)dx

Mean Integrated Square Error: DM(f , f) = E

[∫ {f(x;h)− f(x)

}2

w(x)dx

]

11

where w(x) is a nonnegative weight function. Marron (1987) established, under some mild

regularity conditions, the asymptotic optimality of LSCV and modified LCV as follows:

limn→∞

D(f(h), f)

infh∈Hn D(f(h), f)= 1 a.s. (13)

where D is any of DA, DI or DM , and Hn is a finite set whose cardinality grows algebraically

fast. The asymptotic optimality of LSCV is obtained under w(x) = 1 while setting w(x) =

f−1(x)I(s1 ≤ x ≤ s2) gives the desired result for modified LCV.

Below we show that the asymptotic optimality of RLCV can be established in a similar

manner. Define a hybrid weight function

w?(x) = f−1(x)I(f(x) ≥ a) + a−1I(f(x) < a). (14)

We then have the following.

Theorem. Under Assumptions A1-A3 given in Appendix, if h? is given by (9), then

limn→∞

D(f(h?), f)

infh∈Hn D(f(h), f)= 1 a.s.

where D is any of DA, DI or DM with the weight function w given by (14).

Unlike the modified LCV, RLCV does not require that f is bounded above zero on a

compact support and the error criterion is minimized over the entire support of the underlying

density.

4 Specification of Threshold a

RLCV is reduced to LCV if we set a = 0. On the other hand, it is equivalent to LSCV

under a sufficiently large a. Barring these two extrema, the threshold a controls the trade-

12

off between efficiency and robustness inherent in RLCV. In this section, we present a simple

method to select this threshold.

Our method is motivated by Hall’s (1987) investigation into the interplay between tail-

heaviness of the kernel function and that of the underlying density in LCV. Hall (1987)

focused on univariate densities and considered a scenario wherein f(x) ∼ c|x|−α as |x| → ∞,

c > 0 and α > 1. Suppose that the kernel function takes the form

K(x) = A2 exp(−A1|x|κ), (15)

where A1, A2 and κ are positive constants such that K integrates to unity. Hall showed

that if κ > α − 1, LCV selects a bandwidth diverging to infinity and becomes inconsistent.

We employ in this study the Gaussian kernel, which is a member of (15) with κ = 2. Hall

(1987) suggested that LCV with a Gaussian kernel is consistent when the underlying density

is Gaussian or sub-Gaussian (the tails of a sub-Gaussian distribution decay at least as fast

as those of a Gaussian distribution).

In practice, tail-heaviness of the underlying density is usually unknown and often difficult

to estimate, especially for multivariate random variables. We therefore opt for a somewhat

conservative strategy that uses the Gaussian density as the benchmark. Denote by Mn the

extreme observation of an I.I.D. random sample {Zi}ni=1 from the d-dimensional Gaussian

distribution N (µ,Σ). We advocate the following simple rule: a = E[φ(Mn;µ,Σ)], where

φ(·;µ,Σ) is the density function of N (µ,Σ). Under this rule, if the estimated kernel den-

sity of a given observation is smaller than the expected density of sample extremum under

Gaussianity, LCV is deemed vulnerable. When this occurs, RLCV replaces the log-density

with its linear approximation, effectively weighting down its influence.

Assuming for simplicity that Σ is non-singular, we define the extremum of a Gaussian

13

sample as follows:

Mn = {Zi : ‖Zi − µ‖Σ > ‖Zj − µ‖Σ, 1 ≤ i, j ≤ n}, (16)

where ‖x‖Σ =√x′Σ−1x. We show that the expected density of Mn can be approximated

using the following result.

Proposition. Let {Zi}ni=1 be an I.I.D. sample from the d-dimensional Gaussian distribution

N (µ,Σ) and Mn be given by (16). Then as n→∞,

φ(Mn;µ,Σ)

an

p→ 1,

where an = |Σ|−1/2(2π)−d/2Γ(d/2)(lnn)1−d/2n−1.

We have also considered ‘data-driven’ thresholds that depend on the sample kurtosis or

estimated tail-index. Our experiments suggest little benefit from this additional complexity.

Furthermore, the overall performance of RLCV seems to be stable over a fairly wide range

of the threshold. Intuitively the robustness of RLCV stems from replacing LCV with LSCV

for small densities; so long as the threshold is bounded above zero, the desired robustness is

obtained. At the same time, making the threshold sufficiently small retains the efficiency of

LCV.

5 Monte Carlo Simulations

We use Monte Carlo simulations to assess the performance of RLCV and compare it with

LCV and LSCV. There exist alternative bandwidth selectors (e.g., Sheather-Jones plug in

bandwidth) that are known to be competitive. They are, however, generally not available

for multivariate densities and therefore are not considered in this study. The experiment

designs are summarized in Table 1. We include univariate, bivariate and trivariate densities

14

Table 1: Summarization of simulation design

Density d n Descriptionf11 1 50 standard Gaussian; skewness=0, kurtosis=3f12 1 50 1

2N (−1, 1) + 1

2N (1, 1) Gaussian mixture; skewness=0, kurtosis=2.5

f13 1 50 t5; sknewness=0, kurtosis=9f14 1 50 generalized Lambda distribution∗; skewness=0, kurtosis=126f15 1 50 standard log-normal; skewness=6.2, kurtosis=113.9f16 1 50 χ2

2; skewness=2, kurtosis=9f21 2 100 bivariate Gaussian with ρ = 0.5f22 2 100 t5 margins, Gaussian copula with ρ = 0.5f23 2 100 t5 margins, t5-copula with ρ = 0.5f24 2 100 t5 margins, Clayton copula with Kendall’s τ = 0.5f31 3 100 trivariate Gaussian with ρ = (0.5, 0.6, 0.7)f32 3 100 t5 margins, Gaussian copula with ρ = (0.5, 0.6, 0.7)f33 3 100 t5 margins, t5-copula with pairwise ρ = 0.5f34 3 100 t5 margins, Clayton copula with pairwise Kendall’s τ = 0.5

∗ defined by quantile function F−1(p) = (1− p)λ − pλ, λ = −0.24, p ∈ (0, 1).

with various degrees of skewness and kurtosis. We set n = 50 for univariate densities and

n = 100 for multivariate densities. The Gaussian kernel is used in all estimations. Each

experiment is repeated 1,000 times. Computation of LCV and RLCV is speedy and faster

than that of LSCV, especially in multivariate estimations.

The mean and standard deviation of estimated bandwidths for each experiment are re-

ported in Table 2. For bivariate and trivariate densities, we report the averaged means

and standard deviations of multiple bandwidths. Some comments are in order. (i) Under

regular- or thin-tailed densities (f11, f12, f21 and f31), RLCV bandwidths are very close to

LCV bandwidths. (ii) For heavy-tailed densities (other than the aforementioned four), RLCV

bandwidths are consistently smaller than their LCV counterparts. This is expected as LCV

tends to select overly large bandwidth in the presence of heavy-tailed densities. (iii) RLCV

shows smaller variability than LCV. (iv) Consistent with the findings of previous studies

(cf., Park and Marron (1990), Jones et al. (1996) and Sain et al. (1994)), LSCV tends to

undersmooth and has high variability. Interestingly, the only two exceptions wherein LSCV

produces the largest bandwidth among the three selectors are f11 and f12. These are regular-

15

Table 2: Averages and standard deviations of estimated bandwidths

Density d n LCV RLCV LSCVMean S.D. Mean S.D. Mean S.D.

f11 1 50 0.490 0.123 0.467 0.120 0.526 0.148f12 1 50 0.444 0.115 0.447 0.124 0.541 0.156f13 1 50 0.571 0.137 0.446 0.120 0.451 0.137f14 1 50 0.578 0.142 0.402 0.114 0.364 0.133f15 1 50 0.344 0.130 0.146 0.064 0.141 0.072f16 1 50 0.365 0.143 0.189 0.072 0.194 0.089f21 2 100 0.525 0.100 0.494 0.105 0.491 0.147f22 2 100 0.600 0.117 0.476 0.105 0.419 0.130f23 2 100 0.626 0.118 0.474 0.105 0.403 0.128f24 2 100 0.551 0.118 0.421 0.099 0.379 0.118f31 3 100 0.595 0.098 0.575 0.101 0.545 0.157f32 3 100 0.659 0.115 0.549 0.111 0.469 0.147f33 3 100 0.687 0.109 0.550 0.106 0.441 0.132f34 3 100 0.593 0.114 0.483 0.104 0.423 0.133

or thin-tailed densities under which LCV and RLCV outperform LCSV.

We next examine the performance of density estimation. Figure 3 reports the mean

and error plots of univariate density estimations. Goodness-of-fit measures in terms of the

Root Mean Square Error (RMSE) and Mean Absolute Error (MAE) are reported in black

solid lines and red dash lines respectively. We make the following observations. (i) KDE

with a Gaussian kernel is known to be optimal for Gaussian densities. As expected, LCV

outperforms LSCV under Gaussian density f11 while RLCV is only slightly worse than LCV.

A similar pattern is observed under the thin-tailed density f12. (ii) Density f13 is symmetric

with kurtosis 9. Under this moderate kurtosis, LCV still outperforms LSCV while RLCV in

turn beats LCV. (iii) Density f14 is symmetric with a substantial kurtosis 126. Under this

density, LCV is clearly dominated by the other two, manifesting the detrimental effect of

tail-heaviness. At the same time, RLCV fares better than LSCV. (iv) Under densities f15

and f16, which are skewed and kurtotic, LCV again suffers from tail-heaviness while RLCV

performs the best.

Figure 4 reports bivariate estimation results. Under density f21, which is bivariate Gaus-

16

● ●

●

0.01

0.02

0.03

0.04

0.05

0.06

0.07

LCV RLCV LSCV

● ●

●

●● ●

0.01

0.02

0.03

0.04

LCV RLCV LSCV

●● ●

●

●●

0.02

0.03

0.04

0.05

0.06

0.07

LCV RLCV LSCV

●

●●

●

● ●

0.05

0.10

0.15

0.20

LCV RLCV LSCV

●

● ●

●

● ●

0.05

0.10

0.15

0.20

LCV RLCV LSCV

●

● ●

●

●●

0.04

0.06

0.08

0.10

0.12

LCV RLCV LSCV

●

●●

Figure 3: Mean and error plots of univariate estimation (Solid: RMSE; Dash: MAE)

17

sian with correlation 0.5, LCV and RLCV perform similarly and are better than LSCV. The

next three densities are characterized by heavy-tailed margins and/or heavy-tailed copula.

Similarly to the univariate case, LCV suffers from tail-heaviness of the underlying density

and is dominated by LSCV and RLCV. Between RLCV and LSCV, the former has lower

average MISE and MAE, and noticeably smaller variability. Figure 5 reports the estimation

results for trivariate densities. The overall pattern remains the same.

In summary, RLCV — unlike LCV — is seen to be resistant against tail-heaviness;

while for thin- or regular-tailed distributions, RLCV performs similarly to LCV, indicating

little efficiency loss. Furthermore in all experiments, RLCV is comparable to and often

outperforms LSCV, especially for multivariate densities. Given that in practice the tail-

heaviness of the underlying density is often unknown, RLCV provides a robust alternative

to LCV and a viable competitor to LSCV.

6 Empirical examples

We illustrate the proposed bandwidth selector with two empirical examples. Our first dataset

is the British income data studied by Wand et al. (1991). The data contain approximately

7,000 incomes, normalized by the sample mean, of British citizens in the year 1975. The

skewness and kurtosis of the data are 1.84 and 13.69 respectively. The estimated bandwidths

are 0.177(LCV), 0.054 (RLCV) and 0.049 (LSCV). Their corresponding densities are plotted

in the left panel of Figure 6. Apparently in the presence of the substantial right tail of this

income distribution, LCV selects a relatively large bandwidth and underestimates a sharp

peak of the underlying density. On the other hand, RLCV and LSCV produce very similar

estimates. We caution that the main purpose of this exercise is to demonstrate that the tail

problem of LCV remains even under such a large sample; Wand et al. (1991) showed that

a transformation-based kernel estimator produces a pleasantly smooth density that aptly

captures the two modes of the density.

18

●●

●

0.01

00.

015

0.02

00.

025

0.03

00.

035

0.04

00.

045

LCV RLCV LSCV

●●

●

●

●

●

0.01

00.

015

0.02

00.

025

0.03

00.

035

LCV RLCV LSCV

●

●

●

●

●●

0.01

00.

015

0.02

00.

025

0.03

00.

035

0.04

0

LCV RLCV LSCV

●

●

●

●

●

●

0.01

0.02

0.03

0.04

LCV RLCV LSCV

●

●

●

Figure 4: Mean and error plots of bivariate estimation (Solid: RMSE; Dash: MAE)

19

● ●

●

0.00

50.

010

0.01

50.

020

0.02

5

LCV RLCV LSCV

● ●

●

●

●●

0.00

50.

010

0.01

50.

020

LCV RLCV LSCV

●

●

●

●

●

●

0.00

50.

010

0.01

50.

020

LCV RLCV LSCV

●

●

●

●

●

●

0.01

00.

015

0.02

00.

025

0.03

0

LCV RLCV LSCV

●

●

●

Figure 5: Mean and error plots of trivariate estimation (Solid: RMSE; Dash: MAE)

0 1 2 3 4

0.0

0.2

0.4

0.6

0.8

1.0

income

Den

sity

PM2.5

Den

sity

0 100 200 300 400 500 600

0.00

00.

002

0.00

40.

006

0.00

80.

010

0.01

2

Figure 6: Left: estimated income densities; Right: estimated PM2.5 densities (blue: LCV;black: RLCV; red: LSCV)

20

LSCV estimate

PM2.5, Time=t−1

PM

2.5,

tim

e=t

1 1

1

1 1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1 1

1

1

1

1

1

1

1

1

1

1 1

1

1

1

1

1

1

1

2

2

2

2

3

0 100 200 300 400

010

020

030

040

0LCV estimate

PM2.5, Time=t−1

PM

2.5,

tim

e=t

0.02

0.04

0.04

0.06

0.08

0.1

0.12

0.14 0.16

0.18 0.2 0.22

0.24 0.26 0.28

0.3 0.32

0.34

0.36

0.38

0 100 200 300 400

010

020

030

040

0

RLCV estimate

PM2.5, Time=t−1

PM

2.5,

tim

e=t

0.1 0.1

0.1

0.2 0.3

0.4

0.5 0.6

0.7

0.8

0.9

1

0 100 200 300 400

010

020

030

040

0

Figure 7: Contours of estimated joint densities of PM2.5 from two consecutive days

We next examine air pollution data of Beijing. Accompanying China’s rapid industri-

alization, there have been widespread environmental degradations, especially air pollution.

An important indicator of air quality is the level of fine particular matter (PM2.5), a main

constituent of air pollution. In this study we use PM2.5 data released by the U.S. Embassy

in Beijing. For in-depth analyses of this data, see a recent study by Liang et al. (2016) and

references therein. Since severe and sometimes prolonged smog in Bejing mainly occurs in

winter because of meteorological conditions and winter heating effects, we focus our analysis

on the winter. To avoid hourly fluctuation in PM2.5 monitoring, we use daily median level

of PM2.5. In particular, our data consist of 92 observations of daily median PM2.5 level for

October, November and December of the year 2015. The skewness and kurtosis of the data

are 2.14 and 9.29 respectively.

A PM2.5 level higher than 100 µm is customarily used as an indicator of hazardous air

pollution. Histogram of the data are reported in the right panel of Figure 6. It is seen that

the majority of the observations are below 100 µm; at the same time, there is a significant

second mode around 250 µm, signifying severe air pollution. In the same plot, we report

the estimated PM2.5 densities. The results suggest that LCV oversmoothes the data and

underestimates the major mode, while LSCV undersmoothes the data and produces a rather

rough density. RLCV appears to produce a reasonably smooth density that adequately

21

reflects the overall distribution of the data.

Prolonged severe near earth air pollution constitutes a major trigger of respiratory dis-

eases. In order to explore the dynamics of pollutant accumulation, we proceed to estimate

the joint density of PM2.5 of two consecutive days. The left panel of Figure 7 reports the

contours of estimated density produced by LSCV, with daily PM2.5 level at time t − 1 and

time t in the horizontal and vertical axes respectively. LSCV obviously undersmoothes the

data and the resulting contour plot closely resembles the scatterplot of the data. The middle

panel of Figure 7 plots the contours of the LCV estimate. The density appears to be rather

flat, with the height of the estimated mode at 0.40. The RLCV estimate is reported in the

right panel. A distinct mode, with an estimated height of 1.13, is produced. Some salient

features of the data are readily discernible from the estimated joint density with RLCV

bandwidth: (i) The significant mode at a low pollution level indicates that a low pollution

day is likely to be followed by another one. This merely reflects the fact that majority of

the observations are below 100. (ii) The ‘ridge’ above the 45 degree line for pollution level

higher than 100 suggests that severe smog tends to result from pollutant accumulation over

consecutive days. (iii) The two minor modes at the level around 300 for time t−1 indicates a

bi-model conditional distribution at this high pollution level. There is a simple explanation

for this phenomenon. Heavy precipitation and/or strong wind can effectively dissipate PM2.5

in the atmosphere. A high level of PM2.5 is likely to sustain or decline abruptly — depend-

ing on whether significant precipitation and/or wind occurred — giving rise to a bi-model

conditional distribution.

References

Basu, A., Harris, I. R., Hjort, N. L., and Jones, M. C. (1998), “Robust and efficient estimation

by minimising a density power divergence,” Biometrika, 85, 549–559.

Basu, A., Shioya, H., and Park, C. (2011), Statistical Inference: The Minimum Distance

22

Approach, Chapman & Hall/CRC.

Beran, R. (1977), “Minimum Hellinger Distance Estimates for Parametric Models,” The

Annals of Statistics, 5, 445–463.

Bowman, A. (1984), “An alternative method of cross-validation for the smoothing of density

estimates,” Biometrika, 71, 353–360.

Burman, P. (1985), “A data dependent approach to density estimation,” Zeitschrift fur

Wahrscheinlichkeitstheorie und Verwandte Gebiete, 69, 609–628.

Chow, Y., Geman, S., and Wu, L. (1983), “Consistent cross-validated density estimation,”

The Annals of Statistics, 11, 25–38.

Eguchi, S. and Kano, Y. (2001), “Robustifing maximum likelihood estimation by psi-

divergence,” ISM Research Memorandam, 802.

Embrechts, P., Kluppelberg, C., and Mikosch, T. (2012), Modelling Extremal Events for

Insurance and Finance, Springer.

Habbema, J., Hermans, J., and van den Broek, K. (1974), “A stepwise discrimination anal-

ysis program using density estimation,” in Compstat 1974: Proceddings in computational

statistics, ed. Bruckman, G., Physica Verlag, pp. 101–110.

Hall, P. (1982), “Cross validation in density estimation,” Biometrika, 69, 383–390.

— (1983), “Large sample optimality of least squares cross-validation in density estimation,”

The Annals of Statistics, 11, 1156–1174.

— (1987), “On Kullback-Leibler loss and density estimation,” The Annals of Statistics,

1491–1519.

Huber, P. (1964), “Robust Estimation of a Location Parameter,” The Annals of Mathemat-

ical Statistics, 35, 73–101.

23

Jones, M., Marron, J., and Sheather, S. (1996), “A Brief Survey of Bandwidth Selection for

Density Estimation,” Journal of the American Statistical Association, 91, 401–407.

Liang, X., Li, S., Zhang, S., Huang, H., and Chen, S. (2016), “PM2.5 Data Reliability,

Consistency and Air Quality Assessment in Five Chinese Cities,” Journal of Geophysical

Research Atmosphere, 121, 10220–10236.

Loader, C. (1999), “Bandwidth Selection: Classical or Plug-in?” The Annals of Statistics,

27, 415–438.

Marron, J. S. (1985), “An asymptotically efficient solution to the bandwidth problem of

kernel density estimation,” The Annals of Statistics, 1011–1023.

— (1987), “A comparison of cross-validation techniques in density estimation,” The Annals

of Statistics, 152–162.

Park, B. and Marron, J. (1990), “Comparison of Data-driven Bandwidth Selectors,” Journal

of the American Statistical Association, 85, 66–72.

Rudemo, M. (1982), “Empirical choice of histogram and kernel density estimators,” Scandi-

navian Journal of Statistics, 9, 65–78.

Sain, S., Baggerly, K., and Scott, D. (1994), “Cross-Validation of Multivariate Densities,”

Journal of the American Statistical Association, 89, 807–817.

Schuster, E. and Gregory, G. (1981), “On the nonconsistency of maximum likelihood non-

parametric density estimators,” in Computer Science and Statistics: Proceedings of the

13th Symposium on the Interface, ed. Eddy, W., Springer, pp. 295–298.

Scott, D. (1992), Multivariate Density Estimation: Theory, Practice, and Visualization,

Wiley.

24

Sheather, S. and Jones, M. (1991), “A Reliable Data-Based Bandwidth Selection Method for

Kernel Density Estimation,” Journal of Royal Statistical Society, Series B, 53, 683–690.

Stone, C. (1984), “An Asymptotically optimal window selection rule for kernel density esti-

mates,” The Annals of Statistics, 121, 1285–1297.

Wand, M. and Jones, M. (1995), Kernel Smoothing, Chapman and Hall.

Wand, M., Marron, J., and Ruppert, D. (1991), “Transformation in Density Estimation,”

Journal of the American Statistical Association, 86, 343–353.

Appendix

The following assumptions are needed to established the asymptotic optimality of RLCV.

There exist constants c1, c2 and δ such that:

(A1) The cardinality #(Hn) ≤ nc1 and c−11 nδ ≤ hs ≤ c1n

1−δ, s = 1, . . . , d.

(A2) For h ∈ Hn,

supx|∫Kh(x,y)f(y)dy − f(x)| ≤ c1n

−δ,

limn→∞

suph|∫

var[Kh(x,Xi)]w?(x)dx

c2h1 · · ·hd− 1| = 0.

(A3) (i) supx |Kh(x,x)| ≤ c1h1 · · ·hd; (ii) for h ∈ Hn and f(x) ≥ a, supi,h,x |fi(x;h) −

f(x)| → 0 a.s., where fi(x;h) = 1/(n− 1)∑

j 6=iKh(Xj − x).

These assumptions are a subset of those used in Marron (1987) to establish the asymptotic

optimality of a general family of delta sequence estimators. Assumption A1 is commonly used

in kernel density estimation; for instance, it is also used in Hall (1987) in his investigation of

LCV. Assumption A2 is a high level assumption; it is satisfied, e.g., when the kernel function

25

K and the density f are Holder continuous. Assumption A3 is needed for the “LCV part”

of the RLCV objective function. Note that condition A3(ii) is not required when f(x) ≤ a.

One key difference from the requirements in Marron (1987) is that we do not impose the

condition that f is bounded above zero.

Proof of Theorem. Let u(x) = w(x)f(x). Marron (1987) established the asymptotic op-

timality of LSCV by setting w(x) = 1 or u(x) = f(x)−1, and that for LCV by setting

w(x) = f(x)−1 on a set where f is bounded above zero. The asymptotic optimality of

RLCV can be established in a similar manner. In particular, define

u?(x) = I(f(x) ≥ a) +f(x)

aI(f(x) < a)

such that u?(x) = w?(x)f(x) for all x ∈ X . Replacing the weight function w used in

Theorems 1 and 2 of Marron (1987) with the hybrid weight function w? yields the desired

result.

Proof of Proposition. For simplicity, consider first X a d-dimensional independent standard

Gaussian random vector. Let z = ‖x‖2, where ‖ · ‖ is the Euclidean norm. The density of

X is given by

f(x) = (2π)−d/2 exp(−z/2).

It follows that

lnf = −d2

ln(2π)− z/2,

where z is a χ2d random variable.

Given an I.I.D. sample of standard multivariate Gaussian random vectors {Xi}ni=1 with

Xi = (Xi,1, . . . , Xi,d). Define the extreme observation as

X(n) = {Xi : ‖Xi‖2 ≥ ‖Xj‖2, 1 ≤ i, j ≤ n}.

26

The log-density of X(n) is given by

−d2

ln(2π)− Z(n)/2,

where Z(n) = ‖X(n)‖2 is the sample maximum of n I.I.D. χ2d random variables.

Next for a random variable Z ∼ χ2d and a constant c > 0, cZ ∼ Γ(k = d/2, θ = 2c),

where k is shape parameter and θ is scale parameter of the Gamma distribution. Setting

c = 1/2, we have z/2 ∼ Γ(k = d/2, θ = 1). Using Table 3.4.4. (page 156) of Embrechts et al.

(2012), we obtain

(Z(n)/2− δn)d→ Λ,

where P (Λ < x) = exp(− exp(−x)) and δn = lnn + (d/2 − 1)lnlnn − lnΓ(d/2). Thus

f(X(n)) ≈ φ(δn). It follows readily that for the general case X ∼ N (µ,Σ) with extreme

observation Mn given by (16), we have

f(Mn) = φ(Mn;µ,Σ) ≈ exp(−δn/2)

(2π)d/2|Σ|1/2= |Σ|−1/2(2π)−d/2Γ(d/2)(lnn)1−d/2n−1.

27

Robust Likelihood Cross Validation for Kernel Density Estimation … · 2017. 6. 15. ·...

Documents

Transcript of Robust Likelihood Cross Validation for Kernel Density Estimation … · 2017. 6. 15. ·...