Robust Likelihood Cross Validation for Kernel Density Estimation … · 2017. 6. 15. ·...
Transcript of Robust Likelihood Cross Validation for Kernel Density Estimation … · 2017. 6. 15. ·...
Robust Likelihood Cross Validation for Kernel Density Estimation
Ximing Wu∗
Abstract
Likelihood cross validation for kernel density estimation is known to be sensitive to
extreme observations and heavy-tailed distributions. We propose a robust likelihood-
based cross validation method to select bandwidths in multivariate density estimations.
We derive this bandwidth selector within the framework of robust maximum likelihood
estimation and establish its connection to the minimum density power divergence es-
timation. This method effects a smooth transition from likelihood cross validation for
non-extreme observations to least squares cross validation for extreme observations,
thereby combining the efficiency of likelihood cross validation and the robustness of
least squares cross validation. An automatic transition threshold is suggested. We
demonstrate the finite sample performance and practical usefulness of the proposed
method via Monte Carlo simulations and empirical applications to a British income
data and a Chinese air pollution data.
Key Words: Multivariate Density Estimation; Bandwidth Selection; Likelihood Cross Val-
idation; Robust Maximum Likelihood
∗Department of Agricultural Economics, Texas A&M University, College Station, 77843 TX, USA;[email protected]
1 Introduction
Kernel density estimator (KDE) has been the working horse of nonparametric density es-
timation for decades. Consider an I.I.D. sample of d-dimensional random vectors {Xi}ni=1
from an absolutely continuous distribution F defined on X with density f . In this study, we
are concerned with the following product KDE of multivariate densities:
f(x;h) =1
n
n∑i=1
Kh(x−Xi) ≡1
n
n∑i=1
{d∏s=1
Khs (xs −Xi,s)
}, (1)
where x = (x1, . . . , xd)′, K : R → R+ is taken to be a univariate density function, Kh(·) =
K(·/h)/h, and h = (h1, . . . , hd)′ is a positive vector of bandwidths. Kernel estimation de-
pends crucially on the bandwidth. There exist two major approaches of bandwidth selection:
the plug-in approach and the classical approach. Readers are referred to Scott (1992) and
Wand and Jones (1995) for general overviews of KDE and Park and Marron (1990), Sain
et al. (1994), Jones et al. (1996), and Loader (1999) for in-depth examinations of bandwidth
selection.
This study focuses on the method of Cross Validation (CV), which is a member of the
classical approach and one of the most commonly used methods of bandwidth selection. Some
plug-in methods, cf. Sheather and Jones (1991), are known to provide excellent performance.
However, these plug-in methods often require some complicated derivations and preliminary
estimates, and general purpose plug-in methods for multivariate densities are not available in
the literature. In contrast CV entails no complicated derivations nor preliminary estimates.
Furthermore, it works for univariate and multivariate densities alike and is suggested to be
advantageous for multivariate densities (Sain et al. (1994)). See also Loader (1999) on the
advantages of CV methods over plug-in approaches.
Habbema et al. (1974) introduced the Likelihood Cross Validation (LCV), which is de-
1
fined by
maxh
1
n
n∑i=1
lnfi(h), (2)
where fi(h) = 1/(n− 1)∑
j 6=iKh(Xi −Xj) is the leave-one-out density estimate. Another
popular method, the Least Squares Cross Validation (LSCV), proposed by Rudemo (1982)
and Bowman (1984), is given by
minh
∫Xf 2(x;h)dx− 2
n
n∑i=1
fi(h). (3)
Either method has its limitations. LSCV is known to have high variability and tend to
undersmooth data. It is also computationally more expensive than LCV — especially for
multivariate densities — due to the calculation of integrated squared density in (3). On the
other hand, LCV suffers one critical drawback: sensitivity to extreme observations and tail
heaviness of the underlying distribution; cf. Schuster and Gregory (1981) and Hall (1987)
on this detrimental tail effect. Some possible remedies have been proposed to alleviate
the tail problem of LCV. Marron (1985, 1987) explored trimming of extreme observations.
Hall (1987) suggested using a heavy-tailed kernel function for heavy-tailed densities. This
method, however, performs poorly for thin- or moderate-tailed densities. All these studies
focus on univariate densities.
This study proposes a robust alternative to LCV for multivariate kernel density estima-
tion. The key innovation of our method is to replace the logarithm function (in the LCV
objective) with a function that is robust against extreme observations. In particular, we
consider the following piecewise function: for x > 0,
ln?(x; a) =
lnx, if x ≥ a
lna− 1 + x/a, if x < a
(4)
where a ≥ 0. For x < a, we replace lnx with its linear approximation at a, which is
2
larger than lnx by the concavity of the logarithm function; see in Figure 1 an illustration
of ln?(x; a = 0.1) versus lnx. These two curves coincide for x ≥ 0.1; while ln(x) goes to
minus infinity rapidly as x→ 0, ln?(x; a) declines linearly for x < 0.1, effectively mitigating
the detrimental tail effect associated with LCV. We therefore name our method Robust
Likelihood Cross Validation (RLCV).
0.00 0.05 0.10 0.15 0.20
−7
−6
−5
−4
−3
−2
x
ln(x
) vs
ln∗ (x
;a)
Figure 1: ln?(x; a = 0.1) depicted by solid line and lnx depicted by dash line
The proposed RLCV is defined by
maxh
1
n
n∑i=1
ln?fi(h)− b?(h),
where b?(h) is a bias correction term to be given below. We show that LSCV, LCV and
RLCV can all be obtained within a unifying framework of robust maximum likelihood es-
timation. We also establish a connection between CV-based bandwidth selection and the
minimum density power divergence estimator of Basu et al. (1998). RLCV is in spirit close
to Huber’s (1964) robustification of location estimation, which replaces the least squares
objective function ρ(t) = 12t2 with its linear approximation ρ(t) = k|t|− 1
2k2 when |t| > k for
3
some k ≥ 0. Huber’s estimator nests sample mean (when k →∞) and sample median (when
k = 0) as limiting cases. Similarly, RLCV can be viewed as a hybrid bandwidth selector
that nests LCV (when a = 0) and LSCV (when a→∞). Loosely speaking, RLCV conducts
LSCV on extreme observations, avoiding the tail sensitivity of LCV; at the same time with
a small a, LCV is undertaken on the majority of observations, entailing little efficiency loss.
Therefore it essentially combines the efficiency of LCV and the robustness of LSCV while
eschewing their respective drawbacks.
To make the proposed bandwidth selector fully automatic, we further propose a simple
rule to select the threshold a in ln?(·; a):
an = |Σ|−1/2(2π)−d/2Γ(d/2)(lnn)1−d/2n−1,
where Σ is the variance of X and Γ is the Gamma function. No preliminary estimates nor
additional tuning parameters are required. We conduct a series of Monte Carlo simulations
on densities with varying degrees of tail-heaviness and dimensions. Our results demonstrate
good finite sample performance of RLCV relative to that of LCV and LSCV. RLCV per-
forms similarly to LCV for thin- and moderate-tailed densities and clearly outperforms LCV
for heavy-tailed densities. It also generally performs better than LSCV. We illustrate the
usefulness of RLCV via applications to a British income data and air pollution PM2.5 data
of Beijing.
2 Preliminaries
In this section we present a brief introduction to the approach of robust maximum likelihood
estimation. We shall show in the next section that it provides a unifying framework to
explore bandwidth selection via cross validation, including LSCV, LCV and the proposed
RLCV.
4
Given I.I.D. observations {Xi}ni=1 from an unknown density f defined on X , let us con-
sider a statistical model {g(x;θ) : θ ∈ Θ} of f , where Θ is a parameter space of finite
dimensional θ. Eguchi and Kano (2001) presented a family of robust MLE associated with
an increasing, convex and differentiable function Ψ : R→ R. Let l(x;θ) = ln[g(x;θ)]. They
defined the Ψ-likelihood function as
LΨ(θ) =1
n
n∑i=1
Ψ(l(Xi;θ))− bΨ(θ),
where
bΨ(θ) =
∫X
Ξ(l(x;θ))dx, Ξ(z) =
∫ z
−∞exp(s)
∂Ψ(s)
∂sds.
Since Ψ generally transforms the log likelihood nonlinearly, a bias correction term bΨ is
introduced to ensure the Fisher consistency of the estimator. The maximum Ψ(-likelihood)
estimator is then given by
maxθ∈Θ
LΨ(θ).
Let ψ(z) = ∂Ψ(z)/∂z and the score function S(x;θ) = ∂l(x;θ)/∂θ. The estimating
equation associated with Ψ-estimator is given by
1
n
n∑i=1
ψ(l(Xi;θ))S(Xi;θ) =∂
∂θbΨ(θ), (5)
with
∂
∂θbΨ(θ) =
∫Xψ(l(x;θ)S(x;θ)g(x;θ)dx.
Equation (5) can be rewritten as
∫Xψ(l(x;θ))S(x;θ)d (Fn(x)−G(x;θ)) = 0,
where Fn and G(·;θ) are the empirical CDF and CDF associated with f and g(·;θ) respec-
5
tively. Apparently if f is a member of g(·;θ), this estimating equation is unbiased.
By the monotonicity and concavity of Ψ, ψ(·) ≥ 0 and ψ′(·) ≤ 0. Thus ψ(l(Xi;θ)) can
be interpreted as the implicit weight assigned to Xi, which decreases with its log-density.
Extreme observations are weighted down in the estimating equation, providing desirable
robustness. Note that the classical MLE is a special case of Ψ-estimator with Ψ(z) = z and
ψ(z) = 1. It is efficient in the sense that all observations from an I.I.D. sample are assigned
equal weights in the estimating equation. On the other hand MLE is not robust: since the
log-density of an observation tends to minus infinity as its density approaches zero, extreme
observations exert unduly large influence on the estimation.
A family of Ψ function, which is termed Ψβ function by Eguchi and Kano (2001), turns
out to be particularly useful for the present study. This function is defined as
Ψβ(z) =exp(βz)− 1
β, β > 0.
The corresponding Ψβ estimator is given by
maxθ∈Θ
1
n
n∑i=1
gβ(Xi;θ)
β−∫X
gβ+1(x;θ)
β + 1dx. (6)
The Ψβ estimator is closely related to the minimum density power divergence estimator
by Basu et al. (1998). Let g be a generic statistical model for f , the density power divergence
is defined as
∆β(g, f) =
∫X
{g1+β(x)− β + 1
βgβ(x)f(x) +
1
βf 1+β(x)
}dx, β > 0, (7)
where the third term is constant. It nests the Kullback-Leibler divergence as a limiting case:
∆β→0(g, f) =
∫Xf(x)ln {f(x)/g(x)} dx. (8)
6
It is seen that minimizing the density power divergence (7) with respect to g is equivalent
to maximizing the Ψβ likelihood (6).
The density power divergence is appealing as it is linear in the unknown density f and
does not require a separate nonparametric estimate of f . To see this, note that the second
term in (7) is linear in f and can be readily estimated by its sample analog
1
n
n∑i=1
(β + 1)gβ(Xi)
β.
Most density divergence indices do not afford this advantage. For instance Csiszar’s diver-
gence, with the exception of the Kullback-Leibler divergence, is nonlinear in f ; cf. Beran’s
(1977) minimum Hellinger distance estimator of a parametric model, which requires an ad-
ditional nonparametric estimate of f . Interested readers are referred to Basu et al. (2011)
for a general treatment of minimum density power divergence estimation.
3 Robust Likelihood Cross Validation
3.1 Formulation
RLCV is motivated by replacing the logarithm function in the objective function of LCV with
its robust alternative ln? to alleviate sensitivity to extreme observations. A naive estimator
that maximizes∑n
i=1 ln?(fi(h)), however, does not render consistency. In this section we
show that bandwidths selected using LCV or LSCV can be interpreted as robust maximum
likelihood estimates and also derive RLCV within the robust MLE framework.
We recognize that Ψβ estimator provides a unifying framework to explore cross validation
methods of KDE. The basic idea is to select the bandwidth that maximizes the Ψβ likelihood
of a KDE. In particular, we replace in (6) the parametric model g(x;θ) with the kernel
estimate f(x;h) and the summand g(Xi;θ) with its leave-one-out counterpart fi(h) (to
7
avoid overfitting), yielding:
maxh
1
n
n∑i=1
fβi (h)
β−∫X
fβ+1(x;h)
β + 1dx.
It follows that LSCV/LCV can be viewed as special/ limiting case of this family. Setting
β = 1, we obtain
h1 = arg maxh
1
n
n∑i=1
fi(h)−∫X
f 2(x;h)
2dx,
which coincides with LSCV given in (3). Alternatively letting β → 0 yields
h0 = arg maxh
1
n
n∑i=1
lnfi(h)−∫Xf(x;h)dx,
where the second term is constant at unity. This is equivalent to LCV given in (2).
Furthermore, the Ψβ estimator provides a natural framework to derive the proposed
RLCV. Define
Ψ?(z) =
z, if z ≥ lna
lna− 1 + exp(z)/a, if z < lna
and the corresponding
Ξ?(z) =
exp(z), if z ≥ lna
exp(2z)/2a, if z < lna
Since Ψ? is increasing, convex and piecewise differentiable, a robust MLE based on Ψ? inherits
the properties of Ψ estimator. In particular, we define the robustified likelihood associated
8
with ln? as follows:
L?(h) =1
n
n∑i=1
ln?(fi(h))− b?(h)
where
b?(h) =
∫XI(f(x;h) ≥ a)f(x;h)dx+
1
2a
∫XI(f(x;h) < a)f 2(x;h)dx.
The resulting RLCV bandwidth selector is then given by
h? = arg maxh
L?(h). (9)
−6 −4 −2 0
−6
−4
−2
02
4
z
Ψ∗ (z
) vs
Ψβ(
z)
Figure 2: Ψ?(z) with a = 0.1, solid; Ψβ→0(z), dash; Ψβ=1, dotted; z = lnf, f ∈ [0.001, 5].
To help readers appreciate the role of Ψ in robust MLE, in Figure 2 we plot Ψβ→0(z),
Ψβ=1(z) and Ψ?(z) with a = 0.1, where z = lnf signifies log-density with f ∈ [0.001, 5].
Note that Ψβ→0(z) = lnf , corresponding to LCV, tends rapidly towards minus infinity as f
declines. In contrast, Ψβ=1(z) = f − 1, corresponding to LSCV, is linear in f and therefore
9
robust against small densities. Lastly Ψ?(z) coincides with Ψβ→0(z) when f ≥ a but switches
to f/a + lna − 1 if f < a. Thus like Ψβ=1, Ψ? is linear in f for f < a and robust against
small densities.
3.2 Discussions
Define
ψ?(z; a) =
1, if z ≥ lna
exp(z)/a, if z < lna
It follows readily that the estimating equation associated with RLCV is given by
1
n
n∑i=1
ψ?(lnfi(h))S(Xi;h) =
∫Xψ?(lnf(x;h))S(x;h)f(x;h)dx. (10)
This estimating equation is asymptotically unbiased provided that f(·;h) is a consistent
estimator of f .
Equation (10) can be rewritten as follows:
1
n
n∑i=1
{I(fi(h) ≥ a) + I(fi(h) < a)fi(h)/a
}S(Xi;h)
=
∫X
{I(f(x;h) ≥ a) + I(f(x;h) < a)f(x;h)/a
}S(x;h)f(x;h)dx,
which aptly captures the main thrust of RLCV. When fi ≥ a, RLCV executes LCV and
ψ?(fi) = 1; when fi < a, RLCV executes LSCV and ψ?(fi) = fi/a < 1, which tends to zero
with fi. Given a small positive value for the threshold a, the majority of observations are
assigned a unitary weight while extreme observations, if any, are assigned smaller weights
proportional to their densities. Thus RLCV effectively combines the efficiency of LCV and
the robustness of LSCV.
10
As is discussed in the previous section, Ψβ estimator can be interpreted as a minimum
density power divergence estimation. It follows that CV-based bandwidth selectors can
also be obtained as minimum density power divergence estimators. In particular, LSCV
is obtained by minimizing (7) under β = 1 while LCV is obtained by minimizing (8), the
Kullback-Leibler divergence. Define
∆?(g, f) =
∫g(x)≥a
f(x)ln {f(x)/g(x)} dx+1
2a
∫g(x)<a
{g2(x)− 2g(x)f(x) + f 2(x)
}dx.
(11)
It is seen that RLCV can also be obtained as a minimum density power divergence estimator
by minimizing the hybrid density power divergence (11).
There exists a large body of work on the theoretical properties of cross validation band-
width selection; cf. Hall (1983), Stone (1984), Burman (1985) on LSCV and Hall (1982,
1987) and Chow et al. (1983) on LCV. Marron (1985) considered, for univariate densities, a
modified LCV of the form
maxh
1
n
n∑i=1
fi(h)I(s1 ≤ Xi ≤ s2)−∫ s2
s1
f(x;h)dx, (12)
where f is assumed to be bounded above zero on the interval [s1, s2]. Marron (1987) pro-
posed a unifying framework to explore the asymptotic optimality of LSCV and modified
LCV (12). Let h be the optimizer of LSCV or modified LCV and define
Average Square Error: DA(f , f) = 1/n∑n
i=1
{f(Xi;h)− f(Xi)
}2
w2(Xi)
Integrated Square Error: DI(f , f) =∫ {
f(x;h)− f(x)}2
w(x)dx
Mean Integrated Square Error: DM(f , f) = E
[∫ {f(x;h)− f(x)
}2
w(x)dx
]
11
where w(x) is a nonnegative weight function. Marron (1987) established, under some mild
regularity conditions, the asymptotic optimality of LSCV and modified LCV as follows:
limn→∞
D(f(h), f)
infh∈Hn D(f(h), f)= 1 a.s. (13)
where D is any of DA, DI or DM , and Hn is a finite set whose cardinality grows algebraically
fast. The asymptotic optimality of LSCV is obtained under w(x) = 1 while setting w(x) =
f−1(x)I(s1 ≤ x ≤ s2) gives the desired result for modified LCV.
Below we show that the asymptotic optimality of RLCV can be established in a similar
manner. Define a hybrid weight function
w?(x) = f−1(x)I(f(x) ≥ a) + a−1I(f(x) < a). (14)
We then have the following.
Theorem. Under Assumptions A1-A3 given in Appendix, if h? is given by (9), then
limn→∞
D(f(h?), f)
infh∈Hn D(f(h), f)= 1 a.s.
where D is any of DA, DI or DM with the weight function w given by (14).
Unlike the modified LCV, RLCV does not require that f is bounded above zero on a
compact support and the error criterion is minimized over the entire support of the underlying
density.
4 Specification of Threshold a
RLCV is reduced to LCV if we set a = 0. On the other hand, it is equivalent to LSCV
under a sufficiently large a. Barring these two extrema, the threshold a controls the trade-
12
off between efficiency and robustness inherent in RLCV. In this section, we present a simple
method to select this threshold.
Our method is motivated by Hall’s (1987) investigation into the interplay between tail-
heaviness of the kernel function and that of the underlying density in LCV. Hall (1987)
focused on univariate densities and considered a scenario wherein f(x) ∼ c|x|−α as |x| → ∞,
c > 0 and α > 1. Suppose that the kernel function takes the form
K(x) = A2 exp(−A1|x|κ), (15)
where A1, A2 and κ are positive constants such that K integrates to unity. Hall showed
that if κ > α − 1, LCV selects a bandwidth diverging to infinity and becomes inconsistent.
We employ in this study the Gaussian kernel, which is a member of (15) with κ = 2. Hall
(1987) suggested that LCV with a Gaussian kernel is consistent when the underlying density
is Gaussian or sub-Gaussian (the tails of a sub-Gaussian distribution decay at least as fast
as those of a Gaussian distribution).
In practice, tail-heaviness of the underlying density is usually unknown and often difficult
to estimate, especially for multivariate random variables. We therefore opt for a somewhat
conservative strategy that uses the Gaussian density as the benchmark. Denote by Mn the
extreme observation of an I.I.D. random sample {Zi}ni=1 from the d-dimensional Gaussian
distribution N (µ,Σ). We advocate the following simple rule: a = E[φ(Mn;µ,Σ)], where
φ(·;µ,Σ) is the density function of N (µ,Σ). Under this rule, if the estimated kernel den-
sity of a given observation is smaller than the expected density of sample extremum under
Gaussianity, LCV is deemed vulnerable. When this occurs, RLCV replaces the log-density
with its linear approximation, effectively weighting down its influence.
Assuming for simplicity that Σ is non-singular, we define the extremum of a Gaussian
13
sample as follows:
Mn = {Zi : ‖Zi − µ‖Σ > ‖Zj − µ‖Σ, 1 ≤ i, j ≤ n}, (16)
where ‖x‖Σ =√x′Σ−1x. We show that the expected density of Mn can be approximated
using the following result.
Proposition. Let {Zi}ni=1 be an I.I.D. sample from the d-dimensional Gaussian distribution
N (µ,Σ) and Mn be given by (16). Then as n→∞,
φ(Mn;µ,Σ)
an
p→ 1,
where an = |Σ|−1/2(2π)−d/2Γ(d/2)(lnn)1−d/2n−1.
We have also considered ‘data-driven’ thresholds that depend on the sample kurtosis or
estimated tail-index. Our experiments suggest little benefit from this additional complexity.
Furthermore, the overall performance of RLCV seems to be stable over a fairly wide range
of the threshold. Intuitively the robustness of RLCV stems from replacing LCV with LSCV
for small densities; so long as the threshold is bounded above zero, the desired robustness is
obtained. At the same time, making the threshold sufficiently small retains the efficiency of
LCV.
5 Monte Carlo Simulations
We use Monte Carlo simulations to assess the performance of RLCV and compare it with
LCV and LSCV. There exist alternative bandwidth selectors (e.g., Sheather-Jones plug in
bandwidth) that are known to be competitive. They are, however, generally not available
for multivariate densities and therefore are not considered in this study. The experiment
designs are summarized in Table 1. We include univariate, bivariate and trivariate densities
14
Table 1: Summarization of simulation design
Density d n Descriptionf11 1 50 standard Gaussian; skewness=0, kurtosis=3f12 1 50 1
2N (−1, 1) + 1
2N (1, 1) Gaussian mixture; skewness=0, kurtosis=2.5
f13 1 50 t5; sknewness=0, kurtosis=9f14 1 50 generalized Lambda distribution∗; skewness=0, kurtosis=126f15 1 50 standard log-normal; skewness=6.2, kurtosis=113.9f16 1 50 χ2
2; skewness=2, kurtosis=9f21 2 100 bivariate Gaussian with ρ = 0.5f22 2 100 t5 margins, Gaussian copula with ρ = 0.5f23 2 100 t5 margins, t5-copula with ρ = 0.5f24 2 100 t5 margins, Clayton copula with Kendall’s τ = 0.5f31 3 100 trivariate Gaussian with ρ = (0.5, 0.6, 0.7)f32 3 100 t5 margins, Gaussian copula with ρ = (0.5, 0.6, 0.7)f33 3 100 t5 margins, t5-copula with pairwise ρ = 0.5f34 3 100 t5 margins, Clayton copula with pairwise Kendall’s τ = 0.5
∗ defined by quantile function F−1(p) = (1− p)λ − pλ, λ = −0.24, p ∈ (0, 1).
with various degrees of skewness and kurtosis. We set n = 50 for univariate densities and
n = 100 for multivariate densities. The Gaussian kernel is used in all estimations. Each
experiment is repeated 1,000 times. Computation of LCV and RLCV is speedy and faster
than that of LSCV, especially in multivariate estimations.
The mean and standard deviation of estimated bandwidths for each experiment are re-
ported in Table 2. For bivariate and trivariate densities, we report the averaged means
and standard deviations of multiple bandwidths. Some comments are in order. (i) Under
regular- or thin-tailed densities (f11, f12, f21 and f31), RLCV bandwidths are very close to
LCV bandwidths. (ii) For heavy-tailed densities (other than the aforementioned four), RLCV
bandwidths are consistently smaller than their LCV counterparts. This is expected as LCV
tends to select overly large bandwidth in the presence of heavy-tailed densities. (iii) RLCV
shows smaller variability than LCV. (iv) Consistent with the findings of previous studies
(cf., Park and Marron (1990), Jones et al. (1996) and Sain et al. (1994)), LSCV tends to
undersmooth and has high variability. Interestingly, the only two exceptions wherein LSCV
produces the largest bandwidth among the three selectors are f11 and f12. These are regular-
15
Table 2: Averages and standard deviations of estimated bandwidths
Density d n LCV RLCV LSCVMean S.D. Mean S.D. Mean S.D.
f11 1 50 0.490 0.123 0.467 0.120 0.526 0.148f12 1 50 0.444 0.115 0.447 0.124 0.541 0.156f13 1 50 0.571 0.137 0.446 0.120 0.451 0.137f14 1 50 0.578 0.142 0.402 0.114 0.364 0.133f15 1 50 0.344 0.130 0.146 0.064 0.141 0.072f16 1 50 0.365 0.143 0.189 0.072 0.194 0.089f21 2 100 0.525 0.100 0.494 0.105 0.491 0.147f22 2 100 0.600 0.117 0.476 0.105 0.419 0.130f23 2 100 0.626 0.118 0.474 0.105 0.403 0.128f24 2 100 0.551 0.118 0.421 0.099 0.379 0.118f31 3 100 0.595 0.098 0.575 0.101 0.545 0.157f32 3 100 0.659 0.115 0.549 0.111 0.469 0.147f33 3 100 0.687 0.109 0.550 0.106 0.441 0.132f34 3 100 0.593 0.114 0.483 0.104 0.423 0.133
or thin-tailed densities under which LCV and RLCV outperform LCSV.
We next examine the performance of density estimation. Figure 3 reports the mean
and error plots of univariate density estimations. Goodness-of-fit measures in terms of the
Root Mean Square Error (RMSE) and Mean Absolute Error (MAE) are reported in black
solid lines and red dash lines respectively. We make the following observations. (i) KDE
with a Gaussian kernel is known to be optimal for Gaussian densities. As expected, LCV
outperforms LSCV under Gaussian density f11 while RLCV is only slightly worse than LCV.
A similar pattern is observed under the thin-tailed density f12. (ii) Density f13 is symmetric
with kurtosis 9. Under this moderate kurtosis, LCV still outperforms LSCV while RLCV in
turn beats LCV. (iii) Density f14 is symmetric with a substantial kurtosis 126. Under this
density, LCV is clearly dominated by the other two, manifesting the detrimental effect of
tail-heaviness. At the same time, RLCV fares better than LSCV. (iv) Under densities f15
and f16, which are skewed and kurtotic, LCV again suffers from tail-heaviness while RLCV
performs the best.
Figure 4 reports bivariate estimation results. Under density f21, which is bivariate Gaus-
16
● ●
●
0.01
0.02
0.03
0.04
0.05
0.06
0.07
LCV RLCV LSCV
● ●
●
●● ●
0.01
0.02
0.03
0.04
LCV RLCV LSCV
●● ●
●
●●
0.02
0.03
0.04
0.05
0.06
0.07
LCV RLCV LSCV
●
●●
●
● ●
0.05
0.10
0.15
0.20
LCV RLCV LSCV
●
● ●
●
● ●
0.05
0.10
0.15
0.20
LCV RLCV LSCV
●
● ●
●
●●
0.04
0.06
0.08
0.10
0.12
LCV RLCV LSCV
●
●●
Figure 3: Mean and error plots of univariate estimation (Solid: RMSE; Dash: MAE)
17
sian with correlation 0.5, LCV and RLCV perform similarly and are better than LSCV. The
next three densities are characterized by heavy-tailed margins and/or heavy-tailed copula.
Similarly to the univariate case, LCV suffers from tail-heaviness of the underlying density
and is dominated by LSCV and RLCV. Between RLCV and LSCV, the former has lower
average MISE and MAE, and noticeably smaller variability. Figure 5 reports the estimation
results for trivariate densities. The overall pattern remains the same.
In summary, RLCV — unlike LCV — is seen to be resistant against tail-heaviness;
while for thin- or regular-tailed distributions, RLCV performs similarly to LCV, indicating
little efficiency loss. Furthermore in all experiments, RLCV is comparable to and often
outperforms LSCV, especially for multivariate densities. Given that in practice the tail-
heaviness of the underlying density is often unknown, RLCV provides a robust alternative
to LCV and a viable competitor to LSCV.
6 Empirical examples
We illustrate the proposed bandwidth selector with two empirical examples. Our first dataset
is the British income data studied by Wand et al. (1991). The data contain approximately
7,000 incomes, normalized by the sample mean, of British citizens in the year 1975. The
skewness and kurtosis of the data are 1.84 and 13.69 respectively. The estimated bandwidths
are 0.177(LCV), 0.054 (RLCV) and 0.049 (LSCV). Their corresponding densities are plotted
in the left panel of Figure 6. Apparently in the presence of the substantial right tail of this
income distribution, LCV selects a relatively large bandwidth and underestimates a sharp
peak of the underlying density. On the other hand, RLCV and LSCV produce very similar
estimates. We caution that the main purpose of this exercise is to demonstrate that the tail
problem of LCV remains even under such a large sample; Wand et al. (1991) showed that
a transformation-based kernel estimator produces a pleasantly smooth density that aptly
captures the two modes of the density.
18
●●
●
0.01
00.
015
0.02
00.
025
0.03
00.
035
0.04
00.
045
LCV RLCV LSCV
●●
●
●
●
●
0.01
00.
015
0.02
00.
025
0.03
00.
035
LCV RLCV LSCV
●
●
●
●
●●
0.01
00.
015
0.02
00.
025
0.03
00.
035
0.04
0
LCV RLCV LSCV
●
●
●
●
●
●
0.01
0.02
0.03
0.04
LCV RLCV LSCV
●
●
●
Figure 4: Mean and error plots of bivariate estimation (Solid: RMSE; Dash: MAE)
19
● ●
●
0.00
50.
010
0.01
50.
020
0.02
5
LCV RLCV LSCV
● ●
●
●
●●
0.00
50.
010
0.01
50.
020
LCV RLCV LSCV
●
●
●
●
●
●
0.00
50.
010
0.01
50.
020
LCV RLCV LSCV
●
●
●
●
●
●
0.01
00.
015
0.02
00.
025
0.03
0
LCV RLCV LSCV
●
●
●
Figure 5: Mean and error plots of trivariate estimation (Solid: RMSE; Dash: MAE)
0 1 2 3 4
0.0
0.2
0.4
0.6
0.8
1.0
income
Den
sity
PM2.5
Den
sity
0 100 200 300 400 500 600
0.00
00.
002
0.00
40.
006
0.00
80.
010
0.01
2
Figure 6: Left: estimated income densities; Right: estimated PM2.5 densities (blue: LCV;black: RLCV; red: LSCV)
20
LSCV estimate
PM2.5, Time=t−1
PM
2.5,
tim
e=t
1 1
1
1 1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1 1
1
1
1
1
1
1
1
1
1
1 1
1
1
1
1
1
1
1
2
2
2
2
3
0 100 200 300 400
010
020
030
040
0LCV estimate
PM2.5, Time=t−1
PM
2.5,
tim
e=t
0.02
0.04
0.04
0.06
0.08
0.1
0.12
0.14 0.16
0.18 0.2 0.22
0.24 0.26 0.28
0.3 0.32
0.34
0.36
0.38
0 100 200 300 400
010
020
030
040
0
RLCV estimate
PM2.5, Time=t−1
PM
2.5,
tim
e=t
0.1 0.1
0.1
0.2 0.3
0.4
0.5 0.6
0.7
0.8
0.9
1
0 100 200 300 400
010
020
030
040
0
Figure 7: Contours of estimated joint densities of PM2.5 from two consecutive days
We next examine air pollution data of Beijing. Accompanying China’s rapid industri-
alization, there have been widespread environmental degradations, especially air pollution.
An important indicator of air quality is the level of fine particular matter (PM2.5), a main
constituent of air pollution. In this study we use PM2.5 data released by the U.S. Embassy
in Beijing. For in-depth analyses of this data, see a recent study by Liang et al. (2016) and
references therein. Since severe and sometimes prolonged smog in Bejing mainly occurs in
winter because of meteorological conditions and winter heating effects, we focus our analysis
on the winter. To avoid hourly fluctuation in PM2.5 monitoring, we use daily median level
of PM2.5. In particular, our data consist of 92 observations of daily median PM2.5 level for
October, November and December of the year 2015. The skewness and kurtosis of the data
are 2.14 and 9.29 respectively.
A PM2.5 level higher than 100 µm is customarily used as an indicator of hazardous air
pollution. Histogram of the data are reported in the right panel of Figure 6. It is seen that
the majority of the observations are below 100 µm; at the same time, there is a significant
second mode around 250 µm, signifying severe air pollution. In the same plot, we report
the estimated PM2.5 densities. The results suggest that LCV oversmoothes the data and
underestimates the major mode, while LSCV undersmoothes the data and produces a rather
rough density. RLCV appears to produce a reasonably smooth density that adequately
21
reflects the overall distribution of the data.
Prolonged severe near earth air pollution constitutes a major trigger of respiratory dis-
eases. In order to explore the dynamics of pollutant accumulation, we proceed to estimate
the joint density of PM2.5 of two consecutive days. The left panel of Figure 7 reports the
contours of estimated density produced by LSCV, with daily PM2.5 level at time t − 1 and
time t in the horizontal and vertical axes respectively. LSCV obviously undersmoothes the
data and the resulting contour plot closely resembles the scatterplot of the data. The middle
panel of Figure 7 plots the contours of the LCV estimate. The density appears to be rather
flat, with the height of the estimated mode at 0.40. The RLCV estimate is reported in the
right panel. A distinct mode, with an estimated height of 1.13, is produced. Some salient
features of the data are readily discernible from the estimated joint density with RLCV
bandwidth: (i) The significant mode at a low pollution level indicates that a low pollution
day is likely to be followed by another one. This merely reflects the fact that majority of
the observations are below 100. (ii) The ‘ridge’ above the 45 degree line for pollution level
higher than 100 suggests that severe smog tends to result from pollutant accumulation over
consecutive days. (iii) The two minor modes at the level around 300 for time t−1 indicates a
bi-model conditional distribution at this high pollution level. There is a simple explanation
for this phenomenon. Heavy precipitation and/or strong wind can effectively dissipate PM2.5
in the atmosphere. A high level of PM2.5 is likely to sustain or decline abruptly — depend-
ing on whether significant precipitation and/or wind occurred — giving rise to a bi-model
conditional distribution.
References
Basu, A., Harris, I. R., Hjort, N. L., and Jones, M. C. (1998), “Robust and efficient estimation
by minimising a density power divergence,” Biometrika, 85, 549–559.
Basu, A., Shioya, H., and Park, C. (2011), Statistical Inference: The Minimum Distance
22
Approach, Chapman & Hall/CRC.
Beran, R. (1977), “Minimum Hellinger Distance Estimates for Parametric Models,” The
Annals of Statistics, 5, 445–463.
Bowman, A. (1984), “An alternative method of cross-validation for the smoothing of density
estimates,” Biometrika, 71, 353–360.
Burman, P. (1985), “A data dependent approach to density estimation,” Zeitschrift fur
Wahrscheinlichkeitstheorie und Verwandte Gebiete, 69, 609–628.
Chow, Y., Geman, S., and Wu, L. (1983), “Consistent cross-validated density estimation,”
The Annals of Statistics, 11, 25–38.
Eguchi, S. and Kano, Y. (2001), “Robustifing maximum likelihood estimation by psi-
divergence,” ISM Research Memorandam, 802.
Embrechts, P., Kluppelberg, C., and Mikosch, T. (2012), Modelling Extremal Events for
Insurance and Finance, Springer.
Habbema, J., Hermans, J., and van den Broek, K. (1974), “A stepwise discrimination anal-
ysis program using density estimation,” in Compstat 1974: Proceddings in computational
statistics, ed. Bruckman, G., Physica Verlag, pp. 101–110.
Hall, P. (1982), “Cross validation in density estimation,” Biometrika, 69, 383–390.
— (1983), “Large sample optimality of least squares cross-validation in density estimation,”
The Annals of Statistics, 11, 1156–1174.
— (1987), “On Kullback-Leibler loss and density estimation,” The Annals of Statistics,
1491–1519.
Huber, P. (1964), “Robust Estimation of a Location Parameter,” The Annals of Mathemat-
ical Statistics, 35, 73–101.
23
Jones, M., Marron, J., and Sheather, S. (1996), “A Brief Survey of Bandwidth Selection for
Density Estimation,” Journal of the American Statistical Association, 91, 401–407.
Liang, X., Li, S., Zhang, S., Huang, H., and Chen, S. (2016), “PM2.5 Data Reliability,
Consistency and Air Quality Assessment in Five Chinese Cities,” Journal of Geophysical
Research Atmosphere, 121, 10220–10236.
Loader, C. (1999), “Bandwidth Selection: Classical or Plug-in?” The Annals of Statistics,
27, 415–438.
Marron, J. S. (1985), “An asymptotically efficient solution to the bandwidth problem of
kernel density estimation,” The Annals of Statistics, 1011–1023.
— (1987), “A comparison of cross-validation techniques in density estimation,” The Annals
of Statistics, 152–162.
Park, B. and Marron, J. (1990), “Comparison of Data-driven Bandwidth Selectors,” Journal
of the American Statistical Association, 85, 66–72.
Rudemo, M. (1982), “Empirical choice of histogram and kernel density estimators,” Scandi-
navian Journal of Statistics, 9, 65–78.
Sain, S., Baggerly, K., and Scott, D. (1994), “Cross-Validation of Multivariate Densities,”
Journal of the American Statistical Association, 89, 807–817.
Schuster, E. and Gregory, G. (1981), “On the nonconsistency of maximum likelihood non-
parametric density estimators,” in Computer Science and Statistics: Proceedings of the
13th Symposium on the Interface, ed. Eddy, W., Springer, pp. 295–298.
Scott, D. (1992), Multivariate Density Estimation: Theory, Practice, and Visualization,
Wiley.
24
Sheather, S. and Jones, M. (1991), “A Reliable Data-Based Bandwidth Selection Method for
Kernel Density Estimation,” Journal of Royal Statistical Society, Series B, 53, 683–690.
Stone, C. (1984), “An Asymptotically optimal window selection rule for kernel density esti-
mates,” The Annals of Statistics, 121, 1285–1297.
Wand, M. and Jones, M. (1995), Kernel Smoothing, Chapman and Hall.
Wand, M., Marron, J., and Ruppert, D. (1991), “Transformation in Density Estimation,”
Journal of the American Statistical Association, 86, 343–353.
Appendix
The following assumptions are needed to established the asymptotic optimality of RLCV.
There exist constants c1, c2 and δ such that:
(A1) The cardinality #(Hn) ≤ nc1 and c−11 nδ ≤ hs ≤ c1n
1−δ, s = 1, . . . , d.
(A2) For h ∈ Hn,
supx|∫Kh(x,y)f(y)dy − f(x)| ≤ c1n
−δ,
limn→∞
suph|∫
var[Kh(x,Xi)]w?(x)dx
c2h1 · · ·hd− 1| = 0.
(A3) (i) supx |Kh(x,x)| ≤ c1h1 · · ·hd; (ii) for h ∈ Hn and f(x) ≥ a, supi,h,x |fi(x;h) −
f(x)| → 0 a.s., where fi(x;h) = 1/(n− 1)∑
j 6=iKh(Xj − x).
These assumptions are a subset of those used in Marron (1987) to establish the asymptotic
optimality of a general family of delta sequence estimators. Assumption A1 is commonly used
in kernel density estimation; for instance, it is also used in Hall (1987) in his investigation of
LCV. Assumption A2 is a high level assumption; it is satisfied, e.g., when the kernel function
25
K and the density f are Holder continuous. Assumption A3 is needed for the “LCV part”
of the RLCV objective function. Note that condition A3(ii) is not required when f(x) ≤ a.
One key difference from the requirements in Marron (1987) is that we do not impose the
condition that f is bounded above zero.
Proof of Theorem. Let u(x) = w(x)f(x). Marron (1987) established the asymptotic op-
timality of LSCV by setting w(x) = 1 or u(x) = f(x)−1, and that for LCV by setting
w(x) = f(x)−1 on a set where f is bounded above zero. The asymptotic optimality of
RLCV can be established in a similar manner. In particular, define
u?(x) = I(f(x) ≥ a) +f(x)
aI(f(x) < a)
such that u?(x) = w?(x)f(x) for all x ∈ X . Replacing the weight function w used in
Theorems 1 and 2 of Marron (1987) with the hybrid weight function w? yields the desired
result.
Proof of Proposition. For simplicity, consider first X a d-dimensional independent standard
Gaussian random vector. Let z = ‖x‖2, where ‖ · ‖ is the Euclidean norm. The density of
X is given by
f(x) = (2π)−d/2 exp(−z/2).
It follows that
lnf = −d2
ln(2π)− z/2,
where z is a χ2d random variable.
Given an I.I.D. sample of standard multivariate Gaussian random vectors {Xi}ni=1 with
Xi = (Xi,1, . . . , Xi,d). Define the extreme observation as
X(n) = {Xi : ‖Xi‖2 ≥ ‖Xj‖2, 1 ≤ i, j ≤ n}.
26
The log-density of X(n) is given by
−d2
ln(2π)− Z(n)/2,
where Z(n) = ‖X(n)‖2 is the sample maximum of n I.I.D. χ2d random variables.
Next for a random variable Z ∼ χ2d and a constant c > 0, cZ ∼ Γ(k = d/2, θ = 2c),
where k is shape parameter and θ is scale parameter of the Gamma distribution. Setting
c = 1/2, we have z/2 ∼ Γ(k = d/2, θ = 1). Using Table 3.4.4. (page 156) of Embrechts et al.
(2012), we obtain
(Z(n)/2− δn)d→ Λ,
where P (Λ < x) = exp(− exp(−x)) and δn = lnn + (d/2 − 1)lnlnn − lnΓ(d/2). Thus
f(X(n)) ≈ φ(δn). It follows readily that for the general case X ∼ N (µ,Σ) with extreme
observation Mn given by (16), we have
f(Mn) = φ(Mn;µ,Σ) ≈ exp(−δn/2)
(2π)d/2|Σ|1/2= |Σ|−1/2(2π)−d/2Γ(d/2)(lnn)1−d/2n−1.
27