Bayesian Adaptive Bandwidth Kernel Density Estimation of...

27
Bayesian Adaptive Bandwidth Kernel Density Estimation of Irregular Multivariate Distributions Shuowen Hu, D. S. Poskitt, Xibin Zhang * Department of Econometrics and Business Statistics, Monash University, Australia Abstract In this paper, we propose a new methodology for multivariate kernel density estimation in which data are categorized into low- and high-density regions as an underlying mechanism for assigning adaptive bandwidths. We derive the posterior density of the bandwidth parameters via the Kullback-Leibler divergence criterion and use a Markov chain Monte Carlo (MCMC) sampling algorithm to estimate the adaptive bandwidths. The resulting estimator is referred to as the tail-adaptive density estimator. Monte Carlo simulation results show that the tail-adaptive density estimator outperforms the global- bandwidth density estimators implemented using different global bandwidth selection rules. The inferential potential of the tail-adaptive density estimator is demonstrated by employing the estimator to estimate the bivariate density of daily index returns observed from the USA and Australian stock markets. Keywords: marginal likelihood, Markov chain Monte Carlo, S&P500 index, value-at-risk * Corresponding author. Email: [email protected]. Address: 900 Dandenong Road, Caulfield East, Victoria 3145, Australia. Telephone: +61 3 99032130. Fax: +61 3 99032007. URL: http://users.monash.edu.au/~xzhang/ Preprint submitted to Computational Statistics and Data Analysis July 25, 2011

Transcript of Bayesian Adaptive Bandwidth Kernel Density Estimation of...

Page 1: Bayesian Adaptive Bandwidth Kernel Density Estimation of ...users.monash.edu.au/~xzhang/CSDA.2011.pdf · Kulasekera and Padgett, 2006; de Lima and Atuncar, 2010). In particular, Zhang

Bayesian Adaptive Bandwidth Kernel Density

Estimation of Irregular Multivariate Distributions

Shuowen Hu, D. S. Poskitt, Xibin Zhang∗

Department of Econometrics and Business Statistics, Monash University, Australia

Abstract

In this paper, we propose a new methodology for multivariate kernel density

estimation in which data are categorized into low- and high-density regions

as an underlying mechanism for assigning adaptive bandwidths. We derive

the posterior density of the bandwidth parameters via the Kullback-Leibler

divergence criterion and use a Markov chain Monte Carlo (MCMC) sampling

algorithm to estimate the adaptive bandwidths. The resulting estimator is

referred to as the tail-adaptive density estimator. Monte Carlo simulation

results show that the tail-adaptive density estimator outperforms the global-

bandwidth density estimators implemented using different global bandwidth

selection rules. The inferential potential of the tail-adaptive density estimator

is demonstrated by employing the estimator to estimate the bivariate density

of daily index returns observed from the USA and Australian stock markets.

Keywords: marginal likelihood, Markov chain Monte Carlo, S&P500 index,

value-at-risk

∗Corresponding author. Email: [email protected]. Address: 900 DandenongRoad, Caulfield East, Victoria 3145, Australia. Telephone: +61 3 99032130.Fax: +61 3 99032007. URL: http://users.monash.edu.au/~xzhang/

Preprint submitted to Computational Statistics and Data Analysis July 25, 2011

Page 2: Bayesian Adaptive Bandwidth Kernel Density Estimation of ...users.monash.edu.au/~xzhang/CSDA.2011.pdf · Kulasekera and Padgett, 2006; de Lima and Atuncar, 2010). In particular, Zhang

1. Introduction

Kernel density estimation is one of the most important techniques for

understanding the distributional properties of data. It is understood that the

effectiveness of such approach depends on the choice of a kernel function and

the choice of a smoothing parameter or bandwidth (see for example, Izenman,

1991, for a discussion). Although the two issues cannot be treated separately,

it is widely accepted that the performance of a kernel density estimator is

mainly determined by the bandwidth (see for example, Scott, 1992; Wand

and Jones, 1995), while the impact of kernel choices on the performance of the

resulting density estimator was examined by Marron and Nolan (1988),Vieu

(1999) and Horova et al. (2002). Most investigations have aimed at choosing

a fixed or global bandwidth for a full sample of data (see Jones et al., 1996,

for a survey). Terrell and Scott (1992) and Sain and Scott (1996) proposed

the idea of data-driven adaptive bandwidth density estimation, which allows

the bandwidth to vary at different data points. In this situation, bandwidth

selection remains as an important issue and has been extensively investigated

for univariate data. However, less attention has been paid to investigations

of data-driven methods for estimating adaptive bandwidths in multivariate

data. This motivates the investigation of this paper.

Our investigation is also empirically motivated. Most financial analysts

believe that during the course of the global financial crisis caused by the fall-

out of the USA sub-prime mortgage crisis, the USA stock market has had a

leading effect on most other stock markets world wide. Using a kernel den-

sity estimator of bivariate stock-index returns we can derive the conditional

distribution of the stock-index return in one market for a given value of the

2

Page 3: Bayesian Adaptive Bandwidth Kernel Density Estimation of ...users.monash.edu.au/~xzhang/CSDA.2011.pdf · Kulasekera and Padgett, 2006; de Lima and Atuncar, 2010). In particular, Zhang

stock-index return in the USA market, and therefore we can better under-

stand how the former market was associated with the USA market. However,

the marginal density of stock-index returns often exhibits leptokurtosis. Con-

sequently, the kernel estimation of the bivariate density of stock-index returns

may require different bandwidths to different groups of observed returns.

1.1. Some Background

Let X = (X1, X2, · · · , Xd)′ denote a d-dimensional random vector with

its density function f(x) defined on Rd (see Jacome et al., 2008, for an

example of censored data). Let {x1,x2, · · · ,xn} be a random sample drawn

from f(x). The kernel density estimator of f(x) is (Wand and Jones, 1995)

fH(x) =1

n

n∑i=1

KH(x− xi) =1

n|H|1/2n∑i=1

K(H−1/2(x− xi)), (1)

where K(·) is a multivariate kernel, and H is a symmetric and positive defi-

nite d× d matrix known as the bandwidth matrix. To choose an optimal H,

several methods have been discussed in literature (see for example, Devroye

and Gyorfi, 1985; Marron, 1987). Marron (1992) proposed a bootstrapping

approach to the approximation of the mean integrated squared error (MISE),

and the normal reference rule (NRR) or equivalently the rule-of-thumb that

minimizes the asymptotic MISE was discussed by Scott (1992). The least

squares cross-validation was discussed by Sain et al. (1994) and Duong and

Hazelton (2005), and a plug-in method was suggested by Wand and Jones

(1994) and improved by Duong and Hazelton (2003). Moreover, Zhang et al.

(2006) proposed a Bayesian approach to the estimation of the bandwidth

matrix based on Kullback-Leibler information criterion.

3

Page 4: Bayesian Adaptive Bandwidth Kernel Density Estimation of ...users.monash.edu.au/~xzhang/CSDA.2011.pdf · Kulasekera and Padgett, 2006; de Lima and Atuncar, 2010). In particular, Zhang

However, a problem in using a global bandwidth is that the kernel meth-

ods often produce unsatisfactory results for complex or irregular densities.

Sain and Scott (1996) presented a classic example, in which a bimodal mix-

ture with two modes of equal height but different levels of variation, where

an optimal global bandwidth tends to under-smooth the mode with a large

variation and over-smooth the mode with a small variation. Therefore, it is

necessary to let the bandwidth be adaptive to local observations; a relatively

small bandwidth is needed where observations are densely distributed, and a

large bandwidth is required where observations are sparsely distributed (see

Jones, 1990; Wand and Jones, 1995; Sain, 2002, for similar arguments).

Several versions of the adaptive bandwidth kernel density estimator have

been studied in literature. Mielniczuk et al. (1989) proposed to use a weight-

ing function on data point x in a global bandwidth estimator. A popular

method is to make the bandwidth as a function of data points, and Nolan and

Marron (1989) discussed this issue from a general Delta-sequence estimator

perspective. Loftsgaarden and Quesenberry (1965) suggested to replace H in

(1) is by H(x), which is the bandwidth at data point x. This estimator was

studied by Cao (2001) and Sain and Scott (2002) and is also called the bal-

loon estimator. However, the balloon estimator does not integrate to one and

therefore is not a good choice for density estimation (Terrell and Scott, 1992;

Izenman, 1991). The other estimator proposed by Breiman et al. (1977) is

called the sample-point estimator, which employs H(xi) as the bandwidth

associated with sample point xi. Abramson (1982a,b) suggested to choose

bandwidth as the inverse square root of f(xi). As the sample-point estimator

always integrate to one, we consider this estimator throughout this paper.

4

Page 5: Bayesian Adaptive Bandwidth Kernel Density Estimation of ...users.monash.edu.au/~xzhang/CSDA.2011.pdf · Kulasekera and Padgett, 2006; de Lima and Atuncar, 2010). In particular, Zhang

The difficulty of using sample-point estimator is that the number of band-

widths exceeds the sample size in multivariate data, and this makes diffi-

culties in estimating bandwidths. Sain and Scott (1996) and Sain (2002)

suggested grouping data into different bins and using a constant bandwidth

for each bin. Although this method reduces the number of bandwidths, the

number of bandwidths still grows exponentially with the dimension.

1.2. The Tail-Adaptive Density Estimator

A simple method to control the number of bandwidths to be estimated

for the sample-point kernel density estimator is to put the data into a small

number of groups. In this paper, we propose dividing the observations into

two regions, namely the low-density region (LDR) and high-density region

(HDR), and assigning two different bandwidth matrices to these two regions.

When the true density is unimodal, the low-density region is the tail area

and should be assigned larger bandwidths than the high-density region. We

call this type of kernel density estimator the tail-adaptive density estimator.

It is not new to group the observations into the low- and high-density

regions. Hartigan (1975, 1987) proposed clustering data into different regions

with different density values, and Hyndman (1996) presented an algorithm for

computing and graphing data in different density regions. A comprehensive

review of applications related to the issue of low-and high-density regions

was given in Mason and Polonik (2009). In terms of bandwidth selection,

Samworth and Wand (2010) considered a bandwidth selection method for

univariate high-density region estimation. To our knowledge, there is yet any

other investigation on bandwidth selection for a general multivariate kernel

density estimation, where two different bandwidth matrices are assigned to

5

Page 6: Bayesian Adaptive Bandwidth Kernel Density Estimation of ...users.monash.edu.au/~xzhang/CSDA.2011.pdf · Kulasekera and Padgett, 2006; de Lima and Atuncar, 2010). In particular, Zhang

observations in the low- and high-density regions.

In this paper, we treat the elements of the two bandwidth matrices

as parameters, whose posterior can be approximately derived through the

Kullback-Leibler information. Therefore, bandwidths can be estimated through

a posterior simulator. During the past decade, there have been several inves-

tigations on Bayesian approaches to bandwidth estimation for kernel density

estimation (see for example, Brewer, 2000; Gangopadhyay and Cheung, 2002;

Kulasekera and Padgett, 2006; de Lima and Atuncar, 2010). In particular,

Zhang et al. (2006) derived the posterior of bandwidths for multivariate ker-

nel density estimation with a global bandwidth matrix. Their Monte Carlo

simulation results reveal the advantage of this Bayesian approach over its

competitors including the plug-in method of Duong and Hazelton (2003)

and the NRR. However, Hall (1987) showed that the use of a global band-

width for a long-tailed distribution can mislead Kullback-Leibler informa-

tion. Therefore, in this paper, we extent the sampling algorithm proposed

by Zhang et al. (2006) by incorporating tail-adaptive bandwidth matrices

into the multivariate kernel density estimation.

The rest of this paper is organized as follows. In Section 2, we derive the

posterior of the elements of the tail-adaptive bandwidth matrices describe an

MCMC sampling algorithm. Sections 3 and 4 present the results of Monte

Carlo simulation studies designed to examine the performance of the tail-

adaptive density estimator. In Section 5, we apply the tail-adaptive density

estimator to the estimation of the bivariate density of two asset returns.

Section 6 concludes the paper.

6

Page 7: Bayesian Adaptive Bandwidth Kernel Density Estimation of ...users.monash.edu.au/~xzhang/CSDA.2011.pdf · Kulasekera and Padgett, 2006; de Lima and Atuncar, 2010). In particular, Zhang

2. Bayesian estimation of bandwidths

2.1. Likelihood cross-validation

The Kullback-Leibler information, which measures the discrepancy be-

tween a density estimator and its true density, is defined as

dKL

(f(x), fH(x)

)=

∫Rd

log{f(x)}f(x)dx−∫Rd

log{fH(x)}f(x)dx. (2)

An optimal bandwidth could be derived by minimizing (2), and this is equiv-

alent to the maximization of∫Rd log{fH(x)}f(x)dx with respect to H. As

this integral is the expectation of log{fH(x)} under f(x), it is approximated

by n−1∑n

i=1 log fH,i(xi), where

fH,i(xi) =1

n− 1

n∑j=1j 6=i

|H|−1/2K(H−1/2(xi − xj)

), (3)

known as the leave-one-out estimator of f(xi) (Hardle, 1991). The likelihood

cross-validation method is to maximize∑n

i=1 log fH,i(xi), which is regarded

as the likelihood of {x1,x2, · · · ,xn} for given H, with respect to H.

The bandwidth matrix can be either a full or diagonal matrix. A full

bandwidth matrix can reveal useful features of the resulting density estima-

tor, while the implementation of some computing algorithms for bandwidth

estimation is often very difficult especially when adaptive bandwidth ma-

trices are used. The numerical result obtained by Sain (2002) showed that

the density estimator with a full bandwidth matrix is not smooth in low-

density regions. In this paper, we use a diagonal bandwidth matrix and let

h = (h1, h2, . . . , hd)′ denote the vector of the square roots of the diagonal

elements of the bandwidth matrix. h is also known as the bandwidth vector.

7

Page 8: Bayesian Adaptive Bandwidth Kernel Density Estimation of ...users.monash.edu.au/~xzhang/CSDA.2011.pdf · Kulasekera and Padgett, 2006; de Lima and Atuncar, 2010). In particular, Zhang

2.2. Tail-adaptive kernel density estimator

The concept of grouping observations into low- and high-density regions

has been discussed in literature (see for example, Hartigan, 1975, p205). We

propose to group some observations into the low-density region, inside which

each observation has a density no more than the density of each observation

outside the region. Actually, our definition of the low-density region is in the

same sense of Hyndman (1996) whose defined the highest density region.

Let α be a threshold value that determines the proportion of the low-

density region relative to the sample space. Let L(fα) denote a subset of the

sample space, so that the (100× α)% low-density region is shown as

L(fα) = {x : f(x) ≤ fα},

where fα is the largest constant such that Pr{x ∈ L(fα)} ≤ α.

Let

Ij =

1 if xj ∈ L(fα)

0 otherwise,

for j = 1, 2, · · · , n. Let h(1) denote the bandwidth vector assigned to the

observations in L(fα) and h(0) the bandwidth vector assigned to those ob-

servations outside L(fα). The kernel density estimator is

fh(1),h(0)(x) =1

n

n∑j=1

{IjK

((x− xj)./h

(1))./h(1)

+ (1− Ij)K((x− xj)./h

(0))./h(0)

}, (4)

where ./ denotes division by elements. The leave-one-out estimator is denoted

as fh(1),h(0),i(xi) for i = 1, 2, · · · , n. As the low-density region becomes the tail

area when the underlying density is unimodal, we call (4) the tail-adaptive

8

Page 9: Bayesian Adaptive Bandwidth Kernel Density Estimation of ...users.monash.edu.au/~xzhang/CSDA.2011.pdf · Kulasekera and Padgett, 2006; de Lima and Atuncar, 2010). In particular, Zhang

estimator for simplicity. The value of α can be chosen as either 5% or 10%.

As f(x) is unknown, the initial value of fα can be approximated through the

kernel density estimator of f(x) using a global bandwidth.

2.3. Posterior of bandwidth parameters

Given h(1) and h(0), the approximate likelihood is

`(x1,x2, . . . ,xn|h(1),h(0)) =n∏i=1

fh(1),h(0),i(xi). (5)

As suggested by Zhang et al. (2006), we assume that the prior of each band-

width to be the truncated Cauchy density

p(h(l)k ) =

2

1 + h(l)k × h

(l)k

, for h(l)k > 0,

for k = 1, 2, · · · , d, and l = 0 and 1. The posterior of h(1) and h(0) for given

{x1,x2, . . . ,xn} is

π(h(1),h(0)|x1,x2, · · · ,xn) ∝

{n∏i=1

fh(1),h(0),i(xi)

{d∏

k=1

p(h(1)k )× p(h(0)

k )

}.

(6)

The posterior given by (6) is of non-standard form, and we cannot derive

an analytical expression as the estimate of {h(1),h(0)}. However, we can use

the random-walk Metropolis algorithm to sample {h(1),h(0)} from (6). The

sampling procedure is as follows.

1) Obtain initial low-density regions for a given α based on kernel density

estimator with a global bandwidth vector chosen via NRR.

2) Assign initial values to h(1) and h(0), which are respectively, the band-

width vectors given to observations within the low- and high-density

regions specified in Step 1).

9

Page 10: Bayesian Adaptive Bandwidth Kernel Density Estimation of ...users.monash.edu.au/~xzhang/CSDA.2011.pdf · Kulasekera and Padgett, 2006; de Lima and Atuncar, 2010). In particular, Zhang

3) Let h denote the vector of all elements of h(1) and h(0). Use the random-

walk Metropolis algorithm to update h with the acceptance probability

computed through the posterior given by (6).

4) Derive the low-density region according the density estimator with the

bandwidth vectors updated in Step 3).

5) Repeat Steps 3) and 4) until the simulated chain of h achieves reason-

able mixing performance.

During the above iteration process, we discard the draws during the burn-

in period, and record the draws of h thereafter. Let {h(1), h(2), · · · , h(M)}

denote the recorded draws. The posterior mean (or ergodic average) denoted

as∑M

i=1 h(i)/M , is an estimate of h.

3. A Monte Carlo simulation study

To investigate the performance of the proposed tail-adaptive kernel den-

sity estimator, we approximate Kullback-Leibler information via Monte Carlo

simulation. A large number of random vectors are drawn from f(x) so as to

approximate (2) numerically. A bandwidth estimation method is better than

its competitor if the former produces a smaller Kullback-Leibler information

than the former.

3.1. True densities

We simulate samples from six target bivariate densities labeled A, B, C,

D, E and F1. These densities are of irregular shapes. Density A is a mixture of

1Refer to Hu et al. (2010) for univariate target densities, their contour plots and the

simulation results.

10

Page 11: Bayesian Adaptive Bandwidth Kernel Density Estimation of ...users.monash.edu.au/~xzhang/CSDA.2011.pdf · Kulasekera and Padgett, 2006; de Lima and Atuncar, 2010). In particular, Zhang

two equally weighted normal densities with bimodality: fA (x|µ1,Σ1, µ2,Σ2) =

12φ (x|µ1,Σ1) + 1

2φ (x|µ2,Σ2) , where φ(x|µ,Σ) is a multivariate normal den-

sity with mean µ and variance-covariance matrix

Σ =

1 ρ

ρ 1

. (7)

We used µ1 = (−1.5,−1.5)′, µ2 = (2, 2)′, ρ = 0.3 for Σ1 and ρ = −0.9 for Σ2.

Density B is a mixture of two normal densities with different weights

but an equal height at the modes: fB (x|µ1,Σ1, µ2,Σ2) = 34φ (x|µ1,Σ1) +

14φ (x|µ2,Σ2) , where µ1 = (−1.5,−1.5)′ and µ2 = (1.5, 1.5)′. Σ is defined in

(7) with ρ = 0.5 in Σ1, and Σ2 = 13Σ1.

Density C is a mixture of two skew-normal densities proposed by Azzalini

and Valle (1996): fC (x|µ1, γ1, µ2, γ2,Σ) = 12× 2φ(x|µ1,Σ)Φ (γ′1(x− µ1)) +

12× 2φ(x|µ2,Σ)Φ (γ′2(x− µ2)), where Φ(·) is the cumulative density function

(CDF) of the standard normal distribution. We chose µ1 = (−0.5,−0.5)′,

γ1 = (−9,−9)′, µ2 = (0, 0)′, γ2 = (9, 9) and Σ is defined in (7) with ρ = 0.3.

Note that γ1, γ2 ∈ R2 are the shape parameters determining the skewness.

Density D is a Student t density denoted as td (x|µ,Σ, ν),

Density E is a mixture of two Student t densities with ν = 5:

fE (x|µ1, µ2,Σ, ν) = 0.5 td (x|µ1,Σ1, ν) + 0.5 td (x|µ2,Σ2, ν) ,

where µ1 = (−2, 0)′, µ2 = (2, 0)′, and Σ1 and Σ2 are defined in (7) with

ρ = −0.5 and 0.5, respectively.

Density F is a skew-t density proposed by Azzalini and Capitanio (2003):

fF (x|µ,Σ, α, ν) = 2 td(x|µ,Σ, ν)Td(x|ν + d),

11

Page 12: Bayesian Adaptive Bandwidth Kernel Density Estimation of ...users.monash.edu.au/~xzhang/CSDA.2011.pdf · Kulasekera and Padgett, 2006; de Lima and Atuncar, 2010). In particular, Zhang

where x = γ′ω−1(x − µ)(

ν+d(x−µ)′Σ−1(x−µ)+ν

)1/2

, µ = (0, 0)′, γ = (−2, 0)′, Σ

is an identity matrix. ω is a diagonal matrix with diagonal elements the

same as those of Σ, and Td(·|ν + d) is the CDF of the Student t distribution

with ν + d degrees of freedom. The contour plot of each of the six bivariate

densities can be found in Figure 2 of Hu et al. (2010).

3.2. Accuracy of our Bayesian bandwidth estimation

We generated samples of n = 500, 1000, 2000 from each of the six bivariate

densities. The kernel function is the product of univariate Gaussian kernels.

We estimated the bandwidth vectors for the proposed tail-adaptive kernel

density estimator with α = 0.05 and 0.1. A global bandwidth vector is also

estimated via the Bayesian method of Zhang et al. (2006) and NRR.

We used the random-walk Metropolis algorithm to sample all bandwidths

from their posterior. The burn-in period contains 3,000 iterations, and the

following 10,000 iterations were recorded. We computed the batch-mean

standard deviation discussed by Roberts (1996) and the simulation ineffi-

cient factor (SIF) discussed by Kim et al. (1998) to monitor the mixing per-

formance. Both indicators show that the sampler has achieved a reasonable

mixing performance. Refer to Hu et al. (2010) for graphs on visual inspection

of the mixing performance.

Consider a sample generated from fF (x) with α = 0.05 and sample size

1000. Table 1 presents the MCMC results, where the SIFs are very small and

indicate a reasonable mixing performance of the proposed sampler. The tail-

adaptive density estimator clearly captures the fat-tailed feature of fF . For

example, the estimates of both components of h(1) for observations inside the

12

Page 13: Bayesian Adaptive Bandwidth Kernel Density Estimation of ...users.monash.edu.au/~xzhang/CSDA.2011.pdf · Kulasekera and Padgett, 2006; de Lima and Atuncar, 2010). In particular, Zhang

low-density region are respectively, much larger than the estimates of both

components of h(0) for observations outside this region.

We generatedN=100,000 random vectors from the true density and calcu-

lated the estimated Kullback-Leibler information defined by (2). As shown in

Table 2, among all six densities considered, the tail-adaptive density estima-

tor obviously performs better than the global-bandwidth density estimators

with the bandwidth vector estimated through Bayesian sampling and NRR.

The results also indicate that the performance of the tail-adaptive density

estimator is not very sensitive to different values of α.

The mean integrated squared error (MISE) was also used to examine the

performance of tail-adaptive density estimator. We numerically approximate

the MISE through 200 data sets for each the bivariate densities with sample

size 500, 1000 and 2000. Table 3 shows that the lower-adaptive estimator

always outperforms the global-bandwidth estimator.

4. Tail-adaptive density estimation for high dimensions

Our proposed Bayesian sampling algorithm for estimating bandwidths in

tail-adaptive kernel density estimation is applicable to data of any dimen-

sion. In this section, we aim to examine the performance of the tail-adaptive

estimator in comparison with the global-bandwidth estimator.

4.1. True densities

We consider four target densities labeled G, H, I and J. Density G is a

mixture of two multivariate normal densities:

fG (x|µ1, µ2,Σ1,Σ2) =1

2φ (x|µ1,Σ1) +

1

2φ (x|µ2,Σ2) ,

13

Page 14: Bayesian Adaptive Bandwidth Kernel Density Estimation of ...users.monash.edu.au/~xzhang/CSDA.2011.pdf · Kulasekera and Padgett, 2006; de Lima and Atuncar, 2010). In particular, Zhang

where µ1 = (−1.5,−1.5,−1.5,−1.5,−1.5)′, µ2 = (2, 2, 2, 2, 2)′, and both

variance-covariance matrices have the form of

Σ =1

1− ρ2

1 ρ ρ2 ρ3 ρ4

ρ 1 ρ ρ2 ρ3

. . .

ρ4 ρ3 ρ2 ρ 1

, (8)

with ρ = 0.3 for Σ1 and ρ = −0.9 for Σ2.

Density H is a multivariate skew-normal densities:

fH (x|µ,Σ, α) = 2φ (x|µ,Σ) Φ (γ′(x− µ)) ,

where Σ is given by (8) with ρ = 0.9, µ = (−0.5,−0.5,−0.5,−0.5,−0.5)′,

and γ = (−9,−9,−9,−9,−9)′.

Density I is a mixture of two multivariate Student t densities:

fI (x|µ1, µ2,Σ1,Σ2, ν) = 0.5 td (x|µ1,Σ1, ν) + 0.5 td (x|µ2,Σ2, ν) ,

where µ1 = (−2, 0,−2, 0,−2)′, µ2 = (2, 0, 2, 0, 2)′, ν = 5, and both Σ1 and

Σ1 are defined by (8) with ρ = −0.5 and ρ = 0.5, respectively.

Density J is a multivariate skew-t densities:

fJ (x|µ,Σ, α, ν) = 2td (x|µ,Σ, ν)Td (x|ν + d) ,

where µ = 0, ν = 5, Σ is an identity matrix, and x is defined in a similar

way as that in the bivariate fF (x) with γ = (2, 0, 2, 0, 2)′.

4.2. Accuracy of our Bayesian bandwidth estimation

We generated samples of sizes n = 500, 1000, 2000 from each of the five-

dimensional densities. Table 4 presents the estimated Kullback-Leibler in-

formation between the true density and its estimator resulted from each of

14

Page 15: Bayesian Adaptive Bandwidth Kernel Density Estimation of ...users.monash.edu.au/~xzhang/CSDA.2011.pdf · Kulasekera and Padgett, 2006; de Lima and Atuncar, 2010). In particular, Zhang

the three bandwidth estimation methods. The kernel density estimator with

tail-adaptive bandwidth vectors obviously outperforms its counterpart with a

global bandwidth vector estimated via either the NRR or Bayesian sampling.

This finding is consistent with what we found in the bivariate situation. For

all sample sizes of each density, we found that the tail-adaptive kernel den-

sity estimator with α = 0.1 slightly outperforms the same estimator with

α = 0.05. However, we would be reluctant to make a decision as to whether

the former performs better than the latter because such a is marginal. Due

to the heavy burden required by the numerical computation of MISE for

five-dimensional densities, we have not computed MISE in this section.

5. An application of the tail-adaptive density estimator

In this section, we apply the proposed tail-adaptive kernel density es-

timator to the estimation of bivariate density of stock-index returns. We

downloaded the daily closing index values of the S&P 500 index in the USA

stock market and the All Ordinaries (AORD) in the Australian stock mar-

ket, from the 2nd January 2006 to the 16th September 2010 excluding non-

trading days, with sample size 1155. The AORD return was matched to the

overnight S&P 500 return. Let Pt denote the closing index at date t. The

daily continuously compounded returns in percentage form was computed as

(lnPt− lnPt−1)×100. The sample period covers the period of current global

financial crisis, and there were some extreme observations. The Pearson

correlation coefficient between the two return series is 0.6171.

We used the random-walk Metropolis algorithm to estimate bandwidths

for the tail-adaptive kernel density estimator of the bivariate returns, with

15

Page 16: Bayesian Adaptive Bandwidth Kernel Density Estimation of ...users.monash.edu.au/~xzhang/CSDA.2011.pdf · Kulasekera and Padgett, 2006; de Lima and Atuncar, 2010). In particular, Zhang

α = 0.05. We also estimated the global bandwidth via the NRR and Bayesian

sampling. The results are given in Table 5, where the SIFs indicate reasonable

mixing performance achieved by both sampling algorithms. Moreover, the

use of tail-adaptive bandwidths leads to a much larger marginal likelihood

than the use of a global bandwidth in this bivariate kernel density estimator,

where the marginal likelihoods were calculated through Newton and Raftery

(1994). Thus, the use of tail-adaptive bandwidths is strongly favored against

the use of a global bandwidth. The density surface and contour plot for each

of the three density estimators are presented in Hu et al. (2010), where the

tail-adaptive estimator is shown to be able to capture richer dynamics than

the other tow.

Let xt and yt denote the S&P 500 and AORD daily returns, respectively.

We used the estimated tail-adaptive bandwidths to compute the conditional

density of yt at a given value of xt through f(y|xt = x) = f(y, x)/fx(x),

where f(y, x) is the joint density of (yt, xt), and fx(x) is the marginal density

of xt. Bandwidths estimated for a joint density can be used for the purpose

to compute conditional density (Holmes et al., 2010; Polak et al., 2010). We

computed the conditional density of xt given that yt is at the 10th, 5th and

1st percentiles, which are corresponding to percentage return values of -0.75,

-1.13, and -2.24, respectively.

We are also able to compute the conditional probability Pr{yt ≤ y|xt ≤

x} = Pr{yt ≤ y, xt ≤ x}/Pr{xt ≤ x}. For example, Pr{yt ≤ 0|xt ≤ 0} =

0.67. It means that when the USA stock market finished daily trading with

a negative return, there was a 67% chance that the Australian market would

also drop. Such a percentage suggests that the Australian market followed

16

Page 17: Bayesian Adaptive Bandwidth Kernel Density Estimation of ...users.monash.edu.au/~xzhang/CSDA.2011.pdf · Kulasekera and Padgett, 2006; de Lima and Atuncar, 2010). In particular, Zhang

the USA market during this period. However, this result should not be

interpreted as blaming the problems in Australian stock market to the USA

stock market, while this is simply an evidence of market dependence.

In addition, we can estimate the conditional CDF of yt for given xt = x:

F (y|xt = x) = Pr{yt ≤ y|xt = x} =

∫ y

−∞

f(z, x)

fx(x)dz. (9)

Given that the USA market finished daily trading with the S&P 500 return

being at x%, the probability that the Australian market dropped beyond

the same level is indicated by letting y = x in (9). At the above-mentioned

percentiles of the S&P 500 return, the probability that the AORD return

were below the same levels are 0.27, 0.24 and 0.12, respectively. It indicates

that the probability that the Australian market had a larger drop than the

USA market was no more than 27%. See Hu et al. (2010) for the graphs of

the conditional density and its CDF.

Moreover, we evaluated the performance of our VaR in comparison to the

VaR obtained through IGARCH model (or the Riskmetrics). We followed

the steps described in Bao et al. (2006) and calculated the check function of

Koenker and Bassett Jr (1978). The existing sample was used as the learning

set, and we forecasted daily 5% VaR of AORD from 17th September 2010 to

5th May 2011. The check function computed by IGARCH and our proposed

method are 0.0356 and 0.0344, respectively. This finding indicates that our

approach to VaR estimation is more effective than the Riskmetrics.

6. Conclusion

This paper proposes a kernel density estimator with tail-adaptive band-

widths, which are assigned to the low- and high-density regions, respectively.

17

Page 18: Bayesian Adaptive Bandwidth Kernel Density Estimation of ...users.monash.edu.au/~xzhang/CSDA.2011.pdf · Kulasekera and Padgett, 2006; de Lima and Atuncar, 2010). In particular, Zhang

We have derived the posterior of bandwidth parameters based on Kullback-

Leibler information and presented an MCMC sampling algorithm to estimate

bandwidths. The Monte Carlo simulation study shows that the tail-adaptive

kernel density estimator outperforms its competitor, the global-bandwidth

kernel density estimator. The simulation result also shows that the improve-

ment made by the tail-adaptive kernel density estimator is especially obvious

when the underlying density is fat-tailed. Although the probability of the

low-density region has to be chosen before sampling is carried out, we have

found that performance the tail-adaptive kernel estimator is not sensitive to

different values of this probability. Future search could include this proba-

bility as a parameter to be estimated via the same sampling procedure.

We applied the tail-adaptive kernel density estimator to the estimation

of bivariate density of the paired daily returns of the Australian Ordinary

and S&P 500 indices during a period that covers the global financial crisis.

The tail-adaptive density estimator is strongly favored against the global-

bandwidth density estimator. We derived the estimated conditional density

and distribution of the Australian index return given that the USA market

finished daily trading with different return values. Although the Australian

stock market followed the USA market during the crisis, there was no more

than 27% chance that the former market had a larger drop than the latter.

Our approach can be viewed as mode estimation in clustering analysis.

Even though our algorithm can be implemented to data of any dimension, the

curse of dimensionality is a major concern for kernel density estimation. Fer-

raty and Vieu (2006) suggested attacking the dimensionality problem from

a functional setting even for data of infinite dimensions. It might be possi-

18

Page 19: Bayesian Adaptive Bandwidth Kernel Density Estimation of ...users.monash.edu.au/~xzhang/CSDA.2011.pdf · Kulasekera and Padgett, 2006; de Lima and Atuncar, 2010). In particular, Zhang

ble to explore whether the proposed Bayesian approach is applicable in this

situation. We leave it for future research.

Acknowledgements

We extend our sincere thanks to the Co-Editor Stanley Azen, an associate

editor and two reviewers for their very insightful comments that have led to

a substantially improved paper. Thanks also go to the Victorian Partnership

for Advanced Computing (VPAC) for its quality computing facility.

References

Abramson, I., 1982a. Arbitrariness of the pilot estimator in adaptive kernel

methods. Journal of Multivariate Analysis 12, 562–567.

Abramson, I., 1982b. On bandwidth variation in kernel estimates — a square

root law. The Annals of Statistics 10, 1217–1223.

Azzalini, A., Capitanio, A., 2003. Distributions generated by perturbation

of symmetry with emphasis on a multivariate skew t-distribution. Journal

of the Royal Statistical Society, Series B 65, 367–389.

Azzalini, A., Valle, A., 1996. The multivariate skew-normal distribution.

Biometrika 83, 715–726.

Bao, Y., Lee, T., Saltoglu, B., 2006. Evaluating predictive performance

of value-at-risk models in emerging markets: a reality check. Journal of

Forecasting 25, 101–128.

19

Page 20: Bayesian Adaptive Bandwidth Kernel Density Estimation of ...users.monash.edu.au/~xzhang/CSDA.2011.pdf · Kulasekera and Padgett, 2006; de Lima and Atuncar, 2010). In particular, Zhang

Breiman, L., Meisel, W., Purcell, E., 1977. Variable kernel estimates of

multivariate densities. Technometrics 19, 135–144.

Brewer, M., 2000. A Bayesian model for local smoothing in kernel density

estimation. Statistics and Computing 10, 299–309.

Cao, R., 2001. Relative efficiency of local bandwidths in kernel density esti-

mation. Statistics 35, 113–137.

Devroye, L., Gyorfi, L., 1985. Nonparametric Density Estimation: The L1

View. John Wiley & Sons, New York.

Duong, T., Hazelton, M., 2003. Plug-in bandwidth matrices for bivariate

kernel density estimation. Journal of Nonparametric Statistics 15, 17–30.

Duong, T., Hazelton, M., 2005. Cross-validation bandwidth matrices for

multivariate kernel density estimation. Scandinavian Journal of Statistics

32, 485–506.

Ferraty, F., Vieu, P., 2006. Nonparametric Functional Data Analysis: Theory

and Practice. Springer, New York.

Gangopadhyay, A., Cheung, K., 2002. Bayesian approach to the choice of

smoothing parameter in kernel density estimation. Journal of Nonpara-

metric Statistics 14, 655–664.

Hall, P., 1987. On Kullback-Leibler loss and density estimation. The Annals

of Statistics 15, 1491–1519.

Hardle, W., 1991. Smoothing Techniques: with Implementation in S.

Springer, New York.

20

Page 21: Bayesian Adaptive Bandwidth Kernel Density Estimation of ...users.monash.edu.au/~xzhang/CSDA.2011.pdf · Kulasekera and Padgett, 2006; de Lima and Atuncar, 2010). In particular, Zhang

Hartigan, J.A., 1975. Clustering Algorithms. John Wiley & Sons, New York.

Hartigan, J.A., 1987. Estimation of a convex density contour in two dimen-

sions. Journal of the American Statistical Association 82, 267–270.

Holmes, M.P., Gray, A.G., Isbell Jr, C.L., 2010. Fast kernel conditional

density estimation: A dual-tree Monte Carlo approach. Computational

Statistics & Data Analysis 54, 1707–1718.

Horova, I., Vieu, P., Zelinka, H., 2002. Optimal choice of nonparametric

estimaties of a density and of its derivatives. Statistics & Decisions 20,

355–378.

Hu, S., Poskitt, D., Zhang, X., 2010. Bayesian Adaptive Bandwidth Ker-

nel Density Estimation of Irregular Multivariate Distributions. Monash

Econometrics and Business Statistics Working Papers.

Hyndman, R.J., 1996. Computing and graphing highest density regions. The

American Statistician 50, 120–126.

Izenman, A.J., 1991. Recent developments in nonparametric density estima-

tion. Journal of the American Statistical Association 86, 205–224.

Jacome, M., Gijbels, I., Cao, R., 2008. Comparison of presmoothing methods

in kernel density estimation under censoring. Computational Statistics 23,

381–406.

Jones, M.C., 1990. Variable kernel density estimates and variable kernel

density estimates. Australian & New Zealand Journal of Statistics 32,

361–371.

21

Page 22: Bayesian Adaptive Bandwidth Kernel Density Estimation of ...users.monash.edu.au/~xzhang/CSDA.2011.pdf · Kulasekera and Padgett, 2006; de Lima and Atuncar, 2010). In particular, Zhang

Jones, M.C., Marron, J.S., Sheather, S.J., 1996. A brief survey of band-

width selection for density estimation. Journal of the American Statistical

Association 91, 401–407.

Kim, S., Shepherd, N., Chib, S., 1998. Stochastic volatility: Likelihood

inference and comparison with ARCH models. Review of Economic Studies

65, 361–393.

Koenker, R., Bassett Jr, G., 1978. Regression quantiles. Econometrica 46,

33–50.

Kulasekera, K.B., Padgett, W.J., 2006. Bayes bandwidth selection in kernel

density estimation with censored data. Journal of Nonparametric Statistics

18, 129–143.

de Lima, M., Atuncar, G., 2010. A bayesian method to estimate the optimal

bandwidth for multivariate kernel estimator. Journal of Nonparametric

Statistics 23, 137–148.

Loftsgaarden, D.O., Quesenberry, C.P., 1965. A nonparametric estimate of a

multivariate density function. The Annals of Mathematical Statistics 36,

1049–1051.

Marron, J., 1992. Bootstrap bandwidth selection, in: LePage, R., Billard, L.

(Eds.), Exploring the Limits of Boostrap. John Wiley & Sons, New York,

pp. 249–262.

Marron, J.S., 1987. A comparison of cross-validation techniques in density

estimation. The Annals of Statistics 15, pp. 152–162.

22

Page 23: Bayesian Adaptive Bandwidth Kernel Density Estimation of ...users.monash.edu.au/~xzhang/CSDA.2011.pdf · Kulasekera and Padgett, 2006; de Lima and Atuncar, 2010). In particular, Zhang

Marron, J.S., Nolan, D., 1988. Canonical kernels for density estimation.

Statistics & Probability Letters 7, 195 – 199.

Mason, D.M., Polonik, W., 2009. Asymptotic normality of plug-in level set

estimates. The Annals of Applied Probability 19, 1108–1142.

Mielniczuk, J., Sarda, P., Vieu, P., 1989. Local data-driven bandwidth choice

for density estimation. Journal of Statistical Planning and Inference 23,

53–69.

Newton, M.A., Raftery, A.E., 1994. Approximate Bayesian inference with

the weighted likelihood bootstrap. Journal of the Royal Statistical Society,

Series B 56, 3–48.

Nolan, D., Marron, J., 1989. Uniform consistency of automatic and location-

adaptive delta-sequence estimators. Probability Theory and Related Fields

80, 619–632.

Polak, J., Zhang, X., King, M.L., 2010. Bandwidth selection for kernel condi-

tional density estimation using the MCMC method. Manuscript presented

at Australian Statistical Conference, 6-10 December, Western Australia.

Roberts, G.O., 1996. Markov chain concepts related to sampling algorithms,

in: Gilks, W.R., Richardson, S., Spiegelhalter, D.J. (Eds.), Markov Chain

Monte Carlo in Practice. Chapman & Hall, London, pp. 45–57.

Sain, S.R., 2002. Multivariate locally adaptive density estimation. Compu-

tational Statistics & Data Analysis 39, 165–186.

23

Page 24: Bayesian Adaptive Bandwidth Kernel Density Estimation of ...users.monash.edu.au/~xzhang/CSDA.2011.pdf · Kulasekera and Padgett, 2006; de Lima and Atuncar, 2010). In particular, Zhang

Sain, S.R., Baggerly, K.A., Scott, D.W., 1994. Cross-validation of multivari-

ate densities. Journal of the American Statistical Association 89, 807–817.

Sain, S.R., Scott, D.W., 1996. On locally adaptive density estimation. Jour-

nal of the American Statistical Association 91, 1525–1534.

Sain, S.R., Scott, D.W., 2002. Zero-bias locally adaptive density estimators.

Scandinavian Journal of Statistics 29, 441–460.

Samworth, R.J., Wand, M.P., 2010. Asymptotics and optimal bandwidth

selection for highest density region estimation. The Annals of Statistics

38, 1767–1792.

Scott, D.W., 1992. Multivariate Density Estimation: Theory, Practice, and

Visualization. John Wiley & Sons, New York.

Terrell, G.R., Scott, D.W., 1992. Variable kernel density estimation. The

Annals of Statistics 20, 1236–1265.

Vieu, P., 1999. Multiple kernel procedure: An asymptotic support. Scandi-

navian Journal of Statistics 26, 61–72.

Wand, M.P., Jones, M.C., 1994. Multivariate plug-in bandwidth selection.

Computational Statistics 9, 97–116.

Wand, M.P., Jones, M.C., 1995. Kernel Smoothing. Chapman & Hall, New

York.

Zhang, X., King, M.L., Hyndman, R.J., 2006. A Bayesian approach to band-

width selection for multivariate kernel density estimation. Computational

Statistics & Data Analysis 50, 3009–3031.

24

Page 25: Bayesian Adaptive Bandwidth Kernel Density Estimation of ...users.monash.edu.au/~xzhang/CSDA.2011.pdf · Kulasekera and Padgett, 2006; de Lima and Atuncar, 2010). In particular, Zhang

Table 1: MCMC results obtained based on a sample generated from density F

Bandwidths Mean Standard Batch-mean SIF Acceptance

deviation standard deviation rate

h(1)1 1.1121 0.3184 0.0157 24.32 0.28

h(1)2 1.6432 0.3816 0.0164 18.57

h(0)1 0.2505 0.0469 0.0019 17.13

h(0)2 0.4196 0.0675 0.0018 7.35

Table 2: Estimated Kullback-Leibler information for bivariate densities

Global-bandwidth Tail-adaptive bandwidth

Density n NRR Bayesian α = 0.05 α = 0.10

fA 500 0.2878 0.0858 0.0772 0.0748

1000 0.2382 0.0617 0.0498 0.0467

2000 0.1981 0.0402 0.0339 0.0338

fB 500 0.1201 0.0499 0.0444 0.0442

1000 0.0826 0.0349 0.0332 0.0337

2000 0.0653 0.0256 0.0219 0.0217

fC 500 0.1126 0.0930 0.0783 0.0768

1000 0.0924 0.0689 0.0559 0.0558

2000 0.0900 0.0648 0.0497 0.0498

fD 500 0.1171 0.0946 0.0464 0.0449

1000 0.0809 0.0769 0.0286 0.0312

2000 0.0590 0.0565 0.0242 0.0270

fE 500 0.1436 0.1072 0.0623 0.0530

1000 0.1038 0.1088 0.0328 0.0397

2000 0.0782 0.0666 0.0262 0.0282

fF 500 0.1169 0.1641 0.0520 0.0545

1000 0.0781 0.0657 0.0261 0.0306

2000 0.0708 0.0637 0.0237 0.0242

25

Page 26: Bayesian Adaptive Bandwidth Kernel Density Estimation of ...users.monash.edu.au/~xzhang/CSDA.2011.pdf · Kulasekera and Padgett, 2006; de Lima and Atuncar, 2010). In particular, Zhang

Table 3: Estimated MISE (×100) for bivariate densities

Global-bandwidth Tail-adaptive bandwidth

Density n NRR Bayesian α = 0.05 α = 0.10

fA 500 1.8482 0.4945 0.4415 0.4401

1000 1.5546 0.3354 0.2920 0.2901

2000 1.2875 0.2230 0.1875 0.1854

fB 500 0.6526 0.2718 0.2612 0.2606

1000 0.4927 0.1828 0.1738 0.1727

2000 0.3595 0.1164 0.1101 0.1089

fC 500 0.8812 0.6022 0.5422 0.5422

1000 0.7059 0.4208 0.3694 0.3804

2000 0.5589 0.2868 0.2600 0.2692

fD 500 0.3250 0.6573 0.2782 0.2638

1000 0.2152 0.5121 0.1796 0.1737

2000 0.1489 0.3915 0.1219 0.1245

fE 500 0.4022 0.3722 0.1919 0.1822

1000 0.3039 0.2980 0.1279 0.1242

2000 0.2207 0.2283 0.0840 0.0844

fF 500 0.3331 0.7560 0.3092 0.2966

1000 0.2229 0.6150 0.2054 0.2022

2000 0.1496 0.4856 0.1386 0.1437

26

Page 27: Bayesian Adaptive Bandwidth Kernel Density Estimation of ...users.monash.edu.au/~xzhang/CSDA.2011.pdf · Kulasekera and Padgett, 2006; de Lima and Atuncar, 2010). In particular, Zhang

Table 4: Estimated Kullback-Leibler information for 5-dimensional densities

Global-bandwidth Tail-adaptive bandwidth

Density n NRR Bayesian α = 0.05 α = 0.10

fG 500 0.8923 0.4280 0.4026 0.4004

1000 0.7705 0.3093 0.2848 0.2825

2000 0.6933 0.2489 0.2343 0.2300

fH 500 0.4559 0.3438 0.3212 0.3179

1000 0.4041 0.2892 0.2613 0.2582

2000 0.3355 0.2226 0.2033 0.1987

fI 500 0.5943 0.5674 0.3446 0.3187

1000 0.4994 0.4814 0.2891 0.2666

2000 0.4395 0.4255 0.2274 0.2072

fJ 500 0.6107 0.5755 0.3226 0.3033

1000 0.5969 0.4415 0.2538 0.2284

2000 0.5050 0.3937 0.1971 0.1773

Table 5: A summary of MCMC results obtained through our proposed Bayesian sampling

algorithm to the tail-adaptive kernel density estimator of the S&P500 and AORD returns

Bandwidths Mean Standard SIF Acceptance log marginal

deviation rate likelihood

NRR h1 0.2171

h2 0.1783

Bayesian global h1 0.1795 0.0113 5.63 0.21 -1719.64

bandwidth h2 0.2485 0.0121 5.89

Tail-adaptive h(1)1 0.5533 0.2217 39.30 0.27 -1657.14

bandwidth h(1)2 0.1221 0.0161 15.39

with α = 0.05 h(0)1 0.5552 0.1140 19.97

h(0)2 0.1547 0.0174 13.55

27