Kernel Density Estimation -...

Kernel Density Estimation Theory, Aspects of Dimension and Application in Discriminant

Analysis

eingereicht von:

Thomas Ledl

DIPLOMARBEIT zur Erlangung des akademischen Grades

Magister rerum socialium oeconomicarumque (Mag. rer. soc. oec)

Magister der Sozial- und Wirtschaftswissenschaften

Fakultät für Wirtschaftswissenschaften und Informatik,

Universität Wien

Studienrichtung: Statistik

Begutachter:

Univ.-Prof. Dr. Wilfried Grossmann Wien, im März 2002

Ich versichere:

daß ich die Diplomarbeit selbständig verfasst, andere als die angegebenen Quellen

und Hilfsmittel nicht benutzt und mich auch sonst keiner unerlaubten Hilfe

bedient habe.

daß ich dieses Diplomarbeitsthema bisher weder im In- noch im Ausland (einer

Beurteilerin/ einem Beurteiler zur Begutachtung) in irgendeiner Form als

Prüfungsarbeit vorgelegt habe.

daß diese Arbeit mit der vom Begutachter beurteilten Arbeit übereinstimmt.

Thomas Ledl

Preface

The following diploma thesis is thought to be a diploma thesis in applied

statistics. I declare this in the first paragraph of my work, because you can treat

this subject either from a theoretic or an applied view, although the borders

between these two areas of statistics cannot be drawn exactly.

The reason why I got the idea to treat this subject, is that on the one hand density

estimation of a random variable is an elementary and important task in statistics,

which is treated already in the first weeks of a statistic study (actually in the basic

statistic lectures from all related studies as well) in its most basic form, the

histogram estimate. Using both density estimation on descriptive means (detecting

symmetry properties, number and location of modes, skewness, etc.) and applying

those estimates for inductive statistics (indirect in almost any context possible)

makes a good estimate a powerful mean. Choosing a non-parametric approach

was, because on the one hand nowadays we have a necessary computer power to

calculate e.g. kernel density estimates even in datasets with a great number of

observations and on the other hand a flexible setting, as this approach provides a

probably better fit to the underlying structure in the case of non-normal

distributed variables.

The second topic involved using non-paramteric density estimation for bayes rule

discriminant analysis, was motivated in one of my last lectures about discriminant

analysis where it was only mentioned, that for this case a non-parametric estimate

can be used as an alternative.

My interest of how good this can be performed for certain datasets is exactly the

main goal of this thesis. The estimation of the densities, the kernel discrimination

and maybe necessary reductions of dimensions should be seen in a context and

not separated. For application of the methods, with both self-constructed data and

a real-life dataset should be dealt with. In short words, the final result ought to be

3

PREFACE 4

a kind of “check list“ on how to perform best a certain discrimination task by

using bayes rule kernel density estimation.

I want to thank all people who were directly or indirectly involved, that I was able

to write this thesis as it looks like now. First of all I want to mention some people

of the Department of Statistics and Decision Support Systems at the University of

Vienna. I am especially grateful to Prof. Grossmann my mentor, who took care

that the work leads in the right direction and who provided literature and software

for me. Also I want to thank Prof. Pflug for providing further sources of literature

and Prof. Neuwirth who forced my interest in this topic in the first weeks of my

study through his excellent interactive presentation of the kernel density estimator

concept.

Finally it is a great pleasure for me to thank all members of my family, who gave

me financial and emotional support. Without their support this study would not

have been possible in that way, as well as my study colleagues who provided the

right social background during my study and to whom I had a really good

relationship.

Contents

Table of Diagrams ________________________________________________ 7

Table of Tables ___________________________________________________ 9

Chapter 1: Introduction __________________________________________ 10

Chapter 2: Kernel density estimation during the last 25 years ____________ 13

2.1 Introduction____________________________________________________ 13

2.2 The univariate case ______________________________________________ 14

2.2.1 From the histogram to the kernel density estimator ________________________14

2.2.2 The model of the kernel density estimator _______________________________16

2.2.3 Optimization criteria________________________________________________17

2.2.4 Calculations of the error criteria_______________________________________21

2.2.5 Optimality properties of the kernel ____________________________________26

2.2.6 Further developements of the model to improve the estimation ______________28

2.2.7 Bandwidth selection ________________________________________________33

2.3 The multivariate case ____________________________________________ 43

2.3.1 The model________________________________________________________44

2.3.2 Parametrizations ___________________________________________________45

2.3.3 Parameter selection_________________________________________________46

2.3.4 The curse of dimensionality __________________________________________48

2.4 The context to kernel discriminant analysis__________________________ 50

Chapter 3: Dimension Reduction and Marginal Transformations ________ 57

3.1 Introduction____________________________________________________ 57

3.2 Marginal transformations ________________________________________ 57

3.3 Dimension reduction_____________________________________________ 61

Chapter 4: Simulation Study and Real-life Application _________________ 65

4.1 Introduction____________________________________________________ 65

4.2 Preliminaries ___________________________________________________ 66

4.2.1 The data _________________________________________________________66

4.2.2 The construction of the estimators and the estimation procedure _____________72

5

CONTENTS 6

4.2.3 The performance measure ___________________________________________74

4.2.4 The software______________________________________________________76

4.3 Results ________________________________________________________ 76

4.3.1 LDA versus QDA__________________________________________________77

4.3.2 LDA and QDA versus the normalized datasets ___________________________79

4.3.3 The multivariate kernel density estimators – differences concerning dimensions _83

4.3.4 LDA and QDA versus the multivariate kernel density estimators _____________86

4.3.5 Results concerning the insurance data __________________________________87

4.4 Computational considerations _____________________________________ 89

4.5 Discussion _____________________________________________________ 90

Chapter 5: Summary and outlook __________________________________ 92

References _____________________________________________________ 95

Appendix_______________________________________________________ 98

About the used literature _____________________________________________ 98

Books __________________________________________________________________98

Papers __________________________________________________________________99

Notation and common abbreviations___________________________________ 101

Tables ____________________________________________________________ 102

Table of Diagrams

Figure 1: Non-parametric and parametric density estimates. _________________________13

Figure 2: The construction of the kernel density estimator. __________________________17

Figure 3: The original density and two expectations of kernel esimates for different values of

h.______________________________________________________________________22

Figure 4: Plots of MISE, AMISE (bowl-shaped curves) and their additive components. ___24

Figure 5: MISE, IV, ISB, AMISE, AIV and AISB for the underlying density. ___________25

Figure 6: Kernel estimate using different bandwidths. The underlying density (solid) and the

concerning kernel density estimate (dotted). __________________________________29

Figure 7: The corresponding density types for Table 1.______________________________32

Figure 8: Ratio of the AMISE-optimal bandwidth to window widths chosen by different

reference distributions. ___________________________________________________36

Figure 9: Contour-plot of the maximum posterior probabilities at each point (LDA)._____51

Figure 10: Contour-plot of the maximum posterior probabilities at each point (kernel

estimate). _______________________________________________________________52

Figure 11: Difference between the logarithm of a standard normal density and the logarithm

of two kernel estimates (Sheather-Jones plug-in and the Rule of Thumb). _________54

Figure 12: Normalization of two bimodal class densities. Density 1 (solid), density 2 (dashed)

and their respective normalizations._________________________________________58

Figure 13: Univariate normalizations. ____________________________________________59

Figure 14: Non-parametric normalization of one variable of the insurance data._________61

Figure 15: The corresponding scree-plot. _________________________________________63

Figure 16: Prototype distributions of the synthetic datasets.__________________________68

Figure 17: Univariate histograms for the insurance-dataset. _________________________71

Figure 18: Brier-scores for the NN-distributions. Comparison between LDA and QDA. __77

Figure 19: Brier-score for the SkN-distributions. Comparison between LDA and QDA. __77

Figure 20: Brier-score for the Bi-distributions. Comparison between LDA and QDA.____78

Figure 21: Error-rate for the NN-distributions. Comparison within the LDA-method by

using normalizations. _____________________________________________________79

Figure 22: Error-rate for the SkN-distributions. Comparison within the LDA-method by


Figure 23: Error-rate for the Bi-distributions. Comparison within the LDA-method by using

normalizations. __________________________________________________________80

Figure 24: Error-rate for the NN-distributions. Comparison within the QDA-method by


7

TABLE OF DIAGRAMS 8

Figure 25: Error-rate for the SkN-distributions. Comparison within the QDA-method by


Figure 26: Error-rate for the Bi-distributions. Comparison within the QDA-method by using

normalizations. __________________________________________________________82

Figure 27: Dependency of the performance of the multivariate kernel density estimator for

two datasets. ____________________________________________________________85

Figure 28: Brier-score of the datasets having equal correlation matrices. Comparison

between the LDA and the bayes-rule-kernel-methods constructed by the LSCV-

selector. ________________________________________________________________86

Figure 29: Brier-score of the datasets having unequal correlation matrices. Comparison

between the QDA and the bayes-rule-kernel-methods constructed by the LSCV-

selector. ________________________________________________________________87

Figure 30: The Error-rates for the insurance data. _________________________________88

Figure 31: The problem of classification concerning non-equal class observation numbers.

_______________________________________________________________________88

Table of Tables

Table 1: Efficiency on how to estimate different densities. ___________________________31

Table 2: Efficiencies of product kernels relative to radially symmetric kernels when using

the multivariate generalization of a beta-kernel._______________________________44

Table 3: Probability of data in regions which have density values higher than one hundredth

the value at the mode of a multivariate normal distribution._____________________49

Table 4: Principal component analysis with one of my synthetic datasets. ______________63

Table 5: Smallest sample size for each dimension, which satisfies (3.2)._________________64

Table 6: Prototype distributions of the synthetic datasets. ___________________________67

Table 7: Description of the used datasets. _________________________________________69

Table 8: Average rank of the kernel estimators in dependency on the dimension of the

subspace concerning the Error-rates. "1" is the best.___________________________83

Table 9: Average place of the kernel estimators in dependency on the dimension of the

subspace concerning the Brier-score. "1" is the best. ___________________________84

Table 10: Principal component analysis for "SkN11"._______________________________84

Table 11: Principal component analysis for "SkN32"._______________________________84

Table 12: Results of the principal component analysis. The percentage of the explained

variance is shown for all datasets.__________________________________________102

Table 13: Classification results for the LDA. _____________________________________103

Table 14: Classification results for the QDA. _____________________________________103

Table 15: Classification results for the estimator "Normal rule" in two and three

dimensions. ____________________________________________________________104

Table 16: Classification results for the estimator "Normal rule" in four and five dimensions.

______________________________________________________________________105

Table 17: Classification results for the estimator "LSCV" in two and three dimensions. _106

Table 18: Classification results for the estimator "LSCV" in four and five dimensions. __107

Table 19: Estimator "Normal rule - normalized". The normalization is done by the non-

parametric kernel estimate. _______________________________________________108

Table 20: Estimator "Sheather-Jones - normalized". The normalization is done by the non-

parametric kernel estimate. _______________________________________________109

9

Chapter 1: Introduction

What is actually the most important thing you learn in a statistic study? During

your study you start with exploratory methods, getting some knowledge about

your dataset by producing basic frequency plots or calculating some useful

coefficients (descriptive statistics). You learn about several distributions of

random elements, about the fact that your dataset has only a finite number of

observations (the things you watch are only realisations of the underlying origin)

and you always try to fit proper models to your data. Finally you end up with

methods for more variables and theoretical considerations about the goodness of

certain statistical tests or estimators. The common element of all these contents is

the uncertainty of your observations, which is modeled by the most elementary

object in statistics, a random variable. So the most interesting question is actually

if one has a set of realisations x xn1 ,... of a random variable X , how to discover

the structure of X appropriately? If you know this structure, you know with

which probability a certain event occurs, and thus you should also be able to give

judgement about the assessment of a certain observation within a class of distinct

populations, the classification task. These two tasks are fundamental in statistics

and therefore they will have my undivided attention during this thesis.

In the classical statistic theory you learn that the normal distribution is a very

commonly used model for such a set of realisations. On the one hand it is based

on the central limit theorem, which proves that the sum of identically independent

variables converges to a normal distribution and many things in our life could be

treated like that approximately. On the other hand it has a compact formula with

additional nice mathematical properties which makes it useful as a model for

unimodal symmetric distributions. Last but not least, multivariate generalizations

can be done straightforward.

However, the greatest drawback of this setting is the small flexibility according to

the number of parameters. Only having a parameter for the location and one for

the scale makes it a poor approximation for data, far away from this normality

assumption. Regarding the fact that all events which occur in real life are

10

INTRODUCTION 11

dependent of each other (although maybe only in a complex way), a model of

identically independent variables will actually never be “the right one“ and is

therefore only approximately valid. In addition, there are of course many cases

where even an approximation of the idealized view (sum of identically distibuted

independent observations) is not justified. That is the case in life-duration data,

income data (both in general highly skewed) or multimodal distributions.

Nevertheless, the model of the normal distribution and its consequences are used

until today, for example in standard regression theory and ANOVA, as well as in

discriminant analysis (linear and quadratic) as first choice in many software

packages.

Since the computer power has increased considerably in the last two decades,

there was enough power to fit more flexible models, and researches gained

another view for this problem by using non-parametric density estimates. Besides,

we have today huge data sources available to figure out significant deviations of

the normality assumption even when they are small.

As a result of the new opportunity for density estimation, there was an enormeous

increase of literature which has been written about this topic. Both discrete and

continious random variables were considered. Also, the multivariate

generalizations for random vectors (either only continious ones or only discrete

ones or mixtures of both types) were taken into account. Theoretical properties of

the estimators have been derived, as well as simulation studies of several

estimators have been produced.

To restrict the various number of methods for non-parametric density estimation, I

only concentrated in this work on the kernel density estimator, which is probably

the most discussed concept in the existing literature because of its simplicity and

clearness. Also, the multivariate generalization can be done straightforward.

Another restriction is that I only treat continious random variables and pure

continious random vectors, because otherwise the estimated function is not

smooth and you cannot speak about a density function.

The papers and books written about these topics treat mostly the discussion of

selecting a good smoothing parameter for the estimated function in the univariate

case, which is not a satisfactory answer for the discrimination problem where

INTRODUCTION 12

more than one variable is used in general. There are some statements in certain

books underlining the difficulties that occur by estimating in high dimensions

(> ), also called the curse of dimensionality as well as giving advice how a set of

more variables can be suitably transformed to a lower dimensional space.

However, by reading this you gain no idea how different methods perform at a

certain discrimination task. Actually, there is no connection between the problem

of estimating the density in a proper way, using a multivariate density estimate for

bayes-rule kernel disriminant analysis and reducing high dimensions to make a

reasonable solution just possible. Only one study from the early 1980s (Remme

et. al., 1980) analysed the application of density estimation and watched error

rates in classification compared with the performance of Linear Discriminant

Analysis (LDA) and Quadratic Discriminant Analysis (QDA), that connected at

least two of the subtasks mentioned above. But the smoothing parameter was

selected by a method which is no longer qualified today. Nevertheless, this study

should give the impetus and an idea how to proceed in this work. This leads to the

research assignment of the present work.

5

What should be comprised in my thesis:

1. A discussion of the main ideas in kernel density estimation in

the last years with emphasis on the variety in methods,

parametrisations, optimizing criteria, etc. (chapter 2).

2. A discussion about how the number of variables can be

reduced or the data can be transformed to still get reasonable

results (chapter 3).

3. A simulation study and a real-life-application where the main

results of 1. and 2. should be applied in discriminant analysis

in a proper way to make it maybe dominating over the LDA

and QDA (chapter 4).

Chapter 2: Kernel density estimation during the last 25

years

2.1 Introduction

30 40 50 60 70

0.0

0.01

0.02

0.03

0.04

0.05

a)

x

b)

x

30 40 50 60 70

0.0

0.01

0.02

0.03

0.04

0.05

0.06

Figure 1: Non-parametric and parametric density estimates. A histogram is

shown in a) and in b) a kernel density estimate (dashed) is produced. In both

graphs the normal curve is plotted, too (solid).

The aim of chapter 2 is to give some insight how the kernel density estimator is

performed, and the reason that makes this subject such a huge research area.

Regarding the univariate case treated in section 2.2, Figure 1 should be watched

for a first impression. On the left side is the output you can produce with almost

every statistical program-package which is the well-known histogram, the first

and probably easiest realization of a non-paramteric estimate. In the same graphic

drawn as a line, is its parametric counterpart, the normal-distribution curve, where

the parameters have been estimated by the classical unbiased estimators. Since it

is obvious that neither the histogram as a very crude estimator, nor the normal-

approximation seem to be a proper method in this case, a kernel density estimator

13

KERNEL DENSITY ESTIMATION DURING THE LAST 25 YEARS 14

has been performed on the right side, which allows the estimated function to be

smooth as well as to figure out more detailed structure, the bimodality in this case.

As this concept is more flexible, it leads also to many new questions and

difficulties, which ought to be pointed out in section 2.2. In particular, the

following questions should be discussed.

What is a good choice for the kernel shape (section 2.2.5)?

What is a good choice for the bandwidth (section 2.2.7)?

In which sence can a certain bandwidth-choice be optimal (2.2.3)?

How can the basic concept be improved (section 2.2.6)?

Concerning the multivariate case there are additional difficulties, since there

occurs not only a vector of different bandwidths for every dimension, but also

different orientations of the multivariate kernels, which results in a whole

bandwidth-matrix. Besides, there are different methods to produce kernels in high

dimensions. Questions like that are going to be discussed in section 2.3.

Section 2.4 keeps an eye on the fact that optimality in density estimation is maybe

not the same as optimality in discrimination. Connections between the bayes rule

kernel setting and other common discrimination techniques are going to be

pointed out as well. Additionally, it provides a collection of performance results

of kernel discriminant analysis in the past.

2.2 The univariate case

2.2.1 From the histogram to the kernel density estimator

The most basic way to approach the problem is like it is done in almost every

basic lecture in statistics. The classical formula of the histogram estimate is:

$ ( ) ( ) ( )f xnh

I x B I x Bh i jji

n

= ∈∑∑=

11

j∈

where

[ )B x j h x jhj = + − +0 01( ) , j Z∈ .


This leads to the two parmeters x0 (origin) and h (binwidth), which determine the

shape of the estimate totally. As it is known from basic statistic courses, different

choices of x0 can lead to quite differing impressions of the underlying distribution.

This results either in a different number of estimated modes, or in different

impressions about skewness and kurtosis. Besides, the binwidth parameter h plays

an important role which is almost the same as in the kernel density estimators

setting. It produces either very noisy or smooth estimates, where the structure

becomes not apparent.

A possibility to make at least one of the problems disappear is making the

estimate almost independent from the origin x0 . This can be reached by

performing an average shifted histogram (ASH; Scott, 1992). The idea behind this

is that M − 1 within the interval [ )x x h0 0, + equidistant placed points, are used to

calculate sub-histograms. The final estimate is then given by averaging over all

M sub-histograms, which amounts in the formula

$ ( ) ( ) ( ),, ,f xM nh

I x B I x Bh jji

n

l

M

= ∈∑∑∑==

l i j l∈−1 1

10

1

where B jlM

h jlM

hj l, ( ) , (= − + + )⎡⎣⎢

⎞⎠⎟1 l M∈ −{ , ,..., }0 1 1

refers to the j -th Interval in the original definition of the histogram, which is

“shifted” l times.

In this setting it is clear that the position of xi within the interval is relevant for

the estimate (as long as different xi , which lie in the same origin bin Bj have

different bins B j l, ). While it is evident that this estimate produces a “smoother”

estimate than the histogram, it is nevertheless a step-function which does not give

the impression to be the underlying density, but only a better way to perform the

basic concept. Even though this estimate is quite simple, and one can obtain good

results almost independent from the exact choice of M, it is worthwile to study the

case as M → ∞ . This seemingly more detailed view of the concept was first

considered in the 1950s and is the basic form of the kernel density estimator.


2.2.2 The model of the kernel density estimator

The straightforward enlargement is now to set such a bin, or more generally a

special kernel on every point of the axis, and counting (weighting) the datapoints

in the neighbourhood of this point. The first author who considered this model

was Rosenblatt (1956). The estimate is given by

$ ( ) ,f xnh

Kx x

hhi

i

n

=−⎛

⎝⎜

⎞⎠⎟

=∑1

1

where K denotes the kernel function, and h is the bandwidth parameter which is

the analogue to the binwidth in the histogram, and therefore the same letter is

usually used.

To make sure that this expression is a density function, the kernel function has to

satisfy K u du( ) =∫ 1.

Other useful and common assumptions for the kernel are

K u K u( ) ( )= −

K u( ) has its maximum at u = 0

K u( ) ≥ 0 for all u (which is however not satisfied by higher order kernels)

In many cases a density N ( , )0 2σ is used as kernel function, but also several other

kernels satisfying the assumptions above with bounded support are used (see

section 2.2.5).

The context between the historam and the kernel concept arises from the fact that

the density estimator using a triangular kernel (K x x I x( ) ( ) ( )= − <1 1 ) is the

limit of the average shifted histogram for M → ∞ , where the binwidth h of the

histogram refers to the bandwidth h of the kernel setting.

It should also be emphasized, that all mathematical properties of the kernel like

the differentiabilty and smoothness are inherited by the corresponding kernel

density estimate, because the estimate represents actually, a convolution of the

kernel with the data. The construction of the kernel density estimator using a

normal density as kernel (a so-called gaussian kernel) and different bandwidths, is

shown in Figure 2.


a)

0 2 4 6

0.0

0.05

0.10

0.15

0.20

0.25

xxx x xx

b)

0 2 4 6

0.0

0.1

0.2

0.3

xxx x xx

Figure 2: The construction of the kernel density estimator. Same kernels and

different bandwidths are used in a) and b). The underlying density (dotted),

the estimated density (solid) and its contributions (dashed).

2.2.3 Optimization criteria

As you watch the picture above, maybe the first question you can ask yourself is

whether the right, or the left estimator is more appropriate. This leads to the

question, which kernel and which bandwidth is the best to select, and in particular

the second question leads to a very controversial discussion.

As already mentioned in the introduction to this chapter, there is no general way

to estimate a density “optimal”. Every optimality is always with respect to a

certain optimization criterion. This section should give the definitions of different

optimization criteria, and should also point out their properties.


2.2.3.1 Criteria based on the -Distance L1

A very common idea to measure the difference between two functions f and g is to

consider the Lp -Distance, defined by

( )L f gpp p

= −∫1/

,

where is a parameter to choose. By the means of simplicity one would probably

choose

p

p = 1, the L1-distance. Though the formula looks quite easy for p = 1,

some mathematic problems are included refering to the absolute error function,

which is not that easy to treat.

Nevertheless, this criterion has one great advantage, namely it is invariant to any

bijective transformations of both of the densities. That means, if f is the density

of a random variable X , and g is the density according to Y , then

L f g f g1 = − = −∫∫ * * ,

where f * is the density of T X( ) , g * is the density of T Y( ) and T is a bijective

function (Devroye and Györfi, 1985). The value above is also called integrated

absolute error (IAE).

Another property of this criterion can be associated with discriminant analysis.

Suppose f and g are the densities of two populations, and a new

datapoint is classified in the following way: Assignment to

( $ ( ))g f xh≡

f if x A∈ and to g

otherwise. Then the minimization of the -distance is exactly the same as

maximizing the classical Error-rate in discriminant analysis, that is equal to

maximizing the confusion between

L1

f and g (Scott, 1992).

The book of Devroye and Györfi (1985) gives a detailed discussion about the L -

criterion. Because of inconvinience comparing different density estimates by the

random variable IAE, the expectation of IAE (the expected integrated absolute

error

1

E ( )IAE EIAE MIAE= = ) is regarded. The proportion of these two values

is relatively stable. Roughly speaking, the statetment

IAEEIAE

∈⎡⎣⎢

⎤⎦⎥

14

3,


is valid in any case (see Devroye and Györfi, 1985, for more exact mathematical

expressions).

2.2.3.2 Criteria based on the -Distance L2

The expression for the Lp -distance in the case p = 2 contains a root, and therefore

the squared criterion

{ }ISE( $ ) $ ( ) ( )f f x f xh h= − dx∫ ,

which still keeps the right order for comparing two values of different estimates to

each other, is basically considered. ISE stands for integrated squared error. For

the same reason as above, the expectation

{ }MISE( $ ) $ ( ) ( )f E f x f xh h= − dx∫ 2

is used. MISE stands for mean integrated squared error. A comparison between

the use of ISE or MISE as error criterion gives Jones (1991). The main statement

is, that deriving the optimal bandwidth according to MISE is better than

optimizing ISE. The author says that ”this is based on the argument that

estimating f well from X [the sample, author] alone is often an unrealistic

ambition”, which sounds a little bit confusing for the less experienced reader.

Like the ordinary mean squared error

{ }MSE( $ ( )) $ ( ) ( )f x E f x f xh h= −2,

which is a measure of the error with respect to a certain point x , a decomposition

into a bias- and a variance-term, can also be achieved relatively easy in the MISE-

expression.

Most of the literature in kernel density estimation is written about minimizing the

MISE-criterion and the AMISE-criterion (see section 2.2.7) respectively.

Nevertheless, there are other criteria which lead to denstiy estimates, which seems

more appropriate to a “human feeling” of what the real density looks like. Some

concepts besides the limit of the Lp -distance are briefly discussed in the following

section.


2.2.3.3 and alternative criteria L∞

Of course, any p can be chosen in the Lp -distance, but for p’s other than one or

two there are no longer useful properties, as discussed in the last two sections.

The only interesting case is when p →∞. This is equal to minimizing the

maximal absolute error between f and g , or again its expectation. The reader has

now an idea how the variation of changes the aims of the estimate. With p p = 1,

the goal is to minimize the area between the curves, regardless of the fact that the

same area can be caused in several different ways, whereas with p →∞ an

overall good performance is the only goal, and one does not matter if the

difference between the curve appears only at a certain point or everywhere.

All the measures described so far are strict mathematical criteria, measuring in

different ways the similarity of two functions, but this is maybe not neccessarily

the kind of optimization for pointing out the structure of the underlying density in

a proper way. For example the MISE-criterion, and even more the L -criterion

ignore the fit of the tails of the distribution, which is especially important in high

dimensions (see

∞

Figure 11 and section 2.3.4).

In particular, for an exploratory view, one could be interested in the location and

the number of modes in the distribution. Park and Turlach (1992) calculated the

average distance of absolute deviations from the estimated modes to the true

modes of the distribution, as well as the number of modes in the estimated density

compared with the true one.

The problem of such measures is that you can only perform them in a simulation

study, because of your knowledge about the exact positions and the number of the

modes. Unfortunately, an optimization with respect to that kind of error-criteria

grows more inaccurate as the number of dimensions increases in a multivariate

setting.

It is apparent that everyone can create such measures and as long as it is not

defined when a density is estimated well, several criteria should be considered.

Marron and Tsybakov (1995) wrote an interesting paper, where they underline the

importance of measuring a kind of horizontal distance between two densities as

well. Since the common used criteria take only vertical distances into


consideration, the authors emphasize that “the eye uses both horizontal and

vertical information”. In any case, much research is needed to give better answers

than those.

2.2.4 Calculations of the error criteria

This section treats only the calculation of error criteria based on the L - and the

-distance respectively. There exist almost no parameter-selection-rules based

on criteria different from the

1

L2

Lp -distances, as the reader might have assumed

when reading the last section. Here, other techniques probably have to be used,

and such are either subject to future research or they would be beyond the scope

of this work.

2.2.4.1 MISE- and AMISE- calculations

Since a problem of the MISE expression is that it depends on the bandwidth h in

a complicated way, there is a way of overcoming this problem by using large

sample approximations of the bias and variance terms, which occur in the

decomposition suggested above. Thus, some assumptions have to be made, which

are discussed in Wand and Jones (1995). The most important ones are

limn h→∞ = 0 and limn nh→∞ = ∞ (2.1)

which means, that h approaches zero, but at a rate slower than n . For the bias-

calculation on a certain point

−1

x , a change of variables and a Taylor expansion

leads to the asymptotical unbiasedness (h ) of → 0

E f x f x h K f x o hh( $ ( )) ( ) ( ) ' ' ( ) ( ).− = +12

22

2μ (2.2)

where μ22( ) ( )K z K z= dz∫ denotes a functional (the variance) of the kernel

(Wand and Jones, 1995). This expression makes clear that we can reduce bias by

reducing the bandwidth h, and that for a fixed h at a certain point x the bias is

proportional to the second derivative f x' ' ( ) of the unknown density f , as it is

shown in Figure 3, motivated by Härdle (1991, p. 57).


Since typical values for h estimated by different estimators are lying in the

interval [0.5,1] for n about 100 to 1000, the graph gives insight about how the

bias changes over the range of X .

For calculating the variance of the estimation, the following expression can be

derived.

Var f xnh

R K f x onhh( $ ( )) ( ) ( ) ,= +⎛⎝⎜

⎞⎠⎟

1 1 (2.3)

where R K K x dx( ) ( )= .∫ 2

x

-4 -2 0 2 4

0.05

0.10

0.15

0.20

0.25

Density f(x)Expectation for h=0.5Expectation for h=1

Figure 3: The original density and two expectations of kernel esimates for

different values of h.

A quick look on (2.3), makes the following evident.

Because of the assumptions, the variance tends to zero as n increases.

One can achieve small variances with large values of n (reasonable) and large

values of h .

Summing up (2.3), and the square of (2.2), and integrating leads to


MISE AMISE( $ ) ( $ ) ,f f onh

hh h= + +⎛⎝⎜

⎞⎠⎟

1 4

AMISE( $ ) ( ) ( ) ( ' ' )fnh

R K h K R fh = +1 1

44

22μ (2.4)

and after setting the first derivative in (2.4) equal to zero, the optimal bandwidth

hR K

K R f nAMISE =⎡

⎣⎢⎤

⎦⎥( )

( ) ( ' ' ).

/

μ22

1 5

(2.5)

This formula for the AMISE shows in a very clear way the conflict between

reducing the variance and the bias simultaneously, since the first term represents

the integrated variance and the second one the integrated squared bias. On the one

hand h should be chosen small in order to achieve a small bias. In this case, the

variance is big. The density estimate fluctuates heavily depending on the exact

positions of the datapoints. The arithmetic mean of these estimations however (if

the experiment is reapplied several times), is near f x( ) , but this is uninteresting

in the special case.

On the other hand, a large choice of h is necessary to hold the variance at a lower

level. The resulting estimator is a function, which is close to the kernel itself,

certainly including a huge bias in general. A solution to this problem has to

respect both views of the problem.

The problem of the ability of calculating an AMISE-optimal bandwidth only if the

underlying density is known, is circular and therefore paradoxical. Results to

escape this infinite loop in order to obtain concrete values for h are given in

section 2.2.7.

In some cases it is even possible to give an exact expression of the MISE. That

occurs if the underlying density f is normal, or at least a mixture of several

normal densities and the gaussian kernel is used (Wand and Jones, 1995).

According to the fact that the class of normal mixture distributions allows a big

variety of possible densities, this is a rather useful result. If such a restriction for

the kernel is crucial or not will be discussed in section 2.2.5. Some really

interesting results about the approximation of AMISE to MISE have been studied

by Marron and Wand (1992).


The authors considered in several uni- and multimodal normal-mixture densities

the behaviour of MISE, AMISE and their bias- and variance-components, as well

as the resulting optimal bandwiths h MISE and h AMISE, respectively.

Figure 4 and Figure 5 shall give an impression of what can happen in case of a

MISE-approximation.

log10(h)

-1.2 -1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2

0.0

0.01

0.02

0.03

0.04

0.05

MISE,IV,ISBAMISE,AIV,AISB

x

Den

sity

-3 -1 1 2 3

0.0

0.2

0.4

Figure 4: Plots of MISE, AMISE (bowl-shaped curves) and their additive

components, the integrated variance (IV), the integrated squared bias (ISB)

and their asymptotic counterparts for n=100 realisations of the underlying

density (small picture).

In Figure 4 f was chosen N ( , )0 1 , and a gaussian kernel was used. Note that the

variance approximation is quite good and uniform, whereas the bias

approximation is only good for small h and poor otherwise. Even though the bias

approximation is bad, the difference in the resulting bandwidths is negligible.

An even worse approximation is achieved in Figure 5, where the underlying

density f is bimodal with several spikes in it, a so-called claw density. This is

caused by a very large value of ( ' ' )f 2∫ , to which the asymptotic integrated

squared bias is proportional. In both graphs n = 100 was chosen. Marron and


Wand (1992) also discovered examples, where h MISE is not uniquely defined

because of at least two local minima of the MISE.

log10(h)

-2 -1 0 1

0.0

0.05

0.10

0.15

0.20

0.25

MISE,IV,ISB

Den

sity

0.0

0.2

0.4

x-3 -1 1 2 3

AMISE,AIV,AISB

Figure 5: MISE, IV, ISB, AMISE, AIV and AISB for the underlying density

(small picture, n=100).

Although those examples are probably only of academic nature, they give insight,

how poor a bandwidth chosen by minimizing the AMISE, can perform in

application.

2.2.4.2 MIAE- calculations

While it was stressed in the last section that the calculation of MISE has to

happen numerically and no explicit dependency of h can be given, the behaviour

of calculations concerning the L1-criterion is even worse. A decomposition

EIAE = + + −J n h o h nh( , ) ( ( ) ),/2 1 2

J n hf

nhnh

ff

( , )' '

=⎛

⎝⎜

⎞

⎠⎟∫α ψ

βα

5

2, (2.6)

where the values α and β are kernel functionals and


ψπ

( )u u e dx ex uu

= +− −∫2

2 2

2 2

0

,

can be derived as well (Devroye and Györfi, 1985), but it is difficult to see how

the kernel functionals and the bandwidth can be optimized separately.

Nevertheless, the upper bound

J n hf

nhh f( , ) ' '≤ +

∫ ∫22

2

πα β

allows again a separation into a bias-(first) and a variance term (second). If one

optimizes this expression with respect to h and chooses for example one popular

kernel, the Epanechnikow kernel (K x( ) x I x( ) ( )= − <34

1 12 ), the optimal

bandwidth can be calculated by

hf

fnopt =

⎡

⎣⎢⎢

⎤

⎦⎥⎥

∫∫

−152

2 5

1 5

π ' '.

/

/ (2.7)

Like in the formula for h AMISE, there is again a dependency h n and the order

terms of the EIAE are the square roots of the order terms of the MISE. That is

natural, because the MISE criterion is squared. Again the unknown density is

needed to calculate h

~ /−1 5

opt.

2.2.5 Optimality properties of the kernel

After defining the optimization criteria and getting a first impression for the

bandwidth choice, it makes sence to take a closer look at the importance of the

kernel choice.

There are many kernels which maintain the restrictions from section 2.2.2. One

can find the formulas and graphs of the most common used kernels, e.g. in Härdle

(1991). As pointed out in the calculation of the AMISE-formula, a functional of

the kernel is included in both of the additive terms and is coupled with the

bandwidth h . Nevertheless, there is a possibility to separate them from each other

by rescaling to achieve so-called canonical kernels, which have the property that

given a certain bandwidth, they lead to roughly the same amount of smoothing.

Transforming the functional R K( ) in (2.4) to achieve kernel functionals which are


independent of h makes a comparison between different kernels and differences

in the efficiency of using data transparent. The kernel which minimizes this

criterion, is the Epanechnikow kernel (Silverman, 1986), whose definition is

already given above. However, the efficiency to achieve a certain accuracy (in

terms of the number of observations) does not decline heavily for most of the

other common kernels, and it makes therefore almost no difference choosing

different kernels. Even for the triangular and the uniform kernel, there are only

less than 10% additional data points neccesary to get the same AMISE. The

bandwidth transformations to achieve the same accuracy with different kernels are

listed in Härdle (1991, p.76).

The kernel functionals α and β in section 2.2.4.2 are constructed like their

counterparts in the AMISE, only their involvement in the formula is more

complicated. Therefore, one can assume that the Epanechnikow kernel is the

optimal kernel in L -Theory as well. However, I found no proof so far. Unlike the

discussion about optimal bandwidths, can this result be applied without any

restriction (given one accepts minimizing the asymptotic versions of the error

criteria and treats such a solution as the best one to figure out the structure of the

underlying density).

1

Until now only non-negative kernels, satisfying the assumptions

zK z dz( ) =∫ 0 and μ22 0= ≠∫ z K z dz( )

have been taken into account. At a closer view of (2.4), it is evident that it is

possible to reduce the bias by defining kernels having μ2 0= and μ3 0≠ . If this

technique (setting μi = 0, for i = 3 4 5, , ,...) is used again and again, even higher

terms of the Taylor approximations (2.2) can be eliminated. Such a kernel,

satisfying the assumptions

jK x K x dx( ) ( )= =≠

⎧

⎨⎪

⎩⎪∫

100

for j

j pj p

== −

=

01 2 3 1, , ,..., μ j

is called a kernel of order p. Another effect of higher order kernels ( p > 2) is the

improvement of the convergence rate for the minimal AMISE from n to

. This means that you have only to wait sufficiently long (in terms of

−4 5/

n p p− +2 2 1/( )


increasing n ) to get a smaller value for the minimized AMISE, than with a kernel

of lower order. However, in application one is interested in the question how large

is needed, and what happens before the asymptotics take effect. Marron and

Wand (1992) also found even worse asymptotic bias approximations than in

n

Figure 5, what makes their faster convergence to 0 senceless.

The problem of higher order kernels ( ) concerning plausibility, is that these

restrictions are only possible for not completely non-negative kernels. As we

know that all kernel properties are inherited by the resulting density estimate, the

estimate has therefore generally no longer an interpretation as density function.

Generally speaking, this concept seems to be only a theoretical construct, which

improves the estimate by only a small amount, but pays the price of a loss of

interpretability as well as the difficulty of mathematical tractability. For those

reasons, I am not going to use it in my application.

p > 2

2.2.6 Further developements of the model to improve the estimation

So far, at least the problem of a proper kernel choice for the basic kernel density

estimator seems to be solved somewhat satisfactorily. But there can be easiliy

thought of improvements, regarding the fact that only one smoothing parameter is

used in all regions of the distribution.

2.2.6.1 Variable bandwidths

One might expect that (for example in case of skewed distributions) different

bandwidths within one estimation lead to more flexible approximations, as is

shown in Figure 6, where this problem becomes fairly transparent. Neither the

right nor the left picture seems to fit the curve suitably, because one has to make a

decision whether the bump or the tail should be estimated properly. The

bandwidth has to be adapted in different areas of the curve. To get an idea, if the

density in a certain range is high or low, one could take the distance from xi to the

k -th nearest neighbour.

This is done in the definition of the variable kernel density estimator (Breiman et.

al., 1977)


$ ( ) ,, ,

f xn hd

Kx xhdh

j k

i

j ki

n

=−⎛

⎝⎜

⎞

⎠⎟

=∑1 1

1 (2.8)

where d j k, denotes exactly this distance. This seems to be a quite good concept

and e.g. the simulation study of Remme et. al. (1980) concerning kernel

discriminant analysis show the great dominance of the variable kernel density

estimator over the standard model, for example in skewed distributions. The

authors gave ranges for the choice of the new parameter k , but they also pointed

out that the choice of k is not that important.

-4 -2 0 2

0.0

0.2

0.4

0.6

-3 -2 -1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Figure 6: Kernel estimate using different bandwidths. The underlying

density (solid) and the concerning kernel density estimate (dotted).

For example, k should be chosen smaller in skewed distributions than in

symmetric ones. Nevertheless, no fully automatic procedure exists and one can

never be sure to have the best or at least a good estimate (see Breiman et. al,

1977).

If we take again a look at (2.7), it can be seen that the smoothing factor h should

be chosen large if 1 / f and f ' ' are small. This means that in areas of high

curvature of f (large f ' ') or smaller values of f , smaller bandwidths are required

(Devroye and Györfi, 1985). The authors argued that d j k, does not take these facts

into consideration, which makes this setting asymptotic suboptimal. Even worse,

if the curvature is fixed, higher values of f require higher bandwidhts, and not


lower ones as was suggested above (Devroye and Györfi, 1985). However,

measures of functionals (like f∫ ) say actually nothing about the local

behaviour of f . Anyway, those results seem rather controversial.

This result also makes one interesting fact apparent. While the optimizations of h

with respect to the - and the L -distance have many common properties and

lead to similar estimates, there is a striking difference, because a functional of the

unknown density

L1 2

f appears in the numerator of the AMIAE-optimal bandwidth

(2.7), which does not exist in the AMISE-optimal bandwidth-formula (2.5).

Maybe this should as well be taken into account when talking about optimization.

Another approach allowing non-equal bandwidths, occurs by adapting the

bandwidth h by defining a function h x( ) in order to vary the bandwidth and the

resulting kernel on every point on the whole axis. The formula for the so-called

local kernel density estimator is the same as above, only substituting hd j k, by

h x( ) . The additional parameter required is the funtion h itself. Therefore, a

pilot estimation has to be built up. A method is again, to watch the number of

nearest neighbours, in this case neighbours of

(.)

x , which results again in

h x f x( ) ~ / ( )1 but this has also some unsatisafactory properties (Wand and Jones,

1995). Finally it should not be forgotten to mention that the resulting local kernel

density estimate is in general no longer a density since it does not integrate to one.

2.2.6.2 Transformed kernel density estimator

Concerning again Figure 6 in the last section, the problem of an unsatisfactory fit

can also be solved otherwise. If one wants to hold on to the basic model with one

bandwidth, there is at least one additional technique.

The poor behaviour of several modern bandwidth selectors for this type of

distributions is shown in example 6 of Sheather (1992, p.246), a highly right-

skewed distribution. No bandwidth whatever you are going to use for the

estimation, will be appropriate.

Introductional to the following concept, one should again look on a kind of MISE

formula, namely in this case on the minimum of AMISE, achieved by h AMISE:


inf ( $ ) inf ( $ ) ( ) ( ' ' ) / /h h h hf f C K R f> >

−≈ =0 01 5 4 55

4MISE AMISE n ,

where denotes exactly the functional of the canonical

kernel for a scaleinvariant comparison in section

{ }C K K R K( ) ( ) ( )/

= μ22 4 1 5

2.2.5 (Wand and Jones, 1995).

Again there is a functional which can be separated from the others. In this case it

is the functional R f( ' ' ) . This fact allows us to find (n and K fixed) densities,

which are “better” to estimate than others, e.g. which produce a lower AMISE-

minimum.

Again, this provides no conclusion of which shape of densities are prefered,

because one can make every density sufficiently smooth by rescaling it. Again,

the way to proceed here is to scale this functional with a function depending on

σ(f), getting again a scale-invariant functional of f , D f( ) . Different values of

D f( ) for different densities in proportion to the best one (D f( *), f * is a

between –1 and 1 standardized form of the beta (4,4)-density) are given in Table 1

and the corresponding densities are shown in Figure 7 (Wand and Jones, 1995).

The numbers are interpretated to give the proportion of sample sizes having the

same AMISE.

Table 1: Efficiency on how to estimate different densities.

Density D(f*)/D(f)

a) Beta(4,4) 1

b) Normal 0.908

c) Extreme value 0.688

d) 34

0 114

32

13

2

N N( , ) ,+⎛⎝⎜⎞⎠⎟

⎛

⎝⎜

⎞

⎠⎟

0.568

e) 12

149

12

149

N N−⎛⎝⎜

⎞⎠⎟ +

⎛⎝⎜

⎞⎠⎟, ,

0.536

f) Gamma(3) 0.327

g) ( )23

0 113

01

100N N, ,+

⎛⎝⎜

⎞⎠⎟

0.114

h) Lognormal 0.053


a)

x

-1.0 -0.5 0.0 0.5 1.0

0.0

0.5

1.0

1.5

2.0

b)

x

-3 -2 -1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

c)

x

-6 -4 -2 0 2

0.0

0.1

0.2

0.3

d)

x

-3 -2 -1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

e)

x

-3 -2 -1 0 1 2 3

0.0

0.05

0.10

0.15

0.20

0.25

0.30

f)

x

0 2 4 6 8 10

0.0

0.05

0.10

0.15

0.20

0.25

g)

x

-3 -2 -1 0 1 2 3

0.0

0.5

1.0

1.5

h)

x

0 1 2 3 4 5

0.0

0.2

0.4

0.6

Figure 7: The corresponding density types for Table 1.

To get back to the topic of this section, one can use the data much more efficient

by transforming for example, a skewed or kurtotic or multimodal density into an

unskewed, unkurtotic or unimodal one, respectively. After estimating the density

and transforming back, a much better result (in the sense of AMISE-reduction)

can be observed.

Ideas of how to proceed in application are given by Devroye and Györfi (1985)

and Wand and Jones (1995). A reasonable way in any case, is to get some

numbers of the distribution like variance, skewness, kurtosis or an idea about the

shape (pilot-estimator). However, for a fully automatical transformation

procedure, one cannot “look” at the distribution. Either a function, how to get

from some sample moments to certain coefficients of a parametric transformation

function (e.g. the shifted power family, described in Wand and Jones (1995)) is

needed or a non-parametric approach, as described in Ruppert and Cline (1994)

has to be carried out. The first one provides the inverse function for a

backtransformation of the estimated density immediately, but automatically needs


the coefficients, as just described. The second method transforms the data easily

without any knowledge of parameters, but requires many more calculations.

2.2.6.3 Density estimation at boundaries

Another problem which occurs for example in life-time-distributions is beside the

fact that they are often highly skewed, the fact that a big mass of the distribuion is

often concentrated relatively close next to zero, or more generally near a boundary

of the distribution. Thus, the resulting kernel density estimate often has a

considerable mass at values outside the support of f .

Since setting the estimated density equal to zero outside the support, and

“blowing up” the positive part of the estimated density to integral one is often

quite a bad approach, Silverman (1986) proposed a better one. He showed that

producing the double amount of data by reflecting the data-points at the boundary,

estimating the density, cutting the estimate at the reflection point and double the

resulting values leads to a better estimate. Nevertheless, this has also a main

drawback, since there is always a zero-slope in the reflection point (the boundary)

which occurs nowhere in typical right-skewed life-time distributions like gamma-

or weibull- distributions. A more sophisticated approach is decribed in Wand and

Jones (1995), where so-called boundary kernels are used.

2.2.7 Bandwidth selection

The question how to make an appropriate choice for the bandwidth (or

“smoothing parameter”) h , is probably the most extensive and controversial

subject in the field of kernel density estimation. A whole host of ideas, approaches

and methods have been studied and discarded again. The wide field can be

structured in more or less strong “subjective methods”, where at least one

parameter has to be chosen by a human, and so-called “data-driven” or

“automatically” bandwidth selectors, where everything only depends on the data,

and you need no experience to get a reasonable fit. The data-driven bandwidths

can be further separated with respect to the historical growth in first- or second-

generation methods (Jones et. al., 1996).


In the context of my simulation study, and regarding the fact that the number of

variables and the resulting selections of the bandwidths increases nowadays in

many applications, it is impossible to spend a certain time on each univariate

density estimate (or on several parameters of a multivariate density estimate),

which happens when choosing the smoothing parameters interactively. Thus, the

following discussion treats almost exlusively the data-driven case.

Nevertheless, the other methods are also quite useful for example, an interactive

choice for exploratory means which is the classical subjective method. Varying

the smoothing parameter, and watching the resulting shape, as long as one might

have found a compromise between under- and oversmoothing is probably the best

thing to do according to the “sensitivity of the human eye”. This criterion is

probably the better one in detecting the number and location of modes (compare

section 2.2.3.3 as well as Marron and Tsybakov(1995)) and one has no need to

create complicated formulas to achieve the same estimate. Furthermore, this

method provides a deeper insight into the data, because one gets different views

of the possible structure of the unknown density.

However, for reasons of an objective choice, I am going to turn now to the wide

field of automatic bandwidth selection.

2.2.7.1 Refering to standard distributions

For choosing the bandwidth automatically, it is neccesary to again take a closer

look at the - and -minimizing bandwidths, respectively ((2.7) and (2.5)). The

first thing which becomes apparent, is the already mentioned fact that the

unknown functionals

L1 L2

R f f( ' ' ) ( ' ' )= ∫ 2 , f ' '∫ or f∫ are required. An approach

which is quite simple, but often used, is by refering to a standard distribution and

setting f for example, the normal distibution with variance σ 2 . This is called the

Rule of Thumb, and was suggested by Silverman (1986).

The resulting formula for h opt is

hopt n≈ −106 1 5. /σ (2.9)

and can be calulated by a point estimator $σ 2 for σ 2 for example, the sample

variance. As seen in section 2.2.6.2 (Table 1), this method is oversmoothing in


any case of multimodality or skewness of the density because the bandwidth is

chosen close to the upper bound concerning h AMISE, namely h OVERSMOOTH (the

resulting bandwidth when setting f equal to the transformed beta (4,4)-density).

Another possibility is the use of a more robust estimate for σ , the inter-quartile-

range . Because of the fact that is approximately 1.35 times as high as the

standard deviation for normal densities, one has to adapt (2.9) to maintain

$R R

h Ropt n≈ −0 79 1 5. / (2.10)

Discussion about which one is better can be found in Silverman (1986), and

Sheather (1992, sect. 2.3). Finally, maybe the best choice is to choose the

minimum of $σ and $

.R

135 and correct the value by a constant in between those of

(2.9) and (2.10), say 0.9 (Silverman, 1986).

Another reference distribution is the transformed beta(4,4)-density (see Table 1),

which always oversmooths, but one can watch graphs by dividing the resulting

bandwidths by 2, 3, 4, ...etc.. However, this is actually a special case of a

subjective choice, where an upper bound is given, and therefore no longer

interesting.

To show how poor those choices are when using them in multimodal distibutions,

Figure 8, reproduced from Silverman (1986) should give an impression.

It shows the ratio between h AMISE (2.5) to the window widths, given by the two

rules of thumb and h OVERSMOOTH, respectively. The interested reader can find

similar graphs according to skewness and kurtosis in Silverman (1986). The third

Rule of Thumb equals in this case always the first, and the corresponding

bandwidth-ratios are therefore not plotted.

The fact that h OVERSMOOTH has not always the largest values (the smallest

bandwidth-ratio), appears because of the different calculations of the estimate of

σ. h OVERSMOOTH is only the largest, when both densities are either both rescaled

with $σ or both with . $R


Distance between modes

Ban

dwid

th-r

atio

0 1 2 3 4 5 6

0.4

0.6

0.8

1.0

Rule of Thumb (standard deviation)Rule of Thumb (interquartile-range)Beta(4,4)

Figure 8: Ratio of the AMISE-optimal bandwidth to window widths chosen

by different reference distributions. The produced densities were mixtures of

two standard normals, whose distance between their modes was varied.

The following methods are already more sophisticated, but belong still to the

methods of the first generation.

2.2.7.2 Cross-validation methods

The statistical concept of cross-validation can certainly also be applied in density

estimation. As already discussed in section 2.2.3.2, there are different opinions

about minimizing with respect to ISE and MISE, respectively (see Jones, 1991).

Both are used in the context of cross-validation.

Least-Square Cross-Validation

This in the 1980s undisputed selector, concentrates on the minimization of the

ISE. Therefore, the IS is separated into its components obtaining E( $ ( ))f xh

ISE( $ ( )) $ ( ) $ ( ) ( ) ( ) .f x f x dx f x f x f xh h h= − + ∫∫∫ 2 22


Since f x( ) does not depend on h , it is sufficient keeping an eye on the first two

terms. Bowman (1984) formulates the minimization problem by giving unbiased

estimators for these terms and obtains

LSCV( ) ( ) ( ),,hn

f x dxn

f xh ii

n

h i ii

n

= −−=

−=

∫∑ ∑1 21 1

, (2.11)

where

f xn h

Kx x

hh ij

i i j

n

,,

( )( )−

= ≠=

−

−⎛

⎝⎜

⎞

⎠⎟∑1

1 1

denotes the leave-one-out kernel density estimator, the resulting estimator when

datapoint i is left out. Again, using the gaussian kernel facilitates the problem and

makes the integral able to be solved analytically (Bowman, 1984).

This selector has one main problem. If the data is highly discretized (to a certain

degree is every continious dataset discretized), which means, easily speaking, that

there are too many equal values, an overall minimizer of LSCV( will be found

at (Silverman, 1986), which is an unreasonable choice. The threshold,

where this occurs depends on the kernel. Examples for this case are given by

Sheather (1992, examples 2-4).

)h

h = 0

Another bad performance concerning convergence, will be figured out in the

section about asymptotic behaviour. Nevertheless, it is still a good and transparent

concept, which can also be easily generalized for the multivariate case and I use it

therefore in chapter 4.

Biased Cross-Validation (BCV)

The difference between the LSCV and the biased cross-validation method is the

fact that here, minimization is based on the asymptotic MISE (2.4). Here the

functional R f( ' ' ) is estimated. As the name says, the resutling bandwidth is

biased, but has a smaller variance than LSCV. There sometimes also occurs the

problem of more than one minimum. Ideas about which one to choose are

discussed in Marron (1993) and Sheather (1992).

(Pseudo)-Likelihood Cross-Validation (LCV)


The LCV-selector was maybe the first commonly used automatic bandwidth

selector because it is based on a basic statistic concept, the maximum-likelihood

optimization. The criterion to maximize is

LCV hn

f xh i ii

n

( ) ( ),= −=∏1

1 (2.12)

and f h i,− is defined as above. The leave-one-out estimator is used, because

otherwise the function has a minimum at h = 0, and the resulting estimate would

be a sum of dirac-funtions at the data-points. One problem is that the window

width has to be (in case of kernels with bounded support) at least as big as the

distance from the furthest outlier to its closest point, because LCV is zero

otherwise. Thus, the resulting bandwidth often tends to oversmooth the function.

In this view, Silverman (1986) pointed out at least a theoretical problem. Since the

just described distance (often called “gap”) does not get smaller as n increases for

densities, which vanish at a exponential rate or more slowly, h does not converge

to zero, which is obligatory for every reasonable estimator. Actually, this concept

is sorted out from the spectrum of bandwidth-selectors nowadays, but

nevertheless the simulation study (Remme et. al., 1980), which is the pattern for

my work, got even 22 years ago quite good results, which makes my intention to

achieve a dominance over the LDA and QDA (applied in discriminant analysis) in

any case feasible.

2.2.7.3 Plug-in-methods

While Jones (1996) called the selectors in section 2.2.7.1 and 2.2.7.2 methods of

the first generation, the following belong to the second one. A family of methods

which follow straightforward from BCV, are the so-called plug-in methods. They

were investigated in the early 1990s, and represent the “state of the art” as it

seems (Bowman and Azzalini, 1997; Jones et. al., 1996; Wand and Jones, 1995;

Sheather, 1992; Park and Turlach, 1992; Cao et. al., 1994; etc.).

The common thing of all plug-in-methods, is that they include an estimate of the

unknown density functional R f( ' ' ) in (2.5), which is performed by a kernel

estimate itself, but in general not by using the same smoothing parameter and


kernel as for estimating f itself, which makes it different to the biased cross-

validation concept. One uses the fact that

R f f x dx f x f x dxs s s s( ) ( ) ( ) ( ) ( ) .( ) ( ) ( )= = − ∫∫ 2 21

For the estimation of further density derivative functionals it is therefore sufficient

to study functionals like

ψ rr rf x f x dx E f X= =∫ ( ) ( )( ) ( ) { ( )}

for r even. This leads to the estimator

$ ( ) $ ( ) ( )( ) ( )ψ r gr

i gr

i jj

n

i

n

i

n

gn

f xn

L x x= ====∑∑∑1 1

2111

− , (2.13)

where g and L are respectively, a bandwidth and kernel that are in general

different from h and K .

To find a good choice for g for estimating $ ( )ψr g , a reasonable choice is using

g AMSE, since this can be calculated in a close form as well (see Sheather and

Jones, 1991; Park and Marron, 1990).

Unfortunately, the formula of g AMSE in turn contains a functional of the unknown

density, which is now one degree higher (e.g. the AMSE-optimal bandwidth for

estimating R f( ' ' ) = ψ4 needs R f( ' ' ' ) = ψ6). This fact is again evidence that one

cannot escape the problem of knowing f to estimate f best. The only thing to do

is shifting the problem, and deriving an AMSE-optimal bandwidth for estimating

ψ6 , say g1. This is what is done in the direct plug-in approach. However, at a

certain stage you have to use a pilot estimator, which is chosen mostly with

reference to a standard distribution (see section 2.2.7.1).

So, the question of how many stages to choose arises. Wand and Jones (1995)

discovered, that while the bias of h grows smaller when the number of stages

increases, the variance increases. It is also evident, that additional steps cause

additional computation time, which is also not negligable. The graph in Wand and

Jones (1995, p.73) leads one to assume, that more than two or three stages are not

necessary. However, this graph refers only to the estimation of one certain

density.

To turn back to the problem of choosing a good bandwidth h , the AMISE-optimal

formula looks like that:


hR K

nR f Kg h

=⎡

⎣⎢

⎤

⎦⎥

( )( $ ' ' ) ( )( )

/

μ2

1 5

. (2.14)

This provides insight about how the concept can be improved. As one recognizes

on the left and on the right side of the equation, solving the equation is exactly

what is wanted. This is done by the so-called solve-the-equation rules (see Jones

et. al., 1996, Sheather and Jones, 1991).

h

These methods require additional computation time, because the connection

between h on the right and on the left side is complicated, even when there is only

one stage (refering to the direct-plug-in-approach) included. To continue the

process before, one has to write then e.g. R f g g g h( ( ( ( )))2 1 ' ' ) instead of R f g h( '( ) ' )

)

in

the denominator of (2.14).

The dominance of the plug-in-methods over the others, at least in the univariate

case will be pointed out in section 2.2.7.5. The basic concept seems to go back to

Park and Marron (1990). The improvement by Sheather and Jones (1991) was

only done by calculating the double-sum in (2.13) for all n2 values, whereas the

Park-Marron-estimate left out the terms, having i and j equal and divided by

. This improvement by adding a non-stochastic term is only slight.

However, it is an improvement and therefore justified.

n n( − 1

More detailed mathematical expositions of this topic, as well as similar concepts

such as smoothed cross-validation and smoothed bootstrap are skipped here,

because the scope of this work is limited. Nevertheless, the interested reader finds

many more details by studying either the indicated sources, or at least the

appendix of my thesis.

2.2.7.4 Alternative methods

After discussing the modern plug-in rules in the last section, I briefly want to

describe some other estimators which were used in the comparative simulation

study of Cao et. al. (1994).

As already discussed, there are also other criteria than the Lp –distances available

to measure the goodness of the fit. One method which uses such an alternative, is

the IP-method. It has its name from its dependence on the number of inflection


points. To carry out this method in the initial version, one has to have an idea

about the number of inflection points (denoted by q ) in the density. The estimate

h IP is simply defined as

h h hIP = inf{ : $f has at most q inflection points}

Of course, the user does not always have an idea about q and therefore Cao. et. al.

used a second estimator, hEP, which gets a value instead of q from a pilot-

estimation. Unfortunately, this concept however, is actually unsuitable to be

generalized for the multivariate case. The univariate results were quite good.

$q

The last method explained in this section treats an especially adapted approach

concerning the -distance. The bandwidth hL1 DK (“Double kernel”) minimizes the

-distance between two kernel density estimates which use the same bandwidth,

but different kernels. The criterion

L1

IAE = −∫ $ ( ) $ ( )f x g x dxh h

uses the kernels K and L respectively, which have to be of different order. To

fulfill certain consistency restrictions, the choice of the two kernels has to be in

tune with each other.

Since the reader probably gained an imagination of the dimension of this research

area, the next section should set some methods in context by drawing some

comparisons.

2.2.7.5 Asymptotic analysis and results from application

Since Marron and Wand (1992) talk about three tools for understanding the

behaviour of non-parametric curve estimators (which are asymptotic analysis,

simulation and numerical calculation of error criteria), I am going to compare

some estimators by the means of the first and second one. I will skip the third

because of its limitedness on easy settings.

An often used criterion (Jones et. al., 1996) is studying the behaviour of the

random variable

h hh− MISE

MISE, (2.15)


which tends to zero as n increases for most estimators. Since the limiting

distribution of (2.15) corrected by the factor n is often a normal distribution, it

makes sence to compare the value of

p

p for several estimators.

The distinction between methods of the first and the second generation (Jones et.

al., 1996) respectively, is not only caused by the time of their invention, but also

through better convergence rates by the latter ones. While the values for in

LSCV and BCV are 1/10, better rates are obtained by the second-generation rules.

The LSCV is probably the bandwidth selector with the highest limiting variance.

The BCV is biased, however its variance is much less. The highest possible rate

for

p

p is 0.5, and there exist some estimators (Jones et. al., 1991) which reach this

“magic” border, though they use no higher-order kernels.

When comparing estimators, one has to do this also with respect to eventually

occuring additional parameters, like different scales in the Rule of Thumb or

number of stages in a plug-in approach.

To get an idea, if the advantage concerning the convergence rate can be exploited

in practical application, many simulation studies using different reasonable values

for n were carried out.

Probably the last compact survey of different bandwidth-selectors (Jones, 1996)

provided the following main results.

1. h ROT (“Rule of Thumb”) oversmooths too often.

2. h BCV has the same tendency and is instable as well.

3. h LSCV has an unacceptable spread, often in the direction of

undersmoothing.

4. h SJ (Sheather-Jones-solve the equation) is a useful compromise

between h ROT and h LSCV and performs acceptably in harder to estimate

densities as well.

5. h SB (smoothed bootstrap) is very similar to h SJ, but slightly worse.

The estimated densities are the same as the already introduced densities of Marron

and Wand (1992, see Figure 4 and Figure 5). The fact that estimates, which are

based on the minimization of AMISE happen to perform badly, applied on


densities containing spikes or other different features, is again a consequence of

the bad approximation of h MISE by h AMISE (see again Figure 5).

The extensive simulation study of Cao et. al. (1994) leads to similar conclusions,

but gives very good evidence for h EP, the estimator based on the inflection points

according to the -distance. L∞

Also Sheather (1992) discovered a superiority of the Sheather-Jones plug-in by

analyzing real-life datasets, although this judge is subjective because the

underlying destiny is unknown.

The theoretical superior estimator described in Jones, Marron and Park (1991),

hJMP performs well in their study, when the underlying density is normal or

similar to this, but is worse than the hLSCV in other densties.

On the other hand, hJMP was one of the best estimators in the simulation study of

Park and Turlach (1992), where the estimated number and location of modes was

another quality criterion. It was together with the LSCV-criterion, the best in

detecting modes in the multi-modal distributions of this study, as well as together

with h SJ the best with respect to the - and -distance measures. L1 L2

My latest reference in this discussion (Bowman and Azzalini, 1997) stresses the

fact that even if the conservative LSCV has some undesired properties, it is an

estimator which is easy to generalize in the multivariate setting, as well as in a

setting using different bandwidths (variable kernel), which should not be

overlooked when talking about density estimation for discriminatory purposes.

To gain some intuition to explore the different behaviours of certain data-driven

bandwidth selectors in different univariate distributions, one is recommended to

run the software-package of the same name, “XploRe” (see Härdle et. al., 2000).

Here almost all of the bandwidth-selectors described in this section are

implemented. However, in more dimensions one probably has to program the

prefered estimators him/herself.

2.3 The multivariate case

The univariate concept has to be extended in case one wants to discover either

dependencies between several variables, or to carry out classification decisions.


The exploratory means seem to be restricted to bivariate random variables,

because the resulting densities can be plotted in three dimensions. However, Scott

(1992) provided graphs where densities in higher dimensions have been plotted.

Nevertheless, carrying out a classification task by the means of kernel density

estimation seems to be more relevant when handling multidimensional data, at

least in my thesis.

2.3.1 The model

The (d -dimensional) multivariate generalization of the kernel density estimator is

given by

( )$ ( ) ( )/ /f xn

K ii

n

H H H x= −− −

=∑1 1 2 1 2

1x ,

where H is a symmetric positive definite d d× -matrix, the bandwidth matrix, and

K is a multivariate kernel satisfying K d( )x x =∫ 1 (Wand and Jones, 1995).

This kernel function can be derived straightforward by generalization of the

univariate ones, by either multiplying the univariate scores (product kernel) or

“rotating” the univariate kernel in the d -dimensional space (radially symmetric

kernel). The most common choice using a multivariate normal density as kernel,

can be derived by either the product- or the radially symmetric kernel.

Of course, the user wants to know, which generalization performs better and like

in the univariate case (section 2.2.5) differences in the efficiency of using the data

can be calculated. The sample sizes to get the same AMISE-performance, are

given in Table 2 (Wand and Jones, 1995), where a beta-kernel of the form

κ ( ) ( ) ( )x x I xp∝ − <1 12

is used. The numbers give the efficiencies of the product kernel in relation to the

efficiencies of the spherically symmetric kernel for different dimensions. Thus,

values smaller than 1 denote an asymptotic superiority of the spherical version.

Table 2: Efficiencies of product kernels relative to radially symmetric

kernels when using the multivariate generalization of a beta-kernel.

p d = 2 d = 3 d = 4


0 0.955 0.888 0.811

1 0.982 0.953 0.916

2 0.983 0.953 0.915

3 0.984 0.956 0.919

Summing up the knowledge about the kernels, the most effective choice with

respect to an effective minimization of either MISE or MIAE, is supposed to be

the radially symmetric Epanechnikow kernel (assuming that the radially

symmetric kernels are also superior in the case of the Epanechnikow kernel).

However, for reasons I am going to describe in chapter 4, I prefer in my

simulation study a kernel having unbounded support.

2.3.2 Parametrizations

Concerning the choice of the smoothing parameter, one is now not only

confronted with estimating d bandwidths, which might have been expected, but

even with a whole matrix, also giving the information in which direction the

function is going to be smoothed and in this setting is not even a variable

generalization like the variable kernel density estimator in the univariate setting

included. So, the estimation seems to be an exhaustive procedure to find a good

estimate.

Strategies to overcome this problem deal with only estimating one bandwidth for

all variables, or with only a bandwidth-vector, ignoring possible correlations

between the variables. It can be easily seen, that the choice of only one smoothing

parameter is basically not sufficient. At least rescaling of the variables, as it is

also done in other multivariate applications, is desireable. Anyway, an additional

decision has to be made by the user.

A method described in Silverman (1986), which goes back to Fukunaga (1972) is

to carry out a linear transformation of the data in order to have unit covariance-

matrix. As a next step, the density is going to be smoothed by using a radially

symmetric kernel, and finally back-transformed. This method equals an

adjustment of a multivariate kernel in the “direction” of the data, and therefore

only one parameter is needed.


However, Scott (1992, p.180) emphasizes that even for standardized data, the

choice h j for different variables is unlikely to be satisfactory. In any case, the

user has to be aware of the fact that in the complete setting, the number of

estimated parameters grows of order n , and the variances of the matrix elements

“explode” as well. The knowledge about the curse of dimensionality (see section

hi =

2

2.3.4) makes this model probably restricted to only a few dimensions.

2.3.3 Parameter selection

The methods for parameter selection actually follow the same principles like those

in the univariate setting. Unfortunately, only a part of the procedures in the

univariate case can be generalized well, and I actually found no comparative study

of several estimators. Therefore, I only want to briefly explain generalizations of

some estimators of section 2.2.7.

2.3.3.1 The normal reference rule

As Scott (1992) gives formulas for the AMISE in the multivariate case,

minimization of it with respect to a bandwidth-matrix should not be that

complicated, although an explicit expression for HAMISE is not possible. However,

the author restricted his view only to the setting where only a vector of

bandwidths has to be estimated.

Since the AMISE-formula again contains a functional of the unknown density f ,

the optimal bandwidth-vector h can be derived by setting f a multivariate normal

distribution. In case of using a normal product kernel, the elements of this vector

are given by

hd

ni

d

id*

/( )/( )=

+⎛⎝⎜

⎞⎠⎟

+− +4

2

1 41 4σ , (2.16)

where an estimator $σi for σi is necessary in application. Note that d leads to

(2.9). Scott (1992) describes the extension of h

= 1

OVERSMOOTH to the multivariate

setting as well.


2.3.3.2 Cross-validation methods

The density estimator in the simulation study of Remme et. al. (1980) used a

multivariate normal kernel and a bandwidth-matrix of the form

H =

⎛

⎝

⎜⎜⎜⎜

⎞

⎠

⎟⎟⎟⎟

h

ss

sp

2

12

22

2

0 00

0

..

:..

: , (2.17)

where the s denote the sample variances for the corresponding variable. The

bandwidth h was then computed by the generalization of (2.12), by minimizing i2

LCV( ) (2.18) $ ( ), ,h f K h i ii

n

==∏ x

1

with respect to h . is the density estimate using the multivariate kernel $, ,f K h i K

and leaving out the p-dimensional observation vector x . i

The fact that the minimization of

LSCV( ) $ ( ) $ ( ),H xH= − −=∑∫ f dx

nf i

i

h

i2

1

2xH , (2.19)

( is equal to in (2.18)) leads to the LSCV-optimal bandwidth-matrix

(whether H is restricted to a diagonal-matrix or not), is from this view not

surprising (see (2.11)), and Wand and Jones, 1995).

$,f H i−

$, ,f K h i

2.3.3.3 Plug-in

Given the good performance of the plug-in methods, one could also be interested

in their multivariate extensions. Wand and Jones (1994) wrote in their article how

to perform the direct-plug-in concept of Sheather and Jones (1991) in d

dimensions (there is no direct analogue for the solve-the-equation approach

available in the multivariate case). Unfortunately, the problem is getting more and

more complicated, since the extensions of the density functionals ψ r (see section

2.2.7.3) contain functionals , where r is a vector of length d containing

non-negative integers and the definition is

f ( ) ( )r x

fx x

fr rd

d

( )|

( )...

( )r|

x xr

=∂

∂ ∂11

,


where | is defined. |r ==∑ rii

d

1

Additionally, the number of density functionals going to be estimated increases as

both d and l increase, for each of them rapidly.

This would not be a problem either, if one had a powerful computer, but another

problem arises when a pilot estimation at a certain step has to be carried out. The

pilot functional estimator which uses as reference distribution the multivariate

normal distribution provides even for the case d = 2 such a complicated formula,

that Wand and Jones (1994) stress that “it is too difficult to give succint

expressions for ψmN ( )Σ [the functional, author] so we will restrict attention to the

bivariate case”. Regarding this fact, the concept is not viable for my simulation

study for which I am very sorry, since this concept promises very much and

achieved in the present study good results for the bivariate case. Maybe this

technique is only suitable for exploration purposes rather than discrimination,

because the visual techniques are essentially restricted to two dimensions as well.

Nevertheless, it seems worthwile to also study the case for d in more detail. > 2

2.3.4 The curse of dimensionality

As if the additional problems in the multivariate extension were not big enough so

far, there occurs a further fundamental problem when estimating densities in high

dimensions, which is called the curse of dimensionality. Silverman (1986)

describes this in a interesting manner. He considers the importance of the

distribution tails in high dimensions. In a univariate density, one has no problem

when regions, where the density f has values below one hundreth of the

supremum of f (the value at the mode) are estimated by and the other mass

turns to the mode. Everyone would indicate such an estimate as good, provided

the fit in the main part is appropriate. In ten dimensions however, regarding a

normal distribution, more than one half of the data falls in such regions of low

density values and one does quite bad ignoring those tails.

$f = 0

Table 3 (Scott, 1992)

gives an impression, where the values

pff

p d( )( )

( ) logx0

≥⎛

⎝⎜

⎞

⎠⎟ = ≤ −

⎛⎝⎜

⎞⎠⎟

1100

21

1002χ


are listed.

Table 3: Probability of data in regions which have density values higher than

one hundredth the value at the mode of a multivariate normal distribution.

d = 1 2 3 4 5 6 7 8 9 10 15 20

p *1000 = 998 990 973 944 899 834 762 675 582 488 134 20

For already five to six dimensions the mass in the dense regions of the distribution

declines considerably and moves into the sparse regions.

Silverman gives another example when he explains that in a ten-dimensional

normal distribution, more than 99% of the data have a distance greater than 1.6 to

the center. Every statistician knows that it is almost the other way round in the

univariate case, where nearly 90% lie in the interval [-1.6,1.6].

The paradoxon here, is that the data is not in the regions having high density-

values, but in the tails, although the tail-regions are even more sparsly populated.

Roughly speaking, this happens because in more dimensions is “much more

space“ for observations and exactly this fact makes a density in high dimensions

difficult to estimate non-parametrically.

Scott (1992) also obtained imposing results when he compared the ratio of the

volume of a hypercube and an inscribed hypersphere in dependency of the

dimension d . Here, almost all mass of the hypersphere “disappears“ in the corners

of such a cube in high dimensions.

This digression to a topic, which sounds like a science-fiction-story provides very

deep insight about the main problem in the extension of a density estimation

model in high dimensions. The fascinated reader is recommended to read more of

such examples in Scott (1992) and Silverman (1986), respectively. In particular

the table on page 94 (Table 5 in my thesis) in the latter source, showing the

required sample size to achieve a given accuracy depending on the variation of

dimensions, is of practical importance and leads back to the topic of density

estimation.


This curse of dimensionality stresses the unconditional necessity to transform

highdimensional data onto suitable subspaces. Some techiques how to proceed are

discussed in chapter 3.

2.4 The context to kernel discriminant analysis

As already shortly mentioned in the introduction to chapter 2, density estimation

is not only an end in itself, but also a means to an end.

One possible application field which I take into consideration in my thesis is the

application in disriminant analysis, often called (bayes-rule-) kernel discriminant

analysis. Since the kernel model is more flexible, it seems to be a proper

alternative to the model-based approach, although the bayes rule based decision

rules assuming multivariate normal distributions are also somewhat justified (e.g.

equality of the LDA and the different Fisher approach).

Denoting the class of a d -dimensional data-vector xi by k and the prior-

probabilities by i

p k( ) , it is well-known that the bayes rule tells you to maximize

p kp k f k

p k f kk

( | )$( ) $ ( | )$( ) $ ( | )

xx

x= ∑

over all possible values k getting , which is the maximizer. $k

In kernel discriminant analysis is now not the normal density, but a

multivariate kernel density estimate, calculated classwise by choosing for each

class only the vectors

$( | )f x k

xi having the value k ki = .

To gain some intuition about the difference Figure 9 and Figure 10 should be

considered. Figure 9 shows a contour-plot of the posterior-probabilities in a self-

constructed dataset and the underlying datapoints. The dataset consists of five

groups having different means, different covariance-matrices and twenty

observations each. The goal is to separate those groups and to produce rules to

classify new observations. Note, that LDA is very limited since the separation is

carried out by using hyper-planes and especially the different variances of the

component of group 3 and 5 lead to several misclassifications of the latter group.

The QDA is often critizised, but provides through its more flexible approach in

this case better results. Figure 10 finally, shows the flexible fit by using the same


datapoints, but also even indicates in two dimensions what was meant by the

curse of dimensionality (section 2.3.4).

-5

-1

3

7

0.34

0.48

0.63

0.78

0.93

-1 1 3 5 7V1

V2

1

11

11

11 1 11

1

1

1

11

111

1

1

2

2

2 2

2

22

2 2

2

22

2

2

2

22

22

2

33

333

33

3 333

33

3

3333

33

4

4

4

4

4

444

44

44

4 44

444 44

5

55

5

5 5

5

5

5

5

5555

55

5

5

55

Figure 9: Contour-plot of the maximum posterior probabilities at each point

(LDA).

The rules produced are spurious and have probably (and since I know the

distribution, surely) nothing to do with the underlying density, what stress again

the difficulty of estimation and the need for transformations and dimension

reduction.


-1 1 3 5 7

-5

-1

3

7

0.34

0.48

0.63

0.78

0.93

V1

V2

1

11

11

11 1 11

1

1

1

11

111

1

1

2

2

2 22

22

2 2

2

22

2

2

2

22

22

2

33

333

33

3 333

33

3

3333

33

4

4

4

4

4

444

44

44

4 44

444 44

5

55

5

5 5

5

5

5

5

5555

55

5

5

55

Figure 10: Contour-plot of the maximum posterior probabilities at each

point (kernel estimate).

The performance of a classification rule is now measured by either the classical

Misclassification-rate (Error-rate) or by more modern rules (see Hand, 1997, for

a discussion) and no longer by any differences of densities. In case there are more

than two populations, it is not really clear how a difference between several

estimated densities is measured either. In general, the better the density is

estimated, the smaller should be the misclassification-rate, but probably not a

MISE-optimal bandwidth-selector is the best regarding the Misclassification-rate,

since the theoretical Misclassification-rate is an L -based measure. One has to be

aware of this fact, and a common optimization of both aims is what is desired (see

also Hand, 1997). Note, that the Misclassification-rate does not take different

costs into consideration either.

1

How can one actually make connections between the choice of the smoothing

parameter and known discrimination rules? Hand (1982) showed that when

, the decision surfaces become more and more hyper-planes, and end up in h →∞


a rule which assigns the observation x to the group having the closer distance to

its sample mean (average linkage). That happens because when h becomes large,

the resulting estimate is the kernel K itself.

On the other hand, if h , the resulting estimate amounts in a surface

consisting of bumps at the position of each datapoint. Caused by the exponential

decrease of the normal density, as h a new point x will be classified to its

nearest neighbour (measured by using the euclidian norm). Those results ensure

the user that a choice in between this range of h can not be bad, since those two

algorithms are known to perform quite well.

→ 0

→ 0

Taking again the curse of dimensionality into account, one is in higher dimensions

confronted with the fact that the proper estimation of the distribution tails is

particularly important. Ripley (1996, p. 182) suggests therefore watching densities

on log-scales, and emphasizes that discrimination is based on differences in log

densities. In his opinion, the proper estimation of the tails seems almost

completely ignored in the density estimation literature. That is no wonder

regarding the fact that the -based parameter-selectors (which weight a proper fit

in the tails most) include a difficult tractability (see again the quite technical book

of Devroye and Györfi, 1985), and are therefore widely avoided.

L1


x

Diff

eren

ces

of lo

g-de

nsiti

es

-4 -2 0 2 4

-6-4

-20

2

h-SJh-ROT

Figure 11: Difference between the logarithm of a standard normal density

and the logarithm of two kernel estimates (Sheather-Jones plug-in and the

Rule of Thumb).

Figure 11 gives an impression of how bad the fit is in the tails. 50 observations

from a N(0,1)-random-variable were generated, and the density was estimated by

the Rule-of-Thumb bandwidth-selector and by the Sheather-Jones-selector,

respectively. Then the true and the estimated curve were plotted on a log-scale.

Regarding the fact that a univariate normal density is almost the easiest-to-

estimate possible density, this is a rather poor result and focuses the view on one

of the main problems in kernel discriminant analysis. There is one approach in

literature (Hall and Wand, 1988), which estimates density differences and

circumvent the problem of negative estimated values concerning higher order

kernels, but they estimate only the differences and not the log-differences.

The literature about how kernel discriminant analysis performs compared to LDA

and QDA is rather old, compared to the vast amount of modern bandwidth

selection rules for kernel density estimation. Van Ness and Simpson (1976)

investigated this area by choosing a mulivariate normal- and cauchy-kernel,

respectively, and searched for dependencies on the number of dimensions (up to

30), the number of observations per class (10 and 20, respectively) and the


distance between two normal-distributions with equal covariance-matrices

which represent the class-densities to distinguish. For reasons discussed in section

Δ

2.3.4, it is in my opinion completely absurd to gain any practical insight from

such a setting. It is remarkable that they discovered a dominance of the kernel

setting over the LDA, although the LDA is of course the best in this setting (with

respect to minimizing the Error-rate). In this setting there is no convergence effect

of the estimates to the real distribution. The parametric estimates (LDA, QDA) as

well as the kernel estimate have such a great variance, that any comparisons of

error-rates are fruitless, even when the Error-rates were estimated by averaging

over 100-300 equal attempts. The fact that this averaged classification-rate is over

a wide area not affected by the choice of the smoothing parameter (see Van Ness

and Simpson, 1976, p. 185), gives the best evidence. The simulation study of Van

Ness (1980) chose for the reason of similar results for the normal- and the cauchy-

kernel in the former study, only the normal kernel here. Different covariance-

matrices have been used here, as well. Nevertheless, those two studies probably

gave one of the first impetus for taking the important role of the dimensions in the

kernel setting into consideration.

Remme et. al. (1980) discuss a much wider range of underlying distributions and

restrict the number of dimensions d to the much more reasonable range {2,...,6}.

The construction of the estimator is given in (2.17) and (2.18) and the tested

densities are multivariate normals, multidimensional lognormals and normal-

mixture densities. They used the classical Misclassification-rate and a measure,

which considers also the exact estimated posterior-probabilities for testing the

performance. They also studied a variable kernel model, whose bandwidths are

constructed by the multivariate counterpart of (2.8). The number of nearest

neighbours to use, was studied in a separate simulation study (Habbema et.al.,

1978), where the same authors underline that the choice of this number is over a

wide range not that crucial.

They compared the method to LDA and QDA respectively, and essentially got the

following results.


1. LDA is the best for the multivariate normals with equal covariance-matrices

(not surprising).

2. LDA performs increasingly poor in case of non-equal covariance-matrices and

QDA is not much better.

3. The kernel method was better than the others or at least as good as them

except in the setting of point 1. This is surprising, regarding the fact, that the

sample sizes were not so large (n = 15, n = 35).

4. Nevertheless, the results in the lognormal case were disappointing and the

variable kernel estimate performed much better there.

The last point again stresses the need of a more flexible setting or for

transformations. In my study I will try to carry out proper transformations. The

most remarkable statement in their study is probably the final conclusion, that

“the present practice of the nearly exclusive use of LDA cannot be justified“,

which tells us that this was known already 22 years ago and should encourage

every software-provider for discrimination rules to offer alternatives.

With the current knowledge, one is actually “only“ confronted with a need to

transform marginal distributions and the necessity for a proper projection onto a

subspace, to overcome the curse of dimensionality when analyzing higher-

dimensional datasets. This important task is going to be discussed in the following

chapter.

Chapter 3: Dimension Reduction and Marginal

Transformations

3.1 Introduction

As already briefly discussed in section 2.3 and 2.4 estimating densities

appropriate in high dimensions (“high” however, means always high with respect

to the sample size) is a difficult, if not impossible task and some kinds of

transformations are inevitable. This is however, easier said than done, if those

estimates lead to classification decisions. In this case, it is important to know the

density values at each test-observation x in the original dimension, because they

determine the posterior probabilities or at least the order of several posterior

probabilities of different classes. Since the transformations have to be carried out

classwise this order possibly changes. While Wand and Jones (1995) care about

the backtransformations of the density in their univariate transformations, Scott

(1992) does not treat this necessity in the multivariate case.

Section 3.2 gives now advice on how to transform the dataset within the original

dimension, whereas questions of reducing dimensions are discussed in section 3.3.

The application in chapter 4 will take both aspects into consideration.

3.2 Marginal transformations

Suppose you have data in a high dimensional space. In the most cases at least

some of them are highly correlated. Scott (1992) suggests in this case a two-stage

approach, where marginal transformations have to precede. He talks about a need

to normalize the data. In the context of section 2.2.6.2 this is not a bad goal as

well, since the normal density is one of the easiest to estimate densities.

The need of backtransforming the estimated densities becomes apparent in Figure

12.

57

DIMENSION REDUCTION AND MARGINAL TRANSFORMATIONS 58

x

-2 0 2 4 6 8 10

0.0

0.1

0.2

0.3

0.4

|x0

Figure 12: Normalization of two bimodal class densities. Density 1 (solid),

density 2 (dashed) and their respective normalizations.

If the point x0 has to be classified, the reasonable choice would be an assignment

to group 1, since the value of the second class density is higher. “Normalization”

of each class density changes the assessment totally and the point would be

classified to group 2. That means, the transformation rules have to be stored in

some way.

So, how to normalize? A method to try was named in section 2.2.6.2, to use as a

transformation function a “member” of the “shifted power family” (Wand and

Jones, 1995), which is given by

t xx

x( ; , )

( ) ( )ln( )

λ λλ λ

λ

λ

1 21 2

1

2

=+

+

⎧⎨⎩

sign λλ

2

2

00

≠= ,

where λ1 > −min( )X and mi denotes the lower endpoint of the support of f.

This methods suffers in application, because one has to know the parameters and

this leads to several optimization steps, regarding the fact that every marginal

density has to be transformed.

n( )X


The non-parametric method of Ruppert and Cline (1994) gives straightforward

transformations. They refer to the fact, that if F and G are the cdf’s for the

densities f and g, then Y G F X= −1 ( ( )) has density g. One is now able to choose

any distribution and the user will probably choose a normal distribution, because

normalization is wanted. In application t x is taken and is a

(pilot-) kernel estimate of F

G F xh( ) ( $ ( ))= −1 $ ( )F xh

x( ) (Ruppert and Cline, 1994). To get the density

values for the original data, one has to transform the density back to the original

space and therefore the first derivative of the transformation function is needed.

For this reason one point for every observation lying “close” has to be

transformed as well to approximate a density estimate ( f x F x( ) ' ( )= ), since the

transformation is done by a kernel density estimate of the cdf. This idea should be

watched in Figure 13.

x

f(x)

0 2 4 6 8

0.05

0.10

0.15

0.20

x

0 2 4 6 8

0.0

0.2

0.4

0.6

0.8

1.0

F(x)G(x)

t(x) t(x+Δ) x x+Δ

a)

b)

Figure 13: Univariate normalizations. a) shows a density, which has to be

normalized. b) shows the normalizing procedure based on the estimated cdf.


For reasons of exhaustive computations this procedure is probably not suitable for

estimating the whole density, but it seems reasonable for discriminatory purposes,

where only one function value for each test observation has to be computed.

For the concrete realization of this approach the formula for transformed densities

is used. Let x be a random vector and y x= T( ) another random vector of the same

dimension. Then the equation

f f T DTx yx x( ) ( ( )) det( ( )0 0x )= 0 (3.1)

is valid (Bomze, 1998). DT(x) is the Jacobian matrix of the transformation T at

the point x.

Since f Ty x( ( )) are the values for the multivariate density estimate at the

transformed data points and therefore known, it is only necessary to get the value

of the Jacobian matrix (or an numerical approximation) at each point of the test-

dataset for each transformation (class).

Because of the fact that the transformations are marginal, the Jacobian matrix is

diagonal and “only” d values (according to d dimensions in the original space)

have to be calculated.

If the number of test-data-points is n, the number of dimensions in the original

space is d, and the number of groups to separate is g, then n d g× × values are

necessary to carry out this approach. Probably, this is not possible for “high”

dimensions, e.g. d or for “many” test-data-points to evaluate. However, in

this case the dimensions do not have to be reduced that strongly as in the case of

multivariate kernel density estimation, and I think a proportion of 95% variance

explained, refering to a dimension reduction by the principal components analysis

can be achieved easily.

= 100

To get an impression of how good does this non-parametric normalization

performs, Figure 14 should be watched. The used variable comes from my

empirical insurance dataset (see chapter 4). The highly skewed variable measures

the duration since the last delay in payment of each customer. The graph shows

that the normalization works pretty good and the user is not forced to select any

parameters. The kernel estimate in the right graph reminds the user probably

slightly of a standard normal distribution.


0 5 10 15 20

0.0

0.05

0.10

0.15

0.20

0.25

0.30

a)

x

-1 0 1 2 30.

00.

10.

20.

30.

4

b)

y

Figure 14: Non-parametric normalization of one variable of the insurance

data.

This method is a first concept to use kernel density estimation for discrimination

in high dimensions, which overcomes the problem of multivariate density

estimation (as long as the LDA or the QDA are used after the normalization step).

The other alternative including the multivariate kernel estimation is discussed in

the following section.

3.3 Dimension reduction

The briefly mentioned two-stage approach (Scott, 1992) suggests a linear and a

non-linear transformation.

The linear one happens either by choosing a sphering transformation or a

transformation into principal components. The sphering transformation is given

by

Z X= −−Σ 1 2/ ( ),μ

where Σ Λ− −=1 2 1 2/ /A A T and A and Λ contain the eigenvectors and the

eigenvalues of X, written in the main diagonal with zeros else, respectively. While


this transormation destroys all first- and second order information, the principal

component transformation

Y A X= −T ( )μ

keeps at least the variance information (Scott, 1992). It is well known that the

columns of Y are uncorrelated and have (starting with the first) the largest

possible variance of all linear combinations of the variables. In application the

calculation of A is based on the eigen-vectors and -values of the maximum

likelihood estimate of the covariance-matrix of X. The author suggests e.g. in

case of normal data choosing the subspace-dimension d ' as the smallest one that

satisfies

$Σ

$

$

'

λ

λ

ii

d

ii

d=

=

∑

∑>1

1

90%,

where is the sample variance of the transformed columns and the i-th

largest eigenvalue , respectively. In case of normalized, originally non-normal

data he suggests taking 95% as threshold to get rid of any dimensions containing

no independent linear information and expects 1

$λi si

= 2Y

$Σ

10≤ ≤d ' in practice.

His second stage to get to the working dimension d is the use of a projection

pursuit technique. The optimization happens with respect to the d -matrix P by

using a multivariate version of

' '

d× '

μ2 ( ) ( )K R K to get not the “smoothest”, but the

most informative “least-smooth” density. That means, denoting now the data in

ℜd’ by X and the transformed data in ℜd’’ by Y XP= , the matrix P satisfiying

max $( ) { $ ( | )}P

XP y XPσ R f h

has to be found. $( )σ XP is the product of the sample deviations of the columns of

XP and is a multivariate kernel density estimator based on the data-

matrix XP and having a bandwidth-matrix depending only on one parameter h .

Although the optimal P is relatively insensitive to the choice of h , the

optimization turns out to be non-trivial and this fact makes the preceding principal

component analysis unavoidable. Scott (1992) suggests other criteria, as well.

Nevertheless, I am not going to use such an exhaustive approach, since my

$ ( | )f h y XP


datasets have not that much dimensions and I have enough data transformation

steps. In addition, it is not included in my software-packages.

As seen in section 2.3.4, radically reducing dimensions is desired to avoid the

great volumes of a high-dimensional space. Unfortunately, a loss in information is

included in this reduction. In fact, there are two aims, which counteract. Table 4

and Figure 15 show an average output of a principal principal component analysis

with one of my synthetic datasets and Table 5 (Silverman, 1986) makes again the

curse of dimensionality transparent.

Table 4: Principal component

analysis with one of my synthetic

datasets.

47,674 47,674

21,826 69,500

6,119 75,619

5,655 81,274

4,801 86,076

4,292 90,367

3,291 93,659

3,247 96,906

2,297 99,203

,797 100,000

Component1

2

3

4

5

6

7

8

9

10

explainedvariance

(%)cumulated

(%)

Principal component analysis

Factor

10987654321

Eig

enva

lue

6

5

4

3

2

1

0

Figure 15: The corresponding

scree-plot.

The values are based on a accuracy measure, called the relative mean square

error. The sample sizes are calculated to satisfy

( )E f ff

$( ) ( )( )

.0 0

001

2

2

−< (3.2)

when setting f the standard normal density.

Concerning Figure 15 a reasonable choice for the number of dimension in the

subspace would be three. If the number of observaions is e.g. n (as it is for

each class in each training dataset for my examples in chapter 4), the accuracy

= 600


given in Table 5 is achieved, but a projection onto four dimensions is “valid”, as

well. However, not even an explanation of 81% of the data seems to be adequate

for exact discrimination tasks.

Table 5: Smallest sample size for each dimension, which satisfies (3.2).

Dimensionality Required sample size1 42 193 674 2235 7686 27907 107008 437009 187000

10 842000

The problem becomes now apparent, but an intuitive statement about how to

choose the number of dimensions d is rather difficult.

The kernel density estimation in such dimensions is the alternative to the method

discussed in section 3.2 and chapter 4 should give answers to the question: Which

one is better and how do they perform compared to the classical methods (LDA

and QDA)?

Chapter 4: Simulation Study and Real-life Application

4.1 Introduction

In Chapter 2 and 3 I collected many ideas about how to maintain good

estimation results and how to carry out the classification. However,

essentially this was all theory and certainly I am now interested how this

can be applied.

The basic intention behind my attempts concerning this topic is to create

different distributions, which occur in every-day-life. We are nowadays

confronted with huge datasets and therefore I want to use higher numbers

of observations than in most cases in the existing literature, but always

taking an eye on the limitations of the model. My argument is that if the

results are good with these observation numbers, one can reduce a data-

mining dataset by drawing a sample of reasonable size. The fact that the

model is not tractable for data-mining-datasets with thousands (or more)

of observations will lately become apparent in this chapter.

Basic contents of this chapter is a detailed description of what has been

used as “input“, which algorithms were used, and which “output“ is

going to be generated (section 4.2). The second main point is a depiction

of the most interesting results of this analysis mostly by using graphs

(section 4.3).

Some computational considerations, which either represent my

experience or the experience of some former authors will be pointed out

in a separate section (section 4.4). Finally, trials of explanation for the

results of the different estimators are carried out in section 4.5. The exact

results are listed in the appendix.

65

SIMULATION STUDY AND REAL-LIFE APPLICATION 66

4.2 Preliminaries

4.2.1 The data

The choice of the data happened with respect to the results in chapter 2 and 3.

Each dataset consists of two classes having 600 observations each, and a test

observation dataset containing 200 observations. The two-class-problem is

probably the most investigated problem in literature. There was no reason for me

to choose more than two classes, because the basic problem will probably not

change. Furthermore, in the most applications are only two classes to separate.

The number of 600 observations seems to be a kind of upper bound for using

leave-one-out cross-validation methods, as well. Even here the calculations were

very tough, but a reduction into five dimensions seems to be somewhat justified

(see Table 5).

The number of dimensions in the original space was chosen by d . This is at

least twice as high as the highest occuring sub-dimensional dataset, since I plan to

project the data onto two until five dimensions. A number of d seems to be

enough to get a feeling for the behaviour of the dimension reduction approach.

= 10

= 10

4.2.1.1 The self-constructed (synthetic) data

The self-constructed data consists of seven different univariate prototypes for the

class distributions, which were chosen to model common distributions, which

might appear in different applications. These prototypes are shown in Table 6 and

Figure 16.

I want now to give some short reasons, why I chose those prototypes. The studies

which concentrated to a great extent on high dimensions (Van Ness and Simpson,

1976; Van Ness 1980) treated only uncorrelated multinormal distributions, whose

groups were only separated by one variable. This is a very limited view on the

problem, since section 2.2.6.2 showed that, for example right-skewed

distributions are much more difficult to estimate. In addition, those distributions

appear often in practical problems, e.g. in lifespan-distributions as well as in


income data. Thus, it is essential to have at least one type of those in the

simulation.

Table 6: Prototype distributions of the synthetic datasets.

Name Construction Means

Normal N(0;1)

Normal with

“small noise” 0 8 0 1

0 2

2501

1

252

1

25

. ( , ).

( )( ) ( , . )N N+

=

=∑∑

φ μφ μ μ

ii

i ii

μ1=-3, μi (i=2,...,25) is

created by adding

stepwise uniform (0,0.5)

random variables

Normal with

“medium

noise”

0 7 0 10 3

130 2

4

1

134 2

1

13

. ( , ).

( )( ) ( , .N N+ )

=

=∑∑

φ μφ μ μ

ii

i ii

μ1=-3, μi (i=2,...,13) is

created by adding

stepwise uniform (0,1)

random variables

Normal with

“large noise” 0 5 0 1

0 5

70 35

4

1

74 2

1

7

. ( , ).

( )( ) ( , .N + )N

=

=∑∑

φ μφ μ μ

ii

i ii

μ1=-3, μi (i=2,...,7) is

created by adding

stepwise uniform (0,2)

random variables

Exponential Exp(1)

Bimodal

“close”

0.5N(0;1)+0.5N(2.5;1)

Bimodal “far” 0.5N(0;1)+0.5N(5;1)

On the other hand it is of course important to have a multi-modal distribution

included. Therefore I chose two bimodal distributions. One with two modes lying

close to each other and one with strictly separated modes. I skipped the use of

several pathological normal-mixtures and normal-mixtures having more than two

bumps. A detailed discussion of such distributions with respect to the bandwidth

selection in the univariate kernel setting is given by Marron and Wand (1992). In

my opinion, many of the distributions used in this paper are not entitled to

represent typical occuring densities in application.


Normal

-4 -2 0 2 4

0.0

0.2

0.4

Normal-noise small

-4 -2 0 2 4

0.0

0.2

0.4

Normal-noise medium

-4 -2 0 2 4

0.0

0.2

0.4

Normal-noise large

-4 -2 0 2 4

0.0

0.15

0.30

Exponential (1)

0 1 2 3 4 5 6

0.0

0.4

0.8

Bimodal - close

-2 0 2 4 6 8

0.0

0.10

0.20

Bimodal - far

-2 0 2 4 6 8

0.0

0.10

0.20

Figure 16: Prototype distributions of the synthetic datasets.

The different normal-noise-densities are chosen, because I want to figure out the

dependency of the dominance of the normal-distribution-based decision rules

(LDA, QDA) on deviations of this hypothesis. Since actually nothing in real life is

normal distributed because of unknown dependencies, what makes the model of

independent observations questionable, it is one of my interests, how big the

“noise” in the normal distribution has to be that the non-parametric approach

becomes dominating. The parameters μi of those “error-bumps” in the normal-

noise densities are chosen to have at least the whole expected normal-noise-

density symmetric around zero. The coefficients are high in the center of the


distribution and smaller in the tails in order to produce features according to my

understanding when I talk about “noise”. The function φ in Table 6 denotes the

standard normal density function.

The just described prototypes are linked together in the next step for the

production of 20 different datasets, each of dimension 1400x10.

The prototype distributions shown in Table 6 are used only for population 1,

because (at least) the multivariate distributions have to differ, since a discriminant

analysis is not possible otherwise. For population 2 every normal density is

shifted for 0.5 to the right and the exponential distribution changes its parameter

from λ=1 to λ = 2. This was tested to retain even in ten dimensions Error-rates,

which are approximately 20% and therefore noticeable higher than zero to make

differences between the methods apparent. Table 7 should make everything

become clear.

Table 7: Description of the used datasets.

Dataset Nr. Abbrev. contains

1 NN1 10 normal distribution with “small noise”

2 NN2 10 normal distributions with “medium noise”

3 NN3 10 normal distributions with “large noise”

4 SkN1 2 skewed (exp-)distributions and 8 normals



7 Bi1 4 normals, 4 skewed and 2 bimodal (close)-dist.

8 Bi2 4 normals, 4 skewed and 2 bimodal (far)-dist.

9 Bi3 8 skewed and 2 bimodal (close)-dist.

10 Bi4 8 skewed and 2 bimodal (far)-dist.

After the “linking” step, those ten datasets have been transformed linear by ten

10x10-matrices, which are the roots of ten self-produced correlation-matrices.

The datasets 11-20 have been produced exactly in the same way as number 1-10,

but population 1 and population 2 have been transformed by unequal


transformation-matrices. The datasets having equal covariance matrices have “1”

as their last digit, the others have “2”. For example, the dataset “Bi42” consists

originally of eight skewed distributions and two bimodals (whose bumps are

strongly separated), and the transformation happened for both groups with

unequal transformation matrices.

The 30 occuring correlation matrices, for their part, have been produced by

assuming a common factor in the ten variables having a regression coefficient,

whose absolute value is uniform distributed between 0.3 and 1. The simulations

showed that there were no essential differences between the results of the

principal components analysis.

The produced data consists now of many different features including correlations

as well. The only drawback is that one can retransform the data by estimating the

correlation-matrices in order to get uncorrelated variables. In this case,

uncorrelatedness means the same as independency between the variables, which is

not the case in reality, since the correlation-coefficient measures only linear

dependency. This means that there is no nonlinear information in my data. If that

occurs always, one does not need to think about any multivariate setting, since

instead of a multivariate density the product of univariate density estimates can be

used. The case of different variances is left out, because rescaling the variables

amounts in no loss of generality.

4.2.1.2 The real-life data

To overcome the drawback just described in the last paragraph the performance of

the kernel setting should also be applied on a real-life-dataset. Therefore, a sample

(observations and variables) of an insurance-dataset has been chosen.

In this case the observation numbers for both groups are not equal. The two

groups are separated by the fact, whether the incurance had to pay for a certain

insurance policy or not during the year 1998. The relation was about 20:80.

In order to estimate both class densities with the same accuracy, the training

datasets were built by samples of equal size (again 600 observations of each

group, the same as in the synthetic data). However, the 200-observation-sample,

which represent the test-data was drawn accidently from the remaining dataset.


To gain an impression of the dataset, Figure 17 provides histograms of the ten

dimensions.

0 5 10 15 20

0.0

0.05

0.10

0.15

Variable 1

0 5 10 15 20

0.0

0.10

0.20

0.30

Variable 2

0 10 20 30

0.0

0.02

0.06

Variable 3

0 50 100 150 200

0.0

0.00

50.

015

Variable 4

0 20 40 60 80 100

0.0

0.04

0.08

Variable 5

0 20 40 60

0.0

0.02

0.04

Variable 6

0 5 10 15

0.0

0.10

0.20

Variable 7

0 5 10 15 20 25

0.0

0.05

0.10

0.15

Variable 8

0 10000 20000 30000 40000

0.0

0.00

004

0.00

008

Variable 9

20 30 40 50 60 70 80

0.0

0.01

00.

020

Variable 10

Figure 17: Univariate histograms for the insurance-dataset.


The results for the dimension reduction for the insurance-dataset, as well as for

the synthetic datasets are listed in the appendix (Table 12).

4.2.2 The construction of the estimators and the estimation procedure

As suggested in chapter 3 the kernel concept can be applied in different manners.

The estimators 1 and 2 in my study are constructed as it was described in section

3.2. A non-parametric normalization is carried out by using kernel estimates for

the univariate cdf’s. For this reason, 20 normalizations have to be carried out,

because there are ten variables and two groups in each dataset. Estimator 1 uses

the bandwidth selected by the “Normal rule” (“Rule of Thumb”, (2.9)) to smooth

out the cdf, a method which is known to oversmooth in most of the cases. It will

be denoted as “Normal rule – norm(alized)”.

As alternative the bandwidth used for estimator 2 is constructed by the solve-the

equation-version of the Sheather-Jones-plug-in selector through solving (2.14)

with respect to h. The results concerning this transformation will be denoted as

“Sheather-Jones – norm(alized)”.

The aim is not to get an easier-to-estimate density for the kernel setting, but to be

“allowed” to apply LDA and QDA, respectively, for the classification step. After

normalizing, the two class densities are going to be estimated parametric by the

maximum-likelihood estimators of the parameters of a normal distribution,

because the transformed multivariate distributions are almost normals and a

dimension reduction including a loss in information is not justified. Depending on

assuming equal or unequal covariance matrices, the results are separated again,

denoted by LDA and QDA, respectively.

The 200 data-points are evaluated and the corresponding normal-distribution-

values are transformed by the formula given in (3.1) to get the estimated density

values in the original space. The derivatives have been calculated by an numerical

approximation. Concerning the empirical dataset, the estimated density values

were also multiplied by the prior probabilities of the corresponding group,

because bayes-rule is wanted, and not the maximum-likelihood discriminant

analysis. In the case of the synthetic data both methods are equal.


Using a variable kernel density estimator (2.8) as third alternative, suffers from

two or at least one parameter (Breiman et. al, 1977), which have to be selected by

the user and since the aim of this study is a fully automatic selection method, I

have to discard this idea.

The approach discussed in section 3.3 is the base for the next set of estimators.

Here, the data was projected in the subspace without any preceding manipulation.

As I pointed out in section 3.3 a reasonable choice for the sub-dimensions are in

the range from two to five.

A principal component analysis transformed the data, and the estimation itself

was done by a multivariate kernel density estimator. The used kernel was a

gaussian kernel, since the theory is almost exclusively prepared for this choice

and no alternatives occur in the multivariate case. Furthermore, the normal kernel

has unbounded support which is necessary to avoid comparing two zero-values in

order to draw a classification.

The “product kernel”-generalization was chosen to construct the multivariate

kernel. This choice is also reasonable, because the theory of the multivariate

“Normal rule”-bandwidth-matrix-selector is based on those generalizations (see

section 2.3.3.1).

The spectrum of different bandwidth selectors in the univariate case is quite wide,

however in the multivariate model for discrimination tasks the possibilities are

restricted. As described in the section about bandwidth selection the reference to a

normal distribution is the most easiest way and (2.16) provides the optimal

bandwidth selection for the classes of bandwidth matrices constructed by a

bandwidth vector. A closer look at the formula makes clear, that there is actually

only one parameter to estimate, since the difference in the bandwidths are only

caused by the different standard deviations. I am going to use this as my third

non-parametric estimator and it will be denoted by „Normal rule“.

The field of cross-validation methods represent estimators, which are already a

little bit more sophisticated. Taking again a look at (2.19) the least square cross

validation in that form requires many more computations. The occuring integral

can in general not be solved analytically. However, in case of using a gaussian


kernel one can derive the exact formula (Bowman, 1984) and the generalization to

the multivariate setting gives the function

LSCV( ) ( , )( )

( , )( )

( ,H H x x H x=−

+ )x H−−

− −−

−≠≠∑∑1

10 2

21

22

12nN

nn n

Nn n

Ni j i ji ji j

to be minimized with respect to H. Note, that this succint formula requires for a

certain input H the calculation of n n× −( 1) multivariate normal density values

for each class. Since a numerical optimization has to be carried out, one has of

course to restrict the matrix H depending only on one parameter h and even then it

is a really tough procedure to find a value near to the minimizer. Nevertheless, I

tried this bandwidth selector on my datasets by using the model in (2.17). To be

more detailed, I started at the corresponding value h of the “Normal rule”-

estimator above and investigated the range between ¼ of the value and 1.5 times

the value on 100 logarithmically equidistant points in between, since I know that

the univariate counterpart of “Normal rule”, hROT oversmooths almost

everywhere. This estimator, my fourth, is going to be denoted as “LSCV”. Both

estimators are used for projections in two until five dimensions. For reasons

discussed in section 2.3.3.3 the multivariate extension of the plug-in-estimators

will not be used here. Even if the formulas for more than two dimensions were

derived, the calculations would exceed a reasonable border of computation time.

Finally, of course the LDA and the QDA has been carried out with the original

data itself, the most common practice statisticians do. Both methods were based

on the bayes-rule and not on the maximum-likelihood-rule.

4.2.3 The performance measure

At this stage one has to think about a quality measure for the classifications

carried out by the rules above. A common used measure is the classical Error-rate

(Misclassification-rate). This is probably the easiest measure to interpret.

Unfortunately, it makes no difference, if the relation of the posterior probabilities

is 0.49:0.51 or 0.01:0.99. In both cases the second group is going to be predicted

and this measure is therefore not really robust. Nevertheless, it is a often used


measure (Remme et. al., 1980; Van Ness and Simpson, 1976; Van Ness; 1980),

which is (in case of two groups) given by

ER = −=∑ $ ,k ki ii

n

1

where and k$ki i are the assignment and the real group of the observation xi,

respectively. Different costs of misclassifications are not taken into account, as

well. An detailed discussion of several performance measures gives Hand (1997).

The second “coefficient” is a little bit more robust, because here the posterior

probabilities are not only used for the assignment of an observation, but also for

measuring the amount of uncertainty. Strictly speaking, the posterior probabilities

p(k|x) itself are included in the formula for the Brier-score (Hand, 1997).

( )BS = −=∑2

21

2

np ci

i

n

i( | ) ,x

and if xci = 0 i comes from group 1 and ci = 1 otherwise. Since the training data

and the test data are independent, both measures should not be biased. For reasons

of a variance reduction, I calculated the results for the exclusive use of the LDA

and the QDA (in the case of the synthetic datasets) several times always by

leaving out 100 observations and using them as test-data for one group. Thus, in

the data every combination of leaving out 100 points in the class-1-set and leaving

out 100 points in the class-2-set was chosen to serve for the test-dataset (1-100,

101-200, usw.). This leads to 7 7 49x = measures, which are going to be averaged

for the Error-rate, as well as for the Brier-score.

In all other conditions I calculated all measures only once because of huge

computation times, but the Brier-score is a robust measure, anyway.

Both used criteria are going to yield the estimator for ”the best“, which minimizes

the corresponding values. Therefore the “best“ estimator is actually only the best

concerning classification accuracy. Other aspects, like speed, cost of

classification, amount of prior knowledge about the data and other aspects are

ignored in this study, but they are discussed in Hand (1997). At least the speed-

drawback of the kernel methods is obvious and does not have to be measured.


4.2.4 The software

During all my work concerning the calculations in this study I used three different

programs, which were Microsoft Excel 97, S-Plus 4.5 and XploRe 4.2a. Excel,

because it is very useful to gain quick insights into some problems. It was a kind

of auxilliary tool during the whole study. In addition, it served for the calculation

of the LSCV-parameters through automatized procedures controlled by Visual-

Basic-macros. I chose S-Plus, because I have now much experience with its

functions and features and it is really useful in handling vectors and matrices.

Unfortunately, the calculation accuracy of both programs is restricted, since it is

not possible to calculate the value of a standard normal cdf at the point x = 8,

which occurs during the normalization of the data. The only possible thing is to

compute such a cdf for values x = −8 or smaller (only with S-Plus) and this is a

possibility to by-pass the problem.

S-Plus was also a very good means to program short automatized functions like

e.g. a test of multivariate normal distribution.

Since there are many functions concerning kernel density estimators in XploRe I

chose that program to gather impressions of the behaviour of different estimators.

Furthermore, it served to get some experience with a new program, which is

however, not so different compared to S-Plus.

I also used SPSS, but only to carry out a Box-test for equal covariance matrices.

4.3 Results

The results are quite detailed and include many aspects. There are 21 datasets (20

synthetic and one empirical), fourteen estimators (LDA, QDA, “Normal rule –

norm“ (LDA and QDA), “Sheather-Jones – norm“ (LDA and QDA), “Normal

rule“ (dimensions 2-5) and “LSCV“ (dimensions 2-5)) and two performance

measures (Error-rate and Brier-score), which amounts totally in 21

scores to discuss. The tables containing all those scores are listed in the appendix.

In this section only the most important results are going to be pointed out.

14 2 588× × =


4.3.1 LDA versus QDA

0,000

0,050

0,100

0,150

0,200

0,250

0,300

0,350

0,400

0,450

NN11 NN12 NN21 NN22 NN31 NN32

LDAQDA

Figure 18: Brier-scores for the NN-distributions. Comparison between LDA

and QDA.

0,000

0,100

0,200

0,300

0,400

0,500

0,600

SkN11 SkN12 SkN21 SkN22 SkN31 SkN32

LDAQDA

Figure 19: Brier-score for the SkN-distributions. Comparison between LDA

and QDA.


0,000

0,100

0,200

0,300

0,400

0,500

0,600

Bi11 Bi12 Bi21 Bi22 Bi31 Bi32 Bi41 Bi42

LDAQDA

Figure 20: Brier-score for the Bi-distributions. Comparison between LDA

and QDA.

Since the QDA estimates for each class the covariance-matrix separately, it

should have an advantage concerning datasets having different covariance-

matrices. That are the datasets having “2” as their last digit (see Figure 18, Figure

19 and Figure 20). This assumption is confirmed by the results, since only

“SkN32” has a worse performance compared to LDA.

One interesting fact also becomes apparent by watching the graphes above. While

the QDA is always slightly worse in the case of equal correlation-matrices and

much better otherwise, the situation changes for the NN-distributions when the

distribution becomes more different from a multivariate normal. Here, the QDA is

much worse than the LDA for equal covariance-matrices and only slightly better

otherwise. That means the relative performance of the QDA compared to the LDA

gets worse when the assumption of a multivariate normal is not justified. The

distributions “SkN11” and “SkN12” are close to a multivariate normal, because a

test of multivariate normal distribution (Grossmann and Hudec, 1999) cannot be

rejected for both classes in each dataset.


4.3.2 LDA and QDA versus the normalized datasets

0,000

0,050

0,100

0,150

0,200

0,250

0,300

0,350


LDANormal rule - norm (LDA)Sheather-Jones - norm (LDA)

Figure 21: Error-rate for the NN-distributions. Comparison within the LDA-

method by using normalizations.

0,000

0,050

0,100

0,150

0,200

0,250



Figure 22: Error-rate for the SkN-distributions. Comparison within the

LDA-method by using normalizations.


0,000

0,050

0,100

0,150

0,200

0,250



Figure 23: Error-rate for the Bi-distributions. Comparison within the LDA-


A result of my study was the fact that it does not make much difference to watch

the Error-rate-results or the classification performance measured by the Brier-

score (at least not with respect to the order to compare different methods). As

mentioned above, the Brier-score is more robust, but has not really a

interpretation. Nevertheless, I would prefer it to compare the estimators.

Figure 21, Figure 22 and Figure 23 show a first impression of how to use the

kernel concept in the multivariate setting. In Figure 22 almost every normalization

procedure (“Normal rule – norm“ and “Sheather-Jones – norm“) improves the

classification. Only in the case of “SkN11“ there is no improvement. However,

this is one of the datasets, which are “identified“ as multivariate normal

distribution. The normalization seems to have only a random-impact on the

distributions, which are very close to the multivariate normal distribution

(“NN11“-“NN22“). This is not surprising, because what should be normalized, if

the original dataset itself is normal?


0,000

0,050

0,100

0,150

0,200

0,250

0,300

0,350


QDANormal rule - norm (QDA)Sheather-Jones - norm (QDA)

Figure 24: Error-rate for the NN-distributions. Comparison within the QDA-


0,000

0,050

0,100

0,150

0,200

0,250

0,300

0,350



Figure 25: Error-rate for the SkN-distributions. Comparison within the

QDA-method by using normalizations.


0,000

0,050

0,100

0,150

0,200

0,250

0,300

0,350

0,400



Figure 26: Error-rate for the Bi-distributions. Comparison within the QDA-


Compared to the performance of the QDA, the normalization is a fundamental

improvement in almost all datasets, but one has to handle the results carefully.

Since the performance of the QDA was terrible in datasets having equal

covariance-matrices, it is not surprising that the normalized datasets separate

better. No one would use the concept of the QDA after taking e.g. Figure 20 into

consideration. However, the results for “unequal covariance matrices” are pretty

good, as well. In any case, preceding tests of multinormality or equal covariance-

matrices seem to be a somewhat suitable help for further decisions.

Another important fact becomes also apparent. It makes almost no difference

using the Normal rule or the Sheather-Jones-plug-in as bandwidth for the kernel

estimator. In this context the choice of the bandwidth is as unimportant as the

choice of the kernel in the case of kernel density estimation. This is pleasant,

because the Normal rule needs almost no calculations compared to other

bandwidth-selectors and at least the drawback of exhaustive calculations in the

multivariate case seems to disappear.


4.3.3 The multivariate kernel density estimators – differences concerning

dimensions

The results for the multivariate density estimators are of great interest, since the

extent of the “trade-off” between the information loss by deleting some

dimensions and the advantage by overcoming the curse of dimensionality has to

be quantified. The Error-rates and Brier-scores were sorted and the average rank

for each “group” of datasets (“NN”, “SkN” and “Bi”) has been calculated. The

results are shown in Table 8 and Table 9.

Table 8: Average rank of the kernel estimators in dependency on the

dimension of the subspace concerning the Error-rates. “1” is the best.

Error-rateNormal rule (2) Normal rule (3) Normal rule (4) Normal rule (5)

NN 2,67 1,83 2,33 3,17SkN 3,08 1,92 1,75 3,25Bi 2,25 2,31 2,38 3,06Empirical dataset 3,50 1,50 3,50 1,50Total 2,67 2,02 2,24 3,07

LSCV (2) LSCV (3) LSCV (4) LSCV (5)NN 2,58 1,67 2,42 3,33SkN 3,25 1,83 2,08 2,83Bi 2,56 2,31 2,25 2,88Empirical dataset 1,00 2,00 3,00 4,00Total 2,69 1,98 2,29 3,05

As the user might have expected during reading the section about the curse of

dimensionality, the results for a classification based on a kernel density estimate

in five dimensions are the worst. This space is probably too sparse populated for a

number of n = 600 observations. Again, one can see almost no difference between

the two different estimation methods and the relative differences between the

Error-rate and the Brier-score are only marginal, as well.

In my opinion, the main statement is that the dimensions of size three and four fit

best and that a reduction to two dimensions is to radical. Of course, this depends

in the special case on the principal component analysis for the corresponding

dataset.


Table 9: Average place of the kernel estimators in dependency on the

dimension of the subspace concerning the Brier-score. “1” is the best.

Brier-scoreNormal rule (2) Normal rule (3) Normal rule (4) Normal rule (5)

NN 2,50 2,33 1,83 3,33SkN 3,33 1,33 2,50 2,83Bi 2,75 1,63 2,25 3,38Empirical dataset 1,00 4,00 3,00 2,00Total 2,76 1,86 2,24 3,14

LSCV (2) LSCV (3) LSCV (4) LSCV (5)NN 2,50 2,33 2,00 3,17SkN 3,33 1,33 2,50 2,83Bi 2,75 1,63 2,13 3,50Empirical dataset 2,00 4,00 1,00 3,00Total 2,81 1,86 2,14 3,19

For this reason I am going to show two examples. In Table 10 and Table 11 are

two quite different results of a principal component analysis listed. The results for

the dataset “SkN32” are much worse and therefore the classifiers based on the

corresponding sub-datasets work better for more dimensions.


analysis for "SkN11".

59,125 59,125

10,116 69,241

8,946 78,188

5,797 83,985

5,021 89,006

3,255 92,261

2,922 95,183

2,783 97,966

1,395 99,361

,639 100,000

Component1

2

3

4

5

6

7

8

9

10

explainedvariance

(%)cumulated

(%)



analysis for "SkN32".

26,936 26,936

19,320 46,256

11,222 57,478

7,852 65,330

7,318 72,648

6,536 79,183

6,313 85,497

6,248 91,744

4,329 96,073

3,927 100,000

Component1

2

3

4

5

6

7

8

9

10

explainedvariance

(%)cumulated

(%)



This fact is shown in Figure 27. The things are different for “SkN11”. Here, the

explanation of 70% is already in two dimensions quite good according to the way

the correlation-matrices have been constructed.

0,000

0,050

0,100

0,150

0,200

0,250

0,300

0,350

2 3 4 5Dimensions

Normal rule - SkN11LSCV - SkN11Normal rule - SkN32LSCV - SkN32

Figure 27: Dependency of the performance of the multivariate kernel density

estimator for two datasets. The Brier-score is plotted vs. the dimension of the

subspace.

However, this tendency is not inherent to all datasets and one has to be very

careful to draw such conclusions.


4.3.4 LDA and QDA versus the multivariate kernel density estimators

0,000

0,100

0,200

0,300

0,400

0,500

0,600

0,700

NN11 NN21 NN31 SkN11 SkN21 SkN31 Bi11 Bi21 Bi31 Bi41

LDALSCV (3)LSCV (4)

Figure 28: Brier-score of the datasets having equal correlation matrices.

Comparison between the LDA and the bayes-rule-kernel-methods

constructed by the LSCV-selector.

The results of the multivariate kernel estimators compared to the classical

methods, LDA and QDA, are quite disappointing and the euphoria of the

simulation studies in the past (e.g. Remme et. al., 1980) are from this point of

view not comprehensible.

In Figure 28 the performance of the datasets having equal correlation-matrices has

been compared. The LDA is the best in all cases and the kernel concepts are quite

bad for datasets, where the assumption of multivariate normal distribution has to

be rejected. This is really interesting, since the (normal-distribution-based) LDA

should actually lose its advantage for those datasets.


0,000

0,050

0,100

0,150

0,200

0,250

0,300

0,350

NN12 NN22 NN32 SkN12 SkN22 SkN32 Bi12 Bi22 Bi32 Bi42

QDALSCV (3)LSCV (4)

Figure 29: Brier-score of the datasets having unequal correlation matrices.

Comparison between the QDA and the bayes-rule-kernel-methods

constructed by the LSCV-selector.

The kernel results for the NN-datasets and the Bi-datasets consisting of unequal

correlation-matrices are also quite bad compared to their parametric counterparts.

Slight improvements are only for the SkN-datasets observable.

4.3.5 Results concerning the insurance data

When taking again a look at the univariate histograms (Figure 17) the reader

identifies several skewed densities. This fact might prefer a kernel setting from

the view of the results in section 4.3.2 and 4.3.4. Unfortunately, the two classes of

this dataset cannot be separated well and Figure 30 makes this problem

transparent. To understand this, the reader has to be reminded that the proportion

between the observation numbers between the two groups is about 4:1. Thus, an

Error-rate of 20% can be achieved by only predicting the larger group and without

having a look at the dataset. This is what is actually done in the most cases. The

problem is shown in Figure 31, where in each graph two slightly different class

densities are plotted, however having a proportion in their integrals of 4:1.


0,000 0,050 0,100 0,150 0,200 0,250 0,300 0,350

LDA

QDA

Normal rule (2)

Normal rule (3)

Normal rule (4)

Normal rule (5)

LSCV (2)

LSCV (3)

LSCV (4)

LSCV (5)

Normal rule - norm (LDA)

Normal rule - norm (LDA)

Sheather-Jones - norm (LDA)

Sheather-Jones - norm (QDA)

Figure 30: The Error-rates for the insurance data.

a)

x

-2 0 2 4

0.0

0.1

0.2

0.3

0.4

N(0,1)0.25N(0.5,1)

b)

x

0 2 4 6 8

0.0

0.2

0.4

0.6

0.8

1.0

Exp(1)0.25Exp(2)

Figure 31: The problem of classification concerning non-equal class

observation numbers. Picture a) shows two slightly different normal densities

and in b) two exponential densities are plotted. The densities in both pictures

are rescaled to an area-proportion of 4:1.


In b) one will never predict the smaller group, since the crucial density value at a

certain point is smaller in all cases. The situation in a) is not much better, but at

least a few points of the support are classified to the smaller group.

Returning back to Figure 30, the results for the LDA and especially for the QDA

are disastrous, but I do not really have an explanation for this result.

4.4 Computational considerations

A great drawback of the kernel method is, that it includes many calculations, even

in simple models and even in the univariate setting. Nowadays, high-speed-

computers are available and this fact might encourage the user to program blind

without taking any simplification of algorithms into account. However, the

problems arising at least in the multivariate setting force the programmer to avoid

unnecessary calculations in any case, and this aspect should be taken into

consideration.

Silverman (1986) considers this problem and gives advice on how to save

computation time. A well-known hint is that common factors should be calculated

after summing up different elements and not within the sum. This is in the view of

the kernel density estimator very important. Another suggestion is to use a kernel,

which has a easy formula, since the choice of the certain kernel is not that crucial

for the performance. Thus, he definitely recommends not using a normal kernel,

but this point is again controversial with respect to the fact that unbounded

support is needed in discrimination, and there is not really an unbounded

alternative, which contains no exponential function call. Additionally, the theory

for the normal kernel is really easy, compared to any alternatives.

The calculations for the LSCV-smoothing-parameter were a very exhaustive

procedure. For each dataset and each group a certain bandwidth matrix H twice

the half of a 600x600-matrix was necessary, where every value was a density

value of a two- until five-dimensional multivariate normal. The computation time

“explodes“ here for more observations, as well as for more elements of H chosen

to be flexible. This was the reason that my matrix H depended only on one

paramter h (see section 4.2.2). Fortunately, the results concerning classification


analysis showed almost no difference between methods using different

bandwidth-selectors (see section 4.3).

Other differences occured by the calculation of an estimator for the univariate cdf

in order to carry out marginal transformations. Since the quantile of the values 0

and 1 of the cdf are “-∞“ and “∞“, respectively, the method does not work for

some “outliers“. Thus, the author recommends choosing larger bandwidths. This

fact again prefers the “Normal rule“ and one is well-advised to ignore the other

bandwidths in this certain context.

For the numerical calculation of the derivatives in the normalization approach, it

is also necessary to tune the value of Δ (see Figure 13) to the accuracy provided

by the used software, because the approximation is either very crude or not

calculable, otherwise.

4.5 Discussion

The quite good performance in the different studies of the kernel estimation

concept in discriminant analysis (Remme et. al., 1980, Van Ness and Simpson,

1976, Van Ness, 1980) have to be seen from a different view regarding the results

in section 4.3. All indicated studies used the concept on the original dataset and

did not care about variable reductions. However, the problem in application is the

information loss caused by the inevitable data projection onto a subspace. That is

a factor, which is absolutely not negligible and ignoring it makes all research not

suitable for application.

The great dilemma before I carried out this study seemed to be the necessity of a

huge training sample which however, makes the application on a conventional

computer impractible. This impracticability is caused by an enourmous

expenditure on memory for calculating e.g. marginal normalizations of the

variables or by the calculation time for the multivariate kernel density estimator

and its parameters itself. Fortunately, the choice of the parameters was not crucial

and I therefore strongly recommend the “Normal rule” as the only method in

application. Anyway, most of the other alternatives are not applicable for certain

dimensions/observation numbers.


The great differences between the two manners of applying the kernel method (the

use for marginal transformation and the estimation of a multivariate density) is

maybe the most important result I discovered. Since the non-parametric

estimation of the multivariate class densities performs quite bad in most of the

datasets, probably the only possible advice for the user is to restrict multivariate

density estimation to say three or less variables and to use the method only if the

explanation loss after the dimension reduction is “acceptable”. However, it is

difficult to give numbers about what is meant by “acceptable”. If the user has a

huge set of observations (here “huge” means good enough with respect to the

number of dimensions (see again e.g. Table 5)), the application is restricted to

parameters, which are calculable straightforward (depending e.g. only on the

sample variance or the observation number).

My expectations about the performance of the kernel method were actually much

higher than the achieved results. Seemingly, the problems of the multivariate

generalizations are worse than I thought, but it is difficult to imagine the expanse

of a ten-dimensional space, although many comparisons were drawn.

On the contrary, the marginal normalizations improved the estimation well in

most cases. However, normalizations can also be carried out by other means than

a univariate kernel estimate of the distribution function, but probably a parametric

transformation fulfills this task not that good and furthermore, its parameters have

to be chosen by the user. Nevertheless, the need for memory in the computer

“explodes” here as well, and in my opinion normalizations of variables in

distributions, which are close to a multivariate normal distribution, are not really

necessary. A preceding test to check this hypothesis would not be bad.

Finally, the dominance of the LDA in all cases of equal covariance-matrices

within the classes, regardless whether the univariate distributions are skewed,

bimodal or normal, makes also a preceding test of equal covariance-matrices

unevitable. Here, the LDA has probably no alternatives and the user saves much

computation time.

Chapter 5: Summary and outlook

The whole genesis of this diploma thesis was almost exclusively a series of

surprises concerning the power of the concept of kernel density estimation.

I started with the aim to explore the status of the kernel density estimation method

within the huge field of density estimation. My only knowledge was, that this

concept is quite easy and nice and its flexibility encouraged me to assume really

good results, in estimating the density itself, as well as in classifying objects.

The first experiences concerned the fact, that optimality can happen to many

different criteria, and that a density is only “estimated well” with respect to such a

criterion. Probably, the variety of non-parametric methods, parameter selection

methods in kernel density estimation and many criteria, which can be thought to

be optimized makes the whole research field not comprehensible. The fact that

different applications of this concept need also different criteria to optimize leads

to a even higher degree of complexity.

A very interesting fact appears in formulas, where optimal parameters

(bandwidths and bandwidth-matrices) are derived, and one recognizes the

unknown density in such an expression. However, this seems to be a problem,

which is not only inherent to kernel density estimation, but also to many other

methods in statistics. In my opinion, this makes the limitedness of statistics

transparent, because you always know only a bit of the reality by observing a

sample and although there were many trials to overcome this drawback in the

existing literature, the alternatives are actually quite poor. Only assuming that the

underlying density is a normal distribution (as it is done by the Normal rule)

cannot be the “end of the story”.

There was another problem that appeared very questionable for me. In the case of

the univariate kernel density estimations, there are several approximations and

“corrections”, where the unexperienced user does not really know, what was

finally optimized.

92

SUMMARY AND OUTLOOK 93

For example, the L1-Theory consists of several “upper bounds” for the error-

criteria, which are optimized. Another problem is, even if one makes the decision

to optimize measures based on the L2-distance, whether to choose ISE- or the

MISE criterion. In many cases the AMISE criterion is optimized and nobody

knows how close the AMISE is to the MISE or how close the resulting

bandwidths are. Whole bandwidth-selection-philosophies (BCV, plug-in) are

based on the minimization of the AMISE.

In the case of the BCV- and the LSCV-selector, respectively, the occurence of

several minima is possible and the discussions about which one to choose are

quite controversial (Wand and Jones, 1995 and Sheather, 1992). The reference to

a certain distribution in order to derive optimal parameters and the justification of

the L2-distance itself, which occurs only because of the need for easier

calculations, are other examples.

The fact that discriminant analysis is based on differences in the logarithms of

densities discovers another problem for somebody, who has thought that there are

no more difficulties than those mentioned above. In order to get suitable results,

one has to estimate the log-densities, but the kernel density estimation concept is

not suitable for such a change. However, there are only “tails” in high dimensions

and the literature ignores this completely. These facts probably force the user to

distinguish between the density estimation for exploratory purposes in one or at

the most two dimensions and other tasks like the use in discriminant analysis

(probably restricted to about five dimensions, depending on the sample sizes). For

the first purpose the kernel method nevertheless, performs quite good. For the

second one, a good performance for datasets in high dimensions seems almost

impossible (see section 4.3.4). Some theoretical concepts cannot be generalized,

others suffer from a inacceptable computation time.

The good improvements of using the kernel method for preceding marginal

transformations of the variables seem to be a kind of “happy end” during writing

the thesis. Here, one can achieve further improvements by using e.g. a variable

kernel density estimator, but this amounts already in a not fully automatical

SUMMARY AND OUTLOOK 94

procedure, since all concepts concerning this topic require (a) additional

parameter(s). The range of occuring densities in application is wide and also

highly skewed distributions have to be taken into account. A fully automatical

transformation procedure has to work for every density. The user can also think of

other improvements like applying the normalization step several times. However,

this method is again evidence for a need to by-pass the problem of estimating the

densities multivariate.

An outlook for further research cannot be given easily. In the case of univariate

density estimation seems to be a correspondence in that sense that it is important

to fit the density by optimizing some qualitative features and not different

integrals or squares of integrals between two curves. Marron and Tsybakov

(1995) showed that there are other aspects of fitting the density in order to get an

intuitive good result. They pointed out to underline the features, “what the eye

sees” and not those found by optimizing Lp-distances. Probably, the

mathematicians have to be more creative. After that, a correspondence about the

question “When is a density estimated well?” has to be found.

The future in multivariate kernel density estimation as an alternative to other

classification algorithms looks somewhat hopeless. This concerns not the problem

of generalizing existing univariate parameter-selection-methods, since different

selectors do not have a crucial impact on the classification-rates. The main

problem is the often discussed “trade-off” between the curse of dimensionality

and the explanation loss by reducing dimensions. Only computers, which are

much faster than the usual personal computers can help, provided the observation

numbers “fit” to the number of variables. One has definetely to draw a border for

observation numbers or variables including a comment:”Below those numbers is a

use of this method possible and reasonable and above not!”

References

Bomze, I. M. (1998). Mathematik 3+4 für Statistiker. Lecture notes. Institute for

Statistics and Decision Support Systems, Vienna University.

Bowman, A. W. (1984). An alternative method of cross-validation for the

smoothing of density estimates. Biometrika 71, 353-60.

Bowman, A. W. and Azzalini, A. (1997). Applied Smoothing Techniques for Data

Analysis. Oxford Science Publications.

Breiman, L., Meisel, W. and Purcell, E. (1977). Variable kernel estimates of

multivariate densities. Technometrics 19, 135-44.

Cao, R., Cuevas, A. and Gonzalez-Manteiga, W. (1994). A comparative study of

several smoothing methods in density estimation. Comp. Stat. Data Anal. 17,

153-76.

Devroye, L. and Györfi, L. (1985). Nonparametric Density Estimation: The L1

View. Wiley, New York.

Fukunaga, K. (1972). Introduction to statistical pattern recognition. Academic

Press, New York.

Grossmann, W. and Hudec, M. (1999). Multivariate statistische Verfahren.

Lecture notes. Institute for Statistics and Decision Support Systems, Vienna

University.

Habbema, J. D. F., Hermans, J. and Remme, J. (1978). Variable kernel density

estimation in discriminant analysis. Compstat 1978, Proceedings in

Computational Statistics. Physica Verlag, Vienna

Hall, P. and Wand, M. P. (1988). On nonparametric discrimination using density

differences. Biometrika 75, 541-7.

Hand, D. J. (1982). Kernel Discriminant Analysis. Research Studies Press,

Chicester.

Hand, D. J. (1997). Construction and Assessment of Classification Rules. John

Wiley & Sons, Chicester.

Härdle, W. (1991). Smoothing Techniques with Implementation in S. Springer-

Verlag, New York.

95

REFERENCES 96

Härdle, W., Klinke, S. and Müller, M. (2000). XploRe - Learning Guide.

Springer-Verlag, Berlin.

Jones, M. C. (1991). The roles of ISE and MISE in density estimation. Statist.

Probab. Lett. 12, 51-56.

Jones, M. C., Marron, J. S. and Park, B. U. (1991). A simple root-n bandwidth

selector. Ann. Statist. 19, 1919-32.

Jones, M. C., Marron, J. S. and Sheather, S. J. (1996). A brief survey of

bandwidth selection for density estimation. Journal of the American

Statistical Association 91, 401-7.

Marron, J. S. (1993). Discussion of ‘Practical performance of several data-driven

bandwidth selectors’ by Park and Turlach. Comput. Statist. 8, 17-9.

Marron, J. S. and Tsybakov, A. B. (1995). Visual Error criteria for qualitative

smoothing. J. Amer. Statist. Assoc. 90, 499-507.

Marron, J. S. and Wand, M. P. (1992). Exact mean integrated squared error. Ann.

Statist. 20, 712-36.

Park, B. U. and Marron, J. S. (1990). Comparison of data-driven bandwidth

selectors. J. Amer. Statist. Assoc. 85, 66-72.

Park, B. U. and Turlach, B. (1992). Practical performance of several data driven

bandwidth selectors (with discussion). Comput. Statist. 7, 251-85.

Ripley, B. D. (1996). Pattern Recognition and Neural Networks, Cambridge

University Press.

Remme, J., Habbema, J. D. F. and Hermans, J. (1980). A simulative comparison

of linear, quadratic and kernel discrimination. J. Statist. Comput. Simul. 11,

87-106.

Rosenblatt, M. (1956). Remarkes on some nonparametric estimates of a density

function. Ann. Math. Statist. 27, 832-7.

Ruppert, D. and Cline, D. B. H. (1994). Transformation kernel density estimation

– bias reduction by empirical transformations. Ann. Statist. 22, 185-210.

Scott, D.W. (1992). Multivariate Density Estimation: Theory, Practice and

Visualization. Wiley, New York.

REFERENCES 97

Sheather, S. J. (1992). The performance of six popular bandwidth selection

methods on some real data sets (with discussion). Comput. Statist. 7, 225-50,

271-81.

Sheather, S. J. and Jones, M. C. (1991). A reliable data-based bandwidth-selection

method for kernel density estimation. J. Royal Statist. Soc. Ser. B 53, 683-

90.

Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis.

Chapman and Hall, London.

Van Ness, J. (1980). On the dominance of non-parametric bayes rule discriminant

algorithms in high dimensions. Pattern Recognition 12, 355-68.

Van Ness, J. W. and Simpson, C. (1976). On the effects of dimension in

discriminant analysis. Technometrics 18, 175-87.

Wand, M. P. and Jones, M. C. (1994). Multivariate plug-in bandwidth-selection.

Comput. Statist. 9, 97-117.

Wand, M. P. and Jones, M. C. (1995). Kernel Smoothing. Chapman and Hall,

London.

Appendix

About the used literature

The interested reader is recommended to read the literature listed above. I just

want to give a few comments about the used sources, underlining why I used it.

Books

The book of Silverman (1986) is an excellent overview about non-parametric

density estimation with great emphasis on the kernel setting and its extensions. It

is probably the first main collection of non-parametric density estimation ideas.

A comparable book, which however only concentrates on the kernel setting is

written by Wand and Jones (1995). They provide many graphs, which give a great

insight about many methods and problems in this setting. A collection of

bandwidth selection ideas is included, as well. Both books base their optimality

results on minimizing L -measures. 2

A quite technical book, which takes an eye to the -optimal results is the book of

Devroye and Györfi (1985). Many proofs are included, but it is maybe too

technical for a newcomer in this field.

L1

While the book of Härdle (1991) provides many density-estimation algorithms for

the S-Plus-Software-Package is Härdle et. al. (2000) a manual to the software

package XploRe, which contains many modern univariate bandwidth-selectors.

Bowman and Azzalini (1997) also treat smoothing methods, but stress not really

on the estimation of densities.

The multivariate density estimation, which builds the bridge to discriminant

analysis is maybe most detailed discussed in Scott (1992). Here, also practical

hints about the necessary task of dimension reduction are given. However, the

comments about the kernel setting in the discrimination context are quite rare.

Nevertheless, some visualization techniques for densities in more than two

dimensions are introduced.

98

APPENDIX 99

The combination of kernel density estimation and classification fulfills Hand

(1982), even though practical advice for handling high dimensionional datasets is

not given, but he discusses categorial data to make up. The evidence that the

kernel approach is maybe nowadays not the most popular one for classification

tasks give Ripley (1996) and Hand (1997), since only a few pages in each of their

great overviews of pattern recognition methods are about kernel density

estimation. While the former critizises the density estimation literature according

to the fact, that non-parametric density estimation is usually not directed to good

discrimination results, gives the latter advice concerning the optimality of several

classification rules. It also points out connections between different classification

methods.

Papers

Starting at the first description of the kernel density estimation concept by

Rosenblatt (1956), many suggestions about how to select the bandwidth

appropriate (the most important decision to carry out) have been made. The

principle of cross-validation was soon applied in the form of likelihood-cross-

validation, which seems to be a natural developement and least-square cross-

validation (Bowman, 1984), both based on the leave-one-out-principle. LSCV

was for a long time the most popular bandwidth selector, but in the beginninig

1990’s a great impetus was given by the plug-in-methods. A first version (Park

and Marron, 1990) was slightly improved by a second (Sheather and Jones, 1991).

Convergence rates were also taken into account and the fastest possible AMISE-

convergence rate was achieved (see e.g. the estimator of Jones et. al., 1991).

Since it was unknown, how better convergence rates amount in better estimate for

reasonable sample sizes, numerous comparatative studies were carried out on

synthetic data (e.g. Cao et. al, 1994, Jones et. al, 1996, Park and Turlach, 1992,

Marron, 1993) or on real-life-datasets (Sheather, 1992).

But also extended concepts like the variable kernel estimator (Breiman et. al.

1977) as well as possible univariate transformations were considered (see e.g. the

nonparametric method of Ruppert and Cline, 1994).

APPENDIX 100

The important question about which error-criterion is the best to optimize, was

treated by Jones (1991), Marron and Wand (1992), Marron and Tsybakov (1995).

Multivariate density estimation is actually a quite new research area and many

questions are yet to be resolved. In this view remarkable is the multivariate

generalization for the Sheather-Jones plug-in (Wand and Jones, 1994), even

though it is very restricted and not really viable for discrimination purposes.

The only studies I found, where the kernel setting serves as an alternative to the

classical parametric methods go back in the end of the 1980’s. Habbema et. al.

(1978) give advice about choosing the parameters of a variable kernel setting to

maintain optimal classification results, which were the basics for those attempts in

Remme et.al. (1980). Both parameter selection methods refer to the likelihood

cross validation and are very simple.

Van Ness and Simpson (1976) and Van Ness (1980) investigated the influence of

high dimensions, but used very small sample sizes. Finally a more present idea

(Hall and Wand, 1988) seems to concentrate at least partial on a discrimination

goal when esitmating differences of densities.

APPENDIX 101

Notation and common abbreviations

f(.) the underlying density

h the bandwidth (smoothing parameter)

σ(.) the standard deviation

H the bandwidth-matrix

K the kernel function in the univariate case

K, κ the multi- and univariate kernel in the multivariate setting $ (.)f h the density estimate based on a certain bandwidth h

|A|,|b| the determinant of the matrix A (also det(A)) or the value of b

without the sign

R(g) g x dx( )2∫

p(k|x) posterior probability

$( )p k prior probability

f(x|k) conditional density given class k

ASH average shifted histogram

AMIAE asymptotic mean integrated absolute error

AMISE asymptotic mean integrated squared error

AMSE asymptotic mean squared error

ANOVA analysis of variance

BCV biased cross-validation

cdf cumulative distribution function

IAE integrated absolute error

ISE integrated squared error

LDA linear discriminant analysis

LSCV least square cross-validation

MIAE mean integrated absolute error

MISE (EISE) mean (expected) integrated squared error

QDA quadratic discriminant analysis

APPENDIX 102

Tables

Table 12: Results of the principal component analysis. The percentage of the

explained variance is shown for all datasets.

Dataset\Dimension 1 2 3 4 5 6 7 8 9 10NN11 53,04 62,02 70,18 77,15 83,54 89,54 94,64 97,51 99,27 100NN12 29,23 50,36 59,99 67,59 74,49 81,07 87,12 92,59 96,49 100NN21 33,94 45,25 54,06 62,50 70,57 78,18 84,94 91,19 96,26 100NN22 29,53 54,86 63,35 70,68 77,69 83,80 89,34 93,56 97,22 100NN31 35,78 45,85 54,37 62,77 71,10 79,03 86,40 92,58 97,74 100NN32 40,14 57,37 65,59 72,29 78,26 83,93 88,84 93,44 97,57 100SkN11 59,13 69,24 78,19 83,99 89,01 92,26 95,18 97,97 99,36 100SkN12 36,45 54,05 63,34 71,43 78,15 84,55 90,40 94,62 97,77 100SkN21 58,85 68,48 75,55 81,77 86,31 89,70 92,99 95,79 98,21 100SkN22 42,92 60,80 68,85 74,64 80,21 85,55 89,92 93,93 97,68 100SkN31 63,88 73,04 79,72 85,61 90,56 94,04 97,20 99,06 99,68 100SkN32 26,94 46,26 57,48 65,33 72,65 79,18 85,50 91,74 96,07 100Bi11 59,62 66,76 72,25 77,50 82,57 87,31 91,59 95,12 98,02 100Bi12 35,41 60,23 69,36 76,38 81,85 87,14 91,33 94,53 97,33 100Bi21 38,96 50,07 58,46 66,13 73,58 80,42 86,79 92,63 98,14 100Bi22 47,71 73,17 80,27 86,15 91,26 93,97 96,37 97,90 99,11 100Bi31 53,25 64,11 71,48 78,15 84,19 89,73 93,92 96,95 99,24 100Bi32 39,86 67,23 74,22 80,17 85,65 89,88 93,86 96,59 98,62 100Bi41 70,03 77,98 83,23 87,35 91,32 94,66 97,05 98,68 99,58 100Bi42 47,67 69,50 75,62 81,27 86,08 90,37 93,66 96,91 99,20 100

Insurance Data 33,38 51,54 61,69 71,38 79,17 86,12 91,73 96,18 98,99 100

APPENDIX 103

Table 13: Classification results for the LDA.

Dataset Error-rate std. err. Brier-score std. err.NN11 0,238 0,018 0,317 0,018NN12 0,191 0,024 0,273 0,024NN21 0,235 0,020 0,321 0,020NN22 0,231 0,040 0,331 0,033NN31 0,280 0,032 0,377 0,027NN32 0,301 0,028 0,388 0,030SkN11 0,191 0,032 0,264 0,027SkN12 0,222 0,024 0,303 0,028SkN21 0,188 0,023 0,253 0,019SkN22 0,136 0,022 0,199 0,018SkN31 0,163 0,023 0,229 0,025SkN32 0,164 0,022 0,222 0,023Bi11 0,195 0,029 0,275 0,031Bi12 0,102 0,027 0,157 0,028Bi21 0,183 0,019 0,259 0,021Bi22 0,080 0,016 0,115 0,018Bi31 0,173 0,019 0,245 0,026Bi32 0,119 0,021 0,183 0,030Bi41 0,162 0,022 0,238 0,029Bi42 0,115 0,018 0,159 0,021

Insurance Data 0,237 0,022 0,358 0,030

Linear discriminant analysis (LDA)

Table 14: Classification results for the QDA.

Dataset Error-rate std. err. Brier-score std. err.NN11 0,249 0,020 0,330 0,019NN12 0,104 0,025 0,146 0,030NN21 0,247 0,018 0,329 0,018NN22 0,094 0,017 0,134 0,020NN31 0,300 0,028 0,394 0,026NN32 0,104 0,018 0,147 0,026SkN11 0,222 0,036 0,320 0,041SkN12 0,199 0,020 0,288 0,031SkN21 0,269 0,028 0,412 0,045SkN22 0,124 0,025 0,191 0,036SkN31 0,317 0,025 0,484 0,030SkN32 0,183 0,026 0,276 0,044Bi11 0,281 0,027 0,403 0,037Bi12 0,092 0,023 0,134 0,029Bi21 0,257 0,024 0,382 0,030Bi22 0,052 0,013 0,081 0,021Bi31 0,359 0,020 0,555 0,032Bi32 0,056 0,017 0,085 0,023Bi41 0,359 0,016 0,561 0,014Bi42 0,110 0,017 0,156 0,024

Insurance Data 0,313 0,064 0,496 0,099

Quadratic discriminant analysis (QDA)

APPENDIX 104

Table 15: Classification results for the estimator "Normal rule" in two and

three dimensions.

Dataset Error-rate Brier-score Error-rate Brier-scoreNN11 0,270 0,383 0,285 0,393NN12 0,175 0,258 0,125 0,212NN21 0,280 0,364 0,275 0,360NN22 0,200 0,250 0,145 0,231NN31 0,330 0,445 0,350 0,441NN32 0,190 0,293 0,185 0,300SkN11 0,190 0,275 0,180 0,267SkN12 0,215 0,303 0,200 0,270SkN21 0,245 0,346 0,230 0,322SkN22 0,170 0,227 0,115 0,168SkN31 0,255 0,367 0,210 0,285SkN32 0,145 0,218 0,160 0,198Bi11 0,380 0,477 0,310 0,395Bi12 0,130 0,171 0,120 0,148Bi21 0,385 0,462 0,365 0,453Bi22 0,125 0,151 0,215 0,257Bi31 0,220 0,322 0,240 0,347Bi32 0,140 0,212 0,160 0,204Bi41 0,360 0,466 0,420 0,457Bi42 0,165 0,260 0,135 0,192

Insurance Data 0,200 0,283 0,195 0,318

Normal rule (2) Normal rule (3)

APPENDIX 105

Table 16: Classification results for the estimator "Normal rule" in four and

five dimensions.


Insurance Data 0,200 0,313 0,195 0,303

Normal rule (4) Normal rule (5)

APPENDIX 106

Table 17: Classification results for the estimator "LSCV" in two and three

dimensions.


Insurance Data 0,185 0,285 0,195 0,319

LSCV (2) LSCV (3)

APPENDIX 107

Table 18: Classification results for the estimator "LSCV" in four and five

dimensions.


Insurance Data 0,200 0,277 0,205 0,296

LSCV (4) LSCV (5)

APPENDIX 108

Table 19: Estimator "Normal rule - normalized". The normalization is done

by the non-parametric kernel estimate.


Insurance Data 0,195 0,289 0,210 0,287

LDA normalized QDA normalized

APPENDIX 109

Table 20: Estimator "Sheather-Jones - normalized". The normalization is

done by the non-parametric kernel estimate.


Insurance Data 0,195 0,281 0,175 0,273

LDA normalized QDA normalized

Kernel Density Estimation -...

Documents

Transcript of Kernel Density Estimation -...