Kernel Density Estimation and Intrinsic Alignment - UCLA Vision Lab
Kernel Density Estimation -...
Transcript of Kernel Density Estimation -...
Kernel Density Estimation Theory, Aspects of Dimension and Application in Discriminant
Analysis
eingereicht von:
Thomas Ledl
DIPLOMARBEIT zur Erlangung des akademischen Grades
Magister rerum socialium oeconomicarumque (Mag. rer. soc. oec)
Magister der Sozial- und Wirtschaftswissenschaften
Fakultät für Wirtschaftswissenschaften und Informatik,
Universität Wien
Studienrichtung: Statistik
Begutachter:
Univ.-Prof. Dr. Wilfried Grossmann Wien, im März 2002
Ich versichere:
daß ich die Diplomarbeit selbständig verfasst, andere als die angegebenen Quellen
und Hilfsmittel nicht benutzt und mich auch sonst keiner unerlaubten Hilfe
bedient habe.
daß ich dieses Diplomarbeitsthema bisher weder im In- noch im Ausland (einer
Beurteilerin/ einem Beurteiler zur Begutachtung) in irgendeiner Form als
Prüfungsarbeit vorgelegt habe.
daß diese Arbeit mit der vom Begutachter beurteilten Arbeit übereinstimmt.
Thomas Ledl
Preface
The following diploma thesis is thought to be a diploma thesis in applied
statistics. I declare this in the first paragraph of my work, because you can treat
this subject either from a theoretic or an applied view, although the borders
between these two areas of statistics cannot be drawn exactly.
The reason why I got the idea to treat this subject, is that on the one hand density
estimation of a random variable is an elementary and important task in statistics,
which is treated already in the first weeks of a statistic study (actually in the basic
statistic lectures from all related studies as well) in its most basic form, the
histogram estimate. Using both density estimation on descriptive means (detecting
symmetry properties, number and location of modes, skewness, etc.) and applying
those estimates for inductive statistics (indirect in almost any context possible)
makes a good estimate a powerful mean. Choosing a non-parametric approach
was, because on the one hand nowadays we have a necessary computer power to
calculate e.g. kernel density estimates even in datasets with a great number of
observations and on the other hand a flexible setting, as this approach provides a
probably better fit to the underlying structure in the case of non-normal
distributed variables.
The second topic involved using non-paramteric density estimation for bayes rule
discriminant analysis, was motivated in one of my last lectures about discriminant
analysis where it was only mentioned, that for this case a non-parametric estimate
can be used as an alternative.
My interest of how good this can be performed for certain datasets is exactly the
main goal of this thesis. The estimation of the densities, the kernel discrimination
and maybe necessary reductions of dimensions should be seen in a context and
not separated. For application of the methods, with both self-constructed data and
a real-life dataset should be dealt with. In short words, the final result ought to be
3
PREFACE 4
a kind of “check list“ on how to perform best a certain discrimination task by
using bayes rule kernel density estimation.
I want to thank all people who were directly or indirectly involved, that I was able
to write this thesis as it looks like now. First of all I want to mention some people
of the Department of Statistics and Decision Support Systems at the University of
Vienna. I am especially grateful to Prof. Grossmann my mentor, who took care
that the work leads in the right direction and who provided literature and software
for me. Also I want to thank Prof. Pflug for providing further sources of literature
and Prof. Neuwirth who forced my interest in this topic in the first weeks of my
study through his excellent interactive presentation of the kernel density estimator
concept.
Finally it is a great pleasure for me to thank all members of my family, who gave
me financial and emotional support. Without their support this study would not
have been possible in that way, as well as my study colleagues who provided the
right social background during my study and to whom I had a really good
relationship.
Contents
Table of Diagrams ________________________________________________ 7
Table of Tables ___________________________________________________ 9
Chapter 1: Introduction __________________________________________ 10
Chapter 2: Kernel density estimation during the last 25 years ____________ 13
2.1 Introduction____________________________________________________ 13
2.2 The univariate case ______________________________________________ 14
2.2.1 From the histogram to the kernel density estimator ________________________14
2.2.2 The model of the kernel density estimator _______________________________16
2.2.3 Optimization criteria________________________________________________17
2.2.4 Calculations of the error criteria_______________________________________21
2.2.5 Optimality properties of the kernel ____________________________________26
2.2.6 Further developements of the model to improve the estimation ______________28
2.2.7 Bandwidth selection ________________________________________________33
2.3 The multivariate case ____________________________________________ 43
2.3.1 The model________________________________________________________44
2.3.2 Parametrizations ___________________________________________________45
2.3.3 Parameter selection_________________________________________________46
2.3.4 The curse of dimensionality __________________________________________48
2.4 The context to kernel discriminant analysis__________________________ 50
Chapter 3: Dimension Reduction and Marginal Transformations ________ 57
3.1 Introduction____________________________________________________ 57
3.2 Marginal transformations ________________________________________ 57
3.3 Dimension reduction_____________________________________________ 61
Chapter 4: Simulation Study and Real-life Application _________________ 65
4.1 Introduction____________________________________________________ 65
4.2 Preliminaries ___________________________________________________ 66
4.2.1 The data _________________________________________________________66
4.2.2 The construction of the estimators and the estimation procedure _____________72
5
CONTENTS 6
4.2.3 The performance measure ___________________________________________74
4.2.4 The software______________________________________________________76
4.3 Results ________________________________________________________ 76
4.3.1 LDA versus QDA__________________________________________________77
4.3.2 LDA and QDA versus the normalized datasets ___________________________79
4.3.3 The multivariate kernel density estimators – differences concerning dimensions _83
4.3.4 LDA and QDA versus the multivariate kernel density estimators _____________86
4.3.5 Results concerning the insurance data __________________________________87
4.4 Computational considerations _____________________________________ 89
4.5 Discussion _____________________________________________________ 90
Chapter 5: Summary and outlook __________________________________ 92
References _____________________________________________________ 95
Appendix_______________________________________________________ 98
About the used literature _____________________________________________ 98
Books __________________________________________________________________98
Papers __________________________________________________________________99
Notation and common abbreviations___________________________________ 101
Tables ____________________________________________________________ 102
Table of Diagrams
Figure 1: Non-parametric and parametric density estimates. _________________________13
Figure 2: The construction of the kernel density estimator. __________________________17
Figure 3: The original density and two expectations of kernel esimates for different values of
h.______________________________________________________________________22
Figure 4: Plots of MISE, AMISE (bowl-shaped curves) and their additive components. ___24
Figure 5: MISE, IV, ISB, AMISE, AIV and AISB for the underlying density. ___________25
Figure 6: Kernel estimate using different bandwidths. The underlying density (solid) and the
concerning kernel density estimate (dotted). __________________________________29
Figure 7: The corresponding density types for Table 1.______________________________32
Figure 8: Ratio of the AMISE-optimal bandwidth to window widths chosen by different
reference distributions. ___________________________________________________36
Figure 9: Contour-plot of the maximum posterior probabilities at each point (LDA)._____51
Figure 10: Contour-plot of the maximum posterior probabilities at each point (kernel
estimate). _______________________________________________________________52
Figure 11: Difference between the logarithm of a standard normal density and the logarithm
of two kernel estimates (Sheather-Jones plug-in and the Rule of Thumb). _________54
Figure 12: Normalization of two bimodal class densities. Density 1 (solid), density 2 (dashed)
and their respective normalizations._________________________________________58
Figure 13: Univariate normalizations. ____________________________________________59
Figure 14: Non-parametric normalization of one variable of the insurance data._________61
Figure 15: The corresponding scree-plot. _________________________________________63
Figure 16: Prototype distributions of the synthetic datasets.__________________________68
Figure 17: Univariate histograms for the insurance-dataset. _________________________71
Figure 18: Brier-scores for the NN-distributions. Comparison between LDA and QDA. __77
Figure 19: Brier-score for the SkN-distributions. Comparison between LDA and QDA. __77
Figure 20: Brier-score for the Bi-distributions. Comparison between LDA and QDA.____78
Figure 21: Error-rate for the NN-distributions. Comparison within the LDA-method by
using normalizations. _____________________________________________________79
Figure 22: Error-rate for the SkN-distributions. Comparison within the LDA-method by
using normalizations. _____________________________________________________79
Figure 23: Error-rate for the Bi-distributions. Comparison within the LDA-method by using
normalizations. __________________________________________________________80
Figure 24: Error-rate for the NN-distributions. Comparison within the QDA-method by
using normalizations. _____________________________________________________81
7
TABLE OF DIAGRAMS 8
Figure 25: Error-rate for the SkN-distributions. Comparison within the QDA-method by
using normalizations. _____________________________________________________81
Figure 26: Error-rate for the Bi-distributions. Comparison within the QDA-method by using
normalizations. __________________________________________________________82
Figure 27: Dependency of the performance of the multivariate kernel density estimator for
two datasets. ____________________________________________________________85
Figure 28: Brier-score of the datasets having equal correlation matrices. Comparison
between the LDA and the bayes-rule-kernel-methods constructed by the LSCV-
selector. ________________________________________________________________86
Figure 29: Brier-score of the datasets having unequal correlation matrices. Comparison
between the QDA and the bayes-rule-kernel-methods constructed by the LSCV-
selector. ________________________________________________________________87
Figure 30: The Error-rates for the insurance data. _________________________________88
Figure 31: The problem of classification concerning non-equal class observation numbers.
_______________________________________________________________________88
Table of Tables
Table 1: Efficiency on how to estimate different densities. ___________________________31
Table 2: Efficiencies of product kernels relative to radially symmetric kernels when using
the multivariate generalization of a beta-kernel._______________________________44
Table 3: Probability of data in regions which have density values higher than one hundredth
the value at the mode of a multivariate normal distribution._____________________49
Table 4: Principal component analysis with one of my synthetic datasets. ______________63
Table 5: Smallest sample size for each dimension, which satisfies (3.2)._________________64
Table 6: Prototype distributions of the synthetic datasets. ___________________________67
Table 7: Description of the used datasets. _________________________________________69
Table 8: Average rank of the kernel estimators in dependency on the dimension of the
subspace concerning the Error-rates. "1" is the best.___________________________83
Table 9: Average place of the kernel estimators in dependency on the dimension of the
subspace concerning the Brier-score. "1" is the best. ___________________________84
Table 10: Principal component analysis for "SkN11"._______________________________84
Table 11: Principal component analysis for "SkN32"._______________________________84
Table 12: Results of the principal component analysis. The percentage of the explained
variance is shown for all datasets.__________________________________________102
Table 13: Classification results for the LDA. _____________________________________103
Table 14: Classification results for the QDA. _____________________________________103
Table 15: Classification results for the estimator "Normal rule" in two and three
dimensions. ____________________________________________________________104
Table 16: Classification results for the estimator "Normal rule" in four and five dimensions.
______________________________________________________________________105
Table 17: Classification results for the estimator "LSCV" in two and three dimensions. _106
Table 18: Classification results for the estimator "LSCV" in four and five dimensions. __107
Table 19: Estimator "Normal rule - normalized". The normalization is done by the non-
parametric kernel estimate. _______________________________________________108
Table 20: Estimator "Sheather-Jones - normalized". The normalization is done by the non-
parametric kernel estimate. _______________________________________________109
9
Chapter 1: Introduction
What is actually the most important thing you learn in a statistic study? During
your study you start with exploratory methods, getting some knowledge about
your dataset by producing basic frequency plots or calculating some useful
coefficients (descriptive statistics). You learn about several distributions of
random elements, about the fact that your dataset has only a finite number of
observations (the things you watch are only realisations of the underlying origin)
and you always try to fit proper models to your data. Finally you end up with
methods for more variables and theoretical considerations about the goodness of
certain statistical tests or estimators. The common element of all these contents is
the uncertainty of your observations, which is modeled by the most elementary
object in statistics, a random variable. So the most interesting question is actually
if one has a set of realisations x xn1 ,... of a random variable X , how to discover
the structure of X appropriately? If you know this structure, you know with
which probability a certain event occurs, and thus you should also be able to give
judgement about the assessment of a certain observation within a class of distinct
populations, the classification task. These two tasks are fundamental in statistics
and therefore they will have my undivided attention during this thesis.
In the classical statistic theory you learn that the normal distribution is a very
commonly used model for such a set of realisations. On the one hand it is based
on the central limit theorem, which proves that the sum of identically independent
variables converges to a normal distribution and many things in our life could be
treated like that approximately. On the other hand it has a compact formula with
additional nice mathematical properties which makes it useful as a model for
unimodal symmetric distributions. Last but not least, multivariate generalizations
can be done straightforward.
However, the greatest drawback of this setting is the small flexibility according to
the number of parameters. Only having a parameter for the location and one for
the scale makes it a poor approximation for data, far away from this normality
assumption. Regarding the fact that all events which occur in real life are
10
INTRODUCTION 11
dependent of each other (although maybe only in a complex way), a model of
identically independent variables will actually never be “the right one“ and is
therefore only approximately valid. In addition, there are of course many cases
where even an approximation of the idealized view (sum of identically distibuted
independent observations) is not justified. That is the case in life-duration data,
income data (both in general highly skewed) or multimodal distributions.
Nevertheless, the model of the normal distribution and its consequences are used
until today, for example in standard regression theory and ANOVA, as well as in
discriminant analysis (linear and quadratic) as first choice in many software
packages.
Since the computer power has increased considerably in the last two decades,
there was enough power to fit more flexible models, and researches gained
another view for this problem by using non-parametric density estimates. Besides,
we have today huge data sources available to figure out significant deviations of
the normality assumption even when they are small.
As a result of the new opportunity for density estimation, there was an enormeous
increase of literature which has been written about this topic. Both discrete and
continious random variables were considered. Also, the multivariate
generalizations for random vectors (either only continious ones or only discrete
ones or mixtures of both types) were taken into account. Theoretical properties of
the estimators have been derived, as well as simulation studies of several
estimators have been produced.
To restrict the various number of methods for non-parametric density estimation, I
only concentrated in this work on the kernel density estimator, which is probably
the most discussed concept in the existing literature because of its simplicity and
clearness. Also, the multivariate generalization can be done straightforward.
Another restriction is that I only treat continious random variables and pure
continious random vectors, because otherwise the estimated function is not
smooth and you cannot speak about a density function.
The papers and books written about these topics treat mostly the discussion of
selecting a good smoothing parameter for the estimated function in the univariate
case, which is not a satisfactory answer for the discrimination problem where
INTRODUCTION 12
more than one variable is used in general. There are some statements in certain
books underlining the difficulties that occur by estimating in high dimensions
(> ), also called the curse of dimensionality as well as giving advice how a set of
more variables can be suitably transformed to a lower dimensional space.
However, by reading this you gain no idea how different methods perform at a
certain discrimination task. Actually, there is no connection between the problem
of estimating the density in a proper way, using a multivariate density estimate for
bayes-rule kernel disriminant analysis and reducing high dimensions to make a
reasonable solution just possible. Only one study from the early 1980s (Remme
et. al., 1980) analysed the application of density estimation and watched error
rates in classification compared with the performance of Linear Discriminant
Analysis (LDA) and Quadratic Discriminant Analysis (QDA), that connected at
least two of the subtasks mentioned above. But the smoothing parameter was
selected by a method which is no longer qualified today. Nevertheless, this study
should give the impetus and an idea how to proceed in this work. This leads to the
research assignment of the present work.
5
What should be comprised in my thesis:
1. A discussion of the main ideas in kernel density estimation in
the last years with emphasis on the variety in methods,
parametrisations, optimizing criteria, etc. (chapter 2).
2. A discussion about how the number of variables can be
reduced or the data can be transformed to still get reasonable
results (chapter 3).
3. A simulation study and a real-life-application where the main
results of 1. and 2. should be applied in discriminant analysis
in a proper way to make it maybe dominating over the LDA
and QDA (chapter 4).
Chapter 2: Kernel density estimation during the last 25
years
2.1 Introduction
30 40 50 60 70
0.0
0.01
0.02
0.03
0.04
0.05
a)
x
b)
x
30 40 50 60 70
0.0
0.01
0.02
0.03
0.04
0.05
0.06
Figure 1: Non-parametric and parametric density estimates. A histogram is
shown in a) and in b) a kernel density estimate (dashed) is produced. In both
graphs the normal curve is plotted, too (solid).
The aim of chapter 2 is to give some insight how the kernel density estimator is
performed, and the reason that makes this subject such a huge research area.
Regarding the univariate case treated in section 2.2, Figure 1 should be watched
for a first impression. On the left side is the output you can produce with almost
every statistical program-package which is the well-known histogram, the first
and probably easiest realization of a non-paramteric estimate. In the same graphic
drawn as a line, is its parametric counterpart, the normal-distribution curve, where
the parameters have been estimated by the classical unbiased estimators. Since it
is obvious that neither the histogram as a very crude estimator, nor the normal-
approximation seem to be a proper method in this case, a kernel density estimator
13
KERNEL DENSITY ESTIMATION DURING THE LAST 25 YEARS 14
has been performed on the right side, which allows the estimated function to be
smooth as well as to figure out more detailed structure, the bimodality in this case.
As this concept is more flexible, it leads also to many new questions and
difficulties, which ought to be pointed out in section 2.2. In particular, the
following questions should be discussed.
What is a good choice for the kernel shape (section 2.2.5)?
What is a good choice for the bandwidth (section 2.2.7)?
In which sence can a certain bandwidth-choice be optimal (2.2.3)?
How can the basic concept be improved (section 2.2.6)?
Concerning the multivariate case there are additional difficulties, since there
occurs not only a vector of different bandwidths for every dimension, but also
different orientations of the multivariate kernels, which results in a whole
bandwidth-matrix. Besides, there are different methods to produce kernels in high
dimensions. Questions like that are going to be discussed in section 2.3.
Section 2.4 keeps an eye on the fact that optimality in density estimation is maybe
not the same as optimality in discrimination. Connections between the bayes rule
kernel setting and other common discrimination techniques are going to be
pointed out as well. Additionally, it provides a collection of performance results
of kernel discriminant analysis in the past.
2.2 The univariate case
2.2.1 From the histogram to the kernel density estimator
The most basic way to approach the problem is like it is done in almost every
basic lecture in statistics. The classical formula of the histogram estimate is:
$ ( ) ( ) ( )f xnh
I x B I x Bh i jji
n
= ∈∑∑=
11
j∈
where
[ )B x j h x jhj = + − +0 01( ) , j Z∈ .
KERNEL DENSITY ESTIMATION DURING THE LAST 25 YEARS 15
This leads to the two parmeters x0 (origin) and h (binwidth), which determine the
shape of the estimate totally. As it is known from basic statistic courses, different
choices of x0 can lead to quite differing impressions of the underlying distribution.
This results either in a different number of estimated modes, or in different
impressions about skewness and kurtosis. Besides, the binwidth parameter h plays
an important role which is almost the same as in the kernel density estimators
setting. It produces either very noisy or smooth estimates, where the structure
becomes not apparent.
A possibility to make at least one of the problems disappear is making the
estimate almost independent from the origin x0 . This can be reached by
performing an average shifted histogram (ASH; Scott, 1992). The idea behind this
is that M − 1 within the interval [ )x x h0 0, + equidistant placed points, are used to
calculate sub-histograms. The final estimate is then given by averaging over all
M sub-histograms, which amounts in the formula
$ ( ) ( ) ( ),, ,f xM nh
I x B I x Bh jji
n
l
M
= ∈∑∑∑==
l i j l∈−1 1
10
1
where B jlM
h jlM
hj l, ( ) , (= − + + )⎡⎣⎢
⎞⎠⎟1 l M∈ −{ , ,..., }0 1 1
refers to the j -th Interval in the original definition of the histogram, which is
“shifted” l times.
In this setting it is clear that the position of xi within the interval is relevant for
the estimate (as long as different xi , which lie in the same origin bin Bj have
different bins B j l, ). While it is evident that this estimate produces a “smoother”
estimate than the histogram, it is nevertheless a step-function which does not give
the impression to be the underlying density, but only a better way to perform the
basic concept. Even though this estimate is quite simple, and one can obtain good
results almost independent from the exact choice of M, it is worthwile to study the
case as M → ∞ . This seemingly more detailed view of the concept was first
considered in the 1950s and is the basic form of the kernel density estimator.
KERNEL DENSITY ESTIMATION DURING THE LAST 25 YEARS 16
2.2.2 The model of the kernel density estimator
The straightforward enlargement is now to set such a bin, or more generally a
special kernel on every point of the axis, and counting (weighting) the datapoints
in the neighbourhood of this point. The first author who considered this model
was Rosenblatt (1956). The estimate is given by
$ ( ) ,f xnh
Kx x
hhi
i
n
=−⎛
⎝⎜
⎞⎠⎟
=∑1
1
where K denotes the kernel function, and h is the bandwidth parameter which is
the analogue to the binwidth in the histogram, and therefore the same letter is
usually used.
To make sure that this expression is a density function, the kernel function has to
satisfy K u du( ) =∫ 1.
Other useful and common assumptions for the kernel are
K u K u( ) ( )= −
K u( ) has its maximum at u = 0
K u( ) ≥ 0 for all u (which is however not satisfied by higher order kernels)
In many cases a density N ( , )0 2σ is used as kernel function, but also several other
kernels satisfying the assumptions above with bounded support are used (see
section 2.2.5).
The context between the historam and the kernel concept arises from the fact that
the density estimator using a triangular kernel (K x x I x( ) ( ) ( )= − <1 1 ) is the
limit of the average shifted histogram for M → ∞ , where the binwidth h of the
histogram refers to the bandwidth h of the kernel setting.
It should also be emphasized, that all mathematical properties of the kernel like
the differentiabilty and smoothness are inherited by the corresponding kernel
density estimate, because the estimate represents actually, a convolution of the
kernel with the data. The construction of the kernel density estimator using a
normal density as kernel (a so-called gaussian kernel) and different bandwidths, is
shown in Figure 2.
KERNEL DENSITY ESTIMATION DURING THE LAST 25 YEARS 17
a)
0 2 4 6
0.0
0.05
0.10
0.15
0.20
0.25
xxx x xx
b)
0 2 4 6
0.0
0.1
0.2
0.3
xxx x xx
Figure 2: The construction of the kernel density estimator. Same kernels and
different bandwidths are used in a) and b). The underlying density (dotted),
the estimated density (solid) and its contributions (dashed).
2.2.3 Optimization criteria
As you watch the picture above, maybe the first question you can ask yourself is
whether the right, or the left estimator is more appropriate. This leads to the
question, which kernel and which bandwidth is the best to select, and in particular
the second question leads to a very controversial discussion.
As already mentioned in the introduction to this chapter, there is no general way
to estimate a density “optimal”. Every optimality is always with respect to a
certain optimization criterion. This section should give the definitions of different
optimization criteria, and should also point out their properties.
KERNEL DENSITY ESTIMATION DURING THE LAST 25 YEARS 18
2.2.3.1 Criteria based on the -Distance L1
A very common idea to measure the difference between two functions f and g is to
consider the Lp -Distance, defined by
( )L f gpp p
= −∫1/
,
where is a parameter to choose. By the means of simplicity one would probably
choose
p
p = 1, the L1-distance. Though the formula looks quite easy for p = 1,
some mathematic problems are included refering to the absolute error function,
which is not that easy to treat.
Nevertheless, this criterion has one great advantage, namely it is invariant to any
bijective transformations of both of the densities. That means, if f is the density
of a random variable X , and g is the density according to Y , then
L f g f g1 = − = −∫∫ * * ,
where f * is the density of T X( ) , g * is the density of T Y( ) and T is a bijective
function (Devroye and Györfi, 1985). The value above is also called integrated
absolute error (IAE).
Another property of this criterion can be associated with discriminant analysis.
Suppose f and g are the densities of two populations, and a new
datapoint is classified in the following way: Assignment to
( $ ( ))g f xh≡
f if x A∈ and to g
otherwise. Then the minimization of the -distance is exactly the same as
maximizing the classical Error-rate in discriminant analysis, that is equal to
maximizing the confusion between
L1
f and g (Scott, 1992).
The book of Devroye and Györfi (1985) gives a detailed discussion about the L -
criterion. Because of inconvinience comparing different density estimates by the
random variable IAE, the expectation of IAE (the expected integrated absolute
error
1
E ( )IAE EIAE MIAE= = ) is regarded. The proportion of these two values
is relatively stable. Roughly speaking, the statetment
IAEEIAE
∈⎡⎣⎢
⎤⎦⎥
14
3,
KERNEL DENSITY ESTIMATION DURING THE LAST 25 YEARS 19
is valid in any case (see Devroye and Györfi, 1985, for more exact mathematical
expressions).
2.2.3.2 Criteria based on the -Distance L2
The expression for the Lp -distance in the case p = 2 contains a root, and therefore
the squared criterion
{ }ISE( $ ) $ ( ) ( )f f x f xh h= − dx∫ ,
which still keeps the right order for comparing two values of different estimates to
each other, is basically considered. ISE stands for integrated squared error. For
the same reason as above, the expectation
{ }MISE( $ ) $ ( ) ( )f E f x f xh h= − dx∫ 2
is used. MISE stands for mean integrated squared error. A comparison between
the use of ISE or MISE as error criterion gives Jones (1991). The main statement
is, that deriving the optimal bandwidth according to MISE is better than
optimizing ISE. The author says that ”this is based on the argument that
estimating f well from X [the sample, author] alone is often an unrealistic
ambition”, which sounds a little bit confusing for the less experienced reader.
Like the ordinary mean squared error
{ }MSE( $ ( )) $ ( ) ( )f x E f x f xh h= −2,
which is a measure of the error with respect to a certain point x , a decomposition
into a bias- and a variance-term, can also be achieved relatively easy in the MISE-
expression.
Most of the literature in kernel density estimation is written about minimizing the
MISE-criterion and the AMISE-criterion (see section 2.2.7) respectively.
Nevertheless, there are other criteria which lead to denstiy estimates, which seems
more appropriate to a “human feeling” of what the real density looks like. Some
concepts besides the limit of the Lp -distance are briefly discussed in the following
section.
KERNEL DENSITY ESTIMATION DURING THE LAST 25 YEARS 20
2.2.3.3 and alternative criteria L∞
Of course, any p can be chosen in the Lp -distance, but for p’s other than one or
two there are no longer useful properties, as discussed in the last two sections.
The only interesting case is when p →∞. This is equal to minimizing the
maximal absolute error between f and g , or again its expectation. The reader has
now an idea how the variation of changes the aims of the estimate. With p p = 1,
the goal is to minimize the area between the curves, regardless of the fact that the
same area can be caused in several different ways, whereas with p →∞ an
overall good performance is the only goal, and one does not matter if the
difference between the curve appears only at a certain point or everywhere.
All the measures described so far are strict mathematical criteria, measuring in
different ways the similarity of two functions, but this is maybe not neccessarily
the kind of optimization for pointing out the structure of the underlying density in
a proper way. For example the MISE-criterion, and even more the L -criterion
ignore the fit of the tails of the distribution, which is especially important in high
dimensions (see
∞
Figure 11 and section 2.3.4).
In particular, for an exploratory view, one could be interested in the location and
the number of modes in the distribution. Park and Turlach (1992) calculated the
average distance of absolute deviations from the estimated modes to the true
modes of the distribution, as well as the number of modes in the estimated density
compared with the true one.
The problem of such measures is that you can only perform them in a simulation
study, because of your knowledge about the exact positions and the number of the
modes. Unfortunately, an optimization with respect to that kind of error-criteria
grows more inaccurate as the number of dimensions increases in a multivariate
setting.
It is apparent that everyone can create such measures and as long as it is not
defined when a density is estimated well, several criteria should be considered.
Marron and Tsybakov (1995) wrote an interesting paper, where they underline the
importance of measuring a kind of horizontal distance between two densities as
well. Since the common used criteria take only vertical distances into
KERNEL DENSITY ESTIMATION DURING THE LAST 25 YEARS 21
consideration, the authors emphasize that “the eye uses both horizontal and
vertical information”. In any case, much research is needed to give better answers
than those.
2.2.4 Calculations of the error criteria
This section treats only the calculation of error criteria based on the L - and the
-distance respectively. There exist almost no parameter-selection-rules based
on criteria different from the
1
L2
Lp -distances, as the reader might have assumed
when reading the last section. Here, other techniques probably have to be used,
and such are either subject to future research or they would be beyond the scope
of this work.
2.2.4.1 MISE- and AMISE- calculations
Since a problem of the MISE expression is that it depends on the bandwidth h in
a complicated way, there is a way of overcoming this problem by using large
sample approximations of the bias and variance terms, which occur in the
decomposition suggested above. Thus, some assumptions have to be made, which
are discussed in Wand and Jones (1995). The most important ones are
limn h→∞ = 0 and limn nh→∞ = ∞ (2.1)
which means, that h approaches zero, but at a rate slower than n . For the bias-
calculation on a certain point
−1
x , a change of variables and a Taylor expansion
leads to the asymptotical unbiasedness (h ) of → 0
E f x f x h K f x o hh( $ ( )) ( ) ( ) ' ' ( ) ( ).− = +12
22
2μ (2.2)
where μ22( ) ( )K z K z= dz∫ denotes a functional (the variance) of the kernel
(Wand and Jones, 1995). This expression makes clear that we can reduce bias by
reducing the bandwidth h, and that for a fixed h at a certain point x the bias is
proportional to the second derivative f x' ' ( ) of the unknown density f , as it is
shown in Figure 3, motivated by Härdle (1991, p. 57).
KERNEL DENSITY ESTIMATION DURING THE LAST 25 YEARS 22
Since typical values for h estimated by different estimators are lying in the
interval [0.5,1] for n about 100 to 1000, the graph gives insight about how the
bias changes over the range of X .
For calculating the variance of the estimation, the following expression can be
derived.
Var f xnh
R K f x onhh( $ ( )) ( ) ( ) ,= +⎛⎝⎜
⎞⎠⎟
1 1 (2.3)
where R K K x dx( ) ( )= .∫ 2
x
-4 -2 0 2 4
0.05
0.10
0.15
0.20
0.25
Density f(x)Expectation for h=0.5Expectation for h=1
Figure 3: The original density and two expectations of kernel esimates for
different values of h.
A quick look on (2.3), makes the following evident.
Because of the assumptions, the variance tends to zero as n increases.
One can achieve small variances with large values of n (reasonable) and large
values of h .
Summing up (2.3), and the square of (2.2), and integrating leads to
KERNEL DENSITY ESTIMATION DURING THE LAST 25 YEARS 23
MISE AMISE( $ ) ( $ ) ,f f onh
hh h= + +⎛⎝⎜
⎞⎠⎟
1 4
AMISE( $ ) ( ) ( ) ( ' ' )fnh
R K h K R fh = +1 1
44
22μ (2.4)
and after setting the first derivative in (2.4) equal to zero, the optimal bandwidth
hR K
K R f nAMISE =⎡
⎣⎢⎤
⎦⎥( )
( ) ( ' ' ).
/
μ22
1 5
(2.5)
This formula for the AMISE shows in a very clear way the conflict between
reducing the variance and the bias simultaneously, since the first term represents
the integrated variance and the second one the integrated squared bias. On the one
hand h should be chosen small in order to achieve a small bias. In this case, the
variance is big. The density estimate fluctuates heavily depending on the exact
positions of the datapoints. The arithmetic mean of these estimations however (if
the experiment is reapplied several times), is near f x( ) , but this is uninteresting
in the special case.
On the other hand, a large choice of h is necessary to hold the variance at a lower
level. The resulting estimator is a function, which is close to the kernel itself,
certainly including a huge bias in general. A solution to this problem has to
respect both views of the problem.
The problem of the ability of calculating an AMISE-optimal bandwidth only if the
underlying density is known, is circular and therefore paradoxical. Results to
escape this infinite loop in order to obtain concrete values for h are given in
section 2.2.7.
In some cases it is even possible to give an exact expression of the MISE. That
occurs if the underlying density f is normal, or at least a mixture of several
normal densities and the gaussian kernel is used (Wand and Jones, 1995).
According to the fact that the class of normal mixture distributions allows a big
variety of possible densities, this is a rather useful result. If such a restriction for
the kernel is crucial or not will be discussed in section 2.2.5. Some really
interesting results about the approximation of AMISE to MISE have been studied
by Marron and Wand (1992).
KERNEL DENSITY ESTIMATION DURING THE LAST 25 YEARS 24
The authors considered in several uni- and multimodal normal-mixture densities
the behaviour of MISE, AMISE and their bias- and variance-components, as well
as the resulting optimal bandwiths h MISE and h AMISE, respectively.
Figure 4 and Figure 5 shall give an impression of what can happen in case of a
MISE-approximation.
log10(h)
-1.2 -1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2
0.0
0.01
0.02
0.03
0.04
0.05
MISE,IV,ISBAMISE,AIV,AISB
x
Den
sity
-3 -1 1 2 3
0.0
0.2
0.4
Figure 4: Plots of MISE, AMISE (bowl-shaped curves) and their additive
components, the integrated variance (IV), the integrated squared bias (ISB)
and their asymptotic counterparts for n=100 realisations of the underlying
density (small picture).
In Figure 4 f was chosen N ( , )0 1 , and a gaussian kernel was used. Note that the
variance approximation is quite good and uniform, whereas the bias
approximation is only good for small h and poor otherwise. Even though the bias
approximation is bad, the difference in the resulting bandwidths is negligible.
An even worse approximation is achieved in Figure 5, where the underlying
density f is bimodal with several spikes in it, a so-called claw density. This is
caused by a very large value of ( ' ' )f 2∫ , to which the asymptotic integrated
squared bias is proportional. In both graphs n = 100 was chosen. Marron and
KERNEL DENSITY ESTIMATION DURING THE LAST 25 YEARS 25
Wand (1992) also discovered examples, where h MISE is not uniquely defined
because of at least two local minima of the MISE.
log10(h)
-2 -1 0 1
0.0
0.05
0.10
0.15
0.20
0.25
MISE,IV,ISB
Den
sity
0.0
0.2
0.4
x-3 -1 1 2 3
AMISE,AIV,AISB
Figure 5: MISE, IV, ISB, AMISE, AIV and AISB for the underlying density
(small picture, n=100).
Although those examples are probably only of academic nature, they give insight,
how poor a bandwidth chosen by minimizing the AMISE, can perform in
application.
2.2.4.2 MIAE- calculations
While it was stressed in the last section that the calculation of MISE has to
happen numerically and no explicit dependency of h can be given, the behaviour
of calculations concerning the L1-criterion is even worse. A decomposition
EIAE = + + −J n h o h nh( , ) ( ( ) ),/2 1 2
J n hf
nhnh
ff
( , )' '
=⎛
⎝⎜
⎞
⎠⎟∫α ψ
βα
5
2, (2.6)
where the values α and β are kernel functionals and
KERNEL DENSITY ESTIMATION DURING THE LAST 25 YEARS 26
ψπ
( )u u e dx ex uu
= +− −∫2
2 2
2 2
0
,
can be derived as well (Devroye and Györfi, 1985), but it is difficult to see how
the kernel functionals and the bandwidth can be optimized separately.
Nevertheless, the upper bound
J n hf
nhh f( , ) ' '≤ +
∫ ∫22
2
πα β
allows again a separation into a bias-(first) and a variance term (second). If one
optimizes this expression with respect to h and chooses for example one popular
kernel, the Epanechnikow kernel (K x( ) x I x( ) ( )= − <34
1 12 ), the optimal
bandwidth can be calculated by
hf
fnopt =
⎡
⎣⎢⎢
⎤
⎦⎥⎥
∫∫
−152
2 5
1 5
π ' '.
/
/ (2.7)
Like in the formula for h AMISE, there is again a dependency h n and the order
terms of the EIAE are the square roots of the order terms of the MISE. That is
natural, because the MISE criterion is squared. Again the unknown density is
needed to calculate h
~ /−1 5
opt.
2.2.5 Optimality properties of the kernel
After defining the optimization criteria and getting a first impression for the
bandwidth choice, it makes sence to take a closer look at the importance of the
kernel choice.
There are many kernels which maintain the restrictions from section 2.2.2. One
can find the formulas and graphs of the most common used kernels, e.g. in Härdle
(1991). As pointed out in the calculation of the AMISE-formula, a functional of
the kernel is included in both of the additive terms and is coupled with the
bandwidth h . Nevertheless, there is a possibility to separate them from each other
by rescaling to achieve so-called canonical kernels, which have the property that
given a certain bandwidth, they lead to roughly the same amount of smoothing.
Transforming the functional R K( ) in (2.4) to achieve kernel functionals which are
KERNEL DENSITY ESTIMATION DURING THE LAST 25 YEARS 27
independent of h makes a comparison between different kernels and differences
in the efficiency of using data transparent. The kernel which minimizes this
criterion, is the Epanechnikow kernel (Silverman, 1986), whose definition is
already given above. However, the efficiency to achieve a certain accuracy (in
terms of the number of observations) does not decline heavily for most of the
other common kernels, and it makes therefore almost no difference choosing
different kernels. Even for the triangular and the uniform kernel, there are only
less than 10% additional data points neccesary to get the same AMISE. The
bandwidth transformations to achieve the same accuracy with different kernels are
listed in Härdle (1991, p.76).
The kernel functionals α and β in section 2.2.4.2 are constructed like their
counterparts in the AMISE, only their involvement in the formula is more
complicated. Therefore, one can assume that the Epanechnikow kernel is the
optimal kernel in L -Theory as well. However, I found no proof so far. Unlike the
discussion about optimal bandwidths, can this result be applied without any
restriction (given one accepts minimizing the asymptotic versions of the error
criteria and treats such a solution as the best one to figure out the structure of the
underlying density).
1
Until now only non-negative kernels, satisfying the assumptions
zK z dz( ) =∫ 0 and μ22 0= ≠∫ z K z dz( )
have been taken into account. At a closer view of (2.4), it is evident that it is
possible to reduce the bias by defining kernels having μ2 0= and μ3 0≠ . If this
technique (setting μi = 0, for i = 3 4 5, , ,...) is used again and again, even higher
terms of the Taylor approximations (2.2) can be eliminated. Such a kernel,
satisfying the assumptions
jK x K x dx( ) ( )= =≠
⎧
⎨⎪
⎩⎪∫
100
for j
j pj p
== −
=
01 2 3 1, , ,..., μ j
is called a kernel of order p. Another effect of higher order kernels ( p > 2) is the
improvement of the convergence rate for the minimal AMISE from n to
. This means that you have only to wait sufficiently long (in terms of
−4 5/
n p p− +2 2 1/( )
KERNEL DENSITY ESTIMATION DURING THE LAST 25 YEARS 28
increasing n ) to get a smaller value for the minimized AMISE, than with a kernel
of lower order. However, in application one is interested in the question how large
is needed, and what happens before the asymptotics take effect. Marron and
Wand (1992) also found even worse asymptotic bias approximations than in
n
Figure 5, what makes their faster convergence to 0 senceless.
The problem of higher order kernels ( ) concerning plausibility, is that these
restrictions are only possible for not completely non-negative kernels. As we
know that all kernel properties are inherited by the resulting density estimate, the
estimate has therefore generally no longer an interpretation as density function.
Generally speaking, this concept seems to be only a theoretical construct, which
improves the estimate by only a small amount, but pays the price of a loss of
interpretability as well as the difficulty of mathematical tractability. For those
reasons, I am not going to use it in my application.
p > 2
2.2.6 Further developements of the model to improve the estimation
So far, at least the problem of a proper kernel choice for the basic kernel density
estimator seems to be solved somewhat satisfactorily. But there can be easiliy
thought of improvements, regarding the fact that only one smoothing parameter is
used in all regions of the distribution.
2.2.6.1 Variable bandwidths
One might expect that (for example in case of skewed distributions) different
bandwidths within one estimation lead to more flexible approximations, as is
shown in Figure 6, where this problem becomes fairly transparent. Neither the
right nor the left picture seems to fit the curve suitably, because one has to make a
decision whether the bump or the tail should be estimated properly. The
bandwidth has to be adapted in different areas of the curve. To get an idea, if the
density in a certain range is high or low, one could take the distance from xi to the
k -th nearest neighbour.
This is done in the definition of the variable kernel density estimator (Breiman et.
al., 1977)
KERNEL DENSITY ESTIMATION DURING THE LAST 25 YEARS 29
$ ( ) ,, ,
f xn hd
Kx xhdh
j k
i
j ki
n
=−⎛
⎝⎜
⎞
⎠⎟
=∑1 1
1 (2.8)
where d j k, denotes exactly this distance. This seems to be a quite good concept
and e.g. the simulation study of Remme et. al. (1980) concerning kernel
discriminant analysis show the great dominance of the variable kernel density
estimator over the standard model, for example in skewed distributions. The
authors gave ranges for the choice of the new parameter k , but they also pointed
out that the choice of k is not that important.
-4 -2 0 2
0.0
0.2
0.4
0.6
-3 -2 -1 0 1 2 3
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Figure 6: Kernel estimate using different bandwidths. The underlying
density (solid) and the concerning kernel density estimate (dotted).
For example, k should be chosen smaller in skewed distributions than in
symmetric ones. Nevertheless, no fully automatic procedure exists and one can
never be sure to have the best or at least a good estimate (see Breiman et. al,
1977).
If we take again a look at (2.7), it can be seen that the smoothing factor h should
be chosen large if 1 / f and f ' ' are small. This means that in areas of high
curvature of f (large f ' ') or smaller values of f , smaller bandwidths are required
(Devroye and Györfi, 1985). The authors argued that d j k, does not take these facts
into consideration, which makes this setting asymptotic suboptimal. Even worse,
if the curvature is fixed, higher values of f require higher bandwidhts, and not
KERNEL DENSITY ESTIMATION DURING THE LAST 25 YEARS 30
lower ones as was suggested above (Devroye and Györfi, 1985). However,
measures of functionals (like f∫ ) say actually nothing about the local
behaviour of f . Anyway, those results seem rather controversial.
This result also makes one interesting fact apparent. While the optimizations of h
with respect to the - and the L -distance have many common properties and
lead to similar estimates, there is a striking difference, because a functional of the
unknown density
L1 2
f appears in the numerator of the AMIAE-optimal bandwidth
(2.7), which does not exist in the AMISE-optimal bandwidth-formula (2.5).
Maybe this should as well be taken into account when talking about optimization.
Another approach allowing non-equal bandwidths, occurs by adapting the
bandwidth h by defining a function h x( ) in order to vary the bandwidth and the
resulting kernel on every point on the whole axis. The formula for the so-called
local kernel density estimator is the same as above, only substituting hd j k, by
h x( ) . The additional parameter required is the funtion h itself. Therefore, a
pilot estimation has to be built up. A method is again, to watch the number of
nearest neighbours, in this case neighbours of
(.)
x , which results again in
h x f x( ) ~ / ( )1 but this has also some unsatisafactory properties (Wand and Jones,
1995). Finally it should not be forgotten to mention that the resulting local kernel
density estimate is in general no longer a density since it does not integrate to one.
2.2.6.2 Transformed kernel density estimator
Concerning again Figure 6 in the last section, the problem of an unsatisfactory fit
can also be solved otherwise. If one wants to hold on to the basic model with one
bandwidth, there is at least one additional technique.
The poor behaviour of several modern bandwidth selectors for this type of
distributions is shown in example 6 of Sheather (1992, p.246), a highly right-
skewed distribution. No bandwidth whatever you are going to use for the
estimation, will be appropriate.
Introductional to the following concept, one should again look on a kind of MISE
formula, namely in this case on the minimum of AMISE, achieved by h AMISE:
KERNEL DENSITY ESTIMATION DURING THE LAST 25 YEARS 31
inf ( $ ) inf ( $ ) ( ) ( ' ' ) / /h h h hf f C K R f> >
−≈ =0 01 5 4 55
4MISE AMISE n ,
where denotes exactly the functional of the canonical
kernel for a scaleinvariant comparison in section
{ }C K K R K( ) ( ) ( )/
= μ22 4 1 5
2.2.5 (Wand and Jones, 1995).
Again there is a functional which can be separated from the others. In this case it
is the functional R f( ' ' ) . This fact allows us to find (n and K fixed) densities,
which are “better” to estimate than others, e.g. which produce a lower AMISE-
minimum.
Again, this provides no conclusion of which shape of densities are prefered,
because one can make every density sufficiently smooth by rescaling it. Again,
the way to proceed here is to scale this functional with a function depending on
σ(f), getting again a scale-invariant functional of f , D f( ) . Different values of
D f( ) for different densities in proportion to the best one (D f( *), f * is a
between –1 and 1 standardized form of the beta (4,4)-density) are given in Table 1
and the corresponding densities are shown in Figure 7 (Wand and Jones, 1995).
The numbers are interpretated to give the proportion of sample sizes having the
same AMISE.
Table 1: Efficiency on how to estimate different densities.
Density D(f*)/D(f)
a) Beta(4,4) 1
b) Normal 0.908
c) Extreme value 0.688
d) 34
0 114
32
13
2
N N( , ) ,+⎛⎝⎜⎞⎠⎟
⎛
⎝⎜
⎞
⎠⎟
0.568
e) 12
149
12
149
N N−⎛⎝⎜
⎞⎠⎟ +
⎛⎝⎜
⎞⎠⎟, ,
0.536
f) Gamma(3) 0.327
g) ( )23
0 113
01
100N N, ,+
⎛⎝⎜
⎞⎠⎟
0.114
h) Lognormal 0.053
KERNEL DENSITY ESTIMATION DURING THE LAST 25 YEARS 32
a)
x
-1.0 -0.5 0.0 0.5 1.0
0.0
0.5
1.0
1.5
2.0
b)
x
-3 -2 -1 0 1 2 3
0.0
0.1
0.2
0.3
0.4
c)
x
-6 -4 -2 0 2
0.0
0.1
0.2
0.3
d)
x
-3 -2 -1 0 1 2 3
0.0
0.1
0.2
0.3
0.4
e)
x
-3 -2 -1 0 1 2 3
0.0
0.05
0.10
0.15
0.20
0.25
0.30
f)
x
0 2 4 6 8 10
0.0
0.05
0.10
0.15
0.20
0.25
g)
x
-3 -2 -1 0 1 2 3
0.0
0.5
1.0
1.5
h)
x
0 1 2 3 4 5
0.0
0.2
0.4
0.6
Figure 7: The corresponding density types for Table 1.
To get back to the topic of this section, one can use the data much more efficient
by transforming for example, a skewed or kurtotic or multimodal density into an
unskewed, unkurtotic or unimodal one, respectively. After estimating the density
and transforming back, a much better result (in the sense of AMISE-reduction)
can be observed.
Ideas of how to proceed in application are given by Devroye and Györfi (1985)
and Wand and Jones (1995). A reasonable way in any case, is to get some
numbers of the distribution like variance, skewness, kurtosis or an idea about the
shape (pilot-estimator). However, for a fully automatical transformation
procedure, one cannot “look” at the distribution. Either a function, how to get
from some sample moments to certain coefficients of a parametric transformation
function (e.g. the shifted power family, described in Wand and Jones (1995)) is
needed or a non-parametric approach, as described in Ruppert and Cline (1994)
has to be carried out. The first one provides the inverse function for a
backtransformation of the estimated density immediately, but automatically needs
KERNEL DENSITY ESTIMATION DURING THE LAST 25 YEARS 33
the coefficients, as just described. The second method transforms the data easily
without any knowledge of parameters, but requires many more calculations.
2.2.6.3 Density estimation at boundaries
Another problem which occurs for example in life-time-distributions is beside the
fact that they are often highly skewed, the fact that a big mass of the distribuion is
often concentrated relatively close next to zero, or more generally near a boundary
of the distribution. Thus, the resulting kernel density estimate often has a
considerable mass at values outside the support of f .
Since setting the estimated density equal to zero outside the support, and
“blowing up” the positive part of the estimated density to integral one is often
quite a bad approach, Silverman (1986) proposed a better one. He showed that
producing the double amount of data by reflecting the data-points at the boundary,
estimating the density, cutting the estimate at the reflection point and double the
resulting values leads to a better estimate. Nevertheless, this has also a main
drawback, since there is always a zero-slope in the reflection point (the boundary)
which occurs nowhere in typical right-skewed life-time distributions like gamma-
or weibull- distributions. A more sophisticated approach is decribed in Wand and
Jones (1995), where so-called boundary kernels are used.
2.2.7 Bandwidth selection
The question how to make an appropriate choice for the bandwidth (or
“smoothing parameter”) h , is probably the most extensive and controversial
subject in the field of kernel density estimation. A whole host of ideas, approaches
and methods have been studied and discarded again. The wide field can be
structured in more or less strong “subjective methods”, where at least one
parameter has to be chosen by a human, and so-called “data-driven” or
“automatically” bandwidth selectors, where everything only depends on the data,
and you need no experience to get a reasonable fit. The data-driven bandwidths
can be further separated with respect to the historical growth in first- or second-
generation methods (Jones et. al., 1996).
KERNEL DENSITY ESTIMATION DURING THE LAST 25 YEARS 34
In the context of my simulation study, and regarding the fact that the number of
variables and the resulting selections of the bandwidths increases nowadays in
many applications, it is impossible to spend a certain time on each univariate
density estimate (or on several parameters of a multivariate density estimate),
which happens when choosing the smoothing parameters interactively. Thus, the
following discussion treats almost exlusively the data-driven case.
Nevertheless, the other methods are also quite useful for example, an interactive
choice for exploratory means which is the classical subjective method. Varying
the smoothing parameter, and watching the resulting shape, as long as one might
have found a compromise between under- and oversmoothing is probably the best
thing to do according to the “sensitivity of the human eye”. This criterion is
probably the better one in detecting the number and location of modes (compare
section 2.2.3.3 as well as Marron and Tsybakov(1995)) and one has no need to
create complicated formulas to achieve the same estimate. Furthermore, this
method provides a deeper insight into the data, because one gets different views
of the possible structure of the unknown density.
However, for reasons of an objective choice, I am going to turn now to the wide
field of automatic bandwidth selection.
2.2.7.1 Refering to standard distributions
For choosing the bandwidth automatically, it is neccesary to again take a closer
look at the - and -minimizing bandwidths, respectively ((2.7) and (2.5)). The
first thing which becomes apparent, is the already mentioned fact that the
unknown functionals
L1 L2
R f f( ' ' ) ( ' ' )= ∫ 2 , f ' '∫ or f∫ are required. An approach
which is quite simple, but often used, is by refering to a standard distribution and
setting f for example, the normal distibution with variance σ 2 . This is called the
Rule of Thumb, and was suggested by Silverman (1986).
The resulting formula for h opt is
hopt n≈ −106 1 5. /σ (2.9)
and can be calulated by a point estimator $σ 2 for σ 2 for example, the sample
variance. As seen in section 2.2.6.2 (Table 1), this method is oversmoothing in
KERNEL DENSITY ESTIMATION DURING THE LAST 25 YEARS 35
any case of multimodality or skewness of the density because the bandwidth is
chosen close to the upper bound concerning h AMISE, namely h OVERSMOOTH (the
resulting bandwidth when setting f equal to the transformed beta (4,4)-density).
Another possibility is the use of a more robust estimate for σ , the inter-quartile-
range . Because of the fact that is approximately 1.35 times as high as the
standard deviation for normal densities, one has to adapt (2.9) to maintain
$R R
h Ropt n≈ −0 79 1 5. / (2.10)
Discussion about which one is better can be found in Silverman (1986), and
Sheather (1992, sect. 2.3). Finally, maybe the best choice is to choose the
minimum of $σ and $
.R
135 and correct the value by a constant in between those of
(2.9) and (2.10), say 0.9 (Silverman, 1986).
Another reference distribution is the transformed beta(4,4)-density (see Table 1),
which always oversmooths, but one can watch graphs by dividing the resulting
bandwidths by 2, 3, 4, ...etc.. However, this is actually a special case of a
subjective choice, where an upper bound is given, and therefore no longer
interesting.
To show how poor those choices are when using them in multimodal distibutions,
Figure 8, reproduced from Silverman (1986) should give an impression.
It shows the ratio between h AMISE (2.5) to the window widths, given by the two
rules of thumb and h OVERSMOOTH, respectively. The interested reader can find
similar graphs according to skewness and kurtosis in Silverman (1986). The third
Rule of Thumb equals in this case always the first, and the corresponding
bandwidth-ratios are therefore not plotted.
The fact that h OVERSMOOTH has not always the largest values (the smallest
bandwidth-ratio), appears because of the different calculations of the estimate of
σ. h OVERSMOOTH is only the largest, when both densities are either both rescaled
with $σ or both with . $R
KERNEL DENSITY ESTIMATION DURING THE LAST 25 YEARS 36
Distance between modes
Ban
dwid
th-r
atio
0 1 2 3 4 5 6
0.4
0.6
0.8
1.0
Rule of Thumb (standard deviation)Rule of Thumb (interquartile-range)Beta(4,4)
Figure 8: Ratio of the AMISE-optimal bandwidth to window widths chosen
by different reference distributions. The produced densities were mixtures of
two standard normals, whose distance between their modes was varied.
The following methods are already more sophisticated, but belong still to the
methods of the first generation.
2.2.7.2 Cross-validation methods
The statistical concept of cross-validation can certainly also be applied in density
estimation. As already discussed in section 2.2.3.2, there are different opinions
about minimizing with respect to ISE and MISE, respectively (see Jones, 1991).
Both are used in the context of cross-validation.
Least-Square Cross-Validation
This in the 1980s undisputed selector, concentrates on the minimization of the
ISE. Therefore, the IS is separated into its components obtaining E( $ ( ))f xh
ISE( $ ( )) $ ( ) $ ( ) ( ) ( ) .f x f x dx f x f x f xh h h= − + ∫∫∫ 2 22
KERNEL DENSITY ESTIMATION DURING THE LAST 25 YEARS 37
Since f x( ) does not depend on h , it is sufficient keeping an eye on the first two
terms. Bowman (1984) formulates the minimization problem by giving unbiased
estimators for these terms and obtains
LSCV( ) ( ) ( ),,hn
f x dxn
f xh ii
n
h i ii
n
= −−=
−=
∫∑ ∑1 21 1
, (2.11)
where
f xn h
Kx x
hh ij
i i j
n
,,
( )( )−
= ≠=
−
−⎛
⎝⎜
⎞
⎠⎟∑1
1 1
denotes the leave-one-out kernel density estimator, the resulting estimator when
datapoint i is left out. Again, using the gaussian kernel facilitates the problem and
makes the integral able to be solved analytically (Bowman, 1984).
This selector has one main problem. If the data is highly discretized (to a certain
degree is every continious dataset discretized), which means, easily speaking, that
there are too many equal values, an overall minimizer of LSCV( will be found
at (Silverman, 1986), which is an unreasonable choice. The threshold,
where this occurs depends on the kernel. Examples for this case are given by
Sheather (1992, examples 2-4).
)h
h = 0
Another bad performance concerning convergence, will be figured out in the
section about asymptotic behaviour. Nevertheless, it is still a good and transparent
concept, which can also be easily generalized for the multivariate case and I use it
therefore in chapter 4.
Biased Cross-Validation (BCV)
The difference between the LSCV and the biased cross-validation method is the
fact that here, minimization is based on the asymptotic MISE (2.4). Here the
functional R f( ' ' ) is estimated. As the name says, the resutling bandwidth is
biased, but has a smaller variance than LSCV. There sometimes also occurs the
problem of more than one minimum. Ideas about which one to choose are
discussed in Marron (1993) and Sheather (1992).
(Pseudo)-Likelihood Cross-Validation (LCV)
KERNEL DENSITY ESTIMATION DURING THE LAST 25 YEARS 38
The LCV-selector was maybe the first commonly used automatic bandwidth
selector because it is based on a basic statistic concept, the maximum-likelihood
optimization. The criterion to maximize is
LCV hn
f xh i ii
n
( ) ( ),= −=∏1
1 (2.12)
and f h i,− is defined as above. The leave-one-out estimator is used, because
otherwise the function has a minimum at h = 0, and the resulting estimate would
be a sum of dirac-funtions at the data-points. One problem is that the window
width has to be (in case of kernels with bounded support) at least as big as the
distance from the furthest outlier to its closest point, because LCV is zero
otherwise. Thus, the resulting bandwidth often tends to oversmooth the function.
In this view, Silverman (1986) pointed out at least a theoretical problem. Since the
just described distance (often called “gap”) does not get smaller as n increases for
densities, which vanish at a exponential rate or more slowly, h does not converge
to zero, which is obligatory for every reasonable estimator. Actually, this concept
is sorted out from the spectrum of bandwidth-selectors nowadays, but
nevertheless the simulation study (Remme et. al., 1980), which is the pattern for
my work, got even 22 years ago quite good results, which makes my intention to
achieve a dominance over the LDA and QDA (applied in discriminant analysis) in
any case feasible.
2.2.7.3 Plug-in-methods
While Jones (1996) called the selectors in section 2.2.7.1 and 2.2.7.2 methods of
the first generation, the following belong to the second one. A family of methods
which follow straightforward from BCV, are the so-called plug-in methods. They
were investigated in the early 1990s, and represent the “state of the art” as it
seems (Bowman and Azzalini, 1997; Jones et. al., 1996; Wand and Jones, 1995;
Sheather, 1992; Park and Turlach, 1992; Cao et. al., 1994; etc.).
The common thing of all plug-in-methods, is that they include an estimate of the
unknown density functional R f( ' ' ) in (2.5), which is performed by a kernel
estimate itself, but in general not by using the same smoothing parameter and
KERNEL DENSITY ESTIMATION DURING THE LAST 25 YEARS 39
kernel as for estimating f itself, which makes it different to the biased cross-
validation concept. One uses the fact that
R f f x dx f x f x dxs s s s( ) ( ) ( ) ( ) ( ) .( ) ( ) ( )= = − ∫∫ 2 21
For the estimation of further density derivative functionals it is therefore sufficient
to study functionals like
ψ rr rf x f x dx E f X= =∫ ( ) ( )( ) ( ) { ( )}
for r even. This leads to the estimator
$ ( ) $ ( ) ( )( ) ( )ψ r gr
i gr
i jj
n
i
n
i
n
gn
f xn
L x x= ====∑∑∑1 1
2111
− , (2.13)
where g and L are respectively, a bandwidth and kernel that are in general
different from h and K .
To find a good choice for g for estimating $ ( )ψr g , a reasonable choice is using
g AMSE, since this can be calculated in a close form as well (see Sheather and
Jones, 1991; Park and Marron, 1990).
Unfortunately, the formula of g AMSE in turn contains a functional of the unknown
density, which is now one degree higher (e.g. the AMSE-optimal bandwidth for
estimating R f( ' ' ) = ψ4 needs R f( ' ' ' ) = ψ6). This fact is again evidence that one
cannot escape the problem of knowing f to estimate f best. The only thing to do
is shifting the problem, and deriving an AMSE-optimal bandwidth for estimating
ψ6 , say g1. This is what is done in the direct plug-in approach. However, at a
certain stage you have to use a pilot estimator, which is chosen mostly with
reference to a standard distribution (see section 2.2.7.1).
So, the question of how many stages to choose arises. Wand and Jones (1995)
discovered, that while the bias of h grows smaller when the number of stages
increases, the variance increases. It is also evident, that additional steps cause
additional computation time, which is also not negligable. The graph in Wand and
Jones (1995, p.73) leads one to assume, that more than two or three stages are not
necessary. However, this graph refers only to the estimation of one certain
density.
To turn back to the problem of choosing a good bandwidth h , the AMISE-optimal
formula looks like that:
KERNEL DENSITY ESTIMATION DURING THE LAST 25 YEARS 40
hR K
nR f Kg h
=⎡
⎣⎢
⎤
⎦⎥
( )( $ ' ' ) ( )( )
/
μ2
1 5
. (2.14)
This provides insight about how the concept can be improved. As one recognizes
on the left and on the right side of the equation, solving the equation is exactly
what is wanted. This is done by the so-called solve-the-equation rules (see Jones
et. al., 1996, Sheather and Jones, 1991).
h
These methods require additional computation time, because the connection
between h on the right and on the left side is complicated, even when there is only
one stage (refering to the direct-plug-in-approach) included. To continue the
process before, one has to write then e.g. R f g g g h( ( ( ( )))2 1 ' ' ) instead of R f g h( '( ) ' )
)
in
the denominator of (2.14).
The dominance of the plug-in-methods over the others, at least in the univariate
case will be pointed out in section 2.2.7.5. The basic concept seems to go back to
Park and Marron (1990). The improvement by Sheather and Jones (1991) was
only done by calculating the double-sum in (2.13) for all n2 values, whereas the
Park-Marron-estimate left out the terms, having i and j equal and divided by
. This improvement by adding a non-stochastic term is only slight.
However, it is an improvement and therefore justified.
n n( − 1
More detailed mathematical expositions of this topic, as well as similar concepts
such as smoothed cross-validation and smoothed bootstrap are skipped here,
because the scope of this work is limited. Nevertheless, the interested reader finds
many more details by studying either the indicated sources, or at least the
appendix of my thesis.
2.2.7.4 Alternative methods
After discussing the modern plug-in rules in the last section, I briefly want to
describe some other estimators which were used in the comparative simulation
study of Cao et. al. (1994).
As already discussed, there are also other criteria than the Lp –distances available
to measure the goodness of the fit. One method which uses such an alternative, is
the IP-method. It has its name from its dependence on the number of inflection
KERNEL DENSITY ESTIMATION DURING THE LAST 25 YEARS 41
points. To carry out this method in the initial version, one has to have an idea
about the number of inflection points (denoted by q ) in the density. The estimate
h IP is simply defined as
h h hIP = inf{ : $f has at most q inflection points}
Of course, the user does not always have an idea about q and therefore Cao. et. al.
used a second estimator, hEP, which gets a value instead of q from a pilot-
estimation. Unfortunately, this concept however, is actually unsuitable to be
generalized for the multivariate case. The univariate results were quite good.
$q
The last method explained in this section treats an especially adapted approach
concerning the -distance. The bandwidth hL1 DK (“Double kernel”) minimizes the
-distance between two kernel density estimates which use the same bandwidth,
but different kernels. The criterion
L1
IAE = −∫ $ ( ) $ ( )f x g x dxh h
uses the kernels K and L respectively, which have to be of different order. To
fulfill certain consistency restrictions, the choice of the two kernels has to be in
tune with each other.
Since the reader probably gained an imagination of the dimension of this research
area, the next section should set some methods in context by drawing some
comparisons.
2.2.7.5 Asymptotic analysis and results from application
Since Marron and Wand (1992) talk about three tools for understanding the
behaviour of non-parametric curve estimators (which are asymptotic analysis,
simulation and numerical calculation of error criteria), I am going to compare
some estimators by the means of the first and second one. I will skip the third
because of its limitedness on easy settings.
An often used criterion (Jones et. al., 1996) is studying the behaviour of the
random variable
h hh− MISE
MISE, (2.15)
KERNEL DENSITY ESTIMATION DURING THE LAST 25 YEARS 42
which tends to zero as n increases for most estimators. Since the limiting
distribution of (2.15) corrected by the factor n is often a normal distribution, it
makes sence to compare the value of
p
p for several estimators.
The distinction between methods of the first and the second generation (Jones et.
al., 1996) respectively, is not only caused by the time of their invention, but also
through better convergence rates by the latter ones. While the values for in
LSCV and BCV are 1/10, better rates are obtained by the second-generation rules.
The LSCV is probably the bandwidth selector with the highest limiting variance.
The BCV is biased, however its variance is much less. The highest possible rate
for
p
p is 0.5, and there exist some estimators (Jones et. al., 1991) which reach this
“magic” border, though they use no higher-order kernels.
When comparing estimators, one has to do this also with respect to eventually
occuring additional parameters, like different scales in the Rule of Thumb or
number of stages in a plug-in approach.
To get an idea, if the advantage concerning the convergence rate can be exploited
in practical application, many simulation studies using different reasonable values
for n were carried out.
Probably the last compact survey of different bandwidth-selectors (Jones, 1996)
provided the following main results.
1. h ROT (“Rule of Thumb”) oversmooths too often.
2. h BCV has the same tendency and is instable as well.
3. h LSCV has an unacceptable spread, often in the direction of
undersmoothing.
4. h SJ (Sheather-Jones-solve the equation) is a useful compromise
between h ROT and h LSCV and performs acceptably in harder to estimate
densities as well.
5. h SB (smoothed bootstrap) is very similar to h SJ, but slightly worse.
The estimated densities are the same as the already introduced densities of Marron
and Wand (1992, see Figure 4 and Figure 5). The fact that estimates, which are
based on the minimization of AMISE happen to perform badly, applied on
KERNEL DENSITY ESTIMATION DURING THE LAST 25 YEARS 43
densities containing spikes or other different features, is again a consequence of
the bad approximation of h MISE by h AMISE (see again Figure 5).
The extensive simulation study of Cao et. al. (1994) leads to similar conclusions,
but gives very good evidence for h EP, the estimator based on the inflection points
according to the -distance. L∞
Also Sheather (1992) discovered a superiority of the Sheather-Jones plug-in by
analyzing real-life datasets, although this judge is subjective because the
underlying destiny is unknown.
The theoretical superior estimator described in Jones, Marron and Park (1991),
hJMP performs well in their study, when the underlying density is normal or
similar to this, but is worse than the hLSCV in other densties.
On the other hand, hJMP was one of the best estimators in the simulation study of
Park and Turlach (1992), where the estimated number and location of modes was
another quality criterion. It was together with the LSCV-criterion, the best in
detecting modes in the multi-modal distributions of this study, as well as together
with h SJ the best with respect to the - and -distance measures. L1 L2
My latest reference in this discussion (Bowman and Azzalini, 1997) stresses the
fact that even if the conservative LSCV has some undesired properties, it is an
estimator which is easy to generalize in the multivariate setting, as well as in a
setting using different bandwidths (variable kernel), which should not be
overlooked when talking about density estimation for discriminatory purposes.
To gain some intuition to explore the different behaviours of certain data-driven
bandwidth selectors in different univariate distributions, one is recommended to
run the software-package of the same name, “XploRe” (see Härdle et. al., 2000).
Here almost all of the bandwidth-selectors described in this section are
implemented. However, in more dimensions one probably has to program the
prefered estimators him/herself.
2.3 The multivariate case
The univariate concept has to be extended in case one wants to discover either
dependencies between several variables, or to carry out classification decisions.
KERNEL DENSITY ESTIMATION DURING THE LAST 25 YEARS 44
The exploratory means seem to be restricted to bivariate random variables,
because the resulting densities can be plotted in three dimensions. However, Scott
(1992) provided graphs where densities in higher dimensions have been plotted.
Nevertheless, carrying out a classification task by the means of kernel density
estimation seems to be more relevant when handling multidimensional data, at
least in my thesis.
2.3.1 The model
The (d -dimensional) multivariate generalization of the kernel density estimator is
given by
( )$ ( ) ( )/ /f xn
K ii
n
H H H x= −− −
=∑1 1 2 1 2
1x ,
where H is a symmetric positive definite d d× -matrix, the bandwidth matrix, and
K is a multivariate kernel satisfying K d( )x x =∫ 1 (Wand and Jones, 1995).
This kernel function can be derived straightforward by generalization of the
univariate ones, by either multiplying the univariate scores (product kernel) or
“rotating” the univariate kernel in the d -dimensional space (radially symmetric
kernel). The most common choice using a multivariate normal density as kernel,
can be derived by either the product- or the radially symmetric kernel.
Of course, the user wants to know, which generalization performs better and like
in the univariate case (section 2.2.5) differences in the efficiency of using the data
can be calculated. The sample sizes to get the same AMISE-performance, are
given in Table 2 (Wand and Jones, 1995), where a beta-kernel of the form
κ ( ) ( ) ( )x x I xp∝ − <1 12
is used. The numbers give the efficiencies of the product kernel in relation to the
efficiencies of the spherically symmetric kernel for different dimensions. Thus,
values smaller than 1 denote an asymptotic superiority of the spherical version.
Table 2: Efficiencies of product kernels relative to radially symmetric
kernels when using the multivariate generalization of a beta-kernel.
p d = 2 d = 3 d = 4
KERNEL DENSITY ESTIMATION DURING THE LAST 25 YEARS 45
0 0.955 0.888 0.811
1 0.982 0.953 0.916
2 0.983 0.953 0.915
3 0.984 0.956 0.919
Summing up the knowledge about the kernels, the most effective choice with
respect to an effective minimization of either MISE or MIAE, is supposed to be
the radially symmetric Epanechnikow kernel (assuming that the radially
symmetric kernels are also superior in the case of the Epanechnikow kernel).
However, for reasons I am going to describe in chapter 4, I prefer in my
simulation study a kernel having unbounded support.
2.3.2 Parametrizations
Concerning the choice of the smoothing parameter, one is now not only
confronted with estimating d bandwidths, which might have been expected, but
even with a whole matrix, also giving the information in which direction the
function is going to be smoothed and in this setting is not even a variable
generalization like the variable kernel density estimator in the univariate setting
included. So, the estimation seems to be an exhaustive procedure to find a good
estimate.
Strategies to overcome this problem deal with only estimating one bandwidth for
all variables, or with only a bandwidth-vector, ignoring possible correlations
between the variables. It can be easily seen, that the choice of only one smoothing
parameter is basically not sufficient. At least rescaling of the variables, as it is
also done in other multivariate applications, is desireable. Anyway, an additional
decision has to be made by the user.
A method described in Silverman (1986), which goes back to Fukunaga (1972) is
to carry out a linear transformation of the data in order to have unit covariance-
matrix. As a next step, the density is going to be smoothed by using a radially
symmetric kernel, and finally back-transformed. This method equals an
adjustment of a multivariate kernel in the “direction” of the data, and therefore
only one parameter is needed.
KERNEL DENSITY ESTIMATION DURING THE LAST 25 YEARS 46
However, Scott (1992, p.180) emphasizes that even for standardized data, the
choice h j for different variables is unlikely to be satisfactory. In any case, the
user has to be aware of the fact that in the complete setting, the number of
estimated parameters grows of order n , and the variances of the matrix elements
“explode” as well. The knowledge about the curse of dimensionality (see section
hi =
2
2.3.4) makes this model probably restricted to only a few dimensions.
2.3.3 Parameter selection
The methods for parameter selection actually follow the same principles like those
in the univariate setting. Unfortunately, only a part of the procedures in the
univariate case can be generalized well, and I actually found no comparative study
of several estimators. Therefore, I only want to briefly explain generalizations of
some estimators of section 2.2.7.
2.3.3.1 The normal reference rule
As Scott (1992) gives formulas for the AMISE in the multivariate case,
minimization of it with respect to a bandwidth-matrix should not be that
complicated, although an explicit expression for HAMISE is not possible. However,
the author restricted his view only to the setting where only a vector of
bandwidths has to be estimated.
Since the AMISE-formula again contains a functional of the unknown density f ,
the optimal bandwidth-vector h can be derived by setting f a multivariate normal
distribution. In case of using a normal product kernel, the elements of this vector
are given by
hd
ni
d
id*
/( )/( )=
+⎛⎝⎜
⎞⎠⎟
+− +4
2
1 41 4σ , (2.16)
where an estimator $σi for σi is necessary in application. Note that d leads to
(2.9). Scott (1992) describes the extension of h
= 1
OVERSMOOTH to the multivariate
setting as well.
KERNEL DENSITY ESTIMATION DURING THE LAST 25 YEARS 47
2.3.3.2 Cross-validation methods
The density estimator in the simulation study of Remme et. al. (1980) used a
multivariate normal kernel and a bandwidth-matrix of the form
H =
⎛
⎝
⎜⎜⎜⎜
⎞
⎠
⎟⎟⎟⎟
h
ss
sp
2
12
22
2
0 00
0
..
:..
: , (2.17)
where the s denote the sample variances for the corresponding variable. The
bandwidth h was then computed by the generalization of (2.12), by minimizing i2
LCV( ) (2.18) $ ( ), ,h f K h i ii
n
==∏ x
1
with respect to h . is the density estimate using the multivariate kernel $, ,f K h i K
and leaving out the p-dimensional observation vector x . i
The fact that the minimization of
LSCV( ) $ ( ) $ ( ),H xH= − −=∑∫ f dx
nf i
i
h
i2
1
2xH , (2.19)
( is equal to in (2.18)) leads to the LSCV-optimal bandwidth-matrix
(whether H is restricted to a diagonal-matrix or not), is from this view not
surprising (see (2.11)), and Wand and Jones, 1995).
$,f H i−
$, ,f K h i
2.3.3.3 Plug-in
Given the good performance of the plug-in methods, one could also be interested
in their multivariate extensions. Wand and Jones (1994) wrote in their article how
to perform the direct-plug-in concept of Sheather and Jones (1991) in d
dimensions (there is no direct analogue for the solve-the-equation approach
available in the multivariate case). Unfortunately, the problem is getting more and
more complicated, since the extensions of the density functionals ψ r (see section
2.2.7.3) contain functionals , where r is a vector of length d containing
non-negative integers and the definition is
f ( ) ( )r x
fx x
fr rd
d
( )|
( )...
( )r|
x xr
=∂
∂ ∂11
,
KERNEL DENSITY ESTIMATION DURING THE LAST 25 YEARS 48
where | is defined. |r ==∑ rii
d
1
Additionally, the number of density functionals going to be estimated increases as
both d and l increase, for each of them rapidly.
This would not be a problem either, if one had a powerful computer, but another
problem arises when a pilot estimation at a certain step has to be carried out. The
pilot functional estimator which uses as reference distribution the multivariate
normal distribution provides even for the case d = 2 such a complicated formula,
that Wand and Jones (1994) stress that “it is too difficult to give succint
expressions for ψmN ( )Σ [the functional, author] so we will restrict attention to the
bivariate case”. Regarding this fact, the concept is not viable for my simulation
study for which I am very sorry, since this concept promises very much and
achieved in the present study good results for the bivariate case. Maybe this
technique is only suitable for exploration purposes rather than discrimination,
because the visual techniques are essentially restricted to two dimensions as well.
Nevertheless, it seems worthwile to also study the case for d in more detail. > 2
2.3.4 The curse of dimensionality
As if the additional problems in the multivariate extension were not big enough so
far, there occurs a further fundamental problem when estimating densities in high
dimensions, which is called the curse of dimensionality. Silverman (1986)
describes this in a interesting manner. He considers the importance of the
distribution tails in high dimensions. In a univariate density, one has no problem
when regions, where the density f has values below one hundreth of the
supremum of f (the value at the mode) are estimated by and the other mass
turns to the mode. Everyone would indicate such an estimate as good, provided
the fit in the main part is appropriate. In ten dimensions however, regarding a
normal distribution, more than one half of the data falls in such regions of low
density values and one does quite bad ignoring those tails.
$f = 0
Table 3 (Scott, 1992)
gives an impression, where the values
pff
p d( )( )
( ) logx0
≥⎛
⎝⎜
⎞
⎠⎟ = ≤ −
⎛⎝⎜
⎞⎠⎟
1100
21
1002χ
KERNEL DENSITY ESTIMATION DURING THE LAST 25 YEARS 49
are listed.
Table 3: Probability of data in regions which have density values higher than
one hundredth the value at the mode of a multivariate normal distribution.
d = 1 2 3 4 5 6 7 8 9 10 15 20
p *1000 = 998 990 973 944 899 834 762 675 582 488 134 20
For already five to six dimensions the mass in the dense regions of the distribution
declines considerably and moves into the sparse regions.
Silverman gives another example when he explains that in a ten-dimensional
normal distribution, more than 99% of the data have a distance greater than 1.6 to
the center. Every statistician knows that it is almost the other way round in the
univariate case, where nearly 90% lie in the interval [-1.6,1.6].
The paradoxon here, is that the data is not in the regions having high density-
values, but in the tails, although the tail-regions are even more sparsly populated.
Roughly speaking, this happens because in more dimensions is “much more
space“ for observations and exactly this fact makes a density in high dimensions
difficult to estimate non-parametrically.
Scott (1992) also obtained imposing results when he compared the ratio of the
volume of a hypercube and an inscribed hypersphere in dependency of the
dimension d . Here, almost all mass of the hypersphere “disappears“ in the corners
of such a cube in high dimensions.
This digression to a topic, which sounds like a science-fiction-story provides very
deep insight about the main problem in the extension of a density estimation
model in high dimensions. The fascinated reader is recommended to read more of
such examples in Scott (1992) and Silverman (1986), respectively. In particular
the table on page 94 (Table 5 in my thesis) in the latter source, showing the
required sample size to achieve a given accuracy depending on the variation of
dimensions, is of practical importance and leads back to the topic of density
estimation.
KERNEL DENSITY ESTIMATION DURING THE LAST 25 YEARS 50
This curse of dimensionality stresses the unconditional necessity to transform
highdimensional data onto suitable subspaces. Some techiques how to proceed are
discussed in chapter 3.
2.4 The context to kernel discriminant analysis
As already shortly mentioned in the introduction to chapter 2, density estimation
is not only an end in itself, but also a means to an end.
One possible application field which I take into consideration in my thesis is the
application in disriminant analysis, often called (bayes-rule-) kernel discriminant
analysis. Since the kernel model is more flexible, it seems to be a proper
alternative to the model-based approach, although the bayes rule based decision
rules assuming multivariate normal distributions are also somewhat justified (e.g.
equality of the LDA and the different Fisher approach).
Denoting the class of a d -dimensional data-vector xi by k and the prior-
probabilities by i
p k( ) , it is well-known that the bayes rule tells you to maximize
p kp k f k
p k f kk
( | )$( ) $ ( | )$( ) $ ( | )
xx
x= ∑
over all possible values k getting , which is the maximizer. $k
In kernel discriminant analysis is now not the normal density, but a
multivariate kernel density estimate, calculated classwise by choosing for each
class only the vectors
$( | )f x k
xi having the value k ki = .
To gain some intuition about the difference Figure 9 and Figure 10 should be
considered. Figure 9 shows a contour-plot of the posterior-probabilities in a self-
constructed dataset and the underlying datapoints. The dataset consists of five
groups having different means, different covariance-matrices and twenty
observations each. The goal is to separate those groups and to produce rules to
classify new observations. Note, that LDA is very limited since the separation is
carried out by using hyper-planes and especially the different variances of the
component of group 3 and 5 lead to several misclassifications of the latter group.
The QDA is often critizised, but provides through its more flexible approach in
this case better results. Figure 10 finally, shows the flexible fit by using the same
KERNEL DENSITY ESTIMATION DURING THE LAST 25 YEARS 51
datapoints, but also even indicates in two dimensions what was meant by the
curse of dimensionality (section 2.3.4).
-5
-1
3
7
0.34
0.48
0.63
0.78
0.93
-1 1 3 5 7V1
V2
1
11
11
11 1 11
1
1
1
11
111
1
1
2
2
2 2
2
22
2 2
2
22
2
2
2
22
22
2
33
333
33
3 333
33
3
3333
33
4
4
4
4
4
444
44
44
4 44
444 44
5
55
5
5 5
5
5
5
5
5555
55
5
5
55
Figure 9: Contour-plot of the maximum posterior probabilities at each point
(LDA).
The rules produced are spurious and have probably (and since I know the
distribution, surely) nothing to do with the underlying density, what stress again
the difficulty of estimation and the need for transformations and dimension
reduction.
KERNEL DENSITY ESTIMATION DURING THE LAST 25 YEARS 52
-1 1 3 5 7
-5
-1
3
7
0.34
0.48
0.63
0.78
0.93
V1
V2
1
11
11
11 1 11
1
1
1
11
111
1
1
2
2
2 22
22
2 2
2
22
2
2
2
22
22
2
33
333
33
3 333
33
3
3333
33
4
4
4
4
4
444
44
44
4 44
444 44
5
55
5
5 5
5
5
5
5
5555
55
5
5
55
Figure 10: Contour-plot of the maximum posterior probabilities at each
point (kernel estimate).
The performance of a classification rule is now measured by either the classical
Misclassification-rate (Error-rate) or by more modern rules (see Hand, 1997, for
a discussion) and no longer by any differences of densities. In case there are more
than two populations, it is not really clear how a difference between several
estimated densities is measured either. In general, the better the density is
estimated, the smaller should be the misclassification-rate, but probably not a
MISE-optimal bandwidth-selector is the best regarding the Misclassification-rate,
since the theoretical Misclassification-rate is an L -based measure. One has to be
aware of this fact, and a common optimization of both aims is what is desired (see
also Hand, 1997). Note, that the Misclassification-rate does not take different
costs into consideration either.
1
How can one actually make connections between the choice of the smoothing
parameter and known discrimination rules? Hand (1982) showed that when
, the decision surfaces become more and more hyper-planes, and end up in h →∞
KERNEL DENSITY ESTIMATION DURING THE LAST 25 YEARS 53
a rule which assigns the observation x to the group having the closer distance to
its sample mean (average linkage). That happens because when h becomes large,
the resulting estimate is the kernel K itself.
On the other hand, if h , the resulting estimate amounts in a surface
consisting of bumps at the position of each datapoint. Caused by the exponential
decrease of the normal density, as h a new point x will be classified to its
nearest neighbour (measured by using the euclidian norm). Those results ensure
the user that a choice in between this range of h can not be bad, since those two
algorithms are known to perform quite well.
→ 0
→ 0
Taking again the curse of dimensionality into account, one is in higher dimensions
confronted with the fact that the proper estimation of the distribution tails is
particularly important. Ripley (1996, p. 182) suggests therefore watching densities
on log-scales, and emphasizes that discrimination is based on differences in log
densities. In his opinion, the proper estimation of the tails seems almost
completely ignored in the density estimation literature. That is no wonder
regarding the fact that the -based parameter-selectors (which weight a proper fit
in the tails most) include a difficult tractability (see again the quite technical book
of Devroye and Györfi, 1985), and are therefore widely avoided.
L1
KERNEL DENSITY ESTIMATION DURING THE LAST 25 YEARS 54
x
Diff
eren
ces
of lo
g-de
nsiti
es
-4 -2 0 2 4
-6-4
-20
2
h-SJh-ROT
Figure 11: Difference between the logarithm of a standard normal density
and the logarithm of two kernel estimates (Sheather-Jones plug-in and the
Rule of Thumb).
Figure 11 gives an impression of how bad the fit is in the tails. 50 observations
from a N(0,1)-random-variable were generated, and the density was estimated by
the Rule-of-Thumb bandwidth-selector and by the Sheather-Jones-selector,
respectively. Then the true and the estimated curve were plotted on a log-scale.
Regarding the fact that a univariate normal density is almost the easiest-to-
estimate possible density, this is a rather poor result and focuses the view on one
of the main problems in kernel discriminant analysis. There is one approach in
literature (Hall and Wand, 1988), which estimates density differences and
circumvent the problem of negative estimated values concerning higher order
kernels, but they estimate only the differences and not the log-differences.
The literature about how kernel discriminant analysis performs compared to LDA
and QDA is rather old, compared to the vast amount of modern bandwidth
selection rules for kernel density estimation. Van Ness and Simpson (1976)
investigated this area by choosing a mulivariate normal- and cauchy-kernel,
respectively, and searched for dependencies on the number of dimensions (up to
30), the number of observations per class (10 and 20, respectively) and the
KERNEL DENSITY ESTIMATION DURING THE LAST 25 YEARS 55
distance between two normal-distributions with equal covariance-matrices
which represent the class-densities to distinguish. For reasons discussed in section
Δ
2.3.4, it is in my opinion completely absurd to gain any practical insight from
such a setting. It is remarkable that they discovered a dominance of the kernel
setting over the LDA, although the LDA is of course the best in this setting (with
respect to minimizing the Error-rate). In this setting there is no convergence effect
of the estimates to the real distribution. The parametric estimates (LDA, QDA) as
well as the kernel estimate have such a great variance, that any comparisons of
error-rates are fruitless, even when the Error-rates were estimated by averaging
over 100-300 equal attempts. The fact that this averaged classification-rate is over
a wide area not affected by the choice of the smoothing parameter (see Van Ness
and Simpson, 1976, p. 185), gives the best evidence. The simulation study of Van
Ness (1980) chose for the reason of similar results for the normal- and the cauchy-
kernel in the former study, only the normal kernel here. Different covariance-
matrices have been used here, as well. Nevertheless, those two studies probably
gave one of the first impetus for taking the important role of the dimensions in the
kernel setting into consideration.
Remme et. al. (1980) discuss a much wider range of underlying distributions and
restrict the number of dimensions d to the much more reasonable range {2,...,6}.
The construction of the estimator is given in (2.17) and (2.18) and the tested
densities are multivariate normals, multidimensional lognormals and normal-
mixture densities. They used the classical Misclassification-rate and a measure,
which considers also the exact estimated posterior-probabilities for testing the
performance. They also studied a variable kernel model, whose bandwidths are
constructed by the multivariate counterpart of (2.8). The number of nearest
neighbours to use, was studied in a separate simulation study (Habbema et.al.,
1978), where the same authors underline that the choice of this number is over a
wide range not that crucial.
They compared the method to LDA and QDA respectively, and essentially got the
following results.
KERNEL DENSITY ESTIMATION DURING THE LAST 25 YEARS 56
1. LDA is the best for the multivariate normals with equal covariance-matrices
(not surprising).
2. LDA performs increasingly poor in case of non-equal covariance-matrices and
QDA is not much better.
3. The kernel method was better than the others or at least as good as them
except in the setting of point 1. This is surprising, regarding the fact, that the
sample sizes were not so large (n = 15, n = 35).
4. Nevertheless, the results in the lognormal case were disappointing and the
variable kernel estimate performed much better there.
The last point again stresses the need of a more flexible setting or for
transformations. In my study I will try to carry out proper transformations. The
most remarkable statement in their study is probably the final conclusion, that
“the present practice of the nearly exclusive use of LDA cannot be justified“,
which tells us that this was known already 22 years ago and should encourage
every software-provider for discrimination rules to offer alternatives.
With the current knowledge, one is actually “only“ confronted with a need to
transform marginal distributions and the necessity for a proper projection onto a
subspace, to overcome the curse of dimensionality when analyzing higher-
dimensional datasets. This important task is going to be discussed in the following
chapter.
Chapter 3: Dimension Reduction and Marginal
Transformations
3.1 Introduction
As already briefly discussed in section 2.3 and 2.4 estimating densities
appropriate in high dimensions (“high” however, means always high with respect
to the sample size) is a difficult, if not impossible task and some kinds of
transformations are inevitable. This is however, easier said than done, if those
estimates lead to classification decisions. In this case, it is important to know the
density values at each test-observation x in the original dimension, because they
determine the posterior probabilities or at least the order of several posterior
probabilities of different classes. Since the transformations have to be carried out
classwise this order possibly changes. While Wand and Jones (1995) care about
the backtransformations of the density in their univariate transformations, Scott
(1992) does not treat this necessity in the multivariate case.
Section 3.2 gives now advice on how to transform the dataset within the original
dimension, whereas questions of reducing dimensions are discussed in section 3.3.
The application in chapter 4 will take both aspects into consideration.
3.2 Marginal transformations
Suppose you have data in a high dimensional space. In the most cases at least
some of them are highly correlated. Scott (1992) suggests in this case a two-stage
approach, where marginal transformations have to precede. He talks about a need
to normalize the data. In the context of section 2.2.6.2 this is not a bad goal as
well, since the normal density is one of the easiest to estimate densities.
The need of backtransforming the estimated densities becomes apparent in Figure
12.
57
DIMENSION REDUCTION AND MARGINAL TRANSFORMATIONS 58
x
-2 0 2 4 6 8 10
0.0
0.1
0.2
0.3
0.4
|x0
Figure 12: Normalization of two bimodal class densities. Density 1 (solid),
density 2 (dashed) and their respective normalizations.
If the point x0 has to be classified, the reasonable choice would be an assignment
to group 1, since the value of the second class density is higher. “Normalization”
of each class density changes the assessment totally and the point would be
classified to group 2. That means, the transformation rules have to be stored in
some way.
So, how to normalize? A method to try was named in section 2.2.6.2, to use as a
transformation function a “member” of the “shifted power family” (Wand and
Jones, 1995), which is given by
t xx
x( ; , )
( ) ( )ln( )
λ λλ λ
λ
λ
1 21 2
1
2
=+
+
⎧⎨⎩
sign λλ
2
2
00
≠= ,
where λ1 > −min( )X and mi denotes the lower endpoint of the support of f.
This methods suffers in application, because one has to know the parameters and
this leads to several optimization steps, regarding the fact that every marginal
density has to be transformed.
n( )X
DIMENSION REDUCTION AND MARGINAL TRANSFORMATIONS 59
The non-parametric method of Ruppert and Cline (1994) gives straightforward
transformations. They refer to the fact, that if F and G are the cdf’s for the
densities f and g, then Y G F X= −1 ( ( )) has density g. One is now able to choose
any distribution and the user will probably choose a normal distribution, because
normalization is wanted. In application t x is taken and is a
(pilot-) kernel estimate of F
G F xh( ) ( $ ( ))= −1 $ ( )F xh
x( ) (Ruppert and Cline, 1994). To get the density
values for the original data, one has to transform the density back to the original
space and therefore the first derivative of the transformation function is needed.
For this reason one point for every observation lying “close” has to be
transformed as well to approximate a density estimate ( f x F x( ) ' ( )= ), since the
transformation is done by a kernel density estimate of the cdf. This idea should be
watched in Figure 13.
x
f(x)
0 2 4 6 8
0.05
0.10
0.15
0.20
x
0 2 4 6 8
0.0
0.2
0.4
0.6
0.8
1.0
F(x)G(x)
t(x) t(x+Δ) x x+Δ
a)
b)
Figure 13: Univariate normalizations. a) shows a density, which has to be
normalized. b) shows the normalizing procedure based on the estimated cdf.
DIMENSION REDUCTION AND MARGINAL TRANSFORMATIONS 60
For reasons of exhaustive computations this procedure is probably not suitable for
estimating the whole density, but it seems reasonable for discriminatory purposes,
where only one function value for each test observation has to be computed.
For the concrete realization of this approach the formula for transformed densities
is used. Let x be a random vector and y x= T( ) another random vector of the same
dimension. Then the equation
f f T DTx yx x( ) ( ( )) det( ( )0 0x )= 0 (3.1)
is valid (Bomze, 1998). DT(x) is the Jacobian matrix of the transformation T at
the point x.
Since f Ty x( ( )) are the values for the multivariate density estimate at the
transformed data points and therefore known, it is only necessary to get the value
of the Jacobian matrix (or an numerical approximation) at each point of the test-
dataset for each transformation (class).
Because of the fact that the transformations are marginal, the Jacobian matrix is
diagonal and “only” d values (according to d dimensions in the original space)
have to be calculated.
If the number of test-data-points is n, the number of dimensions in the original
space is d, and the number of groups to separate is g, then n d g× × values are
necessary to carry out this approach. Probably, this is not possible for “high”
dimensions, e.g. d or for “many” test-data-points to evaluate. However, in
this case the dimensions do not have to be reduced that strongly as in the case of
multivariate kernel density estimation, and I think a proportion of 95% variance
explained, refering to a dimension reduction by the principal components analysis
can be achieved easily.
= 100
To get an impression of how good does this non-parametric normalization
performs, Figure 14 should be watched. The used variable comes from my
empirical insurance dataset (see chapter 4). The highly skewed variable measures
the duration since the last delay in payment of each customer. The graph shows
that the normalization works pretty good and the user is not forced to select any
parameters. The kernel estimate in the right graph reminds the user probably
slightly of a standard normal distribution.
DIMENSION REDUCTION AND MARGINAL TRANSFORMATIONS 61
0 5 10 15 20
0.0
0.05
0.10
0.15
0.20
0.25
0.30
a)
x
-1 0 1 2 30.
00.
10.
20.
30.
4
b)
y
Figure 14: Non-parametric normalization of one variable of the insurance
data.
This method is a first concept to use kernel density estimation for discrimination
in high dimensions, which overcomes the problem of multivariate density
estimation (as long as the LDA or the QDA are used after the normalization step).
The other alternative including the multivariate kernel estimation is discussed in
the following section.
3.3 Dimension reduction
The briefly mentioned two-stage approach (Scott, 1992) suggests a linear and a
non-linear transformation.
The linear one happens either by choosing a sphering transformation or a
transformation into principal components. The sphering transformation is given
by
Z X= −−Σ 1 2/ ( ),μ
where Σ Λ− −=1 2 1 2/ /A A T and A and Λ contain the eigenvectors and the
eigenvalues of X, written in the main diagonal with zeros else, respectively. While
DIMENSION REDUCTION AND MARGINAL TRANSFORMATIONS 62
this transormation destroys all first- and second order information, the principal
component transformation
Y A X= −T ( )μ
keeps at least the variance information (Scott, 1992). It is well known that the
columns of Y are uncorrelated and have (starting with the first) the largest
possible variance of all linear combinations of the variables. In application the
calculation of A is based on the eigen-vectors and -values of the maximum
likelihood estimate of the covariance-matrix of X. The author suggests e.g. in
case of normal data choosing the subspace-dimension d ' as the smallest one that
satisfies
$Σ
$
$
'
λ
λ
ii
d
ii
d=
=
∑
∑>1
1
90%,
where is the sample variance of the transformed columns and the i-th
largest eigenvalue , respectively. In case of normalized, originally non-normal
data he suggests taking 95% as threshold to get rid of any dimensions containing
no independent linear information and expects 1
$λi si
= 2Y
$Σ
10≤ ≤d ' in practice.
His second stage to get to the working dimension d is the use of a projection
pursuit technique. The optimization happens with respect to the d -matrix P by
using a multivariate version of
' '
d× '
μ2 ( ) ( )K R K to get not the “smoothest”, but the
most informative “least-smooth” density. That means, denoting now the data in
ℜd’ by X and the transformed data in ℜd’’ by Y XP= , the matrix P satisfiying
max $( ) { $ ( | )}P
XP y XPσ R f h
has to be found. $( )σ XP is the product of the sample deviations of the columns of
XP and is a multivariate kernel density estimator based on the data-
matrix XP and having a bandwidth-matrix depending only on one parameter h .
Although the optimal P is relatively insensitive to the choice of h , the
optimization turns out to be non-trivial and this fact makes the preceding principal
component analysis unavoidable. Scott (1992) suggests other criteria, as well.
Nevertheless, I am not going to use such an exhaustive approach, since my
$ ( | )f h y XP
DIMENSION REDUCTION AND MARGINAL TRANSFORMATIONS 63
datasets have not that much dimensions and I have enough data transformation
steps. In addition, it is not included in my software-packages.
As seen in section 2.3.4, radically reducing dimensions is desired to avoid the
great volumes of a high-dimensional space. Unfortunately, a loss in information is
included in this reduction. In fact, there are two aims, which counteract. Table 4
and Figure 15 show an average output of a principal principal component analysis
with one of my synthetic datasets and Table 5 (Silverman, 1986) makes again the
curse of dimensionality transparent.
Table 4: Principal component
analysis with one of my synthetic
datasets.
47,674 47,674
21,826 69,500
6,119 75,619
5,655 81,274
4,801 86,076
4,292 90,367
3,291 93,659
3,247 96,906
2,297 99,203
,797 100,000
Component1
2
3
4
5
6
7
8
9
10
explainedvariance
(%)cumulated
(%)
Principal component analysis
Factor
10987654321
Eig
enva
lue
6
5
4
3
2
1
0
Figure 15: The corresponding
scree-plot.
The values are based on a accuracy measure, called the relative mean square
error. The sample sizes are calculated to satisfy
( )E f ff
$( ) ( )( )
.0 0
001
2
2
−< (3.2)
when setting f the standard normal density.
Concerning Figure 15 a reasonable choice for the number of dimension in the
subspace would be three. If the number of observaions is e.g. n (as it is for
each class in each training dataset for my examples in chapter 4), the accuracy
= 600
DIMENSION REDUCTION AND MARGINAL TRANSFORMATIONS 64
given in Table 5 is achieved, but a projection onto four dimensions is “valid”, as
well. However, not even an explanation of 81% of the data seems to be adequate
for exact discrimination tasks.
Table 5: Smallest sample size for each dimension, which satisfies (3.2).
Dimensionality Required sample size1 42 193 674 2235 7686 27907 107008 437009 187000
10 842000
The problem becomes now apparent, but an intuitive statement about how to
choose the number of dimensions d is rather difficult.
The kernel density estimation in such dimensions is the alternative to the method
discussed in section 3.2 and chapter 4 should give answers to the question: Which
one is better and how do they perform compared to the classical methods (LDA
and QDA)?
Chapter 4: Simulation Study and Real-life Application
4.1 Introduction
In Chapter 2 and 3 I collected many ideas about how to maintain good
estimation results and how to carry out the classification. However,
essentially this was all theory and certainly I am now interested how this
can be applied.
The basic intention behind my attempts concerning this topic is to create
different distributions, which occur in every-day-life. We are nowadays
confronted with huge datasets and therefore I want to use higher numbers
of observations than in most cases in the existing literature, but always
taking an eye on the limitations of the model. My argument is that if the
results are good with these observation numbers, one can reduce a data-
mining dataset by drawing a sample of reasonable size. The fact that the
model is not tractable for data-mining-datasets with thousands (or more)
of observations will lately become apparent in this chapter.
Basic contents of this chapter is a detailed description of what has been
used as “input“, which algorithms were used, and which “output“ is
going to be generated (section 4.2). The second main point is a depiction
of the most interesting results of this analysis mostly by using graphs
(section 4.3).
Some computational considerations, which either represent my
experience or the experience of some former authors will be pointed out
in a separate section (section 4.4). Finally, trials of explanation for the
results of the different estimators are carried out in section 4.5. The exact
results are listed in the appendix.
65
SIMULATION STUDY AND REAL-LIFE APPLICATION 66
4.2 Preliminaries
4.2.1 The data
The choice of the data happened with respect to the results in chapter 2 and 3.
Each dataset consists of two classes having 600 observations each, and a test
observation dataset containing 200 observations. The two-class-problem is
probably the most investigated problem in literature. There was no reason for me
to choose more than two classes, because the basic problem will probably not
change. Furthermore, in the most applications are only two classes to separate.
The number of 600 observations seems to be a kind of upper bound for using
leave-one-out cross-validation methods, as well. Even here the calculations were
very tough, but a reduction into five dimensions seems to be somewhat justified
(see Table 5).
The number of dimensions in the original space was chosen by d . This is at
least twice as high as the highest occuring sub-dimensional dataset, since I plan to
project the data onto two until five dimensions. A number of d seems to be
enough to get a feeling for the behaviour of the dimension reduction approach.
= 10
= 10
4.2.1.1 The self-constructed (synthetic) data
The self-constructed data consists of seven different univariate prototypes for the
class distributions, which were chosen to model common distributions, which
might appear in different applications. These prototypes are shown in Table 6 and
Figure 16.
I want now to give some short reasons, why I chose those prototypes. The studies
which concentrated to a great extent on high dimensions (Van Ness and Simpson,
1976; Van Ness 1980) treated only uncorrelated multinormal distributions, whose
groups were only separated by one variable. This is a very limited view on the
problem, since section 2.2.6.2 showed that, for example right-skewed
distributions are much more difficult to estimate. In addition, those distributions
appear often in practical problems, e.g. in lifespan-distributions as well as in
SIMULATION STUDY AND REAL-LIFE APPLICATION 67
income data. Thus, it is essential to have at least one type of those in the
simulation.
Table 6: Prototype distributions of the synthetic datasets.
Name Construction Means
Normal N(0;1)
Normal with
“small noise” 0 8 0 1
0 2
2501
1
252
1
25
. ( , ).
( )( ) ( , . )N N+
=
=∑∑
φ μφ μ μ
ii
i ii
μ1=-3, μi (i=2,...,25) is
created by adding
stepwise uniform (0,0.5)
random variables
Normal with
“medium
noise”
0 7 0 10 3
130 2
4
1
134 2
1
13
. ( , ).
( )( ) ( , .N N+ )
=
=∑∑
φ μφ μ μ
ii
i ii
μ1=-3, μi (i=2,...,13) is
created by adding
stepwise uniform (0,1)
random variables
Normal with
“large noise” 0 5 0 1
0 5
70 35
4
1
74 2
1
7
. ( , ).
( )( ) ( , .N + )N
=
=∑∑
φ μφ μ μ
ii
i ii
μ1=-3, μi (i=2,...,7) is
created by adding
stepwise uniform (0,2)
random variables
Exponential Exp(1)
Bimodal
“close”
0.5N(0;1)+0.5N(2.5;1)
Bimodal “far” 0.5N(0;1)+0.5N(5;1)
On the other hand it is of course important to have a multi-modal distribution
included. Therefore I chose two bimodal distributions. One with two modes lying
close to each other and one with strictly separated modes. I skipped the use of
several pathological normal-mixtures and normal-mixtures having more than two
bumps. A detailed discussion of such distributions with respect to the bandwidth
selection in the univariate kernel setting is given by Marron and Wand (1992). In
my opinion, many of the distributions used in this paper are not entitled to
represent typical occuring densities in application.
SIMULATION STUDY AND REAL-LIFE APPLICATION 68
Normal
-4 -2 0 2 4
0.0
0.2
0.4
Normal-noise small
-4 -2 0 2 4
0.0
0.2
0.4
Normal-noise medium
-4 -2 0 2 4
0.0
0.2
0.4
Normal-noise large
-4 -2 0 2 4
0.0
0.15
0.30
Exponential (1)
0 1 2 3 4 5 6
0.0
0.4
0.8
Bimodal - close
-2 0 2 4 6 8
0.0
0.10
0.20
Bimodal - far
-2 0 2 4 6 8
0.0
0.10
0.20
Figure 16: Prototype distributions of the synthetic datasets.
The different normal-noise-densities are chosen, because I want to figure out the
dependency of the dominance of the normal-distribution-based decision rules
(LDA, QDA) on deviations of this hypothesis. Since actually nothing in real life is
normal distributed because of unknown dependencies, what makes the model of
independent observations questionable, it is one of my interests, how big the
“noise” in the normal distribution has to be that the non-parametric approach
becomes dominating. The parameters μi of those “error-bumps” in the normal-
noise densities are chosen to have at least the whole expected normal-noise-
density symmetric around zero. The coefficients are high in the center of the
SIMULATION STUDY AND REAL-LIFE APPLICATION 69
distribution and smaller in the tails in order to produce features according to my
understanding when I talk about “noise”. The function φ in Table 6 denotes the
standard normal density function.
The just described prototypes are linked together in the next step for the
production of 20 different datasets, each of dimension 1400x10.
The prototype distributions shown in Table 6 are used only for population 1,
because (at least) the multivariate distributions have to differ, since a discriminant
analysis is not possible otherwise. For population 2 every normal density is
shifted for 0.5 to the right and the exponential distribution changes its parameter
from λ=1 to λ = 2. This was tested to retain even in ten dimensions Error-rates,
which are approximately 20% and therefore noticeable higher than zero to make
differences between the methods apparent. Table 7 should make everything
become clear.
Table 7: Description of the used datasets.
Dataset Nr. Abbrev. contains
1 NN1 10 normal distribution with “small noise”
2 NN2 10 normal distributions with “medium noise”
3 NN3 10 normal distributions with “large noise”
4 SkN1 2 skewed (exp-)distributions and 8 normals
5 SkN2 5 skewed (exp-)distributions and 5 normals
6 SkN3 7 skewed (exp-)distributions and 3 normals
7 Bi1 4 normals, 4 skewed and 2 bimodal (close)-dist.
8 Bi2 4 normals, 4 skewed and 2 bimodal (far)-dist.
9 Bi3 8 skewed and 2 bimodal (close)-dist.
10 Bi4 8 skewed and 2 bimodal (far)-dist.
After the “linking” step, those ten datasets have been transformed linear by ten
10x10-matrices, which are the roots of ten self-produced correlation-matrices.
The datasets 11-20 have been produced exactly in the same way as number 1-10,
but population 1 and population 2 have been transformed by unequal
SIMULATION STUDY AND REAL-LIFE APPLICATION 70
transformation-matrices. The datasets having equal covariance matrices have “1”
as their last digit, the others have “2”. For example, the dataset “Bi42” consists
originally of eight skewed distributions and two bimodals (whose bumps are
strongly separated), and the transformation happened for both groups with
unequal transformation matrices.
The 30 occuring correlation matrices, for their part, have been produced by
assuming a common factor in the ten variables having a regression coefficient,
whose absolute value is uniform distributed between 0.3 and 1. The simulations
showed that there were no essential differences between the results of the
principal components analysis.
The produced data consists now of many different features including correlations
as well. The only drawback is that one can retransform the data by estimating the
correlation-matrices in order to get uncorrelated variables. In this case,
uncorrelatedness means the same as independency between the variables, which is
not the case in reality, since the correlation-coefficient measures only linear
dependency. This means that there is no nonlinear information in my data. If that
occurs always, one does not need to think about any multivariate setting, since
instead of a multivariate density the product of univariate density estimates can be
used. The case of different variances is left out, because rescaling the variables
amounts in no loss of generality.
4.2.1.2 The real-life data
To overcome the drawback just described in the last paragraph the performance of
the kernel setting should also be applied on a real-life-dataset. Therefore, a sample
(observations and variables) of an insurance-dataset has been chosen.
In this case the observation numbers for both groups are not equal. The two
groups are separated by the fact, whether the incurance had to pay for a certain
insurance policy or not during the year 1998. The relation was about 20:80.
In order to estimate both class densities with the same accuracy, the training
datasets were built by samples of equal size (again 600 observations of each
group, the same as in the synthetic data). However, the 200-observation-sample,
which represent the test-data was drawn accidently from the remaining dataset.
SIMULATION STUDY AND REAL-LIFE APPLICATION 71
To gain an impression of the dataset, Figure 17 provides histograms of the ten
dimensions.
0 5 10 15 20
0.0
0.05
0.10
0.15
Variable 1
0 5 10 15 20
0.0
0.10
0.20
0.30
Variable 2
0 10 20 30
0.0
0.02
0.06
Variable 3
0 50 100 150 200
0.0
0.00
50.
015
Variable 4
0 20 40 60 80 100
0.0
0.04
0.08
Variable 5
0 20 40 60
0.0
0.02
0.04
Variable 6
0 5 10 15
0.0
0.10
0.20
Variable 7
0 5 10 15 20 25
0.0
0.05
0.10
0.15
Variable 8
0 10000 20000 30000 40000
0.0
0.00
004
0.00
008
Variable 9
20 30 40 50 60 70 80
0.0
0.01
00.
020
Variable 10
Figure 17: Univariate histograms for the insurance-dataset.
SIMULATION STUDY AND REAL-LIFE APPLICATION 72
The results for the dimension reduction for the insurance-dataset, as well as for
the synthetic datasets are listed in the appendix (Table 12).
4.2.2 The construction of the estimators and the estimation procedure
As suggested in chapter 3 the kernel concept can be applied in different manners.
The estimators 1 and 2 in my study are constructed as it was described in section
3.2. A non-parametric normalization is carried out by using kernel estimates for
the univariate cdf’s. For this reason, 20 normalizations have to be carried out,
because there are ten variables and two groups in each dataset. Estimator 1 uses
the bandwidth selected by the “Normal rule” (“Rule of Thumb”, (2.9)) to smooth
out the cdf, a method which is known to oversmooth in most of the cases. It will
be denoted as “Normal rule – norm(alized)”.
As alternative the bandwidth used for estimator 2 is constructed by the solve-the
equation-version of the Sheather-Jones-plug-in selector through solving (2.14)
with respect to h. The results concerning this transformation will be denoted as
“Sheather-Jones – norm(alized)”.
The aim is not to get an easier-to-estimate density for the kernel setting, but to be
“allowed” to apply LDA and QDA, respectively, for the classification step. After
normalizing, the two class densities are going to be estimated parametric by the
maximum-likelihood estimators of the parameters of a normal distribution,
because the transformed multivariate distributions are almost normals and a
dimension reduction including a loss in information is not justified. Depending on
assuming equal or unequal covariance matrices, the results are separated again,
denoted by LDA and QDA, respectively.
The 200 data-points are evaluated and the corresponding normal-distribution-
values are transformed by the formula given in (3.1) to get the estimated density
values in the original space. The derivatives have been calculated by an numerical
approximation. Concerning the empirical dataset, the estimated density values
were also multiplied by the prior probabilities of the corresponding group,
because bayes-rule is wanted, and not the maximum-likelihood discriminant
analysis. In the case of the synthetic data both methods are equal.
SIMULATION STUDY AND REAL-LIFE APPLICATION 73
Using a variable kernel density estimator (2.8) as third alternative, suffers from
two or at least one parameter (Breiman et. al, 1977), which have to be selected by
the user and since the aim of this study is a fully automatic selection method, I
have to discard this idea.
The approach discussed in section 3.3 is the base for the next set of estimators.
Here, the data was projected in the subspace without any preceding manipulation.
As I pointed out in section 3.3 a reasonable choice for the sub-dimensions are in
the range from two to five.
A principal component analysis transformed the data, and the estimation itself
was done by a multivariate kernel density estimator. The used kernel was a
gaussian kernel, since the theory is almost exclusively prepared for this choice
and no alternatives occur in the multivariate case. Furthermore, the normal kernel
has unbounded support which is necessary to avoid comparing two zero-values in
order to draw a classification.
The “product kernel”-generalization was chosen to construct the multivariate
kernel. This choice is also reasonable, because the theory of the multivariate
“Normal rule”-bandwidth-matrix-selector is based on those generalizations (see
section 2.3.3.1).
The spectrum of different bandwidth selectors in the univariate case is quite wide,
however in the multivariate model for discrimination tasks the possibilities are
restricted. As described in the section about bandwidth selection the reference to a
normal distribution is the most easiest way and (2.16) provides the optimal
bandwidth selection for the classes of bandwidth matrices constructed by a
bandwidth vector. A closer look at the formula makes clear, that there is actually
only one parameter to estimate, since the difference in the bandwidths are only
caused by the different standard deviations. I am going to use this as my third
non-parametric estimator and it will be denoted by „Normal rule“.
The field of cross-validation methods represent estimators, which are already a
little bit more sophisticated. Taking again a look at (2.19) the least square cross
validation in that form requires many more computations. The occuring integral
can in general not be solved analytically. However, in case of using a gaussian
SIMULATION STUDY AND REAL-LIFE APPLICATION 74
kernel one can derive the exact formula (Bowman, 1984) and the generalization to
the multivariate setting gives the function
LSCV( ) ( , )( )
( , )( )
( ,H H x x H x=−
+ )x H−−
− −−
−≠≠∑∑1
10 2
21
22
12nN
nn n
Nn n
Ni j i ji ji j
to be minimized with respect to H. Note, that this succint formula requires for a
certain input H the calculation of n n× −( 1) multivariate normal density values
for each class. Since a numerical optimization has to be carried out, one has of
course to restrict the matrix H depending only on one parameter h and even then it
is a really tough procedure to find a value near to the minimizer. Nevertheless, I
tried this bandwidth selector on my datasets by using the model in (2.17). To be
more detailed, I started at the corresponding value h of the “Normal rule”-
estimator above and investigated the range between ¼ of the value and 1.5 times
the value on 100 logarithmically equidistant points in between, since I know that
the univariate counterpart of “Normal rule”, hROT oversmooths almost
everywhere. This estimator, my fourth, is going to be denoted as “LSCV”. Both
estimators are used for projections in two until five dimensions. For reasons
discussed in section 2.3.3.3 the multivariate extension of the plug-in-estimators
will not be used here. Even if the formulas for more than two dimensions were
derived, the calculations would exceed a reasonable border of computation time.
Finally, of course the LDA and the QDA has been carried out with the original
data itself, the most common practice statisticians do. Both methods were based
on the bayes-rule and not on the maximum-likelihood-rule.
4.2.3 The performance measure
At this stage one has to think about a quality measure for the classifications
carried out by the rules above. A common used measure is the classical Error-rate
(Misclassification-rate). This is probably the easiest measure to interpret.
Unfortunately, it makes no difference, if the relation of the posterior probabilities
is 0.49:0.51 or 0.01:0.99. In both cases the second group is going to be predicted
and this measure is therefore not really robust. Nevertheless, it is a often used
SIMULATION STUDY AND REAL-LIFE APPLICATION 75
measure (Remme et. al., 1980; Van Ness and Simpson, 1976; Van Ness; 1980),
which is (in case of two groups) given by
ER = −=∑ $ ,k ki ii
n
1
where and k$ki i are the assignment and the real group of the observation xi,
respectively. Different costs of misclassifications are not taken into account, as
well. An detailed discussion of several performance measures gives Hand (1997).
The second “coefficient” is a little bit more robust, because here the posterior
probabilities are not only used for the assignment of an observation, but also for
measuring the amount of uncertainty. Strictly speaking, the posterior probabilities
p(k|x) itself are included in the formula for the Brier-score (Hand, 1997).
( )BS = −=∑2
21
2
np ci
i
n
i( | ) ,x
and if xci = 0 i comes from group 1 and ci = 1 otherwise. Since the training data
and the test data are independent, both measures should not be biased. For reasons
of a variance reduction, I calculated the results for the exclusive use of the LDA
and the QDA (in the case of the synthetic datasets) several times always by
leaving out 100 observations and using them as test-data for one group. Thus, in
the data every combination of leaving out 100 points in the class-1-set and leaving
out 100 points in the class-2-set was chosen to serve for the test-dataset (1-100,
101-200, usw.). This leads to 7 7 49x = measures, which are going to be averaged
for the Error-rate, as well as for the Brier-score.
In all other conditions I calculated all measures only once because of huge
computation times, but the Brier-score is a robust measure, anyway.
Both used criteria are going to yield the estimator for ”the best“, which minimizes
the corresponding values. Therefore the “best“ estimator is actually only the best
concerning classification accuracy. Other aspects, like speed, cost of
classification, amount of prior knowledge about the data and other aspects are
ignored in this study, but they are discussed in Hand (1997). At least the speed-
drawback of the kernel methods is obvious and does not have to be measured.
SIMULATION STUDY AND REAL-LIFE APPLICATION 76
4.2.4 The software
During all my work concerning the calculations in this study I used three different
programs, which were Microsoft Excel 97, S-Plus 4.5 and XploRe 4.2a. Excel,
because it is very useful to gain quick insights into some problems. It was a kind
of auxilliary tool during the whole study. In addition, it served for the calculation
of the LSCV-parameters through automatized procedures controlled by Visual-
Basic-macros. I chose S-Plus, because I have now much experience with its
functions and features and it is really useful in handling vectors and matrices.
Unfortunately, the calculation accuracy of both programs is restricted, since it is
not possible to calculate the value of a standard normal cdf at the point x = 8,
which occurs during the normalization of the data. The only possible thing is to
compute such a cdf for values x = −8 or smaller (only with S-Plus) and this is a
possibility to by-pass the problem.
S-Plus was also a very good means to program short automatized functions like
e.g. a test of multivariate normal distribution.
Since there are many functions concerning kernel density estimators in XploRe I
chose that program to gather impressions of the behaviour of different estimators.
Furthermore, it served to get some experience with a new program, which is
however, not so different compared to S-Plus.
I also used SPSS, but only to carry out a Box-test for equal covariance matrices.
4.3 Results
The results are quite detailed and include many aspects. There are 21 datasets (20
synthetic and one empirical), fourteen estimators (LDA, QDA, “Normal rule –
norm“ (LDA and QDA), “Sheather-Jones – norm“ (LDA and QDA), “Normal
rule“ (dimensions 2-5) and “LSCV“ (dimensions 2-5)) and two performance
measures (Error-rate and Brier-score), which amounts totally in 21
scores to discuss. The tables containing all those scores are listed in the appendix.
In this section only the most important results are going to be pointed out.
14 2 588× × =
SIMULATION STUDY AND REAL-LIFE APPLICATION 77
4.3.1 LDA versus QDA
0,000
0,050
0,100
0,150
0,200
0,250
0,300
0,350
0,400
0,450
NN11 NN12 NN21 NN22 NN31 NN32
LDAQDA
Figure 18: Brier-scores for the NN-distributions. Comparison between LDA
and QDA.
0,000
0,100
0,200
0,300
0,400
0,500
0,600
SkN11 SkN12 SkN21 SkN22 SkN31 SkN32
LDAQDA
Figure 19: Brier-score for the SkN-distributions. Comparison between LDA
and QDA.
SIMULATION STUDY AND REAL-LIFE APPLICATION 78
0,000
0,100
0,200
0,300
0,400
0,500
0,600
Bi11 Bi12 Bi21 Bi22 Bi31 Bi32 Bi41 Bi42
LDAQDA
Figure 20: Brier-score for the Bi-distributions. Comparison between LDA
and QDA.
Since the QDA estimates for each class the covariance-matrix separately, it
should have an advantage concerning datasets having different covariance-
matrices. That are the datasets having “2” as their last digit (see Figure 18, Figure
19 and Figure 20). This assumption is confirmed by the results, since only
“SkN32” has a worse performance compared to LDA.
One interesting fact also becomes apparent by watching the graphes above. While
the QDA is always slightly worse in the case of equal correlation-matrices and
much better otherwise, the situation changes for the NN-distributions when the
distribution becomes more different from a multivariate normal. Here, the QDA is
much worse than the LDA for equal covariance-matrices and only slightly better
otherwise. That means the relative performance of the QDA compared to the LDA
gets worse when the assumption of a multivariate normal is not justified. The
distributions “SkN11” and “SkN12” are close to a multivariate normal, because a
test of multivariate normal distribution (Grossmann and Hudec, 1999) cannot be
rejected for both classes in each dataset.
SIMULATION STUDY AND REAL-LIFE APPLICATION 79
4.3.2 LDA and QDA versus the normalized datasets
0,000
0,050
0,100
0,150
0,200
0,250
0,300
0,350
NN11 NN12 NN21 NN22 NN31 NN32
LDANormal rule - norm (LDA)Sheather-Jones - norm (LDA)
Figure 21: Error-rate for the NN-distributions. Comparison within the LDA-
method by using normalizations.
0,000
0,050
0,100
0,150
0,200
0,250
SkN11 SkN12 SkN21 SkN22 SkN31 SkN32
LDANormal rule - norm (LDA)Sheather-Jones - norm (LDA)
Figure 22: Error-rate for the SkN-distributions. Comparison within the
LDA-method by using normalizations.
SIMULATION STUDY AND REAL-LIFE APPLICATION 80
0,000
0,050
0,100
0,150
0,200
0,250
Bi11 Bi12 Bi21 Bi22 Bi31 Bi32 Bi41 Bi42
LDANormal rule - norm (LDA)Sheather-Jones - norm (LDA)
Figure 23: Error-rate for the Bi-distributions. Comparison within the LDA-
method by using normalizations.
A result of my study was the fact that it does not make much difference to watch
the Error-rate-results or the classification performance measured by the Brier-
score (at least not with respect to the order to compare different methods). As
mentioned above, the Brier-score is more robust, but has not really a
interpretation. Nevertheless, I would prefer it to compare the estimators.
Figure 21, Figure 22 and Figure 23 show a first impression of how to use the
kernel concept in the multivariate setting. In Figure 22 almost every normalization
procedure (“Normal rule – norm“ and “Sheather-Jones – norm“) improves the
classification. Only in the case of “SkN11“ there is no improvement. However,
this is one of the datasets, which are “identified“ as multivariate normal
distribution. The normalization seems to have only a random-impact on the
distributions, which are very close to the multivariate normal distribution
(“NN11“-“NN22“). This is not surprising, because what should be normalized, if
the original dataset itself is normal?
SIMULATION STUDY AND REAL-LIFE APPLICATION 81
0,000
0,050
0,100
0,150
0,200
0,250
0,300
0,350
NN11 NN12 NN21 NN22 NN31 NN32
QDANormal rule - norm (QDA)Sheather-Jones - norm (QDA)
Figure 24: Error-rate for the NN-distributions. Comparison within the QDA-
method by using normalizations.
0,000
0,050
0,100
0,150
0,200
0,250
0,300
0,350
SkN11 SkN12 SkN21 SkN22 SkN31 SkN32
QDANormal rule - norm (QDA)Sheather-Jones - norm (QDA)
Figure 25: Error-rate for the SkN-distributions. Comparison within the
QDA-method by using normalizations.
SIMULATION STUDY AND REAL-LIFE APPLICATION 82
0,000
0,050
0,100
0,150
0,200
0,250
0,300
0,350
0,400
Bi11 Bi12 Bi21 Bi22 Bi31 Bi32 Bi41 Bi42
QDANormal rule - norm (QDA)Sheather-Jones - norm (QDA)
Figure 26: Error-rate for the Bi-distributions. Comparison within the QDA-
method by using normalizations.
Compared to the performance of the QDA, the normalization is a fundamental
improvement in almost all datasets, but one has to handle the results carefully.
Since the performance of the QDA was terrible in datasets having equal
covariance-matrices, it is not surprising that the normalized datasets separate
better. No one would use the concept of the QDA after taking e.g. Figure 20 into
consideration. However, the results for “unequal covariance matrices” are pretty
good, as well. In any case, preceding tests of multinormality or equal covariance-
matrices seem to be a somewhat suitable help for further decisions.
Another important fact becomes also apparent. It makes almost no difference
using the Normal rule or the Sheather-Jones-plug-in as bandwidth for the kernel
estimator. In this context the choice of the bandwidth is as unimportant as the
choice of the kernel in the case of kernel density estimation. This is pleasant,
because the Normal rule needs almost no calculations compared to other
bandwidth-selectors and at least the drawback of exhaustive calculations in the
multivariate case seems to disappear.
SIMULATION STUDY AND REAL-LIFE APPLICATION 83
4.3.3 The multivariate kernel density estimators – differences concerning
dimensions
The results for the multivariate density estimators are of great interest, since the
extent of the “trade-off” between the information loss by deleting some
dimensions and the advantage by overcoming the curse of dimensionality has to
be quantified. The Error-rates and Brier-scores were sorted and the average rank
for each “group” of datasets (“NN”, “SkN” and “Bi”) has been calculated. The
results are shown in Table 8 and Table 9.
Table 8: Average rank of the kernel estimators in dependency on the
dimension of the subspace concerning the Error-rates. “1” is the best.
Error-rateNormal rule (2) Normal rule (3) Normal rule (4) Normal rule (5)
NN 2,67 1,83 2,33 3,17SkN 3,08 1,92 1,75 3,25Bi 2,25 2,31 2,38 3,06Empirical dataset 3,50 1,50 3,50 1,50Total 2,67 2,02 2,24 3,07
LSCV (2) LSCV (3) LSCV (4) LSCV (5)NN 2,58 1,67 2,42 3,33SkN 3,25 1,83 2,08 2,83Bi 2,56 2,31 2,25 2,88Empirical dataset 1,00 2,00 3,00 4,00Total 2,69 1,98 2,29 3,05
As the user might have expected during reading the section about the curse of
dimensionality, the results for a classification based on a kernel density estimate
in five dimensions are the worst. This space is probably too sparse populated for a
number of n = 600 observations. Again, one can see almost no difference between
the two different estimation methods and the relative differences between the
Error-rate and the Brier-score are only marginal, as well.
In my opinion, the main statement is that the dimensions of size three and four fit
best and that a reduction to two dimensions is to radical. Of course, this depends
in the special case on the principal component analysis for the corresponding
dataset.
SIMULATION STUDY AND REAL-LIFE APPLICATION 84
Table 9: Average place of the kernel estimators in dependency on the
dimension of the subspace concerning the Brier-score. “1” is the best.
Brier-scoreNormal rule (2) Normal rule (3) Normal rule (4) Normal rule (5)
NN 2,50 2,33 1,83 3,33SkN 3,33 1,33 2,50 2,83Bi 2,75 1,63 2,25 3,38Empirical dataset 1,00 4,00 3,00 2,00Total 2,76 1,86 2,24 3,14
LSCV (2) LSCV (3) LSCV (4) LSCV (5)NN 2,50 2,33 2,00 3,17SkN 3,33 1,33 2,50 2,83Bi 2,75 1,63 2,13 3,50Empirical dataset 2,00 4,00 1,00 3,00Total 2,81 1,86 2,14 3,19
For this reason I am going to show two examples. In Table 10 and Table 11 are
two quite different results of a principal component analysis listed. The results for
the dataset “SkN32” are much worse and therefore the classifiers based on the
corresponding sub-datasets work better for more dimensions.
Table 10: Principal component
analysis for "SkN11".
59,125 59,125
10,116 69,241
8,946 78,188
5,797 83,985
5,021 89,006
3,255 92,261
2,922 95,183
2,783 97,966
1,395 99,361
,639 100,000
Component1
2
3
4
5
6
7
8
9
10
explainedvariance
(%)cumulated
(%)
Principal component analysis
Table 11: Principal component
analysis for "SkN32".
26,936 26,936
19,320 46,256
11,222 57,478
7,852 65,330
7,318 72,648
6,536 79,183
6,313 85,497
6,248 91,744
4,329 96,073
3,927 100,000
Component1
2
3
4
5
6
7
8
9
10
explainedvariance
(%)cumulated
(%)
Principal component analysis
SIMULATION STUDY AND REAL-LIFE APPLICATION 85
This fact is shown in Figure 27. The things are different for “SkN11”. Here, the
explanation of 70% is already in two dimensions quite good according to the way
the correlation-matrices have been constructed.
0,000
0,050
0,100
0,150
0,200
0,250
0,300
0,350
2 3 4 5Dimensions
Normal rule - SkN11LSCV - SkN11Normal rule - SkN32LSCV - SkN32
Figure 27: Dependency of the performance of the multivariate kernel density
estimator for two datasets. The Brier-score is plotted vs. the dimension of the
subspace.
However, this tendency is not inherent to all datasets and one has to be very
careful to draw such conclusions.
SIMULATION STUDY AND REAL-LIFE APPLICATION 86
4.3.4 LDA and QDA versus the multivariate kernel density estimators
0,000
0,100
0,200
0,300
0,400
0,500
0,600
0,700
NN11 NN21 NN31 SkN11 SkN21 SkN31 Bi11 Bi21 Bi31 Bi41
LDALSCV (3)LSCV (4)
Figure 28: Brier-score of the datasets having equal correlation matrices.
Comparison between the LDA and the bayes-rule-kernel-methods
constructed by the LSCV-selector.
The results of the multivariate kernel estimators compared to the classical
methods, LDA and QDA, are quite disappointing and the euphoria of the
simulation studies in the past (e.g. Remme et. al., 1980) are from this point of
view not comprehensible.
In Figure 28 the performance of the datasets having equal correlation-matrices has
been compared. The LDA is the best in all cases and the kernel concepts are quite
bad for datasets, where the assumption of multivariate normal distribution has to
be rejected. This is really interesting, since the (normal-distribution-based) LDA
should actually lose its advantage for those datasets.
SIMULATION STUDY AND REAL-LIFE APPLICATION 87
0,000
0,050
0,100
0,150
0,200
0,250
0,300
0,350
NN12 NN22 NN32 SkN12 SkN22 SkN32 Bi12 Bi22 Bi32 Bi42
QDALSCV (3)LSCV (4)
Figure 29: Brier-score of the datasets having unequal correlation matrices.
Comparison between the QDA and the bayes-rule-kernel-methods
constructed by the LSCV-selector.
The kernel results for the NN-datasets and the Bi-datasets consisting of unequal
correlation-matrices are also quite bad compared to their parametric counterparts.
Slight improvements are only for the SkN-datasets observable.
4.3.5 Results concerning the insurance data
When taking again a look at the univariate histograms (Figure 17) the reader
identifies several skewed densities. This fact might prefer a kernel setting from
the view of the results in section 4.3.2 and 4.3.4. Unfortunately, the two classes of
this dataset cannot be separated well and Figure 30 makes this problem
transparent. To understand this, the reader has to be reminded that the proportion
between the observation numbers between the two groups is about 4:1. Thus, an
Error-rate of 20% can be achieved by only predicting the larger group and without
having a look at the dataset. This is what is actually done in the most cases. The
problem is shown in Figure 31, where in each graph two slightly different class
densities are plotted, however having a proportion in their integrals of 4:1.
SIMULATION STUDY AND REAL-LIFE APPLICATION 88
0,000 0,050 0,100 0,150 0,200 0,250 0,300 0,350
LDA
QDA
Normal rule (2)
Normal rule (3)
Normal rule (4)
Normal rule (5)
LSCV (2)
LSCV (3)
LSCV (4)
LSCV (5)
Normal rule - norm (LDA)
Normal rule - norm (LDA)
Sheather-Jones - norm (LDA)
Sheather-Jones - norm (QDA)
Figure 30: The Error-rates for the insurance data.
a)
x
-2 0 2 4
0.0
0.1
0.2
0.3
0.4
N(0,1)0.25N(0.5,1)
b)
x
0 2 4 6 8
0.0
0.2
0.4
0.6
0.8
1.0
Exp(1)0.25Exp(2)
Figure 31: The problem of classification concerning non-equal class
observation numbers. Picture a) shows two slightly different normal densities
and in b) two exponential densities are plotted. The densities in both pictures
are rescaled to an area-proportion of 4:1.
SIMULATION STUDY AND REAL-LIFE APPLICATION 89
In b) one will never predict the smaller group, since the crucial density value at a
certain point is smaller in all cases. The situation in a) is not much better, but at
least a few points of the support are classified to the smaller group.
Returning back to Figure 30, the results for the LDA and especially for the QDA
are disastrous, but I do not really have an explanation for this result.
4.4 Computational considerations
A great drawback of the kernel method is, that it includes many calculations, even
in simple models and even in the univariate setting. Nowadays, high-speed-
computers are available and this fact might encourage the user to program blind
without taking any simplification of algorithms into account. However, the
problems arising at least in the multivariate setting force the programmer to avoid
unnecessary calculations in any case, and this aspect should be taken into
consideration.
Silverman (1986) considers this problem and gives advice on how to save
computation time. A well-known hint is that common factors should be calculated
after summing up different elements and not within the sum. This is in the view of
the kernel density estimator very important. Another suggestion is to use a kernel,
which has a easy formula, since the choice of the certain kernel is not that crucial
for the performance. Thus, he definitely recommends not using a normal kernel,
but this point is again controversial with respect to the fact that unbounded
support is needed in discrimination, and there is not really an unbounded
alternative, which contains no exponential function call. Additionally, the theory
for the normal kernel is really easy, compared to any alternatives.
The calculations for the LSCV-smoothing-parameter were a very exhaustive
procedure. For each dataset and each group a certain bandwidth matrix H twice
the half of a 600x600-matrix was necessary, where every value was a density
value of a two- until five-dimensional multivariate normal. The computation time
“explodes“ here for more observations, as well as for more elements of H chosen
to be flexible. This was the reason that my matrix H depended only on one
paramter h (see section 4.2.2). Fortunately, the results concerning classification
SIMULATION STUDY AND REAL-LIFE APPLICATION 90
analysis showed almost no difference between methods using different
bandwidth-selectors (see section 4.3).
Other differences occured by the calculation of an estimator for the univariate cdf
in order to carry out marginal transformations. Since the quantile of the values 0
and 1 of the cdf are “-∞“ and “∞“, respectively, the method does not work for
some “outliers“. Thus, the author recommends choosing larger bandwidths. This
fact again prefers the “Normal rule“ and one is well-advised to ignore the other
bandwidths in this certain context.
For the numerical calculation of the derivatives in the normalization approach, it
is also necessary to tune the value of Δ (see Figure 13) to the accuracy provided
by the used software, because the approximation is either very crude or not
calculable, otherwise.
4.5 Discussion
The quite good performance in the different studies of the kernel estimation
concept in discriminant analysis (Remme et. al., 1980, Van Ness and Simpson,
1976, Van Ness, 1980) have to be seen from a different view regarding the results
in section 4.3. All indicated studies used the concept on the original dataset and
did not care about variable reductions. However, the problem in application is the
information loss caused by the inevitable data projection onto a subspace. That is
a factor, which is absolutely not negligible and ignoring it makes all research not
suitable for application.
The great dilemma before I carried out this study seemed to be the necessity of a
huge training sample which however, makes the application on a conventional
computer impractible. This impracticability is caused by an enourmous
expenditure on memory for calculating e.g. marginal normalizations of the
variables or by the calculation time for the multivariate kernel density estimator
and its parameters itself. Fortunately, the choice of the parameters was not crucial
and I therefore strongly recommend the “Normal rule” as the only method in
application. Anyway, most of the other alternatives are not applicable for certain
dimensions/observation numbers.
SIMULATION STUDY AND REAL-LIFE APPLICATION 91
The great differences between the two manners of applying the kernel method (the
use for marginal transformation and the estimation of a multivariate density) is
maybe the most important result I discovered. Since the non-parametric
estimation of the multivariate class densities performs quite bad in most of the
datasets, probably the only possible advice for the user is to restrict multivariate
density estimation to say three or less variables and to use the method only if the
explanation loss after the dimension reduction is “acceptable”. However, it is
difficult to give numbers about what is meant by “acceptable”. If the user has a
huge set of observations (here “huge” means good enough with respect to the
number of dimensions (see again e.g. Table 5)), the application is restricted to
parameters, which are calculable straightforward (depending e.g. only on the
sample variance or the observation number).
My expectations about the performance of the kernel method were actually much
higher than the achieved results. Seemingly, the problems of the multivariate
generalizations are worse than I thought, but it is difficult to imagine the expanse
of a ten-dimensional space, although many comparisons were drawn.
On the contrary, the marginal normalizations improved the estimation well in
most cases. However, normalizations can also be carried out by other means than
a univariate kernel estimate of the distribution function, but probably a parametric
transformation fulfills this task not that good and furthermore, its parameters have
to be chosen by the user. Nevertheless, the need for memory in the computer
“explodes” here as well, and in my opinion normalizations of variables in
distributions, which are close to a multivariate normal distribution, are not really
necessary. A preceding test to check this hypothesis would not be bad.
Finally, the dominance of the LDA in all cases of equal covariance-matrices
within the classes, regardless whether the univariate distributions are skewed,
bimodal or normal, makes also a preceding test of equal covariance-matrices
unevitable. Here, the LDA has probably no alternatives and the user saves much
computation time.
Chapter 5: Summary and outlook
The whole genesis of this diploma thesis was almost exclusively a series of
surprises concerning the power of the concept of kernel density estimation.
I started with the aim to explore the status of the kernel density estimation method
within the huge field of density estimation. My only knowledge was, that this
concept is quite easy and nice and its flexibility encouraged me to assume really
good results, in estimating the density itself, as well as in classifying objects.
The first experiences concerned the fact, that optimality can happen to many
different criteria, and that a density is only “estimated well” with respect to such a
criterion. Probably, the variety of non-parametric methods, parameter selection
methods in kernel density estimation and many criteria, which can be thought to
be optimized makes the whole research field not comprehensible. The fact that
different applications of this concept need also different criteria to optimize leads
to a even higher degree of complexity.
A very interesting fact appears in formulas, where optimal parameters
(bandwidths and bandwidth-matrices) are derived, and one recognizes the
unknown density in such an expression. However, this seems to be a problem,
which is not only inherent to kernel density estimation, but also to many other
methods in statistics. In my opinion, this makes the limitedness of statistics
transparent, because you always know only a bit of the reality by observing a
sample and although there were many trials to overcome this drawback in the
existing literature, the alternatives are actually quite poor. Only assuming that the
underlying density is a normal distribution (as it is done by the Normal rule)
cannot be the “end of the story”.
There was another problem that appeared very questionable for me. In the case of
the univariate kernel density estimations, there are several approximations and
“corrections”, where the unexperienced user does not really know, what was
finally optimized.
92
SUMMARY AND OUTLOOK 93
For example, the L1-Theory consists of several “upper bounds” for the error-
criteria, which are optimized. Another problem is, even if one makes the decision
to optimize measures based on the L2-distance, whether to choose ISE- or the
MISE criterion. In many cases the AMISE criterion is optimized and nobody
knows how close the AMISE is to the MISE or how close the resulting
bandwidths are. Whole bandwidth-selection-philosophies (BCV, plug-in) are
based on the minimization of the AMISE.
In the case of the BCV- and the LSCV-selector, respectively, the occurence of
several minima is possible and the discussions about which one to choose are
quite controversial (Wand and Jones, 1995 and Sheather, 1992). The reference to
a certain distribution in order to derive optimal parameters and the justification of
the L2-distance itself, which occurs only because of the need for easier
calculations, are other examples.
The fact that discriminant analysis is based on differences in the logarithms of
densities discovers another problem for somebody, who has thought that there are
no more difficulties than those mentioned above. In order to get suitable results,
one has to estimate the log-densities, but the kernel density estimation concept is
not suitable for such a change. However, there are only “tails” in high dimensions
and the literature ignores this completely. These facts probably force the user to
distinguish between the density estimation for exploratory purposes in one or at
the most two dimensions and other tasks like the use in discriminant analysis
(probably restricted to about five dimensions, depending on the sample sizes). For
the first purpose the kernel method nevertheless, performs quite good. For the
second one, a good performance for datasets in high dimensions seems almost
impossible (see section 4.3.4). Some theoretical concepts cannot be generalized,
others suffer from a inacceptable computation time.
The good improvements of using the kernel method for preceding marginal
transformations of the variables seem to be a kind of “happy end” during writing
the thesis. Here, one can achieve further improvements by using e.g. a variable
kernel density estimator, but this amounts already in a not fully automatical
SUMMARY AND OUTLOOK 94
procedure, since all concepts concerning this topic require (a) additional
parameter(s). The range of occuring densities in application is wide and also
highly skewed distributions have to be taken into account. A fully automatical
transformation procedure has to work for every density. The user can also think of
other improvements like applying the normalization step several times. However,
this method is again evidence for a need to by-pass the problem of estimating the
densities multivariate.
An outlook for further research cannot be given easily. In the case of univariate
density estimation seems to be a correspondence in that sense that it is important
to fit the density by optimizing some qualitative features and not different
integrals or squares of integrals between two curves. Marron and Tsybakov
(1995) showed that there are other aspects of fitting the density in order to get an
intuitive good result. They pointed out to underline the features, “what the eye
sees” and not those found by optimizing Lp-distances. Probably, the
mathematicians have to be more creative. After that, a correspondence about the
question “When is a density estimated well?” has to be found.
The future in multivariate kernel density estimation as an alternative to other
classification algorithms looks somewhat hopeless. This concerns not the problem
of generalizing existing univariate parameter-selection-methods, since different
selectors do not have a crucial impact on the classification-rates. The main
problem is the often discussed “trade-off” between the curse of dimensionality
and the explanation loss by reducing dimensions. Only computers, which are
much faster than the usual personal computers can help, provided the observation
numbers “fit” to the number of variables. One has definetely to draw a border for
observation numbers or variables including a comment:”Below those numbers is a
use of this method possible and reasonable and above not!”
References
Bomze, I. M. (1998). Mathematik 3+4 für Statistiker. Lecture notes. Institute for
Statistics and Decision Support Systems, Vienna University.
Bowman, A. W. (1984). An alternative method of cross-validation for the
smoothing of density estimates. Biometrika 71, 353-60.
Bowman, A. W. and Azzalini, A. (1997). Applied Smoothing Techniques for Data
Analysis. Oxford Science Publications.
Breiman, L., Meisel, W. and Purcell, E. (1977). Variable kernel estimates of
multivariate densities. Technometrics 19, 135-44.
Cao, R., Cuevas, A. and Gonzalez-Manteiga, W. (1994). A comparative study of
several smoothing methods in density estimation. Comp. Stat. Data Anal. 17,
153-76.
Devroye, L. and Györfi, L. (1985). Nonparametric Density Estimation: The L1
View. Wiley, New York.
Fukunaga, K. (1972). Introduction to statistical pattern recognition. Academic
Press, New York.
Grossmann, W. and Hudec, M. (1999). Multivariate statistische Verfahren.
Lecture notes. Institute for Statistics and Decision Support Systems, Vienna
University.
Habbema, J. D. F., Hermans, J. and Remme, J. (1978). Variable kernel density
estimation in discriminant analysis. Compstat 1978, Proceedings in
Computational Statistics. Physica Verlag, Vienna
Hall, P. and Wand, M. P. (1988). On nonparametric discrimination using density
differences. Biometrika 75, 541-7.
Hand, D. J. (1982). Kernel Discriminant Analysis. Research Studies Press,
Chicester.
Hand, D. J. (1997). Construction and Assessment of Classification Rules. John
Wiley & Sons, Chicester.
Härdle, W. (1991). Smoothing Techniques with Implementation in S. Springer-
Verlag, New York.
95
REFERENCES 96
Härdle, W., Klinke, S. and Müller, M. (2000). XploRe - Learning Guide.
Springer-Verlag, Berlin.
Jones, M. C. (1991). The roles of ISE and MISE in density estimation. Statist.
Probab. Lett. 12, 51-56.
Jones, M. C., Marron, J. S. and Park, B. U. (1991). A simple root-n bandwidth
selector. Ann. Statist. 19, 1919-32.
Jones, M. C., Marron, J. S. and Sheather, S. J. (1996). A brief survey of
bandwidth selection for density estimation. Journal of the American
Statistical Association 91, 401-7.
Marron, J. S. (1993). Discussion of ‘Practical performance of several data-driven
bandwidth selectors’ by Park and Turlach. Comput. Statist. 8, 17-9.
Marron, J. S. and Tsybakov, A. B. (1995). Visual Error criteria for qualitative
smoothing. J. Amer. Statist. Assoc. 90, 499-507.
Marron, J. S. and Wand, M. P. (1992). Exact mean integrated squared error. Ann.
Statist. 20, 712-36.
Park, B. U. and Marron, J. S. (1990). Comparison of data-driven bandwidth
selectors. J. Amer. Statist. Assoc. 85, 66-72.
Park, B. U. and Turlach, B. (1992). Practical performance of several data driven
bandwidth selectors (with discussion). Comput. Statist. 7, 251-85.
Ripley, B. D. (1996). Pattern Recognition and Neural Networks, Cambridge
University Press.
Remme, J., Habbema, J. D. F. and Hermans, J. (1980). A simulative comparison
of linear, quadratic and kernel discrimination. J. Statist. Comput. Simul. 11,
87-106.
Rosenblatt, M. (1956). Remarkes on some nonparametric estimates of a density
function. Ann. Math. Statist. 27, 832-7.
Ruppert, D. and Cline, D. B. H. (1994). Transformation kernel density estimation
– bias reduction by empirical transformations. Ann. Statist. 22, 185-210.
Scott, D.W. (1992). Multivariate Density Estimation: Theory, Practice and
Visualization. Wiley, New York.
REFERENCES 97
Sheather, S. J. (1992). The performance of six popular bandwidth selection
methods on some real data sets (with discussion). Comput. Statist. 7, 225-50,
271-81.
Sheather, S. J. and Jones, M. C. (1991). A reliable data-based bandwidth-selection
method for kernel density estimation. J. Royal Statist. Soc. Ser. B 53, 683-
90.
Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis.
Chapman and Hall, London.
Van Ness, J. (1980). On the dominance of non-parametric bayes rule discriminant
algorithms in high dimensions. Pattern Recognition 12, 355-68.
Van Ness, J. W. and Simpson, C. (1976). On the effects of dimension in
discriminant analysis. Technometrics 18, 175-87.
Wand, M. P. and Jones, M. C. (1994). Multivariate plug-in bandwidth-selection.
Comput. Statist. 9, 97-117.
Wand, M. P. and Jones, M. C. (1995). Kernel Smoothing. Chapman and Hall,
London.
Appendix
About the used literature
The interested reader is recommended to read the literature listed above. I just
want to give a few comments about the used sources, underlining why I used it.
Books
The book of Silverman (1986) is an excellent overview about non-parametric
density estimation with great emphasis on the kernel setting and its extensions. It
is probably the first main collection of non-parametric density estimation ideas.
A comparable book, which however only concentrates on the kernel setting is
written by Wand and Jones (1995). They provide many graphs, which give a great
insight about many methods and problems in this setting. A collection of
bandwidth selection ideas is included, as well. Both books base their optimality
results on minimizing L -measures. 2
A quite technical book, which takes an eye to the -optimal results is the book of
Devroye and Györfi (1985). Many proofs are included, but it is maybe too
technical for a newcomer in this field.
L1
While the book of Härdle (1991) provides many density-estimation algorithms for
the S-Plus-Software-Package is Härdle et. al. (2000) a manual to the software
package XploRe, which contains many modern univariate bandwidth-selectors.
Bowman and Azzalini (1997) also treat smoothing methods, but stress not really
on the estimation of densities.
The multivariate density estimation, which builds the bridge to discriminant
analysis is maybe most detailed discussed in Scott (1992). Here, also practical
hints about the necessary task of dimension reduction are given. However, the
comments about the kernel setting in the discrimination context are quite rare.
Nevertheless, some visualization techniques for densities in more than two
dimensions are introduced.
98
APPENDIX 99
The combination of kernel density estimation and classification fulfills Hand
(1982), even though practical advice for handling high dimensionional datasets is
not given, but he discusses categorial data to make up. The evidence that the
kernel approach is maybe nowadays not the most popular one for classification
tasks give Ripley (1996) and Hand (1997), since only a few pages in each of their
great overviews of pattern recognition methods are about kernel density
estimation. While the former critizises the density estimation literature according
to the fact, that non-parametric density estimation is usually not directed to good
discrimination results, gives the latter advice concerning the optimality of several
classification rules. It also points out connections between different classification
methods.
Papers
Starting at the first description of the kernel density estimation concept by
Rosenblatt (1956), many suggestions about how to select the bandwidth
appropriate (the most important decision to carry out) have been made. The
principle of cross-validation was soon applied in the form of likelihood-cross-
validation, which seems to be a natural developement and least-square cross-
validation (Bowman, 1984), both based on the leave-one-out-principle. LSCV
was for a long time the most popular bandwidth selector, but in the beginninig
1990’s a great impetus was given by the plug-in-methods. A first version (Park
and Marron, 1990) was slightly improved by a second (Sheather and Jones, 1991).
Convergence rates were also taken into account and the fastest possible AMISE-
convergence rate was achieved (see e.g. the estimator of Jones et. al., 1991).
Since it was unknown, how better convergence rates amount in better estimate for
reasonable sample sizes, numerous comparatative studies were carried out on
synthetic data (e.g. Cao et. al, 1994, Jones et. al, 1996, Park and Turlach, 1992,
Marron, 1993) or on real-life-datasets (Sheather, 1992).
But also extended concepts like the variable kernel estimator (Breiman et. al.
1977) as well as possible univariate transformations were considered (see e.g. the
nonparametric method of Ruppert and Cline, 1994).
APPENDIX 100
The important question about which error-criterion is the best to optimize, was
treated by Jones (1991), Marron and Wand (1992), Marron and Tsybakov (1995).
Multivariate density estimation is actually a quite new research area and many
questions are yet to be resolved. In this view remarkable is the multivariate
generalization for the Sheather-Jones plug-in (Wand and Jones, 1994), even
though it is very restricted and not really viable for discrimination purposes.
The only studies I found, where the kernel setting serves as an alternative to the
classical parametric methods go back in the end of the 1980’s. Habbema et. al.
(1978) give advice about choosing the parameters of a variable kernel setting to
maintain optimal classification results, which were the basics for those attempts in
Remme et.al. (1980). Both parameter selection methods refer to the likelihood
cross validation and are very simple.
Van Ness and Simpson (1976) and Van Ness (1980) investigated the influence of
high dimensions, but used very small sample sizes. Finally a more present idea
(Hall and Wand, 1988) seems to concentrate at least partial on a discrimination
goal when esitmating differences of densities.
APPENDIX 101
Notation and common abbreviations
f(.) the underlying density
h the bandwidth (smoothing parameter)
σ(.) the standard deviation
H the bandwidth-matrix
K the kernel function in the univariate case
K, κ the multi- and univariate kernel in the multivariate setting $ (.)f h the density estimate based on a certain bandwidth h
|A|,|b| the determinant of the matrix A (also det(A)) or the value of b
without the sign
R(g) g x dx( )2∫
p(k|x) posterior probability
$( )p k prior probability
f(x|k) conditional density given class k
ASH average shifted histogram
AMIAE asymptotic mean integrated absolute error
AMISE asymptotic mean integrated squared error
AMSE asymptotic mean squared error
ANOVA analysis of variance
BCV biased cross-validation
cdf cumulative distribution function
IAE integrated absolute error
ISE integrated squared error
LDA linear discriminant analysis
LSCV least square cross-validation
MIAE mean integrated absolute error
MISE (EISE) mean (expected) integrated squared error
QDA quadratic discriminant analysis
APPENDIX 102
Tables
Table 12: Results of the principal component analysis. The percentage of the
explained variance is shown for all datasets.
Dataset\Dimension 1 2 3 4 5 6 7 8 9 10NN11 53,04 62,02 70,18 77,15 83,54 89,54 94,64 97,51 99,27 100NN12 29,23 50,36 59,99 67,59 74,49 81,07 87,12 92,59 96,49 100NN21 33,94 45,25 54,06 62,50 70,57 78,18 84,94 91,19 96,26 100NN22 29,53 54,86 63,35 70,68 77,69 83,80 89,34 93,56 97,22 100NN31 35,78 45,85 54,37 62,77 71,10 79,03 86,40 92,58 97,74 100NN32 40,14 57,37 65,59 72,29 78,26 83,93 88,84 93,44 97,57 100SkN11 59,13 69,24 78,19 83,99 89,01 92,26 95,18 97,97 99,36 100SkN12 36,45 54,05 63,34 71,43 78,15 84,55 90,40 94,62 97,77 100SkN21 58,85 68,48 75,55 81,77 86,31 89,70 92,99 95,79 98,21 100SkN22 42,92 60,80 68,85 74,64 80,21 85,55 89,92 93,93 97,68 100SkN31 63,88 73,04 79,72 85,61 90,56 94,04 97,20 99,06 99,68 100SkN32 26,94 46,26 57,48 65,33 72,65 79,18 85,50 91,74 96,07 100Bi11 59,62 66,76 72,25 77,50 82,57 87,31 91,59 95,12 98,02 100Bi12 35,41 60,23 69,36 76,38 81,85 87,14 91,33 94,53 97,33 100Bi21 38,96 50,07 58,46 66,13 73,58 80,42 86,79 92,63 98,14 100Bi22 47,71 73,17 80,27 86,15 91,26 93,97 96,37 97,90 99,11 100Bi31 53,25 64,11 71,48 78,15 84,19 89,73 93,92 96,95 99,24 100Bi32 39,86 67,23 74,22 80,17 85,65 89,88 93,86 96,59 98,62 100Bi41 70,03 77,98 83,23 87,35 91,32 94,66 97,05 98,68 99,58 100Bi42 47,67 69,50 75,62 81,27 86,08 90,37 93,66 96,91 99,20 100
Insurance Data 33,38 51,54 61,69 71,38 79,17 86,12 91,73 96,18 98,99 100
APPENDIX 103
Table 13: Classification results for the LDA.
Dataset Error-rate std. err. Brier-score std. err.NN11 0,238 0,018 0,317 0,018NN12 0,191 0,024 0,273 0,024NN21 0,235 0,020 0,321 0,020NN22 0,231 0,040 0,331 0,033NN31 0,280 0,032 0,377 0,027NN32 0,301 0,028 0,388 0,030SkN11 0,191 0,032 0,264 0,027SkN12 0,222 0,024 0,303 0,028SkN21 0,188 0,023 0,253 0,019SkN22 0,136 0,022 0,199 0,018SkN31 0,163 0,023 0,229 0,025SkN32 0,164 0,022 0,222 0,023Bi11 0,195 0,029 0,275 0,031Bi12 0,102 0,027 0,157 0,028Bi21 0,183 0,019 0,259 0,021Bi22 0,080 0,016 0,115 0,018Bi31 0,173 0,019 0,245 0,026Bi32 0,119 0,021 0,183 0,030Bi41 0,162 0,022 0,238 0,029Bi42 0,115 0,018 0,159 0,021
Insurance Data 0,237 0,022 0,358 0,030
Linear discriminant analysis (LDA)
Table 14: Classification results for the QDA.
Dataset Error-rate std. err. Brier-score std. err.NN11 0,249 0,020 0,330 0,019NN12 0,104 0,025 0,146 0,030NN21 0,247 0,018 0,329 0,018NN22 0,094 0,017 0,134 0,020NN31 0,300 0,028 0,394 0,026NN32 0,104 0,018 0,147 0,026SkN11 0,222 0,036 0,320 0,041SkN12 0,199 0,020 0,288 0,031SkN21 0,269 0,028 0,412 0,045SkN22 0,124 0,025 0,191 0,036SkN31 0,317 0,025 0,484 0,030SkN32 0,183 0,026 0,276 0,044Bi11 0,281 0,027 0,403 0,037Bi12 0,092 0,023 0,134 0,029Bi21 0,257 0,024 0,382 0,030Bi22 0,052 0,013 0,081 0,021Bi31 0,359 0,020 0,555 0,032Bi32 0,056 0,017 0,085 0,023Bi41 0,359 0,016 0,561 0,014Bi42 0,110 0,017 0,156 0,024
Insurance Data 0,313 0,064 0,496 0,099
Quadratic discriminant analysis (QDA)
APPENDIX 104
Table 15: Classification results for the estimator "Normal rule" in two and
three dimensions.
Dataset Error-rate Brier-score Error-rate Brier-scoreNN11 0,270 0,383 0,285 0,393NN12 0,175 0,258 0,125 0,212NN21 0,280 0,364 0,275 0,360NN22 0,200 0,250 0,145 0,231NN31 0,330 0,445 0,350 0,441NN32 0,190 0,293 0,185 0,300SkN11 0,190 0,275 0,180 0,267SkN12 0,215 0,303 0,200 0,270SkN21 0,245 0,346 0,230 0,322SkN22 0,170 0,227 0,115 0,168SkN31 0,255 0,367 0,210 0,285SkN32 0,145 0,218 0,160 0,198Bi11 0,380 0,477 0,310 0,395Bi12 0,130 0,171 0,120 0,148Bi21 0,385 0,462 0,365 0,453Bi22 0,125 0,151 0,215 0,257Bi31 0,220 0,322 0,240 0,347Bi32 0,140 0,212 0,160 0,204Bi41 0,360 0,466 0,420 0,457Bi42 0,165 0,260 0,135 0,192
Insurance Data 0,200 0,283 0,195 0,318
Normal rule (2) Normal rule (3)
APPENDIX 105
Table 16: Classification results for the estimator "Normal rule" in four and
five dimensions.
Dataset Error-rate Brier-score Error-rate Brier-scoreNN11 0,285 0,387 0,320 0,395NN12 0,125 0,211 0,145 0,216NN21 0,310 0,380 0,270 0,374NN22 0,150 0,215 0,155 0,216NN31 0,320 0,423 0,375 0,463NN32 0,215 0,298 0,270 0,336SkN11 0,190 0,296 0,195 0,290SkN12 0,195 0,273 0,215 0,282SkN21 0,330 0,411 0,335 0,426SkN22 0,140 0,180 0,165 0,209SkN31 0,195 0,292 0,210 0,307SkN32 0,140 0,184 0,145 0,182Bi11 0,305 0,400 0,305 0,398Bi12 0,105 0,150 0,110 0,159Bi21 0,245 0,353 0,235 0,370Bi22 0,270 0,368 0,350 0,487Bi31 0,340 0,409 0,395 0,541Bi32 0,160 0,190 0,195 0,268Bi41 0,485 0,605 0,495 0,760Bi42 0,195 0,237 0,280 0,382
Insurance Data 0,200 0,313 0,195 0,303
Normal rule (4) Normal rule (5)
APPENDIX 106
Table 17: Classification results for the estimator "LSCV" in two and three
dimensions.
Dataset Error-rate Brier-score Error-rate Brier-scoreNN11 0,280 0,387 0,280 0,396NN12 0,190 0,252 0,120 0,211NN21 0,285 0,366 0,270 0,360NN22 0,210 0,251 0,155 0,231NN31 0,330 0,446 0,345 0,437NN32 0,170 0,284 0,195 0,294SkN11 0,190 0,276 0,180 0,268SkN12 0,215 0,296 0,200 0,262SkN21 0,260 0,347 0,230 0,315SkN22 0,185 0,227 0,105 0,163SkN31 0,260 0,366 0,215 0,284SkN32 0,140 0,214 0,140 0,195Bi11 0,370 0,477 0,325 0,396Bi12 0,130 0,175 0,120 0,145Bi21 0,390 0,462 0,350 0,449Bi22 0,105 0,141 0,125 0,178Bi31 0,220 0,325 0,265 0,337Bi32 0,160 0,216 0,160 0,208Bi41 0,360 0,448 0,365 0,421Bi42 0,140 0,211 0,085 0,126
Insurance Data 0,185 0,285 0,195 0,319
LSCV (2) LSCV (3)
APPENDIX 107
Table 18: Classification results for the estimator "LSCV" in four and five
dimensions.
Dataset Error-rate Brier-score Error-rate Brier-scoreNN11 0,300 0,388 0,340 0,401NN12 0,125 0,210 0,145 0,216NN21 0,310 0,383 0,275 0,375NN22 0,155 0,214 0,160 0,214NN31 0,310 0,423 0,375 0,462NN32 0,225 0,291 0,265 0,334SkN11 0,185 0,299 0,195 0,288SkN12 0,200 0,269 0,240 0,282SkN21 0,290 0,385 0,335 0,409SkN22 0,140 0,176 0,135 0,192SkN31 0,210 0,288 0,200 0,295SkN32 0,125 0,180 0,130 0,168Bi11 0,320 0,406 0,315 0,408Bi12 0,115 0,148 0,110 0,159Bi21 0,225 0,336 0,220 0,353Bi22 0,215 0,282 0,270 0,372Bi31 0,325 0,388 0,350 0,491Bi32 0,135 0,160 0,175 0,229Bi41 0,425 0,576 0,470 0,740Bi42 0,100 0,137 0,185 0,238
Insurance Data 0,200 0,277 0,205 0,296
LSCV (4) LSCV (5)
APPENDIX 108
Table 19: Estimator "Normal rule - normalized". The normalization is done
by the non-parametric kernel estimate.
Dataset Error-rate Brier-score Error-rate Brier-scoreNN11 0,225 0,321 0,240 0,330NN12 0,200 0,273 0,050 0,101NN21 0,235 0,295 0,220 0,294NN22 0,210 0,308 0,065 0,094NN31 0,245 0,340 0,260 0,358NN32 0,280 0,361 0,110 0,141SkN11 0,205 0,276 0,215 0,287SkN12 0,180 0,254 0,120 0,179SkN21 0,180 0,269 0,190 0,276SkN22 0,100 0,145 0,085 0,127SkN31 0,150 0,242 0,150 0,248SkN32 0,110 0,153 0,040 0,063Bi11 0,190 0,270 0,205 0,297Bi12 0,100 0,144 0,060 0,095Bi21 0,140 0,216 0,165 0,241Bi22 0,090 0,117 0,055 0,074Bi31 0,190 0,290 0,210 0,314Bi32 0,105 0,179 0,030 0,042Bi41 0,150 0,213 0,205 0,310Bi42 0,075 0,114 0,045 0,060
Insurance Data 0,195 0,289 0,210 0,287
LDA normalized QDA normalized
APPENDIX 109
Table 20: Estimator "Sheather-Jones - normalized". The normalization is
done by the non-parametric kernel estimate.
Dataset Error-rate Brier-score Error-rate Brier-scoreNN11 0,225 0,323 0,240 0,332NN12 0,200 0,275 0,055 0,101NN21 0,240 0,293 0,220 0,292NN22 0,215 0,312 0,065 0,094NN31 0,225 0,335 0,245 0,352NN32 0,280 0,365 0,110 0,141SkN11 0,195 0,267 0,195 0,279SkN12 0,175 0,254 0,120 0,183SkN21 0,180 0,271 0,195 0,281SkN22 0,095 0,142 0,085 0,129SkN31 0,135 0,238 0,145 0,247SkN32 0,105 0,153 0,030 0,058Bi11 0,200 0,269 0,200 0,298Bi12 0,100 0,142 0,050 0,094Bi21 0,150 0,221 0,165 0,251Bi22 0,090 0,116 0,055 0,076Bi31 0,200 0,287 0,210 0,316Bi32 0,100 0,180 0,025 0,039Bi41 0,150 0,212 0,205 0,314Bi42 0,080 0,118 0,050 0,058
Insurance Data 0,195 0,281 0,175 0,273
LDA normalized QDA normalized