Local Regression
description
Transcript of Local Regression
-
Advanced data analysis
M.GerolimettoDip. di Statistica
Universita` CaFoscari Venezia,
www.dst.unive.it/margherita
1
-
PART 4: LOCAL REGRESSION
2
-
Definition
Local regression is an approach to fitting curves
and surfaces to data by smoothing. It is called
LOCAL since the fit at a generic point x0 is the
value of a parametric function fitted only to those
observations that are close to x0.
In this sense it can be thought as a natural exten-
sion of parametric fitting. Since now we considered
models like:
yi = + xi+ i, i = 1, . . . , N
that can be seen as
yi = m(xi) + i, i = 1, . . . , N
where m is linear.
When we assume that m(x) is an element of a
specific parametric class of functions (for example
linear) we are forcing the relationship to have a
certain shape.
3
-
However it is possible that these models cannot be
applied because of nonlinearity (especially of un-
known form) in the data.
In this sense nonparametric modelling is a goodresponse because it is like placing a a flexible
curve on the (x, y) scatterplot with no para-
metric restrictions on the form of the curve.
Moreover nonparametric methods can help tosee in the scatterplot the underlying structures
of the data (smoothing).
4
-
Parametric localization
The underlying model for local regression is:
yi = m(xi) + ui, i = 1, . . . , N
The distribution of the yis are unknown.
The means m(xi) are unknown.
In practice we must model the data, which meansmaking certain assumptions on m and other as-pects of the distribution of the yi.
One common assumption is that the yis arehomoskedastic.
As for m it is supposed that the function can belocally approximated by a member of a para-metric class, usually chosen to be a polynomialof certain degree.
This is the parametric localization: in carryingout the local regression we use a parametric familyas in global parametric fitting but we ask only thatthe family fit locally and not globally.
5
-
Suppose x0 is a generic point in the support of the
x variable. Suppose we do not know function m(x),
but we can assume it is derivable.
To estimate m(x) in x0, we can think of using the
Taylor expansion
m(x) = m(x0) +m(x0)(x x0) + r
where r is a quantity of order smaller than (xx0).
Whatever function (under certain regularity condi-
tions) can be locally approximated by a line.
It is possible to estimate m(x) in a neighborhood of
x0 by minimizing squared errors for pairs (xi, yi), i =
1, . . . , N
min,
Ni=1
{yi (xi x0)}2wi
6
-
The weights wi in the previous formula are often
chosen so that they are bigger when (xi x0) issmaller. This means that the closer is xi to the
point x0, the bigger is the weight.
This minimization can be thought in a sort of local
view around x0: we can think of weighted least
squares.
BIG ISSUES:
1. How can the weights be chosen?
2. How large should be the neighborhood?
7
-
The estimation of m that comes from above defi-
nition is obtained with the following steps:
for each fitting point x0 define a neighborhoodbased on some metric in the space of the x
variable
within this neighborhood assume that m is ap-proximated by some member of the chosen para-
metric family
estimate the parameters from observations inthe neighborhood; the local fit at x0 is the fit-
ted function evaluated at x0.
Very often a weight function w(u) is incorporated
that gives greater weight to the xis that are closer
to x0 and smaller weight to the xis that are further
from x0.
The estimation method used depends on the as-
sumption on the yis. If the yis are assumed to
be Gaussian with constant variance, then it makes
sense to base estimation on least squares.
8
-
Once wi and h have been chosen, one is not in-
terested in calculating m only on a single point x0,
but typically on a set of values (usually uniformly
spaced along the interval between x1 and xN).
Practically, one creates a grid between x1 and xNconsisting of m points (uniformly spaced) and then
compute the minimization over all points of the
grid.
This corresponds to havingm times locally weighted
least squares, one for every of the m points of the
grid that become the center of the neighborhood.
9
-
Modeling the data
When using local regression the following are the
choices to be made:
1. Assumptions about the behaviour of m
Weight function
Bandwidth
Parametric family
2. Assumption about the yis
Fitting criterion
Differently from parametric fitting we do not rely
on a priori knowledge.
To make the choices listed above we use i) either
the data with graphical analysis or ii) some auto-
matic methods to carry out model selection.
10
-
Trade-off... again!
Modeling m non parametrically requires a trade off
between bias and variance, starting from the choice
of the bandwidth (but not only!).
In some applications there is a strong preference to-
ward rough estimates (smaller bias) in some other
there is a preference toward smoother estimates
(bigger bias).
Using criteria of model selection, like cross-validation,
has the advantage of an automatic choice (less
subjectivity), but at the same time the disadvan-
tage of giving a poor answer in any particular ap-
plication.
Using graphical criteria, the advantage is great power,
but the disadvantage that they are labor-intensive.
They are good for picking a small number of pa-
rameters, but in case of adaptive fitting it becomes
extremely long.
11
-
Selecting the weight function
Supposing that m is continuous, then we will use
weight functions that are peaked around 0 and de-
cay smoothly as distances from x0 (let us call the
distances u) increase.
A smooth weight function results in a smoother
estimate than, for example, using a rectangular
weight function.
A natural choice is to use gaussian kernels. The
tricube kernels also are often used because of the
computational speed of a weight function that at
a certain point (but smoothly) gives zero weight
compared to one that only approaches zero as u
gets larger:
w(u) =
{(1 |u|3)3 |u| < 1
0 |u| > 1
In case a gaussian kernel is used, local regression
take the name of kernel regression. In case a
tricube kernel is used (plus a nearest neighbors
bandwidth), local regression take the name of LOESS
estimator, as we will see later on.
12
-
Selecting the fitting criterion
Virtually any global fitting procedure can be local-
ized. So local regression could work on the basis
of the same number of distributions as global para-
metric fitting.
The simplest case is the Gaussian yis. Least squares
methods approaches can be used. An objection to
least squares is that those estimators are not ro-
bust to heavy-tailed residuals distributions. Under
these circumstances, proposals of ad hoc robusti-
fied fitting procedures are available (LOWESS).
In case other distributions are hypothesized for the
yis, then the locally weighted likelihood can be
used. For example in case of binary data the non
parametric estimated is obtained by local likeli-
hood.
13
-
Selecting the bandwidth and local family
These issues will be sort of discussed simultane-
ously since they are strongly connected.
Both the choice of the bandwidth parameter and
the parametric family are related to the goal of
producing an estimate that is as smoother as pos-
sible whithout distorting the underlying pattern of
dependence of the response on the independent
variables.
As for kernel estimates of density functions, a bal-
ance between bias and variance must be found.
As for the bandwidth selection, will be considered
fixed and nearest neighbors bandwidth. As for the
parametric family the choice will be made among
polynomial forms whith the degree ranging from 0
to 3.
14
-
Nearest neighbor bandwidths vs fixed band-
width
The problem with fixed bandwidth is that it pro-
vokes strong swings in variance in case of large
changes in the density of the data.
The boundary issue plays a major role in the band-
width choice. The issue is that using the same
bandwidth at the boundary (where observations
can be more sparse) as in the interior can pro-
duce estimates with a large variability. Think of
gaussian data!
The variable bandwidth (as nearest neighbors) ap-
pears to perform better overall in applications for
this variance issue.
Of course nearest neighbors can fail for some spe-
cific examples, but it is not the fixed bandwidth
the remedy, but rather adaptive methods.
15
-
Polynomial degree
The choice of the polynomial degree is also a bias-
variance trade-off: a higher degree will produce a
less biased, but more variable estimate.
In case the degree is 0 the local regression estimate
is:
m(x) =
ni=1K(
xxih )yin
i=1K(xxih )
This choice p = 0 is quite well-known in nonpara-
metric literature (it is called local constant regres-
sion), because it is the one for which the asymp-
totic theory has been derived. However this case
is, at the same time, the one that in practice has
less frequently shown good performance.
The problem with local constant regression is that
it cannot reproduce a line even in the very special
case of equally spaced data away from boundaries.
Reducing the lack of fit to a tolerable level requires
very small bandwidths that end up in a very rough
estimate.
16
-
So, by using a polynomial degree greater than zero
it is possible to increase the bandwith (so reducing
the roughness) without introducing an intolerable
bias.
In case the degree is 1 the local regression estimateis:
m(x) =
ni=1
K(xxih)yin
i=1K(xxi
h)+ (x Xw)
ni=1
K(xxih)(xi Xw)yin
i=1K(xxi
h)(xi Xw)2
where
Xw =
ni=1K(
xxih )xin
i=1K(xxih )
This choice p = 1 is called local linear regression.
17
-
Notable cases
1. Kernel regression is a local constant regression
(p = 0) where the weigthing mechanism is done
using typical kernel functions (in particular the
Gaussian). It is also called Nadaraya Watson
regression.
2. The LOESS estimator for local regression is
characterized for having a tricube weigthing mech-
anism and a nearest neighbours bandwidth.
18
-
Kernel regression theory
For kernel regression much theory has been pro-
posed even though it is not the best option in prac-
tise.
The model is
y = m(x) + u
for a given choice of K and h (fixed), we suppose
that the data are i.i.d., the x are not stochastic.
BIASSimilarly to kernel density estimators, the ker-
nel regression is biased of size O(h2):
b(x0) = h2
m(x0)f (x0)f(x0)
+1
2m
(x0)
z2k(z)dzGiven a value for h, the bias varies with the
kernel function that we use, but most of all it
depends on the slope and the curvature of the
function m in x0 and with the slope of f(x0) the
density of the regressors. In the kernel density,
instead, the bias depends only on f(x).
19
-
LIMIT DISTRIBUTIONThe kernel regression estimator has a limit dis-tribution which is normalNh (m(x0)m(x0) b(x0)) N(0,
2
f(x0)
k(z)2dz
Note that the variance of the estimator m(x0)
is inversely related to f(x0), which means that
the variance of m(x0) is bigger in regions where
x is sparse.
BANDWIDTHThe choice of the bandwidth is once more con-
nected to the bias-variance trade-off.
As in kernel density estimator the bandwidth
can be determined using different methods, we
will see them in the next slides.
20
-
Choosing the bandwidth: Optimal rule
A value of h that minimizes MISE in an asymptotic
sense would be an optimal bandwidth.
Remember that MSE (mean squared error) mea-
sures the local performance of m in x0, in this case
it takes the form:
MSE [(m(x0)] = E[(m(x0)m(x0))2
]The MISE (mean integrated squared error) is a
global measure of performance
MISE(h) =MSE [(m(x0)] f(x0)dx0
where f is the density of the regressors.
The optimal bandwidth is obtained by minimizing
the MISE and this yields h = O(N1/5
).
It has been shown that the kernel estimate con-
verges with a rate that is slower than the paramet-
ric estimate.
21
-
Choosing the bandwidth: Cross-validation
An empirical estimate of the optimal h can be ob-
tained using the leave-one-out cross validation pro-
cedure, thus minimizing:
CV (h) =Ni=1
(yi mi(xi))2
The optimality properties derive from the asymp-
totic equivalence between minimizing CV (h) and
minimizingMISE(h) or ISE(h), recalling that, sim-
ilarly to what presented in the previous section:
ISE(h) =((m(x0)m(x0))2 f(x0)dx0
Plug-in
Usually in the kernel regression context it is not
used, the CV is preferred.
22
-
LOESS estimator
The LOESS estimator is a local regression estima-
tor where:
1. the weight function used for LOESS is the tri-
cube weight function
2. the local polynomials degrees are almost always
of first or second degree (that is, either locally
linear or locally quadratic)
3. the subsets of data used for each weighted least
squares fit in LOESS are determined by a near-
est neighbors algorithm
23
-
About the third characteristic, usually the smooth-
ing parameter, q, is a number between (p+1)/N
and 1, with p denoting the degree of the local poly-
nomial.
Large values of q produce the smoothest functions
that do not react that much in response to fluctu-
ations in the data. Smaller values of q make the
regression function follow closely the data.
Note that using too small a value of the smooth-
ing parameter is not desirable, however, since the
regression function will eventually start to capture
the random error in the data (too rough!). Possibly
good values of the smoothing parameter typically
lie in the range 0.25 to 0.5 for most LOESS appli-
cations.
24