Post on 07-Oct-2015
description
1Robust Regression
By Will Garner
21. Motivation
Under the assumption that errors in regression models are normally distributed, the least squares estimate was the most efficient unbiased estimate of the bis
What happens if the errors are not normally distributed?
31. Motivation
We need regression methods that are not as sensitive to outliers
This leads to Robust Regression
42. Robust Regression
There are two popular remedies for this problem
We can measure size in some other way, by replacing the square e2 by some other function r(e) which reflects the same of the residual in a less extreme way. To be sensible, we should have that r be symmetric [r(e) = r(-e)] r be positive [r(e) 0 for all e] r be monotone [r(e1) r(e2) if |e1| |e2|]
52. Robust Regression
An Example of this type of regression is M-estimation
Suppose that the observed responses Yi are independent and have density functions
Note: If f is the standard normal density, then is just the standard regression model and s is the standard deviation.
1( ) i ii iYf Y f
=
x
62.1. M-Estimation
The log likelihood is given by
Setting r = -log f, we have 1
( , ) log logn
i i
ii
Yl n f
=
= +
x
1( , ) log
ni i
i
Yl n
=
= +
x
72.1. M-Estimation
Let s = s and ei(b) = Yi xib. Thus, to estimate b and s using maximum likelihood, we must minimize
as a function of b and s. 1
( )logn
i
i
en s
s
=
+
b
82.1. M-Estimation
Differentiating gives us:
where y = r.
1
( ) 0n
ii
i
e
s
=
=
bx
1
( ) ( )n
ii
i
ee ns
s
=
=
b b
92.1. M-Estimation
If we do not have a requirement for f, then we can choose r to make the estimate robust by choosing a r for which y = r is bounded. Hence, we can generalize the above to the estimating equations
and
where c is also chosen the make the scale estimate robust. These estimates are called M-estimates, since their definition is motivated by the maximum likelihood estimating equations above.
1
( ) 0n
ii
i
e
s
=
=
bx
1
( ) 0n
i
i
e
s
=
=
b
10
2. Robust Regression
Example: Let r(x) = (1/2)x2. Then reduces to the normal equations, , with the solution that is the least squares estimate (LSE). gives us the maximum likelihood estimate
2 2
1
1
( )n
ii
en
=
=
11
2. Robust Regression
Example: Let r(x) = |x|. We have that a value of bthat minimizes the log likelihood also satisfies
This is know as the L1 estimate. Note: The L1 estimate is called the LAD (Least
Absolute Deviations) estimate. Note: b need not be unique
1( )
n
ii
e=
b
12
2. Robust Regression
Example: Let
Setting k = 1.5, we have a reasonable compromise between least squares (the greatest efficiency at the normal model) and L1 estimation, which gives more protection from outliers.
( )k x k
x x k x kk x k
<
= >
13
2. Robust Regression
Example: There is also the Mean Absolute Deviation (MAD), which is found by setting
where c solves ; c = 1.4326
1( ) sgn( )c
z z =
( )11 34 0.6749c = =
14
2. Robust Regression
Regression coefficients found using M-estimators are close to least squares estimators if the errors are normal, but are much more robust if the error distribution has heavy tails.
However, M-estimates of regression coefficients are just as vulnerable as least squares estimates to outliers in the explanatory variables.
15
2. Robust Regression
Another remedy is that we can replace the sum (or the mean) by a more robust measure of location, such as the median or a trimmed mean.
Some examples are least median of squares (LMS) and least trimmed squares(LTS).
16
2. Robust Regression
These estimates are very robust to outliers in both the errors and the explanatory variables, but can be unstable nonetheless.
Small changes in non-extreme points can make a very large change in the fitted regression. (See Figure 1.)
17
2. Robust Regression
x
y
B
x
y
B
Figure 1: Moving B from being collinear with the three points to being collinear with the other three points causes a drastic change in the regression line.
18
2. Robust Regression
Furthermore, the LMS and LTS estimates are very inefficient if the data is actually normally distributed.
19
2. Robust Regression
Thus, as we use more robust regression estimators on normal data, we have worse estimates. But, the further the data is away from being normal, the better the robust estimations will fit the data.
Before we apply any regression analysis, we should first run a QQ-Plot to determine if the data is normally distributed. Depending on the results of the plot, we choose an appropriate model.
20
3. Measuring Robustness
The next logical question to ask is how do we measure robustness?
There are two common measures. The first is the breakdown point, which
measures how well an estimate can resist baddata before it fails.
The second measure is the influence curve, which tells us how much a single outlier affects the estimate.
21
3. Measuring Robustness
Definition: The breakdown point of an estimate is the smallest fraction of the data that can be changed by an arbitrarily large amount and still cause an arbitrarily large change in the estimate.
22
3. Measuring Robustness
Example: The breakdown point of the sample mean and the least squares estimate is 1/n.
Example: The breakdown of the sample median is almost 1/2.
Example: The breakdown point of the L1 estimator is also 1/n, though the L1 estimator looks at the LAD. This is the same of M-estimates.
Example: The LMS and LTS estimates have breakdown points near 1/2.
23
4. Influence Curves
Suppose that F is a k-dimensional distribution function and q is a population parameter that depends on F, so q = T(F). Tis called a statistical functional, since it is a function of a function.
24
4. Influence Curves
The influence curve (IC) of a statistical functional T is the derivative with respect to tof T(Ft) evaluated at t = 0. It is a measure of the rate at which T responds to a small amount of contamination at z0.
25
4. Influence Curves
The mean is highly nonrobust The least squares error is not robust M-estimates are not robust with respect to
high-leverage points (outliers in the explanatory variables).
26
4. Influence Curves
The robust estimators discussed so far are not entirely satisfactory, since those with high breakdown (such as LMS and LTS) have poor efficiency and the efficient M-estimators are not robust in the explanatory variables and have breakdown points of zero.
27
4. Influence Curves
The natural question to ask is whether there are other estimates that have high breakdown points but much greater efficiency than LMS or LTS. It turns out that there are better estimates. We shall discuss two more estimators.
28
4. Influence Curves
If we apply a weight function chosen to make the IC bounded, the resulting estimates are called bounded influence estimates or generalized M-estimates (GM-estimates).
29
4. Influence Curves
To bound the IC, the weights are chosen in such a way that reduce the impact of high-leverage points. However, including a high-leverage points that is not an outlier increases the efficiency of the estimate.
30
4. Influence Curves
That is, if we fit a regression line to a set of data and then we get another sample point that is far away from our other data points, but is close to the regression line, then the efficiency of the estimate increases. (See Figure 2.)
31
4. Influence Curves
Figure 2: A good outlier
x
y
32
4. Influence Curves
Hence, we include the weight function in the denominator so that the effect of a small residual at a high-leverage point will be magnified.
The weights can be chosen to minimize the asymptotic variance of the estimates. This leads to weights of the form w(x) = ||Ax||-1, for some matrix A.
33
4. Influence Curves
Note: The breakdown point of these estimates is better than an M-estimate, but cannot exceed 1/p, where p is the rank of X.
34
4. Influence Curves
The estimating equation is usually solved iteratively by Newtons method of Fisher scoring, using some other estimate as a starting value.
35
4. Influence Curves
There are combinations of high breakdown estimates with GM-estimates. These use a high breakdown estimate as a starting value and then use an iterative method. This is called a one-step GM-estimate. Hence, one gets a breakdown point of roughly 50% that is rather efficient.
36
5. S-Estimators
We can think of the average size of the residuals as a measure of their dispersion, so we can consider more general regression estimators based on some dispersion or scale estimator s(e1, , en). This leads to minimizing D(b) = s[e1(b),en(b)], where s is a estimator of scale.
37
5. S-Estimators
We define an S-estimator to be one in which we use s = s(e1, , en) defined by
where K = E[r(Z)] for a standard normal Z, ris strictly increasing on [0, c] and constant on (c, ).
( )1
1 n ii
e
sK
n
=
=
38
5. S-Estimators
Note: The breakdown point of such an estimate can be made close to 50% with a suitable choice of r. The biweight function
is a popular choice. For c = 1.547, the breakdown point is just under 50%
and the efficiency at the normal distribution is roughly 29%.
2 4 6
2 4
22 2 6( )
/ 6
x x x
c cx c
x
c x c
+ = >
39
5. S-Estimators
Remark: There is another notable class of estimators, R-Estimators It is a blend of the Bounded Influence Estimators
and S-Estimators This leads to a generalized S-Estimate as well as
least quartile difference (LQD) estimate and least trimmed difference (LTD) estimate