Robust Regression

1Robust Regression

By Will Garner

21. Motivation

Under the assumption that errors in regression models are normally distributed, the least squares estimate was the most efficient unbiased estimate of the bis

What happens if the errors are not normally distributed?

31. Motivation

We need regression methods that are not as sensitive to outliers

This leads to Robust Regression

42. Robust Regression

There are two popular remedies for this problem

We can measure size in some other way, by replacing the square e2 by some other function r(e) which reflects the same of the residual in a less extreme way. To be sensible, we should have that r be symmetric [r(e) = r(-e)] r be positive [r(e) 0 for all e] r be monotone [r(e1) r(e2) if |e1| |e2|]


An Example of this type of regression is M-estimation

Suppose that the observed responses Yi are independent and have density functions

Note: If f is the standard normal density, then is just the standard regression model and s is the standard deviation.

1( ) i ii iYf Y f

=

x

62.1. M-Estimation

The log likelihood is given by

Setting r = -log f, we have 1

( , ) log logn

i i

ii

Yl n f

=

= +

x

1( , ) log

ni i

i

Yl n

=

= +

x

72.1. M-Estimation

Let s = s and ei(b) = Yi xib. Thus, to estimate b and s using maximum likelihood, we must minimize

as a function of b and s. 1

( )logn

i

i

en s

s

=

+

b

82.1. M-Estimation

Differentiating gives us:

where y = r.

1

( ) 0n

ii

i

e

s

=

=

bx

1

( ) ( )n

ii

i

ee ns

s

=

=

b b

92.1. M-Estimation

If we do not have a requirement for f, then we can choose r to make the estimate robust by choosing a r for which y = r is bounded. Hence, we can generalize the above to the estimating equations

and

where c is also chosen the make the scale estimate robust. These estimates are called M-estimates, since their definition is motivated by the maximum likelihood estimating equations above.

1

( ) 0n

ii

i

e

s

=

=

bx

1

( ) 0n

i

i

e

s

=

=

b

10


Example: Let r(x) = (1/2)x2. Then reduces to the normal equations, , with the solution that is the least squares estimate (LSE). gives us the maximum likelihood estimate

2 2

1

1

( )n

ii

en

=

=

11


Example: Let r(x) = |x|. We have that a value of bthat minimizes the log likelihood also satisfies

This is know as the L1 estimate. Note: The L1 estimate is called the LAD (Least

Absolute Deviations) estimate. Note: b need not be unique

1( )

n

ii

e=

b

12


Example: Let

Setting k = 1.5, we have a reasonable compromise between least squares (the greatest efficiency at the normal model) and L1 estimation, which gives more protection from outliers.

( )k x k

x x k x kk x k

<

= >

13


Example: There is also the Mean Absolute Deviation (MAD), which is found by setting

where c solves ; c = 1.4326

1( ) sgn( )c

z z =

( )11 34 0.6749c = =

14


Regression coefficients found using M-estimators are close to least squares estimators if the errors are normal, but are much more robust if the error distribution has heavy tails.

However, M-estimates of regression coefficients are just as vulnerable as least squares estimates to outliers in the explanatory variables.

15


Another remedy is that we can replace the sum (or the mean) by a more robust measure of location, such as the median or a trimmed mean.

Some examples are least median of squares (LMS) and least trimmed squares(LTS).

16


These estimates are very robust to outliers in both the errors and the explanatory variables, but can be unstable nonetheless.

Small changes in non-extreme points can make a very large change in the fitted regression. (See Figure 1.)

17


x

y

B

x

y

B

Figure 1: Moving B from being collinear with the three points to being collinear with the other three points causes a drastic change in the regression line.

18


Furthermore, the LMS and LTS estimates are very inefficient if the data is actually normally distributed.

19


Thus, as we use more robust regression estimators on normal data, we have worse estimates. But, the further the data is away from being normal, the better the robust estimations will fit the data.

Before we apply any regression analysis, we should first run a QQ-Plot to determine if the data is normally distributed. Depending on the results of the plot, we choose an appropriate model.

20

3. Measuring Robustness

The next logical question to ask is how do we measure robustness?

There are two common measures. The first is the breakdown point, which

measures how well an estimate can resist baddata before it fails.

The second measure is the influence curve, which tells us how much a single outlier affects the estimate.

21


Definition: The breakdown point of an estimate is the smallest fraction of the data that can be changed by an arbitrarily large amount and still cause an arbitrarily large change in the estimate.

22


Example: The breakdown point of the sample mean and the least squares estimate is 1/n.

Example: The breakdown of the sample median is almost 1/2.

Example: The breakdown point of the L1 estimator is also 1/n, though the L1 estimator looks at the LAD. This is the same of M-estimates.

Example: The LMS and LTS estimates have breakdown points near 1/2.

23

4. Influence Curves

Suppose that F is a k-dimensional distribution function and q is a population parameter that depends on F, so q = T(F). Tis called a statistical functional, since it is a function of a function.

24

4. Influence Curves

The influence curve (IC) of a statistical functional T is the derivative with respect to tof T(Ft) evaluated at t = 0. It is a measure of the rate at which T responds to a small amount of contamination at z0.

25

4. Influence Curves

The mean is highly nonrobust The least squares error is not robust M-estimates are not robust with respect to

high-leverage points (outliers in the explanatory variables).

26

4. Influence Curves

The robust estimators discussed so far are not entirely satisfactory, since those with high breakdown (such as LMS and LTS) have poor efficiency and the efficient M-estimators are not robust in the explanatory variables and have breakdown points of zero.

27

4. Influence Curves

The natural question to ask is whether there are other estimates that have high breakdown points but much greater efficiency than LMS or LTS. It turns out that there are better estimates. We shall discuss two more estimators.

28

4. Influence Curves

If we apply a weight function chosen to make the IC bounded, the resulting estimates are called bounded influence estimates or generalized M-estimates (GM-estimates).

29

4. Influence Curves

To bound the IC, the weights are chosen in such a way that reduce the impact of high-leverage points. However, including a high-leverage points that is not an outlier increases the efficiency of the estimate.

30

4. Influence Curves

That is, if we fit a regression line to a set of data and then we get another sample point that is far away from our other data points, but is close to the regression line, then the efficiency of the estimate increases. (See Figure 2.)

31

4. Influence Curves

Figure 2: A good outlier

x

y

32

4. Influence Curves

Hence, we include the weight function in the denominator so that the effect of a small residual at a high-leverage point will be magnified.

The weights can be chosen to minimize the asymptotic variance of the estimates. This leads to weights of the form w(x) = ||Ax||-1, for some matrix A.

33

4. Influence Curves

Note: The breakdown point of these estimates is better than an M-estimate, but cannot exceed 1/p, where p is the rank of X.

34

4. Influence Curves

The estimating equation is usually solved iteratively by Newtons method of Fisher scoring, using some other estimate as a starting value.

35

4. Influence Curves

There are combinations of high breakdown estimates with GM-estimates. These use a high breakdown estimate as a starting value and then use an iterative method. This is called a one-step GM-estimate. Hence, one gets a breakdown point of roughly 50% that is rather efficient.

36

5. S-Estimators

We can think of the average size of the residuals as a measure of their dispersion, so we can consider more general regression estimators based on some dispersion or scale estimator s(e1, , en). This leads to minimizing D(b) = s[e1(b),en(b)], where s is a estimator of scale.

37

5. S-Estimators

We define an S-estimator to be one in which we use s = s(e1, , en) defined by

where K = E[r(Z)] for a standard normal Z, ris strictly increasing on [0, c] and constant on (c, ).

( )1

1 n ii

e

sK

n

=

=

38

5. S-Estimators

Note: The breakdown point of such an estimate can be made close to 50% with a suitable choice of r. The biweight function

is a popular choice. For c = 1.547, the breakdown point is just under 50%

and the efficiency at the normal distribution is roughly 29%.

2 4 6

2 4

22 2 6( )

/ 6

x x x

c cx c

x

c x c

+ = >

39

5. S-Estimators

Remark: There is another notable class of estimators, R-Estimators It is a blend of the Bounded Influence Estimators

and S-Estimators This leads to a generalized S-Estimate as well as

least quartile difference (LQD) estimate and least trimmed difference (LTD) estimate

Robust Regression

Documents

Transcript of Robust Regression