Regression. Idea behind Regression Y X We have a scatter of points, and we want to find the line...

64
Regression

Transcript of Regression. Idea behind Regression Y X We have a scatter of points, and we want to find the line...

Regression

Idea behind Regression

Y

X

We have a scatter of points, and we want to find the line that best fits that scatter.

For example, we might want to know the relationship between

Exam score and hours studied, or

Wheat yield and fertilizer usage, or

Job performance and job training, or

Sales revenue and advertising expenditure.

Imagine that there is a true relationship behind the variables in which we are interested. That relationship is known perhaps to some supreme being.

However, we are mere mortals, and the best we can do is to estimate that relationship based on a sample of observations.

iii X Y

The subscript i indicates which observation or which point we are considering. Xi is the value of the independent variable for observation i.

Yi is the value of the dependent variable.

is the true intercept. is the true slope. i is the random error.

Perhaps the supreme being feels that the world would be too boring if a particular number of hours studied was always associated with the same exam score, a particular amount of job training always led to the same job performance, etc. So the supreme being tosses in a random error.Then the equation of the true relationship is:

Again the equation of the true relationship is:

Our estimated equation is:

iii X Y

iii e X b aY

a is our estimated intercept.b is our estimated slope. ei is the estimation error.

Let’s look at our regression line and one particular observation.

Y

XXi

Yi

iYpredicted value of the

dependent variable

observed value of the dependent variable

observed value of the independent variable

ii X b aY ˆ

iii e YY ˆ

estimated equation of the line

The estimation error, ei, is the gap between the observed value and the predicted value of the dependent variable.

Fitting a scatter of points with a line by eye is too subjective.We need a more rigorous method. We will consider three possible criteria.

Criterion 1: minimize the sum of the vertical errors

. YY en

1i

ii

n

1i

i

)ˆ(

Y

XXi

Yi

iY

ii X Y ˆ

iii e YY ˆ

Problem: The best fit by this criterion may not be very good. For points below the estimated regression line, we have a negative error ei.

Positive and negative errors cancel each other out. So the points could be far from the line, but we may have a small sum of vertical errors.

Criterion 2: minimize the sum of the absolute values of the vertical errors

. YY en

1i

ii

n

1i

i

|ˆ|||

Y

XXi

Yi

iY

ii X Y ˆ

iii e YY ˆ

This avoids our previous problem of positive and negative errors canceling each other out.

However, the absolute value function is not differentiable, so using calculus to minimize will not work.

Criterion 3: minimize the sum of the squares of the vertical errors

. YY en

1i

2ii

n

1i

2i

)ˆ()(

Y

XXi

Yi

iY

ii X Y ˆ

iii e YY ˆ

This also avoids the problem of positive and negative errors canceling each other out.In addition, the square function is differentiable, so using calculus to minimize will work.

Minimizing the sum of the squared errors is the criterion that we will be using.

The technique is called least squares or ordinary least squares (OLS).

Using calculus, it can be shown that the values of a and b that give the line with the best fit can be calculated as:

n

1i

2n

1i

i2i

n

1i

i

n

1i

i

n

1i

ii

Xn1

X

YXn1

YX

b :slope

X b - Y a :intercept

Sometimes we omit the subscripts, since they are understood, and it’s less cumbersome without them. Then the equations are:

22 X

n1

X

YXn1

XYb :slope

X b - Y a :intercept

Another equivalent formula for b that is sometimes used is:

22 Xn X

Y Xn XYb :slope

You may use either formula for b in this class.

Example: Determine the least squares regression line for Y = wheat yield and X = fertilizer, using the following data.

X Y XY X2

100 40

200 50

300 50

400 70

500 65

600 65

700 80

We need the sums of the X’s, the Y’s, the XY’s and the X2’s

X Y XY X2

100 40 4,000 10,000

200 50 10,000 40,000

300 50 15,000 90,000

400 70 28,000 14,000

500 65 32,500 25,000

600 65 39,000 36,000

700 80 56,000 49,000

2800 420 184,500 1,400,000

We also need the means of X and of Y.

X Y XY X2

100 40 4,000 10,000

200 50 10,000 40,000

300 50 15,000 90,000

400 70 28,000 14,000

500 65 32,500 25,000

600 65 39,000 36,000

700 80 56,000 49,000

2800 420 184,500 1,400,000

4007

2800X 60

7

420Y

X Y XY X2

100 40 4,000 10,000

200 50 10,000 40,000

300 50 15,000 90,000

400 70 28,000 14,000

500 65 32,500 25,000

600 65 39,000 36,000

700 80 56,000 49,000

2800 420 184,500 1,400,000

4007

2800X 60

7

420Y

1

b1 22X X

XY X Yn

n

2280071

1,400,000

420280071

500184

))((,

00012011,400,000

000168500184

,,

,,

000280

50016

,

, 0590.

Next, we calculate the estimated slope b.

X Y XY X2

100 40 4,000 10,000

200 50 10,000 40,000

300 50 15,000 90,000

400 70 28,000 14,000

500 65 32,500 25,000

600 65 39,000 36,000

700 80 56,000 49,000

2800 420 184,500 1,400,000

4007

2800X 60

7

420Y

XbYa

Then we calculate the estimated intercept a.

))(.( 400059060

436.

So our estimated regression line is

X 0590436Y ..ˆ

Given certain assumptions, the OLS estimators can be shown to have certain

desirable properties. The assumptions are

• The Y values are independent of each other.

• The conditional distributions of Y given X are normal.

• The conditional standard deviations of Y given X are equal for all values of X.

Gauss-Markov Theorem: If the previous assumptions hold, then the OLS estimators and , , of Y andb,a, y.xˆ

are best, linear, unbiased estimators (BLUE).

Linear means that the estimators are linear functions of the observed Y values. (There are no Y2s or square roots of Y, etc.)

Unbiased means that the expected values of the estimators are equal to the parameters you are trying to estimate.

Best means that the estimator has the lowest variance of any linear unbiased estimators of the parameter.

Let’s look at our wheat example using our graph.

Consider the fertilizer amount Xi = 700.

Y

XXi=700

Yi = 80

777Yi .ˆ

ii X b aY ˆ

The predicted value of Y corresponding to X = 700 is

. 7777000590436Y .))(.(.ˆ

The observed value of Y corresponding to X = 700 is Y = 80.

The average of all Y values is. 60Y

60Y

Y

XXi=700

Yi = 80

777Yi .ˆ

ii X b aY ˆ

60Y

unexplained deviation

explained deviation

total deviation

The difference between the predicted value of Y and the average value is called the explained deviation.The difference between the observed value of Y and the predicted value is the unexplained deviation.The difference between the observed value of Y and the average value is the total deviation.

If we sum the squares of those deviations, we get

.deviations total thefrom

)Y - (Y totalsquares of sum SST 2i

.deviations explained thefrom

)Y - Y( regression squares of sum SSR 2i ˆ

.deviations dunexplaine thefrom

)Y - (Y error squares of sum SSE 2ii ˆ

. SSE SSR SST shown that becan It

The Sums of Squares are often reported in a Regression ANOVA Table

Source of Variation

Sum of squaresDegrees of

freedomMean square

Regression 1MSRSSR/1

Error n – 2MSE

SSE/(n-2)

Total n – 1 MST

SST/(n-1) 2

i )Y - (Y SST

2i )Y - Y( SSR ˆ

2ii )Y - (Y SSE ˆ

Two measures of how well our regression line fits our data.

The first measure is the standard error of the estimate or the standard error of the regression, se or SER.

The se or SER tells you the typical error of fit, or how far the observed value of Y is from the expected value of Y.

The second measure of “goodness of fit” is the coefficient of determination or R2.

The R2 tells you the proportion of the total variation in the dependent variable that is explained by the regression on the independent variable (or variables).

standard error of the estimate or standard error of the regression

2-n

YY

2-n

e

2-n

SSE SER S

2ii

2i

e

)ˆ(

2-n

YXbYaY iii2i

There is a 2 in the denominator, because we estimated 2 parameters, the intercept a and the slope b.Later, we’ll have more parameters and this will change.

SST

SSE - 1

variationtotal

variationexplained

SST

SSR R2

Coefficient of determination or R2

22

2

2i

2i

YnY

Yn XYb Ya

YY

YY

)(

)ˆ(

If the line fits the scatter of points perfectly, the points are all on the regression line and R2 = 1.

If the line doesn’t fit at all and the scatter is just a jumble of points, then R2 = 0.

1R 0 2

Let’s return to our data and calculate se or SER and R2.

X Y XY X2

100 40 4,000 10,000

200 50 10,000 40,000

300 50 15,000 90,000

400 70 28,000 14,000

500 65 32,500 25,000

600 65 39,000 36,000

700 80 56,000 49,000

2800 420 184,500 1,400,000

4007

2800X 60

7

420Y

X Y XY X2 Y2

100 40 4,000 10,000 1600

200 50 10,000 40,000 2500

300 50 15,000 90,000 2500

400 70 28,000 14,000 4900

500 65 32,500 25,000 4225

600 65 39,000 36,000 4225

700 80 56,000 49,000 6400

2800 420 184,500 1,400,000 26,350

4007

2800X 60

7

420Y

First, let’s add a column for Y2.

X Y XY X2 Y2

100 40 4,000 10,000 1600

200 50 10,000 40,000 2500

300 50 15,000 90,000 2500

400 70 28,000 14,000 4900

500 65 32,500 25,000 4225

600 65 39,000 36,000 4225

700 80 56,000 49,000 6400

2800 420 184,500 1,400,000 26,350

4007

2800X 60

7

420Y

2-n

YXbYaY iii2i

26,350 36.4(420) 0.059(184,500)

7-2

945.

Remember that a = 36.4 and b = 0.059. Then Se or SER

X Y XY X2 Y2

100 40 4,000 10,000 1600

200 50 10,000 40,000 2500

300 50 15,000 90,000 2500

400 70 28,000 14,000 4900

500 65 32,500 25,000 4225

600 65 39,000 36,000 4225

700 80 56,000 49,000 6400

2800 420 184,500 1,400,000 26,350

4007

2800X 60

7

420Y

Again, a = 36.4 and b = 0.059.

8460.

22

22

YnY

Yn XYb Ya R

2

2

36.4(420) 0.059(184,500) 7(60)

26,350 7(60)

So about 85% of the variation in wheat yield is explained by the regression on fertilizer.

1150

5973

.

The sum of squares error, SSE, is the difference SSE = SST – SSR = 1150 – 973.5 = 176.5.

SSR, SSE, and SST for wheat example

On the previous slide, we found that R2 = 973.5 / 1150 = 0.846.

SSR SST

What is the square root of R2?

It is the sample correlation coefficient, usually denoted by lower case r.

2Rr then 0,b if 2Rr then 0,b if

If you don’t already have R2 calculated, the sample correlation coefficient r can also be calculated from this formula.

2222 Y

n1

YXn1

X

YXn1

- XY r

For example, in our wheat problem, we had a = 36.4 and b = 0.059.

2222 Y

n1

YXn1

X

YXn1

XY r

X Y XY X2 Y2

100 40 4,000 10,000 1600

200 50 10,000 40,000 2500

300 50 15,000 90,000 2500

400 70 28,000 14,000 4900

500 65 32,500 25,000 4225

600 65 39,000 36,000 4225

700 80 56,000 49,000 6400

2800 420 184,500 1,400,000 26,350

4007

2800X 60

7

420Y

22 4207

1350262800

7

10004001

42028007

1000184

)(,)(,,

))((,

92.

920 8460

R r Also, 2

..

The sample correlation coefficient r is often used to estimate the population correlation coefficient (rho).

ly.respective Y and X of deviations standard theare and and Y, and X of covariance theis

] Y E[(X Y)Cov(X, where

Y)Cov(X,

YX

YX

YX

))(

= 1: There is a perfect positive linear relation.

= -1: There is a perfect negative linear relation.

= 0: There is no linear relation.

The correlation coefficient (and the covariance) tell how the variables move with each other.

11

Y

X

1

Y

X 0

Y

X 0

Y

X 0.5

Y

X

0.8

Y

X -1

Correlation Coefficient Graphs

R2 adjusted or corrected for degrees of freedom

)(

)(ˆ

1n)Y- (Y

2n)Y - (Y 1 R

2

22c

2n

1nR11R

or

22c )(

It is possible to compare specifications that would otherwise not be comparable by using the adjusted R2 .

The “2” is because we are estimating 2 parameters, and . This will change when we are estimating more parameters.

Adjusted R2 for wheat example

2n

1nR11R 22

c )(

27

17846011 ).(

8150.

Test on the correlation coefficient

0 :H versus0 H 10 :

)()( 2nr1

rt

22n

Test at the 5% 0 :H versus0 H 10 :

)()( 2nr1

rt

22n

for the wheat example. Recall that r = 0 .92 and n = 7.

)().(

.

279201

0922

255.

.025 .025

-2.571 0 2.571 t5

critical region

critical region

From our t table, we see that for 5 dof, and a 2-tailed critical region, our cut-off points are -2.571 and 2.571.

Since our t value of 5.25 is in the critical region, we reject H0 and accept H1 that the population correlation is not zero.

If our regression line slope estimate b is close to zero, that would indicate that the true slope might be zero.To test if equals zero, we need to know the distribution of b.If is normally distributed with mean 0 and standard deviation , then b is normally

distributed with mean and standard deviation or standard error

.

Xn1

X22

b

variable.normal standard a is

Xn1

X

- b

- b ZThen,

22

b

.

Xn1

X

SER - b

t

22

2-n

Since we usually don’t know , we estimate it using SER = se, and a tn-2 instead of the Z.

So for our test statistic, we have

sb

.

Xn1

X

SER - b

t

22

2-n

For the wheat example, test at the 5% level . 0 :H vs.0 H 10 :

280071

0004001

9450 - 0.059

t

2

5

,,

.

.,, 0004001X 2800, X 5.94, SER 7, n 0.059, b :Recall 2

01120

00590

.

. 275.

.025 .025

-2.571 0 2.571 t5

critical region

critical region

From our t table, we see that for 5 dof, and a 2-tailed critical region, our cut-off points are -2.571 and 2.571.

Since our t value of 5.27 is in the critical region, we reject H0 and accept H1 that the slope is not zero.

. 0 :H vs. 0 :H ing when teststatistic for the found

that we5.27 of value the toclose very iswhich 5.25, was0 :H vs.0 :H ngwhen testi calculated westatistic theof value that theNotice

1

0

10

This is not a coincidence. When dealing with a regression with a single X value on the right side of the equation, testing whether there is a linear correlation between the 2 variables ( = 0) and testing whether the slope is zero ( = 0) are equivalent. Our values differ only because of rounding error.

We can do an ANOVA test based on the amount of variation in the dependent variable Y that is explained by the regression. This is referred to as testing the significance of the regression.

H0: there is no linear relationship between X and Y(this is the same thing as equals zero.)

H1: there is a linear relationship between X and Y(this is the same thing as is not zero.)

)( 2nSSE

1SSR

MSE

MSRF

is statistic The

2-n1,

Example: Test the significance of the regression in the wheat problem at the 5% level . Recall SSR = 973.5 and SSE = 176.5.

)( 2nSSE

1SSR

MSE

MSRFF 1,52-n1,

582755176

15973.

.

.

The F table shows that for 1 and 5 degrees of freedom, the 5% critical value is 6.61.

Since our F has a value of 27.58, we reject H0: no linear relation and accept H1: there is a linear relation between wheat yield and fertilizer.

27.58

f(F1,5)

F1, 5

acceptance region crit. reg.

0.05

6.61

For a regression with just one independent variable X on the right side of the equation, testing the significance of the regression is equivalent to testing whether the slope is zero.

Therefore, you might expect there to be a relationship between the statistics used for these tests, and there is one.

The F-statistic for this test is the square of the t-statistic for the test on .

In our wheat example, the t-statistic for the test on was 5.27 and the critical value or cut-off point was 2.571.

For the F-test, the statistic was 27.58 (5.27)2 and the critical value or cut-off point was 6.61 (2.571)2. (The numbers don’t match exactly because of rounding error.)

We can also calculate confidence intervals for the slope .

s t b s t- b b2-nb2-n

Calculate a 95% confidence interval for the slope for the wheat example. Recall that b = 0.059, n = 7, and sb = 0.0112. We also found the critical values for a 2-tailed t with 5 dof are 2.571 and -2.571. s t b s t- b b2-nb2-n

(0.0112) 2.571 0.059 (0.0112) 2.571 - 0.059 0.028 0.059 0.028 - 0.059

0.087 0.031

0.087 0.031 Our 95% confidence interval

means that we are 95% sure that the true slope of the relationship is between 0.031 and 0.087.

Since zero is not in this interval, the results also imply that for a 5% test level, we would reject and accept 0 H0 :

. 0 :H1

Sometimes we want to calculate forecasting intervals for predicted Y values.

For example, perhaps we’re working for an agricultural agency.A farmer calls to ask us for an estimate of the wheat yield that might be expected based on a particular fertilizer usage level on the farmer’s wheat field.We might reply that we are 95% certain that the yield would be between 60 and 80 bushels per acre.A representative from a cereal company might ask for an estimate of the average wheat yield that might be expected based on that same fertilizer usage level on many wheat fields.To that question, we might reply that we are 95% certain that the yield would be between 65 and 75 bushels per acre.Our intervals would both be centered around the same number (70 in this example), but we can give a more precise prediction for an average of many fields, than we can for an individual field.

The width of our forecasting intervals also depends on our level of expertise with the specified value of the independent variable.

Recall that the fertilizer values in our wheat problem had a mean of 400 and were all between 100 and 700.If someone asks about applying 2000 units of fertilizer to a field, we would probably feel less comfortable with our prediction than we would if the person asked about applying 500 units of fertilizer.The closer the value of X is to the mean value of our sample, the more comfortable we are with our numbers, and the narrower the interval required for a particular confidence level.

Y

X

Forecasting intervals for the individual case and for the mean of many cases.

X

upper endpoint for the forecasting interval for the individual case

lower endpoint for the forecasting interval for the individual case

upper endpoint for the forecasting interval for the mean of many cases

lower endpoint for the forecasting interval for the mean of many cases

regression line

Notice that the intervals for individual case are narrower that those for the average of many cases.Also all the intervals are narrower near the sample mean of the independent variable.

Y

X

For the given level of X requested by our callers, we would have the following.

confidence interval for the individual case

Xgiven

70

confidence interval for the mean of many cases

75

65

80

60

Formulae for forecasting intervals

forecasting interval for individual case:

forecasting interval for the mean of many cases:

ind2-nggind2-ng s tY Y s tY ˆˆ

22

2

g

ind

Xn1

X

XX

n

11 SER s where

)(

mean2-nggY.Xmean2-ng s tY s tY ˆˆ

22

2

g

mean

Xn1

X

XX

n

1 SER s where

)(

gg X baY intervals, following theofboth For ˆ

Example: If 550 pounds of fertilizer are applied in our wheat example, find the 95% forecasting interval for the mean wheat yield if we fertilized many fields.

8685500590436X baY gg .)(..ˆ

mean2-nggY.Xmean2-ng s tY s tY ˆˆ

22

2g

meanX

n1

X

XX

n

1 SER s

)(

2

2

280071

0004001

400550

7

1 945

)(,,

)(.

2.81

2.81) 2.571( 868 2.81) 2.571( 868 gY.X .. 76.0 661 gY.X .

.,,

,.

0004001X 2800, X

5.94, SER 5712 t7, n 0.059, b 36.4,a :Recall2

5,.05

Example: If 550 pounds of fertilizer are applied in our wheat example, find the 95% forecasting interval for the wheat yield if we fertilized one field.

ind2-nggind2-ng s tY Y s tY ˆˆ

22

gind

Xn1

X

2XX

n

11 SER s

)(

8685500590436X baY gg .)(..ˆ

2

2

28007

10004001

400550

7

11 945

)(,,

)(.

6.56

6.56) 2.571( 868 Y 6.56) 2.571( 868 g ..

85.7 Y 1.95 g

.,,

,.

0004001X 2800, X

5.94, SER 5712 t7, n 0.059, b 36.4,a :Recall2

5,.05

76.0 661 gY.X .85.7 Y 1.95 g

Notice that, as we stated previously, the interval for the mean of many cases is narrower than the interval for the individual case.