Chapter 14 Simple Linear Regressionpages.stat.wisc.edu/~wardrop/courses/371chapter14spr2012.pdf ·...

25
Chapter 14 Simple Linear Regression Regression analysis is too big a topic for just one chapter in these notes. If you have an interest in this methodology, I recommend that you consider taking a course on regression. At the UW– Madison, for example, there is an excellent semester course on regression analysis, Statistics 333, that is offered at least once per year. 14.1 The Scatterplot and Correlation Coefficient For each subject/trial/unit or case, as they tend to be called in regression, we have two numbers, denoted by X and Y . The number of greater interest to us is denoted by Y and is called the response. Predictor is the common term for the X variable. Very roughly speaking, we want to study whether there is an association or relationship between X and Y , with special interest in the question of using a case’s value of X to describe or predict its value of Y . It is very important to remember that the idea of experimental and observational studies intro- duced in Chapter 9 applies here too, in a way that will be discussed below. We have data on n cases. When we think of them as random variables we use upper case letters and when we think of specific numerical values we use lower case letters. Thus, we have the n pairs (X 1 ,Y 1 ), (X 2 ,Y 2 ), (X 3 ,Y 3 ),... (X n ,Y n ), which take on specific numerical values (x 1 ,y 1 ), (x 2 ,y 2 ), (x 3 ,y 3 ),... (x n ,y n ). The difference between experimental and observational lies in how we obtain the X ’s. There are two possibilities: 1. Experimental: The researcher deliberately selects the values of X 1 ,X 2 ,X 3 ,...X n . 2. Observational: The researcher selects units (usually assumed to be at random from a popu- lation or to be i.i.d. trials) and observes the values of two random variables per unit. Here are two brief examples. 173

Transcript of Chapter 14 Simple Linear Regressionpages.stat.wisc.edu/~wardrop/courses/371chapter14spr2012.pdf ·...

Page 1: Chapter 14 Simple Linear Regressionpages.stat.wisc.edu/~wardrop/courses/371chapter14spr2012.pdf · Simple Linear Regression Regression analysis is too big a topic for just one chapter

Chapter 14

Simple Linear Regression

Regression analysis is too big a topic for just one chapter inthese notes. If you have an interestin this methodology, I recommend that you consider taking a course on regression. At the UW–Madison, for example, there is an excellent semester courseon regression analysis, Statistics 333,that is offered at least once per year.

14.1 The Scatterplot and Correlation Coefficient

For each subject/trial/unit or case, as they tend to be called in regression, we have two numbers,denoted byX andY . The number of greater interest to us is denoted byY and is called theresponse. Predictor is the common term for theX variable. Very roughly speaking, we want tostudy whether there is an association or relationship betweenX andY , with special interest in thequestion of using a case’s value ofX to describe or predict its value ofY .

It is very important to remember that the idea of experimental and observational studies intro-duced in Chapter 9 applies here too, in a way that will be discussed below.

We have data onn cases. When we think of them as random variables we use upper case lettersand when we think of specific numerical values we use lower case letters. Thus, we have thenpairs

(X1, Y1), (X2, Y2), (X3, Y3), . . . (Xn, Yn),

which take on specific numerical values

(x1, y1), (x2, y2), (x3, y3), . . . (xn, yn).

The difference between experimental and observational lies in how we obtain theX ’s. Thereare two possibilities:

1. Experimental: The researcher deliberately selects the values ofX1, X2, X3, . . .Xn.

2. Observational: The researcher selects units (usually assumed to be at random from a popu-lation or to be i.i.d. trials) and observes the values of two random variables per unit.

Here are two brief examples.

173

Page 2: Chapter 14 Simple Linear Regressionpages.stat.wisc.edu/~wardrop/courses/371chapter14spr2012.pdf · Simple Linear Regression Regression analysis is too big a topic for just one chapter

1. Experimental: The researcher is interested in the yield, per acre, of a certain crop. Denotethe yield byY . The researcher believes that the yield will be affected by the concentration ofa certain fertilizer that will be applied to the plants. The variableX represents the concentra-tion of the fertilizer. The researcher selectsn one acre plots for study and selectsn numbers,x1, x2, x3, . . . xn, for the concentrations of the fertilizer. Finally, the researcher assigns thenselected values ofX to then plots by randomization.

2. Observational: The researcher selectsn men at random from a population of men andmeasures each man’s height,X, and weight,Y .

Note that for the experimental setting, theX ’s are not random variables, but for the observa-tional setting, theX ’s are random variables.

It turns out that several key components of the desired analysis become impossible for randomX ’s, so the standard practice—and all that we will cover here—is to condition on the valuesof the X ’s when they are random and henceforth pretend that they are not random. The mainconsequence of conditioning this way is that the computations and predictions still make sense,but we must temper our enthusiasm in our conclusions. Just asin Chapter 9, for the experimentalsetting we can claim causation, but for the observational setting, the most we get is association.

First, we will learn how to draw a picture of our data, called thescatterplot. Below I presenteight scatterplots that come from three sets of data.

1. Seven offensive variables were determined for each of the16 teams in Major League Base-ball’s National League in 2009. These data are presented in Table 14.1.

2. Three additional variables for the same 16 teams in 2009. (Actually, one variable—runs—iscommon to both data sets.) These data are presented in Table 14.2.

3. Two variables—the score on midterm and final exams— for 36 students in one of my sectionsof Statistics 371. These data are presented later in this chapter.

In most scientific problems, we are quite sure which variableshould be the responseY , but mighthave a number of candidates forX. For example, for the data in Table 14.1, the obvious choice,to me, forY is the number of runs scored by the team. (If you are a baseballfan, this position ofmine likely makes sense; if you are not a baseball fan, don’t worry about this issue.) Any one ofthe remaining six variables could be taken asX.

Before I proceed, let me digress and mention a topic we will not be considering in this chapter,but that is of great interest in science. This topic is covered in any course devoted to regressionanalysis. In this chapter, we restrict our attention to problems with exactly oneX variable. The useof one predictor is conveyed by the adjectivesimple and the method we learn is simple regressionanalysis. It is often desirable in science to allow for two ormore predictors; this method is notedby the adjectivemultiple and the method is referred to as multiple regression analysis. A regressionanalyst often has either or both of the goals of using the predictor(s) todescribe or predict the valueof the response. Rather obviously, there is no reason to restrict our attention to only one predictor.(For example, if we want to predict the height of an adult male, it seems sensible to use both of hisparent’s heights as predictors.)

174

Page 3: Chapter 14 Simple Linear Regressionpages.stat.wisc.edu/~wardrop/courses/371chapter14spr2012.pdf · Simple Linear Regression Regression analysis is too big a topic for just one chapter

Table 14.1: Various Team Statistics, National League, 2009.

Team Runs Triples Home Runs BA OBP SLG OPSPhiladelphia 820 35 224 .258 .334 .447 .781Colorado 804 50 190 .261 .343 .441 .784Milwaukee 785 37 182 .263 .341 .426 .767LA Dodgers 780 39 145 .270 .346 .412 .758Florida 772 25 159 .268 .340 .416 .756Atlanta 735 20 149 .263 .339 .405 .744St. Louis 730 29 160 .263 .332 .415 .747Arizona 720 45 173 .253 .324 .418 .742Washington 710 38 156 .258 .337 .406 .743Chicago Cubs 707 29 161 .255 .332 .407 .738Cincinnati 673 25 158 .247 .318 .394 .712NY Mets 671 49 95 .270 .335 .394 .729San Francisco 657 43 122 .257 .309 .389 .699Houston 643 32 142 .260 .319 .400 .719San Diego 638 31 141 .242 .321 .381 .701Pittsburgh 636 34 125 .252 .318 .387 .705

For the data in Table 14.2 the natural choice forY is the number of wins achieved by the teamduring the 162 game regular season. Either of the remaining variables (or both, but not in thischapter) is an attractive candidate for the predictor. As baseball fans know, the total number ofruns a team scores is a measure of its offensive performance and its earned run average (ERA) is ameasure of the effectiveness of its pitching staff. (Side note: of the ten variables in the two tablesthat have been discussed, ERA is unique in that it the only variable for whichsmaller values reflecta better performance.)

For the last of our three examples, it seems more natural, to me, to letY denote the student’sscore on the final andX denote the student’s score on the midterm. With this set-up we will learnhow to use a particular student’s midterm score to predict his/her score on the final exam. But themethods we learn could be applied to the reverse problem: using the score on the final to predictthe score on the midterm. For this latter situation,Y would be the midterm score andX the finalscore.

Take a minute and quickly do a visual scan of the scatterplotsin Figures 14.1–14.8. First, Ineed to explain how to read a scatterplot. Look at Figure 14.1. Locate the circle that is farthest tothe left in the picture. You can see that itsx (horizontal) coordinate is approximately 20 and itsy (vertical) coordinate is approximately 730. Now, look at Table 14.1 again and note that Atlantahasx = 20 andy = 735; thus, this circle in the picture presents the values ofx andy for Atlanta.Similarly, each of the 16 circles in the scatterplot represent a different team’s values ofx andy. Take a minute now, please, and make sure you are able to locate the circles for: Philadelphia(x = 35 andy = 820) and Colorado.

175

Page 4: Chapter 14 Simple Linear Regressionpages.stat.wisc.edu/~wardrop/courses/371chapter14spr2012.pdf · Simple Linear Regression Regression analysis is too big a topic for just one chapter

Table 14.2: Wins, Runs Scored and Earned Run Average for the National League Teams in 2009.(One game, Pittsburgh at Chicago, was canceled; I arbitrarily count it as a victory for Chicago.)

Team Wins Runs ERA Team Wins Runs ERALA Dodgers 95 780 3.50 Milwaukee 80 785 4.84Philadelphia 93 820 4.15Cincinnati 78 673 4.18Colorado 92 804 4.26 San Diego 75 638 4.37St. Louis 91 730 3.66 Houston 74 643 4.54San Francisco 88 657 3.55Arizona 70 720 4.44Florida 87 772 4.32 NY Mets 70 671 4.46Atlanta 86 735 3.57 Pittsburgh 62 636 4.59Chicago Cubs 84 707 3.84Washington 59 710 5.02

Figure 14.1: Scatterplot of Runs Scored Versus the Number ofTriples.

600

660

720

780

840

10 20 30 40 50 60

OO

O OO

O O OOO

O OO

OO O

r= 0.100

176

Page 5: Chapter 14 Simple Linear Regressionpages.stat.wisc.edu/~wardrop/courses/371chapter14spr2012.pdf · Simple Linear Regression Regression analysis is too big a topic for just one chapter

Figure 14.2: Scatterplot of Runs Scored Versus Batting Average.

600

660

720

780

840

0.240 0.247 0.254 0.261 0.268 0.275

OO

O OO

OOOOO

O OO

OO O

r= 0.510

Figure 14.3: Scatterplot of Runs Scored Versus the Number ofHome Runs.

600

660

720

780

840

90 120 150 180 210 240

OO

OO O

O O OOO

OOO

OOO

r= 0.756

177

Page 6: Chapter 14 Simple Linear Regressionpages.stat.wisc.edu/~wardrop/courses/371chapter14spr2012.pdf · Simple Linear Regression Regression analysis is too big a topic for just one chapter

Figure 14.4: Scatterplot of Runs Scored Versus OPS.

600

660

720

780

840

0.690 0.710 0.730 0.750 0.770 0.790

OO

OOO

OOOOO

O OO

OO O

r= 0.964

Figure 14.5: Scatterplot of Wins Versus Runs Scored.

55

65

75

85

95

635 671 707 743 779 815

OO

O

O

OO

O

O

O

O

O

O

O

OO

O

r= 0.627

178

Page 7: Chapter 14 Simple Linear Regressionpages.stat.wisc.edu/~wardrop/courses/371chapter14spr2012.pdf · Simple Linear Regression Regression analysis is too big a topic for just one chapter

Figure 14.6: Scatterplot of Wins Versus Earned Run Average.

55

65

75

85

95

3.45 3.77 4.09 4.41 4.73 5.05

O O

O

O

OO

O

O

O

O

O

O

O

OO

Or= −0.748

Figure 14.7: Final Exam Score Versus Midterm Exam Score for 36 Students. There is a ‘2’ in theScatterplot Because Two Subjects had (x, y)= (55.5, 96.0).

80

85

90

95

100

35 40 45 50 55 60

O

OO

O

OO

O

O

2

O

OO

O

O

O

O

O

OO

O

OO

O

O

OO

O

O

O

O

O

OO O

Or= 0.353

179

Page 8: Chapter 14 Simple Linear Regressionpages.stat.wisc.edu/~wardrop/courses/371chapter14spr2012.pdf · Simple Linear Regression Regression analysis is too big a topic for just one chapter

Figure 14.8: Final Exam Score Versus Midterm Exam Score for 35 Students, After Deleting OneIsolated Case. There is a ‘2’ in the Scatterplot Because Two Subjects had (x, y)= (55.5, 96.0).

80

85

90

95

100

35 40 45 50 55 60

O

O

O

OO

O

O

2

O

OO

O

O

O

O

O

OO

O

OO

O

O

OO

O

O

O

O

O

OO O

Or= 0.464

Now look at Figure 14.3, the scatterplot of runs scored versus home runs. Scan the picture,running your eyes from left to right, which corresponds to increasing the value ofx (going fromthe lowest value ofx, which is farthest to the left, to the largest value ofx, which is farthest tothe right). As your eyes are scanning left-to-right, what happens to the circles? Well, there isnot adeterministic relationship, such as all the circles being on a line or a curve; but there is atendency for the circles to rise (i.e.,y is becoming larger) as thex values become larger. Also,in my judgment, the tendency looks like a straight line tendency rather than a (more complicated)curved tendency.

Now look at the remaining three scatterplots of runs versus someX. Here are the conclusionsI draw:

1. In all scatterplots the tendency betweenx andy appears to be linear (a straight line) ratherthan curved.

2. In the first scatterplot, in whichX is the number of triples, whereas the tendency is linear, itis neither increasing nor decreasing; just flat. In the remaining three scatterplots the tendencyis definitely increasing. The increasing tendency: is strongest forX equal to OPS; is weakestfor X equal to BA; withX equal to Home Runs falling between these extremes.

Each scatterplot also presents a number denoted byr, which takes on values:r = 0.100 (forX equal to number of triples);r = 0.510 (BA); r = 0.756 (Home Runs); andr = 0.964 (OPS).This numberr is called the(Pearson’s product moment) correlation coefficient. There is analgebraic formula forr, but I will not present it for two reasons:

1. It is quite a mess to compute; as a result, we will use a computer to obtain its value.

180

Page 9: Chapter 14 Simple Linear Regressionpages.stat.wisc.edu/~wardrop/courses/371chapter14spr2012.pdf · Simple Linear Regression Regression analysis is too big a topic for just one chapter

2. There issome insight to be gained by examining the algebraic formula forr, but not enoughto squeeze this topic into our ever diminishing remaining time.

(I went online and found a presentation of the formula forr on Wikipedia.)Please take a minute to look at the remaining four scatterplots. The values of the correlation

coefficient for these data sets are:

• r = 0.627 for victories versus runs scored;

• r = −0.748 for victories versus ERA;

• r = 0.353 for final versus midterm; and

• r = 0.464 for final versus midterm after the deletion of theunusual case with x = 35.5 andy = 95.5.

Note the following.

1. We have our first example of a negative number for the correlation coefficient. This re-flects the (visual) observation that the tendency betweenx andy is decreasing—as the ERAincreases (noted earlier, a bad thing), the number of wins decreases.

2. Consider the two scatterplots for the scores on exams. Thedeletion of only one case (outof 36) results in a substantial increase in the value of the correlation coefficient. In theterminology of Chapter 10, the value of the correlation coefficient is fragile to unusual cases.Also note that both scatterplots contain the numeral ‘2’ because two students had(x, y) =(55.5, 96.0), and, of course, two circles placed at the same place look like one circle.

The correlation coefficient has several important properties. They are listed below, after a bit ofterminology in the first item.

1. If the correlation coefficient is greater than zero, the variablesY andX are said to have apositive linear relationship; if it is less than zero, the variables are said to have anegativelinear relationship; if it equals zero, the variables are said to haveno linear relationship,or to beuncorrelated.

2. The correlation coefficient is not appropriate for summarizing a curved relationship betweenY andX. This fact is illustrated in Figure 14.9 in which there is a perfect (deterministic)curved relationship betweenY andX, yet the correlation coefficient equals zero. Therefore,it is always necessary to examine a scatterplot of the data to determine whether computationof the correlation coefficient is appropriate.

3. The value of the correlation coefficient is always between−1 and+1. It equals+1 if, andonly if, all data points lie on a straight line with positive slope; it equals−1 if, and only if,all data points lie on a straight line with negative slope. (Extra: Why is it that statisticiansand scientists are not interested in data sets for which all points lie on a horizontal or verticalline?)

181

Page 10: Chapter 14 Simple Linear Regressionpages.stat.wisc.edu/~wardrop/courses/371chapter14spr2012.pdf · Simple Linear Regression Regression analysis is too big a topic for just one chapter

4. The farther the value of the correlation coefficient from zero, in either direction, the ‘stronger’the linear relationship. This fact will be justified and mademore precise later in this chapter.

5. The value of the correlation coefficient does not depend onthe units of measurement chosenby the experimenter. More precisely, ifX is replaced byaX + b andY is replaced bycY + d, wherea, b, c, andd are any numbers witha and c bigger than zero, then thecorrelation coefficient of the new variables is equal to the correlation coefficient ofX andY . (The numbersa andc are required to be positive to avoid reversing the directionof therelationship. Ifa andc are both negative,r is unchanged; if exactly one ofa andc is negative,r becomes its negative.) This result is true because the correlation coefficient is defined interms of the standardized values ofX andY (not shown in this text) and these do not changewhen the units change. Among other examples, this result shows that changing from milesto inches, pounds to kilograms, degrees Celsius to degrees Fahrenheit, or seconds to hourswill not change the correlation coefficient.

6. The correlation coefficient is symmetric inX andY . In other words, if the researcher in-terchanges the labels predictor and response, the correlation coefficient will not change. Inparticular, if there is no natural assignment of the labels predictor and response to the two nu-merical variables, the value of the correlation coefficientis not affected by which assignmentis chosen.

Here is an example of item 6: Suppose that in the population ofmarried couples, you want tostudy the relationship between the husband’s IQ and the wife’s IQ. To me, there is no natural wayto assign the labels response and predictor to these variables.

In view of item 4 in the above list, let’s revisit the two scatterplots of wins versus runs andwins versus ERA. Recall that the correlation coefficients are r = 0.627 for runs andr = −0.748for ERA. By item 4,descriptively, ERA is a better predictor than runs. Sadly, there is no wayto compare these two predictors inferentially, for example, with a hypothesis test. This result isbeyond the scope of these notes.

14.2 The Simple Linear Regression Model

I am now ready to show you thesimple linear regression model. We assume the followingrelationship betweenX andY .

Yi = β0 + β1Xi + ǫi, for i = 1, 2, 3, . . . n (14.1)

It will take some time to examine and explain this model. First, β0 andβ1 are parameters of themodel; this means, of course, that they are numbers and by changing either or both of their valueswe get a different model. The actual numerical values ofβ0 andβ1 are known by Nature andunknown to the researcher; thus, the researcher will want toestimate both of these parameters and,perhaps, test hypotheses about them.

The ǫi’s are random variables with the following properties. Theyare i.i.d. with mean 0 andvarianceσ2. Thus,σ2 is the third parameter of the model. Again, its value is knownby Nature but

182

Page 11: Chapter 14 Simple Linear Regressionpages.stat.wisc.edu/~wardrop/courses/371chapter14spr2012.pdf · Simple Linear Regression Regression analysis is too big a topic for just one chapter

Figure 14.9: An Example of a Curved Relationship BetweenY andX.

O

O

OO

OO

OOOOOOOO

OO

OO

O

O

O

r= 0.000

unknown to the researcher. It is very important to note that we assume theseǫi’s, which are callederrors, are statistically independent. In addition, we assume that every case’s error has the samevariance.

Oh, and by the way, not only isσ2 unknown to the researcher, the researcher does not get toobserve theǫi’s. Remember this: all that the researcher observes are then pairs of (x, y) values.

Now, we look at some consequences of our model. I will be usingthe rules of means andvariances from Chapter 7. It is not necessary for you to know how to recreate the arguments below.

Remember, theYi’s are random variables; theXi’s are viewed as constants. The mean ofYi

givenXi isµYi|Xi

= β0 + β1Xi + µǫi= β0 + β1Xi, (14.2)

because the mean of a constant is the constant and the mean of each error is 0. The variance ofYi

isσ2

Yi= σ2

ǫi= σ2, (14.3)

because the variance of a constant (β0 + β1Xi) is 0.In words, we see that the relationship betweenX andY is that the mean ofY given the value

of X is a linear function ofX with y-intercept given byβ0 and slope given byβ1.The first issue we turn to is: How do we use data to estimate the values ofβ0 andβ1?First, note that this is a familiar question, but the currentsituation is much more difficult than

any we have encountered. It is familiar because we previously have asked the questions: How dowe estimatep? Orµ? Orν? In all previous cases, our estimate was simply the ‘sample version’of the parameter; for example, to estimate the proportion ofsuccesses in a population, we use theproportion of successes in the sample. There is no such easy analogy for the current situation: forexample, what is the sample version ofβ1?

183

Page 12: Chapter 14 Simple Linear Regressionpages.stat.wisc.edu/~wardrop/courses/371chapter14spr2012.pdf · Simple Linear Regression Regression analysis is too big a topic for just one chapter

Instead we adopt thePrinciple of Least Squaresto guide us. Here is the idea. Suppose thatwe have two candidate pairs for the values ofβ0 andβ1; namely,

1. β0 = c0 andβ1 = c1 for the first candidate pair, and

2. β0 = d0 andβ1 = d1 for the second candidate pair,

wherec0, c1, d0 andd1 are known numbers, possibly calculated from our data. The Principle ofLeast Squares tells us which candidate pair is better. Our data are the values of they’s and thex’s. We know from above that the mean ofY givenX is β0 + β1X, so we evaluate the candidatepair c0, c1 by comparing the value ofyi to the value of thec0 + c1xi, for all i. We compare, as weusually do, by subtracting to obtain:

yi − [c0 + c1xi].

This quantity can be negative, zero or positive. Ideally, itis 0, which indicates thatc0 + c1xi is anexact match foryi. The farther this quantity is from 0, in either direction, the worse jobc0 + c1xi

is doing in describing or predictingyi. The Principle of Least Squares tells us to take this quantityand square it:

(yi − [c0 + c1xi])2,

and then sum this square over all cases:

n∑

i=1

(yi − [c0 + c1xi])2.

Thus, to compare two candidate pairs:c0, c1 andd0, d1, we calculate two sums of squares:

n∑

i=1

(yi − [c0 + c1xi])2 and

n∑

i=1

(yi − [d0 + d1xi])2.

If these sums are equal, then we say that the pairs are equallygood. Otherwise, whichever sum issmaller designates the better pair. This is the Principle ofLeast Squares, although with only twocandidate pairs, it should be called the Principle of LesserSquares.

Of course, it is impossible to compute the sum of squares for every possible set of candidates,but we don’t need to do this if we use calculus. Define the function

f(c0, c1) =n∑

i=1

(yi − [c0 + c1xi])2.

Our goal is to find the pair of numbers(c0, c1) that minimizes the functionf .If you have studied calculus, you know thatdifferentiation is a method that can be used to

minimize a function. If we take two partial derivatives off , one with respect toc0 and one withrespect toc1; set the two resulting equations equal to 0; and solve forc0 andc1, we find thatf isminimized at(b0, b1) with

b1 =

∑(xi − x)(yi − y)∑

(xi − x)2andb0 = y − b1x. (14.4)

184

Page 13: Chapter 14 Simple Linear Regressionpages.stat.wisc.edu/~wardrop/courses/371chapter14spr2012.pdf · Simple Linear Regression Regression analysis is too big a topic for just one chapter

We writey = b0 + b1x; (14.5)

This is called theregression lineor least squares regression lineor least squares predictionline or even just thebest line. You will not be required to compute(b0, b1) by hand from raw data.Instead, I will show you how to interpret computer output from the Minitab statistical softwaresystem.

There is another way to write the expression forb1 given in Equation 14.4. This alternativeformula does not help us computationally—we don’t compute by hand—but, as we will see later,gives us additional insight into the regression line.

Remember that every case gives us the values of two numericalvariables,X andY . Regressionanalysis focuses on finding an association between these variables, but it is certainly allowable tolook at them separately. In particular, letx andsx denote the mean and standard deviation of thex’s in the data set, and lety andsy denote the mean and standard deviation of they’s in the dataset. With this notation, it can be shown that (details not given)

b1 = r(sy/sx), (14.6)

where, recall,r denotes the correlation coefficient ofX andY .Note thatsy (sx) is a summary of they (x) values alone. With this observation, Equation 14.6

has the following interesting implication:

In order to compute the regression line, the correlation coefficient contains all theinformation we need to know about the association betweenX andY .

Thus, we see thatr has at least one important feature.In a regression analysis, every case begins with two values:x andy. Applying Equation 14.5

we can obtain a third value for each case, namely its predicted responsey. Finally, a fourth valuefor each response is itsresidual, denoted bye and defined below.

ei = yi − yi = yi − b0 − b1xi. (14.7)

Note the following features of residuals:

1. A case’s residual can be positive, zero or negative. A residual of zero isideal in that it meansyi = yi; in words, the predicted response is equal to the actual response. A positive residualmeans thatyi is larger than yi; i.e., the actual response istoo large. Finally, a negativeresidual means thatyi is smaller than yi; i.e., the actual response istoo small.

2. In view of the previous item, the farther the residual is from 0, in either direction, the worsethe agreement between the actual and predicted responses.

It turns out that the residuals have two linear constraints:

∑ei = 0 and

∑(xiei) = 0. (14.8)

185

Page 14: Chapter 14 Simple Linear Regressionpages.stat.wisc.edu/~wardrop/courses/371chapter14spr2012.pdf · Simple Linear Regression Regression analysis is too big a topic for just one chapter

14.3 Reading Computer Output

In this section we will study a data set onn = 35 students in my Statistics 371 class. The twovariables are:Y , the score on the final exam, andX, the score on the midterm exam. Recall thatFigure 14.8 presents a scatterplot of these data and that thecorrelation coefficient isr = 0.464.Table 14.3 presents output obtained from my statistical software package, Minitab. The output hasbeen edited in order to fit on one page. The deleted item will bepresented later; it is not necessaryfor our analysis. Note that there is some redundancy in this output; the most obvious example isthat to the right of the observation numbers 21 and 22, the remaining entries are identical. We willwork through the output in some detail.

Before turning to an inspection of this computer output, I want to give you a few facts aboutthese data. I grade my exam in half-point increments. Thus, for example, 88.0, 88.5 and 89.0 areall possible scores on my final exam. The maximum number of possible points was 60.0 on themidterm exam 100.0 on the final exam. The midterm exam scores ranged from a low of 39.0 to ahigh of 59.0; the final exam scores ranged from a low of 81.0 to ahigh of 99.5. The means andstandard deviations are below.

x = 52.786, sx = 5.295, y = 92.257, sy = 5.154. (14.9)

The first information in the output is the line:

The regression equation is: Final = 68.4 + 0.452 Midterm

This is Minitab’s way of telling us that the regression line:

y = b0 + b1x is y = 68.4 + 0.452x.

Thus, we see that the least squares estimates ofβ0 andβ1 areb0 = 68.4 andb1 = 0.452, respec-tively. Minitab isuser friendly in that it uses words (admittedly, provided by me, the programmer)to remind us that the response is the score on the final exam andthe predictor is the score on themidterm. But Minitab is quite anachronistic (we will see more of this below) in not having a ‘hat’above ‘Final.’ (Reason: Minitab is a very old software package. It was designed to create outputfor a typewriting-like machine; i.e., no hats, no subscripts, no Greek letters and no math symbols.For some reason—unknown to me—the package has never been updated—well, at least not on myversion which is a few years old—to utilize modern printers.)

Next in the output are the three lines

Predictor Coef SE Coef T PConstant 68.420 7.963 8.59 0.000Midterm 0.4516 0.1501 3.01 0.005

The first column in this group provides labels for the rows of this table. Its heading, Predictor,seems strange because we only have one predictor, namely Midterm. The output has promotedthe intercept to Predictor status because that is a natural thing to do in multiple regression (again,

186

Page 15: Chapter 14 Simple Linear Regressionpages.stat.wisc.edu/~wardrop/courses/371chapter14spr2012.pdf · Simple Linear Regression Regression analysis is too big a topic for just one chapter

Table 14.3: Edited Minitab Output for the Regression of Final Exam Score on Midterm ExamScore for 35 Students. R denotes an observation with a large standardized residual X denotes anobservation whose X value gives it large influence.

The regression equation is: Final = 68.4 + 0.452 Midterm

Predictor Coef SE Coef T PConstant 68.420 7.963 8.59 0.000Midterm 0.4516 0.1501 3.01 0.005S = 4.635 R-Sq = 21.5%

Obs Midterm Final Fit SE Fit Residual St Resid1 39.0 83.5 86.032 2.213 -2.532 -0.62 X2 43.0 89.0 87.838 1.665 1.162 0.273 44.0 92.0 88.290 1.534 3.710 0.854 44.5 82.0 88.515 1.470 -6.515 -1.485 46.0 93.0 89.193 1.285 3.807 0.866 48.0 87.0 90.096 1.063 -3.096 -0.697 48.5 91.5 90.322 1.014 1.178 0.268 49.0 99.5 90.548 0.968 8.952 1.989 49.0 88.0 90.548 0.968 -2.548 -0.5610 49.0 81.0 90.548 0.968 -9.548 -2.11 R11 49.5 91.0 90.773 0.926 0.227 0.0512 49.5 95.0 90.773 0.926 4.227 0.9313 50.0 97.5 90.999 0.888 6.501 1.4314 53.0 83.0 92.354 0.784 -9.354 -2.05 R15 53.0 93.0 92.354 0.784 0.646 0.1416 53.5 97.0 92.580 0.791 4.420 0.9717 53.5 99.0 92.580 0.791 6.420 1.4118 54.5 95.5 93.031 0.825 2.469 0.5419 54.5 83.0 93.031 0.825 -10.031 -2.20 R20 55.0 94.5 93.257 0.851 1.243 0.2721 55.5 96.0 93.483 0.883 2.517 0.5522 55.5 96.0 93.483 0.883 2.517 0.5523 55.5 92.5 93.483 0.883 -0.983 -0.2224 56.5 93.5 93.934 0.962 -0.434 -0.1025 56.5 89.5 93.934 0.962 -4.434 -0.9826 56.5 90.5 93.934 0.962 -3.434 -0.7627 57.0 92.0 94.160 1.007 -2.160 -0.4828 57.5 97.5 94.386 1.056 3.114 0.6929 58.5 99.0 94.838 1.162 4.162 0.9330 58.5 95.0 94.838 1.162 0.162 0.0431 58.5 91.0 94.838 1.162 -3.838 -0.8632 58.5 97.5 94.838 1.162 2.662 0.5933 59.0 91.5 95.063 1.218 -3.563 -0.8034 59.0 98.5 95.063 1.218 3.437 0.7735 59.0 94.0 95.063 1.218 -1.063 -0.24

187

Page 16: Chapter 14 Simple Linear Regressionpages.stat.wisc.edu/~wardrop/courses/371chapter14spr2012.pdf · Simple Linear Regression Regression analysis is too big a topic for just one chapter

Figure 14.10: Final Exam Score Versus Midterm Exam Score for35 Students.

··························································

··························································

··························································

··························································

································

· · · · · ·· · · · · ·

· · · · · ·· · · · · ·

· · · · · ·· · · · · ·

· · · · · ·· · · · · ·

· · · ·

· · · · · ·· · · · · ·

· · · · · ·· · · · · ·

· · · · · ·· · · · · ·

· · · · · ·· · · · · ·

· · · ·

y = 68.42 + 0.4516x

y + s

y − s

80

85

90

95

100

35 40 45 50 55 60

O

O

O

OO

O

O

2

O

OO

O

O

O

O

O

OO

O

OO

O

O

OO

O

O

O

O

O

OO O

Or= 0.464

details not given). So, to summarize, don’t fret about the heading on the first column above; theimportant thing is that row Constant (Midterm) provides information aboutb0 (b1).

The first feature to note in these three lines is that the computer has given us more precision inthe values ofb0 andb1; for example, earlierb0 = 68.4, but nowb0 = 68.42.

Next, skip down to the 36 lines of output headed by:

Obs Midterm Final Fit SE Fit Residual St Resid

The first column numbers the observations 1–35 for ease of reference. The second and thirdcolumns—Midterm and Final—lists the values ofx (Midterm) andy (Final) for each of the 35cases. The fourth column—Fit—lists the value ofy for each case. It is important to pause for amoment and make sure you understand the valuesx, y andy. The values ofx andy are the actualexam scores for the case (student). The value ofy is the predicted response for the case, whichis obtained by plugging the case’s value ofx into the regression line. Thus, for example, case 1scored 39.0 on the midterm and 83.5 on the final. The predictedfinal score for this student is

y = 68.42 + 0.4516(39.0) = 86.0324,

which the computer rounds to 86.032.Let me digress briefly and mention a curious convention in reporting the value ofy. In Chapter

7 we learned how to predict the total number of successes inm future Bernoulli Trials. At thattime, I said, for example, that a point prediction of 72.3 should be rounded to 72 because it isimpossible to obtain a fractional number of successes. Following this principle, you might expectme to round 86.032 to 86.0 because, as mentioned earlier, exam scores could occur only in one-half point increments; i.e. 85.5, 86.0 and 86.5 are all possible scores on the final, but not 86.032.Surprisingly (?) we never round in regression. There are tworeasons for this:

188

Page 17: Chapter 14 Simple Linear Regressionpages.stat.wisc.edu/~wardrop/courses/371chapter14spr2012.pdf · Simple Linear Regression Regression analysis is too big a topic for just one chapter

1. If we rounded, then the regression line would not really bea straight line—it would behorizontal between jumps, called astep-function. Statisticians won’t do this; we won’t callit a line when it is a step-function!

2. In many (most?) applications the variableY is a measurement, for examples, distance ortime or weight. In these cases we would not have any reason to round. It’s confusing toround sometimes and not others. Thus, we don’t round.

Let’s return to our examination of case 1. This student’s final exam score is a disappointment intwo ways. First, only four students scored lower than 83.5; thus, relative to the class as a whole, thisstudent did poorly on the final. Second, the student’s actualy is smaller than the score predictedby his/her score on the midterm: 83.5 versusy = 86.032. As statisticians, we are much moreinterested in this second form of disappointment. Indeed, we see that for this student, the residualis

e = y − y = 83.5 − 86.032 = −2.532,

as reported in column 6. (We will learn about the meaning and utility of the entries in column 5later.)

After a brief reflection, we see that a negative (positive) number for a residual means that thestudent performed worse (better) than predicted by the regression line. For these data, overall 15students have negative residuals and 20 have positive residuals. To be fair, we should note that twostudents with positive residuals (numbers 11 and 30) scoredexactly as predicted, if we round-offy.

Now, please refer to Figure 14.10. With three additions, this is the scatterplot of final versusmidterm that was presented earlier in Figure 14.8. First, note the solid line, which is the graphof our regression line. This allows us to visualize how the regression line performs in makingpredictions for our data set. Below are a few of many possiblethings to note.

1. The student with the largest value ofy (99.5; case 8 withx = 49.0) is, happily for thestudent, represented by the circle that is 8.952 points above the regression line’s prediction.

2. The two students with residuals closest to 0 (cases 11 and 30 as mentioned earlier) havecircles that touch the line. This touching shows the near agreement betweeny andy.

3. Student 19 has the distinction of having the negative residual farthest from 0. The actualfinal (y = 83.0) is more than 10 points lower than the prediction based on thefairly highmidterm exam score (x = 54.5).

There aren = 35 residuals in our data set with two linear restrictions, given earlier in Equa-tion 14.8 on page 185. Thus, the residuals have(n − 2) degrees of freedom. I want to examinethe spread in the distribution of the residuals. Following the ideas of Chapter 10, I begin with thevariance of the residuals:

s2 =

∑(e − e)2

n − 2=

∑e2

n − 2,

189

Page 18: Chapter 14 Simple Linear Regressionpages.stat.wisc.edu/~wardrop/courses/371chapter14spr2012.pdf · Simple Linear Regression Regression analysis is too big a topic for just one chapter

because the mean of the residuals is 0 (because their sum is 0). Minitab does this computationfor us and reports the value of the standard deviation of the residuals, which can be found in thecomputer output immediately above the listing of the cases;s = 4.635 for these data.

The value ofs is very important. Here is the reasoning. The residuals, as agroup, tell us howwell the actualy’s agrees with their predicted values. If we apply the empirical rule from Chapter10, we conclude that for approximately 68% of the cases the value of e falls between−s and+s.(Remembering, again, thate = 0.) In other words for approximately 68% of the cases the valueofy is within s units of its predicted value; i.e., theabsolute error in prediction is at mosts. Thus,the value ofs tells us howgood the regression line is at making predictions.

It is natural to wonder whether the empirical rule approximation is any good for our data. It isindeed a very good approximation; here are two ways to see whyI say this.

1. Scan down the numbers in column 6 of Table 14.3. I count seven residuals that lie outsidethe interval

[−s, +s] = [−4.635, +4.635].

If you want to check this, note that the extreme residuals belong to cases 4, 8, 10, 13, 14, 17and 19.

2. (This way is more fun!) Look again at Figure 14.10. In addition to the regression line,the picture contains two dotted lines. The 28 circles that lie between the two dotted linescorrespond to the 28 cases that have a residual in the interval [−s, +s].

Thus, we see that28/35 = 0.70 = 70% of the residuals lie in the interval[−s, +s] which is a veryclose agreement with the empirical rule’s 68% approximation.

To summarize, if asked the question, “How well does the midterm score predict the finalscore?” I would reply as follows.

For approximately two-thirds of the students the final scorecan be predicted within4.635 points. For the remaining one-third of the students, the prediction errs by morethan 4.635 points.

Clearly, in many scientific questions, ifs is small enough, the regression is very useful. But ifsis too large, the regression is scientifically useless. As George Box, the founder of the StatisticsDepartment at UW–Madison, would say, “Just because something is the best, doesn’t mean its anygood.” The regression line is the best way (according to the Principle of Least Squares) to useXin a linear way to predictY . This is a math result. Whether this best line is any good in sciencemust be determined by a scientist.

14.3.1 Inference for Regression

Literally, β0 is the mean value ofY whenX = 0. But there are two problems: First,X = 0 isnot even close to the range ofX values in our data set. Thus, we have no data-based reason tobelieve that the linear relation in our data—forx values between 39.0 and 59.0—will extend allthe way down toX = 0. Second, if a student scored 0 on the midterm he/she would likely drop

190

Page 19: Chapter 14 Simple Linear Regressionpages.stat.wisc.edu/~wardrop/courses/371chapter14spr2012.pdf · Simple Linear Regression Regression analysis is too big a topic for just one chapter

the course. In other words, first we don’t want to extend our model to X = 0 because it mightnot be statistically valid to do so and, second, we don’t wantto becauseX = 0 is uninterestingscientifically.

Thus, viewed alone,β0, and hence its estimateb0 is not of interest to us. This is not to saythat the intercept is totally uninteresting; it is an important component of the regression line. Butbecauseβ0 alone is not interesting we will not learn how to estimate it with confidence, nor willwe learn how to test a hypothesis about its value.

Inference for the Slope. Unlike the intercept, the slope,β1, is always of great interest to theresearcher. As you have learned in math, the slope is the change iny for a unit change inx. Still inmath, we learned that a slope of 0 means that changingx has no effect ony; a positive (negative)slope means that asx increases,y increases (decreases). Similar comments are true in regression.Now, the interpretation is thatβ1 is the true change in the mean ofY for a unit change inX, andb1 is the estimated change in the mean ofY for a unit change inX.

In the current example, we see that for each additional pointon the midterm, we estimate thatthe mean number of points on the final increases by 0.4516 points.

Not surprisingly, one of our first tasks of interest is to estimate the slope with confidence. Thecomputer output allows us to do this with only a small amount of work. The confidence intervalfor β1 is given by

b1 ± t∗ SE(b1). (14.10)

Note that we follow Gosset in using the t curves as our reference. As discussed earlier, the residualshave(n− 2) degrees of freedom and, even though one can’t tell from our notation, the ‘estimated’in the estimated SE ofb1 comes from estimating the unknownσ2 by s2. Thus, it is no surprise thatour reference curve is the t-curve with(n − 2) degrees of freedom, which equals 33 for this study.The SE (estimated standard error) ofb1 is given in the output and is equal to 0.1501; the outputalso gives a more precise value ofb1, 1.5608, in the event you want to be more precise; I do. Withthe help of our online calculator, thet∗ for df = 33 and 95% is 2.035. Thus, the 95% CI forb1 is

0.4516 ± 2.035(0.1501) = 0.4516 ± 0.3055 = [0.1461, 0.7581].

This confidence interval is very wide.Next, we can test a hypothesis aboutβ1. Consider the null hypothesisβ1 = β10, whereβ10

(read beta-one-zero, not beta-ten) is a known number specified by the researcher. As usual, thereare three possible alternatives:β1 > β10; β1 < β10; andβ1 6= β10. The observed value of the teststatistic is given by

t =b1 − β10

SE(b1)(14.11)

We obtain the P-value by using the t-curve withdf = n − 2 and the familiar rules relating thedirection of the area to the alternative. Here are two examples.

Bert believes that, on average, each additional point on themidtermshould result in an addi-tional 5/3 = 1.6667 points on the final. (Can you create a plausible reason why he might believe

191

Page 20: Chapter 14 Simple Linear Regressionpages.stat.wisc.edu/~wardrop/courses/371chapter14spr2012.pdf · Simple Linear Regression Regression analysis is too big a topic for just one chapter

this?) Thus, his choice isβ10 = 1.6667. He chooses the alternative< because he thinks it isinconceivable that the slope could be larger than 1.6667. Thus, his observed test statistic is

t =0.4516 − 1.6667

0.1501= −8.095.

With the help of our calculator, we find that the area under thet(33) curve to the left of−8.095 is0.0000; this is Bert’s P-value. With a more precise program,I found that the P-value is 0.0000000012,which is just a bit larger than 1 in one billion. Bert’s theorylooks pretty—how can I say this—dumb.

The next example involves the computer’s attempt to be helpful. In many, but not all, appli-cations the researcher is interested inβ10 = 0. The idea is that if the slope is 0, then there is noreason to bother usingX to predict or describeY . For this choice the test statistic becomes

t =b1

SE(b1),

which for this example ist = 0.4516/0.1501 = 3.0087, which, of course, could be rounded to 3.01.Notice that the computer has done this calculation for us! (Look in the ‘T’ column of the ‘Midterm’row.) For the alternative6= and our calculator, we find that the P-value is2(0.0025) = 0.0050 whichthe computer gives us (located in the output to the right of 3.01).

Inference for the mean response for a given value of the predictor. Next, let us consider aspecific possible value ofX, call it x0. Now given thatX = x0, the mean ofY is β0 + β1x0; callthisµ0. We can use the computer output to obtain a point estimate andCI estimate ofµ0.

For example, suppose we selectx0 = 49.0. Then, the point estimate is

b0 + b1(182) = 68.42 + 0.4516(49.0) = 90.548

If, however, you look at the output, you will find this value, 90.548, in the column ‘Fit’ and therow Midterm= 182. (This is observation 8, 9 or 10.) Of course, this is not a hugeaid because,frankly, the above computation of 90.548 was pretty easy. But it’s the next entry in the output thatis useful. Just to the right of ‘Fit’ is ‘SE Fit,’ with, as before, SE the abbreviation for estimatedstandard error. Thus, we are able to calculate a CI forµ0:

Fit ± t∗ (SE Fit). (14.12)

For the current example, the 95% CI for the mean ofY givenX = 49.0 is

90.548 ± 2.035(0.968) = 90.548 ± 1.970 = [88.578, 92.518].

The obvious question is: We were pretty lucky thatX = x0 = 182 was in our computer output;what do we do if ourx0 isn’t there? It turns out that the answer is easy: trick the computer. Hereis how.

Suppose we want to estimate the mean ofY givenX = 51.0. An inspection of the Midtermcolumn in the computer output reveals that there is no row forX = 51.0. Go back to the data set

192

Page 21: Chapter 14 Simple Linear Regressionpages.stat.wisc.edu/~wardrop/courses/371chapter14spr2012.pdf · Simple Linear Regression Regression analysis is too big a topic for just one chapter

and add a 36th ‘student.’ For this student, enter 51.0 for Midterm and amissing valuefor Final.(For Minitab, my software package, this means you enter a *.)Then rerun the regression analysis.In all of the computations the computer will ignore the 36th student because, of course, it has novalue forY and therefore cannot be used. Thus, the computer output is unchanged by the additionof this extra student. But, and this is the key point, the computer includes observation 36 in the lastsection of the output, creating the row:

Obs Midterm Final Fit SE Fit Residual St Resid36 51.0 * 91.451 0.828 * *

From this we see that the point estimate of the mean ofY givenX = 51.0 is 91.451. (This, ofcourse, is easy to verify.) But now we also have the SE Fit, so we can obtain the 95% CI for thismean:

91.451 ± 2.035(0.828) = 91.451 ± 1.685 = [98.766, 93.136].

I will remark that we could test a hypothesis about the value of the mean ofY for a givenX,but people rarely do this. And if you wanted to do it, I believeyou could work out the details.

Prediction of the response for a given value of the predictor. Suppose that beyond ourn casesfor which we have data, we have an additional case. For this new case, we know thatX = xn+1,for a known numberxn+1, and we want to predict the value ofYn+1. Now, of course,

Yn+1 = β0 + β1xn+1 + ǫn+1.

We assume thatǫn+1 is independent of all previousǫ’s, and, like the previous errors, has mean 0and varianceσ2.

The natural prediction ofYn+1 is obtained by replacing theβ’s by their estimates andǫn+1 byits mean, 0. The result is

yn+1 = b0 + b1xn+1.

We recognize this as the Fit forX = xn+1; as such, its value and its SE are both presented in (orcan be made to be presented in) our computer output.

Using the results of Chapter 7 on variances, we find that afterreplacing the unknownσ2 by itsestimates2, the estimated variance of our prediction is

s2 + [SE(Fit)]2.

For example, suppose thatxn+1 = 49.0. From our computer output, and our earlier work, the pointprediction ofyn+1 is ‘Fit,’ which is 90.548. The estimated variance of this prediction is

(4.635)2 + (0.968)2 = 22.4202;

thus, the estimated standard error of this prediction is√

22.4202 = 4.735. It now follows that wecan obtain a prediction interval. In particular, the 95% prediction interval foryn+1 is

90.548 ± 2.035(4.735) = 90.548 ± 9.636 = [80.912, 100.184].

This is not a particularly useful prediction interval. (Why?)

193

Page 22: Chapter 14 Simple Linear Regressionpages.stat.wisc.edu/~wardrop/courses/371chapter14spr2012.pdf · Simple Linear Regression Regression analysis is too big a topic for just one chapter

Table 14.4: The ANOVA Table for the Regression of Final Exam Score on Midterm Exam Scorefor 35 Students.

Analysis of Variance

Source DF SS MS F PRegression 1 194.38 194.38 9.05 0.005Residual Error 33 708.81 21.48Total 34 903.19

14.3.2 Some Loose Ends

There are a few loose ends from the computer output and regression analysis in general that areworth mentioning. In the earlier computer output I omitted the Analysis of Variance Table(ANOVA Table) in order to fit the output onto one page. The ANOVA table is presented nowin Table 14.4. The first thing to note in this table is that the column heading ‘SS’ is short forSumof Squares. There are three sums of squares in this table; one each forregression, (residual) errorandtotal. These are abbreviated by SSR, SSE and SSTO, respectively. The next thing to note isthat these sums of squaresadd:

SSR + SSE = SSTO.

This is actually a fairly amazing result, aPythagorean Theorem for Statistics.We begin by considering then values of the response:y1, y2, . . . , yn. Following Chapter 10,

these give rise ton deviations:y1 − y, y2 − y, . . . , yn − y. We then write a generic deviation asfollows:

yi − y = (yi − yi) + (yi − y) (14.13)

In words, we take the deviation for casei and break it into the sum of two pieces: the deviationof yi from its (regression line) predicted valueyi; and the deviation of its predicted value from theoverall mean. For example, refer back to the computer outputin Table 14.3 and examine case 29.For this case,y29 = 99.0 and y29 = 94.838. Recall, also, from Equation 14.9 thaty = 92.257.Thus, for case 29, Equation 14.13 becomes

99.0 − 92.257 = (99.0 − 94.838) + (94.838 − 92.257) or

6.743 = 4.162 + 2.581.

In words, the response of 99.0 deviates from the mean of all responses by 6.743. This deviation isbroken into two pieces: 99.0 is 4.162 points larger than its predicted value which, in turn, is 2.581points larger than the mean of all responses. This identity remains true if we sum over all cases:

∑(yi − y) =

∑(yi − yi) +

∑(yi − y).

What is remarkable, however, is that this last identity remains true if we square all of its terms:∑

(yi − y)2 =∑

(yi − yi)2 +

∑(yi − y)2.

194

Page 23: Chapter 14 Simple Linear Regressionpages.stat.wisc.edu/~wardrop/courses/371chapter14spr2012.pdf · Simple Linear Regression Regression analysis is too big a topic for just one chapter

In words, this last identity isSSTO = SSE + SSR.

The degrees of freedom (DF) in the ANOVA Table also sum, as I will now explain. FromChapter 10, the deviations haven−1 degrees of freedom, which for our current data set is35−1 =34. As argued earlier in this chapter, the residuals have two linear constraints and, hence, haven−2degrees of freedom, which is 33 for the current data set. The degrees of freedom for regressionis a bit trickier: the regression line is determined by two numbers (intercept and slope) which isone more than the number determined by the mean (itself); hence, there are2 − 1 = 1 degrees offreedom for regression.

I will now explain why these sums of squares are so interesting. First, consider SSTO; this sumof squares measures the variation in the responses,ignoring the predictor. For our data, SSTO= 903.19. Of course, we could divide this by its degrees of freedom to obtain s2

y, the varianceof they values. And, of course, we could take the square root of the variance to obtainsy whichI reported earlier as equaling 5.154 points. But I don’t wantto do this! I just want to focus on903.19. This number measures the total squared variation inthey values. It must be nonnegative.If it were 0 we could infer that ally values are the same. Finally, the larger the value of SSTO themore squared variation we have in oury values.

Now, contrast SSTO with SSE. SSE is the sum of squared residuals—the square of the dif-ferences between the actual responses and their respectivepredicted values. For these data, SSE= 708.81. Next, consider the difference between SSTO and SSE:

903.19 − 708.81 = 194.38 = SSR.

Here is a picturesque way to interpret this difference: 194.38 is theamount of squared error thathas been removed by using the predictor X. Removing error is good. (Why?) Thus, this number194.38 is a measure of how much usingX has improved our predictions. The problem, however, ishow do weinterpret 194.38; Is it large? Is it small? Our answer lies in comparingit to something,but what? We compare it to 903.19 by taking a ratio:

194.38

903.19= 0.215, or 21.5%.

Thus, of the total squared error in the response (SSTO= 903.19), 21.5% isremoved or explainedor accounted for by a linear relationship with the predictor.

The above ratio is called thecoefficient of determination; it is denoted byR2 and, as above,if often reported as a percentage. Thus,

R2 =SSTO − SSE

SSTO(14.14)

Recall the correlation coefficient,r, discussed earlier. It can be shown that

R2 = r2.

For example, with our current data set,r = 0.464 which gives

r2 = (0.464)2 = 0.215 = R2.

195

Page 24: Chapter 14 Simple Linear Regressionpages.stat.wisc.edu/~wardrop/courses/371chapter14spr2012.pdf · Simple Linear Regression Regression analysis is too big a topic for just one chapter

I am now able to fulfill an earlier promise to make item 4 in our list of properties ofr (on page 182)more precise. We now have an interpretation for the value ofr: its square is equal to the coefficientof determination. Thus, the fartherr is from 0, the larger the value ofr2 and, hence, the betterXis at predicting/describingY .

The fact thatr2 = R2 has led many analysts to state that in order to interpretr we must firstsquare it. Indeed, I have heard several people say that it isdishonest to reportr instead ofr2. Hereis there reasoning: If one reports, for example, thatr = 0.8 this sounds much stronger than themore accurater2 = 0.64. This argument is flawed, in my opinion, for two reasons.

First, whereas I believe thatR2 is a useful way to summarize one aspect of a regression analysis,we must remember that it is based onsquaring errors which, at the very least, is an unnaturalactivity. My second reason requires a bit more work.

Combining the results in Equations 14.4—14.6, we find that

y = b0 + b1x = y − b1x + b1x = y + b1(x − x) = y − r(sy/sx)(x − x).

Rewritingy = y − r(sy/sx)(x − x),

we obtain the following.y − y

sy

= r(x − x

sx

) (14.15)

Now this equation is not useful for obtainingy for a givenx, but it gives us great insight into themeaning ofr, as I will now argue.

Recall thatx = 52.786, sx = 5.295, y = 92.257 andsy = 5.154.

Consider a new case whose value ofx is

x = x + sx = 52.786 + 5.295 = 58.081.

(Please ignore that such a score is impossible with my one-half point system.) This is a goodstudent; she scored one standard deviation above the mean onthe midterm. According to Equa-tion 14.15 her predicted response satisfies the following equation.

y − y

sy

= r = 0.464.

Rewriting this, we obtain,y = y + 0.464sy.

In words, her predicted response isonly 0.464 standard deviations above the mean response!I ended this last sentence with an exclamation point becausesomething quite remarkable is

happening. On the midterm, this student is exactly one standard deviation above the mean, but wepredict that she will be much closer to the mean on the final. Inpicturesque language, we predictthat only 46.4% of her advantage on the midterm will persist to the final. Thus, only 46.4% of heradvantage on the midterm reflects her superior ability; the other 53.6%, for lack of a better term,

196

Page 25: Chapter 14 Simple Linear Regressionpages.stat.wisc.edu/~wardrop/courses/371chapter14spr2012.pdf · Simple Linear Regression Regression analysis is too big a topic for just one chapter

was due to good luck. The regression line states that there isno reason to believe that good luckwill persist.

The above phenomenon—that not all of the advantage inX is inherited by the predicted valueof Y —is referred to as theregression effect.

The regression effect also applies to subjects who score below the mean onX. For example,consider a new case withx equal tox − sx. This student scored one standard deviation below themean on the midterm, but is predicted to score 46.4% below themean on the final. In words, alow score onx is partly due to inferior skill, but also due to bad luck on themidterm. There is noreason to expect the bad luck to repeat on the final.

14.3.3 Standardized Residuals

The seventh and final column in Table 14.3 presents the standardized residuals. The main featureto note is that a standardized residual isnot what one would expect it to be. Based on the idea ofstandardizing throughout these notes, one would expect that a standardized residual would be

e − e

s= e/s = e/4.635,

for our data. To see that this isnot the case, just look at the first observation. For case 1

e1/s = −2.532/4.635 = −0.55,

not the value−0.62 in the output. The correct definition of a standardized residual is beyond thescope of this course.

Note that the standardized residual for case 1 has an ‘X’ nextto it. This choice of a symbolby Minitab alerts us that the value of the predictor for this case gives it a large influence in ourregression analysis. In the original data set ofn = 36 students, two cases were marked in this way:the case I subsequently deleted and the first case in the current data set. As discussed earlier, thedeletion of the case withx = 35.5 andy = 95.5 resulted inr increasing from 0.353 to 0.464. Also,s was reduced from 4.850 to 4.635.

If we delete the current case 1 to give a data set withn = 34, the value ofr changes to 0.387ands hardly changes at all, becoming 4.679.

In short, the deletion of an influential case—which should never be done lightly—can makethe relationship betweenX andY stronger or weaker.

Finally, each of three standardized residuals are labeled with ‘R’ because its absolute valueexceeds 2. It’s worth noting that these cases are far away from the regression line, but I don’trecommend any additional thought, let alone action, on thismatter.

197