Bivariate (Simple) Regression. We return to the unfinished task of association between two...

Bivariate (Simple) RegressionBivariate (Simple) Regression

We return to the unfinished task of association between two continuous variables.

The key to building measures of association for continuous variables is the properties of this type of variable. (1) They have equal intervals throughout their range (this means that the units of measurement are identical); and (2) they have a known and meaningful zero-point (this means that when the variable takes the value of 0.0, the phenomenon that it expresses is absent).

It is the first property—equal intervals—that is the key.

Because of the equal-interval property of continuous variables, one obvious way to describe the relationship between any two such variables (for example, X and Y) is by plotting their association in two-dimensional space.

Every observation (data point) is located simultaneously in reference to the values of each variable calibrated along the two axes in Cartesian space (i.e., the x-axis and the y-axis). These axes define four quadrants or sectors where observations lie.

Here is a simple example to show how this works.

Suppose that we are interested in describing the relationship between daily high temperature (Y) and the time (X) when we first saw sunlight in the morning. Our interest is in the month of June in particular because the afternoon temperature can vary widely from day to day.

We suspect that the earlier we see the sun in the morning (i.e., the earlier the overcast clouds disappear), the warmer the day will be. For several days we note the time when we first saw sunlight and then record the afternoon high temperature. Here are our data for the first three days:

———————————————————————— Day Time (X) Temperature (Y)———————————————————————— 1 10:00 a.m. 76 2 5:30 a.m. 90 3 8:00 a.m. 82 ————————————————————————

90 __ T * Day 2 e 80 __ * Day 3 m p 70 __ * Day 1 (Y)

| | | | | | | 5:00 7:00 9:00 10:00 Time (X)

This simple device is called a scatterplot, and it tells us two things: (1) the direction of the relationship between X and Y, that is, whether it is DIRECT or INDIRECT ( + or - ); and (2) the strength of the relationship, that is, how strong the association is between X and Y. Strength is indicated by how sharp an angle the plot takes as it falls toward the x-axis. This is an important property called the slope of the line. Here, the angle is not very steep, suggesting that the relationship is not very strong. Clearly, we have an inverse (indirect) relationship, because as the values of the X-variable (time of day) INCREASE, the values of the Y-variable (high temperature) DECREASE.

This is suggestive but not very precise. Statisticians prefer to describe association NUMERICALLY. This is done by determining how close the data points come to defining a straight line. If it looks as though the data points would like to be a straight line, then we can use an old trick.

A long time ago, you learned that a straight line is defined by two variables and two constants:

Y = a + bX

where X and Y are the values of two continuous variables such as time and temperature.

More recently, we saw this algorithm when we dealt with the analysis of variance (ANOVA). Here, we confronted the general linear model:

Yij = + jXij + ij

Then, the Xij-variable was called the treatment and the Yij-variable was the outcome measure.

[By the way, Excel in its Helps shows this as y = mx + b]

Essentially, this says that Y is a function of X,

Y = f(X)

plus the arithmetic effects of two constants.

Let’s deal with the simplified version for now:

Y = a + bX

The first constant, a, is called the y-intercept. It is the value of Y where the line defined by the equation crosses the y-axis, that is, the value of Y when X = 0.0. The second constant, b, is known as the slope. It is called a coefficient (i.e., a constant that we use to multiply the values of X in order to identify the values of Y).

The slope is a way of describing the STRENGTH OF ASSOCIATION between X and Y. If we had a way of calculating the value of b, then we would be onto a numerical description of association, our goal.

First, let's see how SAS would plot our data. This is easy to do with PROC PLOT:

LIBNAME mydata 'a:\';LIBNAME library 'a:\';

PROC PLOT DATA=mydata.weather;PLOT temp*time;RUN;

In the PLOT statement, you control which variable has its values placed on the vertical (y-axis) axis and which on the horizontal (x-axis) axis. In the PLOT statement, the variable whose mnemonic is listed first, to the LEFT of the asterisk, is plotted on the y-axis, and the variable listed second, to the RIGHT of the asterisk, is plotted on the x-axis.

Plot of TEMP*TIME. TEMP | | 90 + A | | | | | A 80 + | | | A | | 70 + | --+------------+------------+------------+------------+------------+-- 5 6 7 8 9 10 TIME

In practice, not all data points will fall perfectly in a straight line. The important point is that the data points suggest that we can best describe the relationship between X and Y by IMPOSING a straight line rather than imposing some other mathematical function (i.e., a curve). In other words, we can describe the relationship with a linear function as long as it does not look like the points define a curve of some sort. Then we could take a yardstick or a ruler and try to COME AS CLOSE AS I CAN TO ALL THE POINTS. (We might not actually touch any single data point.) The object is to find the straight line that the data look like they are trying to become, the line that BEST FITS all the data points in the scatterplot. This line is called—not surprisingly—the line of best fit.

Algebraically, the line of best fit is expressed as

Yi = a + bXi + i

where i is the "error" term indicating not mistakes in

plotting but rather the amount of distance by which the ith individual data point misses falling exactly on the line of best fit. This term is called the residual and is described mathematically as

iii YY ˆ

Yi is the y-location in two dimensional space for the data

point whose x-location is Xi. In other words, it is the

ACTUAL location on the y-axis for the data point located by Yi and Xi. Y-hat (the other Y with the caret, "^") is the

location on the y-axis that this data point WOULD HAVE BEEN FOUND if the relationship between X and Y was EXACTLY as the equation predicted, that is, if this data point FELL EXACTLY ON THE LINE of best fit. Y-hat is therefore called the predicted Y-value. Notice that for data points falling perfectly on the line of best fit, there is NO DIFFERENCE between the actual Y-value and the predicted Y-value, and thus i = 0.0.

Let’s look at another example, this time from Sirkin.

Now we can write a more precise equation for the line of best fit. It is:

There is NO error term in this equation because it describes the line of best fit itself, the line that the data points were trying to become. Since this is the line describing the predicted Y-values rather than actual Y-values, all the predicted data points line up perfectly. There is no error. (There are no "misses").

ii bXaY ˆ

Now we can give a more precise meaning to "line of best fit." Mathematically, "best" means the straight line that minimizes the squared differences between Y and Y-hat.

We speak of squared differences because of our old friend, the "sum-to-zero" problem. The distance between actual Y-locations in two-dimensional space and predicted Y-values is like a TWO-DIMENSIONAL DEVIATION. This is mathematically like deviations about the mean which sum to zero. Analogously, distances between actual and predicted Y-values will sum to zero when the (one true) line of best fit has been found. As a result, the type of analysis to which this is leading is called ordinary least squares regression (or OLS, for short).

To identify the line of best fit, rather than "eye-balling" we calculate the two constants that define a straight line, a and b, the Y-intercept and the slope, respectively. We calculate these constants from the values of the two variables, X and Y. Think about a two-dimensional sum of squared deviations as you examine the algorithm for calculating the slope (b):

where and are deviations about the X and Y means, respectively.

N

ii

N

iii

XX

YYXXb

1

2

1

XX i YYi

The numerator is an important statistical idea called the covariance. Thus, the slope is the covariance divided by the sum of squared deviations about the mean of the X-variable. This slope is known to statisticians as the regression coefficient.

Calaulating the other constant, the Y-intercept (a) is easy once we have the value of the regression coefficient. It is simply

That is, the Y-intercept is the mean of Y less the product of the regression coefficient and the mean of X.

XbYa

Let's compute the regression coefficient and the Y-intercept for the example of time of first appearance of the sun and afternoon high temperature, even though we have only three pairs of observations. Rather than first calculate the deviations, I am going to use the short-cut equation:

In our example, N = 3, the sum of X (time) is 23.5 [converting 5:30 a.m. to 5.5 hours], the sum of the squared time values is 194.25, the sum of time values squared is 552.25, the sum of Y (temperature) is 248, and the sum of the cross-products is 1911[(5.5 x 90) + (8 x 82) + (10 x 76)].

N

i

N

iii

N

i

N

i

N

iiiii

XXN

XYXYNb

1

2

1

2

1 1 1

Thus, the regression coefficient is:

b = (3)(1911) - (23.5)(248) / (3)(194.25) - 552.25 b = (5733 - 5828) / (582.75) - 552.25

b = - 95 / 30.5 b = - 3.115

Notice the negative sign confirms our earlier description of the relationship as an inverse one, that is, the GREATER the time value, the LOWER the temperature.

Having found the value of the regression coefficient, we can now find the value of the Y-intercept. The mean of X (time) is 7.833 and the mean of Y (temperature) is 82.667.

a = 82.667 - (-3.115)(7.833) a = 82.667 - (-24.400) a = 82.667 + 24.400

a = 107.067

This means that the line of best fit would cross the Y-axis at 107 degrees Fahrenheit. This is because of the steep angle (slope) of the line of best fit (- 3.115).

How do we interpret the regression coefficient? The regression coefficient is the average amount of increase (or decrease) in values of Y for each one-unit change in X. It is a "bang for the buck" statistic. Notice that the regression coefficient is a captive of the units of measurement of the X and Y variables. That is, here we have a metric of Fahrenheit temperature per hour. This means that the magnitude of the coefficient can look great or small according to the units of measurement of X and Y. Consider the velocity of a vehicle on the freeway. The speedometer of a car going 70 miles per hour will register a much larger number in kilometers per hour. Similarly, velocity measured in terms of feet per second is a much larger number than the same velocity measured in miles per hour.

In theory, the values of the regression coefficient range from a low of 0.0 when there is no association between X and Y to a maximum value of . In practice, however, the upper values of the regression coefficient are constrained by the range of values of X and Y in the data set.

How would we interpret the value of the regression coefficient in our little example? We would say that, for each hour that cloud cover prevents the sun from shining in the morning (after the appointed hour of sunrise, that is), the afternoon high temperature will be almost 3 degrees cooler than otherwise. This is the average amount of difference in temperature that a one hour change in sunrise is associated with.

We can now give a more explicit example of the residual, that is, the error term, i. Let's say that on the

fourth day of our observations, the sun didn't burn through the cloud cover until 9:00 a.m. Based upon our regression results, we would predict that the afternoon high temperature would be

Y-hat = a + bX Y-hat = 107.067 + (- 3.115)(9) for 9:00 a.m.

Y-hat = 107.067 + (- 28.035)Y-hat = 79.032

That is, our MODEL predicts, based upon the constant relationship between X and Y, that the afternoon high temperature would be 79.032 degrees.

Later, if we record the afternoon high temperature at 79 degrees, we have at the same time identified the residual as -0.032 degrees. This -0.032 degrees is the amount by which our model MISPREDICTED the actual temperature; our prediction was this far OVER the actual temperature.

SAS Time and Temperature Example

LIBNAME perm 'a:\';LIBNAME library 'a:\';

OPTIONS NODATE NONUMBER PS=66; PROC REG DATA=perm.weather;MODEL temp = time;TITLE1 'Regression Analysis Example';TITLE2;TITLE3 'PPD 404';RUN;

Regression Analysis Example

PPD 404

Model: MODEL1Dependent Variable: TEMP Analysis of Variance Sum of Mean Source DF Squares Square F Value Prob>F Model 1 98.63388 98.63388 3008.333 0.0116 Error 1 0.03279 0.03279 C Total 2 98.66667 Root MSE 0.18107 R-square 0.9997 Dep Mean 82.66667 Adj R-sq 0.9993 C.V. 0.21904 Parameter Estimates Parameter Standard T for H0: Variable DF Estimate Error Parameter=0 Prob > |T| INTERCEP 1 107.065574 0.45696262 234.298 0.0027 TIME 1 -3.114754 0.05678855 -54.848 0.0116

Regression Analysis Example For the following data on ten families, estimate a simple linear regression model and answer the questions below. —————————————————————————————————————————————————————————————————————————————— Annual Income _ Number of _ _ _ Family (in $1,000) (Xi - X)

2 Children (Yi - Y)2 (Xi - X)(Yi - Y)

X Y—————————————————————————————————————————————————————————————————————————————— 1 25 0 2 17 0 3 20 1 4 14 2 5 11 2 6 10 3 7 6 4 8 8 5 9 8 610 4 7 --- --- X = Y = _ _ X = Y =—————————————————————————————————————————————————————————————————————————————— 1. What is the value of the regression coefficient? ______________ 2. What is the value of the Y-intercept? ______________ 3. Estimate the number of children in a family ______________ whose annual income is 15 thousand dollars.

Regression Analysis Example Answers For the following data on ten families, estimate a simple linear regression model and answer the questions below. —————————————————————————————————————————————————————————————————————————————— Annual Income _ Number of _ _ _ Family (in $1,000) (Xi - X)

2 Children (Yi - Y)2 (Xi - X)(Yi - Y)

X Y—————————————————————————————————————————————————————————————————————————————— 1 25 161.29 0 9 -38.1 2 17 22.09 0 9 -14.1 3 20 59.29 1 4 -15.4 4 14 2.89 2 1 -1.7 5 11 1.69 2 1 1.3 6 10 5.29 3 0 0.0 7 6 39.69 4 1 -6.3 8 8 18.49 5 4 -8.6 9 8 18.49 6 9 -12.910 4 68.89 7 16 -33.2

--- --- X = 123 Y = 30 _ _ X = 12.3 Y = 3.0

= 398.1 = 54 = -129—————————————————————————————————————————————————————————————————————————————— 1. What is the value of the regression coefficient? -0.324 2. What is the value of the Y-intercept? 6.985

3. Estimate the number of children in a family 2.125 whose annual income is 15 thousand dollars.

Bivariate (Simple) Regression. We return to the unfinished task of association between two...

Documents

Transcript of Bivariate (Simple) Regression. We return to the unfinished task of association between two...