Download - Lecture13 Regression

8/6/2019 Lecture13 Regression

http://slidepdf.com/reader/full/lecture13-regression 1/54

Linear Regression II



Linear regression

� An estimate of the linear relationshipbetween two variables (X & Y) interms of the actual scale

±We find the equation for the line thatbest fits the data

± This involves

�

Finding the intercept± e.g., the value of Ywhen X = 0

� Finding the slope± the change in Y given aone point change in X



Equation of a line

� Y¶ = a + bX

� The intercept (a) is the point at

which the line crosses the Y axis ± The value of Y when X = 0

� The slope (b) is the amount of

increase in Y given an increase inone point of X

� Y¶ means ³predicted Y value´



Making predictions

� We use the regression equation topredict what Y will be given some valueof X ± E.g., how tall is someone who weights 121

lbs?

� Last time, we focused on ³perfectpredictions´ ± Weight ³perfectly predicted´ height because

the correlation was one«

± That¶s usually not the case in real life«



Making predictions

� One way to think about theregression line is in terms of

³conditional´ averages (means) ± Given some condition of X, what is the

mean of Y?

± So, given a GMAT score of 640, what is

the average income?



The line that ³best fits´

� The method behind linear regressioninvolves finding the line that ³best

fits´ the data ±We won¶t get into how this is computed

in this class

� Involves matrix algebra

± But conceptually, the point is to find aline that minimizes the total distancefrom all the points



The line that ³best fits´

� The method behind linear regressioninvolves finding the line that ³best fits´ the data ± We won¶t get into how this is computed in

this class� Involves matrix algebra

± But conceptually, the goal is to find a linethat minimizes the total distance from allthe points� Often called ³Ordinary Least Squares regression´

because you square the distance from eachpoint to the line and use an iterative process tominimize it± i.e., find the ³least´ amount of summed squares



Residuals

� The regression line is what we wouldpredict for Y given some X«



Residuals

� The regression line is what we wouldpredict for Y given some X« ± Regression equation gives us the straight

line that minimizes the error involved inmaking predictions

� Residuals are what we call error ± Residuals are the differences between an

actual Y value and the predicted Y value ± The residual is Y ± Y¶

� The actual Y value minus the predicted Y value



Variance of the estimate

� We can quantify the amount of errorin the prediction by finding theaverage of all of the square residuals



10000-11000




� We can quantify the amount of errorin the prediction by finding theaverage of all of the squaredresiduals

± This is the ³variance of the estimate´ E.g., How much do the points vary

around the line

W estY

2 !Y dY

2

§ N

The closer thepoints are to theline, the smallerthe variance of the

estimate will be



Variance of the Estimate

� When r=0 (no correlation), thismeans the best fitting line is ahorizontal one«

W estY

2 !Y dY

2

§ N



No correlation



Variance of the Estimate� W

hen r=0 (no correlation), thismeans the best fitting line is ahorizontal one«

± Same predicted Y for all values of X

� The line is doing nothing for us..

± The variance of the estimate is largestin this case

� The variance of the predictions around theregression line is just the variance of Y

estY

2!

Y Y 2

§

N

W estY

2 !Y dY

2

§ N

When r=0,Y¶ is themean of Y




� For a sample, we use N-2 in thedenominator to get an unbiasedestimate

± Two degrees of freedom

sest

2! Y

dY

2

§ N 2



Explained vs. unexplained

variance� The difference between the total

amount of variance in Y and thevariance of the estimate is theamount of v ariance explained by theregression line

� Explained variance = total variance-

unexplained variance ± Total Variance = Unexplained variance +Explained Variance



Coefficient of determination

� This is the ³proportion of the totalvariance that is explained (ordetermined) by the predictor

variable´ � It is the (explained variance)/(total

variance) ±

This equals r2

± It is the proportion of the variance in Ythat is accounted for by X



Coefficient of non-determination

� This is simply the reverse²theamount of variance in Y that X does

not account for ± An estimate of how much the points

don¶t fall on the line

�

It is the (unexplainedvariance)/(total variance) or (1-r2)



The variance of the

estimate� Remember that the variance of the

estimate is the unexplained variance

�

An easier way to compute thevariance of the estimate is to use thecoefficient of non-determination

estY 2

W Y

2!1 r 2

W estY

2! W

Y

2 1 r 2

Becomes«



Example

� Relationship between age and verbalcomprehension

� We want to use age (in months) topredict test scores on a verbalcomprehension test



Example

� In our sample of 100 kids fromgrades 1-6, we have

� Mean age of 98.14 months (s = 21.0)

� Mean test score of 30.35 items correctout of 50 (s = 7.25)



Example�

In our sample of 100 kids fromgrades 1-6, we have



Why use regression?Our independent variable is

age²a continuous measure«We don¶t have 2 groups tocompare, so we can use a t-test.We want to look at how increasesin age relate to increases (or

decreases) in scores



Example

� In our sample of 100 kids from grades1-6, we have

�

Mean age of 98.14 months (s = 21.0)� Mean test score of 30.35 items correct

out of 50 (s = 7.25)

� We find that the correlation between age

and test score in our sample is r = .72

� How can we make predictions forverbal comprehension given an age?



1. Find the slope of the line

� X is age (the independent variable)

±Mx = 98.14, sx = 21.0

� Y is test score (the dependent

variable)

±My = 30.35, sy = 7.25

� r = .72

bYX ! sY

s X

r





±Mx = 98.14, sx = 21.0


variable)

±My = 30.35, sy = 7.25

� r = .72

bYX ! sY

s X

r !7.25

2 .0(.72)





±Mx = 98.14, sx = 21.0


variable)

±My = 30.35, sy = 7.25

� r = .72

bYX ! sY

s X

r !7.25

2 .0(.72) ! .249



2. Find the intercept of the line


±Mx = 98.14, sx = 21.0


variable)

±My = 30.35, sy = 7.25

� b = .249

aYX

! Y bYX X





±Mx = 98.14, sx = 21.0


variable)

±My = 30.35, sy = 7.25

� b = .249

aYX

! Y bYX X ! 30.35 .249(98.14)





±Mx = 98.14, sx = 21.0


variable)

±My = 30.35, sy = 7.25

� b = .249

aY ! Y b

Y ! 30.35 .249(98.14)

!5

.9

1





±Mx = 98.14, sx = 21.0


variable)

±My = 30.35, sy = 7.25

� b = .249

aY ! Y b

Y ! 30.35 .249(98.14)

!5

.9

1

For an age of 0 months(X=0), wepredict ascore of 5.91

on the test



Making a prediction

� Y¶ = a + bX ± a = 5.91

± b = .249

� Y¶ means ³predicted Y´

� A child is 10 years old (120 months) ± His predicted test result will be:Y¶ = 5.91 + .249(120) = 35.8 items

correct



Example�




We predict a child at 120months will get 35.8 items

correctThis child is older than theaverage child in our sample,so he does better than

average on the test



Interpreting: r vs. b

� b (the slope of the line) is the change(amount of points) we predict in Ybased on a one point change is X ± For each month increase in age, test scores

go up .249

� r (the correlation) is the change (interms of standard deviations) wepredict in Y based on a one standard

deviation change in X ± For every one standard deviation increase

in age, test scores will increase by .72 of astandard deviation



The residual

� Our equation is:

Test score = 5.91 + .249(age in months)

� We have a child who is 92 months old,

and she gets 40 questions correct

� We¶d predict she would get

Y¶ =5.91+.249(92) = 28.82 questions

correct



The residual

� Our equation is:Test score = 5.91 + .249(age in months)

� We have a child who is 92 months old,

and she gets 40 questions correct� We¶d predict she would get

Y¶ =5.91+.249(92) = 28.82 questionscorrect

The residual is 40-28.82 = 11.18

Positive because she did better thanour predicted value



The residual

� Our equation is:Test score = 5.91 + .249(age in months)

� We have a child who is 92 months old, and shegets 40 questions correct

� We¶d predict she would get

Y¶ =5.91+.249(92) = 28.82 questions correct

The residual is 40-28.82 = 11.18

Positive because she did better than ourpredicted value

If another 92 month old got 27 questionscorrect, the residual would be 27-28.82=-1.12



Example: Variance explained

�




The total variance in testscores is s2 = 7.252 = 52.56

How much is explained bythe regression line?



Unexplained variance

� If we went through each of our 100data points, we could calculate theresidual± the value of Y we actuallygot minus the value of Y wepredicted from the equation

± The sum of those squared deviations is

everything we didn¶t explain

Y Y '

2

§



Age(X)

Score(Y)

92 25

100 30

84 29

73 25



Age(X)

Score(Y)

Predicted ScoreY¶ = 5.91+.249(X)

92 25 5.91+.249(92) = 28.82

100 30 5.91+.249(92) = 30.81

84 29 5.91+.249(92) = 26.83

73 25 5.91+.249(92) = 24.09



Age(X)

Score(Y)

Predicted ScoreY¶ = 5.91+.249(X)

ResidualY-Y¶

92 25 5.91+.249(92) = 28.82 25-28.82=-3.82

100 30 5.91+.249(92) = 30.81 30-30.81=-.81

84 29 5.91+.249(92) = 26.83 29-26.83=2.17

73 25 5.91+.249(92) = 24.09 25-24.09=.91



Unexplained variance

� If we went through each of our 100data points, we could calculate theresidual± the value of Y we actually

got minus the value of Y wepredicted from the equation

± The sum of those squared deviations iseverything we didn¶t explain

± The average squared deviations is thevariance of the estimate

W estY

2 !Y dY

2

§ N



Explained variance

� The total variance is the variance of Y

� The unexplained variance is theaverage squared deviation score

� Total variance = explained variance +unexplained variance ± So all that¶s left is what we explained by

the regression line ± Explained variance = total variance -

unexplained



Coefficient of determination

� We know from our example that thecorrelation between age & test scorewas .72

±We can compute the coefficient of determination by squaring it

± r2 = .722 = .52

� Age accounts for 52% of thevariance in test scores



Coefficient of non-determination

� This is simply the reverse²the amount of variance in Y that X does not account for ± An estimate of how much the points don¶t fall

on the line� It is the (unexplained variance)/(total

variance) or (1-r2) ± So 1- .722 = 1 - .52 = .48

�

48% of the variance in test scores is not accounted for by age ± We cannot account for 48% of the variance in

test scores



� Next time:

More regression & quizreview

� Happy Spring!