Analysis of Variance for Regression/Multiple Regressiondsmall/stat112-02/handouts/lectslides... ·...

Analysis of Variance for Regression/Multiple

Regression

Lecture Notes XVII

Statistic 112, Fall 2002

Announcements

�The second midterm is next Thursday.

� Extra Office Hours Next Week: Monday, 1:30-2:30;

Wednesday, 9-10, 1:30-2:30 or by appointment. Usual office

hours on Tuesday, 1-2 and 4:30-5:30.

� Haipeng will hold office hours tomorrow from 10:30-12:30.

� A tutor from the university tutor service will hold a review

session on Tuesday (Nov. 12) from 6-8 p.m. in Huntsman Hall,

Room 250. He is not affiliated with the course.

� I will post a set of exercises for this week’s lectures by Friday

night.

Outline

�Analysis of variance for regression.

� Multiple regression.

– Basic model.

– Estimating and interpreting the parameters.

– The impact of lurking variables.

� Reading for this time: Chapter 10.2 (analysis of variance for

regression part), Chapter 11.

Analysis of Variance for Regression

�The analysis of variance (ANOVA) provides a convenient

method of comparing the fit of two or more models to the same

set of data. Here we are interested in comparing

1. A simple linear regression model in which the slope is zero,�� vs.

2. A simple linear regression model in which the slope is not

zero, �� .For both models it is assumed that �� , ��independent.

� Analysis of variance summarizes information about the

sources of variation in the data.

� Total variation in the response � is expressed by the deviations� �"! #� .

Two reasons why � � does not equal #� :

– Responses � � correspond to different values of the explanatory

variable � . The fitted values�� estimates the mean response

for each specific �� . The difference�� "! #� reflects variation in

mean responses due to differences in � � .– Individual observations will vary about mean because of

variation within subpopulation of responses to a fixed � � . This

variation is represented by the residuals � �"! �� .

Sums of Squares

�Basic idea behind analysis of variance: If

� �� , then

all variation should be due to individual observations varying

about their mean. We can estimate the amount of variation

due to the responses � � corresponding to different values of

the explanatory variable � and base our test on this estimate.

�

� � � ! #� � � � �� ! #� �"� � �� ! �� Algebraic fact:

� � � ! #� � � � � �� ! #� � � � � � � ! �� (1)

� We write (1) as

��

SS stands for sum of squares and T,M and E stand for total,

model and error respectively. Total variation SST is the sum of

variation due to the straight-line model for the regression

function (SSM) and variation due to deviations from this model

(SSE).

� If� �� were true, then SSM should be small.

� Degrees of freedom are associated with each sum of squares.

Degrees of freedom can be thought of as the number of

independent pieces of information that the sum of squares

reflects.

��

.

�� ! � � � � � � � � � � � � ! � .

�

Mean square (MS) � sum of squares

degrees of freedom

� Interpretation of � � : fraction of variation in the values of � that

is explained by the least squares regression of � on � .

� � ��

��

� � �� ! #� � �� ! #� � �

� � !� � � � ! ��

� � � � ! #� � �

The ANOVA F test

� � � � � � � ( � is not linearly related to � ) can be tested by

comparing MSM with MSE. The ANOVA test statistic is

� ��

� �

�will tend to be small when

� � is true and large when� � � � �� is true.

� Under� � , the statistic

�has an

�distribution with � degree

of freedom in the numerator and � ! � degrees of freedom in

the denominator (Table E).

� For simple linear regression, the�

test is equivalent to the�-test of

� � � � � � versus� � � � �� .

Multiple regression

�Consider again the problem of deciding how many years you

should stay in school. Suppose that you now have available

the joint distributions of earnings ( � ), education ( � ) and IQs

( � � ) for a sample from a population of people like yourself.

� You could just use the regression function � � � � � to make

your prediction but given the extra information about IQ in the

sample, it is natural to try to use it.

� The natural way to use the extra information is to use the

multiple regression function � � � �� to make your

prediction (you substitute in your IQ to make the prediction).

� Population regression function: � � � � �� .

� Data for multiple regression:

Person 1 � � �� " � �� " ��

Person 2 � � � � � � � � �� ...

Person � � � ��

Multiple Linear Regression Model

�One possible model for the population regression function is

the multiple linear regression model, an analogue of the simple

linear regression model:

� � � �" �� " � ��

� Interpretation of � � : The change in the mean of � if � � is

increased by one unit and all other explanatory variables,�" �� are held fixed.

� The multiple linear regression model is very flexible. Examples

of multiple linear regression models:

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

� � � � � � � � � � � � � � � � � � � � �� "� � � ��

Probability Model for Multiple Linear Regression

�The statistical model for multiple linear regression is

� � � � � � � � � � � � � � � � � ��

for� � � � � �� .

� The mean response �� is a linear function of the explanatory

variables

� � � � � � � � � � � � � � ��

� The residuals � are independent and normally distributed with

mean 0 and standard deviation � . In other words, they are a

simple random sample from a � � �� "� � distribution.

Estimation of the multiple regression parameters

� � �� .� Let

� � �� denote the estimators of � � �� .

� For the�th observations the predicted response is

��

� The�th residual, the difference between the observed and

predicted response, is

� � � observed response ! predicted response

� ��! �� ! � � ! � � � ! � �� ! � � � � �

� To estimate� � �� , we use the method of least squares.

Choose the values of the�’s that makes the sum of the

squares as small as possible, i.e., choose� � �� to

minimize

� � � ! � � ! � � � ! � � � � � ! ��

� The parameter �"� measures the variability of the responses

about the population regression equation. We estimate � � by

� � which is an average of the squared residuals

� � �� ! ��

� ! � ! �

We estimate � by � � � � � . We call � the root mean squared

error.

The Impact of Lurking Variables

�You want to predict what your earnings will be if you obtain a

certain number of years of education. Let � � earnings,�" � years of education and � � � IQ. Suppose that you have

a sample thatonly contains earnings and years of education

data, people’s IQs are not recoreded.

� IQ is probably a lurking variable, i.e., a variable that has an

important effect on the relationship among the variables in a

study but is not included among the variables studied.

� Suppose that the population regression function for the

expected value of earnings given education and IQ is a

multiple linear regression function:

� � � �" � � � � � �� " � � � � �

and also suppose that

� � �� " � � � �� "

� What is � � � � � ?

� � � � � � � � � � � � � � ��

� � � � � � � � � � � � � ��

� � � � � � � � � � � ��

� � � � � � � � � � � � � � � � � � � � � � � � � ��

� If we use least squares to estimate the simple linear regression

function � � � � � , the slope of the least squares line will be

an unbiased estimate of � � � � � which does not generally

equal �� . Thus, by regressing earnings on only years of

education, you will not obtain the right slope for estimating the

impact of additional years of education given your fixed IQ.

� Two circumstances in which � � � � � � � � � :– � � � � , i.e., IQ does not help to predict earnings once

education is included.

– � � � , i.e., years of education does not help to predict IQ.

Effect of omitting ability on estimates of the returns to

education:� � log earnings. The interpretation of the least squares

coefficient is that it approximately measures the percent

increase in earnings for one extra year of schooling.

Data Set IQ omitted IQ included

Male Ph..D’s, 1958-1960 0.0205 0.0213

Rejected low-AFQT 0.0346 0.0171

military trained

applicants (1962)

NLSYM (1969) 0.065 0.059

Veterans - CPS (1964) 0.0508 0.0433

NLSYM (1973) 0.041 0.030

military trained 0.0346 0.0171

G-77 0.022 0.014

Analysis of Variance for Regression/Multiple Regressiondsmall/stat112-02/handouts/lectslides... ·...

Documents

Transcript of Analysis of Variance for Regression/Multiple Regressiondsmall/stat112-02/handouts/lectslides... ·...