Chapter 13 Multiple Regression
description
Transcript of Chapter 13 Multiple Regression
1 1 Slide
Slide
© 2007 Thomson South-Western. All Rights Reserved© 2007 Thomson South-Western. All Rights Reserved
Chapter 13Chapter 13 Multiple Regression Multiple Regression
Multiple Regression ModelMultiple Regression Model Least Squares MethodLeast Squares Method Multiple Coefficient of DeterminationMultiple Coefficient of Determination
Model AssumptionsModel Assumptions Testing for SignificanceTesting for Significance Using the Estimated Regression EquationUsing the Estimated Regression Equation
for Estimation and Predictionfor Estimation and Prediction
Qualitative Independent VariablesQualitative Independent Variables Residual AnalysisResidual Analysis
2 2 Slide
Slide
© 2007 Thomson South-Western. All Rights Reserved© 2007 Thomson South-Western. All Rights Reserved
The equation that describes how the The equation that describes how the dependent variable dependent variable yy is related to the independent is related to the independent variables variables xx11, , xx22, . . . , . . . xxpp and an error term is called and an error term is called the the multiplemultiple regression modelregression model..
Multiple Regression ModelMultiple Regression Model
yy = = 00 + + 11xx11 + + 22xx2 2 ++ . . . + . . . + ppxxpp + +
where:where:00, , 11, , 22, . . . , , . . . , pp are the are the parametersparameters, and, and is a random variable called the is a random variable called the error termerror term
3 3 Slide
Slide
© 2007 Thomson South-Western. All Rights Reserved© 2007 Thomson South-Western. All Rights Reserved
The equation that describes how the The equation that describes how the mean value of mean value of yy is related to is related to xx11, , xx22, . . . , . . . xxpp is is called the called the multiple regression equationmultiple regression equation..
Multiple Regression EquationMultiple Regression Equation
EE((yy) = ) = 00 + + 11xx1 1 + + 22xx2 2 + . . . + + . . . + ppxxpp
4 4 Slide
Slide
© 2007 Thomson South-Western. All Rights Reserved© 2007 Thomson South-Western. All Rights Reserved
A simple random sample is used to A simple random sample is used to compute sample statistics compute sample statistics bb00, , bb11, , bb22, , . . . , . . . , bbpp that are used as the point estimators of the that are used as the point estimators of the parameters parameters 00, , 11, , 22, . . . , , . . . , pp..
Estimated Multiple Regression EquationEstimated Multiple Regression Equation
^yy = = bb00 + + bb11xx1 1 + + bb22xx2 2 + . . . + + . . . + bbppxxpp
The The estimated multiple regression equationestimated multiple regression equation is: is:
5 5 Slide
Slide
© 2007 Thomson South-Western. All Rights Reserved© 2007 Thomson South-Western. All Rights Reserved
Estimation ProcessEstimation Process
Multiple Regression ModelMultiple Regression Model
EE((yy) = ) = 00 + + 11xx1 1 + + 22xx2 2 +. . .+ +. . .+ ppxxpp + + Multiple Regression EquationMultiple Regression Equation
EE((yy) = ) = 00 + + 11xx1 1 + + 22xx2 2 +. . .+ +. . .+ ppxxpp Unknown parameters areUnknown parameters are
00, , 11, , 22, . . . , , . . . , pp
Sample Data:Sample Data:xx11 x x22 . . . x . . . xpp y y. . . .. . . .. . . .. . . .
0 1 1 2 2ˆ ... p py b b x b x b x 0 1 1 2 2ˆ ... p py b b x b x b x
Estimated MultipleEstimated MultipleRegression EquationRegression Equation
Sample statistics areSample statistics are
bb00, , bb11, , bb22, , . . . , . . . , bbp p
bb00, , bb11, , bb22, , . . . , . . . , bbpp
provide estimates ofprovide estimates of00, , 11, , 22, . . . , , . . . , pp
6 6 Slide
Slide
© 2007 Thomson South-Western. All Rights Reserved© 2007 Thomson South-Western. All Rights Reserved
Least Squares MethodLeast Squares Method
Least Squares CriterionLeast Squares Criterion
2ˆmin ( )i iy y 2ˆmin ( )i iy y
Computation of Coefficient ValuesComputation of Coefficient Values
The formulas for the regression coefficientsThe formulas for the regression coefficients
bb00, , bb11, , bb22, . . . , . . . bbp p involve the use of matrix algebra. involve the use of matrix algebra.
We will rely on computer software packages toWe will rely on computer software packages to
perform the calculations.perform the calculations.
7 7 Slide
Slide
© 2007 Thomson South-Western. All Rights Reserved© 2007 Thomson South-Western. All Rights Reserved
Interpreting the CoefficientsInterpreting the Coefficients
In multiple regression analysis, we In multiple regression analysis, we interpret eachinterpret each
regression coefficient as follows:regression coefficient as follows: bbii represents an estimate of the change in represents an estimate of the change in yy corresponding to a 1-unit increase in corresponding to a 1-unit increase in xxii when all when all other independent variables are held constant.other independent variables are held constant.
8 8 Slide
Slide
© 2007 Thomson South-Western. All Rights Reserved© 2007 Thomson South-Western. All Rights Reserved
Multiple Coefficient of DeterminationMultiple Coefficient of Determination
Relationship Among SST, SSR, SSERelationship Among SST, SSR, SSE
where:where: SST = total sum of squaresSST = total sum of squares SSR = sum of squares due to regressionSSR = sum of squares due to regression SSE = sum of squares due to errorSSE = sum of squares due to error
SST = SSR + SST = SSR + SSE SSE
2( )iy y 2( )iy y 2ˆ( )iy y 2ˆ( )iy y 2ˆ( )i iy y 2ˆ( )i iy y
9 9 Slide
Slide
© 2007 Thomson South-Western. All Rights Reserved© 2007 Thomson South-Western. All Rights Reserved
Multiple Coefficient of DeterminationMultiple Coefficient of Determination
RR22 = SSR/SST = SSR/SST
Adjusted Multiple CoefficientAdjusted Multiple Coefficientof Determinationof Determination
R Rn
n pa2 21 1
11
( )R Rn
n pa2 21 1
11
( )
10 10 Slide
Slide
© 2007 Thomson South-Western. All Rights Reserved© 2007 Thomson South-Western. All Rights Reserved
The variance of The variance of , denoted by , denoted by 22, is the same for all, is the same for all values of the independent variables.values of the independent variables. The variance of The variance of , denoted by , denoted by 22, is the same for all, is the same for all values of the independent variables.values of the independent variables.
The error The error is a normally distributed random variable is a normally distributed random variable reflecting the deviation between the reflecting the deviation between the yy value and the value and the expected value of expected value of yy given by given by 00 + + 11xx1 1 + + 22xx2 2 + . . + + . . + ppxxpp..
The error The error is a normally distributed random variable is a normally distributed random variable reflecting the deviation between the reflecting the deviation between the yy value and the value and the expected value of expected value of yy given by given by 00 + + 11xx1 1 + + 22xx2 2 + . . + + . . + ppxxpp..
Assumptions About the Error Term Assumptions About the Error Term
The error The error is a random variable with mean of zero. is a random variable with mean of zero. The error The error is a random variable with mean of zero. is a random variable with mean of zero.
The values of The values of are independent. are independent. The values of The values of are independent. are independent.
11 11 Slide
Slide
© 2007 Thomson South-Western. All Rights Reserved© 2007 Thomson South-Western. All Rights Reserved
In simple linear regression, the In simple linear regression, the FF and and tt tests provide tests provide the same conclusion.the same conclusion. In simple linear regression, the In simple linear regression, the FF and and tt tests provide tests provide the same conclusion.the same conclusion.
Testing for SignificanceTesting for Significance
In multiple regression, the In multiple regression, the FF and and tt tests have different tests have different purposes.purposes. In multiple regression, the In multiple regression, the FF and and tt tests have different tests have different purposes.purposes.
12 12 Slide
Slide
© 2007 Thomson South-Western. All Rights Reserved© 2007 Thomson South-Western. All Rights Reserved
Testing for Significance: Testing for Significance: F F Test Test
The The FF test is referred to as the test is referred to as the test for overalltest for overall significancesignificance.. The The FF test is referred to as the test is referred to as the test for overalltest for overall significancesignificance..
The The FF test is used to determine whether a significant test is used to determine whether a significant relationship exists between the dependent variablerelationship exists between the dependent variable and the set of and the set of all the independent variablesall the independent variables..
The The FF test is used to determine whether a significant test is used to determine whether a significant relationship exists between the dependent variablerelationship exists between the dependent variable and the set of and the set of all the independent variablesall the independent variables..
13 13 Slide
Slide
© 2007 Thomson South-Western. All Rights Reserved© 2007 Thomson South-Western. All Rights Reserved
A separate A separate tt test is conducted for each of the test is conducted for each of the independent variables in the model.independent variables in the model. A separate A separate tt test is conducted for each of the test is conducted for each of the independent variables in the model.independent variables in the model.
If the If the FF test shows an overall significance, the test shows an overall significance, the tt test is test is used to determine whether each of the individualused to determine whether each of the individual independent variables is significant.independent variables is significant.
If the If the FF test shows an overall significance, the test shows an overall significance, the tt test is test is used to determine whether each of the individualused to determine whether each of the individual independent variables is significant.independent variables is significant.
Testing for Significance: Testing for Significance: t t Test Test
We refer to each of these We refer to each of these tt tests as a tests as a test for individualtest for individual significancesignificance.. We refer to each of these We refer to each of these tt tests as a tests as a test for individualtest for individual significancesignificance..
14 14 Slide
Slide
© 2007 Thomson South-Western. All Rights Reserved© 2007 Thomson South-Western. All Rights Reserved
Testing for Significance: Testing for Significance: F F Test Test
HypothesesHypotheses
Rejection RuleRejection Rule
Test StatisticsTest Statistics
HH00: : 11 = = 2 2 = . . . = = . . . = p p = 0= 0
HHaa: One or more of the parameters: One or more of the parameters
is not equal to zero.is not equal to zero.
FF = MSR/MSE = MSR/MSE
Reject Reject HH00 if if pp-value -value << or if or if FF > > FF
where where FF is based on an is based on an FF distribution distribution
with with pp d.f. in the numerator and d.f. in the numerator and
nn - - pp - 1 d.f. in the denominator. - 1 d.f. in the denominator.
15 15 Slide
Slide
© 2007 Thomson South-Western. All Rights Reserved© 2007 Thomson South-Western. All Rights Reserved
Testing for Significance: Testing for Significance: t t Test Test
HypothesesHypotheses
Rejection RuleRejection Rule
Test StatisticsTest Statistics
Reject Reject HH00 if if pp-value -value << or or
if if tt << - -ttor or tt >> ttwhere where tt
is based on a is based on a t t distribution distribution
with with nn - - pp - 1 degrees of freedom. - 1 degrees of freedom.
tbs
i
bi
tbs
i
bi
0 : 0iH 0 : 0iH
: 0a iH : 0a iH
16 16 Slide
Slide
© 2007 Thomson South-Western. All Rights Reserved© 2007 Thomson South-Western. All Rights Reserved
Testing for Significance: Multicollinearity Testing for Significance: Multicollinearity
The term The term multicollinearitymulticollinearity refers to the correlation refers to the correlation among the independent variables.among the independent variables. The term The term multicollinearitymulticollinearity refers to the correlation refers to the correlation among the independent variables.among the independent variables.
When the independent variables are highly correlatedWhen the independent variables are highly correlated it is not possible to determine the separate effect of it is not possible to determine the separate effect of any particular independent variable on the dependent any particular independent variable on the dependent variable.variable.
When the independent variables are highly correlatedWhen the independent variables are highly correlated it is not possible to determine the separate effect of it is not possible to determine the separate effect of any particular independent variable on the dependent any particular independent variable on the dependent variable.variable.
17 17 Slide
Slide
© 2007 Thomson South-Western. All Rights Reserved© 2007 Thomson South-Western. All Rights Reserved
Testing for Significance: Multicollinearity Testing for Significance: Multicollinearity
Every attempt should be made to avoid includingEvery attempt should be made to avoid including independent variables that are highly correlated.independent variables that are highly correlated. Every attempt should be made to avoid includingEvery attempt should be made to avoid including independent variables that are highly correlated.independent variables that are highly correlated.
If the estimated regression equation is to be used onlyIf the estimated regression equation is to be used only for predictive purposes, multicollinearity is usuallyfor predictive purposes, multicollinearity is usually not a serious problem.not a serious problem.
If the estimated regression equation is to be used onlyIf the estimated regression equation is to be used only for predictive purposes, multicollinearity is usuallyfor predictive purposes, multicollinearity is usually not a serious problem.not a serious problem.
18 18 Slide
Slide
© 2007 Thomson South-Western. All Rights Reserved© 2007 Thomson South-Western. All Rights Reserved
Using the Estimated Regression EquationUsing the Estimated Regression Equationfor Estimation and Predictionfor Estimation and Prediction
The procedures for estimating the mean value of The procedures for estimating the mean value of yy and predicting an individual value of and predicting an individual value of y y in multiple in multiple regression are similar to those in simple regression.regression are similar to those in simple regression.
The procedures for estimating the mean value of The procedures for estimating the mean value of yy and predicting an individual value of and predicting an individual value of y y in multiple in multiple regression are similar to those in simple regression.regression are similar to those in simple regression.
We substitute the given values of We substitute the given values of xx11, , xx22, . . . , , . . . , xxpp into into the estimated regression equation and use thethe estimated regression equation and use the corresponding value of corresponding value of yy as the point estimate. as the point estimate.
We substitute the given values of We substitute the given values of xx11, , xx22, . . . , , . . . , xxpp into into the estimated regression equation and use thethe estimated regression equation and use the corresponding value of corresponding value of yy as the point estimate. as the point estimate.
19 19 Slide
Slide
© 2007 Thomson South-Western. All Rights Reserved© 2007 Thomson South-Western. All Rights Reserved
Using the Estimated Regression EquationUsing the Estimated Regression Equationfor Estimation and Predictionfor Estimation and Prediction
Software packages for multiple regression will oftenSoftware packages for multiple regression will often provide these interval estimates.provide these interval estimates. Software packages for multiple regression will oftenSoftware packages for multiple regression will often provide these interval estimates.provide these interval estimates.
The formulas required to develop interval estimatesThe formulas required to develop interval estimates for the mean value of for the mean value of yy and for an individual value and for an individual value of of y y are beyond the scope of the textbook. are beyond the scope of the textbook.
The formulas required to develop interval estimatesThe formulas required to develop interval estimates for the mean value of for the mean value of yy and for an individual value and for an individual value of of y y are beyond the scope of the textbook. are beyond the scope of the textbook.
^
20 20 Slide
Slide
© 2007 Thomson South-Western. All Rights Reserved© 2007 Thomson South-Western. All Rights Reserved
In many situations we must work with In many situations we must work with qualitativequalitative independent variablesindependent variables such as gender (male, female),such as gender (male, female), method of payment (cash, check, credit card), etc.method of payment (cash, check, credit card), etc.
In many situations we must work with In many situations we must work with qualitativequalitative independent variablesindependent variables such as gender (male, female),such as gender (male, female), method of payment (cash, check, credit card), etc.method of payment (cash, check, credit card), etc.
For example, For example, xx22 might represent gender where might represent gender where xx22 = 0 = 0 indicates male and indicates male and xx22 = 1 indicates female. = 1 indicates female. For example, For example, xx22 might represent gender where might represent gender where xx22 = 0 = 0 indicates male and indicates male and xx22 = 1 indicates female. = 1 indicates female.
Qualitative Independent VariablesQualitative Independent Variables
In this case, In this case, xx22 is called a is called a dummy or indicator variabledummy or indicator variable.. In this case, In this case, xx22 is called a is called a dummy or indicator variabledummy or indicator variable..
21 21 Slide
Slide
© 2007 Thomson South-Western. All Rights Reserved© 2007 Thomson South-Western. All Rights Reserved
As an extension of the problem involving theAs an extension of the problem involving thecomputer programmer salary survey, supposecomputer programmer salary survey, supposethat management also believes that thethat management also believes that theannual salary is related to whether theannual salary is related to whether theindividual has a graduate degree inindividual has a graduate degree incomputer science or information systems.computer science or information systems.
The years of experience, the score on the The years of experience, the score on the programmerprogrammer
aptitude test, whether the individual has a relevant aptitude test, whether the individual has a relevant graduate degree, and the annual salary ($1000) for graduate degree, and the annual salary ($1000) for
eacheachof the sampled 20 programmers are shown on the of the sampled 20 programmers are shown on the
next next slide.slide.
Qualitative Independent VariablesQualitative Independent Variables
Example: Programmer Salary SurveyExample: Programmer Salary Survey
22 22 Slide
Slide
© 2007 Thomson South-Western. All Rights Reserved© 2007 Thomson South-Western. All Rights Reserved
4477115588101000116666
9922101055668844663333
787810010086868282868684847575808083839191
8888737375758181747487877979949470708989
24244343
23.723.734.334.335.835.83838
22.222.223.123.130303333
383826.626.636.236.231.631.629293434
30.130.133.933.928.228.23030
Exper.Exper. ScoreScore ScoreScoreExper.Exper.SalarySalary SalarySalaryDegr.Degr.
NoNoYesYes NoNoYesYesYesYesYesYes NoNo NoNo NoNoYesYes
Degr.Degr.
YesYes NoNoYesYes NoNo NoNoYesYes NoNoYesYes NoNo NoNo
Qualitative Independent VariablesQualitative Independent Variables
23 23 Slide
Slide
© 2007 Thomson South-Western. All Rights Reserved© 2007 Thomson South-Western. All Rights Reserved
Estimated Regression EquationEstimated Regression Equation
yy = = bb00 + + bb11xx1 1 + + bb22xx2 2 + + bb33xx33
^
where:where:
yy = annual salary ($1000) = annual salary ($1000)
xx11 = years of experience = years of experience
xx22 = score on programmer aptitude test = score on programmer aptitude test
xx33 = 0 if individual = 0 if individual does notdoes not have a graduate degree have a graduate degree 1 if individual 1 if individual doesdoes have a graduate degree have a graduate degree
xx33 is a dummy variable is a dummy variable
24 24 Slide
Slide
© 2007 Thomson South-Western. All Rights Reserved© 2007 Thomson South-Western. All Rights Reserved
More Complex Qualitative VariablesMore Complex Qualitative Variables
If a qualitative variable has If a qualitative variable has kk levels, levels, kk - 1 dummy - 1 dummy variables are required, with each dummy variablevariables are required, with each dummy variable being coded as 0 or 1.being coded as 0 or 1.
If a qualitative variable has If a qualitative variable has kk levels, levels, kk - 1 dummy - 1 dummy variables are required, with each dummy variablevariables are required, with each dummy variable being coded as 0 or 1.being coded as 0 or 1.
For example, a variable with levels A, B, and C couldFor example, a variable with levels A, B, and C could be represented by be represented by xx11 and and xx22 values of (0, 0) for A, (1, 0) values of (0, 0) for A, (1, 0) for B, and (0,1) for C.for B, and (0,1) for C.
For example, a variable with levels A, B, and C couldFor example, a variable with levels A, B, and C could be represented by be represented by xx11 and and xx22 values of (0, 0) for A, (1, 0) values of (0, 0) for A, (1, 0) for B, and (0,1) for C.for B, and (0,1) for C.
Care must be taken in defining and interpreting theCare must be taken in defining and interpreting the dummy variables.dummy variables. Care must be taken in defining and interpreting theCare must be taken in defining and interpreting the dummy variables.dummy variables.
25 25 Slide
Slide
© 2007 Thomson South-Western. All Rights Reserved© 2007 Thomson South-Western. All Rights Reserved
For example, a variable indicating level of For example, a variable indicating level of education could be represented by education could be represented by xx11 and and xx22 values as follows:values as follows:
More Complex Qualitative VariablesMore Complex Qualitative Variables
HighestHighest
DegreeDegree xx1 1 xx22
Bachelor’sBachelor’s 00 00Master’sMaster’s 11 00Ph.D.Ph.D. 00 11
26 26 Slide
Slide
© 2007 Thomson South-Western. All Rights Reserved© 2007 Thomson South-Western. All Rights Reserved
Residual AnalysisResidual Analysis
yy
For simple linear regression the residual plot For simple linear regression the residual plot againstagainst
and the residual plot against and the residual plot against xx provide the provide the same information.same information.
yy In multiple regression analysis it is preferable In multiple regression analysis it is preferable
to use the residual plot against to determine to use the residual plot against to determine if the model assumptions are satisfied.if the model assumptions are satisfied.
27 27 Slide
Slide
© 2007 Thomson South-Western. All Rights Reserved© 2007 Thomson South-Western. All Rights Reserved
Standardized Residual Plot Against Standardized Residual Plot Against y
Standardized residuals are frequently used in Standardized residuals are frequently used in residual plots for purposes of:residual plots for purposes of:
• Identifying outliers (typically, standardized Identifying outliers (typically, standardized residuals < -2 or > +2)residuals < -2 or > +2)
• Providing insight about the assumption that Providing insight about the assumption that the error term the error term has a normal distribution has a normal distribution
The computation of the standardized residuals The computation of the standardized residuals in multiple regression analysis is too complex in multiple regression analysis is too complex to be done by handto be done by hand
Excel’s Regression tool can be Excel’s Regression tool can be usedused