Applied Regression Modeling Third Edition - 5. Model Building II: … · Applied Regression...
Transcript of Applied Regression Modeling Third Edition - 5. Model Building II: … · Applied Regression...
Applied RegressionModeling 5a
Iain Pardoe
Outline
5.1 Influential points
5.2 Regression pitfalls
Applied Regression ModelingThird Edition
5. Model Building II: Sections 5.1–5.2
Iain Pardoe
Thompson Rivers University
The Pennsylvania State University
March 2020
Applied RegressionModeling 5a
Iain Pardoe
Outline
5.1 Influential points
5.2 Regression pitfalls
Outline
1 5.1 Influential points
2 5.2 Regression pitfalls
Applied RegressionModeling 5a
Iain Pardoe
Outline
5.1 Influential points
Influential points
Outliers
Dealing with outliers
CARS5 data
Model 1 studentized residuals
Remove outlier
Model 2 studentized residuals
Leverage
Results with outlier removed
Model 2 leverages
Results with highest leverage pointremoved
Cook’s distance
Results for all observations
Model 1 Cook’s distances
Results with highest Cook’sdistance removed
Model 2 Cook’s distances
5.2 Regression pitfalls
Influential points
Beware model conclusions that are overly influencedby a small handful of data points, e.g.:
overall results can be biased if a few unusual pointsdiffer dramatically from general patterns in themajority of the data values;misleading to conclude evidence of a strongassociation between variables if evidence basedmainly on a few dominant points.
Focus on two measures of individual data pointinfluence:
outliers have unusual Y-values relative to theirpredicted Y-values from a model;high leverage points have unusual combinations ofX-values relative to general dataset patterns.
Cook’s distance is a composite measure ofoutlyingness and leverage.
Applied RegressionModeling 5a
Iain Pardoe
Outline
5.1 Influential points
Influential points
Outliers
Dealing with outliers
CARS5 data
Model 1 studentized residuals
Remove outlier
Model 2 studentized residuals
Leverage
Results with outlier removed
Model 2 leverages
Results with highest leverage pointremoved
Cook’s distance
Results for all observations
Model 1 Cook’s distances
Results with highest Cook’sdistance removed
Model 2 Cook’s distances
5.2 Regression pitfalls
Influential points
Beware model conclusions that are overly influencedby a small handful of data points, e.g.:
overall results can be biased if a few unusual pointsdiffer dramatically from general patterns in themajority of the data values;misleading to conclude evidence of a strongassociation between variables if evidence basedmainly on a few dominant points.
Focus on two measures of individual data pointinfluence:
outliers have unusual Y-values relative to theirpredicted Y-values from a model;high leverage points have unusual combinations ofX-values relative to general dataset patterns.
Cook’s distance is a composite measure ofoutlyingness and leverage.
Applied RegressionModeling 5a
Iain Pardoe
Outline
5.1 Influential points
Influential points
Outliers
Dealing with outliers
CARS5 data
Model 1 studentized residuals
Remove outlier
Model 2 studentized residuals
Leverage
Results with outlier removed
Model 2 leverages
Results with highest leverage pointremoved
Cook’s distance
Results for all observations
Model 1 Cook’s distances
Results with highest Cook’sdistance removed
Model 2 Cook’s distances
5.2 Regression pitfalls
Influential points
Beware model conclusions that are overly influencedby a small handful of data points, e.g.:
overall results can be biased if a few unusual pointsdiffer dramatically from general patterns in themajority of the data values;misleading to conclude evidence of a strongassociation between variables if evidence basedmainly on a few dominant points.
Focus on two measures of individual data pointinfluence:
outliers have unusual Y-values relative to theirpredicted Y-values from a model;high leverage points have unusual combinations ofX-values relative to general dataset patterns.
Cook’s distance is a composite measure ofoutlyingness and leverage.
Applied RegressionModeling 5a
Iain Pardoe
Outline
5.1 Influential points
Influential points
Outliers
Dealing with outliers
CARS5 data
Model 1 studentized residuals
Remove outlier
Model 2 studentized residuals
Leverage
Results with outlier removed
Model 2 leverages
Results with highest leverage pointremoved
Cook’s distance
Results for all observations
Model 1 Cook’s distances
Results with highest Cook’sdistance removed
Model 2 Cook’s distances
5.2 Regression pitfalls
Outliers
Outliers have unusual Y-values relative to theirpredicted Y-values from a model.In other words, observations with a large magnituderesidual, ei = Yi − Yi .Computer can calculate studentized residuals to putthem on a common scale.When four regression assumptions (zero mean,constant variance, normality, and independence) aresatisfied, studentized residuals ≈ N(0, 12).If we identify an observation with a studentizedresidual outside (−3, 3), we’ve either witnessed avery unusual event (one with prob. less than 0.002)or we’ve found an observation with a Y-value thatdoesn’t fit the pattern in the rest of the dataset.Formally define a potential outlier as an observationwith studentized residual < −3 or > 3.
Applied RegressionModeling 5a
Iain Pardoe
Outline
5.1 Influential points
Influential points
Outliers
Dealing with outliers
CARS5 data
Model 1 studentized residuals
Remove outlier
Model 2 studentized residuals
Leverage
Results with outlier removed
Model 2 leverages
Results with highest leverage pointremoved
Cook’s distance
Results for all observations
Model 1 Cook’s distances
Results with highest Cook’sdistance removed
Model 2 Cook’s distances
5.2 Regression pitfalls
Dealing with outliers
If we find one or more outliers, investigate why:
data input mistake (remedy: identify and correctmistake(s) and reanalyze data);important predictor omitted from model (remedy:identify potentially useful predictors not included inthe model and reanalyze data);regression assumptions violated (remedy:reformulate model using transformations orinteractions, say, to correct problem);potential outliers differ substantively from othersample observations (remedy: remove outliers andreanalyze remainder of dataset separately).
To gauge outlier influence exclude largest magnitudestudentized residual, refit model to remainingobservations, and see if regression parameterestimates change substantially.
Applied RegressionModeling 5a
Iain Pardoe
Outline
5.1 Influential points
Influential points
Outliers
Dealing with outliers
CARS5 data
Model 1 studentized residuals
Remove outlier
Model 2 studentized residuals
Leverage
Results with outlier removed
Model 2 leverages
Results with highest leverage pointremoved
Cook’s distance
Results for all observations
Model 1 Cook’s distances
Results with highest Cook’sdistance removed
Model 2 Cook’s distances
5.2 Regression pitfalls
Dealing with outliers
If we find one or more outliers, investigate why:
data input mistake (remedy: identify and correctmistake(s) and reanalyze data);
important predictor omitted from model (remedy:identify potentially useful predictors not included inthe model and reanalyze data);regression assumptions violated (remedy:reformulate model using transformations orinteractions, say, to correct problem);potential outliers differ substantively from othersample observations (remedy: remove outliers andreanalyze remainder of dataset separately).
To gauge outlier influence exclude largest magnitudestudentized residual, refit model to remainingobservations, and see if regression parameterestimates change substantially.
Applied RegressionModeling 5a
Iain Pardoe
Outline
5.1 Influential points
Influential points
Outliers
Dealing with outliers
CARS5 data
Model 1 studentized residuals
Remove outlier
Model 2 studentized residuals
Leverage
Results with outlier removed
Model 2 leverages
Results with highest leverage pointremoved
Cook’s distance
Results for all observations
Model 1 Cook’s distances
Results with highest Cook’sdistance removed
Model 2 Cook’s distances
5.2 Regression pitfalls
Dealing with outliers
If we find one or more outliers, investigate why:
data input mistake (remedy: identify and correctmistake(s) and reanalyze data);important predictor omitted from model (remedy:identify potentially useful predictors not included inthe model and reanalyze data);
regression assumptions violated (remedy:reformulate model using transformations orinteractions, say, to correct problem);potential outliers differ substantively from othersample observations (remedy: remove outliers andreanalyze remainder of dataset separately).
To gauge outlier influence exclude largest magnitudestudentized residual, refit model to remainingobservations, and see if regression parameterestimates change substantially.
Applied RegressionModeling 5a
Iain Pardoe
Outline
5.1 Influential points
Influential points
Outliers
Dealing with outliers
CARS5 data
Model 1 studentized residuals
Remove outlier
Model 2 studentized residuals
Leverage
Results with outlier removed
Model 2 leverages
Results with highest leverage pointremoved
Cook’s distance
Results for all observations
Model 1 Cook’s distances
Results with highest Cook’sdistance removed
Model 2 Cook’s distances
5.2 Regression pitfalls
Dealing with outliers
If we find one or more outliers, investigate why:
data input mistake (remedy: identify and correctmistake(s) and reanalyze data);important predictor omitted from model (remedy:identify potentially useful predictors not included inthe model and reanalyze data);regression assumptions violated (remedy:reformulate model using transformations orinteractions, say, to correct problem);
potential outliers differ substantively from othersample observations (remedy: remove outliers andreanalyze remainder of dataset separately).
To gauge outlier influence exclude largest magnitudestudentized residual, refit model to remainingobservations, and see if regression parameterestimates change substantially.
Applied RegressionModeling 5a
Iain Pardoe
Outline
5.1 Influential points
Influential points
Outliers
Dealing with outliers
CARS5 data
Model 1 studentized residuals
Remove outlier
Model 2 studentized residuals
Leverage
Results with outlier removed
Model 2 leverages
Results with highest leverage pointremoved
Cook’s distance
Results for all observations
Model 1 Cook’s distances
Results with highest Cook’sdistance removed
Model 2 Cook’s distances
5.2 Regression pitfalls
Dealing with outliers
If we find one or more outliers, investigate why:
data input mistake (remedy: identify and correctmistake(s) and reanalyze data);important predictor omitted from model (remedy:identify potentially useful predictors not included inthe model and reanalyze data);regression assumptions violated (remedy:reformulate model using transformations orinteractions, say, to correct problem);potential outliers differ substantively from othersample observations (remedy: remove outliers andreanalyze remainder of dataset separately).
To gauge outlier influence exclude largest magnitudestudentized residual, refit model to remainingobservations, and see if regression parameterestimates change substantially.
Applied RegressionModeling 5a
Iain Pardoe
Outline
5.1 Influential points
Influential points
Outliers
Dealing with outliers
CARS5 data
Model 1 studentized residuals
Remove outlier
Model 2 studentized residuals
Leverage
Results with outlier removed
Model 2 leverages
Results with highest leverage pointremoved
Cook’s distance
Results for all observations
Model 1 Cook’s distances
Results with highest Cook’sdistance removed
Model 2 Cook’s distances
5.2 Regression pitfalls
Dealing with outliers
If we find one or more outliers, investigate why:
data input mistake (remedy: identify and correctmistake(s) and reanalyze data);important predictor omitted from model (remedy:identify potentially useful predictors not included inthe model and reanalyze data);regression assumptions violated (remedy:reformulate model using transformations orinteractions, say, to correct problem);potential outliers differ substantively from othersample observations (remedy: remove outliers andreanalyze remainder of dataset separately).
To gauge outlier influence exclude largest magnitudestudentized residual, refit model to remainingobservations, and see if regression parameterestimates change substantially.
Applied RegressionModeling 5a
Iain Pardoe
Outline
5.1 Influential points
Influential points
Outliers
Dealing with outliers
CARS5 data
Model 1 studentized residuals
Remove outlier
Model 2 studentized residuals
Leverage
Results with outlier removed
Model 2 leverages
Results with highest leverage pointremoved
Cook’s distance
Results for all observations
Model 1 Cook’s distances
Results with highest Cook’sdistance removed
Model 2 Cook’s distances
5.2 Regression pitfalls
CARS5 data
Y = city gallons per hundred miles for 50 newU.S. passenger cars in 2011.
X1 = engine size (liters).
X2 = number of cylinders.
X3 = interior volume (hundreds of cubic feet).
Model: E(Y ) = b0 + b1X1 + b2X2 + b3X3.
Parameters a
Model Estimate Std. Error t-stat Pr(> |t|)1 (Intercept) 2.542 0.365 6.964 0.000
X1 = Eng 0.235 0.112 2.104 0.041X2 = Cyl 0.294 0.076 3.855 0.000X3 = Vol 0.476 0.279 1.709 0.094
a Response variable: Y = Cgphm.
Applied RegressionModeling 5a
Iain Pardoe
Outline
5.1 Influential points
Influential points
Outliers
Dealing with outliers
CARS5 data
Model 1 studentized residuals
Remove outlier
Model 2 studentized residuals
Leverage
Results with outlier removed
Model 2 leverages
Results with highest leverage pointremoved
Cook’s distance
Results for all observations
Model 1 Cook’s distances
Results with highest Cook’sdistance removed
Model 2 Cook’s distances
5.2 Regression pitfalls
Model 1 studentized residuals
There is one outlier with studentized residual ≈ −8.
Studentized residual
−8 −6 −4 −2 0
02
46
810
12
Lexus HS 250h
Applied RegressionModeling 5a
Iain Pardoe
Outline
5.1 Influential points
Influential points
Outliers
Dealing with outliers
CARS5 data
Model 1 studentized residuals
Remove outlier
Model 2 studentized residuals
Leverage
Results with outlier removed
Model 2 leverages
Results with highest leverage pointremoved
Cook’s distance
Results for all observations
Model 1 Cook’s distances
Results with highest Cook’sdistance removed
Model 2 Cook’s distances
5.2 Regression pitfalls
Remove outlier
Parameters a
Model Estimate Std. Error t-stat Pr(> |t|)2 (Intercept) 2.905 0.237 12.281 0.000
X1 = Eng 0.259 0.071 3.645 0.001X2 = Cyl 0.228 0.049 4.634 0.000X3 = Vol 0.492 0.177 2.774 0.008
a Response variable: Y = Cgphm.
Regression parameter estimates and p-values changedramatically.
The outlier was a hybrid and did not fit the patternof the standard gasoline-powered cars.
Applied RegressionModeling 5a
Iain Pardoe
Outline
5.1 Influential points
Influential points
Outliers
Dealing with outliers
CARS5 data
Model 1 studentized residuals
Remove outlier
Model 2 studentized residuals
Leverage
Results with outlier removed
Model 2 leverages
Results with highest leverage pointremoved
Cook’s distance
Results for all observations
Model 1 Cook’s distances
Results with highest Cook’sdistance removed
Model 2 Cook’s distances
5.2 Regression pitfalls
Model 2 studentized residuals
No further outliers.
Studentized residual
−2 −1 0 1 2
02
46
810
12
Applied RegressionModeling 5a
Iain Pardoe
Outline
5.1 Influential points
Influential points
Outliers
Dealing with outliers
CARS5 data
Model 1 studentized residuals
Remove outlier
Model 2 studentized residuals
Leverage
Results with outlier removed
Model 2 leverages
Results with highest leverage pointremoved
Cook’s distance
Results for all observations
Model 1 Cook’s distances
Results with highest Cook’sdistance removed
Model 2 Cook’s distances
5.2 Regression pitfalls
Leverage
High leverage points have unusual combinations ofX-values relative to general dataset patterns.
If a point is far from the majority of the sample, itcan pull the fitted model close toward its Y-value,potentially biasing the results.
Leverage measures potential for an observation tohave undue influence on a model (0–1: low–high).
Rule of thumb:
if leverage > 3(k + 1)/n investigate further;if leverage > 2(k + 1)/n and isolated investigatefurther;otherwise, no evidence of undue influence.
To gauge influence exclude largest leverage point,refit model to remaining observations, and see ifregression parameter estimates change substantially.
Applied RegressionModeling 5a
Iain Pardoe
Outline
5.1 Influential points
Influential points
Outliers
Dealing with outliers
CARS5 data
Model 1 studentized residuals
Remove outlier
Model 2 studentized residuals
Leverage
Results with outlier removed
Model 2 leverages
Results with highest leverage pointremoved
Cook’s distance
Results for all observations
Model 1 Cook’s distances
Results with highest Cook’sdistance removed
Model 2 Cook’s distances
5.2 Regression pitfalls
Leverage
High leverage points have unusual combinations ofX-values relative to general dataset patterns.
If a point is far from the majority of the sample, itcan pull the fitted model close toward its Y-value,potentially biasing the results.
Leverage measures potential for an observation tohave undue influence on a model (0–1: low–high).
Rule of thumb:
if leverage > 3(k + 1)/n investigate further;
if leverage > 2(k + 1)/n and isolated investigatefurther;otherwise, no evidence of undue influence.
To gauge influence exclude largest leverage point,refit model to remaining observations, and see ifregression parameter estimates change substantially.
Applied RegressionModeling 5a
Iain Pardoe
Outline
5.1 Influential points
Influential points
Outliers
Dealing with outliers
CARS5 data
Model 1 studentized residuals
Remove outlier
Model 2 studentized residuals
Leverage
Results with outlier removed
Model 2 leverages
Results with highest leverage pointremoved
Cook’s distance
Results for all observations
Model 1 Cook’s distances
Results with highest Cook’sdistance removed
Model 2 Cook’s distances
5.2 Regression pitfalls
Leverage
High leverage points have unusual combinations ofX-values relative to general dataset patterns.
If a point is far from the majority of the sample, itcan pull the fitted model close toward its Y-value,potentially biasing the results.
Leverage measures potential for an observation tohave undue influence on a model (0–1: low–high).
Rule of thumb:
if leverage > 3(k + 1)/n investigate further;if leverage > 2(k + 1)/n and isolated investigatefurther;
otherwise, no evidence of undue influence.
To gauge influence exclude largest leverage point,refit model to remaining observations, and see ifregression parameter estimates change substantially.
Applied RegressionModeling 5a
Iain Pardoe
Outline
5.1 Influential points
Influential points
Outliers
Dealing with outliers
CARS5 data
Model 1 studentized residuals
Remove outlier
Model 2 studentized residuals
Leverage
Results with outlier removed
Model 2 leverages
Results with highest leverage pointremoved
Cook’s distance
Results for all observations
Model 1 Cook’s distances
Results with highest Cook’sdistance removed
Model 2 Cook’s distances
5.2 Regression pitfalls
Leverage
High leverage points have unusual combinations ofX-values relative to general dataset patterns.
If a point is far from the majority of the sample, itcan pull the fitted model close toward its Y-value,potentially biasing the results.
Leverage measures potential for an observation tohave undue influence on a model (0–1: low–high).
Rule of thumb:
if leverage > 3(k + 1)/n investigate further;if leverage > 2(k + 1)/n and isolated investigatefurther;otherwise, no evidence of undue influence.
To gauge influence exclude largest leverage point,refit model to remaining observations, and see ifregression parameter estimates change substantially.
Applied RegressionModeling 5a
Iain Pardoe
Outline
5.1 Influential points
Influential points
Outliers
Dealing with outliers
CARS5 data
Model 1 studentized residuals
Remove outlier
Model 2 studentized residuals
Leverage
Results with outlier removed
Model 2 leverages
Results with highest leverage pointremoved
Cook’s distance
Results for all observations
Model 1 Cook’s distances
Results with highest Cook’sdistance removed
Model 2 Cook’s distances
5.2 Regression pitfalls
Leverage
High leverage points have unusual combinations ofX-values relative to general dataset patterns.
If a point is far from the majority of the sample, itcan pull the fitted model close toward its Y-value,potentially biasing the results.
Leverage measures potential for an observation tohave undue influence on a model (0–1: low–high).
Rule of thumb:
if leverage > 3(k + 1)/n investigate further;if leverage > 2(k + 1)/n and isolated investigatefurther;otherwise, no evidence of undue influence.
To gauge influence exclude largest leverage point,refit model to remaining observations, and see ifregression parameter estimates change substantially.
Applied RegressionModeling 5a
Iain Pardoe
Outline
5.1 Influential points
Influential points
Outliers
Dealing with outliers
CARS5 data
Model 1 studentized residuals
Remove outlier
Model 2 studentized residuals
Leverage
Results with outlier removed
Model 2 leverages
Results with highest leverage pointremoved
Cook’s distance
Results for all observations
Model 1 Cook’s distances
Results with highest Cook’sdistance removed
Model 2 Cook’s distances
5.2 Regression pitfalls
Results with outlier removed
Parameters a
Model Estimate Std. Error t-stat Pr(> |t|)2 (Intercept) 2.905 0.237 12.281 0.000
X1 = Eng 0.259 0.071 3.645 0.001X2 = Cyl 0.228 0.049 4.634 0.000X3 = Vol 0.492 0.177 2.774 0.008
a Response variable: Y = Cgphm.
Threshold: 3(k + 1)/n = 3(3 + 1)/49 = 0.24.
Threshold: 2(k + 1)/n = 2(3 + 1)/49 = 0.16.
Applied RegressionModeling 5a
Iain Pardoe
Outline
5.1 Influential points
Influential points
Outliers
Dealing with outliers
CARS5 data
Model 1 studentized residuals
Remove outlier
Model 2 studentized residuals
Leverage
Results with outlier removed
Model 2 leverages
Results with highest leverage pointremoved
Cook’s distance
Results for all observations
Model 1 Cook’s distances
Results with highest Cook’sdistance removed
Model 2 Cook’s distances
5.2 Regression pitfalls
Model 2 leverages
Three high leverage points exceed 3(k + 1)/n threshold.
0 10 20 30 40 50
0.1
0.2
0.3
0.4
0.5
ID number
Leve
rage
Aston Martin DB9
Dodge Challenger SRT8
Rolls−Royce Ghost
Applied RegressionModeling 5a
Iain Pardoe
Outline
5.1 Influential points
Influential points
Outliers
Dealing with outliers
CARS5 data
Model 1 studentized residuals
Remove outlier
Model 2 studentized residuals
Leverage
Results with outlier removed
Model 2 leverages
Results with highest leverage pointremoved
Cook’s distance
Results for all observations
Model 1 Cook’s distances
Results with highest Cook’sdistance removed
Model 2 Cook’s distances
5.2 Regression pitfalls
Results with highest leverage point removed
Parameters a
Model Estimate Std. Error t-stat Pr(> |t|)3 (Intercept) 2.949 0.247 11.931 0.000
X1 = Eng 0.282 0.080 3.547 0.001X2 = Cyl 0.201 0.064 3.139 0.003X3 = Vol 0.532 0.188 2.824 0.007
a Response variable: Y = Cgphm.
Regression parameter estimates and p-values don’tchange dramatically.
The highest leverage point had the potential tostrongly influence results, but in this case did not doso.
Applied RegressionModeling 5a
Iain Pardoe
Outline
5.1 Influential points
Influential points
Outliers
Dealing with outliers
CARS5 data
Model 1 studentized residuals
Remove outlier
Model 2 studentized residuals
Leverage
Results with outlier removed
Model 2 leverages
Results with highest leverage pointremoved
Cook’s distance
Results for all observations
Model 1 Cook’s distances
Results with highest Cook’sdistance removed
Model 2 Cook’s distances
5.2 Regression pitfalls
Cook’s distance
Cook’s distance is a composite measure ofoutlyingness and leverage.
Rule of thumb:
observations with a Cook’s distance > 1 are oftensufficiently influential that they should be removedfrom the main analysis—investigate further;
observations with a Cook’s distance > 0.5 aresometimes sufficiently influential that they should beremoved from the main analysis—investigate further;otherwise, unlikely to have undue influence(although not always—see Problem 5.2).
To gauge influence exclude largest Cook’s distance,refit model to remaining observations, and see ifregression parameter estimates change substantially.
Applied RegressionModeling 5a
Iain Pardoe
Outline
5.1 Influential points
Influential points
Outliers
Dealing with outliers
CARS5 data
Model 1 studentized residuals
Remove outlier
Model 2 studentized residuals
Leverage
Results with outlier removed
Model 2 leverages
Results with highest leverage pointremoved
Cook’s distance
Results for all observations
Model 1 Cook’s distances
Results with highest Cook’sdistance removed
Model 2 Cook’s distances
5.2 Regression pitfalls
Cook’s distance
Cook’s distance is a composite measure ofoutlyingness and leverage.
Rule of thumb:
observations with a Cook’s distance > 1 are oftensufficiently influential that they should be removedfrom the main analysis—investigate further;observations with a Cook’s distance > 0.5 aresometimes sufficiently influential that they should beremoved from the main analysis—investigate further;
otherwise, unlikely to have undue influence(although not always—see Problem 5.2).
To gauge influence exclude largest Cook’s distance,refit model to remaining observations, and see ifregression parameter estimates change substantially.
Applied RegressionModeling 5a
Iain Pardoe
Outline
5.1 Influential points
Influential points
Outliers
Dealing with outliers
CARS5 data
Model 1 studentized residuals
Remove outlier
Model 2 studentized residuals
Leverage
Results with outlier removed
Model 2 leverages
Results with highest leverage pointremoved
Cook’s distance
Results for all observations
Model 1 Cook’s distances
Results with highest Cook’sdistance removed
Model 2 Cook’s distances
5.2 Regression pitfalls
Cook’s distance
Cook’s distance is a composite measure ofoutlyingness and leverage.
Rule of thumb:
observations with a Cook’s distance > 1 are oftensufficiently influential that they should be removedfrom the main analysis—investigate further;observations with a Cook’s distance > 0.5 aresometimes sufficiently influential that they should beremoved from the main analysis—investigate further;otherwise, unlikely to have undue influence(although not always—see Problem 5.2).
To gauge influence exclude largest Cook’s distance,refit model to remaining observations, and see ifregression parameter estimates change substantially.
Applied RegressionModeling 5a
Iain Pardoe
Outline
5.1 Influential points
Influential points
Outliers
Dealing with outliers
CARS5 data
Model 1 studentized residuals
Remove outlier
Model 2 studentized residuals
Leverage
Results with outlier removed
Model 2 leverages
Results with highest leverage pointremoved
Cook’s distance
Results for all observations
Model 1 Cook’s distances
Results with highest Cook’sdistance removed
Model 2 Cook’s distances
5.2 Regression pitfalls
Cook’s distance
Cook’s distance is a composite measure ofoutlyingness and leverage.
Rule of thumb:
observations with a Cook’s distance > 1 are oftensufficiently influential that they should be removedfrom the main analysis—investigate further;observations with a Cook’s distance > 0.5 aresometimes sufficiently influential that they should beremoved from the main analysis—investigate further;otherwise, unlikely to have undue influence(although not always—see Problem 5.2).
To gauge influence exclude largest Cook’s distance,refit model to remaining observations, and see ifregression parameter estimates change substantially.
Applied RegressionModeling 5a
Iain Pardoe
Outline
5.1 Influential points
Influential points
Outliers
Dealing with outliers
CARS5 data
Model 1 studentized residuals
Remove outlier
Model 2 studentized residuals
Leverage
Results with outlier removed
Model 2 leverages
Results with highest leverage pointremoved
Cook’s distance
Results for all observations
Model 1 Cook’s distances
Results with highest Cook’sdistance removed
Model 2 Cook’s distances
5.2 Regression pitfalls
Results for all observations
Parameters a
Model Estimate Std. Error t-stat Pr(> |t|)1 (Intercept) 2.542 0.365 6.964 0.000
X1 = Eng 0.235 0.112 2.104 0.041X2 = Cyl 0.294 0.076 3.855 0.000X3 = Vol 0.476 0.279 1.709 0.094
a Response variable: Y = Cgphm.
Applied RegressionModeling 5a
Iain Pardoe
Outline
5.1 Influential points
Influential points
Outliers
Dealing with outliers
CARS5 data
Model 1 studentized residuals
Remove outlier
Model 2 studentized residuals
Leverage
Results with outlier removed
Model 2 leverages
Results with highest leverage pointremoved
Cook’s distance
Results for all observations
Model 1 Cook’s distances
Results with highest Cook’sdistance removed
Model 2 Cook’s distances
5.2 Regression pitfalls
Model 1 Cook’s distances
Highest Cook’s distance exceeds 0.5 threshold.
0 10 20 30 40 50
0.0
0.1
0.2
0.3
0.4
0.5
0.6
ID number
Cook’s
dis
tance
Lexus HS 250h
Applied RegressionModeling 5a
Iain Pardoe
Outline
5.1 Influential points
Influential points
Outliers
Dealing with outliers
CARS5 data
Model 1 studentized residuals
Remove outlier
Model 2 studentized residuals
Leverage
Results with outlier removed
Model 2 leverages
Results with highest leverage pointremoved
Cook’s distance
Results for all observations
Model 1 Cook’s distances
Results with highest Cook’sdistance removed
Model 2 Cook’s distances
5.2 Regression pitfalls
Results with highest Cook’s distance removed
Parameters a
Model Estimate Std. Error t-stat Pr(> |t|)2 (Intercept) 2.905 0.237 12.281 0.000
X1 = Eng 0.259 0.071 3.645 0.001X2 = Cyl 0.228 0.049 4.634 0.000X3 = Vol 0.492 0.177 2.774 0.008
a Response variable: Y = Cgphm.
The car with the highest Cook’s distance was theoutlier we found before.
Applied RegressionModeling 5a
Iain Pardoe
Outline
5.1 Influential points
Influential points
Outliers
Dealing with outliers
CARS5 data
Model 1 studentized residuals
Remove outlier
Model 2 studentized residuals
Leverage
Results with outlier removed
Model 2 leverages
Results with highest leverage pointremoved
Cook’s distance
Results for all observations
Model 1 Cook’s distances
Results with highest Cook’sdistance removed
Model 2 Cook’s distances
5.2 Regression pitfalls
Model 2 Cook’s distances
Highest Cook’s distance less than 0.5 threshold.
0 10 20 30 40 50
0.0
0.1
0.2
0.3
0.4
0.5
0.6
ID number
Cook’s
dis
tance
Aston Martin DB9
Dodge Challenger SRT8
Rolls−Royce Ghost
Applied RegressionModeling 5a
Iain Pardoe
Outline
5.1 Influential points
5.2 Regression pitfalls
Regression pitfalls
Nonconstant variance
Model 1 residuals
Model 2 residuals
Autocorrelation
Model 1 residuals
Model 2 residuals
Multicollinearity
Model 1 results
X1 and X2 highly correlated
Model 2 results
Excluding important predictors
Model 1 results
Model 2 results
Paradox explained
Overfitting
Extrapolation
Extrapolation example
Missing data
Model results
Regression pitfalls
Some of the pitfalls that can cause problems with aregression analysis:
nonconstant variance—violation of constant varianceassumption;autocorrelation (serial correlation)—failing toaccount for time trends in the model;multicollinearity—highly correlated predictorscausing unstable model results;excluding important predictor variables—leading topossibly incorrect conclusions;overfitting (the sample data)—leading to poorgeneralizability to the population;extrapolation—using model results for predictorvalues very different to those in the sample;missing data—leading to reduced sample sizes atbest, misleading results at worst;power and sample size—small samples can lead toproblems with power.
Applied RegressionModeling 5a
Iain Pardoe
Outline
5.1 Influential points
5.2 Regression pitfalls
Regression pitfalls
Nonconstant variance
Model 1 residuals
Model 2 residuals
Autocorrelation
Model 1 residuals
Model 2 residuals
Multicollinearity
Model 1 results
X1 and X2 highly correlated
Model 2 results
Excluding important predictors
Model 1 results
Model 2 results
Paradox explained
Overfitting
Extrapolation
Extrapolation example
Missing data
Model results
Nonconstant variance
Regression model residuals can violate the constantvariance assumption if there is a fan or funnelpattern in residual scatterplot.
Tests such as Breusch/Pagan (1979) andCook/Weisberg (1983) can complement residual plotanalysis.
Nonconstant variance creates technical difficultieswith model such as inaccurate prediction intervals.
Remedies include response transformations, weightedleast squares, generalized least squares, generalizedlinear models.
Example: FERTILITY data file with Y = Fertility ,X1 = loge(Income), and X2 = Education for 169countries.
Applied RegressionModeling 5a
Iain Pardoe
Outline
5.1 Influential points
5.2 Regression pitfalls
Regression pitfalls
Nonconstant variance
Model 1 residuals
Model 2 residuals
Autocorrelation
Model 1 residuals
Model 2 residuals
Multicollinearity
Model 1 results
X1 and X2 highly correlated
Model 2 results
Excluding important predictors
Model 1 results
Model 2 results
Paradox explained
Overfitting
Extrapolation
Extrapolation example
Missing data
Model results
Model 1 residuals
Model 1 uses Y as the response variable. Plot indicatesincreasing residual variance as fitted values increase.
1 2 3 4 5
−2
−1
01
2
Fitted values
Stu
dentized r
esid
ual
Applied RegressionModeling 5a
Iain Pardoe
Outline
5.1 Influential points
5.2 Regression pitfalls
Regression pitfalls
Nonconstant variance
Model 1 residuals
Model 2 residuals
Autocorrelation
Model 1 residuals
Model 2 residuals
Multicollinearity
Model 1 results
X1 and X2 highly correlated
Model 2 results
Excluding important predictors
Model 1 results
Model 2 results
Paradox explained
Overfitting
Extrapolation
Extrapolation example
Missing data
Model results
Model 2 residuals
Model 1 uses loge(Y ) as the response variable.Constant variance assumption more reasonable now.
0.5 1.0 1.5
−2
−1
01
2
Fitted values
Stu
dentized r
esid
ual
Applied RegressionModeling 5a
Iain Pardoe
Outline
5.1 Influential points
5.2 Regression pitfalls
Regression pitfalls
Nonconstant variance
Model 1 residuals
Model 2 residuals
Autocorrelation
Model 1 residuals
Model 2 residuals
Multicollinearity
Model 1 results
X1 and X2 highly correlated
Model 2 results
Excluding important predictors
Model 1 results
Model 2 results
Paradox explained
Overfitting
Extrapolation
Extrapolation example
Missing data
Model results
Autocorrelation
Autocorrelation occurs when regression modelresiduals violate the independence assumptionbecause they are highly dependent across time.
Can occur when regression data have been collectedover time and model fails to account for any strongtime trends.
Dealing with this issue rigorously can requirespecialized time series and forecasting methods.
Sometimes, however, simple ideas can mitigateautocorrelation problems.
Example: UNEMP data file contains annualmacroeconomic data from 1959 to 2010.
Applied RegressionModeling 5a
Iain Pardoe
Outline
5.1 Influential points
5.2 Regression pitfalls
Regression pitfalls
Nonconstant variance
Model 1 residuals
Model 2 residuals
Autocorrelation
Model 1 residuals
Model 2 residuals
Multicollinearity
Model 1 results
X1 and X2 highly correlated
Model 2 results
Excluding important predictors
Model 1 results
Model 2 results
Paradox explained
Overfitting
Extrapolation
Extrapolation example
Missing data
Model results
Model 1 residuals
Model 1: Multiple linear regression with 4 predictors.Residual plot shows clear evidence of autocorrelation.
0 10 20 30 40 50
−2
−1
01
2
Time period
Resid
ual
Applied RegressionModeling 5a
Iain Pardoe
Outline
5.1 Influential points
5.2 Regression pitfalls
Regression pitfalls
Nonconstant variance
Model 1 residuals
Model 2 residuals
Autocorrelation
Model 1 residuals
Model 2 residuals
Multicollinearity
Model 1 results
X1 and X2 highly correlated
Model 2 results
Excluding important predictors
Model 1 results
Model 2 results
Paradox explained
Overfitting
Extrapolation
Extrapolation example
Missing data
Model results
Model 2 residuals
Model 2: Use first-differenced data.Independent errors assumption more reasonable now.
0 10 20 30 40 50
−1.0
−0.5
0.0
0.5
1.0
Time period
Resid
ual
Applied RegressionModeling 5a
Iain Pardoe
Outline
5.1 Influential points
5.2 Regression pitfalls
Regression pitfalls
Nonconstant variance
Model 1 residuals
Model 2 residuals
Autocorrelation
Model 1 residuals
Model 2 residuals
Multicollinearity
Model 1 results
X1 and X2 highly correlated
Model 2 results
Excluding important predictors
Model 1 results
Model 2 results
Paradox explained
Overfitting
Extrapolation
Extrapolation example
Missing data
Model results
Multicollinearity
Multicollinearity occurs when excessive correlationbetween quantitative predictors leads to unstablemodels and inflated standard errors.
Identify by looking at a scatterplot matrix,calculating bivariate correlations, and calculatingvariance inflation factors(potential problems if > 4, likely problems if > 10).
Potential remedies include:
collect more uncorrelated data (if possible);create new combined predictor variables from thehighly correlated predictors (if possible);remove one of the highly correlated predictors fromthe model.
Example: SALES3 data file with sales (Y = Sales),TV/newspaper advertising (X1 = Trad), andinternet advertising (X2 = Int).
Applied RegressionModeling 5a
Iain Pardoe
Outline
5.1 Influential points
5.2 Regression pitfalls
Regression pitfalls
Nonconstant variance
Model 1 residuals
Model 2 residuals
Autocorrelation
Model 1 residuals
Model 2 residuals
Multicollinearity
Model 1 results
X1 and X2 highly correlated
Model 2 results
Excluding important predictors
Model 1 results
Model 2 results
Paradox explained
Overfitting
Extrapolation
Extrapolation example
Missing data
Model results
Multicollinearity
Multicollinearity occurs when excessive correlationbetween quantitative predictors leads to unstablemodels and inflated standard errors.
Identify by looking at a scatterplot matrix,calculating bivariate correlations, and calculatingvariance inflation factors(potential problems if > 4, likely problems if > 10).
Potential remedies include:collect more uncorrelated data (if possible);
create new combined predictor variables from thehighly correlated predictors (if possible);remove one of the highly correlated predictors fromthe model.
Example: SALES3 data file with sales (Y = Sales),TV/newspaper advertising (X1 = Trad), andinternet advertising (X2 = Int).
Applied RegressionModeling 5a
Iain Pardoe
Outline
5.1 Influential points
5.2 Regression pitfalls
Regression pitfalls
Nonconstant variance
Model 1 residuals
Model 2 residuals
Autocorrelation
Model 1 residuals
Model 2 residuals
Multicollinearity
Model 1 results
X1 and X2 highly correlated
Model 2 results
Excluding important predictors
Model 1 results
Model 2 results
Paradox explained
Overfitting
Extrapolation
Extrapolation example
Missing data
Model results
Multicollinearity
Multicollinearity occurs when excessive correlationbetween quantitative predictors leads to unstablemodels and inflated standard errors.
Identify by looking at a scatterplot matrix,calculating bivariate correlations, and calculatingvariance inflation factors(potential problems if > 4, likely problems if > 10).
Potential remedies include:collect more uncorrelated data (if possible);create new combined predictor variables from thehighly correlated predictors (if possible);
remove one of the highly correlated predictors fromthe model.
Example: SALES3 data file with sales (Y = Sales),TV/newspaper advertising (X1 = Trad), andinternet advertising (X2 = Int).
Applied RegressionModeling 5a
Iain Pardoe
Outline
5.1 Influential points
5.2 Regression pitfalls
Regression pitfalls
Nonconstant variance
Model 1 residuals
Model 2 residuals
Autocorrelation
Model 1 residuals
Model 2 residuals
Multicollinearity
Model 1 results
X1 and X2 highly correlated
Model 2 results
Excluding important predictors
Model 1 results
Model 2 results
Paradox explained
Overfitting
Extrapolation
Extrapolation example
Missing data
Model results
Multicollinearity
Multicollinearity occurs when excessive correlationbetween quantitative predictors leads to unstablemodels and inflated standard errors.
Identify by looking at a scatterplot matrix,calculating bivariate correlations, and calculatingvariance inflation factors(potential problems if > 4, likely problems if > 10).
Potential remedies include:collect more uncorrelated data (if possible);create new combined predictor variables from thehighly correlated predictors (if possible);remove one of the highly correlated predictors fromthe model.
Example: SALES3 data file with sales (Y = Sales),TV/newspaper advertising (X1 = Trad), andinternet advertising (X2 = Int).
Applied RegressionModeling 5a
Iain Pardoe
Outline
5.1 Influential points
5.2 Regression pitfalls
Regression pitfalls
Nonconstant variance
Model 1 residuals
Model 2 residuals
Autocorrelation
Model 1 residuals
Model 2 residuals
Multicollinearity
Model 1 results
X1 and X2 highly correlated
Model 2 results
Excluding important predictors
Model 1 results
Model 2 results
Paradox explained
Overfitting
Extrapolation
Extrapolation example
Missing data
Model results
Multicollinearity
Multicollinearity occurs when excessive correlationbetween quantitative predictors leads to unstablemodels and inflated standard errors.
Identify by looking at a scatterplot matrix,calculating bivariate correlations, and calculatingvariance inflation factors(potential problems if > 4, likely problems if > 10).
Potential remedies include:collect more uncorrelated data (if possible);create new combined predictor variables from thehighly correlated predictors (if possible);remove one of the highly correlated predictors fromthe model.
Example: SALES3 data file with sales (Y = Sales),TV/newspaper advertising (X1 = Trad), andinternet advertising (X2 = Int).
Applied RegressionModeling 5a
Iain Pardoe
Outline
5.1 Influential points
5.2 Regression pitfalls
Regression pitfalls
Nonconstant variance
Model 1 residuals
Model 2 residuals
Autocorrelation
Model 1 residuals
Model 2 residuals
Multicollinearity
Model 1 results
X1 and X2 highly correlated
Model 2 results
Excluding important predictors
Model 1 results
Model 2 results
Paradox explained
Overfitting
Extrapolation
Extrapolation example
Missing data
Model results
Model 1 results
Model 1: E(Y ) = b0 + b1X1 + b2X2.
Model SummaryAdjusted Regression
Model Multiple R R Squared R Squared Std. Error1 0.987 a 0.974 0.968 0.8916a Predictors: (Intercept), X1 = Trad , X2 = Int.
Parameters a
Model Estimate Std. Error t-stat Pr(> |t|) VIF1 (Intercept) 1.992 0.902 2.210 0.054
X1 = Trad 0.767 0.868 0.884 0.400 49.541X2 = Int 1.275 0.737 1.730 0.118 49.541
a Response variable: Y = Sales.
R2 is 0.974, but neither X1 nor X2 are significant(given the presence of the other)!
VIF > 10 suggests there is a multicollinearityproblem.
Applied RegressionModeling 5a
Iain Pardoe
Outline
5.1 Influential points
5.2 Regression pitfalls
Regression pitfalls
Nonconstant variance
Model 1 residuals
Model 2 residuals
Autocorrelation
Model 1 residuals
Model 2 residuals
Multicollinearity
Model 1 results
X1 and X2 highly correlated
Model 2 results
Excluding important predictors
Model 1 results
Model 2 results
Paradox explained
Overfitting
Extrapolation
Extrapolation example
Missing data
Model results
X1 and X2 highly correlated
Unstable estimates when both X1 and X2 in model.
Sales
1 2 3 4 5 6 7 8
68
10
14
18
12
34
56
78
Trad
6 8 10 14 18 2 3 4 5 6 7 8
23
45
67
8
Int
Applied RegressionModeling 5a
Iain Pardoe
Outline
5.1 Influential points
5.2 Regression pitfalls
Regression pitfalls
Nonconstant variance
Model 1 residuals
Model 2 residuals
Autocorrelation
Model 1 residuals
Model 2 residuals
Multicollinearity
Model 1 results
X1 and X2 highly correlated
Model 2 results
Excluding important predictors
Model 1 results
Model 2 results
Paradox explained
Overfitting
Extrapolation
Extrapolation example
Missing data
Model results
Model 2 results
Model 2: E(Y ) = b0 + b1(X1 + X2).Model Summary
Adjusted RegressionModel Multiple R R Squared R Squared Std. Error2 0.987 a 0.974 0.971 0.8505a Predictors: (Intercept), X1 + X2.
Parameters a
Model Estimate Std. Error t-stat Pr(> |t|)2 (Intercept) 1.776 0.562 3.160 0.010
X1 + X2 1.042 0.054 19.240 0.000a Response variable: Y = Sales.
R2 unchanged, and the combined predictor variable,X1 + X2, is significant.Note this approach is only possible if it makes senseto create a combined predictor variable.More common to drop one of the correlatedpredictors from model.
Applied RegressionModeling 5a
Iain Pardoe
Outline
5.1 Influential points
5.2 Regression pitfalls
Regression pitfalls
Nonconstant variance
Model 1 residuals
Model 2 residuals
Autocorrelation
Model 1 residuals
Model 2 residuals
Multicollinearity
Model 1 results
X1 and X2 highly correlated
Model 2 results
Excluding important predictors
Model 1 results
Model 2 results
Paradox explained
Overfitting
Extrapolation
Extrapolation example
Missing data
Model results
Excluding important predictor variables
Excluding important predictors sometimes results inmodels that provide incorrect, biased conclusionsabout included predictors.
Strive to include all potentially important predictors,and remove a predictor only if there are compellingreasons to do so (e.g., if causing multicollinearityproblems and has high individual p-value).
Example: PARADOX data file with n=27high-precision computer components withcomponent quality (Y ) potentially depending on twocontrollable machine factors, speed (X1) and angle(X2).
Applied RegressionModeling 5a
Iain Pardoe
Outline
5.1 Influential points
5.2 Regression pitfalls
Regression pitfalls
Nonconstant variance
Model 1 residuals
Model 2 residuals
Autocorrelation
Model 1 residuals
Model 2 residuals
Multicollinearity
Model 1 results
X1 and X2 highly correlated
Model 2 results
Excluding important predictors
Model 1 results
Model 2 results
Paradox explained
Overfitting
Extrapolation
Extrapolation example
Missing data
Model results
Model 1 results
Model 1: E(Y ) = b0 + b1X1.
Parameters a
Model Estimate Std. Error t-stat Pr(> |t|)1 (Intercept) 2.847 1.011 2.817 0.009
X1 = Speed 0.430 0.188 2.288 0.031a Response variable: Y = Quality .
Results suggest a positive association betweenquality and speed.
In other words, increase the speed of the machine toimprove quality.
However, this ignores process information relating toangle.
Applied RegressionModeling 5a
Iain Pardoe
Outline
5.1 Influential points
5.2 Regression pitfalls
Regression pitfalls
Nonconstant variance
Model 1 residuals
Model 2 residuals
Autocorrelation
Model 1 residuals
Model 2 residuals
Multicollinearity
Model 1 results
X1 and X2 highly correlated
Model 2 results
Excluding important predictors
Model 1 results
Model 2 results
Paradox explained
Overfitting
Extrapolation
Extrapolation example
Missing data
Model results
Model 1 results
Model 1: E(Y ) = b0 + b1X1.
Parameters a
Model Estimate Std. Error t-stat Pr(> |t|)1 (Intercept) 2.847 1.011 2.817 0.009
X1 = Speed 0.430 0.188 2.288 0.031a Response variable: Y = Quality .
Results suggest a positive association betweenquality and speed.
In other words, increase the speed of the machine toimprove quality.
However, this ignores process information relating toangle.
Applied RegressionModeling 5a
Iain Pardoe
Outline
5.1 Influential points
5.2 Regression pitfalls
Regression pitfalls
Nonconstant variance
Model 1 residuals
Model 2 residuals
Autocorrelation
Model 1 residuals
Model 2 residuals
Multicollinearity
Model 1 results
X1 and X2 highly correlated
Model 2 results
Excluding important predictors
Model 1 results
Model 2 results
Paradox explained
Overfitting
Extrapolation
Extrapolation example
Missing data
Model results
Model 1 results
Model 1: E(Y ) = b0 + b1X1.
Parameters a
Model Estimate Std. Error t-stat Pr(> |t|)1 (Intercept) 2.847 1.011 2.817 0.009
X1 = Speed 0.430 0.188 2.288 0.031a Response variable: Y = Quality .
Results suggest a positive association betweenquality and speed.
In other words, increase the speed of the machine toimprove quality.
However, this ignores process information relating toangle.
Applied RegressionModeling 5a
Iain Pardoe
Outline
5.1 Influential points
5.2 Regression pitfalls
Regression pitfalls
Nonconstant variance
Model 1 residuals
Model 2 residuals
Autocorrelation
Model 1 residuals
Model 2 residuals
Multicollinearity
Model 1 results
X1 and X2 highly correlated
Model 2 results
Excluding important predictors
Model 1 results
Model 2 results
Paradox explained
Overfitting
Extrapolation
Extrapolation example
Missing data
Model results
Model 2 results
Model 2: E(Y ) = b0 + b1X1 + b2X2.
Parameters a
Model Estimate Std. Error t-stat Pr(> |t|)2 (Intercept) 1.638 0.217 7.551 0.000
X1 = Speed −0.962 0.071 −13.539 0.000X2 = Angle 2.014 0.086 23.473 0.000
a Response variable: Y = Quality .
Results suggest a negative association betweenquality and speed (for a fixed angle),
and a positive association between quality and angle(for a fixed speed).
In other words, increase the angle but decrease thespeed of the machine to improve quality.
Applied RegressionModeling 5a
Iain Pardoe
Outline
5.1 Influential points
5.2 Regression pitfalls
Regression pitfalls
Nonconstant variance
Model 1 residuals
Model 2 residuals
Autocorrelation
Model 1 residuals
Model 2 residuals
Multicollinearity
Model 1 results
X1 and X2 highly correlated
Model 2 results
Excluding important predictors
Model 1 results
Model 2 results
Paradox explained
Overfitting
Extrapolation
Extrapolation example
Missing data
Model results
Model 2 results
Model 2: E(Y ) = b0 + b1X1 + b2X2.
Parameters a
Model Estimate Std. Error t-stat Pr(> |t|)2 (Intercept) 1.638 0.217 7.551 0.000
X1 = Speed −0.962 0.071 −13.539 0.000X2 = Angle 2.014 0.086 23.473 0.000
a Response variable: Y = Quality .
Results suggest a negative association betweenquality and speed (for a fixed angle),
and a positive association between quality and angle(for a fixed speed).
In other words, increase the angle but decrease thespeed of the machine to improve quality.
Applied RegressionModeling 5a
Iain Pardoe
Outline
5.1 Influential points
5.2 Regression pitfalls
Regression pitfalls
Nonconstant variance
Model 1 residuals
Model 2 residuals
Autocorrelation
Model 1 residuals
Model 2 residuals
Multicollinearity
Model 1 results
X1 and X2 highly correlated
Model 2 results
Excluding important predictors
Model 1 results
Model 2 results
Paradox explained
Overfitting
Extrapolation
Extrapolation example
Missing data
Model results
Model 2 results
Model 2: E(Y ) = b0 + b1X1 + b2X2.
Parameters a
Model Estimate Std. Error t-stat Pr(> |t|)2 (Intercept) 1.638 0.217 7.551 0.000
X1 = Speed −0.962 0.071 −13.539 0.000X2 = Angle 2.014 0.086 23.473 0.000
a Response variable: Y = Quality .
Results suggest a negative association betweenquality and speed (for a fixed angle),
and a positive association between quality and angle(for a fixed speed).
In other words, increase the angle but decrease thespeed of the machine to improve quality.
Applied RegressionModeling 5a
Iain Pardoe
Outline
5.1 Influential points
5.2 Regression pitfalls
Regression pitfalls
Nonconstant variance
Model 1 residuals
Model 2 residuals
Autocorrelation
Model 1 residuals
Model 2 residuals
Multicollinearity
Model 1 results
X1 and X2 highly correlated
Model 2 results
Excluding important predictors
Model 1 results
Model 2 results
Paradox explained
Overfitting
Extrapolation
Extrapolation example
Missing data
Model results
Paradox explained
Positive association betweenY and X1 ignoring X2.Negative association betweenY and X1 accounting for X2.
2 4 6 8
24
68
Speed
Qualit
y
1
1 2
3
12
34
23
4 5
34 5
3
4
5 6
4 56
7
5 6
7
7
Points marked by Angle
Applied RegressionModeling 5a
Iain Pardoe
Outline
5.1 Influential points
5.2 Regression pitfalls
Regression pitfalls
Nonconstant variance
Model 1 residuals
Model 2 residuals
Autocorrelation
Model 1 residuals
Model 2 residuals
Multicollinearity
Model 1 results
X1 and X2 highly correlated
Model 2 results
Excluding important predictors
Model 1 results
Model 2 results
Paradox explained
Overfitting
Extrapolation
Extrapolation example
Missing data
Model results
Overfitting
Overfitting can occur if overly complicated modeltries to account for every possible pattern in sampledata, but generalizes poorly to underlyingpopulation.Should always apply a “sanity check” to make suremodel makes sense from subject-matter perspectiveand conclusions are supported by data.
−1.0 −0.5 0.0 0.5 1.0
−2
02
46
810
X
Y
−1.0 −0.5 0.0 0.5 1.0
−2
02
46
810
X
Y
Which model seems more reasonable?
Applied RegressionModeling 5a
Iain Pardoe
Outline
5.1 Influential points
5.2 Regression pitfalls
Regression pitfalls
Nonconstant variance
Model 1 residuals
Model 2 residuals
Autocorrelation
Model 1 residuals
Model 2 residuals
Multicollinearity
Model 1 results
X1 and X2 highly correlated
Model 2 results
Excluding important predictors
Model 1 results
Model 2 results
Paradox explained
Overfitting
Extrapolation
Extrapolation example
Missing data
Model results
Overfitting
Overfitting can occur if overly complicated modeltries to account for every possible pattern in sampledata, but generalizes poorly to underlyingpopulation.Should always apply a “sanity check” to make suremodel makes sense from subject-matter perspectiveand conclusions are supported by data.
−1.0 −0.5 0.0 0.5 1.0
−2
02
46
810
X
Y
−1.0 −0.5 0.0 0.5 1.0
−2
02
46
810
X
Y
Which model seems more reasonable?
Applied RegressionModeling 5a
Iain Pardoe
Outline
5.1 Influential points
5.2 Regression pitfalls
Regression pitfalls
Nonconstant variance
Model 1 residuals
Model 2 residuals
Autocorrelation
Model 1 residuals
Model 2 residuals
Multicollinearity
Model 1 results
X1 and X2 highly correlated
Model 2 results
Excluding important predictors
Model 1 results
Model 2 results
Paradox explained
Overfitting
Extrapolation
Extrapolation example
Missing data
Model results
Extrapolation
Extrapolation occurs when regression model resultsare used to estimate or predict a response value foran observation with predictor values that are verydifferent from those in the sample.
This can be dangerous because it means making adecision about a situation where there are no datavalues to support our conclusions.
Example: if we observe an upward trend betweentwo variables, should we assume the trend continuesindefinitely at higher values?
Applied RegressionModeling 5a
Iain Pardoe
Outline
5.1 Influential points
5.2 Regression pitfalls
Regression pitfalls
Nonconstant variance
Model 1 residuals
Model 2 residuals
Autocorrelation
Model 1 residuals
Model 2 residuals
Multicollinearity
Model 1 results
X1 and X2 highly correlated
Model 2 results
Excluding important predictors
Model 1 results
Model 2 results
Paradox explained
Overfitting
Extrapolation
Extrapolation example
Missing data
Model results
Extrapolation example
Straight-line model overshoots actual Y at far right.Quadratic model undershoots (i.e., neither model enablesaccurate prediction far from sample data).
2 4 6 8 10 12
51
01
52
02
5
Labor = labor hours in 100s
Pro
du
ctio
n =
pro
du
ctio
n in
10
0s
Applied RegressionModeling 5a
Iain Pardoe
Outline
5.1 Influential points
5.2 Regression pitfalls
Regression pitfalls
Nonconstant variance
Model 1 residuals
Model 2 residuals
Autocorrelation
Model 1 residuals
Model 2 residuals
Multicollinearity
Model 1 results
X1 and X2 highly correlated
Model 2 results
Excluding important predictors
Model 1 results
Model 2 results
Paradox explained
Overfitting
Extrapolation
Extrapolation example
Missing data
Model results
Missing data
Missing data occurs when particular values in thedataset have not been recorded for particularvariables and observations.
Dealing with issue rigorously is beyond scope ofbook, but there are some simple ideas that canmitigate some of the major problems.
Example: MISSING data file with n=30, Y, andX1–X4.
No missing values for Y, X1, or X4, but five missingvalues for X2, one of which is also missing X3:
any model including X2 will exclude fiveobservations;including X3 (but excluding X2) will exclude oneobservation;excluding X2 and X3 will exclude no observations.
Applied RegressionModeling 5a
Iain Pardoe
Outline
5.1 Influential points
5.2 Regression pitfalls
Regression pitfalls
Nonconstant variance
Model 1 residuals
Model 2 residuals
Autocorrelation
Model 1 residuals
Model 2 residuals
Multicollinearity
Model 1 results
X1 and X2 highly correlated
Model 2 results
Excluding important predictors
Model 1 results
Model 2 results
Paradox explained
Overfitting
Extrapolation
Extrapolation example
Missing data
Model results
Missing data
Missing data occurs when particular values in thedataset have not been recorded for particularvariables and observations.
Dealing with issue rigorously is beyond scope ofbook, but there are some simple ideas that canmitigate some of the major problems.
Example: MISSING data file with n=30, Y, andX1–X4.
No missing values for Y, X1, or X4, but five missingvalues for X2, one of which is also missing X3:
any model including X2 will exclude fiveobservations;
including X3 (but excluding X2) will exclude oneobservation;excluding X2 and X3 will exclude no observations.
Applied RegressionModeling 5a
Iain Pardoe
Outline
5.1 Influential points
5.2 Regression pitfalls
Regression pitfalls
Nonconstant variance
Model 1 residuals
Model 2 residuals
Autocorrelation
Model 1 residuals
Model 2 residuals
Multicollinearity
Model 1 results
X1 and X2 highly correlated
Model 2 results
Excluding important predictors
Model 1 results
Model 2 results
Paradox explained
Overfitting
Extrapolation
Extrapolation example
Missing data
Model results
Missing data
Missing data occurs when particular values in thedataset have not been recorded for particularvariables and observations.
Dealing with issue rigorously is beyond scope ofbook, but there are some simple ideas that canmitigate some of the major problems.
Example: MISSING data file with n=30, Y, andX1–X4.
No missing values for Y, X1, or X4, but five missingvalues for X2, one of which is also missing X3:
any model including X2 will exclude fiveobservations;including X3 (but excluding X2) will exclude oneobservation;
excluding X2 and X3 will exclude no observations.
Applied RegressionModeling 5a
Iain Pardoe
Outline
5.1 Influential points
5.2 Regression pitfalls
Regression pitfalls
Nonconstant variance
Model 1 residuals
Model 2 residuals
Autocorrelation
Model 1 residuals
Model 2 residuals
Multicollinearity
Model 1 results
X1 and X2 highly correlated
Model 2 results
Excluding important predictors
Model 1 results
Model 2 results
Paradox explained
Overfitting
Extrapolation
Extrapolation example
Missing data
Model results
Missing data
Missing data occurs when particular values in thedataset have not been recorded for particularvariables and observations.
Dealing with issue rigorously is beyond scope ofbook, but there are some simple ideas that canmitigate some of the major problems.
Example: MISSING data file with n=30, Y, andX1–X4.
No missing values for Y, X1, or X4, but five missingvalues for X2, one of which is also missing X3:
any model including X2 will exclude fiveobservations;including X3 (but excluding X2) will exclude oneobservation;excluding X2 and X3 will exclude no observations.
Applied RegressionModeling 5a
Iain Pardoe
Outline
5.1 Influential points
5.2 Regression pitfalls
Regression pitfalls
Nonconstant variance
Model 1 residuals
Model 2 residuals
Autocorrelation
Model 1 residuals
Model 2 residuals
Multicollinearity
Model 1 results
X1 and X2 highly correlated
Model 2 results
Excluding important predictors
Model 1 results
Model 2 results
Paradox explained
Overfitting
Extrapolation
Extrapolation example
Missing data
Model results
Model results
Predictors Sample size R2 s
X1, X2, X3, X4 25 0.959 0.865X2, X3, X4 25 0.958 0.849X1, X3, X4 29 0.953 0.852X1, X4 30 0.640 2.300
Ordinarily, we would probably favor the (X2, X3, X4)model.
However, the (X1, X3, X4) model applies to more ofthe sample.
Thus, in this case, we would probably favor the (X1,X3, X4) model, since R2 and s are roughlyequivalent, but the usable sample size is larger.