Lecture 25 Regression diagnostics for the multiple linear regression model Dealing with influential...
-
date post
20-Dec-2015 -
Category
Documents
-
view
259 -
download
2
Transcript of Lecture 25 Regression diagnostics for the multiple linear regression model Dealing with influential...
Lecture 25
• Regression diagnostics for the multiple linear regression model
• Dealing with influential observations for multiple linear regression
• Interaction variables
•
Distributions Midterm 2 Scores Midterm 2 Scores
25 30 35 40 45 50 55
Moments Mean 43.916667 Std Dev 4.7580845 Std Err Mean 0.8687034 upper 95% Mean 45.693365 lower 95% Mean 42.139969 N 30
Approximate grade guidelines: Exam raw score Grade 44+ A range 35+ B range Final grade is determined by 40% homework, 20% each midterm and 20% final. Lower midterm is replaced by final score if latter is higher.
Assumptions of Multiple Linear Regression Model
• Assumptions of multiple linear regression:– For each subpopulation ,
• (A-1A)• (A-1B) • (A-1C) The distribution of is normal[Distribution of residuals should not depend on ]
– (A-2) The observations are independent of one another
pxx ,...,1
ppp XXXXY 1101 },...,|{2
1 ),...,|( pXXYVar
pXXY ,...,| 1
pxx ,...,1
Checking/Refining Model
• Tools for checking (A-1A) and (A-1B)– Residual plots versus predicted (fitted) values– Residual plots versus explanatory variables – If model is correct, there should be no pattern in the
residual plots
• Tool for checking (A-1C)– Histogram of residuals
• Tool for checking (A-2)– Residual plot versus time or spatial order of
observations
pxx ,,1
Model Building (Display 9.9)1. Make scatterplot matrix of variables (using analyze,
multivariate). Decide on whether to transform any of the explanatory variables. Check for obvious outliers.
2. Fit tentative model.3. Check residual plots for whether assumptions of
multiple regression model are satisfied. Look for outliers and influential points.
4. Consider fitting richer model with interactions or curvature. See if extra terms can be dropped.
5. Make changes to model and repeat steps 2-4 until an adequate model is found.
Multiple regression, modeling and outliers, leverage and influential pointsPollution Example
• Data set pollutionhc.JMP provides information about the relationship between pollution and mortality for 60 cities between 1959-1961.
• The variables are• y (MORT)=total age adjusted mortality in deaths per 100,000
population; • PRECIP=mean annual precipitation (in inches);
EDUC=median number of school years completed for persons 25 and older;
NONWHITE=percentage of 1960 population that is nonwhite; HC=relative pollution potential of hydrocarbons (product of tons
emitted per day per square kilometer and a factor correcting for SMSA dimension and exposure)
Scatterplot Matrix
800
950
1100
103050
9.0
10.5
12.0
5
20
35
0
300
600
MORTALITY
8001000
PRECIP
103050
EDUC
9.0 11.0
NONWHITE
515 30
HC
0 300
Transformations for Explanatory Variables
• In deciding whether to transform an explanatory variable x, we consider two features of the plot of the response y vs. the explanatory variable x.
1. Is there curvature in the relationship between y and x? This suggests a transformation chosen by Tukey’s Bulging rule.
2. Are most of the x values “crunched together” and a few very spread apart? This will lead to several points being very influential. When this is the case, it is best to transform x to make the x values more evenly spaced and less influential. If the x values are positive, the log transformation is a good idea.
• For the pollution data, reason 2 suggests transforming HC to log HC.
Scatterplot Matrix
800
950
1100
103050
9.0
10.5
12.0
5
20
35
0
3
6
MORTALITY
8001000
PRECIP
103050
EDUC
9.0 11.0
NONWHITE
515 30
Log HC
0 2 4 6
Response MORTALITY Summary of Fit
RSquare 0.62584 RSquare Adj 0.598628 Root Mean Square Error 39.40713 Analysis of Variance Source DF Sum of Squares Mean Square F Ratio
Model 4 142862.38 35715.6 22.9990 Error 55 85410.70 1552.9 Prob > F
C. Total 59 228273.08 <.0001 Parameter Estimates Term Estimate Std Error t Ratio Prob>|t|
Intercept 1043.1145 96.26515 10.84 <.0001 PRECIP 1.9789185 0.762967 2.59 0.0121 EDUC -22.93284 7.003909 -3.27 0.0018 NONWHITE 2.8382152 0.692279 4.10 0.0001 Log HC 14.984527 5.401444 2.77 0.0075
If the multiple regression model assumptions are correct, there is strong evidence that the mean mortality increases for increases in hydrocarbons, holding fixed precipitation, education and nonwhite. If there are no uncontrolled for confounding variables his would suggest that an increase in hydrocarbons causes an increase in mortality.
Residual vs. Predicted Plot• Useful for detecting nonconstant variance;
look for fan or funnel pattern.
• Plot of residuals versus predicted values ,
• For pollution data, no strong indication of nonconstant variance.
ie
},,|{ˆ 1 piii xxy
Residual by Predicted Plot
-100
-50
0
50
100
MO
RT
AL
ITY
Re
sid
ua
l
7508509501050MORTALITY Predicted
Residual plots vs. each explanatory variable
• Make plot of residuals vs. an explanatory variable by using Fit Model, clicking red triangle next to response, selecting Save Columns and selecting save residuals. This creates a column of residuals. Then click Analyze, Fit Y by X and put residuals in Y and the explanatory variable in X.
• Use these residual plots to check for pattern in the mean of residuals (suggests that we need to transform x or use a polynomial in x) or pattern in the variance of the residuals.
Bivariate Fit of Residual MORTALITY By PRECIP
-100
-50
0
50
100
Re
sid
ua
l M
OR
TA
LIT
Y
0 10203040506070PRECIP
Bivariate Fit of Residual MORTALITY By EDUC
-100
-50
0
50
100
Re
sid
ua
l M
OR
TA
LIT
Y
8.5 9.5 10.511.512.5EDUC
Bivariate Fit of Residual MORTALITY By NONWHITE
-100
-50
0
50
100
Re
sid
ua
l M
OR
TA
LIT
Y
0 5 10152025303540NONWHITE
Bivariate Fit of Residual MORTALITY By Log HC
-100
-50
0
50
100
Re
sid
ua
l M
OR
TA
LIT
Y
-1 0 1 2 3 4 5 6 7Log HC
Residual plots look fine. No strong indication of nonlinearity or nonconstant variance.
Check of normality/outliersDistributions Residual MORTALITY
-100-500 50100
Normality looks okay. One residual outlier, Lancaster.
Influential Observations• As in simple linear regression, one or two
observations can strongly influence the estimates.• Harder to immediately see the influential
observations in multiple regression.• Use Cook’s Distances (Cook’s D influence) to look
for influential observations. An obs. Has large influence if Cook’s distance is greater than 1.
• Can use Table, Sort to sort observations by Cook’s Distance or Leverage.
• For pollution data: no observation has high influence.
Strategy for dealing with influential observations
• Use Display 11.8• Leverage of point: measure of distance between point’s
explanatory variable values and explanatory variable values in entire data set.
• Two sources of influence: leverage, magnitude of residual.
• General approach: If an influential point has high leverage, omit point and report conclusions for reduced range of explanatory variables. If an influential point does not have high leverage, then the point cannot just be removed. We can report results with and without point.
Leverage• Obtaining leverages from JMP: After Fit Model,
click red triangle next to Response, select Save Columns, Hats.
• Leverages are between 1/n and 1. Average leverage is p/n.
• An observation is considered to have high leverage if the leverage is greater than 2p/n where p=# of explanatory variables. For pollution data, 2p/n = (2*4)/60=.133
Specially Constructed Explanatory Variables
• Interaction variables
• Squared and higher polynomial terms for curvature
• Dummy variables for categorical variables.
Interaction
• Interaction is a three-variable concept. One of these is the response variable (Y) and the other two are explanatory variables (X1 and X2).
• There is an interaction between X1 and X2 if the impact of an increase in X2 on Y depends on the level of X1.
• To incorporate interaction in multiple regression model, we add the explanatory variable . There is evidence of an interaction if the coefficient on is significant (t-test has p-value < .05).
21 * XX
21 * XX
An experiment to study how noise affects the performance of children tested second grade hyperactive children and a control group of second graders who were not hyperactive. One of the tasks involved solving math problems. The children solved problems under both high-noise and low-noise conditions. Here are the mean scores:
0
50
100
150
200
250
Control Hyperactive
Me
an
Ma
the
ma
tic
s S
co
re
High Noise
Low Noise
Let Y=Mean Mathematics Score, 1X Type of Child (0= Control, 1 = Hyperactive),
2X =Type of Noise (0= Low Noise, 1= High Noise). There is an interaction between type of child and type of noise: Impact of increasing noise from low to high depends on the type of child.
Interaction variables in JMP
• To add an interaction variable in Fit Model in JMP, add the usual explanatory variables first, then highlight in the Select Columns box and in the Construct Model Effects Box. Then click Cross in the Construct Model Effects Box.
• JMP creates the explanatory variable
1X
2X
)(*)( 2211 XXXX
Interaction Model for Pollution Data
)37.37(*)75.2(loglog
}log,,,|{
543210 precipHCHCnonwhiteducprecip
HCnonwhiteducprecipY
)37.37(*
)log,,,|{
)1log,,,|{
154
4321
4321
x
xHCxnonwhitxeducxprecipY
xHCxnonwhitxeducxprecipY
Response MORTALITY Summary of Fit RSquare 0.703313 RSquare Adj 0.675842 Root Mean Square Error 35.41441 Mean of Response 940.3568 Observations (or Sum Wgts) 60 Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 989.56166 87.67919 11.29 <.0001 PRECIP 1.7266652 0.688946 2.51 0.0152 EDUC -18.72354 6.393312 -2.93 0.0050 NONWHITE 2.3882226 0.633573 3.77 0.0004 Log HC 25.422693 5.593733 4.54 <.0001 (Log HC-2.75329)*(PRECIP-37.3667) 1.2550598 0.334228 3.76 0.0004
There is strong evidence (p-value < .0004) that there is an interaction between hydrocarbons and precipitation. The impact of an increase in hydrocarbons on mean mortality, holding fixed education, nonwhite and precipitation is greater for higher precipitation levels.