Regression: Checking the Model Peter T. Donnan Professor of Epidemiology and Biostatistics...
-
Upload
hannah-mcgee -
Category
Documents
-
view
217 -
download
0
Transcript of Regression: Checking the Model Peter T. Donnan Professor of Epidemiology and Biostatistics...
Regression:
Checking the Model Peter T. Donnan
Professor of Epidemiology and Biostatistics
Statistics for Health ResearchStatistics for Health Research
Objectives of sessionObjectives of session
• Recognise the need to check fit Recognise the need to check fit of the modelof the model
• Carry out checks of Carry out checks of assumptions in SPSS for simple assumptions in SPSS for simple linear regressionlinear regression
• Understand predictive modelUnderstand predictive model• Understand residualsUnderstand residuals
• Recognise the need to check fit Recognise the need to check fit of the modelof the model
• Carry out checks of Carry out checks of assumptions in SPSS for simple assumptions in SPSS for simple linear regressionlinear regression
• Understand predictive modelUnderstand predictive model• Understand residualsUnderstand residuals
How is the fitted line How is the fitted line obtained?obtained?
Use method of least squares Use method of least squares (LS)(LS)
Seek to minimise squared Seek to minimise squared vertical differences between vertical differences between each point and fitted lineeach point and fitted line
Results in parameter estimates Results in parameter estimates or regression coefficients of or regression coefficients of slope (b) and intercept (a) – slope (b) and intercept (a) – y=a+bxy=a+bx
Consider Fitted line of Consider Fitted line of y = a +bxy = a +bx
Explanatory (x)Explanatory (x)
Dep
en
den
t D
ep
en
den
t (y
)(y
)
aa
Consider the regression of age Consider the regression of age on minimum LDL cholesterol on minimum LDL cholesterol
achievedachieved
•Select Regression Select Regression Linear….Linear….
•Dependent (y) – Min LDL achievedDependent (y) – Min LDL achieved•Independent (x) - Age_BaseIndependent (x) - Age_Base
•Select Regression Select Regression Linear….Linear….
•Dependent (y) – Min LDL achievedDependent (y) – Min LDL achieved•Independent (x) - Age_BaseIndependent (x) - Age_Base
N.B. -0.008 may look very small but N.B. -0.008 may look very small but represents: represents:
The DECREASE in LDL achieved for The DECREASE in LDL achieved for each increase in one unit of age i.e. each increase in one unit of age i.e. ONE yearONE year
Output from SPSS linear Output from SPSS linear regressionregression
Coefficientsa
Model Unstandardized Coefficients Standardized CoefficientsB Std. Error Beta t sig
1 (Constant) 2.024 .105 19.340 .000Age at baseline -.008 .002 -.121 -4.546 .000
a. Dependent Variable: Min LDL achieved
HH00 : slope b = 0 : slope b = 0
Test t = slope/se = -0.008/0.002 = 4.546 Test t = slope/se = -0.008/0.002 = 4.546 with p<0.001, so statistically significantwith p<0.001, so statistically significant
Predicted LDL = 2.024 - 0.008xAgePredicted LDL = 2.024 - 0.008xAge
Output from SPSS linear Output from SPSS linear regressionregression
Coefficientsa
Model Unstandardized Coefficients Standardized CoefficientsB Std. Error Beta t sig
1 (Constant) 2.024 .105 19.340 .000Age at baseline -.008 .002 -.121 -4.546 .000
a. Dependent Variable: Min LDL achieved
Predicted LDL achieved = 2.024 - Predicted LDL achieved = 2.024 - 0.008xAge0.008xAge
So for a man aged 65 the predicted LDL So for a man aged 65 the predicted LDL achieved = 2.024 – 0.008x 65 = 1.504achieved = 2.024 – 0.008x 65 = 1.504
Prediction Equation from Prediction Equation from linear regressionlinear regression
Age Predicted Min LDL
45 1.664
55 1.584
65 1.504
75 1.424
Assumptions of Assumptions of RegressionRegression
11. . Relationship is linearRelationship is linear11. . Relationship is linearRelationship is linear
2. Outcome variable and hence 2. Outcome variable and hence residuals or error terms are residuals or error terms are approx. Normally distributed approx. Normally distributed
Use Graphs and Use Graphs and Scatterplot to obtain the Scatterplot to obtain the
Lowess line of fitLowess line of fit
Use Graphs and Scatterplot Use Graphs and Scatterplot to obtain the Lowess line of to obtain the Lowess line of
fitfit
1.1. Create Scatterplot and Create Scatterplot and then double-click to enter then double-click to enter chart editorchart editor
2.2. Chose Icon ‘Chose Icon ‘Add fit line at Add fit line at totaltotal’’
3.3. Then select type of fit Then select type of fit such as such as LowessLowess
Linear assumption: Fitted Linear assumption: Fitted lowess smoothed line lowess smoothed line
Lowess smoothed line (red) gives a good Lowess smoothed line (red) gives a good eyeball examination of linear eyeball examination of linear assumption (green)assumption (green)
Definition of a residualDefinition of a residual
A A residualresidual is the difference is the difference between the predicted value between the predicted value (fitted line) and the actual value or (fitted line) and the actual value or unexplained variationunexplained variation
rrii = y = yii – E ( y – E ( yii ) )
OrOr
rrii = y = yii – ( a + bx ) – ( a + bx )
ResidualsResiduals
To assess the residuals in SPSS To assess the residuals in SPSS linear regression, select plots…..linear regression, select plots…..
NormaliseNormalised or d or standardisstandardised ed predicted predicted value of value of LDLLDLNormalisNormalised ed residualresidualSelect Select histogram of histogram of residuals and residuals and normal normal probability probability plotplot
In SPSS linear regression, select In SPSS linear regression, select Statistics…..Statistics…..
Select Select confidence confidence intervals intervals for for regression regression coefficientcoefficientss
Model fitModel fit
Select Durbin-Select Durbin-Watson for Watson for serial serial correlation correlation and and identification identification of outliersof outliers
Output: Output: Scatterplot of residuals vs. Scatterplot of residuals vs.
predictedpredicted
Note Note 1)1)Mean of Mean of residuals residuals = 0= 02)2)Most of Most of data lie data lie within + within + or -3 SDs or -3 SDs of meanof mean
Assumptions of Assumptions of RegressionRegression
1. Relationship is linear1. Relationship is linear1. Relationship is linear1. Relationship is linear
2. Outcome variable and hence 2. Outcome variable and hence residuals or error terms are residuals or error terms are approx. Normally distributed approx. Normally distributed
Plot of Plot of residualresiduals with s with normal normal curve curve super-super-imposedimposed
Output: Output: Histogram of standardised Histogram of standardised
residualsresiduals
Output: Output: Cumulative probability plotCumulative probability plot
Look Look for for deviatiodeviation from n from diagonadiagonal line to l line to indicate indicate non-non-normalinormalityty
Residuals Statisticsa
1.314867 1.843205 1.556478 .0878548 1383
-1.65389 4.0658469 .0000000 .7181448 1383
-2.750 3.264 .000 1.000 1383
-2.302 5.660 .000 1.000 1383
Predicted Value
Residual
Std. Predicted Value
Std. Residual
Minimum Maximum Mean Std. Dev iation N
Dependent Variable: Min LDL achieveda.
Output: Output: Description of residualsDescription of residuals
Subjects with Subjects with standardised residuals standardised residuals > 3> 3
Descriptive statistics for Descriptive statistics for residualsresiduals
Worth Worth investigation?investigation?
Casewise Diagnostics(a)
Case Number Std. Residual Min LDL Predicted Residual
164 5.660 5.5840 1.518153 4.0658471209 4.395 4.5260 1.368685 3.1573148250 3.143 3.7875 1.529325 2.2581750268 3.064 3.8730 1.671664 2.2013357274 3.227 4.0953 1.777153 2.3180975362 4.095 4.5350 1.593460 2.9415398517 3.636 4.3240 1.711788 2.6122125849 3.968 4.3290 1.478113 2.85088731047 4.207 4.4360 1.413686 3.02231411075 3.885 4.4040 1.613219 2.79078051103 3.519 3.9905 1.462584 2.52791571229 3.016 3.7660 1.599254 2.16674561290 3.975 4.2345 1.379107 2.8553933
a. Dependent Variable: Min LDL achieved
R – correlation between min LDL achieved and R – correlation between min LDL achieved and Age at baseline, here 0.121Age at baseline, here 0.121
RR22 - % variation explained, here 1.5%, not - % variation explained, here 1.5%, not particularly highparticularly high
Durbin-Watson test - serial correlation of Durbin-Watson test - serial correlation of residuals should be approximately 2 if no serial residuals should be approximately 2 if no serial correlationcorrelation
Output: Output: Model fit and serial correlationModel fit and serial correlation
Model Summary
Model R R Square Adjusted R Square Std. Error of the Estimate Durbin-Watson1 .121a .015 .014 .7184048 2.034
a. Predictors: (Constant), Age at baseline
SummarySummary
After fitting any regression model check After fitting any regression model check assumptions - assumptions -
• Functional form – linearity is default, Functional form – linearity is default, often not best fit, consider quadratic… often not best fit, consider quadratic…
• Check Residuals for approx. normalityCheck Residuals for approx. normality• Check Residuals for outliers (> 3 SDs)Check Residuals for outliers (> 3 SDs)• All accomplished within SPSSAll accomplished within SPSS
After fitting any regression model check After fitting any regression model check assumptions - assumptions -
• Functional form – linearity is default, Functional form – linearity is default, often not best fit, consider quadratic… often not best fit, consider quadratic…
• Check Residuals for approx. normalityCheck Residuals for approx. normality• Check Residuals for outliers (> 3 SDs)Check Residuals for outliers (> 3 SDs)• All accomplished within SPSSAll accomplished within SPSS
Practical on Model Practical on Model CheckingChecking
Read in ‘LDL Data.sav’Read in ‘LDL Data.sav’1) Fit age squared term in min LDL model and 1) Fit age squared term in min LDL model and
check fit of model compared to linear fit (Hint: check fit of model compared to linear fit (Hint: Use transform/compute to create age squared Use transform/compute to create age squared term and fit age and ageterm and fit age and age22))
2) Fit 2) Fit separateseparate linear regressions with min Chol linear regressions with min Chol achieved with predictors of 1) baseline Chol 2) achieved with predictors of 1) baseline Chol 2) APOE_lin 3) adherenceAPOE_lin 3) adherence
Check assumptions and interpret resultsCheck assumptions and interpret results
Read in ‘LDL Data.sav’Read in ‘LDL Data.sav’1) Fit age squared term in min LDL model and 1) Fit age squared term in min LDL model and
check fit of model compared to linear fit (Hint: check fit of model compared to linear fit (Hint: Use transform/compute to create age squared Use transform/compute to create age squared term and fit age and ageterm and fit age and age22))
2) Fit 2) Fit separateseparate linear regressions with min Chol linear regressions with min Chol achieved with predictors of 1) baseline Chol 2) achieved with predictors of 1) baseline Chol 2) APOE_lin 3) adherenceAPOE_lin 3) adherence
Check assumptions and interpret resultsCheck assumptions and interpret results