The SPSS Sample Problem
To demonstrate these concepts, we will work the sample problem for logistic regression in SPSS
Professional Statistics 7.5, pages 37 - 64. The description of the problem can be found on page 39.
The data for this problem is: Prostate.Sav.
Stage One: Define the Research Problem
In this stage, the following issues are addressed:
Relationship to be analyzed
Specifying the dependent and independent variables
Method for including independent variables
Relationship to be analyzed
The goal of this analysis is to determine the relationship between the dependent variable NODALINV
(whether or not the cancer has spread to the lymph nodes), and the independent variables of AGE (age
of the subject), ACID (a laboratory test value that is elevated when the tumor has spread to certain
areas), STAGE (whether or not the disease has reached an advanced stage), GRADE (aggressiveness of
the tumor), and XRAY (positive or negative xray result).
Specifying the dependent and independent variables
The dependent variable is NODALINV 'Cancer spread to lymph nodes', a dichotomous variable.
The independent variables are:
AGE 'Age of the subject'
ACID 'Laboratory test score'
XRAY 'Positive X-ray result'
STAGE 'Disease reached advanced stage'
GRADE 'Aggressive tumor'
Method for including independent variables
Since we are interested in the relationship between the dependent variable and all of the independent
variables, we will use direct entry of the independent variables.
Stage 2: Develop the Analysis Plan: Sample Size Issues
In this stage, the following issues are addressed:
Missing data analysis
Minimum sample size requirement: 15-20 cases per independent variable
Missing data analysis
There is no missing data in this problem.
Minimum sample size requirement: 15-20 cases per independent variable
The data set has 53 cases and 5 independent variables for a ratio of 10 to 1, short of the requirement that
we have 15-20 cases per independent variable. We should look for opportunities to validate our findings
against other samples before generalizing our results.
Stage 2: Develop the Analysis Plan: Measurement Issues:
In this stage, the following issues are addressed:
Incorporating nonmetric data with dummy variables
Representing Curvilinear Effects with Polynomials
Representing Interaction or Moderator Effects
Incorporating Nonmetric Data with Dummy Variables
All of the nonmetric variables have recoded into dichotomous dummy-coded variables.
Representing Curvilinear Effects with Polynomials
We do not have any evidence of curvilinear effects at this point in the analysis.
Representing Interaction or Moderator Effects
We do not have any evidence at this point in the analysis that we should add interaction or moderator
variables.
Stage 3: Evaluate Underlying Assumptions
In this stage, the following issues are addressed:
Nonmetric dependent variable with two groups
Metric or dummy-coded independent variables
Nonmetric dependent variable having two groups
The dependent variable NODALINV 'Cancer spread to lymph nodes' is a dichotomous variable.
Metric or dummy-coded independent variables
AGE 'Age of the subject' and ACID 'Laboratory test score' are metric variables.
XRAY 'Positive X-ray result', STAGE 'Disease reached advanced stage', and GRADE 'Aggressive
tumor' are nonmetric dichotomous variables.
Stage 4: Estimation of Logistic Regression and Assessing Overall Fit: Model Estimation
In this stage, the following issues are addressed:
Compute logistic regression model
Compute the logistic regression
The steps to obtain a logistic regression analysis are detailed on the following screens.
Requesting a Logistic Regression
First, choose the
Regression > Binary Logistic...' command
from the 'Analyze' menu.
Specifying the Dependent Variable
First, move the dependent variableNODALINV 'Cancer spread to lymph nodes'to the 'Dependent:' text box.
Specifying the Independent Variables
First, move the independent variables to theCovariates:' list box. AGE 'Age of the subject' ACID 'Laboratory test score' XRAY 'Postive Xray result' STAGE 'Disease reached advanced stage' GRADE 'Aggressive tumor'
Specify the method for entering variables
First, accept the 'Enter'default entry method fromthe 'Method:' popup menu.
Specifying Options to Include in the Output
First, click on the 'Options...' button to
open the Logistic Regression: Options
dialog box,
Second, in
the 'Statistics
and Plots' panel, mark
the 'Classificatio
n plots' checkbox,
the 'Hosmer-
Lemeshow
goodness-of-fit' check
box, and the
'Casewise listing of
residuals' option
button with the 'Outliers
outside 2 std. dev.
option. Third, mark the option button
'At each step' on the 'Display'
panel to display output every time a variable is added.
Fifth, accept the other defaults and
click on the Continue button.
Fourth, check the “Iteration history”
to obtain the starting model.
Specifying the New Variables to Save
First, click on the'Save...' button to openthe 'Logistic Regression:Save New Variables'dialog box.
Second, mark the'Cook's' checkbox onthe 'Influence' panel.
Third, click on the'Continue' button.
Complete the Logistic Regression Request
First, click on the OK button to completethe logistic regression request.
Stage 4: Estimation of Logistic Regression and Assessing Overall Fit: Assessing Model Fit
In this stage, the following issues are addressed:
Significance test of the model log likelihood (Change in -2LL)
Measures Analogous to R2:
Cox and Snell R2 and Nagelkerke R
2
Hosmer-Lemeshow Goodness-of-fit
Classification matrices as a measure of model accuracy
Check for Numerical Problems
Presence of outliers
Initial statistics before independent variables are included
The Initial Log Likelihood Function, (-2 Log Likelihood or -2LL) is a statistical measure like total sums
of squares in regression. If our independent variables have a relationship to the dependent variable, we
will improve our ability to predict the dependent variable accurately, and the log likelihood value will
decrease. The initial –2LL value is 70.252 on step 0, before any variables have been added to the
model.
Significance test of the model log likelihood
The difference between these two measures is the model child-square value (22.126 = 70.252 - 48.126)
that is tested for statistical significance. This test is analogous to the F-test for R2 or change in R
2 value
in multiple regression which tests whether or not the improvement in the model associated with the
additional variables is statistically significant.
In this problem the model Chi-Square value of 22.126 has a significance of 0.000, less than 0.05, so we
conclude that there is a significant relationship between the dependent variable and the set of
independent variables.
Measures Analogous to R2
The next SPSS outputs indicate the strength of the relationship between the dependent variable and the
independent variables, analogous to the R2 measures in multiple regression.
The Cox and Snell R
2 measure operates like R
2, with higher values indicating greater model fit.
However, this measure is limited in that it cannot reach the maximum value of 1, so Nagelkerke
proposed a modification that had the range from 0 to 1. We will rely upon Nagelkerke's measure as
indicating the strength of the relationship.
If we applied our interpretive criteria to the Nagelkerke R2 of 0.465, we would characterize the
relationship as strong.
Correspondence of Actual and Predicted Values of the Dependent Variable
The final measure of model fit is the Hosmer and Lemeshow goodness-of-fit statistic, which measures
the correspondence between the actual and predicted values of the dependent variable. In this case,
better model fit is indicated by a smaller difference in the observed and predicted classification. A good
model fit is indicated by a nonsignificant chi-square value.
The goodness-of-fit measure has a value of 5.954 which has the desirable outcome of nonsignificance.
The Classification Matrices as a Measure of Model Accuracy
The classification matrices in logistic regression serve the same function as the classification matrices in
discriminant analysis, i.e. evaluating the accuracy of the model.
If the predicted and actual group memberships are the same, i.e. 1 and 1 or 0 and 0, then the prediction is
accurate for that case. If predicted group membership and actual group membership are different, the
model "misses" for that case. The overall percentage of accurate predictions (77.4% in this case) is the
measure of a model that I rely on most heavily for this analysis as well as for discriminant analysis
because it has a meaning that is readily communicated, i.e. the percentage of cases for which our model
predicts accurately.
To evaluate the accuracy of the model, we compute the proportional by chance accuracy rate and the
maximum by chance accuracy rates, if appropriate.
The proportional by chance accuracy rate is equal to 0.530 (0.623^2 + 0.377^2). A 25% increase over
the proportional by chance accuracy rate would equal 0.663. Our model accuracy race of 77.4% meets
this criterion.
Since one of our groups contains 62.3% of the cases, we might also apply the maximum by chance
criterion. A 25% increase over the largest groups would equal 0.778. Our model accuracy race of
77.4% almost meets this criterion.
SPSS provides a visual image of the classification accuracy in the stacked histogram as shown below.
To the extent to which the cases in one group cluster on the left and the other group clusters on the right,
the predictive accuracy of the model will be higher.
Check for Numerical Problems
There are several numerical problems that can occur in logistic regression that are not detected by SPSS
or other statistical packages: multicollinearity among the independent variables, zero cells for a dummy-
coded independent variable because all of the subjects have the same value for the variable, and
"complete separation" whereby the two groups in the dependent event variable can be perfectly
separated by scores on one of the independent variables.
All of these problems produce large standard errors (over 2) for the variables included in the analysis
and very often produce very large B coefficients as well. If we encounter large standard errors for the
predictor variables, we should examine frequency tables, one-way ANOVAs, and correlations for the
variables involved to try to identify the source of the problem.
The standard errors and B coefficients are not excessively large, so there is no evidence of a numeric
problem with this analysis.
Presence of outliers
There are two outputs to alert us to outliers that we might consider excluding from the analysis: listing
of residuals and saving Cook's distance scores to the data set.
SPSS provides a casewise list of residuals that identify cases whose residual is above or below a certain
number of standard deviation units. Like multiple regression there are a variety of ways to compute the
residual. In logistic regression, the residual is the difference between the observed probability of the
dependent variable event and the predicted probability based on the model. The standardized residual
is the residual divided by an estimate of its standard deviation. The deviance is calculated by taking the
square root of -2 x the log of the predicted probability for the observed group and attaching a negative
sign if the event did not occur for that case. Large values for deviance indicate that the model does not
fit the case well. The studentized residual for a case is the change in the model deviance if the case is
excluded. Discrepancies between the deviance and the studentized residual may identify unusual cases.
(See the SPSS chapter on Logistic Regression Analysis for additional details, pages 57-61).
In the output for our problem, SPSS listed three cases that have may be considered outliers with a
studentized residuals greater than 2:
SPSS has an option to compute Cook's distance as a measure of influential cases and add the score to the
data editor. I am not aware of a precise formula for determining what cutoff value should be used, so we
will rely on the more traditional method for interpreting Cook's distance which is to identify cases that
either have a score of 1.0 or higher, or cases which have a Cook's distance substantially different from
the other. The prescribed method for detecting unusually large Cook's distance scores is to create a
scatterplot of Cook's distance scores versus case id.
Request the Scatterplot
First, select the
'Scatter...' command from the Graphs
menu.
Second, highlight the
thumbnail sketch of the 'Simple' scatterplot.
Third, click on the Define button to
specify the variables for the scatterplot.
Specifying the Variables for the Scatterplot
First, move the variableCOO_1 to the text boxfor the 'Y Axis:' variable.
Second, move thecase variable to the 'XAxis' text box.
Third, click onthe OK buttonto completethe request
The Scatterplot of Cook's Distances
On the plot of Cook's distances, we see a case that exceeds the 1.0 rule of thumb for influential cases.
Scanning the data in the data editor, we find that the case with the large Cook's distance is case 24. If we
study case 24 in the data editor, we will find that this case had the highest score for the acid variable, but
no nodal involvement. Comparing this case to the two cases with the next highest acid score, case 25
with a score of 136 and case 53 with a score of 126, we see that both of these cases had nodal
involvement, suggesting that a high acid score is associated with nodal involvement. We can consider
case 24 as a candidate for exclusion from the analysis.
Stage 5: Interpret the Results
In this section, we address the following issues:
In this section, we address the following issues:
Identifying the statistically significant predictor variables
Direction of relationship and contribution to dependent variable
Identifying the statistically significant predictor variables
The coefficients are found in the column labeled B, and the test that the coefficient is not zero, i.e.
changes the odds of the dependent variable event is tested with the Wald statistic, instead of the t-test as
was done for the individual B coefficients in the multiple regression equation.
Similar to the output for a regression equation, we examine the probabilities of the test statistic in the
column labeled "Sig," where we identity that the variable STAGE 'Disease reached advanced stage' and
the variable XRAY 'Positive X-ray result' have a statistically significant relationship with the
dependent variable.
Direction of relationship and contribution to dependent variable
The signs of both of the statistically significant independent variables are positive, indicating a direct
relationship with the dependent variable. Our interpretation of these variables is that positive (yes or 1)
values to both questions XRAY 'Positive Xray result' and STAGE 'Disease reached advanced stage'
are associated with the positive (yes or 1) category of the dependent variable NODALINV 'Cancer
spread to lymph nodes'.
Interpretation of the independent variables is aided by the "Exp (B)" column which contains the odds
ratio for each independent variable. Thus, we would say that persons with a value of 1 STAGE 'Disease
reached advanced stage' are 4.77 times as likely to have a score of 1 on the dependent variable
NODALINV 'Cancer spread to lymph nodes'. Similarly, persons whose score is 1 on the independent
variable XRAY 'Positive X-ray result' have a 7.73 greater likelihood of having lymph node
involvement.
Stage 6: Validate The Model
When we have a small sample in the full data set as we do in this problem, a split half validation
analysis is almost guaranteed to fail because we will have little power to detect statistical differences in
analyses of the validation samples. In this circumstance, our alternative is to conduct validation
analyses with random samples that comprise the majority of the sample. We will demonstrate this
procedure in the following steps:
Computing the First Validation Analysis
Computing the Second Validation Analysis
The Output for the Validation Analysis
Computing the First Validation Analysis
We set the random number seed and modify our selection variable so that is selects about 75-80% of the
sample.
Set the Starting Point for Random Number Generation
First, select the
'Random Number Seed...' command from
the 'Transform' menu.
Second, click on the 'Set
seed to:' option to access the
text box for the seed number.
Third, accept
the default
'2000000' in the 'Set seed to:'
text box.
Fourth, click on the OK button to complete
this action.
Compute the Variable to Select a Large Proportion of the Data Set
First, select the 'Compute...' command from the Transform menu.
Second, create a new
variable named 'split1’ that
has the values 1 and 0 to divide the sample into two
parts. Type the name 'split1' into the 'Target
Variable:' text box.
Third, type the formula
'uniform(1) > 0.15' in the
'Numeric Expression:' text box. The uniform function will
generate a random number between 0.0 and 1.0 for each
case. If the generated random number is greater
than 0.15, the numeric expression will result in a 1,
since the numeric expression is true. We will include cases
with a split value of 1 in the validation analysis.
Fourth, we click on the
OK button to compute the split variable.
Specify the Cases to Include in the First Validation Analysis
First, select 'Logistic Regression' from the 'Dialog
Recall' drop down menu.
Specify the Value of the Selection Variable for the First Validation Analysis
First, click on the'Select>>" button toexpose the 'SelectionVariable:' text box.
Second, highlightthe 'split1' variableand click on themove button to putit into the'Selection Variable:'text box.
Third, after 'split1=?' appears in the 'SelectionVariable:' text box, click on the Value..' buttonto specify which cases to include in thescreening sample.
Fourth, type a '1' inthe 'Value forSelection Variable:'text box.
Fifth, click on the'Continue' button tocomplete setting thevalue.
Computing the Second Validation Analysis
We reset the random number seed to another value and modify our selection variable so that is selects
about 75-80% of the sample.
Set the Starting Point for Random Number Generation
First, select the
'Random Number Seed...' command from
the 'Transform' menu.
Second, click on the 'Set
seed to:' option to access the text box for
the seed number.
Third, type
'2000001' in the 'Set seed to:'
text box.
Fourth, click on the OK button to complete
this action.
Compute the Variable to Select a Large Proportion of the Data Set
First, select the 'Compute...' command
from the Transform menu.
Second, create a new variable named 'split2' that
has the values 1 and 0 to divide the sample into two
parts. Type the name 'split2' into the 'Target
Variable:' text box.
Third, type the formula
'uniform(1) > 0.15' in the 'Numeric
Expression:' text box. The uniform function will
generate a random number between 0.0 and
1.0 for each case. If the generated random
number is greater than 0.15, the numeric
expression will result in a 1, since the numeric
expression is true. We will include cases with a
split value of 1 in the validation analysis.
Fourth, we click on the
OK button to compute the split variable.
Specify the Cases to Include in the Second Validation Analysis
First, select 'Logistic
Regression' from the 'Dialog Recall' drop down menu.
Specify the Value of the Selection Variable for the Second Validation Analysis
First,
highlight the 'split1=1'
text in the 'Selection
Variable:' text box
and click on the move
button to return the
variable to the list of
variables.
Second, highlight
the 'split2' variable and click on the
move button to put it into the
'Selection Variable:' text box.
Third, after 'split2=?' appears in the 'Selection
Variable:' text box, click on the Value..' button to specify which cases to include in the screening
sample.
Fourth, type a '1' in
the 'Value for Selection Variable:'
text box.
Fifth, click on the 'Continue'
button to complete
setting the value.
Generalizability of the Logistic Regression Model
We can summarize the results of the validation analyses in the following table.
Full Model Split1 = 1 Split2 = 1
Model Chi-Square 22.126, p=.0005 16.275, p=.0061 16.609, p=.0053
Nagelkerke R2 .465 .452 .424
Accuracy Rate for
Learning Sample 77.36% 75.00% 75.56%
Accuracy Rate for
Validation Sample 76.92% 87.50%
Significant
Coefficients
(p < 0.05)
STAGE 'Disease
reached advanced
stage' and the
variable
XRAY 'Positive
Xray result'
STAGE 'Disease
reached advanced
stage' and the
variable (0.0531)
XRAY 'Positive
Xray result'
STAGE 'Disease
reached advanced
stage' and the
variable
XRAY 'Positive
Xray result'
As we can see in the table, the results for each analysis are approximately the same, except that the
variable STAGE 'Disease reached advanced stage' was not quite significant in the first validation
analysis.
Based on the validation analyses, I would conclude that our results are generalizable.
Top Related