MODEL FOR PREDICTING PRISONER
RECIDIVISM
HIRA NADEEM
INTRODUCTION
The data is created for Folsom Prison to find a model for predicting prisoner recidivism. Thus, six
predictor variables are taken to predict the prisoner recidivism. So, the response variable is the
Recidivism status defined by the two binary variables; 1 defines a person who is a recidivist, that is the
person repeatedly perform an undesirable behavior after he has either experienced negative
consequences of that behavior, or have been treated or trained to extinguish that behavior, and 0
defines a person who is a non-recidivist, that is a person who does not repeat crime after receiving the
punishment for it.
Age at release is the first predictor variable, describing the age of the prisoner at the time of release
(after he has received the punishment). I want to see if this variable is related to the recidivism status of
the person, so a person of young age might have higher recidivism as compared to a person of older
age. The second predictor variable is the Gender, defined by a set of two binary variables; 0 for female
and 1 for male. This variable is incorporated to see if the behavior of being a recidivist is influenced by
the gender type. The third predictor variable is Job status after release. This variable again defined by
the two binary variables, 0 for no job after release and 1 for yes (getting hired or employed) seemed to
be an important factor. A lot of studies and researchers have concluded that one of the main reasons for
recidivism is the lack of job opportunities for the people with some criminal record. Most of the
companies do not want to hire people with some criminal background, so they are prone to fulfill their
needs by repeating either the same crime or a different one. In order to determine if this hypothesis is
true, I have incorporated job status to see how it is related to recidivism status of the person. Number of
thefts/offenses in the criminal history is the fourth predictor variable. This variable is incorporated to
see the frequency of crime committed by the recidivist, how often recidivists commit crime. The fifth
variable is the Supervision term after release; this is the number of weeks the prisoner is supervised by
the police after he/she was released from prison. There might be a relationship between the supervision
term and the recidivism of a person; if the person is supervised longer, he might not do as many
unlawful activities during that time, so a longer supervision term might reduce the number of crimes the
person commit and thus, affects the recidivism status of the person. The last independent variable is the
number of prior prison commitments; this represents the number of times the prisoner has been
arrested before. All of these variables apparently seem to affect the recidivism status of a person.
STATISTICAL ANALYSIS
We now look at the descriptive statistics of the data:
Descriptive Statistics
N Minimum Maximum Mean Std. Deviation
Age of release 50 13 60 35.08 12.900
Gender (0=female, 1=male) 50 0 1 .46 .503
Job Status after release
(0=No, 1 = Yes) 50 0 1 .50 .505
Number of the
Thefts/offenses in criminal
history
50 1 7 2.24 1.847
Supervision term after
release (weeks) 50 1 30 13.98 7.438
Number of prior prison
committments 50 0 5 1.12 1.380
Recidivism (1=recidivist,
0=non-recidivist) 50 0 1 .40 .495
Valid N (listwise) 50
The above table represents the descriptive statistics of the data at hand. The first column represents the
number of observations for each of the predictor variable. We can see that the data does not suffer
from the missing data problem, so we get 50 observations for each of the predictor and response
variable. The second and the third column show the minimum and maximum value respectively for each
variable. Mean and standard deviation is given in the fourth and fifth column respectively.
Now, we will look at the correlations table for each of these variables and see if the correlation between
the dependent variable and the independent variables is significant or not, as also the direction of the
correlation (that is whether it is positively or negatively correlated with the response variable). The table
below shows the correlations output:
Correlations
Recidivi
sm
(1=recid
ivist,
0=non-
recidivis
t)
Age of
release
Gender
(0=fema
le,
1=male)
Job
Status
after
release
(0=No,
1 = Yes)
Number of
the
Thefts/offe
nses in
criminal
history
Supervision
term after
release
(weeks)
Number of
prior prison
committment
s
Recidivism
(1=recidivist,
0=non-
recidivist)
Pearson
Correlation 1 -.702
** .311
* -.572
** .831
** -.580
** .675
**
Sig. (2-tailed) .000 .028 .000 .000 .000 .000
N 50 50 50 50 50 50 50
Age of release
Pearson
Correlation -.702
** 1 -.405
** .326
* -.646
** .368
** -.599
**
Sig. (2-tailed) .000 .004 .021 .000 .009 .000
N 50 50 50 50 50 50 50
Gender
(0=female,
1=male)
Pearson
Correlation .311
* -.405
** 1 -.040 .208 -.090 .360
*
Sig. (2-tailed) .028 .004 .782 .147 .534 .010
N 50 50 50 50 50 50 50
Job Status
after release
(0=No, 1 =
Yes)
Pearson
Correlation -.572
** .326
* -.040 1 -.503
** .410
** -.381
**
Sig. (2-tailed) .000 .021 .782 .000 .003 .006
N 50 50 50 50 50 50 50
Number of the
Thefts/offense
s in criminal
history
Pearson
Correlation .831
** -.646
** .208 -.503
** 1 -.500
** .525
**
Sig. (2-tailed) .000 .000 .147 .000 .000 .000
N 50 50 50 50 50 50 50
Supervision
term after
release
(weeks)
Pearson
Correlation -.580
** .368
** -.090 .410
** -.500
** 1 -.356
*
Sig. (2-tailed) .000 .009 .534 .003 .000 .011
N 50 50 50 50 50 50 50
Number of
prior prison
committments
Pearson
Correlation .675
** -.599
** .360
* -.381
** .525
** -.356
* 1
Sig. (2-tailed) .000 .000 .010 .006 .000 .011
N 50 50 50 50 50 50 50
**. Correlation is significant at the 0.01 level (2-tailed).
*. Correlation is significant at the 0.05 level (2-tailed).
Now looking at the three rows highlighted above gives the correlation of the response variable
with each of the predictor variable along with the significance and number of observations in each of
the variables. The Pearson correlation shows the strength between the two variables. The value for
Pearson’s Correlation lies between -1.00 (perfect negative correlation) and 1.00 (perfect Correlation)
with 0.00 showing no correlation. The correlation between Recidivism and Age is around -.702, which
implies a negative relationship between Recidivism status and Age at release. So, as the age increases,
we can say that there is less chance of being a recidivist given by a negative direction of correlation.
Since Gender and Job Status are dichotomous variables with only two options, we can say that they are
not continuous, so the correlation does not make sense. The correlation between number of
thefts/offences in the criminal history and recidivism is positive and quite high of .831. This implies the
person with high number of thefts/offences is more likely to be a recidivist. The supervision term after
release is negatively correlated with recidivism status, so we can say that as the number of weeks for
the supervision term increases, the person is more likely not to be a recidivist given by the negative sign.
Lastly, the number of prior prison commitments and Recidivism are positively correlated with a value of
.675 indicating higher recidivism rates for the people with prior prison commitments.
The second highlighted row gives 2-tailed significance; this is the p-value used to see if the
correlation is significant or not. If we take the significance level as 0.05, then we can see, for more half
of the results given in the table above, the significance level as 0.000 indicates that there is a statistically
significant correlations between your two variables. That means, increases or decreases in one variable
do significantly relate to increases or decreases in your second variable. On the other hand, since the
significance value for the other variables is greater than 0.05, indicating that there is no statistically
significant correlation between your two variables. That means, increases or decreases in one variable
do not significantly relate to increases or decreases in your second variable. But in our case, all of the
variables are significant for .05.
The third row just represents the number of observations in all of the given variables. So, every
variable has 50 observations in the data set.
After looking at the basic statistics to describe the data set used, we will undertake the task of predicting
the nominal variable using the discriminant analysis technique. Discriminant Function Analysis
undertakes the same task as multiple linear regressions by predicting an outcome, but the response
variable for the Discriminant Analysis needs to be a nominal variable which must have at least two
groups or categories, with each case belonging to only one group so that the groups are mutually
exclusive and collectively exhaustive. So, we run the following discriminant analysis model in spss, with
Recidivism as dependent variable and Age, Gender, Job, Offences, Supervision and Prior as independent
variables.
Recidivism = 𝜷𝟏Age + 𝜷𝟐Gender + 𝜷𝟑Job + 𝜷𝟒 Offences + 𝜷𝟓Supervision + 𝜷𝟔Prior
We receive the following output by running the discriminant analysis on the data of Folsom Prison:
Analysis Case Processing Summary
Unweighted Cases N Percent
Valid 50 100.0
Excluded
Missing or out-of-range
group codes 0 .0
At least one missing
discriminating variable 0 .0
Both missing or out-of-range
group codes and at least
one missing discriminating
variable
0 .0
Total 0 .0
Total 50 100.0
The above table shows that there is no missing data problem, so we will use the entire data of 50
observations.
Group Statistics
Recidivism (1=recidivist, 0=non-recidivist) Mean Std. Deviation Valid N (listwise)
Unweighted Weighted
0
Age of release 42.40 10.559 30 30.000
Gender (0=female, 1=male) .33 .479 30 30.000
Job Status after release
(0=No, 1 = Yes) .73 .450 30 30.000
Number of the
Thefts/offenses in criminal
history
1.00 .000 30 30.000
Supervision term after
release (weeks) 17.47 6.673 30 30.000
Number of prior prison
committments .37 .669 30 30.000
1
Age of release 24.10 6.889 20 20.000
Gender (0=female, 1=male) .65 .489 20 20.000
Job Status after release
(0=No, 1 = Yes) .15 .366 20 20.000
Number of the
Thefts/offenses in criminal
history
4.10 1.651 20 20.000
Supervision term after
release (weeks) 8.75 5.169 20 20.000
Number of prior prison
committments 2.25 1.410 20 20.000
Total
Age of release 35.08 12.900 50 50.000
Gender (0=female, 1=male) .46 .503 50 50.000
Job Status after release
(0=No, 1 = Yes) .50 .505 50 50.000
Number of the
Thefts/offenses in criminal
history
2.24 1.847 50 50.000
Supervision term after
release (weeks) 13.98 7.438 50 50.000
Number of prior prison
committments 1.12 1.380 50 50.000
The table of group statistics shows the mean and standard deviation of each of the predictor variable
according to the two groups of the response variable. For example, if we look at the 1st column, the
number 0 before the predictor variables name represents those people in the recidivism group who are
non-recidivists. The mean age for all those people who are non-recidivists is 42.40 and the standard
deviation is 10.559. Now, the third and fourth columns represent the number of observations in each of
these variables when the person is non-recidivist, i.e. given by 30. Similarly, if we look at the 1 before
the names of the predictor variables, it represents the variables lying in the recidivist category, i.e. the
mean age for all the people in the recidivist category is 24.10 with the standard deviation as 6.889.
Further, there are 20 observations in each of the predictor variables along with the response variable for
the group of recidivist. The last set of variables in the table is for both the groups of response variable,
i.e. for recidivists and non-recidivists (total). So, the mean age for all the observations in the response
variable is 35.08 along with the standard deviation of 12.9 and the total number of observations is just
the sum of the two groups, which are 50. This last portion is in accordance with the descriptive statistics
table given on the start of the paper.
The table shown above is generated by the selected Univariate ANOVAs. This indicates whether there is
statistically significant difference among the dependent variable means (group) for each independent
variable. Among all the independent variables, we can see from the significance level which represents
the p value that all of these variables are significant considering the value of = 0.05. The Wilks’
Lambda is a statistical criterion that is used to add or remove variables from the analysis. Several other
criteria’s are also available.
Box's Test of Equality of Covariance Matrices
Log Determinants
Recidivism (1=recidivist,
0=non-recidivist)
Rank Log
Determinant
0 5 .a
1 6 4.295
Pooled within-groups 6 4.722
The ranks and natural logarithms of determinants printed
are those of the group covariance matrices.
a. Singular
Tests of Equality of Group Means
Wilks' Lambda F df1 df2 Sig.
Age of release .507 46.650 1 48 .000
Gender (0=female, 1=male) .903 5.149 1 48 .028
Job Status after release
(0=No, 1 = Yes) .673 23.287 1 48 .000
Number of the
Thefts/offenses in criminal
history
.310 106.860 1 48 .000
Supervision term after
release (weeks) .664 24.324 1 48 .000
Number of prior prison
commitments .544 40.283 1 48 .000
Test Resultsa
Tests null hypothesis
of equal population
covariance matrices.
a. No test can be
performed with fewer
than two nonsingular
group covariance
matrices.
Summary of Canonical Discriminant Functions
Eigenvalues
Function Eigenvalue % of Variance Cumulative % Canonical
Correlation
1 4.603a 100.0 100.0 .906
a. First 1 canonical discriminant functions were used in the analysis.
Wilks' Lambda
Test of Function(s) Wilks' Lambda Chi-square df Sig.
1 .178 77.549 6 .000
An eigenvalue indicates the proportion of variance explained. (Between-groups sums of squares divided
by within-groups sums of squares). A large eigenvalue is associated with a strong function. So, the above
two tables shown gives the percentage of the variance accounted for by the one discriminant function
generated. The significant of the function is also shown given by 0.000. Because there are two groups
only one discriminant function was generated.
Wilks’ Lambda is the ratio of within-groups sums of squares to the total sums of squares. This is the
proportion of the total variance in the discriminant scores not explained by differences among groups. A
lambda of 1.00 occurs when observed group means are equal (all the variance is explained by factors
other than difference between those means), while a small lambda occurs when within-groups
variability is small compared to the total variability. A small lambda indicates that group means appear
to differ. The associated significance value indicates whether the difference is significant. Here, the
Lambda of 0.178 has a significant value (Sig. = 0,000); thus, the group means appear to differ.
Standardized Canonical Discriminant
Function Coefficients
Function
1
Age of release -.278
Gender (0=female, 1=male) .145
Job Status after release
(0=No, 1 = Yes) -.310
Number of the
Thefts/offenses in criminal
history
.669
Supervision term after
release (weeks) -.328
Number of prior prison
committments .411
The ‘Canonical Discriminant Function Coefficients’ indicate the unstandardized scores concerning the
independent variables. It is the list of coefficients of the unstandardized discriminant equation. Each
subject’s discriminant score would be computed by entering his or her variable values (raw data) for
each of the variables in the equation.
Functions at Group Centroids
Recidivism (1=recidivist,
0=non-recidivist)
Function
1
0 -1.716
1 2.575
Unstandardized canonical discriminant
functions evaluated at group means
‘Functions at Group Centroids’ indicate the average discriminant score for subjects in the two groups.
More specifically, the discriminant score for each group when the variable means (rather than individual
values for each subject) are entered into the discriminant equation. The two values are quite different
including one being negative and the other being positive.
Classification Statistics
Classification Processing Summary
Processed 50
Excluded
Missing or out-of-range
group codes 0
At least one missing
discriminating variable 0
Used in Output 50
Prior Probabilities for Groups
Recidivism (1=recidivist,
0=non-recidivist)
Prior Cases Used in Analysis
Unweighted Weighted
0 .500 30 30.000
1 .500 20 20.000
Total 1.000 50 50.000
This is the distribution of observations into the Recidivism groups used as a starting point in the
analysis. The default prior distribution is an equal allocation into the groups, as seen in this example.
SPSS allows users to specify different priors with the priors subcommand.
Classification Function Coefficients
Recidivism (1=recidivist, 0=non-
recidivist)
0 1
Age of release .671 .542
Gender (0=female, 1=male) 3.346 4.636
Job Status after release
(0=No, 1 = Yes) 4.957 1.784
Number of the
Thefts/offenses in criminal
history
2.389 5.151
Supervision term after
release (weeks) .466 .235
Number of prior prison
committments 1.480 3.196
(Constant) -22.827 -24.056
Fisher's linear discriminant functions
Casewise Statistics
Case
Numb
er
Actua
l
Grou
p
Highest Group Second Highest Group Discrimin
ant
Scores
Predicte
d Group
P(D>d |
G=g)
P(G=g
| D=d)
Squared
Mahalanobi
s Distance
to Centroid
Grou
p
P(G=g |
D=d)
Squared
Mahalanobis
Distance to
Centroid
Function
1
p df
Original
1 1 1 .901 1 1.000 .015 0 .000 17.360 2.450
2 1 1 .798 1 1.000 .065 0 .000 16.286 2.319
3 0 0 .610 1 .999 .261 1 .001 14.290 -1.206
4 1 1 .664 1 .999 .189 0 .001 14.872 2.140
5 0 0 .712 1 1.000 .136 1 .000 15.384 -1.348
6 0 0 .836 1 1.000 .043 1 .000 20.235 -1.924
7 1 1 .020 1 1.000 5.383 0 .000 43.708 4.895
8 1 1 .020 1 1.000 5.441 0 .000 43.872 4.907
9 1 1 .155 1 .957 2.026 0 .043 8.223 1.151
10 0 0 .911 1 1.000 .013 1 .000 17.464 -1.604
11 0 0 .301 1 1.000 1.071 1 .000 28.363 -2.751
12 0 0 .531 1 .999 .393 1 .001 13.426 -1.090
13 1 1 .146 1 .951 2.110 0 .049 8.056 1.122
14 0 0 .889 1 1.000 .020 1 .000 17.233 -1.577
15 1 1 .030 1 1.000 4.738 0 .000 41.829 4.751
16 1 1 .887 1 1.000 .020 0 .000 19.654 2.717
17 1 0** .101 1 .897 2.697 1 .103 7.016 -.074
18 0 0 .423 1 .997 .642 1 .003 12.180 -.915
19 0 0 .773 1 1.000 .083 1 .000 20.973 -2.005
20 0 0 .560 1 .999 .339 1 .001 13.756 -1.134
21 0 0 .790 1 1.000 .071 1 .000 16.195 -1.450
22 0 0 .742 1 1.000 .109 1 .000 21.350 -2.046
23 0 0 .888 1 1.000 .020 1 .000 19.636 -1.857
24 1 1 .338 1 .994 .919 0 .006 11.103 1.616
25 1 1 .303 1 1.000 1.062 0 .000 28.316 3.605
26 0 0 .978 1 1.000 .001 1 .000 18.176 -1.689
27 0 0 .686 1 1.000 .164 1 .000 22.049 -2.121
28 0 0 .155 1 .957 2.024 1 .043 8.227 -.294
29 0 0 .317 1 1.000 1.000 1 .000 27.994 -2.716
30 1 1 .010 1 1.000 6.659 0 .000 47.218 5.155
31 0 0 .919 1 1.000 .010 1 .000 17.548 -1.615
32 0 0 .660 1 1.000 .193 1 .000 22.375 -2.156
33 0 0 .803 1 1.000 .062 1 .000 16.339 -1.468
34 1 1 .672 1 1.000 .179 0 .000 22.222 2.998
35 1 1 .999 1 1.000 .000 0 .000 18.421 2.576
36 0 0 .917 1 1.000 .011 1 .000 17.533 -1.613
37 0 0 .808 1 1.000 .059 1 .000 20.562 -1.960
38 0 0 .963 1 1.000 .002 1 .000 18.814 -1.763
39 1 1 .883 1 1.000 .022 0 .000 17.172 2.428
40 1 1 .608 1 .999 .263 0 .001 14.277 2.062
41 0 0 .397 1 1.000 .719 1 .000 26.406 -2.564
42 0 0 .776 1 1.000 .081 1 .000 16.050 -1.432
43 0 0 .503 1 1.000 .449 1 .000 24.615 -2.387
44 1 1 .142 1 .948 2.152 0 .052 7.976 1.108
45 1 1 .657 1 .999 .197 0 .001 14.801 2.131
46 0 0 .978 1 1.000 .001 1 .000 18.649 -1.744
47 1 1 .255 1 .987 1.297 0 .013 9.935 1.436
48 0 0 .660 1 .999 .194 1 .001 14.829 -1.276
49 0 0 .927 1 1.000 .008 1 .000 19.202 -1.807
50 0 0 .791 1 1.000 .070 1 .000 20.760 -1.982
Cross-
validated
b
1 1 1 .620 6 1.000 4.422 0 .000 20.888
2 1 1 .164 6 .999 9.168 0 .001 23.789
3 0 0 .638 6 .999 4.288 1 .001 17.476
4 1 1 .006 6 .997 17.999 0 .003 29.395
5 0 0 .577 6 .999 4.744 1 .001 19.119
6 0 0 .828 6 1.000 2.844 1 .000 22.548
7 1 1 .020 6 1.000 15.014 0 .000 59.768
8 1 1 .045 6 1.000 12.875 0 .000 57.061
9 1 1 .198 6 .886 8.585 0 .114 12.681
10 0 0 .451 6 1.000 5.755 1 .000 22.359
11 0 0 .743 6 1.000 3.504 1 .000 31.071
12 0 0 .229 6 .997 8.120 1 .003 19.648
13 1 1 .114 6 .838 10.268 0 .162 13.562
14 0 0 .716 6 1.000 3.710 1 .000 20.240
15 1 1 .345 6 1.000 6.743 0 .000 47.026
16 1 1 .966 6 1.000 1.404 0 .000 20.534
17 1 0** .445 6 .986 5.805 1 .014 14.271
18 0 0 .030 6 .988 13.969 1 .012 22.738
19 0 0 .900 6 1.000 2.207 1 .000 22.660
20 0 0 .631 6 .998 4.342 1 .002 16.901
21 0 0 .405 6 .999 6.165 1 .001 21.297
22 0 0 .967 6 1.000 1.382 1 .000 22.212
23 0 0 .488 6 1.000 5.444 1 .000 24.435
24 1 1 .724 6 .991 3.646 0 .009 13.059
25 1 1 .394 6 1.000 6.269 0 .000 33.914
26 0 0 .284 6 1.000 7.417 1 .000 24.692
27 0 0 .178 6 1.000 8.926 1 .000 30.470
28 0 1** .001 6 .510 23.416 0 .490 23.492
29 0 0 .489 6 1.000 5.435 1 .000 32.882
30 1 1 .011 6 1.000 16.512 0 .000 65.364
31 0 0 .612 6 1.000 4.479 1 .000 21.284
32 0 0 .581 6 1.000 4.711 1 .000 26.561
33 0 0 .671 6 1.000 4.043 1 .000 19.568
34 1 1 .901 6 1.000 2.199 0 .000 23.800
35 1 1 .842 6 1.000 2.729 0 .000 20.501
36 0 0 .521 6 1.000 5.183 1 .000 21.912
37 0 0 .812 6 1.000 2.977 1 .000 23.007
38 0 0 .368 6 1.000 6.520 1 .000 24.566
39 1 1 .022 6 .999 14.777 0 .001 29.771
40 1 1 .853 6 .999 2.639 0 .001 15.984
41 0 0 .206 6 1.000 8.469 1 .000 34.635
42 0 0 .774 6 1.000 3.271 1 .000 18.565
43 0 0 .953 6 1.000 1.588 1 .000 25.510
44 1 1 .003 6 .500 19.891 0 .500 19.891
45 1 1 .030 6 .998 13.945 0 .002 25.933
46 0 0 .767 6 1.000 3.325 1 .000 21.383
47 1 1 .293 6 .969 7.315 0 .031 14.233
48 0 0 .214 6 .999 8.350 1 .001 21.586
49 0 0 .091 6 1.000 10.922 1 .000 29.176
50 0 0 .437 6 1.000 5.880 1 .000 26.053
For the original data, squared Mahalanobis distance is based on canonical functions.
For the cross-validated data, squared Mahalanobis distance is based on observations.
**. Misclassified case
b. Cross validation is done only for those cases in the analysis. In cross validation, each case is classified by the
functions derived from all cases other than that case.
In the Casewise Statistics output, SPSS provides us with the Squared Mahalanobis Distance to the
Centroid for each of the groups defined by the dependent variable. If a case has a large Squared
Mahalanobis Distance to the Centroid is most likely to belong to, it is an outlier.
Now, if we look at the P(G=g | D=d) column in the Highest Group and Second Highest Group columns,
we see that the for each subject, there are two numbers that add to 1.00 These are the probabilities of
group membership assigned by the discriminant analysis model. Thus, our model suggests that the
probability of subject #4 being a recidivist is .999 and the complementary probability of being a non-
recidivist for the same subject is .001. This gives us a pretty accurate picture of being predicted
accurately with more confidence of about 99%, while the chance of being wrong is only 1%.
Now, if we look at the same information for the subject # 17, we can see that a recidivist has been
wrongly perceived to be a non-recidivist with the probability of .897 as oppose to being a recidivist with
a probability of .103. But, now we can see that the level of confidence has not decreased in predicting
recidivist as non-recidivist.
Similarly, we can see results for the cross-validation data. Overall, we can see that if the model is correct
for a particular subject, the confidence is generally high, like the case for subject # 13. On the other
hand, if we look at the case of subject # 17, even though the model was wrongly perceived, there was
comparatively a higher confidence regarding the wrong judgment given by the probability of .986. Now,
if we look at the subject # 28, we can see that even if the non-recidivist has been wrongly perceived as
recidivist, the level of confidence wasn’t high enough. In fact, the two probabilities are close together,
this can be said as ‘fence-rider’. We can further continue to look at various cases of the subjects to draw
upon a general conclusion regarding the pattern for the prediction of the recidivist and non-recidivist
groups.
Classification Resultsa,c
Recidivism (1=recidivist,
0=non-recidivist)
Predicted Group Membership Total
0 1
Original
Count 0 30 0 30
1 1 19 20
% 0 100.0 .0 100.0
1 5.0 95.0 100.0
Cross-validatedb
Count 0 29 1 30
1 1 19 20
% 0 96.7 3.3 100.0
1 5.0 95.0 100.0
a. 98.0% of original grouped cases correctly classified.
b. Cross validation is done only for those cases in the analysis. In cross validation, each case is classified
by the functions derived from all cases other than that case.
c. 96.0% of cross-validated grouped cases correctly classified.
In the last table, we can see that upon cross-validating, we classified 29 out of the total 30 non-
recidivists correctly as non-recidivists, while incorrectly labeling 1 of the recidivist as being in the non-
recidivist group, thus, correctly predicting approximately 97% while making an error of 3%. Similarly, 19
out of the total 20 recidivists were correctly labeled as recidivists, whereas, 1 out the total 20 was
incorrectly labeled as non-recidivist when he/she was actually a recidivist. Thus, in the second case, we
obtained an accuracy of approximately 95%, while making a small error of around 5%. Thus, the overall
accuracy of the cross-validated model was 96% implying that a large number of recidivists and non-
recidivists were correctly predicted.
The general definition of the hit rate is the ratio of the number of records found and processed during a
particular processing run to the total number of records available. The hit rate can be found using the z
distribution. Under the null hypothesis, the expectation is that the hits will be proportional to the
number of subjects in each group and that chance can be construed to the accuracy that would obtain
upon classifying all subjects in the large group. Calculations needed for the z tests are dependent on the
number of hits and number of subjects in each group. The program to calculate the test of significance
of the hit rates requires the cross-validated data from the discriminant analysis in the previous lab
predicting recidivists and non-recidivists. So, after running the program, we get the following output:
GROUP NS
30.00 20.00
HITS
29.00 19.00
PROPORTIONAL CHANCE CRITERION
Separate group chance expectation:
18.00 8.00
Separate group accuracies as %
96.67 95.00
Separate group Z/Prob:
4.10 5.02
.000 .000
Total sample hits = 48.00
Total sample hits as % = 96.00
Total sample chance expectation = 26.00
Total sample Z/Prob = 6.23
.000
% improvement over chance for separate groups
91.67 91.67
% improvement over chance for total sample = 91.67
MAXIMUM CHANCE CRITERION
Total sample Z/Prob = 5.20
.000
% improvement over chance for total sample = 90.00
In the first four lines of the output, we can see that from among the total of 30 non-recidivists,
the number of hits is 29, as also from among the group of recidivist, 19 are the hits from among the total
of 20. The proportional chance criteria for assessing model fit is calculated by summing the squared
proportion that each group represents of the sample.
Now, the separate group for chance expectation is better for the first group of non-recidivists
than for the second group of recidivist. Now, the separate group accuracies for the first group are better
than for the second group. Now, looking at the proportional chance criterion, in case of the first group,
the accuracy is almost as good as the accuracy of the second group. The first group has a z-value of 4.10
with the p value of .000 as compared to the other group of z value of 5.02 with p value of 0.000. Also,
we can see that the total sample hits are 48.
Now, if we look at the percentage over chance for separate groups, we can see that it is equally
good for the first group as it is for the second group. The maximum chance criterion shows that the
model is better than chance for the total sample since the percentage improvement is positive.
Like in multiple linear regression models, we would now like to see the unique contribution of a
variable or a subset of variables in discriminant analysis. We want to see how much does a variable or a
subset of variables adds to the classification accuracy of a model having the remainder of the predictor
variables. Thus, we use McNemar’s test for correlated proportions. McNemar’s test is a non-parametric
method used on the nominal variable. It is applied to 2 × 2 contingency tables with a dichotomous trait,
with matched pairs of subjects, to determine whether the row and column marginal frequencies are
equal. This also requires the hit-rate for the full and restricted models along with the joint distribution of
hits and misses for each model.
However, to apply the program the program to see the unique contribution of a variable or subset of
variables, we have to remove the fourth variable of Offences to run this, otherwise the program doesn’t
work. When we run the program leaving the fourth variable, we get the following output:
***FULL/RESTRICTED MODEL HIT-RATE TEST***
***JOHN D. MORRIS***
***FLORIDA ATLANTIC UNIVERSTIY***
DATA FILE: FPrison1.dat
FORMAT = (6f8.0)
NUMBER OF FULL MODEL PREDICTOR VARIABLES = 5
RESTRICTED MODEL PREDICTOR VARIABLES ARE:
2 3
GROUP INDICES = 0 1
PRIOR PROBABILITIES USED: 0.500 0.500
N = 50
MEANS FOR GROUP 0 (N = 30)
1) 42.4000 2) 0.3333 3) 0.7333 4) 17.4667
5) 0.3667
MEANS FOR GROUP 1 (N = 20)
1) 24.1000 2) 0.6500 3) 0.1500 4) 8.7500
5) 2.2500
CHI-SQUARED FROM BOX-M (DF= 15.00) = 2.6884
P = 0.0007
CALIBRATION SAMPLE ESTIMATES
PROPORTION OF HITS
TOTAL 0 1
FULL MODEL 0.940 0.967 0.900
RESTRICTED MODEL 0.780 0.733 0.850
HITS/MISSES MATRIX FOR GROUP 0
FULL MODEL
MISSES HITS
RESTRICTED HITS 1. 21.
MODEL MISSES 0. 8.
MCNEMAR'S Z = 2.3333
HITS/MISSES MATRIX FOR GROUP 1
FULL MODEL
MISSES HITS
RESTRICTED HITS 0. 17.
MODEL MISSES 2. 1.
MCNEMAR'S Z = 1.0000
HITS/MISSES MATRIX FOR TOTAL SAMPLE
FULL MODEL
MISSES HITS
RESTRICTED HITS 1. 38.
MODEL MISSES 2. 9.
MCNEMAR'S Z = 2.5298
-------------------------------------------------------
LEAVE ONE OUT ESTIMATES
PROPORTION OF HITS
TOTAL 0 1
FULL MODEL 0.900 0.967 0.800
RESTRICTED MODEL 0.780 0.733 0.850
HITS/MISSES MATRIX FOR GROUP 0
FULL MODEL
MISSES HITS
RESTRICTED HITS 1. 21.
MODEL MISSES 0. 8.
MCNEMAR'S Z = 2.3333
HITS/MISSES MATRIX FOR GROUP 1
FULL MODEL
MISSES HITS
RESTRICTED HITS 2. 15.
MODEL MISSES 2. 1.
MCNEMAR'S Z = -0.5774
HITS/MISSES MATRIX FOR TOTAL SAMPLE
FULL MODEL
MISSES HITS
RESTRICTED HITS 3. 36.
MODEL MISSES 2. 9.
MCNEMAR'S Z = 1.7321
The McNemar’s Test uses the z-value to determine how much greater the classification accuracy is for
the full model than for the restricted model. Since, the interest is only in the one tailed test, we can use
the critical value of z as 1.65. That means that if the Z obtained in the output is larger than the critical
value than we may declare a significant difference in the accuracy of the models (reject the null),
otherwise we can’t. So, in the calibration sample estimates, we reject the null hypothesis for group 0
and total since the z-value is greater than the critical value, and accept it for group 1. Same is the result
for leave one out estimates.
Now, we want to look at the cross-validation classification accuracy of subsets of the models. A defensible method of selecting a subset of predictor variables for a prediction model is to examine 2𝑝-1 possible combination of p predictor variables. So, we use the data to predict Recidivism Status from Age, Gender, Job, offences, supervision and prior prison commitments to see the cross-validation classification accuracy for all the possible subsets. The program we will now use show the user to specify how many of the top performing subsets of the predictor variables are present in the output. It also allows specification how many of the top performing subsets with each number of variables to display. An interesting view of variable importance also arises if a specific variable consistently appears in the top performing models. After running the program, we get the following output:
***TWO GROUP PREDICTOR VARIABLE SELECTION***
***JOHN D. MORRIS & ALICE MESHBANE***
***FLORIDA ATLANTIC UNIVERSITY***
***QUESTIONS/COMMENTS TO: [email protected]***
FILE NAME = FPrison.dat
FORMAT = (7F8.0)
PREDICTORS = 6
GROUP ONE INDEX = 0
GROUP TWO INDEX = 1
MISSING DATA CODE = -30.
NUMBER OF BEST SUBSETS = 20
NUMBER OF BEST SUBSETS OF EACH SIZE = 5
PRIORS = 0.5000 0.5000
SUBJECTS READ = 50
GROUP ONE (N = 30) MEANS
1) 42.4000 2) 0.3333 3) 0.7333 4) 1.0000
5) 17.4667 6) 0.3667
GROUP TWO (N = 20) MEANS
1) 24.1000 2) 0.6500 3) 0.1500 4) 4.1000
5) 8.7500 6) 2.2500
The 20 subsets with the highest total hit rate
HITS VARIABLES
#1 #2 Total
30 19 49 4 6
30 19 49 3 4 6
30 19 49 1 3 4 6
30 19 49 2 3 4 6
30 19 49 1 2 3 4 6
30 19 49 3 4 5 6
30 19 49 2 3 4 5 6
30 18 48 4 5 6
30 18 48 2 4 5 6
29 19 48 1 4 6
30 18 48 2 4 6
29 19 48 1 2 3 4 5 6
30 18 48 2 3 4
30 18 48 1 2 3 4
29 18 47 1 2 4 6
29 18 47 1 4 5 6
28 19 47 1 5
29 18 47 1 3 6
29 18 47 1 2 4 5 6
30 17 47 2 4 5
The (up to) 5 best subsets of each size
HITS VARIABLES
#1 #2 Total
30 16 46 4
24 18 42 1
27 14 41 6
23 16 39 5
22 17 39 3
30 19 49 4 6
28 19 47 1 5
30 16 46 4 5
30 16 46 2 4
29 16 45 1 4
30 19 49 3 4 6
29 19 48 1 4 6
30 18 48 2 3 4
30 18 48 2 4 6
30 18 48 4 5 6
30 19 49 1 3 4 6
30 19 49 2 3 4 6
30 19 49 3 4 5 6
30 18 48 2 4 5 6
30 18 48 1 2 3 4
30 19 49 2 3 4 5 6
30 19 49 1 2 3 4 6
30 17 47 1 2 3 4 5
29 18 47 1 2 4 5 6
29 18 47 1 3 4 5 6
29 19 48 1 2 3 4 5 6
We can see that out of all the 20 subsets of the models, some of them have better cross-
validation classification accuracy than the full 6 variable model and some has worse cross-validation
accuracy. The model with two variables numbered 4 and 6 have the best accuracy than the rest of the
models with more number of variables. So, many of the subsets of variables perform a little better than
the original model but original model performs really well too (since it comes on the number two rank).
However, we can see that many of the models’ accuracy is the same as the original model, so even after
using some partial variables in the model can yield the same accuracy as the full model depending on
the variables available for the particular case.
Another important aspect of this output can be seen from the individual contribution of each
variable in the model. However, we can see that none of the variables individually contribute too much
to the accuracy of the model. However, variables 4 and 6 contribute a lot to the accuracy of the model
than the original model of 6 predictor variables.
Logistic regression is a type of regression analysis used for predicting the outcome of a categorical (a
variable that can take on a limited number of categories) criterion variable based on one or more
predictor variables. The probabilities describing the possible outcome of a single trial are modeled, as
function of explanatory variables, using a logistic function. Logistic regression, in this case, is similar to
discriminant analysis. However, there are several advantages of Logistic Regression over Discriminant
Analysis since it is more robust, it does not assume a linear relationship between the independent
variables and dependent variable, the dependent variable need not to be normally distributed, there is
no homogeneity of variance assumption etc. However, despite all of these advantages of Logistic
Regression over Discriminant Analysis, the advantages of Logistic Regression come at a cost because it
requires much more data to achieve stable and meaningful results.
Now, we use the same data as used in previous labs of discriminant analysis to do a similar
regression for the model using the logistic regression technique. So, our predictor variable is a binary
variable representing the smoking status of people, assuming 0 for non-recidivist and 1 for recidivist. So,
after running logistic regression in spss, we get the following output:
Classification Tablea
Observed Predicted
Recidivism (1=recidivist, 0=non-
recidivist)
Percentage
Correct
0 1
Step 1
Recidivism (1=recidivist,
0=non-recidivist)
0 30 0 100.0
1 0 20 100.0
Overall Percentage 100.0
a. The cut value is .500
Variables in the Equation
B S.E. Wald df Sig. Exp(B)
Step 1a
Age .061 400.481 .000 1 1.000 1.063
Gender(1) -17.483 12446.184 .000 1 .999 .000
Job(1) -4.857 16612.197 .000 1 1.000 .008
Offences 20.380 6534.399 .000 1 .998 709347781.862
Supervision -1.506 1398.819 .000 1 .999 .222
Prior 10.485 5234.580 .000 1 .998 35786.351
Constant -21.563 30599.392 .000 1 .999 .000
a. Variable(s) entered on step 1: Age, Gender, Job, Offences, Supervision, Prior.
Now, looking at the classification table, we can see that the predicted values were 100% correct as
compared to the accuracy of the cross validated model of 96% in the discriminant analysis. So, in this
case, we can see that logistic regression performed much better than discriminant analysis. However,
looking in the second table we can see that none of the variables is significant. So, discriminant analysis
might be better than Logistic Regression.
In conclusion, we can see that the overall model predicts recidivism well using the six predictor variables
of Age, Gender, Job, Offences, Supervision and Prior. As indicated by the output above, we can see that
the relationship between Age and Recidivism is negative showing an increase in one variable cause a
decrease in the other. Gender just shows that more men are prone to recidivism than females. Further,
if a prisoner gets the job after being released from prison is more likely not to be a recidivist. Also, the
higher the number of offences, the more likely the person would be a recidivist. The fifth variable of
supervision also signifies that longer the supervision period, the lesser is the chance for being a
recidivist. Lastly, if the prisoner has prior prison commitments, then he/she will mostly be a recidivist. To
sum up, we used the discriminant analysis to predict how well the model fits.
PART 2 (HOW LONG DOES A RECIDIVIST STAY OUT OF PRISON)
We have seen from the output above to predict whether the future prisoner returns to prison or not, i.e.
to see the recidivism status of the person from the 6 predictor variables, namely Age, Gender, Job
status, Offences, Supervision and Prior Prison commitments. We would now like to look at how long the
prisoner stays out of prison before he was caught again for only the recidivist’s case. So, we would like
to see how the Recidivism Period is related to the same set of variables (same observations), i.e. Age of
Release, Gender of the person, Job status of the person after the release, Number of offences/thefts in
the criminal history, Supervision term on the prisoner and Prior Prison commitments of the criminal. So,
in order to determine the period, we would run the multiple linear regression model to determine the
recidivism period:
Descriptive Statistics
Mean Std. Deviation N
Recidivism Period in
months(Time Period after
release and again caught)
11.95 5.753 20
Age of release 24.10 6.889 20
Gender (0=female, 1=male) .65 .489 20
Job Status after release
(0=No, 1 = Yes) .15 .366 20
Number of the
Thefts/offenses in criminal
history
4.10 1.651 20
Supervision term after
release (weeks) 8.75 5.169 20
Number of prior prison
commitments 2.25 1.410 20
As described before, the descriptive statistics display the mean and standard deviation of all the
variables in the multiple linear regression model. The mean number of months for the recidivism period
is 11.95 for 20 recidivists. Similarly, the first column displays all the means and the second column
represents the standard deviation. The third column represents the number of observations i.e. 20.
Correlations
Recidivism
Period in
months(Tim
e Period
after
release and
again
caught)
Age of
releas
e
Gender
(0=female
, 1=male)
Job
Status
after
releas
e
(0=No,
1 =
Yes)
Number of the
Thefts/offense
s in criminal
history
Supervisio
n term
after
release
(weeks)
Number of
prior prison
committment
s
Pearson
Correlatio
n
Recidivism
Period in
months(Time
Period after
release and
again caught)
1.000 .902 -.212 .378 -.276 .442 -.180
Age of release .902 1.000 -.270 .390 -.339 .366 -.192
Gender
(0=female,
1=male)
-.212 -.270 1.000 .015 -.150 -.140 .591
Job Status
after release
(0=No, 1 =
Yes)
.378 .390 .015 1.000 -.113 .132 -.178
Number of the
Thefts/offense
s in criminal
history
-.276 -.339 -.150 -.113 1.000 -.077 -.102
Supervision
term after
release
(weeks)
.442 .366 -.140 .132 -.077 1.000 -.034
Number of
prior prison
committments
-.180 -.192 .591 -.178 -.102 -.034 1.000
Sig. (1-
tailed)
Recidivism
Period in
months(Time
Period after
release and
again caught)
. .000 .185 .050 .119 .026 .224
Age of release .000 . .125 .045 .072 .056 .208
Gender
(0=female,
1=male)
.185 .125 . .476 .264 .277 .003
Job Status
after release
(0=No, 1 =
Yes)
.050 .045 .476 . .317 .290 .226
Number of the
Thefts/offense
s in criminal
history
.119 .072 .264 .317 . .373 .335
Supervision
term after
release
(weeks)
.026 .056 .277 .290 .373 . .443
Number of
prior prison
committments
.224 .208 .003 .226 .335 .443 .
N
Recidivism
Period in
months(Time
Period after
release and
again caught)
20 20 20 20 20 20 20
Age of release 20 20 20 20 20 20 20
Gender
(0=female,
1=male)
20 20 20 20 20 20 20
Job Status
after release
(0=No, 1 =
Yes)
20 20 20 20 20 20 20
Number of the
Thefts/offense
s in criminal
history
20 20 20 20 20 20 20
Supervision
term after
release
(weeks)
20 20 20 20 20 20 20
Number of
prior prison
commitments
20 20 20 20 20 20 20
The correlation table above shows the correlation of each of the variable with other variables in the
model. However, we would like to see the relation of the response variables with the 6 predictor
variables. We now can see that Recidivism period is highly positively correlated with Age, which implies
that as age increases, the recidivism period increases, i.e. with younger age, prisoners do not stay out of
prison for long. Now, as the number of thefts/offenses increases, the recidivism period decreases, i.e.
the more the number of offences, the lesser the period the criminal stays out. As the supervision term
increases, the recidivism period increases, i.e. the prisoner stays longer out of prison. As the number of
prior prison commitments increases, the recidivism period decreases.
Model Summary
Model R R Square Adjusted R
Square
Std. Error of the
Estimate
1 .912a .832 .755 2.850
a. Predictors: (Constant), Number of prior prison committments,
Supervision term after release (weeks), Number of the Thefts/offenses
in criminal history, Job Status after release (0=No, 1 = Yes), Age of
release, Gender (0=female, 1=male)
The above table shows the model summary. The R squared value of approximately 83% shows that
about 83% of the variance in the response variable is explained by the 6 predictor variables. This high
value of R-squared shows that the model is well-defined by the predictor variables.
ANOVAa
Model Sum of Squares df Mean Square F Sig.
1
Regression 523.358 6 87.226 10.739 .000b
Residual 105.592 13 8.122
Total 628.950 19
a. Dependent Variable: Recidivism Period in months(Time Period after release and again caught)
b. Predictors: (Constant), Number of prior prison committments, Supervision term after release
(weeks), Number of the Thefts/offenses in criminal history, Job Status after release (0=No, 1 =
Yes), Age of release, Gender (0=female, 1=male)
The above table shows the coefficients of each of the variables involved in the regression. So, we get the
following model:
Recidivism_Period = -7.576 + .727 Age + .862 Gender + .259 Job + .128 Offences + .148 Supervision
-.183 Prior
Now, we determine the partial slopes, and the recidivism period for each of the predictor variables
holding the variable not in consideration as constant. Now as the age increases by one year the
Coefficientsa
Model Unstandardized Coefficients Standardized
Coefficients
t Sig. Collinearity Statistics
B Std. Error Beta Tolerance VIF
1
(Constant) -7.576 4.367 -1.735 .106
Age of release .727 .123 .870 5.891 .000 .592 1.690
Gender (0=female, 1=male) .862 1.788 .073 .482 .638 .559 1.790
Job Status after release
(0=No, 1 = Yes) .259 2.016 .016 .128 .900 .784 1.276
Number of the
Thefts/offenses in criminal
history
.128 .438 .037 .293 .774 .817 1.223
Supervision term after
release (weeks) .148 .137 .133 1.084 .298 .857 1.167
Number of prior prison
committments -.183 .595 -.045 -.307 .764 .609 1.643
a. Dependent Variable: Recidivism Period in months(Time Period after release and again caught)
recidivism period decreases by -6.849 months. If the person is the female, i.e. gender=0, then the
recidivism period is -7.576, but if the gender is a male, the recidivism period decreases by 6.714 months.
Similarly, we can interpret the results for the other variables holding the all the other variables constant.
Even though, the R-squared is quite high, the significance levels show that apart from age, all the other
variables are insignificant. Thus, we can look for the estimate of the recidivism time given all these
variables of interest in the model.
Top Related