Download - Model for Predicting Prisoner Recidivism

MODEL FOR PREDICTING PRISONER

RECIDIVISM

HIRA NADEEM

INTRODUCTION

The data is created for Folsom Prison to find a model for predicting prisoner recidivism. Thus, six

predictor variables are taken to predict the prisoner recidivism. So, the response variable is the

Recidivism status defined by the two binary variables; 1 defines a person who is a recidivist, that is the

person repeatedly perform an undesirable behavior after he has either experienced negative

consequences of that behavior, or have been treated or trained to extinguish that behavior, and 0

defines a person who is a non-recidivist, that is a person who does not repeat crime after receiving the

punishment for it.

Age at release is the first predictor variable, describing the age of the prisoner at the time of release

(after he has received the punishment). I want to see if this variable is related to the recidivism status of

the person, so a person of young age might have higher recidivism as compared to a person of older

age. The second predictor variable is the Gender, defined by a set of two binary variables; 0 for female

and 1 for male. This variable is incorporated to see if the behavior of being a recidivist is influenced by

the gender type. The third predictor variable is Job status after release. This variable again defined by

the two binary variables, 0 for no job after release and 1 for yes (getting hired or employed) seemed to

be an important factor. A lot of studies and researchers have concluded that one of the main reasons for

recidivism is the lack of job opportunities for the people with some criminal record. Most of the

companies do not want to hire people with some criminal background, so they are prone to fulfill their

needs by repeating either the same crime or a different one. In order to determine if this hypothesis is

true, I have incorporated job status to see how it is related to recidivism status of the person. Number of

thefts/offenses in the criminal history is the fourth predictor variable. This variable is incorporated to

see the frequency of crime committed by the recidivist, how often recidivists commit crime. The fifth

variable is the Supervision term after release; this is the number of weeks the prisoner is supervised by

the police after he/she was released from prison. There might be a relationship between the supervision

term and the recidivism of a person; if the person is supervised longer, he might not do as many

unlawful activities during that time, so a longer supervision term might reduce the number of crimes the

person commit and thus, affects the recidivism status of the person. The last independent variable is the

number of prior prison commitments; this represents the number of times the prisoner has been

arrested before. All of these variables apparently seem to affect the recidivism status of a person.

STATISTICAL ANALYSIS

We now look at the descriptive statistics of the data:

Descriptive Statistics

N Minimum Maximum Mean Std. Deviation

Age of release 50 13 60 35.08 12.900

Gender (0=female, 1=male) 50 0 1 .46 .503

Job Status after release

(0=No, 1 = Yes) 50 0 1 .50 .505

Number of the

Thefts/offenses in criminal

history

50 1 7 2.24 1.847

Supervision term after

release (weeks) 50 1 30 13.98 7.438

Number of prior prison

committments 50 0 5 1.12 1.380

Recidivism (1=recidivist,

0=non-recidivist) 50 0 1 .40 .495

Valid N (listwise) 50

The above table represents the descriptive statistics of the data at hand. The first column represents the

number of observations for each of the predictor variable. We can see that the data does not suffer

from the missing data problem, so we get 50 observations for each of the predictor and response

variable. The second and the third column show the minimum and maximum value respectively for each

variable. Mean and standard deviation is given in the fourth and fifth column respectively.

Now, we will look at the correlations table for each of these variables and see if the correlation between

the dependent variable and the independent variables is significant or not, as also the direction of the

correlation (that is whether it is positively or negatively correlated with the response variable). The table

below shows the correlations output:

Correlations

Recidivi

sm

(1=recid

ivist,

0=non-

recidivis

t)

Age of

release

Gender

(0=fema

le,

1=male)

Job

Status

after

release

(0=No,

1 = Yes)

Number of

the

Thefts/offe

nses in

criminal

history

Supervision

term after

release

(weeks)

Number of

prior prison

committment

s

Recidivism

(1=recidivist,

0=non-

recidivist)

Pearson

Correlation 1 -.702

** .311

* -.572

** .831

** -.580

** .675

**

Sig. (2-tailed) .000 .028 .000 .000 .000 .000

N 50 50 50 50 50 50 50

Age of release

Pearson

Correlation -.702

** 1 -.405

** .326

* -.646

** .368

** -.599

**

Sig. (2-tailed) .000 .004 .021 .000 .009 .000

N 50 50 50 50 50 50 50

Gender

(0=female,

1=male)

Pearson

Correlation .311

* -.405

** 1 -.040 .208 -.090 .360

*

Sig. (2-tailed) .028 .004 .782 .147 .534 .010

N 50 50 50 50 50 50 50

Job Status

after release

(0=No, 1 =

Yes)

Pearson

Correlation -.572

** .326

* -.040 1 -.503

** .410

** -.381

**

Sig. (2-tailed) .000 .021 .782 .000 .003 .006

N 50 50 50 50 50 50 50

Number of the

Thefts/offense

s in criminal

history

Pearson

Correlation .831

** -.646

** .208 -.503

** 1 -.500

** .525

**

Sig. (2-tailed) .000 .000 .147 .000 .000 .000

N 50 50 50 50 50 50 50

Supervision

term after

release

(weeks)

Pearson

Correlation -.580

** .368

** -.090 .410

** -.500

** 1 -.356

*

Sig. (2-tailed) .000 .009 .534 .003 .000 .011

N 50 50 50 50 50 50 50

Number of

prior prison

committments

Pearson

Correlation .675

** -.599

** .360

* -.381

** .525

** -.356

* 1

Sig. (2-tailed) .000 .000 .010 .006 .000 .011

N 50 50 50 50 50 50 50

**. Correlation is significant at the 0.01 level (2-tailed).

*. Correlation is significant at the 0.05 level (2-tailed).

Now looking at the three rows highlighted above gives the correlation of the response variable

with each of the predictor variable along with the significance and number of observations in each of

the variables. The Pearson correlation shows the strength between the two variables. The value for

Pearson’s Correlation lies between -1.00 (perfect negative correlation) and 1.00 (perfect Correlation)

with 0.00 showing no correlation. The correlation between Recidivism and Age is around -.702, which

implies a negative relationship between Recidivism status and Age at release. So, as the age increases,

we can say that there is less chance of being a recidivist given by a negative direction of correlation.

Since Gender and Job Status are dichotomous variables with only two options, we can say that they are

not continuous, so the correlation does not make sense. The correlation between number of

thefts/offences in the criminal history and recidivism is positive and quite high of .831. This implies the

person with high number of thefts/offences is more likely to be a recidivist. The supervision term after

release is negatively correlated with recidivism status, so we can say that as the number of weeks for

the supervision term increases, the person is more likely not to be a recidivist given by the negative sign.

Lastly, the number of prior prison commitments and Recidivism are positively correlated with a value of

.675 indicating higher recidivism rates for the people with prior prison commitments.

The second highlighted row gives 2-tailed significance; this is the p-value used to see if the

correlation is significant or not. If we take the significance level as 0.05, then we can see, for more half

of the results given in the table above, the significance level as 0.000 indicates that there is a statistically

significant correlations between your two variables. That means, increases or decreases in one variable

do significantly relate to increases or decreases in your second variable. On the other hand, since the

significance value for the other variables is greater than 0.05, indicating that there is no statistically

significant correlation between your two variables. That means, increases or decreases in one variable

do not significantly relate to increases or decreases in your second variable. But in our case, all of the

variables are significant for .05.

The third row just represents the number of observations in all of the given variables. So, every

variable has 50 observations in the data set.

After looking at the basic statistics to describe the data set used, we will undertake the task of predicting

the nominal variable using the discriminant analysis technique. Discriminant Function Analysis

undertakes the same task as multiple linear regressions by predicting an outcome, but the response

variable for the Discriminant Analysis needs to be a nominal variable which must have at least two

groups or categories, with each case belonging to only one group so that the groups are mutually

exclusive and collectively exhaustive. So, we run the following discriminant analysis model in spss, with

Recidivism as dependent variable and Age, Gender, Job, Offences, Supervision and Prior as independent

variables.

Recidivism = 𝜷𝟏Age + 𝜷𝟐Gender + 𝜷𝟑Job + 𝜷𝟒 Offences + 𝜷𝟓Supervision + 𝜷𝟔Prior

We receive the following output by running the discriminant analysis on the data of Folsom Prison:

Analysis Case Processing Summary

Unweighted Cases N Percent

Valid 50 100.0

Excluded

Missing or out-of-range

group codes 0 .0

At least one missing

discriminating variable 0 .0

Both missing or out-of-range

group codes and at least

one missing discriminating

variable

0 .0

Total 0 .0

Total 50 100.0

The above table shows that there is no missing data problem, so we will use the entire data of 50

observations.

Group Statistics

Recidivism (1=recidivist, 0=non-recidivist) Mean Std. Deviation Valid N (listwise)

Unweighted Weighted

0

Age of release 42.40 10.559 30 30.000

Gender (0=female, 1=male) .33 .479 30 30.000


(0=No, 1 = Yes) .73 .450 30 30.000

Number of the


history

1.00 .000 30 30.000


release (weeks) 17.47 6.673 30 30.000


committments .37 .669 30 30.000

1

Age of release 24.10 6.889 20 20.000



(0=No, 1 = Yes) .15 .366 20 20.000

Number of the


history

4.10 1.651 20 20.000


release (weeks) 8.75 5.169 20 20.000


committments 2.25 1.410 20 20.000

Total

Age of release 35.08 12.900 50 50.000



(0=No, 1 = Yes) .50 .505 50 50.000

Number of the


history

2.24 1.847 50 50.000


release (weeks) 13.98 7.438 50 50.000


committments 1.12 1.380 50 50.000

The table of group statistics shows the mean and standard deviation of each of the predictor variable

according to the two groups of the response variable. For example, if we look at the 1st column, the

number 0 before the predictor variables name represents those people in the recidivism group who are

non-recidivists. The mean age for all those people who are non-recidivists is 42.40 and the standard

deviation is 10.559. Now, the third and fourth columns represent the number of observations in each of

these variables when the person is non-recidivist, i.e. given by 30. Similarly, if we look at the 1 before

the names of the predictor variables, it represents the variables lying in the recidivist category, i.e. the

mean age for all the people in the recidivist category is 24.10 with the standard deviation as 6.889.

Further, there are 20 observations in each of the predictor variables along with the response variable for

the group of recidivist. The last set of variables in the table is for both the groups of response variable,

i.e. for recidivists and non-recidivists (total). So, the mean age for all the observations in the response

variable is 35.08 along with the standard deviation of 12.9 and the total number of observations is just

the sum of the two groups, which are 50. This last portion is in accordance with the descriptive statistics

table given on the start of the paper.

The table shown above is generated by the selected Univariate ANOVAs. This indicates whether there is

statistically significant difference among the dependent variable means (group) for each independent

variable. Among all the independent variables, we can see from the significance level which represents

the p value that all of these variables are significant considering the value of = 0.05. The Wilks’

Lambda is a statistical criterion that is used to add or remove variables from the analysis. Several other

criteria’s are also available.

Box's Test of Equality of Covariance Matrices

Log Determinants


0=non-recidivist)

Rank Log

Determinant

0 5 .a

1 6 4.295

Pooled within-groups 6 4.722

The ranks and natural logarithms of determinants printed

are those of the group covariance matrices.

a. Singular

Tests of Equality of Group Means

Wilks' Lambda F df1 df2 Sig.

Age of release .507 46.650 1 48 .000

Gender (0=female, 1=male) .903 5.149 1 48 .028


(0=No, 1 = Yes) .673 23.287 1 48 .000

Number of the


history

.310 106.860 1 48 .000


release (weeks) .664 24.324 1 48 .000


commitments .544 40.283 1 48 .000

Test Resultsa

Tests null hypothesis

of equal population

covariance matrices.

a. No test can be

performed with fewer

than two nonsingular

group covariance

matrices.

Summary of Canonical Discriminant Functions

Eigenvalues

Function Eigenvalue % of Variance Cumulative % Canonical

Correlation

1 4.603a 100.0 100.0 .906

a. First 1 canonical discriminant functions were used in the analysis.

Wilks' Lambda

Test of Function(s) Wilks' Lambda Chi-square df Sig.

1 .178 77.549 6 .000

An eigenvalue indicates the proportion of variance explained. (Between-groups sums of squares divided

by within-groups sums of squares). A large eigenvalue is associated with a strong function. So, the above

two tables shown gives the percentage of the variance accounted for by the one discriminant function

generated. The significant of the function is also shown given by 0.000. Because there are two groups

only one discriminant function was generated.

Wilks’ Lambda is the ratio of within-groups sums of squares to the total sums of squares. This is the

proportion of the total variance in the discriminant scores not explained by differences among groups. A

lambda of 1.00 occurs when observed group means are equal (all the variance is explained by factors

other than difference between those means), while a small lambda occurs when within-groups

variability is small compared to the total variability. A small lambda indicates that group means appear

to differ. The associated significance value indicates whether the difference is significant. Here, the

Lambda of 0.178 has a significant value (Sig. = 0,000); thus, the group means appear to differ.

Standardized Canonical Discriminant

Function Coefficients

Function

1

Age of release -.278

Gender (0=female, 1=male) .145


(0=No, 1 = Yes) -.310

Number of the


history

.669


release (weeks) -.328


committments .411

The ‘Canonical Discriminant Function Coefficients’ indicate the unstandardized scores concerning the

independent variables. It is the list of coefficients of the unstandardized discriminant equation. Each

subject’s discriminant score would be computed by entering his or her variable values (raw data) for

each of the variables in the equation.

Functions at Group Centroids


0=non-recidivist)

Function

1

0 -1.716

1 2.575

Unstandardized canonical discriminant

functions evaluated at group means

‘Functions at Group Centroids’ indicate the average discriminant score for subjects in the two groups.

More specifically, the discriminant score for each group when the variable means (rather than individual

values for each subject) are entered into the discriminant equation. The two values are quite different

including one being negative and the other being positive.

Classification Statistics

Classification Processing Summary

Processed 50

Excluded

Missing or out-of-range

group codes 0

At least one missing

discriminating variable 0

Used in Output 50

Prior Probabilities for Groups


0=non-recidivist)

Prior Cases Used in Analysis

Unweighted Weighted

0 .500 30 30.000

1 .500 20 20.000

Total 1.000 50 50.000

This is the distribution of observations into the Recidivism groups used as a starting point in the

analysis. The default prior distribution is an equal allocation into the groups, as seen in this example.

SPSS allows users to specify different priors with the priors subcommand.

Classification Function Coefficients

Recidivism (1=recidivist, 0=non-

recidivist)

0 1

Age of release .671 .542

Gender (0=female, 1=male) 3.346 4.636


(0=No, 1 = Yes) 4.957 1.784

Number of the


history

2.389 5.151


release (weeks) .466 .235


committments 1.480 3.196

(Constant) -22.827 -24.056

Fisher's linear discriminant functions

Casewise Statistics

Case

Numb

er

Actua

l

Grou

p

Highest Group Second Highest Group Discrimin

ant

Scores

Predicte

d Group

P(D>d |

G=g)

P(G=g

| D=d)

Squared

Mahalanobi

s Distance

to Centroid

Grou

p

P(G=g |

D=d)

Squared

Mahalanobis

Distance to

Centroid

Function

1

p df

Original

1 1 1 .901 1 1.000 .015 0 .000 17.360 2.450

2 1 1 .798 1 1.000 .065 0 .000 16.286 2.319

3 0 0 .610 1 .999 .261 1 .001 14.290 -1.206

4 1 1 .664 1 .999 .189 0 .001 14.872 2.140

5 0 0 .712 1 1.000 .136 1 .000 15.384 -1.348

6 0 0 .836 1 1.000 .043 1 .000 20.235 -1.924

7 1 1 .020 1 1.000 5.383 0 .000 43.708 4.895

8 1 1 .020 1 1.000 5.441 0 .000 43.872 4.907

9 1 1 .155 1 .957 2.026 0 .043 8.223 1.151

10 0 0 .911 1 1.000 .013 1 .000 17.464 -1.604

11 0 0 .301 1 1.000 1.071 1 .000 28.363 -2.751

12 0 0 .531 1 .999 .393 1 .001 13.426 -1.090

13 1 1 .146 1 .951 2.110 0 .049 8.056 1.122

14 0 0 .889 1 1.000 .020 1 .000 17.233 -1.577

15 1 1 .030 1 1.000 4.738 0 .000 41.829 4.751

16 1 1 .887 1 1.000 .020 0 .000 19.654 2.717

17 1 0** .101 1 .897 2.697 1 .103 7.016 -.074

18 0 0 .423 1 .997 .642 1 .003 12.180 -.915

19 0 0 .773 1 1.000 .083 1 .000 20.973 -2.005

20 0 0 .560 1 .999 .339 1 .001 13.756 -1.134

21 0 0 .790 1 1.000 .071 1 .000 16.195 -1.450

22 0 0 .742 1 1.000 .109 1 .000 21.350 -2.046

23 0 0 .888 1 1.000 .020 1 .000 19.636 -1.857

24 1 1 .338 1 .994 .919 0 .006 11.103 1.616

25 1 1 .303 1 1.000 1.062 0 .000 28.316 3.605

26 0 0 .978 1 1.000 .001 1 .000 18.176 -1.689

27 0 0 .686 1 1.000 .164 1 .000 22.049 -2.121

28 0 0 .155 1 .957 2.024 1 .043 8.227 -.294

29 0 0 .317 1 1.000 1.000 1 .000 27.994 -2.716

30 1 1 .010 1 1.000 6.659 0 .000 47.218 5.155

31 0 0 .919 1 1.000 .010 1 .000 17.548 -1.615

32 0 0 .660 1 1.000 .193 1 .000 22.375 -2.156

33 0 0 .803 1 1.000 .062 1 .000 16.339 -1.468

34 1 1 .672 1 1.000 .179 0 .000 22.222 2.998

35 1 1 .999 1 1.000 .000 0 .000 18.421 2.576

36 0 0 .917 1 1.000 .011 1 .000 17.533 -1.613

37 0 0 .808 1 1.000 .059 1 .000 20.562 -1.960

38 0 0 .963 1 1.000 .002 1 .000 18.814 -1.763

39 1 1 .883 1 1.000 .022 0 .000 17.172 2.428

40 1 1 .608 1 .999 .263 0 .001 14.277 2.062

41 0 0 .397 1 1.000 .719 1 .000 26.406 -2.564

42 0 0 .776 1 1.000 .081 1 .000 16.050 -1.432

43 0 0 .503 1 1.000 .449 1 .000 24.615 -2.387

44 1 1 .142 1 .948 2.152 0 .052 7.976 1.108

45 1 1 .657 1 .999 .197 0 .001 14.801 2.131

46 0 0 .978 1 1.000 .001 1 .000 18.649 -1.744

47 1 1 .255 1 .987 1.297 0 .013 9.935 1.436

48 0 0 .660 1 .999 .194 1 .001 14.829 -1.276

49 0 0 .927 1 1.000 .008 1 .000 19.202 -1.807

50 0 0 .791 1 1.000 .070 1 .000 20.760 -1.982

Cross-

validated

b

1 1 1 .620 6 1.000 4.422 0 .000 20.888

2 1 1 .164 6 .999 9.168 0 .001 23.789

3 0 0 .638 6 .999 4.288 1 .001 17.476

4 1 1 .006 6 .997 17.999 0 .003 29.395

5 0 0 .577 6 .999 4.744 1 .001 19.119

6 0 0 .828 6 1.000 2.844 1 .000 22.548

7 1 1 .020 6 1.000 15.014 0 .000 59.768

8 1 1 .045 6 1.000 12.875 0 .000 57.061

9 1 1 .198 6 .886 8.585 0 .114 12.681

10 0 0 .451 6 1.000 5.755 1 .000 22.359

11 0 0 .743 6 1.000 3.504 1 .000 31.071

12 0 0 .229 6 .997 8.120 1 .003 19.648

13 1 1 .114 6 .838 10.268 0 .162 13.562

14 0 0 .716 6 1.000 3.710 1 .000 20.240

15 1 1 .345 6 1.000 6.743 0 .000 47.026

16 1 1 .966 6 1.000 1.404 0 .000 20.534

17 1 0** .445 6 .986 5.805 1 .014 14.271

18 0 0 .030 6 .988 13.969 1 .012 22.738

19 0 0 .900 6 1.000 2.207 1 .000 22.660

20 0 0 .631 6 .998 4.342 1 .002 16.901

21 0 0 .405 6 .999 6.165 1 .001 21.297

22 0 0 .967 6 1.000 1.382 1 .000 22.212

23 0 0 .488 6 1.000 5.444 1 .000 24.435

24 1 1 .724 6 .991 3.646 0 .009 13.059

25 1 1 .394 6 1.000 6.269 0 .000 33.914

26 0 0 .284 6 1.000 7.417 1 .000 24.692

27 0 0 .178 6 1.000 8.926 1 .000 30.470

28 0 1** .001 6 .510 23.416 0 .490 23.492

29 0 0 .489 6 1.000 5.435 1 .000 32.882

30 1 1 .011 6 1.000 16.512 0 .000 65.364

31 0 0 .612 6 1.000 4.479 1 .000 21.284

32 0 0 .581 6 1.000 4.711 1 .000 26.561

33 0 0 .671 6 1.000 4.043 1 .000 19.568

34 1 1 .901 6 1.000 2.199 0 .000 23.800

35 1 1 .842 6 1.000 2.729 0 .000 20.501

36 0 0 .521 6 1.000 5.183 1 .000 21.912

37 0 0 .812 6 1.000 2.977 1 .000 23.007

38 0 0 .368 6 1.000 6.520 1 .000 24.566

39 1 1 .022 6 .999 14.777 0 .001 29.771

40 1 1 .853 6 .999 2.639 0 .001 15.984

41 0 0 .206 6 1.000 8.469 1 .000 34.635

42 0 0 .774 6 1.000 3.271 1 .000 18.565

43 0 0 .953 6 1.000 1.588 1 .000 25.510

44 1 1 .003 6 .500 19.891 0 .500 19.891

45 1 1 .030 6 .998 13.945 0 .002 25.933

46 0 0 .767 6 1.000 3.325 1 .000 21.383

47 1 1 .293 6 .969 7.315 0 .031 14.233

48 0 0 .214 6 .999 8.350 1 .001 21.586

49 0 0 .091 6 1.000 10.922 1 .000 29.176

50 0 0 .437 6 1.000 5.880 1 .000 26.053

For the original data, squared Mahalanobis distance is based on canonical functions.

For the cross-validated data, squared Mahalanobis distance is based on observations.

**. Misclassified case

b. Cross validation is done only for those cases in the analysis. In cross validation, each case is classified by the

functions derived from all cases other than that case.

In the Casewise Statistics output, SPSS provides us with the Squared Mahalanobis Distance to the

Centroid for each of the groups defined by the dependent variable. If a case has a large Squared

Mahalanobis Distance to the Centroid is most likely to belong to, it is an outlier.

Now, if we look at the P(G=g | D=d) column in the Highest Group and Second Highest Group columns,

we see that the for each subject, there are two numbers that add to 1.00 These are the probabilities of

group membership assigned by the discriminant analysis model. Thus, our model suggests that the

probability of subject #4 being a recidivist is .999 and the complementary probability of being a non-

recidivist for the same subject is .001. This gives us a pretty accurate picture of being predicted

accurately with more confidence of about 99%, while the chance of being wrong is only 1%.

Now, if we look at the same information for the subject # 17, we can see that a recidivist has been

wrongly perceived to be a non-recidivist with the probability of .897 as oppose to being a recidivist with

a probability of .103. But, now we can see that the level of confidence has not decreased in predicting

recidivist as non-recidivist.

Similarly, we can see results for the cross-validation data. Overall, we can see that if the model is correct

for a particular subject, the confidence is generally high, like the case for subject # 13. On the other

hand, if we look at the case of subject # 17, even though the model was wrongly perceived, there was

comparatively a higher confidence regarding the wrong judgment given by the probability of .986. Now,

if we look at the subject # 28, we can see that even if the non-recidivist has been wrongly perceived as

recidivist, the level of confidence wasn’t high enough. In fact, the two probabilities are close together,

this can be said as ‘fence-rider’. We can further continue to look at various cases of the subjects to draw

upon a general conclusion regarding the pattern for the prediction of the recidivist and non-recidivist

groups.

Classification Resultsa,c


0=non-recidivist)

Predicted Group Membership Total

0 1

Original

Count 0 30 0 30

1 1 19 20

% 0 100.0 .0 100.0

1 5.0 95.0 100.0

Cross-validatedb

Count 0 29 1 30

1 1 19 20

% 0 96.7 3.3 100.0

1 5.0 95.0 100.0

a. 98.0% of original grouped cases correctly classified.

b. Cross validation is done only for those cases in the analysis. In cross validation, each case is classified

by the functions derived from all cases other than that case.

c. 96.0% of cross-validated grouped cases correctly classified.

In the last table, we can see that upon cross-validating, we classified 29 out of the total 30 non-

recidivists correctly as non-recidivists, while incorrectly labeling 1 of the recidivist as being in the non-

recidivist group, thus, correctly predicting approximately 97% while making an error of 3%. Similarly, 19

out of the total 20 recidivists were correctly labeled as recidivists, whereas, 1 out the total 20 was

incorrectly labeled as non-recidivist when he/she was actually a recidivist. Thus, in the second case, we

obtained an accuracy of approximately 95%, while making a small error of around 5%. Thus, the overall

accuracy of the cross-validated model was 96% implying that a large number of recidivists and non-

recidivists were correctly predicted.

The general definition of the hit rate is the ratio of the number of records found and processed during a

particular processing run to the total number of records available. The hit rate can be found using the z

distribution. Under the null hypothesis, the expectation is that the hits will be proportional to the

number of subjects in each group and that chance can be construed to the accuracy that would obtain

upon classifying all subjects in the large group. Calculations needed for the z tests are dependent on the

number of hits and number of subjects in each group. The program to calculate the test of significance

of the hit rates requires the cross-validated data from the discriminant analysis in the previous lab

predicting recidivists and non-recidivists. So, after running the program, we get the following output:

GROUP NS

30.00 20.00

HITS

29.00 19.00

PROPORTIONAL CHANCE CRITERION

Separate group chance expectation:

18.00 8.00

Separate group accuracies as %

96.67 95.00

Separate group Z/Prob:

4.10 5.02

.000 .000

Total sample hits = 48.00

Total sample hits as % = 96.00

Total sample chance expectation = 26.00

Total sample Z/Prob = 6.23

.000

% improvement over chance for separate groups

91.67 91.67

% improvement over chance for total sample = 91.67

MAXIMUM CHANCE CRITERION

Total sample Z/Prob = 5.20

.000

% improvement over chance for total sample = 90.00

In the first four lines of the output, we can see that from among the total of 30 non-recidivists,

the number of hits is 29, as also from among the group of recidivist, 19 are the hits from among the total

of 20. The proportional chance criteria for assessing model fit is calculated by summing the squared

proportion that each group represents of the sample.

Now, the separate group for chance expectation is better for the first group of non-recidivists

than for the second group of recidivist. Now, the separate group accuracies for the first group are better

than for the second group. Now, looking at the proportional chance criterion, in case of the first group,

the accuracy is almost as good as the accuracy of the second group. The first group has a z-value of 4.10

with the p value of .000 as compared to the other group of z value of 5.02 with p value of 0.000. Also,

we can see that the total sample hits are 48.

Now, if we look at the percentage over chance for separate groups, we can see that it is equally

good for the first group as it is for the second group. The maximum chance criterion shows that the

model is better than chance for the total sample since the percentage improvement is positive.

Like in multiple linear regression models, we would now like to see the unique contribution of a

variable or a subset of variables in discriminant analysis. We want to see how much does a variable or a

subset of variables adds to the classification accuracy of a model having the remainder of the predictor

variables. Thus, we use McNemar’s test for correlated proportions. McNemar’s test is a non-parametric

method used on the nominal variable. It is applied to 2 × 2 contingency tables with a dichotomous trait,

with matched pairs of subjects, to determine whether the row and column marginal frequencies are

equal. This also requires the hit-rate for the full and restricted models along with the joint distribution of

hits and misses for each model.

However, to apply the program the program to see the unique contribution of a variable or subset of

variables, we have to remove the fourth variable of Offences to run this, otherwise the program doesn’t

work. When we run the program leaving the fourth variable, we get the following output:

***FULL/RESTRICTED MODEL HIT-RATE TEST***

***JOHN D. MORRIS***

***FLORIDA ATLANTIC UNIVERSTIY***

DATA FILE: FPrison1.dat

FORMAT = (6f8.0)

NUMBER OF FULL MODEL PREDICTOR VARIABLES = 5

RESTRICTED MODEL PREDICTOR VARIABLES ARE:

2 3

GROUP INDICES = 0 1

PRIOR PROBABILITIES USED: 0.500 0.500

N = 50

MEANS FOR GROUP 0 (N = 30)

1) 42.4000 2) 0.3333 3) 0.7333 4) 17.4667

5) 0.3667

MEANS FOR GROUP 1 (N = 20)

1) 24.1000 2) 0.6500 3) 0.1500 4) 8.7500

5) 2.2500

CHI-SQUARED FROM BOX-M (DF= 15.00) = 2.6884

P = 0.0007

CALIBRATION SAMPLE ESTIMATES

PROPORTION OF HITS

TOTAL 0 1

FULL MODEL 0.940 0.967 0.900

RESTRICTED MODEL 0.780 0.733 0.850

HITS/MISSES MATRIX FOR GROUP 0

FULL MODEL

MISSES HITS

RESTRICTED HITS 1. 21.

MODEL MISSES 0. 8.

MCNEMAR'S Z = 2.3333


FULL MODEL

MISSES HITS


MODEL MISSES 2. 1.


HITS/MISSES MATRIX FOR TOTAL SAMPLE

FULL MODEL

MISSES HITS


MODEL MISSES 2. 9.


-------------------------------------------------------

LEAVE ONE OUT ESTIMATES

PROPORTION OF HITS

TOTAL 0 1

FULL MODEL 0.900 0.967 0.800

RESTRICTED MODEL 0.780 0.733 0.850


FULL MODEL

MISSES HITS


MODEL MISSES 0. 8.



FULL MODEL

MISSES HITS


MODEL MISSES 2. 1.

MCNEMAR'S Z = -0.5774

HITS/MISSES MATRIX FOR TOTAL SAMPLE

FULL MODEL

MISSES HITS


MODEL MISSES 2. 9.


The McNemar’s Test uses the z-value to determine how much greater the classification accuracy is for

the full model than for the restricted model. Since, the interest is only in the one tailed test, we can use

the critical value of z as 1.65. That means that if the Z obtained in the output is larger than the critical

value than we may declare a significant difference in the accuracy of the models (reject the null),

otherwise we can’t. So, in the calibration sample estimates, we reject the null hypothesis for group 0

and total since the z-value is greater than the critical value, and accept it for group 1. Same is the result

for leave one out estimates.

Now, we want to look at the cross-validation classification accuracy of subsets of the models. A defensible method of selecting a subset of predictor variables for a prediction model is to examine 2𝑝-1 possible combination of p predictor variables. So, we use the data to predict Recidivism Status from Age, Gender, Job, offences, supervision and prior prison commitments to see the cross-validation classification accuracy for all the possible subsets. The program we will now use show the user to specify how many of the top performing subsets of the predictor variables are present in the output. It also allows specification how many of the top performing subsets with each number of variables to display. An interesting view of variable importance also arises if a specific variable consistently appears in the top performing models. After running the program, we get the following output:

***TWO GROUP PREDICTOR VARIABLE SELECTION***

***JOHN D. MORRIS & ALICE MESHBANE***

***FLORIDA ATLANTIC UNIVERSITY***

***QUESTIONS/COMMENTS TO: [email protected]***

FILE NAME = FPrison.dat

FORMAT = (7F8.0)

PREDICTORS = 6

GROUP ONE INDEX = 0

GROUP TWO INDEX = 1

MISSING DATA CODE = -30.

NUMBER OF BEST SUBSETS = 20

NUMBER OF BEST SUBSETS OF EACH SIZE = 5

PRIORS = 0.5000 0.5000

SUBJECTS READ = 50

GROUP ONE (N = 30) MEANS

1) 42.4000 2) 0.3333 3) 0.7333 4) 1.0000

5) 17.4667 6) 0.3667

GROUP TWO (N = 20) MEANS

1) 24.1000 2) 0.6500 3) 0.1500 4) 4.1000

5) 8.7500 6) 2.2500

The 20 subsets with the highest total hit rate

HITS VARIABLES

#1 #2 Total

30 19 49 4 6

30 19 49 3 4 6

30 19 49 1 3 4 6

30 19 49 2 3 4 6

30 19 49 1 2 3 4 6

30 19 49 3 4 5 6

30 19 49 2 3 4 5 6

30 18 48 4 5 6

30 18 48 2 4 5 6

29 19 48 1 4 6

30 18 48 2 4 6

29 19 48 1 2 3 4 5 6

30 18 48 2 3 4

30 18 48 1 2 3 4

29 18 47 1 2 4 6

29 18 47 1 4 5 6

28 19 47 1 5

29 18 47 1 3 6

29 18 47 1 2 4 5 6

30 17 47 2 4 5

The (up to) 5 best subsets of each size

HITS VARIABLES

#1 #2 Total

30 16 46 4

24 18 42 1

27 14 41 6

23 16 39 5

22 17 39 3

30 19 49 4 6

28 19 47 1 5

30 16 46 4 5

30 16 46 2 4

29 16 45 1 4

30 19 49 3 4 6

29 19 48 1 4 6

30 18 48 2 3 4

30 18 48 2 4 6

30 18 48 4 5 6

30 19 49 1 3 4 6

30 19 49 2 3 4 6

30 19 49 3 4 5 6

30 18 48 2 4 5 6

30 18 48 1 2 3 4

30 19 49 2 3 4 5 6

30 19 49 1 2 3 4 6

30 17 47 1 2 3 4 5

29 18 47 1 2 4 5 6

29 18 47 1 3 4 5 6

29 19 48 1 2 3 4 5 6

We can see that out of all the 20 subsets of the models, some of them have better cross-

validation classification accuracy than the full 6 variable model and some has worse cross-validation

accuracy. The model with two variables numbered 4 and 6 have the best accuracy than the rest of the

models with more number of variables. So, many of the subsets of variables perform a little better than

the original model but original model performs really well too (since it comes on the number two rank).

However, we can see that many of the models’ accuracy is the same as the original model, so even after

using some partial variables in the model can yield the same accuracy as the full model depending on

the variables available for the particular case.

Another important aspect of this output can be seen from the individual contribution of each

variable in the model. However, we can see that none of the variables individually contribute too much

to the accuracy of the model. However, variables 4 and 6 contribute a lot to the accuracy of the model

than the original model of 6 predictor variables.

Logistic regression is a type of regression analysis used for predicting the outcome of a categorical (a

variable that can take on a limited number of categories) criterion variable based on one or more

predictor variables. The probabilities describing the possible outcome of a single trial are modeled, as

function of explanatory variables, using a logistic function. Logistic regression, in this case, is similar to

discriminant analysis. However, there are several advantages of Logistic Regression over Discriminant

Analysis since it is more robust, it does not assume a linear relationship between the independent

variables and dependent variable, the dependent variable need not to be normally distributed, there is

no homogeneity of variance assumption etc. However, despite all of these advantages of Logistic

Regression over Discriminant Analysis, the advantages of Logistic Regression come at a cost because it

requires much more data to achieve stable and meaningful results.

Now, we use the same data as used in previous labs of discriminant analysis to do a similar

regression for the model using the logistic regression technique. So, our predictor variable is a binary

variable representing the smoking status of people, assuming 0 for non-recidivist and 1 for recidivist. So,

after running logistic regression in spss, we get the following output:

Classification Tablea

Observed Predicted

Recidivism (1=recidivist, 0=non-

recidivist)

Percentage

Correct

0 1

Step 1


0=non-recidivist)

0 30 0 100.0

1 0 20 100.0

Overall Percentage 100.0

a. The cut value is .500

Variables in the Equation

B S.E. Wald df Sig. Exp(B)

Step 1a

Age .061 400.481 .000 1 1.000 1.063

Gender(1) -17.483 12446.184 .000 1 .999 .000

Job(1) -4.857 16612.197 .000 1 1.000 .008

Offences 20.380 6534.399 .000 1 .998 709347781.862

Supervision -1.506 1398.819 .000 1 .999 .222

Prior 10.485 5234.580 .000 1 .998 35786.351

Constant -21.563 30599.392 .000 1 .999 .000

a. Variable(s) entered on step 1: Age, Gender, Job, Offences, Supervision, Prior.

Now, looking at the classification table, we can see that the predicted values were 100% correct as

compared to the accuracy of the cross validated model of 96% in the discriminant analysis. So, in this

case, we can see that logistic regression performed much better than discriminant analysis. However,

looking in the second table we can see that none of the variables is significant. So, discriminant analysis

might be better than Logistic Regression.

In conclusion, we can see that the overall model predicts recidivism well using the six predictor variables

of Age, Gender, Job, Offences, Supervision and Prior. As indicated by the output above, we can see that

the relationship between Age and Recidivism is negative showing an increase in one variable cause a

decrease in the other. Gender just shows that more men are prone to recidivism than females. Further,

if a prisoner gets the job after being released from prison is more likely not to be a recidivist. Also, the

higher the number of offences, the more likely the person would be a recidivist. The fifth variable of

supervision also signifies that longer the supervision period, the lesser is the chance for being a

recidivist. Lastly, if the prisoner has prior prison commitments, then he/she will mostly be a recidivist. To

sum up, we used the discriminant analysis to predict how well the model fits.

PART 2 (HOW LONG DOES A RECIDIVIST STAY OUT OF PRISON)

We have seen from the output above to predict whether the future prisoner returns to prison or not, i.e.

to see the recidivism status of the person from the 6 predictor variables, namely Age, Gender, Job

status, Offences, Supervision and Prior Prison commitments. We would now like to look at how long the

prisoner stays out of prison before he was caught again for only the recidivist’s case. So, we would like

to see how the Recidivism Period is related to the same set of variables (same observations), i.e. Age of

Release, Gender of the person, Job status of the person after the release, Number of offences/thefts in

the criminal history, Supervision term on the prisoner and Prior Prison commitments of the criminal. So,

in order to determine the period, we would run the multiple linear regression model to determine the

recidivism period:

Descriptive Statistics

Mean Std. Deviation N

Recidivism Period in

months(Time Period after

release and again caught)

11.95 5.753 20

Age of release 24.10 6.889 20

Gender (0=female, 1=male) .65 .489 20


(0=No, 1 = Yes) .15 .366 20

Number of the


history

4.10 1.651 20


release (weeks) 8.75 5.169 20


commitments 2.25 1.410 20

As described before, the descriptive statistics display the mean and standard deviation of all the

variables in the multiple linear regression model. The mean number of months for the recidivism period

is 11.95 for 20 recidivists. Similarly, the first column displays all the means and the second column

represents the standard deviation. The third column represents the number of observations i.e. 20.

Correlations

Recidivism

Period in

months(Tim

e Period

after

release and

again

caught)

Age of

releas

e

Gender

(0=female

, 1=male)

Job

Status

after

releas

e

(0=No,

1 =

Yes)

Number of the

Thefts/offense

s in criminal

history

Supervisio

n term

after

release

(weeks)

Number of

prior prison

committment

s

Pearson

Correlatio

n

Recidivism

Period in

months(Time

Period after

release and

again caught)

1.000 .902 -.212 .378 -.276 .442 -.180

Age of release .902 1.000 -.270 .390 -.339 .366 -.192

Gender

(0=female,

1=male)

-.212 -.270 1.000 .015 -.150 -.140 .591

Job Status

after release

(0=No, 1 =

Yes)

.378 .390 .015 1.000 -.113 .132 -.178

Number of the

Thefts/offense

s in criminal

history

-.276 -.339 -.150 -.113 1.000 -.077 -.102

Supervision

term after

release

(weeks)

.442 .366 -.140 .132 -.077 1.000 -.034

Number of

prior prison

committments

-.180 -.192 .591 -.178 -.102 -.034 1.000

Sig. (1-

tailed)

Recidivism

Period in

months(Time

Period after

release and

again caught)

. .000 .185 .050 .119 .026 .224

Age of release .000 . .125 .045 .072 .056 .208

Gender

(0=female,

1=male)

.185 .125 . .476 .264 .277 .003

Job Status

after release

(0=No, 1 =

Yes)

.050 .045 .476 . .317 .290 .226

Number of the

Thefts/offense

s in criminal

history

.119 .072 .264 .317 . .373 .335

Supervision

term after

release

(weeks)

.026 .056 .277 .290 .373 . .443

Number of

prior prison

committments

.224 .208 .003 .226 .335 .443 .

N

Recidivism

Period in

months(Time

Period after

release and

again caught)

20 20 20 20 20 20 20

Age of release 20 20 20 20 20 20 20

Gender

(0=female,

1=male)

20 20 20 20 20 20 20

Job Status

after release

(0=No, 1 =

Yes)

20 20 20 20 20 20 20

Number of the

Thefts/offense

s in criminal

history

20 20 20 20 20 20 20

Supervision

term after

release

(weeks)

20 20 20 20 20 20 20

Number of

prior prison

commitments

20 20 20 20 20 20 20

The correlation table above shows the correlation of each of the variable with other variables in the

model. However, we would like to see the relation of the response variables with the 6 predictor

variables. We now can see that Recidivism period is highly positively correlated with Age, which implies

that as age increases, the recidivism period increases, i.e. with younger age, prisoners do not stay out of

prison for long. Now, as the number of thefts/offenses increases, the recidivism period decreases, i.e.

the more the number of offences, the lesser the period the criminal stays out. As the supervision term

increases, the recidivism period increases, i.e. the prisoner stays longer out of prison. As the number of

prior prison commitments increases, the recidivism period decreases.

Model Summary

Model R R Square Adjusted R

Square

Std. Error of the

Estimate

1 .912a .832 .755 2.850

a. Predictors: (Constant), Number of prior prison committments,

Supervision term after release (weeks), Number of the Thefts/offenses

in criminal history, Job Status after release (0=No, 1 = Yes), Age of

release, Gender (0=female, 1=male)

The above table shows the model summary. The R squared value of approximately 83% shows that

about 83% of the variance in the response variable is explained by the 6 predictor variables. This high

value of R-squared shows that the model is well-defined by the predictor variables.

ANOVAa

Model Sum of Squares df Mean Square F Sig.

1

Regression 523.358 6 87.226 10.739 .000b

Residual 105.592 13 8.122

Total 628.950 19

a. Dependent Variable: Recidivism Period in months(Time Period after release and again caught)

b. Predictors: (Constant), Number of prior prison committments, Supervision term after release

(weeks), Number of the Thefts/offenses in criminal history, Job Status after release (0=No, 1 =

Yes), Age of release, Gender (0=female, 1=male)

The above table shows the coefficients of each of the variables involved in the regression. So, we get the

following model:

Recidivism_Period = -7.576 + .727 Age + .862 Gender + .259 Job + .128 Offences + .148 Supervision

-.183 Prior

Now, we determine the partial slopes, and the recidivism period for each of the predictor variables

holding the variable not in consideration as constant. Now as the age increases by one year the

Coefficientsa

Model Unstandardized Coefficients Standardized

Coefficients

t Sig. Collinearity Statistics

B Std. Error Beta Tolerance VIF

1

(Constant) -7.576 4.367 -1.735 .106

Age of release .727 .123 .870 5.891 .000 .592 1.690

Gender (0=female, 1=male) .862 1.788 .073 .482 .638 .559 1.790


(0=No, 1 = Yes) .259 2.016 .016 .128 .900 .784 1.276

Number of the


history

.128 .438 .037 .293 .774 .817 1.223


release (weeks) .148 .137 .133 1.084 .298 .857 1.167


committments -.183 .595 -.045 -.307 .764 .609 1.643

a. Dependent Variable: Recidivism Period in months(Time Period after release and again caught)

recidivism period decreases by -6.849 months. If the person is the female, i.e. gender=0, then the

recidivism period is -7.576, but if the gender is a male, the recidivism period decreases by 6.714 months.

Similarly, we can interpret the results for the other variables holding the all the other variables constant.

Even though, the R-squared is quite high, the significance levels show that apart from age, all the other

variables are insignificant. Thus, we can look for the estimate of the recidivism time given all these

variables of interest in the model.