Analyses of Cateogical Dependent Variablesweb2.utc.edu/~Michael-Biderman/PSY595/L5... · Web view4....

Analyses Involving Categorical Dependent Variables

When Dependent Variables are Categorical

Chi-square analysis is frequently used.

Example Question: Is there a difference in likelihood of death in an ATV accident between persons wearing helmets and those without helmets?

Dependent variable is Death: No (0) vs. Yes (1).

Crosstabs

[DataSet0]

So, based on this analysis, there is no significant difference in likelihood of dying between ATV accident victims wearing helmets and those without helmets.

Logistic Regression Lecture - 1 5/25/2023

Comments on Chi-square analyses

What’s good?

1. The analysis is appropriate. It hasn’t been supplanted by something else.

2. The results are usually easy to communicate, especially to lay audiences.

3. A DV with a few more than 2 categories can be easily analyzed.

4. An IV with only a few more than 2 categories can be easily analyzed.

What’s bad?

1. Incorporating more than one independent variable is awkward, requiring multiple tables.

2. Certain tests, such as tests of interactions, can’t be performed when you have more than one IV.

3. Chi-square analyses can’t be done when you have continuous IVs unless you categorize the continuous IVs which goes against recommendations to NOT categorize continuous variables because you lose power.

Alternatives to the Chi-square test. We’ll focus on Dichotomous (two-valued) DVs.

1. Techniques based on linear regression

a. Multiple Linear Regression. Regress the dichotomous DV onto the mix of IVs.b. Discriminant Analysis (equivalent to MR when DV is dichotomous)

Problems with regression-based methods, when the dependent variable is dichotomous and the independent variable is continuous.

1. Assumption is that underlying relationship between Y and X is linear.But when Y has only two values, how can that be?

2. Y-hats when Y is continuous are typically realizable values of Y. But when Y has only two values, most of the Y-hats will be values that are not either of those two values. In that case, what are they?

3. Linear techniques assume that variability about the regression line is homogenous across possible values of X. But when Y has only two values, residual variability will vary as X varies, a violation of the homogeneity assumption.

4. Residuals will probably not be normally distributed.

5. Regression line will extend beyond the more negative of the two Y values in the negative direction and beyond the more positive value in the positive direction.

2. Logistic Regression

3. Probit analysis

Logistic Regression and Probit analysis are very similar. Almost everyone uses Logistic. We’ll focus on it.


The Logistic Regression Equation

Without restricting the interpretation, assume that the dependent variable, Y, takes on two values, 0 or 1.

When you have a two-valued DV it is convenient to think of Y-hat as the likelihood or probability that one of the values will occur. We’ll use that conceptualization in what follows and view Y-hat as the probability that Y will equal 1.

The equation will be presented as an equation for the probability that Y = 1, written simply as P(Y=1). So we’re conceptualizing Y-hat as the probability that Y is 1.

The equation for simple Logistic Regression (analogous to Predicted Y = B0 + B1*X in linear regression)

(B0 + B1*X) 1 e P(Y=1) = --------------------- = ----------------- -(B0 + B1*X) (B0 + B1*X) 1 + e e + 1

The logistic regression equation defines an S-shaped (Ogive) curve, that rises from 0 to 1. P(Y=1) is never negative and never larger than 1.

The curve of the equation . . .

B0: B0 is analogous to the linear regression “constant” , i.e., intercept parameter. B0 defines the "height" of the curve. B0 is an elevation parameter. Also called a difficulty parameter in some applications.

X

3.002.001.00.00-1.00-2.00-3.00

Valu

e

1.2

1.0

.8

.6

.4

.2

0.0

PROB1

PROB2

PROB3


Prob1: B0 =0

Prob3: B0 =2Prob2: B0 =1

P(Y)

B1: B1 is analogous to the slope of the linear regression line. B1 defines the “steepness” of the curve. It is sometimes called a discrimination parameter.

The larger the value of B1, the “steeper” the curve, the more quickly it goes from 0 to 1.

X

3.002.001.00.00-1.00-2.00-3.00

Valu

e

1.2

1.0

.8

.6

.4

.2

0.0

PROB4

PROB5

PROB6

Note that there is a MAJOR difference between the linear regression and logistic regression curves - - -

The logistic regression lines asymptote at 0 and 1. They’re bounded by 0 and 1.

But the linear regression lines extend below 0 on the left and above 1 on the right.

If we interpret P(Y) as a probability, the linear regression curves cannot literally represent P(Y) except for a limited range of X values.


Prob6: B1 =3

Prob4: B1 =1

Prob5: B1 =2

P(Y=


Example

P(Y) = .09090909 Odds of Y = .09090909/.909090909 = .1Y is 1/10th as likely to occur as to not occur.

P(Y) = .50 Odds of Y = .5/.5 = 1Y is as likely to occur as to not occur.

P(Y) = .8 Odds of Y = .8/.2 = 4Y is 4 times more likely to occur than to not occur.

P(Y) = .99 Odds of Y = .99/.01 = 99Y is 99 times more likely to occur than to not occur.


So logistic regression is logistic in probability but linear in log odds.

Why we must fit ogival-shaped curves – the curse of categorizationHere’s a perfectly nice linear relationship between score values, from a recent study.This relationship is of ACT Comp scores to Wonderlic scores.[DataSet3] G:\MdbR\0DataFiles\BalancedScale_110706.sav

Here’s the relationship when ACT Comp has been dichotomized at 23, into Low vs. High.

When, proportions of High scores are plotted vs. WPT value, we get the following

So, to fit the above curve, we need a model that is ogival. This is where the logistic regression function comes into play.


This means that even if the “underlying” true values are linearly related, the dichotomized values that we may have to work with may not be linearly related to the independent variable.


Crosstabs and Logistic RegressionApplied to the same 2x2 situation

The FFROSH data.

The data here are from a study of the effect of the Freshman Seminar course on 1st semester GPA and on retention. It involved students from 1987-1992. The data were gathered to investigate the effectiveness of having the freshman seminar course as a requirement for all students. There were two main criteria, i.e., dependent variables – first semester GPA excluding the seminar course and whether a student continued into the 2nd semester.

The dependent variable in this analysis is whether or not a student moved directly into the 2nd semester in the spring following his/her 1st fall semester. It is called RETAINED and is equal to 1 for students who retained to the immediately following spring semester and 0 for those who did not.

The analysis reported here was a serendipitous finding regarding the time at which students register for school. It has been my experience that those students who wait until the last minute to register for school perform more poorly on the average than do students who register earlier. This analysis looked at whether this informal observation could be extended to the likelihood of retention to the 2nd semester.

After examining the distribution of the times students registered prior to the first day of class we decided to compute a dichotomous variable representing the time prior to the 1st day of class that a student registered for classes. The variable was called EARLIREG – for EARLY REGistration. It had the value 1 for all students who registered 150 or more days prior to the first day of class and the value 0 for students who registered within 150 days of the 1st day. (The 150 day value was chosen after inspection of the 1st semester GPA data.)

So the analysis that follows examines the relationship of RETAINED to EARLIREG, retention to the 2nd semester to early registration.

The analyses will be performed using CROSSTABS and using LOGISTIC REGRESSION.

First, univariate analyses . . .GET FILE='E:\MdbR\FFROSH\Ffroshnm.sav'.

Fre var=retained earlireg.

sustained

Frequency Percent Valid PercentCumulative

PercentValid .00 552 11.6 11.6 11.6 1.00 4201 88.4 88.4 100.0

Total 4753 100.0 100.0

earlireg

Frequency Percent Valid PercentCumulative

PercentValid .00 2316 48.7 48.7 48.7

1.00 2437 51.3 51.3 100.0Total 4753 100.0 100.0


crosstabs retained by earlireg /cells=cou col /sta=chisq.

CrosstabsCase Processing Summary

4753 100.0% 0 .0% 4753 100.0%RETAINED *EARLIREG

N Percent N Percent N Percent

Valid Missing Total

Cases

RETAINED * EARLIREG Crosstabulation

367 185 552

15.8% 7.6% 11.6%

1949 2252 4201

84.2% 92.4% 88.4%

2316 2437 4753

100.0% 100.0% 100.0%

Count

% withinEARLIREG

Count

% withinEARLIREG

Count

% withinEARLIREG

.00

1.00

RETAINED

Total

.00 1.00

EARLIREG

Total

Chi-Square Tests

78.832b 1 .000

78.030 1 .000

79.937 1 .000

.000 .000

78.815 1 .000

4753

Pearson Chi-Square

Continuity Correction a

Likelihood Ratio

Fisher's Exact Test

Linear-by-LinearAssociation

N of Valid Cases

Value dfAsymp. Sig.

(2-sided)Exact Sig.(2-sided)

Exact Sig.(1-sided)

Computed only for a 2x2 tablea.

0 cells (.0%) have expected count less than 5. The minimum expected count is 268.97.b.


So, 92.4% of those who registered early sustained, compared to 84.2% of those who registered late.

The same analysis using Logistic Regression Analyze -> Regression -> Binary Logistic

logistic regression retained WITH earlireg.

Logistic Regression

Case Processing Summary

4753 100.0

0 .0

4753 100.0

0 .0

4753 100.0

Unweighted Casesa

Included inAnalysis

Missing Cases

Total

Selected Cases

Unselected Cases

Total

N Percent

If weight is in effect, see classification table for the totalnumber of cases.

a.

Dependent Variable Encoding

0

1

OriginalValue.00

1.00

Internal Value

The Logistic Regression procedure fits the logistic regression model to the data. It estimates the parameters of the logistic regression equation.

1That equation is P(Y) = --------------------- -(B0 + B1X) 1 + e

It performs the estimation in two stages. The first stage estimates only B0. So the model fit to the data in the first stage is simply

1P(Y) = ------------------ -(B0) 1 + e

SPSS labels the various stages of the estimation procedure “Blocks”. In Block 0, a model with only B0 is estimated


The display to the left is a valuable check to make sure that your “1” is the same as the Logistic Regression procedure’s “1”.

Block 0: Beginning Block (estimating only B0)Classification Table a,b

0 552 .0

0 4201 100.0

88.4

Observed.00

1.00

RETAINED

Overall Percentage

Step 0.00 1.00

RETAINED

Percentage Correct

Predicted

Constant is included in the model.a.

The cut value is .500b.

Explanation of the above table:

The program computes Y-hat for each case using the logistic regression formula with the estimate of B0. If Y-hat is <= a predetermined cut value of 0.500, that case is recorded as a predicted 0. If Y-hat is > 0.5, the program records that case as a predicted 1. It then creates the above table of number of actual 1’s and 0’s vs. predicted 1’s and 0’s.

The prediction equation for Block 0 is Y-hat = 1/(1 + e –2.030) (The value 2.030 is shown in the “Variables in the Equation” table below.). Recall that B1 is not yet in the equation. This means that Y-hat is a constant, equal to .8839 for each case. (I got this by entering the prediction equation into a calculator.) Since Y-hat for each case is > 0.5, all predictions are 1, which is why the above table has only predicted 1’s. Sometimes this table is more useful than it was in this case.

Variables in the Equation

2.030 .045 2009.624 1 .000 7.611ConstantStep 0B S.E. Wald df Sig. Exp(B)

The above “Variables in the Equation” box is the Logistic Regression equivalent of the “Coefficients Box” in regular regression analysis.

The test statistic is not a t statistic, as in regular regression, but the Wald statistic. The Wald statistic is (B/SE)2. So (2.030/.045)2 = 2,035, which would be 2009.624 if the two coefficients were represented with greater precision.

Exp(B) is the odds ratio: e2.030 More on it later.

Variables not in the Equation

78.832 1 .000

78.832 1 .000

EARLIREGVariables

Overall Statistics

Step 0Score df Sig.

The “Variables not in the Equation” gives information on each independent variable that is not in the equation. Specifically, it tells you whether or not the variable would be “significant” if it were added to the equation. In this case, it’s telling us that EARLIREG would contribute significantly to the equation if it were added to the equation, which is what SPSS does next . . .


Block 1: Method = Enter (Adding B1*X to the equation)Omnibus Tests of Model Coefficients

79.937 1 .000

79.937 1 .000

79.937 1 .000

Step

Block

Model

Step 1Chi-square df Sig.

Whew – three chi-square statistics. “Step”: Compared to previous step in a stepwise regression. Ignore for now.

“Block”: Tests the significance of the improvement in fit of the model evaluated in this block vs. the previous block. Note that the chi-square is identical to the Likelihood ratio chi-square printed in the Chi-square Box in the CROSSTABS output.

“Model”: Ignore for nowModel Summary

3334.212 .017 .033Step1

-2 Log likelihoodCox & Snell R

SquareNagelkerke R

Square

The value under “-2 Log likelihood” is a test of how well the model fit the data in an absolute sense. Values closer to 0 represent better fit. But goodness of fit is complicated by sample size. The R Square values are measures analogous to “percent of variance accounted for”. All three measures tell us that there is a lot of variability in proportions of persons retained that is not accounted for by this one-predictor model.

Classification Table a

0 552 .0

0 4201 100.0

88.4

Observed.00

1.00

RETAINED

Overall Percentage

Step 1.00 1.00

RETAINED

Percentage Correct

Predicted

The cut value is .500a.

The above table is the revised version of the table presented in Block 0.

Note that since X is a dichotomous variable here, there are only two y-hat values. They are

1 P(Y) = --------------------- = .842 (see below) -(B0 + B1*0) 1 + eAnd 1 P(Y) = --------------------- = .924 (see below) -(B0 + B1*1) 1 + eAs we’ll see below, in both cases, the y-hat was > .5, so predicted Y in the table was 1 for all cases.


Note that the chi-square value is almost the same as the chi-square value from the CROSSTABS analysis.


.830 .095 75.719 1 .000 2.292

1.670 .057 861.036 1 .000 5.311

EARLIREG

Constant

Step 1 aB S.E. Wald df Sig. Exp(B)

Variable(s) entered on step 1: EARLIREG.a.

The prediction equation is Y-hat = 1 / (1 + e-(.1.670 + .830*EARLIREG).

Since EARLIREG has only two values, those students who registered early will have predicted RETAINED v alue of 1/(1+e-(1.670+.830*1)) = .924. Those who registered late will have predicted RETAINED value of 1/(1+e-(1.670+.830*0) = 1/(1+e-1.670)).= .842. Since both predicted values are above .5, this is why all the cases were predicted to be retained in the table on the previous page.

Exp(B) is called the odds ratio. It is the ratio of the odds of Y=1 when X=1 to the odds of Y=1 when X=0.Recall that the odds of 1 are P(Y=1)/(1-P(Y=1)). The odds ratio is

Odds when X=1 .924/(1-.924) 12.158Odds ratio = --------------------- = --------------- = ------------------- = 2.29.

Odds when X= 0 .842/(1-.842) 5.329

So a person who registered early had odds of being retained that were 2.29 times the odds of a person registering late being retained.

Graphical representation of what we’ve just found.

The following is a plot of Y-hat vs. X, that is, the plot of predicted Y vs. X. Since there are only two values of X (0 and 1), the plot has only two points. The curve drawn on the plot is the theoretical relationship of y-hat to other hypothetical values of X over a wide range of X values (ignoring the fact that none of them could occur.) The curve is analogous to the straight line plot in a regular regression analysis.

X

4.003.002.001.00.00-1.00-2.00-3.00-4.00-5.00-6.00

Valu

e YH

AT

1.2

1.0

.8

.6

.4

.2

0.0

-.2


The two points are the predicted points for the two possible values of RETAINED.

Discussion

1. When there is only one dichotomous predictor, the CROSSTABS and LOGISTIC REGRESSION give the same significance results, although each gives different ancillary information.

BUT as mentioned above . . .

2. CROSSTABS cannot be used to analyze relationships in which the X variable is continuous.

3. CROSSTABS can be used in a rudimentary fashion to analyze relationships between a dichotomous Y and 2 or more categorical X’s, but the analysis IS rudimentary and is laborious. No tests of interactions are possible. The analysis involves inspection and comparison of multiple tables.

4. CROSSTABS, of course, cannot be used when there is a mixture of continuous and categorical IV’s.

5. LOGISTIC REGRESSION can be used to analyze all the situations mentioned in 2-4 above.

6. So CROSSTABS should be considered for the very simplest situations involving one categorical predictor. But LOGISTIC REGRESSION is the analytic technique of choice when there are two or more categorical predictors and when there are one or more continuous predictors.


Logistic Regression with one Continuous Independent Variable

The data analyzed here represent the relationship of Pancreatitis Diagnosis to measures of Amylase and Lipase. Both Amylase and Lipase levels are tests that can predict the occurrence of Pancreatitis. Generally, it is believed that the larger the value of either, the greater the likelihood of Pancreatitis.

The objective here is to determine which alone is the better predictor of the condition and to determine if both are needed.

Because the distributions of both predictors were positively skewed, logarithms of the actual Amylase and Lipase values were used for this handout and for some of the following handouts.

This handout illustrates the analysis of the relationship of Pancreatitis diagnosis to only Amylase.

The name of the dependent variables is PANCGRP. It is 1 if the person is diagnosed with Pancreatitis. It is 0 otherwise.

Distributions of logamy and loglip – still somewhat positively skewed even though logarithms were taken.

The logamy and loglip scores are highly positively correlated. For that reason, it may be that once either is in the equation, adding the other won’t significantly increase the fit of the model. We’ll test that hypothesis later.


1.00 1.50 2.00 2.50 3.00 3.50 4.00

logamy

0

10

20

30

40

50

60

Freq

uenc

y

Mean = 2.0267Std. Dev. = 0.50269N = 306

logamy

0.00 1.00 2.00 3.00 4.00 5.00

loglip

0

10

20

30

40

50

60

Freq

uenc

y

Mean = 2.3851Std. Dev. = 0.82634N = 306

loglip

1.00 1.50 2.00 2.50 3.00 3.50 4.00

logamy

0.00

1.00

2.00

3.00

4.00

5.00

logl

ip

1. Scatterplots with individual cases.

Relationship of Pancreatitis Diagnosis to log(Amylase)

LOGAMY

4.03.53.02.52.01.51.0

PAN

CG

RP

1.2

1.0

.8

.6

.4

.2

0.0

-.2

This graph represents a primary problem with visualizing results when the dependent variable is a dichotomy. It is difficult to see the relationship that may very well be represented by the data. One can see, however, that when log amylase is low, there are more 0’s (no Pancreatitis) and when log amylase is high there are more 1’s (presence of Pancreatitis).

The line through the scatterplot is the linear line of best fit. It was easy to generate. It represents the relationship of probability of Pancreatitis to log amylase that would be assumed if a linear regression were conducted.

But, the logistic regression analysis assumes that the relationship between probability of Pancreatitis to log amylase is different. The relationship assumed by the logistic regression analysis would be an S-shaped curve, called an ogive.

Below are the same data, this time with the line of best fit generated by the logistic regression analysis through it. While neither line fits the observed individual case points well in the middle, it’s easy to see that the logistic line fits better at small and at large values of log amylase.

4.03.53.02.52.01.51.0

1.2

1.0

.8

.6

.4

.2

0.0

-.2

Predicted probabilit

LOGAMY

Pancreatitis Diagnos

LOGAMY


This graph is of individual cases.

Y values are 0 or 1.X values are continuous.

2. Grouping cases to show a relationship when the DV is a dichotomy.

The plots above were plots of individual cases. Each point represented the DV value of a case (0 or 1) vs. that person’s IV value (log amylase value). The problem was that the plot didn’t really show the relationship because the DV could take on only two values - 0 and 1.

When the DV is a dichotomy, it my be profitable to form groups of cases with similar IV values and plot the proportion of 1’s within each group vs. the IV value for that group.

To illustrate this, groups were formed for every .2 increase in log amylase. That is, the values 1.4, 1.6, 1.8, 2.0, 2.2, 2.4, 2.6, 2.8, 3.0, 3.2, 3.4, 3.6, and 3.8 were used as group mid points. Each case was assigned to a group based on how close that case’s log amylase value was to the group midpoint. So, for example, all cases between 1.5 and 1.7 were assigned to the 1.6 group.

Syntax: compute logamygp = rnd(logamy,.2).

Then the proportion of 1’s within each group was computed. The figure below is a plot of the proportion of 1’s within each group vs. the groups midpoints. Note that the points form a curve, quite a bit like the ogival form from the logistic regression analysis shown on the previous page.

LOGAMY

4.03.53.02.52.01.51.0

PAN

CG

RP

1.2

1.0

.8

.6

.4

.2

0.0

-.2

LOGAMYGP

4.03.53.02.52.01.51.0

PRO

BPAN

C

1.2

1.0

.8

.6

.4

.2

0.0

-.2

The plot of proportions above suggests that the S-shaped curve of the logistic regression model may better represent the increase in probability of Pancreatitis than the straight line curve of the linear regression model.

The analyses that follow illustrate the application of both analyses to the data.


Note that the plot of proportion of Pancreatitis diagnoses within groups is not linear. The proportions increase in an ogival (S-shaped) fashion, with asymptotes at 0 and 1.

This, of course, is a violation of the linear relationship which linear regression analysis assumes.

3. Linear Regression analysis of the logamy data, just for old time’s sake.

REGRESSION /MISSING LISTWISE /STATISTICS COEFF OUTS R ANOVA /CRITERIA=PIN(.05) POUT(.10) /NOORIGIN /DEPENDENT pancgrp /METHOD=ENTER logamy /SCATTERPLOT=(*ZRESID ,*ZPRED ) /RESIDUALS HIST(ZRESID) NORM(ZRESID) .

Regression

LOGAMY a . EnterModel1

VariablesEntered

VariablesRemoved Method

Variables Entered/Removed b

.755a .570 .568 .2569Model1

RR

SquareAdjustedR Square

Std. Errorof the

Estimate

Model Summary b

Predictors: (Constant),LOGAMY

a.

Dependent Variable:PANCGRP

b.

22.230 1 22.230 336.706 .000a

16.770 254 6.602E-02

39.000 255

Regression

Residual

Total

Model1

Sum ofSquares df

MeanSquare F Sig.

ANOVA b

Predictors: (Constant), LOGAMYa.

Dependent Variable: PANCGRPb.

-1.043 .069 -15.125 .000

.635 .035 .755 18.350 .000

(Constant)

LOGAMY

Model1

BStd.Error

Unstandardized Coefficients

Beta

Standardized

Coefficients

t Sig.

Coefficients a

Dependent Variable: PANCGRPa.

Thus, the predicted linear relationship of probability of Pancreatitis to log amylase is

Predicted probability of Pancreatitis = -1.043 + 0.635 * logamy.


The linear relationship of pancdiag to logamy is strong.

But as we'll see, the logistic relationship is even stronger.

The following are the usual linear regression diagnostics.

3.016 1.00

3.343 1.00

3.419 1.00

3.218 1.00

CaseNumber54

77

85

97

Std.Residual PANCGRP

Casewise Diagnostics a

DependentVariable:PANCGRP

a.

-.1044 1.4256 .1875 .2953 256

-.5998 .8786 -1.3848E-16 .2564 256

-.989 4.193 .000 1.000 256

-2.334 3.419 .000 .998 256

PredictedValue

Residual

Std.PredictedValue

Std.Residual

Minimum Maximum MeanStd.

Deviation N

Residuals Statistics a

Dependent Variable: PANCGRPa.

Regression Standardized Residual3.252.752.251.751.25.75.25-.25-.75-1.25-1.75-2.25

Dependent Variable: PANCGRP

Frequency

60

50

40

30

20

10

0

Std. Dev = 1.00

Mean = 0.00

N = 256.00

Normal P-P Plot

Observed Cum Prob

1.00.75.50.250.00

Expected Cum Prob

1.00

.75

.50

.25

0.00


Nothing particularly unusual here.

Or here.

Although there is a clear bend from the expected linear line, this is not particularly diagnostic..

The histogram of residuals is not particularly unusual.

Scatterplot

Dependent Variable: PANCGRP

Regression Standardized Predicted Value

543210-1-2

Reg

ress

ion

Sta

ndar

dize

d R

esid

ual

4

3

2

1

0

-1

-2

-3

Computation of y-hats for the groups.

I had SPSS compute the Y-hat for each of the group mid-points discussed on page 3. I then plotted both the observed group proportion of 1’s that was shown on the previous page and the Y-hat for each group. Of course, the Y-hats are in a linear relationship with log amylase. Note that the solid points don’t really represent the relationship shown by the open symbols. Note also that the solid points extend above 1 and below 0. But the observed proportions are bound by 1 and 0.

compute mrgpyhat = -1.043 + .635*logamyvalue.execute.GRAPH /SCATTERPLOT(OVERLAY)=logamygp logamygp WITH probpanc mrgpyhat (PAIR) /MISSING=LISTWISE .

Graph

4.03.53.02.52.01.51.0

1.4

1.2

1.0

.8

.6

.4

.2

0.0

-.2

MRGPYHAT

LOGAMYGP

PROBPANC

LOGAMYGP


This is an indicator that there is something amiss.

The plot of residuals vs. predicted values is supposed to form a classic 0 correlation scatterplot, with no unusual shape. This is clearly unusual.

Predicted proportion of Pancreatitis diagnosis within groups.Note that predictions extend below 0 and above 1.

Observed proportion of Pacreatitis diagnoses within groups.

4. Logistic Regression Analysis of logamy data

logistic regression pancgrp with logamy.

Logistic RegressionCase Processing Summary

256 83.750 16.3

306 100.00 .0

306 100.0

Unweighted Cases a

Included in AnalysisMissing CasesTotal

Selected Cases

Unselected CasesTotal

N Percent


a.


01

Original Value.00 No Pancreatitis1.00 Pancreatitis

Internal Value

SPSS’s Logistic regression procedure always performs the analysis in at least two steps, which it calls Blocks.

Recall the Logistic prediction formula is

1 P(Y) = --------------------- -(B0 + B1X) 1 + e

In the first block, labeled Block 0, only B0 is entered into the equation. In this B0 only equation, it is assumed that the probability of a 1 is a constant, equal to the overall proportion of 1’s for the whole sample.

Obviously this model will generally be incorrect, since typically, we’ll be working with data in which the probability of a 1 increases as the IV increases.

But this model serves as a useful baseline against which to assess subsequent models, all of which do assume that probability of a 1 increase as the IV increases.


For each block the Logistic Regression Procedure automatically prints a 2x2 table of predicted and observed 1’s and 0’s. For all of these tables, a case is classified as a predicted 1 if it’s Y-hat (predicted probability) exceed 0.5. Otherwise it’s classified as a predicted 0. Since only the constant is estimated here, the predicted probability for every case is simply the proportion of 1’s in the sample, which is 48/256 = 0.1875. Since that’s less than 0.5, every case is predicted to be a 0 for this constant only model.

Block 0: Beginning Block

Classification Tablea,b

208 0 100.048 0 .0

81.3

ObservedNo PancreatitisPancreatitis

Pancreatitis Diagnosis(DV)

Overall Percentage

Step 0

NoPancreatitis Pancreatitis

Pancreatitis Diagnosis (DV)Percentage

Correct

Predicted




-1.466 .160 83.852 1 .000 .231ConstantStep 0B S.E. Wald df Sig. Exp(B)

The test that is recommended is the Wald test. The p-value of .000 says that the value of B0 is significantly different from 0.

The predicted probability of 1 here is

1 1 1P(1) = ------------------------- = --------------------------- = ------------- = 0.1875, the observed proportion of 1’s. 1 + e-(-1.466) 1 + 4.332 5.332


145.884 1 .000145.884 1 .000

LOGAMYVariablesOverall Statistics

Step 0Score df Sig.

The “Variables not in the Equation” box says that if log amylase were added to the equation, it would be significant.


A case is classified as a Predicted 0 if the y-hat for that case is less than or equal to .5

A case is classified as a Predicted 1 if the y-hat for that case is larger than .5

Specificity

Sensitivity

Block 1: Method = Enter

In this block, log amylase is added to the equation.Omnibus Tests of Model Coefficients

151.643 1 .000151.643 1 .000151.643 1 .000

StepBlockModel


Step: The procedure can perform stepwise regression from a set of covariates. The Chi-square step tests the significance of the increase in fit of the current set of covariates vs. those in the previous set.Block: The significance of the increase in fit of the current model vs. the last Block. We’ll focus on this.Model: Tests the significance of the increase in fit of the current model vs. the “B0 only” model.

Model Summary

95.436 .447 .722Step1

-2 Loglikelihood

Cox & SnellR Square

NagelkerkeR Square

In the following classification table, for each case, the predicted probability of 1 is evaluated and compared with 0.5. If that probability is > 0.5, the case is a predicted 1, otherwise it’s a predicted 0.

Classification Tablea

200 8 96.214 34 70.8

91.4



Overall Percentage

Step 1



Correct

Predicted


Specificity: Proportion of Y=0 cases that test labels as 0. (Percentage of correct predictions of people who don’t have the disease.)Sensitivity: Proportion of Y=1 cases that test labels as 1. (Percentage of correct predictions of people who did have the disease.)


6.898 1.017 45.972 1 .000 990.114-16.020 2.227 51.744 1 .000 .000

LOGAMYConstant

Step1

a

B S.E. Wald df Sig. Exp(B)

Variable(s) entered on step 1: LOGAMY.a.


These are the coefficients for the equation.

1y-hat = ---------------------------------- -(-16.0203+6.8978*LOGAMY

1 + e

Sensitivity (power)Specificity

Analogous to “Coefficients” box in Regression

The Linear Regression R2 was .570.

5. Computing Predicted proportions for the groups defined on page 3.

To show that the relationship assumed by the logistic regression analysis is a better representation of the relationship than the linear, I computed probability of 1 for each of the group midpoints from page 3. The figure below is a plot of those probabilities and the observed proportion of 1’s vs. the group midpoints. Compare this figure with that on page 6 to see how much better the logistic regression relationship fits the data than does the linear relationship.

compute lrgpyhat = 1/(1+exp(-(-16.0203 + 6.8978*logamygp))).

GRAPH /SCATTERPLOT(OVERLAY)=logamygp logamygp WITH probpanc lrgpyhat (PAIR) /MISSING=LISTWISE .

Graph

4.03.53.02.52.01.51.0

1.2

1.0

.8

.6

.4

.2

0.0

-.2

LRGPYHAT

LOGAMYGP

PROBPANC

LOGAMYGP

Logamy

Compare this graph with the one immediately above. Note that the predicted proportions correspond much more closely to the observed proportions here.


Predicted proportions, most of which coincide precisely with the observed proportions.

Observed proportions. Could it be that there were coding errors for this group?

6. Using residuals to distinguish between logistic and linear regression.

I computed residuals for all cases. Recall that a residual is Y – Y-hat. For these data, Y’s were either 1 or 0. Y-hats are probabilities.

First, I computed Y-hats for all cases, using both the linear equation and the logistic equation..compute mryhat = -1.043 + .635*logamy.compute lryhat = 1/(1+exp(-(-16.0203 + 6.8978*logamy))).

Now residuals are computed.compute mrresid = pancdiag - mryhat.compute lrresid = pancdiag - lryhat.

frequencies variables = mrresid lrresid /histogram /format=notable.

Frequencies

MRRESID

1.00.80.60.40.20-.00

-.20-.40

-.60-.80

-1.00

Histogram

Freq

uenc

y

80

60

40

20

0

Std. Dev = .26

Mean = .00

N = 256.00

LOGAMY

4.03.53.02.52.01.51.0

PAN

CG

RP

1.2

1.0

.8

.6

.4

.2

0.0

-.2


This is the distribution of residuals for the linear multiple regression.It's like the plot on page 3, except these are actual residuals, not Z's of residuals.

Note that there are many large residuals - large negative and large positive.

The residuals above are simply distances of the observed points from the best fitting line, in this case a straight line.

Positive residual

Negative residual

LRRESID

1.00.80.60.40.20-.00

-.20-.40

-.60-.80

-1.00

HistogramFr

eque

ncy

200

100

0

Std. Dev = .24

Mean = .00

N = 256.00

4.03.53.02.52.01.51.0

1

1

1

1

0

0

0

-0

Predicted Value

LOGAMY

PANCGRP

LOGAMY

What these two sets of figures show is that the vast majority of residuals from the logistic regression analysis were virtually 0, while for the linear regression, there were many residuals that were substantially different from 0. So the logistic regression analysis has modeled the Y’s better than the linear regression.


This is the distribution of residuals for the logistic regression.

Note that most of them are virtually 0.

The residuals above are simply distances of the observed points from the best fitting line, in this case a logistic line.

The points which are circled are those with near 0 residuals.

Logistic Regression - Logamy revisited: Focus on the Logistic Regression Output

logistic regression variables = pancgrp with logamy.

Logistic RegressionCase Processing Summary

256 83.750 16.3

306 100.00 .0

306 100.0

Unweighted Cases a

Included in AnalysisMissing CasesTotal

Selected Cases

Unselected CasesTotal

N Percent


a.


01

Original Value.00 No Pancreatitis1.00 Pancreatitis

Internal Value


All cases have to have valid values of both the dependent variable and the independent variable to be included in the analysis. Some 50 cases had missing values of either one or the other, leaving only 256 valid cases for the analysis.

Be sure to make certain that your “0” and “1” are the same as logistic regression’s “0” and “1”.

Block 0: Beginning BlockClassification Tablea,b

208 0 100.048 0 .0

81.3



Overall Percentage

Step 0



Correct

Predicted



Specificity: The ability to identify cases that don’t have the disease.Sensitivity: The ability to identify cases that do have the disease.


-1.466 .160 83.852 1 .000 .231ConstantStep 0B S.E. Wald df Sig. Exp(B)

The Wald statistics is (B/SE)2. It tests the null hypothesis that the coefficient (B0 in this case) is 0 in the population. That null is rejected here.


145.884 1 .000145.884 1 .000

LOGAMYVariablesOverall Statistics

Step 0Score df Sig.


A classification table is produced for each model tested. In this case, the model contained only the constant, B0. (See "Variables in the Equation" on the next page.)

1 Predicted Y = -------- 1+e-B0

For these data, B0 = -1.4663, (see below) so

1 1P(Y) = -------- = --------- = .1875 1 + e-(-1.4663) 1 + 4.333

Each case for which P(Y) <= .5 is predicted to be 0. Each case for which P(Y) > .5 is predicted to be 1.

Specificity=208/208Sensitivity=0/48

Block 1: Method = EnterOmnibus Tests of Model Coefficients

151.643 1 .000151.643 1 .000151.643 1 .000

StepBlockModel


The Step Chi-Square tests the significance of improvement (or decrement) in fit over the immediately previous model. It is applicable when stepwise entry of independent variables within a block has been specified. It will be printed after each variable is entered or removed. Again, larger is better.

The Block Chi-Square tests the significance of improvement (or decrement) in fit over the model specified in the previous block of independent variables, if there was one. It is only applicable when two or more blocks of independent variables have been specified. Again, larger is better. It's analogous to the F-change statistic in linear regression.

The Model Chi-Square statistic tests the significance of the improvement in fit of the current model over a model containing just the constant, B0. For this chi-square, larger is better. It is analogous to the overall F statistic in linear regression output.

Model Summary

95.436 .447 .722Step1

-2 Loglikelihood

Cox & SnellR Square

NagelkerkeR Square

-2 Log Likelihood is a goodness of fit measure (0 is best) computed using a particular set of assumptions.

The R Square measures are analogous to R2 in regular regression. Each is computed using a different set of assumption, which accounts for the difference in their values.

Classification Tablea

200 8 96.214 34 70.8

91.4



Overall Percentage

Step 1



Correct

Predicted


Specificity: Ability to predict cases without the disease.Sensitivity: Ability to predict cases with the disease.In this classification table, since every case potentially had a different value of logamy, a unique Y-hat was generated for each case. If Y-hat was <= .5, a prediction of 0 was recorded. If Y-hat was > .5, a prediction of 1 was recorded. Note the increase in % of correct classifications over the "constant only" model above.


Specificity=200/208Sensitivity= 34/48

Each chi-square tests the significance of the increase in your ability to predict the dependent variable.


6.898 1.017 45.972 1 .000 990.114-16.020 2.227 51.744 1 .000 .000

LOGAMYConstant

Step1

a

B S.E. Wald df Sig. Exp(B)

Variable(s) entered on step 1: LOGAMY.a.


The Wald statistic is

Bi

(-------- )2

SE Bi

Exp(B) is the ratio of odds when the independent variable increases by 1.Recall: odds that Y=1 are P(Y=1) -------- for that X + 1 1-P(Y=1)

----------------------- P(Y=1) -------- for some X 1-P(Y=1)

When logamy increases by 1, the odds of Pancreatitus are 990.114 times greater.

6.898 7.915

1.017

Logistic Regression: Two Continuous predictorsLOGISTIC REGRESSION VAR=pancgrp /METHOD=ENTER logamy loglip /CLASSPLOT /CRITERIA PIN(.05) POUT(.10) ITERATE(20) CUT(.5) .[DataSet3] G:\MdbT\InClassDatasets\amylip.sav Logistic Regression


256 83.7

50 16.3

306 100.0

0 .0

306 100.0

Unweighted Casesa

Included inAnalysis

Missing Cases

Total

Selected Cases

Unselected Cases

Total

N Percent


a.


0

1

Original Value.00 NoPancreatitis

1.00 Pancreatitis

Internal Value

Block 0: Beginning BlockClassification Table a,b

208 0 100.0

48 0 .0

81.3

ObservedNo Pancreatitis

Pancreatitis


Overall Percentage

Step 0No Pancreatitis Pancreatitis

Pancreatitis Diagnosis (DV)

Percentage Correct

Predicted




-1.466 .160 83.852 1 .000 .231Constant

Step 0B S.E. Wald df Sig. Exp(B)

Based on the equation with only the constant, B0.Variables not in the Equation

145.884 1 .000

161.169 1 .000

165.256 2 .000

LOGAMY

LOGLIP

Variables

Overall Statistics

Step 0Score df Sig.


Each p-value tells you whether or not the variable would be significant if entered BY ITSELF. That is, each of the above p-values should be interpreted on the assumption that only 1 of the variables would be entered.



170.852 2 .000

170.852 2 .000

170.852 2 .000

Step

Block

Model


Model Summary

76.228 .487 .787Step1


SquareNagelkerke R

Square


204 4 98.1

10 38 79.2

94.5

ObservedNo Pancreatitis

Pancreatitis


Overall Percentage

Step 1No Pancreatitis Pancreatitis

Pancreatitis Diagnosis (DV)

Percentage Correct

Predicted


Recall: Specificity is the ability to predict cases who do NOT have the disease.Sensitivity is the ability to predict cases who do have the disease.


2.659 1.418 3.518 1 .061 14.286

2.998 .844 12.628 1 .000 20.043

-14.573 2.251 41.907 1 .000 .000

LOGAMY

LOGLIP

Constant


Variable(s) entered on step 1: LOGAMY, LOGLIP.a.

Interpretation of the coefficients . . .

B: Not easily interpretable on a raw probability scale. Expected increase in log odds for a one-unit increase in IV.

SE: Standard error of the estimate of Bi.

Wald: Test statistic.

Sig: p-value associated with test statistic.

Note that LOGAMY does NOT (officially) add significantly to prediction over and above the prediction afforded by LOGLIP.

Exp(B): Odds ratio for a one-unit increase in IV among persons equal on the other IV.

Person one unit higher on IV will have Exp(B) greater odds of having Pancreatitis.

So a person one unit higher on LOGLIP will have 20.04 greater odds of having Pancreatitis.


Specificity: 204/208Sensitivity: 38/48

Note that LOGAMY does not officially increase predictability over that afforded by LOGLIP.

Classification Plots – a frequency distribution of predicted probabilities categorized by actual classification

Step number: 1

Observed Groups and Predicted Probabilities

80 ┼ ┼ │N │ │N │F │N │R 60 ┼N ┼E │N │Q │N │U │NN │E 40 ┼NN ┼N │NN │C │NNN │Y │NNN │ 20 ┼NNN ┼ │NNN P│ │NNN NN P│ │NNNNNNNNNNNP N P PP PP│Predicted ─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼────────── Prob: 0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1 Group: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPP

Predicted Probability is of Membership for Pancreatitis The Cut Value is .50 Symbols: N - No Pancreatitis P - Pancreatitis Each Symbol Represents 5 Cases.

One aspect of the above plot is misleading because many cases are not represented in it. Only those cases which happened to be so close to other cases that a group of 5 cases could be formed are represented. So, for example, those relatively few cases whose y-hats were close to .5 are not seen in the above plot, because there were enough to make 5 cases.

Classification Plots using dot plots.

Here’s the same information gotten as dot plots of Y-hats with PANCGRP as a Row Panel Variable.


For the most part, the patients who did not get Pancreatitis had small predicted probabilities while the patients who did get it had high predicted probabilities, as you would expect.

There were, however, a few patients who did get Pancreatitis who had small values of Y-hat. Those patients are dragging down the sensitivity of the test. Note that these patients don’t show up on the CASEPLOT produced by the LOGISTIC REGRESSION procedure.

Classification Plots using Histograms in EXPLORE

Here’s another equivalent representation of what the authors of the program were trying to show.

1.000000.800000.600000.400000.200000.00000

Predicted probability

200

150

100

50

0

Freq

uenc

y

Mean =0.0515201Std. Dev. =0.12420309

N =208

Histogram

for pancgrp= No Pancreatitis

1.000000.800000.600000.400000.200000.00000

Predicted probability

30

25

20

15

10

5

0

Freq

uenc

y

Mean =0.7767463Std. Dev. =0.33120602

N =48

Histogram

for pancgrp= Pancreatitis


Visualizing the equation with two predictors

With one predictor, a simple scatterplot of YHATs vs. X will show the relationship between Y and X implied by the model.

For two predictor models, a 3-D scatterplot is required. Here’s how the graph below was produced. Graphs -> Interactive -> Scatterplot. . .


The graph shows the general ogival relationship of YHAT on the vertical to LOGLIP and LOGAMY. But the relationships really aren’t apparent until the graph is rotated.

Don’t ask me to demonstrate rotation. SPSS now does not offer the ability to rotate the graph interactively. It used to offer such a capability, but it’s been removed. Shame on SPSS.

The same graph but with Linear Regression Y-hats plotted vs. loglip and logamy.


Representing Relationships with a Table – for the Powerpoint slides

compute logamygp2 = rnd(logamy,.5). <- Rounds logamy to the nearest .5 .logamygp2

123 40.2 40.2 40.2

105 34.3 34.3 74.5

46 15.0 15.0 89.5

21 6.9 6.9 96.4

10 3.3 3.3 99.7

1 .3 .3 100.0

306 100.0 100.0

1.50

2.00

2.50

3.00

3.50

4.00

Total

ValidFrequency Percent Valid Percent

CumulativePercent

compute loglipgp2 = rnd(loglip,.5).loglipgp2

1 .3 .3 .3

6 2.0 2.0 2.3

45 14.7 14.7 17.0

125 40.8 40.8 57.8

49 16.0 16.0 73.9

30 9.8 9.8 83.7

20 6.5 6.5 90.2

20 6.5 6.5 96.7

8 2.6 2.6 99.3

2 .7 .7 100.0

306 100.0 100.0

.50

1.00

1.50

2.00

2.50

3.00

3.50

4.00

4.50

5.00

Total

ValidFrequency Percent Valid Percent

CumulativePercent

means pancgrp yhatamylip by logamygp2 by loglipgp2.


LOGAMY and LOGLIP groups were created by rounding values of LOGAMY and LOGLIP to the nearest .5.

Here’s the LOGLIP grouping.

Here’s the top of a very long two way table of mean Y-hat values for each combination of logamy group and loglip group.

Below, this table is “prettified”.

The above MEANS output, put into a 2 way table in Word

The entry in each cell is the expected probability of contracting Pancreatitus at the combination of logamy and loglip represented by the cell.

LOGLIP.5 1 1.5 2 2.5 3 3.5 4 4.5

4 . .3.5 .99 1.00 1.00 1.00

3 .97 .98 .99 1.002.5 .03 .09 .30 .73 .92 .97 1.002 .01 .04 .14 .47 .851.5 .00 .00 .00 .01 .05 .

This table shows the joint relationship of predicted Y to LOGAMY and LOGLIP. Move from the lower left of the table to the upper right.

It also shows the partial relationships of each.

Partial Relationship of YHAT to LOGLIP – Move across any row.

So, for example, if your logamylase were 2.5, your chances of having pancreatitus would be only .03 if your loglipase were 1.5. But at the same 2.5 value of logamylase, your chances would be .97 if your loglipase value were 4.0.

Partial Relationship of YHAT to LOGAMY – Move up any column.

Empty cells show that there are certain combinations of LOGAMY and LOGLIP that are very unlikely.


LOGAMY

Logistic Regression 3: One Categorical IV with 3 categories

The data here are the FFROSH data – freshmen from 1987-1992.

The dependent variable is RETAINED – whether a student went directly to the 2nd semester.

The independent variable is NRACE – the ethnic group recorded for the student. It has three values:1: White; 2: African American 3: Asian-American

Indicator coding is dummy coding. Here, Category 1 (White) is used as the reference category.


Recall that ALL independent variables are called covariates in LOGISTIC REGRESSION.

We know that categorical independent variables with 3 or more categories must be represented by group coding variables.

LOGISTIC REGRESION allows us to do that internally.

LOGISTIC REGRESSION retained /METHOD = ENTER nrace /CONTRAST (nrace)=Indicator(1) /CRITERIA = PIN(.05) POUT(.10) ITERATE(20) CUT(.5) .Logistic Regression


4697 98.8

56 1.2

4753 100.0

0 .0

4753 100.0

Unweighted Casesa

Included in Analysis

Missing Cases

Total

Selected Cases

Unselected Cases

Total

N Percent


a.


0

1

Original Value.00

1.00

Internal Value

Categorical Variables Codings

3987 .000 .000

626 1.000 .000

84 .000 1.000

1.00 WHITE

2.00 BLACK

3.00 ORIENTAL

nrace NUMBERICWHITE/BLACK/ORIENTAL RACE CODE

Frequency (1) (2) (3)

Parameter coding


0 545 .0

0 4152 100.0

88.4

Observed.00

1.00

retained

Overall Percentage

Step 0.00 1.00

retained

Percentage Correct

Predicted






6.680 2 .035

2.433 1 .119

3.903 1 .048

6.680 2 .035

nrace

nrace(1)

nrace(2)

Variables

Overall Statistics

Step 0Score df Sig.

SPSS first prints p-value information for the collection of group coding variables representing the categorical factor. Then it prints p-value information for each GCV separately. None of the information about categorical variables in this “Variables not in the Equation” box is too useful.


SPSS’s coding of the independent variable here is important. Note that Whites are the 0,0 group. The first group coding variable compares Blacks with Whites. The 2nd compares Asian-Americans with Whites.

This is the syntax generated by the above menus.


7.748 2 .021

7.748 2 .021

7.748 2 .021

Step

Block

Model


Model Summary

3364.160a .002 .003Step1


SquareNagelkerke R

Square

Estimation terminated at iteration number 6 becauseparameter estimates changed by less than .001.

a.


0 545 .0

0 4152 100.0

88.4

Observed.00

1.00

retained

Overall Percentage

Step 1.00 1.00

retained

Percentage Correct

Predicted



6.368 2 .041

.237 .143 2.741 1 .098 1.268

1.007 .515 3.829 1 .050 2.737

1.989 .049 1669.869 1 .000 7.306

nrace

nrace(1)

nrace(2)

Constant


Variable(s) entered on step 1: nrace.a.

So the bottom line is that

0) There are significant differences in likelihood of retention to the 2nd semester between the groups (p=.041).

1) Blacks are not significantly more likely to sustain than Whites, although the difference approaches significance. (p=.098).

2) Asian-Americans are significantly more likely to sustain than Whites (p=.050).


Logistic Regression: Three Continuous predictors – FFROSH Data

The data used for this are data on freshmen from 1987-1992.

The dependent variable is RETAINED – whether student went directly into the 2nd semester or not.Predictors (covariates in logistic regression) are HSGPA, ACT composite, and Overall attempted hours in the first semester, excluding the freshman seminar course.

GET FILE='E:\MdbR\FFROSH\ffrosh.sav'.logistic regression retained with hsgpa actcomp oatthrs1.Logistic Regression


4852 100.0

0 .0

4852 100.0

0 .0

4852 100.0

Unweighted Casesa

Included inAnalysis

Missing Cases

Total

Selected Cases

Unselected Cases

Total

N Percent


a.


0 620 .0

0 4232 100.0

87.2

Observed.00

1.00

RETAINED

Overall Percentage

Step 0.00 1.00

RETAINED

Percentage Correct

Predicted




1.921 .043 1994.988 1 .000 6.826Constant

Step 0B S.E. Wald df Sig. Exp(B)


225.908 1 .000

44.653 1 .000

274.898 1 .000

385.437 3 .000

HSGPA

ACTCOMP

OATTHRS1

Variables

Overall Statistics

Step 0Score df Sig.



0

1

OriginalValue.00

1.00

Internal Value

SpecificitySensitivity

Recall that the p-values are those that would be obtained if a variable were put BY ITSELF into the equation.


381.011 3 .000

381.011 3 .000

381.011 3 .000

Step

Block

Model


Model Summary

3327.365 .076 .141Step1


SquareNagelkerke R

Square


35 585 5.6

16 4216 99.6

87.6

Observed.00

1.00

RETAINED

Overall Percentage

Step 1.00 1.00

RETAINED

Percentage Correct

Predicted



1.077 .101 112.767 1 .000 2.935

-.022 .014 2.637 1 .104 .978

.148 .012 146.487 1 .000 1.160

-2.225 .308 52.362 1 .000 .108

HSGPA

ACTCOMP

OATTHRS1

Constant


Variable(s) entered on step 1: HSGPA, ACTCOMP, OATTHRS1.a.

Note that while ACTCOMP would have been significant by itself without controlling for HSGPA and OATTHRS1, when controlling for those two variables, it’s not significant.

So, the bottom line is that

1) Among persons equal on ACTCOMP and OATTHRS1, those with larger HSGPAs were more likely to go directly into the 2nd semester.

2) Among persons equal on HSGPA and OATTHRS1, there was no significant relationship of likelihood of sustaining to ACTCOMP. Among persons equal on HSGPA and OATTHRS1 those with higher ACTCOMP were not significantly more likely to sustain than those with lower ACTCOMP. Note that there are other variables that could be controlled for and that this relationship might “become” significant when those variables are controlled. (But it didn’t.)

3) Among persons equal on HSGPA and ACTCOMP, those who took more hours in the first semester were more likely to go directly to the 2nd semester. What does this mean???? These were more likely to be full-time students??


SpecificitySensitivity

The FFROSH Full AnalysisFrom the report to the faculty – Output from SPSS for the Macintosh Version 6.

---------------------- Variables in the Equation -----------------------

Variable B S.E. Wald df Sig R Exp(B)

AGE -.0950 .0532 3.1935 1 .0739 -.0180 .9094NSEX .2714 .0988 7.5486 1 .0060 .0388 1.3118

After adjusting for differences associated with the other variables, Males were more likely to enroll in the second semester .

NRACE1 -.4738 .1578 9.0088 1 .0027 -.0436 .6227After adjusting for differences associated with the other variables, Whites were less likely to enroll in the second semester.

NRACE2 .1168 .1773 .4342 1 .5099 .0000 1.1239HSGPA .8802 .1222 51.8438 1 .0000 .1162 2.4114

After adjusting for differences associated with the other variables, those with higher high school GPA's were more likely to enroll in the second semester.

ACTCOMP -.0239 .0161 2.1929 1 .1387 -.0072 .9764OATTHRS1 .1588 .0124 164.4041 1 .0000 .2098 1.1721

After adjusting for differences associated with the other variables, those with higher attempted hours were more likely to enroll in the second semester.

EARLIREG .2917 .1011 8.3266 1 .0039 .0414 1.3387After adjusting for differences associated with the other variables, those who registered six months or more before the first day of school were more likely to enroll in the second semester.

NADMSTAT -.2431 .1226 3.9330 1 .0473 -.0229 .7842POSTSEM -.1092 .0675 2.6206 1 .1055 -.0130 .8965PREYEAR2 -.0461 .0853 .2924 1 .5887 .0000 .9549PREYEAR3 .1918 .0915 4.3952 1 .0360 .0255 1.2114

After adjusting for differences associated with the other variables, those who enrolled in 1991 were more likely to enroll in the second semester than others enrolled before 1990..

POSYEAR2 -.0845 .0977 .7467 1 .3875 .0000 .9190POSYEAR3 -.1397 .0998 1.9585 1 .1617 .0000 .8696HAVEF101 .4828 .1543 9.7876 1 .0018 .0459 1.6206

After adjusting for differences associated with the other variables, those who took the freshman seminar were more likely to enroll in second semester than those who did not.

Constant -.1075 1.1949 .0081 1 .9283


-.099 .053 3.461 1 .063 .905

.257 .099 6.726 1 .010 1.294

19.394 2 .000

-.944 .487 3.749 1 .053 .389

-.337 .504 .446 1 .504 .714

.852 .123 48.204 1 .000 2.344

-.021 .016 1.676 1 .195 .979

.159 .012 163.499 1 .000 1.173

.316 .102 9.640 1 .002 1.372

.253 .123 4.222 1 .040 1.288

-.115 .068 2.880 1 .090 .891

-.048 .086 .306 1 .580 .954

.177 .092 3.737 1 .053 1.194

-.078 .098 .633 1 .426 .925

-.124 .101 1.511 1 .219 .884

.967 .152 40.364 1 .000 2.629

-.032 1.228 .001 1 .979 .968

age

nsex

nrace

nrace(1)

nrace(2)

hsgpa

actcomp

oatthrs1

earlireg

admstat(1)

postsem

y1988

y1989

y1991

y1992

havef101

Constant


Variable(s) entered on step 1: age, nsex, nrace, hsgpa, actcomp, oatthrs1, earlireg, admstat,postsem, y1988, y1989, y1991, y1992, havef101.

a.


This is from SPSS V15. There are slight differences in the numbers, not due to changes in the program but due to slight differences in the data. I believe some cases were dropped between when the V6 and V15 analyses were performed.NRACE was coded differently in the V15 analysis.

The full FFROSH Analysis in Version 15 of SPSS

logistic regression retained with age nsex nrace hsgpa actcomp oatthrs1 earlireg admstat postsem y1988 y1989 y1991 y1992 havef101 /categorical nrace admstat.

Logistic Regression

[DataSet1] G:\MdbR\FFROSH\ffrosh.sav


4781 98.5

71 1.5

4852 100.0

0 .0

4852 100.0

Unweighted Casesa

Included in Analysis

Missing Cases

Total

Selected Cases

Unselected Cases

Total

N Percent


a.


0

1

Original Value.00

1.00

Internal Value

Categorical Variables Codings a

4060 1.000 .000

636 .000 1.000

85 .000 .000

3292 1.000

1489 .000

1.00 WHITE

2.00 BLACK

3.00 ORIENTAL

nrace NUMBERICWHITE/BLACK/ORIENTALRACE CODE

AP

CD

admstat NUMERICADMISSTION STATUS CODE

Frequency (1) (2)

Parameter coding

This coding results in indicator coefficients.a.


Block 0: Beginning Block

Classification Table a,b

0 610 .0

0 4171 100.0

87.2

Observed.00

1.00

retained

Overall Percentage

Step 0.00 1.00

retained

Percentage Correct

Predicted






27.445 1 .000

3.147 1 .076

7.322 2 .026

5.864 1 .015

3.261 1 .071

223.532 1 .000

46.129 1 .000

273.644 1 .000

86.855 1 .000

119.994 1 .000

13.295 1 .000

1.049 1 .306

11.532 1 .001

.486 1 .486

.102 1 .750

40.186 1 .000

528.012 15 .000

age

nsex

nrace

nrace(1)

nrace(2)

hsgpa

actcomp

oatthrs1

earlireg

admstat(1)

postsem

y1988

y1989

y1991

y1992

havef101

Variables

Overall Statistics

Step 0Score df Sig.


Block 1: Method = Enter

Omnibus Tests of Model Coefficients

494.704 15 .000

494.704 15 .000

494.704 15 .000

Step

Block

Model


Model Summary

3155.842a .098 .184Step1


SquareNagelkerke R

Square

Estimation terminated at iteration number 6 becauseparameter estimates changed by less than .001.

a.


79 531 13.0

33 4138 99.2

88.2

Observed.00

1.00

retained

Overall Percentage

Step 1.00 1.00

retained

Percentage Correct

Predicted



-.099 .053 3.461 1 .063 .905

.257 .099 6.726 1 .010 1.294

19.394 2 .000

-.944 .487 3.749 1 .053 .389

-.337 .504 .446 1 .504 .714

.852 .123 48.204 1 .000 2.344

-.021 .016 1.676 1 .195 .979

.159 .012 163.499 1 .000 1.173

.316 .102 9.640 1 .002 1.372

.253 .123 4.222 1 .040 1.288

-.115 .068 2.880 1 .090 .891

-.048 .086 .306 1 .580 .954

.177 .092 3.737 1 .053 1.194

-.078 .098 .633 1 .426 .925

-.124 .101 1.511 1 .219 .884

.967 .152 40.364 1 .000 2.629

-.032 1.228 .001 1 .979 .968

age

nsex

nrace

nrace(1)

nrace(2)

hsgpa

actcomp

oatthrs1

earlireg

admstat(1)

postsem

y1988

y1989

y1991

y1992

havef101

Constant


Variable(s) entered on step 1: age, nsex, nrace, hsgpa, actcomp, oatthrs1, earlireg, admstat,postsem, y1988, y1989, y1991, y1992, havef101.

a.

The absence of a relationship to ACTCOMP is very interesting. It could be the foundation for a theory of retention.


Analyses of Cateogical Dependent Variablesweb2.utc.edu/~Michael-Biderman/PSY595/L5... · Web view4....

Documents

Transcript of Analyses of Cateogical Dependent Variablesweb2.utc.edu/~Michael-Biderman/PSY595/L5... · Web view4....