Chapter 20

Chapter 20 Logistic Regression for Binary Response Variables Response variable is binary, denoted by 0,1. Example 1 Case Study 20.1: Survival in the Donner Party. What is the relationship between age and sex of individuals in the Donner Party and whether or not they survived? Example 2 Birthweight data: birthweights of 189 newborns were recorded along with a number of other variables concerning the mother: weight, smoker or not, race, etc. Suppose we were interested only in whether the baby was underweight (defined to be under 2500 gms) or not and what characteristics of the mother, if any, are associated with having underweight babies. Example 3 Pronghorn data: In a study of winter habitat selection by pronghorn in the Red Rim area in south-central Wyoming (Source: Manly, McDonald, and Thomas 1993, Resource Selection by Animals, Chapman and Hall, pp. 16-24; data from Ryder 1983), presence/absence of pronghorn during winters of 1980-81 and 1981-82 were recorded for a systematic sample of 256 plots of 4 ha. each Other variables recorded for each plot were: density (in thousands/ha) of sagebrush, black greasewood, Nuttal’s saltbrush, and Douglas rabbitbrush, and the slope, distance to water, and aspect. What variables are most strongly associated with presence/absence of pronghorn and can you formulate a model to predict the probability that pronghorn will be present on a plot? Could use linear regression to regress 0/1 response on quantitative and categorical explanatory variables. What are the problems with this approach? Better approach: logistic regression model

Y = 0,1 response variable pXX ,...,1 : explanatory variables

For a particular set of values of the explanatory variables,

),,( 1 pXXY …µ = π = proportion of 1’s

• Rather than model )(Yµ = π as a linear function of the explanatory variables, model

logit(π) ⎟⎠⎞

⎜⎝⎛−

=π

π1

ln as a linear function of the explanatory variables; that is:

logit(π) = pp XX βββ +++ …110 • Why use logit(π)? Could you use another function?

Given the value of η = logit(π), can calculate

η

ηπ

ee+

=1

Example 3: In the pronghorn example, suppose, hypothetically, the true relationship between probability of use and distance to water (in meters) followed the logit model:

logit(π) = 3 - .0015Water

Then π = Water0015.3

Water0015.3

1 −

−

+ ee .

Calculate π for the following distances (note that it might be easier to compute 1-π =

Water0015.311

−+ e, then subtract from 1),

Water = 100 m. 1000 m. 3000m.

Distance to water (m)

Pro

babi

lity

of u

se

0 1000 2000 3000 4000 5000

0.0

0.2

0.4

0.6

0.8

1.0

• What is the interpretation of the coefficient -.0015 for the variable distance to water?

For every 1 m. increase in the distance to water, the log-odds of use decrease by .0015; for every 1 km. increase in distance to water, log-odds of use decrease by 1.5.

• More meaningful: for every 1 m. increase in the distance to water, the odds of use change

by a multiplicative factor of 0015.−e = .999; for every 1 km. increase in distance to water, the odds of use change by a multiplicative factor of 5.1−e = .223 (we could also reverse these statements; for example, the odds of use increase by a factor of 48.45.1 =e for every km. closer to water a plot is.)

Variance The 0/1 response variable Y is a Bernoulli random variable: ),,( 1 pXXY …µ = π

SD( ),,1 pXXY … = )1( ππ −

Variance of Y is not constant. The logistic regression model is an example of a generalized linear model (in contrast to a general linear model which is the usual regression model with normal errors and constant variance). A generalized linear model is specified by:

1. a link function which specifies what function of )(Yµ is a linear function of pXX ,,1 … . In logistic regression, the link function is the logit function.

2. the distribution of Y for a fixed set of values of pXX ,,1 … . In the logit model, this is the Bernoulli distribution.

The usual linear regression model is also a generalized linear model. The link function is the identity: that is, µµ =)(f , and the distribution is normal with constant variance. There are other generalized linear models which are useful (e.g., Poisson response distribution with log link function). A general methodology has been developed to fit and analyze these models. Estimation of Logistic Regression Coefficients Estimation of parameters in linear regression model was by least squares. If we assume normal errors with constant variance, the least squares estimators are the same as the maximum likelihood estimators (MLE’s). Maximum likelihood estimation is based on a simple principle: the estimates of the parameters in a model are the values which maximize the probability (likelihood) of observing the sample data we have. Example: Suppose we select a random sample of 10 UM students in order to estimate what proportion of students own a car. We find that 6 out of the 10 own a car. What is the MLE of the proportion p of all students who won a car? What do we think it should turn out to be? We model the responses from the students as 10 independent Bernoulli trials with probability of success p on each trial. Then the total number of successes, say Y, in 10 trials follows a binomial model:

yy ppy

yY −−⎟⎟⎠

⎞⎜⎜⎝

⎛== 10)1(

10)Pr( , y = 0,1,…,10

The maximum likelihood principle says to find the value of p which maximizes the probability of observing the number of successes we actually observed in the sample; that is, find p to maximize:

64 )1(4

10)4Pr( ppY −⎟⎟

⎠

⎞⎜⎜⎝

⎛==

This is called the likelihood function. The following is a graph of Pr(Y=4) versus p. At what value of p does it appear to be maximized?

p

Pr(Y

= 4

)

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.05

0.10

0.15

0.20

0.25

We can also find the exact solution using calculus. Note that finding the value of p to maximize Pr(Y = 4) is equivalent to finding the value of p which maximizes

ln[Pr(Y = 4)] = [ ] )1ln(6ln44

10ln)1(ln

410

ln 64 pppp −++⎥⎦

⎤⎢⎣

⎡⎟⎟⎠

⎞⎜⎜⎝

⎛=−+⎥

⎦

⎤⎢⎣

⎡⎟⎟⎠

⎞⎜⎜⎝

⎛

Take the derivative with respect to p and setting it equal to 0:

Now, solve for p: By repeating this process but replacing 10 by n (the number of trials) and 4 by y (the number of successes), we find a general expression for the MLE of p in n independent Bernoulli trials:

nyp =ˆ

• Back to logistic regression: we use the maximum likelihood principle to find estimators of

the β’s in the logistic regression model. The likelihood function is the probability of observing the particular set of failures and successes that we observed in the sample. But there’s a difference from the binomial model above: the model says that the probability of success is possibly different for each subject because it depends on the explanatory variables pXX ,,1 … . Dropping the subscript i to avoid confusion, the probability of success for an individual is:

π== )1Pr(Y where pp

pp

XX

XX

ee

βββ

βββ

π +++

+++

+= …

…

110

110

1.

• Note that we can also write the model for a single individual as yyyY −−== 1)1()Pr( ππ , y = 0, 1.

• If the responses are independent, then the probability of observing the set of 0/1 responses

we observed is

∏=

−=−−−====n

i

yi

yi

yn

yn

yyyynn

iinnyYyYyY1

22112211 )1()1()1()1(),,,Pr( 2211 ππππππππ ……

where each of the iπ ’s is, in turn, a function of the iβ ’s. Therefore, we maximize this likelihood function as a function of the iβ ’s. This cannot be done analytically and must be done numerically using a computer program.

• SPSS can fit logit models: use Analyze…Regression…BinaryLogistic. Identify the

binary dependent variable and all the covariates. • If any of the covariates are categorical then click Categorical to tell SPSS how to create

indicator variables (or other way of coding) for each categorical variable. If a binary covariate is already coded as 0-1, it’s not necessary to identify it as categorical.

• Interaction terms can be added when specifying the covariates in the logistic regression

command in SPSS. Select two or more variables (keep the Ctrl key pressed while selecting the second and subsequent variables), then press the “a*b>” button.

Example: Birthweight data. The response variable was the 0/1 variable “low” which was an indicator of low birthweight, An additive model with all the covariates was fit. The variables were: Name Meaning low Birthweight < 2500 gm. age Age (years) lwt Weight at last menstrual period (lbs) race Race (1=White, 2=Black, 3=Other) smoke Smoke (0/1) ptl Number of premature labors ht Hypertension (0.1) ui Uterine irritability (0/1) ftv Number of physician visits in first trimester

There is a lot of SPSS out put. For now, we’re interested in only two tables: one that indicates how the categorical variable Race was coded, and the final table of model coefficients (under Block 1).

Categorical Variables Codings

96 .000 .00026 1.000 .00067 .000 1.000

WhiteBlackOther

Race (1=White,2=Black, 3=Other)

Frequency (1) (2)Parameter coding

I chose Indicator coding for Race with the first category (White) as the reference category. The two columns of this table are the two indicator variables for Race.

Variables in the Equation

-.030 .037 .637 1 .425 .971 .903 1.044-.015 .007 4.969 1 .026 .985 .971 .998

7.116 2 .0281.272 .527 5.820 1 .016 3.569 1.270 10.033

.880 .441 3.990 1 .046 2.412 1.017 5.723

.939 .402 5.450 1 .020 2.557 1.163 5.624

.543 .345 2.474 1 .116 1.722 .875 3.3881.863 .698 7.136 1 .008 6.445 1.642 25.291

.768 .459 2.793 1 .095 2.155 .876 5.301

.065 .172 .143 1 .705 1.067 .761 1.497

.481 1.197 .161 1 .688 1.617

agelwtracerace(1)race(2)smokeptlhtuiftvConstant

Step1

a

B S.E. Wald df Sig. Exp(B) Lower Upper95.0% C.I.for EXP(B)

Variable(s) entered on step 1: age, lwt, race, smoke, ptl, ht, ui, ftv.a.

The columns in this table are:

1. B: the estimated coefficients 2. S.E. The standard errors (SE’s) for the estimated coefficients (the βi ’s) are estimated

using asymptotic results for maximum likelihood estimators. An approximate confidence interval for each coefficient can be calculated using βi z± SE where z is the appropriate percentile of the standard normal distribution. We don’t use the t-distribution; that only applies to the linear regression model with normal errors.

3. Wald: the Wald statistic is 2

)ˆ(SE

ˆ⎟⎟⎠

⎞⎜⎜⎝

⎛

i

iβ

β. It is used as a two-sided test of the hypothesis

0:0 =iH β given that all the other variables are in the model. Under this null

hypothesis )ˆ(SE/ˆ ii ββ has approximately a N(0,1) distribution. The square of a N(0,1)

is a chi-square distribution with 1 d.f. so the Wald statisic can be compared to a 2χ (1) to get a test of 0:0 =iH β .

4. df: The d.f. for the chi-square distribution which the Wald statistic is compared to. It is 1 except for categorical explanatory variables with more than 2 levels (however, the test for each indicator variable involved in the categorical variable still has 1 d.f.).

5. Sig.: The P-value for the test of 0:0 =iH β using the Wald statistic. This is the area to

the right of the Wald statistic in a 2χ (1) distribution.

6. Exp(B): exp( )βi has the interpretation that it represents how many times the odds of a positive response increase for every one-unit increase in ix , given that the values of the other variables in the model remain the same. For example, the coefficient for Smoke is 2.557. Since Smoke is 0/1, this means that, according to this model, the odds that a smoker will have a low birthweight baby are estimated to be 2.557 times higher (95% CI: 1.163 to 5.624; see next column) than the odds a non-smoker will, even after adjusting for Last weight, Race and the other variables in the model. Similarly, every one-year increase in age reduces the odds of a low birthweight baby to .971 of what they were (95% CI: .903 to 1.044), after adjusting for the other variables.

7. 95.0% C.I. for Exp(B): A confidence interval for exp( )βi (the true value) can be

calculated as ( ) ( )[ ]exp ,expβ βi iz z− +SE SE . SPSS will calculate a confidence interval

for each exp( )βi in the model if, under Options, you select “CI for exp(B).” Comparing two models In linear regression, we used the extra sum-of-squares F test to compare two nested models (two models are nested if one is a special case of the other). The analogous test in logistic regression compares the values of –2ln(Maximized likelihood function). What is this quantity? Recall that

the MLE’s of the iβ ’s are the values that maximize the likelihood function of the data (defined a couple of pages ago). So, we find that maximum value of the likelihood function, take the natural log and multiply by –2.

• The quantity –2 ln(Maximized likelihood) is also called the deviance of a model since larger values indicate greater deviation from the assumed model. Comparing two nested models by the difference in deviances is a drop-in-deviance test.

SPSS gives the value of –2 ln(Maximized likelihood) in the output. For the full model for low birthweight examined previously:

Model Summary

201.285a .162 .228Step1

-2 Loglikelihood

Cox & SnellR Square

NagelkerkeR Square

Estimation terminated at iteration number 5 becauseparameter estimates changed by less than .001.

a.

• The difference between the values of –2ln(Maximized likelihood function) for a full and reduced model has approximately a chi-square distribution if the null hypothesis that the extra parameters are all 0 is true. The d.f. is the difference in the number of parameters for the two models.

Example: Compare the full birthweight model to one without Smoke. The value of –2ln(Maximized likelihood function) for the model without Smoke is:

Model Summary

206.906a .137 .192Step1

-2 Loglikelihood

Cox & SnellR Square

NagelkerkeR Square


a.

The test statistic is

Drop-in-deviance = 206.906 – 201.285 = 5.621 The P-value is the area to the right of 5.621 in )1(2χ distribution (since there is one less parameter in the reduced model). Thus, P =.018, which indicates moderately strong evidence of an effect of smoking on the probability of having a low-birthweight baby after adjusting for the other variables in the model.

• The drop-in-deviance test is a likelihood ratio test (LR test) because it is based on the natural log of the ratio of the maximized likelihoods (the difference of logs is the log of the ratio). The extra sum-of squares F-test in linear regression also turns out to be a likelihood ratio test.

• If the full and reduced models differ by only one parameter, as in this example, then the likelihood ratio test is testing the same thing as the Wald test above. In this example, the test statistic is slightly different than the Wald test value of 5.450 and P = .020. The likelihood ratio test is preferred. The two tests will generally give similar results, but not always.

• The relationship between the Wald test and the likelihood ratio test is analogous to

relationship between the t-test for a single coefficient and the F-test in linear regression. However, in linear regression, the two tests are exactly equivalent. Not so in logistic regression.

One can obtain a LR test for each coefficient in SPSS by selecting “Backward-LR” under Method. Also be sure that Display: At each step is checked under Options. This performs a backward stepwise logistic regression. All we are using it for, though, is the output at the first step where it reports a likelihood ratio test for each variable with all the other variables in the model. The table we are looking for looks like this. The LR test for each term is listed under Step 1.

Model if Term Removed

-100.965 .645 1 .422-103.402 5.520 1 .019-104.376 7.468 2 .024-103.453 5.621 1 .018-101.916 2.546 1 .111-104.403 7.521 1 .006-102.014 2.742 1 .098-100.713 .142 1 .706-100.993 .559 1 .455-103.404 5.381 1 .020-104.386 7.344 2 .025-103.458 5.489 1 .019-101.974 2.521 1 .112-104.406 7.384 1 .007-102.053 2.680 1 .102-104.056 6.126 1 .013-105.155 8.325 2 .016-103.864 5.743 1 .017-102.108 2.231 1 .135-104.732 7.479 1 .006-102.449 2.912 1 .088-105.583 6.949 1 .008-106.413 8.609 2 .014-105.755 7.293 1 .007-105.957 7.698 1 .006-104.124 4.031 1 .045

Variableagelwtracesmokeptlhtuiftv

Step1

agelwtracesmokeptlhtui

Step2

lwtracesmokeptlhtui

Step3

lwtracesmokehtui

Step4

Model LogLikelihood

Change in-2 Log

Likelihood dfSig. of theChange

Model Selection Both AIC and BIC can be used as model selection criteria. As with linear regression models, they are only relative measures of fit, not absolute measures of fit.

AIC = Deviance + 2p

BIC = Deviance + plog(n)

where p is the number of parameters in the model. Stepwise model selection methods are available in SPSS using likelihood ratio tests or Wald’s test. The LR methods are preferred. Other software programs, like S-Plus, have stepwise procedures using AIC or BIC. Interactions Interaction terms can be added when specifying the covariates in the logistic regression command in SPSS. Select two or more variables (keep the Ctrl key pressed while selecting the second and subsequent variables), then press the “a*b>” button. Residuals There are two standard ways to define a residual in logistic regression.

1. Pearson residual = )ˆ1(ˆ

ˆ

ii

iiyπππ−

−.

This is called a “Standardized Residual” in the Save option under Regression…Binary Logistic in SPSS, but it is labeled “Normalized Residual” in the data sheet. The Pearson residual is the raw (unstandardized) residual iiy π̂− standardized by its estimated standard deviation.

2. Deviance residual = ⎪⎩

⎪⎨

⎧

=−−−

=−

0 if )ˆ1ln(2

1 if ˆln2

ii

ii

y

y

π

π

This is called the “Deviance Residual” in SPSS. The deviance residuals have the property that the sum of the deviance residuals squared is the deviance D (= -2 ln(maximized likelihood)) for the model.

The Pearson residual is more easily understood, but the deviance residual directly gives the contribution of each point to the lack of fit of the model. There is also a “Studentized residual” in SPSS (but labeled “Standard residual” in the data sheet). It’s usually almost identical to the deviance residual so we won’t consider it here. The residuals in a logistic regression (either Pearson or deviance) are not as useful as in a linear regression unless the data are grouped (see Chapter 21). The residuals are not assumed to have a normal distribution, as in linear regression. Since the residuals don’t have a normal distribution, the usual ±2 or ±3 cutoffs don’t necessarily apply. Instead, we can simply look for outliers among the distribution of Pearson or deviance residuals. We can plot the values against the predicted probabilities. Measures of influence Measures of influence attempt to measure how much individual observations influence the model. Observations with high influence merit special examination. Many measures of influence have been suggested and are used for linear regression. Some of these have analogies

for logistic regression. However, the guidelines for deciding what is a big enough value of an influence measure to merit special attention are less developed for logistic regression than for linear regression and the guidelines developed there are sometimes not appropriate for logistic regression. Cook’s Distance: This is a measure of how much the residuals change when the case is deleted. Large values indicate a large change when the observation is left out. Plot Di against case number or against iπ̂ and look for outliers. The leverage is a measure of the potential influence of an observation. In linear regression, the leverage is a function of how far its covariate vector xi is from the average. It is a function only of the covariate vector. In logistic regression, the leverage is a function of xi and of iπ (which must be estimated) – it isn’t necessarily the observations which have the most extreme xi ’s which have the most leverage. Assessing the overall fit of the model Assessing the fit of a logistic regression model is difficult. Since the response is 0/1, plots of the response versus explanatory variables are not as useful as when the response is continuous. There is no good measure analogous to R2; SPSS will report a couple of R2-like measures that have been proposed for logistic regression, but these don’t have the same interpretation as in linear regression and are not particularly good measures of fit. In theory, a chi-square goodness-of-fit test could be carried out by comparing the observed count in each category (meaning each combination of the values of the explanatory variables) against the expected count. However, unless there are multiple subjects in each combination of the values of the covariates (the explanatory variables), the observed counts are 0’s and 1’s; the expected counts are simply the iπ̂ ’s. But the chi-square approximation for the goodness-of-fit statistic does not hold when the expected counts are all less than 1 (most of them should be above 5 to use the chi-square approximation). A way around this difficulty is to group the data. The Hosmer-Lemeshow test groups individuals by their value of iπ̂ into ten groups of about equal size. That is, the tenth of the observations with the lowest values of iπ̂ are put in the first group, the next tenth lowest values in the second, and so on. It then totals the iπ̂ ’s for everyone in each group (that is, it totals the expected number of successes) and does the same for the

iπ̂1− ’s (the expected number of failures). It then compares the observed and expected counts in the 20 cells using Pearson’s X2 and compares the result to a chi-square with 8 d.f. The expected counts still need to be large enough for the chi-square approximation to hold, but this is much more likely to happen since we group the data. Since the cases are grouped, this test is “crude” in a sense, but it is a useful tool. Another goodness-of-fit tool is the classification table. Each observation is classified according to its estimated π: if iπ̂ > .5, then its classified as a “success” (a 1), otherwise as a “failure” (a 0). The table of actual group versus predicted group is used to assess how well the classifier does. This is a very crude way of assessing how well the model fits since it does not measure how far

iπ̂ is from yi ; only whether it is on the same side of .5. as yi . In addition, if yi =1 is rare, then

the estimated probabilities may all be less than 1 and the classification table isn’t useful in assessing fit (the same is true if yi =0 is rare).

Example Birthweight data: using only LastWeight, Race, Smoke First fit a rich model

Model Summary

210.491a .120 .169Step1

-2 Loglikelihood

Cox & SnellR Square

NagelkerkeR Square


a.


-.012 .045 .073 1 .786 .9881.733 2 .420

-3.481 2.717 1.642 1 .200 .031-2.969 2.907 1.043 1 .307 .051-.679 1.933 .123 1 .726 .507

1.271 2 .530.017 .021 .634 1 .426 1.017.025 .022 1.269 1 .260 1.026.008 .016 .257 1 .612 1.008

1.911 2 .3851.308 .953 1.882 1 .170 3.698.667 1.180 .320 1 .572 1.949.000 .000 .167 1 .683 1.000

1.849 3.131 .349 1 .555 6.353

lwtracerace(1)race(2)smokelwt * racelwt by race(1)lwt by race(2)lwt by smokerace * smokerace(1) by smokerace(2) by smokelwt2Constant

Step1

a

B S.E. Wald df Sig. Exp(B)

Variable(s) entered on step 1: lwt, race, smoke, lwt * race , lwt * smoke , race * smoke , lwt2.a.

Compare to a main effects model

Model Summary

215.015a .099 .139Step1

-2 Loglikelihood

Cox & SnellR Square

NagelkerkeR Square


a.


-.013 .006 4.415 1 .036 .9878.730 2 .013

-.971 .412 5.543 1 .019 .379.320 .526 .370 1 .543 1.377

1.060 .378 7.850 1 .005 2.886.861 .797 1.168 1 .280 2.366

lwtracerace(1)race(2)smokeConstant

Step1

a

B S.E. Wald df Sig. Exp(B)

Variable(s) entered on step 1: lwt, race, smoke.a.

• Carry out drop in deviance test:

Categorical Variables Codings

96 1.000 .00026 .000 1.00067 .000 .000

WhiteBlackOther

Race (1=White,2=Black, 3=Other)

Frequency (1) (2)Parameter coding

The largest difference is between White and Other. The difference between Black and Other is smaller and not statistically significant. We might consider coding race as simply White/Nonwhite.

Model Summary

215.383a .097 .136Step1

-2 Loglikelihood

Cox & SnellR Square

NagelkerkeR Square


a.


-.0122 .0060 4.118 1 .042 .988 .976 1.0001.077 .373 8.342 1 .004 2.937 1.414 6.1021.103 .372 8.775 1 .003 3.013 1.452 6.252-.275 .834 .109 1 .742 .760

lwtNonwhitesmokeConstant

Step1

a

B S.E. Wald df Sig. Exp(B) Lower Upper95.0% C.I.for EXP(B)

Variable(s) entered on step 1: lwt, Nonwhite, smoke.a.

Hosmer and Lemeshow Test

7.071 8 .529Step1

Chi-square df Sig.

Some residual analysis should also be done to see that there are not any outliers or particularly influential points. • Write the logit equation from the last model for each of the four groups represented by the

two categorical explanatory variables:

Compare the odds of a low birthweight baby for a nonwhite smoking mother who weighs 150 pounds versus a white, nonsmoking mother who weighs 120 pounds.

Chapter 20

Documents

Transcript of Chapter 20