Download - Review Homework - Pennsylvania State University · Review Homework 2 Overview • Logistic regression model conceptually • Logistic regression model graphically • Lots of equations

7/28/15

1

Lecture 6: Logistic Regression Analysis Christopher S. Hollenbeak, PhD Jane R. Schubart, PhD The Outcomes Research Toolbox

Review Homework

2

Overview

•  Logistic regression model conceptually •  Logistic regression model graphically •  Lots of equations •  Performing logistic regression in Stata •  Performing logistic regression in R •  Predicting SSI in liver transplantation

3

7/28/15

2

Logistic Regression Model

•  Recall we use logistic regression when we have a binary dependent variable

•  Independent variables could be continuous, binary, or categorical

•  The goals becomes to estimate the effect of covariates on the probability that the dependent variable equals 1

4

Pr(yi = 1) = �0 + �1x1i + · · ·+ �kxki

When to Use Logistic Regression

•  When you want to estimate or predict the probability that an event occurs

•  Examples: –  What is the probability of developing a surgical site

infection? –  What is the effect of age on risk of dying in the hospital?

•  What do the data look like?

5

0 20 40 60 80

-0.5

0.0

0.5

1.0

1.5

Age

Pr(Died)

7/28/15

3

Logistic Regression

•  We could fit a linear regression model to these data

•  Called a linear probability model –  Often done in economics –  Almost never done in biomedical research!

•  What would it look like?

7

8

0 20 40 60 80

-0.5

0.0

0.5

1.0

1.5

Age

Pr(Died)

What’s Wrong with This?

•  Does not fit the assumptions of linear regression model –  Not normally distributed –  Always heteroskedastic

•  Predicted probabilities can be <0 or >1!! •  Need to fit a function that fits the data

9

7/28/15

4

10

0 20 40 60 80

-0.5

0.0

0.5

1.0

1.5

Age

Pr(Died)

Logistic Regression

•  This functional form restricts the probability between 0 and 1

•  What functions look like this? •  Cumulative distributions functions!

11

Cumulative Distribution Functions

-4 -2 0 2 4

0.0

0.2

0.4

0.6

0.8

1.0

x

CD

F, D

ensi

ty

7/28/15

5

Cumulative Distribution Function

•  We just need to pick a probability distribution and use its CDF

•  Which probability distribution? •  The Logistic Distribution –  Which is why we call this logistic regression

13

The Model

•  The logistic CDF looks like this:

•  If we instead model the ratio we get something simpler:

Pr(yi

= 1) =e�0+�1x1i+···+�kxki

1 + e�0+�1x1i+···+�kxki

Pr(yi

= 1)

Pr(yi

= 0)= e�0+�1x1i+···+�kxki

The Model

•  If we now take the natural log of both sides, we get something even simpler

•  This model is just a linear model of the log odds of risk

15

ln

✓Pr(yi = 1)

Pr(yi = 0)

◆= �0 + �1x1i + · · ·+ �kxki

7/28/15

6

Quantifying Risk: One Event

•  The best measure of risk that one event occurs is the probability of the event (p)

•  Risk is not the only measure •  Could also use the odds of the event –  Odds is the ratio of the probability that it occurs to the

probability that it doesn’t occur

–  An odds of 2 (or 2:1) means that the event is twice as likely to occur as not to occur

16

p

1� p

Quantifying Risk: Two Events

•  But what if I have two events (p1 and p2)? –  How can I quantify relative risk?

•  Risk Ratio

–  If risk ratio < 1, event 1 is less likely to occur –  If risk ratio > 1, event 1 is more likely to occur –  If risk ratio = 1, the events are equally likely

17

p1p2

Quantifying Risk: Two Events

•  Another common alternative is the Odds Ratio

–  If odds ratio < 1, event 1 is less likely to occur –  If odds ratio > 1, event 1 is more likely to occur –  If odds ratio = 1, the events are equally likely

18

p1

1�p1

p2

1�p2

7/28/15

7

Logistic Regression Coefficients

•  Raw coefficients tell you the effect of a one-unit change in the covariate on the log odds of risk

•  But we cannot interpret this! •  But a simple transformation of the coefficient gives

us the odds ratio of risk

•  In logistic regression we always report and use odds ratios

19

ln

✓Pr(yi = 1)

Pr(yi = 0)

◆= �0 + �1x1i + · · ·+ �kxki

Odds Ratio = e�k

Interpreting the Coefficients

•  Be careful when using and reporting odds ratios •  Scale matters!

•  Always report baseline probability so the reader knows what the odds apply to

•  This will be the Intercept in the logistic regression output

20

0.00002

0.00001= 2

0.2

0.1= 2

Stata Code

•  The Stata command for a logistic regression is – logit depvar ind1 ind2 … , or

•  The or options reports odds ratios instead of coefficients –  Important, otherwise you will have trouble with the

interpretation

21

7/28/15

8

Example: Risk of SSI in Liver Tx

•  Consider an example from our liver transplant data

•  What factors are associated with risk of surgical site infection?

logit ssi age4049 age5059 age60 female /// black ab0 ab1 ab2 ab3, or

Example: Risk of SSI

•  Without or option

Logistic regression Number of obs = 777 LR chi2(9) = 10.04 Prob > chi2 = 0.3473 Log likelihood = -509.33314 Pseudo R2 = 0.0098 ------------------------------------------------------------------------------ ssi | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- age4049 | -.1787575 .2001108 -0.89 0.372 -.5709675 .2134525 age5059 | -.3355092 .2017607 -1.66 0.096 -.730953 .0599346 age60 | -.2626236 .2338076 -1.12 0.261 -.7208781 .195631 female | -.3633933 .1517322 -2.39 0.017 -.660783 -.0660036 black | .0281334 .3712022 0.08 0.940 -.6994095 .7556762 ab0 | .4821227 .594928 0.81 0.418 -.6839148 1.64816 ab1 | -.2625846 .473966 -0.55 0.580 -1.191541 .6663716 ab2 | .0844837 .2251664 0.38 0.708 -.3568343 .5258017 ab3 | -.0755279 .1682882 -0.45 0.654 -.4053666 .2543109 _cons | -.146457 .1757599 -0.83 0.405 -.4909401 .1980261 ------------------------------------------------------------------------------

Example: Risk of SSI

•  With or option

Logistic regression Number of obs = 777 LR chi2(9) = 10.04 Prob > chi2 = 0.3473 Log likelihood = -509.33314 Pseudo R2 = 0.0098 ------------------------------------------------------------------------------ ssi | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- age4049 | .8363087 .1673544 -0.89 0.372 .5649786 1.237945 age5059 | .714974 .1442537 -1.66 0.096 .48145 1.061767 age60 | .7690313 .1798054 -1.12 0.261 .486325 1.216078 female | .6953129 .1055014 -2.39 0.017 .5164468 .9361275 black | 1.028533 .3817936 0.08 0.940 .4968786 2.129051 ab0 | 1.619508 .9634909 0.81 0.418 .5046376 5.197409 ab1 | .7690613 .3645089 -0.55 0.580 .3037529 1.947159 ab2 | 1.088155 .2450159 0.38 0.708 .6998885 1.691815 ab3 | .9272539 .1560458 -0.45 0.654 .6667323 1.289573 _cons | .8637629 .1518149 -0.83 0.405 .6120508 1.218994 ------------------------------------------------------------------------------

7/28/15

9

Goodness of Fit

•  How can you tell when you have a good model? •  Usual statistic is the “C” statistic •  Comparable to the R2 in linear regression –  Lower bound of .5 –  Upper bound of 1

•  Computed as the area under the Receiver Operating Characteristic (ROC) curve

•  Stata command: lroc •  Run on line right after logit statement

Goodness of Fit

Logistic model for ssi number of observations = 777 area under ROC curve = 0.5607

R Code

•  The R function that performs logistic regression is glm() –  glm = Generalized Linear Model –  Must specify that the data come from the binomial

family of distributions

•  Create a glm object, then summarize –  lr1 <- glm(dat$depvar ~ dat$indvar1 + … + dat$indvark,

family=“binomial”) –  summary(lr1) –  confint(lr1)

7/28/15

10

R Results Call: glm(formula = data1$ssi ~ data1$age4049 + data1$age5059 + data1$age60 + data1$female + data1$black + data1$abmm, family = "binomial") Deviance Residuals: Min 1Q Median 3Q Max -1.1589 -0.9771 -0.8660 1.3309 1.5647 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -0.08709 0.31192 -0.279 0.7801 data1$age4049 -0.17923 0.19955 -0.898 0.3691 data1$age5059 -0.33866 0.20120 -1.683 0.0923 . data1$age60 -0.27206 0.23308 -1.167 0.2431 data1$female -0.36278 0.15153 -2.394 0.0167 * data1$black 0.04339 0.36991 0.117 0.9066 data1$abmm -0.02186 0.08261 -0.265 0.7913 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 1028.7 on 776 degrees of freedom Residual deviance: 1020.1 on 770 degrees of freedom AIC: 1034.1 Number of Fisher Scoring iterations: 4

R Code

•  R will not give you odds ratios •  You have to force them

–  exp(cbind(OR = coef(lr1), confint(lr1)))

R Results

> exp(cbind(OR = coef(lr1), confint(lr1))) Waiting for profiling to be done... OR 2.5 % 97.5 % (Intercept) 0.9165969 0.4952724 1.6866575 data1$age4049 0.8359172 0.5648019 1.2356989 data1$age5059 0.7127235 0.4796774 1.0563229 data1$age60 0.7618093 0.4806089 1.1999588 data1$female 0.6957407 0.5162789 0.9354321 data1$black 1.0443432 0.4940752 2.1331501 data1$abmm 0.9783733 0.8326251 1.1517419

7/28/15

11

R Code

•  For area under the curve, you need an add-on package –  install.packages("Deducer") –  library(Deducer)

•  This gives you a new function called rocplot() •  Apply the function to your glm object –  rocplot(lr1)

R Results

AUC= 0.56070.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.001−Specificity

Sens

itivi

ty

logit (data1$ssi ~ data1$age4049 + data1$age5059 + data1$age60 + data1$female + )

Reporting Logistic Regression

•  Include the odds ratio, its 95% confidence interval, and p-value –  Do not include the Intercept!! –  Do mention the baseline probability in your text

•  For p-values less than 0.0001, indicate with “< 0.0001” rather than the actual p-value

•  Indicate reference group •  Include C-statistics in the table caption –  Usually don’t include a plot of the ROC curve

33

7/28/15

12

34

The Narrative

35

Women had 30% lower odds of developing SSI relative to men (OR = 0.70, p = 0.02), and African Americans had 4% greater odds of eveloping SSI, although this effect was not statistically significant (p = 0.94).

Summary

•  Regression analysis for binary dependent variable –  Fits the data to a cumulative logistic distribution

•  Raw parameters tell you the impact of a one-unit change in the covariates on the log odds of the event of interest

•  Raising e to the power of your coefficients give you their odds ratios, or the impact of a one-unit change in your covariate on the odds of the event

•  C-statistic measures goodness of fit

7/28/15

13

Homework

•  Using the Liver Transplant data: –  What factors determine the probability of dying?

•  Primary interest is in effect of SSI on risk of death •  Use the other covariates you used in your cost analysis •  Find at least one other *new* variable in the data set that you

did not include in your cost analysis

–  Is it better to use age as a continuous covariate or categories?

–  Is it better to use HLA mismatches as a continuous covariate or categories?

37