7/28/15
1
Lecture 6: Logistic Regression Analysis Christopher S. Hollenbeak, PhD Jane R. Schubart, PhD The Outcomes Research Toolbox
Review Homework
2
Overview
• Logistic regression model conceptually • Logistic regression model graphically • Lots of equations • Performing logistic regression in Stata • Performing logistic regression in R • Predicting SSI in liver transplantation
3
7/28/15
2
Logistic Regression Model
• Recall we use logistic regression when we have a binary dependent variable
• Independent variables could be continuous, binary, or categorical
• The goals becomes to estimate the effect of covariates on the probability that the dependent variable equals 1
4
Pr(yi = 1) = �0 + �1x1i + · · ·+ �kxki
When to Use Logistic Regression
• When you want to estimate or predict the probability that an event occurs
• Examples: – What is the probability of developing a surgical site
infection? – What is the effect of age on risk of dying in the hospital?
• What do the data look like?
5
0 20 40 60 80
-0.5
0.0
0.5
1.0
1.5
Age
Pr(Died)
7/28/15
3
Logistic Regression
• We could fit a linear regression model to these data
• Called a linear probability model – Often done in economics – Almost never done in biomedical research!
• What would it look like?
7
8
0 20 40 60 80
-0.5
0.0
0.5
1.0
1.5
Age
Pr(Died)
What’s Wrong with This?
• Does not fit the assumptions of linear regression model – Not normally distributed – Always heteroskedastic
• Predicted probabilities can be <0 or >1!! • Need to fit a function that fits the data
9
7/28/15
4
10
0 20 40 60 80
-0.5
0.0
0.5
1.0
1.5
Age
Pr(Died)
Logistic Regression
• This functional form restricts the probability between 0 and 1
• What functions look like this? • Cumulative distributions functions!
11
Cumulative Distribution Functions
-4 -2 0 2 4
0.0
0.2
0.4
0.6
0.8
1.0
x
CD
F, D
ensi
ty
7/28/15
5
Cumulative Distribution Function
• We just need to pick a probability distribution and use its CDF
• Which probability distribution? • The Logistic Distribution – Which is why we call this logistic regression
13
The Model
• The logistic CDF looks like this:
• If we instead model the ratio we get something simpler:
Pr(yi
= 1) =e�0+�1x1i+···+�kxki
1 + e�0+�1x1i+···+�kxki
Pr(yi
= 1)
Pr(yi
= 0)= e�0+�1x1i+···+�kxki
The Model
• If we now take the natural log of both sides, we get something even simpler
• This model is just a linear model of the log odds of risk
15
ln
✓Pr(yi = 1)
Pr(yi = 0)
◆= �0 + �1x1i + · · ·+ �kxki
7/28/15
6
Quantifying Risk: One Event
• The best measure of risk that one event occurs is the probability of the event (p)
• Risk is not the only measure • Could also use the odds of the event – Odds is the ratio of the probability that it occurs to the
probability that it doesn’t occur
– An odds of 2 (or 2:1) means that the event is twice as likely to occur as not to occur
16
p
1� p
Quantifying Risk: Two Events
• But what if I have two events (p1 and p2)? – How can I quantify relative risk?
• Risk Ratio
– If risk ratio < 1, event 1 is less likely to occur – If risk ratio > 1, event 1 is more likely to occur – If risk ratio = 1, the events are equally likely
17
p1p2
Quantifying Risk: Two Events
• Another common alternative is the Odds Ratio
– If odds ratio < 1, event 1 is less likely to occur – If odds ratio > 1, event 1 is more likely to occur – If odds ratio = 1, the events are equally likely
18
p1
1�p1
p2
1�p2
7/28/15
7
Logistic Regression Coefficients
• Raw coefficients tell you the effect of a one-unit change in the covariate on the log odds of risk
• But we cannot interpret this! • But a simple transformation of the coefficient gives
us the odds ratio of risk
• In logistic regression we always report and use odds ratios
19
ln
✓Pr(yi = 1)
Pr(yi = 0)
◆= �0 + �1x1i + · · ·+ �kxki
Odds Ratio = e�k
Interpreting the Coefficients
• Be careful when using and reporting odds ratios • Scale matters!
• Always report baseline probability so the reader knows what the odds apply to
• This will be the Intercept in the logistic regression output
20
0.00002
0.00001= 2
0.2
0.1= 2
Stata Code
• The Stata command for a logistic regression is – logit depvar ind1 ind2 … , or
• The or options reports odds ratios instead of coefficients – Important, otherwise you will have trouble with the
interpretation
21
7/28/15
8
Example: Risk of SSI in Liver Tx
• Consider an example from our liver transplant data
• What factors are associated with risk of surgical site infection?
logit ssi age4049 age5059 age60 female /// black ab0 ab1 ab2 ab3, or
Example: Risk of SSI
• Without or option
Logistic regression Number of obs = 777 LR chi2(9) = 10.04 Prob > chi2 = 0.3473 Log likelihood = -509.33314 Pseudo R2 = 0.0098 ------------------------------------------------------------------------------ ssi | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- age4049 | -.1787575 .2001108 -0.89 0.372 -.5709675 .2134525 age5059 | -.3355092 .2017607 -1.66 0.096 -.730953 .0599346 age60 | -.2626236 .2338076 -1.12 0.261 -.7208781 .195631 female | -.3633933 .1517322 -2.39 0.017 -.660783 -.0660036 black | .0281334 .3712022 0.08 0.940 -.6994095 .7556762 ab0 | .4821227 .594928 0.81 0.418 -.6839148 1.64816 ab1 | -.2625846 .473966 -0.55 0.580 -1.191541 .6663716 ab2 | .0844837 .2251664 0.38 0.708 -.3568343 .5258017 ab3 | -.0755279 .1682882 -0.45 0.654 -.4053666 .2543109 _cons | -.146457 .1757599 -0.83 0.405 -.4909401 .1980261 ------------------------------------------------------------------------------
Example: Risk of SSI
• With or option
Logistic regression Number of obs = 777 LR chi2(9) = 10.04 Prob > chi2 = 0.3473 Log likelihood = -509.33314 Pseudo R2 = 0.0098 ------------------------------------------------------------------------------ ssi | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- age4049 | .8363087 .1673544 -0.89 0.372 .5649786 1.237945 age5059 | .714974 .1442537 -1.66 0.096 .48145 1.061767 age60 | .7690313 .1798054 -1.12 0.261 .486325 1.216078 female | .6953129 .1055014 -2.39 0.017 .5164468 .9361275 black | 1.028533 .3817936 0.08 0.940 .4968786 2.129051 ab0 | 1.619508 .9634909 0.81 0.418 .5046376 5.197409 ab1 | .7690613 .3645089 -0.55 0.580 .3037529 1.947159 ab2 | 1.088155 .2450159 0.38 0.708 .6998885 1.691815 ab3 | .9272539 .1560458 -0.45 0.654 .6667323 1.289573 _cons | .8637629 .1518149 -0.83 0.405 .6120508 1.218994 ------------------------------------------------------------------------------
7/28/15
9
Goodness of Fit
• How can you tell when you have a good model? • Usual statistic is the “C” statistic • Comparable to the R2 in linear regression – Lower bound of .5 – Upper bound of 1
• Computed as the area under the Receiver Operating Characteristic (ROC) curve
• Stata command: lroc • Run on line right after logit statement
Goodness of Fit
Logistic model for ssi number of observations = 777 area under ROC curve = 0.5607
R Code
• The R function that performs logistic regression is glm() – glm = Generalized Linear Model – Must specify that the data come from the binomial
family of distributions
• Create a glm object, then summarize – lr1 <- glm(dat$depvar ~ dat$indvar1 + … + dat$indvark,
family=“binomial”) – summary(lr1) – confint(lr1)
7/28/15
10
R Results Call: glm(formula = data1$ssi ~ data1$age4049 + data1$age5059 + data1$age60 + data1$female + data1$black + data1$abmm, family = "binomial") Deviance Residuals: Min 1Q Median 3Q Max -1.1589 -0.9771 -0.8660 1.3309 1.5647 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -0.08709 0.31192 -0.279 0.7801 data1$age4049 -0.17923 0.19955 -0.898 0.3691 data1$age5059 -0.33866 0.20120 -1.683 0.0923 . data1$age60 -0.27206 0.23308 -1.167 0.2431 data1$female -0.36278 0.15153 -2.394 0.0167 * data1$black 0.04339 0.36991 0.117 0.9066 data1$abmm -0.02186 0.08261 -0.265 0.7913 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 1028.7 on 776 degrees of freedom Residual deviance: 1020.1 on 770 degrees of freedom AIC: 1034.1 Number of Fisher Scoring iterations: 4
R Code
• R will not give you odds ratios • You have to force them
– exp(cbind(OR = coef(lr1), confint(lr1)))
R Results
> exp(cbind(OR = coef(lr1), confint(lr1))) Waiting for profiling to be done... OR 2.5 % 97.5 % (Intercept) 0.9165969 0.4952724 1.6866575 data1$age4049 0.8359172 0.5648019 1.2356989 data1$age5059 0.7127235 0.4796774 1.0563229 data1$age60 0.7618093 0.4806089 1.1999588 data1$female 0.6957407 0.5162789 0.9354321 data1$black 1.0443432 0.4940752 2.1331501 data1$abmm 0.9783733 0.8326251 1.1517419
7/28/15
11
R Code
• For area under the curve, you need an add-on package – install.packages("Deducer") – library(Deducer)
• This gives you a new function called rocplot() • Apply the function to your glm object – rocplot(lr1)
R Results
AUC= 0.56070.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.001−Specificity
Sens
itivi
ty
logit (data1$ssi ~ data1$age4049 + data1$age5059 + data1$age60 + data1$female + )
Reporting Logistic Regression
• Include the odds ratio, its 95% confidence interval, and p-value – Do not include the Intercept!! – Do mention the baseline probability in your text
• For p-values less than 0.0001, indicate with “< 0.0001” rather than the actual p-value
• Indicate reference group • Include C-statistics in the table caption – Usually don’t include a plot of the ROC curve
33
7/28/15
12
34
The Narrative
35
Women had 30% lower odds of developing SSI relative to men (OR = 0.70, p = 0.02), and African Americans had 4% greater odds of eveloping SSI, although this effect was not statistically significant (p = 0.94).
Summary
• Regression analysis for binary dependent variable – Fits the data to a cumulative logistic distribution
• Raw parameters tell you the impact of a one-unit change in the covariates on the log odds of the event of interest
• Raising e to the power of your coefficients give you their odds ratios, or the impact of a one-unit change in your covariate on the odds of the event
• C-statistic measures goodness of fit
7/28/15
13
Homework
• Using the Liver Transplant data: – What factors determine the probability of dying?
• Primary interest is in effect of SSI on risk of death • Use the other covariates you used in your cost analysis • Find at least one other *new* variable in the data set that you
did not include in your cost analysis
– Is it better to use age as a continuous covariate or categories?
– Is it better to use HLA mismatches as a continuous covariate or categories?
37
Top Related