Post on 13-Jan-2016
April 4
• Logistic Regression– Lee Chapter 9
– Cody and Smith 9:F
HRT Use and Polyps
72 175
102 114
Case (Polyps) Control (No Polyps)
HRT Use
216
174 289
RO = 72/102 175/114
= 0.46
No HRT Use
247
RO HRT Use (Case v Control)
463 ) (RO)2
174) (289) (247) (216) =16.04
463
Inference for binary data
• Relative risk, odds ratios, 2x2 tables are limited– Can’t adjust for many confounders– Limited to categorical predictors– Can’t look at multiple variables simultaneously
• Logistic regression– Adjust for many confounders– Study continuous predictors– Model interactions
Linear regression model
Y = o + 1X1 + 2X2 + ... + pXp
Y = dependent variableXi = independent variables
Y is continuous, normally distributed
Model the mean response (Y) based on the predictors
is mean of Y when all Xs are 0 is increase in mean of Y for increase in 1 unit of X
New regression model?
Y?= o + 1X1 + 2X2 + ... + pXp
Y = binary outcome (0 or 1)
Xi = independent variables
Would like to use this type of model for a binary outcome variable
Draw a line ?
What if you had multiple observationsat each Score (or you grouped scores)
Score Proportion Dying
< 10 1/10 = 0.10
11-20 4/15 = 0.27
21-30 5/15 = 0.33
31-40 8/16 = 0.50
*
**
*
Possibilities for Y
Y?= o + 1X1 + 2X2 + ... + pXp
Y = probability of Y = 1 (Problem: Y bound by 0 -1)
Y = odds of Y = 1
Y = log (odds of Y = 1) – Has good properties
Probability, Odds, Log Odds
Odds (Log (Odds)0.01 0.01 -4.600.10 0.11 -2.200.20 0.25 -1.380.30 0.43 -0.850.40 0.63 -0.410.50 1.00 0.000.60 1.50 0.410.70 2.33 0.850.80 4.00 1.380.90 9.00 2.200.99 99.00 4.60
Bound by 0 -1Extreme Values
Less extreme values and symmetric about =0.5
Nearly a straight line for middle values of P
Logistic regression equation
Model log odds of outcome as a linear function of one or more variables
Xi = predictors, independent variables
...)1
log( 22110
xx
The model is:
A Little Math
• The natural LOG and exponential (EXP) functions are inverse functions of each other– LOG (a) = b EXP (b) = a
– LOG (1) = 0 EXP(0) = 1
– LOG (.5) = -0.693 EXP(-.693) = .5
– LOG (1.5) = .405 EXP(.405) = 1.5
These will be logistic regression betas These will be the odds ratios
Note: Calculators and Excel use LN for natural logarithm
A Little Math
• LOG function– Takes values [ 0 to +infinity] [-infinity to +infinity]
• EXP function– Takes values [ -infinity to infinity] [0 to +infinity]
A Little Math
• Properties of LOG function– log (a*b) = log (a) + log (b)
– log (a/b) = log (a) – log (b)
• Properties of EXP function– exp (a+b) = exp(a) * exp(b)
– exp (a-b) = exp(a)/exp(b)
Differences in log odds
Odds Ratios
(ODDS)
These will be typical betas from the logistic regression model
These will be the odds ratios
Logistic regression – single binary covariate
We need to use a dummy variable to code for men and women
x = 1 for women, 0 for men
What do the betas mean? What is odds ratio, women versus men?
x10)1
log(
The model is:
Odds for Men and Women
For men;
01010 )0()1
log(
x
For women;
101010 )1()1
log(
x
After some algebra, the odds ratio is equal to;
)exp(menfor odds
for women odds1B
is difference in log odds between men and women
Example - risk of CVD for men vs. women
log(odds) = 0 + 1x
= -2.5504 - 1.0527*x
For females; log(odds) = -2.5504 - 1.0527(1) = -3.6031
For males; log(odds) = -2.5504 - 1.0527(0) = -2.5504
exp(1) = odds ratio for women vs. men
Here, exp(1) = exp(-1.0527) = 0.35
Women are at a 65% lower risk of the outcome than men (OR<1)
Dif = -1.0527
Note
• Odds ratio from 2 x 2 table• EXP () from logistic regression for binary risk factor
• These will be equal
Multiple logistic regression model
log(odds) = o + 1X1 + 2X2 + ... + pXp
log(odds) = logarithm of the odds for the outcome, dependent variable
Xi = predictors, independent variables
i - log(OR) associated with either• exposure (for categorical predictors) • a 1 unit increase in predictor (for continuous)
OR adjusted for other variables in model
Interpretation of coefficients - continuous predictors
Example - effect of age on risk of death in 10 years
log(odds) = -8.2784+ 0.1026*age
0 = -8.2784, 1 = 0.1026
exp(1) = exp(0.1026) = 1.108
A one year increase in age is associated with an odds ratio of death of 1.108 (assumption that this is true for any 2 consecutive ages)
This is an increase of approximately 11% (= 1.108 - 1)
Interpretation of coefficients - continuous predictors
What about a 5 year increase in age?
Multiply coefficient by the change you want to look at;
exp(5*1) = exp(5*0.1026) = 1.67
A five year increase in age is associated with an odds ratio of death of 1.67
This is an increase of 67%
Note: exp(5*1) does not equal 5*exp(1)
Parameter Estimation
• How do we come up with estimates for i?
• Can’t use least squares since outcome is not continuous
• Use Maximum Likelihood Estimation (MLE)
Maximum Likelihood Estimation
• Choose parameter estimates that maximize the probability of observing the data you observed.
• Example for estimation a proportion – Observe 7/10 have characteristic
– P = 0.70 is estimate – P = 0.70 is MLE of Why?)
– Which value of maximizes the probability of getting 7 of 10?
– Answer: 0.70
MLE Simple Example
• Wish to estimate a proportion • Sample n = 2
– Observe 1 of 2 have characteristic
– L = – What value of maximizes L?
– Answer: = 0.5 which is p=1/2
Fitted regression line
xp
po 1)
1log(
Curve based on:
o effects location
1 effects curvature
Inference for multiple logistic regression
• Collect data, choose model, estimate o and is
• Describe odds ratios, exp(i), in statistical terms.
– How confident are we of our estimate?– Is the odds ratio is different from one due to chance?
Not interested in inference for o (related to overall probability of outcome)
Confidence Intervals for logistic regression coefficients
• General form of 95% CI: Estimate ± 1.96*SE– Bi estimate, provided by SAS– SE is complicated, provided by SAS• Related to variability of our data and sample
size
95% Confidence Intervals for the odds ratio
• Based on transforming the 95% confidence interval for the parameter estimates
• Supplied automatically by SAS
• Look to see if interval contains 1
“We have a statistically significant association between the predictor and the outcome controlling for all other covariates”
• Equivalent to a hypothesis test; reject Ho: OR = 1 at alpha = 0.05. Based on whether or not 1 is in the interval
),( 96.196.1 SEbSEb ii ee
Hypothesis test for individual logistic regression coefficient
• Null and alternative hypotheses– Ho : i = 0, Ha: i 0
• Test statistic: 2 = (i/ SE)2, supplied by SAS
• p-values are supplied by SAS
• If p<0.05, “there is a statistically significant association between the predictor and outcome variable controlling for all other covariates” at alpha = 0.05
PROC LOGISTIC
PROC LOGISTIC DATA = dataset ; MODEL outcome = list of x variables; RUN;
• CLASS statement allows for categorical variables with many
groups (>2)
DATA temp;INPUT apache death @@ ; xdeath = 2; if death = 1 then xdeath = 1;DATALINES;0 0 2 0 3 0 4 0 5 06 0 7 0 8 0 9 0 10 011 0 12 0 13 0 14 0 15 016 0 17 1 18 1 19 0 20 021 1 22 1 23 0 24 1 25 126 1 27 0 28 1 29 1 30 131 1 32 1 33 1 34 1 35 136 1 37 1 38 1 41 0;PROC LOGIST DATA=temp; MODEL xdeath = apache;RUN;
The LOGISTIC Procedure
Model Information
Data Set WORK.TEMPResponse Variable xdeathNumber of Response Levels 2Number of Observations 39Model binary logitOptimization Technique Fisher's scoring
Response Profile
Ordered Total Value xdeath Frequency
1 1 18 2 2 21
Probability modeled is xdeath=1.
The LOGISTIC Procedure
Analysis of Maximum Likelihood Estimates
Standard WaldParameter DF Estimate Error Chi-Square Pr > ChiSq
Intercept 1 -4.3861 1.3687 10.2686 0.0014apache 1 0.2034 0.0605 11.3093 0.0008
Odds Ratio Estimates
Point 95% WaldEffect Estimate Confidence Limits
apache 1.226 1.089 1.380
EXP(0.2034)EXP(0.2034 – 1.96*.0605)
EXP(0.2034 +1.96*.0605)
TOMHS – bpstudy sas dataset
• Variable CLINICAL (1=yes, 0 =no) indicates whether patient had a CVD event
• Run logistic regression separately for age and gender to determine if:
– Age is related to CVD
• What is the odds associated with a 1 year increase in age
• What is the odds associated with a 5 year increase in age
– Gender is related to CVD
• What is the odds of CVD (women versus men)
• Run logistic regression for age and gender together
• Note: Download dataset from web-page or use dataset on SATURN