LOGISTIC REGRESSION TUTORIAL - Technology - …course1.winona.edu/bdeppa/STAT 600/Handouts/STAT...

19
STAT 600 - LOGISTIC REGRESSION Example 1 - High Dieldrin Levels in Western Australian Breast Feeding Mothers Data File: Pestmilk.JMP These data come from a study of breast feeding mothers in Western Australia in 1979-80. Earlier research discovered surprisingly high levels of pesticide levels in human breast milk. The research conducted in 1979-80 hoped to show that the levels had decreased as a result of stricter government regulations on the use of pesticides on food crops. They did find decreases for several types of pesticides. Levels of the pesticide Dieldrin, however had substantially increased. These data were collected to hopefully explain why. For 45 breast milk donors, we have information on the mother's age in years, whether they lived in a new suburb (0 = no, 1 = yes), whether their house had been treated for termites within the past three years (0 = no, 1 = yes), and whether their breast milk contained above average (> .009 ppm) levels of the pesticide Dieldrin. By law new homes are treated for termites in Australia. The variables in the Pestmilk.JMP data file are: Age - age of mother (yrs.) NS - new suburb indicator (1 = yes, 0 = no) HT - house treated for termites in the last 3 years (1 = yes, 0 = no) HD - high Dieldrin level (1 = yes, 0 = no) New Suburb (New or Old) Home Treated (HT = house treated or NT= not treated) High Dieldrin (High or Low) Note: For interpretation purposes it is sometime necessary to reorder the levels of the nominal variables used in the logistic regression model. This can be done by right-clicking at the top 1

Transcript of LOGISTIC REGRESSION TUTORIAL - Technology - …course1.winona.edu/bdeppa/STAT 600/Handouts/STAT...

Page 1: LOGISTIC REGRESSION TUTORIAL - Technology - …course1.winona.edu/bdeppa/STAT 600/Handouts/STAT 600... · Web viewSTAT 600 - LOGISTIC REGRESSION Example 1 - High Dieldrin Levels in

STAT 600 - LOGISTIC REGRESSION

Example 1 - High Dieldrin Levels in Western Australian Breast Feeding Mothers Data File: Pestmilk.JMP

These data come from a study of breast feeding mothers in Western Australia in 1979-80. Earlier research discovered surprisingly high levels of pesticide levels in human breast milk.  The research conducted in 1979-80 hoped to show that the levels had decreased as a result of stricter government regulations on the use of pesticides on food crops.   They did find decreases for several types of pesticides.   Levels of the pesticide Dieldrin, however had substantially increased.  These data were collected to hopefully explain why.

For 45 breast milk donors, we have information on the mother's age in years, whether they lived in a new suburb (0 = no, 1 = yes), whether their house had been treated for termites within the past three years (0 = no, 1 = yes), and whether their breast milk contained above average (> .009 ppm) levels of the pesticide Dieldrin.  By law new homes are treated for termites in Australia.

The variables in the Pestmilk.JMP data file are: Age - age of mother (yrs.) NS - new suburb indicator (1 = yes, 0 = no) HT - house treated for termites in the last 3 years (1 = yes, 0 = no) HD - high Dieldrin level (1 = yes, 0 = no) New Suburb (New or Old) Home Treated (HT = house treated or NT= not treated) High Dieldrin (High or Low)

Note: For interpretation purposes it is sometime necessary to reorder the levels of the nominal variables used in the logistic regression model. This can be done by right-clicking at the top of a column and selecting Value Ordering from the Column Info… pull-out menu.

One way to examine the relationship between the response (High Dieldrin) and the predictors (age, New Sub & Treated) we could construct 2×2contingency tables and compute conditional probabilities, relative risks, and odds ratios.  The tables and plots below were obtained in JMP by using Fit Y by X and placing each of the predictors (New Sub & Treated) in the X box and the response (High Dieldrin) in the Y box.  The results are shown on the following page.

The plots and the contingency tables with the conditional probabilities added suggest that both living in a new suburb (New Sub) and living in home treated for termites (HT) lead to increased risk of having high dieldrin levels in breast milk.

Contingency Analysis of High Dieldrin by Home Treated

1

Page 2: LOGISTIC REGRESSION TUTORIAL - Technology - …course1.winona.edu/bdeppa/STAT 600/Handouts/STAT 600... · Web viewSTAT 600 - LOGISTIC REGRESSION Example 1 - High Dieldrin Levels in

OR = (13*16)/(3*11) = 6.30 Mothers living in a home treated for termites have 6.30 times higher odds for having high dieldrin levels in their breast milk when compared to mothers living in homes not treated for termites.

Contingency Analysis of High Dieldrin by New Suburb

OR = (7*22)/(9*5) = 3.42 Mothers living in a new suburb have 3.42 times the odds of having high dieldrin levels in their breast milk when compared to mothers in living in an older suburb.

Logistic Regression Model

2

Page 3: LOGISTIC REGRESSION TUTORIAL - Technology - …course1.winona.edu/bdeppa/STAT 600/Handouts/STAT 600... · Web viewSTAT 600 - LOGISTIC REGRESSION Example 1 - High Dieldrin Levels in

In logistic regression we model the log of odds for success as a function of the predictors using a linear model. For example, consider the logistic regression model for the risk factor New Suburb.

ln (odds for high dieldrin )= ln( p1− p )=βo+β1 NewSuburb

where,

NewSuburb={+1 if mother lives in new suburb−1 if mother lives in old suburb

The log odds a breast feeding mother living in a new suburb is given by

ln (odds for High for mothers living in a new suburb)=ln( p1−p )=βo+ β1

and for a mother living in an old suburb is given by

ln (odds for High for mothers living in an old suburb)= ln( p1−p )=βo−β1

The difference in the log odds is equivalent to the log of the odds ratio (OR) because of the following property of logarithms.

ln (x )−ln ( y )= ln( xy )

Applying this property here we have

ln (odds for High for mothers in a new suburb)-ln(odds for High for mothers in an old suburb )

= ln (odds for High for mothers in a new suburbodds for High for mothers in an old suburb )=( βo+β1)−( βo−β1 )=2 β1

This says that the OR associated with living in a new suburb is given by

OR=e2 β1

Fitting the New Suburb Logistic Regression Model in JMP

3

Page 4: LOGISTIC REGRESSION TUTORIAL - Technology - …course1.winona.edu/bdeppa/STAT 600/Handouts/STAT 600... · Web viewSTAT 600 - LOGISTIC REGRESSION Example 1 - High Dieldrin Levels in

Select Fit Model and place High Dieldrin in the Y box and New Suburb in the Model Effects box.

Resulting output…

The estimated OR associated with living in a new suburb is then

4

Page 5: LOGISTIC REGRESSION TUTORIAL - Technology - …course1.winona.edu/bdeppa/STAT 600/Handouts/STAT 600... · Web viewSTAT 600 - LOGISTIC REGRESSION Example 1 - High Dieldrin Levels in

We can use JMP to compute the OR’s by selecting Nominal Logistic > Odds Ratio

Similarly for House Treated we have the following logistic regression model.

5

Page 6: LOGISTIC REGRESSION TUTORIAL - Technology - …course1.winona.edu/bdeppa/STAT 600/Handouts/STAT 600... · Web viewSTAT 600 - LOGISTIC REGRESSION Example 1 - High Dieldrin Levels in

Finding Predicted ProbabilitiesThe logistic regression model can be used to estimate the probability of “success” given a set of predictor values as follows:

p=P (success|X )= eβ̂o+ β̂1 X

1+eβ̂o+ β̂1 X

for situations where we have a single predictorand is given by

p=P (success|X )= eβ̂o+ β̂1 X1+⋯+ β̂ p X p

1+eβ̂o+ β̂1 X1+⋯+β̂ p X p for situations where we have p predictors.

For the example above we can estimate the probability of high dieldrin levels for women living in a home treated for termites as follows:

P(High|House Treated) =

e−. 7535+. 9205

1+e−. 7535+ . 9205=.5417

P(High|House Not Treated) =

e−. 7535−. 9205

1+e−. 7535−. 9205 =.1579

We now consider the age effect. Again select Fit Y by X from the Analyze menu and place High Dieldrin in the Y box and age in the X box. The resulting output is given below.

How do these estimate probabilities compare to those we obtain by using a 2 X 2 contingency table?

6

Page 7: LOGISTIC REGRESSION TUTORIAL - Technology - …course1.winona.edu/bdeppa/STAT 600/Handouts/STAT 600... · Web viewSTAT 600 - LOGISTIC REGRESSION Example 1 - High Dieldrin Levels in

The logistic model using age a predictor is given by

ln ( p1−p )

= Age -4.0886156 + .1222*Age

Note: The response in logistic regression is the natural log of the odds for “success”.

The curve added to the plot gives the P(High|Age) = p. For example, for mothers 25 years of age the predicted probability of finding a high dieldrin level in her breast milk is .25. For mothers 35 years of age this probability increases to around .50. The distance from the top of the plot to the curve represents the P(Low|Age). To attach an odds ratio to mother’s age we need to pick an incremental increase of interest, e.g. suppose we wanted to find the odds ratio associated with a 5-year increase in age. The associated odds ratio is found as follows:

OR for 5-year increase in age = e5*.122 = 1.84

Thus for a 5-year increase in age a mothers odds for having high dieldrin are 1.84 times higher or alternatively there is an 84% increase in their odds for having high dieldrin levels in their breast milk.

7

Page 8: LOGISTIC REGRESSION TUTORIAL - Technology - …course1.winona.edu/bdeppa/STAT 600/Handouts/STAT 600... · Web viewSTAT 600 - LOGISTIC REGRESSION Example 1 - High Dieldrin Levels in

Predicted Probabilities for Logistic Model Using Age

We can use the logistic regression model to obtain predicted probabilities of high dieldrin levels as a function of age by using.

P(High|Age) =

e−4 . 089+. 1222⋅Age

1+e−4 .089+ .1222⋅Age

For example,

P(High|Age=25) =

e−4 .089+. 1222⋅25

1+e−4 .089+.1222⋅25 =. 2623

P(High|Age=35) =

e−4 .089+. 1222⋅35

1+e−4 .089+ .1222⋅35 =. 5469

Multiple Logistic Regression ModelNow we consider a logistic regression model. 

ln ( p1−p )=βo+β1 NewSuburb+β2 Treated+β3 Age

where,

NewSuburb={+1 if mother lives in new suburb−1 if mother lives in old suburb

Treated={+1 if mother lives in a home treated for termites −1 if mother lives in a home not treated for termites

Age = mother’s age in years

Select Fit Model from the Analyze menu and put the high dieldrin indicator in the Y box and Age, HT, and New Sub in the Effects in Model box as shown at the top of the following page.

8

Page 9: LOGISTIC REGRESSION TUTORIAL - Technology - …course1.winona.edu/bdeppa/STAT 600/Handouts/STAT 600... · Web viewSTAT 600 - LOGISTIC REGRESSION Example 1 - High Dieldrin Levels in

The resulting output is shown below.

The Whole Model Test is testingHo : The logistic model is NOT usefulHa : The logistic model is useful .The p-value = .0013 so here we evidence to suggest that the model is useful for explaining presence of high dieldrin levels in a mothers breast milk.

The Lack of Fit test is testingHo : The model is adequate .Ha : The model is inadequate, i .e . there is lack of fitThe p-value = .2220, so there is no evidence of lack of fit.

9

Page 10: LOGISTIC REGRESSION TUTORIAL - Technology - …course1.winona.edu/bdeppa/STAT 600/Handouts/STAT 600... · Web viewSTAT 600 - LOGISTIC REGRESSION Example 1 - High Dieldrin Levels in

Finding OR’s associated with the predictorsFor a dichotomous (two-level) categorical predictor, e.g. new suburb and house treated, in order to find the associated OR we do the following:

OR associated with risk factor i=exp(2 β̂i ) , i.e.e2 β̂ i .

Examples:For New Suburb we have: For House Treated we have:

To find a crude 95% CI associated with the OR associated with risk factor i we compute

exp(2∗( β̂ i±(normal or t-table value )∗SE ( β̂i ))

which will give an lower and upper confidence limits for the true OR associated with risk factor.

Examples:For New Suburb we have: For House Treated we have:exp(2∗(1 .0703±1 .96⋅. 4678 ))=(1.359 , 53.22)

exp(2∗(1 .2984±1 .96⋅. 4873 ))=(1.986 , 90.65 )

These intervals are very wide because the sample size (n = 45) is not very big. Typically these types of studies require a much larger sample size to get precise CI’s for OR’s.

The Parameter Estimates and Effect Likelihood Ratio Tests both contain the results of tests that are used to test the significance of the predictors in the logistic model. Here we see that both the new suburb and house treated indicators are statistically significant at the .05 level, while mother’s age is significant at the .10 level.

10

Page 11: LOGISTIC REGRESSION TUTORIAL - Technology - …course1.winona.edu/bdeppa/STAT 600/Handouts/STAT 600... · Web viewSTAT 600 - LOGISTIC REGRESSION Example 1 - High Dieldrin Levels in

We can obtain both the OR’s and their confidence intervals using JMP as follows. Select both the options

The Odds Ratios are shown below.

Odds Ratios – calculates the odds ratios for all predictors in the model.

Confidence Intervals – provides CI’s for the Odds Ratio, calculated using a method slightly differently than approach above.

ROC Curve – draws an ROC curve which is shown and discussed later in the handout.

Save Probability Formula – save P(High|X) to the data table.

Profiler – examine the P(High|X) graphically.

Confusion Matrix – gives a misclassification matrix for classifying into the two response categories.

11

Page 12: LOGISTIC REGRESSION TUTORIAL - Technology - …course1.winona.edu/bdeppa/STAT 600/Handouts/STAT 600... · Web viewSTAT 600 - LOGISTIC REGRESSION Example 1 - High Dieldrin Levels in

The OR’s associated with living in home treated for termites and living in a new suburb are considerably larger than those found examining their effect independently. The differences between those obtained above are due to the fact that the factors themselves are potentially related and as result their estimated effects when placed in a model jointly differ.

The range odds ratio reported for age is found by using Max(Age) – Min(Age) as the incremental increase. For these data Max(Age) = 37 and Min(Age) = 21, thus a mother who is 37 has 28.055 times higher odds for having high dieldrin levels in her breast milk when compared to a mother who is 21 years of age. It is better to use an increment like 5 years instead, i.e. OR associated with a 5 year increase in age is calculated as follows: OR=exp ( .2083∗5)=exp(1 .042 )=2 .833 . The unit odds ratio uses an increment of 1 year.

As stated previously, the confidence intervals for all of the OR’s are quite broad in this study because the sample size is small (n = 45).

Predicted Probabilities Using All Available PredictorsThe predicted probabilities of high dieldrin can be found as follows:

P(High Dieldrin|House Treated, New Suburb, Age) =

e−6 .604+1 .070 NewSuburb+1 .298 HouseTreated +. 2084 Age

1+e−6.604+1. 070 NewSuburb+1.298 HouseTreated+. 2084 Age

For example the probability that a 30 year old mother living in a home treated for termites in an old suburb is estimated to be:

P(High|Old Suburb, House Treated, Age = 30) =

e−6 .604−1.070+1 .298+ . 2084⋅30

1+e−6. 604−1 .070+1. 298+. 2084⋅30= .4690

For a 25 year old mother living in a home treated for termites located in a new suburb the probability of high dieldrin is estimated to be:

P(High|New Suburb, House Treated, Age = 25) =

e−6 .604+1.070+1. 298+ . 2084⋅25

1+e−6. 604+1.070+1 .298+ . 2084⋅25= .7259

We can save these probabilities to the data table using the Save Probability Formula option.

12

Page 13: LOGISTIC REGRESSION TUTORIAL - Technology - …course1.winona.edu/bdeppa/STAT 600/Handouts/STAT 600... · Web viewSTAT 600 - LOGISTIC REGRESSION Example 1 - High Dieldrin Levels in

Estimates of the P(High Dieldrin|New Suburb, House Treated, Age)

Selecting Save Probability Formula from the Nominal Logistic Fit pull down menu places the predicted probabilities of high and low dieldrin levels in the spreadsheet along with the predicted status. The predicted status is determined by whichever probability is larger, low dieldrin level or high dieldrin level, given their demographics.

Here is a portion of this output which will appear back in the original data spreadsheet.

P(High|X) P(Low|X)

We can compare the predicted dieldrin status to the actual via a contingency table. Select Fit Y by X from the Analyze menu a place Most Likely High Dieldrin in the X box and High Dieldrin in the Y box. The table and mosaic plot are shown below.

Contingency Analysis of High Dieldrin By MostLikely High Dieldrin

From the table we see that 26.7% of mothers classified as having high dieldrin levels actually had low dieldrin levels, similarly 17.9% of those classified as having low

Actual Status

13

Page 14: LOGISTIC REGRESSION TUTORIAL - Technology - …course1.winona.edu/bdeppa/STAT 600/Handouts/STAT 600... · Web viewSTAT 600 - LOGISTIC REGRESSION Example 1 - High Dieldrin Levels in

dieldrin levels actually had high dieldrin levels. In total 9 out of 43 mothers were misclassified for an estimated overall error rate of 20.9%.

Receiver Operating Characteristic (ROC) CurveThe Receiver Operating Characteristic plots the true positive probability vs. the false positive probability. As the sensitivity increases the false positive rate increases as expected. A good classification rule based on upon a logistic model should have area beneath the ROC curve of .90 or higher. Here we do not quite meet that standard.

Area Under ROC Curve = 0.83449

14

Page 15: LOGISTIC REGRESSION TUTORIAL - Technology - …course1.winona.edu/bdeppa/STAT 600/Handouts/STAT 600... · Web viewSTAT 600 - LOGISTIC REGRESSION Example 1 - High Dieldrin Levels in

Example 2: Risk Factors for Low Birth WeightThese data come from a case-control study where risk factors for having a infant with low birth weight (< 2500g) were studied. The following information was recorded for each mother in the study:Low Birth Weight – indicator of birth weight status (Low or Normal)Prev? – previous history of premature labor (History or None)Hyper – hypertension during pregnancy (HT or Normal)Smoke – mother smoked during pregnancy (Cig or No Cig)Uterine – uterine irritability during pregnancy (Irritation or None)Minority – minority status of mother (Nonwhite or White)Age – age of motherLwt – mothers weight at last menstrual cycle

Important JMP Note: For interpretation purposes it is best to code the outcome so that the adverse outcome is alphabetically first. The same is true for risk factors, code them so the level that would be associated with increased risk is alphabetically first.

To fit the multiple logistic regression model select Analyze > Fit Model and set up the dialog box as shown below.

15

Page 16: LOGISTIC REGRESSION TUTORIAL - Technology - …course1.winona.edu/bdeppa/STAT 600/Handouts/STAT 600... · Web viewSTAT 600 - LOGISTIC REGRESSION Example 1 - High Dieldrin Levels in

After using backward elimination to remove non-significant predictors, uterine irritability and mothers age here, we have the following.

The only predictor which represents something a mother could control or change is smoking during pregnancy. This is the primary factor of interest in this study and the other factors, while interesting, are there for control purposes only. In summarizing the effect smoking we would see the phrase: “adjusting for age, pre-pregnancy weight, race, hypertension, uterine irritability, and previous history of premature labor we find the OR associated with smoking is OR = 2.66. This says that, after adjusting for these factors, the odds for having a low birth weight infant are 2.66 times larger for mothers who smoked during pregnancy.

16