1 MRes Wednesday 11 th March 2009 Logistic regression.

93
1 MRes Wednesday 11 th March 2009 Logistic regression

Transcript of 1 MRes Wednesday 11 th March 2009 Logistic regression.

Page 1: 1 MRes Wednesday 11 th March 2009 Logistic regression.

1

MRes Wednesday

11th March 2009

Logistic regression

Page 2: 1 MRes Wednesday 11 th March 2009 Logistic regression.

2

Programme

• A short talk.

• Break for coffee.

• A short class exercise.

Page 3: 1 MRes Wednesday 11 th March 2009 Logistic regression.

3

Background

• Logistic regression is a special kind of regression designed for a specific type of situation.

• To understand it, you must be clear about some fundamentals of ORDINARY LEAST SQUARES (OLS) regression.

• I’ll review those first, before I talk about logistic regression itself.

Page 4: 1 MRes Wednesday 11 th March 2009 Logistic regression.

4

A study

• In a study of the effects of media violence, children were measured on their Actual violence and their Exposure to screen violence.

• Here is a scatterplot of Actual violence against Exposure.

Page 5: 1 MRes Wednesday 11 th March 2009 Logistic regression.

5

The scatterplot

• Each point in the plot represents one child.

• The coordinates are the child’s scores on Exposure to and Actual violence.

• A statistical ASSOCIATION between Exposure to and Actual violence is evident from the elliptical shape of the cloud of points.

Page 6: 1 MRes Wednesday 11 th March 2009 Logistic regression.

6

Regression

• Regression is a set of statistical techniques enabling the researcher to exploit an association among variables to PREDICT the values of one variable from those of others.

Page 7: 1 MRes Wednesday 11 th March 2009 Logistic regression.

7

Some key terms

• The variable we are trying to predict or account for is the CRITERION, TARGET or DEPENDENT VARIABLE (DV).

• The predictors are the INDEPENDENT VARIABLES (IVs) or REGRESSORS.

• In our current example, the DV is Actual violence and the IV is Exposure to screen violence.

Page 8: 1 MRes Wednesday 11 th March 2009 Logistic regression.

8

The REGRESSION LINE is drawn through the points.

Page 9: 1 MRes Wednesday 11 th March 2009 Logistic regression.

9

Page 10: 1 MRes Wednesday 11 th March 2009 Logistic regression.

10

The regression line

Page 11: 1 MRes Wednesday 11 th March 2009 Logistic regression.

11

Filling in the values

Page 12: 1 MRes Wednesday 11 th March 2009 Logistic regression.

12

Interpretation of slope or regression coefficient

• The slope is the average number of units change in the DV that result from a change of one unit on the IV.

• In our example, slope = .74 .

• So, an increase of one unit of Exposure produces, on average, an increase of .74 in Actual violence.

Page 13: 1 MRes Wednesday 11 th March 2009 Logistic regression.

13

The ‘best-fitting’ line

• The regression line of Actual violence upon Exposure is the uniquely ‘best-fitting’ line according to what is known as the LEAST SQUARES criterion.

Page 14: 1 MRes Wednesday 11 th March 2009 Logistic regression.

14

Residuals

• John scored 9 on Exposure and 8 on Actual.

• John’s predicted score from regression ŷ is the point on the line above the value 9 on the x-axis.

• The error in prediction is (y - ŷ ), a quantity known as the RESIDUAL score e.

• John’s residual score is shown.

Page 15: 1 MRes Wednesday 11 th March 2009 Logistic regression.

15

Page 16: 1 MRes Wednesday 11 th March 2009 Logistic regression.

16

Least squares criterion of goodness-of-fit

• The sum of the squares of the residuals (lower formula) is a minimum.

• There is a mathematical solution to the problem of finding values for the slope and the constant that meet the criterion.

Page 17: 1 MRes Wednesday 11 th March 2009 Logistic regression.

17

Ordinary least-squares (OLS) regression

• This approach to regression is known as ORDINARY LEAST SQUARES (OLS) regression.

• There are other kinds of regression (such as LOGISTIC REGRESSION, today’s topic) that do not work in this way.

Page 18: 1 MRes Wednesday 11 th March 2009 Logistic regression.

18

Regression and correlation

• Regression and correlation are two sides of the same associative coin.

• The stronger the association, the narrower will be the elliptical scatterplot, the higher will be the value of the correlation coefficient and the smaller will be the residuals from regression.

• THE CORRELATION AND THE REGRESSION COEFFICIENT ALWAYS HAVE THE SAME SIGN.

• For fixed values of the variances of x and y, the greater the value of r, the steeper will be the slope of the regression line, i.e., the greater will be the value of b1.

• The slope of the regression line b1 and r are related according to …

Page 19: 1 MRes Wednesday 11 th March 2009 Logistic regression.

19

Relation between the regression coefficient and the correlation

coefficient

Page 20: 1 MRes Wednesday 11 th March 2009 Logistic regression.

20

The coefficient of determination (r2)

• The square of the Pearson correlation is known as the COEFFICIENT OF DETERMINATION.

• It is so-called because r2 is the proportion of the variance of y that is accounted for by regression upon x.

Page 21: 1 MRes Wednesday 11 th March 2009 Logistic regression.

21

Coefficient of determination

Page 22: 1 MRes Wednesday 11 th March 2009 Logistic regression.

22

Prediction without regression

• Suppose you know nothing of the association between x and y.

• But you are told that the mean of the target variable y has a certain value My

.

• You are asked to predict values of y for various values of x.

• It can be shown that your best strategy is to guess the value of My, irrespective of the value of x.

• This is termed INTERCEPT- ONLY prediction.

Page 23: 1 MRes Wednesday 11 th March 2009 Logistic regression.

23

A baseline model

• In multiple regression and several other related techniques, the first step is to formulate a baseline model, which takes no account of any association among the variables.

• The baseline model is the equivalent of guessing the mean every time.

• This is ‘Step 0’ in several SPSS regression and modelling routines.

• Step 0 provides a comparison or baseline for the evaluation of later models that include one or more of the IVs.

Page 24: 1 MRes Wednesday 11 th March 2009 Logistic regression.

24

Two or more IVs: multiple regression

• We could try to predict a person’s actual violence not only from exposure to screen violence, but also from additional variables, such as number of years of education and other characteristics of the parents.

• We should then have to determine the relative importance of the various IVs and whether we needed to include all of them in the regression model.

• These are problems in MULTIPLE REGRESSION.

Page 25: 1 MRes Wednesday 11 th March 2009 Logistic regression.

25

Multiple regression

Page 26: 1 MRes Wednesday 11 th March 2009 Logistic regression.

26

Partial regression coefficients

• In multiple regression, a PARTIAL REGRESSION COEFFICIENT is the estimated average change in the DV resulting from an increase of one unit in one particular IV with ALL THE OTHER IVs HELD CONSTANT.

Page 27: 1 MRes Wednesday 11 th March 2009 Logistic regression.

27

The multiple correlation coefficient R

• The MULTIPLE CORRELATION COEFFICIENT (R) is the correlation between the target variable y and the corresponding predictions of y from regression ŷ.

• R can never take a negative value.

Page 28: 1 MRes Wednesday 11 th March 2009 Logistic regression.

28

Coefficient of determination in multiple regression

• In multiple regression, the COEFFICIENT OF DETERMINATION is the square of the multiple correlation coefficient.

Page 29: 1 MRes Wednesday 11 th March 2009 Logistic regression.

29

The case of one IV

• The multiple correlation coefficient is defined even in simple regression, where there is only one IV.

• Here, remembering that R can never be negative, it takes the ABSOLUTE VALUE of the Pearson correlation between x and y, even when r has a negative value.

• So in SPSS, R is included in the output for simple regression.

Page 30: 1 MRes Wednesday 11 th March 2009 Logistic regression.

30

The coefficient of multiple determination R2

• In multiple regression, the coefficient of determination, the proportion of variance of the target variable y that is accounted for by regression, is R2, the square of the multiple correlation coefficient.

Page 31: 1 MRes Wednesday 11 th March 2009 Logistic regression.

31

Page 32: 1 MRes Wednesday 11 th March 2009 Logistic regression.

32

What if the DV is a set of categories?

• Simple and multiple OLS regression assume that the DV and IVs consist of measures on an independent scale with units. The term CONTINUOUS VARIABLE is used for this sort of DV.

• But suppose we want to predict whether a person will suffer from a heart attack or contract a certain illness with known risk factors.

• Here, we are predicting not a VALUE, but CATEGORY MEMBERSHIP.

Page 33: 1 MRes Wednesday 11 th March 2009 Logistic regression.

33

Regression with a categorical DV

The two most commonly used techniques are:

1.Logistic regression

2.Discriminant analysis

Page 34: 1 MRes Wednesday 11 th March 2009 Logistic regression.

34

Discriminant analysis

• If all (or most) IVs are continuous, you might consider using DISCRIMINANT ANALYSIS (DA).

• But the DA model makes assumptions about the distributions of the IVs (such as multivariate normality) which data sets often fail to satisfy.

• Moreover, DA doesn’t like qualitative IVs, such as sex or nationality.

Page 35: 1 MRes Wednesday 11 th March 2009 Logistic regression.

35

Logistic regression

• Logistic regression makes fewer assumptions than does discriminant analysis.

• Logistic regression, moreover, is happy with qualitative IVs; in fact, logistic regression is happy even if ALL the IVs are qualitative.

Page 36: 1 MRes Wednesday 11 th March 2009 Logistic regression.

36

A research question

• It is suspected that smoking and drinking are risk factors in the incidence of a pre-morbid blood condition, characterised by the presence of an antibody.

• Records of the incidence of the condition in 100 patients are available, together with estimates of the amount they smoke and drink.

Page 37: 1 MRes Wednesday 11 th March 2009 Logistic regression.

37

The data

Page 38: 1 MRes Wednesday 11 th March 2009 Logistic regression.

38

Let’s find out how many of the patients have the condition.

Page 39: 1 MRes Wednesday 11 th March 2009 Logistic regression.

39

Page 40: 1 MRes Wednesday 11 th March 2009 Logistic regression.

40

Page 41: 1 MRes Wednesday 11 th March 2009 Logistic regression.

41

Forty-four patients have the condition

Page 42: 1 MRes Wednesday 11 th March 2009 Logistic regression.

42

The regression model assumes …

• Either you have the disease or you don’t. • As smoking and alcohol increase, however, we

assume that the probability of developing the condition increases CONTINUOUSLY as a function of the IVs.

• In logistic regression, we estimate the probability of the condition with the LOGISTIC REGRESSION FUNCTION

• If the estimated probability exceeds a cut-off (usually 0.5), the case is classified by the program as a Yes, rather than a No.

Page 43: 1 MRes Wednesday 11 th March 2009 Logistic regression.

43

A logistic regression function

Page 44: 1 MRes Wednesday 11 th March 2009 Logistic regression.

44

Logistic regression function

Page 45: 1 MRes Wednesday 11 th March 2009 Logistic regression.

45

Page 46: 1 MRes Wednesday 11 th March 2009 Logistic regression.

46

The odds

• In an EXPERIMENT OF CHANCE (tossing a coin, rolling a die) the ODDS in favour of an event is the number of ways in which the event could occur, divided by the number of ways in which it could fail to occur.

• If a die is rolled, there is one way of getting a six and there are five ways of not getting a six.

• The odds in favour of a six are 1/6.

Page 47: 1 MRes Wednesday 11 th March 2009 Logistic regression.

47

Odds in favour of antibody

• Suppose we know that out of 100 people, 44 have a certain antibody in their blood. We select a person at random from this group.

• There are 44 ways of selecting a person with the antibody; and 56 ways of selecting someone without it.

• The ODDS in favour of the person having the antibody are 44 to 56 or 44/56.

Page 48: 1 MRes Wednesday 11 th March 2009 Logistic regression.

48

The log odds (logit)

• The odds measure suffers from ASYMMETRY OF RANGE.

• Unlikely events have odds between 0 and 1; likely events can have huge odds.

• The LOG ODDS (LOGIT) is the natural logarithm (log to the base e) of the odds.

• Logit = ln(odds) = loge(odds).

Page 49: 1 MRes Wednesday 11 th March 2009 Logistic regression.

49

When the logit is zero

• Suppose the odds were 50 to 50 (50/50 =1).

• Since the log of 1 is zero (e0 = 1), a logit of zero means that the odds for are equal to the odds against.

Page 50: 1 MRes Wednesday 11 th March 2009 Logistic regression.

50

Range of the logit

• The logit has a symmetrical range: a positive sign means the odds are in favour; a negative sign means the odds are against.

• The logit has no upper or lower limit: it has an unlimited range of values.

Page 51: 1 MRes Wednesday 11 th March 2009 Logistic regression.

51

Example

• The odds in favour of a case having the antibody are 44/56 = 11/14.

• Logit = ln(11/14) = –.24 • The event is less likely than not, hence the

negative sign. • If the odds in favour were 56/44, the logit

would be ln(56/44) = +.24. • Notice the symmetry of the scale of

magnitude around the neutral point at 1.

Page 52: 1 MRes Wednesday 11 th March 2009 Logistic regression.

52

Probability

• A probability is a measure of likelihood ranging from 0 (an impossibility) to 1 (a certainty).

• The classical definition of probability, like that of the odds, also arises in the context of an experiment of chance.

• The probability p of an event is the number of ways it can happen divided by the TOTAL number of outcomes.

• The probability of a six when a die is rolled is 1/6.

Page 53: 1 MRes Wednesday 11 th March 2009 Logistic regression.

53

Relationship between the probability and the odds

• A probability and the odds are both measures of likelihood.

• They are related according to the equation on the left.

Page 54: 1 MRes Wednesday 11 th March 2009 Logistic regression.

54

Logs and antilogs

Page 55: 1 MRes Wednesday 11 th March 2009 Logistic regression.

55

The antilog function

Page 56: 1 MRes Wednesday 11 th March 2009 Logistic regression.

56

Answers

Page 57: 1 MRes Wednesday 11 th March 2009 Logistic regression.

57

Odds as antilogs

• A number such as the odds can be written as an ANTILOG, that is, the base e to the power of the natural log of the odds (the logit):

Page 58: 1 MRes Wednesday 11 th March 2009 Logistic regression.

58

The logistic regression function revisited

Page 59: 1 MRes Wednesday 11 th March 2009 Logistic regression.

59

Logistic regression function

Page 60: 1 MRes Wednesday 11 th March 2009 Logistic regression.

60

The logit

• The logit is assumed to be a linear function Z of the independent variables.

Page 61: 1 MRes Wednesday 11 th March 2009 Logistic regression.

61

Interpretation of a logistic regression coefficient

• The partial regression coefficient is the increase in the LOG ODDS or LOGIT arising from an increase of one unit in the independent variable.

• The log of a product is the SUM of the logs.

• So the antilog of the partial regression coefficient is the factor by which the original odds must be MULTIPLIED to give the new odds when the IV increases by a unit.

Page 62: 1 MRes Wednesday 11 th March 2009 Logistic regression.

62

Change in the odds

Page 63: 1 MRes Wednesday 11 th March 2009 Logistic regression.

63

Example

• Suppose that for Smoking, b = 1.1. An increase of one smoking unit (eg 10 cigarettes) increases the logit (the log odds) by 1.1.

• So the original odds are MULTIPLIED by

Page 64: 1 MRes Wednesday 11 th March 2009 Logistic regression.

64

Summary

• In terms of the ODDS, an increase of one unit in the IV MULTIPLIES the original odds by the ANTILOG of b, that is, by eb, or exp(b).

• Exp(1.1) = 3.0• So an increase of one smoking unit results

in the odds being MULTIPLIED by 3, that is, the event is THREE times as likely to happen.

Page 65: 1 MRes Wednesday 11 th March 2009 Logistic regression.

65

The problem

• In the logit equation, we must find values of the constant and partial regression coefficients such that correct assignment to categories is maximised.

Page 66: 1 MRes Wednesday 11 th March 2009 Logistic regression.

66

No mathematical solution

• In logistic regression, there is no equivalent of the formulae for the intercept and coefficients in OLS regression.

• A ‘brute force’ computing algorithm is used whereby, starting at arbitrary values of the coefficients, the values are progressively adjusted to try to arrive at a set which maximises the likelihood of obtaining the observed frequencies.

• In a process known as ITERATION, estimates of the parameters are calculated again and again in the hope that they will ‘converge’ to stable values.

• IT DOESN’T ALWAYS HAPPEN!• We must therefore check that this ‘convergence’ really

has been achieved by examining the ITERATION HISTORY in the SPSS output.

Page 67: 1 MRes Wednesday 11 th March 2009 Logistic regression.

67

Potential difficulties

• The algorithm will not run successfully if the IVs are too highly correlated. This is the familiar MULTICOLLINEARITY PROBLEM sometimes encountered in OLS regression.

Page 68: 1 MRes Wednesday 11 th March 2009 Logistic regression.

68

Centring

• As with OLS multiple regression, it is a good idea to CENTRE variables, by subtracting the mean from each score.

• Centring leaves the correlations among the variables unchanged.

• This move makes the algorithm more robust to substantial correlations among the variables.

Page 69: 1 MRes Wednesday 11 th March 2009 Logistic regression.

69

Page 70: 1 MRes Wednesday 11 th March 2009 Logistic regression.

70

Covariates

In SPSS logistic regression dialogs, IVs that are continuous variables are known as COVARIATES.

Page 71: 1 MRes Wednesday 11 th March 2009 Logistic regression.

71

Always ask for the ITERATION HISTORY, so that you can check whether the algorithm was able to

arrive at a stable estimate.

Page 72: 1 MRes Wednesday 11 th March 2009 Logistic regression.

72

Dire warning!

• Should the iteration history show failure to converge, the results of the analysis can be ridiculous!

• The effects of failure to converge are not limited to the IV concerned: they can mess up the whole analysis!

Page 73: 1 MRes Wednesday 11 th March 2009 Logistic regression.

73

Page 74: 1 MRes Wednesday 11 th March 2009 Logistic regression.

74

Page 75: 1 MRes Wednesday 11 th March 2009 Logistic regression.

75

Fitting a model

• The goodness-of-fit of a model is measured by a log likelihood chi-square statistic.

Page 76: 1 MRes Wednesday 11 th March 2009 Logistic regression.

76

Step 0 in logistic regression

• We know that 44/100 people have the condition. • Armed only with this fact, and with no knowledge

of any associations there might be among the variables, we shall maximise our hit rate if we predict ABSENCE of the condition for ANY person selected at random.

• This, in logistic regression, is the equivalent of intercept-only (no-regression) prediction in OLS regression: you just guess My, whatever the value of x.

Page 77: 1 MRes Wednesday 11 th March 2009 Logistic regression.

77

Here is the logistic regression output for Step 0

Page 78: 1 MRes Wednesday 11 th March 2009 Logistic regression.

78

Page 79: 1 MRes Wednesday 11 th March 2009 Logistic regression.

79

The Nagelkerke R2 statistic

• The Nagelkerke statistic is the counterpart of the coefficient of determination R2 in OLS multiple regression.

• It is a measure of the proportion of the total variation in incidence of the blood condition accounted for by regression.

Page 80: 1 MRes Wednesday 11 th March 2009 Logistic regression.

80

Page 81: 1 MRes Wednesday 11 th March 2009 Logistic regression.

81

Page 82: 1 MRes Wednesday 11 th March 2009 Logistic regression.

82

Page 83: 1 MRes Wednesday 11 th March 2009 Logistic regression.

83

Classification Tablea

51 5 91.1

10 34 77.3

85.0

ObservedNo

Yes

Blood Condition

Overall Percentage

Step 1No Yes

Blood Condition PercentageCorrect

Predicted

The cut value is .500a.

Hit rate using the regression model. This is obviously much better than the ‘no-regression’ hit rate

of 56%.

A regression model is now applied.

Page 84: 1 MRes Wednesday 11 th March 2009 Logistic regression.

84

The Wald statistic

• The WALD STATISTIC tests a regression coefficient for significance.

• The null hypothesis is that, in the population, the coefficient is zero.

• The Wald statistic is B2/SE2 (not B/SE as Andy Field says on page 224 of his book) and is distributed approximately as chi-square.

Page 85: 1 MRes Wednesday 11 th March 2009 Logistic regression.

85

Variables in the Equation

2.264 .513 19.490 1 .000 9.623

-.078 .085 .846 1 .358 .925

-1.394 .373 13.979 1 .000 .248

Smoking

Alcohol

Constant

Step1

a

B S.E. Wald df Sig. Exp(B)

Variable(s) entered on step 1: Smoking, Alcohol.a.

This is the antilog of the coefficient of Smoking in the logit equation. Increasing Smoking by one unit MULTIPLIES the odds in favour of occurrence by about 10.

Page 86: 1 MRes Wednesday 11 th March 2009 Logistic regression.

86

0 1 1 2 2

1.394 2.64( ) .078( )

The logit equation

Z b b X b X

Smoking Alcohol

Page 87: 1 MRes Wednesday 11 th March 2009 Logistic regression.

87

1.394 2.64 .078

1.394 2.64 .0781 1

Z Smoking Alcohol

Z Smoking Alcohol

Logistic function

e ep

e e

Page 88: 1 MRes Wednesday 11 th March 2009 Logistic regression.

88

Page 89: 1 MRes Wednesday 11 th March 2009 Logistic regression.

89

Conclusion

• The incidence of the blood condition is indeed predictable from regression and raises the hit rate from 54% to 85%.

• Smoking contributes significantly to the model.

• Alcohol does not contribute significantly to the model.

Page 90: 1 MRes Wednesday 11 th March 2009 Logistic regression.

90

The next step

• This session has been merely an introduction to the technique of logistic regression.

• The next step is to do some further reading.

Page 91: 1 MRes Wednesday 11 th March 2009 Logistic regression.

91

Getting started

• There’s an elementary section on logistic regression in

–Kinnear, P., & Gray, C. (2008). SPSS16 made simple. Hove: Psychology Press. Chapter 14.

• This is mainly a practical, get-started guide; but there is an outline of the rationale of the technique as well.

Page 92: 1 MRes Wednesday 11 th March 2009 Logistic regression.

92

Sage paperbacks

• Menard, S. (2002). Applied logistic regression analysis (2nd ed.). London: Sage.

• Jaccard, J. (2001). Interaction effects in logistic regression. London: Sage.

Page 93: 1 MRes Wednesday 11 th March 2009 Logistic regression.

93

• Tabachnik, B. G., & Fidell, L. S. (2007). Using multivariate statistics (5th ed.). Boston: Allyn & Bacon. Chapter 10.

• Field, A. (2005). Discovering statistics using SPSS for Windows: Advanced techniques for the beginner (2nd ed.). London: Sage. Chapter 6.