IBM SPSS presentation
Amsterdam, 11th November 2014
Drs. Ing. J.A.C.M. Smit (Jan)
Director of
STATSCONsult, based in Drunen NL
11/11/2014 STATSCONsult, Logistic Regression, IBM SPSS presentation
STATSCONsult
Support, marketing and Sales of software
products for statistical analyses
Courses in Statistics
Consultancy in Data Analyses
Jan Smit worked for SPSS from 1984 until
1989.
11/11/2014
STATSCONsult, Logistic Regression, IBM
SPSS presentation
STATSCONsult Consultancy
SPSS Intro courses
SPSS assistance in data analyses
SPSS advanced courses
SPSS Risk Analyses (including Weight of
Evidence)
11/11/2014
STATSCONsult, Logistic Regression, IBM
SPSS presentation
Examples of Logistic Regression
We wish to model the likelihood of an event
that is likely to happen which depends on a
number of factors (predictors):
◦ To predict whether a patient has (or will have) a
given disease
◦ Prediction of a customer's propensity to
purchase an appliance (TV)
◦ Prediction of passing an exam
◦ Prediction of paying back a loan in full
◦ Risk analyses is done with Logistic Regression
11/11/2014
STATSCONsult, Logistic Regression, IBM
SPSS presentation
What are the assumptions of using
Logistic Regression? The predictors are not too much highly multiple
correlated (multicollinearity)
A continuous predictor should have a monotone descending (ascending) probability of the dependent variable in the data
We obtain a (model + error); residuals (=error) should not dominate
Model should be interpretable, easy to use and be useful for forecasting
11/11/2014
STATSCONsult, Logistic Regression, IBM
SPSS presentation
Logistic regression, application
We analyse the effect of a number of
independent predictors (x1, x2, .. xn) on a
dependent variable Y, where Y in [0,1]
Covariates are predictors, for which we
wish to correct (such as age)
Predictors can be continuous, nominal or
ordinal
◦ Independent variables can be continuous, Age
◦ Ordinal or Nominal, Level of Education,
11/11/2014
STATSCONsult, Logistic Regression, IBM
SPSS presentation
Data
1500 observations
We wish to have a model of Previous Default (Y=1)
From now on, we say Previous Default (risk to pay-off bankloan) =“Risk”
Interpret the model, use the model for prediction.
Based on the predictors (Age, .. ,Household Income)
548 observations have Risks (Y=1) in our data set
Here 90% of observations are used for the model; the remaining observations are used for prediction.
11/11/2014
STATSCONsult, Logistic Regression, IBM
SPSS presentation
What are my odds? We cannot use regression, though we use all the theory of
linear regression.
In Logistic regression our model is:
◦ log(P(y=1)/P(y=0) )= a + b1*x1 + b2*x2 + ..+ bn*xn
◦ Linear regression : Y= a + b1*x1 + b2*x2 + ..+ bn*xn (nearly the same)
◦ Odds : P(y=1), P(y=0) and P(y=1)/P(y=0) ; my odds are 2 to 1, meaning P(y=1)/P(y=0)=2
Log(odds) makes statistics possible :
◦ P=2/3: odds ratio= 2; log(2)=0,69
◦ P=1/2: odds ratio=1; log(1)=0
A coefficient is the change in the log odds, when other factors are fixed.
Sometimes I have the Odds against, or Odds on, or Odds even.
11/11/2014
STATSCONsult, Logistic Regression, IBM
SPSS presentation
Bankloan data
We wish to model the chance of paying back in full a bank loan. When Risk =1, the loan was not in the end returned to the bank.
Y : Risk{1=yes, 0=no}
X : A number of factors that may affect Y
Age in years age
Level of education ed
Years with current employer employ
Years at current address address
Household income in thousands income
Debt to income ratio (x100) debtinc
Credit card debt in thousands creddebt
Other debt in thousands othdebt
11/11/2014
STATSCONsult, Logistic Regression, IBM
SPSS presentation
Make groups via visual binning
11/11/2014
STATSCONsult, Logistic Regression, IBM
SPSS presentation
Odd ratios of risk decreases with
higher values for age
11/11/2014
STATSCONsult, Logistic Regression, IBM
SPSS presentation
Dependency of Age on Risk
11/11/2014
STATSCONsult, Logistic Regression, IBM
SPSS presentation
AND IN FORMULAS
We read : Log Odds = log(Risk=1 / Risk=0)
= Constant + B * age =
1,250 – 0,055*age
For age=20 : LogOdds = 1,25-1,1= 0,15
For age=30 : LogOdds = 1,25- 1,65 = -0,4
For age=40 : LogOdds = 1,25 – 2,2 = -0,95
If age= 22,7 the LogOdds=0
According to model :
For age=20 : OddsRatio = exp(0,15) =1,15
For age=30 : OddsRatio = exp(-0,4) = 0,67
For age=40 : OddsRatio = exp(-0,95)= 0,4
Probability :
For age=20 : P(yes)= 0,54
For age=30 : P(Y=1) = 0,40
For age=40 : P(Y=10 = 0,28
We conclude Age can be used as an
predictor for Risk (as Sig-P < 0,05)
11/11/2014
STATSCONsult, Logistic Regression, IBM
SPSS presentation
Usage of dialog in IBM SPSS:
11/11/2014
STATSCONsult, Logistic Regression, IBM
SPSS presentation
Specify categorical predictors
11/11/2014
STATSCONsult, Logistic Regression, IBM
SPSS presentation
Output (2) from initial stage (all
main effects)
The -2 log likelihood is leading.
11/11/2014
STATSCONsult, Logistic Regression, IBM
SPSS presentation
Output of predictors and effect on
Risk
11/11/2014
STATSCONsult, Logistic Regression, IBM
SPSS presentation
What predictors can we use in the
model to estimate “Risk” If the Sig.-p< 0,05 for a predictor, we may
conclude that this predictor has an effect on the depended variable (a significance effect).
If the Sig.-p>0,05 for a predictor, we may conclude that we are uncertain that this predictor has an effect on the depended variable.
Watch out for pit falls (remove a variable that has no effect, and re-estimate).
11/11/2014
STATSCONsult, Logistic Regression, IBM
SPSS presentation
Modelling
By using Backwards (LR), at each step we
re-estimate the model, leaving out a non-
significant predictor:
After this step : only the variables: age,
employ, debtinc, creddebt are significant
Note that correlations of predictors may
affect the order of inclusion in model (employ and address are highly correlated)
11/11/2014
STATSCONsult, Logistic Regression, IBM
SPSS presentation
After Backward Deletion
11/11/2014
STATSCONsult, Logistic Regression, IBM
SPSS presentation
Interpretation of the model
If the coefficient of a predictor < 0 the odd ratio decreases for larger values.
Large coefficients (positive or negative) are more important (go with large Wald statistics and small Sig.-p values).
Here people with 1. short period at current employer (Change) and
2. high Credit Card Debts (Expenders) and
3. high values of Debts to Income ratio (Have Fun) and
4. low ages (Young)
show high risk.
11/11/2014
STATSCONsult, Logistic Regression, IBM
SPSS presentation
Classification on the data in model;
If we adjust the cut value to a lower p (0,5) the
Predicted Yes column values becomes lower.
11/11/2014
STATSCONsult, Logistic Regression, IBM
SPSS presentation
Model expression
The model is:
Log(Risk=1/ Risk=0) = -0,133 (constant)
- 0,213 * employ (from 0 to 50)
+ 0,483 * creddebt (from 0 to 36)
+ 0,102 * debtinc (from 0 to 40)
- 0,040 * age (from 18 to 60)
If this expression > 0, the probability > 0,5
11/11/2014
STATSCONsult, Logistic Regression, IBM
SPSS presentation
Prediction
Prediction is rather good (102 out of 133)
Make use of the model and apply this to the remaining observations that are not included in the model.
65+ 15 were formally classified as “No Risk”
65 +16 are selected in model as “No Risk”
We are able to change the cut off value of 0,5
11/11/2014
STATSCONsult, Logistic Regression, IBM
SPSS presentation
Comparison Classification Trees and
Logistic Regression If the number of variables is high the
result of LR still is simple; CT output will
become large and complex.
CT finds interactions, segments, with
highest P. With LR segments are
determined with high probability.
11/11/2014
STATSCONsult, Logistic Regression, IBM
SPSS presentation
Vragen
Jan Smit
+31 416 378 125
http://www.statsconsult.nl/
11/11/2014
STATSCONsult, Logistic Regression, IBM
SPSS presentation
Top Related