Lecture 12: Logistic regression (v1) - web.stanford.edu
Transcript of Lecture 12: Logistic regression (v1) - web.stanford.edu
MS&E 226: “Small” DataLecture 12: Logistic regression (v1)
Ramesh [email protected]
Fall 2015
1 / 30
Regression methods for binary outcomes
2 / 30
Binary outcomes
For the duration of this lecture suppose the outcome variableYi ∈ {0, 1} for each i.
(Much of what we cover generalizes to discrete outcomes withmore than two levels, but we focus on the binary case.)
3 / 30
A population model for binary outcomes
What does it mean to have a population model for binaryoutcomes?
Given a covariate vector ~X, the population model specifies theprobability that Yi is either 0 or 1.
Given a sample X and Y, our goal is to construct a goodapproximation of this population model (for prediction, inference,or both).
4 / 30
Linear regression?Why doesn’t linear regression work well for prediction? A picture:
0
1
−20 0 20X
Y
5 / 30
Logistic regression
At its core, logistic regression is a method that directly addressesthis issue with linear regression: it produces fitted values thatalways lie in [0, 1].
Input: Sample data X and Y.
Output: A fitted model f̂(·), where we interpret f̂( ~X) as anestimate of the probability that the corresponding outcome Y isequal to 1.
6 / 30
Logistic regression: Basics
7 / 30
The population modelLogistic regression assumes the following population model:
P(Y = 1| ~X) =exp( ~Xβ)
1 + exp( ~Xβ)= 1− P(Y = 0| ~X).
A plot in the case with only one covariate, and β0 = 0, β1 = 1:
0.00
0.25
0.50
0.75
1.00
−10 −5 0 5 10X
P(Y
= 1
| X
)
8 / 30
Logistic curve
The function g given by:
g(q) = log
(q
1− q
)is called the logit function. It has the following inverse, called thelogistic curve:
g−1(z) =exp(z)
1 + exp(z).
In terms of g, we can write the population model as:1
P(Y = 1| ~X) = g−1( ~Xβ).
1This is one example of a generalized linear model (GLM); for a GLM, g iscalled the link function.
9 / 30
Logistic curve
Note that from this curve we see some important characteristics oflogistic regression:
I The logistic curve is increasing. Therefore, in logisticregression, larger values of covariates that have positivecoefficients will tend to increase the probability that Y = 1.
I When z > 0, then g−1(z) > 1/2; when z < 0, theng−1(z) < 1/2.Therefore, when ~Xβ > 0, Y is more likely to be one thanzero; and conversely, when ~Xβ < 0, Y is more likely to bezero than one.
10 / 30
Interpretations: Log odds ratio
In many ways, the choice of a logistic regression model is a matterof practical convenience, rather than any fundamentalunderstanding of the population: it allows us to neatly employregression techniques for binary data.
One way to interpret the model:
I Note that given a probability 0 < q < 1, q/(1− q) is calledthe odds ratio. The odds ratio lies in (0,∞).
I So logistic regression uses a model that suggests the log oddsratio of Y given ~X is linear in the covariates. Note the logodds ratio lies in (−∞,∞).
11 / 30
Interpretations: Latent variables
Another way to interpret the model is the latent variable approach:
I Suppose given a vector of covariates ~X, a logistic randomvariable Z is sampled independently:
P(Z < z) = g−1(z).
I Define Y = 1 if ~Xβ + Z > 0, and Y = 0 otherwise.
I By this definition, Y = 1 if and only if Z < ~Xβ, an eventthat has probability g−1( ~Xβ) — exactly the logisticregression population model.
12 / 30
Interpretations: Latent variables
The latent variable interpretation is particularly popular ineconometrics, where it is a first example of a discrete choicemodeling.For example, in modeling customer choice over whether or not tobuy a product, suppose:
I Each customer has a feature vector ~X.
I This customer’s utility for purchasing the item is the(random) quantity ~Xβ + Z, and the utility for not purchasingthe item is zero.
I The probability the customer purchases the item is exactly thelogistic regression model.
This is a very basic example of a random utility model forcustomer choice.
13 / 30
Fitting through maximum likelihood
The logistic regression parameters are found through maximumlikelihood.
The (conditional) likelihood for Y given X and parameters β is:
P(Y|β,X) =
n∏i=1
g−1(Xiβ)Yi(1− g−1(Xiβ))
1−Yi .
Let β̂MLE be the resulting solution; these are the logistic regressioncoefficients.2
2For the remainder of this lecture we drop the subscript MLE on thesolution.
14 / 30
Fitting through maximum likelihood [∗]
Unfortunately, in contrast to our previous examples, maximumlikelihood estimation does not have a closed form solution in thecase of logistic regression.
However, it turns out that there are reasonably efficient iterativemethods for algorithmically computing the MLE solution.
One example is an algorithm inspired by weighted least squares,called iteratively reweighted least squares. This algorithmiteratively updates the weights in WLS to converge to the logisticregression MLE solution. See [AoS], Section 13.7 for details.
15 / 30
Fitting and interpreting logistic regressionmodels
16 / 30
Example: The CORIS dataset
Recall: 462 South African males evaluated for heart disease.
Outcome variable: Coronary heart disease (chd).
Covariates:
I Systolic blood pressure (sbp)
I Cumulative tobacco use (tobacco)
I LDL cholesterol (ldl)
I Adiposity (adiposity)
I Family history of heart disease (famhist)
I Type A behavior (typea)
I Obesity (obesity)
I Current alcohol consumption (alcohol)
I Age (age)
17 / 30
Logistic regression
To run logistic regression in R, we use the glm function, as follows:
> fm = glm(formula = chd ~ .,
family = "binomial",
data = coris)
> display(fm)
coef.est coef.se
(Intercept) -6.15 1.31
...
famhist 0.93 0.23
...
obesity -0.06 0.04
...
(As before, standard errors are the standard deviation of thesampling distribution, and can be used to construct confidenceintervals.)
18 / 30
Interpreting the output
Recall that if a coefficient is positive, it increases the probabilitythat the outcome is 1 in the fitted model (since the logisticfunction is increasing).
So, for example, obesity has a negative coefficient. What doesthis mean? Do you believe the implication?
19 / 30
Interpreting the output
In linear regression, coefficients were directly interpretable; this isnot as straightforward for logistic regression.
Some approaches to interpretation of β̂j (holding all othercovariates constant):
I β̂j is the change in the log odds ratio for Y per unit change inXj .
I exp(β̂j) is the change in the odds ratio for Y per unit changein Xj .
20 / 30
The “divide by 4” rule
We can also directly differentiate the fitted model with respect toXj .
By differentating g−1( ~Xβ̂), we find that the change in the (fitted)probability Y = 1 per unit change in Xj is:(
exp( ~Xβ)
[1 + exp( ~Xβ)]2
)β̂j .
Note that the term in parentheses cannot be any larger than 1/4.
Therefore, |β̂j |/4 is an upper bound on the magnitude of thechange in the fitted probability that Y = 1, per unit change in Xj .
21 / 30
ResidualsHere is what happens when we plot residuals vs. fitted values:
−0.5
0.0
0.5
1.0
0.00 0.25 0.50 0.75fitted.coris
resi
dual
s.co
ris
Why is this plot not informative?22 / 30
ResidualsAn alternative approach: bin the residuals based on fitted value,and then plot the average residual in each bin.
E.g., when we divide the data into ≈ 50 bins, here is what weobtain:
0.0 0.2 0.4 0.6 0.8
−0.
2−
0.1
0.0
0.1
0.2
Binned residual plot
Expected Values
Ave
rage
res
idua
l
23 / 30
Logistic regression as a classifier
24 / 30
Classification
Logistic regression serves as a classifier in the following naturalway:
I Given the estimated coefficients β̂, and a new covariate vector~X, compute ~Xβ̂.
I If the resulting value is positive, return Y = 1 as the predictedvalue.
I If the resulting value is negative, return Y = 0 as thepredicted value.
25 / 30
Linear classification
The idea is to directly use the model to choose the most likelyvalue for Y , given the model:
When ~Xβ̂ > 0, the fitted model gives P(Y = 1| ~X, β̂) > 1/2, sowe make the prediction that Y = 1.
Note that logistic regression is therefore an example of a linearclassifier: the boundary between covariate vectors where we predictY = 1 (resp., Y = 0) is linear.
26 / 30
Tuning logistic regression
As with other classifiers, we can tune logistic regression to tradeoff false positives and false negatives.
In particular, suppose we choose a threshold t, and predict Y = 1whenever ~Xβ̂ > t.
I When t = 0, we recover the classification rule on thepreceding slide. This is the rule that minimizes average 0-1loss on the training data.
I What happens to our classifier when t→∞?
I What happens to our classifier when t→ −∞
As usual, you can plot an ROC curve for the resulting family ofclassifiers as t varies.
27 / 30
Additional thoughts on logistic regression
28 / 30
Regularized logistic regression [∗]
As with linear regression, regularized logistic regression is oftenused in the presence of many features.
In practice, the most common regularization technique is to addthe penalty −λ
∑j |β̂j | to the maximum likelihood problem; this is
the equivalent of lasso for logistic regression.
As for linear regression, this penalty has the effect of selecting asubset of the parameters (for sufficient large values of theregularization penalty λ).
29 / 30
Do you believe the model?
Logistic regression highlights one of the inherent tensions ininference:
I Inference means that we are trying to understand and explainthe population model.
I At the same time, it appears that logistic regression is anestimation procedure that largely exists for its technicalconvenience. (Do we really believe that the population modelis logistic?)
At times like this it is useful to remember:
“All models are wrong, but some are more useful thanothers.”
– George E.P. Box
30 / 30