Logistic Regression - MIT OpenCourseWare · 2020-01-04 · Harvard-MIT Division of Health Sciences...
Transcript of Logistic Regression - MIT OpenCourseWare · 2020-01-04 · Harvard-MIT Division of Health Sciences...
![Page 1: Logistic Regression - MIT OpenCourseWare · 2020-01-04 · Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor](https://reader035.fdocuments.us/reader035/viewer/2022062914/5e5959f50eed076f3d343c93/html5/thumbnails/1.jpg)
Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005Instructors: Professor Lucila Ohno-Machado and Professor Staal Vinterbo
6.873/HST.951 Medical Decision SupportFall 2005
Logistic Regression Maximum Likelihood Estimation
Lucila Ohno-Machado
![Page 2: Logistic Regression - MIT OpenCourseWare · 2020-01-04 · Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor](https://reader035.fdocuments.us/reader035/viewer/2022062914/5e5959f50eed076f3d343c93/html5/thumbnails/2.jpg)
Risk Score of Death from AngioplastyUnadjusted Overall Mortality Rate = 2.1%
0
500
1000
1500
2000
2500
3000
0%
10%
20%
30%
40%
50%
60%
Risk
Number of Cases
0.4% 1.4%
53.6%
12.4%
21.5%
2.2%
0 to 2 3 to 4 5 to 6 7 to 8 9 to 10 >10
Risk Score Category
Num
ber o
f Cas
es
Mortality
62%
26%
7.6% 2.9% 1.6% 1.3%
![Page 3: Logistic Regression - MIT OpenCourseWare · 2020-01-04 · Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor](https://reader035.fdocuments.us/reader035/viewer/2022062914/5e5959f50eed076f3d343c93/html5/thumbnails/3.jpg)
Linear RegressionOrdinary Least Squares (OLS)
y x3Minimize Sum of Squared Errors (SSE)
n data points x4 i is the subscript for each point x2 x1
yi = β0 + β xi1 x n n
2 2SSE = ∑ (y − yi ) = ∑[y − (β + β xi )]i i 0 1 i=1 i=1
![Page 4: Logistic Regression - MIT OpenCourseWare · 2020-01-04 · Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor](https://reader035.fdocuments.us/reader035/viewer/2022062914/5e5959f50eed076f3d343c93/html5/thumbnails/4.jpg)
Logit
1 =pi 1+ e−(β +β xi )0 1
β0 +β1xie p = i eβ0 +β1xi +1
y
x
logit⎤
log⎣
⎡
1−
pi
pi ⎦⎥ = β01+ β1xi⎢
x
![Page 5: Logistic Regression - MIT OpenCourseWare · 2020-01-04 · Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor](https://reader035.fdocuments.us/reader035/viewer/2022062914/5e5959f50eed076f3d343c93/html5/thumbnails/5.jpg)
Increasing β
0
0.2
0.4
0.6
0.8
1
1.2
0 10 20 300
0.2
0.4
0.6
0.8
1
1.2
0 10 20 300
0.2
0.4
0.6
0.8
1
1.2
0 10 20 300
0.2
0.4
0.6
0.8
1
1.2
0 10 20 300
0.2
0.4
0.6
0.8
1
1.2
0 10 20 300
0.2
0.4
0.6
0.8
1
1.2
0 10 20 300
0.2
0.4
0.6
0.8
1
1.2
0 10 20 300
0.2
0.4
0.6
0.8
1
1.2
0 10 20 300
0.2
0.4
0.6
0.8
1
1.2
0 10 20 300
0.2
0.4
0.6
0.8
1
1.2
0 10 20 300
0.2
0.4
0.6
0.8
1
1.2
0 10 20 300
0.2
0.4
0.6
0.8
1
1.2
0 10 20 300
0.2
0.4
0.6
0.8
1
1.2
0 10 20 300
0.2
0.4
0.6
0.8
1
1.2
0 10 20 300
0.2
0.4
0.6
0.8
1
1.2
0 10 20 300
0.2
0.4
0.6
0.8
1
1.2
0 10 20 300
0.2
0.4
0.6
0.8
1
1.2
0 10 20 300
0.2
0.4
0.6
0.8
1
1.2
0 10 20 30
Series1Series1Series1Series1Series1
0
0.2
0.4
0.6
0.8
1
1.2
0 10 20 300
0.2
0.4
0.6
0.8
1
1.2
0 10 20 300
0.2
0.4
0.6
0.8
1
1.2
0 10 20 300
0.2
0.4
0.6
0.8
1
1.2
0 10 20 300
0.2
0.4
0.6
0.8
1
1.2
0 10 20 300
0.2
0.4
0.6
0.8
1
1.2
0 10 20 300
0.2
0.4
0.6
0.8
1
1.2
0 10 20 300
0.2
0.4
0.6
0.8
1
1.2
0 10 20 300
0.2
0.4
0.6
0.8
1
1.2
0 10 20 300
0.2
0.4
0.6
0.8
1
1.2
0 10 20 300
0.2
0.4
0.6
0.8
1
1.2
0 10 20 300
0.2
0.4
0.6
0.8
1
1.2
0 10 20 300
0.2
0.4
0.6
0.8
1
1.2
0 10 20 30
Series1Series1Series1Series1Series1Series1
0
0.2
0.4
0.6
0.8
1
1.2
0 10 20 300
0.2
0.4
0.6
0.8
1
1.2
0 10 20 300
0.2
0.4
0.6
0.8
1
1.2
0 10 20 300
0.2
0.4
0.6
0.8
1
1.2
0 10 20 300
0.2
0.4
0.6
0.8
1
1.2
0 10 20 300
0.2
0.4
0.6
0.8
1
1.2
0 10 20 300
0.2
0.4
0.6
0.8
1
1.2
0 10 20 300
0.2
0.4
0.6
0.8
1
1.2
0 10 20 300
0.2
0.4
0.6
0.8
1
1.2
0 10 20 300
0.2
0.4
0.6
0.8
1
1.2
0 10 20 300
0.2
0.4
0.6
0.8
1
1.2
0 10 20 300
0.2
0.4
0.6
0.8
1
1.2
0 10 20 300
0.2
0.4
0.6
0.8
1
1.2
0 10 20 300
0.2
0.4
0.6
0.8
1
1.2
0 10 20 300
0.2
0.4
0.6
0.8
1
1.2
0 10 20 300
0.2
0.4
0.6
0.8
1
1.2
0 10 20 300
0.2
0.4
0.6
0.8
1
1.2
0 10 20 300
0.2
0.4
0.6
0.8
1
1.2
0 10 20 30
Series1Series1Series1Series1Series1
![Page 6: Logistic Regression - MIT OpenCourseWare · 2020-01-04 · Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor](https://reader035.fdocuments.us/reader035/viewer/2022062914/5e5959f50eed076f3d343c93/html5/thumbnails/6.jpg)
Finding β0
• Baseline case 1pi =
1+ e− (β )0
Blue(1) Green(0) 1Death 28 22 50 297.0 =
1+ e− (β )Life 45 52 97
Total 73 74 147
β 0 − = 8616.0
0
![Page 7: Logistic Regression - MIT OpenCourseWare · 2020-01-04 · Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor](https://reader035.fdocuments.us/reader035/viewer/2022062914/5e5959f50eed076f3d343c93/html5/thumbnails/7.jpg)
Odds ratio
• Odds: p/(1-p) |blue death • Odds-ratio
p 1− p |=OR blue death
p |green death Blue Green
|Death 28 22 50 1− p green death
Life 45 52 97 28 / 45Total 73 74 147 OR = = 47.1
22 / 52
![Page 8: Logistic Regression - MIT OpenCourseWare · 2020-01-04 · Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor](https://reader035.fdocuments.us/reader035/viewer/2022062914/5e5959f50eed076f3d343c93/html5/thumbnails/8.jpg)
What do coefficients mean?
eβ color = ORcolor
OR = 28 / 45
= 47 .122 / 52
eβ color = 47 .1Blue Green β color = 385.0
Death 28 22 50 1 Life 45 52 97 pblue =
1+ e − − 8616.0 + 385.0 ) = 383.0 (
Total 73 74 147 1 pgreen = 1+ e 8616.0 = 297.0
![Page 9: Logistic Regression - MIT OpenCourseWare · 2020-01-04 · Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor](https://reader035.fdocuments.us/reader035/viewer/2022062914/5e5959f50eed076f3d343c93/html5/thumbnails/9.jpg)
What do coefficients mean?
eβage = ORage
p age death =50|
Age49 Age50 1− p age death =50OR = |
Death 28 22 50 p age death =49| Life 45 52 97
|1− p age death =49Total 73 74 147
![Page 10: Logistic Regression - MIT OpenCourseWare · 2020-01-04 · Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor](https://reader035.fdocuments.us/reader035/viewer/2022062914/5e5959f50eed076f3d343c93/html5/thumbnails/10.jpg)
Why not search using OLS?y x3
yi = β0 + β xi1
n x4 x22SSE = ∑ (y − yi ) x1ii=1
x
logit
⎤log
⎣
⎡
1−
pi
pi ⎦⎥ = β01+ β1xi⎢
x
![Page 11: Logistic Regression - MIT OpenCourseWare · 2020-01-04 · Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor](https://reader035.fdocuments.us/reader035/viewer/2022062914/5e5959f50eed076f3d343c93/html5/thumbnails/11.jpg)
P(model | data) ?
If only intercept is allowed, which1 value would it have?pi = 1+ e −( β β 1xi )0 +
yy
x x
x
![Page 12: Logistic Regression - MIT OpenCourseWare · 2020-01-04 · Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor](https://reader035.fdocuments.us/reader035/viewer/2022062914/5e5959f50eed076f3d343c93/html5/thumbnails/12.jpg)
P (data | model) ?P(data|model) = [P(model | data) P(data)] / P(model)
When comparing models:
P(model): assume all the same (ie, chances ofbeing a model with high coefficients the same aslow, etc)
P(data): assume it is the same Then,
P(data | model) α P(model | data)
![Page 13: Logistic Regression - MIT OpenCourseWare · 2020-01-04 · Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor](https://reader035.fdocuments.us/reader035/viewer/2022062914/5e5959f50eed076f3d343c93/html5/thumbnails/13.jpg)
Maximum Likelihood Estimation
• Maximize P(data | model)• Maximize the probability that we would
observe what we observed (given assumption of a particular model)
• Choose the best parameters from the particular model
logit
x
![Page 14: Logistic Regression - MIT OpenCourseWare · 2020-01-04 · Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor](https://reader035.fdocuments.us/reader035/viewer/2022062914/5e5959f50eed076f3d343c93/html5/thumbnails/14.jpg)
Maximum Likelihood Estimation
• Steps:– Define expression for the probability of data
as a function of the parameters – Find the values of the parameters that
maximize this expression
![Page 15: Logistic Regression - MIT OpenCourseWare · 2020-01-04 · Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor](https://reader035.fdocuments.us/reader035/viewer/2022062914/5e5959f50eed076f3d343c93/html5/thumbnails/15.jpg)
Likelihood Function
L = Pr(Y )L = Pr( y y 2 ,..., y )1, n
L = Pr( y ) Pr(y Pr( )... y )1 2 n ∏ =
= n
i i y
1
)Pr(
![Page 16: Logistic Regression - MIT OpenCourseWare · 2020-01-04 · Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor](https://reader035.fdocuments.us/reader035/viewer/2022062914/5e5959f50eed076f3d343c93/html5/thumbnails/16.jpg)
Likelihood FunctionBinomial
L = Pr(Y )L = Pr( y y 2 ,..., y )1, n
n
L = Pr( y ) Pr( y Pr( )... y ) = ∏ Pr( y )1 2 n i i =1
Pr( y = 1) = pi i
Pr( y = 0) = (1− p )i i
ii y i
y ii ppy −−= 1)1()Pr(
![Page 17: Logistic Regression - MIT OpenCourseWare · 2020-01-04 · Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor](https://reader035.fdocuments.us/reader035/viewer/2022062914/5e5959f50eed076f3d343c93/html5/thumbnails/17.jpg)
)1(log)(log
)1(log)1(
loglog
ix
ii
ii
iii
i
ii
exyL
pp
pyL
ββ +−=
−+⎟⎟⎠
⎞⎜⎜⎝
⎛−
=
∑∑
∑∑Since model is the logit
Likelihood Function
n n yi 1− yiL = ∏ Pr( y ) = ∏ p (1− p )i i i
i= 1 i= 1
n ⎛ i
L = ∏⎜⎜ pi ⎞
⎟⎟ y
(1− p )i= 1 ⎝ (1− p ) ⎠
ii
![Page 18: Logistic Regression - MIT OpenCourseWare · 2020-01-04 · Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor](https://reader035.fdocuments.us/reader035/viewer/2022062914/5e5959f50eed076f3d343c93/html5/thumbnails/18.jpg)
Log Likelihood Function
n n yi 1− yiL = ∏ Pr( y ) = ∏ p (1− p )i i i
i= 1 i= 1
in ⎛ p ⎞ y
i
i= 1 ⎝ (1− pi ) ⎟⎟⎠
(1− pi )L = ∏⎜⎜
p ⎞ilog L = ∑ y log⎜⎜⎛
⎟⎟ + ∑ log(1− p )i i ⎝ (1− p ) ⎠ i
ii
![Page 19: Logistic Regression - MIT OpenCourseWare · 2020-01-04 · Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor](https://reader035.fdocuments.us/reader035/viewer/2022062914/5e5959f50eed076f3d343c93/html5/thumbnails/19.jpg)
Log Likelihood Function
p ⎞ilog L = ∑ y log⎜⎜⎛
⎟⎟ + ∑ log(1− p )i i ⎝ (1− p ) ⎠ i
ii
βxi )log L = ∑ y (βxi ) − ∑ log(1+ e i i
i
Since model is the logit
![Page 20: Logistic Regression - MIT OpenCourseWare · 2020-01-04 · Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor](https://reader035.fdocuments.us/reader035/viewer/2022062914/5e5959f50eed076f3d343c93/html5/thumbnails/20.jpg)
ixi
iiii
ii
ey
xyxyL
β
β
−+=
=−=∂
∂ ∑∑
11ˆ
0ˆlog
Not easy to solve because y-hat is non-linear, need to use iterative methods: most popular is Newton-Raphson
Maximize
βxi )log L = ∑ y (βxi ) − ∑ log(1+ e i i
i
![Page 21: Logistic Regression - MIT OpenCourseWare · 2020-01-04 · Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor](https://reader035.fdocuments.us/reader035/viewer/2022062914/5e5959f50eed076f3d343c93/html5/thumbnails/21.jpg)
Maximize
βxi )log L = ∑ y (βxi ) − ∑ log(1+ e i i
i
∂ log L = ∑ y xi − ∑ y∂β i
i i
ˆ x = 0i i
1 Not easy to solve becauseyi = y-hat is non-linear, need to1+ e−βxi use iterative methods: most popular is Newton-Raphson
![Page 22: Logistic Regression - MIT OpenCourseWare · 2020-01-04 · Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor](https://reader035.fdocuments.us/reader035/viewer/2022062914/5e5959f50eed076f3d343c93/html5/thumbnails/22.jpg)
Newton-Raphson
• Start with random or zero βs
• “walk” in the “direction” that maximizes MLE – how big a step (Gradient or Score)– direction
![Page 23: Logistic Regression - MIT OpenCourseWare · 2020-01-04 · Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor](https://reader035.fdocuments.us/reader035/viewer/2022062914/5e5959f50eed076f3d343c93/html5/thumbnails/23.jpg)
Maximizing the LogLikelihood
Log Likelihood
βi+1 βi First iteration LL Initial LL
![Page 24: Logistic Regression - MIT OpenCourseWare · 2020-01-04 · Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor](https://reader035.fdocuments.us/reader035/viewer/2022062914/5e5959f50eed076f3d343c93/html5/thumbnails/24.jpg)
Maximizing the LogLikelihood
Log Likelihood
βi+1 βi Second iteration LL New Initial LL
![Page 25: Logistic Regression - MIT OpenCourseWare · 2020-01-04 · Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor](https://reader035.fdocuments.us/reader035/viewer/2022062914/5e5959f50eed076f3d343c93/html5/thumbnails/25.jpg)
Similar iterative method to Minimizing the Error in Gradient Descent (neural nets)
winitial wtrained
initial error
final error
negative
local minimum
Error surface
derivative
positive change
![Page 26: Logistic Regression - MIT OpenCourseWare · 2020-01-04 · Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor](https://reader035.fdocuments.us/reader035/viewer/2022062914/5e5959f50eed076f3d343c93/html5/thumbnails/26.jpg)
Newton-Raphson Algorithmβ xi )log L = ∑ y (β xi ) − ∑ log(1+ e
i ii
U (β ) =∂ log L
= ∑ x y − ∑ i i∂β i i i x y Gradient
i
' I (β ) =∂ 2 log L
= −∑ y x x i i ˆ (1− y ˆ ) Hessiani iβ β ' i∂ ∂
− 1β j + 1 = β j − I (β j )U (β j ) a step
![Page 27: Logistic Regression - MIT OpenCourseWare · 2020-01-04 · Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor](https://reader035.fdocuments.us/reader035/viewer/2022062914/5e5959f50eed076f3d343c93/html5/thumbnails/27.jpg)
•
•
Convergence
Criterion
β j+1 − β j < 0001.β j
Convergence problems: complete and quasi-complete separation
![Page 28: Logistic Regression - MIT OpenCourseWare · 2020-01-04 · Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor](https://reader035.fdocuments.us/reader035/viewer/2022062914/5e5959f50eed076f3d343c93/html5/thumbnails/28.jpg)
Complete separation
MLE does not exist (ie, it is infinite)
βi βi+1
yy
x x
x
![Page 29: Logistic Regression - MIT OpenCourseWare · 2020-01-04 · Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor](https://reader035.fdocuments.us/reader035/viewer/2022062914/5e5959f50eed076f3d343c93/html5/thumbnails/29.jpg)
Quasi-complete separation
Same values for predictors, different outcomes
y
x
![Page 30: Logistic Regression - MIT OpenCourseWare · 2020-01-04 · Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor](https://reader035.fdocuments.us/reader035/viewer/2022062914/5e5959f50eed076f3d343c93/html5/thumbnails/30.jpg)
No (quasi)complete separationis fine to find MLE
y
x
![Page 31: Logistic Regression - MIT OpenCourseWare · 2020-01-04 · Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor](https://reader035.fdocuments.us/reader035/viewer/2022062914/5e5959f50eed076f3d343c93/html5/thumbnails/31.jpg)
How good is the model?
• Is it better than predicting the same prior probability for everyone? (ie, model with just β0)
• How well do the training data fit?
• How well does is generalize?
![Page 32: Logistic Regression - MIT OpenCourseWare · 2020-01-04 · Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor](https://reader035.fdocuments.us/reader035/viewer/2022062914/5e5959f50eed076f3d343c93/html5/thumbnails/32.jpg)
Generalized likelihood-ratio test
• Are β 1, β 2, …, β different from 0?nn n
yi 1− yiL = ∏ Pr( y ) = ∏ p (1− p )i i ii= 1 i= 1
log L = ∑ [y log p + (1− y ) 1log( − p )]i i i ii
G − = log 2 L + log 2 Lo 1
G has χ 2 distribution
cross − entropy _ error − = ∑ [y log p + (1− y ) 1log( − p )]i i i ii
![Page 33: Logistic Regression - MIT OpenCourseWare · 2020-01-04 · Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor](https://reader035.fdocuments.us/reader035/viewer/2022062914/5e5959f50eed076f3d343c93/html5/thumbnails/33.jpg)
AIC, SC, BIC
• To compare models
• Akaike’s Information Criterion, k parameters
AIC=− log 2 k L+2
• Schwartz Criterion, Bayesian Information Criterion, n cases
BIC=− log 2 k L logn+
![Page 34: Logistic Regression - MIT OpenCourseWare · 2020-01-04 · Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor](https://reader035.fdocuments.us/reader035/viewer/2022062914/5e5959f50eed076f3d343c93/html5/thumbnails/34.jpg)
Summary
• Maximum Likelihood Estimation is used in finding parameters for models
• MLE maximizes the probability that the data obtained would have been generatedby the model
• Coming up: goodness-of-fit (how good are the predictions?) – How well do the training data fit?– How well does is generalize?