Logistic Regressionguestrin/Class/10701/slides/logisticregression.pdf · 3 5 ©Carlos Guestrin...

16
1 1 ©Carlos Guestrin 2005-2007 Logistic Regression Machine Learning – 10701/15781 Carlos Guestrin Carnegie Mellon University September 24 th , 2007 2 ©Carlos Guestrin 2005-2007 Generative v. Discriminative classifiers – Intuition Want to Learn: h:X a Y X – features Y – target classes Bayes optimal classifier – P(Y|X) Generative classifier, e.g., Naïve Bayes: Assume some functional form for P(X|Y), P(Y) Estimate parameters of P(X|Y), P(Y) directly from training data Use Bayes rule to calculate P(Y|X= x) This is a generative’ model Indirect computation of P(Y|X) through Bayes rule But, can generate a sample of the data, P(X) = y P(y) P(X|y) Discriminative classifiers, e.g., Logistic Regression: Assume some functional form for P(Y|X) Estimate parameters of P(Y|X) directly from training data This is the discriminative’ model Directly learn P(Y|X) But cannot obtain a sample of the data, because P(X) is not available

Transcript of Logistic Regressionguestrin/Class/10701/slides/logisticregression.pdf · 3 5 ©Carlos Guestrin...

Page 1: Logistic Regressionguestrin/Class/10701/slides/logisticregression.pdf · 3 5 ©Carlos Guestrin 2005-2007 Logistic Regression – a Linear classifier-6 -4 -2 0 2 4 6 0 0.1 0.2 0.3

1

1©Carlos Guestrin 2005-2007

Logistic Regression

Machine Learning – 10701/15781Carlos GuestrinCarnegie Mellon University

September 24th, 2007

2©Carlos Guestrin 2005-2007

Generative v. Discriminativeclassifiers – Intuition

Want to Learn: h:X a Y X – features Y – target classes

Bayes optimal classifier – P(Y|X) Generative classifier, e.g., Naïve Bayes:

Assume some functional form for P(X|Y), P(Y) Estimate parameters of P(X|Y), P(Y) directly from training data Use Bayes rule to calculate P(Y|X= x) This is a ‘generative’ model

Indirect computation of P(Y|X) through Bayes rule But, can generate a sample of the data, P(X) = ∑y P(y) P(X|y)

Discriminative classifiers, e.g., Logistic Regression: Assume some functional form for P(Y|X) Estimate parameters of P(Y|X) directly from training data This is the ‘discriminative’ model

Directly learn P(Y|X) But cannot obtain a sample of the data, because P(X) is not available

Page 2: Logistic Regressionguestrin/Class/10701/slides/logisticregression.pdf · 3 5 ©Carlos Guestrin 2005-2007 Logistic Regression – a Linear classifier-6 -4 -2 0 2 4 6 0 0.1 0.2 0.3

2

3©Carlos Guestrin 2005-2007

Logistic RegressionLogisticfunction(or Sigmoid):

Learn P(Y|X) directly! Assume a particular functional form Sigmoid applied to a linear function

of the data:

Z

Features can be discrete or continuous!

4©Carlos Guestrin 2005-2007

Understanding the sigmoid

-6 -4 -2 0 2 4 60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

w0=0, w1=-1

-6 -4 -2 0 2 4 60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

w0=-2, w1=-1

-6 -4 -2 0 2 4 60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

w0=0, w1=-0.5

Page 3: Logistic Regressionguestrin/Class/10701/slides/logisticregression.pdf · 3 5 ©Carlos Guestrin 2005-2007 Logistic Regression – a Linear classifier-6 -4 -2 0 2 4 6 0 0.1 0.2 0.3

3

5©Carlos Guestrin 2005-2007

Logistic Regression –a Linear classifier

-6 -4 -2 0 2 4 60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

6©Carlos Guestrin 2005-2007

Very convenient!

implies

implies

implies

linearclassification

rule!

Page 4: Logistic Regressionguestrin/Class/10701/slides/logisticregression.pdf · 3 5 ©Carlos Guestrin 2005-2007 Logistic Regression – a Linear classifier-6 -4 -2 0 2 4 6 0 0.1 0.2 0.3

4

7©Carlos Guestrin 2005-2007

What if we have continuous Xi ?Eg., character recognition: Xi is ith pixel

Gaussian Naïve Bayes (GNB):

Sometimes assume variance is independent of Y (i.e., σi), or independent of Xi (i.e., σk) or both (i.e., σ)

8©Carlos Guestrin 2005-2007

Example: GNB for classifying mental states

~1 mm resolution

~2 images per sec.

15,000 voxels/image

non-invasive, safe

measures BloodOxygen LevelDependent (BOLD)response

Typicalimpulseresponse

10 sec

[Mitchell et al.]

Page 5: Logistic Regressionguestrin/Class/10701/slides/logisticregression.pdf · 3 5 ©Carlos Guestrin 2005-2007 Logistic Regression – a Linear classifier-6 -4 -2 0 2 4 6 0 0.1 0.2 0.3

5

9©Carlos Guestrin 2005-2007

Learned Bayes Models – Means forP(BrainActivity | WordCategory)

Animal wordsPeople words

Pairwise classification accuracy: 85%[Mitchell et al.]

10©Carlos Guestrin 2005-2007

Logistic regression v. Naïve Bayes

Consider learning f: X Y, where X is a vector of real-valued features, < X1 … Xn > Y is boolean

Could use a Gaussian Naïve Bayes classifier assume all Xi are conditionally independent given Y model P(Xi | Y = yk) as Gaussian N(µik,σi) model P(Y) as Bernoulli(θ,1-θ)

What does that imply about the form of P(Y|X)?

Cool!!!!

Page 6: Logistic Regressionguestrin/Class/10701/slides/logisticregression.pdf · 3 5 ©Carlos Guestrin 2005-2007 Logistic Regression – a Linear classifier-6 -4 -2 0 2 4 6 0 0.1 0.2 0.3

6

11©Carlos Guestrin 2005-2007

Derive form for P(Y|X) for continuous Xi

12©Carlos Guestrin 2005-2007

Ratio of class-conditional probabilities

Page 7: Logistic Regressionguestrin/Class/10701/slides/logisticregression.pdf · 3 5 ©Carlos Guestrin 2005-2007 Logistic Regression – a Linear classifier-6 -4 -2 0 2 4 6 0 0.1 0.2 0.3

7

13©Carlos Guestrin 2005-2007

Derive form for P(Y|X) for continuous Xi

14©Carlos Guestrin 2005-2007

Gaussian Naïve Bayes v. Logistic Regression

Representation equivalence But only in a special case!!! (GNB with class-independent variances)

But what’s the difference??? LR makes no assumptions about P(X|Y) in learning!!! Loss function!!!

Optimize different functions ! Obtain different solutions

Set of Gaussian Naïve Bayes parameters

(feature variance independent of class label)

Set of Logistic Regression parameters

Page 8: Logistic Regressionguestrin/Class/10701/slides/logisticregression.pdf · 3 5 ©Carlos Guestrin 2005-2007 Logistic Regression – a Linear classifier-6 -4 -2 0 2 4 6 0 0.1 0.2 0.3

8

15©Carlos Guestrin 2005-2007

Logistic regression for morethan 2 classes

Logistic regression in more general case, whereY 2 {Y1 ... YR} : learn R-1 sets of weights

16©Carlos Guestrin 2005-2007

Logistic regression more generally

Logistic regression in more general case, where Y 2{Y1 ... YR} : learn R-1 sets of weights

for k<R

for k=R (normalization, so no weights for this class)

Features can be discrete or continuous!

Page 9: Logistic Regressionguestrin/Class/10701/slides/logisticregression.pdf · 3 5 ©Carlos Guestrin 2005-2007 Logistic Regression – a Linear classifier-6 -4 -2 0 2 4 6 0 0.1 0.2 0.3

9

17©Carlos Guestrin 2005-2007

Loss functions: Likelihood v.Conditional Likelihood

Generative (Naïve Bayes) Loss function:Data likelihood

Discriminative models cannot compute P(xj|w)! But, discriminative (logistic regression) loss function:

Conditional Data Likelihood

Doesn’t waste effort learning P(X) – focuses on P(Y|X) all that matters forclassification

18©Carlos Guestrin 2005-2007

Expressing Conditional Log Likelihood

Page 10: Logistic Regressionguestrin/Class/10701/slides/logisticregression.pdf · 3 5 ©Carlos Guestrin 2005-2007 Logistic Regression – a Linear classifier-6 -4 -2 0 2 4 6 0 0.1 0.2 0.3

10

19©Carlos Guestrin 2005-2007

Maximizing Conditional Log Likelihood

Good news: l(w) is concave function of w ! no locally optimalsolutions

Bad news: no closed-form solution to maximize l(w)

Good news: concave functions easy to optimize

20©Carlos Guestrin 2005-2007

Optimizing concave function –Gradient ascent

Conditional likelihood for Logistic Regression is concave ! Findoptimum with gradient ascent

Gradient ascent is simplest of optimization approaches e.g., Conjugate gradient ascent much better (see reading)

Gradient:

Learning rate, η>0

Update rule:

Page 11: Logistic Regressionguestrin/Class/10701/slides/logisticregression.pdf · 3 5 ©Carlos Guestrin 2005-2007 Logistic Regression – a Linear classifier-6 -4 -2 0 2 4 6 0 0.1 0.2 0.3

11

21©Carlos Guestrin 2005-2007

Maximize Conditional Log Likelihood:Gradient ascent

22©Carlos Guestrin 2005-2007

Gradient Descent for LR

Gradient ascent algorithm: iterate until change < ε

For i = 1… n,

repeat

Page 12: Logistic Regressionguestrin/Class/10701/slides/logisticregression.pdf · 3 5 ©Carlos Guestrin 2005-2007 Logistic Regression – a Linear classifier-6 -4 -2 0 2 4 6 0 0.1 0.2 0.3

12

23©Carlos Guestrin 2005-2007

That’s all M(C)LE. How about MAP?

One common approach is to define priors on w Normal distribution, zero mean, identity covariance “Pushes” parameters towards zero

Corresponds to Regularization Helps avoid very large weights and overfitting More on this later in the semester

MAP estimate

24©Carlos Guestrin 2005-2007

M(C)AP as Regularization

Penalizes high weights, also applicable in linear regression

Page 13: Logistic Regressionguestrin/Class/10701/slides/logisticregression.pdf · 3 5 ©Carlos Guestrin 2005-2007 Logistic Regression – a Linear classifier-6 -4 -2 0 2 4 6 0 0.1 0.2 0.3

13

25©Carlos Guestrin 2005-2007

Gradient of M(C)AP

26©Carlos Guestrin 2005-2007

MLE vs MAP

Maximum conditional likelihood estimate

Maximum conditional a posteriori estimate

Page 14: Logistic Regressionguestrin/Class/10701/slides/logisticregression.pdf · 3 5 ©Carlos Guestrin 2005-2007 Logistic Regression – a Linear classifier-6 -4 -2 0 2 4 6 0 0.1 0.2 0.3

14

27©Carlos Guestrin 2005-2007

Naïve Bayes vs Logistic Regression

Consider Y boolean, Xi continuous, X=<X1 ... Xn>

Number of parameters: NB: 4n +1 LR: n+1

Estimation method: NB parameter estimates are uncoupled LR parameter estimates are coupled

28©Carlos Guestrin 2005-2007

G. Naïve Bayes vs. Logistic Regression 1

Generative and Discriminative classifiers

Asymptotic comparison (# training examples infinity) when model correct

GNB, LR produce identical classifiers

when model incorrect LR is less biased – does not assume conditional independence

therefore LR expected to outperform GNB

[Ng & Jordan, 2002]

Page 15: Logistic Regressionguestrin/Class/10701/slides/logisticregression.pdf · 3 5 ©Carlos Guestrin 2005-2007 Logistic Regression – a Linear classifier-6 -4 -2 0 2 4 6 0 0.1 0.2 0.3

15

29©Carlos Guestrin 2005-2007

G. Naïve Bayes vs. Logistic Regression 2

Generative and Discriminative classifiers

Non-asymptotic analysis convergence rate of parameter estimates, n = # of attributes in X

Size of training data to get close to infinite data solution GNB needs O(log n) samples LR needs O(n) samples

GNB converges more quickly to its (perhaps less helpful)asymptotic estimates

[Ng & Jordan, 2002]

30©Carlos Guestrin 2005-2007

Someexperimentsfrom UCIdata sets

Naïve bayesLogistic Regression

Page 16: Logistic Regressionguestrin/Class/10701/slides/logisticregression.pdf · 3 5 ©Carlos Guestrin 2005-2007 Logistic Regression – a Linear classifier-6 -4 -2 0 2 4 6 0 0.1 0.2 0.3

16

31©Carlos Guestrin 2005-2007

What you should know aboutLogistic Regression (LR)

Gaussian Naïve Bayes with class-independent variancesrepresentationally equivalent to LR Solution differs because of objective (loss) function

In general, NB and LR make different assumptions NB: Features independent given class ! assumption on P(X|Y) LR: Functional form of P(Y|X), no assumption on P(X|Y)

LR is a linear classifier decision rule is a hyperplane

LR optimized by conditional likelihood no closed-form solution concave ! global optimum with gradient ascent Maximum conditional a posteriori corresponds to regularization

Convergence rates GNB (usually) needs less data LR (usually) gets to better solutions in the limit