Lecture 4: Logistic Regression
description
Transcript of Lecture 4: Logistic Regression
![Page 1: Lecture 4: Logistic Regression](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815bf2550346895dc9e320/html5/thumbnails/1.jpg)
Machine LearningCUNY Graduate Center
Lecture 4: Logistic Regression
![Page 2: Lecture 4: Logistic Regression](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815bf2550346895dc9e320/html5/thumbnails/2.jpg)
Today
• Linear Regression– Bayesians v. Frequentists– Bayesian Linear Regression
• Logistic Regression– Linear Model for Classification
2
![Page 3: Lecture 4: Logistic Regression](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815bf2550346895dc9e320/html5/thumbnails/3.jpg)
Regularization: Penalize large weights
• Introduce a penalty term in the loss function.
3
Regularized Regression(L2-Regularization or Ridge Regression)
![Page 4: Lecture 4: Logistic Regression](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815bf2550346895dc9e320/html5/thumbnails/4.jpg)
More regularization
• The penalty term defines the styles of regularization
• L2-Regularization• L1-Regularization• L0-Regularization
– L0-norm is the optimal subset of features
4
![Page 5: Lecture 4: Logistic Regression](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815bf2550346895dc9e320/html5/thumbnails/5.jpg)
Curse of dimensionality• Increasing dimensionality of features increases the
data requirements exponentially.• For example, if a single feature can be accurately
approximated with 100 data points, to optimize the joint over two features requires 100*100 data points.
• Models should be small relative to the amount of available data
• Dimensionality reduction techniques – feature selection – can help.– L0-regularization is explicit feature selection– L1- and L2-regularizations approximate feature selection.
5
![Page 6: Lecture 4: Logistic Regression](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815bf2550346895dc9e320/html5/thumbnails/6.jpg)
Bayesians v. Frequentists• What is a probability?• Frequentists
– A probability is the likelihood that an event will happen– It is approximated by the ratio of the number of observed events to the
number of total events– Assessment is vital to selecting a model– Point estimates are absolutely fine
• Bayesians– A probability is a degree of believability of a proposition.– Bayesians require that probabilities be prior beliefs conditioned on data.– The Bayesian approach “is optimal”, given a good model, a good prior
and a good loss function. Don’t worry so much about assessment.– If you are ever making a point estimate, you’ve made a mistake. The
only valid probabilities are posteriors based on evidence given some prior
6
![Page 7: Lecture 4: Logistic Regression](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815bf2550346895dc9e320/html5/thumbnails/7.jpg)
Bayesian Linear Regression• The previous MLE derivation of linear regression uses
point estimates for the weight vector, w.• Bayesians say, “hold it right there”.
– Use a prior distribution over w to estimate parameters
• Alpha is a hyperparameter over w, where alpha is the precision or inverse variance of the distribution.
• Now optimize:
7
![Page 8: Lecture 4: Logistic Regression](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815bf2550346895dc9e320/html5/thumbnails/8.jpg)
Optimize the Bayesian posterior
8
As usual it’s easier to optimize after a log transform.
![Page 9: Lecture 4: Logistic Regression](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815bf2550346895dc9e320/html5/thumbnails/9.jpg)
Optimize the Bayesian posterior
9
As usual it’s easier to optimize after a log transform.
![Page 10: Lecture 4: Logistic Regression](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815bf2550346895dc9e320/html5/thumbnails/10.jpg)
Optimize the Bayesian posterior
10
Ignoring terms that do not depend on w
IDENTICAL formulation as L2-regularization
![Page 11: Lecture 4: Logistic Regression](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815bf2550346895dc9e320/html5/thumbnails/11.jpg)
Context
• Overfitting is bad.• Bayesians vs. Frequentists
– Is one better?– Machine Learning uses techniques from both
camps.
11
![Page 12: Lecture 4: Logistic Regression](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815bf2550346895dc9e320/html5/thumbnails/12.jpg)
Logistic Regression
• Linear model applied to classification• Supervised: target information is available
– Each data point xi has a corresponding target ti.
• Goal: Identify a function
12
![Page 13: Lecture 4: Logistic Regression](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815bf2550346895dc9e320/html5/thumbnails/13.jpg)
Target Variables
• In binary classification, it is convenient to represent ti as a scalar with a range of [0,1]– Interpretation of ti as the likelihood that xi is the member of the
positive class– Used to represent the confidence of a prediction.
• For L > 2 classes, ti is often represented as a K element vector. – tij represents the degree of membership in class j.– |ti| = 1– E.g. 5-way classification vector:
13
![Page 14: Lecture 4: Logistic Regression](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815bf2550346895dc9e320/html5/thumbnails/14.jpg)
Graphical Example of Classification
14
![Page 15: Lecture 4: Logistic Regression](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815bf2550346895dc9e320/html5/thumbnails/15.jpg)
Decision Boundaries
15
![Page 16: Lecture 4: Logistic Regression](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815bf2550346895dc9e320/html5/thumbnails/16.jpg)
Graphical Example of Classification
16
![Page 17: Lecture 4: Logistic Regression](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815bf2550346895dc9e320/html5/thumbnails/17.jpg)
Classification approaches• Generative
– Models the joint distribution between c and x
– Highest data requirements• Discriminative
– Fewer parameters to approximate• Discriminant Function
– May still be trained probabilistically, but not necessarily modeling a likelihood.
17
![Page 18: Lecture 4: Logistic Regression](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815bf2550346895dc9e320/html5/thumbnails/18.jpg)
Treating Classification as a Linear model
18
![Page 19: Lecture 4: Logistic Regression](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815bf2550346895dc9e320/html5/thumbnails/19.jpg)
Relationship between Regression and Classification
• Since we’re classifying two classes, why not set one class to ‘0’ and the other to ‘1’ then use linear regression.– Regression: -infinity to infinity, while class labels
are 0, 1• Can use a threshold, e.g.
– y >= 0.5 then class 1– y < 0.5 then class 2
19
f(x)>=0.5?
Happy/Good/ClassA
Sad/Not Good/ClassB
1
![Page 20: Lecture 4: Logistic Regression](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815bf2550346895dc9e320/html5/thumbnails/20.jpg)
Odds-ratio• Rather than thresholding, we’ll relate the
regression to the class-conditional probability.
• Ratio of the odd of prediction y = 1 or y = 0– If p(y=1|x) = 0.8 and p(y=0|x) = 0.2– Odds ratio = 0.8/0.2 = 4
• Use a linear model to predict odds rather than a class label.
20
![Page 21: Lecture 4: Logistic Regression](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815bf2550346895dc9e320/html5/thumbnails/21.jpg)
Logit – Log odds ratio function
• LHS: 0 to infinity• RHS: -infinity to
infinity• Use a log function.
– Has the added bonus of disolving the division leading to easy manipulation
21
![Page 22: Lecture 4: Logistic Regression](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815bf2550346895dc9e320/html5/thumbnails/22.jpg)
Logistic Regression
• A linear model used to predict log-odds ratio of two classes
22
![Page 23: Lecture 4: Logistic Regression](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815bf2550346895dc9e320/html5/thumbnails/23.jpg)
Logit to probability
23
![Page 24: Lecture 4: Logistic Regression](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815bf2550346895dc9e320/html5/thumbnails/24.jpg)
Sigmoid function
• Squashing function to map the reals to a finite domain.
24
![Page 25: Lecture 4: Logistic Regression](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815bf2550346895dc9e320/html5/thumbnails/25.jpg)
Gaussian Class-conditional
• Assume the data is generated from a gaussian distribution for each class.
• Leads to a bayesian formulation of logistic regression.
25
![Page 26: Lecture 4: Logistic Regression](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815bf2550346895dc9e320/html5/thumbnails/26.jpg)
Bayesian Logistic Regression
26
![Page 27: Lecture 4: Logistic Regression](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815bf2550346895dc9e320/html5/thumbnails/27.jpg)
Maximum Likelihood ExtimationLogistic Regression
• Class-conditional Gaussian.• Multinomial Class distribution.
• As ever, take the derivative of this likelihood function w.r.t.
27
![Page 28: Lecture 4: Logistic Regression](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815bf2550346895dc9e320/html5/thumbnails/28.jpg)
Maximum Likelihood Estimation of the prior
28
![Page 29: Lecture 4: Logistic Regression](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815bf2550346895dc9e320/html5/thumbnails/29.jpg)
Maximum Likelihood Estimation of the prior
29
![Page 30: Lecture 4: Logistic Regression](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815bf2550346895dc9e320/html5/thumbnails/30.jpg)
Maximum Likelihood Estimation of the prior
30
![Page 31: Lecture 4: Logistic Regression](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815bf2550346895dc9e320/html5/thumbnails/31.jpg)
Discriminative Training
• Take the derivatives w.r.t. – Be prepared for this for homework.
• In the generative formulation, we need to estimate the joint of t and x.– But we get an intuitive regularization
technique.• Discriminative Training
– Model p(t|x) directly.
31
![Page 32: Lecture 4: Logistic Regression](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815bf2550346895dc9e320/html5/thumbnails/32.jpg)
What’s the problem with generative training
• Formulated this way, in D dimensions, this function has D parameters.
• In the generative case, 2D means, and D(D+1)/2 covariance values
• Quadratic growth in the number of parameters.
• We’d rather linear growth.
32
![Page 33: Lecture 4: Logistic Regression](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815bf2550346895dc9e320/html5/thumbnails/33.jpg)
Discriminative Training
33
![Page 34: Lecture 4: Logistic Regression](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815bf2550346895dc9e320/html5/thumbnails/34.jpg)
Optimization
• Take the gradient in terms of w
34
![Page 35: Lecture 4: Logistic Regression](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815bf2550346895dc9e320/html5/thumbnails/35.jpg)
Optimization
35
![Page 36: Lecture 4: Logistic Regression](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815bf2550346895dc9e320/html5/thumbnails/36.jpg)
Optimization
36
![Page 37: Lecture 4: Logistic Regression](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815bf2550346895dc9e320/html5/thumbnails/37.jpg)
Optimization
37
![Page 38: Lecture 4: Logistic Regression](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815bf2550346895dc9e320/html5/thumbnails/38.jpg)
Optimization: putting it together
38
![Page 39: Lecture 4: Logistic Regression](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815bf2550346895dc9e320/html5/thumbnails/39.jpg)
Optimization
• We know the gradient of the error function, but how do we find the maximum value?
• Setting to zero is nontrivial• Numerical approximation
39
![Page 40: Lecture 4: Logistic Regression](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815bf2550346895dc9e320/html5/thumbnails/40.jpg)
Gradient Descent
• Take a guess.• Move in the direction of the negative
gradient• Jump again.• In a convex function this will converge
• Other methods include Newton-Raphson40
![Page 41: Lecture 4: Logistic Regression](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815bf2550346895dc9e320/html5/thumbnails/41.jpg)
Multi-class discriminant functions
• Can extend to multiple classes
• Other approaches include constructing K-1 binary classifiers.
• Each classifier compares cn to not cn
• Computationally simpler, but not without problems
41
![Page 42: Lecture 4: Logistic Regression](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815bf2550346895dc9e320/html5/thumbnails/42.jpg)
Exponential Model
• Logistic Regression is a type of exponential model.– Linear combination of weights and features to
produce a probabilistic model.
42
![Page 43: Lecture 4: Logistic Regression](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815bf2550346895dc9e320/html5/thumbnails/43.jpg)
Problems with Binary Discriminant functions
43
![Page 44: Lecture 4: Logistic Regression](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815bf2550346895dc9e320/html5/thumbnails/44.jpg)
K-class discriminant
44
![Page 45: Lecture 4: Logistic Regression](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815bf2550346895dc9e320/html5/thumbnails/45.jpg)
Entropy
• Measure of uncertainty, or Measure of “Information”
• High uncertainty equals high entropy.• Rare events are more “informative” than
common events.
45
![Page 46: Lecture 4: Logistic Regression](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815bf2550346895dc9e320/html5/thumbnails/46.jpg)
Entropy
• How much information is received when observing ‘x’?
• If independent, p(x,y) = p(x)p(y).– H(x,y) = H(x) + H(y)– The information contained in two unrelated
events is equal to their sum.
46
![Page 47: Lecture 4: Logistic Regression](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815bf2550346895dc9e320/html5/thumbnails/47.jpg)
Entropy
• Binary coding of p(x): -log p(x)– “How many bits does it take to represent a
value p(x)?”– How many “decimal” places? How many
binary decimal places?• Expected value of observed information
47
![Page 48: Lecture 4: Logistic Regression](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815bf2550346895dc9e320/html5/thumbnails/48.jpg)
Examples of Entropy
• Uniform distributions have higher distributions.
48
![Page 49: Lecture 4: Logistic Regression](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815bf2550346895dc9e320/html5/thumbnails/49.jpg)
Maximum Entropy• Logistic Regression is also known as
Maximum Entropy.• Entropy is convex.
– Convergence Expectation.• Constrain this optimization to enforce good
classification.• Increase maximum likelihood of the data
while making the distribution of weights most even.– Include as many useful features as possible.
49
![Page 50: Lecture 4: Logistic Regression](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815bf2550346895dc9e320/html5/thumbnails/50.jpg)
Maximum Entropy with Constraints
• From Klein and Manning Tutorial50
![Page 51: Lecture 4: Logistic Regression](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815bf2550346895dc9e320/html5/thumbnails/51.jpg)
Optimization formulation
• If we let the weights represent likelihoods of value for each feature.
51
For each feature i
![Page 52: Lecture 4: Logistic Regression](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815bf2550346895dc9e320/html5/thumbnails/52.jpg)
Solving MaxEnt formulation
• Convex optimization with a concave objective function and linear constraints.
• Lagrange Multipliers
52
For each feature iDual representation of the
maximum likelihood estimation of Logistic Regression
![Page 53: Lecture 4: Logistic Regression](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815bf2550346895dc9e320/html5/thumbnails/53.jpg)
Summary• Bayesian Regularization
– Introduction of a prior over parameters serves to constrain weights
• Logistic Regression– Log odds to construct a linear model– Formulation with Gaussian Class Conditionals– Discriminative Training– Gradient Descent
• Entropy– Logistic Regression as Maximum Entropy.
53
![Page 54: Lecture 4: Logistic Regression](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815bf2550346895dc9e320/html5/thumbnails/54.jpg)
Next Time
• Graphical Models
• Read Chapter 8.1, 8.2
54