s Lecture08–Part01 ModelComplexitysierra.ucsd.edu/cse151a-2020-sp/lectures/08... ·...
Transcript of s Lecture08–Part01 ModelComplexitysierra.ucsd.edu/cse151a-2020-sp/lectures/08... ·...
4 2 0 2 4y
0.0
0.2
0.4
0.6
0.8
1.0
s(y)
Lecture 08 – Part 01Model Complexity
0.0 0.2 0.4 0.6 0.8 1.0x
2.5
0.0
2.5
5.0
7.5
10.0
12.5
15.0
17.5
y
Empirical Risk Minimization1. Pick a model.
▶ E.g., linear prediction rules.
2. Pick a loss.▶ E.g., mean squared error.
3. Find a prediction rule minimizing the risk.
Big Decision▶ Pick a model.
▶ Picking the wrong model causes problems.
Underfitting▶ Fit 𝐻(𝑥) = 𝑤0 + 𝑤1𝑥?
▶ We have underfit the data.
0.0 0.2 0.4 0.6 0.8 1.02.5
0.0
2.5
5.0
7.5
10.0
12.5
15.0
17.5
Overfitting
▶ Fit 𝐻(𝑥) = 𝑤0 + 𝑤1𝑥 + 𝑤2𝑥2 + … + 𝑤10𝑥10?▶ We have overfit the data.
0.0 0.2 0.4 0.6 0.8 1.0
0
5
10
15
Model Complexity▶ Difference? Complexity.
▶ Complex models are highly flexible.▶ They tend to overfit.
▶ Simple models are stiff.▶ They tend to underfit.
Degree 10 Polynomial
0.0 0.2 0.4 0.6 0.8 1.05
0
5
10
15
20
25
Degree 10 Polynomial
0.0 0.2 0.4 0.6 0.8 1.05
0
5
10
15
20
25
Degree 10 Polynomial
0.0 0.2 0.4 0.6 0.8 1.05
0
5
10
15
20
25
Degree 10 Polynomial
0.0 0.2 0.4 0.6 0.8 1.05
0
5
10
15
20
25
Degree 10 Polynomial
0.0 0.2 0.4 0.6 0.8 1.05
0
5
10
15
20
25
Degree 10 Polynomial
0.0 0.2 0.4 0.6 0.8 1.05
0
5
10
15
20
25
Degree 10 Polynomial
0.0 0.2 0.4 0.6 0.8 1.05
0
5
10
15
20
25
Degree 10 Polynomial
0.0 0.2 0.4 0.6 0.8 1.05
0
5
10
15
20
25
Degree 10 Polynomial
0.0 0.2 0.4 0.6 0.8 1.05
0
5
10
15
20
25
Degree 10 Polynomial
0.0 0.2 0.4 0.6 0.8 1.05
0
5
10
15
20
25
Degree 1 Polynomial
0.0 0.2 0.4 0.6 0.8 1.05
0
5
10
15
20
25
Degree 1 Polynomial
0.0 0.2 0.4 0.6 0.8 1.05
0
5
10
15
20
25
Degree 1 Polynomial
0.0 0.2 0.4 0.6 0.8 1.05
0
5
10
15
20
25
Degree 1 Polynomial
0.0 0.2 0.4 0.6 0.8 1.05
0
5
10
15
20
25
Degree 1 Polynomial
0.0 0.2 0.4 0.6 0.8 1.05
0
5
10
15
20
25
Degree 1 Polynomial
0.0 0.2 0.4 0.6 0.8 1.05
0
5
10
15
20
25
Degree 1 Polynomial
0.0 0.2 0.4 0.6 0.8 1.05
0
5
10
15
20
25
Degree 1 Polynomial
0.0 0.2 0.4 0.6 0.8 1.05
0
5
10
15
20
25
Degree 1 Polynomial
0.0 0.2 0.4 0.6 0.8 1.05
0
5
10
15
20
25
Degree 1 Polynomial
0.0 0.2 0.4 0.6 0.8 1.05
0
5
10
15
20
25
Example: kNN▶ 1NN: complex model, likely to overfit.
▶ 20NN: less complex model, likely to underfit.
Choosing Model Complexity▶ How do we choose between two models?
▶ Between degree 10 and degree 1?▶ Between 1NN and 20NN?
▶ Not always obvious.
Bad Idea: Use training MSE▶ Which has smaller MSE on training data?
▶ Problem: Best 10-degree polynomial will alwayshave smaller MSE on training data.
Bad Idea: Use training MSE▶ Which has smaller MSE on training data?
▶ Problem: Best 10-degree polynomial will alwayshave smaller MSE on training data.
Good Idea: Use validation MSE▶ We care about generalization.
▶ So keep a small amount of data hidden in avalidation set.
▶ Fit model on training data, compute MSE onvalidation set.
▶ Pick whichever model has smaller validationerror.
What do you expect?▶ You fit a complex model on training data.
▶ Test it on a validation set.
▶ Likely: validation MSE > training MSE.
What do you expect?▶ You fit a very simple model on training data.
▶ Test it on a validation set.
▶ Likely: validation MSE ≈ training MSE.
Cross-Validation▶ We want all the training data we can get.
▶ Reserving some of it is wasteful.
▶ Idea: split data into pieces, each takes turn asvalidation set.
k-Fold Cross Validation1. Split data set into 𝑘 pieces, 𝑆1, … , 𝑆𝑘.
2. Loop 𝑘 times; on iteration 𝑖:▶ Use 𝑆𝑖 as validation set; rest as training.▶ Compute validation error 𝜖𝑖
3. Overall error: 1𝑘 ∑𝜖𝑖
Leave-One-Out Cross Validation▶ Suppose we have 𝑛 labeled data points.
▶ LOOCV: 𝑘-fold CV with 𝑘 = 𝑛.
Another Approach▶ We can control complexity by choosing model.
▶ Also: via regularization.
Regularization
▶ Let’s fit a complex model: 𝑤0 + 𝑤𝑥𝑥 + … + 𝑤10𝑥10.
▶ But impose a budget on weights, 𝑤0, … , 𝑤𝑑 .
0.0 0.2 0.4 0.6 0.8 1.0x
2.5
0.0
2.5
5.0
7.5
10.0
12.5
15.0
17.5y
Budgeting Weights▶ One way to budget: ask that ‖�⃗�‖2 is small.
▶ Before: minimize
𝑅sq(�⃗�) =1𝑛
𝑛∑𝑖=1(�⃗� ⋅ ⃗𝑥(𝑖) − 𝑦𝑖)2
▶ Now: minimize
�̃�sq(�⃗�) =1𝑛
𝑛∑𝑖=1(�⃗� ⋅ ⃗𝑥(𝑖) − 𝑦𝑖)2 + 𝜆‖�⃗�‖2
Solution▶ The regularized Normal Equations:
(𝑋𝑇𝑋 + 𝜆𝐼)�⃗� = 𝑋𝑇 ⃗𝑦
Example
𝜆 = 0
0.0 0.2 0.4 0.6 0.8 1.05
0
5
10
15
20
25
Example
𝜆 = 1 × 10−4
0.0 0.2 0.4 0.6 0.8 1.05
0
5
10
15
20
25
Example
𝜆 = 1
0.0 0.2 0.4 0.6 0.8 1.05
0
5
10
15
20
25
Regularization▶ As 𝜆 increases, simpler models preferred.
▶ Pick 𝜆 using cross-validation.
Other Penalizations▶ ‖�⃗�‖22 is ℓ2 regularization (explicit solution)
▶ a.k.a., ridge regression
▶ ‖�⃗�‖1 is ℓ1 regularization (no explicit solution)▶ a.k.a., the LASSO▶ encourages sparse �⃗�
4 2 0 2 4y
0.0
0.2
0.4
0.6
0.8
1.0
s(y)
Lecture 08 – Part 02Logistic Regression
Note▶ The midterm will cover everything up to rightnow.
▶ Regularization: yes.
▶ Logistic regression: no.
Predicting Heart Disease▶ Classification problem: Does a patient haveheart disease?
▶ Features: blood pressure, cholesterol level,exercise amount, maximum heart rate, sex
Better idea...▶ Instead of predicting yes/no...
▶ Give a probability that they have heart disease.▶ 1 = definitely yes▶ 0 = definitely no▶ 0.75 = probably, yes▶ …
Associations▶ If cholesterol is high, increased likelihood.
▶ Positive association.
▶ If exercise is low, increased likelihood.▶ Negative association.
The Model▶ Measure cholesterol (𝑥1), exercise (𝑥2), etc.
▶ Idea: weighted1 “vote” for heart disease:
𝑤1𝑥1 + 𝑤2𝑥2 + … + 𝑤𝑑𝑥𝑑
▶ Convention:▶ A positive number = vote for yes▶ A negative number = vote for no
1We’ll learn weights later.
The Model▶ Add a “bias” term:
𝑤0 + 𝑤1𝑥1 + 𝑤2𝑥2 + … + 𝑤𝑑𝑥𝑑= �⃗� ⋅ Aug( ⃗𝑥)
▶ The more positive �⃗� ⋅ Aug( ⃗𝑥), the more likely.
▶ The more negative �⃗� ⋅ Aug( ⃗𝑥), the less likely.
Converting to a Probability▶ Probabilities are between 0 and 1.
▶ Problem: �⃗� ⋅ Aug( ⃗𝑥) can be anything in (−∞,∞)
▶ We need to convert it to a probability.
The Logistic Function
𝜎(𝑡) = 11 + 𝑒−𝑡
4 2 0 2 4y
0.0
0.2
0.4
0.6
0.8
1.0
s(y)
The Model▶ Our simplified model for probability of heartdisease:
𝐻( ⃗𝑥) = 𝜎(�⃗� ⋅ Aug( ⃗𝑥))
▶ What should �⃗� be?
▶ To find �⃗�, use principle of maximum likelihood.
Maximum Likelihood▶ Suppose you have an unfair coin.
▶ Probability of heads is 𝑝, unknown.
▶ Flip 8 times and see: H, H, T, H, H, H, H, T
▶ Which is more likely: 𝑝 = 0.5 or 𝑝 = 0.75?
Maximum Likelihood▶ Assume coin flips are independent.
▶ The likelihood of H, H, T, H, H, H, H, T is:
L(𝑝) = 𝑝 ⋅ 𝑝 ⋅ (1 − 𝑝) ⋅ 𝑝 ⋅ 𝑝 ⋅ 𝑝 ⋅ 𝑝 ⋅ (1 − 𝑝)= 𝑝6(1 − 𝑝)2
▶ Idea: find 𝑝 maximizing L(𝑝)▶ Equivalently, find 𝑝 maximizing logL(𝑝)
Maximum LikelihoodFind 𝑝 maximizing logL(𝑝) = log𝑝6(1 − 𝑝)2:
Maximum Likelihood▶ In general, given 𝑛1 observed heads, 𝑛2observed tails, maximize:
log [𝑃(𝐹1 = 𝑓1) ⋅ 𝑃(𝐹2 = 𝑓2) ⋯ ⋅ 𝑃(𝐹𝑛 = 𝑓𝑛)]
=𝑛∑𝑖=1
log𝑃(𝐹1 = 𝑓1)
Back to Logistic Regression▶ The probability that person 𝑖 has heart disease:
𝐻( ⃗𝑥(𝑖)) = �⃗� ⋅ Aug( ⃗𝑥(𝑖))
▶ Gather a data set, ( ⃗𝑥(1), 𝑦1), … , ( ⃗𝑥(𝑛), 𝑦𝑛).
▶ What is the most likely �⃗�?
Maximum Likelihood▶ Suppose 3 people, (+,-,+).
▶ Likelihood:
𝐻( ⃗𝑥(1)) ⋅ (1 − 𝐻( ⃗𝑥(2))) ⋅ 𝐻( ⃗𝑥(3))
= 𝜎(�⃗� ⋅ Aug( ⃗𝑥(1))) ⋅ (1 − 𝜎(�⃗� ⋅ Aug( ⃗𝑥(2)))) ⋅ 𝜎(�⃗� ⋅ Aug( ⃗𝑥(3)))
= 11 + 𝑒−�⃗�⋅Aug( ⃗𝑥(1))
⋅ (1 − 11 + 𝑒−�⃗�⋅Aug( ⃗𝑥(2))
) ⋅ 11 + 𝑒−�⃗�⋅Aug( ⃗𝑥(3))
Observation▶ Note:
1 − 11 + 𝑒−𝑡 =
11 + 𝑒𝑡
Maximum Likelihood▶ The likelihood:
11 + 𝑒−�⃗�⋅Aug( ⃗𝑥(1))
⋅ 11 + 𝑒�⃗�⋅Aug( ⃗𝑥(2))
⋅ 11 + 𝑒−�⃗�⋅Aug( ⃗𝑥(3))
▶ Suppose 𝑦𝑖 = 1 if positive, 𝑦𝑖 = −1 if negative:
11 + 𝑒−𝑦1�⃗�⋅Aug( ⃗𝑥(1))
⋅ 11 + 𝑒−𝑦2�⃗�⋅Aug( ⃗𝑥(2))
⋅ 11 + 𝑒−𝑦3�⃗�⋅Aug( ⃗𝑥(3))
Maximum Likelihood▶ In general, the likelihood is:
L(�⃗�) =𝑛∏𝑖=1
11 + 𝑒−𝑦𝑖�⃗�⋅Aug( ⃗𝑥(𝑖))
▶ The log likelihood is:
logL(�⃗�) = −𝑛∑𝑖=1
log [1 + 𝑒−𝑦𝑖�⃗�⋅Aug( ⃗𝑥(𝑖))]
Maximizing Likelihood▶ Goal: find �⃗� maximizing logL
▶ Take gradient, set to zero, solve?
▶ Problem: try it, you’ll get stuck.
▶ Unlike least-squares regression, there is noexplicit solution.
Next TimeHow to maximize the log loss with gradient descent.