Machine Learning Department School of Computer Science...
Transcript of Machine Learning Department School of Computer Science...
![Page 1: Machine Learning Department School of Computer Science ...mgormley/courses/10601/slides/lecture8-opt.pdf · lecture8-opt Author: Matt Gormley Created Date: 2/12/2020 3:19:23 PM ...](https://reader034.fdocuments.us/reader034/viewer/2022043017/5f39b672ca3535563d1134b6/html5/thumbnails/1.jpg)
Linear Regression+
Optimization for ML
1
10-601 Introduction to Machine Learning
Matt GormleyLecture 8
Feb. 07, 2020
Machine Learning DepartmentSchool of Computer ScienceCarnegie Mellon University
![Page 2: Machine Learning Department School of Computer Science ...mgormley/courses/10601/slides/lecture8-opt.pdf · lecture8-opt Author: Matt Gormley Created Date: 2/12/2020 3:19:23 PM ...](https://reader034.fdocuments.us/reader034/viewer/2022043017/5f39b672ca3535563d1134b6/html5/thumbnails/2.jpg)
Q&A
3
Q: How can I get more one-on-one interaction with the course staff?
A: Attend office hours as soon after the homework release as possible!
![Page 3: Machine Learning Department School of Computer Science ...mgormley/courses/10601/slides/lecture8-opt.pdf · lecture8-opt Author: Matt Gormley Created Date: 2/12/2020 3:19:23 PM ...](https://reader034.fdocuments.us/reader034/viewer/2022043017/5f39b672ca3535563d1134b6/html5/thumbnails/3.jpg)
Reminders
• Homework 3: KNN, Perceptron, Lin.Reg.– Out: Wed, Feb. 05 (+ 1 day)– Due: Wed, Feb. 12 at 11:59pm
• Today’s In-Class Poll– http://p8.mlcourse.org
4
![Page 4: Machine Learning Department School of Computer Science ...mgormley/courses/10601/slides/lecture8-opt.pdf · lecture8-opt Author: Matt Gormley Created Date: 2/12/2020 3:19:23 PM ...](https://reader034.fdocuments.us/reader034/viewer/2022043017/5f39b672ca3535563d1134b6/html5/thumbnails/4.jpg)
LINEAR REGRESSION
5
![Page 5: Machine Learning Department School of Computer Science ...mgormley/courses/10601/slides/lecture8-opt.pdf · lecture8-opt Author: Matt Gormley Created Date: 2/12/2020 3:19:23 PM ...](https://reader034.fdocuments.us/reader034/viewer/2022043017/5f39b672ca3535563d1134b6/html5/thumbnails/5.jpg)
Regression Problems
Chalkboard– Definition of Regression– Linear functions– Residuals– Notation trick: fold in the intercept
6
![Page 6: Machine Learning Department School of Computer Science ...mgormley/courses/10601/slides/lecture8-opt.pdf · lecture8-opt Author: Matt Gormley Created Date: 2/12/2020 3:19:23 PM ...](https://reader034.fdocuments.us/reader034/viewer/2022043017/5f39b672ca3535563d1134b6/html5/thumbnails/6.jpg)
OPTIMIZATION FOR MLThe Big Picture
7
![Page 7: Machine Learning Department School of Computer Science ...mgormley/courses/10601/slides/lecture8-opt.pdf · lecture8-opt Author: Matt Gormley Created Date: 2/12/2020 3:19:23 PM ...](https://reader034.fdocuments.us/reader034/viewer/2022043017/5f39b672ca3535563d1134b6/html5/thumbnails/7.jpg)
Optimization for ML
Not quite the same setting as other fields…– Function we are optimizing might not be the
true goal (e.g. likelihood vs generalization error)
– Precision might not matter (e.g. data is noisy, so optimal up to 1e-16 might not help)
– Stopping early can help generalization error(i.e. “early stopping” is a technique for regularization – discussed more next time)
8
![Page 8: Machine Learning Department School of Computer Science ...mgormley/courses/10601/slides/lecture8-opt.pdf · lecture8-opt Author: Matt Gormley Created Date: 2/12/2020 3:19:23 PM ...](https://reader034.fdocuments.us/reader034/viewer/2022043017/5f39b672ca3535563d1134b6/html5/thumbnails/8.jpg)
min vs. argmin
9
y = f(x) =x2 + 11
2
3 v* = minx f(x)
x* = argminx f(x)
1. Q: What is v*?
2. Q: What is x*?v* = 1, the minimum value of the function
x* = 0, the argument that yields the minimum value
![Page 9: Machine Learning Department School of Computer Science ...mgormley/courses/10601/slides/lecture8-opt.pdf · lecture8-opt Author: Matt Gormley Created Date: 2/12/2020 3:19:23 PM ...](https://reader034.fdocuments.us/reader034/viewer/2022043017/5f39b672ca3535563d1134b6/html5/thumbnails/9.jpg)
Linear Regression as Function Approximation
10
![Page 10: Machine Learning Department School of Computer Science ...mgormley/courses/10601/slides/lecture8-opt.pdf · lecture8-opt Author: Matt Gormley Created Date: 2/12/2020 3:19:23 PM ...](https://reader034.fdocuments.us/reader034/viewer/2022043017/5f39b672ca3535563d1134b6/html5/thumbnails/10.jpg)
11
![Page 11: Machine Learning Department School of Computer Science ...mgormley/courses/10601/slides/lecture8-opt.pdf · lecture8-opt Author: Matt Gormley Created Date: 2/12/2020 3:19:23 PM ...](https://reader034.fdocuments.us/reader034/viewer/2022043017/5f39b672ca3535563d1134b6/html5/thumbnails/11.jpg)
Contour PlotsContour Plots1. Each level curve labeled
with value 2. Value label indicates the
value of the function for all points lying on that level curve
3. Just like a topographical map, but for a function
12
J(θ) = J(θ1, θ2) = (10(θ1 – 0.5))2 + (6(θ1 – 0.4))2
θ1
θ2
![Page 12: Machine Learning Department School of Computer Science ...mgormley/courses/10601/slides/lecture8-opt.pdf · lecture8-opt Author: Matt Gormley Created Date: 2/12/2020 3:19:23 PM ...](https://reader034.fdocuments.us/reader034/viewer/2022043017/5f39b672ca3535563d1134b6/html5/thumbnails/12.jpg)
Optimization by Random Guessing
Optimization Method #0: Random Guessing1. Pick a random θ2. Evaluate J(θ)
3. Repeat steps 1 and 2 many
times
4. Return θ that gives
smallest J(θ)
13
J(θ) = J(θ1, θ2) = (10(θ1 – 0.5))2 + (6(θ1 – 0.4))2
θ1
θ2
θ1 θ2 J(θ1, θ2)
0.2 0.2 10.4
0.3 0.7 7.2
0.6 0.4 1.0
0.9 0.7 19.2
t
1
2
3
4
![Page 13: Machine Learning Department School of Computer Science ...mgormley/courses/10601/slides/lecture8-opt.pdf · lecture8-opt Author: Matt Gormley Created Date: 2/12/2020 3:19:23 PM ...](https://reader034.fdocuments.us/reader034/viewer/2022043017/5f39b672ca3535563d1134b6/html5/thumbnails/13.jpg)
Optimization by Random Guessing
Optimization Method #0: Random Guessing1. Pick a random θ2. Evaluate J(θ)
3. Repeat steps 1 and 2 many
times
4. Return θ that gives
smallest J(θ)
14
J(θ) = J(θ1, θ2) = (10(θ1 – 0.5))2 + (6(θ1 – 0.4))2
θ1
θ2
θ1 θ2 J(θ1, θ2)
0.2 0.2 10.4
0.3 0.7 7.2
0.6 0.4 1.0
0.9 0.7 19.2
t
1
2
3
4
For Linear Regression:• objective function is Mean
Squared Error (MSE)
• MSE = J(w, b)
= J(θ1, θ2) =
• contour plot: each line labeled with
MSE – lower means a better fit• minimum corresponds to
parameters (w,b) = (θ1, θ2) that
best fit some training dataset
![Page 14: Machine Learning Department School of Computer Science ...mgormley/courses/10601/slides/lecture8-opt.pdf · lecture8-opt Author: Matt Gormley Created Date: 2/12/2020 3:19:23 PM ...](https://reader034.fdocuments.us/reader034/viewer/2022043017/5f39b672ca3535563d1134b6/html5/thumbnails/14.jpg)
Linear Regression by Rand. GuessingOptimization Method #0: Random Guessing1. Pick a random θ2. Evaluate J(θ)3. Repeat steps 1 and 2 many
times4. Return θ that gives
smallest J(θ)
15time
# to
uris
ts (
tho
usan
ds)
y = h*(x)(unknown)
h(x; θ(1))
h(x; θ(2))
h(x; θ(3))
h(x; θ(4))
For Linear Regression:• target function h*(x) is unknown• only have access to h*(x) through
training examples (x(i),y(i))
• want h(x; θ(t)) that best approximates h*(x)
• enable generalization w/inductive bias that restricts hypothesis class to linear functions
![Page 15: Machine Learning Department School of Computer Science ...mgormley/courses/10601/slides/lecture8-opt.pdf · lecture8-opt Author: Matt Gormley Created Date: 2/12/2020 3:19:23 PM ...](https://reader034.fdocuments.us/reader034/viewer/2022043017/5f39b672ca3535563d1134b6/html5/thumbnails/15.jpg)
Linear Regression by Rand. Guessing
Optimization Method #0: Random Guessing1. Pick a random θ2. Evaluate J(θ)
3. Repeat steps 1 and 2 many
times
4. Return θ that gives
smallest J(θ)
16
θ1
θ2
θ1 θ2 J(θ1, θ2)
0.2 0.2 10.4
0.3 0.7 7.2
0.6 0.4 1.0
0.9 0.7 19.2time
# t
ou
rists
(th
ou
sa
nd
s)
y = h*(x)
(unknown)
t
1
2
3
4
h(x; θ(1))
h(x; θ(2))
h(x; θ(3))
h(x; θ(4))
J(θ) = J(θ1, θ2) = (10(θ1 – 0.5))2 + (6(θ1 – 0.4))2
![Page 16: Machine Learning Department School of Computer Science ...mgormley/courses/10601/slides/lecture8-opt.pdf · lecture8-opt Author: Matt Gormley Created Date: 2/12/2020 3:19:23 PM ...](https://reader034.fdocuments.us/reader034/viewer/2022043017/5f39b672ca3535563d1134b6/html5/thumbnails/16.jpg)
OPTIMIZATION METHOD #1:GRADIENT DESCENT
17
![Page 17: Machine Learning Department School of Computer Science ...mgormley/courses/10601/slides/lecture8-opt.pdf · lecture8-opt Author: Matt Gormley Created Date: 2/12/2020 3:19:23 PM ...](https://reader034.fdocuments.us/reader034/viewer/2022043017/5f39b672ca3535563d1134b6/html5/thumbnails/17.jpg)
Optimization for ML
Chalkboard– Unconstrained optimization– Derivatives– Gradient
18
![Page 18: Machine Learning Department School of Computer Science ...mgormley/courses/10601/slides/lecture8-opt.pdf · lecture8-opt Author: Matt Gormley Created Date: 2/12/2020 3:19:23 PM ...](https://reader034.fdocuments.us/reader034/viewer/2022043017/5f39b672ca3535563d1134b6/html5/thumbnails/18.jpg)
19
Topographical Maps
![Page 19: Machine Learning Department School of Computer Science ...mgormley/courses/10601/slides/lecture8-opt.pdf · lecture8-opt Author: Matt Gormley Created Date: 2/12/2020 3:19:23 PM ...](https://reader034.fdocuments.us/reader034/viewer/2022043017/5f39b672ca3535563d1134b6/html5/thumbnails/19.jpg)
20
Topographical Maps
![Page 20: Machine Learning Department School of Computer Science ...mgormley/courses/10601/slides/lecture8-opt.pdf · lecture8-opt Author: Matt Gormley Created Date: 2/12/2020 3:19:23 PM ...](https://reader034.fdocuments.us/reader034/viewer/2022043017/5f39b672ca3535563d1134b6/html5/thumbnails/20.jpg)
Gradients
21
![Page 21: Machine Learning Department School of Computer Science ...mgormley/courses/10601/slides/lecture8-opt.pdf · lecture8-opt Author: Matt Gormley Created Date: 2/12/2020 3:19:23 PM ...](https://reader034.fdocuments.us/reader034/viewer/2022043017/5f39b672ca3535563d1134b6/html5/thumbnails/21.jpg)
Gradients
22These are the gradients that
Gradient Ascent would follow.
![Page 22: Machine Learning Department School of Computer Science ...mgormley/courses/10601/slides/lecture8-opt.pdf · lecture8-opt Author: Matt Gormley Created Date: 2/12/2020 3:19:23 PM ...](https://reader034.fdocuments.us/reader034/viewer/2022043017/5f39b672ca3535563d1134b6/html5/thumbnails/22.jpg)
(Negative) Gradients
23These are the negative gradients that
Gradient Descent would follow.
![Page 23: Machine Learning Department School of Computer Science ...mgormley/courses/10601/slides/lecture8-opt.pdf · lecture8-opt Author: Matt Gormley Created Date: 2/12/2020 3:19:23 PM ...](https://reader034.fdocuments.us/reader034/viewer/2022043017/5f39b672ca3535563d1134b6/html5/thumbnails/23.jpg)
(Negative) Gradient Paths
24
Shown are the paths that Gradient Descent would follow if it were making infinitesimally
small steps.
![Page 24: Machine Learning Department School of Computer Science ...mgormley/courses/10601/slides/lecture8-opt.pdf · lecture8-opt Author: Matt Gormley Created Date: 2/12/2020 3:19:23 PM ...](https://reader034.fdocuments.us/reader034/viewer/2022043017/5f39b672ca3535563d1134b6/html5/thumbnails/24.jpg)
Pros and cons of gradient descent• Simple and often quite effective on ML tasks• Often very scalable • Only applies to smooth functions (differentiable)• Might find a local minimum, rather than a global one
25Slide courtesy of William Cohen
![Page 25: Machine Learning Department School of Computer Science ...mgormley/courses/10601/slides/lecture8-opt.pdf · lecture8-opt Author: Matt Gormley Created Date: 2/12/2020 3:19:23 PM ...](https://reader034.fdocuments.us/reader034/viewer/2022043017/5f39b672ca3535563d1134b6/html5/thumbnails/25.jpg)
Gradient Descent
Chalkboard– Gradient Descent Algorithm– Details: starting point, stopping criterion, line
search
26
![Page 26: Machine Learning Department School of Computer Science ...mgormley/courses/10601/slides/lecture8-opt.pdf · lecture8-opt Author: Matt Gormley Created Date: 2/12/2020 3:19:23 PM ...](https://reader034.fdocuments.us/reader034/viewer/2022043017/5f39b672ca3535563d1134b6/html5/thumbnails/26.jpg)
Gradient Descent
27
Algorithm 1 Gradient Descent
1: procedure GD(D, �(0))2: � � �(0)
3: while not converged do4: � � � + ���J(�)
5: return �
In order to apply GD to Linear Regression all we need is the gradient of the objective function (i.e. vector of partial derivatives).
��J(�) =
�
����
dd�1
J(�)d
d�2J(�)...
dd�N
J(�)
�
����
—
M
![Page 27: Machine Learning Department School of Computer Science ...mgormley/courses/10601/slides/lecture8-opt.pdf · lecture8-opt Author: Matt Gormley Created Date: 2/12/2020 3:19:23 PM ...](https://reader034.fdocuments.us/reader034/viewer/2022043017/5f39b672ca3535563d1134b6/html5/thumbnails/27.jpg)
Gradient Descent
28
Algorithm 1 Gradient Descent
1: procedure GD(D, �(0))2: � � �(0)
3: while not converged do4: � � � + ���J(�)
5: return �
There are many possible ways to detect convergence. For example, we could check whether the L2 norm of the gradient is below some small tolerance.
||��J(�)||2 � �Alternatively we could check that the reduction in the objective function from one iteration to the next is small.
—
![Page 28: Machine Learning Department School of Computer Science ...mgormley/courses/10601/slides/lecture8-opt.pdf · lecture8-opt Author: Matt Gormley Created Date: 2/12/2020 3:19:23 PM ...](https://reader034.fdocuments.us/reader034/viewer/2022043017/5f39b672ca3535563d1134b6/html5/thumbnails/28.jpg)
GRADIENT DESCENT FORLINEAR REGRESSION
29
![Page 29: Machine Learning Department School of Computer Science ...mgormley/courses/10601/slides/lecture8-opt.pdf · lecture8-opt Author: Matt Gormley Created Date: 2/12/2020 3:19:23 PM ...](https://reader034.fdocuments.us/reader034/viewer/2022043017/5f39b672ca3535563d1134b6/html5/thumbnails/29.jpg)
Linear Regression as Function Approximation
30
![Page 30: Machine Learning Department School of Computer Science ...mgormley/courses/10601/slides/lecture8-opt.pdf · lecture8-opt Author: Matt Gormley Created Date: 2/12/2020 3:19:23 PM ...](https://reader034.fdocuments.us/reader034/viewer/2022043017/5f39b672ca3535563d1134b6/html5/thumbnails/30.jpg)
Linear Regression by Gradient Desc.Optimization Method #1: Gradient Descent1. Pick a random θ2. Repeat:
a. Evaluate gradient ∇J(θ)b. Step opposite gradient
3. Return θ that gives smallest J(θ)
31
θ1
θ2
θ1 θ2 J(θ1, θ2)0.01 0.02 25.20.30 0.12 8.70.51 0.30 1.50.59 0.43 0.2
t1234
J(θ) = J(θ1, θ2) = (10(θ1 – 0.5))2 + (6(θ1 – 0.4))2
![Page 31: Machine Learning Department School of Computer Science ...mgormley/courses/10601/slides/lecture8-opt.pdf · lecture8-opt Author: Matt Gormley Created Date: 2/12/2020 3:19:23 PM ...](https://reader034.fdocuments.us/reader034/viewer/2022043017/5f39b672ca3535563d1134b6/html5/thumbnails/31.jpg)
Linear Regression by Gradient Desc.Optimization Method #1: Gradient Descent1. Pick a random θ2. Repeat:
a. Evaluate gradient ∇J(θ)b. Step opposite gradient
3. Return θ that gives smallest J(θ)
32
θ1 θ2 J(θ1, θ2)0.01 0.02 25.20.30 0.12 8.70.51 0.30 1.50.59 0.43 0.2time
# to
uris
ts (
tho
usan
ds)
y = h*(x)(unknown)
t1234
h(x; θ(1))
h(x; θ(2))
h(x; θ(3))
h(x; θ(4))
![Page 32: Machine Learning Department School of Computer Science ...mgormley/courses/10601/slides/lecture8-opt.pdf · lecture8-opt Author: Matt Gormley Created Date: 2/12/2020 3:19:23 PM ...](https://reader034.fdocuments.us/reader034/viewer/2022043017/5f39b672ca3535563d1134b6/html5/thumbnails/32.jpg)
Linear Regression by Gradient Desc.Optimization Method #1: Gradient Descent1. Pick a random θ2. Repeat:
a. Evaluate gradient ∇J(θ)b. Step opposite gradient
3. Return θ that gives smallest J(θ)
33
θ1
θ2
θ1 θ2 J(θ1, θ2)0.01 0.02 25.20.30 0.12 8.70.51 0.30 1.50.59 0.43 0.2time
# to
uris
ts (
tho
usan
ds)
y = h*(x)(unknown)
t1234
h(x; θ(1))
h(x; θ(2))
h(x; θ(3))
h(x; θ(4))
J(θ) = J(θ1, θ2) = (10(θ1 – 0.5))2 + (6(θ1 – 0.4))2
![Page 33: Machine Learning Department School of Computer Science ...mgormley/courses/10601/slides/lecture8-opt.pdf · lecture8-opt Author: Matt Gormley Created Date: 2/12/2020 3:19:23 PM ...](https://reader034.fdocuments.us/reader034/viewer/2022043017/5f39b672ca3535563d1134b6/html5/thumbnails/33.jpg)
Linear Regression by Gradient Desc.
34
θ1 θ2 J(θ1, θ2)0.01 0.02 25.20.30 0.12 8.70.51 0.30 1.50.59 0.43 0.2time
# to
uris
ts (
tho
usan
ds)
y = h*(x)(unknown)
t1234
h(x; θ(1))
h(x; θ(2))
h(x; θ(3))
h(x; θ(4))
iteration, t
mea
n sq
uare
d e
rro
r,
J(θ
1, θ
2)
![Page 34: Machine Learning Department School of Computer Science ...mgormley/courses/10601/slides/lecture8-opt.pdf · lecture8-opt Author: Matt Gormley Created Date: 2/12/2020 3:19:23 PM ...](https://reader034.fdocuments.us/reader034/viewer/2022043017/5f39b672ca3535563d1134b6/html5/thumbnails/34.jpg)
Linear Regression by Gradient Desc.
35
θ1
θ2
θ1 θ2 J(θ1, θ2)0.01 0.02 25.20.30 0.12 8.70.51 0.30 1.50.59 0.43 0.2time
# to
uris
ts (t
hous
ands
)
y = h*(x)(unknown)
t1234
h(x; θ(1))
h(x; θ(2))
h(x; θ(3))
h(x; θ(4))
J(θ) = J(θ1, θ2) = (10(θ1 – 0.5))2 + (6(θ1 – 0.4))2
iteration, t
mea
n sq
uare
d er
ror,
J(
θ1,
θ 2)
![Page 35: Machine Learning Department School of Computer Science ...mgormley/courses/10601/slides/lecture8-opt.pdf · lecture8-opt Author: Matt Gormley Created Date: 2/12/2020 3:19:23 PM ...](https://reader034.fdocuments.us/reader034/viewer/2022043017/5f39b672ca3535563d1134b6/html5/thumbnails/35.jpg)
Optimization for Linear Regression
Chalkboard– Computing the gradient for Linear Regression– Gradient Descent for Linear Regression
36
![Page 36: Machine Learning Department School of Computer Science ...mgormley/courses/10601/slides/lecture8-opt.pdf · lecture8-opt Author: Matt Gormley Created Date: 2/12/2020 3:19:23 PM ...](https://reader034.fdocuments.us/reader034/viewer/2022043017/5f39b672ca3535563d1134b6/html5/thumbnails/36.jpg)
Gradient Calculation for Linear Regression
37
[used by Gradient Descent]
![Page 37: Machine Learning Department School of Computer Science ...mgormley/courses/10601/slides/lecture8-opt.pdf · lecture8-opt Author: Matt Gormley Created Date: 2/12/2020 3:19:23 PM ...](https://reader034.fdocuments.us/reader034/viewer/2022043017/5f39b672ca3535563d1134b6/html5/thumbnails/37.jpg)
GD for Linear RegressionGradient Descent for Linear Regression repeatedly takes steps opposite the gradient of the objective function
38
![Page 38: Machine Learning Department School of Computer Science ...mgormley/courses/10601/slides/lecture8-opt.pdf · lecture8-opt Author: Matt Gormley Created Date: 2/12/2020 3:19:23 PM ...](https://reader034.fdocuments.us/reader034/viewer/2022043017/5f39b672ca3535563d1134b6/html5/thumbnails/38.jpg)
CONVEXITY
39
![Page 39: Machine Learning Department School of Computer Science ...mgormley/courses/10601/slides/lecture8-opt.pdf · lecture8-opt Author: Matt Gormley Created Date: 2/12/2020 3:19:23 PM ...](https://reader034.fdocuments.us/reader034/viewer/2022043017/5f39b672ca3535563d1134b6/html5/thumbnails/39.jpg)
Convexity
40
![Page 40: Machine Learning Department School of Computer Science ...mgormley/courses/10601/slides/lecture8-opt.pdf · lecture8-opt Author: Matt Gormley Created Date: 2/12/2020 3:19:23 PM ...](https://reader034.fdocuments.us/reader034/viewer/2022043017/5f39b672ca3535563d1134b6/html5/thumbnails/40.jpg)
Convexity
Convex Function
• Each local minimum is a global minimum
Nonconvex Function
• A nonconvex function is not convex
• Each local minimum is notnecessarily a global minimum 41
![Page 41: Machine Learning Department School of Computer Science ...mgormley/courses/10601/slides/lecture8-opt.pdf · lecture8-opt Author: Matt Gormley Created Date: 2/12/2020 3:19:23 PM ...](https://reader034.fdocuments.us/reader034/viewer/2022043017/5f39b672ca3535563d1134b6/html5/thumbnails/41.jpg)
Convexity
44
Each local minimum of a
convex function is also a global
minimum.
A strictly convex function has a unique global
minimum.
![Page 42: Machine Learning Department School of Computer Science ...mgormley/courses/10601/slides/lecture8-opt.pdf · lecture8-opt Author: Matt Gormley Created Date: 2/12/2020 3:19:23 PM ...](https://reader034.fdocuments.us/reader034/viewer/2022043017/5f39b672ca3535563d1134b6/html5/thumbnails/42.jpg)
CONVEXITY AND LINEAR REGRESSION
45
![Page 43: Machine Learning Department School of Computer Science ...mgormley/courses/10601/slides/lecture8-opt.pdf · lecture8-opt Author: Matt Gormley Created Date: 2/12/2020 3:19:23 PM ...](https://reader034.fdocuments.us/reader034/viewer/2022043017/5f39b672ca3535563d1134b6/html5/thumbnails/43.jpg)
Convexity and Linear Regression
46
The Mean Squared Error function, which we minimize for learning
the parameters of Linear Regression, is convex!
…but in the general case it is not strictly convex.
![Page 44: Machine Learning Department School of Computer Science ...mgormley/courses/10601/slides/lecture8-opt.pdf · lecture8-opt Author: Matt Gormley Created Date: 2/12/2020 3:19:23 PM ...](https://reader034.fdocuments.us/reader034/viewer/2022043017/5f39b672ca3535563d1134b6/html5/thumbnails/44.jpg)
Regression Loss Functions
In-Class Exercise:
Which of the following could be used as loss functions for training a linear regression model?
Select all that apply.
47
![Page 45: Machine Learning Department School of Computer Science ...mgormley/courses/10601/slides/lecture8-opt.pdf · lecture8-opt Author: Matt Gormley Created Date: 2/12/2020 3:19:23 PM ...](https://reader034.fdocuments.us/reader034/viewer/2022043017/5f39b672ca3535563d1134b6/html5/thumbnails/45.jpg)
Answer:
Solving Linear Regression
48
Question:
![Page 46: Machine Learning Department School of Computer Science ...mgormley/courses/10601/slides/lecture8-opt.pdf · lecture8-opt Author: Matt Gormley Created Date: 2/12/2020 3:19:23 PM ...](https://reader034.fdocuments.us/reader034/viewer/2022043017/5f39b672ca3535563d1134b6/html5/thumbnails/46.jpg)
OPTIMIZATION METHOD #2:CLOSED FORM SOLUTION
49
![Page 47: Machine Learning Department School of Computer Science ...mgormley/courses/10601/slides/lecture8-opt.pdf · lecture8-opt Author: Matt Gormley Created Date: 2/12/2020 3:19:23 PM ...](https://reader034.fdocuments.us/reader034/viewer/2022043017/5f39b672ca3535563d1134b6/html5/thumbnails/47.jpg)
Calculus and Optimization
In-Class ExercisePlot three functions:
50
Answer Here:
![Page 48: Machine Learning Department School of Computer Science ...mgormley/courses/10601/slides/lecture8-opt.pdf · lecture8-opt Author: Matt Gormley Created Date: 2/12/2020 3:19:23 PM ...](https://reader034.fdocuments.us/reader034/viewer/2022043017/5f39b672ca3535563d1134b6/html5/thumbnails/48.jpg)
Optimization: Closed form solutions
Chalkboard– Zero Derivatives– Example: 1-D function– Example: higher dimensions
52
![Page 49: Machine Learning Department School of Computer Science ...mgormley/courses/10601/slides/lecture8-opt.pdf · lecture8-opt Author: Matt Gormley Created Date: 2/12/2020 3:19:23 PM ...](https://reader034.fdocuments.us/reader034/viewer/2022043017/5f39b672ca3535563d1134b6/html5/thumbnails/49.jpg)
CLOSED FORM SOLUTION FOR LINEAR REGRESSION
53
![Page 50: Machine Learning Department School of Computer Science ...mgormley/courses/10601/slides/lecture8-opt.pdf · lecture8-opt Author: Matt Gormley Created Date: 2/12/2020 3:19:23 PM ...](https://reader034.fdocuments.us/reader034/viewer/2022043017/5f39b672ca3535563d1134b6/html5/thumbnails/50.jpg)
Linear Regression as Function Approximation
54
![Page 51: Machine Learning Department School of Computer Science ...mgormley/courses/10601/slides/lecture8-opt.pdf · lecture8-opt Author: Matt Gormley Created Date: 2/12/2020 3:19:23 PM ...](https://reader034.fdocuments.us/reader034/viewer/2022043017/5f39b672ca3535563d1134b6/html5/thumbnails/51.jpg)
Linear Regression: Closed FormOptimization Method #2: Closed Form1. Evaluate
2. Return θMLE
55
θ1
θ2
θ1 θ2 J(θ1, θ2)0.59 0.43 0.2
time
# to
uris
ts (
thou
sand
s)
y = h*(x)(unknown)
tMLE
h(x; θ(MLE))
J(θ) = J(θ1, θ2) = (10(θ1 – 0.5))2 + (6(θ1 – 0.4))2
![Page 52: Machine Learning Department School of Computer Science ...mgormley/courses/10601/slides/lecture8-opt.pdf · lecture8-opt Author: Matt Gormley Created Date: 2/12/2020 3:19:23 PM ...](https://reader034.fdocuments.us/reader034/viewer/2022043017/5f39b672ca3535563d1134b6/html5/thumbnails/52.jpg)
Optimization for Linear Regression
Chalkboard– Closed-form (Normal Equations)
56
![Page 53: Machine Learning Department School of Computer Science ...mgormley/courses/10601/slides/lecture8-opt.pdf · lecture8-opt Author: Matt Gormley Created Date: 2/12/2020 3:19:23 PM ...](https://reader034.fdocuments.us/reader034/viewer/2022043017/5f39b672ca3535563d1134b6/html5/thumbnails/53.jpg)
Computational Complexity of OLS:
Computational Complexity of OLS
57
To solve the Ordinary Least Squares problem we compute:
The resulting shape of the matrices:
Linear in # of examples, NPolynomial in # of features, M
![Page 54: Machine Learning Department School of Computer Science ...mgormley/courses/10601/slides/lecture8-opt.pdf · lecture8-opt Author: Matt Gormley Created Date: 2/12/2020 3:19:23 PM ...](https://reader034.fdocuments.us/reader034/viewer/2022043017/5f39b672ca3535563d1134b6/html5/thumbnails/54.jpg)
Gradient Descent
Cases to consider gradient descent:1. What if we can not find a closed-form
solution?2. What if we can, but it’s inefficient to
compute?3. What if we can, but it’s numerically
unstable to compute?
58
![Page 55: Machine Learning Department School of Computer Science ...mgormley/courses/10601/slides/lecture8-opt.pdf · lecture8-opt Author: Matt Gormley Created Date: 2/12/2020 3:19:23 PM ...](https://reader034.fdocuments.us/reader034/viewer/2022043017/5f39b672ca3535563d1134b6/html5/thumbnails/55.jpg)
Convergence Curves
• SGD reduces MSE much more rapidly than GD
• For GD / SGD, training MSE is initially large due to uninformed initialization
59
Gradient DescentSGD
Closed-form (normal eq.s)
Figure adapted from Eric P. Xing
• Def: an epoch is a single pass through the training data
1. For GD, only one update per epoch
2. For SGD, N updates per epoch N = (# train examples)