Applied Machine Learning Lecture 5-1: Optimization in ...richajo/dit866/lectures/l5/l5_1.pdf ·...
Transcript of Applied Machine Learning Lecture 5-1: Optimization in ...richajo/dit866/lectures/l5/l5_1.pdf ·...
Applied Machine LearningLecture 5-1: Optimization in machine learning
Selpi ([email protected])
The slides are further development of Richard Johansson's slides
February 7, 2020
Overview
Optimization in machine learning
SGD for training a linear regression model
introduction to PyTorch
regularized linear regression models
Objective functions for machine learning
I in this lecture, we'll discuss objective functions for our
models
I training a model = minimizing (or maximizing) this objective
I often, the objective consists of a combination ofI a loss function that measures how well the model �ts the
training dataI a regularizer that measures how simple/complex the model is
minimizing squared errors
I in the least squares approach to linear regression, we want to
minimize the mean (or sum) of squared errors
f (w) =1
N
N∑i=1
(w · x i − yi )2
1.5 1.0 0.5 0.0 0.5 1.0 1.5
1.5
1.0
0.5
0.0
0.5
1.0
1.5
I From calculus: if a �nice� function has a maximum or
minimum, then the derivative will be zero there
the gradientI the multidimensional equivalent of the derivative is called the
gradient
I if f is a function of n variables, then the gradient is an
n-dimensional vector, often written ∇f (x)I intuition: the gradient points in the uphill direction
I again: the gradient is zero if we have an optimum
I this intuition leads to a simple idea for �nding the minimum:I take a small step in the direction opposite to the gradientI repeat until the gradient is close enough to zero
Gradient descent, pseudocode
I the same thing again, in pseudocode:
1. set x to some initial value, and select a suitable step size η2. compute the gradient ∇f (x)3. if ∇f (x) is small enough, we are done4. otherwise, subtract η · ∇f (x) from x and go back to step 2
I conversely, to �nd the maximum we can do gradient ascent:then we instead add η · ∇f (x) to x
Overview
Optimization in machine learning
SGD for training a linear regression model
introduction to PyTorch
regularized linear regression models
stochastic gradient descent (SGD)
I in machine learning, our objective functions are often de�ned
as sums over the whole training set, such as the least squares
loss function
f (w) =1
N
N∑i=1
(w · x i − yi )2
I if N is large, it would take time to use gradient descent tominimize the loss functionI because gradient descent takes into account all the instances
in the training set
I stochastic gradient descent: simplify the computation bycomputing the gradient using just a small partI in the extreme case, a single training instanceI or minibatch: a few instances
SGD: pseudocode
1. set w to some initial value, and select a suitable step size η
2. select a single training instance x
3. compute the gradient ∇f (w) using x only
4. if we are �done�, stop
5. otherwise, subtract η · ∇f (w) from w and go back to step 2
w = (0, . . . , 0)for (x i , yi ) . . .
compute gradient ∇fi for current instance (x i , yi )w = w − η · ∇fi (w)
return w
SGD: pseudocode
1. set w to some initial value, and select a suitable step size η
2. select a single training instance x
3. compute the gradient ∇f (w) using x only
4. if we are �done�, stop
5. otherwise, subtract η · ∇f (w) from w and go back to step 2
w = (0, . . . , 0)for (x i , yi ) . . .
compute gradient ∇fi for current instance (x i , yi )w = w − η · ∇fi (w)
return w
when to terminate SGD?
I simple solution: �xed number of iterations
I or stop when we've seen no improvement for some time
I or evaluate on a held-out set: early stopping
applying SGD to the least squares loss
I the least squares loss function:
f (w) =1
N
N∑i=1
(w · x i − yi )2
I let's consider just a single instance:
fi (w) = (w · x i − yi )2
I for this instance, the gradient of the least squares loss with
respect to w is
∇fi (w) = 2 · (w · x i − yi ) · x i
plugging it back into SGD ⇒ Widrow-Ho� again!
w = (0, . . . , 0)for (x i , yi ) . . .∇fi (w) = 2 · error · x iw = w − η · ∇fi (w)
return w
Overview
Optimization in machine learning
SGD for training a linear regression model
introduction to PyTorch
regularized linear regression models
the Python machine learning ecosystem (selection)
�nding and installing PyTorch
I easy to install using conda
I https://pytorch.org/
what is the purpose of PyTorch?
I on a high level: a library for prototyping ML models
I on a low level: NumPy-like with automatic gradients
I easy to move computations to a GPU (if available)
[source]
Overview
Optimization in machine learning
SGD for training a linear regression model
introduction to PyTorch
regularized linear regression models
recall: the fundamental tradeo� in machine learning
I goodness of �t: the learned model should describe
the examples in the training data
I regularization: the model should be simple
[Image source]
I the least squares loss just takes care of the �rst part!
f (w) =1
N
N∑i=1
(w · x i − yi )2
1.5 1.0 0.5 0.0 0.5 1.0 1.5
1.5
1.0
0.5
0.0
0.5
1.0
1.5
what does it mean to �keep the model simple�?
I concretely, how can we add regularization to the linear
regression model?
I the most common approach is to add a term that keeps theweights small
I for instance, by penalizing the squared length (norm) of the
weight vector should be small:
‖w‖2 = w1 · w1 + . . .+ wn · wn = w ·w
I this is called a L2 regularizer (or a L2 penalty)
combining the pieces
I we combine the loss and the regularizer:
1
N
N∑i=1
Loss(w , x i , yi ) + α · Regularizer(w)
=1
N
N∑i=1
(w · xi − yi )2 + α · ‖w‖2
I in this formula, α is a �tweaking� parameter that controls the
tradeo� between loss and regularization
ridge regression
I the combination of least squares loss with L2 regularization is
called ridge regression
1
N
N∑i=1
(w · xi − yi )2 + α · ‖w‖2
I in scikit-learn: sklearn.linear_model.Ridge
I (unregularized least squares:
sklearn.linear_model.LinearRegression)
discussion
1
N
N∑i=1
(w · xi − yi )2 + α · ‖w‖2
I what happens if we set α to 0?
I what happens if we set α to a huge number?
why �ridge�?
[Image source]
SGD for the ridge regression model
1
N
∑(w · xi − yi )
2 + α · ‖w‖2
I the gradient with respect to a single training instance is
∇fi (w) = 2 · error · x i + 2 · α ·w
w = (0, . . . , 0)for (x i , yi ) in the training set
g = w · x ierror = g − yiw = w − η · (error · x i+α ·w)
return w
the Lasso and ElasticNet models
I another common regularizer is the L1 norm
‖w‖1 = |w1|+ . . .+ |wn|
I the L1 norm gives the Lasso model
I combination of L1 and L2 gives the ElasticNet model
I in scikit-learn:I sklearn.linear_model.Lasso
I sklearn.linear_model.ElasticNet
why does Lasso give zero coe�cients?
I �gure from Hastie et al. book