CSC 311: Introduction to Machine Learning · cross-entropy as the loss function: L CE(y;t) = XK k=1...
Transcript of CSC 311: Introduction to Machine Learning · cross-entropy as the loss function: L CE(y;t) = XK k=1...
![Page 1: CSC 311: Introduction to Machine Learning · cross-entropy as the loss function: L CE(y;t) = XK k=1 t klogy k = t>(logy); where the log is applied elementwise. Just like with logistic](https://reader034.fdocuments.us/reader034/viewer/2022042413/5f2cf1bd59015d6dae4551b7/html5/thumbnails/1.jpg)
CSC 311: Introduction to Machine LearningLecture 5 - Multiclass Classification & Neural Networks
Amir-massoud Farahmand & Emad A.M. Andrews
University of Toronto
Intro ML (UofT) CSC311-Lec5 1 / 46
![Page 2: CSC 311: Introduction to Machine Learning · cross-entropy as the loss function: L CE(y;t) = XK k=1 t klogy k = t>(logy); where the log is applied elementwise. Just like with logistic](https://reader034.fdocuments.us/reader034/viewer/2022042413/5f2cf1bd59015d6dae4551b7/html5/thumbnails/2.jpg)
Overview
Classification: predicting a discrete-valued targetI Binary classification: predicting a binary-valued targetI Multiclass classification: predicting a discrete(> 2)-valued target
Examples of multi-class classificationI predict the value of a handwritten digitI classify e-mails as spam, travel, work, personal
Intro ML (UofT) CSC311-Lec5 2 / 46
![Page 3: CSC 311: Introduction to Machine Learning · cross-entropy as the loss function: L CE(y;t) = XK k=1 t klogy k = t>(logy); where the log is applied elementwise. Just like with logistic](https://reader034.fdocuments.us/reader034/viewer/2022042413/5f2cf1bd59015d6dae4551b7/html5/thumbnails/3.jpg)
Multiclass Classification
Classification tasks with more than two categories:It is very hard to say what makes a 2 Some examples from an earlier version of the net
Intro ML (UofT) CSC311-Lec5 3 / 46
![Page 4: CSC 311: Introduction to Machine Learning · cross-entropy as the loss function: L CE(y;t) = XK k=1 t klogy k = t>(logy); where the log is applied elementwise. Just like with logistic](https://reader034.fdocuments.us/reader034/viewer/2022042413/5f2cf1bd59015d6dae4551b7/html5/thumbnails/4.jpg)
Multiclass Classification
Targets form a discrete set {1, . . . ,K}.It’s often more convenient to represent them as one-hot vectors, ora one-of-K encoding:
t = (0, . . . , 0, 1, 0, . . . , 0)︸ ︷︷ ︸entry k is 1
∈ RK
Intro ML (UofT) CSC311-Lec5 4 / 46
![Page 5: CSC 311: Introduction to Machine Learning · cross-entropy as the loss function: L CE(y;t) = XK k=1 t klogy k = t>(logy); where the log is applied elementwise. Just like with logistic](https://reader034.fdocuments.us/reader034/viewer/2022042413/5f2cf1bd59015d6dae4551b7/html5/thumbnails/5.jpg)
Multiclass Classification
Now there are D input dimensions and K output dimensions, sowe need K ×D weights, which we arrange as a weight matrix W.
Also, we have a K-dimensional vector b of biases.
Linear predictions:
zk =
D∑j=1
wkjxj + bk for k = 1, 2, ...,K
Vectorized:z = Wx + b
Intro ML (UofT) CSC311-Lec5 5 / 46
![Page 6: CSC 311: Introduction to Machine Learning · cross-entropy as the loss function: L CE(y;t) = XK k=1 t klogy k = t>(logy); where the log is applied elementwise. Just like with logistic](https://reader034.fdocuments.us/reader034/viewer/2022042413/5f2cf1bd59015d6dae4551b7/html5/thumbnails/6.jpg)
Multiclass Classification
Predictions are like probabilities: want 1 ≥ yk ≥ 0 and∑
k yk = 1
A natural activation function to use is the softmax function, amultivariable generalization of the logistic function:
yk = softmax(z1, . . . , zK)k =ezk∑k′ e
zk′
The inputs zk are called the logits.
Properties:I Outputs are positive and sum to 1 (so they can be interpreted as
probabilities)I If one of the zk is much larger than the others, softmax(z)k ≈ 1
(behaves like argmax).I Exercise: how does the case of K = 2 relate to the logistic
function?
Note: sometimes σ(z) is used to denote the softmax function; inthis class, it will denote the logistic function applied elementwise.
Intro ML (UofT) CSC311-Lec5 6 / 46
![Page 7: CSC 311: Introduction to Machine Learning · cross-entropy as the loss function: L CE(y;t) = XK k=1 t klogy k = t>(logy); where the log is applied elementwise. Just like with logistic](https://reader034.fdocuments.us/reader034/viewer/2022042413/5f2cf1bd59015d6dae4551b7/html5/thumbnails/7.jpg)
Multiclass Classification
If a model outputs a vector of class probabilities, we can usecross-entropy as the loss function:
LCE(y, t) = −K∑k=1
tk log yk
= −t>(log y),
where the log is applied elementwise.
Just like with logistic regression, we typically combine the softmaxand cross-entropy into a softmax-cross-entropy function.
Intro ML (UofT) CSC311-Lec5 7 / 46
![Page 8: CSC 311: Introduction to Machine Learning · cross-entropy as the loss function: L CE(y;t) = XK k=1 t klogy k = t>(logy); where the log is applied elementwise. Just like with logistic](https://reader034.fdocuments.us/reader034/viewer/2022042413/5f2cf1bd59015d6dae4551b7/html5/thumbnails/8.jpg)
Multiclass Classification
Softmax regression:
z = Wx + b
y = softmax(z)
LCE = −t>(log y)
Gradient descent updates can be derived for each row of W:
∂LCE
∂wk=∂LCE
∂zk· ∂zk∂wk
= (yk − tk) · x
wk ← wk − α1
N
N∑i=1
(y(i)k − t
(i)k )x(i)
Similar to linear/logistic reg (no coincidence) (verify the update)
Intro ML (UofT) CSC311-Lec5 8 / 46
![Page 9: CSC 311: Introduction to Machine Learning · cross-entropy as the loss function: L CE(y;t) = XK k=1 t klogy k = t>(logy); where the log is applied elementwise. Just like with logistic](https://reader034.fdocuments.us/reader034/viewer/2022042413/5f2cf1bd59015d6dae4551b7/html5/thumbnails/9.jpg)
Limits of Linear Classification
Visually, it’s obvious that XOR is not linearly separable. But howto show this?
Intro ML (UofT) CSC311-Lec5 9 / 46
![Page 10: CSC 311: Introduction to Machine Learning · cross-entropy as the loss function: L CE(y;t) = XK k=1 t klogy k = t>(logy); where the log is applied elementwise. Just like with logistic](https://reader034.fdocuments.us/reader034/viewer/2022042413/5f2cf1bd59015d6dae4551b7/html5/thumbnails/10.jpg)
Limits of Linear Classification
Showing that XOR is not linearly separable (proof bycontradiction)
If two points lie in a half-space, line segment connecting them also lie inthe same halfspace.
Suppose there were some feasible weights (hypothesis). If the positiveexamples are in the positive half-space, then the green line segment mustbe as well.
Similarly, the red line segment must line within the negative half-space.
But the intersection can’t lie in both half-spaces. Contradiction!
Intro ML (UofT) CSC311-Lec5 10 / 46
![Page 11: CSC 311: Introduction to Machine Learning · cross-entropy as the loss function: L CE(y;t) = XK k=1 t klogy k = t>(logy); where the log is applied elementwise. Just like with logistic](https://reader034.fdocuments.us/reader034/viewer/2022042413/5f2cf1bd59015d6dae4551b7/html5/thumbnails/11.jpg)
Limits of Linear Classification
Sometimes we can overcome this limitation using feature maps,just like for linear regression. E.g., for XOR:
ψ(x) =
x1x2x1x2
x1 x2 ψ1(x) ψ2(x) ψ3(x) t
0 0 0 0 0 00 1 0 1 0 11 0 1 0 0 11 1 1 1 1 0
This is linearly separable. (Try it!)
Not a general solution: it can be hard to pick good basis functions.Instead, we’ll use neural nets to learn nonlinear hypothesesdirectly.
Intro ML (UofT) CSC311-Lec5 11 / 46
![Page 12: CSC 311: Introduction to Machine Learning · cross-entropy as the loss function: L CE(y;t) = XK k=1 t klogy k = t>(logy); where the log is applied elementwise. Just like with logistic](https://reader034.fdocuments.us/reader034/viewer/2022042413/5f2cf1bd59015d6dae4551b7/html5/thumbnails/12.jpg)
Neural Networks
Intro ML (UofT) CSC311-Lec5 12 / 46
![Page 13: CSC 311: Introduction to Machine Learning · cross-entropy as the loss function: L CE(y;t) = XK k=1 t klogy k = t>(logy); where the log is applied elementwise. Just like with logistic](https://reader034.fdocuments.us/reader034/viewer/2022042413/5f2cf1bd59015d6dae4551b7/html5/thumbnails/13.jpg)
Inspiration: The Brain
Neurons receive input signals and accumulate voltage. After somethreshold they will fire spiking responses.
[Pic credit: www.moleculardevices.com]
Intro ML (UofT) CSC311-Lec5 13 / 46
![Page 14: CSC 311: Introduction to Machine Learning · cross-entropy as the loss function: L CE(y;t) = XK k=1 t klogy k = t>(logy); where the log is applied elementwise. Just like with logistic](https://reader034.fdocuments.us/reader034/viewer/2022042413/5f2cf1bd59015d6dae4551b7/html5/thumbnails/14.jpg)
Inspiration: The Brain
For neural nets, we use a much simpler model neuron, or unit:
Compare with logistic regression: y = σ(w>x + b)
By throwing together lots of these incredibly simplistic neuron-likeprocessing units, we can do some powerful computations!Intro ML (UofT) CSC311-Lec5 14 / 46
![Page 15: CSC 311: Introduction to Machine Learning · cross-entropy as the loss function: L CE(y;t) = XK k=1 t klogy k = t>(logy); where the log is applied elementwise. Just like with logistic](https://reader034.fdocuments.us/reader034/viewer/2022042413/5f2cf1bd59015d6dae4551b7/html5/thumbnails/15.jpg)
Multilayer Perceptrons
We can connect lots ofunits together into adirected acyclicgraph.
Typically, units aregrouped together intolayers.
This gives afeed-forward neuralnetwork. That’s incontrast to recurrentneural networks,which can have cycles.
Intro ML (UofT) CSC311-Lec5 15 / 46
![Page 16: CSC 311: Introduction to Machine Learning · cross-entropy as the loss function: L CE(y;t) = XK k=1 t klogy k = t>(logy); where the log is applied elementwise. Just like with logistic](https://reader034.fdocuments.us/reader034/viewer/2022042413/5f2cf1bd59015d6dae4551b7/html5/thumbnails/16.jpg)
Multilayer Perceptrons
Each hidden layer i connects Ni−1 input units to Ni output units.
In the simplest case, all input units are connected to all output units. We callthis a fully connected layer. We’ll consider other layer types later.
Note: the inputs and outputs for a layer are distinct from the inputs andoutputs to the network.If we need to compute M outputs from Ninputs, we can do so in parallel using matrixmultiplication. This means we’ll be using aM ×N matrix
The output units are a function of the inputunits:
y = f(x) = φ (Wx+ b)
A multilayer network consisting of fullyconnected layers is called a multilayerperceptron. Despite the name, it hasnothing to do with perceptrons!
Intro ML (UofT) CSC311-Lec5 16 / 46
![Page 17: CSC 311: Introduction to Machine Learning · cross-entropy as the loss function: L CE(y;t) = XK k=1 t klogy k = t>(logy); where the log is applied elementwise. Just like with logistic](https://reader034.fdocuments.us/reader034/viewer/2022042413/5f2cf1bd59015d6dae4551b7/html5/thumbnails/17.jpg)
Multilayer Perceptrons
Some activation functions:
Identity
y = z
Rectified LinearUnit
(ReLU)
y = max(0, z)
Soft ReLU
y = log 1 + ez
Intro ML (UofT) CSC311-Lec5 17 / 46
![Page 18: CSC 311: Introduction to Machine Learning · cross-entropy as the loss function: L CE(y;t) = XK k=1 t klogy k = t>(logy); where the log is applied elementwise. Just like with logistic](https://reader034.fdocuments.us/reader034/viewer/2022042413/5f2cf1bd59015d6dae4551b7/html5/thumbnails/18.jpg)
Multilayer Perceptrons
Some activation functions:
Hard Threshold
y =
{1 if z > 00 if z ≤ 0
Logistic
y =1
1 + e−z
Hyperbolic Tangent(tanh)
y =ez − e−z
ez + e−z
Intro ML (UofT) CSC311-Lec5 18 / 46
![Page 19: CSC 311: Introduction to Machine Learning · cross-entropy as the loss function: L CE(y;t) = XK k=1 t klogy k = t>(logy); where the log is applied elementwise. Just like with logistic](https://reader034.fdocuments.us/reader034/viewer/2022042413/5f2cf1bd59015d6dae4551b7/html5/thumbnails/19.jpg)
Multilayer Perceptrons
Each layer computes a function, so the networkcomputes a composition of functions:
h(1) = f (1)(x) = φ(W(1)x + b(1))
h(2) = f (2)(h(1)) = φ(W(2)h(1) + b(2))
...
y = f (L)(h(L−1))
Or more compactly:
y = f (L) ◦ · · · ◦ f (1)(x).
Neural nets provide modularity: we canimplement each layer’s computations as a blackbox.
Intro ML (UofT) CSC311-Lec5 19 / 46
![Page 20: CSC 311: Introduction to Machine Learning · cross-entropy as the loss function: L CE(y;t) = XK k=1 t klogy k = t>(logy); where the log is applied elementwise. Just like with logistic](https://reader034.fdocuments.us/reader034/viewer/2022042413/5f2cf1bd59015d6dae4551b7/html5/thumbnails/20.jpg)
Feature Learning
Last layer:
If task is regression: choosey = f (L)(h(L−1)) = (w(L))Th(L−1) + b(L)
If task is binary classification: choosey = f (L)(h(L−1)) = σ((w(L))Th(L−1) + b(L))Neural nets can be viewed as a way of learning features:
The goal:
Intro ML (UofT) CSC311-Lec5 20 / 46
![Page 21: CSC 311: Introduction to Machine Learning · cross-entropy as the loss function: L CE(y;t) = XK k=1 t klogy k = t>(logy); where the log is applied elementwise. Just like with logistic](https://reader034.fdocuments.us/reader034/viewer/2022042413/5f2cf1bd59015d6dae4551b7/html5/thumbnails/21.jpg)
Feature Learning
Suppose we’re trying to classify images of handwritten digits.Each image is represented as a vector of 28× 28 = 784 pixel values.
Each first-layer hidden unit computes φ(wTi x). It acts as a
feature detector.
We can visualize w by reshaping it into an image. Here’s anexample that responds to a diagonal stroke.
Intro ML (UofT) CSC311-Lec5 21 / 46
![Page 22: CSC 311: Introduction to Machine Learning · cross-entropy as the loss function: L CE(y;t) = XK k=1 t klogy k = t>(logy); where the log is applied elementwise. Just like with logistic](https://reader034.fdocuments.us/reader034/viewer/2022042413/5f2cf1bd59015d6dae4551b7/html5/thumbnails/22.jpg)
Feature Learning
Here are some of the features learned by the first hidden layer of ahandwritten digit classifier:
Intro ML (UofT) CSC311-Lec5 22 / 46
![Page 23: CSC 311: Introduction to Machine Learning · cross-entropy as the loss function: L CE(y;t) = XK k=1 t klogy k = t>(logy); where the log is applied elementwise. Just like with logistic](https://reader034.fdocuments.us/reader034/viewer/2022042413/5f2cf1bd59015d6dae4551b7/html5/thumbnails/23.jpg)
Expressive Power
We’ve seen that there are some functions that linear classifierscan’t represent. Are deep networks any better?
Suppose a layer’s activation function was the identity, so the layerjust computes a affine transformation of the input
I We call this a linear layer
Any sequence of linear layers can be equivalently represented witha single linear layer.
y = W(3)W(2)W(1)︸ ︷︷ ︸,W′
x
I Deep linear networks are no more expressive than linear regression.
Intro ML (UofT) CSC311-Lec5 23 / 46
![Page 24: CSC 311: Introduction to Machine Learning · cross-entropy as the loss function: L CE(y;t) = XK k=1 t klogy k = t>(logy); where the log is applied elementwise. Just like with logistic](https://reader034.fdocuments.us/reader034/viewer/2022042413/5f2cf1bd59015d6dae4551b7/html5/thumbnails/24.jpg)
Expressive Power
Multilayer feed-forward neural nets with nonlinear activationfunctions are universal function approximators: they canapproximate any function arbitrarily well.
This has been shown for various activation functions (thresholds,logistic, ReLU, etc.)
I Even though ReLU is “almost” linear, it’s nonlinear enough.
Intro ML (UofT) CSC311-Lec5 24 / 46
![Page 25: CSC 311: Introduction to Machine Learning · cross-entropy as the loss function: L CE(y;t) = XK k=1 t klogy k = t>(logy); where the log is applied elementwise. Just like with logistic](https://reader034.fdocuments.us/reader034/viewer/2022042413/5f2cf1bd59015d6dae4551b7/html5/thumbnails/25.jpg)
Multilayer Perceptrons
Designing a network to classify XOR:
Assume hard threshold activation function
Intro ML (UofT) CSC311-Lec5 25 / 46
![Page 26: CSC 311: Introduction to Machine Learning · cross-entropy as the loss function: L CE(y;t) = XK k=1 t klogy k = t>(logy); where the log is applied elementwise. Just like with logistic](https://reader034.fdocuments.us/reader034/viewer/2022042413/5f2cf1bd59015d6dae4551b7/html5/thumbnails/26.jpg)
Multilayer Perceptrons
h1 computes I[x1 + x2 − 0.5 > 0]I i.e. x1 OR x2
h2 computes I[x1 + x2 − 1.5 > 0]I i.e. x1 AND x2
y computes I[h1 − h2 − 0.5 > 0] ≡ I[h1 + (1− h2)− 1.5 > 0]I i.e. h1 AND (NOT h2) = x1 XOR x2
Intro ML (UofT) CSC311-Lec5 26 / 46
![Page 27: CSC 311: Introduction to Machine Learning · cross-entropy as the loss function: L CE(y;t) = XK k=1 t klogy k = t>(logy); where the log is applied elementwise. Just like with logistic](https://reader034.fdocuments.us/reader034/viewer/2022042413/5f2cf1bd59015d6dae4551b7/html5/thumbnails/27.jpg)
Expressive Power
Universality for binary inputs and targets:
Hard threshold hidden units, linear output
Strategy: 2D hidden units, each of which responds to oneparticular input configuration
Only requires one hidden layer, though it needs to be extremelywide.Intro ML (UofT) CSC311-Lec5 27 / 46
![Page 28: CSC 311: Introduction to Machine Learning · cross-entropy as the loss function: L CE(y;t) = XK k=1 t klogy k = t>(logy); where the log is applied elementwise. Just like with logistic](https://reader034.fdocuments.us/reader034/viewer/2022042413/5f2cf1bd59015d6dae4551b7/html5/thumbnails/28.jpg)
Expressive Power
What about the logistic activation function?
You can approximate a hard threshold by scaling up the weightsand biases:
y = σ(x) y = σ(5x)
This is good: logistic units are differentiable, so we can train themwith gradient descent.
Intro ML (UofT) CSC311-Lec5 28 / 46
![Page 29: CSC 311: Introduction to Machine Learning · cross-entropy as the loss function: L CE(y;t) = XK k=1 t klogy k = t>(logy); where the log is applied elementwise. Just like with logistic](https://reader034.fdocuments.us/reader034/viewer/2022042413/5f2cf1bd59015d6dae4551b7/html5/thumbnails/29.jpg)
Expressive Power
Limits of universality
I You may need to represent an exponentially large network.
I How can you find the appropriate weights to represent a givenfunction?
I If you can learn any function, you’ll just overfit.
I We desire a compact representation.
Intro ML (UofT) CSC311-Lec5 29 / 46
![Page 30: CSC 311: Introduction to Machine Learning · cross-entropy as the loss function: L CE(y;t) = XK k=1 t klogy k = t>(logy); where the log is applied elementwise. Just like with logistic](https://reader034.fdocuments.us/reader034/viewer/2022042413/5f2cf1bd59015d6dae4551b7/html5/thumbnails/30.jpg)
Training Neural Networks withBackpropagation
Intro ML (UofT) CSC311-Lec5 30 / 46
![Page 31: CSC 311: Introduction to Machine Learning · cross-entropy as the loss function: L CE(y;t) = XK k=1 t klogy k = t>(logy); where the log is applied elementwise. Just like with logistic](https://reader034.fdocuments.us/reader034/viewer/2022042413/5f2cf1bd59015d6dae4551b7/html5/thumbnails/31.jpg)
Recap: Gradient Descent
Recall: gradient descent moves opposite the gradient (the direction ofsteepest descent)
Weight space for a multilayer neural net: one coordinate for each weightor bias of the network, in all the layers
Conceptually, not any different from what we’ve seen so far — justhigher dimensional and harder to visualize!
We want to define a loss L and compute the gradient of the cost dJ /dw,which is the vector of partial derivatives.
I This is the average of dL/dw over all the training examples, so inthis lecture we focus on computing dL/dw.
Intro ML (UofT) CSC311-Lec5 31 / 46
![Page 32: CSC 311: Introduction to Machine Learning · cross-entropy as the loss function: L CE(y;t) = XK k=1 t klogy k = t>(logy); where the log is applied elementwise. Just like with logistic](https://reader034.fdocuments.us/reader034/viewer/2022042413/5f2cf1bd59015d6dae4551b7/html5/thumbnails/32.jpg)
Univariate Chain Rule
We’ve already been using the univariate Chain Rule.
Recall: if f(x) and x(t) are univariate functions, then
d
dtf(x(t)) =
df
dx
dx
dt.
Intro ML (UofT) CSC311-Lec5 32 / 46
![Page 33: CSC 311: Introduction to Machine Learning · cross-entropy as the loss function: L CE(y;t) = XK k=1 t klogy k = t>(logy); where the log is applied elementwise. Just like with logistic](https://reader034.fdocuments.us/reader034/viewer/2022042413/5f2cf1bd59015d6dae4551b7/html5/thumbnails/33.jpg)
Univariate Chain Rule
Recall: Univariate logistic least squares model
z = wx+ b
y = σ(z)
L =1
2(y − t)2
Let’s compute the loss derivatives ∂L∂w ,
∂L∂b
Intro ML (UofT) CSC311-Lec5 33 / 46
![Page 34: CSC 311: Introduction to Machine Learning · cross-entropy as the loss function: L CE(y;t) = XK k=1 t klogy k = t>(logy); where the log is applied elementwise. Just like with logistic](https://reader034.fdocuments.us/reader034/viewer/2022042413/5f2cf1bd59015d6dae4551b7/html5/thumbnails/34.jpg)
Univariate Chain Rule
How you would have done it in calculus class
L =1
2(σ(wx+ b)− t)2
∂L∂w
=∂
∂w
[1
2(σ(wx+ b)− t)2
]=
1
2
∂
∂w(σ(wx+ b)− t)2
= (σ(wx+ b)− t)∂
∂w(σ(wx+ b)− t)
= (σ(wx+ b)− t)σ′(wx+ b)∂
∂w(wx+ b)
= (σ(wx+ b)− t)σ′(wx+ b)x
∂L∂b
=∂
∂b
[1
2(σ(wx+ b)− t)2
]=
1
2
∂
∂b(σ(wx+ b)− t)2
= (σ(wx+ b)− t)∂
∂b(σ(wx+ b)− t)
= (σ(wx+ b)− t)σ′(wx+ b)∂
∂b(wx+ b)
= (σ(wx+ b)− t)σ′(wx+ b)
What are the disadvantages of this approach?
Intro ML (UofT) CSC311-Lec5 34 / 46
![Page 35: CSC 311: Introduction to Machine Learning · cross-entropy as the loss function: L CE(y;t) = XK k=1 t klogy k = t>(logy); where the log is applied elementwise. Just like with logistic](https://reader034.fdocuments.us/reader034/viewer/2022042413/5f2cf1bd59015d6dae4551b7/html5/thumbnails/35.jpg)
Univariate Chain Rule
A more structured way to do it
Computing the loss:
z = wx+ b
y = σ(z)
L =1
2(y − t)2
Computing the derivatives:
dLdy
= y − t
dLdz
=dLdy
dy
dz=
dLdy
σ′(z)
∂L∂w
=dLdz
dz
dw=
dLdz
x
∂L∂b
=dLdz
dz
db=
dLdz
Remember, the goal isn’t to obtain closed-form solutions, but to beable to write a program that efficiently computes the derivatives.
Intro ML (UofT) CSC311-Lec5 35 / 46
![Page 36: CSC 311: Introduction to Machine Learning · cross-entropy as the loss function: L CE(y;t) = XK k=1 t klogy k = t>(logy); where the log is applied elementwise. Just like with logistic](https://reader034.fdocuments.us/reader034/viewer/2022042413/5f2cf1bd59015d6dae4551b7/html5/thumbnails/36.jpg)
Univariate Chain Rule
We can diagram out the computations using a computationgraph.The nodes represent all the inputs and computed quantities, andthe edges represent which nodes are computed directly as afunction of which other nodes.
Computing the loss:
z = wx+ b
y = σ(z)
L =1
2(y − t)2
Intro ML (UofT) CSC311-Lec5 36 / 46
![Page 37: CSC 311: Introduction to Machine Learning · cross-entropy as the loss function: L CE(y;t) = XK k=1 t klogy k = t>(logy); where the log is applied elementwise. Just like with logistic](https://reader034.fdocuments.us/reader034/viewer/2022042413/5f2cf1bd59015d6dae4551b7/html5/thumbnails/37.jpg)
Univariate Chain Rule
A slightly more convenient notation:
Use y to denote the derivative dL/dy, sometimes called the error signal.
This emphasizes that the error signals are just values our program iscomputing (rather than a mathematical operation).
Computing the loss:
z = wx+ b
y = σ(z)
L =1
2(y − t)2
Computing the derivatives:
y = y − tz = y σ′(z)
w = z x
b = z
Intro ML (UofT) CSC311-Lec5 37 / 46
![Page 38: CSC 311: Introduction to Machine Learning · cross-entropy as the loss function: L CE(y;t) = XK k=1 t klogy k = t>(logy); where the log is applied elementwise. Just like with logistic](https://reader034.fdocuments.us/reader034/viewer/2022042413/5f2cf1bd59015d6dae4551b7/html5/thumbnails/38.jpg)
Multivariate Chain Rule
Problem: what if the computation graph has fan-out > 1?This requires the multivariate Chain Rule!
L2-Regularized regression
z = wx+ b
y = σ(z)
L =1
2(y − t)2
R =1
2w2
Lreg = L+ λR
Softmax regression
z` =∑j
w`jxj + b`
yk =ezk∑` e
z`
L = −∑k
tk log yk
Intro ML (UofT) CSC311-Lec5 38 / 46
![Page 39: CSC 311: Introduction to Machine Learning · cross-entropy as the loss function: L CE(y;t) = XK k=1 t klogy k = t>(logy); where the log is applied elementwise. Just like with logistic](https://reader034.fdocuments.us/reader034/viewer/2022042413/5f2cf1bd59015d6dae4551b7/html5/thumbnails/39.jpg)
Multivariate Chain Rule
Suppose we have a function f(x, y) and functions x(t) and y(t).(All the variables here are scalar-valued.) Then
d
dtf(x(t), y(t)) =
∂f
∂x
dx
dt+∂f
∂y
dy
dt
Example:
f(x, y) = y + exy
x(t) = cos t
y(t) = t2
Plug in to Chain Rule:
df
dt=∂f
∂x
dx
dt+∂f
∂y
dy
dt
= (yexy) · (− sin t) + (1 + xexy) · 2t
Intro ML (UofT) CSC311-Lec5 39 / 46
![Page 40: CSC 311: Introduction to Machine Learning · cross-entropy as the loss function: L CE(y;t) = XK k=1 t klogy k = t>(logy); where the log is applied elementwise. Just like with logistic](https://reader034.fdocuments.us/reader034/viewer/2022042413/5f2cf1bd59015d6dae4551b7/html5/thumbnails/40.jpg)
Multivariable Chain Rule
In the context of backpropagation:
In our notation:
t = xdx
dt+ y
dy
dt
Intro ML (UofT) CSC311-Lec5 40 / 46
![Page 41: CSC 311: Introduction to Machine Learning · cross-entropy as the loss function: L CE(y;t) = XK k=1 t klogy k = t>(logy); where the log is applied elementwise. Just like with logistic](https://reader034.fdocuments.us/reader034/viewer/2022042413/5f2cf1bd59015d6dae4551b7/html5/thumbnails/41.jpg)
Backpropagation
Full backpropagation algorithm:Let v1, . . . , vN be a topological ordering of the computation graph(i.e. parents come before children.)
vN denotes the variable we’re trying to compute derivatives of (e.g. loss).
Intro ML (UofT) CSC311-Lec5 41 / 46
![Page 42: CSC 311: Introduction to Machine Learning · cross-entropy as the loss function: L CE(y;t) = XK k=1 t klogy k = t>(logy); where the log is applied elementwise. Just like with logistic](https://reader034.fdocuments.us/reader034/viewer/2022042413/5f2cf1bd59015d6dae4551b7/html5/thumbnails/42.jpg)
Backpropagation
Example: univariate logistic least squares regression
Forward pass:
z = wx+ b
y = σ(z)
L =1
2(y − t)2
R =1
2w2
Lreg = L+ λR
Backward pass:
Lreg = 1
R = LregdLreg
dR= Lreg λ
L = LregdLreg
dL= Lreg
y = L dLdy
= L (y − t)
z = ydy
dz
= y σ′(z)
w= z∂z
∂w+RdR
dw
= z x+Rw
b = z∂z
∂b
= z
Intro ML (UofT) CSC311-Lec5 42 / 46
![Page 43: CSC 311: Introduction to Machine Learning · cross-entropy as the loss function: L CE(y;t) = XK k=1 t klogy k = t>(logy); where the log is applied elementwise. Just like with logistic](https://reader034.fdocuments.us/reader034/viewer/2022042413/5f2cf1bd59015d6dae4551b7/html5/thumbnails/43.jpg)
Backpropagation
Multilayer Perceptron (multiple outputs):
Forward pass:
zi =∑j
w(1)ij xj + b
(1)i
hi = σ(zi)
yk =∑i
w(2)ki hi + b
(2)k
L =1
2
∑k
(yk − tk)2
Backward pass:
L = 1
yk = L (yk − tk)
w(2)ki = yk hi
b(2)k = yk
hi =∑k
ykw(2)ki
zi = hi σ′(zi)
w(1)ij = zi xj
b(1)i = zi
Intro ML (UofT) CSC311-Lec5 43 / 46
![Page 44: CSC 311: Introduction to Machine Learning · cross-entropy as the loss function: L CE(y;t) = XK k=1 t klogy k = t>(logy); where the log is applied elementwise. Just like with logistic](https://reader034.fdocuments.us/reader034/viewer/2022042413/5f2cf1bd59015d6dae4551b7/html5/thumbnails/44.jpg)
Backpropagation
In vectorized form:
Forward pass:
z = W(1)x + b(1)
h = σ(z)
y = W(2)h + b(2)
L =1
2‖y − t‖2
Backward pass:
L = 1
y = L (y − t)
W(2) = yh>
b(2) = y
h = W(2)>y
z = h ◦ σ′(z)
W(1) = zx>
b(1) = z
Intro ML (UofT) CSC311-Lec5 44 / 46
![Page 45: CSC 311: Introduction to Machine Learning · cross-entropy as the loss function: L CE(y;t) = XK k=1 t klogy k = t>(logy); where the log is applied elementwise. Just like with logistic](https://reader034.fdocuments.us/reader034/viewer/2022042413/5f2cf1bd59015d6dae4551b7/html5/thumbnails/45.jpg)
Computational Cost
Computational cost of forward pass: one add-multiplyoperation per weight
zi =∑j
w(1)ij xj + b
(1)i
Computational cost of backward pass: two add-multiplyoperations per weight
w(2)ki = yk hi
hi =∑k
ykw(2)ki
Rule of thumb: the backward pass is about as expensive as twoforward passes.
For a multilayer perceptron, this means the cost is linear in thenumber of layers, quadratic in the number of units per layer.
Intro ML (UofT) CSC311-Lec5 45 / 46
![Page 46: CSC 311: Introduction to Machine Learning · cross-entropy as the loss function: L CE(y;t) = XK k=1 t klogy k = t>(logy); where the log is applied elementwise. Just like with logistic](https://reader034.fdocuments.us/reader034/viewer/2022042413/5f2cf1bd59015d6dae4551b7/html5/thumbnails/46.jpg)
Backpropagation
Backprop is used to train the overwhelming majority of neural netstoday.
I Even optimization algorithms much fancier than gradient descent(e.g. second-order methods) use backprop to compute the gradients.
Despite its practical success, backprop is believed to be neurallyimplausible.
Intro ML (UofT) CSC311-Lec5 46 / 46