Post on 15-Jan-2017
https://goo.gl/tKOvr7
Link to these slides:
YO!I am Sam Abrahams
I am a data scientist and engineer.You can find me on GitHub @samjabrahams
Buy my book:TensorFlow for Machine Intelligence
Gradient Descent Outline
▣ Problem: fit data▣ Basic OLS linear regression▣ Visualize error curve and regression line▣ Step by step through changes
Simple Start: Linear Regression
▣ Want to find a model that can fit our data▣ Could do it algebraically…
▣ BUT that doesn’t generalize well
Simple Start: Linear Regression
▣ Step back: what does ordinary linear regression try to do?
▣ Minimize the sum of (or average) squared error
▣ How else could we minimize?
Gradient Descent
▣ Start with a random guess▣ Use the derivative (gradient when dealing with
multiple variables) to get the slope of the error curve
▣ Move our parameters to move down the error curve
Gradient Descent Variants
▣ There are additional techniques that can help speed up (or otherwise improve) gradient descent
▣ The next slides describe some of these!▣ More details (and some awesome visuals)
here: article by Sebastian Ruder
Gradient Descent
▣Get true gradient with respect to all examples▣One step = one epoch▣Slow and generally unfeasible for large training sets
Stochastic Gradient Descent
▣Basic idea: approximate derivative by only using one example▣“Online learning”▣Update weights after each example
Mini-Batch Gradient Descent
▣Similar idea to stochastic gradient descent▣Approximate derivative with a sample batch of examples▣Middle ground between “true” stochastic gradient and full gradient descent
Momentum
▣Idea: if we see multiple gradients in a row with same direction, we should increase our learning rate▣Accumulate a “momentum” vector to speed up descent
Nesterov Momentum
▣ Idea: before updating our weights, look ahead to where we have accumulated momentum
▣ Adjust our update based on “future”
Nesterov Momentum
Source: Lecture by Geoffrey Hinton
Momentum Vector
Gradient/correction
Nesterov steps
Standard momentum steps
AdaGrad
▣ Idea: update individual weights differently depending on how frequently they change
▣ Keeps a running tally of previous updates for each weight, and divides new updates by a factor of the previous updates
▣ Downside: for long running training, eventually all gradients diminish
▣ Paper on jmlr.org
AdaDelta / RMSProp
▣ Two slightly different algorithms with same concept: only keep a window of the previous n gradients when scaling updates
▣ Seeks to reduce diminishing gradient problem with AdaGrad
▣ AdaDelta Paper on arxiv.org
Adam
▣ Adam expands on the concepts introduced with AdaDelta and RMSProp
▣ Uses both first order and second order moments, decayed over time
▣ Paper on arxiv.org
Beyond OLS Regression
▣ Can’t do everything with linear regression!▣ Nor polynomial…▣ Why can’t we let the computer figure out how
to model?
Neural Networks: Idea
▣ Chain together non-linear functions▣ Have lots of parameters that can be adjusted▣ These “weights” determine the model function
Feed forward neural network
+1 +1
x1
x2
+1
l (2) l (3) l (4)l (1)
input hidden 1 hidden 2 output
W(1) W(2) W(3)
a(2) a(3) a(4)
ŷ
σ
σ
σ
+1
σ
σ
σ
+1
SM
SM
SM
x1
x2
+1
W(1) W(2) W(3)
a(2) a(3) a(4)
ŷ
xi: input value
ŷ: output vector
+1: bias (constant) unit
a(l): activation vector for layer l
W(l): weight matrix for layer l
z(l): input into layer l
σ: sigmoid (logistic) function
SM: Softmax function
σ
σ
σ
+1
σ
σ
σ
+1
x1
x2
+1
W(1) W(2) W(3)
a(2) a(3) a(4)
ŷ
xi: input value
ŷ: output vector
+1: bias (constant) unit
a(l): activation vector for layer l
Layer 1
W(l): weight matrix for layer l
z(l): input into layer l
SM
SM
SM
σ: sigmoid (logistic) function
SM: Softmax function
σ
σ
σ
+1
σ
σ
σ
+1
x1
x2
+1
W(1) W(2) W(3)
a(2) a(3) a(4)
ŷ
xi: input value
ŷ: output vector
+1: bias (constant) unit
a(l): activation vector for layer l
Layer 2
W(l): weight matrix for layer l
z(l): input into layer l
SM
SM
SM
σ: sigmoid (logistic) function
SM: Softmax function
σ
σ
σ
+1
σ
σ
σ
+1
x1
x2
+1
W(1) W(2) W(3)
a(2) a(3) a(4)
ŷ
xi: input value
ŷ: output vector
+1: bias (constant) unit
a(l): activation vector for layer l
Layer 3
W(l): weight matrix for layer l
z(l): input into layer l
SM
SM
SM
σ: sigmoid (logistic) function
SM: Softmax function
σ
σ
σ
+1
σ
σ
σ
+1
x1
x2
+1
W(1) W(2) W(3)
a(2) a(3) a(4)
ŷ
xi: input value
ŷ: output vector
+1: bias (constant) unit
a(l): activation vector for layer l
Layer 4
W(l): weight matrix for layer l
z(l): input into layer l
SM
SM
SM
σ: sigmoid (logistic) function
SM: Softmax function
σ
σ
σ
+1
σ
σ
σ
+1
x1
x2
+1
W(1) W(2) W(3)
a(2) a(3) a(4)
ŷ
xi: input value
ŷ: output vector
+1: bias (constant) unit
a(l): activation vector for layer l
Biases (constant units)
W(l): weight matrix for layer l
z(l): input into layer l
SM
SM
SM
σ: sigmoid (logistic) function
SM: Softmax function
σ
σ
σ
+1
σ
σ
σ
+1
x1
x2
+1
W(1) W(2) W(3)
a(2) a(3) a(4)
ŷ
xi: input value
ŷ: output vector
+1: bias (constant) unit
a(l): activation vector for layer l
Input
W(l): weight matrix for layer l
z(l): input into layer l
SM
SM
SM
σ: sigmoid (logistic) function
SM: Softmax function
σ
σ
σ
+1
σ
σ
σ
+1
x1
x2
+1
W(1) W(2) W(3)
a(2) a(3) a(4)
ŷ
xi: input value
ŷ: output vector
+1: bias (constant) unit
a(l): activation vector for layer l
Weight matrices
W(l): weight matrix for layer l
z(l): input into layer l
SM
SM
SM
σ: sigmoid (logistic) function
SM: Softmax function
σ
σ
σ
+1
σ
σ
σ
+1
x1
x2
+1
W(1) W(2) W(3)
a(2) a(3) a(4)
ŷ
xi: input value
ŷ: output vector
+1: bias (constant) unit
a(l): activation vector for layer l
Layer inputs, z(l)
W(l): weight matrix for layer l
z(l): input into layer l
z(l) = W(l-1)a(l-1) + b(l-1)
SM
SM
SM
σ: sigmoid (logistic) function
SM: Softmax function
σ
σ
σ
+1
σ
σ
σ
+1
x1
x2
+1
W(1) W(2) W(3)
a(2) a(3) a(4)
ŷ
xi: input value
ŷ: output vector
+1: bias (constant) unit
a(l): activation vector for layer l
Activation vectors
W(l): weight matrix for layer l
z(l): input into layer l
SM
SM
SM
σ: sigmoid (logistic) function
SM: Softmax function
σ
σ
σ
+1
σ
σ
σ
+1
x1
x2
+1
W(1) W(2) W(3)
a(2) a(3) a(4)
ŷ
xi: input value
ŷ: output vector
+1: bias (constant) unit
a(l): activation vector for layer l
Sigmoid activation function
W(l): weight matrix for layer l
z(l): input into layer l
SM
SM
SM
σ: sigmoid (logistic) function
SM: Softmax function
SM
SM
SM
σ
σ
σ
+1
σ
σ
σ
+1
x1
x2
+1
W(1) W(2) W(3)
a(2) a(3) a(4)
ŷ
xi: input value
ŷ: output vector
+1: bias (constant) unit
a(l): activation vector for layer l
Softmax activation function
W(l): weight matrix for layer l
z(l): input into layer l
σ: sigmoid (logistic) function
SM: Softmax function
σ
σ
σ
+1
σ
σ
σ
+1
x1
x2
+1
W(1) W(2) W(3)
a(2) a(3) a(4)
ŷ
σ: sigmoid (logistic) function
SM: Softmax function
xi: input value
ŷ: output vector
+1: bias (constant) unit
a(l): activation vector for layer l
W(l): weight matrix for layer l
z(l): input into layer l
Output
SM
SM
SM
x1
x2
+1
W(1)
a(2)
W(2) W(3)
a(3) a(4)
Forward Propagation
Input is multiplied with W(1) weight matrix and added with layer 1 biases to calculate z(2)
z(2) = W(1)x + b(1)
σ
σ
σ
x1
x2
+1
W(1)
a(2)
Forward Propagation
W(2) W(3)
a(3) a(4)
Activation value for the second layer is calculated by passing z(2) into some function. In this case, the sigmoid function.
a(2) = σ(z(2))
σ
σ
σ
+1
x1
x2
+1
W(1) W(2)
a(2)
Forward Propagation
W(3)
a(3) a(4)
z(3) is calculated by multiplying a(2) vector with W(2) weight matrix and adding layer 2 biases
z(3) = W(2)a(2) + b(2)
σ
σ
σ
+1
σ
σ
σ
x1
x2
+1
W(1) W(2)
a(2) a(3)
Forward Propagation
Similar to previous layer, a(3) is calculated by passing z(3) into the sigmoid function
a(3) = σ(z(3))
W(3)
σ
σ
σ
+1
σ
σ
σ
+1
x1
x2
+1
W(1) W(2) W(3)
a(2) a(3)
Forward Propagation
z(4) is calculated by multiplying a(3) vector with W(3) weight matrix and adding layer 3 biases
z(4) = W(3)a(3) + b(3)
a(4)
σ
σ
σ
+1
σ
σ
σ
+1
x1
x2
+1
W(1) W(2) W(3)
a(2) a(3) a(4)
SM
SM
SM
Forward Propagation
For the final layer, we calculate a(4) by passing z(4) into the Softmax function
a(4) = SM(z(4))
σ
σ
σ
+1
σ
σ
σ
+1
x1
x2
+1
W(1) W(2) W(3)
a(2) a(3) a(4)
ŷ
SM
SM
SM
Forward Propagation
We then make our prediction based on the final layer’s output
Page of Math
z(2) = W(1)x + b(1)
z(3) = W(2)a(2) + b(2)
z(4) = W(3)a(3) + b(3)
a(2) = σ(z(2))
a(3) = σ(z(3))
a(4) = ŷ = SM(z(4))
Goal: Find which direction to shift weights
How: Find partial derivatives of the cost with respect to weight matrices
How (again): Chain rule the sh*t out of this mofo
DEEPER
NOTE: “Cancelling out” isn’t how the math actually works. But it’s a handy way to think about it.
DEEPER
NOTE: “Cancelling out” isn’t how the math actually works. But it’s a handy way to think about it.
DEEPER
NOTE: “Cancelling out” isn’t how the math actually works. But it’s a handy way to think about it.
Return of Page of Math
z(2) = W(1)x + b(1)
z(3) = W(2)a(2) + b(2)
z(4) = W(3)a(3) + b(3)
a(2) = σ(z(2))
a(3) = σ(z(3))
a(4) = ŷ = SM(z(4))
σ
σ
σ
+1
σ
σ
σ
+1
x1
x2
+1
W(1) W(2) W(3)
a(2) a(3) a(4)
cost
σ: sigmoid (logistic) function
SM: Softmax function
xi: input value
ŷ: output vector
+1: bias (constant) unit
a(l): activation vector for layer l
W(l): weight matrix for layer l
z(l): input into layer l
SM
SM
SM
Back Propagation Want:
σ
σ
σ
+1
σ
σ
σ
+1
x1
x2
+1
W(1) W(2) W(3)
a(2) a(3) a(4)
cost
σ: sigmoid (logistic) function
SM: Softmax function
xi: input value
ŷ: output vector
+1: bias (constant) unit
a(l): activation vector for layer l
W(l): weight matrix for layer l
z(l): input into layer l
SM
SM
SM
Back Propagation Want:
σ
σ
σ
+1
σ
σ
σ
+1
x1
x2
+1
W(1) W(2) W(3)
a(2) a(3) a(4)
cost
σ: sigmoid (logistic) function
SM: Softmax function
xi: input value
ŷ: output vector
+1: bias (constant) unit
a(l): activation vector for layer l
W(l): weight matrix for layer l
z(l): input into layer l
SM
SM
SM
Back Propagation Want:
σ
σ
σ
+1
σ
σ
σ
+1
x1
x2
+1
W(1) W(2) W(3)
a(2) a(3) a(4)
cost
σ: sigmoid (logistic) function
SM: Softmax function
xi: input value
ŷ: output vector
+1: bias (constant) unit
a(l): activation vector for layer l
W(l): weight matrix for layer l
z(l): input into layer l
SM
SM
SM
Back Propagation Want:
σ
σ
σ
+1
σ
σ
σ
+1
x1
x2
+1
W(1) W(2) W(3)
a(2) a(3) a(4)
cost
σ: sigmoid (logistic) function
SM: Softmax function
xi: input value
ŷ: output vector
+1: bias (constant) unit
a(l): activation vector for layer l
W(l): weight matrix for layer l
z(l): input into layer l
SM
SM
SM
Back Propagation Want:
σ
σ
σ
+1
σ
σ
σ
+1
x1
x2
+1
W(1) W(2) W(3)
a(2) a(3) a(4)
cost
σ: sigmoid (logistic) function
SM: Softmax function
xi: input value
ŷ: output vector
+1: bias (constant) unit
a(l): activation vector for layer l
W(l): weight matrix for layer l
z(l): input into layer l
SM
SM
SM
Back Propagation Want:
Auto-Differentiation: Idea
▣ Use functions that have easy-to-compute derivatives
▣ Compose these functions to create more complex super-model
▣ Use the chain rule to get partial derivatives of the model
What makes a “good” function?
▣ Obvious stuff: differentiable (continuously and smoothly!)
▣ Simple operations: add, subtract, multiply▣ Reuse previous computation
σ
σ
σ
+1
σ
σ
σ
+1
x1
x2
+1
W(1) W(2) W(3)
a(2) a(3) a(4)
ŷ
SM
SM
SM
Store activation values for backprop
σ
σ
σ
+1
σ
σ
σ
+1
x1
x2
+1
W(1) W(2) W(3)
a(2) a(3) a(4)
ŷ
SM
SM
SM
Chain rule takes care of the rest
It’s Over!Any questions?
Email: sam@memdump.ioGitHub: samjabrahamsTwitter: @sabraha
Presentation template by SlidesCarnival
Neural Network terms
▣ Neuron: a unit that transforms input via an activation function and outputs the result to other neurons and/or the final result
▣ Activation function: a(l), a transformation function, typically non-linear. Sigmoid, ReLU▣ Bias unit: a trainable scalar shift, typically applied to each non-output layer (think
y-intercept term in the linear function)▣ Layer: a grouping of “neurons” and biases that (in general) take in values from the same
previous neurons and pass values forwards to the same targets▣ Hidden layer: A layer that is neither the input layer nor the output layer▣ Input layer: ▣ Output layer