Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark and TensorFlow Meetup...

Backprop,Gradient Descent,

And Auto Differentiation

Sam Abrahams, Memdump LLC

https://goo.gl/tKOvr7

Link to these slides:

YO!I am Sam Abrahams

I am a data scientist and engineer.You can find me on GitHub @samjabrahams

Buy my book:TensorFlow for Machine Intelligence

1.Gradient Descent

Guess and Check for Adults

Gradient Descent Outline

▣ Problem: fit data▣ Basic OLS linear regression▣ Visualize error curve and regression line▣ Step by step through changes

Scatter Plot

Simple Start: Linear Regression

Ordinary Least Squares Linear Regression

▣ Want to find a model that can fit our data▣ Could do it algebraically…

▣ BUT that doesn’t generalize well

▣ Step back: what does ordinary linear regression try to do?

▣ Minimize the sum of (or average) squared error

▣ How else could we minimize?

Gradient Descent

▣ Start with a random guess▣ Use the derivative (gradient when dealing with

multiple variables) to get the slope of the error curve

▣ Move our parameters to move down the error curve

Single Variable Cost Curve

J (cost)

Random guess put us here

J (cost)

∂J∂W

J (cost)

∂J∂W < 0

J (cost)

∂J∂W < 0; move to the right!

J (cost)

∂J∂W

J (cost)

∂J∂W < 0

J (cost)

∂J∂W < 0; move to the right!

J (cost)

∂J∂W

J (cost)

∂J∂W

J (cost)

∂J∂W

1.5Gradient

Descent VariantsIntelligent descent into madness

Gradient Descent Variants

▣ There are additional techniques that can help speed up (or otherwise improve) gradient descent

▣ The next slides describe some of these!▣ More details (and some awesome visuals)

here: article by Sebastian Ruder

Gradient Descent

▣Get true gradient with respect to all examples▣One step = one epoch▣Slow and generally unfeasible for large training sets

Gradient Descent

Stochastic Gradient Descent

▣Basic idea: approximate derivative by only using one example▣“Online learning”▣Update weights after each example

Stochastic Gradient Descent

Mini-Batch Gradient Descent

▣Similar idea to stochastic gradient descent▣Approximate derivative with a sample batch of examples▣Middle ground between “true” stochastic gradient and full gradient descent

Mini-Batch Gradient Descent

Momentum

▣Idea: if we see multiple gradients in a row with same direction, we should increase our learning rate▣Accumulate a “momentum” vector to speed up descent

Without Momentum

Momentum

Nesterov Momentum

▣ Idea: before updating our weights, look ahead to where we have accumulated momentum

▣ Adjust our update based on “future”

Nesterov Momentum

Source: Lecture by Geoffrey Hinton

Momentum Vector

Gradient/correction

Nesterov steps

Standard momentum steps

AdaGrad

▣ Idea: update individual weights differently depending on how frequently they change

▣ Keeps a running tally of previous updates for each weight, and divides new updates by a factor of the previous updates

▣ Downside: for long running training, eventually all gradients diminish

▣ Paper on jmlr.org

AdaDelta / RMSProp

▣ Two slightly different algorithms with same concept: only keep a window of the previous n gradients when scaling updates

▣ Seeks to reduce diminishing gradient problem with AdaGrad

▣ AdaDelta Paper on arxiv.org

▣ Adam expands on the concepts introduced with AdaDelta and RMSProp

▣ Uses both first order and second order moments, decayed over time

▣ Paper on arxiv.org

2.Forward & Back

PropagationThe Chain Rule got the last laugh, high-school-you

Beyond OLS Regression

▣ Can’t do everything with linear regression!▣ Nor polynomial…▣ Why can’t we let the computer figure out how

to model?

Neural Networks: Idea

▣ Chain together non-linear functions▣ Have lots of parameters that can be adjusted▣ These “weights” determine the model function

Feed forward neural network

l (2) l (3) l (4)l (1)

input hidden 1 hidden 2 output

W(1) W(2) W(3)

a(2) a(3) a(4)

W(1) W(2) W(3)

a(2) a(3) a(4)

xi: input value

ŷ: output vector

+1: bias (constant) unit

a(l): activation vector for layer l

W(l): weight matrix for layer l

z(l): input into layer l

σ: sigmoid (logistic) function

SM: Softmax function

W(1) W(2) W(3)

a(2) a(3) a(4)

xi: input value

ŷ: output vector

Layer 1

W(1) W(2) W(3)

a(2) a(3) a(4)

xi: input value

ŷ: output vector

Layer 2

W(1) W(2) W(3)

a(2) a(3) a(4)

xi: input value

ŷ: output vector

Layer 3

W(1) W(2) W(3)

a(2) a(3) a(4)

xi: input value

ŷ: output vector

Layer 4

W(1) W(2) W(3)

a(2) a(3) a(4)

xi: input value

ŷ: output vector

Biases (constant units)

W(1) W(2) W(3)

a(2) a(3) a(4)

xi: input value

ŷ: output vector

W(1) W(2) W(3)

a(2) a(3) a(4)

xi: input value

ŷ: output vector

Weight matrices

W(1) W(2) W(3)

a(2) a(3) a(4)

xi: input value

ŷ: output vector

Layer inputs, z(l)

z(l) = W(l-1)a(l-1) + b(l-1)

W(1) W(2) W(3)

a(2) a(3) a(4)

xi: input value

ŷ: output vector

Activation vectors

W(1) W(2) W(3)

a(2) a(3) a(4)

xi: input value

ŷ: output vector

Sigmoid activation function

W(1) W(2) W(3)

a(2) a(3) a(4)

xi: input value

ŷ: output vector

Softmax activation function

W(1) W(2) W(3)

a(2) a(3) a(4)

xi: input value

ŷ: output vector

Output

W(1) W(2) W(3)

a(2) a(3) a(4)

Forward Propagation

Input vector is passed into the network

W(2) W(3)

a(3) a(4)

Forward Propagation

Input is multiplied with W(1) weight matrix and added with layer 1 biases to calculate z(2)

z(2) = W(1)x + b(1)

Forward Propagation

W(2) W(3)

a(3) a(4)

Activation value for the second layer is calculated by passing z(2) into some function. In this case, the sigmoid function.

a(2) = σ(z(2))

W(1) W(2)

Forward Propagation

a(3) a(4)

z(3) is calculated by multiplying a(2) vector with W(2) weight matrix and adding layer 2 biases

z(3) = W(2)a(2) + b(2)

W(1) W(2)

a(2) a(3)

Forward Propagation

Similar to previous layer, a(3) is calculated by passing z(3) into the sigmoid function

a(3) = σ(z(3))

W(1) W(2) W(3)

a(2) a(3)

Forward Propagation

z(4) is calculated by multiplying a(3) vector with W(3) weight matrix and adding layer 3 biases

z(4) = W(3)a(3) + b(3)

W(1) W(2) W(3)

a(2) a(3) a(4)

Forward Propagation

For the final layer, we calculate a(4) by passing z(4) into the Softmax function

a(4) = SM(z(4))

W(1) W(2) W(3)

a(2) a(3) a(4)

Forward Propagation

We then make our prediction based on the final layer’s output

Page of Math

z(2) = W(1)x + b(1)

z(3) = W(2)a(2) + b(2)

z(4) = W(3)a(3) + b(3)

a(2) = σ(z(2))

a(3) = σ(z(3))

a(4) = ŷ = SM(z(4))

Goal: Find which direction to shift weights

How: Find partial derivatives of the cost with respect to weight matrices

How (again): Chain rule the sh*t out of this mofo

DANGER:MATH

Chain Rule Reminder

Chain rule example

Find derivative with respect to x:

Chain rule example

First split into two functions:

Chain rule example

Then get derivative of components:

Chain rule example

DEEPER

NOTE: “Cancelling out” isn’t how the math actually works. But it’s a handy way to think about it.

DEEPER

Back Prop

Back to backpropagation:

Return of Page of Math

z(2) = W(1)x + b(1)

z(3) = W(2)a(2) + b(2)

z(4) = W(3)a(3) + b(3)

a(2) = σ(z(2))

a(3) = σ(z(3))

a(4) = ŷ = SM(z(4))

Partials, step by step

a(4) = ŷ = SM(z(4))

With cross entropy loss:

W(1) W(2) W(3)

a(2) a(3) a(4)

xi: input value

ŷ: output vector

Back Propagation Want:

W(1) W(2) W(3)

a(2) a(3) a(4)

xi: input value

ŷ: output vector

W(1) W(2) W(3)

a(2) a(3) a(4)

xi: input value

ŷ: output vector

W(1) W(2) W(3)

a(2) a(3) a(4)

xi: input value

ŷ: output vector

W(1) W(2) W(3)

a(2) a(3) a(4)

xi: input value

ŷ: output vector

W(1) W(2) W(3)

a(2) a(3) a(4)

xi: input value

ŷ: output vector

As programmers...

How do we NOT do this ourselves?

We’re lazy by trade.

3.Automatic

DifferentiationBringing sexy lazy back

Why not hard code?

▣ Want to iterate fast!▣ Want flexibility▣ Want to reuse our code!

Auto-Differentiation: Idea

▣ Use functions that have easy-to-compute derivatives

▣ Compose these functions to create more complex super-model

▣ Use the chain rule to get partial derivatives of the model

What makes a “good” function?

▣ Obvious stuff: differentiable (continuously and smoothly!)

▣ Simple operations: add, subtract, multiply▣ Reuse previous computation

Nice functions: sigmoid

Nice functions: hyperbolic tangent

Nice functions: Rectified linear unit

Nice functions: Addition

Nice functions: Multiplication

Good news:

Most of these use activation values! Can store in cache!

W(1) W(2) W(3)

a(2) a(3) a(4)

Store activation values for backprop

W(1) W(2) W(3)

a(2) a(3) a(4)

Chain rule takes care of the rest

It’s Over!Any questions?

Email: sam@memdump.ioGitHub: samjabrahamsTwitter: @sabraha

Presentation template by SlidesCarnival

Neural Network terms

▣ Neuron: a unit that transforms input via an activation function and outputs the result to other neurons and/or the final result

▣ Activation function: a(l), a transformation function, typically non-linear. Sigmoid, ReLU▣ Bias unit: a trainable scalar shift, typically applied to each non-output layer (think

y-intercept term in the linear function)▣ Layer: a grouping of “neurons” and biases that (in general) take in values from the same

previous neurons and pass values forwards to the same targets▣ Hidden layer: A layer that is neither the input layer nor the output layer▣ Input layer: ▣ Output layer

Terminology used

▣ Learning rate▣ Parameters▣ Training step▣ Training example▣ Epoch vs training time▣

Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark and TensorFlow Meetup...

Software

Transcript of Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark and TensorFlow Meetup...

Welcome to TensorFlow! · 2017. 7. 10. · Awesome projects already using TensorFlow 10. Companies using Tensorflow Google OpenAI DeepMind Snapchat Uber Airbus eBay ... TensorFlow

TensorFlow Graph Optimizations

Tensorflow - Aalto · Tensorflow API TensorFlow has APIs available in several languages both for constructing and executing a TensorFlow graph. The Python API is at present the most

High Performance Distributed TensorFlow with GPUs - TensorFlow Chicago Meetup - June 22 2017

Weichen (TensorFlow)

Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - Anirudh Ramanthan from Google Kubernetes Team

Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubernetes + Distributed TensorFlow GPUs

TensorFlow w/XLA: TensorFlow, Compiled! - Autodiff · PDF fileTensorFlow w/XLA: TensorFlow, Compiled! Expressiveness with performance Jeff Dean Google Brain team ... XLA:TPU TensorFlow

NVIDIA DIGITS with TensorFlow · NVIDIA DIGITS with TensorFlow DU-09197-001 _v1.0 | 3 Chapter 3. SELECTING TENSORFLOW WHEN CREATING A MODEL IN DIGITS Click the TensorFlow tab on the

Welcome to TensorFlow! - Stanford University...Welcome to TensorFlow! CS 20SI: TensorFlow for Deep Learning Research Lecture 1 1/13/2017 1 2 Agenda Welcome Overview of TensorFlow Graphs

TensorFlow Extended (TFX) · TensorFlow Transform Estimator or Keras Model TensorFlow Model Analysis TensorFlow Serving Logging Shared Utilities for Garbage Collection, Data Access

Installing TensorFlow For Jetson Platform · Installing TensorFlow For Jetson Platform SWE-SWDOCTFX-001-INST _v001 | 1 Chapter 1. OVERVIEW TensorFlow on Jetson Platform TensorFlow™

Tensorflow - Intro (2017)

Tensorflow vs MxNet

Introduction to TensorFlow 2...Deep Learning Intro to TensorFlow TensorFlow @ Google 2.0 and Examples Getting Started TensorFlow Deep Learning Doodles courtesy of @dalequark Weight

Advanced Spark and TensorFlow Meetup May 26, 2016

Università degli Studi di Pavia Deep Learning and TensorFlow · Deep Learning and TensorFlow –Episode 4 [1] Deep Learning and TensorFlow Episode 4 TensorFlow Basics Part 1 Università

TensorFlow Tutorial Part1

High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 - LA Big Data Meetup - SoCal PyData Meetup - Dec 2017

Google TensorFlow Tutorial