Deep Learning: An Engineer's Dream and a Mathematician's ... · Intro Neural Network Design...

Post on 29-May-2020

7 views 0 download

Transcript of Deep Learning: An Engineer's Dream and a Mathematician's ... · Intro Neural Network Design...

Deep Learning: An Engineer’s Dream and aMathematician’s Nightmare

John McKay

Information Processing & Algorithms LaboratoryDept. of Electrical EngineeringPennsylvania State University

Dartmouth Applied Math Seminar, 2017

Intro Neural Network Design Training Neural Networks Transfer Learning References

Contents

1 Intro

2 Neural Network DesignConvolutional Neural Networks

3 Training Neural NetworksBackpropagationStochastic Gradient DescentOptimization Issues

4 Transfer LearningPretrainingFine Tuning

John.McKay@psu.edu Skeptical Math Intro to DL 0 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Why Give a Talk to Mathematicians on DL?

� Deep learning is dominating machine learning (imageclassification, object detection, text parsing, etc).

� Industry is keeping pace with/beating academia in terms ofbreakthroughs and adoption. We need mathematicians toprovide rigor to modern strategies.

Month of Citation

Numb

er o

f Tex

ts

Convolutional Neural Networks [Category: Computer Science; 86K articles, 946M words]topology [Category: Mathematics; 303K articles, 4B words]

1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 20140

500

1000

1500

2000

Highcharts.com

Figure: Per-month publication totals of papers on arXiv. Topology(orange) provided as a reference.

John.McKay@psu.edu Skeptical Math Intro to DL 1 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Gatys et al. (2015), Taigman et al. (2014)

John.McKay@psu.edu Skeptical Math Intro to DL 2 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Courtsey Nvidia,Krause et al. (2016)

John.McKay@psu.edu Skeptical Math Intro to DL 3 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

This talk will. . .

� Provide a comprehensive introduction to Deep Learning� Cover most of the modern ideas and trends� Illuminate the issues underpinning popular strategies

John.McKay@psu.edu Skeptical Math Intro to DL 4 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Traditional Supervised Image Classification

Training1. Input Labeled Data

↓2. Extract Features

↓3. Train Classifier

John.McKay@psu.edu Skeptical Math Intro to DL 5 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Traditional Supervised Image Classification

Training1. Input Labeled Data

↓2. Extract Features

↓3. Train Classifier

John.McKay@psu.edu Skeptical Math Intro to DL 5 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Traditional Supervised Image Classification

Training1. Input Labeled Data

↓2. Extract Features

↓3. Train Classifier

John.McKay@psu.edu Skeptical Math Intro to DL 5 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Traditional Supervised Image Classification

Training1. Input Labeled Data

↓2. Extract Features

↓3. Train Classifier

John.McKay@psu.edu Skeptical Math Intro to DL 5 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Training

http://www.cs.trincoll.edu

John.McKay@psu.edu Skeptical Math Intro to DL 6 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Feature Extraction

� Popular features: scale-invariant feature transforms (SIFT),speeded up robust features (SURF), sparsereconstruction-based classification (SRC) coefficients.

� They are hand crafted from analytical work trying todecipher invariant (translation, rotational, scale, etc.) anddiscriminatory image characteristics.

Lowe (1999)

John.McKay@psu.edu Skeptical Math Intro to DL 7 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Feature Extraction

� Popular features: scale-invariant feature transforms (SIFT),speeded up robust features (SURF), sparsereconstruction-based classification (SRC) coefficients.

� They are hand crafted from analytical work trying todecipher invariant (translation, rotational, scale, etc.) anddiscriminatory image characteristics.

� Why not have the computer do it?

Lowe (1999)

John.McKay@psu.edu Skeptical Math Intro to DL 7 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Contents

1 Intro

2 Neural Network DesignConvolutional Neural Networks

3 Training Neural NetworksBackpropagationStochastic Gradient DescentOptimization Issues

4 Transfer LearningPretrainingFine Tuning

John.McKay@psu.edu Skeptical Math Intro to DL 7 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Single Hidden Layer Network

� Input x ∈ Rd , NN output a(2)

a(2) = f2(f1(x))

John.McKay@psu.edu Skeptical Math Intro to DL 8 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Single Hidden Layer Network

� Input x ∈ Rd , NN output a(2)

a(2) = W2 σ(W1x + b1)︸ ︷︷ ︸1st Layer (hidden)

+b2

John.McKay@psu.edu Skeptical Math Intro to DL 8 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Single Hidden Layer Network

� Input x ∈ Rd , NN output a(2)

a(2) =

2nd Layer (Output)︷ ︸︸ ︷W2 σ(W1x + b1)︸ ︷︷ ︸

1st Layer (Hidden)

+b2

John.McKay@psu.edu Skeptical Math Intro to DL 8 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Single Hidden Layer Network

� Input x ∈ Rd , NN output a(2)

a(2) =

2nd Layer (Output)︷ ︸︸ ︷W2 σ(W1x + b1)︸ ︷︷ ︸

1st Layer (Hidden)

+b2

� W1, W2 are the weights and b1, b2 the biases.� σ is an activation function (more on this later)

John.McKay@psu.edu Skeptical Math Intro to DL 8 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Single Hidden Layer Network

� Input x ∈ Rd , NN output a(2)

a(2) =

2nd Layer (Output)︷ ︸︸ ︷W2 σ(W1x + b1)︸ ︷︷ ︸

1st Layer (Hidden)

+b2

� W1, W2 are the weights and b1, b2 the biases.� σ is an activation function (more on this later)

John.McKay@psu.edu Skeptical Math Intro to DL 8 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Nielsen (2015)

John.McKay@psu.edu Skeptical Math Intro to DL 9 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Universal Approximation Theorem

� Universal Approximation Theorem: given enough hiddenlayers, weight depth, and σ of a certain set of nonlinearfunctions, then a NN with linear output can represent anyBorel measurable function mapping finite dimensional spaceto another.

� Nice, but representation6=prediction.

Hornik et al. (1989),Leshno et al. (1993)

John.McKay@psu.edu Skeptical Math Intro to DL 10 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

More Parameters, More Descriptiveness

Li/Kaparthy/Johnson, Stanford CS231

John.McKay@psu.edu Skeptical Math Intro to DL 11 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Activation Functions

� Feature spaces require nonlinear activation functions.� There is no one “best” choice for an activation function, to

say nothing of how many layers have one specific activationfunction.

� Since 2009, the go-to: Rectified Linear Units (ReLU)

σ(z) = a such that ai = max(0, zi)

John.McKay@psu.edu Skeptical Math Intro to DL 12 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Activation Functions

imiloainf.wordpress.com (her typo)

John.McKay@psu.edu Skeptical Math Intro to DL 13 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Output Layer

� Nevermind, sigmoids are okay.� For binary problem, sigmoid acts as a probability� For multinomial, popular to use softmax

softmax(zi) =exp(zi)∑j exp(zj)

� softmax ≈ argmax function (one hot vector output)

John.McKay@psu.edu Skeptical Math Intro to DL 14 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Contents

1 Intro

2 Neural Network DesignConvolutional Neural Networks

3 Training Neural NetworksBackpropagationStochastic Gradient DescentOptimization Issues

4 Transfer LearningPretrainingFine Tuning

John.McKay@psu.edu Skeptical Math Intro to DL 14 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Convolutional Neural Networks

� Convolutions with filters are a way to do weight sharing andexploit sparse interactions.

� Fewer weights? Less to train/save.� Modern CNNs are the most accurate algorithms humans

have ever devised (by a bit).

LeCun et al. (1998)

John.McKay@psu.edu Skeptical Math Intro to DL 15 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Convolutional Neural Network

John.McKay@psu.edu Skeptical Math Intro to DL 16 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Convolutional Neural Network

John.McKay@psu.edu Skeptical Math Intro to DL 16 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Convolutional Neural Network

John.McKay@psu.edu Skeptical Math Intro to DL 16 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Convolutional Neural Network

John.McKay@psu.edu Skeptical Math Intro to DL 16 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Convolutional Neural Network

John.McKay@psu.edu Skeptical Math Intro to DL 16 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Convolutional Neural Network

John.McKay@psu.edu Skeptical Math Intro to DL 16 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Convolutional Neural Network

John.McKay@psu.edu Skeptical Math Intro to DL 16 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Convolutional Neural Network

John.McKay@psu.edu Skeptical Math Intro to DL 16 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Li/Kaparthy/Johnson, Stanford CS231

John.McKay@psu.edu Skeptical Math Intro to DL 17 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Recurrent Neural Networks

source

John.McKay@psu.edu Skeptical Math Intro to DL 18 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Regularization

Li/Kaparthy/Johnson, Stanford CS231,Gal and Ghahramani (2015)

John.McKay@psu.edu Skeptical Math Intro to DL 19 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Contents

1 Intro

2 Neural Network DesignConvolutional Neural Networks

3 Training Neural NetworksBackpropagationStochastic Gradient DescentOptimization Issues

4 Transfer LearningPretrainingFine Tuning

John.McKay@psu.edu Skeptical Math Intro to DL 19 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

How to train Neural Networks?

� Suppose we are given a set x1, . . . ,xn with known labelsy(xi) = yi . Training entails tuning parameters to minimizethe error between y and the model’s outputs for each xi .

� Cost functions represent a surrogate for classification error.We indirectly improve classification performance.

� In minimizing the cost function, gradient descent methodshave become the go-to strategy. We will later discuss issueswith this.

� What is the gradient of a NN w.r.t. its parameters?

John.McKay@psu.edu Skeptical Math Intro to DL 20 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Contents

1 Intro

2 Neural Network DesignConvolutional Neural Networks

3 Training Neural NetworksBackpropagationStochastic Gradient DescentOptimization Issues

4 Transfer LearningPretrainingFine Tuning

John.McKay@psu.edu Skeptical Math Intro to DL 20 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Backpropagation

� Calculating the gradient was significant challenge until 1986when Rumelhart et al proposed backpropagation.

� What was the problem before them?� Nested nonlinear activation functions and nontrivial cost

functions are messy and require a lot of work for slight tweaks.� NNs with even a little depth involve several matrix

multiplications. If one were to use a finite difference scheme,several forward passes through a network gets expensive.

Rumelhart et al. (1986)

John.McKay@psu.edu Skeptical Math Intro to DL 21 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Backpropagation

� Calculating the gradient was significant challenge until 1986when Rumelhart et al proposed backpropagation.

� What was the problem before them?� Nested nonlinear activation functions and nontrivial cost

functions are messy and require a lot of work for slight tweaks.� NNs with even a little depth involve several matrix

multiplications. If one were to use a finite difference scheme,several forward passes through a network gets expensive.

Rumelhart et al. (1986)

John.McKay@psu.edu Skeptical Math Intro to DL 21 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Backpropagation

� Calculating the gradient was significant challenge until 1986when Rumelhart et al proposed backpropagation.

� What was the problem before them?� Nested nonlinear activation functions and nontrivial cost

functions are messy and require a lot of work for slight tweaks.� NNs with even a little depth involve several matrix

multiplications. If one were to use a finite difference scheme,several forward passes through a network gets expensive.

Rumelhart et al. (1986)

John.McKay@psu.edu Skeptical Math Intro to DL 21 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Backpropagation

� As we will show, Backpropagation allows for us to computethe gradient of a NN at the cost of two forward passesregardless of the architecture.

� It plays mainly off of implementations of the chain rule.

John.McKay@psu.edu Skeptical Math Intro to DL 22 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Example Network: C cost function, L number of layers, inputtraining x ∈ Rm (n in total), y(x) desired output

alj(x) = al

j = σ

(∑k

w ljkal−1

k + blj

)

Nielsen (2015)

John.McKay@psu.edu Skeptical Math Intro to DL 23 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Example Network: C cost function, L number of layers, inputtraining x ∈ Rm (n in total), y(x) desired output

alj(x) = al

j = σ

(∑k

w ljkal−1

k + blj

)

C has two requirements: it can be written as a function of thenetwork outputs aL and arranged as a sum of cost functionsfor each training sample. Example:

C(aL) =1

2n

∑x||y(x)− aL(x)||22

Nielsen (2015)

John.McKay@psu.edu Skeptical Math Intro to DL 23 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Example Network: C cost function, L number of layers, inputtraining x ∈ Rm (n in total), y(x) desired output

alj(x) = al

j = σ

(∑k

w ljkal−1

k + blj

)a1

j = xj

Nielsen (2015)

John.McKay@psu.edu Skeptical Math Intro to DL 23 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Example Network: C cost function, L number of layers, inputtraining x ∈ Rm (n in total), y(x) desired output

alj = σ

(∑k

w ljkal−1

k + blj

)

Nielsen (2015)

John.McKay@psu.edu Skeptical Math Intro to DL 23 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Example Network: C cost function, L number of layers, inputtraining x ∈ Rm (n in total), y(x) desired output

alj = σ

(∑k

w ljkal−1

k + blj︸ ︷︷ ︸

z l

)

Nielsen (2015)

John.McKay@psu.edu Skeptical Math Intro to DL 23 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Example Network: C cost function, L number of layers, inputtraining x ∈ Rm (n in total), y(x) desired output

al = σ(

W lal−1 + bl)= σ(z l)

Goal: find ∂C/∂w ljk , ∂C/∂bl

j

Nielsen (2015)

John.McKay@psu.edu Skeptical Math Intro to DL 23 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

al = σ(

W lal−1 + bl)= σ(z l)

Let δlj = ∂C/∂z l

j- Error in Output Layer

δLj =

∂C∂aL

j

∂aLj

∂zLj=∂C∂aL

jσ′(zL

j ) (1)

-δl = ((W l+1)Tδl+1)� σ′(z l) (2)

-∂C∂bl

j= δl

j (3)

-∂C∂w l

jk= al−1

k δlj (4)

John.McKay@psu.edu Skeptical Math Intro to DL 24 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

al = σ(

W lal−1 + bl)= σ(z l)

Let δlj = ∂C/∂z l

j-

δLj =

∂C∂aL

j

∂aLj

∂zLj=∂C∂aL

jσ′(zL

j ) (1)

- δl in terms of δl+1

δl = ((W l+1)Tδl+1)� σ′(z l) (2)

-∂C∂bl

j= δl

j (3)

-∂C∂w l

jk= al−1

k δlj (4)

John.McKay@psu.edu Skeptical Math Intro to DL 24 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

al = σ(

W lal−1 + bl)= σ(z l)

Let δlj = ∂C/∂z l

j-

δLj =

∂C∂aL

j

∂aLj

∂zLj=∂C∂aL

jσ′(zL

j ) (1)

-δl = ((W l+1)Tδl+1)� σ′(z l) (2)

- Partial w.r.t. bias∂C∂bl

j= δl

j (3)

-∂C∂w l

jk= al−1

k δlj (4)

John.McKay@psu.edu Skeptical Math Intro to DL 24 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

al = σ(

W lal−1 + bl)= σ(z l)

Let δlj = ∂C/∂z l

j-

δLj =

∂C∂aL

j

∂aLj

∂zLj=∂C∂aL

jσ′(zL

j ) (1)

-δl = ((W l+1)Tδl+1)� σ′(z l) (2)

-∂C∂bl

j= δl

j (3)

- Partial w.r.t. weights∂C∂w l

jk= al−1

k δlj (4)

John.McKay@psu.edu Skeptical Math Intro to DL 24 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Backpropagation Algorithm

1 Input x and solve for a(1)

2 Feedforward through the network, i.e. for l = 2, . . . ,L find

z l = W lal−1 + bl , al = σ(z l)

3 Find δL = ∇aLC � σ(zL)4 Backpropagate error by, for l = L− 1,L− 2, . . . ,2,

calculatingδl = ((W l+1)Tδl+1)� σ′(z l)

5 Gradients of Cost Function are found as∂C∂bl

j= δl

j ,∂C∂w l

jk= al−1

k δlj

Nielsen (2015)

John.McKay@psu.edu Skeptical Math Intro to DL 25 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Backpropagation Algorithm

1 Input x and solve for a(1)

2 Feedforward through the network, i.e. for l = 2, . . . ,L find

z l = W lal−1 + bl , al = σ(z l)

3 Find δL = ∇aLC � σ(zL)4 Backpropagate error by, for l = L− 1,L− 2, . . . ,2,

calculatingδl = ((W l+1)Tδl+1)� σ′(z l)

5 Gradients of Cost Function are found as∂C∂bl

j= δl

j ,∂C∂w l

jk= al−1

k δlj

Nielsen (2015)

John.McKay@psu.edu Skeptical Math Intro to DL 25 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Backpropagation Algorithm

1 Input x and solve for a(1)

2 Feedforward through the network, i.e. for l = 2, . . . ,L find

z l = W lal−1 + bl , al = σ(z l)

3 Find δL = ∇aLC � σ(zL)4 Backpropagate error by, for l = L− 1,L− 2, . . . ,2,

calculatingδl = ((W l+1)Tδl+1)� σ′(z l)

5 Gradients of Cost Function are found as∂C∂bl

j= δl

j ,∂C∂w l

jk= al−1

k δlj

Nielsen (2015)

John.McKay@psu.edu Skeptical Math Intro to DL 25 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Backpropagation Algorithm

1 Input x and solve for a(1)

2 Feedforward through the network, i.e. for l = 2, . . . ,L find

z l = W lal−1 + bl , al = σ(z l)

3 Find δL = ∇aLC � σ(zL)4 Backpropagate error by, for l = L− 1,L− 2, . . . ,2,

calculatingδl = ((W l+1)Tδl+1)� σ′(z l)

5 Gradients of Cost Function are found as∂C∂bl

j= δl

j ,∂C∂w l

jk= al−1

k δlj

Nielsen (2015)

John.McKay@psu.edu Skeptical Math Intro to DL 25 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Backpropagation Algorithm

1 Input x and solve for a(1)

2 Feedforward through the network, i.e. for l = 2, . . . ,L find

z l = W lal−1 + bl , al = σ(z l)

3 Find δL = ∇aLC � σ(zL)4 Backpropagate error by, for l = L− 1,L− 2, . . . ,2,

calculatingδl = ((W l+1)Tδl+1)� σ′(z l)

5 Gradients of Cost Function are found as∂C∂bl

j= δl

j ,∂C∂w l

jk= al−1

k δlj

Nielsen (2015)

John.McKay@psu.edu Skeptical Math Intro to DL 25 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Learned Filters

Courtesy Yann LeCun

John.McKay@psu.edu Skeptical Math Intro to DL 26 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Contents

1 Intro

2 Neural Network DesignConvolutional Neural Networks

3 Training Neural NetworksBackpropagationStochastic Gradient DescentOptimization Issues

4 Transfer LearningPretrainingFine Tuning

John.McKay@psu.edu Skeptical Math Intro to DL 26 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Gradient Descent

� Backpropagation→ gradient� (Batch) gradient descent:

W l = W l − αW∂C∂W l

b = bl − αb∂C∂bl

� Issue: C is the sum of costs associated with each trainingsample.

� Many problems use millions of training samples. . .

John.McKay@psu.edu Skeptical Math Intro to DL 27 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Gradient Descent

� Backpropagation→ gradient� (Batch) gradient descent:

W l = W l − αW∂C∂W l

b = bl − αb∂C∂bl

� Issue: C is the sum of costs associated with each trainingsample.

� Many problems use millions of training samples. . .

John.McKay@psu.edu Skeptical Math Intro to DL 27 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Stochastic Gradient Descent

� Randomly choose m < n training samples x1, . . . ,xm

C(m) =m∑

i=1

Cxi

� Update step:

W lk+1 = W l

k −nαW

m

m∑i=1

∂Cxi

∂W lk

bk+1 = bl − nαb

m

m∑i=1

∂Cxi

∂blk

John.McKay@psu.edu Skeptical Math Intro to DL 28 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Stochastic Gradient Descent

� Randomly choose m < n training samples x1, . . . ,xm

C(m) =m∑

i=1

Cxi

W lk+1 = W l

k − αWnm

m∑i=1

∂Cxi

∂W lk︸ ︷︷ ︸

gW

E[gW ]= ∇W C

John.McKay@psu.edu Skeptical Math Intro to DL 28 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Stochastic Gradient

� Randomly choose m < n training samples x1, . . . ,xm

C(m) =m∑

i=1

Cxi

W lk+1 = W l

k − αWnm

m∑i=1

∂Cxi

∂W lk

� Learning rate αW is typically fixed and “small” (but not “toosmall”).

John.McKay@psu.edu Skeptical Math Intro to DL 28 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Stochastic Gradient “Descent”

� Randomly choose m < n training samples x1, . . . ,xm

C(m) =m∑

i=1

Cxi

W lk+1 = W l

k − αWnm

m∑i=1

∂Cxi

∂W lk

� Learning rate αW is typically fixed and “small” (but not “toosmall”).

John.McKay@psu.edu Skeptical Math Intro to DL 28 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Stochastic Gradient Descent

Figure: GD on left, SGD on right

Ng, Standford Machine Learning (Fall 2011)

John.McKay@psu.edu Skeptical Math Intro to DL 29 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Stochastic Gradient Descent

Figure: GD on left, SGD on right

Still acceptable - we just want “close enough”

Ng, Standford Machine Learning (Fall 2011)

John.McKay@psu.edu Skeptical Math Intro to DL 29 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Summarize Training

John.McKay@psu.edu Skeptical Math Intro to DL 30 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Quick Note on C

� Least squares is useful for examples, but usually notsuitable for practice (slow with sigmoid output).

� Cross-entropy

C =1n

∑x

yT ln(aL(x)) + (1− y)T ln(1− aL(x))

� Chosen for “nice” attributes, but no idea of resultingtopology, which influences everything.

John.McKay@psu.edu Skeptical Math Intro to DL 31 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Contents

1 Intro

2 Neural Network DesignConvolutional Neural Networks

3 Training Neural NetworksBackpropagationStochastic Gradient DescentOptimization Issues

4 Transfer LearningPretrainingFine Tuning

John.McKay@psu.edu Skeptical Math Intro to DL 31 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Optimization Issues

� Training for NNs remains one of the least analytically soundaspects.

� What does the cost function look like?� Where are its local minima (if there are any)?� How will SGD jump around?

John.McKay@psu.edu Skeptical Math Intro to DL 32 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Local Minima Problem

Sontag and Sussmann (1989),Brady et al. (1989),Gori and Tesi (1992)

John.McKay@psu.edu Skeptical Math Intro to DL 33 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Saddle Points

� Suppose H = ∇2C and λi its eigenvalues.� P(λi > 0) = 1

2=⇒P(λ1 > 0, . . . , λK > 0) = (12)

K

� (12)

K → 0 as K →∞, meaning the probability that a criticalpoint is a saddle point.

Dauphin et al. (2014)

John.McKay@psu.edu Skeptical Math Intro to DL 34 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Sontag and Sussmann (1989),Brady et al. (1989),Gori and Tesi (1992)

John.McKay@psu.edu Skeptical Math Intro to DL 35 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

No Minima Problem

John.McKay@psu.edu Skeptical Math Intro to DL 36 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

In Practice. . .

� Even in cases where local minimum is not found, results canbe state-of-the-art; just want SGD to find “very small value.”

� Initialization of weights is a major area of research to try toplace model in “good” area to find local minima (more onthis later).

� Structural elements have adapted to (experimentally) fixthese issues (ReLU, cross-entropy cost, convolutionallayers, etc.).

Goodfellow et al. (2016)

John.McKay@psu.edu Skeptical Math Intro to DL 37 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Unstable Gradient vs. Architecture

Layers (Parameters∗) Mutli-Class Error Top-5 Error11 (133) 29.6 10.413 (133) 28.7 9.916 (134) 28.1 9.416 (138) 27.3 8.819 (144) 25.5 8.0

∗ Number in millions

Simonyan and Zisserman (2014)

John.McKay@psu.edu Skeptical Math Intro to DL 38 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Unstable Gradient

� Consider simplified model

� Typical initialization: wi ∼ N (0,1), bi = 1

Nielsen (2015)

John.McKay@psu.edu Skeptical Math Intro to DL 39 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Vanishing Gradient

� First layers learn slower than later ones

Nielsen (2015)John.McKay@psu.edu Skeptical Math Intro to DL 40 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Unstable Gradients for FFN

Nielsen (2015)John.McKay@psu.edu Skeptical Math Intro to DL 41 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Overcoming Unstable Gradients

� Initialization (again)1 Random Walk Initialization1

W̃ ` = ωW ` where ωReLU = (√

2) exp(

1.2max(N`,6)− 2.4

)2 Orthonormal Initialization2

3 Unit Variance Initilization3,4

Sussillo and Abbott (2014)1,Saxe et al. (2013)2,He et al. (2015)3,Mishkin and Matas (2015)4

John.McKay@psu.edu Skeptical Math Intro to DL 42 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Overcoming Unstable Gradients

� Student-teacher model� Drop in layers as you go� Change the cost function and/or activiation functions?

Romero et al. (2014)

John.McKay@psu.edu Skeptical Math Intro to DL 43 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Contents

1 Intro

2 Neural Network DesignConvolutional Neural Networks

3 Training Neural NetworksBackpropagationStochastic Gradient DescentOptimization Issues

4 Transfer LearningPretrainingFine Tuning

John.McKay@psu.edu Skeptical Math Intro to DL 43 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Transfer Learning & Optimization Problems

� Transfer learning: taking a model primed for task X andapplying to unrelated problem Y . For example, taking aCNN designed to discern amongst vehicles and applying itto classifying different animals without manipulatingparameters.

� Transfer learning and its predecessor pretraining areunintentional solutions to the initialization problems.

John.McKay@psu.edu Skeptical Math Intro to DL 44 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Contents

1 Intro

2 Neural Network DesignConvolutional Neural Networks

3 Training Neural NetworksBackpropagationStochastic Gradient DescentOptimization Issues

4 Transfer LearningPretrainingFine Tuning

John.McKay@psu.edu Skeptical Math Intro to DL 44 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Pretraining

� Pretraining revived deep neural networks from obscurity.� Without convolutions or recursion, deep full connected

networks could be trained and avoid local minima problems.� Unsupervised pretraining has mostly been abandoned, but it

inspired much of the modern work in transfer learning.

Bengio et al. (2007)

John.McKay@psu.edu Skeptical Math Intro to DL 45 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Unsupervised Pretraining

Let f be the identity function, X input data matrix (1 row perexample), K number of iterationsfor k ∈ {1, . . . ,K}

f (k) = L(X )

f = f (k) ◦ f

X = f (k)(X )

The above is done for each layer. The input is the previouslayer’s output.

Goodfellow et al. (2016)

John.McKay@psu.edu Skeptical Math Intro to DL 46 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Unsupervised Pretraining

Unsupervised pretraining is thought to work because:1 A model is sensitive to its initializations.2 Learning the input distribution is useful.

Goodfellow et al. (2016)

John.McKay@psu.edu Skeptical Math Intro to DL 47 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Unsupervised Pretraining

Unsupervised pretraining is thought to work because:1 A model is sensitive to its initializations.

� This is the least mathematically understood part ofpretraining.

� It may start the model at local minima otherwise inaccessiblebased on the shape of the cost function.

� How much of the initialization survives training?

2 Learning the input distribution is useful.

Goodfellow et al. (2016)

John.McKay@psu.edu Skeptical Math Intro to DL 47 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Unsupervised Pretraining

Unsupervised pretraining is thought to work because:1 A model is sensitive to its initializations.2 Learning the input distribution is useful.

� Suppose car/motorcycle model; will pretraining spot wheels?Will its representation of a wheel be helpful?

� There is no coherent theory as to if/how/why/whenunsupervised features help with NN training.

Goodfellow et al. (2016)

John.McKay@psu.edu Skeptical Math Intro to DL 47 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Supervised Pretraining

x → σ(W (1)x + b(1))→ y(x)

Bengio et al. (2007)

John.McKay@psu.edu Skeptical Math Intro to DL 48 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Supervised Pretraining

x → σ(W (1)x + b(1))︸ ︷︷ ︸x(1)

→ y(x)

x (1) → σ(W (2)x (1) + b(2))→ y(x)

Bengio et al. (2007)

John.McKay@psu.edu Skeptical Math Intro to DL 48 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Supervised Pretraining

x → σ(W (1)x + b(1))︸ ︷︷ ︸x(1)

→ y(x)

x (1) → σ(W (2)x (1) + b(2))→ y(x)...

x → σ(W (1)x + b(1))→ σ(W (2)a(1) + b(2))→ · · · → y(x)

Bengio et al. (2007)

John.McKay@psu.edu Skeptical Math Intro to DL 48 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Contents

1 Intro

2 Neural Network DesignConvolutional Neural Networks

3 Training Neural NetworksBackpropagationStochastic Gradient DescentOptimization Issues

4 Transfer LearningPretrainingFine Tuning

John.McKay@psu.edu Skeptical Math Intro to DL 48 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Oquab et al. (2014)

John.McKay@psu.edu Skeptical Math Intro to DL 49 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Finer Details

Inpu

tIm

age

Outpu

tLayer

Inpu

tIm

age

Outpu

tLayer

SVM

Model

xx

x

xx

oo o

o

1

2

3

McKay 2017

John.McKay@psu.edu Skeptical Math Intro to DL 50 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

70

75

80

85

90

95

100

70 75 80 85 90 95 100Recall

Precision

VGG-f VGG16 VGG19 Alex

McKay 2017

John.McKay@psu.edu Skeptical Math Intro to DL 51 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

Thank You

Acknolwedgements� Dr. Anne Gelb� Dr. Suren Jayasuriya� iPal NN group: Tiep Vu, Hojjat Mousavi, Tiantong Guo

John.McKay@psu.edu Skeptical Math Intro to DL 52 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

References I

Bengio, Y. et al. (2007). “Greedy layer-wise training of deep networks”. In: IN NIPS.

Brady, M. L., R. Raghavan, and J. Slawny (1989). “Back propagation fails to separate where perceptrons succeed”.In: IEEE Transactions on Circuits and Systems 36.5,

Dauphin, Y. N. et al. (2014). “Identifying and attacking the saddle point problem in high-dimensional non-convexoptimization”. In: Advances in neural information processing systems,

Gal, Y. and Z. Ghahramani (2015). “Dropout as a Bayesian approximation: Representing model uncertainty in deeplearning”. In: arXiv preprint arXiv:1506.02142 2.

Gatys, L. A., A. S. Ecker, and M. Bethge (2015). “A neural algorithm of artistic style”. In: arXiv preprintarXiv:1508.06576.

Goodfellow, I., Y. Bengio, and A. Courville (2016). Deep Learning. http://www.deeplearningbook.org. MITPress.

Gori, M. and A. Tesi (1992). “On the problem of local minima in backpropagation”. In: IEEE Transactions on PatternAnalysis and Machine Intelligence 14.1,

He, K. et al. (2015). “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification”.In: Proceedings of the IEEE international conference on computer vision,

Hornik, K., M. Stinchcombe, and H. White (1989). “Multilayer feedforward networks are universal approximators”. In:Neural networks 2.5,

John.McKay@psu.edu Skeptical Math Intro to DL 53 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

References II

Krause, J. et al. (2016). “A Hierarchical Approach for Generating Descriptive Image Paragraphs”. In: arXiv preprintarXiv:1611.06607.

LeCun, Y. et al. (1998). “Gradient-based learning applied to document recognition”. In: Proceedings of the IEEE86.11,

Leshno, M. et al. (1993). “Multilayer feedforward networks with a nonpolynomial activation function can approximateany function”. In: Neural networks 6.6,

Lowe, D. G. (1999). “Object recognition from local scale-invariant features”. In: Computer vision, 1999. Theproceedings of the seventh IEEE international conference on. Vol. 2. Ieee,

Mishkin, D. and J. Matas (2015). “All you need is a good init”. In: arXiv preprint arXiv:1511.06422.

Nielsen, M. A. (2015). Neural Networks and Deep Learning. http://neuralnetworksanddeeplearning.com.Determination Press.

Oquab, M. et al. (2014). “Learning and transferring mid-level image representations using convolutional neuralnetworks”. In: Proceedings of the IEEE conference on computer vision and pattern recognition,

Romero, A. et al. (2014). “Fitnets: Hints for thin deep nets”. In: arXiv preprint arXiv:1412.6550.

Rumelhart, D. E., G. E. Hinton, and R. J. Williams (1986). “Learning representations by back-propagating errors”.In: Nature 323,

John.McKay@psu.edu Skeptical Math Intro to DL 54 / 55

Intro Neural Network Design Training Neural Networks Transfer Learning References

References III

Saxe, A. M., J. L. McClelland, and S. Ganguli (2013). “Exact solutions to the nonlinear dynamics of learning in deeplinear neural networks”. In: arXiv preprint arXiv:1312.6120.

Simonyan, K. and A. Zisserman (2014). “Very deep convolutional networks for large-scale image recognition”. In:arXiv preprint arXiv:1409.1556.

Sontag, E. D. and H. J. Sussmann (1989). “Backpropagation can give rise to spurious local minima even fornetworks without hidden layers”. In: Complex Systems 3.1,

Sussillo, D. and L. Abbott (2014). “Random walk initialization for training very deep feedforward networks”. In: arXivpreprint arXiv:1412.6558.

Taigman, Y. et al. (2014). “Deepface: Closing the gap to human-level performance in face verification”. In:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,

John.McKay@psu.edu Skeptical Math Intro to DL 55 / 55