Failures of Gradient-Based Deep Learning -...

Failures of Gradient-Based Deep Learning

Ohad ShamirWeizmann Institute

Joint work with Shai Shalev-Shwartz & Shaked Shammah

(Hebrew University & Mobileye)

ICRI-CI WorkshopMay 2017

Neural Networks (a.k.a. Deep Learning)

Neural Networks (a.k.a. Deep Learning)

The Fizz Buzz Job Interview Question

Interviewer: OK, so I need you to print the numbers from 1 to100, except that if the number is divisible by 3 print ”fizz”, if it’sdivisible by 5 print ”buzz”, and if it’s divisible by 15 print”fizzbuzz”

Interviewee: ... let’s talk models. I’m thinking a simplemulti-layer-perceptron with one hidden layer...

http://joelgrus.com/2016/05/23/fizz-buzz-in-tensorflow/

http://joelgrus.com/2016/05/23/fizz-buzz-in-tensorflow/

This Talk

Simple problems where standard deep learning either

Does not work at all

Even for “nice” distributions and realizabilityEven with over-parameterization (a.k.a. improper learning)

Does not work well

Requires prior knowledge for better architectural/algorithmicchoices

Mix of theory and experiments. Code available!

Take-home Message

Even deep learning has limitations. To overcome them, priorknowledge and domain expertise can still be important

Outline

1 Piecewise Linear Curves

2 Linear-Periodic Functions

3 End-to-End vs. Decomposition

4 Flat Activations

Piecewise Linear Curves: Motivation

Piecewise Linear Curves

Problem: Train a piecewise linear curve detector

Input: f = (f (0), f (1), . . . , f (n − 1)) where

f (x) =k∑

r=1

ar [x − θr ]+ , θr ∈ 0, . . . , n − 1

Output: Curve parameters ar , θrkr=1


Approach 1: Deep autoencoder

minv1,v2

E[(Mv2(Nv1(f))− f)2]

(3 ReLU layers + linear output; sizes (100, 100, n) and (500, 100, 2k)

(after 500; 10000; 50000 iterations)


Input: f = (f (0), f (1), . . . , f (n − 1)) where

f (x) =∑k

r=1 ar [x − θr ]+Output: ar , θrkr=1

Approach 2: Linear Regression

Observation: f = Wp, where

Wi ,j = [i − j + 1]+ , pj =k∑

r=1

ar1θr = j

Can extract parameter vector p from f by p = W−1f

Learning approach: Train a one-layer fully connected networkon (f,p) examples:

minU

E[(Uf − p)2

]= E

[(Uf −W−1f)2

]


minU

E[(Uf − p)2

]= E

[(Uf −W−1f)2

]

Convex; Realizable

Also, doesn’t work well

(n = 300; after 500, 10000, 50000 iterations)


minU

E[(Uf − p)2

]= E

[(Uf −W−1f)2

]Convex; Realizable

Also, doesn’t work well

(n = 300; after 500, 10000, 50000 iterations)


Explanation: W has a very large condition number

Theorem

λmax(W>W )

λmin(W>W )= Ω(n3.5)

⇒ SGD requires Ω(n3.5) iterations to reach U s.t.∥∥E[U]−W−1∥∥ < 1/2


Approach 3: Convolutional Networks

p = W−1f

Observation:

W−1 =

1 0 0 0 · · ·−2 1 0 0 · · ·1 −2 1 0 · · ·0 1 −2 1 · · ·0 0 1 −2 · · ·...

......

W−1f is 1D convolution of f with “line-break” filter (1,−2, 1)

Can train a one-layer convnet to learn filter (problem in R3!)



Theorem: Condition number reduced to Θ(n3). Convolutionsaid geometry!

But: Θ(n3) iterations very disappointing for a problem in R3

...


Approach 4: Preconditioning

Convolutions reduce the problem to R3. In such lowdimension, can easily estimate correlations in f and use toprecondition


Outline




4 Flat Activations

Linear-Periodic Functions

x 7→ ψ(w>x), ψ periodic

Closely related to generalized linear models

Implementable with 2-layer networks on any bounded domain

Statistically learnable from data

Computationally learnable from data, at least in some cases

Informal Result

Not learnable with gradient-based methods in polynomial time,for any smooth distribution on Rd

Even with over-parameterization / arbitrarily complex network

Even if ψ and distribution are known

Case Study

Target function: x 7→ cos(w?>x)x has standard Gaussian distribution in Rd

With enough training data, equivalent to

minw

Ex∼N (0,I )

[(cos(w>x)− cos(w?>x)

)2]

Case Study

In 2 dimensions, w? = (2, 2):

No local minima/saddle points

However, extremely flat unless very close to optimum⇒ difficult for gradient methods

Analysis

Similar issues even for

Arbitrary smooth distributions

Any periodic ψ (not just cosine)

Arbitrary networks

minv∈V

Fw?(v) = Ex∼ϕ2

[(f (v, x)− ψ(w?>x)

)2]

Theorem

Under mild assumptions, if w? is a random norm-r vector in Rd ,then at any v,

Varw? (∇Fw?(v)) ≤ exp(−Ω(mind , r2))

Can be shown to imply that any gradient-based method wouldrequire exp(Ω(mind , r2)) iterations to succeed.

Experiment

0 1 2 3 4 5

·104

0.5

1

Training Iterations

Acc

ura

cy

d=5d=10d=30

2-layer ReLU network, width 10d

Outline




4 Flat Activations

End-to-End vs. Decomposition

Input x: k-tuple of images of random lines

f1(x): For each image, whether slopes up or down

f2(x): Given bit vector, return parityImportant: Will focus on small k (where parity is easy)

Goal: Learn f2(f1(x))

End-to-End vs. Decomposition

End-to-end approach: Train overall network on primaryobjective

Decomposition approach: Augment objective with loss specificto first net, using per-image labels

0.3

1k = 1

0.3

1k = 2

0.3

1k = 3

0.3

1k = 4

20000 iterations

Analysis

Through gradient variance w.r.t. target function

Under some simplifying assumptions,

Varw? (∇Fw?(v)) ≤ O(√

k/d)k

where d = number of pixels

Extremely concentrated already for very small values of k

Through gradient signal-to-noise ratio (SNR)

Ratio of bias and variance of y(x) · g(x) w.r.t. random inputx, where y is target and g is gradient at initialization point

1 2 3 4

−7

−15

k

log(

SN

R)

Outline




4 Flat Activations

Flat Activations

Vanishing gradients due to saturating activations (e.g. in RNN’s)

Flat Activations

Problem: Learning x 7→ u(w?>, x) where u is a fixed step function

Optimization problem:

minw

Ex[(u(Nw(x))− u(w?>x))2]

u′(z) = 0 almost everywhere → can’t apply gradient-basedmethods

Standard workarounds (smooth approximations; end-to-end;multiclass), don’t work too well either – see paper

Flat Activations

Different approach (Kalai & Sastry 2009, Kakade, Kalai, Kanade,S. 2011): Gradient descent, but replace gradient with somethingelse

minw

Ex

[1

2

((u(w>x))− u(w?>x)

)2]∇ = Ex

[(u(w>x)− u(w?>x)

)· u′(w>x) · x

]∇ = Ex

[(u(w>x)− u(w?>x))x

]Interpretation: “Forward only” backpropagation

Flat Activations

(linear; 5000 iterations)

Best results, and smallest train+test time

Analysis (KS09, KKKS11): Needs O(L2/ε2) iterations if u isL-Lipschitz

Summary

Simple problems where standard gradient-based deep learningdoesn’t work well (or at all), even under favorable conditions

Not due to local minima/saddle points!

Prior knowledge and domain expertise can still be importantFor more details:

“Distribution-Specific Hardness of Learning Neural Networks”arXiv 1609.01037

“Failures of Gradient-Based Deep Learning”: arXiv1703.07950

github.com/shakedshammah/failures_of_DL

github.com/shakedshammah/failures_of_DL

Failures of Gradient-Based Deep Learning -...

Documents

Transcript of Failures of Gradient-Based Deep Learning -...