Failures of Gradient-Based Deep Learning -...
Transcript of Failures of Gradient-Based Deep Learning -...
Failures of Gradient-Based Deep Learning
Ohad ShamirWeizmann Institute
Joint work with Shai Shalev-Shwartz & Shaked Shammah
(Hebrew University & Mobileye)
ICRI-CI WorkshopMay 2017
Neural Networks (a.k.a. Deep Learning)
Neural Networks (a.k.a. Deep Learning)
Neural Networks (a.k.a. Deep Learning)
Neural Networks (a.k.a. Deep Learning)
Neural Networks (a.k.a. Deep Learning)
Neural Networks (a.k.a. Deep Learning)
Neural Networks (a.k.a. Deep Learning)
The Fizz Buzz Job Interview Question
Interviewer: OK, so I need you to print the numbers from 1 to100, except that if the number is divisible by 3 print ”fizz”, if it’sdivisible by 5 print ”buzz”, and if it’s divisible by 15 print”fizzbuzz”
Interviewee: ... let’s talk models. I’m thinking a simplemulti-layer-perceptron with one hidden layer...
http://joelgrus.com/2016/05/23/fizz-buzz-in-tensorflow/
Neural Networks (a.k.a. Deep Learning)
The Fizz Buzz Job Interview Question
Interviewer: OK, so I need you to print the numbers from 1 to100, except that if the number is divisible by 3 print ”fizz”, if it’sdivisible by 5 print ”buzz”, and if it’s divisible by 15 print”fizzbuzz”
Interviewee: ... let’s talk models. I’m thinking a simplemulti-layer-perceptron with one hidden layer...
http://joelgrus.com/2016/05/23/fizz-buzz-in-tensorflow/
Neural Networks (a.k.a. Deep Learning)
The Fizz Buzz Job Interview Question
Interviewer: OK, so I need you to print the numbers from 1 to100, except that if the number is divisible by 3 print ”fizz”, if it’sdivisible by 5 print ”buzz”, and if it’s divisible by 15 print”fizzbuzz”
Interviewee: ... let’s talk models. I’m thinking a simplemulti-layer-perceptron with one hidden layer...
http://joelgrus.com/2016/05/23/fizz-buzz-in-tensorflow/
This Talk
Simple problems where standard deep learning either
Does not work at all
Even for “nice” distributions and realizabilityEven with over-parameterization (a.k.a. improper learning)
Does not work well
Requires prior knowledge for better architectural/algorithmicchoices
Mix of theory and experiments. Code available!
Take-home Message
Even deep learning has limitations. To overcome them, priorknowledge and domain expertise can still be important
Outline
1 Piecewise Linear Curves
2 Linear-Periodic Functions
3 End-to-End vs. Decomposition
4 Flat Activations
Piecewise Linear Curves: Motivation
Piecewise Linear Curves
Problem: Train a piecewise linear curve detector
Input: f = (f (0), f (1), . . . , f (n − 1)) where
f (x) =k∑
r=1
ar [x − θr ]+ , θr ∈ 0, . . . , n − 1
Output: Curve parameters ar , θrkr=1
Piecewise Linear Curves
Approach 1: Deep autoencoder
minv1,v2
E[(Mv2(Nv1(f))− f)2]
(3 ReLU layers + linear output; sizes (100, 100, n) and (500, 100, 2k)
(after 500; 10000; 50000 iterations)
Piecewise Linear Curves
Input: f = (f (0), f (1), . . . , f (n − 1)) where
f (x) =∑k
r=1 ar [x − θr ]+Output: ar , θrkr=1
Approach 2: Linear Regression
Observation: f = Wp, where
Wi ,j = [i − j + 1]+ , pj =k∑
r=1
ar1θr = j
Can extract parameter vector p from f by p = W−1f
Learning approach: Train a one-layer fully connected networkon (f,p) examples:
minU
E[(Uf − p)2
]= E
[(Uf −W−1f)2
]
Piecewise Linear Curves
Input: f = (f (0), f (1), . . . , f (n − 1)) where
f (x) =∑k
r=1 ar [x − θr ]+Output: ar , θrkr=1
Approach 2: Linear Regression
Observation: f = Wp, where
Wi ,j = [i − j + 1]+ , pj =k∑
r=1
ar1θr = j
Can extract parameter vector p from f by p = W−1f
Learning approach: Train a one-layer fully connected networkon (f,p) examples:
minU
E[(Uf − p)2
]= E
[(Uf −W−1f)2
]
Piecewise Linear Curves
minU
E[(Uf − p)2
]= E
[(Uf −W−1f)2
]
Convex; Realizable
Also, doesn’t work well
(n = 300; after 500, 10000, 50000 iterations)
Piecewise Linear Curves
minU
E[(Uf − p)2
]= E
[(Uf −W−1f)2
]Convex; Realizable
Also, doesn’t work well
(n = 300; after 500, 10000, 50000 iterations)
Piecewise Linear Curves
minU
E[(Uf − p)2
]= E
[(Uf −W−1f)2
]Convex; Realizable
Also, doesn’t work well
(n = 300; after 500, 10000, 50000 iterations)
Piecewise Linear Curves
minU
E[(Uf − p)2
]= E
[(Uf −W−1f)2
]Convex; Realizable
Also, doesn’t work well
(n = 300; after 500, 10000, 50000 iterations)
Piecewise Linear Curves
Explanation: W has a very large condition number
Theorem
λmax(W>W )
λmin(W>W )= Ω(n3.5)
⇒ SGD requires Ω(n3.5) iterations to reach U s.t.∥∥E[U]−W−1∥∥ < 1/2
Piecewise Linear Curves
Approach 3: Convolutional Networks
p = W−1f
Observation:
W−1 =
1 0 0 0 · · ·−2 1 0 0 · · ·1 −2 1 0 · · ·0 1 −2 1 · · ·0 0 1 −2 · · ·...
......
W−1f is 1D convolution of f with “line-break” filter (1,−2, 1)
Can train a one-layer convnet to learn filter (problem in R3!)
Piecewise Linear Curves
(after 500; 10000; 50000 iterations)
Theorem: Condition number reduced to Θ(n3). Convolutionsaid geometry!
But: Θ(n3) iterations very disappointing for a problem in R3
...
Piecewise Linear Curves
Approach 4: Preconditioning
Convolutions reduce the problem to R3. In such lowdimension, can easily estimate correlations in f and use toprecondition
(after 500; 10000; 50000 iterations)
Outline
1 Piecewise Linear Curves
2 Linear-Periodic Functions
3 End-to-End vs. Decomposition
4 Flat Activations
Linear-Periodic Functions
x 7→ ψ(w>x), ψ periodic
Closely related to generalized linear models
Implementable with 2-layer networks on any bounded domain
Statistically learnable from data
Computationally learnable from data, at least in some cases
Informal Result
Not learnable with gradient-based methods in polynomial time,for any smooth distribution on Rd
Even with over-parameterization / arbitrarily complex network
Even if ψ and distribution are known
Linear-Periodic Functions
x 7→ ψ(w>x), ψ periodic
Closely related to generalized linear models
Implementable with 2-layer networks on any bounded domain
Statistically learnable from data
Computationally learnable from data, at least in some cases
Informal Result
Not learnable with gradient-based methods in polynomial time,for any smooth distribution on Rd
Even with over-parameterization / arbitrarily complex network
Even if ψ and distribution are known
Case Study
Target function: x 7→ cos(w?>x)x has standard Gaussian distribution in Rd
With enough training data, equivalent to
minw
Ex∼N (0,I )
[(cos(w>x)− cos(w?>x)
)2]
Case Study
In 2 dimensions, w? = (2, 2):
No local minima/saddle points
However, extremely flat unless very close to optimum⇒ difficult for gradient methods
Case Study
In 2 dimensions, w? = (2, 2):
No local minima/saddle points
However, extremely flat unless very close to optimum⇒ difficult for gradient methods
Case Study
In 2 dimensions, w? = (2, 2):
No local minima/saddle points
However, extremely flat unless very close to optimum⇒ difficult for gradient methods
Analysis
Similar issues even for
Arbitrary smooth distributions
Any periodic ψ (not just cosine)
Arbitrary networks
minv∈V
Fw?(v) = Ex∼ϕ2
[(f (v, x)− ψ(w?>x)
)2]
Theorem
Under mild assumptions, if w? is a random norm-r vector in Rd ,then at any v,
Varw? (∇Fw?(v)) ≤ exp(−Ω(mind , r2))
Can be shown to imply that any gradient-based method wouldrequire exp(Ω(mind , r2)) iterations to succeed.
Analysis
Similar issues even for
Arbitrary smooth distributions
Any periodic ψ (not just cosine)
Arbitrary networks
minv∈V
Fw?(v) = Ex∼ϕ2
[(f (v, x)− ψ(w?>x)
)2]
Theorem
Under mild assumptions, if w? is a random norm-r vector in Rd ,then at any v,
Varw? (∇Fw?(v)) ≤ exp(−Ω(mind , r2))
Can be shown to imply that any gradient-based method wouldrequire exp(Ω(mind , r2)) iterations to succeed.
Experiment
0 1 2 3 4 5
·104
0.5
1
Training Iterations
Acc
ura
cy
d=5d=10d=30
2-layer ReLU network, width 10d
Outline
1 Piecewise Linear Curves
2 Linear-Periodic Functions
3 End-to-End vs. Decomposition
4 Flat Activations
End-to-End vs. Decomposition
Input x: k-tuple of images of random lines
f1(x): For each image, whether slopes up or down
f2(x): Given bit vector, return parityImportant: Will focus on small k (where parity is easy)
Goal: Learn f2(f1(x))
End-to-End vs. Decomposition
End-to-end approach: Train overall network on primaryobjective
Decomposition approach: Augment objective with loss specificto first net, using per-image labels
0.3
1k = 1
0.3
1k = 2
0.3
1k = 3
0.3
1k = 4
20000 iterations
End-to-End vs. Decomposition
End-to-end approach: Train overall network on primaryobjective
Decomposition approach: Augment objective with loss specificto first net, using per-image labels
0.3
1k = 1
0.3
1k = 2
0.3
1k = 3
0.3
1k = 4
20000 iterations
Analysis
Through gradient variance w.r.t. target function
Under some simplifying assumptions,
Varw? (∇Fw?(v)) ≤ O(√
k/d)k
where d = number of pixels
Extremely concentrated already for very small values of k
Through gradient signal-to-noise ratio (SNR)
Ratio of bias and variance of y(x) · g(x) w.r.t. random inputx, where y is target and g is gradient at initialization point
1 2 3 4
−7
−15
k
log(
SN
R)
Analysis
Through gradient variance w.r.t. target function
Under some simplifying assumptions,
Varw? (∇Fw?(v)) ≤ O(√
k/d)k
where d = number of pixels
Extremely concentrated already for very small values of k
Through gradient signal-to-noise ratio (SNR)
Ratio of bias and variance of y(x) · g(x) w.r.t. random inputx, where y is target and g is gradient at initialization point
1 2 3 4
−7
−15
k
log(
SN
R)
Analysis
Through gradient variance w.r.t. target function
Under some simplifying assumptions,
Varw? (∇Fw?(v)) ≤ O(√
k/d)k
where d = number of pixels
Extremely concentrated already for very small values of k
Through gradient signal-to-noise ratio (SNR)
Ratio of bias and variance of y(x) · g(x) w.r.t. random inputx, where y is target and g is gradient at initialization point
1 2 3 4
−7
−15
k
log(
SN
R)
Outline
1 Piecewise Linear Curves
2 Linear-Periodic Functions
3 End-to-End vs. Decomposition
4 Flat Activations
Flat Activations
Vanishing gradients due to saturating activations (e.g. in RNN’s)
Flat Activations
Problem: Learning x 7→ u(w?>, x) where u is a fixed step function
Optimization problem:
minw
Ex[(u(Nw(x))− u(w?>x))2]
u′(z) = 0 almost everywhere → can’t apply gradient-basedmethods
Standard workarounds (smooth approximations; end-to-end;multiclass), don’t work too well either – see paper
Flat Activations
Different approach (Kalai & Sastry 2009, Kakade, Kalai, Kanade,S. 2011): Gradient descent, but replace gradient with somethingelse
minw
Ex
[1
2
((u(w>x))− u(w?>x)
)2]∇ = Ex
[(u(w>x)− u(w?>x)
)· u′(w>x) · x
]∇ = Ex
[(u(w>x)− u(w?>x))x
]Interpretation: “Forward only” backpropagation
Flat Activations
(linear; 5000 iterations)
Best results, and smallest train+test time
Analysis (KS09, KKKS11): Needs O(L2/ε2) iterations if u isL-Lipschitz
Summary
Simple problems where standard gradient-based deep learningdoesn’t work well (or at all), even under favorable conditions
Not due to local minima/saddle points!
Prior knowledge and domain expertise can still be importantFor more details:
“Distribution-Specific Hardness of Learning Neural Networks”arXiv 1609.01037
“Failures of Gradient-Based Deep Learning”: arXiv1703.07950
github.com/shakedshammah/failures_of_DL