Is Depth Needed for Deep Learning? Circuit Complexity in...
Transcript of Is Depth Needed for Deep Learning? Circuit Complexity in...
Is Depth Needed for Deep Learning?Circuit Complexity in Neural Networks
Ohad Shamir
and Microsoft Research
STOC Deep Learning WorkshopJune 2017
Ohad Shamir Is Depth Needed for Deep Learning? 1/34
Neural Networks (a.k.a. Deep Learning)
Ohad Shamir Is Depth Needed for Deep Learning? 2/34
Neural Networks (a.k.a. Deep Learning)
Ohad Shamir Is Depth Needed for Deep Learning? 2/34
Neural Networks (a.k.a. Deep Learning)
Ohad Shamir Is Depth Needed for Deep Learning? 2/34
Neural Networks (a.k.a. Deep Learning)
Ohad Shamir Is Depth Needed for Deep Learning? 2/34
Neural Networks (a.k.a. Deep Learning)
Ohad Shamir Is Depth Needed for Deep Learning? 2/34
Neural Networks (a.k.a. Deep Learning)
Ohad Shamir Is Depth Needed for Deep Learning? 2/34
Neural Networks (a.k.a. Deep Learning)
Ohad Shamir Is Depth Needed for Deep Learning? 2/34
Neural Networks (a.k.a. Deep Learning)
A single neuron
x 7→ σ(w>x + b)
Activation σ examples
ReLU: [z]+ := max0, z
Feedforward neural network
Deep Networks
x (∈ Rd) 7→Wk σk−1 (· · ·σ2 (W2 σ1 (W1x + b1) + b2) · · · ) + bk
Depth: kWidth: Maximal dimension of W1, . . . ,Wk
Ohad Shamir Is Depth Needed for Deep Learning? 3/34
Deep Learning
Winner of imagenet challenge 2012:Alexnet, 8 layers
Ohad Shamir Is Depth Needed for Deep Learning? 4/34
Deep Learning
Winner of imagenet challenge 2014:VGG, 19 layers
Ohad Shamir Is Depth Needed for Deep Learning? 4/34
Deep Learning
Winner of imagenet challenge 2015:Resnet, 152 layers
Ohad Shamir Is Depth Needed for Deep Learning? 4/34
Is Depth Needed for Deep Learning?
Overlwhelming empirical evidence
Intuitive
Many tasks are naturally modelled as a pipelineDeep networks allow end-to-end learning
Ohad Shamir Is Depth Needed for Deep Learning? 5/34
Is Depth Needed for Deep Learning?
Overlwhelming empirical evidenceIntuitive
Many tasks are naturally modelled as a pipelineDeep networks allow end-to-end learning
hand-crafted features predictor “dog”
明天打电话给我。 “call me tomorrow”
Ohad Shamir Is Depth Needed for Deep Learning? 5/34
Is Depth Needed for Deep Learning?
Overlwhelming empirical evidenceIntuitive
Many tasks are naturally modelled as a pipelineDeep networks allow end-to-end learning
hand-crafted features predictor “dog”
明天打电话给我。 “call me tomorrow”
Ohad Shamir Is Depth Needed for Deep Learning? 5/34
Is Depth Needed for Deep Learning?
No (in some sense):
Universal Approximation Theorems [Cybenko, 1989, Hornik 1991,Leshno et al. 1993...]
2-layer networks, with any non-polynomial activation σ, canapproximate any continuous f : [0, 1]d → R to arbitrary accuracy
Catch: Construction uses exp(d)-wide networksWhat about poly-sized networks??
Ohad Shamir Is Depth Needed for Deep Learning? 6/34
Is Depth Needed for Deep Learning?
No (in some sense):
Universal Approximation Theorems [Cybenko, 1989, Hornik 1991,Leshno et al. 1993...]
2-layer networks, with any non-polynomial activation σ, canapproximate any continuous f : [0, 1]d → R to arbitrary accuracy
Catch: Construction uses exp(d)-wide networksWhat about poly-sized networks??
Ohad Shamir Is Depth Needed for Deep Learning? 6/34
Is Depth Needed for Deep Learning?
No (in some sense):
Universal Approximation Theorems [Cybenko, 1989, Hornik 1991,Leshno et al. 1993...]
2-layer networks, with any non-polynomial activation σ, canapproximate any continuous f : [0, 1]d → R to arbitrary accuracy
Catch: Construction uses exp(d)-wide networksWhat about poly-sized networks??
Ohad Shamir Is Depth Needed for Deep Learning? 6/34
Is Depth Needed for Deep Learning?
Main Question
Are there real-valued functions which are
Expressible by a depth-h, width-w neural network
Can’t be even approximated by any depth < h network, unlesswidth is w
Approximation metric: Expected loss w.r.t. some data distribution:
d(n, f ) = Ex∼D `(n(x), f (x))
In this talk: `(y , y ′) = (y − y ′)2
Ohad Shamir Is Depth Needed for Deep Learning? 7/34
Should Sound Familiar...
Same question asked in circuit complexity! (just differentmotivation)
Boolean circuits (e.g. AC 0)
Separation between any two constant depths (Hastad 1986,Rossman et al. 2015)
Threshold circuits (e.g. TC 0)
Neural networks with σ(z) = 1z ≥ 0 activations; Booleaninput/outputSeparation between depth 2 and 3, if weights are bounded[Hajnal et al. 1987]Sufficiently larger depths known to hit natural proofs barrier
Arithmetic circuits
Neural networks computing polynomialsEach neuron computes sum or product
Ohad Shamir Is Depth Needed for Deep Learning? 8/34
Should Sound Familiar...
Same question asked in circuit complexity! (just differentmotivation)
Boolean circuits (e.g. AC 0)
Separation between any two constant depths (Hastad 1986,Rossman et al. 2015)
Threshold circuits (e.g. TC 0)
Neural networks with σ(z) = 1z ≥ 0 activations; Booleaninput/outputSeparation between depth 2 and 3, if weights are bounded[Hajnal et al. 1987]Sufficiently larger depths known to hit natural proofs barrier
Arithmetic circuits
Neural networks computing polynomialsEach neuron computes sum or product
Ohad Shamir Is Depth Needed for Deep Learning? 8/34
Should Sound Familiar...
Same question asked in circuit complexity! (just differentmotivation)
Boolean circuits (e.g. AC 0)
Separation between any two constant depths (Hastad 1986,Rossman et al. 2015)
Threshold circuits (e.g. TC 0)
Neural networks with σ(z) = 1z ≥ 0 activations; Booleaninput/outputSeparation between depth 2 and 3, if weights are bounded[Hajnal et al. 1987]Sufficiently larger depths known to hit natural proofs barrier
Arithmetic circuits
Neural networks computing polynomialsEach neuron computes sum or product
Ohad Shamir Is Depth Needed for Deep Learning? 8/34
Should Sound Familiar...
Same question asked in circuit complexity! (just differentmotivation)
Boolean circuits (e.g. AC 0)
Separation between any two constant depths (Hastad 1986,Rossman et al. 2015)
Threshold circuits (e.g. TC 0)
Neural networks with σ(z) = 1z ≥ 0 activations; Booleaninput/outputSeparation between depth 2 and 3, if weights are bounded[Hajnal et al. 1987]Sufficiently larger depths known to hit natural proofs barrier
Arithmetic circuits
Neural networks computing polynomialsEach neuron computes sum or product
Ohad Shamir Is Depth Needed for Deep Learning? 8/34
Should Sound Familiar...
But: Modern neural networks have non-Boolean inputs/outputs,and non-polynomial activations
Not Boolean circuits
Not threshold circuits
Not arithmetic circuits
Unlike work from the 80’s/90’s (e.g. Parberry [1994]), interestedin real-valued inputs/outputs, not just Boolean functions
Ohad Shamir Is Depth Needed for Deep Learning? 9/34
This Talk
Depth separations for modern neural networks
Nascent field in machine learning community; Some examplesof results and techniques
Focus on clean lower bounds and standard activations
Many open questions..
Comments/feedback welcome!
Ohad Shamir Is Depth Needed for Deep Learning? 10/34
This Talk
Depth separations for modern neural networks
Nascent field in machine learning community; Some examplesof results and techniques
Focus on clean lower bounds and standard activations
Many open questions..
Comments/feedback welcome!
Ohad Shamir Is Depth Needed for Deep Learning? 10/34
This Talk
Depth separations for modern neural networks
Nascent field in machine learning community; Some examplesof results and techniques
Focus on clean lower bounds and standard activations
Many open questions..
Comments/feedback welcome!
Ohad Shamir Is Depth Needed for Deep Learning? 10/34
Separating Depth 2 and 3 via Correlations
Depth-2 networks: x 7→ w>2 σ(W1x + b1) + b2
Linear combination of neurons σ(w>x + b)
Depth-3 networks: x 7→ w>3 σ(W2σ(W1x + b1) + b2) + b3
Theorem (Daniely 2017)
Let (x, y) 7→ f (x>y), where x, y uniform on Sd−1,f (z) = sin(πd3z):
ε-approximable by depth-3 ReLU network of poly(d , 1/ε)width and weight sizes
Not Ω(1)-approximable by any depth-2 ReLU network ofexp(o(d log d)) width and O(exp(d))-sized weights.
More generally: Other activations; Any f which is inapproximablewith O(d1+ε)-degree polynomial
Ohad Shamir Is Depth Needed for Deep Learning? 11/34
Separating Depth 2 and 3 via Correlations
Depth-2 networks: x 7→ w>2 σ(W1x + b1) + b2
Linear combination of neurons σ(w>x + b)
Depth-3 networks: x 7→ w>3 σ(W2σ(W1x + b1) + b2) + b3
Theorem (Daniely 2017)
Let (x, y) 7→ f (x>y), where x, y uniform on Sd−1,f (z) = sin(πd3z):
ε-approximable by depth-3 ReLU network of poly(d , 1/ε)width and weight sizes
Not Ω(1)-approximable by any depth-2 ReLU network ofexp(o(d log d)) width and O(exp(d))-sized weights.
More generally: Other activations; Any f which is inapproximablewith O(d1+ε)-degree polynomial
Ohad Shamir Is Depth Needed for Deep Learning? 11/34
Separating Depth 2 and 3 via Correlations
Depth-2 networks: x 7→ w>2 σ(W1x + b1) + b2
Linear combination of neurons σ(w>x + b)
Depth-3 networks: x 7→ w>3 σ(W2σ(W1x + b1) + b2) + b3
Theorem (Daniely 2017)
Let (x, y) 7→ f (x>y), where x, y uniform on Sd−1,f (z) = sin(πd3z):
ε-approximable by depth-3 ReLU network of poly(d , 1/ε)width and weight sizes
Not Ω(1)-approximable by any depth-2 ReLU network ofexp(o(d log d)) width and O(exp(d))-sized weights.
More generally: Other activations; Any f which is inapproximablewith O(d1+ε)-degree polynomial
Ohad Shamir Is Depth Needed for Deep Learning? 11/34
Lower Bound Proof Idea
Based on Harmonic analysis over Sd−1
(x, y) 7→ f (x>y) almost orthogonal to any(x, y) 7→ ψ(x>w, v>y) (e.g. one neuron)
Need many neurons (or huge weight) to correlate with f (x>y)
Comparison to Threshold Circuit Results
Correlation bounds also used for depth-2/3 separation ofthreshold circuits [Hajnal et al. 1987]
But, stronger separation: exp(Ω(d log d)) vs. exp(Ω(d))width
Boolean functions never require more than O(2d) width...
Ohad Shamir Is Depth Needed for Deep Learning? 12/34
Lower Bound Proof Idea
Based on Harmonic analysis over Sd−1
(x, y) 7→ f (x>y) almost orthogonal to any(x, y) 7→ ψ(x>w, v>y) (e.g. one neuron)
Need many neurons (or huge weight) to correlate with f (x>y)
Comparison to Threshold Circuit Results
Correlation bounds also used for depth-2/3 separation ofthreshold circuits [Hajnal et al. 1987]
But, stronger separation: exp(Ω(d log d)) vs. exp(Ω(d))width
Boolean functions never require more than O(2d) width...
Ohad Shamir Is Depth Needed for Deep Learning? 12/34
Weight Restrictions
Result assume that weights are not too large. Reallynecessary?
In threshold circuits: 30-year-old open question
Next: Separating depth-2/3 neural networks, without anyweight restrictions, using a different technique
Theorem (Eldan and S., 2016)
∃ function f & distribution on Rd which is
ε-approximable by 3-layer, poly(d , 1/ε)-wide network
Not Ω(1)-approximable by any 2-layer, exp(o(d))-widenetwork
Applies to virtually any measurable σ(·) s.t. |σ(x)| ≤ poly(x)
Ohad Shamir Is Depth Needed for Deep Learning? 13/34
Weight Restrictions
Result assume that weights are not too large. Reallynecessary?
In threshold circuits: 30-year-old open question
Next: Separating depth-2/3 neural networks, without anyweight restrictions, using a different technique
Theorem (Eldan and S., 2016)
∃ function f & distribution on Rd which is
ε-approximable by 3-layer, poly(d , 1/ε)-wide network
Not Ω(1)-approximable by any 2-layer, exp(o(d))-widenetwork
Applies to virtually any measurable σ(·) s.t. |σ(x)| ≤ poly(x)
Ohad Shamir Is Depth Needed for Deep Learning? 13/34
Proof Idea
Use radial functions:
f (x) = g(‖x‖2) where x ∈ Rd , f : R→ R
If f is Lipschitz, easy to approximate with depth 3
One layer: Approximate x 7→ x2;hence also x 7→ ‖x‖2 =
∑i x
2i
Second layer+output neuron: Approximate univariate functionof ‖x‖2
With two layers, difficult to do
Ohad Shamir Is Depth Needed for Deep Learning? 14/34
Proof Idea
Use radial functions:
f (x) = g(‖x‖2) where x ∈ Rd , f : R→ R
If f is Lipschitz, easy to approximate with depth 3
One layer: Approximate x 7→ x2;hence also x 7→ ‖x‖2 =
∑i x
2i
Second layer+output neuron: Approximate univariate functionof ‖x‖2
With two layers, difficult to do
Ohad Shamir Is Depth Needed for Deep Learning? 14/34
Proof Idea
Use radial functions:
f (x) = g(‖x‖2) where x ∈ Rd , f : R→ R
If f is Lipschitz, easy to approximate with depth 3
One layer: Approximate x 7→ x2;hence also x 7→ ‖x‖2 =
∑i x
2i
Second layer+output neuron: Approximate univariate functionof ‖x‖2
With two layers, difficult to do
Ohad Shamir Is Depth Needed for Deep Learning? 14/34
Proof Idea
Fourier Transform on Rd
Given function f , f (ξ) =∫f (x) exp(−2πiξ>x)dx
If x sampled from a distribution with density ϕ2,
Ex∼ϕ2
[(n(x)− f (x))2
]=
∫(n(x)− f (x))2ϕ2(x)dx
=
∫(n(x)ϕ(x)− f (x)ϕ(x))2dx
=
∫(n(ξ) ∗ ϕ(ξ)− f (ξ) ∗ ϕ(ξ))2dξ
For a two-layer network, n(x) =∑
i ni ,wi(x) :=
∑i ni (w
>i x), so
equals ∫ (∑i
ni ,wi(ξ) ∗ ϕ(ξ)− f (ξ) ∗ ϕ(ξ)
)2
dξ
Ohad Shamir Is Depth Needed for Deep Learning? 15/34
Proof Idea
Fourier Transform on Rd
Given function f , f (ξ) =∫f (x) exp(−2πiξ>x)dx
If x sampled from a distribution with density ϕ2,
Ex∼ϕ2
[(n(x)− f (x))2
]=
∫(n(x)− f (x))2ϕ2(x)dx
=
∫(n(x)ϕ(x)− f (x)ϕ(x))2dx
=
∫(n(ξ) ∗ ϕ(ξ)− f (ξ) ∗ ϕ(ξ))2dξ
For a two-layer network, n(x) =∑
i ni ,wi(x) :=
∑i ni (w
>i x), so
equals ∫ (∑i
ni ,wi(ξ) ∗ ϕ(ξ)− f (ξ) ∗ ϕ(ξ)
)2
dξ
Ohad Shamir Is Depth Needed for Deep Learning? 15/34
Proof Idea
Fourier Transform on Rd
Given function f , f (ξ) =∫f (x) exp(−2πiξ>x)dx
If x sampled from a distribution with density ϕ2,
Ex∼ϕ2
[(n(x)− f (x))2
]=
∫(n(x)− f (x))2ϕ2(x)dx
=
∫(n(x)ϕ(x)− f (x)ϕ(x))2dx
=
∫(n(ξ) ∗ ϕ(ξ)− f (ξ) ∗ ϕ(ξ))2dξ
For a two-layer network, n(x) =∑
i ni ,wi(x) :=
∑i ni (w
>i x), so
equals ∫ (∑i
ni ,wi(ξ) ∗ ϕ(ξ)− f (ξ) ∗ ϕ(ξ)
)2
dξ
Ohad Shamir Is Depth Needed for Deep Learning? 15/34
Proof Idea
∫ (∑i
ni ,wi(ξ) ∗ ϕ(ξ)− f (ξ) ∗ ϕ(ξ)
)2
dξ
For (say) Gaussian ϕ2, ϕ is Gaussian ⇒
ni ,wi(ξ) ∗ ϕ(ξ) f (ξ) ∗ ϕ(ξ)
Intuition: Can’t approximate “fat” function with few “thin”functions in high dimension
Ohad Shamir Is Depth Needed for Deep Learning? 16/34
Proof Idea
∫ (∑i
ni ,wi(ξ) ∗ ϕ(ξ)− f (ξ) ∗ ϕ(ξ)
)2
dξ
For (say) Gaussian ϕ2, ϕ is Gaussian ⇒
ni ,wi(ξ) ∗ ϕ(ξ) f (ξ) ∗ ϕ(ξ)
Intuition: Can’t approximate “fat” function with few “thin”functions in high dimension
Ohad Shamir Is Depth Needed for Deep Learning? 16/34
Proof Idea
But: Hard to handle Gaussian tailIdea: Use density ϕ2 s.t.
f ∗ ϕ is “sufficiently fat”ϕ has bounded support∑
i ni ,wi(ξ) ∗ ϕ(ξ)
Ohad Shamir Is Depth Needed for Deep Learning? 17/34
Proof Idea
Explicit construction in Rd :
Density ϕ2(x) =
(Rd
‖x‖
)d
· J2d/2(2πRd ‖x‖)︸ ︷︷ ︸
Bessel function of first kind
Function f (x) =
poly(d)∑i=1
εi1 ‖x‖2 ∈ ∆i
where εi ∈ −1,+1, and ∆i are disjoint intervals
Ohad Shamir Is Depth Needed for Deep Learning? 18/34
Higher Depth
So far: Separations between depths 2 and 3, in terms of dimension
Open Question: Can we show separations for higher depths?
In threshold circuits: Longstanding open problem. Probablyvery difficult from some constant depth (natural proofsbarrier)
But: Not threshold circuits... Perhaps “hard” functions inEuclidean space?
Next: Higher-depth separations, in terms of quantities other thandimension
Ohad Shamir Is Depth Needed for Deep Learning? 19/34
Higher Depth
So far: Separations between depths 2 and 3, in terms of dimension
Open Question: Can we show separations for higher depths?
In threshold circuits: Longstanding open problem. Probablyvery difficult from some constant depth (natural proofsbarrier)
But: Not threshold circuits... Perhaps “hard” functions inEuclidean space?
Next: Higher-depth separations, in terms of quantities other thandimension
Ohad Shamir Is Depth Needed for Deep Learning? 19/34
Higher Depth
So far: Separations between depths 2 and 3, in terms of dimension
Open Question: Can we show separations for higher depths?
In threshold circuits: Longstanding open problem. Probablyvery difficult from some constant depth (natural proofsbarrier)
But: Not threshold circuits... Perhaps “hard” functions inEuclidean space?
Next: Higher-depth separations, in terms of quantities other thandimension
Ohad Shamir Is Depth Needed for Deep Learning? 19/34
Highly Oscillatory Functions
Theorem (Telgarsky, 2016)
There exists a family of functions ϕk∞k=1 on [0, 1],s.t. for any k,
ϕk expressible by k-depth, O(1)-width ReLU network
ϕk not approximable by any o(k/ log(k))-depth, poly(k)width ReLU network
* Approximation w.r.t. uniform distribution on [0, 1]
Again, can be generalized to other activations
Ohad Shamir Is Depth Needed for Deep Learning? 20/34
Highly Oscillatory Functions
Theorem (Telgarsky, 2016)
There exists a family of functions ϕk∞k=1 on [0, 1],s.t. for any k,
ϕk expressible by k-depth, O(1)-width ReLU network
ϕk not approximable by any o(k/ log(k))-depth, poly(k)width ReLU network
* Approximation w.r.t. uniform distribution on [0, 1]
Again, can be generalized to other activations
Ohad Shamir Is Depth Needed for Deep Learning? 20/34
Construction
ϕ1(x) = [2x ]+ − [4x − 2]+
Ohad Shamir Is Depth Needed for Deep Learning? 21/34
Construction
ϕ2 = ϕ1(ϕ1(x))
Ohad Shamir Is Depth Needed for Deep Learning? 21/34
Construction
ϕk(x) = ϕk1(x)
Ohad Shamir Is Depth Needed for Deep Learning? 21/34
Construction
ϕk expressible by O(k)-depth,O(1)-width ReLU network
ϕk composed of 2k+1 linear segments;can’t be approximated bypiecewise-linear function with o(2k)segments
A depth h, width w network expressesat most (2w)h linear segments
⇒ If h = o(k/ log(k)), can’tapproximate with w = poly(k) width
Ohad Shamir Is Depth Needed for Deep Learning? 22/34
Separations in Accuracy
Theorem (Safran and S., 2016)
There exists a large family F of C 2 functions on [0, 1]d (includingx 7→ x2), s.t. for any f ∈ F :
Can be ε-approximated with polylog(1/ε) depth and widthReLU network
Cannot be ε-approximated with O(1) depth ReLU network,unless width is poly(1/ε)
* Approximation w.r.t. uniform distribution
F ≈ non-linear functions expressible by a fixed number ofadditions and multiplications.
Note: Broadly similar and independent results in [Yarotski 2016],[Liang and Srikant 2016]
Ohad Shamir Is Depth Needed for Deep Learning? 23/34
Separations in Accuracy
Theorem (Safran and S., 2016)
There exists a large family F of C 2 functions on [0, 1]d (includingx 7→ x2), s.t. for any f ∈ F :
Can be ε-approximated with polylog(1/ε) depth and widthReLU network
Cannot be ε-approximated with O(1) depth ReLU network,unless width is poly(1/ε)
* Approximation w.r.t. uniform distribution
F ≈ non-linear functions expressible by a fixed number ofadditions and multiplications.
Note: Broadly similar and independent results in [Yarotski 2016],[Liang and Srikant 2016]
Ohad Shamir Is Depth Needed for Deep Learning? 23/34
Proof Idea for x 7→ x2
Upper bound:
Use ϕ1, ϕ2, . . . , ϕO(log(1/ε)) variants toextract first O(log(1/ε)) bits of x
Given bit vector, do long multiplicationto get first O(log(1/ε)) bits of x2
Convert back to R
Representable via O(log(1/ε)) depth/widthnetwork
Ohad Shamir Is Depth Needed for Deep Learning? 24/34
Proof Idea for x 7→ x2
Upper bound:
Use ϕ1, ϕ2, . . . , ϕO(log(1/ε)) variants toextract first O(log(1/ε)) bits of x
Given bit vector, do long multiplicationto get first O(log(1/ε)) bits of x2
Convert back to R
Representable via O(log(1/ε)) depth/widthnetwork
Ohad Shamir Is Depth Needed for Deep Learning? 24/34
Proof Idea for x 7→ x2
Lower Bound:
If h linear,∫ a+∆a
(x2 − h(x)
)2= Ω
(∆5)
⇒ If h piecewise-linear with O(n)
segments,∫ 1
0
(x2 − h(x)
)2= Ω
(n−4)
But: Any O(1)-depth, w -width networkcan express only poly(w) segments
⇒ For ε approximation, need poly(1/ε)width
Similar ideas also in higher dimensions
Ohad Shamir Is Depth Needed for Deep Learning? 25/34
Natural Depth Separations
So far: Depth separations for some functions
But for machine learning, this is only 1/3 of the picture!
Expressiveness Statistical Error Optimization Error
Depth separations for functions that we can hope to learn withstandard optimization methods?
(x, y) 7→ f (x>y), f highly-oscillating
x 7→ f (‖x‖), f highly-oscillating
x 7→ f (x), f highly-oscillating
x 7→ x2, bit-extraction (highly-oscillating) + longmultiplication networks
Ohad Shamir Is Depth Needed for Deep Learning? 26/34
Natural Depth Separations
So far: Depth separations for some functions
But for machine learning, this is only 1/3 of the picture!
Expressiveness Statistical Error Optimization Error
Depth separations for functions that we can hope to learn withstandard optimization methods?
(x, y) 7→ f (x>y), f highly-oscillating
x 7→ f (‖x‖), f highly-oscillating
x 7→ f (x), f highly-oscillating
x 7→ x2, bit-extraction (highly-oscillating) + longmultiplication networks
Ohad Shamir Is Depth Needed for Deep Learning? 26/34
Natural Depth Separations
So far: Depth separations for some functions
But for machine learning, this is only 1/3 of the picture!
Expressiveness Statistical Error Optimization Error
Depth separations for functions that we can hope to learn withstandard optimization methods?
(x, y) 7→ f (x>y), f highly-oscillating
x 7→ f (‖x‖), f highly-oscillating
x 7→ f (x), f highly-oscillating
x 7→ x2, bit-extraction (highly-oscillating) + longmultiplication networks
Ohad Shamir Is Depth Needed for Deep Learning? 26/34
Natural Depth Separations
So far: Depth separations for some functions
But for machine learning, this is only 1/3 of the picture!
Expressiveness Statistical Error Optimization Error
Depth separations for functions that we can hope to learn withstandard optimization methods?
(x, y) 7→ f (x>y), f highly-oscillating
x 7→ f (‖x‖), f highly-oscillating
x 7→ f (x), f highly-oscillating
x 7→ x2, bit-extraction (highly-oscillating) + longmultiplication networks
Ohad Shamir Is Depth Needed for Deep Learning? 26/34
First Example: Indicator of L2 Ball
Theorem (Safran & S., 2016)
Let f (x) = 1 ‖x‖ ≤ 1 on Rd , exists distribution s.t.
ε-approximable with depth-3, poly(d , 1/ε)-wide ReLU network
Not Ω(d−4)-approximable by any depth-2, exp(o(d))-wideReLU network
Can be generalized to indicators of any ellipsoids
Proof idea: Reduction from construction of Eldan and S.
Ohad Shamir Is Depth Needed for Deep Learning? 27/34
Experiment: Unit L2 ball
d = 100
Batch number (x1000)0 20 40 60 80 100 120 140 160 180 200
RM
SE
(va
lidat
ion
set)
0.15
0.2
0.25
0.3
3-layer, width 1002-layer, width 1002-layer, width 2002-layer, width 4002-layer, width 800
Ohad Shamir Is Depth Needed for Deep Learning? 28/34
Second Example: L1 Ball
Theorem (Safran & S., 2016)
Let f (x) = [‖x‖1 − 1]+ on Rd , exists distribution s.t.
Expressible with with depth-3, width 2d ReLU network
Not ε-approximable by any depth-2, width min1/ε, exp(d)ReLU network
Ohad Shamir Is Depth Needed for Deep Learning? 29/34
Proof Idea
Upper Bound
[‖x‖1 − 1]+ =
[d∑
i=1
([xi ]+ + [−xi ]+
)− 1
]+
Lower Bound
Function “breaks” along 2d facets ofthe L1 ball
For a good approximation, most facetsmust have a ReLU neuron breakingclose to it
Bound can probably be improved...
Ohad Shamir Is Depth Needed for Deep Learning? 30/34
Experiment: L1 Ball
Ohad Shamir Is Depth Needed for Deep Learning? 31/34
Other Directions
Many other works and directions!
Study architectural properties of neural networks (depth andbeyond) using arithmetic-style circuits
Depth separations using metrics other than approximationerror
Study realistic architectures via upper bounds
...[Delalleau and Bengio 2011],[Pascanu et al. 2013],[Martens et al. 2013],[Montufar etal. 2014],[Cohen et al. 2015],[Cohen et al. 2016],[Raghu et al. 2016],[Poole et al.2016],[Arora et al. 2016],[Mhaskar and Poggio 2016],[Shaham et al. 2016],[Mossel2016],[McCane and Szymanskic 2016],[Poggio et al. 2017],[Sharir and Shashua2017],[Rolnick and Tegmark 2017],[Nguyen and Hein 2017],[Petersen and Voigtlaender2017], [Lu et al. 2017],[Montanelli and Du 2017],[Telgarsky 2017],[Lee et al.2017],[Khrulkov et al. 2017],[Serra et al. 2017],[Guss and Salakhutdinov2017],[Mukherjee and Basu 2017]...
Ohad Shamir Is Depth Needed for Deep Learning? 32/34
Summary and Discussion
Depth separations for modern neural networks
Take-Home Message
Similar questions as in circuit complexity, but not standard circuitsand a different playing field
Euclidean (not Boolean) input/output
Continuity; large Lipschitz constants; Fourier in Rd ...
No clear algebraic structure (as in arithmetic circuits). Usegeometric properties instead
Curvature; piecewise linearity; sparsity in Fourier domain...
AFAIK, little study of connections between fields
Ohad Shamir Is Depth Needed for Deep Learning? 33/34
Open Questions
Separations w.r.t. dimension for depths > 3?
Alternatively, a “natural proof” barrier? How to even define?
Strong separations w.r.t. dimension for O(1)-Lipschitzfunctions?
Circuit complexity techniques to analyze neural networks?And vice versa?
Is there any function which is both (1) provably deep and (2)easily learned with neural networks?
Architecture and expressiveness of modern neural networksbeyond depth: Convolutions, pooling, recurrences, skipconnections...
Ohad Shamir Is Depth Needed for Deep Learning? 34/34
Open Questions
Separations w.r.t. dimension for depths > 3?
Alternatively, a “natural proof” barrier? How to even define?
Strong separations w.r.t. dimension for O(1)-Lipschitzfunctions?
Circuit complexity techniques to analyze neural networks?And vice versa?
Is there any function which is both (1) provably deep and (2)easily learned with neural networks?
Architecture and expressiveness of modern neural networksbeyond depth: Convolutions, pooling, recurrences, skipconnections...
Ohad Shamir Is Depth Needed for Deep Learning? 34/34
Open Questions
Separations w.r.t. dimension for depths > 3?
Alternatively, a “natural proof” barrier? How to even define?
Strong separations w.r.t. dimension for O(1)-Lipschitzfunctions?
Circuit complexity techniques to analyze neural networks?And vice versa?
Is there any function which is both (1) provably deep and (2)easily learned with neural networks?
Architecture and expressiveness of modern neural networksbeyond depth: Convolutions, pooling, recurrences, skipconnections...
Ohad Shamir Is Depth Needed for Deep Learning? 34/34
Open Questions
Separations w.r.t. dimension for depths > 3?
Alternatively, a “natural proof” barrier? How to even define?
Strong separations w.r.t. dimension for O(1)-Lipschitzfunctions?
Circuit complexity techniques to analyze neural networks?And vice versa?
Is there any function which is both (1) provably deep and (2)easily learned with neural networks?
Architecture and expressiveness of modern neural networksbeyond depth: Convolutions, pooling, recurrences, skipconnections...
Ohad Shamir Is Depth Needed for Deep Learning? 34/34