Is Depth Needed for Deep Learning? Circuit Complexity in...

Is Depth Needed for Deep Learning?Circuit Complexity in Neural Networks

Ohad Shamir

and Microsoft Research

STOC Deep Learning WorkshopJune 2017

Ohad Shamir Is Depth Needed for Deep Learning? 1/34

Neural Networks (a.k.a. Deep Learning)

A single neuron

x 7→ σ(w>x + b)

Activation σ examples

ReLU: [z]+ := max0, z

Feedforward neural network

Deep Networks

x (∈ Rd) 7→Wk σk−1 (· · ·σ2 (W2 σ1 (W1x + b1) + b2) · · · ) + bk

Depth: kWidth: Maximal dimension of W1, . . . ,Wk

Deep Learning

Winner of imagenet challenge 2012:Alexnet, 8 layers

Deep Learning

Winner of imagenet challenge 2014:VGG, 19 layers

Deep Learning

Winner of imagenet challenge 2015:Resnet, 152 layers

Is Depth Needed for Deep Learning?

Overlwhelming empirical evidence

Intuitive

Many tasks are naturally modelled as a pipelineDeep networks allow end-to-end learning

Overlwhelming empirical evidenceIntuitive

hand-crafted features predictor “dog”

明天打电话给我。 “call me tomorrow”

Overlwhelming empirical evidenceIntuitive

hand-crafted features predictor “dog”

明天打电话给我。 “call me tomorrow”

No (in some sense):

Universal Approximation Theorems [Cybenko, 1989, Hornik 1991,Leshno et al. 1993...]

2-layer networks, with any non-polynomial activation σ, canapproximate any continuous f : [0, 1]d → R to arbitrary accuracy

Catch: Construction uses exp(d)-wide networksWhat about poly-sized networks??

No (in some sense):

Main Question

Are there real-valued functions which are

Expressible by a depth-h, width-w neural network

Can’t be even approximated by any depth < h network, unlesswidth is w

Approximation metric: Expected loss w.r.t. some data distribution:

d(n, f ) = Ex∼D `(n(x), f (x))

In this talk: `(y , y ′) = (y − y ′)2

Should Sound Familiar...

Same question asked in circuit complexity! (just differentmotivation)

Boolean circuits (e.g. AC 0)

Separation between any two constant depths (Hastad 1986,Rossman et al. 2015)

Threshold circuits (e.g. TC 0)

Neural networks with σ(z) = 1z ≥ 0 activations; Booleaninput/outputSeparation between depth 2 and 3, if weights are bounded[Hajnal et al. 1987]Sufficiently larger depths known to hit natural proofs barrier

Arithmetic circuits

Neural networks computing polynomialsEach neuron computes sum or product

Arithmetic circuits

But: Modern neural networks have non-Boolean inputs/outputs,and non-polynomial activations

Not Boolean circuits

Not threshold circuits

Not arithmetic circuits

Unlike work from the 80’s/90’s (e.g. Parberry [1994]), interestedin real-valued inputs/outputs, not just Boolean functions

This Talk

Depth separations for modern neural networks

Nascent field in machine learning community; Some examplesof results and techniques

Focus on clean lower bounds and standard activations

Many open questions..

Comments/feedback welcome!

This Talk

Separating Depth 2 and 3 via Correlations

Depth-2 networks: x 7→ w>2 σ(W1x + b1) + b2

Linear combination of neurons σ(w>x + b)

Depth-3 networks: x 7→ w>3 σ(W2σ(W1x + b1) + b2) + b3

Theorem (Daniely 2017)

Let (x, y) 7→ f (x>y), where x, y uniform on Sd−1,f (z) = sin(πd3z):

ε-approximable by depth-3 ReLU network of poly(d , 1/ε)width and weight sizes

Not Ω(1)-approximable by any depth-2 ReLU network ofexp(o(d log d)) width and O(exp(d))-sized weights.

More generally: Other activations; Any f which is inapproximablewith O(d1+ε)-degree polynomial

Lower Bound Proof Idea

Based on Harmonic analysis over Sd−1

(x, y) 7→ f (x>y) almost orthogonal to any(x, y) 7→ ψ(x>w, v>y) (e.g. one neuron)

Need many neurons (or huge weight) to correlate with f (x>y)

Comparison to Threshold Circuit Results

Correlation bounds also used for depth-2/3 separation ofthreshold circuits [Hajnal et al. 1987]

But, stronger separation: exp(Ω(d log d)) vs. exp(Ω(d))width

Boolean functions never require more than O(2d) width...

Lower Bound Proof Idea

Based on Harmonic analysis over Sd−1

(x, y) 7→ f (x>y) almost orthogonal to any(x, y) 7→ ψ(x>w, v>y) (e.g. one neuron)

Need many neurons (or huge weight) to correlate with f (x>y)

Comparison to Threshold Circuit Results

Correlation bounds also used for depth-2/3 separation ofthreshold circuits [Hajnal et al. 1987]

But, stronger separation: exp(Ω(d log d)) vs. exp(Ω(d))width

Boolean functions never require more than O(2d) width...

Weight Restrictions

Result assume that weights are not too large. Reallynecessary?

In threshold circuits: 30-year-old open question

Next: Separating depth-2/3 neural networks, without anyweight restrictions, using a different technique

Theorem (Eldan and S., 2016)

∃ function f & distribution on Rd which is

ε-approximable by 3-layer, poly(d , 1/ε)-wide network

Not Ω(1)-approximable by any 2-layer, exp(o(d))-widenetwork

Applies to virtually any measurable σ(·) s.t. |σ(x)| ≤ poly(x)

Weight Restrictions

Result assume that weights are not too large. Reallynecessary?

In threshold circuits: 30-year-old open question

Next: Separating depth-2/3 neural networks, without anyweight restrictions, using a different technique

Theorem (Eldan and S., 2016)

∃ function f & distribution on Rd which is

ε-approximable by 3-layer, poly(d , 1/ε)-wide network

Not Ω(1)-approximable by any 2-layer, exp(o(d))-widenetwork

Applies to virtually any measurable σ(·) s.t. |σ(x)| ≤ poly(x)

Proof Idea

Use radial functions:

f (x) = g(‖x‖2) where x ∈ Rd , f : R→ R

If f is Lipschitz, easy to approximate with depth 3

One layer: Approximate x 7→ x2;hence also x 7→ ‖x‖2 =

∑i x

Second layer+output neuron: Approximate univariate functionof ‖x‖2

With two layers, difficult to do

Proof Idea

∑i x

Proof Idea

∑i x

Proof Idea

Fourier Transform on Rd

Given function f , f (ξ) =∫f (x) exp(−2πiξ>x)dx

If x sampled from a distribution with density ϕ2,

Ex∼ϕ2

[(n(x)− f (x))2

∫(n(x)− f (x))2ϕ2(x)dx

∫(n(x)ϕ(x)− f (x)ϕ(x))2dx

∫(n(ξ) ∗ ϕ(ξ)− f (ξ) ∗ ϕ(ξ))2dξ

For a two-layer network, n(x) =∑

i ni ,wi(x) :=

∑i ni (w

>i x), so

equals ∫ (∑i

ni ,wi(ξ) ∗ ϕ(ξ)− f (ξ) ∗ ϕ(ξ)

Proof Idea

Ex∼ϕ2

[(n(x)− f (x))2

∫(n(x)− f (x))2ϕ2(x)dx

∫(n(ξ) ∗ ϕ(ξ)− f (ξ) ∗ ϕ(ξ))2dξ

i ni ,wi(x) :=

∑i ni (w

>i x), so

equals ∫ (∑i

ni ,wi(ξ) ∗ ϕ(ξ)− f (ξ) ∗ ϕ(ξ)

Proof Idea

Ex∼ϕ2

[(n(x)− f (x))2

∫(n(x)− f (x))2ϕ2(x)dx

∫(n(ξ) ∗ ϕ(ξ)− f (ξ) ∗ ϕ(ξ))2dξ

i ni ,wi(x) :=

∑i ni (w

>i x), so

equals ∫ (∑i

ni ,wi(ξ) ∗ ϕ(ξ)− f (ξ) ∗ ϕ(ξ)

Proof Idea

∫ (∑i

ni ,wi(ξ) ∗ ϕ(ξ)− f (ξ) ∗ ϕ(ξ)

For (say) Gaussian ϕ2, ϕ is Gaussian ⇒

ni ,wi(ξ) ∗ ϕ(ξ) f (ξ) ∗ ϕ(ξ)

Intuition: Can’t approximate “fat” function with few “thin”functions in high dimension

Proof Idea

∫ (∑i

ni ,wi(ξ) ∗ ϕ(ξ)− f (ξ) ∗ ϕ(ξ)

For (say) Gaussian ϕ2, ϕ is Gaussian ⇒

ni ,wi(ξ) ∗ ϕ(ξ) f (ξ) ∗ ϕ(ξ)

Intuition: Can’t approximate “fat” function with few “thin”functions in high dimension

Proof Idea

But: Hard to handle Gaussian tailIdea: Use density ϕ2 s.t.

f ∗ ϕ is “sufficiently fat”ϕ has bounded support∑

i ni ,wi(ξ) ∗ ϕ(ξ)

Proof Idea

Explicit construction in Rd :

Density ϕ2(x) =

‖x‖

· J2d/2(2πRd ‖x‖)︸︷︷︸

Bessel function of first kind

Function f (x) =

poly(d)∑i=1

εi1 ‖x‖2 ∈ ∆i

where εi ∈ −1,+1, and ∆i are disjoint intervals

Higher Depth

So far: Separations between depths 2 and 3, in terms of dimension

Open Question: Can we show separations for higher depths?

In threshold circuits: Longstanding open problem. Probablyvery difficult from some constant depth (natural proofsbarrier)

But: Not threshold circuits... Perhaps “hard” functions inEuclidean space?

Next: Higher-depth separations, in terms of quantities other thandimension

Higher Depth

Highly Oscillatory Functions

Theorem (Telgarsky, 2016)

There exists a family of functions ϕk∞k=1 on [0, 1],s.t. for any k,

ϕk expressible by k-depth, O(1)-width ReLU network

ϕk not approximable by any o(k/ log(k))-depth, poly(k)width ReLU network

* Approximation w.r.t. uniform distribution on [0, 1]

Again, can be generalized to other activations

Highly Oscillatory Functions

Theorem (Telgarsky, 2016)

There exists a family of functions ϕk∞k=1 on [0, 1],s.t. for any k,

ϕk expressible by k-depth, O(1)-width ReLU network

ϕk not approximable by any o(k/ log(k))-depth, poly(k)width ReLU network

* Approximation w.r.t. uniform distribution on [0, 1]

Again, can be generalized to other activations

Construction

ϕ1(x) = [2x ]+ − [4x − 2]+

Construction

ϕ2 = ϕ1(ϕ1(x))

Construction

ϕk(x) = ϕk1(x)

Construction

ϕk expressible by O(k)-depth,O(1)-width ReLU network

ϕk composed of 2k+1 linear segments;can’t be approximated bypiecewise-linear function with o(2k)segments

A depth h, width w network expressesat most (2w)h linear segments

⇒ If h = o(k/ log(k)), can’tapproximate with w = poly(k) width

Separations in Accuracy

Theorem (Safran and S., 2016)

There exists a large family F of C 2 functions on [0, 1]d (includingx 7→ x2), s.t. for any f ∈ F :

Can be ε-approximated with polylog(1/ε) depth and widthReLU network

Cannot be ε-approximated with O(1) depth ReLU network,unless width is poly(1/ε)

* Approximation w.r.t. uniform distribution

F ≈ non-linear functions expressible by a fixed number ofadditions and multiplications.

Note: Broadly similar and independent results in [Yarotski 2016],[Liang and Srikant 2016]

Separations in Accuracy

Theorem (Safran and S., 2016)

There exists a large family F of C 2 functions on [0, 1]d (includingx 7→ x2), s.t. for any f ∈ F :

Can be ε-approximated with polylog(1/ε) depth and widthReLU network

Cannot be ε-approximated with O(1) depth ReLU network,unless width is poly(1/ε)

* Approximation w.r.t. uniform distribution

F ≈ non-linear functions expressible by a fixed number ofadditions and multiplications.

Note: Broadly similar and independent results in [Yarotski 2016],[Liang and Srikant 2016]

Proof Idea for x 7→ x2

Upper bound:

Use ϕ1, ϕ2, . . . , ϕO(log(1/ε)) variants toextract first O(log(1/ε)) bits of x

Given bit vector, do long multiplicationto get first O(log(1/ε)) bits of x2

Convert back to R

Representable via O(log(1/ε)) depth/widthnetwork

Upper bound:

Use ϕ1, ϕ2, . . . , ϕO(log(1/ε)) variants toextract first O(log(1/ε)) bits of x

Given bit vector, do long multiplicationto get first O(log(1/ε)) bits of x2

Convert back to R

Representable via O(log(1/ε)) depth/widthnetwork

Lower Bound:

If h linear,∫ a+∆a

(x2 − h(x)

)2= Ω

(∆5)

⇒ If h piecewise-linear with O(n)

segments,∫ 1

(x2 − h(x)

)2= Ω

(n−4)

But: Any O(1)-depth, w -width networkcan express only poly(w) segments

⇒ For ε approximation, need poly(1/ε)width

Similar ideas also in higher dimensions

Natural Depth Separations

So far: Depth separations for some functions

But for machine learning, this is only 1/3 of the picture!

Expressiveness Statistical Error Optimization Error

Depth separations for functions that we can hope to learn withstandard optimization methods?

(x, y) 7→ f (x>y), f highly-oscillating

x 7→ f (‖x‖), f highly-oscillating

x 7→ f (x), f highly-oscillating

x 7→ x2, bit-extraction (highly-oscillating) + longmultiplication networks

First Example: Indicator of L2 Ball

Theorem (Safran & S., 2016)

Let f (x) = 1 ‖x‖ ≤ 1 on Rd , exists distribution s.t.

ε-approximable with depth-3, poly(d , 1/ε)-wide ReLU network

Not Ω(d−4)-approximable by any depth-2, exp(o(d))-wideReLU network

Can be generalized to indicators of any ellipsoids

Proof idea: Reduction from construction of Eldan and S.

Experiment: Unit L2 ball

d = 100

Batch number (x1000)0 20 40 60 80 100 120 140 160 180 200

3-layer, width 1002-layer, width 1002-layer, width 2002-layer, width 4002-layer, width 800

Second Example: L1 Ball

Theorem (Safran & S., 2016)

Let f (x) = [‖x‖1 − 1]+ on Rd , exists distribution s.t.

Expressible with with depth-3, width 2d ReLU network

Not ε-approximable by any depth-2, width min1/ε, exp(d)ReLU network

Proof Idea

Upper Bound

[‖x‖1 − 1]+ =

([xi ]+ + [−xi ]+

)− 1

Lower Bound

Function “breaks” along 2d facets ofthe L1 ball

For a good approximation, most facetsmust have a ReLU neuron breakingclose to it

Bound can probably be improved...

Experiment: L1 Ball

Other Directions

Many other works and directions!

Study architectural properties of neural networks (depth andbeyond) using arithmetic-style circuits

Depth separations using metrics other than approximationerror

Study realistic architectures via upper bounds

...[Delalleau and Bengio 2011],[Pascanu et al. 2013],[Martens et al. 2013],[Montufar etal. 2014],[Cohen et al. 2015],[Cohen et al. 2016],[Raghu et al. 2016],[Poole et al.2016],[Arora et al. 2016],[Mhaskar and Poggio 2016],[Shaham et al. 2016],[Mossel2016],[McCane and Szymanskic 2016],[Poggio et al. 2017],[Sharir and Shashua2017],[Rolnick and Tegmark 2017],[Nguyen and Hein 2017],[Petersen and Voigtlaender2017], [Lu et al. 2017],[Montanelli and Du 2017],[Telgarsky 2017],[Lee et al.2017],[Khrulkov et al. 2017],[Serra et al. 2017],[Guss and Salakhutdinov2017],[Mukherjee and Basu 2017]...

Summary and Discussion

Take-Home Message

Similar questions as in circuit complexity, but not standard circuitsand a different playing field

Euclidean (not Boolean) input/output

Continuity; large Lipschitz constants; Fourier in Rd ...

No clear algebraic structure (as in arithmetic circuits). Usegeometric properties instead

Curvature; piecewise linearity; sparsity in Fourier domain...

AFAIK, little study of connections between fields

Open Questions

Separations w.r.t. dimension for depths > 3?

Alternatively, a “natural proof” barrier? How to even define?

Strong separations w.r.t. dimension for O(1)-Lipschitzfunctions?

Circuit complexity techniques to analyze neural networks?And vice versa?

Is there any function which is both (1) provably deep and (2)easily learned with neural networks?

Architecture and expressiveness of modern neural networksbeyond depth: Convolutions, pooling, recurrences, skipconnections...

Open Questions

Is Depth Needed for Deep Learning? Circuit Complexity in...

Documents

Transcript of Is Depth Needed for Deep Learning? Circuit Complexity in...