Statistical guarantees for deep ReLU networks [1cm] · Imposing a statistical model Iassume...

Statistical guarantees for deep ReLU networks

Johannes Schmidt-Hieber

1 / 30

Theory for deep neural networks

different approaches:

I studying properties of the algorithms

I statistical properties

2 / 30

Theory for (S)GD

Mathematics:

I (bad) local minima and saddle points of the energy landscape(Kawaguchi ’16)

I ability of gradient descent to reach zero training error forover-parametrized networks (Lee et al. ’18, Allen-Zhu et al.’18)

I continuous limits of gradient descent

3 / 30

Questions that a theory should be able to answer

I for which datasets or data structures do neural networksperform well?

I how should we choose the network architecture?

I how does the signal processing work in a deep network?

I these problems motivate the statistical approach

4 / 30

Imposing a statistical model

I assume (training) data are generated from a given statisticalmodel

I imposed model can be simpler than deep learning applications

I provides us with an information theoretic notion of optimality(optimal estimation rates)

I since other methods have been studied for the same statisticalmethod, we have a theoretical framework for comparison ofdifferent methods

5 / 30

Simplifications

I (up to now) statistical analysis studies properties of the globalminimum (nothing on the algorithm)

I properties that are induced by the learning algorithm viainitialization and regularization techniques have to be buildinto the framework by constraints on the underlying class ofnetworks

6 / 30

Mathematics

I statistical approach combines approximation theory withcomplexity bounds

I good network architectures provide an optimal trade-offbetween these two quantities

I statistical analysis provides constraints on the networkarchitectures that work well

7 / 30

Algebraic definition of a deep net

I A neural network builds functions by alternating matrix-vectormultiplications and componentwise application of theactivation function σ : R→ R,

I

f (x) = WL+1σWLσ · · ·W2σW1x

Network parameters:

I matrices W1, . . . ,WL+1

Activation function:

I we study the ReLU activation function σ(x) = max(x , 0)

I x 7→ f (x) is a piecewise linear function

8 / 30

Challenges for a statistical analysis: Depth

Source: Kaiming He, Deep Residual Networks

I Networks are deepI version of ResNet with 152 hidden layersI networks become deeper

I highly non-linear in parameters

9 / 30

Challenges for a statistical analysis: High-dimensionality

Source: arxiv.org/pdf/1605.07678.pdf

I Number of network parameters is larger than sample sizeI AlexNet uses 60 million parameters for 1.2 million training

samples

I high-dimensional statistical problem

10 / 30

Challenges for a statistical analysis: Network sparsity

(ax + b)+

−b/a

I There is some sort of sparsity on the parameters

I units get deactivated by initialization or by learning

I assume instead sparsely connected networks

11 / 30

Challenges for a statistical analysis: Bounded parameters

I with arbitrarily large networkparameters we can approximatethe indicator function via

x 7→ σ(ax)− σ(ax − 1)

I it is common in approximation theory to use networks withnetwork parameters tending to infinity

I in practice, network parameters aretypically small

we restrict all network parameters in absolute value by one

12 / 30

Mathematical problem

The data are used to fit a network, i.e. estimate the networkparameters.

How fast does the estimated networkconverge to the truth f as sample size increases?

13 / 30

Statistical analysis

I we observe n i.i.d. copies (X1,Y1), . . . , (Xn,Yn),

Yi = f (Xi ) + εi , εi ∼ N (0, 1)

I Xi ∈ Rd , Yi ∈ R,I goal is to reconstruct the function f : Rd → R

I has been studied extensively (kernel smoothing, wavelets,splines, . . . )

14 / 30

The estimator

I choose network architecture (L,p) and sparsity sI denote by F(L,p, s) the class of all networks with

I architecture (L,p)I number of active (e.g. non-zero) parameters is s

I our theory applies to any estimator fn taking values inF(L,p, s)

I prediction error

R(fn, f ) := Ef

[(fn(X)− f (X)

)2],

with XD= X1 being independent of the sample

I study the dependence of n on R(fn, f )

15 / 30

Function class

I classical idea: assume that regression function is β-smooth

I optimal nonparametric estimation rate is n−2β/(2β+d)

I suffers from curse of dimensionality

I to understand deep learning this setting is therefore useless

I make a good structural assumption on f

16 / 30

Hierarchical structure

I Important: Only few objects are combined on deeperabstraction levelI few letters in one wordI few words in one sentence

I Similar ideas have been proposed earlier by the group ofTomaso Poggio

17 / 30

Function class

I We assume that

f = gq ◦ . . . ◦ g0

withI gi : Rdi → Rdi+1 .I each of the di+1 components of gi is βi -smooth and depends

only on ti variablesI ti can be much smaller than diI effective smoothness

β∗i := βi

q∏`=i+1

(β` ∧ 1).

I we show that the rate depends on the pairs

(ti , β∗i ), i = 0, . . . , q.

18 / 30

Main resultTheorem: If

(i) depth � log n

(ii) width ≥ network sparsity � maxi=0,...,q nti

2β∗i+ti log n

Then, for any network reconstruction method fn,

prediction error � φn + ∆n

(up to log n-factors) with

∆n := E[1

n

n∑i=1

(Yi − fn(Xi ))2 − inff ∈F(L,p,s)

1

n

n∑i=1

(Yi − f (Xi ))2]

and

φn := maxi=0,...,q

n− 2β∗i

2β∗i+ti .

Moreover, φn is the fastest obtainable rate19 / 30

Consequences

I the assumption that depth � log n appears naturally

I in particular the depth scales with the sample size

I the networks can have much more parameters than thesample size

important for statistical performance is not the size ofthe network but the amount of regularization

20 / 30

Consequences (ctd.)

paradox:

I good rate for all smoothness indices

I existing piecewise linear methods only give good rates up tosmoothness two

I Here the non-linearity of the function class helps

non-linearity is essential!!!

21 / 30

On the proof of the theorem

Upper bound: Can be shown by establishing an oracle inequality

statistical risk . approx. error over s-sparse networks +sL log n

n.

Approximation theory:

I builds on work by Telgarsky (2016), Liang and Srikant (2016),Yarotski (2017)

Lower bound: Reduction to multiple testing argument

22 / 30

Lower bounds on the network sparsity

I the convergence theorem implies a deterministic lower boundon the network sparsity required to approximate β-smoothfunctions on [0, 1]d

I Bolcskei et al. (2017)

Result:

I if for ε > 0,

s .ε−d/β

L log(1/ε)

then

supf0 is β−Holder

inff a s−sparse network

‖f − f0‖∞ ≥ ε.

23 / 30

Comparison with other nonparametric methods

I there are many well-studied methods for the same statisticalmodel

I we can now compare the statistical guarantees for thesemodels

I we study wavelets and MARS

24 / 30

Suboptimality of wavelet estimators

I in the regression model, wavelet methods can be shown toachieve optimal estimation rates under all sorts of constraintson regression function

For function compositions, this is not true:

I f (x) = h(x1 + . . .+ xd)

I for some α-smooth function h

I Rate for DNNs . n−α/(2α+1) (up to logarithmic factors)

I Rate for best wavelet thresholding estimator & n−α/(2α+d)

I Reason: Low-dimensional structure does not affect the decayof the wavelet coefficients

25 / 30

MARS

I consider products of ramp functions

hI ,t(x1, . . . , xd) =∏j∈I

(± (xj − tj)

)+

I piecewise constant in each component

I MARS (multivariate adaptive regression splines) fits linearcombinations of such functions to data

I greedy algorithm

I has depth and width type parameters

26 / 30

Comparison with MARS

I how does MARS compare to ReLU networks?

I functions that can be represented by s parameters withrespect to the MARS function system can be represented bys log(1/ε)-sparse DNNs up to sup-norm error ε

27 / 30

Comparison with MARS (ctd.)

Figure: Reconstruction using MARS (left) and networks (right)

I the opposite is not true, one counterexample is

f (x1, x2) = (x1 + x2 − 1)+

I we need & ε−1/2 many parameters to get ε-close with MARSfunctions

I conclusion: DNNs work better for correlated design

28 / 30

Outlook

Many extensions of these results are possible

I understanding gradient based methods (energy landscape, . . . )

I other statistical models: classification, . . .

I other network/data types: CNNs, RNNs, autoencoders, . . .

I theory for generative adversarial networks (GANs)

Thank you for your attention!

29 / 30

References:

I SH (2017). Nonparametric regression using deep neuralnetworks with ReLU activation function. arxiv:1708.06633

I Eckle, SH (2018). A comparison of deep networks with ReLUactivation function and linear spline-type methods. NeuralNetworks.

30 / 30

Statistical guarantees for deep ReLU networks [1cm] · Imposing a statistical model Iassume...

Documents

Transcript of Statistical guarantees for deep ReLU networks [1cm] · Imposing a statistical model Iassume...