Statistical guarantees for deep ReLU networks [1cm] · Imposing a statistical model Iassume...
Transcript of Statistical guarantees for deep ReLU networks [1cm] · Imposing a statistical model Iassume...
Statistical guarantees for deep ReLU networks
Johannes Schmidt-Hieber
1 / 30
Theory for deep neural networks
different approaches:
I studying properties of the algorithms
I statistical properties
2 / 30
Theory for (S)GD
Mathematics:
I (bad) local minima and saddle points of the energy landscape(Kawaguchi ’16)
I ability of gradient descent to reach zero training error forover-parametrized networks (Lee et al. ’18, Allen-Zhu et al.’18)
I continuous limits of gradient descent
3 / 30
Questions that a theory should be able to answer
I for which datasets or data structures do neural networksperform well?
I how should we choose the network architecture?
I how does the signal processing work in a deep network?
I these problems motivate the statistical approach
4 / 30
Imposing a statistical model
I assume (training) data are generated from a given statisticalmodel
I imposed model can be simpler than deep learning applications
I provides us with an information theoretic notion of optimality(optimal estimation rates)
I since other methods have been studied for the same statisticalmethod, we have a theoretical framework for comparison ofdifferent methods
5 / 30
Simplifications
I (up to now) statistical analysis studies properties of the globalminimum (nothing on the algorithm)
I properties that are induced by the learning algorithm viainitialization and regularization techniques have to be buildinto the framework by constraints on the underlying class ofnetworks
6 / 30
Mathematics
I statistical approach combines approximation theory withcomplexity bounds
I good network architectures provide an optimal trade-offbetween these two quantities
I statistical analysis provides constraints on the networkarchitectures that work well
7 / 30
Algebraic definition of a deep net
I A neural network builds functions by alternating matrix-vectormultiplications and componentwise application of theactivation function σ : R→ R,
I
f (x) = WL+1σWLσ · · ·W2σW1x
Network parameters:
I matrices W1, . . . ,WL+1
Activation function:
I we study the ReLU activation function σ(x) = max(x , 0)
I x 7→ f (x) is a piecewise linear function
8 / 30
Challenges for a statistical analysis: Depth
Source: Kaiming He, Deep Residual Networks
I Networks are deepI version of ResNet with 152 hidden layersI networks become deeper
I highly non-linear in parameters
9 / 30
Challenges for a statistical analysis: High-dimensionality
Source: arxiv.org/pdf/1605.07678.pdf
I Number of network parameters is larger than sample sizeI AlexNet uses 60 million parameters for 1.2 million training
samples
I high-dimensional statistical problem
10 / 30
Challenges for a statistical analysis: Network sparsity
(ax + b)+
−b/a
I There is some sort of sparsity on the parameters
I units get deactivated by initialization or by learning
I assume instead sparsely connected networks
11 / 30
Challenges for a statistical analysis: Bounded parameters
I with arbitrarily large networkparameters we can approximatethe indicator function via
x 7→ σ(ax)− σ(ax − 1)
I it is common in approximation theory to use networks withnetwork parameters tending to infinity
I in practice, network parameters aretypically small
we restrict all network parameters in absolute value by one
12 / 30
Mathematical problem
The data are used to fit a network, i.e. estimate the networkparameters.
How fast does the estimated networkconverge to the truth f as sample size increases?
13 / 30
Statistical analysis
I we observe n i.i.d. copies (X1,Y1), . . . , (Xn,Yn),
Yi = f (Xi ) + εi , εi ∼ N (0, 1)
I Xi ∈ Rd , Yi ∈ R,I goal is to reconstruct the function f : Rd → R
I has been studied extensively (kernel smoothing, wavelets,splines, . . . )
14 / 30
The estimator
I choose network architecture (L,p) and sparsity sI denote by F(L,p, s) the class of all networks with
I architecture (L,p)I number of active (e.g. non-zero) parameters is s
I our theory applies to any estimator fn taking values inF(L,p, s)
I prediction error
R(fn, f ) := Ef
[(fn(X)− f (X)
)2],
with XD= X1 being independent of the sample
I study the dependence of n on R(fn, f )
15 / 30
Function class
I classical idea: assume that regression function is β-smooth
I optimal nonparametric estimation rate is n−2β/(2β+d)
I suffers from curse of dimensionality
I to understand deep learning this setting is therefore useless
I make a good structural assumption on f
16 / 30
Hierarchical structure
I Important: Only few objects are combined on deeperabstraction levelI few letters in one wordI few words in one sentence
I Similar ideas have been proposed earlier by the group ofTomaso Poggio
17 / 30
Function class
I We assume that
f = gq ◦ . . . ◦ g0
withI gi : Rdi → Rdi+1 .I each of the di+1 components of gi is βi -smooth and depends
only on ti variablesI ti can be much smaller than diI effective smoothness
β∗i := βi
q∏`=i+1
(β` ∧ 1).
I we show that the rate depends on the pairs
(ti , β∗i ), i = 0, . . . , q.
18 / 30
Main resultTheorem: If
(i) depth � log n
(ii) width ≥ network sparsity � maxi=0,...,q nti
2β∗i+ti log n
Then, for any network reconstruction method fn,
prediction error � φn + ∆n
(up to log n-factors) with
∆n := E[1
n
n∑i=1
(Yi − fn(Xi ))2 − inff ∈F(L,p,s)
1
n
n∑i=1
(Yi − f (Xi ))2]
and
φn := maxi=0,...,q
n− 2β∗i
2β∗i+ti .
Moreover, φn is the fastest obtainable rate19 / 30
Consequences
I the assumption that depth � log n appears naturally
I in particular the depth scales with the sample size
I the networks can have much more parameters than thesample size
important for statistical performance is not the size ofthe network but the amount of regularization
20 / 30
Consequences (ctd.)
paradox:
I good rate for all smoothness indices
I existing piecewise linear methods only give good rates up tosmoothness two
I Here the non-linearity of the function class helps
non-linearity is essential!!!
21 / 30
On the proof of the theorem
Upper bound: Can be shown by establishing an oracle inequality
statistical risk . approx. error over s-sparse networks +sL log n
n.
Approximation theory:
I builds on work by Telgarsky (2016), Liang and Srikant (2016),Yarotski (2017)
Lower bound: Reduction to multiple testing argument
22 / 30
Lower bounds on the network sparsity
I the convergence theorem implies a deterministic lower boundon the network sparsity required to approximate β-smoothfunctions on [0, 1]d
I Bolcskei et al. (2017)
Result:
I if for ε > 0,
s .ε−d/β
L log(1/ε)
then
supf0 is β−Holder
inff a s−sparse network
‖f − f0‖∞ ≥ ε.
23 / 30
Comparison with other nonparametric methods
I there are many well-studied methods for the same statisticalmodel
I we can now compare the statistical guarantees for thesemodels
I we study wavelets and MARS
24 / 30
Suboptimality of wavelet estimators
I in the regression model, wavelet methods can be shown toachieve optimal estimation rates under all sorts of constraintson regression function
For function compositions, this is not true:
I f (x) = h(x1 + . . .+ xd)
I for some α-smooth function h
I Rate for DNNs . n−α/(2α+1) (up to logarithmic factors)
I Rate for best wavelet thresholding estimator & n−α/(2α+d)
I Reason: Low-dimensional structure does not affect the decayof the wavelet coefficients
25 / 30
MARS
I consider products of ramp functions
hI ,t(x1, . . . , xd) =∏j∈I
(± (xj − tj)
)+
I piecewise constant in each component
I MARS (multivariate adaptive regression splines) fits linearcombinations of such functions to data
I greedy algorithm
I has depth and width type parameters
26 / 30
Comparison with MARS
I how does MARS compare to ReLU networks?
I functions that can be represented by s parameters withrespect to the MARS function system can be represented bys log(1/ε)-sparse DNNs up to sup-norm error ε
27 / 30
Comparison with MARS (ctd.)
Figure: Reconstruction using MARS (left) and networks (right)
I the opposite is not true, one counterexample is
f (x1, x2) = (x1 + x2 − 1)+
I we need & ε−1/2 many parameters to get ε-close with MARSfunctions
I conclusion: DNNs work better for correlated design
28 / 30
Outlook
Many extensions of these results are possible
I understanding gradient based methods (energy landscape, . . . )
I other statistical models: classification, . . .
I other network/data types: CNNs, RNNs, autoencoders, . . .
I theory for generative adversarial networks (GANs)
Thank you for your attention!
29 / 30
References:
I SH (2017). Nonparametric regression using deep neuralnetworks with ReLU activation function. arxiv:1708.06633
I Eckle, SH (2018). A comparison of deep networks with ReLUactivation function and linear spline-type methods. NeuralNetworks.
30 / 30