Sixth Italian Workshop on Machine Learning and Data Mining...
Transcript of Sixth Italian Workshop on Machine Learning and Data Mining...
![Page 1: Sixth Italian Workshop on Machine Learning and Data Mining …ispac.diet.uniroma1.it/scardapane/wp-content/uploads/... · 2017-11-18 · train deep NNs with hundreds of layers. ...](https://reader033.fdocuments.us/reader033/viewer/2022050502/5f94e119fb9aaa493351840e/html5/thumbnails/1.jpg)
Sixth Italian Workshop on Machine Learningand Data Mining (MLDM)
Kernel-based non-parametric activationfunctions for neural networks
Authors: S. Scardapane, S. Van Vaerenbergh and A. Uncini
![Page 2: Sixth Italian Workshop on Machine Learning and Data Mining …ispac.diet.uniroma1.it/scardapane/wp-content/uploads/... · 2017-11-18 · train deep NNs with hundreds of layers. ...](https://reader033.fdocuments.us/reader033/viewer/2022050502/5f94e119fb9aaa493351840e/html5/thumbnails/2.jpg)
Table of contents
1 Introduction
2 Non-parametric activation functions
3 Proposed kernel activation functions
4 Experimental results
5 Conclusions and future work
![Page 3: Sixth Italian Workshop on Machine Learning and Data Mining …ispac.diet.uniroma1.it/scardapane/wp-content/uploads/... · 2017-11-18 · train deep NNs with hundreds of layers. ...](https://reader033.fdocuments.us/reader033/viewer/2022050502/5f94e119fb9aaa493351840e/html5/thumbnails/3.jpg)
Basic NN architecture
The basic layer for a neural network alternates a linear projection witha pointwise nonlinearity:
hl = gl (Wlhl−1 + bl) , (1)
There is a huge literature on the linear component, e.g., initialization,compression, fast multiplication...
In most cases, the matrices {W,bl} are the only adaptable componentsof the network.
![Page 4: Sixth Italian Workshop on Machine Learning and Data Mining …ispac.diet.uniroma1.it/scardapane/wp-content/uploads/... · 2017-11-18 · train deep NNs with hundreds of layers. ...](https://reader033.fdocuments.us/reader033/viewer/2022050502/5f94e119fb9aaa493351840e/html5/thumbnails/4.jpg)
What about the nonlinearity?
The choice of the nonlinearity is crucial:
• Having differentiable activations was the basic ingredient for back-propagation.
• In the last decade, ReLU functions g(s) = max(0, s) allowed totrain deep NNs with hundreds of layers.
• Many papers recently on new activation functions, e.g., the Swishfunction [1]:
g(s) = s · sigmoid(s) . (2)
Can we learn the activation functions?
[1] Ramachandran, P., Zoph, B. and Le, Q.V., 2017. Swish: a Self-GatedActivation Function. arXiv preprint arXiv:1710.05941.
![Page 5: Sixth Italian Workshop on Machine Learning and Data Mining …ispac.diet.uniroma1.it/scardapane/wp-content/uploads/... · 2017-11-18 · train deep NNs with hundreds of layers. ...](https://reader033.fdocuments.us/reader033/viewer/2022050502/5f94e119fb9aaa493351840e/html5/thumbnails/5.jpg)
What about the nonlinearity?
The choice of the nonlinearity is crucial:
• Having differentiable activations was the basic ingredient for back-propagation.
• In the last decade, ReLU functions g(s) = max(0, s) allowed totrain deep NNs with hundreds of layers.
• Many papers recently on new activation functions, e.g., the Swishfunction [1]:
g(s) = s · sigmoid(s) . (2)
Can we learn the activation functions?
[1] Ramachandran, P., Zoph, B. and Le, Q.V., 2017. Swish: a Self-GatedActivation Function. arXiv preprint arXiv:1710.05941.
![Page 6: Sixth Italian Workshop on Machine Learning and Data Mining …ispac.diet.uniroma1.it/scardapane/wp-content/uploads/... · 2017-11-18 · train deep NNs with hundreds of layers. ...](https://reader033.fdocuments.us/reader033/viewer/2022050502/5f94e119fb9aaa493351840e/html5/thumbnails/6.jpg)
Parametric activation functions
Making a single activation function parametric is relatively simple, e.g.,we can add a learnable scale and bandwidth to a tanh:
g(s) =a (1− exp {−bs})
1 + exp {−bs}. (3)
Or learn the slope for the negative part of the ReLU (PReLU):
g(s) =
{s if s ≥ 0
αs otherwise. (4)
These parametric AFs have a small amount of trainable parameters,but their flexibility is severely limited.
![Page 7: Sixth Italian Workshop on Machine Learning and Data Mining …ispac.diet.uniroma1.it/scardapane/wp-content/uploads/... · 2017-11-18 · train deep NNs with hundreds of layers. ...](https://reader033.fdocuments.us/reader033/viewer/2022050502/5f94e119fb9aaa493351840e/html5/thumbnails/7.jpg)
Table of contents
1 Introduction
2 Non-parametric activation functions
3 Proposed kernel activation functions
4 Experimental results
5 Conclusions and future work
![Page 8: Sixth Italian Workshop on Machine Learning and Data Mining …ispac.diet.uniroma1.it/scardapane/wp-content/uploads/... · 2017-11-18 · train deep NNs with hundreds of layers. ...](https://reader033.fdocuments.us/reader033/viewer/2022050502/5f94e119fb9aaa493351840e/html5/thumbnails/8.jpg)
Adaptive piecewise linear units
An APL nonlinearity is the sum of S linear segments:
g(s) = max {0, s}+
S∑i=1
ai max {0,−s+ bi} . (5)
This is non-parametric because S is a user-defined hyper-parametercontrolling the flexibility of the unit.
The APL introduces S+1 points of non-differentiability for each neuron,which may damage the optimization algorithm. Also, in practice havingS > 3 seems to have less effect on the resulting shapes.
[1] Agostinelli, F., Hoffman, M., Sadowski, P. and Baldi, P., 2014. Lear-ning activation functions to improve deep neural networks. arXiv preprintarXiv:1412.6830.
![Page 9: Sixth Italian Workshop on Machine Learning and Data Mining …ispac.diet.uniroma1.it/scardapane/wp-content/uploads/... · 2017-11-18 · train deep NNs with hundreds of layers. ...](https://reader033.fdocuments.us/reader033/viewer/2022050502/5f94e119fb9aaa493351840e/html5/thumbnails/9.jpg)
Spline activation functions
A SAF uses cubic interpolation over a set of adaptable control points:
2 0 2Activation
0
2
4
6SA
F ou
tput
However, regularizing the control points is non-trivial, and SAFs cannotbe easily accelerated on GPU.
[1] Vecci, L., Piazza, F. and Uncini, A., 1998. Learning and approximationcapabilities of adaptive spline activation function neural networks. NeuralNetworks, 11(2), pp. 259-270.
![Page 10: Sixth Italian Workshop on Machine Learning and Data Mining …ispac.diet.uniroma1.it/scardapane/wp-content/uploads/... · 2017-11-18 · train deep NNs with hundreds of layers. ...](https://reader033.fdocuments.us/reader033/viewer/2022050502/5f94e119fb9aaa493351840e/html5/thumbnails/10.jpg)
Maxout neurons
A Maxout replaces an entire neuron by taking the maximum over Kseparate linear projections:
g(h) = maxi=1,...,K
{wT
i h + bi}. (6)
With two maxout neurons, a NN with one hidden layer remains anuniversal approximator provided K is sufficiently large.
However, it is impossible to plot the functions for K > 3, and thenumber of parameters can increase drastically with respect to K.
[1] Goodfellow, I.J., Warde-Farley, D., Mirza, M., Courville, A. and Bengio, Y.,2013. Maxout networks. Proc. 30th Int. Conf. on Machine Learning.
![Page 11: Sixth Italian Workshop on Machine Learning and Data Mining …ispac.diet.uniroma1.it/scardapane/wp-content/uploads/... · 2017-11-18 · train deep NNs with hundreds of layers. ...](https://reader033.fdocuments.us/reader033/viewer/2022050502/5f94e119fb9aaa493351840e/html5/thumbnails/11.jpg)
Visualization of a Maxout neuron
2 0 21D input
1
0
1
2
Max
out a
ctiv
atio
n
![Page 12: Sixth Italian Workshop on Machine Learning and Data Mining …ispac.diet.uniroma1.it/scardapane/wp-content/uploads/... · 2017-11-18 · train deep NNs with hundreds of layers. ...](https://reader033.fdocuments.us/reader033/viewer/2022050502/5f94e119fb9aaa493351840e/html5/thumbnails/12.jpg)
Table of contents
1 Introduction
2 Non-parametric activation functions
3 Proposed kernel activation functions
4 Experimental results
5 Conclusions and future work
![Page 13: Sixth Italian Workshop on Machine Learning and Data Mining …ispac.diet.uniroma1.it/scardapane/wp-content/uploads/... · 2017-11-18 · train deep NNs with hundreds of layers. ...](https://reader033.fdocuments.us/reader033/viewer/2022050502/5f94e119fb9aaa493351840e/html5/thumbnails/13.jpg)
Basic structure of the KAF
We model each activation function in terms of a kernel expansion overD terms as:
g(s) =
D∑i=1
αiκ (s, di) , (7)
where:
1 {αi}Di=1 are the mixing coefficients;
2 {di}Di=1 are the dictionary elements;
3 κ(·, ·) : R× R→ R is a 1D kernel function.
To make everything tractable, we only adapt the mixing coefficients,and for the dictionary we sample D values over the x-axis, uniformlyaround zero.
[1] Scardapane, S., Van Vaerenbergh, S. and Uncini, A., 2017. Kafnets:kernel-based non-parametric activation functions for neural networks.arXiv preprint arXiv:1707.04035.
![Page 14: Sixth Italian Workshop on Machine Learning and Data Mining …ispac.diet.uniroma1.it/scardapane/wp-content/uploads/... · 2017-11-18 · train deep NNs with hundreds of layers. ...](https://reader033.fdocuments.us/reader033/viewer/2022050502/5f94e119fb9aaa493351840e/html5/thumbnails/14.jpg)
Kernel selection
For our experiments, we use the 1D Gaussian kernel defined as:
κ(s, di) = exp{−γ (s− di)2
}, (8)
where γ ∈ R is called the kernel bandwidth. Based on some prelimi-nary experiments, we use the following rule-of-thumb for selecting thebandwidth:
γ =1
6∆2, (9)
where ∆ is the distance between the grid points.
![Page 15: Sixth Italian Workshop on Machine Learning and Data Mining …ispac.diet.uniroma1.it/scardapane/wp-content/uploads/... · 2017-11-18 · train deep NNs with hundreds of layers. ...](https://reader033.fdocuments.us/reader033/viewer/2022050502/5f94e119fb9aaa493351840e/html5/thumbnails/15.jpg)
Choosing the bandwidth
2 0 2Activation
KA
F
(a) γ = 2.0
2 0 2Activation
(b) γ = 0.5
2 0 2Activation
(c) γ = 0.1
Figure 1 : Examples of KAFs. In all cases we sample uniformly 20 pointson the x-axis, while the mixing coefficients are sampled from a normaldistribution. The three plots show three different choices for γ.
![Page 16: Sixth Italian Workshop on Machine Learning and Data Mining …ispac.diet.uniroma1.it/scardapane/wp-content/uploads/... · 2017-11-18 · train deep NNs with hundreds of layers. ...](https://reader033.fdocuments.us/reader033/viewer/2022050502/5f94e119fb9aaa493351840e/html5/thumbnails/16.jpg)
Inizialization of the mixing coefficients
Other than initializing the mixing coefficients randomly, we can alsoapproximate any initial function using kernel ridge regression (KRR):
α = (K + εI)−1
t , (10)
where K ∈ RD×D is the kernel matrix computed between the desiredpoints t and the elements of the dictionary d.
![Page 17: Sixth Italian Workshop on Machine Learning and Data Mining …ispac.diet.uniroma1.it/scardapane/wp-content/uploads/... · 2017-11-18 · train deep NNs with hundreds of layers. ...](https://reader033.fdocuments.us/reader033/viewer/2022050502/5f94e119fb9aaa493351840e/html5/thumbnails/17.jpg)
Examples of initialization
4 2 0 2 4Activation
0.5
0.0
0.5
KA
F
(a) tanh
4 2 0 2 4Activation
0
1
2
3
KA
F
(b) ELU
Figure 2 : Two examples of initializing a KAF using KRR, with ε = 10−6.(a) A hyperbolic tangent. (b) The ELU function. The red dots indicate thecorresponding initialized values for the mixing coefficients.
![Page 18: Sixth Italian Workshop on Machine Learning and Data Mining …ispac.diet.uniroma1.it/scardapane/wp-content/uploads/... · 2017-11-18 · train deep NNs with hundreds of layers. ...](https://reader033.fdocuments.us/reader033/viewer/2022050502/5f94e119fb9aaa493351840e/html5/thumbnails/18.jpg)
Multi-dimensional KAFs
We also consider a two-dimensional variant (2D-KAF), that acts on apair of activation values:
g (s) =
D2∑i=1
αiκ (s,di) , (11)
where di is the i-th element of the dictionary, and we now have D2
adaptable coefficients {αi}D2
i=1 sampled over the plane.
In this case, we consider the 2D Gaussian kernel:
κ (s,di) = exp{−γ ‖s− di‖22
}. (12)
![Page 19: Sixth Italian Workshop on Machine Learning and Data Mining …ispac.diet.uniroma1.it/scardapane/wp-content/uploads/... · 2017-11-18 · train deep NNs with hundreds of layers. ...](https://reader033.fdocuments.us/reader033/viewer/2022050502/5f94e119fb9aaa493351840e/html5/thumbnails/19.jpg)
Advantages of the framework
1 Universal approximation properties.
2 Very simple to vectorize and to accelerate on GPUs.
3 Smooth over the entire domain.
4 Mixing coefficients can be regularized easily, including the use ofsparse penalties.
![Page 20: Sixth Italian Workshop on Machine Learning and Data Mining …ispac.diet.uniroma1.it/scardapane/wp-content/uploads/... · 2017-11-18 · train deep NNs with hundreds of layers. ...](https://reader033.fdocuments.us/reader033/viewer/2022050502/5f94e119fb9aaa493351840e/html5/thumbnails/20.jpg)
Table of contents
1 Introduction
2 Non-parametric activation functions
3 Proposed kernel activation functions
4 Experimental results
5 Conclusions and future work
![Page 21: Sixth Italian Workshop on Machine Learning and Data Mining …ispac.diet.uniroma1.it/scardapane/wp-content/uploads/... · 2017-11-18 · train deep NNs with hundreds of layers. ...](https://reader033.fdocuments.us/reader033/viewer/2022050502/5f94e119fb9aaa493351840e/html5/thumbnails/21.jpg)
Visualizing the functions
1 0 1Activation
(a)
2 0 2Activation
(b)
2.5 0.0 2.5Activation
(c)
1 0Activation
(d)
0 2Activation
(e)
1 0 1Activation
(f)
Figure 3 : Examples of 6 trained KAFs (with random initialization) on theSensorless dataset. On the y-axis we plot the output value of the KAF. TheKAF after initialization is shown with a dashed red, while the final KAF isshown with a solid green. The distribution of activation values is shown asa references with a light blue.
![Page 22: Sixth Italian Workshop on Machine Learning and Data Mining …ispac.diet.uniroma1.it/scardapane/wp-content/uploads/... · 2017-11-18 · train deep NNs with hundreds of layers. ...](https://reader033.fdocuments.us/reader033/viewer/2022050502/5f94e119fb9aaa493351840e/html5/thumbnails/22.jpg)
Results on the SUSY benchmark
Activation function Testing AUC Trainable parameters
ReLU (five hidden layers) 0.8739(0.001)
367201ELU (five hidden layers) 0.8739(0.001)
SELU (five hidden layers) 0.8745(0.002)
PReLU (five hidden layers) 0.8748(0.001) 368701
Maxout (one layer) 0.8744(0.001) 17401
Maxout (two layers) 0.8744(0.002) 288301
APL (one layer) 0.8744(0.002) 7801
APL (two layers) 0.8757(0.002) 99901
KAF (one layer) 0.8756(0.001) 12001
KAF (two layers) 0.8758(0.001) 108301
Table 1 : Results on the SUSY benchmark.
![Page 23: Sixth Italian Workshop on Machine Learning and Data Mining …ispac.diet.uniroma1.it/scardapane/wp-content/uploads/... · 2017-11-18 · train deep NNs with hundreds of layers. ...](https://reader033.fdocuments.us/reader033/viewer/2022050502/5f94e119fb9aaa493351840e/html5/thumbnails/23.jpg)
Table of contents
1 Introduction
2 Non-parametric activation functions
3 Proposed kernel activation functions
4 Experimental results
5 Conclusions and future work
![Page 24: Sixth Italian Workshop on Machine Learning and Data Mining …ispac.diet.uniroma1.it/scardapane/wp-content/uploads/... · 2017-11-18 · train deep NNs with hundreds of layers. ...](https://reader033.fdocuments.us/reader033/viewer/2022050502/5f94e119fb9aaa493351840e/html5/thumbnails/24.jpg)
Conclusions and future work
1 We proposed a novel family of non-parametric functions, framedin a kernel expansion of their input value.
2 KAFs combine several advantages of previous approaches, withoutintroducing an excessive number of additional parameters.
3 Networks trained with these activations can obtain a higher accu-racy while being significantly smaller.
4 Alternative choices for the kernel expansion are possible, e.g. dicti-onary selection strategies, alternative kernels (e.g., periodic ker-nels), and several others.
5 The framework provides a further link between neural networksand kernel methods, opening up a large number of variations withrespect to our initial approach.
![Page 25: Sixth Italian Workshop on Machine Learning and Data Mining …ispac.diet.uniroma1.it/scardapane/wp-content/uploads/... · 2017-11-18 · train deep NNs with hundreds of layers. ...](https://reader033.fdocuments.us/reader033/viewer/2022050502/5f94e119fb9aaa493351840e/html5/thumbnails/25.jpg)
Thanks for the attention,questions?