Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958...
Transcript of Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958...
![Page 1: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/1.jpg)
Introduction to Neural Networks
E. Scornet
January 2020
E. Scornet Deep Learning January 2020 1 / 118
![Page 2: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/2.jpg)
1 Introduction
2 History of Neural NetworksPerceptronMultilayer Perceptron - Backpropagation algorithm
3 HyperparametersActivation functionsOutput unitsLoss functionsWeight initialization
4 RegularizationPenalizationDropoutBatch normalizationEarly stopping
5 All in all
E. Scornet Deep Learning January 2020 2 / 118
![Page 3: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/3.jpg)
Outline
1 Introduction
2 History of Neural NetworksPerceptronMultilayer Perceptron - Backpropagation algorithm
3 HyperparametersActivation functionsOutput unitsLoss functionsWeight initialization
4 RegularizationPenalizationDropoutBatch normalizationEarly stopping
5 All in all
E. Scornet Deep Learning January 2020 3 / 118
![Page 4: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/4.jpg)
Supervised Learning
Supervised Learning FrameworkInput measurement X ∈ XOutput measurement Y ∈ Y.(X,Y ) ∼ P with P unknown.Training data : Dn = {(X1,Y1), . . . , (Xn,Yn)} (i.i.d. ∼ P)
OftenI X ∈ Rd and Y ∈ {−1, 1} (classification)I or X ∈ Rd and Y ∈ R (regression).
A predictor is a function in F = {f : X → Y meas.}
GoalConstruct a good predictor f from the training data.
Need to specify the meaning of good.Classification and regression are almost the same problem!
E. Scornet Deep Learning January 2020 4 / 118
![Page 5: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/5.jpg)
Loss and Probabilistic Framework
Loss function for a generic predictorLoss function : `(Y , f (X)) measures the goodness of the prediction of Y by f (X)Examples:
I Prediction loss: `(Y , f (X)) = 1Y 6=f (X)I Quadratic loss: `(Y , f (X)) = |Y − f (X)|2
Risk functionRisk measured as the average loss for a new couple:
R(f ) = E(X,Y )∼P [`(Y , f (X))]Examples:
I Prediction loss: E [`(Y , f (X))] = P {Y 6= f (X)}I Quadratic loss: E [`(Y , f (X))] = E
[|Y − f (X)|2
]Beware: As f depends on Dn, R(f ) is a random variable!
E. Scornet Deep Learning January 2020 5 / 118
![Page 6: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/6.jpg)
Supervised Learning
Experience, Task and Performance measureTraining data : D = {(X1,Y1), . . . , (Xn,Yn)} (i.i.d. ∼ P)Predictor: f : X → Y measurableCost/Loss function : `(Y , f (X)) measure how well f (X) “predicts" YRisk:
R(f ) = E [`(Y , f (X))] = EX[EY |X [`(Y , f (X))]
]Often `(Y , f (X)) = |f (X)− Y |2 or `(Y , f (X)) = 1Y 6=f (X)
GoalLearn a rule to construct a predictor f ∈ F from the training data Dn s.t. the riskR(f ) is small on average or with high probability with respect to Dn.
E. Scornet Deep Learning January 2020 6 / 118
![Page 7: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/7.jpg)
Best Solution
The best solution f ∗ (which is independent of Dn) is
f ∗ = argminf∈F
R(f ) = argminf∈F
E [`(Y , f (X))]
Bayes Predictor (explicit solution)I In binary classification with 0− 1 loss:
f ∗(X) =
{+1 if P {Y = +1|X} ≥ P {Y = −1|X}
⇔ P {Y = +1|X} ≥ 1/2−1 otherwise
I In regression with the quadratic lossf ∗(X) = E [Y |X]
Issue: Solution requires to know E [Y |X] for all values of X!
E. Scornet Deep Learning January 2020 7 / 118
![Page 8: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/8.jpg)
Examples
Spam detection (Text classification)
Data: email collectionInput: emailOutput : Spam or No Spam
E. Scornet Deep Learning January 2020 8 / 118
![Page 9: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/9.jpg)
Examples
Face Detection
Data: Annotated database of imagesInput : Sub window in the imageOutput : Presence or no of a face...
E. Scornet Deep Learning January 2020 9 / 118
![Page 10: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/10.jpg)
Examples
Number Recognition
Data: Annotated database of images (each image is represented by a vector of28× 28 = 784 pixel intensities)Input: ImageOutput: Corresponding number
E. Scornet Deep Learning January 2020 10 / 118
![Page 11: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/11.jpg)
Machine Learning
A definition by Tom Mitchell (http://www.cs.cmu.edu/~tom/)A computer program is said to learn from experience E with respect to some class oftasks T and performance measure P, if its performance at tasks in T, as measured by P,improves with experience E.
E. Scornet Deep Learning January 2020 11 / 118
![Page 12: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/12.jpg)
Unsupervised Learning
Experience, Task and Performance measureTraining data : D = {X1, . . . ,Xn} (i.i.d. ∼ P)Task: ???Performance measure: ???
No obvious task definition!
Tasks for this lectureClustering (or unsupervised classification): construct a grouping of the data inhomogeneous classes.Dimension reduction: construct a map of the data in a low dimensional spacewithout distorting it too much.
E. Scornet Deep Learning January 2020 12 / 118
![Page 13: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/13.jpg)
Marketing
Data: Base of customer data containing their properties and past buying records
Goal: Use the customers similarities to find groups.
Two directions:I Clustering: propose an explicit grouping of the customersI Visualization: propose a representation of the customers so that the groups are visibles
E. Scornet Deep Learning January 2020 13 / 118
![Page 14: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/14.jpg)
Machine Learning
E. Scornet Deep Learning January 2020 14 / 118
![Page 15: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/15.jpg)
Outline
1 Introduction
2 History of Neural NetworksPerceptronMultilayer Perceptron - Backpropagation algorithm
3 HyperparametersActivation functionsOutput unitsLoss functionsWeight initialization
4 RegularizationPenalizationDropoutBatch normalizationEarly stopping
5 All in all
E. Scornet Deep Learning January 2020 15 / 118
![Page 16: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/16.jpg)
Outline
1 Introduction
2 History of Neural NetworksPerceptronMultilayer Perceptron - Backpropagation algorithm
3 HyperparametersActivation functionsOutput unitsLoss functionsWeight initialization
4 RegularizationPenalizationDropoutBatch normalizationEarly stopping
5 All in all
E. Scornet Deep Learning January 2020 16 / 118
![Page 17: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/17.jpg)
Outline
1 Introduction
2 History of Neural NetworksPerceptronMultilayer Perceptron - Backpropagation algorithm
3 HyperparametersActivation functionsOutput unitsLoss functionsWeight initialization
4 RegularizationPenalizationDropoutBatch normalizationEarly stopping
5 All in all
E. Scornet Deep Learning January 2020 17 / 118
![Page 18: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/18.jpg)
What is a neuron?
This is a real neuron!
Figure: Real Neuron - diagram
Figure: Real Neuron
The idea of neural networks began unsurprisingly as a model of how neurons in the brainfunction, termed ‘connectionism’ and used connected circuits to simulate intelligent be-haviour.
E. Scornet Deep Learning January 2020 18 / 118
![Page 19: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/19.jpg)
McCulloch and Pitts neuron - 1943
In 1943, portrayed with a simple electrical circuit by neurophysiologist Warren McCullochand mathematician Walter Pitts.
A McCulloch-Pitts neuron takes binary inputs, computes a weighted sum and returns 0 ifthe result is below threshold and 1 otherwise.
Figure: [“A logical calculus of the ideas immanent in nervous activity”, McCulloch and Pitts 1943]
E. Scornet Deep Learning January 2020 19 / 118
![Page 20: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/20.jpg)
McCulloch and Pitts neuron - 1943
In 1943, portrayed with a simple electrical circuit by neurophysiologist Warren McCullochand mathematician Walter Pitts.
A McCulloch-Pitts neuron takes binary inputs, computes a weighted sum and returns 0 ifthe result is below threshold and 1 otherwise.
Figure: [“A logical calculus of the ideas immanent in nervous activity”, McCulloch and Pitts 1943]
Donald Hebb took the idea further by proposing that neural pathways strengthen overeach successive use, especially between neurons that tend to fire at the same time.[The organization of behavior: a neuropsychological theory, Hebb 1949]
E. Scornet Deep Learning January 2020 19 / 118
![Page 21: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/21.jpg)
Perceptron - 1958
In the late 50s, Frank Rosenblatt, a psychologist at Cornell, worked on on decision systemspresent in the eye of a fly, which determine its flee response.
In 1958, he proposed the idea of a Perceptron, calling it Mark I Perceptron. It was asystem with a simple input-output relationship, modelled on a McCulloch-Pitts neuron.
[“Perceptron simulation experiments”, Rosenblatt 1960]
E. Scornet Deep Learning January 2020 20 / 118
![Page 22: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/22.jpg)
Perceptron diagram
Figure: True representation of perceptron
The connections between the input and the first hidden layer cannot be optimized!
E. Scornet Deep Learning January 2020 21 / 118
![Page 23: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/23.jpg)
Perceptron Machine
First implementation: Mark I Perceptron(1958).
The machine was connected to acamera (20x20 photocells, 400-pixelimage)
Patchboard: allowed experimentationwith different combinations of inputfeatures
Potentiometers: that implement theadaptive weights
E. Scornet Deep Learning January 2020 22 / 118
![Page 24: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/24.jpg)
Parameters of the perceptron
Activation function: σ(z) = 1z>0
Parameters:Weights w = (w1, . . . ,wd) ∈ Rd
Bias: b
The output is given by f(w,b)(x).
How do we estimate (w, b)?
E. Scornet Deep Learning January 2020 23 / 118
![Page 25: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/25.jpg)
Parameters of the perceptron
Activation function: σ(z) = 1z>0
Parameters:Weights w = (w1, . . . ,wd) ∈ Rd
Bias: b
The output is given by f(w,b)(x).
How do we estimate (w, b)?
E. Scornet Deep Learning January 2020 23 / 118
![Page 26: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/26.jpg)
The Perceptron Algorithm
To ease notations, we put w = (w1, . . . ,wd , b) and xi = (xi , 1)
Perceptron Algorithm - first (iterative) learning algorithmStart with w = 0.Repeat over all samples:
I if yi < w, xi >< 0 modify w into w + yi xi ,I otherwise do not modify w.
ExerciseWhat is the rational behind this procedure?Is this procedure related to a gradient descent method?
Gradient descent:
1 Start with w = w0
2 Update w← w− η∇L(w), where L is the loss to be minimized.3 Stop when w does not vary too much.
E. Scornet Deep Learning January 2020 24 / 118
![Page 27: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/27.jpg)
Solution
Perceptron algorithm can be seen as a stochastic gradient descent.
Perceptron AlgorithmRepeat over all samples:
I if yi < w, xi >< 0 modify w into w + yi xi ,I otherwise do not modify w.
A sample is misclassified if yi < w, xi >< 0. Thus we want to minimize the lossL(w) = −
∑i∈Mw
yi < w, xi >,
whereMw is the set of indices misclassified by the hyperplane w.
Stochastic Gradient Descent:1 Select randomly i ∈Mw
2 Update w← w− η∇Li (w) = w− ηyi xi
Perceptron algorithm: η = 1.
E. Scornet Deep Learning January 2020 25 / 118
![Page 28: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/28.jpg)
Exercise
Figure: From http://image.diku.dk/kstensbo/notes/perceptron.pdf
Let R = maxi ‖xi‖. Let w? be the optimalhyperplane of margin
γ = mini
yi〈w?, xi〉,with
‖w?‖ = 1.
Theorem (Block 1962; Novikoff 1963)Assume that the training set Dn ={(x1, yi ), . . . , (xn, yn)} is linearly separable(γ > 0). Start with w0 = 0. Then the num-ber of updates k of the perceptron algorithmis bounded by
k ≤ 1 + R2
γ2.
Exercise: Prove it!
E. Scornet Deep Learning January 2020 26 / 118
![Page 29: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/29.jpg)
Solution
According to Cauchy-Schwarz inequality,
〈w?, wk+1〉 ≤ ‖wk+1‖2 .
since ‖w?‖2 = 1 with equality if and only if wk+1 ∝ w?. By construction,
〈w?, wk+1〉 = 〈w?, wk〉 + yi〈w?, xi〉
≥ 〈w?, wk〉 + γ
≥ 〈w?, w0〉 + kγ≥ kγ,
since w0 = 0. We have
‖wk+1‖2 = ‖wk‖2 + ‖xi‖2 + 2〈wk , yi xi〉
≤ ‖wk‖2 + R2 + 1
≤ k(1 + R2).
Finally,
kγ ≤√
k(1 + R2)
⇔k ≤1 + R2
γ2.
E. Scornet Deep Learning January 2020 27 / 118
![Page 30: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/30.jpg)
Perceptron - Summary and drawbacks
Perceptron algorithmWe have a data set Dn = {(xi , yi ), i = 1, . . . , n}We use the perceptron algorithm to learn the weight vector w and the bias b.We predict using f(w,b)(x) = 1〈w,x〉+b>0.
Limitations
The decision frontier is linear! Too simple model.The perceptron algorithm does not converge if data are not linearly separable: inthis case, the algorithm must not be used.In practice, we do not know if data are linearly separable... Perceptron should neverbe used!
E. Scornet Deep Learning January 2020 28 / 118
![Page 31: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/31.jpg)
Perceptron - Summary and drawbacks
Perceptron algorithmWe have a data set Dn = {(xi , yi ), i = 1, . . . , n}We use the perceptron algorithm to learn the weight vector w and the bias b.We predict using f(w,b)(x) = 1〈w,x〉+b>0.
LimitationsThe decision frontier is linear! Too simple model.The perceptron algorithm does not converge if data are not linearly separable: inthis case, the algorithm must not be used.In practice, we do not know if data are linearly separable... Perceptron should neverbe used!
E. Scornet Deep Learning January 2020 28 / 118
![Page 32: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/32.jpg)
Perceptron will make machine intelligent... or not!
The Perceptron project led by Rosenblatt was funded by the US Office of Naval Research.
The Navy revealed the embryo of an electronic computer today that it expects will beable to walk, talk, see, write, reproduce itself and be conscious of its existence. Laterperceptrons will be able to recognize people and call out their names and instantlytranslate speech in one language to speech and writing in another language, it was
predicted.
Press conference, 7 July 1958, New York Times.
For an extensive study of the perceptron, [Principles of neurodynamics. perceptrons and the theory of brain
mechanisms, Rosenblatt 1961]
E. Scornet Deep Learning January 2020 29 / 118
![Page 33: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/33.jpg)
AI winter
In 1969, Minsky and Papert exhibit the fact that it was difficult for perceptron todetect parity (number of activated pixels)detect connectedness (are the pixels connected?)represent simple non linear function like XOR
There is no reason to suppose that any of [the virtue of perceptrons] carry over to themany-layered version. Nevertheless, we consider it to be an important research problemto elucidate (or reject) our intuitive judgement that the extension is sterile. Perhaps somepowerful convergence theorem will be discovered, or some profound reason for the failureto produce an interesting "learning theorem" for the multilayered machine will be found.
[“Perceptrons.”, Minsky and Papert 1969]
This book is the starting point of the period known as “AI winter”, a significant declinein funding of neural network research.
Controversy between Rosenblatt, Minsky, Papert:[“A sociological study of the official history of the perceptrons controversy”, Olazaran 1996]
E. Scornet Deep Learning January 2020 30 / 118
![Page 34: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/34.jpg)
AI Winter: the XOR function
Exercise. Consider two binary variables x1 ∈ {0, 1} and x2 ∈ {0, 1}. The logical functionAND applied to x1, x2 is defined as
AND :{0, 1}2 → {0, 1}
(x1, x2) 7→{
1 if x1 = x2 = 10 otherwise
1) Find a perceptron (i.e., weights and bias) that implements the AND function.
The logical function XOR applied to x1, x2 is defined asXOR :{0, 1}2 → {0, 1}
(x1, x2) 7→{
0 if x1 = x2 = 0 or x1 = x2 = 11 otherwise
2) Prove that no perceptron can implement the XOR function.3) Find a neural network with one hidden layer that implements the XOR function.
E. Scornet Deep Learning January 2020 31 / 118
![Page 35: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/35.jpg)
Solution
1 The following perceptron with w1 = w2 = 1 and b = −1.5 implements the ANDfunction:
x1
x2
b
w1
w2
b
σ(w1x1 + w2x2 + b)
Output layerInput layer
2 The perceptron algorithm builds a hyperplane and predicts accordingly, dependingon which side of the hyperplane the observation falls into. Unfortunately the XORfunction cannot be represented with a hyperplane in the original space {0, 1}2.
E. Scornet Deep Learning January 2020 32 / 118
![Page 36: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/36.jpg)
Solution
3 The following network implements the XOR function withw (1)1,1 = −1
w (1)1,2 = −1
b(1)1 = 0.5
w (1)2,1 = 1
w (1)2,2 = 1
b(1)2 = −1.5
w (2)1,1 = −1
w (2)1,2 = −1
b(2)1 = 0.5
E. Scornet Deep Learning January 2020 33 / 118
![Page 37: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/37.jpg)
Solution
x1
x2
b(1)1
b(1)2
b(2)1
"0"
"2"
w (1)1,1
w (1)2,1
w (1)1,2
w (1)2,2
w (2)1,1
w (2)1,2
L1Input layer Output layer
E. Scornet Deep Learning January 2020 34 / 118
![Page 38: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/38.jpg)
Moving forward - ADALINE, MADALINE
In 1959 at Stanford, Bernard Widrow and Marcian Hoff developed AdaLinE (ADAptiveLINear Elements) and MAdaLinE (Multiple AdaLinE) the latter being the first networksuccessfully applied to a real world problem.[Adaptive switching circuits, Bernard Widrow and Hoff 1960]
Loss: square difference between a weighted sum of inputs and the outputOptimization procedure: trivial gradient descent
E. Scornet Deep Learning January 2020 35 / 118
![Page 39: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/39.jpg)
MADALINE
Many Adalines: network with one hidden layer composed of many Adaline units.[“Madaline Rule II: a training algorithm for neural networks”, Winter and Widrow 1988]
Applications:Speech and pattern recognition[“Real-Time Adaptive Speech-Recognition System”,
Talbert et al. 1963]
Weather forecasting[“Application of the adaline system to weather
forecasting”, Hu 1964]
Adaptive filtering and adaptive signalprocessing[“Adaptive signal processing”, Bernard and Samuel 1985]
E. Scornet Deep Learning January 2020 36 / 118
![Page 40: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/40.jpg)
Outline
1 Introduction
2 History of Neural NetworksPerceptronMultilayer Perceptron - Backpropagation algorithm
3 HyperparametersActivation functionsOutput unitsLoss functionsWeight initialization
4 RegularizationPenalizationDropoutBatch normalizationEarly stopping
5 All in all
E. Scornet Deep Learning January 2020 37 / 118
![Page 41: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/41.jpg)
Neural network with one hidden layer
Generic notations:W (`)
i,j : weights between the j neuron in the `− 1 layer and the i neuron of the ` layer.
b(`)j : bias of the j neuron of the ` layer.
a(`)j : output of the j neuron of the ` layer.
z (`)j : input of the j neuron of the ` layer, such that a(`)
j = σ(z (`)j ).
E. Scornet Deep Learning January 2020 38 / 118
![Page 42: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/42.jpg)
How to find weights and bias?
Perceptron algorithm does not work anymore!
E. Scornet Deep Learning January 2020 39 / 118
![Page 43: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/43.jpg)
Gradient Descent Algorithm
The prediction of the network is given by fθ(x).
Empirical risk minimization:
argminθ
1n
n∑i=1
`(Yi , fθ(Xi )) ≡ argminθ
1n
n∑i=1
`i (θ)
(Stochastic) Gradient descent rule:While |θt − θt−1| ≥ ε do
I Sample It ⊂ {1, . . . , n}I
θt+1 = θt − η( 1|It |
∑i∈It
∇θ`i (θt))
How to compute ∇θ`i efficiently?
E. Scornet Deep Learning January 2020 40 / 118
![Page 44: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/44.jpg)
Gradient Descent Algorithm
The prediction of the network is given by fθ(x).
Empirical risk minimization:
argminθ
1n
n∑i=1
`(Yi , fθ(Xi )) ≡ argminθ
1n
n∑i=1
`i (θ)
(Stochastic) Gradient descent rule:While |θt − θt−1| ≥ ε do
I Sample It ⊂ {1, . . . , n}I
θt+1 = θt − η( 1|It |
∑i∈It
∇θ`i (θt))
How to compute ∇θ`i efficiently?
E. Scornet Deep Learning January 2020 40 / 118
![Page 45: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/45.jpg)
Backprop Algorithm
A Clever Gradient Descent ImplementationPopularized by Rumelhart, McClelland, Hinton in 1986.Can be traced back to Werbos in 1974.Nothing but the use of chain rule derivation with a touch of dynamic programing.
Key ingredient to make the Neural Networks work!Still at the core of Deep Learning algorithm.
E. Scornet Deep Learning January 2020 41 / 118
![Page 46: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/46.jpg)
Backpropagation equations
Neural network with L layers, with vector output, with quadratic costC = 1
2‖y − a(L)‖2.By definition,
δ(`)j = ∂C
∂z (`)j
.
The four fundamental equations of backpropagation are given byδ(L) = ∇aC � σ′(z (L)),
δ(`) = ((w (`+1))T δ(`+1))� σ′(z (`))∂C∂b(`)
j
= δ(`)j
∂C∂w (`)
j,k
= a(`−1)k δ
(`)j .
E. Scornet Deep Learning January 2020 42 / 118
![Page 47: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/47.jpg)
Proof
We start with the first equality
δ(L) = ∇aC � σ′(z(L)).
Applying the chain rule gives
δ(L)j =
∑k
∂C
∂a(L)k
∂a(L)k
∂z(L)j
,
where z(L)j is the input of the j neuron of the layer L. Since the activation a(L)
k depends only on the input z(L)j ,
we have
δ(L)j =
∂C
∂a(L)j
∂a(L)j
∂z(L)j
.
Besides, since a(L)j = σ(z(L)
j ), we have
δ(L)j =
∂C
∂a(L)j
σ′(z(L)
j ),
which is the component wise version of the first equality.
E. Scornet Deep Learning January 2020 43 / 118
![Page 48: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/48.jpg)
Proof
Now, we want to prove
δ(`) = ((w (`+1))T
δ(`+1))� σ′(z(`)).
Again, using the chain rule,
δ(`)j =
∂C
∂z(`)j
=∑
k
∂C
∂z(`+1)k
∂z(`+1)k
∂z(`)j
=∑
k
δ(`+1)k
∂z(`+1)k
∂z(`)j
.
Recalling that z(`+1)k =
∑jw (`+1)
kj σ(z(`)j ) + b`+1
k , we get
δ(`)j =
∑k
δ(`+1)k w (`+1)
kj σ′(z(`)
j ),
which concludes the proof.
E. Scornet Deep Learning January 2020 44 / 118
![Page 49: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/49.jpg)
Proof
Now, we want to prove∂C
∂b(`)j
= δ(`)j .
Using the chain rule,
∂C∂b`j
=∑
k
∂C
∂z(`)k
∂z(`)k
∂b`j.
However, only z(`)j depends on b`j and z(`)
j =∑
kw (`)
jk σ(z(`−1)k ) + b`j . Therefore,
∂C∂b`j
= δ(`)j .
E. Scornet Deep Learning January 2020 45 / 118
![Page 50: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/50.jpg)
Proof
Finally, we want to prove
∂C
∂w (`)j,k
= a(`−1)k δ
(`)j .
By the chain rule,
∂C
∂w (`)j,k
=∑
m
∂C
∂z(`)m
∂z(`)m
∂w (`)j,k
.
Since z(`)m =
∑k
w (`)mk σ(z(`−1)
k ) + b`m, we have
∂z(`)m
∂w (`)j,k
= σ(z(`−1)k )1m=j .
Consequently,∂C
∂w (`)j,k
= δ(`)j σ(z(`−1)
k ) = a(`−1)k δ
(`)j .
E. Scornet Deep Learning January 2020 46 / 118
![Page 51: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/51.jpg)
Backpropagation Algorithm
Letδ
(`)j = ∂C
∂z (`)j
,
where z (`)j is the entry of the neuron j of the layer `.
Backpropagation AlgorithmInitialize randomly weights and bias in the network.
For each training sample xi ,1 Feedforward: let xi go through the network and store the value of activation function
and its derivative, for each neuron.2 Output error: compute the neural network error for xi .3 Backpropagation: compute recursively the vectors δ(`) starting from ` = L to ` = 1.
Update the weights and bias using the backpropagation equations.
E. Scornet Deep Learning January 2020 47 / 118
![Page 52: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/52.jpg)
Neural Network terminology
Epoch: one forward pass and one backward pass of all training examples.
(Mini) Batch size: number of training examples in one forward/backward pass. Thehigher the batch size is, the more memory space you’ll need.
Number of iterations: number of passes, each pass using [batch size] number ofexamples. To be clear, one pass = one forward pass + one backward pass (we do notcount the forward pass and backward pass as two different passes).
For example, for 1000 training examples, if you set the batch size at 500, it will take 2iterations to complete 1 epoch.
E. Scornet Deep Learning January 2020 48 / 118
![Page 53: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/53.jpg)
Outline
1 Introduction
2 History of Neural NetworksPerceptronMultilayer Perceptron - Backpropagation algorithm
3 HyperparametersActivation functionsOutput unitsLoss functionsWeight initialization
4 RegularizationPenalizationDropoutBatch normalizationEarly stopping
5 All in all
E. Scornet Deep Learning January 2020 49 / 118
![Page 54: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/54.jpg)
Outline
1 Introduction
2 History of Neural NetworksPerceptronMultilayer Perceptron - Backpropagation algorithm
3 HyperparametersActivation functionsOutput unitsLoss functionsWeight initialization
4 RegularizationPenalizationDropoutBatch normalizationEarly stopping
5 All in all
E. Scornet Deep Learning January 2020 50 / 118
![Page 55: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/55.jpg)
Sigmoid
5 4 3 2 1 0 1 2 3 4 50.0
0.5
1.0
Sigmoid function
x 7→ exp(x)1 + exp(x)
Problems:
1 Sigmoid is not a zero centeredfunction -> need for rescaling data.
2 Saturated function: gradient killer ->need for rescaling data
3 Plus: exp is a bit computationalexpensive
E. Scornet Deep Learning January 2020 51 / 118
![Page 56: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/56.jpg)
Tanh
5 4 3 2 1 0 1 2 3 4 51
0
1
Hyperbolic tangente function
x 7→ exp(x)− exp(−x)exp(x) + exp(−x)
Problems:
1 Tanh is a zero centered function ->no need for rescaling data
2 Saturated function: gradient killer ->need for rescaling data
3 Plus: exp is a bit computationalexpensive
Note: tanh(x) = 2σ(2x)− 1.
E. Scornet Deep Learning January 2020 52 / 118
![Page 57: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/57.jpg)
Rectified Linear Unit
5 4 3 2 1 0 1 2 3 4 55
4
3
2
1
0
1
2
3
4
5
ReLU / positive part
x 7→ max(0, x)
Problems:
1 Not a saturated function.2 Computationally efficient3 Empirically, convergence is faster than
sigmoid/tanh.4 Plus: biologically plausible
E. Scornet Deep Learning January 2020 53 / 118
![Page 58: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/58.jpg)
A little bit more on ReLU
Introduced by [“Imagenet classification with deep convolutional neural networks”, Krizhevsky et al. 2012] in AlexNet
Problems:Not zero centeredGradient is null if x < 0If weights are not properly initialized, ReLU output can be zero.
Usually initial bias for ReLU so that they fire up: useful or not? Mystery...
Related to biology [“Deep sparse rectifier neural networks”, Glorot, Bordes, et al. 2011]:Most of the time, neurons are inactive.when they activate, their activation is proportional to their input.
E. Scornet Deep Learning January 2020 54 / 118
![Page 59: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/59.jpg)
Leaky ReLU/ Parametric ReLU / Absolute value rectification
5 4 3 2 1 0 1 2 3 4 55
4
3
2
1
0
1
2
3
4
5x 7→ max(αx , x)
Leaky ReLU: α = 0.1[“Rectifier nonlinearities improve neural network acoustic
models”, Maas et al. 2013]
Absolute Value Rectification:α = −1
[“What is the best multi-stage architecture for object
recognition?”, Jarrett et al. 2009]
Parametric ReLU: α optimized duringbackpropagation. Activation function islearned.[“Empirical evaluation of rectified activations in
convolutional network”, Xu et al. 2015]
E. Scornet Deep Learning January 2020 55 / 118
![Page 60: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/60.jpg)
ELU
5 4 3 2 1 0 1 2 3 4 55
4
3
2
1
0
1
2
3
4
5
Exponential Linear Unit
x 7→{
x if x ≥ 0α(exp(x)− 1) otherwise
Negative saturation regime, closer tozero mean output.
α is set to 1.0.
Robustness to noise[“Fast and accurate deep network learning by exponential
linear units (elus)”, Clevert et al. 2015]
E. Scornet Deep Learning January 2020 56 / 118
![Page 61: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/61.jpg)
Maxout
5 4 3 2 1 0 1 2 3 4 55
4
3
2
1
0
1
2
3
4
5
ReLU / positive part
x 7→ max(w1x + b1,w2x + b2)
Linear regime, not saturating, notdying.
Number of parameters multiplied by 2(resp. by k if k entries)
Learn piecewise linear functions (up tok pieces).[“Maxout networks”, Goodfellow, Warde-Farley, et al.
2013] [“Deep maxout neural networks for speech
recognition”, Cai et al. 2013]
Resist to catastrophic forgetting[“An empirical investigation of catastrophic forgetting in
gradient-based neural networks”, Goodfellow, Mirza,
et al. 2013]
E. Scornet Deep Learning January 2020 57 / 118
![Page 62: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/62.jpg)
Conclusion on activation functions
Use ReLU.Test Leaky ReLU, maxout, ELU.Try out Tanh, but not expect too much.Do not use sigmoid.
E. Scornet Deep Learning January 2020 58 / 118
![Page 63: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/63.jpg)
Outline
1 Introduction
2 History of Neural NetworksPerceptronMultilayer Perceptron - Backpropagation algorithm
3 HyperparametersActivation functionsOutput unitsLoss functionsWeight initialization
4 RegularizationPenalizationDropoutBatch normalizationEarly stopping
5 All in all
E. Scornet Deep Learning January 2020 59 / 118
![Page 64: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/64.jpg)
Output units
Linear output unit:y = W Th + b
→ Linear regression based on the new variables h.
Sigmoid output unit, used to predict {0, 1} outputs:P(Y = 1|h) = σ(W Th + b),
where σ(t) = et/(1 + et)..
→ Logistic regression based on the new variables h.
Softmax output unit, used to predict {1, . . . ,K}:
softmax(z)i = ezi∑Kk=1 ezk
where, each zi is the activation of one neuron of the previous layer, given byzi = W T
i h + bi .
→ Multinomial logistic regression based on the new variables h.
E. Scornet Deep Learning January 2020 60 / 118
![Page 65: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/65.jpg)
Multinomial logistic regression
Generalization of logistic regression for multiclass outputs: for all 1 ≤ k ≤ K ,
log(P[Yi = k]
Z
)= βkXi ,
Hence, for all 1 ≤ k ≤ K ,
P[Yi = k] = Zeβk Xi ,
whereZ = 1∑K
k=1 eβk Xi.
Thus,
P[Yi = k] = eβk Xi∑K`=1 eβ`Xi
.
E. Scornet Deep Learning January 2020 61 / 118
![Page 66: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/66.jpg)
Biology bonus
Softmax, used with cross-entropy:
− log(P(Y = k|z)) = − log softmax(z)k
= zk − log(∑
j
exp(zj))
' zk −maxj
zj
' 0,No contribution to the cost when softmax(z)y is maximal.
Lateral inhibition: believed to exist between nearby neurons in the cortex. When thedifference between the max and the other is large, winner takes all: one neuron is set to 1and the others go to zero.
More complex models: Conditional Gaussian Mixture: y is multimodal[“On supervised learning from sequential data with applications for speech recognition”; “Generating sequences with recurrent
neural networks”, Schuster 1999; Graves 2013].
E. Scornet Deep Learning January 2020 62 / 118
![Page 67: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/67.jpg)
Outline
1 Introduction
2 History of Neural NetworksPerceptronMultilayer Perceptron - Backpropagation algorithm
3 HyperparametersActivation functionsOutput unitsLoss functionsWeight initialization
4 RegularizationPenalizationDropoutBatch normalizationEarly stopping
5 All in all
E. Scornet Deep Learning January 2020 63 / 118
![Page 68: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/68.jpg)
Cost functions
Mean Square Error (MSE)
1n
n∑i=1
`(Yi , fθ(Xi )) = 1n
n∑i=1
(Yi − fθ(Xi ))2
Mean Absolute Error
1n
n∑i=1
`(Yi , fθ(Xi )) = 1n
n∑i=1
|Yi − fθ(Xi )|
0− 1 Error
1n
n∑i=1
`(Yi , fθ(Xi )) = 1n
n∑i=1
1Yi 6=fθ(Xi )
E. Scornet Deep Learning January 2020 64 / 118
![Page 69: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/69.jpg)
Cost functions
Cross entropy (or negative log-likelihood):
1n
n∑i=1
`(Yi , fθ(Xi )) = −1n
n∑i=1
K∑k=1
1yi =k log([fθ(Xi )]k
)Cross-entropy:
Very popular!Should help to prevent saturation phenomenon compared to MSE:
− log(P(Y = yi |X)) = − log(σ((2y − 1)(W Th + b))),with
σ(t) = et
1 + et
Usually, saturation occurs when (2y − 1)(W Th+ b)� 1. In that case, − log(P(Y =yi |X)) is linear in W and b which makes the gradient easy to compute, and thegradient descent easy to implement.
Mean Square Error should not be used with softmax output units[“Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition”,
Bridle 1990]
E. Scornet Deep Learning January 2020 65 / 118
![Page 70: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/70.jpg)
Outline
1 Introduction
2 History of Neural NetworksPerceptronMultilayer Perceptron - Backpropagation algorithm
3 HyperparametersActivation functionsOutput unitsLoss functionsWeight initialization
4 RegularizationPenalizationDropoutBatch normalizationEarly stopping
5 All in all
E. Scornet Deep Learning January 2020 66 / 118
![Page 71: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/71.jpg)
Weight initialization
First idea: Set all weights and bias to the same value.
E. Scornet Deep Learning January 2020 67 / 118
![Page 72: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/72.jpg)
Small or big weights?
1 First idea: small random numbers, typically0.01×N (0, 10−4).
I work for small networksI for big networks (∼ 10 layers) output become dirac in 0: there is no activation at all.
2 Second idea: “big random numbers”N (0, 10−4).
→ Saturating phenomenon
In any case, no need to tune the bias: they can be initially set to zero.
E. Scornet Deep Learning January 2020 68 / 118
![Page 73: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/73.jpg)
Other initialization
Idea: the variance of the input should be the same as the variance of the output.
1 Xavier initializationInitialize bias to zero and weights randomly using
U[−
√6√nj + nj+1
,
√6√nj + nj+1
],
where nj is the size of layer j[“Understanding the difficulty of training deep feedforward neural networks”, Glorot and Bengio 2010].→ Sadly, it does not work for ReLU (non activated neurons)
2 He et al. initializationInitialize bias to zero and weights randomly using
N(0,√2
nj
),
where nj is the size of layer j.[“Delving deep into rectifiers: Surpassing human-level performance on imagenet classification”, He et al. 2015].
Bonus: [“All you need is a good init”, Mishkin and Matas 2015]
E. Scornet Deep Learning January 2020 69 / 118
![Page 74: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/74.jpg)
Exercise
1 Consider a neural network with two hidden layers (containing n1 and n2 neuronsrespectively) and equipped with linear hidden units. Find a simple sufficientcondition on the weights so that the variance of the hidden units stay constantaccross layers.
2 Considering the same network, find a simple sufficient condition on the weights sothat the gradient stay constant across layers when applying backpropagationprocedure.
3 Based on previous questions, propose a simple way to initialize weights.
E. Scornet Deep Learning January 2020 70 / 118
![Page 75: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/75.jpg)
Solution
1 Consider the neuron i of the second hidden layer. The output of this neuron is
a(`)i =
n1∑j=1
W (2)ij z(1)
j + b(2)i
=
n1∑j=1
W (2)ij
( n0∑k=1
W (2)jk xk + b(2)
j
)+ b(2)
i
=
n1∑j=1
W (2)ij
( n0+1∑k=1
W (2)jk xk)
+ b(2)i ,
where we set xn0+1 = 1, and W (1)j,n0+1 = b(1)
j to ease notations. At fixed x , since the weights and bias arecentred, we have
V[
a(`)i
]= V[ n1∑
j=1
n0+1∑k=1
W (2)ij W (2)
jk xk + b(2)i
]=
n1∑j=1
n0+1∑k=1
x2kV[
W (2)ij W (2)
jk
]E. Scornet Deep Learning January 2020 71 / 118
![Page 76: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/76.jpg)
Solution
=
n1∑j=1
n0+1∑k=1
x2kσ
22σ
21 ,
where σ21 (resp. σ2
2) is the shared variance of all weights between layers 0 and 1 (resp. layers 1 and 2).Since we want that V
[a(`)
i
]= V[
a(`−1)i
], it is sufficient that
n1σ22 = 1.
E. Scornet Deep Learning January 2020 72 / 118
![Page 77: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/77.jpg)
Solution
2 Consider a neural network where the cost function is given, for one training example, by
C =(
y −1
nd+1
nd+1∑j=1
z(d)j
)2,
which implies
∂C
∂z(d)j
= −2Wj
nd+1
(y −
1nd+1
nd∑i=1
Wi z(d)i
).
Hence,
E[ ∂C
∂z(d)j
]= E[ 2W 2
j
n2d+1
z(d)j
]=
2σ2d+1
n2d+1
E[
z(d)j
].
Besides, according to the chain rule, we have
∂C
∂z(d−1)k
=
nd+1∑j=1
∂C
∂z(d)j
∂z(d)j
∂z(d−1)k
=
nd+1∑j=1
∂C
∂z(d)j
W (d)jk .
E. Scornet Deep Learning January 2020 73 / 118
![Page 78: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/78.jpg)
Solution
More precisely,
E[ ∂C
∂z(d)j
W (d)jk
]= E[ ∂C
∂z(d)j
W (d)jk
]= E[−
2Wj
nd+1yW (d)
jk +2Wj
n2d+1
W (d)jk
∑i
Wi(∑
`
W (d)i` z(d−1)
`+ b(d)
i
)]= E[ 2W 2
j
n2d+1
W (d)jk
(∑`
W (d)j` z(d−1)
`+ b(d)
j
)]= E[ 2W 2
j
n2d+1
(W (d)jk )2z(d−1)
`
]=
2σ2d+1σ
2d
n2d+1
E[
z(d−1)`
].
Using the chain rule, we get
E[ ∂C
∂z(d−1)k
]= E[ ∂C
∂z(d)j
]⇔nd+1
2σ2d+1σ
2d
n2d+1
E[
z(d−1)`
]=
2σ2d+1
n2d+1
E[
z(d)j
]E. Scornet Deep Learning January 2020 74 / 118
![Page 79: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/79.jpg)
Solution
⇔nd+1σ2d =
E[
z(d)j
]E[
z(d−1)`
] .This gives the sufficient condition nd+1σ
2d ∼ 1 assuming that the activations of two consecutive hidden
layers are similar.
E. Scornet Deep Learning January 2020 75 / 118
![Page 80: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/80.jpg)
Outline
1 Introduction
2 History of Neural NetworksPerceptronMultilayer Perceptron - Backpropagation algorithm
3 HyperparametersActivation functionsOutput unitsLoss functionsWeight initialization
4 RegularizationPenalizationDropoutBatch normalizationEarly stopping
5 All in all
E. Scornet Deep Learning January 2020 76 / 118
![Page 81: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/81.jpg)
Regularizing to avoid overfitting
Avoid overfitting by imposing some constraints over the parameter space.
Reducing variance and increasing bias.
E. Scornet Deep Learning January 2020 77 / 118
![Page 82: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/82.jpg)
Overfitting
Many different manners to avoid overfitting:
Penalization (L1 or L2)Replacing the cost function L by L(θ,X , y) = L(θ,X , y) + pen(θ).
Soft weight sharing - cf CNN lectureReduce the parameter space artificially by imposing explicit constraints.
DropoutRandomly kill some neurons during optimization and predict with the full network.
Batch normalizationRenormalize a layer inside a batch, so that the network does not overfit on thisparticular batch.
Early stoppingStop the gradient descent procedure when the error on the validation set increases.
E. Scornet Deep Learning January 2020 78 / 118
![Page 83: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/83.jpg)
Outline
1 Introduction
2 History of Neural NetworksPerceptronMultilayer Perceptron - Backpropagation algorithm
3 HyperparametersActivation functionsOutput unitsLoss functionsWeight initialization
4 RegularizationPenalizationDropoutBatch normalizationEarly stopping
5 All in all
E. Scornet Deep Learning January 2020 79 / 118
![Page 84: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/84.jpg)
Constraint the optimization problemminθL(θ,X , y), s.t. pen(θ) ≤ cste.
Using Lagrangian formulation, this problem is equivalent to:minθL(θ,X , y) + λ pen(θ),
whereL(θ,X , y) is the loss function (data-driven term)pen is a function that increases when θ becomes more complex (penalty term)λ ≥ 0 is a constant standing for the strength of the penalty term.
For Neural Networks, pen only penalizes the weights and not the bias: the later beingeasier to estimate than weights.
E. Scornet Deep Learning January 2020 80 / 118
![Page 85: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/85.jpg)
Example of penalization
minθL(θ,X , y) + pen(θ),
Ridgepen(θ) = λ‖θ‖22
[“Ridge regression: Biased estimation for nonorthogonal problems”, Hoerl and Kennard 1970].See also [“Lecture notes on ridge regression”, Wieringen 2015]
Lassopen(θ) = λ‖θ‖1
[“Regression shrinkage and selection via the lasso”, Tibshirani 1996]
Elastic Netpen(θ) = λ‖θ‖22 + µ‖θ‖1
[“Regularization and variable selection via the elastic net”, Zou and Hastie 2005]
E. Scornet Deep Learning January 2020 81 / 118
![Page 86: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/86.jpg)
Simple case: linear regression
Linear regressionThe estimate of linear regression β is given by
β ∈ argminβ∈Rd
n∑i=1
(Yi −d∑
j=1
βjx (j)i )2,
which can be written asβ ∈ argmin
β∈Rd‖Y − Xβ‖22,
where X ∈ Mn,d(R).
Solution:β = (X′X)−1X′Y .
E. Scornet Deep Learning January 2020 82 / 118
![Page 87: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/87.jpg)
Penalized regression
Penalized linear regressionThe estimate of linear regression βλ,q is given by
βλ,q ∈ argminβ∈Rd
‖Y − Xβ‖22 + λ‖β‖qq.
q = 2: Ridge linear regressionq = 1: LASSO
E. Scornet Deep Learning January 2020 83 / 118
![Page 88: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/88.jpg)
Ridge regression, q = 2
Ridge linear regressionThe ridge estimate βλ,2 is given by
βλ,2 ∈ argminβ∈Rd
‖Y − Xβ‖22 + λ‖β‖22.
Solution:βλ,2 = (X′X + λI)−1X′Y .
This estimate has a bias equal to −λ(X′X + λI)−1β, and a varianceσ2(X′X + λI)−1X′X(X′X + λI)−1. Note that
V[βλ,2] ≤ V[β].
In the case of orthonormal design (X′X = I), we have
βλ,2 = β
1 + λ= 1
1 + λX ′j Y .
E. Scornet Deep Learning January 2020 84 / 118
![Page 89: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/89.jpg)
Ridge regression q = 2
In general, the ridge penalization considers
pen(θ) = 12‖β‖
22 = 1
2
d∑j=1
β2j ,
in the optimization problemminθL(β,X , y) + pen(β),
It penalizes the “size” of β.
This is the most widely used penalization
It’s nice and easyIt allows to “deal” with correlated features.It actually helps training! With a ridge penalization, the optimization problem iseasier.
E. Scornet Deep Learning January 2020 85 / 118
![Page 90: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/90.jpg)
Sparsity
There is another desirable property on β
If βj = 0, then feature j has no impact on the prediction:y = sign(x>β + b)
If we have many features (d is large), it would be nice if β contained zeros, and many ofthem
Means that only few features are statistically relevant.Means that only few features are useful to predict the label
Leads to a simpler model, with a “reduced” dimension
How do we enforce sparsity in β?
E. Scornet Deep Learning January 2020 86 / 118
![Page 91: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/91.jpg)
Sparsity
Tempting to solve
βλ,0 ∈ argminβ∈Rd
‖Y − Xβ‖22 + λ‖β‖0.
where‖β‖0 = #{j ∈ {1, . . . , d} : βj 6= 0}.
To solve this, explore all possible supports of β. Too long! (NP-hard)
Find a convex proxy of ‖‖0: the `1-norm ‖β‖1 =∑d
j=1 |βj |
E. Scornet Deep Learning January 2020 87 / 118
![Page 92: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/92.jpg)
LASSO
Ridge linear regressionThe LASSO (Least Absolute Selection and Shrinkage Operator) estimate of linearregression βλ,2 is given by
βλ,2 ∈ argminβ∈Rd
‖Y − Xβ‖22 + λ‖β‖1.
Solution: No close form in the general case
If the Xj are orthonormal then
βλ,1 = X ′j Y(1− λ
2|X ′j Y |
)+,
where (x)+ = max(0, x).Thus, in the very specific case of orthogonal design, we can easily show that L1penalization implies a sparse vector if the parameter λ is properly tuned.
E. Scornet Deep Learning January 2020 88 / 118
![Page 93: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/93.jpg)
Sparsity - a picture
Why `2 (ridge) does not induce sparsity?
E. Scornet Deep Learning January 2020 89 / 118
![Page 94: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/94.jpg)
Outline
1 Introduction
2 History of Neural NetworksPerceptronMultilayer Perceptron - Backpropagation algorithm
3 HyperparametersActivation functionsOutput unitsLoss functionsWeight initialization
4 RegularizationPenalizationDropoutBatch normalizationEarly stopping
5 All in all
E. Scornet Deep Learning January 2020 90 / 118
![Page 95: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/95.jpg)
Dropout
Dropout refers to dropping out units (hidden and visible) in a neural network, i.e.,temporarily removing it from the network, along with all its incoming and outgoingconnections.
Each unit is independently retained with probabilityp = 0.5 for hidden unitsp ∈ [0.5, 1] for input units, usually p = 0.8.
[“Improving neural networks by preventing co-adaptation of feature detectors”, Hinton et al. 2012]
E. Scornet Deep Learning January 2020 91 / 118
![Page 96: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/96.jpg)
Dropout
E. Scornet Deep Learning January 2020 92 / 118
![Page 97: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/97.jpg)
Dropout algorithm
Training step. While not convergence
1 Inside one epoch, for each mini-batch of size m,
1 Sample m different mask. A mask consists in one Bernoulli per node of the network(inner and entry nodes but not output nodes). These Bernoulli variables are i.i.d..Usually
F the probability of selecting an hidden node is 0.5F the probability of selecting an input node is 0.8
2 For each one of the m observation in the mini-batch,F Do a forward pass on the masked networkF Compute backpropagation in the masked networkF Compute the gradient for each weights
3 Update the parameter according to the usual formula.
Prediction step.Use all neurons in the network with weights given by the previous optimization procedure,times the probability p of being selected (0.5 for inner nodes, 0.8 for input nodes).
E. Scornet Deep Learning January 2020 93 / 118
![Page 98: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/98.jpg)
Another way of seeing dropout - Ensemble method
Averaging many different neural networks.
Different can mean either:
randomizing the data set on which wetrain each network (via subsampling)Problem: not enough data to obtaingood performance...
building different network architecturesand train each large network separatelyon the whole training setProblem: computationally prohibitive attraining time and test time!
Miscelleanous:[“Fast dropout training”, Wang and Manning 2013]
[“Dropout: A simple way to prevent neural networks from
overfitting”, Srivastava et al. 2014]
E. Scornet Deep Learning January 2020 94 / 118
![Page 99: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/99.jpg)
Exercise: linear units
1 Consider a neural networks with linear activation functions. Prove that dropout can be seen as a modelaveraging method.
2 Given one training example, consider the error of the ensemble of neural network and that of one randomneural network sample with dropout:
Eens =12
(y − aens )2 =12
(y −d∑
j=1
pi wi xi )2
Esingle =12
(y − asingle)2 =12
(y −d∑
j=1
δi wi xi )2,
where δi ∈ {0, 1} represents the presence (δi = 1) or the absence of a connexion between the input xiand the output.Prove that
E[∇w Esingle ] = ∇w Eens +12
d∑j=1
w2i x2
i V(δi ).
Comment.
E. Scornet Deep Learning January 2020 95 / 118
![Page 100: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/100.jpg)
Solution
1 Simple calculations...2 The gradient of the ensemble is given by
∂Eens
∂wi= −(y − aens)
∂aens
∂wi= −(y − aens)pi Ii .
and that of a single (random) network satisfies∂Esingle
∂wi= −(y − asingle)∂asingle
∂wi
= −(y − asingle)δi Ii
= −tδixi + wiδ2i x2
i +∑j 6=i
wjδiδjxixj .
Taking the expectation of the last expression gives
E
[∂Esingle
∂wi
]= −tpiδi + wipix2
i +∑j 6=i
wipipjxixj
= −(t − aens)pixi + wix2i pi (1− pi ).
Therefore, we haveE
[∂Esingle
∂wi
]= ∂Eens
∂wi+ wiV[δixi ].
E. Scornet Deep Learning January 2020 96 / 118
![Page 101: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/101.jpg)
Solution
In terms of error, we have
E[Esingle ] = Eens + 12
d∑j=1
w2i x2
i V[δi ].
The regularization term is maximized for p = 0.5.
E. Scornet Deep Learning January 2020 97 / 118
![Page 102: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/102.jpg)
Outline
1 Introduction
2 History of Neural NetworksPerceptronMultilayer Perceptron - Backpropagation algorithm
3 HyperparametersActivation functionsOutput unitsLoss functionsWeight initialization
4 RegularizationPenalizationDropoutBatch normalizationEarly stopping
5 All in all
E. Scornet Deep Learning January 2020 98 / 118
![Page 103: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/103.jpg)
Batch normalization
The network converges faster if its input are scaled (mean, variance) and decorrelated.[“Efficient backprop”, LeCun et al. 1998]
Hard to decorrelate variables: requiring to compute covariance matrix.[“Batch normalization: Accelerating deep network training by reducing internal covariate shift”, Ioffe and Szegedy 2015]
Ideas:Improving gradient flowsAllowing higher learning ratesReducing strond dependence on initializationRelated to regularization (maybe slightly reduces the need for Dropout)
E. Scornet Deep Learning January 2020 99 / 118
![Page 104: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/104.jpg)
Algorithm
See [“Batch normalization: Accelerating deep network training by reducing internal covariate shift”, Ioffe and Szegedy 2015]
1 For every unit in the first layer,1 µB = 1
m∑m
i=1 xi2 σ2B = 1
m∑m
i=1(xi − µB)2
3 xi = xi−µB√σ2B+ε
4 yi = γxi + β ≡ BNγ,β(xi )2 yi is fed to the next layer and the procedure iterates.3 Backpropagation is performed on the network parameters including (γ(k), β(k)). This
returns a network.4 For inference, compute the average over many training batches B:
EB[x ] = EB[µB] and VB[x ] = mm − 1EB[σ2
B].
5 For inference, replace every function x 7→ NBγ,β(x) in the network by
x 7→ γ√VB[x ] + ε
x +(β − γEB[x ]√
VB[x ] + ε
).
E. Scornet Deep Learning January 2020 100 / 118
![Page 105: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/105.jpg)
Outline
1 Introduction
2 History of Neural NetworksPerceptronMultilayer Perceptron - Backpropagation algorithm
3 HyperparametersActivation functionsOutput unitsLoss functionsWeight initialization
4 RegularizationPenalizationDropoutBatch normalizationEarly stopping
5 All in all
E. Scornet Deep Learning January 2020 101 / 118
![Page 106: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/106.jpg)
Early stopping
Idea:Store the parameter values that lead to the lowest error on the validation setReturn these values rather than the latest ones.
E. Scornet Deep Learning January 2020 102 / 118
![Page 107: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/107.jpg)
Early stopping algorithm
Parameters:patience p of the algorithm: number of times to observe worsening validation seterror before giving up;the number of steps n between evaluations.
1 Start with initial random values θ0.
2 Let θ? = θ0, Err? =∞, j = 0, i = 0.
3 While j < p1 Update θ by running the training algorithm for n steps2 i = i + n3 Compute the error Err(θ) on the validation set4 If Err(θ) < Err?
F θ? = θF Err? = Err(θ)F j = 0
else j = j + 1.
4 Return θ? and the overall number of steps i? = i − np.
E. Scornet Deep Learning January 2020 103 / 118
![Page 108: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/108.jpg)
How to leverage on early stopping?
First idea: use early stopping to determine the best number of iterations i? and train onthe whole data set for i? iterations.
Let X (train), y (train) be the training set.
Split X (train), y (train) into X (subtrain), y (subtrain) and X (valid), y (valid).
Run early stopping algorithm starting from random θ using X (subtrain), y (subtrain) fortraining data and X (valid), y (valid) for validation data. This returns i? the optimalnumber of steps.
Set θ to random values again.
Train on X (train), y (train) for i? steps.
E. Scornet Deep Learning January 2020 104 / 118
![Page 109: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/109.jpg)
How to leverage on early stopping?
Second idea: use early stopping to determine the best parameters and the training errorat the best number of iterations. Starting from θ?, train on the whole data set until theerror matches the previous early stopping error.
Let X (train), y (train) be the training set.
Split X (train), y (train) into X (subtrain), y (subtrain) and X (valid), y (valid).
Run early stopping algorithm starting from random θ using X (subtrain), y (subtrain) fortraining data and X (valid), y (valid) for validation data. This returns the optimalparameters θ?.
Set ε = L(θ?,X (subtrain), y (subtrain)).
While L(θ?,X (valid), y (valid)) > ε, train on X (train), y (train) for n steps.
E. Scornet Deep Learning January 2020 105 / 118
![Page 110: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/110.jpg)
To go further
Early stopping is a very old ideaI [“Three topics in ill-posed problems”, Wahba 1987]
I [“A formal comparison of methods proposed for the numerical solution of first kind integral equations”, Anderssenand Prenter 1981]
I [“Overfitting in neural nets: Backpropagation, conjugate gradient, and early stopping”, Caruana et al. 2001]
But also an active area of researchI [“Adaboost is consistent”, Bartlett and Traskin 2007]
I [“Boosting algorithms as gradient descent”, Mason et al. 2000]
I [“On early stopping in gradient descent learning”, Yao et al. 2007]
I [“Boosting with early stopping: Convergence and consistency”, Zhang, Yu, et al. 2005]
I [“Early stopping for kernel boosting algorithms: A general analysis with localized complexities”, Wei et al. 2017]
E. Scornet Deep Learning January 2020 106 / 118
![Page 111: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/111.jpg)
More on reducing overfitting averaging
Soft-weight sharing:[“Simplifying neural networks by soft weight-sharing”, Nowlan and Hinton 1992]
Model averaging:Average over: random initialization, random selection of minibatches,hyperparameters, or outcomes of nondeterministic neural networks.
Boosting neural networks by incrementally adding neural networks to the ensemble[“Training methods for adaptive boosting of neural networks”, Schwenk and Bengio 1998]
Boosting has also been applied interpreting an individual neural network as anensemble, incrementally adding hidden units to the networks[“Convex neural networks”, Bengio et al. 2006]
E. Scornet Deep Learning January 2020 107 / 118
![Page 112: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/112.jpg)
Outline
1 Introduction
2 History of Neural NetworksPerceptronMultilayer Perceptron - Backpropagation algorithm
3 HyperparametersActivation functionsOutput unitsLoss functionsWeight initialization
4 RegularizationPenalizationDropoutBatch normalizationEarly stopping
5 All in all
E. Scornet Deep Learning January 2020 108 / 118
![Page 113: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/113.jpg)
Pipeline for neural networks
Step 1: Preprocessing the data (substract mean, divide by standard deviation).More complex if data are images.
Step 2: Choose the architecture (number of layers, number of nodes per layer)
Step 3:1 First, run the network and see if the loss is reasonable (compare with dumb classifier:
uniform for classification, mean for regression)2 Add some regularization and check that the error on the training set increases.3 On a small portion of data, make sure you can overfit when turning down the
regularization.4 Find the best learning rate
1 The error does not change too much → learning rate too small2 The error explodes, NaN → learning rate too high3 Find a rough range [10−5, 10−3].
Playing with neural network:
http://playground.tensorflow.org/
E. Scornet Deep Learning January 2020 109 / 118
![Page 114: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/114.jpg)
To go further
The kernel perceptron algorithm was alreadyintroduced by[“Theoretical foundations of the potential function method in pat-
tern recognition learning”, Aizerman 1964].
General idea (work for all methods using onlydot product): replace the dot product by amore complex kernel function.
Linear separation in the original space be-comes a linear separation in a more complexspace, i.e., a non linear separation in the orig-inal space.
Margin bounds for the Perceptron algorithm in the general non-separable case wereproven by[“Large margin classification using the perceptron algorithm”, Freund and Schapire 1999]
and then by[“Perceptron mistake bounds”, Mohri and Rostamizadeh 2013]
who extended existing results and gave new L1 bounds.
E. Scornet Deep Learning January 2020 110 / 118
![Page 115: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/115.jpg)
Mark A Aizerman. “Theoretical foundations of the potential functionmethod in pattern recognition learning”. In: Automation and remotecontrol 25 (1964), pp. 821–837.
RS Anderssen and PM Prenter. “A formal comparison of methods proposedfor the numerical solution of first kind integral equations”. In: TheANZIAM Journal 22.4 (1981), pp. 488–500.
Yoshua Bengio et al. “Convex neural networks”. In: Advances in neuralinformation processing systems. 2006, pp. 123–130.
Hans-Dieter Block. “The perceptron: A model for brain functioning. i”. In:Reviews of Modern Physics 34.1 (1962), p. 123.
John S Bridle. “Probabilistic interpretation of feedforward classificationnetwork outputs, with relationships to statistical pattern recognition”. In:Neurocomputing. Springer, 1990, pp. 227–236.
Widrow Bernard and D Stearns Samuel. “Adaptive signal processing”. In:Englewood Cliffs, NJ, Prentice-Hall, Inc 1 (1985), p. 491.
Peter L Bartlett and Mikhail Traskin. “Adaboost is consistent”. In: Journalof Machine Learning Research 8.Oct (2007), pp. 2347–2368.
E. Scornet Deep Learning January 2020 111 / 118
![Page 116: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/116.jpg)
Rich Caruana, Steve Lawrence, and C Lee Giles. “Overfitting in neural nets:Backpropagation, conjugate gradient, and early stopping”. In: Advances inneural information processing systems. 2001, pp. 402–408.
Meng Cai, Yongzhe Shi, and Jia Liu. “Deep maxout neural networks forspeech recognition”. In: Automatic Speech Recognition and Understanding(ASRU), 2013 IEEE Workshop on. IEEE. 2013, pp. 291–296.
Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. “Fast andaccurate deep network learning by exponential linear units (elus)”. In: arXivpreprint arXiv:1511.07289 (2015).
Yoav Freund and Robert E Schapire. “Large margin classification using theperceptron algorithm”. In: Machine learning 37.3 (1999), pp. 277–296.
Xavier Glorot and Yoshua Bengio. “Understanding the difficulty of trainingdeep feedforward neural networks”. In: Proceedings of the thirteenthinternational conference on artificial intelligence and statistics. 2010,pp. 249–256.
Xavier Glorot, Antoine Bordes, and Yoshua Bengio. “Deep sparse rectifierneural networks”. In: Proceedings of the Fourteenth InternationalConference on Artificial Intelligence and Statistics. 2011, pp. 315–323.
E. Scornet Deep Learning January 2020 112 / 118
![Page 117: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/117.jpg)
Ian J Goodfellow, Mehdi Mirza, et al. “An empirical investigation ofcatastrophic forgetting in gradient-based neural networks”. In: arXivpreprint arXiv:1312.6211 (2013).
Ian J Goodfellow, David Warde-Farley, et al. “Maxout networks”. In: arXivpreprint arXiv:1302.4389 (2013).
Alex Graves. “Generating sequences with recurrent neural networks”. In:arXiv preprint arXiv:1308.0850 (2013).
Kaiming He et al. “Delving deep into rectifiers: Surpassing human-levelperformance on imagenet classification”. In: Proceedings of the IEEEinternational conference on computer vision. 2015, pp. 1026–1034.
DO Hebb. The organization of behavior: a neuropsychological theory.Wiley, 1949.
Geoffrey E Hinton et al. “Improving neural networks by preventingco-adaptation of feature detectors”. In: arXiv preprint arXiv:1207.0580(2012).
Arthur E Hoerl and Robert W Kennard. “Ridge regression: Biasedestimation for nonorthogonal problems”. In: Technometrics 12.1 (1970),pp. 55–67.
E. Scornet Deep Learning January 2020 113 / 118
![Page 118: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/118.jpg)
Michael Jen-Chao Hu. “Application of the adaline system to weatherforecasting”. PhD thesis. Department of Electrical Engineering, StanfordUniversity, 1964.
Sergey Ioffe and Christian Szegedy. “Batch normalization: Accelerating deepnetwork training by reducing internal covariate shift”. In: arXiv preprintarXiv:1502.03167 (2015).
Kevin Jarrett, Koray Kavukcuoglu, Yann LeCun, et al. “What is the bestmulti-stage architecture for object recognition?” In: Computer Vision, 2009IEEE 12th International Conference on. IEEE. 2009, pp. 2146–2153.
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. “Imagenetclassification with deep convolutional neural networks”. In: Advances inneural information processing systems. 2012, pp. 1097–1105.
Yann LeCun et al. “Efficient backprop”. In: Neural networks: Tricks of thetrade. Springer, 1998, pp. 9–50.
Llew Mason et al. “Boosting algorithms as gradient descent”. In: Advancesin neural information processing systems. 2000, pp. 512–518.
Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. “Rectifiernonlinearities improve neural network acoustic models”. In: Proc. icml.Vol. 30. 1. 2013, p. 3.
E. Scornet Deep Learning January 2020 114 / 118
![Page 119: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/119.jpg)
Dmytro Mishkin and Jiri Matas. “All you need is a good init”. In: arXivpreprint arXiv:1511.06422 (2015).
Warren S McCulloch and Walter Pitts. “A logical calculus of the ideasimmanent in nervous activity”. In: The bulletin of mathematical biophysics5.4 (1943), pp. 115–133.
Marvin Minsky and Seymour Papert. “Perceptrons.” In: (1969).
Mehryar Mohri and Afshin Rostamizadeh. “Perceptron mistake bounds”.In: arXiv preprint arXiv:1305.0208 (2013).
Steven J Nowlan and Geoffrey E Hinton. “Simplifying neural networks bysoft weight-sharing”. In: Neural computation 4.4 (1992), pp. 473–493.
Albert B Novikoff. On convergence proofs for perceptrons. Tech. rep.STANFORD RESEARCH INST MENLO PARK CA, 1963.
Mikel Olazaran. “A sociological study of the official history of theperceptrons controversy”. In: Social Studies of Science 26.3 (1996),pp. 611–659.
Frank Rosenblatt. “Perceptron simulation experiments”. In: Proceedings ofthe IRE 48.3 (1960), pp. 301–309.
E. Scornet Deep Learning January 2020 115 / 118
![Page 120: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/120.jpg)
Frank Rosenblatt. Principles of neurodynamics. perceptrons and the theoryof brain mechanisms. Tech. rep. CORNELL AERONAUTICAL LAB INCBUFFALO NY, 1961.
Holger Schwenk and Yoshua Bengio. “Training methods for adaptiveboosting of neural networks”. In: Advances in neural information processingsystems. 1998, pp. 647–653.
Michael Schuster. “On supervised learning from sequential data withapplications for speech recognition”. In: Daktaro disertacija, Nara Instituteof Science and Technology 45 (1999).
Nitish Srivastava et al. “Dropout: A simple way to prevent neural networksfrom overfitting”. In: The Journal of Machine Learning Research 15.1(2014), pp. 1929–1958.
LR Talbert, GF Groner, and JS Koford. “Real-Time AdaptiveSpeech-Recognition System”. In: The Journal of the Acoustical Society ofAmerica 35.5 (1963), pp. 807–807.
Robert Tibshirani. “Regression shrinkage and selection via the lasso”. In:Journal of the Royal Statistical Society. Series B (Methodological) (1996),pp. 267–288.
E. Scornet Deep Learning January 2020 116 / 118
![Page 121: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/121.jpg)
Grace Wahba. “Three topics in ill-posed problems”. In: Inverse andill-posed problems. Elsevier, 1987, pp. 37–51.
Bernard Widrow and Marcian E Hoff. Adaptive switching circuits.Tech. rep. Stanford Univ Ca Stanford Electronics Labs, 1960.
Wessel N van Wieringen. “Lecture notes on ridge regression”. In: arXivpreprint arXiv:1509.09169 (2015).
Sida Wang and Christopher Manning. “Fast dropout training”. In:international conference on machine learning. 2013, pp. 118–126.
Capt Rodney Winter and B Widrow. “Madaline Rule II: a training algorithmfor neural networks”. In: Second Annual International Conference on NeuralNetworks. 1988, pp. 1–401.
Yuting Wei, Fanny Yang, and Martin J Wainwright. “Early stopping forkernel boosting algorithms: A general analysis with localized complexities”.In: Advances in Neural Information Processing Systems. 2017,pp. 6067–6077.
Bing Xu et al. “Empirical evaluation of rectified activations in convolutionalnetwork”. In: arXiv preprint arXiv:1505.00853 (2015).
E. Scornet Deep Learning January 2020 117 / 118
![Page 122: Introduction to Neural Networks - GitHub Pages · 2020-03-01 · Perceptron-1958 Inthelate50s,FrankRosenblatt,apsychologistatCornell,workedonondecisionsystems presentintheeyeofafly,whichdetermineitsfleeresponse.](https://reader034.fdocuments.us/reader034/viewer/2022042102/5e7e688fd55e0140214ba53e/html5/thumbnails/122.jpg)
Yuan Yao, Lorenzo Rosasco, and Andrea Caponnetto. “On early stopping ingradient descent learning”. In: Constructive Approximation 26.2 (2007),pp. 289–315.
Tong Zhang, Bin Yu, et al. “Boosting with early stopping: Convergence andconsistency”. In: The Annals of Statistics 33.4 (2005), pp. 1538–1579.
Hui Zou and Trevor Hastie. “Regularization and variable selection via theelastic net”. In: Journal of the Royal Statistical Society: Series B(Statistical Methodology) 67.2 (2005), pp. 301–320.
E. Scornet Deep Learning January 2020 118 / 118