MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf ·...
Transcript of MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf ·...
![Page 1: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/1.jpg)
MIT 9.520/6.860, Fall 2018
Class 11: Neural networks – tips, tricks & software
Andrzej Banburski
![Page 2: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/2.jpg)
Last time - Convolutional neural networks
source: github.com/vdumoulin/conv arithmetic
More Data and GPUs
AlexNet outmatches the ILSVRC 2012
Large-scale Datasets General Purpose GPUs
AlexNet Krizhevsky et al (2012)
A. Banburski
![Page 3: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/3.jpg)
Overview
Initialization & hyper-parameter tuning
Optimization algorithms
Batchnorm & Dropout
Finite dataset woes
Software
A. Banburski
![Page 4: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/4.jpg)
Initialization & hyper-parameter tuning
Consider the problem of training a neural network fθ(x) by minimizing aloss
L(θ, x) =
N∑i=1
li(yi, fθ(xi)) + λ|θ|2
with SGD and mini-batch size b:
θt+1 = θt − η1
b
∑i∈B∇θL(θt, xi) (1)
I How should we choose the initial set of parameters θ?
I How about the hyper-parameters η, λ and b?
A. Banburski
![Page 5: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/5.jpg)
Initialization & hyper-parameter tuning
Consider the problem of training a neural network fθ(x) by minimizing aloss
L(θ, x) =
N∑i=1
li(yi, fθ(xi)) + λ|θ|2
with SGD and mini-batch size b:
θt+1 = θt − η1
b
∑i∈B∇θL(θt, xi) (1)
I How should we choose the initial set of parameters θ?
I How about the hyper-parameters η, λ and b?
A. Banburski
![Page 6: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/6.jpg)
Initialization & hyper-parameter tuning
Consider the problem of training a neural network fθ(x) by minimizing aloss
L(θ, x) =
N∑i=1
li(yi, fθ(xi)) + λ|θ|2
with SGD and mini-batch size b:
θt+1 = θt − η1
b
∑i∈B∇θL(θt, xi) (1)
I How should we choose the initial set of parameters θ?
I How about the hyper-parameters η, λ and b?
A. Banburski
![Page 7: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/7.jpg)
Weight Initialization
I First obvious observation: starting with 0 will make every weightupdate in the same way. Similarly, too big and we can run into NaN.
I What about θ0 = ε×N (0, 1), with ε ≈ 10−2?
I For a few layers this would seem to work nicely.
I If we go deeper however...
I Super slow update of earlier layers 10−2L for sigmoid or tanhactivations – vanishing gradients. ReLU activations do not suffer somuch from this.
A. Banburski
![Page 8: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/8.jpg)
Weight Initialization
I First obvious observation: starting with 0 will make every weightupdate in the same way. Similarly, too big and we can run into NaN.
I What about θ0 = ε×N (0, 1), with ε ≈ 10−2?
I For a few layers this would seem to work nicely.
I If we go deeper however...
I Super slow update of earlier layers 10−2L for sigmoid or tanhactivations – vanishing gradients. ReLU activations do not suffer somuch from this.
A. Banburski
![Page 9: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/9.jpg)
Weight Initialization
I First obvious observation: starting with 0 will make every weightupdate in the same way. Similarly, too big and we can run into NaN.
I What about θ0 = ε×N (0, 1), with ε ≈ 10−2?
I For a few layers this would seem to work nicely.
I If we go deeper however...
I Super slow update of earlier layers 10−2L for sigmoid or tanhactivations – vanishing gradients. ReLU activations do not suffer somuch from this.
A. Banburski
![Page 10: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/10.jpg)
Weight Initialization
I First obvious observation: starting with 0 will make every weightupdate in the same way. Similarly, too big and we can run into NaN.
I What about θ0 = ε×N (0, 1), with ε ≈ 10−2?
I For a few layers this would seem to work nicely.
I If we go deeper however...
I Super slow update of earlier layers 10−2L for sigmoid or tanhactivations – vanishing gradients. ReLU activations do not suffer somuch from this.
A. Banburski
![Page 11: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/11.jpg)
Weight Initialization
I First obvious observation: starting with 0 will make every weightupdate in the same way. Similarly, too big and we can run into NaN.
I What about θ0 = ε×N (0, 1), with ε ≈ 10−2?
I For a few layers this would seem to work nicely.
I If we go deeper however...
I Super slow update of earlier layers 10−2L for sigmoid or tanhactivations – vanishing gradients. ReLU activations do not suffer somuch from this.
A. Banburski
![Page 12: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/12.jpg)
Xavier & He initializations
I For tanh and sigmoid activations, near origin we deal with a nearlylinear function y =Wx, with x = (x1, . . . , xnin). To stop vanishingand exploding gradients we need
Var(y) = Var(Wx) = Var(w1x1) + · · ·+ Var(wninxnin
)
I If we assume that W and x are i.i.d. and have zero mean, thenVar(y) = nVar(wi)Var(xi)
I If we want the inputs and outputs to have same variance, this givesus Var(wi) =
1nin
.
I Similar analysis for backward pass gives Var(wi) =1
nout.
I The compromise is the Xavier initialization [Glorot et al., 2010]:
Var(wi) =2
nin + nout(2)
I Heuristically, ReLU is half of the linear function, so we can take
Var(wi) =4
nin + nout(3)
An analysis in [He et al., 2015] confirms this.
A. Banburski
![Page 13: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/13.jpg)
Xavier & He initializations
I For tanh and sigmoid activations, near origin we deal with a nearlylinear function y =Wx, with x = (x1, . . . , xnin). To stop vanishingand exploding gradients we need
Var(y) = Var(Wx) = Var(w1x1) + · · ·+ Var(wninxnin
)
I If we assume that W and x are i.i.d. and have zero mean, thenVar(y) = nVar(wi)Var(xi)
I If we want the inputs and outputs to have same variance, this givesus Var(wi) =
1nin
.
I Similar analysis for backward pass gives Var(wi) =1
nout.
I The compromise is the Xavier initialization [Glorot et al., 2010]:
Var(wi) =2
nin + nout(2)
I Heuristically, ReLU is half of the linear function, so we can take
Var(wi) =4
nin + nout(3)
An analysis in [He et al., 2015] confirms this.
A. Banburski
![Page 14: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/14.jpg)
Xavier & He initializations
I For tanh and sigmoid activations, near origin we deal with a nearlylinear function y =Wx, with x = (x1, . . . , xnin). To stop vanishingand exploding gradients we need
Var(y) = Var(Wx) = Var(w1x1) + · · ·+ Var(wninxnin
)
I If we assume that W and x are i.i.d. and have zero mean, thenVar(y) = nVar(wi)Var(xi)
I If we want the inputs and outputs to have same variance, this givesus Var(wi) =
1nin
.
I Similar analysis for backward pass gives Var(wi) =1
nout.
I The compromise is the Xavier initialization [Glorot et al., 2010]:
Var(wi) =2
nin + nout(2)
I Heuristically, ReLU is half of the linear function, so we can take
Var(wi) =4
nin + nout(3)
An analysis in [He et al., 2015] confirms this.
A. Banburski
![Page 15: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/15.jpg)
Xavier & He initializations
I For tanh and sigmoid activations, near origin we deal with a nearlylinear function y =Wx, with x = (x1, . . . , xnin). To stop vanishingand exploding gradients we need
Var(y) = Var(Wx) = Var(w1x1) + · · ·+ Var(wninxnin
)
I If we assume that W and x are i.i.d. and have zero mean, thenVar(y) = nVar(wi)Var(xi)
I If we want the inputs and outputs to have same variance, this givesus Var(wi) =
1nin
.
I Similar analysis for backward pass gives Var(wi) =1
nout.
I The compromise is the Xavier initialization [Glorot et al., 2010]:
Var(wi) =2
nin + nout(2)
I Heuristically, ReLU is half of the linear function, so we can take
Var(wi) =4
nin + nout(3)
An analysis in [He et al., 2015] confirms this.
A. Banburski
![Page 16: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/16.jpg)
Xavier & He initializations
I For tanh and sigmoid activations, near origin we deal with a nearlylinear function y =Wx, with x = (x1, . . . , xnin). To stop vanishingand exploding gradients we need
Var(y) = Var(Wx) = Var(w1x1) + · · ·+ Var(wninxnin
)
I If we assume that W and x are i.i.d. and have zero mean, thenVar(y) = nVar(wi)Var(xi)
I If we want the inputs and outputs to have same variance, this givesus Var(wi) =
1nin
.
I Similar analysis for backward pass gives Var(wi) =1
nout.
I The compromise is the Xavier initialization [Glorot et al., 2010]:
Var(wi) =2
nin + nout(2)
I Heuristically, ReLU is half of the linear function, so we can take
Var(wi) =4
nin + nout(3)
An analysis in [He et al., 2015] confirms this.
A. Banburski
![Page 17: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/17.jpg)
Xavier & He initializations
I For tanh and sigmoid activations, near origin we deal with a nearlylinear function y =Wx, with x = (x1, . . . , xnin). To stop vanishingand exploding gradients we need
Var(y) = Var(Wx) = Var(w1x1) + · · ·+ Var(wninxnin
)
I If we assume that W and x are i.i.d. and have zero mean, thenVar(y) = nVar(wi)Var(xi)
I If we want the inputs and outputs to have same variance, this givesus Var(wi) =
1nin
.
I Similar analysis for backward pass gives Var(wi) =1
nout.
I The compromise is the Xavier initialization [Glorot et al., 2010]:
Var(wi) =2
nin + nout(2)
I Heuristically, ReLU is half of the linear function, so we can take
Var(wi) =4
nin + nout(3)
An analysis in [He et al., 2015] confirms this.A. Banburski
![Page 18: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/18.jpg)
Hyper-parameter tuning
How about the hyper-parameters η, λ and b
I How do we choose optimal η, λ and b?
I Basic idea: split your training dataset into a smaller training set anda cross-validation set.
– Run a coarse search (on a logarithmic scale) over the parameters forjust a few epochs of SGD and evaluate on the cross-validation set.
– Perform a finer search.
I Interestingly, [Bergstra and Bengio, 2012] shows that it is better torun the search randomly than on a grid.
source: [Bergstra and Bengio, 2012]
A. Banburski
![Page 19: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/19.jpg)
Hyper-parameter tuning
How about the hyper-parameters η, λ and b
I How do we choose optimal η, λ and b?
I Basic idea: split your training dataset into a smaller training set anda cross-validation set.
– Run a coarse search (on a logarithmic scale) over the parameters forjust a few epochs of SGD and evaluate on the cross-validation set.
– Perform a finer search.
I Interestingly, [Bergstra and Bengio, 2012] shows that it is better torun the search randomly than on a grid.
source: [Bergstra and Bengio, 2012]
A. Banburski
![Page 20: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/20.jpg)
Hyper-parameter tuning
How about the hyper-parameters η, λ and b
I How do we choose optimal η, λ and b?
I Basic idea: split your training dataset into a smaller training set anda cross-validation set.
– Run a coarse search (on a logarithmic scale) over the parameters forjust a few epochs of SGD and evaluate on the cross-validation set.
– Perform a finer search.
I Interestingly, [Bergstra and Bengio, 2012] shows that it is better torun the search randomly than on a grid.
source: [Bergstra and Bengio, 2012]
A. Banburski
![Page 21: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/21.jpg)
Hyper-parameter tuning
How about the hyper-parameters η, λ and b
I How do we choose optimal η, λ and b?
I Basic idea: split your training dataset into a smaller training set anda cross-validation set.
– Run a coarse search (on a logarithmic scale) over the parameters forjust a few epochs of SGD and evaluate on the cross-validation set.
– Perform a finer search.
I Interestingly, [Bergstra and Bengio, 2012] shows that it is better torun the search randomly than on a grid.
source: [Bergstra and Bengio, 2012]
A. Banburski
![Page 22: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/22.jpg)
Hyper-parameter tuning
How about the hyper-parameters η, λ and b
I How do we choose optimal η, λ and b?
I Basic idea: split your training dataset into a smaller training set anda cross-validation set.
– Run a coarse search (on a logarithmic scale) over the parameters forjust a few epochs of SGD and evaluate on the cross-validation set.
– Perform a finer search.
I Interestingly, [Bergstra and Bengio, 2012] shows that it is better torun the search randomly than on a grid.
source: [Bergstra and Bengio, 2012]
A. Banburski
![Page 23: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/23.jpg)
Hyper-parameter tuning
How about the hyper-parameters η, λ and b
I How do we choose optimal η, λ and b?
I Basic idea: split your training dataset into a smaller training set anda cross-validation set.
– Run a coarse search (on a logarithmic scale) over the parameters forjust a few epochs of SGD and evaluate on the cross-validation set.
– Perform a finer search.
I Interestingly, [Bergstra and Bengio, 2012] shows that it is better torun the search randomly than on a grid.
source: [Bergstra and Bengio, 2012]
A. Banburski
![Page 24: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/24.jpg)
Decaying learning rate
I To improve convergence of SGD, we have to use a decaying learningrate.
I Typically we use a scheduler – decrease η after some fixed numberof epochs.
I This allows the training loss to keep improving after it has plateaued
A. Banburski
![Page 25: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/25.jpg)
Decaying learning rate
I To improve convergence of SGD, we have to use a decaying learningrate.
I Typically we use a scheduler – decrease η after some fixed numberof epochs.
I This allows the training loss to keep improving after it has plateaued
A. Banburski
![Page 26: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/26.jpg)
Decaying learning rate
I To improve convergence of SGD, we have to use a decaying learningrate.
I Typically we use a scheduler – decrease η after some fixed numberof epochs.
I This allows the training loss to keep improving after it has plateaued
A. Banburski
![Page 27: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/27.jpg)
Batch-size & learning rate
An interesting linear scaling relationship seems to exist between thelearning rate η and mini-batch size b:
I In the SGD update, they appear as a ratio ηb , with an additional
implicit dependence of the sum of gradients on b.
I If b� N , we can approximate SGD by a stochastic differentialequation with a noise scale g ≈ ηNb [Smit & Le, 2017].
I This means that instead of decaying η, we can increase batch sizedynamically.
source: [Smith et al., 2018]
I As b approaches N the dynamics become more and moredeterministic and we would expect this relationship to vanish.
A. Banburski
![Page 28: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/28.jpg)
Batch-size & learning rate
An interesting linear scaling relationship seems to exist between thelearning rate η and mini-batch size b:
I In the SGD update, they appear as a ratio ηb , with an additional
implicit dependence of the sum of gradients on b.I If b� N , we can approximate SGD by a stochastic differential
equation with a noise scale g ≈ ηNb [Smit & Le, 2017].
I This means that instead of decaying η, we can increase batch sizedynamically.
source: [Smith et al., 2018]
I As b approaches N the dynamics become more and moredeterministic and we would expect this relationship to vanish.
A. Banburski
![Page 29: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/29.jpg)
Batch-size & learning rate
An interesting linear scaling relationship seems to exist between thelearning rate η and mini-batch size b:
I In the SGD update, they appear as a ratio ηb , with an additional
implicit dependence of the sum of gradients on b.I If b� N , we can approximate SGD by a stochastic differential
equation with a noise scale g ≈ ηNb [Smit & Le, 2017].I This means that instead of decaying η, we can increase batch size
dynamically.
source: [Smith et al., 2018]
I As b approaches N the dynamics become more and moredeterministic and we would expect this relationship to vanish.
A. Banburski
![Page 30: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/30.jpg)
Batch-size & learning rate
An interesting linear scaling relationship seems to exist between thelearning rate η and mini-batch size b:
I In the SGD update, they appear as a ratio ηb , with an additional
implicit dependence of the sum of gradients on b.I If b� N , we can approximate SGD by a stochastic differential
equation with a noise scale g ≈ ηNb [Smit & Le, 2017].I This means that instead of decaying η, we can increase batch size
dynamically.
source: [Smith et al., 2018]
I As b approaches N the dynamics become more and moredeterministic and we would expect this relationship to vanish.
A. Banburski
![Page 31: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/31.jpg)
Batch-size & learning rate
An interesting linear scaling relationship seems to exist between thelearning rate η and mini-batch size b:
I In the SGD update, they appear as a ratio ηb , with an additional
implicit dependence of the sum of gradients on b.I If b� N , we can approximate SGD by a stochastic differential
equation with a noise scale g ≈ ηNb [Smit & Le, 2017].I This means that instead of decaying η, we can increase batch size
dynamically.
source: [Smith et al., 2018]
I As b approaches N the dynamics become more and moredeterministic and we would expect this relationship to vanish.
A. Banburski
![Page 32: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/32.jpg)
Batch-size & learning rate
source: [Goyal et al., 2017]
A. Banburski
![Page 33: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/33.jpg)
Overview
Initialization & hyper-parameter tuning
Optimization algorithms
Batchnorm & Dropout
Finite dataset woes
Software
A. Banburski
![Page 34: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/34.jpg)
SGD is kinda slow...
I GD – use all points each iteration to compute gradient
I SGD – use one point each iteration to compute gradient
I Faster: Mini-Batch – use a mini-batch of points each iteration tocompute gradient
A. Banburski
![Page 35: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/35.jpg)
Alternatives to SGD
Are there reasonable alternatives outside of Newton method?
Accelerations
I Momentum
I Nesterov’s method
I Adagrad
I RMSprop
I Adam
I . . .
A. Banburski
![Page 36: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/36.jpg)
SGD with Momentum
We can try accelerating SGD
θt+1 = θt − η∇f(θt)
by adding a momentum/velocity term:
vt+1 = µvt − η∇f(θt)θt+1 = θt + vt+1
(4)
µ is a new ”momentum” hyper-parameter.
source: cs213n.github.io
A. Banburski
![Page 37: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/37.jpg)
SGD with Momentum
We can try accelerating SGD
θt+1 = θt − η∇f(θt)
by adding a momentum/velocity term:
vt+1 = µvt − η∇f(θt)θt+1 = θt + vt+1
(4)
µ is a new ”momentum” hyper-parameter.
source: cs213n.github.io
A. Banburski
![Page 38: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/38.jpg)
SGD with Momentum
We can try accelerating SGD
θt+1 = θt − η∇f(θt)
by adding a momentum/velocity term:
vt+1 = µvt − η∇f(θt)θt+1 = θt + vt+1
(4)
µ is a new ”momentum” hyper-parameter.
source: cs213n.github.io
A. Banburski
![Page 39: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/39.jpg)
Nesterov Momentum
I Sometimes the momentum update can overshoot
I We can instead evaluate the gradient at the point where momentumtakes us:
vt+1 = µvt − η∇f(θt + µvt)
θt+1 = θt + vt+1
(5)
source: Geoff Hinton’s lecture
A. Banburski
![Page 40: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/40.jpg)
Nesterov Momentum
I Sometimes the momentum update can overshoot
I We can instead evaluate the gradient at the point where momentumtakes us:
vt+1 = µvt − η∇f(θt + µvt)
θt+1 = θt + vt+1
(5)
source: Geoff Hinton’s lecture
A. Banburski
![Page 41: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/41.jpg)
Nesterov Momentum
I Sometimes the momentum update can overshoot
I We can instead evaluate the gradient at the point where momentumtakes us:
vt+1 = µvt − η∇f(θt + µvt)
θt+1 = θt + vt+1
(5)
source: Geoff Hinton’s lecture
A. Banburski
![Page 42: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/42.jpg)
Nesterov Momentum
I Sometimes the momentum update can overshoot
I We can instead evaluate the gradient at the point where momentumtakes us:
vt+1 = µvt − η∇f(θt + µvt)
θt+1 = θt + vt+1
(5)
source: Geoff Hinton’s lecture
A. Banburski
![Page 43: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/43.jpg)
AdaGrad
I An alternative way is to automatize the decay of the learning rate.
I The Adaptive Gradient algorithm does this by accumulatingmagnitudes of gradients
I AdaGrad accelerates in flat directions of optimization landscape andslows down in step ones.
A. Banburski
![Page 44: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/44.jpg)
AdaGrad
I An alternative way is to automatize the decay of the learning rate.I The Adaptive Gradient algorithm does this by accumulating
magnitudes of gradients
I AdaGrad accelerates in flat directions of optimization landscape andslows down in step ones.
A. Banburski
![Page 45: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/45.jpg)
AdaGrad
I An alternative way is to automatize the decay of the learning rate.I The Adaptive Gradient algorithm does this by accumulating
magnitudes of gradients
I AdaGrad accelerates in flat directions of optimization landscape andslows down in step ones.
A. Banburski
![Page 46: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/46.jpg)
AdaGrad
I An alternative way is to automatize the decay of the learning rate.I The Adaptive Gradient algorithm does this by accumulating
magnitudes of gradients
I AdaGrad accelerates in flat directions of optimization landscape andslows down in step ones.
A. Banburski
![Page 47: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/47.jpg)
RMSProp
Problem:The updates in AdaGrad always decrease the learning rate, sosome of the parameters can become un-learnable.
I Fix by Hinton: use weighted sum of the square magnitudes instead.I This assigns more weight to recent iterations. Useful if directions of
steeper or shallower descent suddenly change.
A. Banburski
![Page 48: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/48.jpg)
RMSProp
Problem:The updates in AdaGrad always decrease the learning rate, sosome of the parameters can become un-learnable.
I Fix by Hinton: use weighted sum of the square magnitudes instead.
I This assigns more weight to recent iterations. Useful if directions ofsteeper or shallower descent suddenly change.
A. Banburski
![Page 49: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/49.jpg)
RMSProp
Problem:The updates in AdaGrad always decrease the learning rate, sosome of the parameters can become un-learnable.
I Fix by Hinton: use weighted sum of the square magnitudes instead.I This assigns more weight to recent iterations. Useful if directions of
steeper or shallower descent suddenly change.
A. Banburski
![Page 50: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/50.jpg)
RMSProp
Problem:The updates in AdaGrad always decrease the learning rate, sosome of the parameters can become un-learnable.
I Fix by Hinton: use weighted sum of the square magnitudes instead.I This assigns more weight to recent iterations. Useful if directions of
steeper or shallower descent suddenly change.
A. Banburski
![Page 51: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/51.jpg)
Adam
Adaptive Moment – a combination of the previous approaches.
[Kingma and Ba, 2014]
I Ridiculously popular – more than 13K citations!
I Probably because it comes with recommended parameters and camewith a proof of convergence (which was shown to be wrong).
A. Banburski
![Page 52: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/52.jpg)
Adam
Adaptive Moment – a combination of the previous approaches.
[Kingma and Ba, 2014]
I Ridiculously popular – more than 13K citations!
I Probably because it comes with recommended parameters and camewith a proof of convergence (which was shown to be wrong).
A. Banburski
![Page 53: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/53.jpg)
Adam
Adaptive Moment – a combination of the previous approaches.
[Kingma and Ba, 2014]
I Ridiculously popular – more than 13K citations!
I Probably because it comes with recommended parameters and camewith a proof of convergence (which was shown to be wrong).
A. Banburski
![Page 54: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/54.jpg)
So what should I use in practice?
I Adam is a good default in many cases.
I There exist datasets in which Adam and other adaptive methods donot generalize to unseen data at all! [Marginal Value of AdaptiveGradient Methods in Machine Learning]
I SGD with Momentum and a decay rate often outperforms Adam
(but requires tuning).
includegraphicsFigures/comp.png source:
github.com/YingzhenLi
A. Banburski
![Page 55: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/55.jpg)
So what should I use in practice?
I Adam is a good default in many cases.
I There exist datasets in which Adam and other adaptive methods donot generalize to unseen data at all! [Marginal Value of AdaptiveGradient Methods in Machine Learning]
I SGD with Momentum and a decay rate often outperforms Adam
(but requires tuning).
includegraphicsFigures/comp.png source:
github.com/YingzhenLi
A. Banburski
![Page 56: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/56.jpg)
So what should I use in practice?
I Adam is a good default in many cases.
I There exist datasets in which Adam and other adaptive methods donot generalize to unseen data at all! [Marginal Value of AdaptiveGradient Methods in Machine Learning]
I SGD with Momentum and a decay rate often outperforms Adam
(but requires tuning).
includegraphicsFigures/comp.png source:
github.com/YingzhenLi
A. Banburski
![Page 57: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/57.jpg)
So what should I use in practice?
I Adam is a good default in many cases.
I There exist datasets in which Adam and other adaptive methods donot generalize to unseen data at all! [Marginal Value of AdaptiveGradient Methods in Machine Learning]
I SGD with Momentum and a decay rate often outperforms Adam
(but requires tuning). includegraphicsFigures/comp.png source:
github.com/YingzhenLi
A. Banburski
![Page 58: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/58.jpg)
Overview
Initialization & hyper-parameter tuning
Optimization algorithms
Batchnorm & Dropout
Finite dataset woes
Software
A. Banburski
![Page 59: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/59.jpg)
Data pre-processing
Since our non-linearities change their behavior around the origin, it makessense to pre-process to zero-mean and unit variance.
x̂i =xi − E[xi]√
Var[xi](6)
source: cs213n.github.io
A. Banburski
![Page 60: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/60.jpg)
Data pre-processing
Since our non-linearities change their behavior around the origin, it makessense to pre-process to zero-mean and unit variance.
x̂i =xi − E[xi]√
Var[xi](6)
source: cs213n.github.io
A. Banburski
![Page 61: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/61.jpg)
Batch Normalization
A common technique is to repeat this throughout the deep network in adifferentiable way:
[Ioffe and Szegedy, 2015]
A. Banburski
![Page 62: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/62.jpg)
Batch Normalization
A common technique is to repeat this throughout the deep network in adifferentiable way:
[Ioffe and Szegedy, 2015]
A. Banburski
![Page 63: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/63.jpg)
Batch Normalization
In practice, a batchnorm layer is added after a conv or fully-connectedlayer, but before activations.
I In the original paper the authors claimed that this is meant toreduce covariate shift.
I More obviously, this reduces 2nd-order correlations between layers.Recently shown that it actually doesn’t change covariate shift!Instead it smooths out the landscape.
I In practice this reduces dependence on initialization and seems tostabilize the flow of gradient descent.
I Using BN usually nets you a gain of few % increase in test accuracy.
A. Banburski
![Page 64: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/64.jpg)
Batch Normalization
In practice, a batchnorm layer is added after a conv or fully-connectedlayer, but before activations.
I In the original paper the authors claimed that this is meant toreduce covariate shift.
I More obviously, this reduces 2nd-order correlations between layers.Recently shown that it actually doesn’t change covariate shift!Instead it smooths out the landscape.
I In practice this reduces dependence on initialization and seems tostabilize the flow of gradient descent.
I Using BN usually nets you a gain of few % increase in test accuracy.
A. Banburski
![Page 65: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/65.jpg)
Batch Normalization
In practice, a batchnorm layer is added after a conv or fully-connectedlayer, but before activations.
I In the original paper the authors claimed that this is meant toreduce covariate shift.
I More obviously, this reduces 2nd-order correlations between layers.Recently shown that it actually doesn’t change covariate shift!Instead it smooths out the landscape.
I In practice this reduces dependence on initialization and seems tostabilize the flow of gradient descent.
I Using BN usually nets you a gain of few % increase in test accuracy.
A. Banburski
![Page 66: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/66.jpg)
Batch Normalization
In practice, a batchnorm layer is added after a conv or fully-connectedlayer, but before activations.
I In the original paper the authors claimed that this is meant toreduce covariate shift.
I More obviously, this reduces 2nd-order correlations between layers.Recently shown that it actually doesn’t change covariate shift!Instead it smooths out the landscape.
I In practice this reduces dependence on initialization and seems tostabilize the flow of gradient descent.
I Using BN usually nets you a gain of few % increase in test accuracy.
A. Banburski
[Santurkar, Tsipras, Ilyas, Madry, 2018]
![Page 67: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/67.jpg)
Batch Normalization
In practice, a batchnorm layer is added after a conv or fully-connectedlayer, but before activations.
I In the original paper the authors claimed that this is meant toreduce covariate shift.
I More obviously, this reduces 2nd-order correlations between layers.Recently shown that it actually doesn’t change covariate shift!Instead it smooths out the landscape.
I In practice this reduces dependence on initialization and seems tostabilize the flow of gradient descent.
I Using BN usually nets you a gain of few % increase in test accuracy.
A. Banburski
![Page 68: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/68.jpg)
Batch Normalization
In practice, a batchnorm layer is added after a conv or fully-connectedlayer, but before activations.
I In the original paper the authors claimed that this is meant toreduce covariate shift.
I More obviously, this reduces 2nd-order correlations between layers.Recently shown that it actually doesn’t change covariate shift!Instead it smooths out the landscape.
I In practice this reduces dependence on initialization and seems tostabilize the flow of gradient descent.
I Using BN usually nets you a gain of few % increase in test accuracy.
A. Banburski
![Page 69: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/69.jpg)
Dropout
Another common technique: during forward pass, set some of theweights to 0 randomly with probability p. Typical choice is p = 50%.
I The idea is to prevent co-adaptation of neurons.
I At test want to remove the randomness. A good approximation is tomultiply the neural network by p.
I Dropout is more commonly applied for fully-connected layers,though its use is waning.
A. Banburski
![Page 70: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/70.jpg)
Dropout
Another common technique: during forward pass, set some of theweights to 0 randomly with probability p. Typical choice is p = 50%.
I The idea is to prevent co-adaptation of neurons.
I At test want to remove the randomness. A good approximation is tomultiply the neural network by p.
I Dropout is more commonly applied for fully-connected layers,though its use is waning.
A. Banburski
![Page 71: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/71.jpg)
Dropout
Another common technique: during forward pass, set some of theweights to 0 randomly with probability p. Typical choice is p = 50%.
I The idea is to prevent co-adaptation of neurons.
I At test want to remove the randomness. A good approximation is tomultiply the neural network by p.
I Dropout is more commonly applied for fully-connected layers,though its use is waning.
A. Banburski
![Page 72: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/72.jpg)
Dropout
Another common technique: during forward pass, set some of theweights to 0 randomly with probability p. Typical choice is p = 50%.
I The idea is to prevent co-adaptation of neurons.
I At test want to remove the randomness. A good approximation is tomultiply the neural network by p.
I Dropout is more commonly applied for fully-connected layers,though its use is waning.
A. Banburski
![Page 73: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/73.jpg)
Dropout
Another common technique: during forward pass, set some of theweights to 0 randomly with probability p. Typical choice is p = 50%.
I The idea is to prevent co-adaptation of neurons.
I At test want to remove the randomness. A good approximation is tomultiply the neural network by p.
I Dropout is more commonly applied for fully-connected layers,though its use is waning.
A. Banburski
![Page 74: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/74.jpg)
Overview
Initialization & hyper-parameter tuning
Optimization algorithms
Batchnorm & Dropout
Finite dataset woes
Software
A. Banburski
![Page 75: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/75.jpg)
Finite dataset woes
While we are entering the Big Data age, in practice we often findourselves with insufficient data to sufficiently train our deep neuralnetworks.
I What if collecting more data is slow/difficult?
I Can we squeeze out more from what we already have?
A. Banburski
![Page 76: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/76.jpg)
Finite dataset woes
While we are entering the Big Data age, in practice we often findourselves with insufficient data to sufficiently train our deep neuralnetworks.
I What if collecting more data is slow/difficult?
I Can we squeeze out more from what we already have?
A. Banburski
![Page 77: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/77.jpg)
Invariance problem
An often-repeated claim about CNNs is that they are invariant to smalltranslations. Independently of whether this is true, they are not invariantto most other types of transformations:
source: cs213n.github.io
A. Banburski
![Page 78: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/78.jpg)
Data augmentation
I Can greatly increase the amount of data by performing:
– Translations– Rotations– Reflections– Scaling– Cropping– Adding Gaussian Noise– Adding Occlusion– Interpolation– etc.
I Crucial for achieving state-of-the-art performance!
I For example, ResNet improves from 11.66% to 6.41% error onCIFAR-10 dataset and from 44.74% to 27.22% on CIFAR-100.
A. Banburski
![Page 79: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/79.jpg)
Data augmentation
I Can greatly increase the amount of data by performing:
– Translations– Rotations– Reflections– Scaling– Cropping– Adding Gaussian Noise– Adding Occlusion– Interpolation– etc.
I Crucial for achieving state-of-the-art performance!
I For example, ResNet improves from 11.66% to 6.41% error onCIFAR-10 dataset and from 44.74% to 27.22% on CIFAR-100.
A. Banburski
![Page 80: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/80.jpg)
Data augmentation
I Can greatly increase the amount of data by performing:
– Translations– Rotations– Reflections– Scaling– Cropping– Adding Gaussian Noise– Adding Occlusion– Interpolation– etc.
I Crucial for achieving state-of-the-art performance!
I For example, ResNet improves from 11.66% to 6.41% error onCIFAR-10 dataset and from 44.74% to 27.22% on CIFAR-100.
A. Banburski
![Page 81: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/81.jpg)
Data augmentation
I Can greatly increase the amount of data by performing:
– Translations– Rotations– Reflections– Scaling– Cropping– Adding Gaussian Noise– Adding Occlusion– Interpolation– etc.
I Crucial for achieving state-of-the-art performance!
I For example, ResNet improves from 11.66% to 6.41% error onCIFAR-10 dataset and from 44.74% to 27.22% on CIFAR-100.
A. Banburski
![Page 82: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/82.jpg)
Data augmentation
source: github.com/aleju/imgaug
A. Banburski
![Page 83: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/83.jpg)
Transfer Learning
What if you truly have too little data?
I If your data has sufficient similarity to a bigger dataset, the you’re inluck!
I Idea: take a model trained for example on ImageNet.
I Freeze all but last few layers and retrain on your small data. Thebigger your dataset, the more layers you have to retrain.
source: [Haase et al., 2014]
A. Banburski
![Page 84: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/84.jpg)
Transfer Learning
What if you truly have too little data?
I If your data has sufficient similarity to a bigger dataset, the you’re inluck!
I Idea: take a model trained for example on ImageNet.
I Freeze all but last few layers and retrain on your small data. Thebigger your dataset, the more layers you have to retrain.
source: [Haase et al., 2014]
A. Banburski
![Page 85: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/85.jpg)
Transfer Learning
What if you truly have too little data?
I If your data has sufficient similarity to a bigger dataset, the you’re inluck!
I Idea: take a model trained for example on ImageNet.
I Freeze all but last few layers and retrain on your small data. Thebigger your dataset, the more layers you have to retrain.
source: [Haase et al., 2014]
A. Banburski
![Page 86: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/86.jpg)
Transfer Learning
What if you truly have too little data?
I If your data has sufficient similarity to a bigger dataset, the you’re inluck!
I Idea: take a model trained for example on ImageNet.
I Freeze all but last few layers and retrain on your small data. Thebigger your dataset, the more layers you have to retrain.
source: [Haase et al., 2014]
A. Banburski
![Page 87: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/87.jpg)
Overview
Initialization & hyper-parameter tuning
Optimization algorithms
Batchnorm & Dropout
Finite dataset woes
Software
A. Banburski
![Page 88: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/88.jpg)
Software overview
A. Banburski
![Page 89: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/89.jpg)
Software overview
A. Banburski
![Page 90: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/90.jpg)
Why use frameworks?
I You don’t have to implement everything yourself.
I Many inbuilt modules allow quick iteration of ideas – building aneural network becomes putting simple blocks together andcomputing backprop is a breeze.
I Someone else already wrote CUDA code to efficiently run trainingon GPUs (or TPUs).
A. Banburski
![Page 91: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/91.jpg)
Why use frameworks?
I You don’t have to implement everything yourself.
I Many inbuilt modules allow quick iteration of ideas – building aneural network becomes putting simple blocks together andcomputing backprop is a breeze.
I Someone else already wrote CUDA code to efficiently run trainingon GPUs (or TPUs).
A. Banburski
![Page 92: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/92.jpg)
Why use frameworks?
I You don’t have to implement everything yourself.
I Many inbuilt modules allow quick iteration of ideas – building aneural network becomes putting simple blocks together andcomputing backprop is a breeze.
I Someone else already wrote CUDA code to efficiently run trainingon GPUs (or TPUs).
A. Banburski
![Page 93: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/93.jpg)
Main design difference
source: Introduction to Chainer
A. Banburski
![Page 94: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/94.jpg)
PyTorch concepts
Similar in code to numpy.
I Tensor: nearly identical to np.array, can run on GPU just with
I Autograd: package for automatic computation of backprop andconstruction of computational graphs.
I Module: neural network layer storing weights
I Dataloader: class for simplifying efficient data loading
A. Banburski
![Page 95: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/95.jpg)
PyTorch concepts
Similar in code to numpy.
I Tensor: nearly identical to np.array, can run on GPU just with
I Autograd: package for automatic computation of backprop andconstruction of computational graphs.
I Module: neural network layer storing weights
I Dataloader: class for simplifying efficient data loading
A. Banburski
![Page 96: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/96.jpg)
PyTorch concepts
Similar in code to numpy.
I Tensor: nearly identical to np.array, can run on GPU just with
I Autograd: package for automatic computation of backprop andconstruction of computational graphs.
I Module: neural network layer storing weights
I Dataloader: class for simplifying efficient data loading
A. Banburski
![Page 97: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/97.jpg)
PyTorch concepts
Similar in code to numpy.
I Tensor: nearly identical to np.array, can run on GPU just with
I Autograd: package for automatic computation of backprop andconstruction of computational graphs.
I Module: neural network layer storing weights
I Dataloader: class for simplifying efficient data loading
A. Banburski
![Page 98: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/98.jpg)
PyTorch concepts
Similar in code to numpy.
I Tensor: nearly identical to np.array, can run on GPU just with
I Autograd: package for automatic computation of backprop andconstruction of computational graphs.
I Module: neural network layer storing weights
I Dataloader: class for simplifying efficient data loading
A. Banburski
![Page 99: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/99.jpg)
PyTorch - optimization
A. Banburski
![Page 100: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/100.jpg)
PyTorch - ResNet in one page
@jeremyphowardA. Banburski
![Page 101: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/101.jpg)
Tensorflow static graphs
source: cs213n.github.ioA. Banburski
![Page 102: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/102.jpg)
Keras wrapper - closer to PyTorch
source: cs213n.github.ioA. Banburski
![Page 103: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/103.jpg)
Tensorboard - very useful tool for visualization
A. Banburski
![Page 104: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/104.jpg)
Tensorflow overview
I Main difference – uses static graphs. Longer code, but moreoptimized. In practice PyTorch is faster to experiment on.
I With Keras wrapper code is more similar to PyTorch however.
I Can use TPUs
A. Banburski
![Page 105: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/105.jpg)
Tensorflow overview
I Main difference – uses static graphs. Longer code, but moreoptimized. In practice PyTorch is faster to experiment on.
I With Keras wrapper code is more similar to PyTorch however.
I Can use TPUs
A. Banburski
![Page 106: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/106.jpg)
Tensorflow overview
I Main difference – uses static graphs. Longer code, but moreoptimized. In practice PyTorch is faster to experiment on.
I With Keras wrapper code is more similar to PyTorch however.
I Can use TPUs
A. Banburski
![Page 107: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/107.jpg)
But
I Tensorflow has added dynamic batching, which makes dynamicgraphs possible.
I PyTorch is merging with Caffe2, which will provide static graphs too!
I Which one to choose then?
– PyTorch is more popular in the research community for easydevelopment and debugging.
– In the past a better choice for production was Tensorflow. Still theonly choice if you want to use TPUs.
A. Banburski
![Page 108: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/108.jpg)
But
I Tensorflow has added dynamic batching, which makes dynamicgraphs possible.
I PyTorch is merging with Caffe2, which will provide static graphs too!
I Which one to choose then?
– PyTorch is more popular in the research community for easydevelopment and debugging.
– In the past a better choice for production was Tensorflow. Still theonly choice if you want to use TPUs.
A. Banburski
![Page 109: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/109.jpg)
But
I Tensorflow has added dynamic batching, which makes dynamicgraphs possible.
I PyTorch is merging with Caffe2, which will provide static graphs too!
I Which one to choose then?
– PyTorch is more popular in the research community for easydevelopment and debugging.
– In the past a better choice for production was Tensorflow. Still theonly choice if you want to use TPUs.
A. Banburski
![Page 110: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/110.jpg)
But
I Tensorflow has added dynamic batching, which makes dynamicgraphs possible.
I PyTorch is merging with Caffe2, which will provide static graphs too!
I Which one to choose then?
– PyTorch is more popular in the research community for easydevelopment and debugging.
– In the past a better choice for production was Tensorflow. Still theonly choice if you want to use TPUs.
A. Banburski
![Page 111: MIT 9.520/6.860, Fall 2018 .5cm Class 11: Neural networks ...9.520/fall18/slides/Class11_nn2.pdf · MIT 9.520/6.860, Fall 2018 Class 11: Neural networks { tips, tricks & software](https://reader034.fdocuments.us/reader034/viewer/2022043021/5f3d3019e24e3b486d658ffd/html5/thumbnails/111.jpg)
But
I Tensorflow has added dynamic batching, which makes dynamicgraphs possible.
I PyTorch is merging with Caffe2, which will provide static graphs too!
I Which one to choose then?
– PyTorch is more popular in the research community for easydevelopment and debugging.
– In the past a better choice for production was Tensorflow. Still theonly choice if you want to use TPUs.
A. Banburski