Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data....
Transcript of Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data....
![Page 1: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally](https://reader033.fdocuments.us/reader033/viewer/2022042409/5f26bdbcfd29af162068576d/html5/thumbnails/1.jpg)
Lecture 37: ConvNets (Cont’d) and TrainingCS 4670/5670 Sean Bell
[http://bbabenko.tumblr.com/post/83319141207/convolutional-learnings-things-i-learned-by]
![Page 2: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally](https://reader033.fdocuments.us/reader033/viewer/2022042409/5f26bdbcfd29af162068576d/html5/thumbnails/2.jpg)
(Unrelated) Dog vs Food
[Karen Zack, @teenybiscuit]
![Page 3: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally](https://reader033.fdocuments.us/reader033/viewer/2022042409/5f26bdbcfd29af162068576d/html5/thumbnails/3.jpg)
[Karen Zack, @teenybiscuit]
(Unrelated) Dog vs Food
![Page 4: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally](https://reader033.fdocuments.us/reader033/viewer/2022042409/5f26bdbcfd29af162068576d/html5/thumbnails/4.jpg)
[Karen Zack, @teenybiscuit]
(Unrelated) Dog vs Food
![Page 5: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally](https://reader033.fdocuments.us/reader033/viewer/2022042409/5f26bdbcfd29af162068576d/html5/thumbnails/5.jpg)
(Recap) BackpropFrom Geoff Hinton’s seminar at Stanford yesterday
![Page 6: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally](https://reader033.fdocuments.us/reader033/viewer/2022042409/5f26bdbcfd29af162068576d/html5/thumbnails/6.jpg)
(Recap) Backprop
Gradient:∂L∂θ
= ∂L∂θ1
∂L∂θ2
…⎡
⎣⎢⎢
⎤
⎦⎥⎥
All of the weights and biases in the network, stacked together
Parameters: θ = θ1 θ2 !⎡⎣
⎤⎦
Intuition: “How fast would the error change if I change myself by a little bit”
![Page 7: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally](https://reader033.fdocuments.us/reader033/viewer/2022042409/5f26bdbcfd29af162068576d/html5/thumbnails/7.jpg)
x h(1) LFunction Function s...
θ (1) θ (n)Forward Propagation: compute the activations and loss
L∂L∂s
...
Backward Propagation: compute the gradient (“error signal”)
∂L∂h(1)
Function
∂L∂θ (n)
Function
∂L∂θ (1)
∂L∂x
(Recap) Backprop
![Page 8: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally](https://reader033.fdocuments.us/reader033/viewer/2022042409/5f26bdbcfd29af162068576d/html5/thumbnails/8.jpg)
(Recap)A ConvNet is a sequence of convolutional layers, interspersed with
activation functions (and possibly other layer types)
![Page 9: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally](https://reader033.fdocuments.us/reader033/viewer/2022042409/5f26bdbcfd29af162068576d/html5/thumbnails/9.jpg)
(Recap)
![Page 10: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally](https://reader033.fdocuments.us/reader033/viewer/2022042409/5f26bdbcfd29af162068576d/html5/thumbnails/10.jpg)
(Recap)
![Page 11: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally](https://reader033.fdocuments.us/reader033/viewer/2022042409/5f26bdbcfd29af162068576d/html5/thumbnails/11.jpg)
(Recap)
![Page 12: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally](https://reader033.fdocuments.us/reader033/viewer/2022042409/5f26bdbcfd29af162068576d/html5/thumbnails/12.jpg)
(Recap)
![Page 13: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally](https://reader033.fdocuments.us/reader033/viewer/2022042409/5f26bdbcfd29af162068576d/html5/thumbnails/13.jpg)
(Recap)
![Page 14: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally](https://reader033.fdocuments.us/reader033/viewer/2022042409/5f26bdbcfd29af162068576d/html5/thumbnails/14.jpg)
(Recap)
![Page 15: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally](https://reader033.fdocuments.us/reader033/viewer/2022042409/5f26bdbcfd29af162068576d/html5/thumbnails/15.jpg)
(Recap)
![Page 16: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally](https://reader033.fdocuments.us/reader033/viewer/2022042409/5f26bdbcfd29af162068576d/html5/thumbnails/16.jpg)
Web demo 1: Convolution
http://cs231n.github.io/convolutional-networks/
[Karpathy 2016]
![Page 17: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally](https://reader033.fdocuments.us/reader033/viewer/2022042409/5f26bdbcfd29af162068576d/html5/thumbnails/17.jpg)
Web demo 2: ConvNet in a Browser
http://cs.stanford.edu/people/karpathy/convnetjs/demo/mnist.html
[Karpathy 2014]
![Page 18: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally](https://reader033.fdocuments.us/reader033/viewer/2022042409/5f26bdbcfd29af162068576d/html5/thumbnails/18.jpg)
Convolution: Stride
Input
Weights
During convolution, the weights “slide” along the input to generate each output
Output
![Page 19: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally](https://reader033.fdocuments.us/reader033/viewer/2022042409/5f26bdbcfd29af162068576d/html5/thumbnails/19.jpg)
Convolution: Stride
Input
During convolution, the weights “slide” along the input to generate each output
Output
![Page 20: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally](https://reader033.fdocuments.us/reader033/viewer/2022042409/5f26bdbcfd29af162068576d/html5/thumbnails/20.jpg)
Convolution: Stride
Input
During convolution, the weights “slide” along the input to generate each output
Output
![Page 21: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally](https://reader033.fdocuments.us/reader033/viewer/2022042409/5f26bdbcfd29af162068576d/html5/thumbnails/21.jpg)
Convolution: Stride
Input
During convolution, the weights “slide” along the input to generate each output
Output
![Page 22: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally](https://reader033.fdocuments.us/reader033/viewer/2022042409/5f26bdbcfd29af162068576d/html5/thumbnails/22.jpg)
Convolution: Stride
Input
During convolution, the weights “slide” along the input to generate each output
Output
![Page 23: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally](https://reader033.fdocuments.us/reader033/viewer/2022042409/5f26bdbcfd29af162068576d/html5/thumbnails/23.jpg)
Convolution: Stride
Input
During convolution, the weights “slide” along the input to generate each output
Output
![Page 24: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally](https://reader033.fdocuments.us/reader033/viewer/2022042409/5f26bdbcfd29af162068576d/html5/thumbnails/24.jpg)
Convolution: Stride
Input
During convolution, the weights “slide” along the input to generate each output
Recall that at each position, we are doing a 3D sum:
hr = xrijkWijkijk∑ + b
(channel, row, column)
![Page 25: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally](https://reader033.fdocuments.us/reader033/viewer/2022042409/5f26bdbcfd29af162068576d/html5/thumbnails/25.jpg)
Convolution: Stride
Input
Output
But we can also convolve with a stride, e.g. stride = 2
![Page 26: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally](https://reader033.fdocuments.us/reader033/viewer/2022042409/5f26bdbcfd29af162068576d/html5/thumbnails/26.jpg)
Convolution: Stride
Input
Output
But we can also convolve with a stride, e.g. stride = 2
![Page 27: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally](https://reader033.fdocuments.us/reader033/viewer/2022042409/5f26bdbcfd29af162068576d/html5/thumbnails/27.jpg)
Convolution: Stride
Input
Output
But we can also convolve with a stride, e.g. stride = 2
![Page 28: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally](https://reader033.fdocuments.us/reader033/viewer/2022042409/5f26bdbcfd29af162068576d/html5/thumbnails/28.jpg)
Convolution: Stride
Input
- Notice that with certain strides, we may not be able to cover all of the input
Output
- The output is also half the size of the input
But we can also convolve with a stride, e.g. stride = 2
![Page 29: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally](https://reader033.fdocuments.us/reader033/viewer/2022042409/5f26bdbcfd29af162068576d/html5/thumbnails/29.jpg)
0 0 0 0 0 0 0 0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0 0 0 0 0 0 0 0
Convolution: PaddingWe can also pad the input with zeros. Here, pad = 1, stride = 2
Output
Input
![Page 30: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally](https://reader033.fdocuments.us/reader033/viewer/2022042409/5f26bdbcfd29af162068576d/html5/thumbnails/30.jpg)
0 0 0 0 0 0 0 0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0 0 0 0 0 0 0 0
Convolution: PaddingWe can also pad the input with zeros. Here, pad = 1, stride = 2
Output
Input
![Page 31: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally](https://reader033.fdocuments.us/reader033/viewer/2022042409/5f26bdbcfd29af162068576d/html5/thumbnails/31.jpg)
0 0 0 0 0 0 0 0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0 0 0 0 0 0 0 0
Convolution: Padding
Input
We can also pad the input with zeros. Here, pad = 1, stride = 2
Output
![Page 32: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally](https://reader033.fdocuments.us/reader033/viewer/2022042409/5f26bdbcfd29af162068576d/html5/thumbnails/32.jpg)
0 0 0 0 0 0 0 0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0 0 0 0 0 0 0 0
Convolution: PaddingWe can also pad the input with zeros. Here, pad = 1, stride = 2
Output
Input
![Page 33: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally](https://reader033.fdocuments.us/reader033/viewer/2022042409/5f26bdbcfd29af162068576d/html5/thumbnails/33.jpg)
Convolution: How big is the output?
0 0 0 0 0 0 0 0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0 0 0 0 0 0 0 0
width win
stride s
kernel k
pp
wout =win + 2p − k
s⎢⎣⎢
⎥⎦⎥+1
In general, the output has size:
![Page 34: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally](https://reader033.fdocuments.us/reader033/viewer/2022042409/5f26bdbcfd29af162068576d/html5/thumbnails/34.jpg)
Convolution: How big is the output?
0 0 0 0 0 0 0 0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0 0 0 0 0 0 0 0
Example: k=3, s=1, p=1
wout =win + 2p − k
s⎢⎣⎢
⎥⎦⎥+1
= win + 2 − 31
⎢⎣⎢
⎥⎦⎥+1
= win
width win p
stride s
kernel k
VGGNet [Simonyan 2014] uses filters of this shapep
![Page 35: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally](https://reader033.fdocuments.us/reader033/viewer/2022042409/5f26bdbcfd29af162068576d/html5/thumbnails/35.jpg)
Pooling
Figure: Andrej Karpathy
For most ConvNets, convolution is often followed by pooling:- Creates a smaller representation while retaining the most important information- The “max” operation is the most common- Why might “avg” be a poor choice?
![Page 36: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally](https://reader033.fdocuments.us/reader033/viewer/2022042409/5f26bdbcfd29af162068576d/html5/thumbnails/36.jpg)
Pooling
![Page 37: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally](https://reader033.fdocuments.us/reader033/viewer/2022042409/5f26bdbcfd29af162068576d/html5/thumbnails/37.jpg)
Max Pooling
Figure: Andrej Karpathy
What’s the backprop rule for max pooling?- In the forward pass, store the index that took the max- The backprop gradient is the input gradient at that index
![Page 38: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally](https://reader033.fdocuments.us/reader033/viewer/2022042409/5f26bdbcfd29af162068576d/html5/thumbnails/38.jpg)
Example ConvNet
Figure: Andrej Karpathy
![Page 39: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally](https://reader033.fdocuments.us/reader033/viewer/2022042409/5f26bdbcfd29af162068576d/html5/thumbnails/39.jpg)
Example ConvNet
Figure: Andrej Karpathy
![Page 40: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally](https://reader033.fdocuments.us/reader033/viewer/2022042409/5f26bdbcfd29af162068576d/html5/thumbnails/40.jpg)
Example ConvNet
Figure: Andrej Karpathy
![Page 41: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally](https://reader033.fdocuments.us/reader033/viewer/2022042409/5f26bdbcfd29af162068576d/html5/thumbnails/41.jpg)
Example ConvNet
Figure: Andrej Karpathy
10x3x3 conv filters, stride 1, pad 12x2 pool filters, stride 2
![Page 42: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally](https://reader033.fdocuments.us/reader033/viewer/2022042409/5f26bdbcfd29af162068576d/html5/thumbnails/42.jpg)
Example: AlexNet [Krizhevsky 2012]
FC
Figure: [Karnowski 2015] (with corrections)
3 227x227
max full
“max”: max pooling “norm”: local response normalization “full”: fully connected
1000
classscores
![Page 43: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally](https://reader033.fdocuments.us/reader033/viewer/2022042409/5f26bdbcfd29af162068576d/html5/thumbnails/43.jpg)
Example: AlexNet [Krizhevsky 2012]
Figure: [Karnowski 2015]
zoom in
![Page 44: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally](https://reader033.fdocuments.us/reader033/viewer/2022042409/5f26bdbcfd29af162068576d/html5/thumbnails/44.jpg)
Questions?
![Page 45: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally](https://reader033.fdocuments.us/reader033/viewer/2022042409/5f26bdbcfd29af162068576d/html5/thumbnails/45.jpg)
… why so many layers?
[Szegedy et al, 2014]
How do you actually train these things?
![Page 46: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally](https://reader033.fdocuments.us/reader033/viewer/2022042409/5f26bdbcfd29af162068576d/html5/thumbnails/46.jpg)
How do you actually train these things?
Gather labeled data
Find a ConvNet architecture
Minimize the loss
Roughly speaking:
![Page 47: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally](https://reader033.fdocuments.us/reader033/viewer/2022042409/5f26bdbcfd29af162068576d/html5/thumbnails/47.jpg)
Training a convolutional neural network
• Split and preprocess your data
• Choose your network architecture
• Initialize the weights
• Find a learning rate and regularization strength
• Minimize the loss and monitor progress
• Fiddle with knobs
![Page 48: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally](https://reader033.fdocuments.us/reader033/viewer/2022042409/5f26bdbcfd29af162068576d/html5/thumbnails/48.jpg)
Mini-batch Gradient DescentLoop:
1. Sample a batch of training data (~100 images)
2. Forwards pass: compute loss (avg. over batch)
3. Backwards pass: compute gradient
4. Update all parameters
Note: usually called “stochastic gradient descent” even though SGD has a batch size of 1
![Page 49: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally](https://reader033.fdocuments.us/reader033/viewer/2022042409/5f26bdbcfd29af162068576d/html5/thumbnails/49.jpg)
Regularization
L = Ldata + Lreg Lreg = λ 12W 2
2
[Andrej Karpathy http://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html]
Regularization reduces overfitting:
![Page 50: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally](https://reader033.fdocuments.us/reader033/viewer/2022042409/5f26bdbcfd29af162068576d/html5/thumbnails/50.jpg)
Overfitting
[Image: https://en.wikipedia.org/wiki/File:Overfitted_Data.png]
Overfitting: modeling noise in the training set instead of the “true” underlying relationship
Underfitting: insufficiently modeling the relationship in the training set
General rule: models that are “bigger” or have more capacity are more likely to overfit
![Page 51: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally](https://reader033.fdocuments.us/reader033/viewer/2022042409/5f26bdbcfd29af162068576d/html5/thumbnails/51.jpg)
(0) Dataset splitSplit your data into “train”, “validation”, and “test”:
Dataset
Train
Validation
Test
![Page 52: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally](https://reader033.fdocuments.us/reader033/viewer/2022042409/5f26bdbcfd29af162068576d/html5/thumbnails/52.jpg)
Train
Validation
Test
Train: gradient descent and fine-tuning of parameters
Validation: determining hyper-parameters (learning rate, regularization strength, etc) and picking an architecture
Test: estimate real-world performance (e.g. accuracy = fraction correctly classified)
(0) Dataset split
![Page 53: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally](https://reader033.fdocuments.us/reader033/viewer/2022042409/5f26bdbcfd29af162068576d/html5/thumbnails/53.jpg)
Train
Validation
Test
(0) Dataset split
To avoid false discovery, once we have used a test set once, we should not use it again (but nobody follows this rule, since it’s expensive to collect datasets)
Be careful with false discovery:
Instead, try and avoid looking at the test score until the end
![Page 54: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally](https://reader033.fdocuments.us/reader033/viewer/2022042409/5f26bdbcfd29af162068576d/html5/thumbnails/54.jpg)
Train Test
(0) Dataset splitCross-validation: cycle which data is used as validation
Val
TestTrain TrainVal
TestTrain TrainVal
Average scores across validation splits
TrainVal Test
![Page 55: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally](https://reader033.fdocuments.us/reader033/viewer/2022042409/5f26bdbcfd29af162068576d/html5/thumbnails/55.jpg)
(1) Data preprocessing
Figure: Andrej Karpathy
Preprocess the data so that learning is better conditioned:
![Page 56: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally](https://reader033.fdocuments.us/reader033/viewer/2022042409/5f26bdbcfd29af162068576d/html5/thumbnails/56.jpg)
(1) Data preprocessing
Slide: Andrej Karpathy
In practice, you may also see PCA and Whitening of the data:
![Page 57: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally](https://reader033.fdocuments.us/reader033/viewer/2022042409/5f26bdbcfd29af162068576d/html5/thumbnails/57.jpg)
(1) Data preprocessing
Figure: Alex Krizhevsky
For ConvNets, typically only the mean is subtracted.
A per-channel mean also works (one value per R,G,B).
![Page 58: Lecture 37: ConvNets (Cont’d) and Training...ConvNets (e.g. [Krizhevsky 2012]) overfit the data. E.g. 224x224 patches extracted from 256x256 images Randomly reflect horizontally](https://reader033.fdocuments.us/reader033/viewer/2022042409/5f26bdbcfd29af162068576d/html5/thumbnails/58.jpg)
(1) Data preprocessingAugment the data — extract random crops from the input, with slightly jittered offsets. Without this, typical ConvNets (e.g. [Krizhevsky 2012]) overfit the data.
E.g. 224x224 patches extracted from 256x256 images
Randomly reflect horizontally
Perform the augmentation live during training
Figure: Alex Krizhevsky