Machine Learning - Lecture 09: Convolutional Neural...
Transcript of Machine Learning - Lecture 09: Convolutional Neural...
Machine LearningLecture 09: Convolutional Neural Networks
Nevin L. [email protected]
Department of Computer Science and EngineeringThe Hong Kong University of Science and Technology
This set of notes is based on various sources on the internet andStanford CS course CS231n: Convolutional Neural Networks for Visual
Recognition. http://cs231n.stanford.edu/
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT press.http://www.deeplearningbook.org
Nevin L. Zhang (HKUST) Machine Learning 1 / 45
What are Convolutional Neural Networks?
Outline
1 What are Convolutional Neural Networks?
2 Convolutional Layer
3 Technical Issues with Convolutional Layer
4 Pooling Layer
5 CNN Examples
Nevin L. Zhang (HKUST) Machine Learning 2 / 45
What are Convolutional Neural Networks?
Convolutional Neural Networks
Convolutional Neural Networks (CNNs, ConvNets) are a specializedkind of neural network for processing data that has a known, grid-liketopology.
Examples include
Time-series data, which can be thought of as a 1D grid taking samplesat regular time intervals, andImage data, which can be thought of as a 2D grid of pixels
We will focus on image data
Nevin L. Zhang (HKUST) Machine Learning 3 / 45
What are Convolutional Neural Networks?
CIFAR-10 Dataset
We will use the CIFAR-10 Dataset as a background example.
The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes,with 6000 images per class.
There are 50000 training images and 10000 test images.
Nevin L. Zhang (HKUST) Machine Learning 4 / 45
What are Convolutional Neural Networks?
CNN for CIFAR-10
A CNN represents a function from the raw image pixels on one end to classscores at the other
Units at an CNN layer are arranged in 3D volumes with dimensions width,height, depth.
For CIFAR 10, an input image is represented as a volume of units with size32 x 32 x 3 (32 wide, 32 high, 3 color channels).
The output (10 class scores) is represented as a volume of units with size 1x 1 x 10.
Nevin L. Zhang (HKUST) Machine Learning 5 / 45
What are Convolutional Neural Networks?
CNN Layers
An CNN is made up of multiple layers.
Every Layer has a simple API: It transforms an input 3D volume to anoutput 3D volume with some differentiable function that may or may nothave parameters.
Three main types of layers to build CNN architectures: ConvolutionalLayer, Pooling Layer, and Fully-Connected Layer (exactly as seen inregular Neural Networks).
Nevin L. Zhang (HKUST) Machine Learning 6 / 45
What are Convolutional Neural Networks?
CNN Layers
An CNN is made up of multiple layers.
Every Layer has a simple API: It transforms an input 3D volume to anoutput 3D volume with some differentiable function that may or may nothave parameters.
Three main types of layers to build CNN architectures: ConvolutionalLayer, Pooling Layer, and Fully-Connected Layer (exactly as seen inregular Neural Networks).
Nevin L. Zhang (HKUST) Machine Learning 6 / 45
Convolutional Layer
Outline
1 What are Convolutional Neural Networks?
2 Convolutional Layer
3 Technical Issues with Convolutional Layer
4 Pooling Layer
5 CNN Examples
Nevin L. Zhang (HKUST) Machine Learning 7 / 45
Convolutional Layer
Convolutional Layer
Each convolutional unit is connected only to units in a small patch (whichis called the receptive field of the unit) of the previous layer.
The receptive filed is local in space (width and height), but always full alongthe entire depth of the input volume.
Nevin L. Zhang (HKUST) Machine Learning 8 / 45
Convolutional Layer
Convolutional Layer
The parameters of a convolutional unit are the connection weights and thebias. They are to be learned.
Intuitively, the task is to learn weights so that the unit will activate whensome type of visual feature such as an edge is present.
Nevin L. Zhang (HKUST) Machine Learning 9 / 45
Convolutional Layer
Convolutional Layer
A convolutional layer consists of a volume of convolutional units.
All units on a given depth slice share the same parameters,
so that we candetect the same feature at different locations. (Edges can be at multiplelocations)
Hence, the set of weights is called a filter or kernel.Different units on the depth slices are obtained by sliding the filter.
Nevin L. Zhang (HKUST) Machine Learning 10 / 45
Convolutional Layer
Convolutional Layer
A convolutional layer consists of a volume of convolutional units.
All units on a given depth slice share the same parameters,so that we candetect the same feature at different locations. (Edges can be at multiplelocations)
Hence, the set of weights is called a filter or kernel.Different units on the depth slices are obtained by sliding the filter.
Nevin L. Zhang (HKUST) Machine Learning 10 / 45
Convolutional Layer
Convolutional Layer
A convolutional layer consists of a volume of convolutional units.
All units on a given depth slice share the same parameters,so that we candetect the same feature at different locations. (Edges can be at multiplelocations)
Hence, the set of weights is called a filter or kernel.Different units on the depth slices are obtained by sliding the filter.
Nevin L. Zhang (HKUST) Machine Learning 10 / 45
Convolutional Layer
Convolutional Layer
There are multiple depth slices (i.e., multiple filters) so that differentfeatures can be detected.
Nevin L. Zhang (HKUST) Machine Learning 11 / 45
Convolutional Layer
Convolutional Layer
The output of each depth slice is called a feature map.
The output a conv layer is a collection of stacked feature maps.
Mathematically, a conv layer maps a 3D tensor to another 3D tensor.
Nevin L. Zhang (HKUST) Machine Learning 12 / 45
Convolutional Layer
Example of Learned Fitlers
Here are the 96 filters learned in the first convolution layer in AlexNet (withreceptive field: 11 x 11 x 3. Now people mostly use 3 x 3 x 3.)
A lot of these filters turns out to be edge detection filters common tohuman visual systems.
Nevin L. Zhang (HKUST) Machine Learning 13 / 45
Convolutional Layer
Example of Feature Maps
Nevin L. Zhang (HKUST) Machine Learning 14 / 45
Convolutional Layer
Computation of a convolutional unit
Let I be a 2D image (one of the three channels), and K be the filter.
K (m, n) = 0 when |m| > r or |n| > r where 2r is the weight and heightof the receptive field.
Computation carried by the convolutional unit at coordinates (i , j) is
S(i , j) =∑m,n
I (i + m, j + n)K (m, n)
This is cross-correlation, although it is referred to as convolution in deeplearning (en.wikipedia.org/wiki/Cross-correlation).
Nevin L. Zhang (HKUST) Machine Learning 15 / 45
Convolutional Layer
NOTE: Cross-correlation vs convolution
Cross-correlation:∑
m,n I (i + m, j + n)K (m, n)
Convolution:∑
m,n I (i −m, j − n)K (m, n)
Let us flip the kernel K to get K ′(m, n) = K (−m,−n). Then∑m,n
I (i −m, j − n)K (m, n) =∑m,n
I (i −m, j − n)K ′(−m,−n)
=∑m,n
I (i + m, j + n)K ′(m, n)
So, convolution with kernel K is the same as cross-correlation with theflipped kernel K ′.
And because the kernel is to be learned., it is not necessary to distinguishbetween the two (and hence cross-correlation and convolution) in deeplearning.
Nevin L. Zhang (HKUST) Machine Learning 16 / 45
Convolutional Layer
NOTE: Cross-correlation vs convolution
Cross-correlation:∑
m,n I (i + m, j + n)K (m, n)
Convolution:∑
m,n I (i −m, j − n)K (m, n)
Let us flip the kernel K to get K ′(m, n) = K (−m,−n). Then∑m,n
I (i −m, j − n)K (m, n) =∑m,n
I (i −m, j − n)K ′(−m,−n)
=∑m,n
I (i + m, j + n)K ′(m, n)
So, convolution with kernel K is the same as cross-correlation with theflipped kernel K ′.
And because the kernel is to be learned., it is not necessary to distinguishbetween the two (and hence cross-correlation and convolution) in deeplearning.
Nevin L. Zhang (HKUST) Machine Learning 16 / 45
Convolutional Layer
NOTE: Cross-correlation vs convolution
Cross-correlation:∑
m,n I (i + m, j + n)K (m, n)
Convolution:∑
m,n I (i −m, j − n)K (m, n)
Let us flip the kernel K to get K ′(m, n) = K (−m,−n). Then∑m,n
I (i −m, j − n)K (m, n) =∑m,n
I (i −m, j − n)K ′(−m,−n)
=∑m,n
I (i + m, j + n)K ′(m, n)
So, convolution with kernel K is the same as cross-correlation with theflipped kernel K ′.
And because the kernel is to be learned., it is not necessary to distinguishbetween the two (and hence cross-correlation and convolution) in deeplearning.
Nevin L. Zhang (HKUST) Machine Learning 16 / 45
Convolutional Layer
NOTE: Cross-correlation vs convolution
Cross-correlation:∑
m,n I (i + m, j + n)K (m, n)
Convolution:∑
m,n I (i −m, j − n)K (m, n)
Let us flip the kernel K to get K ′(m, n) = K (−m,−n). Then∑m,n
I (i −m, j − n)K (m, n) =∑m,n
I (i −m, j − n)K ′(−m,−n)
=∑m,n
I (i + m, j + n)K ′(m, n)
So, convolution with kernel K is the same as cross-correlation with theflipped kernel K ′.
And because the kernel is to be learned., it is not necessary to distinguishbetween the two (and hence cross-correlation and convolution) in deeplearning.
Nevin L. Zhang (HKUST) Machine Learning 16 / 45
Convolutional Layer
Convolutional Layer
Convolutional demo: http://cs231n.github.io/convolutional-networks/
Nevin L. Zhang (HKUST) Machine Learning 17 / 45
Convolutional Layer
Computation of a convolutional unit
The result of convolution is passed through a nonlinear activation function(ReLU) to get the output of the unit.
Nevin L. Zhang (HKUST) Machine Learning 18 / 45
Technical Issues with Convolutional Layer
Outline
1 What are Convolutional Neural Networks?
2 Convolutional Layer
3 Technical Issues with Convolutional Layer
4 Pooling Layer
5 CNN Examples
Nevin L. Zhang (HKUST) Machine Learning 19 / 45
Technical Issues with Convolutional Layer
Reduction in Spatial Size
To apply the filter to an image, we can move the filter 1 pixel at a time fromleft to right and top to bottom until we process every pixel.
And we have a convolutional unit for each location.
The width and height of the array convolutional units are reduced by 2 fromthe width and height of the array of input units.
Nevin L. Zhang (HKUST) Machine Learning 20 / 45
Technical Issues with Convolutional Layer
Zero Padding
If we want to maintain the spatial dimensions, we can pack extra 0 orreplicate the edge of the original image.
Zero padding helps to better preserve information on the edge.
Nevin L. Zhang (HKUST) Machine Learning 21 / 45
Technical Issues with Convolutional Layer
Stride
For a CNN, sometimes we do not move the filter only by 1 pixel. If we movethe filter 2 pixels to the right, we call the stride equal to 2.
It is uncommon to use stride 3 or more.
Nevin L. Zhang (HKUST) Machine Learning 22 / 45
Technical Issues with Convolutional Layer
Summary of Convolutional Layer
Accepts a volume of size W1 × H1 × D1
Requires four hyperparameters:
Number of filters K , their spatial extent F , the stride S , the amount ofzero padding P.
Produces a volume of size W2 × H2 × D2 where:
W2 = (W1 − F + 2P)/S + 1
H2 = (H1 − F + 2P)/S + 1
D2 = K
Need to make sure that (W1 − F + 2P)/S and (W2 − F + 2P)/S areintegers.
For example, we cannot have W = 10, P = 0, F = 3 and S = 2.
Nevin L. Zhang (HKUST) Machine Learning 23 / 45
Technical Issues with Convolutional Layer
Summary of Convolutional Layer
Accepts a volume of size W1 × H1 × D1
Requires four hyperparameters:
Number of filters K , their spatial extent F , the stride S , the amount ofzero padding P.
Produces a volume of size W2 × H2 × D2 where:
W2 = (W1 − F + 2P)/S + 1
H2 = (H1 − F + 2P)/S + 1
D2 = K
Need to make sure that (W1 − F + 2P)/S and (W2 − F + 2P)/S areintegers.
For example, we cannot have W = 10, P = 0, F = 3 and S = 2.
Nevin L. Zhang (HKUST) Machine Learning 23 / 45
Technical Issues with Convolutional Layer
Summary of Convolutional Layer
Number of parameters of a convolutional layer
(FFD1 + 1)K (K filters each with FFD1 + 1 parameters.)
A fully connected layer with the same number of units and no parametersharing would require (W1H1D1 + 1)W2H2D2, which can be prohibitivelylarge
AlexNet accepts images of size [227 x 227 x 3]
For first convolutional layer: F = 11, S = 4, P = 0 and K = 96.
W2 = H2 = 55 and D2 = 96The number of parameters is: (11× 11× 3 + 1)× 96 = 34, 944parameters
If a FC layer were used instead, the number of parameters would be
(227× 227× 3 + 1)× 55× 55× 96 = 44, 892, 355, 200
This is a key motivation for using convolutional layers.
Nevin L. Zhang (HKUST) Machine Learning 24 / 45
Technical Issues with Convolutional Layer
Summary of Convolutional Layer
Number of parameters of a convolutional layer
(FFD1 + 1)K (K filters each with FFD1 + 1 parameters.)
A fully connected layer with the same number of units and no parametersharing would require (W1H1D1 + 1)W2H2D2, which can be prohibitivelylarge
AlexNet accepts images of size [227 x 227 x 3]
For first convolutional layer: F = 11, S = 4, P = 0 and K = 96.
W2 = H2 = 55 and D2 = 96The number of parameters is: (11× 11× 3 + 1)× 96 = 34, 944parameters
If a FC layer were used instead, the number of parameters would be
(227× 227× 3 + 1)× 55× 55× 96 = 44, 892, 355, 200
This is a key motivation for using convolutional layers.
Nevin L. Zhang (HKUST) Machine Learning 24 / 45
Technical Issues with Convolutional Layer
Summary of Convolutional Layer
Number of parameters of a convolutional layer
(FFD1 + 1)K (K filters each with FFD1 + 1 parameters.)
A fully connected layer with the same number of units and no parametersharing would require (W1H1D1 + 1)W2H2D2, which can be prohibitivelylarge
AlexNet accepts images of size [227 x 227 x 3]
For first convolutional layer: F = 11, S = 4, P = 0 and K = 96.
W2 = H2 = 55 and D2 = 96The number of parameters is: (11× 11× 3 + 1)× 96 = 34, 944parameters
If a FC layer were used instead, the number of parameters would be
(227× 227× 3 + 1)× 55× 55× 96 = 44, 892, 355, 200
This is a key motivation for using convolutional layers.
Nevin L. Zhang (HKUST) Machine Learning 24 / 45
Technical Issues with Convolutional Layer
Summary of Convolutional Layer
Number of parameters of a convolutional layer
(FFD1 + 1)K (K filters each with FFD1 + 1 parameters.)
A fully connected layer with the same number of units and no parametersharing would require (W1H1D1 + 1)W2H2D2, which can be prohibitivelylarge
AlexNet accepts images of size [227 x 227 x 3]
For first convolutional layer: F = 11, S = 4, P = 0 and K = 96.
W2 = H2 = 55 and D2 = 96The number of parameters is: (11× 11× 3 + 1)× 96 = 34, 944parameters
If a FC layer were used instead, the number of parameters would be
(227× 227× 3 + 1)× 55× 55× 96 = 44, 892, 355, 200
This is a key motivation for using convolutional layers.
Nevin L. Zhang (HKUST) Machine Learning 24 / 45
Pooling Layer
Outline
1 What are Convolutional Neural Networks?
2 Convolutional Layer
3 Technical Issues with Convolutional Layer
4 Pooling Layer
5 CNN Examples
Nevin L. Zhang (HKUST) Machine Learning 25 / 45
Pooling Layer
Pooling
The objective of a pooling layer is to reduce the spatial size of the featuremap.
It aggregates a patch of units into one unit.
There are different ways to do this, and MAX pooling is found the workthe best.
Nevin L. Zhang (HKUST) Machine Learning 26 / 45
Pooling Layer
Pooling
Pooling also helps to make the representation approximately invariant tosmall translations of the input.
Invariance to translation means that if we translate the input by a smallamount, the values of most of the pooled outputs do not change.
Every value in the bottom row has changed, but only half of the values in the top row
have changed.
Invariance to local translation can be a very useful property if we care moreabout whether some feature is present than exactly where it is.
Nevin L. Zhang (HKUST) Machine Learning 27 / 45
Pooling Layer
Fully-Connected Layer
Neurons in a fully connected layer have full connections to all activations inthe previous layer, as seen in regular Neural Networks.
Fully-connected layers are typically used as the last layers.
Nevin L. Zhang (HKUST) Machine Learning 28 / 45
CNN Examples
Outline
1 What are Convolutional Neural Networks?
2 Convolutional Layer
3 Technical Issues with Convolutional Layer
4 Pooling Layer
5 CNN Examples
Nevin L. Zhang (HKUST) Machine Learning 29 / 45
CNN Examples
ImageNet
The ImageNet project is a large visual database designed for use in visualobject recognition software research.
As of 2016, over ten million URLs of images have been hand-annotated byImageNet to indicate what objects are pictured
Since 2010, the ImageNet project runs an annual software contest, theImageNet Large Scale Visual Recognition Challenge (ILSVRC)
Nevin L. Zhang (HKUST) Machine Learning 30 / 45
CNN Examples
AlexNet (in terms of tensor shapes)
Alexnet won the ImageNet ILSVRC challenge in 2012, beating all previousnon-DL based method.
Starting the surge of research on CNN.
It has 5 conv layers which are followed by 3 fully connected layers.
Totally, around 60M parameters, most of which are in the FC layers.
Most operations (ops) in forward propagation takes place in the Conv layers.
Nevin L. Zhang (HKUST) Machine Learning 31 / 45
CNN Examples
AlexNet (in terms of tensor shapes)
Alexnet won the ImageNet ILSVRC challenge in 2012, beating all previousnon-DL based method. Starting the surge of research on CNN.
It has 5 conv layers which are followed by 3 fully connected layers.
Totally, around 60M parameters, most of which are in the FC layers.
Most operations (ops) in forward propagation takes place in the Conv layers.
Nevin L. Zhang (HKUST) Machine Learning 31 / 45
CNN Examples
AlexNet (in terms of tensor shapes)
Alexnet won the ImageNet ILSVRC challenge in 2012, beating all previousnon-DL based method. Starting the surge of research on CNN.
It has 5 conv layers which are followed by 3 fully connected layers.
Totally, around 60M parameters, most of which are in the FC layers.
Most operations (ops) in forward propagation takes place in the Conv layers.
Nevin L. Zhang (HKUST) Machine Learning 31 / 45
CNN Examples
AlexNet (in terms of tensor shapes)
Alexnet won the ImageNet ILSVRC challenge in 2012, beating all previousnon-DL based method. Starting the surge of research on CNN.
It has 5 conv layers which are followed by 3 fully connected layers.
Totally, around 60M parameters, most of which are in the FC layers.
Most operations (ops) in forward propagation takes place in the Conv layers.
Nevin L. Zhang (HKUST) Machine Learning 31 / 45
CNN Examples
AlexNet (in terms of tensor shapes)
Alexnet won the ImageNet ILSVRC challenge in 2012, beating all previousnon-DL based method. Starting the surge of research on CNN.
It has 5 conv layers which are followed by 3 fully connected layers.
Totally, around 60M parameters, most of which are in the FC layers.
Most operations (ops) in forward propagation takes place in the Conv layers.
Nevin L. Zhang (HKUST) Machine Learning 31 / 45
CNN Examples
AlexNet (in terms of connections)
Nevin L. Zhang (HKUST) Machine Learning 32 / 45
CNN Examples
AlexNet
First use of ReLU
Heavy data augmentation
Dropout: 0.5 ; Batch size: 128; SGD momentum 0.9
Learning rate: 0.01, reduced by 10 manually when validation accuracyplateaus.
L2 weight decay; 7 CNN ensemble.
Nevin L. Zhang (HKUST) Machine Learning 33 / 45
CNN Examples
AlexNet
First use of ReLU
Heavy data augmentation
Dropout: 0.5 ; Batch size: 128; SGD momentum 0.9
Learning rate: 0.01, reduced by 10 manually when validation accuracyplateaus.
L2 weight decay; 7 CNN ensemble.
Nevin L. Zhang (HKUST) Machine Learning 33 / 45
CNN Examples
AlexNet
First use of ReLU
Heavy data augmentation
Dropout: 0.5
; Batch size: 128; SGD momentum 0.9
Learning rate: 0.01, reduced by 10 manually when validation accuracyplateaus.
L2 weight decay; 7 CNN ensemble.
Nevin L. Zhang (HKUST) Machine Learning 33 / 45
CNN Examples
AlexNet
First use of ReLU
Heavy data augmentation
Dropout: 0.5 ; Batch size: 128;
SGD momentum 0.9
Learning rate: 0.01, reduced by 10 manually when validation accuracyplateaus.
L2 weight decay; 7 CNN ensemble.
Nevin L. Zhang (HKUST) Machine Learning 33 / 45
CNN Examples
AlexNet
First use of ReLU
Heavy data augmentation
Dropout: 0.5 ; Batch size: 128; SGD momentum 0.9
Learning rate: 0.01, reduced by 10 manually when validation accuracyplateaus.
L2 weight decay; 7 CNN ensemble.
Nevin L. Zhang (HKUST) Machine Learning 33 / 45
CNN Examples
AlexNet
First use of ReLU
Heavy data augmentation
Dropout: 0.5 ; Batch size: 128; SGD momentum 0.9
Learning rate: 0.01, reduced by 10 manually when validation accuracyplateaus.
L2 weight decay; 7 CNN ensemble.
Nevin L. Zhang (HKUST) Machine Learning 33 / 45
CNN Examples
AlexNet
First use of ReLU
Heavy data augmentation
Dropout: 0.5 ; Batch size: 128; SGD momentum 0.9
Learning rate: 0.01, reduced by 10 manually when validation accuracyplateaus.
L2 weight decay;
7 CNN ensemble.
Nevin L. Zhang (HKUST) Machine Learning 33 / 45
CNN Examples
AlexNet
First use of ReLU
Heavy data augmentation
Dropout: 0.5 ; Batch size: 128; SGD momentum 0.9
Learning rate: 0.01, reduced by 10 manually when validation accuracyplateaus.
L2 weight decay; 7 CNN ensemble.
Nevin L. Zhang (HKUST) Machine Learning 33 / 45
CNN Examples
VGGNet (in terms of tensor shapes)
Runner-up in the ImageNet ILSVRC challenge 2014,
beating Alexnet .
Key point: Depth of network is a critical component for good performance.
It has 16 layers and always use 3 x 3 kernels.
Totally, 138M parameters.
Nevin L. Zhang (HKUST) Machine Learning 34 / 45
CNN Examples
VGGNet (in terms of tensor shapes)
Runner-up in the ImageNet ILSVRC challenge 2014, beating Alexnet .
Key point: Depth of network is a critical component for good performance.
It has 16 layers and always use 3 x 3 kernels.
Totally, 138M parameters.
Nevin L. Zhang (HKUST) Machine Learning 34 / 45
CNN Examples
VGGNet (in terms of tensor shapes)
Runner-up in the ImageNet ILSVRC challenge 2014, beating Alexnet .
Key point: Depth of network is a critical component for good performance.
It has 16 layers and always use 3 x 3 kernels.
Totally, 138M parameters.
Nevin L. Zhang (HKUST) Machine Learning 34 / 45
CNN Examples
VGGNet (in terms of tensor shapes)
Runner-up in the ImageNet ILSVRC challenge 2014, beating Alexnet .
Key point: Depth of network is a critical component for good performance.
It has 16 layers and always use 3 x 3 kernels.
Totally, 138M parameters.
Nevin L. Zhang (HKUST) Machine Learning 34 / 45
CNN Examples
VGGNet (in terms of tensor shapes)
Runner-up in the ImageNet ILSVRC challenge 2014, beating Alexnet .
Key point: Depth of network is a critical component for good performance.
It has 16 layers and always use 3 x 3 kernels.
Totally, 138M parameters.
Nevin L. Zhang (HKUST) Machine Learning 34 / 45
CNN Examples
VGGNet (in terms of connections)
Nevin L. Zhang (HKUST) Machine Learning 35 / 45
CNN Examples
GoogleNet
Winner the ImageNet ILSVRC challenge 2014, 22 layers.
Main contribution was the development of an Inception Module thatdramatically reduced the number of parameters in the network (5M,compared to AlexNet with 60M).
No fully connected layers.
Nevin L. Zhang (HKUST) Machine Learning 36 / 45
CNN Examples
GoogleNet
Winner the ImageNet ILSVRC challenge 2014, 22 layers.
Main contribution was the development of an Inception Module thatdramatically reduced the number of parameters in the network (5M,compared to AlexNet with 60M).
No fully connected layers.
Nevin L. Zhang (HKUST) Machine Learning 36 / 45
CNN Examples
GoogleNet
Winner the ImageNet ILSVRC challenge 2014, 22 layers.
Main contribution was the development of an Inception Module thatdramatically reduced the number of parameters in the network (5M,compared to AlexNet with 60M).
No fully connected layers.
Nevin L. Zhang (HKUST) Machine Learning 36 / 45
CNN Examples
GoogleNet
The idea of the inception layer is to cover a bigger area, but also keep a fineresolution for small information on the images.
So the idea is to convolve in parallel different sizes from the most accuratedetailing (1x1) to a bigger one (5x5).
The naive version has too many features (parameters), more than inVGGNet.
So, bottleneck layers (1 x 1 conv layers) are used to reduce the number ofthe features ( parameters).
Nevin L. Zhang (HKUST) Machine Learning 37 / 45
CNN Examples
GoogleNet
The idea of the inception layer is to cover a bigger area, but also keep a fineresolution for small information on the images.
So the idea is to convolve in parallel different sizes from the most accuratedetailing (1x1) to a bigger one (5x5).
The naive version has too many features (parameters), more than inVGGNet.
So, bottleneck layers (1 x 1 conv layers) are used to reduce the number ofthe features ( parameters).
Nevin L. Zhang (HKUST) Machine Learning 37 / 45
CNN Examples
GoogleNet
The idea of the inception layer is to cover a bigger area, but also keep a fineresolution for small information on the images.
So the idea is to convolve in parallel different sizes from the most accuratedetailing (1x1) to a bigger one (5x5).
The naive version has too many features (parameters), more than inVGGNet.
So, bottleneck layers (1 x 1 conv layers) are used to reduce the number ofthe features ( parameters).
Nevin L. Zhang (HKUST) Machine Learning 37 / 45
CNN Examples
GoogleNet
The idea of the inception layer is to cover a bigger area, but also keep a fineresolution for small information on the images.
So the idea is to convolve in parallel different sizes from the most accuratedetailing (1x1) to a bigger one (5x5).
The naive version has too many features (parameters), more than inVGGNet.
So, bottleneck layers (1 x 1 conv layers) are used to reduce the number ofthe features ( parameters).
Nevin L. Zhang (HKUST) Machine Learning 37 / 45
CNN Examples
Full GoogleNet
Totally, 22 layers;
Starts with some vanilla layers (called stem network),
Followed by 9 stacked inception layers,
Auxiliary classification outputs to inject additional gradient at lower layers.
No fully connected layers.
Nevin L. Zhang (HKUST) Machine Learning 38 / 45
CNN Examples
Full GoogleNet
Totally, 22 layers;
Starts with some vanilla layers (called stem network),
Followed by 9 stacked inception layers,
Auxiliary classification outputs to inject additional gradient at lower layers.
No fully connected layers.
Nevin L. Zhang (HKUST) Machine Learning 38 / 45
CNN Examples
Full GoogleNet
Totally, 22 layers;
Starts with some vanilla layers (called stem network),
Followed by 9 stacked inception layers,
Auxiliary classification outputs to inject additional gradient at lower layers.
No fully connected layers.
Nevin L. Zhang (HKUST) Machine Learning 38 / 45
CNN Examples
Full GoogleNet
Totally, 22 layers;
Starts with some vanilla layers (called stem network),
Followed by 9 stacked inception layers,
Auxiliary classification outputs to inject additional gradient at lower layers.
No fully connected layers.
Nevin L. Zhang (HKUST) Machine Learning 38 / 45
CNN Examples
Full GoogleNet
Totally, 22 layers;
Starts with some vanilla layers (called stem network),
Followed by 9 stacked inception layers,
Auxiliary classification outputs to inject additional gradient at lower layers.
No fully connected layers.
Nevin L. Zhang (HKUST) Machine Learning 38 / 45
CNN Examples
Full GoogleNet
Nevin L. Zhang (HKUST) Machine Learning 39 / 45
CNN Examples
ResNet
Winner the ImageNet ILSVRCchallenge 2015, beating all othermethods by large margins.
It goes much deeper than previousmethods with 152 layers.
It brings about the revolution ofdepth.
But, how to go deeper? Simplystacking more layers does not work.
Nevin L. Zhang (HKUST) Machine Learning 40 / 45
CNN Examples
ResNet
Winner the ImageNet ILSVRCchallenge 2015, beating all othermethods by large margins.
It goes much deeper than previousmethods with 152 layers.
It brings about the revolution ofdepth.
But, how to go deeper? Simplystacking more layers does not work.
Nevin L. Zhang (HKUST) Machine Learning 40 / 45
CNN Examples
ResNet
Winner the ImageNet ILSVRCchallenge 2015, beating all othermethods by large margins.
It goes much deeper than previousmethods with 152 layers.
It brings about the revolution ofdepth.
But, how to go deeper?
Simplystacking more layers does not work.
Nevin L. Zhang (HKUST) Machine Learning 40 / 45
CNN Examples
ResNet
Winner the ImageNet ILSVRCchallenge 2015, beating all othermethods by large margins.
It goes much deeper than previousmethods with 152 layers.
It brings about the revolution ofdepth.
But, how to go deeper? Simplystacking more layers does not work.
Nevin L. Zhang (HKUST) Machine Learning 40 / 45
CNN Examples
ResNet
Winner the ImageNet ILSVRCchallenge 2015, beating all othermethods by large margins.
It goes much deeper than previousmethods with 152 layers.
It brings about the revolution ofdepth.
But, how to go deeper? Simplystacking more layers does not work.
Nevin L. Zhang (HKUST) Machine Learning 40 / 45
CNN Examples
Simply Stacking Many Layers?
Stacking 3x3 conv layers.
56-layer net has higher training error as well as test error.
Usually, models with higher capacity should have lower training errors.
It is not the case here. This suggests difficulties in training very deep models.
It is difficult for gradients to propagate back to lower layers.
Nevin L. Zhang (HKUST) Machine Learning 41 / 45
CNN Examples
Simply Stacking Many Layers?
Stacking 3x3 conv layers.
56-layer net has higher training error as well as test error.
Usually, models with higher capacity should have lower training errors.
It is not the case here. This suggests difficulties in training very deep models.
It is difficult for gradients to propagate back to lower layers.
Nevin L. Zhang (HKUST) Machine Learning 41 / 45
CNN Examples
Simply Stacking Many Layers?
Stacking 3x3 conv layers.
56-layer net has higher training error as well as test error.
Usually, models with higher capacity should have lower training errors.
It is not the case here. This suggests difficulties in training very deep models.
It is difficult for gradients to propagate back to lower layers.
Nevin L. Zhang (HKUST) Machine Learning 41 / 45
CNN Examples
Simply Stacking Many Layers?
Stacking 3x3 conv layers.
56-layer net has higher training error as well as test error.
Usually, models with higher capacity should have lower training errors.
It is not the case here. This suggests difficulties in training very deep models.
It is difficult for gradients to propagate back to lower layers.
Nevin L. Zhang (HKUST) Machine Learning 41 / 45
CNN Examples
Simply Stacking Many Layers?
Stacking 3x3 conv layers.
56-layer net has higher training error as well as test error.
Usually, models with higher capacity should have lower training errors.
It is not the case here. This suggests difficulties in training very deep models.
It is difficult for gradients to propagate back to lower layers.
Nevin L. Zhang (HKUST) Machine Learning 41 / 45
CNN Examples
Plain Modules vs Residual Modules
ResNet stacks residual modules instead of plain modules
In a plain module, we try to represent a target function H(x) using plainlayers
In a residual module, we have a identity connection from input to output,and use plain layers to represent the difference F (x) = H(x)− x , which iscalled residual.
Nevin L. Zhang (HKUST) Machine Learning 42 / 45
CNN Examples
Plain Modules vs Residual Modules
ResNet stacks residual modules instead of plain modules
In a plain module, we try to represent a target function H(x) using plainlayers
In a residual module, we have a identity connection from input to output,and use plain layers to represent the difference F (x) = H(x)− x ,
which iscalled residual.
Nevin L. Zhang (HKUST) Machine Learning 42 / 45
CNN Examples
Plain Modules vs Residual Modules
ResNet stacks residual modules instead of plain modules
In a plain module, we try to represent a target function H(x) using plainlayers
In a residual module, we have a identity connection from input to output,and use plain layers to represent the difference F (x) = H(x)− x , which iscalled residual.
Nevin L. Zhang (HKUST) Machine Learning 42 / 45
CNN Examples
ResNet
In plain net, it is difficult for gradients to backpropagate to lower layers.
In ResNet, gradients can back propagate to lowerlayers because of the identity connections.
Nevin L. Zhang (HKUST) Machine Learning 43 / 45
CNN Examples
ResNet
In plain net, it is difficult for gradients to backpropagate to lower layers.
In ResNet, gradients can back propagate to lowerlayers because of the identity connections.
Nevin L. Zhang (HKUST) Machine Learning 43 / 45
CNN Examples
Simply Stacking Many Layers?
With plain net, higher training and test errors observed with more layers (fordeep architectures)
With ResNet, higher training and test errors observed with more layers (fordeep architectures).
Nevin L. Zhang (HKUST) Machine Learning 44 / 45
CNN Examples
Simply Stacking Many Layers?
With plain net, higher training and test errors observed with more layers (fordeep architectures)
With ResNet, higher training and test errors observed with more layers (fordeep architectures).
Nevin L. Zhang (HKUST) Machine Learning 44 / 45
CNN Examples
Training of CNNs
CNNs are trained the same ways as feedforward networks.
There are somepractical tips to improve training and the results.
Data augmentation is usually used to improve generalization.
Data augmentation means to create fake data and add them to thetraining set.Fake images can be created by translating, rotating or scaling realimages.
Batch normalization accelerates training and improves generalization.
Nevin L. Zhang (HKUST) Machine Learning 45 / 45
CNN Examples
Training of CNNs
CNNs are trained the same ways as feedforward networks. There are somepractical tips to improve training and the results.
Data augmentation is usually used to improve generalization.
Data augmentation means to create fake data and add them to thetraining set.Fake images can be created by translating, rotating or scaling realimages.
Batch normalization accelerates training and improves generalization.
Nevin L. Zhang (HKUST) Machine Learning 45 / 45
CNN Examples
Training of CNNs
CNNs are trained the same ways as feedforward networks. There are somepractical tips to improve training and the results.
Data augmentation is usually used to improve generalization.
Data augmentation means to create fake data and add them to thetraining set.Fake images can be created by translating, rotating or scaling realimages.
Batch normalization accelerates training and improves generalization.
Nevin L. Zhang (HKUST) Machine Learning 45 / 45
CNN Examples
Training of CNNs
CNNs are trained the same ways as feedforward networks. There are somepractical tips to improve training and the results.
Data augmentation is usually used to improve generalization.
Data augmentation means to create fake data and add them to thetraining set.Fake images can be created by translating, rotating or scaling realimages.
Batch normalization accelerates training and improves generalization.
Nevin L. Zhang (HKUST) Machine Learning 45 / 45