Machine Learning - Lecture 09: Convolutional Neural...

Post on 09-Aug-2018

218 views 0 download

Transcript of Machine Learning - Lecture 09: Convolutional Neural...

Machine LearningLecture 09: Convolutional Neural Networks

Nevin L. Zhanglzhang@cse.ust.hk

Department of Computer Science and EngineeringThe Hong Kong University of Science and Technology

This set of notes is based on various sources on the internet andStanford CS course CS231n: Convolutional Neural Networks for Visual

Recognition. http://cs231n.stanford.edu/

Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT press.http://www.deeplearningbook.org

Nevin L. Zhang (HKUST) Machine Learning 1 / 45

What are Convolutional Neural Networks?

Outline

1 What are Convolutional Neural Networks?

2 Convolutional Layer

3 Technical Issues with Convolutional Layer

4 Pooling Layer

5 CNN Examples

Nevin L. Zhang (HKUST) Machine Learning 2 / 45

What are Convolutional Neural Networks?

Convolutional Neural Networks

Convolutional Neural Networks (CNNs, ConvNets) are a specializedkind of neural network for processing data that has a known, grid-liketopology.

Examples include

Time-series data, which can be thought of as a 1D grid taking samplesat regular time intervals, andImage data, which can be thought of as a 2D grid of pixels

We will focus on image data

Nevin L. Zhang (HKUST) Machine Learning 3 / 45

What are Convolutional Neural Networks?

CIFAR-10 Dataset

We will use the CIFAR-10 Dataset as a background example.

The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes,with 6000 images per class.

There are 50000 training images and 10000 test images.

Nevin L. Zhang (HKUST) Machine Learning 4 / 45

What are Convolutional Neural Networks?

CNN for CIFAR-10

A CNN represents a function from the raw image pixels on one end to classscores at the other

Units at an CNN layer are arranged in 3D volumes with dimensions width,height, depth.

For CIFAR 10, an input image is represented as a volume of units with size32 x 32 x 3 (32 wide, 32 high, 3 color channels).

The output (10 class scores) is represented as a volume of units with size 1x 1 x 10.

Nevin L. Zhang (HKUST) Machine Learning 5 / 45

What are Convolutional Neural Networks?

CNN Layers

An CNN is made up of multiple layers.

Every Layer has a simple API: It transforms an input 3D volume to anoutput 3D volume with some differentiable function that may or may nothave parameters.

Three main types of layers to build CNN architectures: ConvolutionalLayer, Pooling Layer, and Fully-Connected Layer (exactly as seen inregular Neural Networks).

Nevin L. Zhang (HKUST) Machine Learning 6 / 45

What are Convolutional Neural Networks?

CNN Layers

An CNN is made up of multiple layers.

Every Layer has a simple API: It transforms an input 3D volume to anoutput 3D volume with some differentiable function that may or may nothave parameters.

Three main types of layers to build CNN architectures: ConvolutionalLayer, Pooling Layer, and Fully-Connected Layer (exactly as seen inregular Neural Networks).

Nevin L. Zhang (HKUST) Machine Learning 6 / 45

Convolutional Layer

Outline

1 What are Convolutional Neural Networks?

2 Convolutional Layer

3 Technical Issues with Convolutional Layer

4 Pooling Layer

5 CNN Examples

Nevin L. Zhang (HKUST) Machine Learning 7 / 45

Convolutional Layer

Convolutional Layer

Each convolutional unit is connected only to units in a small patch (whichis called the receptive field of the unit) of the previous layer.

The receptive filed is local in space (width and height), but always full alongthe entire depth of the input volume.

Nevin L. Zhang (HKUST) Machine Learning 8 / 45

Convolutional Layer

Convolutional Layer

The parameters of a convolutional unit are the connection weights and thebias. They are to be learned.

Intuitively, the task is to learn weights so that the unit will activate whensome type of visual feature such as an edge is present.

Nevin L. Zhang (HKUST) Machine Learning 9 / 45

Convolutional Layer

Convolutional Layer

A convolutional layer consists of a volume of convolutional units.

All units on a given depth slice share the same parameters,

so that we candetect the same feature at different locations. (Edges can be at multiplelocations)

Hence, the set of weights is called a filter or kernel.Different units on the depth slices are obtained by sliding the filter.

Nevin L. Zhang (HKUST) Machine Learning 10 / 45

Convolutional Layer

Convolutional Layer

A convolutional layer consists of a volume of convolutional units.

All units on a given depth slice share the same parameters,so that we candetect the same feature at different locations. (Edges can be at multiplelocations)

Hence, the set of weights is called a filter or kernel.Different units on the depth slices are obtained by sliding the filter.

Nevin L. Zhang (HKUST) Machine Learning 10 / 45

Convolutional Layer

Convolutional Layer

A convolutional layer consists of a volume of convolutional units.

All units on a given depth slice share the same parameters,so that we candetect the same feature at different locations. (Edges can be at multiplelocations)

Hence, the set of weights is called a filter or kernel.Different units on the depth slices are obtained by sliding the filter.

Nevin L. Zhang (HKUST) Machine Learning 10 / 45

Convolutional Layer

Convolutional Layer

There are multiple depth slices (i.e., multiple filters) so that differentfeatures can be detected.

Nevin L. Zhang (HKUST) Machine Learning 11 / 45

Convolutional Layer

Convolutional Layer

The output of each depth slice is called a feature map.

The output a conv layer is a collection of stacked feature maps.

Mathematically, a conv layer maps a 3D tensor to another 3D tensor.

Nevin L. Zhang (HKUST) Machine Learning 12 / 45

Convolutional Layer

Example of Learned Fitlers

Here are the 96 filters learned in the first convolution layer in AlexNet (withreceptive field: 11 x 11 x 3. Now people mostly use 3 x 3 x 3.)

A lot of these filters turns out to be edge detection filters common tohuman visual systems.

Nevin L. Zhang (HKUST) Machine Learning 13 / 45

Convolutional Layer

Example of Feature Maps

Nevin L. Zhang (HKUST) Machine Learning 14 / 45

Convolutional Layer

Computation of a convolutional unit

Let I be a 2D image (one of the three channels), and K be the filter.

K (m, n) = 0 when |m| > r or |n| > r where 2r is the weight and heightof the receptive field.

Computation carried by the convolutional unit at coordinates (i , j) is

S(i , j) =∑m,n

I (i + m, j + n)K (m, n)

This is cross-correlation, although it is referred to as convolution in deeplearning (en.wikipedia.org/wiki/Cross-correlation).

Nevin L. Zhang (HKUST) Machine Learning 15 / 45

Convolutional Layer

NOTE: Cross-correlation vs convolution

Cross-correlation:∑

m,n I (i + m, j + n)K (m, n)

Convolution:∑

m,n I (i −m, j − n)K (m, n)

Let us flip the kernel K to get K ′(m, n) = K (−m,−n). Then∑m,n

I (i −m, j − n)K (m, n) =∑m,n

I (i −m, j − n)K ′(−m,−n)

=∑m,n

I (i + m, j + n)K ′(m, n)

So, convolution with kernel K is the same as cross-correlation with theflipped kernel K ′.

And because the kernel is to be learned., it is not necessary to distinguishbetween the two (and hence cross-correlation and convolution) in deeplearning.

Nevin L. Zhang (HKUST) Machine Learning 16 / 45

Convolutional Layer

NOTE: Cross-correlation vs convolution

Cross-correlation:∑

m,n I (i + m, j + n)K (m, n)

Convolution:∑

m,n I (i −m, j − n)K (m, n)

Let us flip the kernel K to get K ′(m, n) = K (−m,−n). Then∑m,n

I (i −m, j − n)K (m, n) =∑m,n

I (i −m, j − n)K ′(−m,−n)

=∑m,n

I (i + m, j + n)K ′(m, n)

So, convolution with kernel K is the same as cross-correlation with theflipped kernel K ′.

And because the kernel is to be learned., it is not necessary to distinguishbetween the two (and hence cross-correlation and convolution) in deeplearning.

Nevin L. Zhang (HKUST) Machine Learning 16 / 45

Convolutional Layer

NOTE: Cross-correlation vs convolution

Cross-correlation:∑

m,n I (i + m, j + n)K (m, n)

Convolution:∑

m,n I (i −m, j − n)K (m, n)

Let us flip the kernel K to get K ′(m, n) = K (−m,−n). Then∑m,n

I (i −m, j − n)K (m, n) =∑m,n

I (i −m, j − n)K ′(−m,−n)

=∑m,n

I (i + m, j + n)K ′(m, n)

So, convolution with kernel K is the same as cross-correlation with theflipped kernel K ′.

And because the kernel is to be learned., it is not necessary to distinguishbetween the two (and hence cross-correlation and convolution) in deeplearning.

Nevin L. Zhang (HKUST) Machine Learning 16 / 45

Convolutional Layer

NOTE: Cross-correlation vs convolution

Cross-correlation:∑

m,n I (i + m, j + n)K (m, n)

Convolution:∑

m,n I (i −m, j − n)K (m, n)

Let us flip the kernel K to get K ′(m, n) = K (−m,−n). Then∑m,n

I (i −m, j − n)K (m, n) =∑m,n

I (i −m, j − n)K ′(−m,−n)

=∑m,n

I (i + m, j + n)K ′(m, n)

So, convolution with kernel K is the same as cross-correlation with theflipped kernel K ′.

And because the kernel is to be learned., it is not necessary to distinguishbetween the two (and hence cross-correlation and convolution) in deeplearning.

Nevin L. Zhang (HKUST) Machine Learning 16 / 45

Convolutional Layer

Convolutional Layer

Convolutional demo: http://cs231n.github.io/convolutional-networks/

Nevin L. Zhang (HKUST) Machine Learning 17 / 45

Convolutional Layer

Computation of a convolutional unit

The result of convolution is passed through a nonlinear activation function(ReLU) to get the output of the unit.

Nevin L. Zhang (HKUST) Machine Learning 18 / 45

Technical Issues with Convolutional Layer

Outline

1 What are Convolutional Neural Networks?

2 Convolutional Layer

3 Technical Issues with Convolutional Layer

4 Pooling Layer

5 CNN Examples

Nevin L. Zhang (HKUST) Machine Learning 19 / 45

Technical Issues with Convolutional Layer

Reduction in Spatial Size

To apply the filter to an image, we can move the filter 1 pixel at a time fromleft to right and top to bottom until we process every pixel.

And we have a convolutional unit for each location.

The width and height of the array convolutional units are reduced by 2 fromthe width and height of the array of input units.

Nevin L. Zhang (HKUST) Machine Learning 20 / 45

Technical Issues with Convolutional Layer

Zero Padding

If we want to maintain the spatial dimensions, we can pack extra 0 orreplicate the edge of the original image.

Zero padding helps to better preserve information on the edge.

Nevin L. Zhang (HKUST) Machine Learning 21 / 45

Technical Issues with Convolutional Layer

Stride

For a CNN, sometimes we do not move the filter only by 1 pixel. If we movethe filter 2 pixels to the right, we call the stride equal to 2.

It is uncommon to use stride 3 or more.

Nevin L. Zhang (HKUST) Machine Learning 22 / 45

Technical Issues with Convolutional Layer

Summary of Convolutional Layer

Accepts a volume of size W1 × H1 × D1

Requires four hyperparameters:

Number of filters K , their spatial extent F , the stride S , the amount ofzero padding P.

Produces a volume of size W2 × H2 × D2 where:

W2 = (W1 − F + 2P)/S + 1

H2 = (H1 − F + 2P)/S + 1

D2 = K

Need to make sure that (W1 − F + 2P)/S and (W2 − F + 2P)/S areintegers.

For example, we cannot have W = 10, P = 0, F = 3 and S = 2.

Nevin L. Zhang (HKUST) Machine Learning 23 / 45

Technical Issues with Convolutional Layer

Summary of Convolutional Layer

Accepts a volume of size W1 × H1 × D1

Requires four hyperparameters:

Number of filters K , their spatial extent F , the stride S , the amount ofzero padding P.

Produces a volume of size W2 × H2 × D2 where:

W2 = (W1 − F + 2P)/S + 1

H2 = (H1 − F + 2P)/S + 1

D2 = K

Need to make sure that (W1 − F + 2P)/S and (W2 − F + 2P)/S areintegers.

For example, we cannot have W = 10, P = 0, F = 3 and S = 2.

Nevin L. Zhang (HKUST) Machine Learning 23 / 45

Technical Issues with Convolutional Layer

Summary of Convolutional Layer

Number of parameters of a convolutional layer

(FFD1 + 1)K (K filters each with FFD1 + 1 parameters.)

A fully connected layer with the same number of units and no parametersharing would require (W1H1D1 + 1)W2H2D2, which can be prohibitivelylarge

AlexNet accepts images of size [227 x 227 x 3]

For first convolutional layer: F = 11, S = 4, P = 0 and K = 96.

W2 = H2 = 55 and D2 = 96The number of parameters is: (11× 11× 3 + 1)× 96 = 34, 944parameters

If a FC layer were used instead, the number of parameters would be

(227× 227× 3 + 1)× 55× 55× 96 = 44, 892, 355, 200

This is a key motivation for using convolutional layers.

Nevin L. Zhang (HKUST) Machine Learning 24 / 45

Technical Issues with Convolutional Layer

Summary of Convolutional Layer

Number of parameters of a convolutional layer

(FFD1 + 1)K (K filters each with FFD1 + 1 parameters.)

A fully connected layer with the same number of units and no parametersharing would require (W1H1D1 + 1)W2H2D2, which can be prohibitivelylarge

AlexNet accepts images of size [227 x 227 x 3]

For first convolutional layer: F = 11, S = 4, P = 0 and K = 96.

W2 = H2 = 55 and D2 = 96The number of parameters is: (11× 11× 3 + 1)× 96 = 34, 944parameters

If a FC layer were used instead, the number of parameters would be

(227× 227× 3 + 1)× 55× 55× 96 = 44, 892, 355, 200

This is a key motivation for using convolutional layers.

Nevin L. Zhang (HKUST) Machine Learning 24 / 45

Technical Issues with Convolutional Layer

Summary of Convolutional Layer

Number of parameters of a convolutional layer

(FFD1 + 1)K (K filters each with FFD1 + 1 parameters.)

A fully connected layer with the same number of units and no parametersharing would require (W1H1D1 + 1)W2H2D2, which can be prohibitivelylarge

AlexNet accepts images of size [227 x 227 x 3]

For first convolutional layer: F = 11, S = 4, P = 0 and K = 96.

W2 = H2 = 55 and D2 = 96The number of parameters is: (11× 11× 3 + 1)× 96 = 34, 944parameters

If a FC layer were used instead, the number of parameters would be

(227× 227× 3 + 1)× 55× 55× 96 = 44, 892, 355, 200

This is a key motivation for using convolutional layers.

Nevin L. Zhang (HKUST) Machine Learning 24 / 45

Technical Issues with Convolutional Layer

Summary of Convolutional Layer

Number of parameters of a convolutional layer

(FFD1 + 1)K (K filters each with FFD1 + 1 parameters.)

A fully connected layer with the same number of units and no parametersharing would require (W1H1D1 + 1)W2H2D2, which can be prohibitivelylarge

AlexNet accepts images of size [227 x 227 x 3]

For first convolutional layer: F = 11, S = 4, P = 0 and K = 96.

W2 = H2 = 55 and D2 = 96The number of parameters is: (11× 11× 3 + 1)× 96 = 34, 944parameters

If a FC layer were used instead, the number of parameters would be

(227× 227× 3 + 1)× 55× 55× 96 = 44, 892, 355, 200

This is a key motivation for using convolutional layers.

Nevin L. Zhang (HKUST) Machine Learning 24 / 45

Pooling Layer

Outline

1 What are Convolutional Neural Networks?

2 Convolutional Layer

3 Technical Issues with Convolutional Layer

4 Pooling Layer

5 CNN Examples

Nevin L. Zhang (HKUST) Machine Learning 25 / 45

Pooling Layer

Pooling

The objective of a pooling layer is to reduce the spatial size of the featuremap.

It aggregates a patch of units into one unit.

There are different ways to do this, and MAX pooling is found the workthe best.

Nevin L. Zhang (HKUST) Machine Learning 26 / 45

Pooling Layer

Pooling

Pooling also helps to make the representation approximately invariant tosmall translations of the input.

Invariance to translation means that if we translate the input by a smallamount, the values of most of the pooled outputs do not change.

Every value in the bottom row has changed, but only half of the values in the top row

have changed.

Invariance to local translation can be a very useful property if we care moreabout whether some feature is present than exactly where it is.

Nevin L. Zhang (HKUST) Machine Learning 27 / 45

Pooling Layer

Fully-Connected Layer

Neurons in a fully connected layer have full connections to all activations inthe previous layer, as seen in regular Neural Networks.

Fully-connected layers are typically used as the last layers.

Nevin L. Zhang (HKUST) Machine Learning 28 / 45

CNN Examples

Outline

1 What are Convolutional Neural Networks?

2 Convolutional Layer

3 Technical Issues with Convolutional Layer

4 Pooling Layer

5 CNN Examples

Nevin L. Zhang (HKUST) Machine Learning 29 / 45

CNN Examples

ImageNet

The ImageNet project is a large visual database designed for use in visualobject recognition software research.

As of 2016, over ten million URLs of images have been hand-annotated byImageNet to indicate what objects are pictured

Since 2010, the ImageNet project runs an annual software contest, theImageNet Large Scale Visual Recognition Challenge (ILSVRC)

Nevin L. Zhang (HKUST) Machine Learning 30 / 45

CNN Examples

AlexNet (in terms of tensor shapes)

Alexnet won the ImageNet ILSVRC challenge in 2012, beating all previousnon-DL based method.

Starting the surge of research on CNN.

It has 5 conv layers which are followed by 3 fully connected layers.

Totally, around 60M parameters, most of which are in the FC layers.

Most operations (ops) in forward propagation takes place in the Conv layers.

Nevin L. Zhang (HKUST) Machine Learning 31 / 45

CNN Examples

AlexNet (in terms of tensor shapes)

Alexnet won the ImageNet ILSVRC challenge in 2012, beating all previousnon-DL based method. Starting the surge of research on CNN.

It has 5 conv layers which are followed by 3 fully connected layers.

Totally, around 60M parameters, most of which are in the FC layers.

Most operations (ops) in forward propagation takes place in the Conv layers.

Nevin L. Zhang (HKUST) Machine Learning 31 / 45

CNN Examples

AlexNet (in terms of tensor shapes)

Alexnet won the ImageNet ILSVRC challenge in 2012, beating all previousnon-DL based method. Starting the surge of research on CNN.

It has 5 conv layers which are followed by 3 fully connected layers.

Totally, around 60M parameters, most of which are in the FC layers.

Most operations (ops) in forward propagation takes place in the Conv layers.

Nevin L. Zhang (HKUST) Machine Learning 31 / 45

CNN Examples

AlexNet (in terms of tensor shapes)

Alexnet won the ImageNet ILSVRC challenge in 2012, beating all previousnon-DL based method. Starting the surge of research on CNN.

It has 5 conv layers which are followed by 3 fully connected layers.

Totally, around 60M parameters, most of which are in the FC layers.

Most operations (ops) in forward propagation takes place in the Conv layers.

Nevin L. Zhang (HKUST) Machine Learning 31 / 45

CNN Examples

AlexNet (in terms of tensor shapes)

Alexnet won the ImageNet ILSVRC challenge in 2012, beating all previousnon-DL based method. Starting the surge of research on CNN.

It has 5 conv layers which are followed by 3 fully connected layers.

Totally, around 60M parameters, most of which are in the FC layers.

Most operations (ops) in forward propagation takes place in the Conv layers.

Nevin L. Zhang (HKUST) Machine Learning 31 / 45

CNN Examples

AlexNet (in terms of connections)

Nevin L. Zhang (HKUST) Machine Learning 32 / 45

CNN Examples

AlexNet

First use of ReLU

Heavy data augmentation

Dropout: 0.5 ; Batch size: 128; SGD momentum 0.9

Learning rate: 0.01, reduced by 10 manually when validation accuracyplateaus.

L2 weight decay; 7 CNN ensemble.

Nevin L. Zhang (HKUST) Machine Learning 33 / 45

CNN Examples

AlexNet

First use of ReLU

Heavy data augmentation

Dropout: 0.5 ; Batch size: 128; SGD momentum 0.9

Learning rate: 0.01, reduced by 10 manually when validation accuracyplateaus.

L2 weight decay; 7 CNN ensemble.

Nevin L. Zhang (HKUST) Machine Learning 33 / 45

CNN Examples

AlexNet

First use of ReLU

Heavy data augmentation

Dropout: 0.5

; Batch size: 128; SGD momentum 0.9

Learning rate: 0.01, reduced by 10 manually when validation accuracyplateaus.

L2 weight decay; 7 CNN ensemble.

Nevin L. Zhang (HKUST) Machine Learning 33 / 45

CNN Examples

AlexNet

First use of ReLU

Heavy data augmentation

Dropout: 0.5 ; Batch size: 128;

SGD momentum 0.9

Learning rate: 0.01, reduced by 10 manually when validation accuracyplateaus.

L2 weight decay; 7 CNN ensemble.

Nevin L. Zhang (HKUST) Machine Learning 33 / 45

CNN Examples

AlexNet

First use of ReLU

Heavy data augmentation

Dropout: 0.5 ; Batch size: 128; SGD momentum 0.9

Learning rate: 0.01, reduced by 10 manually when validation accuracyplateaus.

L2 weight decay; 7 CNN ensemble.

Nevin L. Zhang (HKUST) Machine Learning 33 / 45

CNN Examples

AlexNet

First use of ReLU

Heavy data augmentation

Dropout: 0.5 ; Batch size: 128; SGD momentum 0.9

Learning rate: 0.01, reduced by 10 manually when validation accuracyplateaus.

L2 weight decay; 7 CNN ensemble.

Nevin L. Zhang (HKUST) Machine Learning 33 / 45

CNN Examples

AlexNet

First use of ReLU

Heavy data augmentation

Dropout: 0.5 ; Batch size: 128; SGD momentum 0.9

Learning rate: 0.01, reduced by 10 manually when validation accuracyplateaus.

L2 weight decay;

7 CNN ensemble.

Nevin L. Zhang (HKUST) Machine Learning 33 / 45

CNN Examples

AlexNet

First use of ReLU

Heavy data augmentation

Dropout: 0.5 ; Batch size: 128; SGD momentum 0.9

Learning rate: 0.01, reduced by 10 manually when validation accuracyplateaus.

L2 weight decay; 7 CNN ensemble.

Nevin L. Zhang (HKUST) Machine Learning 33 / 45

CNN Examples

VGGNet (in terms of tensor shapes)

Runner-up in the ImageNet ILSVRC challenge 2014,

beating Alexnet .

Key point: Depth of network is a critical component for good performance.

It has 16 layers and always use 3 x 3 kernels.

Totally, 138M parameters.

Nevin L. Zhang (HKUST) Machine Learning 34 / 45

CNN Examples

VGGNet (in terms of tensor shapes)

Runner-up in the ImageNet ILSVRC challenge 2014, beating Alexnet .

Key point: Depth of network is a critical component for good performance.

It has 16 layers and always use 3 x 3 kernels.

Totally, 138M parameters.

Nevin L. Zhang (HKUST) Machine Learning 34 / 45

CNN Examples

VGGNet (in terms of tensor shapes)

Runner-up in the ImageNet ILSVRC challenge 2014, beating Alexnet .

Key point: Depth of network is a critical component for good performance.

It has 16 layers and always use 3 x 3 kernels.

Totally, 138M parameters.

Nevin L. Zhang (HKUST) Machine Learning 34 / 45

CNN Examples

VGGNet (in terms of tensor shapes)

Runner-up in the ImageNet ILSVRC challenge 2014, beating Alexnet .

Key point: Depth of network is a critical component for good performance.

It has 16 layers and always use 3 x 3 kernels.

Totally, 138M parameters.

Nevin L. Zhang (HKUST) Machine Learning 34 / 45

CNN Examples

VGGNet (in terms of tensor shapes)

Runner-up in the ImageNet ILSVRC challenge 2014, beating Alexnet .

Key point: Depth of network is a critical component for good performance.

It has 16 layers and always use 3 x 3 kernels.

Totally, 138M parameters.

Nevin L. Zhang (HKUST) Machine Learning 34 / 45

CNN Examples

VGGNet (in terms of connections)

Nevin L. Zhang (HKUST) Machine Learning 35 / 45

CNN Examples

GoogleNet

Winner the ImageNet ILSVRC challenge 2014, 22 layers.

Main contribution was the development of an Inception Module thatdramatically reduced the number of parameters in the network (5M,compared to AlexNet with 60M).

No fully connected layers.

Nevin L. Zhang (HKUST) Machine Learning 36 / 45

CNN Examples

GoogleNet

Winner the ImageNet ILSVRC challenge 2014, 22 layers.

Main contribution was the development of an Inception Module thatdramatically reduced the number of parameters in the network (5M,compared to AlexNet with 60M).

No fully connected layers.

Nevin L. Zhang (HKUST) Machine Learning 36 / 45

CNN Examples

GoogleNet

Winner the ImageNet ILSVRC challenge 2014, 22 layers.

Main contribution was the development of an Inception Module thatdramatically reduced the number of parameters in the network (5M,compared to AlexNet with 60M).

No fully connected layers.

Nevin L. Zhang (HKUST) Machine Learning 36 / 45

CNN Examples

GoogleNet

The idea of the inception layer is to cover a bigger area, but also keep a fineresolution for small information on the images.

So the idea is to convolve in parallel different sizes from the most accuratedetailing (1x1) to a bigger one (5x5).

The naive version has too many features (parameters), more than inVGGNet.

So, bottleneck layers (1 x 1 conv layers) are used to reduce the number ofthe features ( parameters).

Nevin L. Zhang (HKUST) Machine Learning 37 / 45

CNN Examples

GoogleNet

The idea of the inception layer is to cover a bigger area, but also keep a fineresolution for small information on the images.

So the idea is to convolve in parallel different sizes from the most accuratedetailing (1x1) to a bigger one (5x5).

The naive version has too many features (parameters), more than inVGGNet.

So, bottleneck layers (1 x 1 conv layers) are used to reduce the number ofthe features ( parameters).

Nevin L. Zhang (HKUST) Machine Learning 37 / 45

CNN Examples

GoogleNet

The idea of the inception layer is to cover a bigger area, but also keep a fineresolution for small information on the images.

So the idea is to convolve in parallel different sizes from the most accuratedetailing (1x1) to a bigger one (5x5).

The naive version has too many features (parameters), more than inVGGNet.

So, bottleneck layers (1 x 1 conv layers) are used to reduce the number ofthe features ( parameters).

Nevin L. Zhang (HKUST) Machine Learning 37 / 45

CNN Examples

GoogleNet

The idea of the inception layer is to cover a bigger area, but also keep a fineresolution for small information on the images.

So the idea is to convolve in parallel different sizes from the most accuratedetailing (1x1) to a bigger one (5x5).

The naive version has too many features (parameters), more than inVGGNet.

So, bottleneck layers (1 x 1 conv layers) are used to reduce the number ofthe features ( parameters).

Nevin L. Zhang (HKUST) Machine Learning 37 / 45

CNN Examples

Full GoogleNet

Totally, 22 layers;

Starts with some vanilla layers (called stem network),

Followed by 9 stacked inception layers,

Auxiliary classification outputs to inject additional gradient at lower layers.

No fully connected layers.

Nevin L. Zhang (HKUST) Machine Learning 38 / 45

CNN Examples

Full GoogleNet

Totally, 22 layers;

Starts with some vanilla layers (called stem network),

Followed by 9 stacked inception layers,

Auxiliary classification outputs to inject additional gradient at lower layers.

No fully connected layers.

Nevin L. Zhang (HKUST) Machine Learning 38 / 45

CNN Examples

Full GoogleNet

Totally, 22 layers;

Starts with some vanilla layers (called stem network),

Followed by 9 stacked inception layers,

Auxiliary classification outputs to inject additional gradient at lower layers.

No fully connected layers.

Nevin L. Zhang (HKUST) Machine Learning 38 / 45

CNN Examples

Full GoogleNet

Totally, 22 layers;

Starts with some vanilla layers (called stem network),

Followed by 9 stacked inception layers,

Auxiliary classification outputs to inject additional gradient at lower layers.

No fully connected layers.

Nevin L. Zhang (HKUST) Machine Learning 38 / 45

CNN Examples

Full GoogleNet

Totally, 22 layers;

Starts with some vanilla layers (called stem network),

Followed by 9 stacked inception layers,

Auxiliary classification outputs to inject additional gradient at lower layers.

No fully connected layers.

Nevin L. Zhang (HKUST) Machine Learning 38 / 45

CNN Examples

Full GoogleNet

Nevin L. Zhang (HKUST) Machine Learning 39 / 45

CNN Examples

ResNet

Winner the ImageNet ILSVRCchallenge 2015, beating all othermethods by large margins.

It goes much deeper than previousmethods with 152 layers.

It brings about the revolution ofdepth.

But, how to go deeper? Simplystacking more layers does not work.

Nevin L. Zhang (HKUST) Machine Learning 40 / 45

CNN Examples

ResNet

Winner the ImageNet ILSVRCchallenge 2015, beating all othermethods by large margins.

It goes much deeper than previousmethods with 152 layers.

It brings about the revolution ofdepth.

But, how to go deeper? Simplystacking more layers does not work.

Nevin L. Zhang (HKUST) Machine Learning 40 / 45

CNN Examples

ResNet

Winner the ImageNet ILSVRCchallenge 2015, beating all othermethods by large margins.

It goes much deeper than previousmethods with 152 layers.

It brings about the revolution ofdepth.

But, how to go deeper?

Simplystacking more layers does not work.

Nevin L. Zhang (HKUST) Machine Learning 40 / 45

CNN Examples

ResNet

Winner the ImageNet ILSVRCchallenge 2015, beating all othermethods by large margins.

It goes much deeper than previousmethods with 152 layers.

It brings about the revolution ofdepth.

But, how to go deeper? Simplystacking more layers does not work.

Nevin L. Zhang (HKUST) Machine Learning 40 / 45

CNN Examples

ResNet

Winner the ImageNet ILSVRCchallenge 2015, beating all othermethods by large margins.

It goes much deeper than previousmethods with 152 layers.

It brings about the revolution ofdepth.

But, how to go deeper? Simplystacking more layers does not work.

Nevin L. Zhang (HKUST) Machine Learning 40 / 45

CNN Examples

Simply Stacking Many Layers?

Stacking 3x3 conv layers.

56-layer net has higher training error as well as test error.

Usually, models with higher capacity should have lower training errors.

It is not the case here. This suggests difficulties in training very deep models.

It is difficult for gradients to propagate back to lower layers.

Nevin L. Zhang (HKUST) Machine Learning 41 / 45

CNN Examples

Simply Stacking Many Layers?

Stacking 3x3 conv layers.

56-layer net has higher training error as well as test error.

Usually, models with higher capacity should have lower training errors.

It is not the case here. This suggests difficulties in training very deep models.

It is difficult for gradients to propagate back to lower layers.

Nevin L. Zhang (HKUST) Machine Learning 41 / 45

CNN Examples

Simply Stacking Many Layers?

Stacking 3x3 conv layers.

56-layer net has higher training error as well as test error.

Usually, models with higher capacity should have lower training errors.

It is not the case here. This suggests difficulties in training very deep models.

It is difficult for gradients to propagate back to lower layers.

Nevin L. Zhang (HKUST) Machine Learning 41 / 45

CNN Examples

Simply Stacking Many Layers?

Stacking 3x3 conv layers.

56-layer net has higher training error as well as test error.

Usually, models with higher capacity should have lower training errors.

It is not the case here. This suggests difficulties in training very deep models.

It is difficult for gradients to propagate back to lower layers.

Nevin L. Zhang (HKUST) Machine Learning 41 / 45

CNN Examples

Simply Stacking Many Layers?

Stacking 3x3 conv layers.

56-layer net has higher training error as well as test error.

Usually, models with higher capacity should have lower training errors.

It is not the case here. This suggests difficulties in training very deep models.

It is difficult for gradients to propagate back to lower layers.

Nevin L. Zhang (HKUST) Machine Learning 41 / 45

CNN Examples

Plain Modules vs Residual Modules

ResNet stacks residual modules instead of plain modules

In a plain module, we try to represent a target function H(x) using plainlayers

In a residual module, we have a identity connection from input to output,and use plain layers to represent the difference F (x) = H(x)− x , which iscalled residual.

Nevin L. Zhang (HKUST) Machine Learning 42 / 45

CNN Examples

Plain Modules vs Residual Modules

ResNet stacks residual modules instead of plain modules

In a plain module, we try to represent a target function H(x) using plainlayers

In a residual module, we have a identity connection from input to output,and use plain layers to represent the difference F (x) = H(x)− x ,

which iscalled residual.

Nevin L. Zhang (HKUST) Machine Learning 42 / 45

CNN Examples

Plain Modules vs Residual Modules

ResNet stacks residual modules instead of plain modules

In a plain module, we try to represent a target function H(x) using plainlayers

In a residual module, we have a identity connection from input to output,and use plain layers to represent the difference F (x) = H(x)− x , which iscalled residual.

Nevin L. Zhang (HKUST) Machine Learning 42 / 45

CNN Examples

ResNet

In plain net, it is difficult for gradients to backpropagate to lower layers.

In ResNet, gradients can back propagate to lowerlayers because of the identity connections.

Nevin L. Zhang (HKUST) Machine Learning 43 / 45

CNN Examples

ResNet

In plain net, it is difficult for gradients to backpropagate to lower layers.

In ResNet, gradients can back propagate to lowerlayers because of the identity connections.

Nevin L. Zhang (HKUST) Machine Learning 43 / 45

CNN Examples

Simply Stacking Many Layers?

With plain net, higher training and test errors observed with more layers (fordeep architectures)

With ResNet, higher training and test errors observed with more layers (fordeep architectures).

Nevin L. Zhang (HKUST) Machine Learning 44 / 45

CNN Examples

Simply Stacking Many Layers?

With plain net, higher training and test errors observed with more layers (fordeep architectures)

With ResNet, higher training and test errors observed with more layers (fordeep architectures).

Nevin L. Zhang (HKUST) Machine Learning 44 / 45

CNN Examples

Training of CNNs

CNNs are trained the same ways as feedforward networks.

There are somepractical tips to improve training and the results.

Data augmentation is usually used to improve generalization.

Data augmentation means to create fake data and add them to thetraining set.Fake images can be created by translating, rotating or scaling realimages.

Batch normalization accelerates training and improves generalization.

Nevin L. Zhang (HKUST) Machine Learning 45 / 45

CNN Examples

Training of CNNs

CNNs are trained the same ways as feedforward networks. There are somepractical tips to improve training and the results.

Data augmentation is usually used to improve generalization.

Data augmentation means to create fake data and add them to thetraining set.Fake images can be created by translating, rotating or scaling realimages.

Batch normalization accelerates training and improves generalization.

Nevin L. Zhang (HKUST) Machine Learning 45 / 45

CNN Examples

Training of CNNs

CNNs are trained the same ways as feedforward networks. There are somepractical tips to improve training and the results.

Data augmentation is usually used to improve generalization.

Data augmentation means to create fake data and add them to thetraining set.Fake images can be created by translating, rotating or scaling realimages.

Batch normalization accelerates training and improves generalization.

Nevin L. Zhang (HKUST) Machine Learning 45 / 45

CNN Examples

Training of CNNs

CNNs are trained the same ways as feedforward networks. There are somepractical tips to improve training and the results.

Data augmentation is usually used to improve generalization.

Data augmentation means to create fake data and add them to thetraining set.Fake images can be created by translating, rotating or scaling realimages.

Batch normalization accelerates training and improves generalization.

Nevin L. Zhang (HKUST) Machine Learning 45 / 45