Machine Learning - Lecture 09: Convolutional Neural...

94
Machine Learning Lecture 09: Convolutional Neural Networks Nevin L. Zhang [email protected] Department of Computer Science and Engineering The Hong Kong University of Science and Technology This set of notes is based on various sources on the internet and Stanford CS course CS231n: Convolutional Neural Networks for Visual Recognition.http://cs231n.stanford.edu/ Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning . MIT press. http://www.deeplearningbook.org Nevin L. Zhang (HKUST) Machine Learning 1 / 45

Transcript of Machine Learning - Lecture 09: Convolutional Neural...

Page 1: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

Machine LearningLecture 09: Convolutional Neural Networks

Nevin L. [email protected]

Department of Computer Science and EngineeringThe Hong Kong University of Science and Technology

This set of notes is based on various sources on the internet andStanford CS course CS231n: Convolutional Neural Networks for Visual

Recognition. http://cs231n.stanford.edu/

Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT press.http://www.deeplearningbook.org

Nevin L. Zhang (HKUST) Machine Learning 1 / 45

Page 2: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

What are Convolutional Neural Networks?

Outline

1 What are Convolutional Neural Networks?

2 Convolutional Layer

3 Technical Issues with Convolutional Layer

4 Pooling Layer

5 CNN Examples

Nevin L. Zhang (HKUST) Machine Learning 2 / 45

Page 3: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

What are Convolutional Neural Networks?

Convolutional Neural Networks

Convolutional Neural Networks (CNNs, ConvNets) are a specializedkind of neural network for processing data that has a known, grid-liketopology.

Examples include

Time-series data, which can be thought of as a 1D grid taking samplesat regular time intervals, andImage data, which can be thought of as a 2D grid of pixels

We will focus on image data

Nevin L. Zhang (HKUST) Machine Learning 3 / 45

Page 4: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

What are Convolutional Neural Networks?

CIFAR-10 Dataset

We will use the CIFAR-10 Dataset as a background example.

The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes,with 6000 images per class.

There are 50000 training images and 10000 test images.

Nevin L. Zhang (HKUST) Machine Learning 4 / 45

Page 5: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

What are Convolutional Neural Networks?

CNN for CIFAR-10

A CNN represents a function from the raw image pixels on one end to classscores at the other

Units at an CNN layer are arranged in 3D volumes with dimensions width,height, depth.

For CIFAR 10, an input image is represented as a volume of units with size32 x 32 x 3 (32 wide, 32 high, 3 color channels).

The output (10 class scores) is represented as a volume of units with size 1x 1 x 10.

Nevin L. Zhang (HKUST) Machine Learning 5 / 45

Page 6: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

What are Convolutional Neural Networks?

CNN Layers

An CNN is made up of multiple layers.

Every Layer has a simple API: It transforms an input 3D volume to anoutput 3D volume with some differentiable function that may or may nothave parameters.

Three main types of layers to build CNN architectures: ConvolutionalLayer, Pooling Layer, and Fully-Connected Layer (exactly as seen inregular Neural Networks).

Nevin L. Zhang (HKUST) Machine Learning 6 / 45

Page 7: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

What are Convolutional Neural Networks?

CNN Layers

An CNN is made up of multiple layers.

Every Layer has a simple API: It transforms an input 3D volume to anoutput 3D volume with some differentiable function that may or may nothave parameters.

Three main types of layers to build CNN architectures: ConvolutionalLayer, Pooling Layer, and Fully-Connected Layer (exactly as seen inregular Neural Networks).

Nevin L. Zhang (HKUST) Machine Learning 6 / 45

Page 8: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

Convolutional Layer

Outline

1 What are Convolutional Neural Networks?

2 Convolutional Layer

3 Technical Issues with Convolutional Layer

4 Pooling Layer

5 CNN Examples

Nevin L. Zhang (HKUST) Machine Learning 7 / 45

Page 9: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

Convolutional Layer

Convolutional Layer

Each convolutional unit is connected only to units in a small patch (whichis called the receptive field of the unit) of the previous layer.

The receptive filed is local in space (width and height), but always full alongthe entire depth of the input volume.

Nevin L. Zhang (HKUST) Machine Learning 8 / 45

Page 10: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

Convolutional Layer

Convolutional Layer

The parameters of a convolutional unit are the connection weights and thebias. They are to be learned.

Intuitively, the task is to learn weights so that the unit will activate whensome type of visual feature such as an edge is present.

Nevin L. Zhang (HKUST) Machine Learning 9 / 45

Page 11: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

Convolutional Layer

Convolutional Layer

A convolutional layer consists of a volume of convolutional units.

All units on a given depth slice share the same parameters,

so that we candetect the same feature at different locations. (Edges can be at multiplelocations)

Hence, the set of weights is called a filter or kernel.Different units on the depth slices are obtained by sliding the filter.

Nevin L. Zhang (HKUST) Machine Learning 10 / 45

Page 12: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

Convolutional Layer

Convolutional Layer

A convolutional layer consists of a volume of convolutional units.

All units on a given depth slice share the same parameters,so that we candetect the same feature at different locations. (Edges can be at multiplelocations)

Hence, the set of weights is called a filter or kernel.Different units on the depth slices are obtained by sliding the filter.

Nevin L. Zhang (HKUST) Machine Learning 10 / 45

Page 13: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

Convolutional Layer

Convolutional Layer

A convolutional layer consists of a volume of convolutional units.

All units on a given depth slice share the same parameters,so that we candetect the same feature at different locations. (Edges can be at multiplelocations)

Hence, the set of weights is called a filter or kernel.Different units on the depth slices are obtained by sliding the filter.

Nevin L. Zhang (HKUST) Machine Learning 10 / 45

Page 14: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

Convolutional Layer

Convolutional Layer

There are multiple depth slices (i.e., multiple filters) so that differentfeatures can be detected.

Nevin L. Zhang (HKUST) Machine Learning 11 / 45

Page 15: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

Convolutional Layer

Convolutional Layer

The output of each depth slice is called a feature map.

The output a conv layer is a collection of stacked feature maps.

Mathematically, a conv layer maps a 3D tensor to another 3D tensor.

Nevin L. Zhang (HKUST) Machine Learning 12 / 45

Page 16: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

Convolutional Layer

Example of Learned Fitlers

Here are the 96 filters learned in the first convolution layer in AlexNet (withreceptive field: 11 x 11 x 3. Now people mostly use 3 x 3 x 3.)

A lot of these filters turns out to be edge detection filters common tohuman visual systems.

Nevin L. Zhang (HKUST) Machine Learning 13 / 45

Page 17: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

Convolutional Layer

Example of Feature Maps

Nevin L. Zhang (HKUST) Machine Learning 14 / 45

Page 18: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

Convolutional Layer

Computation of a convolutional unit

Let I be a 2D image (one of the three channels), and K be the filter.

K (m, n) = 0 when |m| > r or |n| > r where 2r is the weight and heightof the receptive field.

Computation carried by the convolutional unit at coordinates (i , j) is

S(i , j) =∑m,n

I (i + m, j + n)K (m, n)

This is cross-correlation, although it is referred to as convolution in deeplearning (en.wikipedia.org/wiki/Cross-correlation).

Nevin L. Zhang (HKUST) Machine Learning 15 / 45

Page 19: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

Convolutional Layer

NOTE: Cross-correlation vs convolution

Cross-correlation:∑

m,n I (i + m, j + n)K (m, n)

Convolution:∑

m,n I (i −m, j − n)K (m, n)

Let us flip the kernel K to get K ′(m, n) = K (−m,−n). Then∑m,n

I (i −m, j − n)K (m, n) =∑m,n

I (i −m, j − n)K ′(−m,−n)

=∑m,n

I (i + m, j + n)K ′(m, n)

So, convolution with kernel K is the same as cross-correlation with theflipped kernel K ′.

And because the kernel is to be learned., it is not necessary to distinguishbetween the two (and hence cross-correlation and convolution) in deeplearning.

Nevin L. Zhang (HKUST) Machine Learning 16 / 45

Page 20: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

Convolutional Layer

NOTE: Cross-correlation vs convolution

Cross-correlation:∑

m,n I (i + m, j + n)K (m, n)

Convolution:∑

m,n I (i −m, j − n)K (m, n)

Let us flip the kernel K to get K ′(m, n) = K (−m,−n). Then∑m,n

I (i −m, j − n)K (m, n) =∑m,n

I (i −m, j − n)K ′(−m,−n)

=∑m,n

I (i + m, j + n)K ′(m, n)

So, convolution with kernel K is the same as cross-correlation with theflipped kernel K ′.

And because the kernel is to be learned., it is not necessary to distinguishbetween the two (and hence cross-correlation and convolution) in deeplearning.

Nevin L. Zhang (HKUST) Machine Learning 16 / 45

Page 21: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

Convolutional Layer

NOTE: Cross-correlation vs convolution

Cross-correlation:∑

m,n I (i + m, j + n)K (m, n)

Convolution:∑

m,n I (i −m, j − n)K (m, n)

Let us flip the kernel K to get K ′(m, n) = K (−m,−n). Then∑m,n

I (i −m, j − n)K (m, n) =∑m,n

I (i −m, j − n)K ′(−m,−n)

=∑m,n

I (i + m, j + n)K ′(m, n)

So, convolution with kernel K is the same as cross-correlation with theflipped kernel K ′.

And because the kernel is to be learned., it is not necessary to distinguishbetween the two (and hence cross-correlation and convolution) in deeplearning.

Nevin L. Zhang (HKUST) Machine Learning 16 / 45

Page 22: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

Convolutional Layer

NOTE: Cross-correlation vs convolution

Cross-correlation:∑

m,n I (i + m, j + n)K (m, n)

Convolution:∑

m,n I (i −m, j − n)K (m, n)

Let us flip the kernel K to get K ′(m, n) = K (−m,−n). Then∑m,n

I (i −m, j − n)K (m, n) =∑m,n

I (i −m, j − n)K ′(−m,−n)

=∑m,n

I (i + m, j + n)K ′(m, n)

So, convolution with kernel K is the same as cross-correlation with theflipped kernel K ′.

And because the kernel is to be learned., it is not necessary to distinguishbetween the two (and hence cross-correlation and convolution) in deeplearning.

Nevin L. Zhang (HKUST) Machine Learning 16 / 45

Page 23: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

Convolutional Layer

Convolutional Layer

Convolutional demo: http://cs231n.github.io/convolutional-networks/

Nevin L. Zhang (HKUST) Machine Learning 17 / 45

Page 24: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

Convolutional Layer

Computation of a convolutional unit

The result of convolution is passed through a nonlinear activation function(ReLU) to get the output of the unit.

Nevin L. Zhang (HKUST) Machine Learning 18 / 45

Page 25: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

Technical Issues with Convolutional Layer

Outline

1 What are Convolutional Neural Networks?

2 Convolutional Layer

3 Technical Issues with Convolutional Layer

4 Pooling Layer

5 CNN Examples

Nevin L. Zhang (HKUST) Machine Learning 19 / 45

Page 26: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

Technical Issues with Convolutional Layer

Reduction in Spatial Size

To apply the filter to an image, we can move the filter 1 pixel at a time fromleft to right and top to bottom until we process every pixel.

And we have a convolutional unit for each location.

The width and height of the array convolutional units are reduced by 2 fromthe width and height of the array of input units.

Nevin L. Zhang (HKUST) Machine Learning 20 / 45

Page 27: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

Technical Issues with Convolutional Layer

Zero Padding

If we want to maintain the spatial dimensions, we can pack extra 0 orreplicate the edge of the original image.

Zero padding helps to better preserve information on the edge.

Nevin L. Zhang (HKUST) Machine Learning 21 / 45

Page 28: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

Technical Issues with Convolutional Layer

Stride

For a CNN, sometimes we do not move the filter only by 1 pixel. If we movethe filter 2 pixels to the right, we call the stride equal to 2.

It is uncommon to use stride 3 or more.

Nevin L. Zhang (HKUST) Machine Learning 22 / 45

Page 29: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

Technical Issues with Convolutional Layer

Summary of Convolutional Layer

Accepts a volume of size W1 × H1 × D1

Requires four hyperparameters:

Number of filters K , their spatial extent F , the stride S , the amount ofzero padding P.

Produces a volume of size W2 × H2 × D2 where:

W2 = (W1 − F + 2P)/S + 1

H2 = (H1 − F + 2P)/S + 1

D2 = K

Need to make sure that (W1 − F + 2P)/S and (W2 − F + 2P)/S areintegers.

For example, we cannot have W = 10, P = 0, F = 3 and S = 2.

Nevin L. Zhang (HKUST) Machine Learning 23 / 45

Page 30: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

Technical Issues with Convolutional Layer

Summary of Convolutional Layer

Accepts a volume of size W1 × H1 × D1

Requires four hyperparameters:

Number of filters K , their spatial extent F , the stride S , the amount ofzero padding P.

Produces a volume of size W2 × H2 × D2 where:

W2 = (W1 − F + 2P)/S + 1

H2 = (H1 − F + 2P)/S + 1

D2 = K

Need to make sure that (W1 − F + 2P)/S and (W2 − F + 2P)/S areintegers.

For example, we cannot have W = 10, P = 0, F = 3 and S = 2.

Nevin L. Zhang (HKUST) Machine Learning 23 / 45

Page 31: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

Technical Issues with Convolutional Layer

Summary of Convolutional Layer

Number of parameters of a convolutional layer

(FFD1 + 1)K (K filters each with FFD1 + 1 parameters.)

A fully connected layer with the same number of units and no parametersharing would require (W1H1D1 + 1)W2H2D2, which can be prohibitivelylarge

AlexNet accepts images of size [227 x 227 x 3]

For first convolutional layer: F = 11, S = 4, P = 0 and K = 96.

W2 = H2 = 55 and D2 = 96The number of parameters is: (11× 11× 3 + 1)× 96 = 34, 944parameters

If a FC layer were used instead, the number of parameters would be

(227× 227× 3 + 1)× 55× 55× 96 = 44, 892, 355, 200

This is a key motivation for using convolutional layers.

Nevin L. Zhang (HKUST) Machine Learning 24 / 45

Page 32: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

Technical Issues with Convolutional Layer

Summary of Convolutional Layer

Number of parameters of a convolutional layer

(FFD1 + 1)K (K filters each with FFD1 + 1 parameters.)

A fully connected layer with the same number of units and no parametersharing would require (W1H1D1 + 1)W2H2D2, which can be prohibitivelylarge

AlexNet accepts images of size [227 x 227 x 3]

For first convolutional layer: F = 11, S = 4, P = 0 and K = 96.

W2 = H2 = 55 and D2 = 96The number of parameters is: (11× 11× 3 + 1)× 96 = 34, 944parameters

If a FC layer were used instead, the number of parameters would be

(227× 227× 3 + 1)× 55× 55× 96 = 44, 892, 355, 200

This is a key motivation for using convolutional layers.

Nevin L. Zhang (HKUST) Machine Learning 24 / 45

Page 33: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

Technical Issues with Convolutional Layer

Summary of Convolutional Layer

Number of parameters of a convolutional layer

(FFD1 + 1)K (K filters each with FFD1 + 1 parameters.)

A fully connected layer with the same number of units and no parametersharing would require (W1H1D1 + 1)W2H2D2, which can be prohibitivelylarge

AlexNet accepts images of size [227 x 227 x 3]

For first convolutional layer: F = 11, S = 4, P = 0 and K = 96.

W2 = H2 = 55 and D2 = 96The number of parameters is: (11× 11× 3 + 1)× 96 = 34, 944parameters

If a FC layer were used instead, the number of parameters would be

(227× 227× 3 + 1)× 55× 55× 96 = 44, 892, 355, 200

This is a key motivation for using convolutional layers.

Nevin L. Zhang (HKUST) Machine Learning 24 / 45

Page 34: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

Technical Issues with Convolutional Layer

Summary of Convolutional Layer

Number of parameters of a convolutional layer

(FFD1 + 1)K (K filters each with FFD1 + 1 parameters.)

A fully connected layer with the same number of units and no parametersharing would require (W1H1D1 + 1)W2H2D2, which can be prohibitivelylarge

AlexNet accepts images of size [227 x 227 x 3]

For first convolutional layer: F = 11, S = 4, P = 0 and K = 96.

W2 = H2 = 55 and D2 = 96The number of parameters is: (11× 11× 3 + 1)× 96 = 34, 944parameters

If a FC layer were used instead, the number of parameters would be

(227× 227× 3 + 1)× 55× 55× 96 = 44, 892, 355, 200

This is a key motivation for using convolutional layers.

Nevin L. Zhang (HKUST) Machine Learning 24 / 45

Page 35: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

Pooling Layer

Outline

1 What are Convolutional Neural Networks?

2 Convolutional Layer

3 Technical Issues with Convolutional Layer

4 Pooling Layer

5 CNN Examples

Nevin L. Zhang (HKUST) Machine Learning 25 / 45

Page 36: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

Pooling Layer

Pooling

The objective of a pooling layer is to reduce the spatial size of the featuremap.

It aggregates a patch of units into one unit.

There are different ways to do this, and MAX pooling is found the workthe best.

Nevin L. Zhang (HKUST) Machine Learning 26 / 45

Page 37: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

Pooling Layer

Pooling

Pooling also helps to make the representation approximately invariant tosmall translations of the input.

Invariance to translation means that if we translate the input by a smallamount, the values of most of the pooled outputs do not change.

Every value in the bottom row has changed, but only half of the values in the top row

have changed.

Invariance to local translation can be a very useful property if we care moreabout whether some feature is present than exactly where it is.

Nevin L. Zhang (HKUST) Machine Learning 27 / 45

Page 38: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

Pooling Layer

Fully-Connected Layer

Neurons in a fully connected layer have full connections to all activations inthe previous layer, as seen in regular Neural Networks.

Fully-connected layers are typically used as the last layers.

Nevin L. Zhang (HKUST) Machine Learning 28 / 45

Page 39: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

CNN Examples

Outline

1 What are Convolutional Neural Networks?

2 Convolutional Layer

3 Technical Issues with Convolutional Layer

4 Pooling Layer

5 CNN Examples

Nevin L. Zhang (HKUST) Machine Learning 29 / 45

Page 40: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

CNN Examples

ImageNet

The ImageNet project is a large visual database designed for use in visualobject recognition software research.

As of 2016, over ten million URLs of images have been hand-annotated byImageNet to indicate what objects are pictured

Since 2010, the ImageNet project runs an annual software contest, theImageNet Large Scale Visual Recognition Challenge (ILSVRC)

Nevin L. Zhang (HKUST) Machine Learning 30 / 45

Page 41: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

CNN Examples

AlexNet (in terms of tensor shapes)

Alexnet won the ImageNet ILSVRC challenge in 2012, beating all previousnon-DL based method.

Starting the surge of research on CNN.

It has 5 conv layers which are followed by 3 fully connected layers.

Totally, around 60M parameters, most of which are in the FC layers.

Most operations (ops) in forward propagation takes place in the Conv layers.

Nevin L. Zhang (HKUST) Machine Learning 31 / 45

Page 42: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

CNN Examples

AlexNet (in terms of tensor shapes)

Alexnet won the ImageNet ILSVRC challenge in 2012, beating all previousnon-DL based method. Starting the surge of research on CNN.

It has 5 conv layers which are followed by 3 fully connected layers.

Totally, around 60M parameters, most of which are in the FC layers.

Most operations (ops) in forward propagation takes place in the Conv layers.

Nevin L. Zhang (HKUST) Machine Learning 31 / 45

Page 43: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

CNN Examples

AlexNet (in terms of tensor shapes)

Alexnet won the ImageNet ILSVRC challenge in 2012, beating all previousnon-DL based method. Starting the surge of research on CNN.

It has 5 conv layers which are followed by 3 fully connected layers.

Totally, around 60M parameters, most of which are in the FC layers.

Most operations (ops) in forward propagation takes place in the Conv layers.

Nevin L. Zhang (HKUST) Machine Learning 31 / 45

Page 44: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

CNN Examples

AlexNet (in terms of tensor shapes)

Alexnet won the ImageNet ILSVRC challenge in 2012, beating all previousnon-DL based method. Starting the surge of research on CNN.

It has 5 conv layers which are followed by 3 fully connected layers.

Totally, around 60M parameters, most of which are in the FC layers.

Most operations (ops) in forward propagation takes place in the Conv layers.

Nevin L. Zhang (HKUST) Machine Learning 31 / 45

Page 45: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

CNN Examples

AlexNet (in terms of tensor shapes)

Alexnet won the ImageNet ILSVRC challenge in 2012, beating all previousnon-DL based method. Starting the surge of research on CNN.

It has 5 conv layers which are followed by 3 fully connected layers.

Totally, around 60M parameters, most of which are in the FC layers.

Most operations (ops) in forward propagation takes place in the Conv layers.

Nevin L. Zhang (HKUST) Machine Learning 31 / 45

Page 46: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

CNN Examples

AlexNet (in terms of connections)

Nevin L. Zhang (HKUST) Machine Learning 32 / 45

Page 47: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

CNN Examples

AlexNet

First use of ReLU

Heavy data augmentation

Dropout: 0.5 ; Batch size: 128; SGD momentum 0.9

Learning rate: 0.01, reduced by 10 manually when validation accuracyplateaus.

L2 weight decay; 7 CNN ensemble.

Nevin L. Zhang (HKUST) Machine Learning 33 / 45

Page 48: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

CNN Examples

AlexNet

First use of ReLU

Heavy data augmentation

Dropout: 0.5 ; Batch size: 128; SGD momentum 0.9

Learning rate: 0.01, reduced by 10 manually when validation accuracyplateaus.

L2 weight decay; 7 CNN ensemble.

Nevin L. Zhang (HKUST) Machine Learning 33 / 45

Page 49: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

CNN Examples

AlexNet

First use of ReLU

Heavy data augmentation

Dropout: 0.5

; Batch size: 128; SGD momentum 0.9

Learning rate: 0.01, reduced by 10 manually when validation accuracyplateaus.

L2 weight decay; 7 CNN ensemble.

Nevin L. Zhang (HKUST) Machine Learning 33 / 45

Page 50: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

CNN Examples

AlexNet

First use of ReLU

Heavy data augmentation

Dropout: 0.5 ; Batch size: 128;

SGD momentum 0.9

Learning rate: 0.01, reduced by 10 manually when validation accuracyplateaus.

L2 weight decay; 7 CNN ensemble.

Nevin L. Zhang (HKUST) Machine Learning 33 / 45

Page 51: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

CNN Examples

AlexNet

First use of ReLU

Heavy data augmentation

Dropout: 0.5 ; Batch size: 128; SGD momentum 0.9

Learning rate: 0.01, reduced by 10 manually when validation accuracyplateaus.

L2 weight decay; 7 CNN ensemble.

Nevin L. Zhang (HKUST) Machine Learning 33 / 45

Page 52: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

CNN Examples

AlexNet

First use of ReLU

Heavy data augmentation

Dropout: 0.5 ; Batch size: 128; SGD momentum 0.9

Learning rate: 0.01, reduced by 10 manually when validation accuracyplateaus.

L2 weight decay; 7 CNN ensemble.

Nevin L. Zhang (HKUST) Machine Learning 33 / 45

Page 53: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

CNN Examples

AlexNet

First use of ReLU

Heavy data augmentation

Dropout: 0.5 ; Batch size: 128; SGD momentum 0.9

Learning rate: 0.01, reduced by 10 manually when validation accuracyplateaus.

L2 weight decay;

7 CNN ensemble.

Nevin L. Zhang (HKUST) Machine Learning 33 / 45

Page 54: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

CNN Examples

AlexNet

First use of ReLU

Heavy data augmentation

Dropout: 0.5 ; Batch size: 128; SGD momentum 0.9

Learning rate: 0.01, reduced by 10 manually when validation accuracyplateaus.

L2 weight decay; 7 CNN ensemble.

Nevin L. Zhang (HKUST) Machine Learning 33 / 45

Page 55: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

CNN Examples

VGGNet (in terms of tensor shapes)

Runner-up in the ImageNet ILSVRC challenge 2014,

beating Alexnet .

Key point: Depth of network is a critical component for good performance.

It has 16 layers and always use 3 x 3 kernels.

Totally, 138M parameters.

Nevin L. Zhang (HKUST) Machine Learning 34 / 45

Page 56: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

CNN Examples

VGGNet (in terms of tensor shapes)

Runner-up in the ImageNet ILSVRC challenge 2014, beating Alexnet .

Key point: Depth of network is a critical component for good performance.

It has 16 layers and always use 3 x 3 kernels.

Totally, 138M parameters.

Nevin L. Zhang (HKUST) Machine Learning 34 / 45

Page 57: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

CNN Examples

VGGNet (in terms of tensor shapes)

Runner-up in the ImageNet ILSVRC challenge 2014, beating Alexnet .

Key point: Depth of network is a critical component for good performance.

It has 16 layers and always use 3 x 3 kernels.

Totally, 138M parameters.

Nevin L. Zhang (HKUST) Machine Learning 34 / 45

Page 58: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

CNN Examples

VGGNet (in terms of tensor shapes)

Runner-up in the ImageNet ILSVRC challenge 2014, beating Alexnet .

Key point: Depth of network is a critical component for good performance.

It has 16 layers and always use 3 x 3 kernels.

Totally, 138M parameters.

Nevin L. Zhang (HKUST) Machine Learning 34 / 45

Page 59: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

CNN Examples

VGGNet (in terms of tensor shapes)

Runner-up in the ImageNet ILSVRC challenge 2014, beating Alexnet .

Key point: Depth of network is a critical component for good performance.

It has 16 layers and always use 3 x 3 kernels.

Totally, 138M parameters.

Nevin L. Zhang (HKUST) Machine Learning 34 / 45

Page 60: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

CNN Examples

VGGNet (in terms of connections)

Nevin L. Zhang (HKUST) Machine Learning 35 / 45

Page 61: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

CNN Examples

GoogleNet

Winner the ImageNet ILSVRC challenge 2014, 22 layers.

Main contribution was the development of an Inception Module thatdramatically reduced the number of parameters in the network (5M,compared to AlexNet with 60M).

No fully connected layers.

Nevin L. Zhang (HKUST) Machine Learning 36 / 45

Page 62: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

CNN Examples

GoogleNet

Winner the ImageNet ILSVRC challenge 2014, 22 layers.

Main contribution was the development of an Inception Module thatdramatically reduced the number of parameters in the network (5M,compared to AlexNet with 60M).

No fully connected layers.

Nevin L. Zhang (HKUST) Machine Learning 36 / 45

Page 63: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

CNN Examples

GoogleNet

Winner the ImageNet ILSVRC challenge 2014, 22 layers.

Main contribution was the development of an Inception Module thatdramatically reduced the number of parameters in the network (5M,compared to AlexNet with 60M).

No fully connected layers.

Nevin L. Zhang (HKUST) Machine Learning 36 / 45

Page 64: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

CNN Examples

GoogleNet

The idea of the inception layer is to cover a bigger area, but also keep a fineresolution for small information on the images.

So the idea is to convolve in parallel different sizes from the most accuratedetailing (1x1) to a bigger one (5x5).

The naive version has too many features (parameters), more than inVGGNet.

So, bottleneck layers (1 x 1 conv layers) are used to reduce the number ofthe features ( parameters).

Nevin L. Zhang (HKUST) Machine Learning 37 / 45

Page 65: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

CNN Examples

GoogleNet

The idea of the inception layer is to cover a bigger area, but also keep a fineresolution for small information on the images.

So the idea is to convolve in parallel different sizes from the most accuratedetailing (1x1) to a bigger one (5x5).

The naive version has too many features (parameters), more than inVGGNet.

So, bottleneck layers (1 x 1 conv layers) are used to reduce the number ofthe features ( parameters).

Nevin L. Zhang (HKUST) Machine Learning 37 / 45

Page 66: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

CNN Examples

GoogleNet

The idea of the inception layer is to cover a bigger area, but also keep a fineresolution for small information on the images.

So the idea is to convolve in parallel different sizes from the most accuratedetailing (1x1) to a bigger one (5x5).

The naive version has too many features (parameters), more than inVGGNet.

So, bottleneck layers (1 x 1 conv layers) are used to reduce the number ofthe features ( parameters).

Nevin L. Zhang (HKUST) Machine Learning 37 / 45

Page 67: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

CNN Examples

GoogleNet

The idea of the inception layer is to cover a bigger area, but also keep a fineresolution for small information on the images.

So the idea is to convolve in parallel different sizes from the most accuratedetailing (1x1) to a bigger one (5x5).

The naive version has too many features (parameters), more than inVGGNet.

So, bottleneck layers (1 x 1 conv layers) are used to reduce the number ofthe features ( parameters).

Nevin L. Zhang (HKUST) Machine Learning 37 / 45

Page 68: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

CNN Examples

Full GoogleNet

Totally, 22 layers;

Starts with some vanilla layers (called stem network),

Followed by 9 stacked inception layers,

Auxiliary classification outputs to inject additional gradient at lower layers.

No fully connected layers.

Nevin L. Zhang (HKUST) Machine Learning 38 / 45

Page 69: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

CNN Examples

Full GoogleNet

Totally, 22 layers;

Starts with some vanilla layers (called stem network),

Followed by 9 stacked inception layers,

Auxiliary classification outputs to inject additional gradient at lower layers.

No fully connected layers.

Nevin L. Zhang (HKUST) Machine Learning 38 / 45

Page 70: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

CNN Examples

Full GoogleNet

Totally, 22 layers;

Starts with some vanilla layers (called stem network),

Followed by 9 stacked inception layers,

Auxiliary classification outputs to inject additional gradient at lower layers.

No fully connected layers.

Nevin L. Zhang (HKUST) Machine Learning 38 / 45

Page 71: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

CNN Examples

Full GoogleNet

Totally, 22 layers;

Starts with some vanilla layers (called stem network),

Followed by 9 stacked inception layers,

Auxiliary classification outputs to inject additional gradient at lower layers.

No fully connected layers.

Nevin L. Zhang (HKUST) Machine Learning 38 / 45

Page 72: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

CNN Examples

Full GoogleNet

Totally, 22 layers;

Starts with some vanilla layers (called stem network),

Followed by 9 stacked inception layers,

Auxiliary classification outputs to inject additional gradient at lower layers.

No fully connected layers.

Nevin L. Zhang (HKUST) Machine Learning 38 / 45

Page 73: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

CNN Examples

Full GoogleNet

Nevin L. Zhang (HKUST) Machine Learning 39 / 45

Page 74: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

CNN Examples

ResNet

Winner the ImageNet ILSVRCchallenge 2015, beating all othermethods by large margins.

It goes much deeper than previousmethods with 152 layers.

It brings about the revolution ofdepth.

But, how to go deeper? Simplystacking more layers does not work.

Nevin L. Zhang (HKUST) Machine Learning 40 / 45

Page 75: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

CNN Examples

ResNet

Winner the ImageNet ILSVRCchallenge 2015, beating all othermethods by large margins.

It goes much deeper than previousmethods with 152 layers.

It brings about the revolution ofdepth.

But, how to go deeper? Simplystacking more layers does not work.

Nevin L. Zhang (HKUST) Machine Learning 40 / 45

Page 76: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

CNN Examples

ResNet

Winner the ImageNet ILSVRCchallenge 2015, beating all othermethods by large margins.

It goes much deeper than previousmethods with 152 layers.

It brings about the revolution ofdepth.

But, how to go deeper?

Simplystacking more layers does not work.

Nevin L. Zhang (HKUST) Machine Learning 40 / 45

Page 77: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

CNN Examples

ResNet

Winner the ImageNet ILSVRCchallenge 2015, beating all othermethods by large margins.

It goes much deeper than previousmethods with 152 layers.

It brings about the revolution ofdepth.

But, how to go deeper? Simplystacking more layers does not work.

Nevin L. Zhang (HKUST) Machine Learning 40 / 45

Page 78: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

CNN Examples

ResNet

Winner the ImageNet ILSVRCchallenge 2015, beating all othermethods by large margins.

It goes much deeper than previousmethods with 152 layers.

It brings about the revolution ofdepth.

But, how to go deeper? Simplystacking more layers does not work.

Nevin L. Zhang (HKUST) Machine Learning 40 / 45

Page 79: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

CNN Examples

Simply Stacking Many Layers?

Stacking 3x3 conv layers.

56-layer net has higher training error as well as test error.

Usually, models with higher capacity should have lower training errors.

It is not the case here. This suggests difficulties in training very deep models.

It is difficult for gradients to propagate back to lower layers.

Nevin L. Zhang (HKUST) Machine Learning 41 / 45

Page 80: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

CNN Examples

Simply Stacking Many Layers?

Stacking 3x3 conv layers.

56-layer net has higher training error as well as test error.

Usually, models with higher capacity should have lower training errors.

It is not the case here. This suggests difficulties in training very deep models.

It is difficult for gradients to propagate back to lower layers.

Nevin L. Zhang (HKUST) Machine Learning 41 / 45

Page 81: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

CNN Examples

Simply Stacking Many Layers?

Stacking 3x3 conv layers.

56-layer net has higher training error as well as test error.

Usually, models with higher capacity should have lower training errors.

It is not the case here. This suggests difficulties in training very deep models.

It is difficult for gradients to propagate back to lower layers.

Nevin L. Zhang (HKUST) Machine Learning 41 / 45

Page 82: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

CNN Examples

Simply Stacking Many Layers?

Stacking 3x3 conv layers.

56-layer net has higher training error as well as test error.

Usually, models with higher capacity should have lower training errors.

It is not the case here. This suggests difficulties in training very deep models.

It is difficult for gradients to propagate back to lower layers.

Nevin L. Zhang (HKUST) Machine Learning 41 / 45

Page 83: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

CNN Examples

Simply Stacking Many Layers?

Stacking 3x3 conv layers.

56-layer net has higher training error as well as test error.

Usually, models with higher capacity should have lower training errors.

It is not the case here. This suggests difficulties in training very deep models.

It is difficult for gradients to propagate back to lower layers.

Nevin L. Zhang (HKUST) Machine Learning 41 / 45

Page 84: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

CNN Examples

Plain Modules vs Residual Modules

ResNet stacks residual modules instead of plain modules

In a plain module, we try to represent a target function H(x) using plainlayers

In a residual module, we have a identity connection from input to output,and use plain layers to represent the difference F (x) = H(x)− x , which iscalled residual.

Nevin L. Zhang (HKUST) Machine Learning 42 / 45

Page 85: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

CNN Examples

Plain Modules vs Residual Modules

ResNet stacks residual modules instead of plain modules

In a plain module, we try to represent a target function H(x) using plainlayers

In a residual module, we have a identity connection from input to output,and use plain layers to represent the difference F (x) = H(x)− x ,

which iscalled residual.

Nevin L. Zhang (HKUST) Machine Learning 42 / 45

Page 86: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

CNN Examples

Plain Modules vs Residual Modules

ResNet stacks residual modules instead of plain modules

In a plain module, we try to represent a target function H(x) using plainlayers

In a residual module, we have a identity connection from input to output,and use plain layers to represent the difference F (x) = H(x)− x , which iscalled residual.

Nevin L. Zhang (HKUST) Machine Learning 42 / 45

Page 87: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

CNN Examples

ResNet

In plain net, it is difficult for gradients to backpropagate to lower layers.

In ResNet, gradients can back propagate to lowerlayers because of the identity connections.

Nevin L. Zhang (HKUST) Machine Learning 43 / 45

Page 88: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

CNN Examples

ResNet

In plain net, it is difficult for gradients to backpropagate to lower layers.

In ResNet, gradients can back propagate to lowerlayers because of the identity connections.

Nevin L. Zhang (HKUST) Machine Learning 43 / 45

Page 89: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

CNN Examples

Simply Stacking Many Layers?

With plain net, higher training and test errors observed with more layers (fordeep architectures)

With ResNet, higher training and test errors observed with more layers (fordeep architectures).

Nevin L. Zhang (HKUST) Machine Learning 44 / 45

Page 90: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

CNN Examples

Simply Stacking Many Layers?

With plain net, higher training and test errors observed with more layers (fordeep architectures)

With ResNet, higher training and test errors observed with more layers (fordeep architectures).

Nevin L. Zhang (HKUST) Machine Learning 44 / 45

Page 91: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

CNN Examples

Training of CNNs

CNNs are trained the same ways as feedforward networks.

There are somepractical tips to improve training and the results.

Data augmentation is usually used to improve generalization.

Data augmentation means to create fake data and add them to thetraining set.Fake images can be created by translating, rotating or scaling realimages.

Batch normalization accelerates training and improves generalization.

Nevin L. Zhang (HKUST) Machine Learning 45 / 45

Page 92: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

CNN Examples

Training of CNNs

CNNs are trained the same ways as feedforward networks. There are somepractical tips to improve training and the results.

Data augmentation is usually used to improve generalization.

Data augmentation means to create fake data and add them to thetraining set.Fake images can be created by translating, rotating or scaling realimages.

Batch normalization accelerates training and improves generalization.

Nevin L. Zhang (HKUST) Machine Learning 45 / 45

Page 93: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

CNN Examples

Training of CNNs

CNNs are trained the same ways as feedforward networks. There are somepractical tips to improve training and the results.

Data augmentation is usually used to improve generalization.

Data augmentation means to create fake data and add them to thetraining set.Fake images can be created by translating, rotating or scaling realimages.

Batch normalization accelerates training and improves generalization.

Nevin L. Zhang (HKUST) Machine Learning 45 / 45

Page 94: Machine Learning - Lecture 09: Convolutional Neural Networkshome.cse.ust.hk/~lzhang/teach/csit/slides/l09.t.pdf · 2018-04-05 · Stanford CS course CS231n: Convolutional Neural Networks

CNN Examples

Training of CNNs

CNNs are trained the same ways as feedforward networks. There are somepractical tips to improve training and the results.

Data augmentation is usually used to improve generalization.

Data augmentation means to create fake data and add them to thetraining set.Fake images can be created by translating, rotating or scaling realimages.

Batch normalization accelerates training and improves generalization.

Nevin L. Zhang (HKUST) Machine Learning 45 / 45