Machine Learning - Lecture 09: Convolutional Neural...

Machine LearningLecture 09: Convolutional Neural Networks

Nevin L. [email protected]

Department of Computer Science and EngineeringThe Hong Kong University of Science and Technology

This set of notes is based on various sources on the internet andStanford CS course CS231n: Convolutional Neural Networks for Visual

Recognition. http://cs231n.stanford.edu/

Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT press.http://www.deeplearningbook.org

Nevin L. Zhang (HKUST) Machine Learning 1 / 45

What are Convolutional Neural Networks?

Outline

1 What are Convolutional Neural Networks?

2 Convolutional Layer

3 Technical Issues with Convolutional Layer

4 Pooling Layer

5 CNN Examples

Convolutional Neural Networks

Convolutional Neural Networks (CNNs, ConvNets) are a specializedkind of neural network for processing data that has a known, grid-liketopology.

Examples include

Time-series data, which can be thought of as a 1D grid taking samplesat regular time intervals, andImage data, which can be thought of as a 2D grid of pixels

We will focus on image data

CIFAR-10 Dataset

We will use the CIFAR-10 Dataset as a background example.

The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes,with 6000 images per class.

There are 50000 training images and 10000 test images.

CNN for CIFAR-10

A CNN represents a function from the raw image pixels on one end to classscores at the other

Units at an CNN layer are arranged in 3D volumes with dimensions width,height, depth.

For CIFAR 10, an input image is represented as a volume of units with size32 x 32 x 3 (32 wide, 32 high, 3 color channels).

The output (10 class scores) is represented as a volume of units with size 1x 1 x 10.

CNN Layers

An CNN is made up of multiple layers.

Every Layer has a simple API: It transforms an input 3D volume to anoutput 3D volume with some differentiable function that may or may nothave parameters.

Three main types of layers to build CNN architectures: ConvolutionalLayer, Pooling Layer, and Fully-Connected Layer (exactly as seen inregular Neural Networks).

CNN Layers

An CNN is made up of multiple layers.

Every Layer has a simple API: It transforms an input 3D volume to anoutput 3D volume with some differentiable function that may or may nothave parameters.

Three main types of layers to build CNN architectures: ConvolutionalLayer, Pooling Layer, and Fully-Connected Layer (exactly as seen inregular Neural Networks).

Convolutional Layer

Outline

4 Pooling Layer

5 CNN Examples

Convolutional Layer

Each convolutional unit is connected only to units in a small patch (whichis called the receptive field of the unit) of the previous layer.

The receptive filed is local in space (width and height), but always full alongthe entire depth of the input volume.

Convolutional Layer

The parameters of a convolutional unit are the connection weights and thebias. They are to be learned.

Intuitively, the task is to learn weights so that the unit will activate whensome type of visual feature such as an edge is present.

Convolutional Layer

A convolutional layer consists of a volume of convolutional units.

All units on a given depth slice share the same parameters,

so that we candetect the same feature at different locations. (Edges can be at multiplelocations)

Hence, the set of weights is called a filter or kernel.Different units on the depth slices are obtained by sliding the filter.

Convolutional Layer

All units on a given depth slice share the same parameters,so that we candetect the same feature at different locations. (Edges can be at multiplelocations)

Convolutional Layer

All units on a given depth slice share the same parameters,so that we candetect the same feature at different locations. (Edges can be at multiplelocations)

Convolutional Layer

There are multiple depth slices (i.e., multiple filters) so that differentfeatures can be detected.

Convolutional Layer

The output of each depth slice is called a feature map.

The output a conv layer is a collection of stacked feature maps.

Mathematically, a conv layer maps a 3D tensor to another 3D tensor.

Convolutional Layer

Example of Learned Fitlers

Here are the 96 filters learned in the first convolution layer in AlexNet (withreceptive field: 11 x 11 x 3. Now people mostly use 3 x 3 x 3.)

A lot of these filters turns out to be edge detection filters common tohuman visual systems.

Convolutional Layer

Example of Feature Maps

Convolutional Layer

Computation of a convolutional unit

Let I be a 2D image (one of the three channels), and K be the filter.

K (m, n) = 0 when |m| > r or |n| > r where 2r is the weight and heightof the receptive field.

Computation carried by the convolutional unit at coordinates (i , j) is

S(i , j) =∑m,n

I (i + m, j + n)K (m, n)

This is cross-correlation, although it is referred to as convolution in deeplearning (en.wikipedia.org/wiki/Cross-correlation).

Convolutional Layer

NOTE: Cross-correlation vs convolution

Cross-correlation:∑

m,n I (i + m, j + n)K (m, n)

Convolution:∑

m,n I (i −m, j − n)K (m, n)

Let us flip the kernel K to get K ′(m, n) = K (−m,−n). Then∑m,n

I (i −m, j − n)K (m, n) =∑m,n

I (i −m, j − n)K ′(−m,−n)

=∑m,n

I (i + m, j + n)K ′(m, n)

So, convolution with kernel K is the same as cross-correlation with theflipped kernel K ′.

And because the kernel is to be learned., it is not necessary to distinguishbetween the two (and hence cross-correlation and convolution) in deeplearning.

Convolutional Layer

m,n I (i + m, j + n)K (m, n)

Convolution:∑

m,n I (i −m, j − n)K (m, n)

I (i −m, j − n)K (m, n) =∑m,n

I (i −m, j − n)K ′(−m,−n)

=∑m,n

I (i + m, j + n)K ′(m, n)

Convolutional Layer

m,n I (i + m, j + n)K (m, n)

Convolution:∑

m,n I (i −m, j − n)K (m, n)

I (i −m, j − n)K (m, n) =∑m,n

I (i −m, j − n)K ′(−m,−n)

=∑m,n

I (i + m, j + n)K ′(m, n)

Convolutional Layer

m,n I (i + m, j + n)K (m, n)

Convolution:∑

m,n I (i −m, j − n)K (m, n)

I (i −m, j − n)K (m, n) =∑m,n

I (i −m, j − n)K ′(−m,−n)

=∑m,n

I (i + m, j + n)K ′(m, n)

Convolutional Layer

Convolutional demo: http://cs231n.github.io/convolutional-networks/

Convolutional Layer

Computation of a convolutional unit

The result of convolution is passed through a nonlinear activation function(ReLU) to get the output of the unit.

Technical Issues with Convolutional Layer

Outline

4 Pooling Layer

5 CNN Examples

Reduction in Spatial Size

To apply the filter to an image, we can move the filter 1 pixel at a time fromleft to right and top to bottom until we process every pixel.

And we have a convolutional unit for each location.

The width and height of the array convolutional units are reduced by 2 fromthe width and height of the array of input units.

Zero Padding

If we want to maintain the spatial dimensions, we can pack extra 0 orreplicate the edge of the original image.

Zero padding helps to better preserve information on the edge.

Stride

For a CNN, sometimes we do not move the filter only by 1 pixel. If we movethe filter 2 pixels to the right, we call the stride equal to 2.

It is uncommon to use stride 3 or more.

Summary of Convolutional Layer

Accepts a volume of size W1 × H1 × D1

Requires four hyperparameters:

Number of filters K , their spatial extent F , the stride S , the amount ofzero padding P.

Produces a volume of size W2 × H2 × D2 where:

W2 = (W1 − F + 2P)/S + 1

H2 = (H1 − F + 2P)/S + 1

D2 = K

Need to make sure that (W1 − F + 2P)/S and (W2 − F + 2P)/S areintegers.

For example, we cannot have W = 10, P = 0, F = 3 and S = 2.

Accepts a volume of size W1 × H1 × D1

Requires four hyperparameters:

Number of filters K , their spatial extent F , the stride S , the amount ofzero padding P.

Produces a volume of size W2 × H2 × D2 where:

W2 = (W1 − F + 2P)/S + 1

H2 = (H1 − F + 2P)/S + 1

D2 = K

Need to make sure that (W1 − F + 2P)/S and (W2 − F + 2P)/S areintegers.

For example, we cannot have W = 10, P = 0, F = 3 and S = 2.

Number of parameters of a convolutional layer

(FFD1 + 1)K (K filters each with FFD1 + 1 parameters.)

A fully connected layer with the same number of units and no parametersharing would require (W1H1D1 + 1)W2H2D2, which can be prohibitivelylarge

AlexNet accepts images of size [227 x 227 x 3]

For first convolutional layer: F = 11, S = 4, P = 0 and K = 96.

W2 = H2 = 55 and D2 = 96The number of parameters is: (11× 11× 3 + 1)× 96 = 34, 944parameters

If a FC layer were used instead, the number of parameters would be

(227× 227× 3 + 1)× 55× 55× 96 = 44, 892, 355, 200

This is a key motivation for using convolutional layers.

(227× 227× 3 + 1)× 55× 55× 96 = 44, 892, 355, 200

Pooling Layer

Outline

4 Pooling Layer

5 CNN Examples

Pooling Layer

Pooling

The objective of a pooling layer is to reduce the spatial size of the featuremap.

It aggregates a patch of units into one unit.

There are different ways to do this, and MAX pooling is found the workthe best.

Pooling Layer

Pooling

Pooling also helps to make the representation approximately invariant tosmall translations of the input.

Invariance to translation means that if we translate the input by a smallamount, the values of most of the pooled outputs do not change.

Every value in the bottom row has changed, but only half of the values in the top row

have changed.

Invariance to local translation can be a very useful property if we care moreabout whether some feature is present than exactly where it is.

Pooling Layer

Fully-Connected Layer

Neurons in a fully connected layer have full connections to all activations inthe previous layer, as seen in regular Neural Networks.

Fully-connected layers are typically used as the last layers.

CNN Examples

Outline

4 Pooling Layer

5 CNN Examples

CNN Examples

ImageNet

The ImageNet project is a large visual database designed for use in visualobject recognition software research.

As of 2016, over ten million URLs of images have been hand-annotated byImageNet to indicate what objects are pictured

Since 2010, the ImageNet project runs an annual software contest, theImageNet Large Scale Visual Recognition Challenge (ILSVRC)

CNN Examples

AlexNet (in terms of tensor shapes)

Alexnet won the ImageNet ILSVRC challenge in 2012, beating all previousnon-DL based method.

Starting the surge of research on CNN.

It has 5 conv layers which are followed by 3 fully connected layers.

Totally, around 60M parameters, most of which are in the FC layers.

Most operations (ops) in forward propagation takes place in the Conv layers.

CNN Examples

Alexnet won the ImageNet ILSVRC challenge in 2012, beating all previousnon-DL based method. Starting the surge of research on CNN.

CNN Examples

AlexNet (in terms of connections)

CNN Examples

AlexNet

First use of ReLU

Heavy data augmentation

Dropout: 0.5 ; Batch size: 128; SGD momentum 0.9

Learning rate: 0.01, reduced by 10 manually when validation accuracyplateaus.

L2 weight decay; 7 CNN ensemble.

CNN Examples

AlexNet

First use of ReLU

CNN Examples

AlexNet

First use of ReLU

Dropout: 0.5

; Batch size: 128; SGD momentum 0.9

CNN Examples

AlexNet

First use of ReLU

Dropout: 0.5 ; Batch size: 128;

SGD momentum 0.9

CNN Examples

AlexNet

First use of ReLU

CNN Examples

AlexNet

First use of ReLU

CNN Examples

AlexNet

First use of ReLU

L2 weight decay;

7 CNN ensemble.

CNN Examples

AlexNet

First use of ReLU

CNN Examples

VGGNet (in terms of tensor shapes)

Runner-up in the ImageNet ILSVRC challenge 2014,

beating Alexnet .

Key point: Depth of network is a critical component for good performance.

It has 16 layers and always use 3 x 3 kernels.

Totally, 138M parameters.

CNN Examples

Runner-up in the ImageNet ILSVRC challenge 2014, beating Alexnet .

CNN Examples

VGGNet (in terms of connections)

CNN Examples

GoogleNet

Winner the ImageNet ILSVRC challenge 2014, 22 layers.

Main contribution was the development of an Inception Module thatdramatically reduced the number of parameters in the network (5M,compared to AlexNet with 60M).

No fully connected layers.

CNN Examples

GoogleNet

CNN Examples

GoogleNet

CNN Examples

GoogleNet

The idea of the inception layer is to cover a bigger area, but also keep a fineresolution for small information on the images.

So the idea is to convolve in parallel different sizes from the most accuratedetailing (1x1) to a bigger one (5x5).

The naive version has too many features (parameters), more than inVGGNet.

So, bottleneck layers (1 x 1 conv layers) are used to reduce the number ofthe features ( parameters).

CNN Examples

GoogleNet

CNN Examples

GoogleNet

CNN Examples

GoogleNet

CNN Examples

Full GoogleNet

Totally, 22 layers;

Starts with some vanilla layers (called stem network),

Followed by 9 stacked inception layers,

Auxiliary classification outputs to inject additional gradient at lower layers.

CNN Examples

Full GoogleNet

Totally, 22 layers;

CNN Examples

Full GoogleNet

Totally, 22 layers;

CNN Examples

Full GoogleNet

Totally, 22 layers;

CNN Examples

Full GoogleNet

Totally, 22 layers;

CNN Examples

Full GoogleNet

CNN Examples

ResNet

Winner the ImageNet ILSVRCchallenge 2015, beating all othermethods by large margins.

It goes much deeper than previousmethods with 152 layers.

It brings about the revolution ofdepth.

But, how to go deeper? Simplystacking more layers does not work.

CNN Examples

ResNet

CNN Examples

ResNet

But, how to go deeper?

Simplystacking more layers does not work.

CNN Examples

ResNet

CNN Examples

ResNet

CNN Examples

Simply Stacking Many Layers?

Stacking 3x3 conv layers.

56-layer net has higher training error as well as test error.

Usually, models with higher capacity should have lower training errors.

It is not the case here. This suggests difficulties in training very deep models.

It is difficult for gradients to propagate back to lower layers.

CNN Examples

Plain Modules vs Residual Modules

ResNet stacks residual modules instead of plain modules

In a plain module, we try to represent a target function H(x) using plainlayers

In a residual module, we have a identity connection from input to output,and use plain layers to represent the difference F (x) = H(x)− x , which iscalled residual.

CNN Examples

In a residual module, we have a identity connection from input to output,and use plain layers to represent the difference F (x) = H(x)− x ,

which iscalled residual.

CNN Examples

In a residual module, we have a identity connection from input to output,and use plain layers to represent the difference F (x) = H(x)− x , which iscalled residual.

CNN Examples

ResNet

In plain net, it is difficult for gradients to backpropagate to lower layers.

In ResNet, gradients can back propagate to lowerlayers because of the identity connections.

CNN Examples

ResNet

In plain net, it is difficult for gradients to backpropagate to lower layers.

In ResNet, gradients can back propagate to lowerlayers because of the identity connections.

CNN Examples

With plain net, higher training and test errors observed with more layers (fordeep architectures)

With ResNet, higher training and test errors observed with more layers (fordeep architectures).

CNN Examples

With plain net, higher training and test errors observed with more layers (fordeep architectures)

With ResNet, higher training and test errors observed with more layers (fordeep architectures).

CNN Examples

Training of CNNs

CNNs are trained the same ways as feedforward networks.

There are somepractical tips to improve training and the results.

Data augmentation is usually used to improve generalization.

Data augmentation means to create fake data and add them to thetraining set.Fake images can be created by translating, rotating or scaling realimages.

Batch normalization accelerates training and improves generalization.

CNN Examples

Training of CNNs

CNNs are trained the same ways as feedforward networks. There are somepractical tips to improve training and the results.

CNN Examples

Training of CNNs

CNN Examples

Training of CNNs

Machine Learning - Lecture 09: Convolutional Neural...

Documents

Transcript of Machine Learning - Lecture 09: Convolutional Neural...