Charu C. Aggarwal IBM T J Watson Research Center Yorktown … · 2018-05-19 · IBM T J Watson...

Post on 29-May-2020

2 views 0 download

Transcript of Charu C. Aggarwal IBM T J Watson Research Center Yorktown … · 2018-05-19 · IBM T J Watson...

Charu C. Aggarwal

IBM T J Watson Research Center

Yorktown Heights, NY

Convolutional Neural Networks

Neural Networks and Deep Learning, Springer, 2018

Chapter 8.1–8.2

Convolutional Neural Networks

• Like recurrent neural networks, convolutional neural networks

are domain-aware neural networks.

– The structure of the neural network encodes domain-

specific information.

– Specifically designed for images.

• Images have length, width, and a depth corresponding to the

number of color channels (typically 3: RGB).

• All layers are spatially structured with length, width, and

depth.

History

• Motivated by Hubel and Wiesel’s understanding of the cat’s

visual cortex.

– Particular shapes in the visual field excite neurons ⇒Sparse connectivity with shared weights.

– Hierarchical arrangement of neurons into simple and com-

plex cells.

– Neocognitron was first variant ⇒ Motivated LeNet-5

• Success in ImageNet (ILSVRC) competitions brought atten-

tion to deep learning.

Basic Structure of a Convolutional Neural Network

• Most layers have length, width, and depth.

– The length and width are almost always the same.

– The depth is 3 for color images, 1 for grayscale, and an

arbitrary value for hidden layers.

• Three operations are convolution, max-pooling, and ReLU.

– Maxpooling substituted with strided convolution in recent

years.

– The convolution operation is analogous to the matrix mul-

tiplication in a conventional network.

Filter for Convolution Operation

• Let the input volume of layer q have dimensions Lq×Bq× dq.

• The operation uses a filter of size Fq × Fq × dq.

– The filter’s spatial dimensions must be no larger than

layer’s spatial dimensions.

– The filter depth must match input volume.

– Typically, the filter spatial size Fq is a small odd number

like 3 or 5.

Convolution Operation

• Spatially align the top-left corner of filter with each of (Lq −Fq +1)× (Bq − Fq +1) spatial positions.

– Corresponds to number of positions for top left corner in

input volume, so that filter fully fits inside layer volume.

• Perform elementwise multiplication between input/filter over

all Fq × Fq × dq aligned elements and add.

• Creates a single spatial map in the output of size (Lq − Fq +

1)× (Bq − Fq +1).

• Multiple filters create depth in output volume.

Convolution Operation: Pictorial Illustration of

Dimensions

DEPTH DEFINED BY NUMBER OF DIFFERENT FILTERS (5)

5

3

3

32

32

28

28

5

5

INPUT

FILTER

OUTPUT

DEPTH OF INPUT AND FILTER MUST MATCH

IMAGE

1

0

1 1

0 0

-1 -1 -1

HORIZONTAL EDGE DETECTING FILTER

ZERO ACTIVATION

HIGH ACTIVATION

(a) Input and output dimensions (b) Sliding the filter

Convolution Operation: Numerical Example with Depth 1

CONVOLVE

1

0

1 0

1 0

0 0 2

6 3 4

4 7 4

7 0 2

5

8

8

0

6 4

1

3 7 0 3

5

2 5 4

1

0 6

43 0

0

4 5

0

0 4 0

3 4

5

5

1

0

0

2

7

2

4

3

16 16

26

FILTER

INPUT

18

OUTPUT

25

14

15

16

20

7

14

15

16

21

16

21

21

7

14

3

16

2

16

16

26

13

15

23

• Add up over multiple activations maps from the depth

Understanding Convolution

• Sparse connectivity because we are creating a feature from

a region in the input volume of the size of the filter.

– Trying to explore smaller regions of the image to find

shapes.

• Shared weights because we use the same filter across entire

spatial volume.

– Interpret a shape in various parts of the image in the same

way.

Effects of Convolution

• Each feature in a hidden layer captures some properties of a

region of input image.

• A convolution in the qth layer increases the receptive field of

a feature from the qth layer to the (q +1)th layer.

• Consider using a 3× 3 filter successively in three layers:

– The activations in the first, second, and third hidden lay-

ers capture pixel regions of size 3 × 3, 5 × 5, and 7 × 7,

respectively, in the original input image.

Need for Padding

• The convolution operation reduces the size of the (q +1)th

layer in comparison with the size of the qth layer.

– This type of reduction in size is not desirable in general,

because it tends to lose some information along the bor-

ders of the image (or of the feature map, in the case of

hidden layers).

• This problem can be resolved by using padding.

• By adding (Fq − 1)/2 “pixels” all around the borders of the

feature map, one can maintain the size of the spatial image.

Application of Padding with 2 Zeros

6 3 4

4 7 4

7 0 2

5

8

8

0

6 4

1

3 7 0 3

5

2 5 4

1

0 6

43 0

0

4 5

0

0 4 0

3 4

5

5

1

0

0

2

7

2

4

3 6 3 4

4 7 4

7 0 2

5

8

8

0

6 4

1

3 7 0 3

5

2 5 4

1

0 6

43 0

0

4 5

0

0 4 0

3 4

5

5

1

0

0

2

7

2

4

3

0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

PAD

Types of Padding

• No padding: When no padding is used around the borders of

the image ⇒ Reduces the size of spatial footprint by (Fq−1).

• Half padding: Pad with (Fq − 1)/2 “pixels” ⇒ Maintain size

of spatial footprint.

• Full padding: Pad with (Fq − 1) “pixels” ⇒ Increase size of

spatial footprint with (Fq − 1).

2 ∗Amount Padded + Size Reduction = (Fq − 1) (1)

• Note that amount padded is on both sides ⇒ Explains factor

of 2

Strided Convolution

• When a stride of Sq is used in the qth layer, the convolution

is performed at the locations 1, Sq + 1, 2Sq + 1, and so on

along both spatial dimensions of the layer.

• The spatial size of the output on performing this convolution

has height of (Lq−Fq)/Sq+1 and a width of (Bq−Fq)/Sq+1.

– Exact divisibility is required.

• Strided convolutions are sometimes used in lieu of max-

pooling.

Max Pooling

• The pooling operation works on small grid regions of sizePq × Pq in each layer, and produces another layer with thesame depth.

• For each square region of size Pq × Pq in each of the dqactivation maps, the maximum of these values is returned.

• It is common to use a stride Sq > 1 in pooling (often we havePq = Sq).

• Length of the new layer will be (Lq − Pq)/Sq + 1 and thebreadth will be (Bq − Pq)/Sq +1.

• Pooling drastically reduces the spatial dimensions of eachactivation map.

Pooling Example

6 3 4

4 7 4

7 0 2

5

8

8

0

6 4

1

3 7 0 3

5

2 5 4

1

0 6

43 0

0

4 5

0

0 4 0

3 4

5

5

1

0

0

2

7

2

4

3

8 8

7

INPUT

7

OUTPUT

7

8

8

8

7

7

8

8

8

5

5

5

6

6

5

5

5

6

6

5

7

7

7

6

5

7

7 5

8 5

8 6 6

3X3 POOLINGSTRIDE=1

3X3 POOLINGSTRIDE=1

3X3 POOLINGSTRIDE=1

OUTPUT

ReLU

• Use of ReLU is a straightforward one-to-one operation.

• The number of feature maps and spatial footprint size is

retained.

• Often stuck at the end of a convolution operation and not

shown in architectural diagrams.

Fully Connected Layers: Stuck at the End

• Each feature in the final spatial layer is connected to eachhidden state in the first fully connected layer.

• This layer functions in exactly the same way as a traditionalfeed-forward network.

• In most cases, one might use more than one fully connectedlayer to increase the power of the computations towards theend.

• The connections among these layers are exactly structuredlike a traditional feed-forward network.

• The vast majority of parameters lie in the fully connectedlayers.

Interleaving Between Layers

• The convolution, pooling, and ReLU layers are typically in-

terleaved in order to increase expressive power.

• The ReLU layers often follow the convolutional layers, just

as a nonlinear activation function typically follows the linear

dot product in traditional neural networks.

• After two or three sets of convolutional-ReLU combinations,

one might have a max-pooling layer.

• Examples:

CRCRP

CRCRCRP

Example: LeNet-5: Full Notation

INPUT: GRAYSCALEFEATURE MAP OF PIXELS

32

32

5

5

28

28

2

6

2

6

14

14

5

510

22

16

10

16

5

5

C1

S2 C3 S4

12084 10

C5 F6O

SUBSAMPLING OPERATIONS

CONVOLUTION OPERATIONS

• The earliest convolutional network

Example: LeNet-5: Shorthand Notation

INPUT: GRAYSCALEFEATURE MAP OF PIXELS

32

32

5

5

28

28

5

6

5

C1

10

16

10

C3

12084 10

C5 F6O

SS

SS

SUBSAMPLING/MAX-POOLING SHOWN IMPLICITLY AS “SS” OR “MP”

• Subsampling layers are not explicitly shown

Feature Engineering

IMAGE

HORIZONTALEDGES DETECTED1

0

1 1

0 0

-1 -1 -1

FILTER

-1

-1

1 0

1 0

1 0 -1

FILTERVERTICAL EDGES

DETECTED

NEXT LAYER FILTER(VISUALIZATION

UNINTERPRETABLE)RECTANGLE

DETECTED

• The early layers detect primitive features and later layerscomplex ones.

• During training the filters will be learned to identify relevantshapes.

Hierarchical Feature Engineering

• Successive layers put together primitive features to create

more complex features.

• Complex features represent regularities in the data, that are

valuable for features.

• Mid-level features might be honey-combs.

• Higher-level features might be a part of a face.

• The network is a master of extracting repeating shapes in

data-driven manner.

Charu C. Aggarwal

IBM T J Watson Research Center

Yorktown Heights, NY

Backpropagation in Convolutional Neural

Networks and Its Visualization Applications

Neural Networks and Deep Learning, Springer, 2018

Chapter 8.3, 8.5

Backpropagation in Convolutional Neural Networks

• Three operations of convolutions, max-pooling, and ReLU.

• The ReLU backpropagation is the same as any other network.

– Passes gradient to a previous layer only if the original input

value was positive.

• The max-pooling passes the gradient flow through the largest

cell in the input volume.

• Main complexity is in backpropagation through convolutions.

Backpropagating through Convolutions

• Traditional backpropagation is transposed matrix multiplica-tion.

• Backpropagation through convolutions is transposed convo-lution (i.e., with an inverted filter).

• Derivative of loss with respect to each cell is backpropagated.

– Elementwise approach of computing which cell in inputcontributes to which cell in output.

– Multiplying with an inverted filter.

• Convert layer-wise derivative to weight-wise derivative andadd over shared weights.

Backpropagation with an Inverted Filter [Single Channel]

c

f

a b

d e

g h i

FILTER DURINGCONVOLUTION

g

d

i h

f e

c b a

FILTER DURINGBACKPROPAGATION

• Multichannel case: We have 20 filters for 3 input channels

(RGB) ⇒ We have 20× 3 = 60 spatial slices.

• Each of these 60 spatial slices will be inverted and grouped

into 3 sets of filters with depth 20 (one each for RGB).

• Backpropagate with newly grouped filters.

Convolution as a Matrix Multiplication

• Convolution can be presented as a matrix multiplication.

– Useful during forward and backward propagation.

– Backward propagation can be presented as transposed

matrix multiplication

Convolution as a Matrix Multiplication

FLATTEN TO9-DIMENSIONAL

VECTOR4

7

1 3

9 6

5 0 2

FILTER

INPUT

a

c d

b

CONVERT TO 4X9 SPARSE MATRIX C

a b 0 c d 0 0 0 00 a b 0 c d 0 0 00 0 0 a b 0 c d 00 0 0 0 a b 0 c d

[a+3b+9c+6d, 3a+4b+6c+7d, 9a+6b+5c, 6a+7b+2d]T

RESHAPE TOSPATIAL OUTPUT

a+3b+9c+6d 3a+4b+6c+7d9a+6b+5c 6a+7b+2d

OUTPUT

f= [1, 3, 4, 9, 6, 7, 5, 0, 2]T

MULTIPLYC * f

Gradient-Based Visualization

• Can backpropagate all the way back to the input layer.

• Imagine the output o is the probability of a class label like

“dog”

• Compute ∂o∂xi

for each pixel xi of each color.

– Compute maximum absolute magnitude of gradient over

RGB colors and create grayscale image.

Gradient-Based Visualization

• Examples of portions of specific images activated by particu-

lar class labels. ( c©2014 Simonyan, Vedaldi, and Zisserman)

Getting Cleaner Visualizations

• The idea of “deconvnet” is sometimes used for cleaner visu-

alizations.

• Main difference is in terms of how ReLU’s are treated.

– Normal backpropagation passes on gradient through ReLU

when input is positive.

– “Deconvnet” passes on gradient through ReLU when

backpropagated gradient is positive.

Guided Backpropagation

• A variation of backpropagation, referred to as guided back-

propagation is more useful for visualization.

– Guided backpropagation is a combination of gradient-

based visualization and “deconvnet.”

– Set an entry to zero, if either of these rules sets an entry

to zero.

Illustration of “Deconvnet” and Guided Backpropagation

2

4

1 -1

-2 3

-2 1 3

TRADITIONALBACKPROPAGATION

FORWARDPASS

ReLU

2

4

1 0

0 3

0 1 3

-1

2

-3 2

-1 2

1 2 -4

BACKWARDPASS

ACTIVATION(LAYER i)

ACTIVATION(LAYER i+1)

-1

2

-3 0

0 2

0 2 -4

0

2

0 2

0 2

1 2 0

0

2

0 0

0 2

0 2 0

“GRADIENTS”(LAYER i+1)

“GRADIENTS”(LAYER i)

“DECONVNET”(APPLY ReLU)

GUIDEDBACKPROPAGATION

Derivatives of Features with Respect to Input Pixels

deconv guided backpropagation corresponding image crops

deconv guided backpropagation corresponding image crops

• c©2015 Springenberg, Dosovitskiy, Brox, Riedmiller

Creating a fantasy image that matches a label

• The value of o might be the unnormalized score for “banana.”

• We would like to learn the input image x that maximizes the

output o, while applying regularization to x:

Maximizex J(x) = (o− λ||x||2)

• Here, λ is the regularization parameter.

• Update x while keeping weights fixed!

Examples

cup dalmatian goose

Generalization to Autoencoders

• Ideas have been generalized to convolutional autoencoders.

• Deconvolution operation similar to backpropagation.

• One can combine pretraining with backpropagation.

Charu C. Aggarwal

IBM T J Watson Research Center

Yorktown Heights, NY

Case Studies of Convolutional Neural

Networks

Neural Networks and Deep Learning, Springer, 2018

Chapter 8.4

AlexNet

• Winner of the 2012 ILSVRC contest

– Brought attention to the area of deep learning

• First use of ReLU

• Use Dropout with probability 0.5

• Use of 3× 3 pools at stride 2

• Heavy data augmentation

AlexNet Architecture

224

224

3

11

11

55

55

5

5

96

256

27

27

3

313

33

384

1333

384

13

13

256

13

13

4096 40961000

INPUT

C1

C2 C3 C4 C5

FC6 FC7FC8

MP

MPMP

Other Comments on AlexNet

• Popularized the notion of FC7 features

• The features in the penultimate layer are often extracted and

used for various applications

• Pretrained versions of AlexNet are available in most deep

learning frameworks.

• Single network had top-5 error rate of 18.2%

• Ensemble of seven CNN had top-5 error rate of 15.4%

ZfNet

AlexNet ZFNet

Volume: 224× 224× 3 224× 224× 3Operations: Conv 11× 11 (stride 4) Conv 7× 7 (stride 2), MPVolume: 55× 55× 96 55× 55× 96Operations: Conv 5× 5, MP Conv 5× 5 (stride 2), MPVolume: 27× 27× 256 13× 13× 256Operations: Conv 3× 3, MP Conv 3× 3Volume: 13× 13× 384 13× 13× 512Operations: Conv 3× 3 Conv 3× 3Volume: 13× 13× 384 13× 13× 1024Operations: Conv 3× 3 Conv 3× 3Volume: 13× 13× 256 13× 13× 512Operations: MP, Fully connect MP, Fully connectFC6: 4096 4096Operations: Fully connect Fully connectFC7: 4096 4096Operations: Fully connect Fully connectFC8: 1000 1000Operations: Softmax Softmax

• AlexNet Variation ⇒ (Clarifai) won 2013 (14.8%/11.1%)

VGG

• One of the top entries in 2014 (but not winner).

• Notable for its design principle of reduced filter size and in-

creased depth

• All filters had spatial footprint of 3× 3 and padding of 1

• Maxpooling was done using 2× 2 regions at stride 2

• Experimented with a variety of configurations between 11

and 19 layers

Principle of Reduced Filter Size

• A single 7×7 filter will have 49 parameters over one channel.

• Three 3×3 filters will have a receptive field of size 7×7, but

will have only 27 parameters.

• Regularization advantage: A single filter will capture only

primitive features but three successive filters will capture

more complex features.

VGG Configurations

Name: A A-LRN B C D E# Layers 11 11 13 16 16 19

C3D64 C3D64 C3D64 C3D64 C3D64 C3D64LRN C3D64 C3D64 C3D64 C3D64

M M M M M MC3D128 C3D128 C3D128 C3D128 C3D128 C3D128

C3D128 C3D128 C3D128 C3D128M M M M M M

C3D256 C3D256 C3D256 C3D256 C3D256 C3D256C3D256 C3D256 C3D256 C3D256 C3D256 C3D256

C1D256 C3D256 C3D256C3D256

M M M M M MC3D512 C3D512 C3D512 C3D512 C3D512 C3D512C3D512 C3D512 C3D512 C3D512 C3D512 C3D512

C1D512 C3D512 C3D512C3D512

M M M M M MC3D512 C3D512 C3D512 C3D512 C3D512 C3D512C3D512 C3D512 C3D512 C3D512 C3D512 C3D512

C1D512 C3D512 C3D512C3D512

M M M M M MFC4096 FC4096 FC4096 FC4096 FC4096 FC4096FC4096 FC4096 FC4096 FC4096 FC4096 FC4096FC1000 FC1000 FC1000 FC1000 FC1000 FC1000

S S S S S S

VGG Design Choices and Performance

• Max-pooling had the responsibility of reducing spatial foot-

print.

• Number of filters often increased by 2 after each max-pooling

– Volume remained roughly constant.

• VGG had top-5 error rate of 7.3%

• Column D was the best architecture [previous slide]

GoogLeNet

• Introduced the principle of inception architecture.

• The initial part of the architecture is much like a traditional

convolutional network, and is referred to as the stem.

• The key part of the network is an intermediate layer, referred

to as an inception module.

• Allows us to capture images at varying levels of detail in

different portions of the network.

Inception Module: Motivation

• Key information in the images is available at different levels

of detail.

– Large filter can capture can information in a bigger area

containing limited variation.

– Small filter can capture detailed information in a smaller

area.

• Piping together many small filters is wasteful ⇒ Why not let

the neural network decide?

Basic Inception Module

FILTERCONCATENATION

1 X 1 CONVOLUTIONS

PREVIOUSLAYER

3 X 3 CONVOLUTIONS

5 X 5 CONVOLUTIONS

3 X 3MAX-POOLING

• Main problem is computational efficiency ⇒ First reduce

depth

Computationally Efficient Inception Module

3 X 3 CONVOLUTIONS

FILTERCONCATENATION

5 X 5 CONVOLUTIONS

1 X 1 CONVOLUTIONS

3 X 3MAX-POOLING

1 X 1 CONVOLUTIONS

1 X 1 CONVOLUTIONS

1 X 1 CONVOLUTIONS

PREVIOUSLAYER

• Note the 1× 1 filters

Design Principles of Output Layer

• It is common to use fully connected layers near the output.

• GoogLeNet uses average pooling across the whole spatial

area of the final set of activation maps to create a single

value.

• The number of features created in the final layer will be

exactly equal to the number of filters.

– Helps in reducing parameter footprint

• Detailed connectivity is not required for applications in which

only a class label needs to be predicted.

GoogLeNet Architecture

Other Details on GoogLeNet

• Winner of ILSVRC 2014

• Reached top-5 error rate of 6.7%

• Contained 22 layers

• Several advanced variants with improved accuracy.

• Some have been combined with ResNet

ResNet Motivation

• Increasing depth has advantages but also makes the network

harder to train.

• Even the error on the training data is high!

– Poor convergence

• Selecting an architecture with unimpeded gradient flows is

helpful!

ResNet

• ResNet brought the depth of neural networks into the hun-

dreds.

• Based on the principle of iterative feature engineering rather

than hierarchical feature engineering.

• Principle of partially copying features across layers.

– Where have we heard this before?

• Human-level performance with top-5 error rate of 3.6%.

Skip Connections in ResNet

WEIGHT LAYER

WEIGHT LAYER

ReLU

+

F(x)

x

ReLU

F(x)+x

IDENTITYx

7X7 CONV, 64, /2

3X3 CONV, 64

POOL, /2

3X3 CONV, 64

3X3 CONV, 64

3X3 CONV, 64

3X3 CONV, 64

3X3 CONV, 64

3X3 CONV, 128, 1/2

3X3 CONV, 128

3X3 CONV, 128

3X3 CONV, 128

(a) Residual module (b) Partial architecture

Details of ResNet

• A 3 × 3 filter is used at a stride/padding of 1 ⇒ Adopted

from VGG and maintains dimensionality.

• Some layers use strided convolutions to reduce each spatial

dimension by a factor of 2.

– A linear projection matrix reduces the dimensionality.

– The projection matrix defines a set of 1 × 1 convolution

operations with stride of 2.

Importance of Depth

Name Year Number of Layers Top-5 Error

- Before 2012 ≤ 5 > 25%AlexNet 2012 8 15.4%ZfNet/Clarifai 2013 8/> 8 14.8% / 11.1%VGG 2014 19 7.3%GoogLeNet 2014 22 6.7%ResNet 2015 152 3.6%

Pretrained Models

• Many of the models discussed in previous slides are available

as pre-trained models with ImageNet data.

• One can use the features from fully connected layers for

nearest neighbor search.

– A nearest neighbor classifier on raw pixels vs features from

fully connected layers will show huge differences.

• One can train output layer for other applications like regres-

sion while retaining weights in other layers.

Object Localization

• Need to classify image together with bounding box (fournumbers)

Object Localization

CONVOLUTION LAYERS (WEIGHTS FIXED FORBOTH CLASSIFICATIONAND REGRESSION)

SOFT

MAX

FULLY CONNECTED

CLASSIFICATION HEAD

CLASS PROBABILITIES

LIN

EAR

LAYE

R

FULLY CONNECTED

REGRESSION HEAD

BOUNDING BOX (FOUR NUMBERS)

TRAIN FOR REGRESSION

TRAIN FOR CLASSIFICATION

FULLY CONNECTED

FULLY CONNECTED

• Fully connected layers for classification and regression headstrained separately.

Other Applications

• Object detection: Multiple objects

• Text and sequence processing applications

– 1-dimensional convolutions

• Video and spatiotemporal data

– 3-dimensional convolutions