Charu C. Aggarwal IBM T J Watson Research Center Yorktown … · 2018-05-19 · IBM T J Watson...

Charu C. Aggarwal

IBM T J Watson Research Center

Yorktown Heights, NY

Convolutional Neural Networks

Neural Networks and Deep Learning, Springer, 2018

Chapter 8.1–8.2

Convolutional Neural Networks

• Like recurrent neural networks, convolutional neural networks

are domain-aware neural networks.

– The structure of the neural network encodes domain-

specific information.

– Specifically designed for images.

• Images have length, width, and a depth corresponding to the

number of color channels (typically 3: RGB).

• All layers are spatially structured with length, width, and

depth.

History

• Motivated by Hubel and Wiesel’s understanding of the cat’s

visual cortex.

– Particular shapes in the visual field excite neurons ⇒Sparse connectivity with shared weights.

– Hierarchical arrangement of neurons into simple and com-

plex cells.

– Neocognitron was first variant ⇒ Motivated LeNet-5

• Success in ImageNet (ILSVRC) competitions brought atten-

tion to deep learning.

Basic Structure of a Convolutional Neural Network

• Most layers have length, width, and depth.

– The length and width are almost always the same.

– The depth is 3 for color images, 1 for grayscale, and an

arbitrary value for hidden layers.

• Three operations are convolution, max-pooling, and ReLU.

– Maxpooling substituted with strided convolution in recent

years.

– The convolution operation is analogous to the matrix mul-

tiplication in a conventional network.

Filter for Convolution Operation

• Let the input volume of layer q have dimensions Lq×Bq× dq.

• The operation uses a filter of size Fq × Fq × dq.

– The filter’s spatial dimensions must be no larger than

layer’s spatial dimensions.

– The filter depth must match input volume.

– Typically, the filter spatial size Fq is a small odd number

like 3 or 5.

Convolution Operation

• Spatially align the top-left corner of filter with each of (Lq −Fq +1)× (Bq − Fq +1) spatial positions.

– Corresponds to number of positions for top left corner in

input volume, so that filter fully fits inside layer volume.

• Perform elementwise multiplication between input/filter over

all Fq × Fq × dq aligned elements and add.

• Creates a single spatial map in the output of size (Lq − Fq +

1)× (Bq − Fq +1).

• Multiple filters create depth in output volume.

Convolution Operation: Pictorial Illustration of

Dimensions

DEPTH DEFINED BY NUMBER OF DIFFERENT FILTERS (5)

FILTER

OUTPUT

DEPTH OF INPUT AND FILTER MUST MATCH

-1 -1 -1

HORIZONTAL EDGE DETECTING FILTER

ZERO ACTIVATION

HIGH ACTIVATION

(a) Input and output dimensions (b) Sliding the filter

Convolution Operation: Numerical Example with Depth 1

CONVOLVE

3 7 0 3

FILTER

OUTPUT

• Add up over multiple activations maps from the depth

Understanding Convolution

• Sparse connectivity because we are creating a feature from

a region in the input volume of the size of the filter.

– Trying to explore smaller regions of the image to find

shapes.

• Shared weights because we use the same filter across entire

spatial volume.

– Interpret a shape in various parts of the image in the same

Effects of Convolution

• Each feature in a hidden layer captures some properties of a

region of input image.

• A convolution in the qth layer increases the receptive field of

a feature from the qth layer to the (q +1)th layer.

• Consider using a 3× 3 filter successively in three layers:

– The activations in the first, second, and third hidden lay-

ers capture pixel regions of size 3 × 3, 5 × 5, and 7 × 7,

respectively, in the original input image.

Need for Padding

• The convolution operation reduces the size of the (q +1)th

layer in comparison with the size of the qth layer.

– This type of reduction in size is not desirable in general,

because it tends to lose some information along the bor-

ders of the image (or of the feature map, in the case of

hidden layers).

• This problem can be resolved by using padding.

• By adding (Fq − 1)/2 “pixels” all around the borders of the

feature map, one can maintain the size of the spatial image.

Application of Padding with 2 Zeros

3 7 0 3

3 6 3 4

3 7 0 3

0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0

Types of Padding

• No padding: When no padding is used around the borders of

the image ⇒ Reduces the size of spatial footprint by (Fq−1).

• Half padding: Pad with (Fq − 1)/2 “pixels” ⇒ Maintain size

of spatial footprint.

• Full padding: Pad with (Fq − 1) “pixels” ⇒ Increase size of

spatial footprint with (Fq − 1).

2 ∗Amount Padded + Size Reduction = (Fq − 1) (1)

• Note that amount padded is on both sides ⇒ Explains factor

Strided Convolution

• When a stride of Sq is used in the qth layer, the convolution

is performed at the locations 1, Sq + 1, 2Sq + 1, and so on

along both spatial dimensions of the layer.

• The spatial size of the output on performing this convolution

has height of (Lq−Fq)/Sq+1 and a width of (Bq−Fq)/Sq+1.

– Exact divisibility is required.

• Strided convolutions are sometimes used in lieu of max-

pooling.

Max Pooling

• The pooling operation works on small grid regions of sizePq × Pq in each layer, and produces another layer with thesame depth.

• For each square region of size Pq × Pq in each of the dqactivation maps, the maximum of these values is returned.

• It is common to use a stride Sq > 1 in pooling (often we havePq = Sq).

• Length of the new layer will be (Lq − Pq)/Sq + 1 and thebreadth will be (Bq − Pq)/Sq +1.

• Pooling drastically reduces the spatial dimensions of eachactivation map.

Pooling Example

3 7 0 3

OUTPUT

3X3 POOLINGSTRIDE=1

OUTPUT

• Use of ReLU is a straightforward one-to-one operation.

• The number of feature maps and spatial footprint size is

retained.

• Often stuck at the end of a convolution operation and not

shown in architectural diagrams.

Fully Connected Layers: Stuck at the End

• Each feature in the final spatial layer is connected to eachhidden state in the first fully connected layer.

• This layer functions in exactly the same way as a traditionalfeed-forward network.

• In most cases, one might use more than one fully connectedlayer to increase the power of the computations towards theend.

• The connections among these layers are exactly structuredlike a traditional feed-forward network.

• The vast majority of parameters lie in the fully connectedlayers.

Interleaving Between Layers

• The convolution, pooling, and ReLU layers are typically in-

terleaved in order to increase expressive power.

• The ReLU layers often follow the convolutional layers, just

as a nonlinear activation function typically follows the linear

dot product in traditional neural networks.

• After two or three sets of convolutional-ReLU combinations,

one might have a max-pooling layer.

• Examples:

CRCRCRP

Example: LeNet-5: Full Notation

INPUT: GRAYSCALEFEATURE MAP OF PIXELS

S2 C3 S4

12084 10

C5 F6O

SUBSAMPLING OPERATIONS

CONVOLUTION OPERATIONS

• The earliest convolutional network

Example: LeNet-5: Shorthand Notation

INPUT: GRAYSCALEFEATURE MAP OF PIXELS

12084 10

C5 F6O

SUBSAMPLING/MAX-POOLING SHOWN IMPLICITLY AS “SS” OR “MP”

• Subsampling layers are not explicitly shown

Feature Engineering

HORIZONTALEDGES DETECTED1

-1 -1 -1

FILTER

1 0 -1

FILTERVERTICAL EDGES

DETECTED

NEXT LAYER FILTER(VISUALIZATION

UNINTERPRETABLE)RECTANGLE

DETECTED

• The early layers detect primitive features and later layerscomplex ones.

• During training the filters will be learned to identify relevantshapes.

Hierarchical Feature Engineering

• Successive layers put together primitive features to create

more complex features.

• Complex features represent regularities in the data, that are

valuable for features.

• Mid-level features might be honey-combs.

• Higher-level features might be a part of a face.

• The network is a master of extracting repeating shapes in

data-driven manner.

Charu C. Aggarwal

Backpropagation in Convolutional Neural

Networks and Its Visualization Applications

Chapter 8.3, 8.5

Backpropagation in Convolutional Neural Networks

• Three operations of convolutions, max-pooling, and ReLU.

• The ReLU backpropagation is the same as any other network.

– Passes gradient to a previous layer only if the original input

value was positive.

• The max-pooling passes the gradient flow through the largest

cell in the input volume.

• Main complexity is in backpropagation through convolutions.

Backpropagating through Convolutions

• Traditional backpropagation is transposed matrix multiplica-tion.

• Backpropagation through convolutions is transposed convo-lution (i.e., with an inverted filter).

• Derivative of loss with respect to each cell is backpropagated.

– Elementwise approach of computing which cell in inputcontributes to which cell in output.

– Multiplying with an inverted filter.

• Convert layer-wise derivative to weight-wise derivative andadd over shared weights.

Backpropagation with an Inverted Filter [Single Channel]

FILTER DURINGCONVOLUTION

FILTER DURINGBACKPROPAGATION

• Multichannel case: We have 20 filters for 3 input channels

(RGB) ⇒ We have 20× 3 = 60 spatial slices.

• Each of these 60 spatial slices will be inverted and grouped

into 3 sets of filters with depth 20 (one each for RGB).

• Backpropagate with newly grouped filters.

Convolution as a Matrix Multiplication

• Convolution can be presented as a matrix multiplication.

– Useful during forward and backward propagation.

– Backward propagation can be presented as transposed

matrix multiplication

Convolution as a Matrix Multiplication

FLATTEN TO9-DIMENSIONAL

VECTOR4

FILTER

CONVERT TO 4X9 SPARSE MATRIX C

a b 0 c d 0 0 0 00 a b 0 c d 0 0 00 0 0 a b 0 c d 00 0 0 0 a b 0 c d

[a+3b+9c+6d, 3a+4b+6c+7d, 9a+6b+5c, 6a+7b+2d]T

RESHAPE TOSPATIAL OUTPUT

a+3b+9c+6d 3a+4b+6c+7d9a+6b+5c 6a+7b+2d

OUTPUT

f= [1, 3, 4, 9, 6, 7, 5, 0, 2]T

MULTIPLYC * f

Gradient-Based Visualization

• Can backpropagate all the way back to the input layer.

• Imagine the output o is the probability of a class label like

“dog”

• Compute ∂o∂xi

for each pixel xi of each color.

– Compute maximum absolute magnitude of gradient over

RGB colors and create grayscale image.

Gradient-Based Visualization

• Examples of portions of specific images activated by particu-

Getting Cleaner Visualizations

• The idea of “deconvnet” is sometimes used for cleaner visu-

alizations.

• Main difference is in terms of how ReLU’s are treated.

– Normal backpropagation passes on gradient through ReLU

when input is positive.

– “Deconvnet” passes on gradient through ReLU when

backpropagated gradient is positive.

Guided Backpropagation

• A variation of backpropagation, referred to as guided back-

propagation is more useful for visualization.

– Guided backpropagation is a combination of gradient-

based visualization and “deconvnet.”

– Set an entry to zero, if either of these rules sets an entry

to zero.

Illustration of “Deconvnet” and Guided Backpropagation

-2 1 3

TRADITIONALBACKPROPAGATION

FORWARDPASS

1 2 -4

BACKWARDPASS

ACTIVATION(LAYER i)

ACTIVATION(LAYER i+1)

0 2 -4

“GRADIENTS”(LAYER i+1)

“GRADIENTS”(LAYER i)

“DECONVNET”(APPLY ReLU)

GUIDEDBACKPROPAGATION

Derivatives of Features with Respect to Input Pixels

deconv guided backpropagation corresponding image crops

Creating a fantasy image that matches a label

• The value of o might be the unnormalized score for “banana.”

• We would like to learn the input image x that maximizes the

output o, while applying regularization to x:

Maximizex J(x) = (o− λ||x||2)

• Here, λ is the regularization parameter.

• Update x while keeping weights fixed!

Examples

cup dalmatian goose

Generalization to Autoencoders

• Ideas have been generalized to convolutional autoencoders.

• Deconvolution operation similar to backpropagation.

• One can combine pretraining with backpropagation.

Charu C. Aggarwal

Case Studies of Convolutional Neural

Networks

Chapter 8.4

AlexNet

• Winner of the 2012 ILSVRC contest

– Brought attention to the area of deep learning

• First use of ReLU

• Use Dropout with probability 0.5

• Use of 3× 3 pools at stride 2

• Heavy data augmentation

AlexNet Architecture

4096 40961000

C2 C3 C4 C5

FC6 FC7FC8

Other Comments on AlexNet

• Popularized the notion of FC7 features

• The features in the penultimate layer are often extracted and

used for various applications

• Pretrained versions of AlexNet are available in most deep

learning frameworks.

• Single network had top-5 error rate of 18.2%

• Ensemble of seven CNN had top-5 error rate of 15.4%

AlexNet ZFNet

Volume: 224× 224× 3 224× 224× 3Operations: Conv 11× 11 (stride 4) Conv 7× 7 (stride 2), MPVolume: 55× 55× 96 55× 55× 96Operations: Conv 5× 5, MP Conv 5× 5 (stride 2), MPVolume: 27× 27× 256 13× 13× 256Operations: Conv 3× 3, MP Conv 3× 3Volume: 13× 13× 384 13× 13× 512Operations: Conv 3× 3 Conv 3× 3Volume: 13× 13× 384 13× 13× 1024Operations: Conv 3× 3 Conv 3× 3Volume: 13× 13× 256 13× 13× 512Operations: MP, Fully connect MP, Fully connectFC6: 4096 4096Operations: Fully connect Fully connectFC7: 4096 4096Operations: Fully connect Fully connectFC8: 1000 1000Operations: Softmax Softmax

• AlexNet Variation ⇒ (Clarifai) won 2013 (14.8%/11.1%)

• One of the top entries in 2014 (but not winner).

• Notable for its design principle of reduced filter size and in-

creased depth

• All filters had spatial footprint of 3× 3 and padding of 1

• Maxpooling was done using 2× 2 regions at stride 2

• Experimented with a variety of configurations between 11

and 19 layers

Principle of Reduced Filter Size

• A single 7×7 filter will have 49 parameters over one channel.

• Three 3×3 filters will have a receptive field of size 7×7, but

will have only 27 parameters.

• Regularization advantage: A single filter will capture only

primitive features but three successive filters will capture

more complex features.

VGG Configurations

Name: A A-LRN B C D E# Layers 11 11 13 16 16 19

C3D64 C3D64 C3D64 C3D64 C3D64 C3D64LRN C3D64 C3D64 C3D64 C3D64

M M M M M MC3D128 C3D128 C3D128 C3D128 C3D128 C3D128

C3D128 C3D128 C3D128 C3D128M M M M M M

C3D256 C3D256 C3D256 C3D256 C3D256 C3D256C3D256 C3D256 C3D256 C3D256 C3D256 C3D256

C1D256 C3D256 C3D256C3D256

M M M M M MC3D512 C3D512 C3D512 C3D512 C3D512 C3D512C3D512 C3D512 C3D512 C3D512 C3D512 C3D512

C1D512 C3D512 C3D512C3D512

M M M M M MC3D512 C3D512 C3D512 C3D512 C3D512 C3D512C3D512 C3D512 C3D512 C3D512 C3D512 C3D512

C1D512 C3D512 C3D512C3D512

M M M M M MFC4096 FC4096 FC4096 FC4096 FC4096 FC4096FC4096 FC4096 FC4096 FC4096 FC4096 FC4096FC1000 FC1000 FC1000 FC1000 FC1000 FC1000

S S S S S S

VGG Design Choices and Performance

• Max-pooling had the responsibility of reducing spatial foot-

print.

• Number of filters often increased by 2 after each max-pooling

– Volume remained roughly constant.

• VGG had top-5 error rate of 7.3%

• Column D was the best architecture [previous slide]

GoogLeNet

• Introduced the principle of inception architecture.

• The initial part of the architecture is much like a traditional

convolutional network, and is referred to as the stem.

• The key part of the network is an intermediate layer, referred

to as an inception module.

• Allows us to capture images at varying levels of detail in

different portions of the network.

Inception Module: Motivation

• Key information in the images is available at different levels

of detail.

– Large filter can capture can information in a bigger area

containing limited variation.

– Small filter can capture detailed information in a smaller

• Piping together many small filters is wasteful ⇒ Why not let

the neural network decide?

Basic Inception Module

FILTERCONCATENATION

1 X 1 CONVOLUTIONS

PREVIOUSLAYER

3 X 3 CONVOLUTIONS

5 X 5 CONVOLUTIONS

3 X 3MAX-POOLING

• Main problem is computational efficiency ⇒ First reduce

Computationally Efficient Inception Module

3 X 3 CONVOLUTIONS

FILTERCONCATENATION

5 X 5 CONVOLUTIONS

1 X 1 CONVOLUTIONS

3 X 3MAX-POOLING

1 X 1 CONVOLUTIONS

PREVIOUSLAYER

• Note the 1× 1 filters

Design Principles of Output Layer

• It is common to use fully connected layers near the output.

• GoogLeNet uses average pooling across the whole spatial

area of the final set of activation maps to create a single

value.

• The number of features created in the final layer will be

exactly equal to the number of filters.

– Helps in reducing parameter footprint

• Detailed connectivity is not required for applications in which

only a class label needs to be predicted.

GoogLeNet Architecture

Other Details on GoogLeNet

• Winner of ILSVRC 2014

• Reached top-5 error rate of 6.7%

• Contained 22 layers

• Several advanced variants with improved accuracy.

• Some have been combined with ResNet

ResNet Motivation

• Increasing depth has advantages but also makes the network

harder to train.

• Even the error on the training data is high!

– Poor convergence

• Selecting an architecture with unimpeded gradient flows is

helpful!

ResNet

• ResNet brought the depth of neural networks into the hun-

dreds.

• Based on the principle of iterative feature engineering rather

than hierarchical feature engineering.

• Principle of partially copying features across layers.

– Where have we heard this before?

• Human-level performance with top-5 error rate of 3.6%.

Skip Connections in ResNet

WEIGHT LAYER

F(x)+x

IDENTITYx

7X7 CONV, 64, /2

3X3 CONV, 64

POOL, /2

3X3 CONV, 64

3X3 CONV, 128, 1/2

3X3 CONV, 128

(a) Residual module (b) Partial architecture

Details of ResNet

• A 3 × 3 filter is used at a stride/padding of 1 ⇒ Adopted

from VGG and maintains dimensionality.

• Some layers use strided convolutions to reduce each spatial

dimension by a factor of 2.

– A linear projection matrix reduces the dimensionality.

– The projection matrix defines a set of 1 × 1 convolution

operations with stride of 2.

Importance of Depth

Name Year Number of Layers Top-5 Error

- Before 2012 ≤ 5 > 25%AlexNet 2012 8 15.4%ZfNet/Clarifai 2013 8/> 8 14.8% / 11.1%VGG 2014 19 7.3%GoogLeNet 2014 22 6.7%ResNet 2015 152 3.6%

Pretrained Models

• Many of the models discussed in previous slides are available

as pre-trained models with ImageNet data.

• One can use the features from fully connected layers for

nearest neighbor search.

– A nearest neighbor classifier on raw pixels vs features from

fully connected layers will show huge differences.

• One can train output layer for other applications like regres-

sion while retaining weights in other layers.

Object Localization

• Need to classify image together with bounding box (fournumbers)

Object Localization

CONVOLUTION LAYERS (WEIGHTS FIXED FORBOTH CLASSIFICATIONAND REGRESSION)

FULLY CONNECTED

CLASSIFICATION HEAD

CLASS PROBABILITIES

FULLY CONNECTED

REGRESSION HEAD

BOUNDING BOX (FOUR NUMBERS)

TRAIN FOR REGRESSION

TRAIN FOR CLASSIFICATION

FULLY CONNECTED

• Fully connected layers for classification and regression headstrained separately.

Other Applications

• Object detection: Multiple objects

• Text and sequence processing applications

– 1-dimensional convolutions

• Video and spatiotemporal data

– 3-dimensional convolutions

Charu C. Aggarwal IBM T J Watson Research Center Yorktown … · 2018-05-19 · IBM T J Watson...

Documents

Transcript of Charu C. Aggarwal IBM T J Watson Research Center Yorktown … · 2018-05-19 · IBM T J Watson...

ARTIFICIAL INTELLIGENCE AND CYBERSECURITY TECHNOLOGY ...€¦ · Cognitive Cybersecurity Intelligence (CCSI) group at the IBM T.J. Watson Research Center in Yorktown Heights, NY.

World of Watson - Integrating IBM Watson IOT Platform and IBM Blockchain

Architecture of the IBM Blue Gene Supercomputertwqcd09.phys.ntu.edu.tw/speakers/G.Chiu/BGP_Taiwan.pdf · IBM T.J. Watson Research Center Yorktown Heights, NY Architecture of the IBM

Avirup Sil , Georgiana Dinu and Radu Florian IBM T.J ... · Avirup Sil , Georgiana Dinu and Radu Florian IBM T.J. Watson Research Center Yorktown Heights, NY Gaithersburg,-MD-!

Charu C. Aggarwal IBM T J Watson Research Center Yorktown ... · IBM T J Watson Research Center Yorktown Heights, NY Recurrent Neural Networks Neural Networks and Deep Learning, Springer,

Compilers and Computers: Partners in Performance Fran Allen (IBM Fellow Emerita) T. J. Watson Research Center Yorktown Heights, NY 10598 allen@watson.ibm.com.

IBM Research Reportdomino.research.ibm.com/library/cyberdig.nsf/...IBM Research Division Thomas J. Watson Research Center P.O. Box 218 Yorktown Heights, NY 10598 Research Division

IBM Research Reportdomino.research.ibm.com/library/cyberdig.nsf/papers/951...IBM Research Division Thomas J. Watson Research Center P.O. Box 218 Yorktown Heights, NY 10598 Research

IBM Research Report · Mark H. Linehan IBM Research Division Thomas J. Watson Research Center P.O. Box 218 Yorktown Heights, NY 10598 USA Research Division Almaden – Austin –

IBM Research Reportpdfs.semanticscholar.org/ccbe/c205d4ce9059eaf989b... · IBM Research Division Thomas J. Watson Research Center P.O. Box 218 Yorktown Heights, NY 10598 Research

Using the Power Diagram to Computing Implicitly Defined Surfaces Michael E. Henderson IBM T.J. Watson Research Center Yorktown Heights, NY Presented at.

IBM Watson : Courses Overvie€¦ · IBM Watson Platform • Getting Started with Watson - Basic terminology and the applications of Watson • Watson 101 • External Resources

Cloud Service Placement via Subgraph Matchingsites.cs.ucsb.edu/~xyan/papers/icde14_dynamic_matching.pdf · *IBM T.J. Watson Research Center, Yorktown Heights, NY, USA {rraghav, msrivats,

Providing Enhanced Functionality for Data Store Clients - IBM · Providing Enhanced Functionality for Data Store Clients Arun Iyengar IBM T.J. Watson Research Center Yorktown Heights,

Optical Interconnects for High Performance Computing · Optical Interconnects for High Performance Computing Marc A. Taubenblatt IBM T.J. Watson Research Center Yorktown Heights,

Kerry Bernstein IBM T.J. Watson Research Center Yorktown ... · Wall Street Journal: IBM Touts Breakthrough in 3-D Chips By WILLIAM M. BULKELEY April 12, 2007; Page B3 International

Afl-AOG IBM THOMAS J WATSON RESEARCH … 67 IBM THOMAS J WATSON RESEARCH CENTER YORKTOWN HEIGHTS MY F/6 20/12 SPECTROSCOPIC STUIES OF THE ELECTRONIC STRUCTURE OF METAL-SEI-ETC(U)

IBM Research Reportdomino.research.ibm.com/library/cyberdig.nsf/papers/DF271AE59C0… · IBM T. J. Watson Research Center, Yorktown Heights, NY 10598, subshiva@us.ibm.com Markus Ettl

IBM Research Report · IBM T.J. Watson Research Center P.O. Box 218, Yorktown Heights, NY 10598 October 12, 2005 Abstract We give a single-letter outer bound for the two-descriptions

Dr. Dong Chen IBM T.J. Watson Research Center Yorktown Heights, NY