DeepLearningProgramming Lecture7: ConvolutionalNeuralNetworks · 26/31 04.06.2018...

Convolutional Neural NetworksLecture 7:Deep Learning Programming

Dr. Hanhe LinUniversität Konstanz, 04.06.2018

1 Introduction

Challenges of object recognition

- Occlusion: real scenes are cluttered with other objects:- hard to tell which parts go to which object- Part of an object may be hidden behind other objects

- Lighting: the pixel intensities are determined by the lighting on the objects- Deformation: object can deform in a variety of non-affine ways, for example,written

number 7- Affordances: object classes are often defined by how they are used, e.g., chairs are

designed sitting so they have a variety of shapes- Viewpoint: changes in viewpoint cause the changes in images that standard learning

methods cannot cope with. In other words, information hops between inputdimensions

2 / 31 04.06.2018 Deep Learning Programming - Lecture 7 Dr. Hanhe Lin

1 Introduction

Ways to achieve viewpoint invariance

- Use redundant invariant features- Put a box around the object and use normalized pixels- Use a hierarchy of parts that have explicit poses relative to the camera- Use replicated features with pooling, i.e., convolutional neural networks- . . .


1 Introduction

The invariant features approach

- “Local” approach- Extract a large, redundant set of features

that are invariant under transformations,e.g., SIFT

- But for recognition, we must avoidforming same features from parts ofdifferent objects

- With enough invariant features, we canrepresent an object by Bag-of-Wordsmodel (BoW)


1 Introduction

The judicious normalization approach

- “Global” approach- Put a box around the object and use it as a coordinate frame for a set of normalized

pixels- This solves the dimension-hopping problem. If we choose the box correctly, the

same part of an object always occurs on the same normalized pixels- The box can provide invariance to many degrees of freedom, e.g., translation,

rotation, scale, shear, .etc- But choosing the box is difficult because of:

- Segmentation errors, occlusion, unusual orientations- We need to recognize the shape to get the box right


1 Introduction

The brute force normalization approach

- When training the recognizer, use well-segmented, upright images to fit correct box- At the test time try all possible boxes in a range of positions and scales

- This approach is widely used for detecting upright things like faces and housenumbers in unsegmented images

- Although it can provide a certain degrees of freedom, you must train multiplerecognizers for same class. For example, front face and profile face


1 Introduction

Motivation of CNNs

- Use many different copies of the same feature detector with different positions- Could also replicate across scale and orientation, but tricky and expensive- Replication greatly reduces the number of free parameters to be learned

- Use several different types of feature detectors, each with its own feature map- Allows each patch of image to be represented in several ways


1 Introduction

Definition of CNNs

- Convolutional neural networks (CNNs), also known as convolutional networks,ConvNets

- It is a specialized kind of neural network for processing data that has a know grid-liketopology

- Examples:- 1-D grid - time-series data- 2-D grid - gray-scale image- 3-D grid - color image or gray-scale video- 4-D grid - color video

- CNNs are simply neural networks that use convolution in place of general matrixmultiplication in at least one of their layers


1 Introduction

Convolution vs feedforward

- Full-connected networks have many parameters, prone to overfitting- CNN arranges neurons in 3 dimensions: width, height, depth


2 Convolutional layer

Overview: a simple CNN



The convolution operation

- The convolution operation takes twotime-series sequences x [n] and h[n] andproduce a third sequence y [n], definedas:

y [n] = x [n]∗h[n] =∞∑

k=−∞x [k ]h[n−k ], ∀n

- In digital signal process, x [n], h[n] andy [n] are referred to as input, filter, andoutput respectively

- In CNN terminology, x [n], h[n] and y [n]are referred to as input, kernel andfeature map



The convolution operation in CNNs

- If we use a two-dimensional image I asour input, we probably also want to use atwo-dimensional kernel K:

S(i , j) = (I∗K)(i , j) =∑m

∑n

I(i+m, j+n)K(m, n)

- When an image is three-dimensional, thekernel should also be three-dimensional

- What is the difference between thesetwo convolution operations?



The convolution operation in CNNs

- If we use a two-dimensional image I asour input, we probably also want to use atwo-dimensional kernel K:

S(i , j) = (I∗K)(i , j) =∑m

∑n

I(i+m, j+n)K(m, n)

- When an image is three-dimensional, thekernel should also be three-dimensional

- What is the difference between thesetwo convolution operations?

- Without flipped the kernel!- Many deep learning libraries

implement cross-correlation butcall it convolution



Hyperparameters in conv layer

- depth: corresponds to the number of kernels we would like to use, each learning tolook for something different in the input

- kernel size: large kernel size increases computation- stride- zero-padding



Stride

- Stride is the number of pixels to slide the kernel, where large stride produce smalleroutput spatially

Figure: Stride 1 with kernel size of 3 Figure: Stride 2 with kernel size of 3



Zero-padding

- Pad the input volume with zeros around the border so that the input and output widthand height are the same

Figure: Stride 1 with zero-padding of 0 Figure: Stride 1 with zero-padding of 1



Conv layer input and output

- Accepts a volumn of size W1 ×H1 ×D1- Requires four hyperparameters:

- Number of kernels K- Kernel size F- Stride S- The amount of zero padding P

- Produces a volume of size W2 ×H2 ×D2 where:- W2 = (W1 − F + 2P )/S + 1- H2 = (H1 − F + 2P )/S + 1- D2 = K

- With parameter sharing, it introduces F · F ·D1 weights per filter, for a total of(F · F ·D1) ·K weights and K biases.



What does convolution achieve?

- Convolution operation leverages three important ideas that can help improve amachine learning system:

- sparse interactions- parameter sharing- equivariant representations

- Convolution provides a means for working with inputs of variable size



Sparse interactions

- Also referred to as sparse connectivityor sparse weights

- Traditional neural network layers usematrix multiplication by a matrix ofparameters with a separate parameterdescribing the interaction between eachinput unit and each output unit

- Making the kernel smaller than the input⇒ fewer parameters⇒ less memoryrequirement



Sparse interactions (cont.)

- In a deep convolutional network, units inthe deeper layers may indirectly interactwith a larger portion of the input⇒ allowcomplicated interactions between manyvariable



Parameter sharing

- Refers to using the same parameter formore than one function in a model

- In a convolutional neural net, eachmember of the kernel is used at everyposition of the input

- Less computation and less parametersto learn



Equivariant representations

- To say a function is equivariant meansthat if the input changes, the outputchanges in the same way

- In image recognition, convolution createsa 2-D map of where certain featuresappear in the input. If we move theobject in the input, its representation willmove the same amount in the output

- Invariant knowledge: If a feature isuseful in some locations during training,kernels for that feature will be availablein all locations during testing.



Backpropagation

- It’s easy to modify the backpropagation algorithm to incorporate linear constraintsbetween the weights

- We compute the gradients as usual, and then modify the gradients so that they satisfythe constraints:

- To constrain: w1 = w2- We need: ∆w1 = ∆w2- Compute ∂J

∂w1and ∂J

∂w2

- Use ∂J∂w1+ ∂J∂w2

for w1 and w2



Pool layer

- A pooling function replaces the output of the net at a certain location with a summarystatistic of the nearby outputs

- Functions include average, weighted average, L2 norm, maximum, . . .- In all cases, pooling helps to make the representation approximately invariant to small

translations of the input- “Invariance to local translation can be a useful property if we care more about whether

some feature is present than exactly where it is.”



Example of max pooling



Max pooling

- The max pooling operation, which reports themaximum output within a rectangularneighborhood (usually 2x2), is preferred becauseit worked slightly better

- Pros:- This reduces the number of inputs to the

next layer of feature extraction, thusallowing us to have many more differentfeature maps

- Invariant to small translations of the input- Cons: after several levels of pooling, we have lost

information about the precise positions of things



Max pooling layer input and output

- Accepts a volume of size W1 ×H1 ×D1- Requires two hyperparameters: spatial extend F and the stride S- Produces a volume of size W2 ×H2 ×D2 where:

- W2 = (W1 − F )/S + 1- H2 = (H1 − F )/S + 1- D2 = D1

- Introduce zero parameter since it computes a fixed function of the input



Transformation invariant

- If we pool over the outputs of separatelyparametrized convolutions, the featurescan learn which transformations tobecome invariant to



A simple CNN revisited

- There are a few distinct types of Layers- Each Layer accepts an input 3D volume

and transforms it to an output 3D volumethrough a differentiable function

- Each Layer may or may not haveparameters?

- Each Layer may or may not haveadditional hyperparameters?


3 Summary

Two different terminology for CNN


3 Summary

Summary

- CNNs is a specialized kind of neural network to process data a grid-like topology- A typical convolutional layer includes convolution, ReLU, and max pooling operations- Convolution operation has three important features, including sparse interactions,

parameter sharing, and equivariant representations- Max pooling reduces the number of inputs to the next layer and invariant to small

translations of the input


DeepLearningProgramming Lecture7: ConvolutionalNeuralNetworks · 26/31 04.06.2018...

Documents

Transcript of DeepLearningProgramming Lecture7: ConvolutionalNeuralNetworks · 26/31 04.06.2018...