DeepLearningProgramming Lecture7: ConvolutionalNeuralNetworks · 26/31 04.06.2018...

31
Convolutional Neural Networks Lecture 7: Deep Learning Programming Dr. Hanhe Lin Universität Konstanz, 04.06.2018

Transcript of DeepLearningProgramming Lecture7: ConvolutionalNeuralNetworks · 26/31 04.06.2018...

Page 1: DeepLearningProgramming Lecture7: ConvolutionalNeuralNetworks · 26/31 04.06.2018 DeepLearningProgramming-Lecture7 Dr.HanheLin. 2Convolutionallayer Maxpoolinglayerinputandoutput -

Convolutional Neural NetworksLecture 7:Deep Learning Programming

Dr. Hanhe LinUniversität Konstanz, 04.06.2018

Page 2: DeepLearningProgramming Lecture7: ConvolutionalNeuralNetworks · 26/31 04.06.2018 DeepLearningProgramming-Lecture7 Dr.HanheLin. 2Convolutionallayer Maxpoolinglayerinputandoutput -

1 Introduction

Challenges of object recognition

- Occlusion: real scenes are cluttered with other objects:- hard to tell which parts go to which object- Part of an object may be hidden behind other objects

- Lighting: the pixel intensities are determined by the lighting on the objects- Deformation: object can deform in a variety of non-affine ways, for example,written

number 7- Affordances: object classes are often defined by how they are used, e.g., chairs are

designed sitting so they have a variety of shapes- Viewpoint: changes in viewpoint cause the changes in images that standard learning

methods cannot cope with. In other words, information hops between inputdimensions

2 / 31 04.06.2018 Deep Learning Programming - Lecture 7 Dr. Hanhe Lin

Page 3: DeepLearningProgramming Lecture7: ConvolutionalNeuralNetworks · 26/31 04.06.2018 DeepLearningProgramming-Lecture7 Dr.HanheLin. 2Convolutionallayer Maxpoolinglayerinputandoutput -

1 Introduction

Ways to achieve viewpoint invariance

- Use redundant invariant features- Put a box around the object and use normalized pixels- Use a hierarchy of parts that have explicit poses relative to the camera- Use replicated features with pooling, i.e., convolutional neural networks- . . .

3 / 31 04.06.2018 Deep Learning Programming - Lecture 7 Dr. Hanhe Lin

Page 4: DeepLearningProgramming Lecture7: ConvolutionalNeuralNetworks · 26/31 04.06.2018 DeepLearningProgramming-Lecture7 Dr.HanheLin. 2Convolutionallayer Maxpoolinglayerinputandoutput -

1 Introduction

The invariant features approach

- “Local” approach- Extract a large, redundant set of features

that are invariant under transformations,e.g., SIFT

- But for recognition, we must avoidforming same features from parts ofdifferent objects

- With enough invariant features, we canrepresent an object by Bag-of-Wordsmodel (BoW)

4 / 31 04.06.2018 Deep Learning Programming - Lecture 7 Dr. Hanhe Lin

Page 5: DeepLearningProgramming Lecture7: ConvolutionalNeuralNetworks · 26/31 04.06.2018 DeepLearningProgramming-Lecture7 Dr.HanheLin. 2Convolutionallayer Maxpoolinglayerinputandoutput -

1 Introduction

The judicious normalization approach

- “Global” approach- Put a box around the object and use it as a coordinate frame for a set of normalized

pixels- This solves the dimension-hopping problem. If we choose the box correctly, the

same part of an object always occurs on the same normalized pixels- The box can provide invariance to many degrees of freedom, e.g., translation,

rotation, scale, shear, .etc- But choosing the box is difficult because of:

- Segmentation errors, occlusion, unusual orientations- We need to recognize the shape to get the box right

5 / 31 04.06.2018 Deep Learning Programming - Lecture 7 Dr. Hanhe Lin

Page 6: DeepLearningProgramming Lecture7: ConvolutionalNeuralNetworks · 26/31 04.06.2018 DeepLearningProgramming-Lecture7 Dr.HanheLin. 2Convolutionallayer Maxpoolinglayerinputandoutput -

1 Introduction

The brute force normalization approach

- When training the recognizer, use well-segmented, upright images to fit correct box- At the test time try all possible boxes in a range of positions and scales

- This approach is widely used for detecting upright things like faces and housenumbers in unsegmented images

- Although it can provide a certain degrees of freedom, you must train multiplerecognizers for same class. For example, front face and profile face

6 / 31 04.06.2018 Deep Learning Programming - Lecture 7 Dr. Hanhe Lin

Page 7: DeepLearningProgramming Lecture7: ConvolutionalNeuralNetworks · 26/31 04.06.2018 DeepLearningProgramming-Lecture7 Dr.HanheLin. 2Convolutionallayer Maxpoolinglayerinputandoutput -

1 Introduction

Motivation of CNNs

- Use many different copies of the same feature detector with different positions- Could also replicate across scale and orientation, but tricky and expensive- Replication greatly reduces the number of free parameters to be learned

- Use several different types of feature detectors, each with its own feature map- Allows each patch of image to be represented in several ways

7 / 31 04.06.2018 Deep Learning Programming - Lecture 7 Dr. Hanhe Lin

Page 8: DeepLearningProgramming Lecture7: ConvolutionalNeuralNetworks · 26/31 04.06.2018 DeepLearningProgramming-Lecture7 Dr.HanheLin. 2Convolutionallayer Maxpoolinglayerinputandoutput -

1 Introduction

Definition of CNNs

- Convolutional neural networks (CNNs), also known as convolutional networks,ConvNets

- It is a specialized kind of neural network for processing data that has a know grid-liketopology

- Examples:- 1-D grid - time-series data- 2-D grid - gray-scale image- 3-D grid - color image or gray-scale video- 4-D grid - color video

- CNNs are simply neural networks that use convolution in place of general matrixmultiplication in at least one of their layers

8 / 31 04.06.2018 Deep Learning Programming - Lecture 7 Dr. Hanhe Lin

Page 9: DeepLearningProgramming Lecture7: ConvolutionalNeuralNetworks · 26/31 04.06.2018 DeepLearningProgramming-Lecture7 Dr.HanheLin. 2Convolutionallayer Maxpoolinglayerinputandoutput -

1 Introduction

Convolution vs feedforward

- Full-connected networks have many parameters, prone to overfitting- CNN arranges neurons in 3 dimensions: width, height, depth

9 / 31 04.06.2018 Deep Learning Programming - Lecture 7 Dr. Hanhe Lin

Page 10: DeepLearningProgramming Lecture7: ConvolutionalNeuralNetworks · 26/31 04.06.2018 DeepLearningProgramming-Lecture7 Dr.HanheLin. 2Convolutionallayer Maxpoolinglayerinputandoutput -

2 Convolutional layer

Overview: a simple CNN

10 / 31 04.06.2018 Deep Learning Programming - Lecture 7 Dr. Hanhe Lin

Page 11: DeepLearningProgramming Lecture7: ConvolutionalNeuralNetworks · 26/31 04.06.2018 DeepLearningProgramming-Lecture7 Dr.HanheLin. 2Convolutionallayer Maxpoolinglayerinputandoutput -

2 Convolutional layer

The convolution operation

- The convolution operation takes twotime-series sequences x [n] and h[n] andproduce a third sequence y [n], definedas:

y [n] = x [n]∗h[n] =∞∑

k=−∞x [k ]h[n−k ], ∀n

- In digital signal process, x [n], h[n] andy [n] are referred to as input, filter, andoutput respectively

- In CNN terminology, x [n], h[n] and y [n]are referred to as input, kernel andfeature map

11 / 31 04.06.2018 Deep Learning Programming - Lecture 7 Dr. Hanhe Lin

Page 12: DeepLearningProgramming Lecture7: ConvolutionalNeuralNetworks · 26/31 04.06.2018 DeepLearningProgramming-Lecture7 Dr.HanheLin. 2Convolutionallayer Maxpoolinglayerinputandoutput -

2 Convolutional layer

The convolution operation in CNNs

- If we use a two-dimensional image I asour input, we probably also want to use atwo-dimensional kernel K:

S(i , j) = (I∗K)(i , j) =∑m

∑n

I(i+m, j+n)K(m, n)

- When an image is three-dimensional, thekernel should also be three-dimensional

- What is the difference between thesetwo convolution operations?

12 / 31 04.06.2018 Deep Learning Programming - Lecture 7 Dr. Hanhe Lin

Page 13: DeepLearningProgramming Lecture7: ConvolutionalNeuralNetworks · 26/31 04.06.2018 DeepLearningProgramming-Lecture7 Dr.HanheLin. 2Convolutionallayer Maxpoolinglayerinputandoutput -

2 Convolutional layer

The convolution operation in CNNs

- If we use a two-dimensional image I asour input, we probably also want to use atwo-dimensional kernel K:

S(i , j) = (I∗K)(i , j) =∑m

∑n

I(i+m, j+n)K(m, n)

- When an image is three-dimensional, thekernel should also be three-dimensional

- What is the difference between thesetwo convolution operations?

- Without flipped the kernel!- Many deep learning libraries

implement cross-correlation butcall it convolution

13 / 31 04.06.2018 Deep Learning Programming - Lecture 7 Dr. Hanhe Lin

Page 14: DeepLearningProgramming Lecture7: ConvolutionalNeuralNetworks · 26/31 04.06.2018 DeepLearningProgramming-Lecture7 Dr.HanheLin. 2Convolutionallayer Maxpoolinglayerinputandoutput -

2 Convolutional layer

Hyperparameters in conv layer

- depth: corresponds to the number of kernels we would like to use, each learning tolook for something different in the input

- kernel size: large kernel size increases computation- stride- zero-padding

14 / 31 04.06.2018 Deep Learning Programming - Lecture 7 Dr. Hanhe Lin

Page 15: DeepLearningProgramming Lecture7: ConvolutionalNeuralNetworks · 26/31 04.06.2018 DeepLearningProgramming-Lecture7 Dr.HanheLin. 2Convolutionallayer Maxpoolinglayerinputandoutput -

2 Convolutional layer

Stride

- Stride is the number of pixels to slide the kernel, where large stride produce smalleroutput spatially

Figure: Stride 1 with kernel size of 3 Figure: Stride 2 with kernel size of 3

15 / 31 04.06.2018 Deep Learning Programming - Lecture 7 Dr. Hanhe Lin

Page 16: DeepLearningProgramming Lecture7: ConvolutionalNeuralNetworks · 26/31 04.06.2018 DeepLearningProgramming-Lecture7 Dr.HanheLin. 2Convolutionallayer Maxpoolinglayerinputandoutput -

2 Convolutional layer

Zero-padding

- Pad the input volume with zeros around the border so that the input and output widthand height are the same

Figure: Stride 1 with zero-padding of 0 Figure: Stride 1 with zero-padding of 1

16 / 31 04.06.2018 Deep Learning Programming - Lecture 7 Dr. Hanhe Lin

Page 17: DeepLearningProgramming Lecture7: ConvolutionalNeuralNetworks · 26/31 04.06.2018 DeepLearningProgramming-Lecture7 Dr.HanheLin. 2Convolutionallayer Maxpoolinglayerinputandoutput -

2 Convolutional layer

Conv layer input and output

- Accepts a volumn of size W1 ×H1 ×D1- Requires four hyperparameters:

- Number of kernels K- Kernel size F- Stride S- The amount of zero padding P

- Produces a volume of size W2 ×H2 ×D2 where:- W2 = (W1 − F + 2P )/S + 1- H2 = (H1 − F + 2P )/S + 1- D2 = K

- With parameter sharing, it introduces F · F ·D1 weights per filter, for a total of(F · F ·D1) ·K weights and K biases.

17 / 31 04.06.2018 Deep Learning Programming - Lecture 7 Dr. Hanhe Lin

Page 18: DeepLearningProgramming Lecture7: ConvolutionalNeuralNetworks · 26/31 04.06.2018 DeepLearningProgramming-Lecture7 Dr.HanheLin. 2Convolutionallayer Maxpoolinglayerinputandoutput -

2 Convolutional layer

What does convolution achieve?

- Convolution operation leverages three important ideas that can help improve amachine learning system:

- sparse interactions- parameter sharing- equivariant representations

- Convolution provides a means for working with inputs of variable size

18 / 31 04.06.2018 Deep Learning Programming - Lecture 7 Dr. Hanhe Lin

Page 19: DeepLearningProgramming Lecture7: ConvolutionalNeuralNetworks · 26/31 04.06.2018 DeepLearningProgramming-Lecture7 Dr.HanheLin. 2Convolutionallayer Maxpoolinglayerinputandoutput -

2 Convolutional layer

Sparse interactions

- Also referred to as sparse connectivityor sparse weights

- Traditional neural network layers usematrix multiplication by a matrix ofparameters with a separate parameterdescribing the interaction between eachinput unit and each output unit

- Making the kernel smaller than the input⇒ fewer parameters⇒ less memoryrequirement

19 / 31 04.06.2018 Deep Learning Programming - Lecture 7 Dr. Hanhe Lin

Page 20: DeepLearningProgramming Lecture7: ConvolutionalNeuralNetworks · 26/31 04.06.2018 DeepLearningProgramming-Lecture7 Dr.HanheLin. 2Convolutionallayer Maxpoolinglayerinputandoutput -

2 Convolutional layer

Sparse interactions (cont.)

- In a deep convolutional network, units inthe deeper layers may indirectly interactwith a larger portion of the input⇒ allowcomplicated interactions between manyvariable

20 / 31 04.06.2018 Deep Learning Programming - Lecture 7 Dr. Hanhe Lin

Page 21: DeepLearningProgramming Lecture7: ConvolutionalNeuralNetworks · 26/31 04.06.2018 DeepLearningProgramming-Lecture7 Dr.HanheLin. 2Convolutionallayer Maxpoolinglayerinputandoutput -

2 Convolutional layer

Parameter sharing

- Refers to using the same parameter formore than one function in a model

- In a convolutional neural net, eachmember of the kernel is used at everyposition of the input

- Less computation and less parametersto learn

21 / 31 04.06.2018 Deep Learning Programming - Lecture 7 Dr. Hanhe Lin

Page 22: DeepLearningProgramming Lecture7: ConvolutionalNeuralNetworks · 26/31 04.06.2018 DeepLearningProgramming-Lecture7 Dr.HanheLin. 2Convolutionallayer Maxpoolinglayerinputandoutput -

2 Convolutional layer

Equivariant representations

- To say a function is equivariant meansthat if the input changes, the outputchanges in the same way

- In image recognition, convolution createsa 2-D map of where certain featuresappear in the input. If we move theobject in the input, its representation willmove the same amount in the output

- Invariant knowledge: If a feature isuseful in some locations during training,kernels for that feature will be availablein all locations during testing.

22 / 31 04.06.2018 Deep Learning Programming - Lecture 7 Dr. Hanhe Lin

Page 23: DeepLearningProgramming Lecture7: ConvolutionalNeuralNetworks · 26/31 04.06.2018 DeepLearningProgramming-Lecture7 Dr.HanheLin. 2Convolutionallayer Maxpoolinglayerinputandoutput -

2 Convolutional layer

Backpropagation

- It’s easy to modify the backpropagation algorithm to incorporate linear constraintsbetween the weights

- We compute the gradients as usual, and then modify the gradients so that they satisfythe constraints:

- To constrain: w1 = w2- We need: ∆w1 = ∆w2- Compute ∂J

∂w1and ∂J

∂w2

- Use ∂J∂w1+ ∂J∂w2

for w1 and w2

23 / 31 04.06.2018 Deep Learning Programming - Lecture 7 Dr. Hanhe Lin

Page 24: DeepLearningProgramming Lecture7: ConvolutionalNeuralNetworks · 26/31 04.06.2018 DeepLearningProgramming-Lecture7 Dr.HanheLin. 2Convolutionallayer Maxpoolinglayerinputandoutput -

2 Convolutional layer

Pool layer

- A pooling function replaces the output of the net at a certain location with a summarystatistic of the nearby outputs

- Functions include average, weighted average, L2 norm, maximum, . . .- In all cases, pooling helps to make the representation approximately invariant to small

translations of the input- “Invariance to local translation can be a useful property if we care more about whether

some feature is present than exactly where it is.”

24 / 31 04.06.2018 Deep Learning Programming - Lecture 7 Dr. Hanhe Lin

Page 25: DeepLearningProgramming Lecture7: ConvolutionalNeuralNetworks · 26/31 04.06.2018 DeepLearningProgramming-Lecture7 Dr.HanheLin. 2Convolutionallayer Maxpoolinglayerinputandoutput -

2 Convolutional layer

Example of max pooling

25 / 31 04.06.2018 Deep Learning Programming - Lecture 7 Dr. Hanhe Lin

Page 26: DeepLearningProgramming Lecture7: ConvolutionalNeuralNetworks · 26/31 04.06.2018 DeepLearningProgramming-Lecture7 Dr.HanheLin. 2Convolutionallayer Maxpoolinglayerinputandoutput -

2 Convolutional layer

Max pooling

- The max pooling operation, which reports themaximum output within a rectangularneighborhood (usually 2x2), is preferred becauseit worked slightly better

- Pros:- This reduces the number of inputs to the

next layer of feature extraction, thusallowing us to have many more differentfeature maps

- Invariant to small translations of the input- Cons: after several levels of pooling, we have lost

information about the precise positions of things

26 / 31 04.06.2018 Deep Learning Programming - Lecture 7 Dr. Hanhe Lin

Page 27: DeepLearningProgramming Lecture7: ConvolutionalNeuralNetworks · 26/31 04.06.2018 DeepLearningProgramming-Lecture7 Dr.HanheLin. 2Convolutionallayer Maxpoolinglayerinputandoutput -

2 Convolutional layer

Max pooling layer input and output

- Accepts a volume of size W1 ×H1 ×D1- Requires two hyperparameters: spatial extend F and the stride S- Produces a volume of size W2 ×H2 ×D2 where:

- W2 = (W1 − F )/S + 1- H2 = (H1 − F )/S + 1- D2 = D1

- Introduce zero parameter since it computes a fixed function of the input

27 / 31 04.06.2018 Deep Learning Programming - Lecture 7 Dr. Hanhe Lin

Page 28: DeepLearningProgramming Lecture7: ConvolutionalNeuralNetworks · 26/31 04.06.2018 DeepLearningProgramming-Lecture7 Dr.HanheLin. 2Convolutionallayer Maxpoolinglayerinputandoutput -

2 Convolutional layer

Transformation invariant

- If we pool over the outputs of separatelyparametrized convolutions, the featurescan learn which transformations tobecome invariant to

28 / 31 04.06.2018 Deep Learning Programming - Lecture 7 Dr. Hanhe Lin

Page 29: DeepLearningProgramming Lecture7: ConvolutionalNeuralNetworks · 26/31 04.06.2018 DeepLearningProgramming-Lecture7 Dr.HanheLin. 2Convolutionallayer Maxpoolinglayerinputandoutput -

2 Convolutional layer

A simple CNN revisited

- There are a few distinct types of Layers- Each Layer accepts an input 3D volume

and transforms it to an output 3D volumethrough a differentiable function

- Each Layer may or may not haveparameters?

- Each Layer may or may not haveadditional hyperparameters?

29 / 31 04.06.2018 Deep Learning Programming - Lecture 7 Dr. Hanhe Lin

Page 30: DeepLearningProgramming Lecture7: ConvolutionalNeuralNetworks · 26/31 04.06.2018 DeepLearningProgramming-Lecture7 Dr.HanheLin. 2Convolutionallayer Maxpoolinglayerinputandoutput -

3 Summary

Two different terminology for CNN

30 / 31 04.06.2018 Deep Learning Programming - Lecture 7 Dr. Hanhe Lin

Page 31: DeepLearningProgramming Lecture7: ConvolutionalNeuralNetworks · 26/31 04.06.2018 DeepLearningProgramming-Lecture7 Dr.HanheLin. 2Convolutionallayer Maxpoolinglayerinputandoutput -

3 Summary

Summary

- CNNs is a specialized kind of neural network to process data a grid-like topology- A typical convolutional layer includes convolution, ReLU, and max pooling operations- Convolution operation has three important features, including sparse interactions,

parameter sharing, and equivariant representations- Max pooling reduces the number of inputs to the next layer and invariant to small

translations of the input

31 / 31 04.06.2018 Deep Learning Programming - Lecture 7 Dr. Hanhe Lin