DeepLearningProgramming Lecture7: ConvolutionalNeuralNetworks · 26/31 04.06.2018...
Transcript of DeepLearningProgramming Lecture7: ConvolutionalNeuralNetworks · 26/31 04.06.2018...
Convolutional Neural NetworksLecture 7:Deep Learning Programming
Dr. Hanhe LinUniversität Konstanz, 04.06.2018
1 Introduction
Challenges of object recognition
- Occlusion: real scenes are cluttered with other objects:- hard to tell which parts go to which object- Part of an object may be hidden behind other objects
- Lighting: the pixel intensities are determined by the lighting on the objects- Deformation: object can deform in a variety of non-affine ways, for example,written
number 7- Affordances: object classes are often defined by how they are used, e.g., chairs are
designed sitting so they have a variety of shapes- Viewpoint: changes in viewpoint cause the changes in images that standard learning
methods cannot cope with. In other words, information hops between inputdimensions
2 / 31 04.06.2018 Deep Learning Programming - Lecture 7 Dr. Hanhe Lin
1 Introduction
Ways to achieve viewpoint invariance
- Use redundant invariant features- Put a box around the object and use normalized pixels- Use a hierarchy of parts that have explicit poses relative to the camera- Use replicated features with pooling, i.e., convolutional neural networks- . . .
3 / 31 04.06.2018 Deep Learning Programming - Lecture 7 Dr. Hanhe Lin
1 Introduction
The invariant features approach
- “Local” approach- Extract a large, redundant set of features
that are invariant under transformations,e.g., SIFT
- But for recognition, we must avoidforming same features from parts ofdifferent objects
- With enough invariant features, we canrepresent an object by Bag-of-Wordsmodel (BoW)
4 / 31 04.06.2018 Deep Learning Programming - Lecture 7 Dr. Hanhe Lin
1 Introduction
The judicious normalization approach
- “Global” approach- Put a box around the object and use it as a coordinate frame for a set of normalized
pixels- This solves the dimension-hopping problem. If we choose the box correctly, the
same part of an object always occurs on the same normalized pixels- The box can provide invariance to many degrees of freedom, e.g., translation,
rotation, scale, shear, .etc- But choosing the box is difficult because of:
- Segmentation errors, occlusion, unusual orientations- We need to recognize the shape to get the box right
5 / 31 04.06.2018 Deep Learning Programming - Lecture 7 Dr. Hanhe Lin
1 Introduction
The brute force normalization approach
- When training the recognizer, use well-segmented, upright images to fit correct box- At the test time try all possible boxes in a range of positions and scales
- This approach is widely used for detecting upright things like faces and housenumbers in unsegmented images
- Although it can provide a certain degrees of freedom, you must train multiplerecognizers for same class. For example, front face and profile face
6 / 31 04.06.2018 Deep Learning Programming - Lecture 7 Dr. Hanhe Lin
1 Introduction
Motivation of CNNs
- Use many different copies of the same feature detector with different positions- Could also replicate across scale and orientation, but tricky and expensive- Replication greatly reduces the number of free parameters to be learned
- Use several different types of feature detectors, each with its own feature map- Allows each patch of image to be represented in several ways
7 / 31 04.06.2018 Deep Learning Programming - Lecture 7 Dr. Hanhe Lin
1 Introduction
Definition of CNNs
- Convolutional neural networks (CNNs), also known as convolutional networks,ConvNets
- It is a specialized kind of neural network for processing data that has a know grid-liketopology
- Examples:- 1-D grid - time-series data- 2-D grid - gray-scale image- 3-D grid - color image or gray-scale video- 4-D grid - color video
- CNNs are simply neural networks that use convolution in place of general matrixmultiplication in at least one of their layers
8 / 31 04.06.2018 Deep Learning Programming - Lecture 7 Dr. Hanhe Lin
1 Introduction
Convolution vs feedforward
- Full-connected networks have many parameters, prone to overfitting- CNN arranges neurons in 3 dimensions: width, height, depth
9 / 31 04.06.2018 Deep Learning Programming - Lecture 7 Dr. Hanhe Lin
2 Convolutional layer
Overview: a simple CNN
10 / 31 04.06.2018 Deep Learning Programming - Lecture 7 Dr. Hanhe Lin
2 Convolutional layer
The convolution operation
- The convolution operation takes twotime-series sequences x [n] and h[n] andproduce a third sequence y [n], definedas:
y [n] = x [n]∗h[n] =∞∑
k=−∞x [k ]h[n−k ], ∀n
- In digital signal process, x [n], h[n] andy [n] are referred to as input, filter, andoutput respectively
- In CNN terminology, x [n], h[n] and y [n]are referred to as input, kernel andfeature map
11 / 31 04.06.2018 Deep Learning Programming - Lecture 7 Dr. Hanhe Lin
2 Convolutional layer
The convolution operation in CNNs
- If we use a two-dimensional image I asour input, we probably also want to use atwo-dimensional kernel K:
S(i , j) = (I∗K)(i , j) =∑m
∑n
I(i+m, j+n)K(m, n)
- When an image is three-dimensional, thekernel should also be three-dimensional
- What is the difference between thesetwo convolution operations?
12 / 31 04.06.2018 Deep Learning Programming - Lecture 7 Dr. Hanhe Lin
2 Convolutional layer
The convolution operation in CNNs
- If we use a two-dimensional image I asour input, we probably also want to use atwo-dimensional kernel K:
S(i , j) = (I∗K)(i , j) =∑m
∑n
I(i+m, j+n)K(m, n)
- When an image is three-dimensional, thekernel should also be three-dimensional
- What is the difference between thesetwo convolution operations?
- Without flipped the kernel!- Many deep learning libraries
implement cross-correlation butcall it convolution
13 / 31 04.06.2018 Deep Learning Programming - Lecture 7 Dr. Hanhe Lin
2 Convolutional layer
Hyperparameters in conv layer
- depth: corresponds to the number of kernels we would like to use, each learning tolook for something different in the input
- kernel size: large kernel size increases computation- stride- zero-padding
14 / 31 04.06.2018 Deep Learning Programming - Lecture 7 Dr. Hanhe Lin
2 Convolutional layer
Stride
- Stride is the number of pixels to slide the kernel, where large stride produce smalleroutput spatially
Figure: Stride 1 with kernel size of 3 Figure: Stride 2 with kernel size of 3
15 / 31 04.06.2018 Deep Learning Programming - Lecture 7 Dr. Hanhe Lin
2 Convolutional layer
Zero-padding
- Pad the input volume with zeros around the border so that the input and output widthand height are the same
Figure: Stride 1 with zero-padding of 0 Figure: Stride 1 with zero-padding of 1
16 / 31 04.06.2018 Deep Learning Programming - Lecture 7 Dr. Hanhe Lin
2 Convolutional layer
Conv layer input and output
- Accepts a volumn of size W1 ×H1 ×D1- Requires four hyperparameters:
- Number of kernels K- Kernel size F- Stride S- The amount of zero padding P
- Produces a volume of size W2 ×H2 ×D2 where:- W2 = (W1 − F + 2P )/S + 1- H2 = (H1 − F + 2P )/S + 1- D2 = K
- With parameter sharing, it introduces F · F ·D1 weights per filter, for a total of(F · F ·D1) ·K weights and K biases.
17 / 31 04.06.2018 Deep Learning Programming - Lecture 7 Dr. Hanhe Lin
2 Convolutional layer
What does convolution achieve?
- Convolution operation leverages three important ideas that can help improve amachine learning system:
- sparse interactions- parameter sharing- equivariant representations
- Convolution provides a means for working with inputs of variable size
18 / 31 04.06.2018 Deep Learning Programming - Lecture 7 Dr. Hanhe Lin
2 Convolutional layer
Sparse interactions
- Also referred to as sparse connectivityor sparse weights
- Traditional neural network layers usematrix multiplication by a matrix ofparameters with a separate parameterdescribing the interaction between eachinput unit and each output unit
- Making the kernel smaller than the input⇒ fewer parameters⇒ less memoryrequirement
19 / 31 04.06.2018 Deep Learning Programming - Lecture 7 Dr. Hanhe Lin
2 Convolutional layer
Sparse interactions (cont.)
- In a deep convolutional network, units inthe deeper layers may indirectly interactwith a larger portion of the input⇒ allowcomplicated interactions between manyvariable
20 / 31 04.06.2018 Deep Learning Programming - Lecture 7 Dr. Hanhe Lin
2 Convolutional layer
Parameter sharing
- Refers to using the same parameter formore than one function in a model
- In a convolutional neural net, eachmember of the kernel is used at everyposition of the input
- Less computation and less parametersto learn
21 / 31 04.06.2018 Deep Learning Programming - Lecture 7 Dr. Hanhe Lin
2 Convolutional layer
Equivariant representations
- To say a function is equivariant meansthat if the input changes, the outputchanges in the same way
- In image recognition, convolution createsa 2-D map of where certain featuresappear in the input. If we move theobject in the input, its representation willmove the same amount in the output
- Invariant knowledge: If a feature isuseful in some locations during training,kernels for that feature will be availablein all locations during testing.
22 / 31 04.06.2018 Deep Learning Programming - Lecture 7 Dr. Hanhe Lin
2 Convolutional layer
Backpropagation
- It’s easy to modify the backpropagation algorithm to incorporate linear constraintsbetween the weights
- We compute the gradients as usual, and then modify the gradients so that they satisfythe constraints:
- To constrain: w1 = w2- We need: ∆w1 = ∆w2- Compute ∂J
∂w1and ∂J
∂w2
- Use ∂J∂w1+ ∂J∂w2
for w1 and w2
23 / 31 04.06.2018 Deep Learning Programming - Lecture 7 Dr. Hanhe Lin
2 Convolutional layer
Pool layer
- A pooling function replaces the output of the net at a certain location with a summarystatistic of the nearby outputs
- Functions include average, weighted average, L2 norm, maximum, . . .- In all cases, pooling helps to make the representation approximately invariant to small
translations of the input- “Invariance to local translation can be a useful property if we care more about whether
some feature is present than exactly where it is.”
24 / 31 04.06.2018 Deep Learning Programming - Lecture 7 Dr. Hanhe Lin
2 Convolutional layer
Example of max pooling
25 / 31 04.06.2018 Deep Learning Programming - Lecture 7 Dr. Hanhe Lin
2 Convolutional layer
Max pooling
- The max pooling operation, which reports themaximum output within a rectangularneighborhood (usually 2x2), is preferred becauseit worked slightly better
- Pros:- This reduces the number of inputs to the
next layer of feature extraction, thusallowing us to have many more differentfeature maps
- Invariant to small translations of the input- Cons: after several levels of pooling, we have lost
information about the precise positions of things
26 / 31 04.06.2018 Deep Learning Programming - Lecture 7 Dr. Hanhe Lin
2 Convolutional layer
Max pooling layer input and output
- Accepts a volume of size W1 ×H1 ×D1- Requires two hyperparameters: spatial extend F and the stride S- Produces a volume of size W2 ×H2 ×D2 where:
- W2 = (W1 − F )/S + 1- H2 = (H1 − F )/S + 1- D2 = D1
- Introduce zero parameter since it computes a fixed function of the input
27 / 31 04.06.2018 Deep Learning Programming - Lecture 7 Dr. Hanhe Lin
2 Convolutional layer
Transformation invariant
- If we pool over the outputs of separatelyparametrized convolutions, the featurescan learn which transformations tobecome invariant to
28 / 31 04.06.2018 Deep Learning Programming - Lecture 7 Dr. Hanhe Lin
2 Convolutional layer
A simple CNN revisited
- There are a few distinct types of Layers- Each Layer accepts an input 3D volume
and transforms it to an output 3D volumethrough a differentiable function
- Each Layer may or may not haveparameters?
- Each Layer may or may not haveadditional hyperparameters?
29 / 31 04.06.2018 Deep Learning Programming - Lecture 7 Dr. Hanhe Lin
3 Summary
Two different terminology for CNN
30 / 31 04.06.2018 Deep Learning Programming - Lecture 7 Dr. Hanhe Lin
3 Summary
Summary
- CNNs is a specialized kind of neural network to process data a grid-like topology- A typical convolutional layer includes convolution, ReLU, and max pooling operations- Convolution operation has three important features, including sparse interactions,
parameter sharing, and equivariant representations- Max pooling reduces the number of inputs to the next layer and invariant to small
translations of the input
31 / 31 04.06.2018 Deep Learning Programming - Lecture 7 Dr. Hanhe Lin