Computer Vision - Neural Networks · Neural Networks Ing. Ivan Gruber Ph.D. Department of...
Transcript of Computer Vision - Neural Networks · Neural Networks Ing. Ivan Gruber Ph.D. Department of...
Computer VisionNeural Networks
Ing. Ivan Gruber Ph.D.
Department of CyberneticsFaculty of Applied SciencesUniversity of West Bohemia
ESF projekt Zapadoceske univerzity v Plznireg. c. CZ.02.2.69/0.0/0.0/16 015/0002287
Computer Vision
Content
I Neural Networks
– General Knowledge– Artificial neuron
I Neural Networks proprieties
– Activation functions– Layers– Training– Parameters
I Important architectures
Computer Vision 1 / 30
General Knowledge Activation functions Layers Training Important architectures Conclusion
Deep neural networks
I The most popular machine learning technique nowadays
I Models are inspired by biological brain
I By using more layers with activation functions is achieved non-linearity
I Many different neural network architectures
I Feedforward vs Recurrent
I Supervised vs Unsupervised learning
I Weight update via back-propagation algorithm
I Advantages: End-to-end training (including feature extraction), state-of-the-artresults
I Disadvantages: Need huge amount of the training data, choice of correctarchitecture (little bit an alchemy)
Computer Vision 2 / 30
General Knowledge Activation functions Layers Training Important architectures Conclusion
Artificial neuron
I Biological neuron is composed of:– Soma - body of the neuron– Axon - output, each neuron has only one axon– Dendrites - input, each neuron can have up to several thousands dendrites– Synapses - links between Axons and Dendrites, one-way gates, different synaptic
strength– Inputs (electrical impulses) are summed and send into its axon if above certain
thresholdI Artificial neuron:
– The strength of the axioms is modeled by weights W– The threshold is ensured by activation function f
Computer Vision 3 / 30
General Knowledge Activation functions Layers Training Important architectures Conclusion
Activation function
I Defines the output of the neuron based on the input(s) and some fixedmathematical operation
I Neurons don’t have to have activation function
I Many different activation functions
Computer Vision 4 / 30
General Knowledge Activation functions Layers Training Important architectures Conclusion
Sigmoid Function
f (ξ) =1
1 + eξ,where ξ =
n∑i=1
(wTi xi + b), (1)
I Frequently used historicallyI Two major drawbacks:
– Function saturation (-5,5) - causes problems during back-propagation (vanishing)– Not zero-centered - gradient during back-propagation will be always positive or
negative
Computer Vision 5 / 30
General Knowledge Activation functions Layers Training Important architectures Conclusion
Tanh activation
f (ξ) =2eξ
1 + eξ(2ξ)− 1, (2)
I Zero centered
I Saturation problem again
Computer Vision 6 / 30
General Knowledge Activation functions Layers Training Important architectures Conclusion
ReLU (Rectified Linear Unit)
f (ξ) = max(0, ξ), (3)
I Most popular activation function
I Computational simplicity
I Danger of dead neurons creation
I Modifications: Leaky ReLU, PReLU
Computer Vision 7 / 30
General Knowledge Activation functions Layers Training Important architectures Conclusion
Softmax
f (ξ)j =eξj∑N eξN
, (4)
I Used in the classification layer
I Converts a raw value into a posterior probability
Computer Vision 8 / 30
General Knowledge Activation functions Layers Training Important architectures Conclusion
Layers
I NN is formad by (acyclic) connecting of artificial neurons together
I The final purpose and function of the ANN are to determine by these connections(architecture of the network), by weights, and by types of neurons (activationfunctions)
I Neurons are organized unto distinct layersI The most common ones:
– Fully-connected layer– Convolution layer– Pooling layer– Regularization layer
Computer Vision 9 / 30
General Knowledge Activation functions Layers Training Important architectures Conclusion
Fully-Connected Layer
I Each neuron is connected to all neurons in previous layerI The most common layerI (Optionally) last few layers in CNNsI Prone to over-fitting → usage together with dropoutI Hyperparameters: Number of neurons
Computer Vision 10 / 30
General Knowledge Activation functions Layers Training Important architectures Conclusion
Convolution Layer
I Neurons are connected only to a local region of the previous layerI Size of the region is hyperparameter called kernel size (or receptive field)I The size of convolutional step is called stride (usually = 1)I Can be imagined as a set of filtersI Number of filters is called depthI All the neurons within the same filter are sharing weightsI Hyperparameters: Kernel size, Stride, Number of filters
Computer Vision 11 / 30
General Knowledge Activation functions Layers Training Important architectures Conclusion
Kernel’s properties
I Kernels in the first layers = low-level features (edge detectors for example)
I Kernels in the mid layers = higher-level features
I Kernels at the end = class specific features
Computer Vision 12 / 30
General Knowledge Activation functions Layers Training Important architectures Conclusion
Pooling Layer
I No trainable weights. no activation functionI Performs specific mathematical operation over the related regionI Hyperparameters: Kernel size, StrideI Typical math operations: Maximum, Average
Computer Vision 13 / 30
General Knowledge Activation functions Layers Training Important architectures Conclusion
Dropout
I Regularization technique
I Prevents over-fitting
I During the training, each neuron output has a probability p to be ignored
I Hyperparameters: probability p
Computer Vision 14 / 30
General Knowledge Activation functions Layers Training Important architectures Conclusion
Loss layer
I Last layer of NNI in the case of classification or regression we try to:
I such values of ωi that will minimize a chosen criterionI criteriun usually incorporates information from teacher: t
Classification criteria:I Binary cross entropy :
Ek = −∑i
ti log oi + (1− ti ) log(1− oi ) (5)
I Categorical cross entropy:
Ek = −∑i
tk,i log ok,i (6)
(with soft-max layer)
Regression criteria: Mean squared/absolute error
Ek =∑i
(ti − oi )2 (7)
Computer Vision 15 / 30
General Knowledge Activation functions Layers Training Important architectures Conclusion
Back-propagation
I The most common training method of NNs
I Used in conjunction with an optimization methodI Algorithm steps:
1. The forward pass - NN predicts an output2. Error calculation based on loss function3. The backward pass - By recursive application of chain rule, gradient for individual
parameters (W, b) is calculated = loss is back-propagated to the individual neurons4. Using the gradients, the parameters update is performed based on the optimization
method
Computer Vision 16 / 30
General Knowledge Activation functions Layers Training Important architectures Conclusion
Loss function
I Cross-entropy loss (classification tasks), always used with Softmax:
Lce = −N∑i
pi log pi , (8)
I Mean-square error (regression tasks):
L2 = ||f − y ||22, (9)
I Others:
– Contrastive loss– Triplet loss– Angular Softmax loss– Arc loss
Computer Vision 17 / 30
General Knowledge Activation functions Layers Training Important architectures Conclusion
Weight initialization
I Before the training process, it is necessary to initialize parameters
I Non-trivial task
I Popular subject of researchI Common initializers:
– Zeros– Normal– Gaussian Random Variables (µ = 0 a σ = 0.01 . . . 10−5)– Xavier (Glorot)– LeCun– etc
I Parameter update is performed by an optimizer
Computer Vision 18 / 30
General Knowledge Activation functions Layers Training Important architectures Conclusion
Stochastic Gradient Descent (SGD)
I First-order optimization algorithm
I Training data are divided into batches (due to memory limitations)
I Gradient descent is computed over those batches
ωt+1 = ωt − γtn∑
i=1
∇Li (ωt), (10)
I Advantages: Low computational time, best results with the right learning ratepolicy
I Disadvantages: Necessity of finding right learning rate policy
Computer Vision 19 / 30
General Knowledge Activation functions Layers Training Important architectures Conclusion
Learning rate
I Effect of the learning rate
I Learning rate decay:– Learning rate change during the training– Step decay– Exponential decay– Etc.
Computer Vision 20 / 30
General Knowledge Activation functions Layers Training Important architectures Conclusion
Momentum
I Improves results in the most cases
I Weighted average between newly computed gradient and the past gradients
ωt+1 = ωt + ∆ωt = ωt − γt∇L(ωt) + α∆ωt−1, (11)
Computer Vision 21 / 30
General Knowledge Activation functions Layers Training Important architectures Conclusion
Adaptive optimizers
I Changes learning rate adaptively
I Adagrad
I RMSprop
I Adam
I Etc.
Computer Vision 22 / 30
General Knowledge Activation functions Layers Training Important architectures Conclusion
AlexNet (2012)
I Winner of ImageNet 2012
I The first time NN approach overcome other approaches
I 5 convolutional layers + 3 fully-connected
I Innovations: ReLU nonlinearity, Dropout
Computer Vision 23 / 30
General Knowledge Activation functions Layers Training Important architectures Conclusion
VGGNet (2014)
I Usage of small kernels (3×3)
I Constant computational complexity across all convolutional layers
I State-of-the-art results
Computer Vision 24 / 30
General Knowledge Activation functions Layers Training Important architectures Conclusion
InceptionNet (2014)
I Large kernels are preferred for more global information, while smaller one arepreferred for local information
I Size of important things can be very variableI Application of different operations in the same depth = Inception moduleI Usage of 1×1 convolutions (’max pooling for the channel dimension’)
Computer Vision 25 / 30
General Knowledge Activation functions Layers Training Important architectures Conclusion
Fully-Convolutional Networks
I No fully-connected layersI Fully-connected layer - huge number of parameters, prone to overfittingI Global average pooling instead of the last fully-connected layerI Advantages:
1. Correspondence between feature maps and categories is enforced2. Overfitting avoidance3. Global average pooling is more robust to spatial transformations4. Fully-convolutional networks have a great ability to encode localization without any
further information
Computer Vision 26 / 30
General Knowledge Activation functions Layers Training Important architectures Conclusion
ResNet (2016)
I Problems with vanishing and exploding gradientI The ease of learning is not same for all transformationsI Inclusion of shortcut connectionsI Winners of ImageNet 2015
y = F (x, {Wi}) + x, (12)
Computer Vision 27 / 30
General Knowledge Activation functions Layers Training Important architectures Conclusion
Autoencoders
I Bow-tie structure:
– Encoder - Fully-convolutional network– Decoder - Deconvolutional network
I Semantic segmentation tasks
Computer Vision 28 / 30
General Knowledge Activation functions Layers Training Important architectures Conclusion
Challenges, codes and examples
I Kaggle
I ImageNet
I Papers with code
I CS231n: Convolutional Neural Networks for Visual Recognition
I CS231n: Youtube
I Andrew Ng Coursera courses
I Siraj Raval - Youtube (Fraud and thief, but still very informative)
I 3Blue1Brown
I Two minute papers
I Deep learning news
Computer Vision 29 / 30
Thank you for your attention!
Questions?
Computer Vision 30 / 30