All You Want To Know About CNNsfidler/teaching/2015/slides/CSC2523/... · 2016-01-21 · Deep...

All You Want To Know About CNNsYukun Zhu

Deep Learning

Image from http://imgur.com/

Deep Learning

Deep Learning in Vision

DPM(2010)

Object detection performance, PASCAL VOC 2010

DPM(2010)

segDPM(2014)

DPM(2010)

segDPM(2014)

RCNN(2014)

DPM(2010)

segDPM(2014)

RCNN(2014)

RCNN*(Oct 2014)

DPM(2010)

segDPM(2014)

RCNN(2014)

segRCNN(Jan 2015)

RCNN*(Oct 2014)

DPM(2010)

segDPM(2014)

RCNN(2014)

segRCNN(Jan 2015)

RCNN*(Oct 2014)

Fast RCNN(Jun 2015)

A Neuron

Image from http://cs231n.github.io/neural-networks-1/

A Neuron in Neural Network

Activation Functions● Sigmoid: f(x) = 1 / (1 + e

● ReLU: f(x) = max(0, x)

● Leaky ReLU: f(x) = max(ax, x)

● Maxout: f(x) = max(w

● and many others…

Neural Network (MLP)

Image modified from http://cs231n.github.io/neural-networks-1/

The network simulates a function y = f(x; w)

Forward Computation

Image and code modified from http://cs231n.github.io/optimization-2/

) = 1 / (1 + exp(-(w

sigmoid

Forward Computation

) = 1 / (1 + exp(-(w

sigmoid

Forward Computation

) = 1 / (1 + exp(-(w

sigmoid

Forward Computation

) = 1 / (1 + exp(-(w

sigmoid

Forward Computation

) = 1 / (1 + exp(-(w

sigmoid

Forward Computation

) = 1 / (1 + exp(-(w

sigmoid

1.00 -1.00

Forward Computation

) = 1 / (1 + exp(-(w

sigmoid

1.00 -1.00 0.37

Forward Computation

) = 1 / (1 + exp(-(w

sigmoid

1.00 -1.00 0.37 1.37

Forward Computation

) = 1 / (1 + exp(-(w

sigmoid

1.00 -1.00 0.37 1.37 0.73

Loss FunctionLoss function measures how well prediction matches true value

Commonly used loss function:

● Squared loss: (y - y’)

● Cross-entropy loss: -sum

’ * log(y

● and many others

Loss FunctionDuring training, we would like to minimize the total loss on a set

of training data

● We want to find w* = argmin{sum

[loss(f(x

; w), y

Loss FunctionDuring training, we would like to minimize the total loss on a set

of training data

● We want to find w* = argmin{sum

[loss(f(x

; w), y

● Usually we use gradient based approach

- a∇w

Backward Computation

) = 1 / (1 + exp(-(w

sigmoid

1.00 -1.00 0.37 1.37 0.73

) = 1 / (1 + exp(-(w

sigmoid

1.00 -1.00 0.37 1.37 0.73

1.00-0.53

f = 1/xdf/dx = -1/x2

) = 1 / (1 + exp(-(w

sigmoid

1.00 -1.00 0.37 1.37 0.73

1.00-0.53-0.53

f = x + 1df/dx = 1

) = 1 / (1 + exp(-(w

sigmoid

1.00 -1.00 0.37 1.37 0.73

1.00-0.53-0.53-0.20

f = ex

df/dx = ex

) = 1 / (1 + exp(-(w

sigmoid

1.00 -1.00 0.37 1.37 0.73

1.00-0.53-0.53-0.200.20

f = -xdf/dx = -1

) = 1 / (1 + exp(-(w

sigmoid

1.00 -1.00 0.37 1.37 0.73

1.00-0.53-0.53-0.200.20

f = x + adf/dx = 1

) = 1 / (1 + exp(-(w

sigmoid

1.00 -1.00 0.37 1.37 0.73

1.00-0.53-0.53-0.200.20

) = 1 / (1 + exp(-(w

sigmoid

1.00 -1.00 0.37 1.37 0.73

1.00-0.53-0.53-0.200.20

0.200.40

f = axdf/dx = a

Why NNs?

Universal Approximation TheoremA feed-forward network with a single hidden layer containing a

finite number of neurons, can approximate continuous functions

on compact subsets of R

, under mild assumptions on the

activation function.

https://en.wikipedia.org/wiki/Universal_approximation_theorem

Stone’s Theorem● Suppose X is a compact Hausdorff space and B is a

subalgebra in C(X, R) such that:

○ B separates points.

○ B contains the constant function 1.

○ If f ∈ B then af ∈ B for all constants a ∈ R.

○ If f, g ∈ B, then f + g, max{f, g} ∈ B.

● Then every continuous function defined on C(X, R) can be

approximated as closely as desired by functions in B

Why CNNs?

Problems of MLP in VisionFor input as a 10 * 10 image:

● A 3 layer MLP with 200 hidden units contains ~100k

parameters

For input as a 100 * 100 image:

● A 1 layer MLP with 20k hidden units contains ~200m

parameters

Can We Do Better?

Can We Do Better?Based on such observation, MLP can be improved in two ways:

● Locally connected instead of fully connected

● Sharing weights between neurons

We achieve those by using convolution neurons

Convolutional Layers

Image from http://cs231n.github.io/convolutional-networks/

Convolutional Layers

Image from http://cs231n.github.io/convolutional-networks/. See this page for an excellent example of convolution.

height

Pooling Layers

Pooling Layers Example: Max Pooling

Pooling LayersCommonly used pooling layers:

● Max pooling

● Average pooling

Why pooling layers?

● Reduce activation dimensionality

● Robust against tiny shifts

CNN Architecture: An Example

Layer Activations for CNNs

Image modified from http://cs231n.github.io/convolutional-networks/

Conv:1 ReLU:1 Conv:2 ReLU:2 MaxPool:1 Conv:3

Layer Activations for CNNs

Image modified from http://cs231n.github.io/convolutional-networks/

MaxPool:2 Conv:5 ReLU:5 Conv:6 ReLU:6 MaxPool:3

Learnt Weights for CNNs: First Conv Layer of AlexNet

Why CNNs Work Now?

Convolutional Neural Networks● Faster heterogeneous parallel computing

○ CPU clusters, GPUs, etc.

● Large dataset

○ ImageNet: 1.2m images of 1,000 object classes

○ CoCo: 300k images of 2m object instances

● Improvements in model architecture

○ ReLU, dropout, inception, etc.

AlexNet

Krizhevsky, Alex, et al. "Imagenet classification with deep convolutional neural networks." NIPS 2012

GoogLeNet

Szegedy, Christian, et al. "Going deeper with convolutions." arXiv preprint arXiv:1409.4842 (2014).

Quiz# of parameters for the first conv layer of AlexNet?

Quiz# of parameters if the first layer is fully-connected?

QuizGiven a convolution operation written as

, b) = sum

Can you derive its gradients (df/dx, df/dw, df/db)?

Ready to Build Your Own Networks?

Tips and Tricks for CNNs ● Know your data, clean your data, and normalize your data

○ A common trick: subtract the mean and divide by its std.

Tips and Tricks for CNNs ● Augment your data

Tips and Tricks for CNNs ● Organize your data:

○ Keep training data balanced

○ Shuffle data before batching

● Feed your data in the correct way

○ Image channel order

○ Tensor storage order

Tips and Tricks for CNNs

First Order, in order. First Order, out of order.

Tips and Tricks for CNNs Common tensor storage order:

● BDRC

○ Used in Caffe, Torch, Theano, supported by CuDNN

○ Pros: faster for convolution (FFT, memory access)

● BRCD

○ Used in TensorFlow, limited support by CuDNN

○ Pros: Fast batch normalization, easier batching

Tips and Tricks for CNNs Designing model architecture

● Convolution, max pooling, then fully connected layers

● Nonlinearity

○ Stay away from sigmoid (except for output)

○ ReLU preferred

○ Leaky ReLU after

○ Use Maxout if most ReLU units die (have zero activation)

Tips and Tricks for CNNs Setting parameters

● Weights

○ Random initialization with proper variance

● Biases

○ For ReLU we prefer a small positive bias to activate ReLU

Tips and Tricks for CNNs Setting hyperparameters

● Learning Rate / Momentum (Δw

+ mΔw

○ Decrease learning rate while training

○ Setting momentum to 0.8 - 0.9

● Batch Size

○ For large dataset: set to whatever fits your memory

○ For smaller dataset: find a tradeoff between instance

randomness and gradient smoothness

Tips and Tricks for CNNs Monitoring your training:

● Split your dataset to training, validation and test

○ Optimize your hyperparameter in val and evaluate on test

○ Keep track of training and validation loss during training

○ Do early stopping if training and validation loss diverge

○ Loss doesn’t tell you all. Try precision, class-wise precision, and

Tips and Tricks for CNNs Borrow knowledge from another dataset

● Pre-train your CNN on a large dataset (e.g. ImageNet)

● Remove / reshape the last a few layers

● Fix the parameters of first a few layers, or make the learning

rate small for them

● Fine-tune the parameters on your own dataset

Tips and Tricks for CNNs Debugging

● import unittest, not import pdb

● Check your gradient [**deprecated**]

● Make your model large enough, and try overfitting training

● Check gradient norms, weight norms, and activation norms

Talk is Cheap, Show Me Some Code

Image from http://www.linkresearchtools.com/

Fully Convolutional Networks

Long, Jonathan, et al. "Fully convolutional networks for semantic segmentation." arXiv preprint arXiv:1411.4038 (2014).

All You Want To Know About CNNsfidler/teaching/2015/slides/CSC2523/... · 2016-01-21 · Deep...

Documents

Transcript of All You Want To Know About CNNsfidler/teaching/2015/slides/CSC2523/... · 2016-01-21 · Deep...

Gated Bi-directional CNN for Object Detectionxgwang/papers/GBD.pdf · 3.1 Fast RCNN Pipeline We adopt the Fast RCNN [6] as the object detection pipeline with four steps. 1. Candidate

Multiple Object Tracking Based on Faster-RCNN Detector …yeeyoung/publication/MOT.pdf · Multiple Object Tracking Based on Faster-RCNN Detector and KCF Tracker ... the whole project

Pr057 mask rcnn

Fiscal 1Q17 Sprint Quarterly Investor Update - FINAL › ... › fiscal1q17sprintquarterlyinv.pdf · Sprint ended the quarter with nearly 53.7 million connections, including 31.5

Subspace Alignment Based Domain Adaptation for RCNN Detector · domain adaptation over the initial RCNN-detector. Instead of using single subspace for the full source and target data,

Video Captioningfidler/teaching/2015/slides/CSC2523/... · 2016. 3. 9. · Caption Generation Y Y From Yao et al. [2015]: ... We can simplify the problem... In captioning, we translate

Revisiting RCNN: On Awakening the Classi cation Power of ...

We Are Humor Beings: Understanding and Predicting Visual …fidler/teaching/2015/slides/CSC2523/shuai... · We Are Humor Beings: Understanding and Predicting Visual Humor ... with

Training Video Content in Natural Language Descriptions ...fidler/slides/CSC2523/patricia_feb11.pdf · Goal: finding natural language descriptions for video content Uses: Improvement

Reasoning-RCNN: Unifying Adaptive Global Reasoning Into ...openaccess.thecvf.com/content_CVPR_2019/papers/Xu_Reasoning-RCNN... · When identifying an object in a scene, reasoning

Faster rcnn

NOTE-RCNN: NOise Tolerant Ensemble RCNN for Semi ...€¦ · training samples using the detectors and then train detec-tors with the updated training samples. The detector can be

Audio: Generation & Extractionfidler/teaching/2015/slides/CSC2523/charu... · • Designed to obtain constant error flow through time ... Experiment 2 – Learning Melody and Chords

Beyondnouns: Exploitingprepositionsand ...fidler/slides/CSC2523/Arvie_week4_gupta.pdf · translation,EECV(2002) Generativetrainingmodel I C A andC R areclassiﬁers (models)forthenoun

RECURSIVE DEEP MODELS FOR SEMANTICfidler/teaching/2015/slides/CSC2523/sentimen… · STANFORD SENTIMENT TREEBANK DATASET 215,154 phrases with labels by Amazon Mechanical Turk Parse

A-Fast-RCNN: Hard Positive Generation via …openaccess.thecvf.com/content_cvpr_2017/papers/Wang_A...A-Fast-RCNN: Hard Positive Generation via Adversary for Object Detection Xiaolong

FASTER-RCNN BASED DEEP LEARNING MODEL FOR POMEGRANATE …

Real World Occlusions A-Fast-RCNN: Hard Positive ...xiaolonw/papers/CVPR2017_Adversarial_Det.pdf · will be hard for a detector like Fast-RCNN [6] and in turn the Fast-RCNN will adapt

Multi-adversarial Faster-RCNN for Unrestricted Object Detection · 2019-09-10 · Multi-adversarial Faster-RCNN for Unrestricted Object Detection Zhenwei He Lei Zhang School of Microelectronics

University of orontoTfidler/teaching/2015/slides/CSC2523/... · 2016. 5. 17. · otalT energy function to be minimized Minimized via quadratic pseudo-boolean optimization and graph