Midterm Review - CS230 Deep Learning · Optimization Methods Gradient Descent - update parameters...

Midterm ReviewCS230 Spring 2019

Outline- Forward and Backpropagation- Optimization Methods- Bias & Variance- Batch Normalization- Adversarial Examples- GANs- CNNs

Forward and Backpropagation

Activations:

● Relu● Sigmoid● Tanh

Losses:

● Mean Square Error● Cross Entropy

Hints:

● Remember to use the chain rule.● Gradients of a parameter with respect to the loss

function have the same dimension as the parameter.

(1) Forward Propagation - Compute the output of the model - Use the output and label to compute loss

(2) Backward Propagation - Take the gradients of the loss with respect to the parameters - Update parameters with gradients

Further Reading and proofs

Forward and Backpropagation

Optimization Methods

● Gradient Descent - update parameters in the opposite direction of their gradient.

○ Stochastic - batch size of 1.○ Mini Batch - batch size of 1 < m < n.○ Full Batch - batch size of n.

● Momentum - accelerate gradient descent by adding a fraction of the update vector from previous steps to the current update.

● RMSprop - adapts the learning rate to the features by dividing by previous squared gradients.

● Adam - a mixture of momentum and adaptive learning rate.Visualization of optimization methods.

Optimization Methods

Visualization of optimization methods.

Hyperparameters:● Batch Size● Learning Rate● Learning Rate Decay

Initialization Methods

● Xavier

○ Weights and inputs are centered at zero○ Weights and inputs are independent and identically distributed○ Biases are initialized as zeros○ Activation function is 0 centred eg. TanH○ Var(a)≈Var(z)

● He

○ ReLU activations (Note that ReLU is not a 0 centred activation)

Bias & Variance

- Cat classifier

Human/Optimal/Bayes error: 3%

High variance

High bias

High bias & high variance

Low bias & low variance

Train set error

Dev set error

Bias & Variance Recipe

High bias?(train set performance)

High Variance?(dev set performance)

- Bigger network- Train longer- NN architecture search

- More data- Regularization- (NN architecture search)

Regularization

- L1, L2 (weight decay)- Inverted dropout (no dropout during test time!)- Data augmentation: crop, flip, rotate, …- Early stopping

Normalize inputs

Zero-mean:

Unit-variance:

Batch Normalization

- Mini-batch Batch Norm- Usually applied before activation- At test time: use running average of mean and var computed during train time

Batch Normalization

Why does it work?

- Faster learning: each dimension takes similar range of values- Mitigate covariate shift

(Make weights in later layers more robust to weight changes in earlier layers)- Slight regularization effect: mean and var of mini-batches add noise

Adversarial examples

- Creating adversarial examples (Gradient Method, Fast Gradient Sign Method)

- Black Box vs. White Box attacks- Defences against adversarial examples

- SafetyNet- Train on correctly labelled adversarial examples- Adversarial Training (train with perturbed version of input)

Generative Adversarial Networks (GANs)

Convolutional Neural Networks (CNNs)

Motivation:

NN architecture suitable for high dimensional inputs and prediction tasks that benefit from spatial pattern recognition

● Reduced number of parameters due to parameter sharing and sparsity of connections

● The convolution operation as a way of detecting visual features

Basic building blocks:

● Convolution Layers (CONV)

● Pooling Layers (POOL)

● Fully connected Layers (FC)

Convolution Layers:

● 2D (No depth) ● 3D (e.g RGB channels)

# learnable parameters = (f * f (+1 for bias)) * nf

Output shape = (n-f+1) * (n-f+1) * nf

# learnable parameters = (f * f * nc (+1 for bias)) * nf Output shape = (n-f+1) * (n-f+1) * nf

Hyperparameters: Stride (s), Padding (p), Filter size (f)

Padding and Strided Convolution Layers:

Output shape with padding = (n-f+1+2p) * (n-f+1+2p) * nf

● “Valid” convolutions ⇔ No padding:

● “Same” convolutions ⇔ Output shape should match Input shape

Output shape = (n-f+1) * (n-f+1) * nf

p for “Same” convolutions = (f-1)/2

● General formula for output shape: x nf

Pooling Layers:

x nc ● General formula for output shape:

Hyperparameters: Stride (s), Filter size (f)

Menti Quiz

Midterm Review - CS230 Deep Learning · Optimization Methods Gradient Descent - update parameters...

Documents

Transcript of Midterm Review - CS230 Deep Learning · Optimization Methods Gradient Descent - update parameters...

Chapter 4 (Part 2): Artificial Neural Networksmlittman/courses/ml03/ch4b.pdf · 6 Stochastic Gradient Descent Batch mode Gradient Descent: Do until satisfied 1. Compute the gradient

Tuple-oriented Compression for Large-scale Mini-batch ... · Tuple-oriented Compression for Large-scale Mini-batch Stochastic Gradient Descent Fengan Li yLingjiao Chen Yijing Zeng

STOCHASTIC GRADIENT DESCENT PERFORMS VARIATIONAL … · gradient noise in SGD; the covariance matrix of mini-batch gradients for deep networks has a rank as small as 1% of its dimension.

The Parallel Knowledge Gradient Method for Batch Bayesian … · 2018-04-24 · The Parallel Knowledge Gradient Method for Batch Bayesian Optimization Jian Wu, Peter I. Frazier Cornell

Stochastic Training of Graph Convolutional Networks with ... · Stochastic Training of Graph Convolutional Networks with Variance Reduction is to approximate the batch gradient by

CS230: Lecture 9 Deep Reinforcement Learning

The General Inefficiency of Batch Training for Gradient ...axon.cs.byu.edu/~martinez/classes/678/Papers/Wilson_batch orig.pdfKey words: Batch training, on-line training, gradient descent,

gradient 1 gradient 2 gradient 3 gradient 4 ECDIS … Buyers Guide v2 0 19...ECDIS buyers guide gradient 1 gradient 2 gradient 3 gradient 4 gradient 1 gradient 2 gradient 3 gradient

CS230 Deep Learningcs230.stanford.edu/projects_fall_2018/reports/12437786.pdf · ANET achieved 0.87 recall rate across all test cases. CS230: Deep Learning, Fall 2018, Stanford University,

CS230: Lecture 9 Deep Reinforcement Learning · CS230: Lecture 9 Deep Reinforcement Learning Kian Katanforoosh Menti code: 80 24 08. Kian Katanforoosh, Andrew Ng, ... Today’s outline.

Systems Engineering Mini-Batch Semi-Stochastic Gradient ... · In a large-scale setting it is more efﬁcient to instead con-sider the stochastic proximal gradient approach, in which

CS230: Lecture 5cs230.stanford.edu/fall2017/files/CS230_Handout6.pdf · CS230: Lecture 5 Advanced topics in Object Detection Kian Katanforoosh, Andrew Ng Stanford University. Kian

Scanned by CamScanneraccmumbai.gov.in/aircargo/miscellaneous/posting-order/2017/872017.pdfname of the inspector (po) from batch batch batch batch batch batch batch batch floating floating

Training and Generalization - Purdue University– The gradient for a small batch is much faster to compute and almost as good as the full gradient. – If !=10,000and ! & =' & =32,

CS230 : Computer Graphics

Efﬁcient Online and Batch Learning Using Forward Backward ...jmlr.csail.mit.edu › papers › volume10 › duchi09a › duchi09a.pdfstrained gradient descent step. We then cast

The general inefﬁciency of batch training for gradient ...

Midterm Review - CS230 Deep Learningcs230.stanford.edu/fall2018/midterm_review.pdf · Midterm Review CS230 Fall 2018. Broadcasting. Calculating Means How would you calculate the means

CS230-Fall 2019 Final Project Report Painting Genre ...cs230.stanford.edu/projects_fall_2019/reports/26180781.pdf · CS230-Fall 2019 Final Project Report Painting Genre Classiﬁcation

PROGRAM 1 CITIZEN FSKM MAC 2013 23/3 - … €siti rabiah binti mahadi€ €cs230€ a4 65 €siti rokiah binti mahadi€ €cs230€ a3 66 €siti roslina binti roslim€ €cs230€