CS60010: Deep Learningsudeshna/courses/DL18/lec5.pdfSigmoid Units for Bernoulli Output...

22
CS60010: Deep Learning Sudeshna Sarkar Spring 2018 16 Jan 2018

Transcript of CS60010: Deep Learningsudeshna/courses/DL18/lec5.pdfSigmoid Units for Bernoulli Output...

Page 1: CS60010: Deep Learningsudeshna/courses/DL18/lec5.pdfSigmoid Units for Bernoulli Output Distributions. 2-class classification problem. N eeds to predict 𝑃(𝑦= 1|π‘₯). 𝑦 = πœŽπ‘€π‘‡β„Ž+

CS60010: Deep Learning

Sudeshna Sarkar Spring 2018

16 Jan 2018

Page 2: CS60010: Deep Learningsudeshna/courses/DL18/lec5.pdfSigmoid Units for Bernoulli Output Distributions. 2-class classification problem. N eeds to predict 𝑃(𝑦= 1|π‘₯). 𝑦 = πœŽπ‘€π‘‡β„Ž+

FFN

β€’ Goal: Approximate some unknown ideal function f : X ! Y β€’ Ideal classifier: y = f*(x) with x and category y β€’ Feedforward Network: Define parametric mapping β€’ y = f(x; theta) β€’ Learn parameters theta to get a good approximation to f* from

available sample

β€’ Function f is a composition of many different functions

Page 3: CS60010: Deep Learningsudeshna/courses/DL18/lec5.pdfSigmoid Units for Bernoulli Output Distributions. 2-class classification problem. N eeds to predict 𝑃(𝑦= 1|π‘₯). 𝑦 = πœŽπ‘€π‘‡β„Ž+

original W

negative gradient direction W_1

W_2

Gradient Descent

Page 4: CS60010: Deep Learningsudeshna/courses/DL18/lec5.pdfSigmoid Units for Bernoulli Output Distributions. 2-class classification problem. N eeds to predict 𝑃(𝑦= 1|π‘₯). 𝑦 = πœŽπ‘€π‘‡β„Ž+

Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson

The effects of step size (or β€œlearning rate”)

Page 5: CS60010: Deep Learningsudeshna/courses/DL18/lec5.pdfSigmoid Units for Bernoulli Output Distributions. 2-class classification problem. N eeds to predict 𝑃(𝑦= 1|π‘₯). 𝑦 = πœŽπ‘€π‘‡β„Ž+

original W

True gradients in blue minibatch gradients in red

W_1

W_2

Stochastic Gradient Descent

Gradients are noisy but still make good progress on average

Page 6: CS60010: Deep Learningsudeshna/courses/DL18/lec5.pdfSigmoid Units for Bernoulli Output Distributions. 2-class classification problem. N eeds to predict 𝑃(𝑦= 1|π‘₯). 𝑦 = πœŽπ‘€π‘‡β„Ž+

Cost functions:

β€’ In most cases, our parametric model defines a distribution 𝑝(𝑦|π‘₯;πœƒ)

β€’ Use the principle of maximum likelihood

β€’ The cost function is often the negative log-likelihood β€’ equivalently described as the cross-entropy between the

training data and the model distribution. 𝐽 πœƒ = βˆ’πΈπ‘₯,𝑦~𝑝�𝑑𝑑𝑑𝑑 log π‘π‘šπ‘šπ‘šπ‘šπ‘š(𝑦|π‘₯)

β€’ .

Page 7: CS60010: Deep Learningsudeshna/courses/DL18/lec5.pdfSigmoid Units for Bernoulli Output Distributions. 2-class classification problem. N eeds to predict 𝑃(𝑦= 1|π‘₯). 𝑦 = πœŽπ‘€π‘‡β„Ž+

Conditional Distributions and Cross-Entropy

𝐽 πœƒ = βˆ’πΈπ‘₯,𝑦~𝑝�𝑑𝑑𝑑𝑑 log π‘π‘šπ‘šπ‘šπ‘šπ‘š(𝑦|π‘₯) β€’ The specific form of the cost function changes from model to

model, depending on the specific form of logπ‘π‘šπ‘šπ‘šπ‘šπ‘š β€’ For example, if π‘π‘šπ‘šπ‘šπ‘šπ‘š(𝑦|π‘₯) = 𝒩(y; f x,πœƒ , I) then we

recover the mean squared error cost,

𝐽 πœƒ =12𝐸π‘₯,𝑦~𝑝�𝑑𝑑𝑑𝑑 𝑦 βˆ’ 𝑓(π‘₯; πœƒ) 2 + Const

β€’ For predicting median of Gaussian, the equivalence between maximum likelihood estimation with an output distribution and minimization of mean squared error holds

β€’ Specifying a model 𝑝(𝑦|π‘₯) automatically determines a cost function log 𝑝(𝑦|π‘₯)

Page 8: CS60010: Deep Learningsudeshna/courses/DL18/lec5.pdfSigmoid Units for Bernoulli Output Distributions. 2-class classification problem. N eeds to predict 𝑃(𝑦= 1|π‘₯). 𝑦 = πœŽπ‘€π‘‡β„Ž+

β€’ The gradient of the cost function must be large and predictable enough to serve as a good guide for the learning algorithm.

β€’ The negative log-likelihood helps to avoid this problem for many models.

Page 9: CS60010: Deep Learningsudeshna/courses/DL18/lec5.pdfSigmoid Units for Bernoulli Output Distributions. 2-class classification problem. N eeds to predict 𝑃(𝑦= 1|π‘₯). 𝑦 = πœŽπ‘€π‘‡β„Ž+

Learning Conditional Statistics

β€’ Sometimes we merely predict some statistic of 𝑦 conditioned on π‘₯. Use specialized loss functions β€’ For example, we may have a predictor 𝑓(π‘₯; πœƒ) that we wish to employ to predict

the mean of 𝑦.

β€’ With a sufficiently powerful neural network, we can think of the NN as being able to represent any function 𝑓 from a wide class of functions. β€’ view the cost function as being a functional rather than just a function. β€’ A functional is a mapping from functions to real numbers.

β€’ We can thus think of learning as choosing a function rather than a set of parameters.

β€’ We can design our cost functional to have its minimum occur at some specific function we desire. For example, we can design the cost functional to have its minimum lie on the function that maps π‘₯ to the expected value of 𝑦 given π‘₯.

Page 10: CS60010: Deep Learningsudeshna/courses/DL18/lec5.pdfSigmoid Units for Bernoulli Output Distributions. 2-class classification problem. N eeds to predict 𝑃(𝑦= 1|π‘₯). 𝑦 = πœŽπ‘€π‘‡β„Ž+

Learning Conditional Statistics

β€’ Results derived using calculus of variations: 1. Solving the optimization problem

π‘“βˆ— =π‘Žπ‘Ÿπ‘Ÿπ‘Ÿπ‘Ÿπ‘Ÿ

𝑓 𝔼π‘₯,𝑦~𝑝𝑑𝑑𝑑𝑑 𝑦 βˆ’ 𝑓(π‘₯) 2

yields π‘“βˆ—(π‘₯) = 𝔼π‘₯,𝑦~𝑝𝑑𝑑𝑑𝑑(𝑦|π‘₯) 𝑦

2. Solving

π‘“βˆ— =π‘Žπ‘Ÿπ‘Ÿπ‘Ÿπ‘Ÿπ‘Ÿ

𝑓 𝔼π‘₯,𝑦~𝑝𝑑𝑑𝑑𝑑 𝑦 βˆ’ 𝑓(π‘₯) 1

yields a function that predicts the median value of 𝑦 for each π‘₯. Mean Absolute Error (MAE) β€’ MSE and MAE often lead to poor results when used with gradient-based optimization.

Some output units that saturate produce very small gradients when combined with these cost functions.

β€’ Thus use of cross-entropy is popular even when it is not necessary to estimate the distribution 𝑝(𝑦|π‘₯)

Page 11: CS60010: Deep Learningsudeshna/courses/DL18/lec5.pdfSigmoid Units for Bernoulli Output Distributions. 2-class classification problem. N eeds to predict 𝑃(𝑦= 1|π‘₯). 𝑦 = πœŽπ‘€π‘‡β„Ž+

Output Units

1. Linear units for Gaussian Output Distributions. Linear output layers are often used to produce the mean of a conditional Gaussian distribution 𝑝 𝑦 π‘₯ = 𝒩(𝑦;𝑦�, 𝐼)

β€’ Maximizing the log-likelihood is then equivalent to minimizing the mean squared error

β€’ Because linear units do not saturate, they pose little difficulty for gradient-based optimization algorithms

2. Sigmoid Units for Bernoulli Output Distributions. 2-class classification problem. Needs to predict 𝑃(𝑦 = 1|π‘₯).

𝑦� = 𝜎 π‘€π‘‡β„Ž + 𝑏 3. Softmax Units for Multinoulli Output Distributions. Any time we

wish to represent a probability distribution over a discrete variable with n possible values, we may use the softmax function. Softmax functions are most often used as the output of a classifier, to represent the probability distribution over π‘Ÿ different classes.

Page 12: CS60010: Deep Learningsudeshna/courses/DL18/lec5.pdfSigmoid Units for Bernoulli Output Distributions. 2-class classification problem. N eeds to predict 𝑃(𝑦= 1|π‘₯). 𝑦 = πœŽπ‘€π‘‡β„Ž+

Softmax output

β€’ In case of a discrete variable with π‘˜ values, produce a vector π’šοΏ½ with 𝑦�𝑖 = 𝑃(𝑦 = π‘Ÿ|π‘₯).

β€’ A linear layer predicts unnormalized log probabilities: 𝒛 = 𝑾𝑻𝒉 + 𝒃

β€’ where 𝑧𝑖 = log𝑃�(𝑦 = π‘Ÿ|π‘₯)

softmax(𝑧)𝑖 =exp (𝑧𝑖)βˆ‘ exp (𝑧𝑗)𝑗

β€’ When training the softmax to output a target value 𝑦 using maximum log-likelihood

β€’ Maximize log𝑃 𝑦 = π‘Ÿ; 𝑧 = log π‘ π‘ π‘“π‘ π‘Ÿπ‘Žπ‘₯(𝑧)𝑖 β€’ log π‘ π‘ π‘“π‘ π‘Ÿπ‘Žπ‘₯(𝑧)𝑖 = 𝑧𝑖 βˆ’ logβˆ‘ exp (𝑧𝑗)𝑗

Page 13: CS60010: Deep Learningsudeshna/courses/DL18/lec5.pdfSigmoid Units for Bernoulli Output Distributions. 2-class classification problem. N eeds to predict 𝑃(𝑦= 1|π‘₯). 𝑦 = πœŽπ‘€π‘‡β„Ž+

Output Types

Output Type Output Distribution

Output Layer Cost Function

Binary Bernoulli Sigmoid Binary cross-

entropy

Discrete Multinoulli Softmax Discrete cross-

entropy

Continuous Gaussian Linear Gaussian cross- entropy (MSE)

Continuous Mixture of Gaussian

Mixture Density Cross-entropy

Continuous Arbitrary GAN, VAE, FVBN Various

Page 14: CS60010: Deep Learningsudeshna/courses/DL18/lec5.pdfSigmoid Units for Bernoulli Output Distributions. 2-class classification problem. N eeds to predict 𝑃(𝑦= 1|π‘₯). 𝑦 = πœŽπ‘€π‘‡β„Ž+

Sigmoid output with target of 1

β€”3 β€”2 β€”1 0 1 2 3

z

0.0

0.5

1.0

𝝈(𝒛) Cross-entropy loss MSE loss

Bad idea to use MSE loss with sigmoid unit.

Page 15: CS60010: Deep Learningsudeshna/courses/DL18/lec5.pdfSigmoid Units for Bernoulli Output Distributions. 2-class classification problem. N eeds to predict 𝑃(𝑦= 1|π‘₯). 𝑦 = πœŽπ‘€π‘‡β„Ž+
Page 16: CS60010: Deep Learningsudeshna/courses/DL18/lec5.pdfSigmoid Units for Bernoulli Output Distributions. 2-class classification problem. N eeds to predict 𝑃(𝑦= 1|π‘₯). 𝑦 = πœŽπ‘€π‘‡β„Ž+
Page 17: CS60010: Deep Learningsudeshna/courses/DL18/lec5.pdfSigmoid Units for Bernoulli Output Distributions. 2-class classification problem. N eeds to predict 𝑃(𝑦= 1|π‘₯). 𝑦 = πœŽπ‘€π‘‡β„Ž+

Benefits

Page 18: CS60010: Deep Learningsudeshna/courses/DL18/lec5.pdfSigmoid Units for Bernoulli Output Distributions. 2-class classification problem. N eeds to predict 𝑃(𝑦= 1|π‘₯). 𝑦 = πœŽπ‘€π‘‡β„Ž+

Hidden Units

β€’ Rectified linear units are an excellent default choice of hidden unit.

β€’ Use ReLUs, 90% of the time β€’ Many hidden units perform comparably to ReLUs.

New hidden units that perform comparably are rarely interesting.

Page 19: CS60010: Deep Learningsudeshna/courses/DL18/lec5.pdfSigmoid Units for Bernoulli Output Distributions. 2-class classification problem. N eeds to predict 𝑃(𝑦= 1|π‘₯). 𝑦 = πœŽπ‘€π‘‡β„Ž+

Rectified Linear Activation

0 z

0

g(z)

= m

ax{0

, z}

Page 20: CS60010: Deep Learningsudeshna/courses/DL18/lec5.pdfSigmoid Units for Bernoulli Output Distributions. 2-class classification problem. N eeds to predict 𝑃(𝑦= 1|π‘₯). 𝑦 = πœŽπ‘€π‘‡β„Ž+

ReLU

β€’ Positives: β€’ Gives large and consistent gradients (does not saturate) when active β€’ Efficient to optimize, converges much faster than sigmoid or tanh

β€’ Negatives: β€’ Non zero centered output β€’ Units "die" i.e. when inactive they will never update

Page 21: CS60010: Deep Learningsudeshna/courses/DL18/lec5.pdfSigmoid Units for Bernoulli Output Distributions. 2-class classification problem. N eeds to predict 𝑃(𝑦= 1|π‘₯). 𝑦 = πœŽπ‘€π‘‡β„Ž+

Architecture Basics

β€’ Depth β€’ Width

Page 22: CS60010: Deep Learningsudeshna/courses/DL18/lec5.pdfSigmoid Units for Bernoulli Output Distributions. 2-class classification problem. N eeds to predict 𝑃(𝑦= 1|π‘₯). 𝑦 = πœŽπ‘€π‘‡β„Ž+

Universal Approximator Theorem

β€’ One hidden layer is enough to represent (not learn) an approximation of any function to an arbitrary degree of accuracy

β€’ So why deeper? β€’ Shallow net may need (exponentially) more width β€’ Shallow net may overfit more

β€’ http://mcneela.github.io/machine_learning/2017/03/21/Universal-

Approximation-Theorem.html β€’ https://blog.goodaudience.com/neural-networks-part-1-a-simple-proof-of-

the-universal-approximation-theorem-b7864964dbd3 β€’ http://neuralnetworksanddeeplearning.com/chap4.html