Deep Learning Basics Lecture 1: Feedforward
Princeton University COS 495
Instructor: Yingyu Liang
Motivation I: representation learning
Machine learning 1-2-3
β’ Collect data and extract features
β’ Build model: choose hypothesis class π and loss function π
β’ Optimization: minimize the empirical loss
Features
Color Histogram
Red Green Blue
Extract features
π₯
π¦ = π€ππ π₯build hypothesis
Features: part of the model
π¦ = π€ππ π₯build hypothesis
Linear model
Nonlinear model
Example: Polynomial kernel SVM
π₯1
π₯2
π¦ = sign(π€ππ(π₯) + π)
Fixed π π₯
Motivation: representation learning
β’ Why donβt we also learn π π₯ ?
Learn π π₯
π₯
π¦ = π€ππ π₯Learn π€π π₯
Feedforward networks
β’ View each dimension of π π₯ as something to be learned
π₯
π π₯
π¦ = π€ππ π₯
β¦β¦
Feedforward networks
β’ Linear functions ππ π₯ = ππππ₯ donβt work: need some nonlinearity
π₯
π π₯
π¦ = π€ππ π₯
β¦β¦
Feedforward networks
β’ Typically, set ππ π₯ = π(ππππ₯) where π(β ) is some nonlinear function
π₯
π π₯
π¦ = π€ππ π₯
β¦β¦
Feedforward deep networks
β’ What if we go deeper?
β¦ β¦
β¦β¦ β¦
β¦
π₯
β1 β2 βπΏ
π¦
Figure fromDeep learning, by Goodfellow, Bengio, Courville.Dark boxes are things to be learned.
Motivation II: neurons
Motivation: neurons
Figure from Wikipedia
Motivation: abstract neuron model
β’ Neuron activated when the correlation
between the input and a pattern π
exceeds some threshold π
β’ π¦ = threshold(πππ₯ β π)
or π¦ = π(πππ₯ β π)
β’ π(β ) called activation function π¦
π₯1
π₯2
π₯π
Motivation: artificial neural networks
Motivation: artificial neural networks
β’ Put into layers: feedforward deep networks
β¦ β¦
β¦β¦ β¦
β¦
π₯
β1 β2 βπΏ
π¦
Components in Feedforward networks
Components
β’ Representations:β’ Input
β’ Hidden variables
β’ Layers/weights:β’ Hidden layers
β’ Output layer
Components
β¦ β¦
β¦β¦ β¦
β¦
Hidden variables β1 β2 βπΏ
π¦
Input π₯
First layer Output layer
Input
β’ Represented as a vector
β’ Sometimes require some
preprocessing, e.g.,β’ Subtract mean
β’ Normalize to [-1,1]
Expand
Output layers
β’ Regression: π¦ = π€πβ + π
β’ Linear units: no nonlinearity
β
π¦
Output layer
Output layers
β’ Multi-dimensional regression: π¦ = ππβ + π
β’ Linear units: no nonlinearity
β
π¦
Output layer
Output layers
β’ Binary classification: π¦ = π(π€πβ + π)
β’ Corresponds to using logistic regression on β
β
π¦
Output layer
Output layers
β’ Multi-class classification:
β’ π¦ = softmax π§ where π§ = ππβ + π
β’ Corresponds to using multi-class
logistic regression on β
β
π¦
Output layer
π§
Hidden layers
β’ Neuron take weighted linear combination of the previous layer
β’ So can think of outputting one value for the next layer
β¦β¦
βπ βπ+1
Hidden layers
β’ π¦ = π(π€ππ₯ + π)
β’ Typical activation function πβ’ Threshold t π§ = π[π§ β₯ 0]
β’ Sigmoid π π§ = 1/ 1 + exp(βπ§)
β’ Tanh tanh π§ = 2π 2π§ β 1
π¦π₯π(β )
Hidden layers
β’ Problem: saturation
π¦π₯π(β )
Figure borrowed from Pattern Recognition and Machine Learning, Bishop
Too small gradient
Hidden layers
β’ Activation function ReLU (rectified linear unit)β’ ReLU π§ = max{π§, 0}
Figure from Deep learning, by Goodfellow, Bengio, Courville.
Hidden layers
β’ Activation function ReLU (rectified linear unit)β’ ReLU π§ = max{π§, 0}
Gradient 0
Gradient 1
Hidden layers
β’ Generalizations of ReLU gReLU π§ = max π§, 0 + πΌmin{π§, 0}
β’ Leaky-ReLU π§ = max{π§, 0} + 0.01min{π§, 0}
β’ Parametric-ReLU π§ : πΌ learnable
π§
gReLU π§
Top Related