Introduction to Machine Learning - udel.eduudel.edu/~amotong/teaching/machine...

Post on 27-Jun-2020

18 views 0 download

Transcript of Introduction to Machine Learning - udel.eduudel.edu/~amotong/teaching/machine...

Introduction to Machine Learning

Introduction to Machine Learning Amo G. Tong 1

Lecture *Deep Learning

• An Introduction

• Deep Learning, Ian Goodfellow, Yoshua Bengio and Aaron Courville

• Some materials are courtesy of .

• All pictures belong to their creators.

Introduction to Machine Learning Amo G. Tong 2

Introduction to Machine Learning Amo G. Tong 3

Deep Learning

• What are deep learning methods?

• Using a complex neural network to approximate the function we want to learn.

Story: ImageNet object recognition contest.

https://medium.com/analytics-vidhya/cnns-architectures-lenet-alexnet-vgg-googlenet-resnet-and-more-666091488df5

Introduction to Machine Learning Amo G. Tong 4

Deep Learning

• What are deep learning methods?

• Using a complex neural network to approximate the function we want to learn.

Story: ImageNet object recognition contest.LeNet-5 (1998): 7-level convolutional network

https://medium.com/analytics-vidhya/cnns-architectures-lenet-alexnet-vgg-googlenet-resnet-and-more-666091488df5

Introduction to Machine Learning Amo G. Tong 5

Deep Learning

• What are deep learning methods?

• Using a complex neural network to approximate the function we want to learn.

Story: ImageNet object recognition contest.AlexNet (2012): more layers and filters Trained for 6 days on two GPUs.Error rate: 15.3% (reduced from 26.2%)

https://medium.com/analytics-vidhya/cnns-architectures-lenet-alexnet-vgg-googlenet-resnet-and-more-666091488df5

Introduction to Machine Learning Amo G. Tong 6

Deep Learning

• What are deep learning methods?

• Using a complex neural network to approximate the function we want to learn.

Story: ImageNet object recognition contest.

ResNet(2015): 152 layers with residual connections.Error rate: 3.57%

https://medium.com/analytics-vidhya/cnns-architectures-lenet-alexnet-vgg-googlenet-resnet-and-more-666091488df5

Introduction to Machine Learning Amo G. Tong 7

Deep Learning

• What are deep learning methods?

• Using a complex neural network to approximate the function we want to learn.

• Why we need deep learning?

• We need to solve complex real-world problems.

• A standard way to parameterize functions.

• Layer by layer.

• Flexible to be customized for different applications.

• Image processing

• Sequence data

• Standard training methods.

• Backpropagation.

Introduction to Machine Learning Amo G. Tong 8

Deep Learning

• What are deep learning methods?

• Using a complex neural network to approximate the function we want to learn.

• Why now?

• Availability of data

• Hardware

• GPUs,…

• Software

• Tensor Flow, Pytorch,…

Introduction to Machine Learning Amo G. Tong 9

Outline

• General Deep Feedforward Network.

• Convolutional Neural Network (CNN)

• Image processing.

• Recurrent Neural Network (RNN)

• Sequence data processing.

Introduction to Machine Learning Amo G. Tong 10

Deep Feedforward Network

Input Output

Hidden Layers

Deep Feedforward Network

Training, Design, and Regularization

Introduction to Machine Learning Amo G. Tong 11

Deep Feedforward Network:

• Feedforward Network (Multi-Layer Perceptron (MLP))

• More Layers.

• Training

• Design

• Regularization

Input Output

Hidden Layers

Introduction to Machine Learning Amo G. Tong 12

Deep Feedforward Network: Training

• Cost Function: 𝐽(𝜃)

• Common choice: cross-entropy

• The difference between two distributions 𝑄 and 𝑃

• 𝐻 𝑃, 𝑄 = −E𝑥∼𝑃 log 𝑄(𝑥)

Input Output

Hidden Layers

𝐷(𝑃||𝑄) =

𝑥

𝑃 𝑥 log𝑃(𝑥)

𝑄(𝑥)

Introduction to Machine Learning Amo G. Tong 13

Deep Feedforward Network: Training

• Cost Function: 𝐽(𝜃)

• Common choice: cross-entropy

• The difference between two distributions 𝑄 and 𝑃

• 𝐻 𝑃, 𝑄 = −E𝑥∼𝑃 log 𝑄(𝑥)

• 𝐽 𝜃 = −E𝑥, 𝑦∼𝐷train log Pmodel(𝑦|𝑥) (negative log likelihood)

• Pmodel(𝑦|𝑥) changes from model to model

• Least square error: assume Gaussian and use MLE

Input Output

Hidden Layers

𝐷(𝑃||𝑄) =

𝑥

𝑃 𝑥 log𝑃(𝑥)

𝑄(𝑥)

Introduction to Machine Learning Amo G. Tong 14

Deep Feedforward Network: Training

• Gradient-Based Training

• Non-convex because we have nonlinear units.

• Iteratively decrease the cost function; not global optimal; not even local optimal.

• Initializations with small random numbers.

• Backpropagation.

• One issue: the gradient must be large and predictable. Input Output

Hidden Layers

Introduction to Machine Learning Amo G. Tong 15

Deep Feedforward Network: Training

• Gradient-Based Training

• Non-convex because we have nonlinear units.

• Iteratively decrease the cost function; not global optimal; not even local optimal.

• Initializations with small random numbers.

• Backpropagation.

• One issue: the gradient must be large and predictable. Input Output

Hidden LayersStochastic gradient descent (SGD) algorithmWhile stopping criterion not met

Sample a minibatch of 𝑚 examples 𝐷𝑚Compute the gradient 𝑔 ← σ𝑥∈𝐷𝑚

Δ𝜃 𝐽(𝜃, 𝑥)

Update 𝜃 ← 𝜃 − 𝜖𝑔

Introduction to Machine Learning Amo G. Tong 16

Deep Feedforward Network: Design

• Output units

• A real vector

• Regression Problem: Pr[𝑦|𝑥].

• Affine transformation: 𝑦𝑜𝑢𝑡 = 𝑊𝑇ℎ + 𝑏

• 𝐽 𝜃 = E𝑥, 𝑦∼𝐷train 𝑦 − 𝑦𝑜𝑢𝑡2

• Binary classification

• Sigmoid 𝜎

• 𝑦𝑜𝑢𝑡 = 𝜎(𝑊𝑇ℎ + 𝑏)

• 𝐽 𝜃 = −E𝑥, 𝑦∼𝐷train log 𝑦𝑜𝑢𝑡(𝑥)

• Multinoulli (𝑛-classification)

• 𝐬𝐨𝐟𝐭𝐦𝐚𝐱

• 𝑧 = 𝑊𝑇ℎ + 𝑏 = (𝑧1, … , 𝑧𝑛)

• softmax 𝑧 𝑖 =exp(zi)

σ𝑗 exp(z𝑗)

Input Output

Hidden Layers

Note: the gradient must be large and predictable. Log can help.

Introduction to Machine Learning Amo G. Tong 17

Deep Feedforward Network: Design

• Output units

• A real vector

• Regression Problem: Pr[𝑦|𝑥].

• Affine transformation: 𝑦𝑜𝑢𝑡 = 𝑊𝑇ℎ + 𝑏

• 𝐽 𝜃 = E𝑥, 𝑦∼𝐷train 𝑦 − 𝑦𝑜𝑢𝑡2

• Binary classification

• Sigmoid 𝜎

• 𝑦𝑜𝑢𝑡 = 𝜎(𝑊𝑇ℎ + 𝑏)

• 𝐽 𝜃 = −E𝑥, 𝑦∼𝐷train log 𝑦𝑜𝑢𝑡(𝑥)

• Multinoulli (𝑛-classification)

• 𝐬𝐨𝐟𝐭𝐦𝐚𝐱

• 𝑧 = 𝑊𝑇ℎ + 𝑏 = (𝑧1, … , 𝑧𝑛)

• softmax 𝑧 𝑖 =exp(zi)

σ𝑗 exp(z𝑗)

Input Output

Hidden Layers

Note: the gradient must be large and predictable. Log can help.

Introduction to Machine Learning Amo G. Tong 18

Deep Feedforward Network: Design

• Output units

• A real vector

• Regression Problem: Pr[𝑦|𝑥].

• Affine transformation: 𝑦𝑜𝑢𝑡 = 𝑊𝑇ℎ + 𝑏

• 𝐽 𝜃 = E𝑥, 𝑦∼𝐷train 𝑦 − 𝑦𝑜𝑢𝑡2

• Binary classification

• Sigmoid 𝜎

• 𝑦𝑜𝑢𝑡 = 𝜎(𝑊𝑇ℎ + 𝑏)

• 𝐽 𝜃 = −E𝑥, 𝑦∼𝐷train log 𝑦𝑜𝑢𝑡(𝑥)

• Multinoulli (𝑛-classification)

• 𝐬𝐨𝐟𝐭𝐦𝐚𝐱

• 𝑧 = 𝑊𝑇ℎ + 𝑏 = (𝑧1, … , 𝑧𝑛)

• softmax 𝑧 𝑖 =exp(zi)

σ𝑗 exp(z𝑗)

Input Output

Hidden Layers

Note: the gradient must be large and predictable. Log can help.

Introduction to Machine Learning Amo G. Tong 19

Deep Feedforward Network: Design

• Hidden Units

• Rectified linear units (ReLU).

• 𝑔(𝑧) = max{0, 𝑧}

• ℎ = 𝑔(𝑊𝑇𝑥 + 𝑏)

• Good point: large gradient when active, easy to train

• Bad point: not learnable when inactive

• Generalizations.

• 𝑔 𝑧 = max 0, 𝑧 + 𝛼 min(0, 𝑧)

• Absolute value rectification: 𝛼 = −1.

• Leaky ReLU: fixed small 𝛼 (0.01).

• Parametric ReLU (PReLU): learnable 𝛼.

• Maxout units: group and take the max.

• Traditional units:

• Sigmoid (consider tanh(z))

• Perceptron

Input Output

Hidden Layers

Introduction to Machine Learning Amo G. Tong 20

Deep Feedforward Network: Design

• Hidden Units

• Rectified linear units (ReLU).

• 𝑔(𝑧) = max{0, 𝑧}

• ℎ = 𝑔(𝑊𝑇𝑥 + 𝑏)

• Good point: large gradient when active, easy to train

• Bad point: not learnable when inactive

• Generalizations.

• 𝑔 𝑧 = max 0, 𝑧 + 𝛼 min(0, 𝑧)

• Absolute value rectification: 𝛼 = −1.

• Leaky ReLU: fixed small 𝛼 (0.01).

• Parametric ReLU (PReLU): learnable 𝛼.

• Maxout units: group and take the max.

• Traditional units:

• Sigmoid (consider tanh(z))

• Perceptron

Input Output

Hidden Layers

Introduction to Machine Learning Amo G. Tong 21

Deep Feedforward Network: Design

• Hidden Units

• Rectified linear units (ReLU).

• 𝑔(𝑧) = max{0, 𝑧}

• ℎ = 𝑔(𝑊𝑇𝑥 + 𝑏)

• Good point: large gradient when active, easy to train

• Bad point: not learnable when inactive

• Generalizations.

• 𝑔 𝑧 = max 0, 𝑧 + 𝛼 min(0, 𝑧)

• Absolute value rectification: 𝛼 = −1.

• Leaky ReLU: fixed small 𝛼 (0.01).

• Parametric ReLU (PReLU): learnable 𝛼.

• Maxout units: group and take the max.

• Traditional units:

• Sigmoid (consider tanh(z))

• Perceptron

Input Output

Hidden Layers

Introduction to Machine Learning Amo G. Tong 22

Deep Feedforward Network: Design

• Architecture Design

• The layer structure.

• ℎ(1) = 𝑔 1 (𝑊 1 𝑇 𝑥 + 𝑏(1))

• ℎ(2) = 𝑔 2 (𝑊 2 𝑇ℎ(1) + 𝑏(2))

• …..

• How many layers? How many units in each layer?

• Theoretically, universal approximation theorem: one layer of sigmoid can approximate any Borelmeasurable function, if given enough hidden units.

• But it is not guaranteed that it can be learned.

• Practically, (a) follow classic models: CNN, RNN, …; (b) start from few (2 or 3) layers with few (16, 32 or 64) hidden units and watch at the validation error.

• Greater depth may not be better.

Input Output

Hidden Layers

Introduction to Machine Learning Amo G. Tong 23

Deep Feedforward Network: Design

• Architecture Design

• The layer structure.

• ℎ(1) = 𝑔 1 (𝑊 1 𝑇 𝑥 + 𝑏(1))

• ℎ(2) = 𝑔 2 (𝑊 2 𝑇ℎ(1) + 𝑏(2))

• …..

• How many layers? How many units in each layer?

• Theoretically, universal approximation theorem: one layer of sigmoid can approximate any Borelmeasurable function, if given enough hidden units.

• But it is not guaranteed that it can be learned.

• Practically, (a) follow classic models: CNN, RNN, …; (b) start from few (2 or 3) layers with few (16, 32 or 64) hidden units and watch at the validation error.

• Greater depth may not be better.

Input Output

Hidden Layers

Introduction to Machine Learning Amo G. Tong 24

Deep Feedforward Network: Design

• Architecture Design

• The layer structure.

• ℎ(1) = 𝑔 1 (𝑊 1 𝑇 𝑥 + 𝑏(1))

• ℎ(2) = 𝑔 2 (𝑊 2 𝑇ℎ(1) + 𝑏(2))

• …..

• How many layers? How many units in each layer?

• Theoretically, universal approximation theorem: one layer of sigmoid can approximate any Borelmeasurable function, if given enough hidden units.

• But it is not guaranteed that it can be learned.

• Practically, (a) follow classic models: CNN, RNN, …; (b) start from few (2 or 3) layers with few (16, 32 or 64) hidden units and watch at the validation error. .

• Greater depth may not be better.

• Other considerations: local connected layers, skip layers, recurrent layers, …

Input Output

Hidden Layers

Introduction to Machine Learning Amo G. Tong 25

Deep Feedforward Network: Regularization

• Regularization: methods for reducing over-fitting.

• Over-fitting is a serious problem in deep learning.

• Method 1: Norm Penalties:

• 𝐽 𝜃 𝑛𝑒𝑤 = 𝐽 𝜃 + 𝛼 ⋅ Ω(𝜃)

• 𝐿2 regularization (weight decay): Ω 𝜃 =1

2∥ 𝜃 ∥2

• 𝐿1 regularization: Ω 𝜃 = σ𝑖 |𝜃𝑖|

• What is the difference between them?

ℎ(1) = 𝑔 1 (𝑊 1 𝑇 𝑥 + 𝑏(1))Typically, we put a penalty on 𝑊 but not bias.

Stochastic gradient descent (SGD) algorithmWhile stopping criterion not met

Sample a minibatch of 𝑚 examples 𝐷𝑚Compute the gradient 𝑔 ← σ𝑥∈𝐷𝑚

Δ𝜃 𝐽(𝜃, 𝑥)

Update 𝜃 ← 𝜃 − 𝜖𝑔

Introduction to Machine Learning Amo G. Tong 26

Deep Feedforward Network: Regularization

• Regularization: methods for reducing over-fitting.

• Over-fitting is a serious problem in deep learning.

• Method 2: Norm Penalties as Constrained Optimization:

• 𝐽 𝜃, 𝛼 𝑛𝑒𝑤 = 𝐽 𝜃 + 𝛼 ⋅ (Ω 𝜃 − 𝑘)

• 𝛼 increases when Ω 𝜃 > 𝑘; 𝛼 decreases when Ω 𝜃 < 𝑘

• Take 𝛼 as another learnable parameter.

• Select 𝑘 manully.

• Need to control 𝛼 manully.

Stochastic gradient descent (SGD) algorithmWhile stopping criterion not met

Sample a minibatch of 𝑚 examples 𝐷𝑚Compute the gradient 𝑔 ← σ𝑥∈𝐷𝑚

Δ𝜃 𝐽(𝜃, 𝑥)

Update 𝜃 ← 𝜃 − 𝜖𝑔

Introduction to Machine Learning Amo G. Tong 27

Deep Feedforward Network: Regularization

• Regularization: methods for reducing over-fitting.

• Over-fitting is a serious problem in deep learning.

• Method 3: Explicit Constraints:

• 𝐽 𝜃 𝑛𝑒𝑤 = 𝐽 𝜃

• Whenever Ω 𝜃 > 𝑘, project 𝜃 to the nearest point in Ω 𝜃 < 𝑘.

• After each iteration, check if the new parameters are in the area Ω 𝜃 < 𝑘.

Stochastic gradient descent (SGD) algorithmWhile stopping criterion not met

Sample a minibatch of 𝑚 examples 𝐷𝑚Compute the gradient 𝑔 ← σ𝑥∈𝐷𝑚

Δ𝜃 𝐽(𝜃, 𝑥)

Update 𝜃 ← 𝜃 − 𝜖𝑔

Introduction to Machine Learning Amo G. Tong 28

Deep Feedforward Network: Regularization

• Regularization: methods for reducing over-fitting.

• Over-fitting is a serious problem in deep learning.

• Method 4: Data Augmentation

• How does over-fitting happen?

• High representability and insufficient data.

• Creating fake but useful training data is possible.

• The target classifier is invariant to some transformations.

• E.g. object detection.

Input Output

Hidden Layers

9

9

Good Bad

Introduction to Machine Learning Amo G. Tong 29

Deep Feedforward Network: Regularization

• Regularization: methods for reducing over-fitting.

• Over-fitting is a serious problem in deep learning.

• Method 5: Multi-Task Learning

• Two models (problems) share some parameters

• The hidden representations can be better learned if two tasks share certain factors.

Input

Output 2

Shared

Output 1

If animal?

If cat?

If dog?

Introduction to Machine Learning Amo G. Tong 30

Deep Feedforward Network: Regularization

• Regularization: methods for reducing over-fitting.

• Over-fitting is a serious problem in deep learning.

• Method 6: Early Stopping

• We would want an early stop before converge.

• Validation set error.

• One possible method: stop if no better parameter found in 𝑝 (say 10) consecutive training batches.

• Simple to implement.

• Need some extra data.

• Use second training to fully utilize the data.

• Retraining. Adopt the same number of training steps.

• Continuous Training. Start from the obtained parameters.

while 𝑗 < 𝑝update 𝜃 for 𝑛 steps;if no smaller validation error

𝑗 = 𝑗 + 1;else

note the new parameters;𝑗 = 0;

Introduction to Machine Learning Amo G. Tong 31

Deep Feedforward Network: Regularization

• Regularization: methods for reducing over-fitting.

• Over-fitting is a serious problem in deep learning.

• Method 7: Dropout

• Bagging: train different models with different datasets.

• Bagging neural networks is not effective.

• Dropout:

• Idea: dynamically change the network structure.

• Mute units by setting its output as zero.

• Algorithm:

• Before each iteration, randomly mute units.

• Typically, with prob 0.2 drop an input node, with prob 0.5 drop a hidden node.

• Pros: Computationally cheap, independent of training algorithm or model.

• Cons: Size of the model may need increasing, does not work for small data set.

Input Output

Hidden Layers

Introduction to Machine Learning Amo G. Tong 32

Deep Feedforward Network: Regularization

• Regularization: methods for reducing over-fitting.

• Over-fitting is a serious problem in deep learning.

• Method 7: Dropout

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15, 1929–1958. 258, 265, 267, 672

Introduction to Machine Learning Amo G. Tong 33

Deep Feedforward Network: Regularization

• Regularization: methods for reducing over-fitting.

• Over-fitting is a serious problem in deep learning.

• Method 8: Adversarial training

• There can be 𝑥1 ≈ 𝑥2 but the results of prediction are very different.

• Training on adversarially perturbed examples from the training set

• Generative adversarial network (GAN)

Goodfellow, I. J., Shlens, J., and Szegedy, C. (2014b). Explaining and harnessing adversarialexamples. CoRR, abs/1412.6572. 268, 269, 271, 555, 556

Introduction to Machine Learning Amo G. Tong 34

Convolutional Neural Networks

Convolutional Neural Networks

Introduction to Machine Learning Amo G. Tong 35

Convolutional Neural Networks

• One large class of neural networks.

• The main unit is the convolutional layer

• Replace the matrix multiplication by convolutional operation.

• Designed for processing grid-like inputs.

Introduction to Machine Learning Amo G. Tong 36

Convolutional Neural Networks

• One large class of neural networks.

• The main unit is the convolutional layer

• Replace the matrix multiplication by convolutional operation.

• Designed for processing grid-like inputs.

• Convolutional operation in math: 𝑓 ∗ 𝑔 𝑡 = ∞−+∞

𝑓 𝜏 𝑔 𝑡 − 𝜏 𝑑𝜏

• Average of 𝑔(𝑡) weighted by 𝑓(−𝜏)

Introduction to Machine Learning Amo G. Tong 37

Convolutional Neural Networks

• One large class of neural networks.

• The main unit is the convolutional layer

• Replace the matrix multiplication by convolutional operation.

• Designed for processing grid-like inputs.

• Convolutional operation in math: 𝑓 ∗ 𝑔 𝑡 = ∞−+∞

𝑓 𝜏 𝑔 𝑡 − 𝜏 𝑑𝜏

• Average of 𝑔(𝑡) weighted by 𝑓(−𝜏)

• Convolutional operation in deep learning:

• ℎ(1) = 𝑔 1 (𝑊 1 𝑇 𝑥 + 𝑏(1))

• ℎ(2) = 𝑔 2 (𝑊 2 𝑇ℎ(1) + 𝑏(2))

• …..

𝑊 2 𝑇ℎ(1) → 𝐾 ∗ ℎ(1)

Input Output

Hidden Layers

Introduction to Machine Learning Amo G. Tong 38

Convolutional Neural Networks

• One large class of neural networks.

• The main unit is the convolutional layer

• Replace the matrix multiplication by convolutional operation.

• Designed for processing grid-like inputs.

• Convolutional operation in math: 𝑓 ∗ 𝑔 𝑡 = ∞−+−∞

𝑓 𝜏 𝑔 𝑡 − 𝜏 𝑑𝜏

• Average of 𝑔(𝑡) weighted by 𝑓(−𝜏)

• Convolutional operation in deep learning:

• ℎ(1) = 𝑔 1 (𝑊 1 𝑇 𝑥 + 𝑏(1))

• ℎ(2) = 𝑔 2 (𝑊 2 𝑇ℎ(1) + 𝑏(2))

• …..

𝑊 2 𝑇ℎ(1) → 𝐾 ∗ ℎ(1)

a

b

c

d

𝐾𝑎𝑒 + 𝑏𝑓

𝑏𝑒 + 𝑐𝑓

𝑐𝑒 + 𝑑𝑓

𝑑𝑒 + 0

ℎ(1) 1D example

Introduction to Machine Learning Amo G. Tong 39

Convolutional Neural Networks

• One large class of neural networks.

• The main unit is the convolutional layer

• Replace the matrix multiplication by convolutional operation.

• Designed for processing grid-like inputs.

• Convolutional operation in math: 𝑓 ∗ 𝑔 𝑡 = ∞−+−∞

𝑓 𝜏 𝑔 𝑡 − 𝜏 𝑑𝜏

• Average of 𝑔(𝑡) weighted by 𝑓(−𝜏)

• Convolutional operation in deep learning:

• ℎ(1) = 𝑔 1 (𝑊 1 𝑇 𝑥 + 𝑏(1))

• ℎ(2) = 𝑔 2 (𝑊 2 𝑇ℎ(1) + 𝑏(2))

• …..

𝑊 2 𝑇ℎ(1) → 𝐾 ∗ ℎ(1)

a

b

c

d

𝐾

ℎ(1)

2D example 𝑊 2 𝑇ℎ(1) → 𝐾 ∗ ℎ(1)

Introduction to Machine Learning Amo G. Tong 40

Convolutional Neural Networks

• Convolutional operation in deep learning:

• ℎ(1) = 𝑔 1 (𝐾(1) ∗ 𝑥)

• ℎ(2) = 𝑔 2 (𝐾(2) ∗ ℎ(1))

• …..

• What are the advantages?

• Sparse interactions: the size of kernel

• Fewer parameters

• Parameter Sharing:

• One set of parameter for each layer

• Equivariance

• The output changes in the same way as input does.

𝑊 2 𝑇ℎ(1) 𝐾 ∗ ℎ(1)

Introduction to Machine Learning Amo G. Tong 41

Convolutional Neural Networks

• Convolutional operation in deep learning:

• ℎ(1) = 𝑔 1 (𝐾(1) ∗ 𝑥)

• ℎ(2) = 𝑔 2 (𝐾(2) ∗ ℎ(1))

• …..

• Typical structure of a convolutional networks:

• Convolution layers+

• Activation functions (e.g. ReLU)+

• Pooling and Sampling

• Pooling layers:

• Replace the nodes with a summary of the nearby neighbors

• E.g. Max Pooling: reports the maximum output within a rectangular neighborhood.

• Why pooling?

• Theoretically, making the model invariance to small translations.

• Intuitively, we want to know if some feature exists than where it is.

Introduction to Machine Learning Amo G. Tong 42

Convolutional Neural Networks

• Convolutional operation in deep learning:

• ℎ(1) = 𝑔 1 (𝐾(1) ∗ 𝑥)

• ℎ(2) = 𝑔 2 (𝐾(2) ∗ ℎ(1))

• …..

• Typical structure of a convolutional networks:

• Convolution layers+

• Activation functions (e.g. ReLU)+

• Pooling

• Pooling layers:

• Replace the nodes with a summary of the nearby neighbors

• E.g. Max Pooling: reports the maximum output within a rectangular neighborhood.

• Why pooling?

• Theoretically, making the model invariance to small translations.

• Intuitively, we want to know if some feature exists than where it is.

Max=1 Max=1 Max=0

Introduction to Machine Learning Amo G. Tong 43

Convolutional Neural Networks

• Convolutional operation in deep learning:

• ℎ(1) = 𝑔 1 (𝐾(1) ∗ 𝑥)

• ℎ(2) = 𝑔 2 (𝐾(2) ∗ ℎ(1))

• …..

• Typical structure of a convolutional networks:

• Convolution layers+

• Activation functions (e.g. ReLU)+

• Pooling and Sampling

Introduction to Machine Learning Amo G. Tong 44

Convolutional Neural Networks

• Convolutional operation in deep learning:

• ℎ(1) = 𝑔 1 (𝐾(1) ∗ 𝑥)

• ℎ(2) = 𝑔 2 (𝐾(2) ∗ ℎ(1))

• …..

• Typical structure of a convolutional networks:

• Convolution layers+

• Activation functions (e.g. ReLU)+

• Pooling and Sampling

Introduction to Machine Learning Amo G. Tong 45

Recurrent Neural Network

“In this lecture, we are going to learn deep learning…..”

Recurrent Neural Network

Introduction to Machine Learning Amo G. Tong 46

Recurrent Neural Network

• One large class of neural networks.

• Designed for processing sequence data.

“In this lecture, we are going to learn deep learning…..”

𝑦1

𝑥1

𝑦2

𝑥2

𝑦3

𝑥3

𝑦4

𝑥4

Introduction to Machine Learning Amo G. Tong 47

Recurrent Neural Network

• One large class of neural networks.

• Designed for processing sequence data.

“In this lecture, we are going to learn deep learning…..”

𝑦1

𝑥1

𝑦2

𝑥2

𝑦3

𝑥3

𝑦4

𝑥4

Introduction to Machine Learning Amo G. Tong 48

Recurrent Neural Network

• One large class of neural networks.

• Designed for processing sequence data.

Truth

Loss

Output

Hidden state

Input

A typical pattern.

• Stationary model• Parameter sharing• Unlimited Sequence

• Powerful: can simulate a universal Turing Machine

Introduction to Machine Learning Amo G. Tong 49

Recurrent Neural Network

• One large class of neural networks.

• Designed for processing sequence data.

Truth

Loss

Output

Hidden state

Input

A typical pattern.

• Stationary model• Parameter sharing• Unlimited Sequence

• Powerful: can simulate a universal Turing Machine

Introduction to Machine Learning Amo G. Tong 50

Recurrent Neural Network

• One large class of neural networks.

• Designed for processing sequence data.

Truth

Loss

Output

Hidden state

Input

A typical pattern.

• Stationary model• Parameter sharing• Unlimited Sequence

• Powerful: can simulate a universal Turing Machine

Introduction to Machine Learning Amo G. Tong 51

Recurrent Neural Network

• One large class of neural networks.

• Designed for processing sequence data.

Truth

Loss

Output

Hidden state

Input

Hidden transmission:𝑎𝑡 = 𝑏 +𝑊ℎ𝑡−1 + 𝑈𝑥𝑡

ℎ𝑡 = tanh(𝑎𝑡)

Output:𝑜𝑡 = 𝑐 + 𝑉ℎ𝑡

𝑦𝑜𝑢𝑡𝑡 = softmax(𝑜𝑡)

Loss:Cross-entropy

Introduction to Machine Learning Amo G. Tong 52

Example (“hello”) Character-level Language Model

Vocabulary [h,e,l,o]

Hidden transmission:𝑎𝑡 = 𝑏 +𝑊ℎ𝑡−1 + 𝑈𝑥𝑡

ℎ𝑡 = tanh(𝑎𝑡)

1.02.2-3.04.1

0.50.3-1.01.2

0.10.51.9-1.1

0.2-1.5-0.12.2

0.3-0.10.9

1.0-0.30.1

0.1-0.5-0.3

-0.30.90.7

1000

0100

0010

0010

h e l l

e l l o

o o l oLoss

Output

Hidden state

Input

Output:

𝑜𝑡 = 𝑐 + 𝑉ℎ𝑡

𝑦𝑜𝑢𝑡𝑡 = softmax(𝑜𝑡)

Introduction to Machine Learning Amo G. Tong 53

Recurrent Neural Network

• One large class of neural networks.

• Designed for processing sequence data.

Output guided hidden transmission (Teacher forcing), with less information.

• Less powerful in terms of representability.

• Output may not carry sufficient information.

Introduction to Machine Learning Amo G. Tong 54

Recurrent Neural Network

• One large class of neural networks.

• Designed for processing sequence data.

Output guided hidden transmission (Teacher forcing), with less information.

• Less powerful in terms of representability.

• Output may not carry sufficient information.

• Training can be parallelized.

Introduction to Machine Learning Amo G. Tong 55

Recurrent Neural Network

Introduction to Machine Learning Amo G. Tong 56

Summary

• General Deep Feedforward Network.

• Training

• Design: output node, hidden node,

• Regularization

• Convolutional Neural Network

• Convolution Operator

• Grid-like input

• Extract features layer by layer

• Recurrent Neural Network

• Sequence data processing

• Hidden state transmission