Introduction to Machine Learning - udel.eduudel.edu/~amotong/teaching/machine...

56
Introduction to Machine Learning Introduction to Machine Learning Amo G. Tong 1

Transcript of Introduction to Machine Learning - udel.eduudel.edu/~amotong/teaching/machine...

Page 1: Introduction to Machine Learning - udel.eduudel.edu/~amotong/teaching/machine learning...Introduction to Machine Learning Amo G. Tong 7 Deep Learning •What are deep learning methods?

Introduction to Machine Learning

Introduction to Machine Learning Amo G. Tong 1

Page 2: Introduction to Machine Learning - udel.eduudel.edu/~amotong/teaching/machine learning...Introduction to Machine Learning Amo G. Tong 7 Deep Learning •What are deep learning methods?

Lecture *Deep Learning

• An Introduction

• Deep Learning, Ian Goodfellow, Yoshua Bengio and Aaron Courville

• Some materials are courtesy of .

• All pictures belong to their creators.

Introduction to Machine Learning Amo G. Tong 2

Page 3: Introduction to Machine Learning - udel.eduudel.edu/~amotong/teaching/machine learning...Introduction to Machine Learning Amo G. Tong 7 Deep Learning •What are deep learning methods?

Introduction to Machine Learning Amo G. Tong 3

Deep Learning

• What are deep learning methods?

• Using a complex neural network to approximate the function we want to learn.

Story: ImageNet object recognition contest.

https://medium.com/analytics-vidhya/cnns-architectures-lenet-alexnet-vgg-googlenet-resnet-and-more-666091488df5

Page 4: Introduction to Machine Learning - udel.eduudel.edu/~amotong/teaching/machine learning...Introduction to Machine Learning Amo G. Tong 7 Deep Learning •What are deep learning methods?

Introduction to Machine Learning Amo G. Tong 4

Deep Learning

• What are deep learning methods?

• Using a complex neural network to approximate the function we want to learn.

Story: ImageNet object recognition contest.LeNet-5 (1998): 7-level convolutional network

https://medium.com/analytics-vidhya/cnns-architectures-lenet-alexnet-vgg-googlenet-resnet-and-more-666091488df5

Page 5: Introduction to Machine Learning - udel.eduudel.edu/~amotong/teaching/machine learning...Introduction to Machine Learning Amo G. Tong 7 Deep Learning •What are deep learning methods?

Introduction to Machine Learning Amo G. Tong 5

Deep Learning

• What are deep learning methods?

• Using a complex neural network to approximate the function we want to learn.

Story: ImageNet object recognition contest.AlexNet (2012): more layers and filters Trained for 6 days on two GPUs.Error rate: 15.3% (reduced from 26.2%)

https://medium.com/analytics-vidhya/cnns-architectures-lenet-alexnet-vgg-googlenet-resnet-and-more-666091488df5

Page 6: Introduction to Machine Learning - udel.eduudel.edu/~amotong/teaching/machine learning...Introduction to Machine Learning Amo G. Tong 7 Deep Learning •What are deep learning methods?

Introduction to Machine Learning Amo G. Tong 6

Deep Learning

• What are deep learning methods?

• Using a complex neural network to approximate the function we want to learn.

Story: ImageNet object recognition contest.

ResNet(2015): 152 layers with residual connections.Error rate: 3.57%

https://medium.com/analytics-vidhya/cnns-architectures-lenet-alexnet-vgg-googlenet-resnet-and-more-666091488df5

Page 7: Introduction to Machine Learning - udel.eduudel.edu/~amotong/teaching/machine learning...Introduction to Machine Learning Amo G. Tong 7 Deep Learning •What are deep learning methods?

Introduction to Machine Learning Amo G. Tong 7

Deep Learning

• What are deep learning methods?

• Using a complex neural network to approximate the function we want to learn.

• Why we need deep learning?

• We need to solve complex real-world problems.

• A standard way to parameterize functions.

• Layer by layer.

• Flexible to be customized for different applications.

• Image processing

• Sequence data

• Standard training methods.

• Backpropagation.

Page 8: Introduction to Machine Learning - udel.eduudel.edu/~amotong/teaching/machine learning...Introduction to Machine Learning Amo G. Tong 7 Deep Learning •What are deep learning methods?

Introduction to Machine Learning Amo G. Tong 8

Deep Learning

• What are deep learning methods?

• Using a complex neural network to approximate the function we want to learn.

• Why now?

• Availability of data

• Hardware

• GPUs,…

• Software

• Tensor Flow, Pytorch,…

Page 9: Introduction to Machine Learning - udel.eduudel.edu/~amotong/teaching/machine learning...Introduction to Machine Learning Amo G. Tong 7 Deep Learning •What are deep learning methods?

Introduction to Machine Learning Amo G. Tong 9

Outline

• General Deep Feedforward Network.

• Convolutional Neural Network (CNN)

• Image processing.

• Recurrent Neural Network (RNN)

• Sequence data processing.

Page 10: Introduction to Machine Learning - udel.eduudel.edu/~amotong/teaching/machine learning...Introduction to Machine Learning Amo G. Tong 7 Deep Learning •What are deep learning methods?

Introduction to Machine Learning Amo G. Tong 10

Deep Feedforward Network

Input Output

Hidden Layers

Deep Feedforward Network

Training, Design, and Regularization

Page 11: Introduction to Machine Learning - udel.eduudel.edu/~amotong/teaching/machine learning...Introduction to Machine Learning Amo G. Tong 7 Deep Learning •What are deep learning methods?

Introduction to Machine Learning Amo G. Tong 11

Deep Feedforward Network:

• Feedforward Network (Multi-Layer Perceptron (MLP))

• More Layers.

• Training

• Design

• Regularization

Input Output

Hidden Layers

Page 12: Introduction to Machine Learning - udel.eduudel.edu/~amotong/teaching/machine learning...Introduction to Machine Learning Amo G. Tong 7 Deep Learning •What are deep learning methods?

Introduction to Machine Learning Amo G. Tong 12

Deep Feedforward Network: Training

• Cost Function: 𝐽(𝜃)

• Common choice: cross-entropy

• The difference between two distributions 𝑄 and 𝑃

• 𝐻 𝑃, 𝑄 = −E𝑥∼𝑃 log 𝑄(𝑥)

Input Output

Hidden Layers

𝐷(𝑃||𝑄) =

𝑥

𝑃 𝑥 log𝑃(𝑥)

𝑄(𝑥)

Page 13: Introduction to Machine Learning - udel.eduudel.edu/~amotong/teaching/machine learning...Introduction to Machine Learning Amo G. Tong 7 Deep Learning •What are deep learning methods?

Introduction to Machine Learning Amo G. Tong 13

Deep Feedforward Network: Training

• Cost Function: 𝐽(𝜃)

• Common choice: cross-entropy

• The difference between two distributions 𝑄 and 𝑃

• 𝐻 𝑃, 𝑄 = −E𝑥∼𝑃 log 𝑄(𝑥)

• 𝐽 𝜃 = −E𝑥, 𝑦∼𝐷train log Pmodel(𝑦|𝑥) (negative log likelihood)

• Pmodel(𝑦|𝑥) changes from model to model

• Least square error: assume Gaussian and use MLE

Input Output

Hidden Layers

𝐷(𝑃||𝑄) =

𝑥

𝑃 𝑥 log𝑃(𝑥)

𝑄(𝑥)

Page 14: Introduction to Machine Learning - udel.eduudel.edu/~amotong/teaching/machine learning...Introduction to Machine Learning Amo G. Tong 7 Deep Learning •What are deep learning methods?

Introduction to Machine Learning Amo G. Tong 14

Deep Feedforward Network: Training

• Gradient-Based Training

• Non-convex because we have nonlinear units.

• Iteratively decrease the cost function; not global optimal; not even local optimal.

• Initializations with small random numbers.

• Backpropagation.

• One issue: the gradient must be large and predictable. Input Output

Hidden Layers

Page 15: Introduction to Machine Learning - udel.eduudel.edu/~amotong/teaching/machine learning...Introduction to Machine Learning Amo G. Tong 7 Deep Learning •What are deep learning methods?

Introduction to Machine Learning Amo G. Tong 15

Deep Feedforward Network: Training

• Gradient-Based Training

• Non-convex because we have nonlinear units.

• Iteratively decrease the cost function; not global optimal; not even local optimal.

• Initializations with small random numbers.

• Backpropagation.

• One issue: the gradient must be large and predictable. Input Output

Hidden LayersStochastic gradient descent (SGD) algorithmWhile stopping criterion not met

Sample a minibatch of 𝑚 examples 𝐷𝑚Compute the gradient 𝑔 ← σ𝑥∈𝐷𝑚

Δ𝜃 𝐽(𝜃, 𝑥)

Update 𝜃 ← 𝜃 − 𝜖𝑔

Page 16: Introduction to Machine Learning - udel.eduudel.edu/~amotong/teaching/machine learning...Introduction to Machine Learning Amo G. Tong 7 Deep Learning •What are deep learning methods?

Introduction to Machine Learning Amo G. Tong 16

Deep Feedforward Network: Design

• Output units

• A real vector

• Regression Problem: Pr[𝑦|𝑥].

• Affine transformation: 𝑦𝑜𝑢𝑡 = 𝑊𝑇ℎ + 𝑏

• 𝐽 𝜃 = E𝑥, 𝑦∼𝐷train 𝑦 − 𝑦𝑜𝑢𝑡2

• Binary classification

• Sigmoid 𝜎

• 𝑦𝑜𝑢𝑡 = 𝜎(𝑊𝑇ℎ + 𝑏)

• 𝐽 𝜃 = −E𝑥, 𝑦∼𝐷train log 𝑦𝑜𝑢𝑡(𝑥)

• Multinoulli (𝑛-classification)

• 𝐬𝐨𝐟𝐭𝐦𝐚𝐱

• 𝑧 = 𝑊𝑇ℎ + 𝑏 = (𝑧1, … , 𝑧𝑛)

• softmax 𝑧 𝑖 =exp(zi)

σ𝑗 exp(z𝑗)

Input Output

Hidden Layers

Note: the gradient must be large and predictable. Log can help.

Page 17: Introduction to Machine Learning - udel.eduudel.edu/~amotong/teaching/machine learning...Introduction to Machine Learning Amo G. Tong 7 Deep Learning •What are deep learning methods?

Introduction to Machine Learning Amo G. Tong 17

Deep Feedforward Network: Design

• Output units

• A real vector

• Regression Problem: Pr[𝑦|𝑥].

• Affine transformation: 𝑦𝑜𝑢𝑡 = 𝑊𝑇ℎ + 𝑏

• 𝐽 𝜃 = E𝑥, 𝑦∼𝐷train 𝑦 − 𝑦𝑜𝑢𝑡2

• Binary classification

• Sigmoid 𝜎

• 𝑦𝑜𝑢𝑡 = 𝜎(𝑊𝑇ℎ + 𝑏)

• 𝐽 𝜃 = −E𝑥, 𝑦∼𝐷train log 𝑦𝑜𝑢𝑡(𝑥)

• Multinoulli (𝑛-classification)

• 𝐬𝐨𝐟𝐭𝐦𝐚𝐱

• 𝑧 = 𝑊𝑇ℎ + 𝑏 = (𝑧1, … , 𝑧𝑛)

• softmax 𝑧 𝑖 =exp(zi)

σ𝑗 exp(z𝑗)

Input Output

Hidden Layers

Note: the gradient must be large and predictable. Log can help.

Page 18: Introduction to Machine Learning - udel.eduudel.edu/~amotong/teaching/machine learning...Introduction to Machine Learning Amo G. Tong 7 Deep Learning •What are deep learning methods?

Introduction to Machine Learning Amo G. Tong 18

Deep Feedforward Network: Design

• Output units

• A real vector

• Regression Problem: Pr[𝑦|𝑥].

• Affine transformation: 𝑦𝑜𝑢𝑡 = 𝑊𝑇ℎ + 𝑏

• 𝐽 𝜃 = E𝑥, 𝑦∼𝐷train 𝑦 − 𝑦𝑜𝑢𝑡2

• Binary classification

• Sigmoid 𝜎

• 𝑦𝑜𝑢𝑡 = 𝜎(𝑊𝑇ℎ + 𝑏)

• 𝐽 𝜃 = −E𝑥, 𝑦∼𝐷train log 𝑦𝑜𝑢𝑡(𝑥)

• Multinoulli (𝑛-classification)

• 𝐬𝐨𝐟𝐭𝐦𝐚𝐱

• 𝑧 = 𝑊𝑇ℎ + 𝑏 = (𝑧1, … , 𝑧𝑛)

• softmax 𝑧 𝑖 =exp(zi)

σ𝑗 exp(z𝑗)

Input Output

Hidden Layers

Note: the gradient must be large and predictable. Log can help.

Page 19: Introduction to Machine Learning - udel.eduudel.edu/~amotong/teaching/machine learning...Introduction to Machine Learning Amo G. Tong 7 Deep Learning •What are deep learning methods?

Introduction to Machine Learning Amo G. Tong 19

Deep Feedforward Network: Design

• Hidden Units

• Rectified linear units (ReLU).

• 𝑔(𝑧) = max{0, 𝑧}

• ℎ = 𝑔(𝑊𝑇𝑥 + 𝑏)

• Good point: large gradient when active, easy to train

• Bad point: not learnable when inactive

• Generalizations.

• 𝑔 𝑧 = max 0, 𝑧 + 𝛼 min(0, 𝑧)

• Absolute value rectification: 𝛼 = −1.

• Leaky ReLU: fixed small 𝛼 (0.01).

• Parametric ReLU (PReLU): learnable 𝛼.

• Maxout units: group and take the max.

• Traditional units:

• Sigmoid (consider tanh(z))

• Perceptron

Input Output

Hidden Layers

Page 20: Introduction to Machine Learning - udel.eduudel.edu/~amotong/teaching/machine learning...Introduction to Machine Learning Amo G. Tong 7 Deep Learning •What are deep learning methods?

Introduction to Machine Learning Amo G. Tong 20

Deep Feedforward Network: Design

• Hidden Units

• Rectified linear units (ReLU).

• 𝑔(𝑧) = max{0, 𝑧}

• ℎ = 𝑔(𝑊𝑇𝑥 + 𝑏)

• Good point: large gradient when active, easy to train

• Bad point: not learnable when inactive

• Generalizations.

• 𝑔 𝑧 = max 0, 𝑧 + 𝛼 min(0, 𝑧)

• Absolute value rectification: 𝛼 = −1.

• Leaky ReLU: fixed small 𝛼 (0.01).

• Parametric ReLU (PReLU): learnable 𝛼.

• Maxout units: group and take the max.

• Traditional units:

• Sigmoid (consider tanh(z))

• Perceptron

Input Output

Hidden Layers

Page 21: Introduction to Machine Learning - udel.eduudel.edu/~amotong/teaching/machine learning...Introduction to Machine Learning Amo G. Tong 7 Deep Learning •What are deep learning methods?

Introduction to Machine Learning Amo G. Tong 21

Deep Feedforward Network: Design

• Hidden Units

• Rectified linear units (ReLU).

• 𝑔(𝑧) = max{0, 𝑧}

• ℎ = 𝑔(𝑊𝑇𝑥 + 𝑏)

• Good point: large gradient when active, easy to train

• Bad point: not learnable when inactive

• Generalizations.

• 𝑔 𝑧 = max 0, 𝑧 + 𝛼 min(0, 𝑧)

• Absolute value rectification: 𝛼 = −1.

• Leaky ReLU: fixed small 𝛼 (0.01).

• Parametric ReLU (PReLU): learnable 𝛼.

• Maxout units: group and take the max.

• Traditional units:

• Sigmoid (consider tanh(z))

• Perceptron

Input Output

Hidden Layers

Page 22: Introduction to Machine Learning - udel.eduudel.edu/~amotong/teaching/machine learning...Introduction to Machine Learning Amo G. Tong 7 Deep Learning •What are deep learning methods?

Introduction to Machine Learning Amo G. Tong 22

Deep Feedforward Network: Design

• Architecture Design

• The layer structure.

• ℎ(1) = 𝑔 1 (𝑊 1 𝑇 𝑥 + 𝑏(1))

• ℎ(2) = 𝑔 2 (𝑊 2 𝑇ℎ(1) + 𝑏(2))

• …..

• How many layers? How many units in each layer?

• Theoretically, universal approximation theorem: one layer of sigmoid can approximate any Borelmeasurable function, if given enough hidden units.

• But it is not guaranteed that it can be learned.

• Practically, (a) follow classic models: CNN, RNN, …; (b) start from few (2 or 3) layers with few (16, 32 or 64) hidden units and watch at the validation error.

• Greater depth may not be better.

Input Output

Hidden Layers

Page 23: Introduction to Machine Learning - udel.eduudel.edu/~amotong/teaching/machine learning...Introduction to Machine Learning Amo G. Tong 7 Deep Learning •What are deep learning methods?

Introduction to Machine Learning Amo G. Tong 23

Deep Feedforward Network: Design

• Architecture Design

• The layer structure.

• ℎ(1) = 𝑔 1 (𝑊 1 𝑇 𝑥 + 𝑏(1))

• ℎ(2) = 𝑔 2 (𝑊 2 𝑇ℎ(1) + 𝑏(2))

• …..

• How many layers? How many units in each layer?

• Theoretically, universal approximation theorem: one layer of sigmoid can approximate any Borelmeasurable function, if given enough hidden units.

• But it is not guaranteed that it can be learned.

• Practically, (a) follow classic models: CNN, RNN, …; (b) start from few (2 or 3) layers with few (16, 32 or 64) hidden units and watch at the validation error.

• Greater depth may not be better.

Input Output

Hidden Layers

Page 24: Introduction to Machine Learning - udel.eduudel.edu/~amotong/teaching/machine learning...Introduction to Machine Learning Amo G. Tong 7 Deep Learning •What are deep learning methods?

Introduction to Machine Learning Amo G. Tong 24

Deep Feedforward Network: Design

• Architecture Design

• The layer structure.

• ℎ(1) = 𝑔 1 (𝑊 1 𝑇 𝑥 + 𝑏(1))

• ℎ(2) = 𝑔 2 (𝑊 2 𝑇ℎ(1) + 𝑏(2))

• …..

• How many layers? How many units in each layer?

• Theoretically, universal approximation theorem: one layer of sigmoid can approximate any Borelmeasurable function, if given enough hidden units.

• But it is not guaranteed that it can be learned.

• Practically, (a) follow classic models: CNN, RNN, …; (b) start from few (2 or 3) layers with few (16, 32 or 64) hidden units and watch at the validation error. .

• Greater depth may not be better.

• Other considerations: local connected layers, skip layers, recurrent layers, …

Input Output

Hidden Layers

Page 25: Introduction to Machine Learning - udel.eduudel.edu/~amotong/teaching/machine learning...Introduction to Machine Learning Amo G. Tong 7 Deep Learning •What are deep learning methods?

Introduction to Machine Learning Amo G. Tong 25

Deep Feedforward Network: Regularization

• Regularization: methods for reducing over-fitting.

• Over-fitting is a serious problem in deep learning.

• Method 1: Norm Penalties:

• 𝐽 𝜃 𝑛𝑒𝑤 = 𝐽 𝜃 + 𝛼 ⋅ Ω(𝜃)

• 𝐿2 regularization (weight decay): Ω 𝜃 =1

2∥ 𝜃 ∥2

• 𝐿1 regularization: Ω 𝜃 = σ𝑖 |𝜃𝑖|

• What is the difference between them?

ℎ(1) = 𝑔 1 (𝑊 1 𝑇 𝑥 + 𝑏(1))Typically, we put a penalty on 𝑊 but not bias.

Stochastic gradient descent (SGD) algorithmWhile stopping criterion not met

Sample a minibatch of 𝑚 examples 𝐷𝑚Compute the gradient 𝑔 ← σ𝑥∈𝐷𝑚

Δ𝜃 𝐽(𝜃, 𝑥)

Update 𝜃 ← 𝜃 − 𝜖𝑔

Page 26: Introduction to Machine Learning - udel.eduudel.edu/~amotong/teaching/machine learning...Introduction to Machine Learning Amo G. Tong 7 Deep Learning •What are deep learning methods?

Introduction to Machine Learning Amo G. Tong 26

Deep Feedforward Network: Regularization

• Regularization: methods for reducing over-fitting.

• Over-fitting is a serious problem in deep learning.

• Method 2: Norm Penalties as Constrained Optimization:

• 𝐽 𝜃, 𝛼 𝑛𝑒𝑤 = 𝐽 𝜃 + 𝛼 ⋅ (Ω 𝜃 − 𝑘)

• 𝛼 increases when Ω 𝜃 > 𝑘; 𝛼 decreases when Ω 𝜃 < 𝑘

• Take 𝛼 as another learnable parameter.

• Select 𝑘 manully.

• Need to control 𝛼 manully.

Stochastic gradient descent (SGD) algorithmWhile stopping criterion not met

Sample a minibatch of 𝑚 examples 𝐷𝑚Compute the gradient 𝑔 ← σ𝑥∈𝐷𝑚

Δ𝜃 𝐽(𝜃, 𝑥)

Update 𝜃 ← 𝜃 − 𝜖𝑔

Page 27: Introduction to Machine Learning - udel.eduudel.edu/~amotong/teaching/machine learning...Introduction to Machine Learning Amo G. Tong 7 Deep Learning •What are deep learning methods?

Introduction to Machine Learning Amo G. Tong 27

Deep Feedforward Network: Regularization

• Regularization: methods for reducing over-fitting.

• Over-fitting is a serious problem in deep learning.

• Method 3: Explicit Constraints:

• 𝐽 𝜃 𝑛𝑒𝑤 = 𝐽 𝜃

• Whenever Ω 𝜃 > 𝑘, project 𝜃 to the nearest point in Ω 𝜃 < 𝑘.

• After each iteration, check if the new parameters are in the area Ω 𝜃 < 𝑘.

Stochastic gradient descent (SGD) algorithmWhile stopping criterion not met

Sample a minibatch of 𝑚 examples 𝐷𝑚Compute the gradient 𝑔 ← σ𝑥∈𝐷𝑚

Δ𝜃 𝐽(𝜃, 𝑥)

Update 𝜃 ← 𝜃 − 𝜖𝑔

Page 28: Introduction to Machine Learning - udel.eduudel.edu/~amotong/teaching/machine learning...Introduction to Machine Learning Amo G. Tong 7 Deep Learning •What are deep learning methods?

Introduction to Machine Learning Amo G. Tong 28

Deep Feedforward Network: Regularization

• Regularization: methods for reducing over-fitting.

• Over-fitting is a serious problem in deep learning.

• Method 4: Data Augmentation

• How does over-fitting happen?

• High representability and insufficient data.

• Creating fake but useful training data is possible.

• The target classifier is invariant to some transformations.

• E.g. object detection.

Input Output

Hidden Layers

9

9

Good Bad

Page 29: Introduction to Machine Learning - udel.eduudel.edu/~amotong/teaching/machine learning...Introduction to Machine Learning Amo G. Tong 7 Deep Learning •What are deep learning methods?

Introduction to Machine Learning Amo G. Tong 29

Deep Feedforward Network: Regularization

• Regularization: methods for reducing over-fitting.

• Over-fitting is a serious problem in deep learning.

• Method 5: Multi-Task Learning

• Two models (problems) share some parameters

• The hidden representations can be better learned if two tasks share certain factors.

Input

Output 2

Shared

Output 1

If animal?

If cat?

If dog?

Page 30: Introduction to Machine Learning - udel.eduudel.edu/~amotong/teaching/machine learning...Introduction to Machine Learning Amo G. Tong 7 Deep Learning •What are deep learning methods?

Introduction to Machine Learning Amo G. Tong 30

Deep Feedforward Network: Regularization

• Regularization: methods for reducing over-fitting.

• Over-fitting is a serious problem in deep learning.

• Method 6: Early Stopping

• We would want an early stop before converge.

• Validation set error.

• One possible method: stop if no better parameter found in 𝑝 (say 10) consecutive training batches.

• Simple to implement.

• Need some extra data.

• Use second training to fully utilize the data.

• Retraining. Adopt the same number of training steps.

• Continuous Training. Start from the obtained parameters.

while 𝑗 < 𝑝update 𝜃 for 𝑛 steps;if no smaller validation error

𝑗 = 𝑗 + 1;else

note the new parameters;𝑗 = 0;

Page 31: Introduction to Machine Learning - udel.eduudel.edu/~amotong/teaching/machine learning...Introduction to Machine Learning Amo G. Tong 7 Deep Learning •What are deep learning methods?

Introduction to Machine Learning Amo G. Tong 31

Deep Feedforward Network: Regularization

• Regularization: methods for reducing over-fitting.

• Over-fitting is a serious problem in deep learning.

• Method 7: Dropout

• Bagging: train different models with different datasets.

• Bagging neural networks is not effective.

• Dropout:

• Idea: dynamically change the network structure.

• Mute units by setting its output as zero.

• Algorithm:

• Before each iteration, randomly mute units.

• Typically, with prob 0.2 drop an input node, with prob 0.5 drop a hidden node.

• Pros: Computationally cheap, independent of training algorithm or model.

• Cons: Size of the model may need increasing, does not work for small data set.

Input Output

Hidden Layers

Page 32: Introduction to Machine Learning - udel.eduudel.edu/~amotong/teaching/machine learning...Introduction to Machine Learning Amo G. Tong 7 Deep Learning •What are deep learning methods?

Introduction to Machine Learning Amo G. Tong 32

Deep Feedforward Network: Regularization

• Regularization: methods for reducing over-fitting.

• Over-fitting is a serious problem in deep learning.

• Method 7: Dropout

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15, 1929–1958. 258, 265, 267, 672

Page 33: Introduction to Machine Learning - udel.eduudel.edu/~amotong/teaching/machine learning...Introduction to Machine Learning Amo G. Tong 7 Deep Learning •What are deep learning methods?

Introduction to Machine Learning Amo G. Tong 33

Deep Feedforward Network: Regularization

• Regularization: methods for reducing over-fitting.

• Over-fitting is a serious problem in deep learning.

• Method 8: Adversarial training

• There can be 𝑥1 ≈ 𝑥2 but the results of prediction are very different.

• Training on adversarially perturbed examples from the training set

• Generative adversarial network (GAN)

Goodfellow, I. J., Shlens, J., and Szegedy, C. (2014b). Explaining and harnessing adversarialexamples. CoRR, abs/1412.6572. 268, 269, 271, 555, 556

Page 34: Introduction to Machine Learning - udel.eduudel.edu/~amotong/teaching/machine learning...Introduction to Machine Learning Amo G. Tong 7 Deep Learning •What are deep learning methods?

Introduction to Machine Learning Amo G. Tong 34

Convolutional Neural Networks

Convolutional Neural Networks

Page 35: Introduction to Machine Learning - udel.eduudel.edu/~amotong/teaching/machine learning...Introduction to Machine Learning Amo G. Tong 7 Deep Learning •What are deep learning methods?

Introduction to Machine Learning Amo G. Tong 35

Convolutional Neural Networks

• One large class of neural networks.

• The main unit is the convolutional layer

• Replace the matrix multiplication by convolutional operation.

• Designed for processing grid-like inputs.

Page 36: Introduction to Machine Learning - udel.eduudel.edu/~amotong/teaching/machine learning...Introduction to Machine Learning Amo G. Tong 7 Deep Learning •What are deep learning methods?

Introduction to Machine Learning Amo G. Tong 36

Convolutional Neural Networks

• One large class of neural networks.

• The main unit is the convolutional layer

• Replace the matrix multiplication by convolutional operation.

• Designed for processing grid-like inputs.

• Convolutional operation in math: 𝑓 ∗ 𝑔 𝑡 = ∞−+∞

𝑓 𝜏 𝑔 𝑡 − 𝜏 𝑑𝜏

• Average of 𝑔(𝑡) weighted by 𝑓(−𝜏)

Page 37: Introduction to Machine Learning - udel.eduudel.edu/~amotong/teaching/machine learning...Introduction to Machine Learning Amo G. Tong 7 Deep Learning •What are deep learning methods?

Introduction to Machine Learning Amo G. Tong 37

Convolutional Neural Networks

• One large class of neural networks.

• The main unit is the convolutional layer

• Replace the matrix multiplication by convolutional operation.

• Designed for processing grid-like inputs.

• Convolutional operation in math: 𝑓 ∗ 𝑔 𝑡 = ∞−+∞

𝑓 𝜏 𝑔 𝑡 − 𝜏 𝑑𝜏

• Average of 𝑔(𝑡) weighted by 𝑓(−𝜏)

• Convolutional operation in deep learning:

• ℎ(1) = 𝑔 1 (𝑊 1 𝑇 𝑥 + 𝑏(1))

• ℎ(2) = 𝑔 2 (𝑊 2 𝑇ℎ(1) + 𝑏(2))

• …..

𝑊 2 𝑇ℎ(1) → 𝐾 ∗ ℎ(1)

Input Output

Hidden Layers

Page 38: Introduction to Machine Learning - udel.eduudel.edu/~amotong/teaching/machine learning...Introduction to Machine Learning Amo G. Tong 7 Deep Learning •What are deep learning methods?

Introduction to Machine Learning Amo G. Tong 38

Convolutional Neural Networks

• One large class of neural networks.

• The main unit is the convolutional layer

• Replace the matrix multiplication by convolutional operation.

• Designed for processing grid-like inputs.

• Convolutional operation in math: 𝑓 ∗ 𝑔 𝑡 = ∞−+−∞

𝑓 𝜏 𝑔 𝑡 − 𝜏 𝑑𝜏

• Average of 𝑔(𝑡) weighted by 𝑓(−𝜏)

• Convolutional operation in deep learning:

• ℎ(1) = 𝑔 1 (𝑊 1 𝑇 𝑥 + 𝑏(1))

• ℎ(2) = 𝑔 2 (𝑊 2 𝑇ℎ(1) + 𝑏(2))

• …..

𝑊 2 𝑇ℎ(1) → 𝐾 ∗ ℎ(1)

a

b

c

d

𝐾𝑎𝑒 + 𝑏𝑓

𝑏𝑒 + 𝑐𝑓

𝑐𝑒 + 𝑑𝑓

𝑑𝑒 + 0

ℎ(1) 1D example

Page 39: Introduction to Machine Learning - udel.eduudel.edu/~amotong/teaching/machine learning...Introduction to Machine Learning Amo G. Tong 7 Deep Learning •What are deep learning methods?

Introduction to Machine Learning Amo G. Tong 39

Convolutional Neural Networks

• One large class of neural networks.

• The main unit is the convolutional layer

• Replace the matrix multiplication by convolutional operation.

• Designed for processing grid-like inputs.

• Convolutional operation in math: 𝑓 ∗ 𝑔 𝑡 = ∞−+−∞

𝑓 𝜏 𝑔 𝑡 − 𝜏 𝑑𝜏

• Average of 𝑔(𝑡) weighted by 𝑓(−𝜏)

• Convolutional operation in deep learning:

• ℎ(1) = 𝑔 1 (𝑊 1 𝑇 𝑥 + 𝑏(1))

• ℎ(2) = 𝑔 2 (𝑊 2 𝑇ℎ(1) + 𝑏(2))

• …..

𝑊 2 𝑇ℎ(1) → 𝐾 ∗ ℎ(1)

a

b

c

d

𝐾

ℎ(1)

2D example 𝑊 2 𝑇ℎ(1) → 𝐾 ∗ ℎ(1)

Page 40: Introduction to Machine Learning - udel.eduudel.edu/~amotong/teaching/machine learning...Introduction to Machine Learning Amo G. Tong 7 Deep Learning •What are deep learning methods?

Introduction to Machine Learning Amo G. Tong 40

Convolutional Neural Networks

• Convolutional operation in deep learning:

• ℎ(1) = 𝑔 1 (𝐾(1) ∗ 𝑥)

• ℎ(2) = 𝑔 2 (𝐾(2) ∗ ℎ(1))

• …..

• What are the advantages?

• Sparse interactions: the size of kernel

• Fewer parameters

• Parameter Sharing:

• One set of parameter for each layer

• Equivariance

• The output changes in the same way as input does.

𝑊 2 𝑇ℎ(1) 𝐾 ∗ ℎ(1)

Page 41: Introduction to Machine Learning - udel.eduudel.edu/~amotong/teaching/machine learning...Introduction to Machine Learning Amo G. Tong 7 Deep Learning •What are deep learning methods?

Introduction to Machine Learning Amo G. Tong 41

Convolutional Neural Networks

• Convolutional operation in deep learning:

• ℎ(1) = 𝑔 1 (𝐾(1) ∗ 𝑥)

• ℎ(2) = 𝑔 2 (𝐾(2) ∗ ℎ(1))

• …..

• Typical structure of a convolutional networks:

• Convolution layers+

• Activation functions (e.g. ReLU)+

• Pooling and Sampling

• Pooling layers:

• Replace the nodes with a summary of the nearby neighbors

• E.g. Max Pooling: reports the maximum output within a rectangular neighborhood.

• Why pooling?

• Theoretically, making the model invariance to small translations.

• Intuitively, we want to know if some feature exists than where it is.

Page 42: Introduction to Machine Learning - udel.eduudel.edu/~amotong/teaching/machine learning...Introduction to Machine Learning Amo G. Tong 7 Deep Learning •What are deep learning methods?

Introduction to Machine Learning Amo G. Tong 42

Convolutional Neural Networks

• Convolutional operation in deep learning:

• ℎ(1) = 𝑔 1 (𝐾(1) ∗ 𝑥)

• ℎ(2) = 𝑔 2 (𝐾(2) ∗ ℎ(1))

• …..

• Typical structure of a convolutional networks:

• Convolution layers+

• Activation functions (e.g. ReLU)+

• Pooling

• Pooling layers:

• Replace the nodes with a summary of the nearby neighbors

• E.g. Max Pooling: reports the maximum output within a rectangular neighborhood.

• Why pooling?

• Theoretically, making the model invariance to small translations.

• Intuitively, we want to know if some feature exists than where it is.

Max=1 Max=1 Max=0

Page 43: Introduction to Machine Learning - udel.eduudel.edu/~amotong/teaching/machine learning...Introduction to Machine Learning Amo G. Tong 7 Deep Learning •What are deep learning methods?

Introduction to Machine Learning Amo G. Tong 43

Convolutional Neural Networks

• Convolutional operation in deep learning:

• ℎ(1) = 𝑔 1 (𝐾(1) ∗ 𝑥)

• ℎ(2) = 𝑔 2 (𝐾(2) ∗ ℎ(1))

• …..

• Typical structure of a convolutional networks:

• Convolution layers+

• Activation functions (e.g. ReLU)+

• Pooling and Sampling

Page 44: Introduction to Machine Learning - udel.eduudel.edu/~amotong/teaching/machine learning...Introduction to Machine Learning Amo G. Tong 7 Deep Learning •What are deep learning methods?

Introduction to Machine Learning Amo G. Tong 44

Convolutional Neural Networks

• Convolutional operation in deep learning:

• ℎ(1) = 𝑔 1 (𝐾(1) ∗ 𝑥)

• ℎ(2) = 𝑔 2 (𝐾(2) ∗ ℎ(1))

• …..

• Typical structure of a convolutional networks:

• Convolution layers+

• Activation functions (e.g. ReLU)+

• Pooling and Sampling

Page 45: Introduction to Machine Learning - udel.eduudel.edu/~amotong/teaching/machine learning...Introduction to Machine Learning Amo G. Tong 7 Deep Learning •What are deep learning methods?

Introduction to Machine Learning Amo G. Tong 45

Recurrent Neural Network

“In this lecture, we are going to learn deep learning…..”

Recurrent Neural Network

Page 46: Introduction to Machine Learning - udel.eduudel.edu/~amotong/teaching/machine learning...Introduction to Machine Learning Amo G. Tong 7 Deep Learning •What are deep learning methods?

Introduction to Machine Learning Amo G. Tong 46

Recurrent Neural Network

• One large class of neural networks.

• Designed for processing sequence data.

“In this lecture, we are going to learn deep learning…..”

𝑦1

𝑥1

𝑦2

𝑥2

𝑦3

𝑥3

𝑦4

𝑥4

Page 47: Introduction to Machine Learning - udel.eduudel.edu/~amotong/teaching/machine learning...Introduction to Machine Learning Amo G. Tong 7 Deep Learning •What are deep learning methods?

Introduction to Machine Learning Amo G. Tong 47

Recurrent Neural Network

• One large class of neural networks.

• Designed for processing sequence data.

“In this lecture, we are going to learn deep learning…..”

𝑦1

𝑥1

𝑦2

𝑥2

𝑦3

𝑥3

𝑦4

𝑥4

Page 48: Introduction to Machine Learning - udel.eduudel.edu/~amotong/teaching/machine learning...Introduction to Machine Learning Amo G. Tong 7 Deep Learning •What are deep learning methods?

Introduction to Machine Learning Amo G. Tong 48

Recurrent Neural Network

• One large class of neural networks.

• Designed for processing sequence data.

Truth

Loss

Output

Hidden state

Input

A typical pattern.

• Stationary model• Parameter sharing• Unlimited Sequence

• Powerful: can simulate a universal Turing Machine

Page 49: Introduction to Machine Learning - udel.eduudel.edu/~amotong/teaching/machine learning...Introduction to Machine Learning Amo G. Tong 7 Deep Learning •What are deep learning methods?

Introduction to Machine Learning Amo G. Tong 49

Recurrent Neural Network

• One large class of neural networks.

• Designed for processing sequence data.

Truth

Loss

Output

Hidden state

Input

A typical pattern.

• Stationary model• Parameter sharing• Unlimited Sequence

• Powerful: can simulate a universal Turing Machine

Page 50: Introduction to Machine Learning - udel.eduudel.edu/~amotong/teaching/machine learning...Introduction to Machine Learning Amo G. Tong 7 Deep Learning •What are deep learning methods?

Introduction to Machine Learning Amo G. Tong 50

Recurrent Neural Network

• One large class of neural networks.

• Designed for processing sequence data.

Truth

Loss

Output

Hidden state

Input

A typical pattern.

• Stationary model• Parameter sharing• Unlimited Sequence

• Powerful: can simulate a universal Turing Machine

Page 51: Introduction to Machine Learning - udel.eduudel.edu/~amotong/teaching/machine learning...Introduction to Machine Learning Amo G. Tong 7 Deep Learning •What are deep learning methods?

Introduction to Machine Learning Amo G. Tong 51

Recurrent Neural Network

• One large class of neural networks.

• Designed for processing sequence data.

Truth

Loss

Output

Hidden state

Input

Hidden transmission:𝑎𝑡 = 𝑏 +𝑊ℎ𝑡−1 + 𝑈𝑥𝑡

ℎ𝑡 = tanh(𝑎𝑡)

Output:𝑜𝑡 = 𝑐 + 𝑉ℎ𝑡

𝑦𝑜𝑢𝑡𝑡 = softmax(𝑜𝑡)

Loss:Cross-entropy

Page 52: Introduction to Machine Learning - udel.eduudel.edu/~amotong/teaching/machine learning...Introduction to Machine Learning Amo G. Tong 7 Deep Learning •What are deep learning methods?

Introduction to Machine Learning Amo G. Tong 52

Example (“hello”) Character-level Language Model

Vocabulary [h,e,l,o]

Hidden transmission:𝑎𝑡 = 𝑏 +𝑊ℎ𝑡−1 + 𝑈𝑥𝑡

ℎ𝑡 = tanh(𝑎𝑡)

1.02.2-3.04.1

0.50.3-1.01.2

0.10.51.9-1.1

0.2-1.5-0.12.2

0.3-0.10.9

1.0-0.30.1

0.1-0.5-0.3

-0.30.90.7

1000

0100

0010

0010

h e l l

e l l o

o o l oLoss

Output

Hidden state

Input

Output:

𝑜𝑡 = 𝑐 + 𝑉ℎ𝑡

𝑦𝑜𝑢𝑡𝑡 = softmax(𝑜𝑡)

Page 53: Introduction to Machine Learning - udel.eduudel.edu/~amotong/teaching/machine learning...Introduction to Machine Learning Amo G. Tong 7 Deep Learning •What are deep learning methods?

Introduction to Machine Learning Amo G. Tong 53

Recurrent Neural Network

• One large class of neural networks.

• Designed for processing sequence data.

Output guided hidden transmission (Teacher forcing), with less information.

• Less powerful in terms of representability.

• Output may not carry sufficient information.

Page 54: Introduction to Machine Learning - udel.eduudel.edu/~amotong/teaching/machine learning...Introduction to Machine Learning Amo G. Tong 7 Deep Learning •What are deep learning methods?

Introduction to Machine Learning Amo G. Tong 54

Recurrent Neural Network

• One large class of neural networks.

• Designed for processing sequence data.

Output guided hidden transmission (Teacher forcing), with less information.

• Less powerful in terms of representability.

• Output may not carry sufficient information.

• Training can be parallelized.

Page 55: Introduction to Machine Learning - udel.eduudel.edu/~amotong/teaching/machine learning...Introduction to Machine Learning Amo G. Tong 7 Deep Learning •What are deep learning methods?

Introduction to Machine Learning Amo G. Tong 55

Recurrent Neural Network

Page 56: Introduction to Machine Learning - udel.eduudel.edu/~amotong/teaching/machine learning...Introduction to Machine Learning Amo G. Tong 7 Deep Learning •What are deep learning methods?

Introduction to Machine Learning Amo G. Tong 56

Summary

• General Deep Feedforward Network.

• Training

• Design: output node, hidden node,

• Regularization

• Convolutional Neural Network

• Convolution Operator

• Grid-like input

• Extract features layer by layer

• Recurrent Neural Network

• Sequence data processing

• Hidden state transmission