Introduction to Machine Learning

Introduction to Machine Learning Amo G. Tong 1

Lecture *Deep Learning

• An Introduction

• Deep Learning, Ian Goodfellow, Yoshua Bengio and Aaron Courville

• Some materials are courtesy of .

• All pictures belong to their creators.

Deep Learning

• What are deep learning methods?

• Using a complex neural network to approximate the function we want to learn.

Story: ImageNet object recognition contest.

https://medium.com/analytics-vidhya/cnns-architectures-lenet-alexnet-vgg-googlenet-resnet-and-more-666091488df5

Deep Learning

Story: ImageNet object recognition contest.LeNet-5 (1998): 7-level convolutional network

Deep Learning

Story: ImageNet object recognition contest.AlexNet (2012): more layers and filters Trained for 6 days on two GPUs.Error rate: 15.3% (reduced from 26.2%)

Deep Learning

Story: ImageNet object recognition contest.

ResNet(2015): 152 layers with residual connections.Error rate: 3.57%

Deep Learning

• Why we need deep learning?

• We need to solve complex real-world problems.

• A standard way to parameterize functions.

• Layer by layer.

• Flexible to be customized for different applications.

• Image processing

• Sequence data

• Standard training methods.

• Backpropagation.

Deep Learning

• Why now?

• Availability of data

• Hardware

• GPUs,…

• Software

• Tensor Flow, Pytorch,…

Outline

• General Deep Feedforward Network.

• Convolutional Neural Network (CNN)

• Image processing.

• Recurrent Neural Network (RNN)

• Sequence data processing.

Deep Feedforward Network

Input Output

Hidden Layers

Deep Feedforward Network

Training, Design, and Regularization

Deep Feedforward Network:

• Feedforward Network (Multi-Layer Perceptron (MLP))

• More Layers.

• Training

• Design

• Regularization

Input Output

Hidden Layers

Deep Feedforward Network: Training

• Cost Function: 𝐽(𝜃)

• Common choice: cross-entropy

• The difference between two distributions 𝑄 and 𝑃

• 𝐻 𝑃, 𝑄 = −E𝑥∼𝑃 log 𝑄(𝑥)

Input Output

Hidden Layers

𝐷(𝑃||𝑄) =

𝑃 𝑥 log𝑃(𝑥)

𝑄(𝑥)

• Cost Function: 𝐽(𝜃)

• Common choice: cross-entropy

• The difference between two distributions 𝑄 and 𝑃

• 𝐻 𝑃, 𝑄 = −E𝑥∼𝑃 log 𝑄(𝑥)

• 𝐽 𝜃 = −E𝑥, 𝑦∼𝐷train log Pmodel(𝑦|𝑥) (negative log likelihood)

• Pmodel(𝑦|𝑥) changes from model to model

• Least square error: assume Gaussian and use MLE

Input Output

Hidden Layers

𝐷(𝑃||𝑄) =

𝑃 𝑥 log𝑃(𝑥)

𝑄(𝑥)

• Gradient-Based Training

• Non-convex because we have nonlinear units.

• Iteratively decrease the cost function; not global optimal; not even local optimal.

• Initializations with small random numbers.

• One issue: the gradient must be large and predictable. Input Output

Hidden Layers

• Gradient-Based Training

• Non-convex because we have nonlinear units.

• Iteratively decrease the cost function; not global optimal; not even local optimal.

• Initializations with small random numbers.

• One issue: the gradient must be large and predictable. Input Output

Hidden LayersStochastic gradient descent (SGD) algorithmWhile stopping criterion not met

Sample a minibatch of 𝑚 examples 𝐷𝑚Compute the gradient 𝑔 ← σ𝑥∈𝐷𝑚

Δ𝜃 𝐽(𝜃, 𝑥)

Update 𝜃 ← 𝜃 − 𝜖𝑔

Deep Feedforward Network: Design

• Output units

• A real vector

• Regression Problem: Pr[𝑦|𝑥].

• Affine transformation: 𝑦𝑜𝑢𝑡 = 𝑊𝑇ℎ + 𝑏

• 𝐽 𝜃 = E𝑥, 𝑦∼𝐷train 𝑦 − 𝑦𝑜𝑢𝑡2

• Binary classification

• Sigmoid 𝜎

• 𝑦𝑜𝑢𝑡 = 𝜎(𝑊𝑇ℎ + 𝑏)

• 𝐽 𝜃 = −E𝑥, 𝑦∼𝐷train log 𝑦𝑜𝑢𝑡(𝑥)

• Multinoulli (𝑛-classification)

• 𝐬𝐨𝐟𝐭𝐦𝐚𝐱

• 𝑧 = 𝑊𝑇ℎ + 𝑏 = (𝑧1, … , 𝑧𝑛)

• softmax 𝑧 𝑖 =exp(zi)

σ𝑗 exp(z𝑗)

Input Output

Hidden Layers

Note: the gradient must be large and predictable. Log can help.

• Output units

• A real vector

• Sigmoid 𝜎

• 𝑧 = 𝑊𝑇ℎ + 𝑏 = (𝑧1, … , 𝑧𝑛)

σ𝑗 exp(z𝑗)

Input Output

Hidden Layers

• Output units

• A real vector

• Sigmoid 𝜎

• 𝑧 = 𝑊𝑇ℎ + 𝑏 = (𝑧1, … , 𝑧𝑛)

σ𝑗 exp(z𝑗)

Input Output

Hidden Layers

• Hidden Units

• Rectified linear units (ReLU).

• 𝑔(𝑧) = max{0, 𝑧}

• ℎ = 𝑔(𝑊𝑇𝑥 + 𝑏)

• Good point: large gradient when active, easy to train

• Bad point: not learnable when inactive

• Generalizations.

• 𝑔 𝑧 = max 0, 𝑧 + 𝛼 min(0, 𝑧)

• Absolute value rectification: 𝛼 = −1.

• Leaky ReLU: fixed small 𝛼 (0.01).

• Parametric ReLU (PReLU): learnable 𝛼.

• Maxout units: group and take the max.

• Traditional units:

• Sigmoid (consider tanh(z))

• Perceptron

Input Output

Hidden Layers

• Hidden Units

• 𝑔(𝑧) = max{0, 𝑧}

• ℎ = 𝑔(𝑊𝑇𝑥 + 𝑏)

• 𝑔 𝑧 = max 0, 𝑧 + 𝛼 min(0, 𝑧)

• Perceptron

Input Output

Hidden Layers

• Hidden Units

• 𝑔(𝑧) = max{0, 𝑧}

• ℎ = 𝑔(𝑊𝑇𝑥 + 𝑏)

• 𝑔 𝑧 = max 0, 𝑧 + 𝛼 min(0, 𝑧)

• Perceptron

Input Output

Hidden Layers

• Architecture Design

• The layer structure.

• ℎ(1) = 𝑔 1 (𝑊 1 𝑇 𝑥 + 𝑏(1))

• ℎ(2) = 𝑔 2 (𝑊 2 𝑇ℎ(1) + 𝑏(2))

• …..

• How many layers? How many units in each layer?

• Theoretically, universal approximation theorem: one layer of sigmoid can approximate any Borelmeasurable function, if given enough hidden units.

• But it is not guaranteed that it can be learned.

• Practically, (a) follow classic models: CNN, RNN, …; (b) start from few (2 or 3) layers with few (16, 32 or 64) hidden units and watch at the validation error.

• Greater depth may not be better.

Input Output

Hidden Layers

• ℎ(1) = 𝑔 1 (𝑊 1 𝑇 𝑥 + 𝑏(1))

• ℎ(2) = 𝑔 2 (𝑊 2 𝑇ℎ(1) + 𝑏(2))

• …..

• Practically, (a) follow classic models: CNN, RNN, …; (b) start from few (2 or 3) layers with few (16, 32 or 64) hidden units and watch at the validation error.

Input Output

Hidden Layers

• ℎ(1) = 𝑔 1 (𝑊 1 𝑇 𝑥 + 𝑏(1))

• ℎ(2) = 𝑔 2 (𝑊 2 𝑇ℎ(1) + 𝑏(2))

• …..

• Practically, (a) follow classic models: CNN, RNN, …; (b) start from few (2 or 3) layers with few (16, 32 or 64) hidden units and watch at the validation error. .

• Other considerations: local connected layers, skip layers, recurrent layers, …

Input Output

Hidden Layers

Deep Feedforward Network: Regularization

• Regularization: methods for reducing over-fitting.

• Over-fitting is a serious problem in deep learning.

• Method 1: Norm Penalties:

• 𝐽 𝜃 𝑛𝑒𝑤 = 𝐽 𝜃 + 𝛼 ⋅ Ω(𝜃)

• 𝐿2 regularization (weight decay): Ω 𝜃 =1

2∥ 𝜃 ∥2

• 𝐿1 regularization: Ω 𝜃 = σ𝑖 |𝜃𝑖|

• What is the difference between them?

ℎ(1) = 𝑔 1 (𝑊 1 𝑇 𝑥 + 𝑏(1))Typically, we put a penalty on 𝑊 but not bias.

Stochastic gradient descent (SGD) algorithmWhile stopping criterion not met

• Method 2: Norm Penalties as Constrained Optimization:

• 𝐽 𝜃, 𝛼 𝑛𝑒𝑤 = 𝐽 𝜃 + 𝛼 ⋅ (Ω 𝜃 − 𝑘)

• 𝛼 increases when Ω 𝜃 > 𝑘; 𝛼 decreases when Ω 𝜃 < 𝑘

• Take 𝛼 as another learnable parameter.

• Select 𝑘 manully.

• Need to control 𝛼 manully.

• Method 3: Explicit Constraints:

• 𝐽 𝜃 𝑛𝑒𝑤 = 𝐽 𝜃

• Whenever Ω 𝜃 > 𝑘, project 𝜃 to the nearest point in Ω 𝜃 < 𝑘.

• After each iteration, check if the new parameters are in the area Ω 𝜃 < 𝑘.

• Method 4: Data Augmentation

• How does over-fitting happen?

• High representability and insufficient data.

• Creating fake but useful training data is possible.

• The target classifier is invariant to some transformations.

• E.g. object detection.

Input Output

Hidden Layers

Good Bad

• Method 5: Multi-Task Learning

• Two models (problems) share some parameters

• The hidden representations can be better learned if two tasks share certain factors.

Output 2

Shared

Output 1

If animal?

If cat?

If dog?

• Method 6: Early Stopping

• We would want an early stop before converge.

• Validation set error.

• One possible method: stop if no better parameter found in 𝑝 (say 10) consecutive training batches.

• Simple to implement.

• Need some extra data.

• Use second training to fully utilize the data.

• Retraining. Adopt the same number of training steps.

• Continuous Training. Start from the obtained parameters.

while 𝑗 < 𝑝update 𝜃 for 𝑛 steps;if no smaller validation error

𝑗 = 𝑗 + 1;else

note the new parameters;𝑗 = 0;

• Method 7: Dropout

• Bagging: train different models with different datasets.

• Bagging neural networks is not effective.

• Dropout:

• Idea: dynamically change the network structure.

• Mute units by setting its output as zero.

• Algorithm:

• Before each iteration, randomly mute units.

• Typically, with prob 0.2 drop an input node, with prob 0.5 drop a hidden node.

• Pros: Computationally cheap, independent of training algorithm or model.

• Cons: Size of the model may need increasing, does not work for small data set.

Input Output

Hidden Layers

• Method 7: Dropout

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15, 1929–1958. 258, 265, 267, 672

• Method 8: Adversarial training

• There can be 𝑥1 ≈ 𝑥2 but the results of prediction are very different.

• Training on adversarially perturbed examples from the training set

• Generative adversarial network (GAN)

Goodfellow, I. J., Shlens, J., and Szegedy, C. (2014b). Explaining and harnessing adversarialexamples. CoRR, abs/1412.6572. 268, 269, 271, 555, 556

Convolutional Neural Networks

• One large class of neural networks.

• The main unit is the convolutional layer

• Replace the matrix multiplication by convolutional operation.

• Designed for processing grid-like inputs.

• Convolutional operation in math: 𝑓 ∗ 𝑔 𝑡 = ∞−+∞

𝑓 𝜏 𝑔 𝑡 − 𝜏 𝑑𝜏

• Average of 𝑔(𝑡) weighted by 𝑓(−𝜏)

• Convolutional operation in math: 𝑓 ∗ 𝑔 𝑡 = ∞−+∞

• Convolutional operation in deep learning:

• ℎ(1) = 𝑔 1 (𝑊 1 𝑇 𝑥 + 𝑏(1))

• ℎ(2) = 𝑔 2 (𝑊 2 𝑇ℎ(1) + 𝑏(2))

• …..

𝑊 2 𝑇ℎ(1) → 𝐾 ∗ ℎ(1)

Input Output

Hidden Layers

• Convolutional operation in math: 𝑓 ∗ 𝑔 𝑡 = ∞−+−∞

• ℎ(1) = 𝑔 1 (𝑊 1 𝑇 𝑥 + 𝑏(1))

• ℎ(2) = 𝑔 2 (𝑊 2 𝑇ℎ(1) + 𝑏(2))

• …..

𝑊 2 𝑇ℎ(1) → 𝐾 ∗ ℎ(1)

𝐾𝑎𝑒 + 𝑏𝑓

𝑏𝑒 + 𝑐𝑓

𝑐𝑒 + 𝑑𝑓

𝑑𝑒 + 0

ℎ(1) 1D example

• Convolutional operation in math: 𝑓 ∗ 𝑔 𝑡 = ∞−+−∞

• ℎ(1) = 𝑔 1 (𝑊 1 𝑇 𝑥 + 𝑏(1))

• ℎ(2) = 𝑔 2 (𝑊 2 𝑇ℎ(1) + 𝑏(2))

• …..

𝑊 2 𝑇ℎ(1) → 𝐾 ∗ ℎ(1)

ℎ(1)

2D example 𝑊 2 𝑇ℎ(1) → 𝐾 ∗ ℎ(1)

• ℎ(1) = 𝑔 1 (𝐾(1) ∗ 𝑥)

• ℎ(2) = 𝑔 2 (𝐾(2) ∗ ℎ(1))

• …..

• What are the advantages?

• Sparse interactions: the size of kernel

• Fewer parameters

• Parameter Sharing:

• One set of parameter for each layer

• Equivariance

• The output changes in the same way as input does.

𝑊 2 𝑇ℎ(1) 𝐾 ∗ ℎ(1)

• ℎ(1) = 𝑔 1 (𝐾(1) ∗ 𝑥)

• ℎ(2) = 𝑔 2 (𝐾(2) ∗ ℎ(1))

• …..

• Typical structure of a convolutional networks:

• Convolution layers+

• Activation functions (e.g. ReLU)+

• Pooling and Sampling

• Pooling layers:

• Replace the nodes with a summary of the nearby neighbors

• E.g. Max Pooling: reports the maximum output within a rectangular neighborhood.

• Why pooling?

• Theoretically, making the model invariance to small translations.

• Intuitively, we want to know if some feature exists than where it is.

• ℎ(1) = 𝑔 1 (𝐾(1) ∗ 𝑥)

• ℎ(2) = 𝑔 2 (𝐾(2) ∗ ℎ(1))

• …..

• Pooling

• Pooling layers:

• Replace the nodes with a summary of the nearby neighbors

• E.g. Max Pooling: reports the maximum output within a rectangular neighborhood.

• Why pooling?

• Theoretically, making the model invariance to small translations.

• Intuitively, we want to know if some feature exists than where it is.

Max=1 Max=1 Max=0

• ℎ(1) = 𝑔 1 (𝐾(1) ∗ 𝑥)

• ℎ(2) = 𝑔 2 (𝐾(2) ∗ ℎ(1))

• …..

• ℎ(1) = 𝑔 1 (𝐾(1) ∗ 𝑥)

• ℎ(2) = 𝑔 2 (𝐾(2) ∗ ℎ(1))

• …..

Recurrent Neural Network

“In this lecture, we are going to learn deep learning…..”

• Designed for processing sequence data.

Output

Hidden state

A typical pattern.

• Stationary model• Parameter sharing• Unlimited Sequence

• Powerful: can simulate a universal Turing Machine

Output

Hidden state

A typical pattern.

Output

Hidden state

A typical pattern.

Output

Hidden state

Hidden transmission:𝑎𝑡 = 𝑏 +𝑊ℎ𝑡−1 + 𝑈𝑥𝑡

ℎ𝑡 = tanh(𝑎𝑡)

Output:𝑜𝑡 = 𝑐 + 𝑉ℎ𝑡

𝑦𝑜𝑢𝑡𝑡 = softmax(𝑜𝑡)

Loss:Cross-entropy

Example (“hello”) Character-level Language Model

Vocabulary [h,e,l,o]

Hidden transmission:𝑎𝑡 = 𝑏 +𝑊ℎ𝑡−1 + 𝑈𝑥𝑡

ℎ𝑡 = tanh(𝑎𝑡)

1.02.2-3.04.1

0.50.3-1.01.2

0.10.51.9-1.1

0.2-1.5-0.12.2

0.3-0.10.9

1.0-0.30.1

0.1-0.5-0.3

-0.30.90.7

h e l l

e l l o

o o l oLoss

Output

Hidden state

Output:

𝑜𝑡 = 𝑐 + 𝑉ℎ𝑡

𝑦𝑜𝑢𝑡𝑡 = softmax(𝑜𝑡)

Output guided hidden transmission (Teacher forcing), with less information.

• Less powerful in terms of representability.

• Output may not carry sufficient information.

Output guided hidden transmission (Teacher forcing), with less information.

• Less powerful in terms of representability.

• Output may not carry sufficient information.

• Training can be parallelized.

Summary

• General Deep Feedforward Network.

• Training

• Design: output node, hidden node,

• Regularization

• Convolutional Neural Network

• Convolution Operator

• Grid-like input

• Extract features layer by layer

• Recurrent Neural Network

• Sequence data processing

• Hidden state transmission

Introduction to Machine Learning - udel.eduudel.edu/~amotong/teaching/machine...

Transcript of Introduction to Machine Learning - udel.eduudel.edu/~amotong/teaching/machine...

Introduction to Machine Learning - udel.eduudel.edu/~amotong/teaching/machine...

Documents

Transcript of Introduction to Machine Learning - udel.eduudel.edu/~amotong/teaching/machine...

COMP 562: Introduction to Machine Learning · Introduction to Machine Learning Machine learning: What and Why? Quiz: De ning the Learning Task "Machine Learning is the study of algorithms

ARTIFICIAL INTELLIGENCE & MACHINE LEARNING · Intelligence & Machine Learning technologies and applications including Machine Learning, Deep Learning, Computer Vision, Natural Language

1-Machine Learning in Oracle - ITOUG2017/12/01 · – Oracle Machine Learning (on Oracle Autonomous Data Warehouse Cloud Service) – Automated Machine Learning => Machine Learning

Machine Learning: Machine Learning: Introduction Introduction

Machine Learning - 0. Overview€¦ · Machine Learning 1. What is Machine Learning? One area of research, many names (and aspects) machine learning historically, stresses learning

Computer Security: A Machine Learning Approachmedia.techtarget.com/searchSecurity/downloads/REVISED_RH9_Sabn… · Royal Holloway series Machine learning approaches • MACHINE LEARNING

Machine Learning: Machine Learning:

Fast Track Machine Learning Part 1 (Machine Learning Overview) - Types of Machine Learning Problems

MACHINE LEARNING AN INTRODUCTION - Sas Institute · 2016-03-11 · MACHINE LEARNING - AN INTRODUCTION WHAT IS MACHINE LEARNING? Wikipedia: Machine learning, a branch of artificial

Advanced Topics in Machine Learning: Bayesian Machine Learning · 2020. 2. 3. · Advanced Topics in Machine Learning: Bayesian Machine Learning Tom Rainforth Department of Computer

Methods MACHINE LEARNING FROM MACHINE LEARNING TO …

Introducing Machine Learning · 4 ntroducing Machine Learning How Machine Learning Works Machine learning uses two types of techniques: supervised learning, which trains a model on

Predicting Housing Prices With Azure Machine Learning · Azure Machine Learning. Expected Learning Outcomes Azure Machine Learning @tetranoodle Build Regression Machine Learning Model.

Introduction To Azure Machine Learning...Expected Learning Outcomes Azure Machine Learning @tetranoodle Supervised Machine Learning Machine Learning Algorithms in Azure ML Workflow

Masters of Craft: Programming - Machine Learning : Research · Masters of Craft: Programming - Machine Learning: Research Janet Choi. Machine Learning Machine Learning is a trend

Introduction to Machine Learning - udel.eduudel.edu/~amotong/teaching/machine learning...Supervised Learning • Artificial Neural Networks • Backpropagation Algorithm • Some materials

Machine Learning Introduction Machine learningbejar/ia/transpas/teoria.mti/6-AP-aprendizaj… · Machine Learning Introduction Types of machine learning Inductive Learning: Models

Machine Learning on Spark - UC Berkeley AMP Campampcamp.berkeley.edu/.../Machine-Learning-on-Spark... · Machine Learning on Spark Shivaram Venkataraman ... Machine learning algorithms

Machine Learning with MATLAB - … · 2 Agenda Machine Learning Overview Machine Learning with MATLAB –Unsupervised Learning Clustering –Supervised Learning Classification

MathWorks - Introducing Machine Learning · 4 ntroducing Machine Learning How Machine Learning Works Machine learning uses two types of techniques: supervised learning, which trains