ESS391H The Application of Machine Learning and Artificial ...The Application of Machine Learning...

The Application of Machine Learning and Artificial

Neural Networks in Geosciences

Lecture 10 – Deep Learning: Neural Networks

University of Toronto, Department of Earth Sciences,

Hosein Shahnas

1

ESS391H

2

One-vs.-All (OvA)

The perceptron algorithm can be extended to multi-class classification-for

example, through the One-vs.-All (OvA) technique.

Class 1 Class 0

For 3-class problem

1

0 1 0

0 1

Multiclass Problems

5

0 1

4

2

3

6

7

8

9

3

Multiclass Problems

Softmax

This is a generalization of the logistic function to compute meaningful class-

probabilities in multi-class settings (multinomial logistic regression).

F(𝑧𝑗) = 1

1+𝑒−𝑧𝑗

𝑆𝑜𝑓𝑡𝑚𝑎𝑥 𝑧𝑗 =𝑒𝑧𝑗

𝑒𝑧𝑘𝐾𝑘=1

𝐾: 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑙𝑎𝑠𝑠𝑒𝑠

𝑷𝟏

𝑷𝟐

𝑷𝟑

P3

P1

P2

𝑓𝑜𝑟 𝑗 = 1, 2, 3

2D to 1D array

n images, one per line

1

2

3

…

…

…

…

n

n samples (images)

2D: 28x28 pixels

1D: 784

784

Learning

Image Recognition

Handwritten Digits Classification

Learning numbers 0-9

https://www.google.ca/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=0ahUKEwiH9u7P1tnQAhVH0oMKHTOeBhwQjRwIBw&url=http://neuralnetworksanddeeplearning.com/chap1.html&psig=AFQjCNFFX-jtEHJVJgx9SFOp-vUD6CyuUg&ust=1480911267378023

5

Image Recognition

Q

𝑷𝟎

𝑷𝟏

𝑷𝟐

𝑷𝟑

𝑷𝟒

𝑷𝟓

𝑷𝟔

𝑷𝟕

𝑷𝟖

𝑷𝟗

=

𝟎𝟎𝟎𝟎𝟎𝟎𝟎𝟏𝟎𝟎

=

𝑆𝑜𝑓𝑡𝑚𝑎𝑥 𝑧𝑗 =𝑒𝑧𝑗

𝑒𝑧𝑘𝐾𝑘=1

, 𝑗 = 1, 2, . . , 10, 𝐾 = 10 𝑃𝑖𝑖 = 1

𝑄:𝑄𝑢𝑎𝑛𝑡𝑖𝑧𝑒𝑟

6

Let’s explore what we can do with combinations of perceptrons rather than

single ones.

Let’s consider the classification problem for which the perceptron algorithm

failed.

This problem cannot be classified by a single perceptron. But what about

with two perceptrons?

Deep Learning - Combining Perceptrons

x1

x2

x1

x2

h1 h2

x1

x2

7

+

+ -

-

𝝈 ≡ 𝝓(𝒛) = 𝟎, 𝒛 < 𝟎 𝟏, 𝒛 ≥ 𝟎

𝒛 = 𝒘𝒊𝒊 𝒙𝒊 𝝈 = 𝝓 𝟐𝟎 ∗ 𝟎 + 𝟐𝟎 ∗ 𝟎 − 𝟏𝟎 = 0 𝝈 = 𝝓 −𝟐𝟎 ∗ 𝟎 − 𝟐𝟎 ∗ 𝟎 + 𝟑𝟎 = 1 𝝈 = 𝝓 𝟐𝟎 ∗ 𝟎 + 𝟐𝟎 ∗ 𝟏 − 𝟑𝟎 = 0

𝝈 = 𝝓 𝟐𝟎 ∗ 𝟏 + 𝟐𝟎 ∗ 𝟏 − 𝟏𝟎 = 1 𝝈 = 𝝓 −𝟐𝟎 ∗ 𝟏 − 𝟐𝟎 ∗ 𝟏 + 𝟑𝟎 = 0 𝝈 = 𝝓 𝟐𝟎 ∗ 𝟏 + 𝟐𝟎 ∗ 𝟎 − 𝟑𝟎 = 0

𝝈 = 𝝓 𝟐𝟎 ∗ 𝟎 + 𝟐𝟎 ∗ 𝟏 − 𝟏𝟎 = 1 𝝈 = 𝝓 −𝟐𝟎 ∗ 𝟎 − 𝟐𝟎 ∗ 𝟏 + 𝟑𝟎 = 1 𝝈 = 𝝓 𝟐𝟎 ∗ 𝟏 + 𝟐𝟎 ∗ 𝟏 − 𝟑𝟎 = 1

𝝈 = 𝝓 𝟐𝟎 ∗ 𝟏 + 𝟐𝟎 ∗ 𝟎 − 𝟏𝟎 = 1 𝝈 = 𝝓 −𝟐𝟎 ∗ 𝟏 − 𝟐𝟎 ∗ 𝟎 + 𝟑𝟎 = 1 𝝈 = 𝝓 𝟐𝟎 ∗ 𝟏 + 𝟐𝟎 ∗ 𝟏 − 𝟑𝟎 = 1

h1 h2 y

Or Nand And

x2

x1

Deep Learning - Combining Perceptrons

y

x2

x1 h1

h2

w0 = -10

w1 = 20

w2 = 20

w1 = -20

w2 = -20

w0 = 30

w0 = -30

w1 = 20

w2 = 20

8

- - + + + + - -

- - + + + + - -

- - + + + + - -

8-Perceptrons 16-Perceptrons

We solved this problem using feature-transformation before.

We can use neural networks.

Combining Perceptrons - Multilayer Perceptron

9

There are many perceptrons (and therefore many parameters), so optimization

might be a problem (remember that for single perceptron when the data were

nonlinear, we had convergence problem).

Solution:

1- We chose a soft threshold (tanh) rather than a hard threshold (step function).

2- We use SGD

3- And an efficient way to find weight factors w.

𝝓 𝒛 = tanh(z) = 𝒆𝒛 − 𝒆−𝒛

𝒆𝒛+ 𝒆−𝒛

Step function

tanh

Input Hidden layers output

0 1≤ 𝒍 < 𝑳 L

Optimization

10

Neural networks can be thought as Learned Nonlinear Transform. Note that the

nonlinear transformation of features (e.g., polynomial, RBF, etc.) are not learned

transformation.

Since in the hidden layers the features are higher order features (leaned

features), then we can implement a better learning. Indeed the network looks for

weight factors for a proper transform the factors that fits data.

Remark

Hidden layers : higher order features

or learned features

Raw input: features of

dimension d

Output

11

Dropout

Dropout Dropout is a regularization technique for neural network models (Srivastava, et

al. , 2014). This is a simple way to prevent neural networks from overfitting.

Some key points: 1) Use 20%-50% dropout

2) Dropout with larger network in general provides better performance, giving the

model more of an opportunity to learn independent representations.

3) Dropout can be used on visible (input) as well as hidden layers

Standard neural network Neural network with dropout

12

Deep Learning - A Simple Network

Ex. 1: MNIST Database - Handwritten digits

A simple sequential deep learning model for handwritten digits recognition using Keras and

TensorFlow,

784

512

0.2

Input 10

Output

Relu Sigmoid

28×28

tf.keras.layers.Flatten(input_shape=(28, 28)

tf.keras.layers.Dense(512, activation=tf.nn.relu)

tf.keras.layers.Dropout(0.2)

tf.keras.layers.Dense(10, activation=tf.nn.softmax)

13

Deep Learning - A Simple Network

Ex. 2: Pima Indians onset of diabetes dataset

A simple sequential deep learning model for predicting handwritten digits using

Keras,

8 8 12

Input

1

Relu ReLu Sigmoid

model.add(Dense(8, input_dim=8, activation='relu'))

model.add(Dense(12, activation='relu'))

model.add(Dense(1, activation='sigmoid'))

𝒙𝟏

𝒙𝟖

Output

Single Neuron

Binary Class

Shallow Learning

Binary Class

Deep neural Network

Shallow Learning

Multi Class Multi layer

Multi Class

14

Deep Learning

15

Some error metrics MSE (Mean Squared Error): The average error based on the square of the

difference between the original and predicted values.

MAE (Mean Absolute Error): The average error based on the absolute values of

the difference between the original and predicted values.

RMSE (Root Mean Squared Error): The error rate by the square root of MSE.

MSLE (Mean Squared Logarithmic Error): The mean error over the log-

transformed values of the original and the predicted values.

R-squared (Coefficient of determination): A measure of how well the values fit

compared to the original values (0-1). The higher the value represents higher

accuracy.

𝑚𝑠𝑒 = 1

𝑁 𝑦𝑖 − 𝑦 2𝑁

1 𝑦 :𝑚𝑒𝑎𝑛 𝑣𝑎𝑙𝑢𝑒

𝑚𝑎𝑒 = 1

𝑁 𝑦𝑖 − 𝑦 𝑁

1 𝑦 :𝑚𝑒𝑎𝑛 𝑣𝑎𝑙𝑢𝑒

𝑟𝑚𝑠𝑒 = 𝑚𝑠𝑒

𝑚𝑠𝑙𝑒 =1

𝑁 log (𝑦𝑖 + 1) − log (𝑦 + 1) 2𝑁

1

𝑅2 = 1 − 𝑦𝑖−𝑦 2

𝑦𝑖−𝑦 2

Deep Learning

16

Deep Learning - Why more layers?

17

Deep Learning - Large Number of parameters

I: Image

K: ‘Filter‘ or ‘Kernel’ or ‘Feature Detector’

I*K: Convolved Feature (activation map)

𝑾𝒄 = 𝑾−𝑭𝑾

𝑺𝑾 + 1 The size of the convolved field (I*K)

S: stride

Ex.: 𝑾𝒄 = (7-3)/1+1 = 5

𝑾𝒄 F

W

Volume image

18

Convolutional Networks (CNN)

𝑓 ∗ 𝑔 𝑡 = 𝑓 𝜏 𝑔 𝑡 − 𝜏 𝑑𝜏∞

−∞

𝐼 ∗ 𝐾 𝑥𝑦 = 𝐾𝑖𝑗 ∙ 𝐼𝑥+𝑖−1,𝑦+𝑗−1

𝑤

𝑗=1

ℎ

𝑖=1

http://xrds.acm.org/blog/wp-content/uploads/2016/06/Figure1.png

19

https://medium.com/@ageitgey/machine-learning-is-fun-part-3-deep-learning-and-convolutional-neural-networks-f40359318721

Convolutional Networks (CNN) - Filters



























I𝐝𝐞𝐧𝐭𝐢𝐭𝐲 𝟎 𝟎 𝟎𝟎 𝟏 𝟎𝟎 𝟎 𝟎

Edge detection

−𝟏 − 𝟏 − 𝟏−𝟏 𝟖 − 𝟏−𝟏 − 𝟏 − 𝟏

S𝐡𝐚𝐫𝐩𝐞𝐧 𝟎 − 𝟏 𝟎−𝟏 𝟓 − 𝟏𝟎 − 𝟏 𝟎

Box blur 𝟏

𝟗

𝟏 𝟏 𝟏𝟏 𝟏 𝟏𝟏 𝟏 𝟏

Gaussian blur 𝟏

𝟏𝟔

𝟏 𝟐 𝟏𝟐 𝟒 𝟐𝟏 𝟐 𝟏

20

Convolutional Networks (CNN) - Filters

21

Stride and Padding

In convolving an image:

1) The outputs shrink

2) The information on corners of the image is lost

This can be prevented by padding.

32× 32 ×3

36

36

32

32

Conv.

22

Stride and Padding - Color Image

𝑯𝒄 = 𝑯−𝑭𝑯+𝑷

𝑺𝑯 + 1

𝑾𝒄 = 𝑾−𝑭𝑾+𝑷

𝑺𝑾 + 1

http://machinelearninguru.com/computer_vision/basics/convolution/convolution_layer.html

http://machinelearninguru.com/computer_vision/basics/convolution/convolution_layer.html

23

Stride and Padding

W

H

F

P

S

Stride = 1

Stride = 2

𝐂𝐨𝐧𝐯 𝐑𝐞𝐋𝐔

𝐏𝐨𝐨𝐥 ……… . .

𝐂𝐥𝐚𝐬𝐬𝐟𝐢𝐜𝐚𝐭𝐢𝐨𝐧

FReLU 𝑧 = 𝑚𝑎𝑥 0, 𝑧

Rectified Linear Unit (ReLU)

FSReLU 𝑧 = log(1+exp(z))

Analytic approx.


𝐏𝐨𝐨𝐥


𝐏𝐨𝐨𝐥

Z = X.W = wi 𝑥𝑖𝑛1 = w1 x1 + w2 x2 + … + wn xn


24

Convolutional Networks

Max Pooling / Downsampling with CNNs

https://en.wikipedia.org/wiki/File:Rectifier_and_softplus_functions.svg

https://www.google.ca/imgres?imgurl=http://inspirehep.net/record/1300728/files/figures_ArtificialNeuron.png&imgrefurl=http://inspirehep.net/record/1300728/plots&docid=uV74ucaUXTbbRM&tbnid=WvXKGQdlw3xV9M:&vet=10ahUKEwiN2s-6qovcAhVG0oMKHepXD7AQMwhpKCswKw..i&w=1050&h=650&bih=588&biw=1536&q=artificial neuron pictures&ved=0ahUKEwiN2s-6qovcAhVG0oMKHepXD7AQMwhpKCswKw&iact=mrc&uact=8



FSoftmax(zj) = 𝑒−𝑧𝑗

𝑒−𝑧𝑚𝑘𝑚

𝐤: Number of classes

𝐒𝐨𝐟𝐭𝐦𝐚𝐱 𝐟𝐮𝐧𝐜𝐭𝐢𝐨𝐧


Rectified Linear Unit (ReLU)

FSReLU 𝑧 = log(1+exp(z))

Analytic approx.

𝐒𝐨𝐟𝐭𝐦𝐚𝐱

𝐑𝐞𝐋𝐔 𝐑𝐞𝐋𝐔

25


https://medium.com/data-science-group-iitr/building-a-convolutional-neural-network-in-python-with-tensorflow-d251c3ca8117 `

𝐒𝐢𝐱 𝐟𝐢𝐥𝐭𝐞𝐫𝐬

https://en.wikipedia.org/wiki/File:Rectifier_and_softplus_functions.svg




https://medium.com/data-science-group-iitr/building-a-convolutional-neural-network-in-python-with-tensorflow-d251c3ca8117


























26

Network Architecture

Convolution, Filter shape:(5,5,6), Stride=1, Padding=’SAME’

Max pooling (2x2), Window shape:(2,2), Stride=2, Padding=’Same’

ReLU

Convolution, Filter shape:(5,5,16), Stride=1, Padding=’SAME’

Max pooling (2x2), Window shape:(2,2), Stride=2, Padding=’Same’

ReLU

Fully Connected Layer (128)

ReLU

Fully Connected Layer (10)

Softmax


27

3× 𝟑 𝑪𝒐𝒏𝒗.

𝟐𝟖

2× 𝟐 𝑷𝒐𝒐𝒍.

flatten

128

128 ×0.2

10

A Simple CNN Model

Output

Fully connected layers

Ex. 3: MNIST Database - Handwritten digits

A simple CNN deep learning model for handwritten digits recognition using Keras,

Max pooling: Take the maximum number when pooling

Dropout

Dense

Dense 𝐑𝐞𝐋𝐔 𝐒𝐨𝐟𝐭𝐦𝐚𝐱

ESS391H The Application of Machine Learning and Artificial ...The Application of Machine Learning...

Documents

Transcript of ESS391H The Application of Machine Learning and Artificial ...The Application of Machine Learning...