Model detecting learning styles with artificial neural network
ESS391H The Application of Machine Learning and Artificial ...The Application of Machine Learning...
Transcript of ESS391H The Application of Machine Learning and Artificial ...The Application of Machine Learning...
The Application of Machine Learning and Artificial
Neural Networks in Geosciences
Lecture 10 – Deep Learning: Neural Networks
University of Toronto, Department of Earth Sciences,
Hosein Shahnas
1
ESS391H
2
One-vs.-All (OvA)
The perceptron algorithm can be extended to multi-class classification-for
example, through the One-vs.-All (OvA) technique.
Class 1 Class 0
For 3-class problem
1
0 1 0
0 1
Multiclass Problems
5
0 1
4
2
3
6
7
8
9
3
Multiclass Problems
Softmax
This is a generalization of the logistic function to compute meaningful class-
probabilities in multi-class settings (multinomial logistic regression).
F(𝑧𝑗) = 1
1+𝑒−𝑧𝑗
𝑆𝑜𝑓𝑡𝑚𝑎𝑥 𝑧𝑗 =𝑒𝑧𝑗
𝑒𝑧𝑘𝐾𝑘=1
𝐾: 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑙𝑎𝑠𝑠𝑒𝑠
𝑷𝟏
𝑷𝟐
𝑷𝟑
P3
P1
P2
𝑓𝑜𝑟 𝑗 = 1, 2, 3
2D to 1D array
n images, one per line
1
2
3
…
…
…
…
n
n samples (images)
2D: 28x28 pixels
1D: 784
784
Learning
Image Recognition
Handwritten Digits Classification
Learning numbers 0-9
5
Image Recognition
Q
𝑷𝟎
𝑷𝟏
𝑷𝟐
𝑷𝟑
𝑷𝟒
𝑷𝟓
𝑷𝟔
𝑷𝟕
𝑷𝟖
𝑷𝟗
=
𝟎𝟎𝟎𝟎𝟎𝟎𝟎𝟏𝟎𝟎
=
𝑆𝑜𝑓𝑡𝑚𝑎𝑥 𝑧𝑗 =𝑒𝑧𝑗
𝑒𝑧𝑘𝐾𝑘=1
, 𝑗 = 1, 2, . . , 10, 𝐾 = 10 𝑃𝑖𝑖 = 1
𝑄:𝑄𝑢𝑎𝑛𝑡𝑖𝑧𝑒𝑟
6
Let’s explore what we can do with combinations of perceptrons rather than
single ones.
Let’s consider the classification problem for which the perceptron algorithm
failed.
This problem cannot be classified by a single perceptron. But what about
with two perceptrons?
Deep Learning - Combining Perceptrons
x1
x2
x1
x2
h1 h2
x1
x2
7
+
+ -
-
𝝈 ≡ 𝝓(𝒛) = 𝟎, 𝒛 < 𝟎 𝟏, 𝒛 ≥ 𝟎
𝒛 = 𝒘𝒊𝒊 𝒙𝒊 𝝈 = 𝝓 𝟐𝟎 ∗ 𝟎 + 𝟐𝟎 ∗ 𝟎 − 𝟏𝟎 = 0 𝝈 = 𝝓 −𝟐𝟎 ∗ 𝟎 − 𝟐𝟎 ∗ 𝟎 + 𝟑𝟎 = 1 𝝈 = 𝝓 𝟐𝟎 ∗ 𝟎 + 𝟐𝟎 ∗ 𝟏 − 𝟑𝟎 = 0
𝝈 = 𝝓 𝟐𝟎 ∗ 𝟏 + 𝟐𝟎 ∗ 𝟏 − 𝟏𝟎 = 1 𝝈 = 𝝓 −𝟐𝟎 ∗ 𝟏 − 𝟐𝟎 ∗ 𝟏 + 𝟑𝟎 = 0 𝝈 = 𝝓 𝟐𝟎 ∗ 𝟏 + 𝟐𝟎 ∗ 𝟎 − 𝟑𝟎 = 0
𝝈 = 𝝓 𝟐𝟎 ∗ 𝟎 + 𝟐𝟎 ∗ 𝟏 − 𝟏𝟎 = 1 𝝈 = 𝝓 −𝟐𝟎 ∗ 𝟎 − 𝟐𝟎 ∗ 𝟏 + 𝟑𝟎 = 1 𝝈 = 𝝓 𝟐𝟎 ∗ 𝟏 + 𝟐𝟎 ∗ 𝟏 − 𝟑𝟎 = 1
𝝈 = 𝝓 𝟐𝟎 ∗ 𝟏 + 𝟐𝟎 ∗ 𝟎 − 𝟏𝟎 = 1 𝝈 = 𝝓 −𝟐𝟎 ∗ 𝟏 − 𝟐𝟎 ∗ 𝟎 + 𝟑𝟎 = 1 𝝈 = 𝝓 𝟐𝟎 ∗ 𝟏 + 𝟐𝟎 ∗ 𝟏 − 𝟑𝟎 = 1
h1 h2 y
Or Nand And
x2
x1
Deep Learning - Combining Perceptrons
y
x2
x1 h1
h2
w0 = -10
w1 = 20
w2 = 20
w1 = -20
w2 = -20
w0 = 30
w0 = -30
w1 = 20
w2 = 20
8
- - + + + + - -
- - + + + + - -
- - + + + + - -
8-Perceptrons 16-Perceptrons
We solved this problem using feature-transformation before.
We can use neural networks.
Combining Perceptrons - Multilayer Perceptron
9
There are many perceptrons (and therefore many parameters), so optimization
might be a problem (remember that for single perceptron when the data were
nonlinear, we had convergence problem).
Solution:
1- We chose a soft threshold (tanh) rather than a hard threshold (step function).
2- We use SGD
3- And an efficient way to find weight factors w.
𝝓 𝒛 = tanh(z) = 𝒆𝒛 − 𝒆−𝒛
𝒆𝒛+ 𝒆−𝒛
Step function
tanh
Input Hidden layers output
0 1≤ 𝒍 < 𝑳 L
Optimization
10
Neural networks can be thought as Learned Nonlinear Transform. Note that the
nonlinear transformation of features (e.g., polynomial, RBF, etc.) are not learned
transformation.
Since in the hidden layers the features are higher order features (leaned
features), then we can implement a better learning. Indeed the network looks for
weight factors for a proper transform the factors that fits data.
Remark
Hidden layers : higher order features
or learned features
Raw input: features of
dimension d
Output
11
Dropout
Dropout Dropout is a regularization technique for neural network models (Srivastava, et
al. , 2014). This is a simple way to prevent neural networks from overfitting.
Some key points: 1) Use 20%-50% dropout
2) Dropout with larger network in general provides better performance, giving the
model more of an opportunity to learn independent representations.
3) Dropout can be used on visible (input) as well as hidden layers
Standard neural network Neural network with dropout
12
Deep Learning - A Simple Network
Ex. 1: MNIST Database - Handwritten digits
A simple sequential deep learning model for handwritten digits recognition using Keras and
TensorFlow,
784
512
0.2
Input 10
Output
Relu Sigmoid
28×28
tf.keras.layers.Flatten(input_shape=(28, 28)
tf.keras.layers.Dense(512, activation=tf.nn.relu)
tf.keras.layers.Dropout(0.2)
tf.keras.layers.Dense(10, activation=tf.nn.softmax)
13
Deep Learning - A Simple Network
Ex. 2: Pima Indians onset of diabetes dataset
A simple sequential deep learning model for predicting handwritten digits using
Keras,
8 8 12
Input
1
Relu ReLu Sigmoid
model.add(Dense(8, input_dim=8, activation='relu'))
model.add(Dense(12, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
𝒙𝟏
𝒙𝟖
Output
Single Neuron
Binary Class
Shallow Learning
Binary Class
Deep neural Network
Shallow Learning
Multi Class Multi layer
Multi Class
14
Deep Learning
15
Some error metrics MSE (Mean Squared Error): The average error based on the square of the
difference between the original and predicted values.
MAE (Mean Absolute Error): The average error based on the absolute values of
the difference between the original and predicted values.
RMSE (Root Mean Squared Error): The error rate by the square root of MSE.
MSLE (Mean Squared Logarithmic Error): The mean error over the log-
transformed values of the original and the predicted values.
R-squared (Coefficient of determination): A measure of how well the values fit
compared to the original values (0-1). The higher the value represents higher
accuracy.
𝑚𝑠𝑒 = 1
𝑁 𝑦𝑖 − 𝑦 2𝑁
1 𝑦 :𝑚𝑒𝑎𝑛 𝑣𝑎𝑙𝑢𝑒
𝑚𝑎𝑒 = 1
𝑁 𝑦𝑖 − 𝑦 𝑁
1 𝑦 :𝑚𝑒𝑎𝑛 𝑣𝑎𝑙𝑢𝑒
𝑟𝑚𝑠𝑒 = 𝑚𝑠𝑒
𝑚𝑠𝑙𝑒 =1
𝑁 log (𝑦𝑖 + 1) − log (𝑦 + 1) 2𝑁
1
𝑅2 = 1 − 𝑦𝑖−𝑦 2
𝑦𝑖−𝑦 2
Deep Learning
16
Deep Learning - Why more layers?
17
Deep Learning - Large Number of parameters
I: Image
K: ‘Filter‘ or ‘Kernel’ or ‘Feature Detector’
I*K: Convolved Feature (activation map)
𝑾𝒄 = 𝑾−𝑭𝑾
𝑺𝑾 + 1 The size of the convolved field (I*K)
S: stride
Ex.: 𝑾𝒄 = (7-3)/1+1 = 5
𝑾𝒄 F
W
Volume image
18
Convolutional Networks (CNN)
𝑓 ∗ 𝑔 𝑡 = 𝑓 𝜏 𝑔 𝑡 − 𝜏 𝑑𝜏∞
−∞
𝐼 ∗ 𝐾 𝑥𝑦 = 𝐾𝑖𝑗 ∙ 𝐼𝑥+𝑖−1,𝑦+𝑗−1
𝑤
𝑗=1
ℎ
𝑖=1
19
https://medium.com/@ageitgey/machine-learning-is-fun-part-3-deep-learning-and-convolutional-neural-networks-f40359318721
Convolutional Networks (CNN) - Filters
I𝐝𝐞𝐧𝐭𝐢𝐭𝐲 𝟎 𝟎 𝟎𝟎 𝟏 𝟎𝟎 𝟎 𝟎
Edge detection
−𝟏 − 𝟏 − 𝟏−𝟏 𝟖 − 𝟏−𝟏 − 𝟏 − 𝟏
S𝐡𝐚𝐫𝐩𝐞𝐧 𝟎 − 𝟏 𝟎−𝟏 𝟓 − 𝟏𝟎 − 𝟏 𝟎
Box blur 𝟏
𝟗
𝟏 𝟏 𝟏𝟏 𝟏 𝟏𝟏 𝟏 𝟏
Gaussian blur 𝟏
𝟏𝟔
𝟏 𝟐 𝟏𝟐 𝟒 𝟐𝟏 𝟐 𝟏
20
Convolutional Networks (CNN) - Filters
21
Stride and Padding
In convolving an image:
1) The outputs shrink
2) The information on corners of the image is lost
This can be prevented by padding.
32× 32 ×3
36
36
32
32
Conv.
22
Stride and Padding - Color Image
𝑯𝒄 = 𝑯−𝑭𝑯+𝑷
𝑺𝑯 + 1
𝑾𝒄 = 𝑾−𝑭𝑾+𝑷
𝑺𝑾 + 1
http://machinelearninguru.com/computer_vision/basics/convolution/convolution_layer.html
23
Stride and Padding
W
H
F
P
S
Stride = 1
Stride = 2
𝐂𝐨𝐧𝐯 𝐑𝐞𝐋𝐔
𝐏𝐨𝐨𝐥 ……… . .
𝐂𝐥𝐚𝐬𝐬𝐟𝐢𝐜𝐚𝐭𝐢𝐨𝐧
FReLU 𝑧 = 𝑚𝑎𝑥 0, 𝑧
Rectified Linear Unit (ReLU)
FSReLU 𝑧 = log(1+exp(z))
Analytic approx.
𝐂𝐨𝐧𝐯 𝐑𝐞𝐋𝐔
𝐏𝐨𝐨𝐥
𝐂𝐨𝐧𝐯 𝐑𝐞𝐋𝐔
𝐏𝐨𝐨𝐥
Z = X.W = wi 𝑥𝑖𝑛1 = w1 x1 + w2 x2 + … + wn xn
FReLU 𝑧 = 𝑚𝑎𝑥 0, 𝑧
24
Convolutional Networks
Max Pooling / Downsampling with CNNs
FSoftmax(zj) = 𝑒−𝑧𝑗
𝑒−𝑧𝑚𝑘𝑚
𝐤: Number of classes
𝐒𝐨𝐟𝐭𝐦𝐚𝐱 𝐟𝐮𝐧𝐜𝐭𝐢𝐨𝐧
FReLU 𝑧 = 𝑚𝑎𝑥 0, 𝑧
Rectified Linear Unit (ReLU)
FSReLU 𝑧 = log(1+exp(z))
Analytic approx.
𝐒𝐨𝐟𝐭𝐦𝐚𝐱
𝐑𝐞𝐋𝐔 𝐑𝐞𝐋𝐔
25
Convolutional Networks
https://medium.com/data-science-group-iitr/building-a-convolutional-neural-network-in-python-with-tensorflow-d251c3ca8117 `
𝐒𝐢𝐱 𝐟𝐢𝐥𝐭𝐞𝐫𝐬
26
Network Architecture
Convolution, Filter shape:(5,5,6), Stride=1, Padding=’SAME’
Max pooling (2x2), Window shape:(2,2), Stride=2, Padding=’Same’
ReLU
Convolution, Filter shape:(5,5,16), Stride=1, Padding=’SAME’
Max pooling (2x2), Window shape:(2,2), Stride=2, Padding=’Same’
ReLU
Fully Connected Layer (128)
ReLU
Fully Connected Layer (10)
Softmax
Convolutional Networks
27
3× 𝟑 𝑪𝒐𝒏𝒗.
𝟐𝟖
2× 𝟐 𝑷𝒐𝒐𝒍.
flatten
128
128 ×0.2
10
A Simple CNN Model
Output
Fully connected layers
Ex. 3: MNIST Database - Handwritten digits
A simple CNN deep learning model for handwritten digits recognition using Keras,
Max pooling: Take the maximum number when pooling
Dropout
Dense
Dense 𝐑𝐞𝐋𝐔 𝐒𝐨𝐟𝐭𝐦𝐚𝐱