Wulfram Gerstner Artificial Neural Networks: Lecture 3 ...
Transcript of Wulfram Gerstner Artificial Neural Networks: Lecture 3 ...
Wulfram GerstnerEPFL, Lausanne, SwitzerlandArtificial Neural Networks: Lecture 3
Statistical classification by deep networksObjectives for today:
- The cross-entropy error is the optimal loss function for classification tasks
- The sigmoidal (softmax) is the optimal output unit for classification tasks
- Multi-class problems and β1-hot codingβ- Under certain conditions we may interpret the
output as a probability- The rectified linear unit (RELU) for hidden layers
Reading for this lecture:
Bishop 2006, Ch. 4.2 and 4.3Pattern recognition and Machine Learning
orBishop 1995, Ch. 6.7 β 6.9 Neural networks for pattern recognition
orGoodfellow et al.,2016 Ch. 5.5, 6.2, and 3.13 ofDeep Learning
Miniproject1: soon!
You will work with- regularization methods- cross-entropy error function- sigmoidal (softmax) output- rectified linear hidden units- 1-hot coding for multiclass- batch normalization- Adam optimizer
This week
Next week
(see last week)
Review: Data base for Supervised learning (single output)
input
car =yes
{ ππππ, π‘π‘ππ , 1 β€ ππ β€ ππ };
target output
π‘π‘ππ = 1
P data points
π‘π‘ππ = 0 car =no
review: Supervised learning
inputcar (yes)
Classifier
output
Techerteacher
ππππ
οΏ½π¦π¦ππ = 1π‘π‘ππ = 1target output classifier output
review: Artificial Neural Networks for classification
input
outputcar dog
Aim of learning:Adjust connections suchthat output is correct(for each input image,even new ones)
π€π€ππ,ππ(1)
ππππ β π π ππ+1
Review: Multilayer Perceptron
β1
β1
οΏ½π¦π¦1ππ οΏ½π¦π¦2
ππ
π€π€1,ππ(2)
π₯π₯ππ(1)
ππ
1
0
ππ(ππ)
Review: Example MNIST
- images 28x28- Labels: 0, β¦, 9- 250 writers- 60 000 images in training set
MNIST data samples
Picture: Goodfellow et al, 2016Data base: http://yann.lecun.com/exdb/mnist/
5 9
review: data base is noisy
- training data is always noisy- the future data has different noise
- Classifier must extract the essence do not fit the noise!!
9 or 4?
9 or 4?
What might be a 9 for reader A
Might be a4 for reader B
Question for today
May we interpret the outputs our network as a probability?
input
output4 9
Wulfram GerstnerEPFL, Lausanne, SwitzerlandArtificial Neural Networks: Lecture 3
Statistical Classification by Deep Networks
1. The statistical view: generative model
1. The statistical view
π€π€ππ,ππ(1)
ππππ β π π ππ+1
οΏ½π¦π¦1ππ οΏ½π¦π¦2
ππ
π€π€1,ππ(2)
π₯π₯ππ(1)
car dog otherIdea:
interpret the output as the probability thatthe novel input pattern should be classifiedas class k
οΏ½π¦π¦ππππ
ππππ
οΏ½π¦π¦ππππ = P(πΆπΆππ|ππππ)
οΏ½π¦π¦ππ = P(πΆπΆππ|ππ )
pattern from data base
arbitrary novel pattern
1. The statistical view: single class
π€π€ππ,ππ(1)
ππ β π π ππ+1
οΏ½π¦π¦1
π€π€1,ππ(2)
π₯π₯ππ(1)
+1ππ+ = οΏ½π¦π¦1
0ππβ = 1 β οΏ½π¦π¦1
+1 0
Take the output and generatepredicted labels οΏ½ΜοΏ½π‘1 probabilistically
οΏ½π¦π¦1
generative model for class labelwith οΏ½π¦π¦1 = P(πΆπΆ1|ππ) = π·π·(οΏ½ΜοΏ½π‘1 =1|ππ)
predicted label
οΏ½π¦π¦1
Wulfram GerstnerEPFL, Lausanne, SwitzerlandArtificial Neural Networks: Lecture 3
Statistical Classification by Deep Networks
1. The statistical view: generative model2. The likelihood of data under a model
2. The likelihood of a model (given data)
Overall aim:What is the probability that my set of P data points
{ ππππ, π‘π‘ππ , 1 β€ ππ β€ ππ };
could have been generated by my model?
2. The likelihood of a model Detour: forget about labeled data, and just think of input patternsWhat is the probability that a set of P data points
οΏ½ππππ ; 1 β€ kβ€ ππ };
could have been generated by my model?
https://en.wikipedia.org/wiki/Gaussian_function#/media/
ππ(π₯π₯) = 12ππππ
πππ₯π₯ππ β(π₯π₯βππ)2
2ππ2 1
this depends on 2 parameters
π€π€1,π€π€2, = ππ,ππ
center width
2. Example: Gaussian distribution
π₯π₯
~ππ(π₯π₯ππ)
Probability that a random data generation process draws one sample k with value π₯π₯ππ is
2. Random Data Generation Process
ππ(π₯π₯ππ) = 12ππππ
exp β(π₯π₯ππβππ)2
2ππ2
Example: for the specific case of the Gaussian
ππ(π₯π₯ππ) ππ(π₯π₯)
What is the probability to generate P data points?
Blackboard 1:generate P data points
2. Likelihood function (beyond Gaussian)
ππ(ππππ)
Suppose the probability for generating a data point ππππ usingmy model is proportional to
Suppose that data points are generated independently.
Then the likelihood that my actual data set
could have been generated by my model is
πΏπΏ = {ππππ; 1 β€ k β€ ππ };
ππ(ππ1) ππ(ππ2) ππ ππ3 β¦ππ(ππππ)ππππππππππππ πΏπΏ =
2. Maximum Likelihood (beyond Gaussian)
ππ(ππ1) ππ(ππ2) ππ ππ3 β¦ππ(ππππ)ππππππππππππ πΏπΏ =
BUT this likelihood depends on the parameters of my model
ππππππππππππ πΏπΏ = ππππππππππππ πΏπΏ| π€π€1,π€π€2, β¦π€π€ππ,
parametersChoose the parameters such that the likelihood is maximal!
ππ(π₯π₯ππ) = 12ππππ
exp β(π₯π₯ππβππ)2
2ππ2Likelihood of point π₯π₯ππ is
2. Example: Gaussian distribution
x x xx xx xπ₯π₯
Which Gaussian is most consistent with the data?
[ ] green curve[ ] blue curve[ ] red curve[ ] brown-orange curve
2. Example: Gaussian
ππ(ππ1) ππ(ππ2) ππ ππ3 β¦ππ(ππππ)ππππππππππππ πΏπΏ =
The likelihood depends on the 2 parameters of my Gaussian
ππππππππππππ πΏπΏ = ππππππππππππ πΏπΏ| π€π€1,π€π€2ππππππππππππ πΏπΏ = ππππππππππππ πΏπΏ| ππ,ππ
Exercise 1 NOW! (8 minutes): you have P data pointsCalculate the optimal choice of parameter ππ:To do so maximize ππππππππππππ πΏπΏ with respect to ππ
Blackboard 2:Gaussian: best parameter choice for center
2. Maximum Likelihood (general)
ππ(ππ1) ππ(ππ2) ππ ππ3 β¦ππ(ππππ)ππππππππππππ πΏπΏ| π€π€1,π€π€2, β¦π€π€ππ, =
Choose the parameters such that the likelihood
is maximal
ππ π¦π¦ = ln(π¦π¦)Note:Instead of maximizing
you can also maximize
ππππππππππππ πΏπΏ|ππππππππππ
ln(ππππππππππππ πΏπΏ|ππππππππππ )
π¦π¦1 < π¦π¦2
2. Maximum Likelihood (general)
ππ(ππ1) ππ(ππ2) ππ ππ3 β¦ππ(ππππ)ππππππππππππ πΏπΏ| π€π€1,π€π€2, β¦π€π€ππ, =
Choosing the parameters such that the likelihood
is maximal is equivalent to maximizing the log-likelihood
βMaximize the likelihood that the given data could have been generated by your modelβ(even though you know that the data points were generated by a process in the real world that might be very different)
πΏπΏπΏπΏ π€π€1,π€π€2, β¦π€π€ππ, = ln ππππππππππππ = βππ ππππ ππ(ππππ)
2. Maximum Likelihood (general)
ππ(ππ1) ππ(ππ2) ππ ππ3 β¦ππ(ππππ)ππππππππππππ πΏπΏ| π€π€1,π€π€2, β¦π€π€ππ, =
Choose the parameters such that the likelihood
is maximal is equivalent to maximizing the log-likelihood
Note: some people (e.g. David MacKay) use the termβlikelihoodβ ONLY IF we consider LL(w) as a function ofthe parameters w. βlikelihood of the model parameters in view of the dataβ
πΏπΏπΏπΏ π€π€1,π€π€2, β¦π€π€ππ, = ln ππππππππππππ = βππ ππππ ππ(ππππ)
Wulfram GerstnerEPFL, Lausanne, SwitzerlandArtificial Neural Networks: Lecture 3
Statistical Classification by Deep Networks
1. The statistical view: generative model2. The likelihood of data under a model3. Application to artificial neural networks
3. The likelihood of data under a neural network model
Overall aim:What is the likelihood that my set of P data points
{ ππππ, π‘π‘ππ , 1 β€ ππ β€ ππ };
could have been generated by my model?
π€π€ππ,ππ(1)
ππ β π π ππ+1
οΏ½π¦π¦1
π€π€1,ππ(2)
π₯π₯ππ(1)
+1ππ+ = οΏ½π¦π¦1
03. Maximum Likelihood for neural networks
Blackboard 3:Likelihood of P input-output pairs
Blackboard 3:Likelihood of P input-output pairs
π€π€ππ,ππ(1)
ππ β π π ππ+1
οΏ½π¦π¦1
π€π€1,ππ(2)
π₯π₯ππ(1)
+1ππ+ = οΏ½π¦π¦1
03. Maximum Likelihood for neural networks
πΈπΈ π€π€ = βπΏπΏπΏπΏ = βln ππππππππππππ
πΈπΈ(π€π€) = ββππ[π‘π‘ππ ππππ οΏ½π¦π¦ππ+ (1 β π‘π‘ππ )ln(1 β οΏ½π¦π¦ππ )]
Minimize the negative log-likelihood
parameters= all weights, all layers
3. Cross-entropy error function for neural networks
πΈπΈ(π€π€) = ββππ[π‘π‘ππ ππππ οΏ½π¦π¦ππ+ (1 β π‘π‘ππ )ln(1 β οΏ½π¦π¦ππ )]Suppose we minimize the cross-entropy error function
Can we be sure that the output οΏ½π¦π¦ππ will represent the probability?
Intuitive answer: No, because
A We will need enough data for training(not just 10 data points for a complex task)
B We need a sufficiently flexible network (not a simple perceptron for XOR task)
3. Output = probability ?
πΈπΈ(π€π€) = ββππ[π‘π‘ππ ππππ οΏ½π¦π¦ππ+ (1 β π‘π‘ππ )ln(1 β οΏ½π¦π¦ππ )]
Suppose we minimize the cross-entropy error function
AssumeA We have enough data for trainingB We have a sufficiently flexible network
Blackboard 4:From Cross-entropy to output probabilities
Blackboard 4:From Cross-entropy to output probabilities
QUIZ: Maximum likelihood solution means[ ] find the unique set of parameters that generated the data[ ] find the unique set of parameters that best explains the data[ ] find the best set of parameters such that your model could
have generated the data
Miminization of the cross-entropy error function for single class output[ ] is consistent with the idea that the output οΏ½π¦π¦1 of your network
can be interpreted as
[ ] guarantees that the output οΏ½π¦π¦1 of your networkcan be interpreted as
οΏ½π¦π¦1 = P(πΆπΆ1|ππ)
οΏ½π¦π¦1 = P(πΆπΆ1|ππ)
[ ][x][x]
[x]
[ ]
Wulfram GerstnerEPFL, Lausanne, SwitzerlandArtificial Neural Networks: Lecture 3
Statistical Classification by Deep Networks
1. The statistical view: generative model2. The likelihood of data under a model3. Application to artificial neural networks4. Sigmoidal as a natural output function
4. Why sigmoidal output ? β single class
ππ
1
0π€π€ππ,ππ
(1)
ππ β π π ππ+1
οΏ½π¦π¦1
π€π€1,ππ(2)
π₯π₯ππ(1)
+1ππ+ = οΏ½π¦π¦1
0ππβ = 1 β οΏ½π¦π¦1
Observations (single-class):- Probability must be between 0 and 1- Inutitively: smooth is better
οΏ½π¦π¦1
ππ(ππ)
οΏ½π¦π¦1 = P(πΆπΆ1|ππ) = π·π·(οΏ½ΜοΏ½π‘1 =1|ππ)
Blackboard 5:derive optimal sigmoidal
Blackboard 5:derive optimal sigmoidal
π€π€ππ,ππ(1)
ππ β π π ππ+1
οΏ½π¦π¦1
π€π€1,ππ(2)
π₯π₯ππ(1)
+1ππ+ = οΏ½π¦π¦1
0
4. Why sigmoidal output ? β single class
ππ
1
0 π€π€ππ,ππ(1)
ππ β π π ππ+1
οΏ½π¦π¦1
π€π€1,ππ(2)
π₯π₯ππ(1)
+1ππ+ = οΏ½π¦π¦1
0
οΏ½π¦π¦1
ππ(ππ)
οΏ½π¦π¦1 = P(πΆπΆ1|ππ) = π·π·(οΏ½ΜοΏ½π‘1 =1|ππ)
οΏ½π¦π¦1 = ππ ππ =1
1 + ππβππ
total input a into output neuron canbe interpreted as log-prob. ratio
https://en.wikipedia.org/wiki/Logistic_function
Rule of thumb:for a= 3: g(3) =0.95for a=-3: g(-3)=0.05
ππ ππ =1
1 + ππβππ
4. sigmoidal output = logistic function
Wulfram GerstnerEPFL, Lausanne, SwitzerlandArtificial Neural Networks: Lecture 3
Statistical Classification by Deep Networks
1. The statistical view: generative model2. The likelihood of data under a model3. Application to artificial neural networks4. Sigmoidal as a natural output function5. Multi-class problems
5. Multiple Classesmutually exclusive classes
input
outputcar dog
multiple attributes
input
outputteeth dog ears
5. Multiple Classes: Multiple attributesMultiple attributes:
input
outputteeth dog ears
equivalent to severalsingle-class decisions
π€π€ππ,ππ(1)
ππ β π π ππ+1
οΏ½π¦π¦2
π₯π₯ππ(1)
0
οΏ½π¦π¦3
1 0
οΏ½π¦π¦1
1 0 1
5. Multiple Classes: Mutuall exclusive classesmutually exclusive classes
input
outputcar dog
either car or dog:only one can be trueoutputs interact
5. Exclusive Multiple Classes
π€π€ππ,ππ(1)
ππ β π π ππ+1
οΏ½π¦π¦2
π₯π₯ππ(1)
0
οΏ½π¦π¦3
0οΏ½π¦π¦1 = P(πΆπΆ1|ππ) = π·π·(οΏ½ΜοΏ½π‘1 =1|ππ)
οΏ½π¦π¦1
1
1-hot-coding:
οΏ½ΜοΏ½π‘ππππ=1β οΏ½ΜοΏ½π‘ππ
ππ=0 for j β k
5. Exclusive Multiple Classes
π€π€ππ,ππ(1)
ππ β π π ππ+1
οΏ½π¦π¦2
π₯π₯ππ(1)
0
οΏ½π¦π¦3
0οΏ½π¦π¦1 = P(πΆπΆ1|ππ) = π·π·(οΏ½ΜοΏ½π‘1 =1|ππ)
οΏ½π¦π¦11-hot-coding:
οΏ½ΜοΏ½π‘ππππ=0 for j β k
1
Outputs are NOT independent:
οΏ½ππ=1
πΎπΎ
π‘π‘ππππ = 1 exactly one output is 1
οΏ½ΜοΏ½π‘ππππ=1β
οΏ½ππ=1
πΎπΎ
οΏ½π¦π¦1ππ = 1 Output probabilities sum to 1
5. Why sigmoidal output ?
π€π€ππ,ππ(1)
ππ β π π ππ+1
οΏ½π¦π¦2
π₯π₯ππ(1)
1 0
Observations (multiple-classes):- Probabilities must sum to one!
ππ
1
0
οΏ½π¦π¦1ππ(ππ)
οΏ½π¦π¦3
1 0οΏ½π¦π¦1 = P(πΆπΆ1|ππ) = π·π·(οΏ½ΜοΏ½π‘1 =1|ππ)
οΏ½π¦π¦1
1 0
Exercise this week! derive softmax as optimal multi-class output
5. Softmax output
π€π€ππ,ππ(1)
ππ β π π ππ+1
οΏ½π¦π¦2
π₯π₯ππ(1)
ππ
1
0
οΏ½π¦π¦1ππ(ππ)
οΏ½π¦π¦ππ = P(πΆπΆππ|ππ) = π·π·(οΏ½ΜοΏ½π‘ππ =1|ππ)
οΏ½π¦π¦ππ = P(πΆπΆππ|ππ) =ππππππ(ππππ)βππ ππππππ(ππππ)
0
οΏ½π¦π¦3
0
οΏ½π¦π¦1
1
5. Exclusive Multiple Classes
π€π€ππ,ππ(1)
ππ β π π ππ+1
οΏ½π¦π¦2
π₯π₯ππ(1)
0
οΏ½π¦π¦3
0οΏ½π¦π¦1 = P(πΆπΆ1|ππ) = π·π·(οΏ½ΜοΏ½π‘1 =1|ππ)
οΏ½π¦π¦1
Blackboard 6:probility of target labels and likelihood function
1-hot-coding:
οΏ½ΜοΏ½π‘ππππ=0 for j β k
1
Outputs are NOT independent:
οΏ½ππ=1
πΎπΎ
π‘π‘ππππ = 1 exactly one output is 1
οΏ½ΜοΏ½π‘ππππ=1β
Blackboard 6:Probability of target labels: mutually exclusive classes
5. Cross entropy error for neural networks: Multiclass
πΈπΈ(π€π€) = βοΏ½ππ=1
πΎπΎ
οΏ½ππ
[π‘π‘ππππ ππππ οΏ½π¦π¦ππ
ππ]
Minimize* the cross-entropy
parameters= all weights, all layers
We have a total of K classes (mutually exclusive: either dog or car)
KL π€π€ = β{βππ=1πΎπΎ βππ[π‘π‘ππππππππ οΏ½π¦π¦ππ
ππ]β βππ[π‘π‘ππππππππ π‘π‘ππ
ππ]}Compare: KL divergence between outputs and targets
KL π€π€ = πΈπΈ π€π€ + πππππππππ‘π‘πππππ‘π‘
*Minimization under the constraint:βππ=1πΎπΎ οΏ½π¦π¦ππ
ππ = 1
Wulfram GerstnerEPFL, Lausanne, SwitzerlandArtificial Neural Networks: Lecture 3
Statistical Classification by Deep Networks
1. The statistical view: generative model2. The likelihood of data under a model3. Application to artificial neural networks4. Multi-class problems5. Sigmoidal as a natural output function6. Rectified linear for hidden units
ππ β π π ππ+1
οΏ½π¦π¦1
π₯π₯ππ(1)
π€π€1ππ(2)
π€π€ππ1(1)
output layeruse sigmoidal unit (single-class)or softmax (exclusive mutlit-class)
6. Modern Neural Networks
hidden layeruse rectified linear unit in N+1 dim.
f(x)=x for x>0f(x)=0 for x<0 or x=0
ππ β π π ππ+1
οΏ½π¦π¦1
π₯π₯ππ(1)
π€π€1ππ(2)
π€π€ππ1(1)
π₯π₯1(0)
π₯π₯2(0)
1
1
-12 6
-2
1.5
ππ β₯ 0.95
ππ β€ 0.05
0 β€ ππ β€ 0.95
Preparation for Exercises: Link multilayer networks to probabilities
ππ β π π ππ+1
οΏ½π¦π¦1
π₯π₯ππ(1)
π€π€1ππ(2)
π€π€ππ1(1)
π₯π₯1(0)
π₯π₯2(0)
1
1
-12
Preparation for Exercises: there are many solutions!!!!
QUIZ: Modern Neural Networks [ ] piecewise linear units should be used in all layers [ ] piecewise linear units should be used in the hidden layers[ ] softmax unit should be used for exclusive multi-class in
an output layer in problems with 1-hot coding[ ] sigmoidal unit should be used for single-class problems[ ] two-class problems (mutually exclusive) are the same as
single-class problems[ ] multiple-attribute-class problems are treated as
multiple-single-class[ ] In neural nets we can interpret the output as a probability,
[ ] if we are careful in the model design, we may interpret the output as a probability that the data belongs to the class
οΏ½π¦π¦1 = P(πΆπΆ1|ππ)
[ ][x][x]
[x][x]
[x]
[ ]
[x]
Wulfram GerstnerEPFL, Lausanne, SwitzerlandArtificial Neural Networks: Lecture 3
Statistical classification by deep networksObjectives for today:
- The cross-entropy error is the optimal loss function for classification tasks
- The sigmoidal (softmax) is the optimal output unit for classification tasks
- Exclusive Multi-class problems use β1-hot codingβ - Under certain conditions we may interpret the
output as a probability- Piecewise linear units are preferable for
hidden layers
Reading for this lecture:
Bishop 2006, Ch. 4.2 and 4.3Pattern recognition and Machine Learning
orBishop 1995, Ch. 6.7 β 6.9 Neural networks for pattern recognition
orGoodfellow et al.,2016 Ch. 5.5, 6.2, and 3.13 ofDeep Learning