Supervised Learning in Neural Networks - Watanabe Lab.watanabe- · In image analysis, a network is...
Transcript of Supervised Learning in Neural Networks - Watanabe Lab.watanabe- · In image analysis, a network is...
Supervised Learning in Neural Networks
Sumio WatanabeTokyo Institute of Technology
Advanced Topics in Mathematical Information Sciences II
April 24, May 1, 2015
Quick Review
2015/5/1 Mathematical Learning Theory 3
Supervised Learning
Y=f(x,w)
…
X1, X2, …, Xn
Y1, Y2, …, Yn
Supervisor
Learner
Samples
Mathematics of Supervised Learning
Training Samples
X1, X2, …, Xn
Y1, Y2, …, Yn
Test Samples
X
Y
TrueInformationSource
NeuralNetwork
y=f(x,w) q(x,y)
x1 x2 x3 xN
w1 w2 w3 wN
∑ wi xiN
i=1
σ( ∑ wi xi + θ)N
i=1
Neuron
Synapse weight
biasθ
One Neuron Model
Output
Input
Three-Layered Neural Network
x1 x2 xM
f1 f2 fN
Output Layer
Hidden Layer
Input Layer
1. Deep neural network
2. Sequential learning and auto-encoder
3. Convolution learning
Contents
Deep neural network (DNN)
Recently, neural networks which havedeep layers are being studied.
It is reported that DNNs have bettergeneralization performance.
f1 f2 fN
x1 x2 xM
2015
1960
x1 x2 xM
f1 f2 fN
1985
x1 x2 xM
f1 f2 fN
DefinitionIt is easy to define a deep network.
fi =σ( ∑uij xj + θj )H
j=1Simpleperceptron
M
k=1fi =σ( ∑uij σ( ∑ wjkxk + θj) + φi)
H
j=1
Three-layerNeural network
H2
k=1fi =σ( ∑uij σ( ∑ wjk σ( ∑ vkl ( ……. )l+ θj) + φi)
H1
j=1
M
l=1DNN
Learning and Generalization
E(w) = (1/n) Σ (Yi-f(Xi,w))2n
i=1
Training Error
G(w) = ∫∫ (y-f(x,w))2 q(x,y) dxdy
Generalization Error
The main purpose of learning is to minimize G(w), but we haveonly training samples. Minimizing E(w) is not equivalent to minimizing G(w).
Steepest Descent : Error back-propagation
E(w) = ― ∑(fi -yi )2N
i=1
12
oj=σ(∑wjkok+θj)M
k=1fi =σ(∑uijoj+φi)
H
j=1Inference
∂E∂wjk
= ∑ (fi -yi ) ∂ fi∂wjk
N
i=1
∂ fi∂wjk
= ∂ fi∂oj
∂ oj
∂wjk
All parameters can be optimized by steepest descent of E(w) by
Square error
recursive
2015/5/1 Mathematical Learning Theory
Regularization : Ridge and Lasso
E(w) = (1/n) Σ (Yi-f(Xi,w))2 + R(w)n
i=1
Ridge R(w) = λ Σ |wj|2
Lasso R(w)= λ Σ |wj|
λ>0 : Hyperparameter
DNN has many parameters which are to be optimized.Regularization terms are necessary.
Remark. It is still difficult to find the optimal hyperparameter.
Steepest Descent ?
Minimize this error by optimizing all parameters.
f1 f2 fN
x1 x2 xM
Supervised data
Outputs are far from inputs.Mathematically speaking, all parameters can be optimized by steepest descent, but it is difficult for a neural network to find the nonlinear relation between distant inputs and outputs.
input
output
We need methodology to build a deep neural network.
1. Deep neural network
2. Sequential learning and auto-encoder
3. Convolution learning
Contents
2015/5/1 Mathematical Learning Theory
Deep Learning Methodology
(2) Auto-encoder
(1) Sequential layer learning
(3) Convolution network
Three methods are being studied.
(1) Sequential Layer Learning
x1 x2 xM
f1 f2 fN
x1 x2 xM
f1 f2 fN
x1 x2 xM
f1 f2 fN
Synapse weights in lower layers are copied from a trained shallow network to deeper one.
copy copy
SupervisorSupervisor
Supervisor
2015/5/1 Mathematical Learning Theory
Parameter SpaceE(w)
W is in the higher dimensional Euclidean Space
Many local and complicated structure
E(w) is minimized at |w|=infinity.We need an appropriatefinite and local parameter.
Sequential layer learning maylead the training result to some appropriate point.
2015/5/1 Mathematical Learning Theory
(2) Auto-encoder
X1 X2 XM
f1 f2 fN
Firstly, a bottleneck network is trained,then its weight is copied.
Smaller than M
X1 X2 XM
X1 X2 XM
SupervisorInput
Input
2015/5/1 Mathematical Learning Theory
Bottleneck neural network
X1 X2 XM
If a set of inputs are on the K dimensionalmanifold in M dimensional Euclidean space, then its essential coordinates can be extracted automatically.
= Nonlinear Principal Component Analysis
X1 X2 XMK dimensional
manifold Input
Input
M dimensionalEuclidean space
Same
ExampleInput 5 * 5Training samples 2000Testining samples 2000
0
6 image
Input 25
A network
Hidden 8
0 6
Output 2
Hidden 6
Hidden 4
2015/5/1 Mathematical Learning Theory
(0) Only Error backpropagation
mean 213.5
mean 265.5
std 414.7
std 388.0
Training Error
Generalization Error
Training results strongly depend on initial synapse weights.
2015/5/1 Mathematical Learning Theory
(1) Sequential Layer Learning
Training error Testing errorMean 4.1Std 1.8
mean 61.6std 7.0
2015/5/1 Mathematical Learning Theory
(2) Auto-encoder
Std 3.4Mean 61.3Std 8.1
mean 5.3Training Error Test Error
1. Deep neural network
2. Sequential learning and auto-encoder
3. Convolution learning
Contents
2015/5/1 Mathematical Learning Theory
Data structureIn several data such as images or time series,neighborhoods have local covariance.
Image: a pixel depends on its neighbors.
Time series: a future value can be predicted by the past.
Convolutional network is useful to analyze such data.
fi =σ( ∑uij σ( ∑ wjkxk + θj) + φi)|i-j|<3 |j-k|<3
Synapse weights outside of neighbors are zero.
2015/5/1 Mathematical Learning Theory
Convolutional network
In image analysis,a network is made by nonlinear convolution processingfrom local to global.
LocalInformation
Globalinformation
2015/5/1 Mathematical Learning Theory
Multi-resolution Analysis
Multi-resolution analysis (MRA)
is a method of analyzing images by integration from local to global data.
Convolution network can beunderstood as a kind ofMRA.
2015/5/1Mathematical Learning Theory
Time Delay Neural Network
Human’s speech containslocal abbreviation, expansion, and contraction.
A layered neural network wasproposed so as to be adapted to such local nonlinear changing.
This is called TDNN.
speech sound
timeRecognitionresult
time
Example: Time series
x(t) = f(x(t-1),x(t-2),…,x(t-27)) + noise,
Time series prediction problem : how to find a nonlinear function
where { x(t) } is the set of prices of Hakusai (Japanese vegetable like cabbage) of each month 1970 – 2013.
Before processing, a linear prediction was optimized
x(t) = a1 x(t-1) + a2x(t-2) +・・・+ a27 x(t-27).
Linear prediction: Generalization Error 1.55Training Error 1.29
2015/5/1
2015/5/1 Mathematical Learning Theory
Example
Month
Month
Price
Price
Training resultRed: TrueBlue: prediction
The data in e-stats of Japanese Government are used. http://www.e-stat.go.jp/SG1/estat/eStatTopPortal.do
Test resultRed: TrueBlue: Prediction
2015/5/1 Mathematical Learning Theory
Comparison of DNN and ConvNN
A deep neural Network Convolution Network
TimeTime
Generalization Error 1.56Training Error 1.01
Generalization Error 1.35Training Error 1.28
2015/5/1 Mathematical Learning Theory
Deep Learning and Feature extraction
(1) Automatic extraction of feature
(2) Preparing feature by human
By using the deep neural network, the optimal feature representation may be found. Discovery of unknown structure enables us to “mine data”. However, it may be difficult or if it is possible, it needs heavy computational costs.
If an appropriate feature is found by human before training, then computational cost in learning can be reduced.However, discovery of unknown feature does not occur.
2015/5/1 Mathematical Learning Theory
Summary
(1) Supervised learning in neural networks is introduced. (April,24th)
(2) Methodology of deep neural network (May, 1st)
(a) Definitions of Training and Generalization Errors
(b) Steepest Descent as an learning algorithm
(a) Sequential layer learning
(b) Auto-encoder
(c) Convolution network