Cybernetics and Second-Order Cybernetics - Welcome to Principia
Neural networks • Overview –Neurons and models an introduction · 2019-12-17 · K.Fukushima,...
Transcript of Neural networks • Overview –Neurons and models an introduction · 2019-12-17 · K.Fukushima,...
Neural networksan introduction
Akito Sakurai
Contents
• Overview– Neurons and models
• Implementation in early days– Perceptrons
• An introduction to MLP
2
What are neurons?
3
• “There is no such thing as a typical neuron”, Arbib, 1997
A typical(?) neuron
4
Dynamics of neuron activity
5 6Paul M. Churchland (1996) The Engine of Reason, The Seat of the Soul
McCulloch and Pitts
8
• Warren S. McCulloch and Walter Pitts (1943) ``A logical calculus of the ideas immanent in nervous activity'', Bulletin of Mathematical Biophysics, 5: 115-133.
• A very simplified (mathematical, computational) model• Any logical function can be realized by connecting the
model neurons.
A model neuron
9
A model neuron receives an input vector, processes it, and outputs a value.
The model in early ages use the step function as an activation function. Currently sigmoid or rectified linear function are commonly used.
_ ·
Perceptron: its structure
• Rosenblatt, F. (1957). “The perceptron: A perceiving and recognizing automaton (project PARA).”, Technical Report 85-460-1, Cornell Aeronautical Laboratory.
• Rosenblatt, F. (1962). “Principles of Neurodynamics.”, Spartan Books, New York.
10
Contents
• Overview– Neurons and models
• Implementation in early days– Perceptrons
• An introduction to MLP
11
Perceptron: a polysemic word Perceptron: used to mean varying concepts
Linear threshold unit: the next slide Original perceptron: as follows Sigmoid unit or any other similar units Network of sigmoidal units: called multi-layer perceptron or MLP Network of linear threshold units: MLP but quite rarely used in this meaning
In this class, we assume it means a sigmoidal network Original perceptron
Rosenblatt 1962 Minsky and Papert 1969
12
Perceptron: a single neuron unit model Alias: Linear Threshold Unit (LTU) or Linear Threshold Gate (LTG)
Net input: a value of a linear function applied to inputs
Net output: a value of the threshold function applied to the net input (threshold = w0)
A function to obtain the net output by applying it to net input is called activation function
Perceptron Networks A network of perceptrons connected through weighted links wi
Multi-Layer Perceptron (MLP):
n
0iii xwnet
Perceptron
x1
x2
xn
w1
w2
wn
x0 = 1w0
n
0iii xw
otherwise 1-
0 if 1 i
n
0ii
n21
xwxxxo ,,
otherwise 1-
0 if 1: Vector
xww ,xsgnxo
表記
13
repr.
Decision boundary of perceptron
Perceptron: can easily represent many important functions. Logical functions (McCulloch and Pitts, 1943)
e.g., with simple integer weights AND(x1, x2), OR(x1, x2), NOT(x)
Some functions are not representable e.g., linearly non-separable functions (just a paraphrase) To circumvent: construct a network of perceptrons
Example A
+
-+
+
--
x1
x2
+
+
Example B
-
-x1
x2
14
Perceptron learning algorithm
Perceptron Learning Rule (Rosenblatt, 1959) Idea: suppose that for each input vector, an output value is given. Then by updating
weights, the perceptron will become able to output the proper values. The unit assumes binary value; for a perceptron unit, the weight update formula is
where t = c(x) is a target value for x, o is a current perceptron output value. The second formula is sometimes expressed as Δ where r is called a learning rate. Because in the case of perceptron, r could be any positive value, giving equivalent results, r =1 is preferred in the formula.
When the training set D is linearly separable , the algorithm converges in finite time. Some literature requires r to be small enough, but it is wrong for the perceptron.
← ΔΔ
15
In a different way
Initialization: is any vector.Repeat
Select all in sequence in arbitrary orderIf and then continue;If and then FixPlus and continue;If and then continue;If and then FixMinus and continue;
until no errors (neither FixPlus nor FixMinus is called)FixPlus: FixMinus:
w
Fx
FFFx
FxFxFxFx
0xw
0xw
0xw0xw
xww :xww :
Linearly separable?
Linearly Separable (LS)Data Set
x1
x2
++
+
++
++
+
+
++
-
--
--
-
--
-
-
-
--
- - -
Definition Suppose 0 or 1 is a label f(x) of x in D. if there exists w and s.t. f(x) = 1 if w1x1 + w2x2 + … + wnxn , 0 otherwise D is called linearly separable. is called a threshold.
Linearly separable? Note: D being linearly separable does not mean its population is so. disjunction: c(x) = x1’ x2’ … xm’
m of n: c(x) = at least 3 of (x1’ , x2’, …, xm’ ) exclusive OR (XOR): c(x) = x1 x2
DNF: c(x) = T1 T2 … Tm; Ti = l1 l1 … lk
Transformation of expression Can we transform a linearly non-separable problem to a linearly separable one? If it is possible, is it meaningful? Realistic? Is it an important problem?
○
○
×
×
17
Convergence of the perceptron algorithm
Perceptron convergence theorem Claim: if there exists a weight vector w which is consistent with the training dataset (i.e., linearly
separable), the perceptron learning algorithm converges. Proof: Searching spaces are in order with a limit (width of wedge (searching space) decrease) −
Ref. Minsky and Papert, 11.2-11.3 Note 1: how many repetitions are necessary until convergence? Note 2: what happens if it is not linearly separable?
Perceptron cycling theorem Claim: If a training dataset is not linearly separable, weight vectors obtained during the perceptron
algorithm form a bounded set. If the weights are integers, the set is finite. Proof: If the initial weight vector is long enough, its length is shown to be unable to become
longer; proven by a mathematical induction with the dimension n. − Minsky and Papert, 11.10
How to make it robust? Or to make it more expressive? Goal 1: to develop an algorithm which finds a better approximation Goal 2: to develop a new architecture to go beyond the restrictions
18
Universal approximation theorem
• A neural network with a single hidden layer can approximate any continuous function within a required accuracy if any finite number of hidden units are allowed to use
x
11y
12y
13y
14y
x
1
Any number of units are allowed to use if necessary. 19
Perceptron’s ability
• Perceptrons can– Recognize characters (alphabets)– Recognize types of patterns (forms etc.)– Learn with a splendid learning algorithm
• Perceptron learning algorithm, as was seen, is able to find a solution of any problems that has solutions by perceptrons.
• Note: Existence of solutions does not help us to find a solution. 20
Unfortunately
A
B
When an output is not as is required, some weights must be updated, which is easily inferred. But how much they are changed was not known at the time.
learnable
how to learn was unknown
In short• There exist problems that cannot be solved,
although a solution exists in a form of networks. – Parity or XOR problem– Connectivity of patterns– In general, linearly non-separable problems
• Marvin L. Minsky and Seymour Papert (1969), “Perceptrons”, Cambridge, MA: MIT Press– Proved that perceptron is not capable to solve many problems.
– We can construct a network! Yes, but “we” must do it.
• McCulloch & Pitts neuron network is equivalent to Turing machine (i.e. “universal”). OK but does it help us?– If we do not know how to make it learn, it is useless.
• Does a learning algorithm of the network exist? 22
Cognitron
Fig. 4 a-c. Three possible methods for interconnecting layers. The connectable area of each cell is differently chosen in these three methods. Method c is adopted for the cognitron discussed in this paper
K.Fukushima, Cognitron: A self-organizing multilayered neural network, Biological Cybernetics 1975
Sophisticated structureSpecific learning algorithmToo advanced to be popular
Exception at the time
Neocognitron
24
Fig. 2. Schematic diagram illustrating the interconnections between layers in the neocognitron
K. Fukushima, Neocognitron: A Self-Organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position, Biological Cybernetics (1980)
Exception at the time
S-cell in Neocognitron
25K. Fukushima, Neocognitron: A Hierarchical Neural Network Capable of Visual Pattern Recognition, NeuralNetworks, Vol. 1, pp. 119-130, 1988
Contents
• Overview– Neurons and models
• Implementation in early days– Perceptrons
• MLP appears
26
PDP appeared• The book “Perceptrons” was said to have delayed
researches in the field by two decades (pros and cons exist)• A turning point: D.E. Rumelhart, J.L. McClelland, eds.,
“Parallel Distributed Processing: Explorations in the Microstructure of Cognition”, MIT Press, 1986.– A compilation of articles: from mathematics to philosophy – Many successful experiments on multi-layer networks– Proposed “error back propagation algorithm”: unexpected influence on
learning algorithms.– [Because similar techniques to BP were found before PDP (Amari
1967; Werbos, 1974, “dynamic feedback”; Parker, 1982, “learning logic”), reinvention/rediscovery would be a better word to describe it. But it has had a profound influence.]
27
Error Back Propagation• Basically it is for multi-layer feed-forward networks but
could be applied to other types of networks.
• Weights are updated in proportion to backpropagated errors
28
Two importantinventions
Reasons why PDP succeeded• Changed the activation function of units
from the threshold function to the sigmoid function
• Formulated the learning problem in the error minimization problem. E.g.,
• Solved the (nonlinear multivariate) problem by a naïve method (steepest descent)
; targetvaluefor :
29
Error Minimization
• Do not pursue a complete solution (error=0)– Our ability may be limited– Samples may contain errors– Samples may be of probabilistic events
• Consider (too naively) the sum of squares of the difference between targets and current outputs as the error.
• Find out weights that minimize the error
30-6 -4 -2 2 4 6
0.2
0.4
0.6
0.8
1
Sigmoid function
1x
2x
nx
1w
2w
nw
10 x0w
i
n
ii xwnet
0
i
n
ii xwneto
0
Quite often use : xex
11
A method of minimization• Solve equations obtained by equating the
gradients to be 0
• Because f is non-linear, we cannot solve it.• An iterative method is to be sought, i.e., a method
that gives us such that 321 wEwEwE
321 ,, www
; targetof 0
wE
32
Iterative method for minimization
• Many efficient methods have been proposed• The simplest one is the steepest descnet
– Steepest ascent method give us a maximum
-2-1
01
23w0
-3 -2 -1 0 1 2w1
0
5
10
EHwL
-2-1
01
23w0
-3 -2 -1 0 1 2w1
Direction of the steepest descent is normal to the contour of the error
33
Mathematically
• Steepest descent
• Update w a bit along the direction
• The learning rate >0 is to be defined appropriately
wE
-2-1
01
23w0
-3 -2 -1 0 1 2w1
0
5
10
EHwL
-2-1
01
23w0
-3 -2 -1 0 1 2w1
wEww currentnew
34
Actually
• A method to simplifythe calculation of gradientsis needed, because there are hundreds (at the time) or hundreds of thousandsvariables exist.
-2-1
01
23w0
-3 -2 -1 0 1 2w1
0
5
10
EHwL
-2-1
01
23w0
-3 -2 -1 0 1 2w1
wEww currentnew
What was done: sound classification
(Gorman & Sejnowski, NC 1(1), 75-89 (1988)) 36
What was done: speech recognition
http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-3/www/ml.html 37
ALVINN:An Autonomous land vehicle in a
neural network• Made a tour on the interstate
(Dean Pomerleau 1995)
http://web.mit.edu/6.034/wwwbob/L9/node10.html38
Navlab
39
Problems to be solved• Inputs : N labeled samples , , 1, … ,– ∈ is a d-dimensional feature vector– ∈ is a c-dimensional label– belongs to a class k → 0,… , 1, … , 0 where only k-th
element is 1.
• Outputs: a function– ∈ , ∈– It classifies the training samples as correctly as possible– , , … ,• belongs to k-th class→ > 40
Summary and a bit beyond itA unit in neural networks
• A unit or a neuron is a basic construct of neural networks.• It has a weight vector and a nonlinear function .
• It returns a value in [0,1] by first obtaining an input vector , calculating the inner product with the weight, and applying it to the non-linear
function • Note: the returned value becomes larger when the input vector is
more similar to the weight vector .
41
1
∑ [0,1] Inner product
Non-linear function
Connecting units in parallel• Each unit returns large value when its input vector is
similar to its weight vector.• When multiple units run in parallel, multiple class
classification becomes possible.
1∑ [0,1]
∑ [0,1]
∑ [0,1]
The input vector is similar to the weight vector , the output value is large
The input vector is similar to the weight vector , the output value is large
The input vector is similar to the weight vector , the output value is large
Connecting units in series• Insert intermediate layers between input and output layers.• Input vector is transformed such that it could be
discriminated well →linearlynon‐separablecasescouldbehandled
43
1
Feature extraction/elaboration classification
Intermediate layer’s role (1)
44
0.60.4
1.0
0.0
0.6,0.41.0,0.0Class 1 (-2,-2) (0.18, 0.18)(-1,-1) (0.12, 0.12)(-1,-2) (0.06, 0.12)(-2,-1) (0.04, 0.02)
Class 2( 0,-4) ( 0.04, 0.50)( 2, 2) ( 0.98, 0.98)( 0, 6) ( 0.99, 0.50)(-3, 4) ( 0.40 , 0.00)
Intermediate layer’s role (2)from Face direction recognition
... ...
left strt right up
30 x 32
http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-3/www/ml.html
Very old story
45
A result in intermediate layer
• Weights to intermediate layer
... ...
left strt right up
30 x 32
46
Intermediate layer’s role (3)
• Can this be learned?Input Output10000000 → 1000000001000000 → 0100000000100000 → 0010000000010000 → 0001000000001000 → 0000100000000100 → 0000010000000010 → 0000001000000001 → 00000001
47
Result of learning
Input Output10000000 → .89 .04 .08 → 1000000001000000 → .01 .11 .88 → 0100000000100000 → .01 .97 .27 → 0010000000010000 → .99 .97 .71 → 0001000000001000 → .03 .05 .02 → 0000100000000100 → .22 .99 .99 → 0000010000000010 → .80 .01 .98 → 0000001000000001 → .60 .94 .01 → 00000001
Information compression – the root of encoder network
48
Result of learning
Input Output10000000 → .89 .04 .08 → 1000000001000000 → .01 .11 .88 → 0100000000100000 → .01 .97 .27 → 0010000000010000 → .99 .97 .71 → 0001000000001000 → .03 .05 .02 → 0000100000000100 → .22 .99 .99 → 0000010000000010 → .80 .01 .98 → 0000001000000001 → .60 .94 .01 → 00000001
Information compression – the root of encoder network
49
Intermediate layer’s role (4)from NETTalk
• NETTalk: pronunciation of words– Many rules available– Many exceptions
NETTalk: cluster analysis
51
Interesting findings
Interesting things were found, i.e.,
In intermediate layers, some representation which were not imagined by researchers was observed.• The representations are meaningful.• They are information compression and extraction.• This will give again profound influence in the future
(now current) research.
52
MLP: a demo
53http://playground.tensorflow.org/