Neural networks • Overview –Neurons and models an introduction · 2019-12-17 · K.Fukushima,...

Neural networksan introduction

Akito Sakurai

Contents

• Overview– Neurons and models

• Implementation in early days– Perceptrons

• An introduction to MLP

2

What are neurons?

3

• “There is no such thing as a typical neuron”, Arbib, 1997

A typical(?) neuron

4

Dynamics of neuron activity

5 6Paul M. Churchland (1996) The Engine of Reason, The Seat of the Soul

McCulloch and Pitts

8

• Warren S. McCulloch and Walter Pitts (1943) ``A logical calculus of the ideas immanent in nervous activity'', Bulletin of Mathematical Biophysics, 5: 115-133.

• A very simplified (mathematical, computational) model• Any logical function can be realized by connecting the

model neurons.

A model neuron

9

A model neuron receives an input vector, processes it, and outputs a value.

The model in early ages use the step function as an activation function. Currently sigmoid or rectified linear function are commonly used.

_ ·

Perceptron: its structure

• Rosenblatt, F. (1957). “The perceptron: A perceiving and recognizing automaton (project PARA).”, Technical Report 85-460-1, Cornell Aeronautical Laboratory.

• Rosenblatt, F. (1962). “Principles of Neurodynamics.”, Spartan Books, New York.

10

Contents



• An introduction to MLP

11

Perceptron: a polysemic word Perceptron: used to mean varying concepts

Linear threshold unit: the next slide Original perceptron: as follows Sigmoid unit or any other similar units Network of sigmoidal units: called multi-layer perceptron or MLP Network of linear threshold units: MLP but quite rarely used in this meaning

In this class, we assume it means a sigmoidal network Original perceptron

Rosenblatt 1962 Minsky and Papert 1969

12

Perceptron: a single neuron unit model Alias: Linear Threshold Unit (LTU) or Linear Threshold Gate (LTG)

Net input: a value of a linear function applied to inputs

Net output: a value of the threshold function applied to the net input (threshold = w0)

A function to obtain the net output by applying it to net input is called activation function

Perceptron Networks A network of perceptrons connected through weighted links wi

Multi-Layer Perceptron (MLP):

n

0iii xwnet

Perceptron

x1

x2

xn

w1

w2

wn

x0 = 1w0

n

0iii xw

otherwise 1-

0 if 1 i

n

0ii

n21

xwxxxo ,,

otherwise 1-

0 if 1: Vector

xww ,xsgnxo

表記

13

repr.

Decision boundary of perceptron

Perceptron: can easily represent many important functions. Logical functions (McCulloch and Pitts, 1943)

e.g., with simple integer weights AND(x1, x2), OR(x1, x2), NOT(x)

Some functions are not representable e.g., linearly non-separable functions (just a paraphrase) To circumvent: construct a network of perceptrons

Example A

+

-+

+

--

x1

x2

+

+

Example B

-

-x1

x2

14

Perceptron learning algorithm

Perceptron Learning Rule (Rosenblatt, 1959) Idea: suppose that for each input vector, an output value is given. Then by updating

weights, the perceptron will become able to output the proper values. The unit assumes binary value; for a perceptron unit, the weight update formula is

where t = c(x) is a target value for x, o is a current perceptron output value. The second formula is sometimes expressed as Δ where r is called a learning rate. Because in the case of perceptron, r could be any positive value, giving equivalent results, r =1 is preferred in the formula.

When the training set D is linearly separable , the algorithm converges in finite time. Some literature requires r to be small enough, but it is wrong for the perceptron.

← ΔΔ

15

In a different way

Initialization: is any vector.Repeat

Select all in sequence in arbitrary orderIf and then continue;If and then FixPlus and continue;If and then continue;If and then FixMinus and continue;

until no errors (neither FixPlus nor FixMinus is called)FixPlus: FixMinus:

w

Fx

FFFx

FxFxFxFx

0xw

0xw

0xw0xw

xww :xww :

Linearly separable?

Linearly Separable (LS)Data Set

x1

x2

++

+

++

++

+

+

++

-

--

--

-

--

-

-

-

--

- - -

Definition Suppose 0 or 1 is a label f(x) of x in D. if there exists w and s.t. f(x) = 1 if w1x1 + w2x2 + … + wnxn , 0 otherwise D is called linearly separable. is called a threshold.

Linearly separable? Note: D being linearly separable does not mean its population is so. disjunction: c(x) = x1’ x2’ … xm’

m of n: c(x) = at least 3 of (x1’ , x2’, …, xm’ ) exclusive OR (XOR): c(x) = x1 x2

DNF: c(x) = T1 T2 … Tm; Ti = l1 l1 … lk

Transformation of expression Can we transform a linearly non-separable problem to a linearly separable one? If it is possible, is it meaningful? Realistic? Is it an important problem?

○

○

×

×

17

Convergence of the perceptron algorithm

Perceptron convergence theorem Claim: if there exists a weight vector w which is consistent with the training dataset (i.e., linearly

separable), the perceptron learning algorithm converges. Proof: Searching spaces are in order with a limit (width of wedge (searching space) decrease) −

Ref. Minsky and Papert, 11.2-11.3 Note 1: how many repetitions are necessary until convergence? Note 2: what happens if it is not linearly separable?

Perceptron cycling theorem Claim: If a training dataset is not linearly separable, weight vectors obtained during the perceptron

algorithm form a bounded set. If the weights are integers, the set is finite. Proof: If the initial weight vector is long enough, its length is shown to be unable to become

longer; proven by a mathematical induction with the dimension n. − Minsky and Papert, 11.10

How to make it robust? Or to make it more expressive? Goal 1: to develop an algorithm which finds a better approximation Goal 2: to develop a new architecture to go beyond the restrictions

18

Universal approximation theorem

• A neural network with a single hidden layer can approximate any continuous function within a required accuracy if any finite number of hidden units are allowed to use

x

11y

12y

13y

14y

x

1

Any number of units are allowed to use if necessary. 19

Perceptron’s ability

• Perceptrons can– Recognize characters (alphabets)– Recognize types of patterns (forms etc.)– Learn with a splendid learning algorithm

• Perceptron learning algorithm, as was seen, is able to find a solution of any problems that has solutions by perceptrons.

• Note: Existence of solutions does not help us to find a solution. 20

Unfortunately

A

B

When an output is not as is required, some weights must be updated, which is easily inferred. But how much they are changed was not known at the time.

learnable

how to learn was unknown

In short• There exist problems that cannot be solved,

although a solution exists in a form of networks. – Parity or XOR problem– Connectivity of patterns– In general, linearly non-separable problems

• Marvin L. Minsky and Seymour Papert (1969), “Perceptrons”, Cambridge, MA: MIT Press– Proved that perceptron is not capable to solve many problems.

– We can construct a network! Yes, but “we” must do it.

• McCulloch & Pitts neuron network is equivalent to Turing machine (i.e. “universal”). OK but does it help us?– If we do not know how to make it learn, it is useless.

• Does a learning algorithm of the network exist? 22

Cognitron

Fig. 4 a-c. Three possible methods for interconnecting layers. The connectable area of each cell is differently chosen in these three methods. Method c is adopted for the cognitron discussed in this paper

K.Fukushima, Cognitron: A self-organizing multilayered neural network, Biological Cybernetics 1975

Sophisticated structureSpecific learning algorithmToo advanced to be popular

Exception at the time

Neocognitron

24

Fig. 2. Schematic diagram illustrating the interconnections between layers in the neocognitron

K. Fukushima, Neocognitron: A Self-Organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position, Biological Cybernetics (1980)

Exception at the time

S-cell in Neocognitron

25K. Fukushima, Neocognitron: A Hierarchical Neural Network Capable of Visual Pattern Recognition, NeuralNetworks, Vol. 1, pp. 119-130, 1988

Contents



• MLP appears

26

PDP appeared• The book “Perceptrons” was said to have delayed

researches in the field by two decades (pros and cons exist)• A turning point: D.E. Rumelhart, J.L. McClelland, eds.,

“Parallel Distributed Processing: Explorations in the Microstructure of Cognition”, MIT Press, 1986.– A compilation of articles: from mathematics to philosophy – Many successful experiments on multi-layer networks– Proposed “error back propagation algorithm”: unexpected influence on

learning algorithms.– [Because similar techniques to BP were found before PDP (Amari

1967; Werbos, 1974, “dynamic feedback”; Parker, 1982, “learning logic”), reinvention/rediscovery would be a better word to describe it. But it has had a profound influence.]

27

Error Back Propagation• Basically it is for multi-layer feed-forward networks but

could be applied to other types of networks.

• Weights are updated in proportion to backpropagated errors

28

Two importantinventions

Reasons why PDP succeeded• Changed the activation function of units

from the threshold function to the sigmoid function

• Formulated the learning problem in the error minimization problem. E.g.,

• Solved the (nonlinear multivariate) problem by a naïve method (steepest descent)

; targetvaluefor :

29

Error Minimization

• Do not pursue a complete solution (error=0)– Our ability may be limited– Samples may contain errors– Samples may be of probabilistic events

• Consider (too naively) the sum of squares of the difference between targets and current outputs as the error.

• Find out weights that minimize the error

30-6 -4 -2 2 4 6

0.2

0.4

0.6

0.8

1

Sigmoid function

1x

2x

nx

1w

2w

nw

10 x0w

i

n

ii xwnet

0

i

n

ii xwneto

0

Quite often use ： xex

11

A method of minimization• Solve equations obtained by equating the

gradients to be 0

• Because f is non-linear, we cannot solve it.• An iterative method is to be sought, i.e., a method

that gives us such that 321 wEwEwE

321 ,, www

; targetof 0

wE

32

Iterative method for minimization

• Many efficient methods have been proposed• The simplest one is the steepest descnet

– Steepest ascent method give us a maximum

-2-1

01

23w0

-3 -2 -1 0 1 2w1

0

5

10

EHwL

-2-1

01

23w0

-3 -2 -1 0 1 2w1

Direction of the steepest descent is normal to the contour of the error

33

Mathematically

• Steepest descent

• Update w a bit along the direction

• The learning rate >0 is to be defined appropriately

wE

-2-1

01

23w0

-3 -2 -1 0 1 2w1

0

5

10

EHwL

-2-1

01

23w0

-3 -2 -1 0 1 2w1

wEww currentnew

34

Actually

• A method to simplifythe calculation of gradientsis needed, because there are hundreds (at the time) or hundreds of thousandsvariables exist.

-2-1

01

23w0

-3 -2 -1 0 1 2w1

0

5

10

EHwL

-2-1

01

23w0

-3 -2 -1 0 1 2w1

wEww currentnew

What was done: sound classification

(Gorman & Sejnowski, NC 1(1), 75-89 (1988)) 36

What was done: speech recognition

http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-3/www/ml.html 37

ALVINN：An Autonomous land vehicle in a

neural network• Made a tour on the interstate

(Dean Pomerleau 1995)

http://web.mit.edu/6.034/wwwbob/L9/node10.html38

Navlab

39

Problems to be solved• Inputs : N labeled samples , , 1, … ,– ∈ is a d-dimensional feature vector– ∈ is a c-dimensional label– belongs to a class k → 0,… , 1, … , 0 where only k-th

element is １.

• Outputs: a function– ∈ , ∈– It classifies the training samples as correctly as possible– , , … ,• belongs to k-th class→ > 40

Summary and a bit beyond itA unit in neural networks

• A unit or a neuron is a basic construct of neural networks.• It has a weight vector and a nonlinear function .

• It returns a value in [0,1] by first obtaining an input vector , calculating the inner product with the weight, and applying it to the non-linear

function • Note: the returned value becomes larger when the input vector is

more similar to the weight vector .

41

1

∑ [0,1] Inner product

Non-linear function

Connecting units in parallel• Each unit returns large value when its input vector is

similar to its weight vector.• When multiple units run in parallel, multiple class

classification becomes possible.

1∑ [0,1]

∑ [0,1]

∑ [0,1]

The input vector is similar to the weight vector , the output value is large



Connecting units in series• Insert intermediate layers between input and output layers.• Input vector is transformed such that it could be

discriminated well →linearlynon‐separablecasescouldbehandled

43

1

Feature extraction/elaboration classification

Intermediate layer’s role (1)

44

0.60.4

1.0

0.0

0.6,0.41.0,0.0Class 1 (-2,-2) (0.18, 0.18)(-1,-1) (0.12, 0.12)(-1,-2) (0.06, 0.12)(-2,-1) (0.04, 0.02)

Class 2( 0,-4) ( 0.04, 0.50)( 2, 2) ( 0.98, 0.98)( 0, 6) ( 0.99, 0.50)(-3, 4) ( 0.40 , 0.00)

Intermediate layer’s role (2)from Face direction recognition

... ...

left strt right up

30 x 32

http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-3/www/ml.html

Very old story

45

A result in intermediate layer

• Weights to intermediate layer

... ...

left strt right up

30 x 32

46

Intermediate layer’s role (3)

• Can this be learned?Input Output10000000 → 1000000001000000 → 0100000000100000 → 0010000000010000 → 0001000000001000 → 0000100000000100 → 0000010000000010 → 0000001000000001 → 00000001

47

Result of learning

Input Output10000000 → .89 .04 .08 → 1000000001000000 → .01 .11 .88 → 0100000000100000 → .01 .97 .27 → 0010000000010000 → .99 .97 .71 → 0001000000001000 → .03 .05 .02 → 0000100000000100 → .22 .99 .99 → 0000010000000010 → .80 .01 .98 → 0000001000000001 → .60 .94 .01 → 00000001

Information compression – the root of encoder network

48

Result of learning

Input Output10000000 → .89 .04 .08 → 1000000001000000 → .01 .11 .88 → 0100000000100000 → .01 .97 .27 → 0010000000010000 → .99 .97 .71 → 0001000000001000 → .03 .05 .02 → 0000100000000100 → .22 .99 .99 → 0000010000000010 → .80 .01 .98 → 0000001000000001 → .60 .94 .01 → 00000001

Information compression – the root of encoder network

49

Intermediate layer’s role (4)from NETTalk

• NETTalk: pronunciation of words– Many rules available– Many exceptions

NETTalk: cluster analysis

51

Interesting findings

Interesting things were found, i.e.,

In intermediate layers, some representation which were not imagined by researchers was observed.• The representations are meaningful.• They are information compression and extraction.• This will give again profound influence in the future

(now current) research.

52

MLP: a demo

53http://playground.tensorflow.org/

Neural networks • Overview –Neurons and models an introduction · 2019-12-17 · K.Fukushima,...

Documents

Transcript of Neural networks • Overview –Neurons and models an introduction · 2019-12-17 · K.Fukushima,...