Multilayer Neural Networks

Introduction

• Goal: Classify objects by learning nonlinearity

– There are many problems for which lineardiscriminants are insufficient for minimum error

– In previous methods, the central difficulty was thechoice of the appropriate nonlinear functions

– A “brute” approach might be to select a complete basisset such as all polynomials; such a classifier wouldrequire too many parameters to be determined from alimited number of training samples

– There is no automatic method for determining thenonlinearities when no information is provided tothe classifier

– In using the multilayer Neural Networks, the formof the nonlinearity is learned from the training data

Feedforward Operation and Classification

• A three-layer neural network consists of aninput layer, a hidden layer and an output layerinterconnected by modifiable weightsrepresented by links between layers

• A single “bias unit” is connected to each unit other thanthe input units

• Net activation:

where the subscript i indexes units in the input layer, j inthe hidden; wji denotes the input-to-hidden layer weightsat the hidden unit j.

• Each hidden unit emits an output that is a nonlinearfunction of its activation, that is:

net j = w jixi + w j 0i=1

= w jixi ≡r w jt

r x i= 0

y j = f (net j ) = f ( r w jtr x + w j 0

Function is constantperpendicular to this line.

1 2 4 3 4 )Defines an

Oriented “Ridge”

Multi-layer perceptrons: Sums ofledge functions

Non-linear function are 1-D and mapped alongdirection of the weight vector

Figure 6.1 shows a simple threshold function

• The function f(.) is also called the activation functionor “nonlinearity” of a unit.

• Each output unit similarly computes its net activationbased on the hidden unit signals as:

where the subscript k indexes units in the output layerand nH denotes the number of hidden units

f (net) = sgn(net) ≡1 if net ≥ 0-1 if net < 0

Ï Ì Ó

netk = wkj y j + wk0 = wkj y j =r w kt .r y

=r w kt . f (W1

• More than one output are referred zk. An output unitcomputes the nonlinear function of its net, emitting

zk = f(netk)

• In the case of c outputs (classes), we can view thenetwork as computing c discriminants functionszk = gk(x) and classify the input x according to the largestdiscriminant function gk(x) " k = 1, …, c

• The three-layer network with the weights listed infig. 6.1 solves the XOR problem

– The hidden unit y1 computes the boundary: ≥ 0 fi y1 = +1

x1 + x2 + 0.5 = 0 < 0 fi y1 = -1

– The hidden unit y2 computes the boundary: £ 0 fi y2 = +1

x1 + x2 -1.5 = 0 < 0 fi y2 = -1

– The final output unit emits z1 = +1 ¤ y1 = +1 and y2 = +1zk = y1 and not y2 = (x1 or x2) and not (x1 and x2) = x1 XOR x2

which provides the nonlinear decision of fig. 6.1

• General Feedforward Operation – case of c output units

– Hidden units enable us to express more complicated nonlinearfunctions and thus extend the classification

– The activation function does not have to be a sign function, it isoften required to be continuous and differentiable

– We can allow the activation in the output layer to be differentfrom the activation function in the hidden layer or havedifferent activation for each individual unit

– We assume for now that all activation functions to be identical

gk(x) ≡ zk = f wkj f w jixi + w j 0i=1

¯ ˜ + wk 0

Ë Á Á

¯ ˜ ˜ (1)

(k =1,...,c)

• Expressive Power of multi-layer Networks

Q: Can every decision be implemented by a three-layernetwork described by equation (1) ?

A: Yes (A. Kolmogorov)“Any continuous function from input to output can beimplemented in a three-layer net, given sufficient numberof hidden units nH, proper nonlinearities, and weights.”

for properly chosen functionsWe can always scale the input region of interest to lie in a hypercube, and thusthis condition on the feature space is not limiting.†

g(x) = X j y ij (xi)i=1:NÂ

Â "x Œ In (I = [0,1];n ≥ 2)

X j and yij

Expressive Power of multi-layer Networks

• Each of the 2n+1 hidden units takes as input asum of N nonlinear functions, one for each inputfeature xi

Unfortunately: Kolmogorov’s theorem tells usvery little about how to find the nonlinearfunctions based on data; this is the centralproblem in network-based pattern recognition.

Moreover, any specific network structure specializesthe above formulation, and hence most networksmust be viewed as approximating the targetfunction

Backpropagation Algorithm

• Any function from input to output can beapproximated as a three-layer neuralnetwork

• How do we learn the approximation?

• Our goal now is to set the interconnexion weightsbased on the training patterns and the desired outputs

• In a three-layer network, it is a straightforward matterto understand how the output, and thus the error,depend on the hidden-to-output layer weights

• The power of backpropagation is that it enables us tocompute an effective error for each hidden unit, andthus derive a learning rule for the input-to-hiddenweights, this is known as:

The credit assignment problem

• Network have two modes of operation:

– FeedforwardThe feedforward operations consists of presenting apattern to the input units and passing (or feeding) thesignals through the network in order to get outputs units(no cycles!)

– LearningThe supervised learning consists of presenting an inputpattern and modifying the network parameters (weights) toreduce distances between the computed output and thedesired output

• Network Learning– Let tk be the k-th target (or desired) output and zk be the k-

th computed output with k = 1, …, c and w represents all the weights of the network

– The training error:

– The backpropagation learning rule is based on gradientdescent

• The weights are initialized with pseudo-random values and arechanged in a direction that will reduce the error:†

J(w) =12

(tk - zk )2 =12

t - z 2

Dw = -h∂J∂w

gk(x) ≡ zk = f wkj f w jixi + w j 0i=1

¯ ˜ + wk 0

Ë Á Á

¯ ˜ ˜ (1)

(k =1,...,c)

where h is the learning rate which indicates the relative size ofthe change in weights

w(m +1) = w(m) + Dw(m)where m is the m-th pattern presented

– Error on the hidden–to-output weights

where the sensitivity of unit k is defined as:

and describes how the overall error changes with the activation ofthe unit’s net†

∂J∂wkj

∂netk

.∂netk

∂wkj

= -dk∂netk

∂wkj

dk = -∂J

∂netk

dk = -∂J

∂netk

= -∂J∂zk

. ∂zk

∂netk

= (tk - zk) f '(netk )

Since netk = wkt.y therefore:

Conclusion: the weight update (or learning rule) forthe hidden-to-output weights is:

Dwkj = hdkyj = h(tk – zk) f’ (netk)yj

– Error on the input-to-hidden units

∂J∂w ji

=∂J∂y j

.∂y j

∂net j

.∂net j

∂w ji

However,

Similarly as in the preceding case, we define the sensitivity for ahidden unit:

which means that:“The sensitivity at a hidden unit is simply thesum of the individual sensitivities at the output units weightedby the hidden-to-output weights wkj; all multipled by f’(netj)”

Conclusion: The learning rule for the input-to-hidden weights is:

--=∂

∂--=

∂--=˙

˘ÍÎ

1kkjkkk

w)net('f)zt(y

z)zt()zt(

1kkkjjj w)net('f dd

[ ] ijkkjjiji x)net('f wxw

444 3444 21d

dShdhD ==

– Starting with a pseudo-random weight configuration, thestochastic backpropagation algorithm can be written as:

initialize nH; w, criterion q, h, m ¨ 0

do m ¨ m + 1

xm ¨ randomly chosen pattern

wji ¨ wji + hdjxi;

wkj ¨ wkj + hdkyjuntil ||—J(w)|| < q

return w

– Stopping criterion

• The algorithm terminates when the change in the criterion functionJ(w) is smaller than some preset value q

• There are other stopping criteria that lead to better performance thanthis one

• So far, we have considered the error on a single pattern, but we want toconsider an error defined over the entirety of patterns in the training set

• The total training error is the sum over the errors of n individualpatterns

J = Jpp=1

Â (1)

– Stopping criterion (cont.)

• A weight update may reduce the error on the singlepattern being presented but can increase the error on thefull training set

• However, given a large number of such individualupdates, the total error of equation (1) decreases

• Learning Curves– Before training starts, the error on the training set is high;

through the learning process, the error becomes smaller

– The error per pattern depends on the amount of training dataand the expressive power (such as the number of weights) inthe network

– The average error on an independent test set is always higherthan on the training set, and it can decrease as well asincrease

– A validation set is used in order to decide when to stoptraining ; we do not want to overfit the network and decreasethe power of the classifier generalization“we stop training at a minimum of the error on the validationset”

Multilayer Neural Networks - University of Minnesota

Transcript of Multilayer Neural Networks - University of Minnesota

Multilayer Neural Networks - University of Minnesota

Documents

Transcript of Multilayer Neural Networks - University of Minnesota

Multilayer Neural Networksmlclab.org/PR/notes/Lecture04-MultilayerNeuralNetworks2.pdfLec4: Multilayer Neural Networks 5Introduction Solution 2: Multi-LayerNeural Network No need to

Multilayer Neural Networks (sometimes called “Multilayer Perceptrons” or MLPs)

Neural Networks Part I The Multilayer Perceptron

Multilayer and Multimodal Fusion of Deep Neural Networks ...research.nvidia.com/sites/default/files/publications/MM16.pdf · Multilayer and Multimodal Fusion of Deep Neural Networks

Multilayer Neural Networks - Computer Action Teamweb.cecs.pdx.edu/~mm/MachineLearningFall2018/NNs.pdf · Multilayer Neural Networks (sometimes called “Multilayer Perceptrons”

Airplane Vortex Encounters Identification Using Multilayer ... · neural networks. Multilayer feedforward neural networks are trained using the back-propagation learning algorithm

MULTILAYER NEURAL NETWORKS - Computer Sciencerlaz/PatternRecognition/slides/NeuralNetworks.pdf · Multilayer Neural Networks ... in order to form a a scalar net activation ... t –

Daniela Justiniano de Sousa Multilayer Neural Networks Machine Learning.

6.- Supervised Neural Networks: Multilayer Perceptron

Neural Networks and Introduction to Bishop (1995) : Neural ...besse/Wikistat/pdf/st-m-hdstat-rnn-deep... · 2.2 Multilayer perceptron A multilayer perceptron (or neural network) is

Encoding strategies in multilayer neural networksdeim.urv.cat/...Encoding_strategies_in_multilayer_neural_networks.pdf · Encoding strategies in multilayer neural networks ... a number

Multilayer and Multimodal Fusion of Deep Neural Networks ...research.nvidia.com/sites/default/files/pubs/2016-10_Multilayer... · Multilayer and Multimodal Fusion of Deep Neural Networks

Multilayer Perceptrons CS/CMPE 537 – Neural Networks.

Multilayer and Multimodal Fusion of Deep Neural …on-demand.gputechconf.com/gtc/2017/presentation/s7497...Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification

Multilayer 3D electrodes for neural implants

Bioinspired Computing Lecture 12 Artificial Neural Networks: From Multilayer to Recurrent Neural Nets Netta Cohen.

Adaptation of Multilayer Perceptron Neural …...Adaptation of Multilayer Perceptron Neural Network to unsupervised Clustering using a developed version of k-means algorithm HARCHLI

Supervised learning. Multilayer neural networks.

Multilayer Neural Networksmlss.tuebingen.mpg.de/2013/bottou_slides.pdf · Multilayer Neural Networks (are no longer old-fashioned!) MLSS TUEBINGEN 2013 ... Training multilayer networks

Multilayer Perceptron = FeedForward Neural Network€¦ · Multilayer Perceptron = FeedForward Neural Network History Definition Classification = feedforward operation Learning =