Learning Processes. Introduction Property of primary significance in nn: learn from its environment,...

Learning Processes

Introduction

• Property of primary significance in nn: learn from its environment, and improve its performance through learning.• Iterative adjustment of synaptic weights.• Learning: hard to define.• One definition by Mendel and McClaren: Learning is a process

by which the free parameters of a neural network are adapted through a process of stimulation by the environment in which the network is embedded. The type of learning is determined by the manner in which the parameter changes take place.

Learning

• Sequence of events in nn learning: NN is stimulated by the environment.• NN undergoes changes in its free parameters as a result of

this stimulation.• NN responds in a new way to the environment because of

the changes that have occurred in its internal structure.

• A prescribed set of well-defined rules for the solution of the learning problem is called a learning algorithm.• The manner in which a NN relates to the environment

dictates the learning paradigm that refers to a model of environment operated on by the nn.

Overview

• Five basic learning rules• Error correction, Hebbian, memory-based, competitive, and

Boltzmann

• Learning paradigms• Credit assignment problem, supervised learning, unsupervised

learning

• Learning tasks, memory, and adaptation• Bias-Variance Tradeoff

Error Correction Learning

• Input , output , and desired response or target output • Error signal =-• actuates a control mechanism that gradually adjust the synaptic

weights, to minimize the cost function (or index of perfor-mance):

• =• When synaptic weights reach a steady state, learning is stopped.1

Error Correction Learning: Delta Rule

• Widrow-Hoff rule, with learning rate : = • With that, we can update the weights:

= +• Note: Partial derivatives for gradient

Memory-Based Learning or Lazy Learning• All (or most) past experiences are explicitly stored, as

input-target pairs • Given a new input , determine class based on local neighborhood of .• Criterion used for determining the neighborhood• Learning rule applied to the neighborhood of the input, within

the set of training examples

• CMAC, Case-Based Learning

Memory-Based Learning: Nearest Neighbor• A set of instances observed so far: • Nearest neighbor of : where is the Euclidean distance. is classified as the same class as

• Cover and Hart (1967): error (for Nearest Neighbor) is at most twice that of the optimal (Bayes probability of error), given • The classified examples are independently and identically distributed.• The sample size N is infinitely large.

Hebbian Synapses

• Donald Hebb’s postulate of learning appeared in his book The Organization of Behavior (1949).• When an axon of cell A is near enough to excite a cell B and

repeatedly or persistently takes part in firing it, some growth process or metabolic changes take place in one or both cells such that A’s efficiency as one of the cells firing B, is in-creased.

• Hebbian synapse• If two neurons on either side of a synapse are activated simul-

taneously, the synapse is strengthened.• If they are activated asynchronously, the synapse is weakened

or eliminated. (This part was not mentioned in Hebb.)

Hebbian Learning

• Hebb’s Law provides the basis for learning without a teacher. Learning here is a local phenomenon occurring without feedback from the environment.

i j

I n

p u

t S

i g n

a l

s

O u

t p

u t

S i g

n a

l s

Hebbian Learning

• Using Hebb’s Law we can express the adjustment applied to the weight at iteration :

= ,) • As a special case, we can represent Hebb’s Law as

follows(Activity product rule): = ,

where is the learning rate parameter. • Non-linear forgetting factor into Hebb’s Law:= where is the (usually small positive) forgetting factor.

Hebbian Learning

• Hebbian: time-dependent, highly local, heavily interac-tive.

Positively correlated: Strengthen Negatively correlated: Weaken

• Strong evidence for Hebbian plasticity in the Hippocam-pus (brainregion).• Correlation v.s. Covariance Rule• Covariance Rule(1977 Sejnowski) = -)

Competitive Learning

• In competitive learning, neurons compete among themselves to be acti-vated.

• While in Hebbian learning, several output neurons can be activated simul-taneously, in competitive learning, only a single output neuron is active at any time.

• The output neuron that wins the “competition” is called the winner-takes-all neuron.

• In competitive learning, neurons compete among themselves to be acti-vated.

• The basic idea of competitive learning was introduced in the early 1970s.• In the late 1980s, Teuvo Kohonen introduced a special class of artificial

neural networks called self-organizing feature maps. These maps are based on competitive learning.


• Highly suited to discover sta-tistically salient features (that may aid in classification).• Three basic elements:

• Same type of neurons with dif-ferent weight sets, so that they respond differently to a given set of inputs.

• A limit imposed on the strength of each neuron.

• Competition mechanism, to choose one winner: winner-takes-all neuron.


• Inputs and weights can be seen as vectors: and . Note that the weight vector belongs to a certain output neuron , and thus the index.• Single layer, feedforward excitatory, and lateral inhibitory

connection• Winner selection:• Limit: • Adaptation: • , ,…, ) is moved toward the input vector.


• Weight vectors converge toward local input clusters: clustering.

Bolzmann Learning

• Stochastic learning algorithm rooted in statistical me-chanics.• Recurrent network, binary neurons (on: ‘+1’, off: ‘-1’).• Energy function

• Activation:• Choose a random neuron • Flip state with a probability (given temperature )• )where is the change in due to the flip.

BolzmannMachine

• Two types of neurons• Visible neurons: can be affected by the

environment• Hidden neurons: isolated

• Two modes of operation• Clamped: visible neuron states are fixed

by environmental input and held con-stant.• Free-running: all neurons are allowed to

update their activity freely.

Bolzmann Machine: Learning

• Learning:• Correlation of activity during clamped condition:• Correlation of activity during free-running condition • Weight update: (

• Train weights with various clamping input patterns.• After training is completed, present new clamping input pattern

that is a partial input of one of the known vectors.• Let it run clamped on the new input (subset of visible neurons),and

eventually it will complete the pattern (pattern completion).

• Note that both +kj and -

kj range in value from –1 to +1.

Learning Paradigms

• How neural networks relate to their environment• credit assignment problem• learning with a teacher• learning without a teacher

Credit Assignment Problem

• How to assign credit or blame for overall outcome to indi-vidual decisions made by the learning machine.• In many cases, the outcomes depend on a sequence of ac-tions.• Assignment of credit for outcomes of actions (temporal credit-as-

signment problem): When does a particular action deserve credit.• Assignment of credit for actions to internal decisions (structural

credit-assignment problem): assign credit to internal struc-tures of actions generated by the system.

• Credit-assignment problem routinely arises in neural network learning. Which neuron, which connection to credit or blame?

Learning Paradigms

• Supervised learning• Teacher has knowledge, repre-

sented as input–output examples. • The environment is unknown to

the nnet.• Nnet tries to emulate the teacher

gradually.• Error-correction learning is one

way to achieve this.• Error surface, gradient, steepest

descent, etc.

Learning Paradigms

• Learning without a Teacher: no labeled examples available of the function to be learned.

1) Reinforcement learning 2) Unsupervised learning

Learning Paradigms

1) Reinforcement learning: The learning of input-output mapping is performed through continued interaction with the environment in oder to minimize a scalar index of performance.

Learning Paradigms

• Delayed reinforcement, which means that the system observes a temporal sequence of stimuli.• Difficult to perform for two reasons:• There is no teacher to provide a desired response at each step

of the learning process.• The delay incurred in the generation of the primary

reinforcement signal implies that the machine must solve a temporal credit assignment problem.

• Reinforcement learning is closely related to dynamic programming.

Learning Paradigms

• Unsupervised Learning: There is no external teacher or critic to oversee the learning process.

• The provision is made for a task independent measure of the quality of representation that the network is required to learn.

• Internal representations for en-coding features of the input space.

• Competitive learning rule needed, such as winner-takes-all.

The Issues of Learning Tasks: Pat-tern Association• An associative memory: a

brain-like distributed memory that learns by association.

• Autoassociation: A neural network is required to store a set of patterns by repeatedly presenting then to the network. The network is presented a partial description of an original pattern stored in it, and the task is to retrieve that particular pattern.

• Heteroassociation: It differs from autoassociation in that an arbitary set of input patterns is paired with another arbitary set of output patterns.

The Issues of Learning Tasks

• Let denote a key pattern and denote a memorized pattern. The pattern association is decribed by

• In an autoassociative memory

• In a heteroassociative memory • Storage phase v.s. Recall phase• is a direct measure of the storage capacity.

The Issues of Learning Tasks: Classifica-tion

• Pattern Recognition: The process whereby a received pattern/signal is assigned to one of a prescribed number of classes• Two general types:

• Feature extraction (observation space to feature space: cf. di-mensionality reduction), then classification (feature space todecision space).

• Single step (observation space to decision space).


Function Approximation:

Consider a nonlinear input-output mapping

The vector is the input and the vector is the output. The function is assumed to be unknown. The requirement is to design a neural network that approximates the unknown function .

for all

• System identification• Inverse system

𝒅= 𝒇 (𝒙)


• Control: The controller has to invert the plant’s input-output behavior.• Feedback controller: adjust

plant input u so that the output of the plant y tracks the reference signal d. Learning is in the form of free-parameter adjust-ment in the controller.


• Filtering: estimate quantity at time , based on mea-surements up to time .• Smoothing: estimate quantity at time , based on

measurements up to time (> 0).• Prediction: estimate quantity at time , based on• measurements up to time (> 0).


• Blind source separation: recover from distorted signal when mixing ma-trix is unknown.

=• Nonlinear prediction:

given estimate (is the es-timated value).

The Issues of Learning Tasks: Associative Memory

• pattern pairs: for .• Input (key vector) =• Output (memorized vector) =.

• Weights can be represented as a weight mat-rix (see next slide):

=

=

Associative Memory

• =

• With a single , we can only represent one mapping . For all pairs (for ). We need such weight matrices. = for

• One strategy is to combine all into a single memory matrix by simple summation

• Will such a simple strategy work? That is, can the following be possible with ?

for

Correlation Matrix Memory

• With fixed set of key vectors , an matrix can store arbitrary output vectors .• Let =where only the -th element is 1 and all the rest is 0.• Construct a memory matrix with each column represent-

ing the arbitrary output vectors .• =

• Then, for • But, we want to be arbitrary too!


• With pairs for , we can construct a candidate memory matrix that stores all mappings as:• =++… • This can be verified easily using partitioned matrices


• First, consider =only.• Check if =:• == )=where a scalar value (the length of vector squared). If all s

were normalized to have length 1, = will hold!• Now, back to Under what condition will give for all ? Let’s begin by

assuming =1.(Key vectors are normalized)• We can decompose as follows.+ = • If all keys are orthogonal, then = by removing the last noise term


• We can also ask how many items can be stored in , i.e., its capacity.• The capacity is closely related with the rank of the matrix • The rank means the number of linearly independent col-

umn vectors (or row vectors) in the matrix.

Adaptation

• When the environment is stationary (the statistic charac-teristics• do not change over time), supervised learning can be used

to obtain a relatively stable set of parameters.• If the environment is nonstationary, the parameters need

to be adapted over time, on an on-going basis (continuous learning or learning-on-the-fly).• If the signal is locally stationary (pseudostationary), then

the parameters can repeatedly be retrained based on a small window of samples, assuming these are stationary: continual training with time-ordered samples.

Probabilistic and Statistical Aspects of the Learning Process

• We do not have knowledge of the exact functional relationship between X and D ->

• D = f(X) + , a regressive model

• The mean value of the expectational error , given any realization of X, is zero.

• The expectational error is uncorrelated with the regression function f(X).

Probabilistic and Statistical Aspects of the Learning Process

• Bias/Variance Dilemma

• Lav(f(x),F(x,T)) = B²(w)+V(w)

• B(w) = ET[F(x,T)]-E[D|X=x] (an approximation error)

• V(w) = ET[(F(x,T)-ET[F(x,T)])² ] (an estimation error)

• NN -> small bias and large variance

• Introduce bias -> reduce variance

• See other sources

Learning Processes. Introduction Property of primary significance in nn: learn from its environment,...

Documents

Transcript of Learning Processes. Introduction Property of primary significance in nn: learn from its environment,...