Syntactic Category Learning as Iterative Prototype-Driven ...
Learning Processes. Introduction Property of primary significance in nn: learn from its environment,...
-
Upload
silvia-holmes -
Category
Documents
-
view
216 -
download
2
Transcript of Learning Processes. Introduction Property of primary significance in nn: learn from its environment,...
Learning Processes
Introduction
• Property of primary significance in nn: learn from its environment, and improve its performance through learning.• Iterative adjustment of synaptic weights.• Learning: hard to define.• One definition by Mendel and McClaren: Learning is a process
by which the free parameters of a neural network are adapted through a process of stimulation by the environment in which the network is embedded. The type of learning is determined by the manner in which the parameter changes take place.
Learning
• Sequence of events in nn learning: NN is stimulated by the environment.• NN undergoes changes in its free parameters as a result of
this stimulation.• NN responds in a new way to the environment because of
the changes that have occurred in its internal structure.
• A prescribed set of well-defined rules for the solution of the learning problem is called a learning algorithm.• The manner in which a NN relates to the environment
dictates the learning paradigm that refers to a model of environment operated on by the nn.
Overview
• Five basic learning rules• Error correction, Hebbian, memory-based, competitive, and
Boltzmann
• Learning paradigms• Credit assignment problem, supervised learning, unsupervised
learning
• Learning tasks, memory, and adaptation• Bias-Variance Tradeoff
Error Correction Learning
• Input , output , and desired response or target output • Error signal =-• actuates a control mechanism that gradually adjust the synaptic
weights, to minimize the cost function (or index of perfor-mance):
• =• When synaptic weights reach a steady state, learning is stopped.1
Error Correction Learning: Delta Rule
• Widrow-Hoff rule, with learning rate : = • With that, we can update the weights:
= +• Note: Partial derivatives for gradient
Memory-Based Learning or Lazy Learning• All (or most) past experiences are explicitly stored, as
input-target pairs • Given a new input , determine class based on local neighborhood of .• Criterion used for determining the neighborhood• Learning rule applied to the neighborhood of the input, within
the set of training examples
• CMAC, Case-Based Learning
Memory-Based Learning: Nearest Neighbor• A set of instances observed so far: • Nearest neighbor of : where is the Euclidean distance. is classified as the same class as
• Cover and Hart (1967): error (for Nearest Neighbor) is at most twice that of the optimal (Bayes probability of error), given • The classified examples are independently and identically distributed.• The sample size N is infinitely large.
Hebbian Synapses
• Donald Hebb’s postulate of learning appeared in his book The Organization of Behavior (1949).• When an axon of cell A is near enough to excite a cell B and
repeatedly or persistently takes part in firing it, some growth process or metabolic changes take place in one or both cells such that A’s efficiency as one of the cells firing B, is in-creased.
• Hebbian synapse• If two neurons on either side of a synapse are activated simul-
taneously, the synapse is strengthened.• If they are activated asynchronously, the synapse is weakened
or eliminated. (This part was not mentioned in Hebb.)
Hebbian Learning
• Hebb’s Law provides the basis for learning without a teacher. Learning here is a local phenomenon occurring without feedback from the environment.
i j
I n
p u
t S
i g n
a l
s
O u
t p
u t
S i g
n a
l s
Hebbian Learning
• Using Hebb’s Law we can express the adjustment applied to the weight at iteration :
= ,) • As a special case, we can represent Hebb’s Law as
follows(Activity product rule): = ,
where is the learning rate parameter. • Non-linear forgetting factor into Hebb’s Law:= where is the (usually small positive) forgetting factor.
Hebbian Learning
• Hebbian: time-dependent, highly local, heavily interac-tive.
Positively correlated: Strengthen Negatively correlated: Weaken
• Strong evidence for Hebbian plasticity in the Hippocam-pus (brainregion).• Correlation v.s. Covariance Rule• Covariance Rule(1977 Sejnowski) = -)
Competitive Learning
• In competitive learning, neurons compete among themselves to be acti-vated.
• While in Hebbian learning, several output neurons can be activated simul-taneously, in competitive learning, only a single output neuron is active at any time.
• The output neuron that wins the “competition” is called the winner-takes-all neuron.
• In competitive learning, neurons compete among themselves to be acti-vated.
• The basic idea of competitive learning was introduced in the early 1970s.• In the late 1980s, Teuvo Kohonen introduced a special class of artificial
neural networks called self-organizing feature maps. These maps are based on competitive learning.
Competitive Learning
• Highly suited to discover sta-tistically salient features (that may aid in classification).• Three basic elements:
• Same type of neurons with dif-ferent weight sets, so that they respond differently to a given set of inputs.
• A limit imposed on the strength of each neuron.
• Competition mechanism, to choose one winner: winner-takes-all neuron.
Competitive Learning
• Inputs and weights can be seen as vectors: and . Note that the weight vector belongs to a certain output neuron , and thus the index.• Single layer, feedforward excitatory, and lateral inhibitory
connection• Winner selection:• Limit: • Adaptation: • , ,…, ) is moved toward the input vector.
Competitive Learning
• Weight vectors converge toward local input clusters: clustering.
Bolzmann Learning
• Stochastic learning algorithm rooted in statistical me-chanics.• Recurrent network, binary neurons (on: ‘+1’, off: ‘-1’).• Energy function
• Activation:• Choose a random neuron • Flip state with a probability (given temperature )• )where is the change in due to the flip.
BolzmannMachine
• Two types of neurons• Visible neurons: can be affected by the
environment• Hidden neurons: isolated
• Two modes of operation• Clamped: visible neuron states are fixed
by environmental input and held con-stant.• Free-running: all neurons are allowed to
update their activity freely.
Bolzmann Machine: Learning
• Learning:• Correlation of activity during clamped condition:• Correlation of activity during free-running condition • Weight update: (
• Train weights with various clamping input patterns.• After training is completed, present new clamping input pattern
that is a partial input of one of the known vectors.• Let it run clamped on the new input (subset of visible neurons),and
eventually it will complete the pattern (pattern completion).
• Note that both +kj and -
kj range in value from –1 to +1.
Learning Paradigms
• How neural networks relate to their environment• credit assignment problem• learning with a teacher• learning without a teacher
Credit Assignment Problem
• How to assign credit or blame for overall outcome to indi-vidual decisions made by the learning machine.• In many cases, the outcomes depend on a sequence of ac-tions.• Assignment of credit for outcomes of actions (temporal credit-as-
signment problem): When does a particular action deserve credit.• Assignment of credit for actions to internal decisions (structural
credit-assignment problem): assign credit to internal struc-tures of actions generated by the system.
• Credit-assignment problem routinely arises in neural network learning. Which neuron, which connection to credit or blame?
Learning Paradigms
• Supervised learning• Teacher has knowledge, repre-
sented as input–output examples. • The environment is unknown to
the nnet.• Nnet tries to emulate the teacher
gradually.• Error-correction learning is one
way to achieve this.• Error surface, gradient, steepest
descent, etc.
Learning Paradigms
• Learning without a Teacher: no labeled examples available of the function to be learned.
1) Reinforcement learning 2) Unsupervised learning
Learning Paradigms
1) Reinforcement learning: The learning of input-output mapping is performed through continued interaction with the environment in oder to minimize a scalar index of performance.
Learning Paradigms
• Delayed reinforcement, which means that the system observes a temporal sequence of stimuli.• Difficult to perform for two reasons:• There is no teacher to provide a desired response at each step
of the learning process.• The delay incurred in the generation of the primary
reinforcement signal implies that the machine must solve a temporal credit assignment problem.
• Reinforcement learning is closely related to dynamic programming.
Learning Paradigms
• Unsupervised Learning: There is no external teacher or critic to oversee the learning process.
• The provision is made for a task independent measure of the quality of representation that the network is required to learn.
• Internal representations for en-coding features of the input space.
• Competitive learning rule needed, such as winner-takes-all.
The Issues of Learning Tasks: Pat-tern Association• An associative memory: a
brain-like distributed memory that learns by association.
• Autoassociation: A neural network is required to store a set of patterns by repeatedly presenting then to the network. The network is presented a partial description of an original pattern stored in it, and the task is to retrieve that particular pattern.
• Heteroassociation: It differs from autoassociation in that an arbitary set of input patterns is paired with another arbitary set of output patterns.
The Issues of Learning Tasks
• Let denote a key pattern and denote a memorized pattern. The pattern association is decribed by
• In an autoassociative memory
• In a heteroassociative memory • Storage phase v.s. Recall phase• is a direct measure of the storage capacity.
The Issues of Learning Tasks: Classifica-tion
• Pattern Recognition: The process whereby a received pattern/signal is assigned to one of a prescribed number of classes• Two general types:
• Feature extraction (observation space to feature space: cf. di-mensionality reduction), then classification (feature space todecision space).
• Single step (observation space to decision space).
The Issues of Learning Tasks
Function Approximation:
Consider a nonlinear input-output mapping
The vector is the input and the vector is the output. The function is assumed to be unknown. The requirement is to design a neural network that approximates the unknown function .
for all
• System identification• Inverse system
𝒅= 𝒇 (𝒙)
The Issues of Learning Tasks
• Control: The controller has to invert the plant’s input-output behavior.• Feedback controller: adjust
plant input u so that the output of the plant y tracks the reference signal d. Learning is in the form of free-parameter adjust-ment in the controller.
The Issues of Learning Tasks
• Filtering: estimate quantity at time , based on mea-surements up to time .• Smoothing: estimate quantity at time , based on
measurements up to time (> 0).• Prediction: estimate quantity at time , based on• measurements up to time (> 0).
The Issues of Learning Tasks
• Blind source separation: recover from distorted signal when mixing ma-trix is unknown.
=• Nonlinear prediction:
given estimate (is the es-timated value).
The Issues of Learning Tasks: Associative Memory
• pattern pairs: for .• Input (key vector) =• Output (memorized vector) =.
• Weights can be represented as a weight mat-rix (see next slide):
=
=
Associative Memory
• =
• With a single , we can only represent one mapping . For all pairs (for ). We need such weight matrices. = for
• One strategy is to combine all into a single memory matrix by simple summation
• Will such a simple strategy work? That is, can the following be possible with ?
for
Correlation Matrix Memory
• With fixed set of key vectors , an matrix can store arbitrary output vectors .• Let =where only the -th element is 1 and all the rest is 0.• Construct a memory matrix with each column represent-
ing the arbitrary output vectors .• =
• Then, for • But, we want to be arbitrary too!
Correlation Matrix Memory
• With pairs for , we can construct a candidate memory matrix that stores all mappings as:• =++… • This can be verified easily using partitioned matrices
Correlation Matrix Memory
• First, consider =only.• Check if =:• == )=where a scalar value (the length of vector squared). If all s
were normalized to have length 1, = will hold!• Now, back to Under what condition will give for all ? Let’s begin by
assuming =1.(Key vectors are normalized)• We can decompose as follows.+ = • If all keys are orthogonal, then = by removing the last noise term
Correlation Matrix Memory
• We can also ask how many items can be stored in , i.e., its capacity.• The capacity is closely related with the rank of the matrix • The rank means the number of linearly independent col-
umn vectors (or row vectors) in the matrix.
Adaptation
• When the environment is stationary (the statistic charac-teristics• do not change over time), supervised learning can be used
to obtain a relatively stable set of parameters.• If the environment is nonstationary, the parameters need
to be adapted over time, on an on-going basis (continuous learning or learning-on-the-fly).• If the signal is locally stationary (pseudostationary), then
the parameters can repeatedly be retrained based on a small window of samples, assuming these are stationary: continual training with time-ordered samples.
Probabilistic and Statistical Aspects of the Learning Process
• We do not have knowledge of the exact functional relationship between X and D ->
• D = f(X) + , a regressive model
• The mean value of the expectational error , given any realization of X, is zero.
• The expectational error is uncorrelated with the regression function f(X).
Probabilistic and Statistical Aspects of the Learning Process
• Bias/Variance Dilemma
• Lav(f(x),F(x,T)) = B²(w)+V(w)
• B(w) = ET[F(x,T)]-E[D|X=x] (an approximation error)
• V(w) = ET[(F(x,T)-ET[F(x,T)])² ] (an estimation error)
• NN -> small bias and large variance
• Introduce bias -> reduce variance
• See other sources