1 10. Supervised learning and rewards systems Lecture Notes on Brain and Computation Byoung-Tak...

31
1 10. Supervised learning and rewards systems Lecture Notes on Brain and Computation Byoung-Tak Zhang Biointelligence Laboratory School of Computer Science and Engineering Graduate Programs in Cognitive Science, Brain Science and Bioinformatics Brain-Mind-Behavior Concentration Program Seoul National University E-mail: [email protected] This material is available online at http://bi.snu.ac.kr/ Fundamentals of Computational Neuroscience, T. P. Trappenberg, 2002.

Transcript of 1 10. Supervised learning and rewards systems Lecture Notes on Brain and Computation Byoung-Tak...

Page 1: 1 10. Supervised learning and rewards systems Lecture Notes on Brain and Computation Byoung-Tak Zhang Biointelligence Laboratory School of Computer Science.

1

10. Supervised learning and rewards systems

Lecture Notes on Brain and Computation

Byoung-Tak Zhang

Biointelligence Laboratory

School of Computer Science and Engineering

Graduate Programs in Cognitive Science, Brain Science and Bioinformatics

Brain-Mind-Behavior Concentration Program

Seoul National University

E-mail: [email protected]

This material is available online at http://bi.snu.ac.kr/

Fundamentals of Computational Neuroscience, T. P. Trappenberg, 2002.

Page 2: 1 10. Supervised learning and rewards systems Lecture Notes on Brain and Computation Byoung-Tak Zhang Biointelligence Laboratory School of Computer Science.

(C) 2012 SNU CSE Biointelligence Lab, http://bi.snu.ac.kr

Outline

2

10.110.210.310.4

Motor learning and controlThe delta ruleGeneralized delta rulesReward learning

Page 3: 1 10. Supervised learning and rewards systems Lecture Notes on Brain and Computation Byoung-Tak Zhang Biointelligence Laboratory School of Computer Science.

(C) 2012 SNU CSE Biointelligence Lab, http://bi.snu.ac.kr

10.1 Motor learning and control

Act o a large number of training data without the intention of storing all the specific examples

The learning of motor skills, motor control¨ Important for the survival of a species¨ Ex) Catching a ball, play the piano, etc

The brain must be able to direct the control system¨ Visual guidance¨ Arm movements with visual signals

Commonly able to adapt to the changed environment within only a few additional trials

3

Page 4: 1 10. Supervised learning and rewards systems Lecture Notes on Brain and Computation Byoung-Tak Zhang Biointelligence Laboratory School of Computer Science.

(C) 2012 SNU CSE Biointelligence Lab, http://bi.snu.ac.kr

10.1.1 Feedback controller

How limb movements could be controlled by the nervous sys-tem

Feedback control

How to find and implement an appropriate and accurate motor command generator

4

Fig. 10.1 Negative feedback control and the elements of a standard control system.

Page 5: 1 10. Supervised learning and rewards systems Lecture Notes on Brain and Computation Byoung-Tak Zhang Biointelligence Laboratory School of Computer Science.

(C) 2012 SNU CSE Biointelligence Lab, http://bi.snu.ac.kr

10.1.2 Forward controller

Refined schemes for motor control with slow sensory feed-back

Forward models¨ the dynamic of the controlled object and the behavior of the

sensory system

5

Fig. 10.2 Forward model controller

Page 6: 1 10. Supervised learning and rewards systems Lecture Notes on Brain and Computation Byoung-Tak Zhang Biointelligence Laboratory School of Computer Science.

(C) 2012 SNU CSE Biointelligence Lab, http://bi.snu.ac.kr

10.1.3 Inverse model controller

Refined schemes for motor control with slow sensory feed-back

Inverse model controller¨ Incorporated as side-loop to the standard feedback controller,

learns to correct the computation of the motor command genera-tor

6

Fig. 10.3 Inverse model controller

Page 7: 1 10. Supervised learning and rewards systems Lecture Notes on Brain and Computation Byoung-Tak Zhang Biointelligence Laboratory School of Computer Science.

(C) 2012 SNU CSE Biointelligence Lab, http://bi.snu.ac.kr

10.1.4 The cerebellum and motor control

Adaptive controllers are realized in the brain and are vital for our survival

7

Fig. 10.4 Schematic illustration of some connectivity patterns in the cerebellum. Note that the output of the cerebellum is pro-vided by Purkinje neurons that make inhibitory synapses. Climbing fibers specific for each Purkinje neuron and are tightly interwoven with their dendritic tree.

Page 8: 1 10. Supervised learning and rewards systems Lecture Notes on Brain and Computation Byoung-Tak Zhang Biointelligence Laboratory School of Computer Science.

(C) 2012 SNU CSE Biointelligence Lab, http://bi.snu.ac.kr

10.2 The delta rule

Forward and inverse models can be implemented by feed-forward mapping networks

How such mapping networks can be trained¨ To minimize the mean difference between the output of a feed-

forward mapping network and a desired state provided by a teacher

¨ Object function or cost function Measures the distance between the actual output and the de-

sired output, E¨ The mean square error (MSE)¨ rout

i is actual output

¨ yi is the desired output

8

i

iout

i yrE2

2

1

(10.1)

Page 9: 1 10. Supervised learning and rewards systems Lecture Notes on Brain and Computation Byoung-Tak Zhang Biointelligence Laboratory School of Computer Science.

(C) 2012 SNU CSE Biointelligence Lab, http://bi.snu.ac.kr

10.2.1 Gradient descent

Minimize the error function of a single-layer mapping network¨ By changing the weight values¨ k, learning rate

The gradient of the error function

Delta rule

9

x

xg

g

gf

x

xgf

ryrwhfyrwgww

E inji

j

injiji

ii

j

injij

ijij

)()())((

)))((('))((2

1 2

ijij dw

dEkw

inj

outiiij rrykw )(

Fig. 10.5 Illustration of error mini-mization with a gradient descent method on a one-dimensional error surface E(w).

(10.2)

(10.3)

(10.4)

(10.5)

Page 10: 1 10. Supervised learning and rewards systems Lecture Notes on Brain and Computation Byoung-Tak Zhang Biointelligence Laboratory School of Computer Science.

(C) 2012 SNU CSE Biointelligence Lab, http://bi.snu.ac.kr

10.2.2 Batch versus online algorithm

Batch algorithm versus Online learning algorithm

10

smallly sufficient iserror until 5-2 stepsRepeat 6.

term theaddingby matrix weight theupdate 5.

))(('

layeroutput for the termdelta theCompute 4.

)(

nodesoutput theof rate Calculate 3.

nodesinput the topatterm sample aApply 2.

valuesrandom to weightsInitialize 1.

i

i

0

inj

outi

outi

outi

j

injij

outi

ini

inii

rkΔw

rhg

rwgr

rr

Table 10.1 Summary of delta-rule algorithm

Page 11: 1 10. Supervised learning and rewards systems Lecture Notes on Brain and Computation Byoung-Tak Zhang Biointelligence Laboratory School of Computer Science.

(C) 2012 SNU CSE Biointelligence Lab, http://bi.snu.ac.kr

10.2.3 Supervised learning

The delta learning rule depends on knowledge of the desired output

Supervised learning¨ Supplies the network with the desired response¨ The training signal

The climbing fiber in the cerebellum could very well supply such an error signal to the purkinje cells

¨ The weight changes still takes the form of a correlation rule be-tween an error factor

The biological mechanisms underlying synaptic plasticity Unsupervised learning

¨ Hebbian learning

11

Page 12: 1 10. Supervised learning and rewards systems Lecture Notes on Brain and Computation Byoung-Tak Zhang Biointelligence Laboratory School of Computer Science.

(C) 2012 SNU CSE Biointelligence Lab, http://bi.snu.ac.kr

10.2.4 Supervised learning in multilayer networks Generalize the delta rule to multilayer mapping network

¨ The error-back-propagation algorithms or generalized delta rule¨ The application of multilayer feed-forward mapping networks

(multilayer perceptrons) Discuss difficulties in connecting the computational step with

brain processes¨ Strongly restricted number of hidden nodes to achieve good gen-

eralization There might not be the need in the brain to train multilayer

mapping networks with supervised learning algorithms with the generalized delta rule¨ Single-layer networks can represent complicated function¨ Expansion recoding

12

Page 13: 1 10. Supervised learning and rewards systems Lecture Notes on Brain and Computation Byoung-Tak Zhang Biointelligence Laboratory School of Computer Science.

(C) 2012 SNU CSE Biointelligence Lab, http://bi.snu.ac.kr

10.3 Generalized delta rules (1)

The gradient of the MSE error function with respect to the output weights

The delta factor

The calculation of the gradients with respect to the weights to the hidden layer

The derivative of the output layer

The delta term of the hidden term

13

hj

outi

hji

j

hj

outij

hi

out

ii

j

hj

outij

outoutiji

iout

ioutij

outij

rryrwhfyrwfw

yrww

E

))(())((

2

1)(

2

1 '22

))(())(( ''i

outi

hi

outi

j

hj

outij

hi

outouti yrhfyrwhf

ii

j k

ink

hjk

houtij

outhiji

iout

ihij

hij

yrwfwfw

yrww

E 22 )))(((2

1)(

2

1

k

outk

outik

ini

INhi whf )('

inj

hih

ij

rw

E

(10.6)

(10.7)

(10.8)

(10.9)

(10.10)

Page 14: 1 10. Supervised learning and rewards systems Lecture Notes on Brain and Computation Byoung-Tak Zhang Biointelligence Laboratory School of Computer Science.

(C) 2012 SNU CSE Biointelligence Lab, http://bi.snu.ac.kr

10.3 Generalized delta rules (2)

14

smallly sufficient iserror utill 7-2 steps7.Repeat

Δ

term theaddingby matrix weight Update6.

network e throgh th termdelta propagate-Back 5.

))(('

layeroutput for the termdelta theCompute 4.

)()(

layer successivein nodes of rates the

gcalculatinby network ugh theinput thro Propagate 3.

nodesinput the topatterm sample aApply 2.

valuesrandom to weightsInitialize 1.

1

11

11

0

lj

li

lij

j

lj

lij

li

li

outi

outi

outi

outi

j

lj

lij

li

li

ini

inii

rkw

w)g'( hδ

rhg

rwghgr

l

rr

Table 10.2 Summary of error-back-propagation algorithm

Page 15: 1 10. Supervised learning and rewards systems Lecture Notes on Brain and Computation Byoung-Tak Zhang Biointelligence Laboratory School of Computer Science.

(C) 2012 SNU CSE Biointelligence Lab, http://bi.snu.ac.kr

10.3.1 Biological plausibility

The back-propagation of error signals is probably the most problematic feature in biological terms

The non-locality of the algorithm in which a neuron has to gather the back-propagated error from all the other nodes to which it projects¨ Synchronization issues¨ Disadvantages for true parallel processing

The delta signals is also problematic How a forward propagating phase of signals can be separated

effectively from the back-propagation phase of the error sig-nals

15

Page 16: 1 10. Supervised learning and rewards systems Lecture Notes on Brain and Computation Byoung-Tak Zhang Biointelligence Laboratory School of Computer Science.

(C) 2012 SNU CSE Biointelligence Lab, http://bi.snu.ac.kr

10.3.2 Advanced algorithms

The basic error-back-propagation algorithm¨ Convergence performance problem

The learning in the form of statistical learning theories Improvements over the basic algorithm

¨ Initial conditions¨ Different error functions¨ Various acceleration techniques¨ Hybrid methods

The limitation of the basic error-back-propagation algorithm Alternative learning strategies

16

Page 17: 1 10. Supervised learning and rewards systems Lecture Notes on Brain and Computation Byoung-Tak Zhang Biointelligence Laboratory School of Computer Science.

(C) 2012 SNU CSE Biointelligence Lab, http://bi.snu.ac.kr

10.3.3 Momentum method and adaptive learning rate The basic gradient descent method

¨ Typically find an initial phase ¨ Followed by a phase of very slow convergence¨ A shallow part of the error function

Momentum term¨ Remembers the changes of the weight in the previous time step

The momentum term has the effect of biasing the direction of the new update vector towards the previous direction

To increase the learning rate¨ when the gradient become small

17

)()1( tww

Ektw ij

ijij

(10.11)

Page 18: 1 10. Supervised learning and rewards systems Lecture Notes on Brain and Computation Byoung-Tak Zhang Biointelligence Laboratory School of Computer Science.

(C) 2012 SNU CSE Biointelligence Lab, http://bi.snu.ac.kr

10.3.4 Different error functions

Shallow areas in the error function depend on the particular choice of the error function

Entropic error function

A proper measure for the information content (or entropy) of the actual output of the multilayer perceptron given the knowledge of the correct output

It is not always obvious which error functions should be used A general strategy for choosing the error function can unfor-

tunately not be given

18

i

outi

iiout

i

ii r

yy

r

yyE

,

]1

1log)1(

1

1log)1[(

2

1

(10.12)

Page 19: 1 10. Supervised learning and rewards systems Lecture Notes on Brain and Computation Byoung-Tak Zhang Biointelligence Laboratory School of Computer Science.

(C) 2012 SNU CSE Biointelligence Lab, http://bi.snu.ac.kr

1.03.5 High-order gradient methods

The basic line search algorithm of gradient decent is known for its poor performance with shallow error functions

The minimization of an error function Many other advanced minimization techniques

¨ Take high-order gradient terms into account Curvature terms

¨ The curvature of the error surface in the weight change calcula-tions

¨ The calculation of the inverse of the Hessian matrix Natural gradient algorithm Levenberg-marquardt method

19

Page 20: 1 10. Supervised learning and rewards systems Lecture Notes on Brain and Computation Byoung-Tak Zhang Biointelligence Laboratory School of Computer Science.

(C) 2012 SNU CSE Biointelligence Lab, http://bi.snu.ac.kr

10.3.6 Local minima and simulated anneal-ing A general limitation of pure gradient descent methods

¨ A local minimum of the error surface¨ The system is not able to approach a global minimum of the error

function

Solution¨ Stochastic processes¨ Simulated annealing

Add noise to the weight values

20

Fig. 10.5 Illustration of error mini-mization with a gradient descent method on a one-dimensional error surface E(w).

Page 21: 1 10. Supervised learning and rewards systems Lecture Notes on Brain and Computation Byoung-Tak Zhang Biointelligence Laboratory School of Computer Science.

(C) 2012 SNU CSE Biointelligence Lab, http://bi.snu.ac.kr

10.3.7 Hybrid methods

A variety of methods utilize the rapid initial convergence of the gradient descent method and combine it¨ Global search strategies

After the gradient descent method slows down below an ac-ceptable level, a new starting point is chosen randomly

Hybrid methods combine the efficient local optimization ca-pabilities of gradient descent method with the global search abilities of stochastic processes

Genetic algorithms use similar combinations of deterministic minimization and stochastic components

21

Page 22: 1 10. Supervised learning and rewards systems Lecture Notes on Brain and Computation Byoung-Tak Zhang Biointelligence Laboratory School of Computer Science.

(C) 2012 SNU CSE Biointelligence Lab, http://bi.snu.ac.kr

10.4 Reward learning10.4.1 Classical conditioning and temporal credit as-signment problem Learning with reward signals

¨ Conditioning

22

Fig. 10.6 Classical conditioning and temporal credit assignment problem. A sub-ject is required to associate the ringing of a bell with the pressing of a button that will open the door to a chamber with some food reward. In the example the subject has learned to press the left button after the ringing of the bell. This is an example of a temporal credit assignment problem. It is difficult to devise a sys-tem that is still open to possible other solutions such as a bigger reward hidden in the right chamber.

Page 23: 1 10. Supervised learning and rewards systems Lecture Notes on Brain and Computation Byoung-Tak Zhang Biointelligence Laboratory School of Computer Science.

(C) 2012 SNU CSE Biointelligence Lab, http://bi.snu.ac.kr

10.4.2 Stochastic escape

The experiment another chamber (with rodent)¨ A larger food reward

Conditioned¨ Chance to open the left door after the ringing of the bell

If the rodent always stuck to the initial conditioned situation it would never learn about the existence of the larger food reward

If the rodent is running around randomly in the button chamber before the bell rings it could still happen that I presses the right button before running to the left button¨ The opening right door and the larger food reward

Changes the association of auditory signal to new motor action¨ Stochastic escape that can balance habit versus novelty

23

Page 24: 1 10. Supervised learning and rewards systems Lecture Notes on Brain and Computation Byoung-Tak Zhang Biointelligence Laboratory School of Computer Science.

(C) 2012 SNU CSE Biointelligence Lab, http://bi.snu.ac.kr

10.4.3 Reinforcement models

The implementation of a system ¨ Learns from reward signals within neural

architectures The input to this node represent a certain

input stimulus such as the ringing of the bell

The node gets activated under the right conditions and is therefore able to predict the future reward

24

Fig. 10.7 (A) Linear predictor node.

i

inii trtwtP )()()(

(10.13)

Page 25: 1 10. Supervised learning and rewards systems Lecture Notes on Brain and Computation Byoung-Tak Zhang Biointelligence Laboratory School of Computer Science.

(C) 2012 SNU CSE Biointelligence Lab, http://bi.snu.ac.kr

10.4.4 Temporal delta rule

A reward is given at time t + 1¨ A scalar value r(t + 1)

A temporal version of the delta rule

Eligibility trace Node calculate an effective reinforcement

signal Rescorla-Wagner theory The model can produce one-step ahead

predictions of a reward signal

25

Fig. 10.7 (B) Neural im-plementation of temporal delta rule.

)1())1()(()1()( trtPtrtwtw iniii

)1()()( tPtrtr

(10.14)

(10.15)

Page 26: 1 10. Supervised learning and rewards systems Lecture Notes on Brain and Computation Byoung-Tak Zhang Biointelligence Laboratory School of Computer Science.

(C) 2012 SNU CSE Biointelligence Lab, http://bi.snu.ac.kr

10.4.5 Reward chain

Learning in the previous model is restricted to the prediction of reward in the next time step

The ability to predict future reward at different time steps or even whole series of reward

V(t), all the future rewards into account, reinforcement value

αi, allow us to specify the weights we give to the reward at different times

A simple realization of such model

¨ 0 ≤ γ < 1, αi = γi-1

26

...)3()2()1()( 321 trtrtrtV

...)3()2()1()( 2 trtrtrtV

(10.16)

(10.17)

Page 27: 1 10. Supervised learning and rewards systems Lecture Notes on Brain and Computation Byoung-Tak Zhang Biointelligence Laboratory School of Computer Science.

(C) 2012 SNU CSE Biointelligence Lab, http://bi.snu.ac.kr

10.4.6 Temporal difference learning

Temporal difference learning (advanced re-inforcement learning)

Predict the reinforcement value at time t correctly

Predict the correct reinforcement value at previous time step

So Minimize the temporal difference error

27

)1()()()( tPtPtrtr

...)3()2()1()( 2 trtrtrtP

...])2()1([)(

...)2()1()()1( 2

trtrtr

trtrtrtP

)()()1( tPtrtP Fig. 10.7 (C) Neural imple-mentation of temporal dif-ference learning.

(10.18)

(10.19)

(10.20)

(10.21)

Page 28: 1 10. Supervised learning and rewards systems Lecture Notes on Brain and Computation Byoung-Tak Zhang Biointelligence Laboratory School of Computer Science.

(C) 2012 SNU CSE Biointelligence Lab, http://bi.snu.ac.kr

10.4.7 Adaptive critic controller

Temporal difference learning is method of learning to predict future reward contingencies

Adaptive critic¨ Designed to predict the correct motor command for accurate fu-

ture actions¨ Supervise the motor command generator

28Fig. 10.8 Adaptive critic controller.

Page 29: 1 10. Supervised learning and rewards systems Lecture Notes on Brain and Computation Byoung-Tak Zhang Biointelligence Laboratory School of Computer Science.

(C) 2012 SNU CSE Biointelligence Lab, http://bi.snu.ac.kr

10.4.8 The basal ganglia in the actor-critic scheme

29

Fig. 10.9 (A) Anatomical overview of the connections within the basal ganglia and the major pro-jections comprising the input and output of the basal ganglia. (B) Organizations within the basal ganglia are composed of processing pathways within the striosomal and matrix modules reflecting an architecture that could implement an actor-critic control scheme. C, cerebral cortex; F, frontal lobe; TH, thalamus; ST, subthalamic nucleus; PD, pallidusl; SPm, spiny neurons in the matri mod-ule; SPs, spiny neurons in the striosomal module; DA, dopaminergic neurons.

Page 30: 1 10. Supervised learning and rewards systems Lecture Notes on Brain and Computation Byoung-Tak Zhang Biointelligence Laboratory School of Computer Science.

(C) 2012 SNU CSE Biointelligence Lab, http://bi.snu.ac.kr

10.4.9 Other reward mechanisms in the brain The proposed functional role of the basal ganglia

¨ Only one hypothesis mentioned in the literature Several hypothesis

¨ The details of the biochemical nature of an eligibility trace¨ Experimental verifications

The origin of reward learning in the brain is still not very un-derstood

Involve some association of reward contingencies with spe-cific motor actions in the brain¨ Amygdala¨ Orbitofrontal cortex¨ Dopaminergic neurons

30

Page 31: 1 10. Supervised learning and rewards systems Lecture Notes on Brain and Computation Byoung-Tak Zhang Biointelligence Laboratory School of Computer Science.

(C) 2012 SNU CSE Biointelligence Lab, http://bi.snu.ac.kr

Conclusion

Motor learning¨ Feedback, forward, inverse model controller

The delta rule¨ Gradient descent¨ Batch algorithm¨ Online learning¨ Supervised learning¨ Generalized delta rule¨ Acceleration of delta rule

Reward learning¨ Classical conditioning¨ Reinforcement learning

Biological mechanisms of reward leanring

31