Statistical learning and optimal control: A framework for biological learning and motor control

Statistical learning and optimal control:

A framework for biological learning and motor control

Lecture 2: Models of biological learning and sensory-motor integration

Reza Shadmehr

Johns Hopkins School of Medicine

Various forms of classical conditioning in animal psychology

Table from Peter Dayan

Not explained by LMS, but predicted by the Kalman filter.

1 ( )( )

1( ) ( ) 2

1 1( ) ( ) ( )

1( ) ( )

n n nn

n nn T n

n n n n n nn n n T

n n n nn n T

n n n n

n n n n T

P AP A Q

w w k x w

Kalman filter as a model of animal learninglight

xSuppose that x represents inputs from the environment: a light and a tone.

Suppose that y represents rewards, like a food pellet.

* * ( )( 1) ( )

( ) * * ( ) 21 1 2 2

nn n w w

n ny y

y x w x w N

w w ε ε

*w *w *w

Animal’s model of the experimental setup

1( ) ( )ˆ n nn n Tyx w

Animal’s expectation on trial n

Animal’s learning from trial n

Sharing Paradigm

Train: {x1,x2} -> 1

Test: x1 -> ?, x2 -> ?

Result: x1->0.5, x2->0.5

10 20 30 400

1.5yyhat

0 10 20 30 400

10 20 30 400.2

P11P22

10 20 30 40

0 10 20 30 400

10 20 30 400

0 10 20 30 400

Learning with Kalman gain LMS

Blocking

Kamin (1968) Attention-like processes in classical conditioning. In: Miami symposium on the prediction of behavior: aversive stimulation (ed. MR Jones), pp. 9-33. Univ. of Miami Press.

Kamin trained an animal to continuously press a lever to receive food. He then paired a light (conditioned stimulus) and a mild electric shock to the foot of the rat (unconditioned stimulus). In response to the shock, the animal would reduce the lever-press activity. Soon the animal learned that when the light predicted the shock, and therefore reduced lever pressing in response to the light. He then paired the light with a tone when giving the electric shock. After this second stage of training, he observed than when the tone was given alone, the animal did not reduce its lever pressing. The animal had not learned anything about the tone.

Blocking Paradigm

Train: x1 -> 1, {x1,x2} -> 1

Test: x2 -> ?, x1 -> ?

Result: x2 -> 0, x1 -> 1

0 10 20 30 40

10 20 30 400

0 10 20 30 40-0.25

10 20 30 40

P11P22

10 20 30 400

0 10 20 30 40-0.2

10 20 30 400

Backwards Blocking Paradigm

Train: {x1,x2} -> 1, x1 -> 1

Test: x2 -> ?

Result: x2 -> 0

0 10 20 30 40 50 600

0 10 20 30 40 50 60-0.2

0 10 20 30 40 50 60

P11P22

0 10 20 30 40 50 60-0.4

0 10 20 30 40 50 600

Different output models

* * ( )( 1) ( )

1* *1 1 1 2 2 2 1 2

0,nn n w w

y b x w b x w b b

w w ε ε

Case 1: the animal assumes an additive model. If each stimulus predicts one reward, then if the two are present together, they predict two rewards.

xSuppose that x represents inputs from the environment: a light and a tone.

Suppose that y represents a reward, like a food pellet.

* * ( )( 1) ( )

( ) * ( ) 2( )

nn n w w

n T nn y y

w w ε ε

Case 2: the animal assumes a weighted average model. If each stimulus predicts one reward, then if the two are present together, they still predict one reward, but with higher confidence.

The weights a1 and a2 should be set to the inverse of the variance (uncertainty) with which each stimulus x1 and x2 predicts the reward.

General case of the Kalman filter

* * ( )( ) ( 1)

( ) ( ) * ( )( )

11 1( ) ( ) ( ) ( )

1 1( ) ( ) ( )

1( ) ( )

nn n w w

n n nn y y

T Tn n n nn n n n

n n n n n nn n n

n n n nn n

n n n n

P H H P H R

P I H P

w w ε ε

y x w ε ε

k x x x

w w k x w

n TA Q

A priori estimate of mean and variance of the hidden variable

before I observe the first data point

Update of the estimate of the hidden variable after I observed

the data point

Forward projection of the estimate to the next trial

DM Wolpert et al. (1995) Science 269:1880

Motor command

Sensory measurement

State of our body x

Application of Kalman filter to problems in sensorimotor control

When we move our arm in darkness, we may estimate the position of our hand based on three sources of information:

• proprioceptive feedback.

• a forward model of how the motor commands have moved our arm.

• by combining our prediction from the forward model with actual proprioceptive feedback.

Experimental procedures:

Subject holds a robotic arm in total darkness. The hand is briefly illuminated. An arrow is displayed to left or right, showing which way to move the hand. In some cases, the robot produces a constant force that assists or resists the movement. The subject slowly moves the hand until a tone is sounded. They use the other hand to move a mouse cursor to show where they think their hand is located.

DM Wolpert et al. (1995) Science 269:1880

( 1) ( ) ( ) ( )

( ) ( ) ( )

n n n nx x

n n ny y

A Bu N Q

x x ε ε

y x ε ε

Motorcommand

Sensory measurement

State of the body

The generative model, describing actual dynamics of the limb

The model for estimation of sensory state from sensory feedback

( 1) ( ) ( ) ( )

( ) ( ) ( )

ˆ ˆ 0,

ˆ 1.4

n n n nx x

n n ny y

A Bu N Q

x x ε ε

y x ε ε

For whatever reason, the brain has an incorrect model of the arm. It overestimates the effect of motor commands on changes in limb position.

10 0 0 1

(1) (1)

110 10(1)

11 1 0 1(1) (1)

11 1 0(1)

21 11 2

ˆ ˆˆ ˆ

ˆˆ ˆ

ˆ ˆ ˆ

ˆ ˆˆ ˆ

P C CP C R

P I C P

P AP A Q

x x k y y

Initial conditions: the subject can see the hand and has no uncertainty regarding its position and velocity

Forward model of state change and feedback

Actual observation

Estimate of state incorporates the prior and the observation

Forward model to establish the prior and the uncertainty for the

next state

0 0.2 0.4 0.6 0.8 1 1.2 1.40

x t ˆ ˆx SD x

0 0.2 0.4 0.6 0.8 1 1.2 1.4Time sec

Actual and estimated position

Kalman gain

0 0.5 1 1.5 2

Bias at end of movement (cm)

Variance at end of movement (cm^2)

Total movement time (sec)

0 0.2 0.4 0.6 0.8 1 1.2 1.4-1.5

Motor command u

Time of “beep”

For movements of various length

A single movement

Puzzling results: Savings and memory despite “washout”

Gain=eye displacement divided by target displacement

Result 1: After changes in gain, monkeys exhibit recall despite behavioral evidence for washout.

Kojima et al. (2004) Memory of learning facilitates saccade adaptation in the monkey. J Neurosci 24:7531.

Result 2: Following changes in gain and a long period of washout, monkeys exhibit no recall.

Result 3: Following changes in gain and a period of darkness, monkeys exhibit a “jump” in memory.

Puzzling results: Improvements in performance without error feedback

Kojima et al. (2004) J Neurosci 24:7531.

The learner’s hypothesis about the structure of the world

1. The world has many hidden states. What I observe is a linear combination of these states.

2. The hidden states change from trial to trial. Some change slowly, others change fast.

3. The states that change fast have larger noise than states that change slow.

*( ) *( 1) ( )

( ) ( ) *( ) ( ) 2

0.99 0

0 0.50

n n nw w

n n T n ny y

w w ε ε

x w ε ε

slow system

fast system

state transition equation

output equation

0 50 100 150 200 250 300-1.5

0 50 100 150 200 250 300-1

*( ) *( 1) ( )

( ) ( ) *( ) ( ) 2

0.99 0

0 0.50

0.00004 0

0 0.01

n n nw w

n n T n ny y

w w ε ε

x w ε ε

0 50 100 150 200 250 300-0.4

Simulations for savingsx1x2y

0 50 100 150 200 250 300-0.5

The critical assumption is that in the fast system, there is much more noise than in the slow system. This produces larger learning rate in the slow system.

0 50 100 150 200 250 300-1

0 50 100 150 200 250 300-1.5

0 50 100 150 200 250 300-0.4

Simulations for spontaneous recovery despite zero error feedback

1 1( ) ( ) ( )

n n n n n nn n n T

n n n n

w w k x w

error clamp

In the error clamp period, estimates are made yet the weight update equation does not see any error. Therefore, the effect of Kalman gain in the error-clamp period is zero. Nevertheless, weights continue to change because of the state update equations. The fast weights rapidly rebound to zero, while the slow weights slowly decline. The sum of these two changes produces a “spontaneous recovery” after washout.

Mean gain at start of recovery = 0.83

Mean gain at start of recovery = 0.86

Mean gain at end of recovery = 0.87

% gain change = 1.2%% gain change = 14.4%

Mean gain at end of recovery = 0.95

Target extinguished during recoveryTarget visible during recovery

Changes in representation without error feedback

Seeberger et al. (2002) Brain Research 956:374-379.

Massed vs. Spaced training: effect of changing the inter-trial interval

Learning reaching in a force field

ITI = 8min

ITI = 1min

Rats were trained on an operant conditional discrimination in which an ambiguous stimulus (X) indicated both the occasions on which responding in the presence of a second cue (A) would be reinforced and the occasions on which responding in the presence of a third cue (B) would not be reinforced (X --> A+, A-, X --> B-, B+). Both rats with lesions of the hippocampus and control rats learned this discrimination more rapidly when the training trials were widely spaced (intertrial interval of 8 min) than when they were massed (intertrial interval of 1 min). With spaced practice, lesioned and control rats learned this discrimination equally well. But when the training trials were massed, lesioned rats learned more rapidly than controls.

Training trial (bin size=4)

4 trials per day for 4 days

16 trials in one day

Performance in a water maze

Commins, S., Cunningham, L., Harvey, D. & Walsh, D. (2003) Behav Brain Res 139:215-23

Aboukhalil, A., Shelhamer, M. & Clendaniel, R. (2004) Neurosci Lett 369:162-7.

Cue-dependent saccade gain adaptationWhen eyes are looking up, increase saccade gain, when eyes are looking down, decrease gain.

(break period: 1 min)

A *w A *w AA

The learner’s hypothesis about the structure of the world

1. The world has many hidden states. What I observe is a linear combination of these states.

2. The hidden states change from trial to trial. Some change slowly, others change fast.

3. The states that change fast have larger noise than states that change slow.4. The state changes can occur more frequently than I can make observations.

Inter-trial interval

0 50 100 150 200 250 300

0 500 1000 1500 2000 2500 3000

ITI=2 ITI=20

*( ) *( 1) ( )

( ) ( ) *( ) ( ) 2

0.999 0

0 0.40

0.00008 0

n n nw w

n n T n ny y

w w ε ε

x w ε ε

1( ) ( )

211 11 11 22 12

211 22 12 22 11

n n n nn n T

n n n n T

P AP A Q

a P a a PQ

a a P a P

k xWhen there is an observation, the uncertainty for each hidden variable decreases proportional to its Kalman gain.

When there are no observations, the uncertainty decreases in proportion to A squared, but increases in proportion to state noise Q.

1000 1020 1040 1060 1080 11000.106

1000 1020 1040 1060 1080 11000.0126

0.0128

0.0132

0.0134

0.0136

Uncertainty for the slow state Uncertainty for the fast stateITI=20

Beyond a minimum ITI, increased ITI continues to increase the uncertainty of the slow state but has little effect on the fast state uncertainty. The longer ITI increases the total learning by increasing the slow state’s sensitivity to error.

0 20 40 60 80 100 120 140

yhatspaced

yhatmassed

0 20 40 60 80 100 120 140-0.2

w1massed

w1spaced

w2massed

w2spaced

0 20 40 60 80 100 120 140

k1massed

k1spaced

k2massed

k2spaced

0 20 40 60 80 100 120 140

P11massed

P11spaced

P22massed

P22spaced

P12massed

P12spaced

Observation number

Performance in spaced training depends largely on the slow state. Therefore, spaced training produces memories that decay little with passage of time.

ITI=14 ITI=2

ITI=98

Performance during training

Test at 1 week

ITI=14

ITI=98

Testing at 1 day or 1 week (averaged together)

lik, P

. I. a

, J. R

Spaced training results in better retention in learning a second language

On Day 1, subjects learned to translate written Japanese words into English. They were given a Japanese word (written phonetically), and then given the English translation. This “study trial” was repeated twice. Afterwards, the were given the Japanese word and had to write the translation. If their translation was incorrect, the correct translation was given.

The ITI between word repetition was either 2, 14, or 98 trials.

Performance during training was better when the ITI was short. However, retention was much better for words that were observed with longer ITI. (The retention test involved two groups; one at 1 day and other at 7 days. Performance was slightly better for the 1 day group but the results were averaged in this figure.)

Data fusion

Suppose that we have two sensors that independently measure something. We would like to combine their measures to form a better estimate. What should the weights be?

1 22 2 2 21 2 1 2

2 21 2

1 22 2 2 21 2 1 2

2 22 1

1 22 2 2 21 2 1 2

1 1 1 1ˆ

Suppose that we know that sensor 1 gives us measurement y1 and has Gaussian noise with variance:

And similarly, sensor 2 has gives us measurement y2 and has Gaussian noise with variance:

A good idea is to weight each sensor inversely proportional to its noise:

2 22 1

1 22 2 2 21 2 1 2

x̂ y y

To see why this makes sense, let’s put forth a generative model that describes our hypothesis about how the data that we are observing is generated:

( )* * 2( 1) ( ) **

21( ) * ( )

( ) 22

nn n xx

n nn y y

x ax N q

x N R R

y ε ε 0

Observed variables

Hidden variable

Data fusion via Kalman filter

21( ) * ( )

( ) 22

(1)(1) 1

1 111 1 0 1

2 21 2

2 211 1 2

2 21 2

2 211(1) 1 1 2

2 21 2

01 1 ,

n nn y y

x N R R H

P P H R H

y ε ε 0

2 2 2 22 21 1 1 21 2

2 2 21 2 1

2 2 2 22 2 1 2

2 211 1 0 10 (1) (1)(1) (1) 2 1

1 22 2 2 21 2 1 2

ˆ ˆ ˆx x Hx y y

See homework for this

priors

our first observation

variance of our posterior estimate

Notice that after we make our first observation, the variance of our posterior is better than the variance of either sensor.

2 211 1 2

2 21 2

2 22 2 2 2 2 21 21 1 1 2 1 22 2

1 22 2

2 2 2 2 2 21 22 2 1 2 2 12 2

because

What our sensors tell us

The real world

-2 0 2 4 6 8 100

-2.5 0 2.5 5 7.5 10 12.5 150

Sensor 1 Sensor 2Combined

Sensor 1

Sensor 2

Combined

Combining equally noisy sensors Combining sensors with unequal noise

(1) 21 11 ,y N y (1) 2

2 22 ,y N y

2 2 2 2(1) (2)2 1 1 21 12 2 2 2 2 2

1 2 1 2 1 2

ˆ ,x N y y

Mean of the posterior, and its variance

musclesMotor commandsforce

Body partState change

Sensory system

ProprioceptionVision

Audition

Measured sensory

consequences

Forward model

Predicted sensory consequences

Integration

Belief

What we sense depends on what we predicted

Duhamel et al. Science 255, 90-92 (1992)

The brain predicts the sensory consequences of motor commands

Vaziri, Diedrichsen, Shadmehr (2006) Journal of Neuroscience

Combining sensory predictions with sensory measurements should produce a better spatial estimate of the visual world

Vaziri et al. (2006) J Neurosci

How to set the initial var-cov matrix

* * ( )( ) ( 1)

( ) ( ) * ( )( )

nn n w w

n n nn y y

w w ε ε

y w ε ε

1 11 1n n n n TP P H R H

In homework, we will show that in general:

Now if we have absolutely no prior information on w, then before we see the first data point P(1|0) is infinity, and therefore its inverse in zero. After we see the first data point, we will be using the above equation to update our estimate. The updated estimate will become:

P H R H

A reasonable and conservative estimate of the initial value of P would be to set it to the above value. That is, set:

110 1TP H R H

Statistical learning and optimal control: A framework for biological learning and motor control

Documents

Transcript of Statistical learning and optimal control: A framework for biological learning and motor control

Learning Optimal Impedance Control During Complex 3D Arm ...

Inverse Optimal Heuristic Control for Imitation Learningproceedings.mlr.press/v5/ratliff09a/ratliff09a.pdf · Inverse Optimal Heuristic Control for Imitation Learning ... guide a

Reinforcement Learning - Computer Sciencejbg/teaching/CSCI_5622/22a.pdf · 2019. 10. 21. · Reinforcement Learning Control learning Control policies that choose optimal actions Q

Ant colony learning algorithm for optimal controlbdeschutter/pub/rep/10_029.pdf · Ant Colony Learning Algorithm for Optimal Control Jelmer Marinus van Ast, Robert Babuˇska, and

Optimal Control and Planningrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-10.pdf · Optimal Control and Planning CS 285: Deep Reinforcement Learning, Decision Making, and

Optimal control

Reinforcement Learning and Optimal ControlASU, CSE 691 ...Reinforcement Learning and Optimal Control ASU, CSE 691, Winter 2019 Dimitri P. Bertsekas dimitrib@mit.edu Lecture 2 Bertsekas

Guided Cost Learning: Deep Inverse Optimal Control via ...

Approximate Dynamic Programming and Reinforcement Learning for Nonlinear Optimal Control of Power Systems

Reinforcement Learning and Optimal ControlASU, CSE 691 ...web.mit.edu/dimitrib/www/Slides_Lecture11_RLOC.pdfReinforcement Learning and Optimal Control ASU, CSE 691, Winter 2019 Dimitri

Learning Deep Stochastic Optimal Control Policies …rss2019.informatik.uni-freiburg.de/papers/0208_FI.pdfLearning Deep Stochastic Optimal Control Policies using Forward-Backward SDEs

Robustness, Adaptation, and Learning in Optimal Control · Robustness, Adaptation, and Learning in Optimal Control Thesis by ... aerospace, formal methods, hybrid systems, and networked

REINFORCEMENT LEARNING AND OPTIMAL ADAPTIVE CONTROL

Learning the Optimal Control of Coordinated Eye and Head ...e.guigon.free.fr/data/masterMSR/SaebEtAl11.pdf · Learning the Optimal Control of Coordinated Eye and Head Movements Sohrab

REINFORCEMENT LEARNING AND OPTIMAL CONTROL …ncr.mae.ufl.edu/dissertations/bhasin.pdf · reinforcement learning and optimal control methods for uncertain nonlinear systems by ...

Reinforcement Learning and Optimal ControlASU, CSE 691 ...dimitrib/Slides_Lecture10_RLOC.pdf · Reinforcement Learning and Optimal Control ASU, CSE 691, Winter 2019 Dimitri P. Bertsekas

Combining Optimal Control and Learning for Visual …Combining Optimal Control and Learning for Visual Navigation in Novel Environments Somil Bansal 1Varun Tolani Saurabh Gupta 2Jitendra

Reinforcement Learning and Optimal ControlA Selective Overviewdimitrib/Slides_RLOC_Stanford... · 2019-03-06 · Reinforcement Learning and Optimal Control A Selective Overview Dimitri

Optimal Learning

Statistical learning and optimal control: A framework for biological learning and motor control