Jochen Triesch, UC San Diego, triesch 1 Organizing Principles for Learning in the Brain Associative...

36
Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch 1 Organizing Principles for Organizing Principles for Learning in the Brain Learning in the Brain iative Learning: rule and variations, self-organizing maps ive Hedonism: seeks pleasure and avoids pain: conditioning and orcement learning tion: specially set up to learn from other brains? tion learning approaches vised Learning: urse brain has no explicit teacher, but timing of develo ead to some circuits being trained by others

Transcript of Jochen Triesch, UC San Diego, triesch 1 Organizing Principles for Learning in the Brain Associative...

Page 1: Jochen Triesch, UC San Diego, triesch 1 Organizing Principles for Learning in the Brain Associative Learning: Hebb rule and variations,

Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch 1

Organizing Principles for Learning Organizing Principles for Learning in the Brainin the Brain

Associative Learning:Hebb rule and variations, self-organizing maps

Adaptive Hedonism:Brain seeks pleasure and avoids pain: conditioning andreinforcement learning

Imitation:Brain specially set up to learn from other brains?Imitation learning approaches

Supervised Learning:Of course brain has no explicit teacher, but timing of developmentmay lead to some circuits being trained by others

Page 2: Jochen Triesch, UC San Diego, triesch 1 Organizing Principles for Learning in the Brain Associative Learning: Hebb rule and variations,

Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch 2

Classical Conditioning and Classical Conditioning and Reinforcement LearningReinforcement Learning

Outline:

1. classical conditioning and its variations2. Rescorla Wagner rule3. instrumental conditioning4. Markov decision processes5. reinforcement learning

Note: this presentation follows a chapter of “Theoretical Neuroscience” by Dayan&Abbott

Page 3: Jochen Triesch, UC San Diego, triesch 1 Organizing Principles for Learning in the Brain Associative Learning: Hebb rule and variations,

Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch 3

Example of embodied models of reward based learning:Skinnerbots in Touretzky’s lab at CMU:http://www-2.cs.cmu.edu/~dst/Skinnerbots/index.html

Page 4: Jochen Triesch, UC San Diego, triesch 1 Organizing Principles for Learning in the Brain Associative Learning: Hebb rule and variations,

Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch 4

Project Goals We are developing computational theories of operant conditioning. While classical (Pavlovian) conditioning has a well-developed theory, implemented in the Rescorla-Wagner model and its descendants (work by Sutton & Barto, Grossberg, Klopf, Gallistel, and others), there is at present no comprehensive theory of operant conditioning. Our work has four components: 1. Develop computationally explicit models of operant conditioning that reproduce classical animal learning experiments with rats, dogs, pigeons, etc. 2. Demonstrate the workability of these models by implementing them on mobile robots, which then become trainable robots (Skinnerbots). We originally used Amelia, a B21 robot manufactured by Real World Interface, as our implementation platform. We are moving to the Sony AIBO. 3. Map our computational theories onto neuroanatomical structures known to be involved in animal learning, such as the hippocampus, amygdala, and striatum. 4. Explore issues in human-robot interaction that arise when non-scientists try to train robots as if they were animals.

also at: http://www-2.cs.cmu.edu/~dst/Skinnerbots/index.html

Page 5: Jochen Triesch, UC San Diego, triesch 1 Organizing Principles for Learning in the Brain Associative Learning: Hebb rule and variations,

Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch 5

Classical ConditioningClassical Conditioning

Pavlov’s classic finding: (classical conditioning)

Initially, sight of food leads to dog salivating:

food salivatingunconditioned stimulus, US unconditioned response, UR(reward)

Sound of bell consistently precedes food. Afterwards, bell leads to salivating:

bell salivatingconditioned stimulus, CS conditioned response, CR(expectation of reward)

Page 6: Jochen Triesch, UC San Diego, triesch 1 Organizing Principles for Learning in the Brain Associative Learning: Hebb rule and variations,

Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch 6

Variations of Conditioning 1Variations of Conditioning 1

Extinction:Stimulus (bell) repeatedly shown without reward (food):conditioned response (salivating) reduced.

Partial reinforcement:Stimulus only sometimes preceding reward:conditioned response weaker than in classical case.

Blocking (2 stimuli):First: stimulus S1 associated with reward: classical conditioning.Then: stimulus S1 and S2 shown together followed by reward:Association between S2 and reward not learned.

Page 7: Jochen Triesch, UC San Diego, triesch 1 Organizing Principles for Learning in the Brain Associative Learning: Hebb rule and variations,

Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch 7

Variations of Conditioning 2Variations of Conditioning 2

Inhibitory Conditioning (2 stimuli):Alternate 2 types of trials:1. S1 followed by reward.2. S1+S2 followed by absence of reward.Result: S2 becomes predictor of absence of reward.

To show this use for example the following 2 methods:A. train animal to predict reward based on S2.Result: learning slowed

B. train animal to predict reward based on S3, then show S2+S3.Result: conditioned response weaker than for S3 alone.

Page 8: Jochen Triesch, UC San Diego, triesch 1 Organizing Principles for Learning in the Brain Associative Learning: Hebb rule and variations,

Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch 8

Variations of Conditioning 3Variations of Conditioning 3

Overshadowing (2 stimuli):Repeatedly present S1+S2 followed by reward.Result: often, reward prediction shared unequally between stimuli.

Example (made up):red light + high pitch beep precede pigeon food.

Result: red light more effective in predicting the food than high pitch beep.

Secondary Conditioning:S1 preceding reward (classical case). Then, S2 preceding S1.Result: S2 leads to prediction of reward.But: if S1 following S2 showed too often: extinction will occur

Page 9: Jochen Triesch, UC San Diego, triesch 1 Organizing Principles for Learning in the Brain Associative Learning: Hebb rule and variations,

Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch 9

Summary of Conditioning FindingsSummary of Conditioning Findings(incomplete, has been studied extensively for decades,

many books on topic)

figure taken from Dayan&Abbott

Page 10: Jochen Triesch, UC San Diego, triesch 1 Organizing Principles for Learning in the Brain Associative Learning: Hebb rule and variations,

Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch 10

Modeling ConditioningModeling Conditioning

The Rescorla Wagner rule (1972):

Consider stimulus variable u representing presence (u=1) orabsence (u=0) of stimulus. Correspondingly, reward variable rrepresents presence or absence of reward.

The expected reward v is modeled as “stimulus x weight”:

v = wu

Learning is done by adjusting the weight to minimize errorbetween predicted reward and actual reward.

Page 11: Jochen Triesch, UC San Diego, triesch 1 Organizing Principles for Learning in the Brain Associative Learning: Hebb rule and variations,

Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch 11

Rescorla Wagner RuleRescorla Wagner Rule

Denote the prediction error by δ (delta): δ = r-v

Learning rule: w := w + ε δ u ,

where ε is a learning rate.

Q: Why is this useful?A: This rule does stochastic gradient descent to minimize the expected squared error (r-v)2, w converges to <r>. R.W. rule is variant of the “delta rule” in neural networks.

Note: in psychological terms the learning rate is measureof associability of stimulus with reward.

u

uwur

wurw

vrww

2

))((2

)(

)(

2

22

Page 12: Jochen Triesch, UC San Diego, triesch 1 Organizing Principles for Learning in the Brain Associative Learning: Hebb rule and variations,

Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch 12

Rescorla Wagner Rule ExampleRescorla Wagner Rule Example

prediction error δ = r-v; learning rule: w := w + ε δ u

figure taken from Dayan&Abbott

Page 13: Jochen Triesch, UC San Diego, triesch 1 Organizing Principles for Learning in the Brain Associative Learning: Hebb rule and variations,

Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch 13

Multiple StimuliMultiple Stimuli

Essentially the same idea/learning rule:

In case of multiple stimuli: v = w·u(predicted reward = dot product of stimulus vector and weight vector)

Prediction error: δ = r-v

Learning rule: w := w + ε δ u

i

jjj

i

iii

u

uww

r

rw

vrww

2

)(2

)()( 222

uw

uw

Page 14: Jochen Triesch, UC San Diego, triesch 1 Organizing Principles for Learning in the Brain Associative Learning: Hebb rule and variations,

Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch 14

In how far does Rescorla Wagner rule account for variantsof classical conditioning?(prediction: v = w·u; error: δ = r-v; learning: w := w + ε δ u)

figure taken from Dayan&Abbott

Page 15: Jochen Triesch, UC San Diego, triesch 1 Organizing Principles for Learning in the Brain Associative Learning: Hebb rule and variations,

Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch 15

(prediction: v = w·u; error: δ = r-v; learning: w := w + ε δ u)

Extinction, Partial Reinforcement: o.k., since w converges to <r>

Blocking: during pre-training, w1 converges to r. During trainingv=w1u1+w2u2=r, hence δ=0 and w2 does not grow.

Inhibitory Conditioning: on S1 only trials, w1 gets positive value.on S1+S2 trials, v=w1+w2 must converge to zero, hence w2

becoming negative.

Overshadow: v=w1+w2 goes to r, but w1 and w2 may becomedifferent if there are different learning rates εi for them.

Secondary Conditioning: R.-W.-rule predicts negative S2 weight!

Rescorla Wagner rule qualitatively accounts for wide rangeof conditioning phenomena but not secondary conditioning.

Page 16: Jochen Triesch, UC San Diego, triesch 1 Organizing Principles for Learning in the Brain Associative Learning: Hebb rule and variations,

Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch 16

Temporal Difference LearningTemporal Difference Learning

Motivation: need to keep track of time within a trialIdea: (Sutton&Barto, 1990)Try to predict the total future reward expected from time t onwardto the time T of end of trial. Assume time is in discrete steps.

Predicted total future reward from time t (one stimulus case):

tT

trtR0

)()(

t

tuwtv0

)()()(

Problem: how to adjust the weight? Would like to adjust w(τ)to make v(t) approximate the true total future reward R(t)(reward that is yet to come) but this is unknown since lying in future.

Page 17: Jochen Triesch, UC San Diego, triesch 1 Organizing Principles for Learning in the Brain Associative Learning: Hebb rule and variations,

Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch 17

TD Learning cont’d.TD Learning cont’d.

tT

trtR0

)()(

t

tuwtv0

)()()(

Solution: (Temporal Difference Learning Rule))()()()( tutww , with )()1()()( tvtvtrt

temporaldifferenceTo see why this makes sense:

1

00

)1()()(tTtT

trtrtr

We want v(t) to approximate left hand side but also: v(t+1) shouldapproximate 2nd term of right hand side. Hence:

)1()()()(0

tvtrtrtvtT

or )1()()( tvtvtr

Page 18: Jochen Triesch, UC San Diego, triesch 1 Organizing Principles for Learning in the Brain Associative Learning: Hebb rule and variations,

Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch 18

TD Learning Rule ExampleTD Learning Rule Example

figure taken from Dayan&Abbott

)()()()( tutww)()1()()( tvtvtrt ;

Note:temporal differencelearning rule can alsoaccount for secondaryconditioning(sorry, no example)

reward and time courseof reward correctlypredicted!

Page 19: Jochen Triesch, UC San Diego, triesch 1 Organizing Principles for Learning in the Brain Associative Learning: Hebb rule and variations,

Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch 19

Dopamine and Reward PredictionDopamine and Reward Prediction

figure taken from Dayan&Abbott

(VTA=ventraltegmentalarea(midbrain))

VTA neuronsfire for unex-pected reward:seem to re-present thepredictionerror δ

Page 20: Jochen Triesch, UC San Diego, triesch 1 Organizing Principles for Learning in the Brain Associative Learning: Hebb rule and variations,

Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch 20

Instrumental ConditioningInstrumental Conditioning

So far: only concerned with prediction of reward.Didn’t consider agent’s actions. Reward usually depends on whatyou do! Skinner boxes, etc.

Distinguish two scenarios:A. Rewards follow actions immediately (Static Action Choice) Example: n-armed bandit (slot machine)

B. Rewards may be delayed (Sequential Action Choice) Example: playing chess

Goal: choose actions to maximize rewards

Page 21: Jochen Triesch, UC San Diego, triesch 1 Organizing Principles for Learning in the Brain Associative Learning: Hebb rule and variations,

Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch 21

Static Action ChoiceStatic Action Choice

Consider bee foraging:

Bee can choose to fly to blue or yellow flowers,wants to maximize nectar volume.

Bees learn to fly to “better” flower in single session (~40 flowers)

Page 22: Jochen Triesch, UC San Diego, triesch 1 Organizing Principles for Learning in the Brain Associative Learning: Hebb rule and variations,

Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch 22

Simple model of bee foragingSimple model of bee foraging

When bee chooses blue: reward ~p(rb) or yellow: reward ~p(ry)

Assume model bee has stochastic policy:chooses to fly to blue or yellow flower with p(b) or p(y) respectively.

A “convenient” assumption: p(b), p(y) follow softmax decision rule:

Notes: p(b)+p(y)=1; mb, my are action values to be adjusted; β: inverse temperature: big β deterministic behavior

)exp()exp(

)exp()(

yb

b

mm

mbp

)exp()exp(

)exp()(

yb

y

mm

myp

))(()( yb mmbp )exp(1/1)( xx , where

Page 23: Jochen Triesch, UC San Diego, triesch 1 Organizing Principles for Learning in the Brain Associative Learning: Hebb rule and variations,

Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch 23

Exploration-Exploitation dilemmaExploration-Exploitation dilemma

Why use softmax action selection?Idea: bee could also choose “better” action all the time.But: bee can’t be sure that better action is really better action.

Bee needs to test and continuously verify which action leadsto higher rewards.

This is the famous exploration-exploitation dilemma ofreinforcement learning:Need to explore to know what’s good.Need to exploit what you know is good to maximize reward.

Generalization of softmaxto many possible actions:

aN

a a

a

m

map

1' ')exp(

)exp()(

Page 24: Jochen Triesch, UC San Diego, triesch 1 Organizing Principles for Learning in the Brain Associative Learning: Hebb rule and variations,

Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch 24

The Indirect ActorThe Indirect Actor

Question: how to adjust the action values ma ?Idea: have action values adapt to average reward for that action:

mb = <rb> and my = <ry>

This can be achieved with simple delta rule:

mb mb + εδ , where δ = rb-mb

This is indirect actor because action choice is mediatedindirectly by expected amounts of rewards.

Page 25: Jochen Triesch, UC San Diego, triesch 1 Organizing Principles for Learning in the Brain Associative Learning: Hebb rule and variations,

Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch 25

Indirect Actor ExampleIndirect Actor Example

figure taken fromDayan&Abbott

Page 26: Jochen Triesch, UC San Diego, triesch 1 Organizing Principles for Learning in the Brain Associative Learning: Hebb rule and variations,

Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch 26

The Direct ActorThe Direct Actor

figure taken from Dayan&Abbott

Idea: choose action values directly to maximize expected reward

Maximize this expected reward by stochastic gradient ascent:

yb ryprbpr )()(

))()()()(( ybb

rbpyprypbpm

r

)))((( rrbpmm aabbb This leads to the following learning rule:

where δab is the “Kronecker delta” and r is a parameter often chosento be an estimate of the average reward per time.

¯

Page 27: Jochen Triesch, UC San Diego, triesch 1 Organizing Principles for Learning in the Brain Associative Learning: Hebb rule and variations,

Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch 27

Direct Actor ExampleDirect Actor Example

figure taken fromDayan&Abbott

again: nectarvolumes reversedafter first 100 visits

Page 28: Jochen Triesch, UC San Diego, triesch 1 Organizing Principles for Learning in the Brain Associative Learning: Hebb rule and variations,

Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch 28

Sequential Action ChoiceSequential Action Choice

So far: immediate reward after each action (n-armed bandit problem)Now: delayed rewards, can be in different states

Example: Maze Task

figure taken fromDayan&Abbott

Amount of reward after decision at second intersection depends onaction taken at first intersection.

Page 29: Jochen Triesch, UC San Diego, triesch 1 Organizing Principles for Learning in the Brain Associative Learning: Hebb rule and variations,

Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch 29

Policy IterationPolicy Iteration

Big body of research on how to solve this and more complicatedtasks, easily filling an entire course by itself. Here we just considerone example method: policy iteration.

Assumption: state is fully observable (in contrast to only partiallyobservable), i.e. the rat knows exactly where it is at any time.

Idea: maintain and improve a stochastic policy, determining actionsat each decision point (A,B,C) using actionvalues and softmax decision.

Two elements:critic: use temporal difference learning to predictfuture rewards from A,B,C if current policy is followedactor: maintain and improve the policy

figure taken from Dayan&Abbott

Page 30: Jochen Triesch, UC San Diego, triesch 1 Organizing Principles for Learning in the Brain Associative Learning: Hebb rule and variations,

Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch 30

Policy Iteration cont’d.Policy Iteration cont’d.

How to formalize this idea?Introduce state variable u to describe whether rat is at A,B,C.Also introduce action value vector m(u) describing the policy.(softmax rule assigns probability of action a based on action values)

Immediate reward for taking action a in state u: ra(u)Expected future reward for starting in state u and followingcurrent policy: v(u) (state value).The rat’s estimate for this is denoted by w(u).

Policy Evaluation (critic): estimate w(u) usingtemporal difference learning.

Policy Improvement (actor): improve actionvalues m(u) based on estimated state values.

figure taken from Dayan&Abbott

Page 31: Jochen Triesch, UC San Diego, triesch 1 Organizing Principles for Learning in the Brain Associative Learning: Hebb rule and variations,

Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch 31

Policy EvaluationPolicy Evaluation

Initially, assume all action values are 0, i.e.left/right equally likely everywhere.

True value of each state can be foundby inspection:v(B) = ½(5+0)=2.5; v(C) = ½(2+0)=1;v(A) = ½(v(B)+v(C))=1.75.

These values can be learned with temporal difference learning rule:

)()( uwuw with )()'()( uvuvura where u’ is the state that results from taking action a in state u.

figure taken from Dayan&Abbott

Page 32: Jochen Triesch, UC San Diego, triesch 1 Organizing Principles for Learning in the Brain Associative Learning: Hebb rule and variations,

Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch 32

Policy Evaluation ExamplePolicy Evaluation Example

)()( uwuw with

)()'()( uvuvura

figures taken fromDayan&Abbott

Page 33: Jochen Triesch, UC San Diego, triesch 1 Organizing Principles for Learning in the Brain Associative Learning: Hebb rule and variations,

Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch 33

Policy ImprovementPolicy Improvement

)()'()( uvuvura figures taken fromDayan&Abbott

));'(()()( ''' uapumum aaaa How to adjust action values?

where

and p(a’;u) is the softmax probability of chosing action a’ instate u as determined by ma’(u).

Example: consider starting out from random policy and assumestate value estimates w(u) are accurate. Consider u=A, leads to

75.0)()(0 AvBv75.0)()(0 AvCv

for left turn

for right turn

rat will increaseprobability of going left in A

Page 34: Jochen Triesch, UC San Diego, triesch 1 Organizing Principles for Learning in the Brain Associative Learning: Hebb rule and variations,

Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch 34

Policy Improvement ExamplePolicy Improvement Example

figures taken fromDayan&Abbott

Page 35: Jochen Triesch, UC San Diego, triesch 1 Organizing Principles for Learning in the Brain Associative Learning: Hebb rule and variations,

Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch 35

Some ExtensionsSome Extensions

-Introduction of a state vector u

-discounting of future rewards: put more emphasis on rewards in the near future than rewards that are far away.

Note: reinforcement learning is big subfieldof machine learning. There is a goodintroductory textbook by Sutton and Barto.

Page 36: Jochen Triesch, UC San Diego, triesch 1 Organizing Principles for Learning in the Brain Associative Learning: Hebb rule and variations,

Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch 36

Questions to discuss/think aboutQuestions to discuss/think about

1. Even at one level of abstraction there are many different “Hebbian”, or Reinforcement learning rules; is it important which one you use? What is the right one?

2. The applications we discussed in Hebbian and Reinforcement learning considered networks passively receiving simple sensory input and learning to code it or behave “well”; how can we model learning through interaction with complex environments? Why might it be important to do so?

3. The problems we considered so far are very “low-level”, no hint of “complex behaviors” yet. How can we bridge this huge divide? How can we “scale up”? Why is it difficult?