Introduction: What does phasic Dopamine encode ?

1
Introduction: Introduction: What does phasic Dopamine What does phasic Dopamine encode encode ? ? With asymmetric coding of errors, the mean TD error at the time of reward is proportional to p(1-p) -> Indeed maximal at p=50% 0 10 20 30 -0.2 0 0.2 0.4 0.6 0.8 1 TD Error D orsal Striatum (C audate,Putam en) Ventral Tegm ental Area Substantia N igra Am ygdala N ucleus Accum bens (Ventral Striatum ) Prefrontal C ortex Classical conditioning paradigm (delay conditioning) using probabilistic outcomes -> generates ongoing prediction errors in a learned task Single DA cell recordings in Single DA cell recordings in VTA/SNc: VTA/SNc: At stimulus time - DA represents mean expected reward (compliant with TD hypothesis) • Surprising ramping of activity in the delay -> Fiorillo et al.’s hypothesis : Coding of uncertainty However: No prediction error to `justify’ ramp TD learning predicts away any predictable quantity Uncertainty not available for control -> The uncertainty hypothesis seems contradictory to the TD hypothesis DA single cell recordings from the lab of Wolfram Schultz Overview Overview Substantial evidence suggests that phasic dopaminergic firing represents a temporal difference (TD) error in the predictions of future reward. Recent experiments probe the way information about outcomes propagates back to the stimuli predicting them. These use stochastic rewards (eg., Fiorillo et al., 2003) which allow systematic study of persistent prediction errors even in well learned tasks. We use a novel theoretical analysis to show that across-trials ramping in DA activity may be a signature of this process. Importantly, we address the asymmetric coding in DA activity of positive and negative TD errors, and acknowledge the constant learning that results from ongoing prediction errors. Selected References Selected References [1] Fiorillo, Tobler & Schultz (2003) - Discrete coding of reward probability and uncertainty by dopamine neurons. Science, 299, 1898–1902. [2] Morris, Arkadir, Nevet, Vaadia & Bergman (2004) – Coincident but distinct messages of midbrain dopamine and striatal tonically active neurons. Neuron, 43, 133-143. [3] Montague, Dayan & Sejnowski (1996) J Neurosci, 16:1936-1947. [4] Sutton and Barto (1988) – Reinforcement learning: An introduction, MIT Press. Acknowledgements Acknowledgements This research was funded by an EC Thematic Network short-term fellowship to YN and The Gatsby Charitable Foundation. Asymmetric Coding of Temporal Difference Errors: Asymmetric Coding of Temporal Difference Errors: Implications for Dopamine Firing Patterns Implications for Dopamine Firing Patterns Y. Niv 1,2 , M.O. Duff 2 and P. Dayan 2 (1)Interdisciplinary Center for Neural Computation, Hebrew University, Jerusalem, [email protected] (2) Gatsby Computational Neuroscience Unit, University College London Simulating TD with asymmetric coding Simulating TD with asymmetric coding Unpredicted reward (neutral/no stimulus) Predicted reward (learned task) -> DA encodes a temporally sophisticated reward signal Computational hypothesis – Computational hypothesis – DA encodes reward prediction error: DA encodes reward prediction error: (Sutton, Barto 1987, Montague, Dayan, Sejnowski, 1996) t t t t t .. ) 3 ( ) 2 ( ) 1 ( ) ( ) ( r r r r V ) 1 ( ) 1 ( t t V r ) ( ) 1 ( ) 1 ( ) 1 ( t t t t V V r Temporal Difference error ) 1 ( ) ( t t V -> Phasic DA encodes reward prediction error • Precise computational theory for generation of DA firing patterns Compelling account for role of DA in appetitive conditioning Experimental results: Experimental results: measuring propagating errors measuring propagating errors Fiorillo et al. (2003) Fiorillo et al. (2003) 2 sec visual stimulus indicating reward probability – 100%, 75%, 50%, 25% or 0% Probabilistic reward (drops of juice) A TD resolution: A TD resolution: Ramps result from Ramps result from backpropagating prediction errors - backpropagating prediction errors - Note that according to TD, activity at time of reward should cancel out – but it doesn’t. This is because… Prediction errors can be positive or negative However, firing rate is positive -> encoding of negative errors is relative to baseline activity But : baseline activity in DA cells is low (2-5Hz) -> asymmetric coding of errors Experiment Model Omitted reward (probe trial) Negative δ(t) scaled by d=1/6 prior to PSTH summation Learning proceeds normally (without scaling): • Necessary to produce the right predictions • Can be biologically plausible Conclusion: Conclusion: Uncertainty or Temporal Uncertainty or Temporal Difference? Difference? Trace conditioning: A puzzle solved Trace conditioning: A puzzle solved Short visual stimulus Trace period Reward (probabilistic) = drops of juice •Same (if not more) uncertainty •But : no DA ramping Morris et al. (2004) (see also Fiorillo et al. (2003) Solution : Lower learning rate in trace conditioning eliminates ramp Indeed: computed learning rate in Morris et al.’s data near zero (personal communication) However : • No need to assume explicit coding of uncertainty – Ramping in DA activity is explained by neural constraints. • Explanation for puzzling absence of ramp in trace conditioning results. • Experimental tests : Ramp as within or between trial phenomenon? Relationship between ramp size and learning rate (within/between experiments)? Challenges to TD remain: TD and noise; Conditioned inhibition; additivity… 0 10 20 30 0 0.5 1 TD E rror 0 10 20 30 0 0.5 1 V alues tim esteps 0 5 10 15 20 25 30 0 0.5 1 TD E rror 0 5 10 15 20 25 30 0 0.5 1 V alues tim esteps 0 10 20 30 0 0.5 1 TD E rror 0 10 20 30 0 0.5 1 V alues tim esteps 0 10 20 30 0 0.5 1 TD E rror 0 10 20 30 0 0.5 1 V alues tim esteps Visualizing Temporal-Difference Learning: Visualizing Temporal-Difference Learning: After first trial After third trial Task learned Learning continues (~10 trials) x(1) x(2) … r(t) δ(t) V(1) V(30) 55% 270% δ(t) DA Bayer and Glimcher Schultz lab -> Ongoing (intertwined) backpropagation of asymmetrically coded positive and negative errors causes ramps to appear in the summed PSTH -> The ramp itself is a between trial and not a within trial phenomenon (results from summation over different reward histories) x50% x50% x75% x25% p = 50% p = 75%

description

Short visual stimulus. Reward (probabilistic) = drops of juice. Trace period. x50%. p = 50%. x50%. x75%. p = 75%. x25%. DA. 270%. Experiment Model. δ (t). 55%. Bayer and Glimcher Schultz lab. x(1) x(2) …. V(1) V(30). r( t ). δ ( t ). - PowerPoint PPT Presentation

Transcript of Introduction: What does phasic Dopamine encode ?

Page 1: Introduction:  What does phasic Dopamine encode ?

Introduction: Introduction: What does phasic Dopamine What does phasic Dopamine encodeencode??

With asymmetric coding of errors, the mean TD error at the time of reward is proportional to p(1-p)-> Indeed maximal at p=50%

0 10 20 30-0.2

0

0.2

0.4

0.6

0.8

1

TD

Err

or

Dorsal Striatum (Caudate, Putamen)

Ventral TegmentalArea

Substantia Nigra

Amygdala

Nucleus Accumbens(Ventral Striatum)

Prefrontal CortexDorsal Striatum (Caudate, Putamen)

Ventral TegmentalArea

Substantia Nigra

Amygdala

Nucleus Accumbens(Ventral Striatum)

Prefrontal Cortex

Classical conditioning paradigm (delay conditioning) using probabilistic outcomes -> generates ongoing prediction errors in a learned task

Single DA cell recordings in Single DA cell recordings in VTA/SNc: VTA/SNc:

• At stimulus time - DA represents mean expected reward (compliant with TD hypothesis)

• Surprising ramping of activity in the delay

-> Fiorillo et al.’s hypothesis: Coding of uncertainty

However:

• No prediction error to `justify’ ramp

• TD learning predicts away any predictable quantity

• Uncertainty not available for control

-> The uncertainty hypothesis seems contradictory to the TD hypothesis

DA single cell recordings from the lab of Wolfram Schultz

OverviewOverview• Substantial evidence suggests that phasic dopaminergic firing represents a

temporal difference (TD) error in the predictions of future reward.

• Recent experiments probe the way information about outcomes propagates back to the stimuli predicting them. These use stochastic rewards (eg., Fiorillo et al., 2003) which allow systematic study of persistent prediction errors even in well learned tasks.

• We use a novel theoretical analysis to show that across-trials ramping in DA activity may be a signature of this process. Importantly, we address the asymmetric coding in DA activity of positive and negative TD errors, and acknowledge the constant learning that results from ongoing prediction errors.

Selected ReferencesSelected References[1] Fiorillo, Tobler & Schultz (2003) - Discrete coding of reward probability and uncertainty by dopamine

neurons. Science, 299, 1898–1902.[2] Morris, Arkadir, Nevet, Vaadia & Bergman (2004) – Coincident but distinct messages of midbrain

dopamine and striatal tonically active neurons. Neuron, 43, 133-143.[3] Montague, Dayan & Sejnowski (1996) – J Neurosci, 16:1936-1947.[4] Sutton and Barto (1988) – Reinforcement learning: An introduction, MIT Press.

AcknowledgementsAcknowledgementsThis research was funded by an EC Thematic Network short-term fellowship to YN and The Gatsby Charitable

Foundation.

Asymmetric Coding of Temporal Difference Errors:Asymmetric Coding of Temporal Difference Errors:Implications for Dopamine Firing PatternsImplications for Dopamine Firing Patterns

Y. Niv1,2, M.O. Duff2 and P. Dayan2

(1)Interdisciplinary Center for Neural Computation, Hebrew University, Jerusalem, [email protected] (2) Gatsby Computational Neuroscience Unit, University College London

Simulating TD with asymmetric codingSimulating TD with asymmetric coding

Unpredicted reward(neutral/no stimulus)

Predicted reward(learned task)

-> DA encodes a temporally sophisticated reward signal

Computational hypothesis – Computational hypothesis – DA encodes reward prediction error:DA encodes reward prediction error:

(Sutton, Barto 1987, Montague, Dayan, Sejnowski, 1996)

t

tttt

...)3()2()1()()( rrrrV

)1()1( tt Vr

)()1()1()1( tttt VVr Temporal Difference error

)1()( tt V

-> Phasic DA encodes reward prediction error

• Precise computational theory for generation of DA firing patterns• Compelling account for role of DA in appetitive conditioning

Experimental results:Experimental results: measuring propagating errors measuring propagating errorsFiorillo et al. (2003)Fiorillo et al. (2003)

2 sec visual stimulus indicating reward probability –100%, 75%, 50%, 25% or 0%

Probabilistic reward (drops of juice)

A TD resolution: A TD resolution: Ramps result from Ramps result from backpropagating prediction errors -backpropagating prediction errors -

• Note that according to TD, activity at time of reward should cancel out – but it doesn’t.

This is because…• Prediction errors can be positive or negative• However, firing rate is positive -> encoding of negative

errors is relative to baseline activity• But: baseline activity in DA cells is low (2-5Hz)

-> asymmetric coding of errors

Experiment

Model

Omitted reward(probe trial)

Negative δ(t) scaled by d=1/6 prior to PSTH summation

Learning proceeds normally (without scaling):

• Necessary to produce the right predictions

• Can be biologically plausible

Conclusion: Conclusion: Uncertainty or Temporal Uncertainty or Temporal Difference?Difference?

Trace conditioning: A puzzle solvedTrace conditioning: A puzzle solvedShort visual stimulus

Trace period

Reward (probabilistic) = drops of juice

•Same (if not more) uncertainty•But: no DA ramping

Morris et al. (2004) (see also Fiorillo et al. (2003))

Solution: Lower learning rate in trace conditioning eliminates ramp

Indeed: computed learning rate in Morris et al.’s data near zero (personal communication)

However:• No need to assume explicit coding of uncertainty –

Ramping in DA activity is explained by neural constraints.• Explanation for puzzling absence of ramp in trace conditioning results.• Experimental tests:

Ramp as within or between trial phenomenon? Relationship between ramp size and learning rate (within/between

experiments)?

Challenges to TD remain: TD and noise; Conditioned inhibition; additivity…

0 10 20 300

0.5

1

TD

Err

or

0 10 20 300

0.5

1

Va

lue

s

timesteps

0 5 10 15 20 25 300

0.5

1

TD

Err

or

0 5 10 15 20 25 300

0.5

1

Va

lue

s

timesteps

0 10 20 300

0.5

1

TD

Err

or

0 10 20 300

0.5

1

Va

lue

s

timesteps

0 10 20 300

0.5

1

TD

Err

or

0 10 20 300

0.5

1

Va

lue

s

timesteps

Visualizing Temporal-Difference Learning:Visualizing Temporal-Difference Learning:After first trial

After third trial Task learnedLearning continues (~10 trials)

x(1) x(2) …

r(t) δ(t)

V(1) V(30)

55%

270%

δ(t)

DA

Bayer and GlimcherSchultz lab

-> Ongoing (intertwined) backpropagation of asymmetrically coded positive and negative errors causes ramps to appear in the summed PSTH

-> The ramp itself is a between trial and not a within trial phenomenon

(results from summation over different reward histories)

x50%

x50%

x75%

x25%

p = 50%

p = 75%