Rein for Cent Learning Presentation 2

8/2/2019 Rein for Cent Learning Presentation 2

1/30

Reinforcement Learning

ModelsPeter Sords


2/30


Reinforcement learning theories seek to explain how anagentlearns to take actions in an environment so as tomaximize some notion of cumulative reward

Reinforcement learning differs from standard supervisedlearning in that correct input/output pairs are never presented,nor sub-optimal actions explicitly corrected

There is a biological wisdom built into simple observations

that gets encoded into our nervous system to computeoptimal solutions


3/30


Pavlov in 1927 first demonstrated how cognitive connectionscan be created or degraded based on predictability in hisfamous experiment in which he trained his dog by ringing abell (stimulus) before feeding it (reward)

After repeated trials, the dogs started to salivate upon hearingthe bell before the food was delivered.

The bell is the conditioned stimulus (CS) and the food is theunconditioned stimulus (US)

Further experimental evidence led to the hypothesis thatreinforced learning is dependent upon the temporaldifference (TD) between CS and US and the predictability


4/30

Temporal Difference (TD) Model

The difference between the actual reward occurrence andthe predicted reward occurrence is referred to as the rewardprediction error (RPE)

This concept has been employed in the temporal difference (TD)model to model reinforcement learning

The TD model uses the reward prediction error of a learnedevent to recalibrate the odds of that event in the future

The goal of the model is to compute a desired prediction signalwhich reflects the sum of future reinforcement


5/30

Reinforcement Learning Models

Reinforcement models (like the TD model) are typically composedof four main components

1. A reward function, which attributes a desirability value to a state.

The reward function is often a one-dimensional scalar r(t)2. A value function (also known as critic) which determines the long

term desirability of a state

3. A policy function (also known as actor) which maps the agentsstates to possible actions, using the output of the value function

(the reinforcement signal) to determine the best action to choose4. A model of the environment which includes a representation of

the environment dynamics required for maximizing the sum offuture awards


6/30

Dopamine neurons

Dopaminergic (DA) neurons (i.e., neurons whose primary neurotransmitter isdopamine) have been known to play important roles in behavior, motivation,attention, working memory, reward and learning

Wolfram Schultz and colleagues (1997) have shown the reward prediction error

of the TD model resembles midbrain dopamine (DA) neuron activity insituations with predictable rewards

During classical conditioning, unexpected rewards triggered an increase inphasic activity of DA neurons

As learning progresses, the phasic burst shifts from the time of expected rewarddelivery to the time the conditioned stimulus (CS) is given

The conditioned response to CS onset grows over learning trials whereas the

unconditioned response to actual reward delivery declines over learning trials After learning, the omission of an expected reward induces a dopamine cell

pause, i.e., a depression in firing rate to a below-baseline level, at the time ofexpected reward delivery.


7/30

DA neurons encode prediction error

DA activity predicts rewards before they occur rather thanreporting them only after the behavior

Measured dopamine neural activity when macaque monkeys were presented distinct

visual stimuli that specified both the probability and magnitude of receiving an award


8/30

TD Model: The Critic

The Critic: The TD algorithm is used as an adaptive critic forreinforced learning, to learn an estimate of a value function,the reward prediction V(t) , representing expected total

future reward, from any state Reward prediction V(t) is computed as the weighted sum over

multiple stimulus representationsxm(t)

V(t) is adjusted throughout learning by updating the weightswm(t) of the incoming neuron units according to the TD model

V(t)= wmxm(t)


9/30


The TD model represents a temporal stimulus as multipleincoming signalsxm(t) each multiplied by correspondingadaptive weights wm(t)


10/30


The TD model calculates the reward prediction error(t)based on the temporal difference between the currentdiscounted value function V(t) and the previous time step

V(t-1)

V(t)=the weighted input from the active unit at time t

r(t)=the reward obtained at time t

=the discount factorreflects the decreased impact of moredistant rewards

d (t)= r(t)+gV(t) - V(t- 1)


11/30

TD Model: The Value Function

The prediction error (t) is used to improve the estimate of thereward reward value V(t) for later trials as follows

Where is the learning rate For a given policy and a sufficiently small , the TD learning

algorithm converges with probability 1

This adjusts the weights wm(t) of incoming stimuli

If reward prediction V(t) for a stimulus is underestimated, thenprediction error (t) is positive and adaptive weight wm(t) isincreased. If the value of the reward is overestimated, then (t) isnegative and the synaptic weight is decreased

wm

(t)=wm

(t- 1)+ b d (t)xm

(t- 1)

V(t) V(t)+ b d (t)


12/30

TD Model: The Actor

The goal of a TD learning agent, as forevery reinforcement , is to maximize theaccumulated reward it receives overtime

The actor module leans a policy (s,a)which gives the probability of selectingan action a in a state s. A commonmethod of defining a policy is given bythe Gibbs softmax distribution:

Wherep(s,a) is known as the preferenceof action a in state s and the index b runsover all possible actions in state s

p(s,a)

=ep(s,a)

ep(s,b)

b


13/30

TD Model: The Actor

The preference of the chosen action a in state s is adjusted tomake the selection of this action correspondingly more orless likely the next time the agent visits that state. One

possibility to update the preference in the actor-criticarchitecture is given by:

Where is another learning rate that relates different states to

corresponding actions

p(sn,an ) p(sn,an)+a d n(t)


14/30

The actor (neuron) learnsstimulus-action pairs under

the influence of theprediction-error signal ofthe critic (TD model)

The actor module learns a

policy (s,a), which givesthe probability of selectingan action a in a state s.


15/30

Modeling the DA prediction error signal

Drawback of model: run-away synaptic strength


16/30


17/30


18/30


19/30


20/30


21/30


22/30

Thus, the TD model computes the reward prediction error r(t)from discounted temporal differences in the prediction signalp(t) and from the reward signal with the equation r(t) =reward(t-100) [p(t-100) g p(t)] (time t in msec, 100 msec isthe step size of the model implementation). The rewardprediction error is phasically increased above base line levelsof zero for rewards and reward-predicting stimuli if theseevents are unpredicted but remains on base line levels ifthese events are predicted. In addition, if a predicted reward

is omitted, the reward prediction error decreases below baseline levels at the time of the predicted reward when thepredicted reward fails to occur.


23/30

The agent-environment interface

In reinforcement learning a distinction is made between theelements of a problem that are controllable and those whichare only observable. The controllable aspects are said to be

modified by an agent, and the observable aspects are said tobe sensed through the environment


24/30

At each point in time, the environment is in some state, st, which is

one of a finite set of states

The agent has access to that state, and based on it takes someaction, at, which is one of a finite set of actions available at a givenstate, represented by A(st)

That action has some effect on the environment, which pushes itinto its next state, st+1

The environment also emits a scalar reward value, rt+1


25/30

Future Prospects

Further development and understanding of learning modelsand algorithm can be used in robotics

Reinforced learning models are useful in computer science tocreate algorithms that can self-optimize for solving tasks


26/30

The goal of a reinforcement learning algorithm is to choosethe best actions for the agent.

R = rt+1. t=0

It is, however, impossible to reason about rewards deliveredfor all of time, so rewards are weighted such that thosedelivered sooner are weighted higher, and rewards deliveredvery far in the future are ignored entirely. To do this, we add adiscounting factor 1 to the above equation.

R = trt+1 (5.1) t=0


27/30

Disturbances of Dopamine

Perhaps the best understood pathology of dopamine excess is drugaddiction

Addictive drugs such as cocaine, amphetamine and heroin allincrease dopamine concentrations in the Nac and other forebrainstructures

Disturbances of dopamine function are also known to have acentral role in schizophrenia

schizophrenia is associated with a hyper-dopaminergic state

The development of the formal models of dopamine function

discussed above, and its interaction with other brain systems,offers hope for a more sophisticated understanding of howdopamine disturbances produce the patterns of clinicalpsychopathology observed in schizophrenia


28/30

Fiorillo et al. (2003)

The study demonstrated thatdopamine may also code foruncertainty

The Fiorillo experimentassociated the presentation offive different visual stimuli tomacaques monkeys with thedelayed,probabilistic (pr= 0,

0.25, 0.5, 0.75, 1) delivery ofjuice rewards, wherepr is theprobability of receiving thereward


29/30


They used a delay conditioning paradigm, in which thestimulus persists for a fixed interval of 2s, with the rewardbeing delivered when the stimulus disappears

TD theory predicts that the phasic activation of the DA cells atthe time of the visual stimuli should correspond to the averageexpected reward, and so should increase withpr. This is exactlywhat is seen to occur.

What the current TD model fails to explain for results such as

these, however, are the ramping activity between the CS andreward and the small response at the expected time of reward


30/30


These resulting neuron activity, after learning had takenplace, showed:

(i) A phasic burst of activity, or reward prediction error, at the

time of the expected reward, whose magnitude increased asprobability decreased; and

(ii) a new slower, sustained activity, above baseline, related tomotivationally relevant stimuli, which developed with increasinglevels of uncertainty, and varied with reward magnitude

Dopamine neuron firing is greatest when uncertainty of thereward is at a maximum (i.e. at 50% probability)

Rein for Cent Learning Presentation 2

Documents

Transcript of Rein for Cent Learning Presentation 2