Rein for Cent Learning Presentation 2
-
Upload
peter-sords -
Category
Documents
-
view
217 -
download
0
Transcript of Rein for Cent Learning Presentation 2
-
8/2/2019 Rein for Cent Learning Presentation 2
1/30
Reinforcement Learning
ModelsPeter Sords
-
8/2/2019 Rein for Cent Learning Presentation 2
2/30
Reinforcement Learning
Reinforcement learning theories seek to explain how anagentlearns to take actions in an environment so as tomaximize some notion of cumulative reward
Reinforcement learning differs from standard supervisedlearning in that correct input/output pairs are never presented,nor sub-optimal actions explicitly corrected
There is a biological wisdom built into simple observations
that gets encoded into our nervous system to computeoptimal solutions
-
8/2/2019 Rein for Cent Learning Presentation 2
3/30
Reinforcement Learning
Pavlov in 1927 first demonstrated how cognitive connectionscan be created or degraded based on predictability in hisfamous experiment in which he trained his dog by ringing abell (stimulus) before feeding it (reward)
After repeated trials, the dogs started to salivate upon hearingthe bell before the food was delivered.
The bell is the conditioned stimulus (CS) and the food is theunconditioned stimulus (US)
Further experimental evidence led to the hypothesis thatreinforced learning is dependent upon the temporaldifference (TD) between CS and US and the predictability
-
8/2/2019 Rein for Cent Learning Presentation 2
4/30
Temporal Difference (TD) Model
The difference between the actual reward occurrence andthe predicted reward occurrence is referred to as the rewardprediction error (RPE)
This concept has been employed in the temporal difference (TD)model to model reinforcement learning
The TD model uses the reward prediction error of a learnedevent to recalibrate the odds of that event in the future
The goal of the model is to compute a desired prediction signalwhich reflects the sum of future reinforcement
-
8/2/2019 Rein for Cent Learning Presentation 2
5/30
Reinforcement Learning Models
Reinforcement models (like the TD model) are typically composedof four main components
1. A reward function, which attributes a desirability value to a state.
The reward function is often a one-dimensional scalar r(t)2. A value function (also known as critic) which determines the long
term desirability of a state
3. A policy function (also known as actor) which maps the agentsstates to possible actions, using the output of the value function
(the reinforcement signal) to determine the best action to choose4. A model of the environment which includes a representation of
the environment dynamics required for maximizing the sum offuture awards
-
8/2/2019 Rein for Cent Learning Presentation 2
6/30
Dopamine neurons
Dopaminergic (DA) neurons (i.e., neurons whose primary neurotransmitter isdopamine) have been known to play important roles in behavior, motivation,attention, working memory, reward and learning
Wolfram Schultz and colleagues (1997) have shown the reward prediction error
of the TD model resembles midbrain dopamine (DA) neuron activity insituations with predictable rewards
During classical conditioning, unexpected rewards triggered an increase inphasic activity of DA neurons
As learning progresses, the phasic burst shifts from the time of expected rewarddelivery to the time the conditioned stimulus (CS) is given
The conditioned response to CS onset grows over learning trials whereas the
unconditioned response to actual reward delivery declines over learning trials After learning, the omission of an expected reward induces a dopamine cell
pause, i.e., a depression in firing rate to a below-baseline level, at the time ofexpected reward delivery.
-
8/2/2019 Rein for Cent Learning Presentation 2
7/30
DA neurons encode prediction error
DA activity predicts rewards before they occur rather thanreporting them only after the behavior
Measured dopamine neural activity when macaque monkeys were presented distinct
visual stimuli that specified both the probability and magnitude of receiving an award
-
8/2/2019 Rein for Cent Learning Presentation 2
8/30
TD Model: The Critic
The Critic: The TD algorithm is used as an adaptive critic forreinforced learning, to learn an estimate of a value function,the reward prediction V(t) , representing expected total
future reward, from any state Reward prediction V(t) is computed as the weighted sum over
multiple stimulus representationsxm(t)
V(t) is adjusted throughout learning by updating the weightswm(t) of the incoming neuron units according to the TD model
V(t)= wmxm(t)
-
8/2/2019 Rein for Cent Learning Presentation 2
9/30
TD Model: The Critic
The TD model represents a temporal stimulus as multipleincoming signalsxm(t) each multiplied by correspondingadaptive weights wm(t)
-
8/2/2019 Rein for Cent Learning Presentation 2
10/30
TD Model: The Critic
The TD model calculates the reward prediction error(t)based on the temporal difference between the currentdiscounted value function V(t) and the previous time step
V(t-1)
V(t)=the weighted input from the active unit at time t
r(t)=the reward obtained at time t
=the discount factorreflects the decreased impact of moredistant rewards
d (t)= r(t)+gV(t) - V(t- 1)
-
8/2/2019 Rein for Cent Learning Presentation 2
11/30
TD Model: The Value Function
The prediction error (t) is used to improve the estimate of thereward reward value V(t) for later trials as follows
Where is the learning rate For a given policy and a sufficiently small , the TD learning
algorithm converges with probability 1
This adjusts the weights wm(t) of incoming stimuli
If reward prediction V(t) for a stimulus is underestimated, thenprediction error (t) is positive and adaptive weight wm(t) isincreased. If the value of the reward is overestimated, then (t) isnegative and the synaptic weight is decreased
wm
(t)=wm
(t- 1)+ b d (t)xm
(t- 1)
V(t) V(t)+ b d (t)
-
8/2/2019 Rein for Cent Learning Presentation 2
12/30
TD Model: The Actor
The goal of a TD learning agent, as forevery reinforcement , is to maximize theaccumulated reward it receives overtime
The actor module leans a policy (s,a)which gives the probability of selectingan action a in a state s. A commonmethod of defining a policy is given bythe Gibbs softmax distribution:
Wherep(s,a) is known as the preferenceof action a in state s and the index b runsover all possible actions in state s
p(s,a)
=ep(s,a)
ep(s,b)
b
-
8/2/2019 Rein for Cent Learning Presentation 2
13/30
TD Model: The Actor
The preference of the chosen action a in state s is adjusted tomake the selection of this action correspondingly more orless likely the next time the agent visits that state. One
possibility to update the preference in the actor-criticarchitecture is given by:
Where is another learning rate that relates different states to
corresponding actions
p(sn,an ) p(sn,an)+a d n(t)
-
8/2/2019 Rein for Cent Learning Presentation 2
14/30
The actor (neuron) learnsstimulus-action pairs under
the influence of theprediction-error signal ofthe critic (TD model)
The actor module learns a
policy (s,a), which givesthe probability of selectingan action a in a state s.
-
8/2/2019 Rein for Cent Learning Presentation 2
15/30
Modeling the DA prediction error signal
Drawback of model: run-away synaptic strength
-
8/2/2019 Rein for Cent Learning Presentation 2
16/30
-
8/2/2019 Rein for Cent Learning Presentation 2
17/30
-
8/2/2019 Rein for Cent Learning Presentation 2
18/30
-
8/2/2019 Rein for Cent Learning Presentation 2
19/30
-
8/2/2019 Rein for Cent Learning Presentation 2
20/30
-
8/2/2019 Rein for Cent Learning Presentation 2
21/30
-
8/2/2019 Rein for Cent Learning Presentation 2
22/30
Thus, the TD model computes the reward prediction error r(t)from discounted temporal differences in the prediction signalp(t) and from the reward signal with the equation r(t) =reward(t-100) [p(t-100) g p(t)] (time t in msec, 100 msec isthe step size of the model implementation). The rewardprediction error is phasically increased above base line levelsof zero for rewards and reward-predicting stimuli if theseevents are unpredicted but remains on base line levels ifthese events are predicted. In addition, if a predicted reward
is omitted, the reward prediction error decreases below baseline levels at the time of the predicted reward when thepredicted reward fails to occur.
-
8/2/2019 Rein for Cent Learning Presentation 2
23/30
The agent-environment interface
In reinforcement learning a distinction is made between theelements of a problem that are controllable and those whichare only observable. The controllable aspects are said to be
modified by an agent, and the observable aspects are said tobe sensed through the environment
-
8/2/2019 Rein for Cent Learning Presentation 2
24/30
At each point in time, the environment is in some state, st, which is
one of a finite set of states
The agent has access to that state, and based on it takes someaction, at, which is one of a finite set of actions available at a givenstate, represented by A(st)
That action has some effect on the environment, which pushes itinto its next state, st+1
The environment also emits a scalar reward value, rt+1
-
8/2/2019 Rein for Cent Learning Presentation 2
25/30
Future Prospects
Further development and understanding of learning modelsand algorithm can be used in robotics
Reinforced learning models are useful in computer science tocreate algorithms that can self-optimize for solving tasks
-
8/2/2019 Rein for Cent Learning Presentation 2
26/30
The goal of a reinforcement learning algorithm is to choosethe best actions for the agent.
R = rt+1. t=0
It is, however, impossible to reason about rewards deliveredfor all of time, so rewards are weighted such that thosedelivered sooner are weighted higher, and rewards deliveredvery far in the future are ignored entirely. To do this, we add adiscounting factor 1 to the above equation.
R = trt+1 (5.1) t=0
-
8/2/2019 Rein for Cent Learning Presentation 2
27/30
Disturbances of Dopamine
Perhaps the best understood pathology of dopamine excess is drugaddiction
Addictive drugs such as cocaine, amphetamine and heroin allincrease dopamine concentrations in the Nac and other forebrainstructures
Disturbances of dopamine function are also known to have acentral role in schizophrenia
schizophrenia is associated with a hyper-dopaminergic state
The development of the formal models of dopamine function
discussed above, and its interaction with other brain systems,offers hope for a more sophisticated understanding of howdopamine disturbances produce the patterns of clinicalpsychopathology observed in schizophrenia
-
8/2/2019 Rein for Cent Learning Presentation 2
28/30
Fiorillo et al. (2003)
The study demonstrated thatdopamine may also code foruncertainty
The Fiorillo experimentassociated the presentation offive different visual stimuli tomacaques monkeys with thedelayed,probabilistic (pr= 0,
0.25, 0.5, 0.75, 1) delivery ofjuice rewards, wherepr is theprobability of receiving thereward
-
8/2/2019 Rein for Cent Learning Presentation 2
29/30
Fiorillo et al. (2003)
They used a delay conditioning paradigm, in which thestimulus persists for a fixed interval of 2s, with the rewardbeing delivered when the stimulus disappears
TD theory predicts that the phasic activation of the DA cells atthe time of the visual stimuli should correspond to the averageexpected reward, and so should increase withpr. This is exactlywhat is seen to occur.
What the current TD model fails to explain for results such as
these, however, are the ramping activity between the CS andreward and the small response at the expected time of reward
-
8/2/2019 Rein for Cent Learning Presentation 2
30/30
Fiorillo et al. (2003)
These resulting neuron activity, after learning had takenplace, showed:
(i) A phasic burst of activity, or reward prediction error, at the
time of the expected reward, whose magnitude increased asprobability decreased; and
(ii) a new slower, sustained activity, above baseline, related tomotivationally relevant stimuli, which developed with increasinglevels of uncertainty, and varied with reward magnitude
Dopamine neuron firing is greatest when uncertainty of thereward is at a maximum (i.e. at 50% probability)