Rein for Cent Learning Presentation 2

download Rein for Cent Learning Presentation 2

of 30

Transcript of Rein for Cent Learning Presentation 2

  • 8/2/2019 Rein for Cent Learning Presentation 2

    1/30

    Reinforcement Learning

    ModelsPeter Sords

  • 8/2/2019 Rein for Cent Learning Presentation 2

    2/30

    Reinforcement Learning

    Reinforcement learning theories seek to explain how anagentlearns to take actions in an environment so as tomaximize some notion of cumulative reward

    Reinforcement learning differs from standard supervisedlearning in that correct input/output pairs are never presented,nor sub-optimal actions explicitly corrected

    There is a biological wisdom built into simple observations

    that gets encoded into our nervous system to computeoptimal solutions

  • 8/2/2019 Rein for Cent Learning Presentation 2

    3/30

    Reinforcement Learning

    Pavlov in 1927 first demonstrated how cognitive connectionscan be created or degraded based on predictability in hisfamous experiment in which he trained his dog by ringing abell (stimulus) before feeding it (reward)

    After repeated trials, the dogs started to salivate upon hearingthe bell before the food was delivered.

    The bell is the conditioned stimulus (CS) and the food is theunconditioned stimulus (US)

    Further experimental evidence led to the hypothesis thatreinforced learning is dependent upon the temporaldifference (TD) between CS and US and the predictability

  • 8/2/2019 Rein for Cent Learning Presentation 2

    4/30

    Temporal Difference (TD) Model

    The difference between the actual reward occurrence andthe predicted reward occurrence is referred to as the rewardprediction error (RPE)

    This concept has been employed in the temporal difference (TD)model to model reinforcement learning

    The TD model uses the reward prediction error of a learnedevent to recalibrate the odds of that event in the future

    The goal of the model is to compute a desired prediction signalwhich reflects the sum of future reinforcement

  • 8/2/2019 Rein for Cent Learning Presentation 2

    5/30

    Reinforcement Learning Models

    Reinforcement models (like the TD model) are typically composedof four main components

    1. A reward function, which attributes a desirability value to a state.

    The reward function is often a one-dimensional scalar r(t)2. A value function (also known as critic) which determines the long

    term desirability of a state

    3. A policy function (also known as actor) which maps the agentsstates to possible actions, using the output of the value function

    (the reinforcement signal) to determine the best action to choose4. A model of the environment which includes a representation of

    the environment dynamics required for maximizing the sum offuture awards

  • 8/2/2019 Rein for Cent Learning Presentation 2

    6/30

    Dopamine neurons

    Dopaminergic (DA) neurons (i.e., neurons whose primary neurotransmitter isdopamine) have been known to play important roles in behavior, motivation,attention, working memory, reward and learning

    Wolfram Schultz and colleagues (1997) have shown the reward prediction error

    of the TD model resembles midbrain dopamine (DA) neuron activity insituations with predictable rewards

    During classical conditioning, unexpected rewards triggered an increase inphasic activity of DA neurons

    As learning progresses, the phasic burst shifts from the time of expected rewarddelivery to the time the conditioned stimulus (CS) is given

    The conditioned response to CS onset grows over learning trials whereas the

    unconditioned response to actual reward delivery declines over learning trials After learning, the omission of an expected reward induces a dopamine cell

    pause, i.e., a depression in firing rate to a below-baseline level, at the time ofexpected reward delivery.

  • 8/2/2019 Rein for Cent Learning Presentation 2

    7/30

    DA neurons encode prediction error

    DA activity predicts rewards before they occur rather thanreporting them only after the behavior

    Measured dopamine neural activity when macaque monkeys were presented distinct

    visual stimuli that specified both the probability and magnitude of receiving an award

  • 8/2/2019 Rein for Cent Learning Presentation 2

    8/30

    TD Model: The Critic

    The Critic: The TD algorithm is used as an adaptive critic forreinforced learning, to learn an estimate of a value function,the reward prediction V(t) , representing expected total

    future reward, from any state Reward prediction V(t) is computed as the weighted sum over

    multiple stimulus representationsxm(t)

    V(t) is adjusted throughout learning by updating the weightswm(t) of the incoming neuron units according to the TD model

    V(t)= wmxm(t)

  • 8/2/2019 Rein for Cent Learning Presentation 2

    9/30

    TD Model: The Critic

    The TD model represents a temporal stimulus as multipleincoming signalsxm(t) each multiplied by correspondingadaptive weights wm(t)

  • 8/2/2019 Rein for Cent Learning Presentation 2

    10/30

    TD Model: The Critic

    The TD model calculates the reward prediction error(t)based on the temporal difference between the currentdiscounted value function V(t) and the previous time step

    V(t-1)

    V(t)=the weighted input from the active unit at time t

    r(t)=the reward obtained at time t

    =the discount factorreflects the decreased impact of moredistant rewards

    d (t)= r(t)+gV(t) - V(t- 1)

  • 8/2/2019 Rein for Cent Learning Presentation 2

    11/30

    TD Model: The Value Function

    The prediction error (t) is used to improve the estimate of thereward reward value V(t) for later trials as follows

    Where is the learning rate For a given policy and a sufficiently small , the TD learning

    algorithm converges with probability 1

    This adjusts the weights wm(t) of incoming stimuli

    If reward prediction V(t) for a stimulus is underestimated, thenprediction error (t) is positive and adaptive weight wm(t) isincreased. If the value of the reward is overestimated, then (t) isnegative and the synaptic weight is decreased

    wm

    (t)=wm

    (t- 1)+ b d (t)xm

    (t- 1)

    V(t) V(t)+ b d (t)

  • 8/2/2019 Rein for Cent Learning Presentation 2

    12/30

    TD Model: The Actor

    The goal of a TD learning agent, as forevery reinforcement , is to maximize theaccumulated reward it receives overtime

    The actor module leans a policy (s,a)which gives the probability of selectingan action a in a state s. A commonmethod of defining a policy is given bythe Gibbs softmax distribution:

    Wherep(s,a) is known as the preferenceof action a in state s and the index b runsover all possible actions in state s

    p(s,a)

    =ep(s,a)

    ep(s,b)

    b

  • 8/2/2019 Rein for Cent Learning Presentation 2

    13/30

    TD Model: The Actor

    The preference of the chosen action a in state s is adjusted tomake the selection of this action correspondingly more orless likely the next time the agent visits that state. One

    possibility to update the preference in the actor-criticarchitecture is given by:

    Where is another learning rate that relates different states to

    corresponding actions

    p(sn,an ) p(sn,an)+a d n(t)

  • 8/2/2019 Rein for Cent Learning Presentation 2

    14/30

    The actor (neuron) learnsstimulus-action pairs under

    the influence of theprediction-error signal ofthe critic (TD model)

    The actor module learns a

    policy (s,a), which givesthe probability of selectingan action a in a state s.

  • 8/2/2019 Rein for Cent Learning Presentation 2

    15/30

    Modeling the DA prediction error signal

    Drawback of model: run-away synaptic strength

  • 8/2/2019 Rein for Cent Learning Presentation 2

    16/30

  • 8/2/2019 Rein for Cent Learning Presentation 2

    17/30

  • 8/2/2019 Rein for Cent Learning Presentation 2

    18/30

  • 8/2/2019 Rein for Cent Learning Presentation 2

    19/30

  • 8/2/2019 Rein for Cent Learning Presentation 2

    20/30

  • 8/2/2019 Rein for Cent Learning Presentation 2

    21/30

  • 8/2/2019 Rein for Cent Learning Presentation 2

    22/30

    Thus, the TD model computes the reward prediction error r(t)from discounted temporal differences in the prediction signalp(t) and from the reward signal with the equation r(t) =reward(t-100) [p(t-100) g p(t)] (time t in msec, 100 msec isthe step size of the model implementation). The rewardprediction error is phasically increased above base line levelsof zero for rewards and reward-predicting stimuli if theseevents are unpredicted but remains on base line levels ifthese events are predicted. In addition, if a predicted reward

    is omitted, the reward prediction error decreases below baseline levels at the time of the predicted reward when thepredicted reward fails to occur.

  • 8/2/2019 Rein for Cent Learning Presentation 2

    23/30

    The agent-environment interface

    In reinforcement learning a distinction is made between theelements of a problem that are controllable and those whichare only observable. The controllable aspects are said to be

    modified by an agent, and the observable aspects are said tobe sensed through the environment

  • 8/2/2019 Rein for Cent Learning Presentation 2

    24/30

    At each point in time, the environment is in some state, st, which is

    one of a finite set of states

    The agent has access to that state, and based on it takes someaction, at, which is one of a finite set of actions available at a givenstate, represented by A(st)

    That action has some effect on the environment, which pushes itinto its next state, st+1

    The environment also emits a scalar reward value, rt+1

  • 8/2/2019 Rein for Cent Learning Presentation 2

    25/30

    Future Prospects

    Further development and understanding of learning modelsand algorithm can be used in robotics

    Reinforced learning models are useful in computer science tocreate algorithms that can self-optimize for solving tasks

  • 8/2/2019 Rein for Cent Learning Presentation 2

    26/30

    The goal of a reinforcement learning algorithm is to choosethe best actions for the agent.

    R = rt+1. t=0

    It is, however, impossible to reason about rewards deliveredfor all of time, so rewards are weighted such that thosedelivered sooner are weighted higher, and rewards deliveredvery far in the future are ignored entirely. To do this, we add adiscounting factor 1 to the above equation.

    R = trt+1 (5.1) t=0

  • 8/2/2019 Rein for Cent Learning Presentation 2

    27/30

    Disturbances of Dopamine

    Perhaps the best understood pathology of dopamine excess is drugaddiction

    Addictive drugs such as cocaine, amphetamine and heroin allincrease dopamine concentrations in the Nac and other forebrainstructures

    Disturbances of dopamine function are also known to have acentral role in schizophrenia

    schizophrenia is associated with a hyper-dopaminergic state

    The development of the formal models of dopamine function

    discussed above, and its interaction with other brain systems,offers hope for a more sophisticated understanding of howdopamine disturbances produce the patterns of clinicalpsychopathology observed in schizophrenia

  • 8/2/2019 Rein for Cent Learning Presentation 2

    28/30

    Fiorillo et al. (2003)

    The study demonstrated thatdopamine may also code foruncertainty

    The Fiorillo experimentassociated the presentation offive different visual stimuli tomacaques monkeys with thedelayed,probabilistic (pr= 0,

    0.25, 0.5, 0.75, 1) delivery ofjuice rewards, wherepr is theprobability of receiving thereward

  • 8/2/2019 Rein for Cent Learning Presentation 2

    29/30

    Fiorillo et al. (2003)

    They used a delay conditioning paradigm, in which thestimulus persists for a fixed interval of 2s, with the rewardbeing delivered when the stimulus disappears

    TD theory predicts that the phasic activation of the DA cells atthe time of the visual stimuli should correspond to the averageexpected reward, and so should increase withpr. This is exactlywhat is seen to occur.

    What the current TD model fails to explain for results such as

    these, however, are the ramping activity between the CS andreward and the small response at the expected time of reward

  • 8/2/2019 Rein for Cent Learning Presentation 2

    30/30

    Fiorillo et al. (2003)

    These resulting neuron activity, after learning had takenplace, showed:

    (i) A phasic burst of activity, or reward prediction error, at the

    time of the expected reward, whose magnitude increased asprobability decreased; and

    (ii) a new slower, sustained activity, above baseline, related tomotivationally relevant stimuli, which developed with increasinglevels of uncertainty, and varied with reward magnitude

    Dopamine neuron firing is greatest when uncertainty of thereward is at a maximum (i.e. at 50% probability)