Computational Investigations of the Regulative Role of Pleasure in Adaptive Behavior...
-
Upload
shonda-atkinson -
Category
Documents
-
view
214 -
download
0
Transcript of Computational Investigations of the Regulative Role of Pleasure in Adaptive Behavior...
Computational Investigations of the Regulative Role of
Pleasure in Adaptive Behavior
Action-Selection Biased by Pleasure-Regulated Simulated
Interaction
Joost Broekens,
Fons J Verbeek,
LIACS, Leiden University, The Netherlands.
Overview
Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands.
Cognitive influence
Reactive behaviorDistributed-state RL
model
Interaction-selection
Action-selection
action
reinforcement
interaction
ENVIRONMENT
Perception
simulated interaction
simulated reinforcement
percept
predicted interactions
Emotion process
pleasure
stimulus
Distributed-state RL model
Learning
• The agent's memory structure is modelled with a directed graph.
• Constructs a distributed state-prediction of next states.
• Learns through continuous interaction.– The memory is adapted while the agent interacts with its
environment.– Agent selects an action a, executes a and combines
with resulting perception p into a situation s1=<a, p>. Memory adds s1 if s1 does not yet exists.
– Do another action, resulting in s2, add s2, and– connect s1 to s2 by creating an interactron node I1.– Recursively apply this process, use interactrons to
predict situations.– (!Encoding situations in this way is too symbolic for real
world applications!)
as1
cs1 s2
I1
bs1 s2
ds1 s2
I2I1
I3
s3
es1 s2
I2I1
I3
Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands.
Learning: example
Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands.
• Agent environment, grid world consisting of:– lava (red) r=-1, can walk on lava but is discouraged to do so,– food (yellow) r=1,– agent (black),– path (white) r=0.
Learning: reinforcement
• Learning to predict values follows standard RL techniques (Sutton and Barto, 1998).– Except that learning is per interactron (node) and that there are two
prediction values per node.
• Every interactron (representing a sequence of occurred situations) maintains– a direct reinforcement , the interactron’s own predicted value changed
by the reinforcement r from the environment,– a lazy-propagated indirect reinforcement value, that estimates the future
reinforcement, , based on the predicted next interactions, and – a combined value μ=+.
Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands.
Learning: reinforcement example
Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands.
Lazy reinforcement propagation
Action-Selection
Cognitive influence
Reactive behaviorDistributed-state RL
model
Interaction-selection
Action-selection
action
reinforcement
interaction
ENVIRONMENT
Perception
simulated interaction
simulated reinforcement
percept
predicted interactions
Emotion process
pleasure
stimulus
Action-Selection
• Integrate distributed predictions into action values.– Action values are a result of the parallel inhibition and excitation of actions
in the agent’s set of actions, A. Calculated using formula:
– with lt(ah) = the resulting level of activation of an action ahA at time t,– yi an active interactron, and– xi
j predicts action ah
Sum over the weighted values of all predicted interactions into actions.
• Action-selection is based on these action values.– If any lt(ah)>threshold aselect such that lt(aselect)=max(lt(a1),…,lt(a|A|)).– If all lt(ah)<threshold aselect stochastically from l(a1),…,l(a|A|).
• Other selection mechanisms possible, i.e. Boltzmann. (!With a static threshold our selection suffers from lack of exploration!).
Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands.
)|()|()(1 1
*
k
iij
iX
jij
ith
t yxyxaliy
Action-Selection: example
Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands.
)|()|()(1 1
*
k
iij
iX
jij
ith
t yxyxaliy
Internal Simulation
Cognitive influence
Reactive behaviorDistributed-state RL
model
Interaction-selection
Action-selection
action
reinforcement
interaction
ENVIRONMENT
Perception
simulated interaction
simulated reinforcement
percept
predicted interactions
Emotion process
pleasure
stimulus
Thinking as Internal Simulation of Behavior
• Internal simulation of behavior– Covertly execute and evaluate potential interaction using sensory-motor
substrates (Hesslow, 2002; Damasio; Cotterill, 2001), but see also
– “interaction potentialities” (Bickhard), and
– “state anticipation” (Butz, Sigaud, Gérard, 2003).
– Existing mechanisms are basis for simulation
Evolutionary continuity!
Simulation: action-selection bias
At every step, instead of action-selection, select a subset of predicted interactions from reinforcement learning model feed back to RL model.
1. Interaction-selection: select a subset of predicted interactions.2. Simulate-and-bias-predicted-benefit: feed back to model as if a real
interaction. (note that the memory advances to time t+1, so we have to3. reset-memory-state to time=t to be able to select an appropriate action.)
Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands.
Cognitive influence
Reactive behaviorHierarchical-state
RL model
Interaction-selection
Action-selection
action
reinforcement
interaction
ENVIRONMENT
Perception
simulated interaction
simulated reinforcement
percept
predicted interactions
Emotion process
pleasure
stimulus
4. Action-selection: select the next action using the action-selection mechanism explained earlier based on the now biased action values.
Simulation: example
• Action list before simulation (!hypothetical example!):– {up=0.2, down=-0.5, right=-1, left=-1}
• Action-selection would have selected “up”,– at least using our naive action-selection mechanism. – With Boltzmann high probability for “up”.
• Simulate all interactions.– Propagate back the predicted values by simulating interaction with
environment. – Effect is a “value look-ahead” of 1 step.
• Action list after simulation:– {up=0.1, down=0.5, right=-1, left=-1}
• Action-selection selects “down”.
• In this example simulating all predicted interactions helps .
Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands.
Roadblock r=-.5
But: Simulating Everything is not Always Best
• Even apart from fact that simulating everything costs mental effort.• Earlier experiments (Broekens, 2005) showed that
– simulation has benefit, especially when many interactions are simulated. This is not surprising (better heuristic). However,
– in some cases less simulation resulted in better learning. Dynamic relation between environment and simulation “strategy” (i.e.
simulation threshold: percentage of all predicted interactions to be simulated).
Emotion as metalearning to adapt amount of internal simulation? (Doya, 2002)– Pleasure is an indication of the current performance of the agent (Clore
and Gasper, 2000). Also,– high pleasure top down thinking, and
low pleasure bottom up thinking (Fiedler and Bless, 2000).
Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands.
Pleasure Modulates Simulation
Cognitive influence
Reactive behaviorDistributed-state RL
model
Interaction-selection
Action-selection
action
reinforcement
interaction
ENVIRONMENT
Perception
simulated interaction
simulated reinforcement
percept
predicted interactions
Emotion process
pleasure
stimulus
Pleasure Modulates Simulation
• Many theories of emotion.• We use core-affect (or activation-valence) theory of emotion as basis.
– Two fundamental factors, pleasure and arousal (Russell, 2003).– Pleasure relates to emotional valence, and– arousal relates to action-readiness, or activity.
• In this study we model pleasure as simulation threshold.– We use pleasure to dynamically adapt the amount of interactions that are
simulated. It is thus used as a dynamic simulation threshold.– We study the indirect effect of emotion as a metalearning parameter affecting
information processing that on its turn influences action-selection.
• Many models study emotion as direct influence on action-selection (or motivation(-al states)) (Avila-Garcia and Cãnamero, 2004; Cãnamero, 1997; Velasquez, 1998), or as information (e.g. Botelho and Coelho).
• Example of exception: Belavkin (2004), relation between emotion, entropy and information processing.
Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands.
Pleasure Modulates Simulation
• Pleasure: indication of current performance relative to what the agent is used to.– Tried to capture this by the normalized difference between the short
term average reinforcement signal and the long term average reinforcement signal:
ltarltarltarstarp ffrre 2))((
Cognitive influence
Reactive behaviorHierarchical-state
RL model
Interaction-selection
Action-selection
action
reinforcement
interaction
ENVIRONMENT
Perception
simulated interaction
simulated reinforcement
percept
predicted interactions
Emotion process
pleasure, ep
stimulus
• Continuous pleasure feedback:– High pleasure, going well? Continue
strategy, goal directed thinking.• > ep, high threshold, simulate predicted
best interactions,
– Low pleasure? Look broader, pay more attention to all predicted interactions.
• < ep, low threshold, simulate many interactions.
Experiments
• To measure adaptive effect of pleasure-modulated simulation: force agent to adapt to new task. – First the agent has 128 trials to learn task 1, then– switch environment to new task, 128 trials to learn task 2.– Repeat for many different parameter settings (e.g. the window of the
long and short term average reinforcement signals, the learning rate, etc…)
• Pleasure predictions:– Pleasure increases to value near 1 (agent gets better at task)– then slowly converges down to .5. (agent gets used to task)– At switch: pleasure drops, (new task, drop in performance)– then increases to value near 1, and converges down to .5 (agent gets
used to new task)
Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands.
Results
• Performance of pleasure-modulated simulation is comparable with simulating ALL / Best 50% predicted interactions (static simulation threshold), but, using only 30% / 70% of the mental resources.
Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands.
Results
• Some settings even have a significantly better performance at lower mental cost.
• Predicted pleasure curve was confirmed
Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands.
Conclusions
• Simple pleasure feedback can be used to determine the broadness of internal simulation, when simulation is used as action-selection bias, performance is comparable and mental effort decreases.– Since we introduce few new mechanism for simulation relevant to the
understanding of the evolutionary plausibility of the simulation hypothesis, as increased adaptation at lower cost is an evolutionary advantageous feature.
• Our results provide clues of a relation between the simulation hypothesis and emotion theory.
Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands.
Action-selection discussion, and questions.
• Use emotion to:– vary action-selection distribution (Doya, 2002), and/or– vary interaction-selection distribution (e.g. temperature of Boltzmann,
threshold of our AS mechanism).
• Interplay between: covert interaction (simulation) and overt interaction (action-selection).– Simulate the best interaction, but chose an action stochastically, see
also (Gadanho, 2003): Gives extra “drive” to certain actions.
– The inverse? Seems rational too:Simulate bad actions for “mental (covert) exploration”, choose best actions
for “overt exploitation”.Early experiments do not (yet) show clear benefit.
• How to integrate internal simulation input into action–selection?• Questions?
Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands.