Computational Investigations of the Regulative Role of Pleasure in Adaptive Behavior...

22
Computational Investigations of the Regulative Role of Pleasure in Adaptive Behavior Action-Selection Biased by Pleasure-Regulated Simulated Interaction Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands.

Transcript of Computational Investigations of the Regulative Role of Pleasure in Adaptive Behavior...

Computational Investigations of the Regulative Role of

Pleasure in Adaptive Behavior

Action-Selection Biased by Pleasure-Regulated Simulated

Interaction

Joost Broekens,

Fons J Verbeek,

LIACS, Leiden University, The Netherlands.

Overview

Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands.

Cognitive influence

Reactive behaviorDistributed-state RL

model

Interaction-selection

Action-selection

action

reinforcement

interaction

ENVIRONMENT

Perception

simulated interaction

simulated reinforcement

percept

predicted interactions

Emotion process

pleasure

stimulus

Distributed-state RL model

Learning

• The agent's memory structure is modelled with a directed graph.

• Constructs a distributed state-prediction of next states.

• Learns through continuous interaction.– The memory is adapted while the agent interacts with its

environment.– Agent selects an action a, executes a and combines

with resulting perception p into a situation s1=<a, p>. Memory adds s1 if s1 does not yet exists.

– Do another action, resulting in s2, add s2, and– connect s1 to s2 by creating an interactron node I1.– Recursively apply this process, use interactrons to

predict situations.– (!Encoding situations in this way is too symbolic for real

world applications!)

as1

cs1 s2

I1

bs1 s2

ds1 s2

I2I1

I3

s3

es1 s2

I2I1

I3

Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands.

Learning: example

Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands.

• Agent environment, grid world consisting of:– lava (red) r=-1, can walk on lava but is discouraged to do so,– food (yellow) r=1,– agent (black),– path (white) r=0.

Learning: reinforcement

• Learning to predict values follows standard RL techniques (Sutton and Barto, 1998).– Except that learning is per interactron (node) and that there are two

prediction values per node.

• Every interactron (representing a sequence of occurred situations) maintains– a direct reinforcement , the interactron’s own predicted value changed

by the reinforcement r from the environment,– a lazy-propagated indirect reinforcement value, that estimates the future

reinforcement, , based on the predicted next interactions, and – a combined value μ=+.

Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands.

Learning: reinforcement example

Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands.

Lazy reinforcement propagation

Action-Selection

Cognitive influence

Reactive behaviorDistributed-state RL

model

Interaction-selection

Action-selection

action

reinforcement

interaction

ENVIRONMENT

Perception

simulated interaction

simulated reinforcement

percept

predicted interactions

Emotion process

pleasure

stimulus

Action-Selection

• Integrate distributed predictions into action values.– Action values are a result of the parallel inhibition and excitation of actions

in the agent’s set of actions, A. Calculated using formula:

– with lt(ah) = the resulting level of activation of an action ahA at time t,– yi an active interactron, and– xi

j predicts action ah

Sum over the weighted values of all predicted interactions into actions.

• Action-selection is based on these action values.– If any lt(ah)>threshold aselect such that lt(aselect)=max(lt(a1),…,lt(a|A|)).– If all lt(ah)<threshold aselect stochastically from l(a1),…,l(a|A|).

• Other selection mechanisms possible, i.e. Boltzmann. (!With a static threshold our selection suffers from lack of exploration!).

Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands.

)|()|()(1 1

*

k

iij

iX

jij

ith

t yxyxaliy

Action-Selection: example

Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands.

)|()|()(1 1

*

k

iij

iX

jij

ith

t yxyxaliy

Internal Simulation

Cognitive influence

Reactive behaviorDistributed-state RL

model

Interaction-selection

Action-selection

action

reinforcement

interaction

ENVIRONMENT

Perception

simulated interaction

simulated reinforcement

percept

predicted interactions

Emotion process

pleasure

stimulus

Thinking as Internal Simulation of Behavior

• Internal simulation of behavior– Covertly execute and evaluate potential interaction using sensory-motor

substrates (Hesslow, 2002; Damasio; Cotterill, 2001), but see also

– “interaction potentialities” (Bickhard), and

– “state anticipation” (Butz, Sigaud, Gérard, 2003).

– Existing mechanisms are basis for simulation

Evolutionary continuity!

Simulation: action-selection bias

At every step, instead of action-selection, select a subset of predicted interactions from reinforcement learning model feed back to RL model.

1. Interaction-selection: select a subset of predicted interactions.2. Simulate-and-bias-predicted-benefit: feed back to model as if a real

interaction. (note that the memory advances to time t+1, so we have to3. reset-memory-state to time=t to be able to select an appropriate action.)

Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands.

Cognitive influence

Reactive behaviorHierarchical-state

RL model

Interaction-selection

Action-selection

action

reinforcement

interaction

ENVIRONMENT

Perception

simulated interaction

simulated reinforcement

percept

predicted interactions

Emotion process

pleasure

stimulus

4. Action-selection: select the next action using the action-selection mechanism explained earlier based on the now biased action values.

Simulation: example

• Action list before simulation (!hypothetical example!):– {up=0.2, down=-0.5, right=-1, left=-1}

• Action-selection would have selected “up”,– at least using our naive action-selection mechanism. – With Boltzmann high probability for “up”.

• Simulate all interactions.– Propagate back the predicted values by simulating interaction with

environment. – Effect is a “value look-ahead” of 1 step.

• Action list after simulation:– {up=0.1, down=0.5, right=-1, left=-1}

• Action-selection selects “down”.

• In this example simulating all predicted interactions helps .

Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands.

Roadblock r=-.5

But: Simulating Everything is not Always Best

• Even apart from fact that simulating everything costs mental effort.• Earlier experiments (Broekens, 2005) showed that

– simulation has benefit, especially when many interactions are simulated. This is not surprising (better heuristic). However,

– in some cases less simulation resulted in better learning. Dynamic relation between environment and simulation “strategy” (i.e.

simulation threshold: percentage of all predicted interactions to be simulated).

Emotion as metalearning to adapt amount of internal simulation? (Doya, 2002)– Pleasure is an indication of the current performance of the agent (Clore

and Gasper, 2000). Also,– high pleasure top down thinking, and

low pleasure bottom up thinking (Fiedler and Bless, 2000).

Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands.

Pleasure Modulates Simulation

Cognitive influence

Reactive behaviorDistributed-state RL

model

Interaction-selection

Action-selection

action

reinforcement

interaction

ENVIRONMENT

Perception

simulated interaction

simulated reinforcement

percept

predicted interactions

Emotion process

pleasure

stimulus

Pleasure Modulates Simulation

• Many theories of emotion.• We use core-affect (or activation-valence) theory of emotion as basis.

– Two fundamental factors, pleasure and arousal (Russell, 2003).– Pleasure relates to emotional valence, and– arousal relates to action-readiness, or activity.

• In this study we model pleasure as simulation threshold.– We use pleasure to dynamically adapt the amount of interactions that are

simulated. It is thus used as a dynamic simulation threshold.– We study the indirect effect of emotion as a metalearning parameter affecting

information processing that on its turn influences action-selection.

• Many models study emotion as direct influence on action-selection (or motivation(-al states)) (Avila-Garcia and Cãnamero, 2004; Cãnamero, 1997; Velasquez, 1998), or as information (e.g. Botelho and Coelho).

• Example of exception: Belavkin (2004), relation between emotion, entropy and information processing.

Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands.

Pleasure Modulates Simulation

• Pleasure: indication of current performance relative to what the agent is used to.– Tried to capture this by the normalized difference between the short

term average reinforcement signal and the long term average reinforcement signal:

ltarltarltarstarp ffrre 2))((

Cognitive influence

Reactive behaviorHierarchical-state

RL model

Interaction-selection

Action-selection

action

reinforcement

interaction

ENVIRONMENT

Perception

simulated interaction

simulated reinforcement

percept

predicted interactions

Emotion process

pleasure, ep

stimulus

• Continuous pleasure feedback:– High pleasure, going well? Continue

strategy, goal directed thinking.• > ep, high threshold, simulate predicted

best interactions,

– Low pleasure? Look broader, pay more attention to all predicted interactions.

• < ep, low threshold, simulate many interactions.

Experiments

• To measure adaptive effect of pleasure-modulated simulation: force agent to adapt to new task. – First the agent has 128 trials to learn task 1, then– switch environment to new task, 128 trials to learn task 2.– Repeat for many different parameter settings (e.g. the window of the

long and short term average reinforcement signals, the learning rate, etc…)

• Pleasure predictions:– Pleasure increases to value near 1 (agent gets better at task)– then slowly converges down to .5. (agent gets used to task)– At switch: pleasure drops, (new task, drop in performance)– then increases to value near 1, and converges down to .5 (agent gets

used to new task)

Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands.

Results

• Performance of pleasure-modulated simulation is comparable with simulating ALL / Best 50% predicted interactions (static simulation threshold), but, using only 30% / 70% of the mental resources.

Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands.

Results

• Some settings even have a significantly better performance at lower mental cost.

• Predicted pleasure curve was confirmed

Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands.

Conclusions

• Simple pleasure feedback can be used to determine the broadness of internal simulation, when simulation is used as action-selection bias, performance is comparable and mental effort decreases.– Since we introduce few new mechanism for simulation relevant to the

understanding of the evolutionary plausibility of the simulation hypothesis, as increased adaptation at lower cost is an evolutionary advantageous feature.

• Our results provide clues of a relation between the simulation hypothesis and emotion theory.

Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands.

Action-selection discussion, and questions.

• Use emotion to:– vary action-selection distribution (Doya, 2002), and/or– vary interaction-selection distribution (e.g. temperature of Boltzmann,

threshold of our AS mechanism).

• Interplay between: covert interaction (simulation) and overt interaction (action-selection).– Simulate the best interaction, but chose an action stochastically, see

also (Gadanho, 2003): Gives extra “drive” to certain actions.

– The inverse? Seems rational too:Simulate bad actions for “mental (covert) exploration”, choose best actions

for “overt exploitation”.Early experiments do not (yet) show clear benefit.

• How to integrate internal simulation input into action–selection?• Questions?

Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands.