1 Learning Behavior- Selection by Emotions and Cognition in a Multi-Goal Robot Task Sandra Clara...

1

Learning Behavior-Selection by Emotions and Cognition in aMulti-Goal Robot Task

Sandra Clara Gadanho

Presented by Jamie Levy

2

Purpose

Build an autonomous robot controller which can learn to master a complex task when situated in a realistic environment.Continuous time and spaceNoisy sensorsUnreliable actuators

3

Possible problems to the Learning Algorithm: Multiple goals may conflict with each other Situations in which the agent needs to

temporarily overlook one goal to accomplish another.

Short-term and long-term goals.

4

Possible problems to the Learning Algorithm (cont): May need a sequence of different

behaviors to accomplish one goal. Behaviors are unreliable. Behavior’s appropriate duration is

undetermined, it depends on the environment and on their success.

5

Emotion-based Architecture

Traditional RL adaptive system complemented with an emotion system responsible for behavior switching.

Innate emotions define goals. Agent learns emotion associations of

environment-state and behavior pairs to determine its decisions.

Q-learning to learn behavior-selection policy which is stored in Neural Networks.

6

ALEC – Asynchronous Learning by Emotion and Cognition Augments the EB architecture with a

cognitive system.Which has explicit rule knowledge extracted

from environment interactions. Is based on the CLARION model by Sun

and Peterson 1998.Allows learning the decision rules in a bottom

up fashion.

7

ALEC architecture (cont)

Cognitive system of ALEC I was directly inspired by the top-level of the CLARION model

ALEC II has some changes (to be discussed later)

ALEC III emotion system learns about goal state exclusively while cognitive system learns about goal state transitions.

LEC (Learning by Emotion and Cognition) is non-asynchronous used to test usefulness of behavior switching.

8

EB II

Replaces emotional model with a goal system.

Goal system is based on a set of homeostatic variables that it attempts to maintain within certain bounds.

9

EBII Architecture is composed of two parts:

Goal System Adaptive

System

10

Perceptual Values

Light intensity Obstacle density Energy availability

Indicates whether a nearby source is releasing energy

11

Behavior System

Three hand-designed behaviors to select from:Avoid obstaclesSeek lightWall following

These are not designed to be very reliable and may failEx) wall following may lead to a crash

12

Goal System

Responsible for deciding when behavior switching should occur.

Goals are explicitly identified and associated with homeostatic variables.

13

Three different states –-target-recovery-danger

14

Homeostatic Variables

Variable remains in its target as long as its values are optimal or acceptable.

Well-being variable is derived from the above.

Variable has effect on well-being.

15

Homeostatic Variables

Energy Reflects the goal of maintaining its energy

WelfareMaintains goal of avoiding collisions

ActivityEnsures agent keeps moving; otherwise value

slowly decreases and target state is not maintained

16

Well-Being

State Change – when a homeostatic variable changes from one state to another the well-being is positively influenced.

Predictions of State Change – when some perceptual cue predicts the state change of a homeostatic variable, influence is similar to above, but lower in value.

These are modeled after emotions and may describe “pain” or “pleasure.”

17

Well-Being (cont)

cs = state coefficient rs = influence of state on well being.

18

Well-Being (cont)

ct(sh) = state transition coefficient wh = weight of homeostatic variable

1.0 for energy 0.6 for welfare 0.4 for activity

19

Well-Being (cont)

cp = prediction of coefficient rph = value of prediction

Only considered for energy and activity variables

22

Well-being calculation (cont)

23

Well-being calculation - Prediction

Values of rph depend on the strengths of the current predictions and vary between -1 (for predictions of no desirable change) and 1.

If there is no prediction rph = 0.

24


25


Activity prediction provides a no-progress indicator given at regular time intervals when the activity of the robot is low for long periods of time.rp(activity) = -1

There is no prediction for welfarerp(welfare) = 0

26

Adaptive System Uses Q-learning State information is

fed to NN comprising of homeostatic variable values and other perceptual values gathered from sensors.

27

Adaptive System (cont) Developed

controller tries to maximize the reinforcement received by selecting between one of the available hand-designed behaviors.

28

Adaptive System (cont) Agent may select

between performing the behavior proven better in past or an arbitrary one.

Selection function is based on Boltzmann-Gibbs’ distribution. (pg 30 in class textbook).

29

EB II Architecture

30

ALEC Architecture

31

ALEC I Inspired by the CLARION model Each individual rule consists of a condition

for activation and a behavior suggestion. Activation condition is dictated by a set of

intervals, one for each dimension of input space.

6 input dimensions varying between 0 and 1 with intervals of 0.2

32

ALEC I (cont) A condition interval may only start or end

at pre-defined points of the input space. Since this may lead to a large number of

possible states, rule learning is limited to those few cases with successful behavior selection.

Other cases are left to Emotion System which uses its generalization abilities to cover the state space.

33

ALEC I (cont)

Successful behaviors for certain states used to extract a rule corresponding to the decision made and are added to the agent’s rule set.

If same decision is made, the agent updates the Success Rate (SR) for that rule.

34

ALEC I – Success

r = immediate reinforcement Difference of Q-value between state x

where decision a was made and the resulting state y.

Tsuccess = 0.2 Constant threshold

35

ALEC I (cont) – Rule expansion, shrinkage If a rule is often successful, the agent tries to

generalize it to cover nearby environmental states.

If a rule is very poor, the agent makes it more specific.

If it still does not improve, the rule is deleted. Maximum of 100 rules.

36

ALEC I (cont) – Rule expansion, shrinkage Statistics are kept for the success rate of

every possible one-state expansion or shrinkage of the rule, to select best option.

Rule is compared to a “match all” rule (rule_all) with the same behavior suggestion and against itself after the best expansion or shrinkage (rule_exp, rule_shrink).

37

ALEC I (cont) – Rule expansion, shrinkage

A rule is expanded if it is significantly better than the match-all rule and the expanded rule is better or equal to the original rule.

A rule that is insufficiently better than the match-all rule is shrunk if this results in an improvement or otherwise is deleted.

38


39

Rule expansion, shrinkage (cont)

Constant thresholds:Tsuccess = 0.2 Thresholds Texpand = 2.0 Tshrunk = 1.0

40


A rule that performs badly is deleted. A rule is also deleted if its condition has not

been met for a while. When two rules propose the same behavior

selection and their conditions are sufficiently similar, they are merged into a single rule.

Success rate is reset whenever a rule is modified by merging, expansion or shrinkage.

41

Cognitive System (cont)

If the cognitive system has a rule that applies to the current environmental state, then the cognitive system influences the behavior decision.Adds an arbitrary constant of 1.0 to the

respective Q-value before the stochastic behavior selection is made.

42

ALEC Architecture

44

Example of a Rule – execute avoid obstacles. Six input dimensions segmented with 0.2

granularity – 0, 0.2, 0.4, 0.6, 0.8, 1 energy = [0.6, 1] activity = [0,1] welfare = [0, 0.6] light intensity = [0,1] obstacle density = [0.8, 1] energy availability = [0,1]

45

ALEC II

Instead of the above function, the agent considers that a behavior is successful if there is a positive homeostatic variable transition. If a variable state changes to the target state from the

danger state.

46

ALEC III

Same as ALEC II except that the well-being does not depend on state transitions nor predictionsct(sh) = 0 cp = 0

47

Experiments

Goal of ALEC is to allow an agent faced with realistic world conditions to adapt on-line and autonomously to its environment.

Cope with continuous time and space Limited memory Time constraints Noisy sensors Unreliable actuators

48

Khepera Robot

Left and Right wheel motors 8 infrared sensors that allow it to detect

object proximity and ambient light6 in the front2 in the rear

49

Experiment (cont)

50

Goals

Maintain Energy Avoid Obstacles Move around in environment

Not as important as the first two.

51

Energy Acquisition

Must overlook goal of avoiding obstaclesMust bump into source

Energy is available for a short periodMust look for new sources

Energy is received by high values of light in rear sensors

52

Procedure

Each experiment consisted of:100 different robot trials of 3 million simulation

steps A new fully recharged robot with all state

values reset placed at randomly selected starting positions in each trial

For evaluation the trial period was divided into 60 smaller periods of 50,000 steps.

53

Procedure (cont)

For each of these periods the following were recorded: Reinforcement – mean of reinforcement (well-being)

value calculated at each step. Energy – mean level of robot Distance – mean value of Euclidean distance d taken

at 100-step intervals (approx. # steps to move between corners of environment)

Collisions – percentage of steps involving collisions.

54

Results

Pairs of controllers were compared using a randomized analysis of variance (RANOVA) by Piater (1999)

55

Results (cont)

Most important contribution to reinforcement is the state value.

For the successful accomplishment of the task and goals, all homeostatic variables should be taken into consideration in reinforcement. Agents with no: Energy dependent reinforcement fail in their main task

of maintaining energy levels. Welfare – increased collisions Activity – move only as a last resort (avoid collisions)

56

Results (cont)

Predictions of state transitions proved essential for an agent to accomplish its tasks.Controller with no energy prediction is unable

to acquire energy.Controller with no activity prediction will

eventually stop moving.

57

Results – EB, EBII and Random

The first set of graphs is dealing with three different agents:EB – discussed in earlier paperEB II Random – selects randomly amongst the

differently available behaviors at regular intervals.

58

Results – EB, EBII and Random

67

Conclusion

Emotion and Cognitive systems can improve learning, but are unable to store and consult all the single events the agent experiences.

The Emotion system gives a “sense” of what is right, while the Cognitive system has constructs a model of reality and corrects the emotion system when it reaches incorrect conclusions.

68

Future work

Adding more specific knowledge in the cognitive system which then may be used for planning of more complex tasks.

1 Learning Behavior- Selection by Emotions and Cognition in a Multi-Goal Robot Task Sandra Clara...

Documents

Transcript of 1 Learning Behavior- Selection by Emotions and Cognition in a Multi-Goal Robot Task Sandra Clara...