Learning for Physically Diverse Robot Teams Robot Teams - Chapter 7 CS8803 Autonomous Multi-Robot...

Learning for Physically Diverse Robot TeamsRobot Teams - Chapter 7

CS8803 Autonomous Multi-Robot Systems10/3/02

Motivations

• Robots are cool.• Robot teams are cooler.• Robots are hard to

program/control.• Robot teams are even harder.

Motivations

• Robotic soccer - hard!

Motivations

• Diagnose and rebuild the transmission of this 1969 Jaguar E-Type - Really Hard!

Motivations

• Answer: Robot Learning!

Motivations

• Challenges:– Very large state spaces– Uncertain credit assignments– Limited training time– Uncertainty in sensing and shared info– Non-deterministic actions– Difficulty in defining appropriate abstractions

for learned info– Difficulty of merging info from different robot

experiences

Motivations

• Benefits– Increased robustness– Reduced Complexity– Increased ease of adding new assets

to team

Motivations

• 4 types of learning in robotic systems:– Learning numerical functions for

calibrations or parameter adjustments

– Learning about the world– Learning to coordinate behaviors– Learning new behaviors

Learning New Cooperative Behaviors

• Inherently cooperative tasks are difficult to learn!– Utility of the action of a robot dependent

on the actions of other robots– Soccer a good example– Cooperative Multi-robot Observation of

Multiple Moving Targets (CMOMMC)• Scalable


• CMOMMT Application:– S: a 2-D bounded, enclosed spatial region– V: a team of m robot vehicles w/ 360 field

of view w/ limited range.– O(t): a set of n targets in region S at time t– B(t): a matrix such that Bij = 1 if robot i is

observing target j at time t.– Sensor coverage is much less than region

area


• Goal: develop algorithm A-CMOMMT– Maximize average number of targets

observed at any given time.


• Human-Generated Solution– Local force vectors

• Targets attract• Teammates repel

– Magnitude dependent on distance from robot

– Weight reduced if target already being observed

– Direction given by summing vectors


• Results:


• Distributed, Pessimistic Lazy Q-Learning– No a priori model– Reinforcement learning– Instance-based learning– Assumes lower bound on utility


• Q-Learning– For each action/state pair, Q(s,a) = 0– Observe state s.– Do:

• Select an action and execute• Receive reward r• Observe new state s’• Update table entry for Q(s,a)


• Lazy Learning (instance-based learning)– Delays use of gathered info until

necessaryRandomly built look-up table: (state, action)

Reinforcement Function

Situation Matcher Evaluation Function

World ActionState


• Pessimistic Algorithm– Rates utility of an action based on

lower bound• Predict the state following each possible

action in current state• Compute lower bound of utility of each

new state• Choose action corresponding to highest

lower bound


• Results– Much better than

random– Not as good as

human-generated– Significant results


• Q-Learning w/ VQQL and GLA– State space huge– Want generalized algorithm– 2 Phases

• Learn quantizer• Learn Q function


• Generalized Lloyd Algorithm– Clustering technique

• Converts continuous state space to discrete

– Takes set T of M states– Returns set C of N states– Stopping Criterion (Dm - Dm+1)/ Dm <


• Vector Quantization for Q-Learning– Obtain a set T of examples of states– Design a vector quantizer C using T with

GLA– Learn the Q function

•Choose an action following an exploration strategy

•Receive experience tuple <s, a, s’, r>•Quantize the tuple obtaining <s^, a, s^’, r>•Update the Q table


• 2 experiments– Local reward function– Collaborative reward function


• Results– Competitive– Can handle higher

dimension state spaces

Learning for Parameter Adjustment

• Need robots to perform life-long tasks– Environmental changes– Variations in robot capabilities– Heterogeneity

• Overlap in capabilities• Change in heterogeneity


• Problem def– R: set of n robots– T: set of m tasks– Ai: set of actions robot i can perform– H: Ai->T set of functions H, return task

completed by action Ai

– q(aij): quality metric– Ui: set of actions robot i performs in current

mission


– Given R, T, Ai and H, determine set of actions Ui that optimizes the performance metric


• ALLIANCE overview– Completely distributed– Behaviors grouped into sets

• Activated as a set• Controlled by high-level motivational

behaviors– Impatience and Acquiescence

thresholds– Broadcast communication


• L-ALLIANCE overview– Extension of ALLIANCE

• Automatically updates motivational behaviors

– 2 problems to solve:• How to give robots ability to obtain

knowledge about the quality of team member performance

• How to use team member performance knowledge to select a task to pursue


– Performance monitors• One for every behavior set• Monitors how self and others performing


– Control phases•Active learning phase

– Random choices– Maximally patient– Catalog monitors and update control

parameters•Adaptive learning phase

– Must make effort to accomplish mission– Acquiesce and become impatient quickly– Still catalog monitors and update control

parameters


– Action Selection Strategy• At each iteration, robot ri divides remaining

tasks into two categories– Tasks that ri expects to perform better than all

others and are not being currently done– All other tasks ri can do

• Robot ri repeats following until no tasks left to do

– Select tasks from the first category, longest first until none left

– Select tasks from second category, shortest first


• Results - Box Pushing– Experiment 1

• 2 identical robots• 1 fails


– Experiment 2• 2 different robots• Different

capabilities– L-ALLIANCE

capable of keeping teams working toward goal

• Changes to composition

• Changes to ability

Conclusions

• Lots of challenges left• Rewards tantalizing• Learning approaches not yet

superior to human generated solutions

Questions?

Learning for Physically Diverse Robot Teams Robot Teams - Chapter 7 CS8803 Autonomous Multi-Robot...

Documents

Transcript of Learning for Physically Diverse Robot Teams Robot Teams - Chapter 7 CS8803 Autonomous Multi-Robot...