Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer...

Post on 25-Dec-2015

216 views 0 download

Tags:

Transcript of Reinforcement Learning HUT Spatial Intelligence course August/September 2004 Bram Bakker Computer...

Reinforcement Learning

HUT Spatial Intelligence course August/September 2004

Bram Bakker

Computer Science, University of Amsterdam

bram@science.uva.nl

Overview day 1 (Monday 13-16)

Basic concepts Formalized model Value functions Learning value functions In-class assignment & discussion

Overview day 2 (Tuesday 9-12)

Learning value functions more efficiently Generalization Case studies In-class assignment & discussion

Overview day 3 (Thursday 13-16)

Models and planning Multi-agent reinforcement learning Other advanced RL issues Presentation of home assignments & discussion

What is it? Subfield of Artificial Intelligence Making computers learn tasks rather than directly program them

Why is it interesting? Some tasks are very difficult to program, or difficult to optimize,

so learning might be better Relevance for geoinformatics/spatial intelligence:

Geoinformatics deals with many such tasks: transport optimization, water management, etc.

Machine Learning

Supervised learning Works by instructing the learning system what output

to give for each input Unsupervised learning

Clustering inputs based on similarity (e.g. Kohonen Self-organizing maps)

Reinforcement learning Works by letting the learning system learn

autonomously what is good and bad

Classes of Machine Learning techniques

Neural networks Work in a way analogous to brains, can be used with

supervised, unsupervised, reinforcement learning, genetic algorithms

Genetic algorithms Work in a way analogous to evolution

Ant Colony Optimization Works in a way analogous to ant colonies

Some well-known Machine Learning techniques

What is Reinforcement Learning?

Learning from interaction Goal-oriented learning Learning about, from, and while interacting with an

external environment Learning what to do—how to map situations to actions

—so as to maximize a numerical reward signal

Some Notable RL Applications

TD-Gammon: Tesauro– world’s best backgammon program

Elevator Control: Crites & Barto– high performance elevator controller

Dynamic Channel Assignment: Singh & Bertsekas, Nie & Haykin– high performance assignment of radio channels to mobile

telephone calls Traffic light control: Wiering et al., Choy et al.

– high performance control of traffic lights to optimize traffic flow

Water systems control: Bhattacharya et al.– high performance control of water levels of regional water

systems

Relationships to other fields

Psychology

Artificial Intelligence Planning methods

Control Theory andOperations Research

Artificial Neural Networks

ReinforcementLearning (RL)

Neuroscience

Recommended literature

Sutton & Barto (1998). Reinforcement learning: an introduction. MIT Press.

Kaelbling, Littmann, & Moore (1996). Reinforcement learning: a survey. Artificial Inteligence Research, vol. 4, pp. 237--285.

Complete Agent

Temporally situated Continual learning and planning Agent affects the environment Environment is stochastic and uncertain

Environment

actionstate

rewardAgent

Supervised Learning

Supervised Learning SystemInputs Outputs

Training Info = desired (target) outputs

Error = (target output – actual output)

Reinforcement Learning (RL)

RLSystemInputs Outputs (“actions”)

Training Info = evaluations (“rewards” / “penalties”)

Objective: get as much reward as possible

Key Features of RL

Learner is not told which actions to take Trial-and-Error search Possibility of delayed reward

Sacrifice short-term gains for greater long-term gains The need to explore and exploit Considers the whole problem of a goal-directed agent

interacting with an uncertain environment

What is attractive about RL?

Online, “autonomous” learning without a need for preprogrammed behavior or instruction

Learning to satisfy long-term goals Applicable to many tasks

Some RL History

Trial-and-Errorlearning

Temporal-differencelearning

Optimal control,value functions

Thorndike ()1911

Minsky

Klopf

Barto et al.

Secondary reinforcement ()

Samuel

Witten

Sutton

Hamilton (Physics)1800s

Shannon

Bellman/Howard (OR)

Werbos

Watkins

Holland

Elements of RL

Policy: what to do Maps states to actions

Reward: what is good Value: what is good because it predicts reward

Reflects total, long-term reward Model: what follows what

Maps states and actions to new states and rewards

An Extended Example: Tic-Tac-Toe

X XXO O

X

XO

X

O

XO

X

O

X

XO

X

O

X O

XO

X

O

X O

X

} x’s move

} x’s move

} o’s move

} x’s move

} o’s move

...

...... ...

... ... ... ... ...

x x

x

x o

x

o

xo

x

x

xx

o

o

Assume an imperfect opponent:

—he/she sometimes makes mistakes

An RL Approach to Tic-Tac-Toe

1. Make a table with one entry per state:

2. Now play lots of games.

To pick our moves,

look ahead one step:

State V(s) – estimated probability of winning

.5 ?

.5 ?. . .

. . .

. . .. . .

1 win

0 loss

. . .. . .

0 draw

x

xxx

oo

oo

ox

x

oo

o ox

xx

xo

current state

various possible

next states*Just pick the next state with the highest

estimated prob. of winning — the largest V(s);

a greedy move.

But 10% of the time pick a move at random;

an exploratory move.

RL Learning Rule for Tic-Tac-Toe

“Exploratory” move

s – the state before our greedy move

s – the state after our greedy move

We increment each V(s) toward V( s ) – a backup :

V(s) V (s) V( s ) V (s)

a small positive fraction, e.g., .1

the step - size parameter

How can we improve this T.T.T. player?

Take advantage of symmetries representation/generalization

Do we need “random” moves? Why? Do we always need a full 10%?

Can we learn from “random” moves? Can we learn offline?

Pre-training from self play? Using learned models of opponent?

. . .

How is Tic-Tac-Toe easy?

Small number of states and actions Small number of steps until reward . . .

RL Formalized

Agent and environment interact at discrete time steps : t 0,1, 2, Agent observes state at step t : st S

produces action at step t : at A(st )

gets resulting reward : rt1 and resulting next state : st1

t

. . . st art +1 st +1

t +1art +2 st +2

t +2art +3 st +3

. . .t +3a

Policy at step t, t :

a mapping from states to action probabilities

t (s, a) probability that at a when st s

The Agent Learns a Policy

Reinforcement learning methods specify how the agent changes its policy as a result of experience.

Roughly, the agent’s goal is to get as much reward as it can over the long run.

Getting the Degree of Abstraction Right

Time steps need not refer to fixed intervals of real time. Actions can be low level (e.g., voltages to motors), or high

level (e.g., accept a job offer), “mental” (e.g., shift in focus of attention), etc.

States can be low-level “sensations”, or they can be abstract, symbolic, based on memory, or subjective (e.g., the state of being “surprised” or “lost”).

Reward computation is in the agent’s environment because the agent cannot change it arbitrarily.

Goals and Rewards

Is a scalar reward signal an adequate notion of a goal?—maybe not, but it is surprisingly flexible.

A goal should specify what we want to achieve, not how we want to achieve it.

A goal must be outside the agent’s direct control—thus outside the agent.

The agent must be able to measure success: explicitly; frequently during its lifespan.

Returns

maximize? want to wedoWhat

,,,

:is stepafter rewards of sequence theSuppose

321 ttt rrr

t

. stepeach for ,, themaximize want towe

general,In

tRE t return expected

Episodic tasks: interaction breaks naturally into episodes, e.g., plays of a game, trips through a maze.

,21 Tttt rrrR

where T is a final time step at which a terminal state is reached, ending an episode.

Returns for Continuing Tasks

Continuing tasks: interaction does not have natural episodes.

Discounted return:

. theis ,10, where

, 0

132

21

ratediscount

k

ktk

tttt rrrrR

shortsighted 0 1 farsighted

An Example

Avoid failure: the pole falling beyonda critical angle or the cart hitting end oftrack.

reward 1 for each step before failure

return number of steps before failure

As an episodic task where episode ends upon failure:

As a continuing task with discounted return:reward 1 upon failure; 0 otherwise

return k , for k steps before failure

In either case, return is maximized by avoiding failure for as long as possible.

Another Example

Get to the top of the hillas quickly as possible.

reward 1 for each step where not at top of hill

return number of steps before reaching top of hill

Return is maximized by minimizing number of steps reach the top of the hill.

A Unified Notation

Think of each episode as ending in an absorbing state that always produces reward of zero:

We can cover all cases by writing

reached. always is state absorbing reward zero a ifonly 1 becan where

, 0

1

kkt

kt rR

The Markov Property

A state should retain all “essential” information, i.e., it should have the Markov Property:

.,,,,,,,, histories and,, allfor

,,Pr

,,,,,,,,,Pr

00111

11

0011111

asrasrasrs

asrrss

asrasrasrrss

ttttt

tttt

ttttttt

Markov Decision Processes

If a reinforcement learning task has the Markov Property, it is a Markov Decision Process (MDP).

If state and action sets are finite, it is a finite MDP. To define a finite MDP, you need to give:

state and action sets one-step “dynamics” defined by state transition

probabilities:

expected rewards:

Ps s a Pr st1 s st s,at a for all s, s S, a A(s).

Rs s a E rt1 st s,at a,st1 s for all s, s S, a A(s).

Value Functions

State - value function for policy :

V (s) E Rt st s E krtk 1 st sk 0

Action - value function for policy :

Q (s, a) E Rt st s, at a E krtk1 st s,at ak0

The value of a state is the expected return starting from that state; depends on the agent’s policy:

The value of taking an action in a state under policy is the expected return starting from that state, taking that action, and thereafter following :

Bellman Equation for a Policy

11

42

321

43

32

21

tt

tttt

ttttt

Rr

rrrr

rrrrR

The basic idea:

So: V (s) E Rt st s E rt1 V st1 st s

Or, without the expectation operator:

V (s) (s,a) Ps s a Rs s

a V ( s ) s

a

Gridworld

Actions: north, south, east, west; deterministic. If action would take agent off the grid: no move but reward = –1 Other actions produce reward = 0, except actions that move agent out

of special states A and B as shown.

State-value function for equiprobable random policy;= 0.9

if and only if V (s) V (s) for all s S

Optimal Value Functions

For finite MDPs, policies can be partially ordered:

There is always at least one (and possibly many) policies that is better than or equal to all the others. This is an optimal policy. We denote them all by *.

Optimal policies share the same optimal state-value function:

Optimal policies also share the same optimal action-value function:

V (s) max

V (s) for all s S

Q(s,a) max

Q (s, a) for all s S and a A(s)

This is the expected return for taking action a in state s and thereafter following an optimal policy.

Bellman Optimality Equation for V*

V (s) maxaA(s)

Q

(s,a)

maxaA(s)

E rt1 V(st1) st s, at a max

aA(s)Ps s

a

s Rs s

a V ( s )

The value of a state under an optimal policy must equalthe expected return for the best action from that state:

is the unique solution of this system of nonlinear equations.V

Bellman Optimality Equation for Q*

sa

ass

ass

ttta

t

asQRP

aassasQrEasQ

),(max

,),(max),( 11

is the unique solution of this system of nonlinear equations.Q*

Why Optimal State-Value Functions are Useful

V

V

Any policy that is greedy with respect to is an optimal policy.

Therefore, given , one-step-ahead search produces the long-term optimal actions.

E.g., back to the gridworld:

What About Optimal Action-Value Functions?

Given , the agent does not evenhave to do a one-step-ahead search:

Q*

(s) arg maxaA (s)

Q(s,a)

Solving the Bellman Optimality Equation

Finding an optimal policy by solving the Bellman Optimality Equation exactly requires the following:

accurate knowledge of environment dynamics; we have enough space and time to do the computation; the Markov Property.

How much space and time do we need? polynomial in number of states (via dynamic programming

methods), BUT, number of states is often huge (e.g., backgammon has about

10**20 states). We usually have to settle for approximations. Many RL methods can be understood as approximately solving

the Bellman Optimality Equation.

Temporal Difference (TD) Learning

Basic idea: transform the Bellman Equation into an update rule, using two consecutive timesteps

Policy Evaluation: learn approximation to the value function of the current policy

Policy Improvement: Act greedily with respect to the intermediate, learned value function

Repeating this over and over again leads to approximations of the optimal value function

Q-Learning: TD-learning of action values

ttta

ttttt asQasQrasQasQ ,,max,,

:learning-Q step-One

11

Exploration/Exploitation revisited

Suppose you form estimates

The greedy action at t is

You can’t exploit all the time; you can’t explore all the time You can never stop exploring; but you should always reduce

exploring

),(),( * asQasQt action value estimates

nexploratio

onexploitati

),(maxarg

*

*

*

tt

tt

ta

t

aa

aa

asQa

-Greedy Action Selection

Greedy action selection:

-Greedy:

),(maxarg* asQaa ta

tt

at* with probability 1

random action with probability {at

. . . the simplest way to try to balance exploration and exploitation

Softmax Action Selection

Softmax action selection methods grade action probs. by estimated values.

The most common softmax uses a Gibbs, or Boltzmann, distribution:

theis where

,

yprobabilit with play on action Choose

1

),(

),(

n

b

bsQ

asQ

t

t

e

e

ta

“computational temperature”

Pole balancing learned using RL

Improving the basic TD learning scheme

Can we learn more efficiently? Can we update multiple values at the same timestep? Can we look ahead further in time, rather than just use the

value at the next timestep?

Yes! All these can be done simultaneously with one extension: eligibility traces

N-step TD Prediction

Idea: Look farther into the future when you do TD backup (1, 2, 3, …, n steps)

Monte Carlo:

TD: Use V to estimate remaining return

n-step TD: 2 step return:

n-step return:

Mathematics of N-step TD Prediction

TtT

tttt rrrrR 13

221

)( 11)1(

tttt sVrR

)( 22

21)2(

ttttt sVrrR

)(13

221

)(ntt

nnt

nttt

nt sVrrrrR

Learning with N-step Backups

Backup (on-line or off-line):

Vt(st ) Rt(n) Vt(st )

Random Walk Example

How does 2-step TD work here? How about 3-step TD?

Forward View of TD()

TD() is a method for averaging all n-step backups

weight by n-1 (time since visitation)

-return:

Backup using -return:

Rt (1 ) n 1

n1

Rt(n)

Vt(st ) Rt Vt(st )

-return Weighting Function

Relation to TD(0) and MC

-return can be rewritten as:

If = 1, you get MC:

If = 0, you get TD(0)

Rt (1 ) n 1

n1

T t 1

Rt(n) T t 1Rt

Rt (1 1) 1n 1

n1

T t 1

Rt(n ) 1T t 1 Rt Rt

Rt (1 0) 0n 1

n1

T t 1

Rt(n ) 0T t 1 Rt Rt

(1)

Until termination After termination

Forward View of TD() II

Look forward from each state to determine update from future states and rewards:

-return on the Random Walk

Same random walk as before, but now with 19 states Why do you think intermediate values of are best?

Backward View of TD()

The forward view was for theory The backward view is for mechanism

New variable called eligibility trace On each step, decay all traces by and increment the

trace for the current state by 1 Accumulating trace

)(set

et(s) et 1(s) if s st

et 1(s) 1 if s st

Backward View

Shout t backwards over time

The strength of your voice decreases with temporal distance by

)()( 11 tttttt sVsVr

Relation of Backwards View to MC & TD(0)

Using update rule:

As before, if you set to 0, you get to TD(0) If you set to 1, you get MC but in a better way

Can apply TD(1) to continuing tasks Works incrementally and on-line (instead of waiting to

the end of the episode)

)()( sesV ttt

Forward View = Backward View

The forward (theoretical) view of TD() is equivalent to the backward (mechanistic) view

Sutton & Barto’s book shows:

VtTD(s)

t0

T 1

Vt(st )

t0

T 1

Isst

Backward updates Forward updates

algebra shown in book

Q()-learning

Zero out eligibility trace after a non-greedy action. Do max when backing up at first non-greedy choice.

et(s, a) 1 et 1(s,a)

0

et 1(s,a)

if s st , a at ,Qt 1(st ,at ) max a Qt 1(st , a)

if Qt 1(st ,at) maxa Qt 1(st ,a)

otherwise

Qt1(s,a) Qt(s,a) tet(s, a)

t rt1 max a Qt(st1, a ) Qt (st ,at)

Q()-learning

terminalis Until

;

0),( else

),(),( then , If

),(),(),(

: allFor

1

),(),(

) then max, for the ties (if ),(maxarg

greedy)- (e.g. from derivedpolicy using from Choose

, observe ,action Take

:episode) of stepeach (for Repeat

, Initialize

:episode)each (for Repeat

, allfor ,0, andy arbitraril , Initialize

*

*

**

s

aass

ase

aseaseaa

aseasQasQ

s,a

e(s,a)e(s,a)

asQasQr

aaabsQa

Qsa

sra

as

asa)e(sa)Q(s

b

Q() Gridworld Example

With one trial, the agent has much more information about how to get to the goal

not necessarily the best way Can considerably accelerate learning

Conclusions TD()/Q() methods

Can significantly speed learning Robustness against unreliable value estimations (e.g.

caused by Markov violation) Does have a cost in computation

Generalization and Function Approximation

Look at how experience with a limited part of the state set be used to produce good behavior over a much larger part

Overview of function approximation (FA) methods and how they can be adapted to RL

Generalization

Table Generalizing Function Approximator

State VState V

s

s

s

.

.

.

s

1

2

3

N

Train

here

So with function approximation a single value update affects a larger region of the state space

Value Prediction with FA

Before, value functions were stored in lookup tables.

updated. is

vectorparameter only the and , aon

depends , , at time estimatefunction value theNow,

t

tVt

vectorparameter

network. neural a of

weightsconnection of vector thebe could e.g., t

Adapt Supervised Learning Algorithms

Supervised Learning SystemInputs Outputs

Training Info = desired (target) outputs

Error = (target output – actual output)

Training example = {input, target output}

Backups as Training Examples

e.g., the TD(0) backup :

V(st) V(st ) rt1 V(st1) V(st)

description of st , rt1 V (st1 )

As a training example:

input target output

Any FA Method?

In principle, yes: artificial neural networks decision trees multivariate regression methods etc.

But RL has some special requirements: usually want to learn while interacting ability to handle nonstationarity other?

Gradient Descent Methods

Tttt n)(,,)2(),1(t

. allfor , of

function abledifferenti smooth)tly (sufficien a is Assume

Ss

V

t

t

Assume, for now, training examples of this form :

description of st , V (st)

Performance Measures for Gradient Descent

Many are applicable but… a common and simple one is the mean-squared error

(MSE) over a distribution P :

MSE( t) P(s) V (s) Vt (s) s S 2

Gradient Descent

.)(

)(,,

)2(

)(,

)1(

)( )(

:is space in this point any at gradient Its

space.parameter theoffunction any be Let

T

tttt

t

n

ffff

f

(1)

(2) Tttt )2(),1(

)(1 ttt f

Iteratively move down the gradient:

Control with FA

description of ( st , at ), v t Training examples of the form:

Learning state-action values

The general gradient-descent rule:

Gradient-descent Q() (backward view):

),(),(1 tttttttt asQasQv

),(

),(),(max

where

1

111

1

ttttt

tttttttt

tttt

asQee

asQasQr

e

Linear Gradient Descent Q()

Mountain-Car Task

Mountain-Car Results

Summary

Generalization can be done in those cases where there are too many states

Adapting supervised-learning function approximation methods

Gradient-descent methods

Case Studies

Illustrate the promise of RL Illustrate the difficulties, such as long learning times,

finding good state representations

TD Gammon

Tesauro 1992, 1994, 1995, ...

Objective is to advance all pieces to points 19-24

30 pieces, 24 locations implies enormous number of configurations

Effective branching factor of 400

A Few Details

Reward: 0 at all times except those in which the game is won, when it is 1

Episodic (game = episode), undiscounted Gradient descent TD() with a multi-layer neural network

weights initialized to small random numbers backpropagation of TD error four input units for each point; unary encoding of

number of white pieces, plus other features Learning during self-play

Multi-layer Neural Network

Summary of TD-Gammon Results

The Acrobot

Spong 1994Sutton 1996

Acrobot Learning Curves for Q()

Typical Acrobot Learned Behavior

Elevator Dispatching

Crites and Barto 1996

State Space

• 18 hall call buttons: 2 combinations

• positions and directions of cars: 18 (rounding to nearest floor)

• motion states of cars (accelerating, moving, decelerating, stopped, loading, turning): 6

• 40 car buttons: 2

• Set of passengers waiting at each floor, each passenger's arrival time and destination: unobservable. However, 18 real numbers are available giving elapsed time since hall buttons pushed; we discretize these.

• Set of passengers riding each car and their destinations: observable only through the car buttons

18

44

40

Conservatively about 10 states22

Control Strategies

• Zoning: divide building into zones; park in zone when idle. Robust in heavy traffic.

• Search-based methods: greedy or non-greedy. Receding Horizon control.

• Rule-based methods: expert systems/fuzzy logic; from human “experts”

• Other heuristic methods: Longest Queue First (LQF), Highest Unanswered Floor First (HUFF), Dynamic Load Balancing (DLB)

• Adaptive/Learning methods: NNs for prediction, parameter space search using simulation, DP on simplified model, non-sequential RL

Performance Criteria

• Average wait time

• Average system time (wait + travel time)

• % waiting > T seconds (e.g., T = 60)

• Average squared wait time (to encourage fast and fair service)

Minimize:

Average Squared Wait Time

Instantaneous cost:

Define return as an integral rather than a sum (Bradtke and Duff, 1994):

r wait p ( ) p 2

2rtt0

e rd0

becomes

Algorithm

Repeat forever :

1. In state x at time tx , car c must decide to STOP or CONTINUE

2. It selects an action using Boltzmann distribution

(with decreasing temperature) based on current Q values

3. The next decision by car c is required in state y at time ty

4. Implements the gradient descent version of the following backup using backprop :

Q(x,a) Q(x,a) e t x r d e

t y t x maxa

Q(y, a t x

t y

) Q(x,a)

5. x y, tx ty

Neural Networks

• 9 binary: state of each hall down button

• 9 real: elapsed time of hall down button if pushed

• 16 binary: one on at a time: position and direction of car making decision

• 10 real: location/direction of other cars

• 1 binary: at highest floor with waiting passenger?

• 1 binary: at floor with longest waiting passenger?

• 1 bias unit 1

47 inputs, 20 sigmoid hidden units, 1 or 2 output units

Inputs:

Elevator Results

Dynamic Channel Allocation

Details in:Singh and Bertsekas 1997

Helicopter flying

Difficult nonlinear control problem Also difficult for humans Approach: learn in simulation, then transfer to real

helicopter Uses function approximator for generalization Bagnell, Ng, and Schneider (2001, 2003, …)

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 101

In-class assignment

Think again of your own RL problem, with states, actions, and rewards

This time think especially about how uncertainty may play a role, and about how generalization may be important

Discussion

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 102

Homework assignment

Due Thursday 13-16 Think again of your own RL problem, with states, actions,

and rewards Do a web search on your RL problem or related work What is there already, and what, roughly, have they done

to solve the RL problem? Present briefly in class

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 103

Overview day 3

Summary of what we’ve learnt about RL so far Models and planning Multi-agent RL Presentation of homework assignments and discussion

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 104

RL summary

Objective: maximize the total amount of (discounted) reward Approach: estimate a value function (defined over state

space) which represents this total amount of reward Learn this value function incrementally by doing updates

based on values of consecutive states (temporal differences).

After having learnt optimal value function, optimal behavior can be obtained by taking action which has or leads to highest value

Use function approximation techniques for generalization if state space becomes too large for tables

ttta

ttttt asQasQrasQasQ ,,max,,

:learning-Q step-One

11

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 105

RL weaknesses

Still “art” involved in defining good state (and action) representations

Long learning times

Planning and Learning

Use of environment models Integration of planning and learning methods

Models

Model: anything the agent can use to predict how the environment will respond to its actions

Models can be used to produce simulated experience

Planning

Planning: any computational process that uses a model to create or improve a policy

Learning, Planning, and Acting

Two uses of real experience: model learning: to improve

the model direct RL: to directly

improve the value function and policy

Improving value function and/or policy via a model is sometimes called indirect RL or model-based RL. Here, we call it planning.

Direct vs. Indirect RL

Indirect methods: make fuller use of

experience: get better policy with fewer environment interactions

Direct methods simpler not affected by bad

models

But they are very closely related and can be usefully combined:

planning, acting, model learning, and direct RL can occur simultaneously

and in parallel

The Dyna Architecture (Sutton 1990)

The Dyna-Q Algorithm

direct RL

model learning

planning

Dyna-Q on a Simple Maze

rewards = 0 until goal, when =1

Dyna-Q Snapshots: Midway in 2nd Episode

Using Dyna-Q for real-time robot learning

Before learning After learning (approx. 15 minutes)

Multi-agent RL

So far considered only single-agent RL But many domains have multiple agents!

Group of industrial robots working on a single car Robot soccer Traffic

Can we extend the methods of single-agent RL to multi-agent RL?

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 117

Dimensions of multi-agent RL

Is the objective to maximize individual rewards or to maximize global rewards? Competition vs. cooperation

Do the agents share information? Shared state representation? Communication?

Homogeneous or heterogeneous agents? Do some agents have special capabilities?

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 118

Competion

Like multiple single-agent cases simultaneously Related to game theory

Nash equilibria etc. Research goals

study how to optimize individual rewards in the face of competition

study group dynamics

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 119

Cooperation

More different from single-agent case than competition How can we make the individual agents work together? Are rewards shared among the agents?

should all agents be punished for individual mistakes?

Robot soccer example: cooperation

Riedmiller group in Karlsruhe Robots must play together to beat other groups of robots in

Robocup tournaments Riedmiller group uses reinforcement learning techniques

to do this

Opposite approaches to cooperative case

Consider the multi-agent system as a collection of individual reinforcement learners Design individual reward functions such that

cooperation “emerges” They may become “selfish”, or may not cooperate in a

desirable way Consider the whole multi-agent system as one big MDP

with a large action vector State-action space may become very large, but perhaps

possible with advanced function approximation

Interesting intermediate approach

Let agents learn mostly individually Assign (or learn!) a limited number of states where agents

must coordinate, and at those points consider those agents as a larger single agent

This can be represented and computed efficiently using coordination graphs

Guestrin & Koller (2003), Kok & Vlassis (2004)

Robocup simulation league

Kok & Vlassis (2002-2004)

Advanced Generalization Issues

Generalization over states tables linear methods nonlinear methods

Generalization over actions

Proving convergence with generalizion methods

Non-Markov case

Try to do the best you can with non-Markov states Partially Observable MDPs (POMDPs)

– Bayesian approach: belief states

– construct state from sequence of observations

Other issues

Model-free vs. model-based Value functions vs. directly searching for good policies

(e.g. using genetic algorithms) Hierarchical methods Incorporating prior knowledge

advice and hints trainers and teachers shaping Lyapunov functions etc.

The end!