Chapter 6: Temporal Difference Learning

36
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 6: Temporal Difference Learning Introduce Temporal Difference (TD) learning Focus first on policy evaluation, or prediction, methods Then extend to control methods jectives of this chapter:

description

Chapter 6: Temporal Difference Learning. Introduce Temporal Difference (TD) learning Focus first on policy evaluation, or prediction, methods Then extend to control methods. Objectives of this chapter:. TD Prediction. Policy Evaluation (the prediction problem) : - PowerPoint PPT Presentation

Transcript of Chapter 6: Temporal Difference Learning

Page 1: Chapter 6: Temporal Difference Learning

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Chapter 6: Temporal Difference Learning

Introduce Temporal Difference (TD) learning Focus first on policy evaluation, or prediction, methods Then extend to control methods

Objectives of this chapter:

Page 2: Chapter 6: Temporal Difference Learning

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 2

TD Prediction

Simple every - visit Monte Carlo method :

V(st ) V(st) Rt V (st )

Policy Evaluation (the prediction problem): for a given policy , compute the state-value function V

Recall:

The simplest TD method, TD(0) :

V(st ) V(st) rt1 V (st1 ) V(st )

target: the actual return after time tMust wait until the end of the episode

target: an estimate of the return

Page 3: Chapter 6: Temporal Difference Learning

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 3

Simple Monte Carlo

T T T TT

T T T T T

V(st ) V(st) Rt V (st ) where Rt is the actual return following state st .

st

T T

T T

TT T

T TT

Page 4: Chapter 6: Temporal Difference Learning

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 4

Simplest TD Method

T T T TT

T T T T T

st1

rt1

st

V(st ) V(st) rt1 V (st1 ) V(st )

TTTTT

T T T T T

Page 5: Chapter 6: Temporal Difference Learning

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 5

cf. Dynamic Programming

V(st ) E rt 1 V(st )

T

T T T

st

rt1

st1

T

TT

T

TT

T

T

T

Page 6: Chapter 6: Temporal Difference Learning

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 6

TD Bootstraps and Samples

Bootstrapping: update involves an estimate MC does not bootstrap DP bootstraps TD bootstraps

Sampling: update does not involve an expected value MC samples DP does not sample TD samples

Page 7: Chapter 6: Temporal Difference Learning

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 7

Example: Driving Home

State Elapsed Time(minutes)

PredictedTime to Go

PredictedTotal Time

leaving office 0 30 30

reach car,raining

5 35 40

exit highway 20 15 35

behind truck 30 10 40

home street 40 3 43

arrive home 43 0 43

Page 8: Chapter 6: Temporal Difference Learning

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 8

Driving Home

Changes recommended by Monte Carlo methods =1)

Changes recommendedby TD methods (=1)

Page 9: Chapter 6: Temporal Difference Learning

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 9

Advantages of TD Learning

TD methods do not require a model of the environment, only experience

TD, but not MC, methods can be fully incremental You can learn before knowing the final outcome

– Less memory– Less peak computation

You can learn without the final outcome– From incomplete sequences

Both MC and TD converge (under certain assumptions to be detailed later), but which is faster?

Page 10: Chapter 6: Temporal Difference Learning

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 10

Random Walk Example

Values learned by TD(0) aftervarious numbers of episodes

Episodes terminate on either end

with reward 0 (left) and 1 (right)

Page 11: Chapter 6: Temporal Difference Learning

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 11

TD and MC on the Random Walk

Data averaged over100 sequences of episodes

Page 12: Chapter 6: Temporal Difference Learning

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 12

Optimality of TD(0)

Batch Updating: train completely on a finite amount of data, e.g., train repeatedly on 10 episodes until convergence.

Compute updates according to TD(0), but only update estimates after each complete pass through the data.

For any finite Markov prediction task, under batch updating,TD(0) converges for sufficiently small .

Constant- MC also converges under these conditions, but toa different answer! It converges to sample averages of the actual returns at each visited state.

Page 13: Chapter 6: Temporal Difference Learning

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 13

Random Walk under Batch Updating

After each new episode, all previous episodes were treated as a batch, and algorithm was trained until convergence. All repeated 100 times.TD seems to be better than MC, but MC gives optimum estimate of mean square error over the actual returns. So why TD seems better than the optimum? In what sense?

Page 14: Chapter 6: Temporal Difference Learning

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 14

You are the Predictor

Suppose you observe the following 8 episodes:

A, 0, B, 0B, 1B, 1B, 1B, 1B, 1B, 1B, 0

V(A)?

V(B)?

Page 15: Chapter 6: Temporal Difference Learning

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 15

You are the Predictor

V(A)?

Page 16: Chapter 6: Temporal Difference Learning

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 16

You are the Predictor

The prediction that best matches the training data is V(A)=0 This minimizes the mean-square-error on the training set This is what a batch Monte Carlo method gets

If we consider the sequentiality of the problem, then we would set V(A)=.75

This is correct for the maximum likelihood estimate of a Markov model generating the data

i.e, if we do a best fit Markov model, and assume it is exactly correct, and then compute what it predicts (how?)

This is called the certainty-equivalence estimate This is what TD(0) gets

Page 17: Chapter 6: Temporal Difference Learning

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 17

Learning An Action-Value Function

Estimate Q for the current behavior policy .

After every transition from a nonterminal state st , do this :

Q st , at Q st , at rt1 Q st1,at 1 Q st ,at If st1 is terminal, then Q(st1, at1 ) 0.

Page 18: Chapter 6: Temporal Difference Learning

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 18

Sarsa: On-Policy TD Control

Turn this into a control method by always updating thepolicy to be greedy with respect to the current estimate:

The name SARSA comes from the sequence s(t) a(t) r(t+1) s(t+1) a(t+1)

Page 19: Chapter 6: Temporal Difference Learning

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 19

Windy Gridworld

undiscounted, episodic, reward = –1 until goal

Page 20: Chapter 6: Temporal Difference Learning

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 20

Results of Sarsa on the Windy Gridworld

Page 21: Chapter 6: Temporal Difference Learning

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 21

Results of Sarsa on the Windy Gridworld

Page 22: Chapter 6: Temporal Difference Learning

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 22

Results of Sarsa on the Windy Gridworld

% Example 6.5

% states are numbered from 1 to 70 according to raw and column index

% matrix is scanned columnwise

clear all;

close all;

% rewards

r=0*(1:70)-1;

r(53)=0;

% next state for action - going north

S_prime(1,1)=1;

S_prime(1,2:70)=(1:69);

S_prime(1,1:7:64)=(1:7:64);

Spr=reshape(S_prime,7,10);

% wind effect

Spr(:,4:9)=[Spr(1,4:9); Spr(1:6,4:9)];

Spr(:,7:8)=[Spr(1,7:8); Spr(1:6,7:8)];

S_prime(1,:)=reshape(Spr,1,70);

% next state for action - going east

Spr=(8:70);

Spr=reshape(Spr,7,9);

Spr=[Spr Spr(:,9)];

% wind effect

Spr(:,4:9)=[Spr(1,4:9); Spr(1:6,4:9)];

Spr(:,7:8)=[Spr(1,7:8); Spr(1:6,7:8)];

S_prime(2,:)=reshape(Spr,1,70);

% next state for action - going south

Spr=(1:70);

Spr=reshape(Spr,7,10);

Spr=[Spr(2:7,:); Spr(7,:)];

% wind effect

Spr(:,4:9)=[Spr(1,4:9); Spr(1:6,4:9)];

Spr(1,4:9)=(22:7:57);

Spr(:,7:8)=[Spr(1,7:8); Spr(1:6,7:8)];

S_prime(3,:)=reshape(Spr,1,70);

% next state for action - going west

Spr=(1:70);

Spr=reshape(Spr,7,10);

Spr=[Spr(:,1) Spr(:,1:9)];

% wind effect

Spr(:,4:9)=[Spr(1,4:9); Spr(1:6,4:9)];

Spr(:,7:8)=[Spr(1,7:8); Spr(1:6,7:8)];

S_prime(4,:)=reshape(Spr,1,70);

% initialize variables

Qsa=zeros(4,70);

alpha=0.9;

epsi=0.1;

gamma=0.9;

num_episodes=170;

t=1;

Page 23: Chapter 6: Temporal Difference Learning

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 23

Results of Sarsa on the Windy Gridworld

% repeat for each episode

for episode=1:num_episodes

epsi=1/t;

% intitialize state

s=4;

% chose action a from s using epsilon greedy policy

chose_policy=rand>epsi;

if chose_policy

[val,a]=max(Qsa(:,s));

else

a=ceil(rand*(1-eps)*4);

end;

% repeat for each step of episode

episode_size(episode)=0;

path=[ ];

random=[ ];

while s~=53

epsi=1/t;

% take action a, observe r, s_pr

s_pr=S_prime(a,s);

% chose action a_pr from s_pr using epsilon greedy policy

chose_policy_pr=rand>epsi;

if chose_policy_pr

[val,a_pr]=max(Qsa(:,s_pr));

random=[random 0];

else

a_pr=ceil(rand*(1-eps)*4);

random=[random 1];

end;

% Bellman equation

Qsa(a,s)=Qsa(a,s)+alpha*(-1+gamma*Qsa(a_pr,s_pr)-Qsa(a,s));

% the following selection works better

% Qsa(a,s)=Qsa(a_pr,s_pr)-1;

s=s_pr;

a=a_pr;

episode_size(episode)=episode_size(episode)+1;

path=[path s];

t=t+1;

end; %while s

end;

Page 24: Chapter 6: Temporal Difference Learning

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 24

Results of Sarsa on the Windy Gridworld

plot([0 cumsum(episode_size)],(0:num_episodes));

xlabel('Time step');

ylabel('Episode index');

title('Windy gridworld learning rate with eps=1/t, alpha=0.9');

figure(2)

plot( episode_size(1:num_episodes))

xlabel('Episode index');

ylabel('Episode length in steps');

title('Windy gridworld learning rate with eps=1/t, alpha=0.9');

% determine the optimum policy

for i=1:70

Optimum_policy(:,i)=(Qsa(:,i)==max(Qsa(:,i)));

end;

norm=sum(Optimum_policy);

for i=1:70

Optimum_policy(:,i)=Optimum_policy(:,i)./norm(i);

end

% Optimum path

% path =

% Columns 1 through 13

% 11 18 25 31 37 43 50 57 64 65 66 67 68

% Columns 14 through 15

% 61 53

Page 25: Chapter 6: Temporal Difference Learning

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 25

Q-Learning: Off-Policy TD Control

One - step Q - learning :

Q st , at Q st , at rt 1 maxa

Q st1, a Q st , at

Different than in on-policy TD (SARSA)

Page 26: Chapter 6: Temporal Difference Learning

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 26

Cliffwalking

greedy= 0.1Reward for all moves –1

Falling off the cliff –100

Sarsa

Q-learning

Page 27: Chapter 6: Temporal Difference Learning

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 27

Actor-Critic Methods

Explicit representation of policy as well as value function

Minimal computation to select actions

Can learn an explicit stochastic policy

Can put constraints on policies Appealing as psychological and

neural models

Page 28: Chapter 6: Temporal Difference Learning

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 28

Actor-Critic Details

TD - error is used to evaluate actions :

t rt 1 V (st1 ) V(st )

If actions are determined by preferences, p(s,a), as follows :

t (s,a) Pr at a st s ep(s, a)

e p(s ,b)

b

,

then you can update the preferences like this :

p(st , at ) p(st ,at ) t

So with positive the preference for this action is increased and opposite

Page 29: Chapter 6: Temporal Difference Learning

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 29

Dopamine Neurons and TD Error

W. Schultz et al. Universite de Fribourg

Amygdala

Nucleus accumbens

Hippocampus

Striatum

(Caudate nucleus)

VTA &

Substantia Nigra

(not shown)

Page 30: Chapter 6: Temporal Difference Learning

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 30

The CAR task

In the CAR task, a rat is trained that a conditioned stimulus (CS), such as a light or sound, precedes an unconditioned stimulus (US) such as an electric shock. Under normal conditions, the rat will learn to do two things:

First, it will learn to escape the shock by moving to the other side of the cage (for example).

Secondly, it will eventually learn that the CS predicts the shock, and the rat will then avoid the shock entirely by taking evasive action in response to the CS.

Page 31: Chapter 6: Temporal Difference Learning

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 31

A computational model

Following formal models of Reinforcement Learning (RL) (Sutton and Barto, 1998), we translate the CAR task into a Markov Decision Process (MDP).Each circle represents a different state the simulated rat can be in, and the arrows denote the outcome of each of the two possible actions. The shock of the US elicits a punishment of -1. The rat must learn to take the appropriate action in each state which minimises overall punishment.

Page 32: Chapter 6: Temporal Difference Learning

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 32

R-Learning

R-Learning is an off-policy control method for advanced RL when one

neither discounts nor divides experience into distinct episodes.

Instead try to find the maximum reward per time step.

Page 33: Chapter 6: Temporal Difference Learning

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 33

Average Reward Per Time Step in R-Learning

Average expected reward per time step under policy :

limn

1n

E rt t1

n

the same for each state if ergodici.e. nonzero probability of reaching any state

Value of a state relative to :

˜ V s E rt k st s k1

Value of a state - action pair relative to :

˜ Q s, a E rt k

st s,at a k1

This is a transient value

declining to zero as all

states reach their final

average reward

Page 34: Chapter 6: Temporal Difference Learning

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 34

Access-Control Queuing Task

n servers Customers have four different

priorities, which pay reward of 1, 2, 3, or 4, if served

At each time step, customer at head of queue is accepted (assigned to a server) or removed from the queue

Proportion of randomly distributed high priority customers in queue is h

Busy server becomes free with probability p on each time step

Statistics of arrivals and departures are unknown

n=10, h=.5, p=.06

Apply R-learning

Page 35: Chapter 6: Temporal Difference Learning

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 35

Afterstates

Usually, a state-value function evaluates states in which the agent can take an action.

But sometimes it is useful to evaluate states after agent has acted, as in tic-tac-toe.

Why is this useful?We do not know the opponent’s response so determining the state value is difficult Action-value also do not work since different states and actions may result in the same board position

What is this in general?

Page 36: Chapter 6: Temporal Difference Learning

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 36

Summary

TD prediction Introduced one-step tabular model-free TD methods Extend prediction to control by employing some form of GPI

On-policy control: Sarsa Off-policy control: Q-learning and R-learning

These methods bootstrap and sample, combining aspects of DP and MC methods