Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science
description
Transcript of Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science
11
ECE-517: Reinforcement LearningECE-517: Reinforcement Learningin Artificial Intelligence in Artificial Intelligence
Lecture 10: Temporal-Difference LearningLecture 10: Temporal-Difference Learning
Dr. Itamar ArelDr. Itamar Arel
College of EngineeringCollege of EngineeringDepartment of Electrical Engineering and Computer ScienceDepartment of Electrical Engineering and Computer Science
The University of TennesseeThe University of TennesseeFall 2011Fall 2011
October 6, 2011October 6, 2011
ECE 517 - Reinforcement Learning in AI 22
Introduction to Temporal Learning (TD) & TD PredictionIntroduction to Temporal Learning (TD) & TD Prediction
If one had to identify one idea as central and novel to If one had to identify one idea as central and novel to RL, it would undoubtedly be RL, it would undoubtedly be temporal-differencetemporal-difference (TD) (TD) learninglearning
Combination of ideas from DP and Monte CarloCombination of ideas from DP and Monte Carlo Learns without a model (like MC), bootstraps (like DP)Learns without a model (like MC), bootstraps (like DP)
Both TD and Monte Carlo methods use experience to Both TD and Monte Carlo methods use experience to solve the solve the prediction problemprediction problem
A simple every-visit MC method may be expressed asA simple every-visit MC method may be expressed as
Let’s call this Let’s call this constant-constant- MC MC
We will focus on the We will focus on the prediction problemprediction problem (a.k.a. policy (a.k.a. policy evaluation) evaluation) evaluating V(s) for a given policy evaluating V(s) for a given policy
)()()( tttt sVRsVsV
target: the actual return after time t
ECE 517 - Reinforcement Learning in AI 33
TD Prediction (cont.)TD Prediction (cont.)
Recall that in MC we need to wait until the end of the Recall that in MC we need to wait until the end of the episode to update the value estimatesepisode to update the value estimates
The idea of TD is to do so every time stepThe idea of TD is to do so every time step
Simplest TD method, TD(0):Simplest TD method, TD(0):
Essentially, we are updating one guess based on Essentially, we are updating one guess based on anotheranother
The idea is that we have a “moving target”The idea is that we have a “moving target”
ttttt sVsVrsVsV 11)()(
target: an estimate of the return
ECE 517 - Reinforcement Learning in AI 44
Simple Monte CarloSimple Monte Carlo
T T T TT
T T T T T
. state following return actual theis where
)()()(
tt
tttt
sR
sVRsVsV
st
T T
T T
TT T
T TT
ECE 517 - Reinforcement Learning in AI 55
Simplest TD MethodSimplest TD Method
T T T TT
T T T T T
st1
rt1
st
)()()()( 11 ttttt sVsVrsVsV
TTTTT
T T T T T
ECE 517 - Reinforcement Learning in AI 66
Dynamic ProgrammingDynamic Programming
)()( 1 ttt sVrEsV
T
T T T
st
rt1
st1
T
TT
T
TT
T
T
T
ECE 517 - Reinforcement Learning in AI 77
Tabular TD(0) for estimating Tabular TD(0) for estimating VV
ECE 517 - Reinforcement Learning in AI 88
TD methods Bootstrap and SampleTD methods Bootstrap and Sample
BootstrappingBootstrapping: update involves an : update involves an estimate estimate (i.e. guess from a guess)(i.e. guess from a guess)
Monte Carlo does not bootstrapMonte Carlo does not bootstrap Dynamic Programming bootstrapsDynamic Programming bootstraps Temporal Different bootstrapsTemporal Different bootstraps
SamplingSampling: update does not involve an : update does not involve an expected valueexpected value
Monte Carlo samplesMonte Carlo samples Dynamic Programming does not sampleDynamic Programming does not sample Temporal Difference samplesTemporal Difference samples
ECE 517 - Reinforcement Learning in AI 99
Example: Driving HomeExample: Driving Home
State Elapsed Time(minutes)
PredictedTime to Go
PredictedTotal Time
leaving office 0 30 30
reach car, raining 5 35 40
exit highway 20 15 35
behind truck 30 10 40
home street 40 3 43
arrive home 43 0 43
rewardsrewardsReturns fromReturns from
each stateeach state
ECE 517 - Reinforcement Learning in AI 1010
Example: Driving Home (cont.)Example: Driving Home (cont.)
Changes recommended by Monte Carlo methods =1)
Changes recommendedby TD methods (=1)
Value of each state is its expected time-to-goValue of each state is its expected time-to-go
ECE 517 - Reinforcement Learning in AI 1111
Is it really necessary to wait until the end of the Is it really necessary to wait until the end of the episode to start learning?episode to start learning?
Monte Carlo says it isMonte Carlo says it is TD learning argues that learning can occur on-lineTD learning argues that learning can occur on-line
Suppose, on another day, you again estimate when Suppose, on another day, you again estimate when leaving your office that it will take 30 minutes to drive leaving your office that it will take 30 minutes to drive home, but then you get stuck in a massive traffic jamhome, but then you get stuck in a massive traffic jam
Twenty-five minutes after leaving the office you are still Twenty-five minutes after leaving the office you are still bumper-to-bumper on the highwaybumper-to-bumper on the highway
You now estimate that it will take another 25 minutes to You now estimate that it will take another 25 minutes to get home, for a total of 50 minutes get home, for a total of 50 minutes
Must you wait until you get home before increasing your Must you wait until you get home before increasing your estimate for the initial state? estimate for the initial state?
In TD you would be shifting your initial estimate from In TD you would be shifting your initial estimate from 30 minutes toward 50 30 minutes toward 50
Example: Driving Home (cont.)Example: Driving Home (cont.)
ECE 517 - Reinforcement Learning in AI 1212
Advantages of TD LearningAdvantages of TD Learning
TD methods TD methods do not requiredo not require a model of the a model of the environment, only experienceenvironment, only experience
TD, but not MC, methods can be TD, but not MC, methods can be fully incrementalfully incremental Agent learns a “guess from a guess”Agent learns a “guess from a guess” Agent can learn Agent can learn beforebefore knowing the final outcome knowing the final outcome
Less memoryLess memory
Reduced peak computationReduced peak computation Agent can learn Agent can learn withoutwithout the final outcome the final outcome
From incomplete sequencesFrom incomplete sequences
Helps with applications that have very long episodesHelps with applications that have very long episodes
Both MC and TD converge (under certain assumptions Both MC and TD converge (under certain assumptions to be detailed later), but which is faster?to be detailed later), but which is faster?
Currently unknown – generally TD does better on Currently unknown – generally TD does better on stochastic tasksstochastic tasks
ECE 517 - Reinforcement Learning in AI 1313
Random Walk ExampleRandom Walk Example
In this example we empirically compare the prediction In this example we empirically compare the prediction abilities of abilities of TD(0)TD(0) and and constant-constant- MC MC applied to the applied to the small Markov process:small Markov process:
All episodes starts in state CAll episodes starts in state C
Proceed one state, right orProceed one state, right orleft with equal probabilityleft with equal probability
Termination: R = +1, L = 0Termination: R = +1, L = 0
True values:True values:V(C)=1/2, V(A)=1/6, V(B)=2/6V(C)=1/2, V(A)=1/6, V(B)=2/6V(D)=4/6, V(E)=5/6V(D)=4/6, V(E)=5/6
ECE 517 - Reinforcement Learning in AI 1414
Random Walk Example (cont.)Random Walk Example (cont.)
Data averaged over100 sequences of episodes
ECE 517 - Reinforcement Learning in AI 1515
Optimality of TD(0)Optimality of TD(0)
Suppose only a finite amount of experience is Suppose only a finite amount of experience is available, say 10 episodes or 100 time stepsavailable, say 10 episodes or 100 time steps
Intuitively, we repeatedly present the experience until Intuitively, we repeatedly present the experience until convergence is achievedconvergence is achieved
Updates are made after a Updates are made after a batchbatch of training data of training data Also called batch updatingAlso called batch updating
For any finite Markov prediction task, under batch For any finite Markov prediction task, under batch updating, TD(0) converges for sufficiently small updating, TD(0) converges for sufficiently small MC method also converges deterministically but to a MC method also converges deterministically but to a different answerdifferent answer
To better understand the different between MC and To better understand the different between MC and TD(0), we’ll consider TD(0), we’ll consider the batch random walkthe batch random walk
ECE 517 - Reinforcement Learning in AI 1616
Optimality of TD(0) (cont.)Optimality of TD(0) (cont.)
After each new episode, all previous episodes were treated as a batch, and the algorithm was trained until convergence. All repeated 100 times.A key question is what would explain these two curves?
ECE 517 - Reinforcement Learning in AI 1717
You are the PredictorYou are the Predictor
Suppose you observe the following 8 episodesSuppose you observe the following 8 episodes
QQ: : What would you guess V(A) and V(B) to be?What would you guess V(A) and V(B) to be?
1) A, 0, B, 02) B, 13) B, 14) B, 15) B, 16) B, 17) B, 18) B, 0
ECE 517 - Reinforcement Learning in AI 1818
You are the Predictor (cont.)You are the Predictor (cont.)
V(A) = ¾ is the answer that batch TD(0) givesV(A) = ¾ is the answer that batch TD(0) givesThe other reasonable answer is simply to say that A(0)=0 The other reasonable answer is simply to say that A(0)=0 ((Why?Why?))
This is the answer that MC givesThis is the answer that MC gives
If the process is Markovian, we expect that the TD(0) If the process is Markovian, we expect that the TD(0) answer will produce lower error on answer will produce lower error on futurefuture data, even data, even though the Monte Carlo answer is better on the existing though the Monte Carlo answer is better on the existing datadata
ECE 517 - Reinforcement Learning in AI 1919
TD(0) vs. MCTD(0) vs. MC
For MC, the prediction that best matches the For MC, the prediction that best matches the training data is V(A)=0training data is V(A)=0
This This minimizes the mean-square-errorminimizes the mean-square-error on the training set on the training set This is what a batch Monte Carlo method getsThis is what a batch Monte Carlo method gets
If we consider the sequentiality of the problem, If we consider the sequentiality of the problem, then we would set V(A)=.75then we would set V(A)=.75 This is correct for the This is correct for the maximum likelihoodmaximum likelihood
estimate of a Markov model generating the data estimate of a Markov model generating the data i.e, if we do a best fit Markov model, and assume it is i.e, if we do a best fit Markov model, and assume it is exactly correct, and then compute what it predicts exactly correct, and then compute what it predicts
This is called the This is called the certainty-equivalence estimatecertainty-equivalence estimate It is what TD(0) yieldsIt is what TD(0) yields
ECE 517 - Reinforcement Learning in AI 2020
Learning An Action-Value FunctionLearning An Action-Value Function
We now consider the use of TD methods for the We now consider the use of TD methods for the control problemcontrol problem
As with MC, we need to balance exploration and As with MC, we need to balance exploration and exploitationexploitation
Again, two schemes: Again, two schemes: on-policyon-policy and and off-policyoff-policy
We’ll start with on-policy, and learn action-value We’ll start with on-policy, and learn action-value functionfunction
.0),( then terminal,is If
,,,,
: thisdo ,state terminal-nona from sitionevery tranAfter
111
111
ttt
ttttttttt
t
asQs
asQasQrasQasQ
s
ECE 517 - Reinforcement Learning in AI 2121
SARSA: On-Policy TD(0) LearningSARSA: On-Policy TD(0) Learning
One can easily turn this into a control method by One can easily turn this into a control method by always updating the policy to be greedy with respect always updating the policy to be greedy with respect to the current estimate of Q(s,a)to the current estimate of Q(s,a)
ECE 517 - Reinforcement Learning in AI 2222
Q-Learning: Off-Policy TD ControlQ-Learning: Off-Policy TD Control
One of the most important breakthroughs in RL was One of the most important breakthroughs in RL was the development of Q-Learning - an off-policy TD the development of Q-Learning - an off-policy TD control algorithm (1989)control algorithm (1989)
ttta
ttttt asQasQrasQasQ ,,max,, 11
ECE 517 - Reinforcement Learning in AI 2323
Q-Learning: Off-Policy TD Control (cont.)Q-Learning: Off-Policy TD Control (cont.)
The learned action-value function, Q, directly The learned action-value function, Q, directly approximates the optimal action-value function, Q*approximates the optimal action-value function, Q*
Converges as long as all states are visited and state-Converges as long as all states are visited and state-action values updatedaction values updated
Why is it considered an Why is it considered an off-policyoff-policy control method? control method?
How expensive is it to implement?How expensive is it to implement?