Reinforcement Learning I: The setting and classical stochastic dynamic programming algorithms Tuomas...
-
Upload
claire-hurn -
Category
Documents
-
view
215 -
download
0
Transcript of Reinforcement Learning I: The setting and classical stochastic dynamic programming algorithms Tuomas...
![Page 1: Reinforcement Learning I: The setting and classical stochastic dynamic programming algorithms Tuomas Sandholm Carnegie Mellon University Computer Science.](https://reader030.fdocuments.us/reader030/viewer/2022032516/56649c575503460f948fe535/html5/thumbnails/1.jpg)
Reinforcement Learning I: The setting and classical
stochastic dynamic programming algorithms
Tuomas SandholmCarnegie Mellon University
Computer Science Department
![Page 2: Reinforcement Learning I: The setting and classical stochastic dynamic programming algorithms Tuomas Sandholm Carnegie Mellon University Computer Science.](https://reader030.fdocuments.us/reader030/viewer/2022032516/56649c575503460f948fe535/html5/thumbnails/2.jpg)
Reinforcement Learning(Ch. 17.1-17.3, Ch. 20)
Learnerpassive
active
Sequential decision problems
Approaches:1. Learn values of states (or state histories) & try to maximize
utility of their outcomes.• Need a model of the environment: what ops & what
states they lead to2. Learn values of state-action pairs
• Does not require a model of the environment (except legal moves)
• Cannot look ahead
![Page 3: Reinforcement Learning I: The setting and classical stochastic dynamic programming algorithms Tuomas Sandholm Carnegie Mellon University Computer Science.](https://reader030.fdocuments.us/reader030/viewer/2022032516/56649c575503460f948fe535/html5/thumbnails/3.jpg)
Reinforcement Learning …Deterministic transitions
Stochastic transitionsaijM is the probability to reaching state j when taking action
a in state i
start
3
2
1
1 2 3 4
+1
-1
A simple environment that presents the agent with a sequential decision problem:
Move cost = 0.04
(Temporal) credit assignment problem sparse reinforcement problem
Offline alg: action sequences determined ex anteOnline alg: action sequences is conditional on observations along the way; Important in stochastic environment (e.g. jet flying)
![Page 4: Reinforcement Learning I: The setting and classical stochastic dynamic programming algorithms Tuomas Sandholm Carnegie Mellon University Computer Science.](https://reader030.fdocuments.us/reader030/viewer/2022032516/56649c575503460f948fe535/html5/thumbnails/4.jpg)
Reinforcement Learning …M = 0.8 in direction you want to go 0.2 in perpendicular
0.1 left0.1 right
Policy: mapping from states to actions
3
2
1
1 2 3 4
+1
-1
0.705
3
2
1
1 2 3 4
+1
-1
0.812
0.762
0.868 0.912
0.660
0.655 0.611 0.388
An optimal policy for the stochastic environment:
utilities of states:
EnvironmentObservable (accessible): percept identifies the statePartially observable
Markov property: Transition probabilities depend on state only, not on the path to the state.Markov decision problem (MDP).Partially observable MDP (POMDP): percepts does not have enough info to identify transition probabilities.
![Page 5: Reinforcement Learning I: The setting and classical stochastic dynamic programming algorithms Tuomas Sandholm Carnegie Mellon University Computer Science.](https://reader030.fdocuments.us/reader030/viewer/2022032516/56649c575503460f948fe535/html5/thumbnails/5.jpg)
Partial observability in previous figure
(2,1) vs. (2,3)
U(A) 0.8*U(A) in (2,1) + 0.2*U(A) in (2,3)
Have to factor in the value of new info obtained by moving in the world
![Page 6: Reinforcement Learning I: The setting and classical stochastic dynamic programming algorithms Tuomas Sandholm Carnegie Mellon University Computer Science.](https://reader030.fdocuments.us/reader030/viewer/2022032516/56649c575503460f948fe535/html5/thumbnails/6.jpg)
Observable MDPs
Assume additivity (almost always true in practice):
Uh ([S0,S1…Sn]) = R0 + Uh([S1,…Sn])
Utility function on histories
Policy*(i) = j
aij
ajUM )(maxarg
U(i) = R(i) + j
aij
ajUM )(max
![Page 7: Reinforcement Learning I: The setting and classical stochastic dynamic programming algorithms Tuomas Sandholm Carnegie Mellon University Computer Science.](https://reader030.fdocuments.us/reader030/viewer/2022032516/56649c575503460f948fe535/html5/thumbnails/7.jpg)
Classic Dynamic Programming (DP)
Start from last step & move backward
Complexity of Naïve search O(|A|n)
DP O(n|A||S|)Actions per step
# possible states
Problem: n= if loops or otherwise infinite horizon
![Page 8: Reinforcement Learning I: The setting and classical stochastic dynamic programming algorithms Tuomas Sandholm Carnegie Mellon University Computer Science.](https://reader030.fdocuments.us/reader030/viewer/2022032516/56649c575503460f948fe535/html5/thumbnails/8.jpg)
Does not require there to exist a “last step” unlike dynamic programming
![Page 9: Reinforcement Learning I: The setting and classical stochastic dynamic programming algorithms Tuomas Sandholm Carnegie Mellon University Computer Science.](https://reader030.fdocuments.us/reader030/viewer/2022032516/56649c575503460f948fe535/html5/thumbnails/9.jpg)
The utility values for selected states at each iteration step in the application of VALUE-ITERATION to the 4x3 world in our example
Thrm: As t, value iteration converges to exact U even if updates are done asynchronously & i is picked randomly at every step.
![Page 10: Reinforcement Learning I: The setting and classical stochastic dynamic programming algorithms Tuomas Sandholm Carnegie Mellon University Computer Science.](https://reader030.fdocuments.us/reader030/viewer/2022032516/56649c575503460f948fe535/html5/thumbnails/10.jpg)
When to stop value iteration?
![Page 11: Reinforcement Learning I: The setting and classical stochastic dynamic programming algorithms Tuomas Sandholm Carnegie Mellon University Computer Science.](https://reader030.fdocuments.us/reader030/viewer/2022032516/56649c575503460f948fe535/html5/thumbnails/11.jpg)
Idea: Value determination (given a policy) is simpler than value iteration
![Page 12: Reinforcement Learning I: The setting and classical stochastic dynamic programming algorithms Tuomas Sandholm Carnegie Mellon University Computer Science.](https://reader030.fdocuments.us/reader030/viewer/2022032516/56649c575503460f948fe535/html5/thumbnails/12.jpg)
Value Determination Algorithm
The VALUE-DETERMINATION algorithm can be implemented in one of two ways. The first is a simplification of the VALUE-ITERATION algorithm, replacing the equation (17.4) with
and using the current utility estimates from policy iteration as the initial values. (Here Policy(i) is the action suggested by the policy in state i)
While this can work well in some environments, it will often take a very long time to converge in the early stages of policy iteration. This is because the policy will be more or less random, so that many steps can be required to reach terminal states
j
tiPolicy
ijt jUMiRiU )()()( )(1
![Page 13: Reinforcement Learning I: The setting and classical stochastic dynamic programming algorithms Tuomas Sandholm Carnegie Mellon University Computer Science.](https://reader030.fdocuments.us/reader030/viewer/2022032516/56649c575503460f948fe535/html5/thumbnails/13.jpg)
Value Determination AlgorithmThe second approach is to solve for the utilities directly. Given a fixed policy P, the utilities of states obey a set of equations of the form:
For example, suppose P is the policy shown in Figure 17.2(a). Then using the transition model M, we can construct the following set of equations:
U(1,1) = 0.8u(1,2) + 0.1u(1,1) + 0.1u(2,1)U(1,2) = 0.8u(1,3) + 0.2u(1,2)
and so on. This gives a set of 11 linear equations in 11 unknowns, which can be solved by linear algebra methods such as Gaussian elimination. For small state spaces, value determination using exact solution methods is often the most efficient approach.
j
tiP
ij jUMiRiU )()()( )(
Policy iteration converges to optimal policy, and policy improves monotonically for all states.Asynchronous version converges to optimal policy if all states are visited infinitely often.
![Page 14: Reinforcement Learning I: The setting and classical stochastic dynamic programming algorithms Tuomas Sandholm Carnegie Mellon University Computer Science.](https://reader030.fdocuments.us/reader030/viewer/2022032516/56649c575503460f948fe535/html5/thumbnails/14.jpg)
Discounting
Infinite horizon Infinite U Policy & value iteration fail to converge.
Also, what is rational: vs.
Solution: discounting
i
iiRvHU )(
Finite if 10 v
![Page 15: Reinforcement Learning I: The setting and classical stochastic dynamic programming algorithms Tuomas Sandholm Carnegie Mellon University Computer Science.](https://reader030.fdocuments.us/reader030/viewer/2022032516/56649c575503460f948fe535/html5/thumbnails/15.jpg)
Reinforcement Learning II:Reinforcement learning (RL)
algorithms(we will focus solely on observable
environments in this lecture)
Tuomas SandholmCarnegie Mellon University
Computer Science Department
![Page 16: Reinforcement Learning I: The setting and classical stochastic dynamic programming algorithms Tuomas Sandholm Carnegie Mellon University Computer Science.](https://reader030.fdocuments.us/reader030/viewer/2022032516/56649c575503460f948fe535/html5/thumbnails/16.jpg)
Passive learning
(1,1)(1,2)(1,3)(1,2)(1,3)(1,2)(1,1)(1,2)(2,2)(3,2) –1(1,1)(1,2)(1,3)(2,3)(2,2)(2,3)(3,3) +1(1,1)(1,2)(1,1)(1,2)(1,1)(2,1)(2,2)(2,3)(3,3) +1(1,1)(1,2)(2,2)(1,2)(1,3)(2,3)(1,3)(2,3)(3,3) +1(1,1)(2,1)(2,2)(2,1)(1,1)(1,2)(1,3)(2,3)(2,2)(3,2) -1(1,1)(2,1)(1,1)(1,2)(2,2)(3,2) -1
Epochs = training sequences:
![Page 17: Reinforcement Learning I: The setting and classical stochastic dynamic programming algorithms Tuomas Sandholm Carnegie Mellon University Computer Science.](https://reader030.fdocuments.us/reader030/viewer/2022032516/56649c575503460f948fe535/html5/thumbnails/17.jpg)
Passive learning …
start
3
2
1
1 2 3 4
+1
-1
.5
.5 .33
.5
.5 .5
.5 .5
.5
.5
.5 .33
.5
.33.33
.33 .33.33 .5
.33
.33
1.0
1.0+1
-1
-0.2911
3
2
1
1 2 3 4
+1
-1
-0.0380
-0.1646
0.0886 0.2152
-0.4430
-0.0380 -0.5443 -0.7722
(a)
(b)
(c)(a) A simple stochastic environment.
(b) Each state transitions to a neighboring state with equal probability among all neighboring states. State (4,2) is terminal with reward –1, and state (4,3) is terminal with reward +1.
(c) The exact utility values.
![Page 18: Reinforcement Learning I: The setting and classical stochastic dynamic programming algorithms Tuomas Sandholm Carnegie Mellon University Computer Science.](https://reader030.fdocuments.us/reader030/viewer/2022032516/56649c575503460f948fe535/html5/thumbnails/18.jpg)
LMS – updating [Widrow & Hoff 1960]
function LMS-UPDATE(U,e,percepts,M,N) returns an update U
if TERMINAL?[e] then reward-to-go 0 for each ei in percepts (starting at end) do reward-to-go reward-to-go + REWARD[ei] U[STATE[ei]] RUNNING-AVERAGE (U[STATE[ei]], reward-to-go, N[STATE[ei]]) end
Average reward-to-go that state has gotten
simple average batch mode
![Page 19: Reinforcement Learning I: The setting and classical stochastic dynamic programming algorithms Tuomas Sandholm Carnegie Mellon University Computer Science.](https://reader030.fdocuments.us/reader030/viewer/2022032516/56649c575503460f948fe535/html5/thumbnails/19.jpg)
Converges slowly to LMS estimate or training set
![Page 20: Reinforcement Learning I: The setting and classical stochastic dynamic programming algorithms Tuomas Sandholm Carnegie Mellon University Computer Science.](https://reader030.fdocuments.us/reader030/viewer/2022032516/56649c575503460f948fe535/html5/thumbnails/20.jpg)
But utilities of states are not independent!
NEWU = ?
OLDU = -0.8
P=0.9
P=0.1
-1
+1
An example where LMS does poorly. A new state is reached for the first time, and then follows the path marked by the dashed lines, reaching a terminal state with reward +1.
![Page 21: Reinforcement Learning I: The setting and classical stochastic dynamic programming algorithms Tuomas Sandholm Carnegie Mellon University Computer Science.](https://reader030.fdocuments.us/reader030/viewer/2022032516/56649c575503460f948fe535/html5/thumbnails/21.jpg)
Adaptive DP (ADP)Idea: use the constraints (state transition probabilities) between states to speed learning.Solve
j
ij jUMiRiU )()()(
= value determination.No maximization over actions because agent is passive unlike in value iteration.
using DP
Large state spacee.g. Backgammon: 1050 equations in 1050 variables
![Page 22: Reinforcement Learning I: The setting and classical stochastic dynamic programming algorithms Tuomas Sandholm Carnegie Mellon University Computer Science.](https://reader030.fdocuments.us/reader030/viewer/2022032516/56649c575503460f948fe535/html5/thumbnails/22.jpg)
Temporal Difference (TD) Learning
Idea: Do ADP backups on a per move basis, not for the whole state space.
)]()()([)()( iUjUiRiUiU
Thrm: Average value of U(i) converges to the correct value.
Thrm: If is appropriately decreased as a function of times a state is visited (=[N[i]]), then U(i) itself converges to the correct value
![Page 23: Reinforcement Learning I: The setting and classical stochastic dynamic programming algorithms Tuomas Sandholm Carnegie Mellon University Computer Science.](https://reader030.fdocuments.us/reader030/viewer/2022032516/56649c575503460f948fe535/html5/thumbnails/23.jpg)
Algorithm TD()(not in Russell & Norvig book)
Idea: update from the whole epoch, not just on state transition.
km
mmmkm iUiUiRiUiU )]()()([)()( 1
Special cases:=1: LMS=0: TD
Intermediate choice of (between 0 and 1) is best. Interplay with …
![Page 24: Reinforcement Learning I: The setting and classical stochastic dynamic programming algorithms Tuomas Sandholm Carnegie Mellon University Computer Science.](https://reader030.fdocuments.us/reader030/viewer/2022032516/56649c575503460f948fe535/html5/thumbnails/24.jpg)
![Page 25: Reinforcement Learning I: The setting and classical stochastic dynamic programming algorithms Tuomas Sandholm Carnegie Mellon University Computer Science.](https://reader030.fdocuments.us/reader030/viewer/2022032516/56649c575503460f948fe535/html5/thumbnails/25.jpg)
Convergence of TD()
Thrm: Converges w.p. 1 under certain boundaries conditions.Decrease i(t) s.t.
)(
)(
2 t
t
ti
ti
In practice, often a fixed is used for all i and t.
![Page 26: Reinforcement Learning I: The setting and classical stochastic dynamic programming algorithms Tuomas Sandholm Carnegie Mellon University Computer Science.](https://reader030.fdocuments.us/reader030/viewer/2022032516/56649c575503460f948fe535/html5/thumbnails/26.jpg)
Passive learning in an unknown environment
aijM unknown
ADP does not work directlyLMS & TD() will operate unchanged
… Changes to ADPConstruct an environment model (of ) based on observations (state transitions) & run DP
aijM
Quick in # epochs, slow update per example
As the environment model approaches the correct model, the utility estimates will converge to the correct utilities.
![Page 27: Reinforcement Learning I: The setting and classical stochastic dynamic programming algorithms Tuomas Sandholm Carnegie Mellon University Computer Science.](https://reader030.fdocuments.us/reader030/viewer/2022032516/56649c575503460f948fe535/html5/thumbnails/27.jpg)
Passive learning in an unknown environment
ADP: full backup
TD: one experience back up
As TD makes a single adjustment (to U) per observed transitions, ADP makes as many (to U) as it needs to restore consistency between U and M. Change to M is local, but effects may need to be propagated throughout U.
![Page 28: Reinforcement Learning I: The setting and classical stochastic dynamic programming algorithms Tuomas Sandholm Carnegie Mellon University Computer Science.](https://reader030.fdocuments.us/reader030/viewer/2022032516/56649c575503460f948fe535/html5/thumbnails/28.jpg)
![Page 29: Reinforcement Learning I: The setting and classical stochastic dynamic programming algorithms Tuomas Sandholm Carnegie Mellon University Computer Science.](https://reader030.fdocuments.us/reader030/viewer/2022032516/56649c575503460f948fe535/html5/thumbnails/29.jpg)
Passive learning in an unknown environment
TD can be viewed as a crude approximation of ADP
Adjustments in ADP can be viewed as pseudo experience in TD
A model for generating pseudo-experience can be used in TD directly: DYNA [Sutton]
Cost of thinking vs. cost of actingApproximating <value policy> iterations directly by restricting the backup after each observed transition. Prioritized sweeping heuristic prefers to make adjustments to states whose likely successors have just undergone large adjustments in U(j)
- Learns roughly as fast as full ADP (#epochs)- Several orders of magnitude less computation allows doing problems
that are not solvable via ADP - M is incorrect early on minimum decreasing adjustment size before
recompute U(i)
![Page 30: Reinforcement Learning I: The setting and classical stochastic dynamic programming algorithms Tuomas Sandholm Carnegie Mellon University Computer Science.](https://reader030.fdocuments.us/reader030/viewer/2022032516/56649c575503460f948fe535/html5/thumbnails/30.jpg)
Active learning in an unknown environment
Agent considers what actions to take.
Algorithms for learning in the setting (action choice discussed later)
ADP:
TD(): Unchanged!
j
aij
ajUMiRiU )(max)()(
aijM ijMLearn instead of as before
Model-based (learn M)
Model-free (e.g. Q-learning)
Which is better? open
Tradeoff
![Page 31: Reinforcement Learning I: The setting and classical stochastic dynamic programming algorithms Tuomas Sandholm Carnegie Mellon University Computer Science.](https://reader030.fdocuments.us/reader030/viewer/2022032516/56649c575503460f948fe535/html5/thumbnails/31.jpg)
Q-learningQ (a,i)
),(max)( iaQiUa
j
a
aij jaQMiRiaQ ),'(max)(),(
'
Direct approach (ADP) would require learning a model .
Q-learning does not:
aijM
)],(),'(max)([),(),('
iaQjaQiRiaQiaQa
Do this update after each state transition:
![Page 32: Reinforcement Learning I: The setting and classical stochastic dynamic programming algorithms Tuomas Sandholm Carnegie Mellon University Computer Science.](https://reader030.fdocuments.us/reader030/viewer/2022032516/56649c575503460f948fe535/html5/thumbnails/32.jpg)
Exploration
Tradeoff between exploitation (control) and exploration (identification)
Extremes: greedy vs. random acting(n-armed bandit models)
Q-learning converges to optimal Q-values if* Every state is visited infinitely often (due to exploration),* The action selection becomes greedy as time approaches infinity, and* The learning rate a is decreased fast enough but not too fast
(as we discussed in TD learning)
![Page 33: Reinforcement Learning I: The setting and classical stochastic dynamic programming algorithms Tuomas Sandholm Carnegie Mellon University Computer Science.](https://reader030.fdocuments.us/reader030/viewer/2022032516/56649c575503460f948fe535/html5/thumbnails/33.jpg)
![Page 34: Reinforcement Learning I: The setting and classical stochastic dynamic programming algorithms Tuomas Sandholm Carnegie Mellon University Computer Science.](https://reader030.fdocuments.us/reader030/viewer/2022032516/56649c575503460f948fe535/html5/thumbnails/34.jpg)
Common exploration methods
1. E.g. in value iteration in an ADP agent:Optimistic estimate of utility U+(i)
2. E.g. in TD() or Q-learning:Choose best action w.p. p and a random action otherwise.
3. E.g. in TD() or Q-learning:Boltzmann exploration
j
aij
aiaNjUMfiRiU )],(),([max)()( Exploration fn. e.g.
),( nufR+ if n<Nu o.w.
a
T
jUM
T
jUM
aj
aij
j
aij
e
eP
)(
)(
*
*
![Page 35: Reinforcement Learning I: The setting and classical stochastic dynamic programming algorithms Tuomas Sandholm Carnegie Mellon University Computer Science.](https://reader030.fdocuments.us/reader030/viewer/2022032516/56649c575503460f948fe535/html5/thumbnails/35.jpg)
Reinforcement Learning III:Advanced topics
Tuomas SandholmCarnegie Mellon University
Computer Science Department
![Page 36: Reinforcement Learning I: The setting and classical stochastic dynamic programming algorithms Tuomas Sandholm Carnegie Mellon University Computer Science.](https://reader030.fdocuments.us/reader030/viewer/2022032516/56649c575503460f948fe535/html5/thumbnails/36.jpg)
GeneralizationWith table lookup representation (of U,M,R,Q) up to 10,000 states or more
Chess ~ 10120 Backgammon ~ 1050Industrial problems
Hard to represent & visit all states!
Implicit representation, e.g. U(i) = w1f1(i) + w2f2(i) + …+ wnfn(i)Chess 10120 states n weightsThis compression does generalization
E.g. Backgammon: Observe 1/1044 state space and beat any human.
![Page 37: Reinforcement Learning I: The setting and classical stochastic dynamic programming algorithms Tuomas Sandholm Carnegie Mellon University Computer Science.](https://reader030.fdocuments.us/reader030/viewer/2022032516/56649c575503460f948fe535/html5/thumbnails/37.jpg)
Generalization …Could use any supervised learning algorithm for the generalization part:
input sensation
generalizationestimate
(Q or U…)update from RL
Convergence results do not apply with generalization.
Pseudo-experiments require predicting many steps ahead(not supported by standard generalization methods)
![Page 38: Reinforcement Learning I: The setting and classical stochastic dynamic programming algorithms Tuomas Sandholm Carnegie Mellon University Computer Science.](https://reader030.fdocuments.us/reader030/viewer/2022032516/56649c575503460f948fe535/html5/thumbnails/38.jpg)
Convergence results of Q-learningtabular function approximation
state aggregationconverges to Q* general
diverges
averagersconverges to Q*
converges to Q*linear
on-policyoff-policy
converges to Q
prediction control
chatters, boundunknown
diverges
v
jQiQ dd
1
)()(maxQin error class samein ji,
![Page 39: Reinforcement Learning I: The setting and classical stochastic dynamic programming algorithms Tuomas Sandholm Carnegie Mellon University Computer Science.](https://reader030.fdocuments.us/reader030/viewer/2022032516/56649c575503460f948fe535/html5/thumbnails/39.jpg)
Applications of RL• Checker’s [Samuel 59]• TD-Gammon [Tesauro 92]• World’s best downpeak elevator dispatcher [Crites at al ~95]• Inventory management [Bertsekas et al ~95]
– 10-15% better than industry standard• Dynamic channel assignment [Singh & Bertsekas, Nie&Haykin ~95]
– Outperforms best heuristics in the literature• Cart-pole [Michie&Chambers 68-] with bang-bang control• Robotic manipulation [Grupen et al. 93-]• Path planning• Robot docking [Lin 93]• Parking• Football• Tetris• Multiagent RL [Tan 93, Sandholm&Crites 95, Sen 94-, Carmel&Markovitch 95-, lots
of work since]• Combinatorial optimization: maintenance & repair
– Control of reasoning [Zhang & Dietterich IJCAI-95]
![Page 40: Reinforcement Learning I: The setting and classical stochastic dynamic programming algorithms Tuomas Sandholm Carnegie Mellon University Computer Science.](https://reader030.fdocuments.us/reader030/viewer/2022032516/56649c575503460f948fe535/html5/thumbnails/40.jpg)
TD-Gammon• Q-learning & back propagation neural net• Start with random net• Learned by 1.5 million games against itself• As good as best human in the world
• Expert labeled examples are scarce, expensive & possibly wrong• Self-play is cheap & teaches the real solution• Hand-crafted features help
Performance against Gammontool
# hidden units
TD-Gammon (self-play)
Neurogammon (15,000 supervised learning examples)
![Page 41: Reinforcement Learning I: The setting and classical stochastic dynamic programming algorithms Tuomas Sandholm Carnegie Mellon University Computer Science.](https://reader030.fdocuments.us/reader030/viewer/2022032516/56649c575503460f948fe535/html5/thumbnails/41.jpg)
Multiagent RL• Each agent as a Q-table entry e.g. in a communication network• Each agent as an intentional entity
– Opponent’s behavior varies for a given sensation of the agent• Opponent uses different sensation than agent, e.g. longer window or
different features (Stochasticity in steady state)• Opponent learned: sensation Q-values (Nonstationarity)• Opponent’s exploration policy (Q-values action probabilities)
changed.• Opponent’s action selector chose different action. (Stochasticity)
Q-storage ExplorerRandomProcess
Qcoop
Qdef
deterministic
p(coop)
p(def)
an
Sensation at step n: < >, reward from step n-1
opponentn
men aa 11,
![Page 42: Reinforcement Learning I: The setting and classical stochastic dynamic programming algorithms Tuomas Sandholm Carnegie Mellon University Computer Science.](https://reader030.fdocuments.us/reader030/viewer/2022032516/56649c575503460f948fe535/html5/thumbnails/42.jpg)
Future research in RL• Function approximation (& convergence results)• On-line experience vs. simulated experience• Amount of search in action selection• Exploration method (safe?)• Kind of backups
– Full (DP) vs. sample backups (TD)– Shallow (Monte Carlo) vs. deep (exhaustive)
• controls this in TD()• Macros
– Advantages• Reduce complexity of learning by learning subgoals (macros) first• Can be learned by TD()
– Problems• Selection of macro action• Learn models of macro actions (predict their outcome)• How do you come up with subgoals