MDPs: overview - GitHub Pages · 2021. 3. 2. · Markov decision process Definition: Markov...
Transcript of MDPs: overview - GitHub Pages · 2021. 3. 2. · Markov decision process Definition: Markov...
![Page 1: MDPs: overview - GitHub Pages · 2021. 3. 2. · Markov decision process Definition: Markov decision process States: the set of states s start 2 States: starting state Actions(s):](https://reader036.fdocuments.us/reader036/viewer/2022062610/6108444fc99fb3347a404cc3/html5/thumbnails/1.jpg)
MDPs: overview
![Page 2: MDPs: overview - GitHub Pages · 2021. 3. 2. · Markov decision process Definition: Markov decision process States: the set of states s start 2 States: starting state Actions(s):](https://reader036.fdocuments.us/reader036/viewer/2022062610/6108444fc99fb3347a404cc3/html5/thumbnails/2.jpg)
Markov decision process
Definition: Markov decision process
States: the set of states
sstart 2 States: starting state
Actions(s): possible actions from state s
T (s, a, s0): probability of s0 if take action a in state s
Reward(s, a, s0): reward for the transition (s, a, s0)
IsEnd(s): whether at end of game
0 � 1: discount factor (default: 1)
CS221 26
![Page 3: MDPs: overview - GitHub Pages · 2021. 3. 2. · Markov decision process Definition: Markov decision process States: the set of states s start 2 States: starting state Actions(s):](https://reader036.fdocuments.us/reader036/viewer/2022062610/6108444fc99fb3347a404cc3/html5/thumbnails/3.jpg)
What is a solution?
Search problem: path (sequence of actions)
MDP:
Definition: policy
A policy ⇡ is a mapping from each state s 2 States to an action a 2 Actions(s).
Example: volcano crossing
s ⇡(s)
(1,1) S
(2,1) E
(3,1) N
... ...
CS221 36
![Page 4: MDPs: overview - GitHub Pages · 2021. 3. 2. · Markov decision process Definition: Markov decision process States: the set of states s start 2 States: starting state Actions(s):](https://reader036.fdocuments.us/reader036/viewer/2022062610/6108444fc99fb3347a404cc3/html5/thumbnails/4.jpg)
MDPs: policy evaluation
![Page 5: MDPs: overview - GitHub Pages · 2021. 3. 2. · Markov decision process Definition: Markov decision process States: the set of states s start 2 States: starting state Actions(s):](https://reader036.fdocuments.us/reader036/viewer/2022062610/6108444fc99fb3347a404cc3/html5/thumbnails/5.jpg)
Discounting
Definition: utility
Path: s0, a1r1s1, a2r2s2, . . . (action, reward, new state).
The utility with discount � is
u1 = r1 + �r2 + �2r3 + �3r4 + · · ·
Discount � = 1 (save for the future):
[stay, stay, stay, stay]: 4 + 4 + 4 + 4 = 16
Discount � = 0 (live in the moment):
[stay, stay, stay, stay]: 4 + 0 · (4 + · · · ) = 4
Discount � = 0.5 (balanced life):
[stay, stay, stay, stay]: 4 + 12 · 4 + 1
4 · 4 + 18 · 4 = 7.5
CS221 44
![Page 6: MDPs: overview - GitHub Pages · 2021. 3. 2. · Markov decision process Definition: Markov decision process States: the set of states s start 2 States: starting state Actions(s):](https://reader036.fdocuments.us/reader036/viewer/2022062610/6108444fc99fb3347a404cc3/html5/thumbnails/6.jpg)
Policy evaluation
Definition: value of a policy
Let V⇡(s) be the expected utility received by following policy ⇡ from state s.
Definition: Q-value of a policy
Let Q⇡(s, a) be the expected utility of taking action a from state s, and then followingpolicy ⇡.
⇡(s)T (s,⇡(s), s0)
s0
s,⇡(s)sV⇡(s)
V⇡(s0)Q⇡(s,⇡(s))
CS221 46
![Page 7: MDPs: overview - GitHub Pages · 2021. 3. 2. · Markov decision process Definition: Markov decision process States: the set of states s start 2 States: starting state Actions(s):](https://reader036.fdocuments.us/reader036/viewer/2022062610/6108444fc99fb3347a404cc3/html5/thumbnails/7.jpg)
Policy evaluation
Plan: define recurrences relating value and Q-value
⇡(s)T (s,⇡(s), s0)
s0
s,⇡(s)sV⇡(s)
V⇡(s0)Q⇡(s,⇡(s))
V⇡(s) =
(0 if IsEnd(s)
Q⇡(s,⇡(s)) otherwise.
Q⇡(s, a) =X
s0
T (s0|s, a)[Reward(s, a, s0) + �V⇡(s0)]
CS221 48
![Page 8: MDPs: overview - GitHub Pages · 2021. 3. 2. · Markov decision process Definition: Markov decision process States: the set of states s start 2 States: starting state Actions(s):](https://reader036.fdocuments.us/reader036/viewer/2022062610/6108444fc99fb3347a404cc3/html5/thumbnails/8.jpg)
Policy evaluation
Key idea: iterative algorithm
Start with arbitrary policy values and repeatedly apply recurrences to converge to truevalues.
Algorithm: policy evaluation
Initialize V (0)⇡ (s) 0 for all states s.
For iteration t = 1, . . . , tPE:
For each state s:V (t)⇡ (s)
X
s0
T (s0|s,⇡(s))[Reward(s,⇡(s), s0) + �V (t�1)⇡ (s0)]
| {z }Q(t�1)(s,⇡(s))
CS221 52
![Page 9: MDPs: overview - GitHub Pages · 2021. 3. 2. · Markov decision process Definition: Markov decision process States: the set of states s start 2 States: starting state Actions(s):](https://reader036.fdocuments.us/reader036/viewer/2022062610/6108444fc99fb3347a404cc3/html5/thumbnails/9.jpg)
MDPs: value iteration
![Page 10: MDPs: overview - GitHub Pages · 2021. 3. 2. · Markov decision process Definition: Markov decision process States: the set of states s start 2 States: starting state Actions(s):](https://reader036.fdocuments.us/reader036/viewer/2022062610/6108444fc99fb3347a404cc3/html5/thumbnails/10.jpg)
Optimal value and policy
Goal: try to get directly at maximum expected utility
Definition: optimal value
The optimal value Vopt(s) is the maximum value attained by any policy.
CS221 64
![Page 11: MDPs: overview - GitHub Pages · 2021. 3. 2. · Markov decision process Definition: Markov decision process States: the set of states s start 2 States: starting state Actions(s):](https://reader036.fdocuments.us/reader036/viewer/2022062610/6108444fc99fb3347a404cc3/html5/thumbnails/11.jpg)
Optimal values and Q-values
aT (s0|s, a)
s
s, a s0
Vopt(s)
Vopt(s0)
Qopt(s, a)
Optimal value if take action a in state s:
Qopt(s, a) =X
s0
T (s, a, s0)[Reward(s, a, s0) + �Vopt(s0)].
Optimal value from state s:
Vopt(s) =
(0 if IsEnd(s)
maxa2Actions(s)Qopt(s, a) otherwise.
CS221 66
![Page 12: MDPs: overview - GitHub Pages · 2021. 3. 2. · Markov decision process Definition: Markov decision process States: the set of states s start 2 States: starting state Actions(s):](https://reader036.fdocuments.us/reader036/viewer/2022062610/6108444fc99fb3347a404cc3/html5/thumbnails/12.jpg)
Optimal policies
aT (s0|s, a)
s
s, a s0
Vopt(s)
Vopt(s0)
Qopt(s, a)
Given Qopt, read o↵ the optimal policy:
⇡opt(s) = arg maxa2Actions(s)
Qopt(s, a)
CS221 68
![Page 13: MDPs: overview - GitHub Pages · 2021. 3. 2. · Markov decision process Definition: Markov decision process States: the set of states s start 2 States: starting state Actions(s):](https://reader036.fdocuments.us/reader036/viewer/2022062610/6108444fc99fb3347a404cc3/html5/thumbnails/13.jpg)
Value iteration
Algorithm: value iteration [Bellman, 1957]
Initialize V(0)opt (s) 0 for all states s.
For iteration t = 1, . . . , tVI:
For each state s:V
(t)opt (s) max
a2Actions(s)
X
s0
T (s, a, s0)[Reward(s, a, s0) + �V(t�1)opt (s0)]
| {z }Q(t�1)
opt (s,a)
Time: O(tVISAS0)
CS221 70
![Page 14: MDPs: overview - GitHub Pages · 2021. 3. 2. · Markov decision process Definition: Markov decision process States: the set of states s start 2 States: starting state Actions(s):](https://reader036.fdocuments.us/reader036/viewer/2022062610/6108444fc99fb3347a404cc3/html5/thumbnails/14.jpg)
Convergence
Theorem: convergence
Suppose either
• discount � < 1, or
• MDP graph is acyclic.
Then value iteration converges to the correct answer.
Example: non-convergence
discount � = 1, zero rewards
CS221 76
![Page 15: MDPs: overview - GitHub Pages · 2021. 3. 2. · Markov decision process Definition: Markov decision process States: the set of states s start 2 States: starting state Actions(s):](https://reader036.fdocuments.us/reader036/viewer/2022062610/6108444fc99fb3347a404cc3/html5/thumbnails/15.jpg)
Summary of algorithms
• Policy evaluation: (MDP, ⇡) ! V⇡
• Value iteration: MDP ! (Qopt,⇡opt)
CS221 78
![Page 16: MDPs: overview - GitHub Pages · 2021. 3. 2. · Markov decision process Definition: Markov decision process States: the set of states s start 2 States: starting state Actions(s):](https://reader036.fdocuments.us/reader036/viewer/2022062610/6108444fc99fb3347a404cc3/html5/thumbnails/16.jpg)
MDPs: reinforcement learning
![Page 17: MDPs: overview - GitHub Pages · 2021. 3. 2. · Markov decision process Definition: Markov decision process States: the set of states s start 2 States: starting state Actions(s):](https://reader036.fdocuments.us/reader036/viewer/2022062610/6108444fc99fb3347a404cc3/html5/thumbnails/17.jpg)
Unknown transitions and rewards
Definition: Markov decision process
States: the set of states
sstart 2 States: starting state
Actions(s): possible actions from state s
IsEnd(s): whether at end of game
0 � 1: discount factor (default: 1)
reinforcement learning!
CS221 2
![Page 18: MDPs: overview - GitHub Pages · 2021. 3. 2. · Markov decision process Definition: Markov decision process States: the set of states s start 2 States: starting state Actions(s):](https://reader036.fdocuments.us/reader036/viewer/2022062610/6108444fc99fb3347a404cc3/html5/thumbnails/18.jpg)
MDPs: model-based methods
![Page 19: MDPs: overview - GitHub Pages · 2021. 3. 2. · Markov decision process Definition: Markov decision process States: the set of states s start 2 States: starting state Actions(s):](https://reader036.fdocuments.us/reader036/viewer/2022062610/6108444fc99fb3347a404cc3/html5/thumbnails/19.jpg)
Model-Based Value Iteration
Data: s0; a1, r1, s1; a2, r2, s2; a3, r3, s3; . . . ; an, rn, sn
Key idea: model-based learning
Estimate the MDP: T (s, a, s0) and Reward(s, a, s0)
Transitions:
T̂ (s, a, s0) = # times (s, a, s0) occurs# times (s, a) occurs
Rewards:
\Reward(s, a, s0) = r in (s, a, r, s0)
CS221 2
![Page 20: MDPs: overview - GitHub Pages · 2021. 3. 2. · Markov decision process Definition: Markov decision process States: the set of states s start 2 States: starting state Actions(s):](https://reader036.fdocuments.us/reader036/viewer/2022062610/6108444fc99fb3347a404cc3/html5/thumbnails/20.jpg)
Model-Based Value Iteration
in in,stay
in,quit end
stay(4/7): $4
(3/7): $4quit
?: $?
Data (following policy ⇡(s) = stay):
[in; stay, 4, end]
• Estimates converge to true values (under certain conditions)
• With estimated MDP (T̂ , \Reward), compute policy using value iteration
CS221 4
![Page 21: MDPs: overview - GitHub Pages · 2021. 3. 2. · Markov decision process Definition: Markov decision process States: the set of states s start 2 States: starting state Actions(s):](https://reader036.fdocuments.us/reader036/viewer/2022062610/6108444fc99fb3347a404cc3/html5/thumbnails/21.jpg)
Problem
in in,stay
in,quit end
stay(4/7): $4
(3/7): $4quit
?: $?
Problem: won’t even see (s, a) if a 6= ⇡(s) (a = quit)
Key idea: exploration
To do reinforcement learning, need to explore the state space.
Solution: need ⇡ to explore explicitly (more on this later)
CS221 6
![Page 22: MDPs: overview - GitHub Pages · 2021. 3. 2. · Markov decision process Definition: Markov decision process States: the set of states s start 2 States: starting state Actions(s):](https://reader036.fdocuments.us/reader036/viewer/2022062610/6108444fc99fb3347a404cc3/html5/thumbnails/22.jpg)
MDPs: model-free methods
![Page 23: MDPs: overview - GitHub Pages · 2021. 3. 2. · Markov decision process Definition: Markov decision process States: the set of states s start 2 States: starting state Actions(s):](https://reader036.fdocuments.us/reader036/viewer/2022062610/6108444fc99fb3347a404cc3/html5/thumbnails/23.jpg)
From model-based to model-free
Q̂opt(s, a) =X
s0
T̂ (s, a, s0)[ \Reward(s, a, s0) + �V̂opt(s0)]
All that matters for prediction is (estimate of) Qopt(s, a).
Key idea: model-free learning
Try to estimate Qopt(s, a) directly.
CS221 2
![Page 24: MDPs: overview - GitHub Pages · 2021. 3. 2. · Markov decision process Definition: Markov decision process States: the set of states s start 2 States: starting state Actions(s):](https://reader036.fdocuments.us/reader036/viewer/2022062610/6108444fc99fb3347a404cc3/html5/thumbnails/24.jpg)
Model-free Monte Carlo
Data (following policy ⇡):
s0; a1, r1, s1; a2, r2, s2; a3, r3, s3; . . . ; an, rn, sn
Recall:
Q⇡(s, a) is expected utility starting at s, first taking action a, and then following policy ⇡
Utility:
ut = rt + � · rt+1 + �2 · rt+2 + · · ·
Estimate:
Q̂⇡(s, a) = average of ut where st�1 = s, at = a
(and s, a doesn’t occur in s0, · · · , st�2)
CS221 4
![Page 25: MDPs: overview - GitHub Pages · 2021. 3. 2. · Markov decision process Definition: Markov decision process States: the set of states s start 2 States: starting state Actions(s):](https://reader036.fdocuments.us/reader036/viewer/2022062610/6108444fc99fb3347a404cc3/html5/thumbnails/25.jpg)
Model-free Monte Carlo
in in,stay
in,quit end
stay
(?): $?
(?): $?quit
?: $?
(4 + 8 + 16)/3
?
Data (following policy ⇡(s) = stay):
[in; stay, 4, in; stay, 4, in; stay, 4, in; stay, 4, end]
Note: we are estimating Q⇡ now, not Qopt
Definition: on-policy versus o↵-policy
On-policy: estimate the value of data-generating policy
O↵-policy: estimate the value of another policy
CS221 6
![Page 26: MDPs: overview - GitHub Pages · 2021. 3. 2. · Markov decision process Definition: Markov decision process States: the set of states s start 2 States: starting state Actions(s):](https://reader036.fdocuments.us/reader036/viewer/2022062610/6108444fc99fb3347a404cc3/html5/thumbnails/26.jpg)
MDPs: SARSA
![Page 27: MDPs: overview - GitHub Pages · 2021. 3. 2. · Markov decision process Definition: Markov decision process States: the set of states s start 2 States: starting state Actions(s):](https://reader036.fdocuments.us/reader036/viewer/2022062610/6108444fc99fb3347a404cc3/html5/thumbnails/27.jpg)
Using the reward + Q-value
Current estimate: Q̂⇡(s, stay) = 11
Data (following policy ⇡(s) = stay):
[in; stay, 4, end] 4 + 0
[in; stay, 4, in; stay, 4, end] 4 + 11
[in; stay, 4, in; stay, 4, in; stay, 4, end] 4 + 11
[in; stay, 4, in; stay, 4, in; stay, 4, in; stay, 4, end] 4 + 11
Algorithm: SARSA
On each (s, a, r, s0, a0):
Q̂⇡(s, a) (1� ⌘)Q̂⇡(s, a) + ⌘[ r|{z}data
+� Q̂⇡(s0, a0)| {z }
estimate
]
CS221 4
![Page 28: MDPs: overview - GitHub Pages · 2021. 3. 2. · Markov decision process Definition: Markov decision process States: the set of states s start 2 States: starting state Actions(s):](https://reader036.fdocuments.us/reader036/viewer/2022062610/6108444fc99fb3347a404cc3/html5/thumbnails/28.jpg)
Model-free Monte Carlo versus SARSA
Key idea: bootstrapping
SARSA uses estimate Q̂⇡(s, a) instead of just raw data u.
u r + Q̂⇡(s0, a0)
based on one path based on estimate
unbiased biased
large variance small variance
wait until end to update can update immediately
CS221 6
![Page 29: MDPs: overview - GitHub Pages · 2021. 3. 2. · Markov decision process Definition: Markov decision process States: the set of states s start 2 States: starting state Actions(s):](https://reader036.fdocuments.us/reader036/viewer/2022062610/6108444fc99fb3347a404cc3/html5/thumbnails/29.jpg)
MDPs: Q-learning
![Page 30: MDPs: overview - GitHub Pages · 2021. 3. 2. · Markov decision process Definition: Markov decision process States: the set of states s start 2 States: starting state Actions(s):](https://reader036.fdocuments.us/reader036/viewer/2022062610/6108444fc99fb3347a404cc3/html5/thumbnails/30.jpg)
Q-learning
Problem: model-free Monte Carlo and SARSA only estimate Q⇡, but want Qopt to act optimally
Output MDP reinforcement learning
Q⇡ policy evaluation model-free Monte Carlo, SARSA
Qopt value iteration Q-learning
CS221 2
![Page 31: MDPs: overview - GitHub Pages · 2021. 3. 2. · Markov decision process Definition: Markov decision process States: the set of states s start 2 States: starting state Actions(s):](https://reader036.fdocuments.us/reader036/viewer/2022062610/6108444fc99fb3347a404cc3/html5/thumbnails/31.jpg)
Q-learning
Bellman optimality equation:
Qopt(s, a) =X
s0
T (s, a, s0)[Reward(s, a, s0) + �Vopt(s0)]
Algorithm: Q-learning [Watkins/Dayan, 1992]
On each (s, a, r, s0):Q̂opt(s, a) (1� ⌘)Q̂opt(s, a)| {z }
prediction
+ ⌘(r + �V̂opt(s0))
| {z }target
Recall: V̂opt(s0) = max
a02Actions(s0)Q̂opt(s
0, a0)
CS221 4
![Page 32: MDPs: overview - GitHub Pages · 2021. 3. 2. · Markov decision process Definition: Markov decision process States: the set of states s start 2 States: starting state Actions(s):](https://reader036.fdocuments.us/reader036/viewer/2022062610/6108444fc99fb3347a404cc3/html5/thumbnails/32.jpg)
O↵-Policy versus On-Policy
Definition: on-policy versus o↵-policy
On-policy: evaluate or improve the data-generating policy
O↵-policy: evaluate or learn using data from another policy
on-policy o↵-policy
policy evaluationMonte Carlo
SARSA
policy optimization Q-learning
CS221 10
![Page 33: MDPs: overview - GitHub Pages · 2021. 3. 2. · Markov decision process Definition: Markov decision process States: the set of states s start 2 States: starting state Actions(s):](https://reader036.fdocuments.us/reader036/viewer/2022062610/6108444fc99fb3347a404cc3/html5/thumbnails/33.jpg)
Reinforcement Learning Algorithms
Algorithm Estimating Based on
Model-Based Monte Carlo T̂ , R̂ s0, a1, r1, s1, ...
Model-Free Monte Carlo Q̂⇡ u
SARSA Q̂⇡ r + Q̂⇡
Q-Learning Q̂opt r + Q̂opt
CS221 12