Reinforcement Learning Lecture Markov decision process & … › mlr › wp-content › ... ·...
Transcript of Reinforcement Learning Lecture Markov decision process & … › mlr › wp-content › ... ·...
Reinforcement Learning
Markov decision process & Dynamic programming
value function, Bellman equation, optimality, Markov property, Markov decision process,dynamic programming, value iteration, policy iteration.
Vien NgoMLR, University of Stuttgart
Outline
• Reinforcement learning problem.– Element of reinforcement learning– Markov Process– Markov Reward Process– Markov decision process.
• Dynamic programming– Value iteration– Policy iteration
2/??
Reinforcement Learning ProblemElements of Reinforcement Learning Problem
• Agent vs. Environment.
• State, Action, Reward, Goal, Return.
• The Markov property.
• Markov decision process.
• Bellman equations.
• Optimality and Approximation.
3/??
Agent vs. Environment
• The learner and decision-maker is called the agent.
• The thing it interacts with, comprising everything outside the agent, iscalled the environment.
• The environment is formally formulated as a Markov Decision Process,which is a mathematically principled framework for sequential decisionproblems.
(from Introduction to RL book, Sutton & Barto)
4/??
Agent vs. Environment
• The learner and decision-maker is called the agent.
• The thing it interacts with, comprising everything outside the agent, iscalled the environment.
• The environment is formally formulated as a Markov Decision Process,which is a mathematically principled framework for sequential decisionproblems.
(from Introduction to RL book, Sutton & Barto)4/??
The Markov propertyA state that summarizes past sensations compactly yet in such a waythat all relevant information is retained. This normally requires morethan the immediate sensations, but never more than the completehistory of all past sensations. A state that succeeds in retaining allrelevant information is said to be Markov, or to have the Markovproperty.
(Introduction to RL book, Sutton & Barto)
• Formally,
Pr(st+1, rt+1|st, at, rt, · · · , s0, a0, r0) = Pr(st+1, rt+1|st, at, rt)
• Example: the current configuration of the chess board for predicting thenext steps, the position, velocity of the cart, the angle and its changingrate of the pole in cart-pole domain.
5/??
The Markov propertyA state that summarizes past sensations compactly yet in such a waythat all relevant information is retained. This normally requires morethan the immediate sensations, but never more than the completehistory of all past sensations. A state that succeeds in retaining allrelevant information is said to be Markov, or to have the Markovproperty.
(Introduction to RL book, Sutton & Barto)
• Formally,
Pr(st+1, rt+1|st, at, rt, · · · , s0, a0, r0) = Pr(st+1, rt+1|st, at, rt)
• Example: the current configuration of the chess board for predicting thenext steps, the position, velocity of the cart, the angle and its changingrate of the pole in cart-pole domain.
5/??
The Markov propertyA state that summarizes past sensations compactly yet in such a waythat all relevant information is retained. This normally requires morethan the immediate sensations, but never more than the completehistory of all past sensations. A state that succeeds in retaining allrelevant information is said to be Markov, or to have the Markovproperty.
(Introduction to RL book, Sutton & Barto)
• Formally,
Pr(st+1, rt+1|st, at, rt, · · · , s0, a0, r0) = Pr(st+1, rt+1|st, at, rt)
• Example: the current configuration of the chess board for predicting thenext steps, the position, velocity of the cart, the angle and its changingrate of the pole in cart-pole domain.
5/??
Markov Process
• A Markov Process (Markov Chain) is defined as 2-tuple (S,P).– S is a state space.– P is a state transition probability matrix: Pss′ = P (st+1 = s′|st = s)
6/??
Markov Process: ExampleRycycling Robot’s Markov Chain
Batery:high
Batery:low
wait stop
search
recharge0.9 0.90.9
0.1
0.5
0.50.50.51.0
0.1
7/??
Markov Reward Process
• A Markov Reward Process is defined as 4-tuple (S,P,R, γ).– S is a state space of n states.– P is a state transition probability matrix: Pss′ = P (st+1 = s′|st = s)
– R is a reward matrix of Rs.– γ is a discount factor, γ ∈ [0, 1].
• The total return is
ρt = Rt + γRt+1 + γ2Rt+2 + . . .
8/??
Markov Reward Process: Example
Batery:high
Batery:low
wait stop
search
recharge0.9;0.0 0.9;0.0
0.1;-10.0
0.5;0.0
0.5;0.00.5;0.00.5;-1.01.0;0.0
0.1;0.0
9/??
Markov Reward Process: Bellman Equations
• The value function V (s)
V (s) = E[ρt|st = s
]= E
[Rt + γV (st+1)|st = s
]
• V = R+ γPV , hence V = (I − γP )−1RWe will visit again in MDP.
10/??
Markov Reward Process: Discount Factor?Many meanings:
• Weighing the importance of differently timed rewards, higherimportance of more recent rewards.
• Representing uncertainty over the presence of next rewards, i.egeometric distributions.
• Representing human/animal’s preference over ordering of receivedrewards.
11/??
Markov decision process
12/??
Markov decision process
• A reinforcement learning problem that satisfies the Markov property iscalled a Markov decision process, or MDP.
• MDP = {S,A, T ,R,P0, γ}.– S: consists of all possible states.– A: consists of all possible actions.– T : is a transition function which defines the probabilityT (s′, s, a) = Pr(s′|s, a).– R: is a reward function which defines the reward R(s, a).– P0: is the probability distribution over initial states.– γ ∈ [0, 1]: is a discount factor.
13/??
Markov decision process
• A reinforcement learning problem that satisfies the Markov property iscalled a Markov decision process, or MDP.
• MDP = {S,A, T ,R,P0, γ}.
– S: consists of all possible states.– A: consists of all possible actions.– T : is a transition function which defines the probabilityT (s′, s, a) = Pr(s′|s, a).– R: is a reward function which defines the reward R(s, a).– P0: is the probability distribution over initial states.– γ ∈ [0, 1]: is a discount factor.
13/??
Markov decision process
• A reinforcement learning problem that satisfies the Markov property iscalled a Markov decision process, or MDP.
• MDP = {S,A, T ,R,P0, γ}.– S: consists of all possible states.– A: consists of all possible actions.– T : is a transition function which defines the probabilityT (s′, s, a) = Pr(s′|s, a).– R: is a reward function which defines the reward R(s, a).– P0: is the probability distribution over initial states.– γ ∈ [0, 1]: is a discount factor.
13/??
Example: Recycling Robot MDP
14/??
a0
s0
r0
a1
s1
r1
a2
s2
r2
• A policy is a mapping from state space to action space
µ : S 7→ A
• Objective function:– Expected average reward.
η = limT→∞
1
TE[ T−1∑t=0
r(st, at, st+1)]
– Expected discounted reward.
ηγ = E[ ∞∑t=0
γtr(st, at, st+1)]
• Singh et. al. 1994:
ηγ =1
1− γη
15/??
a0
s0
r0
a1
s1
r1
a2
s2
r2
• A policy is a mapping from state space to action space
µ : S 7→ A
• Objective function:– Expected average reward.
η = limT→∞
1
TE[ T−1∑t=0
r(st, at, st+1)]
– Expected discounted reward.
ηγ = E[ ∞∑t=0
γtr(st, at, st+1)]
• Singh et. al. 1994:
ηγ =1
1− γη
15/??
a0
s0
r0
a1
s1
r1
a2
s2
r2
• A policy is a mapping from state space to action space
µ : S 7→ A
• Objective function:– Expected average reward.
η = limT→∞
1
TE[ T−1∑t=0
r(st, at, st+1)]
– Expected discounted reward.
ηγ = E[ ∞∑t=0
γtr(st, at, st+1)]
• Singh et. al. 1994:
ηγ =1
1− γη
15/??
a0
s0
r0
a1
s1
r1
a2
s2
r2
• A policy is a mapping from state space to action space
µ : S 7→ A
• Objective function:– Expected average reward.
η = limT→∞
1
TE[ T−1∑t=0
r(st, at, st+1)]
– Expected discounted reward.
ηγ = E[ ∞∑t=0
γtr(st, at, st+1)]
• Singh et. al. 1994:
ηγ =1
1− γη
15/??
Dynamic Programming
16/??
Dynamic Programming
• State Value Functions
• Bellman Equations
• Value Iteration
• Policy Iteration
17/??
State value function
• The value (expected discounted return) of policy π when started instate s:
V π(s) = Eπ{r0 + γr1 + γ2r2 + · · · | s0 =s} (1)
discounting factor γ ∈ [0, 1]
• definition of optimality: behavior π∗ is optimal iff
∀s : V π∗(s) = V ∗(s) where V ∗(s) = max
πV π(s)
(simultaneously maximising the value in all states)
(In MDPs there always exists (at least one) optimal deterministic policy.)
18/??
State value function
• The value (expected discounted return) of policy π when started instate s:
V π(s) = Eπ{r0 + γr1 + γ2r2 + · · · | s0 =s} (1)
discounting factor γ ∈ [0, 1]
• definition of optimality: behavior π∗ is optimal iff
∀s : V π∗(s) = V ∗(s) where V ∗(s) = max
πV π(s)
(simultaneously maximising the value in all states)
(In MDPs there always exists (at least one) optimal deterministic policy.)
18/??
Bellman optimality equation
V π(s) = E{r0 + γr1 + γ2r2 + · · · | s0 =s;π}= E{r0 | s0 =s;π}+ γE{r1 + γr2 + · · · | s0 =s;π}= R(π(s), s) + γ
∑s′ P (s′ |π(s), s) E{r1 + γr2 + · · · | s1 =s′;π}
= R(π(s), s) + γ∑s′ P (s′ |π(s), s) V π(s′)
• We can write this in vector notation V π = Rπ + γP πV π
with vectors V πs = V π(s), Rπ
s = R(π(s), s) and matrix P πs′s = P (s′ |π(s), s)
• For stochastic π(a|s): V π(s) =∑a π(a|s)R(a, s) + γ
∑s′,a π(a|s)P (s′ | a, s) V π(s′)
• Bellman optimality equation
V ∗(s) = maxa
[R(a, s) + γ
∑s′ P (s′ | a, s) V ∗(s′)
]π∗(s) = argmax
a
[R(a, s) + γ
∑s′ P (s′ | a, s) V ∗(s′)
](Sketch of proof: If π would select another action than argmaxa[·], then π′ which = π
everywhere except π′(s) = argmaxa[·] would be better.)
• This is the principle of optimality in the stochastic case(related to Viterbi, max-product algorithm)
19/??
Bellman optimality equation
V π(s) = E{r0 + γr1 + γ2r2 + · · · | s0 =s;π}= E{r0 | s0 =s;π}+ γE{r1 + γr2 + · · · | s0 =s;π}= R(π(s), s) + γ
∑s′ P (s′ |π(s), s) E{r1 + γr2 + · · · | s1 =s′;π}
= R(π(s), s) + γ∑s′ P (s′ |π(s), s) V π(s′)
• We can write this in vector notation V π = Rπ + γP πV π
with vectors V πs = V π(s), Rπ
s = R(π(s), s) and matrix P πs′s = P (s′ |π(s), s)
• For stochastic π(a|s): V π(s) =∑a π(a|s)R(a, s) + γ
∑s′,a π(a|s)P (s′ | a, s) V π(s′)
• Bellman optimality equation
V ∗(s) = maxa
[R(a, s) + γ
∑s′ P (s′ | a, s) V ∗(s′)
]π∗(s) = argmax
a
[R(a, s) + γ
∑s′ P (s′ | a, s) V ∗(s′)
](Sketch of proof: If π would select another action than argmaxa[·], then π′ which = π
everywhere except π′(s) = argmaxa[·] would be better.)
• This is the principle of optimality in the stochastic case(related to Viterbi, max-product algorithm)
19/??
Bellman optimality equation
V π(s) = E{r0 + γr1 + γ2r2 + · · · | s0 =s;π}= E{r0 | s0 =s;π}+ γE{r1 + γr2 + · · · | s0 =s;π}= R(π(s), s) + γ
∑s′ P (s′ |π(s), s) E{r1 + γr2 + · · · | s1 =s′;π}
= R(π(s), s) + γ∑s′ P (s′ |π(s), s) V π(s′)
• We can write this in vector notation V π = Rπ + γP πV π
with vectors V πs = V π(s), Rπ
s = R(π(s), s) and matrix P πs′s = P (s′ |π(s), s)
• For stochastic π(a|s): V π(s) =∑a π(a|s)R(a, s) + γ
∑s′,a π(a|s)P (s′ | a, s) V π(s′)
• Bellman optimality equation
V ∗(s) = maxa
[R(a, s) + γ
∑s′ P (s′ | a, s) V ∗(s′)
]π∗(s) = argmax
a
[R(a, s) + γ
∑s′ P (s′ | a, s) V ∗(s′)
](Sketch of proof: If π would select another action than argmaxa[·], then π′ which = π
everywhere except π′(s) = argmaxa[·] would be better.)
• This is the principle of optimality in the stochastic case(related to Viterbi, max-product algorithm) 19/??
Richard E. Bellman (1920-1984)Bellman’s principle of optimality
A
B
A opt ⇒ B opt
V ∗(s) = maxa
[R(a, s) + γ
∑s′ P (s′ | a, s) V ∗(s′)
]π∗(s) = argmax
a
[R(a, s) + γ
∑s′ P (s′ | a, s) V ∗(s′)
]
20/??
Value Iteration
• Given the Bellman equation
V ∗(s) = maxa
[R(a, s) + γ
∑s′
P (s′ | a, s) V ∗(s′)]
→ iterate
∀s : Vk+1(s) = maxa
[R(a, s) + γ
∑s′
P (s′|π(s), s) Vk(s′)]
stopping criterion:
maxs|Vk+1(s)− Vk(s)| ≤ ε
• Value Iteration converges to the optimal value function V ∗ (proof below)
21/??
2x2 Maze
0.0 1.0
0.0 0.010%10%
80%
manually solving.
22/??
State-action value function (Q-function)
• The state-action value function (or Q-function) is the expecteddiscounted return when starting in state s and taking first action a:
Qπ(a, s) = Eπ{r0 + γr1 + γ2r2 + · · · | s0 =s, a0 =a}= R(a, s) + γ
∑s′
P (s′ | a, s) Qπ(π(s′), s′)
(Note: V π(s) = Qπ(π(s), s).)
• Bellman optimality equation for the Q-function
Q∗(a, s) = R(a, s) + γ∑s′
P (s′ | a, s) maxa′
Q∗(a′, s′)
π∗(s) = argmaxa
Q∗(a, s)
23/??
Q-Iteration
• Given the Bellman equation
Q∗(a, s) = R(a, s) + γ∑s′
P (s′ | a, s) maxa′
Q∗(a′, s′)
→ iterate
∀a,s : Qk+1(a, s) = R(a, s) + γ∑s′
P (s′|a, s) maxa′
Qk(a′, s′)
stopping criterion:
maxa,s|Qk+1(a, s)−Qk(a, s)| ≤ ε
• Q-Iteration converges to the optimal state-action value function Q∗
24/??
Proof of convergence
• Let ∆k = ||Q∗ −Qk||∞ = maxa,s |Q∗(a, s)−Qk(a, s)|
Qk+1(a, s) = R(a, s) + γ∑s′
P (s′|a, s) maxa′
Qk(a′, s′)
≤ R(a, s) + γ∑s′
P (s′|a, s) maxa′
[Q∗(a′, s′) + ∆k
]=[R(a, s) + γ
∑s′
P (s′|a, s) maxa′
Q∗(a′, s′)]
+ γ∆k
= Q∗(a, s) + γ∆k
similarly: Qk ≥ Q∗ −∆k ⇒ Qk+1 ≥ Q∗ − γ∆k
25/??
Convergence
• Contraction property: ||Uk+1 − Vk+1|| ≤ γ||Uk − Vk||which guarantees convergence with different initial values U0, V0 of twoapproximations.
||Uk+1 − Vk+1|| ≤ γ||Uk − Vk|| ≤ . . . ≤ γk+1||U0 − V0||
• Stopping condition: ||Vk+1 − Vk|| ≤ ε⇒ ||Vk+1 − V ∗|| ≤ εγ/(1− γ)
Proof:
|Vk+1 − V ∗ |γ
≤ |Vk − V ∗ | ≤ |Vk+1 − Vk | + |Vk+1 − V ∗ |
|Vk+1 − V ∗ |γ
≤ ε+ |Vk+1 − V ∗ |
26/??
Convergence
• Contraction property: ||Uk+1 − Vk+1|| ≤ γ||Uk − Vk||which guarantees convergence with different initial values U0, V0 of twoapproximations.
||Uk+1 − Vk+1|| ≤ γ||Uk − Vk|| ≤ . . . ≤ γk+1||U0 − V0||
• Stopping condition: ||Vk+1 − Vk|| ≤ ε⇒ ||Vk+1 − V ∗|| ≤ εγ/(1− γ)
Proof:
|Vk+1 − V ∗ |γ
≤ |Vk − V ∗ | ≤ |Vk+1 − Vk | + |Vk+1 − V ∗ |
|Vk+1 − V ∗ |γ
≤ ε+ |Vk+1 − V ∗ |
26/??
Convergence
• Contraction property: ||Uk+1 − Vk+1|| ≤ γ||Uk − Vk||which guarantees convergence with different initial values U0, V0 of twoapproximations.
||Uk+1 − Vk+1|| ≤ γ||Uk − Vk|| ≤ . . . ≤ γk+1||U0 − V0||
• Stopping condition: ||Vk+1 − Vk|| ≤ ε⇒ ||Vk+1 − V ∗|| ≤ εγ/(1− γ)
Proof:
|Vk+1 − V ∗ |γ
≤ |Vk − V ∗ | ≤ |Vk+1 − Vk | + |Vk+1 − V ∗ |
|Vk+1 − V ∗ |γ
≤ ε+ |Vk+1 − V ∗ |
26/??
Policy EvaluationValue Iteration and Q-Iteration compute directly V ∗ and Q∗
If we want to evaluate a given policy π, we want to compute V π or Qπ:
• Iterate using π instead of maxa:
∀s : Vk+1(s) = R(π(s), s) + γ∑s′
P (s′|π(s), s) Vk(s′)
∀a,s : Qk+1(a, s) = R(a, s) + γ∑s′
P (s′|a, s) Qk(π(s′), s′)
• Or, invert the matrix equation
V π = Rπ + γP πV π
V π + γP πV π = Rπ
(I − γP π)V π = Rπ
V π = (I − γP π)−1Rπ
requires inversion of n× n matrix for |S| = n, O(n3)
27/??
Policy EvaluationValue Iteration and Q-Iteration compute directly V ∗ and Q∗
If we want to evaluate a given policy π, we want to compute V π or Qπ:
• Iterate using π instead of maxa:
∀s : Vk+1(s) = R(π(s), s) + γ∑s′
P (s′|π(s), s) Vk(s′)
∀a,s : Qk+1(a, s) = R(a, s) + γ∑s′
P (s′|a, s) Qk(π(s′), s′)
• Or, invert the matrix equation
V π = Rπ + γP πV π
V π + γP πV π = Rπ
(I − γP π)V π = Rπ
V π = (I − γP π)−1Rπ
requires inversion of n× n matrix for |S| = n, O(n3)
27/??
Policy EvaluationValue Iteration and Q-Iteration compute directly V ∗ and Q∗
If we want to evaluate a given policy π, we want to compute V π or Qπ:
• Iterate using π instead of maxa:
∀s : Vk+1(s) = R(π(s), s) + γ∑s′
P (s′|π(s), s) Vk(s′)
∀a,s : Qk+1(a, s) = R(a, s) + γ∑s′
P (s′|a, s) Qk(π(s′), s′)
• Or, invert the matrix equation
V π = Rπ + γP πV π
V π + γP πV π = Rπ
(I − γP π)V π = Rπ
V π = (I − γP π)−1Rπ
requires inversion of n× n matrix for |S| = n, O(n3)27/??
Policy Iteration
• What does it help to just compute V π or Qπ to find the optimal policy?
• Policy Iteration
1. Initialise π0 somehow (e.g. randomly)2. Iterate
– Policy Evaluation: compute V πk or Qπk
– Policy Improvement: πk+1(s)← argmaxaQπk (a, s)
demo: 2x2 maze
28/??
Policy Iteration
• What does it help to just compute V π or Qπ to find the optimal policy?
• Policy Iteration
1. Initialise π0 somehow (e.g. randomly)2. Iterate
– Policy Evaluation: compute V πk or Qπk
– Policy Improvement: πk+1(s)← argmaxaQπk (a, s)
demo: 2x2 maze
28/??
Convergence proofThe fact is that:
• After policy improvement: V πk ≤ V πk+1 (with a sketch proof from RichSutton’s book)
• The policy space is finite, |A||S|.• The Bellman operator has a unique fixed point (due to the strict
contraction property (0 < γ < 1) on a Banach space). This condition isalso used to prove the fixed point for the VI algorithm.
29/??
VI vs. PI
• VI is PI with one step of policy evaluation.
• PI converges surprisingly rapildy, however with expensive compution,i.e. the policy evaluation step (wait for convergence of V π).
• PI is prefered if the action set is large.
30/??
Asynchronous Dynamic Programming
• The value function table is updated asynchronously.• Computation is significantly reduced.• If all states are updated infinitely, convergence is still guaranteed.
• Three simple algorithms:
• Gauss-Seidel Value Iteration
• Real-time dynamic programming
• Prioritised sweeping
31/??
Gauss-Seidel Value Iteration• Standard VI algorithm updates all states at next iteration using old
values at previous iteration (each iteration finishes when all states getupdated).
Algorithm 1 Standard Value Iteration Algorithm1: while (!converged) do2: Vold = V
3: for (each s ∈ S) do4: V (s) = maxa{R(s, a) + γ
∑s′ P (s′|s, a)Vold(s′)}
• Gauss-Seidel VI updates each state using values from previouscomputation.
Algorithm 2 Gauss-Seidel Value Iteration Algorithm1: while (!converged) do2: for (each s ∈ S) do3: V (s) = maxa{R(s, a) + γ
∑s′ P (s′|s, a)V (s′)}
32/??
Prioritised Sweeping
• Similar to Gauss-Seidel VI, but the sequence of states in each iterationis proportional to their update magnitudes (Bellman errors).
• Define Bellman error as E(s;Vt) = |Vt+1(s)− Vt(s)| that is the changeof s’s value after the most recent update.
Algorithm 3 Prioritised Sweeping VI Algorithm1: Initialize V0(s) and priority values H0(s), ∀s ∈ S.2: for k = 0, 1, 2, 3, . . . do3: pick a state to update (with the highest priortiy): sk ∈ argmaxs∈S Hk(s)
4: value update: Vk+1(sk) = maxa∈A[R(sk, ak) + γ
∑s′ P (s′|sk, ak)Vk(s′)
]5: for s 6= sk: Vk+1(s) = Vk(s)
6: update priority values: ∀s ∈ S, Hk+1(s) ← E(s;Vk+1) (Note: the error is w.r.tthe future update).
33/??
Real-Time Dynamic Programming
• Similar to Gauss-Seidel VI, but the sequence of states in each iterationis generated by simulating the transitions.
Algorithm 4 Real-Time Value Iteration Algorithm1: start at an arbitray s0, and initialize V0(s),∀s ∈ calS.2: for k = 0, 1, 2, 3, . . . do3: action selection:
ak ∈ argmaxa∈A
{R(sk, a) + γ
∑s′
P (s′|sk, a)Vk(s′)}
4: value update: Vk+1(sk) = R(sk, ak) + γ∑s′ P (s′|sk, ak)Vk(s′)
5: For s 6= sk: Vk+1(s) = Vk(s)
6: simulate the next state: sk+1 ∼ P (s′|sk, ak)
34/??
• So far, we introduce basic notions of an MDP and value functions andmethods to compute optimal policies assuming that we know theworld (know P (s′|a, s) and R(a, s)):
– Value Iteration/Q-Iteration → V ∗, Q∗, π∗
– Policy Evaluation→ V π, Qπ
– Policy Improvement π(s)← argmaxaQπk(a, s)
– Policy Iteration (iterate Policy Evaluation and Policy Improvement)
• Reinforcement Learning?
35/??