MDPs and Reinforcement Learning. Overview MDPs Reinforcement learning.
Click here to load reader
-
Upload
gary-conley -
Category
Documents
-
view
219 -
download
0
description
Transcript of MDPs and Reinforcement Learning. Overview MDPs Reinforcement learning.
MDPs and Reinforcement
Learning
Overview • MDPs• Reinforcement learning
Sequential decision problems• In an environment, find a sequence of actions in an
uncertain environment that balance risks and rewards• Markov Decision Process (MDP):
– In a fully observable environment we know initial state (S0) and state transitions T(Si, Ak, Sj) = probability of reaching Sj from Si when doing Ak
– States have a reward associated with them R(Si)• We can define a policy π that selects an action to
perform given a state, i.e., π(Si)• Applying a policy leads to a history of actions• Goal: find policy maximizing expected utility of history
4x3 Grid World
4x3 Grid World
• Assume R(s) = -0.04 except where marked• Here’s an optimal policy
4x3 Grid WorldDifferent default rewards produce different optimal policies
life=pain, get out quick
Life = struggle, go for +1, accept risk
Life = ok, go for +1, minimize risk
Life = good, avoid exits
Finite and infinite horizons
• Finite Horizon– There’s a fixed time N when the game is over– U([s1…sn]) = U([s1…sn…sk])– Find a policy that takes that into account
• Infinite Horizon– Game goes on forever
• The best policy for with a finite horizon can change over time: more complicated
Rewards• The utility of a sequence is usually additive
– U([s0…s1]) = R(s0) + R(s1) + … R(sn)
• But future rewards might be discounted by a factor γ– U([s0…s1]) = R(s0) + γ*R(s1) + γ2*R(s2)…+ γn*R(sn)
• Using discounted rewards – Solves some technical difficulties with very long or
infinite sequences and– Is psychologically realistic
€
shortsighted 0 ← γ →1 farsighted
9
Value Functions
State-value function for policy π :
Vπ (s)=Eπ Rt st =s{ }=Eπ γkrt+k+1 st =sk=0
∞∑⎧ ⎨ ⎩ ⎫ ⎬ ⎭
Action-value function for policy π :
Qπ (s,a) =Eπ Rt st =s,at =a{ }=Eπ γkrt+k+1 st =s,at =ak=0
∞∑⎧ ⎨ ⎩ ⎫ ⎬ ⎭
• The value of a state is the expected return starting from that state; depends on the agent’s policy:
• The value of taking an action in a state under policy is the expected return starting from that state, taking that action, and thereafter following :
10
Bellman Equation for a Policy
Rt =rt+1 +γrt+2 +γ2rt+3 +γ3rt+4L=rt+1 +γ rt+2 +γrt+3 +γ2rt+4L( )=rt+1 +γRt+1
The basic idea:
So: Vπ (s)=Eπ Rt st =s{ }=Eπ rt+1 +γV st+1( ) st =s{ }
Or, without the expectation operator:
Vπ (s)= π(s,a) Ps ′ s a Rs ′ s
a +γVπ( ′ s )[ ]′ s
∑a∑
Values for states in 4x3 world