MDPs and Reinforcement Learning. Overview MDPs Reinforcement learning.

11

Click here to load reader

description

Sequential decision problems In an environment, find a sequence of actions in an uncertain environment that balance risks and rewards Markov Decision Process (MDP): –In a fully observable environment we know initial state (S0) and state transitions T(Si, Ak, Sj) = probability of reaching Sj from Si when doing Ak –States have a reward associated with them R(Si) We can define a policy π that selects an action to perform given a state, i.e., π(Si) Applying a policy leads to a history of actions Goal: find policy maximizing expected utility of history

Transcript of MDPs and Reinforcement Learning. Overview MDPs Reinforcement learning.

Page 1: MDPs and Reinforcement Learning. Overview MDPs Reinforcement learning.

MDPs and Reinforcement

Learning

Page 2: MDPs and Reinforcement Learning. Overview MDPs Reinforcement learning.

Overview • MDPs• Reinforcement learning

Page 3: MDPs and Reinforcement Learning. Overview MDPs Reinforcement learning.

Sequential decision problems• In an environment, find a sequence of actions in an

uncertain environment that balance risks and rewards• Markov Decision Process (MDP):

– In a fully observable environment we know initial state (S0) and state transitions T(Si, Ak, Sj) = probability of reaching Sj from Si when doing Ak

– States have a reward associated with them R(Si)• We can define a policy π that selects an action to

perform given a state, i.e., π(Si)• Applying a policy leads to a history of actions• Goal: find policy maximizing expected utility of history

Page 4: MDPs and Reinforcement Learning. Overview MDPs Reinforcement learning.

4x3 Grid World

Page 5: MDPs and Reinforcement Learning. Overview MDPs Reinforcement learning.

4x3 Grid World

• Assume R(s) = -0.04 except where marked• Here’s an optimal policy

Page 6: MDPs and Reinforcement Learning. Overview MDPs Reinforcement learning.

4x3 Grid WorldDifferent default rewards produce different optimal policies

life=pain, get out quick

Life = struggle, go for +1, accept risk

Life = ok, go for +1, minimize risk

Life = good, avoid exits

Page 7: MDPs and Reinforcement Learning. Overview MDPs Reinforcement learning.

Finite and infinite horizons

• Finite Horizon– There’s a fixed time N when the game is over– U([s1…sn]) = U([s1…sn…sk])– Find a policy that takes that into account

• Infinite Horizon– Game goes on forever

• The best policy for with a finite horizon can change over time: more complicated

Page 8: MDPs and Reinforcement Learning. Overview MDPs Reinforcement learning.

Rewards• The utility of a sequence is usually additive

– U([s0…s1]) = R(s0) + R(s1) + … R(sn)

• But future rewards might be discounted by a factor γ– U([s0…s1]) = R(s0) + γ*R(s1) + γ2*R(s2)…+ γn*R(sn)

• Using discounted rewards – Solves some technical difficulties with very long or

infinite sequences and– Is psychologically realistic

shortsighted 0 ← γ →1 farsighted

Page 9: MDPs and Reinforcement Learning. Overview MDPs Reinforcement learning.

9

Value Functions

State-value function for policy π :

Vπ (s)=Eπ Rt st =s{ }=Eπ γkrt+k+1 st =sk=0

∞∑⎧ ⎨ ⎩ ⎫ ⎬ ⎭

Action-value function for policy π :

Qπ (s,a) =Eπ Rt st =s,at =a{ }=Eπ γkrt+k+1 st =s,at =ak=0

∞∑⎧ ⎨ ⎩ ⎫ ⎬ ⎭

• The value of a state is the expected return starting from that state; depends on the agent’s policy:

• The value of taking an action in a state under policy is the expected return starting from that state, taking that action, and thereafter following :

Page 10: MDPs and Reinforcement Learning. Overview MDPs Reinforcement learning.

10

Bellman Equation for a Policy

Rt =rt+1 +γrt+2 +γ2rt+3 +γ3rt+4L=rt+1 +γ rt+2 +γrt+3 +γ2rt+4L( )=rt+1 +γRt+1

The basic idea:

So: Vπ (s)=Eπ Rt st =s{ }=Eπ rt+1 +γV st+1( ) st =s{ }

Or, without the expectation operator:

Vπ (s)= π(s,a) Ps ′ s a Rs ′ s

a +γVπ( ′ s )[ ]′ s

∑a∑

Page 11: MDPs and Reinforcement Learning. Overview MDPs Reinforcement learning.

Values for states in 4x3 world