Lecture 3: Model-Free Controlpeople.stern.nyu.edu/xchen3/...Learning_PHD_Class/... · Lecture 3:...

Lecture 3: Model-Free Control

Xi Chen

Stern School of Business, New York University

Slides are based on David Silver’s RL lecture notes and Lihong Li’s RL Tutorial

Xi Chen (NYU) Model-Free Control 1 / 38

Outline

1 Introduction

2 On-Policy Monte-Carlo ControlGeneralised Policy IterationExplorationBlackjack Example

3 On-Policy Temporal-Difference Learning

4 Off-Policy LearningQ-Learning


Outline

1 Introduction





Model-Free Reinforcement Learning

Last lecture:

Model-free predictionEstimate the value function of an unknown MDP

This lecture:

Model-free controlOptimize the value function of an unknown MDP


Uses of Model-Free Control

Many examples that can be modelled as MDPs

Elevator

Parallel

Parking

Ship Steering

Bioreactor

Helicopter

Aeroplane

Logistics

Robocup

Soccer

Quake

Portfolio management

Protein Folding

Robot walking

Game of Go

For most of these problems, either:

MDP model is unknown, but experience can be sampled

MDP model is known, but is too big to use, except by samples

Model-free control can optimize the MDP for these problems.


On and Off-Policy Learning

On-policy learning

“Learn on the job”Learn about policy π from experience sampled from πUsing the currently learned policy π to select actions andimproving stability

Off-policy learning

“Look over someones shoulder”Learn about policy π from experience sampled from a differentpolicy µGreater data efficiency using the same data to evaluate/optimizeLower costs/risks: RL in autonomous driving


Outline

1 Introduction





Generalised Policy Iteration (Refresher)

Policy evaluation Estimate vπe.g. Iterative policy evaluation

Policy improvement Generate π′ ≥ πe.g. Greedy policy improvement


Generalised Policy Iteration With Monte-Carlo Evaluation

Policy evaluation Monte-Carlo policy evaluation, V = vπ?Policy improvement Greedy policy improvement?


Model-Free Policy Iteration Using Action-Value Function

Greedy policy improvement over V (s) requires model of MDP(rewardand transition)

π′(s) = argmaxa∈A

R(s, a) + γ∑s′

Pr[s ′|s, a]vπ(s ′)

Greedy policy improvement over Q(s, a) is model-free

π′(s) = argmaxa∈A

Q(s, a)


Generalised Policy Iteration with Action-Value Function

Policy evaluation Monte-Carlo policy evaluation, Q = qπPolicy improvement Greedy policy improvement? Too greedy is bad


Monte-Carlo policy evaluation for Q = qπ

Sample kth episode using π: {S1,A1,R2, . . . ,ST} ∼ πFor each state St and action At in the episode,

N (St ,At)← N (St ,At) + 1Q (St ,At)← Q (St ,At) + 1

N(St ,At)(Gt − Q (St ,At))


Example of Greedy Action Selection

There are two doors in front of you.

You open the left door and get reward 0V(left) = 0

You open the right door and get reward +1V(right) = +1



...

Are you sure youve chosen the best door?


Exploration: ε-greedy

At every step, select a random action with small probability, and thecurrent best-guess otherwise:

at =

{arg maxa Q(st , a), with prob. 1− εrandom action, with prob. ε

Larger ε, more exploration

Uniform random exploration when ε = 1

Very simple to implement; can be quite effective in practice

Often use a sequence of decaying ε values, reducing amountof exploration over time.

Will discuss more improvement over ε-greedy at the end of thissection


ε-Greedy Policy Improvement

Theorem

For any ε-greedy policy π, the ε-greedy policy π′ with respect to qπ is animprovement, vπ′(s) ≥ vπ(s)

qπ (s, π′(s)) =∑a∈A

π′(a|s)qπ(s, a)

= ε/m∑a∈A

qπ(s, a) + (1− ε) maxa∈A

qπ(s, a)

≥ ε/m∑a∈A

qπ(s, a) + (1− ε)∑a∈A

π(a|s)− ε/m

1− εqπ(s, a)

=∑a∈A

π(a|s)qπ(s, a) = vπ(s)

Therefore from policy improvement theorem, vπ′(s) ≥ vπ(s).


Monte-Carlo Control

Every episode:Policy evaluation Monte-Carlo policy evaluation, Q ≈ qπPolicy improvement ε-greedy policy improvement


GLIE Monte-Carlo Control

Greedy in the Limit with Infinite Exploration (GLIE)

Sample kth episode using π: {S1,A1,R2, . . . ,ST} ∼ πFor each state St and action At in the episode,

N (St ,At)← N (St ,At) + 1Q (St ,At)← Q (St ,At) + 1

N(St ,At)(Gt − Q (St ,At))

Improve policy based on new action-value function

ε← 1/k

π ← ε− greedy(Q)


GLIE

Definition

All state-action pairs are explored infinitely many times,

limk→∞

Nk(s, a) =∞

The policy converges on a greedy policy,

limk→∞

πk(a|s) = 1

(a = argmax

a′∈AQk

(s, a′

))GLIE Monte-Carlo control converges to the optimal action-valuefunction, Q(s, a)→ q∗(s, a)

For example, ε-greedy is GLIE


Back to the Blackjack Example


Monte-Carlo Control in Blackjack

Store Q(s, a) inside, only present V ∗(s) = maxa Q∗(s, a) .

No usable ace: HITif the dealer shows a large value (HIT is TWIST)


Outline

1 Introduction





MC vs. TD Control

Temporal-difference (TD) learning has several advantages overMonte-Carlo (MC)

Lower varianceOnlineIncomplete sequences

Natural idea: use TD instead of MC in our control loop

Apply TD to Q(S ,A)Use ε-greedy policy improvementUpdate every time-step


Updating Action-Value Functions with Sarsa

Sarsa: state-action-reward-state-action

Q(S ,A)← Q(S ,A) + α

R + γQ(S ′,A′

)︸︷︷︸TD Target

−Q(S ,A)

Action A′: action taken under S ′.


Sarsa Algorithm for On-Policy Control


On-Policy Control With Sarsa

Every time-step:Policy evaluation Sarsa (updating involves S ,A,R, S ′,A′), Q ≈ qπPolicy improvement ε-greedy policy improvement always based on thecurrent value of Q


Convergence of Sarsa

Theorem

Sarsa converges to the optimal action-value function,Q(s, a)→ q∗(s, a), under the following conditions:

GLIE sequence of policies πt(a|s) (every state has to be visited)

Robbins-Monro sequence of step-sizes αt (e.g., t( − α) withα ∈ (0, 1])

∞∑t=1

αt =∞

∞∑t=1

α2t <∞


Windy Gridworld Example

Reward = −1 per time-step until reaching goal

the player chooses standard’s move from start (S) to goal (G )(thinking: how about king’s move?)

Each column represent the no. of cells that the wind will forceto move it up.


Sarsa on the Windy Gridworld

At beginning, random walk takes about 2000 time steps to finishone episode (reaching G ).

Less time steps to finish more episodes.


Outline

1 Introduction





Off-Policy Learning

Evaluate target policy π(a|s) to compute vπ(s) or qπ(s, a)

While following behavior policy µ(a|s)

{S1,A1,R2, . . . ,ST} ∼ µ

Why is this important?

Learn from observing humans or other agentsRe-use experience generated from old policiesLearn about optimal policy while following behavior policy


Q-Learning

We now consider off-policy learning of action-values Q(s, a)

Next action is chosen using behaviour policy At+1 ∼ µ(·|St),which generates St+1

But we consider target successor action A′ to be the optimalpolicy that maximum the current Q(St+1, ·)We allow both behaviour and target policies to improve

And update Q(St ,At) towards value of target action

Q (St ,At)← Q (St ,At) +α(Rt+1 + γQ

(St+1,A

′)− Q (St ,At))


Off-Policy Control with Q-Learning

The behavior policy µ is e.g. ε-greedy w.r.t. Q(s, a)

The target policy π is greedy w.r.t. Q(s, a)

π(S ′)

= argmaxa′

Q(S ′, a′

)

The Q-learning update:

Q(S ,A)← Q(S ,A) + α

R + γmaxa′

Q(S ′, a′

)︸︷︷︸

Q−learning target

−Q(S ,A)


Off-Policy Control with Q-Learning

The behavior policy µ is e.g. ε-greedy w.r.t. Q(s, a)

The target policy π is greedy w.r.t. Q(s, a)

π(S ′)

= argmaxa′

Q(S ′, a′

)The Q-learning update:

Q(S ,A)← Q(S ,A) + α

R + γmaxa′

Q(S ′, a′

)︸︷︷︸

Q−learning target

−Q(S ,A)


Q-Learning Control Algorithm


Q-Learning as a stochastic approximation

Q Learning as stochastic approximation

Q∗(s, a) = E[r(s, a)] + γ∑s′∈S

Pr[s ′|s, a

]V ∗(s ′)

= Es′

[r(s, a) + γmax

a∈AQ∗(s ′, a

)]Q(S ,A)← Q(S ,A) + α (R + γmaxa′ Q (S ′, a′)− Q(S ,A))

Theorem

Q-learning control converges to the optimal action-value function,Q(s, a)→ q∗(s, a)


Q-learning in Direct Mail Campaign Example

Slowly decayinglearning rate from 0.05to 0.0001

Different ε values

Convergence is alwaysguaranteed, but atdifferent speed

Main lesson

Need efficient exploration/exploitation trade-off!


Inefficiency of ε-greedy

Combination lock

N states, 2 actions

Reward is 1 when reaching G ;and 0 otherwise

Optimal action is always thegreen action

Q-learning with ε-greedy

Assuming Q initialized to 0

Every step has 0.5 probabilityof going back to state “1”

It takes Θ(2N) steps onaverage to get to G for thefirst time

Can we do better?


Exploration Bonus

Selection rule

at = arg maxa{Q(st , a) + B(st , a)}

for some pre-defined exploration bonus function B(st , a).

B(s, a) measures the need for exploration, e.g., some decreasingfunction of C (s, a) — number of times a has been chosen in s

Popular choice: B(s, a) = α/√

C (s, a) with parameter α > 0.

Also known as Upper Confidence Bound (UCB)

An example of guided exploration (as opposed to ε-greedy)


Exploration Bonus

Epsilon-greedy strategy:

with probability 1− ε greedy action from s;with probability ε random action.

Epoch-dependent strategy (Boltzmann exploration):

pt(a|s,Q) =e

Q(s,a)τt∑

a′∈A eQ(s,a′)

Tt

τt → 0: greedy selection.larger τt : random action.


Lecture 3: Model-Free Controlpeople.stern.nyu.edu/xchen3/...Learning_PHD_Class/... · Lecture 3:...

Documents

Transcript of Lecture 3: Model-Free Controlpeople.stern.nyu.edu/xchen3/...Learning_PHD_Class/... · Lecture 3:...