Lecture 3: Model-Free Controlpeople.stern.nyu.edu/xchen3/...Learning_PHD_Class/... · Lecture 3:...

49
Lecture 3: Model-Free Control Xi Chen Stern School of Business, New York University Slides are based on David Silver’s RL lecture notes and Lihong Li’s RL Tutorial Xi Chen (NYU) Model-Free Control 1 / 38

Transcript of Lecture 3: Model-Free Controlpeople.stern.nyu.edu/xchen3/...Learning_PHD_Class/... · Lecture 3:...

Lecture 3: Model-Free Control

Xi Chen

Stern School of Business, New York University

Slides are based on David Silver’s RL lecture notes and Lihong Li’s RL Tutorial

Xi Chen (NYU) Model-Free Control 1 / 38

Outline

1 Introduction

2 On-Policy Monte-Carlo ControlGeneralised Policy IterationExplorationBlackjack Example

3 On-Policy Temporal-Difference Learning

4 Off-Policy LearningQ-Learning

Xi Chen (NYU) Model-Free Control 2 / 38

Outline

1 Introduction

2 On-Policy Monte-Carlo ControlGeneralised Policy IterationExplorationBlackjack Example

3 On-Policy Temporal-Difference Learning

4 Off-Policy LearningQ-Learning

Xi Chen (NYU) Model-Free Control 3 / 38

Model-Free Reinforcement Learning

Last lecture:

Model-free predictionEstimate the value function of an unknown MDP

This lecture:

Model-free controlOptimize the value function of an unknown MDP

Xi Chen (NYU) Model-Free Control 4 / 38

Uses of Model-Free Control

Many examples that can be modelled as MDPs

Elevator

Parallel

Parking

Ship Steering

Bioreactor

Helicopter

Aeroplane

Logistics

Robocup

Soccer

Quake

Portfolio management

Protein Folding

Robot walking

Game of Go

For most of these problems, either:

MDP model is unknown, but experience can be sampled

MDP model is known, but is too big to use, except by samples

Model-free control can optimize the MDP for these problems.

Xi Chen (NYU) Model-Free Control 5 / 38

On and Off-Policy Learning

On-policy learning

“Learn on the job”Learn about policy π from experience sampled from πUsing the currently learned policy π to select actions andimproving stability

Off-policy learning

“Look over someones shoulder”Learn about policy π from experience sampled from a differentpolicy µGreater data efficiency using the same data to evaluate/optimizeLower costs/risks: RL in autonomous driving

Xi Chen (NYU) Model-Free Control 6 / 38

On and Off-Policy Learning

On-policy learning

“Learn on the job”Learn about policy π from experience sampled from πUsing the currently learned policy π to select actions andimproving stability

Off-policy learning

“Look over someones shoulder”Learn about policy π from experience sampled from a differentpolicy µGreater data efficiency using the same data to evaluate/optimizeLower costs/risks: RL in autonomous driving

Xi Chen (NYU) Model-Free Control 6 / 38

Outline

1 Introduction

2 On-Policy Monte-Carlo ControlGeneralised Policy IterationExplorationBlackjack Example

3 On-Policy Temporal-Difference Learning

4 Off-Policy LearningQ-Learning

Xi Chen (NYU) Model-Free Control 7 / 38

Generalised Policy Iteration (Refresher)

Policy evaluation Estimate vπe.g. Iterative policy evaluation

Policy improvement Generate π′ ≥ πe.g. Greedy policy improvement

Xi Chen (NYU) Model-Free Control 8 / 38

Generalised Policy Iteration With Monte-Carlo Evaluation

Policy evaluation Monte-Carlo policy evaluation, V = vπ?Policy improvement Greedy policy improvement?

Xi Chen (NYU) Model-Free Control 9 / 38

Model-Free Policy Iteration Using Action-Value Function

Greedy policy improvement over V (s) requires model of MDP(rewardand transition)

π′(s) = argmaxa∈A

R(s, a) + γ∑s′

Pr[s ′|s, a]vπ(s ′)

Greedy policy improvement over Q(s, a) is model-free

π′(s) = argmaxa∈A

Q(s, a)

Xi Chen (NYU) Model-Free Control 10 / 38

Generalised Policy Iteration with Action-Value Function

Policy evaluation Monte-Carlo policy evaluation, Q = qπPolicy improvement Greedy policy improvement? Too greedy is bad

Xi Chen (NYU) Model-Free Control 11 / 38

Monte-Carlo policy evaluation for Q = qπ

Sample kth episode using π: {S1,A1,R2, . . . ,ST} ∼ πFor each state St and action At in the episode,

N (St ,At)← N (St ,At) + 1Q (St ,At)← Q (St ,At) + 1

N(St ,At)(Gt − Q (St ,At))

Xi Chen (NYU) Model-Free Control 12 / 38

Example of Greedy Action Selection

There are two doors in front of you.

You open the left door and get reward 0V(left) = 0

You open the right door and get reward +1V(right) = +1

You open the right door and get reward +3V(right) = +2

You open the right door and get reward +2V(right) = +2

...

Are you sure youve chosen the best door?

Xi Chen (NYU) Model-Free Control 13 / 38

Exploration: ε-greedy

At every step, select a random action with small probability, and thecurrent best-guess otherwise:

at =

{arg maxa Q(st , a), with prob. 1− εrandom action, with prob. ε

Larger ε, more exploration

Uniform random exploration when ε = 1

Very simple to implement; can be quite effective in practice

Often use a sequence of decaying ε values, reducing amountof exploration over time.

Will discuss more improvement over ε-greedy at the end of thissection

Xi Chen (NYU) Model-Free Control 14 / 38

Exploration: ε-greedy

At every step, select a random action with small probability, and thecurrent best-guess otherwise:

at =

{arg maxa Q(st , a), with prob. 1− εrandom action, with prob. ε

Larger ε, more exploration

Uniform random exploration when ε = 1

Very simple to implement; can be quite effective in practice

Often use a sequence of decaying ε values, reducing amountof exploration over time.

Will discuss more improvement over ε-greedy at the end of thissection

Xi Chen (NYU) Model-Free Control 14 / 38

Exploration: ε-greedy

At every step, select a random action with small probability, and thecurrent best-guess otherwise:

at =

{arg maxa Q(st , a), with prob. 1− εrandom action, with prob. ε

Larger ε, more exploration

Uniform random exploration when ε = 1

Very simple to implement; can be quite effective in practice

Often use a sequence of decaying ε values, reducing amountof exploration over time.

Will discuss more improvement over ε-greedy at the end of thissection

Xi Chen (NYU) Model-Free Control 14 / 38

Exploration: ε-greedy

At every step, select a random action with small probability, and thecurrent best-guess otherwise:

at =

{arg maxa Q(st , a), with prob. 1− εrandom action, with prob. ε

Larger ε, more exploration

Uniform random exploration when ε = 1

Very simple to implement; can be quite effective in practice

Often use a sequence of decaying ε values, reducing amountof exploration over time.

Will discuss more improvement over ε-greedy at the end of thissection

Xi Chen (NYU) Model-Free Control 14 / 38

ε-Greedy Policy Improvement

Theorem

For any ε-greedy policy π, the ε-greedy policy π′ with respect to qπ is animprovement, vπ′(s) ≥ vπ(s)

qπ (s, π′(s)) =∑a∈A

π′(a|s)qπ(s, a)

= ε/m∑a∈A

qπ(s, a) + (1− ε) maxa∈A

qπ(s, a)

≥ ε/m∑a∈A

qπ(s, a) + (1− ε)∑a∈A

π(a|s)− ε/m

1− εqπ(s, a)

=∑a∈A

π(a|s)qπ(s, a) = vπ(s)

Therefore from policy improvement theorem, vπ′(s) ≥ vπ(s).

Xi Chen (NYU) Model-Free Control 15 / 38

Monte-Carlo Control

Every episode:Policy evaluation Monte-Carlo policy evaluation, Q ≈ qπPolicy improvement ε-greedy policy improvement

Xi Chen (NYU) Model-Free Control 16 / 38

GLIE Monte-Carlo Control

Greedy in the Limit with Infinite Exploration (GLIE)

Sample kth episode using π: {S1,A1,R2, . . . ,ST} ∼ πFor each state St and action At in the episode,

N (St ,At)← N (St ,At) + 1Q (St ,At)← Q (St ,At) + 1

N(St ,At)(Gt − Q (St ,At))

Improve policy based on new action-value function

ε← 1/k

π ← ε− greedy(Q)

Xi Chen (NYU) Model-Free Control 17 / 38

GLIE

Definition

All state-action pairs are explored infinitely many times,

limk→∞

Nk(s, a) =∞

The policy converges on a greedy policy,

limk→∞

πk(a|s) = 1

(a = argmax

a′∈AQk

(s, a′

))GLIE Monte-Carlo control converges to the optimal action-valuefunction, Q(s, a)→ q∗(s, a)

For example, ε-greedy is GLIE

Xi Chen (NYU) Model-Free Control 18 / 38

Back to the Blackjack Example

Xi Chen (NYU) Model-Free Control 19 / 38

Monte-Carlo Control in Blackjack

Store Q(s, a) inside, only present V ∗(s) = maxa Q∗(s, a) .

No usable ace: HITif the dealer shows a large value (HIT is TWIST)

Xi Chen (NYU) Model-Free Control 20 / 38

Outline

1 Introduction

2 On-Policy Monte-Carlo ControlGeneralised Policy IterationExplorationBlackjack Example

3 On-Policy Temporal-Difference Learning

4 Off-Policy LearningQ-Learning

Xi Chen (NYU) Model-Free Control 21 / 38

MC vs. TD Control

Temporal-difference (TD) learning has several advantages overMonte-Carlo (MC)

Lower varianceOnlineIncomplete sequences

Natural idea: use TD instead of MC in our control loop

Apply TD to Q(S ,A)Use ε-greedy policy improvementUpdate every time-step

Xi Chen (NYU) Model-Free Control 22 / 38

MC vs. TD Control

Temporal-difference (TD) learning has several advantages overMonte-Carlo (MC)

Lower varianceOnlineIncomplete sequences

Natural idea: use TD instead of MC in our control loop

Apply TD to Q(S ,A)Use ε-greedy policy improvementUpdate every time-step

Xi Chen (NYU) Model-Free Control 22 / 38

Updating Action-Value Functions with Sarsa

Sarsa: state-action-reward-state-action

Q(S ,A)← Q(S ,A) + α

R + γQ(S ′,A′

)︸ ︷︷ ︸TD Target

−Q(S ,A)

Action A′: action taken under S ′.

Xi Chen (NYU) Model-Free Control 23 / 38

Sarsa Algorithm for On-Policy Control

Xi Chen (NYU) Model-Free Control 24 / 38

On-Policy Control With Sarsa

Every time-step:Policy evaluation Sarsa (updating involves S ,A,R, S ′,A′), Q ≈ qπPolicy improvement ε-greedy policy improvement always based on thecurrent value of Q

Xi Chen (NYU) Model-Free Control 25 / 38

Convergence of Sarsa

Theorem

Sarsa converges to the optimal action-value function,Q(s, a)→ q∗(s, a), under the following conditions:

GLIE sequence of policies πt(a|s) (every state has to be visited)

Robbins-Monro sequence of step-sizes αt (e.g., t( − α) withα ∈ (0, 1])

∞∑t=1

αt =∞

∞∑t=1

α2t <∞

Xi Chen (NYU) Model-Free Control 26 / 38

Windy Gridworld Example

Reward = −1 per time-step until reaching goal

the player chooses standard’s move from start (S) to goal (G )(thinking: how about king’s move?)

Each column represent the no. of cells that the wind will forceto move it up.

Xi Chen (NYU) Model-Free Control 27 / 38

Sarsa on the Windy Gridworld

At beginning, random walk takes about 2000 time steps to finishone episode (reaching G ).

Less time steps to finish more episodes.

Xi Chen (NYU) Model-Free Control 28 / 38

Outline

1 Introduction

2 On-Policy Monte-Carlo ControlGeneralised Policy IterationExplorationBlackjack Example

3 On-Policy Temporal-Difference Learning

4 Off-Policy LearningQ-Learning

Xi Chen (NYU) Model-Free Control 29 / 38

Off-Policy Learning

Evaluate target policy π(a|s) to compute vπ(s) or qπ(s, a)

While following behavior policy µ(a|s)

{S1,A1,R2, . . . ,ST} ∼ µ

Why is this important?

Learn from observing humans or other agentsRe-use experience generated from old policiesLearn about optimal policy while following behavior policy

Xi Chen (NYU) Model-Free Control 30 / 38

Q-Learning

We now consider off-policy learning of action-values Q(s, a)

Next action is chosen using behaviour policy At+1 ∼ µ(·|St),which generates St+1

But we consider target successor action A′ to be the optimalpolicy that maximum the current Q(St+1, ·)We allow both behaviour and target policies to improve

And update Q(St ,At) towards value of target action

Q (St ,At)← Q (St ,At) +α(Rt+1 + γQ

(St+1,A

′)− Q (St ,At))

Xi Chen (NYU) Model-Free Control 31 / 38

Off-Policy Control with Q-Learning

The behavior policy µ is e.g. ε-greedy w.r.t. Q(s, a)

The target policy π is greedy w.r.t. Q(s, a)

π(S ′)

= argmaxa′

Q(S ′, a′

)

The Q-learning update:

Q(S ,A)← Q(S ,A) + α

R + γmaxa′

Q(S ′, a′

)︸ ︷︷ ︸

Q−learning target

−Q(S ,A)

Xi Chen (NYU) Model-Free Control 32 / 38

Off-Policy Control with Q-Learning

The behavior policy µ is e.g. ε-greedy w.r.t. Q(s, a)

The target policy π is greedy w.r.t. Q(s, a)

π(S ′)

= argmaxa′

Q(S ′, a′

)The Q-learning update:

Q(S ,A)← Q(S ,A) + α

R + γmaxa′

Q(S ′, a′

)︸ ︷︷ ︸

Q−learning target

−Q(S ,A)

Xi Chen (NYU) Model-Free Control 32 / 38

Q-Learning Control Algorithm

Xi Chen (NYU) Model-Free Control 33 / 38

Q-Learning as a stochastic approximation

Q Learning as stochastic approximation

Q∗(s, a) = E[r(s, a)] + γ∑s′∈S

Pr[s ′|s, a

]V ∗(s ′)

= Es′

[r(s, a) + γmax

a∈AQ∗(s ′, a

)]Q(S ,A)← Q(S ,A) + α (R + γmaxa′ Q (S ′, a′)− Q(S ,A))

Theorem

Q-learning control converges to the optimal action-value function,Q(s, a)→ q∗(s, a)

Xi Chen (NYU) Model-Free Control 34 / 38

Q-learning in Direct Mail Campaign Example

Slowly decayinglearning rate from 0.05to 0.0001

Different ε values

Convergence is alwaysguaranteed, but atdifferent speed

Main lesson

Need efficient exploration/exploitation trade-off!

Xi Chen (NYU) Model-Free Control 35 / 38

Q-learning in Direct Mail Campaign Example

Slowly decayinglearning rate from 0.05to 0.0001

Different ε values

Convergence is alwaysguaranteed, but atdifferent speed

Main lesson

Need efficient exploration/exploitation trade-off!

Xi Chen (NYU) Model-Free Control 35 / 38

Inefficiency of ε-greedy

Combination lock

N states, 2 actions

Reward is 1 when reaching G ;and 0 otherwise

Optimal action is always thegreen action

Q-learning with ε-greedy

Assuming Q initialized to 0

Every step has 0.5 probabilityof going back to state “1”

It takes Θ(2N) steps onaverage to get to G for thefirst time

Can we do better?

Xi Chen (NYU) Model-Free Control 36 / 38

Inefficiency of ε-greedy

Combination lock

N states, 2 actions

Reward is 1 when reaching G ;and 0 otherwise

Optimal action is always thegreen action

Q-learning with ε-greedy

Assuming Q initialized to 0

Every step has 0.5 probabilityof going back to state “1”

It takes Θ(2N) steps onaverage to get to G for thefirst time

Can we do better?

Xi Chen (NYU) Model-Free Control 36 / 38

Inefficiency of ε-greedy

Combination lock

N states, 2 actions

Reward is 1 when reaching G ;and 0 otherwise

Optimal action is always thegreen action

Q-learning with ε-greedy

Assuming Q initialized to 0

Every step has 0.5 probabilityof going back to state “1”

It takes Θ(2N) steps onaverage to get to G for thefirst time

Can we do better?

Xi Chen (NYU) Model-Free Control 36 / 38

Inefficiency of ε-greedy

Combination lock

N states, 2 actions

Reward is 1 when reaching G ;and 0 otherwise

Optimal action is always thegreen action

Q-learning with ε-greedy

Assuming Q initialized to 0

Every step has 0.5 probabilityof going back to state “1”

It takes Θ(2N) steps onaverage to get to G for thefirst time

Can we do better?

Xi Chen (NYU) Model-Free Control 36 / 38

Exploration Bonus

Selection rule

at = arg maxa{Q(st , a) + B(st , a)}

for some pre-defined exploration bonus function B(st , a).

B(s, a) measures the need for exploration, e.g., some decreasingfunction of C (s, a) — number of times a has been chosen in s

Popular choice: B(s, a) = α/√

C (s, a) with parameter α > 0.

Also known as Upper Confidence Bound (UCB)

An example of guided exploration (as opposed to ε-greedy)

Xi Chen (NYU) Model-Free Control 37 / 38

Exploration Bonus

Selection rule

at = arg maxa{Q(st , a) + B(st , a)}

for some pre-defined exploration bonus function B(st , a).

B(s, a) measures the need for exploration, e.g., some decreasingfunction of C (s, a) — number of times a has been chosen in s

Popular choice: B(s, a) = α/√

C (s, a) with parameter α > 0.

Also known as Upper Confidence Bound (UCB)

An example of guided exploration (as opposed to ε-greedy)

Xi Chen (NYU) Model-Free Control 37 / 38

Exploration Bonus

Epsilon-greedy strategy:

with probability 1− ε greedy action from s;with probability ε random action.

Epoch-dependent strategy (Boltzmann exploration):

pt(a|s,Q) =e

Q(s,a)τt∑

a′∈A eQ(s,a′)

Tt

τt → 0: greedy selection.larger τt : random action.

Xi Chen (NYU) Model-Free Control 38 / 38