Machine Learning and Data Mining Reinforcement Learning ...

146
Machine Learning and Data Mining Reinforcement Learning Markov Decision Processes Kalev Kask +

Transcript of Machine Learning and Data Mining Reinforcement Learning ...

Page 1: Machine Learning and Data Mining Reinforcement Learning ...

Machine Learning and Data Mining

Reinforcement Learning

Markov Decision Processes

Kalev Kask

+

Page 2: Machine Learning and Data Mining Reinforcement Learning ...

Overview

• Intro

• Markov Decision Processes

• Reinforcement Learning

– Sarsa

– Q-learning

• Exploration vs Exploitation tradeoff

2

Page 3: Machine Learning and Data Mining Reinforcement Learning ...

Resources

• Book: Reinforcement Learning: An IntroductionRichard S. Sutton and Andrew G. Barto

• UCL Course on Reinforcement LearningDavid Silver– https://www.youtube.com/watch?v=2pWv7GOvuf0

– https://www.youtube.com/watch?v=lfHX2hHRMVQ

– https://www.youtube.com/watch?v=Nd1-UUMVfz4

– https://www.youtube.com/watch?v=PnHCvfgC_ZA

– https://www.youtube.com/watch?v=0g4j2k_Ggc4

– https://www.youtube.com/watch?v=UoPei5o4fps

3

Page 4: Machine Learning and Data Mining Reinforcement Learning ...

Lecture 1: Introduction to Reinforcement Learning

About RLBranches of Machine Learning

Reinforcement Learning

Supervised Learning

Unsupervised Learning

Machine Learning

4

Page 5: Machine Learning and Data Mining Reinforcement Learning ...

Why is it different

• No target values to predict

• Feedback in the form of rewards

– May be delayed not instantaneous

• Have a goal : max reward

• Have timeline : actions along arrow of time

• Actions affect what data it will receive

5

Page 6: Machine Learning and Data Mining Reinforcement Learning ...

Agent-Environment

6

Page 7: Machine Learning and Data Mining Reinforcement Learning ...

Lecture 1: Introduction to Reinforcement Learning

The RL Problem

EnvironmentsAgent and Environment

observation

reward

action

At

Rt

OtAt each step t the agent:

Executes action At

Receives observation Ot

Receives scalar rewardRt

The environment:

Receives action At

Emits observationOt+1

Emits scalar reward Rt+1

t increments at env. step

7

Page 8: Machine Learning and Data Mining Reinforcement Learning ...

Sequential Decision Making

• Actions have long term consequences

• Goal maximize cumulative (long term) reward

– Rewards may be delayed

– May need to sacrifice short term reward

• Devise a plan to maximize cumulative reward

8

Page 9: Machine Learning and Data Mining Reinforcement Learning ...

Lecture 1: Introduction to Reinforcement Learning

The RL Problem

RewardSequential Decision Making

Examples:A financial investment (may take months to mature)

Refuelling a helicopter (might prevent a crash in several hours)

Blocking opponent moves (might help winning chances many

moves from now)

9

Page 10: Machine Learning and Data Mining Reinforcement Learning ...

Reinforcement Learning

10

Learn a behavior strategy (policy) that maximizes the long term

Sum of rewards in an unknown and stochastic environment (Emma Brunskill: )

Planning under Uncertainty

Learn a behavior strategy (policy) that maximizes the long term

Sum of rewards in a known stochastic environment (Emma Brunskill: )

Page 11: Machine Learning and Data Mining Reinforcement Learning ...

295, Winter 2018 11

Page 12: Machine Learning and Data Mining Reinforcement Learning ...

Lecture 1: Introduction to Reinforcement Learning

Problems within RLAtari Example: Reinforcement Learning

observation

reward

action

At

Rt

Ot

Rules of the game are

unknown

Learn directly from

interactive game-play

Pick actions on

joystick, see pixels

and scores

12

Page 13: Machine Learning and Data Mining Reinforcement Learning ...

13

Demos

Some videos

• https://www.youtube.com/watch?v=V1eYniJ0Rnk

• https://www.youtube.com/watch?v=CIF2SBVY-J0

• https://www.youtube.com/watch?v=I2WFvGl4y8c

Page 14: Machine Learning and Data Mining Reinforcement Learning ...

14

Markov Property

Page 15: Machine Learning and Data Mining Reinforcement Learning ...

15

State Transition

Page 16: Machine Learning and Data Mining Reinforcement Learning ...

16

Markov Process

Page 17: Machine Learning and Data Mining Reinforcement Learning ...

17

Student Markov Chain

Page 18: Machine Learning and Data Mining Reinforcement Learning ...

18

Student MC : Episodes

Page 19: Machine Learning and Data Mining Reinforcement Learning ...

19

Student MC : Transition Matrix

Page 20: Machine Learning and Data Mining Reinforcement Learning ...

20

Return

Page 21: Machine Learning and Data Mining Reinforcement Learning ...

21

Value

Page 22: Machine Learning and Data Mining Reinforcement Learning ...

22

Student MRP

Page 23: Machine Learning and Data Mining Reinforcement Learning ...

23

Student MRP : Returns

Page 24: Machine Learning and Data Mining Reinforcement Learning ...

24

Student MRP : Value Function

Page 25: Machine Learning and Data Mining Reinforcement Learning ...

25

Student MRP : Value Function

Page 26: Machine Learning and Data Mining Reinforcement Learning ...

26

Student MRP : Value Function

Page 27: Machine Learning and Data Mining Reinforcement Learning ...

27

Bellman Equation for MRP

Page 28: Machine Learning and Data Mining Reinforcement Learning ...

28

Backup Diagrams for MRP

Page 29: Machine Learning and Data Mining Reinforcement Learning ...

29

Bellman Eq: Student MRP

Page 30: Machine Learning and Data Mining Reinforcement Learning ...

30

Bellman Eq: Student MRP

Page 31: Machine Learning and Data Mining Reinforcement Learning ...

Lecture 2: Markov DecisionProcesses

Markov Reward Processes

Bellman EquationSolving the Bellman Equation

The Bellman equation is a linear equation

It can be solved directly:

v = R + γPv

(I − γP) v = R

v = (I − γP)−1 R

Computational complexity is O(n3) for n states

Direct solution only possible for small MRPs

There are many iterative methods for large MRPs, e.g.Dynamic programming Monte-Carlo evaluation Temporal-Difference learning 31

Page 32: Machine Learning and Data Mining Reinforcement Learning ...

32

Markov Decision Process

Page 33: Machine Learning and Data Mining Reinforcement Learning ...

33

Student MDP

Page 34: Machine Learning and Data Mining Reinforcement Learning ...

34

Policies

Page 35: Machine Learning and Data Mining Reinforcement Learning ...

35

MP → MRP → MDP

Page 36: Machine Learning and Data Mining Reinforcement Learning ...

36

Value Function

Page 37: Machine Learning and Data Mining Reinforcement Learning ...

37

Bellman Eq for MDP

Evaluating Bellman equation translates into 1-step lookahead

Page 38: Machine Learning and Data Mining Reinforcement Learning ...

38

Bellman Eq, V

Page 39: Machine Learning and Data Mining Reinforcement Learning ...

39

Bellman Eq, q

Page 40: Machine Learning and Data Mining Reinforcement Learning ...

40

Bellman Eq, V

Page 41: Machine Learning and Data Mining Reinforcement Learning ...

41

Bellman Eq, q

Page 42: Machine Learning and Data Mining Reinforcement Learning ...

42

Student MDP : Bellman Eq

Page 43: Machine Learning and Data Mining Reinforcement Learning ...

43

Bellman Eq : Matrix Form

Page 44: Machine Learning and Data Mining Reinforcement Learning ...

44

Optimal Value Function

Page 45: Machine Learning and Data Mining Reinforcement Learning ...

45

Student MDP : Optimal V

Page 46: Machine Learning and Data Mining Reinforcement Learning ...

46

Student MDP : Optimal Q

Page 47: Machine Learning and Data Mining Reinforcement Learning ...

Lecture 2: Markov DecisionProcesses

Markov Decision Processes

Optimal Value FunctionsOptimal Policy

Define a partial ordering over policies

π ≥ π' if vπ(s) ≥ vπ'(s),∀s

Theorem

For any Markov Decision Process

There exists an optimal policy π∗ that is better than or equal

to all other policies, π∗ ≥ π, ∀π

All optimal policies achieve the optimal value function,

vπ∗ (s) = v∗(s)

All optimal policies achieve the optimal action-value function,

qπ∗(s,a) = q∗(s,a)

47

Page 48: Machine Learning and Data Mining Reinforcement Learning ...

Lecture 2: Markov DecisionProcesses

Markov Decision Processes

Optimal Value FunctionsFinding an Optimal Policy

An optimal policy can be found by maximising over q∗(s, a),

There is always a deterministic optimal policy for any MDP

If we know q∗(s, a), we immediately have the optimal policy

48

Page 49: Machine Learning and Data Mining Reinforcement Learning ...

49

Student MDP : Optimal Policy

Page 50: Machine Learning and Data Mining Reinforcement Learning ...

50

Bellman Optimality Eq, V

Page 51: Machine Learning and Data Mining Reinforcement Learning ...

51

Student MDP : Bellman Optimality

Page 52: Machine Learning and Data Mining Reinforcement Learning ...

Lecture 2: Markov DecisionProcesses

Markov Decision Processes

Bellman Optimality EquationSolving the Bellman Optimality Equation

Bellman Optimality Equation is non-linear

No closed form solution (in general)

Many iterative solution methods

Value Iteration Policy Iteration Q-learning Sarsa

52

Not easy

Page 53: Machine Learning and Data Mining Reinforcement Learning ...

Lecture 1: Introduction to Reinforcement Learning

Inside An RL AgentMaze Example

Start

Goal

Rewards: -1 per time-step

Actions: N, E, S, W

States: Agent’s location

53

Page 54: Machine Learning and Data Mining Reinforcement Learning ...

Lecture 1: Introduction to Reinforcement Learning

Inside An RL AgentMaze Example: Policy

Start

Goal

Arrows represent policy π(s) for each state s 54

Page 55: Machine Learning and Data Mining Reinforcement Learning ...

Lecture 1: Introduction to Reinforcement Learning

Inside An RL AgentMaze Example: Value Function

-14 -13 -12 -11 -10 -9

-16 -15 -12 -8

-16 -17 -6 -7

-18 -19 -5

-24 -20 -4 -3

-23 -22 -21 -22 -2 -1

Start

Goal

Numbers represent value vπ (s) of each state s 55

Page 56: Machine Learning and Data Mining Reinforcement Learning ...

Lecture 1: Introduction to Reinforcement Learning

Inside An RL AgentMaze Example: Model

-1 -1 -1 -1 -1 -1

-1 -1 -1 -1

-1 -1 -1

-1

-1 -1

-1 -1

Start

Goal

Agent may have an internal

model of the environment

Dynamics: how actions

change the state

Rewards: how muchreward

from each state

The model may be imperfect

Grid layout represents transition model Pa ss‘

asNumbers represent immediate reward R from each state s

(same for all a)56

Page 57: Machine Learning and Data Mining Reinforcement Learning ...

Algorithms for MDPs

57

Page 58: Machine Learning and Data Mining Reinforcement Learning ...

Lecture 1: Introduction to Reinforcement Learning

Inside An RL AgentModel

58

Page 59: Machine Learning and Data Mining Reinforcement Learning ...

Algorithms cont.

59

Prediction Control

Page 60: Machine Learning and Data Mining Reinforcement Learning ...

Lecture 1: Introduction to Reinforcement Learning

Problems within RLLearning and Planning

Two fundamental problems in sequential decision making

Reinforcement Learning:

The environment is initiallyunknown

The agent interacts with the environment

The agent improves itspolicy

Planning:

A model of the environment is known

The agent performs computations with its model (without any

external interaction)

The agent improves its policy

a.k.a. deliberation, reasoning, introspection, pondering,

thought, search

60

Page 61: Machine Learning and Data Mining Reinforcement Learning ...

Lecture 1: Introduction to Reinforcement Learning

Inside An RL AgentMajor Components of an RL Agent

An RL agent may include one or more of these components:

Policy: agent’s behaviourfunction

Value function: how good is each state and/or action

Model: agent’s representation of the environment

61

Page 62: Machine Learning and Data Mining Reinforcement Learning ...

Dynamic Programming

62

Page 63: Machine Learning and Data Mining Reinforcement Learning ...

Requirements for DP

63

Page 64: Machine Learning and Data Mining Reinforcement Learning ...

Applications for DPs

64

Page 65: Machine Learning and Data Mining Reinforcement Learning ...

Lecture 3: Planning by Dynamic Programming

IntroductionPlanning by Dynamic Programming

Dynamic programming assumes full knowledge of the MDP

It is used for planning in an MDP

For prediction:Input: MDP (S, A , P , R , γ) and policy π

or: MRP (S, Pπ , Rπ , γ)Output: value function vπ

Or for control:Input: MDP (S, A , P , R , γ)Output: optimal value function v∗

and: optimal policy π∗

65

Page 66: Machine Learning and Data Mining Reinforcement Learning ...

Lecture 3: Planning by Dynamic Programming

Policy Evaluation

Iterative Policy EvaluationPolicy Evaluation (Prediction)

Problem: evaluate a given policy π

Solution: iterative application of Bellman expectation backup

v1 → v2 → ... → vπ

Using synchronous backups, At each iteration k + 1

For all states s ∈ S

Update vk+1(s) from vk (s') where s' is a successor state of s

We will discuss asynchronous backups later

Convergence to vπ can be proven

66

Page 67: Machine Learning and Data Mining Reinforcement Learning ...

Iterative policy Evaluation

68

Page 68: Machine Learning and Data Mining Reinforcement Learning ...

Lecture 3: Planning by Dynamic Programming

Policy Evaluation

Example: Small GridworldEvaluating a Random Policy in the Small Gridworld

Undiscounted episodic MDP (γ = 1)

Nonterminal states 1, ..., 14

One terminal state (shown twice as shaded squares)

Actions leading out of the grid leave state unchanged

Reward is −1 until the terminal state is reached

Agent follows uniform random policy

π(n|·) = π(e|·) = π(s|·) = π(w |·) = 0.25

69

Page 69: Machine Learning and Data Mining Reinforcement Learning ...

Policy Evaluation : Grid World

70

Page 70: Machine Learning and Data Mining Reinforcement Learning ...

Policy Evaluation : Grid World

71

Page 71: Machine Learning and Data Mining Reinforcement Learning ...

Policy Evaluation : Grid World

72

Page 72: Machine Learning and Data Mining Reinforcement Learning ...

Policy Evaluation : Grid World

73

Page 73: Machine Learning and Data Mining Reinforcement Learning ...

74

Most of the story in a nutshell:

Page 74: Machine Learning and Data Mining Reinforcement Learning ...

Finding Best Policy

75

Page 75: Machine Learning and Data Mining Reinforcement Learning ...

Lecture 3: Planning by Dynamic Programming

Policy IterationPolicy Improvement

Given a policy π

Evaluate the policy π

vπ(s) = E [Rt+1 + γRt+2 + ...|St = s]

Improve the policy by acting greedily with respect to vπ

π' = greedy(vπ)

In Small Gridworld improved policy was optimal, π' = π∗

In general, need more iterations of improvement / evaluation

But this process of policy iteration always converges to π∗

76

Page 76: Machine Learning and Data Mining Reinforcement Learning ...

Policy Iteration

77

Page 77: Machine Learning and Data Mining Reinforcement Learning ...

Lecture 3: Planning by Dynamic Programming

Policy IterationPolicy Iteration

Policy evaluation Estimate vπ

Iterative policy evaluation

Policy improvement Generate πI ≥ π

Greedy policy improvement

78

Page 78: Machine Learning and Data Mining Reinforcement Learning ...

Jack’s Car Rental

79

Page 79: Machine Learning and Data Mining Reinforcement Learning ...

Policy Iteration in Car Rental

80

Page 80: Machine Learning and Data Mining Reinforcement Learning ...

Lecture 3: Planning by Dynamic Programming

Policy Iteration

Policy ImprovementPolicy Improvement

81

Page 81: Machine Learning and Data Mining Reinforcement Learning ...

Lecture 3: Planning by Dynamic Programming

Policy Iteration

Policy ImprovementPolicy Improvement (2)

If improvements stop,

qπ(s, π'(s)) = max qπ(s, a) = qπ(s, π(s)) = vπ(s)a∈A

Then the Bellman optimality equation has been satisfied

vπ(s) = max qπ(s, a)a∈A

Therefore vπ (s) = v∗(s) for all s ∈ S

so π is an optimal policy

82

Page 82: Machine Learning and Data Mining Reinforcement Learning ...

Lecture 3: Planning by Dynamic Programming

Contraction MappingSome Technical Questions

How do we know that value iteration converges to v∗?

Or that iterative policy evaluation converges to vπ ?

And therefore that policy iteration converges to v∗?

Is the solution unique?

How fast do these algorithms converge?

These questions are resolved by contraction mapping theorem

83

Page 83: Machine Learning and Data Mining Reinforcement Learning ...

Lecture 3: Planning by Dynamic Programming

Contraction MappingValue Function Space

Consider the vector space V over value functions

There are |S| dimensions

Each point in this space fully specifies a value function v (s)

What does a Bellman backup do to points in this space?

We will show that it brings value functions closer

And therefore the backups must converge on a unique solution

84

Page 84: Machine Learning and Data Mining Reinforcement Learning ...

Lecture 3: Planning by Dynamic Programming

Contraction MappingValue Function ∞-Norm

s∈S

We will measure distance between state-value functions u and

v by the ∞-norm

i.e. the largest difference between state values,

||u − v||∞ = max |u(s) − v(s)|

85

Page 85: Machine Learning and Data Mining Reinforcement Learning ...

Lecture 3: Planning by Dynamic Programming

Contraction MappingBellman Expectation Backup is a Contraction

86

Page 86: Machine Learning and Data Mining Reinforcement Learning ...

Lecture 3: Planning by Dynamic Programming

Contraction MappingContraction Mapping Theorem

Theorem (Contraction Mapping Theorem)

For any metric space V that is complete (i.e. closed) under an

operator T (v ), where T is a γ-contraction,

T converges to a unique fixed point

At a linear convergence rate of γ

87

Page 87: Machine Learning and Data Mining Reinforcement Learning ...

Lecture 3: Planning by Dynamic Programming

Contraction MappingConvergence of Iter. Policy Evaluation and Policy Iteration

The Bellman expectation operator T π has a unique fixed point

vπ is a fixed point of T π (by Bellman expectation equation)

By contraction mapping theorem

Iterative policy evaluation converges on vπ

Policy iteration converges on v∗

88

Page 88: Machine Learning and Data Mining Reinforcement Learning ...

Lecture 3: Planning by Dynamic Programming

Contraction MappingBellman Optimality Backup is a Contraction

Define the Bellman optimality backup operator T ∗,

T ∗(v) = max Ra + γPava∈A

This operator is a γ-contraction, i.e. it makes value functions

closer by at least γ (similar to previous proof)

||T∗(u) − T ∗(v)||∞ ≤ γ||u − v||∞

89

Page 89: Machine Learning and Data Mining Reinforcement Learning ...

Lecture 3: Planning by Dynamic Programming

Contraction MappingConvergence of Value Iteration

The Bellman optimality operator T ∗ has a unique fixed point

v∗ is a fixed point of T ∗ (by Bellman optimality equation) By

contraction mapping theorem

Value iteration converges on v∗

90

Page 90: Machine Learning and Data Mining Reinforcement Learning ...

91

Most of the story in a nutshell:

Page 91: Machine Learning and Data Mining Reinforcement Learning ...

92

Most of the story in a nutshell:

Page 92: Machine Learning and Data Mining Reinforcement Learning ...

93

Most of the story in a nutshell:

Page 93: Machine Learning and Data Mining Reinforcement Learning ...

Lecture 3: Planning by Dynamic Programming

Policy Iteration

Extensions to Policy IterationModified Policy Iteration

Does policy evaluation need to converge to vπ ?

Or should we introduce a stopping conditione.g. E-convergence of value function

Or simply stop after k iterations of iterative policy evaluation?

For example, in the small gridworld k = 3 was sufficient to

achieve optimal policy

Why not update policy every iteration? i.e. stop after k = 1

This is equivalent to value iteration (next section)

94

Page 94: Machine Learning and Data Mining Reinforcement Learning ...

Lecture 3: Planning by Dynamic Programming

Policy Iteration

Extensions to Policy IterationGeneralised Policy Iteration

Policy evaluation Estimate vπ

Any policy evaluation algorithm

Policy improvement Generate π' ≥ π

Any policy improvement algorithm

95

Page 95: Machine Learning and Data Mining Reinforcement Learning ...

Lecture 3: Planning by Dynamic Programming

Value Iteration

Value Iteration in MDPsValue Iteration

Problem: find optimal policy π

Solution: iterative application of Bellman optimality backup

v1 → v2 → ... → v∗

Using synchronous backups At each iteration k + 1

For all states s ∈ SUpdate vk+1(s) from vk (s')

Convergence to v∗ will be proven later

Unlike policy iteration, there is no explicit policy

Intermediate value functions may not correspond to any policy

96

Page 96: Machine Learning and Data Mining Reinforcement Learning ...

Lecture 3: Planning by Dynamic Programming

Value Iteration

Value Iteration in MDPsValue Iteration (2)

97

Page 97: Machine Learning and Data Mining Reinforcement Learning ...

Lecture 3: Planning by Dynamic Programming

Extensions to Dynamic Programming

Asynchronous Dynamic ProgrammingAsynchronous Dynamic Programming

DP methods described so far used synchronous backups

i.e. all states are backed up in parallel

Asynchronous DP backs up states individually, in any order

For each selected state, apply the appropriate backup

Can significantly reduce computation

Guaranteed to converge if all states continue to be selected

99

Page 98: Machine Learning and Data Mining Reinforcement Learning ...

Lecture 3: Planning by Dynamic Programming

Extensions to Dynamic Programming

Asynchronous Dynamic ProgrammingAsynchronous Dynamic Programming

Three simple ideas for asynchronous dynamic programming:

In-place dynamicprogramming

Prioritised sweeping

Real-time dynamicprogramming

100

Page 99: Machine Learning and Data Mining Reinforcement Learning ...

Lecture 3: Planning by Dynamic Programming

Extensions to Dynamic Programming

Asynchronous Dynamic ProgrammingIn-Place Dynamic Programming

101

Page 100: Machine Learning and Data Mining Reinforcement Learning ...

Lecture 3: Planning by Dynamic Programming

Extensions to Dynamic Programming

Asynchronous Dynamic ProgrammingPrioritised Sweeping

102

Page 101: Machine Learning and Data Mining Reinforcement Learning ...

Lecture 3: Planning by Dynamic Programming

Extensions to Dynamic Programming

Asynchronous Dynamic ProgrammingReal-Time Dynamic Programming

Idea: only states that are relevant to agent

Use agent’s experience to guide the selection of states

After each time-step St , At , Rt+1

Backup the state St

103

Page 102: Machine Learning and Data Mining Reinforcement Learning ...

Lecture 3: Planning by Dynamic Programming

Extensions to Dynamic Programming

Full-width and sample backupsFull-Width Backups

DP uses full-widthbackups

For each backup (sync or async) Every successor state and action is

considered

Using knowledge of the MDP transitions and reward function

DP is effective for medium-sized problems

(millions of states)

For large problems DP suffers Bellman’scurse ofdimensionality

Number of states n = |S| grows

exponentially with number of state variables

Even one backup can be too expensive 104

Page 103: Machine Learning and Data Mining Reinforcement Learning ...

Lecture 3: Planning by Dynamic Programming

Extensions to Dynamic Programming

Full-width and sample backupsSample Backups

In subsequent lectures we will consider sample backups

Using sample rewards and sample transitions(S, A, R, S ')

Instead of reward function R and transition dynamics P

Advantages:

Model-free: no advance knowledge of MDP requiredBreaks the curse of dimensionality through samplingCost of backup is constant, independent of n = |S|

105

Page 104: Machine Learning and Data Mining Reinforcement Learning ...

Lecture 3: Planning by Dynamic Programming

Extensions to Dynamic Programming

Approximate Dynamic ProgrammingApproximate Dynamic Programming

106

Page 105: Machine Learning and Data Mining Reinforcement Learning ...

Monte Carlo Learning

107

Page 106: Machine Learning and Data Mining Reinforcement Learning ...

Lecture 4: Model-Free Prediction

Monte-Carlo LearningMonte-Carlo Reinforcement Learning

MC methods learn directly from episodes of experience

MC is model-free: no knowledge of MDP transitions / rewards

MC learns from complete episodes: nobootstrapping

MC uses the simplest possible idea: value = mean return

Caveat: can only apply MC to episodic MDPs

All episodes must terminate

MC methods can solve the RL problem by averaging sample returns

MC is incremental episode by episode but not step by step

Approach: adapting general policy iteration to sample returns

First policy evaluation, then policy improvement, then control

108

Page 107: Machine Learning and Data Mining Reinforcement Learning ...

Lecture 4: Model-Free Prediction

Monte-Carlo LearningMonte-Carlo Policy Evaluation

Goal: learn vπ from episodes of experience under policy π

S1,A1,R2,...,Sk ∼ π

Recall that the return is the total discounted reward:

Gt = Rt+1 + γRt+2 + ...+ γT−1RT

Recall that the value function is the expected return:

vπ (s) = Eπ [Gt | St = s]

Monte-Carlo policy evaluation uses empirical mean return

instead of expected return, because we do not have the

model 109

Page 108: Machine Learning and Data Mining Reinforcement Learning ...

Every Visit MC Policy Evaluation

110

Page 109: Machine Learning and Data Mining Reinforcement Learning ...

111

Page 110: Machine Learning and Data Mining Reinforcement Learning ...

112

Page 111: Machine Learning and Data Mining Reinforcement Learning ...

113

Page 112: Machine Learning and Data Mining Reinforcement Learning ...

114

Page 113: Machine Learning and Data Mining Reinforcement Learning ...

115

Page 114: Machine Learning and Data Mining Reinforcement Learning ...

116

Page 115: Machine Learning and Data Mining Reinforcement Learning ...

117

Page 116: Machine Learning and Data Mining Reinforcement Learning ...

118

Page 117: Machine Learning and Data Mining Reinforcement Learning ...

119

Page 118: Machine Learning and Data Mining Reinforcement Learning ...

120

Page 119: Machine Learning and Data Mining Reinforcement Learning ...

121

Page 120: Machine Learning and Data Mining Reinforcement Learning ...

122

Page 121: Machine Learning and Data Mining Reinforcement Learning ...

123

Page 122: Machine Learning and Data Mining Reinforcement Learning ...

124

Page 123: Machine Learning and Data Mining Reinforcement Learning ...

125

Page 124: Machine Learning and Data Mining Reinforcement Learning ...

126

Page 125: Machine Learning and Data Mining Reinforcement Learning ...

127

Page 126: Machine Learning and Data Mining Reinforcement Learning ...

128

Page 127: Machine Learning and Data Mining Reinforcement Learning ...

129

Page 128: Machine Learning and Data Mining Reinforcement Learning ...

130

Page 129: Machine Learning and Data Mining Reinforcement Learning ...

131

Page 130: Machine Learning and Data Mining Reinforcement Learning ...

132

Page 131: Machine Learning and Data Mining Reinforcement Learning ...

SARSA

133

Page 132: Machine Learning and Data Mining Reinforcement Learning ...

134

Page 133: Machine Learning and Data Mining Reinforcement Learning ...

135

Page 134: Machine Learning and Data Mining Reinforcement Learning ...

136

Page 135: Machine Learning and Data Mining Reinforcement Learning ...

Q-Learning

137

Page 136: Machine Learning and Data Mining Reinforcement Learning ...

Q-Learning vs. Sarsa

138

Page 137: Machine Learning and Data Mining Reinforcement Learning ...

139

Page 138: Machine Learning and Data Mining Reinforcement Learning ...

140

Page 139: Machine Learning and Data Mining Reinforcement Learning ...

142

Page 140: Machine Learning and Data Mining Reinforcement Learning ...

143

Page 141: Machine Learning and Data Mining Reinforcement Learning ...

Monte Carlo Tree Search

144

Page 142: Machine Learning and Data Mining Reinforcement Learning ...

145

Page 143: Machine Learning and Data Mining Reinforcement Learning ...

146

Page 144: Machine Learning and Data Mining Reinforcement Learning ...

147

Page 145: Machine Learning and Data Mining Reinforcement Learning ...

148

Page 146: Machine Learning and Data Mining Reinforcement Learning ...

149