19 rl keyurtasun/courses/CSC411_Fall16/tutorial_RL.pdf · Zemel, Urtasun, Fidler (UofT) CSC 411:...

70
Q-learning Tutorial CSC411 Geoffrey Roeder Slides Adapted from lecture: Rich Zemel, Raquel Urtasun, Sanja Fidler, Nitish Srivastava

Transcript of 19 rl keyurtasun/courses/CSC411_Fall16/tutorial_RL.pdf · Zemel, Urtasun, Fidler (UofT) CSC 411:...

Page 1: 19 rl keyurtasun/courses/CSC411_Fall16/tutorial_RL.pdf · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn

Q-learning TutorialCSC411

Geoffrey Roeder

Slides Adapted from lecture: Rich Zemel, Raquel Urtasun, Sanja Fidler, Nitish Srivastava

Page 2: 19 rl keyurtasun/courses/CSC411_Fall16/tutorial_RL.pdf · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn

Tutorial Agenda• Refresh RL terminology through Tic Tac Toe

• Deterministic Q-Learning: what and how

• Q-learning Matlab demo: Gridworld

• Extensions: non-deterministic reward, next state

• More cool demos

Page 3: 19 rl keyurtasun/courses/CSC411_Fall16/tutorial_RL.pdf · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn

Tic Tac Toe Redux

Page 4: 19 rl keyurtasun/courses/CSC411_Fall16/tutorial_RL.pdf · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn

Lose Tie WinReward

:-1 0 +1

X OX O

st =

⇡ : S ! ATic Tac Toe Redux

X OX O

V⇡( ) 7! rfuture

V⇡ : S ! R

X O

X O( (⇡ 7! a

R =

Page 5: 19 rl keyurtasun/courses/CSC411_Fall16/tutorial_RL.pdf · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn

RL & Tic-Tac-Toe

Each board position (taking into account symmetry) has some probability

Simple learning process:

I start with all values = 0.5I policy: choose move with highest

probability of winning given currentlegal moves from current state

I update entries in table based onoutcome of each game

I After many games value function willrepresent true probability of winningfrom each state

Can try alternative policy: sometimes select moves randomly (exploration)

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 22 / 1

Page 6: 19 rl keyurtasun/courses/CSC411_Fall16/tutorial_RL.pdf · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn

RL & Tic-Tac-Toe

Each board position (taking into account symmetry) has some probability

Simple learning process:

I start with all values = 0.5I policy: choose move with highest

probability of winning given currentlegal moves from current state

I update entries in table based onoutcome of each game

I After many games value function willrepresent true probability of winningfrom each state

Can try alternative policy: sometimes select moves randomly (exploration)

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 22 / 1

Page 7: 19 rl keyurtasun/courses/CSC411_Fall16/tutorial_RL.pdf · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn

RL & Tic-Tac-Toe

Each board position (taking into account symmetry) has some probability

Simple learning process:

I start with all values = 0.5

I policy: choose move with highestprobability of winning given currentlegal moves from current state

I update entries in table based onoutcome of each game

I After many games value function willrepresent true probability of winningfrom each state

Can try alternative policy: sometimes select moves randomly (exploration)

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 22 / 1

Page 8: 19 rl keyurtasun/courses/CSC411_Fall16/tutorial_RL.pdf · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn

RL & Tic-Tac-Toe

Each board position (taking into account symmetry) has some probability

Simple learning process:

I start with all values = 0.5I policy: choose move with highest

probability of winning given currentlegal moves from current state

I update entries in table based onoutcome of each game

I After many games value function willrepresent true probability of winningfrom each state

Can try alternative policy: sometimes select moves randomly (exploration)

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 22 / 1

Page 9: 19 rl keyurtasun/courses/CSC411_Fall16/tutorial_RL.pdf · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn

RL & Tic-Tac-Toe

Each board position (taking into account symmetry) has some probability

Simple learning process:

I start with all values = 0.5I policy: choose move with highest

probability of winning given currentlegal moves from current state

I update entries in table based onoutcome of each game

I After many games value function willrepresent true probability of winningfrom each state

Can try alternative policy: sometimes select moves randomly (exploration)

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 22 / 1

Page 10: 19 rl keyurtasun/courses/CSC411_Fall16/tutorial_RL.pdf · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn

RL & Tic-Tac-Toe

Each board position (taking into account symmetry) has some probability

Simple learning process:

I start with all values = 0.5I policy: choose move with highest

probability of winning given currentlegal moves from current state

I update entries in table based onoutcome of each game

I After many games value function willrepresent true probability of winningfrom each state

Can try alternative policy: sometimes select moves randomly (exploration)

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 22 / 1

Page 11: 19 rl keyurtasun/courses/CSC411_Fall16/tutorial_RL.pdf · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn

RL & Tic-Tac-Toe

Each board position (taking into account symmetry) has some probability

Simple learning process:

I start with all values = 0.5I policy: choose move with highest

probability of winning given currentlegal moves from current state

I update entries in table based onoutcome of each game

I After many games value function willrepresent true probability of winningfrom each state

Can try alternative policy: sometimes select moves randomly (exploration)

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 22 / 1

Page 12: 19 rl keyurtasun/courses/CSC411_Fall16/tutorial_RL.pdf · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn

MDP RefresherFamiliar? Skip?

Page 13: 19 rl keyurtasun/courses/CSC411_Fall16/tutorial_RL.pdf · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn

MDP Formulation

Goal: find policy ⇡ that maximizes expected accumulated future rewardsV

⇡(st), obtained by following ⇡ from state st :

V

⇡(st) = rt + �rt+1

+ �2

rt+2

+ · · ·

=1X

i=0

� irt+i

Game show example:

I assume series of questions, increasingly di�cult, but increasing payo↵I choice: accept accumulated earnings and quit; or continue and risk

losing everything

Notice that:V

⇡(st) = rt + �V ⇡(st+1

)

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 29 / 1

Page 14: 19 rl keyurtasun/courses/CSC411_Fall16/tutorial_RL.pdf · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn

MDP Formulation

Goal: find policy ⇡ that maximizes expected accumulated future rewardsV

⇡(st), obtained by following ⇡ from state st :

V

⇡(st) = rt + �rt+1

+ �2

rt+2

+ · · ·

=1X

i=0

� irt+i

Game show example:

I assume series of questions, increasingly di�cult, but increasing payo↵I choice: accept accumulated earnings and quit; or continue and risk

losing everything

Notice that:V

⇡(st) = rt + �V ⇡(st+1

)

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 29 / 1

Page 15: 19 rl keyurtasun/courses/CSC411_Fall16/tutorial_RL.pdf · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn

MDP Formulation

Goal: find policy ⇡ that maximizes expected accumulated future rewardsV

⇡(st), obtained by following ⇡ from state st :

V

⇡(st) = rt + �rt+1

+ �2

rt+2

+ · · ·

=1X

i=0

� irt+i

Game show example:

I assume series of questions, increasingly di�cult, but increasing payo↵

I choice: accept accumulated earnings and quit; or continue and risklosing everything

Notice that:V

⇡(st) = rt + �V ⇡(st+1

)

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 29 / 1

Page 16: 19 rl keyurtasun/courses/CSC411_Fall16/tutorial_RL.pdf · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn

MDP Formulation

Goal: find policy ⇡ that maximizes expected accumulated future rewardsV

⇡(st), obtained by following ⇡ from state st :

V

⇡(st) = rt + �rt+1

+ �2

rt+2

+ · · ·

=1X

i=0

� irt+i

Game show example:

I assume series of questions, increasingly di�cult, but increasing payo↵I choice: accept accumulated earnings and quit; or continue and risk

losing everything

Notice that:V

⇡(st) = rt + �V ⇡(st+1

)

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 29 / 1

Page 17: 19 rl keyurtasun/courses/CSC411_Fall16/tutorial_RL.pdf · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn

What to Learn

We might try to learn the function V (which we write as V ⇤)

V

⇤(s) = maxa

[r(s, a) + �V ⇤(�(s, a))]

Here �(s, a) gives the next state, if we perform action a in current state s

We could then do a lookahead search to choose best action from any state s:

⇡⇤(s) = arg maxa

[r(s, a) + �V ⇤(�(s, a))]

But there’s a problem:

I This works well if we know �() and r()I But when we don’t, we cannot choose actions this way

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1

Page 18: 19 rl keyurtasun/courses/CSC411_Fall16/tutorial_RL.pdf · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn

What to Learn

We might try to learn the function V (which we write as V ⇤)

V

⇤(s) = maxa

[r(s, a) + �V ⇤(�(s, a))]

Here �(s, a) gives the next state, if we perform action a in current state s

We could then do a lookahead search to choose best action from any state s:

⇡⇤(s) = arg maxa

[r(s, a) + �V ⇤(�(s, a))]

But there’s a problem:

I This works well if we know �() and r()I But when we don’t, we cannot choose actions this way

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1

Page 19: 19 rl keyurtasun/courses/CSC411_Fall16/tutorial_RL.pdf · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn

What to Learn

We might try to learn the function V (which we write as V ⇤)

V

⇤(s) = maxa

[r(s, a) + �V ⇤(�(s, a))]

Here �(s, a) gives the next state, if we perform action a in current state s

We could then do a lookahead search to choose best action from any state s:

⇡⇤(s) = arg maxa

[r(s, a) + �V ⇤(�(s, a))]

But there’s a problem:

I This works well if we know �() and r()I But when we don’t, we cannot choose actions this way

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1

Page 20: 19 rl keyurtasun/courses/CSC411_Fall16/tutorial_RL.pdf · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn

What to Learn

We might try to learn the function V (which we write as V ⇤)

V

⇤(s) = maxa

[r(s, a) + �V ⇤(�(s, a))]

Here �(s, a) gives the next state, if we perform action a in current state s

We could then do a lookahead search to choose best action from any state s:

⇡⇤(s) = arg maxa

[r(s, a) + �V ⇤(�(s, a))]

But there’s a problem:

I This works well if we know �() and r()

I But when we don’t, we cannot choose actions this way

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1

Page 21: 19 rl keyurtasun/courses/CSC411_Fall16/tutorial_RL.pdf · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn

What to Learn

We might try to learn the function V (which we write as V ⇤)

V

⇤(s) = maxa

[r(s, a) + �V ⇤(�(s, a))]

Here �(s, a) gives the next state, if we perform action a in current state s

We could then do a lookahead search to choose best action from any state s:

⇡⇤(s) = arg maxa

[r(s, a) + �V ⇤(�(s, a))]

But there’s a problem:

I This works well if we know �() and r()I But when we don’t, we cannot choose actions this way

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1

Page 22: 19 rl keyurtasun/courses/CSC411_Fall16/tutorial_RL.pdf · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn

Q LearningDeterministic rewards and actions

Page 23: 19 rl keyurtasun/courses/CSC411_Fall16/tutorial_RL.pdf · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn

Q Learning

Define a new function very similar to V

Q(s, a) = r(s, a) + �V ⇤(�(s, a))

If we learn Q, we can choose the optimal action even without knowing �!

⇡⇤(s) = arg maxa

[r(s, a) + �V ⇤(�(s, a))]

= arg maxa

Q(s, a)

Q is then the evaluation function we will learn

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 31 / 1

Page 24: 19 rl keyurtasun/courses/CSC411_Fall16/tutorial_RL.pdf · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn

Q Learning

Define a new function very similar to V

Q(s, a) = r(s, a) + �V ⇤(�(s, a))

If we learn Q, we can choose the optimal action even without knowing �!

⇡⇤(s) = arg maxa

[r(s, a) + �V ⇤(�(s, a))]

= arg maxa

Q(s, a)

Q is then the evaluation function we will learn

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 31 / 1

Page 25: 19 rl keyurtasun/courses/CSC411_Fall16/tutorial_RL.pdf · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn

Q Learning

Define a new function very similar to V

Q(s, a) = r(s, a) + �V ⇤(�(s, a))

If we learn Q, we can choose the optimal action even without knowing �!

⇡⇤(s) = arg maxa

[r(s, a) + �V ⇤(�(s, a))]

= arg maxa

Q(s, a)

Q is then the evaluation function we will learn

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 31 / 1

Page 26: 19 rl keyurtasun/courses/CSC411_Fall16/tutorial_RL.pdf · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn

Q Learning

Define a new function very similar to V

Q(s, a) = r(s, a) + �V ⇤(�(s, a))

If we learn Q, we can choose the optimal action even without knowing �!

⇡⇤(s) = arg maxa

[r(s, a) + �V ⇤(�(s, a))]

= arg maxa

Q(s, a)

Q is then the evaluation function we will learn

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 31 / 1

Page 27: 19 rl keyurtasun/courses/CSC411_Fall16/tutorial_RL.pdf · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 32 / 1

Page 28: 19 rl keyurtasun/courses/CSC411_Fall16/tutorial_RL.pdf · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 32 / 1

Page 29: 19 rl keyurtasun/courses/CSC411_Fall16/tutorial_RL.pdf · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn

Training Rule to Learn Q

Q and V

⇤ are closely related:

V

⇤(s) = maxa

Q(s, a)

So we can write Q recursively:

Q(st , at) = r(st , at) + �V ⇤(�(st , at))

= r(st , at) + �maxa0

Q(st+1

, a0)

Let Q̂ denote the learner’s current approximation to Q

Consider training rule

Q̂(s, a) r(s, a) + �maxa0

Q̂(s 0, a0)

where s

0 is state resulting from applying action a in state s

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 33 / 1

Page 30: 19 rl keyurtasun/courses/CSC411_Fall16/tutorial_RL.pdf · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn

Training Rule to Learn Q

Q and V

⇤ are closely related:

V

⇤(s) = maxa

Q(s, a)

So we can write Q recursively:

Q(st , at) = r(st , at) + �V ⇤(�(st , at))

= r(st , at) + �maxa0

Q(st+1

, a0)

Let Q̂ denote the learner’s current approximation to Q

Consider training rule

Q̂(s, a) r(s, a) + �maxa0

Q̂(s 0, a0)

where s

0 is state resulting from applying action a in state s

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 33 / 1

Page 31: 19 rl keyurtasun/courses/CSC411_Fall16/tutorial_RL.pdf · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn

Training Rule to Learn Q

Q and V

⇤ are closely related:

V

⇤(s) = maxa

Q(s, a)

So we can write Q recursively:

Q(st , at) = r(st , at) + �V ⇤(�(st , at))

= r(st , at) + �maxa0

Q(st+1

, a0)

Let Q̂ denote the learner’s current approximation to Q

Consider training rule

Q̂(s, a) r(s, a) + �maxa0

Q̂(s 0, a0)

where s

0 is state resulting from applying action a in state s

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 33 / 1

Page 32: 19 rl keyurtasun/courses/CSC411_Fall16/tutorial_RL.pdf · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn

Training Rule to Learn Q

Q and V

⇤ are closely related:

V

⇤(s) = maxa

Q(s, a)

So we can write Q recursively:

Q(st , at) = r(st , at) + �V ⇤(�(st , at))

= r(st , at) + �maxa0

Q(st+1

, a0)

Let Q̂ denote the learner’s current approximation to Q

Consider training rule

Q̂(s, a) r(s, a) + �maxa0

Q̂(s 0, a0)

where s

0 is state resulting from applying action a in state s

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 33 / 1

Page 33: 19 rl keyurtasun/courses/CSC411_Fall16/tutorial_RL.pdf · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn

Training Rule to Learn Q

Q and V

⇤ are closely related:

V

⇤(s) = maxa

Q(s, a)

So we can write Q recursively:

Q(st , at) = r(st , at) + �V ⇤(�(st , at))

= r(st , at) + �maxa0

Q(st+1

, a0)

Let Q̂ denote the learner’s current approximation to Q

Consider training rule

Q̂(s, a) r(s, a) + �maxa0

Q̂(s 0, a0)

where s

0 is state resulting from applying action a in state s

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 33 / 1

Page 34: 19 rl keyurtasun/courses/CSC411_Fall16/tutorial_RL.pdf · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn

Q Learning for Deterministic World

For each s, a initialize table entry Q̂(s, a) 0

Start in some initial state s

Do forever:

I Select an action a and execute itI Receive immediate reward rI Observe the new state s

0

I Update the table entry for Q̂(s, a) using Q learning rule:

Q̂(s, a) r(s, a) + �maxa0

Q̂(s 0, a0)

Is s

0

If we get to absorbing state, restart to initial state, and run thru ”Doforever” loop until reach absorbing state

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 34 / 1

Page 35: 19 rl keyurtasun/courses/CSC411_Fall16/tutorial_RL.pdf · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn

Q Learning for Deterministic World

For each s, a initialize table entry Q̂(s, a) 0

Start in some initial state s

Do forever:

I Select an action a and execute itI Receive immediate reward rI Observe the new state s

0

I Update the table entry for Q̂(s, a) using Q learning rule:

Q̂(s, a) r(s, a) + �maxa0

Q̂(s 0, a0)

Is s

0

If we get to absorbing state, restart to initial state, and run thru ”Doforever” loop until reach absorbing state

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 34 / 1

Page 36: 19 rl keyurtasun/courses/CSC411_Fall16/tutorial_RL.pdf · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn

Q Learning for Deterministic World

For each s, a initialize table entry Q̂(s, a) 0

Start in some initial state s

Do forever:

I Select an action a and execute itI Receive immediate reward rI Observe the new state s

0

I Update the table entry for Q̂(s, a) using Q learning rule:

Q̂(s, a) r(s, a) + �maxa0

Q̂(s 0, a0)

Is s

0

If we get to absorbing state, restart to initial state, and run thru ”Doforever” loop until reach absorbing state

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 34 / 1

Page 37: 19 rl keyurtasun/courses/CSC411_Fall16/tutorial_RL.pdf · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn

Q Learning for Deterministic World

For each s, a initialize table entry Q̂(s, a) 0

Start in some initial state s

Do forever:

I Select an action a and execute it

I Receive immediate reward rI Observe the new state s

0

I Update the table entry for Q̂(s, a) using Q learning rule:

Q̂(s, a) r(s, a) + �maxa0

Q̂(s 0, a0)

Is s

0

If we get to absorbing state, restart to initial state, and run thru ”Doforever” loop until reach absorbing state

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 34 / 1

Page 38: 19 rl keyurtasun/courses/CSC411_Fall16/tutorial_RL.pdf · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn

Q Learning for Deterministic World

For each s, a initialize table entry Q̂(s, a) 0

Start in some initial state s

Do forever:

I Select an action a and execute itI Receive immediate reward r

I Observe the new state s

0

I Update the table entry for Q̂(s, a) using Q learning rule:

Q̂(s, a) r(s, a) + �maxa0

Q̂(s 0, a0)

Is s

0

If we get to absorbing state, restart to initial state, and run thru ”Doforever” loop until reach absorbing state

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 34 / 1

Page 39: 19 rl keyurtasun/courses/CSC411_Fall16/tutorial_RL.pdf · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn

Q Learning for Deterministic World

For each s, a initialize table entry Q̂(s, a) 0

Start in some initial state s

Do forever:

I Select an action a and execute itI Receive immediate reward rI Observe the new state s

0

I Update the table entry for Q̂(s, a) using Q learning rule:

Q̂(s, a) r(s, a) + �maxa0

Q̂(s 0, a0)

Is s

0

If we get to absorbing state, restart to initial state, and run thru ”Doforever” loop until reach absorbing state

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 34 / 1

Page 40: 19 rl keyurtasun/courses/CSC411_Fall16/tutorial_RL.pdf · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn

Q Learning for Deterministic World

For each s, a initialize table entry Q̂(s, a) 0

Start in some initial state s

Do forever:

I Select an action a and execute itI Receive immediate reward rI Observe the new state s

0

I Update the table entry for Q̂(s, a) using Q learning rule:

Q̂(s, a) r(s, a) + �maxa0

Q̂(s 0, a0)

Is s

0

If we get to absorbing state, restart to initial state, and run thru ”Doforever” loop until reach absorbing state

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 34 / 1

Page 41: 19 rl keyurtasun/courses/CSC411_Fall16/tutorial_RL.pdf · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn

Q Learning for Deterministic World

For each s, a initialize table entry Q̂(s, a) 0

Start in some initial state s

Do forever:

I Select an action a and execute itI Receive immediate reward rI Observe the new state s

0

I Update the table entry for Q̂(s, a) using Q learning rule:

Q̂(s, a) r(s, a) + �maxa0

Q̂(s 0, a0)

Is s

0

If we get to absorbing state, restart to initial state, and run thru ”Doforever” loop until reach absorbing state

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 34 / 1

Page 42: 19 rl keyurtasun/courses/CSC411_Fall16/tutorial_RL.pdf · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn

Q Learning for Deterministic World

For each s, a initialize table entry Q̂(s, a) 0

Start in some initial state s

Do forever:

I Select an action a and execute itI Receive immediate reward rI Observe the new state s

0

I Update the table entry for Q̂(s, a) using Q learning rule:

Q̂(s, a) r(s, a) + �maxa0

Q̂(s 0, a0)

Is s

0

If we get to absorbing state, restart to initial state, and run thru ”Doforever” loop until reach absorbing state

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 34 / 1

Page 43: 19 rl keyurtasun/courses/CSC411_Fall16/tutorial_RL.pdf · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn

Updating Estimated Q

Assume the robot is in state s

1

; some of its current estimates of Q are asshown; executes rightward move

Q̂(s1

, aright) r + �maxa0

Q̂(s2

, a0)

r + 0.9maxa

{63, 81, 100} 90

Important observation: at each time step (making an action a in state s

only one entry of Q̂ will change (the entry Q̂(s, a))

Notice that if rewards are non-negative, then Q̂ values only increase from 0,approach true Q

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 35 / 1

Page 44: 19 rl keyurtasun/courses/CSC411_Fall16/tutorial_RL.pdf · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn

Updating Estimated Q

Assume the robot is in state s

1

; some of its current estimates of Q are asshown; executes rightward move

Q̂(s1

, aright) r + �maxa0

Q̂(s2

, a0)

r + 0.9maxa

{63, 81, 100} 90

Important observation: at each time step (making an action a in state s

only one entry of Q̂ will change (the entry Q̂(s, a))

Notice that if rewards are non-negative, then Q̂ values only increase from 0,approach true Q

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 35 / 1

Page 45: 19 rl keyurtasun/courses/CSC411_Fall16/tutorial_RL.pdf · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn

Updating Estimated Q

Assume the robot is in state s

1

; some of its current estimates of Q are asshown; executes rightward move

Q̂(s1

, aright) r + �maxa0

Q̂(s2

, a0)

r + 0.9maxa

{63, 81, 100} 90

Important observation: at each time step (making an action a in state s

only one entry of Q̂ will change (the entry Q̂(s, a))

Notice that if rewards are non-negative, then Q̂ values only increase from 0,approach true Q

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 35 / 1

Page 46: 19 rl keyurtasun/courses/CSC411_Fall16/tutorial_RL.pdf · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn

Updating Estimated Q

Assume the robot is in state s

1

; some of its current estimates of Q are asshown; executes rightward move

Q̂(s1

, aright) r + �maxa0

Q̂(s2

, a0)

r + 0.9maxa

{63, 81, 100} 90

Important observation: at each time step (making an action a in state s

only one entry of Q̂ will change (the entry Q̂(s, a))

Notice that if rewards are non-negative, then Q̂ values only increase from 0,approach true Q

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 35 / 1

Page 47: 19 rl keyurtasun/courses/CSC411_Fall16/tutorial_RL.pdf · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn

Updating Estimated Q

Assume the robot is in state s

1

; some of its current estimates of Q are asshown; executes rightward move

Q̂(s1

, aright) r + �maxa0

Q̂(s2

, a0)

r + 0.9maxa

{63, 81, 100} 90

Important observation: at each time step (making an action a in state s

only one entry of Q̂ will change (the entry Q̂(s, a))

Notice that if rewards are non-negative, then Q̂ values only increase from 0,approach true Q

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 35 / 1

Page 48: 19 rl keyurtasun/courses/CSC411_Fall16/tutorial_RL.pdf · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn

Q Learning: Summary

Training set consists of series of intervals (episodes): sequence of (state,action, reward) triples, end at absorbing state

Each executed action a results in transition from state si to sj ; algorithmupdates Q̂(si , a) using the learning rule

Intuition for simple grid world, reward only upon entering goal state ! Q

estimates improve from goal state back

1. All Q̂(s, a) start at 02. First episode – only update Q̂(s, a) for transition leading to goal state3. Next episode – if go thru this next-to-last transition, will update

Q̂(s, a) another step back4. Eventually propagate information from transitions with non-zero reward

throughout state-action space

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 36 / 1

Page 49: 19 rl keyurtasun/courses/CSC411_Fall16/tutorial_RL.pdf · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn

Q Learning: Summary

Training set consists of series of intervals (episodes): sequence of (state,action, reward) triples, end at absorbing state

Each executed action a results in transition from state si to sj ; algorithmupdates Q̂(si , a) using the learning rule

Intuition for simple grid world, reward only upon entering goal state ! Q

estimates improve from goal state back

1. All Q̂(s, a) start at 02. First episode – only update Q̂(s, a) for transition leading to goal state3. Next episode – if go thru this next-to-last transition, will update

Q̂(s, a) another step back4. Eventually propagate information from transitions with non-zero reward

throughout state-action space

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 36 / 1

Page 50: 19 rl keyurtasun/courses/CSC411_Fall16/tutorial_RL.pdf · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn

Q Learning: Summary

Training set consists of series of intervals (episodes): sequence of (state,action, reward) triples, end at absorbing state

Each executed action a results in transition from state si to sj ; algorithmupdates Q̂(si , a) using the learning rule

Intuition for simple grid world, reward only upon entering goal state ! Q

estimates improve from goal state back

1. All Q̂(s, a) start at 02. First episode – only update Q̂(s, a) for transition leading to goal state3. Next episode – if go thru this next-to-last transition, will update

Q̂(s, a) another step back4. Eventually propagate information from transitions with non-zero reward

throughout state-action space

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 36 / 1

Page 51: 19 rl keyurtasun/courses/CSC411_Fall16/tutorial_RL.pdf · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn

Q Learning: Summary

Training set consists of series of intervals (episodes): sequence of (state,action, reward) triples, end at absorbing state

Each executed action a results in transition from state si to sj ; algorithmupdates Q̂(si , a) using the learning rule

Intuition for simple grid world, reward only upon entering goal state ! Q

estimates improve from goal state back

1. All Q̂(s, a) start at 0

2. First episode – only update Q̂(s, a) for transition leading to goal state3. Next episode – if go thru this next-to-last transition, will update

Q̂(s, a) another step back4. Eventually propagate information from transitions with non-zero reward

throughout state-action space

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 36 / 1

Page 52: 19 rl keyurtasun/courses/CSC411_Fall16/tutorial_RL.pdf · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn

Q Learning: Summary

Training set consists of series of intervals (episodes): sequence of (state,action, reward) triples, end at absorbing state

Each executed action a results in transition from state si to sj ; algorithmupdates Q̂(si , a) using the learning rule

Intuition for simple grid world, reward only upon entering goal state ! Q

estimates improve from goal state back

1. All Q̂(s, a) start at 02. First episode – only update Q̂(s, a) for transition leading to goal state

3. Next episode – if go thru this next-to-last transition, will updateQ̂(s, a) another step back

4. Eventually propagate information from transitions with non-zero rewardthroughout state-action space

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 36 / 1

Page 53: 19 rl keyurtasun/courses/CSC411_Fall16/tutorial_RL.pdf · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn

Q Learning: Summary

Training set consists of series of intervals (episodes): sequence of (state,action, reward) triples, end at absorbing state

Each executed action a results in transition from state si to sj ; algorithmupdates Q̂(si , a) using the learning rule

Intuition for simple grid world, reward only upon entering goal state ! Q

estimates improve from goal state back

1. All Q̂(s, a) start at 02. First episode – only update Q̂(s, a) for transition leading to goal state3. Next episode – if go thru this next-to-last transition, will update

Q̂(s, a) another step back

4. Eventually propagate information from transitions with non-zero rewardthroughout state-action space

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 36 / 1

Page 54: 19 rl keyurtasun/courses/CSC411_Fall16/tutorial_RL.pdf · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn

Q Learning: Summary

Training set consists of series of intervals (episodes): sequence of (state,action, reward) triples, end at absorbing state

Each executed action a results in transition from state si to sj ; algorithmupdates Q̂(si , a) using the learning rule

Intuition for simple grid world, reward only upon entering goal state ! Q

estimates improve from goal state back

1. All Q̂(s, a) start at 02. First episode – only update Q̂(s, a) for transition leading to goal state3. Next episode – if go thru this next-to-last transition, will update

Q̂(s, a) another step back4. Eventually propagate information from transitions with non-zero reward

throughout state-action space

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 36 / 1

Page 55: 19 rl keyurtasun/courses/CSC411_Fall16/tutorial_RL.pdf · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn

Gridworld Demo

Page 56: 19 rl keyurtasun/courses/CSC411_Fall16/tutorial_RL.pdf · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn

ExtensionsNon-deterministic reward and actions

Page 57: 19 rl keyurtasun/courses/CSC411_Fall16/tutorial_RL.pdf · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn

Q Learning: Exploration/Exploitation

Have not specified how actions chosen (during learning)

Can choose actions to maximize Q̂(s, a)

Good idea?

Can instead employ stochastic action selection (policy):

P(ai |s) =exp(kQ̂(s, ai ))Pj exp(kQ̂(s, aj))

Can vary k during learning

I more exploration early on, shift towards exploitation

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 37 / 1

Page 58: 19 rl keyurtasun/courses/CSC411_Fall16/tutorial_RL.pdf · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn

Q Learning: Exploration/Exploitation

Have not specified how actions chosen (during learning)

Can choose actions to maximize Q̂(s, a)

Good idea?

Can instead employ stochastic action selection (policy):

P(ai |s) =exp(kQ̂(s, ai ))Pj exp(kQ̂(s, aj))

Can vary k during learning

I more exploration early on, shift towards exploitation

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 37 / 1

Page 59: 19 rl keyurtasun/courses/CSC411_Fall16/tutorial_RL.pdf · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn

Q Learning: Exploration/Exploitation

Have not specified how actions chosen (during learning)

Can choose actions to maximize Q̂(s, a)

Good idea?

Can instead employ stochastic action selection (policy):

P(ai |s) =exp(kQ̂(s, ai ))Pj exp(kQ̂(s, aj))

Can vary k during learning

I more exploration early on, shift towards exploitation

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 37 / 1

Page 60: 19 rl keyurtasun/courses/CSC411_Fall16/tutorial_RL.pdf · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn

Q Learning: Exploration/Exploitation

Have not specified how actions chosen (during learning)

Can choose actions to maximize Q̂(s, a)

Good idea?

Can instead employ stochastic action selection (policy):

P(ai |s) =exp(kQ̂(s, ai ))Pj exp(kQ̂(s, aj))

Can vary k during learning

I more exploration early on, shift towards exploitation

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 37 / 1

Page 61: 19 rl keyurtasun/courses/CSC411_Fall16/tutorial_RL.pdf · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn

Q Learning: Exploration/Exploitation

Have not specified how actions chosen (during learning)

Can choose actions to maximize Q̂(s, a)

Good idea?

Can instead employ stochastic action selection (policy):

P(ai |s) =exp(kQ̂(s, ai ))Pj exp(kQ̂(s, aj))

Can vary k during learning

I more exploration early on, shift towards exploitation

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 37 / 1

Page 62: 19 rl keyurtasun/courses/CSC411_Fall16/tutorial_RL.pdf · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn

Q Learning: Exploration/Exploitation

Have not specified how actions chosen (during learning)

Can choose actions to maximize Q̂(s, a)

Good idea?

Can instead employ stochastic action selection (policy):

P(ai |s) =exp(kQ̂(s, ai ))Pj exp(kQ̂(s, aj))

Can vary k during learning

I more exploration early on, shift towards exploitation

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 37 / 1

Page 63: 19 rl keyurtasun/courses/CSC411_Fall16/tutorial_RL.pdf · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn

Non-deterministic Case

What if reward and next state are non-deterministic?

We redefine V ,Q based on probabilistic estimates, expected values of them:

V

⇡(s) = E⇡[rt + �rt+1

+ �2

rt+2

+ · · · ]

= E⇡[1X

i=0

� irt+i ]

and

Q(s, a) = E [r(s, a) + �V ⇤(�(s, a))]

= E [r(s, a) + �X

s0

p(s 0|s, a)maxa0

Q(s 0, a0)]

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 38 / 1

Page 64: 19 rl keyurtasun/courses/CSC411_Fall16/tutorial_RL.pdf · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn

Non-deterministic Case

What if reward and next state are non-deterministic?

We redefine V ,Q based on probabilistic estimates, expected values of them:

V

⇡(s) = E⇡[rt + �rt+1

+ �2

rt+2

+ · · · ]

= E⇡[1X

i=0

� irt+i ]

and

Q(s, a) = E [r(s, a) + �V ⇤(�(s, a))]

= E [r(s, a) + �X

s0

p(s 0|s, a)maxa0

Q(s 0, a0)]

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 38 / 1

Page 65: 19 rl keyurtasun/courses/CSC411_Fall16/tutorial_RL.pdf · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn

Non-deterministic Case

What if reward and next state are non-deterministic?

We redefine V ,Q based on probabilistic estimates, expected values of them:

V

⇡(s) = E⇡[rt + �rt+1

+ �2

rt+2

+ · · · ]

= E⇡[1X

i=0

� irt+i ]

and

Q(s, a) = E [r(s, a) + �V ⇤(�(s, a))]

= E [r(s, a) + �X

s0

p(s 0|s, a)maxa0

Q(s 0, a0)]

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 38 / 1

Page 66: 19 rl keyurtasun/courses/CSC411_Fall16/tutorial_RL.pdf · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn

Non-deterministic Case: Learning Q

Training rule does not converge (can keep changing Q̂ even if initialized totrue Q values)

So modify training rule to change more slowly

Q̂(s, a) (1� ↵n)Q̂n�1

(s, a) + ↵n[r + �maxa0

Q̂n�1

(s 0, a0)]

where s

0 is the state land in after s, and a

0 indexes the actions that can betaken in state s

0

↵n =1

1 + visitsn(s, a)

where visits is the number of times action a is taken in state s

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 39 / 1

Page 67: 19 rl keyurtasun/courses/CSC411_Fall16/tutorial_RL.pdf · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn

Non-deterministic Case: Learning Q

Training rule does not converge (can keep changing Q̂ even if initialized totrue Q values)

So modify training rule to change more slowly

Q̂(s, a) (1� ↵n)Q̂n�1

(s, a) + ↵n[r + �maxa0

Q̂n�1

(s 0, a0)]

where s

0 is the state land in after s, and a

0 indexes the actions that can betaken in state s

0

↵n =1

1 + visitsn(s, a)

where visits is the number of times action a is taken in state s

Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 39 / 1

Page 68: 19 rl keyurtasun/courses/CSC411_Fall16/tutorial_RL.pdf · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn

More Cool Demos

Page 69: 19 rl keyurtasun/courses/CSC411_Fall16/tutorial_RL.pdf · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn

https://www.youtube.com/watch?v=L4KBBAwF_bESuper Mario World

https://www.youtube.com/watch?v=XiigTGKZfks

Model-based RL: Pole Balancing

Other Examples:

Page 70: 19 rl keyurtasun/courses/CSC411_Fall16/tutorial_RL.pdf · Zemel, Urtasun, Fidler (UofT) CSC 411: 19-Reinforcement Learning November 29, 2016 30 / 1 What to Learn We might try to learn

Learn how to fly a Helicopter

• http://heli.stanford.edu/

• Formulate as an RL problem

• State - Position, orientation, velocity, angular velocity

• Actions - Front-back pitch, left-right pitch, tail rotor pitch, blade angle

• Dynamics - Map actions to states. Difficult!

• Rewards - Don’t crash, Do interesting things.

Slide credit: Nitish Srivastava