580.691 Learning Theory Reza Shadmehr & Jörn Diedrichsen

14
580.691 Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 1: Generalized policy iteration

description

580.691 Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 1: Generalized policy iteration. - PowerPoint PPT Presentation

Transcript of 580.691 Learning Theory Reza Shadmehr & Jörn Diedrichsen

Page 1: 580.691  Learning Theory Reza Shadmehr & Jörn Diedrichsen

580.691 Learning Theory

Reza Shadmehr & Jörn Diedrichsen

Reinforcement Learning 1: Generalized policy iteration

Page 2: 580.691  Learning Theory Reza Shadmehr & Jörn Diedrichsen

In a given situation the action an organism performs can be followed by a certain outcome, which might be rewarding or non-rewarding to the organism. Also, the action brings the organism into a new state, from which there may be again the possibility to obtain reward.

How do we learn to choose the actions to maximize reward? This is the problem addressed by Reinforcement learning. In contrast to supervised learning, the reward we see only tells us whether the action we chose was good or bad, not what would have been the "correct" action. Also, the actions we take and the reward can be separated temporally, so that the problem arises how to assign the reward signal to actions. Thus, reinforcement learning has some aspects of supervised learning, but with a very "poor" teacher.

Page 3: 580.691  Learning Theory Reza Shadmehr & Jörn Diedrichsen

Reinforcement Learning: The lay of the land

Say, you are a rat in a maze. At any given place you can go left or right, and you might can food as a result:

left

left

right

right

left

wait

right

The big circles are states. We say that the state at a time t has the value st, indicating that the rat is at a certain location. The small circles are actions, which the rat can take at any given state. Actions transport the actor into a new state.We define a policy , a probabilistic mapping of states to action.

:

, |

s a

s a P a s

Page 4: 580.691  Learning Theory Reza Shadmehr & Jörn Diedrichsen

Reinforcement Learning: The lay of the land

left

left

right

right

left

wait

right

P=0.9

P=0.2P=0.1

P=0.8

wait

Often, the outcome of actions is not full certain. For example, when going left the rat might have a 10% chance to fall through a trap door. Once in the trap, the experimenter might free the rat with a probability of 20% on every time step:

Thus, we define the transition probability, that you go to state s', when you are in state s and perform the action a.

' 1 ' | ,ass t t tP P s s s s a a

Page 5: 580.691  Learning Theory Reza Shadmehr & Jörn Diedrichsen

Reinforcement Learning: reward

In a discrete episode, Return will be defined as the sum of all the rewards we get from now to time T.

1 2t t t TR r r r

If you are in a state s at time t and take action a, that brings you to another state at time t+1, the you receive the reward rt+1.

We can also define the expected reward.

' 1 1| , ',ass t t t tE r s s s s a a

left

left

right

right

left

wait

right

r=1

r=6

r=6

P=0.9

P=0.2P=0.1

P=0.8

wait

r=8

The goal of reinforcement learning is to find policy p that maximizes the expected return from each state. This is the optimal policy.

Page 6: 580.691  Learning Theory Reza Shadmehr & Jörn Diedrichsen

Now, life does not come in discrete episodes, but rather as a continuous stream of behavior. If we want to define Return in a continuous case like this (where there is no end T), we need to introduce temporal discounting. That is, reward we get right now is much better than reward we will get tomorrow. The return of the reward way be decrease exponentially.

21 2 3 1

0

kt t t t t k

k

R r r r r

Temporal discounting can be demonstrated and shown in humans and animals: Given the choice between 2 food pellets now and 4 food pellets in 2 hrs, which one does the rat prefer. Would you rather have $10 now or $12 in a month?

Reinforcement Learning: episodic and continuous environments

With temporal discounting we can write each episodic environment as a continuous one, by introducing nodes for the terminal states, that have a transition probability of 1 onto themselves.

Page 7: 580.691  Learning Theory Reza Shadmehr & Jörn Diedrichsen

Value function and Bellman equationsMost reinforcement learning algorithms are based on estimating the Value function. The Value function is the expected Return of a state under a certain policy.

10

1 20

' ' 2 1' 0

' ''

| |

|

, | '

, '

kt t t k t

k

kt t k t

k

a a kss ss t k t

a s k

a ass ss

a s

V s E R s s E r s s

E r r s s

s a P E r s s

s a P V s

The iterative definition of the Value function is known as the Bellman equation. We can write down a Bellman equation for each state. The value function then is the unique solution to these equations. A value function also has a action-value function attached, defined as the expected return, if performing action a from a state s.Correspondingly we can define Q, the action-value function:

1 ' '0 '

, | ,

| , '

t t t

k a at k t t ss ss

k s

Q s a E R s s a a

E r s s a a P V s

Page 8: 580.691  Learning Theory Reza Shadmehr & Jörn Diedrichsen

Evaluating policies

The first question a learner has to answer, is how good the current policy is. That is, we need a method that evaluates the value-function for a policy:

Dynamic Programming

1

1' '

'

1

initialize 0

repeat

( ) , ,

until max

i

i s iass ss

a s

i i

V

V s s a P V s s

V V

Dynamic programming works nice to find a solution using a simple iterative scheme. For large state-spaces it can be beneficial to update only a subset of states in each iteration and then use these to update other states.

BUT: in general dynamic programming requires that we know the transition probabilities P and the expected reward R.

Page 9: 580.691  Learning Theory Reza Shadmehr & Jörn Diedrichsen

Evaluating policies: Experience Monte Carlo

Well, the learner often does not know the environment. So how can we learn the value-function?

-1

i

k

1

follow policy for N steps.

keep track of reward and visits to each state k s

for each visit i to s calculate:

ˆ

average over the first k visits to s (at t ).

1ˆ ˆV s

i

i

i

i

Nt t t

tt t

t

i

R r

Rk

Experience sampling using Monte Carlo can be quite time-intensive, but does not require any knowledge of the environment.

Page 10: 580.691  Learning Theory Reza Shadmehr & Jörn Diedrichsen

Optimal Value function

Now that we have a value function, we can define the optimal value function. This is the value function under the best policy, so that

* *

( )

** 1 1

*' '

'

*1

0

*' '

''

max ,

max | ,

max '

, |

max ', '

a A s

t t t ta

a ass ss

as

kt k t

k

a ass ss

as

V s Q s a

E r V s s s a a

P V s

Q s a E r s s

P Q s a

* ,V s V s s

How do we find the optimal value function if we only know the value function of the current policy?

The optimal action-value function would be:

Page 11: 580.691  Learning Theory Reza Shadmehr & Jörn Diedrichsen

Optimizing the policy

' arg max ,a

s Q s a

How do we find the optimal value function if we only know the value function of the current policy? The key is to realize that if we change the policy at one state a from

Such that and otherwise follow

Then:

', ,s a s a

' ,V s V s s

'V s V s

This is know as the policy improvement theorem. For example we can improve the policy greedily, by choosing for every state s the action a:

* *

( )max ,a A s

V s Q s a

We can now iterated policy evaluation and policy improvement, until the policy doesn’t change anymore. Due to the definition of the optimal Value function:

By iterating on policy evaluation and policy improvement we find the optimal policy & value function. This is know as generalized policy iteration.

Page 12: 580.691  Learning Theory Reza Shadmehr & Jörn Diedrichsen

Generalized policy iterations

We can alternate policy evaluations (E) and policy improvements (I), until convergence:

V s

Page 13: 580.691  Learning Theory Reza Shadmehr & Jörn Diedrichsen

Exploration vs. Exploitation

When the organism does not know the environment, but has to rely on sampling, the a greedy policy can get in the way of finding the value function of the policy. That is Exploitation can get in the way of Exploration.

This will be your homework 1.

Solution 1: without being maximally greedy, be -greedy. That is, go for the maximum in 1- parts of the cases and choose other options of the cases.

Solution 2: Do not go for maximum, but choose the probability following a softmax-function.

( , ) /

( , ) /

e| ,

e

Q a s

t t Q b s

b

p a a s s a s

Sometimes this is called the Gibbs/Boltzman distribution. The parameter determines, how “soft” the selection is. If ->0 the softmax function approaches the maximal greedy selection.

is often called the temperature of the distribution. As the temperature decreases, the distribution “crystallizes” around one point, when the temperature rises, the distribution becomes more and more diffuse.

Page 14: 580.691  Learning Theory Reza Shadmehr & Jörn Diedrichsen

For computational purposes, we can write the policy and the transition properties as matrices. We borrow a formalism that is widely used for discrete stochastic processes. The state s becomes a vector of indicator variables.

1

|

|

1

...

| ,

t taxs

t tsxa

t

t

t

p a a s s

p s s a a

I s

I s S

p

t

t+1 t

A

s

s s A s

To do this we have to be careful how we define our actions. The probability for a transition to s must ONLY depend on the action taken, not on the last state. That is, when we have 5 states and can go left or right from each of them, we need to define 10 actions (go left from state 1 etc…..)