Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.

28
Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011

Transcript of Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.

Page 1: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.

Bayesian Reinforcement Learning

Machine Learning RCC

16th June 2011

Page 2: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.

Outline

• Introduction to Reinforcement Learning

• Overview of the field

• Model-based BRL

• Model-free RL

Page 3: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.

References

• ICML-07 Tutorial– P. Poupart, M. Ghavamzadeh, Y. Engel

• Reinforcement Learning: An Introduction – Richard S. Sutton and Andrew G. Barto

Page 4: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.

Machine Learning

Unsupervised Learning

Reinforcement Learning

Supervised Learning

Page 5: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.

Definitions

State Action Reward

Policy

£££££

Reward function

Page 6: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.

Markov Decision Process

x0

a0

x1

Policy

Transition Probability

r0

a1

r1

Reward function

Page 7: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.

Value Function

Page 8: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.

Optimal Policy

• Assume one optimal action per state

Unknown

Value Iteration

Page 9: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.

Reinforcement Learning

• RL Problem: Solve MDP when reward/transition models are unknown

• Basic Idea: Use samples obtained from agent’s interaction with environment

Page 10: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.

Model-Based vs Model-Free RL

• Model-Based: Learn a model of the reward/transition dynamics and derive value function/policy

• Model-Free: Directly learn value function/policy

Page 11: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.

RL Solutions

Page 12: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.

RL Solutions

• Value Function Algorithms– Define a form for the value function– Sample state-action-reward sequence– Update value function– Extract optimal policy

• SARSA, Q-learning

Page 13: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.

RL Solutions

• Actor-Critic– Define a policy structure

(actor)– Define a value function

(critic)– Sample state-action-reward– Update both actor & critic

Page 14: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.

RL Solutions

• Policy Search Algorithm– Define a form for the policy– Sample state-action-reward sequence– Update policy

• PEGASUS– (Policy Evaluation-of-Goodness

And Search Using Scenarios)

Page 15: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.

Online - Offline

• Offline– Use a simulator– Policy fixed for each ‘episode’– Updates made at the end of episode

• Online– Directly interact with environment– Learning happens step-by-step

Page 16: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.

Model-Free Solutions

1. Prediction: Estimate V(x) or Q(x,a)

2. Control: Extract policy• On-Policy• Off-Policy

Page 17: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.

Monte-Carlo Predictions

Valu

eR

ewar

d

-13

Leave car park Get out of city Motorway Enter Cambridge

State

-90 -83 -55 -11

-15 -61 -11

Up

date

d

-100 -87 -72 -11

Page 18: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.

Temporal Difference Predictions

Valu

eR

ewar

d

-13

Leave car park Get out of city Motorway Enter Cambridge

State

-90 -83 -55 -11

-15 -61 -11

Up

date

d

-96 -70 -72 -11

Page 19: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.

Advantages of TD

• Don’t need a model of reward/transitions

• Online, fully incremental

• Proved to converge given conditions on step-size

• “Usually” faster than MC methods

Page 20: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.

From TD to TD(λ)

State

Reward

Terminal state

Page 21: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.

From TD to TD(λ)

State

Reward

Terminal state

Page 22: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.

SARSA & Q-learning

TD-Learning

SARSA Q-Learning

On-Policy

Estimate value function for

current policy

Off-Policy

Estimate value function for

optimal policy

Page 23: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.

GP Temporal Difference

xxx

xxxx

xxxx

1 2

Page 24: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.

GP Temporal Difference

xxx

xxxx

xxxx

1 2

Page 25: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.

GP Temporal Difference

Page 26: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.

GP Temporal Difference

Page 27: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.

GP Temporal Difference

Page 28: Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.

GP Temporal Difference