Www.igi.tu-graz.ac.at/ril-toolbox The Reinforcement Learning Toolbox – Reinforcement Learning in...
Transcript of Www.igi.tu-graz.ac.at/ril-toolbox The Reinforcement Learning Toolbox – Reinforcement Learning in...
www.igi.tu-graz.ac.at/ril-toolbox
The Reinforcement Learning Toolbox – Reinforcement Learning in Optimal Control Tasks
Gerhard Neumann
Master Thesis2005Institute für Grundlagen der Informationsverarbeitung (IGI)
www.igi.tu-graz.ac.at/ril-toolbox
Master Thesis:
Reinforcement Learning Toolbox General Software Tool for Reinforcement
Learning Benchmark tests of Reinforcement Learning
algorithms on three Optimal Control Problems Pendulum Swing Up Cart-Pole Swing Up Acro-Bot Swing Up
www.igi.tu-graz.ac.at/ril-toolbox
RL Toolbox: Features
Software: C++ Class System Open Source / Non Commercial
Homepage: www.igi.tu-graz.ac.at/ril-toolbox Class Reference, Manual Runs under Linux and Windows > 40.000 lines of code, > 250 classes
www.igi.tu-graz.ac.at/ril-toolbox
RL Toolbox: Features
Learning in discrete or continuous State Space Learning in discrete or continuous Action Space Different kinds of Learning Algorithms
TD-lambda learning Actor critic learning Dynamic Programming, Model based learning, planning methods Continuous time RL Policy search algorithm Residual / Residual gradient Algorithm
Use Different Function Approximators RBF-Networks Linear Interpolation CMAC-Tile Coding Feed Forward Neural Networks
Learning from other (self coded) Controllers Hierarchical Reinforcement Learning
www.igi.tu-graz.ac.at/ril-toolbox
The Agent and the environment The agent tells the environment which action to
execute, the environment makes the internal state transitions
Environment defines the learning problem
Structure of the Learning System
www.igi.tu-graz.ac.at/ril-toolbox
Structure of the learning system
Linkage to the learning algorithms All algorithms need <st,at,st+1> for learning. The algorithms are implemented as listeners
The algorithms adapt the agent controller to learn optimal policy
Agent informs all listeners about the steps and when a new episode has started.
www.igi.tu-graz.ac.at/ril-toolbox
Reinforcement Learning:
Agent: State Space S Action Space A Transition Function Agent has to optimize the future discounted
reward
Many possibilities to solve the optimization task: Value based Approaches Genetic Search Other Optimization algorithms
www.igi.tu-graz.ac.at/ril-toolbox
Short Overview over the algorithms:
Value-based algorithms Calculate the goodness of each
state Policy-search algorithms
Represent the policy directly, search in the policy parameter space
Hybrid Methods Actor-Critic Learning
www.igi.tu-graz.ac.at/ril-toolbox
Value Based Algorithms
Calculate either: Action value function (Q-Function):
Directly used for action selection Value Function
Need the transition function for action selection E.g. Do state prediction or use the derivation of the transition
function Representation of the V or Q Function is in the most
cases independent of the learning algorithm. We can use any function approximator for the value
function Independent V-Function and Q-Function interfaces
Different Algorithms: TD-Learning, Advantage Learning, Continuous Time RL
www.igi.tu-graz.ac.at/ril-toolbox
Policy Search / Policy Gradient Algorithms
Directly climb the value function with a parameterized policy
Calculate the Values of N given initial states per simulation (PEGASUS, [NG, 2000])
Use standard optimization techniques like gradient ascent, simulated annealing or genetic algorithms. Gradient Ascent used in the Toolbox
www.igi.tu-graz.ac.at/ril-toolbox
Actor Critic Methods:
Learn the value function and an extra policy representation Discrete actor critic
Stochastic Policies Represent directly the action
selection propabilities. Similar to TD-Q Learning
Continous actor critic Directly output the continuous
control vector Policy can be represented by any
Function approximator Stochastic Real Values (SRV)
Algorithm ([Gullapalli, 1992]) Policy-Gradient Actor-Critic
(PGAC) algorithm
www.igi.tu-graz.ac.at/ril-toolbox
Policy-Gradient Actor-Critic Agorithm
Learn the V-Function with standard algorithm Calculate Gradient of the Value within a certain time
window (k-steps in the past, l-steps in the future)
Gradient is then estimated by:
Again exact model is needed
www.igi.tu-graz.ac.at/ril-toolbox
Second Part: Benchmark Tests
Pendulum Swing Up Easy Task
CartPole Swing Up Medium Task
AcroBot Swing Up Hard Task
www.igi.tu-graz.ac.at/ril-toolbox
Benchmark Problems
Common problems in non-linear control
Try to find an unstable fixpoint 2 or 4 continuous state variables 1 continuous control variable Reward: Height of the end point at
time each step
www.igi.tu-graz.ac.at/ril-toolbox
Benchmark Tests:
Test the algorithms on the benchmark problems with different parameter settings. Compare sensitivity of the parameter setting
Use different Function Approximators (FA) Linear FAs (e.g. RBF-Networks)
Typical local representation Curse of dimensionality
Non-Linear FA (e.g. Feed-Forward Neural-Networks): No expontial dependency on the input state dimension Harder to learn (no local representation)
Compare the algorithms with respect to their features and requirements Is the exact transition function needed? Can the algorithm produce continuous actions? How much computation time is needed?
Use hierarchical RL, directed exploration strategies or planning methods to boost learning
www.igi.tu-graz.ac.at/ril-toolbox
Benchmark Tests: Cart-Pole Task, RBF-network
Planning boosts performance significantly Very time intensive (search depth 5 – 120 times longer
computation time) PG-AC approach can compete with standard V-
Learning approach Can not represent sharp decision boundaries
www.igi.tu-graz.ac.at/ril-toolbox
Benchmark: PG-AC vs V-Planning, Feed Forward NN
Learning with FF-NN using the standard planning approach almost impossible (very unstable performance)
PG-AC with RBF critic (time window = 7 time steps) manges to learn the task in almost 1/10 of episodes of the standard planning approach.
PG-AC V-Planning
www.igi.tu-graz.ac.at/ril-toolbox
V-Planning
Cart-Pole Task: Higher Search Depths could improve performance significantly, but at exponential cost of computation time
www.igi.tu-graz.ac.at/ril-toolbox
Hierarchical RL
Cart-Pole Task: The Hierarchical Sub-Goal Approach (alpha = 0.6) outperforms the flat approach (alpha = 1.0)
www.igi.tu-graz.ac.at/ril-toolbox
Other general results
The Acro-Bot Task could not be learned with the flat architecture Hierarchical Architecture manges to swing up, but could not stay
on top Nearly all algorithms managed to learn the first two tasks with
linear function approximation (RBF networks) Non linear function approximators are very hard to learn
Feed Forward NN‘s have a very poor performance (no locality), but can be used for larger state spaces
Very restrictive parameter settings Approaches which use the transition function typically
outperform the model-free approaches. The Policy Gradient algorithm (PEGASUS) only worked with
the linear FAs, with non-linear FAs it could not recover from local maxima.
www.igi.tu-graz.ac.at/ril-toolbox
Literature
[Sutt_1999] R. Sutton and A. Barto: Reinforcement Learning: An Introduction. MIT press
[NG_2000] A. Ng an M. Jordan: PEGASUS: A policy search method for large mdps and pomdps approximation
[Doya_1999] K. Doya: Reinforcement Learning in continuous time and space
[Baxter, 1999] J. Baxter: Direct gradient-based reinforcement learning: 2. gradient ascent algorithms and experiments.
[Baird_1999] L. Baird: Reinforcement Learning Through Gradient Descent. PhD Thesis
[Gulla_1992] V. Gullapalli: Reinforcement Learning and its application to control
[Coulom_2000] R. Coulom: Reinforcement Learning using Neural Networks. PhD thesis