Post on 22-Jan-2016
description
On Linking Reinforcement Learningwith Unsupervised Learning
Cornelius Weber, FIAS
presented at Honda HRI, Offenbach, 17th March 2009
for taking action, we need only the relevant features
x
y
z
unsupervisedlearningin cortex
reinforcementlearning
in basal ganglia
state spaceactor
Doya, 1999
actor
state space
1-layer RL model of BG ...
go left?
go right?... is too simple to handle complex input
complex input(cortex)
need another layer(s) to pre-process complex data
feature detection
action selection
actor
state space
models’ background:
- gradient descent methods generalize RL to several layers Sutton&Barto RL book (1998); Tesauro (1992;1995)
- reward-modulated Hebb Triesch, Neur Comp 19, 885-909 (2007), Roelfsema & Ooyen, Neur Comp 17, 2176-214 (2005); Franz & Triesch, ICDL (2007)
- reward-modulated activity leads to input selection Nakahara, Neur Comp 14, 819-44 (2002)
- reward-modulated STDP Izhikevich, Cereb Cortex 17, 2443-52 (2007), Florian, Neur Comp 19/6, 1468-502 (2007); Farries & Fairhall, Neurophysiol 98, 3648-65 (2007); ...
- RL models learn partitioning of input space e.g. McCallum, PhD Thesis, Rochester, NY, USA (1996)
sensory input
reward
action
scenario: bars controlled by actions, ‘up’, ‘down’, ‘left’, ‘right’;
reward given if horizontal bar at specific position
model that learns the relevant features
top layer: SARSA RL
lower layer: winner-take-all feature learning
both layers: modulate learning by δ
RL weights
featureweights
input
action
SARSA with WTA input layer
note: non-negativity constraint on weights
Energy function: estimation error of state-action value
identities used:
RL action weights
feature weights
data
learning the ‘short bars’ data
reward
action
short bars in 12x12 average # of steps to goal: 11
RL action weights
feature weights
input reward 2 actions (not shown)
data
learning ‘long bars’ data
WTAnon-negative
weights
SoftMaxnon-negative
weights
SoftMaxno weight
constraints
Discussion
- simple model: SARSA on winner-take-all network with δ-feedback
- learns only the features that are relevant for action strategy
- theory behind: derivation of value function estimation (approx.)
- non-negative coding aids feature extraction
- link between unsupervised- and reinforcement learning
- demonstration with more realistic data needed
Bernstein FocusNeurotechnology,BMBF grant 01GQ0840
EU project 231722“IM-CLeVeR”,call FP7-ICT-2007-3
Frankfurt Institutefor Advanced Studies,FIAS
Sponsors
Bernstein FocusNeurotechnology,BMBF grant 01GQ0840
EU project 231722“IM-CLeVeR”,call FP7-ICT-2007-3
Frankfurt Institutefor Advanced Studies,FIAS
Sponsors
thank you ...