Abstract This presentation questions the need for reinforcement learning and related paradigms from...

Abstract

This presentation questions the need for reinforcement learning and related paradigms from machine-learning, when trying to optimise the behavior of an agent. We show that it is fairly simple to teach an agent complicated and adaptive behaviors under a free-energy principle. This principle suggests that agents adjust their internal states and sampling of the environment to minimize their free-energy. In this context, free-energy represents a bound on the probability of being in a particular state, given the nature of the agent, or more specifically the model of the environment an agent entails. We show that such agents learn causal structure in the environment and sample it in an adaptive and self-supervised fashion. The result is a policy that reproduces exactly the policies that are optimized by reinforcement learning and dynamic programming. Critically, at no point do we need to invoke the notion of reward, value or utility. We illustrate these points by solving a benchmark problem in dynamic programming; namely the mountain-car problem using just the free-energy principle. The ensuing proof of concept is important because the free-energy formulation also provides a principled account of perceptual inference in the brain and furnishes a unified framework for action and perception.

Action and active inference:A free-energy formulation

Perception, memory and attention(Bayesian Brain)

Action and value learning(Optimum control)

argmax ( ( ) | )a

V s a argmin ( ( | ) || ( | ))D q p s

causes ()

Prediction error

sensory input

( | )q S

R

Q

CS

reward (US)

action

S-R

S-S

The free-energy principle

,argmin ( ( ), )

aF s

s

s

av

http://www.annekaringlass.com/05048%20Artist's%20Hand.jpg

http://images.google.co.uk/imgres?imgurl=http://www.puzzlehouse.com/images/webpage/elephant.jpg&imgrefurl=http://www.puzzlehouse.com/_jigsaws/1000thumbs2b.htm&h=336&w=256&sz=24&tbnid=yngXrNmasQQJ:&tbnh=115&tbnw=87&hl=en&start=2&prev=/images%3Fq%3Delephant%26svnum%3D10%26hl%3Den%26lr%3D%26sa%3DG

Overview

The free energy principle and actionActive inference and prediction errorOrientation and stabilizationIntentional movementsCued movementsGoal-directed movements Autonomous movementsForward and inverse models

agent - menvironment

( , )s g a w

argmin ( , )a

a F s

argmin ( , )F s

( , )f a z

Separated by a Markov blanket

External states

Internal states

Sensation

Action

Exchange with the environment

Perceptual inference Perceptual learning Perceptual uncertainty

ln ( ( ) | , ) ( || ( ))

argmin

argmax ln ( | , )

q

a

qa

F p s a m D q p

a F

p s m

Action to minimise a bound on surprise

The free-energy principleln ( , | ) ln ( )

q qF p s m q

ln ( | ) ( ( | ) || ( | ))

argmin

( | ) ( | )

F p s m D q p s

F

q p s

Perception to optimise the bound

( | ) ( | ) ( | ) ( | )uq q u q q

argminu F

argmin F

argmin F

The conditional density and separation of scales

http://images.google.co.uk/imgres?imgurl=http://www.keystonehatcheries.com/db/upload/VarWaterSnowfalke.jpg&imgrefurl=http://www.keystonehatcheries.com/default.asp%3Fid%3D161&h=235&w=251&sz=15&tbnid=Tre9qVGlZf8J:&tbnh=99&tbnw=106&hl=en&start=9&prev=/images%3Fq%3DSnowfalke%26svnum%3D10%26hl%3Den%26lr%3D%26sa%3DN

http://images.google.co.uk/imgres?imgurl=http://www.flmnh.ufl.edu/fish/southflorida/everglades/glossary/images/plankton.JPG&imgrefurl=http://www.flmnh.ufl.edu/fish/southflorida/everglades/glossary/plankton.html&h=206&w=300&sz=20&tbnid=a-HGvD5UbI4J:&tbnh=76&tbnw=111&hl=en&start=5&prev=/images%3Fq%3DPlankton%26svnum%3D10%26hl%3Den%26lr%3D

http://images.google.co.uk/imgres?imgurl=http://www.arcspace.com/books/Leonardo/images/Leonardo-6.jpg&imgrefurl=http://www.arcspace.com/books/Leonardo/leonardo_book.html&h=318&w=380&sz=21&hl=en&start=1&um=1&usg=__OLVqdXzvHKMKc2rGGL6dkxnrtqk=&tbnid=KloETa3buGM6rM:&tbnh=103&tbnw=123&prev=/images%3Fq%3DDa%2Bvinci%2Bmuscles%26um%3D1%26hl%3Den%26sa%3DN

Overview


)1(1x

)1(2x

2s 3s 4s1s

( ) ( ) ( 1) ( ,1)

( ) ( ) ( 1) ( ,1)

( , )

( , )

i i i v

i i i x

v g x v

x f x v

Hierarchical model Top-down messagesBottom-up messages( , ) ( , ) ( ) ( ) ( , 1)

( , ) ( , ) ( ) ( )

v i v i i T i v iv

x i x i i T ix

D

D

( , ) ( , ) ( , 1) ( )

( , ) ( , ) ( , ) ( )

( ( ))

( ( ))

v i v i v i i

x i x i x i i

g

D f

)1(1v

Prediction error

1s1s 2s 3s 4s

1s1

x 2x

1v1

v

asT saa Action

Active inference: closing the loop (synergy)

sT sa aa F ( )uV t D

a

action

s

perception

Action needs access to sensory-level prediction error

( ( ) ( ))s s s a g


Taa

s

prediction From reflexes to action

( ( ) ( ))s s s a g a

( )g

action

( )s a

dorsal horn dorsal root

ventral root ventral horn


Overview


sensory prediction and error hidden states (location)

cause (perturbing force) perturbation and action

Active inference under flat priors(movement with percept)

10 20 30 40 50 60-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

time10 20 30 40 50 60

-2

-1

0

1

2

time

10 20 30 40 50 60-0.2

0

0.2

0.4

0.6

0.8

1

1.2

time10 20 30 40 50 60

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

time

Visual stimulus

Sensory channels

a

v

s

1x1

x 2x

vv

s

x

v

( )g

av

sensory prediction and error hidden states (location)

cause (perturbing force) perturbation and action

Active inference under tight priors(no movement or percept)a

v

s

1x1

x 2x

vv

s

10 20 30 40 50 60

0

0.5

1

time10 20 30 40 50 60

-2

-1

0

1

2

time

10 20 30 40 50 60-0.2

0

0.2

0.4

0.6

0.8

1

1.2

time10 20 30 40 50 60

-1

-0.5

0

0.5

1

time

x

v

( )g

a

v

under flat priors under tight priors

actionperceived andtrue perturbation

Retinal stabilisation or tracking induced

by priors

Visual stimulus

a-2 -1 0 1 2

-2

-1

0

1

2

displacement

-2 -1 0 1 2-2

-1

0

1

2

displacement

10 20 30 40 50 60-1

-0.5

0

0.5

1

time

10 20 30 40 50 60-1

-0.5

0

0.5

1

time

real

perceived

x

x

a

v

v

Overview


sensory prediction and error

cause (prior) perturbation and action

Active inference under tight priors(movement and percept)

a

v

1s1s 2s 3s 4s

1x1

x 2x

vv

Proprioceptive input

10 20 30 40 50 60-0.2

0

0.2

0.4

0.6

0.8

1

1.2

time10 20 30 40 50 60

-0.2

0

0.2

0.4

0.6

0.8

1

time

10 20 30 40 50 60-2

-1

0

1

2

time10 20 30 40 50 60

-0.2

-0.1

0

0.1

0.2

time

hidden states (location)

robust to perturbation

and change in motor gain

displacement time

trajectories

real

perceived

action and causes

action

perceived cause (prior)

exogenous cause

Self-generated movements

induced by priors

-2 -1 0 1 2-2

-1

0

1

2

10 20 30 40 50 60-0.5

0

0.5

1

-2 -1 0 1 2-2

-1

0

1

2

10 20 30 40 50 60-0.5

0

0.5

1

-2 -1 0 1 2-2

-1

0

1

2

10 20 30 40 50 60-0.5

0

0.5

1


Overview


from reflexes to action

a

vis vis

vs w

J

pro pros x w

svis

,x v

x

v

spro

v v Tv

x x Tx

D

D

( ( ))

( ( ))

( )

s s

x x x

v v v

s g

D f

( )J x

1J

1x

2x2J

(0,0)

1x

1 2( , )v vJointed arm

Cued movementsand sensorimotor

integration


10 20 30 40 50 60-0.5

0

0.5

1

1.5

2

2.5prediction and error

time10 20 30 40 50 60

-0.5

0

0.5

1

1.5

2hidden states

time

10 20 30 40 50 60-0.2

0

0.2

0.4

0.6

0.8

1

1.2causal states

time10 20 30 40 50 60

-0.5

0

0.5

1

1.5

time

perturbation and action

1x

2x

1x

2x

3v

1,2v

3v

1,2v1a

2a

-0.5 0 0.5 1 1.5

0

0.5

1

1.5

2

position

posit

ion

1 2( , )v v

( , )J x t

Trajectory

Cued reaching with noisy proprioception

-0.5 0 0.5 1 1.5

0

0.5

1

1.5

2-0.5 0 0.5 1 1.5

0

0.5

1

1.5

2

-0.5 0 0.5 1 1.5

0

0.5

1

1.5

2-2 -1 0 1 2

-2

-1

0

1

2

0

2

4

6

8

10

0

2

4

6

8

10

0

2

4

6

8

10

0

2

4

6

8

10

Noisy

pro

prio

cept

ion

Noisy vision

position position

posit

ion

posit

ion

Conditional precisionspro

vis

Bayes optimal integration of

sensory modalities

Overview


11 2 34

2 1/2 2 2 3/2

( , )( ( ) )

2 1 : 0

(1 5 ) 5 (1 5 ) : 0

xxf x

G x x x x vx

x xG

x x x x

position

velo

city

null-clines

-2 -1 0 1 2-2

The mountain car problem

0 x

x

equations of motion

-2 -1 0 1 20

0.1

0.2

0.3

0.4

0.5

0.6

0.7

position

Height

-2 -1 0 1 2-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

3

position

Forces

Desired location

1

flow and density nullclines

velo

city

-2 -1 0 1 2-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

-2 -1 0 1 2-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

velo

city

-2 -1 0 1 2-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

-2 -1 0 1 2-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

position

velo

city

-2 -1 0 1 2-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

position-2 -1 0 1 2

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

0

Q

Uncontrolled

Controlled

Expected

x

x

argmin

( | )( | ) ln

( | )

| ( ( ))

Q

x

D

p x mD p x m dx

Q x m

p x m eig f

argmin

0

F

a

100 200 300 400 500-2

-1.5

-1

-0.5

0

0.5

1

1.5

2


time100 200 300 400 500

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5hidden states

time

100 200 300 400 500-5

0

5causes - level 2

time100 200 300 400 500

-1

-0.5

0

0.5

1

time


20 40 60 80 100 120-1.5

-1

-0.5

0

0.5

1


time20 40 60 80 100 120

-1.5

-1

-0.5

0

0.5

1

1.5hidden states

time

20 40 60 80 100 120-8

-6

-4

-2

0

2

4

6x 10

-4 causes - level 2

time20 40 60 80 100 120

-15

-10

-5

0

5

10

15

time


Learning in controlled environment

Active inference in uncontrolled environment

Using just the free-energy principle and a simple gradient ascent scheme, we have solved a benchmark problem in optimal control theory with just a handful of learning trials. At no point did we use reinforcement learning or dynamic programming.

Goal-directed behaviour and trajectories

prediction and error

time

hidden states

time

velo

city

behaviour action

position

velo

city

time

perturbation and actionbehaviour

20 40 60 80 100 120-1.5

-1

-0.5

0

0.5

1

1.5

20 40 60 80 100 120-1.5

-1

-0.5

0

0.5

1

1.5

-2 -1 0 1 2-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

20 40 60 80 100 120-3

-2

-1

0

1

2

3

-2 -1 0 1 2-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

20 40 60 80 100 120-3

-2

-1

0

1

2

3

Action under perturbation

8s

20 40 60 80 100 120-1.5

-1

-0.5

0

0.5

1

1.5

hidden states

time20 40 60 80 100 120

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

hidden states

time20 40 60 80 100 120

-1

-0.5

0

0.5

hidden states

time

5.5x 4x

6x

Simulating Parkinson's disease?

Overview


-20 -10 0 10 20

-30

-20

-10

0

10

20

30

-20 -10 0 10 20

-30

-20

-10

0

10

20

30

velo

city

-20 -10 0 10 20

-30

-20

-10

0

10

20

30

-20 -10 0 10 20

-30

-20

-10

0

10

20

30

velo

city

position-20 -10 0 10 20

-30

-20

-10

0

10

20

30

-20 -10 0 10 20

-30

-20

-10

0

10

20

30

position

velo

city

controlled

velo

city

velo

city

velo

city

before

after

trajectoriesdensities

Q

0

Learning autonomous behaviour

50 100 150 200 250-30

-20

-10

0

10

20

30

40

50prediction and error

time50 100 150 200 250

-30

-20

-10

0

10

20

30

40

50hidden states

time

position

velo

city

learnt

-20 -10 0 10 20

-30

-20

-10

0

10

20

30

50 100 150 200 250-10

-5

0

5

10

time


Autonomous behaviour under

random perturbations

Overview


( , , )s x v a

s

,x v

a

,x vDesired and inferred

states

Sensory prediction error

Motor command (action)

Forward model (generative model)

Forward model (generative model){ , }x v s

Inverse modelInverse models a

Desired and inferred states

Sensory prediction error

Forward modelForward model{ , , }x v a s

Motor command (action)

EnvironmentEnvironment{ , , }x v a s

s

,x v

a

,x v

EnvironmentEnvironment{ , , }x v a s

( , , )s x v a

Free-energy formulation Forward-inverse formulation

Inverse model (control policy)

Inverse model (control policy){ , }x v a

Corollary dischargeEfference copy


Summary

• The free-energy can be minimised by action (through changes in states generating sensory input) or perception (through optimising the predictions of that input)

• The only way that action can suppress free-energy is through reducing prediction error at the sensory level (speaking to a juxtaposition of motor and sensory systems)

• Action fulfils expectations, which can manifest as an explaining away of prediction error through resampling sensory input (e.g., visual tracking);

• Or intentional movement, fulfilling expectations furnished by empirical priors.

• In an optimum control setting a training environment can be constructed by minimising the cross-entropy between the ensemble density and some desired density. This can be learnt and reproduced under active inference.


Abstract This presentation questions the need for reinforcement learning and related paradigms from...

Documents

Transcript of Abstract This presentation questions the need for reinforcement learning and related paradigms from...