An introduction to reinforcement learning (rl)

Post on 01-Dec-2014

6.387 views 0 download

description

Slides from Aditya Tarigoppula's talk at NYC Machine Learning on December 13th.

Transcript of An introduction to reinforcement learning (rl)

An Introduction to Reinforcement Learning (RL) and RL Brain Machine Interface (RL-BMI)

Aditya Tarigoppula www.joefrancislab.com

SUNY Downstate Medical Center

Outline

RL Examples

Environment

Value functions

Optimality

Methods for attaining optimality

DP MC TD

BMI & RL-BMI

Eligibility Traces

START / END

RL Examples

Stanford Autonomous Helicopterhttp://heli.stanford.edu/

Reinforcement Learning Brain Machine InterfaceJoe Francis Lab.

Environment model - Markov decision process 1) States ‘S’

2) Actions ‘A’

3) State transition probabilities.

4) Reward

5)

Deterministic, non-stationary policy

RL Problem: The decision maker, ‘agent’ needs to learn the optimum policy in an ‘environment’ to maximize the total amount of reward it receives over the long term.

1'

'},,|'1Pr{' s

assPaatsstssta

ssP

10

...32

21

...321

trtrtrtR

TtrtrtrtrtR

as :

• Agent performs the action under the policy being followed.

• Environment is everything else other than the agent.

assP '

Value Functions: State Value Function

State – Action Value Function

},|{

},|{),(

10

aassrE

aassREasQ

ttktk

k

ttt

)]'([),(

}|{

}|{)(

''

'

20

1

sVRPas

ssrrE

ssREsV

ass

s

ass

a

tktk

kt

tt

Optimal Value Function:

Optimal Policy – A policy that is better than or equal to all the other policies is called Optimal policy.

(in the sense of maximizing expected reward)

Optimal state value function

Optimal state-action value function

Bellman optimality equation

)(max)(* sVsV

),(max),(* asQasQ

},|)','(max{),(

},|)'({max)(

*1

*

*1

*

aassasQrEasQ

aasssVrEsV

tta

t

ttta

At time = tAcquire Brain State

DecoderAction Selection (trying to execute an optimum action)

Action executedAt time = t +1Observe reward Update the decoder

t

t+1

EXAMPLE

8.0Pr

1.0Pr 1.0Pr

0Pr

EXAMPLE

))](*),((*1.0))(*...

)...,((*1.0))(*),((*8.0[)(

332

211

sVasRsV

asRsVasRsV

1S

2S3S

4S

Prof. Andrew Ng, Lecture 16, Machine learning

Outline

Environment

Value functions

Optimality

Methods for attaining optimality

DP MC TD

BMI & RL-BMI

Eligibility Traces

START / END

We're here !

RL Examples

Solution Methods for RL problem◦ Dynamic Programming (DP) – is a method for optimization of

problems which exhibit the characteristics of overlapping sub problems and optimal substructure.

◦ Monte Carlo method (MC) - requires only experience--sample sequences of states, actions, and rewards from interaction with an environment.

◦ Temporal Difference learning (TD) – is a method that combines the better aspects of DP (estimation) and MC (experience) without incorporating the ‘troublesome’ aspects of both.

Dynamic ProgrammingPolicy Evaluation

Dynamic ProgrammingPolicy Improvement

)())(,( ' sVssQ

*1 *

10 ...... VVVEIIEIE

o

E – Policy Evaluation I – Policy Improvement

Policy Iteration Value Iteration

Replace entire section with

DYNAMIC

PROGRAMMING

)]'([max)( ''

' sVRPsV ass

s

ass

a

Monte Carlo Vs. DP

◦ The estimates for each state are independent. In other words, MC methods do not "bootstrap“.

◦ DP includes only one-step transitions

whereas the MC diagram goes all the

way to the end of the episode.

◦ The computational expense of estimating the value of a single state is less when one requires the value of only a subset of the states.

Monte Carlo Policy Evaluation

Every visit MC First visit MC

-> Without a model, we need Q value estimates.-> All state-action pairs should be visited.-> Exploration techniques 1) Exploring starts 2) e-greedy Policy

Next SlideMONTE

CARLO

As promised, this is the “NEXT SLIDE” !

MONTE

CARLO

Temporal Difference Methods◦ Like MC, TD methods can learn directly from raw experience

without a model of the environment's dynamics. Like DP, TD methods update estimates are based in part on other learned estimates, without waiting for a final outcome (they bootstrap).

)]()([)()( 11 ttttt sVsVrsVsV

TD(lambda)

trace decay parameter

Bias decreases

Variance Increases

Bias –Variance Tradeoff

Intuition: start with large ‘lamda’ and then decrease over time

SARSA

Q Learning

Difference

Outline

Environment

Value functions

Optimality

Methods for attaining optimality

DP MC TD

BMI & RL-BMI

Eligibility Traces

START / END

We're here !

RL Examples

Eligibility Traces

OR

Outline

Environment

Value functions

Optimality

Methods for attaining optimality

DP MC TD

BMI & RL-BMI

Eligibility Traces

START / END

We're here !

RL Examples

Online/closed loop RL-BMI architecture

),(),(

))]([max(_

actionsQasQ

tsiQindexoutputaction

titt

tanh(.)

reward

traceeerrTDdelta

asQasQrerrTD ttttt

_*_

),(),(*_ 11

‘delta’ used for updating the weights through back-propagation

NEURALSIGNAL

Scott, S. H. (1999). "Apparatus for measuring and perturbing shoulder and elbow joint positions and torques during reaching." J Neurosci Methods 89(2): 119-27.

BM I

SET UP

Autonomous Helicopter (Stanford Uni) http://heli.stanford.edu/papers/iser04-invertedflight.pdf

Position , orientation, velocity and angular velocity ),,,,,,,,,,,(

zyxzyx ),,,,,,,(

zyx

S1 S2

a1 a2R1 R2

Dynamics Dynamics

DynamicsRandom Gen

Dynamics

Actor-Critic Model

http://drugabuse.gov/researchreports/methamph/meth04.gif

References Reinforcement Learning: An Introduction

Richard S. Sutton & Andrew G. Barto Prof. Andrew Ng’s machine Learning Lectures http://heli.stanford.edu Dr. Joseph T. Francis

www.joefrancislab.com Prof. Peter Dayan Dr. Justin Sanchez Group

http://www.bme.miami.edu/nrg/