Inverse Reinforcement Learning in Partially Observable Environments

Inverse Reinforcement Learning in Partially Observable Environments Jaedeug Choi Kee-Eung KimKorea Advanced Institute of Science and Technology.

JMLR Jan, 2011

Basics

Reinforcement Learning (RL) Markov Decision Process (MDP)

Reinforcement Learning

Actions

Reward

InternalState

Observation

Actions

Reward

InternalState

Observation

Inverse Reinforcement Learning

Why reward function ??

Solves the more natural problems

Most transferable representation of agent’s behaviour!

Example 1

Reward

Example 2

Name: Agent

Role: Decision making

Property: Principle of rationality

Environment

Markov DecisionProcess (MDP)

Partially Observable

Markov DecisionProcess (POMDP)

Sequential decision making problem States are directly perceived

Sequential decision making problem States are perceived through some

noisy observationSeems like

near a wall !!!

Concept of belief

Policy

Explicit policy

Trajectory

IRL for MDP\RIRL for MDP\R

Policies TrajectoryLinear

approximation

Projection Method

Apprenticeship learning

Using Policies

Ng and Russel, 2000

Any policy deviating from expert’s policy should not yield a higher value.

Using Sample Trajectories Linear approximation for reward

function.

R(s,a) = 11(s,a) + 22(s,a) + … + dd(s,a)

where, [-1,1]d

: SxA→ [0,1]d , basis functions.

Using Linear Programming

Apprenticeship Learn policy from expert’s

demonstration. Does not compute the exact reward

function.

Using QCP

Approximated using Projection method !

IRL in POMDP

Ill-posed problem Existence Uniqueness Stability

Computationally intractable

R = 0Exponenti

al increase in size!

IRL for POMDP \R

IRL for MDP\R

Policies

Q functions

Howard’s theory

Witness theorem

Trajectory

MMV methodMMFE

method

PRJ method

Comparing Q functions

Constraint:

Disadvantage:For each n N, there are |A||N||Z| ways

to deviate one step from expert ! For n nodes, there are |N||A||N||Z|

ways to deviate – it grows exponentially !!!

DP Update Based Appraoch Comes from Generalized Howard’s

Policy Improvement Theorem.

Hansen, 1998

If an FSC Policy is not optimal, the DP update transforms it into an FSC policy with a value function that is as good or better for every belief state and better for some belief state.

Comparison

IRL for POMDP \R

IRL for MDP\R

Policies

Q functions

Howard’s theory

Witness theorem

Trajectory

MMV methodMMFE

method

PRJ method

MMV Method

MMFE Method

Approximated using Projection (PRJ) Method !!!

Experimental Results

Tiger 1d Maze 5 x 5 Grid World Heaven / Hell Rock Sample

Illustration

Characteristics

Results from Policy

Results from Trajectories

Questions ???

Backup slides !

Inverse Reinforcement Learning

Given measurements of an agent’s behaviour

over time, in a variety of circumstances, Measurements of the sensory inputs to the

agent, a model of the physical environment

(including the agent’s body).Determine The reward function that the agent is

optimizing.

Russel (1998)

Partially Observable Environment

Mathematical framework for single-agent planning under uncertainty.

Agent cannot directly observe the underlying states.

Example: Study global warming from your grandfather’s diary !

Advantages of IRL

Natural way to examine animal and human behaviors.

Reward function – most transferable representation of agent’s behavior.

MDP Modeling a sequentially decision making

problem. Five tuple system: <S, A, T, R, γ>

S – finite set of states A – finite set of actions T – state transition function T:SxA →∏(S) R – Reward function R:SxA → Ɍ γ – Discount factor [o,1)

Q∏(s,a) = R(s,a) + γ∑s’ST(s,a,s’)V ∏(s’)

POMDP Partially observable environment Eight tuple system <S,A,Z,T,O;R,bo,γ>

Z – finite set of observation O:SxA →∏(Z), observation function bo – initial state distribution bo (s)

Belief (b) – b(s) is the probability that the state is s at the current time step.

(To reduce the complexity, introduced by the history of action-observation sequence).

Finite State Controller(FSC) Policy in POMDP is represented using

FSC. It’s a directed graph <N,E> nN is associated with an action,

aA eE is an outgoing edge per

observation zZ ∏ = < , >. is the action strategy

and is the observation strategy.

Q∏(<n,b>,<a,os>) = ∑s’ b(s)Q∏

(<n,s>,<a,os>).

Using Projection Method

PRJ Method

Inverse Reinforcement Learning in Partially Observable Environments

Documents

Transcript of Inverse Reinforcement Learning in Partially Observable Environments

Reinforcement Learning in Partially Observable Multiagent ...thinc.cs.uga.edu/data/cdbaamas16_presentation.pdf · Reinforcement Learning in Partially Observable Multiagent Settings:

Partially-Observable Markov Decision Processes

2534 Lecture 5: Partially Observable MDPscebly/2534/Notes/CSC2534_Lecture5.pdf1 2534 Lecture 5: Partially Observable MDPs Discuss algorithms for MDPS (from last time) Introduce partially

Partially-Observable Markov Decision Processesclopinet.com/isabelle/Projects/NIPS2013/slides/Doshi.pdf · Partially-Observable Markov Decision Processes as Dynamical Causal Models

Partially Observable Markov Decision Process (Chapter 15 & 16)

Inverse Reinforcement Learning in Partially Observable ...jmlr.csail.mit.edu/papers/volume12/choi11a/choi11a.pdf · exploit some of the classical results from the POMDP literature.

Partially Observable Markov Decision Processesleews/MLSS/POMDP.pdf · 2011. 6. 8. · Bayesian Reinforcement Learning POMDP • Aim: ... • Partially Observable Markov Decision Process

Partially Observable Markov Decision Processes (POMDPs)

Learning and Solving Partially Observable Markov Decision Processes

Continuous-Observation Partially Observable Semi-Markov ... · PDF file1 Continuous-Observation Partially Observable Semi-Markov Decision Processes for Machine Maintenance Mimi Zhang,

Actor-Critic Policy Optimization in Partially Observable ...papers.nips.cc/paper/7602-actor-critic-policy-optimization-in-partially-observable... · Partially observable environments

Reinforcement Learning Algorithm for Partially Observable ...

Planning and acting in partially observable …people.csail.mit.edu/lpk/papers/aij98-pomdp.pdfoptimal actions in partially observable stochastic domains. We begin by introducing the

Learning and Solving Partially Observable Markov Decision ...shanigu/Publications/Dissertation.4.pdf · Learning and Solving Partially Observable Markov Decision Processes Dissertation

Model-based Bayesian Reinforcement Learning in Partially Observable Domains

Partially Observable Markov Decision Processes …ggordon/780-fall07/... · What is a Partially Observable Markov Decision Process? ... Observations: S1 emits O1 with prob 1.0, S2

CSE-573 Artificial Intelligence Partially-Observable MDPS (POMDPs)

Optimal control of infinite horizon partially observable ...€¦ · Optimal control of infinite horizon partially observable decision processes modelled as ... Policies may be deterministic

Learning Reward Machines for Partially Observable ...rntoro/docs/learningRM_KR20.pdfLearning Reward Machines for Partially Observable Reinforcement Learning Rodrigo Toro Icarte Ethan

Partially Observable Markov Decision Process in ...