Reinforcement Learning and the Reward Engineering Principle

Daniel Dewey

daniel.dewey@philosophy.ox.ac.uk; AAAI Spring Symposium Series 2014

A modest aim:

What role goals in AI research?

…through the lens of reinforcement

learning.

Reinforcement learning and AI

Definitions: “control” “dominance”

The reward engineering principle

Conclusions

Stuart Russell, “Rationality and Intelligence”

RL and AI

“…one can define AI as the problem of designing systems that do the right thing.

Now we just need a definition for

‘right.’”

Reinforcement learning provides a definition: maximize total rewards.

RL and AI

action

reward

Agent EnvironmentAI

RL and AI

Understand and Exploit

Inference, Planning, Learning,

Metareasoning, Concept formation,

etc…

RL and AI

Advantages:• Simple and cheap• Flexible and abstract• Measurable

“worse is better”

…and used in natural neural nets (brains!)

RL and AI

Outside the frame:Some behaviours cannot be elicited(by any rewards!)

As RL AI becomes more general and autonomous, it becomes harder to get good results with RL.

Key concepts: Control and dominance

Conclusions

Definitions: “control”

A user has control when the agent’s received rewards equal the user’s chosen reward.

action

reward

Agent Environment

action

reward

Environment 1

Environment 2

state action

reward

user chooses reward

Environment 2

Agent User

Environment 1

env. “chooses” reward

Environment 2

Environment 1

Definitions: “dominance”

Why does control matter?

Loss of control can create situations where no possible sequence of rewards can elicit the desired behaviour.

These behaviours are dominated by other behaviours.

A “behaviour” (sequence of actions) is a policy.

1 ? 0 ? ? ? 0 ?

a1 a2 a3 a7a4 a5 a6 a8

1 ? 0 ? ? ? 0 ?P1

User-chosen rewards

Env.-chosen rewards (loss of control)

1 ? 0 ? ? ? 0 ?P1

1 0 ? 1 ? ? 1 1P2

Can rewards make either better?

1 1 0 1 1 1 0 1P1

1 0 0 1 0 0 1 1P2

Choose all rewards 1: Max. reward = 6

Choose all rewards 0: Min. reward = 4

1 0 0 0 0 0 0 0P1

1 0 1 1 1 1 1 1P2

Choose all rewards 0: Min. reward = 1

Choose all rewards 1: Max. reward = 7

1 ? 0 ? ? ? 0 ?P1

1 1 1 1 1 ? 1 1P3

1 1 0 1 1 1 0 1P1

1 1 1 1 1 0 1 1P3

Max. reward = 6

Min. reward = 7

Dominated by P3

Dominates P1

1 ? 0 ? ? ? 0 ?P1

1 1 1 1 1 ? 1 1P3

A dominates B if no possible assignment of rewards causes R(A) > R(B).

No series of rewards can prompt a dominated policy; they are unelicitable. (A less obvious

result: every unelicitable policy is dominated.)

Control is sometimes lost;

Loss of control enables dominance;

Dominance makes some policies

unelicitable.

All of this is outside the “RL AI

frame”

…but is clearly part of the AI problem(do the right thing!)

Generality: the range of policies an agent has reasonably efficient access to.

Autonomy: ability to function in environments with little interaction from users.

= better chance of finding dominant policies

= more frequent loss of control

Additional factors

Conclusions

Reward Engineering Principle

As RL AI becomes more general and autonomous, it becomes both more difficult and more important to constrain the environment to avoid loss of control.…because general / autonomous RL AI has• better chance of dominant policies;• more unelicitable policies;• more significant effects

Conclusions

Heed the Reward Engineering Principle.

• Consider existence of dominant policies

• Be as rigorous as possible in excluding them

• Remember what’s outside the frame!

RL AI users:

Expand the frame! Make goal design a first-class citizen.

Consider alternatives: manually coded utility functions, preference learning, …?

Watch out for dominance relations (e.g. in “dual” motivation systems, between intrinsic and extrinsic)

AI Researchers:

Thank you!

Work supported by theAlexander Tamas Research Fellowship

Toby Ord, Seán Ó hÉigeartaigh, and two anonymous judges, for comments.

Reinforcement Learning and the Reward Engineering Principle

Documents

Transcript of Reinforcement Learning and the Reward Engineering Principle

DEFINITION: Learning in which an organism’s REINFORCEMENT · DEFINITION: Learning in which an organism’s behavior is followed by a reward or punishment REINFORCEMENT Positive

Positive Reinforcement: Praise Compared to the Candy Reward

Reinforcement learning Regular MDP –Given: Transition model P(s’ | s, a) Reward function R(s) –Find: Policy (s) Reinforcement learning –Transition model.

Hybrid Reward Architecture for Reinforcement Learning · PDF fileHybrid Reward Architecture for Reinforcement Learning ... The generalisation properties of their Deep Q-Networks ...

Application of the Premack Principle of Reinforcement to the Quality ...

Structuring reward function specifications and reducing ... · Structuring reward function specifications and reducing sample complexity in reinforcement learning Sheila A. McIlraith

Distributional Reward Decomposition for Reinforcement Learning · reward channels, but only the full reward is available. Reward decomposition has been proposed to investigate such

Reinforcement Learning and Time Perception -- a Model of …papers.nips.cc/paper/2000-reinforcement-learning-and-time-perceptio… · An aspect of delayed-reward reinforcement learning

Deep Reinforcement Learning-based Image Captioning with ... · Deep Reinforcement Learning-based Image Captioning with Embedding Reward Zhou Ren 1Xiaoyu Wang Ning Zhang Xutao Lv1

Balancing Multiple Sources of Reward in Reinforcement Learning · 2014-04-15 · Balancing Multiple Sources of Reward in Reinforcement Learning Christian R. Shelton Artificial Intelligence

Reinforcement Reward and Intrinsic Motivation a Meta Analysis

Design and implementation of general purpose reinforcement ... · theoretical reinforcement learning and reward processing in biological brains; a practical Open Source implementation

Using Natural Language for Reward Shaping in Reinforcement ...

Framing Reinforcement Learning from Human Reward: Reward ... · Framing Reinforcement Learning from Human Reward: Reward Positivity, Temporal Discounting, Episodicity, and Performance

An Introduction to Reinforcement LearningWhat is Reinforcement Learning? •Learning an agent while interactingwith the environment •The agent receives a “reward” for each action

Explanation-based Reward Coaching to Improve …Explanation-based Reward Coaching to Improve Human Performance via Reinforcement Learning Aaquib Tabrez University Of Colorado Boulder

Reward-based training - rspcavic.org and... · Reward-based training: a guide for dog trainers The Australian Veterinary Association (AVA) recommends the use of positive reinforcement

Classical Conditioning and Reinforcement Learning · 2 Classical Conditioning and Reinforcement Learning system has been studied in association with reward learning. We focus on the

An Upside to Reward Sensitivity: The Hippocampus Supports … · 2017-02-24 · Neuron Report An Upside to Reward Sensitivity: The Hippocampus Supports Enhanced Reinforcement Learning

Reinforcement Learning - Introduction: Framework, concepts ...mmartin/URL/Lecture1.pdf · The agent’s job is to maximise cumulative reward over an episode Long term reward must