and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task...

Reinforcement Learning for the People and/or by the People

Emma Brunskill

Stanford University

NIPS 2017 Tutorial

1

Amazing Reinforcement Learning Progress

Overview

● RL introduction● RL for people● RL by the people

Audience

• If you are:– Interested in quick overview of RL (section 1)– Want to learn about the RL technical challenges

involved in people-facing applications (section 2)– Want to learn about how people can help RL

systems (section 3)

5

*Caveats

• Not trying to cover all the domain-specific methods for tackling these questions

• Focus will be on RL setting, and new technical challenges for RL with people and using people

• Always delighted to learn about new areas/key references might have missed-- email me at [email protected]

6

Math exercise

Student’s answer

Reinforcement Learning for People

Student

Math exercise

Student’s answer

Pass exam


Student

Suggest ProductBuys

product

Revenue


Policy: Prior Recommendations & Purchases → Product AdGoal: Choose actions to maximize expected revenue

Customer

Action Observation

Reward

Reinforcement Learning

Policy: Map Observations → ActionsGoal: Choose actions to maximize expected rewards

Overview


RL Basics

● MDP● POMDP● Planning● 3 views

○ Model○ Model-free○ Policy search

● Exploration vs exploitation-- how get data?

Markov Decision Process (MDP)

• Set of states S• Set of actions A• Stochastic transition/dynamics model T(s,a,s’)

– Probability of reaching s’ after taking action a in state s

• Reward model R(s,a) (or R(s) or R(s,a,s’))• Maybe a discount factor γ or horizon H

Markov Decision Process (MDP)

• Set of states S• Set of actions A• Stochastic transition/dynamics model T(s,a,s’)


• Reward model R(s,a) (or R(s) or R(s,a,s’))• Maybe a discount factor γ or horizon H• Policy π: s → a• Optimal policy is one with highest expected

discounted sum of future rewards

Partially Observable MDP

• Set of states S• Set of actions A• Set of observations Z• Stochastic transition/dynamics model T(s,a,s’)


• Reward model R(s,a) (or R(s) or R(s,a,s’))• Observation model P(z|s’,a)• Policy π: history (z,a,z’,a’)... → a• Optimal policy is one with highest expected


Model-Based Policy Evaluation

• Given a MDP and a policy π: s → a• How good is this policy?• Want to compute expected sum of discounted

rewards if follow it (starting from some initial states)• Could do Monte Carlo rollouts and average• But if domain is Markov, can do better...

Model-Based Policy Evaluation with Dynamic Programming

• Given a MDP and a policy π: s → a• Set V(s) = 0 for all s;

• Iteratively update value

• V(s) ← R(s) + γ Σs’ T(s, (s),s′) V(s′)

immediate reward


Planning & Bellman Equation

• Given a MDP (includes dynamics and reward model)• Compute optimal policy• Bellman optimality equation:

– V(s) = R(s) + γ Σs’ T(s, *,s′) V(s′)

– Q(s,a) = R(s) + γ Σs’ T(s,a,s′) V(s′)

– *(s) = argmaxa R(s) + γ Σ

s’ T(s,a,s′) V(s′) = argmax

a Q(s,a)

Reinforcement Learning

• Typically still assume in a MDP• Know set of states S• Know set of actions A• Maybe a discount factor γ or horizon H• Don’t know dynamics and/or reward model!• Still want to take actions to yield high reward

RL: 3 Common Approaches

Image from David Silver

Model-Based RL

• Typically still assume in a MDP• As interact in world, observe states, actions and

rewards• Use machine learning to estimate a model of

dynamics and/or rewards using this experience• Now have (one or more) MDP estimated models• Now can do planning to compute an optimal policy

Model Free: Q-Learning

• Initialize Q(s,a) for all (s,a) pairs


• Initialize Q(s,a) for all (s,a) pairs• On observing transition <s

t,a

t,r

t,s

t+1>

– Calculate temporal difference errorδ(s

t,a

t) = r

t + γ max

a′ Q(st+1

,a′) - Q(st,a)

– Difference between what observed and current estimate of long term expected reward (Q)

– Uses Markov property and bootstraps (didn’t observe reward till end of episode, so use Q(s

t+1,a′) as a proxy)



t,a

t,r

t,s

t+1>

– Calculate TD-errorδ(s

t,a

t) = r

t + γ max

a′ Q(st+1

,a′) - Q(st,a)

– Use to update estimate of Q(st,a

t)

Q(st,a

t) = (1-α) Q(s

t,a

t) + αδ(s

t,a

t)

• Slowly moves estimate toward observed experience



t,a

t,r

t,s

t+1>

– Calculate TD-errorδ(s

t,a

t) = r

t + γ max

a′ Q(st+1

,a′) - Q(st,a)

– Use to update estimate of Q(st,a

t)

Q(st,a

t) = (1-α) Q(s

t,a

t) + αδ(s

t,a

t)

• Computationally cheap• But only propagates experience one step

– Replay can help fix

Policy Search

• Directly search π space for argmaxπVπ

• Parameterize policy and do stochastic gradient descent

RL: 3 Common Approaches

Image from David Silver

Exploration vs Exploitation

● In online reinforcement learning, learn about world through acting

● Trade off between○ Learning more about how the world works○ Using that knowledge to maximize reward

Overview


Reinforcement Learning Progress

≠

Cheap to try things, orSimulate

High stakesHard to model

Technical Challenges in RL for People

1. Sample efficient Learning2. What do we want to optimize?3. Batch (purely offline) reinforcement learning

Sample Efficiency Through Transfer

all different all the same

Assume Finite Set of Models

MDP Y

TY, R

Y

MDP R

TR, R

R

MDP G

TG, R

G

Sample a task from finite set of MDPs

(shared S & A space)

Assume Finite Set of GroupsMulti-Task Reinforcement Learning

…

Series of tasksAct in each task for H steps

MDP Y

TY, R

Y

MDP R

TR, R

R

MDP G

TG, R

G

Multi-Task Reinforcement Learning

…

MDP Y

TY, R

Y

MDP R

TR, R

R

MDP G

TG, R

G


…

MDP Y

T=? R=?

MDP R

T=? R=?

MDP G

T=? R=?


• Captures a number of settings of interest• Our primary contributions have been showing can provably speed learning

(Brunskill and Li UAI 2013; Brunskill and Li ICML 2014; Guo and Brunskill AAAI 2015)

• Limitations: focused on discrete state and action, impractical bounds, optimizing for average performance

Assume Related Meta-Groups

Wilson, A., Fern, A., Ray, S., & Tadepalli, P. (2007, June). Multi-task reinforcement learning: a hierarchical Bayesian approach. In Proceedings of the 24th international conference on Machine learning (pp. 1015-1022). ACM.


Encouraging empiricallyNo guarantees

Scalability?

Hidden Parameter MDP

● Allow for smooth linear parameterization of dynamics model

Doshi-Velez, F., & Konidaris, G. (2016, July). Hidden parameter Markov decision processes: A semiparametric regression approach for discovering latent task parametrizations. In IJCAI: proceedings of the conference(Vol. 2016, p. 1432).

Hidden Parameter MDP++

● Use Bayesian Neural Nets for dynamics● Benefits for HIV Treatment simulation● Each episode new patient

TW Killian, G Konidaris, F Doshi-Velez. Robust and Efficient Transfer Learning with Hidden Parameter Markov Decision Processes. NIPS 2017.

Sample Efficient Online RL: Policy Search

• Policy search hugely popular technique in RL– Deep Reinforcement Learning through Policy

Optimization. – Schulman and Abbeel NIPS 2016 tutorial:

https://people.eecs.berkeley.edu/~pabbeel/nips-tutorial-policy-optimization-Schulman-Abbeel.pdf

• Not always associated with data efficiency• But can be used to tune small set of

parameters efficiently

Thomas and B, ICML 2016

Policy Search as Function Optimization

Figure from Wilson et al. JMLR 2014

π = f(θ)

DefinesPolicy class

Perf

orm

ance

of

Polic

y

Policy Search as Function OptimizationGradient Approaches

• Use structure of sequential decision making• Only find local optima

π = f(θ)

DefinesPolicy class

Perf

orm

ance

of

Polic

y


Policy Search as Bayesian Optimization:Treats Each Evaluation as Costly

π = f(θ)

DefinesPolicy class

Perf

orm

ance

of

Polic

y

• Explicit rep uncertainty, finds global optima• Generally ignores sequential structure


Policy Search as Bayesian Optimization

● Data efficient policy search

● Can leverage Markovian structure (e.g. Wilson et al. 2014)

● Doesn’t have to make assumptions about world model

● Can combine with off policy evaluation to further speed up learning (in terms of amount of data required)

Goel, Dann and Brunskill IJCAI 2017

Policy Search for Optimal Stopping Problems

● Can leverage full trajectories to retrospectively evaluate alternate optimal stopping policies

● Substantial increase in data efficiency over generic bounds (Ng and Jordan UAI 2000)

● But Gaussian Processes struggle in high dimensions


Assuming Finite Set of Policies to Speed Learning in a New Task

● Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement learning agent. In Proceedings of the fifth international joint conference on Autonomous agents and multiagent systems (pp. 720-727). ACM.

● Talvitie, E., & Singh, S. P. (2007, January). An Experts Algorithm for Transfer Learning. In IJCAI (pp. 1065-1070).

● Azar, M. G., Lazaric, A., & Brunskill, E. (2013, September). Regret bounds for reinforcement learning with policy advice. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases (pp. 97-112).

● Focus on the discrete state and action space setting

Multi-Task Policy Learning

Ammar, H. B., Eaton, E., Luna, J. M., & Ruvolo, P. (2015). Autonomous cross-domain knowledge transfer in lifelong policy gradient reinforcement learning. In IIJCAI.

Multi-Task Policy Learning


● Set of policies with shared basis set of parameters● Can be used to do cross domain transfer (different state & actions)

Multi-Task Policy Learning with Shared Policy and Feature Parameters

● Descriptor features and policy features● Primarily tackling continuous control tasks

● Scaling to enormous state spaces and performance for a single trajectory less clear

Isele, D., Rostami, M., & Eaton, E. (2016, July). Using Task Features for Zero-Shot Knowledge Transfer in Lifelong Learning. In IJCAI (pp. 1620-1626).

Deep RL Transfer

● Deep reinforcement learning to find good shared representation (Finn, Abbeel, Levine ICML 2017)

● Fast transfer by encouraging shared representation learning across tasks

Direct Policy Search

• Most of these are trying to speed online learning in a new task

• Not making guarantees on performance for a single task

(Some) Recent Sample Efficient RL References

Brunskill, E., & Li, L. Sample Complexity of Multi-task Reinforcement Learning. UAI 2013.


Doshi-Velez, F., & Konidaris, G. (2016, July). Hidden parameter Markov decision processes: A semiparametric regression approach for discovering latent task parametrizations. In IJCAI: proceedings of the conference(Vol. 2016, p. 1432). NIH Public Access.

Robust and Efficient Transfer Learning with Hidden Parameter Markov Decision Processes.TW Killian, G Konidaris, F Doshi-Velez. NIPS 2017

Goel, Dann, & Brunskill. Sample Efficient Policy Search for Optimal Stopping Domains. IJCAI 2017.

Guo, Zhaohan, and Emma Brunskill. "Concurrent PAC RL." AAAI 2015.

Jiang, N., Krishnamurthy, A., Agarwal, A., Langford, J., & Schapire, R. E. (2017). Contextual Decision Processes with Low Bellman Rank are PAC-Learnable. ICML.

Silver, D., Newnham, L., Barker, D., Weller, S., & McFall, J. (2013, February). Concurrent reinforcement learning from customer interactions. In International Conference on Machine Learning (pp. 924-932)

Carlos Diuk, Lihong Li, and Bethany R. Leffler. The adaptive k-meteorologists problem and its application to structure learning and feature selection in reinforcement learning. In ICML, 2009.

Kirill Dyagilev, Shie Mannor, and Nahum Shimkin. Efficient reinforcement learning in parameterized models: Discrete parameter case. In European Workshop on Reinforcement Learning, 2008

Liu, Y., Guo, Z., & Brunskill, E. PAC Continuous State Online Multitask Reinforcement Learning with Identification. AAMAS 2016.

Osband, I., Russo, D., & Van Roy, B. (2013). (More) efficient reinforcement learning via posterior sampling. NIPS

Osband, I., & Van Roy, B. (2014). Near-optimal Regret Bounds for Reinforcement Learning in Factored MDPs. NIPS.

Zhou, L., & Brunskill, E. Latent Contextual Bandits and Their Application to Personalized Recommendations for New Users. IJCAI 2016.

Ammar, H. B., Eaton, E., Luna, J. M., & Ruvolo, P. (2015, July). Autonomous cross-domain knowledge transfer in lifelong policy gradient reinforcement learning. In International Joint Conference on Artificial Intelligence (pp. 3345-3351)

Isele, D., Rostami, M., & Eaton, E. (2016, July). Using Task Features for Zero-Shot Knowledge Transfer in Lifelong Learning. In IJCAI (pp. 1620-1626).

Finn, C., Abbeel, P., & Levine, S. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. ICML 2017.


Beyond Expectation

• If interacting with people, sometimes may care not just about expected performance, averaged across many rounds

Beyond Expectation


• Work on risk-sensitive policies includes– Garcıa, J., & Fernández, F. (2015). A comprehensive survey on safe reinforcement learning. Journal of Machine Learning

Research, 16(1), 1437-1480.– Castro, D. D., Tamar, A., & Mannor, S. (2012). Policy gradients with variance related risk criteria. In Proceedings of the 29th

International Conference on Machine Learning (ICML-12) (pp. 935-942).– Doshi-Velez, F., Pineau, J., & Roy, N. (2012). Reinforcement learning with limited reinforcement: Using Bayes risk for active

learning in POMDPs. Artificial Intelligence, 187, 115-132.– Prashanth, L. A., & Ghavamzadeh, M. (2013). Actor-critic algorithms for risk-sensitive MDPs. In Advances in neural information

processing systems (pp. 252-260).– Chow, Y., & Ghavamzadeh, M. (2014). Algorithms for CVaR optimization in MDPs. In Advances in neural information processing

systems (pp. 3509-3517).– Delage, E., & Mannor, S. (2007, June). Percentile optimization in uncertain Markov decision processes with application to efficient

exploration. In Proceedings of the 24th international conference on Machine learning (pp. 225-232). ACM.•

Beyond Expectation


• Work on risk-sensitive policies includes– Garcıa, J., & Fernández, F. (2015). A comprehensive survey on safe reinforcement learning. Journal of Machine Learning

Research, 16(1), 1437-1480.– Castro, D. D., Tamar, A., & Mannor, S. (2012). Policy gradients with variance related risk criteria. In Proceedings of the 29th

International Conference on Machine Learning (ICML-12) (pp. 935-942).– Doshi-Velez, F., Pineau, J., & Roy, N. (2012). Reinforcement learning with limited reinforcement: Using Bayes risk for active

learning in POMDPs. Artificial Intelligence, 187, 115-132.– Prashanth, L. A., & Ghavamzadeh, M. (2013). Actor-critic algorithms for risk-sensitive MDPs. In Advances in neural information

processing systems (pp. 252-260).– Chow, Y., & Ghavamzadeh, M. (2014). Algorithms for CVaR optimization in MDPs. In Advances in neural information processing

systems (pp. 3509-3517).– Delage, E., & Mannor, S. (2007, June). Percentile optimization in uncertain Markov decision processes with application to efficient

exploration. In Proceedings of the 24th international conference on Machine learning (pp. 225-232). ACM.

• Work on safe exploration includes• Geramifard, A. (2012). Practical reinforcement learning using representation learning and safe exploration for large scale Markov

decision processes (Doctoral dissertation, Massachusetts Institute of Technology).• Moldovan, T. M., & Abbeel, P. (2012). Safe exploration in Markov decision processes. arXiv preprint arXiv:1205.481• Turchetta, M., Berkenkamp, F., & Krause, A. (2016). Safe exploration in finite Markov decision processes with Gaussian processes. In

Advances in Neural Information Processing Systems (pp. 4312-4320).• Gillula, J. H., & Tomlin, C. J. (2012, May). Guaranteed safe online learning via reachability: tracking a ground target using a quadrotor. In

Robotics and Automation (ICRA), 2012 IEEE International Conference on (pp. 2723-2730). IEEE.• Many make assumptions on structural regularities of the world model

≠

Limited prior data Often lots of prior data!

≠

Limited prior data Often lots of prior data!

Batch (Entirely Offline) RL

A Classrooms Avg Score: 95


B Classrooms Avg Score: 92



What should we do for a new student?

Comes Up in Many Domains: e.g. Equipment Maintenance Scheduling

Comes Up in Many Domains: e.g. Patient Treatment Ordering

B Classrooms Avg Score: ????




Challenge: Counterfactual Reasoning

B Classrooms Avg Score: ????



Challenge: Generalization to Untried Policies

Counterfactual Estimation

● Active area of research in many fields● Today focus on this in the context of reinforcement learning (sequential

decision making)● A few pointers to some other work and settings:

○ Saria and Schulam. ML Foundations and Methods for Precision Medicine and Healthcare. NIPS 2016 Tutorial.

○ T. Joachims, A. Swaminathan, Tutorial on Counterfactual Evaluation and Learning for Search, Recommendation and Ad Placement, ACM Conference on Research and Development in Information Retrieval (SIGIR), 2016.

○ Wager, S., & Athey, S. (2017). Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, (to appear)

○ Austin, P. C. (2011). An introduction to propensity score methods for reducing the effects of confounding in observational studies. Multivariate behavioral research, 46(3), 399-424.

○ T. Schnabel, A. Swaminathan, A. Singh, N. Chandak, T. Joachims, Recommendations as Treatments: Debiasing Learning and Evaluation, International Conference on Machine Learning (ICML), 2016.

○ Rubin, D. B. (2005). Causal inference using potential outcomes: Design, modeling, decisions. Journal of the American Statistical Association, 100(469), 322-331.

Batch Data Policy

Evaluation

Data on decisions & outcomes

Batch Data Policy

Evaluation


Policy 1

Policy 3

Policy N

Policy 2

Batch Data Policy

Evaluation


... ...

Policy 1 Estimated Performance 1


Policy N Estimated Performance N


Policy Evaluation

Batch Data Policy

Selection


... ...

Policy i

Policy Selection



Policy N Estimated Performance N


Policy Evaluation

Batch RL for People

• May want to do only policy evaluation• May want asymptotic guarantees on policy evaluation estimators

– Valid confidence intervals – Consistency

Batch RL for People



• May want measure of confidence in values: confidence intervals• Measure of confidence in policy • Empirically good performance• Robustness to assumptions in the estimator

Batch RL for People



• May want measure of confidence in values: confidence intervals• Measure of confidence in policy • Empirically good performance• Robustness to assumptions in the estimator• Lots of great work on doing off policy policy learning

– Here assume only get access to fixed batch of data (no more learning) and care about accuracy of result

• ...

Policy: Player state → levelGoal: Maximize engagement Old data: ~11,000 students

78


Predictive statistical model of player

behavior

Build Predictive Model

Reward

Action Observation

Compute Policy that Optimizes Expected

Rewards for this Model


behavior

Use Models as a Simulator

Reward

Action Observation

Problem: Model May Not be Accurate… Yields Poor Estimate of Policy Performance


behavior

Compute Policy that Optimizes Expected

Rewards for this Model

Worse: More Accurate Models Can Yield Even Poorer Performing Policies?

Mandel, Liu, B and Popovic AAMAS 2014



● Much prior work: if model good, policy is good

Compute Best Policy for Model

● Much prior work: if model good, policy is good● Challenge: model class may be wrong

○ May model as Markov but not Markov○ If compute policy assuming model is Markov,

resulting value estimate is not valid

Using Estimators that Rely on Model Class Being Correct Can Fail




● Much prior work: if model good, policy is good● Challenge: model class may be wrong● How can we identify if model type is wrong?

○ Sometimes feasible, sometimes hard● Relates to

○ Sim2Real problem○ Adversarial examples

Using Estimators that Rely on Model Class Being Correct Can Fail




Prior Work: Estimate Model from Data

• Strengths– Low variance estimator of policy performance

• Weaknesses– Not unbiased estimator (model may be poor!)

– Not consistent estimator

Build & Estimate a Model

• Define states and actions

• Dynamics p(s’|s,a)

• Observation model p(z|s,a)

• Rewards r(s,a,z)

Historical DataHistory

1, R

1=Σ

ir

i1History

2, R

2=Σ

ir

i2…

Key Challenge: Distribution Mismatch

Histories

Behavior Policy

E[Σir]

• Rewards r(s,a,z)

• Policy maps history (a,z,r,a’,z’,…) → a

New Policy

Histories

E[Σir

i]?

Importance Sampling (IS)(e.g. Precup et al. 2002, Mandel et al. 2014)


1, R

1=Σ

ir

i1History

2, R

2=Σ

ir

i2…

Estimate of Behavior Policy π

b Performance

Estimate of Evaluation Policy π

e Performance

Importance Sampling (IS)(e.g. Precup et al. 2002, Mandel et al. 2014)

• Strengths– Unbiased estimator of policy performance

– Strongly consistent estimator of policy perform.

• Weaknesses– High variance estimator*

*Many extensions, including Retrace (NIPS 2016) but most focused on online off policy setting


1, R

1=Σ

ir

i1History

2, R

2=Σ

ir

i2…

Estimated Evaluation Policy π

e performance

We used off-policy evaluation to find a policy with 30% higher

engagement (Mandel et al. AAMAS 2014)

Thomas, Theocharous, & Ghavamzadeh AAAI 2015 (slide from Thomas)

High Confidence Off-Policy Policy Evaluation


Thomas, Theocharous, & Ghavamzadeh AAAI 2015 (slide modified from Thomas)


Thomas, Theocharous, & Ghavamzadeh AAAI 2015


~2 million trajectories

Thomas, Theocharous, & Ghavamzadeh AAAI 2015

High Confidence Off-Policy Policy Improvement

• Use approximate confidence intervals

• Policy evaluation interleaved with running policy for high confidence iterative improvement

• Domain horizon (10) still quite small

Thomas, Theocharous, & Ghavamzadeh ICML 2015

Two Extremes of Offline Reinforcement Learning

ML model + plan

+ Data efficient- Biased

Importance sampling

- Data intensive+ Unbiased

96

Image: https://upload.wikimedia.org/wikipedia/commons/f/f0/DARPA_Big_Data.jpg

Doubly Robust Estimation

ML model + plan

+ Data efficient- Biased

Importance sampling

- Data intensive+ Unbiased

97

Image: https://upload.wikimedia.org/wikipedia/commons/f/f0/DARPA_Big_Data.jpg

+

Doubly Robust (DR) Estimation

• Model + IS-based estimator• Bandits (Dudik et al. 2011)

reward receivedmodel of reward

Doubly Robust Estimation for RL

• Jiang and Li (ICML 2016) extended DR to RL

• Limitation: Estimator derived is unbiased

model-based estimate of Q

actual rewards in the dataset

importance weights

model-based estimate of V

Tight Estimate Often Better than Unbiased: Measure with Mean Squared Error

Thomas and Brunskill, ICML 2016

• Trade bias and varianceBias

Bias

Variance

+Model-based estimator

Importance sampling estimator

Blend IS-Based & Model Based Estimators to Directly Min Mean Squared Error

Bias

Variance

1-step estimate 2-step N-step

x1 x

2 … x

N


Model and Guided Importance Sampling combining (MAGIC) Estimator

Estimated policy value using particular weighting of model estimate and

importance sampling estimate


• Solve quadratic program

• Strongly consistent*

MAGIC Can Yield Orders of Magnitude Better Estimates of Policy Performance

Log scale!

IS-based

Model

DR

MAGIC

MAGIC-B

Number of Histories


MAGICal Policy Search

Thomas and Brunskill, EWRL 2016

107

Variance of Importance Sampling Can Grow With Exponentially with Time Horizon

Trading Bias & Variance for Long Horizon Off Policy Policy Evaluation

● Importance sampling based estimator● Dropping weights (covariance) reduces

variance at cost of bias● Strongly consistent

Guo, Thomas and Brunskill, NIPS 2017

weights for first part of trajectory weights for 2nd

part of trajectory

Trading Bias & Variance for Long Horizon Off Policy Policy EvaluationPromising for some forms of evaluation in Atari

But subtle: depends on particular form of rewards and policies

*Note: updated since original tutorial given a bug found in original plots

Guo, Thomas and Brunskill, NIPS workshop 2017

High Confidence Bounds for Weighted Doubly Robust Estimation● High confidence bounds for weighted doubly robust off-policy policy estimation

(Hanna, Stone, Niekum AAMAS 2017)

Fairness of Importance Sampling for Policy Selection

• Though IS policy estimates are unbiased, policy selection using them can be unfair

• Define unfair as choose wrong policy as having higher performance > 50% of time

• With finite data, can bias towards certain policies

– more myopic / shorter trajectories

Value

Policy 1 Policy 2

Doroudi, Thomas and Brunskill, UAI 2017, Best Paper

Re-thinking Models: Robust Matrix Evaluation

• When data is very limited, and horizon is long

• IS estimators still too high variance

• Train multiple models & simulate policies on each

• Can use for minimax policy selection

Doroudi, Aleven and Brunskill, Learning at Scale, 2017

Model \ Policy Policy 1 Policy 2

Model 1 8 1

Model 2 10 27

Re-thinking Models

• Which models to include?

• What are good models with limited data?

• Bayesian Neural Networks seem promising

– Applications to stochastic dynamical systems (e.g. Depeweg, Hernández-Lobato, Doshi-Velez, & Udluft, ICLR 2017)

Overview


RL By the People

● Reward specification● Demonstrations● Rewarding● Advice/labeling / critiquing● Teaching● Shaping the space

*Also see Section 4 of great tutorial by Taylor, Kamar and Hayes: http://interactiveml.net/

Human Specifies Reward

● Is the true reward the best reward for the agent to learn from? E.g.○ Singh, S., Lewis, R. L., & Barto, A. G. Where do

rewards come from. CogSci 2009.● Given (computationally) bounded agent may not be● Human may not write down reward really want● May want constraints on behavior (Thomas, Castro da

Silva, Barto, Brunskill arvix 2017)

Inverse Reward DesignHadfield-Menell et al., NIPS 2017

● Reward is an observation of true desired reward function. ● Compute Bayesian posterior over true reward● Compute risk-averse policy. ● Avoids new bad things● Also avoids new good things

Image credit: http://www.infocuriosity.com/king-midas-and-his-golden-touch-an-ancient-greek-mythology-for-children/

Human Specifies Reward

● May be hard for people to specify reward● Can use as partial specification / shaping● Do RL on top of specified reward

Human Provides Demonstrations

● Imitation / IRL / learning from demonstration / apprenticeship learning

● Enormously influential, especially in robotics● Recent tutorial at ICRA: http://lasa.epfl.ch/tutorialICRA16/ ● Key idea: human provide demonstrations and agent uses

these to learn task● Here assume that get access to an initial set of trajectories &

then have no more interactions with the person● Goal is to learn a good policy

http://lasa.epfl.ch/tutorialICRA16/

Behavioral Cloning

● Key challenge: data is not iid

Figure from Ross, S., & Bagnell, J. A. (2012). Agnostic system identification for model-based reinforcement learning. ICML.

Human Provides Demonstrations: Inverse Reinforcement Learning

● Learn reward function from human demonstrations ● Develop a policy given that reward function

○ Often involves assuming have access to dynamics model or ability to try new policies online.

○ IRL is ill specified: a 0 reward is always sufficient○ Try to match state features○ Max entropy approach has been very influential○ eta-Learning○ Learning from different demonstrations○ Doing RL on top can be useful○ A number of recent efforts to combine with deep learning

Human Provides Demonstrations: Guided Cost Learning

● Learn reward function from human demonstrations ● Develop a policy given that reward function

○ Often involves assuming have access to dynamics model or ability to try new policies online.

○ IRL is ill specified: a 0 reward is always sufficient○ Try to match state features○ Max entropy approach has been very influential○ eta-Learning○ Learning from different demonstrations○ Doing RL on top can be useful

Finn, C., Levine, S., & Abbeel, P. (2016, June). Guided cost learning: Deep inverse optimal control via policy optimization. In International Conference on Machine Learning (pp. 49-58).

Human Provides Demonstrations: Generative Adversarial Imitation Learning

Ho, J., & Ermon, S. (2016). Generative adversarial imitation learning. In Advances in Neural Information Processing Systems (pp. 4565-4573).

# traj on x-axis, y performance

Some Benefits & Limitations to Learning from Demonstrations

● Need access to experts doing task. Just using these can be limited so often combine with online RL which not always feasible

● Some cases expensive/slow to gather data (teach student across a year? provide customer recommendations across months)

● May not help if the solution we need is radically far away-- still have to handle exploration/exploitation trade off online afterwards

● Still a question of how capture expertise (e.g. what state features to use)

● But generally very promising

○ Demonstrations can be prior data collected for other purposes

○ Great for automating, scaling up, & fine tuning good solutions

Humans Providing Online Rewards

● Sophie’s Kitchen● Human trainer can award a scalar reward signal r = [−1, 1]

Teachable robots: Understanding human teaching behavior to build more effective robot learners" Artificial Intelligence Journal, 172:716-737, 2008..

http://www.cc.gatech.edu/~athomaz/papers/ARTINT2307.pdf

http://www.cc.gatech.edu/~athomaz/papers/ARTINT2307.pdf

Humans Providing Online Rewards

W. Bradley Knox and Peter Stone. Combining Manual Feedback with Subsequent MDP Reward Signals for Reinforcement Learning. In Proceedings of the Ninth International Conference on Autonomous Agents and Multiagent Systems. May 2010. Best student paper

● TAMER framework● Uses reward

feedback to help shape the agent’s reward

● Can be used in addition to other reward signals from domain

http://www.cs.utexas.edu/~pstone

Human Provides Advice / Labels / Feedback

● Human will continue to provide advice about policies over time

Ross, S., Gordon, G. J., & Bagnell, D. (2011). A reduction of imitation learning and structured prediction to no-regret online learning. In International Conference on Artificial Intelligence and Statistics (pp. 627-635).

Human Provides Advice

Griffith, S., Subramanian, K., Scholz, J., Isbell, C., & Thomaz, A. L. (2013). Policy shaping: Integrating human feedback with reinforcement learning. In Advances in Neural Information Processing Systems (pp. 2625-2633).

● Advise: a Bayesian approach for policy shaping● Agent may get “right”/“wrong” label after performing an action

● C: feedback consistent with optimal policy with probability 0 <C <1

● Update posterior over policy from human

● Combine policy learned from

own experience with that estimated

from human

Discussion: Human Provides Advice / Labels / Feedback

● Performance gains are often significant● Challenge: expensive to have someone in the loop. not often

feasible. (number of examples of advice can be in the hundreds)● Maybe in some cases (household robot, asking a supervisor for

help with a tricky case) could be realistic and easier than an expert providing demonstrators

● MIght be some interesting merged caes:○ Learning from demonstrators where can finish the

demonstration of someone once stuck?● Could also potentially be good for unknown unknowns

○ Human may help agent to realize there’s a problem/limitation even if the agent would not have

Humans Teaching Agents

● How do people try to teach?● Evidence that (at least for animal training) people may expect their reward feedback to be

interpreted as an action label ○ This is different if learner is trying to maximize its long term reward

○ Ho, M. K., Littman, M. L., Cushman, F., & Austerweil, J. L. (2015). Teaching with rewards and punishments: Reinforcement or communication?. In CogSci.

How do People Teach

● People can perform tasks differently if trying to show vs doing task at very high level of performance

○ Ho, M. K., Littman, M., MacGlashan, J., Cushman, F., & Austerweil, J. L. (2016). Showing versus doing: Teaching by demonstration. In Advances in Neural Information Processing Systems (pp. 3027-3035)

How do People Teach

● People pay attention to learners’ current policy human trainers give a positive or negative feedback for a decision is influenced by the learner’s current policy which influences how the teacher provides feedback, and this influences the best algorithm to do○ James MacGlashan, Mark K. Ho, Robert Loftin, Bei Peng, Guan Wang, David

L. Roberts, Matthew E. Taylor, Michael L. Littman . Proceedings of the 34th International Conference on Machine Learning,

Cooperative Inverse RLHadfield-Menell, D., Russell, S. J., Abbeel, P., & Dragan, A. (NIPS 2016).

● Game where human and the agent get rewards determined by the same reward function.

● If the learner has different capabilities from teacher, teacher should teach in a way to help learning agent learn its optimal policy (which may be different than if the teacher performed the task herself)

How Should We Teach Agents

● Machine teaching: optimal way to teach agent given knowledge of agent’s algorithm○ Zhu, X. (2015, January). Machine Teaching: An Inverse Problem to Machine Learning

and an Approach Toward Optimal Education. In AAAI (pp. 4083-4087).● How to teach a sequential decision making algorithm e.g.

○ Cakmak, M., & Lopes, M. (2012, July). Algorithmic and Human Teaching of Sequential Decision Tasks. In AAAI.

○ Walsh, Thomas J., and Sergiu Goschin. (2012). "Dynamic teaching in sequential decision making environments. In UAI.

○ Peng, B., MacGlashan, J., Loftin, R., Littman, M. L., Roberts, D. L., & Taylor, M. E. (2017, May). Curriculum Design for Machine Learners in Sequential Decision Tasks. AAMAS

Overview



Histogram Tutor

136

At end, post test

Action Correct/Wrong

Continually Improving Tutoring System

Action

Improving Across Many Students

Action Action

Over Time Tutoring System Stopped Giving Some Problems to Students

139

System Self-Diagnosed that Problems Weren’t Helping Student Learning

Rules of the World Are Not Fixed

≠

Humans are Invention Machines

New actions New sensors

Reward

Action Observation

Goal: Choose actions to maximize expected rewards

Human in the Loop Reinforcement Learning

Reward

Action Observation


Human in the Loop Reinforcement Learning

Add Action

Reward

Action Observation


Add Action

Direct Human Effort for Adding New ActionsMandel, Liu, Brunskil & Popovic, AAAI 2017

Expected Local Improvement

Prob. human gives you action

ah

for state s

Improvement in value (outcomes) at state s if add in action a

h

Mandel, Liu, Brunskil & Popovic, AAAI 2017

Mostly Bad Human Input

Mandel, Liu, Brunskil & Popovic, AAAI 2017

Ask people to add in new hints where might help

Reward

Action

Observation

Add Feature

Is there a Latent Feature that if System Knew It, Could Make Better Decisions?

Ongoing work with Ramtin Keramati


Ask for Feature if Exists Latent Variable Model of a State That Could Change Policy


Learn Model That Supports Making Good Decisions, Not Perfect Model

Reward

Action

Observation

Requests

Directing Human Experts to Change Actions & Observation Space


Reward

Action

Observation

Requests

Another Way to Have Humans Teach Agents


Ask people to add in new hints where might help

But system isn’t improving much… what’s going on?

Humans Teaching Agents: Questions

● How to teach human to teach agents?○ People may not have a good model of (a) agent / learner, (b) access

to content (actions) that can change that agent/ learner state○ May need guidance about how best to help agent / learner○ Just like people have to learn to be effective human teachers…○ In our group new project on teaching the teachers...

● Lots of open directions here● Involving teachers still expensive relative to leveraging prior trajectories

of demonstrations or instructions● Can we make better algorithms to leverage past teaching demonstrations

(youtube has many lecture videos) rather than expert demonstrations? COACH (on prior slide, MacGlashan, et al) is one of the steps in this direction

RL with help from People



Overview


● Lots of applications that could benefit from RL● Lots of ways people can help make RL systems

better● Interested in discussing a postdoc opportunity?

Email me at [email protected]

Thanks to

and Karan Goel, Travis Mandel, Yun-En Liu, Ramtin Keramti, NSF, ONR, Microsoft, Google,

Yahoo & IES

and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task...

Documents

Transcript of and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task...